Message Queues & Brokers¶
Drop the message, walk away. The broker delivers — eventually.
The hook¶
Service A calls Service B over HTTP. B is down for two minutes. Now A is down too — every request piling up, timing out, taking the user with it.
Synchronous calls couple uptime. If you call something, you depend on it being alive right now. That's fine for two services. It's a nightmare for fifty.
The fix is a queue between them. A drops a message. B picks it up when it's ready — five seconds later, five minutes later, doesn't matter. A doesn't wait. A doesn't care. The broker holds the message until B is ready.
That single move — replacing a synchronous call with a queue — unlocks reliability, throughput smoothing, and async work. It's the duct tape of distributed systems.
The concept¶
A message broker is a server that accepts messages from producers, holds them, and delivers them to consumers. Producers and consumers don't know about each other. They only know the broker.
Two main patterns:
| Pattern | Who gets the message | Used for |
|---|---|---|
| Queue (point-to-point) | One consumer per message | Distributing work across workers |
| Pub/Sub (topic-based) | Every subscriber gets a copy | Broadcasting events to many listeners |
Queue example: 10,000 image-resize jobs land on a queue, 50 workers pull from it, each job runs exactly once.
Pub/sub example: a user.signed_up event lands on a topic, the email service sends a welcome, the analytics service counts it, the CRM service creates a record — all from the same message.
Brokers handle the boring-but-critical stuff so producers and consumers don't have to:
- Persistence — survive a broker restart without losing messages
- Ordering — within a partition or queue, messages stay in the order produced
- Retries — if a consumer fails, redeliver
- Dead-letter queues (DLQ) — after N failed attempts, park the message somewhere a human can look at it
Diagram¶
flowchart LR
P1[Producer A] --> Q[Queue]
Q --> C1[Worker 1]
Q -.->|fail x3| DLQ[Dead Letter Queue]
P2[Producer B] --> T[Topic]
T --> S1[Subscriber: Search]
T --> S2[Subscriber: Analytics]
T --> S3[Subscriber: Email]
style DLQ stroke:#f66,stroke-dasharray:5
Top half: queue. One message, one worker. Failed messages get retried, then parked in the DLQ.
Bottom half: pub/sub. One event, three independent subscribers, each on their own pace.
Example — Kafka at LinkedIn¶
LinkedIn invented Kafka because they were drowning.
By 2010, they had hundreds of services, and every service knew about every other service it needed. The User Service called Search directly. Search called Analytics. Analytics called Email. Adding a new consumer meant editing every upstream service. Every outage cascaded. The integration graph looked like spaghetti dropped on a fan.
So they built Kafka as a durable, ordered, replayable log. Services publish events to a topic. Other services subscribe and consume at their own pace. Nobody calls anybody. Today Kafka handles trillions of messages a day across LinkedIn.
Concretely, here's what a profile update looks like now:
- User edits their profile.
- User Service publishes
user.profile.updatedto Kafka. Done. Returns to the user immediately. - Search Service consumes the event and re-indexes the profile.
- Analytics consumes the event and increments a counter.
- Email consumes the event and notifies recruiters who saved the profile.
The User Service doesn't know who's listening. New consumers can show up next quarter without a single change upstream. If Email is down for an hour, Email catches up later — Search and Analytics keep going.
Why Kafka is fast (the part everyone asks about):
- Append-only log on disk. Sequential writes are 100–1000x faster than random writes. Kafka never seeks; it just appends.
- Zero-copy. Data moves from disk to network card without a trip through the application —
sendfile()does it in the kernel. - Partitions. A topic is sharded across brokers. Consumers read partitions in parallel. Throughput scales horizontally.
A single Kafka cluster can push millions of messages per second on commodity hardware. The "secret" is mostly that it does less than other brokers.
Mechanics — five brokers, five jobs¶
| Broker | Pattern | Delivery | Ordering | Throughput | Pick when |
|---|---|---|---|---|---|
| Kafka | Pub/sub on a log | At-least-once (exactly-once with care) | Per partition | Very high (millions/sec) | Event streaming, replay, large scale, audit log |
| RabbitMQ | Queue + flexible routing (direct, fanout, topic, headers) | At-least-once | Per queue | Moderate (tens of thousands/sec) | RPC-style work, complex routing rules, mature ops |
| AWS SQS | Queue (Standard or FIFO) | At-least-once (Standard) or exactly-once (FIFO) | None (Standard) / strict (FIFO) | High, auto-scales | Zero-ops async work on AWS, decoupling Lambda functions |
| Redis Streams | Lightweight log inside Redis | At-least-once | Yes | Moderate, very low latency | You already run Redis, modest scale, latency matters |
| AWS SNS / Google Pub/Sub | Pub/sub fanout | At-least-once | None (mostly) | High, managed | Broadcasting events across services or to mobile push |
Quick reading guide:
- Need replay or a durable event log? Kafka.
- Need flexible routing rules and you're not at huge scale? RabbitMQ.
- On AWS and want to forget about ops? SQS for queues, SNS for pub/sub.
- Already running Redis and the workload is small? Redis Streams. Don't add Kafka for 100 messages a minute.
- Broadcasting to a fleet of mobile clients or services? Managed pub/sub.
A real talk note on delivery guarantees: at-least-once is the realistic default for every broker on this list. "Exactly-once" exists, but it's narrow — usually scoped to a single transaction inside one cluster, with constraints. The professional move is to make consumers idempotent (running the same message twice produces the same result) and stop worrying about duplicates.
Related concepts¶
| Concept | What it is | How it relates |
|---|---|---|
| Event Sourcing | Storing state as an ordered log of events | Same shape as Kafka — replay the log to rebuild state |
| Event-Driven Architecture | Services react to events instead of being called | Brokers are the substrate that makes EDA work |
| Microservices | Many small services with their own data | Queues decouple them — without a broker, microservices revert to spaghetti calls |
| Idempotency | Running an operation twice has the same effect as once | Required on consumers because at-least-once means duplicates |
| Webhooks | HTTP callbacks to external systems on events | The external-facing version of pub/sub — same pattern, different transport |
| Distributed Patterns (Saga, Outbox) | Multi-step workflows across services | Sagas use queues to coordinate steps; the outbox pattern is how you publish reliably from a database |
| Backpressure | Slowing producers when consumers can't keep up | Queues are backpressure — they buffer the surge so the consumer doesn't drown |
| Dead Letter Queue | Side queue for messages that keep failing | The "I give up, look at this manually" tray every broker should have |
When (and when not) to use a queue¶
Reach for a queue when:
- Async work that doesn't need an immediate answer — send the welcome email, generate the thumbnail, run the export. The user doesn't wait; the worker picks it up.
- Smoothing spikes — 10,000 requests land in a second. The queue absorbs them; a steady pool of workers drains the queue at a sane rate. The downstream system never sees the spike.
- Decoupling services — when you don't want Service A's outage to take Service B with it, put a buffer between them.
- Replay and audit — Kafka's killer feature. Bug in your analytics pipeline last Tuesday? Reset the consumer offset, replay the week.
- Broadcasting events — one event, many independent listeners. Pub/sub.
Skip the queue when:
- Low traffic and sync calls work fine. A broker is real ops cost — another moving part to monitor, scale, and pay for. Don't add it for 50 requests a minute.
- Strict global ordering across services. Queues give you ordering within a partition or queue — coordinating order across services is a hard distributed-systems problem, not something a broker hands you.
- Real-time bidirectional traffic. Chat, gaming, live cursors. Use WebSockets or gRPC streaming. A broker adds latency and the wrong shape of communication.
- You need a synchronous answer right now. If the caller has to know whether the operation succeeded before returning to the user, a queue is the wrong tool — the operation is async by definition once you queue it.
The default question for any new service-to-service call is: does the caller need the answer right now? If no, a queue is probably the right shape.
Key takeaway¶
- Queues decouple producers from consumers, smooth spikes, and let services fail independently — the async backbone of distributed systems.
- Queue vs pub/sub: one consumer per message vs one copy per subscriber. Pick based on whether you're distributing work or broadcasting events.
- At-least-once is the realistic default. Make your consumers idempotent and stop chasing exactly-once.
- Pick the broker for the job: Kafka for streaming and replay, RabbitMQ for routing, SQS/SNS for managed AWS, Redis Streams for small-scale low-latency, managed pub/sub for broadcasts.
- Always wire up a dead-letter queue. Messages that can't be processed shouldn't disappear — they should land somewhere a human can look at them.
Quiz available in the SLAM OG app — three questions on queue vs pub/sub, delivery guarantees, and picking the right broker.