Case Study: Payment Systems Architecture (Stripe, FedNow, Shopify)¶
Money plus distributed systems is the hardest problem in tech. ACID is non-negotiable.
The hook¶
Payments are the canonical CP system. A charge has two acceptable end states: fully completed, or fully rolled back. Never both. Never neither. Networks will partition, services will die, banks will time out — and the ledger still has to balance at midnight.
Stripe runs hundreds of billions of dollars a year on a system that takes ACID dead seriously. Shopify designs for Black Friday peaks where a five-second timeout hits 80,000 carts. The Federal Reserve launched FedNow in 2023 to settle directly between bank accounts in seconds, 24x7. Three different scales, same hard problem: every dollar has to be accounted for, and "we'll figure it out later" is not a strategy.
If you only learn one architecture in this course end-to-end, make it this one. The patterns survive everywhere money moves.
The concept¶
Payment systems are built from a small set of patterns that show up together because the problem demands them.
- Idempotency keys on every write. The client sends a unique key with each charge. The server stores it. Same key on retry returns the original response, not a second charge.
- Saga pattern for multi-step workflows. A payment touches your DB, the card network, the issuing bank, your ledger, and the merchant's webhook. You can't hold a 2PC lock across companies. You commit each step locally and define compensations (refund, void, reverse) when a later step fails.
- Event sourcing for the ledger. The source of truth is an append-only log of state transitions:
intent_created,auth_succeeded,captured,refunded. The current row is a projection. Every dollar is auditable backwards. - Circuit breakers around external dependencies. Banks have transient errors. Card networks have bad afternoons. You wrap every external call so a slow downstream cannot eat your thread pool.
- Reconciliation, not just code. Every night you compare your ledger against bank statements. Mismatches become tickets. The code is the optimist; reconciliation is the audit.
ACID at the row level keeps individual records honest. Sagas at the workflow level keep cross-system flows honest. Both are required.
Diagram¶
flowchart LR
C[Client SDK] -->|idempotency-key| API[Stripe API]
API --> IK{Key seen<br/>before?}
IK -->|yes| RET[Return cached response]
IK -->|no| DB[(PaymentIntent<br/>row in DB)]
DB --> EV[Event log:<br/>intent_created]
EV --> NET[Card network<br/>Visa / MC]
NET --> BANK[Issuing bank<br/>auth]
BANK -->|ok| CAP[Capture funds]
BANK -->|fail| COMP[Saga compensation:<br/>void + refund]
CAP --> LED[(Ledger update)]
LED --> WH[Webhook to<br/>merchant]
COMP --> WH
The compensation arrow is the saga in one picture. Any step downstream of intent_created can fail; each failure has a defined rollback that puts the system back into a balanced state. The webhook fires either way — the merchant gets told the truth.
Example — tracing one Stripe charge¶
A customer taps "Pay" on a Shopify checkout. Walk it from client to settled funds.
1. Client SDK call. The merchant's server calls stripe.paymentIntents.create() with an Idempotency-Key header set to a ULID generated for this checkout attempt. Header looks like Idempotency-Key: your-idempotency-key-here. The Stripe SDK retries on network errors, reusing the same key.
2. Idempotency check. Stripe's API gateway hashes the key, looks it up in a fast key-value store. Hit means "we processed this already" — return the original JSON response and stop. Miss means "first time" — proceed and store the key.
3. Persist the intent. A PaymentIntent row is written in a transactional DB (Postgres, sharded by account). Status: requires_confirmation. This commit is the durable promise that something happened. (See: acid-cap-base)
4. Publish the event. An intent_created event lands in Kafka. Downstream services — fraud scoring, risk, analytics — consume independently. The event log is the auditable timeline. (See: event-sourcing)
5. Talk to the card network. Stripe calls Visa or Mastercard with the card details. The network forwards to the issuing bank for authorization. This call is wrapped in a circuit breaker — if the network is degraded, opens the breaker and fails fast instead of piling up requests. Read timeout around 5 seconds, write timeout around 1 second per Shopify's playbook.
6. Bank decisions. The issuing bank approves, declines, or asks for 3D Secure. Approval returns an auth_code. Stripe writes auth_succeeded to the event log and updates the row.
7. Capture (often async). For most flows, capture happens immediately and money moves into Stripe's settlement account. Some flows (auth-then-capture for marketplaces) hold authorization for hours or days and capture later.
8. Settlement. The actual interbank movement takes 1–2 business days through ACH or card-network settlement. Stripe fronts the merchant during that window — they're carrying the credit risk.
9. Webhook to merchant. Stripe POSTs payment_intent.succeeded to the merchant's URL. The webhook is signed, retried with exponential backoff, and replayable from the event log if the merchant's endpoint was down. (See: webhooks)
10. Saga rollback (if step 5 or 6 fails). Bank declines after the row is written? Mark requires_payment_method, fire payment_intent.payment_failed, no money moved. Network call succeeded but capture failed? Issue a void to the network and unwind. Every failure mode has a defined compensating action.
End-to-end: roughly ten services, multiple cross-company hops, every step idempotent and recorded. The customer waits around 1–3 seconds. The ledger balances by morning.
Mechanics — the payments pattern table¶
| Pattern | What it does | Where it shows up |
|---|---|---|
| Idempotency keys | Client-supplied unique ID; server replays original response on retry | Every write API. Stripe requires it; well-designed APIs expose it. Use ULID over UUIDv4 — sortable, paginatable. |
| Saga | Sequence of local commits with defined compensations on failure | Multi-step flows: charge → auth → capture → settle. Replaces 2PC across systems you don't own. (See: distributed-patterns) |
| Event sourcing | Append-only log of state changes; current state is a projection | The ledger. Lets you audit any row backwards and replay history into a new system. |
| At-least-once retries | Repeat failed external calls until success or final failure | Banks and card networks have transient errors. Combined with idempotency, repeats are safe. |
| Circuit breakers | Stop calling a failing dependency after error rate crosses threshold | Wrap every external call. Shopify's Semian library does this for HTTP, MySQL, Redis, gRPC. |
| Reconciliation | Nightly compare your ledger against bank statements; flag mismatches | The audit layer. Code is optimistic; recon catches what code missed. Store breaks in a DB and chase them down. |
| Webhooks | Push state changes to merchants over HTTPS, signed and retried | How merchants find out a charge succeeded. Built on the same event log. |
| Pull payments | Merchant pulls money from cardholder's account on approval | Standard card swipe. Sender authorizes; receiver initiates. Settlement is delayed (1–2 days for cards, 1–3 for ACH). |
| Push payments | Sender's bank pushes money in real time, often instantly | FedNow, RTP, Visa Direct, Mastercard Send. Settlement is final in seconds. Different fraud profile — irreversible. |
The pull-vs-push distinction matters. Cards and ACH are pull: the receiver initiates, money settles later, and chargebacks are possible for weeks. FedNow and RTP are push: the sender initiates, settlement is final in seconds, and there is no chargeback. Architectures look similar; risk models do not.
Related concepts¶
| Concept | Why it shows up here |
|---|---|
| ACID & CAP/BASE | Payments are the textbook CP example. You give up availability before you give up consistency — better to fail a charge than to double-charge. |
| Distributed Patterns | Saga, idempotency, circuit breakers, retries with backoff. The whole pattern set is on display in one product. |
| Event Sourcing | The ledger is an event log. Every state change is a fact you can replay. Auditors and engineers both need this. |
| Webhooks | The standard way payment processors notify merchants. Signed, retried, idempotent on the receiving side. |
| Observability | Four golden signals (latency, traffic, errors, saturation) plus structured logs plus reconciliation. You cannot run a payment system you cannot see. |
| Message Queues | Kafka or equivalent under the event log. Decouples downstream consumers (fraud, analytics) from the request path. |
| API Gateway | Where idempotency lookup, auth, and rate limiting live before the request reaches business logic. |
| Case Study: Stack Overflow | The opposite story — a high-traffic system that chose simpler consistency because the cost of a duplicated upvote is zero. Useful contrast. |
When (and when not) to copy this pattern¶
Copy it when:
- You are moving money or any irreversible value — payouts, refunds, transfers, crypto, gift cards, points with cash equivalence.
- Double-execution is unacceptable — the operation cannot be safely repeated. Charging a card twice, sending an email is fine; sending a wire twice is a Tuesday-night incident.
- You have external systems you don't control in the critical path — banks, networks, partners. 2PC is off the table; sagas and idempotency are mandatory.
- You have regulatory or audit obligations — the event log isn't optional, it's evidence.
Skip it (or simplify) when:
- The operation is safely retryable without state. A read API, a search, a cache update. Idempotency keys add overhead with no payoff.
- Your "transactions" are inside one database. A regular DB transaction beats a hand-rolled saga every day. Reach for sagas only when the transaction crosses systems.
- You're early-stage and not handling money yet. Adding event sourcing and reconciliation infrastructure to a CRUD app is a tax you don't owe. Pay it when the product earns it.
- You have at-most-once semantics from a higher layer — exactly-once messaging via Kafka transactions, for instance — and don't need application-level idempotency.
The honest version: most apps don't need the full payments stack. But the moment you touch money or anything irreversible, every shortcut comes back as a 2 a.m. ledger break.
Key takeaway¶
- Money systems make trade-offs cheap apps don't have to think about — and the patterns that survive at Stripe's scale (idempotency + saga + event log + reconciliation) are the templates the rest of us copy when the stakes climb.
- Idempotency keys on every write. The client owns the key, the server enforces dedup. ULID over UUIDv4 — sortable beats random.
- Saga, not 2PC. You don't own the bank's locks. Local commits with explicit compensations are the only realistic shape across companies.
- Event log is the ledger. Append-only, auditable, replayable. Current state is a projection of the log, not a competing source of truth.
- Reconcile every night. Code lies; the bank statement doesn't. Mismatches become tickets, not silence.
- Pull vs push changes the risk model. Cards = chargebacks for weeks. FedNow/RTP = final in seconds, no take-backs.
Quiz available in the SLAM OG app — three questions on idempotency keys, why sagas beat 2PC across companies, and when the full payments stack is overkill.