10-Idempotency Layer for a Payments API
Idempotency Layer for a Payments API
Section titled “Idempotency Layer for a Payments API”1. Reframing: Kill “Exactly Once” Before It Kills You
Section titled “1. Reframing: Kill “Exactly Once” Before It Kills You”[STAFF SIGNAL: exactly-once-reframing] “Exactly once” is marketing. In a distributed system with crashes, network partitions, and asynchronous downstream calls, exactly-once execution of a side effect is mathematically impossible in the general case. What we can build is a system that satisfies a precise external contract.
Three invariants [STAFF SIGNAL: invariant-based thinking]:
- At-most-once external side effect. For a given idempotency key
K, the externally observable side effect (a charge against a card, a transfer to a bank) happens zero or one time, never two or more, regardless of how many internal attempts occur. - Reproducible successful response. Once a request with key
Khas produced a response, every subsequent retry with keyKand the same request fingerprint observes a byte-identical response — same status, same headers, same body, including any UUIDs and timestamps we minted. - Fingerprint-bound replay. A request with key
Kand a different fingerprint than the original is never replayed and never executed; it is rejected as misuse.
Internally the system may attempt the side effect 0, 1, or many times. What makes the contract holdable is that the side effect itself must be idempotent on a key we control. [STAFF SIGNAL: end-to-end idempotency] API-layer idempotency without downstream cooperation is a lie; it just relocates the double-charge from “our bug” to “the bank’s bug.” The PSP/issuer must dedupe on our key. This is a hard partner-integration requirement, not a nice-to-have.
[STAFF SIGNAL: side-effect-ordering-as-central] The central engineering problem is the ordering between the idempotency record write and the side-effect execution under arbitrary crashes. Every other concern — concurrency, response storage, retention, fingerprinting — is downstream of that. The whole engineering of an idempotency layer is the choreography that makes this ordering correct under arbitrary process death.
2. Scoping
Section titled “2. Scoping”[STAFF SIGNAL: scope negotiation] Committing:
- API surface: synchronous REST POST + JSON.
Idempotency-Keyheader, UUIDv4-shape, ≤255 chars. Endpoints:POST /v1/charges,POST /v1/transfers. Async webhook delivery has its own related idempotency layer; out of scope here. - Side effect: outbound HTTPS to a card network / PSP / issuer. Downstream supports an idempotency key we send, treated as a hard partner-integration requirement.
- Retention: 24h replay window + 24h tombstone for stale-key error messaging; hard delete at 48h.
- Multi-region: active-active across two regions. Each idempotency key is pinned to a home region at first write. Cross-region retries forwarded to home region; we accept ~80ms penalty on cross-region retries to avoid global consensus on every claim.
- Throughput: 50K req/s steady, 100K req/s peak. Replay rate ~3%. Concurrent same-key conflict rate empirically <0.05%.
- Tenancy: multi-tenant. Per-tenant rate limits upstream prevent one tenant saturating a shard.
3. Capacity Math
Section titled “3. Capacity Math”[STAFF SIGNAL: capacity math]
Per-record size: key (UUID) 36 B fingerprint (SHA-256) 32 B request metadata ~64 B response status + headers ~512 B response body (typical charge) ~1.5 KB state, timestamps, lease, fence ~64 B ─────────────────────────────────────── total ~2.2 KB
Steady state (24h retention, 50K req/s): 50,000 × 86,400 × 2.2 KB ≈ 9.5 TB
Peak window (48h tombstone, 100K req/s): upper bound ≈ ~38 TB
Latency budget per request: Conditional claim write 1.5 ms p50, 5 ms p99 Response durable write 2.0 ms p50, 6 ms p99 Idempotency overhead total ~3.5 ms p50, ~10 ms p99 PSP side-effect 80–300 ms — dominates
Lease tuning: Initial lease 30 s (PSP timeout + buffer) Heartbeat renewal 10 s (while side-effect inflight) Hard ceiling 180 sA 9.5TB working set at 50K req/s requires a horizontally sharded substrate; single-Postgres is excluded on this number alone.
4. State Machine of an Idempotency Record
Section titled “4. State Machine of an Idempotency Record”[STAFF SIGNAL: state-machine precision]
┌─────────┐ │ (none) │ no record exists └────┬────┘ │ conditional INSERT │ ON CONFLICT DO NOTHING │ + fence_token = N ▼ ┌──────────────┐ lease expires ◄───┤ IN_FLIGHT │───► concurrent same-key arrives (orphan recovery: │ fence = N │ → poll loop on this row fence → N+1) │ lease_exp=T │ └──────┬───────┘ │ ┌──────────────────────┼──────────────────────┐ │ CAS on fence=N │ CAS on fence=N │ CAS on fence=N │ side-effect ok │ recoverable failure │ terminal failure ▼ ▼ ▼ ┌───────────┐ ┌───────────────┐ ┌─────────────────┐ │ COMPLETED │ │ FAILED_RETRY │ │ FAILED_TERMINAL │ │ resp body │ │ (next attempt │ │ resp body of │ │ stored │ │ may proceed) │ │ failure stored │ └─────┬─────┘ └───────┬───────┘ └────────┬────────┘ │ │ │ retry → replay retry → re-claim retry → replay 200/201 + body (new fence) 4xx + body │ └─► back to IN_FLIGHT (bounded)
Terminal states age: ┌──────────────┐ │ EXPIRED │ 24h after last update │ (tombstone) │ → retries get 410 Gone └──────┬───────┘ │ +24h ▼ (deleted, key reusable as new request)Transitions and atomicity:
(none) → IN_FLIGHT: conditional INSERT, atomic at the storage layer; issues fresh fence token. Crash here: insert either committed or didn’t; next retry sees no row (re-claim) or IN_FLIGHT with dead lease (orphan path).IN_FLIGHT → COMPLETED / FAILED_TERMINAL: conditional UPDATE withWHERE fence = N. Fence prevents a stale-but-resumed worker (post GC pause) from clobbering a fresh worker’s result. [STAFF SIGNAL: lease/fencing discipline]IN_FLIGHT → IN_FLIGHT (orphan recovery): a fresh request findslease_exp < now(), conditional-UPDATEs to bump fence toN+1and reset lease. The original holder, if it ever wakes, fails its CAS atWHERE fence=Nand aborts.FAILED_RETRY → IN_FLIGHT: explicit re-claim. Bounded by an attempts counter to prevent infinite retry loops; exceeding it transitions to FAILED_TERMINAL with an operational alert.* → EXPIRED: explicit application-level check at read time (if ttl < now() return 410). Physical deletion via DynamoDB TTL within 48h after the ttl mark.
The hardest case is the IN_FLIGHT orphan path. Lease too short → false expiry while side effect is genuinely running → potential double-execute (mitigated only by PSP dedupe on our key). Lease too long → real failures cause user-visible stalls. We pick 30s base + 10s heartbeat, hard ceiling 180s, with the explicit assumption that the PSP also dedupes — the in-flight key from the original holder and the retry holder are the same key, so the PSP collapses both attempts to a single side effect and returns the canonical response to whichever attempt commits the CAS first.
5. High-Level Architecture
Section titled “5. High-Level Architecture” ┌────────────────────────────────────┐ │ Client (merchant) │ │ sends Idempotency-Key header │ └─────────────┬──────────────────────┘ │ HTTPS ▼ ┌──────────────────────────────────────┐ │ API Gateway │ │ TLS, per-tenant rate limit, │ │ forwards to home region for key │ └─────────────┬────────────────────────┘ │ ▼ ┌──────────────────────────────────────┐ │ Payments App Server │ │ ┌────────────────────────────────┐ │ │ │ 1. Compute fingerprint │ │ │ │ 2. Conditional claim (DDB) │ │ │ │ 3. If COMPLETED → replay │ │ │ │ 4. If IN_FLIGHT → poll/wait │ │ │ │ 5. If claimed: │ │ │ │ a. Outbox row in same txn │ │ │ │ b. Mint stable IDs │ │ │ │ c. Call PSP w/ our key │ │ │ │ d. CAS → COMPLETED + body │ │ │ │ 6. Reply │ │ │ └────────────────────────────────┘ │ └──────┬─────────────────────┬─────────┘ │ │ conditional ▼ ▼ outbox claim/CAS ┌────────────────┐ ┌──────────────┐ │ Idempotency │ │ Outbox Table │ │ Store (DDB, │ │ (same DB, │ │ sharded by key,│ │ same txn as │ │ multi-AZ, PITR)│ │ claim) │ └────────────────┘ └──────┬───────┘ │ CDC / poll ▼ ┌────────────────┐ │ Outbox Worker │ │ Pool (PSP │ │ at-least-once │ │ with our key) │ └──────┬─────────┘ │ HTTPS (idempotent) ▼ ┌────────────────┐ │ PSP / Network │ │ (dedupes on │ │ our key) │ └────────────────┘The synchronous path is the fast path; the outbox is the recovery path. In steady state the app server completes the PSP call inline and CAS-writes COMPLETED. On crash mid-execute, the outbox worker retries the PSP call (which dedupes on the same key), receives the canonical response, and writes COMPLETED. The same downstream key serves both paths, which is what makes recovery determinism-preserving.
6. Side-Effect Ordering and the Outbox Pattern
Section titled “6. Side-Effect Ordering and the Outbox Pattern”[STAFF SIGNAL: side-effect-ordering-as-central] Walk the four candidate orderings.
Option A — Side effect first, then store record. Charge the card, then write the idempotency record. Crash between → next retry sees no record → charges again. Reject: silent double-charge.
Option B — Store record first, then side effect (the naive textbook design). Mark IN_FLIGHT, execute, mark COMPLETED. Crash mid-execution leaves IN_FLIGHT. Recovery requires (1) lease timeout for orphan detection and (2) the side effect to itself be idempotent so the next attempt doesn’t double-charge. Without (2), unsafe under crash. With (2), it works for the synchronous path but has no story for “server got PSP 200 but died before writing COMPLETED” — the next retry sees IN_FLIGHT, eventually times out, re-attempts the PSP call (which dedupes and returns 200 again), and writes COMPLETED. Functional, but the second PSP response need not byte-equal the first (different PSP-side request_id, different timestamps), breaking the byte-identical-replay invariant. Acceptable; inferior on determinism.
Option C — Two-phase commit between idempotency store and PSP. Pre-allocate a transaction ID, pre-write a “preparing” record, call PSP with prepared ID, finalize. Adds round-trips; PSP must support prepare semantics, which most don’t. Reject: pushes coordination protocol onto a downstream we don’t control.
Option D — Outbox pattern. [STAFF SIGNAL: outbox-pattern-or-equivalent] In a single local DB transaction, atomically:
BEGIN; -- conditional on no existing row INSERT INTO idempotency (key, fingerprint, state='IN_FLIGHT', fence=1, lease_exp=now()+30s, stable_response_id=$generated_uuid); INSERT INTO outbox (key, payload, target='psp.charge', status='PENDING', attempts=0);COMMIT;Both rows commit or neither does. After commit, the synchronous path optimistically calls the PSP inline. On success: CAS UPDATE the idempotency row to COMPLETED + serialized body, mark the outbox row DONE in a single transaction. On any failure or crash: the outbox row remains PENDING; a worker pool tails the outbox (CDC or short poll) and retries the PSP call with our key as the downstream idempotency key. The PSP dedupes — we get the canonical response back — the worker writes COMPLETED + DONE.
The unavoidable conclusion [STAFF SIGNAL: end-to-end idempotency]: outbox works only because the PSP itself dedupes on our key. If the downstream is not idempotent, no API-layer design prevents double-charge under crash. Idempotency is a property of the entire pipeline; designing it as a single API-layer module is the mid-level mistake.
Why inline-with-outbox-recovery, not pure async outbox? [STAFF SIGNAL: rejected alternative] Pure async (always queue, never call PSP inline) is correct and simpler but adds 50–500ms tail latency on every charge. Merchants reject. We pay complexity to keep the median path inline.
Why polling outbox, not transactional Kafka? [STAFF SIGNAL: rejected alternative] Kafka transactions add a transaction coordinator and isolation-level reasoning; at our scale a polling outbox table on the same DB is simpler and the load is bounded by the failure rate, not the request rate. If our scale grew 10× we’d revisit.
Stable IDs minted at claim, not response time. Every response field that varies between executions — charge_id, created_at, server-generated reference numbers — is generated at claim time and stored in the idempotency row. The PSP call passes them in. The response object embeds them. On retry, replay regenerates nothing.
7. Concurrent Same-Key Handling
Section titled “7. Concurrent Same-Key Handling”[STAFF SIGNAL: concurrent-same-key]
Client (or retry storm) sends K twice → two app servers.
┌────────────┐ ┌────────────┐ │ Server A │ │ Server B │ └─────┬──────┘ └─────┬──────┘ │ CondPut(K, IN_FLIGHT, fence=1) │ │ if attribute_not_exists(K) │ ├──────────────► Idempotency Store ◄──────────┤ │ ◄ SUCCESS ─ │ CondPut(K, IN_FLIGHT, fence=1) │ ├─────► Store │ │ ◄ ConditionalCheckFailed │ executes side-effect (PSP) │ │ ... │ read existing row │ │ → IN_FLIGHT, lease_exp=T │ │ │ │ poll loop: │ │ read every 50ms, │ │ bounded 5s + jitter │ CAS: IN_FLIGHT,fence=1 → COMPLETED │ ├──────────────► Store │ │ ◄ SUCCESS ─ │ poll: row=COMPLETED │ │ ◄ row=COMPLETED + body │ reply to client │ replay stored response │ │ to its (retry) clientAtomic claim primitive: DynamoDB PutItem with ConditionExpression: attribute_not_exists(pk). Strongly consistent on the partition. Postgres equivalent: INSERT ... ON CONFLICT DO NOTHING RETURNING *.
Rejected: [STAFF SIGNAL: rejected alternative] distributed lock service (ZooKeeper/etcd) — adds external dependency and moves the claim out of the same store as the data, breaking claim+outbox atomicity. Also rejected: Redis SETNX as source of truth — async replication can lose the most recent claim on failover, producing two winners.
Losing-side behavior: poll the row with bounded backoff, capped at 5s + small jitter. If it reaches COMPLETED, replay. If it reaches FAILED_TERMINAL, replay the failure response. If still IN_FLIGHT after 5s, return 409 Conflict with {"error":"idempotency_key_in_use","retry_after_ms":5000}. We do not silently extend the wait; clients must distinguish “in flight” from “stuck” and SDK retries should not pile up.
[STAFF SIGNAL: lease/fencing discipline] Fencing token discipline. The row carries a monotonic fence counter. Every CAS to advance state requires WHERE fence = N. If server A GC-pauses for 60s, lease expires; server B reclaims via WHERE fence=1 AND lease_exp<now(), bumps to fence=2. When server A wakes and tries to write COMPLETED at WHERE fence=1, the CAS fails. Server A aborts and issues no further side effect. If server A had already issued a PSP call before pausing, the PSP dedupes our key — whichever attempt the PSP committed first is canonical, and server B’s CAS captures the corresponding response. [STAFF SIGNAL: blast radius reasoning] This is the Kleppmann distributed-lock argument transplanted: locks alone are insufficient; the side-effect target must also enforce the fence (here, by deduping on our key).
8. Response Storage and Replay
Section titled “8. Response Storage and Replay”[STAFF SIGNAL: response-replay-with-determinism] A retry must observe a byte-identical response, including any IDs we minted.
Determinism capture at first execute. Every non-deterministic field — server-minted charge_id, created_at, receipt numbers, anything that would differ between “compute again” and “remember from last time” — is generated once at claim time, stored in the idempotency row, and re-emitted on replay. A naive design that regenerates breaks the contract: clients caching by ID see two different “successful” charges and reconcile incorrectly.
Concrete: charge_id is generated before the PSP call so we can pass it as our reference. The full response object, including PSP-returned fields we surface, is serialized verbatim into the COMPLETED record.
Ordering: durable response write before reply. If we reply first and async-write the response, a crash in between loses the response. The next retry sees IN_FLIGHT, lease expires, outbox re-fires the PSP, gets the deduped response back — but the second PSP response need not byte-equal the first (different PSP-side request_id, different timestamp). Contract broken.
The correct ordering:
PSP returns 200 │ ▼ Serialize full response (using stable IDs minted at claim) │ ▼ CAS UPDATE: IN_FLIGHT,fence=N → COMPLETED, body=<bytes> (durably committed, multi-AZ) │ ▼ Reply to client with same bytesWe pay 2–6ms p99 for the durable write before reply. We do not optimize this away with an in-memory cache and async durability — failover loses recent responses, breaks determinism. The latency is the cost of the contract.
Large responses. Payments responses are <10KB; kept inline. For larger payloads (statement attachments), store a content-hash pointer plus blob in object storage with the same retention.
9. Fingerprint Validation
Section titled “9. Fingerprint Validation”[STAFF SIGNAL: fingerprint-validation] Threat: client sends key K with body “charge 1000”. Naive replay returns a fake success. Naive re-execute breaks the at-most-once invariant. Either way, broken.
Mitigation. At first claim, store fingerprint = SHA-256(canonical(method | path | normalized_body | tenant_id)) in the row. On retry, recompute and compare:
- Match → replay stored response.
- Mismatch → reject with
422 Unprocessable Entity, error codeidempotency_key_fingerprint_mismatch. Do not execute, do not replay. Log for fraud analysis.
What goes into canonical(...):
method,path— yes.- Request body — yes, after JSON canonicalization (sorted keys, no whitespace, integer/string normalization).
tenant_id(or auth principal hash) — yes; same key from different tenants is a different request, full stop.Idempotency-Keyitself — no; it’s the key, not the payload.User-Agent,X-Request-ID, IP, accept-language, trace headers — no. Operationally varying; including them creates spurious mismatches when an SDK upgrades.Content-Type— yes if it changes parsing semantics.
The grey zone — number normalization. {"amount": 100} vs {"amount": 100.0} must canonicalize to the same fingerprint, or SDK serializer changes break replay. We canonicalize money as integer minor units (cents) before hashing.
Rejected: [STAFF SIGNAL: rejected alternative] comparing full request bytes — too brittle; any whitespace change breaks replay.
Operational signal. Sustained nonzero fingerprint mismatch rate is either (a) a client bug regenerating the same key for different requests, (b) an SDK serialization change, or (c) abuse. Surface to per-tenant dashboards; alert at 10/min/tenant.
10. Storage Substrate: DynamoDB
Section titled “10. Storage Substrate: DynamoDB”Committed choice: DynamoDB (Spanner where multi-region symmetric workloads dominate).
Requirements check:
- Conditional writes with strong per-partition consistency — yes.
- Horizontal scale at 100K req/s peak — yes, partitioned by hash of idempotency key.
- Multi-AZ durability — default.
- TTL — native (with up to 48h delete lag, which is why we tombstone in application code, not rely on TTL for correctness).
- Per-shard hot-key risk — keys are UUIDs, uniformly distributed; per-tenant rate limits prevent adversarial concentration on a partition.
Rejected [STAFF SIGNAL: rejected alternative]:
- Single Postgres with partitioning. Works to ~30K req/s with care. At our scale we’d be sharding by hand and managing failover at shard level — re-implementing DynamoDB worse. Rejected on operability.
- Redis as source of truth (AOF + replication). Sub-ms latency wins, but Redis failover under partition can drop the most recent fsync — and that’s the most recent claim. For payments the worst case is “lost claim for an in-flight charge” → double-charge. Acceptable as a read-through cache in front of DynamoDB for the COMPLETED replay path (5–10× cost reduction on retry-heavy traffic) but never as source of truth for the claim.
- CockroachDB / Spanner. Strong consistency without sharding pain, multi-region native. Best for global active-active. We pick DynamoDB because home-region pinning gives us most of the multi-region benefit at lower cost; if the workload became truly region-symmetric we’d revisit.
Schema (DynamoDB):
PK: idempotency_key (string, hash key)Attributes: state NEW | IN_FLIGHT | COMPLETED | FAILED_RETRY | FAILED_TERMINAL | EXPIRED fence number, monotonic lease_exp number, epoch ms fingerprint binary, 32 B request_meta map: method, path, tenant_id, created_at stable_ids map: charge_id, created, etc. (minted at claim) response map: status, headers, body (set on COMPLETED/FAILED_TERMINAL) outbox_id string, → outbox table attempts number, bounded ttl number, epoch s, drives 48h hard delete11. Retention and Expiration
Section titled “11. Retention and Expiration”[STAFF SIGNAL: expired-key contract] Three windows, each with a precise contract:
- 0–24h: replay window. Retries observe identical response.
- 24h–48h: tombstone window. Record marked EXPIRED. Retries get
410 Gonewith{"error":"idempotency_key_expired","original_request_at":"..."}. We do not silently start a new execution; that would double-charge a client who cached a key. - >48h: deleted. A new request with that key is treated as brand new and executed. Documented client contract.
The expired-key race. Without a tombstone, a record TTL’d at 24h plus a retry at 24h+50ms results in re-execution and double-charge. With the tombstone, the retry gets 410 Gone and the client knows to use a fresh key. The tombstone is correctness, not nice-to-have.
Mechanism. Application checks ttl < now() at read time and returns 410 if so. DynamoDB TTL physically deletes within 48h after ttl. We do not rely on DynamoDB TTL for correctness — its delete latency is unbounded.
Rejected: [STAFF SIGNAL: rejected alternative] “extend retention forever” — storage cost grows unbounded; clients depend on bounded expiry semantics for their own bookkeeping. Bounded, documented window is correct.
API contract documented: “Idempotency keys are honored for replay for 24h after first use. Keys older than 24h that are reused will return 410 Gone. After 48h, keys are eligible to be reused for new requests.” This is part of the public API surface, not a hidden behavior.
12. Failure Mode Catalog
Section titled “12. Failure Mode Catalog”[STAFF SIGNAL: failure mode precision]
┌──────────────────────────────┐ │ CRASH RECOVERY FLOW │ └──────────────────────────────┘
T0 Request arrives, fingerprint computed │ T1 Conditional claim: IN_FLIGHT, fence=N, lease=T1+30s │ ┌── crash A (claim never committed) ──► retry: claim succeeds, normal │ ├── crash B (claim wrote, app died) ──► retry: sees IN_FLIGHT, polls, │ lease expires, orphan recovery, │ outbox re-fires PSP (dedupes), │ CAS COMPLETED ▼ T2 Outbox INSERT (same txn as claim) │ T3 PSP call (with our key as downstream idempotency key) │ ┌── crash C (PSP call mid-flight) ──► outbox re-calls; PSP dedupes; │ same canonical body → CAS COMPLETED ▼ T4 PSP returns 200 + canonical body │ T5 Serialize response (stable IDs already in row from T1) │ ┌── crash D (got 200, didn't store) ──► IN_FLIGHT until lease; │ outbox re-fires PSP; │ dedupe → same body → COMPLETED ▼ T6 CAS UPDATE: IN_FLIGHT,fence=N → COMPLETED + body │ T7 Reply to client ┌── crash E (replied, then died) ──► next retry: COMPLETED → replay| Scenario | Detection | Response |
|---|---|---|
| Idempotency store unavailable | Conditional write throws | [STAFF SIGNAL: fail-closed-policy] Fail closed: 503 with Retry-After. Do not proceed without a claim. Refusing service for N seconds is bounded; double-charge is unbounded. |
| Crash between claim and outbox | Both-or-neither via single txn | If neither: retry re-claims. If both: outbox worker reattempts. |
| Crash mid-PSP-call | Lease expiry | Orphan recovery; outbox re-calls PSP; PSP dedupes our key. |
| Crash post-PSP, pre-COMPLETED | Lease expiry | Outbox worker re-calls PSP → same canonical body → COMPLETED. Determinism preserved because PSP returns the same bytes. |
| Crash post-COMPLETED, pre-reply | None needed | Next retry sees COMPLETED, replays. |
| Network partition app↔store | Conditional-write timeout | Fail closed. |
| Concurrent same-key, fingerprint match | Loser sees existing row | Wait/poll → replay. |
| Concurrent same-key, fingerprint mismatch | First writer’s fingerprint vs second’s body | Reject second with 422. |
| Bad-actor key reuse, different payload | Fingerprint mismatch | 422 + abuse log. |
| Clock skew on lease | Use server-side now() from store, never client wall clock | Bounded skew within store. |
| GC pause longer than lease | Fence CAS fails | Late writer aborts; PSP dedupes any in-flight side effect. |
| Outbox worker pool down | Outbox row stays PENDING | Inline path may still succeed; alert on stuck-in-flight. |
| PSP returns ambiguous 5xx | Mark FAILED_RETRY, increment attempts | Bounded retries; after N → FAILED_TERMINAL + page. |
| Region failover | Cross-region forwarder detects | Promote standby; outbox replicated via Global Tables; same key still dedupes at PSP. |
13. Observability
Section titled “13. Observability”[STAFF SIGNAL: observability discipline] Per-tenant and global, plus alerting:
- Claim conflict rate (concurrent same-key losses ÷ total claims). Baseline <0.05%. Spike → retry storm or load-balancer misbehavior.
- Replay rate (COMPLETED reads ÷ total claims). Baseline ~3%. Spike → upstream client retry storm or downstream incident causing client-side timeouts. Leading indicator for downstream issues.
- Fingerprint mismatch rate per tenant. Nonzero → client bug or abuse. Per-tenant alert at 10/min.
- In-flight wait p50/p99/p999. Latency the loser pays. p99 < 1s; >5s → stuck IN_FLIGHT, lease misconfigured, or outbox falling behind.
- Stuck-in-flight gauge (rows with
state=IN_FLIGHT AND lease_exp < now()-60s). Should be near zero. Nonzero → page. - Outbox lag (oldest PENDING age). p99 <5s; breach → page.
- Fail-closed rate (
503from store-unavailable). Direct measure of customer impact during store incidents. - Audit log: every claim, every state transition, every PSP attempt, every reply. Append-only; retained per compliance (typically 7 years for payments). Non-negotiable for chargeback investigations and SOX. This is part of correctness, not an “operations afterthought.”
End-to-end p99 SLO ≤500ms (PSP-dominated); idempotency layer’s contribution ≤15ms p99.
14. Tradeoffs Taken and What Would Change Them
Section titled “14. Tradeoffs Taken and What Would Change Them”We chose inline-with-outbox-recovery over pure async outbox, paying complexity for ~100ms median latency. SLO loosening to 500ms median → pure async is simpler and equally safe.
We chose home-region pinning over global consensus on every claim, paying ~80ms penalty on rare cross-region retries for ~5ms savings on the hot path. True symmetric active-active → Spanner replaces DynamoDB.
We chose 24h replay + 24h tombstone. Stripe-class. Larger windows scale storage linearly (~9.5TB → ~30TB at 72h) — affordable but unjustified.
We chose fail-closed on store unavailability. For PSP integration where double-charge is six-figure dollars per hour, this is correct. For a system where double-execution is cheaply reconcilable, fail-open with reconciliation may dominate.
15. What I Would Push Back On
Section titled “15. What I Would Push Back On”[STAFF SIGNAL: saying no]
- “Exactly once.” Replaced with the three invariants. The phrase invites mid-level designs that ignore downstream cooperation.
- “Idempotency at the API layer.” API-layer dedup without downstream dedup is theater. PSP idempotency support is a hard partner-onboarding requirement; reject PSPs that don’t provide it.
- “Single region is fine.” For real payments it isn’t; commit to home-region pinning early. Retrofitting multi-region into a region-naive design is a 6-month project.
- “Just use a Redis lock.” Locks aren’t idempotency; locks plus a separate response store reintroduce the ordering problem less obviously. The idempotency record and the claim must be the same row in the same store.
- “We’ll add monitoring later.” Fingerprint-mismatch and stuck-in-flight are part of the correctness story. Without them, silent abuse or a leaking bug produces double-charges undetected for weeks.
The system above does not achieve “exactly once.” It achieves at-most-once externally, byte-identical replay within 24h, and explicit failure modes everywhere else. That is the contract a payments API can actually keep.