7-Notification System
“Design a notification system that delivers messages to users across web, mobile push, email, SMS. 100M users, support for personalized batching (‘don’t send more than 3 notifications/hour to a user’), preference management, retry.”
1. Reframing
Section titled “1. Reframing”The mid-level read of this question is “fan-out to four channels with a Kafka queue and per-user prefs in Postgres.” That answer is wrong because it locates the complexity in the wrong layer. The hard problem is not delivering a message to FCM; it is deciding what to send, to whom, on which channel, at what time, batched with what other notifications, under whose consent. The channel adapters are commodity glue. The orchestration layer — the per-recipient decision engine that consumes events and emits send commands — is where staff-level design lives. [STAFF SIGNAL: orchestration-as-central]
Two architectural facts dominate everything downstream:
Fact 1: per-user state is the scaling axis, not events/sec. Every send decision requires reading preferences, recent delivery history, current batch state, presence, device tokens, and consent flags for the recipient. At 100M users, that’s hundreds of GB of hot state, and every event multiplied by fan-out hits it. [STAFF SIGNAL: per-user-state-as-scaling-axis]
Fact 2: channels are not interchangeable. Push has <1s latency but no delivery guarantee. Email tolerates minutes but has deliverability/reputation concerns and bounce semantics. SMS costs $0.005–0.05 per message and is regulated under TCPA. Web push only works when the browser is open. In-app is real-time when online, queued otherwise. A design that treats them as a uniform “send” interface hides the entire problem. [STAFF SIGNAL: channel-heterogeneity discipline]
The third fact, which the prompt’s batching requirement forces: “don’t send more than 3/hour to a user” cannot be enforced by a stateless check before each send — by the time you check, more notifications have been generated and concurrent workers race. It requires a stateful per-user scheduler with serialized decisions per user. This shifts the architecture from stateless workers consuming a queue to stateful per-user orchestration. [STAFF SIGNAL: stateful-scheduler-for-batching]
2. Scoping
Section titled “2. Scoping”I’m committing to the following before designing: [STAFF SIGNAL: scope negotiation]
- Notification mix: ~70% social (someone followed/liked/replied), ~25% transactional (security alerts, order status, password resets), ~5% marketing. Different SLOs and consent rules per class.
- Workload shape: bimodal. Steady stream of per-user social/transactional events (~hundreds to low thousands/sec) plus occasional viral fan-out bursts (one event → 10M–100M recipients) that dominate peak load.
- Build vs buy on channel adapters: I’m building the orchestration, scheduler, and preference layer in-house, and using third-party providers (FCM, APNS, SES/SendGrid, Twilio) for the actual channel delivery. Building APNS connection pooling at scale is a solved problem and not where engineering effort earns return.
- Geographic footprint: multi-region (US, EU, APAC) for latency and data residency. Preferences and delivery state are regionally partitioned by user home region. GDPR forces EU isolation.
- Latency SLOs by class: transactional p95 < 30s end-to-end; social p95 < 2 min; marketing best-effort within hours.
- Out of scope: notification content authoring/CMS, A/B test experiment definition (we integrate with an existing experimentation platform), user-facing notification feed UI, real-time chat (different system).
What I’m explicitly rejecting as scope: this is not a feed system. Lossy delivery is not acceptable for transactional. Quality (relevance, fatigue) matters more than raw throughput.
3. Capacity Math
Section titled “3. Capacity Math”[STAFF SIGNAL: capacity math]
Assumptions: 100M registered users, 30M DAU, average 5 notifications/user/day (suppressed/delivered combined).
Metric Value Notes------------------------------------- ----------------- ----------------------------------Notifications generated/day 150M 30M DAU * 5Avg generation rate ~1,700/sec 150M / 86,400Peak (3x avg) ~5,000/sec diurnal + burstsViral fan-out burst 10M–100M one event, processed over 5–10 minDecisions/sec at peak with fan-out 50,000–100,000 each recipient = one decisionState reads per decision 3–5 prefs, freq counter, presence, tokensCache reads/sec at peak ~300,000 mandatory cache; DB cannot absorbPer-user preference state ~1 KB 100M * 1KB = 100 GBPer-user delivery state ~500 B counters, last-N timestamps; 50 GBPer-user pending batch ~200 B avg 50M active * 200B = 10 GBTotal hot state ~160 GB fits in a sharded Redis clusterDevice tokens ~3 per user 300M tokens; ~30 GBPush delivery cost ~$0 FCM/APNS freeEmail cost $0.0001/msg SES; ~$15/day at 150M*0.7 channel mixSMS cost $0.01/msg expensive; restrict to txn + opted-inSMS daily cost if 5% of notifs $75K/day forces aggressive gating policyThe SMS line is the one that drives a real architectural constraint: SMS is expensive enough that it must be gated by class (transactional only by default), by per-user opt-in, and by a hard global daily budget circuit breaker. A bug that sends marketing via SMS to all DAU is a six-figure incident. [STAFF SIGNAL: cost-explicit]
Fan-out burst arithmetic: one event with 50M recipients, at 50K decisions/sec sustained, is 1,000 seconds (~17 minutes). I will spread non-urgent fan-out over 5–10 minutes deliberately rather than try to do it instantly — saves capacity and is invisible to users.
4. High-Level Architecture
Section titled “4. High-Level Architecture” ┌─────────────────────────┐ producer services →→→ │ Event Ingest (gRPC) │ validates, classifies (txn/social/mkt), (post created, └────────────┬────────────┘ assigns event_id, persists for replay order shipped, etc.) ↓ ┌─────────────────────────┐ │ Fan-out Service │ expands 1 event → N recipients │ (sharded by event) │ batches (1000 recipients/msg) └────────────┬────────────┘ applies push-vs-pull policy ↓ ┌─────────────────────────────────┐ │ Per-User Work Queue │ Kafka, partitioned by user_id │ (key = user_id, ordered) │ ordering per user matters └────────────┬────────────────────┘ ↓ ┌──────────────────────────────────────────────┐ │ Orchestration / Per-User Scheduler │ │ (sharded by user_id, one shard owner per │ │ user; in-memory state for active users) │ │ - reads prefs (cache → DB) │ │ - applies frequency cap & batching │ │ - applies quiet hours, presence, consent │ │ - selects channel(s) │ │ - emits per-channel SendCommand │ └────────────┬─────────────────────────────────┘ ↓ ┌───────────────────┼───────────────────┬──────────────┐ ↓ ↓ ↓ ↓ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Push Queue │ │ Email Queue │ │ SMS Queue │ │ In-App Queue│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ↓ ↓ ↓ ↓ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │FCM/APNS Adp │ │ SES Adapter │ │Twilio Adptr │ │WS Push Adptr│ │ + circuit │ │ + circuit │ │ + circuit │ │ + presence │ │ breaker │ │ breaker │ │ breaker │ │ layer │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ↓ ↓ ↓ ↓ FCM/APNS SES Twilio WebSocket fleet │ │ │ │ └──────────────────┴────────┬─────────┴────────────────┘ ↓ ┌──────────────────────────┐ │ Status / Feedback Stream │ delivery confirmed, │ (callbacks from providers│ bounce, complaint, │ + provider FBLs) │ FBL spam reports └────────────┬─────────────┘ ↓ ┌──────────────────────────┐ │ Preference & State Store │ updates prefs on FBL, │ (sharded Postgres SoT + │ persists delivery history, │ Redis hot cache) │ feeds analytics/audit └──────────────────────────┘Three persistence tiers: (1) Postgres (sharded by user_id) as source of truth for preferences, consent, audit log, device tokens. (2) Redis cluster (sharded by user_id, same partitioning as orchestration) for hot per-user state — frequency counters, current batch, presence, device-token cache. (3) Kafka for inter-stage queues with per-user ordering and replay.
Rejected alternatives: [STAFF SIGNAL: rejected alternative]
- Stateless worker pool consuming an unkeyed queue: rejected because frequency caps and batching require per-user serialization. Concurrent workers for the same user race on the cap counter and produce 4-in-an-hour bugs.
- Single Postgres for all hot state: rejected — at 300K reads/sec, a relational DB on the hot path collapses or costs absurdly. Cache-aside is mandatory.
- Stream processor (Flink) for orchestration: viable alternative, considered. Rejected for primary path because operational maturity at this org is lower than for sharded services + Redis, and the per-user-actor abstraction is cleaner. Flink is used for analytics aggregation downstream.
5. The Orchestration Layer
Section titled “5. The Orchestration Layer”This is the central engine. For each (event, recipient) pair, it answers four questions and emits zero or more channel-specific send commands.
Input: (event, recipient_user_id) │ ↓┌─────────────────────────────────────────────────────────┐│ 1. Should this user be notified at all? ││ - Hard consent check (GDPR/TCPA: opt-in present?) ││ - Topic subscribed? (e.g., "replies" enabled?) ││ - Block list? (recipient blocked sender?) ││ - Mute on this thread/source? ││ → if NO at any check: emit Suppressed audit event, ││ return (no send) │└─────────────────────────────────────────────────────────┘ │ YES ↓┌─────────────────────────────────────────────────────────┐│ 2. Which channel(s)? ││ - User channel preference for this topic ││ - Channel availability (device token? verified email?││ phone verified? in-app session?) ││ - Class policy (SMS only for txn, marketing not SMS) ││ - Cross-channel rule (one of {push, email}, not both)││ → produces channel candidate set │└─────────────────────────────────────────────────────────┘ │ ↓┌─────────────────────────────────────────────────────────┐│ 3. When? ││ - Quiet hours? (user-local TZ; transactional bypass) ││ - Frequency cap budget remaining for this hour? ││ - Currently batching? (next-window scheduled?) ││ → "now" | "deferred to T" | "merge into pending" │└─────────────────────────────────────────────────────────┘ │ ↓┌─────────────────────────────────────────────────────────┐│ 4. In what form? ││ - Standalone vs merged ("3 new replies" vs 3 sends) ││ - Full content vs preview (privacy/locked-screen) ││ - Localization (user language pref) ││ → render channel-specific payloads │└─────────────────────────────────────────────────────────┘ │ ↓ emit SendCommand(s) to channel queuesData model on the read path (per recipient, every event):
| Read | Source | Latency target | Hit rate target |
|---|---|---|---|
| Preferences (channels, topics, quiet hours, lang) | Redis → Postgres | <2 ms | 99.5% |
| Frequency counter (sliding-window per hour) | Redis (per-user key) | <1 ms | 99.9% |
| Pending batch state | Redis | <1 ms | 99.9% |
| Presence (online sessions) | Redis pub/sub digest | <1 ms | 99% |
| Device tokens / verified contacts | Redis → Postgres | <2 ms | 99% |
| Block list (against sender) | Redis (bloom + set) | <1 ms | 99.9% |
A single decision is 5–10 ms wall clock, dominated by Redis round-trips. Pipelined: one Redis MULTI/pipeline call returns all state for one user in ~1 ms.
Sharding: orchestrator nodes own user_id ranges via consistent hashing. The per-user Kafka partition (also user_id keyed) routes that user’s events to the same orchestrator instance, which then holds in-memory the active per-user scheduler state. Failover: another node takes the partition, rehydrates state from Redis (the source of hot truth) within seconds. The user’s brief “outage” surfaces as a few-second latency bump, not lost notifications (Kafka retains the events).
Invariant: preferences are read from cache for the orchestrator decision and re-checked at the channel adapter immediately before send. Defense in depth — a stale cache must never cause a consent violation. Cache invalidation happens on preference write via Postgres → Redis synchronous invalidation (write-through), with a TTL backstop. [STAFF SIGNAL: invariant-based thinking]
6. Per-User Batching and Frequency Caps
Section titled “6. Per-User Batching and Frequency Caps”The naïve “check counter, then send” is broken under concurrency. The correct primitive is a per-user actor — a logical owner that processes notifications for one user serially. [STAFF SIGNAL: stateful-scheduler-for-batching]
State machine per user:
┌─────────────────────────┐ │ IDLE (no pending) │ │ sent_this_hour = k │ │ window_resets_at = T │ └────────┬────────────────┘ new notif arrives │ ↓ ┌──────────────────────────────────────┐ │ Decide deliverability for this notif│ │ (consent, channel, class, presence) │ └────┬──────────────┬──────────────────┘ high-pri │ │ normal txn ↓ ↓ ┌─────────────────────────┐ ┌────────────────────────────┐ │ Priority Breakthrough │ │ k < cap? │ │ - bypasses freq cap │ │ YES → deliver now, │ │ - own per-hour budget │ │ k++ │ │ (max 1 breakthrough/h)│ │ NO → enqueue in pending; │ │ - bypasses quiet hours │ │ schedule wake @ T+ε │ │ only if class=critical│ │ if not scheduled │ └─────────┬───────────────┘ └────────────┬───────────────┘ ↓ │ deliver ↓ ┌─────────────────────┐ │ BATCHING │ │ pending = [n1, n2] │ │ wake_at = T_window │ └────────┬────────────┘ wake fires (or │ user comes online) ↓ ┌─────────────────────┐ │ Window tick: │ │ - merge if same │ │ source ("3 new │ │ replies from X") │ │ - emit up to (cap- │ │ k) sends │ │ - rest stay queued │ │ or are dropped │ │ per class policy │ └────────┬────────────┘ ↓ IDLECap semantics: sliding hour window, not fixed top-of-hour, to prevent the 6-in-2-minutes pathology at hour boundaries (3 at 10:59, 3 at 11:00). Implemented as a sorted set in Redis: one entry per delivery with the timestamp; eviction-on-read removes entries older than 1 hour; cardinality is the current count. ZADD/ZREMRANGEBYSCORE/ZCARD in a single pipeline.
Priority breakthrough policy: notification class is one of {critical, transactional, social, marketing}. [STAFF SIGNAL: priority-breakthrough policy]
- critical (security alert, account compromise): bypasses cap and quiet hours; capped at 2/hour against an independent budget to prevent abuse.
- transactional (order shipped, password reset): bypasses cap up to 1/hour; respects quiet hours unless time-sensitive flag set.
- social, marketing: hard cap; quiet hours respected.
The two budgets — cap budget and breakthrough budget — are independent counters. A bug that misclassifies marketing as critical is contained by the breakthrough cap.
Presence-aware early flush: when a user’s app comes to foreground, the presence service publishes a “user online” event keyed by user_id. The orchestrator instance owning that user’s actor wakes the scheduler immediately, draining pending batched notifications via in-app delivery (no push needed) and possibly merging them into a single “you have 4 new things” summary. [STAFF SIGNAL: presence-aware-delivery]
Scale: 100M users, ~10M with at least one pending notification at any moment. Each active actor’s hot state is ~500 bytes in Redis plus negligible in-memory orchestrator state for currently-scheduled wakes. With 100 orchestrator nodes, ~100K active actors per node — trivial.
Rejected alternatives: Per-user scheduled jobs in a job system (Sidekiq-style): rejected because 10M scheduled jobs/hour stresses the job system and loses the in-memory locality benefit. Stateful Flink with user-keyed state: viable, considered; rejected because debugging Flink state in a 3am incident is painful and the per-user-actor pattern is more direct.
7. Fan-out Burst Handling
Section titled “7. Fan-out Burst Handling”A celebrity post fans out to 50M followers. [STAFF SIGNAL: fan-out burst handling] [STAFF SIGNAL: blast radius reasoning]
┌──────────────────────────────┐ │ Event: user_X posted │ │ follower_count = 50M │ └──────────────┬───────────────┘ ↓ ┌────────────────────────────────────┐ │ Fan-out Policy Decision │ │ fanout_size > PUSH_THRESHOLD (1M)?│ │ YES → hybrid push+pull │ │ NO → full push │ │ also: urgency class = social │ │ → spread over T_spread (5 min) │ └──────────────┬─────────────────────┘ ↓ ┌────────────────────────────────────┐ │ Recipient Iterator │ │ - reads follower index in chunks │ │ of 10K (paginated) │ │ - for each chunk, computes │ │ target enqueue time uniformly │ │ across [now, now + T_spread] │ │ - active-followers-only filter │ │ (skip 90-day inactive) │ └──────────────┬─────────────────────┘ ↓ ┌────────────────────────────────────┐ │ Batched writes to user-keyed queue │ │ 1000 recipients per Kafka message │ │ → 50M / 1000 = 50K messages │ │ Consumer expands batch into 1000 │ │ per-user records before processing │ └──────────────┬─────────────────────┘ ↓ per-user orchestration (existing path) │ ↓ ┌────────────────────────────────────┐ │ For very-inactive followers: │ │ PULL MODEL — no push at all. │ │ Notification materialized into │ │ in-app feed only, surfaced when │ │ user opens the app. │ │ Saves 30%+ of fan-out work for │ │ the long tail. │ └────────────────────────────────────┘Spreading over T_spread: instead of trying to push 50M notifications in 1 second, the fan-out service distributes them uniformly across 5 minutes. From a user perspective indistinguishable; from a capacity perspective the difference between “system melts” and “system runs at 1.5x baseline for 5 minutes.”
Push-vs-pull threshold: for fan-out > 10M with notification class = social/marketing, the long-tail (followers inactive >30 days) is moved to pull-only — they will see it in their in-app feed when they next open the app, no push, no email. For followers active in last 24h, full push. This is a deliberate quality decision: a notification has near-zero value if delivered to a dormant user, so spending capacity on them is waste.
Rejected: push to all 50M synchronously. At our per-decision cost it’s 17 minutes of full-fleet saturation per celebrity post. No spreading, queue-and-let-it-drain. Works but produces unpredictable latency for non-celebrity events queued behind. Spreading bounds the impact.
Backpressure: the Fan-out Service monitors per-user-queue depth. If queue depth > threshold, it pauses or further-spreads new fan-outs. Producers (event ingest) are not directly throttled — the queue absorbs the burst, the spreading is the throttle.
Containment of runaway fan-out: hard ceiling per event — no event can fan out more than CONFIG_MAX_FANOUT recipients (set to ~100M, the entire user base). Per-actor (sender) fan-out budget per hour: a sender producing >5 fan-out events of >10M each gets throttled — protects against a compromised celebrity account spamming.
8. Channel Adapters and Channel-Specific Concerns
Section titled “8. Channel Adapters and Channel-Specific Concerns”Each adapter is a stateless worker pool consuming from its channel queue. The work is mostly: render the channel-specific payload, call the provider with idempotency key, handle the response, post a status event back. The interesting parts are channel-specific:
Push (FCM + APNS):
- Connection multiplexing: APNS requires HTTP/2 with persistent connections; pool ~100 connections per APNS region per worker. FCM via batched HTTP.
- Token lifecycle: tokens expire silently; a 410/404/
Unregisteredfrom FCM/APNS triggers token deletion from user state. A user with all tokens expired has push automatically disabled until a fresh registration arrives from the app. - No real delivery confirmation. APNS will tell you “I accepted it”; nothing tells you the user saw it. Status callbacks here are about acceptance, not delivery.
- Per-user device set: notifications go to all of a user’s devices (matches user expectation), but this multiplies cost for high-fan-out users.
Email (SES primary, SendGrid fallback):
- Bounces (hard/soft) and complaints (FBL) flow via SNS callbacks → status stream → preference store. Hard bounce auto-disables email channel for that address. Complaint also auto-disables and logs for compliance.
- Per-recipient-domain throttling: gmail.com can take 1000s/sec from us, but a small corporate domain might rate-limit at 5/sec. The adapter shapes traffic per destination domain.
- Reputation: bounce rate >5% or complaint rate >0.1% triggers SES suppression. We monitor and proactively suppress sends to risky addresses.
- Built-in retry: provider handles transient retry; we only retry on our own ingestion failures.
SMS (Twilio):
- Cost-gated: SMS is restricted to class=critical and class=transactional, plus explicit user opt-in for any other class. Hard global daily budget; circuit-breaks at threshold.
- TCPA compliance: prior express consent recorded with timestamp + source. STOP/UNSUBSCRIBE keywords processed by Twilio webhook → preference store update within seconds; we will never send another SMS to that number.
- Time-of-day: many US states restrict SMS marketing to 8am–9pm local; the orchestrator’s quiet-hours logic enforces this with destination phone number’s area-code TZ.
- Per-message idempotency via Twilio’s
X-Idempotency-Key-equivalent (MessagingServiceSid + ProviderMessageId).
Web Push:
- Subscription managed in user state; expires; refresh on page load.
- VAPID-signed; payload < 4 KB.
- Best-effort; browser must be running.
In-App:
- Real-time when user is online via the WebSocket presence layer. Otherwise persisted in the user’s notification feed (Postgres + Redis recent cache) and surfaced on next app open. The most reliable channel because we own it end-to-end.
All adapters share an identical interface (SendCommand → DeliveryStatus), per-channel circuit breaker, per-channel rate limiter (protects provider), and idempotency by notification_id.
9. Preferences, Consent, and Regulatory
Section titled “9. Preferences, Consent, and Regulatory”[STAFF SIGNAL: regulatory awareness]
Data model (hierarchical):
user_preferences├── global│ ├── enabled_channels: {push, email, sms, web, in_app}│ ├── language: en-US│ └── quiet_hours: 22:00–07:00 local├── per_topic[]│ ├── topic: "replies" | "marketing" | "security" | ...│ ├── enabled: bool│ ├── channel_override: optional channel set│ └── frequency_cap_override: optional int└── consent_record[] ├── channel: sms | email | ... ├── source: "signup_form_v3" | "settings_toggle" | ... ├── timestamp: ... └── ip_address / user_agent (for SMS / regulated channels)Source of truth: Postgres, sharded by user_id, per-region (EU users in EU shard for GDPR). Cache: Redis. Write-through invalidation: every preference write commits to Postgres and synchronously invalidates Redis. Cache miss falls through to Postgres, but never fails-open — if both are unreachable, the orchestrator suppresses non-critical sends rather than risking a consent violation. [STAFF SIGNAL: invariant-based thinking]
Defense in depth: the channel adapter re-checks consent for the specific channel against its own cache before calling the provider. A drift between orchestrator decision and reality at send time is caught here. Adapter-level suppression is logged.
Audit log: every decision (sent / suppressed-for-reason / failed) is written to an append-only audit stream (Kafka → S3 + searchable index) with retention >7 years for regulated classes. This is the answer to “why did I get / not get this notification?” — both for users and regulators.
Specific obligations the architecture supports:
- GDPR: explicit consent for marketing recorded; right-to-be-forgotten erases user state across all stores via a coordinated tombstone (replicated to all caches and the audit log marks the user as erased while retaining minimal compliance metadata).
- TCPA: SMS opt-in stored with provenance; STOP keyword handled within seconds via Twilio webhook → preference write.
- CAN-SPAM: every marketing email contains unsubscribe link; click writes to preference store within 10 days (in practice, seconds).
10. Retry, Dead-Lettering, Idempotency
Section titled “10. Retry, Dead-Lettering, Idempotency”[STAFF SIGNAL: retry-without-amplification]
The retry-amplification trap: a transient FCM 503 should be retried; a “user marked us as spam” feedback should never be retried; an “invalid email” must immediately disable the channel.
┌─────────────────────────────────────┐ │ SendCommand → Adapter │ └────────────────┬────────────────────┘ ↓ Provider call w/ idempotency_key=notif_id │ ┌───────────┼────────────┬─────────────┐ ↓ ↓ ↓ ↓ Success Transient Hard fail Spam/FBL │ (5xx, net) (bad addr, (complaint) │ │ bounce) │ ↓ ↓ ↓ ↓ audit:OK retry budget audit:FAIL disable channel update remaining? update for user; counters YES → backoff user state audit:SPAM; enqueue (token no retry; retry delete, alert if rate NO → DLQ email elevated disable)Per-channel retry budgets:
| Channel | Max retries | Backoff | Retry on hard fail |
|---|---|---|---|
| Push | 2 | 30s, 2m | No (likely offline; not worth it) |
| 3 | 1m, 5m, 30m | No (provider already retried) | |
| SMS | 2 | 1m, 10m | No (carrier failures often permanent) |
| Web Push | 1 | 30s | No |
| In-App | 5 (own infra) | exp backoff | n/a (we own it) |
After the budget, the message goes to a per-channel dead-letter queue. DLQ items are inspected by an alerting system: a sudden spike in DLQ for one channel is a provider/regional issue and pages.
Idempotency: every notification has a globally unique notification_id = hash(event_id, recipient_user_id, channel). Adapters pass this as the provider’s idempotency key where supported (Twilio, modern email providers). For providers without native support, the adapter maintains a short-TTL Redis dedup set (SET notif_id NX EX 3600) — if the key already exists, skip the call.
Specific scenario the design must survive: a bug enqueues the same notification 10 times for 100M users. [STAFF SIGNAL: blast radius reasoning] Defenses, layered:
- Orchestrator dedup on
(user_id, notification_id)within a 24h Redis set. - Per-channel adapter dedup as above.
- Hard per-user per-channel rate ceiling (e.g., 100 sends/user/hour, regardless of class) at the adapter — defense beyond all preference-level limits.
- Global anomaly detector on per-user send rate; auto-throttles if a user’s send rate exceeds 5σ.
11. Cross-Channel Coordination
Section titled “11. Cross-Channel Coordination”Three patterns, each appropriate for different classes:
Pattern A: Single-channel with fallback (transactional default) ───────────────────────────────────────────────────────────── Orchestrator → push → wait T_fb (e.g., 60s) for ack │ ack received? ─── YES → done │ NO → orchestrator emits email send (idempotency-keyed; if push actually delivered late, the duplicate is on the user's view, not the system's)
Pattern B: Channel selection by presence (social default) ───────────────────────────────────────────────────────────── Orchestrator reads presence: user is active in web app NOW → in-app only, no push user has no recent web session → push to mobile user has neither and topic=high → digest email at next window
Pattern C: Parallel multicast (rare; only critical security) ───────────────────────────────────────────────────────────── Push + email + SMS in parallel; user expected to see all, acceptable for "your password was changed" or "new login". Marked clearly as security event.Presence-aware suppression: if the user is currently active in the in-app session, push to the same device is suppressed (the user already saw it in-app via the WebSocket). This requires a presence service maintaining current session state per user, queried in the orchestration decision. [STAFF SIGNAL: presence-aware-delivery]
Cross-channel dedup: the orchestrator emits at most one channel-set per (user_id, notification_id). If the same logical notification arrives from two upstream paths (e.g., a re-published event), the dedup key collapses them.
Rejected: parallel-to-all-channels-by-default. Produces visible duplication (“got a push and an email about the same like”), which is the #1 user complaint that drives unsubscribes. Pattern A is the default for a reason — quality over coverage.
12. Failure Modes and Graceful Degradation
Section titled “12. Failure Modes and Graceful Degradation”[STAFF SIGNAL: failure mode precision]
| Failure | Detection | Response | Containment |
|---|---|---|---|
| FCM down | error rate spike on push adapter; circuit breaker opens | push queue accumulates; if class=transactional, orchestrator failover-emits email after timeout; class=social held until recovery (with TTL) | per-channel breaker isolates from email/SMS |
| SES down | bounce on adapter | failover to SendGrid (warm secondary); if both, queue and alert | reputation damage if held too long → drop class=marketing first |
| Preference DB down | Postgres errors on cache miss | serve from Redis only; suppress non-critical if both miss; allow critical with most-recent-known consent | never send unconsented; better to delay than violate |
| Redis cluster partition | latency / unavailability on shard | degrade to direct Postgres reads, drop throughput by 10x; shed load by class (drop marketing, defer social) | consistent-hash failover on shard; per-shard isolation |
| Orchestrator overload | per-user queue depth grows | backpressure to fan-out; spread non-urgent fan-outs further; class-priority shedding | shedding order: marketing → social → transactional → critical |
| Massive fan-out (Taylor Swift) | fan-out service watermark | spread, push-vs-pull policy kicks in, long-tail goes pull-only | bounded peak load |
| Runaway notification bug (10x to 100M users) | global anomaly detector on send rate per (user, source) | per-user hard cap (defense #3) blocks the spam; DLQ accumulates; on-call paged | bug is contained at adapter ceiling, not at orchestrator logic |
| Corrupted preferences for a user | audit log + checksum on read | restore from latest snapshot; conservative defaults (no marketing, only transactional) until restored | per-user blast radius |
| FBL spike | complaint rate watcher | auto-disable channel for affected users; investigate content/source | avoids reputation damage |
| Provider idempotency-key collision | dedup set hit | skip; log; verify at audit | rare; design uses globally unique IDs |
Shed order under load is explicit: marketing first, then social non-essential, then social essential, then transactional, never critical. This is configured policy, not ad-hoc.
13. Operational Reality
Section titled “13. Operational Reality”Metrics that matter (and what they predict):
- End-to-end delivery latency p50/p95/p99 by channel, by class. Tail latency growth is the leading indicator of saturation.
- Per-channel delivery success rate. Drop signals provider issue or content-quality issue.
- Queue depth at each stage. Primary capacity signal; pages on sustained growth.
- Suppression rate by reason (consent, cap, quiet-hours, presence). Sudden change signals upstream behavior change.
- Unsubscribe rate by source/class. The most important quality metric. Rising unsubscribes mean we are over-notifying — architecture should support cutting volume, not just throughput. [STAFF SIGNAL: quality-over-throughput]
- Per-channel cost (especially SMS) vs. budget. Daily reconciliation.
- FBL/spam rate. Reputation early-warning.
Audit log: every decision indexed by user_id and notification_id. Customer support self-serve — “why didn’t I get this?” answered without engineering. This is operationally cheap and saves enormous toil.
A/B testing integration: notifications integrate with the experimentation platform via a treatment field on the SendCommand. Send-time, content, channel choice are all standard A/B’d. The experimentation system subscribes to delivery + downstream outcome events (open, click, conversion).
Cost attribution: each notification carries a source_team tag; daily cost rollups attribute SMS/email spend. Internal teams have budgets — bad citizens are visible.
Runbooks: per failure mode above, an explicit runbook with the shed-order, the failover toggles, and the metrics to watch.
14. Tradeoffs Taken and What Would Change Them
Section titled “14. Tradeoffs Taken and What Would Change Them”- Per-user actor with sharded ownership, not stateless workers. Changes if frequency caps and batching are dropped from the requirements; then a stateless fanout suffices and is simpler.
- Build orchestration, buy delivery. Changes if channel volume justifies dedicated ESP relationships.
- Spread fan-outs over minutes for non-urgent. Changes if product demands simultaneity (live events).
- Hybrid push+pull at extreme fan-out. Changes if active-rate is high or universal push is expected.
- Preferences in regional Postgres. Changes under stricter data-residency regimes needing isolation.
15. What I Would Push Back On
Section titled “15. What I Would Push Back On”[STAFF SIGNAL: saying no]
- “100M users” is the wrong scaling unit. Peak events/sec, max fan-out per event, and per-user state size predict cost and design far better. I’d ask for those numbers before sizing capacity.
- “Support all four channels equally.” In practice, push + in-app dominate volume, email is steady-state, SMS is rare and expensive. Designing for channel parity wastes engineering on SMS-at-scale features that don’t exist in the workload.
- The implicit framing that more notifications = better. [STAFF SIGNAL: quality-over-throughput] The highest-leverage notification system sends fewer, more relevant notifications. The architecture I described — frequency caps, batching, presence-aware suppression, unsubscribe-as-first-class-metric — is built to enable saying “no” to sends, not just enabling more of them. If the product team measures success by notifications-per-day-per-user going up, the system is pointed in the wrong direction.
- “Build vs buy” deserves explicit discussion. For many companies the right answer is integrate Braze/OneSignal/Customer.io and own only the upstream event layer; the conversation about what we are actually differentiating on changes the design entirely.