Distributed counter at scale
"Design a system that counts events (think: video views, like counts, ad impressions) with eventual consistency. 200B events/day, counters need to be readable within a few seconds of write. Tail latency on reads matters."Counter System at 200B events/day
Section titled “Counter System at 200B events/day”1. Scope and requirements
Section titled “1. Scope and requirements”Before committing to a design: “video views, likes, ad impressions” are three different systems. They share infra but not correctness contracts.
[STAFF SIGNAL: scope negotiation] The three are:
- View counter: display number. Off-by-1% is invisible. Drift-tolerant. Dimensional rollups (country, device, referrer) are the product. This is the cheapest semantic.
- Like counter: per-user idempotency is the whole problem. A user liking twice must collapse to one. The hot key here is the creator, aggregated across their content. Scale is 10–100x smaller than views but the dedup set is user × content, which does not fit in RAM.
- Ad impression counter: billing-grade. Undercounting costs advertisers trust, overcounting costs the platform millions per day. Audited against SOX-style controls. Exactly-once is not a lie here — it’s a contract. This is a completely different system and if the interviewer is asking for this under “eventual consistency,” they’ve mis-specified.
I’ll design for video view counters as primary, with a clear note on what changes if it were ads. View counter semantics: at-least-once ingestion, best-effort dedup within a short window, display counts eventually correct to within ~0.1% of the “truth” (which is itself ill-defined), dimensional rollups supported, no billing linkage. [STAFF SIGNAL: correctness semantics precision]
Requirements I’m committing to:
- 200B events/day globally. Peak burst factor 3–5x (Super Bowl, World Cup, K-pop drops). Peak = ~10M events/sec.
- Read freshness: “few seconds.” I will push back on this in §9 — the cost delta between 1s and 10s is large and “few seconds” is almost certainly not a real product requirement.
- Read latency: p99 < 10ms on counter lookup, p999 < 50ms. This one is real; it’s on the video playback hot path.
- Dimensional rollups:
count by (video, country, device, hour). Cardinality on video is ~1B lifetime, ~100M active (read in last 30d).
2. Capacity math
Section titled “2. Capacity math”Writes Avg: 200B / 86400 = 2.31M events/sec Peak: ~10M events/sec (5x burst) Event size: 250B on wire after enrichment (video_id 8B, user_id 8B, ts 8B, geo 2B, device 4B, ref_source 8B, session 16B, headers/framing ~196B) Raw bandwidth peak: 2.5 GB/s ingest, 200 Gbps aggregate Raw storage: 200B × 250B = 50 TB/day uncompressed, ~12 TB/day with zstd (4x ratio on these fields) → 4.4 PB/year raw archiveReads Video playback emits 1 count read per play. Global plays ≈ video events → read QPS ≈ write QPS in steady state ≈ 2.3M QPS avg, 10M peak. Mitigated heavily by caching; serving tier QPS after 95% edge hit ≈ 115k QPS (avg), 500k QPS (peak).Hot set Zipfian: top 0.1% of videos = ~100k videos get ~50% of traffic. Keep top 10M videos (active counters for last 7d) in RAM. Per-counter state: video_id(8) + K shard counts (K avg = 4, max 256) + last_update_ts(8) + tombstone bits ≈ 80B average, worst case 2 KB for hot shards. RAM for hot set: 10M × 80B = 800 MB base counters, + 1000 hot-sharded videos × 2KB = 2 MB negligible → one Redis shard per region (16GB box) holds hot set with headroomAggregation ratio needed If read path hits Redis for every count, we need commits per key <= 1/s. Input: 10M writes/sec across ~100M active keys over a second → most keys 0-1 writes/sec, hot keys 1M/sec Aggregation job must compress 10M input writes/sec into ~10M output key-updates/sec (1:1 naive), OR ~100k/sec if we batch 100ms windows. Target: 100:1 input-to-commit ratio via 100ms micro-batch per subtask. → serving-tier write load = 100k/s per region. Redis handles this at one shard.[STAFF SIGNAL: capacity math]
3. High-level architecture
Section titled “3. High-level architecture” clients (web/mobile/TV) │ ▼ HTTPS batch POST, 10 events/req typical ┌─────────────────────┐ │ Edge Collectors │ stateless, regional, 500 nodes global │ (Envoy + lua/Rust) │ auth, schema validate, enrich (geo from IP) └──────────┬──────────┘ │ produce, partitioned by hash(video_id) mod 4096 ▼ ┌─────────────────────┐ │ Kafka: raw_events │ 4096 partitions, RF=3, retention=72h │ (or Redpanda) │ compressed=zstd, batch=64KB, linger=20ms └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ ┌────────────────────────┐ │ Flink: preaggregator│──────────▶│ S3 raw archive (iceberg)│ │ tumbling 1s, EvTime │ │ 1h partitions, parquet │ └──────────┬──────────┘ │ for backfill + audit │ │ partial_aggs topic └────────────────────────┘ │ keyed by video_id (trust contract / §6) ▼ ┌─────────────────────┐ │ Flink: finalizer │ merges partials across edge preaggs │ tumbling 1s, 10s │ two-phase: hot keys split, see §5.2 │ allowed lateness │ └──────────┬──────────┘ │ commit path splits: ┌─────┴──────────────┬───────────────────┐ ▼ ▼ ▼ Redis cluster Pinot (OLAP) Kafka: commits_log (current counts) (dim rollups) (downstream fanout)
reads ──────▶ [edge cache] ──▶ [CountService] ──▶ [L1 in-proc LRU] ──▶ Redis ──▶ Pinot (cold/dim)I’m using Flink + Kafka + Redis + Pinot. I’ll justify each against a specific alternative in §5.
4. Pre-aggregation topology
Section titled “4. Pre-aggregation topology” raw_events partition P (single Kafka partition, ~2.5k ev/s avg, 25k peak) │ ▼ ┌──────────────────────────┐ │ Preagg Task (Flink op) │ state: HashMap<video_id, count> │ │ flush: every 100ms OR 10k unique keys │ event → localMap[vid]++ │ evict: LRU on 10k cap → emit partial └──────────┬───────────────┘ │ ▼ partial record: (vid, shard_hint, count, window_start, window_end) ┌──────────────────────────┐ │ keyBy(vid) │ <-- shuffle boundary, network cost here │ + hot-key rebalance │ hot keys get split; see §5.2 └──────────┬───────────────┘ │ ▼ ┌──────────────────────────┐ │ Finalizer Task │ state: Map<vid, running_count> │ window: tumbling 1s │ backed by RocksDB (Flink state backend) │ trigger: every 1s │ checkpoint: 30s, incremental, S3 │ allowed lateness: 10s │ late events → side output → reconciler └──────────┬───────────────┘ │ ▼ committed delta: (vid, window, delta_count) serving tierWindowing model: tumbling 1s event-time windows with a 10s lateness allowance, early-firing trigger every 1s. Late events past 10s go to a side output that feeds the reconciler (§6.3). Watermark = min across sources, which in practice is bounded by the slowest edge collector clock skew (~2s).
5. Deep dives
Section titled “5. Deep dives”5.1 Pre-aggregation: partial aggregation, aggregation ratio, shuffle cost
Section titled “5.1 Pre-aggregation: partial aggregation, aggregation ratio, shuffle cost”The 100:1 aggregation target in §2 is a design constraint that drives everything downstream. The naive “emit one record per event to keyBy” pushes 10M recs/sec across the shuffle boundary, which is 200Gbps of intra-cluster traffic just for the shuffle — infeasible on most clusters.
The preagg task keeps a local hashmap, no state backend, not checkpointed — it’s deliberately volatile. On a 100ms flush tick or 10k-key cap, it emits partials. For Zipfian key distributions, a single preagg task sees ~5k unique keys per 100ms from 25k events → 5:1 local reduction. Not enough. Extend the flush to 500ms and we get 20:1, which is workable. Tradeoff: 500ms of additional p99 latency on the write path.
Why not stateful preagg with Flink-managed state? Because the preagg is pre-shuffle, so it’s running on partitions keyed by ingestion order, not by video_id. Keeping Flink-managed state here adds checkpointing cost for state that’s fundamentally transient — we’ll re-aggregate downstream anyway. The local volatile hashmap is the right abstraction. [STAFF SIGNAL: rejected alternative] Cassandra counter columns — rejected because a single counter column write at 1M ops/sec to a hot partition causes SSTable write amplification that cascades into compaction storms; we observed this pattern at Meta on analogous workloads and it’s documented in the Cassandra 4.x release notes around counter accuracy improvements. Counters in Cassandra are also non-idempotent on retry, which makes the exactly-once story worse, not better.
Shuffle optimization: the partial records have a shard_hint that lets the keyBy partitioner use (video_id, shard_hint) for hot keys and just video_id for cold keys. This requires a sideband hot-key registry — see §5.2.
5.2 The hot counter problem
Section titled “5.2 The hot counter problem”This is the problem. Every other decision is downstream of solving this.
Observation: for a Super Bowl ad running globally, a single video_id can see 1M events/sec sustained for 30s. A single Flink task can commit roughly 100k state updates/sec to RocksDB with incremental checkpoints. So the naive keyBy(video_id) → single-task-per-key pattern melts at 10x below the peak.
Solution: adaptive key splitting with a write-side shard registry and read-side fan-in.
Write path for key V with current shard count K_V:
event(V) │ ▼ lookup K_V from shard registry (Zookeeper/etcd, 100ms-stale OK) │ ├── K_V = 1 ──▶ emit (V, 0, +1) [cold key] ├── K_V = 8 ──▶ emit (V, rand%8, +1) [warm] ├── K_V = 64 ──▶ emit (V, rand%64, +1) [hot] └── K_V =256 ──▶ emit (V, rand%256, +1) [scorching] + route to dedicated "celebrity" worker pool
Storage layout in Redis: HSET video:V shard:0 <count> HSET video:V shard:1 <count> ... HSET video:V K <K_V> (metadata) HSET video:V updated <ts>
Read path: HGETALL video:V (single round trip, pipelined) count = sum(shard:i values) latency: ~1 ms for K<=64, ~3 ms for K=256
Promotion policy (runs every 10s): observed_qps(V) > 50k/s AND K_V < 8 → K_V := 8 observed_qps(V) > 500k/s AND K_V < 64 → K_V := 64 observed_qps(V) > 5M/s AND K_V < 256 → K_V := 256
demotion: observed_qps(V) < K_V * 1000/s for 1h → halve K_V (halving consolidates shards to shard:i for i < K_V/2)Why shard counts in powers of 2 with explicit registry rather than consistent-hash-style uniform splitting? Two reasons. First, reads cost O(K) network ops — we don’t want every cold key paying 256x read cost. Second, promotion requires a rewrite of historical data when K increases; with discrete tiers we do this 3 times per key max, not continuously.
The two-phase aggregation fallback for anything not yet promoted:
- Phase 1 (local): preagg task aggregates
(V, shard_hint=null)locally. - Phase 2 (keyed): finalizer receives
(V, count_delta)and commits.
When phase-1 detects a local key with >1000 events in a single 100ms flush, it auto-assigns a random shard_hint and emits N records. This is the hot-key “first response” before the registry catches up and promotes K_V globally. It prevents melting while the control plane is reacting.
Failure mode: what if the shard registry is unreachable? Edge collectors cache the last-known K_V for 10 minutes locally. A full 10-minute registry outage causes new hot videos to be under-sharded and hot-spot for that window. Acceptable because registry is replicated and this is a correlated failure the blast radius doc addresses. [STAFF SIGNAL: failure mode precision]
Threshold math: promotion at 50k/s assumes a single Flink task handles ~100k state ops/s, so we promote at 50% saturation to give headroom for the promotion itself. These numbers are hardware-specific; on newer NVMe-backed RocksDB setups I’ve seen 300k/s per task sustained, which would shift the promotion threshold to 150k/s.
5.3 Idempotency, exactly-once-ish, and the dedup problem
Section titled “5.3 Idempotency, exactly-once-ish, and the dedup problem”[STAFF SIGNAL: correctness semantics precision] “Exactly-once” across a distributed stream is a lie we tell ourselves when three things are all true: the producer is idempotent, the stream processor has transactional checkpoints, and the commit is idempotent at the sink. If any of those is false, you have at-least-once with small duplicate rates.
For this system:
- Producer idempotency: Kafka idempotent producer (per-session dedup on
<producerId, partition, sequence>). Survives retries within a session. Does NOT survive producer restart — the session ID changes. So edge collector restarts produce duplicates. - Stream processor: Flink with incremental RocksDB checkpoints every 30s, two-phase commit to Kafka commits_log. Re-processing on failure replays from the last checkpoint, and the commits_log dedup is sequence-based.
- Sink commit: Redis
HINCRBYis idempotent only if deltas are tagged with<window_id, shard_id>and we keep a Bloom filter of seenwindow_ids (1MB per 10M windows at 1% FP). Cheap.
Dedup window for client events: events carry <session_id, sequence_number>. We keep a Redis Bloom filter per session (1 KB, TTL 60s). Client retries within 60s collapse; retries after 60s (app backgrounded, resumed) get counted as a second event. This is a conscious product decision — we are not building per-user strict dedup for views. If we were (likes, ad impressions), this would be a per-user Bloom filter with hours-to-days TTL, which pushes the dedup state from MBs to hundreds of GBs and forces a sharded architecture for it.
Restart semantics: Flink crash at t=T. Job restarts from checkpoint at T-30s. Replays 30s of Kafka, reconstructs partial aggregates. The commits_log has a window_id sequence so downstream sinks reject replays. Redis already applied some deltas pre-crash; replays with same window_id get rejected by the Bloom filter. Consumers see no double-counting but may see counts pause for ~30s during recovery. [STAFF SIGNAL: failure mode precision]
5.4 Read path architecture
Section titled “5.4 Read path architecture”Three tiers:
- Edge cache (Varnish/envoy): 1s TTL per counter. Absorbs 90%+ of read load for hot videos. Staleness: ≤1s, which is within the “few seconds” requirement without any work from the backend.
- Count service with in-process LRU L1 (100k entries per instance, 100ms TTL). Shields Redis from cache miss storms during edge deploys.
- Redis cluster sharded by
hash(video_id) mod M. 32 shards, 2 replicas each, per region. ~2M QPS per shard ceiling on modern hardware with pipelining.
Why not serve directly from Pinot/Druid/ClickHouse? [STAFF SIGNAL: rejected alternative] Pinot’s p99 at 500k QPS on a keyed lookup is ~50ms, vs Redis at ~1ms. At 10ms p99 SLO, Pinot fails by 5x. We use Pinot for dimensional queries (views by country last 24h) where the query is not a single point lookup. For the bare counter, Redis wins on tail latency by an order of magnitude.
Why not DynamoDB or Cassandra? Again tail latency: DynamoDB p99 ~10ms but p999 can spike to 100ms during partition splits; Cassandra’s counter columns have compaction-induced p999 issues noted in §5.1. Redis with a warm hot set is the right answer for the display counter. The cold tail goes to a KV fallback (DynamoDB) with a 50ms SLA — fine because by definition nobody is waiting on the count for a video with 7 views.
Degradation under write pipeline lag: if Flink falls behind, Redis counts go stale. Read path returns stale count + a staleness_seconds header. The player UI can decide to show “views updating” if staleness > 30s. The counter is still available, just drifting, which is the correct behavior under eventual consistency.
5.5 Late-arriving data and backfill
Section titled “5.5 Late-arriving data and backfill”[STAFF SIGNAL: precision under ambiguity] Policy: events with event_time within 10s of watermark are included in their window. Events 10s–2h late go to a late side output feeding a reconciler job that emits a count_correction record every 10 minutes. Events >2h late are written to S3 for the next daily restate job. We do NOT drop late data silently — dropping is a correctness fiction.
Backfill mechanic for a 3-hour double-counting bug:
1. Detect via reconciler divergence (§6): live count vs batch recompute from S3 raw events diverges by >0.1% for window W.
2. Compute corrections: batch Flink job reads S3 raw for [T-3h, T], re-applies dedup, produces (vid, correct_count) for the window.
3. Compute delta: correct_count - live_count per vid.
4. Shadow apply: write corrections to Redis with key prefix shadow:video:V for 10 minutes. Dashboard team runs comparison.
5. Atomic swap: MULTI/EXEC rewrite video:V → shadow value. Publish count_correction event to downstream consumers so anyone who already ingested the wrong value can compensate.The compensation story for downstream consumers is the hard part. If a creator-analytics pipeline already snapshotted the wrong count into their daily report, sending them a correction 3 hours later means their daily report is wrong. You need versioned counters: every counter read carries a generation_id, and the correction increments the generation. Downstream consumers can detect stale reads and recompute. [STAFF SIGNAL: blast radius reasoning]
5.6 Storage, cost, and the long tail
Section titled “5.6 Storage, cost, and the long tail”The dominant cost isn’t the hot-counter RAM — it’s the billion-video tail. Most videos get <10 views/day. Keeping them in Redis is wasteful:
Counter distribution (estimated): top 10M videos: in Redis 800 MB × 3 replicas × 5 regions = 12 GB next 100M videos: in DynamoDB 100M × 80B = 8 GB, $200/mo long tail 900M: in S3 + Pinot queried rarely, merge-rollup daily
Raw archive: 12 TB/day compressed × 90d = 1 PB on S3 standard $23k/mo at $0.023/GB. Post-90d → Glacier, $4/TB-mo. Annual archive cost ~$400k. Compared to ~$10M/yr cluster cost, archive is <5%.TTL policy: counters idle for 30d evict from Redis to DynamoDB. Counters cold 365d go to Pinot + S3-backed rollup tables. A background job rolls per-day counters into per-week and per-month bands to shrink the tail.
5.7 Fan-out dimensions
Section titled “5.7 Fan-out dimensions”Views-by-(country, device, ref_source, hour) is the product ask. Naive materialization of all dimension combinations is O(|vid| × |cntry| × |dev| × |ref| × |hr|) which is 100M × 200 × 10 × 50 × 24 = 2.4 × 10^13 cells. Infeasible.
Three-level strategy:
- Full cardinality base counter in Redis: just
(video_id). - Pre-materialized common rollups in Pinot via a star-tree index:
(video, country)and(video, country, hour). Pinot’s star-tree materializes the aggregation tree incrementally — writes are O(dims) not O(cells). Query latency p99 < 50ms for these two rollups. - Query-time aggregation for anything not pre-materialized: Pinot scans raw events with predicate pushdown. Latency 1–5s, fine for analytics UI.
The cardinality defense: we cap per-video dimensional cardinality. If a single video has >1000 unique ref_source values (bot traffic signature), we bucket the tail into ref_source=other. Protects Pinot index explosion from adversarial inputs.
5.8 Observability and the trust contract
Section titled “5.8 Observability and the trust contract”At 10M events/sec, we cannot log every event’s fate. The trust contract is built from three independent measurements:
- Live pipeline count: sum of commits emitted by finalizer per window.
- Independent batch recompute: hourly Spark job over S3 raw archive, applies same dedup logic, emits per-window counts.
- Sampled end-to-end probe: synthetic events injected at edge with unique
probe_ids, traced through to Redis commit, latency measured.
Reconciliation metric: |live - batch| / batch per 1h window. Alert if >0.1% for 3 consecutive windows. This is the single most important metric the system emits. [STAFF SIGNAL: cross-cutting concern] It’s also the primitive that any downstream billing-adjacent system (ads, creator monetization) consumes to decide whether to trust a count.
Anomaly detection on counter derivatives: d(count)/dt for each video. A 10x jump in a minute without a corresponding trending-signal input is a bot or a double-counting bug. Feed these into the abuse-detection pipeline for creator-integrity review.
6. Failure modes (not an afterthought)
Section titled “6. Failure modes (not an afterthought)”Kafka raw_events partition leader election storm (a broker dies during peak): producers buffer ~30s, then start dropping. Edge collectors fall back to local NVMe spool (100GB per node, 2 hours of events buffered). Kafka recovers, collectors drain in reverse chronological order (newest first, oldest last) so fresh counts resume quickly and history fills in. Watermark stays low during drain. [STAFF SIGNAL: operational reality]
Flink finalizer OOM during a hot-key promotion: the just-promoted K=256 video’s state snapshot is 200KB instead of 2KB. Cascading snapshot costs blow through heap. Mitigation: hot keys use a separate job with dedicated heap; promotion routes the key and its state via savepoint-and-replay into the hot job. 5-minute interruption on that key’s counter, consumers see staleness header.
Redis shard failover: 30s unavailability for the affected shard’s hot set. CountService has a stale-read fallback to DynamoDB with an “approximate” marker. Display shows count from 5 minutes ago with UI indicator. Acceptable.
Dedup Bloom filter state loss: session dedup window resets; short window of ~1% duplicate counting. Detected by reconciler, corrected in next restate cycle. Published as a known incident.
The “double-counting bug in production for 3 hours” scenario (§5.5 already covered the mechanic): the key insight is that your system must have versioned counters and a correction event stream before the bug happens. Adding them after is twice the work and leaves corrupted historical data.
7. Recent developments
Section titled “7. Recent developments”[STAFF SIGNAL: modern awareness]
- Apache Paimon / Iceberg V3 for the S3 archive: streaming-native table formats with primary-key merging at read. For this system, Paimon’s bucketed LSM-on-S3 lets us treat the archive as a queryable table with CDC, which means the reconciler (§5.8) becomes a straight SQL join rather than a custom Spark job. Iceberg V3 added deletion vectors and row-level updates that similarly simplify corrections.
- Pinot’s upsert tables with star-tree: Pinot post-0.12 supports upsert semantics on dimensional aggregates, so a correction event can update a rollup cell directly rather than forcing full segment rewrites. Big operational win for the backfill story.
- Redpanda vs Kafka at this scale: Redpanda’s per-core thread-per-shard architecture and Raft-native replication cut p99 produce latency roughly in half for small-batch workloads, but the ecosystem (connectors, Flink integration maturity) still favors Kafka for a counter pipeline. For the raw ingest tier where tail latency matters less than throughput, Kafka stays. For a realtime-correction sidecar topic, Redpanda is a legit choice.
- Flink 1.18+ with disaggregated state: Flink’s async state backend and the disaggregated state proposal (ForSt) let RocksDB live on remote storage, which reduces the checkpointing cost for hot-keyed jobs dramatically. For the hot-key finalizer, this matters — the 200KB-per-promoted-key checkpoint cost stops being a node-local disk pressure problem.
- RisingWave / Materialize: streaming databases with incremental materialized views. Tempting for the dimensional rollup problem, but at 10M events/sec sustained neither is production-proven at that throughput per my read as of early 2026. Watch, don’t bet.
- CRDT counters (PN-counters) at scale: papers and systems (Riak, AntidoteDB) show that multi-region active-active counters with CRDT semantics work for up to ~10k writes/sec per counter. They do NOT solve the hot-counter problem at 1M/sec because the merge cost per replica scales with the number of active nodes. CRDT is the right answer for geo-replication of the Redis layer across regions, not for the write path.
- Kafka KIP-848 (new consumer group protocol): reduces rebalance pauses for long-running Flink consumers, which matters because Flink rescaling was historically a 30s+ operation that manifested as visible counter stalls. Production-ready in Kafka 3.7+.
8. Tradeoffs I’m taking and what would force redesign
Section titled “8. Tradeoffs I’m taking and what would force redesign”- At-least-once ingestion with window-bounded dedup is acceptable because views are display-only. If this were ads, I’d move to transactional producer + Flink exactly-once sink + per-impression persistent dedup keyed on
<user, ad, minute>. That adds ~40% cost and halves throughput per node. - Hot-key splitting with discrete tiers accepts O(K) read cost for hot videos. If reads became more expensive than writes (e.g., a new product feature that hammers counters from every search result card), I’d consider a read-side cache that stores the summed value with 1s TTL, pushing read cost back to O(1) per shard. Currently not worth the staleness.
- 1s commit windows accept ~1s min staleness. If “few seconds” got tightened to sub-second (it won’t, but hypothetically), I’d need to kill Flink’s windowing and go to continuous aggregation with periodic snapshots, roughly tripling the state-backend pressure. Not worth it for views.
- Single-region Redis with geo replication: if the product requires cross-region strong consistency (it doesn’t for views), the whole architecture changes. CRDT PN-counters become the answer and read latency goes up.
9. What I’d push back on
Section titled “9. What I’d push back on”[STAFF SIGNAL: saying no]
-
“Few seconds of write” freshness: for a view counter, users cannot distinguish 1s from 30s. The product requirement is probably “user sees count update within a page-reload cycle,” which is 30–60s. Tightening to seconds costs us the ability to buffer and pre-aggregate more aggressively, which is a 2–3x cost multiplier. I’d get the PM to defend 5s with data. If they can’t, we ship with 30s and save the money.
-
“Dimensional rollups for all dimensions from day one”: the star-tree approach handles a few common ones cheaply; all of them costs 100x storage. I’d ship base counter +
(video, country, hour)in v1, add dimensions as product validates demand. Resist the “future-proof all dimensions” instinct. -
“One system for views, likes, and ad impressions”: tempting for engineering tidiness. Bad for correctness. Likes and ads have different dedup, audit, and billing contracts. Share the infra substrate (Kafka, Flink framework, S3 archive), split the pipelines.
-
The interviewer probably assumed “eventual consistency” means “no correctness guarantees”: it doesn’t. Eventual consistency means “the count converges to a defined truth given enough time, and the truth is computable from the raw log.” If we can’t recompute truth from the archive, we don’t have eventual consistency — we have best-effort with drift. The archive-as-source-of-truth story (§5.8, §6) is the thing that makes this design actually eventually consistent rather than just vaguely-approximately-correct.
If you want, I can drop into any of the deep dives further — the hot-key promotion/demotion control loop and the backfill + versioned-counter mechanic are the two places where this answer has more depth that I compressed. The star-tree dimensional thing is also a rabbit hole.