Online Feature Store (simplified)
"Design the online feature store for a recommendation system. 50K feature lookups per request, 1M requests/sec, p99 budget for feature fetch is 20ms. Features include real-time counters, embeddings, and historical aggregates. Walk me through it."Online Feature Store — 40-Minute Delivery Version
Section titled “Online Feature Store — 40-Minute Delivery Version”Pacing guide
Section titled “Pacing guide”| # | Section | Time | Running total |
|---|---|---|---|
| 1 | Scope + forcing function | 4 min | 4 |
| 2 | Capacity math + latency budget | 6 min | 10 |
| 3 | High-level architecture | 7 min | 17 |
| 4 | Feature types | 4 min | 21 |
| 5 | Deep dive: latency budget | 4 min | 25 |
| 6 | Deep dive: embedding storage | 5 min | 30 |
| 7 | Deep dive: skew prevention | 4 min | 34 |
| 8 | Failure modes | 3 min | 37 |
| 9 | Pushback + close | 3 min | 40 |
If running long at minute 20, compress section 4 (feature types) to 2 minutes. Don’t compress the deep dives — that’s where the signal lives.
1. Scope and Forcing Function (4 min)
Section titled “1. Scope and Forcing Function (4 min)”“Before I design anything, let me narrow the scope and do some quick math, because the numbers force the architecture.”
Scope. “Recommendation system” is too broad. Four stages: candidate generation (approximate nearest neighbor over embeddings, tiny feature count), retrieval (~10K items, lightweight features), fine ranking (~1000 candidates, rich features), re-ranking (~50 items, business rules). 50K feature lookups per request points at fine ranking. I’ll commit to this stage.
Breakdown of the 50K: ~50 user features (fetched once, shared across candidates) + ~30 item features × 1000 candidates (30K) + ~15 cross features × 1000 pairs (15K), plus some redundancy from how the model’s input tensor is shaped. [STAFF SIGNAL: scope negotiation]
The math is the forcing function. 50K features × 1M requests/sec = 50 billion feature reads per second fleet-wide. 20ms p99 budget. If I put features behind a standard RPC at 1ms median / 5ms p99 per call, 50K serial lookups is 250 seconds. Even with 1000-way parallelism, 250ms — 12× over budget. The answer cannot be “faster Redis.” It has to be: colocate the hot data with the ranker, specialize storage per feature type, batch ruthlessly. [STAFF SIGNAL: latency budget reframing]
Second architectural concern: training/serving skew. The model trains on features computed one way (usually Spark, offline, over batches) and serves on features computed another way (Flink, online, streaming). Even small drift — a bugfix applied to one path but not the other, a default value computed differently, a timestamp boundary off by a second — silently degrades quality over weeks. I treat skew as an architectural invariant: the value the model trained on must be byte-equivalent to the value served. The mechanism that enforces this is log-and-train, built in from the start — not bolted on later. [STAFF SIGNAL: skew as first-class]
One pushback up front. [STAFF SIGNAL: saying no] 50K features per request is probably too many. In every large recsys I’ve seen audited, 30-50% of features contribute negligible lift. A pruning initiative — drop candidates, retrain, online A/B — typically removes half with no measurable regression and buys back half the latency budget. I’ll design for 50K, but I’d fund pruning in parallel; it’s a larger win than any infra optimization.
2. Capacity Math and Latency Budget (6 min)
Section titled “2. Capacity Math and Latency Budget (6 min)”How many bytes cross the network per request?
Section titled “How many bytes cross the network per request?”Different feature types have different sizes. Sizing each:
Feature type | Count per req | Bytes each | Total------------------------|----------------|------------|---------Counters | 50 | 8 | 400 BScalar aggregates | 5,000 | 8 | 40 KBDense numeric features | 10,000 | 4 | 40 KBEmbeddings (2 user + 1K item, 64-dim int8, 2B scale + 1B zero-pt + 64B values = 67B rounded to ~96B) | 1,002 rows | 96 | 96 KBCross features | 15,000 | 4 | 60 KB------------------------|----------------|------------|---------Total per request ~240 KBAt 1M requests/sec ~240 GB/s aggregate240 GB/s is the total bandwidth the feature substrate has to deliver to rankers across the fleet. If features are remote, that’s 240 GB/s of network-card budget across ~2000 ranker boxes — a 100-Gbps NIC is 12.5 GB/s, so about 20 NICs worth, spread across the fleet. Feasible. But if features sit in the same process as the ranker, the same bandwidth is just memory bandwidth — ~200 GB/s per CPU socket, essentially free. That alone is the argument for colocation.
Embedding table sizes
Section titled “Embedding table sizes”Table | Rows | Size (int8, 64-dim) | × models----------------------|------------|---------------------|----------User embedding | 1B | 64 GB | × 3 = 192 GBItem embedding | 300M | 19 GB | × 3 = 57 GBCross-surface embeds | 5 tables | ~50 GB each | = 250 GBCategory / taxonomy | 20 tables | ~1 GB each | = 20 GB----------------------|---------------------------------- |----------Total hot (int8) ~520 GBAt fp32 equivalent ~2 TBHistorical aggregates (7-day, 30-day windows over ~500 features): ~100M active users × 500 features × 8 bytes ≈ 400 GB.
Working set
Section titled “Working set”User access is heavily skewed: the top 10% of users get ~80% of traffic (classic power-law distribution). So the hot user set we actually need fast access to is ~100M users, hot items ~50M. Multiply by embedding size: 100M × 64 × int8 × 3 tables = ~20 GB per box for the hot embedding working set. That fits comfortably in RAM on every ranker machine. This is the dominant implication — we don’t need a separate embedding service for the hot path, only for the long tail. [STAFF SIGNAL: capacity math]
20ms budget broken down
Section titled “20ms budget broken down”Stage | Budget | Where the time goes---------------------------------------|---------|-----------------------Request batching wait | 1 ms | wait for parallel reqsRanker → local feature orchestrator | 0.2 ms | in-process function callIn-process lookup (hot tier) | 1 ms | 80% of features hit hereOn-box sidecar (warm tier) | 2 ms | 18% hit hereRemote shard fetch (cold tier) | 6 ms | 2% cold, this is the tailTensor assembly + dequantize | 3 ms | arrange data for GPURemote aggregate fetch (rare) | 4 ms | ScyllaDB fallbackSlack for GC, scheduler jitter, tail | 2.8 ms | always buffer---------------------------------------|---------|-----------------------Total | 20 ms |The in-process layer must deliver ~80% hit rate, or the budget breaks. That’s the concrete design pressure.
3. High-Level Architecture (7 min)
Section titled “3. High-Level Architecture (7 min)”Draw this on the board as you narrate:
┌───────────────────────────────────────────────────────────────────┐│ OFFLINE SUBSTRATE ││ ┌────────────┐ ┌──────────────┐ ┌──────────────────┐ ││ │ Event log │ → │ Spark/Flink │ → │ Feature │ ││ │ (Kafka → │ │ batch+stream │ │ warehouse │ ││ │ HDFS) │ │ jobs │ │ (Iceberg format) │ ││ └────────────┘ └──────────────┘ └──────────────────┘ ││ ▲ │ ││ │ ▼ ││ ┌────────┴───────┐ ┌────────────────────┐ ││ │ Feature DSL │ │ Training loop │ ││ │ (ONE defn, │ │ (trains on LOGGED │ ││ │ codegens to │ │ served features) │ ││ │ Flink+Spark) │ └────────────────────┘ ││ └────────────────┘ │├────────────────────────┼──────────────────────────────────────────││ ▼ ONLINE SUBSTRATE ││ ┌────────────┐ ┌──────────────┐ ││ │ Kafka │ → │ Flink │──────┐ ││ │ (events) │ │ (codegenned) │ │ ││ └────────────┘ └──────────────┘ │ ││ │ │ ││ ▼ ▼ ││ ┌──────────────────────┐ ┌──────────────────────┐ ││ │ Online feature store │ │ Embedding service │ ││ │ (ScyllaDB: counters │ │ (separate fleet; │ ││ │ + aggregate tail) │ │ int8, row-major) │ ││ └──────────────────────┘ └──────────────────────┘ ││ │ │ ││ ▼ ▼ ││ ┌───────────────────────────────────────────┐ ││ │ RANKER BOX (colocation) │ ││ │ ┌───────────────────────────────────┐ │ ││ │ │ In-process tier (20 GB hot) │ │ ││ │ ├───────────────────────────────────┤ │ ││ │ │ On-box sidecar (200 GB warm SSD) │ │ ││ │ ├───────────────────────────────────┤ │ ││ │ │ Feature logger ────────────────── │─────│→ Kafka log ││ │ └───────────────────────────────────┘ │ topic ││ │ GPU ranker │ (back to ││ └───────────────────────────────────────────┘ training) │└───────────────────────────────────────────────────────────────────┘Four big ideas, each replacing something simpler:
1. Three-tier read path (in-process → on-box sidecar → remote). Not one monolithic store. [STAFF SIGNAL: storage tiering]
- In-process tier: the 20 GB hottest entities held in the ranker’s own memory, using a SwissTable hash (Google’s open-addressed hash table, cache-friendly layout).
- On-box sidecar: a separate process on the same machine, running RocksDB on local SSD, connected to the ranker by a Unix domain socket (fast same-machine IPC — basically a loopback network without the network).
- Remote tier: ScyllaDB (a C++ rewrite of Cassandra, lower-latency) for the cold long tail.
2. Embedding service is separate from the main store. [STAFF SIGNAL: feature-type differentiation] Embeddings are batched, row-oriented, and bandwidth-heavy. Counters and aggregates are single-value point reads. Forcing them into one store compromises both.
3. Log-and-train loop. [STAFF SIGNAL: skew as first-class] The ranker writes the exact served feature vector to a Kafka log. The trainer reads from that log. Training data literally is what serving produced. This eliminates one whole class of skew by construction.
4. Single feature definition, two codegens. A DSL declares each feature once; a compiler generates both the Flink streaming job (online) and the Spark batch job (offline/warehouse). The definition is the single source of truth; the two execution paths can’t drift.
Rejected alternatives [STAFF SIGNAL: rejected alternative]:
- A single global Redis cluster. 50K serial lookups × 1ms = 50s. Even 1000-way parallel, 250ms — 12× over budget. Network latency kills this, not Redis speed.
- Feast as serving substrate. Feast is a good control plane — feature registration, lineage, offline joins — but not designed to be in the hot path at 1M rps × 50K lookups. Use it for the registry; don’t use it as the store.
- Tecton-style unified online+offline execution. Tempting for skew elimination, but the unified DSL can’t express 30-day windowed aggregates efficiently (online engine would need 30 days of state per user). Compromise: unified definition, separate execution.
- In-memory-only, no remote tier. The long tail (dormant users, new items) pushes working set past per-box RAM. Remote exists for completeness, not latency.
4. Feature Types (4 min)
Section titled “4. Feature Types (4 min)”[STAFF SIGNAL: feature-type differentiation] Three feature types, three pipelines:
Real-time counters (e.g., user_clicks_last_5min).
Event → Kafka → Flink (with RocksDB state for sliding-window aggregation) → ScyllaDB (authoritative) + a changelog Kafka topic that on-box sidecars subscribe to for cache refresh.
Freshness target: event-to-available p95 under 10 seconds.
Exactly-once delivery isn’t needed for most counters (a double-counted click is small noise); a whitelist of fraud-style counters uses an exactly-once pipeline at ~10× cost.
Embeddings (user, item, query). Refreshed on 6-24 hour retraining cycles, with an optional streaming two-tower producing incremental updates at minute-scale freshness. Storage: the dedicated embedding service, sharded by hash(entity_id), row-major int8. A “candidate batch fetch” is one RPC per shard carrying up to 1000 item IDs, returning a matrix. Critical invariant: the int8 form stored online is the form the model trained on — we do quantization-aware training (QAT: the model sees int8 during training, not just at deploy time). Training on fp32 and serving int8 introduces a silent ~0.1% quality gap.
Historical aggregates (e.g., purchases_7d, sessions_30d).
Long windows make pure streaming infeasible — 30-day sliding-window state per user at 100M users is prohibitive in Flink. Instead: daily batch backfill from the warehouse, written as compact columnar blobs per entity, loaded to each ranker’s warm tier on a daily cycle. Values change once a day, so on-box hit rate is >99% — effectively free to read.
5. Deep Dive: Latency Budget (4 min)
Section titled “5. Deep Dive: Latency Budget (4 min)”Stage-by-stage with timing:
t=0 ms Request arrives at ranker box │ Candidate gen already ran; we have 1000 item IDs + user ctx │t=0.2 ├─► Orchestrator partitions lookups by tier × type │ ┌──────────────┬─────────────┬──────────────┐ │ │ │ │ │t=0.3 ├───► │ In-process │ On-box │ Remote │ │ │ (40K, ≤1 ms) │ (9K, ≤2 ms) │ (1K, ≤6 ms) │ │ │ SwissTable │ UDS+shmem │ gRPC batched │ │ │ │ zero-copy │ per shard │ │ └──────┬───────┴─────┬───────┴──────┬───────┘ │ │ │ │t=1.3 ├────────────┘ in-process returnst=2.3 ├──────────────────────── on-box returnst=6.3 ├─────────────────────────────────────── remote returns │t=6.3 ├─► Tensor assembly: scatter values into the layout the model │ expects. Zero-copy for embeddings: the sidecar writes to │ shared memory, the ranker reads by pointer — no data copy. │t=9.3 ├─► Fill missing values with learned per-feature defaults │t=10.3 └─► Hand off to GPU ranker ┌── remaining ~9.7 ms for model inference ──┐ └── plus slack for tail ─┘Three non-obvious things:
The in-process tier is not an LRU cache. [STAFF SIGNAL: invariant-based thinking] A standard cache would have unpredictable hit rates. Instead, a hot-set controller runs as a background process: reads global traffic logs every 5 minutes, computes per-entity QPS, picks the top-N (~20 GB worth), pushes that set to every ranker box. Invariant: any entity served at >1000 QPS globally must be in-process on every ranker. Eviction is deterministic, not opportunistic.
UDS + shared memory for zero-copy. The sidecar and ranker live on the same box. Even gRPC over loopback adds ~200 microseconds per call from protobuf serialization. At 9K warm-tier lookups per request, that’s 1.8 seconds of overhead — unacceptable. Instead, the sidecar maps a shared-memory region, writes embedding rows directly there, and passes the ranker an offset + length. Ranker reads from shared memory with no copy. At 50K lookups per request, this micro-optimization stops being a micro-optimization.
Batching strategy. 1ms request-level batch window. At 1M rps across 2000 boxes = 500 rps/box, 1ms captures only 0.5 requests — too little for useful cross-request batching. So I batch within a single request (grouping the 50K lookups by tier and shard), not across requests. Cross-request batching only helps the remote tier and is handled inside the sidecar transparently.
6. Deep Dive: Embedding Storage (5 min)
Section titled “6. Deep Dive: Embedding Storage (5 min)”Embeddings are the hardest type — worth going deep.
SHARD (one of 256; sharded by hash(entity_id) % 256) ┌─────────────────────────────────────────────────────────┐ │ Header: {version, row_count, dim, dtype, layout} │ ├─────────────────────────────────────────────────────────┤ │ Row 0: [scale(fp16) | zero_pt(int8) | 64 × int8 values] │ │ Row 1: [scale | zero_pt | 64 × int8 values] │ │ ... │ │ Row N: [scale | zero_pt | 64 × int8 values] │ ├─────────────────────────────────────────────────────────┤ │ ID → row-offset index (perfect hash: no collisions, │ │ built once at load time) │ └─────────────────────────────────────────────────────────┘
Row size: 2 (scale) + 1 (zero_pt) + 64 (values) = 67 B, padded to 72 B
CANDIDATE-BATCH FETCH (the 1000-item case): Client: fetch([id_1, id_2, ..., id_1000]) from table T_item │ ├─ Partition by shard: s_0 gets its IDs, s_1 gets its, ... ├─ Fan out 256-way in parallel (each shard sees ~4 IDs on average) │ Each shard: perfect-hash lookup → memcpy 72 B per row into │ a contiguous response buffer (one pass, no pointer chasing) └─ Client assembles [1000 × 72 B], dequantizes on GPU with a fused kernel that uses the per-row scale and zero_ptWhy int8 quantization. A 1B-user embedding at fp32 (4 bytes × 64 dims) is 256 GB per table. At int8 it’s 64 GB — 4× smaller. Times 3 models: 192 GB vs 768 GB. This is the difference between a 20 GB hot working set that fits in RAM and one that doesn’t.
Options I considered [STAFF SIGNAL: embedding-specific thinking]:
| Scheme | Bytes/row | Quality impact | Verdict |
|---|---|---|---|
| fp32 | 256 | baseline | reject (4× larger, no benefit) |
| fp16 | 128 | matches int8 | reject (int8 does better at same quality) |
| int8 per-row scale | 72 | ~0.01% NE regression with QAT | chosen |
| int4 groupwise | 40 | 0.05-0.1% regression | reserve for biggest tables only |
| Product quantization | 8 | 0.5%+ regression | too lossy for ranking |
Training-serving equivalence is the invariant. If the model trains on fp32 and serves on int8, the served embeddings are subtly different from what the model learned, and a ~0.1% quality gap shows up silently in online metrics. Fix: quantization-aware training (QAT) — during training, quantize embeddings to int8 on the fly and dequantize before the model consumes them, so the model learns to tolerate rounding error. Costs ~15% more training FLOPs. Non-negotiable; otherwise the serving path is lying to the model.
Sharding: hash by entity ID, not by popularity. Popularity sharding (put hot entities on dedicated shards) sounds smart but creates a single hotspot shard; one replica failure there has outsized impact. Hash sharding spreads load uniformly and handles hot entities with replication instead (covered in failure modes). 256 virtual shards per physical shard → consistent hashing → rebalance without full reshuffle when capacity changes.
7. Deep Dive: Skew Prevention (4 min)
Section titled “7. Deep Dive: Skew Prevention (4 min)”[STAFF SIGNAL: skew as first-class] Training/serving skew is the systemic failure mode of recsys. The architecture that defends:
┌──────────────────────────────────────────────────────────┐ │ FEATURE DEFINITION (single source of truth) │ │ ┌────────────────────────────────────────────────┐ │ │ │ feature user_clicks_last_5min: │ │ │ │ window: 5m sliding │ │ │ │ source: event_stream │ │ │ │ agg: count │ │ │ │ key: user_id │ │ │ │ default: 0 │ │ │ └────────────────────────────────────────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Flink codegen│ │ Spark codegen│ │ │ │ (online) │ │ (offline) │ │ │ └──────────────┘ └──────────────┘ │ └────────│──────────────────────────────│──────────────────┘ │ │ ▼ ▼ ┌────────────┐ ┌────────────────┐ │ Online FS │── served vals ─►│ Feature log │ ◄─ source of │ (Scylla, │ │ (Kafka, │ truth for │ emb svc) │── logged vals ─►│ immutable) │ training └────────────┘ └────────────────┘ │ │ ▼ ▼ ┌─────────────┐ ┌────────────────┐ │ Ranker │── outcomes ──►│ Join(logged │ │ (serves) │ (labels) │ + labels) │ └─────────────┘ └────────────────┘ │ ▼ ┌────────────────┐ │ Training data │ └────────────────┘Five distinct ways skew creeps in, and the defense for each:
1. Code-path skew — same feature implemented differently in the offline job and the online job. Defense: single DSL, codegens for both. About 5% of features need custom code the DSL can’t express; for those, both implementations exist side by side with a CI parity test that runs daily on a sampled workload and alerts on >0.01% value mismatch.
2. Time-travel skew — training uses an end-of-day snapshot that “knows the future” relative to what serving could see at decision time. Defense: log-and-train. The training row literally is the feature vector that was served. Eliminates point-in-time-correctness bugs by construction — the single most important defense.
3. Freshness skew — online values are a few seconds stale; training rows assume perfect freshness. Defense: feature logs capture as-served staleness, so the training distribution matches the serving distribution, including its staleness profile.
4. Missing-value skew [STAFF SIGNAL: missing-value discipline] — online fills missing values with one default; offline fills with another. Defense: the DSL declares the default; both codegens use it. Feature logs include a sentinel bit per value indicating whether it was a real read or a default. Critically, we distinguish two reasons a value is missing: “entity is new” (normal — model should handle) vs. “pipeline is broken” (incident — mask the feature entirely and alert). The sentinel lets the model architecture ignore broken-pipeline defaults instead of training on them as real signal.
5. Distribution drift — not really skew, but related. The world changes after the model ships. Defense: distribution-similarity tests (KS test or Population Stability Index) run hourly, per feature, comparing live serving distribution to the training distribution. Alert on divergence. Detection, not prevention.
The invariant I’m enforcing: feature_value_served(t) == feature_value_used_in_training(log_row_at_t). Log-and-train guarantees this for 95% of features by construction. The 5% custom-code features have a weaker invariant (within epsilon), enforced by parity CI.
Log storage cost. 1M rps × 50 sampled features × 100 B × 10% sampling = ~500 GB/day. Manageable. If it grew to 10 PB/year we’d switch to reservoir sampling per user.
8. Failure Modes (3 min)
Section titled “8. Failure Modes (3 min)”[STAFF SIGNAL: failure mode precision] [STAFF SIGNAL: blast radius reasoning] Concrete scenarios, concrete responses:
Single embedding shard down (1/256 = 0.4% of rows unavailable). Cold-tier fetches return a learned default embedding (not zeros — zeros may be in-distribution and cause silent bias). Quality: ~0.4% of candidates get defaults → ~0.1% NE regression. Shard has a hot standby; promotion <30s. Not outage-level.
Flink counter pipeline lag (5 minutes behind real time). Sidecar serves last-known-good. If lag >30 min, per-counter staleness monitor fires; the ranker masks affected features (replaced with default + a “stale” sentinel bit that the model was trained to handle). Quality: ~0.05% CTR loss during degradation.
Hot-entity surge (viral item hits 100K QPS). Detection: per-key QPS counter in the embedding service. Mitigation is layered: (a) replicate that row to all 256 shards (trivial cost for a handful of rows); (b) ranker in-process already has it via the hot-set controller; (c) if sustained, permanently pin to in-process on every box. Replication consistency handled via gossip (~1s bounded staleness).
5% of features stale simultaneously (the yellow-zone scenario). Empirically ~0.2-0.4% NE regression. Runbook: shift 10% of traffic to a fallback simpler model using only the core 50 features; if the quality gap widens beyond 1%, failover simpler model to 100% while primary is repaired. Quarterly GameDay for this.
Cardinality explosion [STAFF SIGNAL: cardinality defense]. A team ships a feature keyed by user × item × context → 10^18 possible keys. Defense layers: DSL requires explicit max_cardinality declaration; registry tracks per-feature unique-key growth and auto-pauses Flink jobs growing >10× WoW; storage enforces per-feature quota and switches to sample-on-write when exceeded. Separately, abandoned features are reaped — ownership annotation + zero-reader detection + 14-day owner notification + auto-delete. Without this, 40% of storage becomes dead features within a year. I’ve seen it happen twice. [STAFF SIGNAL: operational longevity]
9. Pushback and Close (3 min)
Section titled “9. Pushback and Close (3 min)”[STAFF SIGNAL: saying no] What I’d push back on if this were a real project:
1. 50K features per request is almost certainly too many. Pruning studies consistently find 30-50% are pass-through noise. Before scaling infra, run a pruning initiative — cheaper, reduces operational surface, buys back latency budget. Land this infra design, but argue hard for pruning in parallel.
2. The 20ms budget shouldn’t be allocated uniformly across features. A small number (top embeddings, key counters) drive most of the model’s discrimination. Those should live in-process regardless of hit rate; the long tail should live in the cheapest tier with loosest freshness. Per-feature importance-based budgeting is probably a 20% efficiency win.
3. “Unified offline+online feature store” is often the wrong frame. The actual goal is skew elimination. Unified definition gets 95% of the benefit; unified execution costs a lot and constrains expressiveness. Insist on the former; resist the latter.
4. Freshness requirements are probably over-specified. [STAFF SIGNAL: freshness-cost tradeoff] The 80% of features that are 7-day or 30-day aggregates don’t need sub-minute freshness — the signal moves on a day timescale. Per-feature freshness SLAs unlock ~10× cost savings on the long tail.
5. The real system is model + features + ranker, not the feature store alone. If the modeling team can tolerate lossy compression, default-filled missing values, and a smaller feature count, infra gets dramatically cheaper. That cross-team conversation is worth 10× any internal infra optimization. As a staff engineer I’d own that conversation, not just the infra.
Recent work worth citing briefly [STAFF SIGNAL: modern awareness]:
- Uber’s Michelangelo pioneered log-and-train.
- Meta’s ZionEX / TorchRec handle 100+ TB embedding tables with disaggregated memory — the pattern to reach for if tables outgrow per-box RAM.
- Matryoshka embeddings (Kusupati et al. 2022) let you serve 32-dim to cold candidates and 128-dim to top-k — worth piloting on item embeddings.
- Airbnb’s Chronon and Stripe’s Zipline converge on “single definition, dual execution, log-and-train” — the industry’s converging answer.
Jargon cheat-sheet (for your reference, not for delivery)
Section titled “Jargon cheat-sheet (for your reference, not for delivery)”| Term | What it actually means |
|---|---|
| UDS (Unix domain socket) | Like a TCP socket but same-machine; much lower latency than loopback |
| shmem / shared memory | A memory region multiple processes can read without copying |
| Zero-copy | Pass a pointer and length instead of copying the data |
| SwissTable | Google’s optimized open-addressed hash table, cache-friendly |
| RocksDB | Embedded key-value store (LSM tree); what you run inside a sidecar |
| ScyllaDB | C++ rewrite of Cassandra; distributed KV, lower latency than Redis at scale |
| gRPC | Google’s RPC framework; what most RPCs use these days |
| Protobuf | Binary serialization format; what gRPC encodes messages as |
| Kafka | Distributed log; the default event-streaming substrate |
| Flink | Stream-processing engine; handles windowed aggregations with state |
| Iceberg | Table format for data warehouses; gives you ACID on top of Parquet files |
| QAT (quantization-aware training) | Train with fake quantization in the forward pass so the model learns to tolerate rounding |
| NE (normalized entropy) | Offline quality metric for ranking models; lower is better |
| KS test / PSI | Statistical tests for comparing two distributions (serving vs. training) |
| Perfect hash | A hash function built at load time with no collisions; O(1) guaranteed lookup |
| Gossip protocol | Decentralized way for replicas to converge on state by pairwise exchange |
| Consistent hashing | Hashing scheme where adding/removing shards only re-maps a small fraction of keys |
| Hot-set controller | Your own background service; chooses which entities get in-process treatment |
Memorized numbers to have at your fingertips
Section titled “Memorized numbers to have at your fingertips”| Number | What it is |
|---|---|
| 50B / sec | Aggregate feature lookups fleet-wide |
| 240 GB/s | Aggregate bandwidth required |
| 520 GB | Total hot embedding storage (int8) |
| 20 GB | Hot working set per ranker box |
| 72 bytes | One int8 embedding row, padded |
| 80% / 18% / 2% | In-process / on-box / remote hit rates |
| 10 seconds | Default counter freshness target |
| 0.01% | QAT quality regression |
| 0.1% | Quality regression from single shard failure (0.4% of rows) |
| 0.2-0.4% | Quality regression from 5% stale features |
| 40% | Storage that becomes dead features within a year without lifecycle governance |
Delivery tips
Section titled “Delivery tips”- Do the arithmetic out loud on the board. Don’t just state “240 GB/s” — write
240 KB × 1M = 240 GB/sas you say it. The interviewer wants to see you reason through, not recite. - State the invariant before the mechanism. “The invariant is feature-value-equivalence between training and serving; the mechanism is log-and-train.” This inversion signals architectural thinking.
- Commit to numbers, then caveat. Say “80% hit rate in-process” with confidence, then “and if measurement shows that’s off, we adjust the hot-set size.” Staff engineers commit, not hedge-first.
- If interrupted, go deeper on the interruption, don’t defend the prior point. Changing direction cleanly is a staff-level habit.
- If running long at minute 25 (past architecture), compress feature types (section 4) to one sentence per type and keep moving. Never compress the deep dives.
- End with pushback, not summary. Summarizing is mid-level. Pushback closes at senior-staff.