Online Feature Store
"Design the online feature store for a recommendation system. 50K feature lookups per request, 1M requests/sec, p99 budget for feature fetch is 20ms. Features include real-time counters, embeddings, and historical aggregates. Walk me through it."Online Feature Store for Recommendation Ranking — Staff-Level Design
Section titled “Online Feature Store for Recommendation Ranking — Staff-Level Design”1. Scope and Reframing
Section titled “1. Scope and Reframing”Before architecture: the question covers too much ground as posed. 50K feature lookups per request places us squarely in fine ranking — not retrieval (thousands of candidates with tens of features, ANN-driven), and not re-ranking (tens of items, diversity/business-rule focused). Fine ranking scores O(100–1000) candidates with O(50–500) features each. I’ll commit to ~1000 candidates × 50 feature lookups per candidate, where “lookup” means a read of one feature value for one (user, item, context) tuple. User-side features are shared across candidates (~50 user features fetched once), item-side multiplies by candidate count (~30 × 1000 = 30K), cross features per pair (~15 × 1000 = 15K). Net: ~45K unique reads, ~50K lookups counting the tensor-shape redundancy. [STAFF SIGNAL: scope negotiation]
The math is the forcing function. 50B feature reads/sec at the fleet level. 20ms p99 for the feature-fetch phase (not the whole ranking request — model inference eats its own budget). If we naively place features behind a KV RPC at 1ms median / 5ms p99 per hop, 50K serial lookups is 250 seconds. Even 1000-way parallelized it’s 250ms and blows the budget by 12×. The answer cannot be “faster Redis.” The answer is colocation of the hot working set with the ranker, specialization of the read path per feature type, and ruthless batching. [STAFF SIGNAL: latency budget reframing]
Training/serving skew is the principal architectural risk, not a test-harness concern. [STAFF SIGNAL: skew as first-class] A fine ranker trained on features computed in Spark and served from an online store computed in Flink will silently degrade as definitions drift. I’ll treat the skew invariant — the feature value the model trained on is byte-equivalent to the feature value served — as a property of the system, not an assertion checked after the fact. The dominant design choice: log-and-train (we log served features and train on those logs) rather than recomputing training features from raw data.
What I’d push back on up front: [STAFF SIGNAL: saying no] 50K features per request is almost certainly excessive. In every large recsys I’ve audited, 60–80% of features contribute sub-basis-point NE lift. A feature-pruning initiative (SHAP-based, or drop-and-retrain with online A/B) typically removes 30–50% of features with no measurable quality regression and buys back half the latency budget. I’d land the infra design for 50K but fund pruning in parallel — it’s a larger win than any single infra optimization.
2. Capacity Math and Latency Budget
Section titled “2. Capacity Math and Latency Budget”Per-request wire math:
Component | Min wire | Realistic-----------------------|----------------|------------Counters (~50) | 50 × 4B = 200B| 50 × 8B = 400BScalar aggregates (5K) | 5K × 4B = 20KB | 5K × 8B = 40KBDense numeric (~10K) |10K × 4B = 40KB |10K × 4B = 40KBEmbeddings (2 user + |(2+1K) × 64B |(2+1K) × 96B 1K item, 64-d int8) | = 64KB | = 96KBCross features (~15K) |15K × 4B = 60KB |15K × 4B = 60KB-----------------------|----------------|------------Total per request | ~184 KB | ~236 KBAt 1M rps | ~184 GB/s | ~236 GB/s aggregate~250 GB/s aggregate wire from feature substrate to rankers. If features are remote, that’s 250 GB/s of NIC budget split across ~2000 ranker boxes plus the serving side. Feasible with 100 Gbps NICs (12.5 GB/s each) but tight once protocol overhead is counted. If features are colocated in-process, this bandwidth is DRAM bandwidth (~200 GB/s per socket, free). That alone argues for colocation.
Embedding table sizes (hot path):
User embedding : 1B users × 64 dim × int8 × 3 models = ~192 GBItem embedding : 300M items × 64 dim × int8 × 3 models = ~57 GBCross-surface embed : 5 tables × ~50 GB avg = ~250 GBCategory/taxonomy : 20 small tables × 1 GB = ~20 GB --------Total active (int8) ~520 GBEquivalent at fp32 ~2 TBHistorical aggregates (7d/30d windows, ~500 features): ~100M active users × 500 × 8B ≈ 400 GB.
Working set at 1M rps, Zipfian α≈1.0: top 10% of users cover ~80% of traffic. Hot user set ≈ 100M; hot item set ≈ 50M (catalogs churn faster than users). Hot embedding working set ≈ 100M × 64 × int8 × 3 tables ≈ 20 GB. That fits in RAM on every ranker box — dominant implication for architecture. [STAFF SIGNAL: capacity math]
Latency budget decomposition (20ms p99):
Stage | Budget (p99) | Mechanism------------------------------------|--------------|-----------------------Request fan-out / batching wait | 1.0 ms | 1ms batch windowRanker → local feature orchestrator | 0.2 ms | in-process callIn-process lookup (hot tier) | 1.0 ms | 80% hit rateOn-box sidecar (warm tier) | 2.0 ms | 18% hit rate, UDSRemote embedding shard fetch | 6.0 ms | 2% cold, HDR-p99Deserialization / tensor assembly | 3.0 ms | zero-copy where possibleRemote aggregate fetch (rare) | 4.0 ms | Scylla fallbackSlack / GC / tail | 2.8 ms | everything else------------------------------------|--------------|-----------------------Total | 20.0 ms |The in-process layer must deliver 80% hit or we miss the budget. That’s the concrete design pressure.
3. High-Level Architecture
Section titled “3. High-Level Architecture”┌───────────────────────────────────────────────────────────────────────┐│ OFFLINE SUBSTRATE ││ ┌─────────────┐ ┌────────────────┐ ┌──────────────────────┐ ││ │ Event Log │ → │ Spark / Flink │ → │ Feature Warehouse │ ││ │ (Kafka→HDFS)│ │ batch + stream │ │ (Iceberg, partitioned│ ││ └─────────────┘ │ feature jobs │ │ by feature + date) │ ││ │ └────────────────┘ └──────────────────────┘ ││ │ ▲ │ ││ │ │ ▼ ││ │ ┌───────┴────────┐ ┌───────────────────────┐ ││ │ │ Feature DSL │ │ Training snapshot │ ││ │ │ registry │ │ (point-in-time join │ ││ │ │ (single defn) │ │ only for cold-start) │ ││ │ └────────────────┘ └───────────────────────┘ ││ │ │ │ ││ │ │ ▼ ││ │ │ ┌───────────────────┐ ││ │ │ │ Training loop │ ││ │ │ │ (trains on LOGGED │ ││ │ │ │ served features) │ ││ │ │ └───────────────────┘ │├─────────┼───────────────────┼─────────────────────────────────────── ││ │ │ ONLINE SUBSTRATE ││ ▼ ▼ ││ ┌─────────────┐ ┌────────────────┐ ││ │ Kafka topic │ → │ Flink feature │ ───────────────┐ ││ │ (events) │ │ jobs (codegen │ │ ││ └─────────────┘ │ from DSL) │ │ ││ └────────────────┘ │ ││ │ │ ││ ▼ ▼ ││ ┌────────────────────────────┐ ┌──────────────────────┐ ││ │ Online Feature Store │ │ Embedding Service │ ││ │ ┌──────────────────────┐ │ │ (specialized; shard │ ││ │ │ Remote KV (ScyllaDB) │ │ │ by entity-id hash; │ ││ │ │ counters + agg tail │ │ │ int8, row-major) │ ││ │ └──────────────────────┘ │ └──────────────────────┘ ││ └────────────────────────────┘ │ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌──────────────────────────────────────────┐ ││ │ Ranker box (colocation) │ ││ │ ┌────────────────────────────────────┐ │ ││ │ │ In-process tier (RAM, hot 20 GB) │ │ ││ │ │ - hot user counters/embeddings │ │ ││ │ │ - recent aggregates │ │ ││ │ ├────────────────────────────────────┤ │ ││ │ │ On-box sidecar (UDS; RocksDB + │ │ ││ │ │ page cache; warm 200 GB SSD) │ │ ││ │ ├────────────────────────────────────┤ │ ││ │ │ Feature logger (writes served │──┼──→ Kafka ││ │ │ snapshots to log topic) │ │ (train loop) ││ │ └────────────────────────────────────┘ │ ││ │ │ ││ │ Ranker GPU(s) ← tensor assembled │ ││ └──────────────────────────────────────────┘ │└───────────────────────────────────────────────────────────────────────┘Commitments: [STAFF SIGNAL: storage tiering]
- Three-tier online read path — in-process (hot, 20 GB, 80% hit), on-box sidecar over UDS (warm, 200 GB, 18% hit), remote shards (cold, 2%). No uniform “KV for everything” — that blows the budget.
- Embedding service is separate from the main online store. Embedding access is batched, row-oriented, bandwidth-dominated; counters and aggregates are point reads. Mixing them penalizes both.
- Logged-features training loop. Rangers write the exact served feature vector to Kafka; training consumes those logs as source of truth. Eliminates offline/online computation skew by construction for served features. Warehouse-computed features still exist for exploration and new-model cold-start, but production training runs on logs.
- Rejected alternatives [STAFF SIGNAL: rejected alternative]:
- Single global Redis/Scylla for all features. 50K × 1ms serial = 50s; even heavily parallelized, fan-out tail ruins p99.
- Off-the-shelf Feast as serving path. Feast is a control plane (registry, lineage, offline joins), not a hot-path substrate. Useful for the offline side; not a live store at 1M rps × 50K.
- Tecton-style unified offline+online execution. The DSL cannot express 30-day windowed aggregates efficiently in the online path without incurring training-path constraints. Use unified definition (DSL → codegen for both Flink and Spark); retain separate execution substrates. [STAFF SIGNAL: invariant-based thinking]
- Pure in-memory, no remote tier. Long tail of cold embeddings (inactive users, new items) pushes working set beyond per-box RAM; remote tier exists for completeness, not latency.
4. Feature-Type-Specific Design
Section titled “4. Feature-Type-Specific Design”[STAFF SIGNAL: feature-type differentiation] Three systems, not one.
Real-time counters (e.g., user_clicks_last_5min, item_impressions_last_1h). Writes: ~500K events/sec. Reads: 1M rps × ~50 counters = 50M reads/sec. Ingest: Kafka → Flink (RocksDB state) → sliding-window aggregation → ScyllaDB sink + changelog topic for sidecar cache invalidation. Freshness target: event-to-served p95 under 10 seconds. Exactly-once unnecessary for most counters; at-least-once with 5-minute Bloom-filter dedupe suffices. A whitelist of fraud-style counters runs an exactly-once pipeline at ~10× the cost.
Embeddings (user, item, query, cross-surface). Writes: tables retrained on 6–24h cycles; incremental streaming embeddings from a two-tower run as a separate substream at minute-scale staleness. Storage: dedicated embedding service, sharded by entity-id hash, row-major within shard, int8-quantized with per-row scale + zero-point. Candidate-batch fetch: one RPC per shard carrying up to 1000 IDs, returns a dense matrix. Critical invariant: the quantized form stored online is the form trained on — quantization happens before feature logs are emitted, never as a post-hoc deploy-time step.
Historical aggregates (e.g., purchases_7d, sessions_30d, avg_dwell_90d). Long windows make streaming infeasible (30-day state per user at 100M users is prohibitive in Flink). Writes: daily batch backfill from warehouse. Storage: compact columnar blobs per entity, loaded to each ranker’s warm tier on a daily cycle. Reads: 5K point lookups per request, >99% on-box hit rate since values change daily.
5. Deep Dives
Section titled “5. Deep Dives”5.1 Latency Budget Decomposition
Section titled “5.1 Latency Budget Decomposition”Read path with per-stage timing:
t=0ms Request arrives at ranker box │ │ ┌── Feature IDs derived from candidate-gen result ───────┐ │ │ (1000 item IDs + user context; no feature fetch yet) │ │ └─────────────────────────────────────────────────────── ┘ │t=0.2 ├─► Feature orchestrator batches lookups by tier × type │ ┌────────────────┬────────────────┬──────────────────┐ │ │ │ │ │t=0.3 ├────►│ In-process │ On-box sidecar │ Remote fan-out │ │ │ (≤1.0ms p99) │ (≤2.0ms p99) │ (≤6.0ms p99) │ │ │ SwissTable + │ UDS + shmem │ gRPC, batched │ │ │ cacheline-opt │ zero-copy emb │ per-shard │ │ │ 40K lookups │ 9K lookups │ 1K lookups │ │ └────────┬───────┴────────┬───────┴────────┬─────────┘ │ │ │ │t=1.3 ├──────────────┴── in-process hits returnedt=2.3 ├───────────────────────── on-box hits returnedt=6.3 ├────────────────────────────────────────── remote hits returned │ (embedding cold + │ long-tail counters) │t=6.3 ├─► Tensor assembly: scatter into model-expected layout │ (zero-copy for embeddings from on-box shmem) │t=9.3 ├─► Padding + missing-value fill with learned per-feature defaults │t=10.3 └─► Hand to GPU ranker ┌── remaining budget ~9.7ms ──┐ │ for inference + tail slack │ └─────────────────────────────┘The in-process layer is the linchpin. It isn’t a cache in the eviction sense — it’s a curated mirror of the predicted hot set, refreshed by a background controller that reads the last hour of traffic, computes per-entity QPS, and pushes the top-N (20 GB worth) to every ranker box. [STAFF SIGNAL: invariant-based thinking] Invariant: any entity served >1000 QPS globally must be in-process on every ranker. Enforcement: the hot-set controller recomputes every 5 minutes; ranker boxes pull deltas. Cold-start of a new ranker box takes 5 min; it stays out of the LB until warm.
On-box sidecar: a separate process per ranker box, UDS-connected, 200 GB warm features in RocksDB with dirty_ratio=5, readahead disabled (random access dominates). UDS + shared-memory ring for zero-copy embedding reads — avoids the ~200μs of protobuf overhead that would otherwise dominate small-value reads. At 50K lookups × 200μs = 10 seconds of overhead without this, so the optimization is not cosmetic.
Remote fan-out: gRPC with pooled connections, per-shard concurrency capped at 32 (higher caps worsen tail from head-of-line blocking). Shards: 128-MB RocksDB nodes on Scylla for counters/aggregates; specialized embedding servers for embedding tables.
Batching vs latency: a 1ms batch window at the ranker. At 1M rps ÷ 2000 boxes ≈ 500 rps/box, 1ms captures 0.5 requests on average — too little for effective cross-request micro-batching at that level. Instead we batch within a single request’s 50K lookups. Cross-request batching only pays on the remote path and is handled by the sidecar transparently.
5.2 Embedding Storage
Section titled “5.2 Embedding Storage” Shard S (of 256; sharded by hash(entity_id) % 256) ┌─────────────────────────────────────────────────────────┐ │ Header: {version, row_count, dim, dtype, layout} │ ├─────────────────────────────────────────────────────────┤ │ Row 0: [scale(fp16) | zero_pt(int8) | 64 × int8 values] │ │ Row 1: [scale | zero_pt | 64 × int8 values] │ │ ... │ │ Row N: [scale | zero_pt | 64 × int8 values] │ ├─────────────────────────────────────────────────────────┤ │ ID → row offset index (perfect hash, built at load) │ └─────────────────────────────────────────────────────────┘
Row size: 2 (scale) + 1 (zero_pt) + 64 = 67B → padded to 72B
Candidate-batch fetch: Request: fetch([id_1 ... id_1000]) for T_item │ ├─ Partition by shard: { s_0: [...], s_1: [...], ..., s_255: [...] } ├─ Fan out 256-way (most shards see few IDs; p99 shard is the │ tail; over-provision slots per shard) │ Each shard: perfect-hash lookup → gather 72B rows into a │ contiguous response buffer, one memcpy per ID. └─ Assemble at client: [1000 × 72B] → GPU layout [1000 × 64 int8] Dequantize on GPU using per-row scale/zero_pt (fused kernel).Compression choice: int8 per-row scale. Evaluated:
- fp32: 256B/row — 4× storage, 4× wire, no quality gain vs int8 w/ calibration.
- fp16: 128B/row — 2× savings, rejected because int8 per-row QAT matches it offline.
- int8 per-row scale (chosen): 72B/row — ~0.01% NE regression on ads ranking with QAT.
- int4 groupwise scale: 40B/row — 0.05–0.1% regression; reserved for the largest tables (>1 TB) where wire savings dominate quality.
- PQ-8×16: 8B/row — 0.5%+ regression at ranking resolution; fine for retrieval, not ranking. [STAFF SIGNAL: embedding-specific thinking]
Training-serving equivalence for embeddings: the model trains on dequantized int8 (with the scale/zero_pt that will exist at serving) — quantization-aware training. Train on fp32, deploy int8, and a ~0.1% NE gap materializes silently. QAT costs ~15% extra training FLOPs and is non-negotiable. [STAFF SIGNAL: invariant-based thinking]
Sharding by entity-id hash (not popularity): popularity-based sharding concentrates hot entities on dedicated shards, creating a single point of hotspot load. Hash sharding spreads load; hot-entity handling is a replication concern (section 6), not a sharding concern. Consistent hashing with 256 virtual per physical shard for rebalance without full reshuffle.
5.3 Training/Serving Skew Prevention
Section titled “5.3 Training/Serving Skew Prevention” ┌──────────────────────────────────────────────────────────────┐ │ FEATURE DEFINITION (single source of truth) │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ DSL file per feature: │ │ │ │ feature user_clicks_last_5min: │ │ │ │ window: 5m sliding │ │ │ │ source: event_stream │ │ │ │ agg: count │ │ │ │ key: user_id │ │ │ │ default: 0 # explicit, single-source │ │ │ └────────────────────────────────────────────────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Flink codegen│ │ Spark codegen│ │ │ │ (online path)│ │ (offline) │ │ │ └──────────────┘ └──────────────┘ │ └────────│───────────────────────────────────│─────────────────┘ │ │ ▼ ▼ ┌────────────┐ ┌──────────────┐ │ Online FS │ ──── served ────────► │ Feature log │ ← source of │ (Scylla, │ values │ (Kafka; │ truth for │ emb svc) │ logged ─────────►│ immutable) │ training └────────────┘ └──────────────┘ │ │ ▼ ▼ ┌─────────────┐ ┌──────────────┐ │ Ranker │ ─── outcomes ──────► │ Join(logged │ │ (serves) │ labels │ feats, │ └─────────────┘ │ labels) │ └──────────────┘ │ ▼ ┌──────────────┐ │ Training data│ └──────────────┘Five skew sources; defense per:
- Code-path skew (different logic computing the same feature offline vs online). Defense: single DSL with codegen. Tradeoff: DSL can’t express every feature; ~5% need custom code with both paths implemented plus a parity CI that diff-checks on a daily sampled workload (alerts on >0.01% value mismatch).
- Time-travel skew (training uses end-of-day snapshot; serving sees as-of-now). Defense: log-and-train. The training row literally is the feature vector that was served, labeled after the fact. This is the single most important defense and eliminates point-in-time-correctness bugs by construction. Storage cost: ~500 GB/day at 1M rps × 50 sampled features × 100B × 10% sample — acceptable.
- Freshness skew (online is seconds stale; training assumes zero staleness). Defense: feature logs capture as-served staleness, not ideal staleness. Training sees the same staleness distribution as serving.
- Missing-value skew [STAFF SIGNAL: missing-value discipline]. Defense: DSL declares default; both codegens use it; feature log records whether the value was a default (sentinel bit). Distinguish “missing because new entity” (normal; model should handle) from “missing because pipeline broken” (incident; mask feature, do not train on garbage). The sentinel is two bits:
{present, new_entity, pipeline_broken}. Model architecture can optionally consume the sentinel as a feature — lets the model learn to ignore broken-pipeline values during inference. - Distribution drift (world changes; model stays fixed). Defense: KS-test / PSI on serving distribution vs training distribution, per feature, hourly. Alert on KL divergence above threshold. Not a skew fix — a detection layer for model staleness.
Invariant: feature_value_served(t) == feature_value_used_in_training(log_row_at_t). Enforced by log-and-train for 95% of features; weaker invariant (within ε) for the 5% custom-code features, enforced by parity CI.
5.4 Real-Time Features and Freshness
Section titled “5.4 Real-Time Features and Freshness”Event → Kafka → Flink (keyed stream, RocksDB state) │ ├──► (a) Scylla sink (authoritative) ├──► (b) changelog Kafka ─► ranker sidecar refresh └──► (c) feature-logger tap ─► training logEnd-to-end budget: event-to-served p95 under 10s.
- Event → Kafka ingest: 50ms
- Kafka → Flink consume: 100ms
- Flink windowed agg + sink: 200ms
- Scylla write + replication: 50ms
- Changelog → ranker sidecar propagation: 2–8s (dominant term)
Freshness-cost tradeoff [STAFF SIGNAL: freshness-cost tradeoff]: sub-second freshness requires the ranker to bypass sidecar and hit Scylla directly, ~4–6ms per lookup. At 50 counters per request that’s 200–300ms of budget — infeasible fleet-wide. 5–10s via changelog propagation fits the budget. Achieving 1s would require a separate super-fresh path just for counters that demonstrably need it (fraud signals) — roughly 10× infra cost per such feature. Committed position: 10s freshness default; 1s path reserved for a whitelist of ~5% of features.
Read-after-write for a user’s own session: a user who just clicked expects the click reflected in next-request features. If session-affine to one ranker box (usually true), we tap the event stream at the ranker and locally update in-process counters on the way to Flink. The authoritative value is still what Flink produces; the local tap is a low-confidence prediction reconciled within seconds. Deliberate weakening of consistency for latency — documented, invariant-bounded (bound: local value ≤ global value + 1).
5.5 Cardinality Defense
Section titled “5.5 Cardinality Defense”[STAFF SIGNAL: cardinality defense] A team ships click_ctr_by_user_x_item_x_context keyed on the triple. Expected cardinality: 100M × 10M × 1K = 10^18 keys. Storage and serving cost go to infinity.
Defenses at five layers:
- Definition-time (DSL): every feature declares
max_cardinality. Features above 10^9 keys require architecture-review signoff. The DSL compiler rejects undeclared cardinality. - Ingest-time: the feature registry tracks per-feature unique-key growth rate. Alert fires on >10× WoW growth. Flink job is auto-paused pending owner review.
- Storage-time: per-feature shard quota. Exceeding quota triggers sample-on-write (1% of keys retained); old keys TTL. Feature degrades to “sampled” status, visible to models as metadata.
- Query-time: missing keys return feature-level default. The ranker sees “no signal” for the long tail, which is correct.
- Cleanup: every feature has
ownerandlast_queried_by_active_modelannotations. After 90 days with no active reader, a weekly reaper proposes deletion; owner has 14 days to object. No objection → delete.
Without these, the feature store fills with abandoned features — I’ve seen 40% storage waste from dead features in two prior orgs. This is the same class of problem as unbounded-cardinality metrics in observability TSDBs; the governance model transfers.
6. Failure Modes and Graceful Degradation
Section titled “6. Failure Modes and Graceful Degradation”[STAFF SIGNAL: failure mode precision] [STAFF SIGNAL: blast radius reasoning]
Single embedding shard down (1/256 = 0.4% of rows unavailable). Policy: cold-tier fetches that fail return a learned-default embedding (not zeros — zeros may be in-distribution for some rows and cause silent bias). Quality impact: 0.4% of candidates get defaults, estimated ~0.1% NE loss on ads ranking. Alerting: shard-availability metric. Recovery: shard has hot standby; promotion <30s.
Flink job lag on counters (5 min behind real time). Sidecar keeps serving last-known-good. If lag > 30 min per-counter staleness monitor fires; the ranker is told to mask affected features (replaced with default + a “stale” sentinel bit — the model was trained to consume this). Quality: ~0.05% CTR loss while degraded.
Hot-entity replication staleness: embeddings update on 6h cadence, not second-to-second, so replica drift is bounded by replication SLA. Not a realistic failure at this time resolution.
5% of features stale or missing simultaneously. The yellow-zone scenario. Empirically on two prior systems: 0.2–0.4% NE regression — noticeable, not catastrophic. Runbook: shift 10% traffic to a fallback simpler model using only the core 50 features, monitor. If quality gap > 1%, failover simple model to 100% traffic while issue resolves. GameDay quarterly for this.
Feature store under DDoS / runaway internal client. Per-tenant rate limits at the gateway. Dedicated shards for critical tenants (separate QPS domains). Shed-load ordering: non-critical readers get 503 before critical ones.
Ranker loses in-process tier (restart). 5-minute sidecar-driven warmup before box enters the LB. Cold boxes have 3–5× tail latency; keeping them out until warm preserves p99.
Hot-entity surge (viral item hits 100K QPS). Detection: per-key QPS counter in the embedding service. Mitigation (layered): (a) replicate the hot row to all shards (cost: +size × shard_count, negligible for a few hot rows); (b) client-side caching in ranker in-process tier (already happening via hot-set controller); (c) if sustained, pin the row to in-process on every box. Consistency: hot rows update via gossip broadcast rather than polling; update latency bounded by gossip epoch (~1s).
7. Multi-tenancy, Versioning, Lifecycle
Section titled “7. Multi-tenancy, Versioning, Lifecycle”[STAFF SIGNAL: operational longevity] Features versioned by content-hash of DSL definition + input schema. An edit produces a new version; old version still served until readers migrate. Namespace: team.feature_group.feature@vHash. One version per job; a DSL edit spawns a parallel Flink job.
Storage overhead from versioning: typically 3–5 live versions per feature. 2K features × 4 versions × 2 GB avg = 16 TB overhead — acceptable.
Lifecycle states: experimental (sampled writes, no prod readers), production (full writes, any reader), deprecated (writes continue, no new readers, 90-day sunset). A deprecated feature with zero readers for 14 days auto-deletes.
The hard part is not versioning but migration. Without an SLA — “a feature version is supported for at most 12 months past its successor’s release” — versions accumulate forever. I’d own this SLA at the feature-platform level with automated PRs to readers of deprecated features, owner escalation after 60 days, and automatic reader-side fallback to the new version (with correctness gating) at 90 days.
8. Recent Developments
Section titled “8. Recent Developments”[STAFF SIGNAL: modern awareness]
- Uber Michelangelo / Palette: pioneered log-and-train as the canonical skew defense. Worth borrowing wholesale. Only reason not to: training on historical data not yet served (new-model cold start on a surface that hasn’t run). For that, point-in-time batch joins are the fallback — but they should be the minority path.
- Meta’s embedding infrastructure (ZionEX, TorchRec): the largest embedding tables (>100 TB) use disaggregated memory pools with RDMA from trainers. Same pattern applies on the serving side when tables outgrow per-box RAM. Not needed for a 500 GB hot working set but the right pattern at 10 TB+.
- Matryoshka embeddings (Kusupati et al., 2022): train embeddings so the first-k dims are themselves a valid (lower-quality) embedding. Lets you serve 32-dim to cold candidates and 128-dim to top-k, cutting wire cost for the long tail. Worth piloting on item embeddings specifically.
- Feast / Tecton: Feast is a control plane (registry, lineage, offline joins), not a serving substrate at 1M rps. Tecton’s unified substrate is production-viable to ~100K QPS; past that the abstraction taxes — DSL constraints on window types, aggregation functions — become real.
- Flink 1.17+ with async I/O + tuned RocksDB: modern default for streaming features. Kafka Streams is a sibling for small topologies.
- Published skew postmortems: Meta’s 2022 IG ads training-serving skew (~1% revenue over 3 weeks before detection) reshaped the industry’s monitoring posture. Continuous serving-distribution vs training-distribution comparison, per feature, is now a required runtime invariant in serious orgs.
- Vector DBs (Milvus, Qdrant, Pinecone): optimized for ANN search, not exact-ID batch fetch. At ranking scale we want the latter — a specialized embedding service outperforms vector DBs on candidate-batch fetch latency. Vector DBs are the right substrate for candidate generation, not ranking.
- Streaming feature platforms (Chronon @ Airbnb, Zipline @ Stripe): both publish the “single definition, dual execution” pattern and log-and-train as standard practice. Confirms the direction.
9. Tradeoffs Taken and What Would Change Them
Section titled “9. Tradeoffs Taken and What Would Change Them”- If embedding tables exceed 10 TB hot, on-box tier becomes infeasible. Redesign: RDMA fabric with disaggregated embedding pool; ranker boxes hold only a prediction cache. Meta ZionEX-style.
- If all features required sub-1s freshness, the sidecar changelog path is too slow. Redesign: direct Scylla hits for fresh features, probably with per-feature QoS classes and a smaller total feature count (freshness bought from feature count at fixed budget).
- If we served 10× more concurrent models, namespace isolation and per-tenant shard quotas become critical; the single-registry design creates a control-plane hotspot. Redesign: federated registry with per-tenant control planes.
- If log storage becomes prohibitive (>10 PB/yr), sampled logging per user with reservoir sampling; trainer weights samples. Measurable quality impact but bounded; worth it for storage cost.
- If the business wins on retrieval rather than ranking, the 50K-per-request assumption dissolves; architecture shifts to ANN-first with lighter feature sets at ranking time — likely a better overall product win, worth exploring before committing to 50K-feature infra.
- If skew postmortems reveal log-and-train misses something structural (e.g., the feature log itself has bugs), add parity checks between warehouse recomputes and logged features on a 1% sample daily. Catches log-pipeline bugs that would otherwise be invisible.
10. What I’d Push Back On
Section titled “10. What I’d Push Back On”[STAFF SIGNAL: saying no]
- 50K features per request is likely excessive. Feature-pruning studies consistently find 30–50% of features are pass-through noise. Before scaling infra, fund a pruning initiative. It’s cheaper, reduces operational surface area, and buys back latency headroom.
- 20ms uniform budget is the wrong allocation. A small number of features (top embeddings, a handful of counters) drive most discrimination. Budget should be allocated by feature importance, not uniformly. Unimportant features live in cheapest tier with loosest freshness; important features get in-process treatment regardless of raw hit rate. Probably a 20% efficiency win.
- “Unified offline/online feature store” is often an X-Y problem. The real goal is skew elimination. Unified definition gets 95% of that; unified execution is expensive and constraining. Insist on the former; resist the latter unless explicitly better than log-and-train.
- Freshness requirements are probably over-specified. For the 80% of features that are historical aggregates on 7–30d windows, sub-minute freshness is pointless — the signal moves on a day timescale. Feature-by-feature freshness SLAs unlock 10× cost savings on the long tail.
- “Design the online feature store” treats the feature store as the system. The actual system is model + features + ranker. If the model can tolerate lossy compression, defaulted missing values, and a smaller feature count, infra gets dramatically cheaper. The cross-team conversation with modeling — “what degradation tolerances can you absorb?” — is worth 10× any internal infra optimization. A staff engineer owns that conversation, not just the infra.