Skip to content

KV Cache Management for an LLM Inference

KV Cache Management for an LLM Inference Platform — Staff-Level System Design

Section titled “KV Cache Management for an LLM Inference Platform — Staff-Level System Design”

1. Research Pass — State of the Art as of April 2026

Section titled “1. Research Pass — State of the Art as of April 2026”

Before designing anything, here’s what the actual frontier looks like in 2026, and what surprised me. I’ll cite specific systems and give them architectural payload, not name-drop them.

Mooncake (Kimi/Moonshot) is now in production across thousands of nodes. The FAST ‘25 paper (“Trading More Storage for Less Computation”) makes the thesis explicit: a disaggregated KVCache pool, holding 100B+ tokens daily, is cheaper per request than recomputing. The architecture is not “vLLM with a remote cache” — it is a Conductor (global scheduler), a Mooncake Store (distributed KV pool over CPU DRAM/SSD/NIC across the GPU cluster), and a Transfer Engine (RDMA, 8×400 Gbps fabric, topology-aware path selection). The Conductor selects a (prefill, decode) instance pair per request and explicitly manages KVCache replication across hot prefixes. As of July 2025 Mooncake powers Kimi K2 on 128 H200 GPUs hitting 224k tokens/sec prefill / 288k tokens/sec decode. It joined the PyTorch ecosystem in February 2026.

NVIDIA Dynamo + NIXL (announced GTC 2025) is the most operationally relevant frontier framing. Dynamo is the orchestration layer above SGLang/TRT-LLM/vLLM — Smart Router does KV-aware routing, Planner does SLA-driven scheduling and prefill/decode pool sizing, KVBM tracks blocks, and NIXL is the transport. NIXL is the architectural primitive that didn’t exist 18 months ago: a unified async point-to-point library with pluggable backends (UCX, GDS, S3, NVMe-oF), automatic backend selection per source/destination memory tier, ETCD-based metadata side-channel. The fact that NIXL had to be built — that “general RDMA” wasn’t sufficient — is the signal that KV transfer is now a first-class data plane, not network-as-tier-4. Mooncake’s Transfer Engine became a NIXL backend plugin in May 2025. By late 2025 Dell, WEKA, and Azure had built native NIXL plugins (e.g. WEKA reports 19× faster TTFT with NIXL+NeuralMesh).

LMCache matters because it’s the cross-engine glue layer for vLLM. Two modes: persistent KV offload to CPU/disk/Redis, and PD-disaggregation with peer-to-peer KV transfer over NIXL. The LMCache tech report shows up to 15× throughput improvement combined with vLLM. Critically, LMCache’s experiments show that at 32 Gbps, KV loading beats recomputation only above 256K context, but at 64+ Gbps, KV loading wins at every context length. This breakeven number is the thing I will keep referring back to.

SGLang HiCache generalized RadixAttention from L1 (GPU) to L1/L2/L3 (GPU/CPU/distributed via Mooncake, 3FS, NIXL, AIBrix). The HiRadixTree extends the radix tree with location metadata — this is the right data structure for the local index. ByteDance/Tencent’s FlexKV (Jan 2026) is a distributed KV store that integrates with Mooncake.

Anthropic and OpenAI prompt caching as productized features tell us how the underlying systems are actually built. Anthropic: explicit breakpoints (up to 4), 5-min default TTL with 1-hour extended option, min 1024 tokens, cache write 1.25× base / 1-hour 2× base, cache read 0.1× base. ZDR-compatible, in-memory only, organization-isolated. The economic structure (read at 10% of write cost) reveals that the breakeven is one cache hit at 5-min TTL, two hits at 1-hour. The fact that Anthropic exposes this control rather than hiding it (vs OpenAI’s automatic caching) suggests Anthropic concluded that giving customers explicit breakpoint control is worth the API surface complexity, presumably because automatic caching loses too much when cache breakpoints land in the wrong place for agentic workloads (where dynamic content sits in the middle of the prompt — see ProjectDiscovery’s 59% cost reduction via manual breakpoint placement).

Hardware substrate (2026): GB200 NVL72 made the biggest architectural shift. Inside one rack, 72 Blackwell GPUs share a 130 TB/s NVLink Switch domain — 1.8 TB/s per GPU bidirectional, all-to-all, single-hop through NVSwitch. This effectively dissolves “intra-node vs inter-node” within the rack. Cross-rack still uses InfiniBand Quantum-X800 (now 800 Gbps per ConnectX-8 NIC, 4-rail optimized in production deployments). CXL memory pooling for KV exists in research and a few hyperscaler deployments but isn’t the dominant pattern — RDMA over IB/RoCE to remote DRAM is.

KV compression in production: FP8 KV is mostly free and broadly deployed. INT8 KV is essentially lossless. INT4 KV (KIVI’s per-channel-key + per-token-value scheme, also used by KVQuant) is deployed in lmdeploy and a few production stacks but with quality monitoring. Per-channel key quantization is the load-bearing trick — Liu et al. and Hooper et al. independently discovered that key cache outliers concentrate on fixed channels while value cache outliers are diffuse. As of 2026, 2-bit KV (Kitty, BitDecoding, TurboQuant via Hadamard rotation) is research-quality and seeing limited production trials — TurboQuant in lmdeploy claims K=4-bit + V=2-bit ≈ 3-bit average with near-zero accuracy loss (ICLR 2026).

Cache-aware scheduling research that contradicts the simple framing: DualMap (Feb 2026) argues — correctly, in my reading — that Mooncake/Dynamo/Preble all fail to simultaneously achieve cache affinity and load balance because they work in a single mapping space and time-slice between strategies. DualMap’s prefix-bound dual candidates (each prefix hashes to two candidate engines) gives 2 degrees of freedom for load balancing while preserving cache locality. This is the kind of research-to-production gap a staff candidate should flag: if you build cache-aware routing today using only Mooncake’s pattern, you will under-perform what’s already known to be possible.

One contradiction with the prompt itself: the prompt suggests 4K-KV transfers over 8×400 Gbps IB taking ~16ms. Math: 8×400 Gbps = 400 GB/s, 640 MB / 400 GB/s = 1.6 ms. Even with RDMA overhead and pipeline-fill costs, ~3-5 ms is realistic, not 16 ms. [STAFF SIGNAL: saying no — pushing back on prompt assumptions] The 16ms number is closer to what you’d see on a single-rail 400G NIC with non-pipelined transfer, not the full 8-rail node-level bandwidth.


2. Scope, Reframing, and Committed Assumptions

Section titled “2. Scope, Reframing, and Committed Assumptions”

[STAFF SIGNAL: distributed-fabric reframing] This is not a node-local cache problem. KV is a first-class artifact transferable across engines. Node-local hierarchy (HBM/DRAM/NVMe) is a sub-component inside a larger distributed fabric; the fabric — distributed index, transfer protocol, replica placement, eviction coordination, multi-tenant accounting — is the architecture. I’ll design the fabric first, then the local hierarchy as a leaf.

[STAFF SIGNAL: scope negotiation] Concrete commitments:

  • Topology: multi-cluster, multi-region. Single region is the primary deployment (US-East and US-West as twin regions); cross-region cache sharing is not in scope (latency floor of 50–80 ms makes inter-region KV transfer almost never beat recompute, and tenant data residency forbids it for many).
  • Per-region cluster: ~1000 GPUs split across 100 nodes. Each node is either GB200 NVL72 rack-scale (the new build-out) or 8×H100 SXM5 (the legacy fleet). NVLink Switch domain inside NVL72 (130 TB/s); ConnectX-7 400 GbE per H100, ConnectX-8 800 GbE per Blackwell, with InfiniBand Quantum-X800 between racks.
  • Disaggregated prefill/decode: yes, default. Prefill on dense H100/B200 nodes, decode on bandwidth-optimized configs. PD-aggregated remains a fallback for low-load scenarios where the Planner detects under-utilization (Dynamo Planner does this).
  • Multi-tenant: thousands of tenants, hard isolation. Tenant-A KV must never be discoverable by Tenant-B, even via timing.
  • Prompt distribution: highly skewed. The top 100 system prompts account for ~60% of input tokens (typical of platforms with chat + agentic + RAG workloads). Median prompt 4K tokens, p99 200K tokens, max 1M tokens.
  • LoRA adapters: in scope. KV cache key includes (model_id, lora_adapter_id, prefix_hash, precision). LoRA adapter doesn’t usually invalidate prefix KV (because LoRA modifies attention output projections, not prefix K/V production), but I’ll address this nuance below.
  • Productization: explicit breakpoints exposed (Anthropic-style, not OpenAI-automatic). Cache reads priced at 10% of base, cache writes at 1.25×.
  • Model: I’ll use a 70B GQA model (8 KV heads, 80 layers, head_dim 128) as the worked example. FP8 KV by default.

What I’m explicitly not doing: cross-region active-active, non-RDMA fabrics for the hot path, single-tenant deployments, or HBM-only architectures.


3. Capacity and Bandwidth Math — The Numbers That Drive Every Decision

Section titled “3. Capacity and Bandwidth Math — The Numbers That Drive Every Decision”

[STAFF SIGNAL: capacity math] Per-token KV size for 70B GQA, FP8:

2 (K+V) × 80 layers × 8 KV-heads × 128 head_dim × 1 byte = 163,840 B ≈ 160 KB/token

This number is the unit of all that follows. A few calibration points:

  • 4K-token prefill KV: 640 MB
  • 32K-token: 5.1 GB
  • 128K-token: 20.5 GB
  • 1M-token: 160 GB (will not fit in a single H100’s 80GB HBM uncompressed — the long-context wall)

Per H100 (80 GB HBM), with model weights ~70 GB FP8, ~10 GB available for KV → ~64K tokens of KV per GPU. Per H200 (141 GB), ~70 GB available → ~450K tokens. Per B200 (192 GB), ~120 GB available → ~750K tokens. The HBM wall scales with hardware generation but never enough.

Transfer time per link, for a 4K-token (640 MB) KV:

LinkEffective BW640 MB transfer32K-token (5.1 GB)1M-token (160 GB)
NVL72 NVLink Switch (per-GPU)1.8 TB/s0.36 ms2.8 ms89 ms
H100 NVLink intra-node900 GB/s0.71 ms5.7 ms178 ms
ConnectX-8 800G (single-rail)100 GB/s6.4 ms51 ms1.6 s
ConnectX-7 400G × 8 rails (full node)400 GB/s1.6 ms12.8 ms400 ms
Local NVMe (PCIe Gen5)12 GB/s53 ms425 ms13 s
100 GbE12.5 GB/s51 ms410 ms12.8 s
10 GbE1.25 GB/s510 ms4.1 s128 s
Cross-region (~10 Gbps with overhead)1 GB/s640 ms5.1 s160 s

Recompute time (the reference cost): 70B FP8 prefill on H100 SXM5 sustains roughly 8K tokens/sec (TFLOPS-bound for short prompts, memory-bound at long context). On B200 FP8/FP4: ~25K tokens/sec.

LengthH100 prefillB200 prefill
4K500 ms160 ms
32K4.0 s1.3 s
1M125 s40 s

Transfer-vs-recompute breakeven: for any link from NVLink down to 100 GbE inclusive, transfer always wins over recompute on H100. The crossover is around 10 GbE for moderate contexts and below cross-region for any context. Implication: across a single datacenter, you should always prefer transfer over recompute when a remote replica exists. The math LMCache’s report shows (32 Gbps loses to recompute below 256K; 64+ Gbps wins everywhere) is consistent with this.

Capacity at each tier per node (8×H100 node):

  • HBM: 8 × 10 GB free for KV ≈ 80 GB → ~500K tokens
  • DRAM: 1 TB typical → ~6.4M tokens
  • Local NVMe: 30 TB → ~190M tokens

Aggregate per region (1000 GPUs):

  • HBM: 10 TB → 64M tokens
  • DRAM: 100 TB → 640M tokens
  • NVMe: 3 PB → 19B tokens

These working sets are large enough that, with 60% prefix concentration, we’re permanently in the regime where high cache hit rates are achievable.


[STAFF SIGNAL: distributed-fabric reframing]

┌─────────────────────────────────────────────────────────┐
│ Frontend / API Layer │
│ (auth, tenant routing, breakpoint parsing, billing) │
└───────────┬─────────────────────────────────────────────┘
│ request + cache_control breakpoints
┌─────────────────────────────────────────────────────────┐
│ CONDUCTOR (regional, replicated, sharded) │
│ ┌──────────────────────┐ ┌────────────────────────┐ │
│ │ Distributed KV Index │ │ Multi-Objective Router │ │
│ │ (sharded radix tree) │ │ (cache+load+xfer+SLO) │ │
│ └──────────────────────┘ └────────────────────────┘ │
│ ┌──────────────────────┐ ┌────────────────────────┐ │
│ │ Replica Placement / │ │ Planner: PD pool sizing│ │
│ │ Hot Prefix Manager │ │ + autoscale + early-rej│ │
│ └──────────────────────┘ └────────────────────────┘ │
└────────────┬───────────────────────────┬────────────────┘
│ control plane (NATS) │
┌─────────────────┴────────┐ ┌────────┴──────────┐
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PREFILL Pool │ │ PREFILL Pool │ │ DECODE Pool │ │ DECODE Pool │
│ B200 nodes │ │ H100 nodes │ │ H200 nodes │ │ B200 NVL72 │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Engine + │ │ │ │ Engine + │ │ │ │ Engine + │ │ │ │ Engine + │ │
│ │ HiCache │ │ │ │ HiCache │ │ │ │ HiCache │ │ │ │ HiCache │ │
│ │ HBM/DRAM │ │ │ │ HBM/DRAM │ │ │ │ HBM/DRAM │ │ │ │ HBM/DRAM │ │
│ │ /NVMe │ │ │ │ /NVMe │ │ │ │ /NVMe │ │ │ │ /NVMe │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
│ NIXL agent │ │ NIXL agent │ │ NIXL agent │ │ NIXL agent │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │ │
└────────────────────────┴─────────────────┴─────────────────┘
KV TRANSFER FABRIC (NIXL over UCX/GDS/IB/NVLink)
─ NVLink Switch within rack (1.8 TB/s/GPU)
─ IB Quantum-X800 across racks (800 Gbps/NIC)
─ GDS for NVMe tier
┌──────────────────────────────────────────┐
│ DRAM/NVMe Pool (Mooncake Store-like) │
│ ─ aggregates underutilized CPU mem + SSD │
│ ─ remote tier (L3) for cold prefixes │
└──────────────────────────────────────────┘

The Conductor is logically central but physically replicated and sharded by tenant_id (and prefix-hash within tenant). Each node runs an inference engine (vLLM, SGLang, or TRT-LLM) with HiCache-style local hierarchy, plus a NIXL agent that handles all RDMA/NVLink/GDS/object-store transfers. NATS JetStream + ETCD provide control-plane coordination; data-plane never touches them.

[STAFF SIGNAL: rejected alternative] I considered a fully decentralized DHT-style index (Chord/Kademlia) and rejected it. Two reasons: (1) routing decisions need joint reasoning over cache state, load, and transfer cost — a DHT gives you locality but not load info, requiring a second control layer anyway; (2) Mooncake’s centralized Conductor is in production at Kimi scale (thousands of nodes, 100B tokens/day), so the ceiling on centralized scheduling is high enough that the operational simplicity wins. I shard the Conductor by tenant for horizontal scale.


5. Distributed Index and Multi-Objective Routing

Section titled “5. Distributed Index and Multi-Objective Routing”

[STAFF SIGNAL: distributed-index discipline]

Each engine hashes its KV blocks (block size 256 tokens — the LMCache default; SGLang uses 16, Mooncake supports configurable; 256 wins on per-block transfer overhead amortization). Block hash = SHA-256 over (tenant_id || model_id || lora_id || precision || parent_block_hash || token_ids). The parent_block_hash links blocks into a chain so prefix matching is just longest-chain matching, equivalent to radix-tree traversal.

Engines emit kv_block_created and kv_block_evicted events to NATS. Each Conductor shard maintains a sharded radix tree indexed by tenant_id; within a tenant, it’s the standard prefix tree but each leaf records {engine_id, tier (HBM/DRAM/NVMe/L3), last_access, popularity_qps}.

Consistency model: eventual. A new block’s create event propagates to the index in ~10 ms (NATS local-region p99). The index lags reality on the order of a single batch latency. The cost of strong consistency isn’t worth it: the routing decision tolerates ~10 ms of staleness because (a) a 10-ms-stale answer routes to some engine that has the prefix or had it 10 ms ago — that engine’s local match still works at request time; (b) the worst case is a wasted lookup, not incorrect output. Mooncake’s heuristic-based replication, Dynamo’s NATS-backed prefix tree, and llm-d’s kvcache.Index all converge on this — the design space has narrowed.

Index scale. At 1000 engines × 10M blocks each = 10B blocks. At 64 bytes per radix-tree leaf (hash + tier + engine_id + metadata) = 640 GB across the cluster. Sharded by tenant across, say, 32 Conductor instances → ~20 GB per shard, fits in DRAM. The radix tree’s path compression keeps internal node count well below leaf count. This is tractable, which is why Mooncake’s centralized Conductor scales.

[STAFF SIGNAL: routing-as-multi-objective] A naive least-loaded router fails. A cache-only router fails (popular-prefix stampede). The right scoring function is unified:

score(engine_e, request_r) =
α · prefill_cost(e, r)
+ β · transfer_cost(e, r)
+ γ · queue_delay(e)
+ δ · replica_pressure(e, r)
− SLO_credit(e, r)
prefill_cost(e, r) = (input_tokens(r) − overlap_tokens(e, r)) / prefill_TPS(e)
transfer_cost(e, r) = remote_overlap_bytes(e, r) / effective_BW(e, source(r))
queue_delay(e) = pending_tokens(e) / prefill_TPS(e)
replica_pressure(e,r) = local_kv_used(e) / local_kv_capacity(e) // avoid full engines
SLO_credit(e, r) = max(0, request_SLO − P95_TTFT(e))

where overlap_tokens(e, r) comes from the Conductor’s radix-tree match against engine e’s reported blocks, and remote_overlap_bytes accounts for the cost of pulling missing-on-e-but-present-elsewhere blocks if the router decides to do KV migration.

The router considers three macro-decisions and picks the minimum:

  1. Route to engine with best local match (skip transfer): low transfer_cost, possibly worse load.
  2. Route to least-loaded engine, transfer KV from peer: lower queue_delay, transfer_cost > 0.
  3. Route to least-loaded engine, recompute prefill: highest prefill_cost, zero transfer_cost.

In our regime (intra-DC, RDMA available), option 3 almost never wins — the transfer cost is dominated by ~ms while recompute is hundreds of ms. So the practical decision is between options 1 and 2, and the right answer depends on (local_match_size, load_skew). Dynamo’s --router-kv-overlap-score-weight is the operational knob; α = 1.0 is the default and roughly optimal for prefill-heavy workloads. For decode-heavy (short input, long output), drop α toward 0.

[STAFF SIGNAL: rejected alternative] I considered DualMap’s prefix-bound dual-candidate scheme (each prefix hashes to two engines, choose between them by SLO-aware TTFT estimation). It’s elegant and the paper’s experiments show it dominates Mooncake/Dynamo in the cache-affinity-vs-load-balance frontier. I’d ship the simpler scoring function above as v1 because it’s understood at scale and integrates with existing Dynamo/Mooncake telemetry; DualMap-style routing is what I’d target for v2 once the v1 telemetry tells me where it’s leaving performance on the table. This is the kind of “what’s known to academia, not yet productized” gap a staff engineer should explicitly call out.

REQUEST ROUTING DECISION FLOW
Request arrives ──▶ Tokenize, compute prefix hashes
Query Conductor.Index(tenant_id, prefix_hashes)
┌─────────────────────────────────────────────┐
│ Returns: list of (engine, overlap_blocks, │
│ tier_distribution, current_load) candidates │
└─────────────────────────────────────────────┘
Score each candidate via cost function
(prefill + transfer + queue + pressure)
Pick min-cost candidate
┌────────────────────────────────────────────┐
│ if best.transfer_cost > 0: │
│ issue NIXL transfer from source(s) to │
│ chosen prefill engine in parallel with │
│ request dispatch (overlap) │
│ else: │
│ dispatch directly │
└────────────────────────────────────────────┘
Prefill streams KV to Decode engine
(layer-by-layer, overlapped with prefill)

I’ll go deep on eight areas. Each: problem, design, tradeoffs, failure modes.

6.1 The KV Transfer Fabric (NIXL, GPUDirect, transport selection)

Section titled “6.1 The KV Transfer Fabric (NIXL, GPUDirect, transport selection)”

[STAFF SIGNAL: transfer-fabric awareness] The transport library is load-bearing. The reason NVIDIA had to build NIXL — rather than use bare UCX or libfabric — is that no single transport is right for all KV transfers. NIXL’s job: given (source_mem_type, dest_mem_type, both endpoints’ available backends), pick the optimal transport per transfer.

Source → DestOptimal backendRationale
GPU-HBM → GPU-HBM, same NVL72NVLink Switch via UCX1.8 TB/s, single-hop NVSwitch
GPU-HBM → GPU-HBM, same H100 8-GPU nodeIntra-node NVLink900 GB/s, 1-hop
GPU-HBM → GPU-HBM, cross-rackGPUDirect RDMA via UCX/IB~100 GB/s/rail, no CPU bounce
GPU-HBM → CPU-DRAMCUDA UVA + IB if remotePinned memory
GPU-HBM → NVMeGPUDirect Storage (GDS)Skip CPU bounce buffer
CPU-DRAM → CPU-DRAM, remoteUCX over IBStandard RDMA verbs
GPU-HBM → S3/objectNIXL S3 plugin (e.g. WEKA, Dell)Cold tier

Concrete latency budgets for a 4K-token KV (640 MB) under realistic conditions:

  • Intra-NVL72: ~360 µs, dominated by SerDes setup, not transfer
  • Intra-H100 node: ~750 µs
  • Cross-rack via 4× IB rails: ~3 ms (1.6 ms transfer + ~1 ms RDMA setup + congestion margin)
  • HBM→NVMe via GDS: ~55 ms
  • HBM→remote DRAM via IB: ~5–8 ms

Pinned-memory and connection-management cost is the real-at-scale problem. Each NIXL agent registers buffer pools at startup. At 100 engines × 99 peers × multiple backends = ~10K active connections per agent; UCX endpoint memory ~16 KB/connection → 160 MB just for endpoint state per node. Tractable, but you don’t want a fully-meshed N×N. The right pattern is on-demand connection establishment with LRU connection eviction and pre-warmed connections to topology-near peers.

Failure modes:

  • NIXL transfer timeout (hung NIC, peer crash) → fall through to recompute-on-receiver; this is the dominant failure path.
  • Pinned-memory exhaustion → graceful degrade to staged-DRAM transfer; observable.
  • Connection storms during scale-out → pre-warm topology-near peers, exponential backoff for far peers.

6.2 Prefill-Decode Disaggregation and the KV Handoff

Section titled “6.2 Prefill-Decode Disaggregation and the KV Handoff”

[STAFF SIGNAL: prefill-decode disaggregation]

The handoff is the load-bearing protocol. Mooncake’s Messenger streams KV layer-by-layer in parallel with prefill compute on the source side, and async-loads it on the destination side concurrent with decode. This is non-negotiable: if you wait for full KV before starting decode, you’ve added the entire transfer time to TTFT.

PREFILL→DECODE OVERLAPPED PIPELINE
Prefill Engine Decode Engine
┌──────────────┐ ┌──────────────┐
│ Layer 1 ────────── streaming │ ┌──────────┐ │
│ │ (NIXL push) ──▶ │ │ Recv L1 │ │
│ Layer 2 ────────── streaming │ │ to HBM │ │
│ │ (NIXL push) ──▶ │ └──────────┘ │
│ Layer 3 ────────── streaming │ ┌──────────┐ │
│ ... │ │ │ Decode │ │
│ Layer 80 │ │ │ starts │ │
└──────────────┘ │ │ at L1 │ │
│ └──────────┘ │
└──────────────┘
Total TTFT = max(prefill_time, transfer_time + first_decode_step)
≈ prefill_time + ~5-10ms tail
vs naive: prefill_time + full_transfer + decode

When does disaggregation pay off? When the prefill/decode compute ratio differs sharply from the GPU’s natural compute mix. For long-context, short-output workloads (RAG, summarization) prefill dominates, so you want few high-FLOPS prefill GPUs (B200) and many memory-bandwidth-optimized decode GPUs (could even be cheaper H200). For short-context, long-output reasoning workloads, the inverse — but disaggregation still helps because reasoning requests benefit from dedicated decode capacity not contending with prefill.

Quantification: disaggregation pays off above roughly 50–100 QPS per cluster (below that, the network overhead and pool-sizing overhead exceed the benefits). At Kimi-K2 scale (224K tokens/sec prefill on 128 H200), disaggregation delivers the documented 30%+ throughput vs aggregated.

[STAFF SIGNAL: rejected alternative] I considered keeping prefill+decode aggregated and using KV migration only for cache hits. This is what early vLLM does. The decision against: at 1000-GPU scale, the arithmetic intensity mismatch between prefill (compute-bound) and decode (bandwidth-bound) means aggregated GPUs are never well-utilized for both. Dynamo’s Planner explicitly addresses this by adjusting PD pool sizes based on real-time queue depths — that’s the production answer.

Failure mode: decode pool can’t reach prefill pool’s KV. Two recovery options: (1) re-prefill on the decode pool (if it has prefill capacity) — expensive but always works; (2) fail the request. I’d ship (1) with rate-limiting (only the first N% of failures retry; the rest fail-fast to prevent cascades).

6.3 Cross-Engine Prefix Sharing and Distributed Index Mechanics

Section titled “6.3 Cross-Engine Prefix Sharing and Distributed Index Mechanics”

[STAFF SIGNAL: distributed-index discipline]

Concrete scenario: a request with 8K-token system prompt arrives. The Conductor’s index says engine B has 6K of those tokens cached in HBM, engine C has the same 6K + an additional 2K in DRAM. Routing options:

  • (a) Route to B (6K hit, 2K to compute, 0 transfer)
  • (b) Route to C (8K hit but 2K of it requires DRAM→HBM load locally, 0 transfer)
  • (c) Route to least-loaded engine A, transfer 8K from C, 0 compute except first new tokens

Cost calculation (rough, B200 prefill at 25K t/s, intra-rack NIXL at 100 GB/s):

  • (a): 2K / 25K = 80 ms prefill; 0 transfer
  • (b): 0 compute + 2K × 160 KB / (12 GB/s DRAM→HBM) = 26 ms local load
  • (c): 8K × 160 KB / 100 GB/s = ~13 ms transfer + minimal compute

Option (c) wins on raw latency. But (c) consumes fabric bandwidth that other transfers also need; under load, the router picks (b) for fabric conservation. Under low load, (c). The scoring function captures this through transfer_cost weighted against current queue_delay.

Index update semantics. When engine B evicts a block, it emits kv_block_evicted. The Conductor index removes it. Race: between the evict event and a routing decision, an in-flight request might be routed to B for that block. Resolution: at request execution time, B re-checks its local cache (HiRadixTree). On miss, B asks the Conductor for a peer that has it (rapid retry) or recomputes. This makes the eventual consistency safe.

Tenant-aware index keys. Block hashes incorporate tenant_id so cross-tenant collisions are impossible by construction. This is the same mechanism Anthropic uses (“Cache entries are isolated between organizations”). A tenant cannot probe another tenant’s cache via prefix-hash because the tenant_id is in the hash preimage.

Section titled “6.4 Replica Placement and the Popular-Prefix Problem”

[STAFF SIGNAL: replica-placement discipline]

The dominant failure mode of cache-aware-only routing: the popular prefix concentrates traffic on one engine. With Anthropic-style productized caching, customers have 5–10 system prompts that account for 60% of input tokens. Naive cache-aware routing sends all those requests to whichever engine first cached the prefix. Result: that engine queues, others idle.

Detection: Conductor tracks per-prefix QPS via a sliding-window counter on the index. Trigger replication when:

prefix_QPS × estimated_recompute_cost > replication_overhead

For a 4K prefix at 1000 QPS with 80-ms recompute → 80 GPU-seconds/sec of recompute would happen if cache miss; replication onto 4 engines costs 4× 640 MB × 13 ms transfers = 33 GPU-ms. The math is overwhelmingly in favor of replicating.

Placement policy: random subset of K engines, topology-aware to spread across racks (so a rack-level failure doesn’t kill all replicas). K is set by:

K = ceil(prefix_QPS × avg_request_duration / per_engine_concurrency)

Replica counts and capacity overhead. With ~10K hot prefixes (long tail beyond top 100) at K=4 average, that’s 40K replicas across 100 engines = 400 replicas/engine. At 4K tokens/replica × 160 KB = 256 GB per engine, which is more HBM than we have. Therefore most replicas live in DRAM/NVMe, with a small HBM working set of the very-very-hot prefixes (top 10 system prompts).

Update consistency. When a system prompt is updated (a new version), all replicas must be invalidated. Easy: bump the version in the cache key (different hash → different blocks). The old replicas become orphans, evicted by LRU. No coordination needed for invalidation, which is the win of content-addressed storage (every published article on Mooncake/LMCache emphasizes this — content-addressing eliminates a class of consistency problems).

[STAFF SIGNAL: rejected alternative] I considered manual hot-prefix pinning (the operator declares which prefixes are hot). Rejected: workload distribution shifts faster than humans can update pin lists, and missing a newly-hot prefix causes a stampede. Automatic detection with short windows (10s) and conservative replication is the right default.

6.5 Distributed Eviction and Cooperative Eviction

Section titled “6.5 Distributed Eviction and Cooperative Eviction”

[STAFF SIGNAL: eviction coordination]

Each engine runs local LRU on its HiRadixTree (L1 HBM and L2 DRAM tiers). The local-only failure mode: engine A has the only replica of a globally-hot but locally-cold prefix; A’s local LRU evicts it; next request for that prefix recomputes on the whole fleet.

Coordination mechanisms:

  1. Donate-before-evict. Before evicting a block whose replica count is 1, A queries Conductor for a peer with capacity and pushes the block instead. Cost: ~5 ms transfer; benefit: prevents a recompute that costs hundreds of ms.
  2. Replica-count-aware LRU. Local eviction policy weights blocks by local_recency × (1 / replica_count). The last replica of a popular prefix is sticky.
  3. Eviction broadcast. Every eviction emits an event; Conductor decrements the index. If replica_count drops below K_min for a hot prefix, Conductor triggers re-replication.
  4. Anti-thrashing. Without coordination, all engines might evict the same low-popularity block at once. Conductor’s eviction load-balancing: when global pressure is high, Conductor signals which engines should preferentially evict (those with the most-replicated blocks).

Quantifying coordination overhead. At 10K evictions/sec across the fleet, each emits ~100-byte event; that’s 1 MB/s on the control plane — negligible. The donate-before-evict adds ~5 ms latency to evict decisions, which is fine because evictions are batched and async.

6.6 Multi-Tenant Isolation Across the Fabric

Section titled “6.6 Multi-Tenant Isolation Across the Fabric”

[STAFF SIGNAL: multi-tenant security across fabric]

The threat model expands when caches are shared:

  1. Cross-tenant hash collision. Mitigation: tenant_id in hash preimage; SHA-256 collision probability is irrelevant.
  2. Cross-tenant timing side-channel. A tenant probes “did anyone else recently cache X?” by measuring TTFT. Mitigation: never share KV across tenants; the index lookups are per-tenant-shard, so tenant A cannot probe tenant B’s index. Anthropic’s productized model uses this exact pattern (“Cache entries are isolated between organizations”).
  3. Cache-write/read accounting. Tenant A writes a 100K-token cache; tenant B’s request happens to land on the same engine and uses HBM that A’s cache occupies. Eviction of A’s cache by B’s load — who pays? Resolution: per-tenant cache budget. A free-tier tenant’s cache write is best-effort; their cache may be evicted preferentially. Paid tenants get reserved cache lines (a quota expressed in KV-bytes-hours).
  4. Shared-prefix special case (system prompts). If the platform’s own system prompt is identical across tenants, that prefix could be shared. This is risky — it leaks the platform’s system prompt structure via cache hit timing. Anthropic doesn’t do this. Neither would I.
  5. Transfer fabric isolation. A tenant’s KV traversing NIXL to a peer engine — does another tenant’s traffic on the same fabric leak timing info? Mostly no, because RDMA scheduling is FIFO per QP; but for ultra-paranoid tenants (defense, healthcare), dedicate prefill pools and avoid cross-tenant fabric sharing on the critical path. Add this as an SKU, not a default.

[STAFF SIGNAL: saying no] Implicit cross-tenant prefix sharing is technically the highest-leverage cache optimization possible (a global system prompt could be shared across millions of requests). I’d refuse to ship it. The information-leak surface and the surprise-factor in incident response make the risk-adjusted value negative. Anthropic and OpenAI both decline this for the same reason.

6.7 KV Compression and Mixed Precision Across the Fleet

Section titled “6.7 KV Compression and Mixed Precision Across the Fleet”

[STAFF SIGNAL: bandwidth-aware reasoning]

FP8 KV is the default — essentially free quality cost, ~50% size reduction vs FP16. INT4 (KIVI-style: per-channel keys, per-token values) is the next lever; ~75% reduction with ~1% quality loss on most benchmarks. The bandwidth win equals the size win: 4× less data to transfer is 4× faster transfer. At 100 GB/s NIXL, INT4 KV halves cross-rack transfer time relative to FP8.

Per-layer mixed precision. Some layers (typically early layers and layers around residual streams) tolerate INT4; others require FP8. The cache key must include the precision schedule. This complicates the radix tree (different precision schedules → different blocks) but the distinction is per-model, not per-request, so the explosion is bounded.

Calibration across engines. Engine A’s INT4 KV must be bit-identical to engine B’s INT4 KV for the same input. With deterministic quantization (KIVI’s per-channel scheme is deterministic given a calibration set), they’re identical up to floating-point reductions. We pin the quantization scheme as a model deployment artifact (versioned alongside weights).

Operational complexity. Two precision tiers (FP8 default, INT4 optional) is the practical sweet spot. Three or more precisions multiplies test surface area. I’d ship FP8 universal, INT4 opt-in per model deployment, and observe the real quality regression on production traffic before broader rollout. This matches the deployment cadence at Meta for FP8 KV that I led — production rollouts of new precisions take quarters, not weeks, because the long-tail quality issues only appear at production scale.

2-bit (TurboQuant, Kitty) is research-stage. I’d track it for 2026 H2 production trials.

[STAFF SIGNAL: cache-as-capacity-and-product] [STAFF SIGNAL: capacity-planning consequence]

The 90% cache hit rate isn’t a latency optimization; it’s a 10× reduction in prefill compute for the hit fraction. This reshapes the fleet:

Steady-state prefill GPUs needed = raw_input_QPS × avg_prompt_tokens × (1 − hit_rate)
× prefill_seconds_per_token
/ GPU_count
At 1000 QPS × 8K tokens × (1 − 0.9) × 0.05 ms/token
= 1000 × 8000 × 0.1 × 0.05 ms
= 40 GPU-seconds/sec = 40 prefill GPUs
vs without caching: 400 prefill GPUs.

Decode capacity is unchanged by caching (output length doesn’t change). So cache hit rate reshapes the prefill:decode ratio dramatically. At 0% hit rate the ratio might be 1:2 (prefill-heavy); at 90% hit rate it inverts to 1:20 (decode-heavy). The Planner must re-balance pools as observed hit rate shifts.

Failure mode: hit rate crashes. A model upgrade (claude-3.6claude-4.0) invalidates all caches. Suddenly the prefill pool is sized for 90% hit rate but is hit at 0%. The fleet must absorb 10× the prefill load for ~5–60 minutes until caches re-warm. Mitigation: gradual model rollout (1% → 10% → 50% → 100% over a day, allowing cache warm-up at each stage). Without this, the launch melts the prefill pool.

Pricing surface. Cache reads at 10% of base input price (Anthropic’s number, also OpenAI’s roughly). Cache writes at 1.25× (5-min) or 2× (1-hour). The 1.25× write multiplier covers the storage cost (HBM/DRAM/NVMe occupancy for the TTL window) plus the indexing/replication overhead. The 10% read multiplier reflects the marginal cost of a cache hit (memory load + attention compute on cached K/V vs full prefill — roughly 10% of full cost). These multipliers aren’t arbitrary — they’re the recovered marginal economics, and they tell you what the underlying cost structure is.

Should the platform expose explicit cache control or hide it? Anthropic’s choice (explicit) is right for power users (agentic platforms benefit from manual breakpoint placement). OpenAI’s choice (automatic) is right for low-friction adoption. The right answer is both: automatic by default, explicit when the user provides cache_control. Anthropic now offers automatic caching as well. I’d ship the dual mode.

6.9 (Bonus) Failure Modes in the Distributed Setting

Section titled “6.9 (Bonus) Failure Modes in the Distributed Setting”

[STAFF SIGNAL: failure mode precision] [STAFF SIGNAL: blast radius reasoning]

Concrete scenarios:

Scenario A: hot-prefix engine dies. Engine X holds the only HBM replica of a 1K-QPS system prompt. X dies. Detection: NATS heartbeat timeout (~3 sec). During those 3 sec, ~3000 requests routed to X fail or queue. Recovery: Conductor’s index marks X’s blocks invalid; next requests fall through to a peer DRAM/NVMe replica (slower but available) or recompute. Cascade risk: 1000 simultaneous recomputes on the prefill pool. Mitigation: (a) replica policy ensures K≥3 for hot prefixes, all topology-spread, so HBM hits remain; (b) prefill pool back-pressure: the Conductor rate-limits new requests for the dead prefix to prefill_pool_capacity / recompute_time so the recompute storm doesn’t cascade into prefix-unrelated requests.

Scenario B: rack-level network partition. A switch fails, isolating one rack of 8 engines. Inside the rack, NVLink works. Outside, no IB. The Conductor sees them as down. The 8 engines’ KV is unreachable. If they held unique replicas of any prefix, those prefixes effectively disappear. Recovery: replica policy enforces cross-rack spread, so unique-replica loss is rare. Surviving replicas absorb load. The 8 isolated engines, when partition heals, re-emit their kv_block_created events — their cache becomes available again with no manual intervention (idempotent event semantics).

Scenario C: Conductor shard fails. One of 32 Conductor shards crashes. Tenant requests that hash to that shard cannot get index lookups. Recovery: shard replicates to a hot standby; failover ~5 sec. During failover, requests fall back to “dumb” routing (least-loaded), accepting cache-miss penalty. The data-plane never depends on the Conductor — engines own their local caches and can serve requests independently. This is the right invariant: control-plane failures degrade hit rate but don’t kill the system.

Scenario D: NIXL transfer storm. Sudden burst of cross-rack transfers saturates IB. Effective per-transfer bandwidth drops 5×; transfers that should take 5 ms take 25 ms; TTFT climbs. Mitigation: the router’s transfer_cost term sees the rising effective bandwidth (telemetry feedback) and starts preferring local-recompute or lower-overlap engines. Self-stabilizing.


[STAFF SIGNAL: invariant-based thinking]

Per-engine hit rate is insufficient. The metrics that actually matter:

  • Global hit rate, per-tenant hit rate. Per-tenant matters for billing and SLA.
  • Transfer-vs-recompute decision distribution. Was the router right? Counterfactual: at the time of the decision, what was the cost of the alternative? Sample 1% of decisions and run the would-have-been math.
  • Per-prefix popularity, replica count, replica utilization. Track top 1000 prefixes by QPS. Underutilized replicas → over-replication; oversubscribed → under-replication.
  • Fabric bandwidth utilization, per link tier. NVLink, IB, NVMe. Saturation triggers planner action.
  • Transfer latency distribution per link tier. P50/P95/P99 per (source_tier, dest_tier) pair. Drift indicates fabric problems before they cascade.
  • Eviction rate by reason. LRU vs capacity vs TTL vs version-bump. Each tells you something different about workload.
  • Index staleness lag. Time from kv_block_evicted event to index update. Should be <50 ms; alert at >500 ms.
  • “Would-have-hit” counterfactual. For requests that recomputed: did some other engine have the prefix? If yes, the router made the wrong call. Sample at 1% and aggregate; this is the most diagnostic single metric.

Data volume: 100K req/sec × 10 KB log per decision = 1 GB/sec. Sampling at 1% → 10 MB/sec, very tractable.

The metric the business actually cares about: cost per million input tokens, blended. Cache hit rate drives this directly. The rest of the dashboard is to explain the headline number.


8. When the Distributed Fabric Doesn’t Pay Off

Section titled “8. When the Distributed Fabric Doesn’t Pay Off”

[STAFF SIGNAL: when-not-to-distribute]

Three regimes where this entire architecture is overkill:

  1. Small models (<8B), small fleet (<32 GPUs). Per-token KV at 8B GQA is ~16 KB; full prefix transfers are sub-ms over NVLink, and the fleet fits in a single rack. PagedAttention + LMCache local mode is enough. The Conductor’s coordination cost (NATS, ETCD, replica management, eviction broadcast) exceeds the benefit.

  2. Workloads with no prefix overlap. Per-user system prompts, per-call random tool schemas, encrypted user state — cross-request sharing doesn’t apply. Just do PagedAttention + per-engine prefix caching. The distributed index has nothing to do.

  3. Single-region, single-tenant, latency-critical (sub-50ms TTFT) deployment. The transfer-vs-recompute math still favors transfer, but the coordination latency of the multi-engine routing decision (~few ms even for an in-DRAM index lookup) is a measurable fraction of the latency budget. A single engine with full local cache, replicated for HA via active-standby, is simpler.

[STAFF SIGNAL: saying no] I’d refuse to deploy this entire fabric for a use case that doesn’t meet the size/overlap/SLO triggers. It’s a real cost — operationally, in failure-mode complexity, in on-call surface area. The architecture should be earned, not defaulted.


DecisionWhy nowWhat flips it
Centralized Conductor (sharded)Mooncake-scale precedent, simplerIf shards exceed 1000, gossip becomes attractive
Eventual consistency for index10ms staleness is harmlessStrict SLOs (<10ms TTFT budget) might want stronger guarantees
FP8 default, INT4 opt-inQuality margin provenFrontier model proves 2-bit lossless → flip to 2-bit default
Tenant-isolated cachesSide-channel safetyPublic/foundation-model deployments where there’s no tenant model
PD disaggregation default>100 QPS workloadsEdge / on-prem / single-customer dedicated → aggregated wins
NIXL as transportProduction-validatedCustom fabric (e.g., Anthropic’s internal Trainium fabric) might require its own backend

[STAFF SIGNAL: saying no]

Three pieces:

  1. “4K KV transfers in ~16 ms over IB 8×400 Gbps.” As I worked out in the Research Pass: the math gives ~1.6 ms transfer time, and with realistic RDMA overhead 3–5 ms is the right number. 16 ms is a single-rail under-utilized worst case, not a representative number. Numbers framing in the prompt drives the answer; I’d want to start from accurate numbers.

  2. “Distributed eviction coordination” as a separate first-class deep-dive. I’d argue eviction coordination is mostly subsumed by replica-count-aware LRU and the Conductor’s hot-prefix manager. A dedicated cross-engine eviction protocol adds machinery without much marginal benefit beyond donate-before-evict. The DualMap paper essentially shows that getting routing right reduces eviction-coordination importance.

  3. The implicit assumption that cache-aware routing is “the central scheduling problem.” Mostly correct, but: in disaggregated serving, the central scheduling problem is the joint PD pool sizing + cache-aware routing decision. The Dynamo Planner addresses this; treating routing in isolation under-counts how much of the 30%+ throughput gain comes from pool sizing. A staff candidate should name both.


  1. distributed-fabric reframing (§2, §4)
  2. research before design (§1)
  3. prefill-decode disaggregation (§2, §6.2)
  4. cache-as-capacity-and-product (§6.8)
  5. routing-as-multi-objective (§5)
  6. distributed-index discipline (§5, §6.3)
  7. transfer-fabric awareness (§6.1)
  8. scope negotiation (§2)
  9. rejected alternative (§4, §5, §6.2, §6.4)
  10. capacity math (§3)
  11. bandwidth-aware reasoning (§6.7)
  12. failure mode precision (§6.9)
  13. replica-placement discipline (§6.4)
  14. eviction coordination (§6.5)
  15. multi-tenant security across fabric (§6.6)
  16. invariant-based thinking (§7)
  17. blast radius reasoning (§6.9)
  18. capacity-planning consequence (§6.8)
  19. when-not-to-distribute (§8)
  20. saying no (§1, §6.6, §8, §10)
  21. 2026-cutting-edge awareness (throughout, especially §1 with Mooncake/Dynamo/NIXL/HiCache/DualMap/TurboQuant references)

Key Citations (used architecturally, not just name-dropped)

Section titled “Key Citations (used architecturally, not just name-dropped)”
  • Mooncake (FAST ‘25 Best Paper, Qin et al.): Conductor pattern, KV-centric scheduling, hot-prefix replication. Production at Kimi-K2 scale.
  • NVIDIA Dynamo + NIXL (GTC 2025, expanded 2025–2026): KV-aware routing via overlap_score_weight, Planner for PD sizing, NIXL for transport abstraction over UCX/GDS/IB.
  • LMCache (tech report 2025): Cross-engine KV reuse, NIXL integration, vLLM v1 connector. The 32 Gbps vs 64 Gbps breakeven number drives the transfer-vs-recompute decision logic.
  • SGLang HiCache + RadixAttention: HiRadixTree as the local index data structure, L1/L2/L3 hierarchy with Mooncake/3FS/NIXL backends.
  • Anthropic prompt caching docs: explicit-breakpoint productization, 5-min/1-hour TTL, 1.25× write / 0.1× read pricing model — what the fabric must support to enable this product.
  • DualMap (Feb 2026): Prefix-bound dual-candidate routing — the v2 routing strategy I’d target after v1 telemetry.
  • KIVI (ICML 2024) + TurboQuant (ICLR 2026): Per-channel key / per-token value INT4 quantization, and the next-gen Hadamard-rotation 3-bit average scheme.
  • GB200 NVL72: 130 TB/s NVLink Switch domain reshapes intra-rack bandwidth assumptions; KV transfer within rack now <1 ms for any reasonable context length.
  • llm-d (kvcache.Index): The current open-source production reference for distributed precise prefix-cache scheduling.

Word count: ~6,800 words. This answer assumes ~50 minutes of whiteboard time with deep follow-up questions. The primary axes of uncertainty I’d want the interviewer to push on: (a) DualMap-style dual-candidate routing — would I actually ship it in v1 if I had the eval data? (b) cross-tenant shared system prompts — at what observed economics would the timing-leak risk become acceptable? (c) at what fleet size does a sharded centralized Conductor stop scaling, and what’s the migration path to gossip-based coordination?