Fundamentals: Design an LLM Inference Platform
Design a production inference platform for a large transformer
Section titled “Design a production inference platform for a large transformer”Interviewer prompt: “Design the serving platform for a ~70B-parameter LLM: mixed traffic — interactive chat, agentic tool-use loops, and offline batch jobs — at, say, 10K requests/minute peak. Walk me through the architecture and the key trade-offs.”
Interview-style answer. First-person, as the candidate. This is the safety-net question — if the interviewer goes generic instead of world-models, this is the 40-minute version. Deeper treatments live in my LLM Sys Design notes (context management, KV cache management, LLM inference); this doc is the interview-shaped spine.
1. Clarify scope and assumptions
Section titled “1. Clarify scope and assumptions”- Traffic mix and SLOs? Chat wants low TTFT1; agent loops want low per-token latency and cheap repeated context; batch wants throughput and doesn’t care about latency.
- One model or a family? (Assume one 70B dense model; note where a small model changes the answer.)
- Context lengths? (Assume up to 128K, median ~4K — the gap between median and max drives several designs.)
- Quality constraints on optimization? (Quantization allowed; engine swaps must pass quality regression.)
Assumptions: 70B dense, GQA (grouped-query attention — many query heads share a few KV heads, shrinking the KV cache ~8× vs. one-KV-per-query; it’s why the numbers below are merely painful instead of fatal), FP8 weights ≈ 70 GB; H100s ($2/GPU-hr, 80 GB HBM @ 3.35 TB/s, ~1 PFLOP/s dense FP8 of which ~40–50% achievable).
2. The napkin math that generates the whole design
Section titled “2. The napkin math that generates the whole design”Staff-level signal: every serving technique below is a consequence of three numbers. Derive them first and the rest of the interview is downhill.
(a) Prefill is compute-bound. Processing a 4K-token prompt = 2 × 70e9 × 4096 ≈ 0.57 PFLOP — about a second of one GPU’s compute. All prompt tokens are processed in parallel; the GPU’s arithmetic units are the bottleneck.
(b) Decode is memory-bandwidth-bound. Generating one token requires reading all 70 GB of weights from HBM to do ~2 FLOPs per weight (≈0.14 TFLOP) — a trivial amount of math gated by an enormous read. (An MoE model changes this arithmetic: per-token weight traffic scales with active parameters, which is most of why sparse models win on serving cost.)
single stream: 3.35 TB/s ÷ 70 GB ≈ 48 tokens/s — and the GPU's compute sits ~99% idle while you get it.The fix is batching: B concurrent streams share one weight-read per step, so decode throughput scales ~linearly with B until something else runs out. What runs out is…
(c) KV cache2 is the currency of the system.
per token (70B-class: 80 layers, GQA 8 KV-heads × 128, FP8): 2 × 80 × 8 × 128 × 1 B ≈ 160 KB/tokenone 32K-context stream ≈ 5 GBone H100 after weights ≈ ~8 GB free → barely 1–2 long streams!So a single 80 GB GPU can’t even hold the weights plus a respectable batch’s KV. The three numbers force the architecture: split the model (TP2 → 160 GB, ~90 GB for KV ≈ batch of ~18 full-32K streams, far more at the 4K median), batch decode aggressively, and treat KV bytes — not FLOPs — as the resource you schedule, evict, and cache.
3. The serving engine — six techniques, each earning its place
Section titled “3. The serving engine — six techniques, each earning its place”Each one exists because of a specific failure of the naive design:
- Continuous batching. Naive batching waits for the whole batch to finish; one 2K-token answer holds 31 finished streams hostage. Instead, admit/retire streams every step. This alone is typically 2–4× throughput, and it’s why every modern engine (vLLM, TRT-LLM, SGLang) is built around it.
- Paged KV cache. Contiguous per-stream KV allocations fragment HBM exactly like un-paged RAM; paging KV into fixed-size blocks (vLLM’s core idea) gets utilization from ~60% to >95%, and the block table gives you copy-on-write sharing for free.
- Prefix caching. The system prompt, the few-shot header, the agent’s tool definitions — identical across thousands of requests. Cache their KV once (paged blocks + hashing make this natural), and a 3K-token shared prefix turns into a lookup instead of 3K tokens of prefill. For agentic traffic — same context replayed every tool-call iteration — this is the single biggest cost lever there is.
- Chunked prefill. A 128K-token prefill is ~35 s of compute; scheduled whole, it stalls every decode stream on the GPU (TPOT spikes — head-of-line blocking). Slice prefills into chunks interleaved with decode steps: TTFT for the big request degrades slightly, TPOT for everyone else stays flat.
- Speculative decoding. A small draft model proposes k tokens; the 70B verifies them in one parallel pass — one weight-read for several tokens, 2–3× single-stream speedup. The nuance an interviewer wants: it spends compute to save bandwidth, so it’s strongest at low batch (latency-critical, GPU half-idle) and fades as batch grows toward compute-bound — though modern EAGLE-class drafters have pushed the crossover out (see §7), so “off above batch 4” is a stale rule; the honest answer is “profile the crossover on your traffic, then schedule it per tier.”
- Quantization. FP8 weights halve the decode weight-read (≈2× decode speed) and the footprint; FP8 KV doubles how many streams fit. Gated per-release by quality regression, same discipline as every other optimization.
Parallelism note: TP2–TP4 inside an NVLink island, replicas beyond that. Decode-phase TP all-reduces carry one token’s activations (~16 KB) — latency-bound, benign. This is the exact opposite regime from the world-model design (#1), where every pass moved 12K tokens and 75 MB per all-reduce — same formula, the token count flips which term dominates. Knowing why the same technique is cheap here and expensive there is the transferable skill.
4. Platform architecture — tiers, routing, disaggregation
Section titled “4. Platform architecture — tiers, routing, disaggregation” ┌───────────────────────┐ │ Router / Scheduler │ │ prefix-aware, load- │ │ aware, priority-aware│ └──┬────────┬────────┬──┘ interactive │ │ │ batch (TTFT SLO) ▼ ▼ ▼ (no SLO, spot/ ┌──────────────────────┐ ┌─────────────────┐ preemptible) │ PREFILL POOL │ │ DECODE POOL │ ┌─────────────────┐ │ compute-heavy, │─►│ bandwidth- │ │ BATCH TIER │ │ chunked, KV │KV│ heavy, big │ │ max batch, no │ │ streamed out │ │ cont. batches │ │ spec-decode, │ └──────────────────────┘ └─────────────────┘ │ fills valleys │ ▲ └─────────────────┘ │ shared prefix KV store (paged blocks, hash-addressed, ▼ HBM → CPU DRAM → SSD tiers) ┌──────────────────────────────────────────────┐ │ KV / PREFIX CACHE (the system's real state)│ └──────────────────────────────────────────────┘- Prefill/decode disaggregation: the two phases want opposite hardware profiles (compute vs bandwidth) and pollute each other’s SLOs when colocated. Separating them — with KV streamed between pools — is the Mooncake/Dynamo architecture; worth it at scale, overkill for a single-node deployment (say which regime you’re in).
- Routing is cache-aware first: sending a request to the replica that already holds its prefix KV beats any load-balancing heuristic. Session affinity for agents (their KV is here), hash-routing for shared prefixes.
- The batch tier is the economic shock absorber: it runs preemptible, fills diurnal valleys, and makes peak capacity for interactive traffic affordable. Goodput — requests completing within their SLO per GPU-hour — is the metric, not raw tokens/s; a platform at 100% utilization missing every TTFT target is a failed platform at great efficiency.
Cost sanity check: TP2 decode at batch ~32: step time ≈ 70 GB ÷ 6.7 TB/s ≈ 10.5 ms → ~95 steps/s × 32 streams ≈ 3,000 tok/s per pair ≈ 1,500 tok/s/GPU → at 0.37 per million output tokens** before prefill, cache hits, and margin. Knowing this number to within 2× lets you sanity-check every vendor claim and capacity plan in the room.
5. Evaluation, monitoring, rollout
Section titled “5. Evaluation, monitoring, rollout”The serving platform changes weekly (engine versions, kernels, quantization, schedulers) under a model that’s supposed to stay exactly the same — so the discipline is “prove nothing changed”:
- Quality invariance gates: golden-prompt suite with pinned seeds — logprob deltas (the per-token probabilities the model assigns; far more sensitive to a kernel or quantization change than eyeballing outputs) and output diffs per release. Quantization, kernel, and engine swaps all ship through it. (Bitwise-identical output is not achievable across kernels; statistically indistinguishable is the standard, and the gate encodes it.)
- Latency: TTFT/TPOT percentiles per traffic class — p99, not means; the means always look fine.
- Cache health: prefix hit rate, KV utilization, eviction rates. A hit-rate drop is an early-warning for both a cost regression and a routing bug.
- Failure modes worth naming before they page you: OOM cascades (one long-context burst evicts the cache, misses pile up, prefill load doubles — admission control by predicted KV footprint, not request count); hot prefixes (one viral system prompt — replicate its blocks); head-of-line blocking (chunked prefill plus a max-tokens-in-flight cap).
- Rollout: shadow traffic for engine upgrades, canary by traffic percentage, instant rollback (stateless workers + versioned engine images — the KV cache drains, nothing else to migrate).
6. When simpler wins
Section titled “6. When simpler wins”- A smaller model with a fatter cache often beats a bigger model: if 8B + good retrieval passes the task evals, it’s ~9× cheaper per token and fits whole on one GPU (no TP, trivial ops). Run the task eval before the capacity plan.
- Semantic/response caching above the platform: the cheapest token is one you never generate.
- Single-node first: one TP2 box with vLLM serves a surprising amount of traffic; disaggregation, KV tiering, and multi-pool routing earn their complexity only past the point where you can measure the colocation interference they remove.
7. Research pass — new developments (as of June 2026)
Section titled “7. Research pass — new developments (as of June 2026)”- Speculative decoding survives concurrency now. EAGLE-3 drafts from fused multi-layer hidden states (not just the final layer), raising acceptance rates; EAGLE 3.1 (May 2026, with the vLLM team) reports 2.03× at concurrency 1, 1.71× at C=4, and still 1.66× at C=16. The old “spec decode is a single-stream trick” heuristic is dead; it now earns its place in moderately-batched interactive tiers, and only the saturated batch tier turns it off.
- KV compression moved into the architecture. DeepSeek-style MLA (multi-head latent attention) stores a compressed latent instead of per-head K/V — several-fold smaller KV than GQA. The §2(c) “currency” math is increasingly attacked at model-design time, which is the cheapest place to attack it; when evaluating a new model for serving, KV-bytes-per-token now belongs next to parameter count on the datasheet.
- Disaggregation is mainstream, not exotic: prefill/decode separation shipped as a first-class vLLM abstraction (KVConnector), with Dynamo/Mooncake-class orchestration above it — my KV-cache doc covers that frontier in depth; this design’s §4 diagram is now the default shape at scale, not the ambitious one.
- Blackwell moves the constants, not the logic: B200-class HBM (~8 TB/s) roughly 2.4בs the per-GPU decode ceiling from §2(b). Every regime boundary in this doc shifts; none of them disappear — which is exactly why the napkin math, not the constants, is the thing to carry into the interview.
8. Summary — what I’d want the interviewer to remember
Section titled “8. Summary — what I’d want the interviewer to remember”- Three numbers force everything: prefill is compute-bound, decode is one-full-weight-read-per-token (48 tok/s unbatched), and KV at 160 KB/token is the scarce resource — the architecture is just these facts arranged in order.
- Batch to amortize the weight read; page and share the KV; cache every repeated prefix — continuous batching, paged KV, prefix cache are the non-negotiable core.
- Know which regime each trick lives in: speculative decoding strongest at low-to-moderate concurrency (EAGLE-3-class drafters moved the crossover — profile it, don’t recite it); chunked prefill to protect TPOT; disaggregation at scale only. Same math, opposite answer to the world-model case — and saying why is the differentiator.
- Goodput per GPU-hour, per traffic class — not tokens/s. The batch tier absorbs the peaks and pays for the valleys.
- The model is constant, the platform isn’t: quality-invariance gates on every engine change, p99s per class, admission control in KV-bytes, boring rollbacks.
Footnotes
Section titled “Footnotes”-
TTFT / TPOT: time-to-first-token (how long before the user sees anything — dominated by prefill) and time-per-output-token (the streaming rate — dominated by decode). They’re the two latency SLOs of LLM serving, and almost every architecture decision trades one against the other or against cost. ↩
-
KV cache: in attention, every generated token attends to all previous tokens’ key/value vectors. Recomputing them every step would be quadratic-and-ruinous, so they’re cached — which converts the problem from “compute” to “memory”: every active conversation holds megabytes-to-gigabytes of state on the GPU for its whole lifetime. My KV-cache doc covers the full frontier (Mooncake, Dynamo/NIXL, LMCache); here it’s enough that KV is the scarce resource. ↩