Distillation as a Design Problem: Building the Efficient Variant
Design the teacher→student distillation program for a world model
Section titled “Design the teacher→student distillation program for a world model”Interviewer prompt: “You own the 30B teacher world model from the earlier design. Leadership has approved the ‘efficient variant’ strategy: a distilled student powers the bulk simulation fleet. Design the distillation program end to end — data, training, evaluation, and the ongoing process — not just the loss function.”
Interview-style answer. First-person, as the candidate. I talk through decisions out loud, flag tradeoffs explicitly, and mark Staff-level signals wherever they appear. Numbers are stated assumptions, consistent with designs #1 and #2.
1. Clarify scope and assumptions
Section titled “1. Clarify scope and assumptions”Questions I’d ask:
- What’s the compression target? Design #1’s cost model wants ~150×: capacity (30B → ~3B) and steps (30 → ~4 denoising passes). Is that fixed, or do I get to trade quality against it per use case?
- Is the teacher frozen, or does it keep improving? (It keeps improving — which changes this from a project into a process.)
- One student or a family? (Bulk-fleet student, maybe a tinier triage variant later. I’ll design for one and keep the pipeline reusable.)
- What’s the acceptance bar? Inherited from designs #1/#2: per-slice downstream agreement with the teacher, plus the validation tiers. I’m not inventing new gates; I’m building the thing that has to pass them.
Assumptions: 30B diffusion-based teacher (camera + lidar, 10 Hz, 30 steps), target student ≈ 3B at 2–4 steps, FP8 serving. Teacher retrains roughly quarterly. I control a distillation compute budget and the teacher pool’s idle capacity.
2. Framing — what distillation actually has to preserve
Section titled “2. Framing — what distillation actually has to preserve”The lazy framing is “make a small model match the big model’s outputs.” The correct contract, inherited from design #2’s task-relative realism:
The student must preserve the distribution of futures the teacher generates, well enough that the planner-under-test’s evaluations are unchanged — per slice, including the tail.
Two failure modes follow directly from that sentence, and they’re the two things naive distillation does worst:
- Oversmoothing (mode averaging). The future is genuinely multimodal: the pedestrian at the curb either steps out or doesn’t. A student trained with a regression-style loss against teacher samples learns the average of the modes — a half-committed ghost pedestrian that neither crosses nor stays. The planner reacts to that blur with unrealistic caution, and every downstream metric shifts. Sample-matching losses break exactly where simulation matters most: decision points.
- Mode dropping. The student covers the common futures and quietly loses rare ones — the tail slices, again. Note the irony: the sharpest few-step distillation methods are adversarial (distribution-matching with a discriminator-like critic), which reintroduces the GAN-family failure we rejected in design #1’s architecture table1. Sharpness and tail coverage trade against each other, and the gate has to measure both.
Staff-level signal: state what must be preserved as a distributional property and connect each loss-function choice to a named failure mode and a named gate. “I’d use consistency distillation” is a junior answer; “here’s which failure each candidate loss invites, and which eval catches it” is the staff answer.
3. The distillation dataset — where the design is actually won
Section titled “3. The distillation dataset — where the design is actually won”What goes in it
Section titled “What goes in it”- Stratified scenario coverage: every slice × scenario-type cell filled deliberately — uniform-ish over conditions, not over miles driven (fleet-frequency sampling would hand the student a sunny-highway curriculum and starve the tail).
- Deliberate tail oversampling: the slices where agreement gates are tightest get the most teacher examples, not the fewest. The sampling weights come straight from design #2’s per-slice acceptance bands — tightest band, highest weight.
- The divergence queue from design #2: every paired-replay case where the teacher itself was wrong-but-correctable is gold; every case where the previous student diverged from the teacher is targeted curriculum for the next one.
- Multiple teacher samples per condition: to teach a distribution you must show more than one draw from it — K samples per scenario seed at decision points, so the student sees that the pedestrian sometimes crosses and sometimes doesn’t.
The napkin math that reorders priorities
Section titled “The napkin math that reorders priorities”Generating fresh teacher data (design #1 rates: 18 GPU-hr, $36 per mile): 50K stratified miles → ~900K GPU-hours ≈ $1.8M
Training the 3B student on it: 50K miles ≈ 60M frames ≈ 7.2e11 token-instances ~6 × 3e9 × 7.2e11 × ~2 epochs ≈ 2.6e22 FLOPs on 512 H100s @ 0.4 PFLOP/s ≈ ~36 hours ≈ 18K GPU-hours ≈ $37KData generation outweighs training compute ~50:1. Three consequences:
- Free-ride aggressively. The teacher pool already produces rollouts as a byproduct — golden-scenario libraries, the standing audit stream, arbitration re-runs (design #1 §6). Log all of it in distillation-ready form. A large fraction of the dataset costs zero marginal dollars.
- Iterate on training, not data. At $37K a run, training experiments are cheap; fresh data campaigns are not. Sweep losses and schedules freely; commission new teacher miles only when a slice gate fails and the failure is traced to coverage.
- The program pays for itself absurdly fast. Running the bulk fleet on the teacher would cost 24K/day (design #1). The entire ~$2M program is paid back in under a day of fleet operation. Ask for the budget without apologizing.
Staff-level signal: doing the cost asymmetry math before designing the loss. The interviewer learns more from “data is 50× the training cost, so here’s my data strategy” than from any amount of loss-function erudition.
4. Technique choices — and the order of operations
Section titled “4. Technique choices — and the order of operations”Step distillation (30 → 2–4 passes)
Section titled “Step distillation (30 → 2–4 passes)”teacher: noise ●→●→●→●→●→●→ ... →● clean (30 small denoise hops)
progressive: noise ●→──→●→──→●→ ... →● clean (student learns 2-hops: 30→15→8→4, retrain each halving)
consistency: noise ●─────────────────►● clean (from ANY noise level, jump straight to the answer)- Progressive distillation: stable, well-understood; multiple rounds of training; quality degrades gracefully as steps halve.
- Consistency-style distillation: reaches 1–4 steps directly; tends to soften fine detail — watch lidar-camera consistency metrics, since “softened” geometry breaks cross-modal coherence before it looks bad to a human.
- Adversarial / distribution-matching: sharpest few-step results; carries the mode-dropping risk of footnote 1 and the training instability of its GAN ancestry. If used, anchor it with a regression term and gate it on tail slices.
Capacity distillation (30B → 3B)
Section titled “Capacity distillation (30B → 3B)”Output matching on teacher samples is table stakes. The decisions that matter:
- Feature matching (align intermediate activations, not just outputs) transfers more per example but requires an architectural correspondence between teacher and student layers — constrain the student to be a narrower/shallower sibling of the teacher, not an exotic new design, and this comes nearly free.
- Trajectory-level supervision: match multi-frame rollouts, not single frames, so temporal coherence is learned rather than hoped for.
Order of operations
Section titled “Order of operations”Capacity first, then steps: train the 3B student as an ordinary 30-step diffusion model against teacher outputs (a stable, well-posed target), then step-distill the student. Doing both jointly stacks two unstable optimizations; doing steps first means re-doing them after every capacity change. The boring sequencing is the right one.
The on-policy correction loop — the piece most designs miss
Section titled “The on-policy correction loop — the piece most designs miss”Everything above is off-policy — borrowing RL’s vocabulary: the training data comes from contexts the teacher generated, not from situations the student gets itself into. (On-policy = trained on data produced by the student’s own behavior.) The problem: in production (closed-loop, design #2’s distinction) the student consumes its own slightly-degraded history — and small errors compound into drift the teacher’s data never showed it how to escape.
┌──────────────────────────────────────────────────────────┐ │ 1. STUDENT rolls out N frames closed-loop (its own │ │ imperfect history — including its drift) │ │ 2. TEACHER, conditioned on that SAME student-generated │ │ history, produces the next-frame target │ │ 3. student is trained toward the teacher's recovery │ │ 4. repeat, mixing on-policy batches with the offline set │ └──────────────────────────────────────────────────────────┘This is the DAgger recipe from imitation learning2, transplanted: the expert labels the student’s states, so the student learns to recover from mistakes only it makes. In world-model terms it directly attacks the error-accumulation curve — which is the metric (max trusted horizon, design #2 §3) where distilled students fail first.
Staff-level signal: off-policy distillation alone produces students that ace open-loop evals and drift in closed-loop. Naming exposure bias as the gap, and on-policy correction as the fix, connects the training design to the deployment failure mode — the full-system thinking the rubric asks for.
5. The evaluation gauntlet — inherited, plus distillation-specific traps
Section titled “5. The evaluation gauntlet — inherited, plus distillation-specific traps”The student passes through design #1 §7’s gauntlet unchanged (regression suite → downstream agreement → drift horizon → shadow → canary). What distillation adds is which failures to hunt:
- Mode coverage per decision point: generate K rollouts per scenario seed from teacher and student; compare the spread of outcomes (did the student preserve the cross/don’t-cross split, or average it?). A diversity metric per slice, gated.
- Sharpness vs tail tradeoff: adversarial-distilled students get an extra tail-slice audit at elevated sample sizes, per footnote 1.
- Cross-modal coherence: few-step students soften geometry first; lidar-camera consistency (design #2’s L0) catches it before FVD does.
- Drift horizon regression: the closed-loop divergence curve (design #2 §3) re-measured per candidate; the max-trusted-horizon is recomputed and may shrink — which the scheduler must then enforce. A student that’s 150× cheaper but trusted for 4 s instead of 8 s may or may not be a win; that’s a portfolio decision, made explicit.
- Controllability preservation: the teacher’s control surfaces — driving-action conditioning, scene-layout control, language control — are product features, and distillation can silently weaken them (a student can match unconditional realism metrics while half-ignoring its conditioning). Gate it directly: same control input, K seeds, compare teacher-vs-student response to the control, per control axis. A student that renders beautiful rain but won’t produce rain on command is a regression no realism metric sees.
- Gate what you ship: the gauntlet runs on the FP8-quantized student exactly as served — not the BF16 training checkpoint. Quantizing after gating quietly un-gates the model (and per design #1 §10.2, quantization error lands hardest on the tail slices, exactly where the bands are tightest).
6. Distillation as a process, not a project
Section titled “6. Distillation as a process, not a project”The teacher retrains quarterly; cities launch; the divergence queue never empties. If producing a student takes a bespoke heroic effort, the fleet runs stale models. So the deliverable is a distillation CI pipeline:
teacher release (quarterly-ish) divergence queue (continuous) │ │ ▼ ▼ ┌───────────────────────────────────────────────────┐ │ DATA REFRESH — stratified set + byproduct logs │ │ + targeted miles for any slice that failed last │ │ cycle (and only those) │ └─────────────────────────┬─────────────────────────┘ ▼ ┌───────────────────────────────────────────────────┐ │ TRAIN — capacity distill → step distill → │ │ on-policy correction rounds (~$40K, ~2 days) │ └─────────────────────────┬─────────────────────────┘ ▼ ┌───────────────────────────────────────────────────┐ │ GAUNTLET — design #1 §7 + §5 traps above; │ │ auto-promote on green, auto-page on red │ └─────────────────────────┬─────────────────────────┘ ▼ fleet rollout: shadow → canary → bulkOperating notes:
- Distillation lag is a tracked metric: days between teacher release and student promotion. While lagging, the fleet runs the old student against the old teacher’s gates — never a new-teacher/old-student mismatch, which silently invalidates the agreement baseline.
- Slice failures trigger targeted data, not full campaigns: a failed night-rain gate buys night-rain teacher miles, ~1.8M refresh.
- Every student is versioned with its teacher (design #2’s versioning discipline): a student is only meaningful relative to the teacher it was distilled from and gated against.
Staff-level signal: the question asks for a distillation design; the senior answer is mostly about the second and tenth iterations, not the first. One-shot distillation is a demo. The pipeline — lag metric, targeted refresh, auto-gauntlet — is the product.
7. When simpler wins
Section titled “7. When simpler wins”- If a workload needs only ~4× (not 150×), quantization + serving engineering (design #1’s ladder, levels 1–2) gets there with no new model and no gauntlet. Don’t distill for fun.
- If a use case doesn’t need sensor realism at all, it belongs in object-level sim (design #1 §9) — the cheapest distillation is the one you skip.
- A worse but honest student beats a better but uncharacterized one: a student with a known 4 s trusted horizon is usable inside the portfolio today; one with great averages and unmeasured tail behavior is not usable for anything safety-adjacent.
8. Research pass — new developments (as of June 2026)
Section titled “8. Research pass — new developments (as of June 2026)”- The sharpness-vs-tail trade now has a published mitigation: rCM (NVIDIA, ICLR 2026) regularizes continuous-time consistency distillation with a score term — matching DMD2’s quality while explicitly mitigating mode collapse and improving diversity, at 1–4 steps (15–50× sampling speedup), scaled to 10B+ video models. This is §2’s footnote-1 tension, solved-ish in the literature: the frontier answer is consistency/score hybrids, not a pick between the adversarial and consistency columns.
- Distillation training itself got cheaper: SGMD (May 2026) reports ~3× training speedup over DMD2 with better motion dynamics in 4-step students — relevant to §3’s economics, though data generation still dominates by an order of magnitude.
- On-policy correction is now the named SOTA recipe, not a nice-to-have: Self-Forcing (2025) trains the student on its own rollouts — §4’s loop, as the paper’s title concept — and Causal Forcing (Feb 2026) fixed the initialization pathologies in that family. The DAgger framing went from analogy to literature.
- A fourth distillation axis emerged: factorization. CausVid-style asymmetric distillation changes the student’s generation order, not just its size: bidirectional teacher → causal autoregressive student, making the student streamable and KV-cacheable. For this program, that means the §4 sequencing gains a step — capacity → steps → factorization → on-policy — and the student inherits the serving benefits design #1 §11 now treats as central.
9. Summary — what I’d want the interviewer to remember
Section titled “9. Summary — what I’d want the interviewer to remember”- The contract is distributional: preserve the spread of futures per slice — the two named enemies are mode averaging (ghost pedestrians at decision points) and mode dropping (the tail, again).
- Data strategy beats loss strategy: generation outweighs training ~50:1, so free-ride on teacher-pool byproducts, oversample by gate tightness, and let the $37K training runs iterate freely.
- Boring sequencing: capacity-distill at full steps, then step-distill, then on-policy correction — never stack unstable optimizations.
- On-policy correction is the difference between a student that aces open-loop evals and one that survives closed-loop; exposure bias is the named gap, DAgger-style teacher labeling of student states is the fix.
- Gate the distillation-specific failures explicitly: outcome diversity per decision point, tail audits for adversarial losses, cross-modal coherence, re-measured drift horizon, controllability response per control axis — and run the gauntlet on the quantized artifact you actually ship.
- Build the pipeline, not the artifact: quarterly teachers make distillation a CI process with a lag metric — the tenth student should be push-button.
Footnotes
Section titled “Footnotes”-
Design #1, footnote on mode collapse: a GAN’s generator can win by producing a few safe outputs, dropping low-density modes — and the long tail is the low-density modes. Adversarial distillation objectives (e.g., distribution-matching distillation) inherit a milder version of the same risk. The standard mitigation is keeping a likelihood-style anchor term in the loss alongside the adversarial one — a plain reconstruction/matching penalty that charges the student everywhere it deviates, not just where the critic happens to look — plus per-slice tail metrics in the acceptance gate so any dropping is caught rather than assumed away. ↩
-
DAgger (“dataset aggregation”): an imitation-learning loop where the learner acts, an expert labels the states the learner actually visited, and the learner retrains on the aggregate. The point: a policy trained only on expert demonstrations never sees the states its own mistakes create, so one error walks it off the training distribution — same mechanism as exposure bias in sequence generation. ↩