Diffusion interview map
Diffusion Interview Map
Section titled “Diffusion Interview Map”This is the memorization map for Waymo-style diffusion and world-model interviews. The goal is not to name every paper. The goal is to have the architecture substrate cold, know where each technique sits in the stack, and be able to explain the tradeoffs without reaching for notes.
Use this page as the checklist. Each row tells you what to memorize, the one sentence you should be able to say out loud, and where to review it in this primer.
What to memorize cold
Section titled “What to memorize cold”| Priority | Topic | Memorize this | Where to find it |
|---|---|---|---|
| P0 | DDPM | Forward process adds Gaussian noise; reverse model learns to denoise, often by predicting epsilon with an MSE loss. | Diffusion Basics |
| P0 | DDIM | Same trained model, deterministic non-Markovian sampler, fewer NFEs than ancestral DDPM. | Training-Free Diffusion Solvers |
| P0 | Score SDE / probability-flow ODE | Diffusion sampling can be seen as solving an SDE or deterministic ODE from noise to data. | Training-Free Diffusion Solvers |
| P0 | NFE accounting | Cost is denoiser calls times guidance, horizon, candidate actions, and samples per action. | Fast Diffusion Sampling |
| P0 | Classifier-free guidance | Combine unconditional and conditional predictions; guidance improves conditioning but doubles NFE and can reduce diversity. | Fast Diffusion Sampling |
| P0 | Training-free solvers | DDIM, DPM-Solver++, and UniPC change only the sampler/integrator; they preserve the teacher better than distillation but bottom out around low double-digit NFE. | Training-Free Diffusion Solvers |
| P0 | Latent diffusion | Diffuse in a compressed latent instead of pixels; this is the core efficiency move for video/world models. | Diffusion World Model |
| P0 | Tokenizers: VAE vs VQ | Continuous latents pair naturally with diffusion/flow matching; discrete tokens pair naturally with autoregressive transformers. | Diffusion World Model |
| P0 | Flow matching / rectified flow | Learn a velocity field along straighter paths; straighter paths need fewer solver steps. | Fast Diffusion Sampling |
| P1 | U-Net | Original diffusion backbone: convolutional encoder-decoder with skips and attention. Know it as the baseline architecture. | U-Net and DiT Backbones |
| P1 | DiT | Patchify latent tokens and use transformer blocks; modern diffusion backbone, especially for image/video/world models. | U-Net and DiT Backbones |
| P1 | Conditioning injection | Action, map, route, agent boxes, text, and event controls enter through cross-attention, AdaLN/FiLM, concatenation, or ControlNet-style branches. | Diffusion World Model |
| P1 | Video latent world model | History is encoded into latents; a denoiser samples future latents; decoder renders future video. | Training-Free Diffusion Solvers |
| P1 | BEV / occupancy world model | Predict future occupancy or scene tokens instead of pixels; easier to validate and more directly plannable. | Diffusion World Model |
| P1 | Trajectory diffusion | Sample future agent trajectories conditioned on history, map, and ego action; cheap but abstracts away perception. | Training-Free Diffusion Solvers |
| P1 | Distillation | Retrain a student for 1-8 NFE; faster than solvers, but risks quality and tail-mode loss. | Fast Diffusion Sampling |
| P1 | Diversity / tail coverage | For world models, lost modes are lost futures; a fast sampler that drops rare dangerous futures is a safety failure. | Fast Diffusion Sampling |
| P1 | Validation hierarchy | Looks-real is not enough; validate physical consistency, reactivity, distributional coverage, and downstream predictive validity. | Diffusion Validation |
| P2 | GAIA-2 | AV-specific controllable multi-view latent diffusion world model; know it as the driving-video reference architecture. | Diffusion World Model |
| P2 | Vista | Open driving world model for controllable high-fidelity video prediction. | Diffusion World Model |
| P2 | OccWorld | 3D occupancy tokenizer plus GPT-like spatiotemporal transformer for scene and ego-token rollout. | Diffusion World Model |
| P2 | Cosmos | World foundation model platform: tokenizer plus diffusion and autoregressive models for physical AI. | Diffusion World Model |
The architecture you should be able to whiteboard
Section titled “The architecture you should be able to whiteboard”If you memorize only one diagram, memorize this:
history / scene state | vencoder or tokenizer | vhistory latent / context tokens | | action, map, route, agent boxes, text/event controls | | v vrandom noisy future x_T -> denoiser f_theta(x_t, t, c) | v sampler / solver | v clean future x_0 | v decoder | v future video, occupancy, or trajectoriesThe key interview sentence:
The tokenizer and denoiser are architecture. The solver is inference-time numerical integration. The conditioning stack is what makes the generator a world model instead of just a video model.
Why U-Net and DiT came next
Section titled “Why U-Net and DiT came next”I made U-Net and DiT Backbones the next study article because it is the missing middle between diffusion math and world-model systems design.
Priority for your study:
P0: diffusion math and sampling DDPM, DDIM, ODE view, CFG, NFE accounting
P1: denoiser architecture U-Net, latent U-Net, DiT, conditioning injection, video extensions
P1: tokenization and representation VAE vs VQ, latent size, continuous vs discrete state
P2: named world-model systems GAIA-2, Vista, OccWorld, CosmosReason: if an interviewer asks you to design the model, the backbone is the first real architecture whiteboard. You need to explain what the denoiser is, why U-Net was the original default, why DiT became the modern scalable default, where action/map/history conditioning enters, and how this differs from the sampler. Named systems like GAIA-2 or Vista become easier once this substrate is automatic.
The model taxonomy
Section titled “The model taxonomy”Use this taxonomy to place any architecture they mention:
| Family | Representation | Strength | Weakness | Examples to know |
|---|---|---|---|---|
| Video latent diffusion | Future camera/video latents | Sensor realism, perception stress tests | Expensive, hard to validate physically | GAIA-2, Vista, Cosmos Diffusion |
| Occupancy / BEV world model | 3D occupancy, BEV grids, scene tokens | Geometric, checkable, planning-relevant | Less appearance detail | OccWorld, Drive-OccWorld-style models |
| Autoregressive token world model | Discrete video/scene/action tokens | Natural long rollout, token-level prediction | Drift, quantization, exposure bias | Cosmos AR, Genie-style models |
| Trajectory world model | Agent future trajectories | Cheap, directly useful for sim agents | Assumes perception and geometry upstream | Motion forecasting / sim-agent models |
The interview move is to pick the representation from the use case:
- Use video latent diffusion when you need sensor-realistic scenario generation.
- Use occupancy or BEV when you need checkable dynamics and planning relevance.
- Use trajectory models when you need cheap closed-loop agent behavior.
- Use autoregressive token models when long rollout and tokenized state are the natural fit.
End-to-end system design questions (in ML Design)
Section titled “End-to-end system design questions (in ML Design)”The articles above are the substrate — math, backbones, taxonomy. The three problems below apply that substrate end to end; they live in the ML Design section but are listed here because each is one AV specialization (perception/forecasting, simulation, end-to-end driving). Each walks the same arc: clarify → frame (real-world vs. training objective) → metrics → fallback when uncertain → data & long-tail → baseline → modeling → serving within budget → offline+online eval → shadow/canary/rollback with drift monitoring.
| Question | Specialization | The core thing it tests |
|---|---|---|
| Onboard motion forecasting | Motion forecasting | Calibrated multimodal multi-agent prediction under a hard onboard latency budget; never drop the dangerous mode. |
| Closed-loop driving simulator | Simulation | Realistic, reactive, diverse sim-agents + scenario gen at massive throughput, validated by sim-to-real correlation. |
| End-to-end driving with a world model | End-to-end driving | Joint optimization via interpretable-intermediate end-to-end + world-model planning, bounded by a hard safety layer. |
These three deliberately reuse the same engine in different regimes: a scene-conditioned (often diffusion) generator of trajectories/futures, served once onboard with bounded latency (forecasting), once offline at huge throughput (simulation), and once in the planning loop (end-to-end). The contrasts in serving strategy across the three are themselves a strong interview point.
Suggested dedicated articles
Section titled “Suggested dedicated articles”This is the article plan for the new diffusion section. Some pages already exist elsewhere and can be moved or rewritten into this folder.
| Article | Status |
|---|---|
1-ddpm-basics.md | Exists as Diffusion Basics. |
2-ddim-and-ode-view.md | Covered in Training-Free Diffusion Solvers. |
3-training-free-solvers.md | Exists as Training-Free Diffusion Solvers. |
4-classifier-free-guidance.md | Covered in Fast Diffusion Sampling. |
5-latent-diffusion-and-tokenizers.md | Covered in Diffusion World Model. |
6-unet-dit-backbones.md | Exists as U-Net and DiT Backbones. |
7-flow-matching-rectified-flow.md | Covered in Fast Diffusion Sampling. |
8-video-diffusion-architectures.md | Planned. |
9-driving-world-models.md | Covered in Diffusion World Model. |
10-speed-distillation-and-tradeoffs.md | Covered in Fast Diffusion Sampling. |
11-validation-for-world-models.md | Exists as Diffusion Validation. |