Skip to content

Diffusion interview map

This is the memorization map for Waymo-style diffusion and world-model interviews. The goal is not to name every paper. The goal is to have the architecture substrate cold, know where each technique sits in the stack, and be able to explain the tradeoffs without reaching for notes.

Use this page as the checklist. Each row tells you what to memorize, the one sentence you should be able to say out loud, and where to review it in this primer.


PriorityTopicMemorize thisWhere to find it
P0DDPMForward process adds Gaussian noise; reverse model learns to denoise, often by predicting epsilon with an MSE loss.Diffusion Basics
P0DDIMSame trained model, deterministic non-Markovian sampler, fewer NFEs than ancestral DDPM.Training-Free Diffusion Solvers
P0Score SDE / probability-flow ODEDiffusion sampling can be seen as solving an SDE or deterministic ODE from noise to data.Training-Free Diffusion Solvers
P0NFE accountingCost is denoiser calls times guidance, horizon, candidate actions, and samples per action.Fast Diffusion Sampling
P0Classifier-free guidanceCombine unconditional and conditional predictions; guidance improves conditioning but doubles NFE and can reduce diversity.Fast Diffusion Sampling
P0Training-free solversDDIM, DPM-Solver++, and UniPC change only the sampler/integrator; they preserve the teacher better than distillation but bottom out around low double-digit NFE.Training-Free Diffusion Solvers
P0Latent diffusionDiffuse in a compressed latent instead of pixels; this is the core efficiency move for video/world models.Diffusion World Model
P0Tokenizers: VAE vs VQContinuous latents pair naturally with diffusion/flow matching; discrete tokens pair naturally with autoregressive transformers.Diffusion World Model
P0Flow matching / rectified flowLearn a velocity field along straighter paths; straighter paths need fewer solver steps.Fast Diffusion Sampling
P1U-NetOriginal diffusion backbone: convolutional encoder-decoder with skips and attention. Know it as the baseline architecture.U-Net and DiT Backbones
P1DiTPatchify latent tokens and use transformer blocks; modern diffusion backbone, especially for image/video/world models.U-Net and DiT Backbones
P1Conditioning injectionAction, map, route, agent boxes, text, and event controls enter through cross-attention, AdaLN/FiLM, concatenation, or ControlNet-style branches.Diffusion World Model
P1Video latent world modelHistory is encoded into latents; a denoiser samples future latents; decoder renders future video.Training-Free Diffusion Solvers
P1BEV / occupancy world modelPredict future occupancy or scene tokens instead of pixels; easier to validate and more directly plannable.Diffusion World Model
P1Trajectory diffusionSample future agent trajectories conditioned on history, map, and ego action; cheap but abstracts away perception.Training-Free Diffusion Solvers
P1DistillationRetrain a student for 1-8 NFE; faster than solvers, but risks quality and tail-mode loss.Fast Diffusion Sampling
P1Diversity / tail coverageFor world models, lost modes are lost futures; a fast sampler that drops rare dangerous futures is a safety failure.Fast Diffusion Sampling
P1Validation hierarchyLooks-real is not enough; validate physical consistency, reactivity, distributional coverage, and downstream predictive validity.Diffusion Validation
P2GAIA-2AV-specific controllable multi-view latent diffusion world model; know it as the driving-video reference architecture.Diffusion World Model
P2VistaOpen driving world model for controllable high-fidelity video prediction.Diffusion World Model
P2OccWorld3D occupancy tokenizer plus GPT-like spatiotemporal transformer for scene and ego-token rollout.Diffusion World Model
P2CosmosWorld foundation model platform: tokenizer plus diffusion and autoregressive models for physical AI.Diffusion World Model

The architecture you should be able to whiteboard

Section titled “The architecture you should be able to whiteboard”

If you memorize only one diagram, memorize this:

history / scene state
|
v
encoder or tokenizer
|
v
history latent / context tokens
|
| action, map, route, agent boxes, text/event controls
| |
v v
random noisy future x_T -> denoiser f_theta(x_t, t, c)
|
v
sampler / solver
|
v
clean future x_0
|
v
decoder
|
v
future video, occupancy, or trajectories

The key interview sentence:

The tokenizer and denoiser are architecture. The solver is inference-time numerical integration. The conditioning stack is what makes the generator a world model instead of just a video model.


I made U-Net and DiT Backbones the next study article because it is the missing middle between diffusion math and world-model systems design.

Priority for your study:

P0: diffusion math and sampling
DDPM, DDIM, ODE view, CFG, NFE accounting
P1: denoiser architecture
U-Net, latent U-Net, DiT, conditioning injection, video extensions
P1: tokenization and representation
VAE vs VQ, latent size, continuous vs discrete state
P2: named world-model systems
GAIA-2, Vista, OccWorld, Cosmos

Reason: if an interviewer asks you to design the model, the backbone is the first real architecture whiteboard. You need to explain what the denoiser is, why U-Net was the original default, why DiT became the modern scalable default, where action/map/history conditioning enters, and how this differs from the sampler. Named systems like GAIA-2 or Vista become easier once this substrate is automatic.


Use this taxonomy to place any architecture they mention:

FamilyRepresentationStrengthWeaknessExamples to know
Video latent diffusionFuture camera/video latentsSensor realism, perception stress testsExpensive, hard to validate physicallyGAIA-2, Vista, Cosmos Diffusion
Occupancy / BEV world model3D occupancy, BEV grids, scene tokensGeometric, checkable, planning-relevantLess appearance detailOccWorld, Drive-OccWorld-style models
Autoregressive token world modelDiscrete video/scene/action tokensNatural long rollout, token-level predictionDrift, quantization, exposure biasCosmos AR, Genie-style models
Trajectory world modelAgent future trajectoriesCheap, directly useful for sim agentsAssumes perception and geometry upstreamMotion forecasting / sim-agent models

The interview move is to pick the representation from the use case:

  • Use video latent diffusion when you need sensor-realistic scenario generation.
  • Use occupancy or BEV when you need checkable dynamics and planning relevance.
  • Use trajectory models when you need cheap closed-loop agent behavior.
  • Use autoregressive token models when long rollout and tokenized state are the natural fit.

End-to-end system design questions (in ML Design)

Section titled “End-to-end system design questions (in ML Design)”

The articles above are the substrate — math, backbones, taxonomy. The three problems below apply that substrate end to end; they live in the ML Design section but are listed here because each is one AV specialization (perception/forecasting, simulation, end-to-end driving). Each walks the same arc: clarify → frame (real-world vs. training objective) → metrics → fallback when uncertain → data & long-tail → baseline → modeling → serving within budget → offline+online eval → shadow/canary/rollback with drift monitoring.

QuestionSpecializationThe core thing it tests
Onboard motion forecastingMotion forecastingCalibrated multimodal multi-agent prediction under a hard onboard latency budget; never drop the dangerous mode.
Closed-loop driving simulatorSimulationRealistic, reactive, diverse sim-agents + scenario gen at massive throughput, validated by sim-to-real correlation.
End-to-end driving with a world modelEnd-to-end drivingJoint optimization via interpretable-intermediate end-to-end + world-model planning, bounded by a hard safety layer.

These three deliberately reuse the same engine in different regimes: a scene-conditioned (often diffusion) generator of trajectories/futures, served once onboard with bounded latency (forecasting), once offline at huge throughput (simulation), and once in the planning loop (end-to-end). The contrasts in serving strategy across the three are themselves a strong interview point.


This is the article plan for the new diffusion section. Some pages already exist elsewhere and can be moved or rewritten into this folder.

ArticleStatus
1-ddpm-basics.mdExists as Diffusion Basics.
2-ddim-and-ode-view.mdCovered in Training-Free Diffusion Solvers.
3-training-free-solvers.mdExists as Training-Free Diffusion Solvers.
4-classifier-free-guidance.mdCovered in Fast Diffusion Sampling.
5-latent-diffusion-and-tokenizers.mdCovered in Diffusion World Model.
6-unet-dit-backbones.mdExists as U-Net and DiT Backbones.
7-flow-matching-rectified-flow.mdCovered in Fast Diffusion Sampling.
8-video-diffusion-architectures.mdPlanned.
9-driving-world-models.mdCovered in Diffusion World Model.
10-speed-distillation-and-tradeoffs.mdCovered in Fast Diffusion Sampling.
11-validation-for-world-models.mdExists as Diffusion Validation.