Diffusion Basics: A 1-Hour Interview Learning Session
Diffusion Basics: A 1-Hour Interview Learning Session
Section titled “Diffusion Basics: A 1-Hour Interview Learning Session”Companion notebook: diffusion_basics_colab.ipynb
Diffusion models look intimidating because papers introduce many symbols. For an interview, you need a clean mental model:
Diffusion training teaches a neural network to remove known Gaussian noise. Generation starts from noise and repeatedly denoises until a sample appears.
In autonomous driving simulation, the sample may be a future trajectory. In robotics, it may be an action sequence. In vision, it may be an image. The core math is the same.
0. One-hour plan
Section titled “0. One-hour plan”0-10 min Why diffusion exists: sampling multi-modal distributions10-20 min Forward process: add Gaussian noise20-35 min Reverse process: learn to denoise35-45 min Why predict epsilon45-55 min Minimal PyTorch implementation55-60 min Interview answers and drillsBy the end, you should be able to explain DDPM without paper-level detail, write the noising equation, implement the training loss, and explain why diffusion is useful for multi-modal future prediction.
1. Why you should care
Section titled “1. Why you should care”Many ML models predict one answer. Driving futures are not like that. Given the same scene, a car may yield, merge, accelerate, or brake. A pedestrian may wait or cross.
Diffusion models are useful because they model a distribution, not just a point estimate. Different random seeds can generate different plausible futures.
You should care in interviews because diffusion tests three fundamentals:
- Can you reason about probability distributions?
- Can you explain a training objective from first principles?
- Can you implement tensor code without getting lost in notation?
2. Forward noise process
Section titled “2. Forward noise process”Let be clean data. For a trajectory, this might be:
where is future timesteps and each point is .
The forward process gradually adds Gaussian noise:
where:
- is the diffusion step.
- is the noise variance at step .
- is identity covariance.
Define:
and:
The useful shortcut is:
where:
This says: noisy sample equals clean signal plus Gaussian noise. As increases, shrinks, so signal decreases and noise increases.
x0 clean trajectory | | add small noise vx1 slightly noisy | v... | vxT almost Gaussian noise3. Reverse denoising process
Section titled “3. Reverse denoising process”Generation runs backward:
random noise xT | | denoise vxT-1 | v... | vx0 generated sampleThe model learns:
In practice, many DDPM-style models predict the noise that was added:
Training objective:
This is just MSE between true noise and predicted noise.
4. Why predict epsilon?
Section titled “4. Why predict epsilon?”Predicting noise is convenient because noise has a simple distribution:
The target is normalized and stable across data domains. If the model predicts , we can recover an estimate of :
Interview answer:
Predicting epsilon turns denoising into a supervised regression problem with a standardized target. Once I know the noise, I can algebraically estimate the clean sample.
There are other parameterizations, such as predicting or velocity , but epsilon prediction is the easiest to explain and implement.
5. Minimal PyTorch implementation
Section titled “5. Minimal PyTorch implementation”import torchimport torch.nn as nnimport torch.nn.functional as F
def q_sample(x0, t, noise, alpha_bar): a = alpha_bar[t].view(-1, 1) return a.sqrt() * x0 + (1.0 - a).sqrt() * noise
class TinyDenoiser(nn.Module): def __init__(self, dim, steps=100, hidden=64): super().__init__() self.time = nn.Embedding(steps, hidden) self.net = nn.Sequential( nn.Linear(dim + hidden, hidden), nn.ReLU(), nn.Linear(hidden, dim), )
def forward(self, xt, t): return self.net(torch.cat([xt, self.time(t)], dim=-1))
steps = 100betas = torch.linspace(1e-4, 0.02, steps)alphas = 1.0 - betasalpha_bar = torch.cumprod(alphas, dim=0)
B, D = 32, 4model = TinyDenoiser(D, steps)x0 = torch.randn(B, D)t = torch.randint(0, steps, (B,))noise = torch.randn_like(x0)xt = q_sample(x0, t, noise, alpha_bar)
pred_noise = model(xt, t)loss = F.mse_loss(pred_noise, noise)loss.backward()This is the core training loop.
6. Autonomous driving and robotics interpretation
Section titled “6. Autonomous driving and robotics interpretation”For driving:
x0 = future trajectory or future scenec = map + agent history + traffic lights + routeFor robotics:
x0 = action sequencec = camera/state observation + task instructionDiffusion learns a conditional distribution:
The basic loss becomes:
The conditioning is what makes generation useful rather than random.
7. Failure modes and debugging checklist
Section titled “7. Failure modes and debugging checklist”- Timestep embedding is missing or broken.
- Noise schedule is too aggressive.
- Tensor broadcasting silently wrong.
- Model predicts but loss compares to epsilon.
- Sampling code does not match training parameterization.
- Model ignores conditioning.
- Training loss goes down but samples are bad.
Checklist:
- Plot at low, medium, high timesteps.
- Verify becomes noise as increases.
- Overfit 32 examples.
- Print tensor shapes.
- Compare predicted noise scale to true noise scale.
- Test one denoising step before full sampling.
8. Common interview questions and strong answers
Section titled “8. Common interview questions and strong answers”Q: What is the forward process?
A: A fixed process that gradually adds Gaussian noise to clean data using a known noise schedule.
Q: What does the network learn?
A: It learns the reverse denoising process. In DDPM training, it often predicts the noise added to .
Q: Why is diffusion good for multi-modal prediction?
A: Generation starts from random noise, so the same conditioning input can produce multiple plausible samples.
Q: Why predict epsilon?
A: Epsilon is standardized Gaussian noise, which is a stable regression target, and predicting it lets us reconstruct the clean sample.
9. A 60-second explanation you can say out loud
Section titled “9. A 60-second explanation you can say out loud”A diffusion model learns to reverse a noising process. During training, I take clean data , choose a timestep , add Gaussian noise using , and train a network to predict the noise . At generation time, I start from Gaussian noise and repeatedly denoise. Predicting epsilon is common because the target is normalized and easy to regress. For driving simulation, this is useful because one scene can have many plausible futures, and different noise samples can generate different trajectories.
10. Practice exercises with answers
Section titled “10. Practice exercises with answers”Exercise 1: What happens as ?
Answer: The clean signal disappears and becomes mostly Gaussian noise.
Exercise 2: Why does the model need timestep ?
Answer: Denoising a lightly corrupted sample is different from denoising almost pure noise.
Exercise 3: Write the epsilon training objective.
Answer: .
Exercise 4: Why is diffusion slower than one-shot regression?
Answer: Sampling usually requires many denoising steps instead of one forward pass.