Diffusion Basics: A 1-Hour Interview Learning Session

Companion notebook: diffusion_basics_colab.ipynb

Diffusion models look intimidating because papers introduce many symbols. For an interview, you need a clean mental model:

Diffusion training teaches a neural network to remove known Gaussian noise. Generation starts from noise and repeatedly denoises until a sample appears.

In autonomous driving simulation, the sample may be a future trajectory. In robotics, it may be an action sequence. In vision, it may be an image. The core math is the same.

0. One-hour plan

0-10 min   Why diffusion exists: sampling multi-modal distributions
10-20 min  Forward process: add Gaussian noise
20-35 min  Reverse process: learn to denoise
35-45 min  Why predict epsilon
45-55 min  Minimal PyTorch implementation
55-60 min  Interview answers and drills

By the end, you should be able to explain DDPM without paper-level detail, write the noising equation, implement the training loss, and explain why diffusion is useful for multi-modal future prediction.

1. Why you should care

Many ML models predict one answer. Driving futures are not like that. Given the same scene, a car may yield, merge, accelerate, or brake. A pedestrian may wait or cross.

Diffusion models are useful because they model a distribution, not just a point estimate. Different random seeds can generate different plausible futures.

You should care in interviews because diffusion tests three fundamentals:

Can you reason about probability distributions?
Can you explain a training objective from first principles?
Can you implement tensor code without getting lost in notation?

2. Forward noise process

Let $x_0$ be clean data. For a trajectory, this might be:

x_0 \in \mathbb{R}^{T \times 2}

where $T$ is future timesteps and each point is $(x,y)$ .

The forward process gradually adds Gaussian noise:

q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t I)

where:

$t$ is the diffusion step.
$\beta_t$ is the noise variance at step $t$ .
$I$ is identity covariance.

Define:

\alpha_t = 1-\beta_t

and:

\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s

The useful shortcut is:

x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon

where:

\epsilon \sim \mathcal{N}(0,I)

This says: noisy sample equals clean signal plus Gaussian noise. As $t$ increases, $\bar{\alpha}_t$ shrinks, so signal decreases and noise increases.

x0 clean trajectory
   |
   | add small noise
   v
x1 slightly noisy
   |
   v
...
   |
   v
xT almost Gaussian noise

3. Reverse denoising process

Generation runs backward:

random noise xT
   |
   | denoise
   v
xT-1
   |
   v
...
   |
   v
x0 generated sample

The model learns:

p_\theta(x_{t-1}|x_t)

In practice, many DDPM-style models predict the noise that was added:

\epsilon_\theta(x_t,t)

Training objective:

\mathcal{L} = \mathbb{E}_{x_0,t,\epsilon} \left[ \|\epsilon-\epsilon_\theta(x_t,t)\|_2^2 \right]

This is just MSE between true noise and predicted noise.

4. Why predict epsilon?

Predicting noise is convenient because noise has a simple distribution:

\epsilon \sim \mathcal{N}(0,I)

The target is normalized and stable across data domains. If the model predicts $\epsilon$ , we can recover an estimate of $x_0$ :

\hat{x}_0 = \frac{x_t-\sqrt{1-\bar{\alpha}_t}\epsilon_\theta(x_t,t)} {\sqrt{\bar{\alpha}_t}}

Interview answer:

Predicting epsilon turns denoising into a supervised regression problem with a standardized target. Once I know the noise, I can algebraically estimate the clean sample.

There are other parameterizations, such as predicting $x_0$ or velocity $v$ , but epsilon prediction is the easiest to explain and implement.

5. Minimal PyTorch implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

def q_sample(x0, t, noise, alpha_bar):
    a = alpha_bar[t].view(-1, 1)
    return a.sqrt() * x0 + (1.0 - a).sqrt() * noise

class TinyDenoiser(nn.Module):
    def __init__(self, dim, steps=100, hidden=64):
        super().__init__()
        self.time = nn.Embedding(steps, hidden)
        self.net = nn.Sequential(
            nn.Linear(dim + hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, dim),
        )

    def forward(self, xt, t):
        return self.net(torch.cat([xt, self.time(t)], dim=-1))

steps = 100
betas = torch.linspace(1e-4, 0.02, steps)
alphas = 1.0 - betas
alpha_bar = torch.cumprod(alphas, dim=0)

B, D = 32, 4
model = TinyDenoiser(D, steps)
x0 = torch.randn(B, D)
t = torch.randint(0, steps, (B,))
noise = torch.randn_like(x0)
xt = q_sample(x0, t, noise, alpha_bar)

pred_noise = model(xt, t)
loss = F.mse_loss(pred_noise, noise)
loss.backward()

This is the core training loop.

6. Autonomous driving and robotics interpretation

For driving:

x0 = future trajectory or future scene
c  = map + agent history + traffic lights + route

For robotics:

x0 = action sequence
c  = camera/state observation + task instruction

Diffusion learns a conditional distribution:

p_\theta(x_0|c)

The basic loss becomes:

\|\epsilon-\epsilon_\theta(x_t,t,c)\|_2^2

The conditioning $c$ is what makes generation useful rather than random.

7. Failure modes and debugging checklist

Timestep embedding is missing or broken.
Noise schedule is too aggressive.
Tensor broadcasting silently wrong.
Model predicts $x_0$ but loss compares to epsilon.
Sampling code does not match training parameterization.
Model ignores conditioning.
Training loss goes down but samples are bad.

Checklist:

Plot $x_t$ at low, medium, high timesteps.
Verify $x_t$ becomes noise as $t$ increases.
Overfit 32 examples.
Print tensor shapes.
Compare predicted noise scale to true noise scale.
Test one denoising step before full sampling.

8. Common interview questions and strong answers

Q: What is the forward process?
A: A fixed process that gradually adds Gaussian noise to clean data using a known noise schedule.

Q: What does the network learn?
A: It learns the reverse denoising process. In DDPM training, it often predicts the noise added to $x_0$ .

Q: Why is diffusion good for multi-modal prediction?
A: Generation starts from random noise, so the same conditioning input can produce multiple plausible samples.

Q: Why predict epsilon?
A: Epsilon is standardized Gaussian noise, which is a stable regression target, and predicting it lets us reconstruct the clean sample.

9. A 60-second explanation you can say out loud

A diffusion model learns to reverse a noising process. During training, I take clean data $x_0$ , choose a timestep $t$ , add Gaussian noise using $x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon$ , and train a network to predict the noise $\epsilon$ . At generation time, I start from Gaussian noise and repeatedly denoise. Predicting epsilon is common because the target is normalized and easy to regress. For driving simulation, this is useful because one scene can have many plausible futures, and different noise samples can generate different trajectories.

10. Practice exercises with answers

Exercise 1: What happens as $\bar{\alpha}_t \to 0$ ?
Answer: The clean signal disappears and $x_t$ becomes mostly Gaussian noise.

Exercise 2: Why does the model need timestep $t$ ?
Answer: Denoising a lightly corrupted sample is different from denoising almost pure noise.

Exercise 3: Write the epsilon training objective.
Answer: $\mathbb{E}\|\epsilon-\epsilon_\theta(x_t,t)\|_2^2$ .

Exercise 4: Why is diffusion slower than one-shot regression?
Answer: Sampling usually requires many denoising steps instead of one forward pass.