Skip to content

Diffusion Basics: A 1-Hour Interview Learning Session

Diffusion Basics: A 1-Hour Interview Learning Session

Section titled “Diffusion Basics: A 1-Hour Interview Learning Session”

Companion notebook: diffusion_basics_colab.ipynb

Diffusion models look intimidating because papers introduce many symbols. For an interview, you need a clean mental model:

Diffusion training teaches a neural network to remove known Gaussian noise. Generation starts from noise and repeatedly denoises until a sample appears.

In autonomous driving simulation, the sample may be a future trajectory. In robotics, it may be an action sequence. In vision, it may be an image. The core math is the same.

0-10 min Why diffusion exists: sampling multi-modal distributions
10-20 min Forward process: add Gaussian noise
20-35 min Reverse process: learn to denoise
35-45 min Why predict epsilon
45-55 min Minimal PyTorch implementation
55-60 min Interview answers and drills

By the end, you should be able to explain DDPM without paper-level detail, write the noising equation, implement the training loss, and explain why diffusion is useful for multi-modal future prediction.


Many ML models predict one answer. Driving futures are not like that. Given the same scene, a car may yield, merge, accelerate, or brake. A pedestrian may wait or cross.

Diffusion models are useful because they model a distribution, not just a point estimate. Different random seeds can generate different plausible futures.

You should care in interviews because diffusion tests three fundamentals:

  • Can you reason about probability distributions?
  • Can you explain a training objective from first principles?
  • Can you implement tensor code without getting lost in notation?

Let x0x_0 be clean data. For a trajectory, this might be:

x0RT×2x_0 \in \mathbb{R}^{T \times 2}

where TT is future timesteps and each point is (x,y)(x,y).

The forward process gradually adds Gaussian noise:

q(xtxt1)=N(1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t I)

where:

  • tt is the diffusion step.
  • βt\beta_t is the noise variance at step tt.
  • II is identity covariance.

Define:

αt=1βt\alpha_t = 1-\beta_t

and:

αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s

The useful shortcut is:

xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon

where:

ϵN(0,I)\epsilon \sim \mathcal{N}(0,I)

This says: noisy sample equals clean signal plus Gaussian noise. As tt increases, αˉt\bar{\alpha}_t shrinks, so signal decreases and noise increases.

x0 clean trajectory
|
| add small noise
v
x1 slightly noisy
|
v
...
|
v
xT almost Gaussian noise

Generation runs backward:

random noise xT
|
| denoise
v
xT-1
|
v
...
|
v
x0 generated sample

The model learns:

pθ(xt1xt)p_\theta(x_{t-1}|x_t)

In practice, many DDPM-style models predict the noise that was added:

ϵθ(xt,t)\epsilon_\theta(x_t,t)

Training objective:

L=Ex0,t,ϵ[ϵϵθ(xt,t)22]\mathcal{L} = \mathbb{E}_{x_0,t,\epsilon} \left[ \|\epsilon-\epsilon_\theta(x_t,t)\|_2^2 \right]

This is just MSE between true noise and predicted noise.


Predicting noise is convenient because noise has a simple distribution:

ϵN(0,I)\epsilon \sim \mathcal{N}(0,I)

The target is normalized and stable across data domains. If the model predicts ϵ\epsilon, we can recover an estimate of x0x_0:

x^0=xt1αˉtϵθ(xt,t)αˉt\hat{x}_0 = \frac{x_t-\sqrt{1-\bar{\alpha}_t}\epsilon_\theta(x_t,t)} {\sqrt{\bar{\alpha}_t}}

Interview answer:

Predicting epsilon turns denoising into a supervised regression problem with a standardized target. Once I know the noise, I can algebraically estimate the clean sample.

There are other parameterizations, such as predicting x0x_0 or velocity vv, but epsilon prediction is the easiest to explain and implement.


import torch
import torch.nn as nn
import torch.nn.functional as F
def q_sample(x0, t, noise, alpha_bar):
a = alpha_bar[t].view(-1, 1)
return a.sqrt() * x0 + (1.0 - a).sqrt() * noise
class TinyDenoiser(nn.Module):
def __init__(self, dim, steps=100, hidden=64):
super().__init__()
self.time = nn.Embedding(steps, hidden)
self.net = nn.Sequential(
nn.Linear(dim + hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, dim),
)
def forward(self, xt, t):
return self.net(torch.cat([xt, self.time(t)], dim=-1))
steps = 100
betas = torch.linspace(1e-4, 0.02, steps)
alphas = 1.0 - betas
alpha_bar = torch.cumprod(alphas, dim=0)
B, D = 32, 4
model = TinyDenoiser(D, steps)
x0 = torch.randn(B, D)
t = torch.randint(0, steps, (B,))
noise = torch.randn_like(x0)
xt = q_sample(x0, t, noise, alpha_bar)
pred_noise = model(xt, t)
loss = F.mse_loss(pred_noise, noise)
loss.backward()

This is the core training loop.


6. Autonomous driving and robotics interpretation

Section titled “6. Autonomous driving and robotics interpretation”

For driving:

x0 = future trajectory or future scene
c = map + agent history + traffic lights + route

For robotics:

x0 = action sequence
c = camera/state observation + task instruction

Diffusion learns a conditional distribution:

pθ(x0c)p_\theta(x_0|c)

The basic loss becomes:

ϵϵθ(xt,t,c)22\|\epsilon-\epsilon_\theta(x_t,t,c)\|_2^2

The conditioning cc is what makes generation useful rather than random.


  • Timestep embedding is missing or broken.
  • Noise schedule is too aggressive.
  • Tensor broadcasting silently wrong.
  • Model predicts x0x_0 but loss compares to epsilon.
  • Sampling code does not match training parameterization.
  • Model ignores conditioning.
  • Training loss goes down but samples are bad.

Checklist:

  • Plot xtx_t at low, medium, high timesteps.
  • Verify xtx_t becomes noise as tt increases.
  • Overfit 32 examples.
  • Print tensor shapes.
  • Compare predicted noise scale to true noise scale.
  • Test one denoising step before full sampling.

8. Common interview questions and strong answers

Section titled “8. Common interview questions and strong answers”

Q: What is the forward process?
A: A fixed process that gradually adds Gaussian noise to clean data using a known noise schedule.

Q: What does the network learn?
A: It learns the reverse denoising process. In DDPM training, it often predicts the noise added to x0x_0.

Q: Why is diffusion good for multi-modal prediction?
A: Generation starts from random noise, so the same conditioning input can produce multiple plausible samples.

Q: Why predict epsilon?
A: Epsilon is standardized Gaussian noise, which is a stable regression target, and predicting it lets us reconstruct the clean sample.


9. A 60-second explanation you can say out loud

Section titled “9. A 60-second explanation you can say out loud”

A diffusion model learns to reverse a noising process. During training, I take clean data x0x_0, choose a timestep tt, add Gaussian noise using xt=αˉtx0+1αˉtϵx_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon, and train a network to predict the noise ϵ\epsilon. At generation time, I start from Gaussian noise and repeatedly denoise. Predicting epsilon is common because the target is normalized and easy to regress. For driving simulation, this is useful because one scene can have many plausible futures, and different noise samples can generate different trajectories.


Exercise 1: What happens as αˉt0\bar{\alpha}_t \to 0?
Answer: The clean signal disappears and xtx_t becomes mostly Gaussian noise.

Exercise 2: Why does the model need timestep tt?
Answer: Denoising a lightly corrupted sample is different from denoising almost pure noise.

Exercise 3: Write the epsilon training objective.
Answer: Eϵϵθ(xt,t)22\mathbb{E}\|\epsilon-\epsilon_\theta(x_t,t)\|_2^2.

Exercise 4: Why is diffusion slower than one-shot regression?
Answer: Sampling usually requires many denoising steps instead of one forward pass.