Debugging a Model That Is Not Learning: A 1-Hour Interview Session
Debugging a Model That Is Not Learning: A 1-Hour Interview Session
Section titled “Debugging a Model That Is Not Learning: A 1-Hour Interview Session”Companion notebook: debugging_model_not_learning_colab.ipynb
When a model is not learning, the worst response is random architecture tweaking.
The right response:
Reduce the problem until it should work, then find which assumption is false.
For autonomous driving simulation, the bug is often not the neural network. It is data skew, label noise, coordinate frames, timestamp alignment, leakage, normalization, or a loss that rewards the wrong behavior.
0. One-hour plan
Section titled “0. One-hour plan”0-10 min The debugging mindset10-20 min Overfit-one-batch test20-35 min Data, labels, normalization, leakage35-45 min LR, gradients, loss bugs45-55 min Overfitting vs underfitting diagnosis55-60 min Interview drills1. The debugging mindset
Section titled “1. The debugging mindset”Training minimizes:
If learning fails, inspect each piece:
x_i inputsy_i labelsf_theta modelL lossoptimizer updateeval metricDo not start with “make the model bigger.” Start with:
Can this model overfit one small batch?
If not, there is a bug or mismatch.
2. Overfit-one-batch test
Section titled “2. Overfit-one-batch test”Take 8-32 examples. Train only on them. The model should drive training loss very low.
If it cannot:
- labels may be wrong,
- target shape may be wrong,
- loss may be wrong,
- learning rate may be bad,
- gradients may be missing,
- model may be in eval mode,
- inputs may be normalized incorrectly.
This is the most useful interview debugging move.
Full training fails | vCan overfit one batch? | +-- no -> bug in local training setup | +-- yes -> data scale, generalization, imbalance, eval mismatch3. Data and label checks
Section titled “3. Data and label checks”For autonomous driving:
- Are positions in meters or centimeters?
- Are headings radians or degrees?
- Are trajectories in ego frame or world frame?
- Are labels aligned with timestamps?
- Are map features from the correct time/version?
- Are traffic lights current or future?
- Are rare labels reliable?
Bad label examples:
- future trajectory shifted by one timestep,
- actor IDs swapped,
- traffic light state from the future,
- map lane ID mismatched,
- cut-in label based on a heuristic with many false positives.
Data skew:
99% lane following1% rare interactionThe model may learn normal driving well and fail the interview-relevant rare cases.
4. Normalization problems
Section titled “4. Normalization problems”Normalization bugs are common in robotics and driving.
Check:
- feature means and standard deviations,
- train vs eval normalization,
- units,
- clipping,
- missing-value encoding,
- coordinate frame transforms.
Example:
training uses ego-frame positionsinference uses world-frame positionsThe model may appear not to learn because inputs are inconsistent.
Print:
x.mean(dim=0), x.std(dim=0), x.min(), x.max()For trajectories, plot them. Visual inspection catches bugs metrics hide.
5. Learning rate and gradients
Section titled “5. Learning rate and gradients”Optimizer update:
where is learning rate.
Symptoms:
LR too high: loss explodes or oscillates
LR too low: loss barely changes
zero gradients: detach, no requires_grad, wrong branch, saturated activation
exploding gradients: huge norms, NaN, unstable lossTrack gradient norm:
If gradient norm is zero, inspect graph wiring. If huge, lower LR, clip gradients, or inspect loss scale.
6. Loss function problems
Section titled “6. Loss function problems”The model may be learning exactly what the loss asks for, but the loss asks for the wrong thing.
Examples:
- MSE on multi-modal futures creates average trajectories.
- Unweighted CE ignores rare classes.
- Loss ignores collision/offroad.
- Regression loss scale dominates classification loss.
- Mode probability loss too weak.
- Labels are continuous but treated as class IDs.
For simulation, compare training loss to actual metrics:
loss down, ADE down, collision up -> loss missing safety validityloss down, rare recall flat -> imbalance issueloss down, diversity down -> mode collapse7. Data leakage
Section titled “7. Data leakage”Data leakage means the model sees information during training that will not be available at inference.
Driving examples:
- future positions as input,
- future traffic light state,
- labels encoded in scenario metadata,
- same scene in train and eval split,
- route derived from future behavior,
- map annotations unavailable at runtime.
Leakage often causes suspiciously good validation performance and bad real-world performance.
Ask:
At inference time, would this feature be known?
8. Overfitting vs underfitting
Section titled “8. Overfitting vs underfitting”Train loss low, eval bad: overfitting, leakage, distribution shift, weak regularization
Train loss high, eval bad: underfitting, optimization bug, bad data, weak model, wrong loss
Train loss low, eval metric bad: loss-metric mismatchFor driving, also slice eval:
- geography,
- weather,
- agent type,
- maneuver,
- speed,
- rare scenario,
- long-tail interactions.
Aggregate metrics hide failures.
9. Minimal PyTorch debugging snippet
Section titled “9. Minimal PyTorch debugging snippet”import torchimport torch.nn as nnimport torch.nn.functional as F
def grad_norm(model): total = 0.0 for p in model.parameters(): if p.grad is not None: total += p.grad.detach().pow(2).sum() return total.sqrt()
class TinyMLP(nn.Module): def __init__(self, d, c): super().__init__() self.net = nn.Sequential( nn.Linear(d, 64), nn.ReLU(), nn.Linear(64, c), ) def forward(self, x): return self.net(x)
B, D, C = 16, 8, 3x = torch.randn(B, D)y = torch.randint(0, C, (B,))
model = TinyMLP(D, C)opt = torch.optim.Adam(model.parameters(), lr=1e-2)
for step in range(200): opt.zero_grad() logits = model(x) loss = F.cross_entropy(logits, y) loss.backward() g = grad_norm(model) opt.step()Expected: one batch should overfit. If not, debug locally before training at scale.
10. Common interview questions and strong answers
Section titled “10. Common interview questions and strong answers”Q: First thing you do when a model is not learning?
A: Overfit one small batch. If that fails, the issue is local: data, labels, loss, optimizer, gradients, or model wiring.
Q: How do you detect learning-rate issues?
A: Too high causes divergence or oscillation. Too low causes very slow loss movement. I would run a small LR sweep and inspect gradient norms.
Q: What AD-specific bugs do you check?
A: Coordinate frames, timestamp alignment, actor IDs, traffic light timing, map validity, future leakage, rare-label quality.
Q: How do you distinguish overfitting and underfitting?
A: Compare train and eval loss. Low train/high eval suggests overfitting or shift. High train/high eval suggests underfitting or optimization/data problems.
11. A 60-second explanation you can say out loud
Section titled “11. A 60-second explanation you can say out loud”If a model is not learning, I start by trying to overfit one small batch. If that fails, I look for local bugs: bad labels, wrong target shape, wrong loss, learning rate, missing gradients, eval mode, or normalization. If it can overfit one batch but fails broadly, I inspect data skew, class imbalance, train/eval shift, leakage, and whether the loss matches the metric. For autonomous driving, I specifically check coordinate frames, timestamp alignment, traffic light timing, map features, actor IDs, and rare-event label quality.
12. Practice exercises with answers
Section titled “12. Practice exercises with answers”Exercise 1: Model cannot overfit one batch. Name four likely causes.
Answer: Bad labels, wrong loss, target shape bug, LR issue, missing gradients, eval mode, normalization bug.
Exercise 2: Train loss low, eval collision high. What might be wrong?
Answer: Loss does not penalize collisions, eval distribution differs, or model overfits common behavior.
Exercise 3: Give one driving-specific data leakage example.
Answer: Future traffic light state or future agent trajectory included as input.
Exercise 4: Loss decreases but rare cut-in recall stays bad. What do you check?
Answer: Class imbalance, label quality, per-class loss, sampling strategy, weighted/focal loss, and thresholding.