Optimizers
Optimizers: ML Coding Interview Primer
Section titled “Optimizers: ML Coding Interview Primer”The one-sentence version
Section titled “The one-sentence version”An optimizer uses the gradient to update weights. All optimizers do this — they differ in how they use the gradient: just the current one, a smoothed history of it, or a per-parameter scaled version.
Quick reference
Section titled “Quick reference”| Optimizer | Core idea | Default hyperparams | Reach for it when… |
|---|---|---|---|
| SGD | move opposite the gradient | lr=0.01 | baselines, simple problems |
| SGD + Momentum | accumulate velocity across steps | lr=0.01, β=0.9 | oscillating loss, CNN training |
| Adam | per-parameter adaptive LR | lr=0.001, β1=0.9, β2=0.999 | most deep learning; sparse features |
| AdamW | Adam + decoupled weight decay | same + λ=0.01 | transformer fine-tuning |
| Schedule | Core idea | Reach for it when… |
|---|---|---|
| Constant | no decay | quick experiments, small models |
| Step decay | halve LR every N epochs | CNNs, supervised learning |
| Cosine decay | smooth curve from max to min LR | most modern training |
| Warmup + cosine | ramp up, then decay | large models, fine-tuning |
How gradient descent works (plain English)
Section titled “How gradient descent works (plain English)”Imagine the loss as a hilly landscape. Your model’s parameters define a position in that landscape. Training is about finding the lowest valley.
The gradient tells you the slope at your current position: which direction is uphill and how steep it is. Moving opposite the gradient moves downhill. Repeat this enough times and you reach a valley.
The learning rate controls step size. Too large and you overshoot the valley. Too small and you barely move.
new_weight = old_weight − learning_rate × gradientThat one line is the core of all gradient descent. Every optimizer in this article is a variation on it.
The three gradient descent paradigms
Section titled “The three gradient descent paradigms”Before optimizers: what gradient are you computing?
| Paradigm | Gradient computed over | Updates per epoch | Noise |
|---|---|---|---|
| Batch GD | all N examples | 1 | none (exact) |
| Mini-batch GD | random subset of 32–512 | N / batch_size | moderate |
| Stochastic GD | 1 example | N | high |
In practice, “SGD” almost always means mini-batch GD. True single-example GD is rarely used.
import numpy as np
# All three on the same data — just the sampling differsn = 1000X, y = np.random.randn(n, 2), np.random.randn(n)W = np.zeros(2)lr = 0.01
def grad(Xb, yb, W): err = Xb @ W - yb return 2 * Xb.T @ err / len(yb)
# Batch GD: one gradient from all 1000 examples, one update per epochW -= lr * grad(X, y, W)
# Mini-batch GD: gradient from 64 examples, ~16 updates per epochfor i in range(0, n, 64): W -= lr * grad(X[i:i+64], y[i:i+64], W)
# Stochastic GD: gradient from 1 example, 1000 updates per epoch (very noisy)for xi, yi in zip(X, y): W -= lr * grad(xi.reshape(1,-1), np.array([yi]), W)Why mini-batch wins: batch GD on 10M examples means 10M gradient computations before one weight update. Single-example GD is so noisy the loss oscillates rather than converging. Mini-batch balances both — enough examples for a stable gradient, small enough to update frequently and fit in GPU memory.
Interview answer: “SGD in practice means mini-batch gradient descent. Batch size is a hyperparameter — larger batches give lower-variance gradients but need more memory and often generalize slightly worse because they see fewer update steps per epoch.”
Problem 1 — Train a model from scratch → Vanilla SGD
Section titled “Problem 1 — Train a model from scratch → Vanilla SGD”Problem: predict delivery time from distance (km).
x = [2, 5, 10, 3, 8 ] # distance (km)y = [15, 30, 55, 18, 48] # delivery time (min)Model: ŷ = w * x + b. Two parameters: w (slope) and b (intercept).
Computing gradients — once, by hand
Section titled “Computing gradients — once, by hand”Loss = MSE = mean((ŷ − y)²) = mean((wx + b − y)²)
∂Loss/∂w = mean(2 × (ŷ − y) × x)∂Loss/∂b = mean(2 × (ŷ − y))Plain English: the gradient is “prediction error, weighted by how much each parameter contributed.” If x is large, w contributed a lot to the error, so the gradient for w is large.
SGD in NumPy
Section titled “SGD in NumPy”import numpy as np
X = np.array([2.0, 5.0, 10.0, 3.0, 8.0])y = np.array([15.0, 30.0, 55.0, 18.0, 48.0])
w, b = 0.0, 0.0lr = 0.001
for step in range(500): y_pred = w * X + b error = y_pred - y # prediction error per example
d_w = (2 * error * X).mean() # gradient for w d_b = (2 * error).mean() # gradient for b
w -= lr * d_w # move opposite the gradient b -= lr * d_b
if step % 100 == 0: loss = (error**2).mean() print(f"step {step:3d} loss={loss:.2f} w={w:.3f} b={b:.3f}")step 0 loss=1025.20 w=0.268 b=0.048step 100 loss= 18.43 w=4.891 b=3.017step 200 loss= 5.54 w=5.231 b=1.862step 400 loss= 4.72 w=5.333 b=1.384 ← converged near y ≈ 5.3x + 1.4Learning rate sensitivity
Section titled “Learning rate sensitivity”| Learning rate | Result |
|---|---|
| 10.0 | loss explodes to NaN in 10 steps |
| 0.1 | converges in ~50 steps |
| 0.001 | converges in ~400 steps |
| 0.00001 | barely moves after 500 steps |
Interview answer: “SGD is the baseline. It moves each parameter opposite its gradient. The learning rate controls step size — the most important hyperparameter. I’d start with a value from literature for the task type (0.1 for SGD, 0.001 for Adam), run a short sweep if convergence is poor, and treat divergence (rising loss) as the signal that LR is too high.”
Regularization here — L2: after training, if the model overfits to five routes and fails on new ones, add weight decay. SGD with L2:
d_w = (2 * error * X).mean() + 2 * lambda_ * w # L2 gradient appendedAll features (distance, traffic, restaurant load) should stay active, so L2 (shrink all, zero none) is the right choice over L1.
Problem 2 — Loss oscillates and won’t converge → Momentum
Section titled “Problem 2 — Loss oscillates and won’t converge → Momentum”Why SGD oscillates in narrow valleys
Section titled “Why SGD oscillates in narrow valleys”Suppose the loss landscape is steep in one direction and flat in another — a narrow ravine. The gradient in the steep direction is large; in the flat direction, small.
With a fixed learning rate:
- In the steep direction: SGD takes large steps, overshoots, and oscillates back and forth.
- In the flat direction: SGD takes tiny steps, moving along the valley slowly.
The same LR can’t fix both at once.
Two parameters at different scales: w1 gradient ≈ 0.01 (flat direction — needs big steps) w2 gradient ≈ 5.0 (steep direction — needs small steps)
With lr=0.1: w1 update: 0.1 × 0.01 = 0.001 → barely moves w2 update: 0.1 × 5.0 = 0.5 → oscillates wildlyMomentum: build up speed in consistent directions
Section titled “Momentum: build up speed in consistent directions”Instead of using only the current gradient, keep a running weighted average — a velocity vector:
velocity = β × velocity + (1 − β) × gradientweight = weight − lr × velocity- In the steep direction: consecutive gradients alternate sign (oscillating). They partially cancel in the velocity. The effective step shrinks.
- In the flat direction: consecutive gradients point the same way. They accumulate in the velocity. The effective step grows.
Momentum damps oscillations and accelerates progress simultaneously. β=0.9 means “90% last step’s velocity, 10% current gradient.”
Momentum in NumPy
Section titled “Momentum in NumPy”class Momentum: def __init__(self, lr=0.01, beta=0.9): self.lr = lr self.beta = beta self.v = {} # velocity per parameter index
def step(self, params, grads): for i, (p, g) in enumerate(zip(params, grads)): if i not in self.v: self.v[i] = np.zeros_like(p) # 90% of last step + 10% of current gradient self.v[i] = self.beta * self.v[i] + (1 - self.beta) * g p -= self.lr * self.v[i]Oscillation comparison
Section titled “Oscillation comparison”Step SGD (w2) Momentum (w2) 0 5.0 → update −0.50 5.0 → update −0.050 ← gentler start 1 −4.8 → update +0.48 −0.45 → update +0.004 ← velocity partially cancels 2 4.6 → update −0.46 0.41 → update −0.042 ← oscillation dampens 10 still oscillating near convergedInterview answer: “Momentum helps when the loss landscape has directions of very different curvature. It accumulates velocity in directions the gradient consistently points, and cancels velocity where gradients oscillate. β=0.9 is almost always the right setting — the main lever is the learning rate, not β.”
Problem 3 — Sparse features or fine-tuning a large model → Adam
Section titled “Problem 3 — Sparse features or fine-tuning a large model → Adam”Why a single global learning rate fails for sparse features
Section titled “Why a single global learning rate fails for sparse features”Suppose you have a feature “user_has_coupon” that’s 1 for 2% of examples. The gradient for its weight is nonzero only on those 2% of batches. When it is nonzero, the update must be large. But if you set lr high enough for sparse features, dense features (nonzero gradient every batch) overshoot.
Adam fixes this with per-parameter adaptive learning rates.
How Adam works
Section titled “How Adam works”Adam tracks two statistics per parameter across all training steps:
- m (first moment): running average of the gradient — “what direction has this parameter been moving?”
- v (second moment): running average of the squared gradient — “how large have the gradients been?”
The update divides by sqrt(v):
m = β1 × m + (1 − β1) × g # smoothed directionv = β2 × v + (1 − β2) × g² # smoothed squared magnitude
weight = weight − lr × m / (sqrt(v) + ε)- Sparse parameter (“has_coupon”): v is small (rare gradients) →
1/sqrt(v)is large → bigger effective step. - Dense parameter (e.g., a bias updated every batch): v is large → smaller effective step.
Each parameter automatically gets the learning rate it needs.
Bias correction: why Adam is slow to start without it
Section titled “Bias correction: why Adam is slow to start without it”At step 1, m and v are initialized to zero. After one gradient:
m = 0.9 × 0 + 0.1 × g = 0.1 × gm is 10× smaller than the actual gradient — the model barely moves early on. Adam corrects this:
m_hat = m / (1 − β1^t) # at t=1: divides by 0.1, giving 10× amplificationv_hat = v / (1 − β2^t) # restores true scale in early stepsThe correction vanishes as t grows (1 − 0.9^100 ≈ 1). It only matters in the first ~10 steps, but without it, Adam is sluggish at initialization.
Adam in NumPy
Section titled “Adam in NumPy”class Adam: def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8): self.lr = lr self.beta1 = beta1 self.beta2 = beta2 self.eps = eps self.m = {} # first moment per parameter self.v = {} # second moment per parameter self.t = 0 # step counter for bias correction
def step(self, params, grads): self.t += 1 for i, (p, g) in enumerate(zip(params, grads)): if i not in self.m: self.m[i] = np.zeros_like(p) self.v[i] = np.zeros_like(p)
# Update running moments self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g**2
# Bias-corrected estimates m_hat = self.m[i] / (1 - self.beta1**self.t) v_hat = self.v[i] / (1 - self.beta2**self.t)
# Adaptive step: scale by 1/sqrt(v) per parameter p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)eps=1e-8 prevents division by zero when v is near zero (sparse parameters early in training).
Adam vs SGD: when each wins
Section titled “Adam vs SGD: when each wins”| Scenario | Better choice | Reason |
|---|---|---|
| Deep networks, NLP, sparse inputs | Adam | Adapts LR per parameter automatically |
| CNNs, image classification | SGD + Momentum | Often generalizes better with tuned LR schedule |
| Fine-tuning a pretrained transformer | AdamW | Stable convergence, correct weight decay |
| Quick experiment, unknown problem | Adam | Works without careful tuning |
AdamW: Adam with regularization done correctly
Section titled “AdamW: Adam with regularization done correctly”Standard Adam applies L2 regularization by adding it to the gradient:
g_with_l2 = g + lambda_ * w # L2 term added to gradientBut Adam then scales this by 1/sqrt(v). Weights with large gradients get less regularization than those with small gradients — the opposite of what L2 intends.
AdamW decouples weight decay from the adaptive scaling:
# Adam step (adaptive gradient)p -= lr * m_hat / (np.sqrt(v_hat) + eps)
# Weight decay applied directly, NOT scaled by 1/sqrt(v)p -= lr * lambda_ * pRegularization here: for transformer fine-tuning, always use AdamW. The correct weight decay keeps all weights small (L2 behavior) without distorting the per-parameter adaptation. Use L1 only if you need sparse weights — and L1 via AdamW requires manual implementation since frameworks don’t include it.
Interview answer: “Adam adapts the learning rate per parameter based on gradient history. Sparse parameters get larger effective LR; dense ones get smaller. β1=0.9 and β2=0.999 are nearly always left at defaults. ε prevents division by zero for sparse parameters. I’d use AdamW over Adam whenever there’s L2 regularization — standard Adam’s weight decay interacts badly with the adaptive scaling.”
Problem 4 — Training diverges early or plateaus late → Learning rate schedules
Section titled “Problem 4 — Training diverges early or plateaus late → Learning rate schedules”Why a fixed learning rate fails
Section titled “Why a fixed learning rate fails”- Too high at the start: random initialization means large gradients. A high LR overshoots every update and loss explodes.
- Too high at the end: once the model is close to a solution, a large LR keeps jumping over the minimum. The model oscillates around it instead of settling in.
The fix: start with a different LR than you finish with.
Schedule 1: Warmup + cosine decay (modern standard)
Section titled “Schedule 1: Warmup + cosine decay (modern standard)”Ramp LR up linearly for the first few percent of training, then decay smoothly:
import numpy as np
def lr_schedule(step, warmup_steps, total_steps, max_lr, min_lr=0.0): if step < warmup_steps: # Linear warmup: 0 → max_lr return max_lr * step / warmup_steps else: # Cosine decay: max_lr → min_lr progress = (step - warmup_steps) / (total_steps - warmup_steps) return min_lr + 0.5 * (max_lr - min_lr) * (1 + np.cos(np.pi * progress))
# 10k warmup, 100k total steps, peak lr = 3e-4for step in [0, 5_000, 10_000, 55_000, 100_000]: lr = lr_schedule(step, 10_000, 100_000, max_lr=3e-4) print(f"step {step:6d} lr={lr:.6f}")step 0 lr=0.000000 ← just startedstep 5000 lr=0.000150 ← halfway through warmupstep 10000 lr=0.000300 ← peakstep 55000 lr=0.000150 ← halfway through cosine decaystep 100000 lr=0.000000 ← end of trainingWhy large models need warmup
Section titled “Why large models need warmup”At initialization, a pretrained model’s weights encode useful structure. A high LR immediately disrupts that — the model “forgets” pretraining in the first few batches.
Warmup also gives Adam’s m and v time to build up reliable estimates before taking large steps. In the first few steps, Adam’s bias-corrected moments are based on almost no history — unreliable. Large steps on unreliable statistics → divergence.
Rule of thumb: warmup_steps ≈ 5–10% of total_steps for transformer fine-tuning.
Schedule 2: Step decay
Section titled “Schedule 2: Step decay”Halve the LR every N epochs. Simple, still standard for CNNs:
def step_decay(initial_lr, epoch, drop_factor=0.5, drop_every=10): return initial_lr * (drop_factor ** (epoch // drop_every))
# epoch 0: lr = 0.1# epoch 10: lr = 0.05# epoch 20: lr = 0.025Applying a schedule in a training loop
Section titled “Applying a schedule in a training loop”optimizer = Adam(lr=0.001)total_steps = 1000
for step in range(total_steps): # Update LR before each step optimizer.lr = lr_schedule(step, warmup_steps=50, total_steps=total_steps, max_lr=0.001) # ... rest of training loopInterview answer: “I’d use warmup + cosine decay for any large model. Warmup prevents early instability and gives Adam time to calibrate its moment estimates. Cosine decay makes the final approach to the minimum smoother than a step drop. For smaller CNNs, step decay is practical and easier to reason about.”
Problem 5 — Loss spikes suddenly mid-training → Gradient clipping
Section titled “Problem 5 — Loss spikes suddenly mid-training → Gradient clipping”When gradients explode
Section titled “When gradients explode”Transformers and RNNs processing long sequences can produce enormous gradients from a single unusual batch. One bad step → catastrophic weight update → loss jumps from 0.3 to 1000.
Clip the total norm, not individual gradients
Section titled “Clip the total norm, not individual gradients”Clipping each gradient independently distorts the update direction. The right approach clips the total norm of all gradients together — preserving direction, capping only magnitude:
def clip_grad_norm(grads, max_norm=1.0): # Global norm across all parameters total_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
if total_norm > max_norm: scale = max_norm / (total_norm + 1e-6) grads = [g * scale for g in grads]
return grads, total_norm # return norm for monitoringIn PyTorch (call this after .backward(), before .step()):
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)What max_norm to use
Section titled “What max_norm to use”Start with 1.0. Log total_norm during training. If clipping activates on nearly every batch, the gradient norm distribution is pathologically wide — clipping is hiding a real problem (bad loss function, bad architecture, or LR too high). Clipping should handle occasional spikes, not be permanently active.
Interview answer: “Gradient clipping prevents one bad batch from destroying training. I always clip by global norm, not per-parameter, so the update direction is preserved. max_norm=1.0 is the standard starting point. If clipping fires every step, I investigate the root cause rather than just tightening the clip threshold.”
Putting it together: a complete training loop
Section titled “Putting it together: a complete training loop”import numpy as np
def sigmoid(x): return np.where(x >= 0, 1/(1+np.exp(-x)), np.exp(x)/(1+np.exp(x)))
# Binary classification: 2 features, 500 examplesnp.random.seed(0)n = 500X = np.random.randn(n, 2)y = (X[:, 0] - X[:, 1] > 0).astype(float)
W = np.zeros(2)b = np.array([0.0])
opt = Adam(lr=0.01)lambda_ = 0.01 # L2 weight decay (AdamW-style)total_steps = 1000
for step in range(total_steps): # ── LR schedule ────────────────────────────────── opt.lr = lr_schedule(step, warmup_steps=50, total_steps=total_steps, max_lr=0.01)
# ── Mini-batch sample ───────────────────────────── idx = np.random.choice(n, size=64, replace=False) Xb, yb = X[idx], y[idx]
# ── Forward pass ────────────────────────────────── logits = Xb @ W + b[0] p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7)
# ── BCE loss ────────────────────────────────────── loss = -(yb * np.log(p) + (1 - yb) * np.log(1 - p)).mean()
# ── Gradients ───────────────────────────────────── d_logit = (p - yb) / len(yb) d_W = Xb.T @ d_logit # shape (2,) d_b = np.array([d_logit.sum()]) # shape (1,)
# ── Gradient clipping ───────────────────────────── grads, gnorm = clip_grad_norm([d_W, d_b], max_norm=1.0)
# ── Optimizer step ──────────────────────────────── opt.step([W, b], grads)
# ── Weight decay (AdamW-style, after the Adam step) ─ W -= opt.lr * lambda_ * W
if step % 200 == 0: acc = ((sigmoid(X @ W + b[0]) > 0.5) == y).mean() print(f"step {step:4d} lr={opt.lr:.5f} loss={loss:.4f}" f" acc={acc:.3f} gnorm={gnorm:.3f}")step 0 lr=0.00000 loss=0.6931 acc=0.510 gnorm=0.051step 200 lr=0.00917 loss=0.1823 acc=0.932 gnorm=0.312step 400 lr=0.00823 loss=0.1241 acc=0.960 gnorm=0.198step 600 lr=0.00588 loss=0.0998 acc=0.968 gnorm=0.143step 800 lr=0.00293 loss=0.0912 acc=0.972 gnorm=0.098Every concept from the article appears in this loop: mini-batch sampling, Adam update, LR schedule, gradient clipping, and decoupled weight decay.
Debugging: when optimization fails
Section titled “Debugging: when optimization fails”Loss won’t decrease at all
Section titled “Loss won’t decrease at all”- Overfit one tiny batch (8 examples). If the model can’t memorize 8 examples, gradient computation is wrong.
- Print gradient norms. Zero means gradients aren’t flowing (dead ReLU, wrong computation graph). Enormous means LR is too high.
- Try Adam if using SGD. Adam is more forgiving of LR choice and often works out of the box.
- Check the loss formula. Manually compute the loss for one example by hand and verify it matches code.
Loss decreases then plateaus early
Section titled “Loss decreases then plateaus early”- LR is probably too low after warmup. Try 3× increase.
- Momentum/Adam may have accumulated stale state. Reset optimizer state and restart.
- Model capacity is too low. Add parameters.
Loss oscillates without settling
Section titled “Loss oscillates without settling”- LR too high — halve it.
- Batch size too small — gradient estimates are too noisy. Increase batch size or add momentum.
- Adam β2 too low — v doesn’t smooth enough. Try β2=0.9999.
Loss spikes and partially recovers
Section titled “Loss spikes and partially recovers”- Add gradient clipping (max_norm=1.0).
- Inspect the highest-loss examples — mislabeled or corrupted data can cause large gradient spikes.
- Check whether the spike correlates with a LR schedule restart.
Trains well but validation loss diverges (overfitting)
Section titled “Trains well but validation loss diverges (overfitting)”- Add L2 weight decay. Use AdamW for transformer models.
- For models with many irrelevant features: switch to L1 (zero out weak features).
- Reduce model capacity or add dropout.
Interview framework: “why this optimizer / schedule?”
Section titled “Interview framework: “why this optimizer / schedule?””When asked to justify optimization choices, answer in order:
1. Optimizer: “I’d default to Adam (lr=0.001). For CNNs where generalization matters, I’d consider SGD + Momentum with a step decay schedule — it often generalizes better with more tuning. For any transformer, AdamW.”
2. Learning rate: “I’d look at prior work for the architecture and task. If starting blind, I’d do a brief LR range test: train for one epoch while increasing LR from 1e-6 to 1e-1 and find where loss stops falling.”
3. Schedule: “Warmup + cosine decay for large models or fine-tuning. Step decay for smaller models where I want explicit control. I’d set warmup to ~5% of total steps.”
4. Gradient clipping: “I always add clipping (max_norm=1.0) for sequence models. For feedforward nets I’d add it reactively if I see loss spikes.”
5. Regularization: “L2 via AdamW weight decay for general overfitting. L1 if I want automatic feature selection or need to explain which inputs drive the model. Elastic net (L1 + L2) if I need both stability and sparsity.”