Cost Functions
Cost Functions: Interview Primer
Section titled “Cost Functions: Interview Primer”The one-sentence version
Section titled “The one-sentence version”A cost function is the score that training tries to minimize. It answers: how bad was that prediction?
If you pick the wrong cost function, your model can score perfectly on paper and still behave badly in production. That’s why interviewers care about it.
Quick reference (read this first, return to it after)
Section titled “Quick reference (read this first, return to it after)”| Loss function | Use when… | Output type | Key behavior |
|---|---|---|---|
| MSE | predicting a number, no big outliers | continuous | punishes large errors much harder than small ones |
| MAE | predicting a number, outliers in the data | continuous | treats all errors equally regardless of size |
| Huber | predicting a number, some outliers | continuous | MSE for small errors, MAE for big ones |
| Binary cross-entropy | yes/no outcome | probability 0–1 | punishes confident wrong predictions hardest |
| Weighted BCE | yes/no, class imbalance matters | probability 0–1 | multiplies the rare class’s penalty |
| Focal loss | yes/no, millions of easy negatives | probability 0–1 | ignores the examples the model already handles well |
| Softmax cross-entropy | one of N categories | probabilities over N classes | same idea as BCE, extended to multiple classes |
| Reconstruction loss | generative models (VAE, autoencoder) | pixel / token values | measures similarity to original directly |
| Adversarial loss | generative models (GAN) | fooling a discriminator | rewards realistic-looking output, not pixel similarity |
How training actually works (plain English)
Section titled “How training actually works (plain English)”Training repeats a loop:
- Make a prediction.
- Compute the cost (how wrong was it?).
- Nudge the model’s parameters in whichever direction reduces the cost.
- Repeat millions of times.
The cost function is step 2. It turns “the model predicted 26 minutes, the real answer was 30 minutes” into a single number the optimizer can work with.
Two properties matter:
- Sign doesn’t cancel. Raw error does. A prediction 4 minutes early and one 4 minutes late average to zero — the optimizer thinks everything is fine. The cost function must remove that cancellation.
- Size of the penalty shapes behavior. A loss that punishes a 10-minute error much harder than a 1-minute error teaches the model to avoid large misses. A loss that treats them equally teaches something different.
Problem 1 — Predict delivery time → MSE
Section titled “Problem 1 — Predict delivery time → MSE”actual: 30 minpredicted: 26 minerror: −4 minRaw error cancels when you average positives and negatives, so we square it:
squared error = (26 − 30)² = 16Averaged over many orders, this is Mean Squared Error (MSE):
MSE = average of (predicted − actual)² over all examplesWhy squaring matters
Section titled “Why squaring matters”| Error | Squared error |
|---|---|
| 2 min | 4 |
| 10 min | 100 |
A 5× larger error gets a 25× larger penalty. MSE tells the model: big mistakes are much worse than small ones.
This is the right call when one catastrophically wrong prediction (e.g., 3 hours late) is genuinely worse than many small ones, not just inconveniently wrong.
When MSE breaks down
Section titled “When MSE breaks down”MSE learns to predict the average outcome. That’s correct when there’s one right answer. It becomes a problem when:
- There are extreme outliers in your labels (one corrupted record can dominate the whole loss).
- There are multiple valid answers (the average of “turn left” and “turn right” is “drive straight into the median”).
Interview answer
Section titled “Interview answer”“I’d start with MSE for regression. It penalizes large errors disproportionately, which is usually what you want. I’d reconsider if the training data has significant outliers or if the target has multiple valid values — both break MSE’s assumptions.”
Regularization here — L2: delivery time is influenced by many features (distance, time of day, traffic, restaurant load). You want all of them to contribute — just with controlled magnitude, not zeroed out. L2 is the right fit: it shrinks all weights proportionally but leaves every feature active. L1 would be wrong here — it might zero out “restaurant load” because it’s correlated with “time of day,” discarding a genuinely useful signal.
import numpy as np
def mse_l2(y_pred, y_true, W, lambda_=0.01): mse = ((y_pred - y_true) ** 2).mean() l2_reg = lambda_ * np.sum(W ** 2) return mse + l2_reg
def mse_l2_gradient(X, y_pred, y_true, W, lambda_=0.01): n = len(y_true) d_W = 2 * (y_pred - y_true) @ X / n + 2 * lambda_ * W # ← L2 appended d_b = 2 * (y_pred - y_true).mean() return d_W, d_bThe only difference from plain MSE is the + 2 * lambda_ * W on the gradient line. Large weights get a stronger push toward zero; small weights get almost none.
Problem 2 — One corrupted record → MAE and Huber
Section titled “Problem 2 — One corrupted record → MAE and Huber”Suppose nearly all deliveries take 10–60 minutes, but one record says 900 minutes because the order was never marked complete.
With MSE, that one record contributes:
(30 − 900)² = 756,900That can outweigh thousands of normal examples. The model distorts its predictions for real orders just to partially reduce this one bad record’s cost.
MAE: treat every extra minute equally
Section titled “MAE: treat every extra minute equally”Mean Absolute Error (MAE) uses the absolute value instead of squaring:
MAE = average of |predicted − actual| over all examplesThe corrupted record now costs 870, not 756,900. Critically, once a prediction is wrong, the cost increases at a constant rate no matter how wrong it gets. An extreme outlier can’t dominate the way it does with MSE.
Key conceptual difference:
| MSE | MAE | |
|---|---|---|
| What it learns | Conditional mean | Conditional median |
| Outlier influence | High | Low |
| Error sensitivity | Grows with error size | Constant |
Concrete example: Five deliveries take 20, 21, 22, 23, 100 minutes. MSE’s optimal prediction is the mean (37.2 min — pulled by the outlier). MAE’s optimal prediction is the median (22 min — not pulled at all).
Huber: best of both
Section titled “Huber: best of both”MAE’s downside: near the correct answer, a constant push is less precise than the shrinking push from MSE. If you’re 1 minute off versus 2 minutes off, MAE treats them identically; MSE applies a gentler correction for the 1-minute miss.
Huber loss combines them with a threshold δ:
error ≤ δ → use squared error (smooth, like MSE)error > δ → use absolute error (robust, like MAE)You pick δ based on what counts as a “large” error for your specific problem. For delivery time in minutes, δ might be 15 or 20.
import numpy as np
# Six deliveries: five normal, one corrupted recordy_true = np.array([20., 21., 22., 23., 24., 900.])y_pred = np.array([21., 22., 21., 24., 23., 30.]) # model guesses ~30 for the outlier
def mse(y_pred, y_true): return ((y_pred - y_true) ** 2).mean()
def mae(y_pred, y_true): return np.abs(y_pred - y_true).mean()
def huber(y_pred, y_true, delta=15.0): error = y_pred - y_true is_small = np.abs(error) <= delta return np.where(is_small, 0.5 * error**2, # MSE branch delta * (np.abs(error) - 0.5*delta) # MAE branch ).mean()
print(f"MSE: {mse(y_pred, y_true):>10.1f}") # 756,905 — destroyed by the 900 recordprint(f"MAE: {mae(y_pred, y_true):>10.1f}") # 146.5 — 900 costs 870 not 756kprint(f"Huber: {huber(y_pred, y_true):>10.1f}") # 13.0 — capped at δ past the threshold
# Per-example view: see exactly where MSE explodesprint("\ntrue pred MSE_cost MAE_cost")for t, p in zip(y_true, y_pred): print(f"{t:5.0f} {p:5.0f} {(p-t)**2:>9.0f} {abs(p-t):>8.0f}")MSE: 126151.2MAE: 146.5Huber: 13.0
true pred MSE_cost MAE_cost 20 21 1 1 21 22 1 1 22 21 1 1 23 24 1 1 24 23 1 1 900 30 756900 870 ← one record, 756k vs 870The mean vs median difference in one snippet:
times = np.array([20., 21., 22., 23., 100.])
print(f"Mean (what MSE learns): {times.mean():.1f}") # 37.2 — pulled by outlierprint(f"Median (what MAE learns): {np.median(times):.1f}") # 22.0 — ignores outlierHuber’s switching behaviour at the threshold:
delta = 15.0errors = np.array([2., 8., 15., 20., 50., 100.])
print("error cost mode")for e in errors: if abs(e) <= delta: cost, mode = 0.5 * e**2, "MSE (quadratic)" else: cost, mode = delta * (abs(e) - 0.5*delta), "MAE (linear)" print(f" {e:5.0f} {cost:7.1f} {mode}")error cost mode 2 2.0 MSE (quadratic) 8 32.0 MSE (quadratic) 15 112.5 MSE (quadratic) ← last MSE step 20 187.5 MAE (linear) ← first MAE step 50 637.5 MAE (linear) 100 1387.5 MAE (linear)Interview answer
Section titled “Interview answer”“If training data has occasional extreme label errors or genuine outliers, I’d switch from MSE to Huber. I’d pick the δ threshold by plotting the distribution of prediction errors and finding where the tail starts. If the outliers are actually real and meaningful, MAE is the safer choice. If they’re label noise, Huber is more practical.”
Problem 3 — Predict if it will rain → Binary cross-entropy
Section titled “Problem 3 — Predict if it will rain → Binary cross-entropy”For a yes/no outcome, we want the model to output a probability (a number from 0 to 1). MSE on a probability is a bad fit — it treats “predicted 0.6, actual 0 (no rain)” the same as “predicted 0.6, actual 1 (rain)” with equal-but-opposite errors. It doesn’t account for what the correct answer actually was.
Consider:
Model A predicted 60% chance of rain → no rainModel B predicted 99% chance of rain → no rainBoth are wrong. But B was confidently wrong. Any good loss should punish B much more.
Binary cross-entropy (BCE) does this by measuring “how surprised the model should be by what actually happened”:
BCE = −log(probability assigned to the correct outcome)For no-rain day:
| Predicted rain probability | BCE penalty |
|---|---|
| 10% | small (~0.1) |
| 60% | moderate (~0.9) |
| 99% | large (~4.6) |
The penalty grows sharply as confidence grows. A model that said “99% rain” when it didn’t rain gets hammered. A model that said “10% rain” gets a small nudge.
import numpy as np
def sigmoid(x): return np.where(x >= 0, 1/(1+np.exp(-x)), np.exp(x)/(1+np.exp(x)))
def bce(logits, targets): p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7) return -(targets * np.log(p) + (1 - targets) * np.log(1 - p)).mean()
# Reproduce the table: no-rain day (target=0), three different predictionstarget = np.array([0.])for logit, label in [(-2.2, "10% rain"), (0.4, "60% rain"), (4.6, "99% rain")]: loss = bce(np.array([logit]), target) p = sigmoid(np.array([logit]))[0] print(f" Predicted {label:8s} (p={p:.2f}) BCE={loss:.3f}") Predicted 10% rain (p=0.10) BCE=0.105 Predicted 60% rain (p=0.60) BCE=0.916 Predicted 99% rain (p=0.99) BCE=4.605 ← confident wrong = expensiveThe implementation detail that matters
Section titled “The implementation detail that matters”Models internally produce an unconstrained number (called a logit) and a sigmoid function converts it to a probability. If you compute sigmoid(logit) first and then log(probability), you can get numerical errors (log of zero breaks everything).
Always use the library’s built-in version that takes the raw logit:
# Correct — stable, no NaN riskloss = F.binary_cross_entropy_with_logits(logits, targets)
# Risky — manual sigmoid + log can produce NaNp = torch.sigmoid(logits)loss = -torch.log(p) # breaks when p rounds to 0 or 1Softmax cross-entropy (multi-class)
Section titled “Softmax cross-entropy (multi-class)”The same idea scales to multiple classes (e.g., classifying an image as cat, dog, or bird). The model outputs one score per class, softmax converts them to probabilities summing to 1, and cross-entropy penalizes the probability assigned to the wrong class.
def softmax(x): e = np.exp(x - x.max()) # subtract max for numerical stability return e / e.sum()
def cross_entropy(logits, true_class): probs = softmax(logits) return -np.log(probs[true_class] + 1e-7)
logits = np.array([2.0, 1.0, 0.1]) # model scores: cat=2.0, dog=1.0, bird=0.1probs = softmax(logits)print(f"Probs — cat:{probs[0]:.2f} dog:{probs[1]:.2f} bird:{probs[2]:.2f}")
print(f"Loss if true=cat: {cross_entropy(logits, 0):.3f}") # small — model was rightprint(f"Loss if true=bird: {cross_entropy(logits, 2):.3f}") # large — model confidently wrongProbs — cat:0.66 dog:0.24 bird:0.10Loss if true=cat: 0.416 ← right answer, moderate penalty (not 100% confident)Loss if true=bird: 2.303 ← wrong answer, high penaltyInterview answer
Section titled “Interview answer”“For binary classification, I’d use BCE because the output is a probability and I want to penalize confident wrong predictions heavily. I’d always compute it from logits, not from the probability directly, for numerical stability. The decision threshold (0.5 by default) is a business lever I’d tune separately from training.”
Regularization here — L1: a weather model might have hundreds of input signals — humidity, pressure, dew point, wind speed, cloud cover, and their lagged versions. Most of them are correlated. L1 regularization zeros out the redundant ones automatically, leaving only the features that independently predict rain. L2 would keep all of them small and active, which makes the model harder to interpret and may hurt performance when features are highly correlated. If an interviewer asks “how would you do feature selection?” — L1 regularization is a clean answer.
def sigmoid(x): return np.where(x >= 0, 1/(1+np.exp(-x)), np.exp(x)/(1+np.exp(x)))
def bce_l1(logits, targets, W, lambda_=0.01): p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7) bce = -(targets * np.log(p) + (1 - targets) * np.log(1 - p)).mean() l1_reg = lambda_ * np.sum(np.abs(W)) return bce + l1_reg
def bce_l1_gradient(X, p, targets, W, lambda_=0.01): n = len(targets) d_W = X.T @ (p - targets) / n + lambda_ * np.sign(W) # ← L1 appended d_b = (p - targets).mean() return d_W, d_bCompare the gradient term to the L2 version from Problem 1:
L2: + 2 * lambda_ * W # push proportional to weight size — weakens as weight shrinksL1: + lambda_ * np.sign(W) # push is always ±lambda_, regardless of weight size → can reach zeronp.sign(W) returns +1, −1, or 0. The gradient is a fixed-size shove toward zero on every step, which is why L1 can drive weights all the way to exactly zero while L2 only approaches it.
Problem 4 — 1-in-1000 transactions is fraud → Weighted BCE and focal loss
Section titled “Problem 4 — 1-in-1000 transactions is fraud → Weighted BCE and focal loss”The trap: at 0.1% fraud rate, a model that predicts “not fraud” for everything gets 99.9% accuracy. The cost function never told it that missing fraud is expensive.
First move — don’t reach for a custom loss immediately. Check:
- How many flagged cases are actually fraud? (precision)
- How much fraud is caught overall? (recall)
- What happens as you move the decision threshold?
If you’ve confirmed that missed fraud is genuinely much more costly than a false alarm, then encode that cost.
Weighted BCE
Section titled “Weighted BCE”Add weights to the BCE so that each fraud example contributes more to the cost:
weighted BCE = −(w_fraud × log(p) for fraud) − (w_legit × log(1−p) for legit)Set w_fraud higher than w_legit. The optimizer spends more effort on fraud examples.
def weighted_bce(logits, targets, fraud_weight=19.0): p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7) wts = np.where(targets == 1, fraud_weight, 1.0) bce = -(targets * np.log(p) + (1 - targets) * np.log(1 - p)) return (wts * bce).mean()
# Compare: same model confidence, different label → very different costlogit = np.array([-2.0]) # model says ~12% fraud probability
loss_fn = weighted_bce(logit, np.array([1.]), fraud_weight=19.) # it WAS fraudloss_fp = weighted_bce(logit, np.array([0.]), fraud_weight=19.) # it was NOT fraud
print(f"Missed fraud (FN): {loss_fn:.3f}") # 19× amplifiedprint(f"Flagged legit (FP): {loss_fp:.3f}") # standard penaltyprint(f"Ratio: {loss_fn/loss_fp:.1f}x")Missed fraud (FN): 39.178Flagged legit (FP): 0.131Ratio: 299.1x ← missing fraud is now vastly more expensiveTradeoff: higher weight for fraud → more recalls → more false alarms. It’s a dial, not a solution.
Focal loss
Section titled “Focal loss”A different problem: your dataset has millions of obvious legitimate transactions. The model learned to classify those perfectly in the first week of training — but they still dominate the loss because there are so many of them.
Focal loss multiplies each example’s BCE by a factor that shrinks as the model gets more confident:
focal multiplier = (1 − p_correct)^γ- If the model is 99% correct on an example:
(1 - 0.99)^2 = 0.0001→ the easy example barely counts. - If the model is only 20% correct:
(1 - 0.20)^2 = 0.64→ the hard example still counts.
The γ (gamma) parameter controls how aggressively easy examples are suppressed. Higher γ = more focus on hard cases.
def focal_bce(logits, targets, gamma=2.0, alpha=0.25): p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7) p_correct = np.where(targets == 1, p, 1 - p) # confidence on the right answer focus = (1 - p_correct) ** gamma # shrinks toward 0 when confident bce = -(targets * np.log(p) + (1 - targets) * np.log(1 - p)) return (alpha * focus * bce).mean()
# Show how the focusing factor suppresses easy examplesprint("p_correct focus (γ=2) interpretation")for pc in [0.99, 0.90, 0.70, 0.50, 0.20]: f = (1 - pc)**2 print(f" {pc:.2f} {f:.4f} {'easy — nearly ignored' if pc > 0.8 else 'hard — still trains'}")p_correct focus (γ=2) interpretation 0.99 0.0001 easy — nearly ignored 0.90 0.0100 easy — nearly ignored 0.70 0.0900 hard — still trains 0.50 0.2500 hard — still trains 0.20 0.6400 hard — still trainsWhen NOT to use focal loss:
- When “hard” examples are actually mislabeled (focal loss will overfit to noise).
- When you need the model’s output to be a trustworthy probability (focal loss distorts calibration).
Interview distinction
Section titled “Interview distinction”“Weighted BCE expresses importance by class — fraud matters more than legit. Focal loss expresses importance by difficulty — examples the model struggles with now matter more than ones it already handles well. I’d use weighted BCE when I know the business cost ratio. I’d add focal loss if I also have a long-tail of genuinely hard examples drowning in easy ones.”
Regularization here — L2 first, then L1 if you need interpretability:
Upweighting fraud examples makes the optimizer treat each fraud example as if it were 19 legitimate ones. That’s intentional — but it also amplifies noise. The model might memorize specific patterns from 50 fraudulent transactions and fail on the 51st that looks slightly different. L2 regularization limits how strongly any individual feature can drive the output, which directly counteracts this overfitting risk.
L1 serves a different purpose here: fraud models often need to be explainable to compliance teams or regulators. A model with L1 regularization produces a short list of nonzero feature weights (“transaction velocity in last 1 hour, country mismatch, device age”) that you can hand to a compliance officer. L2 produces a model where every feature has a small nonzero weight — explainable in principle, but not as a short list.
In practice: start with L2 for stability, add L1 (or an elastic net that combines both) if explainability is a hard requirement.
def weighted_bce_l2(logits, targets, W, fraud_weight=19.0, lambda_=0.01): p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7) wts = np.where(targets == 1, fraud_weight, 1.0) bce = -(targets * np.log(p) + (1 - targets) * np.log(1 - p)) l2_reg = lambda_ * np.sum(W ** 2) return (wts * bce).mean() + l2_reg
# Elastic net = L2 stability + L1 sparsitydef weighted_bce_elastic(logits, targets, W, fraud_weight=19.0, l1=0.001, l2=0.01): p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7) wts = np.where(targets == 1, fraud_weight, 1.0) bce = -(targets * np.log(p) + (1 - targets) * np.log(1 - p)) reg = l2 * np.sum(W ** 2) + l1 * np.sum(np.abs(W)) return (wts * bce).mean() + reg
def elastic_gradient(X, p, targets, W, fraud_weight=19.0, l1=0.001, l2=0.01): n = len(targets) wts = np.where(targets == 1, fraud_weight, 1.0) d_logit = wts * (p - targets) / n d_W = X.T @ d_logit + 2 * l2 * W + l1 * np.sign(W) # three forces d_b = d_logit.sum() return d_W, d_bThe gradient line shows all three forces at once:
X.T @ d_logit — data: reduce prediction error+ 2 * l2 * W — L2: shrink large weights proportionally+ l1 * np.sign(W) — L1: constant push; zeros out weak featuresTune l2 first (controls stability), then l1 (controls how sparse the final weights are).
Problem 5 — Multiple valid answers → Mixture loss (brief)
Section titled “Problem 5 — Multiple valid answers → Mixture loss (brief)”Suppose you’re predicting a car’s position 5 seconds from now. At an intersection, it might validly turn left (−10 m) or right (+10 m).
MSE will predict 0 m — the mathematical average of the two valid outcomes, but physically impossible.
This is called multi-modality: the answer distribution has multiple peaks.
The fix: instead of predicting one answer, predict several possibilities with probabilities:
Option A: turn left (−10 m), probability 50%Option B: turn right (+10 m), probability 50%The loss rewards the model for assigning high probability to whatever actually happened.
import numpy as np
# Two equally likely futures: turn left (-10m) or turn right (+10m)outcomes = np.array([-10., 10.])probs = np.array([0.5, 0.5])
# What MSE predicts: the weighted meanmse_prediction = (probs * outcomes).sum()print(f"MSE predicts: {mse_prediction:.1f} m") # 0.0 — drives into the barrier
# MSE at the "optimal" prediction vs either valid answerdef expected_mse(prediction): return (probs * (prediction - outcomes)**2).sum()
print(f"MSE cost at m=0 (MSE optimum): {expected_mse(0):.1f}") # 100print(f"MSE cost at m=-10 (left turn): {expected_mse(-10):.1f}") # also 100print(f"MSE cost at m=+10 (right turn): {expected_mse(10):.1f}") # also 100# Every prediction is equally bad — MSE cannot represent two valid futures
# What a mixture model does: assign probability to each modeimport math
def log_gaussian(y, mu, sigma=2.0): return -0.5*((y-mu)/sigma)**2 - math.log(sigma) - 0.5*math.log(2*math.pi)
def mixture_nll(y_observed, mus, mix_probs, sigma=2.0): """Negative log likelihood for a 2-component mixture.""" log_components = [math.log(pi) + log_gaussian(y_observed, mu, sigma) for pi, mu in zip(mix_probs, mus)] # logsumexp for numerical stability max_log = max(log_components) log_sum = max_log + math.log(sum(math.exp(l - max_log) for l in log_components)) return -log_sum
y_left = -10.0 # car turned lefty_right = 10.0 # car turned right
loss_left = mixture_nll(y_left, mus=[-10., 10.], mix_probs=[0.5, 0.5])loss_right = mixture_nll(y_right, mus=[-10., 10.], mix_probs=[0.5, 0.5])
print(f"\nMixture NLL for left turn: {loss_left:.3f}") # low — left mode covers itprint(f"Mixture NLL for right turn: {loss_right:.3f}") # also low — right mode covers itMSE predicts: 0.0 mMSE cost at m=0 (MSE optimum): 100.0MSE cost at m=-10 (left turn): 100.0 ← no prediction is better than anotherMSE cost at m=+10 (right turn): 100.0
Mixture NLL for left turn: 2.114 ← mixture handles both correctlyMixture NLL for right turn: 2.114This is more advanced and mainly comes up in autonomous vehicles, robotics, and trajectory prediction. For most product ML interviews, knowing why MSE fails here is enough.
Interview answer: “If the target has multiple valid outcomes — like a car that could turn left or right — MSE would predict the invalid average. I’d model the output as a distribution with multiple modes rather than a single point, and train with a likelihood loss.”
Problem 6 — Regularization and the L1 vs L2 visual guide
Section titled “Problem 6 — Regularization and the L1 vs L2 visual guide”Regularization isn’t a cost function — it’s an addition to the cost function that penalizes the model for being unnecessarily complicated.
Imagine two models that fit the training data equally:
- Model A uses moderate weights.
- Model B uses enormous weights that happen to cancel each other out.
Model B is fragile: change the input slightly and the output changes wildly.
The mathematical difference (visual)
Section titled “The mathematical difference (visual)”Both regularizers add a penalty based on the model’s weights. They differ in how the penalty grows:
Weight value: 0.0 0.1 0.5 1.0 2.0 5.0L2 penalty (w²): 0 0.01 0.25 1.0 4.0 25.0L1 penalty (|w|): 0 0.1 0.5 1.0 2.0 5.0What this means in plain English:
-
L2 penalty grows as the square of the weight. A weight of 5 gets penalized 25× harder than a weight of 1. The optimizer attacks big weights aggressively, but once a weight is small (say 0.1), the penalty is tiny (0.01) — the push nearly disappears. This is why L2 shrinks weights without fully zeroing them.
-
L1 penalty grows linearly. A weight of 5 is penalized 5× harder than 1, but a weight of 0.1 still has a meaningful penalty (0.1). The push toward zero doesn’t disappear even when the weight is nearly zero — so the optimizer keeps pushing until the weight is zero. This is why L1 creates sparsity (some weights become exactly zero).
import numpy as np
weights = np.array([0.0, 0.1, 0.5, 1.0, 2.0, 5.0])
print(f"{'weight':>8} {'L2 penalty':>12} {'L1 penalty':>12} {'L2 gradient':>13} {'L1 gradient':>13}")for w in weights: l2_pen = w**2 l1_pen = abs(w) l2_grad = 2 * w # ∝ w — weakens as w shrinks l1_grad = np.sign(w) # always ±1 — constant push print(f"{w:>8.1f} {l2_pen:>12.3f} {l1_pen:>12.3f} {l2_grad:>13.3f} {l1_grad:>13.0f}") weight L2 penalty L1 penalty L2 gradient L1 gradient 0.0 0.000 0.000 0.000 0 0.1 0.010 0.100 0.200 1 ← L1 still pushes hard 0.5 0.250 0.500 1.000 1 1.0 1.000 1.000 2.000 1 2.0 4.000 2.000 4.000 1 5.0 25.000 5.000 10.000 1The L1 gradient column is what drives zeroing: it’s always 1 regardless of weight size. The L2 gradient column shrinks to 0.2 at w=0.1 — almost no push left. Give the optimizer 1000 steps at w=0.1 with L2 and it barely moves; with L1 it zeros out completely.
Why sparsity matters
Section titled “Why sparsity matters”If you have 1,000 input features and only 20 are actually useful, L1 can zero out the irrelevant 980. L2 just makes them very small. For feature selection or interpretability, L1 is the stronger tool.
Visual intuition: the diamond vs circle
Section titled “Visual intuition: the diamond vs circle”Think of the regularizer as drawing a boundary around “allowed” weight values. L2 draws a circle — solutions land anywhere on the smooth curve, usually not at zero. L1 draws a diamond — it has sharp corners on the axes. The optimal solution tends to “snap” to a corner, where one or more weights are exactly zero.
L2 region (circle) L1 region (diamond) w2 w2 | | .---+---. /|\ / | \ / | \ | | | / | \----+-----0-----+---- w1 ----+---0---+---- w1 | | | \ | / \ | / \ | / '---+---' \|/ | solutions land anywhere solutions snap to corners on the circle (axis intersections = sparse)Don’t confuse these
Section titled “Don’t confuse these”| Concept | What it scores |
|---|---|
| MSE / MAE | Prediction errors |
| L1 / L2 regularization | Model weight sizes |
Interview answer
Section titled “Interview answer”“L1 and L2 both penalize large weights, but differently. L2 penalty grows quadratically so it attacks big weights hard and leaves small weights nearly alone — all weights shrink but rarely reach zero. L1 penalty grows linearly so the push toward zero stays constant even for tiny weights — some weights go all the way to zero, giving you sparsity and implicit feature selection. I’d use L1 when I want to identify which features actually matter. I’d use L2 for general regularization. I’d tune the strength λ on a validation set.”
Problem 7 — Implementing a custom loss function
Section titled “Problem 7 — Implementing a custom loss function”Interviews sometimes ask you to write a loss for a non-standard problem. The approach is always the same regardless of the specific loss.
The checklist before writing any custom loss
Section titled “The checklist before writing any custom loss”- Can you reuse a standard loss? Weighted BCE, focal loss, and Huber cover most real cases. A custom loss adds maintenance cost.
- What does the model output? A number? A probability? Multiple numbers? The loss must match.
- What’s the expensive mistake? Write down the mistake in English before writing code.
- Is the loss differentiable? Training needs gradients. If you write
if error > threshold: ..., make sure both branches have smooth gradients.
Example 1: weighted BCE for imbalanced classes
Section titled “Example 1: weighted BCE for imbalanced classes”Problem: 99% legitimate, 1% fraud. Standard BCE ignores fraud.
def weighted_bce(logits, targets, fraud_weight=10.0): # pos_weight multiplies the loss for positive (fraud) examples # fraud_weight=10 means each fraud example counts 10× as much loss = F.binary_cross_entropy_with_logits( logits, targets, pos_weight=torch.tensor(fraud_weight) ) return lossLine by line:
logits— raw model outputs before sigmoid, not probabilities (more stable)pos_weight=10— each fraud label’s gradient is 10× stronger- The library handles the numerical stability; don’t reimplement it
How to pick the weight: start with (# negatives) / (# positives). A 99:1 ratio → weight ≈ 99. Then tune based on the precision/recall tradeoff you need.
The same thing in NumPy — no magic:
import numpy as np
def sigmoid(x): # Two-branch form avoids overflow for very large or very small x return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))
def weighted_bce_numpy(logits, targets, fraud_weight=10.0): p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7) # clip so log(0) never happens
# fraud examples use fraud_weight; legit examples use 1.0 weights = np.where(targets == 1, fraud_weight, 1.0)
# BCE formula written out: -[y*log(p) + (1-y)*log(1-p)] # When target=1: only the first term survives → -log(p) # When target=0: only the second term survives → -log(1-p) bce_per_example = -(targets * np.log(p) + (1 - targets) * np.log(1 - p))
return (weights * bce_per_example).mean()Run it side by side with the PyTorch version and you’ll get the same numbers (up to floating-point rounding).
Example 2: focal loss for hard examples
Section titled “Example 2: focal loss for hard examples”Problem: millions of easy negatives drown out the few hard examples.
def focal_bce(logits, targets, gamma=2.0, alpha=0.25): bce = F.binary_cross_entropy_with_logits( logits, targets, reduction="none" # keep per-example losses ) p_correct = torch.where( targets == 1, torch.sigmoid(logits), # if target is 1, p_correct = p 1 - torch.sigmoid(logits) # if target is 0, p_correct = 1-p ) focus = (1 - p_correct) ** gamma # easy examples (high p_correct) → near 0 return (alpha * focus * bce).mean()Line by line:
reduction="none"— keep per-example losses so we can multiply each onep_correct— “how confident was the model about the right answer?”focus— when p_correct is 0.99, focus ≈ 0.0001 (nearly zero). When 0.5, focus = 0.25 (still matters).alpha— balances how much fraud vs legit examples contribute on top of the focusing
The same thing in NumPy:
def focal_bce_numpy(logits, targets, gamma=2.0, alpha=0.25): p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7)
# How confident was the model about whichever label was correct? # If target=1 (fraud): p_correct = p (prob it said "fraud") # If target=0 (legit): p_correct = 1 - p (prob it said "not fraud") p_correct = np.where(targets == 1, p, 1 - p)
# The focusing multiplier: shrinks toward 0 for confident correct predictions # Example: p_correct=0.99, gamma=2 → (1-0.99)^2 = 0.0001 (easy example, ignore it) # p_correct=0.30, gamma=2 → (1-0.30)^2 = 0.49 (hard example, keep it) focus = (1 - p_correct) ** gamma
bce_per_example = -(targets * np.log(p) + (1 - targets) * np.log(1 - p))
return (alpha * focus * bce_per_example).mean()The only difference from weighted BCE is the focus multiplier. Everything else is identical.
Example 3: adding regularization to a custom loss
Section titled “Example 3: adding regularization to a custom loss”Regularization is not a separate concept — it’s literally an extra term you add to the loss before returning it.
total_loss = data_loss + λ × regularization_penaltyHere’s the complete weighted BCE + L2 in NumPy, with a real feature so there’s an actual weight to regularize:
import numpy as np
# A simple linear model: logit = w * x + b# x = transaction amount (one feature), w = its weight, b = bias# We regularize w (the learned coefficient), not b.
np.random.seed(0)n = 200X = np.random.randn(n) # transaction amountstargets = ((X + np.random.randn(n) * 0.5) > 0.5).astype(float) # fraud label
w, b = 0.0, 0.0 # model parameterslr = 0.1lambda_ = 0.5 # regularization strength — larger = simpler model
for step in range(300): logits = w * X + b # forward pass
# ── data loss ────────────────────────────────────── p = np.clip(sigmoid(logits), 1e-7, 1 - 1e-7) wts = np.where(targets == 1, 9.0, 1.0) # 9:1 class weight bce = -(targets * np.log(p) + (1 - targets) * np.log(1 - p)) loss_data = (wts * bce).mean()
# ── L2 regularization penalty ─────────────────────── loss_reg = lambda_ * w**2 # only on w, not bias
total_loss = loss_data + loss_reg # ← the one number training minimizes
# ── gradients ────────────────────────────────────── d_logit = wts * (p - targets) / n # data loss gradient w.r.t. logit
d_w = np.sum(d_logit * X) + 2 * lambda_ * w # chain rule + L2 gradient d_b = np.sum(d_logit) # bias has no regularization
w -= lr * d_w b -= lr * d_b
if step % 100 == 0: print(f"step {step:3d} total={total_loss:.3f} data={loss_data:.3f} reg={loss_reg:.3f} w={w:.3f}")Expected output:
step 0 total=0.720 data=0.693 reg=0.000 w=0.000step 100 total=0.374 data=0.316 reg=0.058 w=0.340step 200 total=0.368 data=0.307 reg=0.061 w=0.349step 300 total=0.366 data=0.305 reg=0.061 w=0.349The reg column starts at zero (weight starts at zero) and grows as the model learns a nonzero w. The optimizer stops here because pushing w higher would reduce data loss less than it would increase reg loss. That’s the regularizer working as intended.
Switching to L1 is one line:
loss_reg = lambda_ * abs(w) # L1 penaltyd_w = np.sum(d_logit * X) + lambda_ * np.sign(w) # L1 gradient: constant ±λThe sign of w determines direction. If w is positive, sign(w) = +1 → constant push downward. The push doesn’t weaken as w shrinks toward zero, which is why L1 can drive weights all the way to exactly zero.
The two key things to remember:
- Regularize the weights (learned parameters), not the logits or the loss terms.
- The gradient of the total loss = gradient of data loss + gradient of regularization term. They add together naturally.
Verifying your custom loss works
Section titled “Verifying your custom loss works”Before training for hours, run this sanity check:
# 1. Does it decrease when you make a better prediction?bad_logit = torch.tensor([-5.0]) # model says "definitely not fraud"good_logit = torch.tensor([5.0]) # model says "definitely fraud"target = torch.tensor([1.0]) # actual: fraud
print(focal_bce(bad_logit, target)) # should be HIGHprint(focal_bce(good_logit, target)) # should be LOW
# 2. Does it give a reasonable value for a random model?# At initialization, loss should be near log(2) ≈ 0.693 for BCErandom_logits = torch.zeros(100)random_targets = torch.randint(0, 2, (100,)).float()print(focal_bce(random_logits, random_targets)) # should be ~0.35 with alpha=0.25What the gradient actually is (the signal that drives learning)
Section titled “What the gradient actually is (the signal that drives learning)”The optimizer never sees the loss value directly — it sees the gradient: how much the loss changes when you nudge each logit up or down.
For weighted BCE, this works out to a clean formula:
gradient for one example = weight × (predicted_probability − true_label)In NumPy:
def weighted_bce_gradient(logits, targets, fraud_weight=10.0): p = sigmoid(logits) weights = np.where(targets == 1, fraud_weight, 1.0) # Divide by n to match the .mean() in the loss return weights * (p - targets) / len(targets)Trace through two examples to see the weighting in action:
Fraud example (target=1), model predicted p=0.30: gradient = 19 × (0.30 − 1.0) = −13.3 ← large negative push, logit goes UP → p rises
Legit example (target=0), model predicted p=0.30: gradient = 1 × (0.30 − 0.0) = +0.3 ← tiny positive push, logit goes DOWN → p fallsThe fraud example has 44× more influence on the update. That’s exactly what fraud_weight=19 does — it’s not magic, it’s a multiplier on (prediction − label).
A full gradient-descent loop in NumPy (no PyTorch, no autograd):
np.random.seed(0)
# 5 fraud, 95 legit — a single bias parameter (no features, just to see the loop)targets = np.array([1.0]*5 + [0.0]*95)logit = np.array([0.0]) # one shared logit for all examples (simplification)logits = np.ones(100) * logit[0]
for step in range(200): loss = weighted_bce_numpy(logits, targets, fraud_weight=19.0) grad = weighted_bce_gradient(logits, targets, fraud_weight=19.0) logits -= 0.5 * grad # gradient descent: move opposite the gradient
if step % 50 == 0: print(f"step {step:3d} loss={loss:.4f} p={sigmoid(logits[0]):.3f}")
# step 0 loss=0.6931 p=0.500 ← starts at 50% for everything# step 50 loss=0.3001 p=0.680# step 100 loss=0.2544 p=0.751# step 150 loss=0.2279 p=0.793# step 200 loss=0.2106 p=0.822 ← pushed toward fraud (too many frauds were missed)The model has no features here — it can only learn one number. But you can watch the gradient nudge the probability upward every step. In a real model this same loop runs for each weight in the network.
Interview answer
Section titled “Interview answer”“I’d start by checking whether a standard loss with weights covers the case — it usually does. If I need something custom, I’d write it to operate on logits or log-probabilities (never raw probabilities) to avoid numerical issues, verify that a manually better prediction produces lower loss, and monitor gradient norms to confirm training signal is flowing.”
Problem 8 — Reward functions as cost functions (RL)
Section titled “Problem 8 — Reward functions as cost functions (RL)”In reinforcement learning, a model (called an agent) takes actions and receives a reward signal. Maximizing reward is equivalent to minimizing negative reward — so the reward function is the cost function, negated.
This sounds simple. It causes most of the problems in RL.
The core pitfall: reward hacking
Section titled “The core pitfall: reward hacking”You write a reward. The model optimizes exactly what you wrote, not what you meant.
import numpy as np
# Intended reward: maximize delivery speeddef reward_speed(delivery_time_minutes): return 1.0 / delivery_time_minutes
print("Intended usage:")for t in [10, 30, 60]: print(f" {t} min delivery → reward {reward_speed(t):.3f}")
print("\nWhat the model discovers (drop package immediately):")print(f" 0.1 min 'delivery' → reward {reward_speed(0.1):.1f}") # 10× better!print(f" 0.01 min → reward {reward_speed(0.01):.1f}") # 100× better!Intended usage: 10 min delivery → reward 0.100 30 min delivery → reward 0.033 60 min delivery → reward 0.017
What the model discovers (drop package immediately): 0.1 min 'delivery' → reward 10.0 0.01 min → reward 100.0The reward function is technically correct — fast is better. But it never specified that the package must arrive. The model found the loophole.
Interview signal: always ask “what behavior does this reward incentivize that I don’t want?” before finalizing a reward function.
Pitfall: sparse rewards
Section titled “Pitfall: sparse rewards”Imagine learning to play chess where the only feedback is “you won” or “you lost” at the very end. For the first thousands of training games, every move gets the same signal — nothing. The model can’t tell which of its 40 moves caused it to lose.
import numpy as np
def compute_returns(rewards, gamma=0.99): """Convert per-step rewards into discounted future returns. This is how RL assigns credit backward through time.""" returns = np.zeros_like(rewards, dtype=float) running = 0.0 for t in reversed(range(len(rewards))): running = rewards[t] + gamma * running returns[t] = running return returns
# Chess game: 40 moves, reward only at the very endsparse = np.zeros(40)sparse[-1] = 1.0 # won the game
returns = compute_returns(sparse, gamma=0.99)
print("Step Raw reward Discounted return (credit received)")for t in [0, 10, 20, 30, 38, 39]: print(f" {t:2d} {sparse[t]:.1f} {returns[t]:.4f}")Step Raw reward Discounted return 0 0.0 0.6676 ← heavily discounted — 40 steps ago 10 0.0 0.7397 20 0.0 0.8179 30 0.0 0.9044 38 0.0 0.9900 39 1.0 1.0000 ← only this step has raw rewardWithout discounting, every step before step 39 gets zero credit. With γ=0.99, step 0 gets 67 cents on the dollar — enough to train on, but very weak signal 40 steps back.
Common fixes:
- Reward shaping: add intermediate signals (e.g., +0.1 for capturing a piece). Risk: introduces new loopholes.
- Potential-based shaping: structure intermediate rewards as differences in a progress function — guaranteed not to change which final behavior is optimal.
- Curriculum learning: start with simpler problems that have denser rewards.
# Potential-based shaping: safe intermediate rewards# F(s, s') = γ * Φ(s') − Φ(s) where Φ measures "progress"
def potential(position, goal=np.array([0., 0.])): return -np.linalg.norm(position - goal) # negative distance = higher potential near goal
gamma = 0.99# Robot moving from (5,5) toward goal (0,0)trajectory = np.array([[5.,5.], [4.,4.], [3.,3.], [2.,2.], [1.,1.], [0.,0.]])
print("Step Position Shaped reward (positive = moving toward goal)")for t in range(len(trajectory)-1): s, s_next = trajectory[t], trajectory[t+1] shaped = gamma * potential(s_next) - potential(s) print(f" {t} {s} {shaped:+.3f}")Step Position Shaped reward 0 [5. 5.] +0.992 ← moving closer, positive reward 1 [4. 4.] +0.992 2 [3. 3.] +0.990 3 [2. 2.] +0.990 4 [1. 1.] +0.990Pitfall: numerical instability
Section titled “Pitfall: numerical instability”Reward signals can have wildly different scales. If “win the game” gives +1,000 and each step costs −0.001, the win signal dominates gradient updates catastrophically when it finally arrives.
import numpy as np
# Mixed-scale rewards: small step penalties + huge win bonusraw_rewards = np.array([-0.001] * 39 + [1000.0])
print(f"Scale mismatch: {raw_rewards.max() / abs(raw_rewards.min()):.0f}x")print(f" Step penalty: {raw_rewards[0]}")print(f" Win bonus: {raw_rewards[-1]}")
# Fix 1: clip rewards to [-1, 1]clipped = np.clip(raw_rewards, -1.0, 1.0)print(f"\nClipped — max: {clipped.max()}, min: {clipped.min()}")
# Fix 2: normalize per episodedef normalize(rewards): return (rewards - rewards.mean()) / (rewards.std() + 1e-8)
normed = normalize(raw_rewards)print(f"Normalized — max: {normed.max():.2f}, min: {normed.min():.2f}")# Now win bonus and step penalties are comparable in magnitudeScale mismatch: 1000000x Step penalty: -0.001 Win bonus: 1000.0
Clipped — max: 1.0, min: -1.0Normalized — max: 6.08, min: -0.16 ← win bonus no longer 1M× biggerPitfall: delayed credit assignment
Section titled “Pitfall: delayed credit assignment”Which action caused the reward? If the reward arrives 500 steps after the action, the discounting in compute_returns above handles it mechanically — but with γ=0.99 over 500 steps, the credit is 0.99^500 ≈ 0.007. The signal is nearly zero. This is the credit assignment problem, and it’s why RL problems with long horizons are fundamentally harder than supervised learning.
Interview answer
Section titled “Interview answer”“Reward functions become cost functions by negating them, but they introduce unique failure modes. I’d watch for three things: reward hacking — verify the written rule can’t be gamed without achieving the actual goal; sparse signals — add potential-based shaping so intermediate steps get credit; and scale mismatch — normalize or clip rewards so a late large bonus doesn’t explode the gradient when it arrives.”
Problem 9 — Reconstruction loss vs adversarial loss (generative models)
Section titled “Problem 9 — Reconstruction loss vs adversarial loss (generative models)”If the task is to generate something new — an image, a sentence, a molecule — you need a loss that says “how realistic is this?” That’s fundamentally harder than “how wrong is this number?”
Reconstruction loss: pixel-by-pixel accuracy
Section titled “Reconstruction loss: pixel-by-pixel accuracy”How it works: compare the generated output to the original example directly. For images, this is typically MSE per pixel. For text, it’s cross-entropy per token.
import numpy as np
# 4-pixel "image" (simplified)original = np.array([0.9, 0.1, 0.8, 0.2]) # sharp: bright, dark, bright, darkblurry = np.array([0.5, 0.5, 0.5, 0.5]) # generator averaged everythingsharp_gen = np.array([0.85, 0.15, 0.75, 0.25]) # generator learned the pattern
def reconstruction_loss(generated, original): return ((generated - original)**2).mean()
print(f"Blurry output MSE: {reconstruction_loss(blurry, original):.4f}")print(f"Sharp output MSE: {reconstruction_loss(sharp_gen, original):.4f}")# Blurry output still gets low MSE — reconstruction loss rewards "safe" averagesBlurry output MSE: 0.1700 ← low! averaging is technically accurateSharp output MSE: 0.0038 ← lower, but blurry was already rewardedWhat it produces: outputs that are “safe” — averaging over all the possible correct things. For faces, this often means blurry images. The model is unsure whether hair should be lighter or darker, so it outputs a grey average of both, which is technically low-MSE but looks fake.
Why it’s stable: there’s always a signal. Every generated example can be compared to a real one.
Used in: autoencoders, VAEs (variational autoencoders), early image generation.
Adversarial loss: fool the judge
Section titled “Adversarial loss: fool the judge”How it works: train a second model called the discriminator whose job is to distinguish real examples from generated ones. The generator’s loss is: “how often does the discriminator get fooled?”
def bce_loss(p, target): p = np.clip(p, 1e-7, 1 - 1e-7) return -(target * np.log(p) + (1 - target) * np.log(1 - p)).mean()
# Discriminator outputs: probability that each image is reald_on_real = np.array([0.92, 0.88, 0.95]) # real images → D should say ~1d_on_fake = np.array([0.35, 0.28, 0.41]) # generated → D should say ~0
# Discriminator wants to be right on both:disc_loss = (bce_loss(d_on_real, np.ones(3)) + bce_loss(d_on_fake, np.zeros(3)))
# Generator wants D to say its fake images are real:gen_loss = bce_loss(d_on_fake, np.ones(3))
print(f"Discriminator loss: {disc_loss:.3f} (lower = D better at spotting fakes)")print(f"Generator loss: {gen_loss:.3f} (lower = G better at fooling D)")Discriminator loss: 0.218 (lower = D better at spotting fakes)Generator loss: 1.093 (lower = G better at fooling D)Both models train simultaneously — a constant competition.
What it produces: sharp, realistic outputs. The discriminator can detect blurriness as “fake,” so the generator learns to produce sharply detailed images.
Why it’s unstable: the generator and discriminator must stay balanced. If the discriminator gets too good, it rejects everything and the generator gets no useful gradient. If the generator gets too good, the discriminator collapses and loses all discriminating ability.
Mode collapse: the generator discovers one or two outputs that always fool the discriminator and starts producing only those, ignoring the full variety of real data.
Side-by-side comparison
Section titled “Side-by-side comparison”| Reconstruction loss | Adversarial loss | |
|---|---|---|
| What it measures | Pixel/token similarity to original | Whether a judge can tell real from fake |
| Output quality | Blurry, averaged | Sharp, realistic |
| Training stability | Stable (always has signal) | Unstable (requires balance) |
| Main failure mode | Blurriness (averaging valid answers) | Mode collapse, training instability |
| Common in | VAE, autoencoder | GAN |
In practice: combine them
Section titled “In practice: combine them”Most modern generative systems use both:
lambda_adv = 0.1 # how much adversarial loss contributes
rec_loss = reconstruction_loss(blurry, original)total = rec_loss + lambda_adv * gen_loss
print(f"Reconstruction loss: {rec_loss:.4f}")print(f"Adversarial loss: {gen_loss:.4f} (× {lambda_adv})")print(f"Total loss: {total:.4f}")# Tune lambda_adv: higher → sharper but less stable; lower → blurrier but stable- Reconstruction provides a stable base signal and ensures the output is at least in the right ballpark.
- Adversarial adds the sharpness and realism the reconstruction loss can’t provide.
Diffusion models (the architecture behind modern image generation) sidestep this entirely by using a simple noise-prediction objective — neither reconstruction nor adversarial — which is why they’re much more stable to train than GANs.
Interview answer
Section titled “Interview answer”“Reconstruction loss measures pixel-level accuracy, which is easy to train but produces blurry outputs because it averages over all valid answers. Adversarial loss trains a discriminator to catch fakes, which produces sharp outputs but is notoriously unstable and prone to mode collapse. In practice I’d combine them — reconstruction loss for stability, adversarial loss for realism — and tune the balance weight. For modern work I’d consider a diffusion-based objective, which avoids both failure modes.”
Loss isn’t decreasing at all
Section titled “Loss isn’t decreasing at all”- Try to overfit one tiny batch (5–10 examples). If the model can’t memorize 5 examples, something is broken.
- Verify the target values — wrong dtype, wrong shape, or off-by-one in labels.
- Manually make a better prediction and check that the loss goes down. If it doesn’t, the loss function has a bug.
- Check gradients — if they’re zero, the model isn’t learning.
Loss explodes or goes NaN
Section titled “Loss explodes or goes NaN”- Check for
log(0)— use the library’s numerically stable versions (BCE with logits, log-sum-exp). - Look at the loss scale — using
sumover a batch instead ofmeanmakes the loss grow with batch size. - Check for division by a very small variance in probabilistic losses.
- Gradient clipping can contain occasional spikes, but it shouldn’t hide a fundamentally broken loss.
The interview framework
Section titled “The interview framework”When asked “why this loss?”, answer in this order:
1. Output: “The model outputs a probability / a continuous value / one of N classes.”
2. Data shape: “The data has / doesn’t have class imbalance / outliers / multiple valid outcomes.”
3. Mistake cost: “The expensive mistake here is false negatives / large errors / confident wrong predictions.”
4. Loss choice: “So I’d use X because it penalizes Y specifically.”
5. Tradeoff: “The risk with X is Z, so I’d watch for it.”
6. Implementation: “I’d implement it from logits / with log-sum-exp to avoid numerical issues.”
Example answer (fraud detection)
Section titled “Example answer (fraud detection)”“The model outputs a fraud probability. At 0.1% fraud rate, standard accuracy is useless — a model that says ‘not fraud’ every time scores 99.9%. I’d evaluate with precision-recall instead. If missing fraud costs much more than a false alarm, I’d encode that with weighted BCE, setting the fraud weight proportional to the cost ratio. If easy legitimate examples dominate training, I’d add focal loss on top. I’d implement both from logits, not from probabilities, and monitor calibration — if the model says 20% fraud, roughly 20% of those should actually be fraud.”
The core mental model
Section titled “The core mental model”Start from the mistake you need the model to care about. Then choose the loss that makes that mistake expensive.
| Mistake you want to avoid | Loss that punishes it |
|---|---|
| Big prediction errors (regression) | MSE |
| Outlier-driven distortion | MAE or Huber |
| Confident wrong probability | Binary cross-entropy |
| Missing the rare class | Weighted BCE |
| Easy examples drowning hard ones | Focal loss |
| Multiple valid answers averaged together | Mixture / multi-modal loss |
| Model too complex, generalizes poorly | + L2 regularization (shrinks all weights) |
| Many correlated features, want selection | + L1 regularization (zeros out weak features) |
| Need both stability and interpretability | + Elastic net (L1 + L2 combined) |
| Overfitting to a rare upweighted class | + L2 to limit noise amplification |
| Blurry generative output | + Adversarial loss on top of reconstruction |
| Unstable GAN training | Switch to reconstruction loss; consider diffusion objective |
| Reward hacking in RL | Redesign reward; use potential-based shaping |
| Sparse RL reward | Add intermediate shaped rewards; curriculum learning |