2 layer classifier
1. Interview structure
Section titled “1. Interview structure”I would say:
“We have a 2-layer classifier:
Let:
X: (N, D)W1: (D, H)b1: (H,)W2: (H, C)b2: (C,)y: (N,) integer class labelsThe key simplification is:
Then everything else is normal chain rule.”
Interview Q1
Section titled “Interview Q1”Perfect Accuracy ⇒ Loss Bounds – If a classifier achieves 100% accuracy on its training set, what are the minimum and maximum possible values of the cross-entropy loss on a single example ? (Hint: with perfect accuracy, the true class probability is 1, so loss can be arbitrarily low or high depending on confidence).I would answer by first clarifying the assumption.
For one example, cross-entropy is:
where (p_y) is the model probability assigned to the true class.
Minimum possible loss
Section titled “Minimum possible loss”If the model assigns probability 1 to the true class:
then:
So the minimum is: 0
Maximum possible loss
Section titled “Maximum possible loss”This depends on what we mean by “100% accuracy.”
If prediction is based on argmax of the softmax probabilities, then the true class must have the highest probability.
For (C) classes, the smallest possible winning probability is about:
So the largest loss while still being correct is approximately:
Example for binary classification:
Correct prediction, but loss is:
So for binary classification, max loss under correct argmax is about:
Important caveat
Section titled “Important caveat”If “accuracy” is computed from some separate hard prediction rule, not from the probability used in cross-entropy, then the loss could be arbitrarily high if :
So my interview answer would be:
Minimum is 0, when the model puts probability 1 on the true class. If correctness is determined by softmax argmax over (C) classes, the maximum single-example loss while still correct is about (\log C). But if the hard prediction is decoupled from the probability used for cross-entropy, the loss can be arbitrarily large.
2. Gradients
Section titled “2. Gradients”Forward:
Z1 = X @ W1 + b1A1 = max(0, Z1)Z2 = A1 @ W2 + b2P = softmax(Z2)L = -mean(log(P[range(N), y]))Backward:
dZ2 = PdZ2[range(N), y] -= 1dZ2 /= N
dW2 = A1.T @ dZ2db2 = sum(dZ2, axis=0)
dA1 = dZ2 @ W2.TdZ1 = dA1 * (Z1 > 0)
dW1 = X.T @ dZ1db1 = sum(dZ1, axis=0)Update:
W1 -= lr * dW1b1 -= lr * db1W2 -= lr * dW2b2 -= lr * db23. Clean NumPy implementation
Section titled “3. Clean NumPy implementation”import numpy as np
def init_params(D, H, C, seed=0): rng = np.random.default_rng(seed) return { "W1": 0.01 * rng.standard_normal((D, H)), "b1": np.zeros(H), "W2": 0.01 * rng.standard_normal((H, C)), "b2": np.zeros(C), }
def forward_loss(X, y, params): W1, b1 = params["W1"], params["b1"] W2, b2 = params["W2"], params["b2"]
Z1 = X @ W1 + b1 A1 = np.maximum(0, Z1) Z2 = A1 @ W2 + b2
# stable softmax shifted = Z2 - np.max(Z2, axis=1, keepdims=True) exp_scores = np.exp(shifted) P = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
N = X.shape[0] loss = -np.mean(np.log(P[np.arange(N), y] + 1e-12))
cache = { "X": X, "y": y, "Z1": Z1, "A1": A1, "P": P, }
return loss, cache
def backward(cache, params): X = cache["X"] y = cache["y"] Z1 = cache["Z1"] A1 = cache["A1"] P = cache["P"]
W2 = params["W2"] N = X.shape[0]
dZ2 = P.copy() dZ2[np.arange(N), y] -= 1 dZ2 /= N
dW2 = A1.T @ dZ2 db2 = np.sum(dZ2, axis=0)
dA1 = dZ2 @ W2.T dZ1 = dA1 * (Z1 > 0)
dW1 = X.T @ dZ1 db1 = np.sum(dZ1, axis=0)
return { "W1": dW1, "b1": db1, "W2": dW2, "b2": db2, }
def train_step(X, y, params, lr=1e-1): loss, cache = forward_loss(X, y, params) grads = backward(cache, params)
for k in params: params[k] -= lr * grads[k]
return loss, grads4. Gradient checking
Section titled “4. Gradient checking”Use central difference:
def rel_error(a, b): return np.max( np.abs(a - b) / np.maximum(1e-8, np.abs(a) + np.abs(b)) )
def grad_check(X, y, params, grads, h=1e-5): errors = {}
for name in params: p = params[name] grad_num = np.zeros_like(p)
it = np.nditer(p, flags=["multi_index"], op_flags=["readwrite"])
while not it.finished: idx = it.multi_index old = p[idx]
p[idx] = old + h loss_plus, _ = forward_loss(X, y, params)
p[idx] = old - h loss_minus, _ = forward_loss(X, y, params)
p[idx] = old
grad_num[idx] = (loss_plus - loss_minus) / (2 * h)
it.iternext()
errors[name] = rel_error(grad_num, grads[name])
return errorsToy test:
np.random.seed(1)
N, D, H, C = 5, 4, 10, 3
X = np.random.randn(N, D).astype(np.float64)y = np.array([0, 1, 2, 2, 1])
params = init_params(D, H, C, seed=42)
loss, cache = forward_loss(X, y, params)grads = backward(cache, params)errors = grad_check(X, y, params, grads)
print("loss:", loss)for k, v in errors.items(): print(k, v)Expected output should be around:
loss: 1.09866W1 1e-7b1 1e-8W2 1e-8b2 1e-10Anything below around 1e-5 is usually fine for this tiny network. 1e-7 or better is very good.
5. Debugging checklist interviewers care about
Section titled “5. Debugging checklist interviewers care about”Most common bugs:
1. Forgot dZ2 /= N Symptom: gradients are exactly N times too large.
2. Modified P in-place without copy Wrong: dZ2 = P Correct: dZ2 = P.copy()
3. Wrong label shape y should be shape (N,), not (N, 1). Correct: P[np.arange(N), y]
4. Bad ReLU mask Correct: dZ1 = dA1 * (Z1 > 0)
5. Softmax not numerically stable Correct: shifted = logits - max(logits)
6. Using float32 for gradient check Prefer float64 for finite differences.
7. Updating params before gradient check Check gradients before applying SGD update.A strong interview line:
“If gradient check fails, I first compare shapes, then check whether the error is a constant factor like N, then isolate layer-by-layer: check dZ2, then dW2/db2, then ReLU mask, then dW1/db1.”