Skip to content

2 layer classifier

I would say:

“We have a 2-layer classifier:

XZ1=XW1+b1A1=ReLU(Z1)Z2=A1W2+b2softmaxCEX \rightarrow Z_1 = XW_1 + b_1 \rightarrow A_1 = ReLU(Z_1) \rightarrow Z_2 = A_1W_2 + b_2 \rightarrow softmax \rightarrow CE

Let:

X: (N, D)
W1: (D, H)
b1: (H,)
W2: (H, C)
b2: (C,)
y: (N,) integer class labels

The key simplification is:

LZ2=softmax(Z2)onehot(y)N\frac{\partial L}{\partial Z_2} = \frac{softmax(Z_2) - onehot(y)}{N}

Then everything else is normal chain rule.”


Perfect Accuracy ⇒ Loss Bounds – If a classifier achieves 100% accuracy on its training set, what are the minimum and maximum possible values of the cross-entropy loss on a single example ? (Hint: with perfect accuracy, the true class probability is 1, so loss can be arbitrarily low or high depending on confidence).

I would answer by first clarifying the assumption.

For one example, cross-entropy is:

L=logpyL = -\log p_y

where (p_y) is the model probability assigned to the true class.

If the model assigns probability 1 to the true class:

py=1p_y = 1

then:

L=log(1)=0L = -\log(1) = 0

So the minimum is: 0

This depends on what we mean by “100% accuracy.”

If prediction is based on argmax of the softmax probabilities, then the true class must have the highest probability.

For (C) classes, the smallest possible winning probability is about:

py1Cp_y \approx \frac{1}{C}

So the largest loss while still being correct is approximately:

log1C=logC-\log \frac{1}{C} = \log C

Example for binary classification:

py=0.5001p_y = 0.5001

Correct prediction, but loss is:

log(0.5001)0.693-\log(0.5001) \approx 0.693

So for binary classification, max loss under correct argmax is about:

log2\log 2

If “accuracy” is computed from some separate hard prediction rule, not from the probability used in cross-entropy, then the loss could be arbitrarily high if py0p_y \to 0:

log(py)-\log(p_y) \to \infty

So my interview answer would be:

Minimum is 0, when the model puts probability 1 on the true class. If correctness is determined by softmax argmax over (C) classes, the maximum single-example loss while still correct is about (\log C). But if the hard prediction is decoupled from the probability used for cross-entropy, the loss can be arbitrarily large.

Forward:

Z1 = X @ W1 + b1
A1 = max(0, Z1)
Z2 = A1 @ W2 + b2
P = softmax(Z2)
L = -mean(log(P[range(N), y]))

Backward:

dZ2 = P
dZ2[range(N), y] -= 1
dZ2 /= N
dW2 = A1.T @ dZ2
db2 = sum(dZ2, axis=0)
dA1 = dZ2 @ W2.T
dZ1 = dA1 * (Z1 > 0)
dW1 = X.T @ dZ1
db1 = sum(dZ1, axis=0)

Update:

W1 -= lr * dW1
b1 -= lr * db1
W2 -= lr * dW2
b2 -= lr * db2

import numpy as np
def init_params(D, H, C, seed=0):
rng = np.random.default_rng(seed)
return {
"W1": 0.01 * rng.standard_normal((D, H)),
"b1": np.zeros(H),
"W2": 0.01 * rng.standard_normal((H, C)),
"b2": np.zeros(C),
}
def forward_loss(X, y, params):
W1, b1 = params["W1"], params["b1"]
W2, b2 = params["W2"], params["b2"]
Z1 = X @ W1 + b1
A1 = np.maximum(0, Z1)
Z2 = A1 @ W2 + b2
# stable softmax
shifted = Z2 - np.max(Z2, axis=1, keepdims=True)
exp_scores = np.exp(shifted)
P = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
N = X.shape[0]
loss = -np.mean(np.log(P[np.arange(N), y] + 1e-12))
cache = {
"X": X,
"y": y,
"Z1": Z1,
"A1": A1,
"P": P,
}
return loss, cache
def backward(cache, params):
X = cache["X"]
y = cache["y"]
Z1 = cache["Z1"]
A1 = cache["A1"]
P = cache["P"]
W2 = params["W2"]
N = X.shape[0]
dZ2 = P.copy()
dZ2[np.arange(N), y] -= 1
dZ2 /= N
dW2 = A1.T @ dZ2
db2 = np.sum(dZ2, axis=0)
dA1 = dZ2 @ W2.T
dZ1 = dA1 * (Z1 > 0)
dW1 = X.T @ dZ1
db1 = np.sum(dZ1, axis=0)
return {
"W1": dW1,
"b1": db1,
"W2": dW2,
"b2": db2,
}
def train_step(X, y, params, lr=1e-1):
loss, cache = forward_loss(X, y, params)
grads = backward(cache, params)
for k in params:
params[k] -= lr * grads[k]
return loss, grads

Use central difference:

LθiL(θi+h)L(θih)2h\frac{\partial L}{\partial \theta_i} \approx \frac{L(\theta_i + h) - L(\theta_i - h)}{2h}
def rel_error(a, b):
return np.max(
np.abs(a - b) / np.maximum(1e-8, np.abs(a) + np.abs(b))
)
def grad_check(X, y, params, grads, h=1e-5):
errors = {}
for name in params:
p = params[name]
grad_num = np.zeros_like(p)
it = np.nditer(p, flags=["multi_index"], op_flags=["readwrite"])
while not it.finished:
idx = it.multi_index
old = p[idx]
p[idx] = old + h
loss_plus, _ = forward_loss(X, y, params)
p[idx] = old - h
loss_minus, _ = forward_loss(X, y, params)
p[idx] = old
grad_num[idx] = (loss_plus - loss_minus) / (2 * h)
it.iternext()
errors[name] = rel_error(grad_num, grads[name])
return errors

Toy test:

np.random.seed(1)
N, D, H, C = 5, 4, 10, 3
X = np.random.randn(N, D).astype(np.float64)
y = np.array([0, 1, 2, 2, 1])
params = init_params(D, H, C, seed=42)
loss, cache = forward_loss(X, y, params)
grads = backward(cache, params)
errors = grad_check(X, y, params, grads)
print("loss:", loss)
for k, v in errors.items():
print(k, v)

Expected output should be around:

loss: 1.09866
W1 1e-7
b1 1e-8
W2 1e-8
b2 1e-10

Anything below around 1e-5 is usually fine for this tiny network. 1e-7 or better is very good.


5. Debugging checklist interviewers care about

Section titled “5. Debugging checklist interviewers care about”

Most common bugs:

1. Forgot dZ2 /= N
Symptom: gradients are exactly N times too large.
2. Modified P in-place without copy
Wrong:
dZ2 = P
Correct:
dZ2 = P.copy()
3. Wrong label shape
y should be shape (N,), not (N, 1).
Correct:
P[np.arange(N), y]
4. Bad ReLU mask
Correct:
dZ1 = dA1 * (Z1 > 0)
5. Softmax not numerically stable
Correct:
shifted = logits - max(logits)
6. Using float32 for gradient check
Prefer float64 for finite differences.
7. Updating params before gradient check
Check gradients before applying SGD update.

A strong interview line:

“If gradient check fails, I first compare shapes, then check whether the error is a constant factor like N, then isolate layer-by-layer: check dZ2, then dW2/db2, then ReLU mask, then dW1/db1.”