2 layer regression
1. Forward pass shapes
Section titled “1. Forward pass shapes”Assume:
X: [B, D]W1: [D, H]b1: [H]W2: [H, C]b2: [C]y: [B] integer class labelsForward:
Z1 = X @ W1 + b1 # [B, H]A1 = ReLU(Z1) # [B, H]Z2 = A1 @ W2 + b2 # [B, C], logitsloss = softmax_cross_entropy(Z2, y)Stable softmax cross entropy:
shifted = Z2 - max(Z2, axis=1)probs = exp(shifted) / sum(exp(shifted), axis=1)loss = mean(-log(probs[range(B), y]))2. Backward pass derivation
Section titled “2. Backward pass derivation”Key shortcut:
dZ2 = probsdZ2[range(B), y] -= 1dZ2 /= BThen:
dW2 = A1.T @ dZ2 # [H, C]db2 = sum(dZ2, axis=0) # [C]
dA1 = dZ2 @ W2.T # [B, H]dZ1 = dA1 * (Z1 > 0) # [B, H]
dW1 = X.T @ dZ1 # [D, H]db1 = sum(dZ1, axis=0) # [H]
dX = dZ1 @ W1.T # [B, D]Main interview invariant:
Gradient of a tensor has the same shape as that tensor.So:
dW1.shape == W1.shapedb1.shape == b1.shapedW2.shape == W2.shapedb2.shape == b2.shapedX.shape == X.shape3. NumPy implementation
Section titled “3. NumPy implementation”import numpy as np
def forward_backward(X, y, W1, b1, W2, b2): """ X: [B, D] y: [B] W1: [D, H] b1: [H] W2: [H, C] b2: [C] """
B = X.shape[0]
# ----- forward ----- Z1 = X @ W1 + b1 # [B, H] A1 = np.maximum(0, Z1) # [B, H] logits = A1 @ W2 + b2 # [B, C]
# stable softmax shifted = logits - np.max(logits, axis=1, keepdims=True) exp_scores = np.exp(shifted) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
loss = -np.log(probs[np.arange(B), y]).mean()
# ----- backward ----- dlogits = probs.copy() dlogits[np.arange(B), y] -= 1 dlogits /= B # because loss is mean over batch
dW2 = A1.T @ dlogits # [H, C] db2 = np.sum(dlogits, axis=0) # [C]
dA1 = dlogits @ W2.T # [B, H] dZ1 = dA1 * (Z1 > 0) # [B, H]
dW1 = X.T @ dZ1 # [D, H] db1 = np.sum(dZ1, axis=0) # [H]
dX = dZ1 @ W1.T # [B, D]
grads = { "dX": dX, "dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2, }
cache = { "Z1": Z1, "A1": A1, "logits": logits, "probs": probs, }
return loss, grads, cacheExample usage:
np.random.seed(0)
B, D, H, C = 4, 5, 10, 3
X = np.random.randn(B, D)y = np.array([0, 2, 1, 2])
W1 = 0.01 * np.random.randn(D, H)b1 = np.zeros(H)
W2 = 0.01 * np.random.randn(H, C)b2 = np.zeros(C)
loss, grads, cache = forward_backward(X, y, W1, b1, W2, b2)
print(loss)for k, v in grads.items(): print(k, v.shape)Expected shapes:
dX [B, D]dW1 [D, H]db1 [H]dW2 [H, C]db2 [C]4. Parameter update
Section titled “4. Parameter update”lr = 1e-1
W1 -= lr * grads["dW1"]b1 -= lr * grads["db1"]W2 -= lr * grads["dW2"]b2 -= lr * grads["db2"]Do not update X unless you explicitly want gradients with respect to the input.
5. PyTorch autograd version
Section titled “5. PyTorch autograd version”Using same layout as NumPy:
import torchimport torch.nn.functional as F
B, D, H, C = 4, 5, 10, 3
X = torch.randn(B, D, requires_grad=True)y = torch.tensor([0, 2, 1, 2])
W1 = torch.randn(D, H, requires_grad=True) * 0.01b1 = torch.zeros(H, requires_grad=True)
W2 = torch.randn(H, C, requires_grad=True) * 0.01b2 = torch.zeros(C, requires_grad=True)
# Need leaf tensors if using optimizer manuallyW1 = W1.detach().requires_grad_()W2 = W2.detach().requires_grad_()
# forwardZ1 = X @ W1 + b1A1 = F.relu(Z1)logits = A1 @ W2 + b2
loss = F.cross_entropy(logits, y)
# backwardloss.backward()
print(loss.item())print(X.grad.shape) # [B, D]print(W1.grad.shape) # [D, H]print(b1.grad.shape) # [H]print(W2.grad.shape) # [H, C]print(b2.grad.shape) # [C]PyTorch’s F.cross_entropy(logits, y) internally does:
log_softmax(logits) + negative log likelihood lossSo you should pass raw logits, not softmax probabilities.
Wrong:
loss = F.cross_entropy(torch.softmax(logits, dim=1), y)Correct:
loss = F.cross_entropy(logits, y)6. How PyTorch autograd works
Section titled “6. How PyTorch autograd works”When you do:
Z1 = X @ W1 + b1A1 = F.relu(Z1)logits = A1 @ W2 + b2loss = F.cross_entropy(logits, y)PyTorch builds a dynamic computation graph.
Each tensor remembers:
how it was createdwhich tensors created ithow to backprop through that operationWhen you call:
loss.backward()PyTorch walks the graph backward and applies chain rule.
So internally it computes the same gradients:
dlogitsdW2, db2dA1dZ1 through ReLU maskdW1, db1dXGradients are stored in:
X.gradW1.gradb1.gradW2.gradb2.gradImportant: gradients accumulate.
So this:
loss.backward()loss.backward()would add gradients twice unless you clear them.
With an optimizer:
optimizer.zero_grad()loss.backward()optimizer.step()7. Common pitfalls
Section titled “7. Common pitfalls”1. Forgetting / B
Section titled “1. Forgetting / B”If loss is averaged over batch, then:
dlogits /= BWithout this, gradients are too large by factor B.
2. Wrong label shape
Section titled “2. Wrong label shape”Correct:
y.shape == [B]Wrong:
y.shape == [B, 1]For PyTorch cross_entropy, labels should be integer class IDs, not one-hot vectors.
3. Applying softmax before cross entropy
Section titled “3. Applying softmax before cross entropy”Wrong:
probs = torch.softmax(logits, dim=1)loss = F.cross_entropy(probs, y)Correct:
loss = F.cross_entropy(logits, y)4. Bias broadcasting confusion
Section titled “4. Bias broadcasting confusion”Forward:
Z1 = X @ W1 + b1b1 broadcasts from [H] to [B, H].
Backward:
db1 = dZ1.sum(axis=0)Because every row used the same shared bias.
5. In-place ops
Section titled “5. In-place ops”This can break autograd:
A1.relu_()Safer:
A1 = F.relu(Z1)In-place ops are sometimes okay, but in interviews avoid them unless you are sure.
6. Forgetting .zero_grad()
Section titled “6. Forgetting .zero_grad()”PyTorch accumulates gradients:
optimizer.zero_grad()loss.backward()optimizer.step()Without zero_grad(), gradients from previous batches leak into the current step.
8. Interview explanation in one pass
Section titled “8. Interview explanation in one pass”I would say:
I start with batched input
Xof shape[B, D]. The first affine layer givesZ1 = XW1 + b1, shape[B, H]. ReLU keeps the same shape. The second affine gives logits[B, C]. For classification, I use stable softmax cross entropy directly from logits. In backward, the key simplification is that gradient of softmax plus cross entropy isprobs - one_hot(y), divided by batch size if the loss is averaged. From there, gradients follow by matrix calculus:dW2 = A1.T @ dlogits,db2 = sum(dlogits), then propagate through ReLU with(Z1 > 0), thendW1 = X.T @ dZ1,db1 = sum(dZ1), anddX = dZ1 @ W1.T. In PyTorch, the same operations build a dynamic computation graph, and.backward()applies the same chain rule automatically, accumulating gradients into.grad. Common bugs are wrong label shape, applying softmax before cross entropy, forgetting the batch normalization factor, bad bias broadcasting, in-place ops, and forgetting to zero gradients.