KL Divergence
Imagine we have a classifier like this:
Z1 = X @ W1 + b1A1 = max(0, Z1)Z2 = A1 @ W2 + b2P = softmax(Z2)Not hard. For this code, there are two possible meanings:
1. KL from true labels to prediction
Section titled “1. KL from true labels to prediction”Your current cross-entropy loss is already basically:
Because for one-hot label:
Only the true class has , so:
So this line:
loss = -np.mean(np.log(P[np.arange(N), y] + 1e-12))is already the KL loss for one-hot labels.
Your backward is also correct:
dZ2 = P.copy()dZ2[np.arange(N), y] -= 1dZ2 /= NSo for hard labels, you do not need to change anything.
2. KL between two probability distributions
Section titled “2. KL between two probability distributions”Example: teacher model distribution Q and your model prediction P.
Code:
def kl_loss_from_probs(P, Q): eps = 1e-12 return np.mean(np.sum(Q * (np.log(Q + eps) - np.log(P + eps)), axis=1))Where:
P.shape == (N, C) # student predictionQ.shape == (N, C) # target/teacher distributionBackward w.r.t. logits Z2 is simple:
dZ2 = (P - Q) / NSo your backward changes only this part:
dZ2 = (P - Q) / Ninstead of:
dZ2 = P.copy()dZ2[np.arange(N), y] -= 1dZ2 /= NClean version for soft-label KL
Section titled “Clean version for soft-label KL”def forward_kl_loss(X, Q, params): W1, b1 = params["W1"], params["b1"] W2, b2 = params["W2"], params["b2"]
Z1 = X @ W1 + b1 A1 = np.maximum(0, Z1) Z2 = A1 @ W2 + b2
shifted = Z2 - np.max(Z2, axis=1, keepdims=True) exp_scores = np.exp(shifted) P = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
eps = 1e-12 loss = np.mean(np.sum(Q * (np.log(Q + eps) - np.log(P + eps)), axis=1))
cache = { "X": X, "Z1": Z1, "A1": A1, "P": P, "Q": Q, }
return loss, cacheBackward:
def backward_kl(cache, params): X = cache["X"] Z1 = cache["Z1"] A1 = cache["A1"] P = cache["P"] Q = cache["Q"]
W2 = params["W2"] N = X.shape[0]
dZ2 = (P - Q) / N
dW2 = A1.T @ dZ2 db2 = np.sum(dZ2, axis=0)
dA1 = dZ2 @ W2.T dZ1 = dA1 * (Z1 > 0)
dW1 = X.T @ dZ1 db1 = np.sum(dZ1, axis=0)
return { "W1": dW1, "b1": db1, "W2": dW2, "b2": db2, }Main idea:
# hard labelsdZ2 = P - one_hot(y)
# soft labels / KLdZ2 = P - QSo difficulty is low. The math is almost identical to softmax cross-entropy.
interview Q1
Section titled “interview Q1”Compute KL Divergence (Example) – Given two simple distributions (p) and (q) over a discrete set, how do you compute the KL divergence ? For example, if (p = [0.5,0.5]) and (q=[0.8,0.2]), calculate
I would answer:
KL divergence measures how different distribution q is from distribution p when p is treated as the reference.
The formula is:
For:
p = [0.5, 0.5]q = [0.8, 0.2]we compute:
Using natural log:
So:
That is in nats because we used natural log. If we used log base 2, the answer would be in bits:
Important interview note: KL is not symmetric:
And if q_i = 0 where p_i > 0, KL becomes infinite.
Interview Q2
Section titled “Interview Q2”Mode-Seeking vs. Covering (KL-Divergence) – Explain how the choice of KL divergence direction affects learned solutions. For example, minimizing KL(q‖p) (model q vs data p) tends to focus on the modes of p, while minimizing KL(p‖q) encourages covering the support of p. What are the intuitive differences in the resulting model behavior?I would answer like this:
KL is asymmetric, so the direction matters.
Assume:
p = true data distributionq = model distributionThe two objectives are:
1. Minimizing , covering behavior
Section titled “1. Minimizing DKL(p∣q)D_{KL}(p | q)DKL(p∣q), covering behavior”This is often called forward KL.
It heavily penalizes cases where:
Meaning: if the data says something is possible, but the model assigns almost no probability to it, the loss becomes very large.
So the model is encouraged to cover all regions where data exists.
Behavior:
less likely to miss modesmore likely to spread probability masscan assign probability to low-density areas between modesExample: if p has two modes and q is a single Gaussian, minimizing (KL(p | q)) may put q between the two modes and make it wide enough to cover both.
So it is mode-covering.
2. Minimizing (D_{KL}(q | p)), mode-seeking behavior
Section titled “2. Minimizing (D_{KL}(q | p)), mode-seeking behavior”This is often called reverse KL.
It heavily penalizes cases where:
Meaning: if the model puts probability somewhere the data distribution says is unlikely, it gets punished hard.
So the model prefers to place mass only where p is very high.
Behavior:
sharp samplesavoids low-probability regionsmay ignore some modes entirelyExample: if p has two modes and q is a single Gaussian, minimizing (KL(q | p)) may choose one mode and ignore the other, because placing mass between modes is punished.
So it is mode-seeking.
Interview summary
Section titled “Interview summary”I would say:
The direction of KL controls what mistakes are expensive. punishes the model for missing data support, so it encourages broad coverage. punishes the model for putting mass where the data has little mass, so it prefers high-density regions and can collapse onto one mode. Forward KL is coverage-seeking, reverse KL is mode-seeking.