Skip to content

KL Divergence

Imagine we have a classifier like this:

Z1 = X @ W1 + b1
A1 = max(0, Z1)
Z2 = A1 @ W2 + b2
P = softmax(Z2)

Not hard. For this code, there are two possible meanings:

Your current cross-entropy loss is already basically:

KL(yone-hotP)KL(y_{\text{one-hot}} | P)

Because for one-hot label:

KL(qP)=iqilogqiPiKL(q | P) = \sum_i q_i \log \frac{q_i}{P_i}

Only the true class has qi=1q_i = 1, so:

KL=logPtrueKL = -\log P_{\text{true}}

So this line:

loss = -np.mean(np.log(P[np.arange(N), y] + 1e-12))

is already the KL loss for one-hot labels.

Your backward is also correct:

dZ2 = P.copy()
dZ2[np.arange(N), y] -= 1
dZ2 /= N

So for hard labels, you do not need to change anything.


2. KL between two probability distributions

Section titled “2. KL between two probability distributions”

Example: teacher model distribution Q and your model prediction P.

KL(QP)=iQi(logQilogPi)KL(Q | P) = \sum_i Q_i(\log Q_i - \log P_i)

Code:

def kl_loss_from_probs(P, Q):
eps = 1e-12
return np.mean(np.sum(Q * (np.log(Q + eps) - np.log(P + eps)), axis=1))

Where:

P.shape == (N, C) # student prediction
Q.shape == (N, C) # target/teacher distribution

Backward w.r.t. logits Z2 is simple:

dZ2 = (P - Q) / N

So your backward changes only this part:

dZ2 = (P - Q) / N

instead of:

dZ2 = P.copy()
dZ2[np.arange(N), y] -= 1
dZ2 /= N

def forward_kl_loss(X, Q, params):
W1, b1 = params["W1"], params["b1"]
W2, b2 = params["W2"], params["b2"]
Z1 = X @ W1 + b1
A1 = np.maximum(0, Z1)
Z2 = A1 @ W2 + b2
shifted = Z2 - np.max(Z2, axis=1, keepdims=True)
exp_scores = np.exp(shifted)
P = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
eps = 1e-12
loss = np.mean(np.sum(Q * (np.log(Q + eps) - np.log(P + eps)), axis=1))
cache = {
"X": X,
"Z1": Z1,
"A1": A1,
"P": P,
"Q": Q,
}
return loss, cache

Backward:

def backward_kl(cache, params):
X = cache["X"]
Z1 = cache["Z1"]
A1 = cache["A1"]
P = cache["P"]
Q = cache["Q"]
W2 = params["W2"]
N = X.shape[0]
dZ2 = (P - Q) / N
dW2 = A1.T @ dZ2
db2 = np.sum(dZ2, axis=0)
dA1 = dZ2 @ W2.T
dZ1 = dA1 * (Z1 > 0)
dW1 = X.T @ dZ1
db1 = np.sum(dZ1, axis=0)
return {
"W1": dW1,
"b1": db1,
"W2": dW2,
"b2": db2,
}

Main idea:

# hard labels
dZ2 = P - one_hot(y)
# soft labels / KL
dZ2 = P - Q

So difficulty is low. The math is almost identical to softmax cross-entropy.

Compute KL Divergence (Example) – Given two simple distributions (p) and (q) over a discrete set, how do you compute the KL divergence DKL(pq)D_{KL}(p|q)? For example, if (p = [0.5,0.5]) and (q=[0.8,0.2]), calculate DKL(pq)D_{KL}(p|q)

I would answer:

KL divergence measures how different distribution q is from distribution p when p is treated as the reference.

The formula is:

DKL(pq)=ipilogpiqiD_{KL}(p | q) = \sum_i p_i \log \frac{p_i}{q_i}

For:

p = [0.5, 0.5]
q = [0.8, 0.2]

we compute:

DKL(pq)=0.5log0.50.8+0.5log0.50.2D_{KL}(p | q)= 0.5 \log \frac{0.5}{0.8} + 0.5 \log \frac{0.5}{0.2}

Using natural log:

0.5(0.4700)+0.5(0.9163)=0.2350+0.4581=0.22310.5(-0.4700) + 0.5(0.9163) = -0.2350 + 0.4581 = 0.2231

So:

DKL(pq)0.223D_{KL}(p | q) \approx 0.223

That is in nats because we used natural log. If we used log base 2, the answer would be in bits:

0.2231/log(2)0.3220.2231 / \log(2) \approx 0.322

Important interview note: KL is not symmetric:

DKL(pq)DKL(qp)D_{KL}(p | q) \neq D_{KL}(q | p)

And if q_i = 0 where p_i > 0, KL becomes infinite.

Mode-Seeking vs. Covering (KL-Divergence) – Explain how the choice of KL divergence direction affects learned solutions. For example, minimizing KL(q‖p) (model q vs data p) tends to focus on the modes of p, while minimizing KL(p‖q) encourages covering the support of p
. What are the intuitive differences in the resulting model behavior?

I would answer like this:

KL is asymmetric, so the direction matters.

Assume:

p = true data distribution
q = model distribution

The two objectives are:

DKL(pq)=xp(x)logp(x)q(x)D_{KL}(p | q) = \sum_x p(x)\log \frac{p(x)}{q(x)} DKL(qp)=xq(x)logq(x)p(x)D_{KL}(q | p) = \sum_x q(x)\log \frac{q(x)}{p(x)}

1. Minimizing DKL(pq)D_{KL}(p | q), covering behavior

Section titled “1. Minimizing DKL(p∣q)D_{KL}(p | q)DKL​(p∣q), covering behavior”

This is often called forward KL.

It heavily penalizes cases where:

p(x)>0butq(x)0p(x) > 0 \quad \text{but} \quad q(x) \approx 0

Meaning: if the data says something is possible, but the model assigns almost no probability to it, the loss becomes very large.

So the model is encouraged to cover all regions where data exists.

Behavior:

less likely to miss modes
more likely to spread probability mass
can assign probability to low-density areas between modes

Example: if p has two modes and q is a single Gaussian, minimizing (KL(p | q)) may put q between the two modes and make it wide enough to cover both.

So it is mode-covering.

2. Minimizing (D_{KL}(q | p)), mode-seeking behavior

Section titled “2. Minimizing (D_{KL}(q | p)), mode-seeking behavior”

This is often called reverse KL.

It heavily penalizes cases where:

q(x)>0butp(x)0q(x) > 0 \quad \text{but} \quad p(x) \approx 0

Meaning: if the model puts probability somewhere the data distribution says is unlikely, it gets punished hard.

So the model prefers to place mass only where p is very high.

Behavior:

sharp samples
avoids low-probability regions
may ignore some modes entirely

Example: if p has two modes and q is a single Gaussian, minimizing (KL(q | p)) may choose one mode and ignore the other, because placing mass between modes is punished.

So it is mode-seeking.

I would say:

The direction of KL controls what mistakes are expensive. KL(pq)KL(p | q) punishes the model for missing data support, so it encourages broad coverage. KL(qp)KL(q | p) punishes the model for putting mass where the data has little mass, so it prefers high-density regions and can collapse onto one mode. Forward KL is coverage-seeking, reverse KL is mode-seeking.