Matrix Calculus (Linear Layer) – For a linear operation (Y = XW), what is the gradient ∂Y/∂W? Use the identity (AB)T=BTAT to show that ∂(XW)/∂W=XT when backpropagating through a batch
I would answer:
For a linear layer:
Y=XW
Assume shapes:
X∈RB×DW∈RD×CY∈RB×C
Strictly speaking, ∂Y/∂W is a higher-rank tensor. But in backprop, we usually care about the gradient of a scalar loss (L) with respect to (W).
Let:
G=∂Y∂L
where:
G∈RB×C
Then:
∂W∂L=XTG
Shape check:
XT∈RD×BG∈RB×C
so:
XTG∈RD×C
same shape as (W).
For a single element:
Yb,c=d∑Xb,dWd,c
So:
∂Wd,c∂Yb,c=Xb,d
Accumulating over the batch and over output dimensions gives:
dWd,c=b∑Xb,d,dYb,c
which is exactly:
dW=XTdY
So the interview-safe answer is:
For (Y=XW), the local Jacobian with respect to (W) contributes XT during backprop. Given upstream gradient (dY), the parameter gradient is dW=XTdY. Similarly, the input gradient is dX=dYWT.