Skip to content

Matrix Calculus

Matrix Calculus (Linear Layer) – For a linear operation (Y = XW), what is the gradient Y/W\partial Y/\partial W? Use the identity (AB)T=BTAT(AB)^T = B^T A^T to show that (XW)/W=XT\partial (XW)/\partial W = X^T when backpropagating through a batch


I would answer:

For a linear layer:

Y=XWY = XW

Assume shapes:

XRB×DWRD×CYRB×CX \in \mathbb{R}^{B \times D} W \in \mathbb{R}^{D \times C} Y \in \mathbb{R}^{B \times C}

Strictly speaking, Y/W\partial Y / \partial W is a higher-rank tensor. But in backprop, we usually care about the gradient of a scalar loss (L) with respect to (W).

Let:

G=LYG = \frac{\partial L}{\partial Y}

where:

GRB×CG \in \mathbb{R}^{B \times C}

Then:

LW=XTG\frac{\partial L}{\partial W} = X^T G

Shape check:

XTRD×BGRB×CX^T \in \mathbb{R}^{D \times B} G \in \mathbb{R}^{B \times C}

so:

XTGRD×CX^T G \in \mathbb{R}^{D \times C}

same shape as (W).

For a single element:

Yb,c=dXb,dWd,cY_{b,c} = \sum_d X_{b,d}W_{d,c}

So:

Yb,cWd,c=Xb,d\frac{\partial Y_{b,c}}{\partial W_{d,c}} = X_{b,d}

Accumulating over the batch and over output dimensions gives:

dWd,c=bXb,d,dYb,cdW_{d,c} = \sum_b X_{b,d} , dY_{b,c}

which is exactly:

dW=XTdYdW = X^T dY

So the interview-safe answer is:

For (Y=XW), the local Jacobian with respect to (W) contributes XTX^T during backprop. Given upstream gradient (dY), the parameter gradient is dW=XTdYdW = X^T dY. Similarly, the input gradient is dX=dYWTdX = dY W^T.