Matrix Calculus: Common Forms

Tags	Backprop

Common Encounters

First derivative

$\nabla_x b^Tx =\nabla_xx^Tb= b$ (see “practicals” for an explanation for the transpose)

$\nabla_x x^Tx = 2x$ (helps to think of it having an invisible $I$ in the middle)

$\partial(Tr(X)) = Tr(\partial X)$

$\partial(X^{-1}) = -X^{-1}(\partial X)X^{-1}$

$\partial a^TXb / \partial X = ab^T$ (derive this with a summation)
- it follows that $\partial a^T X^T b / \partial X = b^Ta$

Second derivative

$\nabla_x x^TAx = 2Ax$

$\nabla^2_x x^TAx = 2A$

$\nabla^2_x x^Tx = 2I$ (think about it!)

Quadratic forms

The last line with the combination is possible because $A$ is symmetric. The key takeaway is that $\nabla_x x^TAx = 2Ax$

👉

What's really cool is that it's EXACTLY like single variable calculus, the way the power rule works!

You can derive the hessian through the interpretation that the hessian is just the gradient of each component. We use the symmetric properties of $A$

Derivative through the matrix

In this case, the matrix contains variables you want to optimize.

You can derive the following through sums:

$\nabla_A x^TAx$ = $x x^T$

Gradient of an inverse matrix

The general form (regardless of what thing you're taking the derivative WRT) is this:

\nabla_x Y^{-1} = -Y^{-1}(\nabla_xY) Y^{-1}

The proof is pretty simple. We use the identity $Y^{-1}Y = I$ and the fact that the derivative of any matrix that doens't have variables in it (we assume $Y$ to be a matrix of variables) is zero.

\nabla Y^{-1}Y = (\nabla Y^{-1}) Y + Y^{-1}\nabla Y = \nabla I = 0

Then, after you rearrange you get

(\nabla Y^{-1}) Y = -Y^{-1}\nabla Y \\\nabla Y^{-1} = -Y^{-1} \nabla Y Y^{-1}

And from this, you can derive certain identities, like

$\nabla_A a^T A^{-1}b = -A^{-T} ab^T A^{-T}$