Matrix Calculus: fundamentals

TagsBackprop

Gradient

Symbolically, \nabla is the explicit gradient of a scalar function WRT to a vector of inputs. The \partial is a looser interpretatin. It can be a gradient or it can be a derivative WRT a matrix, etc. Keep that in mind

The gradient of a function f:Rm×nRf: \mathbb{R}^{m\times n} → \mathbb{R} is defined as

In other words

(Af(A))ij=f(A)Aij(\nabla_A f(A))_{ij} = \frac{\partial f(A)}{\partial A_{ij}}

A gradient is ONLY for functions that return scalars. Gradients are technically linear transformations because these hold:

Hessian

A hessian matrix just takes in a function f:RnRf: \mathbb{R}^n → \mathbb{R} and returns a n×nn \times n matrix that is symmetric, defined as

Hij=2fxixjH_{ij} = \frac{\partial^2f}{\partial x_i \partial x_j}

Or more graphically,

Interpretations

Chain rule

Chain rule is just the multiplication of Jacobians. In the case where we try to find the derivative of a matrix, just know that a matrix isn’t a 2d thing…it can be flattened. So that’s how you get out of doing things with 4d tensors. We will talk more about this in the “practicals” section, where we learn common tricks.

Micro and Macro

You will always have element analysis. But more often than not, try to use the macro properties first, as they are often more informative