Optimization basics

TagsBasics

Cost function

A cost function is basically how bad we're doing. We want to minimize the cost function

Likelihood

The likelihood is basically how good we are doing, in a probabilistic sense. When we maximize likelihood, it means that we are able to explain a training set phenomenon

We often do log-likelihood because a logarithm is monotonic and it decomposes products into sums

A cost function can be the negative likelihood.

Gradient descent

When we have a cost function JJ, the gradient descent is modeled by

We can take the batch gradient descent over the entire dataset, or a single sample (stochastic, SGD), or a subset (minibatch)

Nuances

Batch gradient descent is guaranteed, at sufficiently small learning rates, to decrease the training loss of a convex objective every time

Minibatch and stochastic is not guaranteed to decrease the training loss due to the stochastic nature

Any optimization method will fail if the learning rate is too high

Momentum

Why Momentum Really Works
Here's a popular story about momentum : gradient descent is a man walking down a hill. He follows the steepest path downwards; his progress is slow, but steady. Momentum is a heavy ball rolling down the same hill.
https://distill.pub/2017/momentum/