Linear Classifiers (general)

TagsCS 231NRegressions

The simple linear classifier (SVM)

Let’s say we wanted to create a function ff that maps from a flattened image into a class vector. We could just use a simple linear classifier

Where WW is a k×dk \times d matrix, where dd is the size of the image and kk is the number of classes. You can actually merge the bb into the WW if you add a 11 to the xx.

To train, we need to make sure to zero-center our data (more on this later) and make the pixels range from -1 to 1

This matrix multiplication you can think of as running kk separate dot product linear classifiers. As such, we can visualize this classification as dicing up the hyperspace by a bunch of hyperplanes (the lines represent the zero point in the classifier)

Another interpretation is that each of the rows in WW is the “template” image. You are taking the dot product between each of the templates and the test image, and this is just cosine similarity. Here’s an example of what this might look like

Multiclass support vector machine loss

Now, we look at how we might optimize for WW.

Let ss be the score vector: s=Wx+bs = Wx + b. We compute the SVM loss by running

Let’s unpack this for a second. So sjsyis_j - s_{y_i} is the distance between class jj and the true class yiy_i. Ideally, you want sjs_j to be very small (negative) and syis_{y_i} to be very large (positive). This means that the difference is negative, meaning the whole thing becomes 00. The Δ\Delta is the margin , which is basically how much breathing room you give. It is a positive number, and it forces sjs_j to be some Δ\Delta smaller than syis_{y_i}. Here’s a good visualization

The max(0,)\max(0, \cdot) is known as the hinge loss

We might add on a regularization term, which acts like Occamm’s razor: we select the simplest WW that yields a low loss. It’s mathematically equivalent to a bayesisan prior (see 228 notes) and it helps with overfitting.

Interestingly, this Δ\Delta can be just set at 11 all the time and we only modify the λ\lambda. this is because the Δ\Delta is just the margin of prediction, and we can change this easily by changing the magnitude of WW, which of course is modulated by λ\lambda. Therefore, Δ\Delta and λ\lambda are actually modulating the same thing, and you only need to change one.

Gradient of the SVM

this is something that you are very familiar with from 229 and 228. Just take the gradient! This gives you a closed form expression. For example, for the following SVM loss

the gradient WRT the target class is

and the gradient WRT all other classes is just

Our eventual goal was to take the derivative of WW, and we did this by splitting it up into different rows, because that was the most convenient.

Softmax classifier loss function

Next, we look at a different, more probabilistic loss function. Why don’t we turn the output of ff into a probability distribution, and then take the cross entropy?

Because p(x)p(x) is a delta distribution (i.e. a one-hot vector of the true class), we can rewrite this in a simpler form. Essentially, it’s just the softmax term of the true key

which is equivalent to logp(yixi;W)-\log p(y_i | x_i; W). Just think about it for a second! You want to maximize the log prob by minimizing its negative. For numeric stability, we use a little trick to prevent overflow:

Gradient of Softmax

For target value k=yik = y_i, the gradient is

dLdWk=x+efkjefjx=(efkjefj1)x\frac{dL}{dW_k} = -x + \frac{e^{f_k}}{\sum_j e^{f_j}}x = \left(\frac{e^{f_k}}{\sum_j e^{f_j}} -1\right)x

And if kyik \neq y_i, then we have

dLdWk=efkjefjx\frac{dL}{dW_k} = \frac{e^{f_k}}{\sum_j e^{f_j}}x
This is VERY similar to the SVM gradient. In fact, because the sum of components is 11, you can say that the gradient of the softmax is just the negative of the gradient of the SVM, which makes sense because the softmax objective is a maximization at heart that we negate.

SVM or Softmax?

SVM is an asymmetric loss. If you’re correct, it doesn’t really care about how much, as long as you beat the margin. Softmax is a smoother loss, and it cares about how much you are correct, as well as if you are correct. This has both pros and cons. You pay less attention to the irrelevant details for SVM. essentially

Topography of the SVM loss

Recall that the SVM loss looks like this

Each LiL_i is a linear piecewise function, so the final loss function will also be piecewise linear