Support Vector Machines

TagsCS 229Regressions

Key things to know

In hard-SVM, we assume that things are separable. We want to find the best line that separates things

We predict things by

hw,b(x)=sign(wTx+b)h_{w, b}(x) = \text{sign}(w^Tx + b)

to optimize w,bw, b, we use the following objective:

After lagrangian hardship, you get that the optimal solution has the form

w=inαiy(i)x(i)w^* = \sum_i^n\alpha_iy^{(i)}x^{(i)}

which means that the kernel trick can be used

If we are doing soft-SVM, we can think of SVM as minimizing hinge loss + L2 regularization

The big idea

Given two clusters of data, how can we separate it such that the margin between the two clusters is as big as possible?

In the notes below, we will mostly talk about linear SVM's. But then we'll talk about how we can use a kernel to do an arbitrary boundary with this "linear" classifier!

Notation

Previously in logistic regression we used labels y{0,1}y \in \{0, 1\}. Now, we will use labels y{1,1}y \in \{-1, 1\}, and we parameterize the model as

hw,b(x)=g(wTx+b)h_{w, b}(x) = g(w^Tx + b)

This also means that we drop the convention that x0=1x_0 = 1.

Furthermore, gg is no longer a continuous function. Rather, if wTx+b<0w^T x + b < 0, it predicts 1-1, and if it's 0\geq 0, it predicts 11. We call gg as a sign\text{sign} function

So a trained SVM model is just like a kernelized perceptron

What is a margin?

Back in logistic regression, you could understand g(θTx)g(\theta^T x) as a measure of model confidence that xx is a positive example. If it's greater than 0.5, we just called it a positive prediction, and vice versa.

The further we are from that dividing line of θTx=0\theta^T x = 0, the more confident we are. This length is called a margin.

Functional margins

The functional margin of (w,b)(w, b) with respect to a training example is defined as

γ=y(wTx+b)\gamma = y(w^T x + b)

As you notice, if γ\gamma is positive, then we are classifiying correctly. The larger the γ\gamma, the more confident we are.

Functional margins WRT a set

Given a training set SS, we define the functional margin of (w,b)(w, b) with respect to SS as

γ^=minγ^(i)\hat{\gamma} = \min \hat{\gamma}^{(i)}

or, in other words, the least confident (or the most incorrect) prediction. This is the definition we use when we talk about functional margins.

Geometric Margins

We provide another way of understanding a margin. This time, geometrically. Above, we see a margin wTx+b=0w^Tx + b = 0 and the distance from points. We know that w/ww / ||w|| is the unit vector. As such, we can imagine moving back from a point A by traveling length γ\gamma to point B. This is mathematically expressed as

wT(xγww)+b=0w^T(x - \gamma \frac{w}{||w||}) + b = 0

When we solve for γ\gamma, we get

Now, margins must be positive. To account for negative distance, we multiply everything by yy.

Relating geometric and functional margins

Note how the geometric and the functional margins look very similar, and they are equivalent if w=1||w|| = 1.

The geometric margin value does not change with a scaling of ww and bb because we divide by the weight norm. A good margin quantifier exploits this feature because scaling w,bw, b does not change the line at all. Therefore, the margin shouldn't change.

Generally, when we talk about margins, we talk about the margin WRT a dataset, which means the worst case scenario. This we represent with γ^\hat{\gamma}. note the hat on it. And generally, we will refer to the geometric margin.

Optimal margin classifier objective

The direct objective

What do we want to optimize? We want to maximize the margin of a training set, which means maximizing the lower bound of y(wTx+b)y(w^Tx + b). This looks like

maxw,bminiy(i)(wTx(i)+b)\max_{w, b}\min_{i}y^{(i)}(w^Tx^{(i)} + b)

Writing this direct objective

With this goal in mind, we start to formulate the constrained optimization problem. Our overall objective is

maxw,bminiy(i)(wTx(i)+b)\max_{w, b}\min_{i}y^{(i)}(w^Tx^{(i)} + b)

but we constrain it to w=1||w|| = 1 to keep the functional margin and the geometric margin equivalent.

In addition, we turn the inner min\min into another constrain, to get us this objective.

However, we reach a problem. Our constraints are not affine, which means that we can't apply the dual legrangian properties (see lagrange section) to our problem. Duality allows for an easier pathway of optimization.

Improving the objective

The problem child is the w||w|| in the way. We motivated it due to equivalence between margin types, but can we just compute γfunc\gamma_{func} and convert it to the geometric? This is our motivation below:

Unfortunately, this is still not there yet. Our constraints are linear, but the objective is not convex. the KTT conditions are not met yet.

The final form

Now, to make the final push, we make one crucial insight. The margin is a unitless constant, and we are dealing with the functional margin, which is not invariant to scaling. So what if we just defined some scalar cc such that c=1/γ^c = 1 / \hat{\gamma}? In this case, we would have y(cwTx+cb)1y(cw^Tx + cb) \geq 1. And we can swallow this cc into the ww and the bb.

Essentially, we argue that instead of maximizing both the γ^,w\hat{\gamma}, ||w|| as we were trying above, why don't we hold one thing constant? After all, this γ\gamma and the ww are related, SPECIFICALLY because we are dealing with the FUNCTIONAL margin.

With this insight, we reframe the objective to be maxw,b1/w\max_{w, b} 1 / ||w||. To maximize this, all we need to to is minimize w||w||, or w2||w||^2. We arrive upon this objective :

Now this is what we are looking for! A convex function with a linear constraint. The solution set (w,b)(w, b) gives the optimal margin classifier, and this can be solved using quadratic programming.

Optimal margin classifier derivation

Understanding the role of α\alpha

Now, we are ready to start using Lagrangians to derive the optimal margin classifier. We need the general lagrangian. First, we define the constraint function gig_i:

gi(w)=y(i)(wTx(i)+b)+10g_i(w) = -y^{(i)} (w^T x^{(i)} + b) + 1 \leq 0

We will eventually multiply gig_i by its lagrange constant αi\alpha_i.

Now, let's look back on the KTT conditions (see lagrangians notes). Recall that we have αg(w)=0\alpha^* g(w^*) = 0, meaning α>0\alpha > 0 only when gi(w)=0g_i(w) = 0. This happens only with the margins that are the closest. If you're confused, refer back to our objective. It's the lower bound for the margins.

We represent this graphically below.

The solid line is the boundary, and the maximum margin is denoted with the dotted lines passing through the points with the least margin. These points are the ONLY points whose α>0\alpha > 0. We call these points the support vectors, and we note that only a small subset of the dataset are support vectors.

Intuitively, it means that anything that isn't a support vector can essentially be removed from the dataset without changing ANYTHING. We will see the math behind this in a second

The Lagrangian

The Lagrangian becomes

Usually, we want to find min-max. However, this lagrange problem satisfy KTT because we satisfy PSD and the constraint is affine.

As such, we can also find the max-min. This means that we first minimize L\mathcal{L} with respect to w,bw, b. Then, we maximize that expression with respect to α\alpha.

Minimizing

We just take the derivative and set that equal to zero. First, let's start with ww.

This means that

w=αiy(i)x(i)w^* = \sum \alpha_i y^{(i)}x^{(i)}

And now, let's do bb.

We didn't solve for bb, but we arrived at a constraint. Now, let's take these two insights and simplify the lagrangian. Here are the steps we took: (the original is reproduced below for clarity)

  1. simplify w2||w||^2 using wTww^Tw. We can directly do the transpose on the summation because transpose distributes.
  1. Simply wTxw^Tx using the same technique
  1. Note that 12w2\frac{1}{2} ||w||^2 has the exact same form as αiy(i)wTx(i)\alpha_iy^{(i)}w^Tx^{(i)}, so we combine them into the middle term below. (1/2 - 1 = -1/2 is where that negative comes from)
  1. the first and last terms below are from the other parts of the second term of the lagrangian.

We arrive at this form:

We also notice that the last term we derived to be 00. Therefore

As such, we have completed the min step of the optimization. Now, let's to the max!

Maximizing

Now, this expression above is what happened after we perform a minimization. Now, with these constraints in mind, we will maximize the expression above WRT α\alpha with two key constraints shown below

The first constraint arises from the definition of α\alpha, and the second constraint arises from our derivation of bb. Now, the implicit third constraint lies in how we wrote the maximization objective. We didn't use ww, but rather, the expression we had above.

How do we do this?? Well, let's talk about it in SMO (scroll down). For the next few sections, we assume that we can have this solution and we'll talk a little bit about what we can do about it.

Deriving the needed w,bw, b

Once we solve the α\alpha, we will be able to find the ww through the equation below.

w=αiy(i)x(i)w^* = \sum \alpha_i^* y^{(i)}x^{(i)}

Now, the bb^* can be derived. How? Well, it has a graphical intuition

With our ww^*, we've already settled on an angle of the line. We know our final solution has the form wTx+bw^Tx + b. We want to find the closest points (the support vectors) and then just take the average of those points.

Now, the center is just the average:

The Support Vector Insight

But this allows us to do something very interesting. When we run the model, we just need to do wTx+bw^Tx + b. We can replace the ww with the ww^* definition, which gets us the following:

Now, this gives us a lot of interesting insight. So if we've found the necessary α\alpha, then predicting the output is equivalent to taking the inner product with all the points in the training set. And this is even neater: as we recalled, α\alpha is only non-zero when x(i)x^{(i)} is on the edge of the margin! So, the prediction can be done by just looking at a select few points, which we called support vectors. Nice!!

The Kernel Insight

We note that at test time, we get that

wTx+b=inαiy(i)x(i),x+bw^Tx + b = \sum_i^n\alpha_i y^{(i)}\langle x^{(i)}, x\rangle + b

Hold up! This inner product is reminiscent of the kernel! So...how can we extend this linear classifier derivation to arbitrary boundaries?

wTx+b=inαiy(i)K(x(i),x)+bw^Tx + b = \sum_i^n\alpha_i y^{(i)}K(x^{(i)}, x) + b

So this is how it works at test time, but what about training?

Well, recall that our maximization step looks like this

So you see the inner product? This is the kernel corresponding to the identity feature map. In reality, we can replace this kernel with anything! This gets us

maxαW(α)=iαi12i,jny(i)y(j)αiαjK(x(i),x(j))\max_\alpha W(\alpha) = \sum_i \alpha_i - \frac{1}{2}\sum_{i, j}^ny^{(i)}y^{(j)}\alpha_i\alpha_j K(x^{(i)}, x^{(j)})

And there we have it! We are suddenly able to create an arbitrary boundary that is more complicated than a linear boundary!

The infinite kernel: arbitrary boundaries

Recall that we have the gaussian kernel

K(x,z)=exp(xz22σ2)K(x, z) = \exp\left(\frac{-||x - z||^2}{2\sigma^2}\right)

What if σ\sigma approached 00? Recall that our prediction algorithm is

wTx+b=inαiy(i)K(x(i),x)+bw^Tx + b = \sum_i^n\alpha_i y^{(i)}K(x^{(i)}, x) + b

as such, the value of the summation is determined almost solely by the point x(i)x^{(i)} that is the closest to xx (let's assume that αi=1\alpha_i = 1 for simplicity.) This means that we can draw a perfect boundary between two classes scattered arbitraily!

This goes back to the idea that the gaussian kernel is infinite dimensional. It can make any boundary it wants! It also goes to show that the smaller the σ\sigma, the more "overfit" your boundary can become.

Regularization (soft SVM)

If xx is linearly separable, then the previous derivation of SVM is good enough. However, real data is not that optimal. It may not be linearly separable, or an outlier might require a very severe margin

(when we add one outlier point, the margin must pivot to a very extreme angle).

Kernels may help when we are dealing with non linear separable data, but for outliers, we can use a regularizer (2\ell_2 regularization).

Essentially, we allow things to have narrower margins (including negative margins meaning an incorrect prediction) but we want to minimize the amount of "borrowing" we have to do, which is what we do in the min\min expression. The legrangian is formulated as follows:

Here, r,αr, \alpha are our lagrangian multipliers. After doing the optimization and simplifying, we get the following sub-objective

Interestingly, the only change to the dual problem is that the αi\alpha_i is bounded by another constant CC, which makes sense because we're using an l2 regularizer.

Soft-SVM as hinge loss

We can actually rewrite the lagrangian as a sort of hinge loss. Essentially, instead of setting a lower bound on the borrowing, why don't we just minimize the borrowing directly?

Here, we arrive at the Hinge loss. When a model is accurate, the yy and the wTx+bw^Tx + b will have the same sign and therefore yield zero or close to zero losses.

SMO Algorithm (solving the dual problem)

Now, in the case of SVM, after performing one round of Lagrangian optimization, we arrive at another constrained optimization problem. One example is below:

How do you actually solve this?

We can use a technique called sequential minimal optimization, which does something interesting. It optimizes one parameter at a time. So it might tine α1\alpha_1, and then tune α2\alpha_2, and so on.

This is what it might look like:

Adapting SMO to work with our problem

The issue is that we can't just tune one αi\alpha_i at a time, because αiy=0\sum \alpha_i y = 0. As such, we must perform SMO with a pair of α\alpha.

Let's say that we picked α1,α2\alpha_1, \alpha_2. This optimization problem is actually really easy, because α1y(1)+α2y(2)=k1,2αky(k)\alpha_1y^{(1)} + \alpha_2y^{(2)} = -\sum_{k \neq 1, 2} \alpha_k y^{(k)}. This is just a line! Furthermore, in the case of the regularized SVM, we also know that αC\alpha \leq C, so we are just confined to a short line segment.

Because y{1,1}y \in \{-1, 1\}, we can actually write α1\alpha_1 in terms of α2\alpha_2 as

α1=(ζα2y(2))y(1)\alpha_1 = (\zeta-\alpha_2 y^{(2)})y^{(1)}

The objective then becomes

If you look at the definitino of WW, this just means that WW is a quadratic function in α2\alpha_2, which has a guaranteed extrema. In this case, it's a maximum.

To check for convergence, we just verify that the KKT conditions satisfied within some error margin.

SVM vs Logistic regression

There's a really subtle difference here that is very important. Logistic regression does NOT assume that there is a direct boundary between the data. As such, it will find a θ\theta that creates enough "fuzzy zone" to best represent the data.

However, SVM's without regularization assume that there IS indeed a direct boundary between the data.

If you are sure that the data is separable, you should use an SVM. This is because if you used a logistic regression, the θ\theta will diverge as it makes the boundary sharper and sharper. If we start with an infinitely sharp boundary with the SVM, there's no need for this weird behavior.