Kernel Methods

TagsCS 229Regressions

Feature maps

We talked a little bit about this before: we can reframe polynomial regression as a linear regression problem by defining a feature map

A cubic function over xx is a linear function over ϕ(x)\phi(x). We call the original input the attributes and we call the outputs of ϕ\phi as features. We call ϕ\phi thusly a feature map

As we have derived before, linear regression does not depend on what the feature map is. We can just replace all instances of xx with ϕ(x)\phi(x).

While we see that ϕ\phi seems to be mapping a real number to a vector, we can also have ϕ\phi map a vector to another expanded vector. So don't think it has to be scalar to vector!

Problem with feature maps

Feature maps go from RdRpR^d → R^p, and p>dp > d. As such, θRp\theta \in R^p. The more expressive you want to be, the larger pp has to become. This may eventually become computationally intractable. Can we produce some sort of θ\theta update and prediction algorithm whose complexity doesn't depend on dd??

The Kernel Trick

Let's do this whole thing with a simple linear regression, but we'll draw this out to a more general thing at the end

The problem is that the θ\theta has the same number of elements as ϕ(x)\phi(x), which can be very large. The observeation is that at any time, if we start with θ=0\theta = 0, we can express θ\theta as a collection of feature vectors ϕ(x(1),ϕ(x(2)),...\phi(x^{(1)}, \phi(x^{(2)}), .... Now, more formally:

Proving this: induction

The base case is obvious: if we start with θ=0\theta = 0, then βi=0\beta_i = 0.

Now, the induction. First, assume that we have θ=βiϕ(x(i))\theta = \sum \beta_i \phi(x^{(i)}). Now, let's apply the gradient update!

Do you see how we merged into a new expression of βi\beta_i?? That's the trick!!

As such, we derive the β\beta rule to be (we use ^i as shorthand for ^(i) )

βi:=βi+α(yiθTϕ(xi))\beta_i := \beta_i + \alpha (y^i - \theta^T \phi(x^i))

Now, we can expand the old θ\theta to get

Using the kernel trick: prediction

So we arrive at an important conclusion. We can express θ\theta in terms of β\beta only, and we perform updates using the equation above.

To predict, we note that we want θTx\theta^Tx, which we can rewrite using our summation definition of θ\theta. Eventually, we get

θTϕ(x)=jnβjϕ(x(j)),ϕ(x)\theta^T \phi(x) = \sum_j^n\beta_j \langle \phi(x^{(j)}), \phi(x)\rangle

Looking at complexity, a prediction is O(nd)O(nd), where dd is the dimension of the feature. We can potentially get past this dd if we use a different kernel (more on this in a second)

Using the kernel trick: updates

We have already arrived at the update formulation. Instead of updating θ\theta, we train by updating β\beta which by proxy updates θ\theta. We can rewrite the equation above as the update formula simply by replacing the ϕTϕ\phi^T\phi with the inner product

βi:=βi+α(y(i)jnβjϕ(x(j)),ϕ(x(i)))\beta_i := \beta_i + \alpha \left(y^{(i)} - \sum_j^n\beta_j \langle \phi(x^{(j)}), \phi(x^{(i)})\rangle \right)

Looking at complexity, one update of β\beta takes O(n)O(n), which means that one whole iteration is O(n2)O(n^2).

Introducing the kernel

What you might have noticed is that we were able to reduce both the prediction and the update to some sort of inner product. We give this a special name!

The kernel is the inner product ϕ(xj),ϕ(xj)\langle \phi(x^j), \phi(x^j)\rangle. We formerly define it as

Any function KK that can be expressed like this form is a kernel. We will talk more about this in a second.

The power of kernels

We might ask, why does this kernel definition help us? Well, let's look at a concrete example.

Suppose that we had a cubic feature map whose entries were [1,xi,...xixj,...xixjxk][1, x_i, ... x_i x_j, ... x_i x_j x_k]. If xx were dd dimensional, ϕ(x)\phi(x) would have dimension 1+d+d2+d31 + d + d^2 + d^3, which is pretty large. But let's actually calculate ϕ(x),ϕ(z)\langle \phi(x), \phi(z)\rangle.

ϕ(x),ϕ(z)=1+ixizi+i,jxixjzizj+i,j,kxixjxkzizjzk\langle \phi(x), \phi(z)\rangle = 1 + \sum_i x_i z_i + \sum_{i, j}x_ix_jz_iz_j + \sum_{i, j, k}x_ix_jx_kz_iz_jz_k

Now, by using summation properties, we get that

ϕ(x),ϕ(z)=1+ixizi+(ixizi)2+(ixizi)3\langle \phi(x), \phi(z)\rangle = 1 + \sum_i x_i z_i + \left(\sum_ix_iz_i\right)^2 + \left(\sum_ix_iz_i\right)^3

Which means that

ϕ(x),ϕ(z)=1+x,z+x,z2+x,z3\langle \phi(x), \phi(z)\rangle = 1 + \langle x, z\rangle + \langle x, z\rangle^2 + \langle x, z\rangle^3

We have decoupled the inner product computation from the feature mapping!

Kernel functions

There are many kernel functions out there. Above, we looked at the kernel for

1+xTz+(xTz)2+(xTz)31 + x^Tz + (x^Tz)^2 + (x^Tz)^3

In general, the kernel K(x,z)=(xTz+c)kK(x, z) = (x^Tz + c)^k is the polynomial kernel. The corresponding feature map contains the individual parts to the expansion of the polynomial. A polynomial kernel of degree kk corresponds to a feature map of dimension dkd^k, where dd is the dimension of the input xx.

For continuous kernels, the most popular one is the gaussian kernel, which is also known as the radial basis function.

K(x,z)=exp(xz22σ2)K(x, z) = \exp\left(\frac{-||x - z||^2}{2\sigma^2}\right)

We can actually show that this gaussian kernel corresponds to a feature map of infinite dimension.

There are still other kernels for specialized applications. For example, in a sequence classification problem, we might use ϕ(x)\phi(x) to count every possible sequence of length 4 that occurs in xx.

Picking your kernel function

At the heart, we must remember that the kernel function is some measure of similarity. In sequences, it might be subsequences. In classification, it might be distance. Distance itself is a tricky business.

If you use a simple dot product kernel, the boundary must be linear. Polynomial kernels allow for boundaries of varying curves, and the gaussian kernel allows for circular separation. You can think of the kernel as providing a "separation" factor that is limited to the kernel's behavior itself. Choosing the right kernel is a design choice.

When we make a prediction, we effectively care mostly about the points that are the most "similar", and we look at the β\beta. Now, these points that are the most "similar" should have β\beta with the same sign so that the combined product βK\beta K will push the prediction one way. Of course, in real life it's more noisy, but with many data points, the majority consensus will win. But this is the intuition of the kernel

Kernel Matrices

We can produce a kernel matrix from the definition of a kernel:

Kij=K(x(i),x(j))K_{ij} = K(x^{(i)}, x^{(j)})

The really neat part is that the kernel matrix can be computed beforehand with a complexity of O(n2d)O(n^2d). This allows training time complexity to be independent of dd. We can even use tricks like what we described above to compute the kernel without evaluating out ϕ\phi.

Properties of kernel matrices

We know that KK must be symmetric (property of inner products), and it must also be PSD. We can prove this by computing zTKzz^TKz as follows:

Mercer Theorem

Now, we are going to generalize a bit more. Not only is symmetry and positive semidefinite necessary for a kernel matrix, it is also sufficient. This reverse direction is the work of Mercer's theorem:

If K:Rd×RdRK: R^d \times R^d → R, then it is necessary and sufficient if the corresponding kernel matrix is positive semidefinite and symmetric

This means that you no longer need to derive the feature maps to prove that KK is a valid kernel; all you need is to prove that it is PSD and symmetric!

Proving valid kernels

There's no perfect solution, but if you're dealing with polynomials (like (1+xTy)3(1 + x^Ty)^3), it's best if you kept the transpose notation and started with the summation expansion of zTAzz^TAz, like

ijzizj(1+(x(i))Tx(j))\sum_i\sum_jz_iz_j(1 + (x^{(i)})^Tx^{(j)})

In general, it's good to keep the transpose notation.

In general, here are the things to keep in mind

  1. powers of valid kernels (any polynomial) is still valid
  1. exponent of valid kernels are also valid by taylor expansion
  1. element-wise products of valid kernels are still valid
    • Proof

      Essentially you can define a new kernel such that cmn=ambnc_{mn} = a_m * b_n and show that this is a dot product

  1. sums of valid kernels are still valid (because sums of PSD matrices are still PSD)

Vectorizing with Kernel Matrices

Now, with the kernel matrix, we are able to vectorize the training. The training is as follows: `

β:=β+α(yKβ)\beta := \beta + \alpha(y - K\beta)

(think about this real quick and it should make perfect sense)

Unfortunately, however, the prediction algorithm can't be vectorized because the input xx does not have a precomputed kernel.

Kernels above and beyond

Previously, we looked at kernels when it came to linear regression. This is cool, but we can do so much more. How about logistic regression?

Well, we know that the update rule is

θ:=θ+α(yhθ(ϕ(x)))ϕ(x)\theta := \theta +\alpha(y - h_\theta(\phi(x)))\phi(x)

Therefore, you can make the same exact kernel trick except you wrap the summation in the sigmoid

hθ(x)=σ(jnβjK(ϕ(x(j)),ϕ(x)))h_\theta(x) = \sigma\left(\sum_j^n\beta_j K(\phi(x^{(j)}), \phi(x)) \right)

You can imagine how we can kernelize other GLM's, and even the perceptron! Pretty neat stuff!