PCA

TagsCS 229Unsupervised

From factor analysis

Factor analysis learned a matrix Λ\Lambda that effectively turned a low dimensional representation zz into high-dimensional xx space. It used the EM algorithm to get the encoding. As it turns out, if we only care about low-dimensional representation, we can also use principal component analysis (PCA), which only uses eigenvectors

Often, your different dimensions can be highly correlated. For example, you might have miles per hour and kilometers per hour. In reality, you only get one piece of information. We can extract this through PCA!

PCA can work as noise reduction, overfitting reduction, and can even help us visualize high-dimensional data.

Dimensionality reduction

Here's the big idea. How can we project a set of points xx with dimension dd onto a subspace of dimension kk while maintaining as much information as possible? In other words, if AA were the projection matrix and z=Axz = Ax, then we want x^=A1z\hat x = A^{-1}z to be as close to xx as possible.

The derivation

Projections

When we talk about projection matrix, we just want to project some space onto an orthonormal basis u1,...,uku_1, ..., u_k, the vectors of which make up AA. We need to find these vectors!

We establish some xˉ\bar x which we subtract from xx before projecting. This is just so our later derivation works out perfectly; it does not influence the conceptual understanding of the problem. To project, we just do

zi=(xxˉ)uiz_i = (x - \bar x) \cdot u_i

Now, ziz_i would be in the changed basis, and x,uix, u_i would both be WRT the canonical basis. Here's the tricky part: we can imagine that we have dd orthonormal vectors such that we are just projecting onto a rotated coordinate without losing any information. However, when we do compression, we only pick kk of these (and we order u1,...uku_1, ... u_k to be the most important, somehow)

We can express x^\hat x as the following:

x^=xˉ+jkzjuj\hat x = \bar x + \sum_j^kz_ju_j

Think about this for a second—it's important

Cost function

Our error as as follows:

errork=iN(xix^i)2=iN(xˉ+jdzjiuj(xˉ+jkzjiuj))2\text{error}_k = \sum_i^N(x^i - \hat x^i)^2 = \sum_i^N\left(\bar x + \sum_j^d z_j^iu_j - (\bar x + \sum_j^kz_j^iu_j)\right)^2

Let's make sure that we digested that correctly: so the first part inside the squared is another way of representing xix^i, and the second part inside the squared is what we had above

We note that the two sums can partially cancel and we get

errork=iN(xix^i)2=iN(j=k+1dzjiuj)2\text{error}_k = \sum_i^N(x^i - \hat x^i)^2 = \sum_i^N\left( \sum_{j = k+1}^d z_j^iu_j \right)^2

Now, this squared error is actually a dot product. So, we can write this out by splitting into cross terms and non-cross terms

errork=iNj=k+1dzjiujujzji+j=k+1djjzjiujujzji\text{error}_k = \sum_i^N \sum_{j = k+1}^d z_j^iu_j \cdot u_jz^i_j + \sum_{j=k+1}^d\sum_{j' \neq j}z_j^iu_j\cdot u_{j'}z_{j'}^i

Now, we recall that uu is an orthonormal basis. As such, the second term just cancels out, and the first term simplifies

errork=iNj=k+1d(zji)2\text{error}_k = \sum_i^N\sum_{j = k+1}^d (z_j^i)^2

So, tl;dr—to minimize the error, we just need to minimize the coefficients OUTSIDE of our selected basis. This makes a lot of sense!

The covariance matrix

But how do we minimize this? Well, let's expand what we know about zz

errork=iNj=k+1d(zji)2=iNj=k+1d(uj(xixˉ))2\text{error}_k = \sum_i^N\sum_{j = k+1}^d (z_j^i)^2 = \sum_i^N\sum_{j = k+1}^d (u_j \cdot (x^i - \bar x))^2

Let's rewrite and move the summations around

errork=iNj=k+1d(uj(xixˉ))2=j=k+1dujT[iN(xixˉ)(xixˉ)T]uj\text{error}_k = \sum_i^N\sum_{j = k+1}^d (u_j \cdot (x^i - \bar x))^2 = \sum_{j = k+1}^du_j^T\left[\sum_i^N(x^i - \bar x)(x^i - \bar x)^T\right] u_j

We did this because (yTz)2=yTzzTy(y^Tz)^2 = y^Tzz^Ty .

And if we assign xˉ=E[X]\bar x = E[X] or the mean of the xx, then we have the following:

errork=Nj=k+1NujTΣuj\text{error}_k = N *\sum_{j = k+1}^Nu_j^T\Sigma u_j

(recall that covariance is just 1NiN(xE[X])(xE[X])T\frac{1}{N}\sum_i^N(x - E[X])(x - E[X])^T)

So our error just becomes finding the minimum of a quadratic form!

Eigenvectors

Here's the rub: Why don't we just use the eigenvectors of Σ\Sigma? We need to select some basis that minimizes our error, and any orthonormal basis is equivalent under one transformation. As such, we can pick what is most convenient.

If uju_j is an eigenvector with eigenvalue λj\lambda_j, then

errork=Nj=k+1NujTΣuj=Nj=k+1Nλj\text{error}_k = N *\sum_{j = k+1}^Nu_j^T\Sigma u_j = N*\sum_{j = k+1}^N\lambda_j

(take a minute to convince yourself of this)

As such, to minimize the error, we simply need to pick kk eigenvectors to form u1,...uku_1, ... u_k, and we pick it by selecting the eigenvectors of Σ\Sigma with the highest eigenvalues!

Algorithm, reviewed

  1. start from an N×dN \times d matrix XX
  1. Subtract center mean from each row to form XCX_C
  1. Compute covariance matrix of 1NXCTXC\frac{1}{N}X_C^TX_C (take a second to convince yourself that this is indeed the covariance. It helps to consider matrix multiplication as a series of outer products.
  1. Find the eigenvectors and eigenvalues of Σ\Sigma, and the principle components are the kk eigenvectors with the highest eigenvalues. It's worth noting that the eigenvectors are the orthonormal basis, and they are in the canonical space (i.e. non-transformed space). Confused? See below for a practical example.

To compress a point, just compute the inner product with these normalized eigenvalues!

Eigenfaces

In eigenfaces, we had an N×dN\times d matrix where the NN are the number of people and the dd is a flattened version of the image. For example, if the picture were 100×100100 \times 100, then d=10000d = 10000.

By computing PCA on it and selecting only the kk top eigenvectors, we are able to generate eigenfaces, which are the eigenvectors corresponding to the k-top eigenvalues. Each eigenvector corresponds to a picture. We can reconstruct faces by projecting an image onto the eigenface basis. The larger the kk, the better the reconstruction.

And here's an image of a city reconstructed from eigenvectors

SVD to speed things up

Essentially you write X=WSVTX = WSV^T, where XX is the data matrix (row per datapoint), WW is a weight matrix, SS is the singular value, diagonal matrix, and VV is the singular vector matrix.

You can think of WW as the coordinate of xix^i in eigenspace, and SS contains he eigenvalues λj\lambda_j.

SVD helps because you don't need to compute Σ\Sigma explicitly. Rather, you do some cool math (I'll figure this out later) and eventually get the kk eigenvectors that you desire.

Another key distinction is that SVD isn’t lossy. You’re just factorizing the matrix into three components. You drop components by zeroing out elements of SS.

Kernel methods

You can also substitute the dot products we used with a kernel. Without getting into this too much (it's a similar idea), you can express far more complicated features with a kernel.