Factor Analysis

TagsCS 229Unsupervised

The problem

Let dd be the dimensionality of the data, and let nn be the number of samples. EM works very well if n>dn > d, so that you have essentially an ability to figure out the gaussian clusters and make insights about it. However, there are cases where d>>nd >> n, and we are in trouble.

Specifically, these points span only a subspace of the feature domain, so MLE actually yields a singular Σ\Sigma matrix that can't be inverted for the distribution. The MLE just places all the probability in an affine subspace.

Solution 1: restrictions

Maybe, we can prevent Σ\Sigma from becoming singular by restricting it into a diagonal. Intuitively, the MLE would be

which is just the variance of coordinate jj. We derive this by looking at our MLE for the non-diagonal Σ\Sigma (ΣMLE=(xμ)(xμ)T\Sigma_{MLE} = \sum (x - \mu)(x-\mu)^T) and then just isolating only the diagonal terms.

We may also want the variances to be equal, that is, Σ=σ2I\Sigma = \sigma^2 I. We can get this parameter by using this formula, which is derived from MLE

From these restrictions, we only need 2 data points to form a non-singular matrix. This is a step in the right direction!

However, we make a strong assumption that the data is independent and all distributed with equal variance. Can we compromise?

Review of gaussians

Suppose we had a vector-valued random variable that is consisted of two different vectors

We'd have a mean and a covariance matrix based on this xx. The covariance matrix is a block matrix

if x1x_1 was length rr and x2x_2 was length ss, then Σ11\Sigma_{11} would be r×rr\times r, Σ12\Sigma_{12} would be r×sr\times s, and so on. We also know that by symmetry, Σ21=Σ12T\Sigma_{21} = \Sigma_{12}^T.

When dealing with block matrices, we have that

And here's the magic of block matrices! These are almost like numbers and you can treat them as such.

The marginal distribution of x1x_1 is just N(μ1,Σ11)\mathcal{N}(\mu_1, \Sigma_{11}) and the same pattern for x2x_2. You can show that the conditional distribution is just

Factor Analysis model

Factor Analysis uses a latent random variable such that

In EM, we assumed that xzN(μ,Σ)x | z \sim \mathcal{N}(\mu, \Sigma) with one mean and variance per zz, but now we are assuming that there is only one joint distribution but it depends on zz.

More specifically, zz is a kk vector, μ\mu is a dd vector, and Λ\Lambda is a d×kd\times k matrix. The Ψ\Psi is a diagonal matrix. Usually, k<dk < d.

To generate a datapoint, we first sample zz and then we map it to μ+Λz\mu + \Lambda z, and then we add some uncorrelated Ψ\Psi noise to the outcome. As such, we can also express factor analysis as the following:

Expressing factor analysis as a joint distribution

Because both xx and xzx | z are gaussians, we can think of x,zx, z as sitting in a joint distribution as follows:

The expectation of zz is always zero, and the expectation of xx is just

meaning that

μzx=[0μ]\mu_{zx} = \begin{bmatrix} 0 \\ \mu \end{bmatrix}

What about Σ\Sigma? Well, we can calculate Σzz,Σxx,Σxz\Sigma_{zz}, \Sigma_{xx}, \Sigma_{xz} and put them together.

First, we start out by observing that Σzz=Cov(z)=I\Sigma_{zz} = Cov(z) = I.

From the knowledge that E[z]=0E[z] = 0 and that E[zzT]=E[(z0)(z0)T]=cov(Z)=IE[zz^T] = E[(z - 0)(z - 0)^T] = cov(Z) = I, we have that

With similar tricks, we also get

And when we put everything together, we get

We understand that the marginal distribution xx is given by

xN(μ,ΛΛT+Ψ)x\sim \mathcal{N}(\mu, \Lambda \Lambda^T + \Psi)

and therefore the log likelihood objective becomes

EM for factor analysis

Maximizing the log likelihood is not very fun, and there's no closed form expression that we know of. As such, we can resort to the EM algorithm.

E step

We can find the conditional distribution for zxz | x by using the conditional formulas we derived above in addition to the matrix Σ\Sigma. We will get

As such, our E step is pretty straightforward

M step, Λ\Lambda

Our goal is to maximize

We need to find the optimization for μ,Λ,Ψ\mu, \Lambda, \Psi. We start by simplifying the expression using log rules

Now, we will focus on Λ\Lambda as an exercise. We'll tackle the other parameters later. Therefore, we can eliminate certain things that we don't need. In fact, we only need to keep the first term

And we keep only the relevant terms again and using some trace properties, we get

We can set this equal to zero and simplify

If we use the identity E[YYT]=E[Y]E[Y]T+Cov(Y)E[YY^T] = E[Y]E[Y]^T + Cov(Y), we get that

M step, μ,Ψ\mu, \Psi

It can be shown that the M step yields

and as for Ψ\Psi, we calculate Φ\Phi as follows:

then, we set Ψii=Φii\Psi_{ii} = \Phi_{ii}.

So what are we doing again?

Our big goal for factor analysis was to represent high dimensional data in a way that we can perform EM algorithm on. We did this by sampling a lower-dimensional zz vector, projecting it up to the high dimensional xx space, and adding independent noise.