Factor Analysis

Tags	CS 229Unsupervised

The problem

Let $d$ be the dimensionality of the data, and let $n$ be the number of samples. EM works very well if $n > d$ , so that you have essentially an ability to figure out the gaussian clusters and make insights about it. However, there are cases where $d >> n$ , and we are in trouble.

Specifically, these points span only a subspace of the feature domain, so MLE actually yields a singular $\Sigma$ matrix that can't be inverted for the distribution. The MLE just places all the probability in an affine subspace.

Solution 1: restrictions

Maybe, we can prevent $\Sigma$ from becoming singular by restricting it into a diagonal. Intuitively, the MLE would be

which is just the variance of coordinate $j$ . We derive this by looking at our MLE for the non-diagonal $\Sigma$ ( $\Sigma_{MLE} = \sum (x - \mu)(x-\mu)^T$ ) and then just isolating only the diagonal terms.

We may also want the variances to be equal, that is, $\Sigma = \sigma^2 I$ . We can get this parameter by using this formula, which is derived from MLE

From these restrictions, we only need 2 data points to form a non-singular matrix. This is a step in the right direction!

However, we make a strong assumption that the data is independent and all distributed with equal variance. Can we compromise?

Review of gaussians

Suppose we had a vector-valued random variable that is consisted of two different vectors

We'd have a mean and a covariance matrix based on this $x$ . The covariance matrix is a block matrix

if $x_1$ was length $r$ and $x_2$ was length $s$ , then $\Sigma_{11}$ would be $r\times r$ , $\Sigma_{12}$ would be $r\times s$ , and so on. We also know that by symmetry, $\Sigma_{21} = \Sigma_{12}^T$ .

When dealing with block matrices, we have that

And here's the magic of block matrices! These are almost like numbers and you can treat them as such.

The marginal distribution of $x_1$ is just $\mathcal{N}(\mu_1, \Sigma_{11})$ and the same pattern for $x_2$ . You can show that the conditional distribution is just

Factor Analysis model

Factor Analysis uses a latent random variable such that

In EM, we assumed that $x | z \sim \mathcal{N}(\mu, \Sigma)$ with one mean and variance per $z$ , but now we are assuming that there is only one joint distribution but it depends on $z$ .

More specifically, $z$ is a $k$ vector, $\mu$ is a $d$ vector, and $\Lambda$ is a $d\times k$ matrix. The $\Psi$ is a diagonal matrix. Usually, $k < d$ .

To generate a datapoint, we first sample $z$ and then we map it to $\mu + \Lambda z$ , and then we add some uncorrelated $\Psi$ noise to the outcome. As such, we can also express factor analysis as the following:

Expressing factor analysis as a joint distribution

Because both $x$ and $x | z$ are gaussians, we can think of $x, z$ as sitting in a joint distribution as follows:

The expectation of $z$ is always zero, and the expectation of $x$ is just

meaning that

\mu_{zx} = \begin{bmatrix} 0 \\ \mu \end{bmatrix}

What about $\Sigma$ ? Well, we can calculate $\Sigma_{zz}, \Sigma_{xx}, \Sigma_{xz}$ and put them together.

First, we start out by observing that $\Sigma_{zz} = Cov(z) = I$ .

From the knowledge that $E[z] = 0$ and that $E[zz^T] = E[(z - 0)(z - 0)^T] = cov(Z) = I$ , we have that

With similar tricks, we also get

And when we put everything together, we get

We understand that the marginal distribution $x$ is given by

x\sim \mathcal{N}(\mu, \Lambda \Lambda^T + \Psi)

and therefore the log likelihood objective becomes

EM for factor analysis

Maximizing the log likelihood is not very fun, and there's no closed form expression that we know of. As such, we can resort to the EM algorithm.

E step

We can find the conditional distribution for $z | x$ by using the conditional formulas we derived above in addition to the matrix $\Sigma$ . We will get

As such, our E step is pretty straightforward

M step, $\Lambda$

Our goal is to maximize

We need to find the optimization for $\mu, \Lambda, \Psi$ . We start by simplifying the expression using log rules

Now, we will focus on $\Lambda$ as an exercise. We'll tackle the other parameters later. Therefore, we can eliminate certain things that we don't need. In fact, we only need to keep the first term

And we keep only the relevant terms again and using some trace properties, we get

We can set this equal to zero and simplify

If we use the identity $E[YY^T] = E[Y]E[Y]^T + Cov(Y)$ , we get that

M step, $\mu, \Psi$

It can be shown that the M step yields

and as for $\Psi$ , we calculate $\Phi$ as follows:

then, we set $\Psi_{ii} = \Phi_{ii}$ .

So what are we doing again?

Our big goal for factor analysis was to represent high dimensional data in a way that we can perform EM algorithm on. We did this by sampling a lower-dimensional $z$ vector, projecting it up to the high dimensional $x$ space, and adding independent noise.