ICA

TagsCS 229Unsupervised

The big idea

In PCA, we tried to formulate a basis that captured the most variation in its axes. In ICA, we also try to formulate a basis, but we try to formulate a basis that "decorrelates" the data.

original basis
shifted basis

In this example, you see how the left basis creates a correlation in the blue data. But in the right basis, there is no correlation!

We will attempt to motivate ICA again in a more practical case below

The cocktail problem

Suppose that you had multiple people talking at one time, but you wanted to extract the audio from one person. You have multiple microphones so you can get a sense of depth, but how do you parse out this one person?

More formally, you have d-dimensional vector ss and you have an observation xx. The vector ss is a random variable that is correlated among its dd elements. We want to create a new random variable xx whose individual elements are uncorrelated.

A mixing matrix AA will "mix" the elements of ss and hand it off to xx, which is also a d-dimensional vector

x=Asx = As

Can we recover ss from xx? Hypothetically, if AA were non-singular, this is totally possible. We can find some WW called the unmixing matrix such that s(i)=Wx(i)s^{(i)} = Wx^{(i)}.

So essentially we want to transform as set of random variables into another one

What can't you do?

Well, it's actually more complicated than finding an inverse matrix. The permutation of the original sources are ambiguous, and the scaling is also ambiguous between sources (because you could scale the mixing matrix by α\alpha and the source by 1/α1 / \alpha and the output would be the same.

In addition, the sources can't be gaussian. Why? Well, we can show that any arbitrary "rigid motion" (orthogonal matrix) applied to the mixing matrix produces the same outcome. First, we can show that xN(0,AAT)x \sim \mathcal{N}(0, AA^T).

Now, we can apply some rigid transformation RR such that A=ARA' = AR. Now, we notice that

E[ARssTRTAT]=ARRTAT=AATE[ARss^TR^TA^T] = ARR^TA^T = AA^T

Therefore, given a random rigid motion matrix RR, the output distribution of xx will be unchanged!

Densities and linear transformations

If you have a random variable ss and you have a transformation AA, then the density of x=Asx = As is not a straightforward as you might think. Imagine if you just put p(s)p(s) through AA. You run the risk of creating an invalid distribution that doesn't sum to 1!

To fix this, let W=A1W = A^{-1}. A density transformation is defined as

px(x)=ps(Wx)Wp_x(x) = p_s(Wx)|W|

In a way, the W|W| keeps tracks of where things stretch and where things squeeze. We mapped xx to ss's domain by using WW, but we must account for our stretch debt. This is why the term W|W| is there.

For example, le'ts consider the case where WW shrinks xx twice to ss. This means that xx has double the range, so it should have half the density. Out of the box, the output of the function doesn't change when the input is stretched. This is fine on normal functions, but not on densities that have a clear rule. This is where the W|W| comes in, in our previous example—to scale the density by half. Hopefully, this toy example clarifies things.

ICA algorithm

We assume that each source is independent

We apply the transformation WW to the data by using the trick we just discussed

(we just use the row-dot product form of matrix multiplication to convey the different sources sj=wjTx)s_j = w_j^Tx)

The psp_s could take any form, and if you knew what the probabilities are like, then you should substitute that distribution. However, in a pinch, we can use the density whose CDF is the sigmoid function g=σg = \sigma. As such, ps(s)=g(s)p_s(s) = g'(s) and we get

If we take the derivative, we get

So, after the algorithm converges, all we need to do is compute s(i)=Wx(i)s^{(i)} = Wx^{(i)}.