K-Means

TagsCS 229Unsupervised

The big idea

Given nn data points, how can we cluster them into kk groups?

The intuition is very simple: start out randomly with kk centroids μ1,...μk\mu_1, ... \mu_k. For each xix_i, assign that xix_i to the closest centroid. Then, take all the xx that have centroid μj\mu_j and redefine that centroid to be the center of mass of these xx's. Repeat, and you will converge to a proper clustering.

The mathematical expression

This is just the mathematical expression of what we just intuited

Example

K-means as coordinate descent

We can define the distortion function / potential function as

F(c,μ)=ix(i)μc(i)2F(c, \mu) = \sum_i||x^{(i)} - \mu_{c^{(i)}}||^2

The optimization is just

minμmaxcF(c,μ)\min_\mu\max_c F(c, \mu)

When you re-cluster, it's like minimizing μ\mu. When you re-assign clusters, it's like maximizing cc. As such, you can imagine this whole deal as coordinate descent. We fix one variable, change the other, and then fix the other one, etc.

We are guaranteed to converge. Why? Well, each step you are guaranteed to not get worse: F(at,bt)F(at+1,bt)F(at+1,bt+1)F(a_t, b_t) \geq F(a_{t+1}, b_t) \geq F(a_{t+1}, b_{t+1})

However, granted, JJ is non-convex and there are no convergence guarantees on global minima. To account for this, it's often best to do a few K-means runs with different initializations and pick the one with the lowest converged distortion function loss.

agglomerative clustering

An interesting approach to k-means, known as k-means++, is when you start with a large number of clusters, and then after running k-means for awhile, you cluster the clusters! This allows you to start with high specificity and then move your way down to avoid overfitting.

Limitations

K-means implicitly assumes that your cluster distributions are circular, separable, and have the same size. (the size is because we assign things based on distance from the center).

But these things can come up!

To help us overcome these problems, we resort to a more general class of clustering.