I am reading this paper about the relationship between PCA and kmeans and I just don't understand what is going on. Can anyone give me some reason why PCA and kmeans should be very related?

In my mind, kmeans is about finding where to place the centers of hyperballs, and it's sensitive to a data point moving away from a center in any direction. On the other hand, PCA is about the "important directions" in the data, and moving a data point far along an eigenvector will not change the outcome of PCA.

Where is my intuition failing?

asked Aug 02 '11 at 12:13

Rob%20Renaud's gravatar image

Rob Renaud
41551321


One Answer:

The main result seems to be that if you are attempting a k-means clustering with k=2, then there will be some cutoff point of the first principle component of the data that gives about the same (or exactly the same) division of the data into 2 clusters. This more or less makes sense. PCA tries to find successive directions of maximum variance in the data, while k-means attempts to explain variance as much as possible by treating data as generated from cluster centers. For a visual, imagine that you project all your data along your first principal component. Then cut this line of data points and color one half red, and one half blue. There will be a place you can cut it that corresponds to a k-means clustering. Your loss of intuition is because moving a point along the principal component until it is an outlier will not affect PCA, but it will effect K-means. This is explained by the fact that you need to move the point where you cut the set of data points.

There are some other aspects of this paper that I haven't explained because I don't understand them quite so well. But it seems to imply that K-means on data that has been reduced with PCA should yield near-optimal results.

answered Aug 02 '11 at 13:00

Jacob%20Jensen's gravatar image

Jacob Jensen
1644285360

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.