6
5

Long story short, basically I'm trying to perform k-means clustering on a really large dataset. The dataset is too big to fit into memory. If I were to perform clustering on this dataset, what would be the easiest/straight-forward way to go about it? I'm open to other clustering algorithms.

Update: I didn't know this earlier, but it turns out that such an implementation where the dataset exceeds the amount of memory is called an out-of-core implementation.

asked Mar 05 '11 at 00:31

Hao%20Wooi%20Lim's gravatar image

Hao Wooi Lim
151149

edited Mar 06 '11 at 01:46


5 Answers:

This sounds like a distributed clustering question that was posted like 2 days ago:

My suggestions are:

Divide your set in equivalent sizes. then run k-means on the first set, and save onlye the location of your cluster center, then with those center as your starting point, run a new k-means with the next set, and so on and so forth.

In these kind of settings I would recommend mixture of gaussians, since you have to estimate a mean and a variance, which is easier to carry around for the next sets.

answered Mar 05 '11 at 00:55

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

Leon, why would GMMs be more appropriate for parallelising compared to k-means?

(Mar 05 '11 at 06:57) Oscar Täckström

Since K-means is not bayesian per se, and GMM is bayesian, the use of a prior that you learned from a previous subset of the same set seems more natural.

Any Mixture Model would be good I guess.

(Mar 05 '11 at 10:12) Leon Palafox ♦

You could store only the cluster representatives (centroids) in memory and stream the data from disk one item at a time. This is also trivially parallelisable. But I don't think there is any reason to implement your own k-means, I would first try out the implementation in Apache Mahout.

There is a series of video lectures on parallel clustering algorithms at Google code university (see lecture 4 for k-means).

Edit: if you have a large number of clusters and a large number of features, even storing the centroids in memory might be difficult. One way to keep the number of dimensions down is to use feature hashing.

answered Mar 05 '11 at 04:57

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

edited Mar 06 '11 at 10:38

This is really good idea. Now why didn't I thought of this.

(Mar 06 '11 at 01:40) Hao Wooi Lim

this may also be useful: Web-Scale K-Means Clustering by D. Sculley. Very easy to implement, and if you're not feeling so bold, there is a highly optimized open source version.

answered Mar 05 '11 at 08:12

downer's gravatar image

downer
54891720

edited Mar 06 '11 at 11:29

I did some research on this matter and I found this out-of-core implementation of K-means that seems to work pretty well. Just sharing. Link: http://www.cs.princeton.edu/~wdong/kmeans/

answered Mar 06 '11 at 01:41

Hao%20Wooi%20Lim's gravatar image

Hao Wooi Lim
151149

edited Mar 06 '11 at 01:42

You should also consider Efficient Clustering of Web-Derived Data Sets (Sarmento et al, 2009). It is a promising approach to clustering large data sets, without getting excessively fragmented clusters (clusters that are too small and should be agglomerated with other clusters).

The basic idea is that you look at the next k examples, and try to find one that is "close enough" to link the two examples into one cluster.

If you are interested, I have some unpublished details about implementing this algorithm, from personal communications with Sarmento.

If your final clusters will be too large to store in memory, that is a separate question.

answered Apr 11 '11 at 05:02

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.