|
Long story short, basically I'm trying to perform k-means clustering on a really large dataset. The dataset is too big to fit into memory. If I were to perform clustering on this dataset, what would be the easiest/straight-forward way to go about it? I'm open to other clustering algorithms. Update: I didn't know this earlier, but it turns out that such an implementation where the dataset exceeds the amount of memory is called an out-of-core implementation. |
|
This sounds like a distributed clustering question that was posted like 2 days ago: My suggestions are: Divide your set in equivalent sizes. then run k-means on the first set, and save onlye the location of your cluster center, then with those center as your starting point, run a new k-means with the next set, and so on and so forth. In these kind of settings I would recommend mixture of gaussians, since you have to estimate a mean and a variance, which is easier to carry around for the next sets. Leon, why would GMMs be more appropriate for parallelising compared to k-means?
(Mar 05 '11 at 06:57)
Oscar Täckström
Since K-means is not bayesian per se, and GMM is bayesian, the use of a prior that you learned from a previous subset of the same set seems more natural. Any Mixture Model would be good I guess.
(Mar 05 '11 at 10:12)
Leon Palafox ♦
|
|
You could store only the cluster representatives (centroids) in memory and stream the data from disk one item at a time. This is also trivially parallelisable. But I don't think there is any reason to implement your own k-means, I would first try out the implementation in Apache Mahout. There is a series of video lectures on parallel clustering algorithms at Google code university (see lecture 4 for k-means). Edit: if you have a large number of clusters and a large number of features, even storing the centroids in memory might be difficult. One way to keep the number of dimensions down is to use feature hashing. This is really good idea. Now why didn't I thought of this.
(Mar 06 '11 at 01:40)
Hao Wooi Lim
|
|
this may also be useful: Web-Scale K-Means Clustering by D. Sculley. Very easy to implement, and if you're not feeling so bold, there is a highly optimized open source version. |
|
I did some research on this matter and I found this out-of-core implementation of K-means that seems to work pretty well. Just sharing. Link: http://www.cs.princeton.edu/~wdong/kmeans/ |
|
You should also consider Efficient Clustering of Web-Derived Data Sets (Sarmento et al, 2009). It is a promising approach to clustering large data sets, without getting excessively fragmented clusters (clusters that are too small and should be agglomerated with other clusters). The basic idea is that you look at the next k examples, and try to find one that is "close enough" to link the two examples into one cluster. If you are interested, I have some unpublished details about implementing this algorithm, from personal communications with Sarmento. If your final clusters will be too large to store in memory, that is a separate question. |