9
8

I am trying to do some (k-means) clustering on a very large matrix.

The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row). I want to get around 2000 clusters.

I got two questions:

  • Can someone recommend an open source platform or tool for doing that (maybe using k-means, maybe with something better)?

  • How can I best estimate the time the algorithm will need to finish? I tried weka once, but aborted the job after a couple of days because I couldn't tell how much time it would take.

Thanks!

asked Jul 08 '10 at 11:51

movingabout's gravatar image

movingabout
136234

edited Jul 10 '10 at 16:30

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


3 Answers:

Sparse SVD with Lanzos algorithm is a good way to reduce dimensionality of your data. It performs Singular Value Decomposition but only calculate very few significant eignvectors (around few hundred).

5,00,000 rows x 4000 with couple of value per row is quite small dataset for lancosz algorithm you can get an implementation here: http://tedlab.mit.edu/~dr/SVDLIBC/

However if your Similarity Matrix (row X row) is sparse then, Affinity Propagation is the best solution for your problem. It outperforms K-Means and its well suited for large sparse datasets It was published in Science in 2007.

You can get a Working code on the psi.toronto.edu site.

This answer is marked "community wiki".

answered Jul 09 '10 at 01:02

DirectedGraph's gravatar image

DirectedGraph
56031424

edited Jul 09 '10 at 06:50

ogrisel's gravatar image

ogrisel
498995591

consider using Mahout: http://mahout.apache.org/

they have a scalable PCA/SVD implementation for dimensionality reduction and MapReduce version of KMeans which you can run on Hadoop cluster (e.g Amazon EC2)

answered Jul 08 '10 at 13:49

alex's gravatar image

alex
40861318

edited Jul 08 '10 at 13:56

2

Mahout is overkill for this dataset, It might look big but considering its sparse dataset. It can be easily handled by a single computer. I have routinely calculated Sparse SVD for Million by hundred thousand sparse matrices (~2-5% values and ~100 eigenvectors) on a Pentium 4 with 3 Gb Ram.

(Jul 09 '10 at 01:05) DirectedGraph

I think the first thing you want to do is dimensionality reduction on your data first so that you have a dense representation; most clustering I know operate best with dense representations. You can use PCA but in my experience it is much faster to use random projection dimensionality reduction. I have java code for random projection here; you can check out the whole project here.

Once you've done dimensionality reduction, clustering will be super quick and probably more reliable. Another trick to use is to run K-Means on some random sample of your data and then run it again on more with the K centroids from the previous runs.

answered Jul 08 '10 at 12:11

aria42's gravatar image

aria42
209972441

edited Jul 08 '10 at 12:22

In my experience Random Projections are useful when matrix has more than ~10 million entries and isn't highly sparse. In his case http://en.wikipedia.org/wiki/Lanczos_algorithm should be sufficient.

(Jul 09 '10 at 01:08) DirectedGraph
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.