|
I am trying to do some (k-means) clustering on a very large matrix. The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row). I want to get around 2000 clusters. I got two questions:
Thanks! |
|
Sparse SVD with Lanzos algorithm is a good way to reduce dimensionality of your data.
It performs Singular Value Decomposition but only calculate very few significant eignvectors (around few hundred). 5,00,000 rows x 4000 with couple of value per row is quite small dataset for lancosz algorithm you can get an implementation here: http://tedlab.mit.edu/~dr/SVDLIBC/ However if your Similarity Matrix (row X row) is sparse then, Affinity Propagation is the best solution for your problem. It outperforms K-Means and its well suited for large sparse datasets It was published in Science in 2007. You can get a Working code on the psi.toronto.edu site.
This answer is marked "community wiki".
|
|
consider using Mahout: http://mahout.apache.org/ they have a scalable PCA/SVD implementation for dimensionality reduction and MapReduce version of KMeans which you can run on Hadoop cluster (e.g Amazon EC2) 2
Mahout is overkill for this dataset, It might look big but considering its sparse dataset. It can be easily handled by a single computer. I have routinely calculated Sparse SVD for Million by hundred thousand sparse matrices (~2-5% values and ~100 eigenvectors) on a Pentium 4 with 3 Gb Ram.
(Jul 09 '10 at 01:05)
DirectedGraph
|
|
I think the first thing you want to do is dimensionality reduction on your data first so that you have a dense representation; most clustering I know operate best with dense representations. You can use PCA but in my experience it is much faster to use random projection dimensionality reduction. I have java code for random projection here; you can check out the whole project here. Once you've done dimensionality reduction, clustering will be super quick and probably more reliable. Another trick to use is to run K-Means on some random sample of your data and then run it again on more with the K centroids from the previous runs. In my experience Random Projections are useful when matrix has more than ~10 million entries and isn't highly sparse. In his case http://en.wikipedia.org/wiki/Lanczos_algorithm should be sufficient.
(Jul 09 '10 at 01:08)
DirectedGraph
|