|
I would like to cluster a large sparse matrix in Python. Does anyone know an implementation that would allow me to do this? "Clustering on very large sparse matrix?" suggests that one should first do PCA and then cluster the dense matrix. Aria says that the time complexity of clustering a sparse matrix is not better than if the matrix is dense. I don't mind a dense time complexity, as long as the memory complexity is sparse. My matrix is simply too big to store in memory in a dense representation. Does anyone have a Python implementation of clustering that accepts a |
|
To run KMeans on I have tested it on 20 newsgroups and it seems to work without performance issues although I haven't checked the manually quality of the clusters in details yet. I am also working on another experimental branch where I use a new function Have a look at this script in my experimental branch for usage examples. Ignore the Edit: I just merged the sparse I've not used it with sparse data, but I was impressed with scikit's incremental MinibatchKMeans in my own tests using dense data.
(Jul 18 '11 at 16:56)
Cerin
Interesting, to what kind of data have you applied it?
(Jul 18 '11 at 18:13)
ogrisel
|
|
Is the question still active, after a week ? If so, what are ~ N, dim, k, also what metric ? Two general problems -- maybe not in your case, but easy to test:
If sqrt(N)*dim fits in memory, how about running kmeans on some dense random samples of size ~ sqrt(N), and looking at the centres ? (Advt) this short code uses any of the 20-odd metrics in scipy.spatial.distance, so you try L1, cosine ... |