2
2

I would like to cluster a large sparse matrix in Python. Does anyone know an implementation that would allow me to do this?

"Clustering on very large sparse matrix?" suggests that one should first do PCA and then cluster the dense matrix. Aria says that the time complexity of clustering a sparse matrix is not better than if the matrix is dense.

I don't mind a dense time complexity, as long as the memory complexity is sparse. My matrix is simply too big to store in memory in a dense representation.

Does anyone have a Python implementation of clustering that accepts a scipy.sparse matrix as input?

asked Jul 18 '11 at 14:44

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


2 Answers:

To run KMeans on scipy.sparse data there is this pull request that should be merged really soon in the master of scikit-learn.

I have tested it on 20 newsgroups and it seems to work without performance issues although I haven't checked the manually quality of the clusters in details yet.

I am also working on another experimental branch where I use a new function cosine_similarity to build a symmetric affinity matrix over the sparse document vectors and have the impression that the SpectralClustering implementation of scikit-learn run on this affinity matrix gives much high quality results than MinibatchKMeans in the original feature space.

Have a look at this script in my experimental branch for usage examples. Ignore the PowerIterationClustering model for now are this implementation seems to be broken (more bug hunting required).

Edit: I just merged the sparse MinibatchKMeans branch into master

answered Jul 18 '11 at 15:05

ogrisel's gravatar image

ogrisel
498995591

edited Jul 18 '11 at 15:51

I've not used it with sparse data, but I was impressed with scikit's incremental MinibatchKMeans in my own tests using dense data.

(Jul 18 '11 at 16:56) Cerin

Interesting, to what kind of data have you applied it?

(Jul 18 '11 at 18:13) ogrisel

Is the question still active, after a week ? If so, what are ~ N, dim, k, also what metric ?

Two general problems -- maybe not in your case, but easy to test:

  • distance whiteout, all distances ~ normal
  • the means of many sparse vecs will quickly get dense, so you end up doing np.dot( dense centres, x.toarray() ).

If sqrt(N)*dim fits in memory, how about running kmeans on some dense random samples of size ~ sqrt(N), and looking at the centres ? (Advt) this short code uses any of the 20-odd metrics in scipy.spatial.distance, so you try L1, cosine ...

answered Jul 24 '11 at 07:21

denis's gravatar image

denis
2812812

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.