|
I am trying to cluster a tweets collection (100k), is there any algorithm implemented with source code available for download? Thanks. I am aware of OnlineLDA but I am looking for something like streaming K-Means or Hierarchical agglomerative clustering type, thanks. |
|
Check out scikit-learn's MiniBatchKMeans, which I've used to reasonably good effect in similar situations in the past: http://scikit-learn.org/stable/modules/clustering.html#mini-batch-kmeans http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html |
|
In addition to the other suggestions, you may want to look into locally-sensitive hashing and random indexing, which can be done sequentially / online / out-of-core: http://en.wikipedia.org/wiki/Locality-sensitive_hashing |
|
Such algorithms do exist, e.g. check out BIRCH :
There are several implementations, e.g. this one in Java: JBIRCH. If you Google for "birch [your preferred language]", you'll find others. |
|
There is a survey of data stream clustering just published on ACM Computing Survey (Oct 2013). Although not exactly what the OP asked, it may be never the less relevant for other readers interested in this question. |