I am trying to cluster a tweets collection (100k), is there any algorithm implemented with source code available for download? Thanks.

I am aware of OnlineLDA but I am looking for something like streaming K-Means or Hierarchical agglomerative clustering type, thanks.

asked Oct 29 '13 at 21:10

cherhan's gravatar image

cherhan
189222529


4 Answers:

answered Oct 30 '13 at 04:29

Fred%20Mailhot's gravatar image

Fred Mailhot
161

In addition to the other suggestions, you may want to look into locally-sensitive hashing and random indexing, which can be done sequentially / online / out-of-core: http://en.wikipedia.org/wiki/Locality-sensitive_hashing

answered Nov 22 '13 at 11:10

Ben%20Gimpert's gravatar image

Ben Gimpert
614

Such algorithms do exist, e.g. check out BIRCH :

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of Birch is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, Birch only requires a single scan of the database. In addition, Birch is recognized as the, "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively".

There are several implementations, e.g. this one in Java: JBIRCH. If you Google for "birch [your preferred language]", you'll find others.

answered Oct 30 '13 at 00:11

Max's gravatar image

Max
476162729

There is a survey of data stream clustering just published on ACM Computing Survey (Oct 2013). Although not exactly what the OP asked, it may be never the less relevant for other readers interested in this question.

http://dx.doi.org/10.1145/2522968.2522981

answered Nov 20 '13 at 17:07

JWainer's gravatar image

JWainer
91239

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.