|
I maintain a java package for evaluating distributional similarity and performing Word Sense Induction. As part of this package, I'd like a really fast implementation of Agglomerative Clustering, particularly with the Average Link Criteria. We currently have a naive implementation that runs in O(n^3) time, but there's plenty of ways to do it in O(N^2 log N) time, which i've recently added. However i've read that an O(N^2) algorithms exit but I haven't seen it explained clearly anywhere. I've also poked around other implementations like R's and the one in Cluto Clustering. The R implementation seems to run at about the same speed as my O(N^2 log N) method but Cluto blows both out of the water. As an example, with about 3000 points, R takes about a minute to cluster whereas Cluto can do this in about 10 seconds. Does anyone have any idea how Cluto can do this so quickly? Are they cheating and not really doing pure agglomerative clustering? I've read the papers related to Cluto but they are reasonably vague about how they actually achieve their performance speeds. Thanks! |
|
After doing a lot of literature searching and scanning through other open source implementations, I found an awesome O(N^2) implementation of Agglomerative Clustering. It's reasonably well summarized in Algorithms for hierarchical clustering: an overview. Oddly no one seems to really cite or use this implementation, even though research papers occasionally refer to an older version of the paper written in 1984 by one of the same authors. The trick behind doing things so quickly is building nearest neighbor chains and immediately merging nodes that are labeled as reciprocal nearest neighbors (RNN)s, which are just two nodes that are both nearest neighbors of each other. The other trick comes in how you evaluate the similarity/distance between two clusters. Rather than iterating through points in two clusters, you can do a nifty summation in constant time. After a bit of debugging and testing, I wrote this up in my java package here. I also wrote up a simpler to read and standalone version in scala as a gist here. So if anyone's curious about how to do this in super fast time, this will hopefully make things clearer. Hi Keith, It seems your links don't work..Could you please update them as I'm very interested in the code you created... Thanks, Yantra.
(Aug 06 '13 at 04:27)
Yantra Shikva
|
Bringing runtime down from 60 to 10 seconds sounds like something which should be doable with engineering effort (leveraging the system cache, using lower-level code, a faster similarity function computation, etc). Are you sure this is not it?
Regardless, a log factor in these cases is pretty much constant, so it shouldn't accout by itself for this big difference in performance.
I'm not familiar with cluto, so I'll refrain from commenting further
Engineering optimizations are one possibility. There's also the expected difference between using Java (my code) and C (Cluto's language).
Cluto is also doing some optimized evaluations of the Average Link criteria function. While most implementations do the cross product between two clusters and average together all the similarity scores, Cluto simply uses the dot product between summation vectors. However, my code is also making this same optimization based on their papers.
I suppose I should evaluate the rate of growth in computation time for Cluto.