Basically I'm trying to perform unsupervised document clustering where document set size is ~1 million documents.. I've defined a distance metric which is weighted combination of jaccard distance between strings and number of other features(like number of paragraphs in docs etc.) Problem is I don't whether the distance matrix is euclidean..I probably can verify this in O(n^3) operations but this gets expensive...

General question : Is this a good approach to do this kind of clustering?

asked Aug 06 '13 at 02:28

Yantra%20Shikva's gravatar image

Yantra Shikva
1222


One Answer:

No, the distance matrix doesn't need to be euclidean, but if you cannot verify basic properties on it (such as the triangular inequality), your DBSCAN implementation will probably switch to a mode in which it takes much longer to run (how much longer is probably implementation dependent, but my guess would be that you will be switching for an, in the best case, O(n log n) algorithm to an, in the best case, O(n^2) algorithm.

The best algorithms avoid computing the full distance matrix, and build a tree on the samples.

answered Aug 06 '13 at 07:08

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.