|
Basically I'm trying to perform unsupervised document clustering where document set size is ~1 million documents.. I've defined a distance metric which is weighted combination of jaccard distance between strings and number of other features(like number of paragraphs in docs etc.) Problem is I don't whether the distance matrix is euclidean..I probably can verify this in O(n^3) operations but this gets expensive... General question : Is this a good approach to do this kind of clustering? |
|
No, the distance matrix doesn't need to be euclidean, but if you cannot verify basic properties on it (such as the triangular inequality), your DBSCAN implementation will probably switch to a mode in which it takes much longer to run (how much longer is probably implementation dependent, but my guess would be that you will be switching for an, in the best case, O(n log n) algorithm to an, in the best case, O(n^2) algorithm. The best algorithms avoid computing the full distance matrix, and build a tree on the samples. |