I'm analyzing a set of protein sequences, think words from an alphabet with 20 letters, and I'm attempting to use Neighbor-Joining (NJ) as my clustering technique. I'm using NJ because it works on a pair-wise distance matrix directly, which is great because I'm only able to calculate a (dis)similarity matrix on my data. However, the output is a rooted binary tree and I'm having trouble on how to interpret the tree. Most of the literature on NJ is from phylogenetics, which isn't applicable because my data isn't from multiple sequence alignment, but the NJ technique is perfect due to lack of a linear space for my data. I'm use ape (an R library) if it matters. To get actual clusters, do I start at the root node and branch until I have N sets and those are my clusters? Does the edge length factor in? Finally, I feel like building the tree and traversing the tree is redundant; is it possible to interrupt the actual NJ algorithm once it has joined the data into N clusters? I'm pretty new to machine learning, so please correct me on my semantics/vocabulary ;)
asked Jun 28 '11 at 01:47
I know nothing about NJ, but there are plenty of algorithms you can use.
You can use any graph based clustering method for your data, although you may need to 'roll your own' when it comes to finding implementations. You could also use something like DBScan or DCBOR, density based clustering algorithms, which perform perfectly well without a vector based dataset. In fact, these algorithms often ignore (except in cases of optimisation) the original data and just use the derived distance matrix. The only criteria is to make sure your metric is a true metric or you may get some weird results (although this is more a theoretical consideration that is not always applicable in real applications).
edit: It was pointed out that the metric being used is not a true metric. In that case, try something like Fuzzy c-means MST clustering (FMC - reference, paper free online). The basic idea, which can be expanded quite easily if you wish, is to calculate the MST of the graph, and then determine a threshold. In FMC, this weight is determined using the fuzzy c-means aglorithm with k=2, and all edges in the "higher" cluster are removed. All edges in the MST with a weight greater than* this weight are removed, and the resulting partitions are clusters. NetworkX is a python library which can calculate the minimal spanning tree for you, but a quick google search showed many links to do so in R.
* Less than if using a similarity measure.