|
I have distance matrices comparing pairs of documents (authorship attribution), and there is no dataset as such (no feature values, etc), just the distances between points. I'm wondering if there is an implementation of a classification algorithm (preferably in python but I can work with anything) that takes just this matrix as input and builds a model for classifying new instances, also using a distance matrix. I'd prefer something recent, SVM level, etc. So far, all of the machine learning libraries I have used (weka, orange) take a dataset as input and not a distance matrix. Distance was calculated using a metric (triangle equality and everything) and non-negative. |
|
you can use multidimensional scaling to produce a low-dimensional representation of your documents, for which Euclidean distance approximate the original distance. And then to train your classifier by some your preferred classification algorithm. If new instances are available in advance, there are no problem. Otherwise, you will encounter the out-of-sample problem. For this problem, you can refer Trosset and Priebe (2008).
This answer is marked "community wiki".
|
|
Graph-based and kernel-based algorithms can surely be an option as Alexandre suggests. However, using random projections, you can also actually extract explicit features given a kernel matrix. See this Balcan & Blum Kernels as Features paper that describes this idea. Once you get the features, you can just train any standard classifier using them. This is in fact also somewhat similar in spirit to the idea of using MDS on pairwise distances to compute a low dimensional representation, as Shuo suggests. |
|
You should be able to use MDS to place your documents into some subspace. Once they are in this subspace, any classification algorithm should be able to do the rest. Alternatively, you may be able to use spectral clustering to place your documents into a latent space. This approach is more semi-supervised, but should yield fairly accurate predictions. |
|
You can use any graph-based algorithm (similar to the semi-supervised learning setting), any kernel-based algorithm (libsvm for example allows you to define your own kernel, and something like K(x,y) = max_dist - dist(x,y), or exp(-dist(x,y)) should do as a kernel), and this includes gaussian processes. To classify a new example in the kernel-based setting all you need is its distance to the support vectors. 1
Thanks. I can only give one 'accepted answer' (I did try), so Shuo gets it, but it was a tie.
(Sep 30 '10 at 18:51)
Robert Layton
|
I should note that I have implemented KNN. I'm looking for something....better?