I have distance matrices comparing pairs of documents (authorship attribution), and there is no dataset as such (no feature values, etc), just the distances between points.

I'm wondering if there is an implementation of a classification algorithm (preferably in python but I can work with anything) that takes just this matrix as input and builds a model for classifying new instances, also using a distance matrix. I'd prefer something recent, SVM level, etc. So far, all of the machine learning libraries I have used (weka, orange) take a dataset as input and not a distance matrix. Distance was calculated using a metric (triangle equality and everything) and non-negative.

asked Sep 29 '10 at 02:36

Robert%20Layton's gravatar image

Robert Layton
1520102337

I should note that I have implemented KNN. I'm looking for something....better?

(Sep 29 '10 at 02:37) Robert Layton

4 Answers:

you can use multidimensional scaling to produce a low-dimensional representation of your documents, for which Euclidean distance approximate the original distance. And then to train your classifier by some your preferred classification algorithm.

If new instances are available in advance, there are no problem. Otherwise, you will encounter the out-of-sample problem. For this problem, you can refer Trosset and Priebe (2008).

This answer is marked "community wiki".

answered Sep 29 '10 at 03:37

Shuo%20Xu's gravatar image

Shuo Xu
1136

edited Sep 29 '10 at 03:37

Graph-based and kernel-based algorithms can surely be an option as Alexandre suggests. However, using random projections, you can also actually extract explicit features given a kernel matrix. See this Balcan & Blum Kernels as Features paper that describes this idea. Once you get the features, you can just train any standard classifier using them. This is in fact also somewhat similar in spirit to the idea of using MDS on pairwise distances to compute a low dimensional representation, as Shuo suggests.

answered Sep 29 '10 at 18:53

spinxl39's gravatar image

spinxl39
3458104368

You should be able to use MDS to place your documents into some subspace. Once they are in this subspace, any classification algorithm should be able to do the rest.

Alternatively, you may be able to use spectral clustering to place your documents into a latent space. This approach is more semi-supervised, but should yield fairly accurate predictions.

answered Sep 29 '10 at 19:49

zaxtax's gravatar image

zaxtax
75191934

You can use any graph-based algorithm (similar to the semi-supervised learning setting), any kernel-based algorithm (libsvm for example allows you to define your own kernel, and something like K(x,y) = max_dist - dist(x,y), or exp(-dist(x,y)) should do as a kernel), and this includes gaussian processes. To classify a new example in the kernel-based setting all you need is its distance to the support vectors.

answered Sep 29 '10 at 05:45

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

1

Thanks. I can only give one 'accepted answer' (I did try), so Shuo gets it, but it was a tie.

(Sep 30 '10 at 18:51) Robert Layton
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.