It was proposed by Herbrich et al in Bayes point machines, and on reddit jigsus asked a question about what it is. What is it? What are the main concepts? How to implement it?

asked Jan 11 '12 at 10:25

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421


One Answer:

The bayes point machine paper explains a lot of the Bayesian viewpoint on kernel classification before going in detail into the algorithm itself, hence the confusion.

The paper proceeds by first defining a Bayes-optimal classifier; that is, a classifier that predicts for any test point the class that minimizes average error when marginalizing over all possible hypotheses and all possible samplings of the data. Of course, computing this is ridiculously expensive, so they then decide to approximate this Bayes-optimal classifier by finding the hypothesis in a fixed hypothesis space which is closest to this classifier in terms of the predictions it actually makes.

This approximation is defined by randomly drawing some test points, evaluating the Bayes-optimal prediction on these points, and searching over all hypotheses for the one which is closest to the Bayes-optimal predictions. This, unfortunately, is still really slow in the general case. So the authors turn to linear classifiers.

A linear classifier is a classifier that, given a feature mapping phi, predicts the class of an instance x by sign(w dot phi(x)), where w is a learned parameter vector.

Given a training set with zero training error, they prove that the center of mass of the subset of all possible 1-norm vectors w that correctly classify the training set is a good approximation to the bayes-optimal classifier in terms of the metrics they defined before. Then they prove that the SVM can be derived as an approximation of this bayes-point classifier, and also that this bayes-point classifier admits a kernel representation.

Estimating this center of mass can still be hard, so they do that by Monte Carlo. A very simple Monte Carlo way of estimating this would be: (1) draw a random vector from the unit sphere, (2) discard it if it does not correctly classify all points, (3) when enough vectors have not been discarded, average all survivors. The problem with this method is that in high dimensions the probability of accepting in step (2) can be really small, so they introduce more clever sampling techniques for this problem. A correct faster Monte Carlo algorithm for estimating this is the billiard ball, where a consistent training vector is given a random momentum and "bounced off" of training examples enough times, such that its trajectory approximates a uniform sampling of the space. A much faster way of doing this is just using something like the Kernel perceptron to find many different vectors consistent with all the training points and averaging them. These different vectors can be found by running the perceptron with different orderings of the data (this approach is, however, not exact). You can also use EP and other techniques to approximate the Bayes point.

Finally they show how this can be extended to allow for positive training error, by using a different kernel that has access to the identity of the points, and can "spend mass" on these identity features to achieve classification.

answered Jan 11 '12 at 10:26

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

You just answered your own question one minute after posting it. Is this your new way of keeping notes for yourself? :-)

(Jan 11 '12 at 11:14) Kevin Canini
1

Someone asked this on reddit and after writing this whole answer down I realized I might want to keep it, so I asked-and-answered here.

(Jan 11 '12 at 11:18) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.