I have a question about turning SVMs into multi-class classifiers using one-vs-all training. I had the impression that this method was used in practice but I don't understand why it should work. As far as I understand, if one trained multiple one-vs-all classifiers and tries to predict the class, one chooses the class for which the classifier has the highest confidence. To my knowledge, SVMs usually give the margin of the example as the confidence. My question is: why are these margins comparable? And how? Do you just pick the classifier with the highest margin? If the classifier gave some probabilistic output, the reasoning behind the method would be pretty clear, but without probabilistic confidence, I don't understand it.

asked Jul 06 '11 at 09:49

Andreas%20Mueller's gravatar image

Andreas Mueller

3 Answers:

Indeed, as Mathieu said, there is no guarantee that this one-vs-all reduction is optimal. If you look at John Langford's page on learning reductions you will see more than one paper with reductions from multiclass to binary classification that can trade the regret bound with how many classifiers you're willing to use. This concept of reductions is very powerful and I think it is related to what you're looking for, if you want some theory.

answered Jul 29 '11 at 16:15

Alexandre%20Passos's gravatar image

Alexandre Passos ♦


Thanks for the reference. Last week at CVML Summer school I talked to Fancis Bach and Christoph Lampert about it. I know now that 1 vs all is not Bayes consistent. But apparently in practice there is no significant difference between the Crammer-Singer multi-class SVM and 1-vs-all. According to Christoph Lampert "know one knows why" ;)

(Jul 30 '11 at 09:49) Andreas Mueller

I had never thought about it but I agree with you that when using one-vs-all to convert a binary classifier to multi-class, nothing really constrains each classifier to output comparable confidences. Intuitively, this can become a problem especially if each class has an imbalanced number of examples. In practice however, the various loss functions, especially the hinge loss, don't really encourage overconfident predictions and hence the predictions of each class are more or less in the same range. In my opinion, even logistic regression is in theory subject to this problem. The log loss optimizes for accurate probability estimates but it doesn't necessarily mean that the estimates of each class are globally optimal for the multi-class problem.

answered Jul 29 '11 at 14:00

Mathieu%20Blondel's gravatar image

Mathieu Blondel

We have used one-vs-all in practice and it worked really well. For each class we had a threshold which controlled precision/recall. From all classes with scores above the threshold, we chose the one with the highest score. Wild guess: the distances in one-vs-all are comparable, because every classifier used the same set of training data.

answered Jul 06 '11 at 11:47

Jochen%20Wersd%C3%B6rfer's gravatar image

Jochen Wersdörfer

Thanks. It seems to me that many are using it in this way. It would be nice to know if anyone has any theoretical insight. You said you used all scores about a threshold, so you also allowed the output "don't know" ?

(Jul 06 '11 at 11:51) Andreas Mueller

Yup, if there are no scores above their thresholds, the classification result is 'None'.

(Jul 06 '11 at 13:24) Jochen Wersdörfer

I have related question here. If these models trained independently, there would be case that for some test data all the classifiers give FALSE answers even if these data certainly belongs to predefined categories. what one can do in this case ? simply choosing the one that gives highest confidence or probability? In fact, in my case I used Logistic Regression with one-vs-all scheme for multi class classification, each LR model has threshold value which controls TRUE or FALSE prediction result. How can i adjust these threshold value systemically to avoid ALL FALSE answer?

(Jul 07 '11 at 02:23) hopexy

I don't see why you are using a threshold. Logistic regression gives you a probability estimate. Taking the one with the highest probability (if all the classes have the same prior probability) gives you the maximum likelihood solution.

(Jul 07 '11 at 03:14) Andreas Mueller

Sorry I did not read previous comments carefully before commit my question, these comments already answered my question in some way. There are in-negligible 'None' answers on my test data which hurting performance badly. In fact I did simply chose the highest probability category as answer in this case, but I am not sure if this kind of probability comparison is consistent between multiple categories because the models were used(train/predict) independently, just like SVM case in the original question.

(Jul 07 '11 at 03:54) hopexy

@Andreas: in my experience, if you have high-enough dimensional data, the "probabilities" you get from maxent are very very crappy as probabilities (they are OK for classification).

@hopexy: if you are doing maxent, why not just train a multiclass maxent classifier to begin with? and are you sure you are really losing on precision by allowing a 'don't know' answer?

(Jul 07 '11 at 08:17) yoavg
showing 5 of 6 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.