I have a question about turning SVMs into multiclass classifiers using onevsall training. I had the impression that this method was used in practice but I don't understand why it should work. As far as I understand, if one trained multiple onevsall classifiers and tries to predict the class, one chooses the class for which the classifier has the highest confidence. To my knowledge, SVMs usually give the margin of the example as the confidence. My question is: why are these margins comparable? And how? Do you just pick the classifier with the highest margin? If the classifier gave some probabilistic output, the reasoning behind the method would be pretty clear, but without probabilistic confidence, I don't understand it. asked Jul 06 '11 at 09:49 Andreas Mueller 
Indeed, as Mathieu said, there is no guarantee that this onevsall reduction is optimal. If you look at John Langford's page on learning reductions you will see more than one paper with reductions from multiclass to binary classification that can trade the regret bound with how many classifiers you're willing to use. This concept of reductions is very powerful and I think it is related to what you're looking for, if you want some theory. answered Jul 29 '11 at 16:15 Alexandre Passos ♦ 1
Thanks for the reference. Last week at CVML Summer school I talked to Fancis Bach and Christoph Lampert about it. I know now that 1 vs all is not Bayes consistent. But apparently in practice there is no significant difference between the CrammerSinger multiclass SVM and 1vsall. According to Christoph Lampert "know one knows why" ;)
(Jul 30 '11 at 09:49)
Andreas Mueller

I had never thought about it but I agree with you that when using onevsall to convert a binary classifier to multiclass, nothing really constrains each classifier to output comparable confidences. Intuitively, this can become a problem especially if each class has an imbalanced number of examples. In practice however, the various loss functions, especially the hinge loss, don't really encourage overconfident predictions and hence the predictions of each class are more or less in the same range. In my opinion, even logistic regression is in theory subject to this problem. The log loss optimizes for accurate probability estimates but it doesn't necessarily mean that the estimates of each class are globally optimal for the multiclass problem. answered Jul 29 '11 at 14:00 Mathieu Blondel 
We have used onevsall in practice and it worked really well. For each class we had a threshold which controlled precision/recall. From all classes with scores above the threshold, we chose the one with the highest score. Wild guess: the distances in onevsall are comparable, because every classifier used the same set of training data. answered Jul 06 '11 at 11:47 Jochen Wersdörfer Thanks. It seems to me that many are using it in this way. It would be nice to know if anyone has any theoretical insight. You said you used all scores about a threshold, so you also allowed the output "don't know" ?
(Jul 06 '11 at 11:51)
Andreas Mueller
Yup, if there are no scores above their thresholds, the classification result is 'None'.
(Jul 06 '11 at 13:24)
Jochen Wersdörfer
I have related question here. If these models trained independently, there would be case that for some test data all the classifiers give FALSE answers even if these data certainly belongs to predefined categories. what one can do in this case ? simply choosing the one that gives highest confidence or probability? In fact, in my case I used Logistic Regression with onevsall scheme for multi class classification, each LR model has threshold value which controls TRUE or FALSE prediction result. How can i adjust these threshold value systemically to avoid ALL FALSE answer?
(Jul 07 '11 at 02:23)
hopexy
I don't see why you are using a threshold. Logistic regression gives you a probability estimate. Taking the one with the highest probability (if all the classes have the same prior probability) gives you the maximum likelihood solution.
(Jul 07 '11 at 03:14)
Andreas Mueller
Sorry I did not read previous comments carefully before commit my question, these comments already answered my question in some way. There are innegligible 'None' answers on my test data which hurting performance badly. In fact I did simply chose the highest probability category as answer in this case, but I am not sure if this kind of probability comparison is consistent between multiple categories because the models were used(train/predict) independently, just like SVM case in the original question.
(Jul 07 '11 at 03:54)
hopexy
@Andreas: in my experience, if you have highenough dimensional data, the "probabilities" you get from maxent are very very crappy as probabilities (they are OK for classification). @hopexy: if you are doing maxent, why not just train a multiclass maxent classifier to begin with? and are you sure you are really losing on precision by allowing a 'don't know' answer?
(Jul 07 '11 at 08:17)
yoavg
showing 5 of 6
show all
