|
During an experiment for text classification, I found ridge classifier generating results that constantly top the tests among those classifiers that are more commonly mentioned and applied for text mining tasks, such as SVM, NB, kNN, etc. Though, I haven't elaborated on optimizing each classifier on this specific text classification task except some simple tweaks about parameters. Such result was also mentioned by Dikran Marsupial in Stats.SE. After read through few materials online, I still cannot figure out the main reasons for this. Could anyone provide some insights on such outcome? Disclaimer, this is a duplication of a question previously asked in Stats.SE. |
|
Ridge regression is almost the same problem as SVMs, with a slightly different loss function (that behaves differently with misclassified points), so you should expect it to perform in the same ballpark, sometimes better or worse. The question is: why shouldn't it? 1
Because the squared loss penalizes overly confident correct predictions just as much as incorrect ones. For example, predicting 3 instead of 1 is penalized just as much as predicting -1 instead of 1, even though 3 would still give the correct prediction and -1 would not.
(Nov 05 '11 at 04:57)
Mathieu Blondel
Mathieu, as far as I understood the OP, it was a "ridge regression classifier" which I understood to mean it was logistic regression with an L2 penalty for the weights, NOT something trained with squared loss. Could flake please clarify this point?
(Nov 06 '11 at 01:41)
gdahl ♦
2
No, a ridge classifier (or regularized least-squares classifier) is simply ridge regression applied to binary classification. The problem of casting binary classification as a regression problem is that you're asking the learner too much. The learner doesn't need to get the output values exactly right, it just needs to predict the correct class (be on the right side of the separating hyperplane). The reason why the OP asked this question is because in practice it doesn't seem to matter that much, as pointed in some papers like "In Defense of One-Vs-All Classification" (http://jmlr.csail.mit.edu/papers/v5/rifkin04a.html) or "Text Categorization Based on Regularized Linear Classification Methods" (http://www.stat.rutgers.edu/home/tzhang/papers/ir01_textcat.pdf).
(Nov 06 '11 at 04:20)
Mathieu Blondel
Mathieu: this is true, but it's not the full picture. A lot of learning theory bounds, for example, work better with the square loss than with the log loss because the square loss is bounded (as long as the norms of the parameter vector and examples are bounded, or something like this). See for example http://hunch.net/?p=547&cpage=1 for a discussion of this by John Langford.
(Nov 06 '11 at 08:01)
Alexandre Passos ♦
|