|
People tend to optimize for one loss even though the task is evaluated on another. Often this substitution doesn't harm them. For instance in this paper, Altun/Hoffman found that choice of loss function to optimize didn't matter much for classification error rate in POS tagging. Some loss functions (like F1) are tricky to optimize for, is it really worth the effort? Hal Daume III wrote that for easy tasks, choice of loss for optimization doesn't matter much, but as tasks get more complicated, the "trained for wrong loss" performance penalty grows. So, what are some good examples when making "optimization" loss look more like the evaluation loss gave a big improvement? |
|
As far as I know, consensus is that as long as you're not optimizing a bad loss function (ie, as long as your loss is a proper loss, or a margin loss, and you're using some sort of regularization to avoid overfitting), you will get somewhat good results, but you can almost always improve them by optimizing something closer to the real loss (unless your real loss is 0/1 loss or hamming loss; in this case you're pretty well off with standard margin losses). Some obvious examples where this matters: assymmetric losses, cost-sensitive classification (where some examples are worth more than others), learning to rank (where you can get incresing performance by: predicting the ordinal rank variable, optimizing unweighed pairwise losses, optimizing weighted pairwise losses, optimizing listwise losses uncorrelated with the actual objective, optimizing losses upper-bounding or approximating the actual objective), and most structured classification problems (where, for example, a slightly different likelihood function for a well-known unsupervised model can beat the state of the art, as in this paper). In the Altun/Hoffman paper above, choosing optimization loss to be more like evaluation loss actually made things worse more than half the time
(Jul 30 '10 at 19:04)
Yaroslav Bulatov
In that paper it seems that they only really optimize surrogate losses for hamming loss and 0/1 loss, and both these losses have (I think) theorems proving that convex surrogates converge to approximating them really well (but you could maybe ask about them in another question if you're interested; there are a few theorists here who could answer this very well).
(Jul 30 '10 at 19:32)
Alexandre Passos ♦
Also, in a lot of difference situations it pays to have the test environment as close as possible to the training environment. This paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.6301&rep=rep1&type=pdf , and this paper http://arxiv.org/pdf/0907.0809 have data showing that using slightly different approximations under training and test time can hurt performance. It does not relate directly to your question, but I hope an analogy can be made wrt the loss function used for optimizing and for testing.
(Jul 30 '10 at 19:48)
Alexandre Passos ♦
The confounding issue here is that wrong loss during training acts as a kind of regularizer, so it makes you less likely to overfit
(Jul 30 '10 at 20:02)
Yaroslav Bulatov
1
I'm not sure I buy that, overfitting to the wrong loss doesn't really seem better than overfitting to the right loss.
(Jul 30 '10 at 20:16)
Alexandre Passos ♦
Take AdaBoost, it minimizes exponential loss, and is able to keep reducing 0-1 loss on test set even after thousands of trees are added as components. If you greedily minimized 0-1 loss, surely your test set error would start to increase much sooner
(Jul 30 '10 at 21:22)
Yaroslav Bulatov
Exponential loss is a surrogate loss in the sense that minimizing it guarantees good generalization performance on 0/1 loss. It's not a case of "the wrong loss to minimize". That'd be minimizing 0/1 when you want to minimize recall, or something like that.
(Jul 30 '10 at 21:29)
Alexandre Passos ♦
Don't know what "wrong loss" means, here I'm simply talking about instances of minimizing a loss that's different from evaluation loss
(Jul 30 '10 at 21:51)
Yaroslav Bulatov
Ah, I see. There are roughly two cases: minimizing a surrogate loss (like exponential for 0-1, logistic for 0-1, CRF likelihood for hamming, etc) and minimizing a "wrong" loss (0-1 for precision/recall, word-error-rate for BLEU, etc). They behave very differently. There is a question by me on surrogate losses you can search for more info;
(Jul 30 '10 at 22:15)
Alexandre Passos ♦
BTW, CRF likelihood is more like surrogate for 0-1 loss (ie, 1 if you get the whole labeling correct, 0 otherwise), for Hamming loss, -2*(sum of marginal log likelihoods) forms an upper bound (Kakade's "An Alternate Objective Function for Markovian Fields" gives details)
(Jul 31 '10 at 04:28)
Yaroslav Bulatov
showing 5 of 10
show all
|
|
An interesting paper on the issue is D. Skalak, A. Niculescu-Mizil, and R. Caruana Classifier Loss under Metric Uncertainty, ECML 2007 It studies a number of commonly (and uncommonly) used performance metrics and shows that, at least for the data sets they study, certain loss metrics are better than others for achieving "generalizable" performance gains. Hope this helps. |