Heuristically, I think that if the loss being optimized by a classifier has a margin concept then you can apply the generalization bounds for SVM. If it has a bounded value (as in the ramp loss paper) and/or increases slowly (as in the huber loss) it is less sensitive to outliers, but might be harder to optimize (no convexity, small or zero gradients). For some unknown reason I've never seen hinge loss neural networks, but log-loss supposedly works better for multilayer perceptrons than squared loss. In general, if I'm training a classifier and will minimize "loss(data) + regularizer(feature vector)", I'm pretty sure how to choose the regularizer (squared norm if the representation I want is dense and l1 norm if it is sparse usually work), but if I for some odd reason come up with a weird model and want to train it I would have some problems choosing which loss function to use. Is there a principled way of choosing them depending on some intuitive properties of the problem, or is it really best to just try everything? And why isn't the SVM loss more used, if it works so well for SVMs and has the margin bounds?

asked Jul 03 '10 at 11:26

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421


3 Answers:

Surrogate losses, as their name suggests, are a loss you minimise instead of the loss you would really like to minimise. As you mention, the hinge loss (i.e., the one used in SVMs) is a classic example of a surrogate for 0-1 loss. The reason the hinge loss is commonly used as a surrogate is that minimising 0-1 is a combinatorial problem that is computationally expensive.

When you use log or square loss as a surrogate for a classification problem, what you are effectively doing is class probability estimation. You can then threshold these estimates to classify. However, when you use hinge loss it is not possible to get consistent class probability estimates (the hinge loss is not "proper"). The values from a predictor minimising hinge loss will tend to be extreme (i.e., tend to ±∞).

I'm not too familiar with the nitty-gritty of neural nets but I suspect that they don't use hinge losses because of this extremal behaviour. Perhaps it is more useful for node in a network to encode probability estimates instead of classifications as the former can be combined in multilayer networks in more interesting ways. If all the nodes are giving classifications as output the neural net is effectively a boolean circuit which I believe is less expressive.

John Langford has a post asking about "optimal" surrogate losses (he calls them proxy losses). I have a forthcoming JMLR paper ("Composite Binary Losses") that, amongst other things, attempts to answer this problem (or at least formalise it) for binary problems. John also notes in a later post discussing an ALT 2009 paper that, in some sense, it doesn't really matter which surrogate you use as they all do effectively the same thing.

answered Jul 12 '10 at 22:16

Mark%20Reid's gravatar image

Mark Reid
296244

Thanks for the answer.

(Jul 13 '10 at 10:11) Alexandre Passos ♦

Is there a preprint of this JMLR paper you mention? I didn't find it in your website.

(Jul 13 '10 at 10:20) Alexandre Passos ♦
1

There is a preprint on arXiv: http://arxiv.org/abs/0912.3301 - I should probably link to it from my site.

(Jul 13 '10 at 20:34) Mark Reid

Very mathy too and so more on the theoretical side, Ingo Steinwart's How to compare different loss functions and their risks and specially his corresponding book chapter are appealing too. From the abstract:

"The goal of this chapter is to systematically develop a theory that makes it possible to identify suitable surrogate losses for general learning problems."

Steinwart deals there with general ERM theory and practice beyond classification. He even studies such an esoteric issue as finding a supervised surrogate loss in unsupervised novelty detection settings (casting the problem as a density level detection and generating artificial data). Most important, he provides consistent definitions for what a good surrogate is.

answered Jul 13 '10 at 03:54

Santi%20Villalba's gravatar image

Santi Villalba
211115

edited Jul 13 '10 at 04:25

There is an interesting discussion of what is a good vs a bad loss function in the Tutorial on Energy-Based Learning by Yann Lecun et al and they agree that having a margin is an important feature of a good loss function.

Edit: I forgot about the following ICML 2009 paper by Mark Reid: Surrogate Regret Bounds for Proper Losses although it's too theoretical for me to derive practical intuitions out of it. And BTW thanks for the reference to "Trading Convexity for Scalability ", it looks very interesting.

answered Jul 03 '10 at 11:46

ogrisel's gravatar image

ogrisel
498995591

edited Jul 03 '10 at 12:07

Hm, I had forgotten that LeCun's energy based tutorial went this far. In a previous draft of the question I mentioned it's remark that square loss by itself is useless in a structured setting. I'll read the Reid paper and see if it helps. Thanks

(Jul 03 '10 at 12:04) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.