What is the relationship between large-margin classification and regularization? The standard framework for learning these days seem to be regularized empirical risk minimization, where we can plug in "arbitrary" loss functions and regularizers. However, I am a bit confused by the relationship between margin-based loss functions, e.g., hinge loss, and regularization. Isn't the idea of both simply to limit the complexity of the function class? Why do we then use both? To get more fine-grained control over the structural risk? Is there some relation to the use of slack in the SVM lurking in there?

Edit: I realized now that the regularized empirical risk with hinge loss and an L2-regularizer is just the unconstrained version of the standard SVM problem (minimize norm of the weight vector subject to the margin constraints).

Also, if I am not mistaken, unless you use a margin-based loss or the log loss, the regularization term becomes vacuous*, because you can then uniformly rescale the parameters without changing the predictions. An alternative then is to use averaging, which also correspond to a large-margin solution (cf. Freund & Schapire 1999). But, is anything lost by this averaging procedure compared to using a margin-based loss directly?

*) at least in the case of an L2-regularizer, but is this true in general?

asked Dec 17 '12 at 06:07

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

edited Dec 18 '12 at 08:44


One Answer:

The relationship between the hinge loss and the regularizer is simple. Note that if there is any linear separator you can bring the hinge loss to zero by increasing the norm of the weight vector. The regularizer prevents this from happening, effectively forcing the largest-soft-margin classifier to be chosen.

There are other justifications for regularizers, things related to stability in online learning for example, where following the leader has high regret but following the regularized leader has low regret as it's more stable.

answered Dec 18 '12 at 12:26

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Thanks, I realized that the hinge loss + l2-regularizer is just standard SVM after posting the question, but I didn't know about the online learning stuff. What is the intuition there?

(Dec 18 '12 at 18:30) Oscar Täckström

Also, not sure about what you mean when you say "unless you use a margin-based loss of the log loss, the regularization term becomes vacuous". You can use L2 regularization in least squares, which is ridge regression. The regularization doesn't scale all parameters uniformly, it scales them proportionally, i.e. beta / (1 + lambda2).

(Dec 19 '12 at 18:30) digdug

I liked reading Shai Shalev's schwartz's book on online learning, it's pretty clear on the intuition: http://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf (see the parts on follow-the-leader and follow-the-regularized-leader).

The intuition is that the optimum of the sequence of loss functions you've seen so far can be arbitrarily away from the optimum of the problem you'd get by adding to these loss functions the next loss function. Adding a regularization term keeps this wiggling from happening, and ensures that your iterates never get too far from the global minimum.

(Dec 19 '12 at 20:00) Alexandre Passos ♦

@digdug, I was talking about classification, sorry for being unclear. By uniformly, I meant that you rescale each parameter by multiplying with a scalar as you say. Sorry for the bad terminology.

(Dec 20 '12 at 02:56) Oscar Täckström

@Alexandre, thanks,

(Dec 20 '12 at 02:57) Oscar Täckström
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.