0
1

I've been reading up on regularization and wanted to verify that I haven't misunderstood. Any problems with this summary?

  • Regularization is used to prevent overfitting by penalizing model complexity. At a high level, regularization can be thought to "smooth" out the model. Overfitting can be caused by large weights on individual parameters, too many parameters in the model or both.

  • L2 regularization penalizes parameters that have large weights. This penalty reduces weights of features across the model, making it "smoother" and more generalizable.

  • L1 regularization zeros out individual feature parameters, creating sparseness in the model. Put another way, it helps eliminate irrelevant features from the model. L1 may be useful if your dataset has a large number of (irrelevant) features compared to examples ?

  • You can use both L1 and L2 together (the "elastic net") to gain benefits from both hyperparameters. This "will often give you most of the performance of L2 while completely zeroing of the overly noisy features." (source)

  • You can make a model using L1 on your data to prune out irrelevant features, then update the model using L2 to smooth out the remaining features.

Does all that sound ok so far? Missing anything big, or misconstrue some concepts?

If that all checks out, I'm curious what are some practical tips for L1/L2 regularization. What are some practical values? Which direction is "strong regularization" - e.g. is 0.001 or 10 more regularization?

What are some tell-tale signs that you are overfitting your data (and need to add regularization)? When your training converges...but the performance metric on your testing set is poor?

asked Jul 11 '12 at 14:07

Zach's gravatar image

Zach
1345


One Answer:

From a Bayesian perspective, regularization can be interpreted as incorporating prior knowledge that one has into the system. Example, L1- regularization in logistic regression prunes the unwanted data. Check Andrew Ng paper "Feature selection, L1 vs. L2 regularization, and rotational invariance"

Coming to your question about the regularization, what value to take itself depends on how you pose the problem. In case of \lambda ||x|| (lambda is your regularization parameter) smaller value indicate weak regularization; in other words, large variance on your prior indicates more freedom for the likehood term to represent the data. In any case, one can always induce a hyper-prior on the regularization to set it automatically. Check Bayesian Compressive sening

answered Jul 11 '12 at 16:07

Rakesh's gravatar image

Rakesh
101136

edited Jul 11 '12 at 16:08

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.