|
I know what L1 and L2 mean, it is different representation of regularization term in many learning schemes including SVM/Logistic Regression/CRF etc. How can I predict whether L2 will work better then L1 for my data, what properties of data I should consider(number of relevant features/number of correlated features/etc)? |
|
As per Andrew Ng's Feature selection, l1 vs l2 regularization, and rotational invariance paper, expect l1 regularization be better than l2 regularization if you have a lot less examples than features. Conversely, if your features are generated from something like PCA, SVD, or any other model that assumes rotational invariance, or you have enough examples, l2 regularization is expected to do better because it is directly related to minimizing the VC dimension of the learned classifier, while l1 regularization doesn't have this property. OS course, you can always use the elastic net regularization (which is a sum of the l1 and l2 regularizers) and tune the parameters to get the best of both worlds. Scikits.learn has a good and efficient implementation of elastic net and grid search to tune the hyperparameters. 1
L1-regularization has only been of benefit to me in one application: detecting hedges and uncertainty in Wikipedia articles. There we had tons of features (uni- and bigrams), but only a few specific features seemed to be effective. Still the improvement compared to L2-regularization was minor. I guess the surprise here is rather that L2 does such a good job, while L1 is theoretically superior in this setting.
(Mar 18 '11 at 06:00)
Oscar Täckström
There are some great answers here. Regarding the elastic net that was proposed by Alexandre Passos, does anyone have some sample code for how to implement a logistic regression with elastic net in python using scikit-learn? Do I need to use Stochastic Gradient Descent or are there alternatives? I'm a bit confused by the ElasticNet class and any (minimal) example would be very helpful.
(Feb 12 '12 at 17:39)
ctw
1
@ctw, the scikits has a fast coordinate descent implementation of the elastic net. They have example code online, for example https://github.com/scikit-learn/scikit-learn/blob/master/examples/linear_model/lasso_and_elasticnet.py
(Feb 12 '12 at 17:56)
Alexandre Passos ♦
@Alexandre Thanks for the quick reply. I saw that (and some other) code, but found no example of an implementation of a logistic regression with elastic net. Unfortunately, I'm still confused on how to generalize these examples to a logistic regression case. SGDClassifier let's you specify loss='log', but there seems to be no way to specify a loss function for ElasticNet. I'm probably missing something very obvious ... any further pointers or a minimal example of a logistic regression with elastic net would be very much appreciated.
(Feb 12 '12 at 19:17)
ctw
1
SGDClassifier can take a "penalty" argument which does elastic net regularization if you pass "elasticnet" as its value. See the documentation at http://scikit-learn.sourceforge.net/stable/modules/generated/sklearn.linear_model.sparse.SGDClassifier.html
(Feb 12 '12 at 19:23)
Alexandre Passos ♦
Thanks again and sorry, I'm not making myself clear. I know about the "elasticnet" "penalty" argument to SGDClassifier and managed to get an elastic net logistic regression implemented by specifying "loss='log'" and "penalty='elasticnet'". However it seems (and your previous answer suggests) that I shouldn't have to use the SGDClassifier to get a logistic regression with elastic band regularization. Is there a way to implement a "normal" logistic regression (i.e., without using the SGDClassifier) with elastic band regularization? It seems that there must be, but I can't figure out how to do it .... Thanks again for all your help!
(Feb 12 '12 at 20:12)
ctw
1
What's the issue with using
(Feb 13 '12 at 02:37)
ogrisel
1
BTW the
(Feb 13 '12 at 02:47)
ogrisel
Thanks so much @ogrisel! I had assumed that using SGDClassifier for relatively small problems was not efficient and had thought that there was a way to use the ElasticNet class directly with logistic regression. Thanks for your answer!
(Feb 13 '12 at 04:49)
ctw
showing 5 of 9
show all
|
|
Something that hasn't been mentioned yet, but that I believe is very important, is that for l1 to bring benefits, as in the A. Ng's paper, the ground truth, or asymptotic set of predictive features, must be sparse in the basis chosen. This does not apply for l2, as it is rotationally invariant, and can explain the good performances of l2.
This answer is marked "community wiki".
|
|
Boring background stuff you can skip if you like: A vector (or matrix) norm is a measure of the effect a quantity of interest has. L_1, L_2, and L_inf norms come up fairly often, where the L_inf norm can be thought of as the limit of the L_(2n) norm as n->inf -- (sum(x_{i}^{2n})) ^(1/2n) -- which is just max(abs(x(:))). I'll try to summarize without getting too mathy:
edit Sorry, forgot to bring it home. Deciding what to use is not cut & dry. A lot of research has been done, but in the end it comes down to: what happens [that is undesirable] when you don't use regularization, and what type of regularization does the best job of controlling that behavior and minimizing its effect? -Brian
This answer is marked "community wiki".
|
|
I won't add to the great theoretical points mentioned by others but just a practical trick: just use elastic net with a larger weight on L2, e.g: 0.15 * L1 + 0.85 * L2 That will often give you most of the performance of L2 while completely zeroing of the overly noisy features. You can combine that with a smart search for the optimal lambda using a cross validation and the warm restarts trick. Edit: I had not fully read Alexandre's answer and replied too early, as usual :) I leave this answer as a complement to his. What is the warm restarts trick?
(Nov 14 '11 at 15:32)
Justin Bayer
When you are doing your hyper parameter sweep to find the regularization parameters for an iteratively trained model, re-use the previous converged weights as the starting point for the next training run. It's especially nice if you are optimizing a convex function, so then you don't need to worry that your parameter sweep and learning are interacting in strange ways. It makes doing n optimization problems faster than n * cost of one optimization, because the optimal solution for one set of parameters is probably close to the optimal solution a problem with slightly different hyper parameters.
(Nov 14 '11 at 16:24)
Rob Renaud
1
You train the model once with a strong regularizer, then you the decrease strength of the regularizer term and then re-fit the same model with pre-initialized weights from the output of the previous fit. The convergence for the second fit should be reached much faster. This strategy is implemented in the Lasso and ElasticNet model in glmnet and scikit-learn for instance.
(Nov 14 '11 at 16:24)
ogrisel
|