3
1

L1 regularization is often convenient since it effectively acts as feature selection mechanism during training. But the resulting objective function is not smooth anymore, which makes optimization tricky.

I have two questions:

  • What are the best tricks to optimize an L1-regularized objective function with L-BFGS?

  • Is the L-BFGS implementation in R (see optim()) capable of optimizing such functions?

asked Jul 20 '10 at 17:15

Frank's gravatar image

Frank
1169254150

edited Jul 28 '10 at 05:46

ogrisel's gravatar image

ogrisel
398464480


6 Answers:

This paper suggests L-BFGS shouldn't work for l1-regularized losses because of the unsmoothness, and suggests a modification to L-BFGS that fixes it.

However, in my experience, giving L-BFGS a subgradient (for example, defining the partial derivative of a weight as 0 if it's 0) of the l1-regularized function works well enough in terms of the objective value (I didn't actually see if the solution was as sparse as it should be). I haven't used R's optim(), but scipy's port of the original L-BFGS-B code from Nocedal worked for me, so I don't see why R's optim() should be different.

answered Jul 20 '10 at 17:37

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

edited Jul 20 '10 at 17:59

You will probably also need a truncation of the updates to set the coordinates that cross 0.0 to 0.0 instead of taking the full partial derivative update. Otherwise you loose the sparsifying ability of L1 reg.

(Jul 20 '10 at 18:17) ogrisel

@alexandre: That modification suggested in the paper unfortunately doesn't work for me because my function is non-convex. I already tried their code on my function with L2-regularizer and it crashes with line search error, while R's optim finds a good optimum. With L2 the line search becomes even harder.

(Jul 20 '10 at 21:05) Frank

That's weird, this shouldn't happen, even with non convex functions (unless your function is really not convex). The line search used in this sort of method is usually a backtracking line search in the estimated best descent direction. What you can do is write an outer loop around the optimizer restarting it if it thinks it's done (see this answer http://metaoptimize.com/qa/questions/14/what-are-the-state-of-the-art-optimization-algorithms-for-problems-with-numerous-local-minima#74 )

(Jul 20 '10 at 21:08) Alexandre Passos ♦

Also, a good way of dealing with a non convex objective is using a convex optimizer with many different initializations and parameter configurations, and picking the best result.

But maybe if you really want sparsity, it's better to optimize using Langford et al's truncated gradient, which is easy to implement stochastically (and stochastic optimizers can be a good idea for non convex problems).

(Jul 20 '10 at 21:19) Alexandre Passos ♦

@Alexandre: Yeah, this restarting is also what R does when it doesn't seem to find a good direction. But apparently it's not implemented in the implementation of the OWL-QN method (http://bit.ly/d2YDcB).

(Jul 21 '10 at 13:16) Frank

@Alexandre: Langford's paper looks interesting. In general, I haven't had much success with stochastic methods on non-convex functions, at least SGD seemed too brittle and not converge to the right thing, whereas the L-BFGS in R didn't have much trouble.

(Jul 21 '10 at 13:17) Frank

@ogrisel: Do you have more details about how to truncate?

(Jul 21 '10 at 15:48) Frank

@Frank: you choose a threshold and make all values smaller than that go to zero. See the truncated gradient paper I linked for a reference for other methods, as well.

(Jul 21 '10 at 15:50) Alexandre Passos ♦

@Frank use "soft thresholding" during the parameter update: for instance if you do SGD, first compute: dw_j = learning_rate * grad(obj(y_i, x_i), w_j) then apply the truncated update: w_j = sign(w_j) * max(abs(w_j) - dw_j, 0) where "obj" is your objective function that combines the loss and the L1 regularizer

(Jul 21 '10 at 16:36) ogrisel
showing 5 of 9 show all

These papers present a review of L1-regularized optimization methods, including approximations that are differentiable at the expense of making the objective ill-conditioned. I'd probably try the smooth approximations, iteratively running the minimization with smaller and smaller epsilons.

answered Jul 29 '10 at 23:35

Jey%20Kottalam's gravatar image

Jey Kottalam
454

Mark Schmidt, Glenn Fung and Romer Rosales have made available a MATLAB based optimization toolkit for L1 regularized optimization problems. It supports a number of ways to optimize the L1 regularized objective discussed in this technical report and this paper, and has a very much plug-and-play kind of interface. The software can be downloaded here.

answered Jul 28 '10 at 02:29

spinxl39's gravatar image

spinxl39
3458104368

Perhaps the Willow project SPArse Modeling Software kit could help as well?

answered Jul 21 '10 at 02:30

ualex's gravatar image

ualex
131259

You might also want to look at Compressive Sensing code, since it is based on L1 regularization.
http://dsp.rice.edu/cs

answered Jul 20 '10 at 19:20

DirectedGraph's gravatar image

DirectedGraph
54531422

Have you tried to use glmnet that features both elastic net (or pure L1) regularized regression using coordinate descent and elastic net regularized logistic regression (for multi class classification)? Or maybe do you want to optimize a loss function that involves a non-linear model?

answered Jul 21 '10 at 05:10

ogrisel's gravatar image

ogrisel
398464480

Thanks for the link. But yeah, it's a big log-linear model.

(Jul 21 '10 at 12:30) Frank
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.