6
4

Alexandre Passos (p.c.) suggested to me that I could induce sparsity using l1- or l0.5 regularization. I assume l0.5 regularization penalizes by the square root of the weight. (Correct?)

Where has l0.5 been used? Are there some references on how it induces sparsity and some of its properties? When it is appropriate?

I guess I can no longer have negative weights, yes? Or is the penalty = sgn(w) * sqrt(abs(w)) ?

asked Jul 14 '10 at 10:42

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited Dec 03 '10 at 07:13

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421


4 Answers:

You can have negative weights, but the penalty is sqrt(abs(w)). It can't be negative when the weight is negative, or any sensible minimizer will take the weights to -infinity and attain a really good, really useless minimum of the regularized loss.

This technique was mentioned by Francis Bach in his nips tutorial last year (slide 43).

This produces sparser solutions because, if you look at the lp balls with p < 1, they concentrate progressively more mass on zero-valued weights, while still being continuous. A good reference is this paper. In my experience, however, optimizing this sort of regularized loss is hard, and l-bfgs doesn't do a very good job at it (but sgd does worse).

answered Jul 14 '10 at 10:48

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Jul 14 '10 at 12:26

Maybe this paper can help: Zongben Xu, Hai Zhang, Yao Wang, Xiangyu Chang. L1/2 regularizer. They claim better sparsity than L1 reg, e.g. a better surrogate for L0 while still being solvable in practice despite loosing the convexity of L1 and L2.

answered Jul 15 '10 at 12:02

ogrisel's gravatar image

ogrisel
498995591

So, L1 is convex but not smooth, and L1/2 is neither convex nor smooth?

(Jul 16 '10 at 01:52) Frank
1

Yes. Look at the l0.5 ball at http://www.wolframalpha.com/input/?i=sqrt(abs(x))+%2B+sqrt(abs(y))+%3D+1 and see that the gradient goes to infinity at any zero value for any parameter.

(Jul 16 '10 at 07:02) Alexandre Passos ♦

Somewhat counter-intuitively, there are results showing that in order to achieve fast learning rates, you need a regularizer that shrinks large parameters by a smaller amount than small parameters. L1 regularizers shrink parameters by the same amount regardless of magnitude, whereas square root regularizer has the desired property.

There's discussion on it in Ch.3 of Meinshausen, 2005, for regression see Knight,Wu (2000), for classification Tsybakov, van de Geer (2005)

answered Sep 15 '10 at 13:25

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
2333214365

edited Sep 15 '10 at 15:47

Theoretical or empirical results? What is the intuition?

(Sep 15 '10 at 19:32) Joseph Turian ♦♦
1

The intuition is that you want to get as close as you can to l0-penalization (that is, penalize by the number of nonzero weights), which is optimal for variable-selection but np-complete. What l0 penalization essentially says is that using a feature has a cost but once you do it you want the best weight value possible (disregarding generalization error). What this means is that as a feature's optimal weight gets smaller, the more you want to push it down (since you can probably do without it), but once it gets large enough you don't really care. lp norms for p < 1 have this property. The dissertation Yaroslav cites mentions these other papers as providing theoretical justifications for this, essentially saying that predictors using lp norms with p < 1 converge faster than l1-regularized predictors to the set of actual relevant variables.

(Sep 15 '10 at 19:38) Alexandre Passos ♦

I only learned of this today, but one possible intuition is given in Knight,Wu. The way I understand it -- when number of dimensions is larger than number of datapoints, the number of dimensions you fit grows with number of datapoints. L1 penalty shrinks all parameters towards 0 equally, so there's no hope in obtaining an unbiased estimate. Lp with p<1 gives the same variance, but is asymptotically unbiased (this is the surprising part!), hence it has lower error, see discussion under Theorem 3 of Knight,Wu

(Sep 15 '10 at 19:48) Yaroslav Bulatov

Sorry, but I can't resist the opportunity to give some shameless promotion to an old paper of mine, which I think goes a long way towards answering your question:

A. Kaban and R.J.Durrant. Learning with Lq<1 vs L1-norm regularization with exponentially many irrelevant features. Proc. of the 19th European Conference on Machine Learning (ECML08), 15-19 Sept 2008, Antwerp, Belgium. W. Daelemans et al. (eds.): LNAI 5211, pp. 580-596. Springer.

Abstract: We study the use of fractional norms for regularisation in supervised learning from high dimensional data, in conditions of a large number of irrelevant features, focusing on logistic regression. We develop a variational method for parameter estimation, and show an equivalence between two approximations recently proposed in the statistics literature. Building on previous work by A.Ng, we show the fractional norm regularised logistic regression enjoys a sample complexity that grows logarithmically with the data dimensions and polynomially with the number of relevant dimensions. In addition, extensive empirical testing indicates that fractional-norm regularisation is more suitable than L1 in cases when the number of relevant features is ver y small, and works very well despite a large number of irrelevant features.

answered Sep 16 '10 at 05:54

Bob%20Durrant's gravatar image

Bob Durrant
316510

edited Sep 16 '10 at 18:52

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

1

Self-promotion is fine, since part of the goal of this site is for researchers to create additional impact beyond traditional publication models.

(Sep 16 '10 at 18:53) Joseph Turian ♦♦

@Joseph Turian, maybe you could post a note somewhere saying that self-promotion is even encouraged: for most questions I asked I'd be really happy to have answers from the author of a relevant paper.

(Sep 16 '10 at 19:10) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.