Hello everyone, here's a problem that has been bothering me for quite some time now.

When doing regression one can adopt a laplace prior on the weights W of the regression in order to encourage sparsity. However, directly working with the laplacian prior makes things awkward as MAP estimation of the weights W is non-linear.

A common solution to this problem (e.g. Figueirido: Adaptive sparseness for supervised learning) is to adopt a hierarchical view of the laplacian prior:

  • Each weight Wi (i is the index) has a zero-mean Gaussian prior p(wi|Ti) = N(wi|0,Ti) where Ti is the variance of the i-th weight.
  • Each variance Ti has an exponential prior p(Ti|k) = -k/2 exp(-k/2 * Ti)

Now we can integrate out the variance Ti and obtain a laplacian prior on Wi:

  • p(Wi|k) = S p(Wi|Ti) p(Ti|k) dTi = sqrt(k)/2 exp(-sqrt(k) abs(Wi))

The S above is meant to be integration from zero to infinity.

Map estimation can now be performed using an EM algorithm.

My question: how is this integral worked out? I've tried to follow references, up to (very challenging for me) papers that introduced this representation based on gaussian scale mixtures. Could somebody help?

Thanks in advance! N.

asked Jun 01 '11 at 05:59

Nikos%20G's gravatar image

Nikos G
1224

edited Jun 01 '11 at 06:36

I'm not sure, but you might try looking into the conjugate Gamma Distrib and Gaussian Distrib, which is what they seem to be using there.

(Jun 01 '11 at 06:49) Leon Palafox

Hi Leon, thanks for your comment.

There are two reasons why I would like to use this particular prior:

1) I've had good success with it in the past and

2) there is only one parameter, k above, that needs to be set while in the gaussian-gamma method, there are two (the parameters of the gamma prior) and I've never had much success in setting them to good values. Perhaps there are some good "receipes" here that I ignore, if so I would be grateful to hear them.

Thanks, Nikos

(Jun 01 '11 at 08:16) Nikos G
1

Yeah, that's what I meant, your prior looks like a gamma distribution, and if that's so, then the multiplication of them might be another normal (not sure about that, though) due to the conjugate properties.

(Jun 01 '11 at 09:25) Leon Palafox

Hello Leon, thanks for your interest.

It is true, the exponential is just a special case of the gamma distribution. The gamma is the conjugate prior for the precision parameter. However, in this hierarchical laplace prior, the exponential prior is set on the variance and therefore the conjugate property is lost.

Thanks again, N.

(Jun 01 '11 at 09:33) Nikos G

One Answer:

This paper seems to explain it (as a special case of a more general technique), but I haven't worked through the math myself.

answered Jun 01 '11 at 14:25

Kevin%20Canini's gravatar image

Kevin Canini
12001328

Hi Kevin,

I know of this paper, but I have been wondering whether there is some other reference that is a bit less technical than that.

Thanks, N.

(Jun 02 '11 at 06:48) Nikos G
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.