Hi all. Is there a (conjugate)prior for all the parameters Xi = (mu, kappa, Psi, nu) of a Normal-InverseWishart distribution, such that I can sample Xi from the posterior (that is, given the sought prior distribution and known samples mu_i, Sigma_i drawn from the distribution)?

NOTE: I'm not asking how to sample the (mu_i, Sigma_i) from their posterior, I want to sample Xi from a posterior.

In math:

  • Assume a Gaussian Mixture Model with K Gaussians components
  • Component i has parameters (mu_i, Sigma_i) which are sampled from a single Normal-InverseWishart (NIW) distribution, thus

    (mu_i, Sigma_i) ~ NIW(mu0, kappa0, Psi0, nu0), for all 1 <= i <= K

    which by definition of the NIW distribution means that

    Sigma_i ~ IW(Psi0, nu0), and mu_i ~ N(mu0, Sigma_i / kappa0)

  • Given the K components, is there a distribution P(mu0, kappa0, Psi0, nu0 | Theta) such that I can sample posterior

    (mu0, kappa0, Psi0, nu0) ~ P( Theta, mu_1, Sigma_1, ..., mu_k, Sigma_k) ?

  • I have tried mu0 ~ N(.) and Psi0 ~ IW(.)*nu0, but this still assumes nu0 and kappa0 are fixed and correct. I could not find if there is a way to sample nu0 and kappa0 too.

Context (why I try to do this)

I am trying to apply Gibbs sampling to a Dirichlet Process (DP) mixture model using Gaussian mixture components, thus I'm inferring a Gaussian Mixture Model where the number of components is inferred from the data itself. Till now I have fixed hyperparameters Xi to something I believed to be "reasonable" but "uninformative". However, I noticed that finding an appropriate prior is more important than I anticipated, because I need to compute the likelihood that a datapoint z_j is generated by a new mixture component, which is computed as p(z_j|Xi) = Integral p(x_j|mu, Sigma)p(mu, Sigma|Xi) d(mu, Sigma). If Xi is too uninformative this likelihood will be very low and I find less components that I would expect (for example if the mean mu0 is way off or kappa0 too low). Thus, I figured I should infer Xi too such that it will reflect the known components, and new components would have similar mean and covariance, reducing the importance of my initial estimate of Xi.

Any feedback/ideas are welcome :)

asked Feb 23 '12 at 09:43

juliank's gravatar image

juliank
21113

I don't think you can integrate everything out analytically, as even integrating out the variance of a univariate Gaussian will lead to something which is not in the exponential family (Student's t distribution). So unless you want to deal with the multivariate equivalent of that in your sampler you should keep the variables without integrating them out.

Given that you might as well want to parametrize these things in simple terms and slice sample them.

(Feb 23 '12 at 12:41) Alexandre Passos ♦

One Answer:

I'll try to lend what experience I have, but I can't promise it'll be what you need. My background in this is in RBMs, so if that's not what you're tinkering with I hope this at least makes sense.


If you haven't already, read through the paper Exponential Family Harmoniums with an Application to Information Retrieval by Welling, Rosen-Zvi, and Hinton. It demonstrates that an RBM can be used with any exponential family distribution.

You want to try to construct an energy function that, when solved for P(v_i|h) or P(h_j|v), the two are either exponential family, or can be otherwise demonstrated to work if it isn't in the exponential family (eg, sigmoid, tanh, RLUs).

I haven't done the math for the Wishart, Inverse Wishart, or Normal Inverse Wishart distributions, but I have worked through designing an RBM that can use the Normal, Bernoulli, Binomial, Multinomial, Negative Binomial, Poisson, and while I haven't been successful I've at least attempted Pareto, Geometric, Exponential, Gamma, Chi-Squared, Beta, and Weibull.

The ones I succeeded on are 'easy' in the sense that I didn't run into any significant hurdles finding a representation that seemed useful with little in the way of fudging. The ones I didn't succeed on usually had restricted support or other complicating factors with their parameters.

For example, the exponential distribution only has support on [0,inf], and the inverse scale parameter (lambda) is required to be positive as well. Since the exponential distribution has only a single parameter, it's the only parameter that could contain the linear combinations W*tr(h) or v*W that would ultimately be part of the energy function, but any linear combination could potentially violate the constraint lambda>0 -- which could make things annoying when it comes time to sample from the distribution.

There are ways to get around that sort of problem, but by the time I got to this point I felt I had learned enough about the process (and I had no need to go on), so I moved on to other projects.

You'll run into similar issues with the Inverse-Wishart. I'm not at all familiar with that distribution, but from the wikipedia page on it it I'm going to assume the univariate version of the IW distribution needs parameters psi>0 and m>0 (since psi is p x p, and it's univariate, so p=1 => m>1-1). This situation is similar to the Pareto distribution in that all the parameters are restricted, and the support is restricted.


With all of that out of the way (my apologies for the verbosity), this is essentially the process I went through to construct an energy function and derive conditional distributions for doing Gibbs sampling:

  • Construct a generic energy function of the form:
  • E(v,h) = EVH(v,h) + EV(v) + EH(h)
  • Supposing P(v|h) is the target distribution, EH(h) can be completely disregarded -- it will fall out while solving for P(v_i|h). Further, since the weight matrix is meant to link v to h, EV(v) can only depend on the visible bias -- though you can't disregard v just yet (that ends up being the problem with using the Poisson distribution; when sampling v\_i, it requires knowledge of v, hence Salakhutdinov & Hinton 'cheated' by breaking the conditional independence assumption).
  • Minus the EH(h) term, E(v,h) will take a form that is nearly identical to ln( P(v_i|h) ) when you're through -- assuming P(v_i|h) is a true exponential-family distribution.
  • It then becomes a matter of choosing a form for EVH and EV that, when you start working out P(v|h) = P(v,h) / P(h), will make it easy to reduce it from P(v|h) to P(v_i|h) and additionally have P(v_i|h) be an exponential-family distribution.

If you want some more details, when I get home I can post a small example of how I came up with an energy function for using the Poisson distribution.

Lastly: take a look at Alexander Krizhevsky's non-thesis masters thesis Learning Multiple Layers of Features from Tiny Images. I got some very valuable information about the process I outlined above, both from his paper as well as from email conversations. He doesn't outline the energy function derivation -- that he just states outright. But he does derive the conditionals, and explains that his training procedure required him to learn two sets of parameters -- the weights/biases, and a standard deviation for each visible unit.

I don't recall if his paper says he had to train them in separate phases as Yann LeCunn recommends doing when optimizing for multiple goals or constraints, but that's the impression I have -- you optimize the weights/bias holding the standard deviations constant, then hold those constant and optimize the standard deviations; rinse, repeat.

With the number of parameters the NIW has, I'm guessing you'll find yourself in a similar situation.

answered Feb 23 '12 at 12:11

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Feb 23 '12 at 12:22

Thanks for the extensive reply, this was a completely different direction than what I was thinking of (I'm not very familiar with Restricted Boltzmann Machines), but thinking outside the box is good :). I'm going to check out your references

(Feb 24 '12 at 06:13) juliank

This paper by Le Roux et al. also has a couple of energy functions for more 'exotic' RBMs (Beta RBM, Gaussian RBM with learnt precision): Learning a generative model of images by factoring appearance and shape - I'm also curious to see how you derived a Poisson energy function.

(Feb 24 '12 at 07:51) Sander Dieleman

Sure, not a problem. I took notes on the process back when I did this; here's the text file for it: http://upload.decay.org/1253_mvXUZpu9AY5dVxEa_poisson.txt

(Feb 24 '12 at 10:19) Brian Vandenberg

Here's my notes for the other distributions: http://upload.decay.org/1254_hu5o2ZK1L9EfAal5_other_distributions.txt

(Feb 24 '12 at 10:21) Brian Vandenberg

Awesome, thanks :)

(Feb 24 '12 at 10:24) Sander Dieleman

Thanks for sharing, Brian

(Feb 24 '12 at 11:03) juliank
showing 5 of 6 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.