|
Hi all. Is there a (conjugate)prior for all the parameters NOTE: I'm not asking how to sample the In math:
Context (why I try to do this)I am trying to apply Gibbs sampling to a Dirichlet Process (DP) mixture model using Gaussian mixture components, thus I'm inferring a Gaussian Mixture Model where the number of components is inferred from the data itself.
Till now I have fixed hyperparameters Any feedback/ideas are welcome :) |
|
I'll try to lend what experience I have, but I can't promise it'll be what you need. My background in this is in RBMs, so if that's not what you're tinkering with I hope this at least makes sense. If you haven't already, read through the paper Exponential Family Harmoniums with an Application to Information Retrieval by Welling, Rosen-Zvi, and Hinton. It demonstrates that an RBM can be used with any exponential family distribution. You want to try to construct an energy function that, when solved for P(v_i|h) or P(h_j|v), the two are either exponential family, or can be otherwise demonstrated to work if it isn't in the exponential family (eg, sigmoid, tanh, RLUs). I haven't done the math for the Wishart, Inverse Wishart, or Normal Inverse Wishart distributions, but I have worked through designing an RBM that can use the Normal, Bernoulli, Binomial, Multinomial, Negative Binomial, Poisson, and while I haven't been successful I've at least attempted Pareto, Geometric, Exponential, Gamma, Chi-Squared, Beta, and Weibull. The ones I succeeded on are 'easy' in the sense that I didn't run into any significant hurdles finding a representation that seemed useful with little in the way of fudging. The ones I didn't succeed on usually had restricted support or other complicating factors with their parameters. For example, the exponential distribution only has support on [0,inf], and the inverse scale parameter (lambda) is required to be positive as well. Since the exponential distribution has only a single parameter, it's the only parameter that could contain the linear combinations W*tr(h) or v*W that would ultimately be part of the energy function, but any linear combination could potentially violate the constraint lambda>0 -- which could make things annoying when it comes time to sample from the distribution. There are ways to get around that sort of problem, but by the time I got to this point I felt I had learned enough about the process (and I had no need to go on), so I moved on to other projects. You'll run into similar issues with the Inverse-Wishart. I'm not at all familiar with that distribution, but from the wikipedia page on it it I'm going to assume the univariate version of the IW distribution needs parameters psi>0 and m>0 (since psi is p x p, and it's univariate, so p=1 => m>1-1). This situation is similar to the Pareto distribution in that all the parameters are restricted, and the support is restricted. With all of that out of the way (my apologies for the verbosity), this is essentially the process I went through to construct an energy function and derive conditional distributions for doing Gibbs sampling:
If you want some more details, when I get home I can post a small example of how I came up with an energy function for using the Poisson distribution. Lastly: take a look at Alexander Krizhevsky's non-thesis masters thesis Learning Multiple Layers of Features from Tiny Images. I got some very valuable information about the process I outlined above, both from his paper as well as from email conversations. He doesn't outline the energy function derivation -- that he just states outright. But he does derive the conditionals, and explains that his training procedure required him to learn two sets of parameters -- the weights/biases, and a standard deviation for each visible unit. I don't recall if his paper says he had to train them in separate phases as Yann LeCunn recommends doing when optimizing for multiple goals or constraints, but that's the impression I have -- you optimize the weights/bias holding the standard deviations constant, then hold those constant and optimize the standard deviations; rinse, repeat. With the number of parameters the NIW has, I'm guessing you'll find yourself in a similar situation. Thanks for the extensive reply, this was a completely different direction than what I was thinking of (I'm not very familiar with Restricted Boltzmann Machines), but thinking outside the box is good :). I'm going to check out your references
(Feb 24 '12 at 06:13)
juliank
This paper by Le Roux et al. also has a couple of energy functions for more 'exotic' RBMs (Beta RBM, Gaussian RBM with learnt precision): Learning a generative model of images by factoring appearance and shape - I'm also curious to see how you derived a Poisson energy function.
(Feb 24 '12 at 07:51)
Sander Dieleman
Sure, not a problem. I took notes on the process back when I did this; here's the text file for it: http://upload.decay.org/1253_mvXUZpu9AY5dVxEa_poisson.txt
(Feb 24 '12 at 10:19)
Brian Vandenberg
Here's my notes for the other distributions: http://upload.decay.org/1254_hu5o2ZK1L9EfAal5_other_distributions.txt
(Feb 24 '12 at 10:21)
Brian Vandenberg
Awesome, thanks :)
(Feb 24 '12 at 10:24)
Sander Dieleman
Thanks for sharing, Brian
(Feb 24 '12 at 11:03)
juliank
showing 5 of 6
show all
|
I don't think you can integrate everything out analytically, as even integrating out the variance of a univariate Gaussian will lead to something which is not in the exponential family (Student's t distribution). So unless you want to deal with the multivariate equivalent of that in your sampler you should keep the variables without integrating them out.
Given that you might as well want to parametrize these things in simple terms and slice sample them.