In Neal,1998, when he describes the algorithm to sample from a DP with Conjugate distributions, I have some questions.

When he computes the integrals, he define the differential over distributions, like equation 3.6:

int(F(y_i,phi) dG_0(phi))

Does this mean that is the integral over the prior, or is just a notation glitch, and he means the integral over phi.

If it's the integral over phi, in the case of a mixture of Gaussians with G gaussian, Does the integral just becomes the integral over a single gaussian? Or is it better to use G_0 as a conjugate.

Also, which phi should be used for the case when we have new classes, should we just sample them from the prior G_0

Thanks

asked May 23 '12 at 07:25

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128


One Answer:

It's the integral over phi. This integral is the posterior predictive distribution for the data point y (actually the prior predictive distribution, because there are no data points associated to the this new class yet).

Eq. (3.6) describes the Gibbs sampling step for the association c of the data point y to a class. To evaluate the association probability for a certain class you need to evaluate the prior (depends on the number of points already associated to a class) and the data likelihood (depends on y and on phi which parametrizes the data distribution of a class). This is fine for the existing classes but there is a problem when evaluating this for a new class: you know how to evaluate the prior (there are always 'alpha' points associated to a new class) but you cannot evaluate the data likelihood, because the new class does not yet have a parameter phi associated with it. Therefore, you integrate over all possible phi. Because the prior G0 over phi and the data distribution F(y|phi) form a conjugate pair (this as assumed in Alg. 2) you can evaluate this integral analytically and this is the above-mentioned prior predictive distribution.

As suggested in the last paragraph of your question, you could also just sample a phi from the prior G0 and use this phi to evaluate the data likelihood of y for the new class. You could think of this as a Monte Carlo approximation to the integral -- but a very crude one, because the MC approximation is based on just one sample. And, as mentioned above, you don't need this approximation, because you can evaluate the integral analytically. If you nevertheless want to (or have to) use a Monte Carlo approximation for this integral then have a look at Fig. 1 and Alg. 8 of the paper (this is quite similar to the one-sample approximation, but is uses more than just one sample for the MC approximation).

Whenever you associate a point to a new class, you must also sample a phi for this newly instantiated class. Because you need this phi for evaluating the association probabilities of the next data point. To sample this phi you now sample from the posterior G0(phi|y) with y now being part of the given data. A complete Gibbs sweep also resamples the phi for all classes from the posterior G0(phi|y1,...,yn) where (y1,...,yn) are the points currently associated to this class.

Alg. 2 represents the phi explicitly. But, because it is anyway assumed here that G0 and F are a conjugate pair, you could also get rid of all the phi by integrating them out. This means, you always work with the posterior predictive distribution and the phi are no longer part of your representation. This is exactly what the next algorithm, Alg. 3, in the paper describes. Note that by integrating out all the phi you then integrated out the complete draw G~DP from the Dirichlet process prior DP: this draw consists of the weights (which are already integrated out in Alg. 2) and the atoms phi (which are then also integrated out in Alg. 3).

answered May 23 '12 at 10:43

Dominik's gravatar image

Dominik
1763810

edited May 24 '12 at 03:00

When we calculate the integral, the distribution G_0, in the case of a mixture of Normals? Is it another Normal, or perhaps an Inverse Gamma?

Most examples I've seen, the G_0 is a Gaussian.

(May 24 '12 at 06:36) Leon Palafox ♦

It depends on what you assume to know about the normals of the mixture. If you know the variance of the mixture components but not their mean, then a 'phi' is the mean vector of a cluster and G0 is (in this case) also a Gaussian (as a distribution over mean vectors in this case). If both the mean and variance of the cluster components are unknown, then 'phi' consists both of the mean vector and a covariance matrix (or precision matrix). The conjugate prior G0 in this case would be the normal-Wishart distribution (or the normal-inverse-Wishart distribution). Wikipedia has a nice table with conjugate pairs (at the end of the article, the section "Continuous likelihood distributions").

If you need to implement this conjugate prior things yourself then you might find the paper "Conjugate Bayesian analysis of the Gaussian distribution" by Kevin P. Murphy helpful -- it has all the formulas and details necessary for implementing it.

(May 24 '12 at 07:32) Dominik

Cool, thanks, one last thing, the update of the alphas, Rasmussen uses the conjugate of the prior and the likelihood of the assignments to sample a new alpha via ARS.

However, in other, most simple, approaches like naive bayes, the alpha is updated based on the counts per category, and I would think that for a new class this might make sense.

Which would be a better approach?

(May 25 '12 at 00:07) Leon Palafox ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.