5
5

Hi all,

My problem is related to Bayesian inference. I extended the original LDA with several non-conjugate priors and therefore the simple collapsed gibbs sampling is not applied. I understand that there are basically two ways to do the inference: MCMC and Variational Inference. But how can I choose one to develop? Most of time, I see papers to choose variational inference. Does it imply that variational inference is simpler to implement?

I will be really appreciated for any hints or suggestions.

asked Jul 08 '10 at 12:48

Liangjie%20Hong's gravatar image

Liangjie Hong
256101720


3 Answers:
11

Variational inference is usually harder to derive and implement than MCMC sampling, but has some other nice properties that can be interesting (it's easier to get a bound for the normalization factor; it's a convex optimization; it's deterministic). There are also generic variational inference procedures, usually based on message-passing. However, most of these don't do very well on LDA, and Blei (and his group) usually uses mean-field variational methods, for which there is no clear algorithmic recipe to derive the algorithm. I think in LDA-like models most people just use what they're comfortable with, so you see Blei's papers using mean-field variational inference, Griffiths' papers using gibbs sampling, Tom Minka using expectation-propagation, etc.

To implement a noncollapsed gibbs sampler to LDA what you need to do is write down the likelihood function of your model in terms of all variables (you can do this easily for any directed graphical model) and sample each variable from its posterior given all the others. That is, to sample variable X, sample from P(X|all the rest), which, by Bayes' theorem, is proportional to P(all the rest|X)Prior(X) (you must normalize this, but if X is discrete this is easy to do; if X is continuous use slice sampling or metropolis-hastings).

More explicitly, I don't know of any papers that explain this sampler, but it's trivial. If you're just sampling LDA with the collapsed gibbs sampler, the model is

theta_d ~ Dirichlet(alpha)

eta_t ~ Dirichlet(beta)

z_i ~ Categorical(theta_d)

w_i ~ Categorical(eta_z_i)

To resample the z in the collapsed gibbs sampler, you choose a topic with probability proportional to the product of its smothed count in that document and the product of its smoothed count for that word; or,

p(z_i = t) = (C_dt + alpha)/(sum_d'(C_d't + alpha)) (C_tw + beta)/(sum_w'(C_tw+beta)

To do uncollapsed sampling, you sample the z variable with

p(z_i = t) = theta_d_t * eta_t_w

and, after that, resample the thetas and the etas from the dirichlet distribution of their counts, which you can do by sampling a gamma variable with mean in each smoothed count and normalizing it (or just using a dirichlet sampler, as available in numpy, for example).

To do gibbs sampling with other priors you must just essentially change how you resample the theta and the eta.

answered Jul 08 '10 at 13:51

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Jul 08 '10 at 14:24

Thanks for your detailed answer. However, I don't see any papers to do noncollapsed gibbs sampler to any LDA extensions.

(Jul 08 '10 at 13:57) Liangjie Hong

@liangjie-hong, Edited the answer to include an explanation of the uncollapsed sampler.

(Jul 08 '10 at 14:10) Alexandre Passos ♦

@Alex, thanks for the detailed explanation. So, essentially, for LDA, the difficulty of the un-collapsed sampler becomes to, say, sample theta from alpha, which is to take samples from dirichlet. Do I need to re-sample theta after each word, or do I only need to re-sample it once for each document? In my understanding, theta is only sampled once per document.

(Jul 09 '10 at 02:30) Liangjie Hong

As long as you eventually resample theta, your gibbs sampler ir valid. You can choose how often you will do it, in the way that suits you best. Resampling after each word will approximate the collapsed sampler, but might be very slow for nonconjugate priors. Resampling after each document is ok. You can also resample it less often, like once every two passes over the words, if you think it will converge faster.

(Jul 09 '10 at 07:14) Alexandre Passos ♦

Hi i think Alex has spoken absolutely right. However i think its more of an obvious choice to use Variational methods(like mean field methods) when we have non-conjugate priors and not MCMC methods. Variational methods are typically hard to implement but seem to give a pretty reasonable answer.

(Oct 01 '10 at 01:56) Tanmoy Mukherjee

Alex has just rightly explained everything. However i think if we have non conjugacy we would prefer to use variational methods (Explained in Dynamic Topic Models).Variational is def not easy to implement than MCMC (I dunno if this is right to say too). Infact i would like to know if we have non conjugacy would it be preferable to use Variational methods (though the original LDA paper is based on conjugate priors)

(Oct 01 '10 at 02:03) Tanmoy Mukherjee
showing 5 of 6 show all

There is one obvious answer: use a non-collapsed Gibbs sampler. Essentially, conditioned on topic settings, explicitly draw a sample of parameters. If you can't do this with your new prior, you might be hosed. You should probably derived the un-collapsed LDA sampling equations to make sure you understand what's going on.

Also, Variational Bayes can be applied to non-conjugate settings. You simply need to, in the case of mean-field, be able to compute E_q^{-1} lg P(x), which has a closed form for the Dirichlet as a variational distribution.

answered Jul 08 '10 at 22:27

aria42's gravatar image

aria42
209972441

@aria42, thanks. But, is there any comparison? I came across a paper called "On Evaluation and Smoothing of Topic Models" and they concluded that Collapsed Variational Inference seems work the best empirically. But I don't know whether that becomes a default algorithm, or just for LDA only.

(Jul 09 '10 at 02:34) Liangjie Hong

That is for LDA only, unfortunately. Although, if you read the paper, the collapsed gibbs sampler comes really close to the collapsed variational bayes, and ends up being a good default across many graphical models.

(Jul 09 '10 at 07:15) Alexandre Passos ♦
1

In what sense empirically better? Do you mean convergence or quality of topics? I've never quite believed evaluations of the latter, its very problem dependent. The gibbs vs. variational debate probably doesn't have a solid answer across the board, I would just be more concerned with choosing something that allows you to do inference with your new prior.

(Jul 09 '10 at 09:57) aria42

I haven't used it myself yet, but you might want to check out the Hierarchical Bayes Compiler, which can basically accept your generative equations, and return an inference system that uses Gibbs sampling

answered Jul 08 '10 at 12:59

Aditya%20Mukherji's gravatar image

Aditya Mukherji
2251612

Thanks. I'll try it soon.

(Jul 08 '10 at 13:02) Liangjie Hong

I don't think that will work with non-conjugate priors.

(Jul 08 '10 at 22:18) aria42
1

Theoretically it could, but in practice it probably doesn't support the priors the original asker wants.

(Jul 08 '10 at 22:24) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.