The dirichlet process has two representations commonly used: the chinese restaurant process and the stick-breaking representation. I understand how they work, and how to do sampling on both representations, which is essentially how to write out their marginal representations, or the probability of the next sample given the others. However, say we have a standard DP mixture model:

G   ~  DP(G0, alpha)
z_i ~  G
x_i ~  Normal(z_i, sigma)

If we want to derive a variational algorithm for this model we'll have to define an approximating distribution (usually either a truncated stick-breaking distribution or a finite symmetrical dirichlet distribution) and compute E_q[log P(G)], where G is the DP prior, and I'm not sure how to do this without actually writing down the density of the DP, except I can't find it in most references. Another problem is computing E_q[log P(z|G)], but the papers I know of doing variational inference for DP mixture models ( Blei and Jordan. Variational methods for the dirichlet process and Kurihara, Welling, and Teh. Collapsed variational dirichlet process mixture models) seem to think it's ok to compute E_q[log P(z_i|z-)] (where z- is the set of all zs other than z_i) and use the marginal form of the DP for that. Looking at the resulting update equations it seems as if only a small change is involved between DP mixtures and standard finite mixtures, but I can't wrap my head around:

  1. How can I actually compute the value of the lower bound?

  2. How to write down P(G) and P(z|G) so that I can write down the lower bound, derive it, and get to the update equations? Why is it ok to use the marginal equation P(z_i|z-) in this case but not necessarily so in the finite case?

  3. Am I just supposed to not bound P(G)?

asked Jan 26 '11 at 10:05

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

edited Jan 26 '11 at 19:32

Did you find an answer to this question? I'm also curious if it is possible to compute the lower bound at all in this case. I'm implementing a variation of this algorithm at the moment and haven't got a clue about how to monitor convergence and check for bugs without computing it.

(Feb 09 '11 at 08:21) Philemon Brakel

In the truncated stick-breaking representation it is easy to compute the lower bound, as what effectively happens is that all terms in the infinite sum except for a few finite terms are zero, so there is no effective difference between using mean field to approximate the DP and to approximate the truncated distribution.

For the finite symmetric dirichlet case I'm still not sure, but according the other paper ebony cited (apart from the journal version of the Blei and Jordan) there are some deep issues with using the finite dirichlet approximation. They report that not only the lower bound is lower because of unidentifiability (but this can be fixed) but it diverges when K goes to infinity leading to worse approximations instead of better (my memory is fuzzy on the specific reasons, but you should read that paper if you're interested in variational approximations to the DP, it's very thorough).

(Feb 09 '11 at 09:37) Alexandre Passos ♦

Thanks for the information. I started reading the paper ebony suggested a bit earlier today and it seems to be very detailed indeed.

(Feb 09 '11 at 09:54) Philemon Brakel

One Answer:

I don't quite understand why you need E_q[log P(G)] where G is the DP prior? In the Blei & Jordan paper, don't you just start from eq 12 and then just use the truncated stick-breaking distribution (eq 13) to derive the update equations? There is also a journal version of this paper which might give you some more details. There is another paper Mean field inference for the Dirichlet process mixture model which might be useful for you.

answered Jan 27 '11 at 08:31

ebony's gravatar image

ebony
18181014

On that paper it seems that he is computing E[log P(V)] and E[log P(z)] using the truncated expressions for P(V) and P(z), so I don't understand how is this different from directly doing inference on the truncated model.

(Jan 27 '11 at 09:36) Alexandre Passos ♦

I think the difference comes from the fact the model by itself isn't truncated -- it is still fully DP, whereas only the variational distribution is truncated.

(Jan 27 '11 at 11:13) ebony
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.