6
1

By reading the paper, my impression was that the variational bayesian algorithm for LDA proceeded by:

  • initializing the phi variational mean field for each word in the corpus
  • initializing the gamma variational pseudo-count for each document-topic pair
  • using equation 16 (appendix A.3.1) to update all phi given the values of all gamma and the actual word counts for the documents
  • using equation 17 (appendix A.3.2) to udpate all gamma given the values of all phi
  • looping these past two steps until convergence (and, if needed, updating the alpha and beta hyperparameters to improve even further the variational lower bound)

However, looking at the corresponding code, it seems like it does no such thing. I'm sure I'm not understanding some things, so:

  • why is inference used every time the document e step is computed (ie, why does lda-estimate.c:doc_e_step call lda-inference.c:lda_inference?)
  • why in lda-inference.c:lda_inference is phi_nk updated as digamma(gamma_k) + log(Pk(w)) instead of Pk(w) exp(digamma(gamma_k) - digamma(sum(gamma))), as equation 16 from the paper would lead one to conclude? I see that it is computing log(phi_nk) instead of phi_nk, but there seems to be a missing digamma factor. Can he omit that because it disappears after sum-normalizing the phi variables?
  • lda-inference.c:lda_inference seems to do many passes until convergence. Why should it converge when you're just estimating the model? Don't these extra iterations screw up the estimations of phi and gamma?
  • phi seems to be reestimated using a default gamma (in lda-inference.c:lda_inference) equal to alpha*nwords/ntopics, and gamma seems to be reestimated after this (for which reason?) as this_value + count[w] * (phi_nk - oldphi_nk) (after normalizing phi), and this estimate is reupdated for each word in each document, during inference. Doesn't this mean that the phi for different words will see different values of the gamma? And why not following equation 17 from the paper?

asked Aug 09 '10 at 16:03

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Dec 03 '10 at 07:04

1

Yeah, what you said makes sense and it isn't very obvious to me why it's done like this. I myself have only superficially seen the code. :) Maybe you should try dropping a mail at the topic-modelers mailing list about it: https://lists.cs.princeton.edu/mailman/listinfo/topic-models

I'm guessing some people on that mailing list would be having a first-hand experience with the code.

(Aug 09 '10 at 18:59) spinxl39

Good idea. I'll do that.

(Aug 09 '10 at 19:00) Alexandre Passos ♦

@ Alexandre: Deleted the other comments to avoid any confusions for those reading them. :)

(Aug 09 '10 at 21:09) spinxl39

One Answer:

Ok, let me take a stab at this:

why is inference used every time the document e step is computed (ie, why does lda-estimate.c:doc_e_step call lda-inference.c:lda_inference?)

So instead of going through steps 3 & 4 above sweeping across all documents, you can instead do that on a per-document basis since the documents are independent conditional on the topics.

why in lda-inference.c:lda_inference is phi_nk updated as digamma(gamma_k) + log(Pk(w)) instead of Pk(w) exp(digamma(gamma_k) - digamma(sum(gamma))), as equation 16 from the paper would lead one to conclude? I see that it is computing log(phi_nk) instead of phi_nk, but there seems to be a missing digamma factor. Can he omit that because it disappears after sum-normalizing the phi variables?

Yes, the last term is constant so when phi is normalized, it is irrelevant.

lda-inference.c:lda_inference seems to do many passes until convergence. Why should it converge when you're just estimating the model? Don't these extra iterations screw up the estimations of phi and gamma?

Again, the order you do steps 3 and 4 is just a heuristic. It's debatable which approach is actually better, but as above, because of independence, this is totally valid.

phi seems to be reestimated using a default gamma (in lda-inference.c:lda_inference) equal to alpha*nwords/ntopics, and gamma seems to be reestimated after this (for which reason?) as this_value + count[w] * (phi_nk - oldphi_nk) (after normalizing phi), and this estimate is reupdated for each word in each document, during inference. Doesn't this mean that the phi for different words will see different values of the gamma? And why not following equation 17 from the paper?

This is coordinate ascent, so again the order is arbitrary. You can update gamma, then update all the phis, then update gamma, then update all the phis, etc. Or you can update gamma, update phi_1, update gamma, update phi_2, and just repeat. I think this approach here gives you empirically better convergence, but it's a toss up really.

answered Aug 09 '10 at 19:18

Jonathan%20Chang's gravatar image

Jonathan Chang
11632

Thanks a lot :-).

(Aug 09 '10 at 19:20) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.