I am stuck on Eq 1 in the paper by D. Mimno and D. Blei, “Bayesian Checking for Topic Models,” Empirical Methods in Natural Language Processing, 2011, and would appreciate a push off dead-center.

It seems that the terms in the first expression in Eq 1 should be available from the phi and theta matrices that result from a traditional LDA topic analysis. However, i seem to be stumbling around the calculation of P(w,d | k) and P(w | d,k).

Any insight would be appreciated.

asked Jan 26 at 16:45

Aengus%20Robinson's gravatar image

Aengus Robinson
21551114


One Answer:

Those equations are not about the full posterior of the LDA model but about counts collected in a single Gibbs sample (altough you could replace them with averages over many samples, which would then give you posterior estimates of those quantities). Regarding the terms inside the logarithms, P(w|d,k) is the frequency of word w assigned to topic k in document d, or count of the word w on topic k in document d divided by all words assigned to topic k in document d. The same can be said for the other quantities.

answered Jan 26 at 17:12

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1899744214335

Alexandre: thanks, but I think that's pretty much what the paper already says. I'm looking for some additional insight into how the elements of the information-based measure, MI, relate to the distributions within Theta and Phi.

(Jan 26 at 19:01) Aengus Robinson

I see. That value is not directly related to the LDA model at all, it's an external metric. It's measuring what is the mutual information between word types and documents of a set of wrods given their topic assignments. Ideally, according to the LDA model, this mutual information should be zero (that is, knowing the document in which a word is located should tell you nothing about its identity if you already know from which topic it came from). This is, then, more directly related to the conditional independence assumptions than to the specific parametrization of the model.

(Jan 26 at 22:00) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.