|
In "Finding scientific topics", model selection is used to decide the appropriate value of topics. The harmonic mean of a set of values P(W|Z,T) is used to approximate P(W|T). For a model with K topics, is it 1/P(W|T=K) approx (1/M)*(sum_over_d 1/P_(d in M)(W|Z,T=K)) with a set M of documents d? P_(d in M)(W|Z,T=K) is the product of phi? In this way, logP(W|T) is a negative number in my experiments, quite different from that with the magnitude 10^7 in Fig.3 in "Finding scientific topics". I'm confused and not sure whether the formula is right. Anyone can help me? Thanks a lot. |
|
P(w|z) is a multinomial distribution, or, in the way it is done in the finding scientific topics, and integrated dirichlet-multinomial pair, so it's either a probability from your topic-word distribution or smoothed normalized counts from that same distribution. Each w then has an independent probability, so P(W|Z) is a product for all words of the appropriate probability of that word being chosen from that topic. |
|
Thanks,Alexandre. Is it right that 1/P(W|T=K) approx (1/K)*(sum_over_j 1/P(W|Z=j)),where P(W|Z=j) is the product of P(w=t|Z=j) with V terms? For a corpus with K=200 and V=38898, logP(W|T=K) is -Infinity. So I guess that there is still something wrong. On the other hand, I tried the formula (2) in "Finding scienific topics" for the same corpus with the counts from Gibbs Sampling. However, it not applicable, since gamma(V*beta) is infinity with V=38898 and beta=0.1. What is the problem? Could you please give me some hints? Many thanks. You need to work in logspace. So log 1/prod(p) is -sum(log p). This avoids underflow. Similarly use gammmaln instead of computing log gamma yourself.
(Apr 05 '11 at 03:39)
Alexandre Passos ♦
|
|
Thanks. Does it make sense that the approximizations of P(W|T) are different according to the probability from the topic-word distribution and smoothed normalized counts from that same distribution? |