|
If I have, say, a collapsed gibbs sampler for LDA (like the original Griffiths and Steyvers sampler), can I compute the log-likelihood of a sample (up to a normalization constant) just by adding up the log-probability of each word being assigned to the topic it currently is assigned to, or must I necessarily start from the joint equation for the likelihood and fill in the terms ("uncollapsing" the model in the process by estimating the likelihood of the posterior for the collapsed variables)? At first I think this must work, since the sampling equations are just the likelihood of the model (up to a constant factor) with only the terms that depend on that single variable; on the other hand, it feels weird to perform the metaphorical "remove that variable and insert it with a value proportional to the model likelihood using that value" process to compute the likelihood. If this is true, is it so for all graphical models? |
|
This paper, Evaluation Methods for Topic Models, has a great discussion of this and other issues around computing the "held-out" marginal likelihood of a new document given a collection of documents. Its much more comprehensive than I will be here. To answer you question more directly: is what you want to compute the marginal likelihood of new data given old data and sampled values of topic variables (wherever Gibbs sampling landed)? Given old data and samples of topic variables, you can compute the collapsed probability of a word given topic by integrating out the Dirichlet parametrization; you can also compute the marginal probability of word by marginalizing out $T$. The problem is that all the words in your new document are not independent since the parameters correlate all words of a document. There are a couple things you can do: (1) Just compute the per-word perplexity and essentially ignore the correlation between words. (2) Using the data you trained on, get a point estimate of parameters, $hat{theta}$ and compute the exact likelihood of new data using a single point estimate of the parameters; this decouples the words of the document and allows you to easily compute the log probability of the document. I was almost looking for the information in this paper, so thanks. However, I was looking for training set likelihood, the one you measure to see if your model has reached a comfortable zone and to help you resample the hyperparameters and such things.
(Jul 02 '10 at 16:50)
Alexandre Passos ♦
I'll mark this answer as accepted because it is usually what someone reading this question must be after. Also, I really liked reading your papers on coreference and summarization.
(Jul 02 '10 at 16:53)
Alexandre Passos ♦
Thank you so much. That's very kind.
(Jul 03 '10 at 10:20)
aria42
|
|
It's not clear the purpose of what you're doing. Why not just report the marginal (collapsed) likelihood? The uncollapsed likelihood doesn't have any special status. You can always expand a model to include more latent variables while leaving the marginal likelihood unchanged. Integrating out nuisance parameters, rather than fixing them at some specific value, is a natural choice. See Bayarri and DeGroot (1992) Difficulties and ambiguities in the definition of a likelihood function. That's an interesting paper, I've added it to my reading list, although it does not seem to deal with my problem specifically.
(Jul 02 '10 at 16:47)
Alexandre Passos ♦
|
Do you mean "training set likelihood" or "test set likelihood"?
I guess it's different to calculate them. For "training set likelihood", Griffiths et al. talked it in the paper.
I meant training set likelihood. Referring to Griffiths et al, my question was "does it work if I do model selection based on the current values of equation 5 for all z_i, or must I always compute the values of equations 2 and 3?".
Have you found a solution to this issue?
Yes. I was looking for the complete-data likelihood of the training, and it is trivially not equal to just summing over the sampling probabilities. Its value is, for example, easy to compute from the formulas in the Griffiths and Steyvers Finding Scientific Topics paper, as Turian mentioned.
Part of my questions were because I had confused complete-data likelihood of the training data and held-out marginal likelihood of the test data.