I'm trying to understand the idea behind the Topic Models in document clustering. In Latent Dirichlet Allocation, it is necessary to approximate the posterior distribution of topics over the document. One of the methods is Gibbs sampling - I'm not sure if I understand how it works - I'm basing on this paper: http://faculty.washington.edu/jwilker/559/SteyversGriffiths.pdf.

It starts with random word-topic assignment and iteratively, new samples are computed, conditioned on previous results, and after some number of iterations, it starts to converge. But I don't really get the idea, why multiple samples need to be kept - this paper mentions something about discarding the samples from first iterations (author calls it burn-out phase). What does he mean exactly by 'discarding' these samples? Isn't the result of the last iteration the best approximation of the desired distribution?

I will be grateful for possibly very simple explanation - my knowledge in this area is really poor.

asked Aug 18 '13 at 08:51

watari's gravatar image

watari
1113

edited Aug 18 '13 at 08:52


2 Answers:

Thank you very much for your explantation, but there is still one more thing. In some papers I read that we keep the samples with specified interval, for example every 100th sample (to avoid correlation between them) and then calculate the mean, which is later taken as the approximation of the true distribution. So my question is: let's say we perform 10000 iterations (generate 10000 samples) - isn't the 10000th sample the best aproximation of desired distribution? Why calculate the mean?

answered Aug 19 '13 at 12:30

watari's gravatar image

watari
1113

edited Aug 19 '13 at 12:31

We have to get enough samples for a robust approximation. What you meant is that we can start at 10000th example and sample a huge number of samples after that. But as we cant do that forever we approximate that by taking samples after every N iterations. The basic idea is nearby samples are correlated. To avoid this correlation we need huge number of samples. Another way is take samples which are farther from each other.

(Aug 21 '13 at 08:59) Arun Kumar

Lets start with sampling. Sampling is done here to approximate the true distribution. We collect some samples from the distribution without calculating the real distribution which will be computationally intensive. Gibbs sampling is a class of sampling algorithms of a class of MCMC algorithms. Basically we start from a starting point and sample a new point by a procedure (the sampling step). We go to the next point and again randomly choose which other points to go based on the inherent probability distribution.

You can think of this step like the page rank calculation. Suppose you start at a page like yahoo.com. And randomly take some link out of that page and so on. But the probability of taking a link depends on the weight of this link among all the outgoing links. If you continue this process, at infinity the probability of being at a any given page is proportional to its page rank. So we calculate the global probability of being on a page by only using relative probabilities of links on a page. Also you can see that at the initial steps the probability of being at a page near to yahoo.com is more. But at infinity all pages have the true probability. So we can have a tradeoff between accuracy and time. As time increases the distribution is closer to the true distribution. Its just the burn-in period which we say that the distribution will be close enough and then you sample from there.

answered Aug 19 '13 at 01:08

Arun%20Kumar's gravatar image

Arun Kumar
286101016

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.