1
1

Hi all,

I have a questions related to the meaning of the prior distribution in generative models. Say, for example, standard LDA, we have a Dirichlet distribution alpha over theta. If I interpret it correctly, it encodes our prior knowledge of the theta values. So, E[theta] should equal to frac{alpha}{sum alpha}. This part is clear. However, what happens, say, we set one component in the Dirichlet alpha equal to 0? In theory, it means that the expected value E[theta] for this particular topic should equal to 0, meaning that on average, we don't expect to see any words assigned to this topic? But in reality, in my case, even I set on component equal to 0, that topic still gets assigned to words.

So, what's the point of prior distribution in the model? Does it only help to prevent over-fitting?

Thanks for any hints.

asked Aug 11 '10 at 03:25

Liangjie%20Hong's gravatar image

Liangjie Hong
256101720

edited Aug 11 '10 at 03:26


4 Answers:

Just to touch on the general issue of inference, introducing a Dirichlet prior can be viewed as adding extra data to the problem.

For instance, suppose you observed k heads out of n tosses, your prior is Dirichlet(a,b), what's the probability the next toss is heads? Bayesian answer is (k+a)/(n+a+b). This is the same as Maximum Likelihood estimate if you observed additional a heads and additional b tails. As n increases with a,b fixed, relative contribution of a,b becomes small. In the limit of a,b going to 0, Bayesian Estimate and Maximum Likelihood estimates coincide.

answered Aug 11 '10 at 15:22

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
2333214365

I'll take my own question.

So, after some debugging the code (the code is indeed correct) and thinking, I think the problem is not really a "problem". So, if we have a component in alpha equals to zero. Then, the expected value of the same component of theta also should be near/equal to zero. However, since this is the expected value, it does not imply the real counts are totally empty for this topic. So, you may still have counts for this topic, but on average, only a very small number of documents get counts and on average theta_i is very small (near zero).

Therefore, I think, only restricting some component of Dirichlet prior equal to zero do not really impose a zero count constraints (sparsity). So, we need other techniques to impose a sparse prior.

answered Aug 11 '10 at 14:08

Liangjie%20Hong's gravatar image

Liangjie Hong
256101720

Just out of curiosity, which inference algorithm are you using?

(Aug 11 '10 at 14:11) Alexandre Passos ♦

Gibbs Sampling. If I use a uniform initialization, the result will be what I described.

BTW, what I said "equal to zero" is not really setting to 0 but setting to something like "1e-15".

(Aug 11 '10 at 14:15) Liangjie Hong

Ok. So let it be equal to zero and sample from your topic prior for initialization, instead of sampling uniformly, and you will see that the corresponding topic will have no words assigned to it.

(Aug 11 '10 at 14:22) Alexandre Passos ♦

@Alex, I agree that if I initialize the counts not uniformly, the result will be desired. However, as you said, this is a hack but not a real "solution" to the problem.

I just come across the paper "Decoupling Sparsity and Smoothness in the Dscrete Hierarchical Dirichlet Process". They have a simple method to really impose a sparsity result on counts.

(Aug 11 '10 at 14:29) Liangjie Hong

Yes, that is a very good paper. I was thinking you were more interested in something described in the Labeled LDA http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.155.3678&rep=rep1&type=pdf paper.

(Aug 11 '10 at 14:31) Alexandre Passos ♦

@Alex, thanks a lot!

(Aug 11 '10 at 14:39) Liangjie Hong
showing 5 of 6 show all

You probably have a bug in your inference algorithm. Are you using sampling or variational? Regardless, when you're initializing the topics, you should take the prior into account. If you just assign words to topics uniformly you risk taking a very long time until your model settles down to the correct posterior proportion, and this might never happen with variational.

A quick hack I use when I'm using assymmetric priors (that is, with one alpha_i different from the others) is to assign words not to uniform(Ntopics) but to Discrete(alpha/sum(alpha)) (or even Discrete(Dirichlet(alpha)), which is more correct, or even Discrete(Dirichlet(alpha+current_counts)), but this can bias your model in uninteresting ways sometimes, specially if the alphas are small).

Noel Welsh is correct, and the dirichlet distribution is not defined for alpha=0, but it doesn't lose its meaning or any of its properties if you just assume that there is probability one that a component with prior alpha_i = 0 will have posterior counts 0.

To see why the distribution forever stays empty at the component with alpha_i = 0 you have to look at the sampling/variational equations. The probability of adding a word to a topic, when gibbs sampling, is proportional to (count_topic_i_in_document + alpha_i)(count_word_in_topic_i + beta_i)/(all_words_in_topic_i + V*beta_i). So if alpha_i is 0 and count_topic_i_in_document is zero as well, this probability is zero and no word will ever be assigned to that topic. In variational this is less clear, but I don't know of any vb implementation out there that lets you use assymmetric alphas.

answered Aug 11 '10 at 05:30

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Aug 11 '10 at 06:14

Hi @Alex, First, thanks. I don't quite get the idea of "alpha_i=0" and then "count_topic_i_in_document" should go zero. Why that happened?

(Aug 11 '10 at 12:55) Liangjie Hong

I use alpha_i because alpha is a vector, so by alpha_i I mean the i-th component of that vector.

About the counts, what I mean is, if you set one of the alphas to zero, and don't put any words in the corresponding topic, no words will ever be put there afterwards, since the probability will be zero. Also, when sampling a theta from the alpha (assuming your data is an actual sample of an LDA model) no such theta will have a nonzero value for that component, since (all else being equal), the expected value of theta_i goes to zero as alpha_i goes to zero. And, since the corresponding topic has no words, the count of that topic in the document will be zero.

(Aug 11 '10 at 13:56) Alexandre Passos ♦

@Alex, I see. So, the criteria is "no words" in this topic. Please see my own "answer".

(Aug 11 '10 at 14:02) Liangjie Hong

There are at least two questions here: what role does a prior distribution play in Bayesian inference, and why is your particular implementation of LDA behaving in the way it does.

Addressing the second question to start with, the Dirichlet distribution is only defined for alpha > 0. The Gamma function, from which the normalising constant is calculated, is not defined at zero as you can see from the graph on the linked page. My guess is that you're seeing a numeric issue in your code.

Now onto the first question. One view is that yes, the prior is just a way of performing smoothing. Others argue that there are always a prior, but it just might not be explicit. From a purely engineering point of view you need a prior to compute a posterior. I find Bayesian methods appealing as they 1) quantify uncertainty in the true model, which is useful for decision making, and 2) provide a simple unifying framework for inference.

answered Aug 11 '10 at 04:48

Noel%20Welsh's gravatar image

Noel Welsh
72631023

Thanks for an answer. Since the prior plays as kind of smoothing, does it indicate that if we have enough data, the role of smoothing becomes less important and less effective? Back into the LDA scenario, if I have enough counts for a particular topic, even if the prior is near 0, does it matter the counts we already have?

(Aug 11 '10 at 12:56) Liangjie Hong

Yes, as the amount of data increases the effect of the prior is diminished. In the limit the Bayesian and maximum likelihood solution should converge to the same point estimate. (This convergence property is known as consistency.) Smoothing isn't less effective in this case -- rather it is less important.

(Aug 12 '10 at 03:58) Noel Welsh

@Liangjie Hong: this depends a lot on many things. If all the other priors are big and this one is small, this topic will suffer and probably will have almost no words assigned to it (if you initialize correctly). I don't think you can say that priors become less important in LDA because of issues with identifiability, and the fact that there is nothing a priori to distinguish one topic from another makes it kinda hard to say that you have a lot of counts in one topic before running the model.

(Aug 12 '10 at 11:48) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.