|
Hello, Perhaps this is a bit of a mismatch forum for the question, but I had some issues when doing an implementation of the Gibbs Sampling for Naive Bayes used here (Is a really sweet Intro, you should just check it) I had one issue and one question: -Question: In equation 49, to do the Gibbs Sampling Updates, are C_x and N modified such that they do not take into account the variable we are sampling? (I'm guessing C_x has to be modified, but since N is common for all the probabilities, it should not make a difference) -Issue: I'm not used to work with large dictionaries, and in equation 49, a large dictionary (37,000 words)tends to make the product of the probabilities theta_{x,i}^W_{j,i} (the multinomial) a really small number (python would output a zero). Is this an error in my implementation, or do I have to use something like the bigfloat package to handle really small numbers? Thanks a lot Leon |
|
For your first question, yes. Note the (-j) on the C in the probability expression, this means you should decrement the current count. For your second question you should do as Philemon said and work in log-space, and exponentiate and normalize before sampling. Is there a way to avoid having obscene large numbers in the log, since the exp tends to go to zero anyway. I found an implementation here, and in line 120 he (or she, I'm not sure) divides the exp by a common denominator. http://sourceforge.net/p/nbgibbs/code/ci/f34b66c3e908ecb4842b46a5c14d38ca9a2c4d48/tree/nbgibbs.py I'm not sure whether this is standard practice, I think it makes sense, since we care about the division, and in the sampling it goes away.
(Apr 19 '12 at 07:13)
Leon Palafox ♦
As after doing the exp you'll divide everything by a constant anyway you can do this operation in log space before to make everything easier. By this I mean, add to each log the value of the largest log (bringing that one to zero and everything else to some negative number), and then exponentiate and normalize.
(Apr 19 '12 at 07:21)
Alexandre Passos ♦
Cool Thanks
(Apr 19 '12 at 07:32)
Leon Palafox ♦
|
About the issue: You can probably replace that very large product by the sum of the logarithm of the values. Transforming stuff to log-domain is the most common trick to handle very large or small numbers.