Hi,

A couple of questions regarding Naive Bayes classifiers for text categorization:

1) When using a bag-of-words representation together with a Naive Bayes classifier, how are "zero" occurences dealt with?

Let's say in a corpus of 100 documents, a certain word/feature appears in 20 documents (count "K" > 0) and is not present in 80 documents (k = 0). If I take P(x=k|c) to be a multinomial distribution, should this distribution be

a) defined for k > 0 and thus estimated on the set of documents with k>0 or b) defined for k>=0 and thus estimated on the entire set of documents?

I am guessing the former could be interpreted as allowing missing features, one would drop or otherwise mute p(x=0|c) when multiplying the conditionally independent distributions of the naive bayes classsifier, whereas the latter includes "0 occurences" as a possible feature value.

Is one of the choices more correct than the other?

2) Is there any sensible way of combining discrete class-conditional probability mass functions (such as the ones arising through "count features" and continuous class-conditional pdfs in the same naive bayes classifier? Multipliyng them together seems kind of non-sensical.

3) I mentioned above that I use the multinomial distribution to estimate class-conditional distributions, would it be better to use the Poisson distribution or is this not generally the case?

4) If instead of using "word counts" as features one would use "term frequency inverse document frequency" as feature values, what kind of continuous (?) class-conditional distribution would be adequate in that case?

Any help is appreciated,

/David

asked Mar 28 '11 at 18:55

David%20the%20Dude's gravatar image

David the Dude
60458


5 Answers:

For 1 I'd say most people do (a), as b doesn't make sense in naive bayes--remember that you're estimating P(word|class), not P(class|word), so you compute, for each class, a smoothed fraction of the documents that contain that word and not, for each word, a class distribution.

For (2) most people first bin the continuous features, and then compute the class-conditional "bin presence" probability instead of word probability. The performance degrades really quickly with correlated features, however.

For 3 I've seen some people doing that; you get something similar to good-turing estimation of word frequencies per class. Might not be worth the hassle, however.

What is usually done for 4, I think, is to note that naive bayes actually learns a log-linear classifier, in that P(document|class) = exp(sum_w log-freq(class,w) * count(document, w)); where log-freq(class,w) = log (sum_(d in c) sign(count(d, w))) - log(N_c), where N_c is the total number of documents in that class. You can then use the same learning and classification rule but change only the count(d,w) to weight by tf-idf. More often people just trim the too frequent and too rare words.

answered Mar 28 '11 at 19:21

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Mar 28 '11 at 19:22

re: the stuff on (1) ... at previous employer, an early implementation someone did for Naive Bayes assumed if the word wasn't encountered in a category, then the probability of encountering it was zero. P(basketball|auto-websites) = 0, because the corpus of auto mechanic websites never used the word 'basketball'. (1b) doesn't sound as bad as this choice, but it seems to make the blanket statement that if a word wasn't encountered in the corpus for a single category then the probability of encountering it in that category is '1' -- in other words, it has no effect on the classification.

(Mar 31 '11 at 13:32) Brian Vandenberg

Dovetailing on Alexandre's response to (3), Hinton & Salakhutdinov wrote a paper on Semantic Hashing (2007) that used a Poisson distribution. A later model they constructed and published about in the paper Replicated Softmax: An Undirected Topic Model (2009) used a multinomial distribution on the same problem as the first paper, and contrasted results between the two.

The biggest problem I saw with using Poisson in this setting was issues with independence assumptions. In the RBM model, there's an explicit assumption that I can sample from the hidden units given visibles and vice-versa -- independent of other hiddens (or visibles).

However, the energy function constructed in this paper violates that assumption. They show that it works in practice, but they also show in the second paper a model that doesn't violate conditional independence assumptions and works better than the first model.

I apologize for not linking to stuff on naive bayes; I've done work on it in the past, but no in-depth research so I don't know of any good papers that specifically relate to NB. Nevertheless, I hope this helps.

This answer is marked "community wiki".

answered Mar 31 '11 at 13:26

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Mar 31 '11 at 13:33

I just realized I should probably try to back up the statement about independence problems. It's alluded to in the 2nd paper, but more directly: the conditional P(v|h) depends on N = sum_i v_i, the length of the document. When sampling from the poisson governing 'v', the word counts won't always result in N being a fixed value. In other words, when sampling 'v', each v_i depends on the word count at the current iteration but they gloss this over by fixing N at either the same N as for the original sample or from a/the previous iteration -- I'm not sure which, but it's probably the original sample.

(Apr 01 '11 at 13:48) Brian Vandenberg

David,

With respect to your problem in (a), you should look at something called smoothing. Andrew Ng's lecture notes (Pg 11) delves into one very basic form of smoothing called Laplace Smoothing.

answered Apr 28 '11 at 13:43

Dexter's gravatar image

Dexter
416243438

Hi,

I came to think of another variation in what I referred to as "bag-of-words Naive Bayes" in my above post. Question is below.

Notation used below:

  • x: a word,
  • f: a count,
  • xf: a count f for word x
  • x1, x2, ...: several words in the vocabulary

a)

My current implementation models the class conditional distribution p(xf|c) as a frequency distribution over word counts. If for example for word x, f=0 in 30% of the docs of class c, f=1 for 50% and f=2 in the remaining 20% then p(xf|c) would be specified by p(xf=0|c) = 0.3, p(xf=1|c) = 0.5, p(xf=2|c) = 0.2. Document likelihood under the naive bayes assumption is then:

p(d|c) = (prop) p(x1f|c) * p(x2f|c) * ...

b)

"multinomial Naive Bayes" on the other hand models the entire document likelihood as a multinomial distribution. p(x|c) here is the probability that word x occurs in a document at all. This is unlike in a), where word frequency is incorporated into the class conditional probability p(xf|c) directly. Denoting words and their counts as x1, x2,... and f1, f2,... respectively the document likelihood is then:

p(d|c) = (prop) p(x1|c)^f1 * p(x2|c)^f2 * ... ( i.e a multinomial pmf)

My question is if there is an inherent advantage of either method over the other?

One thing I did notice is that for the purpose of evaluating the decision function, with b) this can be cast into a (sparse) matrix multiplication, for a) one would have to iterate over samples/vocabulary.

Other than that, Apache Manhout goes with implementation b) wheras NLTK implements Naive Bayes as in a)

Thank you,

/David

answered Apr 06 '11 at 12:44

David%20the%20Dude's gravatar image

David the Dude
60458

edited Apr 06 '11 at 12:46

Indeed there are two ways of formulating naive bayes. For text data, (a) is usually better, I think. For a deep discussion of where these versions come from, how do you derive them, and what they mean, see McCallum and Nigam A comparison of event models for Naive Bayes text classification.

(Apr 06 '11 at 12:48) Alexandre Passos ♦

Hi Alxeandre,

Yeah, I took a not so long look at the paper you referenced - it seems the comparison there is between "multinomial-" and "multi-variate Bernoulli" Naive Bayes models. The former is what I described as b) in my above post the latter is not quite a) since multivariate Bernoulli only considers the binary feature: word-occured/did-not-occur. That is to say - it does not take into account word counts whatsoever. The method in a) however does so by way of the frequency distribution. I could be mistaken though - will check again.

Thank you for your reply.

(Apr 06 '11 at 12:55) David the Dude

David, you're right that (b) is multinomial and multivariate bernoulli isn't exactly (a). However, if you change from multivariate bernoulli to multivariate binomial you incorporate word counts in a way that is identical to that of (a).

(Apr 06 '11 at 12:57) Alexandre Passos ♦

(the main difference is that you would need an extra parameter for the number of trials for each word, or somehow keep it fixed or something similar, otherwise your model won't necessarily normalize)

(Apr 06 '11 at 12:58) Alexandre Passos ♦

Here is a write up. For 2) you can either bin the data or use pmfs. Multiplying them should be fine because you are only estimating likelihoods first and eventually normalizing them all.

answered Dec 13 '12 at 11:54

Broccoli's gravatar image

Broccoli
15112

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.