I have a set of ~3k documents, each of which has one or more labels associated with it. There are about 30 unique labels in total. Ultimately what I need is a classifier that will be able to assign one or more labels to a new document. Ideally the output for each document should be the probabilities for each label. For example:
My first question is whether it makes sense to train a NLTK Naive Bayes classifier for each label (using words within the documents as features) and then use the probability of True reported by each classifier to assign labels to a new document. Is there an obviously better approach, or an obvious reason why this approach is bad? I saw in another post that Naive Bayes is bad for generating actual probabilities since the assumption of independence drives most probabilities toward 0 or 1.
My main question is about selecting data to use in 10-fold cross validation. It seems like I should use all the data available, but when I do each classifier sees about 30 times more negative examples than positive examples since each of the labels applies to roughly 1/30 of the documents. My overall precision and recall are .11 and .95.
Another approach I took was to use all of the positive examples and then pull as many negative examples randomly from documents with other labels. In this case precision and recall were .8 and .9. These numbers are obviously much better, but I'm not sure whether this makes sense. On the one hand I'm giving each classifier way more positive examples than it will likely see in the real world. But on the other hand I'm doing the same thing for each classifier/label. I'm currently not even sure whether the distribution of labels in my data will be the same in the actual data.
I'd appreciate any advice or tips,
asked Apr 25 '11 at 01:55
Naive bayes is really bad at calibrating things by itself. Undersampling the negative class sounds like a good idea, and I don't think it will introduce that much bias since you're already using naive bayes. Those numbers sound good. If I were in your place I'd get some more data and see if this generalizes, and if so, stop and go do something else.
answered Apr 25 '11 at 05:55
Alexandre Passos ♦