I'm using NaiveBayesClassifier with nltk to classify free-form text into various groups.

The show_most_informative_features call is very interesting, allowing a small peak into what drives the classifier.

What other ways are there for exploring how the classifier makes decisions? For example, can I see what caused the classifier to classify a specific sample the way it did?

There are specific data points that the classifier misclassifies with high confidence (prob_classify returns a high probability for the wrong class). I'd love to figure out what it is about those cases that drives the misclassification.

asked Nov 12 '10 at 17:05

Parand's gravatar image

Parand
1153510


2 Answers:

Naive bayes classifiers essentially have a weight for each word in the document, and the classification is a function of the sum of these weights (kind of). So it's not like a decision tree, where you usually can follow the decison path for a sample and see what made it go the wrong way (although even with decision trees the usual caveats of not necessarily understanding the earlier decisions apply); in a linear model any set of factors could be responsible for "tipping the scale".

However, there is something you can do (although I don't know how to do this with the NLTK API), which is look at the words in the document sorted by their weights according to the classifier, so that you look at words with high weights before looking at words with low weights. If you also show the incremental score, you can see at which point the classifer is so hopelessly commited to the wrong class that whatever's after that can't really matter. Then you mght think of how to adjust the dataset (as this is the only tweaking you can do in a naive bayes classifier) to make sure those words aren't that heavily tipping to the wrong side.

You can also switch to a logistic regression classifier, which instead of making these weights proportional to the log of the number of times each word appear in each class chooses weights so that, after classifying the training data, the weight for a word in a class is exactly the count of that word in the elements that are classified as being from that class. This generally avoids the overestimation error of naive bayes (specially when regularized), but performs rather poorly with small training sets.

answered Nov 12 '10 at 19:31

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Thanks Alexandre. I'll see if there's a way to peak into the weights for all words inside the classifier.

I do have access to a fairly large training set, so I'll give logistic regression a try as well.

Anything else you'd recommend trying?

(Nov 12 '10 at 20:44) Parand

You might try to look into harder models, such as LDA (Latent Dirichlet allocation) to actually model the topics in which your data is distributed, and as well you can get a grasp of how each word depends on each topic, allowing you to have really cool analysis.

I think in march someone started a motion to include LDA in NLTK, but I am not sure on how that turned out.

You might try to look into that.

answered Nov 12 '10 at 23:54

Leon%20Palafox's gravatar image

Leon Palafox
31265471107

Thanks Leon. I'm just learning about LDA, but it looks like it's an unsupervised method that derives its own topics. In this case I have existing topics (it's a classification problem), so if there's are other supervised learning methods I should be looking at please let me know. It doesn't just have to be NLTK either; generally anything python based would be good.

(Nov 14 '10 at 21:16) Parand

Actually it is only a matter of the modeling, the Topics are distributed as a random dirichlet process, but if you have the set, it become easier, and you do not need a Chinese Restaurant Process, there was a NIPS paper that used LDA on a fixed set of topics that worked real well.

(Nov 15 '10 at 07:38) Leon Palafox
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.