Hi all -

Beginner to text classification here with a beginner question. I'm currently using NLTK's naive Bayesian classifier to classify text into just two categories based on training data. After training, the most-informative-features and auto-classification test both look reasonable - the most-informative-features list is intuitive and the auto-test is near 0.95-0.99.

I am training with a subset of known classified data, and then doing accuracy testing on data that is known-classified but not necessarily used in training.

When I run the trained classifier to guess, the probabilities that it comes up with are very near 0.5 and 1.0 and 0.0 - for example, many guesses will be (0.0, 0.4999), (1.0, 0.0), (0.51, 0.49), (0.50, 0.49). Actual accuracy of the guesses (based on which probability is higher) runs very close to 50%.

This leads me to suspect that either my test data is not diverse enough (too closely matching my training set) or that my feature selection is somehow flawed. My feature selection is relatively basic - after dropping stopwords, I am selecting single tokens and bigram phrases as features and feeding them into classifier training set. I am only reporting non-zero features in my extractor, and letting NLTK handle the zero-value features, based on some advice I received earlier.

So my question is - do the guesses of 0, 0.5, and 1.0 represent something I am doing that is fundamentally incorrect, or is my testing data just not diverse enough?

Thanks for any responses!

asked Jan 07 '11 at 15:47

mfhughes's gravatar image

mfhughes
61247


2 Answers:

Another suggestion is to use slightly better features, for starters I would do the following

  1. Make the stop words list more domain specific, and a bit more aggressive - As this reduces the sparsity of the space you are trying to infer over, meaning the weights can be distributed over a smaller set of words, thus making them more discriminative

  2. Along the same lines as 1, but weight all the words according to some measure of importance (say tf-idf) and keep only high scoring words.

  3. Add information about the structure of the document, for example, emails subject words should be more important than words in the body of the email - this can be done by rewriting words in the subject as "subject:word", and words in the body of the email as "body:word", and letting the algorithm learn their importance. Even though this works in opposition to (1) and (2), the intuition is that sparsity introduced by such transformations is small compared to the sparsity removed by (1) and (2).

Also, as mentioned by others, use regularization, for a large sparse space such as words, L1 regularization works great (scikits.learn calls it Lasso).

answered Feb 03 '11 at 01:54

kungpaochicken's gravatar image

kungpaochicken
66124

Thanks kungpaochicken (lol)!

My next iterations will be to add case-specific stopword lists built by subject matter experts, as well as the addition of tf-idf scores to rank and limit the features used.

I'm working on a separate classifier (due to business rules) that looks at document metadata (as opposed to content) that does exactly what you suggested with the subject/body separation. At some point I'm planning on combining the feature inputs of these two classifiers in some way into a single one.

I believe the toolkit I'm using does regularization but calls it something like "sparse matrix support". I'll have to look into it, but duly noted.

(Feb 03 '11 at 16:41) mfhughes

Also, if both your models output into the same space (for eg, both models output class1/class2, but with different weights) - then the easiest way to combine them is to weight each score by some confidence measure (eg: 1/training error) and sum.

This has the advantage of being very easy to implement, letting you use multiple classifiers and works well in practice.

(Feb 04 '11 at 02:09) kungpaochicken

This distribution on the conditional probabilities (being very close to 0 or 1 or .5) is a common property of the naive bayes classifier. If you want calibrated probability estimates you're better off using a logistic regression classifier (nltk has those as well). This is a consequence of the way naive bayes handles redundant features: they push the probabilities exponentially in the direction of either 0 or 1. Logistic regression works differently and so avoids this problem.

The classification being correct only 50% percent of the time suggests that naive bayes is probably not appropriate to your problem. Without more information I'm not sure what you should do next. How much data do you have?

answered Jan 07 '11 at 16:11

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Alexandre -

Thanks for the help. My data sets are medium sized - they are email-like documents (business documents, basically) of which I have several hundred thousand per category, ranging from 1kb to several hundred kb. Uncategorized data may run into the millions of documents. My training sizes are in the 1k - 10k range and my test runs are also in that range due to limited computing power. I was hoping on proving out a promising approach out before delving into distributed processing of this data.

I'll look into the logistic regression classifier - does that operate on the same feature extraction concept, or will it require major reworking of my approach?

Thanks, mh.

(Jan 07 '11 at 16:27) mfhughes

The procedure is the same, as both are linear classifiers. The main difference is that it takes a bit of work to train a logistic regression classifier versus naive bayes. Also, be sure to use regularization, if nltk provides it, as it usually causes a huge difference in performance.

(Jan 07 '11 at 16:33) Alexandre Passos ♦

Alexandre -

Looking through the nltk documentation, I can't seem to find anything that's described as a logistic classifier -

BinaryMaxentFeatureEncoding ClassifierI ConditionalExponentialClassifier DecisionTreeClassifier MaxentClassifier MultiClassifierI NaiveBayesClassifier RTEFeatureExtractor WekaClassifier

Searching around metaoptimize I found another library, scikits.learn, that seems to implement this (with regularization)... would that be a better choice?

Thanks, mh.

(Jan 07 '11 at 16:44) mfhughes

I like scikits.learn better than nltk, so yes. In nltk the logistic regression classifier is called maxent.

(Jan 07 '11 at 16:45) Alexandre Passos ♦

OK, I'll give both a shot.

Thanks much.

(Jan 07 '11 at 16:59) mfhughes

Alexandre,

Good news - I tried the maxent NLTK classifier (since I had much of the supporting code and feature extraction stuff already written), and after trying two different algorithms (with the help of scipy), I am getting much, much more accurate results. Approximately 70% of guesses are conclusive, and approximately 80% of those conclusive guesses are correct. For my purposes ("value-added" pre-review classification of documents for lawyers), this is more than acceptable.

I am using the 'conjugate gradient' algorithm with sparse matrices, although there are more algorithms available. I will try them as well to see if I can get even better results.

One interesting thing however, is that the guesses are still close to 0.5 and 1.0 - they are just a whole lot more accurate. Any ideas as to why?

Anyway, thank you very much for your suggestions, and I will continue to work to improve the functional accuracy of this application.

(Feb 01 '11 at 14:45) mfhughes
showing 5 of 6 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.