I'm a newcomer to ML and have been using Python's NLTK naive bayesian classifier to develop a text classifier. I've got a couple of fairly fundamental questions that someone may be able to help me with. My statistics background is not as strong as it should be but I'm actively working to rectify that. Any help would be appreciated to further my rudimentary understanding of this field.

Let's say I have a basic feature extractor that takes a string and drops all stopwords and punctuation and treats each word as a feature in the array that is part of the training set, i.e.:

def get_features(cleanstring):  
    for word in cleanstring:   
        features['contains(%s)' % word] = True  
    return features

One example in the NLTK documentation takes only the top N occuring features in the entire training corpus and includes all of them in each array in the training set, so that every array in the training set has the same features. Is this necessary?

In other words, can I simply take each word from a set of documents and assign them as features, and NLTK will internally 'normalize' each feature to zero/False if it does not exist in some documents but does exist in others? Or do I need to make absolutely sure each document that goes in to training shares the exact same features explicitly?

Thanks! mh

asked Jul 09 '10 at 14:05

mfhughes's gravatar image


edited Jul 09 '10 at 14:18

2 Answers:

Every training data point, when finally passed to the classifier, needs to have the same features. But as far as I remember, a feature extractor for an NLTK classifier just has to specify the nonzero ones, which you're doing already, so your feature extractor should work.

answered Jul 09 '10 at 15:28

aditi's gravatar image


edited Jul 09 '10 at 15:29

Thanks for the reply, aditi - so then would it stand to reason that anything sent in for classification/guessing would require only non-zero features to be defined as well, i.e., only the words/features that appear in the text to be classified?

thanks, mh.

(Jul 09 '10 at 16:34) mfhughes

Right, if this feature extractor is what you're using on the training data, you should be able to use it on the test/eval data as well.

(Jul 09 '10 at 16:42) aditi

If you do want to restrict features to the top N, I wrote about how to eliminate low information features. This is usually a good idea for your training features, but is unnecessary for your testing/actual features because the classifier will ignore any unknown features.

answered Feb 06 '11 at 23:34

Jacob%20Perkins's gravatar image

Jacob Perkins

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.