From a high level perspective, my problem is very simple: I have a lot of sentences and I want to perform a supervised binary classification on them based on the words found within these sentences (as well as a couple of other features but these are not really relevant here).

I started from the obvious approach which is to create a feature per word that appears in my training corpus. I then did the second obvious thing to avoid having too many features which is to keep only the features based on words with a high tfidf.

Intuitively, I see how for a fixed number of words, I only need to increase the size of the training corpus to avoid overfitting but I wonder in general what guideline I should follow to estimate the number of features to keep based on the size of the training corpus.

The other obvious next step for me is to use word ngrams to increase the number of meaningful features (and, obviously, use the tfidf of these word ngrams to keep only a fraction of the total number of ngrams). Here again, I wonder if someone could give me advice on a strategy to pick the number of word ngram features to keep.

Pointers to resources which discuss these issues would be most welcome since I seem unable to feed the right keywords to google to get meaningful results.

asked Aug 21 '12 at 08:50

mathieu%20lacage's gravatar image

mathieu lacage
21112

https://class.coursera.org/nlp/lecture/preview Did you try this class

(Aug 23 '12 at 05:11) Leon Palafox ♦

One Answer:

Don't worry too much about having too many features. If you train a l1-regularized linear classifier, for example, it will select the good ones. (See Vowpal Wabbit and the classifiers in scikit-learn.)

Worry about having the right features. Are bag-of-words enough information to do the classification?

Do you also a human have enough information to make a prediction based upon word counts? This is a good heuristic for designing the model.

For some tasks, a bag-of-words is fine. For figuring out the topic of a document, or the language, bag-of-words is good.

For things that involves more complicated understanding of the meaning of the sentence, like sentiment analysis, bag-of-words doesn't have enough information. It's an open question how to classify more difficult tasks, and can be very-problem specific.

answered Aug 23 '12 at 05:44

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.