3
3

I am doing bag-of-words text classification (text categorization) with very few labeled examples. As you can imagine, I have high-dimensional sparse feature vectors. Very few labeled examples means under 100, sometimes only 10 or 20.

As a baseline, I ran l2-regularized logistic regression. I choose hyperparameters (l2 parameter, learning rate, # of training passes) using leave-one-out cross validation. In particular, for each leave-on-out set, I compute the logistic loss of the trained classifier on the left-out example. I choose the hyperparameters that minimize the total logistic loss across all leave-one-out sets.

[edit: This also happens when I use l1 and elasticnet. See my comment on ogrisel's answer for more details.]

The problem is that I am overfitting the rare features. I end up with a low regularization parameter, and the rare features (words that appear in one or two examples) have high weights.

How do I avoid overfitting the rare features when learning a classifier over very few labeled examples? Should I use a different model? Should I use a different cross-validation technique? What approach will give the best generalization when doing supervised classification over few high-dimensional labeled examples?

asked Apr 11 '11 at 04:36

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited Apr 26 '11 at 16:12

Aside: The webpage for LIBLINEAR says: "If you are a beginner and your data sets are not large, you should consider LIBSVM first."

(Apr 11 '11 at 05:15) Joseph Turian ♦♦

"Learning from Little: Comparison of Classifiers Given Little Training" (Forman + Cohen) does a survey, comparing naive bayes, logreg, and SVMs, using various feature selection techniques (information gain and Bi-Normal Separation). Their result is that the choice of learner and feature selection method are interdependent, and also depend upon the number of positive and negative examples.

(May 06 '11 at 03:33) Joseph Turian ♦♦

5 Answers:

If you have a large unlabeled dataset (e.g. a Wikipedia dump), I would try to learn a scikits.learn.decomposition.RandomizedPCA model on the large unlabeled dataset (try several values for n_components e.g. 10, 50, 100, 500... expect diminishing returns, 100 is probably a good first guess) and then use it to transform the data of the small labeled dataset before feeding it to the linear classifier.

Also if over-fitting is an issue, Elastic Net regularization (l1 + l2) on the raw feature set might be better than l2 (not sure though, need extensive cross validation to check).

answered Apr 21 '11 at 09:21

ogrisel's gravatar image

ogrisel
498995591

@ogrisel: I have several hundred labeled examples and thousands (hundreds of thousands?) of features. I tried running scikit.learn's SGDclassifier, with 5-fold crossvalidation. I optimize the F1, and crossvalidated the loss and penalty. Elasticnet does best, and l1 does well, but a lot of the time they select a low penalty.

When I look at the model, I see that there are singleton features (things that occur only once in the training data, but may occur more often in the wild) that have high weights associated with them. What is happening is that elasticnet and l1 want a low penalty for the common features, but then that leads to high weights for a few rare features (to "fix" some noisy examples, maybe?). These highly-weighted rare features are not detected in crossvalidation because they are not present in the test fold.

Besides PCA, is there a way to directly determine, while learning, what features are too rare / not likely to generalize?

(Apr 26 '11 at 16:10) Joseph Turian ♦♦

Interesting report. I think we need to improve the scikit-learn text vectorizer to add a cut-off to zero-down features that appear only a couple of times in the whole dataset.

Have you tried to do that filtering manually? Does it improve the predictive accuracy?

(Apr 27 '11 at 02:43) ogrisel
1

I have not done direct comparisons, but using a frequency threshold is a simple but effective feature selection method.

However, more advanced feature selection for text categorization are available: "An extensive empirical study of feature selection metrics for text classification" by Forman (2003) finds that BNS is the best of all methods he tests. Doc-freq's advantage, compared to BNS, is that it does not require class lables.

(May 03 '11 at 21:09) Joseph Turian ♦♦

@Joseph: your comments suggest to me an interesting idea---different regularization penalties for different weights/features. I'd try picking a threshold (e.g. 5% occurrence) and then use CV to determine separate regularization weights for frequent vs. infrequent words. Ideally, you'd learn a function mapping occurrence to weight, but single threshold is a good start to see if the idea is any good.

(May 29 '11 at 10:13) jrennie

I don't think PCA will help much here as PCA will retain high-variance dimensions, including common words. You want to retain only "interesting" dimensions, such as those which are usually dormant, but spike occasionally. I.e. you want to retain dimension which look like they are created from a mixture distribution ("off" for certain "categories", "on" for others). It's also worth looking at classic term weighting work. IDF, in particular, is a metric that penalizes common words. It's weak on its own, but can be useful in combination with other metrics. I wrote a paper on the subject:

http://qwone.com/~jason/papers/sigir05-informativeness.pdf

I'd also suggest that you look at the Church and Gale paper on (residual) IDF.

answered Apr 27 '11 at 09:25

jrennie's gravatar image

jrennie
12124

edited Apr 27 '11 at 12:21

ogrisel's gravatar image

ogrisel
498995591

This is a very interesting paper. Do you know if your mixing score + IDF combo or the R-IDF score can be used successfully in an extractive summarization setting to identify the top sentences that be used to best summarize a document?

(Apr 28 '11 at 12:41) ogrisel

I don't know, but it certainly sounds like an interesting idea. :)

(Apr 28 '11 at 13:14) jrennie

I know it sounds like an overly simple solution, but have you tried simply removing the words with only one or two instances in your dataset? I know that in authorship attribution only the top L words are chosen (L can be anywhere between 50 and 1000 in normal usage, usually lower). I'm not a text categorisation expert, but I believe that you usually drop the most and least frequent words. i.e. list the set of frequencies (remove duplicates) and taken the words with frequencies in the 5%-95% range. The most frequent words usually don't tell you anything about the category and the least frequent words do not contain enough information.

answered Apr 27 '11 at 21:50

Robert%20Layton's gravatar image

Robert Layton
1625122637

We should indeed as a min_df (minimum document frequency) optional parameter to the TfIdfTransformer and and a min_count to the CountVectorizer of scikit-learn (to drop feature that have a single or 2 occurrences for instance).

Any patch / pull request welcomed :)

(Apr 28 '11 at 03:13) ogrisel

I am not sure it is what you are looking for, but a quick hack could be to index your training set with lucene (or your favorite search engine). You can then use the index for KNN with your unlabelled document as query.

You get a model that cope well with rare words and TF/IDF weighting for free.

answered May 05 '11 at 22:46

Alexandre%20Patry's gravatar image

Alexandre Patry
1

You could try training an RBM (replicated softmax visible units) generatively to model the bags of words and compressing them down to a lower dimensional hidden state. Then you can run logistic regression on that. This may end up ignoring some rare words. Since it is trained generatively, it might actually ignore too many rare words. I would need more information about the task to know if this was a good idea. This also leverages your unlabeled data directly in the feature learning.

Also, if you can share statistical strength across words by using some sensible word representations that might also help. Even something as simple as compressing the vocabulary with Brown clusterings. This might help the most if the rare words are in some sense "synonyms" of other words, but once again I would need to know a bit more about what sort of data you are working with to know if this is reasonable. I could imagine the rare words being something like URLs, in which case this probably wouldn't help if the actual URL (and not just that the word WAS a url) was relevant to the classification task.

answered May 28 '11 at 01:54

gdahl's gravatar image

gdahl ♦
341453559

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.