|
I am doing bag-of-words text classification (text categorization) with very few labeled examples. As you can imagine, I have high-dimensional sparse feature vectors. Very few labeled examples means under 100, sometimes only 10 or 20. As a baseline, I ran l2-regularized logistic regression. I choose hyperparameters (l2 parameter, learning rate, # of training passes) using leave-one-out cross validation. In particular, for each leave-on-out set, I compute the logistic loss of the trained classifier on the left-out example. I choose the hyperparameters that minimize the total logistic loss across all leave-one-out sets. [edit: This also happens when I use l1 and elasticnet. See my comment on ogrisel's answer for more details.] The problem is that I am overfitting the rare features. I end up with a low regularization parameter, and the rare features (words that appear in one or two examples) have high weights. How do I avoid overfitting the rare features when learning a classifier over very few labeled examples? Should I use a different model? Should I use a different cross-validation technique? What approach will give the best generalization when doing supervised classification over few high-dimensional labeled examples? |
|
If you have a large unlabeled dataset (e.g. a Wikipedia dump), I would try to learn a Also if over-fitting is an issue, Elastic Net regularization (l1 + l2) on the raw feature set might be better than l2 (not sure though, need extensive cross validation to check). @ogrisel: I have several hundred labeled examples and thousands (hundreds of thousands?) of features. I tried running scikit.learn's SGDclassifier, with 5-fold crossvalidation. I optimize the F1, and crossvalidated the loss and penalty. Elasticnet does best, and l1 does well, but a lot of the time they select a low penalty. When I look at the model, I see that there are singleton features (things that occur only once in the training data, but may occur more often in the wild) that have high weights associated with them. What is happening is that elasticnet and l1 want a low penalty for the common features, but then that leads to high weights for a few rare features (to "fix" some noisy examples, maybe?). These highly-weighted rare features are not detected in crossvalidation because they are not present in the test fold. Besides PCA, is there a way to directly determine, while learning, what features are too rare / not likely to generalize?
(Apr 26 '11 at 16:10)
Joseph Turian ♦♦
Interesting report. I think we need to improve the scikit-learn text vectorizer to add a cut-off to zero-down features that appear only a couple of times in the whole dataset. Have you tried to do that filtering manually? Does it improve the predictive accuracy?
(Apr 27 '11 at 02:43)
ogrisel
1
I have not done direct comparisons, but using a frequency threshold is a simple but effective feature selection method. However, more advanced feature selection for text categorization are available: "An extensive empirical study of feature selection metrics for text classification" by Forman (2003) finds that BNS is the best of all methods he tests. Doc-freq's advantage, compared to BNS, is that it does not require class lables.
(May 03 '11 at 21:09)
Joseph Turian ♦♦
@Joseph: your comments suggest to me an interesting idea---different regularization penalties for different weights/features. I'd try picking a threshold (e.g. 5% occurrence) and then use CV to determine separate regularization weights for frequent vs. infrequent words. Ideally, you'd learn a function mapping occurrence to weight, but single threshold is a good start to see if the idea is any good.
(May 29 '11 at 10:13)
jrennie
|
|
I don't think PCA will help much here as PCA will retain high-variance dimensions, including common words. You want to retain only "interesting" dimensions, such as those which are usually dormant, but spike occasionally. I.e. you want to retain dimension which look like they are created from a mixture distribution ("off" for certain "categories", "on" for others). It's also worth looking at classic term weighting work. IDF, in particular, is a metric that penalizes common words. It's weak on its own, but can be useful in combination with other metrics. I wrote a paper on the subject: http://qwone.com/~jason/papers/sigir05-informativeness.pdf I'd also suggest that you look at the Church and Gale paper on (residual) IDF. This is a very interesting paper. Do you know if your mixing score + IDF combo or the R-IDF score can be used successfully in an extractive summarization setting to identify the top sentences that be used to best summarize a document?
(Apr 28 '11 at 12:41)
ogrisel
I don't know, but it certainly sounds like an interesting idea. :)
(Apr 28 '11 at 13:14)
jrennie
|
|
I know it sounds like an overly simple solution, but have you tried simply removing the words with only one or two instances in your dataset? I know that in authorship attribution only the top L words are chosen (L can be anywhere between 50 and 1000 in normal usage, usually lower). I'm not a text categorisation expert, but I believe that you usually drop the most and least frequent words. i.e. list the set of frequencies (remove duplicates) and taken the words with frequencies in the 5%-95% range. The most frequent words usually don't tell you anything about the category and the least frequent words do not contain enough information. We should indeed as a Any patch / pull request welcomed :)
(Apr 28 '11 at 03:13)
ogrisel
|
|
I am not sure it is what you are looking for, but a quick hack could be to index your training set with lucene (or your favorite search engine). You can then use the index for KNN with your unlabelled document as query. You get a model that cope well with rare words and TF/IDF weighting for free. |
|
You could try training an RBM (replicated softmax visible units) generatively to model the bags of words and compressing them down to a lower dimensional hidden state. Then you can run logistic regression on that. This may end up ignoring some rare words. Since it is trained generatively, it might actually ignore too many rare words. I would need more information about the task to know if this was a good idea. This also leverages your unlabeled data directly in the feature learning. Also, if you can share statistical strength across words by using some sensible word representations that might also help. Even something as simple as compressing the vocabulary with Brown clusterings. This might help the most if the rare words are in some sense "synonyms" of other words, but once again I would need to know a bit more about what sort of data you are working with to know if this is reasonable. I could imagine the rare words being something like URLs, in which case this probably wouldn't help if the actual URL (and not just that the word WAS a url) was relevant to the classification task. |
See also: Feature selection on sentiment analysis and Text classifiers using accidental features
Aside: The webpage for LIBLINEAR says: "If you are a beginner and your data sets are not large, you should consider LIBSVM first."
"Learning from Little: Comparison of Classifiers Given Little Training" (Forman + Cohen) does a survey, comparing naive bayes, logreg, and SVMs, using various feature selection techniques (information gain and Bi-Normal Separation). Their result is that the choice of learner and feature selection method are interdependent, and also depend upon the number of positive and negative examples.