|
I am doing bag-of-words text classification (text categorization) with very few labeled examples. As you can imagine, I have high-dimensional sparse feature vectors. Very few labeled examples means under 100, sometimes only 10 or 20. As a baseline, I ran l2-regularized logistic regression. I choose hyperparameters (l2 parameter, learning rate, # of training passes) using leave-one-out cross validation. In particular, for each leave-on-out set, I compute the logistic loss of the trained classifier on the left-out example. I choose the hyperparameters that minimize the total logistic loss across all leave-one-out sets. The problem is that I am overfitting the rare features. I end up with a low regularization parameter, and the rare features (words that appear in one or two examples) have high weights. How do I avoid overfitting the rare features when learning a classifier over very few labeled examples? Should I use a different model? Should I use a different cross-validation technique? What approach will give the best generalization when doing supervised classification over few high-dimensional labeled examples? |