|
Hi, I am working on a text classification problem. My data is highly imbalanced. For examples, one category has 700 documents while the other has 30. I have around 30 categories. I tried different classifiers and the performance is consistently poor. What is the best way to tackle this issue? Thanks |
|
A quick hack is to just add copies of the training samples of the smaller class to your training data. (But don't add the 4th example more often than the 6th!) Why? If you use a discriminative approach (eg logistic regression or a neural network), mostly an optimizer will try to maximize the number of correct classifications in form of a loss. Remember that such a loss is a sum over all training samples. This loss can be partitioned into two terms, one being the sum over all training samples of the bigger class, and one over all training samples of the smaller class. Assuming that both classes are equally likely to occur in the real world, a correct way of doing the above is to let each term contribute the same amount to the loss -- so to weigh them according to their inverse frequency in the training set, or prior probability. If you use a generative approach, you model p(x|c) and p(c) to get p(c|x) via Bayes' theorem. In this case, the prior probabilities are explicitly found by p(c). Again, if the class balance of the real world is not resembled in the training set, you can just not estimate p(c) from the data, but set it by hand. If you use the above quick hack, the prior probabilities will get right. In the case that the class balance is reflecting the real world and misclassification of A as B is as undesirable as B as A, you probably have to use a different model to get better results. However, in some cases misclassifcations are not balanced, e.g. labeling a fraud as a legit action is much more expensive than the other way around. In that case, you can use decision theory in order to get optimal decisions based on the p(c|x) that your classifier spits out. To add to what Justin has said, you can probably use bagging with your biased sampling frequency to perform the generative approach.
(Sep 25 '11 at 18:05)
crdrn
|