|
I have data with heavy class imbalance. Less than 1% of the data consists of positive examples. What is the effect of such class imbalance on the model achieved by stochastic gradient descent. Is this a good idea? If not, how should I modify/preprocess the data to get a better model. |
|
If you have such a skewed class, there is no much point in training a NN directly, specially because is going to be very hard to beat your predictive baseline. If you assign every new example to the new class, without classifying, you are going to have a system that is right 99% of the times. It is going to be hard to beat that. There are several things you can do. If you have enough data, you can take only 0.5% of your positive class and train with the same number of negative examples, you do this applying cross validation by selecting different samples from the larger set of negative examples. However, for such a skewed class, you could use other things, like SVMs by giving importance weights to the small samples (Cool Explanation here) Here is a set of other thoughtful suggestions to deal with skewed classes. I can relate to what you are saying. And this is really what is happening. I'm getting error rate of 0.3%, which sounds good, but actually means that I'm getting 30% of the positive cases wrong. Incidentally I'm getting ALL of the negative examples correctly. Actually, I have 10 or so classes, but I'm clubbing the rest 9 classes together as the negative class. Is this a bad idea? Does clubbing classes into one reduce the information I'm providing the net? More formally, Case A: I train a single multi-class classifier on 10 classes. Case B: I train 10 binary classifiers which take one class, club the rest and train on this data, and then I take the argmax of confidences of each of these 10 models, and make my prediction. Which would be better? Or are they equivalent? Does the answer change if I'm considering SVMs, or considering convolutional networks? Should I append this to the main question?
(Dec 14 '12 at 18:42)
Sherjil Ozair
Some more information. I tried with 50-50 data in the batches. The error rate is now 3%, i.e. MUCH better than the predictive baseline. But the number of errors I'm making has vastly increased. Now, I'm also making errors in the negative examples, as well as the positive examples.
(Dec 14 '12 at 18:47)
Sherjil Ozair
"like SVMs by giving importance weights to the small samples" Can't I simulate that same behaviour, by having different learning rates for different classes? Or another idea could be, to just copy the smaller class samples multiple times, so that they get to effect SGD more? Won't this effectively model the SVM's idea of penalties to different classes?
(Dec 14 '12 at 18:53)
Sherjil Ozair
1
Did you do cross validation? If you have that many classes, you could have 10 classifiers and see how that behaves. Also, remember to calculate the Recall as well as the accuracy, and the F-Score could not hurt either.
(Dec 16 '12 at 22:17)
Leon Palafox ♦
|
|
There really isn't much you can do if you have a large class skew. The best way forward is to pick the positive & negative cases in equal proportion. This is a very real problem and occurs quite often in data around credit card fraud, where 99% of cases are legit and 1% are fraud. If you still see a large error (or increased error) after doing a 50-50 split, you could try changing the model method to say a classification tree or even a simple linear regression. |
|
I work in this professionally (financial fraud detection), and the situation is not hopeless at all if you have enough good data; it's a tens of millions of USD per year business. The baseline positive to negative ratio is more like 1:1000 or 1:10000 and not even 1:100 in payment card fraud. I can't discuss proprietary techniques but there are some general principles:
|
I have a similar problem with 80%-20% ratio. So far I got the best result by bootstrapping the samples of the smaller class so I'll have the same number of samples. I'm also testing performance by looking at hit rate and false alarms (or d'), and not correct. I might give a try to optimize a different criteria.
@rm9, could you explain how you are " bootstrapping the samples of the smaller class" ? I'm unaware of this terminology, sorry.
To make the hit rate v/s/ false alarm graph, do you predict on data with the same class distribution as the training data?
More generally, what is the theory on having different class distribution in training and testing? In my ideal application, the testing scenario will have class imbalance, i.e. positive examples would be <1%. Keeping this in mind, should I train in this skewed distribution or with 50-50 distribution?
Another possibility to consider would be to downsample your common case by a factor of 20-100 and then train a large number of networks using different subsets of your common case and all of your uncommon case (with each network being trained on a relatively nice 50-50-ish split). Evaluate your networks on different examples, and hopefully they will be wrong on different ones due to variation in your common case - if this is the case, you can have them vote (or use some other variant of boosting) in order to get a stronger classifier.
(I've never seen this done, but boosting can sometimes lead to very nice results, so it might be worth a try.)