1
1

I have data with heavy class imbalance. Less than 1% of the data consists of positive examples. What is the effect of such class imbalance on the model achieved by stochastic gradient descent. Is this a good idea? If not, how should I modify/preprocess the data to get a better model.

asked Dec 12 '12 at 00:26

Sherjil%20Ozair's gravatar image

Sherjil Ozair
16233

1

I have a similar problem with 80%-20% ratio. So far I got the best result by bootstrapping the samples of the smaller class so I'll have the same number of samples. I'm also testing performance by looking at hit rate and false alarms (or d'), and not correct. I might give a try to optimize a different criteria.

(Dec 12 '12 at 14:39) rm9

@rm9, could you explain how you are " bootstrapping the samples of the smaller class" ? I'm unaware of this terminology, sorry.

To make the hit rate v/s/ false alarm graph, do you predict on data with the same class distribution as the training data?

More generally, what is the theory on having different class distribution in training and testing? In my ideal application, the testing scenario will have class imbalance, i.e. positive examples would be <1%. Keeping this in mind, should I train in this skewed distribution or with 50-50 distribution?

(Dec 14 '12 at 18:51) Sherjil Ozair

Another possibility to consider would be to downsample your common case by a factor of 20-100 and then train a large number of networks using different subsets of your common case and all of your uncommon case (with each network being trained on a relatively nice 50-50-ish split). Evaluate your networks on different examples, and hopefully they will be wrong on different ones due to variation in your common case - if this is the case, you can have them vote (or use some other variant of boosting) in order to get a stronger classifier.

(I've never seen this done, but boosting can sometimes lead to very nice results, so it might be worth a try.)

(Dec 24 '12 at 23:09) Andrew Gibiansky

3 Answers:

If you have such a skewed class, there is no much point in training a NN directly, specially because is going to be very hard to beat your predictive baseline.

If you assign every new example to the new class, without classifying, you are going to have a system that is right 99% of the times. It is going to be hard to beat that.

There are several things you can do. If you have enough data, you can take only 0.5% of your positive class and train with the same number of negative examples, you do this applying cross validation by selecting different samples from the larger set of negative examples.

However, for such a skewed class, you could use other things, like SVMs by giving importance weights to the small samples (Cool Explanation here)

Here is a set of other thoughtful suggestions to deal with skewed classes.

answered Dec 12 '12 at 01:02

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

I can relate to what you are saying. And this is really what is happening. I'm getting error rate of 0.3%, which sounds good, but actually means that I'm getting 30% of the positive cases wrong. Incidentally I'm getting ALL of the negative examples correctly.

Actually, I have 10 or so classes, but I'm clubbing the rest 9 classes together as the negative class. Is this a bad idea? Does clubbing classes into one reduce the information I'm providing the net?

More formally, Case A: I train a single multi-class classifier on 10 classes. Case B: I train 10 binary classifiers which take one class, club the rest and train on this data, and then I take the argmax of confidences of each of these 10 models, and make my prediction.

Which would be better? Or are they equivalent? Does the answer change if I'm considering SVMs, or considering convolutional networks?

Should I append this to the main question?

(Dec 14 '12 at 18:42) Sherjil Ozair

Some more information.

I tried with 50-50 data in the batches. The error rate is now 3%, i.e. MUCH better than the predictive baseline. But the number of errors I'm making has vastly increased. Now, I'm also making errors in the negative examples, as well as the positive examples.

(Dec 14 '12 at 18:47) Sherjil Ozair

"like SVMs by giving importance weights to the small samples"

Can't I simulate that same behaviour, by having different learning rates for different classes?

Or another idea could be, to just copy the smaller class samples multiple times, so that they get to effect SGD more? Won't this effectively model the SVM's idea of penalties to different classes?

(Dec 14 '12 at 18:53) Sherjil Ozair
1

Did you do cross validation? If you have that many classes, you could have 10 classifiers and see how that behaves. Also, remember to calculate the Recall as well as the accuracy, and the F-Score could not hurt either.

(Dec 16 '12 at 22:17) Leon Palafox ♦
-2

There really isn't much you can do if you have a large class skew. The best way forward is to pick the positive & negative cases in equal proportion. This is a very real problem and occurs quite often in data around credit card fraud, where 99% of cases are legit and 1% are fraud. If you still see a large error (or increased error) after doing a 50-50 split, you could try changing the model method to say a classification tree or even a simple linear regression.

answered Dec 17 '12 at 11:46

Broccoli's gravatar image

Broccoli
15112

I work in this professionally (financial fraud detection), and the situation is not hopeless at all if you have enough good data; it's a tens of millions of USD per year business. The baseline positive to negative ratio is more like 1:1000 or 1:10000 and not even 1:100 in payment card fraud. I can't discuss proprietary techniques but there are some general principles:

  1. Be careful and think about your performance metric. What will be done with the score/prediction? Various choices can change the 'operating point' about which the score is the most useful. I like ROC which is invariant to class proportions---but it isn't invariant if you change the nature of examples in each class.
  2. It can be quite useful and practical to downsample/downweight the common class. There is a big range between the natural 1000:1 (e.g.) and a fully balanced 1:1. This choice has a big effect on what kind of model you actually end up optimizing.
  3. Very important: the "number of data points" you must think about as being near the number of rare observations. If I had a million negative examples and only 50 positive examples, what sort of models should I consider? Should I try a big neural network or nonlinear SVM or boosted tree ensemble? I wouldn't deliver one in the real world. If I had, for example, 50 positives and 50 negatives, you or I might be comfortable with nothing more than an appropriately regularized/cross-validated logistic regression on at most 10 well-chosen features. If there are only 50 positives and millions of negatives, it would still be risky to deploy anything more complex than that.
  4. With large amounts of the common class, you can explore unsupervised modeling and clustering. Many of these will be "easy" but some will be "hard" to distinguish from positives.

answered Dec 24 '12 at 17:07

Matt's gravatar image

Matt
0113

edited Dec 24 '12 at 17:09

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.