I have a learning/classification task to predict two classes (0 or 1). My test set is highly unbalanced with only about 0.001% of the test set positive. If I use this test set with the ROC curve, almost all my algorithms show a very high AUC (like 0.98), and if I use precision-recall, it looks like a straight line (which I would also get from a near random classifier). So neither of these gives answers that are good for comparing algorithms. Earlier, I was randomly selecting only a small portion of the negative test set to give equal numbers of positive and negative test examples. Is this the best way to go about it? Is there a better way to evaluate performance in highly unbalanced test sets? I would assume in Information Retrieval sometimes we would have a million times more irrelevant articles than true hits. How does one evaluate performance in these cases?

Update: Thanks everyone for the answers. So the PR is actually much better than random but it is still intriguing.

Precison Recall curve ROC curve

Update 2: So I found that the problem was in my evaluation code, using the wrong test set examples. So for highly unbalanced test sets, this is what the measures look like, and I think the PR curve is more informative than the ROC curve.

Correct Precision Recall Correct ROC curve

asked Feb 17 '11 at 00:07

probreasoning's gravatar image

probreasoning
1366715

edited Feb 17 '11 at 19:24

Both curves look like what you'd expect from random classification in an unbalanced dataset..

(Feb 17 '11 at 17:53) Alexandre Passos ♦

3 Answers:

It can be tempting to sample equally from two imbalanced classes, and the decision of whether to do it or not is probably something you should determine on a case-by-case basis.

On the one hand, if you do sample in equal amounts, this is equivalent to making the assumption that members of one class occur in equal amounts to the other. If you're using any probability estimates based on bayes rule, you're already violating your prior by implicitly stating that P(A) = P(B).

You could modify your training procedure to take into account the prior, however. I'd be inclined to go down that road first.

Lastly, if it's having trouble learning anything useful, the features it is being fed may not be informative enough, or some other issue is either preventing the information from being used appropriately or washing out the useful signal.

-Brian

This answer is marked "community wiki".

answered Feb 17 '11 at 15:12

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

Just to be clear, you're talking about sampling the training set in an equal (balanced) manner, right?

(Feb 17 '11 at 15:21) Troy Raeder

Sorry, I should have been more blunt about that. I'd go down the road of continuing to sample without forcing them to be equal, and attempt to incorporate a prior. If that doesn't work very well, only then would I start going down some other path.

Obviously there are extreme cases. Take market data as an example. It's hard to get anything approaching clean data in the forex markets that's more than about 5 years old. It's possible, but the data is usually full of holes or the holes filled using data from other sources or duplicated -- so the data isn't entirely reliable.

In other words, I have a limited span over which to perform analysis unless I'm working on long timeframes (eg daily, weekly, or monthly samples). Now suppose you want to predict a wild event like what happened last May (the 'flash crash'). There's a ton of data, and not many examples of such wild events.

In these cases you probably need to do something more exotic, in which case my answer doesn't really apply. For that, I'd probably start looking at it as a novelty detection problem instead of classification/prediction.

I'm looking at the original question from a perspective along the lines of "I have 10000 examples of porn sites, and only 100 examples of cooking sites. How do I train it to correctly classify web pages without drowning out the information from cooking sites?"

In that classification problem, I'd be inclined to incorporate a prior and keep sampling from the (unmodified) distributions.

-Brian

(Feb 17 '11 at 16:07) Brian Vandenberg
1

Paul Mineiro's blog ( http://www.machinedlearnings.com/2010/11/on-unimportance-of-zeroes.html ) suggests that subsampling negatives is not all that harmful in practice, and he defends this viewpoint well with experimental evidence in that and subsequent posts.

(Feb 17 '11 at 16:10) Alexandre Passos ♦

I'm more or less arguing on (moral?) grounds; it's hard to argue with experimental evidence.

When you get right down to it, machine learning algorithms are more or less finding an optimal way to spread (risk?) somewhat like an investor would do when choosing how to distribute funds across stocks.

Based on that viewpoint, sampling amongst highly unbalanced classes such that all samples presented are given equal representation appears to me to be almost on par with saying market crashes happen as often as other events in the market, or that earthquakes (in a small region) happen as often as sunny days.

-Brian

(Feb 17 '11 at 17:42) Brian Vandenberg
1

There are two things that can be learned: a positive/negative bias (a prior) and the actual decision surface. Subsampling can help you learn the direction of tje hyperplane and the bias can be computed by knowing the proportions even while ignoring the specific examples. Plus, it doesn't affect either the ROC or tje precision/recall curves.

(Feb 17 '11 at 17:52) Alexandre Passos ♦
2

The Owen paper Infinitely unbalanced logistic regression supports my point in a convoluted way; it's an interesting read, anyway.

(Feb 17 '11 at 17:59) Alexandre Passos ♦
showing 5 of 6 show all

The paper:

H. He, Learning from Imbalanced Data, TKDE 2008

has a section on performance metrics for imbalanced classification tasks, but for the most part I agree with Alexandre. If you compute precision and recall (or, more generally, a PR curve or the area under the PR curve), you should get a fair assessment of performance. If you're getting a nearly-random PR curve I would hypothesize that your classifier is nearly random.

For some practical issues about calculating PR area in a "fair" manner, see

Davis and Goadrich, The Relationship Betweeen Precision-Recall and ROC Curves, ICML 2006

Hope this helps!

answered Feb 17 '11 at 14:17

Troy%20Raeder's gravatar image

Troy Raeder
89972025

Usually I think you compute precision and recall on the positive class. A random classifier should have very low values for these quantities.

answered Feb 17 '11 at 05:25

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.