|
Consider a binary classification task where negative labels are more frequent than positive labels, e.g., negative labels are 10 times more likely apriori. I have this belief that if I subsample the negative labeled instances so that i have a test set that is approximately balanced, and compute an AUC on the subsampled test set, that in expectation I will get the same AUC as if I computed AUC on the complete test set (i.e., with many more negatives than positives). Is this known to be true? |
|
This is true. A point in the ROC curve is (true positive rate, false positive rate). True positive rate is the ratio of the number of examples correctly classified as positive and the total number of positive examples. This obviously does not change if you subsample the negative examples. The false positive rate is the number of negative examples classified as positive divided by the total number of negative examples, which should stay the same on average if you subsample from the set of negative examples. So subsampling is perfectly fine; since it does not change the ROC curve it cannot change the area under it. |
|
I worry a little bit when you say this:
I believe that you should never subsample the test set. The reason for this is that in an application, you do not get to choose which test samples you get to classify. You have to classify every test sample, which is why it is the "test set." An error that I have seen multiple times in submitted papers is that the authors, when dealing with an imbalanced problem such as you describe, will uniformly subsample the entire data set and then run cross-validation. This leads to an over-estimation of the performance, whether by AUC or by any other metric, simply because it artificially makes the problem easier. It is true that AUC as a metric is insensitive to class distribution but, as a prior poster alluded to, you cannot be certain that any subsampling you do maintains the difficulty of the problem. The proper approach, if you are going to do undersampling, is to undersample the training set to whatever level you desire, leaving the test set untouched, and do your evaluation that way. In addition to undersampling, you may wish to consider SMOTE or a combination of SMOTE and undersampling which often offer improved performance. A caveat though is that because SMOTE adds additional examples to the training set it may not be tractable for large data sets. I hope this helps and makes sense. |
|
Uniform subsampling does not affect AUC, but if your sample of either positive or negative examples is biased toward either "hard" or "easy" ones, AUC will be affected. I prefer the probabilistic interpretation of AUC over the geometric one: AUC is the probability that a random positive example will have a higher score than a random negative example (assuming no tied scores, which complicate things a bit). From this definition is follows that you select your random positive and negative examples without bias, you get the same expected value. |
|
When using AUC, one should be aware that there is an implicit cost function associated with it. See the following paper for a nice discussion of the issues: http://www.springerlink.com/content/y35743hp7010g354/ ... and the thing is, this implicit cost function may be different for different methods you want to compare.
(Jul 02 '10 at 05:39)
zeno
|
What do you mean by balanced? That the ratio of negative & positive labels stays the same?
Re: balanced, yes, same number of negative and positive examples. There is a deviation bound (http://research.microsoft.com/pubs/65637/agarwaletal04.pdf) which suggests for a particular computational budget the best estimate of AUC is from a balanced set.