Revision history[back]
click to hide/show revision 1
Revision n. 1

Jul 04 '10 at 00:45

Troy%20Raeder's gravatar image

Troy Raeder
73571721

I worry a little bit when you say this:

if I subsample the negative labeled instances so that i have a test set that is approximately balanced...

I believe that you should never subsample the test set. The reason for this is that in an application, you do not get to choose which test samples you get to classify. You have to classify every test sample, which is why it is the "test set."

An error that I have seen multiple times in submitted papers is that the authors, when dealing with an imbalanced problem such as you describe, will uniformly subsample the entire data set and then run cross-validation. This leads to an over-estimation of the performance, whether by AUC or by any other metric, simply because it artificially makes the problem easier.

The proper approach, if you are going to do undersampling, is to undersample the training set to whatever level you desire, leaving the test set untouched, and do your evaluation that way.

In addition to undersampling, you may wish to consider SMOTE or a combination of SMOTE and undersampling which often offer improved performance. A caveat though is that because SMOTE adds additional examples to the training set it may not be tractable for large data sets.

I hope this helps and makes sense.

click to hide/show revision 2
Revision n. 2

Jul 04 '10 at 01:01

Troy%20Raeder's gravatar image

Troy Raeder
73571721

I worry a little bit when you say this:

if I subsample the negative labeled instances so that i have a test set that is approximately balanced...

I believe that you should never subsample the test set. The reason for this is that in an application, you do not get to choose which test samples you get to classify. You have to classify every test sample, which is why it is the "test set."

An error that I have seen multiple times in submitted papers is that the authors, when dealing with an imbalanced problem such as you describe, will uniformly subsample the entire data set and then run cross-validation. This leads to an over-estimation of the performance, whether by AUC or by any other metric, simply because it artificially makes the problem easier.easier. It is true that AUC as a metric is insensitive to class distribution but, as a prior poster alluded to, you cannot be certain that any subsampling you do maintains the difficulty of the problem.

The proper approach, if you are going to do undersampling, is to undersample the training set to whatever level you desire, leaving the test set untouched, and do your evaluation that way.

In addition to undersampling, you may wish to consider SMOTE or a combination of SMOTE and undersampling which often offer improved performance. A caveat though is that because SMOTE adds additional examples to the training set it may not be tractable for large data sets.

I hope this helps and makes sense.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.