0
1

My data set is highly imbalanced. For one class there are 90,000 instances, while for the other class there is only 800 instances. After researching I found that SMOTE can help me to overcome the class imbalance issue. After oversampling the minority group and performing ten fold cross-validation on the oversampled data set, I got very good result (in terms of precision and recall). My question is that is it wrong to report this precision and recall, or I am missing some important details?

I also tried to divide the data set into two part (80:20). Oversampling the first group and then use the second data set as test set. But this time, I got very poor result. While my precision is over 0.5, the recall drops to only 0.2. If the previous approach to report result is wrong, Is there any other way to improve the result or any link to research paper that can help me to explain/defend my poor result? Thanks.

asked Apr 16 '13 at 20:58

Muhammad%20Asaduzzaman's gravatar image

Muhammad Asaduzzaman
1121


One Answer:

I do not think it is strictly wrong to report accuracy on a balanced set if you clearly highlight what you are doing. I often use this kind of metric at least for internal reporting on my projects but I have spent a lot of time explaining it to people. Personally I think area under the ROC curve is a better general metric in this kind of situation, but I think it is actually more confusing to people without stats/ML background then the balanced accuracy metrics.

However I think there is a major problem in using oversampling with cross validation. It is easiest to see in simple oversampling. If you oversample before randomly splitting into folds, you will end up with copies of the same oversampled items in multiple folds, so you are not evaluating on unseen data, so you are at least partly evaluating on the training set. To make this legitimate you would not need to either split the data into folds and then only oversample within folds or alternatively use undersampling instead if you have enough data. With SMOTE you are not replicating the positive instances, but you are still deriving the new instances from the seen ones, so you should still only oversample within partitions. This also applies to train, validation & test splits. Otherwise you are still leaking information from your train data into your test data.

I think that there should be some kind of theoretical result showing oversampling can not be much better than undersampling since no amount of majority class data can reduce the variance error due the the small amount of minority data which will dominate.

answered Apr 19 '13 at 03:16

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

Your answer
toggle preview

Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

×5
×2
×1

Asked: Apr 16 '13 at 20:58

Seen: 546 times

Last updated: Apr 19 '13 at 03:16

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.