Hey there,

I'm new to Machine Leanring and this forum. I have a beginner's doubt regarding imbalanced datasets. Here it goes:

I have a binary classification task, where I'm more interested in accurately classifying the positive class (which is also the minority class in the target population). Unlike the common problem of not having enough positive (minority) class instances in the training set, my training dataset contains the positive class in majority.

Here's the target population composition (which I'd expect to find in the environment where my classifier/model would be deployed):

  • Positive Class: ~35%
  • Negative Class: ~65%

Here's my Training Set Composition:

  • Positive Class: ~95%
  • Negative Class: ~5%

As my training set composition drastically differs from the target population composition, will the classification algorithm fail to generalize when classifying instances from the target population? As I mentioned earlier, I'm more interested in accurately classifying the positive class instances, which my training set has in abundance.

I read the following description in a publication on imbalanced datasets: "The purpose of machine learning is for the classifier to estimate the probability distribution of the target population. Since that distribution is unknown we try to estimate the population distribution using a sample distribution. Statistics tells us that as long as the sample is drawn randomly, the sample distribution can be used to estimate the population distribution from where it was drawn. Hence, by learning the sample distribution we can learn to approximate the target distribution."

Since my training dataset can not be considered as a random sample of the target distribution, will this affect the generalization power of my classifier? If so, what shall be done to avoid this? Over/under-sampling? Cost Matrices?

PS: I tried to search previous posts about issues similar to mine, but all of them dealt with the problem of not having sufficient examples for the minority class (which is the exact opposite of my scenario).

Thanks in advance, -S

asked Mar 19 '13 at 13:32

Sunny's gravatar image

Sunny
1111


One Answer:

Imbalanced data is a hot research problem facing machine learning. As you mentioned, Over/under sampling are applicable, but not as effective as many algorithms out there. Ensemble learning, for instance, is quite robust for imbalanced data. Examples of Ensemble learning techniques include: Random Forests, Bagging, and Adaboost. You can use the scikit library to try them. If you wish to use ELM or SVM for fitting an imbalanced data, one efficient approach is to give more importance to the minority class by adding more class weight 'C' to that class! This hyperparameter can be tuned with ease, using the scikit library! Hope this helps!! :)

answered Mar 19 '13 at 16:47

Issam's gravatar image

Issam
51347

edited Mar 19 '13 at 16:48

Your answer
toggle preview

Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

×5
×3
×2
×1

Asked: Mar 19 '13 at 13:32

Seen: 627 times

Last updated: Mar 19 '13 at 16:48

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.