|
Is there any specific classifier which can handle unbalanced data set? I have a data set which contains 80% of instances from one class and rest from another class. When I trained classifiers like SVM, MaxEnt , it predicts every class as the one which is majority class. Could someone please suggest me some idea on it? How to improve the prediction accuracy. |
|
Class and instance weighting is a great way to go. (As is under-sampling or over-sampling. Though in some situations these are equivalent.) Decision Trees and their variants (AdaBoost, Random Forests, etc.) also perform fairly well on imbalanced classes. The information gain (and other entropy-based) splitting criteria are more sensitive to the relative distribution of minority classes than ML, or MCE-like loss functions. One comment about random forests and class imbalance: I've found that the rank ordering is just fine, and on several data sets was equally good as the bag balancing Daniel mentions. The problem is you need to adjust the threshold between classes to account for the class imbalance.
(Jan 08 '13 at 14:36)
Art Munson
|
|
Many of the available implementations support either class or instance weights. Weighing the the minority class 4 times the weight of the majority class weight should work for your data with those classifiers. Many of the classifiers in scikit take the string 'auto' as the weights parameter & it will set the weights to rebalance the training data. If your classifier does not support weights, you can resample your training data and either sample the minority class items repeatedly or just undersample the majority class. You can also try a classifier that optimizes a loss function that does not have this problem. AUC is is one such measure. Philip Kegelmeyer recommends Hellinger distance. Unfortunately not many freely available libraries offer this, so you may need to implement these yourself, but I believe Vowpal Wabbit, Sofia-ML, and the mboost library for R offer AUC optimization. |
|
One way to approach the class imbalance problem is by transforming the classification problem to ranking. I recently read this excellent blog post which discusses the practical machine learning tricks from the KDD 2011 best industry paper published by several googlers. Handling the class imbalance problem is one of them. I suggest you to take a look at the blog post as well as the original paper. I haven't had a chance to try it yet, so if you decided to do it, share your experiences. |