I have a data set having the following distribution:

For the training set, the positive case and negative case are almost the same size. But for the testing data set, the negative case is more than 10 times the size of positive case. For this kind of data set, do I have to consider any weight selections when training classifier?

asked Oct 31 '12 at 09:43

surfreta's gravatar image

surfreta
659913


3 Answers:

If you care about finding a needle of positives in a haystack of negatives, then there are some things you can do.

  • If you're using some penalized method like an SVM or logistic regression, then you can penalize one class more than another, e.g., positives get less penalized than negatives.
  • You can try to simulate the rare positives by down-sampling the positives from your data to create subsets with more realistic positive/negative ratios, and train classifier on that. This will probably hurt your cross-validation performance but might be more realistic.
  • Not so much a method but more evaluation tool, use precision-recall curves, partial ROC curves (restricted to low false-positive rate), or positive-predictive value (PPV) plots to evaluate how well your model is doing in the low false-positive rate regime rather than over the entire range of FPRs (which you probably don't care about). Then you can compare whether the two approaches above are actually working...

answered Nov 05 '12 at 00:24

digdug's gravatar image

digdug
245111620

Technically, provided each sample in the testing set gets classified independently (which is the case for most methods I know), the question of balance doesn't really arise.

However, the fact that your testing and training samples have such different characteristics suggests that they are not drawn from the same distribution. Note that some methods work under the assumption that this is the case, and can perform sub-optimally when the assumption is violated.

Depending on the task, you might want to favor precision over recall or vice versa, which can be achieved by different means depending on the classifier used (see digdug's answer). These methods may include weighing the samples.

answered Nov 05 '12 at 08:31

Mikhail's gravatar image

Mikhail
1555

edited Nov 05 '12 at 08:32

You could also consider using evaluation techniques like area under ROC curve (independent of class distributions) in addition to those mentioned. P-R curve is also a good choice.

answered Nov 08 '12 at 19:42

karan%20sikka's gravatar image

karan sikka
613

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.