|
I have a data set having the following distribution: For the training set, the positive case and negative case are almost the same size. But for the testing data set, the negative case is more than 10 times the size of positive case. For this kind of data set, do I have to consider any weight selections when training classifier? |
|
If you care about finding a needle of positives in a haystack of negatives, then there are some things you can do.
|
|
Technically, provided each sample in the testing set gets classified independently (which is the case for most methods I know), the question of balance doesn't really arise. However, the fact that your testing and training samples have such different characteristics suggests that they are not drawn from the same distribution. Note that some methods work under the assumption that this is the case, and can perform sub-optimally when the assumption is violated. Depending on the task, you might want to favor precision over recall or vice versa, which can be achieved by different means depending on the classifier used (see digdug's answer). These methods may include weighing the samples. |
|
You could also consider using evaluation techniques like area under ROC curve (independent of class distributions) in addition to those mentioned. P-R curve is also a good choice. |