|
In 'What does classifying more than 10,000 image categories tell us?' they use the Imagenet dataset which consists of more than 10,000 classes with a total of 9 million images. The training is done with 200 to 1500 training images per category, with an average of 450. My question is, how do you train this data correctly? I learned that the training set needs to be unbiased, meaning that there should be the same number of images for all classes. If this is not the case, one needs to use weighting. But they don't mention any special procedures, so I thought, maybe it is not necessary after all. [update] I just saw this MO post Handling data imbalance in classification, but I doubt, that they generated more images to fill up the small classes. [update 2] Okay, I searched further and found a similar question at Quora: In classification, how do you handle an unbalanced training set? My short summary would be:
|
|
Your answers are essentially correct. It's an empirical question which approach will work best. Before you do anything, take a step back and think about the right evaluation measure. For example, if a false negative is much worse than a false positive, then you should reweight the examples. If a false negative is just as bad as a false positive, then you might not want to do anything. This is because, on the original training set, the classifier will predict MANY negatives and thus have many true negatives, at the expense of a few false negatives. |
|
More resources I've found: |