|
Hi, I have a collection of text documents with multi-labels. I would like to perform multi-class classification on these documents. The problem though is that the dataset is skewed towards different classes. Here's the distribution : Total Number of Documents = 41439 Class 1 = 35643 (86.01%) Class 2 = 2440 Class 3 = 2393 Class 4 = 553 Class 5 = 370 Class 6 = 185 Class 7 = 112 Class 8 = 29 How should I proceed with this? I can definitely start with unigrams/bigrams or with tf-idf. |
I asked a similar question here. I tried using multiple one-versus-many classifiers, using all examples from a given class as positive examples and then sampling randomly from the rest to get an equal number of negative examples. This worked well enough for the classes with a lot of examples but horribly for those with only a few examples since there were so few training examples.
Not oversampling resulted in fairly high F-scores, but at the expense of the smaller classes.
Ultimately I just used one multi-class classifier and used all the training examples. It didn't work that much better. I'd be interested in any better approaches you come up with.