2
1

Hi,

I have a collection of text documents with multi-labels. I would like to perform multi-class classification on these documents. The problem though is that the dataset is skewed towards different classes. Here's the distribution :

Total Number of Documents = 41439 Class 1 = 35643 (86.01%) Class 2 = 2440 Class 3 = 2393 Class 4 = 553 Class 5 = 370 Class 6 = 185 Class 7 = 112 Class 8 = 29

How should I proceed with this? I can definitely start with unigrams/bigrams or with tf-idf.

asked Jul 06 '11 at 10:36

Dexter's gravatar image

Dexter
416243438

I asked a similar question here. I tried using multiple one-versus-many classifiers, using all examples from a given class as positive examples and then sampling randomly from the rest to get an equal number of negative examples. This worked well enough for the classes with a lot of examples but horribly for those with only a few examples since there were so few training examples.

Not oversampling resulted in fairly high F-scores, but at the expense of the smaller classes.

Ultimately I just used one multi-class classifier and used all the training examples. It didn't work that much better. I'd be interested in any better approaches you come up with.

(Jul 08 '11 at 01:40) Colin Pollock
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.