7
2

In a text classification task, if I already know the positive rate is very low (i.e. 5%) in the real world, what should I notice in analyze?

What is the appropriate positive rate for the training data set? 5% or 50%?

If my goal is to find out the positive cases and manual analysis could be used after text mining, which measure is the most important in this situation, accuracy, precision, recall, or F-measure? I think that would be precision, am I right?

Thanks.

asked Jul 09 '10 at 18:44

Jfly's gravatar image

Jfly
2113611

retagged Jul 10 '10 at 18:35

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
156252135


5 Answers:

The use of Fbeta measures, precision vs. recall, and ROC curves are very dependent on what task you're looking at. For rare event detection you probably care about recall (e.g. disease detection, where a miss is more damaging than a false alarm). For IR on a mobile device where you are constrained in the number of results you return maybe precision is more important. Agnostic of the task, P-R curves or ROC curves are both used and both give a nice overall picture of classifier performance. (The paper showing that p-r is more informative than ROC is really interesting though!)

To your question on the best proportion of positive classes in your training data, I don't believe there is a definitive answer for this (for all data sets and classifiers). Though, you've got at least a couple of choices.

  • You can do no sampling of the positive classes at all -- unmodified training data.
  • You can undersample the negative examples to make a 50/50 split
  • You can oversample the positive examples. (In general I don't particularly like this, as it can really mess with variance estimates, but it can still be an effective training strategy.)
  • You can use ensemble sampling. Where you train N classifiers where each training set including all of the positive classes and 1/N of the negative classes. Then use your favorite combination technique to get an answer. (R. Yan, Y. Liu, R. Jin, and A. Hauptmann. On predicting rare cases with svm ensembles in scene classification. In ICASSP, 2003) I've had some good luck with this.

Also, Here are some more references for f-measure optimization.

Martin Jansche has a nice paper on direct optimization of f-measure: Maximum expected F-measure training of logistic regression models.

There's also this paper by Liu, Tan and Jiang, Regularized F-Measure Maximization for Feature Selection and Classification, which I don't know quite as well.

answered Jul 10 '10 at 10:56

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
156252135

Another evaluation measure used to deal with very unbalanced datasets is the truncated AUC, e.g. AUC50. (M. Gribskov and N. L. Robinson. Use of receiver operating characteristic (roc) analysis to evaluate sequence matching. Computers and Chemistry, 20(1):25 – 33, 1996.)

AUC50 focuses on top ranked examples. It is defined as the normalised area under the ROC curve computed for up to 50 true negatives and is typically used to evaluate classifiers on datasets where the number of positives is much lower than the number of negatives. While for a complete test set the AUC is between 0.5 and 1, the AUC values for truncated top lists are between 0 and 1. The shorter the top list is, the smaller the AUC values. For more details on AUC and truncated AUC also see (P. Sonego, A. Kocsor, and S. Pongor. ROC analysis: applications to the classification of biological sequences and 3D structures. Briefings in Bioinformatics, 9(3):198–209, 2008.)

answered Jul 12 '10 at 05:04

Georgiana%20Ifrim's gravatar image

Georgiana Ifrim
1514414

I think recall is very important (you don't want to miss those few examples). This calls either for an assymmetric loss function (one that penalizes false negatives more than false positives) or direct optimization of the f1 measure. Maybe something like this can help you.

answered Jul 09 '10 at 18:49

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1899744214335

Precision-Recall is important in retrieval settings, AUC is widely used measure in case of unbalanced classification tasks.

(Jul 09 '10 at 18:52) DirectedGraph
1

This paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.2196&rep=rep1&type=pdf shows that for very unbalanced datasets the precision-recall curve might contain more information than the ROC curve.

(Jul 09 '10 at 18:54) Alexandre Passos ♦

Interesting!

(Jul 09 '10 at 19:08) DirectedGraph

Datasets with low positive rates are referred as Unbalanced datasets. They occur frequently in real life situations. E.g. Ability of molecules to be effective against say HIV is an unbalanced dataset.

The most important measure is Area Under the Receiver Operating Curve (AUROC or AUC).

This is a good starting point to read more about unbalanced text classification. http://www.springerlink.com/content/f2728072350t3465/

answered Jul 09 '10 at 18:51

DirectedGraph's gravatar image

DirectedGraph
54531422

So the traditional measures for this kind of thing are the F1 (and F2) measure as well as to report P@R figures which are basically points on the precision/recall curve. I think for these kind of problems you also need to be careful how you train them and how you perform inference.

Given that your actual loss function is asymmetric in the sense that recall errors are much worse than precision errors, you will want to perform some kind of min-risk inference. I can give you more details if you're interested.

answered Jul 09 '10 at 21:30

aria42's gravatar image

aria42
194962241

edited Jul 09 '10 at 21:34

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.