First, a bit of context :

I am working on a document dataset (the old reuters 21578) where each document can have one or more of approx. 100 labels. The classes are highly unbalanced. I am trying to compare my results with other published results, and the precision/recall curves don't fit. Some (TF-IDF for instance) are just slightly off, but others are completely different.

I've got two references results for comparison : one from Semi-supervised Learning of Compact Document Representations with Deep Networks, fig. 4, the other from the Rate Adapting Poisson Model paper, fig. 6.

I'm almost exactly matching the results of the Rate Adapting Poisson paper. For the first paper, though, I have got completely different results for their LSI curves. What is really surprising is that their results for LSI are actually below the baseline obtained with random choice (which is at 0.12 approx.). I also have some minor differences with their TF-IDF baseline. I don't know where the difference come from, but I suspect a difference in evaluation methodology.

Now, the question :

Is there a reference paper or book that explains how to compute precision and recall when

  • the documents can have multiple labels
  • the classes are unbalanced

More details :

  1. The problem of multiple labels per documents (i.e. one document can be in several classes at once)

    To deal with multiple labels, you can simply count 1 true positive if the retrieved document has one common label with the query document. This is a very simple rule, as it does not take into account that finding a document with exactly the same labels as the query document is more valuable that a document with just one label in common. However, I haven't found a simple method that could take this into account.

  2. How do you deal with unbalanced classes (even if the documents can have only one label).

    To illustrate why it can be a problem, imagine this setting : your known document set is made of 5 documents, 1 being of class A, 4 of class B. You have the perfect algorithm for ranking documents, and two query documents : Q0 of class A, Q1 of class B. So Q1 first returns the document A then all B documents, Q2 return 4 B documents, then the document A :

    Q1 : A B B B B ( in term of true positive : 1 0 0 0 0)

    Q2 : B B B B A ( in term of true positive : 1 1 1 1 0)

    If you take each result separately, the Precision/Recall curve will be a perfect flat line at 100%. If you average the results, precision will be : [100% 75% 66% 62.5% 50%] and recall : [62.5% 75% 87.5% 100% 100%], which gives a pretty lousy curve compared with what it should be.

I would appreciate any guidance about the right method (maybe existing ML frameworks have standard methods for that ?)

asked Oct 05 '11 at 05:24

Guillaume%20Pitel's gravatar image

Guillaume Pitel
176148

edited Oct 06 '11 at 10:13

I've just realized that in the version of the dataset I use (ModApte matlab files found somewhere), there is actually only one label per document... which make me feel embarrassed but does not necessarily discard this question.

(Oct 05 '11 at 05:48) Guillaume Pitel

Maybe a clue for this problem : if I naively average the precision / recall not on the whole test set, but on each class (that is, I first obtain precision and recall values for each class, then average them), I have very different results (not surprising). However my results are also very different than theirs for TF-IDF, for instance.

So it raises another question: how can we properly average precision and recall at the class level ?

(Oct 05 '11 at 11:50) Guillaume Pitel

To help answering the original question, I can propose at least one alternative metric : scalar product of normalized class vectors

Let's represent a document class knowledge by a vector of N (N is the number of classes) 0/1 values. If we have 5 classes and document A is in classes 0,1 and 3, then its representative vector C_A = [1 1 0 1 0].

Now if we have a test document B with C_B = [1 0 0 0 1], instead of counting a true positive because there is one class in common, we should count 1/(sqrt(3)*sqrt(2)), the scalar product of the normalized C_* vectors.

This proposal actually raises more questions : precision and recall cannot be rigorously adapted with a non discrete metric.

(Oct 06 '11 at 07:50) Guillaume Pitel

3 Answers:
  1. I've done some work with multi-label classification before and basically the task used the same F-measure but defined the true positives as the total number of correct labels over all documents annotated, and false positives similarly. This means that documents with more labels have more weight in the F-measure, which is not necessarily desirable. I suppose you could also have a decimal value for the true/false positives based on the percentage of TP/FP labels for each document.

  2. The F-measure people usually use is actually the micro F-measure, which treats each class according to its proportion in the data. The macro F-measure treats each class equally, regardless of how common or rare it is. Usually tasks intend for you to use the micro unless they state otherwise. The intention of micro is to give more weight to how your classifier performs on common labels, which is generally what you want.

In the example you give for ranking, are you required to return all documents for the task? If not, that would certainly improve the curve if you could find a good threshold. Also try looking into average precision, which is an overall metric for ranking. In this case the average precision for both Q1 and Q2 would be 1.0 (perfect).

answered Oct 06 '11 at 10:40

Kirk%20Roberts's gravatar image

Kirk Roberts
4612410

"In the example you give for ranking, are you required to return all documents for the task?"

Yes, I am required to return all documents for the task, or, at least, that's what everybody does. Finding a threshold could only improve the precision, but it would require another small learning phase.

(Oct 06 '11 at 11:02) Guillaume Pitel

Well then you are upper-bounded by a perfect ranking. In that case it is common for papers to put the "perfect" precision-recall curve in the graph so that readers understand how close your approach is to the upper bound.

(Oct 06 '11 at 12:03) Kirk Roberts

I think I may have found a solution to my problem, even though it may not completely help for comparison with existing precision/recall curves. I have actually come to the conclusion that P/R can't nicely cope with multiple labels, and even less with unbalanced classes.

However, I think my solution has nice properties, mostly that it can be expressed in terms of precision and recall too (which is not the case of ROC, for instance).

First of all, imagine that any IR system's result actually fits between a best case scenario and a worst case scenario. Best case is what I have described in the question. Worst case would be this :

Q1 : 0 0 0 0 1

Q2 : 0 1 1 1 1

What I propose is simply that precision, instead of being expressed without knowledge of the classes and their arity, should be expressed relative to the worst case/best case scenarii.

With this in mind, we just have to compute the cumulative sum of best/worst case :

Best : 2 3 4 5 5

Worst : 0 1 2 3 5

By construction, Best(N-1) == Worst(N-1)

And express (EDIT : this does NOT work well)

  • Precision(k) as (TP(k)-Worst(k))/(Best(k)-Worst(k)), for k in [0 .. N-2]
  • Recall is simply TP(k)/Worst(N-1)

With such a solution, the PR curve of the best case scenario would be a flat 100% line.

Another nice property of this solution is that best case and worst case scenario can be computed even with non-binary results. So a scalar product implementation of the document/document distance based on the class vector could be used.

Example on reuters-21578, the Best(red), Worst(green), and random baseline(cyan) are drawn together with the TP count (blue)

Test method on reuters

answered Oct 06 '11 at 10:59

Guillaume%20Pitel's gravatar image

Guillaume Pitel
176148

edited Oct 11 '11 at 02:44

This method needs to be improved, for the least.

(Oct 06 '11 at 16:05) Guillaume Pitel

Precision, recall, and F-measure are well-defined functions of a single 2x2 contingency table. So the only legitimate way to come up with a "new" version of these is to come up with a new way of transforming your data into a 2x2 contingency table (e.g. by changing thresholding, redefining the two classes, etc.).

As Kirk Roberts mentions above, as soon as you're applying these measures to multilabel data you need to compute one 2x2 table for each category, and then average in some fashion. My paper with Yang, Rose, and Li on RCV-1 discusses microaveraging vs. macroaveraging, and provides in an appendix the raw data from which the averages were computed.

However, it's not clear recall and precision are the right measures for you. It appears you're using a text categorization dataset to stand in for a faceted retrieval task, where categories correspond to aspects of relevance. In that case, if a query wants facets A and B, then a document matching both A and B is much more than twice as valuable as one matching just A or just B. So you might want to look at utility measures that take that into account.

Finally, Reuters-21578 actually has a rather small proportion of documents with 2+ labels compared to RCV-1, so you might want to look at the latter. RCV-1 is believed to have more accurate labeling as well.

answered Oct 07 '11 at 10:24

Dave%20Lewis's gravatar image

Dave Lewis
890202846

"Reuters-21578 actually has a rather small proportion of documents with 2+ labels compared to RCV-1"

Yes, I'm working on rcv1 too, but since I'm willing to compare my method with other existing methods, I need to use the same datasets.

(Oct 07 '11 at 10:56) Guillaume Pitel
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.