First, a bit of context :I am working on a document dataset (the old reuters 21578) where each document can have one or more of approx. 100 labels. The classes are highly unbalanced. I am trying to compare my results with other published results, and the precision/recall curves don't fit. Some (TF-IDF for instance) are just slightly off, but others are completely different. I've got two references results for comparison : one from Semi-supervised Learning of Compact Document Representations with Deep Networks, fig. 4, the other from the Rate Adapting Poisson Model paper, fig. 6. I'm almost exactly matching the results of the Rate Adapting Poisson paper. For the first paper, though, I have got completely different results for their LSI curves. What is really surprising is that their results for LSI are actually below the baseline obtained with random choice (which is at 0.12 approx.). I also have some minor differences with their TF-IDF baseline. I don't know where the difference come from, but I suspect a difference in evaluation methodology. Now, the question :Is there a reference paper or book that explains how to compute precision and recall when
More details :
I would appreciate any guidance about the right method (maybe existing ML frameworks have standard methods for that ?) |
In the example you give for ranking, are you required to return all documents for the task? If not, that would certainly improve the curve if you could find a good threshold. Also try looking into average precision, which is an overall metric for ranking. In this case the average precision for both Q1 and Q2 would be 1.0 (perfect). "In the example you give for ranking, are you required to return all documents for the task?" Yes, I am required to return all documents for the task, or, at least, that's what everybody does. Finding a threshold could only improve the precision, but it would require another small learning phase.
(Oct 06 '11 at 11:02)
Guillaume Pitel
Well then you are upper-bounded by a perfect ranking. In that case it is common for papers to put the "perfect" precision-recall curve in the graph so that readers understand how close your approach is to the upper bound.
(Oct 06 '11 at 12:03)
Kirk Roberts
|
|
I think I may have found a solution to my problem, even though it may not completely help for comparison with existing precision/recall curves. I have actually come to the conclusion that P/R can't nicely cope with multiple labels, and even less with unbalanced classes. However, I think my solution has nice properties, mostly that it can be expressed in terms of precision and recall too (which is not the case of ROC, for instance). First of all, imagine that any IR system's result actually fits between a best case scenario and a worst case scenario. Best case is what I have described in the question. Worst case would be this :
What I propose is simply that precision, instead of being expressed without knowledge of the classes and their arity, should be expressed relative to the worst case/best case scenarii. With this in mind, we just have to compute the cumulative sum of best/worst case :
By construction, Best(N-1) == Worst(N-1) And express (EDIT : this does NOT work well)
With such a solution, the PR curve of the best case scenario would be a flat 100% line. Another nice property of this solution is that best case and worst case scenario can be computed even with non-binary results. So a scalar product implementation of the document/document distance based on the class vector could be used. Example on reuters-21578, the Best(red), Worst(green), and random baseline(cyan) are drawn together with the TP count (blue)
This method needs to be improved, for the least.
(Oct 06 '11 at 16:05)
Guillaume Pitel
|
|
Precision, recall, and F-measure are well-defined functions of a single 2x2 contingency table. So the only legitimate way to come up with a "new" version of these is to come up with a new way of transforming your data into a 2x2 contingency table (e.g. by changing thresholding, redefining the two classes, etc.). As Kirk Roberts mentions above, as soon as you're applying these measures to multilabel data you need to compute one 2x2 table for each category, and then average in some fashion. My paper with Yang, Rose, and Li on RCV-1 discusses microaveraging vs. macroaveraging, and provides in an appendix the raw data from which the averages were computed. However, it's not clear recall and precision are the right measures for you. It appears you're using a text categorization dataset to stand in for a faceted retrieval task, where categories correspond to aspects of relevance. In that case, if a query wants facets A and B, then a document matching both A and B is much more than twice as valuable as one matching just A or just B. So you might want to look at utility measures that take that into account. Finally, Reuters-21578 actually has a rather small proportion of documents with 2+ labels compared to RCV-1, so you might want to look at the latter. RCV-1 is believed to have more accurate labeling as well. "Reuters-21578 actually has a rather small proportion of documents with 2+ labels compared to RCV-1" Yes, I'm working on rcv1 too, but since I'm willing to compare my method with other existing methods, I need to use the same datasets.
(Oct 07 '11 at 10:56)
Guillaume Pitel
|

I've just realized that in the version of the dataset I use (ModApte matlab files found somewhere), there is actually only one label per document... which make me feel embarrassed but does not necessarily discard this question.
Maybe a clue for this problem : if I naively average the precision / recall not on the whole test set, but on each class (that is, I first obtain precision and recall values for each class, then average them), I have very different results (not surprising). However my results are also very different than theirs for TF-IDF, for instance.
So it raises another question: how can we properly average precision and recall at the class level ?
To help answering the original question, I can propose at least one alternative metric : scalar product of normalized class vectors
Let's represent a document class knowledge by a vector of N (N is the number of classes) 0/1 values. If we have 5 classes and document A is in classes 0,1 and 3, then its representative vector C_A = [1 1 0 1 0].
Now if we have a test document B with C_B = [1 0 0 0 1], instead of counting a true positive because there is one class in common, we should count 1/(sqrt(3)*sqrt(2)), the scalar product of the normalized C_* vectors.
This proposal actually raises more questions : precision and recall cannot be rigorously adapted with a non discrete metric.