Revision history[back]
click to hide/show revision 1
Revision n. 1

Oct 05 '11 at 05:24

Guillaume%20Pitel's gravatar image

Guillaume Pitel
176148

Computing precision/recall curve for multilabel document retrieval

I am working on a document dataset (the old reuters 21578) where each document can have one or more of approx. 100 labels. The classes are highly unbalanced. I am trying to compare my results with other published results, and the precision/recall curves don't fit. Some (TF-IDF for instance) are just slightly off, but others are completely different.

I've got two references results for comparison : one from Semi-supervised Learning of Compact Document Representations with Deep Networks, fig. 4, the other from the Rate Adapting Poisson Model paper, fig. 6.

I'm almost exactly matching the results of the Rate Adapting Poisson paper. For the first paper, though, I have got completely different results for their LSI curves. What is really surprising is that their results for LSI are actually below the baseline obtained with random choice (which is at 0.12 approx.). I also have some minor differences with their TF-IDF baseline.

My actual question is this : do you know of a reference paper that explains how to compute precision and recall when the documents are multi-labeled. Here is my (simple) method (Peter Gehler, one of the author of the Rate Adapting Poisson Model, told me that he believed his method was the same) :

  • count one true positive if the retrieved document has one common label with the query document

This is a very simple rule, as it does not take into account that finding a document with exactly the same labels as the query document is more valuable that a document with just one label in common. However, I haven't found a simple method that could take this into account.

So I would appreciate any guidance about the right method (maybe existing ML frameworks have standard methods for that ?)

click to hide/show revision 2
Revision n. 2

Oct 06 '11 at 08:49

Guillaume%20Pitel's gravatar image

Guillaume Pitel
176148

Computing precision/recall curve for unbalanced multilabel document retrieval

I am working on a document dataset (the old reuters 21578) where each document can have one or more of approx. 100 labels. The classes are highly unbalanced. I am trying to compare my results with other published results, and the precision/recall curves don't fit. Some (TF-IDF for instance) are just slightly off, but others are completely different.

I've got two references results for comparison : one from Semi-supervised Learning of Compact Document Representations with Deep Networks, fig. 4, the other from the Rate Adapting Poisson Model paper, fig. 6.

I'm almost exactly matching the results of the Rate Adapting Poisson paper. For the first paper, though, I have got completely different results for their LSI curves. What is really surprising is that their results for LSI are actually below the baseline obtained with random choice (which is at 0.12 approx.). I also have some minor differences with their TF-IDF baseline.

My actual question is this : do you know of a reference paper that explains how to compute precision and recall when the documents are multi-labeled. Here is my (simple) method (Peter Gehler, one of the author of the Rate Adapting Poisson Model, told me that he believed his method was the same) :

  • count one true positive if the retrieved document has one common label with the query document

This is a very simple rule, as it does not take into account that finding a document with exactly the same labels as the query document is more valuable that a document with just one label in common. However, I haven't found a simple method that could take this into account.

So I would appreciate any guidance about the right method (maybe existing ML frameworks have standard methods for that ?)

click to hide/show revision 3
Detailed the question / cut the question in two parts

Oct 06 '11 at 10:13

Guillaume%20Pitel's gravatar image

Guillaume Pitel
176148

Computing precision/recall curve for unbalanced multilabel document retrieval

First, a bit of context :

I am working on a document dataset (the old reuters 21578) where each document can have one or more of approx. 100 labels. The classes are highly unbalanced. I am trying to compare my results with other published results, and the precision/recall curves don't fit. Some (TF-IDF for instance) are just slightly off, but others are completely different.

I've got two references results for comparison : one from Semi-supervised Learning of Compact Document Representations with Deep Networks, fig. 4, the other from the Rate Adapting Poisson Model paper, fig. 6.

I'm almost exactly matching the results of the Rate Adapting Poisson paper. For the first paper, though, I have got completely different results for their LSI curves. What is really surprising is that their results for LSI are actually below the baseline obtained with random choice (which is at 0.12 approx.). I also have some minor differences with their TF-IDF baseline.

My actual baseline. I don't know where the difference come from, but I suspect a difference in evaluation methodology.

Now, the question is this : do you know of :

Is there a reference paper or book that explains how to compute precision and recall when the documents are multi-labeled. Here is my (simple) method (Peter Gehler, one of the author of the Rate Adapting Poisson Model, told me that he believed his method was the same) :

  • the documents can have multiple labels
  • the classes are unbalanced

More details :

  1. The problem of multiple labels per documents (i.e. one document can be in several classes at once)

    To deal with multiple labels, you can simply count one 1 true positive if the retrieved document has one common label with the query document

  2. document. This is a very simple rule, as it does not take into account that finding a document with exactly the same labels as the query document is more valuable that a document with just one label in common. common. However, I haven't found a simple method that could take this into account.

  3. How do you deal with unbalanced classes (even if the documents can have only one label).

    To illustrate why it can be a problem, imagine this setting : your known document set is made of 5 documents, 1 being of class A, 4 of class B. You have the perfect algorithm for ranking documents, and two query documents : Q0 of class A, Q1 of class B. So Q1 first returns the document A then all B documents, Q2 return 4 B documents, then the document A :

    Q1 : A B B B B ( in term of true positive : 1 0 0 0 0)

    Q2 : B B B B A ( in term of true positive : 1 1 1 1 0)

    If you take each result separately, the Precision/Recall curve will be a perfect flat line at 100%. If you average the results, precision will be : [100% 75% 66% 62.5% 50%] and recall : [62.5% 75% 87.5% 100% 100%], which gives a pretty lousy curve compared with what it should be.

I would appreciate any guidance about the right method (maybe existing ML frameworks have standard methods for that ?)

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.