I am working on a document dataset (the old reuters 21578) where each document can have one or more of approx. 100 labels. The classes are highly unbalanced. I am trying to compare my results with other published results, and the precision/recall curves don't fit. Some (TF-IDF for instance) are just slightly off, but others are completely different.
I've got two references results for comparison : one from Semi-supervised Learning of Compact Document Representations with Deep Networks, fig. 4, the other from the Rate Adapting Poisson Model paper, fig. 6.
I'm almost exactly matching the results of the Rate Adapting Poisson paper. For the first paper, though, I have got completely different results for their LSI curves. What is really surprising is that their results for LSI are actually below the baseline obtained with random choice (which is at 0.12 approx.). I also have some minor differences with their TF-IDF baseline.
My actual question is this : do you know of a reference paper that explains how to compute precision and recall when the documents are multi-labeled. Here is my (simple) method (Peter Gehler, one of the author of the Rate Adapting Poisson Model, told me that he believed his method was the same) :
- count one true positive if the retrieved document has one common label with the query document
This is a very simple rule, as it does not take into account that finding a document with exactly the same labels as the query document is more valuable that a document with just one label in common. However, I haven't found a simple method that could take this into account.
So I would appreciate any guidance about the right method (maybe existing ML frameworks have standard methods for that ?)