Is there a standard way to compute inter-annotator agreement for multi-label classification data? Kappa statistics is often used for binary and multi-class classification, but I haven't been able to find anything on inter-annotator agreement for multi-label data. One possibility is to construct a label powerset and treat the problem as a multi-class single-label problem, but this doesn't seem good because partial annotation matches will not be treated as such. Another possibility is to compute kappa separately for each class, but then we don't have a single estimate.

asked Feb 12 '11 at 10:06

Jan%20Snajder's gravatar image

Jan Snajder
46113

edited Feb 12 '11 at 10:12

1

I would probably just average together the kappa for each class, reporting any classes whose kappa is a significant outlier.

(Feb 12 '11 at 11:59) Kevin Canini

Maybe the right way of looking at this should involve some loss function?

(Feb 13 '11 at 05:11) Alexandre Passos ♦

Yes, this seems like a simple solution. How would you go about computing which kappa is a significant outlier? (The individual kappa values are not identically distributed, or are they?)

(Feb 13 '11 at 05:13) Jan Snajder

I probably wouldn't compute the outliers, but just rely on my own intuition for what seems surprising. I guess I was assuming that you only have ~10 classes, and that they would all have about the same kappa score. It's really hard to say how rigorous you should be without knowing the context for what you're doing. If this is just a small step in a larger evaluation, it's probably fine to be fairly loose about it, especially if the kappa scores are all roughly similar.

By the way, I'm pretty sure it's meaningless to say that the kappa scores are identically distributed (or not) unless you have a probability model for how your annotators are producing their labels. Anyway, I didn't mean "outliers" in the probabilistic sense, but rather in the everyday sense. The purpose of reporting them would be to be more academically honest when you replace an entire set of kappa scores with a single average, so people have a better idea of what exactly is being swept under the rug.

(Feb 13 '11 at 11:36) Kevin Canini

3 Answers:

I wrote a paper on this what feels like a long time ago. It's imperfect to be sure, and maybe not applicable to what you're working on, but maybe it'll get some ideas and discussion going. The idea is, more or less, that each multiply labeled class has partial membership with each of the labels. Depending on the task this may or may not be appropriate.

Andrew Rosenberg and Ed Binkowski. 2004. Augmenting the Kappa Statistic to Determine Interannotator Reliability for Multiply Labeled Data Points. HLT/NAACL '04

answered Feb 12 '11 at 23:54

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
173772540

Thanks! I like the idea of having a primary and a secondary label and then being able to select the weight for each. I guess this really can give us insights on the annotation quality. But obviously this will not work when using more than two labels (as in my case). Another thing that bothers me is that I cannot (don't know how to) compute the interval estimate (which is straightforward for standard kappa).

(Feb 13 '11 at 05:05) Jan Snajder

Information theory offers a straightforward way to handle data like this. For instance, calculate the mutual information of the output of two annotators (perhaps normalized by dividing by their joint entropy). I recommend the seminal work in this field, "The Mathematical Theory of Communication", by Shannon. If you want something quick (and free), see, for instance, this:

Lecture 3: Joint entropy, conditional entropy, relative entropy, and mutual information (Biology 429) by Carl Bergstrom

answered Feb 13 '11 at 13:19

Will%20Dwinnell's gravatar image

Will Dwinnell
312210

1

I like the idea of using mutual information or conditional entropy to test inter-annotator agreement.

One of the attractive qualities of kappa is that it relates expected to actual agreement. Information theory doesn't offer a clear correlate to this. But I suppose you could say that chance agreement is 0 additional information above the baseline distribution. Something like H(A) - H(A|B) could be informative.

The only remaining issue would be the calculation of p(X) under multiple labels. Is a label of A and B equivalent to two labels, or one label divided over two classes. But this could be task determined.

(Feb 13 '11 at 16:18) Andrew Rosenberg

What about using the relative entropy (Kullback-Leibler distance)? See, for instance, chapter 2 of "Elements of Information Theory", by Cover and Thomas: http://www1.cs.columbia.edu/~vh/courses/LexicalSemantics/Association/Cover&Thomas-Ch2.pdf

(Feb 14 '11 at 06:35) Will Dwinnell

For inspiration, you might look at the MUC (Vilain et al., 1995)and BCUBED (Bagga & Baldwin, 1998) metrics for scoring coreference chains. These measure agreement between the correct coreference chain (a set of noun phrase occurrences in text that refer to the same entity) and the predicted coreference chain. As with your multilabel problem, the trick is how to measure the overlap between sets. There may also be some relevant follow up work to redress shortcomings of these performance metrics.

answered Feb 12 '11 at 16:06

Art%20Munson's gravatar image

Art Munson
64611316

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.