There are a number of possible ways that NER "truth" data can be labeled, with variations around, for instance, whether the beginning of an entity gets a special label (as in IOB) and whether the "other" class gets an explicit "Other" label or gets no label at all.

There are also a number of ways to evaluate NER. F-measure or its components (precision and recall) are popular, but there are a number of ways to implement this. One could compute F-measure for each named entity class separately and average the result, or collapse all of th classes into one class and compute precision and recall that way, and I've seen (and can imagine) other variations.

My question is: is there a good reference that surveys these variations and associates some terminology with them? My goal is to be to be able to cite some paper that would let me say something like "I evaluated the results using entity-aggregate F1 with collapsed-IOB labeling" (or whatever) and have that be a precise, unambiguous, and compact description of my evaluation method.

asked Apr 05 '11 at 21:11

Philip%20Kegelmeyer's gravatar image

Philip Kegelmeyer
151226


One Answer:

I think a standard form of evaluation is the one used in the NER CONLL shared task. Basically, they use the f-measure, but only for whole entities, so getting an entity boundary/type wrong would get you nothing.

answered Apr 05 '11 at 21:28

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.