|
There are a number of possible ways that NER "truth" data can be labeled, with variations around, for instance, whether the beginning of an entity gets a special label (as in IOB) and whether the "other" class gets an explicit "Other" label or gets no label at all. There are also a number of ways to evaluate NER. F-measure or its components (precision and recall) are popular, but there are a number of ways to implement this. One could compute F-measure for each named entity class separately and average the result, or collapse all of th classes into one class and compute precision and recall that way, and I've seen (and can imagine) other variations. My question is: is there a good reference that surveys these variations and associates some terminology with them? My goal is to be to be able to cite some paper that would let me say something like "I evaluated the results using entity-aggregate F1 with collapsed-IOB labeling" (or whatever) and have that be a precise, unambiguous, and compact description of my evaluation method. |
|
I think a standard form of evaluation is the one used in the NER CONLL shared task. Basically, they use the f-measure, but only for whole entities, so getting an entity boundary/type wrong would get you nothing. |