2
2

I'm interested in cleaning up a manually categorized text data set for the purpose of evaluating a machine learning based text classification system. We have an initial manual labeling, so an obvious approach is to train a classifier and use it to find examples which are likely to be mislabeled. This technique has been proposed many times, including in:

Guyon, I., Matic, N., and Vapnik, V. Discovering informative patterns and data cleaning. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 181--203. MIT Press. 1996

Such approaches are great if the goal is to produce a high quality data set for operational use. But there is a danger of a self-fulfilling prophecy if the data set is to be used for evaluating a learning-based system, i.e. you just end up relabeling the examples that machine learning would otherwise get wrong.

Developing an explicit and detailed coding manual, and carefully training assessors, is one way to limit the danger. Another is to apply the selective procedure only to training data, while all test examples are manually re-assessed, with tiebreaking by a third assessor.

I'm interested in other suggestions, and in particular to citations where this issue has been discussed.

asked Dec 28 '10 at 18:49

Dave%20Lewis's gravatar image

Dave Lewis
890202846


One Answer:

Depending on your expert time available, you may want to take your mis-categorised information, add in a few (perhaps the same number again?) correctly categorised instances, and present all of them to the expert without showing any categorisation.

answered Jul 08 '11 at 20:18

Robert%20Layton's gravatar image

Robert Layton
1625122637

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.