For classification, it is usually difficult to obtain a perfectly labelled training set (with completely reliable labels); so the human annotator who label the training data may give some erroneous/noisy labels.

Without talking about crowdsourcing techniques, what is the state of the art of learning with such noisy labels ? Do you know any interesting papers that deal with this issue ?

Note: question also posted on CrossValidated for those who are interested.

asked Sep 29 '13 at 07:08

shn's gravatar image

shn
462414759

edited Sep 29 '13 at 09:34

Aren't ensemble methods with multiple classifiers one way to address the problem with noisy data?

(Oct 01 '13 at 05:38) Svetoslav Marinov

@SvetoslavMarinov The noise we are talking about is in terms of mislabelled training instances. This label noise will affect each classifier of the ensemble. So I'm rather looking for methods that allow to detect such mislabelled instances in the training set.

(Oct 01 '13 at 06:16) shn

I guess you've seen these, but is it something like that you are interested in: http://arxiv.org/pdf/1305.4987v1.pdf & http://www.cs.bris.ac.uk/~flach/ECMLPKDD2012papers/1125527.pdf

(Oct 02 '13 at 07:48) Svetoslav Marinov

2 Answers:

The way the question is asked to me looks like framing the problem of generalization in a slightly different way. From that perspective, this problem is equivalent to empirical risk minimization. Consider support vector machines with a slack variables. When the data points lie too far away on the "other side" of the decision boundary, we simply drop it and does not consider it as a support vector anymore. More often, this is misunderstood as a noisy data problem, but same can be applied to noisy labels as well.

answered Oct 01 '13 at 15:02

Rakesh%20Chalasani's gravatar image

Rakesh Chalasani
2641210

Can you please give more details about your comparison of the "mislabelling of some training instances (label noise problem)" with "empirical risk minimization and noisy data problem" ? How could "this" be applied to noisy labels ? Please give more explanation (maybe with a synthetic example?), because it is not very clear for me what you are comparing.

Actually there is some little methods that deals with label noise, but most of them just identify a mislabelled instance (x,y) in the training set X, as the one having a low P(y|x,h), where h is a classifier trained on X. I wonder if there is some more principled methods to deal with label noise (learning with mislabbeled data).

(Oct 01 '13 at 16:43) shn

I am not an expert at this, but empirical risk minimization considers the joint probability p(y,x) and in turn model the uncertainty in prediction p(y|x) to determine the hypothesis (or decision boundary). So, if you make a hypothesis (decision boundary) with the mislabeled data, one would end up at higher risk than ignoring such "outlier" predictions.

I am not well versed with this literature to give you pointers, but it does seem pretty close to the problem of generalization.

(Oct 03 '13 at 00:51) Rakesh Chalasani
-1

I guess this theme is related with your question.

answered Sep 29 '13 at 11:58

VAvd's gravatar image

VAvd
0112

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.