|
For classification, it is usually difficult to obtain a perfectly labelled training set (with completely reliable labels); so the human annotator who label the training data may give some erroneous/noisy labels. Without talking about crowdsourcing techniques, what is the state of the art of learning with such noisy labels ? Do you know any interesting papers that deal with this issue ? Note: question also posted on CrossValidated for those who are interested. |
|
The way the question is asked to me looks like framing the problem of generalization in a slightly different way. From that perspective, this problem is equivalent to empirical risk minimization. Consider support vector machines with a slack variables. When the data points lie too far away on the "other side" of the decision boundary, we simply drop it and does not consider it as a support vector anymore. More often, this is misunderstood as a noisy data problem, but same can be applied to noisy labels as well. Can you please give more details about your comparison of the "mislabelling of some training instances (label noise problem)" with "empirical risk minimization and noisy data problem" ? How could "this" be applied to noisy labels ? Please give more explanation (maybe with a synthetic example?), because it is not very clear for me what you are comparing. Actually there is some little methods that deals with label noise, but most of them just identify a mislabelled instance (x,y) in the training set X, as the one having a low P(y|x,h), where h is a classifier trained on X. I wonder if there is some more principled methods to deal with label noise (learning with mislabbeled data).
(Oct 01 '13 at 16:43)
shn
I am not an expert at this, but empirical risk minimization considers the joint probability p(y,x) and in turn model the uncertainty in prediction p(y|x) to determine the hypothesis (or decision boundary). So, if you make a hypothesis (decision boundary) with the mislabeled data, one would end up at higher risk than ignoring such "outlier" predictions. I am not well versed with this literature to give you pointers, but it does seem pretty close to the problem of generalization.
(Oct 03 '13 at 00:51)
Rakesh Chalasani
|
Aren't ensemble methods with multiple classifiers one way to address the problem with noisy data?
@SvetoslavMarinov The noise we are talking about is in terms of mislabelled training instances. This label noise will affect each classifier of the ensemble. So I'm rather looking for methods that allow to detect such mislabelled instances in the training set.
I guess you've seen these, but is it something like that you are interested in: http://arxiv.org/pdf/1305.4987v1.pdf & http://www.cs.bris.ac.uk/~flach/ECMLPKDD2012papers/1125527.pdf