|
I'm looking for pointers to papers to supervised learning methods that incorporate knowledge of label noise. For instance, I have a dataset with 10 labels and approximately 6% of training examples have incorrect label. How can I use this knowledge? |
|
One thing I was thinking about recently, but so far have failed to prove anything interesting about, is that the C in SVM is kind of an l1 relaxation of a loss that deals with label noise. Think of the binary case, and suppose that we know that X examples marked as positive are negative and Y examples marked as negative are positive, and we want to find the optimal hyperplane. Then it's something like minimizing the norm of w constrained that no more than X of the positive examples are negative and no more than Y of the negative examples are positive. If you define decision variables for this and allow them to be "soft" and let X=Y=C you recover soft SVM. I think this suggests that there should be a way to use the hinge loss to control this kind of behavior, specially if you allow per-class weights. This is also useful in the information retrieval setting where you usually have a small set of curated true examples and a large mixed bag of examples where you expect a certain proportion of them to be relevant. Maybe this kind of constraint can be expressed more naturally in the generalized expectations framework. Usually you use generalized expectations to, for example, constrain the label proportions on test data to be as similar as possible to those in training data. If you also add such an expectation term constraining the expectations on the training data (replacing the usual log-likelihood term) to reflect your beliefs on the false-positive rate you should get something very similar to what I described above and roughly equivalent to what you say you want. However, I also cannot really prove that this is better than doing something simple. Thanks for reference... Andrew McCallum's name on article is a good sign the method is practical
(Aug 29 '11 at 06:48)
Yaroslav Bulatov
|
|
(Partly related to Alex's interesting response and not really an answer to the original question, but it's an interesting discussion so I wanted to write down a few things that came to mind.) For me, overfitting and label noise are related problems: if the model overfits the data, the consequences will be much worse in presence of label noise. Therefore regularization in general seems like the standard machine-learning way to deal with label noise (in addition to lack of training data). For example, in C-SVM, C controls the tradeoff between sparsity / simplicity and expressiveness / complexity of the solution. Concretely, the weights alpha will never exceed the value of C: intuitively, this ensures that the model will not be changed too much for an example which was actually mislabeled. Online algorithms are typically very sensitive to label noise since they can take a step to correct for a mistake which in fact wasn't one; this wrongly-taken step can severely impact the following rounds. A good way to deal with this problem is to use mini-batches, for their averaging effect. Many researchers like to emphasize that real-world data is never (totally) clean. So, shouldn't label noise always be dealt with anyway? |
|
Modifying the hinge loss appears in this paper, for a (related) situation where you have multiple sets of labellers: |
|
You may find these references useful too : N. D. Lawrence et B. Schölkopf. Estimating a kernel fisher discriminant in the presence of label noise. Dans Proceedings of the 18th International Conference on Machine Learning (ICML), pages 306–313. Morgan Kaufmann, San Francisco, CA, 2001. R. Amini et P. Gallinari. Semi-supervised learning with an imperfect supervisor. Knowledge Information Systems, 8(4):385–413, 2005. ISSN 0219-1377. Y. Li, L. Wessels, D. De Ridder et M. Reinders. Classification in the presence of class noise using a probabilistic kernel fisher method. Pattern Recognition, 40: 3349–3357, 2007b. |