|
I have a large binary classification problem that I'm currently solving with primal gradient SVM (Pegasos). I want to bias this classifier to be extremely high precision (<0.5% false negatives). The classifier solves a latent subproblem when making its prediction, which is why I'm using a home-grown SVM rather than an off-the-shelf tool. Now, if I were doing normal SVM learning, I would use class-specific C-parameters so the slack cost of getting a true example wrong is very large. Unfortunately, Pegasos places its regularization parameter on the ||w||^2 term, not the hinge loss - so I'm not sure on how to do that here. Another possibility would be to increase the required margin for positive examples, or decrease it for negatives (as is done with costs in MIRA), which helps a little, but I can't seem to push Pegasos to the extremes I need with margin rescaling. What I am doing is altering my sampler during learning to up-sample positives. This seems to be working, but it is ridiculously slow, and I worry that it's hurting my final classifier by playing havoc with regularization and the learning rate. Are there other options that I haven't explored yet? Thanks! |
|
Another interesting approach is based on learning reductions and is named costing. It was introduced by Zadrozny, Langford, and Abe, Cost-sensitive learning by cost-proportionate example weighting, and the main idea is that you use importance sampling to train many classifiers. The idea is presented in the each-example-has-a-cost setting, but this generalizes the each-class-has-acost setting. When you oversample some datapoints you might bias the classifier in an odd way. The approach they suggest, which is optimal, is to create many rejection-sampled datasets where each dataset has a datum p with probability proportional to the importance of p. You then train one classifier on each such dataset and average their predictions, thereby creating an optimal cost-sensitive classifier. In your case you could, for example keep all positive examples and drop randomly x% of the negative ones. To keep the online setting you can have a for loop that, for each example, for each classifier, samples whether or not that classifier will learn from that example. I'm not sure if the proof holds in this case (it's been a while since I read the paper) but the idea is sound. 2
I wound up using the cost-based importance sampling and averaging suggested here, and it basically revitalized the project. The suggested paper is excellent, and I found it to be a great help in understanding the available approaches to adding costs. Thanks to both Alexandre and John! Since the paper shows that implementing cost directly into the learner is preferable to sampling, I'm now re-implementing my SVM with a version of dual exponentiated gradient that accounts for example-specific C-parameters (slack re-scaling).
(Oct 27 '10 at 15:16)
Colin Cherry
|
|
In my experience, the costing approach works pretty well when computation is not a problem---you'll see in the experiments section that this approach retroactively beats some prediction contest winners. If computation is a serious concern, it seems best to deal with the importance weights directly. In general, this is actually pretty difficult to do well with an online learning algorithm, because the update goes nonlinear quickly with large importance weights. Nikos and I have been experimenting with an importance-weight invariant update that partially addresses this. We don't have a writeup yet, but the actual update rule is in the VW codebase. To see it: git clone git://github.com/JohnLangford/vowpal_wabbit.git and look at 'loss_functions.cc'. You can experiment with it directly using VW or just cut&paste. Thanks! That's a great tip and I'll definitely check out the update in VW. You mentioned an experiments section - which paper?
(Sep 23 '10 at 20:23)
Colin Cherry
1
@Colin Cherry: I think it's the paper I referred to in my answer about costing.
(Sep 23 '10 at 21:05)
Alexandre Passos ♦
|
|
I've had reasonable performance scaling the margin, making it n times larger for positive instances (which at first means every positive instance is a margin error and update), where n is the ratio of the losses of predicting wrong. See Tsochantaridis et al, Support vector machines for structured and interdependent output spaces, they have a discussion of two approaches to this problem: margin rescaling (which is what I just suggested) and slack rescaling, and both should converge to the optimal regret minimization. Thanks for your response. I have also been thinking in terms of margin and slack rescaling. By up-sampling positives I am effectively doing crude slack rescaling, and as I mentioned above, I have tried margin rescaling, but I haven't had enough success with it so far. I should revisit margin rescaling, though, as it fits much more naturally into these online formalisms (at least in theory). In Pegasos, I believe the only algorithmic impact of margin rescaling is determining when to apply an update. I'll let you know if I'm able to push it to the point I need.
(Sep 23 '10 at 16:52)
Colin Cherry
I was going to suggest something similar. Essentially, you want to adapt your loss function so that the penalty for a false positive is K times higher than the penalty for a false negative, where K is a hyperparameter.
(Sep 23 '10 at 17:00)
Joseph Turian ♦♦
|
Maybe I'm mistaken here, but prc = true pos / (true pos + false pos), so don't you mean you want few false pos?
Good catch. Actually, I think what I should say is that I want extremely high recall :-)