Here's something that seems like a very natural problem, but for which I can't find any published work. Suppose you have a classifier (logistic regression, SVM,...) that assigns numeric scores to examples. You have a tuning set not used in training, so you can get an unbiased estimate of effectiveness at any desired threshold. Your goal is to choose the highest threshold that will give you, say, at least 90% actual recall on 95% of new test sets of size k to which you might apply the classifier. We'll assume the new test set is drawn from the same population as your tuning set.

This is a lot trickier than it seems at first. You have only a finite tuning set, so there's both variance and perhaps a sequential selection bias in choosing a threshold. More weirdly, you have to take into account the sampling variance experienced on those future test sets, and thus the size k of those future test sets. (Consider the case where the future test sets are of size 1 vs. very large.) It also seems likely that knowing the distribution of scores on a particular future test set is useful, so if you're allowed to do that there's a transductive aspect too.

Has anyone seen this problem addressed before?

asked Jul 18 '13 at 11:06

Dave%20Lewis's gravatar image

Dave Lewis
890202846

edited Jul 22 '13 at 10:29

Sounds like it can be cast as a power analysis for a binomial test, since recall is equivalent to number of successes in n trials.

(Jul 20 '13 at 07:08) digdug

The machinery of power analysis certainly seems relevant here, but the logic is somewhat inverted and I haven't gotten my head around it yet.

(Jul 22 '13 at 10:25) Dave Lewis

With power analysis you can say what is the probability (=power) of detecting a difference in recall of x' from you base recall x, given sample size n and significance level alpha, assuming that test data has same distribution as training data. You can also do things like hold power fixed and find the minimum required sample size, etc.

The tricky bit is that it won't say "you'll achieve recall of x in test data with some probability", it will say "you'll achieve recall statistically indistinguishable from your base recall in test data" (if power is low) or "you'll achieve recall distinguishable from base rate" (if power is high). In other words, if you're happy to get wide confidence intervals in your test data, you need less data, but then you cannot claim that your result was (significantly) better than the base recall. So you can trade-off precision of estimated recall with sample size.

(Jul 25 '13 at 20:33) digdug
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.