|
I have two binary classifiers, call them A and B. I measure their performance on a test set T by computing the area under the ROC curve (call these AUC_A and AUC_B respectively). How can I compare the performance of these two classifiers on T in a — statistically — sensible manner? Currently, I bootstrap the test set T, compute the AUC measures for the bootstrap samples and estimate confidence intervals of AUC_A and AUC_B. I can look at the intervals and see if they overlap or not. But that's not enough. I would like to test the null hypothesis that there's no difference between the mean/median of the distributions from which these AUC values are coming. I think something like a permutation test is needed but I can't figure out how to mix the rankings of different classifiers. Does anyone have an idea? |
|
This is something of a difficult question to answer because the mathematics of the underlying sampling are tricky. Bootstrap samples (or random subsamples or anything we could do) are not really independent, which statistical tests tend to assume. Even with this limitation, though, there are a lot of papers that use paired t-tests to determine whether a difference between classifiers is significant. The assumptions of the t-test, and how they might be violated, are discussed somewhat in this paper. The non-parametric alternative to the paired t-test is the Wilcoxon signed-rank test. It does not assume normality as the t-test does, but I believe it still makes assumptions about independence. I'm sure more details are available on Wikipedia. You could also consider comparing the ROC curves themselves. There has been at least some research done on constructing confidence intervals around ROC curves. See, for example, this paper. I'm not sure that a test can ever really be carried out in a statistically rigorous way, simply because of the complicated nature of the dependence between the samples, but hopefully these references give you an idea of how you might start to investigate the difference. The "classic" reference on statistical comparisons among classifiers, at the moment, is this paper, if you have the energy to read it. I actually had the energy to read the related sections of Demsar's paper. Thanks for the Sunday reading recommendation! I think I'll go with Wilcoxon signed-rank test or just the sign test on bootstrap samples. But please refer to the comments under Alexandre Passos' answer; I still have doubts about the proper sampling method.
(Jul 04 '10 at 19:05)
Amaç Herdağdelen
|
|
The following paper works out an asymptotic statistical test for comparing two (paired) AUC statistics. Wieand S, Gail MH, James BR, James KL (1989). A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 76:585-92. It would be a great service to the community for someone to code this up and test it, so that we could all use it. Thanks for the reference.
(Jul 09 '10 at 08:04)
Amaç Herdağdelen
|
|
Here is a good paper on this subject: Rather than bootstrapping the test set you should use different test sets. The way I do it is to create ~10 random Test-Train splits and evaluate both classifiers on them. Then use the mean and std. to determine relative efficiency. Thanks for the reference, I'll check it out. I don't want to do a k-fold cross-validation because I am already given the train and test sets. I mean, I can do whatever I want in the training phase but there is a fixed test set on which I want to compare the performance of the classifiers.
(Jul 04 '10 at 12:05)
Amaç Herdağdelen
|
|
Since you're using bootstrap, can't you just estimate the mean/median difference in AUC between the classifiers using bootstrap and then compute the bootstrapped 95% confidence interval for it? Or, you can look at the bootstrapped probability that one AUC is bigger than the other, and vice versa, and accept or reject based on a p-value. I could compute the median/mean difference of AUC values by bootstrap and compute CI for it as you suggest. Then I could do a t-test to see if the difference is significantly different than 0. But would that be valid? My null hypothesis is that there is no difference (i.e., the mean/median is the same) therefore I should be computing my statistics based on that assumption. I fear that when I compute AUC by bootstrap I'm violating that assumption. Do you have an idea about that?
(Jul 04 '10 at 16:16)
Amaç Herdağdelen
I thought the whole point of bootstrap is that you don't have to use standard tests, because they're usually based on false parametric assumptions. From what I recall by reading the Efron book you can just compute the distribution from the quantity you're interested in by bootstrap and from that derive whatever you want, including a p-value for it being larger than 0 (different from 0 makes no sense, I think, because 0 has no measure, hence no probability).
(Jul 04 '10 at 16:22)
Alexandre Passos ♦
In Efron's book, to test the equality of means of two unknown distributions F and G, their observed samples are translated so their means become the combined means of the samples. In other words, let's say f=f_1,...,f_n and g=g_1,...,g_n are the samples we observe for F and G with means m(f) and m(g). Also call the mean of the combined sample of f and g as m(f+g). Then in the Efron's example, the bootstrap samples are NOT drawn from f and g but from samples f'=f_1 - m(f) + m(f+g),...,f_n - m(f) + m(f+g) and g'=g_1 - m(f)+ m(f+g),...,g_n - m(g) + m(f+g) respectively. The rationale is that since the equality of the means is our null hypothesis, the null distribution of our test statistic must reflect this assumption. If we use the original samples f and g for our bootstrap, then since their means are already different, the test statistics won't be what it should be. For this example, I understand the argument. But in my case, I have rankings of a set of items. How a null hypothesis of equal AUC values translates into the sampling of these items, I don't know.
(Jul 04 '10 at 19:16)
Amaç Herdağdelen
Yes, that makes sense. Should I delete my answer, to not confuse newcomers? Or would you rather edit it?
(Jul 04 '10 at 19:21)
Alexandre Passos ♦
I think if I was a newcomer I'd like to see your answer. I still have hope in this direction :)
(Jul 04 '10 at 19:33)
Amaç Herdağdelen
|