I am training a set of classifiers on a fixed training set and predicting on a fixed test set (a common scenario in unsupervised learning). I now want to test whether the difference in performance of the different classifiers as measured by accuracy and F1 (wrt three classes) is statistically significant. It seems to me that something like bootstrapping in conjunction with Kruskal-Wallis might be the way to go, but I don't really understand the details of how to set this up, if it is a valid approach at all.

Does anyone have advice on what would be the proper procedure for hypothesis testing in this scenario? What would the null-hypothesis be and what would the proper sampling procedure be? One complicating factor is that I am predicting sequences, so that the predictions on which I am measuring accuracy/F1 are not really independent.

asked Sep 27 '10 at 06:17

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

edited Sep 27 '10 at 06:20

What do you mean by F1 with respect to three classes? F1 is an effectiveness measure for binary classification.

(Aug 26 '11 at 18:48) Dave Lewis

Micro/macroaverage...

(Aug 27 '11 at 12:46) Oscar Täckström

One Answer:

Without any sort of iid assumption I don't think you can do anything in terms of testing the performance. The way of doing that with bootstrap is: sample with replacement from your test set, reduce everything to one number (say, total accuracy, or average f1 of the three classes) and compute the frequency with which the number for one classifier is bigger than the number for the other other as predicted the null hypothesis, and use that as a p-value. You don't technically need to do a test after a bootstrap because bootstrapping already gives you the distribution.

But I think the simplest thing you can do, for accuracy, is to use a test-set bound, as explained in John Langford's tutorial on prediction theory for classification. I assume that even if your individual examples are not iid you can at least divide them into subsets that you can consider iid without too much shame. Otherwise you can't even do bootstrap properly.

answered Sep 27 '10 at 07:12

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

1

Thanks for the quick response. Sampling whole sequences with replacement seems to me to be a valid approach wrt the iid assumption. I think I get how to do this for pairs of classifiers, but what is the proper way to generalize this to multiple classifiers? I want to both know whether all of my classifiers perform better than a common baseline, but I would also want to know if the best performing one is significantly better than the second best one. What test does one typically perform when publishing these sorts of results?

Also, when doing a bootstrap, can I stop worrying about the variances of my classifiers on the ghost samples. At least with a Wilcoxon test as I understand it one has to assume that the variance must be the same for each of the classifiers.

(Sep 27 '10 at 07:24) Oscar Täckström
1

The whole point of bootstrap is that you get to actually measure the quantity you care about, so you don't have to make as many modeling assumptions (instead you use a lot of computational power).

(1) To see if all your classifiers perform better than a baseline, just count the fraction of bootstrapped samples in which this happens. 1-that is your p-value

(2) To see if the best-performing is better than the second one is similar, just run the two of them and compute the fraction of times where the second is better or equal to the first. That's a p-value.

etc

(Sep 27 '10 at 07:27) Alexandre Passos ♦
1

This makes sense now. The only thing that remains then is the number of samples to generate. Are there theoretical arguments for using a specific number (I guess it should depend on the size of the test set).

(Sep 27 '10 at 07:35) Oscar Täckström
1

Yes there is some work on the bootstrap variance, but I don't remember most of it, so I'd say just do as many passes as you can or ask a question on this site to see if anyone has the answers. This page http://www.stata.com/support/faqs/stat/reps.html and this paper http://ideas.repec.org/p/qed/wpaper/1036.html suggest some procedures, but I never tried them.

(Sep 27 '10 at 07:40) Alexandre Passos ♦
1

Ok, I'll probably go for the more the merrier then. Thanks for the pointers.

(Sep 27 '10 at 07:45) Oscar Täckström
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.