|
Here is my problem. I built a multiclass classifier (using libsvm) for over 100 categories. The labeled data is not superaccurate but I am obtaining reasonable performance (~90%) doing the classic train/test split on my labeled data. The problem is that the 'real' data I need to use the classifier on is possibly very different in terms of category distribution (on my labeled data every category has about the same size, not so in the real data, and no, it is practically impossible to reflect the real distribution in the labeled set, categories are very unbalanced). I need to demonstrate that the classifier is performing well on the 'real' data, not on my labeled data. I am thinking of getting a small random sample of the 'real' data and do a manual evaluation over it, possibly more than once (to compare different runs of the classifier with different training data, different configurations etc). Question: how do I select a random sample size and determine margins of error for such multiclass case (as opposed to the typical binary case), so that 1) I can state how confident I am about one particular manual evaluation run; 2) I can also determine whether the performance difference between two classifier output sets is significantly different. Disclaimer: my first post here, and as you call tell I am a practitioner, not a theoretician. |
|
1) In your place, I would aim for a large enough sample to have multiple items from the rarest categories. How many is enough from 1 category? 30? 100? You want enough to feel like you had adequate test coverage on the category. 2) Are you sure you want a random sample of the real data? Or would it be better to use an error measure that isn't dominated by class distribution (in the way accuracy is), combined with over sampling the rare classes? Maybe it would be better to look at the per class accuracy for all categories. You can boil this down to one number, if you like, by taking the average of the per class accuracies. (Aside: this is equivalent to macro-averaged recall, if that helps to put this class averaged accuracy in context.) Another interesting variant might be omitting the most common category, if that category is uninteresting and mainly a nuisance. 3) You can use bootstrap evaluation to figure out the variance of your evaluation, and from that figure out how confident you should be. Normally one would bootstrap the training data and look at the (in)stability of the model estimation process. I don't see why you couldn't apply the same process to the test data to determine the sensitivity of the results to test set composition. The book "An Introduction to the Bootstrap" has the general information you need. I think (but could be wrong) that the best bootstrap method for error estimation is described in Improvements on Cross-Validation: The .632+ Bootstrap Method (Efron and Tibshirani, 1997). There is a short description of this method in The Elements of Statistical Learning (p. 219 in my 1st ed. copy). |
Do you know the distribution of the "real" data or can you reasonably estimate it? You can usually incorporate the distribution information back into the training set to make it more accurate on the actual distribution. If you are using a weighted learning algorithm, adjusting the weights on the training instances to match the real distribution helps. Alternatively, if you are learning a probabilistic classifier (like logistic regression), you can add an odds ratio that represents the difference in the training and real distribution (alternatively, it encodes a different prior) to the output. If you aren't using a weighted learner, nor learning a probabilistic model, you can still resample your training data to roughly match the real distribution.
I don't know the real data distribution and it's not easy to estimate. Part of the reasons is that the category set is in flux (merges and splits, additions and deletions). I also don't have much time. I planning to try techniques like those you suggest later, but for the time being I was hoping to be able to determine the smallest sample whose manual evaluation would give me a reasonable assessment of how the classifier is doing.