3
1

Let's say I have two classifiers (C1, C2). Those 2 classifiers are trained from 2 different dataset (d1, d2) sampled from the same dataset (D) (d1 and d2 have the same number of instances, and same number of instances in each classes) for a specific task. Can we compare the scores of these 2 classifiers as a result of a classification of a particular instance. Assuming that our scores are generated by a Naive Bayesian Classifier or by SVM from the distance of instance to the Hyperplane or loss functions. In short can I use these metrics like confidence? If no are there any methods or algorithms this purpose except the ensemble algorithms.

Edit: I was not clear in my post that, Classifier C1 and C2 are using the same algorithm. For instance C1 and C2 are both SVM.

Also my question arose from the following thought experiment:

Let's say we have two Naive Bayesian classifiers. They can learn incrementally and we have a stream of data. (This stream of data is generated by a pattern matching application in real-time) Each instances have an instance number and that increases one by one. We trained the C1 with the odd numbered instances and we trained the C2 with even numbered instances. We have no idea of the distribution of the dataset because it is streaming and it can change anytime. Assuming that they are going to classify a third dataset. Is it reasonable to say that the one which has a higher confidence classifies better?

asked Sep 15 '10 at 08:23

cfg's gravatar image

cfg
46126

edited Sep 16 '10 at 08:32

Technically, the probability score predicted by a NB model does not give you an indication of confidence. That score is a point estimate, and any confidence interval could be large or small around it.

The distance from an SVM margin is a little fuzzier. In some sense, further distance does suggest higher confidence.

Regardless, I think it is useful to keep in mind that interpreting scalar predictions as confidences is tricky and often incorrect.

(Sep 16 '10 at 12:33) Art Munson

Agree with you Art and this is one of the reason why I ask this question. I feel uncomfortable with the ideas I presented here. Except ensemble ones because voting for confidence in ensemble learning is much more concrete to me. Citing the "bayesian inference" article from wikipedia:

With enough evidence, the degree of confidence should become either very high or very low. Thus, proponents of Bayesian inference say that it can be used to discriminate between conflicting hypotheses: hypotheses with very high support should be accepted as true, and those with very low support should be rejected as false. However, detractors say that this inference method might be biased due to initial notions that one holds before any evidence is ever collected. (This is a form of inductive bias.)

(Sep 17 '10 at 02:53) cfg

4 Answers:

You probably should calibrate the scores from the different classifiers, perhaps in addition to the techniques suggested by Alex Passos. See Niculescu-Mizil and Caruana, Predicting good probabilities with supervised learning .

answered Sep 15 '10 at 13:03

Art%20Munson's gravatar image

Art Munson
64611316

Unfortunately there's also, I think, a credibility problem to tackle.

As a thought experiment, say the random draws from your distribution D are very representative of the overall data distribution D for training set 1, and very unrepresentative of D for training set 2 (which holds with some probability, depending on the number of observations in each training set), then what would you expect the loss to look like on a hold-out set? In particular, if both classifiers are pretty good choices for the problem in hand you'd expect that the second would generalize poorly because it was learnt from a set of outliers.

Now you can bound the probability that this situation happens, and it's small if the training sets are both large. Likewise you can run tests to decide if both training sets were drawn from the same distribution (e.g. t-test, ANOVA, etc.) with some confidence level. OTOH, how are you going to get away from the issue that, presented with the results everybody will ask the obvious question, i.e. why didn't you use the same data for both classifiers if you wanted to make some empirical claims about their effectiveness on some particular problem?

Furthermore, if the difference in performance of your two classifiers is only marginally different (5% or even 10% different say, in terms of number of wrongly classified points) then, unless you used loads of data and ran statistical tests to show the training sets were equally representative of D with very high confidence, I think it would be extremely hard to convincingly claim that any difference in performance of the two classifiers was real, i.e. due to the choice of classifier rather than the particular data set.

answered Sep 16 '10 at 05:01

Bob%20Durrant's gravatar image

Bob Durrant
316510

Different algorithms will give you different confidence metrics. What you're describing sounds like co-regularization. For naive bayes/logistic regression you can use the posterior probability of the labels. For purely discriminative classifiers you're going to have to do more work. See Sindhwani et al, A Co-regularization approach to semi-supervised learning with multiple views. Also see co-training.

Even if you get the probabilities you will need a distance measure between them. See Ganchev et al, Multi-view learning over structured and non-identical outputs for a discussion of the performance of different distance measures.

answered Sep 15 '10 at 10:55

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Sep 15 '10 at 10:57

@Alexandre Thanks for the papers. I knew a little about co-regularization and co-training(also I've read some papers about multi-view learning previously too). But I will read these papers too.

Different algorithms will give you different confidence metric.

Yep I know this but I'm using the same classifier(algorithm) trained from different datasets.

(Sep 16 '10 at 07:18) cfg
2

@cfg This is a bit different from how I read your post. I thought you meant (for example) NB-vs-SVM AND both algorithms trained on different data. If you want to say something about the performance of the same algorithm trained from different datasets D_i, where all the D_i have distribution D, then I don't see any problem. Reporting average loss and the loss variance (e.g. using a plot with error bars) over a sequence of runs is a perfectly sensible, and accepted, way of doing this (with a built-in implicit notion of confidence).

(Sep 16 '10 at 07:59) Bob Durrant

@Bob Yes you are right. I wasn't clear enough in my post. I'm going to edit it. I was writing you a long comment but suddenly saw your comment and dropped it.

But what if we can't say anything about the distribution? (Or if the datasets aren't same but similar-by the means of statistical significance of ANOVA for example) Can we still have the similar claim. Or can we make our classifiers so that their confidences comparable?

(Sep 16 '10 at 08:15) cfg
2

Okay - I think that we're now into the realm of hard questions! If the data all come from the same distribution, then I think that you could get what you're after using a PAC-Bayes approach. Here's a nice tutorial:

http://videolectures.net/aop07_shawe_taylor_pba/

If the distributions are different, but fairly similar, then you could try the same thing, but you'd have essentially the same problem as I flagged up in my earlier answer (but now the distributions over the data points are incomparable, instead of the classifiers).

The following are first, not very deep thoughts, and may not be at all helpful: A possible approach seems to be to look for similarities in both the distributions and the implemented classifiers together. In the case of NB this would mean looking at the distance between class priors and the diagonal precision matrices learned in each case, in the case of SVM I imagine you'd look for structural similarities (margin, orientation of the normal to the hyperplane) BUT without knowing the data distribution(s) I don't see immediately how this would lead to the sort of confidence bound that it looks like you're after. In fact I think to prove anything non-trivial you'd have to make some fairly strong assumptions on what the data distribution can be like. Furthermore, you need to choose some similarity measure in each case, which probably isn't a straightforward as it sounds (KLD or similar for NB perhaps? Dot product between normals for SVM?). For kernel SVM you might be opening a can of worms - the kernel implemented from the training set is essentially unique (i.e. different training data -> different kernels), how do you relate the linear boundaries in two different Hilbert spaces (probably requires transformation of the unknown feature space basis)?

In summary: looks tough, might be impossible.

[Now I wait for someone to post a link to where somebody else has done it! ;-)]

(Sep 16 '10 at 10:05) Bob Durrant

I think libSVM has a probability measure attached to the output scores... You have to supply it with the -b option to generate them.... But i dont know how it measures them.. Its probably in one of its papers...

answered Sep 16 '10 at 05:47

kpx's gravatar image

kpx
541182636

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.