1
1

I've been finding surprisingly little concrete information spelling out how to generate a ROC curve, so I looked into various implementations of ROC curve generation, including this plotroc.py and some code from here:

http://metaoptimize.com/qa/questions/267/code-for-auc-area-under-the-receiver-operating-characteristic-curve

(It's somewhat similar to generating a CDF, turns out. I thought at first perhaps each of the points on the curve was the result of the classifier on a random subsampling, or something.)

My first question is in the context of binary classification. Of the implementations that are explicitly designed for binary classification problems, why do they all have the following logic?

In the plotroc.py code:

tp = 0
fp = 0
for decision, label in data:
  if label is positive:
    tp += 1
  else:
    fp += 1
  curve_points.append((fp/neg, tp/pos))

In the Java code:

for (int i = 0; i < examples.size(); i++)-
{
    ...
    if(current.label > 0.5)
    {
        //positive example
        tp++;
    }
    else
    {  
        //negative example
        fp++;
    }

In the PHP code:

foreach ($submission as $k) {
    ...
    if($solution[$index] == 1) {
        $count['A'] += 1 ;
    } else {
        $count['B'] += 1 ;
    }

Namely, why are all of these ignoring the decision, and calculating "false positives" as the number of positively labeled examples? Shouldn't "false positives" be incremented only when the label is negative, and the decision is positive?

My second question is: where do "scores" or "thresholds" come in? In (e.g.) CROC and scikit-learn's metrics.py, they take scores or thresholds. What do these mean? Are these irrelevant for binary classification? How should I apply scikit-learn's implementation to a binary classification problem?

Thanks in advance for any answers!

asked Jan 21 '11 at 03:32

Yang%20Zhang's gravatar image

Yang Zhang
2915914


2 Answers:

The points are sorted by the value of the decision function (before tresholding to see what is the label). A ROC curve should have all possible falues of the true positive/false positive a classifier can obtain. So you actually want to treshold the predictions everywhere; howver, since the test set is finite, the curve will have one point per point in the test set. Intuitively, each point in the ROC curve corresponds to a different tresholding of the real-valued decision of the classifier.

Assume you go over the points from most positively classified to most negatively classified. You can start with the decision that says "always predict 0" (this will have no true positives and no false positives either) and slowly add each point to the positive class, in order. If the classifier is perfect you shoudld add all true positive points before any true negative point, which should give you a ROC curve looking like a capital Gamma (i.e., it will climb very fast, since for zero false positives you will have all possible values of true positives until 100% true positives, then it will stay high as for each value of false positives you can still reach 100% true positives). So it's normal that the code ignores the decision function itself apart from using the scores to sort the points.

answered Jan 21 '11 at 05:00

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your explanation makes sense to me, and it also fully elucidates the score-/threshold-based code, such as in scikit-learn's metrics.py - which also do seem to count TPs/FPs accurately. Is this also to say that the code I highlighted above, which don't mention scores/thresholds anywhere, are invalid?

(Jan 21 '11 at 14:40) Yang Zhang

No. The java code starts by sorting according to the decision value (which makes it valid). The php code does the same (see the call to array_multisort). I don't know where you got the python code from, but it probably also sorts before doing that loop.

My point is that if you have the examples/labels sorted by the classifier score then that loop computes a valid ROC curve.

(Jan 21 '11 at 14:46) Alexandre Passos ♦

Thanks, I think I get it now. For posterity: say we start with the following (score, label) data, sorted:

[(.9, 1), (.8, 1), (.6, 0), (.4, 1), (.2, 0), (.1, 0)]

The code would first set the threshold to .9, so thus far P=1 (only one positive). Since the label for that one positive also is 1, we calculate that FP=0, TP=1, outputting a corresponding point on the curve. The other values are negatives so they don't count toward either FP or TP counters.

(Jan 21 '11 at 17:24) Yang Zhang

The false positives should indeed only be added when the label is negative. The first code snipped seems to consider just the labels while not doing anything with the decisions of the model. This is very odd...

While 'scores' is a very general term, I assume it refers to the output of a classifier that gives real values instead of discrete labels (probabilities for example or a regression algorithm trained to map to {-1, 1}). The threshold would in that case determine at what value you start calling decisions positive. This is relevant for plotting an ROC curve as you can use the variations in threshold value to see how the plot of true vs false positives changes as a function of it. So you actually need to be able to vary how conservative your classifier is to be able to plot the curve at all.

answered Jan 21 '11 at 04:35

Philemon%20Brakel's gravatar image

Philemon Brakel
2445103560

Thanks to your and Alexandre's answer, I think I now understand the score-/threshold-based code, but I'm still confused about the code I highlighted above: (1) how they can avoid using scores/thresholds altogether, and (2) why they calculate FPs/TPs as they do. That several implementations do this prevent me from dismissing them as being buggy, and I also see this in other larger ML code bases as well (I believe Orange has this - its AUC computer doesn't take scores/thresholds, just the classification outputs, which could've come from any type of trained model).

(Jan 21 '11 at 14:45) Yang Zhang

I think I understand what's going on now. The decisions are not (as I thought) 0s and 1s, they're scores. And it actually is fine for the decision to be completely ignored in the loop, since the points are sorted in descending score order. Each point on the curve corresponds to a different score threshold above which things are considered positive, so the algorithm is effectively setting the threshold to be equal to each distinct score in descending order. Hence, on each iteration of the loop, incrementing the FP/TP counters based just on the label is sufficient because everything before that is positive, hence we're counting the number of correct/mistaken positives, and everything after that is under the threshold and thus negative, not contributing to either counter.

I wish I could mark both answers as accepted.

(Jan 21 '11 at 17:08) Yang Zhang
1

@Yang Zhang: According to the documentation, you pass it a list of labels (y) and a list of probabilities of the positive label (probas_) and get from it the x and y from the ROC curve and the actual probability thresholds used to compute each (x,y) point. What don't you understand?

Their algorithm is a bit more convoluted: first they enumerate all the different probabilities of the positive class, in order, in line

thresholds = np.sort(np.unique(probas_))[::-1]

then for each such probability they compute the true positive rate and false positive rate:

for i, t in enumerate(thresholds):
    tpr[i] = np.sum(y[probas_ >= t] == 1) / n_pos
    fpr[i] = np.sum(y[probas_ >= t] == 0) / n_neg

So while the other algorithms are O(n log n) this is O(n²), but this shouldn't matter much.

(Jan 21 '11 at 17:15) Alexandre Passos ♦

Oops, didn't see your last comment before editing mine. Thanks, everything's clear. I think the biggest thing that had me very confused was that it was never apparent to me that the algorithms were not taking the classifier's 0/1 results, but that they were expecting scores from the classifiers, which are then thresholded.

(Jan 21 '11 at 17:26) Yang Zhang
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.