The UCSD Data Mining Contest 2010 is evaluating models using the AUC (area under the receiver operating characteristic curve). Does anyone have code, preferably Python, for computing this score?

asked Jul 01 '10 at 18:35

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


10 Answers:

The metrics module of scikit-learn (python lib for machine learning) has an implementation for computing various classifiers performance metrics such as the AUC but also the ROC curve itself, the confusion matrix and the precision recall curve.

Edit: In the documentation there is a first simple example showing the ROC plot for simple classifier and another example to combine it with the cross-validation strategies implemented in scikit-learn to be able to plot the mean ROC curve and hence decrease the variance of the AUC estimate.

answered Jul 02 '10 at 04:35

ogrisel's gravatar image

ogrisel
498995591

edited Apr 09 '13 at 08:16

I don't know about Python, but I do recommend the Java code from Jesse Davis and Mark Goadrich. It's nice because it computes the AUC for both ROC graphs and Precision-Recall graphs. Their ICML 2006 paper has an enlightening discussion about the differences and similarities between the two.

See here for code and paper: http://mark.goadrich.com/programs/AUC/

answered Jul 01 '10 at 21:22

Kevin%20Duh's gravatar image

Kevin Duh
271459

edited Jul 01 '10 at 21:25

I use KDD's perf software, which has AUC and many other functions built-in and has had lots of eyeballs on it looking for bugs.

http://kodiak.cs.cornell.edu/kddcup/software.html

answered Jul 02 '10 at 11:22

Paul%20Mineiro's gravatar image

Paul Mineiro
91115

http://pypi.python.org/pypi/CROC/1.0.59

answered Jul 02 '10 at 12:15

V%20C's gravatar image

V C
302

This is not for Python (either), but I have really gotten a lot of use out of the ROCR package for R. It deals nicely with things like cross-validation (so you get ROC curves with configurable error bars) and does precision-recall plots and other things as well.

answered Jul 02 '10 at 03:49

Mikael%20Huss's gravatar image

Mikael Huss
162

I mean, given what function to evaluate it? As I understand it, it's the plot of specificity (false positive rate) vs. sensitivity (true positive rate). If you want the area under it, integrate (pseudocode):

dx = 0.1
auc = 0.0
for false_positive_rate in xrange(0,1,dx):
    auc += whats_my_specificity_at(fpr=false_positive_rate)

(Of course, in a practical version dx would be smaller and you wouldn't use the right hand rule).

answered Jul 01 '10 at 19:04

sbirch's gravatar image

sbirch
2415711

Here's a snippet from a project I'm currently working on. We're using probability trees, so our ROC is defined by stepping through the possible threshold values; x is false positive rate, y is true positive rate, then sorted.


import numpy as np

def AUC(xdata, ydata):
    """Given a list of x coordinates and a list of y coordinates, returns
    the area under the curve they define."""
    x = (np.roll(xdata, -1) - xdata)[:-1]
    y = (np.roll(ydata, -1) + ydata)[:-1]/2
    return sum(map(lambda x, y: x*y, x, y))

answered Jul 01 '10 at 19:54

Dougal%20Sutherland's gravatar image

Dougal Sutherland
16113

I can offer you the PHP code we use for Kaggle. $submission and $solution should be arrays. Let me know if you have any trouble implementing this.

PS. We'd be happy to host this competition for you in future.

function AUC($submission, $solution) {

    array_multisort($submission, SORT_NUMERIC, SORT_DESC, $solution);

    $outFile = "output.csv";
    $fh = fopen($outFile,'w');
    for ($i =0; $i < count($submission); $i++) {
        fwrite($fh,$submission[$i] . ', ' . $solution[$i] . "\n");
    }
    fclose($fh);

    $total = array('A'=>0, 'B'=>0); 
    foreach ($solution as $s) {
        if ($s == 1)
            $total['A']++; 
        elseif ($s == 0) 
            $total['B']++; 
    }

    $next_is_same = 0 ;
    $this_percent['A'] = 0.0 ;
    $this_percent['B'] = 0.0 ;
    $area1 = 0.0 ;
    $count['A'] = 0;
    $count['B'] = 0;
    $index = -1 ;
    foreach ($submission as $k) { 
        $index += 1;
        if ($next_is_same == 0){
            $last_percent['A'] = $this_percent['A'];
            $last_percent['B'] = $this_percent['B'];
        }

        if($solution[$index] == 1) {
            $count['A'] += 1 ;
        } else {
            $count['B'] += 1 ;
        }
        $next_is_same = 0;// if the next predicted value is the same then we don't calculate the area just yet
        if($index < (count($solution)- 1)){
            if($submission[$index] ==  $submission[$index+1]){
                $next_is_same = 1 ;
            }
        }
        if ($next_is_same == 0) {
            $this_percent['A'] = $count['A'] / $total['A'] ;
            $this_percent['B'] = $count['B'] / $total['B'] ;

            $triangle = ($this_percent['B'] - $last_percent['B']) * ($this_percent['A'] - $last_percent['A']) * 0.5 ;
            $rectangle = ($this_percent['B'] - $last_percent['B']) * $last_percent['A'] ;

            $A1 = $rectangle + $triangle ;

            $area1 += $A1 ;
        }
    }
    $AUC = $area1 ;
    return $AUC;

}

answered Jul 01 '10 at 20:45

Anton%20Ballus's gravatar image

Anton Ballus
266101415

edited Jul 01 '10 at 20:46

http://pypi.python.org/pypi/CROC/1.0.59

answered Jul 02 '10 at 12:15

V%20C's gravatar image

V C
302

Use Croc from python cheese shop http://pypi.python.org/pypi/CROC/

answered Jul 14 '10 at 10:19

DirectedGraph's gravatar image

DirectedGraph
56031424

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.