|
The UCSD Data Mining Contest 2010 is evaluating models using the AUC (area under the receiver operating characteristic curve). Does anyone have code, preferably Python, for computing this score? |
|
The metrics module of scikit-learn (python lib for machine learning) has an implementation for computing various classifiers performance metrics such as the AUC but also the ROC curve itself, the confusion matrix and the precision recall curve. Edit: In the documentation there is a first simple example showing the ROC plot for simple classifier and another example to combine it with the cross-validation strategies implemented in scikit-learn to be able to plot the mean ROC curve and hence decrease the variance of the AUC estimate. |
|
I don't know about Python, but I do recommend the Java code from Jesse Davis and Mark Goadrich. It's nice because it computes the AUC for both ROC graphs and Precision-Recall graphs. Their ICML 2006 paper has an enlightening discussion about the differences and similarities between the two. See here for code and paper: http://mark.goadrich.com/programs/AUC/ |
|
I use KDD's perf software, which has AUC and many other functions built-in and has had lots of eyeballs on it looking for bugs. |
|
This is not for Python (either), but I have really gotten a lot of use out of the ROCR package for R. It deals nicely with things like cross-validation (so you get ROC curves with configurable error bars) and does precision-recall plots and other things as well. |
|
I mean, given what function to evaluate it? As I understand it, it's the plot of specificity (false positive rate) vs. sensitivity (true positive rate). If you want the area under it, integrate (pseudocode):
(Of course, in a practical version dx would be smaller and you wouldn't use the right hand rule). |
|
Here's a snippet from a project I'm currently working on. We're using probability trees, so our ROC is defined by stepping through the possible threshold values; x is false positive rate, y is true positive rate, then sorted.
|
|
I can offer you the PHP code we use for Kaggle. $submission and $solution should be arrays. Let me know if you have any trouble implementing this. PS. We'd be happy to host this competition for you in future.
|