Lets say you have a feature classification problem where you need to apply a supervised learning technique. So you are given a learning set with binary labels (1 and 0). Each data point is an n-dimensional vector. How would you go about selecting a supervised learning technique, i.e., what information would you be looking at to make an initial selection of a supervised learning method?
asked Oct 27 '10 at 17:08
There are various theoretical guidelines like "use generative method if you have little data, discriminative if you have lots of data" or "use perceptron if your data is well separated" but I think they aren't as useful/reliable as a simple experimental comparison. If you have some sample data turn it into ARFF format, run Weka's 70 something different classifiers on it, and see what works best.
Weka has command line interface so it's easy to automate. I recently wrote this script to compare all of Weka's classifiers on a document categorization dataset, and the results were not something I would've predicted from theoretical considerations.
Edit 10/27: In reply to Joseph -- This was consulting project which I got through my blog, so I'll ask them if I can post dataset details, and if they OK it, I'll make a blogpost about it.
Meanwhile, I can say that bugs were surely involved. This involved about 3000 hours of Weka runs and many of them terminated with various exceptions (or failed to terminate). edit (actually I just checked the logs, about 10% of the runs failed, and none of them were due to Weka's fault) Naive Bayes always worked, and it's performance was the same as the performance of Naive Bayes classifier already in production. Other ones with consistent performance were SimpleCart, RepTree, HyperPipes, LibSVM, SMO, J48, Logistic Regression, JRip, AdaBoost with decision stumps. They were beaten by Naive Bayes consistently. As far as concrete performance numbers, Naive Bayes made 517 errors. One of the closest contenders, OneR, made 547. LibSVM made 597. OneR is even simpler than Naive Bayes, and described in "Very simple classification rules perform well on most commonly used datasets"
Edit 11/7 In reply to Aman on generative vs. discriminative. For text categorization, Naive Bayes has 2k parameters and k effective parameters. Roughly speaking, we can multiply p(positive|word_present) and p(positive|word_not_present) by some constant c and get (essentially) the same linear discriminant. Naive Bayes and Logistic Regression give you the same set of realizable classifiers, what's different is how each method chooses the classifier. You can see this by rewriting Naive Bayes classifier in terms of log-odds, like here
A common saying is that "generative has smaller variance, discriminative has smaller bias", and to make this precise, you need to look at generative-discriminative pairs. In other words, only compare learners that have the same hypothesis space. This paper ([Ng 2004]) talks more about generative-discriminative pairs. Also, Thomas Mitchell talks about generative vs. discriminative trade-off in an online book chapter. Once you treat choice of hypothesis space and fitting criteria separately, difference between generative and discriminative comes down to the objective function. For instance, Logistic Regression minimizes conditional log-loss, whereas Naive Bayes minimizes log-loss which can be rewritten as a sum of conditional log-loss and marginal log-loss. Now you can see the reason Naive Bayes has smaller variance -- "marginal log-loss" acts as a regularizer. With this view, you can take any learner that minimizes conditional log loss and add a term which is lambda * marginal log loss. When lambda is zero, the classifier is fully discriminative, when it is 1, the classifier is fully generative. For supervised learning with infinite training data, best lambda is 0, but for finite sample sizes, some intermediate value may be better, I think Ng had some paper that looked at smoothly varying lambda depending on sample size
Some more papers on generative vs. discriminative trade-offs in my library
Adding to ogrisel's answer, a good first approach is to use a regularized linear classifier, such as a linear SVM or regularized logistic regression (or naive bayes in the extremely rare training data limit). Surprisingly often this gives enough performance out of the box.
answered Oct 27 '10 at 18:01
Alexandre Passos ♦
I might recommend Lecture 11 of Andrew NG's Lecture in Stanford
In this lecture he goes over some advises to apply algorithms and some counsel on how to apply your algorithms.
The overall Lecture Series is great, but this specific one is good for that problem.
Experimenting with various models is a good empirical method but it is better to have some theoretical understanding of the data, and its distribution to validate the results. Sometimes, the heavy maths and numerical methods involved can mask a significant (bug in the implementation)/(erroneous assumption)/(error in data representation). Also, unless you understand why a particular algorithm is working well for the data, you can never be confident if the results will hold for more amount of training data or similar data from a different source.
Some of the theoretical guidelines for determining which classifiers to start with can be as follows:
1) How many classes are there ? - Use a binary or multi-class classifier.
2) What is the overlap among the classes / Is the data linearly separable ? - Use linear or non-linear method/kernel.
3) How many features are there ? - If, few: most of them might be useful, if, many: use a feature selection method, a L1-regularizer (if the optimization procedure allows it), dimensionality reducion.
4) What is the correlation between the various features ? If, features are orthogonal, can use Naive Bayes, if it is not clear (for e.g., values of pixel intensity in an image), use non-linear methods like Neural Networks, non-linear SVM.
5) Does the data need pre-processing ? Are the scales of the feature values similar or vary by orders of magnitude ? If there are continuous, discrete and binary-values features, you might have to use one-of-K encoding for binary features, quantize continuous values or smoothen discrete values.
6) The amount of data available ? Is the data enough to learn a generative model? Or should you use a discriminative approach?
answered Nov 06 '10 at 17:21
The traditional way to compare the performance of supervised classifiers is to perform cross validation: split your data in 5 folds (for instance), train your algorithms on the first four and measure the performance on the last fold. Rinse repeat on. For each algorithm take the median score, min score and max score of the 5 performance evaluations compare them to each other.
If the classes are imbalanced (many many samples with label 0 and a few percents of 1) then you should use precision, recall and f1 score as evaluation metrics instead of accuracy (percentage of correct classification) since always predicting 0 will yield good accuracy score (despite a poor classifier).
answered Oct 27 '10 at 17:33
I recently had this exact same task for a school project, and a simple naive Bayesian combined with weights to give a lower weight to those dimensions that we have little information on (ie. only one "1", or two "1"'s in different classes etc) worked out the best. This ridiculously simple design beat complex SVM's with Gaussian kernels, decision trees, you name it.