Lets say you have a feature classification problem where you need to apply a supervised learning technique. So you are given a learning set with binary labels (1 and 0). Each data point is an ndimensional vector. How would you go about selecting a supervised learning technique, i.e., what information would you be looking at to make an initial selection of a supervised learning method? asked Oct 27 '10 at 17:08 bustweiser 
There are various theoretical guidelines like "use generative method if you have little data, discriminative if you have lots of data" or "use perceptron if your data is well separated" but I think they aren't as useful/reliable as a simple experimental comparison. If you have some sample data turn it into ARFF format, run Weka's 70 something different classifiers on it, and see what works best. Weka has command line interface so it's easy to automate. I recently wrote this script to compare all of Weka's classifiers on a document categorization dataset, and the results were not something I would've predicted from theoretical considerations. Edit 10/27: In reply to Joseph  This was consulting project which I got through my blog, so I'll ask them if I can post dataset details, and if they OK it, I'll make a blogpost about it. Meanwhile, I can say that bugs were surely involved. This involved about 3000 hours of Weka runs and many of them terminated with various exceptions (or failed to terminate). edit (actually I just checked the logs, about 10% of the runs failed, and none of them were due to Weka's fault) Naive Bayes always worked, and it's performance was the same as the performance of Naive Bayes classifier already in production. Other ones with consistent performance were SimpleCart, RepTree, HyperPipes, LibSVM, SMO, J48, Logistic Regression, JRip, AdaBoost with decision stumps. They were beaten by Naive Bayes consistently. As far as concrete performance numbers, Naive Bayes made 517 errors. One of the closest contenders, OneR, made 547. LibSVM made 597. OneR is even simpler than Naive Bayes, and described in "Very simple classification rules perform well on most commonly used datasets" Edit 11/7 In reply to Aman on generative vs. discriminative. For text categorization, Naive Bayes has 2k parameters and k effective parameters. Roughly speaking, we can multiply p(positiveword_present) and p(positiveword_not_present) by some constant c and get (essentially) the same linear discriminant. Naive Bayes and Logistic Regression give you the same set of realizable classifiers, what's different is how each method chooses the classifier. You can see this by rewriting Naive Bayes classifier in terms of logodds, like here A common saying is that "generative has smaller variance, discriminative has smaller bias", and to make this precise, you need to look at generativediscriminative pairs. In other words, only compare learners that have the same hypothesis space. This paper ([Ng 2004]) talks more about generativediscriminative pairs. Also, Thomas Mitchell talks about generative vs. discriminative tradeoff in an online book chapter. Once you treat choice of hypothesis space and fitting criteria separately, difference between generative and discriminative comes down to the objective function. For instance, Logistic Regression minimizes conditional logloss, whereas Naive Bayes minimizes logloss which can be rewritten as a sum of conditional logloss and marginal logloss. Now you can see the reason Naive Bayes has smaller variance  "marginal logloss" acts as a regularizer. With this view, you can take any learner that minimizes conditional log loss and add a term which is lambda * marginal log loss. When lambda is zero, the classifier is fully discriminative, when it is 1, the classifier is fully generative. For supervised learning with infinite training data, best lambda is 0, but for finite sample sizes, some intermediate value may be better, I think Ng had some paper that looked at smoothly varying lambda depending on sample size Some more papers on generative vs. discriminative tradeoffs in my library answered Oct 27 '10 at 18:02 Yaroslav Bulatov Basically, Naive Bayes worked best. I expected it to beat Logistic Regression due to size of corpus, but didn't expect it to also outperform the remaining 60+ classifiers.
(Oct 27 '10 at 21:12)
Yaroslav Bulatov
1
Are you serious? Could you write this up in more depth, maybe as a blog post? Are you sure it wasn't a bug?
(Oct 27 '10 at 21:37)
Joseph Turian ♦♦
2
I don't doubt the power of naive bayes, but couldn't this result also be partially due to the fact that optimizing many of those other classifiers requires careful tuning of some hyper parameters?
(Oct 28 '10 at 10:01)
Philemon Brakel
Hyperparameter tuning would help, but in the task I got, probably not by much, here are some details  http://yaroslavvb.blogspot.com/2010/10/sometimessimplestlearnersarebest.html
(Oct 28 '10 at 20:21)
Yaroslav Bulatov
@Yaroslav: Shouldn't your first sentence be: "use generative method if you have lots of data and discriminative approach when you have little data" since generative approach involves learning many more parameters than discriminative approach as you have to model the distribution of the data as well ?
(Nov 06 '10 at 17:00)
Aman
Aman  compare Naive Bayes vs. Logistic Regression for text classification with wordlevel features, same number of parameters, naive bayes has smaller variance
(Nov 06 '10 at 18:28)
Yaroslav Bulatov
Yaroslav  Naive Bayes still needs at least twice as many parameters as Logistic Regression. If the number of features = k, then LR needs k params, and NB at least requires 2k for a binary classifier. Also, NB is one of the simplest generative models and is an exception. I didn't understand how you can compare the variances of two different models. Plz give me a pointer to that.
(Nov 07 '10 at 11:57)
Aman
My reply was too long, so I added it into my Answer
(Nov 07 '10 at 12:47)
Yaroslav Bulatov
showing 5 of 9
show all

Adding to ogrisel's answer, a good first approach is to use a regularized linear classifier, such as a linear SVM or regularized logistic regression (or naive bayes in the extremely rare training data limit). Surprisingly often this gives enough performance out of the box. answered Oct 27 '10 at 18:01 Alexandre Passos ♦ 2
Adding to Alexandre's answer I would tell you to tune lambda (the hyper parameter value, a.k.a the strength of the regularizer term in your SVM or logistic regressor) by cross validating 5 models with uniformly distributed lambdas and taking the one that yields the best scores.
(Oct 27 '10 at 18:08)
ogrisel
2
Adding to ogrisel's answer I want to say that anyone who does not perform hyperparameter search before using a supervised learning method should be beaten e.g. with a printed copy of "A practical guide to SVM classification" until they understand ;) Seriously, tuning hyperparameters is essential when comparing methods. http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
(Nov 04 '10 at 15:05)
zeno

I might recommend Lecture 11 of Andrew NG's Lecture in Stanford In this lecture he goes over some advises to apply algorithms and some counsel on how to apply your algorithms. The overall Lecture Series is great, but this specific one is good for that problem. answered Oct 29 '10 at 03:04 Leon Palafox ♦ Oliver Mitevski 
The traditional way to compare the performance of supervised classifiers is to perform cross validation: split your data in 5 folds (for instance), train your algorithms on the first four and measure the performance on the last fold. Rinse repeat on. For each algorithm take the median score, min score and max score of the 5 performance evaluations compare them to each other. If the classes are imbalanced (many many samples with label 0 and a few percents of 1) then you should use precision, recall and f1 score as evaluation metrics instead of accuracy (percentage of correct classification) since always predicting 0 will yield good accuracy score (despite a poor classifier). answered Oct 27 '10 at 17:33 ogrisel 
Experimenting with various models is a good empirical method but it is better to have some theoretical understanding of the data, and its distribution to validate the results. Sometimes, the heavy maths and numerical methods involved can mask a significant (bug in the implementation)/(erroneous assumption)/(error in data representation). Also, unless you understand why a particular algorithm is working well for the data, you can never be confident if the results will hold for more amount of training data or similar data from a different source. Some of the theoretical guidelines for determining which classifiers to start with can be as follows: 1) How many classes are there ?  Use a binary or multiclass classifier. 2) What is the overlap among the classes / Is the data linearly separable ?  Use linear or nonlinear method/kernel. 3) How many features are there ?  If, few: most of them might be useful, if, many: use a feature selection method, a L1regularizer (if the optimization procedure allows it), dimensionality reducion. 4) What is the correlation between the various features ? If, features are orthogonal, can use Naive Bayes, if it is not clear (for e.g., values of pixel intensity in an image), use nonlinear methods like Neural Networks, nonlinear SVM. 5) Does the data need preprocessing ? Are the scales of the feature values similar or vary by orders of magnitude ? If there are continuous, discrete and binaryvalues features, you might have to use oneofK encoding for binary features, quantize continuous values or smoothen discrete values. 6) The amount of data available ? Is the data enough to learn a generative model? Or should you use a discriminative approach? answered Nov 06 '10 at 17:21 Aman 
I recently had this exact same task for a school project, and a simple naive Bayesian combined with weights to give a lower weight to those dimensions that we have little information on (ie. only one "1", or two "1"'s in different classes etc) worked out the best. This ridiculously simple design beat complex SVM's with Gaussian kernels, decision trees, you name it. answered Nov 07 '10 at 20:59 tsiki 