Lets say you have a feature classification problem where you need to apply a supervised learning technique. So you are given a learning set with binary labels (1 and 0). Each data point is an n-dimensional vector. How would you go about selecting a supervised learning technique, i.e., what information would you be looking at to make an initial selection of a supervised learning method?

asked Oct 27 '10 at 17:08

bustweiser's gravatar image


6 Answers:

There are various theoretical guidelines like "use generative method if you have little data, discriminative if you have lots of data" or "use perceptron if your data is well separated" but I think they aren't as useful/reliable as a simple experimental comparison. If you have some sample data turn it into ARFF format, run Weka's 70 something different classifiers on it, and see what works best.

Weka has command line interface so it's easy to automate. I recently wrote this script to compare all of Weka's classifiers on a document categorization dataset, and the results were not something I would've predicted from theoretical considerations.

Edit 10/27: In reply to Joseph -- This was consulting project which I got through my blog, so I'll ask them if I can post dataset details, and if they OK it, I'll make a blogpost about it.

Meanwhile, I can say that bugs were surely involved. This involved about 3000 hours of Weka runs and many of them terminated with various exceptions (or failed to terminate). edit (actually I just checked the logs, about 10% of the runs failed, and none of them were due to Weka's fault) Naive Bayes always worked, and it's performance was the same as the performance of Naive Bayes classifier already in production. Other ones with consistent performance were SimpleCart, RepTree, HyperPipes, LibSVM, SMO, J48, Logistic Regression, JRip, AdaBoost with decision stumps. They were beaten by Naive Bayes consistently. As far as concrete performance numbers, Naive Bayes made 517 errors. One of the closest contenders, OneR, made 547. LibSVM made 597. OneR is even simpler than Naive Bayes, and described in "Very simple classification rules perform well on most commonly used datasets"

Edit 11/7 In reply to Aman on generative vs. discriminative. For text categorization, Naive Bayes has 2k parameters and k effective parameters. Roughly speaking, we can multiply p(positive|word_present) and p(positive|word_not_present) by some constant c and get (essentially) the same linear discriminant. Naive Bayes and Logistic Regression give you the same set of realizable classifiers, what's different is how each method chooses the classifier. You can see this by rewriting Naive Bayes classifier in terms of log-odds, like here

A common saying is that "generative has smaller variance, discriminative has smaller bias", and to make this precise, you need to look at generative-discriminative pairs. In other words, only compare learners that have the same hypothesis space. This paper ([Ng 2004]) talks more about generative-discriminative pairs. Also, Thomas Mitchell talks about generative vs. discriminative trade-off in an online book chapter. Once you treat choice of hypothesis space and fitting criteria separately, difference between generative and discriminative comes down to the objective function. For instance, Logistic Regression minimizes conditional log-loss, whereas Naive Bayes minimizes log-loss which can be rewritten as a sum of conditional log-loss and marginal log-loss. Now you can see the reason Naive Bayes has smaller variance -- "marginal log-loss" acts as a regularizer. With this view, you can take any learner that minimizes conditional log loss and add a term which is lambda * marginal log loss. When lambda is zero, the classifier is fully discriminative, when it is 1, the classifier is fully generative. For supervised learning with infinite training data, best lambda is 0, but for finite sample sizes, some intermediate value may be better, I think Ng had some paper that looked at smoothly varying lambda depending on sample size

Some more papers on generative vs. discriminative trade-offs in my library

answered Oct 27 '10 at 18:02

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov

edited Nov 07 '10 at 12:52


Interesting, can you tell us more on the results?

(Oct 27 '10 at 18:09) ogrisel

Basically, Naive Bayes worked best. I expected it to beat Logistic Regression due to size of corpus, but didn't expect it to also outperform the remaining 60+ classifiers.

(Oct 27 '10 at 21:12) Yaroslav Bulatov

Are you serious? Could you write this up in more depth, maybe as a blog post? Are you sure it wasn't a bug?

(Oct 27 '10 at 21:37) Joseph Turian ♦♦

I don't doubt the power of naive bayes, but couldn't this result also be partially due to the fact that optimizing many of those other classifiers requires careful tuning of some hyper parameters?

(Oct 28 '10 at 10:01) Philemon Brakel

Hyper-parameter tuning would help, but in the task I got, probably not by much, here are some details -- http://yaroslavvb.blogspot.com/2010/10/sometimes-simplest-learners-are-best.html

(Oct 28 '10 at 20:21) Yaroslav Bulatov

@Yaroslav: Shouldn't your first sentence be: "use generative method if you have lots of data and discriminative approach when you have little data" since generative approach involves learning many more parameters than discriminative approach as you have to model the distribution of the data as well ?

(Nov 06 '10 at 17:00) Aman

Aman -- compare Naive Bayes vs. Logistic Regression for text classification with word-level features, same number of parameters, naive bayes has smaller variance

(Nov 06 '10 at 18:28) Yaroslav Bulatov

Yaroslav - Naive Bayes still needs at least twice as many parameters as Logistic Regression. If the number of features = k, then LR needs k params, and NB at least requires 2k for a binary classifier. Also, NB is one of the simplest generative models and is an exception. I didn't understand how you can compare the variances of two different models. Plz give me a pointer to that.

(Nov 07 '10 at 11:57) Aman

My reply was too long, so I added it into my Answer

(Nov 07 '10 at 12:47) Yaroslav Bulatov
showing 5 of 9 show all

Adding to ogrisel's answer, a good first approach is to use a regularized linear classifier, such as a linear SVM or regularized logistic regression (or naive bayes in the extremely rare training data limit). Surprisingly often this gives enough performance out of the box.

answered Oct 27 '10 at 18:01

Alexandre%20Passos's gravatar image

Alexandre Passos ♦


Adding to Alexandre's answer I would tell you to tune lambda (the hyper parameter value, a.k.a the strength of the regularizer term in your SVM or logistic regressor) by cross validating 5 models with uniformly distributed lambdas and taking the one that yields the best scores.

(Oct 27 '10 at 18:08) ogrisel

Adding to ogrisel's answer I want to say that anyone who does not perform hyperparameter search before using a supervised learning method should be beaten e.g. with a printed copy of "A practical guide to SVM classification" until they understand ;-)

Seriously, tuning hyperparameters is essential when comparing methods.


(Nov 04 '10 at 15:05) zeno

I might recommend Lecture 11 of Andrew NG's Lecture in Stanford

In this lecture he goes over some advises to apply algorithms and some counsel on how to apply your algorithms.

The overall Lecture Series is great, but this specific one is good for that problem.

answered Oct 29 '10 at 03:04

Leon%20Palafox's gravatar image

Leon Palafox ♦

edited Jan 27 '11 at 11:30

Oliver%20Mitevski's gravatar image

Oliver Mitevski

Experimenting with various models is a good empirical method but it is better to have some theoretical understanding of the data, and its distribution to validate the results. Sometimes, the heavy maths and numerical methods involved can mask a significant (bug in the implementation)/(erroneous assumption)/(error in data representation). Also, unless you understand why a particular algorithm is working well for the data, you can never be confident if the results will hold for more amount of training data or similar data from a different source.

Some of the theoretical guidelines for determining which classifiers to start with can be as follows:

1) How many classes are there ? - Use a binary or multi-class classifier.

2) What is the overlap among the classes / Is the data linearly separable ? - Use linear or non-linear method/kernel.

3) How many features are there ? - If, few: most of them might be useful, if, many: use a feature selection method, a L1-regularizer (if the optimization procedure allows it), dimensionality reducion.

4) What is the correlation between the various features ? If, features are orthogonal, can use Naive Bayes, if it is not clear (for e.g., values of pixel intensity in an image), use non-linear methods like Neural Networks, non-linear SVM.

5) Does the data need pre-processing ? Are the scales of the feature values similar or vary by orders of magnitude ? If there are continuous, discrete and binary-values features, you might have to use one-of-K encoding for binary features, quantize continuous values or smoothen discrete values.

6) The amount of data available ? Is the data enough to learn a generative model? Or should you use a discriminative approach?

answered Nov 06 '10 at 17:21

Aman's gravatar image


The traditional way to compare the performance of supervised classifiers is to perform cross validation: split your data in 5 folds (for instance), train your algorithms on the first four and measure the performance on the last fold. Rinse repeat on. For each algorithm take the median score, min score and max score of the 5 performance evaluations compare them to each other.

If the classes are imbalanced (many many samples with label 0 and a few percents of 1) then you should use precision, recall and f1 score as evaluation metrics instead of accuracy (percentage of correct classification) since always predicting 0 will yield good accuracy score (despite a poor classifier).

answered Oct 27 '10 at 17:33

ogrisel's gravatar image


I recently had this exact same task for a school project, and a simple naive Bayesian combined with weights to give a lower weight to those dimensions that we have little information on (ie. only one "1", or two "1"'s in different classes etc) worked out the best. This ridiculously simple design beat complex SVM's with Gaussian kernels, decision trees, you name it.

answered Nov 07 '10 at 20:59

tsiki's gravatar image


edited Nov 08 '10 at 00:42

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.