9
5

I want to make a topic listing hack-free machine learning methods.

Many brilliant methods in machine learning - especially, in my experience, deep learning - are theoretically interesting and produce intriguing results, then simply refuse to work as advertised when I try to implement them myself.

This is not a slight against these methods! Some of them require more hacks than they really ought to, but for many it's likely that I'm just misunderstanding how they're supposed to work - how can I blame the method for that? Nonetheless, it's clear that there's a market (me) for idiot-proof machine learning.

My first pick would have to be k-means. The very definition of idiot-proof, almost. Just normalize your data in some way (z-score is how I often do it), then apply this four-step algorithm:

1) Initiate k cluster means randomly

2) Make points members closest cluster mean

3) Move cluster means to average of member points

4) Repeat step 2 until convergence

To make this even better, convergence is practically guaranteed and empirical results can be incredible - state-of-the-art scores on handwritten digit recognition are achievable with nothing more than a good choice of preprocessing and an application of k-means!

asked May 05 '11 at 01:05

Jacob%20Jensen's gravatar image

Jacob Jensen
1914315663

How do you normalize your data using z-score ?

(Feb 29 '12 at 15:58) shn

10 Answers:
12

How about KNN ? I think it doesn't get easier than that ;) Of course it's not the quickest method around but very good and robust for medium sized data sets.

answered May 05 '11 at 03:16

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

I have a few concerns on KNN as "idiot-proof" that I would love to hear other people's thoughts on.

1) It is not necessarily trivial to integrate categorical attributes.

2) It can be sensitive to the units of the attributes.

3) If you normalize (e.g. zero-one scale) attributes so that the distance is not sensitive to the units, then the whole process becomes sensitive to outliers, skewed distributions, or noise (which may create outliers).

4) It is not trivial to apply in instances where one class greatly outnumbers the other.

I'd be interested to hear why people prefer KNN so highly. It is easy to implement correctly and works well on "clean" data, but Naive Bayes seems to have fewer gotchas if you are using a prefab implementation from any ML package. Any thoughts?

(Nov 07 '11 at 14:40) Troy Raeder

I think 1) is true for nearly all models. Why is that easy in Naive Bayes?

2) is true, as you said you can get rid of it by normalization. Usually I'd take a zero mean, unit variance scaling or a whitening transformation.

About 3), I'm not so sure. Indeed, if you make your attributes unit variance, then the influence of a single point is in principal unlimited. But i think in practice that is not that big a problem if your dataset is not very small. If your variance gets so much dominated by outliers, maybe you should consider dealing with them explicitly.

4) I don't agree with that. Why can't you just apply it? I think the results will be quite good compared to other classifiers. Did you have other experiences?

The reason I prefer KNN over NB is that NB only works on very trivial datasets. What NB does is (more or less) storing a single prototype for each class.

If I'm not mistaken (not sure at the moment), if you use NB with Gaussian distributions, you're decision surface is quadratic. If you use KNN, it can be arbitrarily complex.

(Nov 07 '11 at 19:37) Andreas Mueller

Bagged decision trees and random forests are extremely easy to use, work well on most data sets, and almost never overfit. The main parameter you need to set is the number of models in the ensemble, and the results are not extremely sensitive to that. They are my default starting algorithm.

As another answer pointed out, it is best to grab one of the many implementations of decision trees rather than recode them yourself. You might or might not need to implement the meta learning step yourself.

answered Aug 24 '11 at 20:39

Art%20Munson's gravatar image

Art Munson
64611316

They certainly do overfit in high-dimensional settings. However I give you an up vote because I agree with you that these methods are fairly idiot-proof.

(Mar 11 '12 at 15:06) Gael Varoquaux

Gael: if you have specific data sets where they overfit in high dimensions or know papers that examine ensemble overfitting in high dimensions, I'd be interested to know them.

I can believe that overfitting is a risk, but have not seen much evidence of that myself so far. One factor that is important here is how many models are in the ensemble; sometimes seeming overfitting actually indicates the ensemble should be grown to 1000 members or more.

(Mar 12 '12 at 11:59) Art Munson

Both linear SVMs and Gaussian/RBF kernel SVMs are fairly idiot proof.

In general, I think that there are a wide class of algorithms for which you can get good results without any real expertise, SVMs being leaders of the pack. Getting great results often takes either some feature engineering/preprocessing expertise or tuning the hyperparameters of more complicated models. Ditto for scaling to really large datasets (which naive SVM solvers will not do, by a long shot).

In this way your example of k-means as "hack free" is a little misleading. Indeed, for basic clustering, it is fairly simple to apply, though as far as the CIFAR results of Coates et al, the devil is in the details of the preprocessing.

answered May 05 '11 at 17:01

David%20Warde%20Farley's gravatar image

David Warde Farley ♦
551820

edited May 05 '11 at 17:05

You're right, you can't harness the full potential of K-means without good preprocessing. It can still do almost scarily well on some datasets. I was actually thinking of KNN when I mentioned the digit recognition example though.

I took a class that talked about k-means from a learning theory perspective about clustering. Apparently theoretical guarantees on k-means are pretty weak in terms of optimizing its objective function, but guarantees about slightly different measures or slightly different algorithms are very strong.

(May 06 '11 at 00:54) Jacob Jensen

K-means is a good bad example: yes it's simple, yes it converges for about any K. But without verification, what do you have ? For example, try K-means on the uci ml Handwritten digit data (3823 training, 1797 test vecs, 8x8 pixels) for K = 8 .. 12. K=10 should stand out, right ? Then look at this Elbow for Kmeans clustering plot on stackoverflow. It uses only the 1797 test vecs and the Euclidean metric, but shows I think the dangers of unthinking K-means.

answered May 09 '11 at 07:32

denis's gravatar image

denis
2812812

edited Jul 25 '11 at 12:57

I'd say linear SVM and naive Bayes are pretty foolproof. I've always gotten decent to excellent results with linear SVM (once I normalized the features and balance the training data in the event that the classes are unbalanced). Naive Bayes also works well with little fiddling (though less well, in my experience). They both train quickly even on very large datasets, and don't seem to have a lot of gotchas.

answered May 05 '11 at 01:36

Lucas%20Wiman's gravatar image

Lucas Wiman
13615

re: naive bayes, I wouldn't say it's entirely fool-proof. At a previous employer, someone wrote an implementation of naive bayes where they assumed the prior probability of a word being in a category was zero when the word was never encountered in samples from that category. It seems like a reasonable conclusion to make if you're new to the topic, and it's a subtle enough assumption that it wasn't found until a developer more experienced with naive bayes came on-board.

(May 05 '11 at 12:24) Brian Vandenberg

Sure, and there are also gotchas around numerical stability (sum the log of the probabilities instead of multiplying). In other words, you shouldn't naively implement naive Bayes :-) OTOH, using a preexisting implementation works with few gotchas.

(May 05 '11 at 13:35) Lucas Wiman

With respect to "...nothing more than a good choice of preprocessing", 90-99% of the work in any machine learning project is the definition and extraction of the input attributes, i.e. the preprocessing. If you have a good set of attributes, almost any learning algorithm will work.

I also found your question a bit confusing, in that you gave k-means, an unsupervised learning algorithm as your example, but talked about state of the art accuracy on digit recognition, a classification problem. Are you interested in supervised learning or unsupervised learning?

answered Aug 26 '11 at 14:51

Dave%20Lewis's gravatar image

Dave Lewis
890202846

This is an extremely important point. most methods implicitly assume the euclidean metric on the feature space, and if that assumption is wron, all bets are off. The choice of metric is in some way what machine learning is really about (ie figuring out which training xamples are really most similar) This is where ensembles of classification/regression trees come in since they are invariant to monotonic transformations of features and effectively learn a metric from the interdependencies of the features. So if you want a method that really just figures it out, things like random forests are the way to go.

(Aug 26 '11 at 18:25) Daniel Mahler

It seems you are mainly looking for those methods that are often used as baselines in research papers because they are easy to implement and/or optimize while performing well without too much tweaking. Some others that haven't been mentioned yet in this thread are linear discriminative analysis and logistic regression for classification, linear regression, n-grams, methods like PCA/SVD and ICA. Most of these methods have just a single (hyper)parameter to tune.

answered May 05 '11 at 08:54

Philemon%20Brakel's gravatar image

Philemon Brakel
2445103560

More recent/less-known methods are also of interest to me. Relatively few are idiot-proof in the way I specified, but many are well-implemented in software packages. Latent Dirichlet Allocation, for instance, is possibly straight-forward enough to qualify, assuming it is applied to document data.

(May 05 '11 at 11:05) Jacob Jensen

In my experience LDA is not terribly straightforward to apply. Many of the topics come out "noisy", by which I mean they lack an intuitive interpretation, and many of the documents have non-trivial membership only in the noisy topics. Many of the papers I have seen that apply LDA do so by training some number like 100, 200, or 500 topics and then presenting 5 of them as results.

It is definitely possible (maybe even likely) that I'm missing something but I can't see putting it in the same category as, say, Naive Bayes.

(May 07 '11 at 20:21) Troy Raeder

For Bayesian modeling with models that allow it, Gibbs sampling has nothing to tune at all.

answered May 07 '11 at 15:38

gdahl's gravatar image

gdahl ♦
341453559

Maybe it's just me, but just taking one sample instead of several, or not using a burn-in period seem excellent ways to mess up Gibbs sampling for me. (No first-hand experience, just a guess).

Turning a gibbs sampler into its collapsed form is probably also something essential but non-trivial.

(Aug 25 '11 at 13:55) sqrt17

sqrt17, I don't understand your objection. Vacuously, a truly idiot-proof machine learning method doesn't exist. I could train something on the test set or mess up preprocessing or not follow instructions given when Gibbs sampling is described by professors in introductory courses. So what criteria are we using to evaluate the "idiot-proofness" of methods? I think if software packages that use it as a successful black box exist that is a strong point in favor of a method being robust to idiocy (see this list http://www.cs.ubc.ca/~murphyk/Software/bnsoft.html and note that popular ones like BUGS use Gibbs sampling quite heavily). Furthermore, if the method has zero meta-parameters that should also be a point in its favor.

(Nov 07 '11 at 02:15) gdahl ♦

Random forests is the most idiot proof algo in my experiance. Most algos have a sweet spot in terms of complexity and will over/underfit if you dont use the proper set of tuning parameters. Once you've fit enough trees in a RF model the out of sample prediction error just tends to flatline.

answered Mar 11 '12 at 14:53

darckeen's gravatar image

darckeen
21113

Even if your method would be generic, still the choice of the distance may change the outcome of the algorithm. The distance may be Euclidian, Manhattan, Mahalanobis, angular, radial, etc. Choosing the right or the wrong distance, will impact your clustering.

Also, think that even random data clusters nicely with k-mean clustering.

As a joke, there is a theorem stating that no system could be idiot-proof, since you can always find a "better idiot".

answered May 09 '11 at 18:02

Razvan%20Popovici's gravatar image

Razvan Popovici
162

edited May 09 '11 at 18:02

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.