A paper by Coates, Lee & Ng presented at NIPS2010 compared different machine learning approaches within the same convolution-based image classification framework.
I think to many people's surprise they found that, if whitening is applied to the data, K-means(tri), a "soft" K-means variant, generates better features for classification than such sophisticated approaches as deep autoencoders and stacked RBMs.
I couldn't help noticing that K-means (both "hard" and "soft") did better than Gaussian Mixture Models (GMMs), even though, computational efficiency aside, GMMs are often thought of as a "better", more robust, K-means.
GMMs are actually quite similar to "soft" K-means, because cluster membership is "fuzzy", i.e. a matter of degree. So it is especially strange that, with whitening, GMMs produced the worst, while "soft" K-means produced the best results.
Why would GMMs do so poorly compared to K-means?
Has anyone tried reproducing these results? (I hope I'm not being rude to the authors, but bugs happen, and the more surprising the results, the higher the posterior probability of a mistake. Say, a poor GMM implementation could explain some of my surprise)
Has anyone tried applying K-means to other problems where deep learning ruled so far, like speech recognition?
I wonder what happens if you stack several layers of whitening + kmeans(tri)? It seems like it would be a natural thing for Ng's group to try, but this paper doesn't mention it.