A paper by Coates, Lee & Ng presented at NIPS2010 compared different machine learning approaches within the same convolutionbased image classification framework. I think to many people's surprise they found that, if whitening is applied to the data, Kmeans(tri), a "soft" Kmeans variant, generates better features for classification than such sophisticated approaches as deep autoencoders and stacked RBMs. I couldn't help noticing that Kmeans (both "hard" and "soft") did better than Gaussian Mixture Models (GMMs), even though, computational efficiency aside, GMMs are often thought of as a "better", more robust, Kmeans. GMMs are actually quite similar to "soft" Kmeans, because cluster membership is "fuzzy", i.e. a matter of degree. So it is especially strange that, with whitening, GMMs produced the worst, while "soft" Kmeans produced the best results. Questions:
Edit: to clarify the nomenclature:
asked Feb 27 '11 at 14:31 Oleg Trott 
My guess at point 1: GMM:s have many more parameters than soft kmeans, since you also need to estimate covariance matrices. Actually soft kmeans is a special case of a GMM in which you assume a fixed and tied covariance matrix. Because the likelihood function is more complex, naively initializing a GMM is more prone to get you stuck in bad local maxima, compared to soft kmeans. Further, looking at the curves, the more clusters you use, the better the classification. I would guess that with that many clusters, modeling the shape of the clusters would be less important, since you can always capture parts of the distribution with high complexity by using lots of simple clusters. Since you don't need to be able to interpret the results, there is not much loss in adding a couple of extra clusters rather than adequately modeling the shape of one cluster. I don't have any intuition to why hard kmeans would perform so much worse than soft kmeans. answered Feb 28 '11 at 10:07 Oscar Täckström 1
Well, if you use hard kmeans your feature vector will be able to represent log k bits of information (with k being the number of clusters) while with soft kmeans it's "somewhat more".
(Feb 28 '11 at 16:00)
Justin Bayer
Hmm. this I don't understand. Just because I use hard kmeans to induce features, does not mean that I need to use a 1hot representation for my data instances. I can find the clusters with hard kmeans and then measure the distance to each centroid and use the resulting vector as my representation, or use more complex representations. I also don't understand in what sense a feature vector is able to represent log k bits of information and what you mean by "somewhat more" with soft kmeans. Soft kmeans is an algorithm to find centroids so that likelihood is maximized. Once the centroids are found, the amount of information you can encode should not depend on the algorithm used to induce them. Am I missing something?
(Mar 01 '11 at 15:34)
Oscar Täckström
No, using kmeans does not necessarily mean 1ofk feature vectors. It does however in this context, since that is what the paper is about. The only difference btw triangular and hard kmeans (in this paper) is the way the feature vectors are constructed. I did not go into detail how many bits a triangular kmeans feature vector can represent because I was too lazy to think about it at the time. From a quick guess, I would say d*(k1) bits (All centroids but one are less than the mean  if all would contribute, they would be equal and this would not go as k real numbers) which is in any case an exponentially larger space than what hard kmeans can do. (d is the amount of bits needed to represent a real number and a nontrivial thing. :)
(Mar 01 '11 at 16:59)
Justin Bayer
Oh, sorry, now it makes sense. In the original post the kmeans (tri) representation was referred to as "soft kmeans", so I thought this referred to the clustering algorithm used. I should have actually read the paper before making such comments :) I guess the specific number of bits that you will be able to represent will be dependent on your code book, that is it will be dependent on how close to orthogonal the cluster centroids are. Anyway, the triangular representation will of course be able to represent more information in all but degenerate cases. It would be interesting to see how a 1ofk feature representation using a GMM would perform and how placing a Gaussian kernel of fixed variance at each centroid fitted by kmeans would fare. Actually, thinking about this makes me suspect that my original answer might be off target. Perhaps it is not the fitting of the GMM that is the problem, but the fact that if mixture components have low variance, then the number of active dimensions will be very low if we use equation (4) in the paper.
(Mar 01 '11 at 18:09)
Oscar Täckström

I have not read the paper, but in a personal experience: 1. Unles you tune your GMM pretty well, that is choosing the correct priors, sometimes you end up with crazy stuff and incredible high procesing times. 2. Bishop mentions in his book, that it is a good practice to make a couple of runs with kmeans to obtain the priors for a good GMM answered Feb 27 '11 at 22:36 Leon Palafox ♦ They do use a single iteration of Kmeans to initialize GMM. In the footnotes, they say "When Kmeans is run to convergence we have found that the mixture model does not learn features substantially different from the Kmeans result."
(Mar 01 '11 at 21:43)
Oleg Trott
I think it is also important to notice that GMM is nothing but the specific case of MM where the probabilities are Gaussians. You can choose a different distribution if it matches your data the best, and your mixing parameters become those parameters instead of the Means, this is a flexibility that Kmeans do not give.
(Mar 01 '11 at 21:47)
Leon Palafox ♦

I've done a bunch of reading recently that gives some more context: In the paper above, Ng's group uses Kmeans for dictionary learning; that is, to learn small image patches such that a sparse set of these images can reconstruct any image patch from the data. An extremely successful approach to this is sparse coding, which constructs vocabulary patches such that each data patch requires only the activation of a small number of component words. Kmeans normally assigns each data point its closest word (centroid), while in soft kmeans each data point is at least a little related to every word but heavily related to a very few. Tri kmeans is a successful heuristic to assign each data point to a very few words, like sparse coding. This requires that each word/centroid is an important component in reconstructing a good handful of data points, but doesn't mind if its completely useless to construct faraway data. This is essentially the same as finding a good regularizer (meeting the goldilocks criterion: not too strong, not too weak) on the vocabulary. answered May 28 '11 at 12:28 Jacob Jensen 
There is a follow up paper, The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization, arguing that it's not about finding the dictionary but about finding a good encoding. answered May 29 '11 at 10:28 Andreas Mueller 
From what I see from the paper, they are not doing soft kmeans. They are doing hard kmeans (dictionary learning) and use their soft function heuristic to create the features (encoding). answered Nov 24 '11 at 03:13 Mathieu Blondel 
There is a google site dedicated to this: https://sites.google.com/site/kmeanslearning/
Interesting questions! I have two, probably stupid, questions: Why would we expect that a mixture of Gaussians is a good model for image features in the first place? What would the results look like using raw image features?
to 3: I recently heard a talk where GMM features where used for accustic models. It is some work by Simon Wiesler. Maybe the reference is http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5495228 but I'm not so sure if this includes what he was talking about.
to 1: I don't see that GMMs do very poorly compared to kmeans. If you compare hard Kmeans and GMMs there is not SO much difference. Actually I would expect GMM to do slightly better  since as Leon said, usually it is initialized with kmeans and I see it as a sort of refinement. The method that works really well in the paper is "soft kmeans" which is kind of an adhoc method they created as far as I know. It is somewhat similar to a local sparse coding in a way (like LCC) and I feel it does a little bit more than "just kmeans".
Is there a good paper on the triangle initialisation for kmeans? The linked paper only glosses over it.