A paper by Coates, Lee & Ng presented at NIPS2010 compared different machine learning approaches within the same convolution-based image classification framework.

I think to many people's surprise they found that, if whitening is applied to the data, K-means(tri), a "soft" K-means variant, generates better features for classification than such sophisticated approaches as deep autoencoders and stacked RBMs.


I couldn't help noticing that K-means (both "hard" and "soft") did better than Gaussian Mixture Models (GMMs), even though, computational efficiency aside, GMMs are often thought of as a "better", more robust, K-means.

GMMs are actually quite similar to "soft" K-means, because cluster membership is "fuzzy", i.e. a matter of degree. So it is especially strange that, with whitening, GMMs produced the worst, while "soft" K-means produced the best results.


  1. Why would GMMs do so poorly compared to K-means?

  2. Has anyone tried reproducing these results? (I hope I'm not being rude to the authors, but bugs happen, and the more surprising the results, the higher the posterior probability of a mistake. Say, a poor GMM implementation could explain some of my surprise)

  3. Has anyone tried applying K-means to other problems where deep learning ruled so far, like speech recognition?

  4. I wonder what happens if you stack several layers of whitening + kmeans(tri)? It seems like it would be a natural thing for Ng's group to try, but this paper doesn't mention it.

Edit: to clarify the nomenclature:

  • "soft" K-means == kmeans (tri)
  • "hard" K-means == kmeans (hard)

asked Feb 27 '11 at 14:31

Oleg%20Trott's gravatar image

Oleg Trott

edited Mar 01 '11 at 21:47


There is a google site dedicated to this: https://sites.google.com/site/kmeanslearning/

(Feb 27 '11 at 14:41) Justin Bayer

Interesting questions! I have two, probably stupid, questions: Why would we expect that a mixture of Gaussians is a good model for image features in the first place? What would the results look like using raw image features?

(Feb 28 '11 at 03:01) Oscar Täckström

to 3: I recently heard a talk where GMM features where used for accustic models. It is some work by Simon Wiesler. Maybe the reference is http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5495228 but I'm not so sure if this includes what he was talking about.

to 1: I don't see that GMMs do very poorly compared to k-means. If you compare hard K-means and GMMs there is not SO much difference. Actually I would expect GMM to do slightly better - since as Leon said, usually it is initialized with kmeans and I see it as a sort of refinement. The method that works really well in the paper is "soft k-means" which is kind of an ad-hoc method they created as far as I know. It is somewhat similar to a local sparse coding in a way (like LCC) and I feel it does a little bit more than "just k-means".

(Feb 28 '11 at 09:41) Andreas Mueller

Is there a good paper on the triangle initialisation for k-means? The linked paper only glosses over it.

(May 30 '11 at 01:38) Robert Layton

5 Answers:

My guess at point 1: GMM:s have many more parameters than soft k-means, since you also need to estimate co-variance matrices. Actually soft k-means is a special case of a GMM in which you assume a fixed and tied co-variance matrix. Because the likelihood function is more complex, naively initializing a GMM is more prone to get you stuck in bad local maxima, compared to soft k-means.

Further, looking at the curves, the more clusters you use, the better the classification. I would guess that with that many clusters, modeling the shape of the clusters would be less important, since you can always capture parts of the distribution with high complexity by using lots of simple clusters. Since you don't need to be able to interpret the results, there is not much loss in adding a couple of extra clusters rather than adequately modeling the shape of one cluster.

I don't have any intuition to why hard k-means would perform so much worse than soft k-means.

answered Feb 28 '11 at 10:07

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström

edited Feb 28 '11 at 10:16


Well, if you use hard k-means your feature vector will be able to represent log k bits of information (with k being the number of clusters) while with soft k-means it's "somewhat more".

(Feb 28 '11 at 16:00) Justin Bayer

Hmm. this I don't understand. Just because I use hard k-means to induce features, does not mean that I need to use a 1-hot representation for my data instances. I can find the clusters with hard k-means and then measure the distance to each centroid and use the resulting vector as my representation, or use more complex representations.

I also don't understand in what sense a feature vector is able to represent log k bits of information and what you mean by "somewhat more" with soft k-means. Soft k-means is an algorithm to find centroids so that likelihood is maximized. Once the centroids are found, the amount of information you can encode should not depend on the algorithm used to induce them. Am I missing something?

(Mar 01 '11 at 15:34) Oscar Täckström

No, using k-means does not necessarily mean 1-of-k feature vectors. It does however in this context, since that is what the paper is about. The only difference btw triangular and hard k-means (in this paper) is the way the feature vectors are constructed.

I did not go into detail how many bits a triangular k-means feature vector can represent because I was too lazy to think about it at the time. From a quick guess, I would say d*(k-1) bits (All centroids but one are less than the mean - if all would contribute, they would be equal and this would not go as k real numbers) which is in any case an exponentially larger space than what hard k-means can do. (d is the amount of bits needed to represent a real number and a non-trivial thing. :)

(Mar 01 '11 at 16:59) Justin Bayer

Oh, sorry, now it makes sense. In the original post the k-means (tri) representation was referred to as "soft k-means", so I thought this referred to the clustering algorithm used. I should have actually read the paper before making such comments :)

I guess the specific number of bits that you will be able to represent will be dependent on your code book, that is it will be dependent on how close to orthogonal the cluster centroids are. Anyway, the triangular representation will of course be able to represent more information in all but degenerate cases.

It would be interesting to see how a 1-of-k feature representation using a GMM would perform and how placing a Gaussian kernel of fixed variance at each centroid fitted by k-means would fare. Actually, thinking about this makes me suspect that my original answer might be off target. Perhaps it is not the fitting of the GMM that is the problem, but the fact that if mixture components have low variance, then the number of active dimensions will be very low if we use equation (4) in the paper.

(Mar 01 '11 at 18:09) Oscar Täckström

I have not read the paper, but in a personal experience:

1.- Unles you tune your GMM pretty well, that is choosing the correct priors, sometimes you end up with crazy stuff and incredible high procesing times.

2.- Bishop mentions in his book, that it is a good practice to make a couple of runs with kmeans to obtain the priors for a good GMM

answered Feb 27 '11 at 22:36

Leon%20Palafox's gravatar image

Leon Palafox ♦

They do use a single iteration of K-means to initialize GMM.

In the footnotes, they say "When K-means is run to convergence we have found that the mixture model does not learn features substantially different from the K-means result."

(Mar 01 '11 at 21:43) Oleg Trott

I think it is also important to notice that GMM is nothing but the specific case of MM where the probabilities are Gaussians.

You can choose a different distribution if it matches your data the best, and your mixing parameters become those parameters instead of the Means, this is a flexibility that K-means do not give.

(Mar 01 '11 at 21:47) Leon Palafox ♦

I've done a bunch of reading recently that gives some more context:

In the paper above, Ng's group uses K-means for dictionary learning; that is, to learn small image patches such that a sparse set of these images can reconstruct any image patch from the data. An extremely successful approach to this is sparse coding, which constructs vocabulary patches such that each data patch requires only the activation of a small number of component words.

K-means normally assigns each data point its closest word (centroid), while in soft k-means each data point is at least a little related to every word but heavily related to a very few. Tri k-means is a successful heuristic to assign each data point to a very few words, like sparse coding. This requires that each word/centroid is an important component in reconstructing a good handful of data points, but doesn't mind if its completely useless to construct far-away data. This is essentially the same as finding a good regularizer (meeting the goldilocks criterion: not too strong, not too weak) on the vocabulary.

answered May 28 '11 at 12:28

Jacob%20Jensen's gravatar image

Jacob Jensen

There is a follow up paper, The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization, arguing that it's not about finding the dictionary but about finding a good encoding.

answered May 29 '11 at 10:28

Andreas%20Mueller's gravatar image

Andreas Mueller

From what I see from the paper, they are not doing soft k-means. They are doing hard k-means (dictionary learning) and use their soft function heuristic to create the features (encoding).

answered Nov 24 '11 at 03:13

Mathieu%20Blondel's gravatar image

Mathieu Blondel

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.