A paper by Coates, Lee & Ng presented at NIPS2010 compared different machine learning approaches within the same convolution-based image classification framework.
I think to many people's surprise they found that, if whitening is applied to the data, K-means(tri), a "soft" K-means variant, generates better features for classification than such sophisticated approaches as deep autoencoders and stacked RBMs.
I couldn't help noticing that K-means (both "hard" and "soft") did better than Gaussian Mixture Models (GMMs), even though, computational efficiency aside, GMMs are often thought of as a "better", more robust, K-means.
GMMs are actually quite similar to "soft" K-means, because cluster membership is "fuzzy", i.e. a matter of degree. So it is especially strange that, with whitening, GMMs produced the worst, while "soft" K-means produced the best results.
Edit: to clarify the nomenclature:
My guess at point 1: GMM:s have many more parameters than soft k-means, since you also need to estimate co-variance matrices. Actually soft k-means is a special case of a GMM in which you assume a fixed and tied co-variance matrix. Because the likelihood function is more complex, naively initializing a GMM is more prone to get you stuck in bad local maxima, compared to soft k-means.
Further, looking at the curves, the more clusters you use, the better the classification. I would guess that with that many clusters, modeling the shape of the clusters would be less important, since you can always capture parts of the distribution with high complexity by using lots of simple clusters. Since you don't need to be able to interpret the results, there is not much loss in adding a couple of extra clusters rather than adequately modeling the shape of one cluster.
I don't have any intuition to why hard k-means would perform so much worse than soft k-means.
I have not read the paper, but in a personal experience:
1.- Unles you tune your GMM pretty well, that is choosing the correct priors, sometimes you end up with crazy stuff and incredible high procesing times.
2.- Bishop mentions in his book, that it is a good practice to make a couple of runs with kmeans to obtain the priors for a good GMM
answered Feb 27 '11 at 22:36
Leon Palafox ♦
I've done a bunch of reading recently that gives some more context:
In the paper above, Ng's group uses K-means for dictionary learning; that is, to learn small image patches such that a sparse set of these images can reconstruct any image patch from the data. An extremely successful approach to this is sparse coding, which constructs vocabulary patches such that each data patch requires only the activation of a small number of component words.
K-means normally assigns each data point its closest word (centroid), while in soft k-means each data point is at least a little related to every word but heavily related to a very few. Tri k-means is a successful heuristic to assign each data point to a very few words, like sparse coding. This requires that each word/centroid is an important component in reconstructing a good handful of data points, but doesn't mind if its completely useless to construct far-away data. This is essentially the same as finding a good regularizer (meeting the goldilocks criterion: not too strong, not too weak) on the vocabulary.
answered May 28 '11 at 12:28
There is a follow up paper, The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization, arguing that it's not about finding the dictionary but about finding a good encoding.
answered May 29 '11 at 10:28
From what I see from the paper, they are not doing soft k-means. They are doing hard k-means (dictionary learning) and use their soft function heuristic to create the features (encoding).
answered Nov 24 '11 at 03:13