Hello:

I just spent some time implementing the Mixture of Gaussian Algorithm as described in Bishop's Book, as well as in Andre NG's Lectures.

Although it is very good when initializing with values more or less close to the real means of the clusters, it has some issues when doing an initialization with somehow close values between them (similar means), most times it will converge to a single large covariance Gaussian that will cover both clusters (2 clusters case)

Bishop mentions that Mix of Gau should be initialized with an initial run of KL Means.

My question is, if you already ran K-L Means, is it that much the gain in classification you get from the mixture.

I understand that if your objective is to generate a couple of Gaussian mixtures to do later recognition this is good, as you only have to evaluate the likelihood of the the data with your trained model.

Is there any other reason to use Mixture of Gaussians rather than KL means

asked Oct 29 '10 at 04:48

Leon%20Palafox's gravatar image

Leon Palafox
31265471107

Hi,

How did you implemented the code? can you share some details? I am also reading the book and the lectures/notes from Andrew's website. Any help is greatly appreciated.

(Oct 29 '10 at 06:12) kapil_dalwani

The code is not that hard to implement, I did it in Matlab. It took me little time, but is far from being a clean code, I did it mainly to understand the algorithm and what was it doing.

(Oct 29 '10 at 08:25) Leon Palafox

If one Gaussian distribution has large variance, what about the other Gaussian ? You do need to fix the number of Gaussian distributions before running EM to learn their parameters. So, try looking at the parameters of the other distribution. Is it similar to the first Gaussian. If yes, then you need to run the code for a large number of iterations. If no, then you might be stuck in the problem described by @Philemon. Bishop proposes a fix for that.

(Dec 02 '10 at 00:04) Aman

6 Answers:

Converging to a single large cluster really shouldn't happen with K=2, specially since the likelihood is always higher for a "small" covariance matrix, and this is what the algorithm will prefer. If you put two means at epsilon from each other with similar covariances, almost all the assignments will have similar probability (in the E step), so they will take a while to diverge, but they will eventually diverge, unless you're using hard EM.

There are many reasons to use mixture-of-gaussians modeling, and not all of them are directly related to clustering. It is true, however, that very often there aren't very large gains in performance/accuracy in moving up from k-means to mixtures of gaussians. A nice usage of mixture of gaussians that would be harder with k-means is background removal in surveillance camera data, as is Stauffer and Grimson Adaptive background mixture models for real-time tracking.

answered Oct 29 '10 at 07:01

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

The k-means++ algorithm is a seeding algorithm for k-means, but would also generate good starting points for EM. It is shown in the paper that its a benefit to use k-means++ to initialise rather than just run heaps of runs of k-means and choosing the best one.

answered Dec 01 '10 at 23:02

Robert%20Layton's gravatar image

Robert Layton
1520102337

I think the main point of Bishop's suggestion to use k-means first is that it generally converges a lot faster to a reasonable solution. In your case slow convergence from the initial conditions seems to be exactly the problem you want to solve so I'd say doing some iterations of k-means first might indeed help a lot even though running EM for the gaussian mixtures should eventually also converge.

Could it be that you are experiencing means that are collapsing on data points? This could lead to one gaussian with almost zero variance that only models one data point and one with very high variance that models the rest of the data. The Bishop book talks about this issue and some heuristic to prevent it from happening as well.

answered Oct 29 '10 at 07:54

Philemon%20Brakel's gravatar image

Philemon Brakel
153092244

The utility of Gaussian mixture modeling (likely via EM) vs. k-means depends on your analysis goals. The k-means algorithm yields a set of cluster means and a (hard) partition of the data points, which may be sufficient if your goal is to partition the data, or identify representative points for each of k clusters, or to do vector quantization. The GMM approach additionally yields a probabilistic model (Gaussian) for each cluster, which is useful if you require posterior membership estimates or partial membership in multiple clusters. It can also be more effective if your clusters are of very different sizes, or extremely non-(hyper)-spherical in feature space, since the default Euclidean metric used for k-means weights all dimensions equally, but the EM Gaussians can have a different standard deviation in each dimension.

Note that EM is more general than just the estimation of Gaussian cluster models. The alternating parameter estimation/optimization process can be applied to other types of data modeling, as well.

answered Dec 04 '10 at 16:22

Kiri%20Wagstaff's gravatar image

Kiri Wagstaff
464

The advantage of k-means is that it's fast, that of EM is that it's more accurate and honors the covariance-structure of the data. Since EM is slower it therefore makes sense to approximate the solution by running k-means first and then improving on the solution by running EM.

EM is relatively robust against getting trapped in local optima but not immune - especially the more data you have and the higher the dimensionality is. If you believe that you are getting trapped in local optima, a good heuristic (although a slow one) is to initialize k-means with more components, say 3*k. Then you repeatedly select k components from the k-means solution and use them to initialize EM. In the end you select the solution with the highest likelihood.

answered Nov 04 '10 at 17:45

Hannes%20Bretschneider's gravatar image

Hannes Bretschneider
162

One great advantage of Gaussian Mixtures is that if you use Dirichlet priors on the mixture components, the number of Gaussians that are actually used is adapted to the dataset.

answered Dec 02 '10 at 09:42

Andreas%20Mueller's gravatar image

Andreas Mueller
1817133671

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.