I've taken a couple classes that have made use of the EM algorithm in graphical models, but they always seem to have some form specific to the model at hand. Given that I have a model with some seed probabilities over 2 sets of random variables O (observed) and U (unobserved), what are the general steps necessary for performing the EM Algorithm?

asked Aug 02 '10 at 04:50

Daniel%20Duckwoth's gravatar image

Daniel Duckwoth
789162635

edited Aug 02 '10 at 11:38


4 Answers:

It might be instructive to take a look at this paper which presents the EM algorithms from a number of perspectives (and variants of EM), apart from the standard view of the E step computing the expectations of the latent variables and the M step maximizing the expected complete (i.e., in terms of latent + observed) log likelihood of the data.

answered Aug 02 '10 at 09:01

spinxl39's gravatar image

spinxl39
3458104368

edited Aug 02 '10 at 09:05

That's a very good reference, thanks!

(Aug 02 '10 at 14:12) Alexandre Passos ♦

Is it just me, or is the link down?

(Aug 02 '10 at 21:42) Daniel Duckwoth

It's back up for me.

(Aug 02 '10 at 21:45) Alexandre Passos ♦

I'm having some trouble understanding the notation of the paper. There's frequent integrals with "dx" when no variable "x" exists, and on page 2 I fail to see how the equation for L(theta)-L(theta') goes from line 2 to 3 or 3 to 4.

(Aug 05 '10 at 03:44) Daniel Duckwoth

@Daniel Duckworth: About that equation, yes, it should be dz, and seems to be a typo. Line 3 follows from 2 because p(z,y|theta') = P(z|y,theta') P(y|theta'). Line 4 follows from 3 because, sinze P(y|z,theta) is independent from theta (by the conditional independent assumption inherent in the model), you can write P(z,y|theta) = P(y|z,theta)P(z|theta), and P(z,y|theta') = P(y|z,theta')P(z|theta'), and, since P(y|z,theta') = P(y|z,theta), you can cancel them out.

(Aug 05 '10 at 05:04) Alexandre Passos ♦

This paper by Collins might help you, or this one.

The general form of the EM algorithm is

E step: given the current values of the parameters for the observed values, compute the expectation of the unobserved values

M step: given that expectation, choose the current values of the parameters that maximize the likelihood.

EM maximizes the expectation of the log likelihood given the observed data.

answered Aug 02 '10 at 07:36

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

Thanks! Could you elaborate on the M-step a bit? I understand what it means to want the Maximum Likelihood estimate, but is there a general form to do that for all distributions in the Exponential family?

(Aug 02 '10 at 11:39) Daniel Duckwoth

I think there is, since all exponential family distributions can be expressed as (modulo normalization constant) exp(Nparam dot F(data)), so the log of that is just Nparam dot F(data), which has an easy to find maximum in terms of the natural parameters, since it's a linear function.

(Aug 02 '10 at 14:44) Alexandre Passos ♦

Also, are all Exponential Family distributions convex? In my classes we sort of took that on faith (take derivative, set to 0, and assume that's the MLE estimate).

(Aug 02 '10 at 21:39) Daniel Duckwoth

They are log-linear on the natural parameters and sufficient statistics, so the log likelihood at least has to be convex.

(Aug 02 '10 at 21:42) Alexandre Passos ♦
-1

EM???

A) Usually, EM is considered as a special example of MM algorithm(majorize/minimize)

B) The paper below provides a REALLY COOL view about EM algorithm.

It generalizes EM to MM(not the one in A) e.g. K-means) and EE(Meanfield Approximation)!!!

www.ics.uci.edu/~welling/publications/papers/BKM_techrep.pdf

C) Basically:

i) In a probabilistic model, we have x(observation), w(hidden variable) and alpha(hyperparameter). For the generative approach to clustering problem, hidden variables w are divided into two classes: cluster assignment z(discrete) and generative model parameters theta(continuous)

The goal is to maximize P(x|alpha), which is NP-hard in general

For variational methods, the goal is to put a lower bound: to minimize the KL divergence between q(z,theta) and the posterior distribution $p(w |vec x,alpha)$.

In general, we factorize q(z,theta)=q(z)q(theta), which can either be a point estimation or distribution estimation.

Ok, Here we go:

EM:

(E-step)distribution estimate z (multinomial distribution)
(M-step) point estimate \theta  (MLE estimator)

MM:(K-means)

(M-step) point estimate z (Greedily assign the z)
(M-step) point estimate \theta  (MLE estimator)

answered Aug 02 '10 at 20:53

Eminemya's gravatar image

Eminemya
15223

Could someone elaborate why this post is so heavily down-voted?

(Aug 02 '10 at 21:38) Daniel Duckwoth

Probably because it's not a faithful description of the paper...

The cited paper describes 4 perspectives on iterative optimization, EM, EE, ME, and MM. (The paper describes these as "alternating model learning") . This post proposes to generalize EM as MM, despite their differences. EM isn't a special case of MM, it's just a different way of performing optimization of a joint by iterating between optimizations of partial descriptions (marginals or conditional parameters)

(Aug 02 '10 at 22:52) Andrew Rosenberg
-2

[This is wrong. Ignore it]

I'm not a big fan of the standard "General Form" which Alex and Spin cites, as it complicates things unnecessarily. I prefer to see the EM algorithm as something used on a model with two sets of Latent Variables, each of which can be maximized easily with the other known.

What does that mean? Think of it as a kind of a "Catch-22" situation. To know X, you need to know Y. To know Y, you need to know X. The serpent eats it's tail. For example, in Mixture models, to estimate the parameters of the model, you need to know which points belong to which part of the mixture. But to know that, well you need to know the model parameters. Damn! How does one make progress?

Well, by the most obvious way possible. Just start wherever. Make a first guess of X. Then, using X, estimate Y. Then using Y, estimate X, and so on, till you get somewhere. Since each step guarantees an improvement, you're always going to be better off than before. Done.

The E-step and the M-step are not different as they appear. Maximizing the log likelihood is the same as maximizing the expected likelihood (this proof is left as an exercise to the reader), but this has become, IMHO, a source of great confusion. The reason for this silliness is because statisticians like to make the arbitrary distinction between "parameters" of a model, and latent variables. So they, argue, it does not make sense for a log likelihood to be maximized on latent variables, only the expected log likelihood. This is utter tripe, and is just the kind of thing statisticians get their knickers in a knot about. You are best off ignoring that bullshit

answered Aug 02 '10 at 13:41

dogy's gravatar image

dogy
3065918

edited Aug 02 '10 at 19:13

Not really, no. When you look at EM as an inference algorithm for graphical models, the distinction between parameters and latent variables is clear, and it follows easily from the distinction between observed and complete data.

Also, EM in its standard form does depend on the difference between parameters and latent variables. EM works by marginalizing out the latent variables and maximizing over the parameters, and you get different algorithms if you treat them differently. For example, if you maximize over the latent variables as well you get hard EM (also known as max-product or viterbi EM). A nice table representing the known options can be seen in http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.127&rep=rep1&type=pdf .

(Aug 02 '10 at 14:19) Alexandre Passos ♦

The Neal and Hinton paper that Alexandre cited presents EM as a maximizaton-maximization over two unknowns, namely the distribution over latent variables and the parameters (and the one I cited presents this views as one of the explanations of EM). Each maximization w.r.t. one unknown is done assuming that the other has been estimated (thus "known") from the previous step. So I don't think it differs in anyway from the explanation of EM you provided above.

Note that one doesn't really perform the two maximizations -- the one w.r.t. the distribution over latent variables is (usually) replaced by computing expectations of latent variables which are needed for computing the expected likelihood which is maximized in the M step. That's why many people explain EM as doing the alternate expectation and maximization step. IMHO, I don't think such a view introduces any confusion.

(Aug 02 '10 at 14:21) spinxl39

I believe, on second thought that Alex may be right. If I can come up with a simple counterexample where E[...Max[E[Max[p(x,a,b]]...] differs from Max{...Max{Max{p(x,a,b)}}...}, that will clarify things. The paper cited states different algorithms, I know, but they may be conceptually the same.

(Aug 02 '10 at 15:16) dogy

For example, take a mixture of gaussians model. If you maximize the parameters (means, variances) over the expectation of the latent cluster membership variables, you have to deal with fractional cluster memberships. If you just maximize-maximize, you're only dealing with hard assignments.

(Aug 02 '10 at 15:30) Alexandre Passos ♦

No, don't think that's a good example. If I only maximize over the means only, yes we do get the K-means algorithm, what you call "hard assignments". But if you maximize over both the mean and the variances, you can get the "soft assignments". The difference here seems to be a conceptual one.

(Aug 02 '10 at 15:44) dogy

Not really. You can have hard assignments and maximize over the variances as well, and have soft assignments and maximize over the means.

For example, assume we have one dimension, three points, at (1, 2, and 3), and two means (currently at 1.5 and 2.4). Using hard assignments, after updating, we get the two means at 1.5 and 3. Using soft assignments (assuming variance 1) the means should both move towards 2. I didn't do the math out of laziness, but you're welcome to verify this for yourself.

(Aug 02 '10 at 15:50) Alexandre Passos ♦

Well, let's dissect your example from two different perspectives

  • On one hand, as the variance in your "soft assignments" algorithm assumed increases, then the new means will approach the values of the hard assignment 1.5 and 3 correct? So you can see the hard assignments problem as a special case of the soft assignments problem.

  • On the other hand, in the assumptions of the "hard assignments" problem, we can maximize over the means, the individual variances and assume that the covariances are exactly 1. Doing that (I believe) will "draw" the new means closer together as we are imposing a squared penalty if they are too far apart. So the hard assignments problem is a special case of the soft assignments problem.

Since they are both special cases of each other, then they are equivalent.

(Aug 02 '10 at 16:15) dogy

Erm, what?

Leaving the variances aside, the maximize-maximize EM algorithm is:

E step: assign each point to the closest gaussian M step: replace the mean of each gaussian by the mean of its points

So, if we have points at 1, 2, and 3, and means at 1.5 and 2.4, the E step produces the assignments 1 2 2, and the M step produces the means 1, 2.5

The soft algorith is:

E step: for each point, assign to it a distribution over the gaussians, weighted by their likelihood M step: for each gaussian, compute its mean as (sum_i W_i x_i )/(sum_i W_i), where W_i are the distributions obtained from the E step.

So, if we have points at 1, 2, and 3, and means at 1.5 and 2.4, the E step assigns the distributions: [ 0.84683628, 0.15316372], [ 0.47751518, 0.52248482], and [ 0.13124447, 0.86875553], and the new means are 1.508 and 2.463.

So they are different, yes.

(Aug 02 '10 at 16:27) Alexandre Passos ♦

Also, for the specific case of gaussian likelihood, the hard and soft algorithms become more similar when the variance decreases (if they do, the probabilities converge to a single one and a lot of zeros). But his point is moot if you reestimate the variance, which you can do both in the maximization-maximization (hard assignments) and expecation-maximization (soft assignments) algorithms.

(Aug 02 '10 at 16:35) Alexandre Passos ♦

Sorry if that didn't make sense.

Leaving the variances aside, the maximize-maximize EM algorithm is:

E step: assign each point to the closest gaussian M step: replace the mean of each gaussian by the mean of its points

Aha! This is where you and I differ. My version of the E-step is this

E step: maximize the log-likelihood over the means and variances of the joint (multivariate-gaussian) distribution over all means.

In other words, in your example, with three data-points and 2 means, we are finding the maximum value of a distribution from the sample (this is also E[X] - hence the equivalence)

X ~ MultivariateNormal([1.5,3], Sigma)

Where

Sigma is a 2x2 matrix which looks like this. The variances are either actively re-estimated or assumed to be constant.

[[v1, 1] [1, v2]]

In other words, once you do this, the maximum of that distribution does not occur at

(1 + 2)/2 and (3)/1

But somewhere in the middle. They would, however, if

Sigma = [[v1, 0] [0, v2]]

(Aug 02 '10 at 16:55) dogy

Usually the E step adjusts the latent variables and the M step the parameters. Your M step seems to jumble everything together. Also, you don't need a covariance matrix in one-dimensional data, only a variance/precision parameter. I don't understand which distribution are you maximizing. The likelihood in this model is not a multivariate gaussian, it's a mixture of gaussians, something like P(x) = alpha P_gaussian(x; mean_1, sigma_1) + (1-alpha)P_gaussian(x; mean_2, sigma_2).

What are you maximizing, exactly, and how? In what I described, you maximize over the latent variables (ie, the variable determining from which gaussian came a specific point), and this is clearly done by assigning them to the closes point. Computing their expectation is what you get in the soft EM.

(Aug 02 '10 at 17:02) Alexandre Passos ♦

Hey Alex, I am maximizing over the likelihood, or the log likelihood of the k-dimensional Gaussian with the means [1.5,3] and the covariances defined above (note I edited a typo) Specifically, the minimum of the function (this is the "energy")

http://mathurl.com/28lg7z2.png

(Aug 02 '10 at 17:09) dogy

note that without the third term, the maximum occurs exactly at k1, k2. Not so with the third term.

(Aug 02 '10 at 17:10) dogy

@dogy, this sounds like a completely different hierarchical model, almost like a gaussian process prior (ie, each mean is one dimension of a multivariate gaussian) for the mixture of gaussians model; but it still doesn't explain how you can maximize over it without marginalizing over all values of the latent variables, which is intractable.

(Aug 02 '10 at 17:12) Alexandre Passos ♦

Also, how do you get to this equation, from the traditional graphical model formulation of the mixture of gaussians density estimation problem?

(Aug 02 '10 at 17:13) Alexandre Passos ♦

Yes, you're right, it's like a Gaussian process prior. You see Gaussians are magic in that way, as doing marginalization = finding the mode of the distribution = doing a maximization on a quadratic function. This is still slightly unbelievable to me :)

(Aug 02 '10 at 17:15) dogy

It's a bit complex to type out. It's pretty standard stuff in a stats class, you take the log of the distribution, normalization constants fall out, you're only doing an expectation over the thing which is in the exponent of the numerator.

(Aug 02 '10 at 17:18) dogy

Because the mixture of gaussians likelihood is not at all hard to type out; L(x, m1, s1, m2, s2) = 1/2 Pr(x;m1,s1) + 1/2 Pr(x;m2,s2); since this is a sum, you can't just take the log to get rid of the exponent, and it involves no priors on the means m1 m2 or the variances s1 s2. By making the variances constant and equal to one, I mean the likelihood as L(x, m1, m2) = 1/2 Pr(x;m1,1) + 1/2 Pr(x;m2,1). To the get the complete-data likelihood you just multiply this function for all x. Gaussian conjugacy can enter if you have gaussian priors on the means (not gaussian process priors!), which would make the complete data likelihood L(x, m1, m2) = prod_i (1/2 Pr(x_i, m1, 1) + 1/2 Pr(x_i, m2, 1)) prod_j Pr(m_j, hyper mean, hyper variance). Then, the posterior mean is a function of the sample mean and the hyper mean (and the number of samples), due to conjugacy. It has nothing to do with the means being the dimensions of a gaussian process, the means are independent samples from it.

(Aug 02 '10 at 17:25) Alexandre Passos ♦

You are making a classic mistake here. The log likelihood is a sum, but the likelihood (with a big L) is a product. The log likelihood is a sum.

Pr(x;m1,s1)Pr(x;m2,s2) = Pr(x;[m1,m2],S)

Maybe I should type this out in latex

(Aug 02 '10 at 17:34) dogy

The log likelihood is a sum if and only if the likelihood is a product. In a mixture model, however, the likelihood is a sum.

(Aug 02 '10 at 17:36) Alexandre Passos ♦

you're right, i see the mistake. I'm not thinking of a mixture model at all. This also confirms the original point, which was that the maximum is not the same as the expectation.

(Aug 02 '10 at 17:48) dogy

for what it's worth, this has been an excellent discussion (for me, perhaps more an exercise in patience for you) thank you for correcting a deep seated misunderstanding on my part.

(Aug 02 '10 at 17:56) dogy

You're welcome. I'd read up on one of the referenced texts, btw, they are very interesting and show nice aspects of the EM algorithm.

(Aug 02 '10 at 17:59) Alexandre Passos ♦
showing 5 of 23 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.