|
Lots of probabilistic models are typically optimized with expectation maximization: HMMs/LDS, Mixture models, factor analysis and sometimes even pPCA. For all of them, the log likelihood can be written down easily and so can its derivative. I wonder if anyone has practical experiences with using gradient based optimizers such as SGD, CG, LBFGS, RPROP, et al directly on the log likelihood. I have especially the following questions in mind:
|
|
There are two things which could roughly be described as directly optimizing the likelihood. The first is using EM to bound the true marginal log-likelihood function but optimizing the EM bound (as a function of Z and theta) with true optimizers instead of doing coordinate ascent. The other is to not use the EM bound at all. The second is really not practical in most cases because you'd have to sum an exponential number of terms to compute the normalizing constant (which, as a function of the parameters, is not at all a constant here; it's the equivalent of the feature expectations according to the model in logistic regression, roughly speaking), which is needed to compute the gradient and the hessian. In the clustering case, for example, you'd have to sum over all possible clustering assignments of all points jointly, as for each value of these you have different optimal parameter settings (the EM bound is derived by assuming independence). However, in models like LDA that might be practical. So from now on I'll assume you're talking about the first alternative above, which is using true optimization algorithms to optimize the EM bound as a function of Z and theta. You will find some discussion of this in a question I asked a while ago in this website. The main motivators for not using gradient optimizers is that doing coordinate ascent in these models is just too easy, and most people are afraid of constrained optimization (it needs to be constrained as every Z has to be positive and all Zs must sum to 1). So, to your specific points:
I was actually asking for the second alternative. I was not aware that EM is also practical in cases where the partition function cannot be computed efficiently. For the models I am using, this is the case however: constraints on the parameters make sure the density function integrates to one. (e.g. for a mixture of Gaussians, all you have to do is to make sure that the responsibilities are positive and sum to one, which can be done by optimizing wrt to parameters which you put into a softmax/boltzmann distribution in order to get the actual responsibilities). Thanks for the link to the other question.
(Aug 27 '11 at 16:01)
Justin Bayer
1
@Justin: In mixtures of Gaussians the Zs are actually discrete variables; the choice of a vector of responsabilities that sum to one is a relaxation involved in EM (think about the mixture model generatively: for each point you first pick a mean/covariance pair with either uniform probability or sampling from a global vector of probabilities phi (in the case where you integrate out the mixture proportions) and then from that single mean/covariance pair you sample the point at hand. So if you know from which mixture each point was sampled from (the latent variables) you can write down the log-likelihood as a sum of the priors of the mixture components plus a sum over points of a gaussian centered on the right component. If you don't know the latent variables then you have to marginalize them out, and you have to marginalize over all possible z variables jointly, since (for example) if you permute the cluster assignments you need to get the same probabilities out. This induces a multimodal posterior, which is the source of the problem, hence the EM bound. You can see more details in one of the many questions about EM in this site: http://metaoptimize.com/qa/questions/1655/what-is-a-general-form-of-the-em-algorithm (see spinxl39's answer or my own for references).
(Aug 27 '11 at 19:51)
Alexandre Passos ♦
Thanks, very insightful.
(Aug 28 '11 at 11:58)
Justin Bayer
|