9
9

I do know of this paper, that compares CG and EM for certain models and shows varying levels of performance and some conditions on the convergence of EM, and also the hybrid methods.

However, I don't know of any work on using standard gradient optimization techniques for variational bayesian models (and I don't see anything stopping one from computing the gradient/hessian of the variational lower bound on the likelihood), and I always see references in the optimization literature about how newton and quasi-newton methods are a lot better than coordinate ascent (and plain gradient) when the variables are correlated (and in many standard variational algorithms, such as the one for LDA, the variables are known to be correlated (see this paper if you doubt this statement).

So my question is: is there an obvious reason (other than historical reasons, analogy with EM, ease of fit into the message passing framework, ease of derivation (just make each partial derivative equal to zero analytically)) for avoiding standard optimization algorithms in probabilistic settings, or is this an open area for research?

asked Jul 19 '10 at 19:20

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Jul 20 '10 at 16:08


3 Answers:

I think, for models that are outside conjugate-exponential family, variational EM can't actually be readily applied. I'm not intimately familiar with literature in the area but have seen some papers directly applying gradient based methods for variational inference. For example, take a look at these where these ideas are applied: Natural Conjugate Gradient in Variational Inference, A Gradient-Based Algorithm Competitive with Variational Bayesian EM for Mixture of Gaussians .

answered Jul 19 '10 at 23:20

spinxl39's gravatar image

spinxl39
3698114869

1

And as for your question of why EM or variational EM is preferred over direct gradient based methods is due to the difficult optimization problem resulting after integrating out the latent variables. So we rather have the latent variables (or introduce them if the original model doesn't have any) and then estimate both the parameters and latent variables in an alternating fashion as done in EM.

(Jul 19 '10 at 23:41) spinxl39

Thanks for your answers. However, since all you're doing is maximizing a lower bound on the likelihood (or a point estimate of it, in case of EM), why can't you just compute the gradient over the parameters and latent variables?

(Jul 20 '10 at 05:11) Alexandre Passos ♦
3

Not sure what you mean by maximizing a "point estimate" of log likelihood in EM. In EM, don't you maximize the expected log likelihood which will require computing the expectation of the log likelihood w.r.t. the latent variable distribution, and then maximize it? I think the reason why you can't do the maximization w.r.t. parameters and latent variables jointly is because you need the distribution over the latent variables. The E step of the EM algorithm gives this distribution.

As an aside, EM can in fact be seen as doing a joint maximization over the distribution over latent variables and parameters (but in an alternating fashion). The seminal paper by Neal & Hinton presents this view: http://www.stat.columbia.edu/~liam/teaching/neurostat-spr09/papers/EM/EM-neal-hinton.pdf . The maximization over a space of distributions (E step) may seem daunting at first, but what saves us is that this maximizing distribution turns out to be precisely the distribution of latent variables given parameters (obtained from the M step).

(Jul 20 '10 at 06:46) spinxl39
2

An aside: Even in the Expectation Conjugate Gradient algorithm of "Optimization with EM and Expectation-Conjugate-Gradient" paper by Salakhutdinov, Roweis & Ghahramani, you need the distribution over latent variables in order to compute the gradient w.r.t. parameters (what they call the EG step), before invoking a gradient based optimizer for parameters.

(Jul 20 '10 at 06:54) spinxl39

Thanks, you're right. I'll read the Neal & Hinton paper more carefully.

(Jul 20 '10 at 13:49) Alexandre Passos ♦

Two other important points:

  1. is that for many variational algorithms you end up having a bazillion free variational parameters (for example, variational LDA has NtopicsNwords + NdocsNtopics free parameters), and while they are very easy to maximize locally, a gradient-based algorithm would probably flail around a bit trying to tweak them all at once to improve the objective function

  2. there are usually some constraints (like normalization constraints) that would either make it necessary to use a constrained optimizer (which are slower than unconstrained optimizers, and I expect this to get worse as the number of parameters go to infinity) or impose the constraints artificially (for example, by normalizing before computing the likelihood and likelihood gradient), which might screw up an optimizer's line search.

answered Aug 11 '10 at 06:29

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

It's often the case for discriminative models to use numerical methods for parameter estimation. For instance, logistic regression models are often learned with gradient descent, newton's method or lbfgs. It also just so happens that minimizing the loss function in logistic regression corresponds to the maximizing the likelihood of the observed data. Conditional random fields are also often estimated using gradient methods. In supervised learning it may not always make sense to use EM, as there sometimes are no latent variables over which to estimate a distribution.

Furthermore, it can be shown that the likelihood function for CRFs (including logistic regression) is convex, so convergence will be exact and will not get stuck in local optima (which might happen with EM).

gradient methods for CRFs: http://www.cs.ubc.ca/~murphyk/Papers/ICML06.pdf

good intro to CRFs: http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf

answered Jul 20 '10 at 16:03

acm's gravatar image

acm
1

I know that. I was thinking about generative models, however.

(Jul 20 '10 at 16:08) Alexandre Passos ♦
1

Ah, thanks for specifying the question. I was sure discriminative models fell under the umbrella of "probabilistic settings".

(Jul 20 '10 at 18:32) acm

I'm sorry, I should have been more clear from the start.

(Jul 20 '10 at 20:39) Alexandre Passos ♦

Regarding the comment --- "In supervised learning it may not always make sense to use EM, as there sometimes are no latent variables over which to estimate a distribution" --- I'd like to add that even in the presence of latent variables, for some models, EM can still be side-stepped and direct optimization based approaches can be used. For example, see this CRF paper "Conditional Random Fields for Object Recognition": http://groups.csail.mit.edu/vision/vip/papers/quattoni_nips04.pdf

(Jul 20 '10 at 21:16) spinxl39
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.