3
2

When empirical distribution is p, Maximum Likelihood Estimation amounts to minimizing KL(p,q). What about minimizing KL(q,p)? Csiszar calls this an "I-projection" and Koller/Friedman mention in their book that this is sometimes chosen over MLE for computational efficiency reasons. Are there reasons to prefer "I-projection" over MLE besides computational efficiency?

Example: suppose your data is generated by non-realizable p, and you have unlimited training data. Then choosing q to minimize KL(p,q) will give you a distribution that minimizes expected coding length of symbols drawn from p, so you can compress future data better. On other hand, minimizing KL(q,p) will give you q', in general, different from q, that produces a sub-optimal code....is q' better than q for any real world tasks?

asked Aug 23 '10 at 22:37

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
1963193458

edited Aug 25 '10 at 19:44

1

I think it might not apply to the specific MLE setting you mentioned, but in general there are other reasons too behind preferring one form of KL divergence over the other, that are model specific rather than computational.

The two different forms of KL-divergence also distinguish variational Bayes (uses KL(q||p)) and Expectation Propagation (uses KL(p||q)). It's argued that for multi-modal posteriors, EP does badly because the approximation tries to capture all modes and then average, which can be bad in a mixture model kind of setting since averaging of two good parameter values isn't necessarily a good parameter. So here one might just prefer variational Bayes (KL(q||p) which just finds a single mode (which might still be acceptable). There are other models however (e.g., logistic style) where using EP (KL(p||q)) might be preferred over variational Bayes (KL(q||p)). The PRML book by Chris Bishop has a discussion on this (chapter on approximate inference).

(Aug 23 '10 at 23:15) spinxl39

Thanks for the reference, pages 468-470 seem to be esp. relevant. So MLE corresponds to minimizing what he calls "reverse KL divergence", equivalent to minimizing "alpha-divergence" with alpha=1, which makes it zero-avoiding (ie, MLE with enough degrees of freedom will never assign 0 probability to a datapoint that occurs in training data), whereas the above, is minimizing alpha-divergence with alpha=-1 which makes it zero-avoiding, and from his examples, looks like it turns convex optimization problem into non-convex one

(Aug 24 '10 at 01:57) Yaroslav Bulatov

I meant "alpha=-1 makes it zero forcing"

(Aug 24 '10 at 04:26) Yaroslav Bulatov

One Answer:

I-projection fits the mode, instead of the mean, of the distribution, so this is useful if you want to do that.

answered Aug 24 '10 at 05:07

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

edited Aug 25 '10 at 20:26

1

But why would you care about fitting the mean or the mode? The criterion should be improved performance on future data. For instance, MLE optimizes predictive code-length, so it's best for compression tasks with enough training data. The question is then -- for which tasks is argmin_q KL(q,p) better?

(Aug 25 '10 at 19:50) Yaroslav Bulatov

As spinxl39 said yesterday, this matters, for example, for mixture models. Let's assume we're doing a naive mean field approximation of a mixture model (ie, where there is a true distribution which for each point samples a z value and then a value from that point from dist[z]) and we fit the approximating distribution (since this is naive mean field, the zs and dist[]s are assumed independent) by minimizing KL(q,p) or KL(p,q). If we're minimizing KL(q,p) (I-projection), the optimal value will find a distribution q such that samples from q have high enough probability from p. That is, it will find one of the unidenfifiable modes of the distribution. M-projection, on the other hand, will seek values of the zs and dist[]s such that samples over p have high probability over q. So, it will converge towards an "averaged solution", with all dist parameters having the same values, as well as all the z parameters (since the true distribution is perfectly exchangeable, minimizing KL(p,q) has to find some average value for the parameters).

So in this setting the I-projection finds a relevant result by searching for the mode of the distribution, while the M-projection searches for the mean and finds something meaningless. Apparently I had them mixed up in my original answer, so I edited it out.

When is the M-projection better? When you actually want a q such that samples from p have high probability over q. If you're fitting a density from observed data, for example, this is usually better, since it will "smooth out" the kinks in the empirical distribution (while I-projection would probably overfit the irregularities of your empirical distribution).

(Aug 25 '10 at 20:25) Alexandre Passos ♦

For more information on this (I finally found the reference I was looking for!) see Dan Klein's tutorial on variational methods for nlp http://www.eecs.berkeley.edu/~klein/papers/tutorial-acl2007.pdf , slide 25 (on page 18)

(Aug 25 '10 at 20:32) Alexandre Passos ♦

The question really comes down to -- are there any real life losses resembling the one minimized by I-projection? MLE can be justified by saying that it's an empirical risk minimizer for compression, one might want to similarly justify I-projections by finding a task for which it's a risk minimizer

(Aug 25 '10 at 21:55) Yaroslav Bulatov
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.