4
1

How is it defined and is it non-linear (non-orthogonal)?

asked Sep 28 '10 at 07:55

Breno's gravatar image

Breno
81136


One Answer:

A KL projection is a projection with regards to the KL divergence. So to understand it, you must understand both of these things.

  1. A projection is a solution to an optimization problem that looks like "choose the y to minimize distance(x,y) subject to y being in some prespecified set". In an orthogonal projection, projecting a vector x into a subspace Y is usually defined as "choose the vector y that minimizes que squared norm ||x-y||2 subject to y being in the subspace Y".

  2. KL divergence is a natural information-theoretic measure of the "distance" between two distributions. KL(p||q) is E_p[log(p(x)/q(x))] (the expectation, over the distribution p, of the log of the ratio of the densities p and q).

There are two possible ways of doing the KL projection of a distribution p: M-projection, which is choosing q that minimizes KL(p||q) and I-projection, which is choosing q that minimizes KL(q||p). In the limit, when p is in the constraint set, both approaches will choose q = p. When p is not in the constraint setting, however, they will diverge. There are two ways in which each of them is used:

  • M-projection as maximum likelihood estimation. If p is an empirical distribution (i.e., a set of samples), choosing q that minimizes KL(p||q) with q constrained to be a distribution in a parametric model is equivalent to maximum likelihood estimation (to see that note that p(x) is a constant for all samples and zero otherwise, so essentially you're choosing q that minimizes sum(-log(q(x_i))), which is the same thing as maximizing product(q(x_i)), which is the likelihood of the data). This is a general way of estimating models from a sample and is consistent in the limit and has very nice properties.

  • I-projection as a local approximation. Suppose we are doing bayesian estimation and we have a complicated posterior distribution p that can have many modes and behave erratically in the limit, and whose normalization constant we can't compute (as it probably integrates/sums either over a multidimensional space or a discrete space with exponentially many states). One way out of it is choose a set of simpler distributions and pick a distribution from this set that behaves locally (i.e., in places where it has high mass) as close as you can to the true, complicated distribution. To do this you can choose q that minimizes KL(q||p). This implies choosing q such that samples from q will have high probability under p, and the better approximations are the ones that focus most of q's mass around the mode of p. If the set of allowed distributions is a set of dirac delta (i.e., point-wise approximations), I-projection is equivalent to maximum-a-posteriori estimation, as essentially you're choosing q=x to minimize -log P(x), which is essentially choosing x to maximize P(x).

So these are the two most common (as far as I know) uses of KL projection in machine learning. A good reference to these issues (in the context of graphical models) is Wainwright and Jordan's Graphical Models, Exponential Families and Variational Inference.

answered Sep 28 '10 at 10:13

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

edited Sep 28 '10 at 10:42

In Bishops book (Pattern Recognition and Machine Learning), on page 143, he presents a geometrical view of a maximum likelihood solution of a linear regression problem with Gaussian noise assumption. Roughly, he shows that the least-square solution is equivalent to an orthogonal projection (up to a factor) of the vector composed of the target values on the subspace spanned by the corresponding sample points. As far as I can tell, this is the same as the M-projection you were talking about, right? I.e. if the model assumes Gaussian noise, the M-projection is equivalent to an orthogonal projection, since the error function is the sum-of-squares.

(Sep 28 '10 at 12:31) Breno

I think so, yes. Fill in the probabilities in the formula for the M-projection and see if it's the same.

(Sep 28 '10 at 13:00) Alexandre Passos ♦

"This implies choosing q such that samples from q will have high probability under p, and the better approximations are the ones that focus most of q's mass around the mode of p".

So basically, we assume that we can generate sufficiently good samples from the posterior p so that we can at least estimate its most important modes?

(Nov 26 '10 at 04:31) Oscar Täckström

@Oscar: Not really, KL does not use sampling, I mentioned sampling as an analogy to help understand the difference, which is that M-projection makes the whole density of q approximate p, while I-projection makes q only assign significant mass to things that are well-covered by p.

(Nov 26 '10 at 04:34) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.