|
A KL projection is a projection with regards to the KL divergence. So to understand it, you must understand both of these things.
There are two possible ways of doing the KL projection of a distribution p: M-projection, which is choosing q that minimizes KL(p||q) and I-projection, which is choosing q that minimizes KL(q||p). In the limit, when p is in the constraint set, both approaches will choose q = p. When p is not in the constraint setting, however, they will diverge. There are two ways in which each of them is used:
So these are the two most common (as far as I know) uses of KL projection in machine learning. A good reference to these issues (in the context of graphical models) is Wainwright and Jordan's Graphical Models, Exponential Families and Variational Inference. In Bishops book (Pattern Recognition and Machine Learning), on page 143, he presents a geometrical view of a maximum likelihood solution of a linear regression problem with Gaussian noise assumption. Roughly, he shows that the least-square solution is equivalent to an orthogonal projection (up to a factor) of the vector composed of the target values on the subspace spanned by the corresponding sample points. As far as I can tell, this is the same as the M-projection you were talking about, right? I.e. if the model assumes Gaussian noise, the M-projection is equivalent to an orthogonal projection, since the error function is the sum-of-squares.
(Sep 28 '10 at 12:31)
Breno
I think so, yes. Fill in the probabilities in the formula for the M-projection and see if it's the same.
(Sep 28 '10 at 13:00)
Alexandre Passos ♦
"This implies choosing q such that samples from q will have high probability under p, and the better approximations are the ones that focus most of q's mass around the mode of p". So basically, we assume that we can generate sufficiently good samples from the posterior p so that we can at least estimate its most important modes?
(Nov 26 '10 at 04:31)
Oscar Täckström
@Oscar: Not really, KL does not use sampling, I mentioned sampling as an analogy to help understand the difference, which is that M-projection makes the whole density of q approximate p, while I-projection makes q only assign significant mass to things that are well-covered by p.
(Nov 26 '10 at 04:34)
Alexandre Passos ♦
|