|
There's a large number of ways to measure distance between probability distributions. I'd like to get a sense of what the good ones are and why. Here's a couple I know how to justify
Some of the ones I'd like more justifications for are KL(.,p), Hellinger distance, more generally Ali-Sivey distances, Bregman divergences, geodesic distance |
|
Another measure, that can be derived from KL divergence, is the Bhattacharyya distance. It is used in the multi-view learning algorithm in the posterior regularization framework by Kuzman Ganchev et al. It appears as a solution of the problem minimize KL(q(y1,y2)||p1(y1)p2(y2)) where E_q[delta(y1=y)-delta(y2=y)] = 0 for all y. However, I'd reformulate your question as measures of goodness of fit between a probability distribution and empirical data, or distances between samples and distributions. The problems of estimating the distance between two distributions (which is what your question is actually asking) or two sets of samples are different from this one. but any sample can be viewed as a probability distribution and visa versa
(Aug 29 '10 at 18:39)
Yaroslav Bulatov
But it's not always productive to do so. For example, most measures of distance/similarity between continuous distributions don't behave well when it's one continuous and one discrete distribution, as I showed with KL(.,p). Also any measure of similarity between two discrete distributions is zero when computed between two samples from the same continuous dsitribution, which is clearly pathological behavior.
(Aug 29 '10 at 18:57)
Alexandre Passos ♦
You are right, allowing continuous distributions makes it harder to avoid pathological behavior, I'll change the question to discrete-only for simplicity. In practice we compute with finite precision, so all distributions are discrete anyway
(Aug 29 '10 at 19:15)
Yaroslav Bulatov
It doesn't solve the problem, restricting to discrete. In a discrete setting with large sample space and very few samples you'll find the same problem I mentioned comparing two samples from the same distribution vs some estimated density for that same distribution.
(Aug 29 '10 at 19:22)
Alexandre Passos ♦
Doesn't solve what problem? Any specific discrete distribution is a multinomial distribution. Any sample can be viewed as a multinomial distribution. So any measure of similarity for multinomial distributions can be applied to samples and visa versa
(Aug 29 '10 at 19:36)
Yaroslav Bulatov
|
About your other question regarding kl(.,p), I understood the other day that we were considering different problems: KL(p,.) makes sense when p is an empirical distribution, while KL(.,p) makes sense when p the true bayesian posterior distribution you want to appproximate. Computationally only these approaches make sense in this setting, since minimizing KL(.,p) when p is an empirical distribution and . isn't is really hard to justify (since . will definitely assign a density to many things p doesn't, you'll be computing log 0 a lot to compute KL(.,p)).
why not minimize KL(p,.) when p is the true bayesian posterior distribution? Because we want to fit mode instead of mean? That seems subjective