2
3

There's a large number of ways to measure distance between probability distributions. I'd like to get a sense of what the good ones are and why. Here's a couple I know how to justify

  1. KL(p,.) where p is the true (empirical) probability distribution. There are many ways to justify minimizing this measure -- ie, the minimizer is asymptotically consistent, efficient, achieves lowest expected coding length and lowest asymptotic error for optimal hypothesis tests. KL(p,.) minimizers are easy to analyze theoretically in framework of M-estimators, also there are efficient algorithms to compute KL(p,.) and it's gradient for high dimensional distributions. Related measure, Chernoff Information shares some of the "optimal hypothesis test" justification.

  2. L-inf distance in the space of log-odds. This measure or a monotonic transformation of it has come up under many names (dynamic range, L-inf quotient metric, A Distance Measure for Bounding Probabilistic Belief Change, Hilbert's projective metric). The fact that people keep reinventing this measure probably has something to do with the fact that this is a unique (up to monotonic transformation) measure under which any two positive vectors are brought "closer together" by positive matrix multiplication. Since both exact and approximate inference in general graphical models can be written in terms of matrix multiplications where our matrices are joint probability tables, this measure gives a useful tool to get correlations and accuracy bounds.

  3. Any distance function based on empirical risk. For instance, pseudo-likelihood objective gives us empirical risk of modeling P(X given all other variables), hence KL-divergence based on pseudo-likelihood partitioning is relevant

Some of the ones I'd like more justifications for are KL(.,p), Hellinger distance, more generally Ali-Sivey distances, Bregman divergences, geodesic distance

asked Aug 29 '10 at 17:24

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
2333214365

edited Aug 29 '10 at 19:12

About your other question regarding kl(.,p), I understood the other day that we were considering different problems: KL(p,.) makes sense when p is an empirical distribution, while KL(.,p) makes sense when p the true bayesian posterior distribution you want to appproximate. Computationally only these approaches make sense in this setting, since minimizing KL(.,p) when p is an empirical distribution and . isn't is really hard to justify (since . will definitely assign a density to many things p doesn't, you'll be computing log 0 a lot to compute KL(.,p)).

(Aug 29 '10 at 18:04) Alexandre Passos ♦

why not minimize KL(p,.) when p is the true bayesian posterior distribution? Because we want to fit mode instead of mean? That seems subjective

(Aug 29 '10 at 18:45) Yaroslav Bulatov
1
  1. computational considerations (it's usually hard to compute expectations over p) 2. you want a local approximation of p, ie, something that closely resembles its structure around its maximum, but can be arbitrarily bad everywhere else. This is useful in many settings where you want a confidence measure, and minimizing KL(p,.) wouldn't give you such a thing (ie, it would behave arbitrarily bad around its mode but better when averaged over p).
(Aug 29 '10 at 19:04) Alexandre Passos ♦

One Answer:

Another measure, that can be derived from KL divergence, is the Bhattacharyya distance. It is used in the multi-view learning algorithm in the posterior regularization framework by Kuzman Ganchev et al. It appears as a solution of the problem minimize KL(q(y1,y2)||p1(y1)p2(y2)) where E_q[delta(y1=y)-delta(y2=y)] = 0 for all y.

However, I'd reformulate your question as measures of goodness of fit between a probability distribution and empirical data, or distances between samples and distributions. The problems of estimating the distance between two distributions (which is what your question is actually asking) or two sets of samples are different from this one.

answered Aug 29 '10 at 18:19

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

but any sample can be viewed as a probability distribution and visa versa

(Aug 29 '10 at 18:39) Yaroslav Bulatov

But it's not always productive to do so. For example, most measures of distance/similarity between continuous distributions don't behave well when it's one continuous and one discrete distribution, as I showed with KL(.,p). Also any measure of similarity between two discrete distributions is zero when computed between two samples from the same continuous dsitribution, which is clearly pathological behavior.

(Aug 29 '10 at 18:57) Alexandre Passos ♦

You are right, allowing continuous distributions makes it harder to avoid pathological behavior, I'll change the question to discrete-only for simplicity. In practice we compute with finite precision, so all distributions are discrete anyway

(Aug 29 '10 at 19:15) Yaroslav Bulatov

It doesn't solve the problem, restricting to discrete. In a discrete setting with large sample space and very few samples you'll find the same problem I mentioned comparing two samples from the same distribution vs some estimated density for that same distribution.

(Aug 29 '10 at 19:22) Alexandre Passos ♦

Doesn't solve what problem? Any specific discrete distribution is a multinomial distribution. Any sample can be viewed as a multinomial distribution. So any measure of similarity for multinomial distributions can be applied to samples and visa versa

(Aug 29 '10 at 19:36) Yaroslav Bulatov
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.