So i'm trying to figure out if i should use z-score normalization on a variable, x or log(x). Since z-score normalization assumes a normal distribution, i want to pick the distribution that is most gaussian. KL divergence from the ML estimate works fine for this -- I can pick the term with the lower divergence from the hypothesized distribution. No problem, but this got me thinking...

Is there a hypothesis test for KL divergence?

(There are loads of ways to get a do hypothesis testing to ask the likelihood of a sample being drawn from a distribution, I'm mostly thinking about the distribution of expected values, E(KL(P,Q)).)

asked Jul 22 '10 at 11:09

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
156252135

It seems hard to compute such an expectation analytically, and you kind of need that to have a p-value, no?

(Jul 22 '10 at 11:26) Alexandre Passos ♦

I agree. It definitely seems daunting. I'm mostly curious if someone's already done it.

(Jul 22 '10 at 11:36) Andrew Rosenberg

Which question are you hoping to answer:

1) Is the difference between the two KL divergences statistically significant?

or 2) Does my distribution differ significantly from the hypothesized distribution?

The second can be answered with a KS-test. The first, I don't know.

(Jul 22 '10 at 13:01) Troy Raeder

I'd like to know if given a KL divergence value, KL(P,Q) (and information about degrees of freedom) a p-value saying how likely the sample distribution P was drawn from the true distribution Q can be calculated.

Question 1) is a very interesting related question.

Question 2) has a lot of answers including the t-test, KS-test, etc, etc.

Aside: Can you apply a KS test to a multinomial distribution?

(Jul 22 '10 at 13:09) Andrew Rosenberg

About applying KS tests to multinomials I don't think it makes sense; KS is inherently unidimensional and continuous. How could you define a cdf-like statistic for the multinomial distribution? And why not just computing a p-value directly from the multinomial?

(Jul 22 '10 at 13:20) Alexandre Passos ♦

right...but i can get KL divergence between two multinomials...so a KS-test doesn't quite address this question.

(Jul 22 '10 at 13:23) Andrew Rosenberg

What I'd do is use monte carlo to compute an empirical distribution for the KL divergence and create your signficante test with the p-value being the fraction of monte carlo estimated KL divergences (from samples from tue true distribution, that is) that is smaller than the divergence you want to test. But I'm not sure this qualifies as an answer.

(Jul 22 '10 at 14:26) Alexandre Passos ♦

I'd like to know if given a KL divergence value, KL(P,Q) (and information about degrees of freedom) a p-value saying how likely the sample distribution P was drawn from the true distribution Q can be calculated.

Compared to what? What is the null hypothesis here?

(Jul 28 '10 at 17:57) dogy

The null hypothesis would be "P and Q are drawn from the same underlying distribution" or to be more faithful to the asymmetry of KL-divergence "P is drawn from the distribution Q". Based on KL-divergence and degrees of freedom with what probability, p, can we reject the null hypothesis?

(Jul 29 '10 at 02:00) Andrew Rosenberg

@Andrew: in what sense is a monte carlo p-value unsuitable for this? By a monte carlo p-value I mean generate a lot of samples S from Q of the same size as P and compute, for all S, KL(S,Q). Then you count what fraction of them have KL(S,Q) higher than KL(P,Q). That, then, is a monte-carlo computed p-value (as in, the probability of generating that or something less likely according to the base distribution). You might need a lot of samples, but otherwise I think this should work.

(Jul 29 '10 at 02:26) Alexandre Passos ♦

@Alexandre: I definitely think that would work, but its slightly unsatisfying from a theoretical point of view. But yes, if KL-divergence doesn't follow a standard distribution that may be the best route, barring some breakthrough.

(Jul 29 '10 at 07:55) Andrew Rosenberg
showing 5 of 11 show all

4 Answers:

KL-divergence is a statistic. To make a test out of it, it really depends on what assumptions you make for P and Q.

P is your truth. The issue is that given a certain set of data, there are many different P's that would generate it. If you're Bayesian, P can just be your posterior, and you should trust your prior.

Q is your null model. A much simpler model with much less variation. If you're Bayesian, that's your posterior predictive.

When you test P against Q, it should be very unlikely that Q beats P on accuracy.

If you're Bayesian, you can just draw a bunch of samples from the posterior, and test Q against them. If you're doing bootstrap, just compare Q fit on the whole data against P that's fit on the resample. The proportion of Q's wins against P corresponds to the p-value with the hypothesis test.

You could also flip it around like permutation testing, the above is good for starters.

For more, here's my ICML paper from a few years back.

answered Aug 19 '10 at 22:49

Aleks%20Jakulin's gravatar image

Aleks Jakulin
240110

Not sure if it's the same for Gaussians, but for multinomials sampled from Dirichlet, probability of encountering KL divergence above k falls as Exp(-c k), as a consequence of Sanov's Theorem (p.292 of Thomas/Cover "Elements of Information Theory" 1991)

answered Aug 19 '10 at 23:28

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
1963193458

edited Aug 19 '10 at 23:29

I am trying to think about this in more formal terms. Firstly, KL Divergence is

alt text

And the The KL Divergence is therefore the (easier) sum

alt text

Remember, that we are operating upon the assumption that the null hypothesis is true, that is alt text. So, the latter term will be entropy of the distribution, and the former will be, I'm not sure yet. This seems to be the hard part, and relies upon the assumptions that you are using to estimate p(X). This is a kind of "missing piece". If you could somehow believe that

1) The true mean of the estimation will converge to the true mean of the estimator, as n->infinity

2) The variance of the estimation will converge to 0

Then we can do the expansion of the distribution in terms of moments ... which cancel out, and we are only left with third order terms - that is to say, the mean of a chi-squared distribution with n degrees of freedom, where n is, hmmm . Analysis is hard.

Usual Disclaimer: I'm just refreshing some statistics, which is pretty rusty for me. Please don't laugh at me if it is completely wrong.

answered Jul 29 '10 at 23:25

dogy's gravatar image

dogy
3065918

edited Jul 29 '10 at 23:33

aha. A shot in the dark here. The form of the KL Divergence bears a lot of resemblance to the likelihood ratio test,

alt text

So in the limit I'd expect the exponential of the KL Divergence to be chi squared. Don't take my word for anything, however, I've not thought this out through carefully and there are likely holes in my reasoning.

answered Jul 29 '10 at 03:38

dogy's gravatar image

dogy
3065918

edited Jul 29 '10 at 03:44

I'm not sure. Isn't KL divergence something like E_q[(P(i)log(P(i)/Q(i))]? So the distribution of the KL divergence should be something like E_p[E_q[(P(i)log(P(i)/Q(i))]], and from the specific forms of P and Q to this you need to do two hard integrals.

(Jul 29 '10 at 05:34) Alexandre Passos ♦

For the derivation of KL from log likelihood, see this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.366&rep=rep1&type=pdf

(Jul 30 '10 at 08:50) cglr
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.