|
So i'm trying to figure out if i should use z-score normalization on a variable, x or log(x). Since z-score normalization assumes a normal distribution, i want to pick the distribution that is most gaussian. KL divergence from the ML estimate works fine for this -- I can pick the term with the lower divergence from the hypothesized distribution. No problem, but this got me thinking... Is there a hypothesis test for KL divergence? (There are loads of ways to get a do hypothesis testing to ask the likelihood of a sample being drawn from a distribution, I'm mostly thinking about the distribution of expected values, E(KL(P,Q)).)
showing 5 of 11
show all
|
|
KL-divergence is a statistic. To make a test out of it, it really depends on what assumptions you make for P and Q. P is your truth. The issue is that given a certain set of data, there are many different P's that would generate it. If you're Bayesian, P can just be your posterior, and you should trust your prior. Q is your null model. A much simpler model with much less variation. If you're Bayesian, that's your posterior predictive. When you test P against Q, it should be very unlikely that Q beats P on accuracy. If you're Bayesian, you can just draw a bunch of samples from the posterior, and test Q against them. If you're doing bootstrap, just compare Q fit on the whole data against P that's fit on the resample. The proportion of Q's wins against P corresponds to the p-value with the hypothesis test. You could also flip it around like permutation testing, the above is good for starters. For more, here's my ICML paper from a few years back. |
|
Not sure if it's the same for Gaussians, but for multinomials sampled from Dirichlet, probability of encountering KL divergence above k falls as Exp(-c k), as a consequence of Sanov's Theorem (p.292 of Thomas/Cover "Elements of Information Theory" 1991) |
|
I am trying to think about this in more formal terms. Firstly, KL Divergence is
And the The KL Divergence is therefore the (easier) sum
Remember, that we are operating upon the assumption that the null hypothesis is true, that is 1) The true mean of the estimation will converge to the true mean of the estimator, as n->infinity 2) The variance of the estimation will converge to 0 Then we can do the expansion of the distribution in terms of moments ... which cancel out, and we are only left with third order terms - that is to say, the mean of a chi-squared distribution with n degrees of freedom, where n is, hmmm . Analysis is hard. Usual Disclaimer: I'm just refreshing some statistics, which is pretty rusty for me. Please don't laugh at me if it is completely wrong. |
|
aha. A shot in the dark here. The form of the KL Divergence bears a lot of resemblance to the likelihood ratio test,
So in the limit I'd expect the exponential of the KL Divergence to be chi squared. Don't take my word for anything, however, I've not thought this out through carefully and there are likely holes in my reasoning. I'm not sure. Isn't KL divergence something like E_q[(P(i)log(P(i)/Q(i))]? So the distribution of the KL divergence should be something like E_p[E_q[(P(i)log(P(i)/Q(i))]], and from the specific forms of P and Q to this you need to do two hard integrals.
(Jul 29 '10 at 05:34)
Alexandre Passos ♦
For the derivation of KL from log likelihood, see this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.366&rep=rep1&type=pdf
(Jul 30 '10 at 08:50)
cglr
|


. So, the latter term will be entropy of the distribution, and the former will be, I'm not sure yet. This seems to be the hard part, and relies upon the assumptions that you are using to estimate p(X). This is a kind of "missing piece". If you could somehow believe that 
It seems hard to compute such an expectation analytically, and you kind of need that to have a p-value, no?
I agree. It definitely seems daunting. I'm mostly curious if someone's already done it.
Which question are you hoping to answer:
1) Is the difference between the two KL divergences statistically significant?
or 2) Does my distribution differ significantly from the hypothesized distribution?
The second can be answered with a KS-test. The first, I don't know.
I'd like to know if given a KL divergence value, KL(P,Q) (and information about degrees of freedom) a p-value saying how likely the sample distribution P was drawn from the true distribution Q can be calculated.
Question 1) is a very interesting related question.
Question 2) has a lot of answers including the t-test, KS-test, etc, etc.
Aside: Can you apply a KS test to a multinomial distribution?
About applying KS tests to multinomials I don't think it makes sense; KS is inherently unidimensional and continuous. How could you define a cdf-like statistic for the multinomial distribution? And why not just computing a p-value directly from the multinomial?
right...but i can get KL divergence between two multinomials...so a KS-test doesn't quite address this question.
What I'd do is use monte carlo to compute an empirical distribution for the KL divergence and create your signficante test with the p-value being the fraction of monte carlo estimated KL divergences (from samples from tue true distribution, that is) that is smaller than the divergence you want to test. But I'm not sure this qualifies as an answer.
Compared to what? What is the null hypothesis here?
The null hypothesis would be "P and Q are drawn from the same underlying distribution" or to be more faithful to the asymmetry of KL-divergence "P is drawn from the distribution Q". Based on KL-divergence and degrees of freedom with what probability, p, can we reject the null hypothesis?
@Andrew: in what sense is a monte carlo p-value unsuitable for this? By a monte carlo p-value I mean generate a lot of samples S from Q of the same size as P and compute, for all S, KL(S,Q). Then you count what fraction of them have KL(S,Q) higher than KL(P,Q). That, then, is a monte-carlo computed p-value (as in, the probability of generating that or something less likely according to the base distribution). You might need a lot of samples, but otherwise I think this should work.
@Alexandre: I definitely think that would work, but its slightly unsatisfying from a theoretical point of view. But yes, if KL-divergence doesn't follow a standard distribution that may be the best route, barring some breakthrough.