eg, q hasnt observed all possible outcomes. obviously treating these as probability 0 doesnt work, somehow fudging the distribution to have some epsilon probability doesnt seem appealing. I'm kinda stumped. leave these out of the KL equation? give the # of bits as if there was uniform prob. for all remaining?

asked Oct 23 '10 at 16:30

downer's gravatar image

downer
54891720

When people minimize KL(p,q) over q, p is the empirical distribution of the data. It's fully observed and has a lot of 0's, which is good because KL can be computed faster

(Oct 23 '10 at 19:53) Yaroslav Bulatov

@Yaroslav Bulatov: not always. For example, in Haghighi and Vanderwende's paper on summarization, http://aria42.com/pubs/naacl09-topical.pdf , the klsum criterion they define minimizes KL(q,p) between the corpus empirical words counts distribution and the summary empirical word counts distribution, so you can have this problem with zeroes.

(Oct 23 '10 at 19:58) Alexandre Passos ♦

3 Answers:

Fudging the probability to have some epsilon can be justified theoretically. For example, if you add epsilon to all counts and then normalize, you're effectively using the posterior mode of a dirichlet-multinomial distribution with uniform prior epsilon. If this leads to very poor results you can look at richer etimation methods, such as good-turing. Another option is smoothing q by replacing q with alpha*q + (1-alpha)*p, for large alpha (this can be cleverer than adding epsilon uniformly, as it more or less follows the concentration in p, if p i some ort of true posterior.

But, yes, in general, if you don't smooth, the KL between two discrete distributions computed from empirical counts often diverges.

answered Oct 23 '10 at 17:59

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Oct 23 '10 at 17:59

thanks guys- i'm trying to estimate the true distribution, p, in an online setting. my goal here is to evaluate the quality of the estimator. i am using KL and MSE at the moment, i'm not sure if there's a better way. I figured that in my case, just a simple multinomial was the most naive baseline, other techniques could be compared to this.

answered Oct 24 '10 at 15:33

downer's gravatar image

downer
54891720

Indeed, as already suggested, one possibility is to smooth the distributions. An alternative might be to use the Jensen-Shannon divergence instead (c.f.: http://www.cs.cornell.edu/home/llee/papers/aistats01.pdf)

answered Oct 25 '10 at 09:41

b_a's gravatar image

b_a
16223

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.