|
eg, q hasnt observed all possible outcomes. obviously treating these as probability 0 doesnt work, somehow fudging the distribution to have some epsilon probability doesnt seem appealing. I'm kinda stumped. leave these out of the KL equation? give the # of bits as if there was uniform prob. for all remaining? |
|
Fudging the probability to have some epsilon can be justified theoretically. For example, if you add epsilon to all counts and then normalize, you're effectively using the posterior mode of a dirichlet-multinomial distribution with uniform prior epsilon. If this leads to very poor results you can look at richer etimation methods, such as good-turing. Another option is smoothing q by replacing q with alpha*q + (1-alpha)*p, for large alpha (this can be cleverer than adding epsilon uniformly, as it more or less follows the concentration in p, if p i some ort of true posterior. But, yes, in general, if you don't smooth, the KL between two discrete distributions computed from empirical counts often diverges. |
|
thanks guys- i'm trying to estimate the true distribution, p, in an online setting. my goal here is to evaluate the quality of the estimator. i am using KL and MSE at the moment, i'm not sure if there's a better way. I figured that in my case, just a simple multinomial was the most naive baseline, other techniques could be compared to this. |
|
Indeed, as already suggested, one possibility is to smooth the distributions. An alternative might be to use the Jensen-Shannon divergence instead (c.f.: http://www.cs.cornell.edu/home/llee/papers/aistats01.pdf) |
When people minimize KL(p,q) over q, p is the empirical distribution of the data. It's fully observed and has a lot of 0's, which is good because KL can be computed faster
@Yaroslav Bulatov: not always. For example, in Haghighi and Vanderwende's paper on summarization, http://aria42.com/pubs/naacl09-topical.pdf , the klsum criterion they define minimizes KL(q,p) between the corpus empirical words counts distribution and the summary empirical word counts distribution, so you can have this problem with zeroes.