|
I'm still struggling with this problem of doing a decent job of density estimation, eg, i'm observing many urls arriving in a stream, approximating a power law, when i have no idea how many urls there are in the wild. In this case, I only have the url itself to predict density on. my quest has lead me to good-turing density estimators, but i dont really understand them. I suppose they incorporate the probability of probabilities, where the estimated probability of seeing foo is p(foo) ~ (f_foo + 1)/N * E(f_foo + 1)/E(f_foo) where f_foo is teh number of times foo has been observed in a sample popluation and N is the number of observations made. specific questions: 1) why use (f_foo+1) and not f_foo ? 2) what is the meaning behind E(f_foo + 1)/E(f_foo) ? 3) am i on the right track using good-turing estimators for my problem? I have thoroughly debugged, but it seems to be doing no better than just using a uniform density, while dirichlet-multinomial with a uniform prior does much better. (am i applying the technique appropriately?) 4) what other techniques should i consider for this problem? i'd like to perform a comparison. |
|
http://www.grsampson.net/AGtf1.html http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.8518 |
|
Good-turing empirically works very well, and people in NLP have moved away from it mainly because by exploiting the structure of the problem (with back-off estimation) you can improve language modeling. The idea behind good-turing smoothing is to compute a frequencies-of-frequencies histogram and smooth this histogram (instead of the individual features) by making it seem more power-law-like. Essentially a bit of the mass of the high-frequency elements is passed down to lower-frequency elements, which pass a lot of their mass to unseen elements. I'm not sure the formula you displayed is correct, but it's been a long time since I wrote code for good-turing estimation. While good-turing is a very good way of estimating how much mass should be assigned to all unseen events, it neither discriminates between unseen events nor lets you guess how many distinct events should occur, so it's not a way of estimating the total number of URLs. About your questions,
|
|
hmm, read several other resources and still no closer. my p and p match almost perfectly as per manning, curch or gale. what i havent figured out is how to estimate the number of unseen objects; i've only seen examples for bigrams. i compute p_0 as N_1 / N, re-calibrate my r/N by this amount of probability, but i dont know how to give the p_0 out to unseen instances. put another way, given X potential future observations, i'm give some objects, what is the expected freq. over that X? here is are my empirical and smoothed r values, just for laughs.
As I mentioned in my answer, good-turing does not estimate the number of unseen objects, just all the mass that one should allocate to them.
(Nov 09 '10 at 03:59)
Alexandre Passos ♦
|
I don't know about good-turing can you give the URL of the references you used?