I'm still struggling with this problem of doing a decent job of density estimation, eg, i'm observing many urls arriving in a stream, approximating a power law, when i have no idea how many urls there are in the wild. In this case, I only have the url itself to predict density on. my quest has lead me to good-turing density estimators, but i dont really understand them. I suppose they incorporate the probability of probabilities, where the estimated probability of seeing foo is p(foo) ~ (f_foo + 1)/N * E(f_foo + 1)/E(f_foo)

where f_foo is teh number of times foo has been observed in a sample popluation and N is the number of observations made.

specific questions:

1) why use (f_foo+1) and not f_foo ?

2) what is the meaning behind E(f_foo + 1)/E(f_foo) ?

3) am i on the right track using good-turing estimators for my problem? I have thoroughly debugged, but it seems to be doing no better than just using a uniform density, while dirichlet-multinomial with a uniform prior does much better. (am i applying the technique appropriately?)

4) what other techniques should i consider for this problem? i'd like to perform a comparison.

asked Nov 07 '10 at 13:36

downer's gravatar image

downer
54891720

I don't know about good-turing can you give the URL of the references you used?

(Nov 07 '10 at 13:51) ogrisel

3 Answers:

http://www.grsampson.net/AGtf1.html

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.8518

answered Nov 07 '10 at 14:53

downer's gravatar image

downer
54891720

Good-turing empirically works very well, and people in NLP have moved away from it mainly because by exploiting the structure of the problem (with back-off estimation) you can improve language modeling.

The idea behind good-turing smoothing is to compute a frequencies-of-frequencies histogram and smooth this histogram (instead of the individual features) by making it seem more power-law-like. Essentially a bit of the mass of the high-frequency elements is passed down to lower-frequency elements, which pass a lot of their mass to unseen elements. I'm not sure the formula you displayed is correct, but it's been a long time since I wrote code for good-turing estimation.

While good-turing is a very good way of estimating how much mass should be assigned to all unseen events, it neither discriminates between unseen events nor lets you guess how many distinct events should occur, so it's not a way of estimating the total number of URLs.

About your questions,

  1. This is due to probability mass "trickling down" in good-turing, as you're more likely in a power-law setting to oversample the frequent events and undersample the rare events in any finite sample
  2. It's the smoothing factor. You can find better explanations, for example, in the Manning and Schutze statistical NLP book.
  3. You're on the right track, but it is very odd if good-turing is underperforming a smoothed dirichlet-multinomial. In my (admitedly small, two or three problems a few years ago) it was very easy to implement good-turing smoothing incorrectly but when correctly implemented it performed a lot better than add-alpha smoothing (which is dirichlet-multinomial)
  4. You should maybe read the language modeling chapter in the Manning and Schutze Foundations of Statistical Natural Language Processing book, as it shows some competing alternatives to good-turing, discusses implementation details and does an evaluation. Of course, your mileage may vary.

answered Nov 07 '10 at 19:49

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

hmm, read several other resources and still no closer. my p and p match almost perfectly as per manning, curch or gale. what i havent figured out is how to estimate the number of unseen objects; i've only seen examples for bigrams. i compute p_0 as N_1 / N, re-calibrate my r/N by this amount of probability, but i dont know how to give the p_0 out to unseen instances. put another way, given X potential future observations, i'm give some objects, what is the expected freq. over that X? here is are my empirical and smoothed r values, just for laughs.

r   N_r     p_emp(r)   r*

1   8346    1.0E-4  0.27728834878476494
2   316 2.0E-4  0.944421540482021
3   87  3.0E-4  1.7616392907127507
4   30  4.0E-4  2.646817695848339
5   22  5.0E-4  3.5681397398974313
6   12  6.0E-4  4.510902755855956
7   6   7.0E-4  5.467408943356538
8   8   8.0E-4  6.433245753277302
9   4   9.0E-4  7.405704733259923
10  6   0.0010  8.383032212990438
11  2   0.0011  9.364043019672588
12  1   0.0012  10.347907545174808
13  3   0.0013  11.334027786800613
14  1   0.0014  12.321961838050752
17  1   0.0017  15.293675091365033
20  1   0.0020  18.273356493394637
21  1   0.0021  19.267798615589253
25  1   0.0025  23.249794268439054
28  2   0.0028  26.239517625095186
31  1   0.0031  29.231150350433058

answered Nov 08 '10 at 21:56

downer's gravatar image

downer
54891720

As I mentioned in my answer, good-turing does not estimate the number of unseen objects, just all the mass that one should allocate to them.

(Nov 09 '10 at 03:59) Alexandre Passos ♦
Your answer
toggle preview

Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

×5
×1

Asked: Nov 07 '10 at 13:36

Seen: 1,399 times

Last updated: Nov 09 '10 at 03:59

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.