|
I am currently faced with the following situation. i am observing a stream of x's drawn with replacement from some unknown p(x). p(x) is some heavy-tailed distribution, so many x's won't even be seen, or seen but once over some period. Given that i can monitor this stream of x's, how do i come up with a nice estimate of hat{p}(x)? Ideally, i would like to estimate the following quantities: the probability that the next draw from the stream will be a particular x (can assume independence here or not), and after a given time period (say 1M draws), what is the expected number of times i'll see a given x? I dont know what this kind of estimation is, let alone any nice techniques, so i'd like some pointers. what is the most naive approach (i guess multinomial or dirichlet-multinomial?), all the way up to more advanced approaches? any and all help would be appreciated! |
|
If your distribution is discrete and you know the size, the most naive approach would be dirichlet-multinomial with a very high alpha hyperparameter. Assuming you have N possible values of x, the probability of the next draw being equal to xi is (count(xi) + alpha)/(sum_j(count(xj)) + N*alpha). If alpha is big enough it will need a lot of draws to move from the uniform distribution to a more peaked distribution. As pointed out by Kevin Kanini, however, the notion of a "tail" is not very usual in discrete distributions, unless you have a really huge sample space. One possible setting is a discrete distribution where you don't know how many different values x can take. Then the usual approach is either a dirichlet process, which predicts the probability of the next draw being xi as (count(xi))/(sum_j(count(xj)) + alpha) if xi has already been observed or observe a new value (any new value) with probability alpha/(sum_j(count(xj)) + alpha). If you want a power-law distribution you can use a pitman-yor process where you have a discount parameter d and you estimate the probability of the next sample being (count(xi)-d)/(sum_j(count(xj))-1 + alpha) and any new value with probability (N*d+alpha)/(sum_j(count(xj)) + alpha), where N is the number of different values observed so far. You can estimate these hyperparameters either by cross-validation or using sampling, if you're willing to do more than one pass over the data. |
Can you tell us a bit more about the space the x's are being drawn from? You mentioned multinomial, which makes me think it's a finite space. If so, in what way is your p(x) "heavy-tailed"?
one can think of queries being posted to a search engine. here things seem to appear as a power law (or something similar), and one does not know the set of candidates in advance. here the x's would, for instance, be some id's for each query, or optionally the terms in the query represented in some vector (if this makes things easier). i was thinking of using kernel density estimation. however, i don't think this is a perfect approach, i dont know if i should take in the influence of near-by x's or not, and the number of kernels would, of course, become huge.
thanks for the help, i'll do a little research on the suggestions, and respond when i understand a little better.