Hello everyone, first question here,

I'm doing some work on stream processing within our services, beginning with something simple like hostnames being connected to within the last timeslice, but also which hostnames are most outside of their normal range. Similar to what twitter/google/etc do with trending topics.

I've done some research into algorithms for doing this and it seems that Count Sketch would be a solid approach. However after reading through the paper that introduced it a few times I'm still not quite understanding a few things. Specifically the hash functions (both for distributing to multiple buckets and also the one referred to as 's' in the paper that hashes objects to {-1.+1}) , and how one would go about selecting appropriate ones.

Specific hash examples, and why they'd work well in this situation are very much appreciated. Thank you

asked Dec 21 '11 at 13:24

Daniel%20Peck's gravatar image

Daniel Peck

edited Dec 23 '11 at 18:22

One Answer:

You should probably link to the paper, as not everybody here has read it. It also seems at best marginally related to the topic of this forum.

The h functions are supposed to pick which bucket each example gets added to, and the s functions are supposed to pick whether you add or subtract from the value in that bucket. In practice probably any family of hash functions will do, say murmurhash, where you just pick the appropriate number of bits from the hash (for the s functions you really only need one bit per function, so if you have less than 32 of them a single 32-bit hash will suffice).

answered Dec 22 '11 at 06:45

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

apologies for choosing the wrong forum to ask this question, I thought this algorithm would be of some use to to those doig data minig on streaming data in this community.

Thank you for the advice, i will explore the murmurhash family and will do my best to keep any future questions more on topic.

(Dec 23 '11 at 02:29) Daniel Peck
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.