Hello everyone, first question here,
I'm doing some work on stream processing within our services, beginning with something simple like hostnames being connected to within the last timeslice, but also which hostnames are most outside of their normal range. Similar to what twitter/google/etc do with trending topics.
I've done some research into algorithms for doing this and it seems that Count Sketch would be a solid approach. However after reading through the paper that introduced it a few times I'm still not quite understanding a few things. Specifically the hash functions (both for distributing to multiple buckets and also the one referred to as 's' in the paper that hashes objects to {-1.+1}) , and how one would go about selecting appropriate ones.
Pseudocode examples very much appreciated. Thank you