I need a deterministic, random-access PRNG. In particular, I would like to take an arbitrary Python object, and convert it (deterministically and quickly) to a number uniform in the rand [0, 1).

Is murmur hash appropriate for generating random numbers for machine learning (not cryptography)?

How do I test if this is a good PRNG?

In particular, I am using Murmur 0.1.3 Python hooks, which are from February 2009 and use murmur hash 2.0. My Python code is located here, and excerpted as follows:

import murmur
MAX_UNSIGNED_LONG = 2**32
def deterministicrandom(x):
    return 1.0 * murmur.string_hash(`x`) / MAX_UNSIGNED_LONG

asked Jun 23 '10 at 17:10

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited Jun 24 '10 at 03:36


6 Answers:

The hasard lib has tools to plot and measure the entropy (and other statistics) of PRN-generated data.

Or you can use the tests programs directly: TestU01, dieharder and ENT.

answered Jun 23 '10 at 17:22

ogrisel's gravatar image

ogrisel
498995591

edited Jun 24 '10 at 03:18

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Olivier gives some great answers.

From the documentation, it seems that dieharder gives an rigorous battery of tests, as does TestU01. However, they seem to require being able to call your PRNG from C. ENT is easier to use, because you can just output a file of random bytes and execute ENT on it.

I have modified my Python code so that it automatically dumps 500000 random bytes and runs ENT on it. It gives the following output, which appears superior to a 500000 byte file created by radioactive decay:

  • Entropy = 7.999962 bits per byte.
  • Optimum compression would reduce the size of this 4997120 byte file by 0 percent.
  • Chi square distribution for 4997120 samples is 263.84, and randomly would exceed this value 50.00 percent of the times.
  • Arithmetic mean value of data bytes is 127.5297 (127.5 = random).
  • Monte Carlo value for Pi is 3.141591613 (error 0.00 percent).
  • Serial correlation coefficient is 0.000609 (totally uncorrelated = 0.0).

answered Jun 24 '10 at 03:40

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Instead of relying on empirical tests, you can use more principled concepts such as almost universality or k-wise independence (sometimes called strong universality). It is not so difficult.

Quick definitions:

Consider families of hash functions from strings to L-bit values. A hash family is universal if the probability of a collision is 1/2^L. It is epsilon-almost universal for some epsilon<1 if the probability of a collision is no larger than epsilon. It is strongly universal of pairwise independent if the hash values of any two distinct strings are independent. It is k-wise independent if the k hash values of any k distinct strings are independent.

Some practical choices:

You can get almost universal hashing by polynomials over finite field. Essentially, you pick t at random in [0,p) and compute the polynomial t^n + t^{n-1} m1 + ... + mn where the mi's are the bytes of your object. The result are n/2^L almost universal hash values over [0,p) for some prime p and where n is the maximum length of your objects in bytes. (See C++ code below.)

For pairwise independent hashing, consider the following multilinear hash function: sum of ki * mi where the mi are the bytes of data, and the ki's are random number in [0,p) for some prime p. The sum must be executed modulo p (in the field of size p). Of course, you need as many random numbers as you can have bytes in your objects, but with a PRNG such as Mersenne Twister, you can very quickly generate them as you need them. There are faster variations on multilinear where you can avoid the modulos, but they are slightly more complicated. (See C++ code below.)

For even better results, you might want to have 3-wise independent hashing, if so, you can do it by tabulation. See for example one of my recent papers: Daniel Lemire and Owen Kaser, Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010. (http://arxiv.org/abs/0705.4676) You can find related source code at http://code.google.com/p/ngramhashing/.

Once you have generated an L-bit hash value, you must convert your number in [0,p) to a floating point number in [0,1). It should not be difficult.

Source code examples in C++:

Almost universal scheme:

        uint64 h = 1;
        for(vector<chartype>::const_iterator i = b; i!=e; ++i) {
            h = (h * t) % prime;
            h+= *i;
            if(h >=prime) h-=prime;
        }       
        return h;

Pairwise independent multilinear scheme (based on a pseudo-random number generator class mt, I use the following Mersenne Twister generator: http://www-personal.umich.edu/~wagnerr/MersenneTwister.h):

        mt.seed(seed);
        uint64 h = 0;
        for(vector<chartype>::const_iterator i = b; i!=e; ++i) {
            h+= ( static_cast<uint64>(mt.randInt(prime - 1)) 
                             * static_cast<uint64>(*i) )  % prime; 
            if(h>=prime) h-=prime;
        }
        h+= ( static_cast<uint64>(mt.randInt(prime - 1))  )  % prime; 
        if(h>=prime) h-=prime;
        return h;

This answer is marked "community wiki".

answered Jun 30 '10 at 13:01

Daniel%20Lemire's gravatar image

Daniel Lemire
13

edited Jun 30 '10 at 13:19

The only RNGs that I have code for using in random-access contexts that come to mind are LCGs, EICGs, and hash-based RNGs. I think I might have also seen or heard of code for random access to some GFSR-type things, but I don't have any lieing around myself.

Murmur hash 2 on it's own is a surprisingly good as a random access RNG... I just tried applying it to an incrementing 64 bit counter (upper 32 bits passed in as the "seed", lower 32 bits passed in as the data): TestU01 SmallCrush: passes all tests TestU01 Crush: failed 5 tests (out of 144) TestU01 BigCrush: not tested (takes forever, and it failed Crush anyway)

For use as a general RNG it would be a failure due to low speed and mediocre output quality, but for random access it looks like a very good option compared to the few random access RNGs I have in my collection. LCGs are random access but produce low quality output (and seeking is slower). EICGs are random access but very very slow. I think some GFSR-type generators have some random access capabilities, but I have not seen code for such and I imagine seeking would not be fast. Hash function based RNGs tend to be slow, I think this might be the fastest I have seen for its level of quality.

While you can technically get whatever period you want, both empirical evidence and the structure of the code suggest that you should consider it to have a period on the order of 2**32. Output ssequences generated by this longer than that length tend to fail tests badly, not just a little, in my brief testing. And the code itself forces the hashed state down to 32 bits between each block of 4 bytes.

answered Jul 01 '10 at 08:09

orz's gravatar image

orz
161

Murmur hash can probably be significantly upgraded to a pretty darned good RNG by simply feeding it with a linear generator rather than consecutive integers. Murmur hash is good at bit avalanche, but not perfect so starting with something with fewer bits in common would help it out a lot.

Finally, at the fairly gross hack level, you can simply xor the output of two murmur hash generators with mutually prime periods. This won't really give you a strong generator at the composite period, but it will improve the strength compared to the shorter generator.

The 32 bit issue is real, but there is a 64 bit version of murmurhash that might do better.

Murmurhash is used as a random access generator at the core of the Mahout collections package and empirical tests indicate that it gives a net speedup relative to simpler generators such as are used by the Trove library. This was associated with lower collision rates so it seems that the better quality of the hashes more than compensates for the higher cost of the hashing.

answered Jul 03 '10 at 13:37

Ted%20Dunning's gravatar image

Ted Dunning
636815

Could you discuss more details of feeding it with a linear generator?

What precisely is the 32-bit issue, in your opinion?

Why do you say the high cost of hashing? Isn't murmurhash one of the fastest hashes in benchmarks?

(Jul 03 '10 at 15:07) Joseph Turian ♦♦
1

The original suggestion was to use murmurhash(x). If you instead use murmurhash((x * K1 + K2) % BIG_PRIME), then I think that the results of hashing tests will be better. K1, K2 and BIG_PRIME should be selected well, of course. BIG_PRIME could be elided, I think, with the correct choice of K1 and K2.

The 32 bit issue alludes to what orz mentioned about the state of the hash being restricted to 32 bits every 4 byte block. One way of avoiding this (subject to code inspection) is to use the 64 bit murmurhash. The other is to use two independent murmurhash sequences.

I mention the high cost of hashing because, in the context of hashed data structures, it is more common to use a simple linear congruential scheme which is enormously faster than murmurhash, but also clearly a bit too simple based on the speedups that murmurhash achieves. Murmurhash is definitely one of the fastest strong hashing functions around, but it is still slower than a single multiply, add and mod operation.

(Jul 03 '10 at 15:38) Ted Dunning

There are classical methods to test the output of RNGs I would highly suggest you take a look at the Law and Kelton book. It is the bible of simulation. He has a chapter on Random Number Generators Their quality and how to test new RNGs for robustness. [See chapter 7]

In general there are two kind of tests.

  1. Empirical Tests
  2. Theoretical Tests

A very crude way is called "Poker" test. you build a poker hand dealer on top of your RNG and check to see if the probability of winning a hand is equal to what is mentioned in standard tables

Or you can simply run a Chi-square test. The one I like is correlation tests.

There is one that uses the lattice structure you should get the following structured diagram when you run it on your random RNG

alt text

answered Jul 03 '10 at 18:31

Mark%20Alen's gravatar image

Mark Alen
1323234146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.