2009-06-12

Hash 'n Mix

How good is the hash function in your language of choice, particularly for strings? Chances are, not very. I applied the classical statistical tests popularized by Knuth and I was very surprised to see how bad OCaml's hash function on strings is.

I applied the chi-square test on the distribution of hashes over /usr/share/dict/words, for almost 235.000 words. I binned the integers from the most significant bit down, taking successively more and more bits into account, and tested the resulting distribution against a uniform distribution. Here are the results:

χ² test of 234936 hashes with Hashtbl.hash
Bins (ν)Pr[χ² ≤ X²]
25250.00170261.0000000
46491.63326181.0000000
89310.98511941.0000000
1620682.96482451.0000000
3242563.45421731.0000000
6488265.16375521.0000000
12894121.13978271.0000000
256155277.57251341.0000000
512209363.27578571.0000000
1024317909.05410841.0000000
2048389422.34508121.0000000
4096448253.52838221.0000000
8192663614.91660711.0000000
16384887812.23196101.0000000
327681037118.94275891.0000000

The first column indicates how many bins were used, which is the degrees of freedom in the probability distribution (ν). The second column shows the computed statistic. The third column shows the probability with which a χ² deviate with ν degrees of freedom would be at least this value. Highlighted in red are the failed entries according to Knuth's criterion (probability in first or last percentile) and in yellow the suspect entries (probability in first five or last five percentiles).

Pretty dismal, isn't it? The Kolmogorov-Smirnov test is equally appalling: the probability that entries are this higher than a uniform deviate is p+ = 0.0016593, the probability that entries are this lower than a uniform deviate is p- = 1.0000000 both of which are extreme.

Postprocessing the hashes with a simple mixing function helps matter considerably. I use the Murmur hash specialized to a single 32-bit value:

let murmur_hash_32 k h =
  let m = 0x5bd1e995l in
  let k = Int32.mul k m in 
  let k = Int32.logxor k (Int32.shift_right_logical k 24) in 
  let k = Int32.mul k m in 
  let h = Int32.mul h m in 
  let h = Int32.logxor h k in
  let h = Int32.logxor h (Int32.shift_right_logical h 13) in
  let h = Int32.mul h m in
  let h = Int32.logxor h (Int32.shift_right_logical h 15) in
  h

let universal_hash i h =
  let r = murmur_hash_32 (Int32.of_int h) (Int32.of_int (succ i)) in
  Int32.to_int (Int32.logand r 0x3fffffffl)

Since Murmur hash achieves full 32-bit avalanche, I can use it to construct a universal hash function. The results of applying this mixing function to Hashtbl.hash are much more encouraging:

χ² test of 234936 hashes with mixing function
bins (ν)Pr[χ² ≤ X²]
20.03602680.1505399
44.94074980.8238125
811.46436480.8803928
1617.10573090.6874168
3225.67698440.2633473
6455.60717810.2655752
128121.71784660.3843221
256250.50784890.4322981
512480.15554870.1675241
1024934.75669970.0230148
20481983.74910610.1614639
40963951.44563630.0550442
81928032.66115030.1074982
1638416249.48792860.2308936
3276832526.78857220.1741126

Much better, except maybe for the distribution on 1024 bins (the 10 most significant bits). The Kolmogorov-Smirnov test shows that the probability that entries are this higher than a uniform deviate is p+ = 0.7688490, the probability that entries are this lower than a uniform deviate is p- = 0.4623243 which are much more balanced.

Know your hash function. It should be random.

2 comments:

Jon Harrop said...

Only in theory. :-)

I found I could get substantial speedups in my naive GC by altering the hash function to deliberately collide on close input values (essentially by dropping a few least significant bits). If your bucket is an array then that greatly improves cache coherence.

Matías Giovannini said...

@Jon, with linear chaining (almost) any hash function is acceptable. In the worst case you have a linear search but the algorithms for lookup, insertion &c all terminate. With open addressing schemes, you're out of luck, since the correctness of the algorithms depend crucially on the hash function behaving essentially as a random map from your key space to your index space.

In fact, I embarked into this analysis because I can't seem to make insertion into a Cuckoo hash work with a particular dataset I found.