2008-02-26

Pass the Bucket

Although not without its problems, hash tables are a popular data structure for storage and retrieval of key/value pairs, otherwise known as finite maps or finite functions. It is the main data structure in dynamic languages like JavaScript, AWK, Arc, Io, Python, Perl and others. To recapitulate, the idea behind a hash table is to store data in an array (which is a finite map with domain a contiguous range of nonnegative numbers), by encoding the position of the datum associated with key K by means of a function hash K, returning integral values between 0 and N - 1. What is the advantage of this? If the time to compute this function is essentially constant, and since array access is in turn constant, this makes a hash table a constant-time finite map on an arbitrary domain of keys.

All good and dandy except for the fact that, in general, hash is not injective: whenever two different keys KK′ have hash K = hash K′, a collision has occurred, and we must devise some way to arbitrate between the data wanting to go into the same position of the array. A popular way to do this is to crowd more than one item in the same position: by chaining overflowing items into a linked list, the hash table may become oblivious to hash collisions. An obvious problem with this is that if the hash function is very poor (think, for instance, on the hash that sends every item to 0), the expected constant-time algorithm degenerates into a linear search, which for thousands or millions of items is, in Knuth's words, "too horrible to contemplate".

The solution to this is two-pronged: first, choose a function hash that essentially behaves like a random function: the probability of hash K being equal to a given (arbitrary) number should be as near to 1/N as possible. This is in general pretty difficult, which is why good hash functions are hard to find. Secondly, it is important not to let the hash table become too crowded. By the pigeonhole principle, it is obvious that a table of size M containing N > M data would have at least an overflowed entry. To avoid this, increase the table size once it reaches a certain occupancy factor α = N/M. It is important that the size increases exponentially in order to maintain average-case constant-time operations (the literature explains why.) This, however, leaves wasted space behind, especially if after a number of insertions follow a number of deletions, as the extra space is never reclaimed. To address this last problem, shrink the table once α falls below some value. It is important, however, to add some hysteresis to the process to avoid successive resizings when insertions and deletions alternate on a table of critical size. Typically, a hash table must trigger an expansion when α ≥ 0.75 and a shrinkage when α < 0.5, for instance.

The bottom line is that it is easy to cobble together a substandard, underperforming implementation of hash tables. It is important to understand that if the hash function behaves essentially as a random variable there can be no performance guarantees, only probabilistic expectations. However, the comment usually heard is that hash tables "perform well in practice"; I should say that good implementations of hash tables perform well in practice. OCaml's implementation shows all the subtleties a good hash table data structure must take into account. The point I wanted to make, and that needed this long introduction (maybe it didn't really needed it, but then I hope you found it interesting nonetheless) is, what happens when you remove the buckets from a hash table? Why, you get vectors!

It is a glaring omission in the Ocaml's standard library the lack of an extensible, dynamically-resizing array. This is such a useful data structure that it should come included; it is, however, not very difficult to implement. The idea is to use an underlying array big enough for the elements currently stored, with a count of said elements. Upon getting full, the code resizes the array. For this, we need not only a mutable count but also a mutable array:

type 'a t = { mutable vec : 'a array; mutable cnt : int; }

On creation, it is convenient for the array to have a not too-small size to avoid frequent early doublings. Clearly an initial empty vector (of size 0) is less than convenient, as it would need special-casing the calculation. I settled on 16 initial elements, but this can be tuned:

let initial_size = 16

Now the problem with a non-empty array is that it needs initializing with an element to avoid type safety violations:

let make e = { vec = Array.make initial_size e; cnt = 0; }

The code must ensure, at all times, the precondition that no position in the array beyond cnt will ever be accessed. This would allow using an unsafe initialization:

let create () = make (Obj.magic 0)

This, however, demands that the type of vectors be made abstract; this is left as an exercise to you, the reader. With this, simple manipulations of the element count give us some handy functions:

let length v = v.cnt

let clear v = v.cnt <- 0

Getting and setting elements in the array must respect the vector invariant, by checking bounds:

let get v i =
  if 0 > i || i >= v.cnt then invalid_arg "index out of bounds" else
  Array.unsafe_get v.vec i

let set v i x =
  if 0 > i || i >= v.cnt then invalid_arg "index out of bounds" else
  Array.unsafe_set v.vec i x

It would be futile to check twice for bounds, once for our vectors and one for OCaml's native arrays, whose size is at least that of our vectors; so I can use an unsafe array operation with confidence. This is now enough to give us simple iterators over vectors:

let iteri (f : int -> 'a -> unit) v =
  for i = 0 to v.cnt - 1 do
    f i (Array.unsafe_get v.vec i)
  done

let fold f e v =
  let acc = ref e in
  iteri (fun _ x -> acc := f !acc x) v;
  !acc

Again, the validity of the invariant allows me to use unsafe accesses in iteri with impunity. I've used an imperative fold (left) based on iteri, but doing it the other way around is as easy. So far, the code leaves out the small matter of extension: all our vectors have constant size zero! Checking for the vector invariant is best left to an auxiliary function:

let check v =
  let siz = Array.length v.vec in
  if v.cnt == siz then begin
    let cnt = min (siz lsl 1) Sys.max_array_length in
    if cnt == siz then failwith "cannot resize" else
    let vec = Array.make cnt v.vec.(0) in
    Array.blit v.vec 0 vec 0 siz;
    v.vec <- vec
  end

If the array has leftover capacity to grow, there is obviously nothing to be done. If not, there is one implementation detail that must be taken into account: arrays in OCaml have a maximum size (which is rather small), so the new size is double the old size but only up to Sys.max_array_length. Beyond that, there's not much we could do except fail. If not, a new array is created and initialized with an arbitrary but existing (check that this is the case!) element of the old array; the elements are copied over and the array replaced. You can see the same code in the implementation of Hashtbl.resize, except that if the array reaches the maximum allowable length nothing is done and the hash table is allowed to overflow.

With this, we can extend a vector from its end:

let add v e =
  check v;
  Array.unsafe_set v.vec v.cnt e;
  v.cnt <- v.cnt + 1

The check ensures that the precondition is satisfied, and the underlying array can be accessed in an unsafe manner. Similarly, to remove the element at the end:

let remove v =
  if v.cnt == 0 then failwith "pop" else
  v.cnt <- v.cnt - 1;
  Array.unsafe_get v.vec v.cnt

These two operations endow the vector with a FIFO discipline, making it suitable for a stack with random access. The signature for the abstract data type implementation is:

module Vec :
  sig
    type 'a t
    val make   : 'a -> 'a t
    val create : unit -> 'a t
    val length : 'a t -> int
    val clear  : 'a t -> unit
    val get    : 'a t -> int -> 'a
    val set    : 'a t -> int -> 'a -> unit
    val iteri  : (int -> 'a -> unit) -> 'a t -> unit
    val fold   : ('a -> 'b -> 'a) -> 'a -> 'b t -> 'a
    val add    : 'a t -> 'a -> unit
    val remove : 'a t -> 'a
  end

It is interesting to compare the issues of extensibility in the face of type safety with Java's ArrayList<T> implementation. You can see in the constructor and in ensureCapacity unsafe code analogous to create. However, since the ArrayList stores object references, it avoids the need to initialize the underlying array with a dummy object: null suffices! The same idea could be used in OCaml: instead of using an underlying 'a array as the store, we could use an 'a option array initialized with None. Then, set and add would box data items with Some x, and get and remove would unbox them with a pattern match. The resulting interface would be unchanged, since no valid entry in the vector could be None, and the implementation would actually be type-safe. The fact that both versions would be observationally equivalent means that this implementation, although unsafe, is sound.

1 comment:

Anonymous said...

DynArray may also be worth a look!

http://ocaml-lib.sourceforge.net/doc/DynArray.html

http://t-a-w.blogspot.com/2006/05/ocaml-programming-best-practice.html