2010-04-22

The elusive Binary Search

Jon Bentley has scarred a couple of generations of programmers, or he was and continues to be right:

I've assigned this problem in courses at Bell Labs and IBM. […] In several classes and with over a hundred programmers, the results varied little: ninety percent of the programmers found bugs in their programs (and I wasn't always convinced of the correctness of the code in which no bugs were found).

Jon Bentley, Programming Pearls, p. 36

This disjunction is not meant to be exclusive. The question whether programmers know how to write a binary search or not crops again in connection with this quote (but the topic is perennial), and reddit jumps to try its collective hand at the task and fail with numbers echoing Bentley's. The remedy (namely, closely study Dijkstra's A Discipline of Programming, or at least the van Gasteren and Feijen's paper) is as effective as it is bitter. The alternative is to learn the algorithm by heart:

let binsearch v e =
  let n = Array.length v in
  if n = 0 then raise Not_found else
  let i = ref 0
  and j = ref n in
  while !i <> !j - 1 do
    let m = (!i + !j) / 2 in
    if v.(m) <= e
      then i := m
      else j := m
  done;
  if v.(!i) = e then !i else raise Not_found

(note: the midpoint calculation cannot overflow in OCaml). No ifs, no buts, not a single thing changed: the algorithm as presented above is proved correct. Alas, we live in an age of practical assurances, and like Thomas we trust more a green-lighted battery of tests than a formal derivation (who understands those anyway?). There is a nice, 4096-case test set available on-line; a parser and a test harness are simple enough to build for it. Each case consists of a label, a test value, a list of values and an expected outcome:

type test = Test of string * int * int array * bool

The file format is line-oriented; this makes easy to use a regexp-based parser. To ease the use of regular expressions, I build a test function out of a pattern:

let matching patt =
  let re = Str.regexp patt in
  fun str -> Str.string_match re str 0

(I close over the compiled regexp for efficiency). The required regexps are:

let is_label  = matching "^Problem [0-9]+$"
let is_start  = matching "^in \\[$"
let is_value  = matching "^-?[0-9]+$"
let is_endyes = matching "^\\]\\? yes$"
let is_endno  = matching "^\\]\\? no$"
let is_sep    = matching "^$"

Syntax errors in the test cases file are signalled by an exception:

exception SyntaxError of int * string

I keep track of the lines read for purposes of error reporting:

let lineno = ref 0

let syntax_error str = raise (SyntaxError (!lineno, str))

let read inch =
  try let l = input_line inch in incr lineno; Some l
  with End_of_file -> None

The parser reads a file and returns a list of test cases; it uses mutually-recursive functions as a traditional recursive-descent parser with synthetic attributes. Each function corresponds to a terminal in the production case ::= label value start value* end sep:

let rec parse_tests inch = List.rev (parse_case inch [])

and parse_case inch tests = match read inch with
| Some line when is_label  line -> parse_member inch tests line
| None -> tests
| _ -> syntax_error "PROBLEM expected"

and parse_member inch tests label = match read inch with
| Some line when is_value  line -> parse_start  inch tests label (int_of_string line)
| _ -> syntax_error "MEMBER expected"

and parse_start inch tests label cand = match read inch with
| Some line when is_start  line -> parse_value  inch tests label cand []
| _ -> syntax_error "START expected"

and parse_value inch tests label cand vals = match read inch with
| Some line when is_endyes line -> parse_finish inch tests label cand vals true
| Some line when is_endno  line -> parse_finish inch tests label cand vals false
| Some line when is_value  line ->
  parse_value  inch tests label cand (int_of_string line :: vals)
| _ -> syntax_error "VALUE or END expected"

and parse_finish inch tests label cand vals test = match read inch with
| Some line when is_sep    line ->
  parse_case   inch (Test (label, cand, Array.of_list (List.rev vals), test) :: tests)
| _ -> syntax_error "SEPARATOR expected"

The test harness iterates a test function over the resulting list, recording the number of successful tests along the total:

let test_binsearch inch =
  let count = ref 0
  and pass  = ref 0 in
  List.iter (fun (Test (label, cand, vals, test)) ->
    let result =
      try ignore (binsearch vals cand); test
      with Not_found -> not test
    in
    incr count;
    if result then begin
      Printf.printf "TEST %s: PASS\n" label;
      incr pass
    end else
      Printf.printf "TEST %s: FAIL\n" label
    ) (parse_tests inch);
  Printf.printf "\nSUMMARY: Passed %d of %d tests (%.2f %%)\n"
    !pass !count (100. *. float !pass /. float !pass)

In order to be tidy about using files, I use a helper handler:

let unwind ~protect f x =
  try let y = f x in let () = protect x in y
  with e -> let () = protect x in raise e

let with_input_channel f inch = unwind ~protect:close_in f inch

The main expression turns this little (less than 90 lines of code) program into an executable suitable for command-line execution:

let () =
  let inch =
    try let fname = Array.get Sys.argv 1 in open_in_bin fname
    with Invalid_argument _ -> stdin
  in with_input_channel test_binsearch inch

That's it. Compile and execute it with:

$ ocamlopt.opt -o binsrch.exe str.cmxa binsrch.ml
$ ./binsrch.exe tests.txt > tests.out

(I'm using Cygwin here, don't hate me). The result is as expected:

TEST Problem 1: PASS
TEST Problem 2: PASS
TEST Problem 3: PASS
[snip…]
TEST Problem 4094: PASS
TEST Problem 4095: PASS
TEST Problem 4096: PASS

SUMMARY: Passed 4096 of 4096 tests (100.00 %)

Now do yourself a favor and go read Dijkstra.

2 comments:

Anonymous said...

Could you be more specific on `the midpoint calculation cannot overflow in OCaml'?

Chris said...

Anonymous, OCaml defines Sys.max_array_size << max_int. It is possible for integer arithmetic to overflow in OCaml (e.g., (max_int + max_int)/2 is -1), but never for the midpoint calculation with valid array indices.