2010-11-13

A First-Principles DNS Client

Ever wondered how nslookup works? I'm a protocol junkie, and Domain Name (RFC 1035) has it all. It's a simple protocol with a highly structured message format, so the rich machinery of OCaml makes easy to build a simple but relatively complete client. The code is long for a blog post (a bit less than 500 lines), so I'll show the highlights and leave the rest for you to use as you see fit. Since I'm not constrained by presenting a literate bottom-up program, I'll start with a couple of examples. A simple query is straightforward:

# query_dns "10.0.0.1" (query ~q_type:`A 0 "www.nytimes.com");;
- : dns_record =
{id = 0; detail = 33152;
 question = [{q_name = "www.nytimes.com"; q_type = `A; q_class = `IN}];
 answer =
  [{rr_name = "www.nytimes.com"; rr_type = `A; rr_class = `IN; rr_ttl = 120l;
    rr_rdata = `Address 164.107.65.0}];
 authority =
  [{rr_name = "www.nytimes.com"; rr_type = `NS; rr_class = `IN; rr_ttl = 60l;
    rr_rdata = `Domain "nss1.sea1.nytimes.com"};
   {rr_name = "www.nytimes.com"; rr_type = `NS; rr_class = `IN; rr_ttl = 60l;
    rr_rdata = `Domain "nss1.lga2.nytimes.com"}];
 additional =
  [{rr_name = "nss1.lga2.nytimes.com"; rr_type = `A; rr_class = `IN;
    rr_ttl = 60l; rr_rdata = `Address 164.107.65.0}]}

(I installed a top-level pretty printer to show the inet_addresses). As another example, here's how to query the servers to which to send mail for a domain:

let mail_servers server domain =
  let res = query_dns server (query ~q_type:`MX 0 domain) in
  List.sort (fun (p, _) (p', _) -> compare p p')
    (List.map (function { rr_rdata = `Exchange (p, d); _ } -> (p, d))
      (List.filter (fun { rr_type; _ } -> rr_type = `MX)
        res.answer))

And here it is in action:

# mail_servers "10.0.0.1" "google.com";;
- : (int * domain_name) list =
[(100, "google.com.s9a1.psmtp.com"); (200, "google.com.s9a2.psmtp.com");
 (300, "google.com.s9b1.psmtp.com"); (400, "google.com.s9b2.psmtp.com")]

In total, I think I've put in 20 hours to this little project, including this write-up. The networking part is very simple, though it took me a number of tries to get it right (I'm rusty):

let dns_port  = (Unix.getservbyname "domain" "udp").Unix.s_port

let query_dns addr q =
  let len = 4096 in
  let buf  = String.create len in
  let msg  = Writer.run (write_dns_record q) in
  let sock = Unix.socket Unix.PF_INET Unix.SOCK_DGRAM 0 in
  let peer = Unix.ADDR_INET (Unix.inet_addr_of_string addr, dns_port) in
  unwind ~protect:Unix.close (fun sock ->
    Unix.setsockopt_float sock Unix.SO_RCVTIMEO  1.;
    let _ = Unix.sendto sock msg 0 (String.length msg) [] peer in
    let cnt, _ = Unix.recvfrom sock buf 0 len [] in
    match Parser.run parse_dns_record (String.sub buf 0 cnt) with
    | Some dns -> dns
    | None     -> failwith "Parse error"
  ) sock

It is important to set the receive timeout for the datagram socket, otherwise it blocks trying to read too many bytes. The function takes the address of the Domain Name server to query and a dns_record containing the query. It is serialized, sent to the server, a response is received and parsed. For that, I use a parsing monad and a writer monoid to structure the handling of messages (the slogan is "parsing is monadic, pretty-printing is monoidal"):

module Parser = struct
  type cursor = string * int

  type 'a parser = Parser of (cursor -> 'a option * cursor)

  include Monad (struct
    type 'a t = 'a parser
    let return x = Parser (fun cur -> (Some x, cur))
    let bind f (Parser p) = Parser (fun cur ->
      match p cur with
      | (None  , _  ) -> (None, cur)
      | (Some x, cur) -> let Parser q = f x in q cur)
    let fail = Parser (fun cur -> (None, cur))
  end)

This is a simple deterministic parsing monad using option instead of list. The only quirk is that Domain Name messages use back-references for data compression, so I need full random-access to the message buffer with positioning information in the form of a cursor. Running a parser is done by providing it with a buffer:

  let run (Parser p) str = let (res, _) = p (str, 0) in res

The use of positioning information means that the parser is also a state monad:

  let tell     = Parser (fun (_, pos as cur) -> (Some pos, cur))
  let seek off = Parser (fun (str, _ as cur) ->
    if 0 <= off && off <= String.length str
    then (Some (), (str, off))
    else (None, cur) )

  (* … code omitted … *)
end

I deal with low-level entities like bytes (properly octets, but let's not quibble) and big-endian integers. Parsing these is straightforward:

type int16  = int
type byte   = int
type bytes  = string

let byte : byte Parser.t = Parser.(fmap int_of_char next)

let bytes cnt : bytes Parser.t = Parser.take cnt

let int16 : int16 Parser.t = 
  let open Parser in
  byte >>= fun hi ->
  byte >>= fun lo ->
  return ((hi lsl 8) lor lo)

let int32 : int32 Parser.t =
  let open Parser in
  let (<|) m n = Int32.logor (Int32.shift_left m 8) (Int32.of_int n) in
  byte >>= fun b0 ->
  byte >>= fun b1 ->
  byte >>= fun b2 ->
  byte >>= fun b3 ->
  return (((Int32.of_int b0 <| b1) <| b2) <| b3)

By the way, I love the new syntax for modules in OCaml 3.12! Now labels are length-prefixed sequences of bytes, and domain names are sequences of labels:

type label        = bytes
type domain_name  = bytes

let write_bytes (str : bytes) =
  let len = String.length str in
  if len > 255 then invalid_arg "write_bytes" else
  Writer.(byte len ++ bytes str)

let parse_bytes = Parser.(byte >>= bytes)

let write_label (lbl : label) =
  let len = String.length lbl in
  if len > 63 then invalid_arg "write_label" else
  Writer.(byte len ++ bytes lbl)

let parse_label : label Parser.t =
  let open Parser in
  ensure (byte >>= fun n -> guard (0 < n && n < 64)) >> byte >>= bytes

The RFC specifies that labels are limited to 63 octets in length, and domain names can use back-references for reducing the size of a packet. Thus a domain name is terminated by the null label (corresponding to the top-level domain "."), or by a pointer to another label in the packet:

let re_dot = Str.regexp_string "."

let split_domain (name : domain_name) = Str.split re_dot name

let write_domain_name (name : domain_name) =
  let labels = split_domain name in
  Writer.(sequence write_label labels ++ byte 0)

let parse_domain_name : domain_name Parser.t =
 let open Parser in
 let rec labels () =
    sequence parse_label       >>= fun ls  ->
    byte                       >>= fun n   ->
    if n = 0 then return ls else
    guard (n land 0xc0 = 0xc0) >>
    byte                       >>= fun m   ->
    let off = ((n land 0x3f) lsl 8) lor m in
    tell                       >>= fun pos ->
    seek off                   >>
    labels ()                  >>= fun ls' ->
    seek pos                   >>
    return (ls @ ls')
  in fmap (String.concat ".") $ labels ()

Writing a domain name is straightforward, as I don't do compression. Reading a domain name is done recursively. Note that parse_label fails (via ensure) whenever the label is empty or overlong. The first case corresponds to the top-level domain, which ends the sequence; the second case corresponds to a 16-bit pointer which is signalled by its two most-significant bits set. To read the back-reference I record the position in the packet via tell, seek to the referenced position, parse the label sequence recursively (which might trigger yet more back-references), restore the original position and return all the results. For example, consider the following complete response (don't mind the names yet):

id        0000  \000\000
detail    0002  \132\000
qdcount   0004  \000\001
ancount   0006  \000\002
nscount   0008  \000\000
arcount   0010  \000\000
q_name    0012  \009_services\007_dns-sd\004_udp
          0035  \005local\000
q_type    0042  \000\012
q_class   0044  \000\001
rr_name   0046  \192\012
rr_type   0048  \000\012
rr_class  0050  \000\001
rr_ttl    0052  \000\000\000\010
rr_dlen   0056  \000\020
rr_rdata  0058  \012_workstation
          0071  \004_tcp\192\035
rr_name   0078  \192\012
rr_type   0080  \000\012
rr_class  0082  \000\001
rr_ttl    0084  \000\000\000\010
rr_dlen   0088  \000\008
rr_rdata  0090  \005_http\192\071
          0098

The domain name at offset 0090 (_http) refers back to 0071 (_tcp) which in turn refers back to 0035 (local), which spells _http._tcp.local. Now Domain Name is basically a distributed database system which replies to queries with responses containing zero or more resources. The types of resources is spelled out in detail in the RFC:

type rr_type = [
| `A | `NS | `MD | `MF | `CNAME | `SOA | `MB | `MG
| `MR | `NULL | `WKS | `PTR | `HINFO | `MINFO | `MX | `TXT
]

Why polymorphic variants? Because query types are a superset of resource types:

type q_type = [ rr_type | `AXFR | `MAILB | `MAILA | `ANY ]

This lets me reuse the code for converting back and forth between labels and protocol integers. In turn, each standard resource has its particular format:

type resource = [
| `Hostinfo  of string * string
| `Domain    of domain_name
| `Mailbox   of domain_name * domain_name
| `Exchange  of int * domain_name
| `Data      of bytes
| `Text      of bytes list
| `Authority of domain_name * domain_name * int32 * int32 * int32 * int32 * int32
| `Address   of Unix.inet_addr
| `Services  of int32 * byte * bytes
]

The types are dictated by the protocol, and they in turn dictate how to format and write resource data. The first is straightforward:

let write_resource =
  let open Writer in function
  | `Hostinfo (cpu, os) ->
       write_bytes       cpu
    ++ write_bytes       os
  | `Domain name ->
       write_domain_name name
  | `Mailbox  (rmbx, embx) ->
       write_domain_name rmbx
    ++ write_domain_name embx
  | `Exchange (pref, exch) ->
       int16             pref
    ++ write_domain_name exch
  | `Data data ->
       bytes             data
  | `Text texts ->
    sequence write_bytes texts
  | `Authority (mname, rname, serial, refresh, retry, expire, minimum) ->
       write_domain_name mname
    ++ write_domain_name rname
    ++ int32             serial
    ++ int32             refresh
    ++ int32             retry
    ++ int32             expire
    ++ int32             minimum
  | `Address addr ->
       int32  (Obj.magic addr : int32)
  | `Services (addr, proto, bmap) ->
       int32             addr
    ++ byte              proto
    ++ bytes             bmap

(the use of Obj.magic to coerce addresses back and forth is relatively unperilous). Parsing requires knowing the resource type to decode the payload. Note that Writer.sequence formats a list by formatting each element in turn while Parser.sequence applies repeatedly a parser until it fails and returns the list of intermediate parsers:

let parse_resource rr_type rr_dlen =
  let open Parser in match rr_type with
  | `HINFO ->
    parse_bytes                >>= fun cpu ->
    parse_bytes                >>= fun os  ->
    return (`Hostinfo (cpu, os))
  | `MB | `MD | `MF | `MG | `MR | `NS
  | `CNAME | `PTR ->
    parse_domain_name          >>= fun name ->
    return (`Domain name)
  | `MINFO ->
    parse_domain_name          >>= fun rmailbx ->
    parse_domain_name          >>= fun emailbx ->
    return (`Mailbox (rmailbx, emailbx))
  | `MX ->
    int16                      >>= fun preference ->
    parse_domain_name          >>= fun exchange   ->
    return (`Exchange (preference, exchange))
  | `NULL ->
    bytes rr_dlen              >>= fun data ->
    return (`Data data)
  | `TXT ->
    sequence (byte >>= bytes)  >>= fun texts ->
    return (`Text texts)
  | `SOA ->
    parse_domain_name          >>= fun mname   ->
    parse_domain_name          >>= fun rname   ->
    int32                      >>= fun serial  ->
    int32                      >>= fun refresh ->
    int32                      >>= fun retry   ->
    int32                      >>= fun expire  ->
    int32                      >>= fun minimum ->
    return (`Authority (mname, rname, serial, refresh, retry, expire, minimum))
  | `A ->
    int32                      >>= fun addr ->
    return (`Address  (Obj.magic addr : Unix.inet_addr))
  | `WKS ->
    int32                      >>= fun addr   ->
    byte                       >>= fun proto  ->
    bytes (rr_dlen-5)          >>= fun bitmap ->
    return (`Services (addr, proto, bitmap))

There's nothing to it, really, as monadic notation makes the code read straight. A resource is described by a rsrc_record, which can be written and parsed fairly simply:

type rsrc_record = {
  rr_name  : domain_name;
  rr_type  : rr_type;
  rr_class : rr_class;
  rr_ttl   : int32;
  rr_rdata : resource;
}

let write_rsrc_record r =
  let open Writer in
  let rr_rdata = run (write_resource r.rr_rdata) in
     write_domain_name      r.rr_name
  ++ int16 (int_of_rr_type  r.rr_type)
  ++ int16 (int_of_rr_class r.rr_class)
  ++ int32                  r.rr_ttl
  ++ int16     (String.length rr_rdata)
  ++ bytes                    rr_rdata

let parse_rsrc_record =
  let open Parser in
  parse_domain_name              >>= fun rr_name  ->
  fmap rr_type_of_int  int16     >>= fun rr_type  ->
  fmap rr_class_of_int int16     >>= fun rr_class ->
  int32                          >>= fun rr_ttl   ->
  int16                          >>= fun rr_dlen  ->
  parse_resource rr_type rr_dlen >>= fun rr_rdata ->
  return { rr_name; rr_type; rr_class; rr_ttl; rr_rdata; }

Note that since the resource payload is prefixed by its length I have to write it out-of-band to know its length. This is the only instance of buffer copying in the implementation. A question is handled similarly:

type question = {
  q_name  : domain_name;
  q_type  : q_type;
  q_class : q_class;
}
 
let write_question q =
  let open Writer in
     write_domain_name     q.q_name
  ++ int16 (int_of_q_type  q.q_type )
  ++ int16 (int_of_q_class q.q_class)

let parse_question =
  let open Parser in
  parse_domain_name         >>= fun q_name  ->
  fmap q_type_of_int  int16 >>= fun q_type  ->
  fmap q_class_of_int int16 >>= fun q_class ->
  return { q_name; q_type; q_class; }

A Domain Name record has a serial number, a bit field with options and response codes and a sequence of question followed by various sequences of resources as answers:

type dns_record = {
  id         : int16;
  detail     : int16;
  question   : question    list;
  answer     : rsrc_record list;
  authority  : rsrc_record list;
  additional : rsrc_record list;
}

let write_dns_record d =
  let open Writer in
     int16                      d.id
  ++ int16                      d.detail
  ++ int16 (List.length         d.question  )
  ++ int16 (List.length         d.answer    )
  ++ int16 (List.length         d.authority )
  ++ int16 (List.length         d.additional)
  ++ sequence write_question    d.question
  ++ sequence write_rsrc_record d.answer
  ++ sequence write_rsrc_record d.authority
  ++ sequence write_rsrc_record d.additional

let parse_dns_record =
  let open Parser in
  int16                            >>= fun id         ->
  int16                            >>= fun detail     ->
  int16                            >>= fun qdcount    ->
  int16                            >>= fun ancount    ->
  int16                            >>= fun nscount    ->
  int16                            >>= fun arcount    ->
  repeat qdcount parse_question    >>= fun question   ->
  repeat ancount parse_rsrc_record >>= fun answer     ->
  repeat nscount parse_rsrc_record >>= fun authority  ->
  repeat arcount parse_rsrc_record >>= fun additional ->
  return { id; detail; question; answer; authority; additional; }

Very high-level, straight-line code thanks to the structuring power of the algebraic structures! A query involves asking for a resource identified by domain name. If you analyze the bit field, you'll notice that I ask for a recursive query:

let query ?(q_type=`A) id q_name =
  if id land 0xffff <> id then invalid_arg "query" else {
  id;
  detail     = 0b0_0000_0010_000_0000;
  question   = [ { q_name; q_type; q_class = `IN; } ];
  answer     = [];
  authority  = [];
  additional = []; }

That's it! Grab the code and play with it; I've put it in the Public Domain so that you can use it for whatever purpose you want.

2010-10-17

Text to PDF

I find that the best way for me to learn is to reexpress what I read in a different but equivalent form. Transcribing notes and translation are two applications of this heuristic. In this case, I've translated this simple and pretty Factor PDF text formatter into OCaml. The result is almost as compact, at less than 200 lines of code, thanks to the compositional nature that OCaml and Factor share, even if OCaml's library of combinators is rather poor. As a conscious style decision, I try to keep all definitions as short as possible. As usual, I start with an operator for functional composition:

let ( % ) f g x = f (g x)

PDF is essentially a textual format, so it's convenient to perform the conversion in memory by using strings. A PDF object is a string or stream tagged by a reference number:

let pdf_object n str =
  String.concat "\n" [string_of_int n ^ " 0 obj"; str; "endobj"]

These objects can be referenced later by such a number:

let pdf_ref n = string_of_int n ^ " 0 R"

PDF strings are quoted by parentheses like in Postscript, so they must be escaped. The function escape_with is somewhat crude but sufficient for the small number of escapes required:

let escape_with escapes str =
  let buf = Buffer.create (String.length str) in
  String.iter (fun c ->
    try Buffer.add_string buf (List.assoc c escapes)
    with Not_found -> Buffer.add_char buf c) str;
  Buffer.contents buf

let pdf_escapes = [
  '\008', "\\b";
  '\t'  , "\\t";
  '\n'  , "\\n";
  '\012', "\\f";
  '\r'  , "\\r";
  '('   , "\\(";
  ')'   , "\\)";
  '\\'  , "\\\\";
]

let pdf_string str = "(" ^ escape_with pdf_escapes str ^ ")"

A PDF dictionary is a list of key-value pairs with special syntax. I indulge in a bit of special-casing for the sake of better formatting:

let pdf_dict kvs =
  let buf = Buffer.create 16 in
  Buffer.add_string buf "<<";
  begin match kvs with
  | []     -> ()
  | [k, v] -> Printf.bprintf buf " /%s %s " k v
  | kvs    ->
    Buffer.add_char buf '\n';
    List.iter (fun (k, v) -> Printf.bprintf buf "/%s %s\n" k v) kvs
  end;
  Buffer.add_string buf ">>";
  Buffer.contents buf

This way short dictionaries occupy a single line. Every PDF file is annotated by metadata describing the particulars of its creation:

let iso_time () =
  let tm = Unix.gmtime (Unix.time ()) in
  Printf.sprintf "%04d%02d%02d%02d%02d%02d"
    (tm.Unix.tm_year + 1900) (tm.Unix.tm_mon + 1) tm.Unix.tm_mday
    tm.Unix.tm_hour tm.Unix.tm_min tm.Unix.tm_sec

let app_name = Filename.basename Sys.executable_name
and usr_name = try Sys.getenv "USER" with Not_found -> "unknown"

let pdf_info () =
  pdf_dict [
    "CreationDate", "D:" ^ iso_time ();
    "Producer"    , pdf_string app_name;
    "Author"      , pdf_string usr_name;
    "Creator"     , pdf_string "created with OCaml";
  ]

The pages in a PDF document form a tree rooted in a special page table referenced by a Catalog object. Here, as in the original program, the reference numbers are hardwired since the document structure is statically known:

let pdf_catalog () =
  pdf_dict [
    "Type" , "/Catalog";
    "Pages", pdf_ref 4;
  ]

All fonts in use must be similarly declared beforehand:

let pdf_font () =
  pdf_dict [
    "Type"    , "/Font";
    "Subtype" , "/Type1";
    "BaseFont", "/Courier";
  ]

As with dictionaries, arrays have their own syntax, very much like OCaml lists:

let pdf_array vs =
  let buf = Buffer.create 16
  and any = ref false in
  Buffer.add_string buf "[";
  List.iter (fun v -> Printf.bprintf buf " %s" v; any := true) vs;
  if !any then Buffer.add_char buf ' ';
  Buffer.add_string buf "]";
  Buffer.contents buf

PDF pages have definite sizes represented by a media box which serves as a coordinate system and clipping region:

let paper_size = function
| `A4     -> ( 595,  842)
| `Letter -> ( 612,  792)
| _       -> failwith "paper_size"

let media_box paper =
  let (w, h) = paper_size paper in
  pdf_array ["0"; "0"; string_of_int w; string_of_int h]

As I mentioned before, pages are rooted on a page table that forward references each child. Each page is comprised of a page dictionary and of its contents themselves, hence the need to reference every other object:

let rec table f n =
  let rec go i =
    if i = n then [] else f i :: go (succ i)
  in go 0

let pdf_pages ?(paper=`Letter) npages =
  pdf_dict [
    "Type"    , "/Pages";
    "MediaBox", media_box paper;
    "Count"   , string_of_int npages;
    "Kids"    , pdf_array (table (fun i -> pdf_ref (5 + 2 * i)) npages);
  ]

The page dictionary links back to the page table it depends on, the contents that follow it and any resources it needs:

let pdf_page pageno =
  pdf_dict [
    "Type"     , "/Page";
    "Parent"   , pdf_ref 4;
    "Contents" , pdf_ref (succ pageno);
    "Resources", pdf_dict ["Font", pdf_dict ["F1", pdf_ref 3]];
  ]

The page contents are represented by a PDF stream. In this case, I use the simplest formatting I can get away with:

let pdf_stream str =
  String.concat "\n" [
    pdf_dict ["Length", string_of_int (String.length str + 1)];
    "stream";
    str;
    "endstream"
  ]

Textual content is a kind of PDF stream comprised of text layout commands. Lines of text are shipped out by the ' (quote) operator one by one, after defining the characteristics (origin, font, etc) of the text box they are displayed in:

let pdf_text text =
  pdf_stream (String.concat "\n" ([
    "BT";
    "54 738 Td";
    "/F1 10 Tf";
    "12 TL";
  ] @ List.map (fun l -> pdf_string l ^ "'") text @ [
    "ET"
  ]))

All a PDF file contains is a sequence of cross-referenced objects. Given a collection of lines of text grouped into pages, the program writes the metadata, the catalog, the required resources, the page table and the sequence of pages given by their dictionary and contents:

let mapi f =
  let rec go i = function
  | [] -> []
  | x :: xs -> f i x :: go (succ i) xs
  in go 0

let objects_of_pages pages =
  mapi (pdf_object % succ) (List.concat ([
    pdf_info ();
    pdf_catalog ();
    pdf_font ();
    pdf_pages (List.length pages);
  ] :: mapi (fun i p -> [pdf_page (5 + 2 * i); pdf_text p]) pages))

Again, since the structure of the PDF file is known the references can be hardwired. Once the content is written, the PDF file ends with a trailer that cross-references the positions of all its contained objects by stream length. I fold over the list of objects to compute the file offsets at which they begin, starting at 9 to account for the PDF header:

let pdf_trailer objects =
  let len, tot, xref = List.fold_left ( fun (i, n, l) s ->
    (succ i, String.length s + 1 + n, Printf.sprintf "%010x 00000 n" n :: l) )
    (1, 9, []) objects in
  String.concat "\n" ([
    "xref";
    "0 " ^ string_of_int len;
    "0000000000 65535 f";
  ] @ List.rev xref @ [
    "trailer";
    pdf_dict [
      "Size", string_of_int len;
      "Info", pdf_ref 1;
      "Root", pdf_ref 2;
    ];
    "startxref";
    string_of_int tot;
    "%%EOF";
  ])

In this simple case the cross-reference table is written at the end; a stream-aware PDF document would have the table at the beginning, but that would need a two-pass algorithm. That is all that it's needed to write the PDF document:

let pdf_of_objects objects =
  String.concat "\n" (["%PDF-1.4"] @ objects @ [pdf_trailer objects])

(the header has length 9, as indicated above). It remains to format the text into lines. Lines will be split by terminator, accounting for all possibilities and taking care not to lose empty lines at the beginning or end of the text:

let split_lines = Str.split_delim (Str.regexp "\n\\|\r\n?")

In the same way that the original Factor code tab expansion is rather naïve:

let expand_tabs =
  String.concat "    " % Str.split_delim (Str.regexp_string "\t")

Composing both, making sure like in Factor that empty lines have at least something to print:

let lines_of_string =
  List.map (fun s -> if s = "" then " " else s) % split_lines % expand_tabs

Line wrapping and page layout are performed by grouping content lists by length. In the case of lines, I group lists of characters, so I need to convert from strings to lists and back. This is unnecessary in Factor since the word <groups> is generic on sequences, and so it would be in Haskell, but this is OCaml:

let explode str =
  let res = ref [] in
  String.iter (fun c -> res := c :: !res) str;
  List.rev !res

and implode cs =
  let buf = Buffer.create 16 in
  List.iter (Buffer.add_char buf) cs;
  Buffer.contents buf

Grouping lists by length is a simple function with a number of special cases:

let groups_of n =
  let rec go gss gs i = function
  | [] when gs = [] -> List.rev                 gss
  | []              -> List.rev (List.rev gs :: gss)
  | xs when i = n   -> go (List.rev gs :: gss) [] 0  xs
  | x :: xs         -> go gss (x :: gs)     (succ i) xs
  in go [] [] 0

As with the Factor code, I format text in pages of 57 lines of 84 characters. Except for the reading direction and use of ordnance, this function reads exactly like the original:

let pages_of_lines =
  groups_of 57 % List.concat % List.map (List.map implode % groups_of 84 % explode)

That's it. Formatting a text into a PDF document is a pipeline connecting all the pieces:

let pdf_of_text =
  pdf_of_objects % objects_of_pages % pages_of_lines % lines_of_string

As a bonus, a simple file filter:

let unwind ~(protect:'a -> unit) f (x : 'a) =
  try let y = f x in protect x; y with e -> protect x; raise e

let with_input_file  f file = unwind ~protect:close_in  f (open_in_bin file)
and with_output_file f file = unwind ~protect:close_out f (open_out    file)

let file_contents = with_input_file (fun inch ->
  let len = in_channel_length inch in
  let buf = String.create len in
  really_input inch buf 0 len;
  buf)

let pdf_of_file inf =
  let pdf = pdf_of_text (file_contents inf) in
  with_output_file (fun otch -> output_string otch pdf)

The entire program comes down to less than four pages in a formatted PDF document. As the original author remarks, this solution also compares favorably to a 600 line Python version and a 450 line C version. I expect that a Haskell version would be much nearer the 140 lines in the original program.

2010-09-14

Parametric Modular Programming

Generic programming as a concept popularized by Stepanov's STL is concerned with the uniform expression of algorithms decoupled from the specific data structures they operate on. By parametric modular programming I mean the uniform expression of a particular algorithm decoupling it from the specific shape of or rules obeyed by the objects it manipulates. It resolves the same driving forces behind the GoF Strategy pattern but statically, at the time of compilation. As an example, it allows me to do the following:

# T.enumerate_to T.is_connected 10 ;;
- : int list = [1; 1; 1; 3; 4; 12; 24; 66; 159; 444]

(cf A070765)

# P.enumerate_to P.is_connected 10 ;;
- : int list = [1; 1; 2; 5; 12; 35; 107; 363; 1248; 4460]

(cf A000104)

# H.enumerate_to H.is_connected 10 ;;
- : int list = [1; 1; 3; 7; 22; 81; 331; 1435; 6505; 30086]

(cf A018190) with the same implementation by just varying the underlying logic. I estimate that I spent a quarter of the total effort to extend my old code to enumerate polyominos actually refactoring and generalizing the algorithm, with most of the effort trying to come up with a good topology for the polyiamonds. I've used the technique in the past and the only mildly nontrivial lesson I take away from it is that coming up with and enforcing good logical invariants for the parameters is more crucial than having a uniform interface for them. This is where almost all the creativity goes into.

For this program, I started with the same foundation I used the last time. I chose a functional, point-free style supported by the use of functional sets, for which I need a number of utility functions:

let (%) f g x = f (g x)
let id x = x
let const k _ = k

let rec nest n f x =
  if n == 0 then [ ] else
  if n == 1 then [x] else
  x :: nest (pred n) f (f x)

and extensions to the standard Set functor:

module FSet (O : Set.OrderedType) = struct
  include Set.Make(O)
  let collect f s = fold (union % f) s empty
  let map     f s = fold (add   % f) s empty
  let of_list l   = List.fold_right add l empty
end

collect is akin to monadic bind, except for the fact that sets do not form monads (but note that indeed map f ≡ collect (singleton % f)). By definition, ominos are connected finite subsets of the Platonic tilings of the plane obeying certain further restrictions. As such, they underlie a lattice on ℝ². By a suitable change of coordinates this lattice can be taken as ℤ² or a subset thereof. Hence I represent cells by their integer coordinates:

module Cell = struct
  type cell = int * int

Cell sets are lexicographically ordered by coordinate:

  module Set = FSet(struct
    type t = cell
    let compare (i, j) (i', j') =
      let c = compare j j' in
      if c <> 0 then c else
      compare i i'
    end)

I need a small number of generic operations on cells or on their sets:

  let origin = (0, 0)
  let of_points = Set.of_list
  let to_points = Set.elements

  let shift (i, j) (di, dj) = (i + di, j + dj)

  let extents cells =
    Set.fold (fun (i, j) (mi, mj, xi, xj) ->
      (min mi i, min mj j, max xi i, max xj j)
    ) cells (max_int, max_int, min_int, min_int)

The simplest transform that "normalizes" a set of cells into a coordinate-independent set is a translation that takes its axis-aligned bounding box to the origin:

  let normalize cells =
    let (mi, mj, _, _) = extents cells in
    Set.map (shift (-mi, -mj)) cells

This will not work if the underlying lattice has more structure than ℤ², but it's a minimal initial default. Finding a connected component of a set of cells requires an adjacency structure, here given by the function expand:

  let find_component expand cells =
    let neighbors = Set.inter cells % expand in
    let rec visit pending visited =
      if Set.is_empty pending then visited else
      let cell      = Set.choose pending in
      let pending   = Set.remove cell pending in
      let unvisited = Set.diff (neighbors cell) visited in
      visit (Set.union unvisited pending) (Set.add cell visited)
    in visit (Set.singleton (Set.choose cells)) Set.empty
end

This is a standard graph search but made non-deterministic by the choice of a set to represent the visit schedule. This ends the module. Now the structure of the omino is represented by a module signature giving both the geometric and the topological structure of the plane graph:

module type TOPO = sig
  val origin     : Cell.cell
  val neighbors  : Cell.cell  -> Cell.Set.t
  (* The exterior of a cell must include its neighborhood *)
  val exterior   : Cell.cell  -> Cell.Set.t
  (* Transform a set of cells in general position to a canonical position *)
  val normalize  : Cell.Set.t -> Cell.Set.t
  (* Plane symmetries, that is, Dn *)
  val symmetries : (Cell.cell -> Cell.cell) list
end

The comments hint at the algebraic properties that the implementations must obey. neighbors gives the connectivity of the omino, while exterior is required to conform to it by returning a superset of the former. normalize choses a representative from the class of ominos translated around the lattice. symmetries gives the plane symmetries of the underlying shape. Every implementation must make sure that these symmetries are dihedral Dn. The simplest way to do this is to show that they are finitely represented by two generators rot and flip, that is, that rotnflip² ≡ rot % flip % rot % flipid. With this, the generation of the actual ominos is straightforward and concise. An omino is, of course, a set of canonical tiles obeying the given topology:

module Omino (T : TOPO) = struct
  type omino = Cell.Set.t

Omino sets are trivial:

  module Set = FSet(struct
    type t = omino
    let compare = Cell.Set.compare
  end)

The symmetries of an omino are given by the symmetries of the underlying shape, modulo translations. Given that ominos are ordered as sets, their sets are canonical (since OCaml's Sets are ordered) and choose-ing one of them as representative of the equivalence class is arbitrary but deterministic:

  let symmetries cells =
    Set.of_list (List.map (fun f -> T.normalize (Cell.Set.map f cells)) T.symmetries)

  let canon = Set.choose % symmetries

This is the crucial difference between ominos and cell sets: the former have more structure as given by the ADT:

  let of_points = canon % Cell.of_points

In other words, canon has type Cell.Set.t → Omino.omino. To extend an omino we consider all the exterior neighbors of each of their cells that make the resulting omino obey a given predicate:

  let extensions (p : omino -> bool) omino =
    let candidates cell     = Cell.Set.diff (T.neighbors cell) omino
    and extend     cell     = canon (Cell.Set.add cell omino) in
    let extend_if  cell set =
      let omino = extend cell in
      if p omino then Set.add omino set else set
    in
    let extend_at  cell set = Cell.Set.fold extend_if (candidates cell) set in
    Cell.Set.fold extend_at omino Set.empty

  let extend_set p = Set.collect (extensions p)

The code is to be read as English, except perhaps for extend_at which accumulates into set the valid extensions of omino around cell.

I'm particularly interested in simple ominos. By definition, an omino is simple or of genus 0 iff its exterior is connected:

  let is_connected omino =
    let exterior = Cell.Set.diff (Cell.Set.collect T.exterior omino) omino in
    Cell.Set.equal exterior (Cell.find_component T.neighbors exterior)

Generating and enumerating the ominos is trivial:

  let monomino         = of_points [T.origin]
  let ominos_to    p n = nest n (extend_set p) (Set.singleton monomino)
  let enumerate_to p   = List.map Set.cardinal % ominos_to p
end

This was the easy part, as the code I had was essentially correct and just required refactoring (as a side benefit, I think it has greatly gained in clarity. Compare with the old version). The next part is to provide the topological structures for each kind of omino. The polyominos fall out of the old implementation:

module PolyominoTopo = struct
  let origin     = Cell.origin
  let normalize  = Cell.normalize
  let neighbors  =
    let deltas = [1, 0; 0, 1; -1, 0; 0, -1]
    in fun cell -> Cell.Set.of_list (List.map (Cell.shift cell) deltas)
  let exterior   =
    let deltas = [1, 0; 1, 1; 0, 1; -1, 1; -1, 0; -1, -1; 0, -1; 1, -1]
    in fun cell -> Cell.Set.of_list (List.map (Cell.shift cell) deltas)

  (* D4: rot^4 = flip^2 = rot.flip.rot.flip = id *)
  let rot  (i, j) = (-j,  i)
  let flip (i, j) = ( i, -j)
  let symmetries = [
    id; rot; rot % rot; rot % rot % rot;
    flip; rot % flip; rot % rot % flip; rot % rot % rot % flip
  ]
end

Polyominos live exactly in ℤ², so I can use the default behavior in Cell. The neighbors of a cell form a Von Neumann neighborhood, and its exterior forms a Moore neighborhood. It's trivial to check that rot and flip generate D₄. Interestingly enough, under a suitable change of coordinates (given by the basis {(√3, 1)/2, (0, 1)}) hexominos also live in ℤ², so that each cell is oriented such that the Y axis is a double apothegm:

module HexominoTopo = struct
  let origin     = Cell.origin
  let normalize  = Cell.normalize
  let neighbors  =
    let deltas = [1, 0; 0, 1; -1, 1; -1, 0; 0, -1; 1, -1]
    in fun cell -> Cell.Set.of_list (List.map (Cell.shift cell) deltas)
  let exterior   = neighbors

  (* D6: rot^6 = flip^2 = rot.flip.rot.flip = id *)
  let rot  (i, j) = (-j, i + j)
  let flip (i, j) = ( j, i    )
  let symmetries = [
    id; rot; rot % rot; rot % rot % rot;
      rot % rot % rot % rot; rot % rot % rot % rot % rot;
    flip; rot % flip; rot % rot % flip; rot % rot % rot % flip;
      rot % rot % rot % rot % flip; rot % rot % rot % rot % rot % flip;
  ]
end

Each hexagonal cell is much more connected than square or triangular cells, to the point that the neighbors of a cell are exactly its exterior. The basis rotation took me some time to discover, but again, checking that rot and flip generate D₆ is very easy. Now for something completely different.

Polyiamonds are in a sense dual to hexominos, in more than one way: either each hexagon corresponds to one triangle or to six. In the first case alternating vertices form one half of the tiling, the other half corresponding to opposing vertices where three hexagons meet. Conway calls them deep and shallow vertices in connection with spherical lattices. Since lattice points correspond to shape centers, this requires a direct sum of the hexagonal lattice with itself shifted by ⅓. Rescaling by 3, we have the subset of ℤ² where coordinates ij ≡ 0 or 1 (mod 3); this does not form a lattice (as 2×(3i + 1, 3j + 1) doesn't belong to it). Crucially, the underlying symmetry group is D₃ which keeps deep and shallow triangles (i.e., those pointing left or right) separated, missing out on mirror symmetries. The alternative is to subdivide each hexagon into six triangles and rescale by 3. The resulting subset of ℤ² has coordinates ij ≡ 1 or 2 (mod 3). Again, this is not a lattice but a direct sum of two lattices, but the underlying symmetry is D₆ with transforms of odd order switching between one copy and the other.

The upshot of this disquisition is that, since the triangular "lattice" is formed of two copies or halves, the topology must be aware of the half each shape lives in. Since machine arithmetic is not modular arithmetic, I need a helper function:

module TriominoTopo = struct
  let rem x y =
    let r = x - y * (x / y) in
    if r >= 0 then r else if y >= 0 then r + y else r - y

This is the Euclidean remainder of x and y which is always in [0, |y|). With this I can tell copies apart:

  let is_deep (i, j) =
    let c = rem i 3 in
    assert (c = rem j 3);
    match c with
    | 1 -> true
    | 2 -> false
    | _ -> assert false

The predicate is a good place to make extra careful that I don't make any errors in the following. The origin cannot be (0, 0) since it doesn't belong to either copy. Thus, the monomino must be one of (1, 1) and its translates or (-1, -1) and its translates. I choose arbitrarily the former:

  let origin = (1, 1)

Since -1 ≡ 2 (mod 3) and vice-versa, the deltas for shallow points are the negatives of those for deep points:

  let neighbors =
    let deltas = [1, 1; -2, 1; 1, -2]
    in fun (i, j as c) -> if is_deep c
      then Cell.Set.of_list (List.map (fun (di, dj) -> (i + di, j + dj)) deltas)
      else Cell.Set.of_list (List.map (fun (di, dj) -> (i - di, j - dj)) deltas)

  let exterior =
    let deltas = [
      -2, 1; -3, 0; -2, -2; 0, -3; 1, -2; 3, -3;
      4, -2; 3, 0; 1, 1; 0, 3; -2, 4; -3, 3
    ] in fun (i, j as c) -> if is_deep c
      then Cell.Set.of_list (List.map (fun (di, dj) -> (i + di, j + dj)) deltas)
      else Cell.Set.of_list (List.map (fun (di, dj) -> (i - di, j - dj)) deltas)

Since the axis-aligned bounding rectangle of a set of centers doesn't necessarily have opposite vertices in either lattice (minimal example: {(1, 1), (-1, 2)}), normalizing a set of cells is a bit delicate. Given that cells are totally ordered, their sets have a unique minimum element which can be taken as the origin, either (1, 1) or (-1, -1) depending on whether it is deep or shallow:

  let normalize cells =
    let (di, dj as c) = Cell.Set.min_elt cells in
    if is_deep c
      then Cell.Set.map (Cell.shift (-di + 1, -dj + 1)) cells
      else Cell.Set.map (Cell.shift (-di - 1, -dj - 1)) cells

The symmetries can be borrowed from the underlying hexagonal lattice:

  let symmetries = HexominoTopo.symmetries
end

It only remains to instantiate the ominos with their corresponding topologies:

module P = Omino(PolyominoTopo)
module H = Omino(HexominoTopo)
module T = Omino(TriominoTopo)

As you can see, the intelligence is neatly segregated to different modules. The price to pay is that knowledge of the data structures (in this case, sets of cells) is shared among them, and changing representations must affect the entire program. Also, and perhaps more problematic is that the enumeration doesn't take into account any special characteristics of the underlying topology except its connectivity structure. Knowing that there are many neighbors of an hexagon requires a refined pruning strategy for searching hexominos; this code doesn't know about that and it's possible that it cannot know, in principle. All in all I'm very satisfied with the result.

2010-09-01

Rewriting the Rules

The last article on pa_do sparked an interesting conversation with its author, Christophe Troestler, and reader bluestorm. It's true that pa_infix can establish optional rewrite rules for operators as a side-effect (or as an "added bonus" in the creator's words) of it installing new symbols with given arity, precedence and associativity. Given that rewrite rules are purely syntactical artifacts, there is no function associated with an operator unless explicitely defined. Of course, once defined they can be re-defined, and it is responsability of authors and users to keep both semantics (reduction and rewrite) in sync.

On the other hand, rewrite rules open the door to interesting optimizations before and beyond what the compiler sees and can perform. In the case of an extension for functional application and composition, the optimization is obvious: the identity function can be stripped when encountered in an applicative expression. Here is the syntax extension pa_compose.ml with these built-in optimizations (Edit: I've expanded a bit the check for the identity to catch references to the qualified module):

open Camlp4.PreCast
open Pa_do
open Pa_infix
module L = Level

let is_id = function
| <:ident< $uid:"Compose"$.$lid:"id"$ >>
| <:ident<                 $lid:"id"$ >> -> true
| _                                      -> false

let () =
  let expr x f _loc = match f with
  | <:expr< $id:f$ >> when is_id f -> x
  | f -> <:expr< $f$ $x$ >>
  in infix "|>" ~expr (L.binary (L.Higher L.assignment) ~assoc:L.LeftA)

let () =
  let expr f x _loc = match f with
  | <:expr< $id:f$ >> when is_id f -> x
  | f -> <:expr< $f$ $x$ >>
  in infix "%%" ~expr (L.binary (L.Higher L.assignment) ~assoc:L.RightA)

let () =
  let expr f g _loc = match f, g with
  | <:expr< $id:f$ >>, g when is_id f -> g
  | f, <:expr< $id:g$ >> when is_id g -> f
  | f, g ->
    let x = Delimited_overloading.new_lid () in
    <:expr< fun $lid:x$ -> $f$ ($g$ $lid:x$) >>
  in infix "%" ~expr (L.binary (L.Higher L.disjunction) ~assoc:L.RightA)

The great thing about CamlP4 is that it extends the syntax of OCaml to allow for writing matchings on the AST directly. The rules match explicitly on the lowercase identifier "id" and optimize away the application or the composition. In order to give executable semantics to the rewritten syntax, it is necessary to provide an implementation for the operators. Here is compose.ml:

external id : 'a -> 'a = "%identity"
let ( |> ) x f   = f x
let ( %% ) f x   = f x
let ( %  ) f g x = f (g x)

and here is its interface, compose.mli:

external id : 'a -> 'a = "%identity"
val ( |> )  : 'a -> ('a -> 'b) -> 'b
val ( %% )  : ('a -> 'b) -> 'a -> 'b
val ( %  )  : ('b -> 'c) -> ('a -> 'b) -> ('a -> 'c)

Both are utterly trivial, but necessary in order that the following:

let test9 () =
  [succ; succ; pred; id; succ] |> List.fold_left (%) id

works as intended. The following program, use.ml is a non-exhaustive test of the syntax extension:

(* Test of composition operators *)
open Compose

let test1 () =
  [1; 2; 3] |> List.map succ

let test2 () =
  List.map succ %% [1; 2; 3]

let test3 () =
  let x = ref 7 in
  x := !x |> succ

let test4 () =
  let x = ref 7 in
  x := succ %% !x

let test5 () =
  [1; 2; 3] |> List.map succ |> List.fold_left (+) 0 |> succ

let test6 () =
  succ %% List.fold_left (+) 0 %% List.map succ %% [1; 2; 3]

let test7 () =
  succ % List.fold_left (+) 0 % List.map succ %% [1; 2; 3]

let test8 () =
  [1; 2; 3] |> succ % List.fold_left (+) 0 % List.map succ

let test9 () =
  [succ; succ; pred; id; succ] |> List.fold_left (%) id

let test10 () =
  succ % Compose.id %% 10

let test11 () =
  id % succ %% 10

let test12 () =
  id % succ % id % pred %% 10

let test13 () =
  10 |> id

let test14 () =
  id %% 10

let test15 () =
  id % Compose.id % id %% 10

let () = Printf.printf "%d (should be 9)\n" (test9 () 7)

A quick-and-dirty Makefile to build the extension, its library, and the test program (provided you have findlib and pa_do installed, of course):

OCAMLC=    ocamlfind c   -w A
OCAMLOPT=  ocamlfind opt -w A -inline 10
PA_DO=     -package pa_do -syntax camlp4o

EXE=       .exe
PROG=      use$(EXE) use.opt$(EXE)
PACKAGE=   META compose.cmi compose.cmo compose.cmx pa_compose.cmo

ALL: $(PROG)

use$(EXE): compose.cmo use.cmo
    $(OCAMLC) -o $@ $^
use.opt$(EXE): compose.cmx use.cmx
    $(OCAMLOPT) -o $@ $^

pa_compose.cmo: pa_compose.ml
    $(OCAMLC) -c $(PA_DO) -ppopt q_MLast.cmo $<

pa_compose.cmx: pa_compose.ml
    $(OCAMLOPT) -c $(PA_DO) -ppopt q_MLast.cmo $<

use.cmo: use.ml
    $(OCAMLC) -c $(PA_DO) -ppopt pa_infix.cmo -ppopt pa_compose.cmo $<

use.cmx: use.ml
    $(OCAMLOPT) -c $(PA_DO) -ppopt pa_infix.cmo -ppopt pa_compose.cmo $<

use.pre.ml: pa_compose.cmo use.ml
    camlp4 -o $@ -parser o -parser op -printer o -I `ocamlfind -query pa_do` \
        pa_do.cmo pa_infix.cmo pa_compose.cmo use.ml

# Dependencies

compose.cmo: compose.cmi
compose.cmx: compose.cmi
pa_compose.cmo:
pa_compose.cmx:
use.cmo: pa_compose.cmo compose.cmi
use.cmx: pa_compose.cmo compose.cmi

# Phony targets

clean:
    rm -f *.cm* *.o *.obj *~

distclean: clean
    rm -f use.pre.ml $(PROG)

install: $(PACKAGE)
    ocamlfind install pa_compose $(PACKAGE)

uninstall:
    ocamlfind remove pa_compose

# Rules

.SUFFIXES: .cmo .cmi .cmx .ml .mli

.ml.cmo:
    $(OCAMLC) -c $<

.mli.cmi:
    $(OCAMLC) -c $<

.ml.cmx:
    $(OCAMLOPT) -c $<

(replace the spaces by tabs!) And a META file to package it as a ocamlfind extension:

name = "pa_compose"
version = "1.0"
description = "Rewrite rules for functional composition"
requires = "camlp4 pa_do pa_do.infix"
archive(syntax) = "@pa_do/pa_infix.cmo pa_compose.cmo"
archive(byte,toploop) = "@pa_do/pa_infix.cmo compose.cmo pa_compose.cmo"
archive(byte) = "compose.cmo"
archive(native) = "compose.cmx"

If you build the pre-processed example with make use.pre.ml, you'll see how the ids get rewritten away. Once built and installed, it can be directly used from the top-level:

$ ocaml
        Objective Caml version 3.12.0

Findlib has been successfully loaded. Additional directives:
  #require "package";;      to load a package
  #list;;                   to list the available packages
  #camlp4o;;                to load camlp4 (standard syntax)
  #camlp4r;;                to load camlp4 (revised syntax)
  #predicates "p,q,...";;   to set these predicates
  Topfind.reset();;         to force that packages will be reloaded
  #thread;;                 to enable threads

/usr/local/lib/ocaml/str.cma: loaded
/usr/local/lib/ocaml/dynlink.cma: loaded
# #require "pa_compose";;
C:/cygwin/usr/local/lib/ocaml/camlp4: added to search path
C:/cygwin/usr/local/lib/ocaml/site-lib/pa_do: added to search path
/usr/local/lib/ocaml/camlp4/camlp4o.cma: loaded
C:/cygwin/usr/local/lib/ocaml/site-lib/pa_do/pa_do.cmo: loaded
C:/cygwin/usr/local/lib/ocaml/site-lib/pa_do/pa_do_top.cma: loaded
C:/cygwin/usr/local/lib/ocaml/site-lib/pa_compose: added to search path
C:/cygwin/usr/local/lib/ocaml/site-lib/pa_do/pa_infix.cmo: loaded
C:/cygwin/usr/local/lib/ocaml/site-lib/pa_compose/compose.cmo: loaded
C:/cygwin/usr/local/lib/ocaml/site-lib/pa_compose/pa_compose.cmo: loaded
        Camlp4 Parsing version 3.12.0


# 3 |> succ |> succ |> id |> pred |> id |> succ ;;
- : int = 5

Of course, the functions themselves exist:

# open Compose ;;
# id ;;
- : 'a -> 'a = <fun>
# ( % ) ;;
- : ('a -> 'b) -> ('c -> 'a) -> 'c -> 'b = <fun>
# ( |> ) ;;
- : 'a -> ('a -> 'b) -> 'b = <fun>

Happy rewriting!

2010-08-22

Polymorphic recursion with rank-2 polymorphism in OCaml 3.12

Using polymorphic recursion in OCaml just got easy and direct! With the new syntax, Okasaki's binary random-access lists translate to OCaml 3.12 practically verbatim, without the need for cumbersome encodings. You should refer to the book (Purely Functional Data Structures, p. 147) for comparison, but here's the implementation in its entirety:

type 'a seq = Nil | Zero of ('a * 'a) seq | One of 'a * ('a * 'a) seq

let nil = Nil

let is_empty = function Nil -> true | _ -> false

let rec cons : 'a . 'a -> 'a seq -> 'a seq =
  fun x l -> match l with
  | Nil         -> One (x, Nil)
  | Zero    ps  -> One (x, ps)
  | One (y, ps) -> Zero (cons (x, y) ps)

let rec uncons : 'a . 'a seq -> 'a * 'a seq = function
| Nil          -> failwith "uncons"
| One (x, Nil) -> x, Nil
| One (x, ps ) -> x, Zero ps
| Zero    ps   ->
  let (x, y), qs = uncons ps in
  x, One (y, qs)

let head l = let x, _  = uncons l in x
and tail l = let _, xs = uncons l in xs

let rec lookup : 'a . int -> 'a seq -> 'a =
  fun n l -> match l with
  | Nil                    -> failwith "lookup"
  | One (x, _ ) when n = 0 -> x
  | One (_, ps)            -> lookup (n - 1) (Zero ps)
  | Zero ps                ->
    let (x, y) = lookup (n / 2) ps in
    if n mod 2 = 0 then x else y

let update n e =
  let rec go : 'a . ('a -> 'a) -> int -> 'a seq -> 'a seq =
    fun f n l -> match l with
    | Nil                    -> failwith "update"
    | One (x, ps) when n = 0 -> One (f x, ps)
    | One (x, ps)            -> cons x (go f (n - 1) (Zero ps))
    | Zero ps                ->
      let g (x, y) = if n mod 2 = 0 then (f x, y) else (x, f y) in
      Zero (go g (n / 2) ps)
  in go (fun x -> e) n

The implementation given in the book is rather bare-bones, but it can be extended with some thought and by paying close attention to the techniques Okasaki uses. To begin with, a length function is a very simple O(log n) mapping from constructors to integers:

let rec length : 'a . 'a seq -> int
  = fun l -> match l with
  | Nil         -> 0
  | Zero ps     ->     2 * length ps
  | One (_, ps) -> 1 + 2 * length ps

It is also rather easy to write a map for binary random-access lists:

let rec map : 'a 'b . ('a -> 'b) -> 'a seq -> 'b seq =
  fun f l -> match l with
  | Nil          -> Nil
  | One (x,  ps) -> One (f x, map (fun (x, y) -> (f x, f y)) ps)
  | Zero ps      -> Zero     (map (fun (x, y) -> (f x, f y)) ps)

Note two things: first, that both parameters need to be generalized as both the argument and the return type vary from invocation to invocation, as shown by the Zero case. Second, that there is no need to use cons, as map preserves the shape of the list. With this as a warm-up, writing fold_right is analogous:

let rec fold_right : 'a 'b . ('a -> 'b -> 'b) -> 'a seq -> 'b -> 'b =
  fun f l e -> match l with
  | Nil          -> e
  | One (x, ps)  -> f x (fold_right (fun (x, y) z -> f x (f y z)) ps e)
  | Zero ps      ->      fold_right (fun (x, y) z -> f x (f y z)) ps e

Given a right fold, any catamorphism is a one-liner:

let append l = fold_right cons l

let of_list l = List.fold_right cons l nil
and to_list l = fold_right (fun x l -> x :: l) l []

Now, armed with fold_right, filling up a list library is easy; but taking advantage of the logarithmic nature of the representation requires thought. For instance, building a random-access list of size n can be done in logarithmic time with maximal sharing:

let repeat n x =
  let rec go : 'a . int -> 'a -> 'a seq = fun n x ->
     if n = 0 then Nil else
  if n = 1 then One (x, Nil) else
     let t = go (n / 2) (x, x) in
     if n mod 2 = 0 then Zero t else One (x, t)
  in
  if n < 0 then failwith "repeat" else go n x

By analogy with binary adding, there is also a fast O(log n) merge:

let rec merge : 'a . 'a seq -> 'a seq -> 'a seq =
  fun l r -> match l, r with
  | Nil        ,         ps
  |         ps , Nil         -> ps
  | Zero    ps , Zero    qs  -> Zero (merge ps qs)
  | Zero    ps , One (x, qs)
  | One (x, ps), Zero    qs  -> One (x, merge ps qs)
  | One (x, ps), One (y, qs) -> Zero (cons (x, y) (merge ps qs))

It walks both lists, "adding" the "bits" at the head. The only complication is the case where both lists are heded by two Ones, which requires rippling the carry with a call to cons. An alternative is to explicitly keep track of the carry, but that doubles the number of branches. This merge operation does not preserve the order of the elements on both lists. It can be used, however, as the basis for a very fast nondeterminism monad:

let return x = One (x, Nil)
let join mm  = fold_right merge mm Nil
let bind f m = join (map f m)

let mzero = Nil
let mplus = merge

The rank-2 extension to the typing algorithm does not extend to signatures, as in Haskell. It only has effect on the typing of the function body, by keeping the type parameter fresh during unification, as the signature of the module shows:

type 'a seq = Nil | Zero of ('a * 'a) seq | One of 'a * ('a * 'a) seq
val nil : 'a seq
val is_empty : 'a seq -> bool
val cons : 'a -> 'a seq -> 'a seq
val uncons : 'a seq -> 'a * 'a seq
val head : 'a seq -> 'a
val tail : 'a seq -> 'a seq
val lookup : int -> 'a seq -> 'a
val update : int -> 'a -> 'a seq -> 'a seq
val length : 'a seq -> int
val map : ('a -> 'b) -> 'a seq -> 'b seq
val fold_right : ('a -> 'b -> 'b) -> 'a seq -> 'b -> 'b
val append : 'a seq -> 'a seq -> 'a seq
val of_list : 'a list -> 'a seq
val to_list : 'a seq -> 'a list
val repeat : int -> 'a -> 'a seq
val merge : 'a seq -> 'a seq -> 'a seq
val return : 'a -> 'a seq
val join : 'a seq seq -> 'a seq
val bind : ('a -> 'b seq) -> 'a seq -> 'b seq
val mzero : 'a seq
val mplus : 'a seq -> 'a seq -> 'a seq

To use rank-2 types in interfaces it is still necessary to encode them via records or objects.

2010-08-18

What can pa_do for you?

(My puns are getting worse) A lot of things. Christophe Troestler pointed out that pa_do is a much more complete alternative to the new delimited overloading syntax in OCaml 3.12. In particular, it allows not only for the overloading of operators but of constants also, so that the following:

# Num.(3**(561-1) mod 561);;
- : Num.num = <num 375>

(561 = 3×11×17) works, with the simple (!) expedient of having a sane OCaml environment (findlib and omake are must-haves) and the inclusion of the relevant extensions. The delimited overloading expressions is the most visible, or ultimate, functionality of pa_do, but a key piece of the extension is the ability to define new operators much more flexibly than Standard OCaml syntax allows via pa_infix. The bad news is that the documentation is not entirely clear on how to proceed, so you need to dig through examples and list posts and experiment a bit to discover how to put it to productive use. One of the examples in the distribution is examples/pa_compos.ml, a good start for two reasons:

  • It shows how to use the pa_infix API
  • It shows how to leverage CamlP4 quotations to simplify (or optimize, or partially evaluate) generated code

I've extended the example a bit to create a syntax filter pa_compose.ml that defines three operators:

  • A leftwards application operator x |> ff x
  • A rightwards application operator f %% xf x
  • A function composition operator f % g ≡ fun xf (g x)

Note: Since <<, >>, $ and $$ are reserved by CamlP4 they are unavailable as operator names.

The key differences between pa_infix- and Standard Syntax-defined operators are three:

  • They have specific precedence
  • They have specific associativity
  • They are rewrite rules, not regular functions

The |> and %% operators bind just higher than assignment :=, the former from left to right, the latter from right to left. The % operator binds higher than those, also from right to left. First I need to open the necessary modules:

open Camlp4.PreCast
open Pa_infix
module L = Level

The |> operator is just as in the example cited above, modulo some adjustments:

let () =
  let expr x y _loc = <:expr< $y$ $x$ >> in
  infix "|>" ~expr (L.binary (L.Higher L.assignment) ~assoc:L.LeftA)

The operator is assigned an arity, a precedence level and an associativity, and paired with a specific rewriting in the form of a CamlP4 quotation. This rewriting is made at the level of abstract syntax trees after the code is parsed, so there is no danger of it being syntactically incorrect. The %% operator is its mirror image:

let () =
  let expr x y _loc = <:expr< $x$ $y$ >> in
  infix "%%" ~expr (L.binary (L.Higher L.assignment) ~assoc:L.RightA)

The rewrite rules just make the operators disappear, leaving behind the semantics in the form of a direct application. The % operator is more involved, as it must create a variable to abstract the composition away. Since CamlP4 is not hygienic in the Scheme sense, this variable must be explicitly created fresh:

let gensym =
  Random.self_init();
  let prefix = Printf.sprintf "__pa_compose_%04x_" (Random.int 65536) in
  let count  = ref 0 in
  fun () -> prefix ^ string_of_int (incr count; !count)

For each rewriting of f % g as fun xf (g x), the variable x must be fresh, otherwise it would capture in f % g % h:

let () =
  let expr f g _loc = let x = gensym () in <:expr< fun $lid:x$ -> $f$ ($g$ $lid:x$) >> in
  infix "%" ~expr (L.binary (L.Higher L.disjunction) ~assoc:L.RightA)

(The tag lid indicates an AST node containing an identifier). This is the entirety of the extension. It can be compiled with:

ocamlfind ocamlc -package pa_do -syntax camlp4o -ppopt q_MLast.cmo -c pa_compose.ml

to give pa_compose.cmo. With it, I'll try some examples, first using CamlP4 as a filter with the Standard Syntax pretty-printer:

camlp4 -parser o -parser op -printer o -I "$OCAMLLIB/site-lib/pa_do" pa_infix.cmo pa_compose.cmo compose.ml

The first examples are simple uses of the application operators. The code:

let test1 () = [1; 2; 3] |> List.map succ

gets transformed to:

let test1 () = List.map succ [ 1; 2; 3 ]

and the code:

let test2 () = List.map succ %% [1; 2; 3]

gets transformed to:

let test2 () = List.map succ [ 1; 2; 3 ]

Note that in both cases the operators disappear and the application is direct. The binding power of the first operator is shown as this code:

let test3 () =
  let x = ref 7 in
  x := !x |> succ

is transformed into this code:

let test3 () = let x = ref 7 in x := succ !x

and this code:

let test4 () =
  let x = ref 7 in
  x := succ %% !x

is transformed to this:

let test4 () = let x = ref 7 in x := succ !x

The rewriting is syntactically correct even in the presence of chained operators. This code:

let test5 () =
  [1; 2; 3] |> List.map succ |> List.fold_left (+) 0 |> succ

results in this transformation:

let test5 () = succ (List.fold_left ( + ) 0 (List.map succ [ 1; 2; 3 ]))

and this code:

let test6 () =
  succ %% List.fold_left (+) 0 %% List.map succ %% [1; 2; 3]

is transformed into:

let test6 () = succ (List.fold_left ( + ) 0 (List.map succ [ 1; 2; 3 ]))

The operators rewrite exactly to the same code, and they associate correctly into properly nested applications. The composition operator results in more complex expressions. The code:

let test7 () =
  succ % List.fold_left (+) 0 % List.map succ %% [1; 2; 3]

transforms into:

let test7 () =
  (fun __pa_compose_e2b2_2 ->
     succ
       ((fun __pa_compose_e2b2_1 ->
           List.fold_left ( + ) 0 (List.map succ __pa_compose_e2b2_1))
          __pa_compose_e2b2_2))
    [ 1; 2; 3 ]

which after β-reduction is identical to the code for test6 (of course % and %% are intimately related combinators, in that (%%) ≡ (%) id). Note that each abstraction has a freshly-generated, random-named variable. Similarly, the code:

let test8 () =
  [1; 2; 3] |> succ % List.fold_left (+) 0 % List.map succ

(not that it makes much sense) transforms into the same result, modulo α-conversion:

let test8 () =
  (fun __pa_compose_e2b2_4 ->
     succ
       ((fun __pa_compose_e2b2_3 ->
           List.fold_left ( + ) 0 (List.map succ __pa_compose_e2b2_3))
          __pa_compose_e2b2_4))
    [ 1; 2; 3 ]

To use the extension pa_compose to pre-process your source for compilation, it can be used like this:

ocamlfind ocamlc -package pa_do -syntax camlp4o -ppopt pa_infix.cmo -ppopt pa_compose.cmo -c compose.ml

It only remains building an OMakefile that compiles the extension both in bytecode and optimized assembly, packages it and installs it with findlib into the site library, so that it can be #require'd into the top-level.

2010-08-16

Compiling OMake 0.9.8.5 with OCaml 3.12/MinGW under Cygwin

Success! Since OMake doesn't really have a notion of a build environment, just one of platform (UNIX or Win32), compiling the latest release (0.9.8.5-3) from sources requires a number of modifications to the makefiles. Furthermore, OCaml has incorporated a number of warnings that make necessary adjusting some command-line parameters.

Note: These steps are specifically for the MinGW port of OCaml. If you're trying to build with OCaml/Cygwin, you'd want to keep most of the settings in the makefiles intact, except for adjusting the OCaml warnings. YMMV.

Here are the steps I followed for a clean build from sources:

  1. Make sure that your Cygwin environment is sane. In particular, prune from your $PATH variable all entries with spaces, make sure that $OCAMLLIB is properly set and eliminate the $OMAKELIB variable if present. Also, using Cygwin under Windows Vista or Windows 7 is fraught with privilege peril; set your $TMP variable to Cygwin's /tmp directory
  2. Download and unpack omake-0.9.8.5-3.tar.gz to Cygwin's root /
  3. Apply the source patches detailed here:
    1. omake_exec.ml.patch:
      --- a/src/exec/omake_exec.ml    2006-12-08 23:52:01.000000000 +0100
      +++ b/src/exec/omake_exec.ml    2009-10-30 23:45:49.688006630 +0100
      @@ -46,7 +46,7 @@
       open Omake_options
       open Omake_command_type
      
      -external sync : unit -> unit = "caml_sync"
      +(* external sync : unit -> unit = "caml_sync" *)
      
       module Exec =
       struct
      
    2. lm_printf.c.patch:
      --- a/src/libmojave-external/cutil/lm_printf.c  2007-07-15 19:55:23.000000000 +0200
      +++ b/src/libmojave-external/cutil/lm_printf.c  2009-10-30 23:45:26.600007718 +0100
      @@ -142,12 +142,12 @@
       #endif
           if(code < 0) {
               if(bufp != buffer)
      -            free(buffer);
      +            free(bufp);
               failwith("ml_print_string");
           }
           v_result = copy_string(bufp);
           if(bufp != buffer)
      -        free(buffer);
      +        free(bufp);
           return v_result;
       }
      
      @@ -190,12 +190,12 @@
       #endif
           if(code < 0) {
               if(bufp != buffer)
      -            free(buffer);
      +            free(bufp);
               failwith("ml_print_string");
           }
           v_result = copy_string(bufp);
           if(bufp != buffer)
      -        free(buffer);
      +        free(bufp);
           return v_result;
       }
      
  4. Patch the Makefile with the following Makefile.patch:
    --- Makefile.orig 2007-05-21 13:48:00.000000000 -0300
    +++ Makefile 2010-08-16 13:23:37.352895400 -0300
    @@ -18,18 +18,18 @@
      @exit 1
     
     bootstrap: boot/Makefile
    - @cd boot; $(MAKE) Makefile.dep; $(MAKE) omake
    - @ln -sf boot/omake omake-boot
    + @cd boot; $(MAKE) Makefile.dep; $(MAKE) omake.exe
    + @cp boot/omake.exe omake-boot.exe
     
     boot/Makefile: src/Makefile
      mkdir -p boot
      @touch boot/Makefile.dep
      @sleep 1
    - ln -sf ../src/Makefile boot/Makefile
    + cp src/Makefile boot/Makefile
     
     all: bootstrap
      touch .config
    - OMAKEFLAGS= OMAKEPATH=lib ./omake-boot --dotomake .omake --force-dotomake -j2 main
    + OMAKEFLAGS= OMAKEPATH=lib ./omake-boot --dotomake .omake --force-dotomake -j2 main src/main/osh.exe
      OMAKEFLAGS= OMAKEPATH=lib src/main/omake --dotomake .omake --force-dotomake -j2 all
     
     install: all
    

    These modifications account for MinGW applications's inability to read from symbolic links, and the need to use .exe as the executable files's extension.

  5. Patch src/Makefile with the following src_Makefile.patch:
    --- src/Makefile.orig 2007-06-27 20:59:16.000000000 -0300
    +++ src/Makefile 2010-08-16 13:26:35.342346000 -0300
    @@ -7,32 +7,32 @@
     #
     # System config
     #
    -LN = ln -sf
    +LN = cp
     RM = rm -f
     DOT = ./
     slash = /
     
    -win32 = unix
    -system = null
    +win32 = win32
    +system = system
     
     #
     # C configuration
     #
    -CC = cc
    -CFLAGS =
    +CC = gcc
    +CFLAGS = -mno-cygwin -I"$(OCAMLLIB)" -DWIN32 -DFAM_ENABLED -DFAM_PSEUDO
     AR = ar cq
     AROUT =
     EXT_OBJ = .o
     EXT_LIB = .a
    -EXE =
    +EXE = .exe
     
    -OCAMLFLAGS =
    -THREADSLIB =
    +OCAMLFLAGS = -thread
    +THREADSLIB = threads.cma
     
     .SUFFIXES: .mll .mly .mli .ml .c .cmi .cmo .cma .o
     
     .c.o:
    - $(CC) $(CFLAGS) -I"`ocamlc -where`" -c $*.c
    + $(CC) $(CFLAGS) -c $*.c
     
     
     #
    

    Again, this accounts for the fact that MinGW applications under Cygwin are Windows applications with UNIX syntax.

  6. Patch OMakeroot with the following OMakeroot.patch:
    --- OMakeroot.orig 2010-08-16 13:31:07.517242100 -0300
    +++ OMakeroot 2010-08-16 13:31:13.423692900 -0300
    @@ -14,7 +14,13 @@
     #
     # Include the standard configuration
     #
    -include build/C
    +OSTYPE = Cygwin
    +open build/Common
    +EXT_LIB = .a
    +EXT_OBJ = .o
    +EXE = .exe
    +open build/C
    +CC = gcc -mno-cygwin
     include build/OCaml
     include build/LaTeX
     
    @@ -29,6 +35,9 @@
     #
     include configure/Configure
     
    +OSTYPE = Win32
    +TOOLCHAIN = MinGW
    +
     #
     # Include the OMakefile
     #
    

    The rationale behind these modifications can be found in this message. In short, it tricks the initial configuration into thinking it's a UNIX platform, and then reverts to the Win32 port, setting a variable for OMakefile.

  7. Patch OMakefile with the following OMakefile.patch:
    --- OMakefile.orig 2010-08-16 13:29:07.788616600 -0300
    +++ OMakefile 2010-08-16 13:29:13.828708200 -0300
    @@ -15,7 +15,14 @@
         #
         # Extra options for the C compiler
         #
    -    CFLAGS += /MT /DWIN32 /W4 /WX
    +    if $(equal $(TOOLCHAIN), MinGW)
    +        CFLAGS += -DWIN32
    +        FAM_CFLAGS = -DFAM_ENABLED -DFAM_PSEUDO
    +        export
    +
    +    else
    +        CFLAGS += /MT /DWIN32 /W4 /WX
    +        export
     
         #
         # Don't build man pages
    @@ -57,7 +64,7 @@
     #
     # OCaml options
     #
    -OCAMLFLAGS[] += -w Ae$(if $(OCAML_ACCEPTS_Z_WARNING), z)
    +OCAMLFLAGS[] += -w Ae$(if $(OCAML_ACCEPTS_Z_WARNING), z)-9-27..29
     if $(THREADS_ENABLED)
         OCAMLFLAGS += -thread
         export
    

    This modification has two parts. First, it detects the MinGW toolchain to change the command line syntax of some of the build options. Second, it adjusts the OCaml warning flags so that the source compiles under 3.12.

  8. Do make bootstrap
  9. Edit .config to set your installation root
  10. Do make all
  11. Do make install

Good luck!

2010-08-14

Smooth Operators

Edit: Of course the module Num is already present in the standard library. I've renamed the module to Arith.

The newly-released OCaml 3.12 includes many extensions to the sub-language of modules. One of the simplest but most practical is the syntax for locally opening modules (let open M in e), and for evaluating expressions in the context of an implicitly-opened module (M.(e), equivalent to the former). The biggest payoff this syntax affords is overloading operators and functions in an expression, in a delimited context denoted by a module used as a dictionary:

let degree n =
  let pi =  3.1415926535897931 in
  Arith.F.(2. * pi / 180. * of_int n)

Note that the operators are the usual ones! Another neat example:

let rgba r g b a =
  Arith.I32.((of_int r lsl 8 lor of_int g) lsl 8 lor of_int b) lsl 8 lor of_int a)

(I've purposefully written the example to showcase the operator precedence, not for clarity). Unfortunately, the standard library doesn't yet include the modules necessary for this to work. Here's my version of the built-in numeric instances, suitable for inclusion in your .ocamlinit file. It is structured as a top-level module Arith, but can be put into a arith.ml file for separate compilation (if you do that, take care to include in the .mli interface file the complete module signature, including externals, so that the compiler can inline the definitions). This module contains sub-modules with definitions for each of the types int, int32, int64, Big_int, float, and Ratio. Every sub-module conforms to the NUM signature (inspired by the type classes in Haskell's Prelude):

module type NUM = sig
 type t
 val min_value : t
 val max_value : t
 val of_int    : int -> t
 val to_int    : t -> int
 val of_string : string -> t
 val to_string : t -> string
 val ( ~- )    : t -> t
 val ( ~+ )    : t -> t
 val (  + )    : t -> t -> t
 val (  - )    : t -> t -> t
 val (  * )    : t -> t -> t
 val (  / )    : t -> t -> t
 val (mod )    : t -> t -> t
 val abs       : t -> t
end

so that with the following top-level definitions:

let show (type t) d =
  let module N = (val d : NUM with type t = t) in N.to_string

let read (type t) d =
  let module N = (val d : NUM with type t = t) in N.of_string

(note the syntax for modules as first-class values) the following code works:

# read (module Arith.I : NUM with type t = int) "123" ;;
- : int = 123
# read (module Arith.I32 : NUM with type t = int32) "123" ;;
- : int32 = 123l
# read (module Arith.I64 : NUM with type t = int64) "123" ;;
- : int64 = 123L
# read (module Arith.F : NUM with type t = float) "123" ;;
- : float = 123.

(the syntax for binding first-class module values is pretty heavy). They also conform to the ORD signature (also borrowed from Haskell):

module type ORD = sig
  type t
  val compare   : t -> t -> int
  val ( =  )    : t -> t -> bool
  val ( <> )    : t -> t -> bool
  val ( <  )    : t -> t -> bool
  val ( <= )    : t -> t -> bool
  val ( >  )    : t -> t -> bool
  val ( >= )    : t -> t -> bool
end

so that the following code is generic on the module implementing it:

let max (type t) d (x : t) (y : t) : t =
  let module N = (val d : ORD with type t = t) in
  N.(if x < y then y else x)

let min (type t) d (x : t) (y : t) : t =
  let module N = (val d : ORD with type t = t) in
  N.(if x < y then x else y)

The sub-modules have short, mnemonic names I, I32, I64, Z, F and Q so that they don't clash with the corresponding standard modules. The first four, the binary integral types, conform to the following BIN signature:

module type BIN = sig
  type t
  val succ   : t -> t
  val pred   : t -> t
  val (land) : t -> t -> t
  val (lor ) : t -> t -> t
  val (lxor) : t -> t -> t
  val lnot   : t -> t
  val (lsl ) : t -> int -> t
  val (lsr ) : t -> int -> t
  val (asr ) : t -> int -> t
end

So, for those of you that can't or won't avail yourselves to extension libraries like OCaml Batteries, here is the complete code for the module Arith:

module Arith = struct
  module I = struct
    type t = int
    let min_value      : t = 1 lsl (if 1 lsl 31 = 0 then 30 else 62)
    let max_value      : t = min_value - 1
    external of_int    : int -> t = "%identity"
    external to_int    : t -> int = "%identity"
    external of_string : string -> t = "caml_int_of_string"
    let      to_string : t -> string = Pervasives.string_of_int
    external ( ~- )    : t -> t = "%negint"
    external ( ~+ )    : t -> t = "%identity"
    external (  + )    : t -> t -> t = "%addint"
    external (  - )    : t -> t -> t = "%subint"
    external (  * )    : t -> t -> t = "%mulint"
    external (  / )    : t -> t -> t = "%divint"
    external (mod )    : t -> t -> t = "%modint"
    let abs  (x: t)    : t = if x >= 0 then x else -x
    let compare        : t -> t -> int  = Pervasives.compare
    let      ( =  )    : t -> t -> bool = Pervasives.( =  )
    let      ( <> )    : t -> t -> bool = Pervasives.( <> )
    let      ( <  )    : t -> t -> bool = Pervasives.( <  )
    let      ( <= )    : t -> t -> bool = Pervasives.( <= )
    let      ( >  )    : t -> t -> bool = Pervasives.( >  )
    let      ( >= )    : t -> t -> bool = Pervasives.( >= )
    external succ      : t -> t = "%succint"
    external pred      : t -> t = "%predint"
    external (land)    : t -> t -> t = "%andint"
    external (lor )    : t -> t -> t = "%orint"
    external (lxor)    : t -> t -> t = "%xorint"
    let lnot (x: t)    : t = x lxor (-1)
    external (lsl )    : t -> int -> t = "%lslint"
    external (lsr )    : t -> int -> t = "%lsrint"
    external (asr )    : t -> int -> t = "%asrint"
  end
  module I32 = struct
    type t = int32
    let min_value      : t = Int32.min_int
    let max_value      : t = Int32.max_int
    external of_int    : int -> t = "%int32_of_int"
    external to_int    : t -> int = "%int32_to_int"
    external of_string : string -> t = "caml_int32_of_string"
    let      to_string : t -> string = Int32.to_string
    external ( ~- )    : t -> t = "%int32_neg"
    external ( ~+ )    : t -> t = "%identity"
    external (  + )    : t -> t -> t = "%int32_add"
    external (  - )    : t -> t -> t = "%int32_sub"
    external (  * )    : t -> t -> t = "%int32_mul"
    external (  / )    : t -> t -> t = "%int32_div"
    external (mod )    : t -> t -> t = "%int32_mod"
    let abs  (x: t)    : t = if x >= 0l then x else -x
    let compare        : t -> t -> int  = Pervasives.compare
    let      ( =  )    : t -> t -> bool = Pervasives.( =  )
    let      ( <> )    : t -> t -> bool = Pervasives.( <> )
    let      ( <  )    : t -> t -> bool = Pervasives.( <  )
    let      ( <= )    : t -> t -> bool = Pervasives.( <= )
    let      ( >  )    : t -> t -> bool = Pervasives.( >  )
    let      ( >= )    : t -> t -> bool = Pervasives.( >= )
    let      succ      : t -> t = Int32.succ
    let      pred      : t -> t = Int32.pred
    external (land)    : t -> t -> t = "%int32_and"
    external (lor )    : t -> t -> t = "%int32_or"
    external (lxor)    : t -> t -> t = "%int32_xor"
    let lnot (x: t)    : t = x lxor (-1l)
    external (lsl )    : t -> int -> t = "%int32_lsl"
    external (lsr )    : t -> int -> t = "%int32_asr"
    external (asr )    : t -> int -> t = "%int32_lsr"
  end
  module I64 = struct
    type t = int64
    let min_value      : t = Int64.min_int
    let max_value      : t = Int64.max_int
    external of_int    : int -> t = "%int64_of_int"
    external to_int    : t -> int = "%int64_to_int"
    external of_string : string -> t = "caml_int64_of_string"
    let      to_string : t -> string = Int64.to_string
    external ( ~- )    : t -> t = "%int64_neg"
    external ( ~+ )    : t -> t = "%identity"
    external (  + )    : t -> t -> t = "%int64_add"
    external (  - )    : t -> t -> t = "%int64_sub"
    external (  * )    : t -> t -> t = "%int64_mul"
    external (  / )    : t -> t -> t = "%int64_div"
    external (mod )    : t -> t -> t = "%int64_mod"
    let abs  (x: t)    : t = if x >= 0L then x else -x
    let compare        : t -> t -> int  = Pervasives.compare
    let      ( =  )    : t -> t -> bool = Pervasives.( =  )
    let      ( <> )    : t -> t -> bool = Pervasives.( <> )
    let      ( <  )    : t -> t -> bool = Pervasives.( <  )
    let      ( <= )    : t -> t -> bool = Pervasives.( <= )
    let      ( >  )    : t -> t -> bool = Pervasives.( >  )
    let      ( >= )    : t -> t -> bool = Pervasives.( >= )
    let      succ      : t -> t = Int64.succ
    let      pred      : t -> t = Int64.pred
    external (land)    : t -> t -> t = "%int64_and"
    external (lor )    : t -> t -> t = "%int64_or"
    external (lxor)    : t -> t -> t = "%int64_xor"
    let lnot (x: t)    : t = x lxor (-1L)
    external (lsl )    : t -> int -> t = "%int64_lsl"
    external (lsr )    : t -> int -> t = "%int64_asr"
    external (asr )    : t -> int -> t = "%int64_lsr"
  end
  module Z = struct
    type t = Big_int.big_int
    let min_value   : t = Big_int.zero_big_int
    let max_value   : t = Big_int.zero_big_int
    let of_int      : int -> t = Big_int.big_int_of_int
    let to_int      : t -> int = Big_int.int_of_big_int
    let of_string   : string -> t = Big_int.big_int_of_string
    let to_string   : t -> string = Big_int.string_of_big_int
    let ( ~- )      : t -> t = Big_int.minus_big_int
    external ( ~+ ) : t -> t = "%identity"
    let (  + )      : t -> t -> t = Big_int.add_big_int
    let (  - )      : t -> t -> t = Big_int.sub_big_int
    let (  * )      : t -> t -> t = Big_int.mult_big_int
    let (  / )      : t -> t -> t = Big_int.div_big_int
    let (mod )      : t -> t -> t = Big_int.mod_big_int
    let abs         : t -> t = Big_int.abs_big_int
    let compare     : t -> t -> int  = Big_int.compare_big_int
    let ( =  )      : t -> t -> bool = Big_int.eq_big_int
    let ( <> )      (x:t) (y:t) = not (x = y)
    let ( <  )      : t -> t -> bool = Big_int.lt_big_int
    let ( <= )      : t -> t -> bool = Big_int.le_big_int
    let ( >  )      : t -> t -> bool = Big_int.gt_big_int
    let ( >= )      : t -> t -> bool = Big_int.ge_big_int
    let succ        : t -> t = Big_int.succ_big_int
    let pred        : t -> t = Big_int.pred_big_int
    let (land)      : t -> t -> t = Big_int.and_big_int
    let (lor )      : t -> t -> t = Big_int.or_big_int
    let (lxor)      : t -> t -> t = Big_int.xor_big_int
    let lnot        : t -> t = let m1 = of_int (-1) in fun x -> x lxor m1
    let (lsl )      : t -> int -> t = Big_int.shift_left_big_int
    let (lsr )      : t -> int -> t = Big_int.shift_right_big_int
    let (asr )      : t -> int -> t = Big_int.shift_right_towards_zero_big_int
  end
  module F = struct
    type t = float
    let min_value      : t = Int64.float_of_bits 0xFF_F0_00_00_00_00_00_00L
    let max_value      : t = Int64.float_of_bits 0x7F_F0_00_00_00_00_00_00L
    external of_int    : int -> t = "%floatofint"
    external to_int    : t -> int = "%intoffloat"
    external of_string : string -> t = "caml_float_of_string"
    let      to_string : t -> string = Pervasives.string_of_float
    external ( ~- )    : t -> t = "%negfloat"
    external ( ~+ )    : t -> t = "%identity"
    external (  + )    : t -> t -> t = "%addfloat"
    external (  - )    : t -> t -> t = "%subfloat"
    external (  * )    : t -> t -> t = "%mulfloat"
    external (  / )    : t -> t -> t = "%divfloat"
    external (mod )    : t -> t -> t = "caml_modf_float"
    let abs  (x: t)    : t = if x >= 0. then x else -x
    external compare   : t -> t -> int = "%compare"
    external ( =  )    : t -> t -> bool = "%equal"
    external ( <> )    : t -> t -> bool = "%notequal"
    external ( <  )    : t -> t -> bool = "%lessthan"
    external ( <= )    : t -> t -> bool = "%lessequal"
    external ( >  )    : t -> t -> bool = "%greaterthan"
    external ( >= )    : t -> t -> bool = "%greaterequal"
  end
  module Q = struct
    type t = Ratio.ratio
    let min_value   : t =
      let flag = Arith_status.get_error_when_null_denominator () in
      Arith_status.set_error_when_null_denominator false;
      let v = Ratio.minus_ratio (Ratio.create_ratio Big_int.unit_big_int Big_int.zero_big_int) in
      Arith_status.set_error_when_null_denominator flag;
      v
    let max_value   : t =
      let flag = Arith_status.get_error_when_null_denominator () in
      Arith_status.set_error_when_null_denominator false;
      let v = Ratio.create_ratio Big_int.unit_big_int Big_int.zero_big_int in
      Arith_status.set_error_when_null_denominator flag;
      v
    let of_int      : int -> t = Ratio.ratio_of_int
    let to_int      : t -> int = Ratio.int_of_ratio
    let of_string   : string -> t = Ratio.ratio_of_string
    let to_string   : t -> string = Ratio.string_of_ratio
    let ( ~- )      : t -> t = Ratio.minus_ratio
    external ( ~+ ) : t -> t = "%identity"
    let (  + )      : t -> t -> t = Ratio.add_ratio
    let (  - )      : t -> t -> t = Ratio.sub_ratio
    let (  * )      : t -> t -> t = Ratio.mult_ratio
    let (  / )      : t -> t -> t = Ratio.div_ratio
    let (mod ) (x:t) (y:t) : t =
      Ratio.sub_ratio x 
        (Ratio.mult_ratio y
          (Ratio.ratio_of_big_int
            (Ratio.floor_ratio
              (Ratio.div_ratio x y))))
    let abs         : t -> t = Ratio.abs_ratio
    let compare     : t -> t -> int  = Ratio.compare_ratio
    let ( =  )      : t -> t -> bool = Ratio.eq_ratio
    let ( <> ) (x:t) (y:t) = not (x = y)
    let ( <  )      : t -> t -> bool = Ratio.lt_ratio
    let ( <= )      : t -> t -> bool = Ratio.le_ratio
    let ( >  )      : t -> t -> bool = Ratio.gt_ratio
    let ( >= )      : t -> t -> bool = Ratio.ge_ratio
  end
end

In the case of Z, there are no meaningful extremal values. I haven't included a module for NativeInt, but you can do so quite easily. Note that, if any of the external functions in the standard library changes, this module must be revised. I hope you find it useful.

2010-08-05

Reviving an Old Tiger

An old machine that proved its mettle on the line can be a handy development server, on a pinch. Especially if you're starting up by pulling your boot straps, as I currently am. I have this Blue & White G3 that served me faithfully for the last, say, 10 years. Bitrot, alas, is not a hardware problem: software version numbers climb and software authors drop support for older operating systems and configurations, and we misers are left struggling and scavenging for information to make things compile and install. So, for my own reference, here's what I did this time. This applies to Mac OS X 10.4.11, AKA Tiger:

Sybase 12
Get the latest EBF for Mac OS, EBF17473. It works beautifully.
XCode 2.5
I had to root quite a bit around the newly redesigned Apple Developer site. The image name is xcode25_8m2558_developerdvd.dmg
MySQL 5.1.49
A challenge for GCC 4.0.1, it kernel-panicked the machine once. I followed these fine instructions. I set it up to start up automatically on boot, via the following launchd script:

<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
 <key>KeepAlive</key>
 <true/>
 <key>Label</key>
 <string>com.mysql.mysqld</string>
 <key>Program</key>
 <string>/usr/local/mysql/bin/mysqld_safe</string>
 <key>RunAtLoad</key>
 <true/>
 <key>UserName</key>
 <string>mysql</string>
 <key>WorkingDirectory</key>
 <string>/usr/local/mysql</string>
</dict>
</plist>

Put it in /Library/LaunchDaemons/com.mysql.mysqld.plist.

Apache httpd 2.2.16
I followed these instructions. The layout I used is this:
<Layout DarwinLocal>
    prefix:        /usr/local/apache2
    exec_prefix:   ${prefix}
    bindir:        ${exec_prefix}/bin
    sbindir:       ${exec_prefix}/sbin
    libdir:        ${exec_prefix}/lib
    libexecdir:    ${exec_prefix}/modules
    mandir:        ${prefix}/man
    datadir:       /Library/Documents
    sysconfdir:    /etc/httpd
    installbuilddir: ${datadir}/build
    errordir:      ${datadir}/error
    iconsdir:      ${datadir}/icons
    htdocsdir:     /Library/WebServer/Documents
    manualdir:     ${datadir}/manual
    cgidir:        /Library/WebServer/CGI-Executables
    includedir:    ${prefix}/include+
    localstatedir: /var
    runtimedir:    ${localstatedir}/run
    logfiledir:    ${localstatedir}/log/httpd
    proxycachedir: ${runtimedir}/proxy
</Layout>

and the configuration is this (culled from here):

CFLAGS="-arch ppc -isysroot /Developer/SDKs/MacOSX10.4u.sdk" ./configure \
--enable-layout=DarwinLocal \
--enable-access \
--enable-actions --enable-alias --enable-asis --enable-auth \
--enable-auth_dbm --enable-auth_digest --enable-autoindex \
--enable-cache --enable-cgi --enable-dav --enable-dav_fs \
--enable-deflate --enable-dir --enable-disk_cache \
--enable-dumpio --enable-env --enable-expires --enable-fastcgi \
--enable-file_cache --enable-headers --enable-imap \
--enable-include --enable-info --enable-log_config \
--enable-log_forensic --enable-logio --enable-mem_cache --enable-mime \
--enable-mime_magic --enable-negotiation --enable-perl \
--enable-rewrite --enable-setenvif --enable-speling --enable-ssl \
--enable-status --enable-suexec --enable-unique_id --enable-userdir \
--enable-usertrack --enable-version --enable-vhost_alias --enable-so \
--enable-module=all --enable-shared=max

These instructions allow automatic launching of Apache at startup.

zlib 1.2.5
libpng 1.4.3
Beware, as the DLL gets installed with the wrong name: it should be libpng14.14.dylib.
jpegsrc v8b
freetype 2.4.1
gd-2.0.35
I configured it with the usual caveats.
libmcrypt 2.5.8
mhash 0.9.9.9
mcrypt-2.6.8
php 5.3.3
It took me quite a long while sort out the correct incantation to configure PHP. This is what worked for me:
export CFLAGS="-DSQLITE_ENABLE_LOCKING_STYLE=0 -DBIND_8_COMPAT"
./configure \
-prefix=/usr/local/php5 \
-with-config-file-path=/etc \
-with-zlib=/usr/local \
-with-xml \
-with-xsl \
-enable-ctype \
-enable-dom \
-enable-exif \
-enable-filter \
-enable-ftp \
-enable-gd-native-ttf \
-enable-libxml \
-enable-mbregex \
-enable-mbstring \
-enable-pcntl \
-enable-pdo \
-enable-posix \
-enable-shmop \
-enable-simplexml \
-enable-soap \
-enable-xml \
-enable-bcmath \
-with-bz2=/usr/bin \
-with-curl \
-with-freetype-dir=/usr/local \
-with-gd=/usr/local \
-with-iconv \
-with-jpeg-dir=/usr/local \
-with-ldap \
-with-mcrypt=/usr/local \
-with-mysql=/usr/local/mysql \
--with-mysqli=/usr/local/mysql/bin/mysql_config \
-with-openssl \
-with-pdo-mysql=/usr/local/mysql \
-with-png-dir=/usr/local \
-with-zlib=/usr/local \
-enable-dbx \
-enable-sockets \
-with-apxs2=/usr/local/apache2/sbin/apxs \
-with-kerberos=/usr \
-with-mysqlsock=/var/mysql/mysql.sock \
-without-pear

Don't forget to apply the patch described in this post. Be sure to make test!

Next, configure Apache to run phpMyAdmin, Drupal and Mantis (I won't have anything to do with either Python or Ruby, sorry). I expect this old Tiger to bend under the pressure but not break.

2010-07-02

Sukhotin's Algorithm

There is an obscure algorithm devised by one Boris Viktorovich Sukhotin, a Russian… linguist? mathematician? cryptographer? let's say researcher, in the early '70s, usually called Sukhotin's Vowel Identification Algorithm, whose initial aim was to quickly attack an alphabetic cyphertext encoded with an unknown substitution cypher. Jacques Guy, a linguist and programmer well-known for his interest in the Voynich manuscript, translated the algorithm from Russian and simplified it but his publishing in Cryptologia didn't help it being more widely known. Besides, his explanation is not very illuminating, so I'll show the algorithm and then try to explain it from first principles.

This classification algorithm, given a text over an alphabet Σ of size N = |Σ|, partitions Σ into two disjoint subsets V and K (for vowels and consonants). It takes as input the symmetric digraph frequencies F[xy] = F[yx], that is, the frequency with which letter xΣ appears next to letter yΣ in the text. The algorithm is predicated on the assumption that in natural languages vowels tend to appear next to consonants rather than to other vowels. On input, the array d records the frequencies F[xy], subject to the condition:

⟨∀i, j : 0 ≤ i, j < N : d.i.j = d.j.i ≥ 0⟩

On exit, the stack v will contain the letters identified as vowels by the algorithm. I use Dijkstra's guarded command language, except that I omit trivial skip branches in single-branched ifs.

begin
   var v: stack of int;
   var r: array [0 … N) of int;
   var m, i, j: int;
   m, i := 0, 0;
   do iNr.i, j := 0, 0;
      do jNr.i := r.i + d.i.j;
         j := j + 1
      od;
      if r.m < r.im := i fi;
      i := i + 1
   od;
   { P }
   do r.m > 0 →
      v:push.m;
      i, j := 0, 0;
      do iNif imr.i :=  r.i − 2⋅d.i.mi = mr.i := −r.i
         fi;
         if r.j < r.ij := i fi;
         i := i + 1
      od;
      m := j
      { P }
   od
   { v = V }
end

The algorithm initially assumes that every letter is a consonant, that is, initially K = Σ. The first loop computes in r.i the cumulative frequency of character i, recording in m the character with maximum frequency, thus establishing the initial validity of the loop invariant P :

P ≡ ⟨∀ i : 0 ≤ i < N : r.i = ⟨∑j : jK : F[ij]⟩ ∧ r.mr.i

The second loop tries to maximize:

Q = ⟨∑i : iV : ⟨∑j : jK : F[ij]⟩⟩

under invariance of P. This measure is motivated by the original assumption that vowels are more likely to be preceded or followed by consonants than by other vowels, and that a partition of Σ should maximize this likelihood. The second loop proceeds to hill-climb to this maximum by assuming that the most frequent letter is a vowel, adding each r.m to V and removing it from K under the invariance of P. Its inner loop restores it by discounting twice the frequency for m: once for d.m.i and then for d.i.m, except for d.m.m, since after reduction it is precisely equal to r.m. It also keeps track of the new maximum for the next iteration. This process continues as long as Q can increase, which is implied by the loop condition.

References

  1. B. V. Sukhotin. DECIPHERING METHODS AS A MEANS OF LINGUISTIC RESEARCH, COLING Vol.1 1973. Online.
  2. ———. OPTIMIZATION ALGORITHMS OF DECIPHERING AS THE ELEMENTS OF A LINGUISTIC THEORY, COLING Budapest Vol.1 1988. Online.
  3. Jacques B. M. Guy. VOWEL IDENTIFICATION: AN OLD (BUT GOOD) ALGORITHM, CRYPTOLOGIA Vol.XV No.3 1991. Online.
  4. David M. W. Powers. Unsupervised learning of linguistic structure. IJCL Vol.2 No.1 1997. Online.