2012-08-06

Merge Right

The so-called master-transaction update is one of the, if not the defining algorithms of the discipline formerly known as "data processing". Given two sorted files of records with increasing keys, the process applies each record in the transaction file to each record of the the master file and outputs the result, if any, to the updated master file in one pass over each input. The same algorithm can compute the union, intersection or difference of sorted sequences. For instance, the union of two sets represented as sorted lists of unique elements is:

let union       =
  let rec go l r = match l, r with
  | [], xs | xs, []  -> xs
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> x :: go       xs (y :: ys)
    |  0 -> x :: go       xs       ys
    |  1 -> y :: go (x :: xs)      ys
    |  _ -> assert false
  in go

Intersection is:

let inter       =
  let rec go l r = match l, r with
  | [], _  | _, []   -> []
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 ->      go       xs (y :: ys)
    |  0 -> x :: go       xs       ys
    |  1 ->      go (x :: xs)      ys
    |  _ -> assert false
  in go

while difference is:

let diff       =
  let rec go l r = match l, r with
  | [], _  | _, []   -> l
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> x :: go       xs (y :: ys)
    |  0 ->      go       xs       ys
    |  1 ->      go (x :: xs)      ys
    |  _ -> assert false
  in go

And so on. The three functions use the same underlying merge schemata; what varies is the operation to perform in each of the five possible cases:

  1. The left sequence is nil
  2. The right sequence is nil
  3. The next element in the left sequence is less than the next element in the right sequence
  4. The next element in the left sequence is equal to the next element in the right sequence
  5. The next element in the left sequence is greater than the next element in the right sequence

The question is, then, how many set operations can the merge algorithm implement? These five cases partition both input sequences in disjoint sets such that the output sequence is the natural merge of zero or more of them. For example, processing the sets { 1, 3, 4, 6, 7, 8 } ⋈ { 2, 3, 5, 6, 8, 9 } results in the following steps:

LNRNLTEQGTArguments
1 { 1, 3, 4, 6, 7, 8 } ⋈ { 2, 3, 5, 6, 8, 9, 10 }
2{ 3, 4, 6, 7, 8 } ⋈ { 2, 3, 5, 6, 8, 9, 10 }
3 { 3, 4, 6, 7, 8 } ⋈ { 3, 5, 6, 8, 9, 10 }
4 { 4, 6, 7, 8 } ⋈ { 5, 6, 8, 9, 10 }
5{ 6, 7, 8 } ⋈ { 5, 6, 8, 9, 10 }
6 { 6, 7, 8 } ⋈ { 6, 8, 9, 10 }
7 { 7, 8 } ⋈ { 8, 9, 10 }
8 { 8 } ⋈ { 9, 10 }
9,10∅ ⋈ { 9, 10 }

Abstracting away the operations to perform in each of these five cases we have the following schema:

let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list =
  let rec go l r = match l, r with
  | [], ys -> ln ys
  | xs, [] -> rn xs
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> lt x (go       xs (y :: ys))
    |  0 -> eq x (go       xs       ys )
    |  1 -> gt y (go (x :: xs)      ys )
    |  _ -> assert false
  in go

Both ln and rn must decide what to do with the remaining list and so have type α list → α list, while lt, eq and gt must decide what to do with the element in consideration and so have type αα list → α list; thus the type of merge is (α list → α list) → (α list → α list) → (αα list → α list) → (αα list → α list) → (αα list → α list) → α list → α list → α list. The operations on the remainder either pass it unchanged or return nil, while the operations on the next element either add it to the output sequence or not:

let self   xs =      xs
and null   _  =      []
and cons x xs = x :: xs
and tail _ xs =      xs

(some of these have well-known names in functional programming, but here I choose to use these neat, four-letter ones.) With the proviso that the output sequence must be increasing these four functions exhaust the possibilities by parametricity; otherwise, duplications and rearrangements would satisfy the parametric signature. Now union, intersection and difference are simply:

let union l r = merge self self  cons cons cons l r
and inter l r = merge null null  tail cons tail l r
and diff  l r = merge null self  cons tail tail l r

It is obvious that the question I posed above is answered as 25 = 32 possible set operations obtainable by varying each of the five operations. The next question is, then, what are these 32 operations? Let me characterize each of the five sets ln, rn, lt, eq and gt. The easiest one is eq, as it obviously is the intersection of both sets:

eq(A, B) = A ∩ B

By substitution in merge it is possible to show that ln(A, B) = rn(B, A) and vice versa; hence just one set expression suffices. The merge ends with rn for every element in A that is greater than every element in B, as the latter were included in the comparison sets; and conversely for ln. Hence:

ln(A, B) = B / A ≡ ⟨S y : B : ⟨∀ x : A : y > x⟩⟩
rn(A, B) = A / B ≡ ⟨S x : A : ⟨∀ y : B : x > y⟩⟩

(read A / B as "A over B"). All three sets are pairwise disjoint, since A / B ⊆ A and A / B ∩ B = ∅, and conversely, by construction.

The two remaining sets are also symmetric in the sense that lt(A, B) = gt(B, A) but are more difficult to characterize. My first attempt was to think of each element in A as being processed in turn and put into lt(A, B) just when strictly less than all the elements in B against which it could be matched, namely lt(A, B) = ⟨S x : A : x < ⟨min y : B : x ≤ y⟩⟩. The condition can be simplified with a bit of equational reasoning:

   x∈A ∧ x < ⟨min y : B : x ≤ y⟩
≡ { GLB }
   x∈A ∧ ⟨∀ y : y∈B ∧ x ≤ y : x < y⟩⟩
≡ { Trading }
   x∈A ∧ ⟨∀ y : y∈B : x > y ∨ x < y⟩⟩
≡ { Trichotomy }
   x∈A ∧ ⟨∀ y : y∈B : x ≠ y⟩⟩
≡
   x∈A ∧ x∉B

In other words, A − B. The problem is that, since the quantification over an empty set is trivially true, this set is too big as it includes the respective remainder; that is to say A / B ⊆ A − B as I showed above. To preserve disjointness I define:

lt(A, B) = A − B − A / B
gt(A, B) = B − A − B / A

In a Venn diagram, these five sets are:

Venn diagram of the five component sets

So by including or excluding one of the five components depending on the function passed to each of the five operations, the 32 set operations achievable by merge are:

Venn diagrams for all 32 set operations

Or in tabular form:

NlnrnlteqgtFunction
0selfselfconsconsconsA∪B
1selfselfconsconstailA B/A
2selfselfconstailconsA∆B
3selfselfconstailtailA−BB/A
4selfselftailconsconsB A/B
5selfselftailconstailA∩BB/AA/B
6selfselftailtailconsB−AA/B
7selfselftailtailtail B/AA/B
8selfnullconsconsconsA∪BA/B
9selfnullconsconstailA B/AA/B
10selfnullconstailconsA∆BA/B
11selfnullconstailtailA−BB/AA/B
12selfnulltailconsconsB
13selfnulltailconstailA∩BB/A
14selfnulltailtailconsB−A
15selfnulltailtailtail B/A
16nullselfconsconsconsA∪BB/A
17nullselfconsconstailA
18nullselfconstailconsA∆BB/A
19nullselfconstailtailA−B
20nullselftailconsconsB B/AA/B
21nullselftailconstailA∩BA/B
22nullselftailtailconsB−AB/AA/B
23nullselftailtailtail A/B
24nullnullconsconsconsA∪BB/AA/B
25nullnullconsconstailA A/B
26nullnullconstailconsA∆BB/AA/B
27nullnullconstailtailA−BA/B
28nullnulltailconsconsB B/A
29nullnulltailconstailA∩B
30nullnulltailtailconsB−AB/A
31nullnulltailtailtail

Arguably, besides the traditional five set operations A ∪ B, A ∩ B, A − B, B − A and A ∆ B, only the remainders A / B, B / A and perhaps A / B ∪ B / A = A ⊔ B, the join of A and B (not to be confused with the relational operation), are independently useful. These three are obscure, and as far as I know have no name, although I'd love to hear if there is literature about them. This might mean that this exhaustive catalog of set merges is rather pointless, but at least now I know for sure.

2012-08-02

A Helping Phantom Hand

You don't have to be writing an interpreter or some other kind of abstract code to profit from some phantom types. Suppose you have two or more functions that work by "cooking" a simple value (a float, say) with a lengthy computation before proceeding:

let sun_geometric_longitude j =
  let je = to_jcen (to_jde j) in
  (* … computation with je … *)

let sun_apparent_longitude j =
  let je = to_jcen (to_jde j) in
  let q  = sun_geometric_longitude j in
  (* … computation with je … *)

In this case j is a date expressed in Julian Days as a float, and to_jde computes the Ephemeris Time as a 63-term trigonometric polynomial correction on it. sun_apparent_longitude calls sun_geometric_longitude and both call to_jde. Obviously this unnecessary duplication can be factored out:

let sun_geometric_longitude je =
  let t  = to_jcen je in
  (* … computation with je … *)

let sun_apparent_longitude je =
  let q  = sun_geometric_longitude je in
  let t  = to_jcen je in
  (* … computation with je … *)

let foo j =
  let je = to_jde j in
  let l  = sun_apparent_longitude je in
  (* … *)

(to_jcen is cheap and not worth factoring out.) But now a naked float represent two different things, Universal Time and Ephemeris Time, and we have a valid concern of mixing them up. We can wrap the time in an ADT:

type dt = UT of float | TD of float

let to_jcen j = match j with
| UT j ->
  (* … lengthy computation … *)
  TD j
| TD _ -> invalid_arg "to_jcen"

let sun_geometric_longitude je = match je with
| TD je ->
  let t  = to_jcen je in
  (* … computation with je … *)
| UT _  -> invalid_arg "sun_geometric_longitude"

let sun_apparent_longitude je = match je with
| TD je ->
  let q  = sun_geometric_longitude je in
  let t  = to_jcen je in
  (* … computation with je … *)
| UT _  -> invalid_arg "sun_apparent_longitude"

let foo j =
  let je = to_jde j in
  (* … computation with sun_apparent_longitude je … *)

but this forces us to check at run-time whether we mixed times up in our code. A better technique is to use a phantom type. First fix an abstract signature for the module implementing these functions:

module Test : sig
  type 'a dt

  val datetime : yy:int -> mm:int -> dd:int -> hh:int -> nn:int -> ss:float -> [ `JD ] dt
  val to_jde   : [ `JD ] dt -> [ `JDE ] dt
  val to_jcen  : 'a dt -> float

  val sun_geometric_longitude : [ `JDE ] dt -> float
  val sun_apparent_longitude  : [` JDE ] dt -> float
end = struct
  (* … *)

We have a way to construct our type α dt from a Universal datetime, a way to convert it to Dynamical Time with to_jde and operations that respect the kind of measure. The implementation is as before:

  (* … *)
  type 'a dt = float (* phantom type! *)

  let datetime ~yy ~mm ~dd ~hh ~nn ~ss = (* … *)

  let to_jde  j = (* … *)
  let to_jcen j = (* … *)

  let sun_geometric_longitude je =
    (* … computation with a statically checked je … *)

  let sun_apparent_longitude je =
    let q  = sun_geometric_longitude je in
    (* … computation with a statically checked je … *)
end

Now the compiler checks for us that we don't mix up measures. The only inconvenient of this approach is that the type α dt is fully abstract, and you must provide coercions, string_ofs and pretty printers for it if you need to show them or debug your code. There is a way out, though; just make it a private type abbreviation:

module Test : sig
  type 'a dt = private float
  (* … signature exactly as before … *)
end = struct
  type 'a dt = float (* phantom type! *)
  (* … implementation exactly as before … *)
end

Now α dt will be shown in the top-level, can be printed with a coercion (je :> float), etc.

For another simple example, suppose you want to represent sets as lists. The best way to do that is to keep them sorted so that set operations run in linear time. If you want to provide some operations that temporarily destroy the ordering, a phantom type can keep track of the invariant "this list is sorted":

module Set : sig
  type ('a, 'b) t = private 'a list

  val of_list : 'a list -> ('a, [ `S ]) t
  val as_set  : ('a, [ `U ]) t -> ('a, [ `S ]) t
  val empty   : ('a, [ `S ]) t
  val mem     : 'a -> ('a, [ `S ]) t -> bool
  val add     : 'a -> ('a, [ `S ]) t -> ('a, [ `S ]) t
  val union   : ('a, [ `S ]) t -> ('a, [ `S ]) t -> ('a, [ `S ]) t
  val inter   : ('a, [ `S ]) t -> ('a, [ `S ]) t -> ('a, [ `S ]) t
  val append  : ('a, 'b) t -> ('a, 'c) t -> ('a, [ `U ]) t
end = struct
  type ('a, 'b) t = 'a list

  let of_list l   = List.sort compare l
  and as_set  l   = List.sort compare l
  and empty       = []
  let union   l r = (* … linear merge … *)
  and inter   l r = (* … linear merge … *)
  let mem     e   = List.mem e
  and add     e   = union [e]
  and append      = List.append
end

The phantom type [ `S | `U ] tracks the sortedness of the list. Note that in the case of append the argument lists can have any ordering but the result is known to be unsorted. Note also how the fact that the empty list is by definition sorted is directly reflected in the type.