GADTs: variants with type-level information
This lecture switches gears entirely from the monad story. So far in Module 8 we have been about sequencing computations; now we turn to a more advanced type-system feature called generalized algebraic data types, almost always abbreviated GADTs. They are the second half of the OCaml toolkit needed for typed embedded languages, and they show up in serious OCaml code whenever you want the compiler to do more work for you.
Ordinary variants say "this value is one of a finite set of cases". GADTs add: "and each case can have a different type index". The practical consequence is that the compiler can prove things at compile time that an ordinary variant would have to check at runtime. Wrong combinations become type errors, not crashes.
The idea is in some ways simple; the notation is unusual; the type theory is involved. We keep it light and lean on worked examples.
Ordinary variant: same parameter for all constructors
To set the contrast, an ordinary parameterised variant:
Here 'a tree is uniformly indexed: every Leaf and every Node
inside an 'a tree value uses the same 'a. If 'a = int, then
the Node values carry ints; if 'a = string, they carry
strings. The constructors do not choose their own type index; they
share whatever the surrounding type says.
That is fine for most data structures. A list of ints, a tree
of strings, a record of options: in all these the parameter is
fixed by the outside. But sometimes the constructors need to
choose their own indices independently. That is the case GADTs
handle.
The GADT form
Look at this expression-AST type:
The syntax Constructor : args -> result_type is unusual. Read it
as an explicit type signature for the constructor: like a function
signature, but for a data constructor. Ordinary variant constructors
implicitly have a result type of "the type being defined, with the
same parameters as the type header". GADT constructors say so
explicitly, which lets them choose different result types.
Concretely:
Int_lit nis a value of typeint expr. SoInt_lit 3 : int expr.Bool_lit bis a value of typebool expr. SoBool_lit true : bool expr.Add (a, b)is a value of typeint expr, and it can only be built if bothaandbalready have typeint expr. You cannot pass anint exprand abool exprtoAdd.If (c, t, e)is a value of type'a exprfor some'a, with the constraint that the condition isbool exprand the two branches share the same'a. So anIfreturningint exprmust have twoint exprbranches; anIfreturningbool exprmust have twobool exprbranches.
This is what we mean by "type-level information". The constructor not only tags the data; it pins down the type of the value, and the compiler uses that information at compile time.
What we get: type-safe construction
Compare what the compiler accepts and rejects:
This is the GADT design payoff: programs that would otherwise crash an interpreter become type errors. With ordinary variants you would have to write an interpreter, walk the AST, and check at every node that the children have compatible types. With a GADT, the compiler refuses to even build a tree with incompatible children. The runtime check vanishes; the bad construction is rejected before any code runs.
The cost is on the construction side: every constructor application
has to be at the right index. The compiler is unforgiving about
this. If you want to write a program that builds an int expr,
every step of the construction has to commit to int-typed values.
You cannot have a runtime-dispatched "I will figure out later
what type this is" because the compiler wants to know now.
Pattern matching with type refinement
A breath before the next bit. So far you have seen GADT syntax,
the _ placeholder, and what type-safe construction buys. The
section below is the centerpiece of the lecture: how the
compiler uses the constructor's result type to refine the type
during pattern matching, and the type a. ... annotation that
turns this on. Read the eval walkthrough that follows slowly;
the four cases each illustrate a different facet of the
refinement.
First, see why a plain annotation will not do. The obvious
eval : 'a expr -> 'a does not type-check:
The fix is the type a. annotation: it lets a be a different
type in each branch. With it, OCaml's
pattern matching refines the type
index per case: each branch knows which constructor fired and
specialises a accordingly.
The type a. ... syntax says "this function is polymorphic in a,
and inside the body a is an abstract type." Why is it needed here,
when most OCaml functions need no annotation at all? Because eval
does polymorphic recursion: it calls itself at a different type
than the one it was called at. In the If case, the condition c
is a bool expr even when the whole If is an int expr, so
eval c recurses at bool while the outer call is at int.
OCaml's automatic inference assumes a recursive function calls
itself at the same type (that assumption is what keeps inference
decidable), so it cannot work this case out on its own. The
type a. annotation supplies the type up front, and that is exactly
what lets each branch refine a independently. This is one of the
few places in OCaml where an annotation is not optional.
Reading the cases:
Int_lit n -> n. The constructorInt_lit : int -> int exprfires. The compiler refines: in this branch,a = int. The variablen : int. The return type isa, which is nowint. The expressionnis anint. Type-checks.Bool_lit b -> b. Similarly,a = boolin this branch.b : bool. Returningbhas typebool, which isahere. Type-checks.Add (a, b) -> eval a + eval b. The constructor's result isint expr, so this branch refinesa = int. The two recursive callseval aandeval b(whereaandbare the GADT's sub-expressions) produceints. Adding them gives anint, which matches the refineda = int. Type-checks.If (c, t, e). The constructor preserves'a, so no refinement happens here;astays whatever it was. The conditionc : bool expr, soeval c : bool. The branchest, e : a expr, soeval t : aandeval e : a. Theifreturns ana. Type-checks.
The compiler is doing real work in every case: tracking which constructor matched, refining the abstract type accordingly, and type-checking the right-hand side under the refinement. This is the central feature that distinguishes GADTs from ordinary variants. With ordinary variants the index is fixed across branches; with GADTs it can change.
Why is this useful?
A more concrete way to think about the first benefit: when you
build a program with Add (Bool_lit true, Int_lit 5), an
ordinary AST would represent the program just fine, and the
interpreter would eventually try to add a boolean to an integer
and raise a runtime "type error". The compiler had no way to
notice the mistake when you wrote the source. With a GADT, the
compiler refuses to compile the source: the construction itself
is ill-typed, before any evaluation runs.
For a toy interpreter that is a parlour trick. For a real language (a SQL query builder, a financial calculation engine, an embedded scripting language), it is a massive win: bugs that would otherwise hide in branches your tests never exercise become impossible to write.
The cost is real too. GADTs require more annotations, more locally-abstract-type declarations, and more thought. Pattern matching can occasionally need explicit type assertions to help the compiler refine correctly. The error messages, when GADT inference fails, are harder to read than ordinary type errors. Most OCaml code does not need GADTs and is better off without them; the code that does need them needs them in a serious way.
Units of measure
The Orbiter was not even the first famous unit victim. In 1983, Air Canada flight 143 ran out of fuel at 41,000 feet because the ground crew, mid-way through Canada's imperial-to-metric transition, computed the fuel load in pounds where the new 767 expected kilograms; the aircraft, suddenly a 132-seat glider, dead-sticked onto a drag strip at Gimli with no fatalities (the Gimli Glider). Sixteen years apart, the same bug: a number travelled without its unit.
That Mars Climate Orbiter loss is the motivating example: types can
make a units mix-up impossible. We tag a float with its unit by
declaring three uninhabited types, kelvin, celsius, and
fahrenheit, that have no values of their own and exist only to sit
in a type index. Then a GADT temp whose constructor picks the
unit:
Now write addition so it can only ever combine temperatures of the same unit:
This is a phantom type encoded with a GADT: the unit lives only in the type index, never in the runtime value, yet it is enough for the compiler to keep the units apart. Real OCaml code uses exactly this pattern for units, currencies, and tagged identifiers.
When to reach for GADTs
A useful question to ask before reaching for a GADT: "what
specifically would go wrong if I used an ordinary variant and
runtime checks?" If the answer is "I would crash with a
type-mismatch error after running for hours", GADTs are worth it.
If the answer is "I would have to add an
option return
type and pattern-match in two places", they are probably not.
The next lecture shows real use
cases that pull this into focus, typed pretty-printers, type-safe
builders, and length-indexed lists, and the lecture after adds
heterogeneous lists and the GADT machinery behind Printf.
A quick check
What is the type of Add (Int_lit 1, Int_lit 2)?
int exprbool expr(int * int) exprint expr * int expr
Why: the constructor's signature is Add : int expr * int expr -> int expr. Given two int expr arguments, the result type is
int expr. The compiler refuses to apply Add to anything else.
Why does the eval function need the annotation type a. a expr -> a?
- OCaml requires every function to be annotated.
- The function is polymorphic and each branch refines
ato a different concrete type; the annotation tells the compiler to treataas abstract and allow per-branch refinement. - It is purely cosmetic; the compiler infers it anyway.
- It optimises the compiled code.
Why: without the locally-abstract-type annotation, the
compiler tries to infer a single concrete type for a and fails
because different branches refine it differently. The type a. ...
annotation says "treat a as fresh and abstract", which is what
GADT pattern matching needs. This is a recurring quirk: GADT
functions often need this annotation explicitly.
Show reference solution
A code quiz:
Define a GADT type _ value with two constructors VInt : int -> int value and VBool : bool -> bool value. Write unwrap : type a. a value -> a that returns the underlying value.
Show reference solution
Reference solution:
let unwrap : type a. a value -> a = function
| VInt n -> n
| VBool b -> b
The type a. ... annotation makes a locally abstract so that
each branch can refine it (a = int for VInt, a = bool for
VBool). Without the annotation, the compiler cannot find a
single concrete a to satisfy both branches.
What is next
The next lecture takes the basic machinery here and shows real applications. The tutorial later combines a GADT-based typed AST with an option-monad evaluator that can fail at runtime (division by zero, say) while still guaranteeing type safety on the success path. That is the capstone for the OCaml half of the course.
Reading
- Real World OCaml, GADTs: https://dev.realworldocaml.org/gadts.html
Sources
This lecture's prose, worked examples, and quizzes are original to
this course. Materials referenced during preparation are listed in
the Reading section above; Cornell CS3110 and Real World OCaml
are CC BY-NC-ND-licensed and have not been derivatively reused.
See LICENSES.md
at the repository root for the full source posture.