Tutorial for Module 4 (part 1)
Module 4 introduced tuples,
records, variants,
recursive variants and polymorphism,
and option / result. This
tutorial puts those pieces to work on a worked example, building a
small algebraic data type and a handful of concrete values on it.
The example: a tiny abstract syntax tree (AST) for OCaml itself.
Every OCaml expression you have written so far (5, x + 3, if y < 0 then 0 else y, let x = 5 in x + x) is, internally, a value
of a variant type inside the OCaml compiler. The compiler reads
source text, parses it into a tree of constructors, and then every
later stage (type checking, optimisation, code generation) is
just recursive functions on that tree. For this lecture we are
going to be a tiny version of that compiler: we will design the
tree.
We are deliberately staying on the design side: type definitions
and constructed values, no tree walks. Walking an AST (writing an
evaluator, a type checker, or a pretty printer) needs
pattern matching, which gets its full treatment in
Module 5. We will return to this
exact expr type there and write eval, vars_used, and a
pretty printer on it.
What is an AST?
An abstract syntax tree is the tree-shaped data representation of a program. "Abstract" because it throws away surface details that do not matter for meaning (whitespace, comments, parentheses that the parser already resolved); "syntax" because it captures the structure of how the program is written; "tree" because nested expressions become nested constructors.
Concretely: the expression 5 + 3 is, as text, a five-character
string. As an AST, it is a tree with + at the root and two leaves
5 and 3. The expression (5 + 3) * 2 is a tree with * at the
root, one leaf 2, and one subtree (the + node) as the other
child. Parentheses do not appear in the tree: the structure already
encodes which operator binds first.
A first cut: literals and arithmetic
Let's start with the smallest useful AST: integer literals and addition.
Two constructors. Int carries an OCaml int; Add carries
two sub-expressions. That second one is the key piece: the
recursion is what lets us build arbitrarily nested arithmetic.
A few example trees:
e1 is the literal 5; e2 is 5 + 3; e3 is (1 + 2) + 4.
Notice how e3's left child is itself an Add node, exactly
mirroring the parenthesised source.
Adding more operators
Real programs use more than +. The simplest extension is one
constructor per operator:
Now we can build (5 + 3) * 2:
The shape of the OCaml value mirrors the shape of the tree we drew
earlier. Mul at the root, Add (Int 5, Int 3) on the left,
Int 2 on the right.
Adding variables and let ... in
So far our ASTs are closed: every leaf is a literal. To talk
about names we need a Var constructor (a reference to a bound
name) and a Let_in constructor (a binding form):
Var carries just a string (the name being referenced). Let_in
carries three pieces: the name being bound, the binding
expression whose value gets bound to that name, and the body
in which the name is in scope. That is the same let name = e1 in e2 shape from the let-bindings lecture, now expressed as data.
The AST for let x = 5 in x + 3:
The bound name is "x", the binding is Int 5, the body is
Add (Var "x", Int 3). Notice how the body references the bound
name as Var "x", not as a special token: the AST treats names
as ordinary strings, and the compiler's later stages decide what
each Var resolves to.
Nested let bindings become nested Let_in constructors. The AST
for let x = 5 in let y = 10 in x + y:
The body of the outer Let_in is another Let_in; that one's
body is the addition. Exactly the right-associative nesting of
source let chains we saw in the let-bindings lecture.
Adding if, booleans, and comparison
Conditionals need a boolean kind of value and a way to compare. We
add Bool, two comparison constructors (Lt and Eq), and an
If constructor with three sub-expressions:
If (cond, then_branch, else_branch) carries three children: the
test, the value when the test is true, and the value when the test
is false. That matches the three slots in OCaml's
if ... then ... else ... expression from the if-expressions
lecture.
The AST for if x < 0 then 0 else x:
The condition is Lt (Var "x", Int 0), a comparison subtree; the
then-branch is Int 0; the else-branch is Var "x". Each branch
is just another expr, so they can themselves be arbitrarily
nested.
Design decision: per-operator vs Binop
We chose one constructor per operator above: Add, Sub,
Mul, Lt, Eq. The alternative is a single Binop
constructor that carries an operator tag plus two sub-expressions.
The two designs:
Trade-offs:
- The per-operator type is more explicit. Each operator gets a name; the type system makes you spell it out at construction.
- The
Binoptype is more compact. Five constructors collapse into one. Consumers of the AST (the eval function, the pretty printer) match one case and dispatch onopinternally. - The per-operator type scales worse: adding a new operator is
a new constructor and a new pattern-match arm everywhere. The
Binoptype adds a new operator as just a newopvariant. - The
Binoptype groups related shapes: if every binary operator has the sameexpr * exprpayload, it captures that regularity in one place.
Real OCaml compilers use the Binop style, with the operator (or
"primitive") tag as a richer variant. For our tiny AST either
choice is reasonable; we will keep the per-operator design for the
rest of this lecture because it makes each example tree easier to
read at first glance.
Design decision: optional type annotations
OCaml let lets you annotate the binding with a type:
let x : int = 5 in .... The annotation is optional: when
absent, the type checker infers; when present, it must agree with
inference.
That "may or may not be there" is exactly what option is for. We
add a ty type for the small set of types our AST knows about, and
weave a ty option into the Let_in payload:
Two example trees, one with and one without an annotation:
The option cleanly captures "the source either wrote a type or
did not." This is exactly the kind of consumer uncertainty the
recursive-types lecture's rule of thumb
covers: a later stage (the type checker) has to decide what to do
when the annotation is missing vs present.
The shape of Module 4
This tutorial used every Module 4 idea:
- Variants with payloads (every
exprconstructor). - Recursive variants (sub-expressions are themselves
exprs). - Tuples in constructor payloads (
expr * exprfor binary ops). optionfor the optional type annotation.- Type abbreviations would fit naturally for an
id = stringalias, if we wanted to makeVarself-documenting.
The next tutorial (M04-L06) builds a second worked example on a different domain (a tiny file system), so you see the same toolkit applied to data with a very different shape.
What we have not done yet is take any of these trees apart. To
evaluate e_clamp, to substitute a value for Var, or to
pretty-print an expr back to source: each of those is a recursive
function that needs pattern matching. That is the whole of
Module 5. We will return to this
exact AST there.
A quick check
Given this AST type, which value represents (5 - 3) * 2?
type expr =
| Int of int
| Add of expr * expr
| Sub of expr * expr
| Mul of expr * expr
Mul (Sub (5, 3), 2)Mul (Sub (Int 5, Int 3), Int 2)Sub (Mul (Int 5, Int 3), Int 2)Mul (Int 5, Sub (Int 3, Int 2))
Why: every leaf has to be wrapped in Int, because the
payloads of Sub and Mul are expr * expr, not int * int.
The outer operator is the last one applied; here that is the
*, so the root is Mul. The left child is Sub (Int 5, Int 3) (the parenthesised part), and the right child is Int 2.
The third option puts the Sub at the root, which would
represent (5 * 3) - 2.
Why does the AST encode the source let x = 5 in x + 3 as
Let_in ("x", Int 5, Add (Var "x", Int 3))
rather than as
Let_in (Var "x", Int 5, Add (Var "x", Int 3))
?
- Both work; the choice is purely stylistic.
Let_in's first slot is the bound name (astring), not a referenced variable.Var "x"would be type-wrong because that slot expectsstring, notexpr.Varis only allowed insideAdd, never insideLet_in.- The first slot of
Let_inis the value, not the name.
Why: the Let_in constructor has payload string * expr * expr: the name being bound (a plain string from the source),
the expression whose value is bound to it, and the body. The
body may mention the name via Var "x" because there x is
being read; in the first slot the name is being introduced,
so it is just a string. Putting Var "x" in the first slot
would be a type error.
Activity: extending the type
Which is the correct AST for let x = 10 in print x?
type expr =
| Int of int
| Bool of bool
| Add of expr * expr
| Var of string
| Let_in of string * expr * expr
| Print of expr (* new *)
Let_in ("x", Int 10, Print (Var "x"))Let_in ("x", Print (Int 10), Var "x")Print (Let_in ("x", Int 10, Var "x"))Let_in (Print "x", Int 10, Var "x")
Why: Let_in is (name, bound_expr, body). The bound name is
"x", the bound expression is Int 10, and the body is what
print x parses to, namely Print (Var "x"). The third option
prints the result of the whole let; the second prints the bound
value before the binding even happens; the fourth puts a Print
in the name slot, which is type-wrong because the slot expects
a string, not an expr.
What's next
You have now seen the full Module 4 toolkit (variants, tuples,
recursion, option) applied to an AST. The next
tutorial (M04-L06) reapplies the same
toolkit to a tiny file system, so you see how the same ingredients
fit a very different domain. Walking either tree (an AST evaluator,
a file-system du) is the job of pattern matching, which arrives
in Module 5.
Reading
- Cornell CS3110, Algebraic data types: https://cs3110.github.io/textbook/chapters/data/algebraic_data_types.html
- Real World OCaml, Variants: https://dev.realworldocaml.org/variants.html
Sources
This lecture's prose, worked examples, and quizzes are original to
this course. Materials referenced during preparation are listed in
the Reading section above; Cornell CS3110 and Real World OCaml
are CC BY-NC-ND-licensed and have not been derivatively reused.
See LICENSES.md
at the repository root for the full source posture.