Tutorial for Module 4 (part 1)

Functional Programming with OCaml

Tutorial: a tiny AST for OCaml

Module 4 · Lecture 5

KC Sivaramakrishnan
IIT Madras

Module 4 introduced tuples, records, variants, recursive variants and polymorphism, and option / result. This tutorial puts those pieces to work on a worked example, building a small algebraic data type and a handful of concrete values on it.

The example: a tiny abstract syntax tree (AST) for OCaml itself. Every OCaml expression you have written so far (5, x + 3, if y < 0 then 0 else y, let x = 5 in x + x) is, internally, a value of a variant type inside the OCaml compiler. The compiler reads source text, parses it into a tree of constructors, and then every later stage (type checking, optimisation, code generation) is just recursive functions on that tree. For this lecture we are going to be a tiny version of that compiler: we will design the tree.

We are deliberately staying on the design side: type definitions and constructed values, no tree walks. Walking an AST (writing an evaluator, a type checker, or a pretty printer) needs pattern matching, which gets its full treatment in Module 5. We will return to this exact expr type there and write eval, vars_used, and a pretty printer on it.

What is an AST?

An abstract syntax tree is the tree-shaped data representation of a program. "Abstract" because it throws away surface details that do not matter for meaning (whitespace, comments, parentheses that the parser already resolved); "syntax" because it captures the structure of how the program is written; "tree" because nested expressions become nested constructors.

Concretely: the expression 5 + 3 is, as text, a five-character string. As an AST, it is a tree with + at the root and two leaves 5 and 3. The expression (5 + 3) * 2 is a tree with * at the root, one leaf 2, and one subtree (the + node) as the other child. Parentheses do not appear in the tree: the structure already encodes which operator binds first.

What is an AST?

Source vs tree: 5 + 3

Source: 5 + 3

Tree:

   +
  / \
 5   3

Source vs tree: (5 + 3) * 2

Source: (5 + 3) * 2

Tree:

       *
      / \
     +   2
    / \
   5   3

A first cut: literals and arithmetic

Let's start with the smallest useful AST: integer literals and addition.

type expr = | Int of int | Add of expr * expr

Two constructors. Int carries an OCaml int; Add carries two sub-expressions. That second one is the key piece: the recursion is what lets us build arbitrarily nested arithmetic.

A few example trees:

let e1 = Int 5 let e2 = Add (Int 5, Int 3) let e3 = Add (Add (Int 1, Int 2), Int 4)

e1 is the literal 5; e2 is 5 + 3; e3 is (1 + 2) + 4. Notice how e3's left child is itself an Add node, exactly mirroring the parenthesised source.

A first cut

type expr = | Int of int | Add of expr * expr let e1 = Int 5 (* 5 *) let e2 = Add (Int 5, Int 3) (* 5 + 3 *) let e3 = Add (Add (Int 1, Int 2), Int 4) (* (1+2)+4 *)

Adding more operators

Real programs use more than +. The simplest extension is one constructor per operator:

type expr = | Int of int | Add of expr * expr | Sub of expr * expr | Mul of expr * expr

Now we can build (5 + 3) * 2:

let e_a = Mul (Add (Int 5, Int 3), Int 2)

The shape of the OCaml value mirrors the shape of the tree we drew earlier. Mul at the root, Add (Int 5, Int 3) on the left, Int 2 on the right.

More arithmetic

type expr = | Int of int | Add of expr * expr | Sub of expr * expr | Mul of expr * expr (* (5 + 3) * 2 *) let e_a = Mul (Add (Int 5, Int 3), Int 2)

Adding variables and let ... in

So far our ASTs are closed: every leaf is a literal. To talk about names we need a Var constructor (a reference to a bound name) and a Let_in constructor (a binding form):

type expr = | Int of int | Add of expr * expr | Sub of expr * expr | Mul of expr * expr | Var of string | Let_in of string * expr * expr

Var carries just a string (the name being referenced). Let_in carries three pieces: the name being bound, the binding expression whose value gets bound to that name, and the body in which the name is in scope. That is the same let name = e1 in e2 shape from the let-bindings lecture, now expressed as data.

The AST for let x = 5 in x + 3:

let e_let = Let_in ("x", Int 5, Add (Var "x", Int 3))

The bound name is "x", the binding is Int 5, the body is Add (Var "x", Int 3). Notice how the body references the bound name as Var "x", not as a special token: the AST treats names as ordinary strings, and the compiler's later stages decide what each Var resolves to.

Nested let bindings become nested Let_in constructors. The AST for let x = 5 in let y = 10 in x + y:

let e_let_nested = Let_in ("x", Int 5, Let_in ("y", Int 10, Add (Var "x", Var "y")))

The body of the outer Let_in is another Let_in; that one's body is the addition. Exactly the right-associative nesting of source let chains we saw in the let-bindings lecture.

Variables and let ... in

type expr = | Int of int | Add of expr * expr | Sub of expr * expr | Mul of expr * expr | Var of string | Let_in of string * expr * expr (* let x = 5 in x + 3 *) let e_let = Let_in ("x", Int 5, Add (Var "x", Int 3))

Nested let ... in

type expr = | Int of int | Add of expr * expr | Sub of expr * expr | Mul of expr * expr | Var of string | Let_in of string * expr * expr (* let x = 5 in let y = 10 in x + y *) let e_let_nested = Let_in ("x", Int 5, Let_in ("y", Int 10, Add (Var "x", Var "y")))

Adding if, booleans, and comparison

Conditionals need a boolean kind of value and a way to compare. We add Bool, two comparison constructors (Lt and Eq), and an If constructor with three sub-expressions:

type expr = | Int of int | Bool of bool | Add of expr * expr | Sub of expr * expr | Mul of expr * expr | Lt of expr * expr | Eq of expr * expr | If of expr * expr * expr | Var of string | Let_in of string * expr * expr

If (cond, then_branch, else_branch) carries three children: the test, the value when the test is true, and the value when the test is false. That matches the three slots in OCaml's if ... then ... else ... expression from the if-expressions lecture.

The AST for if x < 0 then 0 else x:

let e_clamp = If (Lt (Var "x", Int 0), Int 0, Var "x")

The condition is Lt (Var "x", Int 0), a comparison subtree; the then-branch is Int 0; the else-branch is Var "x". Each branch is just another expr, so they can themselves be arbitrarily nested.

if, booleans, comparison

type expr = | Int of int | Bool of bool | Add of expr * expr | Sub of expr * expr | Mul of expr * expr | Lt of expr * expr | Eq of expr * expr | If of expr * expr * expr | Var of string | Let_in of string * expr * expr

Building if x < 0 then 0 else x

type expr = | Int of int | Bool of bool | Add of expr * expr | Sub of expr * expr | Mul of expr * expr | Lt of expr * expr | Eq of expr * expr | If of expr * expr * expr | Var of string | Let_in of string * expr * expr let e_clamp = If (Lt (Var "x", Int 0), Int 0, Var "x")

Design decision: per-operator vs Binop

We chose one constructor per operator above: Add, Sub, Mul, Lt, Eq. The alternative is a single Binop constructor that carries an operator tag plus two sub-expressions. The two designs:

(* Per-operator. *) type expr = | Int of int | Add of expr * expr | Sub of expr * expr | Mul of expr * expr | Lt of expr * expr | Eq of expr * expr (* Single Binop, with an operator tag. *) type op = | Plus | Minus | Times | Less | Equal type expr = | Int of int | Binop of op * expr * expr

Trade-offs:

Real OCaml compilers use the Binop style, with the operator (or "primitive") tag as a richer variant. For our tiny AST either choice is reasonable; we will keep the per-operator design for the rest of this lecture because it makes each example tree easier to read at first glance.

Design decision: per-op vs Binop

Two ways to represent binary operators:

(* Per-operator. *) type expr = | Int of int | Add of expr * expr | Sub of expr * expr | Mul of expr * expr | Lt of expr * expr | Eq of expr * expr (* Binop with an operator tag. *) type op = Plus | Minus | Times | Less | Equal type expr = | Int of int | Binop of op * expr * expr

Design decision: optional type annotations

OCaml let lets you annotate the binding with a type: let x : int = 5 in .... The annotation is optional: when absent, the type checker infers; when present, it must agree with inference.

That "may or may not be there" is exactly what option is for. We add a ty type for the small set of types our AST knows about, and weave a ty option into the Let_in payload:

type ty = | T_int | T_bool type expr = | Int of int | Bool of bool | Add of expr * expr | Var of string | Let_in of string * ty option * expr * expr

Two example trees, one with and one without an annotation:

(* let x = 5 in x + 3 *) let e_unannotated = Let_in ("x", None, Int 5, Add (Var "x", Int 3)) (* let x : int = 5 in x + 3 *) let e_annotated = Let_in ("x", Some T_int, Int 5, Add (Var "x", Int 3))

The option cleanly captures "the source either wrote a type or did not." This is exactly the kind of consumer uncertainty the recursive-types lecture's rule of thumb covers: a later stage (the type checker) has to decide what to do when the annotation is missing vs present.

Design decision: optional annotations

OCaml: let x = 5 in ... or let x : int = 5 in ....

The annotation is optional - perfect fit for option:

type ty = T_int | T_bool type expr = | Int of int | Bool of bool | Add of expr * expr | Var of string | Let_in of string * ty option * expr * expr (* let x = 5 in x + 3 *) let e_u = Let_in ("x", None, Int 5, Add (Var "x", Int 3)) (* let x : int = 5 in x + 3 *) let e_a = Let_in ("x", Some T_int, Int 5, Add (Var "x", Int 3))

The shape of Module 4

This tutorial used every Module 4 idea:

The next tutorial (M04-L06) builds a second worked example on a different domain (a tiny file system), so you see the same toolkit applied to data with a very different shape.

What we have not done yet is take any of these trees apart. To evaluate e_clamp, to substitute a value for Var, or to pretty-print an expr back to source: each of those is a recursive function that needs pattern matching. That is the whole of Module 5. We will return to this exact AST there.

The shape of Module 4

This tutorial used:

Next tutorial (M04-L06) reapplies the toolkit to a file system. Walking ASTs (evaluator, pretty-printer) is Module 5.

A quick check

Given this AST type, which value represents (5 - 3) * 2?

type expr =
  | Int of int
  | Add of expr * expr
  | Sub of expr * expr
  | Mul of expr * expr

Why: every leaf has to be wrapped in Int, because the payloads of Sub and Mul are expr * expr, not int * int. The outer operator is the last one applied; here that is the *, so the root is Mul. The left child is Sub (Int 5, Int 3) (the parenthesised part), and the right child is Int 2. The third option puts the Sub at the root, which would represent (5 * 3) - 2.

Why does the AST encode the source let x = 5 in x + 3 as

Let_in ("x", Int 5, Add (Var "x", Int 3))

rather than as

Let_in (Var "x", Int 5, Add (Var "x", Int 3))

?

Why: the Let_in constructor has payload string * expr * expr: the name being bound (a plain string from the source), the expression whose value is bound to it, and the body. The body may mention the name via Var "x" because there x is being read; in the first slot the name is being introduced, so it is just a string. Putting Var "x" in the first slot would be a type error.

Activity: extending the type

Activity

Extend expr with a new constructor Print of expr that represents printing the value of a sub-expression. Then build the AST for the program:

let x = 10 in print x

Treat print x as Print (Var "x") (the body of the let).

Which is the correct AST for let x = 10 in print x?

type expr =
  | Int of int
  | Bool of bool
  | Add of expr * expr
  | Var of string
  | Let_in of string * expr * expr
  | Print of expr             (* new *)

Why: Let_in is (name, bound_expr, body). The bound name is "x", the bound expression is Int 10, and the body is what print x parses to, namely Print (Var "x"). The third option prints the result of the whole let; the second prints the bound value before the binding even happens; the fourth puts a Print in the name slot, which is type-wrong because the slot expects a string, not an expr.

Activity discussion

Add one constructor:

type expr =
  ...
  | Print of expr            (* new *)

Build:

type expr = | Int of int | Var of string | Let_in of string * expr * expr | Print of expr (* new *) let prog = Let_in ("x", Int 10, Print (Var "x"))

What's next

What's next

You have now seen the full Module 4 toolkit applied to an AST.

You have now seen the full Module 4 toolkit (variants, tuples, recursion, option) applied to an AST. The next tutorial (M04-L06) reapplies the same toolkit to a tiny file system, so you see how the same ingredients fit a very different domain. Walking either tree (an AST evaluator, a file-system du) is the job of pattern matching, which arrives in Module 5.

Reading

Sources

This lecture's prose, worked examples, and quizzes are original to this course. Materials referenced during preparation are listed in the Reading section above; Cornell CS3110 and Real World OCaml are CC BY-NC-ND-licensed and have not been derivatively reused. See LICENSES.md at the repository root for the full source posture.