Tutorial for Module 4 (part 2)
The previous tutorial modelled a tiny AST
for OCaml itself. This one applies the same toolkit (records,
variants, recursion, option) to a very different domain: a
file system. The shape of the data is different, but the
ingredients are the same; that is the point of seeing two worked
examples in a row.
A file system is built out of two kinds of thing: files, which hold content, and directories, which hold other entries. The "other entries" can themselves be either kind, so the structure is recursive. Every file system you have used (the disk on your laptop, a tarball, a Git repository's tree) is a value of essentially this shape.
We are still on the design side of the toolkit: types and constructed values, no walks. Walking a file system (computing its total size, finding all files with a given extension, listing contents at a path) needs pattern matching, which lands in Module 5; we will return to this tree there.
What's in a file system?
Two kinds of entry:
- A file has a name, a size, and access permissions.
- A directory has a name, access permissions, and a list of entries inside it.
That is genuinely it. Everything else (a path like /home/alice,
the notion of a "root", whether two files have the same content)
is derived from this structure by code that walks it. The
declared type just captures the shape.
A picture of a small tree:
home/
+- alice/
+- notes.txt
+- photos/
| +- cat.jpg
| +- dog.jpg
+- thesis.pdf
home is a directory. Its only entry is alice, also a directory.
Inside alice there are two files and another directory. Inside
photos there are two more files. The recursion bottoms out at
the files (the leaves); the branching happens at directories.
A first cut: file vs directory
Two pieces of data live on every entry: its name and its
permissions. We will use an inline record (from
the variants lecture) inside each
variant constructor to keep the fields self-documenting.
Permissions get their own small record type:
The File constructor carries three fields; Dir carries three
as well, the last of which is a entry list. That self-reference
is what makes the structure recursive: a directory's contents are
themselves entries.
The smallest interesting value is a single file:
rw is a re-usable "read-write" permission value; notes is a
1284-byte file with those permissions. The OCaml value names the
fields by hand, which costs a few extra characters but reads like
a small data sheet.
Constructing the tree
Building up the tree from the picture, leaves first. The photos
directory holds two image files:
photos is a directory whose contents is a two-element list of
files. The variant constructor names the kind (Dir), the inline
record names the fields, and the list literal carries the children.
The alice directory holds two files and the photos directory:
The mix of files and a directory in the same list is fine: every
element has type entry, regardless of which constructor it was
built with. That is the whole point of a variant: a single type
that absorbs multiple shapes.
And the root:
The OCaml value is the tree. Each constructor names its kind; each
inline record names its fields; the recursion threads through the
contents field of every Dir. No path strings, no separators, no
parsing: just constructors and records.
Adding an owner: optional metadata
Some entries have an owner (a user that owns the file or directory);
some, for our purposes, do not (think of a public-read mount, or
an entry whose owner data is missing from a backup). The "may or
may not be present" pattern is exactly what option is for. We
add an owner field of type string option to both variants:
Two example files, one with an owner and one without:
Some "alice" says "this file is owned by alice"; None says
"no owner data attached." Note that this is genuinely "no value
attached," not "an owner who is the empty string"; the type
forces the two cases apart. That distinction (and the
recursive-types lecture's rule of thumb:
option for consumer uncertainty, not for an explicit domain
value) is what this kind of design buys you.
Design decision: lookup outcome with result
A common operation a consumer will write (in Module 5, once we
have pattern matching) is "find the entry at path /home/alice/cat.jpg."
That operation can succeed (return the entry) or fail in two
different ways: the path may not exist, or the entry may exist but
the consumer may lack read permission on it. Plain option cannot
distinguish those two outcomes; result can.
We declare a domain type for the failure reason and use result:
The signature of the future lookup function will then be:
val lookup : entry -> string list -> (entry, lookup_error) result
(string list is the path, broken into segments; the result is
either the found entry or one of the two failure reasons.)
We are not writing lookup here, but we can still construct
example values of the return type, to make sure the design fits:
Three values of the same (entry, lookup_error) result type. The
constructors carry the meaning: Ok says success and carries the
found entry; Error Not_found and Error Permission_denied
distinguish the two failure modes.
The shape of Module 4
This tutorial used the same Module 4 toolkit as the AST tutorial, applied to a very different domain:
- Records (
perms, the inline records inside each variant). - Variants with payloads (
File,Dir). - Recursive variants (
Dir.contents : entry list). - Lists (
entry listfor directory contents). optionfor optional metadata (owner : string option).resultfor a lookup outcome with two failure modes.
What we have not done is take any of these values apart.
Computing the total size of a directory, listing the names at a
path, finding all .jpg files - each is a recursive function on
the tree, and each needs pattern matching. That arrives in
Module 5, where we will pick this
exact entry tree back up.
A quick check
The entry type's Dir variant has contents : entry list.
Why does that one field make the type recursive?
- Because
Diris a variant constructor. - Because the field's type,
entry list, refers back to the typeentrybeing defined; a directory can contain other directories, which can contain more directories, and so on. - Because
listis itself a recursive type. - Because
FileandDirare mutually recursive.
Why: the recursive self-reference is entry list. That
single occurrence of entry inside its own definition is what
lets Dir hold sub-Dirs, which can hold further sub-Dirs.
Without that reference the tree could only be one level deep:
a directory of files. File and Dir are not "mutually
recursive" in the OCaml sense; they are the two cases of one
recursive type.
Why use owner : string option instead of owner : string
(using "" to mean "no owner")?
string optionsaves memory.- The two are equivalent; the choice is purely stylistic.
string optionmakes "no owner" visible in the type: every reader has to handle theNonecase explicitly. Withowner : string, the empty string is a legal value that the type system cannot distinguish from a real owner name.string optionis required by OCaml whenever a field may be missing.
Why: this is the "make illegal states unrepresentable"
slogan from the recursive-types lecture. With owner : string,
the absence of an
owner is encoded as a sentinel (the empty string), and the
compiler cannot tell that sentinel apart from a genuine owner
named ""; you would rely on every reader remembering to
check. With owner : string option, the two cases are
separate constructors (None vs Some _), and the type
system forces every consumer to handle both.
Activity: a new kind of entry
Which of these is the correct constructed value?
type entry =
| File of { name : string; size : int; perms : perms }
| Dir of { name : string; perms : perms; contents : entry list }
| Symlink of { name : string; target : string } (* new *)
Dir { name = "alice"; perms = rx; contents = [notes; Symlink { name = "latest.txt"; target = "notes.txt" }] }Dir { name = "alice"; perms = rx; contents = [notes; "latest.txt" -> "notes.txt"] }Symlink { name = "alice"; target = "notes.txt" }Dir { name = "alice"; perms = rx; contents = [File { name = "latest.txt"; target = "notes.txt" }] }
Why: the directory holds a list of entry values. The first
element is the existing file notes; the second is a fresh
Symlink { ... } value with the two required string fields.
The second option uses a notation that is not OCaml; the third
makes the directory itself a symlink; the fourth tries to give
File a target field that does not exist in the File
constructor's payload (it would not type-check).
What you should be able to do now
You have now seen the same construction toolkit applied to two domains: an AST for OCaml and a small file system (this tutorial). The shapes are very different; the ingredients are the same. That is the through-line for Module 4: variants + records + recursion is enough to describe almost any data you will encounter. Module 5 gives you the matching syntax to take these values apart.
Reading
- Cornell CS3110, Algebraic data types: https://cs3110.github.io/textbook/chapters/data/algebraic_data_types.html
- Real World OCaml, Variants: https://dev.realworldocaml.org/variants.html
Sources
This lecture's prose, worked examples, and quizzes are original to
this course. Materials referenced during preparation are listed in
the Reading section above; Cornell CS3110 and Real World OCaml
are CC BY-NC-ND-licensed and have not been derivatively reused.
See LICENSES.md
at the repository root for the full source posture.