Tutorial: fold across data structures

Functional Programming with OCaml

Tutorial: fold across data structures

Module 6 · Lecture 6

KC Sivaramakrishnan
IIT Madras

This lecture is the worked-exercise capstone of Module 6. Across the previous five lectures we built up a small but powerful toolkit: higher-order functions, map, filter, fold_left and fold_right, the pipeline operator |>, and function composition. The thesis of this module has been that this small toolkit is enough to express a surprising amount of list processing without writing a single hand-coded recursion. In this tutorial we try to make good on that claim: pick a list function from the standard library, then build it from the toolkit.

The point of the exercise is not that you should always re-derive standard library functions in real code; you should not. Use List.length rather than List.fold_left (fun n _ -> n + 1) 0, because the standard library version expresses intent more clearly and is usually faster. The point is to see how versatile the toolkit is: how few primitives you need before the rest follows. By the end of this lecture, if I asked you to write a new list-flavoured function on the spot, you should reach for map, filter, or fold first.

We will work through seven problems: four rebuild small List functions on top of fold, and three lift fold to other data structures (binary tree pre/in/post/level order, and rose trees).

This tutorial: fold across data structures

Problem 1: sum and product

let sum xs = List.fold_left (+) 0 xs let _ = sum [1; 2; 3; 4; 5] (* = 15 *)

Problem 1: sum and product

let sum xs = List.fold_left (+) 0 xs let _ = sum [1; 2; 3; 4; 5] (* = 15 *)

Same shape for product:

let product xs = List.fold_left ( * ) 1 xs let _ = product [1; 2; 3; 4; 5] (* = 120 *)

sum is the canonical fold: pass the operator (+) and the identity element for the operator (0). For product, the operator is ( * ) and the identity is 1. The general pattern: any associative binary operator with an identity element gives you a one-line fold-based aggregation.

(Why does the identity element matter? Because of the empty-list case. fold_left (+) 0 [] returns 0; fold_left ( * ) 1 [] returns 1. Picking the identity makes those answers consistent with the mathematical convention that an empty sum is zero and an empty product is one.)

Problem 2: concat

Flatten a list of lists into a single list. (Also called flatten in some libraries.)

let concat xss = List.fold_right (fun xs acc -> xs @ acc) xss [] let _ = concat [[1; 2]; [3; 4; 5]; [6]] (* = [1; 2; 3; 4; 5; 6] *)

Problem 2: concat

Flatten a list of lists into a single list.

let concat xss = List.fold_right (fun xs acc -> xs @ acc) xss [] let _ = concat [[1; 2]; [3; 4; 5]; [6]] (* = [1; 2; 3; 4; 5; 6] *)

The operator @ is list-append. Each inner list is appended to the accumulator; the rightmost inner list ends up at the back, the leftmost at the front. Note the use of fold_right rather than fold_left: with fold_left, the accumulator would build up backwards because we would append the first inner list last. (Try it on paper if it is not obvious why.) fold_right walks right-to-left, so the leftmost inner list is the last one prepended, and ends up at the front of the result.

The standard library's List.concat is similar but optimised for the common case.

Problem 3: for_all and exists

Two list predicates.

let for_all p xs = List.fold_left (fun acc x -> acc && p x) true xs let exists p xs = List.fold_left (fun acc x -> acc || p x) false xs let _ = for_all (fun n -> n > 0) [1; 2; 3] (* = true *) let _ = for_all (fun n -> n > 0) [1; -2; 3] (* = false *) let _ = exists (fun n -> n < 0) [1; -2; 3] (* = true *)

Problem 3: for_all and exists

let for_all p xs = List.fold_left (fun acc x -> acc && p x) true xs let exists p xs = List.fold_left (fun acc x -> acc || p x) false xs let _ = for_all (fun n -> n > 0) [1; 2; 3] (* = true *) let _ = for_all (fun n -> n > 0) [1; -2; 3] (* = false *) let _ = exists (fun n -> n < 0) [1; -2; 3] (* = true *)

for_all p xs is true if every element satisfies p; exists p xs is true if at least one does. The two fold-based implementations use && and || respectively, with the appropriate identity (true for AND, false for OR).

A subtlety: these implementations do not short-circuit. The fold visits every element of the list, even if the answer is already determined. The standard library's List.for_all and List.exists are written directly and do short-circuit (returning false as soon as an element fails for_all, returning true as soon as one succeeds in exists). For long lists where failure or success arrives early, the standard library is faster. Another reason to prefer the library version over the home-rolled fold one in real code.

Problem 4: count

Count how many elements satisfy a predicate.

let count p xs = List.fold_left (fun n x -> if p x then n + 1 else n) 0 xs let _ = count (fun n -> n > 0) [-1; 5; -3; 8; 0; 2] (* = 3 *)

Problem 4: count

How many elements satisfy a predicate?

let count p xs = List.fold_left (fun n x -> if p x then n + 1 else n) 0 xs let _ = count (fun n -> n > 0) [-1; 5; -3; 8; 0; 2] (* = 3 *)

The result is 3 (the three strictly positive elements: 5, 8, 2). We could have written this as List.length (List.filter p xs): filter to keep the passing elements, then count. The two-step version is arguably clearer; the fold version makes one pass and never allocates the intermediate list. For short lists this does not matter; for long ones the fold version saves both time and garbage.

Problem 5: tree folds in three orderings

Everything so far has been a fold on a list. The fold pattern generalises directly to any recursive data type. A binary tree:

type 'a tree = Leaf | Node of 'a tree * 'a * 'a tree

Three natural traversal orderings, all the same recursion with f acc v permuted:

let rec fold_inorder f acc = function | Leaf -> acc | Node (l, v, r) -> let acc = fold_inorder f acc l in let acc = f acc v in fold_inorder f acc r let rec fold_preorder f acc = function | Leaf -> acc | Node (l, v, r) -> let acc = f acc v in let acc = fold_preorder f acc l in fold_preorder f acc r let rec fold_postorder f acc = function | Leaf -> acc | Node (l, v, r) -> let acc = fold_postorder f acc l in let acc = fold_postorder f acc r in f acc v

The only thing that changes is where the visit f acc v sits relative to the two recursive calls. In-order: between the children. Pre-order: before. Post-order: after.

Problem 5a: in-order

type 'a tree = Leaf | Node of 'a tree * 'a * 'a tree let rec fold_inorder f acc = function | Leaf -> acc | Node (l, v, r) -> let acc = fold_inorder f acc l in let acc = f acc v in fold_inorder f acc r

Problem 5b: pre-order

let rec fold_preorder f acc = function | Leaf -> acc | Node (l, v, r) -> let acc = f acc v in let acc = fold_preorder f acc l in fold_preorder f acc r

Problem 5c: post-order

let rec fold_postorder f acc = function | Leaf -> acc | Node (l, v, r) -> let acc = fold_postorder f acc l in let acc = fold_postorder f acc r in f acc v

To see the difference, try them on a small binary search tree:

let t = Node ( Node (Node (Leaf, 1, Leaf), 2, Node (Leaf, 3, Leaf)), 4, Node (Node (Leaf, 5, Leaf), 6, Node (Leaf, 7, Leaf))) let collect = fun acc x -> acc @ [x] let _ = fold_inorder collect [] t (* = [1; 2; 3; 4; 5; 6; 7] *) let _ = fold_preorder collect [] t (* = [4; 2; 1; 3; 6; 5; 7] *) let _ = fold_postorder collect [] t (* = [1; 3; 2; 5; 7; 6; 4] *)

In-order on a BST returns elements in sorted order: that is the defining property of a BST. Pre-order lists every node before its children, which mirrors how the tree is built. Post-order lists every node after its children, which is what you want when each node's value depends on its subtrees being processed first (think "evaluate the leaves before the internal nodes").

Three orderings on one tree

The tree:

       4
      / \
     2   6
    / \ / \
   1  3 5  7
let t = Node (Node (Node (Leaf, 1, Leaf), 2, Node (Leaf, 3, Leaf)), 4, Node (Node (Leaf, 5, Leaf), 6, Node (Leaf, 7, Leaf))) let collect acc x = acc @ [x] let _ = fold_inorder collect [] t (* = [1; 2; 3; 4; 5; 6; 7] *) let _ = fold_preorder collect [] t (* = [4; 2; 1; 3; 6; 5; 7] *) let _ = fold_postorder collect [] t (* = [1; 3; 2; 5; 7; 6; 4] *)

Problem 6: level-order with a queue

Pre/in/post are easy because they follow the tree's own recursive shape. Level-order (or breadth-first) traversal is different: it visits all nodes at depth 0, then all at depth 1, then 2, and so on. The recursive shape of the tree does not give you that order for free, because each recursive call dives straight to a child before visiting the sibling.

The standard trick is an explicit queue of pending nodes:

let levelorder_fold f acc tree = let rec go acc = function | [] -> acc | Leaf :: rest -> go acc rest | Node (l, v, r) :: rest -> go (f acc v) (rest @ [l; r]) in go acc [tree] let _ = levelorder_fold collect [] t (* = [4; 2; 6; 1; 3; 5; 7] *)

The inner function go takes a list-as-queue. At each step, take the head; if it's a Leaf, drop it and continue; if it's a Node, visit v and append the two children to the back of the queue with @ [l; r]. Appending to the back is what makes the traversal breadth-first: a node's children are queued behind all the nodes already pending at the current depth, so the current level finishes before the next one starts.

Problem 6: level-order with a queue

let levelorder_fold f acc tree = let rec go acc = function | [] -> acc | Leaf :: rest -> go acc rest | Node (l, v, r) :: rest -> go (f acc v) (rest @ [l; r]) in go acc [tree] let _ = levelorder_fold collect [] t (* = [4; 2; 6; 1; 3; 5; 7] *)

Problem 7: fold over rose trees

A rose tree is an n-ary tree: each node carries a value and a list of children, instead of exactly two:

type 'a rose = Rose of 'a * 'a rose list

A fold over a rose tree combines tree recursion with list recursion: visit this node's value, then List.fold_left over the children, recursively folding each one:

let rec fold_rose f acc (Rose (v, children)) = let acc = f acc v in List.fold_left (fold_rose f) acc children

The body has two folds nested in each other: an outer "fold this node" (pre-order on the rose tree) and an inner List.fold_left across the children's results. This is the pattern that scales fold to almost any algebraic data type: one fold per recursive position.

A small example: a threaded discussion. Each comment carries text (or here, a label) and a list of direct replies; each reply is itself a comment that may have its own replies.

let thread = Rose ("Original post", [ Rose ("Reply A", [ Rose ("Reply A.1", []); Rose ("Reply A.2", []) ]); Rose ("Reply B", [ Rose ("Reply B.1", []) ]); Rose ("Reply C", []) ]) let _ = fold_rose (fun acc s -> acc @ [s]) [] thread (* = ["Original post"; "Reply A"; ...; "Reply C"] *)

That is the pre-order walk of the discussion: each comment listed before its replies, replies listed before their sub-replies. This is the same order most threaded forums use to render comments top-to-bottom on a page.

Problem 7a: fold over rose trees

type 'a rose = Rose of 'a * 'a rose list let rec fold_rose f acc (Rose (v, children)) = let acc = f acc v in List.fold_left (fold_rose f) acc children

Problem 7b: walking a threaded discussion

let thread = Rose ("Post", [ Rose ("A", [Rose ("A.1", [])]); Rose ("B", []) ]) let _ = fold_rose (fun acc s -> acc @ [s]) [] thread (* = ["Post"; "A"; "A.1"; "B"] *)
  • Pre-order: each comment is listed before its replies.
  • That's how forums render threads top-to-bottom.
  • Same fold engine; only the combining function and accumulator change to match the question being asked.
+ Post
  + A
    + A.1
  + B

A wider example: word frequencies

Let us combine pieces from across the module into a slightly larger example. We count how often each word appears in a piece of text. We will return the answer as an association list (a list of pairs); in Module 7 we will see proper hash tables and balanced maps.

let word_counts text = text |> String.lowercase_ascii |> String.split_on_char ' ' |> List.filter (fun s -> s <> "") |> List.fold_left (fun counts w -> let n = match List.assoc_opt w counts with | Some n -> n | None -> 0 in (w, n + 1) :: List.remove_assoc w counts ) [] let _ = word_counts "the quick brown fox jumps over the lazy dog the fox" (* = [("fox", 2); ("the", 3); ...; ("quick", 1)] *)

The result is [("fox", 2); ("the", 3); ("dog", 1); ("lazy", 1); ("over", 1); ("jumps", 1); ("brown", 1); ("quick", 1)]: each fold step prepends the freshly-bumped pair, so the words appear in most-recently-touched-first order.

The pipeline reads top-to-bottom: lowercase, split into words, drop empty pieces, then fold to build up a frequency table. The fold's accumulator is an association list of word/count pairs; for each word, we look up its current count (or 0 if absent), remove the old entry, and prepend a new one with the bumped count.

This is the kind of code that, in Python or Java, would take a small loop with a hash table. In OCaml with the higher-order toolkit, it is a single pipeline. The trade-off is that this implementation is O(n^2) in the number of distinct words (each List.assoc and List.remove_assoc is linear); the proper solution uses Map or Hashtbl, which we will meet in Module 7. For now, the point is that the shape of the computation is captured cleanly.

A wider example: word frequencies

let word_counts text = text |> String.lowercase_ascii |> String.split_on_char ' ' |> List.filter (fun s -> s <> "") |> List.fold_left (fun counts w -> let n = match List.assoc_opt w counts with | Some n -> n | None -> 0 in (w, n + 1) :: List.remove_assoc w counts ) []

assoc_opt is not magic

let rec assoc_opt k = function | [] -> None | (k', v) :: rest -> if k = k' then Some v else assoc_opt k rest let _ = assoc_opt "fox" [("the", 3); ("fox", 2)] (* = Some 2 *)

word_counts: a worked example

let _ = word_counts "the quick brown fox jumps over the lazy dog the fox" (* = [("fox", 2); ("the", 3); ("dog", 1); ("lazy", 1); ("over", 1); ("jumps", 1); ("brown", 1); ("quick", 1)] *)

A quick check

Which of the following is not a fold doing constant work per element as it walks the list once?

Why: length, filter, and map are all linear walks of the list, combining each element into the accumulator with a constant-time step. They are textbook folds. Sorting is the odd one out, with a qualifier worth knowing: you can write insertion sort as a fold (fold_right insert xs []), but the combining step (insert) is itself a linear walk of the accumulator, so the whole thing is O(n^2). What no fold gives you is List.sort's O(n log n) merge sort: that algorithm repeatedly compares elements far apart, which does not fit the one-pass, combine-as-you-go shape.

List.fold_left (fun acc x -> x :: acc) [] [1; 2; 3] is...

Why: fold_left walks left to right. Initial acc = []. After 1: [1]. After 2: [2; 1]. After 3: [3; 2; 1]. So this is the classic one-line List.rev. To get back the original order, use fold_right (which walks right-to-left).

A code challenge:

Write longest : string list -> string option that returns the longest string in a list, or None for an empty list. If several strings share the greatest length, return the first of them. Use List.fold_left with String.length. (Hint: the accumulator is a string option.)

let longest xs = failwith "not implemented"
Show reference solution

Reference solution:

let longest xs =
  List.fold_left
    (fun acc s ->
      match acc with
      | None -> Some s
      | Some t ->
          if String.length s > String.length t then Some s else acc)
    None xs

The accumulator is a string option, starting at None. For each string: if the accumulator is None, take this string as the current best. Otherwise keep whichever is longer; the strict > means a tie keeps the earlier string. The result is None exactly when the list was empty.

Activity

Activity

Write maximum : 'a list -> 'a option returning the largest element of a list, or None if empty. Use List.fold_left.

Show reference solution

Activity solution

let maximum xs = List.fold_left (fun acc x -> match acc with | None -> Some x | Some m -> Some (max m x)) None xs let _ = maximum [3; 7; 1; 9; 5] (* = Some 9 *) let _ = maximum ([] : int list) (* = None *)
  • Accumulator is an option ('a option; an int option in the calls above).
  • Starts at None (no element seen yet).
  • For each element: if None, take this element; otherwise keep the larger.
  • ([] : int list) annotation is needed: [] alone is polymorphic and OCaml needs to pick a type.

What you should be able to do now

By the end of Module 6 you should be able to:

What you should be able to do now

After Module 6 you can:

What's coming up:

What's next

Module 7 is a turn back toward the imperative side of OCaml: side effects (mutable references, exceptions, Printf), and modules (the OCaml language feature for organising code into named, type-bearing units). Higher-order functions remain in play throughout; we will see them again in Module 7 in the form of references that hold functions and in Module 8 in the form of monads, where the whole programming pattern is built on higher-order composition.

Reading

Sources

This lecture's prose, worked examples, and quizzes are original to this course. Materials referenced during preparation are listed in the Reading section above; Cornell CS3110 and Real World OCaml are CC BY-NC-ND-licensed and have not been derivatively reused. See LICENSES.md at the repository root for the full source posture.