Model-based testing of stateful data structures
Lecture 5 developed
property-based testing on pure functions: a list reversal, a
sort. The properties were statements
about a single function call: "for every input xs, rev (rev xs) = xs." No state. No mutation. No sequence of calls. Just
input goes in, output comes out, property holds.
Most of real software is not like that. A queue is stateful:
what dequeue returns depends on every enqueue and dequeue
that came before it. A hash table is stateful: the result of
find depends on every previous add and remove. A file
handle is stateful. A database is stateful. The entire OS is
stateful. And the property "for every input, the output is
correct" no longer obviously applies, because there is no
single input; there is a history.
This lecture shows how to extend PBT to stateful code. The technique is model-based testing: write a simple, obviously- correct reference implementation (often a list or a sorted list); generate random sequences of operations; run each sequence against both your sophisticated implementation and the reference; assert observable equivalence at every step.
If your real implementation is correct, the test passes for any operation sequence. If it has a bug, some sequence triggers a divergence between the two implementations, and QCheck shrinks that sequence to the smallest one that still diverges. The result is a tiny reproducer for a stateful bug, which is what you need to debug it.
The state problem
We have a function. We want to test it. The
property-based testing recipe
says: generate a random input, run the function, check a
property of the output. For sort, that is
QCheck.Test.make QCheck.(list int)
(fun xs ->
let ys = sort xs in
is_sorted ys && same_elements xs ys)
One generated input, one call, one property. Done.
Now we have a queue of ints, with this interface:
create : unit -> queue
enqueue : queue -> int -> unit
dequeue : queue -> int option
size : queue -> int
What is the input we generate? A single call to enqueue
does not test much; the interesting behaviour is in the
interaction between enqueue and dequeue over a sequence
of calls: what comes out, and in what order, depends on
everything that went in.
What is the property? "The queue is correct" is not a property; it is a paragraph. We need something concrete. And the natural concrete thing is: the queue behaves the same as a much simpler, obviously-correct implementation of the same interface.
The PBT shape stays the same: generate a random "input" (now a sequence of operations), run the function (now a function sequence), check a property (now observable equivalence with the reference at each step). It is the same idea applied at a higher level.
The recipe in five steps
Model-based testing has a fixed structure. The five pieces:
-
A representation of operations as data. Define a variant type
commandwhose constructors mirror the public API of the data structure under test.Enqueue of int,Dequeue,Size. -
A generator for
command list. Use QCheck combinators to produce random sequences of commands. Make sure the mix is interesting (e.g. enoughDequeues that the queue sometimes drains empty, enoughEnqueues that it sometimes grows long). -
Two interpreters. A function
run_real : queue -> command -> observationandrun_ref : model -> command -> observation. Each takes the (possibly mutable) state, applies the command, and returns whatever the operation observably produced (the result ofdequeue, the new size, ...). -
A property. Given a
command list, run it through both interpreters in lockstep; check that at every step, both produce the same observation. If any step diverges, the property fails and that sequence is the bug. -
A shrinker. Attach QCheck's list shrinker to the command list. When a failing sequence is found, the shrinker tries to delete commands, simplify their arguments, and find the smallest sequence that still diverges.
The five pieces are all the model-based testing you need. Once you have them, the test is one line.
A worked example: a two-list queue
Concretely: we test a two-list queue against a reference
implementation built from a single int list. The two-list
queue is a classic: you met its persistent cousin in
the modules tutorial, as a functor.
Here it returns in mutable clothing, as a test subject.
Under test: the two-list queue
The idea: keep two lists in mutable fields. front holds the
elements about to leave, oldest at the head; back holds the
recent arrivals, newest at the head. enqueue conses onto
back (constant time). dequeue takes the head of front;
when front runs dry, it reverses back into front and
clears back. Each element is moved at most once, so the cost
is constant amortised.
Twenty lines of mutable state with one delicate moment: the
refill, where dequeue must reverse back, install the tail
as the new front, clear back, and hand out the head, all in
the right order. Plenty of room for a bug. The reference
implementation (m_ for model) looks like this:
Eight lines of substance. The queue is a list; m_enqueue
appends at the far end, m_dequeue takes the head. Obviously
correct, at the price of m_enqueue walking the whole list. The
reference does not need to be fast, it does not need to scale,
it only needs to be unambiguously correct.
This is the heart of model-based testing: the reference is the spec, written in code. Anything more complex than the reference is something to test against the reference.
Step 1: commands as data
A pretty-printer comes with it, because QCheck's failure
messages will show command sequences, and <opaque> is not
helpful:
Each constructor mirrors one operation of the queue API. We carry the arguments inside the constructor.
We deliberately do not model create as a command; that is
the initial state. Each test starts with a freshly created
queue and applies a sequence of commands to it.
Step 2: a generator for command list
Gen.oneof picks one of the listed generators uniformly;
Gen.map builds an Enqueue around a random small int;
Gen.return is the generator that always produces exactly its
argument (the same return idea as in the monads module).
QCheck.make wraps the generator into an arbitrary; ~print
attaches our pretty-printer so failure messages show readable
command lists.
The ~shrink argument deserves a note; it is the one custom
shrinker the
PBT lecture promised you
would write. For built-in types, QCheck.(list int) comes with
a shrinker; for a custom type, QCheck.make cannot guess one,
and without it a failing 30-command sequence is reported as
is. A shrinker maps a failing value to a stream
(QCheck.Iter.t) of smaller candidates to try in its place. We
attach Shrink.list, which drops commands from the list, and
give it a per-command shrinker: Shrink.int walks an
Enqueue's argument toward zero (Iter.map wraps each shrunk
int back into the constructor), while Dequeue and Size
carry nothing to shrink (Iter.empty).
Three observations on the generator:
The enqueued values are small (0 to 9). For a queue, the values never steer control flow, so they only need to be readable in counterexamples; single digits are perfect. (For a keyed structure, a map or a hash table, the analogous choice is the most important one in the harness: a small key space forces operations to collide on the same keys, which is where keyed bugs live. The activity takes this up.)
The command mix is uniform. oneof picks each constructor
with equal probability, so Dequeue fires about as often as
Enqueue: the queue repeatedly drains to empty (exercising the
refill) and Dequeue-on-empty happens regularly too.
The list length is bounded. list_size (int_range 0 30)
caps sequences at 30 commands: long enough for several
fill-and-drain cycles, short enough that each test is fast and
the shrinker has manageable work.
Step 3: two interpreters
The interpreters apply a single command to a piece of state and
return an observation, which captures whatever the operation
makes visible.
The reference's interpreter mirrors the same three arms against the model:
The observations are exactly the parts of each call's result
that are publicly visible. Enqueue returns unit, so its
observation is OUnit (it passes trivially); Dequeue returns
an int option, observed as ODeq; Size returns an int,
observed as OInt.
Why an observation type at all? Because the property has to
compare the observable output of each call. A test that just
checked "no exception was raised" would miss bugs that silently
return wrong values from Dequeue. By making every visible
result an observation, we explicitly compare the parts of
behaviour the user can see. And the cheap Size observation
earns its keep: it can expose a corrupted internal state right
after the operation that corrupted it, before any Dequeue
gets around to revealing it.
Step 4: the property
Five lines of property. Read them carefully because they encode the entire model-based-testing idea:
- Create both a real queue and a reference queue, fresh.
- Walk the command list. At each step, apply the same command to both. Compute their observations.
- The step passes if the observations are equal. The step fails if they differ.
- The property is
trueiff every step passes.
Run it against our (correct) two-list queue:
If the queue is correct, both implementations produce the same
observation for every command, the equality holds for every
step, and the test passes, as it just did: a thousand random
operation sequences, every step agreeing. If the queue has a
bug, some operation eventually produces a different observation
than the reference, the equality fails, the step returns
false, and List.for_all short-circuits to false. The
property fails and QCheck reports the operation sequence as a
counterexample. We will watch that happen in a moment.
Step 5: what shrinking does to operation sequences
The list shrinker we attached in step 2 is exactly what we want. When the property fails on a long sequence, the shrinker:
- Tries dropping a single command. Re-runs both interpreters on the shorter sequence. Does the divergence still happen? If yes, the smaller sequence is the new witness; recurse.
- If dropping any single command makes the divergence go away, the shrinker tries dropping pairs, then halves.
- Eventually the shrinker arrives at a minimum-length sequence that still triggers the bug. For most queue bugs, this is 2-4 commands.
- Once the length is minimal, the shrinker tries to simplify
the arguments inside each command (integers toward zero).
Enqueue 7shrinks toEnqueue 0if the smaller version still triggers the bug.
The shrunk sequence is the bug report. For a typical refill bug, the shrinker might end up reporting:
Test failed on input:
[Enqueue 0; Dequeue; Dequeue]
Three commands. A long failing sequence with random
distractors would be unreadable. The three-command shrunk witness reads
itself: "enqueue one element, dequeue it, dequeue again.
The two implementations disagree on what comes out." Now you
know exactly where to look: the moment dequeue crosses from
a drained front into back.
The full test, end to end
Putting the pieces together (this is the complete file you
would put in test/test_queue.ml, below the queue and model
definitions):
type command =
| Enqueue of int
| Dequeue
| Size
let command_to_string = function
| Enqueue x -> Printf.sprintf "Enqueue %d" x
| Dequeue -> "Dequeue"
| Size -> "Size"
type observation =
| OUnit
| OInt of int
| ODeq of int option
let run_real (q : queue) = function
| Enqueue x -> enqueue q x; OUnit
| Dequeue -> ODeq (dequeue q)
| Size -> OInt (size q)
let run_ref (q : model) = function
| Enqueue x -> m_enqueue q x; OUnit
| Dequeue -> ODeq (m_dequeue q)
| Size -> OInt (m_size q)
let command_gen : command QCheck.Gen.t =
let open QCheck.Gen in
oneof [
map (fun x -> Enqueue x) (int_range 0 9);
return Dequeue;
return Size;
]
let command_shrink (c : command) : command QCheck.Iter.t =
match c with
| Enqueue x -> QCheck.Iter.map (fun x' -> Enqueue x') (QCheck.Shrink.int x)
| Dequeue | Size -> QCheck.Iter.empty
let command_list_gen : command list QCheck.arbitrary =
QCheck.make
~print:(fun cs ->
"[" ^ String.concat "; " (List.map command_to_string cs) ^ "]")
~shrink:(QCheck.Shrink.list ~shrink:command_shrink)
(QCheck.Gen.list_size (QCheck.Gen.int_range 0 30) command_gen)
let test_q_matches_ref =
QCheck.Test.make
~name:"queue matches reference on all command sequences"
~count:1000
command_list_gen
(fun cs ->
let real = create () in
let model = m_create () in
List.for_all
(fun c -> run_real real c = run_ref model c)
cs)
let () = QCheck_runner.run_tests_main [test_q_matches_ref]
Some fifty lines for a complete stateful test. The dune:
(test
(name test_queue)
(libraries qcheck))
One test executable, one dune runtest, 1000 random operation
sequences per run, automatic shrinking on failure.
Watching it catch a real bug
To make this concrete: suppose someone writes the refill in
dequeue and forgets one assignment. The reverse happens, the
new front is installed, but back is never cleared. The
type and the other operations are untouched; only dequeue
changes:
The consequence: every element that was in back at refill
time stays there, and the next refill delivers all of them a
second time. The queue duplicates elements. But notice how
quiet the bug is: the refill itself returns the right element,
and every dequeue before the next refill is right too.
Point the same harness at the bad dequeue (the interpreter
reuses enqueue and size; only the Dequeue arm changes):
QCheck finds a diverging sequence almost immediately and shrinks it. The minimal witness is three commands, in one of two equivalent shapes (which one you get varies with the seed):
Test buggy queue matches reference failed (... shrink steps):
[Enqueue 0; Dequeue; Size]
Read it: enqueue one element, dequeue it (this refill leaves
the element stranded in back), then ask for the size. The
real queue says 1, the reference says 0. The other shape,
[Enqueue 0; Dequeue; Dequeue], catches the duplicate
directly: the second Dequeue returns Some 0 instead of
None. Either way the witness is immediately legible. A
unit-test suite would need to think of this scenario; the
model-based test generates it.
This is the value proposition: writing the reference plus the five-step harness costs maybe half an hour. The harness then catches every refill bug, every ordering bug, every lost or duplicated element, every thing the random sequences happen to trip over. The cost is bounded; the coverage is unbounded.
When does this scale?
Model-based testing scales beautifully to any data structure with a clean interface. Some examples:
Hash tables. Reference: an association list
(key * value) list. Test: open addressing, chaining,
resizing. (This is the activity.)
Stacks. Reference: a list (head = top). Test: any stack
implementation (array-backed, linked, persistent).
Priority queues. Reference: a sorted list. Test: heap, pairing heap, leftist heap.
Maps and sets. Reference: an association list or a sorted list. Test: AVL tree, red-black tree, hash table, B-tree.
LRU caches. Reference: a list plus a counter for recency. Test: a doubly-linked-list + hash table implementation.
Persistent data structures. Reference: a list of all versions. Test: a persistent vector, finger tree, zipper.
In every case the pattern is the same: the reference is the simplest possible thing that satisfies the interface, the implementation under test is the optimised version, and the property checks that they agree on every sequence of operations.
Limits and extensions
Limit 1: non-determinism. If the implementation is
non-deterministic (e.g. concurrent code, hash-table iteration
order), the simple equality check or_ = oref fails because
the observations legitimately differ. The fix is to compare
equivalence classes: maps as their multi-sorted contents,
iteration as the sorted result of to_list. The shape of the
property gets richer, but the framework still applies.
Limit 2: concurrent code. For multi-threaded data
structures, you want to test linearisability: the parallel
history is equivalent to some sequential interleaving. This
is what the multicoretests
library
does for OCaml 5's parallel runtime, with qcheck-stm
(stateful model) and qcheck-lin (linearisability). The
underlying technique is the same as this lecture: define
commands, run on a model, check equivalence.
Limit 3: external state. If the data structure interacts with the filesystem, network, or database, the reference can sometimes be an in-memory mock. When the external state itself is the spec, model-based testing degrades to integration testing. The boundary is fuzzy and worth thinking about.
Extension: state-aware command generation. A more advanced
technique is to make the generator itself aware of state. For
example, after an Add 5, the generator should be biased to
generate Find 5 or Remove 5 (to exercise interesting
follow-ups), not just uniform random keys. Libraries like
qcheck-stm support this via a "command precondition"
mechanism: each command can be conditioned on the current
model state. Beyond the scope of this introductory lecture but
worth knowing exists.
The relationship to formal specifications
A philosophical aside. When you write the reference implementation, you are writing a specification in code. It is executable, it is type-checked, it is testable, and it is human-readable. It is not a formal proof of correctness for the optimised version, but it is a very strong specification: any difference between the two on any operation sequence is, by definition, a bug in the optimised version.
There is a classical name for the idea. An equational
specification (or algebraic specification) defines a data
abstraction by equations between its operations, with no
implementation in sight: for a queue, laws like "front of
enqueue x on an empty queue is x" and "dequeue then
enqueue commutes with enqueue then dequeue on a non-empty
queue". A reference implementation is those equations packaged
as code: the simplest structure that satisfies all of them. When
the test harness compares the optimised implementation against
the reference on every operation sequence, it is checking the
equations, wholesale, without ever writing them down one by one.
This puts model-based testing into a productive middle ground. On one side, formal verification (Rocq, Lean) gives mathematical certainty but costs months of effort per data structure. On the other side, unit testing gives the cases you thought of. Model- based PBT gives you, for one hour of harness writing, exhaustive sampling of operation sequences against a hand- written spec. It catches almost all the bugs that unit testing misses, almost free of cost. It is the highest-leverage testing technique we will cover.
Some industrial cases worth knowing:
- John Hughes's group at Quviq applied this technique to industrial codebases at AUTOSAR, Volvo Cars, and Ericsson; see the Experiences with QuickCheck paper in the Reading.
- Kyle Kingsbury's Jepsen does exactly this to distributed databases: it drives a cluster with random operation sequences, checks the observed history against a reference consistency model, and injects network partitions mid-run. It has found lost writes and consistency violations in Redis, MongoDB, and many other production systems. The data structure is a whole database and the divergence is a violated consistency guarantee, but it is the same recipe as the queue above: commands, a reference, equivalence checked over the history.
- Industrial OCaml codebases lean on property tests with
generators derived from type definitions
(
ppx_quickcheck); the RWO Testing chapter shows the pattern. - The OCaml 5 runtime's lock-free data structures are tested
with
multicoretestsagainst a linearisability model.
Activity
You are testing a custom red-black tree implementation using
model-based PBT against an association-list reference. The
default command_gen produces Add k v, Remove k, and
Find k with equal probability and k chosen uniformly from
int_range 0 1_000_000.
After running 1000 random sequences, the test passes. A user
later reports a bug: Remove on a key already in the tree
sometimes leaves the tree unbalanced. Why did the test not
catch this?
- Property-based testing cannot find bugs in tree algorithms.
- With a key space of one million and short command sequences,
Remove kalmost never targets a key that is actually in the tree, so the rebalancing branch is rarely exercised. Restrict the key space to e.g.int_range 0 20, or generateRemovekeys from the set of currently-added keys. - The reference implementation must also be a red-black tree, otherwise the comparison is meaningless.
Removecannot be tested by model-based PBT; it requires a dedicated invariant checker.
Why: the issue is distribution. A million-element key
space means random commands almost never collide. Remove k
on a key not in the tree is a no-op for both implementations,
so observations agree trivially. The interesting case (remove
a present key) almost never happens. The fix is the keyed-
structure version of the lecture's generator-design lesson:
use a small key space so that random commands frequently
target existing keys and exercise the algorithm's interesting
branches.
In the worked example, the property compares observations step-by-step:
List.for_all (fun c -> run_real real c = run_ref model c) cs
Why is step-by-step equivalence important? Why not just compare the final state of the two queues after running all commands?
- OCaml's
=does not work on mutable state. - Step-by-step is faster.
- A divergence might happen in the middle of the sequence but be "fixed up" by a later command (e.g. a
Dequeuethat brings the two impls back into agreement by coincidence). Step-by-step equivalence catches the bug at the exact step where it occurred and gives a smaller, more readable counterexample after shrinking. - Final-state comparison is impossible because the queue is mutable.
Why: if two impls diverge at step 7 and then reconverge at
step 12, a final-state comparison passes; the bug is hidden. By
asserting equivalence at every step, the test fires at the
earliest step where the divergence occurs. Combined with
shrinking, this gives you the smallest possible bug reproducer:
"the first time these two implementations disagree is on a
3-command sequence ending in this Dequeue."
Take the hash table from the activity question: add : t -> int -> string -> unit, find : t -> int -> string option,
remove : t -> int -> unit, and size : t -> int. (The right
reference is an association list (int * string) list: it is
the simplest finite-map structure, and each operation is one
line.) Write the harness's command variant type, one
constructor per operation, and a command_to_string function
that pretty-prints each command for failure messages. (%S in
a Printf format prints a string with quotes and escapes.)
Show reference solution
Reference solution:
Four constructors mirroring the four operations, plus a printer
that uses Printf.sprintf for the parameterised cases (%S
prints the string quoted and escaped). The rest of the harness
follows the lecture's recipe: an association-list reference,
interpreters returning OUnit / OFound of string option /
OInt of int observations, and one crucial generator choice:
draw the keys from a small range, so Find and Remove
frequently hit keys that are actually present.
Common pitfalls
Pitfall 1: the reference is wrong too. If the reference implementation contains the same bug as the system under test, or quietly implements different semantics, the test passes despite both being wrong. Two defences. Keep the reference trivial: the whole point is that it is so simple that any reader can convince themselves it is right; if the reference is more complex than 30-50 lines, you have lost the property. And check the reference against the actual specification once, by hand or with a few unit tests, before using it as the oracle. Garbage in, garbage out.
Pitfall 2: incomplete observations. Comparing only
Dequeue results misses bugs where the internal state has gone
wrong but the divergence has not yet reached the front of the
queue (our seeded bug was exactly this kind of quiet). Include
enough observations to make any state divergence visible: at
minimum, compare the size after every operation; ideally,
compare the full visible state.
Pitfall 3: distribution that misses interesting cases. The
canonical mistake in a keyed structure: use the default
QCheck.int for keys. The test passes vacuously because no two
random commands ever operate on the same key. Force collisions
with a small key range.
Pitfall 4: not enough commands. A bug that only fires after several drain-and-refill cycles (or, for a growing structure, several resizes) needs sequences long enough to trigger them. If your tests are too short, you miss bugs that emerge from interactions between many operations. Bound sequences at 30-50 commands by default; experiment with longer for data structures that grow.
Pitfall 5: assuming determinism. If your data structure has
any randomised behaviour (e.g. a hash with Random.int salt,
or any concurrency), the equality check breaks. Either seed the
randomness explicitly so the test is reproducible, or compare
equivalence classes rather than raw observations.
What's next
Lecture 7 is the module's wrap-up
tutorial: the whole toolkit, applied end to end to a single
worked example. We take a small arithmetic expr evaluator (a
float cousin of the pattern-matching tutorial's interpreter),
write its specification,
design black-box cases from it, mechanise them with OUnit2,
add QCheck properties with a custom generator, and watch the
shrinker corner a deliberately introduced bug.
Reading
- QCheck-STM, a QCheck extension specifically for model-based testing of stateful code. Used in the OCaml 5 multicore runtime for testing lock-free data structures: https://github.com/ocaml-multicore/multicoretests
- Cornell CS3110, Randomized testing with QCheck. The abstraction layering used in this lecture matches CS3110's: https://cs3110.github.io/textbook/chapters/correctness/randomized.html
- John Hughes, Experiences with QuickCheck: Testing the Hard Stuff and Staying Sane (2016). Industrial-scale applications of QuickCheck, including model-based testing of AUTOSAR drivers: https://publications.lib.chalmers.se/records/fulltext/232550/local_232550.pdf
- Claessen and Hughes, QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs (2000). The original paper introducing PBT, and the "test against a model" example with finite maps: https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quick.pdf
- Real World OCaml, Testing. The Quickcheck section
covers manual generators for compound types, with
ppx_quickcheckfor boilerplate reduction: https://dev.realworldocaml.org/testing.html
Sources
This lecture's prose, worked examples, and quizzes are original
to this course. The "test stateful impl against a simple
reference using random operation sequences" pattern is the
canonical model-based testing technique, originally articulated
in Claessen and Hughes 2000 and elaborated in Hughes 2016
(both cited in the Reading); the OCaml-specific phrasing and
the hash-table / queue worked examples are our own. The
two-stack queue (Banker's queue without lazy thunks) is a
classic data structure originally due to Burton; our
presentation is the textbook one. The QCheck library
(Simon Cruanes and contributors) is BSD-2-Clause licensed and
linked through its public API. Cornell CS3110's testing
chapters are CC BY-NC-ND licensed and have not been
derivatively reused. Real World OCaml's testing chapter is
linked for further reading; its Quickcheck and ppx_quickcheck
discussion is independent of ours.
See LICENSES.md
at the repository root for the full source posture.