Let's flood a HashMap!

October 29, 2025 · 14 min read

Rynco Maekawa

This article gives a brief introduction of the structure of a hash table, demonstrates hash flooding attack -- a common attack on it, and how to militate it when implementing this data structure.

Everybody loves hashmaps.

They provide a blazing fast average $O(1)$ access^* to associate any value to any key, asking for only two things in return: an equality comparer and a hash function, nothing more. This unique property makes hashmaps often more efficient than other associative data structures like search trees. As a result, hashmaps are nowadays one of the most used data structures in programming languages.

From the humble dict in Python, to databases and distributed systems, and even JavaScript objects, they're everywhere. They power database indexing systems, enable efficient caching mechanisms, and form the backbone of web frameworks for routing requests. Modern compilers use them for symbol tables, operating systems rely on them for process management, and virtually every web application uses them to manage user state.

Whether you're building a web server, parsing JSON values, dealing with configurations, or just counting word frequencies, chances are you'll reach for a hashmap. They've become so fundamental that many developers take their $O(1)$ magic for granted -- but the $1$ in $O(1)$ has got some strings^* attached.

The anatomy of a hashmap

A hashmap is made of two parts: a bucket array and a hash function.

struct MyHashMap[K, V] {
  Array[ChainingBucket[K, V]]
buckets : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[struct ChainingBucket[K, V] {
  values: Array[(K, V)]
}
Bucket[type parameter K
K, type parameter V
V]]
  (K) -> UInt
hash_fn : (type parameter K
K) -> UInt
UInt
}

The bucket array contains a list of what we call "buckets". Each bucket stores some data we have inserted.

The hash function H associates each key with an integer. This integer is used to find an index in the bucket array to store our value. Usually, the index is derived by simply moduloing the integer with the size of the bucket array, i.e. index = H(key) % bucket_array_size. The hashmap expects the function to satisfy two important properties:

The same key is always converted to the same number. i.e., if a == b, then H(a) == H(b).

This property ensures that, once we have found a bucket to insert using a key, we can always find the same bucket where it has been inserted, using the same key.
The resulting number is distributed uniformly across the space of possible results for different keys.

This property ensures that different keys are unlikely to have the same associated integer, and in consequence, unlikely to be mapped to the same bucket in the array, allowing us to retrieve the value efficiently.

Now, you may ask, what would happen if two keys map to the same bucket? This comes to the realm of hash collisions.

Hash collisions

When two keys have the same hash value, or more broadly, when two keys map to the same bucket, a hash collision occurs.

As hashmaps determines everything based on the hash value (or bucket index), the two keys now look the same to the hashmap itself -- they should be put into the same place, but still unequal enough to not overwriting each other.

Hashmap designers have a couple of strategies to deal with collisions, which fall into one of the two broad categories:

The chaining method puts these keys in the same bucket. Each bucket now may contain the data for a number of keys, instead of just one. When searching for a colliding key, all keys in the bucket are searched at once.

struct ChainingBucket[K, V] {
  Array[(K, V)]
values : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[(type parameter K
K, type parameter V
V)]
}

Java's HashMap is a popular example of this approach.

The open addressing method still has one key per bucket, but uses a separate strategy to choose another bucket index when keys collide. When searching for a key, buckets are searched in the order of the strategy until the it is obvious that there are no more keys that could match.
```
struct OpenAddressBucket[K, V] {
  Int
hash: Int
Int
  K
key: type parameter K
K
  V
value: type parameter V
V
}
```
MoonBit's standard library Map is an example of this approach.

Either case, when a hash collision happens, we have no choice but to search through everything corresponding to the bucket we've found, to determine whether the key we are looking for is there or not.

Using a chaining hashmap (for simplicity), the whole operation looks something like this:

typealias struct ChainingBucket[K, V] {
  values: Array[(K, V)]
}
ChainingBucket as Bucket

/// Search for the place where the key is stored.
///
/// Returns `(bucket, index, number_of_searches_done)`
fn[K : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, V] struct MyHashMap[K, V] {
  buckets: Array[ChainingBucket[K, V]]
  hash_fn: (K) -> UInt
}
MyHashMap::(self : MyHashMap[K, V], key : K) -> (Int, Int?, Int)
Search for the place where the key is stored.
Returns (bucket, index, number_of_searches_done)
search(MyHashMap[K, V]
self : struct MyHashMap[K, V] {
  buckets: Array[ChainingBucket[K, V]]
  hash_fn: (K) -> UInt
}
MyHashMap[type parameter K
K, type parameter V
V], K
key : type parameter K
K) -> (Int
Int, Int
Int?, Int
Int) {
  let UInt
hash = (MyHashMap[K, V]
self.(K) -> UInt
hash_fn)(K
key)
  let Int
bucket = (UInt
hash (self : UInt, other : UInt) -> UInt
Calculates the remainder of dividing one unsigned integer by another.
Parameters:

self : The unsigned integer dividend.
other : The unsigned integer divisor.
Returns the remainder of the division operation.
Throws a panic if other is zero.
Example:
  let a = 17U
  let b = 5U
  inspect(a % b, content="2") // 17 divided by 5 gives quotient 3 and remainder 2
  inspect(7U % 4U, content="3")
% MyHashMap[K, V]
self.Array[ChainingBucket[K, V]]
buckets.(self : Array[ChainingBucket[K, V]]) -> Int
Returns the number of elements in the array.
Parameters:

array : The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length().(self : Int) -> UInt
reinterpret the signed int as unsigned int, when the value is
non-negative, i.e, 0..=2^31-1, the value is the same. When the
value is negative, it turns into a large number,
for example, -1 turns into 2^32-1
reinterpret_as_uint()).(self : UInt) -> Int
reinterpret the unsigned int as signed int
For number within the range of 0..=2^31-1,
the value is the same. For number within the range of 2^31..=2^32-1,
the value is negative
reinterpret_as_int()
  // Result
  let mut Int?
found_index = Int?
None
  let mut Int
n_searches = 0
  // Search through all key-value pairs in the bucket.
  for Int
index, (K, V)
keyvalue in MyHashMap[K, V]
self.Array[ChainingBucket[K, V]]
buckets(Array[ChainingBucket[K, V]], Int) -> ChainingBucket[K, V]
Retrieves an element from the array at the specified index.
Parameters:

array : The array to get the element from.
index : The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a panic if the index is negative or greater than or equal to the
length of the array.
Example:
  let arr = [1, 2, 3]
  inspect(arr[1], content="2")
[bucket].Array[(K, V)]
values {
    Int
n_searches (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+= 1
    if (K, V)
keyvalue.K
0 (_ : K, _ : K) -> Bool
== K
key { // Check if the key matches.
      Int?
found_index = (Int) -> Int?
Some(Int
index)
      break
    }
  }
  return (Int
bucket, Int?
found_index, Int
n_searches)
}

/// Insert a new key-value pair.
///
/// Returns the number of searches done.
fn[K : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, V] struct MyHashMap[K, V] {
  buckets: Array[ChainingBucket[K, V]]
  hash_fn: (K) -> UInt
}
MyHashMap::(self : MyHashMap[K, V], key : K, value : V) -> Int
Insert a new key-value pair.
Returns the number of searches done.
insert(MyHashMap[K, V]
self : struct MyHashMap[K, V] {
  buckets: Array[ChainingBucket[K, V]]
  hash_fn: (K) -> UInt
}
MyHashMap[type parameter K
K, type parameter V
V], K
key : type parameter K
K, V
value : type parameter V
V) -> Int
Int {
  let (Int
bucket, Int?
index, Int
n_searches) = MyHashMap[K, V]
self.(self : MyHashMap[K, V], key : K) -> (Int, Int?, Int)
Search for the place where the key is stored.
Returns (bucket, index, number_of_searches_done)
search(K
key)
  if Int?
index is (Int) -> Int?
Some(Int
index) {
    MyHashMap[K, V]
self.Array[ChainingBucket[K, V]]
buckets(Array[ChainingBucket[K, V]], Int) -> ChainingBucket[K, V]
Retrieves an element from the array at the specified index.
Parameters:

array : The array to get the element from.
index : The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a panic if the index is negative or greater than or equal to the
length of the array.
Example:
  let arr = [1, 2, 3]
  inspect(arr[1], content="2")
[bucket].Array[(K, V)]
values(Array[(K, V)], Int, (K, V)) -> Unit
Sets the element at the specified index in the array to a new value. The
original value at that index is overwritten.
Parameters:

array : The array to modify.
index : The position in the array where the value will be set.
value : The new value to assign at the specified index.
Throws an error if index is negative or greater than or equal to the length
of the array.
Example:
  let arr = [1, 2, 3]
  arr[1] = 42
  inspect(arr, content="[1, 42, 3]")
[index] = (K
key, V
value)
  } else {
    MyHashMap[K, V]
self.Array[ChainingBucket[K, V]]
buckets(Array[ChainingBucket[K, V]], Int) -> ChainingBucket[K, V]
Retrieves an element from the array at the specified index.
Parameters:

array : The array to get the element from.
index : The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a panic if the index is negative or greater than or equal to the
length of the array.
Example:
  let arr = [1, 2, 3]
  inspect(arr[1], content="2")
[bucket].Array[(K, V)]
values.(self : Array[(K, V)], value : (K, V)) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((K
key, V
value))
  }
  Int
n_searches
}

This is the string attached to the $O(1)$ access magic -- we'd have to search through everything if we're unlucky. This gives the hashmap a worst-case complexity of $O(n)$ , where $n$ is the number of keys in the hashmap.

Crafting a collision

For most hash functions we use for hashmaps, unlucky collisions are rare. This means that we usually won't need to bother with the worst case scenario and enjoy the $O(1)$ speed for the vast majority of the time.

That is, unless someone, ~~maybe some black-suited hackerman with some malicious intent,~~ forces you into one.

Hash functions are usually designed to be deterministic and fast, so even without advanced cryptanalysis of the function itself, we can still find some keys that will collide with each other by brute force. ¹

fn (bucket_count : Int, target_bucket : Int, n_collision_want : Int, hash_fn : (String) -> UInt) -> Array[String]
find_collision(
  Int
bucket_count : Int
Int,
  Int
target_bucket : Int
Int,
  Int
n_collision_want : Int
Int,
  (String) -> UInt
hash_fn : (String
String) -> UInt
UInt,
) -> type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[String
String] {
  let Array[String]
result = []
  let UInt
bucket_count = Int
bucket_count.(self : Int) -> UInt
reinterpret the signed int as unsigned int, when the value is
non-negative, i.e, 0..=2^31-1, the value is the same. When the
value is negative, it turns into a large number,
for example, -1 turns into 2^32-1
reinterpret_as_uint()
  let UInt
target_bucket = Int
target_bucket.(self : Int) -> UInt
reinterpret the signed int as unsigned int, when the value is
non-negative, i.e, 0..=2^31-1, the value is the same. When the
value is negative, it turns into a large number,
for example, -1 turns into 2^32-1
reinterpret_as_uint()
  for Int
i = 0; ; Int
i = Int
i (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1 {
    // Generate some string key.
    let String
s = Int
i.(self : Int, radix~ : Int) -> String
Converts an integer to its string representation in the specified radix (base).
Example:
inspect((255).to_string(radix=16), content="ff")
inspect((-255).to_string(radix=16), content="-ff")
to_string(Int
radix=36)
    // Calculate the hash value
    let UInt
hash = (String) -> UInt
hash_fn(String
s)
    let UInt
bucket_index = UInt
hash (self : UInt, other : UInt) -> UInt
Calculates the remainder of dividing one unsigned integer by another.
Parameters:

self : The unsigned integer dividend.
other : The unsigned integer divisor.
Returns the remainder of the division operation.
Throws a panic if other is zero.
Example:
  let a = 17U
  let b = 5U
  inspect(a % b, content="2") // 17 divided by 5 gives quotient 3 and remainder 2
  inspect(7U % 4U, content="3")
% UInt
bucket_count
    let UInt
bucket_index = if UInt
bucket_index (self_ : UInt, other : UInt) -> Bool
< 0 {
      UInt
bucket_index (self : UInt, other : UInt) -> UInt
Performs addition between two unsigned 32-bit integers. If the result
overflows, it wraps around according to the rules of modular arithmetic
(2^32).
Parameters:

self : The first unsigned 32-bit integer operand.
other : The second unsigned 32-bit integer operand to be added.
Returns the sum of the two unsigned integers, wrapped around if necessary.
Example:
  let a = 42U
  let b = 100U
  inspect(a + b, content="142")

  // Demonstrate overflow behavior
  let max = 4294967295U // UInt::max_value
  inspect(max + 1U, content="0")
+ UInt
bucket_count
    } else {
      UInt
bucket_index
    }
    // Check if it collides with our target bucket.
    if UInt
bucket_index (self : UInt, other : UInt) -> Bool
Compares two unsigned 32-bit integers for equality.
Parameters:

self : The first unsigned integer operand.
other : The second unsigned integer operand to compare with.
Returns true if both integers have the same value, false otherwise.
Example:
  let a = 42U
  let b = 42U
  let c = 24U
  inspect(a == b, content="true")
  inspect(a == c, content="false")
== UInt
target_bucket {
      Array[String]
result.(self : Array[String], value : String) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(String
s)
      if Array[String]
result.(self : Array[String]) -> Int
Returns the number of elements in the array.
Parameters:

array : The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length() (self_ : Int, other : Int) -> Bool
>= Int
n_collision_want {
        break
      }
    }
  }
  Array[String]
result
}

Hash flooding attack

With colliding values in hand, we (in the role of malicious hackermen) can now attack hashtables to constantly exploit their worst-case complexity.

Consider the following case: you are inserting keys into the same hashmap, but every key hashes into the same bucket. With each insert, the hashmap must search through all the existing keys in the bucket to determine whether the new key is already there.

The first insertion compares with 0 keys, the second with 1 key, the third compares with 2 keys, and the number of keys compared grows linearly with each insertion. For $n$ insertions, the total number of keys compared is:

0 + 1 + \dots + (n - 1) = \frac{n(n - 1)}{2} = \frac{n^2 + n}{2}

The total list of $n$ insertions now takes $O(n^2)$ compares to complete², as opposed to the average case of $O(n)$ compares. The operation will now take far more time than it ought to.

The attack is not just limited to insertion. Every time when an attacked key is being searched for, the same number of keys will be compared, so every single operation that would have been $O(1)$ now becomes $O(n)$ . These hashmap operations that would otherwise take negligible time will now be severely slower, making the attacker far easier to deplete the program's resources than before.

This, is what we call a hash flooding attack, taken its name from it flooding the same bucket of the hashmap with colliding keys.

We can demonstrate this with the hashmap implementation we wrote earlier:

/// A simple string hasher via the Fowler-Noll-Vo hash function.
/// https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
fn (s : String) -> UInt
A simple string hasher via the Fowler-Noll-Vo hash function.
https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
string_fnv_hash(String
s : String
String) -> UInt
UInt {
  // In reality this should directly operate on the underlying array of the string
  let Bytes
s_bytes = (str : StringView, bom? : Bool, endianness? : @encoding/utf16.Endian) -> Bytes
Encodes a string into a UTF-16 byte array.
Assuming the string is valid.
@encoding/utf16.encode(String
s)
  let mut UInt
acc : UInt
UInt = 0x811c9dc5
  for Byte
b in Bytes
s_bytes {
    UInt
acc = (UInt
acc (self : UInt, other : UInt) -> UInt
Performs a bitwise XOR (exclusive OR) operation between two unsigned 32-bit
integers. Each bit in the result is set to 1 if the corresponding bits in the
operands are different, and 0 if they are the same.
Parameters:

self : The first unsigned 32-bit integer operand.
other : The second unsigned 32-bit integer operand.
Returns the result of the bitwise XOR operation.
Example:
  let a = 0xFF00U // Binary: 1111_1111_0000_0000
  let b = 0x0F0FU // Binary: 0000_1111_0000_1111
  inspect(a ^ b, content="61455") // Binary: 1111_0000_0000_1111
^ Byte
b.(self : Byte) -> UInt
Converts a Byte to a UInt.
Parameters:

byte : The Byte value to be converted.
Returns the UInt representation of the Byte.
to_uint()) (self : UInt, other : UInt) -> UInt
Performs multiplication between two unsigned 32-bit integers. The result
wraps around if it exceeds the maximum value of UInt.
Parameters:

self : The first unsigned integer operand.
other : The second unsigned integer operand.
Returns the product of the two unsigned integers. If the result exceeds the
maximum value of UInt (4294967295), it wraps around to the corresponding
value modulo 2^32.
Example:
  let a = 3U
  let b = 4U
  inspect(a * b, content="12")
  let max = 4294967295U
  inspect(max * 2U, content="4294967294") // Wraps around to max * 2 % 2^32
* 0x01000193
  }
  UInt
acc
}

fn (n_buckets : Int, keys : Array[String], hash_fn : (String) -> UInt) -> Int
test_attack(
  Int
n_buckets : Int
Int,
  Array[String]
keys : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[String
String],
  (String) -> UInt
hash_fn : (String
String) -> UInt
UInt,
) -> Int
Int {
  let MyHashMap[String, Int]
map = { Array[ChainingBucket[String, Int]]
buckets: type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array::(Int, (Int) -> ChainingBucket[String, Int]) -> Array[ChainingBucket[String, Int]]
Creates a new array of the specified length, where each element is
initialized using an index-based initialization function.
Parameters:

length : The length of the new array. If length is less than or equal
to 0, returns an empty array.
initializer : A function that takes an index (starting from 0) and
returns a value of type T. This function is called for each index to
initialize the corresponding element.
Returns a new array of type Array[T] with the specified length, where each
element is initialized using the provided function.
Example:
  let arr = Array::makei(3, i => i * 2)
  inspect(arr, content="[0, 2, 4]")
makei(Int
n_buckets, _ => { Array[(String, Int)]
values: [] }), (String) -> UInt
hash_fn }
  let mut Int
total_searches = 0
  for String
key in Array[String]
keys {
    Int
total_searches (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+= MyHashMap[String, Int]
map.(self : MyHashMap[String, Int], key : String, value : Int) -> Int
Insert a new key-value pair.
Returns the number of searches done.
insert(String
key, 0)
  }
  Int
total_searches
}

test {
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Demonstrate hash flooding attack")
  let Int
bucket_count = 2048
  let Int
target_bucket_id = 42
  let Int
n_collision_want = 1000

  //
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("First, try to insert non-colliding keys.")
  let Array[String]
non_colliding_keys = type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array::(Int, (Int) -> String) -> Array[String]
Creates a new array of the specified length, where each element is
initialized using an index-based initialization function.
Parameters:

length : The length of the new array. If length is less than or equal
to 0, returns an empty array.
initializer : A function that takes an index (starting from 0) and
returns a value of type T. This function is called for each index to
initialize the corresponding element.
Returns a new array of type Array[T] with the specified length, where each
element is initialized using the provided function.
Example:
  let arr = Array::makei(3, i => i * 2)
  inspect(arr, content="[0, 2, 4]")
makei(Int
n_collision_want,
    Int
i => (Int
i (self : Int, other : Int) -> Int
Multiplies two 32-bit integers. This is the implementation of the *
operator for Int.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns the product of the two integers. If the result overflows the range of
Int, it wraps around according to two's complement arithmetic.
Example:
  inspect(42 * 2, content="84")
  inspect(-10 * 3, content="-30")
  let max = 2147483647 // Int.max_value
  inspect(max * 2, content="-2") // Overflow wraps around
* 37).(self : Int, radix~ : Int) -> String
Converts an integer to its string representation in the specified radix (base).
Example:
inspect((255).to_string(radix=16), content="ff")
inspect((-255).to_string(radix=16), content="-ff")
to_string(Int
radix=36))
  let Int
n_compares_nc = (n_buckets : Int, keys : Array[String], hash_fn : (String) -> UInt) -> Int
test_attack(
    Int
bucket_count, Array[String]
non_colliding_keys, (s : String) -> UInt
A simple string hasher via the Fowler-Noll-Vo hash function.
https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
string_fnv_hash,
  )
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println(
    "Total compares for \{Int
n_collision_want} non-colliding keys: \{Int
n_compares_nc}",
  )
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("")

  //
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Now, we want all keys to collide into bucket #\{Int
target_bucket_id}.")
  let Array[String]
colliding_keys = (bucket_count : Int, target_bucket : Int, n_collision_want : Int, hash_fn : (String) -> UInt) -> Array[String]
find_collision(
    Int
bucket_count, Int
target_bucket_id, Int
n_collision_want, (s : String) -> UInt
A simple string hasher via the Fowler-Noll-Vo hash function.
https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
string_fnv_hash,
  )
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Found \{Array[String]
colliding_keys.(self : Array[String]) -> Int
Returns the number of elements in the array.
Parameters:

array : The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length()} colliding keys.")
  let Int
n_compares_c = (n_buckets : Int, keys : Array[String], hash_fn : (String) -> UInt) -> Int
test_attack(Int
bucket_count, Array[String]
colliding_keys, (s : String) -> UInt
A simple string hasher via the Fowler-Noll-Vo hash function.
https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
string_fnv_hash)
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println(
    "Total compares for \{Int
n_collision_want} colliding keys: \{Int
n_compares_c}",
  )

  //
  let Double
increase = Int
n_compares_c.(self : Int) -> Double
Converts a 32-bit integer to a double-precision floating-point number. The
conversion preserves the exact value since all integers in the range of Int
can be represented exactly as Double values.
Parameters:

self : The 32-bit integer to be converted.
Returns a double-precision floating-point number that represents the same
numerical value as the input integer.
Example:
  let n = 42
  inspect(n.to_double(), content="42")
  let neg = -42
  inspect(neg.to_double(), content="-42")
to_double() (self : Double, other : Double) -> Double
Performs division between two double-precision floating-point numbers.
Follows IEEE 754 standard for floating-point arithmetic, including handling
of special cases like division by zero (returns infinity) and operations
involving NaN.
Parameters:

self : The dividend (numerator) in the division operation.
other : The divisor (denominator) in the division operation.
Returns the result of dividing self by other. Special cases follow IEEE
754:

Division by zero returns positive or negative infinity based on the
dividend's sign
Operations involving NaN return NaN
Division of infinity by infinity returns NaN
Example:
  inspect(6.0 / 2.0, content="3")
  inspect(-6.0 / 2.0, content="-3")
  inspect(1.0 / 0.0, content="Infinity")
/ Int
n_compares_nc.(self : Int) -> Double
Converts a 32-bit integer to a double-precision floating-point number. The
conversion preserves the exact value since all integers in the range of Int
can be represented exactly as Double values.
Parameters:

self : The 32-bit integer to be converted.
Returns a double-precision floating-point number that represents the same
numerical value as the input integer.
Example:
  let n = 42
  inspect(n.to_double(), content="42")
  let neg = -42
  inspect(neg.to_double(), content="-42")
to_double()
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("The number of compares increased by a factor of \{Double
increase}")
}

The output of the code above is:

Demonstrate hash flooding attack
First, try to insert non-colliding keys.
Total compares for 1000 non-colliding keys: 347

Now, with colliding keys...
Found 1000 colliding keys.
Total compares for 1000 colliding keys: 499500
The number of compares increased by a factor of 1439.4812680115274

... as can be seen directly, now the insertion is some 1000 times slower!

In reality, although the number of buckets in hashmaps is not fixed like our examples, they often follow a certain growing sequence, such as doubling or following a list of predefined prime numbers. This growth pattern makes the bucket count very predictable. Thus, an attacker can initiate a hash flooding attack even if they don't know the exact bucket count.

Mitigating hash flooding attacks

Hash flooding attack works because the attacker knows exactly how a hash function works, and how it connects to where the key is inserted into the hashmap. If we change either of them, the attack will no longer work.

Seeded hash function

By far, the easiest way to do this is to prevent the attacker from knowing how the hash algorithm exactly works. This might sound impossible, but the properties of the hash function actually only need to hold within a single hashmap!

When dealing with hashmaps, we don't need a single, global "hash value" that can be used everywhere, because hashmaps don't care about what happens outside them. Simply swapping out the hash function from table to table, and you get something that's unpredictable to the attacker.

But hey, you may say, "we don't have an infinite supply of different hash algorithms!"

Well, you do. Remember that hash functions need to distribute the value across the result space as uniform as possible? That means, for a good hash function, a slight change in the input can cause a large change in the output. So, in order to get a hash function unique to each table, we only need to feed it some data unique to the table before feeding it the data we want to hash. This is called a "seed" to the hash function, and each table can now have a different seed to use.

Let's demonstrate how the seed solves the problem with a seeded hash function and two tables with different seeds:

/// A modified version of the FNV hash before to allow a seed to be used.
fn (seed : UInt) -> (String) -> UInt
A modified version of the FNV hash before to allow a seed to be used.
string_fnv_hash_seeded(UInt
seed : UInt
UInt) -> (String
String) -> UInt
UInt {
  let Bytes
seed_bytes = UInt
seed.(self : UInt) -> Bytes
Converts the UInt to a Bytes in little-endian byte order.
to_le_bytes()
  fn (String) -> UInt
string_fnv_hash(String
s : String
String) -> UInt
UInt {
    let Bytes
s_bytes = (str : StringView, bom? : Bool, endianness? : @encoding/utf16.Endian) -> Bytes
Encodes a string into a UTF-16 byte array.
Assuming the string is valid.
@encoding/utf16.encode(String
s)
    let mut UInt
acc : UInt
UInt = 0x811c9dc5
    // Mix in the seed bytes.
    for Byte
b in Bytes
seed_bytes {
      UInt
acc = (UInt
acc (self : UInt, other : UInt) -> UInt
Performs a bitwise XOR (exclusive OR) operation between two unsigned 32-bit
integers. Each bit in the result is set to 1 if the corresponding bits in the
operands are different, and 0 if they are the same.
Parameters:

self : The first unsigned 32-bit integer operand.
other : The second unsigned 32-bit integer operand.
Returns the result of the bitwise XOR operation.
Example:
  let a = 0xFF00U // Binary: 1111_1111_0000_0000
  let b = 0x0F0FU // Binary: 0000_1111_0000_1111
  inspect(a ^ b, content="61455") // Binary: 1111_0000_0000_1111
^ Byte
b.(self : Byte) -> UInt
Converts a Byte to a UInt.
Parameters:

byte : The Byte value to be converted.
Returns the UInt representation of the Byte.
to_uint()) (self : UInt, other : UInt) -> UInt
Performs multiplication between two unsigned 32-bit integers. The result
wraps around if it exceeds the maximum value of UInt.
Parameters:

self : The first unsigned integer operand.
other : The second unsigned integer operand.
Returns the product of the two unsigned integers. If the result exceeds the
maximum value of UInt (4294967295), it wraps around to the corresponding
value modulo 2^32.
Example:
  let a = 3U
  let b = 4U
  inspect(a * b, content="12")
  let max = 4294967295U
  inspect(max * 2U, content="4294967294") // Wraps around to max * 2 % 2^32
* 0x01000193
    }
    // Hash the string bytes.
    for Byte
b in Bytes
s_bytes {
      UInt
acc = (UInt
acc (self : UInt, other : UInt) -> UInt
Performs a bitwise XOR (exclusive OR) operation between two unsigned 32-bit
integers. Each bit in the result is set to 1 if the corresponding bits in the
operands are different, and 0 if they are the same.
Parameters:

self : The first unsigned 32-bit integer operand.
other : The second unsigned 32-bit integer operand.
Returns the result of the bitwise XOR operation.
Example:
  let a = 0xFF00U // Binary: 1111_1111_0000_0000
  let b = 0x0F0FU // Binary: 0000_1111_0000_1111
  inspect(a ^ b, content="61455") // Binary: 1111_0000_0000_1111
^ Byte
b.(self : Byte) -> UInt
Converts a Byte to a UInt.
Parameters:

byte : The Byte value to be converted.
Returns the UInt representation of the Byte.
to_uint()) (self : UInt, other : UInt) -> UInt
Performs multiplication between two unsigned 32-bit integers. The result
wraps around if it exceeds the maximum value of UInt.
Parameters:

self : The first unsigned integer operand.
other : The second unsigned integer operand.
Returns the product of the two unsigned integers. If the result exceeds the
maximum value of UInt (4294967295), it wraps around to the corresponding
value modulo 2^32.
Example:
  let a = 3U
  let b = 4U
  inspect(a * b, content="12")
  let max = 4294967295U
  inspect(max * 2U, content="4294967294") // Wraps around to max * 2 % 2^32
* 0x01000193
    }
    UInt
acc
  }

  (String) -> UInt
string_fnv_hash
}

test {
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Demonstrate flooding attack mitigation")
  let Int
bucket_count = 2048
  let Int
target_bucket_id = 42
  let Int
n_collision_want = 1000

  // The first table has a seed of 42.
  let UInt
seed1 : UInt
UInt = 42
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("We find collisions using the seed \{UInt
seed1}")
  let (String) -> UInt
hash_fn1 = (seed : UInt) -> (String) -> UInt
A modified version of the FNV hash before to allow a seed to be used.
string_fnv_hash_seeded(UInt
seed1)
  let Array[String]
colliding_keys = (bucket_count : Int, target_bucket : Int, n_collision_want : Int, hash_fn : (String) -> UInt) -> Array[String]
find_collision(
    Int
bucket_count, Int
target_bucket_id, Int
n_collision_want, (String) -> UInt
hash_fn1,
  )
  let Int
n_compares_c = (n_buckets : Int, keys : Array[String], hash_fn : (String) -> UInt) -> Int
test_attack(Int
bucket_count, Array[String]
colliding_keys, (String) -> UInt
hash_fn1)
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println(
    "Total compares for \{Int
n_collision_want} colliding keys with seed \{UInt
seed1}: \{Int
n_compares_c}",
  )
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("")

  // The second table has a different seed
  let UInt
seed2 : UInt
UInt = 100
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println(
    "We now use a different seed for the second table, this time \{UInt
seed2}",
  )
  let (String) -> UInt
hash_fn2 = (seed : UInt) -> (String) -> UInt
A modified version of the FNV hash before to allow a seed to be used.
string_fnv_hash_seeded(UInt
seed2)
  let Int
n_compares_nc = (n_buckets : Int, keys : Array[String], hash_fn : (String) -> UInt) -> Int
test_attack(Int
bucket_count, Array[String]
colliding_keys, (String) -> UInt
hash_fn2)
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println(
    "Total compares for \{Int
n_collision_want} keys that were meant to collide with seed \{UInt
seed1}: \{Int
n_compares_nc}",
  )
}

The output of the program above was:

Demonstrate flooding attack mitigation
We find collisions using 42
Total compares for 1000 colliding keys with seed 42: 499500

We now use a different seed for the second table, this time 100
Total compares for 1000 keys that were meant to collide with seed 42: 6342

We can see that, the keys that were colliding in the first table are not colliding in the second. ³ Therefore, we have successfully mitigated the hash flooding attack using this simple trick.

As of where the seed that randomizes each hashmap comes from... For programs with access to an external random source (like Linux's /dev/urandom), using that would generally be the best choice. For programs without such access (such as within a WebAssembly sandbox), a per-process random seed is also a preferrable solution (this is what Python does). Even simpler, a simple counter that increments with each seeding attempt could be good enough -- guessing how many hashmaps have been created can still be quite hard for an attacker.

Other choices

Java uses a different solution, by falling back to a binary search tree (red-black tree) when too many values occupy the same bucket. Yes, this requires the keys to be also comparable in addition to being hashable, but now it guarantees $O(\log n)$ worst-case complexity, which is far better than $O(n)$ .

Why does it matter to us?

Due to the ubiquitous nature of hashmaps, it's extremely easy to find some hashmap in a program where you can control the keys, especially in Web programs. Headers, cookies, query parameters and JSON bodies are all key-value pairs, and often stored in hashmaps, which might be vulnerable to hash flooding attacks.

A malicious attacker with enough knowledge of the program (programming language, frameworks, etc.) can then try to send carefully-crafted request payloads to the Web API endpoints. These requests take a lot longer to handle, so if a regular denial-of-service (DoS) attack takes n requests/s to bring down a server, a hash flooding attack might only a tiny fraction of that number, often a magnitude smaller -- making it far more efficient for the attacker. This turns the DoS attack into a HashDoS attack.

Fortunately, by introducing some even slightly unpredictable patterns (such as a per-process randomness or keyed hashing) into hashmaps, we can make such attack significantly harder, often impractical. Also, as such attack is highly dependent on the language, framework, architecture and implementation of target application, crafting one could be quite hard already, and modern, well-configured systems are even more harder to exploit.

Takeaways

Hashmaps give us powerful, constant-time average access -- but that "constant" depends on assumptions an attacker can sometimes break. A targeted hash-flooding attack forces many keys into the same bucket and turns O(1) operations into O(n), enabling highly efficient resource exhaustion.

The good news is the mitigations are simple and practical: introduce some unpredictableness to your hashmaps, use side-channel information when hash alone is not enough, or rehash when the behavior doesn't look right. With these, we can keep our hashmaps fast and secure.

Side note, this is also similar to how Bitcoin mining works -- finding a value to add to an existing string, so the hash of the entire thing (with bits reversed), modulo some given value, is zero. ↩
There's even a Tumblr blog for unexpected quadratic complexity in programming languages, Accidentally Quadratic. You can even find a hashmap-related one here! -- It's almost a manually-introduced hash flooding attack. ↩
You may notice that this number is still slightly higher than that we got with randomly-generated, non-colliding keys. This might be related to that FNV is not designed for the best quality of its output. Since the two seeds are pretty close to each other, the result might still have some similarity. Using a better hash function (or even a cryptographically-secure one like SipHash) would greatly reduce this effect. ↩

Write a HTTP file server in MoonBit

October 22, 2025 · 17 min read

In this article, I will introduce MoonBit's async programming support and the moonbitlang/async library by writing a simple HTTP file server. If you have experience with the Python language before, you may know that Python has a very convenient builtin HTTP server module. You can launch a HTTP file server sharing current directory by running python -m http.server from the command line, which is useful for LAN file sharing. In this article, we will write a program with similar functionality in MoonBit, and learn about MoonBit's async programming support. We will implement an extra useful functionality absent in python -m http.server: downloading the whole directory as a .zip file.

A brief history of async programming

Async programming enables programs to perform multiple tasks at the same time. For example, for a file server, there may be many users accessing the server at the same time. The server needs to serve all users at the same time while making the experience of every user as fluent as possible. In a typical async program, such as a server, most time is spent on waiting for IO operations in a single task, and only a small portion of time is spent on actual computation. So, we don't really need a lot of computation power to handle a lot of tasks. The key here is to switch frequently between tasks: if a task starts waiting for IO, don't process it anymore, switch to a task that is immediately ready instead.

In the past, async programming is usually implemented via multi-threading. Every task in the program corresponds to a operating system thread. However, OS threads are resource heavy, the context switch between OS threads is expensive, too. So, today, async programming is usually implemented via event loops. In an event loop based async program, the whole is structured as a big loop. In every iteration of the loop, the program check for a list of completed IO operations, and resume the tasks blocked on these IO operations, until they issue another IO request and enter waiting state again. In this programming paradigm, the context switch between tasks happens in the user space, on a single OS thread. So the cost of switching between tasks is very cheap.

Although event loop solves the performance problem, it is very painful to code event loop based program manually. The code of a single task need to be splitted into multiple iterations of the event loop, damaging the readability of program logic significantly. Fortunately, like most other modern programming languages, MoonBit provides native async programming support. Users can write async code just like normal, synchronous code. The MoonBit compiler will automatically split async code into multiple parts, while the moonbitlang/async library provides the event loop, various IO primitives, and a scheduler that actually runs the async code.

Async programming in MoonBit

In MoonBit, you can declare an async function using the async fn syntax. Async functions look exactly the same as normal, synchronous functions, except that thay may be interrupted in the middle at run time, so that the program can switch between multiple tasks.

Unlike most other languages, MoonBit doesn't need special marks such as await when calling async functions. The compiler will automatically infer which function calls are async. However, if you read async MoonBit code in a IDE or text editor that supports MoonBit, you can see async function calls rendered in italic style, and function calls that may raise error rendered with underline. So, you can still easily find out all async function calls when reading code.

For async programming, it is also necessary to have an event loop, a task scheduler and various IO primitives. In MoonBit, these are implemented via the moonbitlang/async library. moonbitlang/async provides support for async primitives such as network IO, file IO and process creation, as well as a lot of useful task management facilities. In the following parts, We will learn about various features of moonbitlang/async while writing the HTTP file server.

The structure of a HTTP server

The structure of a typical HTTP server is:

the server listen on a TCP socket, waiting for incoming connections from users
after accepting a TCP connection from a user, the server read the user's request from the TCP connection, process it, and send the result back to the user.

Every task described above must be performed asynchronously: when performing the request from the first user, the server should still keep waiting for new connections, and react to the connection request of the next user. If many users connect to the server at the same time, the server should handle the requests from all users in parallel. When handling user requests, all time consuming operations, such as network IO and file IO, should be asynchronous: they should not block the program and affect the handling of other tasks.

moonbitlang/async provides a helper function @http.run_server, which automatically setup a HTTP server and run it:

async fn async (path~ : String, port~ : Int) -> Unit
server_main(String
path~ : String
String, Int
port~ : Int
Int) -> Unit
Unit {
  (Unit, (?, Unit) -> Unit) -> Unit
@http.run_server((String) -> Unit
@socket.Addr::parse("[::]:\{Int
port}"), fn (?
conn, Unit
addr) {
    Unit
@pipe.stderr.(String) -> Unit
write("received new connection from \{Unit
addr}\n")
    async (base_path : String, conn : ?) -> Unit
handle_connection(String
path, ?
conn)
  })
}

server_main accepts two parameters. path is the directory to serve, and port is the port to listen on. In moonbitlang/async, all async code are cancellable, and cancellation is performed by raising an error in cancelled code. So, MoonBit assumes all async fn may raise error by default, eliminating the need for explicitly marking async fn with raise.

In server_main, we use @http.run_server to create a HTTP server and run it. @http is the default alias for moonbitlang/async/http, which provides HTTP support for moonbitlang/async. The first parameter of @http.run_server is the address to listen, here we ask the server to listen on [::]:port, which means listening on port on any network interface. moonbitlang/async provides native IPv4/IPv6 dual stack support, so the server here can accept both IPv4 connections and IPv6 connections. The second parameter of @http.run_server is a callback function used for handling client request. The callback function receives two parameters, the first one is the connection from the user, represented using the type @http.ServerConnection. The connection is created automatically by @http.run_server. The second parameter of the callback function is the network address of the user. Here, we use a function handle_connection to handle the request, the implementation of handle_connection will be given later. @http.run_server will automatically create a new task, and run handle_connection in the new task. So, the server may run multiple instances handle_connection in parallel, handling multiple user connections at the same time.

Handle user request

Now, let's implement the handle_connection function. handle_connection accepts two parameters: base_path is the directory being served, and conn is the connection from the user. The implementation of handle_connection is as follows:

async fn async (base_path : String, conn : ?) -> Unit
handle_connection(
  String
base_path : String
String,
  ?
conn : @http.ServerConnection,
) -> Unit
Unit {
  for {
    let Unit
request = ?
conn.() -> Unit
read_request()
    ?
conn.() -> Unit
skip_request_body()
    guard Unit
request.Unit
meth is Unit
Get else {
      ?
conn
      ..(Int, String) -> Unit
send_response(501, "Not Implemented")
      ..(String) -> Unit
write("This request is not implemented")
      ..() -> Unit
end_response()
    }
    let (String
path, Bool
download_zip) = match Unit
request.String
path {
      String
[ ..path, .."?download_zip" ] => (StringView
path.(self : StringView) -> String
Returns a new String containing a copy of the characters in this view.
Examples
  let str = "Hello World"
  let view = str.view(start_offset = str.offset_of_nth_char(0).unwrap(),end_offset = str.offset_of_nth_char(5).unwrap()) // "Hello"
  inspect(view.to_string(), content="Hello")
to_string(), true)
      String
path => (String
path, false)
    }
    if Bool
download_zip {
      async (conn : ?, path : String) -> Unit
serve_zip(?
conn, String
base_path (self : String, other : String) -> String
Concatenates two strings, creating a new string that contains all characters
from the first string followed by all characters from the second string.
Parameters:

self : The first string to concatenate.
other : The second string to concatenate.
Returns a new string containing the concatenation of both input strings.
Example:
  let hello = "Hello"
  let world = " World!"
  inspect(hello + world, content="Hello World!")
  inspect("" + "abc", content="abc") // concatenating with empty string
+ String
path)
    } else {
      let ?
file = (String, Unit) -> ?
@fs.open(String
base_path (self : String, other : String) -> String
Concatenates two strings, creating a new string that contains all characters
from the first string followed by all characters from the second string.
Parameters:

self : The first string to concatenate.
other : The second string to concatenate.
Returns a new string containing the concatenation of both input strings.
Example:
  let hello = "Hello"
  let world = " World!"
  inspect(hello + world, content="Hello World!")
  inspect("" + "abc", content="abc") // concatenating with empty string
+ String
path, Unit
mode=Unit
ReadOnly) catch {
        _ => {
          ?
conn
          ..(Int, String) -> Unit
send_response(404, "NotFound")
          ..(String) -> Unit
write("File not found")
          ..() -> Unit
end_response()
          continue
        }
      }
      defer ?
file.() -> Unit
close()
      if ?
file.() -> Unit
kind() is Unit
Directory {
        if Bool
download_zip {
        } else {
          async (conn : ?, dir : ?, path~ : String) -> Unit
serve_directory(?
conn, ?
file.() -> ?
as_dir(), String
path~)
        }
      } else {
        async (conn : ?, file : ?, path~ : String) -> Unit
server_file(?
conn, ?
file, String
path~)
      }
    }
  }
}

In handle_connection, the program read requests from the user connection and handle them in a big loop. In every iteration, we first read the next request from the user via conn.read_request(). conn.read_request() will only read the header part of a HTTP request, in order to allow streaming read for large body in user request. Since our file server only handles GET request, the body of requests is irrelevant. So, we use conn.skip_body() to skip the body of user request, so that the content of the next request can be processed normally.

If we meet a request that is not GET, the else block of guard statement will be executed. Code after the guard statement will be skipped, and the program will enter the next iteration directly and handle the next request. In the else block, we use conn.send_response(..) to send a "NotImplemented" response back to the user. conn.send_response(..) will only send the header part of the response. After send_response, we use conn.write(..) to write the body of the response to the connection. After writing all desired contents, we use conn.end_response() to tell the library that the response body has completed.

Here, we want to implement a useful feature absent in python -m http.server: download the whole directory as a zip file. If the requested URL has the shape /path/to/directory?download_zip, we package /path/to/directory into a .zip file and send it to the user. This feature is implemented using the serve_zip function to be given later.

Since we are implementing a file server, the requested path in users' GET request will map to file system path under base_path directly. @fs is the default alias of moonbitlang/async/fs, the package for file system IO support in moonbitlang/async. Here, we use @fs.open to open the requested file. In the @fs.open operation fails, we send the user a 404 response, notifying the user that the requested file does not exist.

If the requested file is successfully opened, we need to send its content to the user. Before that, we use defer file.close() to ensure that the opened file will be closed correctly. We can obtain the kind of the file via file.kind(). In a file server, directories need some special handling. Since we cannot send a directory over network, we need to serve a HTML page for the user, which contains the contents of the directory, and links that jump to the corresponding page of each file in the directory. This part of the server is implemented in the serve_directory function, whose definition will be provided later.

If the requested file is a regular file, we simply send the content of the file to the user. This is implemented via the serve_file function:

async fn async (conn : ?, file : ?, path~ : String) -> Unit
server_file(
  ?
conn : @http.ServerConnection,
  ?
file : @fs.File,
  String
path~ : String
String,
) -> Unit
Unit {
  let String
content_type = match String
path {
    [.., .. ".png"] => "image/png"
    [.., .. ".jpg"] | "jpeg" => "image/jpeg"
    [.., .. ".html"] => "text/html"
    [.., .. ".css"] => "text/css"
    [.., .. ".js"] => "text/javascript"
    [.., .. ".mp4"] => "video/mp4"
    [.., .. ".mpv"] => "video/mpv"
    [.., .. ".mpeg"] => "video/mpeg"
    [.., .. ".mkv"] => "video/x-matroska"
    _ => "appliaction/octet-stream"
  }
  ?
conn
  ..(Int, String, Map[String, String]) -> Unit
send_response(200, "OK", Map[String, String]
extra_headers={ "Content-Type": String
content_type })
  ..(?) -> Unit
write_reader(?
file)
  ..() -> Unit
end_response()
}

In the HTTP response header, we fill in different values for the Content-Type field based on the suffix of the requested file. With correct Content-Type, the users can view the content of image/video/HTML file in the browser directly. For other files, the value of Content-Type is set to application/octet-stream, which tells the browser to download the file automatically.

As before, we use conn.send_response to send the response header. The extra_headers field allows us to set extra header fields for the response. The body of the response is the content of the file. Here, conn.write_reader(..) will send the content of file to the user streamingly. Assume the user requests for a video file and plays it in the browser, if we read the whole video file in memory first before sending it to the user, the user can only see response from the server after the whole video file has been loaded, resulting in poor latency. It is also a huge waste of memory to load the whole video file. write_reader, on the other hand, automatically split the file into small chunks, and send the content of the file chunk-by-chunk. This way, users can start playing the video immediately, and the server can save up a lot of memory.

Next, let's implement the serve_directory function:

async fn async (conn : ?, dir : ?, path~ : String) -> Unit
serve_directory(
  ?
conn : @http.ServerConnection,
  ?
dir : @fs.Directory,
  String
path~ : String
String,
) -> Unit
Unit {
  let Unit
files = ?
dir.() -> Unit
read_all()
  Unit
files.() -> Unit
sort()
  ?
conn
  ..(Int, String, Map[String, String]) -> Unit
send_response(200, "OK", Map[String, String]
extra_headers={ "Content-Type": "text/html" })
  ..(String) -> Unit
write("<!DOCTYPE html><html><head></head><body>")
  ..(String) -> Unit
write("<h1>\{String
path}</h1>\n")
  ..(String) -> Unit
write("<div style=\"margin: 1em; font-size: 15pt\">\n")
  ..(String) -> Unit
write("<a href=\"\{String
path}?download_zip\">download as zip</a><br/><br/>\n")
  if String
path[:-1].(self : StringView, str : StringView) -> Int?
Returns the offset of the last occurrence of the given substring. If the
substring is not found, it returns None.
rev_find("/") is (Int) -> Int?
Some(Int
index) {
    let String
parent = if Int
index (self : Int, other : Int) -> Bool
Compares two integers for equality.
Parameters:

self : The first integer to compare.
other : The second integer to compare.
Returns true if both integers have the same value, false otherwise.
Example:
  inspect(42 == 42, content="true")
  inspect(42 == -42, content="false")
== 0 { "/" } else { String
path[:Int
index].(self : StringView) -> String
Returns a new String containing a copy of the characters in this view.
Examples
  let str = "Hello World"
  let view = str.view(start_offset = str.offset_of_nth_char(0).unwrap(),end_offset = str.offset_of_nth_char(5).unwrap()) // "Hello"
  inspect(view.to_string(), content="Hello")
to_string() }
    ?
conn.(String) -> Unit
write("<a href=\"\{String
parent}\">..</a><br/><br/>\n")
  }
  for Unit
file in Unit
files {
    let String
file_url = if String
path(String, Int) -> Int
Returns the UTF-16 code unit at the given index.
Parameters:

string : The string to access.
index : The position in the string from which to retrieve the code unit.
This method has O(1) complexity.
[path.(self : String) -> Int
Returns the number of UTF-16 code units in the string. Note that this is not
necessarily equal to the number of Unicode characters (code points) in the
string, as some characters may be represented by multiple UTF-16 code units.
Parameters:

string : The string whose length is to be determined.
Returns the number of UTF-16 code units in the string.
Example:
  inspect("hello".length(), content="5")
  inspect("🤣".length(), content="2") // Emoji uses two UTF-16 code units
  inspect("".length(), content="0") // Empty string
length() (self : Int, other : Int) -> Int
Performs subtraction between two 32-bit integers, following standard two's
complement arithmetic rules. When the result overflows or underflows, it
wraps around within the 32-bit integer range.
Parameters:

self : The minuend (the number being subtracted from).
other : The subtrahend (the number to subtract).
Returns the difference between self and other.
Example:
  let a = 42
  let b = 10
  inspect(a - b, content="32")
  let max = 2147483647 // Int maximum value
  inspect(max - -1, content="-2147483648") // Overflow case
- 1] (x : Int, y : Int) -> Bool
!= '/' {
      "\{String
path}/\{Unit
file}"
    } else {
      "\{String
path}\{Unit
file}"
    }
    ?
conn.(String) -> Unit
write("<a href=\"\{String
file_url}\">\{Unit
file}</a><br/>\n")
  }
  ?
conn
  ..(String) -> Unit
write("</div></body></html>")
  ..() -> Unit
end_response()
}

Here, we first read the list of files in the directory and sort them. Next, we build a HTML page based on the content of the directory. The body of the HTML page is the list of files in the directory, each file corresponds to a <a> HTML link showing the name of the file. Users can jump to the page of the file by clicking the link. If the requested directory is not the root directory, we add a special link .. at the beginning of the page, which jumps to the parent directory of current directory. Finally, the page also contains a download as zip link, which jumps to the zip download URL for current directory.

Implement the download as zip feature

Finally, let's implement the "download as zip" feature. Here, for simplicity, we use the zip command for compression. The implementation of serve_zip is as follows:

async fn async (conn : ?, path : String) -> Unit
serve_zip(
  ?
conn : @http.ServerConnection,
  String
path : String
String,
) -> Unit
Unit {
  let Unit
full_path = (String) -> Unit
@fs.realpath(String
path)
  let String
zip_name = if Unit
full_path[:].(String) -> Unit
rev_find("/") is (Int) -> Unit
Some(Int
i) {
    Unit
full_path[Int
i+1:].() -> String
to_string()
  } else {
    String
path
  }
  ((Unit) -> Unit) -> Unit
@async.with_task_group(fn(Unit
group) {
    let (Unit
we_read_from_zip, Unit
zip_write_to_us) = () -> (Unit, Unit)
@process.read_from_process()
    defer Unit
we_read_from_zip.() -> Unit
close()
    Unit
group.(() -> Unit) -> Unit
spawn_bg(fn() {
      let Int
exit_code = (String, Array[String], Unit) -> Int
@process.run(
        "zip",
        [ "-q", "-r", "-", String
path ],
        Unit
stdout=Unit
zip_write_to_us,
      )
      if Int
exit_code (x : Int, y : Int) -> Bool
!= 0 {
        (msg : String, loc~ : SourceLoc = _) -> Unit raise Failure
Raises a Failure error with a given message and source location.
Parameters:

message : A string containing the error message to be included in the
failure.
location : The source code location where the failure occurred.
Automatically provided by the compiler when not specified.
Returns a value of type T wrapped in a Failure error type.
Throws an error of type Failure with a message that includes both the
source location and the provided error message.
fail("zip failed with exit code \{Int
exit_code}")
      }
    })
    ?
conn
    ..(Int, String, Map[String, String]) -> Unit
send_response(200, "OK", Map[String, String]
extra_headers={
      "Content-Type": "application/octet-stream",
      "Content-Disposition": "filename=\{String
zip_name}.zip",
    })
    ..(Unit) -> Unit
write_reader(Unit
we_read_from_zip)
    ..() -> Unit
end_response()
  })
}

At the beginning of serve_zip, we first compute the file name for the .zip file. Next, we create a new task group using @async.with_task_group. Task group is the core construct for task management in moonbitlang/async, all tasks must be spawned in a task group. But before we get into the details of with_task_group, let's first check out the remaining content of serve_zip. First, we use @process.read_from_process to create a temporary pipe. Data written to one end of the pipe can be read from the other end, so the pipe can be used to obtain the output of a system command. We will pass the write end of the pipe, zip_write_to_us to the zip command, and let zip write the result of compression to zip_write_to_us. Meanwhile, we will read the output of the zip command from the read end of the pipe, we_read_from_zip, and send the result to the user.

To accomplish the above job, we first spawn a new task in the task group using growp.spawn_bg(..). group.spawn_bg(..) accepts a function as argument, and run the function in a new background task, in parallel with other code in the program. Within the new task, we wse @process.run to launch the zip command. @process is the default alias of moonbitlang/async/process, which provides process spawning and manipulation support for moonbitlang/async. The meaning of the arguments of zip is:

-q: do not output log
-r: recursively compress the whole directory
-: write the result of compression to stdout
path: the directory to compress

When launching zip with @process.run, the stdout=zip_write_to_us part redirects the stdout of zip to zip_write_to_us, so that we can obtain the output of zip. Compared to creating a temporary .zip file to store the result, using a pipe is more efficient because:

the data exchange with zip is completely in-memory, which is more efficient than disk IO
we can send partial compression result on-the-fly while zip is still working, reducing latency

@process.run will wait until zip finishes and return the exit code of zip. If the zip command fail with a non-zero exit code, we raise an error.

Outside the new task in spawn_bg, we use conn.send_response(..) to initiate a response to the user, and send the output of zip to the user via conn.write_reader(we_read_from_zip). The Content-Disposition HTTP header allows us to specify the file name for the .zip file. This part of code will be run in parallel with the @process.run task.

So far everything looks reasonable. But why do we need to create a new task group here? Why doesn't moonbitlang/async just provide a global task-spawning API, like many other languages do? There is a phenomenon in async programming: it is relatively easy to write an async program that works correctly when everything goes well, but much harder to write an async program that behaves correctly when things go wrong. For the serve_zip example:

what should we do if the zip command fails?
what should we do if some network error occurs, or the user closes the connection?

If the zip command fails, the whole serve_zip function should fail too. Since the user already received some incomplete data, it is hard to recover the connection back to normal state, so we have to close the whole connection. If network error occurs when sending data, we should stop the zip command immediately, because its result is no longer useful. Keep the zip command running is just a waste of resource. In the worst case, the pipe for communication with zip may get filled up since we are no longer reading from it, and zip may get blocked forever on writing to the pipe and become a zombie process.

In the code above, we did not perform any explicit error handling. However, when the aforementioned error cases occur, our program can behave correctly and handle all edge cases. The magic lies in the @async.with_task_group function, and the structured concurrency paradigm behind it. The semantic of @async.with_task_group(f) is as follows:

it will create a new task group group, and run f(group) inside the new group
f can spawn new tasks in the group via group.spawn_bg(..)
with_task_group will only return after all tasks inside the group terminates
if any task inside the group fails, with_task_group will fail as well, and all other remaining tasks in the group is automatically cancelled

The last point here is the key to ensure correct error handling behavior:

if the zip command fails, the task that calls @process.run will raise an error, failing the whole task. The error will be propagated to the whole task group since no one is catching it. with_task_group will automatically cancel the response-sending task, propagate the error upwards and close the connection.
if network error occurs, the main response-sending task will fail. The error will also get propagated to the whole task group, and the zip task will be cancelled. When @process.run is cancelled, it automatically terminates the zip command by sending a SIGTERM signal

So, when writing async program using moonbitlang/async, users only need to insert task groups at appropriate places based on the structure of the program, all the remaining error handling details are automatically handled by with_task_group. This is the power of the structured concurrency paradigm of moonbitlang/async: it guides users to write async programs with clearer structure, and makes program behave correctly even when things go wrong.

Run the server

We have implemented all features of the HTTP file server, now we can actually run the server. MoonBit provides native support for async code, users can use async fn main to define entry point to async program, or use async test to test async code directly. Here, we let the HTTP server serve the content of current working directory, and let it listen on port 8000:

async test {
  async (path~ : String, port~ : Int) -> Unit
server_main(String
path=".", Int
port=8000)
}

To use the file server, just run the source code of this document via moon test /path/to/this/document.mbt.md, and open the address http://127.0.0.1:8000 in your browser.

Other features of moonbitlang/async can be found in its API document and GitHub repo.

Interacting with JavaScript in MoonBit: A First Look

September 25, 2025 · 13 min read

Introduction

In today's software world, no programming language ecosystem can be an isolated island. As an emerging general-purpose language, MoonBit's success in the vast technological landscape hinges on its seamless integration with existing ecosystems.

MoonBit provides multiple compilation backends, including JavaScript, which opens the door to the vast JavaScript ecosystem. This integration capability greatly expands MoonBit's application scenarios for both front-end browser development and Node.js applications. It allows developers to leverage the type safety and high performance of MoonBit while reusing a wide range of existing JavaScript libraries.

In this article, using Node.js as our example, we'll explore MoonBit's JavaScript FFI step-by-step. We'll cover various topics from basic function calls to complex type and error handling, demonstrating how to build an elegant bridge between the MoonBit and JavaScript worlds.

Prerequisites

Before we begin, let's configure our project. If you don't have an existing project, you can use the moon new tool to create a new MoonBit project.

To let the MoonBit toolchain know that our target platform is JavaScript, we need to add the following content to the moon.mod.json file in the project's root directory:

{
  "preferred-target": "js"
}

This configuration tells the compiler to use the JavaScript backend by default when executing commands like moon build or moon check. Of course, if you want to specify it temporarily on the command line, you can achieve the same effect with the --target=js option.

Building the Project

After completing the above configuration, simply run the familiar build command in the project's root directory:

> moon build

After the command executes successfully, since our project includes an executable entry by default, you can find the build artifacts in the target/js/debug/build/ directory. MoonBit conveniently generates three files for us:

.js file: The compiled JavaScript source code.
.js.map file: A Source Map file for debugging.
.d.ts file: A TypeScript declaration file, which is convenient for integration into TypeScript projects.

First JavaScript API Call

MoonBit's FFI design is principled and consistent. Similar to calling into C or other languages, we define an external function through a declaration with the extern keyword:

extern "js" fn consoleLog(msg : String
String) -> Unit
Unit = "(msg) => console.log(msg)"

This line of code is the core of enabling our FFI call. Let's break it down:

extern "js": Declares that this is an external function pointing to the JavaScript environment.
fn consoleLog(msg : String) -> Unit: This is the function's type signature in MoonBit. It accepts a parameter of type String and returns a unit value (Unit).
"(msg) => console.log(msg)": The string literal on the right side of the equals sign is the essence of this declaration, containing the native JavaScript function to be executed.

Here, we use a concise arrow function. The MoonBit compiler will embed this code as is into the final generated .js file, enabling the call from MoonBit to JavaScript.

Tip If your JavaScript code snippet is relatively complex, you can use the #| syntax to define multi-line strings to improve readability.

Once this FFI declaration is ready, we can call consoleLog in our MoonBit code just like a normal function:

test "hello" {
  (msg : String) -> Unit
consoleLog("Hello from JavaScript!")
}

Run moon test, and you will see the message printed by JavaScript's console.log in the console. Our first bridge is successfully built!

Interfacing with JavaScript Types

Establishing the call flow is just the first step. The real challenge lies in handling type differences between the two languages. MoonBit is a statically typed language, while JavaScript is dynamically typed. Establishing a safe and reliable type mapping between them is a key consideration in FFI design.

Below, we'll cover how to interface with different JavaScript types in MoonBit, starting from the easiest cases.

JavaScript Types Requiring No Conversion

The simplest case involves types in MoonBit whose underlying compiled representation in JavaScript corresponds directly to a native JavaScript type. In this case, we can pass them directly without any conversion.

The common "zero-cost" interface types are shown below:

MoonBit Type	Corresponding JavaScript Type
`String`	`string`
`Bool`	`boolean`
`Int`, `UInt`, `Float`, `Double`	`number`
`BigInt`	`bigint`
`Bytes`	`Uint8Array`
`Array[T]`	`Array<T>`
Function Type	`Function`

Based on these mappings, we can bind many simple JavaScript functions. In fact, in the previous example of binding the console.log function, we have already used the correspondence between the String type in MoonBit and the string type in JavaScript.

Note: Maintaining the Internal Invariants of MoonBit Types

A crucial detail is that all of MoonBit's standard numeric types (Int, Float, etc.) map to the number type in JavaScript, i.e., IEEE 754 double-precision floating-point numbers. This means that when an integer value crosses the FFI boundary into JavaScript, its behavior will follow floating-point semantics, which may lead to unexpected results from MoonBit's perspective, such as differences in integer overflow behavior:

extern "js" fn incr(x : Int
Int) -> Int
Int = "(x) => x + 1"

test "incr" {
  // In MoonBit, @int.max_value + 1 will overflow and wrap around
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Int
Maximum value of an integer.
@int.max_value (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1, String
content="-2147483648")
  // In JavaScript, it is treated as a floating-point number and does not overflow
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect((x : Int) -> Int
incr(Int
Maximum value of an integer.
@int.max_value), String
content="2147483648") // ???
}

This is essentially illegal because, according to the internal invariant of the Int type in MoonBit, its value cannot be 2147483648 (which exceeds the maximum value allowed by the type). This may cause unexpected behavior in other MoonBit code downstream that relies on this point. Similar issues may arise when handling other data types across the FFI boundary, so please be sure to pay attention to this when writing related logic.

External JavaScript Types

Of course, the JavaScript world is much richer than these basic types. We will quickly encounter undefined, null, symbol, and various complex host objects, which have no direct counterparts in MoonBit.

For this situation, MoonBit provides the #external annotation. This annotation acts as a contract, telling the compiler: "Please trust me, this type actually exists in the external world (JavaScript). You don't need to care about its internal structure, just treat it as an opaque handle."

For example, we can define a type that represents JavaScript's undefined like this:

#external
type Undefined

extern "js" fn Undefined::new() -> Self = "() => undefined"

However, a standalone Undefined type isn't very useful, as undefined typically appears as part of a union type, like string | undefined.

A more practical approach is to create an Optional[T] type that precisely maps to T | undefined in JavaScript, and which can be easily converted to and from MoonBit's built-in Option[T] (aliased as T?).

To achieve this, we first need a type to represent any JavaScript value, similar to TypeScript's any. This is where #external is useful:

#external
pub type Value

Consequently, we need methods to get the undefined value and to check if a given value is undefined:

extern "js" fn type Value
Value::undefined() -> type Value
Value =
  #| () => undefined

extern "js" fn type Value
Value::is_undefined(self : type Value
Self) -> Bool
Bool =
  #| (n) => Object.is(n, undefined)

For easier debugging, we'll implement the Show trait for our Value type, allowing it to be printed:

pub impl trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show for type Value
Value with (self : Value, logger : &Logger) -> Unit
output(Value
self, &Logger
logger) {
  &Logger
logger.(&Logger, String) -> Unit
write_string(Value
self.(self : Value) -> String
to_string())
}

pub extern "js" fn type Value
Value::to_string(self : type Value
Value) -> String
String =
  #| (self) =>
  #|   self === undefined ? 'undefined'
  #|     : self === null ? 'null'
  #|     : self.toString()

Next comes the 'magic' of the conversion process. We'll define two special conversion functions:

fn[T] type Value
Value::cast_from(value : type parameter T
T) -> type Value
Value = "%identity"

fn[T] type Value
Value::cast(self : type Value
Self) -> type parameter T
T = "%identity"

What is %identity

%identity is a special intrinsic provided by MoonBit for zero-cost type casting. It performs type checking at compile time, but has no effect at runtime. It essentially tells the compiler: "Trust me, I know the real type of this value; just treat it as the target type."

This is a double-edged sword: it provides powerful expressiveness at the FFI boundary, but misuse can break type safety. Therefore, its use should be strictly limited to a FFI-related scope.

With these building blocks, we can construct Optional[T]:

#external
type Optional[_] // Corresponds to T | undefined

/// Create an undefined Optional
fn[T] type Optional[_]
Optional::() -> Optional[T]
Create an undefined Optional
undefined() -> type Optional[_]
Optional[type parameter T
T] {
  type Value
Value::() -> Value
undefined().(self : Value) -> Optional[T]
cast()
}

/// Check if an Optional is undefined
fn[T] type Optional[_]
Optional::(self : Optional[T]) -> Bool
Check if an Optional is undefined
is_undefined(Optional[T]
self : type Optional[_]
Optional[type parameter T
T]) -> Bool
Bool {
  Optional[T]
self |> type Value
Value(Optional[T]) -> Value
::cast_from |> type Value
Value(Value) -> Bool
::is_undefined
}

/// Unwrap T from Optional[T], panic if it is undefined
fn[T] type Optional[_]
Optional::(self : Optional[T]) -> T
Unwrap T from Optional[T], panic if it is undefined
unwrap(Optional[T]
self : type Optional[_]
Self[type parameter T
T]) -> type parameter T
T {
  guard Bool
!Optional[T]
selfBool
.(self : Optional[T]) -> Bool
Check if an Optional is undefined
is_undefinedBool
() else { (msg : String) -> T
Aborts the program with an error message. Always causes a panic, regardless
of the message provided.
Parameters:

message : A string containing the error message to be displayed when
aborting.
Returns a value of type T. However, this function never actually returns a
value as it always causes a panic.
abort("Cannot unwrap an undefined value") }
  type Value
Value::(value : Optional[T]) -> Value
cast_from(Optional[T]
self).(self : Value) -> T
cast()
}

/// Convert Optional[T] to MoonBit's built-in T?
fn[T] type Optional[_]
Optional::(self : Optional[T]) -> T?
Convert Optional[T] to MoonBit's built-in T?
to_option(Optional[T]
self : type Optional[_]
Optional[type parameter T
T]) -> type parameter T
T? {
  guard Bool
!type Value
ValueBool
::(value : Optional[T]) -> Value
cast_fromBool
(Optional[T]
selfBool
).(self : Value) -> Bool
is_undefinedBool
() else { T?
None }
  (T) -> T?
Some(type Value
Value::(value : Optional[T]) -> Value
cast_from(Optional[T]
self).(self : Value) -> T
cast())
}

/// Create Optional[T] from MoonBit's built-in T?
fn[T] type Optional[_]
Optional::(value : T?) -> Optional[T]
Create Optional[T] from MoonBit's built-in T?
from_option(T?
value : type parameter T
T?) -> type Optional[_]
Optional[type parameter T
T] {
  guard T?
value is (T) -> T?
Some(T
v) else { type Optional[_]
Optional::() -> Optional[T]
Create an undefined Optional
undefined() }
  type Value
Value::(value : T) -> Value
cast_from(T
v).(self : Value) -> Optional[T]
cast()
}

test "Optional from and to Option" {
  let Optional[Int]
optional = type Optional[_]
Optional::(value : Int?) -> Optional[Int]
Create Optional[T] from MoonBit's built-in T?
from_option((Int) -> Int?
Some(3))
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Optional[Int]
optional.(self : Optional[Int]) -> Int
Unwrap T from Optional[T], panic if it is undefined
unwrap(), String
content="3")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Optional[Int]
optional.(self : Optional[Int]) -> Bool
Check if an Optional is undefined
is_undefined(), String
content="false")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Optional[Int]
optional.(self : Optional[Int]) -> Int?
Convert Optional[T] to MoonBit's built-in T?
to_option(), String
content="Some(3)")
  let Optional[Int]
optional : type Optional[_]
Optional[Int
Int] = type Optional[_]
Optional::(value : Int?) -> Optional[Int]
Create Optional[T] from MoonBit's built-in T?
from_option(Int?
None)
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Optional[Int]
optional.(self : Optional[Int]) -> Bool
Check if an Optional is undefined
is_undefined(), String
content="true")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Optional[Int]
optional.(self : Optional[Int]) -> Int?
Convert Optional[T] to MoonBit's built-in T?
to_option(), String
content="None")
}

With this setup, we've successfully crafted a safe and ergonomic representation for T | undefined within MoonBit's type system. The same method can also be used to interface with other JavaScript-specific types like null, symbol, RegExp, etc.

Handling JavaScript Errors

A robust FFI layer must handle errors gracefully. By default, if JavaScript code throws an exception during an FFI call, it won't be caught by MoonBit's try-catch mechanism. Instead, it will crash the entire program:

// This is an FFI call that will throw an exception
extern "js" fn boom_naive() -> Value raise = "(u) => undefined.toString()"

test "boom_naive" {
  // This code will directly crash the test process instead of returning a `Result` via `try?`
  inspect(try? boom_naive()) // failed: TypeError: Cannot read properties of undefined (reading 'toString')
}

The correct approach is to wrap the call in a try...catch block on the JavaScript side, and then pass either the successful result or the caught error back to MoonBit. While we could do this directly in the JavaScript code of our extern "js" declaration, a more reusable solution exists:

First, let's define an Error_ type to encapsulate JavaScript errors:

suberror Error_ type Value
Value

pub impl trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show for suberror Error_ Value
Error_ with (self : Error_, logger : &Logger) -> Unit
output(Error_
self, &Logger
logger) {
  &Logger
logger.(&Logger, String) -> Unit
write_string("@js.Error: ")
  let (Value) -> Error_
Error_(Value
inner) = Error_
self
  &Logger
logger.(self : &Logger, obj : Value) -> Unit
write_object(Value
inner)
}

Next, we'll define a core FFI wrapper function, Error_::wrap_ffi. Its role is to execute an operation (op) in the JavaScript realm and, depending on the outcome, call either a success (on_ok) or error (on_error) callback:

extern "js" fn suberror Error_ Value
Error_::wrap_ffi(
  op : () -> type Value
Value,
  on_ok : (type Value
Value) -> Unit
Unit,
  on_error : (type Value
Value) -> Unit
Unit,
) -> Unit
Unit =
  #| (op, on_ok, on_error) => { try { on_ok(op()); } catch (e) { on_error(e); } }

Finally, using this FFI function and MoonBit closures, we can create a more idiomatic Error_::wrap function that returns a T raise Error_:

fn[T] suberror Error_ Value
Error_::(op : () -> Value, map_ok? : (Value) -> T) -> T raise Error_
wrap(
  () -> Value
op : () -> type Value
Value,
  (Value) -> T
map_ok~ : (type Value
Value) -> type parameter T
T = type Value
Value(Value) -> T
::cast,
) -> type parameter T
T raise suberror Error_ Value
Error_ {
  // Define a variable to pass the result in and out of the closure
  let mut Result[Value, Error_]
res : enum Result[A, B] {
  Err(B)
  Ok(A)
}
Result[type Value
Value, suberror Error_ Value
Error_] = (Value) -> Result[Value, Error_]
Ok(type Value
Value::() -> Value
undefined())
  // Call the FFI, passing two closures that will modify the value of res based on the JS execution result
  suberror Error_ Value
Error_::(op : () -> Value, on_ok : (Value) -> Unit, on_error : (Value) -> Unit) -> Unit
wrap_ffi(() -> Value
op, fn(Value
v) { Result[Value, Error_]
res = (Value) -> Result[Value, Error_]
Ok(Value
v) }, fn(Value
e) { Result[Value, Error_]
res = (Error_) -> Result[Value, Error_]
Err((Value) -> Error_
Error_(Value
e)) })
  // Check the value of res and return the corresponding result or throw an error
  match Result[Value, Error_]
res {
    (Value) -> Result[Value, Error_]
Ok(Value
v) => (Value) -> T
map_ok(Value
v)
    (Error_) -> Result[Value, Error_]
Err(Error_
e) => raise Error_
e
  }
}

Now, we can safely call the function that previously threw an exception, and we can handle possible errors with pure MoonBit code:

extern "js" fn boom() -> type Value
Value = "(u) => undefined.toString()"

test "boom" {
  let Result[Value, Error_]
result = try? suberror Error_ Value
Error_::(op : () -> Value, map_ok? : (Value) -> Value) -> Value raise Error_
wrap(() -> Value
boom)
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    (Result[Value, Error_]
result : enum Result[A, B] {
  Err(B)
  Ok(A)
}
Result[type Value
Value, suberror Error_ Value
Error_]),
    String
content="Err(@js.Error: TypeError: Cannot read properties of undefined (reading 'toString'))",
  )
}

Interfacing with External JavaScript APIs

Having mastered the key techniques for bridging types and handling errors, it's time to turn our attention to the wider world: the Node.js and NPM ecosystem. The entry point to all of it is a binding for the require() function:

extern "js" fn require_ffi(path : String
String) -> type Value
Value = "(path) => require(path)"

/// A more convenient wrapper that supports chained property access, e.g., require("a", keys=["b", "c"])
pub fn (path : String, keys? : Array[String]) -> Value
A more convenient wrapper that supports chained property access, e.g., require("a", keys=["b", "c"])
require(String
path : String
String, Array[String]
keys~ : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[String
String] = []) -> type Value
Value {
  Array[String]
keys.(self : Array[String], init~ : Value, f : (Value, String) -> Value) -> Value
Fold out values from an array according to certain rules.
Example:
  let sum = [1, 2, 3, 4, 5].fold(init=0, (sum, elem) => sum + elem)
  assert_eq(sum, 15)
fold(Value
init=(path : String) -> Value
require_ffi(String
path), type Value
Value(Value, String) -> Value
::get_with_string)
}

// ... where the definition of Value::get_with_string is as follows:

fn[T] type Value
Value::(self : Value, key : String) -> T
get_with_string(Value
self : type Value
Self, String
key : String
String) -> type parameter T
T {
  Value
self.(self : Value, key : Value) -> Value
get_ffi(type Value
Value::(value : String) -> Value
cast_from(String
key)).(self : Value) -> T
cast()
}

extern "js" fn type Value
Value::get_ffi(self : type Value
Self, key : type Value
Self) -> type Value
Self = "(obj, key) => obj[key]"

With this require function, we can easily load Node.js's built-in modules, such as the node:path module, and call its methods:

// Load the basename function of the node:path module
let (String) -> String
basename : (String
String) -> String
String = (path : String, keys~ : Array[String]) -> Value
A more convenient wrapper that supports chained property access, e.g., require("a", keys=["b", "c"])
require("node:path", Array[String]
keys=["basename"]).(self : Value) -> (String) -> String
cast()

test "require Node API" {
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect((String) -> String
basename("/foo/bar/baz/asdf/quux.html"), String
content="quux.html")
}

More excitingly, we can use the same method to call the vast collection of third-party libraries on NPM. Let's take a popular statistical calculation library simple-statistics as an example.

First, we need to initialize package.json and install dependencies, just like in a standard JavaScript project. Here we use pnpm, you can also use npm or yarn:

> pnpm init
> pnpm install simple-statistics

Once the preparation is complete, we can directly require this library in our MoonBit code and get the standardDeviation function from it:

let (Array[Double]) -> Double
standard_deviation : (type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[Double
Double]) -> Double
Double = (path : String, keys~ : Array[String]) -> Value
A more convenient wrapper that supports chained property access, e.g., require("a", keys=["b", "c"])
require(
  "simple-statistics",
  Array[String]
keys=["standardDeviation"],
).(self : Value) -> (Array[Double]) -> Double
cast()

Now, whether we use moon run or moon test, MoonBit can correctly load dependencies via Node.js and execute the code, returning the expected result.

test "require external lib" {
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect((Array[Double]) -> Double
standard_deviation([2, 4, 4, 4, 5, 5, 7, 9]), String
content="2")
}

This is quite powerful: with just a few lines of FFI code, we've connected MoonBit's type-safe world with NPM's vast and mature ecosystem.

Conclusion

In this article, we've explored the fundamentals of interacting with JavaScript in MoonBit, from the most basic type interfacing to complex error handling, and finally to the easy integration of external libraries. These features bridge the gap between MoonBit's static type system and JavaScript's dynamic typing, reflecting a modern approach to cross-language interoperability, while allowing developers to enjoy the type safety and modern features of MoonBit while seamlessly accessing the vast JavaScript ecosystem, opening up immense application prospects.

Of course, with great power comes great responsibility. While the FFI is powerful, we must handle type conversions and error boundaries carefully to ensure program robustness.

Mastering these FFI techniques is a crucial skill for developers wanting to extend MoonBit applications with JavaScript libraries. By applying these techniques, we can build high-quality applications that leverage both the strengths of MoonBit and the rich resources of the JavaScript ecosystem.

To learn more about MoonBit's ongoing progress in JavaScript interoperability, please check out the web frontend of mooncakes.io and its underlying UI library, rabbit-tea, both built with MoonBit.

Two Approaches to Regex Engines: Derivative and Thompson VM

September 10, 2025 · 11 min read

Regular expression engines can be implemented using fundamentally different approaches, each with distinct trade-offs in performance, memory usage, and implementation complexity. This article explores two mathematically equivalent but practically different methods for regex matching: Brzozowski derivatives and Thompson's virtual machine approach.

Both methods operate on the same abstract syntax tree representation, providing a unified foundation for direct performance comparison. The key insight is how these seemingly different approaches solve identical problems through different computational strategies—one through algebraic transformation, the other through program execution.

Conventions & Definitions

To establish a common foundation, both regex engines start with a shared AST representation that captures the essential structure of regular expressions in a tree format:

enum Ast {
  (Char) -> Ast
Chr(Char
Char)
  (Ast, Ast) -> Ast
Seq(enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast, enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast)
  (Ast, Int?) -> Ast
Rep(enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast, Int
Int?)
  (Ast) -> Ast
Opt(enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast)
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Hash {
  hash_combine(Self, Hasher) -> Unit
  hash(Self) -> Int
}
Trait for types that can be hashed
The hash method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq trait, they should produce the same hash value.
The hash method does not need to be implemented if hash_combine is implemented,
When implemented separately, hash does not need to produce a hash value that is consistent with hash_combine.
Hash, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq)

Additionally, we provide smart constructors to simplify regex construction:

fn enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(chr : Char) -> Ast
chr(Char
chr : Char
Char) -> enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
  (Char) -> Ast
Chr(Char
chr)
}

fn enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(self : Ast, other : Ast) -> Ast
seq(Ast
self : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast, Ast
other : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
  (Ast, Ast) -> Ast
Seq(Ast
self, Ast
other)
}

fn enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(self : Ast, n? : Int) -> Ast
rep(Ast
self : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast, Int?
n? : Int
Int) -> enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
  (Ast, Int?) -> Ast
Rep(Ast
self, Int?
n)
}

fn enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(self : Ast) -> Ast
opt(Ast
self : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
  Unit
@fs.
  (Ast) -> Ast
Opt(Ast
self)
}

The AST defines four fundamental regex operations:

Chr(Char) matches a single literal character.
Seq(Ast, Ast) matches one pattern followed by another through concatenation.
Rep(Ast, Int?) repeats a pattern either unlimited times when None or exactly n times when Some(n).
Opt(Ast) makes a pattern optional, equivalent to pattern? in standard regex syntax.

For example, we can build the regex (ab*)?—an optional sequence of 'a' followed by zero or more 'b's—as:

Ast::chr('a').seq(Ast::chr('b').rep()).opt()

Brzozowski Derivative

The derivative-based approach transforms regular expressions algebraically using formal language theory. For each input character, it computes the "derivative" of the regex by asking: "what remains to be matched after consuming this character?" This creates a new regex representing the remaining pattern.

We extend the basic Ast type to represent derivatives and nullability explicitly:

enum Exp {
  Exp
Nil
  Exp
Eps
  (Char) -> Exp
Chr(Char
Char)
  (Exp, Exp) -> Exp
Alt(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp)
  (Exp, Exp) -> Exp
Seq(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp)
  (Exp) -> Exp
Rep(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp)
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Hash {
  hash_combine(Self, Hasher) -> Unit
  hash(Self) -> Int
}
Trait for types that can be hashed
The hash method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq trait, they should produce the same hash value.
The hash method does not need to be implemented if hash_combine is implemented,
When implemented separately, hash does not need to produce a hash value that is consistent with hash_combine.
Hash, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait Compare {
  compare(Self, Self) -> Int
  op_lt(Self, Self) -> Bool
  op_gt(Self, Self) -> Bool
  op_le(Self, Self) -> Bool
  op_ge(Self, Self) -> Bool
}
Trait for types whose elements are ordered
The return value of [compare] is:

zero, if the two arguments are equal
negative, if the first argument is smaller
positive, if the first argument is greater
Compare)

The constructors in Exp represent:

Nil represents an impossible pattern that can never match anything.
Eps matches the empty string.
Chr(Char) matches a single character.
Alt(Exp, Exp) represents alternation, providing choice between patterns.
Seq(Exp, Exp) represents concatenation of two patterns.
Rep(Exp) represents repetition of a pattern.

We use the Exp::of_ast function to convert the Ast into the more expressive Exp format:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
ast : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
  match Ast
ast {
    (Char) -> Ast
Chr(Char
c) => (Char) -> Exp
Chr(Char
c)
    (Ast, Ast) -> Ast
Seq(Ast
a, Ast
b) => (Exp, Exp) -> Exp
Seq(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a), enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
b))
    (Ast, Int?) -> Ast
Rep(Ast
a, Int?
None) => (Exp) -> Exp
Rep(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a))
    (Ast, Int?) -> Ast
Rep(Ast
a, (Int) -> Int?
Some(Int
n)) => {
      let Exp
sec = enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a)
      let mut Exp
exp = Exp
sec
      for _ in Int
1..<Int
n {
        Exp
exp = (Exp, Exp) -> Exp
Seq(Exp
exp, Exp
sec)
      }
      Exp
exp
    }
    (Ast) -> Ast
Opt(Ast
a) => (Exp, Exp) -> Exp
Alt(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a), Exp
Eps)
  }
}

We also provide smart constructors for Exp to simplify pattern building:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(a : Exp, b : Exp) -> Exp
seq(Exp
a : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, Exp
b : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp) -> enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
  match (Exp
a, Exp
b) {
    (Exp
Nil, _) | (_, Exp
Nil) => Exp
Nil
    (Exp
Eps, Exp
b) => Exp
b
    (Exp
a, Exp
Eps) => Exp
a
    (Exp
a, Exp
b) => (Exp, Exp) -> Exp
Seq(Exp
a, Exp
b)
  }
}

However, the smart constructor for Alt is strictly necessary—it ensures that the constructed Exp is normalized to "similarity" as mentioned in the original paper by Brzozowski. Two regexes are similar if one can be reduced to the other by applying the following rules:

\begin{align} & A \mid \emptyset &&\rightarrow A \\ & A \mid B &&\rightarrow B \mid A \\ & A \mid (B \mid C) &&\rightarrow (A \mid B) \mid C \end{align}

Therefore, we normalize the Alt construction to always use the same associativity and order of alternatives:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(a : Exp, b : Exp) -> Exp
alt(Exp
a : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, Exp
b : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp) -> enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
  match (Exp
a, Exp
b) {
    (Exp
Nil, Exp
b) => Exp
b
    (Exp
a, Exp
Nil) => Exp
a
    ((Exp, Exp) -> Exp
Alt(Exp
a, Exp
b), Exp
c) => Exp
a.(a : Exp, b : Exp) -> Exp
alt(Exp
b.(a : Exp, b : Exp) -> Exp
alt(Exp
c))
    (Exp
a, Exp
b) => {
      if Exp
a (Exp, Exp) -> Bool
automatically derived
== Exp
b {
        Exp
a
      } else if Exp
a (self_ : Exp, other : Exp) -> Bool
> Exp
b {
        (Exp, Exp) -> Exp
Alt(Exp
b, Exp
a)
      } else {
        (Exp, Exp) -> Exp
Alt(Exp
a, Exp
b)
      }
    }
  }
}

The nullable function determines if a pattern can match the empty string without consuming input:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(self : Exp) -> Bool
nullable(Exp
self : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp) -> Bool
Bool {
  match Exp
self {
    Exp
Nil => false
    Exp
Eps => true
    (Char) -> Exp
Chr(_) => false
    (Exp, Exp) -> Exp
Alt(Exp
l, Exp
r) => Exp
l.(self : Exp) -> Bool
nullable() (Bool, Bool) -> Bool
|| Exp
r.(self : Exp) -> Bool
nullable()
    (Exp, Exp) -> Exp
Seq(Exp
l, Exp
r) => Exp
l.(self : Exp) -> Bool
nullable() (Bool, Bool) -> Bool
&& Exp
r.(self : Exp) -> Bool
nullable()
    (Exp) -> Exp
Rep(_) => true
  }
}

The deriv function computes the derivative of a pattern with respect to a character, transforming the pattern based on the rules defined in the Brzozowski derivative. We have reordered the rules to match the order in the deriv function:

\begin{align} D_{a} \emptyset &= \emptyset \\ D_{a} \epsilon &= \emptyset \\ D_{a} a &= \epsilon \\ D_{a} b &= \emptyset & \text{ for }(a \neq b) \\ D_{a} (P \mid Q) &= (D_{a} P) \mid (D_{a} Q) \\ D_{a} (P \cdot Q) &= (D_{a} P \cdot Q) \mid (\nu(P) \cdot D_{a} Q) \\ D_{a} (P\ast) &= D_{a} P \cdot P\ast \\ \end{align}

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(self : Exp, c : Char) -> Exp
deriv(Exp
self : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, Char
c : Char
Char) -> enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
  match Exp
self {
    Exp
Nil => Exp
self
    Exp
Eps => Exp
Nil
    (Char) -> Exp
Chr(Char
d) if Char
d (self : Char, other : Char) -> Bool
Compares two characters for equality.
Parameters:

self : The first character to compare.
other : The second character to compare.
Returns true if both characters represent the same Unicode code point,
false otherwise.
Example:
  let a = 'A'
  let b = 'A'
  let c = 'B'
  inspect(a == b, content="true")
  inspect(a == c, content="false")
== Char
c => Exp
Eps
    (Char) -> Exp
Chr(_) => Exp
Nil
    (Exp, Exp) -> Exp
Alt(Exp
l, Exp
r) => Exp
l.(self : Exp, c : Char) -> Exp
deriv(Char
c).(a : Exp, b : Exp) -> Exp
alt(Exp
r.(self : Exp, c : Char) -> Exp
deriv(Char
c))
    (Exp, Exp) -> Exp
Seq(Exp
l, Exp
r) => {
      let Exp
dl = Exp
l.(self : Exp, c : Char) -> Exp
deriv(Char
c)
      if Exp
l.(self : Exp) -> Bool
nullable() {
        Exp
dl.(a : Exp, b : Exp) -> Exp
seq(Exp
r).(a : Exp, b : Exp) -> Exp
alt(Exp
r.(self : Exp, c : Char) -> Exp
deriv(Char
c))
      } else {
        Exp
dl.(a : Exp, b : Exp) -> Exp
seq(Exp
r)
      }
    }
    (Exp) -> Exp
Rep(Exp
e) => Exp
e.(self : Exp, c : Char) -> Exp
deriv(Char
c).(a : Exp, b : Exp) -> Exp
seq(Exp
self)
  }
}

To simplify our implementation, we only perform strict matching—the pattern must match the entire input string. Therefore, we only check for nullability after the entire input has been consumed:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(self : Exp, s : String) -> Bool
matches(Exp
self : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, String
s : String
String) -> Bool
Bool {
  loop (Exp
self, String
s.(self : String, start_offset? : Int, end_offset? : Int) -> StringView
Creates a View into a String.
Example
  let str = "Hello🤣🤣🤣"
  let view1 = str.view()
  inspect(view1, content=
   "Hello🤣🤣🤣"
  )
  let start_offset = str.offset_of_nth_char(1).unwrap()
  let end_offset = str.offset_of_nth_char(6).unwrap() // the second emoji
  let view2 = str.view(start_offset~, end_offset~)
  inspect(view2, content=
   "ello🤣"
  )
view()) {
    (Exp
Nil, _) => {
      return false
    }
    (Exp
e, []) => {
      return Exp
e.(self : Exp) -> Bool
nullable()
    }
    (Exp
e, StringView
[Char
cStringView
, .. s]) => {
      continue (Exp
e.(self : Exp, c : Char) -> Exp
deriv(Char
c), StringView
s)
    }
  }
}

Virtual Machine

The VM approach compiles regular expressions into bytecode instructions for a simple virtual machine. This method transforms the pattern-matching problem into program execution, where the VM simulates all possible paths through a non-deterministic finite automaton simultaneously.

Ken Thompson's 1968 paper described a regex engine that compiled patterns into IBM 7094 machine code. The key insight was to avoid exponential backtracking by maintaining multiple execution threads that advance through input in lockstep, processing one character at a time across all possible matching paths.

Instruction Set and Program Representation

The VM operates on four fundamental instructions that correspond to NFA operations:

enum Ops {
  Ops
Done
  (Char) -> Ops
Char(Char
Char)
  (Int) -> Ops
Jump(Int
Int)
  (Int) -> Ops
Fork(Int
Int)
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show)

Each instruction serves a specific purpose in NFA simulation. Done marks successful completion of pattern matching, equivalent to Thompson's original match. Char(c) consumes input character c and advances to the next instruction. Jump(addr) provides unconditional jump to instruction at address addr (Thompson's jmp). Fork(addr) creates two execution paths—one continues to the next instruction, another jumps to addr (Thompson's split).

The Fork instruction is crucial for handling non-determinism in patterns like alternation and repetition, where multiple execution paths must be explored simultaneously. This maps directly to NFA ε-transitions, where execution can spontaneously branch without consuming input.

We define a Prg that wraps an array of instructions with convenience methods for building and manipulating bytecode programs.

struct Prg(type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum Ops {
  Done
  Char(Char)
  Jump(Int)
  Fork(Int)
} derive(Show)
Ops]) derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show)

fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg, inst : Ops) -> Unit
push(Prg
self : type Prg Array[Ops] derive(Show)
Prg, Ops
inst : enum Ops {
  Done
  Char(Char)
  Jump(Int)
  Fork(Int)
} derive(Show)
Ops) -> Unit
Unit {
  Prg
self.Array[Ops]
0.(self : Array[Ops], value : Ops) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(Ops
inst)
}

fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg) -> Int
length(Prg
self : type Prg Array[Ops] derive(Show)
Prg) -> Int
Int {
  Prg
self.Array[Ops]
0.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:

array : The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length()
}

fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg, index : Int, inst : Ops) -> Unit
op_set(Prg
self : type Prg Array[Ops] derive(Show)
Prg, Int
index : Int
Int, Ops
inst : enum Ops {
  Done
  Char(Char)
  Jump(Int)
  Fork(Int)
} derive(Show)
Ops) -> Unit
Unit {
  Prg
self.Array[Ops]
0(Array[Ops], Int, Ops) -> Unit
Sets the element at the specified index in the array to a new value. The
original value at that index is overwritten.
Parameters:

array : The array to modify.
index : The position in the array where the value will be set.
value : The new value to assign at the specified index.
Throws an error if index is negative or greater than or equal to the length
of the array.
Example:
  let arr = [1, 2, 3]
  arr[1] = 42
  inspect(arr, content="[1, 42, 3]")
[index] = Ops
inst
}

AST Compilation to Bytecode

The Prg::of_ast function translates AST patterns into VM instructions using standard NFA construction techniques:

Seq(a, b):
```
code for a
code for b
```

Rep(a, None) (unbounded repetition):

    Fork L1, L2
L1: code for a
    Jump L1
L2:

Rep(a, Some(n)) (fixed repetition):

code for a
code for a
... (n times) ...

Opt(a) (optional):
```
    Fork L1, L2
L1: code for a
L2:
```

Note that the Fork constructor only accepts one address, because we always want to proceed to the next instruction after the Fork.

fn type Prg Array[Ops] derive(Show)
Prg::(ast : Ast) -> Prg
of_ast(Ast
ast : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> type Prg Array[Ops] derive(Show)
Prg {
  fn (Prg, Ast) -> Unit
compile(Prg
prog : type Prg Array[Ops] derive(Show)
Prg, Ast
ast : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> Unit
Unit {
    match Ast
ast {
      (Char) -> Ast
Chr(Char
chr) => Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Char) -> Ops
Char(Char
chr))
      (Ast, Ast) -> Ast
Seq(Ast
l, Ast
r) => {
        (Prg, Ast) -> Unit
compile(Prg
prog, Ast
l)
        (Prg, Ast) -> Unit
compile(Prg
prog, Ast
r)
      }
      (Ast, Int?) -> Ast
Rep(Ast
e, Int?
None) => {
        let Int
fork = Prg
prog.(self : Prg) -> Int
length()
        Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Fork(0))
        (Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
        Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Jump(Int
fork))
        Prg
prog(Prg, Int, Ops) -> Unit
[fork] = (Int) -> Ops
Fork(Prg
prog.(self : Prg) -> Int
length())
      }
      (Ast, Int?) -> Ast
Rep(Ast
e, (Int) -> Int?
Some(Int
n)) =>
        for _ in Int
0..<Int
n {
          (Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
        }
      (Ast) -> Ast
Opt(Ast
e) => {
        let Int
fork_inst = Prg
prog.(self : Prg) -> Int
length()
        Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Fork(0))
        (Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
        Prg
prog(Prg, Int, Ops) -> Unit
[fork_inst] = (Int) -> Ops
Fork(Prg
prog.(self : Prg) -> Int
length())
      }
    }
  }

  let Prg
prog : type Prg Array[Ops] derive(Show)
Prg = []
  (Prg, Ast) -> Unit
compile(Prg
prog, Ast
ast)
  Prg
prog.(self : Prg, inst : Ops) -> Unit
push(Ops
Done)
  Prg
prog
}

VM Execution Loop

In Rob Pike's implementation, the VM executes one-past the end of the input string to handle the final acceptance state. To make this explicit, our matches function implements the core VM execution loop using a two-phase approach:

Phase 1 handles character processing. For each input character, it processes all active threads in the current context. Char instructions that match the current character create new threads in the next context. Jump and Fork instructions immediately spawn new threads in the current context. After processing all threads, it swaps contexts and continues with the next character.

Phase 2 handles final acceptance. After consuming all input, it processes remaining threads looking for Done instructions. It handles any final Jump/Fork instructions that don't consume input. It returns true if any thread reaches a Done instruction.

fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg, data : StringView) -> Bool
matches(Prg
self : type Prg Array[Ops] derive(Show)
Prg, StringView
data : type StringView
StringView represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) -> Bool
Bool {
  let (Array[Ops]) -> Prg
Prg(Array[Ops]
prog) = Prg
self
  let mut Ctx
curr = struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Array[Ops]
prog.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:

array : The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length())
  let mut Ctx
next = struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Array[Ops]
prog.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:

array : The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length())
  Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(0)
  for Char
c in StringView
data {
    while Ctx
curr.(self : Ctx) -> Int?
pop() is (Int) -> Int?
Some(Int
pc) {
      match Array[Ops]
prog(Array[Ops], Int) -> Ops
Retrieves an element from the array at the specified index.
Parameters:

array : The array to get the element from.
index : The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a panic if the index is negative or greater than or equal to the
length of the array.
Example:
  let arr = [1, 2, 3]
  inspect(arr[1], content="2")
[pc] {
        Ops
Done => ()
        (Char) -> Ops
Char(Char
char) if Char
char (self : Char, other : Char) -> Bool
Compares two characters for equality.
Parameters:

self : The first character to compare.
other : The second character to compare.
Returns true if both characters represent the same Unicode code point,
false otherwise.
Example:
  let a = 'A'
  let b = 'A'
  let c = 'B'
  inspect(a == b, content="true")
  inspect(a == c, content="false")
== Char
c => {
          Ctx
next.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
        }
        (Int) -> Ops
Jump(Int
jump) =>
          Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
jump)
        (Int) -> Ops
Fork(Int
fork) => {
          Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
fork)
          Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
        }
        _ => ()
      }
    }
    let Ctx
temp = Ctx
curr
    Ctx
curr = Ctx
next
    Ctx
next = Ctx
temp
    Ctx
next.(self : Ctx) -> Unit
reset()
  }
  while Ctx
curr.(self : Ctx) -> Int?
pop() is (Int) -> Int?
Some(Int
pc) {
    match Array[Ops]
prog(Array[Ops], Int) -> Ops
Retrieves an element from the array at the specified index.
Parameters:

array : The array to get the element from.
index : The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a panic if the index is negative or greater than or equal to the
length of the array.
Example:
  let arr = [1, 2, 3]
  inspect(arr[1], content="2")
[pc] {
      Ops
Done => return true
      (Int) -> Ops
Jump(Int
x) => Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
x)
      (Int) -> Ops
Fork(Int
x) => {
        Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
x)
        Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
      }
      _ => ()
    }
  }
  false
}

In the original blog post, Rob Pike uses a recursive function to handle Fork and Jump instructions so that threads are executed according to their priorities. Instead, we use a stack-like structure to manage all threads of execution, which naturally respects thread priority:

struct Ctx {
  @deque.Deque[Int]
deque : type @deque.Deque[A]
@deque.Deque[Int
Int]
  FixedArray[Bool]
visit : type FixedArray[A]
FixedArray[Bool
Bool]
}

fn struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Int
length : Int
Int) -> struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx {
  { @deque.Deque[Int]
deque: (capacity? : Int) -> @deque.Deque[Int]
Creates a new empty deque with an optional initial capacity.
Parameters:

capacity : The initial capacity of the deque. If not specified, defaults
to 0 and will be automatically adjusted as elements are added.
Returns a new empty deque of type T[A] where A is the type of elements
the deque will hold.
Example
  let dq : @deque.Deque[Int] = @deque.new()
  inspect(dq.length(), content="0")
  inspect(dq.capacity(), content="0")

  let dq : @deque.Deque[Int] = @deque.new(capacity=10)
  inspect(dq.length(), content="0")
  inspect(dq.capacity(), content="10")
@deque.new(), FixedArray[Bool]
visit: type FixedArray[A]
FixedArray::(len : Int, init : Bool) -> FixedArray[Bool]
Creates a new fixed-size array with the specified length, initializing all
elements with the given value.
Parameters:

length : The length of the array to create. Must be non-negative.
initial_value : The value used to initialize all elements in the array.
Returns a new fixed-size array of type FixedArray[T] with length
elements, where each element is initialized to initial_value.
Throws a panic if length is negative.
Example:
  let arr = FixedArray::make(3, 42)
  inspect(arr[0], content="42")
  inspect(arr.length(), content="3")
WARNING: A common pitfall is creating with the same initial value, for example:
  let two_dimension_array = FixedArray::make(10, FixedArray::make(10, 0))
  two_dimension_array[0][5] = 10
  assert_eq(two_dimension_array[5][5], 10)
This is because all the cells reference to the same object (the FixedArray[Int] in this case).
One should use makei() instead which creates an object for each index.
make(Int
length, false) }
}

fn struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(self : Ctx, pc : Int) -> Unit
add(Ctx
self : struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx, Int
pc : Int
Int) -> Unit
Unit {
  if Bool
!Ctx
selfBool
.FixedArray[Bool]
visit(FixedArray[Bool], Int) -> Bool
Retrieves an element at the specified index from a fixed-size array. This
function implements the array indexing operator [].
Parameters:

array : The fixed-size array to access.
index : The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a runtime error if the index is out of bounds (negative or greater
than or equal to the length of the array).
Example:
  let arr = FixedArray::make(3, 42)
  inspect(arr[1], content="42")
[Bool
pc] {
    Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int], value : Int) -> Unit
Adds an element to the back of the deque.
If the deque is at capacity, it will be reallocated.
Example
  let dv = @deque.from_array([1, 2, 3, 4, 5])
  dv.push_back(6)
  assert_eq(dv.back(), Some(6))
push_back(Int
pc)
    Ctx
self.FixedArray[Bool]
visit(FixedArray[Bool], Int, Bool) -> Unit
Sets the value at the specified index in a fixed-size array.
Parameters:

array : The fixed-size array to be modified.
index : The index at which to set the value. Must be non-negative and
less than the array's length.
value : The value to be set at the specified index.
Throws a runtime error if the index is out of bounds (less than 0 or greater
than or equal to the array's length).
Example:
  let arr = FixedArray::make(3, 0)
  arr.set(1, 42)
  inspect(arr[1], content="42")
[pc] = true
  }
}

fn struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(self : Ctx) -> Int?
pop(Ctx
self : struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx) -> Int
Int? {
  match Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int]) -> Int?
Removes a back element from a deque and returns it, or None if it is empty.
Example
  let dv = @deque.from_array([1, 2, 3, 4, 5])
  assert_eq(dv.pop_back(), Some(5))
pop_back() {
    (Int) -> Int?
Some(Int
pc) => {
      Ctx
self.FixedArray[Bool]
visit(FixedArray[Bool], Int, Bool) -> Unit
Sets the value at the specified index in a fixed-size array.
Parameters:

array : The fixed-size array to be modified.
index : The index at which to set the value. Must be non-negative and
less than the array's length.
value : The value to be set at the specified index.
Throws a runtime error if the index is out of bounds (less than 0 or greater
than or equal to the array's length).
Example:
  let arr = FixedArray::make(3, 0)
  arr.set(1, 42)
  inspect(arr[1], content="42")
[pc] = false
      (Int) -> Int?
Some(Int
pc)
    }
    Int?
None => Int?
None
  }
}

fn struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(self : Ctx) -> Unit
reset(Ctx
self : struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx) -> Unit
Unit {
  Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int]) -> Unit
Clears the deque, removing all values.
This method has no effect on the allocated capacity of the deque, only setting the length to 0.
Example
  let dv = @deque.from_array([1, 2, 3, 4, 5])
  dv.clear()
  inspect(dv.length(), content="0")
clear()
  Ctx
self.FixedArray[Bool]
visit.(self : FixedArray[Bool], value : Bool, start? : Int, end? : Int) -> Unit
Fill the array with a given value.
This method fills all or part of a FixedArray with the given value.
Parameters

value: The value to fill the array with
start: The starting index (inclusive, default: 0)
end: The ending index (exclusive, optional)
If end is not provided, fills from start to the end of the array.
If start equals end, no elements are modified.
Panics

Panics if start is negative or greater than or equal to the array length
Panics if end is provided and is less than start or greater than array length
Does nothing if the array is empty
Example
// Fill entire array
let fa : FixedArray[Int] = [0, 0, 0, 0, 0]
fa.fill(3)
inspect(fa, content="[3, 3, 3, 3, 3]")

// Fill from index 1 to 3 (exclusive)
let fa2 : FixedArray[Int] = [0, 0, 0, 0, 0]
fa2.fill(9, start=1, end=3)
inspect(fa2, content="[0, 9, 9, 0, 0]")

// Fill from index 2 to end
let fa3 : FixedArray[String] = ["a", "b", "c", "d"]
fa3.fill("x", start=2)
inspect(fa3, content=(
  #|["a", "b", "x", "x"]
))
fill(false)
}

The visit array is used to drop low-priority threads. When a new thread is added, we first check if it is already in the deque using the visit array. If it is, we drop it; otherwise, we add it to the deque and mark it as visited. This mechanism is necessary to avoid infinite loops or exponential blowup when the regex contains patterns that can be expanded indefinitely, such as (a?)*.

Benchmarks and Performance Analysis

The benchmark demonstrates both approaches on a pathological case that challenges many regex implementations:

test (@bench.Bench
b : type @bench.Bench
@bench.T) {
  let Int
n = 15
  let String
txt = "a".(self : String, n : Int) -> String
Returns a new string with self repeated n times.
repeat(Int
n)
  let Ast
chr = enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(chr : Char) -> Ast
chr('a')
  let Ast
ast : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast = Ast
chr.(self : Ast) -> Ast
opt().(self : Ast, n~ : Int) -> Ast
rep(Int
n~).(self : Ast, other : Ast) -> Ast
seq(Ast
chr.(self : Ast, n~ : Int) -> Ast
rep(Int
n~))
  let Exp
exp = enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
ast)
  @bench.Bench
b.(self : @bench.Bench, name~ : String, f : () -> Unit, count? : UInt) -> Unit
Run a benchmark in batch mode
bench(String
name="derive", () => Exp
exp.(self : Exp, s : String) -> Bool
matches(String
txt) |> (t : Bool) -> Unit
Evaluates an expression and discards its result. This is useful when you want
to execute an expression for its side effects but don't care about its return
value, or when you want to explicitly indicate that a value is intentionally
unused.
Parameters:

value : The value to be ignored. Can be of any type.
Example:
  let x = 42
  ignore(x) // Explicitly ignore the value
  let mut sum = 0
  ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore())
  let Prg
tvm = type Prg Array[Ops] derive(Show)
Prg::(ast : Ast) -> Prg
of_ast(Ast
ast)
  @bench.Bench
b.(self : @bench.Bench, name~ : String, f : () -> Unit, count? : UInt) -> Unit
Run a benchmark in batch mode
bench(String
name="thompson", () => Prg
tvm.(self : Prg, data : StringView) -> Bool
matches(String
txt) |> (t : Bool) -> Unit
Evaluates an expression and discards its result. This is useful when you want
to execute an expression for its side effects but don't care about its return
value, or when you want to explicitly indicate that a value is intentionally
unused.
Parameters:

value : The value to be ignored. Can be of any type.
Example:
  let x = 42
  ignore(x) // Explicitly ignore the value
  let mut sum = 0
  ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore())
}

This pattern (a?){n}a{n} represents a classical exponential blowup case for backtracking engines. The pattern allows n different ways to match n 'a' characters, creating exponential search spaces in naive implementations.

name     time (mean ± σ)         range (min … max)
derive     41.78 µs ±   0.14 µs    41.61 µs …  42.13 µs  in 10 ×   2359 runs
thompson   12.79 µs ±   0.04 µs    12.74 µs …  12.84 µs  in 10 ×   7815 runs

The benchmark results show that the VM approach is significantly faster than the derivative-based approach for this case. The derivative method frequently allocates intermediate regex structures, leading to higher overhead and slower performance. In contrast, the VM executes a fixed set of instructions and rarely allocates new structures once the deque grows to its full size.

However, the derivative approach is easier to reason about. We can easily prove termination of the algorithm, as the number of derivatives to be computed is bounded by the size of the AST and strictly decreases with each recursive application of the deriv function. The VM approach, on the other hand, can potentially run indefinitely if the input Prg contains infinite loops, and requires careful handling of thread priority to avoid infinite loops and exponential blowup in the number of threads.

Prettyprinter: Declarative Structured Data Formatting with Function Composition

September 3, 2025 · 8 min read

When working with structured data, printing it in a clear and adaptable format is a common challenge. This comes up often in debugging, logging, and code generation. For instance, an array literal [a,b,c] should ideally print on one line if the screen is wide enough, but gracefully wrap and indent when space is limited.

Traditional solutions often rely on manually concatenating strings while tracking indentation levels. This approach is not only tedious, but also error-prone.

A more elegant solution is to use function composition. With this approach, we build a prettyprinter: a system where users combine primitive formatting functions into a Doc structure that describes the intended layout. Given a maximum width, the prettyprinter automatically chooses the most readable formatting.

This makes the printing process declarative—you specify what the layout should look like under different conditions, and the system figures out how to render it.

SimpleDoc Primitives

We begin with a minimal representation called SimpleDoc. It consists of just four primitives:

enum SimpleDoc {
  SimpleDoc
Empty
  SimpleDoc
Line
  (String) -> SimpleDoc
Text(String
String)
  (SimpleDoc, SimpleDoc) -> SimpleDoc
Cat(enum SimpleDoc {
  Empty
  Line
  Text(String)
  Cat(SimpleDoc, SimpleDoc)
}
SimpleDoc, enum SimpleDoc {
  Empty
  Line
  Text(String)
  Cat(SimpleDoc, SimpleDoc)
}
SimpleDoc)
}

Empty: represents an empty string
Line: represents a newline
Text(String): plain text without line breaks
Cat(SimpleDoc, SimpleDoc): concatenates two SimpleDocss

Using these primitives, we can implement a simple rendering function. It flattens a SimpleDoc into a string using a stack-based traversal:

fn enum SimpleDoc {
  Empty
  Line
  Text(String)
  Cat(SimpleDoc, SimpleDoc)
}
SimpleDoc::(doc : SimpleDoc) -> String
render(SimpleDoc
doc : enum SimpleDoc {
  Empty
  Line
  Text(String)
  Cat(SimpleDoc, SimpleDoc)
}
SimpleDoc) -> String
String {
  let StringBuilder
buf = type StringBuilder
StringBuilder::(size_hint? : Int) -> StringBuilder
Creates a new string builder with an optional initial capacity hint.
Parameters:

size_hint : An optional initial capacity hint for the internal buffer. If
less than 1, a minimum capacity of 1 is used. Defaults to 0. It is the size of bytes,
not the size of characters. size_hint may be ignored on some platforms, JS for example.
Returns a new StringBuilder instance with the specified initial capacity.
new()
  let Array[SimpleDoc]
stack = [SimpleDoc
doc]
  while Array[SimpleDoc]
stack.(self : Array[SimpleDoc]) -> SimpleDoc?
Removes the last element from a array and returns it, or None if it is empty.
Example
  let v = [1, 2, 3]
  assert_eq(v.pop(), Some(3))
  assert_eq(v, [1, 2])
pop() is (SimpleDoc) -> SimpleDoc?
Some(SimpleDoc
doc) {
    match SimpleDoc
doc {
      SimpleDoc
Empty => ()
      SimpleDoc
Line => {
        StringBuilder
buf..(self : StringBuilder, str : String) -> Unit
Writes a string to the StringBuilder.
write_string("\n")
      }
      (String) -> SimpleDoc
Text(String
text) => {
        StringBuilder
buf.(self : StringBuilder, str : String) -> Unit
Writes a string to the StringBuilder.
write_string(String
text)
      }
      (SimpleDoc, SimpleDoc) -> SimpleDoc
Cat(SimpleDoc
left, SimpleDoc
right) =>
        Array[SimpleDoc]
stack..(self : Array[SimpleDoc], value : SimpleDoc) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(SimpleDoc
right)..(self : Array[SimpleDoc], value : SimpleDoc) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(SimpleDoc
left)
    }
  }
  StringBuilder
buf.(self : StringBuilder) -> String
Returns the current content of the StringBuilder as a string.
to_string()
}

Here’s a quick test: we can see that the expressiveness of SimpleDoc is equivalent to String: Empty corresponds to "", Line corresponds to "\n", Text("a") corresponds to "a", and Cat(Text("a"), Text("b")) corresponds to "a" + "b".

test "simple doc" {
  let SimpleDoc
doc : enum SimpleDoc {
  Empty
  Line
  Text(String)
  Cat(SimpleDoc, SimpleDoc)
}
SimpleDoc = (SimpleDoc, SimpleDoc) -> SimpleDoc
Cat((String) -> SimpleDoc
Text("hello"), (SimpleDoc, SimpleDoc) -> SimpleDoc
Cat(SimpleDoc
Line, (String) -> SimpleDoc
Text("world")))
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    SimpleDoc
doc.(doc : SimpleDoc) -> String
render(),
    String
content=(
      #|hello
      #|world
    ),
  )
}

At this stage, the SimpleDoc doesn’t yet handle indentation or layout choices—but we’re about to fix that.

ExtendDoc: Nest, Choice, Group

To handle real-world formatting, we extend SimpleDoc with three new primitives:

enum ExtendDoc {
  ExtendDoc
Empty
  ExtendDoc
Line
  (String) -> ExtendDoc
Text(String
String)
  (ExtendDoc, ExtendDoc) -> ExtendDoc
Cat(enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc, enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc)
  (Int, ExtendDoc) -> ExtendDoc
Nest(Int
Int, enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc)
  (ExtendDoc, ExtendDoc) -> ExtendDoc
Choice(enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc, enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc)
  (ExtendDoc) -> ExtendDoc
Group(enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc)
}

Nest Nest(Int, ExtendDoc) indents the doc by n spaces after each line break. Nested levels accumulate.
Choice Choice(ExtendDoc, ExtendDoc) stores two alternative layouts. Usually, the first parameter is the more compact layout without line breaks, and the second is the layout with Lines. The renderer uses the first layout in compact mode and the second otherwise.
Group Group(ExtendDoc) groups an ExtendDoc and decides between compact or non-compact layout based on the available width. If the remaining space is sufficient, it prints compactly; otherwise, it falls back to the layout with line breaks.

Measuring Space

To know whether compact layout fits, we need a way to estimate how many characters a document would require:

let Int
max_space = 9999

fn enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc::(self : ExtendDoc) -> Int
space(ExtendDoc
self : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
Self) -> Int
Int {
  match ExtendDoc
self {
    ExtendDoc
Empty => 0
    ExtendDoc
Line => Int
max_space
    (String) -> ExtendDoc
Text(String
str) => String
str.(self : String) -> Int
Returns the number of UTF-16 code units in the string. Note that this is not
necessarily equal to the number of Unicode characters (code points) in the
string, as some characters may be represented by multiple UTF-16 code units.
Parameters:

string : The string whose length is to be determined.
Returns the number of UTF-16 code units in the string.
Example:
  inspect("hello".length(), content="5")
  inspect("🤣".length(), content="2") // Emoji uses two UTF-16 code units
  inspect("".length(), content="0") // Empty string
length()
    (ExtendDoc, ExtendDoc) -> ExtendDoc
Cat(ExtendDoc
a, ExtendDoc
b) => ExtendDoc
a.(self : ExtendDoc) -> Int
space() (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ ExtendDoc
b.(self : ExtendDoc) -> Int
space()
    (Int, ExtendDoc) -> ExtendDoc
Nest(_, ExtendDoc
a) | (ExtendDoc, ExtendDoc) -> ExtendDoc
Choice(ExtendDoc
a, _) | (ExtendDoc) -> ExtendDoc
Group(ExtendDoc
a) => ExtendDoc
a.(self : ExtendDoc) -> Int
space()
  }
}

Here, Line is treated as requiring “infinite” space. This guarantees that if a Group contains a line break, it won’t attempt to print compactly.

Rendering ExtendDoc

We extend SimpleDoc::render to implement ExtendDoc::render. Since after printing a substructure we need to return to the original indentation level, the stack must also store two states for each pending ExtendDoc: indentation and whether compact mode is active. We also maintain a column variable to track the number of characters already used on the current line, in order to calculate remaining space. Finally, the function adds a width parameter to specify the maximum line width.

fn enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc::(doc : ExtendDoc, width? : Int) -> String
render(ExtendDoc
doc : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc, Int
width~ : Int
Int = 80) -> String
String {
  let StringBuilder
buf = type StringBuilder
StringBuilder::(size_hint? : Int) -> StringBuilder
Creates a new string builder with an optional initial capacity hint.
Parameters:

size_hint : An optional initial capacity hint for the internal buffer. If
less than 1, a minimum capacity of 1 is used. Defaults to 0. It is the size of bytes,
not the size of characters. size_hint may be ignored on some platforms, JS for example.
Returns a new StringBuilder instance with the specified initial capacity.
new()
  let Array[(Int, Bool, ExtendDoc)]
stack = [(0, false, ExtendDoc
doc)] // default: no indentation, non-compact mode
  let mut Int
column = 0
  while Array[(Int, Bool, ExtendDoc)]
stack.(self : Array[(Int, Bool, ExtendDoc)]) -> (Int, Bool, ExtendDoc)?
Removes the last element from a array and returns it, or None if it is empty.
Example
  let v = [1, 2, 3]
  assert_eq(v.pop(), Some(3))
  assert_eq(v, [1, 2])
pop() is ((Int, Bool, ExtendDoc)) -> (Int, Bool, ExtendDoc)?
Some((Int
indent, Bool
fit, ExtendDoc
doc)) {
    match ExtendDoc
doc {
      ExtendDoc
Empty => ()
      ExtendDoc
Line => {
        StringBuilder
buf..(self : StringBuilder, str : String) -> Unit
Writes a string to the StringBuilder.
write_string("\n")
        for _ in Int
0..<Int
indent {
          StringBuilder
buf.(self : StringBuilder, str : String) -> Unit
Writes a string to the StringBuilder.
write_string(" ")
        }
        Int
column = Int
indent
      }
      (String) -> ExtendDoc
Text(String
text) => {
        StringBuilder
buf.(self : StringBuilder, str : String) -> Unit
Writes a string to the StringBuilder.
write_string(String
text)
        Int
column (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+= String
text.(self : String) -> Int
Returns the number of UTF-16 code units in the string. Note that this is not
necessarily equal to the number of Unicode characters (code points) in the
string, as some characters may be represented by multiple UTF-16 code units.
Parameters:

string : The string whose length is to be determined.
Returns the number of UTF-16 code units in the string.
Example:
  inspect("hello".length(), content="5")
  inspect("🤣".length(), content="2") // Emoji uses two UTF-16 code units
  inspect("".length(), content="0") // Empty string
length()
      }
      (ExtendDoc, ExtendDoc) -> ExtendDoc
Cat(ExtendDoc
left, ExtendDoc
right) =>
        Array[(Int, Bool, ExtendDoc)]
stack..(self : Array[(Int, Bool, ExtendDoc)], value : (Int, Bool, ExtendDoc)) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((Int
indent, Bool
fit, ExtendDoc
right))..(self : Array[(Int, Bool, ExtendDoc)], value : (Int, Bool, ExtendDoc)) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((Int
indent, Bool
fit, ExtendDoc
left))
      (Int, ExtendDoc) -> ExtendDoc
Nest(Int
n, ExtendDoc
doc) => Array[(Int, Bool, ExtendDoc)]
stack..(self : Array[(Int, Bool, ExtendDoc)], value : (Int, Bool, ExtendDoc)) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((Int
indent (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ Int
n, Bool
fit, ExtendDoc
doc))
      (ExtendDoc, ExtendDoc) -> ExtendDoc
Choice(ExtendDoc
a, ExtendDoc
b) =>
        Array[(Int, Bool, ExtendDoc)]
stack.(self : Array[(Int, Bool, ExtendDoc)], value : (Int, Bool, ExtendDoc)) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(if Bool
fit { (Int
indent, Bool
fit, ExtendDoc
a) } else { (Int
indent, Bool
fit, ExtendDoc
b) })
      (ExtendDoc) -> ExtendDoc
Group(ExtendDoc
doc) => {
        let Bool
fit = Bool
fit (Bool, Bool) -> Bool
|| Int
column (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ ExtendDoc
doc.(self : ExtendDoc) -> Int
space() (self_ : Int, other : Int) -> Bool
<= Int
width
        Array[(Int, Bool, ExtendDoc)]
stack.(self : Array[(Int, Bool, ExtendDoc)], value : (Int, Bool, ExtendDoc)) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((Int
indent, Bool
fit, ExtendDoc
doc))
      }
    }
  }
  StringBuilder
buf.(self : StringBuilder) -> String
Returns the current content of the StringBuilder as a string.
to_string()
}

Let’s use ExtendDoc to describe a (expr) and print it under different width:

let ExtendDoc
softline : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (ExtendDoc, ExtendDoc) -> ExtendDoc
Choice(ExtendDoc
Empty, ExtendDoc
Line)

impl trait Add {
  add(Self, Self) -> Self
  op_add(Self, Self) -> Self
}
types implementing this trait can use the + operator
Add for enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc with (a : ExtendDoc, b : ExtendDoc) -> ExtendDoc
op_add(ExtendDoc
a, ExtendDoc
b) {
  (ExtendDoc, ExtendDoc) -> ExtendDoc
Cat(ExtendDoc
a, ExtendDoc
b)
}

test "tuple" {
  let ExtendDoc
tuple : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (ExtendDoc) -> ExtendDoc
Group(
    (String) -> ExtendDoc
Text("(") (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ (Int, ExtendDoc) -> ExtendDoc
Nest(2, ExtendDoc
softline (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ (String) -> ExtendDoc
Text("expr")) (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
softline (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ (String) -> ExtendDoc
Text(")"),
  )
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(ExtendDoc
tuple.(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=40), String
content="(expr)")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    ExtendDoc
tuple.(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=5),
    String
content=(
      #|(
      #|  expr
      #|)
    ),
  )
}

Here, softline is defined as a choice between Empty and Line. Since rendering starts in non-compact mode, we wrap the whole expression with Group. When the width is sufficient, the entire expression prints on one line; otherwise, it automatically wraps with indentation. To improve readability, we overloaded the + operator for ExtendDoc.

Composition Functions

In practice, users rely more on higher-level combinators built from the ExtendDoc primitives—like the softline above. Let’s introduce some useful functions for structured printing.

softline & softbreak

let ExtendDoc
softbreak : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (ExtendDoc, ExtendDoc) -> ExtendDoc
Choice((String) -> ExtendDoc
Text(" "), ExtendDoc
Line)

Similar to softline, except that in compact mode it inserts a space. Note that within the same Group, all Choices follow the same compact or non-compact decision.

let ExtendDoc
abc : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (String) -> ExtendDoc
Text("abc")
let ExtendDoc
def : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (String) -> ExtendDoc
Text("def")
let ExtendDoc
ghi : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (String) -> ExtendDoc
Text("ghi")

test "softbreak" {
  let ExtendDoc
doc : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (ExtendDoc) -> ExtendDoc
Group(ExtendDoc
abc (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
softbreak (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
def (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
softbreak (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
ghi)
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(ExtendDoc
doc.(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=20), String
content="abc def ghi")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    ExtendDoc
doc.(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=10),
    String
content=(
      #|abc
      #|def
      #|ghi
    ),
  )
}

autoline & autobreak

let ExtendDoc
autoline : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (ExtendDoc) -> ExtendDoc
Group(ExtendDoc
softline)
let ExtendDoc
autobreak : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (ExtendDoc) -> ExtendDoc
Group(ExtendDoc
softbreak)

autoline and autobreak make sure the ExtendDocs fit as much as possible on one line, like text editors do.

test {
  let ExtendDoc
doc : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (ExtendDoc) -> ExtendDoc
Group(
    ExtendDoc
abc (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
autobreak (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
def (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
autobreak (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
ghi,
  )
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(ExtendDoc
doc.(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=10), String
content="abc def ghi")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    ExtendDoc
doc.(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=5),
    String
content=(
      #|abc def
      #|ghi
    ),
  )
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    ExtendDoc
doc.(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=3),
    String
content=(
      #|abc
      #|def
      #|ghi
    ),
  )
}

sepby

fn (xs : Array[ExtendDoc], sep : ExtendDoc) -> ExtendDoc
sepby(Array[ExtendDoc]
xs : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc], ExtendDoc
sep : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc) -> enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc {
  match Array[ExtendDoc]
xs {
    [] => ExtendDoc
Empty
    Array[ExtendDoc]
[ExtendDoc
xArray[ExtendDoc]
, .. xs] => ArrayView[ExtendDoc]
xs.(self : ArrayView[ExtendDoc], init~ : ExtendDoc, f : (ExtendDoc, ExtendDoc) -> ExtendDoc) -> ExtendDoc
Fold out values from an ArrayView according to certain rules.
Example
  let sum = [1, 2, 3, 4, 5][:].fold(init=0, (sum, elem) => sum + elem)
  inspect(sum, content="15")
fold(ExtendDoc
init=ExtendDoc
x, (ExtendDoc
a, ExtendDoc
b) => ExtendDoc
a (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
sep (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
b)
  }
}

sepby inserts a separator sep between ExtendDocs.

let ExtendDoc
comma : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc = (String) -> ExtendDoc
Text(",")
test {
  let ExtendDoc
layout = (ExtendDoc) -> ExtendDoc
Group((xs : Array[ExtendDoc], sep : ExtendDoc) -> ExtendDoc
sepby([ExtendDoc
abc, ExtendDoc
def, ExtendDoc
ghi], ExtendDoc
comma (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
softbreak))
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(ExtendDoc
layout.(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=40), String
content="abc, def, ghi")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    ExtendDoc
layout.(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=10),
    String
content=(
      #|abc,
      #|def,
      #|ghi
    ),
  )
}

surround

fn (m : ExtendDoc, l : ExtendDoc, r : ExtendDoc) -> ExtendDoc
surround(ExtendDoc
m : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc, ExtendDoc
l : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc, ExtendDoc
r : enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc) -> enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc {
  ExtendDoc
l (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
m (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
r
}

surround wraps an ExtendDoc with left and right delimiters.

test {
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect((m : ExtendDoc, l : ExtendDoc, r : ExtendDoc) -> ExtendDoc
surround(ExtendDoc
abc, (String) -> ExtendDoc
Text("("), (String) -> ExtendDoc
Text(")")).(doc : ExtendDoc, width? : Int) -> String
render(), String
content="(abc)")
}

Printing JSON

Using the functions above, we can implement a JSON prettyprinter. This function recursively processes each JSON element and generates the appropriate layout.

fn (x : Json) -> ExtendDoc
pretty(Json
x : enum Json {
  Null
  True
  False
  Number(Double, repr~ : String?)
  String(String)
  Array(Array[Json])
  Object(Map[String, Json])
}
Json) -> enum ExtendDoc {
  Empty
  Line
  Text(String)
  Cat(ExtendDoc, ExtendDoc)
  Nest(Int, ExtendDoc)
  Choice(ExtendDoc, ExtendDoc)
  Group(ExtendDoc)
}
ExtendDoc {
  fn (Array[ExtendDoc], ExtendDoc, ExtendDoc) -> ExtendDoc
comma_list(Array[ExtendDoc]
xs, ExtendDoc
l, ExtendDoc
r) {
    ((Int, ExtendDoc) -> ExtendDoc
Nest(2, ExtendDoc
softline (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ (xs : Array[ExtendDoc], sep : ExtendDoc) -> ExtendDoc
sepby(Array[ExtendDoc]
xs, ExtendDoc
comma (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
softbreak)) (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ ExtendDoc
softline)
    |> (m : ExtendDoc, l : ExtendDoc, r : ExtendDoc) -> ExtendDoc
surround(ExtendDoc
l, ExtendDoc
r)
    |> (ExtendDoc) -> ExtendDoc
Group
  }

  match Json
x {
    (Array[Json]) -> Json
Array(Array[Json]
elems) => {
      let Array[ExtendDoc]
elems = Array[Json]
elems.(self : Array[Json]) -> Iter[Json]
Creates an iterator over the elements of the array.
Parameters:

array : The array to create an iterator from.
Returns an iterator that yields each element of the array in order.
Example:
  let arr = [1, 2, 3]
  let mut sum = 0
  arr.iter().each((x) => { sum = sum + x })
  inspect(sum, content="6")
iter().(self : Iter[Json], f : (Json) -> ExtendDoc) -> Iter[ExtendDoc]
Transforms the elements of the iterator using a mapping function.
Type Parameters

T: The type of the elements in the iterator.
R: The type of the transformed elements.
Arguments

self - The input iterator.
f - The mapping function that transforms each element of the iterator.
Returns
A new iterator that contains the transformed elements.
map((x : Json) -> ExtendDoc
pretty).(self : Iter[ExtendDoc]) -> Array[ExtendDoc]
Collects the elements of the iterator into an array.
collect()
      (Array[ExtendDoc], ExtendDoc, ExtendDoc) -> ExtendDoc
comma_list(Array[ExtendDoc]
elems, (String) -> ExtendDoc
Text("["), (String) -> ExtendDoc
Text("]"))
    }
    (Map[String, Json]) -> Json
Object(Map[String, Json]
pairs) => {
      let Array[ExtendDoc]
pairs = Map[String, Json]
pairs
        .(self : Map[String, Json]) -> Iter[(String, Json)]
Returns the iterator of the hash map, provide elements in the order of insertion.
iter()
        .(self : Iter[(String, Json)], f : ((String, Json)) -> ExtendDoc) -> Iter[ExtendDoc]
Transforms the elements of the iterator using a mapping function.
Type Parameters

T: The type of the elements in the iterator.
R: The type of the transformed elements.
Arguments

self - The input iterator.
f - The mapping function that transforms each element of the iterator.
Returns
A new iterator that contains the transformed elements.
map((String, Json)
p => (ExtendDoc) -> ExtendDoc
Group((String) -> ExtendDoc
Text((String, Json)
p.String
0.(self : String) -> String
Returns a valid MoonBit string literal representation of a string,
add quotes and escape special characters.
Examples
  let str = "Hello \n"
  inspect(str.to_string(), content="Hello \n")
  inspect(str.escape(), content="\"Hello \\n\"")
escape()) (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ (String) -> ExtendDoc
Text(": ") (self : ExtendDoc, other : ExtendDoc) -> ExtendDoc
+ (x : Json) -> ExtendDoc
pretty((String, Json)
p.Json
1)))
        .(self : Iter[ExtendDoc]) -> Array[ExtendDoc]
Collects the elements of the iterator into an array.
collect()
      (Array[ExtendDoc], ExtendDoc, ExtendDoc) -> ExtendDoc
comma_list(Array[ExtendDoc]
pairs, (String) -> ExtendDoc
Text("{"), (String) -> ExtendDoc
Text("}"))
    }
    (String) -> Json
String(String
s) => (String) -> ExtendDoc
Text(String
s.(self : String) -> String
Returns a valid MoonBit string literal representation of a string,
add quotes and escape special characters.
Examples
  let str = "Hello \n"
  inspect(str.to_string(), content="Hello \n")
  inspect(str.escape(), content="\"Hello \\n\"")
escape())
    (Double, repr~ : String?) -> Json
Number(Double
i) => (String) -> ExtendDoc
Text(Double
i.(self : Double) -> String
Converts a double-precision floating-point number to its string
representation.
Parameters:

self: The double-precision floating-point number to be converted.
Returns a string representation of the double-precision floating-point
number.
Example:
  inspect(42.0.to_string(), content="42")
  inspect(3.14159.to_string(), content="3.14159")
  inspect((-0.0).to_string(), content="0")
  inspect(@double.not_a_number.to_string(), content="NaN")
to_string())
    Json
False => (String) -> ExtendDoc
Text("false")
    Json
True => (String) -> ExtendDoc
Text("true")
    Json
Null => (String) -> ExtendDoc
Text("null")
  }
}

When rendered, the JSON automatically adapts to different widths:

test {
  let Json
json : enum Json {
  Null
  True
  False
  Number(Double, repr~ : String?)
  String(String)
  Array(Array[Json])
  Object(Map[String, Json])
}
Json = {
    "key1": "string",
    "key2": [12345, 67890],
    "key3": [
      { "field1": 1, "field2": 2 },
      { "field1": 1, "field2": 2 },
      { "field1": [1, 2], "field2": 2 },
    ],
  }
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    (x : Json) -> ExtendDoc
pretty(Json
json).(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=80),
    String
content=(
      #|{
      #|  "key1": "string",
      #|  "key2": [12345, 67890],
      #|  "key3": [
      #|    {"field1": 1, "field2": 2},
      #|    {"field1": 1, "field2": 2},
      #|    {"field1": [1, 2], "field2": 2}
      #|  ]
      #|}
    ),
  )
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    (x : Json) -> ExtendDoc
pretty(Json
json).(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=30),
    String
content=(
      #|{
      #|  "key1": "string",
      #|  "key2": [12345, 67890],
      #|  "key3": [
      #|    {"field1": 1, "field2": 2},
      #|    {"field1": 1, "field2": 2},
      #|    {
      #|      "field1": [1, 2],
      #|      "field2": 2
      #|    }
      #|  ]
      #|}
    ),
  )
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(
    (x : Json) -> ExtendDoc
pretty(Json
json).(doc : ExtendDoc, width~ : Int) -> String
render(Int
width=20),
    String
content=(
      #|{
      #|  "key1": "string",
      #|  "key2": [
      #|    12345,
      #|    67890
      #|  ],
      #|  "key3": [
      #|    {
      #|      "field1": 1,
      #|      "field2": 2
      #|    },
      #|    {
      #|      "field1": 1,
      #|      "field2": 2
      #|    },
      #|    {
      #|      "field1": [
      #|        1,
      #|        2
      #|      ],
      #|      "field2": 2
      #|    }
      #|  ]
      #|}
    ),
  )
}

Conclusion

By combining a small set of primitives with function composition, we can build a flexible, declarative prettyprinter that adapts structured data layouts to the available screen width.

This approach scales well: you describe layout intentions with combinators like sepby, surround, or autobreak, and the rendering engine takes care of indentation, line breaks, and fitting.

The current implementation can be further optimized:

Memoizing space calculations to improve performance.
Adding a ribbon parameter to balance whitespace vs. content density
Supporting advanced layouts like hanging indents or mandatory line breaks

For a deeper dive, see Philip Wadler’s classic paper A prettier printer – Philip Wadler, as well as prettyprinter libraries in Haskell, OCaml, and other languages.

Mini-adapton: incremental computation in MoonBit

August 27, 2025 · 10 min read

Introduction

Let's first illustrate how incremental computation looks like with an example similar to spreadsheet. First define a dependency graph like this:

In this graph, t1's value is computed from n1 + n2 and t2's value is computed from t1 + n3.

When we want to get the value of t2, the computation defined in the graph will be done: first t1 is computed by n1 + n2, then t2 is computed by t1 + n3. This process is the same as non-incremental computation.

However, when we start to change values in n1, n2, or n3, things get different. Say we swap the value of n1 and n2, then get t2's value. In non-incremental computation, both t1 and t2 will be recomputed. But the computation of t2 is actually not needed, since all its dependency t1 and n3 are not changed (swap n1 and n2 wont change t1's value).

The following code example does exactly what we describe above. We use Cell::new to define n1, n2, and n3, which does not need computation. And Thunk::new to define t1 and t2 with computation.

test {
  // a counter to record the times of t2's computation
  let mut Int
cnt = 0
  // start define the graph
  let Cell[Int]
n1 = struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell::(value : Int) -> Cell[Int]
new(1)
  let Cell[Int]
n2 = struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell::(value : Int) -> Cell[Int]
new(2)
  let Cell[Int]
n3 = struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell::(value : Int) -> Cell[Int]
new(3)
  let Thunk[Int]
t1 = struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk::(thunk : () -> Int) -> Thunk[Int]
new(fn() {
    Cell[Int]
n1.(self : Cell[Int]) -> Int
get() (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ Cell[Int]
n2.(self : Cell[Int]) -> Int
get()
  })
  let Thunk[Int]
t2 = struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk::(thunk : () -> Int) -> Thunk[Int]
new(fn() {
    Int
cnt (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+= 1
    Thunk[Int]
t1.(self : Thunk[Int]) -> Int
get() (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ Cell[Int]
n3.(self : Cell[Int]) -> Int
get()
  })
  // get the value of t2
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Thunk[Int]
t2.(self : Thunk[Int]) -> Int
get(), String
content="6")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Int
cnt, String
content="1")
  // swap value of n1 and n2
  Cell[Int]
n1.(self : Cell[Int], new_value : Int) -> Unit
set(2)
  Cell[Int]
n2.(self : Cell[Int], new_value : Int) -> Unit
set(1)
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Thunk[Int]
t2.(self : Thunk[Int]) -> Int
get(), String
content="6")
  // t2 does not recompute
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Int
cnt, String
content="1")
}

In this article, we will show how to implement an incremental computation library in MoonBit with the api used in the above example:

Cell::new
Cell::get
Cell::set
Thunk::new
Thunk::get

Problem Analysis and Solution

To implement the library, there are three main problems to solve:

Build up dependency graph on the fly

As a library in MoonBit, we don't have any easy ways to build up the dependency graph statically, since MoonBit does not have any meta programming mechanism currently. Therefore, we need to construct dependency graph on the fly. Since all we care about is what cells/thunks does a thunk depend on, a good option to build up dependency graph would be when user calls Thunk::get. Take the code above as an example:

let n1 = Cell::new(1)
let n2 = Cell::new(2)
let n3 = Cell::new(3)
let t1 = Thunk::new(fn() { n1.get() + n2.get() })
let t2 = Thunk::new(fn() { t1.get() + n3.get() })
t2.get()

When user calls t2.get(), we can know that at runtime t1.get() and n3.get() are called inside it. Therefore, t1 and n3 are dependencies of t2 and we can construct a subgraph:

The same story will also happen when t1.get() is called inside t2.get().

So here is the plan:

we declare a stack to record which thunk are we currently getting. The reason we use stack here is that we are essentially record call stacks of every get.
whenever we call get, mark it as the dependency of stack top. If it's a thunk, push it onto stack.
whenever a thunk's get finished, pop it off the stack.

Let's see the full process of above example under this algorithm:

when we call t2.get, push t2 on the stack.
when we call t1.get inside t2.get, mark t1 as a dependency of t2 and push t1 onto the stack.
when we call n1.get inside t1.get, mark n1 as a dependency of t1.
same story goes for n2.
when t1.get finished, pop it from stack.
when we call n3.get, mark n3 as a dependency of t2

Besides the edge from dependent to dependency, we'd better also record an edge from dependency to dependent, so that we can easily traverse the graph backwards when we need.

In the code below, we'll use outgoing_edges to refer to edge from parent(dependent) to child (dependency) and incoming_edges to refer to the opposite.

A mechanism to mark outdated node

Whenever we call Cell::set, the node itself and all nodes depend on it should be marked as outdated. This will be one of the criteria to determine whether a thunk needs to be recomputed. This is generally a recursive backward traverse from a leaf of a graph. We can describe the process as pseudo MoonBit code:

fn dirty(node: Node) -> Unit {
  for n in node.incoming_edges {
    n.set_dirty(true)
    dirty(node)
  }
}

Determine whether a thunk needs to be recomputed

Whenever we call Thunk::get, we need to determine whether it really needs to be recomputed. But the dirty mechanism we describe in the last subsection is not enough. If we only use dirtiness to determine whether a thunk needs to be recomputed, there would be unneeded computation. Let's see it from the example we give at the beginning:

n1.set(2)
n2.set(1)
inspect(t2.get(), content="6")

After we swap the value of n1 and n2, n1, n2, t1, and t2 should all be marked as dirty, but when we call t2.get, there is no need to recompute t2, since the value of t1 does not change.

This reminds us that despite dirtiness, we need also to record whether a node's value differs from its last value. If a node is both dirty and one of its dependencies' value changed, it needs to be recomputed.

We can describe the algorithm as the pseudo MoonBit code below:

fn propagate(self: Node) -> Unit {
  // When a node is dirty, it might need to be recomputed
  if self.is_dirty() {
    // after recomputing, it's no longer dirty
    self.set_dirty(false)
    for dependency in self.outgoing_edges() {
      // recursively recompute every dependency
      dependency.propagate()
      // If a dependency's value changed, the node needs to be recomputed
      if dependency.is_changed() {
        // remove all incoming_edges and outgoing_edges, since they will be reconstructed during evaluate
        self.incoming_edges().clear()
        self.outgoing_edges().clear()
        self.evaluate()
        return
      }
    }
  }
}

Implementation

Given the algorithms described in the last section, the implementation should be quite straightforward.

First, let's define Cell:

struct Cell[A] {
  mut Bool
is_dirty : Bool
Bool
  mut A
value : type parameter A
A
  mut Bool
is_changed : Bool
Bool
  Array[&Node]
incoming_edges : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[&trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node]
}

Since Cell can only be leaf node in dependency graph, it does not have outgoing_edges. The trait Node here is used to abstract node in dependency graph.

Then, let's define Thunk:

struct Thunk[A] {
  mut Bool
is_dirty : Bool
Bool
  mut A?
value : type parameter A
A?
  mut Bool
is_changed : Bool
Bool
  () -> A
thunk : () -> type parameter A
A
  Array[&Node]
incoming_edges : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[&trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node]
  Array[&Node]
outgoing_edges : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[&trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node]
}

Thunk's value is optional, since it only exists after we first call Thunk::get.

We can easily add new for both types:

fn[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell::(value : A) -> Cell[A]
new(A
value : type parameter A
A) -> struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell[type parameter A
A] {
  struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell::{
    Bool
is_changed: false,
    A
value,
    Array[&Node]
incoming_edges: [],
    Bool
is_dirty: false,
  }
}

fn[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk::(thunk : () -> A) -> Thunk[A]
new(() -> A
thunk : () -> type parameter A
A) -> struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk[type parameter A
A] {
  struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk::{
    A?
value: A?
None,
    Bool
is_changed: false,
    () -> A
thunk,
    Array[&Node]
incoming_edges: [],
    Array[&Node]
outgoing_edges: [],
    Bool
is_dirty: false,
  }
}

Thunk and Cell are the two kinds of node in dependency graph, we can use the trait Node mentioned above to abstract them:

trait trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node {
  (Self) -> Bool
is_dirty(type parameter Self
Self) -> Bool
Bool
  (Self, Bool) -> Unit
set_dirty(type parameter Self
Self, Bool
Bool) -> Unit
Unit
  (Self) -> Array[&Node]
incoming_edges(type parameter Self
Self) -> type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[&trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node]
  (Self) -> Array[&Node]
outgoing_edges(type parameter Self
Self) -> type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[&trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node]
  (Self) -> Bool
is_changed(type parameter Self
Self) -> Bool
Bool
  (Self) -> Unit
evaluate(type parameter Self
Self) -> Unit
Unit
}

And implement the trait for both types:

impl[A] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell[type parameter A
A] with (self : Cell[A]) -> Array[&Node]
incoming_edges(Cell[A]
self) {
  Cell[A]
self.Array[&Node]
incoming_edges
}

impl[A] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell[type parameter A
A] with (_self : Cell[A]) -> Array[&Node]
outgoing_edges(Cell[A]
_self) {
  []
}

impl[A] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell[type parameter A
A] with (self : Cell[A]) -> Bool
is_dirty(Cell[A]
self) {
  Cell[A]
self.Bool
is_dirty
}

impl[A] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell[type parameter A
A] with (self : Cell[A], new_dirty : Bool) -> Unit
set_dirty(Cell[A]
self, Bool
new_dirty) {
  Cell[A]
self.Bool
is_dirty = Bool
new_dirty
}

impl[A] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell[type parameter A
A] with (self : Cell[A]) -> Bool
is_changed(Cell[A]
self) {
  Cell[A]
self.Bool
is_changed
}

impl[A] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell[type parameter A
A] with (_self : Cell[A]) -> Unit
evaluate(Cell[A]
_self) {
  ()
}

impl[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk[type parameter A
A] with (self : Thunk[A]) -> Bool
is_changed(Thunk[A]
self) {
  Thunk[A]
self.Bool
is_changed
}

impl[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk[type parameter A
A] with (self : Thunk[A]) -> Array[&Node]
outgoing_edges(Thunk[A]
self) {
  Thunk[A]
self.Array[&Node]
outgoing_edges
}

impl[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk[type parameter A
A] with (self : Thunk[A]) -> Array[&Node]
incoming_edges(Thunk[A]
self) {
  Thunk[A]
self.Array[&Node]
incoming_edges
}

impl[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk[type parameter A
A] with (self : Thunk[A]) -> Bool
is_dirty(Thunk[A]
self) {
  Thunk[A]
self.Bool
is_dirty
}

impl[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk[type parameter A
A] with (self : Thunk[A], new_dirty : Bool) -> Unit
set_dirty(Thunk[A]
self, Bool
new_dirty) {
  Thunk[A]
self.Bool
is_dirty = Bool
new_dirty
}

impl[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node for struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk[type parameter A
A] with (self : Thunk[A]) -> Unit
evaluate(Thunk[A]
self) {
  // push self into node_stack top
  // now self is active target
  Array[&Node]
node_stack.(self : Array[&Node], value : &Node) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(Thunk[A]
self)
  // `self.thunk` might contains `source.get()`,
  // such as `s1.get()`, `s2.get()` and `s3.get()`
  //
  // when call `Thunk::get` or `Cell::get`,
  // they will treat `node_stack.last()` as themself's target.
  // if source is `Cell`, then it only record `incoming_edges`.
  // if source is `Thunk`, then it record `incoming_edges` and `outgoing_edges`, connect each other.
  //
  let A
value = (Thunk[A]
self.() -> A
thunk)()
  Thunk[A]
self.Bool
is_changed = match Thunk[A]
self.A?
value {
    A?
None => true
    (A) -> A?
Some(A
v) => A
v (x : A, y : A) -> Bool
!= A
value
  }
  Thunk[A]
self.A?
value = (A) -> A?
Some(A
value)
  // pop self from node_stack
  // now self is no longer active target
  Array[&Node]
node_stack.(self : Array[&Node]) -> &Node
Removes and returns the last element from the array.
Parameters:

array : The array from which to remove and return the last element.
Returns the last element of the array before removal.
Example:
  let arr = [1, 2, 3]
  inspect(arr.unsafe_pop(), content="3")
  inspect(arr, content="[1, 2]")
unsafe_pop() |> (t : &Node) -> Unit
Evaluates an expression and discards its result. This is useful when you want
to execute an expression for its side effects but don't care about its return
value, or when you want to explicitly indicate that a value is intentionally
unused.
Parameters:

value : The value to be ignored. Can be of any type.
Example:
  let x = 42
  ignore(x) // Explicitly ignore the value
  let mut sum = 0
  ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore
}

The only complicated implementation is Thunk's evaluate. Here we need first to push the thunk on stack for dependency recording. node_stack is defined as below:

let Array[&Node]
node_stack : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[&trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node] = []

Then do the real computation and compare it with the last value to update self.is_changed. is_changed is used later to determine whether we need to recompute a thunk.

dirty and propagate are almost the same as the pseudo code described above:

fn trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
&Node::(self : &Node) -> Unit
dirty(&Node
self : &trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node) -> Unit
Unit {
  for &Node
dependent in &Node
self.(&Node) -> Array[&Node]
incoming_edges() {
    if (x : Bool) -> Bool
Performs logical negation on a boolean value.
Parameters:

value : The boolean value to negate.
Returns the logical NOT of the input value: true if the input is false,
and false if the input is true.
Example:
  inspect(not(true), content="false")
  inspect(not(false), content="true")
not(&Node
dependent.(&Node) -> Bool
is_dirty()) {
      &Node
dependent.(&Node, Bool) -> Unit
set_dirty(true)
      &Node
dependent.(self : &Node) -> Unit
dirty()
    }
  }
}

fn trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
&Node::(self : &Node) -> Unit
propagate(&Node
self : &trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
Node) -> Unit
Unit {
  if &Node
self.(&Node) -> Bool
is_dirty() {
    &Node
self.(&Node, Bool) -> Unit
set_dirty(false)
    for &Node
dependency in &Node
self.(&Node) -> Array[&Node]
outgoing_edges() {
      &Node
dependency.(self : &Node) -> Unit
propagate()
      if &Node
dependency.(&Node) -> Bool
is_changed() {
        &Node
self.(&Node) -> Array[&Node]
incoming_edges().(self : Array[&Node]) -> Unit
Clears the array, removing all values.
This method has no effect on the allocated capacity of the array, only setting the length to 0.
Example
  let v = [3, 4, 5]
  v.clear()
  assert_eq(v.length(), 0)
clear()
        &Node
self.(&Node) -> Array[&Node]
outgoing_edges().(self : Array[&Node]) -> Unit
Clears the array, removing all values.
This method has no effect on the allocated capacity of the array, only setting the length to 0.
Example
  let v = [3, 4, 5]
  v.clear()
  assert_eq(v.length(), 0)
clear()
        &Node
self.(&Node) -> Unit
evaluate()
        return
      }
    }
  }
}

With all the foundation we build, the three main api: Cell::get, Cell:set, and Thunk::get are easy to implement.

To get value from a cell, it's simply just return the value filed in struct. But before that, we need first record it as a dependency if it's called inside Thunk::get.

fn[A] struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell::(self : Cell[A]) -> A
get(Cell[A]
self : struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell[type parameter A
A]) -> type parameter A
A {
  if Array[&Node]
node_stack.(self : Array[&Node]) -> &Node?
Returns the last element of the array, or None if the array is empty.
Parameters:

array : The array to get the last element from.
Returns an optional value containing the last element of the array. The
result is None if the array is empty, or Some(x) where x is the last
element of the array.
Example:
  let arr = [1, 2, 3]
  inspect(arr.last(), content="Some(3)")
  let empty : Array[Int] = []
  inspect(empty.last(), content="None")
last() is (&Node) -> &Node?
Some(&Node
target) {
    &Node
target.(&Node) -> Array[&Node]
outgoing_edges().(self : Array[&Node], value : &Node) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(Cell[A]
self)
    Cell[A]
self.Array[&Node]
incoming_edges.(self : Array[&Node], value : &Node) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(&Node
target)
  }
  Cell[A]
self.A
value
}

Whenever we set a cell, we need to first make sure that the two states is_changed and dirty are updated correctly. Then mark every dependent as dirty.

fn[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell::(self : Cell[A], new_value : A) -> Unit
set(Cell[A]
self : struct Cell[A] {
  mut is_dirty: Bool
  mut value: A
  mut is_changed: Bool
  incoming_edges: Array[&Node]
}
Cell[type parameter A
A], A
new_value : type parameter A
A) -> Unit
Unit {
  if Cell[A]
self.A
value (x : A, y : A) -> Bool
!= A
new_value {
    Cell[A]
self.Bool
is_changed = true
    Cell[A]
self.A
value = A
new_value
    Cell[A]
self.(self : Cell[A], new_dirty : Bool) -> Unit
set_dirty(true)
    trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
&Node::(&Node) -> Unit
dirty(Cell[A]
self)
  }
}

In Thunk::get, similar to Cell::get, we first need to record self as a dependency. After that we pattern match on self.value. If it's None, it means that this is the first time user tries to get the thunk's value, so we can safely just evaluate it. If it's Some, we use propagate to make sure that we only recompute thunks that's really needed.

fn[A : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq] struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk::(self : Thunk[A]) -> A
get(Thunk[A]
self : struct Thunk[A] {
  mut is_dirty: Bool
  mut value: A?
  mut is_changed: Bool
  thunk: () -> A
  incoming_edges: Array[&Node]
  outgoing_edges: Array[&Node]
}
Thunk[type parameter A
A]) -> type parameter A
A {
  if Array[&Node]
node_stack.(self : Array[&Node]) -> &Node?
Returns the last element of the array, or None if the array is empty.
Parameters:

array : The array to get the last element from.
Returns an optional value containing the last element of the array. The
result is None if the array is empty, or Some(x) where x is the last
element of the array.
Example:
  let arr = [1, 2, 3]
  inspect(arr.last(), content="Some(3)")
  let empty : Array[Int] = []
  inspect(empty.last(), content="None")
last() is (&Node) -> &Node?
Some(&Node
target) {
    &Node
target.(&Node) -> Array[&Node]
outgoing_edges().(self : Array[&Node], value : &Node) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(Thunk[A]
self)
    Thunk[A]
self.Array[&Node]
incoming_edges.(self : Array[&Node], value : &Node) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(&Node
target)
  }
  match Thunk[A]
self.A?
value {
    A?
None => Thunk[A]
self.(self : Thunk[A]) -> Unit
evaluate()
    (A) -> A?
Some(_) => trait Node {
  is_dirty(Self) -> Bool
  set_dirty(Self, Bool) -> Unit
  incoming_edges(Self) -> Array[&Node]
  outgoing_edges(Self) -> Array[&Node]
  is_changed(Self) -> Bool
  evaluate(Self) -> Unit
}
&Node::(&Node) -> Unit
propagate(Thunk[A]
self)
  }
  Thunk[A]
self.A?
value.(self : A?) -> A
Extract the value in Some.
If the value is None, it throws a panic.
unwrap()
}

Reference

Adapton: Composable, demand-driven incremental computation, PLDI 2014 original paper of adapton
illusory0x0/adapton.mbt adapton library in MoonBit

A Guide to MoonBit Python Integration

August 19, 2025 · 12 min read

Introduction

Python, with its concise syntax and vast ecosystem, has become one of the most popular programming languages today. However, discussions around its performance bottlenecks and the maintainability of its dynamic typing system in large-scale projects have never ceased. To address these challenges, the developer community has explored various optimization paths.

The python.mbt tool, officially launched by MoonBit, offers a new perspective. It allows developers to call Python code directly within the MoonBit environment. This combination aims to merge MoonBit's static type safety and high-performance potential with Python's mature ecosystem. Through python.mbt, developers can leverage MoonBit's static analysis capabilities, modern build and testing tools, while enjoying Python's rich library functions, making it possible to build large-scale, high-performance system-level software.

This article aims to delve into the working principles of python.mbt and provide a practical guide. It will answer common questions such as: How does python.mbt work? Is it slower than native Python due to an added intermediate layer? What are its advantages over existing tools like C++'s pybind11 or Rust's PyO3? To answer these questions, we first need to understand the basic workflow of the Python interpreter.

How the Python Interpreter Works

The Python interpreter executes code in three main stages:

Parsing: This stage includes lexical analysis and syntax analysis. The interpreter breaks down human-readable Python source code into tokens and then organizes these tokens into a tree-like structure, the Abstract Syntax Tree (AST), based on syntax rules.

For example, for the following Python code:

def add(x, y):
  return x + y

a = add(1, 2)
print(a)

We can use Python's ast module to view its generated AST structure:

Module(
    body=[
        FunctionDef(
            name='add',
            args=arguments(
                args=[
                    arg(arg='x'),
                    arg(arg='y')]),
            body=[
                Return(
                    value=BinOp(
                        left=Name(id='x', ctx=Load()),
                        op=Add(),
                        right=Name(id='y', ctx=Load())))]),
        Assign(
            targets=[
                Name(id='a', ctx=Store())],
            value=Call(
                func=Name(id='add', ctx=Load()),
                args=[
                    Constant(value=1),
                    Constant(value=2)])),
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Name(id='a', ctx=Load())]))])

Compilation: Next, the Python interpreter compiles the AST into a lower-level, more linear intermediate representation called bytecode. This is a platform-independent instruction set designed for the Python Virtual Machine (PVM).

Using Python's dis module, we can view the bytecode corresponding to the above code:

  2           LOAD_CONST               0 (<code object add>)
              MAKE_FUNCTION
              STORE_NAME               0 (add)

  5           LOAD_NAME                0 (add)
              PUSH_NULL
              LOAD_CONST               1 (1)
              LOAD_CONST               2 (2)
              CALL                     2
              STORE_NAME               1 (a)

  6           LOAD_NAME                2 (print)
              PUSH_NULL
              LOAD_NAME                1 (a)
              CALL                     1
              POP_TOP
              RETURN_CONST             3 (None)

Execution: Finally, the Python Virtual Machine (PVM) executes the bytecode instructions one by one. Each instruction corresponds to a C function call in the CPython interpreter's underlying layer. For example, LOAD_NAME looks up a variable, and BINARY_OP performs a binary operation. It is this process of interpreting and executing instructions one by one that is the main source of Python's performance overhead. A simple 1 + 2 operation involves the entire complex process of parsing, compilation, and virtual machine execution.

Understanding this process helps us grasp the basic approaches to Python performance optimization and the design philosophy of python.mbt.

Paths to Optimizing Python Performance

Currently, there are two mainstream methods for improving Python program performance:

Just-In-Time (JIT) Compilation: Projects like PyPy analyze a running program and compile frequently executed "hotspot" bytecode into highly optimized native machine code, thereby bypassing the PVM's interpretation and significantly speeding up computationally intensive tasks. However, JIT is not a silver bullet; it cannot solve the inherent problems of Python's dynamic typing, such as the difficulty of effective static analysis in large projects, which poses challenges for software maintenance.
Native Extensions: Developers can use languages like C++ (with pybind11) or Rust (with PyO3) to directly call Python functions or to write performance-critical modules that are then called from Python. This method can achieve near-native performance, but it requires developers to be proficient in both Python and a complex system-level language, presenting a steep learning curve and a high barrier to entry for most Python programmers.

python.mbt is also a native extension. But compared to languages like C++ and Rust, it attempts to find a new balance between performance, ease of use, and engineering capabilities, with a greater emphasis on using Python features directly within the MoonBit language.

High-Performance Core: MoonBit is a statically typed, compiled language whose code can be efficiently compiled into native machine code. Developers can implement computationally intensive logic in MoonBit to achieve high performance from the ground up.
Seamless Python Calls: python.mbt interacts directly with CPython's C-API to call Python modules and functions. This means call overhead is minimized, bypassing Python's parsing and compilation stages and going straight to the virtual machine execution layer.
Gentler Learning Curve: Compared to C++ and Rust, MoonBit's syntax is more modern and concise. It also has comprehensive support for functional programming, a documentation system, unit testing, and static analysis tools, making it more friendly to developers accustomed to Python.
Improved Engineering and AI Collaboration: MoonBit's strong type system and clear interface definitions make code intent more explicit and easier for static analysis tools and AI-assisted programming tools to understand. This helps maintain code quality in large projects and improves the efficiency and accuracy of collaborative coding with AI.

Using Pre-wrapped Python Libraries in MoonBit

To facilitate developer use, MoonBit will officially wrap mainstream Python libraries once the build system and IDE are mature. After wrapping, users can use these Python libraries in their projects just like importing regular MoonBit packages. Let's take the matplotlib plotting library as an example.

First, add the matplotlib dependency in your project's root moon.pkg.json or via the terminal:

moon update
moon add Kaida-Amethyst/matplotlib

Then, declare the import in the moon.pkg.json of the sub-package where you want to use the library. Here, we follow Python's convention and set an alias plt:

{
  "import": [
    {
      "path": "Kaida-Amethyst/matplotlib",
      "alias": "plt"
    }
  ]
}

After configuration, you can call matplotlib in your MoonBit code to create plots:

let (Double) -> Double
sin : (Double
Double) -> Double
Double = (x : Double) -> Double
Calculates the sine of a number in radians. Handles special cases and edge
conditions according to IEEE 754 standards.
Parameters:

x : The angle in radians for which to calculate the sine.
Returns the sine of the angle x.
Example:
inspect(@math.sin(0.0), content="0")
inspect(@math.sin(1.570796326794897), content="1") // pi / 2
inspect(@math.sin(2.0), content="0.9092974268256817")
inspect(@math.sin(-5.0), content="0.9589242746631385")
inspect(@math.sin(31415926535897.9323846), content="0.0012091232715481885")
inspect(@math.sin(@double.not_a_number), content="NaN")
inspect(@math.sin(@double.infinity), content="NaN")
inspect(@math.sin(@double.neg_infinity), content="NaN")
@math.sin

fn main {
  let Array[Double]
x = type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array::(Int, (Int) -> Double) -> Array[Double]
Creates a new array of the specified length, where each element is
initialized using an index-based initialization function.
Parameters:

length : The length of the new array. If length is less than or equal
to 0, returns an empty array.
initializer : A function that takes an index (starting from 0) and
returns a value of type T. This function is called for each index to
initialize the corresponding element.
Returns a new array of type Array[T] with the specified length, where each
element is initialized using the provided function.
Example:
  let arr = Array::makei(3, i => i * 2)
  inspect(arr, content="[0, 2, 4]")
makei(100, fn(Int
i) { Int
i.(self : Int) -> Double
Converts a 32-bit integer to a double-precision floating-point number. The
conversion preserves the exact value since all integers in the range of Int
can be represented exactly as Double values.
Parameters:

self : The 32-bit integer to be converted.
Returns a double-precision floating-point number that represents the same
numerical value as the input integer.
Example:
  let n = 42
  inspect(n.to_double(), content="42")
  let neg = -42
  inspect(neg.to_double(), content="-42")
to_double() (self : Double, other : Double) -> Double
Multiplies two double-precision floating-point numbers. This is the
implementation of the * operator for Double type.
Parameters:

self : The first double-precision floating-point operand.
other : The second double-precision floating-point operand.
Returns a new double-precision floating-point number representing the product
of the two operands. Special cases follow IEEE 754 standard:

If either operand is NaN, returns NaN
If one operand is infinity and the other is zero, returns NaN
If one operand is infinity and the other is a non-zero finite number,
returns infinity with the appropriate sign
If both operands are infinity, returns infinity with the appropriate sign
Example:
  inspect(2.5 * 2.0, content="5")
  inspect(-2.0 * 3.0, content="-6")
  let nan = 0.0 / 0.0 // NaN
  inspect(nan * 1.0, content="NaN")
* 0.1 })
  let Array[Double]
y = Array[Double]
x.(self : Array[Double], f : (Double) -> Double) -> Array[Double]
Maps a function over the elements of the array.
Example
  let v = [3, 4, 5]
  let v2 = v.map((x) => {x + 1})
  assert_eq(v2, [4, 5, 6])
map((Double) -> Double
sin)

  // To ensure type safety, the wrapped subplots interface always returns a tuple of a fixed type.
  // This avoids the dynamic behavior in Python where the return type depends on the arguments.
  let (_, Unit
axes) = (Int, Int) -> (Unit, Unit)
plt::subplots(1, 1)

  // Use the .. cascade call syntax
  Unit
axes[0(Int) -> Unit
][0]
  ..(Array[Double], Array[Double], Unit, Unit, Int) -> Unit
plot(Array[Double]
x, Array[Double]
y, Unit
color = Unit
Green, Unit
linestyle = Unit
Dashed, Int
linewidth = 2)
  ..(String) -> Unit
set_title("Sine of x")
  ..(String) -> Unit
set_xlabel("x")
  ..(String) -> Unit
set_ylabel("sin(x)")

  () -> Unit
@plt.show()
}

Currently, on macOS and Linux, MoonBit's build system can automatically handle dependencies. On Windows, users may need to manually install a C compiler and configure the Python environment. Future MoonBit IDEs will aim to simplify this process.

Using Unwrapped Python Modules in MoonBit

The Python ecosystem is vast, and even with AI technology, relying solely on official wrappers is not realistic. Fortunately, we can use the core features of python.mbt to interact directly with any Python module. Below, we demonstrate this process using the simple time module from the Python standard library.

Introducing python.mbt

First, ensure your MoonBit toolchain is up to date, then add the python.mbt dependency:

moon update
moon add Kaida-Amethyst/python

Next, import it in your package's moon.pkg.json:

{
  "import": ["Kaida-Amethyst/python"]
}

python.mbt automatically handles the initialization (Py_Initialize) and shutdown of the Python interpreter, so developers don't need to manage it manually.

Importing Python Modules

Use the @python.pyimport function to import modules. To avoid performance loss from repeated imports, it is recommended to use a closure technique to cache the imported module object:

// Define a struct to hold the Python module object for enhanced type safety
pub struct TimeModule {
  ?
time_mod: PyModule
}

// Define a function that returns a closure for getting a TimeModule instance
fn () -> () -> TimeModule
import_time_mod() -> () -> struct TimeModule {
  time_mod: ?
}
TimeModule {
  // The import operation is performed only on the first call
  guard (String) -> Unit
@python.pyimport("time") is (?) -> Unit
Some(?
time_mod) else {
    (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Failed to load Python module: time")
    () -> () -> TimeModule
panic("ModuleLoadError")
  }
  let TimeModule
time_mod = struct TimeModule {
  time_mod: ?
}
TimeModule::{ ?
time_mod }
  // The returned closure captures the time_mod variable
  fn () { TimeModule
time_mod }
}

// Create a global time_mod "getter" function
let () -> TimeModule
time_mod: () -> struct TimeModule {
  time_mod: ?
}
TimeModule = () -> () -> TimeModule
import_time_mod()

In subsequent code, we should always call time_mod() to get the module, not import_time_mod.

Converting Between MoonBit and Python Objects

To call Python functions, we need to convert between MoonBit objects and Python objects (PyObject).

Integers: Use PyInteger::from to create a PyInteger from an Int64, and to_int64() for the reverse conversion.

test "py_integer_conversion" {
  let Int64
n: Int64
Int64 = 42
  let &Show
py_int = (Int64) -> &Show
PyInteger::from(Int64
n)
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(&Show
py_int, String
content="42")
  (a : Int64, b : Int64, msg? : String, loc~ : SourceLoc = _) -> Unit raise
Asserts that two values are equal. If they are not equal, raises a failure
with a message containing the source location and the values being compared.
Parameters:

a : First value to compare.
b : Second value to compare.
loc : Source location information to include in failure messages. This is
usually automatically provided by the compiler.
Throws a Failure error if the values are not equal, with a message showing
the location of the failing assertion and the actual values that were
compared.
Example:
  assert_eq(1, 1)
  assert_eq("hello", "hello")
assert_eq(&Show
py_int.() -> Int64
to_int64(), 42L)
}

Floats: Use PyFloat::from and to_double.

test "py_float_conversion" {
  let Double
n: Double
Double = 3.5
  let &Show
py_float = (Double) -> &Show
PyFloat::from(Double
n)
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(&Show
py_float, String
content="3.5")
  (a : Double, b : Double, msg? : String, loc~ : SourceLoc = _) -> Unit raise
Asserts that two values are equal. If they are not equal, raises a failure
with a message containing the source location and the values being compared.
Parameters:

a : First value to compare.
b : Second value to compare.
loc : Source location information to include in failure messages. This is
usually automatically provided by the compiler.
Throws a Failure error if the values are not equal, with a message showing
the location of the failing assertion and the actual values that were
compared.
Example:
  assert_eq(1, 1)
  assert_eq("hello", "hello")
assert_eq(&Show
py_float.() -> Double
to_double(), 3.5)
}

Strings: Use PyString::from and to_string.

test "py_string_conversion" {
  let &Show
py_str = (String) -> &Show
PyString::from("hello")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(&Show
py_str, String
content="'hello'")
  (a : String, b : String, msg? : String, loc~ : SourceLoc = _) -> Unit raise
Asserts that two values are equal. If they are not equal, raises a failure
with a message containing the source location and the values being compared.
Parameters:

a : First value to compare.
b : Second value to compare.
loc : Source location information to include in failure messages. This is
usually automatically provided by the compiler.
Throws a Failure error if the values are not equal, with a message showing
the location of the failing assertion and the actual values that were
compared.
Example:
  assert_eq(1, 1)
  assert_eq("hello", "hello")
assert_eq(&Show
py_str.(&Show) -> String
to_string(), "hello")
}

Lists: You can create an empty PyList and append elements, or create one directly from an Array[&IsPyObject].

test "py_list_from_array" {
  let Unit
one = (Int) -> Unit
PyInteger::from(1)
  let Unit
two = (Double) -> Unit
PyFloat::from(2.0)
  let Unit
three = (String) -> Unit
PyString::from("three")
  let Array[Unit]
arrArray[Unit]
: type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
ArrayArray[Unit]
[&IsPyObject] = [Unit
one, Unit
two, Unit
three]

  let &Show
list = (Array[Unit]) -> &Show
PyList::from(Array[Unit]
arr)
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(&Show
list, String
content="[1, 2.0, 'three']")
}

Tuples: PyTuple requires specifying the size first, then filling elements one by one using the set method.

test "py_tuple_creation" {
  let &Show
tuple = (Int) -> &Show
PyTuple::new(3)
  &Show
tuple
  ..(Int, Unit) -> Unit
set(0, (Int) -> Unit
PyInteger::from(1))
  ..(Int, Unit) -> Unit
set(1, (Double) -> Unit
PyFloat::from(2.0))
  ..(Int, Unit) -> Unit
set(2, (String) -> Unit
PyString::from("three"))

  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(&Show
tuple, String
content="(1, 2.0, 'three')")
}

Dictionaries: PyDict mainly supports strings as keys. Use new to create a dictionary and set to add key-value pairs. For non-string keys, use set_by_obj.

test "py_dict_creation" {
  let &Show
dict = () -> &Show
PyDict::new()
  &Show
dict
  ..(String, Unit) -> Unit
set("one", (Int) -> Unit
PyInteger::from(1))
  ..(String, Unit) -> Unit
set("two", (Double) -> Unit
PyFloat::from(2.0))

  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(&Show
dict, String
content="{'one': 1, 'two': 2.0}")
}

When getting elements from Python composite types, python.mbt performs runtime type checking and returns an Optional[PyObjectEnum] to ensure type safety.

test "py_list_get" {
  let Unit
list = () -> Unit
PyList::new()
  Unit
list.(Unit) -> Unit
append((Int) -> Unit
PyInteger::from(1))
  Unit
list.(Unit) -> Unit
append((String) -> Unit
PyString::from("hello"))

  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Unit
list.(Int) -> Unit
get(0).() -> &Show
unwrap(), String
content="PyInteger(1)")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Unit
list.(Int) -> Unit
get(1).() -> &Show
unwrap(), String
content="PyString('hello')")
  (obj : &Show, content~ : String, loc~ : SourceLoc = _, args_loc~ : ArgsLoc = _) -> Unit raise InspectError
Tests if the string representation of an object matches the expected content.
Used primarily in test cases to verify the correctness of Show
implementations and program outputs.
Parameters:

object : The object to be inspected. Must implement the Show trait.
content : The expected string representation of the object. Defaults to
an empty string.
location : Source code location information for error reporting.
Automatically provided by the compiler.
arguments_location : Location information for function arguments in
source code. Automatically provided by the compiler.
Throws an InspectError if the actual string representation of the object
does not match the expected content. The error message includes detailed
information about the mismatch, including source location and both expected
and actual values.
Example:
  inspect(42, content="42")
  inspect("hello", content="hello")
  inspect([1, 2, 3], content="[1, 2, 3]")
inspect(Unit
list.(Int) -> &Show
get(2), String
content="None") // Index out of bounds returns None
}

Calling Functions in a Module

Calling a function is a two-step process: first, get the function object with get_attr, then execute the call with invoke. The return value of invoke is a PyObject that requires pattern matching and type conversion.

Here is the MoonBit wrapper for time.sleep and time.time:

// Wrap time.sleep
pub fn (seconds : Double) -> Unit
sleep(Double
seconds: Double
Double) -> Unit
Unit {
  let TimeModule
lib = () -> TimeModule
time_mod()
  guard TimeModule
lib.?
time_mod.(String) -> Unit
get_attr("sleep") is (_/0) -> Unit
Some((Unit) -> _/0
PyCallable(Unit
f)) else {
    (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("get function `sleep` failed!")
    () -> Unit
panic()
  }
  let Unit
args = (Int) -> Unit
PyTuple::new(1)
  Unit
args.(Int, Unit) -> Unit
set(0, (Double) -> Unit
PyFloat::from(Double
seconds))
  match (try? Unit
f.(Unit) -> Unit
invoke(Unit
args)) {
    (Unit) -> Result[Unit, Error]
Ok(_) => Unit
Ok(())
    (Error) -> Result[Unit, Error]
Err(Error
e) => {
      (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("invoke `sleep` failed!")
      () -> Unit
panic()
    }
  }
}

// Wrap time.time
pub fn () -> Double
time() -> Double
Double {
  let TimeModule
lib = () -> TimeModule
time_mod()
  guard TimeModule
lib.?
time_mod.(String) -> Unit
get_attr("time") is (_/0) -> Unit
Some((Unit) -> _/0
PyCallable(Unit
f)) else {
    (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("get function `time` failed!")
    () -> Double
panic()
  }
  match (try? Unit
f.() -> Unit
invoke()) {
    (Unit) -> Result[Unit, Error]
Ok((_/0) -> Unit
Some((Unit) -> _/0
PyFloat(Unit
t))) => Unit
t.() -> Double
to_double()
    _ => {
      (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("invoke `time` failed!")
      () -> Double
panic()
    }
  }
}

After wrapping, we can use them in a type-safe way in MoonBit:

test "sleep" {
  let Unit
start = () -> Double
time().() -> Unit
unwrap()
  (seconds : Double) -> Unit
sleep(1)
  let Unit
end = () -> Double
time().() -> Unit
unwrap()

  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("start = \{Unit
start}")
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("end = \{Unit
end}")
}

Practical Advice

Define Clear Boundaries: Treat python.mbt as the "glue layer" connecting MoonBit and the Python ecosystem. Keep core computation and business logic in MoonBit to leverage its performance and type system advantages, and only use python.mbt when necessary to call Python-exclusive libraries.
Use ADTs Instead of String Magic: Many Python functions accept specific strings as arguments to control behavior. In MoonBit wrappers, these "magic strings" should be converted to Algebraic Data Types (ADTs), i.e., enums. This leverages MoonBit's type system to move runtime value checks to compile time, greatly enhancing code robustness.
Thorough Error Handling: The examples in this article use panic or return simple strings for brevity. In production code, you should define dedicated error types and pass and handle them through the Result type, providing clear error context.

Map Keyword Arguments: Python functions extensively use keyword arguments (kwargs), such as plot(color='blue', linewidth=2). This can be elegantly mapped to MoonBit's Labeled Arguments. When wrapping, prioritize using labeled arguments to provide a similar development experience.

For example, a Python function that accepts kwargs:

# graphics.py
def draw_line(points, color="black", width=1):
    # ... drawing logic ...
    print(f"Drawing line with color {color} and width {width}")

Its MoonBit wrapper can be designed as:

fn draw_line(points: Array[Point], color~: Color = Black, width: Int = 1) -> Unit {
  let points : PyList = ... // convert Array[Point] to PyList

  // construct args
  let args = PyTuple::new(1)
  args .. set(0, points)

  // construct kwargs
  let kwargs = PyDict::new()
  kwargs
  ..set("color", PyString::from(color))
  ...set("width", PyInteger::from(width))
  match (try? f.invoke(args~, kwargs~)) {
    Ok(_) => ()
    _ => {
      // handle error
    }
  }
}

Beware of Dynamism: Always remember that Python is dynamically typed. Any data obtained from Python should be treated as "untrusted" and must undergo strict type checking and validation. Avoid using unwrap as much as possible; instead, use pattern matching to safely handle all possible cases.

Conclusion

This article has outlined the working principles of python.mbt and demonstrated how to use it to call Python code in MoonBit, whether through pre-wrapped libraries or by interacting directly with Python modules. python.mbt is not just a tool; it represents a fusion philosophy: combining MoonBit's static analysis, high performance, and engineering advantages with Python's vast and mature ecosystem. We hope this article provides developers in the MoonBit and Python communities with a new, more powerful option for building future software.

A Guide to MoonBit C-FFI

August 14, 2025 · 16 min read

Introduction

MoonBit is a modern functional programming language featuring a robust type system, highly readable syntax, and a toolchain designed for AI. However, reinventing the wheel is not always the best approach. Countless time-tested, high-performance libraries are written in C (or languages with a C-compatible ABI, like C++, Rust). From low-level hardware manipulation to complex scientific computing and graphics rendering, the C ecosystem is a treasure trove of powerful tools.

So, can we make the modern MoonBit work in harmony with these classic C libraries, allowing the pioneers of the new world to wield the powerful tools of the old? The answer is a resounding yes. Through the C Foreign Function Interface (C-FFI), MoonBit can call C functions, bridging these two worlds.

This article will be your guide, leading you step-by-step through the mysteries of MoonBit's C-FFI. We will use a concrete example—creating MoonBit bindings for a C math library called mymath—to learn how to handle different data types, pointers, structs, and even function pointers.

Prerequisites

To connect to any C library, we need to know the functions in its header file, how to find the header file, and how to find the library file. For our task, the header file for the C math library is mymath.h. It defines the various functions and types we want to call from MoonBit. We'll assume mymath is installed on the system, and we'll use -I/usr/include to find the header file and -L/usr/lib -lmymath to link the library during compilation. Here is a part of our mymath.h:

// mymath.h

// --- Basic Functions ---
void print_version();
int version_major();
int is_normal(double input);

// --- Floating-Point Calculations ---
float sinf(float input);
float cosf(float input);
float tanf(float input);
double sin(double input);
double cos(double input);
double tan(double input);

// --- Strings and Pointers ---
int parse_int(char* str);
char* version();
int tan_with_errcode(double input, double* output);

// --- Array Operations ---
int sin_array(int input_len, double* inputs, double* outputs);
int cos_array(int input_len, double* inputs, double* outputs);
int tan_array(int input_len, double* inputs, double* outputs);

// --- Structs and Complex Types ---
typedef struct {
  double real;
  double img;
} Complex;

Complex* new_complex(double r, double i);
void multiply(Complex* a, Complex* b, Complex** result);
void init_n_complexes(int n, Complex** complex_array);

// --- Function Pointers ---
void for_each_complex(int n, Complex** arr, void (*call_back)(Complex*));

The Groundwork

Before writing any FFI code, we need to build the bridge between MoonBit and C code.

Compiling to Native

First, the MoonBit code needs to be compiled into native machine code. This can be done with the following command:

moon build --target native

This command will compile your MoonBit project into C code and then use the system's C compiler (like GCC or Clang) to compile it into a final executable. The compiled C files are located in the target/native/release/build/ directory, stored in subdirectories corresponding to the package name. For example, main/main.mbt will be compiled to target/native/release/build/main/main.c.

Configuring Linkage

Compilation alone is not enough. We also need to tell the MoonBit compiler how to find and link to our mymath library. This is configured in the project's moon.pkg.json file.

{
  "supported-targets": ["native"],
  "link": {
    "native": {
      "cc": "clang",
      "cc-flags": "-I/usr/include",
      "cc-link-flags": "-L/usr/lib -lmymath"
    }
  }
}

cc: Specifies the compiler to use for C code, e.g., clang or gcc.
cc-flags: Flags needed when compiling C files, typically used to specify header search paths (-I).
cc-link-flags: Flags needed during linking, typically used to specify library search paths (-L) and the specific libraries to link (-l).

We also need a "glue" C file, which we'll name cwrap.c, to include the C library's header file and MoonBit's runtime header file.

// cwrap.c
#include <mymath.h>
#include <moonbit.h>

This glue file also needs to be declared to the MoonBit compiler via moon.pkg.json:

{
  // ... other configurations
  "native-stub": ["cwrap.c"]
}

With these configurations in place, our project is ready to link with the mymath library.

The First FFI Call

With everything set up, let's make our first true cross-language call. To declare an external C function in MoonBit, the syntax is as follows:

extern "C" fn moonbit_function_name(arg: Type) -> ReturnType = "c_function_name"

extern "C": Tells the MoonBit compiler that this is an external C function.
moonbit_function_name: The function name used in the MoonBit code.
"c_function_name": The name of the C function to link to.

Let's try it out with the simplest function in mymath.h, version_major:

extern "C" fn version_major() -> Int
Int = "version_major"

Note: MoonBit has powerful Dead Code Elimination (DCE). If you only declare the FFI function above but never actually call it in your code (e.g., in the main function), the compiler will consider it unused code and will not include its declaration in the final generated C code. So, make sure you call it at least once!

Navigating the Type System Chasm

The real challenge lies in handling the data type differences between the two languages. For some complex type situations, readers will need some C language knowledge.

3.1 Basic Types

For basic numeric types, there is a direct and clear correspondence between MoonBit and C.

MoonBit Type	C Type	Notes
`Int`	`int32_t`
`Int64`	`int64_t`
`UInt`	`uint32_t`
`UInt64`	`uint64_t`
`Float`	`float`
`Double`	`double`
`Bool`	`int32_t`	The C standard does not have a native `bool`, `int32_t` (0/1) is usually used.
`Unit`	`void` (return value)	Used to represent that a C function has no return value.
`Byte`	`uint8_t`

Based on this table, we can easily write FFI declarations for most of the simple functions in mymath.h:

extern "C" fn print_version() -> Unit
Unit = "print_version"
extern "C" fn version_major() -> Int
Int = "version_major"

// The return value is semantically a boolean, using MoonBit's Bool type is clearer
extern "C" fn is_normal(input: Double
Double) -> Bool
Bool = "is_normal"

extern "C" fn sinf(input: Float
Float) -> Float
Float = "sinf"
extern "C" fn cosf(input: Float
Float) -> Float
Float = "cosf"
extern "C" fn tanf(input: Float
Float) -> Float
Float = "tanf"

extern "C" fn sin(input: Double
Double) -> Double
Double = "sin"
extern "C" fn cos(input: Double
Double) -> Double
Double = "cos"
extern "C" fn tan(input: Double
Double) -> Double
Double = "tan"

3.2 Strings

Things get interesting when we encounter strings. You might instinctively map C's char* to MoonBit's String, but this is a common pitfall.

MoonBit's String and C's char* have completely different memory layouts. char* is a pointer to a �-terminated sequence of bytes, while MoonBit's String is a GC-managed, complex object containing length information and UTF-16 encoded data.

Passing Arguments: From MoonBit to C

When we need to pass a MoonBit string to a C function that accepts a char* (like parse_int), we need to perform a manual conversion. A recommended approach is to convert it to the Bytes type.

// A helper function to convert a MoonBit String to the null-terminated byte array expected by C
fn (s : String) -> Bytes
string_to_c_bytes(String
s: String
String) -> Bytes
Bytes {
  let mut Array[Byte]
arr = String
s.(self : String) -> Bytes
String holds a sequence of UTF-16 code units encoded in little endian format
to_bytes().(self : Bytes) -> Array[Byte]
Converts a bytes sequence into an array of bytes.
Parameters:

bytes : A sequence of bytes to be converted into an array.
Returns an array containing the same bytes as the input sequence.
Example:
  let bytes = b"hello"
  let arr = bytes.to_array()
  inspect(arr, content="[b'\\x68', b'\\x65', b'\\x6C', b'\\x6C', b'\\x6F']")
to_array()
  // Ensure it's null-terminated
  if Array[Byte]
arr.(self : Array[Byte]) -> Byte?
Returns the last element of the array, or None if the array is empty.
Parameters:

array : The array to get the last element from.
Returns an optional value containing the last element of the array. The
result is None if the array is empty, or Some(x) where x is the last
element of the array.
Example:
  let arr = [1, 2, 3]
  inspect(arr.last(), content="Some(3)")
  let empty : Array[Int] = []
  inspect(empty.last(), content="None")
last() (x : Byte?, y : Byte?) -> Bool
!= (Byte) -> Byte?
Some(0) {
    Array[Byte]
arr.(self : Array[Byte], value : Byte) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(0)
  }
  (ArrayView[Byte]) -> Bytes
Bytes::(arr : ArrayView[Byte]) -> Bytes
Creates a new bytes sequence from a byte array.
Parameters:

array : An array of bytes to be converted.
Returns a new bytes sequence containing the same bytes as the input array.
Example:
  let arr = [b'h', b'i']
  let bytes = @bytes.from_array(arr)
  inspect(
    bytes, 
    content=(
      
  #|b"hi"

    ),
  )
from_array(Array[Byte]
arr)
}

// FFI declaration, note the parameter type is Bytes
#borrow(s) // Tell the compiler we are just borrowing s, don't increase its reference count
extern "C" fn __parse_int(s: Bytes
Bytes) -> Int
Int = "parse_int"

// Wrap it in a user-friendly MoonBit function
fn (str : String) -> Int
parse_int(String
str: String
String) -> Int
Int {
  let Bytes
s = (s : String) -> Bytes
string_to_c_bytes(String
str)
  (s : Bytes) -> Int
__parse_int(Bytes
s)
}

The #borrow Annotation The borrow annotation is an optimization hint. It tells the compiler that the C function only "borrows" this parameter and will not take ownership of it. This can avoid unnecessary reference counting operations and prevent potential memory leaks.

Return Values: From C to MoonBit

Conversely, when a C function returns a char* (like version), the situation is more complex. We absolutely must not declare it to return Bytes or String directly:

// Incorrect!
extern "C" fn version() -> Bytes
Bytes = "version"

This is because the C function returns a raw pointer, which lacks the header information required by the MoonBit GC. A direct conversion like this will lead to a runtime crash.

The correct approach is to treat the returned char* as an opaque handle, and then write a conversion function in the C "glue" code to manually convert it into a valid MoonBit string.

MoonBit side:

// 1. Declare an external type to represent the C string pointer
#extern
type CStr

// 2. Declare an FFI function that calls the C wrapper
extern "C" fn type CStr
CStr::to_string(self: type CStr
Self) -> String
String = "cstr_to_moonbit_str"

// 3. Declare the original C function, which returns our opaque type
extern "C" fn __version() -> type CStr
CStr = "version"

// 4. Wrap it in a safe MoonBit function
fn () -> String
version() -> String
String {
  () -> CStr
__version().(self : CStr) -> String
to_string()
}

C side (add to cwrap.c):

#include <string.h> // for strlen

// This function is responsible for correctly converting a char* to a moonbit_string_t with a GC header
moonbit_string_t cstr_to_moonbit_str(char *ptr) {
  if (ptr == NULL) {
    return moonbit_make_string(0, 0);
  }
  int32_t len = strlen(ptr);
  // moonbit_make_string allocates a MoonBit string object with a GC header
  moonbit_string_t ms = moonbit_make_string(len, 0);
  for (int i = 0; i < len; i++) {
    ms[i] = (uint16_t)ptr[i]; // Assuming ASCII compatibility
  }
  // Note: Whether to free(ptr) depends on the C library's API contract.
  // If the memory returned by version() needs to be freed by the caller, it should be freed here.
  return ms;
}

This pattern, while a bit cumbersome at first glance, ensures memory safety and is the standard way to handle C string return values.

3.3 The Art of Pointers: Passing by Reference and Arrays

C extensively uses pointers for "output parameters" and passing arrays. MoonBit provides specialized types for this.

"Output" Parameters for a Single Value

When a C function uses a pointer to return an additional value, like tan_with_errcode(double input, double* output), MoonBit uses the Ref[T] type.

extern "C" fn tan_with_errcode(input: Double
Double, output: struct Ref[A] {
  mut val: A
}
Ref[Double
Double]) -> Int
Int = "tan_with_errcode"

Ref[T] in MoonBit is a struct containing a single field of type T. When passed to C, MoonBit passes the address of this struct. From C's perspective, a pointer to struct { T val; } is equivalent in memory address to a pointer to T, so it works directly.

Arrays: Passing Collections of Data

When a C function needs to process an array (e.g., double* inputs), MoonBit uses the FixedArray[T] type. FixedArray[T] is a contiguous block of T elements in memory, and its pointer can be passed directly to C.

extern "C" fn sin_array(len: Int
Int, inputs: type FixedArray[A]
FixedArray[Double
Double], outputs: type FixedArray[A]
FixedArray[Double
Double]) -> Int
Int = "sin_array"
extern "C" fn cos_array(len: Int
Int, inputs: type FixedArray[A]
FixedArray[Double
Double], outputs: type FixedArray[A]
FixedArray[Double
Double]) -> Int
Int = "cos_array"
extern "C" fn tan_array(len: Int
Int, inputs: type FixedArray[A]
FixedArray[Double
Double], outputs: type FixedArray[A]
FixedArray[Double
Double]) -> Int
Int = "tan_array"

3.4 External Types: Embracing Opaque C Structs

For C structs, like Complex, the best practice is usually to treat it as an "Opaque Type". We only create a reference (or handle) to it in MoonBit, without caring about its internal fields.

This is achieved with the #extern type syntax:

#extern
type Complex

This declaration tells MoonBit: "There is an external type named Complex. You don't need to know its internal structure, just treat it as a pointer-sized handle." In the generated C code, the Complex type will be treated as void*. This is usually safe because all operations on Complex are done within the C library, and the MoonBit side is only responsible for passing the pointer.

Based on this principle, we can write FFIs for the Complex-related functions in mymath.h:

// C: Complex* new_complex(double r, double i);
// Returns a pointer to Complex, which is a Complex handle in MoonBit
extern "C" fn new_complex(r: Double
Double, i: Double
Double) -> type Complex
Complex = "new_complex"

// C: void multiply(Complex* a, Complex* b, Complex** result);
// Complex* corresponds to Complex, and Complex** corresponds to Ref[Complex]
extern "C" fn multiply(a: type Complex
Complex, b: type Complex
Complex, res: struct Ref[A] {
  mut val: A
}
Ref[type Complex
Complex]) -> Unit
Unit = "multiply"

// C: void init_n_complexes(int n, Complex** complex_array);
// Complex** is used as an array here, corresponding to FixedArray[Complex]
extern "C" fn init_n_complexes(n: Int
Int, complex_array: type FixedArray[A]
FixedArray[type Complex
Complex]) -> Unit
Unit = "init_n_complexes"

Best Practice: Encapsulate Raw FFIs Directly exposing FFI functions can be confusing for users (e.g., Ref and FixedArray). It is strongly recommended to build a more user-friendly API for MoonBit users on top of the FFI declarations.

// Define methods on the Complex type to hide FFI details
fn type Complex
Complex::(self : Complex, other : Complex) -> Complex
mul(Complex
self: type Complex
Complex, Complex
other: type Complex
Complex) -> type Complex
Complex {
  // Create a temporary Ref to receive the result
  let Ref[Complex]
res: struct Ref[A] {
  mut val: A
}
Ref[type Complex
Complex] = struct Ref[A] {
  mut val: A
}
Ref::{ Complex
val: (r : Double, i : Double) -> Complex
new_complex(0, 0) }
  (a : Complex, b : Complex, res : Ref[Complex]) -> Unit
multiply(Complex
self, Complex
other, Ref[Complex]
res)
  Ref[Complex]
res.Complex
val // Return the result
}

fn (n : Int) -> Array[Complex]
init_n(Int
n: Int
Int) -> type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[type Complex
Complex] {
  // Use FixedArray::make to create the array
  let FixedArray[Complex]
arr = type FixedArray[A]
FixedArray::(len : Int, init : Complex) -> FixedArray[Complex]
Creates a new fixed-size array with the specified length, initializing all
elements with the given value.
Parameters:

length : The length of the array to create. Must be non-negative.
initial_value : The value used to initialize all elements in the array.
Returns a new fixed-size array of type FixedArray[T] with length
elements, where each element is initialized to initial_value.
Throws a panic if length is negative.
Example:
  let arr = FixedArray::make(3, 42)
  inspect(arr[0], content="42")
  inspect(arr.length(), content="3")
WARNING: A common pitfall is creating with the same initial value, for example:
  let two_dimension_array = FixedArray::make(10, FixedArray::make(10, 0))
  two_dimension_array[0][5] = 10
  assert_eq(two_dimension_array[5][5], 10)
This is because all the cells reference to the same object (the FixedArray[Int] in this case).
One should use makei() instead which creates an object for each index.
make(Int
n, (r : Double, i : Double) -> Complex
new_complex(0, 0))
  (n : Int, complex_array : FixedArray[Complex]) -> Unit
init_n_complexes(Int
n, FixedArray[Complex]
arr)
  // Convert FixedArray to the more user-friendly Array
  type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array::(FixedArray[Complex]) -> Array[Complex]
Creates a new dynamic array from a fixed-size array.
Parameters:

arr : The fixed-size array to convert. The elements of this array will be
copied to the new array.
Returns a new dynamic array containing all elements from the input fixed-size
array.
Example:
  let fixed = FixedArray::make(3, 42)
  let dynamic = Array::from_fixed_array(fixed)
  inspect(dynamic, content="[42, 42, 42]")
from_fixed_array(FixedArray[Complex]
arr)
}

3.5 Function Pointers: When C Needs to Call Back

The most complex function in mymath.h is for_each_complex, which takes a function pointer as an argument.

void for_each_complex(int n, Complex** arr, void (*call_back)(Complex*));

A common misconception is to try to map MoonBit's closure type (Complex) -> Unit directly to a C function pointer. This is not possible because a MoonBit closure is internally a struct with two parts: a pointer to the actual function code, and a pointer to its captured environment data.

To pass a pure, environment-free function pointer, MoonBit provides the FuncRef type:

extern "C" fn for_each_complex(
  n: Int
Int,
  arr: type FixedArray[A]
FixedArray[type Complex
Complex],
  call_back: FuncRef[(type Complex
Complex) -> Unit
Unit] // Use FuncRef to wrap the function type
) -> Unit
Unit = "for_each_complex"

Any function type wrapped in FuncRef will be converted to a standard C function pointer when passed to C.

How to declare a FuncRef? Just use let. As long as the function does not capture external variables, the declaration will succeed.

fn (c : Complex) -> Unit
print_complex(Complex
c: type Complex
Complex) -> Unit
Unit { ... }

fn main {
  let FuncRef[(Complex) -> Unit]
print_complexFuncRef[(Complex) -> Unit]
 : FuncRef[(type Complex
ComplexFuncRef[(Complex) -> Unit]
) -> Unit
UnitFuncRef[(Complex) -> Unit]
] = (Complex
c) => (c : Complex) -> Unit
print_complex(Complex
c)
  // ...
}

Advanced Topic: GC Management

We have covered most of the type conversion issues, but there is still a very important issue: memory management. C relies on manual malloc/free, while MoonBit has automatic garbage collection (GC). When a C library creates an object (like new_complex), who is responsible for freeing it?

Can we do without GC?

Some library authors may choose not to implement GC, leaving all destruction operations to the user. This approach has its merits in some libraries, such as some high-performance computing libraries, graphics libraries, etc. To improve performance or stability, they may abandon some GC features, but this raises the bar for programmers. Most libraries still need to provide GC to enhance the user experience.

Ideally, we want MoonBit's GC to automatically manage the lifecycle of these C objects. MoonBit provides two mechanisms to achieve this.

4.1 The Simple Case

If the C struct is very simple and you are sure that its memory layout is stable across all platforms, you can redefine it directly in MoonBit.

// mymath.h: typedef struct { double real; double img; } Complex;
// MoonBit:
struct Complex {
  r: Double,
  i: Double
}

By doing this, Complex becomes a true MoonBit object. The MoonBit compiler will automatically manage its memory and add a GC header. When you pass it to a C function, MoonBit will pass a pointer to its data part, which is usually feasible.

But this method has significant limitations:

It requires you to know the exact memory layout, alignment, etc., of the C struct, which can be fragile.
If a C function returns a Complex*, you cannot use it directly. You must, like handling string return values, write a C wrapper function to copy the data from the C struct into a newly created MoonBit Complex object with a GC header.

Therefore, this method is only suitable for the simplest cases. For most scenarios, we recommend a more robust finalizer solution.

4.2 The Complex Situation: Using Finalizers

This is a more general and safer method. The core idea is to create a MoonBit object to "wrap" the C pointer and tell the MoonBit GC that when this wrapper object is collected, a specific C function (a finalizer) should be called to release the underlying C pointer.

This process involves several steps:

1. Declare two types in MoonBit

#extern
type C_Complex // Represents the raw, opaque C pointer

type Complex C_Complex // A MoonBit type that wraps a C_Complex internally

type Complex C_Complex is a special declaration that creates a MoonBit object type named Complex, which has an internal field of type C_Complex. We can access this internal field with the .inner() method.

2. Provide a finalizer and wrapper functions in C

We need a C function to free the Complex object, and a function to create our GC-enabled MoonBit wrapper object.

C side (add to cwrap.c):

// The mymath library should provide a function to free Complex, let's assume it's free_complex
// void free_complex(Complex* c);

// We need a void* version of the finalizer for the MoonBit GC to use
void free_complex_finalizer(void* obj) {
    // The layout of a MoonBit external object is { void (*finalizer)(void*); T data; }
    // We need to extract the real Complex pointer from obj
    // Assuming the MoonBit Complex wrapper has only one field
    Complex* c_obj = *((Complex**)obj);
    free_complex(c_obj); // Call the real finalizer, if provided by the mymath library
    // free(c_obj); // If it was allocated with standard malloc
}

// Define what the MoonBit Complex wrapper looks like in C
typedef struct {
  Complex* val;
} MoonBit_Complex;

// Function to create the MoonBit wrapper object
MoonBit_Complex* new_mbt_complex(Complex* c_complex) {
  // `moonbit_make_external_obj` is the key
  // It creates a GC-managed external object and registers its finalizer.
  MoonBit_Complex* mbt_complex = moonbit_make_external_obj(
      &free_complex_finalizer,
      sizeof(MoonBit_Complex)
  );
  mbt_complex->val = c_complex;
  return mbt_complex;
}

3. Use the wrapper function in MoonBit

Now, instead of calling new_complex directly, we call our wrapper function new_mbt_complex.

// FFI declaration pointing to our C wrapper function
extern "C" fn __new_managed_complex(c_complex: type C_Complex
C_Complex) -> type Complex
Complex = "new_mbt_complex"

// The original C new_complex function returns a raw pointer
extern "C" fn __new_unmanaged_complex(r: Double
Double, i: Double
Double) -> type C_Complex
C_Complex = "new_complex"

// The final, safe, GC-friendly new function provided to the user
fn type Complex
Complex::(r : Double, i : Double) -> Complex
new(Double
r: Double
Double, Double
i: Double
Double) -> type Complex
Complex {
  let C_Complex
c_ptr = (r : Double, i : Double) -> C_Complex
__new_unmanaged_complex(Double
r, Double
i)
  (c_complex : C_Complex) -> Complex
__new_managed_complex(C_Complex
c_ptr)
}

Now, when an object created by Complex::new is no longer used in MoonBit, the GC will automatically call free_complex_finalizer, safely freeing the memory allocated by the C library.

When we need to pass our managed Complex object to other C functions, we just use the .inner() method:

// Assume there is a C function `double length(Complex*);`
extern "C" fn length(c_complex: type C_Complex
C_Complex) -> Double
Double = "length"

fn type Complex
Complex::(self : Complex) -> Double
length(Complex
self: type Complex
Self) -> Double
Double {
  // self.inner() returns the internal C_Complex (i.e., the C pointer)
  (c_complex : C_Complex) -> Double
length(Complex
self.() -> C_Complex
inner())
}

Conclusion

This article has guided you through the process of C-FFI in MoonBit, from basic types to complex struct types and function pointer types. Finally, it discussed the GC problem of MoonBit managing C objects. We hope this will be helpful for the library development of our readers.

Dancing with LLVM: A Moonbit Chronicle (Part 2) - LLVM Backend Generation

August 6, 2025 · 17 min read

Introduction

In the process of programming language design, the frontend is responsible for understanding and verifying the structure and semantics of a program, while the compiler backend takes on the task of translating these abstract concepts into executable machine code. The implementation of the backend not only requires a deep understanding of the target architecture but also mastery of complex optimization techniques to generate efficient code.

LLVM (Low Level Virtual Machine), as a comprehensive modern compiler infrastructure, provides us with a powerful and flexible solution. By converting a program into LLVM Intermediate Representation (IR), we can leverage LLVM's mature toolchain to compile the code to various target architectures, including RISC-V, ARM, and x86.

Moonbit's LLVM Ecosystem

Moonbit officially provides two important LLVM-related projects:

llvm.mbt: Moonbit language bindings for the original LLVM, providing direct access to the llvm-c interface. It requires the installation of the complete LLVM toolchain, can only generate for native backends, and requires you to handle compilation and linking yourself, but it can generate IR that is fully compatible with the original LLVM.

MoonLLVM: A pure Moonbit implementation of an LLVM-like system. It can generate LLVM IR without external dependencies and supports JavaScript and WebAssembly backends.

This article chooses llvm.mbt as our tool. Its API design is inspired by the highly acclaimed inkwell library in the Rust ecosystem.

In the previous article, "Dancing with LLVM: A Moonbit Chronicle (Part 1) - Implementing the Frontend," we completed the conversion from source code to a typed abstract syntax tree. This article will build on that achievement, focusing on the core techniques and implementation details of code generation.

Chapter 1: Representing the LLVM Type System in Moonbit

Before diving into code generation, we first need to understand how llvm.mbt represents LLVM's various concepts within Moonbit's type system. LLVM's type system is quite complex, containing multiple levels such as basic types, composite types, and function types.

Trait Objects: An Abstract Representation of Types

In the API design of llvm.mbt, you will frequently encounter the core concept of &Type. This is not a concrete struct or enum, but a Trait Object—which can be understood as the functional equivalent of an abstract base class in object-oriented programming.

// &Type is a trait object representing any LLVM type
let Unit
some_type: &Type = Unit
context.() -> Unit
i32_type()

Type Identification and Conversion

To determine the specific type of a &Type, we need to perform a runtime type check using the as_type_enum interface:

pub fn (ty : Unit) -> String
identify_type(Unit
ty: &Type) -> String
String {
  match Unit
ty.() -> Unit
as_type_enum() {
    (Unit) -> Unit
IntType(Unit
int_ty) => "Integer type with \{Unit
int_ty.() -> Unit
get_bit_width()} bits"
    (_/0) -> Unit
FloatType(_/0
float_ty) => "Floating point type"
    (_/0) -> Unit
PointerType(_/0
ptr_ty) => "Pointer type"
    (_/0) -> Unit
FunctionType(_/0
func_ty) => "Function type"
    (_/0) -> Unit
ArrayType(_/0
array_ty) => "Array type"
    (_/0) -> Unit
StructType(_/0
struct_ty) => "Structure type"
    (_/0) -> Unit
VectorType(_/0
vec_ty) => "Vector type"
    (_/0) -> Unit
ScalableVectorType(_/0
svec_ty) => "Scalable vector type"
    (_/0) -> Unit
MetadataType(_/0
meta_ty) => "Metadata type"
  }
}

Safe Type Conversion Strategies

When we are certain that a &Type has a specific type, there are several conversion methods to choose from:

Direct Conversion (for deterministic scenarios)

let Unit
ty: &Type = Unit
context.() -> Unit
i32_type()
let ?
i32_ty = Unit
ty.() -> ?
into_int_type()  // Direct conversion, errors are handled by llvm.mbt
let ?
bit_width = ?
i32_ty.() -> ?
get_bit_width()  // Call a method specific to IntType

Defensive Conversion (recommended for production environments)

let Unit
ty: &Type = () -> Unit
get_some_type()  // An unknown type obtained from somewhere

guard ty.as_type_enum() is IntType(i32_ty) else {
  raise CodeGenError("Expected integer type, got \{ty}")
}

// Now it's safe to use i32_ty
let ?
bit_width = ?
i32_ty.() -> ?
get_bit_width()

Constructing Composite Types

LLVM supports various composite types, which are usually constructed through methods of basic types:

pub fn (context : ?) -> Unit
create_composite_types(?
context: @llvm.Context) -> Unit
Unit {
  let Unit
i32_ty = ?
context.() -> Unit
i32_type()
  let Unit
f64_ty = ?
context.() -> Unit
f64_type()

  // Array type: [16 x i32]
  let Unit
i32_array_ty = Unit
i32_ty.(Int) -> Unit
array_type(16)

  // Function type: i32 (i32, i32)
  let Unit
add_func_ty = Unit
i32_ty.(Array[Unit]) -> Unit
fn_type([Unit
i32_ty, Unit
i32_ty])

  // Struct type: {i32, f64}
  let Unit
struct_ty = ?
context.(Array[Unit]) -> Unit
struct_type([Unit
i32_ty, Unit
f64_ty])

  // Pointer type (all pointers are opaque in LLVM 18+)
  let Unit
ptr_ty = Unit
i32_ty.() -> Unit
ptr_type()

  // Output type information for verification
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Array type: \{Unit
i32_array_ty}")      // [16 x i32]
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Function type: \{Unit
add_func_ty}")    // i32 (i32, i32)
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Struct type: \{Unit
struct_ty}")        // {i32, f64}
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Pointer type: \{Unit
ptr_ty}")          // ptr
}

Important Reminder: Opaque Pointers

Starting with LLVM version 18, all pointer types use the opaque pointer design. This means that regardless of the type they point to, all pointers are represented as ptr in the IR, and the specific type information they point to is no longer visible in the type system.

Chapter 2: The LLVM Value System and the BasicValue Concept

Compared to the type system, LLVM's value system is more complex. llvm.mbt, consistent with inkwell, divides values into two important abstract layers: Value and BasicValue. The difference lies in distinguishing the source of value creation from the way values are used:

Value: Focuses on how a value is produced (e.g., constants, instruction results).
BasicValue: Focuses on what basic type a value has (e.g., integer, float, pointer).

Practical Application Example

pub fn (context : ?, builder : ?) -> Unit
demonstrate_value_system(?
context: Context, ?
builder: Builder) -> Unit
Unit {
  let Unit
i32_ty = ?
context.() -> Unit
i32_type()

  // Create two integer constants - these are directly IntValue
  let Unit
const1 = Unit
i32_ty.(Int) -> Unit
const_int(10)  // Value: IntValue, BasicValue: IntValue
  let Unit
const2 = Unit
i32_ty.(Int) -> Unit
const_int(20)  // Value: IntValue, BasicValue: IntValue

  // Perform an addition operation - the result is an InstructionValue
  let Unit
add_result = ?
builder.(Unit, Unit) -> Unit
build_int_add(Unit
const1, Unit
const2)

  // In different contexts, we need different perspectives:

  // As an instruction to check its properties
  let Unit
instruction = Unit
add_result.() -> Unit
as_instruction()
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Instruction opcode: \{Unit
instruction.() -> Unit
get_opcode()}")

  // As a basic value to get its type
  let Unit
basic_value = Unit
add_result.() -> Unit
into_basic_value()
  (input : String) -> Unit
Prints any value that implements the Show trait to the standard output,
followed by a newline.
Parameters:

value : The value to be printed. Must implement the Show trait.
Example:
  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println("Result type: \{Unit
basic_value.() -> Unit
get_type()}")

  // As an integer value for subsequent calculations
  let Unit
int_value = Unit
add_result.() -> Unit
into_int_value()
  let Unit
final_result = ?
builder.(Unit, Unit) -> Unit
build_int_mul(Unit
int_value, Unit
const1)
}

Complete Classification of Value Types

ValueEnum: All possible value types

pub enum ValueEnum {
  (?) -> ValueEnum
IntValue(IntValue)              // Integer value
  (?) -> ValueEnum
FloatValue(FloatValue)          // Floating-point value
  (?) -> ValueEnum
PointerValue(PointerValue)      // Pointer value
  (?) -> ValueEnum
StructValue(StructValue)        // Struct value
  (?) -> ValueEnum
FunctionValue(FunctionValue)    // Function value
  (?) -> ValueEnum
ArrayValue(ArrayValue)          // Array value
  (?) -> ValueEnum
VectorValue(VectorValue)        // Vector value
  (?) -> ValueEnum
PhiValue(PhiValue)             // Phi node value
  (?) -> ValueEnum
ScalableVectorValue(ScalableVectorValue)  // Scalable vector value
  (?) -> ValueEnum
MetadataValue(MetadataValue)    // Metadata value
  (?) -> ValueEnum
CallSiteValue(CallSiteValue)    // Call site value
  (?) -> ValueEnum
GlobalValue(GlobalValue)        // Global value
  (?) -> ValueEnum
InstructionValue(InstructionValue)  // Instruction value
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show)

BasicValueEnum: Values that have a basic type

pub enum BasicValueEnum {
  (?) -> BasicValueEnum
ArrayValue(ArrayValue)              // Array value
  (?) -> BasicValueEnum
IntValue(IntValue)                  // Integer value
  (?) -> BasicValueEnum
FloatValue(FloatValue)              // Floating-point value
  (?) -> BasicValueEnum
PointerValue(PointerValue)          // Pointer value
  (?) -> BasicValueEnum
StructValue(StructValue)            // Struct value
  (?) -> BasicValueEnum
VectorValue(VectorValue)            // Vector value
  (?) -> BasicValueEnum
ScalableVectorValue(ScalableVectorValue)  // Scalable vector value
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show)

💡 Best Practices for Value Conversion

In the actual code generation process, we often need to convert between different value perspectives:

pub fn (instruction_result : Unit) -> Unit
value_conversion_patterns(Unit
instruction_result: &Value) -> Unit
Unit {
  // Pattern 1: I know what type this is, convert directly
  let Unit
int_val = Unit
instruction_result.() -> Unit
into_int_value()

  // Pattern 2: I just need a basic value, I don't care about the specific type
  let Unit
basic_val = Unit
instruction_result.() -> Unit
into_basic_value()

  // Pattern 3: Defensive programming, check before converting
  match Unit
instruction_result.() -> Unit
as_value_enum() {
    // Handle integer values
    (Unit) -> Unit
IntValue(Unit
int_val) => (Unit) -> Unit
handle_integer(Unit
int_val)
    // Handle float values
    (Unit) -> Unit
FloatValue(Unit
float_val) => (Unit) -> Unit
handle_float(Unit
float_val)
    _ => raise Error
CodeGenError("Unexpected value type")
  }
}

Through this two-layer abstraction, llvm.mbt maintains the integrity of the LLVM value system while providing an intuitive and easy-to-use interface for Moonbit developers.

Chapter 3: Practical LLVM IR Generation

Now that we understand the type and value systems, let's demonstrate how to use llvm.mbt to generate LLVM IR with a complete example. This example will implement a simple muladd function, showing the entire process from initialization to instruction generation.

Infrastructure Initialization

Any LLVM program begins by establishing three core components:

pub fn () -> (?, ?, ?)
initialize_llvm() -> (Context, Module, Builder) {
  // 1. Create an LLVM context - a container for all LLVM objects
  let ?
context = () -> ?
@llvm.Context::create()

  // 2. Create a module - a container for functions and global variables
  let ?
module = ?
context.(String) -> ?
create_module("demo_module")

  // 3. Create an IR builder - used to generate instructions
  let ?
builder = ?
context.() -> ?
create_builder()

  (?
context, ?
module, ?
builder)
}

A Simple Function Generation Example

Let's implement a function that calculates (a * b) + c:

pub fn () -> String
generate_muladd_function() -> String
String {
  // Initialize LLVM infrastructure
  let (?
context, ?
module, ?
builder) = () -> (?, ?, ?)
initialize_llvm()

  // Define the function signature
  let Unit
i32_ty = ?
context.() -> Unit
i32_type()
  let Unit
func_type = Unit
i32_ty.(Array[Unit]) -> Unit
fn_type([Unit
i32_ty, Unit
i32_ty, Unit
i32_ty])
  let Unit
func_value = ?
module.(String, Unit) -> Unit
add_function("muladd", Unit
func_type)

  // Create the function entry basic block
  let Unit
entry_block = ?
context.(Unit, String) -> Unit
append_basic_block(Unit
func_value, "entry")
  ?
builder.(Unit) -> Unit
position_at_end(Unit
entry_block)

  // Get the function parameters
  let Unit
arg_a = Unit
func_value.(Int) -> Unit
get_nth_param(0).() -> Unit
unwrap().() -> Unit
into_int_value()
  let Unit
arg_b = Unit
func_value.(Int) -> Unit
get_nth_param(1).() -> Unit
unwrap().() -> Unit
into_int_value()
  let Unit
arg_c = Unit
func_value.(Int) -> Unit
get_nth_param(2).() -> Unit
unwrap().() -> Unit
into_int_value()

  // Generate calculation instructions
  let Unit
mul_result = ?
builder.(Unit, Unit) -> Unit
build_int_mul(Unit
arg_a, Unit
arg_b).() -> Unit
into_int_value()
  let Unit
add_result = ?
builder.(Unit, Unit) -> Unit
build_int_add(Unit
mul_result, Unit
arg_c)

  // Generate the return instruction
  let _ = ?
builder.(Unit) -> Unit
build_return(Unit
add_result)

  // Output the generated IR
  ?
module.() -> String
dump()
}

Generated LLVM IR

Running the above code will produce the following LLVM Intermediate Representation:

; ModuleID = 'demo_module'
source_filename = "demo_module"

define i32 @muladd(i32 %0, i32 %1, i32 %2) {
entry:
  %3 = mul i32 %0, %1
  %4 = add i32 %3, %2
  ret i32 %4
}

💡 Code Generation Best Practices

Naming Conventions

For instructions that return a value, the build interface has a name label argument, which can be used to add a name to the result of the instruction.

let ?
mul_result = Unit
builder.(Unit, Unit, String) -> ?
build_int_mul(Unit
lhs, Unit
rhs, String
name="temp_product")
let ?
final_result = Unit
builder.(?, Unit, String) -> ?
build_int_add(?
mul_result, Unit
offset, String
name="final_sum")

Error Handling

Use raise instead of panic for error handling, and manage exceptions for situations that are not easy to determine directly.

// Check for operations that might fail
match func_value.get_nth_param(index) {
  Some(param) => param.into_int_value()
  None => raise CodeGenError("Function parameter \{index} not found")
}

Chapter 4: TinyMoonbit Compiler Implementation

Now let's turn our attention to the actual compiler implementation, converting the abstract syntax tree we built in the previous article into LLVM IR.

Type Mapping: From Parser to LLVM

First, we need to establish a mapping between the TinyMoonbit type system and the LLVM type system:

pub struct CodeGen {
  ?
parser_program : Program                    // AST representation of the source program
  ?
llvm_context : @llvm.Context               // LLVM context
  ?
llvm_module : @llvm.Module                 // LLVM module
  ?
builder : @llvm.Builder                    // IR builder
  Map[String, ?]
llvm_functions : type Map[K, V]
Mutable linked hash map that maintains the order of insertion, not thread safe.
Example
  let map = { 3: "three", 8 :  "eight", 1 :  "one"}
  assert_eq(map.get(2), None)
  assert_eq(map.get(3), Some("three"))
  map.set(3, "updated")
  assert_eq(map.get(3), Some("updated"))
Map[String
String, @llvm.FunctionValue]  // Function map
}

pub fn (?, ?) -> Unit raise
convert_type(?
self : Self, ?
parser_type : Type) -> &@llvm.Type raise {
  match ?
parser_type {
    Type::?
Unit => ?
selfUnit
.?
llvm_contextUnit
.() -> Unit
void_typeUnit
() as &@llvm.Type
    Type::?
Bool => ?
self.?
llvm_context.() -> Unit
bool_type()
    Type::?
Int => ?
self.?
llvm_context.() -> Unit
i32_type()
    Type::?
Double => ?
self.?
llvm_context.() -> Unit
f64_type()
    // Can be extended with more types as needed
  }
}

Environment Management: Mapping Variables to Values

During the code generation phase, we need to maintain a mapping from variable names to LLVM values:

pub struct Env {
  Env?
parent : struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env?                        // Reference to the parent environment
  Map[String, Unit]
symbols : type Map[K, V]
Mutable linked hash map that maintains the order of insertion, not thread safe.
Example
  let map = { 3: "three", 8 :  "eight", 1 :  "one"}
  assert_eq(map.get(2), None)
  assert_eq(map.get(3), Some("three"))
  map.set(3, "updated")
  assert_eq(map.get(3), Some("updated"))
Map[String
String, &@llvm.Value]        // Local variable map

  // Global information
  CodeGen
codegen : struct CodeGen {
  parser_program: ?
  llvm_context: ?
  llvm_module: ?
  builder: ?
  llvm_functions: Map[String, ?]
}
CodeGen                           // Reference to the code generator
  ?
parser_function : Function                  // AST of the current function
  ?
llvm_function : @llvm.FunctionValue         // LLVM representation of the current function
}

pub fn (?, String) -> Unit?
get_symbol(?
self : Self, String
name : String
String) -> &@llvm.Value? {
  match ?
self.Map[String, Unit]
symbols.(self : Map[String, Unit], key : String) -> Unit?
Retrieves the value associated with a given key in the hash map.
Parameters:

self : The hash map to search in.
key : The key to look up in the map.
Returns Some(value) if the key exists in the map, None otherwise.
Example:
  let map = { "key": 42 }
  inspect(map.get("key"), content="Some(42)")
  inspect(map.get("nonexistent"), content="None")
get(String
name) {
    (Unit) -> Unit?
Some(Unit
value) => (Unit) -> Unit?
Some(Unit
value)
    Unit?
None =>
      match ?
self.Env?
parent {
        (Env) -> Env?
Some(Env
parent_env) => Env
parent_env.(String) -> Unit?
get_symbol(String
name)
        Env?
None => Unit?
None
      }
  }
}

Variable Handling: Memory Allocation Strategy

As a systems-level language, TinyMoonbit supports variable reassignment. In LLVM IR's SSA (Static Single Assignment) form, we need to use the alloca + load/store pattern to implement mutable variables:

pub fn Stmt::(?, Env) -> Unit raise
emit(?
self : Self, Env
env : struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env) -> Unit
Unit raise {
  match ?
self {
    // Variable declaration: e.g., let x : Int = 5;
    (String, Unit, Unit) -> ?
Let(String
var_name, Unit
var_type, Unit
init_expr) => {
      // Convert the type and allocate stack space
      let Unit
llvm_type = Env
env.CodeGen
codegen.(Unit) -> Unit
convert_type(Unit
var_type)
      let Unit
alloca = Env
env.CodeGen
codegen.?
builder.(Unit, String) -> Unit
build_alloca(Unit
llvm_type, String
var_name)

      // Record the allocated pointer in the symbol table
      Env
env.Map[String, Unit]
symbols.(self : Map[String, Unit], key : String, value : Unit) -> Unit
Sets a key-value pair into the hash map. If the key already exists, updates
its value. If the hash map is near full capacity, automatically
grows the internal storage to accommodate more entries.
Parameters:

map : The hash map to modify.
key : The key to insert or update. Must implement Hash and Eq traits.
value : The value to associate with the key.
Example:
  let map : Map[String, Int] = Map::new()
  map.set("key", 42)
  inspect(map.get("key"), content="Some(42)")
  map.set("key", 24) // update existing key
  inspect(map.get("key"), content="Some(24)")
set(String
var_name, Unit
allocaUnit
 as &@llvm.Value)

      // Calculate the value of the initialization expression
      let Unit
init_value = Unit
init_expr.(Env) -> Unit
emit(Env
env).() -> Unit
into_basic_value()

      // Store the initial value into the allocated memory
      let _ = Env
env.CodeGen
codegen.?
builder.(Unit, Unit) -> Unit
build_store(Unit
alloca, Unit
init_value)
    }

    // Variable assignment: x = 10;
    (Unit, Unit) -> ?
Assign(Unit
var_name, Unit
rhs_expr) => {
      // Get the memory address of the variable from the symbol table
      guard let (_/0) -> Unit
Some(_/0
var_ptr) = Env
env.(Unit) -> Unit
get_symbol(Unit
var_name) else {
        raise Error
CodeGenError("Undefined variable: \{Unit
var_name}")
      }

      // Calculate the value of the right-hand side expression
      let Unit
rhs_value = Unit
rhs_expr.(Env) -> Unit
emit(Env
env).() -> Unit
into_basic_value()

      // Store the new value into the variable's memory
      let _ = Env
env.CodeGen
codegen.?
builder.(Unit, Unit) -> Unit
build_store(Unit
var_ptr, Unit
rhs_value)
    }

    // Other statement types...
    _ => { /* Handle other statements */ }
  }
}

Design Decision: Why use alloca?

In functional languages, immutable variables can be directly mapped to SSA values. However, TinyMoonbit supports variable reassignment, which conflicts with the SSA principle of "each variable is assigned only once."

The alloca + load/store pattern is the standard way to handle mutable variables:

alloca: Allocates memory space on the stack.

store: Writes a value to memory.

load: Reads a value from memory.

LLVM's optimization process will automatically convert simple allocas back to value form (the mem2reg optimization).

Expression Code Generation

Expression code generation is relatively straightforward, mainly involving calling the corresponding instruction-building methods based on the expression type:

fn Expr::(?, Env) -> Unit raise
emit(?
self: Self, Env
env: struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env) -> &@llvm.Value raise {
  match ?
self {
    (Unit) -> ?
AtomExpr(Unit
atom_expr, ..) => Unit
atom_expr.(Env) -> Unit
emit(Env
env)
    (String, Unit, _/0) -> ?
Unary("-", Unit
expr, _/0
ty = (_/0) -> _/0
Some(_/0
Int)) => {
      let Unit
value = Unit
expr.() -> Unit
emit().() -> Unit
into_int_value()
      let Unit
zero = Env
env.Unit
gen.Unit
llvm_ctx.() -> Unit
i32_type().() -> Unit
const_zero()
      Env
env.Unit
gen.?
builder.(Unit, Unit) -> Unit
build_int_sub(Unit
zero, Unit
value)
    }
    (String, Unit, _/0) -> ?
Unary("-", Unit
expr, _/0
ty = (_/0) -> _/0
Some(_/0
Double)) => {
      let Unit
value = Unit
expr.() -> Unit
emit().() -> Unit
into_float_value()
      Env
env.Unit
gen.?
builder.(Unit) -> Unit
build_float_neg(Unit
value)
    }
    (String, Unit, Unit, _/0) -> ?
Binary("+", Unit
lhs, Unit
rhs, _/0
ty=(_/0) -> _/0
Some(_/0
Int)) => {
      let Unit
lhs_val = Unit
lhs.() -> Unit
emit().() -> Unit
into_int_value()
      let Unit
rhs_val = Unit
rhs.() -> Unit
emit().() -> Unit
into_int_value()
      Env
env.Unit
gen.?
builder.(Unit, Unit) -> Unit
build_int_add(Unit
lhs_val, Unit
rhs_val)
    }
    // ... others
  }
}

Technical Detail: Floating-Point Negation

Note that when handling floating-point negation, we use build_float_neg instead of subtracting the operand from zero. This is because:

IEEE 754 Standard: Floating-point numbers have special values (like NaN, ∞), and simple subtraction might produce incorrect results.

Performance Considerations: Dedicated negation instructions are usually more efficient on modern processors.

Precision Guarantee: Avoids unnecessary rounding errors.

Chapter 5: Implementation of Control Flow Instructions

Control flow is the backbone of program logic, including conditional branches and loop structures. In LLVM IR, control flow is implemented through Basic Blocks and branch instructions. Each basic block represents a sequence of instructions with no internal jumps, and blocks are connected by branch instructions.

Conditional Branches: Implementing if-else Statements

Conditional branches require creating multiple basic blocks to represent different execution paths:

fn Stmt::(?, Env) -> Unit raise
emit(?
self: Self, Env
env: struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env) -> Unit
Unit raise {
  let Unit
ctx = Env
env.Unit
gen.Unit
llvm_ctx
  let Unit
func = Env
env.Unit
llvm_func
  let ?
builder = Env
env.Unit
gen.?
builder
  match ?
self {
    (Unit, Unit, Unit) -> ?
If(Unit
cond, Unit
then_stmts, Unit
else_stmts) => {
      let Unit
cond_val = Unit
cond.(Env) -> Unit
emit(Env
env).() -> Unit
into_int_value()

      // Create three basic blocks
      let Unit
then_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
      let Unit
else_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
      let Unit
merge_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)

      // Create the jump instruction
      let _ = ?
builder.(Unit, Unit, Unit) -> Unit
build_conditional_branch(
        Unit
cond_val, Unit
then_block, Unit
else_block,
      )

      // Generate code for the then_block
      ?
builder.(Unit) -> Unit
position_at_end(Unit
then_block)
      let Unit
then_env = ?
self.() -> Unit
subenv()
      Unit
then_stmts.((Unit) -> Unit) -> Unit
each(Unit
s => Unit
s.(Unit) -> Unit
emitStmt(Unit
then_env))
      let _ = ?
builder.(Unit) -> Unit
build_unconditional_branch(Unit
merge_block)

      // Generate code for the else_block
      ?
builder.(Unit) -> Unit
position_at_end(Unit
else_block)
      let Unit
else_env = ?
self.() -> Unit
subenv()
      Unit
else_stmts.((Unit) -> Unit) -> Unit
each(Unit
s => Unit
s.(Unit) -> Unit
emitStmt(Unit
else_env))
      let _ = ?
builder.(Unit) -> Unit
build_unconditional_branch(Unit
merge_block)

      // After code generation is complete, the builder's position should be on the merge_block
      ?
builder.(Unit) -> Unit
position_at_end(Unit
merge_block)

    }
    // ...
  }
}

Generated LLVM IR Example

For the following TinyMoonbit code:

if x > 0 {
  y = x + 1;
} else {
  y = x - 1;
}

It will generate LLVM IR similar to this:

  %1 = load i32, ptr %x, align 4
  %2 = icmp sgt i32 %1, 0
  br i1 %2, label %if.then, label %if.else

if.then:                                          ; preds = %0
  %3 = load i32, ptr %x, align 4
  %4 = add i32 %3, 1
  store i32 %4, ptr %y, align 4
  br label %if.end

if.else:                                          ; preds = %0
  %5 = load i32, ptr %x, align 4
  %6 = sub i32 %5, 1
  store i32 %6, ptr %y, align 4
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  ; Subsequent code...

Loop Structures: Implementing while Statements

The implementation of loops requires special attention to the correct connection of the condition check and the loop body:

fn Stmt::(?, Env) -> Unit raise
emit(?
self: Self, Env
env: struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env) -> Unit
Unit raise {
  let Unit
ctx = Env
env.Unit
gen.Unit
llvm_ctx
  let Unit
func = Env
env.Unit
llvm_func
  let ?
builder = Env
env.Unit
gen.?
builder
  match ?
self {
    (Unit, Unit) -> ?
While(Unit
cond, Unit
body) => {
      // Generate three blocks
      let Unit
cond_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
      let Unit
body_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
      let Unit
merge_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)

      // First, unconditionally jump to the cond block
      let _ = ?
builder.(Unit) -> Unit
build_unconditional_branch(Unit
cond_block)
      ?
builder.(Unit) -> Unit
position_at_end(Unit
cond_block)

      // Generate code within the cond block, as well as the conditional jump instruction
      let Unit
cond_val = Unit
cond.() -> Unit
emit().() -> Unit
into_int_value()
      let _ = ?
builder.(Unit, Unit, Unit) -> Unit
build_conditional_branch(
        Unit
cond_val, Unit
body_block, Unit
merge_block,
      )
      ?
builder.(Unit) -> Unit
position_at_end(Unit
body_block)

      // Generate code for the body block, with an unconditional jump to the cond block at the end
      let Unit
body_env = ?
self.() -> Unit
subenv()
      Unit
body.((Unit) -> Unit) -> Unit
each(Unit
s => Unit
s.(Unit) -> Unit
emitStmt(Unit
body_env))
      let _ = ?
builder.(Unit) -> Unit
build_unconditional_branch(Unit
cond_block)

      // After code generation is finished, jump to the merge block
      ?
builder.(Unit) -> Unit
position_at_end(Unit
merge_block)
    }
    // ...
  }
}

Generated LLVM IR Example

For the TinyMoonbit code:

while i < 10 {
  i = i + 1;
}

It will generate:

  br label %while.cond

while.cond:                                       ; preds = %while.body, %0
  %1 = load i32, ptr %i, align 4
  %2 = icmp slt i32 %1, 10
  br i1 %2, label %while.body, label %while.end

while.body:                                       ; preds = %while.cond
  %3 = load i32, ptr %i, align 4
  %4 = add i32 %3, 1
  store i32 %4, ptr %i, align 4
  br label %while.cond

while.end:                                        ; preds = %while.cond
  ; Subsequent code...

💡 Control Flow Design Points

Basic Block Naming Strategy

The append_basic_block function also has a name label argument.

// Use descriptive block names for easier debugging and understanding
let ?
then_block = Unit
context.(Unit, String) -> ?
append_basic_block(Unit
func, String
name="if.then")
let ?
else_block = Unit
context.(Unit, String) -> ?
append_basic_block(Unit
func, String
name="if.else")
let ?
merge_block = Unit
context.(Unit, String) -> ?
append_basic_block(Unit
func, String
name="if.end")

Scope Management

// Create a separate scope for each branch and loop body
let ?
branch_env = Unit
env.() -> ?
sub_env()
branch_stmts.each( stmt => stmt.emit(branch_env) }

Builder Position Management

At the end, be sure to place the instruction builder on the correct basic block.

// Always ensure the builder points to the correct basic block
builder.position_at_end(merge_block)
// Generate instructions in this block...

Chapter 6: From LLVM IR to Machine Code

After generating the complete LLVM IR, we need to convert it into assembly code for the target machine. Although llvm.mbt provides a complete target machine configuration API, for learning purposes, we can use a simpler method.

Compiling with the `llc` Toolchain

The most direct method is to output the generated LLVM IR to a file and then use the LLVM toolchain to compile it:

Call the dump function of the Module, or you can use the println function.

let CodeGen
gen : struct CodeGen {
  parser_program: ?
  llvm_context: ?
  llvm_module: ?
  builder: ?
  llvm_functions: Map[String, ?]
}
CodeGen = ...
let ?
prog = CodeGen
gen.?
llvm_prog
prog.dump() // dump is recommended as it will be slightly faster than println, with the same effect

// or println(prog)

Complete Compilation Flow Example

Let's look at a complete compilation flow from source code to assembly code:

TinyMoonbit Source Code

fn (n : Int) -> Int
factorial(Int
n: Int
Int) -> Int
Int {
  if Int
n (self_ : Int, other : Int) -> Bool
<= 1 {
    return 1;
  }
  return Int
n (self : Int, other : Int) -> Int
Multiplies two 32-bit integers. This is the implementation of the *
operator for Int.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns the product of the two integers. If the result overflows the range of
Int, it wraps around according to two's complement arithmetic.
Example:
  inspect(42 * 2, content="84")
  inspect(-10 * 3, content="-30")
  let max = 2147483647 // Int.max_value
  inspect(max * 2, content="-2") // Overflow wraps around
* (n : Int) -> Int
factorial(Int
n (self : Int, other : Int) -> Int
Performs subtraction between two 32-bit integers, following standard two's
complement arithmetic rules. When the result overflows or underflows, it
wraps around within the 32-bit integer range.
Parameters:

self : The minuend (the number being subtracted from).
other : The subtrahend (the number to subtract).
Returns the difference between self and other.
Example:
  let a = 42
  let b = 10
  inspect(a - b, content="32")
  let max = 2147483647 // Int maximum value
  inspect(max - -1, content="-2147483648") // Overflow case
- 1);
}

fn main() -> Unit {
  let Int
result: Int
Int = (n : Int) -> Int
factorial(5);
  (Int) -> Unit
print_int(Int
result);
}

Generated LLVM IR

; ModuleID = 'tinymoonbit'
source_filename = "tinymoonbit"

define i32 @factorial(i32 %0) {
entry:
  %1 = alloca i32, align 4
  store i32 %0, ptr %1, align 4
  %2 = load i32, ptr %1, align 4
  %3 = icmp sle i32 %2, 1
  br i1 %3, label %4, label %6

4:                                                ; preds = %entry
  ret i32 1

6:                                                ; preds = %entry
  %7 = load i32, ptr %1, align 4
  %8 = load i32, ptr %1, align 4
  %9 = sub i32 %8, 1
  %10 = call i32 @factorial(i32 %9)
  %11 = mul i32 %7, %10
  ret i32 %11
}

define void @main() {
entry:
  %0 = alloca i32, align 4
  %1 = call i32 @factorial(i32 5)
  store i32 %1, ptr %0, align 4
  %2 = load i32, ptr %0, align 4
  call void @print_int(i32 %2)
  ret void
}

declare void @print_int(i32 %0)

Generating RISC-V Assembly with llc

# Generate llvm ir
moon run main --target native > fact.ll

# Generate RISC-V 64-bit assembly code
llc -march=riscv64 -mattr=+m -o fact.s fact.ll

Generated RISC-V Assembly Snippet

factorial:
.Lfunc_begin0:
	.cfi_startproc
	addi	sp, sp, -32
	.cfi_def_cfa_offset 32
	sd	ra, 24(sp)
	.cfi_offset ra, -8
	sd	s0, 16(sp)
	.cfi_offset s0, -16
	addi	s0, sp, 32
	.cfi_def_cfa s0, 0
	sw	a0, -20(s0)
	lw	a0, -20(s0)
	li	a1, 1
	blt	a1, a0, .LBB0_2
	li	a0, 1
	j	.LBB0_3
.LBB0_2:
	lw	a0, -20(s0)
	lw	a1, -20(s0)
	addi	a1, a1, -1
	sw	a0, -24(s0)
	mv	a0, a1
	call	factorial
	lw	a1, -24(s0)
	mul	a0, a1, a0
.LBB0_3:
	ld	ra, 24(sp)
	ld	s0, 16(sp)
	addi	sp, sp, 32
	ret

Conclusion

Through this two-part series, we have completed a fully functional, albeit simple, compiler implementation. From the lexical analysis of a character stream to the construction of an abstract syntax tree, and finally to the generation of LLVM IR and machine code output.

Review

Part 1:

An elegant lexer based on pattern matching
Implementation of a recursive descent parser
A complete type-checking system
Scope management with an environment chain

Part 2:

A deep dive into the LLVM type and value systems
Variable management strategies in SSA form
Correct implementation of control flow instructions
A complete code generation pipeline

Moonbit's Advantages in Compiler Development

Through this practical project, we have gained a deep appreciation for Moonbit's unique value in the field of compiler construction:

Expressive Pattern Matching: Greatly simplifies the complexity of AST processing and type analysis.
Functional Programming Paradigm: Immutable data structures and pure functions make the compiler logic clearer and more reliable.
Modern Type System: Trait objects, generics, and error handling mechanisms provide ample abstraction capabilities.
Excellent Engineering Features: Features like derive and JSON serialization significantly improve development efficiency.

Final Words

Compiler technology represents the perfect combination of computer science theory and engineering practice. With a modern tool like Moonbit, we can explore this ancient yet vibrant field in a more elegant and efficient way.

We hope this series of articles will provide readers with a powerful aid on their journey into compiler design.

Recommended Learning Resources

Moonbit Official Documentation

llvm.mbt Documentation

llvm.mbt Project

LLVM Official Tutorial

Dancing with LLVM: A Moonbit Chronicle (Part 1) - Implementing the Frontend

August 4, 2025 · 16 min read

Introduction

Programming language design and compiler implementation have long been considered among the most challenging topics in computer science. The traditional path to learning compilers often requires students to first master a complex set of theoretical foundations:

Automata Theory: Finite state machines and regular expressions
Type Theory: The mathematical underpinnings of λ-calculus and type systems
Computer Architecture: Low-level implementation from assembly language to machine code

However, Moonbit, a functional programming language designed for the modern development landscape, offers a fresh perspective. It not only features a rigorous type system and exceptional memory safety guarantees but, more importantly, its rich syntax and toolchain tailored for the AI era make it an ideal choice for learning and implementing compilers.

Series Overview This series of articles will delve into the core concepts and best practices of modern compiler implementation by building a small programming language compiler called TinyMoonbit.

Part 1: Focuses on the implementation of the language frontend, including lexical analysis, parsing, and type checking, ultimately generating an abstract syntax tree with complete type annotations.

Part 2: Dives into the code generation phase, utilizing Moonbit's official llvm.mbt binding library to convert the abstract syntax tree into LLVM intermediate representation and finally generate RISC-V assembly code.

TinyMoonbit Language Design

TinyMoonbit is a systems-level programming language with an abstraction level comparable to C. Although its syntax heavily borrows from Moonbit, TinyMoonbit is not a subset of the Moonbit language. Instead, it is a simplified version designed to test the feature completeness of llvm.mbt while also serving an educational purpose.

Note: Due to space constraints, the TinyMoonbit implementation discussed in this series is simpler than the actual TinyMoonbit. For the complete version, please refer to TinyMoonbitLLVM.

Core Features

TinyMoonbit provides the fundamental features required for modern systems programming:

✅ Low-level Memory Operations: Direct pointer manipulation and memory management
✅ Control Flow Structures: Conditional branches, loops, and function calls
✅ Type Safety: Static type checking and explicit type declarations
❌ Simplified Design: To reduce implementation complexity, advanced features like type inference and closures are not supported.

Syntax Example

Let's demonstrate TinyMoonbit's syntax with a classic implementation of the Fibonacci sequence:

extern fn (x : Int) -> Unit
print_int(Int
x : Int
Int) -> Unit
Unit;

// Recursive implementation of the Fibonacci sequence
fn (n : Int) -> Int
fib(Int
n : Int
Int) -> Int
Int {
  if Int
n (self_ : Int, other : Int) -> Bool
<= 1 {
    return Int
n;
  }
  return (n : Int) -> Int
fib(Int
n (self : Int, other : Int) -> Int
Performs subtraction between two 32-bit integers, following standard two's
complement arithmetic rules. When the result overflows or underflows, it
wraps around within the 32-bit integer range.
Parameters:

self : The minuend (the number being subtracted from).
other : The subtrahend (the number to subtract).
Returns the difference between self and other.
Example:
  let a = 42
  let b = 10
  inspect(a - b, content="32")
  let max = 2147483647 // Int maximum value
  inspect(max - -1, content="-2147483648") // Overflow case
- 1) (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ (n : Int) -> Int
fib(Int
n (self : Int, other : Int) -> Int
Performs subtraction between two 32-bit integers, following standard two's
complement arithmetic rules. When the result overflows or underflows, it
wraps around within the 32-bit integer range.
Parameters:

self : The minuend (the number being subtracted from).
other : The subtrahend (the number to subtract).
Returns the difference between self and other.
Example:
  let a = 42
  let b = 10
  inspect(a - b, content="32")
  let max = 2147483647 // Int maximum value
  inspect(max - -1, content="-2147483648") // Overflow case
- 2);
}

fn main {
  (x : Int) -> Unit
print_int((n : Int) -> Int
fib(10));
}

Compilation Target

After the complete compilation process, the above code will generate the following LLVM Intermediate Representation:

; ModuleID = 'tinymoonbit'
source_filename = "tinymoonbit"

define i32 @fib(i32 %0) {
entry:
  %1 = alloca i32, align 4
  store i32 %0, ptr %1, align 4
  %2 = load i32, ptr %1, align 4
  %3 = icmp sle i32 %2, 1
  br i1 %3, label %4, label %6

4:                                                ; preds = %entry
  %5 = load i32, ptr %1, align 4
  ret i32 %5

6:                                                ; preds = %4, %entry
  %7 = load i32, ptr %1, align 4
  %8 = sub i32 %7, 1
  %9 = call i32 @fib(i32 %8)
  %10 = load i32, ptr %1, align 4
  %11 = sub i32 %10, 2
  %12 = call i32 @fib(i32 %11)
  %13 = add i32 %9, %12
  ret i32 %13
}

define void @main() {
entry:
  %0 = call i32 @fib(i32 10)
  call void @print_int(i32 %0)
}

declare void @print_int(i32 %0)

Chapter 2: Lexical Analysis

Lexical Analysis is the first stage of the compilation process. Its core mission is to convert a continuous stream of characters into a sequence of meaningful tokens. This seemingly simple conversion process is, in fact, the cornerstone of the entire compiler pipeline.

From Characters to Symbols: Token Design and Implementation

Consider the following code snippet:

let Int
x : Int
Int = 5;

After being processed by the lexer, it will produce the following sequence of tokens:

(Keyword "let") → (Identifier "x") → (Symbol ":") →
(Type "Int") → (Operator "=") → (IntLiteral 5) → (Symbol ";")

This conversion process needs to handle various complex situations:

Whitespace Filtering: Skipping spaces, tabs, and newlines.
Keyword Recognition: Distinguishing reserved words from user-defined identifiers.
Numeric Parsing: Correctly identifying the boundaries of integers and floating-point numbers.
Operator Handling: Differentiating between single-character and multi-character operators.

Token Type System Design

Based on the TinyMoonbit syntax specification, we classify all possible symbols into the following token types:

pub enum Token {
  (Bool) -> Token
Bool(Bool
Bool)       // Boolean values: true, false
  (Int) -> Token
Int(Int
Int)         // Integers: 1, 2, 3, ...
  (Double) -> Token
Double(Double
Double)   // Floating-point numbers: 1.0, 2.5, 3.14, ...
  (String) -> Token
Keyword(String
String)  // Reserved words: let, if, while, fn, return
  (String) -> Token
Upper(String
String)    // Type identifiers: start with an uppercase letter, e.g., Int, Double, Bool
  (String) -> Token
Lower(String
String)    // Variable identifiers: start with a lowercase letter, e.g., x, y, result
  (String) -> Token
Symbol(String
String)   // Operators and punctuation: +, -, *, :, ;, ->
  (Char) -> Token
Bracket(Char
Char)    // Brackets: (, ), [, ], {, }
  Token
EOF              // End-of-file marker
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq)

Leveraging Pattern Matching

Moonbit's powerful pattern matching capabilities allow us to implement the lexer in an unprecedentedly elegant way. Compared to the traditional finite state machine approach, this pattern-matching-based implementation is more intuitive and easier to understand.

Core Analysis Function

pub fn (code : String) -> Array[Token]
lex(String
code: String
String) -> type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
} derive(Show, Eq)
Token] {
  let Array[Token]
tokens = type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array::(capacity? : Int) -> Array[Token]
Creates a new empty array with an optional initial capacity.
Parameters:

capacity : The initial capacity of the array. If 0 (default), creates an
array with minimum capacity. Must be non-negative.
Returns a new empty array of type Array[T] with the specified initial
capacity.
Example:
  let arr : Array[Int] = Array::new(capacity=10)
  inspect(arr.length(), content="0")
  inspect(arr.capacity(), content="10")

  let arr : Array[Int] = Array::new()
  inspect(arr.length(), content="0")
new()

  loop String
code[:] {
    // Skip whitespace characters
    StringView
[' ' | '\n' | '\r' | '\t', ..rest] =>
      continue StringView
rest

    // Handle single-line comments
    StringView
[.."//", ..rest] =>
      continue loop StringView
rest {
        StringView
['\n' | '\r', ..rest_str] => break StringView
rest_str
        StringView
[_, ..rest_str] => continue StringView
rest_str
        StringView
[] as rest_str => break StringView
rest_str
      }

    // Recognize multi-character operators (order is important!)
    StringView
[.."->", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((String) -> Token
Symbol("->")); continue StringView
rest }
    StringView
[.."==", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((String) -> Token
Symbol("==")); continue StringView
rest }
    StringView
[.."!=", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((String) -> Token
Symbol("!=")); continue StringView
rest }
    StringView
[.."<=", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((String) -> Token
Symbol("<=")); continue StringView
rest }
    StringView
[..">=", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((String) -> Token
Symbol(">=")); continue StringView
rest }

    // Recognize single-character operators and punctuation
    [':' | '.' | ',' | ';' | '+' | '-' | '*' |
     '/' | '%' | '>' | '<' | '=' as c, ..rest] => {
      Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((String) -> Token
Symbol("\{Char
c}"))
      continue StringView
rest
    }

    // Recognize brackets
    StringView
[Char
'(' | ')' | '[' | ']' | '{' | '}' as cStringView
, ..rest] => {
      Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push((Char) -> Token
Bracket(Char
c))
      continue StringView
rest
    }

    // Recognize identifiers and literals
    StringView
['a'..='z', ..] as code => {
      let (Token
tok, StringView
rest) = (StringView) -> (Token, StringView)
lex_ident(StringView
code);
      Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(Token
tok)
      continue StringView
rest
    }

    ['A'..='Z', ..] => { ... }
    ['0'..='9', ..] => { ... }

    // Reached the end of the file
    [] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(Token
EOF); break Array[Token]
tokens }
  }
}

Keyword Recognition Strategy

Identifier parsing requires special handling for keyword recognition:

pub fn (rest : StringView) -> (Token, StringView)
let_ident(StringView
rest: type StringView
StringView represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) -> (enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
} derive(Show, Eq)
Token, type StringView
StringView represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) {
  // Predefined keyword map
  let Unit
keyword_map = Unit
Map.(Array[(String, Token)]) -> Unit
from_array([
    ("let", Token::(String) -> Token
Keyword("let")),
    ("fn", Token::(String) -> Token
Keyword("fn")),
    ("if", Token::(String) -> Token
Keyword("if")),
    ("else", Token::(String) -> Token
Keyword("else")),
    ("while", Token::(String) -> Token
Keyword("while")),
    ("return", Token::(String) -> Token
Keyword("return")),
    ("extern", Token::(String) -> Token
Keyword("extern")),
    ("true", Token::(Bool) -> Token
Bool(true)),
    ("false", Token::(Bool) -> Token
Bool(false)),
  ])

  let Array[Char]
identifier_chars = type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array::(capacity? : Int) -> Array[Char]
Creates a new empty array with an optional initial capacity.
Parameters:

capacity : The initial capacity of the array. If 0 (default), creates an
array with minimum capacity. Must be non-negative.
Returns a new empty array of type Array[T] with the specified initial
capacity.
Example:
  let arr : Array[Int] = Array::new(capacity=10)
  inspect(arr.length(), content="0")
  inspect(arr.capacity(), content="10")

  let arr : Array[Int] = Array::new()
  inspect(arr.length(), content="0")
new()
  let StringView
remaining = loop StringView
rest {
    StringView
[Char
'a'..='z' | 'A'..='Z' | '0'..='9' | '_' as cStringView
, ..rest_str] => {
      Array[Char]
identifier_chars.(self : Array[Char], value : Char) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(Char
c)
      continue StringView
rest_str
    }
    StringView
_ as rest_str => break StringView
rest_str
  }

  let String
ident = (ArrayView[Char]) -> String
String::(chars : ArrayView[Char]) -> String
Convert char array to string.
  let s = @string.from_array(['H', 'e', 'l', 'l', 'o'])
  assert_eq(s, "Hello")
Do not convert large datas to Array[Char] and build a string with String::from_array.
For efficiency considerations, it's recommended to use Buffer instead.
from_array(Array[Char]
identifier_chars)
  let Token
token = Unit
keyword_map.(String) -> Unit
get(String
ident).(() -> Token) -> Token
or_else(() => Token::(String) -> Token
Lower(String
ident))

  (Token
token, StringView
remaining)
}

💡 In-depth Analysis of Moonbit Syntax Features

The implementation of the lexer above fully demonstrates several outstanding advantages of Moonbit in compiler development:

Functional Loop Construct
```
loop initial_value {
  pattern1 => continue new_value1
  pattern2 => continue new_value2
  pattern3 => break final_value
}
```
loop is not a traditional loop structure but a functional loop:
- It accepts an initial parameter as the loop state.
- It handles different cases through pattern matching.
- continue passes the new state to the next iteration.
- break terminates the loop and returns the final value.

String Views and Pattern Matching

Moonbit's string pattern matching feature greatly simplifies text processing:

// Match a single character
['a', ..rest] => // Starts with the character 'a'

// Match a character range
['a'..='z' as c, ..rest] => // A lowercase letter, bound to the variable c

// Match a string literal
[.."hello", ..rest] => // Equivalent to ['h','e','l','l','o', ..rest]

// Match multiple possible characters
[' ' | '\t' | '\n', ..rest] => // Any whitespace character

The Importance of Pattern Matching Priority

⚠️ Important Reminder: The order of matching is crucial.

When writing pattern matching rules, you must place more specific patterns before more general ones. For example:

// ✅ Correct order
loop code[:] {
  [.."->", ..rest] => { ... }     // Match multi-character operators first
  ['-' | '>' as c, ..rest] => { ... }  // Then match single characters
}

// ❌ Incorrect order - "->" will never be matched
loop code[:] {
  ['-' | '>' as c, ..rest] => { ... }
  [.."->", ..rest] => { ... }     // This will never be executed
}

By using this pattern-matching-based approach, we not only avoid complex state machine implementations but also achieve a clearer and more maintainable code structure.

Chapter 3: Parsing and Abstract Syntax Tree Construction

Syntactic Analysis (or Parsing) is the second core stage of the compiler. Its task is to reorganize the sequence of tokens produced by lexical analysis into a hierarchical Abstract Syntax Tree (AST). This process not only verifies whether the program conforms to the language's grammatical rules but also provides a structured data representation for subsequent semantic analysis and code generation.

Abstract Syntax Tree Design: A Structured Representation of the Program

Before building the parser, we need to carefully design the structure of the AST. This design determines how the program's syntactic structure is represented and how subsequent compilation stages will process these structures.

1. Core Type System

First, we define the representation of the TinyMoonbit type system in the AST:

pub enum Type {
  Type
Unit    // Unit type, represents no return value
  Type
Bool    // Boolean type: true, false
  Type
Int     // 32-bit signed integer
  Type
Double  // 64-bit double-precision floating-point number
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait ToJson {
  to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson)

pub fn (type_name : String) -> Type
parse_type(String
type_name: String
String) -> enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type {
  match String
type_name {
    "Unit" => Type::Type
Unit
    "Bool" => Type::Type
Bool
    "Int" => Type::Type
Int
    "Double" => Type::Type
Double
    _ => (msg : String) -> Type
Aborts the program with an error message. Always causes a panic, regardless
of the message provided.
Parameters:

message : A string containing the error message to be displayed when
aborting.
Returns a value of type T. However, this function never actually returns a
value as it always causes a panic.
abort("Unknown type: \{String
type_name}")
  }
}

2. Layered AST Node Design

We use a layered design to clearly represent the different abstraction levels of the program:

Atomic Expressions (AtomExpr) Represent the most basic, indivisible expression units:

pub enum AtomExpr {
  (Bool) -> AtomExpr
Bool(Bool
Bool)                                    // Boolean literal
  (Int) -> AtomExpr
Int(Int
Int)                                      // Integer literal
  (Double) -> AtomExpr
Double(Double
Double)                                // Floating-point literal
  (String, ty~ : Type?) -> AtomExpr
Var(String
String, mut Type?
ty~ : enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type?)                  // Variable reference
  (Expr, ty~ : Type?) -> AtomExpr
Paren(enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr, mut Type?
ty~ : enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type?)                  // Parenthesized expression
  (String, Array[Expr], ty~ : Type?) -> AtomExpr
Call(String
String, type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr], mut Type?
ty~ : enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type?)    // Function call
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait ToJson {
  to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson)

Compound Expressions (Expr) More complex structures that can contain operators and multiple sub-expressions:

pub enum Expr {
  (AtomExpr, ty~ : Type?) -> Expr
AtomExpr(enum AtomExpr {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Var(String, ty~ : Type?)
  Paren(Expr, ty~ : Type?)
  Call(String, Array[Expr], ty~ : Type?)
} derive(Show, Eq, ToJson)
AtomExpr, mut Type?
ty~ : enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type?)          // Wrapper for atomic expressions
  (String, Expr, ty~ : Type?) -> Expr
Unary(String
String, enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr, mut Type?
ty~ : enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type?)         // Unary operation: -, !
  (String, Expr, Expr, ty~ : Type?) -> Expr
Binary(String
String, enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr, enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr, mut Type?
ty~ : enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type?)  // Binary operation: +, -, *, /, ==, !=, etc.
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait ToJson {
  to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson)

Statements (Stmt) Represent executable units in the program:

pub enum Stmt {
  (String, Type, Expr) -> Stmt
Let(String
String, enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type, enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr)                      // Variable declaration: let x : Int = 5;
  (String, Expr) -> Stmt
Assign(String
String, enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr)                         // Assignment statement: x = 10;
  (Expr, Array[Stmt], Array[Stmt]) -> Stmt
If(enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr, type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum Stmt {
  Let(String, Type, Expr)
  Assign(String, Expr)
  If(Expr, Array[Stmt], Array[Stmt])
  While(Expr, Array[Stmt])
  Return(Expr?)
  Expr(Expr)
} derive(Show, Eq, ToJson)
Stmt], type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum Stmt {
  Let(String, Type, Expr)
  Assign(String, Expr)
  If(Expr, Array[Stmt], Array[Stmt])
  While(Expr, Array[Stmt])
  Return(Expr?)
  Expr(Expr)
} derive(Show, Eq, ToJson)
Stmt])           // Conditional branch: if-else
  (Expr, Array[Stmt]) -> Stmt
While(enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr, type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum Stmt {
  Let(String, Type, Expr)
  Assign(String, Expr)
  If(Expr, Array[Stmt], Array[Stmt])
  While(Expr, Array[Stmt])
  Return(Expr?)
  Expr(Expr)
} derive(Show, Eq, ToJson)
Stmt])                     // Loop statement: while
  (Expr?) -> Stmt
Return(enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr?)                                // Return statement: return expr;
  (Expr) -> Stmt
Expr(enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr)                                   // Expression statement
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait ToJson {
  to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson)

Top-Level Structures Function definitions and the complete program:

pub struct Function {
  String
name : String
String                     // Function name
  Array[(String, Type)]
params : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[(String
String, enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type)]    // Parameter list: [(param_name, type)]
  Type
ret_ty : enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type                     // Return type
  Array[Stmt]
body : type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum Stmt {
  Let(String, Type, Expr)
  Assign(String, Expr)
  If(Expr, Array[Stmt], Array[Stmt])
  While(Expr, Array[Stmt])
  Return(Expr?)
  Expr(Expr)
} derive(Show, Eq, ToJson)
Stmt]                // Sequence of statements in the function body
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait ToJson {
  to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson)

// The program is defined as a map from function names to function definitions
pub type Program type Map[K, V]
Mutable linked hash map that maintains the order of insertion, not thread safe.
Example
  let map = { 3: "three", 8 :  "eight", 1 :  "one"}
  assert_eq(map.get(2), None)
  assert_eq(map.get(3), Some("three"))
  map.set(3, "updated")
  assert_eq(map.get(3), Some("updated"))
Map[String
String, struct Function {
  name: String
  params: Array[(String, Type)]
  ret_ty: Type
  body: Array[Stmt]
} derive(Show, Eq, ToJson)
Function]

Design Highlight: Mutability of Type Annotations

Notice that each expression node contains a mut ty~ : Type? field. This design allows us to fill in type information during the type-checking phase without having to rebuild the entire AST.

Recursive Descent Parsing: A Top-Down Construction Strategy

Recursive Descent is a top-down parsing method where the core idea is to write a corresponding parsing function for each grammar rule. In Moonbit, pattern matching makes the implementation of this method exceptionally elegant.

Parsing Atomic Expressions

pub fn (tokens : ArrayView[Token]) -> (AtomExpr, ArrayView[Token]) raise
parse_atom_expr(
  ArrayView[Token]
tokens: #builtin.valtype
type ArrayView[T]
An ArrayView represents a view into a section of an array without copying the data.
Example
  let arr = [1, 2, 3, 4, 5]
  let view = arr[1:4]  // Creates a view of elements at indices 1,2,3
  assert_eq(view[0], 2)
  assert_eq(view.length(), 3)
ArrayView[enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
} derive(Show, Eq)
Token]
) -> (enum AtomExpr {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Var(String, ty~ : Type?)
  Paren(Expr, ty~ : Type?)
  Call(String, Array[Expr], ty~ : Type?)
} derive(Show, Eq, ToJson)
AtomExpr, #builtin.valtype
type ArrayView[T]
An ArrayView represents a view into a section of an array without copying the data.
Example
  let arr = [1, 2, 3, 4, 5]
  let view = arr[1:4]  // Creates a view of elements at indices 1,2,3
  assert_eq(view[0], 2)
  assert_eq(view.length(), 3)
ArrayView[enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
} derive(Show, Eq)
Token]) raise {
  match ArrayView[Token]
tokens {
    // Parse literals
    ArrayView[Token]
[(Bool) -> Token
BoolArrayView[Token]
(Bool
bArrayView[Token]
), ..rest] => (AtomExpr::(Bool) -> AtomExpr
Bool(Bool
b), ArrayView[Token]
rest)
    ArrayView[Token]
[(Int) -> Token
IntArrayView[Token]
(Int
iArrayView[Token]
), ..rest] => (AtomExpr::(Int) -> AtomExpr
Int(Int
i), ArrayView[Token]
rest)
    ArrayView[Token]
[(Double) -> Token
DoubleArrayView[Token]
(Double
dArrayView[Token]
), ..rest] => (AtomExpr::(Double) -> AtomExpr
Double(Double
d), ArrayView[Token]
rest)

    // Parse function calls: func_name(arg1, arg2, ...)
    ArrayView[Token]
[(String) -> Token
LowerArrayView[Token]
(String
func_nameArrayView[Token]
), (Char) -> Token
BracketArrayView[Token]
('('), ..rest] => {
      let (Array[Expr]
args, Unit
rest) = (ArrayView[Token]) -> (Array[Expr], Unit)
parse_argument_list(ArrayView[Token]
rest)
      match Unit
rest {
        Unit
[(Char) -> _/0
BracketUnit
(')'), ..remaining] =>
          (AtomExpr::(String, Array[Expr], ty~ : Type?) -> AtomExpr
Call(String
func_name, Array[Expr]
args, Type?
ty=Type?
None), ArrayView[Token]
remaining)
        _ => raise Error
SyntaxError("Expected ')' after function arguments")
      }
    }

    // Parse variable references
    ArrayView[Token]
[(String) -> Token
LowerArrayView[Token]
(String
var_nameArrayView[Token]
), ..rest] =>
      (AtomExpr::(String, ty~ : Type?) -> AtomExpr
Var(String
var_name, Type?
ty=Type?
None), ArrayView[Token]
rest)

    // Parse parenthesized expressions: (expression)
    ArrayView[Token]
[(Char) -> Token
BracketArrayView[Token]
('('), ..rest] => {
      let (Expr
expr, ArrayView[Token]
rest) = (tokens : ArrayView[Token]) -> (Expr, ArrayView[Token]) raise
parse_expression(ArrayView[Token]
rest)
      match ArrayView[Token]
rest {
        ArrayView[Token]
[(Char) -> Token
BracketArrayView[Token]
(')'), ..remaining] =>
          (AtomExpr::(Expr, ty~ : Type?) -> AtomExpr
Paren(Expr
expr, Type?
ty=Type?
None), ArrayView[Token]
remaining)
        _ => raise Error
SyntaxError("Expected ')' after expression")
      }
    }

    _ => raise Error
SyntaxError("Expected atomic expression")
  }
}

Parsing Statements

Statement parsing needs to dispatch to different handler functions based on the starting keyword:

pub fn (tokens : ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_stmt(ArrayView[Token]
tokens : #builtin.valtype
type ArrayView[T]
An ArrayView represents a view into a section of an array without copying the data.
Example
  let arr = [1, 2, 3, 4, 5]
  let view = arr[1:4]  // Creates a view of elements at indices 1,2,3
  assert_eq(view[0], 2)
  assert_eq(view.length(), 3)
ArrayView[enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
} derive(Show, Eq)
Token]) -> (enum Stmt {
  Let(String, Type, Expr)
  Assign(String, Expr)
  If(Expr, Array[Stmt], Array[Stmt])
  While(Expr, Array[Stmt])
  Return(Expr?)
  Expr(Expr)
} derive(Show, Eq, ToJson)
Stmt, #builtin.valtype
type ArrayView[T]
An ArrayView represents a view into a section of an array without copying the data.
Example
  let arr = [1, 2, 3, 4, 5]
  let view = arr[1:4]  // Creates a view of elements at indices 1,2,3
  assert_eq(view[0], 2)
  assert_eq(view.length(), 3)
ArrayView[enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
} derive(Show, Eq)
Token]) {
  match ArrayView[Token]
tokens {
    // Parse let statements
    [(String) -> Token
Keyword("let"), (String) -> Token
Lower(String
var_name), (String) -> Token
Symbol(":"), ..] => { /* ... */ }

    // Parse if/while/return statements
    ArrayView[Token]
[(String) -> Token
KeywordArrayView[Token]
("if"), .. rest] => (ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_if_stmt(ArrayView[Token]
rest)
    ArrayView[Token]
[(String) -> Token
KeywordArrayView[Token]
("while"), .. rest] => (ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_while_stmt(ArrayView[Token]
rest)
    ArrayView[Token]
[(String) -> Token
KeywordArrayView[Token]
("return"), .. rest] => { /* ... */ }

    // Parse assignment statements
    ArrayView[Token]
[(String) -> Token
LowerArrayView[Token]
(_), (String) -> Token
SymbolArrayView[Token]
("="), .. rest] => (ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_assign_stmt(ArrayView[Token]
tokens)

    // Parse single expression statements
    ArrayView[Token]
[(String) -> Token
LowerArrayView[Token]
(_), (String) -> Token
SymbolArrayView[Token]
("="), .. rest] => (ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_single_expr_stmt(ArrayView[Token]
tokens)

    _ => { /* Error handling */ }
  }
}

Challenge: Handling Operator Precedence:

The most complex part of expression parsing is handling operator precedence. We need to ensure that 1 + 2 * 3 is correctly parsed as 1 + (2 * 3) and not (1 + 2) * 3.

💡 Application of Advanced Moonbit Features

Automatic Derivation Feature

pub enum Expr {
  // ...
} derive(Show, Eq, ToJson)

Moonbit's derive feature automatically generates common implementations for types. Here we use three:

Show: Provides debugging output functionality.
Eq: Supports equality comparison.
ToJson: Serializes to JSON format, which is convenient for debugging and persistence.

These automatically generated features are extremely useful in compiler development, especially during the debugging and testing phases.

Error Handling Mechanism

pub fn (tokens : ArrayView[Token]) -> (Expr, ArrayView[Token]) raise
parse_expression(ArrayView[Token]
tokens: #builtin.valtype
type ArrayView[T]
An ArrayView represents a view into a section of an array without copying the data.
Example
  let arr = [1, 2, 3, 4, 5]
  let view = arr[1:4]  // Creates a view of elements at indices 1,2,3
  assert_eq(view[0], 2)
  assert_eq(view.length(), 3)
ArrayView[enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
} derive(Show, Eq)
Token]) -> (enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr, #builtin.valtype
type ArrayView[T]
An ArrayView represents a view into a section of an array without copying the data.
Example
  let arr = [1, 2, 3, 4, 5]
  let view = arr[1:4]  // Creates a view of elements at indices 1,2,3
  assert_eq(view[0], 2)
  assert_eq(view.length(), 3)
ArrayView[enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
} derive(Show, Eq)
Token]) raise {
  // The 'raise' keyword indicates that this function may throw an exception
}

Moonbit's raise mechanism provides structured error handling, allowing syntax errors to be accurately located and reported.

Through this layered design and recursive descent parsing strategy, we have built a parser that is both flexible and efficient, laying a solid foundation for the subsequent type-checking phase.

Chapter 4: Type Checking and Semantic Analysis

Semantic Analysis is a crucial intermediate stage in compiler design. While parsing ensures the program's structure is correct, it doesn't mean the program is semantically valid. Type Checking, as the core component of semantic analysis, is responsible for verifying the type consistency of all operations in the program, ensuring type safety and runtime correctness.

Scope Management: Building the Environment Chain

The primary challenge in type checking is correctly handling variable scopes. At different levels of the program (global, function, block), the same variable name may refer to different entities. We adopt the classic design of an Environment Chain to solve this problem:

pub struct TypeEnv[K, V] {
  TypeEnv[K, V]?
parent : struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
TypeEnv[type parameter K
K, type parameter V
V]?     // Reference to the parent environment
  Map[K, V]
data : type Map[K, V]
Mutable linked hash map that maintains the order of insertion, not thread safe.
Example
  let map = { 3: "three", 8 :  "eight", 1 :  "one"}
  assert_eq(map.get(2), None)
  assert_eq(map.get(3), Some("three"))
  map.set(3, "updated")
  assert_eq(map.get(3), Some("updated"))
Map[type parameter K
K, type parameter V
V]            // Variable bindings in the current environment
}

The core of the environment chain is the variable lookup algorithm, which follows the rules of lexical scoping:

pub fn struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
TypeEnv::(self : TypeEnv[K, V], key : K) -> V?
get[K : trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq + trait Hash {
  hash_combine(Self, Hasher) -> Unit
  hash(Self) -> Int
}
Trait for types that can be hashed
The hash method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq trait, they should produce the same hash value.
The hash method does not need to be implemented if hash_combine is implemented,
When implemented separately, hash does not need to produce a hash value that is consistent with hash_combine.
Hash, V](TypeEnv[K, V]
self : struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
Self[type parameter K
K, type parameter V
V], K
key : type parameter K
K) -> type parameter V
V? {
  match TypeEnv[K, V]
self.Map[K, V]
data.(self : Map[K, V], key : K) -> V?
Retrieves the value associated with a given key in the hash map.
Parameters:

self : The hash map to search in.
key : The key to look up in the map.
Returns Some(value) if the key exists in the map, None otherwise.
Example:
  let map = { "key": 42 }
  inspect(map.get("key"), content="Some(42)")
  inspect(map.get("nonexistent"), content="None")
get(K
key) {
    (V) -> V?
Some(V
value) => (V) -> V?
Some(V
value)    // Found in the current environment
    V?
None =>
      match TypeEnv[K, V]
self.TypeEnv[K, V]?
parent {
        (TypeEnv[K, V]) -> TypeEnv[K, V]?
Some(TypeEnv[K, V]
parent_env) => TypeEnv[K, V]
parent_env.(self : TypeEnv[K, V], key : K) -> V?
get(K
key)  // Recursively search the parent environment
        TypeEnv[K, V]?
None => V?
None              // Reached the top-level environment, variable not defined
      }
  }
}

Design Principle: Lexical Scoping

This design ensures that variable lookup follows lexical scoping rules:

First, search in the current scope.

If not found, recursively search in the parent scope.

Continue until the variable is found or the global scope is reached.

Type Checker Architecture

Environment management alone is not sufficient to complete the type-checking task. Some operations (like function calls) need to access global program information. Therefore, we design a comprehensive type checker:

pub struct TypeChecker {
  TypeEnv[String, Type]
local_env : struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
TypeEnv[String
String, enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type]    // Local variable environment
  Function
current_func : struct Function {
  name: String
  params: Array[(String, Type)]
  ret_ty: Type
  body: Array[Stmt]
} derive(Show, Eq, ToJson)
Function              // The function currently being checked
  Program
program : type Program Map[String, Function]
Program                    // Complete program information
}

Implementation of Partial Node Type Checking

The core of the type checker is to apply the corresponding type rules to different AST nodes. The following is the implementation of expression type checking:

pub fn enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Expr::(self : Expr, env : TypeEnv[String, Type]) -> Type raise
check_type(
  Expr
self : enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
} derive(Show, Eq, ToJson)
Self,
  TypeEnv[String, Type]
env : struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
TypeEnv[String
String, enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type]
) -> enum Type {
  Unit
  Bool
  Int
  Double
} derive(Show, Eq, ToJson)
Type raise {
  match Expr
self {
    // Type checking for atomic expressions
    (AtomExpr, ty~ : Type?) -> Expr
AtomExprExpr
(AtomExpr
atom_exprExpr
, ..) as node => {
      let Type
ty = AtomExpr
atom_expr.(TypeEnv[String, Type]) -> Type
check_type(TypeEnv[String, Type]
env)
      Expr
nodeUnit
.ty = (Type) -> Type?
SomeUnit
(Type
tyUnit
)  // Fill in the type information
      Type
ty
    }

    // Type checking for unary operations
    (String, Expr, ty~ : Type?) -> Expr
UnaryExpr
("-", Expr
exprExpr
, ..) as node => {
      let Type
ty = Expr
expr.(self : Expr, env : TypeEnv[String, Type]) -> Type raise
check_type(TypeEnv[String, Type]
env)
      Expr
nodeUnit
.ty = (Type) -> Type?
SomeUnit
(Type
tyUnit
)
      Type
ty
    }

    // Type checking for binary operations
    (String, Expr, Expr, ty~ : Type?) -> Expr
BinaryExpr
("+", Expr
lhsExpr
, Expr
rhsExpr
, ..) as node => {
      let Type
lhs_type = Expr
lhs.(self : Expr, env : TypeEnv[String, Type]) -> Type raise
check_type(TypeEnv[String, Type]
env)
      let Type
rhs_type = Expr
rhs.(self : Expr, env : TypeEnv[String, Type]) -> Type raise
check_type(TypeEnv[String, Type]
env)

      // Ensure operand types are consistent
      guard Type
lhs_type (Type, Type) -> Bool
automatically derived
== Type
rhs_type else {
        raise Error
TypeCheckError(
          "Binary operation requires matching types, got \{Type
lhs_type} and \{Type
rhs_type}"
        )
      }

      let Type
result_type = match String
op {
        // Comparison operators always return a boolean value
        "==" | "!=" | "<" | "<=" | ">" | ">=" => Type::Type
Bool

        // Arithmetic operators, etc., maintain the operand type
        _ => Type
lhs_type
      }

      Expr
nodeUnit
.ty = (Type) -> Type?
SomeUnit
(Type
result_typeUnit
)
      Type
result_type
    }
  }
}

💡 Moonbit Enum Modification Trick

During the type-checking process, we need to fill in type information for the AST nodes. Moonbit provides an elegant way to modify the mutable fields of enum variants:

pub enum Expr {
  AtomExpr(AtomExpr, mut ty~ : Type?)
  Unary(String, Expr, mut ty~ : Type?)
  Binary(String, Expr, Expr, mut ty~ : Type?)
} derive(Show, Eq, ToJson)

By using the as binding in pattern matching, we can get a reference to the enum variant and modify its mutable fields:

match expr {
  AtomExpr(atom_expr, ..) as node => {
    let ?
ty = Unit
atom_expr.(Unit) -> ?
check_type(Unit
env)
    node.ty = Some(ty)  // Modify the mutable field
    ty
  }
  // ...
}

This design avoids the overhead of rebuilding the entire AST while maintaining a functional programming style.

Complete Compilation Flow Demonstration

After the three stages of lexical analysis, parsing, and type checking, our compiler frontend is now able to convert source code into a fully typed abstract syntax tree. Let's demonstrate the complete process with a simple example:

Source Code Example

fn (x : Int, y : Int) -> Int
add(Int
x: Int
Int, Int
y: Int
Int) -> Int
Int {
  return Int
x (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ Int
y;
}

Compilation Output: Typed AST

Using the derive(ToJson) feature, we can output the final AST in JSON format for inspection:

{
  "functions": {
    "add": {
      "name": "add",
      "params": [
        ["x", { "$tag": "Int" }],
        ["y", { "$tag": "Int" }]
      ],
      "ret_ty": { "$tag": "Int" },
      "body": [
        {
          "$tag": "Return",
          "0": {
            "$tag": "Binary",
            "0": "+",
            "1": {
              "$tag": "AtomExpr",
              "0": {
                "$tag": "Var",
                "0": "x",
                "ty": { "$tag": "Int" }
              },
              "ty": { "$tag": "Int" }
            },
            "2": {
              "$tag": "AtomExpr",
              "0": {
                "$tag": "Var",
                "0": "y",
                "ty": { "$tag": "Int" }
              },
              "ty": { "$tag": "Int" }
            },
            "ty": { "$tag": "Int" }
          }
        }
      ]
    }
  }
}

From this JSON output, we can clearly see:

Complete Function Signature: Including the parameter list and return type.
Type-Annotated AST Nodes: Each expression carries type information.
Structured Program Representation: Provides a clear data structure for the subsequent code generation phase.

Conclusion

In this article, we have delved into the complete implementation process of a compiler frontend. From a stream of characters to a typed abstract syntax tree, we have witnessed the unique advantages of the Moonbit language in compiler construction:

Core Takeaways

The Power of Pattern Matching: Moonbit's string pattern matching and structural pattern matching greatly simplify the implementation of lexical analysis and parsing.
Functional Programming Paradigm: The combination of the loop construct, environment chains, and immutable data structures provides a solution that is both elegant and efficient.
Expressive Type System: Through mutable fields in enums and trait objects, we can build data structures that are both type-safe and flexible.
Engineering Features: Features like derive, structured error handling, and JSON serialization significantly improve development efficiency.

Looking Ahead to Part 2

Having mastered the implementation of the frontend, the next article will guide us into the more exciting code generation phase. We will:

Delve into the design philosophy of LLVM Intermediate Representation.
Explore how to use Moonbit's official llvm.mbt binding library.
Implement the complete conversion from AST to LLVM IR.
Generate executable RISC-V assembly code.

Building a compiler is a complex and challenging process, but as we have shown in this article, Moonbit provides powerful and elegant tools for this task. Let's continue this exciting compiler construction journey in the next part.

Recommended Resources

Moonbit Official Documentation

llvm.mbt Documentation

llvm.mbt Project

LLVM Official Tutorial

The anatomy of a hashmap​

Hash collisions​

Example

Crafting a collision​

Example

Hash flooding attack​

Mitigating hash flooding attacks​

Seeded hash function​

Other choices​

Why does it matter to us?​

Takeaways​

Footnotes​

A brief history of async programming​

Async programming in MoonBit​

The structure of a HTTP server​

Handle user request​

Examples

Examples

Implement the download as zip feature​

Run the server​

Introduction​

Prerequisites​

Building the Project​

First JavaScript API Call​

Interfacing with JavaScript Types​

JavaScript Types Requiring No Conversion​

External JavaScript Types​

Handling JavaScript Errors​

Interfacing with External JavaScript APIs​

Conclusion​

Conventions & Definitions​

Brzozowski Derivative​

Example

Virtual Machine​

Instruction Set and Program Representation​

Example

AST Compilation to Bytecode​

VM Execution Loop​

Example

Example

Example

Example

Parameters

Panics

Example

Benchmarks and Performance Analysis​

SimpleDoc Primitives

Example

Example

Example

ExtendDoc: Nest, Choice, Group

Measuring Space​

Rendering ExtendDoc​

Example

Example

Example

Example

Example

Example

Composition Functions

softline & softbreak​

autoline & autobreak​

sepby​

Example

surround​

Printing JSON

Type Parameters

Arguments

Returns

Type Parameters

Arguments

Returns

Examples

Examples

Conclusion

Introduction​

Problem Analysis and Solution​

Build up dependency graph on the fly​

A mechanism to mark outdated node​

Determine whether a thunk needs to be recomputed​

The anatomy of a hashmap

Hash collisions

Crafting a collision

Hash flooding attack

Mitigating hash flooding attacks

Seeded hash function

Other choices

Why does it matter to us?

Takeaways

Footnotes

A brief history of async programming

Async programming in MoonBit

The structure of a HTTP server

Handle user request

Implement the download as zip feature

Run the server

Introduction

Prerequisites

Building the Project

First JavaScript API Call

Interfacing with JavaScript Types

JavaScript Types Requiring No Conversion

External JavaScript Types

Handling JavaScript Errors

Interfacing with External JavaScript APIs

Conclusion

Conventions & Definitions

Brzozowski Derivative

Virtual Machine

Instruction Set and Program Representation

AST Compilation to Bytecode

VM Execution Loop

Benchmarks and Performance Analysis

Measuring Space

Rendering ExtendDoc

softline & softbreak

autoline & autobreak

sepby

surround

Introduction

Problem Analysis and Solution

Build up dependency graph on the fly

A mechanism to mark outdated node

Determine whether a thunk needs to be recomputed

Implementation

Reference

Introduction

How the Python Interpreter Works

Paths to Optimizing Python Performance

Using Pre-wrapped Python Libraries in MoonBit

Using Unwrapped Python Modules in MoonBit

Introducing python.mbt

Importing Python Modules

Converting Between MoonBit and Python Objects

Calling Functions in a Module

Practical Advice

Conclusion

Introduction

Prerequisites

The Groundwork

Compiling to Native

Configuring Linkage

The First FFI Call

Navigating the Type System Chasm

3.1 Basic Types

3.2 Strings

3.3 The Art of Pointers: Passing by Reference and Arrays

3.4 External Types: Embracing Opaque C Structs

3.5 Function Pointers: When C Needs to Call Back

Advanced Topic: GC Management

4.1 The Simple Case

4.2 The Complex Situation: Using Finalizers

Conclusion

Introduction

Chapter 1: Representing the LLVM Type System in Moonbit

Trait Objects: An Abstract Representation of Types

Type Identification and Conversion

Safe Type Conversion Strategies

Constructing Composite Types

Chapter 2: The LLVM Value System and the BasicValue Concept