The OCaml compiler toolchain
Compiling source code into an executable involves a fairly complex set of parsers, libraries, linkers, and assembers. To develop its statically safe type system, OCaml runs code through a series of stages where each stage performs a unique job independent of other stages in the pipeline. For example, one stage of the compiler pipeline checks for type safety using knowledge of the OCaml type system, while the final stage of the OCaml compiler is outputting low-level assembly code that knows nothing about OCaml objects.
The book Real World OCaml provides an overview of the compilation pipeline that we can walk through to understand the entire process. Nodes in the diagram refer to build artifacts, and arrows are tasks or processes that convert build artifacts to different representations.
The OCaml compiler operates on source files. The first task in the pipeline is to parse the source code into a structured abstract syntax tree (AST). the parsing directory in the source distribution. Assuming the source file is valid OCaml sytanx, the abstract syntax tree is the first artifact generated by the comiler pipeline.
Given the syntax tree, the next stage in the pipeline is to verify that the syntax obeys the OCaml type rules. At this stage, OCaml performs three tasks:
- automatic type inference. Calculates types for a module where there are no type annotations.
- module system. Combines together modules using the knowledge of their type signatures to make sure that types match across modules.
- explicit subtyping. Checks that polymorphic rules are correct.
When the type checking process has successfully completed, it is combined with the AST to form a typed abstract syntax tree containing precise location information for every token in the input file with concrete type information for each token.
The parsing and type checking stage form OCamls error checking capabilities. Once the source file has passed these stages, it can stop emitting syntax and type errors and begin the process of compiling the well-formed modules into executable code.
Since we have verified that the source code is valid and conforms to OCaml’s type rules, the Lambda stage of the compiler pipeline eliminates the static type information into a simpler intermediate lambda form that can be optimized into high-performance code. This stage discards object and module information, and simplifies high-level concepts like pattern matching into optimized state machines .
After the lambda form has been generated, the OCaml compiler pipeline branches into two separate compilers: one for bytecode and one for native code.
Bytecode has the advantage of being simple, portable, and easily compiled, with the disadvantage of being slower to execute and requiring an OCaml runtime. Bytecode generated by the compiler is run by an OCaml runtime that includes an interpeter, garbage collection facilities, and some C functions that implement low-level system calls.
Native compilation is typically used for production use-cases. The compilation process is slower, but the resulting binary is highly optimized and offers faster performance.
BuckleScript takes a different approach to the same problem by inserting at a different point in the OCaml compiler pipeline, shown in the following figure: