Rami BALI Directed By Samuel TARDIEU
TELECOM ParisTech July 3rd, 2009
Bootstrapped Compilers
1. Introduction
Bootstrapping (also called selfhosting) is a technique which consists of writing a compiler, that will compile some programming language A, in the same language A. Of course, before the first compiler for the language A had been created, we cannot compile this compiler by itself. So you have usually to use a host language that's pretty close to the target language (A) and already present on the target machine, and write a compiler of a subset of the target language on the host language. Nevertheless, you could also bootstrap your target language from “nothing”, so you have to build successive subsets of your target language from the machine code. We will not discuss about the second option because most of compilers use a host language but you can find an example of building a compiler from nothing in this link [1]. The first bootstrapped language was LISP by Tim Hart and Mike Levin at MIT in 1962. Many other languages like Scheme, Haskell, Forth, Factor, C and Pascal have a bootstrapped compiler. So I'll first present some general bootstrapping techniques and then how well known compilers like GHC, GNAT and Factor were bootstrapped.
2. Bootstrapping Techniques
2.1. Full Bootstrapping
If we want to implement a bootstrapped compiler for a target language A but no compiler already exists for the language A, we have first to implement a compiler of A in a host language B and use the resulting compiler to compile the early version of A written in A. However, A could be very hard to translate to B, so we can begin with creating a bootstrapped compiler for a subset of A and then compile the entire A compiler with the compiler of the subset. There is an example from "Compiler, interpreter, and Bootstrapping" : Goal : We want to implement an Ada compiler for machine M. Steps : 1. Write a compiler for AdaS (subset of Ada) in C for which a compiler already exists in M
v1
AdaS >M
C
1
2. Compile this compiler
v1
AdaS >M
v1
AdaS >M C >M
C
M
M M
3. Write a compiler for AdaS (subset of Ada) in AdaS
v2
AdaS >M
AdaS
4. Compile this compiler
v2
AdaS >M
v2
v1
AdaS >M AdaS >M
AdaS
M
M M
5. Write a compiler for all Ada in AdaS
v3
Ada >M
AdaS
6. Compile this Ada compiler
v3
Ada >M
v3
v2
AdaS >M Ada >M
AdaS
M
M M
2
2.2. Simple Bootstrapping
If a compiler or an interpreter already exists for the language A in the target machine, you have just to write your new compiler directly in the language A and then compile it with the first compiler. This technique was used for bootstrapping the Lisp language with an existing interpreter written on a language close to Fortran (see Lisp history).
2.3. Half Bootstrapping
This technique is used when we want to bootstrap a compiler for a language A in some machine M but we just have a compiler for A in another machine N. It's easier than the Full Bootstrapping. So to build the bootstrapped compiler, we need a cross compiler which could be run on the machine N to produce the compiler for the machine M. This example is also from "Compiler, interpreter, and Bootstrapping" : Goal : We want to implement an Ada compiler for machine M. Steps : 1. Write the compiler in Ada Ada >M
Ada
2. Build the cross compiler in the machine N Ada >M Ada >M Ada >N
Ada
N
N N
3. Cross compile Ada compiler for M in the machine N Ada >M Ada >M Ada >M
Ada
M
N N
3
3. Examples of bootstrapped compilers
3.1. Haskell compiler GHC
GHC is a compiler for the functional programming language Haskell, with most part written in Haskell and a little part written in C. However, the first release of this compiler had been written in Lazy ML in 1989. So later this year, when GHC was rewritten in Haskell, it just had been compiled with the older version (See Simple Bootstrapping). Actually, GHC works on Windows and Unix with different processor architectures so many specific releases already exist for those architectures but it is easy to port GHC to a new platform using the Half Bootstrapping. As explained in this tutorial, the bootstrapping of GHC is based on compiling, in the target machine, some intermediate C files generated by compiling GHC source code in a host machine which already contains GHC working. To get GHC working in the target machine TM, you have to : 1. Download GHC source code in TM and configure the building process 2. Take this configuration to the host machine HM and then half compile GHC in HM with TM configuration. (You will get the intermediate C files configured for TM) 3. Copy those files to TM and compile them. 4. You may have to handwrite some parts of the compiler.
3.2. Ada compiler GNAT
GNAT is an open source, multi target compiler for Ada95. The GNAT project started in 1992 in the New York University, awarded by the US Air Force, to support the new revision of Ada. GNAT is composed by a GNAT frontend written in Ada95 and a GCC backend written in C. So that make GNAT portable to any platform in which GCC already exists. The GNAT team chose to write GNAT in Ada because of (among other technical and non technical reasons) the use of hierarchical libraries for strong type checking. So they started using a small subset of Ada83 which they extended step by step and finally bootstrapped with a commercial version. Thus they were able to write the entire GNAT frontend in Ada95 (see The GNAT Project). This GNAT frontend builds an Abstract Syntax Tree (AST) and then an Expanded and Decorated Tree. This intermediate data format is translated to GCC tree fragments by a component (GIGI) written in C. So GCC continues the compiling process. Many cross compilers exist allowing easy porting of GNAT to new platforms like GHC.
3.3. Factor compiler
Factor is a new programming language created by Slava Pestov. Its compiler is composed by an optimizing compiler entirely written in Factor and a nonoptimizing compiler, also called VM, written in C++. In early Factor, the nonoptimizing compiler did not exists but an interpreter had its role. The interpreter written in Java was replaced in 2007 by the nonoptimizing compiler written in C (see 4
this post) which was ported to C++ in 2009 (see this post). The optimizing compiler deals with words with static stack effects and compiles them directly to machine code. In the other hand, the nonoptimizing compiler compiles words with dynamic stack effects. The optimizing compiler is an eight passes optimizer that cleans and compacts the code, detects loops from recursions and eliminates dead code...(see this post). So most of user code is compiled with this compiler. Moreover, Factor is able to save the state of all the system into image files. So the Factor Virtual Machine always bootstrap an image file that could be chosen by the user or the default factor.image that is the initial image of Factor (Factor Images). Factor is ported to new platforms with the half bootstrapping technique (see this link).
4. Conclusion
Many programming languages have bootstrapped compilers even if some people think that it's not so interesting to do that if the target language is not intended for writing compilers [2]. However, a good reason is that the language future improvements could also benefit to the compiler and so, the compiler could be easily maintained. I think is so true for high level languages like Factor (compiler in Factor). Nevertheless, most of bootstrapped compilers contain a little part written in a different language than the target language (mainly C). That make them a little bit easier to write and much more portable.
5
5. Bib
[1] Compiling from nothing http://homepage.ntlworld.com/edmund.grimleyevans/bcompiler.html [2] The Compiler Bootstrap Koan http://dcooney.com/ViewEntry.aspx?ID=528 [3] Cross Compiling GHC http://hackage.haskell.org/trac/ghc/wiki/Building/Porting [4] Lisp History http://wwwformal.stanford.edu/jmc/history/lisp/lisp.html [5] GNAT Book https://www2.adacore.com/gapstatic/GNAT_Book/html/index.htm [6] NY University about GNAT http://cs.nyu.edu/cs/faculty/schonber/gnat.html [7] The Ada Information Clearinghouse http://www.adaic.com/compilers/articles/accgnat.html [8] Factor Compiler http://factorlanguage.blogspot.../twotiercompilationcomestofactor.html [9] Factor VM in C++ http://factorlanguage.blogspot.com/2009/05/factorvmportedtoc.html [10] Factor Optimizing compiler http://factorlanguage.blogspot.com/2008/08/newoptimizer.html [11] Factor Images http://docs.factorcode.org/content/articlebootstrap.image.html
6