With Pith

Ethan Petuchowski

First Glimpse of Compilers

I am close to two chapters into “Compilers” (a.k.a. “The Dragon Book”), by Aho, Lam, Sethi, and Ullman. It is an exciting topic to learn about.

The most fundamental thing I have learned so far is the overall pipeline for understanding the modular components forming the way a compiler is typically constructed. It goes

  1. Lexical analyzer
  2. Syntax analyzer
  3. Semantic analyzer
  4. Intermediate code generator
  5. Machine-independent code optimizer (optional)
  6. Code generator
  7. Machine-dependent code optimizer (optional)

The input to the compiler pipeline is a “character stream” of the program. However I don’t recall the book ever dealing with specifications of that stream, except that it supports a generic getchar() operation. I can understand that if the entire source consists of one file, you just read character-by-character from the file into the compiler. Maybe the language designer decides how multiple-file projects are to be read by the compiler, i.e. it is beyond the scope of this book; or maybe they’ll go more in-depth on this later-on.

The Lexical Analyzer

The character stream is read into first component in the compilation pipeline: the lexical analyzer, which maps the character stream into a “token” stream. It seems like a token knows its tag, and its value. The tag is basically the type of this token in the eyes of the compiler. Possible tags in their example include NUM, ID, [keyword]s, and I added COMMENT as part of an exercise. The value for a token might be the literal value of a literal; or it might be the name of a variable. More about that below.

Symbol Tables

In addition to creating the token stream, the lexical analyzer builds the symbol table. It seems to that the symbol table maps names to places in memory. A symbol table also ‘by default’ implements the language’s preferred form of block scoping. Upon entering a new scope, a new symbol table is created that points to its “parent”, the one it is nested inside of. If a variable is declared, it is added to the symbol table to this scope. Later, when a variable is referenced (i.e. init’d, read, or updated), we will consult the table for the current innermost scope. If the variable’s data is not found there, we will continue to check ancestors until we find it.