I am close to two chapters into “Compilers” (a.k.a. “The Dragon Book”), by Aho, Lam, Sethi, and Ullman. It is an exciting topic to learn about.
The most fundamental thing I have learned so far is the overall pipeline for understanding the modular components forming the way a compiler is typically constructed. It goes
- Lexical analyzer
- Syntax analyzer
- Semantic analyzer
- Intermediate code generator
- Machine-independent code optimizer (optional)
- Code generator
- Machine-dependent code optimizer (optional)
The input to the compiler pipeline is a “character stream” of the program.
However I don’t recall the book ever dealing with specifications of that
stream, except that it supports a generic getchar()
operation. I can
understand that if the entire source consists of one file, you just read
character-by-character from the file into the compiler. Maybe the language
designer decides how multiple-file projects are to be read by the compiler,
i.e. it is beyond the scope of this book; or maybe they’ll go more in-depth on
this later-on.
The Lexical Analyzer
The character stream is read into first component in the compilation pipeline:
the lexical analyzer, which maps the character stream into a “token”
stream. It seems like a token knows its tag, and its value. The
tag is basically the type of this token in the eyes of the compiler.
Possible tags in their example include NUM
, ID
, [keyword]
s, and I added
COMMENT
as part of an exercise. The value for a token might be the
literal value of a literal; or it might be the name of a variable. More about
that below.
Symbol Tables
In addition to creating the token stream, the lexical analyzer builds the symbol table. It seems to that the symbol table maps names to places in memory. A symbol table also ‘by default’ implements the language’s preferred form of block scoping. Upon entering a new scope, a new symbol table is created that points to its “parent”, the one it is nested inside of. If a variable is declared, it is added to the symbol table to this scope. Later, when a variable is referenced (i.e. init’d, read, or updated), we will consult the table for the current innermost scope. If the variable’s data is not found there, we will continue to check ancestors until we find it.