Tuesday, June 14, 2005

How compilers are structured.

Compilers are programs that translate source code from one language (the input language) to another (the output language). Usually, the input language is the language the programmer writes in, and the output language is either assembly code or machine code.

Compilers make their job easier by breaking it into parts.

The first part is lexical analysis. This is done by a lexical analyzer, also known as a lexer. This is breaks up the input stream into tokens. Tokens are lumps of text from the input stream, annotated with a type. So, given the expression:

42 + 4

The following tokens would be produced:
  1. INTEGER "42"
  2. PLUS "+"
  3. INTEGER "4"
  4. EOF
The last element is important: this is the End Of File marker, and tells the grammar when no more source code is to be expected.

The second part is syntactic analysis, using a grammar. A grammar is a list of rules describing the recursive structure of the language.

(I'm going to use recursive descent in my parser since it's the easiest one to write and read without using a parser generator.) A recursive descent grammar for simple expressions might look like:

  expression : factor ( PLUS factor )* EOF ;

This rule means that an 'expression' is expected to be made up of a factor, followed by zero or more reptitions of the part in the ()s: that is, zero or more of "PLUS token immediately followed by a factor". The '*' means zero or more of the element just before it. A '+' means one or more of the element just before it. Finally, the expression ends in EOF, so the source code should finish after the expression rule has been matched.

  factor : INTEGER ;

Here, a 'factor' is in turn made up of an INTEGER.

The job of a parser is to eat tokens (skip over them, causing the next token to come to the front) from the lexer, but only following rules from the grammar. The parser drives the whole process. It follows the grammar as if each rule in the grammar was a function, and it pulls and eats tokens from the lexer.

Consider the expression "42 + 4" above. A parser using the 'expression' rule as its starting rule and given the stream of tokens INTEGER, PLUS, INTEGER from the lexer (from 42 + 4), it's runtime perspective looks like this:
  1. Parser starts with the expression rule. First thing is supposed to be a factor, and all factors start with INTEGER, so check that the lexer's current (i.e. head of the stream) token is an INTEGER.
  2. The lexer's current token is an INTEGER, so recurse into the factor rule.
  3. The factor rule eats the INTEGER token from the lexer, causing the lexer to move on to the next token (PLUS), and returns to the expression rule.
  4. The expression rule is now positioned just after the first 'factor'; it now has a choice to make. Either enter the loop of 'PLUS factor' (choice PLUS), or skip the loop (choice EOF). It uses the current token from the lexer (PLUS) to make a choice, so it enters the loop.
  5. The loop has been entered, and the PLUS token from the stream is eaten. The next token from the lexer is INTEGER "4". The parser recurses into the factor rule.
  6. The factor rule eats the INTEGER, leaving EOF as the current token in the lexer, and returns to the expression rule.
  7. The expression rule has reached the end of the loop, and must decide if it is going to go around again. The start of the loop is a PLUS, while breaking out of the loop causes an EOF to be matched. The current token (EOF) is used to make the decision, so the loop is exited.
  8. The EOF is matched, and expression returns to whoever called into the parser.
That's the basic process of a parser. Of course, if it just follows the grammar and eats tokens, it doesn't do anything other than match its language and throw errors on invalid input. That isn't terribly useful, so usually the parser does more work as it matches elements. However, that's enough on parsers for the moment.

Next, I'll write a very simple little lexer that breaks text up into the tokens described above.

No comments: