Friday, June 17, 2005

A simple evaluating parser

Continuing on from the lexer from the last day, here's a simple parser. The parser grammar is as follows:
expression : simpleExpression EOF ;
simpleExpression : term ( PLUS term | MINUS term )* ;
term : factor ( ASTERISK factor | SLASH factor ) ;
factor :
    MINUS factor
  |  NUMBER
  |  LPAREN simpleExpression RPAREN
  ;
Writing a simple recursive descent parser for this grammar is easy enough. Every rule turns into a function. The function follows the grammar in order to match it, using tokens from the lexer to make decisions. To make this parser useful, every rule will return the evaluation of the expression so far. So, the factor rule will return either a number, or else it will recurse into factor or simpleExpression as appropriate.

Here's the parser:

using System;
using System.Text;

namespace Calc
{
    class Parser
    {
        private Lexer _lexer;
       
        public Parser(Lexer lexer)
        {
            _lexer = lexer;
        }
       
        private Lexer Lexer
        {
            get { return _lexer; }
        }
       
        private void Expect(TokenKind kind)
        {
            if (Lexer.Kind != kind)
                throw new ApplicationException(
                    string.Format("Expected {0}, found {1}.", kind, Lexer.Kind));
        }
       
        private void Eat(TokenKind kind)
        {
            Expect(kind);
            Lexer.NextToken();
        }
       
        public double Expression()
        {
            double result = SimpleExpression(false);
            Eat(TokenKind.Eof);
            return result;
        }
       
        private double SimpleExpression(bool skip)
        {
            double result = Term(skip);
            for (;;)
                switch (Lexer.Kind)
                {
                    case TokenKind.Plus: result += Term(true); break;
                    case TokenKind.Minus: result -= Term(true); break;
                    default:
                        return result;
                }
        }
       
        private double Term(bool skip)
        {
            double result = Factor(skip);
            for (;;)
                switch (Lexer.Kind)
                {
                    case TokenKind.Asterisk: result *= Factor(true); break;
                    case TokenKind.Slash: result /= Factor(true); break;
                    default:
                        return result;
                }
        }
       
        private double Factor(bool skip)
        {
            if (skip)
                Lexer.NextToken();
            switch (Lexer.Kind)
            {
                case TokenKind.Number:
                {
                    double result = Lexer.TokenAsDouble;
                    Lexer.NextToken();
                    return result;
                }
               
                case TokenKind.LParen:
                {
                    double result = SimpleExpression(true);
                    Eat(TokenKind.RParen);
                    return result;
                }
               
                case TokenKind.Minus:
                    return - Factor(true);
               
                default:
                    throw new ApplicationException("Unexpected token: " + Lexer.Kind);
            }
        }
       
        static void Main(string[] args)
        {
            try
            {
                StringBuilder source = new StringBuilder();
                foreach (string arg in args)
                    source.Append(arg).Append(' ');
                if (args.Length > 0)
                    source.Length -= 1;
                Lexer lexer = new Lexer(source.ToString());
                Parser parser = new Parser(lexer);
                Console.WriteLine("Result: {0}", parser.Expression());
            }
            catch (Exception ex)
            {
                Console.Error.WriteLine(ex);
            }
        }
    }
}
Remove the "static void Main" function from the lexer, and recompile with
csc Lexer.cs Parser.cs
and execute:
C:\> parser 45
Result: 45

C:\> parser 45 - 24
Result: 21

C:\> parser 2 * (3 + 4)
Result: 14
Next time, I'll talk about how simple stack-based virtual machines work, and after that how the parser can be modified to produce code for virtual machines.

Wednesday, June 15, 2005

The lexer, for lexical analysis

Here, I'll describe a very simple lexical analyzer, or lexer. The lexer's job is to split the input into discrete lumps of text, with each lump annotated with kind, such as IDENTIFIER, NUMBER or PLUS. Another part of the lexer's job is to ignore whitespace and comments.

Here's the outline of the Lexer class:

using System;

namespace Calc
{
    enum TokenKind
    {
        Eof, Number, LParen, RParen, Plus, Minus, Asterisk, Slash
    }
   
    class Lexer
    {
        private string _source;       // the source text buffer
        private int _currentPosition; // an index into _source
        private TokenKind _kind;      // the kind of the current token
        private int _tokenStart;      // the start of the current token
      
        public Lexer(string source)
        {
            _source = source;
            NextToken();
        }
       
        public void NextToken()
        {
            // [...] Written below
        }
       
        public TokenKind Kind
        {
            get { return _kind; }
        }
       
        public string TokenAsString
        {
            get
            {
                return _source.Substring(_tokenStart,
                    _currentPosition - _tokenStart);
            }
        }
       
        public double TokenAsDouble
        {
            get { return double.Parse(TokenAsString); }
        }
    }
}
The most important method (the only method, for now) is NextToken(). The job of NextToken() is this:
  1. The invariant: when NextToken() is entered and exited, _currentPosition is either:
    1. At the very start of the source.
    2. Pointing at the first character after the last read token.
    3. Equal to _source.Length, indicating the End Of File condition.
  2. NextToken() must skip whitespace until there is no more whitespace, or EOF is reached.
  3. If EOF, set token kind to EOF and return.
  4. At this point, _currentPosition points to the first character of the next token. Mark this position as _tokenStart.
  5. Determine the type of token from the first character of the token. For example, if the character is '1', then the next token is a number of some kind. If the character is '"', then it'll be a string. If it's 'x', then it's an identifier. If it's '+', then it's an operator.
  6. Based on the type of the token, read in the rest of the token and set _kind as appropriate. This means that after this step _currentPosition is either at the end of the source or pointing to the first character that isn't part of the token just read.
All this is pretty easy. Here's a simple NextToken() that parses the token kinds in the enumeration above, namely floating-point numbers and some operators, in the class above with some small modifications:
using System;
using System.Text;

namespace Calc
{
    enum TokenKind
    {
        Eof, Number, LParen, RParen, Plus, Minus, Asterisk, Slash
    }
   
    class Lexer
    {
        private string _source;
        private int _currentPosition;
        private TokenKind _kind;
        private int _tokenStart;
       
        public Lexer(string source)
        {
            _source = source;
            NextToken();
        }
       
        public bool IsEof
        {
            get { return _currentPosition >= _source.Length; }
        }
       
        public void NextToken()
        {
            // Skip whitespace.
            while (!IsEof && char.IsWhiteSpace(_source, _currentPosition))
                ++_currentPosition;
           
            // Check for EOF.
            if (IsEof)
            {
                _kind = TokenKind.Eof;
                _tokenStart = _currentPosition;
                return;
            }
           
            // The token starts here.
            _tokenStart = _currentPosition;
           
            // Determine token kind.
            if (char.IsDigit(_source, _currentPosition))
            {
                do {
                    ++_currentPosition;
                } while (!IsEof && char.IsDigit(_source, _currentPosition));
                _kind = TokenKind.Number;
            }
            else
            {
                switch (_source[_currentPosition])
                {
                    case '+': _kind = TokenKind.Plus; break;
                    case '-': _kind = TokenKind.Minus; break;
                    case '*': _kind = TokenKind.Asterisk; break;
                    case '/': _kind = TokenKind.Slash; break;
                    default:
                        throw new ApplicationException("Unknown character: " + _source[_currentPosition]);
                }
                // Skip past the single-character token.
                ++_currentPosition;
            }
        }
       
        public TokenKind Kind
        {
            get { return _kind; }
        }
       
        public string TokenAsString
        {
            get { return _source.Substring(_tokenStart, _currentPosition - _tokenStart); }
        }
       
        public double TokenAsDouble
        {
            get { return double.Parse(TokenAsString); }
        }
       
        static void Main(string[] args)
        {
            try
            {
                StringBuilder source = new StringBuilder();
                foreach (string arg in args)
                    source.Append(arg).Append(' ');
                if (args.Length > 0)
                    source.Length -= 1;
                Lexer lexer = new Lexer(source.ToString());
                for (;;)
                {
                    Console.WriteLine("Token: {0} => \"{1}\"", lexer.Kind, lexer.TokenAsString);
                    if (lexer.Kind == TokenKind.Eof)
                        break;
                    lexer.NextToken();
                }
            }
            catch (Exception ex)
            {
                Console.Error.WriteLine(ex);
            }
        }
    }
}
Compile this with "csc Lexer.cs" from the .NET SDK, and try it out:
C:\> Lexer + - 3 * 7 - (
Token: Plus => "+"
Token: Minus => "-"
Token: Number => "3"
Token: Asterisk => "*"
Token: Number => "7"
Token: Minus => "-"
Token: LParen => "("
Token: Eof => ""
That's enough of the basics of a lexer for the moment. We'll enhance this lexer to parse numbers with decimal points later. Next comes a simple parser.

Tuesday, June 14, 2005

How compilers are structured.

Compilers are programs that translate source code from one language (the input language) to another (the output language). Usually, the input language is the language the programmer writes in, and the output language is either assembly code or machine code.

Compilers make their job easier by breaking it into parts.

The first part is lexical analysis. This is done by a lexical analyzer, also known as a lexer. This is breaks up the input stream into tokens. Tokens are lumps of text from the input stream, annotated with a type. So, given the expression:

42 + 4

The following tokens would be produced:
  1. INTEGER "42"
  2. PLUS "+"
  3. INTEGER "4"
  4. EOF
The last element is important: this is the End Of File marker, and tells the grammar when no more source code is to be expected.

The second part is syntactic analysis, using a grammar. A grammar is a list of rules describing the recursive structure of the language.

(I'm going to use recursive descent in my parser since it's the easiest one to write and read without using a parser generator.) A recursive descent grammar for simple expressions might look like:

  expression : factor ( PLUS factor )* EOF ;

This rule means that an 'expression' is expected to be made up of a factor, followed by zero or more reptitions of the part in the ()s: that is, zero or more of "PLUS token immediately followed by a factor". The '*' means zero or more of the element just before it. A '+' means one or more of the element just before it. Finally, the expression ends in EOF, so the source code should finish after the expression rule has been matched.

  factor : INTEGER ;

Here, a 'factor' is in turn made up of an INTEGER.

The job of a parser is to eat tokens (skip over them, causing the next token to come to the front) from the lexer, but only following rules from the grammar. The parser drives the whole process. It follows the grammar as if each rule in the grammar was a function, and it pulls and eats tokens from the lexer.

Consider the expression "42 + 4" above. A parser using the 'expression' rule as its starting rule and given the stream of tokens INTEGER, PLUS, INTEGER from the lexer (from 42 + 4), it's runtime perspective looks like this:
  1. Parser starts with the expression rule. First thing is supposed to be a factor, and all factors start with INTEGER, so check that the lexer's current (i.e. head of the stream) token is an INTEGER.
  2. The lexer's current token is an INTEGER, so recurse into the factor rule.
  3. The factor rule eats the INTEGER token from the lexer, causing the lexer to move on to the next token (PLUS), and returns to the expression rule.
  4. The expression rule is now positioned just after the first 'factor'; it now has a choice to make. Either enter the loop of 'PLUS factor' (choice PLUS), or skip the loop (choice EOF). It uses the current token from the lexer (PLUS) to make a choice, so it enters the loop.
  5. The loop has been entered, and the PLUS token from the stream is eaten. The next token from the lexer is INTEGER "4". The parser recurses into the factor rule.
  6. The factor rule eats the INTEGER, leaving EOF as the current token in the lexer, and returns to the expression rule.
  7. The expression rule has reached the end of the loop, and must decide if it is going to go around again. The start of the loop is a PLUS, while breaking out of the loop causes an EOF to be matched. The current token (EOF) is used to make the decision, so the loop is exited.
  8. The EOF is matched, and expression returns to whoever called into the parser.
That's the basic process of a parser. Of course, if it just follows the grammar and eats tokens, it doesn't do anything other than match its language and throw errors on invalid input. That isn't terribly useful, so usually the parser does more work as it matches elements. However, that's enough on parsers for the moment.

Next, I'll write a very simple little lexer that breaks text up into the tokens described above.

Monday, June 13, 2005

Writing a compiler

The problem I remember most from being a young programmer (13 years old or so) was figuring out how to parse and evaluate expressions. It seemed like a black art. It was Jack Crenshaw's compiler tutorial which eventually sorted me out, several years later.

In homage to Crenshaw, I'm going to write a little series on how to write a compiler, updated to use C# 2.0 / .NET. Maybe somebody else out there will find it enlightening.

The project will cover the following areas:
  1. How compilers are structured.
  2. A lexer for lexical analysis.
  3. A parser for syntactic analysis and parse-time evaluation.
  4. Designing of a virtual machine.
  5. A simple virtual machine.
  6. A new parser generating code for the virtual machine.
  7. Type analysis and operator overloading.
Each area will be covered in one or more blog posts (as needed), building up from the previous posts to a final finished little language; how capable remains to be seen.