Entropy Overload: The lexer, for lexical analysis

Here, I'll describe a very simple lexical analyzer, or lexer. The lexer's job is to split the input into discrete lumps of text, with each lump annotated with kind, such as IDENTIFIER, NUMBER or PLUS. Another part of the lexer's job is to ignore whitespace and comments.

Here's the outline of the Lexer class:

using System;

namespace Calc
{
    enum TokenKind
    {
        Eof, Number, LParen, RParen, Plus, Minus, Asterisk, Slash
    }
   
    class Lexer
    {
        private string _source;       // the source text buffer
        private int _currentPosition; // an index into _source
        private TokenKind _kind;      // the kind of the current token
        private int _tokenStart;      // the start of the current token
      
        public Lexer(string source)
        {
            _source = source;
            NextToken();
        }
       
        public void NextToken()
        {
            // [...] Written below
        }
       
        public TokenKind Kind
        {
            get { return _kind; }
        }
       
        public string TokenAsString
        {
            get
            {
                return _source.Substring(_tokenStart,
                    _currentPosition - _tokenStart);
            }
        }
       
        public double TokenAsDouble
        {
            get { return double.Parse(TokenAsString); }
        }
    }
}

The most important method (the only method, for now) is NextToken(). The job of NextToken() is this:

The invariant: when NextToken() is entered and exited, _currentPosition is either:
1. At the very start of the source.
2. Pointing at the first character after the last read token.
3. Equal to _source.Length, indicating the End Of File condition.
NextToken() must skip whitespace until there is no more whitespace, or EOF is reached.
If EOF, set token kind to EOF and return.
At this point, _currentPosition points to the first character of the next token. Mark this position as _tokenStart.
Determine the type of token from the first character of the token. For example, if the character is '1', then the next token is a number of some kind. If the character is '"', then it'll be a string. If it's 'x', then it's an identifier. If it's '+', then it's an operator.
Based on the type of the token, read in the rest of the token and set _kind as appropriate. This means that after this step _currentPosition is either at the end of the source or pointing to the first character that isn't part of the token just read.

All this is pretty easy. Here's a simple NextToken() that parses the token kinds in the enumeration above, namely floating-point numbers and some operators, in the class above with some small modifications:

using System;
using System.Text;

namespace Calc
{
    enum TokenKind
    {
        Eof, Number, LParen, RParen, Plus, Minus, Asterisk, Slash
    }
   
    class Lexer
    {
        private string _source;
        private int _currentPosition;
        private TokenKind _kind;
        private int _tokenStart;
       
        public Lexer(string source)
        {
            _source = source;
            NextToken();
        }
       
        public bool IsEof
        {
            get { return _currentPosition >= _source.Length; }
        }
       
        public void NextToken()
        {
            // Skip whitespace.
            while (!IsEof && char.IsWhiteSpace(_source, _currentPosition))
                ++_currentPosition;
           
            // Check for EOF.
            if (IsEof)
            {
                _kind = TokenKind.Eof;
                _tokenStart = _currentPosition;
                return;
            }
           
            // The token starts here.
            _tokenStart = _currentPosition;
           
            // Determine token kind.
            if (char.IsDigit(_source, _currentPosition))
            {
                do {
                    ++_currentPosition;
                } while (!IsEof && char.IsDigit(_source, _currentPosition));
                _kind = TokenKind.Number;
            }
            else
            {
                switch (_source[_currentPosition])
                {
                    case '+': _kind = TokenKind.Plus; break;
                    case '-': _kind = TokenKind.Minus; break;
                    case '*': _kind = TokenKind.Asterisk; break;
                    case '/': _kind = TokenKind.Slash; break;
                    default:
                        throw new ApplicationException("Unknown character: " + _source[_currentPosition]);
                }
                // Skip past the single-character token.
                ++_currentPosition;
            }
        }
       
        public TokenKind Kind
        {
            get { return _kind; }
        }
       
        public string TokenAsString
        {
            get { return _source.Substring(_tokenStart, _currentPosition - _tokenStart); }
        }
       
        public double TokenAsDouble
        {
            get { return double.Parse(TokenAsString); }
        }
       
        static void Main(string[] args)
        {
            try
            {
                StringBuilder source = new StringBuilder();
                foreach (string arg in args)
                    source.Append(arg).Append(' ');
                if (args.Length > 0)
                    source.Length -= 1;
                Lexer lexer = new Lexer(source.ToString());
                for (;;)
                {
                    Console.WriteLine("Token: {0} => \"{1}\"", lexer.Kind, lexer.TokenAsString);
                    if (lexer.Kind == TokenKind.Eof)
                        break;
                    lexer.NextToken();
                }
            }
            catch (Exception ex)
            {
                Console.Error.WriteLine(ex);
            }
        }
    }
}

Compile this with "csc Lexer.cs" from the .NET SDK, and try it out:

C:\> Lexer + - 3 * 7 - (
Token: Plus => "+"
Token: Minus => "-"
Token: Number => "3"
Token: Asterisk => "*"
Token: Number => "7"
Token: Minus => "-"
Token: LParen => "("
Token: Eof => ""

That's enough of the basics of a lexer for the moment. We'll enhance this lexer to parse numbers with decimal points later. Next comes a simple parser.

3 comments:

Shri Dutta Pachori said...: Hi Kelly, This is really nice to see.. I need little help in the problem below

I am having a heavy string compression scenario which can be simple like 1000^1000 or 50000^50000.... so in the second case time increases exponentially.

I am using regex match now...wondering if switching to stack implementation will improve the performance.

my regex looks like [1,5][3,9][2,6][4,5][3,7]
my strings looks like 13647 or 13655 etc.; 6:15 pm
Barry Kelly said...: I don't think you posted enough information for me to figure out what you're trying to do. However, I do suggest that there is a better forum for your problem: Stack Overflow. You'll get a bigger audience there, and other people who have similar problems will more likely find whatever answers your question would turn up there, rather than here.; 7:56 pm
Barry Kelly said...: Eh, that would be stackoverflow.com.; 7:57 pm

Entropy Overload

Wednesday, June 15, 2005

The lexer, for lexical analysis

3 comments:

About Me

Blog Archive