About parsing

The aim for this site is to help programmers in the process of developing a component for parsing a specific file format or character stream format with the help of parser generator tools.

/Henrik Teinelund


2012-11-28: Corrected a few bugs in grammar for Coco/R Tutorial. Inserted Coco/R grammar diagram.
2012-10-08: Coco/R added a link to this site and tutorial.


Parsing is the process of analyzing a stream of bits, bytes or characters. To simplify the discussion only streams of characters will be considered, but the principal is valid for all types of streams. The aim of the parsing is one or some of

  • verify that the stream is valid.
  • convert the stream to something more understandable, an Abstract Syntax Tree.
  • do actions when patterns are matched with the stream.

Parsing may be done in many ways. Structured Parsing means that the parser is created by tool, a Parser Generator. Another common name is a Compiler Compiler.

The Parsing Process

A parser generated with a Parser Generator consist of two parts: a Scanner and a Parser.

The aim of the Scanner (or Lexer) is to convert the input stream of characters to a stream of tokens. A tokens is one or a sequence of several characters with a common property or that match a predefined pattern. All irrelevant characters, most often whitespace characters like space, line feed and so on, are discarded.


Figure 1. The Scanner is part of the Parser. The aim of the Scanner is to simplify for the Parser by get rid of irrelevant characters and convert characters into tokens. The input character sequence 'if ( x = 5 )' is converted to 6 tokens. The character sequence 'if'' is converted to one token, the left parenthesis is another one token and so on. How the Scanner work and the token conversion rules are described in a grammar file to the Parser Generator tool.

The stream of tokens is then send to the Parser. The Parser uses some parsing algorithm to reduce tokens and match them to some pattern. This is described in the grammar file which is fed to the Parser Generator tool. When a match occurs the parser do an action with respect to the aim of the parser. These actions are also described in the grammar file. In the case of a validating parser it does nothing. In the case of a parser that builds an Abstract Syntax Tree it builds a node in each action and connect it to the all ready constructed nodes. In the case of a parser that do some action when a pattern is matched, a parser could, for example, commit information to a database, or send data to a web server, or add text to file on the file system.


Figure 2. The Parser take a sequence of tokens and, token by token, match them against a grammar. When a match fits grammar rule some action is taken. In the figure a Syntax Tree is constructed with this parser.

Parsing algorithms

All parsers use some sort of parsing algorithm to reduce tokens from the token stream, match them against grammar rules and do some action when a grammar rule is a match. There are two distinct types of parsing algorithm: top down algorithms and bottom up algorithms.

The key idea of a top down algorithm is to reduce tokens from the token stream, one by one, and try to match them against the most top most grammar rule of the grammar rule tree. When matches are done, tokens are reduced and rules more near the bottom of the grammar rule tree are tried. Examples of top down algorithms are the LL algorithm, the recursive decent algorithm, tail recursive algorithm and Earleys algorithm.

A bottom up algorithm works the opposite as a top down algorithm. When tokens are reduced from the token stream, they are matched against the bottom grammar rules in the grammar tree first. As tokens are reduced from the stream, the algorithm works its way up in the grammar rule tree until it hits the top rule. Examples of bottom up algorithms are LR algorithm, LALR algorithm, GLR algorithm, CYK algorithm, SLR algorithm, CLR algorithm and recursive ascent algorithm.

Some algorithms allow one or several tokens of look ahead in the token stream. This is written as a number, or the letter k or as a star in parenthesis directly after the algorithm name. So, LL(1) may use one look ahead token when to figure out what rule to use. LL(k) may use k tokens as look ahead when to figure out what rule to use. Note that k stand for a fixed number. LL(*) is not bound to some fixed values of look ahead tokens to figure out the grammar rule to use. In general, the bigger value on the k, the slower the parser is. They tend to be more complex too, but they are more powerful than, say a LL(1) parser.

Parser Generator tools

A Parser Generator tool is a program. When run the parser generator is fed with a file containing a grammar and produces source code in some computer language (for example, java, C, C++, C#, ruby). The produced source code is the parser.


Figure 3. A Parser Generator takes a file containing grammar rules. It produces a Scanner and a Parser in some computer language. In the case above C or C++.

The grammar consist of rules of what are legal input streams to the parser. The grammar is described in a simple language like BNF. BNF is not very flexible, so many extensions exist. In many Parser Generator tools the grammar file consist of two distinct parts: the scanner part and the parsing rules part. That is, the Parser Generator tool produces source code to both a Scanner and a Parser. Some, how ever, need an external scanner, and only produces a Parser.

Grammar definition languages

The grammar file fed to the Parser Generator program must be described in some structured way. Many Parser Generator programs use the grammar language BNF, or some dialect of it, to describe grammar rules.


BNF, Backus Naur Form, is a simple language that is used to describe computer languages. All though, it is not restricted to describe computer languages only, but any sequence of characters that have some sort of structure, including log files, communication protocol, document formats, instruction sets and more. BNF describes syntax. A syntax consist of rules of what elements are legal and in what order these elements are legal. However it is not always possible to describe all aspects of the structure of a language (or protocol or document). One example: a variable in the computer language, C++, must be declared before it is used. That is not possible to describe in BNF, because BNF to simple and does not have enough expression power. That kind of rules is the semantic of a language, and must be validated programatically.

BNF uses terminals and non-terminals as building blocks. Terminals is everything that is constructed by characters between double quotation marks. "if"', "else", "#include", "<", "+", "[" are all examples of terminals. Non-terminals are written as an identifier between less-than greater-than characters: <A>, <B>, <term>, <name>, <letter> and <STMT> are all examples of non-terminals. A rule is a BNF construction where a non-terminal is equal to, written as ::=, a list constructed of non-terminals and/or terminals. Only a non-terminal is legal to the left of the equal sign in a rule.

<non-terminal> ::= __BNF_expression__

Example 1

<A> ::= "x" <B>
<B> ::= "y" <B>

The above example describes character sequences that begins with a x followed by one or more y:s. For example "xy", "xyyyy" and "xyyyyyyyyyy" are all example of legal character sequences to the rules <A> and <B>.

To make it more convenient to write rules, BNF also have the construction of a vertical bar, |, that express choice. Left and right of the vertical bar-sign goes terminals and non-terminals. An choice expression is evaluated as either the left side is legal or the right side is legal. Choice expressions must be on the right side of the equal sign of a BNF rule.

Example 2

<A> ::= "x" | "y"

means that a valid character sequence must consist of "x" or "y". Nothing else.

Example 3

<E> ::= <T> "+" <T> | <T>
<T> ::= <INT> <T> | <INT>
<INT> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

For the above grammar the following character sequences are all legal: "1", "5436002", "2+2", "54+400+158+21".


There are many extensions to BNF, and EBNF, Extended BNF, is very common. The syntax is slightly different than BNF and with more constructs. It is generally more easy to write and understand a grammar written in EBNF.

Terminals are written as in BNF, and consist of character sequence between double quotation marks. Non-terminals consist of identifier, but are written without lesser-than, greater-than characters. Equal is written with the '=' sign (and not '::=' as BNF). The choice construction is identical to BNF, and written with a vertical bar, '|'.

The extension to BNF is described in this dot list:

  • All rules must end with the character ';'. Example: A = "x";
  • () means grouping. Example: A = "x" ( "y" | "z" ); means "xy" or "xz" are valid.
  • [] means zero or one. Example: A = "x" ["y"]; means "x" and "xy" are all valid.
  • {} means zero or more. Example: A = "x" {"y"}; means "x", "xy" and "xyyyy" are all valid.

Example 4

Example 3 in EBNF

E = T "+" T;
T = INT {INT};
INT = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";


1 Writing a parser with Coco/R for Coco/R grammar files

In this tutorial a Coco/R parser is created. That is, a parser that are able to parse Coco/R grammar files. This is a good starting tutorial because the Coco/R grammar is quite manageable. This is done with the Coco/R parser generator tool, both with JFlex scanner and without JFlex scanner (that is, with Coco/R internal scanner).

Writing a Coco/R Parser with Coco/R - Tutorial

2 Writing a Java 1.5 parser with Coco/R parser generator tool

In this tutorial a Java 1.5 parser is created. The JFlex scanner generator tool will be used to generate a scanner.

Writing a Java 1.5 Parser with Coco/R - Tutorial


Scanner Generator

Flex is a Scanner Generator for C/C++.
JFlex is a Scanner Generator for Java. It is build on Flex for C/C++.
Flex C# is a port of JFlex for C#.

Parser Generator

Coco/R is a LL(1) compiler generator and may produce both a Scanner and Parser. Though, the parser is a LL(1) parser, a functionality in the parser - IF clauses - may extend the parser to a LL(k) parser and thus make it more powerful. Supports Java, C++, C# and other languages.


Coco/R Erratum
JFlex Erratum


Do you have questions, suggestions, critics or just curious who I am, please visit Henrik Teinelund.

Last updated



This site, with all sub-pages within this domain: Henrik Teinelund 2012 (c)

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License