Coco/R Parser with External Flex Scanner - part 1

<- back

Go to Creating Grammar Rules part 2


Use the Coco/R PDF manual side by side with this tutorial as a reference. This tutorial will not go to deep in explanation every detail of the Coco/R grammar.

Creating Tokens

Scanning a stream and creating tokens looks, perhaps, trivial and shouldn't take to much time. Nothing could be more far from the truth. To get the scannar to work correctly is paramount. It will save a lot of time and trouble when the parser rules are about to be written, if the scanner works as intended. Here unit tests are a working horse.

Coco/R Grammar Syntax in Short

Coco/R uses the internal scanner by default (but the internal scanner will not be used in this tutorial). Both rules for the scanner and parser resides in the same grammar file. The grammar file is divided into 4 parts (page 5 in Coco/R PDF Manual):

  • Imports - These lines will be copied into the parser source code verbatim. It should be used to include import statements in Java and C#, and include statements in C++.
  • Global Fields And Methods - Source code that will be copied into Parser class in the parser source code verbatim.
  • Scanner Specification - rules for the scanner to create tokens.
  • Parser Specification - rules for the parser that operates on a token stream.

Scanner Specification

The Scanner Specification is further divided into a few parts:

  • Character sets - macro specification to define sets of characters. Resides below the CHARACTERS statement in the grammar file.
  • Tokens - rules to define tokens. Resides below the TOKEN statement in the grammar file.
  • Pragmas - rules to define tokens that may appear anywhere in the token stream. Resides below the PRAGMA statement in the grammar file.
  • Comment definition - define what is a comment. Uses the COMMENTS statement.
  • White space definition - define what should count as white space and consequently ignored. Uses the IGNORE statement.

Coco/R Vocabulary
  • A identifier in the Coco/R grammar is any word that begins with a letter (uppercase or lowercase) and continues with uppercase and/or lowercase letters and or digits. Example of legal identifiers are: x, letter, digit, letter1, firstLetter. Examples of illegal identifiers are: 1Letter, letter_1, letter#1.
  • A string is any characters (in a single line) in between two "-characters. Examples: "COMPILER", "ABCDEF", "*/".
  • A char is any character in between two '-characters. Examples: 'a', 'A', '1', '%', '+', '#', ' '. Note, only one character may be present between the two '-characters. There are exceptions to this rules: some white space characters must be written with two letters, an escape sequence.
  • escape sequence are those described at page 6 in Coco/R PDF Manual). They are the same as for Java and C++. A few examples: tab character, line feed, carriage return are written as \t, \n and \r.
  • A digit is any continues stream of digits. Examples: 0 , 1, 502873005.

In this section, only the Tokens part will be used. The tokens must be defined in this section to let the parser know what tokens are available. They are also defined in the JFlex grammar file, so it is a bit redundance.

What is JFlex?

JFlex is a powerful and versatile Scanner Generator. It will give almost maximum power over the tokenizer stage. The drawback is to get the two part work together, the scanner from JFlex and the parser from Coco/R.

The power of a JFlex scanner is that is may hold a state. Stateless scanners must consider all lexical rules at the same time. That makes it hard to create sound tokens, because often token content overlap each other. Consider the entities comments and text strings in programming languages. They are quite similar in construction. A scanner that may hold state is easier to be accurate, easier to write, easier to understand what is going on and the risk to compromise lessens.

The JFlex Scanner grammar file

The file with the JFlex grammar is constructed of three parts, all separated by %%:

  • User Code. This part is copied verbatim into the Scanner and must contain java code and/or java comments (or be empty).
  • Options and Declarations. This part regulates the behavior of the scanner. It may also contain Macro declarations.
  • Lexical Rules and Actions. These part consist of rules how to construct tokens. Each rule may also contain actions - Java code - that is executed when the rule is consumed.

User Code

This section may contain Java code and/or Java comments or be empty. Everything in this section is copied into the scanner java code.

Options and Declarations

See the documentation of JFlex for a full description of all options. Here is only presented those useful for this tutorial.

  • %class "classname" - Tells JFlex to give the generated class the name "classname" and to write the generated code to a file "classname.java".
  • %implements "interface 1"[, "interface 2", ..] - Makes the generated class implement the specified interfaces.
  • %public - Makes the generated class public.
  • %{ … %} - The code enclosed in %{ and %} is copied verbatim into the generated class. Here you can define your own member variables and functions in the generated scanner. Like all options, both %{ and %} must start a line in the specification.
  • %init{ … %init} - The code enclosed in %init{ and %init} is copied verbatim into the constructor of the generated class. Here, member variables declared in the %{%} directive can be initialised.
  • %type "typename" - Causes the scanning method to be declared as returning values of the specified type. Actions in the specification can then return values of typename as tokens.
  • %eofval{ … %eofval} - The code included in %eofval{%eofval} will be copied verbatim into the scanning method and will be executed each time when the end of file is reached (this is possible when the scanning method is called again after the end of file has been reached). The code should return the value that indicates the end of file to the parser.
  • %line - Turns line counting on. The int member variable yyline contains the number of lines (starting with 0) from the beginning of input to the beginning of the current token.
  • %column - Turns column counting on. The int member variable yycolumn contains the number of characters (starting with 0) from the beginning of the current line to the beginning of the current token.
  • %char - Turns character counting on. The int member variable yychar contains the number of characters (starting with 0) from the beginning of input to the beginning of the current token.
  • %s[tate] "state identifier" [, "state identifier", … ] for inclusive.
  • %x[state] "state identifier" [, "state identifier", … ] for exclusive states.

There are many more options in the JFlex documentation, but these are those that will be used in this tutorial.

Macro definitions

The section Options and Declarations may also contain Macro definitions. These are used in the rules, in the last section. Macro definitions have the following syntax:

macroidentifier = regular expression

Example

LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]
WhiteSpace     = {LineTerminator} | [ \t\f]

Lexical Rules and Actions

The syntax for Rules and Actions are, where STATE is a state name in between < >, expression is a unique identifier name and action is Java code in between { } :

<STATE> expression { action }

It states that when the scanner is in state STATE and the input stream is valid for the Regular Expression expression, then the Java code action is executed.
<STATE> {
  expr1   { action1 }
  expr2   { action2 }
}

It states that when the scanner is in state STATE and the input stream is valid for some of the Regular Expressions expr1 or expr2, the Java code action1 or action2 corresponding to the expressions will be executed.
expression { action }

It states that when the scanner is in, the pre defined, state YYINITIAL and the input stream is valid for the Regular Expression expression, then the Java code action is executed.

Rules may be more complex, but these should be enough for this tutorial. Mind that all actions should return a token object.

Example:

<YYINITIAL> "abstract"           { return symbol(sym.ABSTRACT); }
<YYINITIAL> "boolean"            { return symbol(sym.BOOLEAN); }
<STRING> {
  \"                             { yybegin(YYINITIAL); 
                                   return symbol(sym.STRING_LITERAL, 
                                   string.toString()); }
  [^\n\r\"\\]+                   { string.append( yytext() ); }
  \\t                            { string.append('\t'); }
}

Example

This example is just an example. It shows where all sections reside in the flex file

/* Example Flex grammar file */

#import my.parser.Token;
%%

%public
%class Scanner
%line
%column

/* main character classes */
LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]

WhiteSpace = {LineTerminator} | [ \t\f]

/* string and character literals */
StringCharacter = [^\r\n\"\\]
SingleCharacter = [^\r\n\'\\]

%state STRING

%%

<YYINITIAL> {

  /* keywords */
  "abstract"                     { return symbol(ABSTRACT); }
  "boolean"                      { return symbol(BOOLEAN); }
  "break"                        { return symbol(BREAK); }
  /* whitespace */
  {WhiteSpace}                   { /* ignore */ }

  /* identifiers */ 
  {Identifier}                   { return symbol(IDENTIFIER, yytext()); }  
}

<STRING> {
  \"                             { yybegin(YYINITIAL); return symbol(STRING_LITERAL, string.toString()); }
  {StringCharacter}+             { string.append( yytext() ); }
}

The Token class

When the built-in scanner in Coco/R is used, a Token class is defined in the scanner code file. This is not the case with JFlex. JFlex is so versatile a programmer may choose how a token is created and how it should work. The programmer has total power how this class work. So a token class must be defined in the user code area (the area before the fist %% marker).

class Token {
    public Token() { this.kind = Parser._EOF; }
    public Token( int kind ) { this.kind = kind;this.val = ""; }
    public Token( int kind, String val ) { this.kind = kind; this.val = val; }
    public Token( int kind, int col, int line, int charPos ) { this.kind = kind; this.col = col; this.line = line; this.charPos = charPos; this.val = ""; }
    public Token( int kind, int col, int line, int charPos, String val ) { this.kind = kind; this.col = col; this.line = line; this.charPos = charPos; this.val = val; }
    public int kind;    // token kind
    public int pos;     // token position in bytes in the source text (starting at 0)
    public int charPos; // token position in characters in the source text (starting at 0)
    public int col;     // token column (starting at 1)
    public int line;    // token line (starting at 1)
    public String val;  // token value
    public Token next;  // ML 2005-03-11 Peek tokens are kept in linked list
}

Comparing this code to the code that Coco/R generates in the Scanner file discover that the above code uses a lot of constructors. That is neat to have. Apart from this, the code is identical.

package and import statements

Above the Token class a package statement should reside:

package org.structuredparsing.cocorgrammar.cocor.parser_jflex_scanner;

An import statement to the Coco/R generated Parser file would not hurt:

import org.structuredparsing.cocorgrammar.cocor.parser_jflex_scanner.Parser;

Options

Below the first %% marker options are placed:

%class Lexer
%public
%type Token
%line
%column
%char

Note that the scanner will not be named Scanner but Lexer. We reserve the Scanner class to a class that will wrap the Lexer class. The %type state that the type used for creating new tokens are of type Token. That is, the class just described above in the user section.

The Scanner class

A Scanner class must be created because the Coco/R Parser class expects one. When JFlex is used, the Scanner class is a glue between the Lexer class and the Parser class.

package org.structuredparsing.cocorgrammar.cocor.parser_jflex_scanner;
 
import java.io.InputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
 
public class Scanner {
 
    private List< Token > buffer = null;
    private int currentBufferIndex = 0;
    private int peekBufferIndex = 0;
    private Lexer lexer = null;
 
    public Scanner(InputStream s) {
        lexer = new Lexer( s );
    }
 
    public Token Scan() {
        Token token = null;
        if ( buffer == null ) {
            buffer = new ArrayList< Token >();
            do {
                try {
                    token = lexer.yylex();
                } catch (IOException e) {
                    e.printStackTrace();
                    buffer.add( new Token( Parser._Illegaltoken ) );
                    token = new Token( Parser._EOF );
                }
                buffer.add( token );
            } while( token.kind != Parser._EOF );
        }
        token = buffer.get(currentBufferIndex);
        ++currentBufferIndex;
        return token;
    }
 
    public Token Peek() {
        if ( peekBufferIndex < currentBufferIndex ) {
            peekBufferIndex = currentBufferIndex;
        }
        if ( peekBufferIndex >= buffer.size() ) {
            peekBufferIndex = buffer.size() - 1;
        }
        Token token = buffer.get(peekBufferIndex);
        ++peekBufferIndex;
        return token;
    }
 
    public void ResetPeek() {
        peekBufferIndex = currentBufferIndex + 1;
    }
}

The Scanner class, above, read and parse all source code (to end-of-file is found) and stores all tokens in a buffer. The method Scan reads from the beginning of this buffer, one by one. The method Peek reads ahead without ruin the position in the buffer for the Scan method.

The Ident token

So, we begin with defining the token Coco/R Ident. Where could we find a definition of how to define Coco/R tokens? The Coco/R PDF manual is the obvious answer, but there is no single page where all token definitions reside. We have to browse back and forth to find them. On page 5, 2.1 Vocabulary the ident is defined.

ident  = letter {letter | digit}.

JFlex have a few neat character sets up its sleave: jletter and jletterdigit. jletter means any US letter, both upper case and lower case. jletterdigit means both letters and digits. These will be used in the tokens part to define the token Ident.

In the Coco/R grammar file:

TOKENS
  Ident = "PLACEHOLDER_IDENT".

In the JFlex grammar file:

Ident = [:jletter:] [:jletterdigit:]*

%%
<YYINITIAL> {
  {Ident}   { return new Token(Parser._Ident, yycolumn + 1, yyline + 1, yychar, yytext()); }
}

This means that a character stream must begin with a letter followed by zero or more letters or digits. The Ident definition in the Coco/R grammar file is just a placeholder. This will generate a static class member in the Parser class named _Ident. This is used when the token is created in the JFlex grammar file: Parser._Ident. The generated scanner from Coco/R and JFlex differ in a few ways. For example Coco/R scanner begins column and line counting from one, JFlex from zero. Thats why one is added to yycolumn and yyline. Also note, that the Ident identifier in the JFlex grammar file belongs to the YYINITIAL state. This is the state the JFlex scanner begins in.

Example of legal character streams that will be translated to an Ident token: x, foobar, FooBar, element1, element2, qX45we77x.

A few unit test verify the token rule for Ident. The first unit test verifies that it is possible to use a single letter as Ident token.

@Test
public void testScan_token_Ident_x() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Ident_x");
    // Initialize
    String sContent = "x";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Ident, 0, 0, 0, sContent);
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The next unit test verifies that it is possible to use several letters as an Ident token.

@Test
public void testScan_token_Ident_foobar() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Ident_foobar");
    // Initialize
    String sContent = "foobar";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Ident, 0, 0, 0, sContent);
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The last unit test verifies that it is possible to use letters and digits in an Ident token:

@Test
public void testScan_token_Ident_x12() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Ident_x12");
    // Initialize
    String sContent = "x12";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Ident, 0, 0, 0, sContent);
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The Number token

On page 5 in the Coco/R PDF manual we also find the definition of number

number = digit {digit}.

This could be defined in a number of ways. One is to follow the definition above:

Number = [0-9][0-9]*

This would allow the character stream 0003 to be converted into a Number token. Fine. A more restrictive definitions could define Number with non trailing zeros before first non-zero digit. That is, the number 3 must be written as 3 and not 0003.

The Coco/R grammar file

TOKENS
  Number = "PLACEHOLDER_NUMBER".

The JFlex grammar file

Number = 0 | [1-9][0-9]*

%%
<YYINITIAL> {
  {Number}       { return new Token(Parser._Number, yycolumn + 1, yyline + 1, yychar, yytext()); }
}

In one way it is a matter of taste, but it is important to understand the difference. As a minimum unit test should verify that it is possible to create a Number token one or some digits:

@Test
public void testScan_token_Number_123() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Number_123");
    // Initialize
    String sContent = "123";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Number, 0, 0, 0, sContent);
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

Unit tests that verifies other combinations of character streams should be written. For example should the stream 1x be verified that it produces two tokens, one Number token and one Ident token.
[[/code]]

Note! The //Number token is not used in the Coco/R grammar rules!// It is described here because it is described in the Coco/R PDF manual on page 5.

The String token

A string is any character stream on a single line between double quotes. According to the Coco/R definition on page 5 in the Coco/R PDF manual

string = '"' {anyButQuote} '"'.

This definition is a bit too simplified. It should be possible to include an escaped double quote character, \", as described at the bottom of page 5 and continuos at page 6. Now, the power of JFlex will be shown.

The Coco/R grammar file

TOKENS
  String = "PLACEHOLDER_STRING".

The JFlex grammar file

%{
  StringBuffer textcontent = new StringBuffer();
  int nColumn, nLine, nChar;
%}

%state STRING

%%
<YYINITIAL> {
  "\""   { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; yybegin(STRING); }
}

<STRING> {
  "\""   { yybegin(YYINITIAL); return new Token(Parser._String, nColumn, nLine, nChar, textcontent.toString()); }
  [^\n\r\"\\]+        { textcontent.append( yytext() ); }
  "\\0"                { textcontent.append('\\'); textcontent.append('0'); }
  "\\a"                { textcontent.append('\\'); textcontent.append('a'); }
  "\\b"                { textcontent.append('\\'); textcontent.append('b'); }
  "\\f"                { textcontent.append('\\'); textcontent.append('f'); }
  "\\t"                { textcontent.append('\\'); textcontent.append('t'); }
  "\\n"                { textcontent.append('\\'); textcontent.append('n'); }
  "\\r"                { textcontent.append('\\'); textcontent.append('r'); }
  "\\v"                { textcontent.append('\\'); textcontent.append('v'); }
  "\\u"                { textcontent.append('\\'); textcontent.append('u'); }
  "\\'"                { textcontent.append('\\'); textcontent.append('\''); }
  "\\\""            { textcontent.append('\\'); textcontent.append('\"'); }
  "\\"                { textcontent.append('\\'); }
}

When a double quote character is encountered the line, column and char positions are stored in the variables nLine, nColumn and nChar. These variables must also be declared before used. That is done within %{ %} . The code inside will be copied into the JFlex scanner source code verbatim, as field members.

The code yybegin(STRING) sets the scanner into a new state, the STRING state (from the state YYINITIAL that is the default beginning state). Once in this new state all character parsing is done within this state and in this context. All states but the default state, YYINITIAL, must be declared. This is done with %state STRING.

The line [^\n\r\"\\]+ { textcontent.append( yytext() ); }  means that all characters, but new line, carriage return, double quote characters and backslash character should be appended into the StringBuffer object textcontent.

All lines beginning with backslashes characters are special cases that should be appended into textcontent.

The line "\"" { yybegin(YYINITIAL); return new Token(Parser._String, nColumn, nLine, nChar, textcontent.toString()); } means that when a double quote character is encountered, the scanner should go inte the state YYINITIAL again. A Token object is also created and returned, storing all appended characters in the STRING state and the line, column and char positions stored by the nLine, nColumn and nChar variables.

Three good unit tests would be to test the empty string, a string (perhaps legal random string), and a string containing an escaped double quote character.

@Test
public void testScan_token_String_Empty() throws UnsupportedEncodingException {
    System.out.println("testScan_token_String_Empty");
    // Initialize
    String sContent = "\"\"";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._String, 1, 1, 0, "" );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
@Test
public void testScan_token_String_abc() throws UnsupportedEncodingException {
    System.out.println("testScan_token_String_abc");
    // Initialize
    String sContent = "\"abc\"";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._String, 1, 1, 0, "abc" );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
@Test
public void testScan_token_String_escapeZero() throws UnsupportedEncodingException {
    System.out.println("testScan_token_String_escapeZero");
    // Initialize
    String sContent = "\"\\0\"";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._String, 1, 1, 0, "\\0" );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The Char token

The last token on page 5 in the Coco/R PDF manual is the char token. It is defined as

char   = '\'' anyButApostrophe '\''.

However, all the escape sequences must also be honoured. The definitions of strings and chars are quite similar in JFlex

The Coco/R grammar file

TOKENS
  Char = "PLACEHOLDER_CHAR".

The JFlex grammar file

%{
  int nCharCount;
%}

%state STRING, CHAR

%%
<YYINITIAL> {
  "\'"   { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; nCharCount = 0; yybegin(CHAR); }
}

<CHAR> {
  "\'"   { yybegin(YYINITIAL); 
          if ( nCharCount == 5 ) {
             if ( textcontent.toString().startsWith( "\\u" ) ) {
                return new Token(Parser._Char, nColumn, nLine, nChar, textcontent.toString());
             }
             else {
                return new Token(Parser.maxT, nColumn, nLine, nChar, textcontent.toString());
             }
          }
          else if ( nCharCount == 1 ) {
             return new Token(Parser._Char, nColumn, nLine, nChar, textcontent.toString());
          }
          else {
             return new Token(Parser.maxT, nColumn, nLine, nChar, textcontent.toString());
          } }
  [^\n\r\'\\]        { textcontent.append( yytext() ); ++nCharCount; }
  "\\0"                { textcontent.append('\\'); textcontent.append('0'); ++nCharCount; }
  "\\a"                { textcontent.append('\\'); textcontent.append('a'); ++nCharCount; }
  "\\b"                { textcontent.append('\\'); textcontent.append('b'); ++nCharCount; }
  "\\f"                { textcontent.append('\\'); textcontent.append('f'); ++nCharCount; }
  "\\t"                { textcontent.append('\\'); textcontent.append('t'); ++nCharCount; }
  "\\n"                { textcontent.append('\\'); textcontent.append('n'); ++nCharCount; }
  "\\r"                { textcontent.append('\\'); textcontent.append('r'); ++nCharCount; }
  "\\v"                { textcontent.append('\\'); textcontent.append('v'); ++nCharCount; }
  "\\u"                { textcontent.append('\\'); textcontent.append('u'); ++nCharCount; }
  "\\'"                { textcontent.append('\\'); textcontent.append('\''); ++nCharCount; }
  "\\\""            { textcontent.append('\\'); textcontent.append('\"'); ++nCharCount; }
  "\\"                { textcontent.append('\\'); ++nCharCount; }
}

The definitions of Char is very similar to String. One thing to mention, though, is that the variable nCharCount is introduced. It counts character added (escape characters counts as one character). If a Char token is about to contain more than one or less than one character, a maxT token is created instead (an illegal token, that is). But, a unicode character \u must be allowed. Unit tests should contain a single character and some escape character. The empty character should not be allowed.

@Test
public void testScan_token_Char_x() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Char_x");
    // Initialize
    String sContent = "\'x\'";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Char, 1, 1, 0, "x" );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
 
@Test
public void testScan_token_Char_escapeZero() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Char_escapeZero");
    // Initialize
    String sContent = "\'\\0\'";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Char, 1, 1, 0, "\\0" );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
 
@Test
public void testScan_token_Char_emptyCharacter_not_allowed() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Char_emptyCharacter_not_allowed");
    // Initialize
    String sContent = "\'\'";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser.maxT;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
}

Reserved Keywords

On page 6 in the Coco/R PDF manual the Reserved Keywords for Coco/R grammar is listed.

The Coco/R grammar file

TOKENS
  Any = "ANY".
  Context = "CONTEXT".
  Ignore = "IGNORE".
  Pragmas = "PRAGMAS".
  Tokens = "TOKENS".
  Characters = "CHARACTERS".
  End = "END".
  Ignorecase = "IGNORECASE".
  Productions = "PRODUCTIONS".
  Weak = "WEAK".
  Comments = "COMMENTS".
  From = "FROM".
  Nested = "NESTED".
  Sync = "SYNC".
  Compiler = "COMPILER".
  If = "IF".
  Out = "out".
  To = "TO".

The JFlex grammar file

%%
<YYINITIAL> {
  ANY      { return new Token(Parser._Any, yycolumn + 1, yyline + 1, yychar, yytext()); }
  CONTEXT  { return new Token(Parser._Context, yycolumn + 1, yyline + 1, yychar, yytext()); }
  IGNORE  { return new Token(Parser._Ignore, yycolumn + 1, yyline + 1, yychar, yytext()); }
  PRAGMAS  { return new Token(Parser._Pragmas, yycolumn + 1, yyline + 1, yychar, yytext()); }
  TOKENS  { return new Token(Parser._Tokens, yycolumn + 1, yyline + 1, yychar, yytext()); }
  CHARACTERS  { return new Token(Parser._Characters, yycolumn + 1, yyline + 1, yychar, yytext()); }
  END       { return new Token(Parser._End, yycolumn + 1, yyline + 1, yychar, yytext()); }
  IGNORECASE  { return new Token(Parser._Ignorecase, yycolumn + 1, yyline + 1, yychar, yytext()); }
  PRODUCTIONS  { return new Token(Parser._Productions, yycolumn + 1, yyline + 1, yychar, yytext()); }
  WEAK   { return new Token(Parser._Weak, yycolumn + 1, yyline + 1, yychar, yytext()); }
  COMMENTS   { return new Token(Parser._Comments, yycolumn + 1, yyline + 1, yychar, yytext()); }
  FROM   { return new Token(Parser._From, yycolumn + 1, yyline + 1, yychar, yytext()); }
  NESTED   { return new Token(Parser._Nested, yycolumn + 1, yyline + 1, yychar, yytext()); }
  SYNC   { return new Token(Parser._Sync, yycolumn + 1, yyline + 1, yychar, yytext()); }
  COMPILER   { return new Token(Parser._Compiler, yycolumn + 1, yyline + 1, yychar, yytext()); }
  IF   { return new Token(Parser._If, yycolumn + 1, yyline + 1, yychar, yytext()); }
  out   { return new Token(Parser._Out, yycolumn + 1, yyline + 1, yychar, yytext()); }
  TO   { return new Token(Parser._To, yycolumn + 1, yyline + 1, yychar, yytext()); }

Unit tests to test them all should be written. Here is an example

@Test
public void testScan_token_Any() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Any");
    // Initialize
    String sContent = "ANY";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Any, 0, 0, 0 );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( sContent, result.val );
}

Operators

There is no single page defining all operator tokens, but examining the Syntax of Coco/R on page 34 in the Coco/R PDF manual carefully will extract the operators.

The Coco/R grammar file

TOKENS
  PointPoint = "..".
  Point = '.'.
  Equal = '='.
  Plus = '+'.
  Minus = '-'.
  LeftParenthesis = '('.
  RightParenthesis = ')'.
  LeftCurlyBrace     = '{'.
  RightCurlyBrace    = '}'.
  LeftSquareBracket  = '['.
  RightSquareBracket = ']'.
  VerticalBar        = '|'.

The JFlex grammar file

  ".."   { return new Token(Parser._PointPoint, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "."   { return new Token(Parser._Point, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "="   { return new Token(Parser._Equal, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "+"   { return new Token(Parser._Plus, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "-"   { return new Token(Parser._Minus, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "("   { return new Token(Parser._LeftParenthesis, yycolumn + 1, yyline + 1, yychar, yytext()); }
  ")"   { return new Token(Parser._RightParenthesis, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "{"   { return new Token(Parser._LeftCurlyBrace, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "}"   { return new Token(Parser._RightCurlyBrace, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "["   { return new Token(Parser._LeftSquareBracket, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "]"   { return new Token(Parser._RightSquareBracket, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "|"   { return new Token(Parser._VerticalBar, yycolumn + 1, yyline + 1, yychar, yytext()); }

These are verified by unit test in the same manner as the reserved keywords:

@Test
public void testScan_token_Equal() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Equal");
    // Initialize
    String sContent = "=";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Equal, 1, 1, 0 );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( sContent, result.val );
}

The Attribute token

A glance at the Syntax of Coco/R at page 34 in the Coco/R PDF manual define the Attribute not as a token, but as a rule

Attributes  = '<' {ANY} '>' | "<." {ANY} ".>".

However it is more convenient to use the power of the scanner to construct a token for this element. The drawback (in my opinion) to define Attribute as a grammar rule, as it is defined, is that we will get a stream of tokens between '<' and '>' (or between '<.' and '.>') tokens. To extract the information we must iterate through them and calculate space position (because white space is ignored and dropped). That will complicate things a little when the parser tree is constructed. Not impossible, but (in my opinion) more messy.

The Coco/R grammar file

TOKENS
  LessThan = '<'.
  GreaterThan = '>'.
  AttributeStart = "<.".
  AttributeEnd = ".>".
  Attributes = "PLACEHOLDER_ATTRIBUTES".

The JFlex grammar file

%state STRING, CHAR, ATTRIBUTE1, ATTRIBUTE2

%%

<YYINITIAL> {
  "<."   { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; yybegin(ATTRIBUTE2); }
  "<"   { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; yybegin(ATTRIBUTE1); }
}

<ATTRIBUTE1> {
  ">"   { yybegin(YYINITIAL); 
            return new Token(Parser._Attributes, nColumn, nLine, nChar, textcontent.toString()); }
  [^>]+   { textcontent.append( yytext() ); }
}

<ATTRIBUTE2> {
  ".>"       { yybegin(YYINITIAL); 
              return new Token(Parser._Attributes, nColumn, nLine, nChar, textcontent.toString()); }
  [^\.]+   { textcontent.append( yytext() ); }
  "."[^>]   { textcontent.append( yytext() ); }
}

Attributes in Coco/R may constructed by a < > pair or by a <. .> pair. To separate these two, two states are needed ATTRIBUTE1 and ATTRIBUTE2.

At minimum legal Attribute tokens should be verified:

@Test
public void testScan_token_Attributs_startString_lessThan() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Attributs_startString_lessThan");
    // Initialize
    String sBeginAttributes = "<";
    String sEndAttributes = ">";
    String sAttributesContent = "int x, String s";
    String sContent = sBeginAttributes + sAttributesContent + sEndAttributes;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Attributes, 1, 1, 0, sAttributesContent );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
 
@Test
public void testScan_token_Attributs_startString_lessThanPoint() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Attributs_startString_lessThanPoint");
    // Initialize
    String sBeginAttributes = "<.";
    String sEndAttributes = ".>";
    String sAttributesContent = "int x, String s";
    String sContent = sBeginAttributes + sAttributesContent + sEndAttributes;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._Attributes, 1, 1, 0, sAttributesContent );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The SemAction token

Another look at the Syntax of Coco/R at page 34 in the Coco/R PDF manual define the SemAction also as a rule

SemAction   = "(." {ANY} ".)".

This could also be defined as a token

The Coco/R grammar file

TOKENS
  ActionStart = "(.".
  ActionEnd = ".)".
  SemAction = "PLACEHOLDER_SEMACTION".

The JFlex grammar file

%state STRING, CHAR, ATTRIBUTE1, ATTRIBUTE2, ACTION

%%
<YYINITIAL> {
  "(."   { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; yybegin(ACTION); }
}

<ACTION> {
  ".)"   { yybegin(YYINITIAL); 
               return new Token(Parser._SemAction, nColumn, nLine, nChar, textcontent.toString()); }
  [^\.]+   { textcontent.append( yytext() ); }
  "."[^)]   { textcontent.append( yytext() ); }
}

This is verified in the same way as Attribute token:

@Test
public void testScan_token_Action() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Action");
    // Initialize
    String sBeginAction = "(.";
    String sEndAction = ".)";
    String sActionContent = "System.out.println(\"x=\" + Integer.toString( x ) );";
    String sContent = sBeginAction + sActionContent + sEndAction;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token( Parser._SemAction, 1, 1, 0, sActionContent );
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

Comments

On page 6 in the Coco/R PDF manual comments are described. Coco/R has a special construct to define comments. The good with a definition like this that Coco/R parser take care of comments whenever it appear in the character stream, without the that the grammar must be polluted with rules everywhere to take care of this.

However, this will not be used with JFlex. Instead this must be solved with the power of JFlex.

The JFlex grammar file

LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]

/* comments */
Comment = {TraditionalComment} | {EndOfLineComment}

TraditionalComment   = "/*" [^*] ~"*/" | "/*" "*"+ "/"
EndOfLineComment     = "//" {InputCharacter}* {LineTerminator}

Ident = [:jletter:] [:jletterdigit:]*
Number = 0 | [1-9][0-9]*

%state STRING, CHAR, ATTRIBUTE1, ATTRIBUTE2, ACTION, COMMENT

%%
<YYINITIAL> {
  {TraditionalComment}   { /* ignore */ }
  {EndOfLineComment}       { /* ignore */ }
}

First, LineTerminator is defined as line feed and/or carriage return characters. InputCharacter is defined to be any character but line feed character or the carriage return character.

TraditionalComment is defined as starting with /* followed by any character but *, unless it is the */ sequence.

EndOfLineComment is defined as starting with // followed by any character until LineTerminator (that is, new line).

The lines below %% states that if these sequences are encountered, ignore them.

Comments may be verified by blend comments among other characters

@Test
public void testScan_line_col_charPos_Comments() throws UnsupportedEncodingException {
    System.out.println("testScan_line_col_charPos_Whitespace");
    // Initialize
    String sToken0 = "ANY";
    String sToken1 = "END";
    String sToken2 = "IF";
    String sComment = "/* x */";
    String sLineComment = "// y";
    String sSpace = " ";
    String sNewLine = "\r\n";
    // A N Y / *   x   * /  E  N  D CR LF  I  F  /  /     y CR LF Eof
    // 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
    String sContent = sToken0 + sComment + sToken1 + sNewLine + sToken2 + sLineComment + sNewLine;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expectedToken0 = new Token( Parser._Any, 1, 1, 0 );
    Token expectedToken1 = new Token( Parser._End, 11, 1, 10 );
    Token expectedToken2 = new Token( Parser._If, 1, 2, 15 );
    Token expectedToken3 = new Token( Parser._EOF, 1, 3, 23 );
    // Test
    Token resultToken0 = instance.Scan();
    Token resultToken1 = instance.Scan();
    Token resultToken2 = instance.Scan();
    Token resultToken3 = instance.Scan();
    // Validate
    assertEquals( expectedToken0.col, resultToken0.col );
    assertEquals( expectedToken1.col, resultToken1.col );
    assertEquals( expectedToken2.col, resultToken2.col );
    assertEquals( expectedToken3.col, resultToken3.col );
    assertEquals( expectedToken0.line, resultToken0.line );
    assertEquals( expectedToken1.line, resultToken1.line );
    assertEquals( expectedToken2.line, resultToken2.line );
    assertEquals( expectedToken3.line, resultToken3.line );
    assertEquals( expectedToken0.charPos, resultToken0.charPos );
    assertEquals( expectedToken1.charPos, resultToken1.charPos );
    assertEquals( expectedToken2.charPos, resultToken2.charPos );
    assertEquals( expectedToken3.charPos, resultToken3.charPos );
}

White space

The last piece is to handle white space:

JFlex grammar file

WhiteSpace     = {LineTerminator} | [ \t\f]

%%
<YYINITIAL> {
  {WhiteSpace}   { /* ignore */ }
}

This could also be verified with unit tests

@Test
public void testScan_line_col_charPos_Whitespace() throws UnsupportedEncodingException {
    System.out.println("testScan_line_col_charPos_Whitespace");
    // Initialize
    String sToken0 = "4";
    String sToken1 = "<.x.>";
    String sToken2 = "(.y.)";
    String sToken3 = "TO";
    String sSpace = " ";
    String sTab = "\t";
    // 4     < . x . >   (  .  y  .  )        T  O Eof
    // 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
    String sContent = sToken0 + sSpace + sSpace + sToken1 + sTab + sToken2 + sSpace + sTab + sToken3;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expectedToken0 = new Token( Parser._Number, 1, 1, 0, sToken0 );
    Token expectedToken1 = new Token( Parser._Attributes, 4, 1, 3, "x" );
    Token expectedToken2 = new Token( Parser._SemAction, 10, 1, 9, "y" );
    Token expectedToken3 = new Token( Parser._To, 17, 1, 16 );
    Token expectedToken4 = new Token( Parser._EOF, 19, 1, 18 );
    // Test
    Token resultToken0 = instance.Scan();
    Token resultToken1 = instance.Scan();
    Token resultToken2 = instance.Scan();
    Token resultToken3 = instance.Scan();
    Token resultToken4 = instance.Scan();
    // Validate
    assertEquals( expectedToken0.col, resultToken0.col );
    assertEquals( expectedToken1.col, resultToken1.col );
    assertEquals( expectedToken2.col, resultToken2.col );
    assertEquals( expectedToken3.col, resultToken3.col );
    assertEquals( expectedToken4.col, resultToken4.col );
    assertEquals( expectedToken0.line, resultToken0.line );
    assertEquals( expectedToken1.line, resultToken1.line );
    assertEquals( expectedToken2.line, resultToken2.line );
    assertEquals( expectedToken3.line, resultToken3.line );
    assertEquals( expectedToken4.line, resultToken4.line );
    assertEquals( expectedToken0.charPos, resultToken0.charPos );
    assertEquals( expectedToken1.charPos, resultToken1.charPos );
    assertEquals( expectedToken2.charPos, resultToken2.charPos );
    assertEquals( expectedToken3.charPos, resultToken3.charPos );
    assertEquals( expectedToken4.charPos, resultToken4.charPos );
}

The Grammar File So Far

This is the grammar without grammar rules. The PRODUCTIONS part is not at all complete, but the top most grammar rule must be there to for the Coco/R parser generator tool to compile.

The Coco/R Grammar file

package org.structuredparsing.cocorgrammar.cocor.parser_jflex_scanner;

COMPILER CocoR

TOKENS
  WhiteSpace = "PLACEHOLDER_WHITESPACE".
  Ident = "PLACEHOLDER_IDENT".
  Number = "PLACEHOLDER_NUMBER".
  QuotationMark = '"'.
  String = "PLACEHOLDER_STRING".
  Apostrophe = "'".
  Char = "PLACEHOLDER_CHAR".
  LessThan = '<'.
  GreaterThan = '>'.
  AttributeStart = "<.".
  AttributeEnd = ".>".
  Attributes = "PLACEHOLDER_ATTRIBUTES".
  ActionStart = "(.".
  ActionEnd = ".)".
  SemAction = "PLACEHOLDER_SEMACTION".
  Any = "ANY".
  Context = "CONTEXT".
  Ignore = "IGNORE".
  Pragmas = "PRAGMAS".
  Tokens = "TOKENS".
  Characters = "CHARACTERS".
  End = "END".
  Ignorecase = "IGNORECASE".
  Productions = "PRODUCTIONS".
  Weak = "WEAK".
  Comments = "COMMENTS".
  From = "FROM".
  Nested = "NESTED".
  Sync = "SYNC".
  Compiler = "COMPILER".
  If = "IF".
  Out = "out".
  To = "TO".
  PointPoint = "..".
  Point = '.'.
  Equal = '='.
  Plus = '+'.
  Minus = '-'.
  LeftParenthesis = '('.
  RightParenthesis = ')'.
  LeftCurlyBrace     = '{'.
  RightCurlyBrace    = '}'.
  LeftSquareBracket  = '['.
  RightSquareBracket = ']'.
  VerticalBar        = '|'.
  CommentStart = "/*".
  CommentEnd = "*/".
  CommentSingleLine = "//".
  Comment = "PLACEHOLDER_COMMENT".
  Illegaltoken = "PLACEHOLDER_ILLEGALTOKEN".

PRODUCTIONS

CocoR = 
  Compiler Ident.

END CocoR.

The JFlex grammar file is, though, complete

package org.structuredparsing.cocorgrammar.cocor.parser_jflex_scanner;

import org.structuredparsing.cocorgrammar.cocor.parser_jflex_scanner.Parser;

class Token {
    public Token() { this.kind = Parser._EOF; }
    public Token( int kind ) { this.kind = kind;this.val = ""; }
    public Token( int kind, String val ) { this.kind = kind; this.val = val; }
    public Token( int kind, int col, int line, int charPos ) { this.kind = kind; this.col = col; this.line = line; this.charPos = charPos; this.val = ""; }
    public Token( int kind, int col, int line, int charPos, String val ) { this.kind = kind; this.col = col; this.line = line; this.charPos = charPos; this.val = val; }
    public int kind;    // token kind
    public int pos;     // token position in bytes in the source text (starting at 0)
    public int charPos; // token position in characters in the source text (starting at 0)
    public int col;     // token column (starting at 1)
    public int line;    // token line (starting at 1)
    public String val;  // token value
    public Token next;  // ML 2005-03-11 Peek tokens are kept in linked list
}

%%
%class Lexer
%public
%type Token
%line
%column
%char
%eofval{
    return new Token(Parser._EOF, yycolumn + 1, yyline + 1, yychar);
%eofval}

%{
  StringBuffer textcontent = new StringBuffer();
  int nColumn, nLine, nChar, nCharCount;
%}

LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]
WhiteSpace     = {LineTerminator} | [ \t\f]

/* comments */
Comment = {TraditionalComment} | {EndOfLineComment}

TraditionalComment   = "/*" [^*] ~"*/" | "/*" "*"+ "/"
EndOfLineComment     = "//" {InputCharacter}* {LineTerminator}

Ident = [:jletter:] [:jletterdigit:]*
Number = 0 | [1-9][0-9]*

%state STRING, CHAR, ATTRIBUTE1, ATTRIBUTE2, ACTION, COMMENT

%%

<YYINITIAL> {
  {Number}            { return new Token(Parser._Number, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "\""                { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; yybegin(STRING); }
  "\'"                { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; nCharCount = 0; yybegin(CHAR); }
  "<."                { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; yybegin(ATTRIBUTE2); }
  "<"                { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; yybegin(ATTRIBUTE1); }
  "(."                { textcontent.setLength(0); nColumn = yycolumn + 1; nLine = yyline + 1; nChar = yychar; yybegin(ACTION); }
  ANY                { return new Token(Parser._Any, yycolumn + 1, yyline + 1, yychar, yytext()); }
  CONTEXT            { return new Token(Parser._Context, yycolumn + 1, yyline + 1, yychar, yytext()); }
  IGNORE            { return new Token(Parser._Ignore, yycolumn + 1, yyline + 1, yychar, yytext()); }
  PRAGMAS            { return new Token(Parser._Pragmas, yycolumn + 1, yyline + 1, yychar, yytext()); }
  TOKENS            { return new Token(Parser._Tokens, yycolumn + 1, yyline + 1, yychar, yytext()); }
  CHARACTERS        { return new Token(Parser._Characters, yycolumn + 1, yyline + 1, yychar, yytext()); }
  END                { return new Token(Parser._End, yycolumn + 1, yyline + 1, yychar, yytext()); }
  IGNORECASE        { return new Token(Parser._Ignorecase, yycolumn + 1, yyline + 1, yychar, yytext()); }
  PRODUCTIONS        { return new Token(Parser._Productions, yycolumn + 1, yyline + 1, yychar, yytext()); }
  WEAK                { return new Token(Parser._Weak, yycolumn + 1, yyline + 1, yychar, yytext()); }
  COMMENTS            { return new Token(Parser._Comments, yycolumn + 1, yyline + 1, yychar, yytext()); }
  FROM                { return new Token(Parser._From, yycolumn + 1, yyline + 1, yychar, yytext()); }
  NESTED            { return new Token(Parser._Nested, yycolumn + 1, yyline + 1, yychar, yytext()); }
  SYNC                { return new Token(Parser._Sync, yycolumn + 1, yyline + 1, yychar, yytext()); }
  COMPILER            { return new Token(Parser._Compiler, yycolumn + 1, yyline + 1, yychar, yytext()); }
  IF                { return new Token(Parser._If, yycolumn + 1, yyline + 1, yychar, yytext()); }
  out                { return new Token(Parser._Out, yycolumn + 1, yyline + 1, yychar, yytext()); }
  TO                { return new Token(Parser._To, yycolumn + 1, yyline + 1, yychar, yytext()); }
  {Ident}            { return new Token(Parser._Ident, yycolumn + 1, yyline + 1, yychar, yytext()); }
  ".."                { return new Token(Parser._PointPoint, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "."                { return new Token(Parser._Point, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "="                { return new Token(Parser._Equal, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "+"                { return new Token(Parser._Plus, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "-"                { return new Token(Parser._Minus, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "("                { return new Token(Parser._LeftParenthesis, yycolumn + 1, yyline + 1, yychar, yytext()); }
  ")"                { return new Token(Parser._RightParenthesis, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "{"                { return new Token(Parser._LeftCurlyBrace, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "}"                { return new Token(Parser._RightCurlyBrace, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "["                { return new Token(Parser._LeftSquareBracket, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "]"                { return new Token(Parser._RightSquareBracket, yycolumn + 1, yyline + 1, yychar, yytext()); }
  "|"                { return new Token(Parser._VerticalBar, yycolumn + 1, yyline + 1, yychar, yytext()); }
  {TraditionalComment}        { /* ignore */ }
  {EndOfLineComment}        { /* ignore */ }
  {WhiteSpace}                { /* ignore */ }
}

<STRING> {
  "\""                { yybegin(YYINITIAL); 
                        return new Token(Parser._String, nColumn, nLine, nChar, textcontent.toString()); }
  [^\n\r\"\\]+        { textcontent.append( yytext() ); }
  "\\0"                { textcontent.append('\\'); textcontent.append('0'); }
  "\\a"                { textcontent.append('\\'); textcontent.append('a'); }
  "\\b"                { textcontent.append('\\'); textcontent.append('b'); }
  "\\f"                { textcontent.append('\\'); textcontent.append('f'); }
  "\\t"                { textcontent.append('\\'); textcontent.append('t'); }
  "\\n"                { textcontent.append('\\'); textcontent.append('n'); }
  "\\r"                { textcontent.append('\\'); textcontent.append('r'); }
  "\\v"                { textcontent.append('\\'); textcontent.append('v'); }
  "\\u"                { textcontent.append('\\'); textcontent.append('u'); }
  "\\'"                { textcontent.append('\\'); textcontent.append('\''); }
  "\\\""            { textcontent.append('\\'); textcontent.append('\"'); }
  "\\"                { textcontent.append('\\'); }
}

<CHAR> {
  "\'"                { yybegin(YYINITIAL); 
                        if ( nCharCount == 5 ) {
                              if ( textcontent.toString().startsWith( "\\u" ) ) {
                                  return new Token(Parser._Char, nColumn, nLine, nChar, textcontent.toString());
                              }
                              else {
                                  return new Token(Parser.maxT, nColumn, nLine, nChar, textcontent.toString());
                              }
                          }
                        else if ( nCharCount == 1 ) {
                           return new Token(Parser._Char, nColumn, nLine, nChar, textcontent.toString());
                        }
                        else {
                           return new Token(Parser.maxT, nColumn, nLine, nChar, textcontent.toString());
                        } }
  [^\n\r\'\\]        { textcontent.append( yytext() ); ++nCharCount; }
  "\\0"                { textcontent.append('\\'); textcontent.append('0'); ++nCharCount; }
  "\\a"                { textcontent.append('\\'); textcontent.append('a'); ++nCharCount; }
  "\\b"                { textcontent.append('\\'); textcontent.append('b'); ++nCharCount; }
  "\\f"                { textcontent.append('\\'); textcontent.append('f'); ++nCharCount; }
  "\\t"                { textcontent.append('\\'); textcontent.append('t'); ++nCharCount; }
  "\\n"                { textcontent.append('\\'); textcontent.append('n'); ++nCharCount; }
  "\\r"                { textcontent.append('\\'); textcontent.append('r'); ++nCharCount; }
  "\\v"                { textcontent.append('\\'); textcontent.append('v'); ++nCharCount; }
  "\\u"                { textcontent.append('\\'); textcontent.append('u'); ++nCharCount; }
  "\\'"                { textcontent.append('\\'); textcontent.append('\''); ++nCharCount; }
  "\\\""            { textcontent.append('\\'); textcontent.append('\"'); ++nCharCount; }
  "\\"                { textcontent.append('\\'); ++nCharCount; }
}

<ATTRIBUTE1> {
  ">"                    { yybegin(YYINITIAL); 
                        return new Token(Parser._Attributes, nColumn, nLine, nChar, textcontent.toString()); }
  [^>]+                { textcontent.append( yytext() ); }
}

<ATTRIBUTE2> {
  ".>"                { yybegin(YYINITIAL); 
                        return new Token(Parser._Attributes, nColumn, nLine, nChar, textcontent.toString()); }
  [^\.]+            { textcontent.append( yytext() ); }
  "."[^>]            { textcontent.append( yytext() ); }
}

<ACTION> {
  ".)"                { yybegin(YYINITIAL); 
                        return new Token(Parser._SemAction, nColumn, nLine, nChar, textcontent.toString()); }
  [^\.]+            { textcontent.append( yytext() ); }
  "."[^)]            { textcontent.append( yytext() ); }
}

 /* error fallback */
.|\n                { return new Token(Parser._Illegaltoken, yycolumn + 1, yyline + 1, yychar, yytext()); }

Next part

Time to Go to Creating Grammar Rules part 2


<- back

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License