Coco/R Parser With Internal Scanner - part 1

<- back

Go to Creating Grammar Rules part 2


Use the Coco/R PDF manual side by side with this tutorial as a reference. This tutorial will not go to deep in explanation every detail of the Coco/R grammar.

Creating Tokens

Scanning a stream and creating tokens looks, perhaps, trivial and shouldn't take to much time. Nothing could be more far from the truth. To get the scannar to work correctly is paramount. It will save a lot of time and trouble when the parser rules are about to be written, if the scanner works as intended. Here unit tests are a working horse.

Coco/R Grammar Syntax in Short

Coco/R uses the internal scanner by default. Both rules for the scanner and parser resides in the same grammar file. The grammar file is divided into 4 parts (page 5 in Coco/R PDF Manual):

  • Imports - These lines will be copied into the parser source code verbatim. It should be used to include import statements in Java and C#, and include statements in C++.
  • Global Fields And Methods - Source code that will be copied into Parser class in the parser source code verbatim.
  • Scanner Specification - rules for the scanner to create tokens.
  • Parser Specification - rules for the parser that operates on a token stream.

Scanner Specification

The Scanner Specification is further divided into a few parts:

  • Character sets - macro specification to define sets of characters. Resides below the CHARACTERS statement in the grammar file.
  • Tokens - rules to define tokens. Resides below the TOKEN statement in the grammar file.
  • Pragmas - rules to define tokens that may appear anywhere in the token stream. Resides below the PRAGMA statement in the grammar file.
  • Comment definition - define what is a comment. Uses the COMMENTS statement.
  • White space definition - define what should count as white space and consequently ignored. Uses the IGNORE statement.

Coco/R Vocabulary
  • A identifier in the Coco/R grammar is any word that begins with a letter (uppercase or lowercase) and continues with uppercase and/or lowercase letters and or digits. Example of legal identifiers are: x, letter, digit, letter1, firstLetter. Examples of illegal identifiers are: 1Letter, letter_1, letter#1.
  • A string is any characters (in a single line) in between two "-characters. Examples: "COMPILER", "ABCDEF", "*/".
  • A char is any character in between two '-characters. Examples: 'a', 'A', '1', '%', '+', '#', ' '. Note, only one character may be present between the two '-characters. There are exceptions to this rules: some white space characters must be written with two letters, an escape sequence.
  • escape sequence are those described at page 6 in Coco/R PDF Manual). They are the same as for Java and C++. A few examples: tab character, line feed, carriage return are written as \t, \n and \r.
  • A digit is any continues stream of digits. Examples: 0 , 1, 502873005.

Character Sets

This part begins with the CHARACTERS statement. Any line after, and until the TOKENS statement is part of the Character Sets. These definitions define sets of characters and are optional, but these definitions will make it easier to read and understand the grammar. The character sets are then used in the TOKENS section.

A character set definitions is defined as:

identifier = <CHARACTER SET RULES> '.'

A Character Set Rule may contain any of these Coco/R elements:

  • string
  • identifier - a previous defined identifier.
  • char
  • char1 .. char2 - the set of characters in the range char1 and char2. Examples: 'a' .. 'z', 'A' .. 'Z'
  • ANY - the set of all characters in the range of 0 to 65535.

Character Set Rules may be constructed by Character Set Rules separated with - or +:

  • CharacterSet1 - CharacterSet2 - the set of all characters in CharacterSet1 minus the set of characters in CharacterSet2.
  • CharacterSet1 + CharacterSet2 - the set of all characters in CharacterSet1 plus the set of characters in CharacterSet2.

Example

CHARACTERS
  digit    = "0123456789".
  hexDigit = digit + "ABCDEF".
  letter   = 'A' .. 'Z'.
  eol      = '\r'.
  noDigit  = ANY - digit.

A digit is any of the characters 0, 1, 2, and so on up to the character 9. A hexDigit is any character in the character set digit and the characters A, B, C and so on up to the character F. A letter is any character from the character A to the character Z. Note that the characters a, b, c, and so on up to the character z is not a letter. eol is the character Carriage Return. A noDigit is any letter minus the characters in the character set digit.

Tokens

This part begins with the TOKENS statement. This is the part where the tokens are being defined.

A token definition is defined as:

identifier = <TOKEN RULES> '.'

A Token rule may contain any of these basic elements:

  • Character Set Idetifier
  • string
  • char

These basic elements are grouped together, separated with white space (mostly often space characters): BasicElement_1 BasicElement_2BasicElement_n. To make it easier to write tokens these rules applies:

  • element_1 | element_2 - A vertical bar means element_1 OR element_2.
  • [element_1 element_2 … element_n] - One or several elements between square brackets means zero or one of these elements.
  • (element_1 element_2 … element_n) - One or several elements between parenthesis means they are grouped together.
  • {element_1 element_2 … element_n} - One or several elements between curly braces means zero or meny of these elements.

Example

TOKENS
  ident  = letter {letter | digit | '_'}. 
  number = digit {digit} | "0x" hexDigit hexDigit hexDigit hexDigit. 
  float  = digit {digit} '.' {digit} ['E' ['+'|'-'] digit {digit}].

An ident token is a stream of characters where the first character is from the character set letter followed by zero or more characters from the character sets letter OR digit OR the character _. A number token is a stream of characters containing one or several characters from the character set digit OR the character stream begins with the characters and x and the followed by four characters from the character set hexDigit. A float token is a stream of characters beginning with one or several characters from the character set digit followed by a .-character followed by zero or more characters from the digit character set followed by an optional part. If the optional part is present it begins with the character E followed by an optional +-character or a —character. After that one or several digit characters are followed.

The Ident token

So, we begin with defining the token Coco/R Ident. Where could we find a definition of how to define Coco/R tokens? The Coco/R PDF manual is the obvious answer, but there is no single page where all token definitions reside. We have to browse back and forth to find them. On page 5, 2.1 Vocabulary the ident is defined.

ident  = letter {letter | digit}.

To construct the token Ident it will be neat to first define the character specification letter and digit:

CHARACTERS
  Letter = 'A' .. 'Z' + 'a' .. 'z'.
  Digit = '0' .. '9'.

The above definitions defines a Letter to be all characters from letter A to letter Z and all characters from letter a to letter z. It also define a Digit to be all characters from to 9.

We use these two character set in the tokens part to define the token Ident

TOKENS
  Ident = Letter {Letter | Digit}.

This means that a character stream must begin with a letter followed by zero or more letters or digits. Example of legal character streams that will be translated to an Ident token: x, foobar, FooBar, element1, element2, qX45we77x.

A few unit test verify the token rule for Ident. The first unit test verifies that it is possible to use a single letter as Ident token.

@Test
public void testScan_token_Ident_x() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Ident_x");
    // Initialize
    String sContent = "x";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Ident;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The next unit test verifies that it is possible to use several letters as an Ident token.

@Test
public void testScan_token_Ident_foobar() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Ident_foobar");
    // Initialize
    String sContent = "foobar";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Ident;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The last unit test verifies that it is possible to use letters and digits in an Ident token:

@Test
public void testScan_token_Ident_x12() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Ident_x12");
    // Initialize
    String sContent = "x12";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Ident;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The Number token

On page 5 in the Coco/R PDF manual we also find the definition of number

number = digit {digit}.

This could be defined in a number of ways. One is to follow the definition above:

CHARACTERS
  Digit = '0' .. '9'.

TOKENS
  Number = Digit {Digit}.

This would allow the character stream 0003 to be converted into a Number token. Fine. A more restrictive definitions could define Number with non trailing zeros before first non-zero digit. That is, the number 3 must be written as 3 and not 0003.

CHARACTERS
  Digit = '0' .. '9'.
  Zero = '0'.
  NonZeroDigit = '1' .. '9'.

TOKENS
  Number = Zero | (NonZeroDigit {Digit}).

In one way it is a matter of taste, but it is important to understand the difference. As a minimum unit test should verify that it is possible to create a Number token one or some digits:

@Test
public void testScan_token_Number_123() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Number_123");
    // Initialize
    String sContent = "123";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Number;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
 
Unit tests that verifies other combinations of character streams should be written. For example should the stream {{1x}} be verified that it produces two tokens, one //Number// token and one //Ident// token.

Note! The //Number token is not used in the Coco/R grammar rules!// It is described here because it is described in the Coco/R PDF manual on page 5.

The String token

A string is any character stream on a single line between double quotes. According to the Coco/R definition on page 5 in the Coco/R PDF manual

string = '"' {anyButQuote} '"'.

This definition is a bit too simplified. It should be possible to include an escaped double quote character, \", as described at the bottom of page 5 and continuos at page 6. So, this will be defined as:

CHARACTERS
  HexDigit = Digit + 'a' .. 'f'.
  CharInLine = ANY - '\r' - '\n'.
  AnyButDoubleQuote = CharInLine - '\"'.

TOKENS
  String = '"' {AnyButDoubleQuote | "\\\""} '"'.

Three good unit tests would be to test the empty string, a string (perhaps legal random string), and a string containing an escaped double quote character.

@Test
public void testScan_token_String_Empty() throws UnsupportedEncodingException {
    System.out.println("testScan_token_String_Empty");
    // Initialize
    String sContent = "\"\"";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._String;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
 
@Test
public void testScan_token_String_abc() throws UnsupportedEncodingException {
    System.out.println("testScan_token_String_abc");
    // Initialize
    String sContent = "\"abc\"";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._String;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
 
@Test
public void testScan_token_String_escapeDoubleQuote() throws UnsupportedEncodingException {
    System.out.println("testScan_token_String_escapeDoubleQuote");
    // Initialize
    String sContent = "\"\\\"\"";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._String;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The Char token

The last token on page 5 in the Coco/R PDF manual is the char token. It is defined as

char   = '\'' anyButApostrophe '\''.

However, all the escape sequences must also be honoured. Why did we not honour escape sequences in the String token definition? This is done implicit. The strings "\a" or "\ua8df" will be recognized as String tokens as is.

CHARACTERS
  AnyButQuote = CharInLine - '\''.
  HexDigit = Digit + 'a' .. 'f'.

TOKENS
  Char = "'" (AnyButQuote
             | "\\\'" | "\\\"" | "\\\\" | "\\0" | "\\a" | "\\b" | "\\f" | "\\n" | "\\r" | "\\t" | "\\v"
             | "\\u" HexDigit HexDigit HexDigit HexDigit ) "'".

Unit tests should contain a single character and some escape character. The empty character should not be allowed.

@Test
public void testScan_token_Char_x() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Char_x");
    // Initialize
    String sContent = "\'x\'";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Char;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
 
@Test
public void testScan_token_Char_escapeZero() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Char_escapeZero");
    // Initialize
    String sContent = "\'\\0\'";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Char;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
 
@Test
public void testScan_token_Char_emptyCharacter_not_allowed() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Char_emptyCharacter_not_allowed");
    // Initialize
    String sContent = "\'\'";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser.maxT;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
}

Reserved Keywords

On page 6 in the Coco/R PDF manual the Reserved Keywords for Coco/R grammar is listed.

TOKENS
  Any = "ANY".
  Context = "CONTEXT".
  Ignore = "IGNORE".
  Pragmas = "PRAGMAS".
  Tokens = "TOKENS".
  Character = "CHARACTERS".
  End = "END".
  Ignorecase = "IGNORECASE".
  Productions = "PRODUCTIONS".
  Weak = "WEAK".
  Comments = "COMMENTS".
  From = "FROM".
  Nested = "NESTED".
  Sync = "SYNC".
  Compiler = "COMPILER".
  If = "IF".
  Out = "out".
  To = "TO".

Unit tests to test them all should be written. Here is an example

@Test
public void testScan_token_Any() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Any");
    // Initialize
    String sContent = "ANY";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Any;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( sContent, result.val );
}

Operators

There is no single page defining all operator tokens, but examining the Syntax of Coco/R on page 34 in the Coco/R PDF manual carefully will extract the operators:

TOKENS
  PointPoint = "..".
  Point = '.'.
  Equal = '='.
  Plus = '+'.
  Minus = '-'.
  LeftParenthesis = '('.
  RightParenthesis = ')'.
  LeftCurlyBrace     = '{'.
  RightCurlyBrace    = '}'.
  LeftSquareBracket  = '['.
  RightSquareBracket = ']'.
  VerticalBar        = '|'.

These are verified by unit test in the same manner as the reserved keywords:

@Test
public void testScan_token_Equal() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Equal");
    // Initialize
    String sContent = "=";
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Equal;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( sContent, result.val );
}

The Attribute token

A glance at the Syntax of Coco/R at page 34 in the Coco/R PDF manual define the Attribute not as a token, but as a rule

Attributes  = '<' {ANY} '>' | "<." {ANY} ".>".

However it is more convenient to use the power of the scanner to construct a token for this element. The drawback (in my opinion) to define Attribute as a grammar rule, as it is defined, is that we will get a stream of tokens between '<' and '>' (or between '<.' and '.>') tokens. To extract the information we must iterate through them and calculate space position (because white space is ignored and dropped). That will complicate things a little when the parser tree is constructed. Not impossible, but (in my opinion) more messy.

CHARACTERS
  AnyButGreaterThan = ANY - '>'.
  AnyButPoint = ANY - '.'.
  AnyButPointOrGreaterThan = ANY - '>' - '.'.

TOKENS
  Attributes = '<' ( (AnyButPoint {AnyButGreaterThan} '>') |
                   ( '.' {AnyButPointOrGreaterThan | ('.' AnyButGreaterThan) |
                   (AnyButPoint '>')} ".>") ).

At minimum legal Attribute tokens should be verified:

@Test
public void testScan_token_Attributs_startString_lessThan() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Attributs_startString_lessThan");
    // Initialize
    String sBeginAttributes = "<";
    String sEndAttributes = ">";
    String sAttributesContent = "int x, String s";
    String sContent = sBeginAttributes + sAttributesContent + sEndAttributes;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Attributes;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}
 
@Test
public void testScan_token_Attributs_startString_lessThanPoint() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Attributs_startString_lessThanPoint");
    // Initialize
    String sBeginAttributes = "<.";
    String sEndAttributes = ".>";
    String sAttributesContent = "int x, String s";
    String sContent = sBeginAttributes + sAttributesContent + sEndAttributes;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._Attributes;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

The SemAction token

Another look at the Syntax of Coco/R at page 34 in the Coco/R PDF manual define the SemAction also as a rule

SemAction   = "(." {ANY} ".)".

This could also be defined as a token

CHARACTERS
  AnyButRightParenthesis = ANY - ')'.
  AnyButPointOrRightParenthesis = ANY - '.' - ')'.

TOKENS
  SemAction = "(." {AnyButPointOrRightParenthesis | ('.' AnyButRightParenthesis) |
                     (AnyButPoint ')')} ".)".

This is verified in the same way as Attribute token:

@Test
public void testScan_token_Action() throws UnsupportedEncodingException {
    System.out.println("testScan_token_Action");
    // Initialize
    String sBeginAction = "(.";
    String sEndAction = ".)";
    String sActionContent = "System.out.println(\"x=\" + Integer.toString( x ) );";
    String sContent = sBeginAction + sActionContent + sEndAction;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expected = new Token();
    expected.kind = Parser._SemAction;
    expected.val = sContent;
    // Test
    Token result = instance.Scan();
    // Validate
    assertNotNull( result );
    assertEquals( expected.kind, result.kind );
    assertNotNull( result.val );
    assertEquals( expected.val, result.val );
}

Comments

On page 6 in the Coco/R PDF manual comments are described. Coco/R has a special construct to define comments. The good with a definition like this that Coco/R parser take care of comments whenever it appear in the character stream, without the that the grammar must be polluted with rules everywhere to take care of this.

CHARACTERS
  lf  = '\n'.

COMMENTS FROM "/*" TO "*/"
COMMENTS FROM "//" TO lf

Comments may be verified by blend comments among other characters

@Test
public void testScan_line_col_charPos_Comments() throws UnsupportedEncodingException {
    System.out.println("testScan_line_col_charPos_Whitespace");
    // Initialize
    String sToken0 = "ANY";
    String sToken1 = "END";
    String sToken2 = "IF";
    String sComment = "/* x */";
    String sLineComment = "// y";
    String sSpace = " ";
    String sNewLine = "\r\n";
    // A N Y / *   x   * /  E  N  D CR LF  I  F  /  /     y CR LF Eof
    // 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
    String sContent = sToken0 + sComment + sToken1 + sNewLine + sToken2 + sLineComment + sNewLine;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expectedToken0 = new Token();
    expectedToken0.kind = Parser._Any;
    expectedToken0.col = 1;
    expectedToken0.line = 1;
    expectedToken0.charPos = 0;
    Token expectedToken1 = new Token();
    expectedToken1.kind = Parser._End;;
    expectedToken1.col = 11;
    expectedToken1.line = 1;
    expectedToken1.charPos = 10;
    Token expectedToken2 = new Token();
    expectedToken2.kind = Parser._If;
    expectedToken2.col = 1;
    expectedToken2.line = 2;
    expectedToken2.charPos = 15;
    Token expectedToken3 = new Token();
    expectedToken3.kind = Parser._EOF;
    expectedToken3.col = 1;
    expectedToken3.line = 3;
    expectedToken3.charPos = 23;
    // Test
    Token resultToken0 = instance.Scan();
    Token resultToken1 = instance.Scan();
    Token resultToken2 = instance.Scan();
    Token resultToken3 = instance.Scan();
    // Validate
    assertEquals( expectedToken0.col, resultToken0.col );
    assertEquals( expectedToken1.col, resultToken1.col );
    assertEquals( expectedToken2.col, resultToken2.col );
    assertEquals( expectedToken3.col, resultToken3.col );
    assertEquals( expectedToken0.line, resultToken0.line );
    assertEquals( expectedToken1.line, resultToken1.line );
    assertEquals( expectedToken2.line, resultToken2.line );
    assertEquals( expectedToken3.line, resultToken3.line );
    assertEquals( expectedToken0.charPos, resultToken0.charPos );
    assertEquals( expectedToken1.charPos, resultToken1.charPos );
    assertEquals( expectedToken2.charPos, resultToken2.charPos );
    assertEquals( expectedToken3.charPos, resultToken3.charPos );
}

White space

The last piece is to handle white space:

CHARACTERS
  cr  = '\r'.
  lf  = '\n'.
  ht  = '\t'.
  ff  = '\f'.

IGNORE cr + lf + ht + ff

This could also be verified with unit tests

@Test
public void testScan_line_col_charPos_Whitespace() throws UnsupportedEncodingException {
    System.out.println("testScan_line_col_charPos_Whitespace");
    // Initialize
    String sToken0 = "4";
    String sToken1 = "<.x.>";
    String sToken2 = "(.y.)";
    String sToken3 = "TO";
    String sSpace = " ";
    String sTab = "\t";
    // 4     < . x . >   (  .  y  .  )        T  O Eof
    // 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
    String sContent = sToken0 + sSpace + sSpace + sToken1 + sTab + sToken2 + sSpace + sTab + sToken3;
    InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
    Scanner instance = new Scanner(is);
    Token expectedToken0 = new Token();
    expectedToken0.kind = Parser._Number;
    expectedToken0.val = sToken0;
    expectedToken0.col = 1;
    expectedToken0.line = 1;
    expectedToken0.charPos = 0;
    Token expectedToken1 = new Token();
    expectedToken1.kind = Parser._Attributes;
    expectedToken1.val = sToken1;
    expectedToken1.col = 4;
    expectedToken1.line = 1;
    expectedToken1.charPos = 3;
    Token expectedToken2 = new Token();
    expectedToken2.kind = Parser._SemAction;
    expectedToken2.val = sToken2;
    expectedToken2.col = 10;
    expectedToken2.line = 1;
    expectedToken2.charPos = 9;
    Token expectedToken3 = new Token();
    expectedToken3.kind = Parser._To;
    expectedToken3.col = 17;
    expectedToken3.line = 1;
    expectedToken3.charPos = 16;
    Token expectedToken4 = new Token();
    expectedToken4.kind = Parser._EOF;
    expectedToken4.col = 19;
    expectedToken4.line = 1;
    expectedToken4.charPos = 18;
    // Test
    Token resultToken0 = instance.Scan();
    Token resultToken1 = instance.Scan();
    Token resultToken2 = instance.Scan();
    Token resultToken3 = instance.Scan();
    Token resultToken4 = instance.Scan();
    // Validate
    assertEquals( expectedToken0.col, resultToken0.col );
    assertEquals( expectedToken1.col, resultToken1.col );
    assertEquals( expectedToken2.col, resultToken2.col );
    assertEquals( expectedToken3.col, resultToken3.col );
    assertEquals( expectedToken4.col, resultToken4.col );
    assertEquals( expectedToken0.line, resultToken0.line );
    assertEquals( expectedToken1.line, resultToken1.line );
    assertEquals( expectedToken2.line, resultToken2.line );
    assertEquals( expectedToken3.line, resultToken3.line );
    assertEquals( expectedToken4.line, resultToken4.line );
    assertEquals( expectedToken0.charPos, resultToken0.charPos );
    assertEquals( expectedToken1.charPos, resultToken1.charPos );
    assertEquals( expectedToken2.charPos, resultToken2.charPos );
    assertEquals( expectedToken3.charPos, resultToken3.charPos );
    assertEquals( expectedToken4.charPos, resultToken4.charPos );
}

The Grammar File So Far

This is the grammar without grammar rules. The PRODUCTIONS part is not at all complete, but the top most grammar rule must be there to for the Coco/R parser generator tool to compile.

package org.structuredparsing.cocorgrammar.cocor.parser;

COMPILER CocoR

CHARACTERS
  Letter = 'A' .. 'Z' + 'a' .. 'z'.
  Digit = '0' .. '9'.
  Zero = '0'.
  NonZeroDigit = '1' .. '9'.
  HexDigit = Digit + 'a' .. 'f'.
  CharInLine = ANY - '\r' - '\n'.
  AnyButQuote = CharInLine - '\''.
  AnyButDoubleQuote = CharInLine - '\"'.
  AnyButGreaterThan = ANY - '>'.
  AnyButPoint = ANY - '.'.
  AnyButPointOrGreaterThan = ANY - '>' - '.'.
  AnyButRightParenthesis = ANY - ')'.
  AnyButPointOrRightParenthesis = ANY - '.' - ')'.
  cr  = '\r'.
  lf  = '\n'.
  ht  = '\t'.
  ff  = '\f'.

TOKENS
  Ident = Letter {Letter | Digit}.
  Number = Zero | (NonZeroDigit {Digit}).
  String = '"' {AnyButDoubleQuote | "\\\""} '"'.
  Char = "'" (    AnyButQuote
                | "\\\'" | "\\\"" | "\\\\" | "\\0" | "\\a" | "\\b" | "\\f" | "\\n" | "\\r" | "\\t" | "\\v"
                | "\\u" HexDigit HexDigit HexDigit HexDigit
                ) 
        "'".

  Attributes = '<' ( (AnyButPoint {AnyButGreaterThan} '>') | ( '.' {AnyButPointOrGreaterThan | ('.' AnyButGreaterThan) | (AnyButPoint '>')} ".>") ).
  SemAction = "(." {AnyButPointOrRightParenthesis | ('.' AnyButRightParenthesis) | (AnyButPoint ')')} ".)".
  Any = "ANY".
  Context = "CONTEXT".
  Ignore = "IGNORE".
  Pragmas = "PRAGMAS".
  Tokens = "TOKENS".
  Character = "CHARACTERS".
  End = "END".
  Ignorecase = "IGNORECASE".
  Productions = "PRODUCTIONS".
  Weak = "WEAK".
  Comments = "COMMENTS".
  From = "FROM".
  Nested = "NESTED".
  Sync = "SYNC".
  Compiler = "COMPILER".
  If = "IF".
  Out = "out".
  To = "TO".
  PointPoint = "..".
  Point = '.'.
  Equal = '='.
  Plus = '+'.
  Minus = '-'.
  LeftParenthesis = '('.
  RightParenthesis = ')'.
  LeftCurlyBrace     = '{'.
  RightCurlyBrace    = '}'.
  LeftSquareBracket  = '['.
  RightSquareBracket = ']'.
  VerticalBar        = '|'.

COMMENTS FROM "/*" TO "*/"
COMMENTS FROM "//" TO lf

IGNORE cr + lf + ht + ff

PRODUCTIONS

CocoR = 
  Compiler Ident.

END CocoR.

Next part

Time to Go to Creating Grammar Rules part 2.


<- back

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License