Java 1.5 Parser - Scanner and Lexer - Part 2

<- back

Next Chapter Scanner and Lexer - part 3


Macro Definitions and Tokens for Java 1.5

A word on Unicode

The scanner is the first component that encounters java code. Source code in Java uses Unicode character set to represent characters and for Java 1.5/1.6 the scanner must understand Unicode version 4.0. Java compiler, Javac, though, uses UTF-8 to read and parse Java source code, so that is the aim for the scanner: to be able to read UTF-8.

Line Terminator

In Chapter 3.4 Line Terminators the following rules are given:

Java Syntax Rule
LineTerminator:
the ASCII LF character, also known as "newline"
the ASCII CR character, also known as "return"
the ASCII CR character followed by the ASCII LF character

InputCharacter:
UnicodeInputCharacter but not CR or LF

This is converted into the following macro definitions

JFlex grammar Rule
%%
LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]

%%

White Space

In Chapter 3.6 White Space the following rules are given:

Java Syntax Rule
WhiteSpace:
the ASCII SP character, also known as "space"
the ASCII HT character, also known as "horizontal tab"
the ASCII FF character, also known as "form feed"
LineTerminator

This is converted into the following macro definitions in JFlex

JFlex grammar Rule
%%
WhiteSpace = {LineTerminator} | [ \t\f]

%%
<YYINITIAL> {
{WhiteSpace} { /* ignore */ }
}

Comments

In Chapter 3.7 Comments the following rules are used to define java comments.

Java Syntax Rule
Comment:
TraditionalComment
EndOfLineComment

TraditionalComment:
/ * CommentTail

EndOfLineComment:
/ / CharactersInLineopt

CommentTail:
* CommentTailStar
NotStar CommentTail

CommentTailStar:
/
* CommentTailStar
NotStarNotSlash CommentTail

NotStar:
InputCharacter but not *
LineTerminator

NotStarNotSlash:
InputCharacter but not * or /
LineTerminator

CharactersInLine:
InputCharacter
CharactersInLine InputCharacter

Java has two types of comments:

  • Single line of comment or end-of-line comment. It begins with .
  • Traditional comment// or C-comment, that may span several lines. It begins wiht /* and ends with */.

These rules may be defined as macros in JFlex

JFlex grammar Rule
%%
TraditionalComment = "/*" ~"*/"
EndOfLineComment = "//" {InputCharacter}* {LineTerminator}
Comment = {TraditionalComment} | {EndOfLineComment}

%%
<YYINITIAL> {
{TraditionalComment} { /* ignore */ }
{EndOfLineComment} { /* ignore */ }
}

Identifier

Chapter 3.8 Identifier defines identifiers

Java Syntax Rule
Identifier:
IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral

IdentifierChars:
JavaLetter
IdentifierChars JavaLetterOrDigit

JavaLetter:
any Unicode character that is a Java letter (see below)

JavaLetterOrDigit:
any Unicode character that is a Java letter-or-digit (see below)

What is a legal Java Identifier? Java identifiers may use more letters than most other computer languages. Reading the article linked above, it seams very complicated to implement identifier. But, because JFlex is used this token is very simple to implement. Just using the pre-defined JFlex expression :jletter and :jletterdigit solves all problems with unicode characters.

JFlex grammar Rule
%%

Ident = [:jletter:] [:jletterdigit:]*

%%
<YYINITIAL> {
{Identifier} { return new Token(Parser._Ident, yycolumn + 1, yyline + 1, yychar, yytext()); }
}

The token Identifier must also be defined in the Coco/R grammar file

Coco/R EBFN Rule
TOKENS
Identifier

Unittest for whitespace, new line, comments and identifier

Now it is possible to create sensible unittests for white space, new line, comments and identifier.

An empty java source file should produce just the End of file token:

    private static final String sEmptyString = "";
 
    @Test
    public void testScan_token_Eof() throws UnsupportedEncodingException {
        System.out.println("testScan_token_Eof");
        // Initialize
        String sContent = "";
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expected = new Token( Parser._EOF, 0, 0, 0 );
        // Test
        Token result = instance.Scan();
        // Validate
        assertNotNull( result );
        assertEquals( expected.kind, result.kind );
        assertNotNull( result.val );
        assertEquals( sEmptyString, result.val );
    }

Check that identifiers may be created from legal and illegal characters.

    @Test
    public void testScan_token_Identifier_x() throws UnsupportedEncodingException {
        System.out.println("testScan_token_Identifier_x");
        // Initialize
        String sContent = "x";
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expected = new Token( Parser._Identifier, 0, 0, 0, sContent);
        // Test
        Token result = instance.Scan();
        // Validate
        assertNotNull( result );
        assertEquals( expected.kind, result.kind );
        assertNotNull( result.val );
        assertEquals( expected.val, result.val );
    }
 
    @Test
    public void testScan_token_Eof_after_last_token() throws UnsupportedEncodingException {
        System.out.println("testScan_token_Eof_after_last_token");
        // Initialize
        String sContent = "x";
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expected = new Token( Parser._EOF, 0, 0, 0 );
        // Test
        Token result = instance.Scan();
        result = instance.Scan();
        // Validate
        assertNotNull( result );
        assertEquals( expected.kind, result.kind );
        assertNotNull( result.val );
        assertEquals( expected.val, result.val );
    }
 
    @Test
    public void testScan_token_Identifier_foobar() throws UnsupportedEncodingException {
        System.out.println("testScan_token_Identifier_foobar");
        // Initialize
        String sContent = "foobar";
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expected = new Token( Parser._Identifier, 0, 0, 0, sContent);
        // Test
        Token result = instance.Scan();
        // Validate
        assertNotNull( result );
        assertEquals( expected.kind, result.kind );
        assertNotNull( result.val );
        assertEquals( expected.val, result.val );
    }
 
    @Test
    public void testScan_token_Identifier_x12() throws UnsupportedEncodingException {
        System.out.println("testScan_token_Identifier_x12");
        // Initialize
        String sContent = "x12";
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expected = new Token( Parser._Identifier, 0, 0, 0, sContent);
        // Test
        Token result = instance.Scan();
        // Validate
        assertNotNull( result );
        assertEquals( expected.kind, result.kind );
        assertNotNull( result.val );
        assertEquals( expected.val, result.val );
    }
 
    @Test
    public void testScan_token_Identifier_unicode() throws UnsupportedEncodingException {
        System.out.println("testScan_token_Identifier_unicode");
        // Initialize
        String sContent = "Σ۝";
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expected = new Token( Parser._Identifier, 0, 0, 0, sContent);
        // Test
        Token result = instance.Scan();
        // Validate
        assertNotNull( result );
        assertEquals( expected.kind, result.kind );
        assertNotNull( result.val );
        assertEquals( expected.val, result.val );
    }
 
    @Test
    public void testScan_token_Illegaltoken_illegal_unicode() throws UnsupportedEncodingException {
        System.out.println("testScan_token_Illegaltoken_illegal_unicode");
        // Initialize
        String sContent = "۝Σ";
        String sExpectedContet = "۝";    //The character after this one, the Σ, will be parsed as an Identifier.
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expected = new Token( Parser._Illegaltoken, 0, 0, 0, sExpectedContet);
        // Test
        Token result = instance.Scan();
        // Validate
        assertNotNull( result );
        assertEquals( expected.kind, result.kind );
        assertNotNull( result.val );
        assertEquals( expected.val, result.val );
    }

Check new line

    @Test
    public void testScan_line_col_charPos_two_lines_UnixNewLine() throws UnsupportedEncodingException {
        System.out.println("testScan_line_col_charPos_two_lines_UnixNewLine");
        // Initialize
        String sToken0 = "x";
        String sToken1 = "Y";
        String sToken2 = "zZ_";
        String sNewLine = "\n";
        String sSpace = " ";
        // x   y   LF z Z _ Eof
        // 0 1 2 3 4  5 6 7 8
        String sContent = sToken0+ sSpace + sToken1 + sSpace + sNewLine + sToken2;
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expectedToken0 = new Token( Parser._Identifier, 1, 1, 0, sToken0 );
        Token expectedToken1 = new Token( Parser._Identifier, 3, 1, 2, sToken1 );
        Token expectedToken2 = new Token( Parser._Identifier, 1, 2, 5, sToken2 );
        Token expectedToken3 = new Token( Parser._EOF, 4, 2, 8 );
        // Test
        Token resultToken0 = instance.Scan();
        Token resultToken1 = instance.Scan();
        Token resultToken2 = instance.Scan();
        Token resultToken3 = instance.Scan();
        // Validate
        assertEquals( expectedToken0.col, resultToken0.col );
        assertEquals( expectedToken1.col, resultToken1.col );
        assertEquals( expectedToken2.col, resultToken2.col );
        assertEquals( expectedToken3.col, resultToken3.col );
        assertEquals( expectedToken0.line, resultToken0.line );
        assertEquals( expectedToken1.line, resultToken1.line );
        assertEquals( expectedToken2.line, resultToken2.line );
        assertEquals( expectedToken3.line, resultToken3.line );
        assertEquals( expectedToken0.charPos, resultToken0.charPos );
        assertEquals( expectedToken1.charPos, resultToken1.charPos );
        assertEquals( expectedToken2.charPos, resultToken2.charPos );
        assertEquals( expectedToken3.charPos, resultToken3.charPos );
    }
 
    @Test
    public void testScan_line_col_charPos_two_lines_Mac_OS_9_NewLine_space_beginning_of_line() throws UnsupportedEncodingException {
        System.out.println("testScan_line_col_charPos_two_lines_Mac_OS_9_NewLine_space_beginning_of_line");
        // Initialize
        String sToken0 = "x";
        String sToken1 = "Y";
        String sToken2 = "zZ_";
        String sNewLine = "\r";
        String sSpace = " ";
        //     x   y  CR   z Z _    Eof
        // 0 1 2 3 4  5  6 7 8 9 10 11
        String sContent = sSpace + sSpace + sToken0+ sSpace + sToken1 + sNewLine + sSpace + sToken2 + sSpace;
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expectedToken0 = new Token( Parser._Identifier, 3, 1, 2, sToken0 );
        Token expectedToken1 = new Token( Parser._Identifier, 5, 1, 4, sToken1 );
        Token expectedToken2 = new Token( Parser._Identifier, 2, 2, 7, sToken2 );
        Token expectedToken3 = new Token( Parser._EOF, 6, 2, 11 );
        // Test
        Token resultToken0 = instance.Scan();
        Token resultToken1 = instance.Scan();
        Token resultToken2 = instance.Scan();
        Token resultToken3 = instance.Scan();
        // Validate
        assertEquals( expectedToken0.col, resultToken0.col );
        assertEquals( expectedToken1.col, resultToken1.col );
        assertEquals( expectedToken2.col, resultToken2.col );
        assertEquals( expectedToken3.col, resultToken3.col );
        assertEquals( expectedToken0.line, resultToken0.line );
        assertEquals( expectedToken1.line, resultToken1.line );
        assertEquals( expectedToken2.line, resultToken2.line );
        assertEquals( expectedToken3.line, resultToken3.line );
        assertEquals( expectedToken0.charPos, resultToken0.charPos );
        assertEquals( expectedToken1.charPos, resultToken1.charPos );
        assertEquals( expectedToken2.charPos, resultToken2.charPos );
        assertEquals( expectedToken3.charPos, resultToken3.charPos );
    }

Verify End of line comment and C-comment

    @Test
    public void testScan_line_col_charPos_end_of_line_comment() throws UnsupportedEncodingException {
        System.out.println("testScan_line_col_charPos_end_of_line_comment");
        // Initialize
        String sToken0 = "x";
        String sNewLine = "\n";
        String sComment = "// x y";
        // x / /   x   y LF / /     x     y LF Eof
        // 0 1 2 3 4 5 6 7  8 9 10 11 12 13 14 15
        String sContent = sToken0 + sComment + sNewLine + sComment + sNewLine;
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expectedToken0 = new Token( Parser._Identifier, 1, 1, 0, sToken0 );
        Token expectedToken1 = new Token( Parser._EOF, 1, 3, 15 );
        // Test
        Token resultToken0 = instance.Scan();
        Token resultToken1 = instance.Scan();
        // Validate
        assertEquals( expectedToken0.col, resultToken0.col );
        assertEquals( expectedToken1.col, resultToken1.col );
        assertEquals( expectedToken0.line, resultToken0.line );
        assertEquals( expectedToken1.line, resultToken1.line );
        assertEquals( expectedToken0.charPos, resultToken0.charPos );
        assertEquals( expectedToken1.charPos, resultToken1.charPos );
    }
 
    @Test
    public void testScan_line_col_charPos_c_comment() throws UnsupportedEncodingException {
        System.out.println("testScan_line_col_charPos_c_comment");
        // Initialize
        String sToken0 = "x";
        String sToken1 = "y";
        String sNewLine = "\n";
        String sCommentStart = "/*";
        String sCommentEnd = "*/";
        String sCommentContent = " xx";
        // x / *   x x LF    x x LF     x  x  /  *  y Eof
        // 0 1 2 3 4 5 6  7  8 9 10 11 12 13 14 15 16 17
        String sContent = sToken0 + sCommentStart + sCommentContent + sNewLine + sCommentContent + sNewLine + sCommentContent + sCommentEnd + sToken1;
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expectedToken0 = new Token( Parser._Identifier, 1, 1, 0, sToken0 );
        Token expectedToken1 = new Token( Parser._Identifier, 6, 3, 16, sToken1 );
        Token expectedToken2 = new Token( Parser._EOF, 7, 3, 17 );
        // Test
        Token resultToken0 = instance.Scan();
        Token resultToken1 = instance.Scan();
        Token resultToken2 = instance.Scan();
        // Validate
        assertEquals( expectedToken0.col, resultToken0.col );
        assertEquals( expectedToken1.col, resultToken1.col );
        assertEquals( expectedToken2.col, resultToken2.col );
        assertEquals( expectedToken0.line, resultToken0.line );
        assertEquals( expectedToken1.line, resultToken1.line );
        assertEquals( expectedToken2.line, resultToken2.line );
        assertEquals( expectedToken0.charPos, resultToken0.charPos );
        assertEquals( expectedToken1.charPos, resultToken1.charPos );
        assertEquals( expectedToken2.charPos, resultToken2.charPos );
    }

Keywords

Chapter 3.9 Keywords defines Java keywords

Java Syntax Rule
Keyword: one of
abstract continue for new switch
assert default if package synchronized
boolean do goto private this
break double implements protected throw
byte else import public throws
case enum instanceof return transient
catch extends int short try
char final interface static void
class finally long strictfp volatile
const float native super while

Each keyword should be translated into

abstract   { return new Token(Parser._Abstract, yycolumn + 1, yyline + 1, yychar, yytext()); }

In the JFlex grammar rules below just three keywords are presented, but the idea is the same for all.

JFlex grammar Rule
%%

%%
<YYINITIAL> {
abstract { return new Token(Parser._Abstract, yycolumn + 1, yyline + 1, yychar, yytext()); }
assert { return new Token(Parser._Assert, yycolumn + 1, yyline + 1, yychar, yytext()); }
boolean { return new Token(Parser._Boolean, yycolumn + 1, yyline + 1, yychar, yytext()); }

}

The java keywords must also be defined in the Coco/R grammar file. Note: only three keywords are presented.

Coco/R EBFN Rule
TOKENS
Abstract
Assert
Boolean

Unittests

All keywords should be verified. One unittest for one keyword is presented

    @Test
    public void testScan_token_abstract() throws UnsupportedEncodingException {
        System.out.println("testScan_token_abstract");
        // Initialize
        String sContent = "abstract";
        InputStream is = new ByteArrayInputStream(sContent.getBytes("UTF-8"));
        Scanner instance = new Scanner(is);
        Token expected = new Token( Parser._Abstract, 0, 0, 0 );
        // Test
        Token result = instance.Scan();
        // Validate
        assertNotNull( result );
        assertEquals( expected.kind, result.kind );
    }

Nect Chapter Scanner and Lexer - part 3


<- back

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License