Coco/R Parser Creating Syntax Tree - part 3

<- back

Creating Syntax Tree

In this part we will discus how to build a Syntax Tree with the grammar we have. Before we just have to deepen our knowledge in Coco/R grammar.

Coco/R Grammar Syntax in Short

Semantic Actions

It is nice that it is possible to define a whole grammar with these grammar rules. But, with just that nothing interesting can happen. Semantic Actions are pieces of code in the target language (say Java, C#, C++, …) that will be executed when an expression is consumed by the parser. Thats how an Syntax Tree may be built. However, any action may be executed, say updating a database, call a Web Service and more. Most often a Syntax Tree is preferable.

Lets consider the more advanced syntax of grammar rules (which still is a simplification):

identifier [<ATTRIBUTE>] = <GRAMMAR_RULE_EXPRESSION> [<SEM_ACTION>] '.'

At the end the GRAMMAR_RULE_EXPRESSION an Semantic Action may be added. Actually, an Semantic Action may be added to the end of any GRAMMAR_RULE_EXPRESSION, not just the last part of the expression. Each Semantic Action consist of a pair of '(.' '.) with code in between.

Examples

In the example with the ClassDeclaration it would we swell to store the class name of the about-to-be-declared-class. Assume a string sClassName is defined before in the parser we may store the name via the current token t by calling its member field val (value of the token).

ClassDeclaration =
    Class Identifier   (. sClassName = t.val; .)
    ClassBody.

One improvement to the code above would be to also create an object when a class is declared to store information about it

ClassDeclaration =
    Class  (. currentClass = new ClassDeclaration(); .)
    Identifier   (. currentClass.setName( t.val ); .)
    ClassBody.

Attributes

One very neat feature of Coco/R is Attributes. Before we dig deeper into this concept it is important to understand that each grammar rule is translated into a function in the parser. If Attributes are not used all these functions will have no parameters or return value.

Attributes are defined as a pair of '<' '>' or '<.' '.>' with code in the target language in them. That is, there are two way defining Attributes. The latter, with the point in it, enables the use of templates as arguments and return value. The return value must be the first parameter and decorated with an 'out' statement before the parameter.

Example

We add some grammar to the example above, a ClassBody rule that uses Attributes. A ClassBody will take a ClassDeclaration object as parameter and return a ClassBody object. These parameters and return objects may be used inside the definitions of ClassBody. Also note, that the definition of ClassDeclaration must be updated reflecting that ClassBody now uses a parameter and a return value, that may be used inside the ClassDeclaration definition.

ClassDeclaration =
    Class  (. currentClass = new ClassDeclaration(); .)
    Identifier   (. currentClass.setName( t.val ); .)
    ClassBody< out classbody, currentClass >   (. currentClass.setClassBody( classbody ); .).

ClassBody< out ClassBody classbody, ClassDeclaration classdecl > =
    LeftCurlyBrace  (. classbody = new ClassBody(); .)
    {Modifier} Ident      (. classbody.setCurrentFunctionName( t.val ); classdecl.incMembers();  .)
    ...

Syntax Tree classes

In order to be able to build a syntax tree we need to declare the classes we need. This is an UML diagram of the Syntax Tree. Note, though, this syntax tree will not cover the whole syntax, just the parser rules.

Coco-r_SyntaxTree_Classes.png

The Interface Documet

This is the interface that will be exposed from the parser to get hold of the Syntax Tree.

package org.structuredparsing.cocorgrammar.ast;
 
public interface Document {
    public Rule getTopRule();
    public RuleMap getRuleMap();
}

DocumentImpl

This is the implementation of the interface Document:

package org.structuredparsing.cocorgrammar.ast;
 
public class DocumentImpl implements Document {
    private Rule topRule;
    private RuleMap ruleMap;
 
    public void setTopRule( Rule rule ) {
        topRule = rule;
    }
 
    @Override
    public Rule getTopRule() {
        return topRule;
    }
 
    public void setRuleMap( RuleMap ruleMap ) {
        this.ruleMap = ruleMap;
    }
 
    @Override
    public RuleMap getRuleMap() {
        return this.ruleMap;
    }
}

Rule

This class holds a Coco/R grammar rule.

package org.structuredparsing.cocorgrammar.ast;
 
import java.util.ArrayList;
import java.util.List;
 
public class Rule {
 
    private AlternativeExpression alternative;
    private String name;
    private List< Rule > parents;
 
    public Rule( String name ) {
        this.alternative = null;
        this.name = name;
        this.parents = new ArrayList< Rule >();
    }
 
    public void set( AlternativeExpression alt ) {
        this.alternative = alt;
    }
 
    public AlternativeExpression get() {
        return this.alternative;
    }
 
    public String getName() {
        return this.name;
    }
 
    public void addParent( Rule parent ) {
        if ( ! (parents.contains(this)) ) {
            parents.add(parent);
        }
    }
 
    public int parentsSize() {
        return parents.size();
    }
 
    public Rule getParnt( int index ) {
        return parents.get(index);
    }
}

AlternativeExpression

This class stores expressions separated with '|':

package org.structuredparsing.cocorgrammar.ast;
 
import java.util.ArrayList;
import java.util.List;
 
public class AlternativeExpression implements Expression {
    private List< ListExpression > altList;
    private GRAMMAR_TYPE type;
 
    public AlternativeExpression() {
        altList = new ArrayList< ListExpression >();
    }
 
    public void setIteration() {
        type = GRAMMAR_TYPE.E_ITERATION;
    }
 
    public void setOptional() {
        type = GRAMMAR_TYPE.E_OPTIONAL;
    }
 
    public void setGroup() {
        type = GRAMMAR_TYPE.E_GROUP;
    }
 
    @Override
    public GRAMMAR_TYPE getType() {
        return type;
    }
 
    public void add( ListExpression list ) {
        altList.add(list);
    }
 
    public int size() {
        return altList.size();
    }
 
    public ListExpression get( int index ) {
        return altList.get(index);
    }
 
}

ListExpression

This class stores expressions as a list:

package org.structuredparsing.cocorgrammar.ast;
 
import java.util.ArrayList;
import java.util.List;
 
public class ListExpression {
    private List< Expression > list;
 
    public ListExpression() {
        list = new ArrayList< Expression >();
    }
 
    public void add( Expression factor ) {
        list.add(factor);
    }
 
    public int size() {
        return list.size();
    }
 
    public Expression get( int index ) {
        return list.get(index);
    }
}

PrimaryExpression

A PrimaryExpression stores a token name or a grammar rule identifier:

package org.structuredparsing.cocorgrammar.ast;
 
public class PrimaryExpression implements Expression {
 
    private String value;
    private Rule grammarRule;
 
    public PrimaryExpression() {
 
    }
 
    @Override
    public GRAMMAR_TYPE getType() {
        return GRAMMAR_TYPE.E_PRIMARY;
    }
 
    public void setValue(String val) {
        this.value = val;
    }
 
    public String getValue() {
        return this.value;
    }
 
    public void setGrammarRule( Rule rule ) {
        grammarRule = rule;
    }
 
    public Rule getGrammarRule() {
        return grammarRule;
    }
 
}

Expression

This is the interface of AlternativeExpression and PrimaryExpression:

package org.structuredparsing.cocorgrammar.ast;
 
public interface Expression {
    public GRAMMAR_TYPE getType();
}

GRAMMAR_TYPE

package org.structuredparsing.cocorgrammar.ast;
 
public enum GRAMMAR_TYPE {
    E_OPTIONAL, E_ITERATION, E_GROUP, E_PRIMARY
}

RuleMap

This class contains all Rule objects. It also connects Rule objects together: Rule objects know their parents:

package org.structuredparsing.cocorgrammar.ast;
 
import java.util.Hashtable;
import java.util.Map;
import java.util.Set;
 
public class RuleMap {
 
    private Map< String, Rule > map;
 
    public RuleMap() {
        map = new Hashtable< String, Rule >();
    }
 
    public void add( Rule rule ) {
        map.put( rule.getName(), rule);
    }
 
    public int size() {
        return map.size();
    }
 
    public Rule getRule( String key ) {
        return map.get(key);
    }
 
    public Set< String > getListOfRuleNames() {
        return map.keySet();
    }
 
    public void connectRules() {
        Set< String > set = getListOfRuleNames();
        for ( String rulename : set ) {
            Rule parentrule = getRule( rulename );
            AlternativeExpression alt = parentrule.get();
            connectRulesInGrammarAlternative( alt, parentrule, set );
        }
    }
 
    private void connectRulesInGrammarAlternative( AlternativeExpression alt, Rule parentrule, Set< String > set ) {
        for ( int i=0; i<alt.size(); ++i ) {
            ListExpression list = alt.get(i);
            connectRuleInGrammarList( list, parentrule, set );
        }
    }
 
    private void connectRuleInGrammarList( ListExpression list, Rule parentrule, Set< String > set ) {
        for ( int i=0; i<list.size(); ++i ) {
            Expression expression = list.get(i);
            if ( expression instanceof AlternativeExpression ) {
                connectRulesInGrammarAlternative( (AlternativeExpression) expression, parentrule, set );
            }
            else if ( expression instanceof PrimaryExpression ) {
                PrimaryExpression primary = (PrimaryExpression) expression;
                if ( set.contains(primary.getValue()) ) {
                    Rule childrule = this.map.get( primary.getValue() );
                    primary.setGrammarRule(childrule);
                    childrule.addParent(parentrule);
                }
            }
        }
    }
}

Import statements

At this point it is a good idea to write import statements into the Coco/R grammar file for the classes just defined. These are written just above the COMPILER statement.

package org.structuredparsing.cocorgrammar.cocor.parser_jflex_scanner;

import org.structuredparsing.cocorgrammar.ast.Expression;
import org.structuredparsing.cocorgrammar.ast.PrimaryExpression;
import org.structuredparsing.cocorgrammar.ast.AlternativeExpression;
import org.structuredparsing.cocorgrammar.ast.ListExpression;
import org.structuredparsing.cocorgrammar.ast.Rule;
import org.structuredparsing.cocorgrammar.ast.RuleMap;
import org.structuredparsing.cocorgrammar.ast.DocumentImpl;
import org.structuredparsing.cocorgrammar.ast.Document;

COMPILER CocoR

User code

A few lines of user code will simplify the task of building a syntax tree. User code is written just below the COMPILER statement.

COMPILER CocoR
 
    private RuleMap rulemap;
    private DocumentImpl doc;
    private String firstRuleName;
 
    public Document getResult() {
        return doc;
    }

RuleMap stores all Rule objects created. When the parsing is finished, it will connect all rules together. It is good idea to declare this object as member field in the Parser class, so it is accessible in all parts of the parser code.

DocumentImpl is the implementation of the Document interface and stores a RuleMap object and a pointer to the starting Rule object.

The String object firstRuleName stores the identifier name after the COMPILER keyword. This is the top Rule object name.

The member function getResult returns a Document object. Document is an interface, but the object returned will be the DocumentImpl object.

The First Rule Name

When the parser have parsed the Ident token in the rule

CocoR = 
  ...
  Compiler Ident
  ...

… the Ident token content should be stored into the variable firstRuleName. This is done by

  Compiler Ident        (. firstRuleName = t.val; .)

The object t is the current token, just parsed. The above java code firstRuleName = t.val; will be copied into the part where the Ident token is parsed

void CocoR() {
   while (StartOf(1)) {
      Get();
   }
   Expect(21);
   Expect(1);
   firstRuleName = t.val;
   ...

Expect(21) aims on the token with the kind value of 21. That is the Compiler token. Expect(1) aims on the token with the kind value of 1. That is the Ident token. These values are defined as public static final int in the Parser class:

public class Parser {
   public static final int _EOF = 0;
   public static final int _Ident = 1;
   public static final int _Number = 2;
   ...
   public static final int _Compiler = 21;
   public static final int _If = 22;
   ...
}

ParserSpecification

This parser only build a syntax tree for the parser rules. These reside in the ParserSpecification.

  ParserSpecification = "PRODUCTIONS" {Production}.

When this part begins it is a good idea to initialize our field members.

ParserSpecification   (. rulemap = new RuleMap(); doc = new DocumentImpl(); .)
  =
  Productions {Production}.

A Production is a parsed rule. This would be neat to grab and store into the rulemap object. One idea is to hook an Attribute on the Production that returns a Rule object.

ParserSpecification   (. Rule rule = null; rulemap = new RuleMap(); doc = new DocumentImpl(); .)
  =
  Productions
  {Production< out rule >   (. rulemap.add( rule ); .)
  }.

The local variable rule is also defined, Attributes added to Production.

Production

This is how Production first was defined

Production =
  Ident [Attributes] [SemAction] Equal Expression Point.

Now a return value has been added, Production must return a Rule object:

Production< out Rule rule > =
  Ident [Attributes] [SemAction] Equal Expression Point.

How to create a Rule object? Once the Attributes are added, all these including return variables are created by Coco/R parser. That is, there is a Rule object with the name rule defined already now. It is null, so we just need to explicit construct it and store information into it.

Production< out Rule rule > =
  Ident                                          (. rule = new Rule( t.val ); .)
  [Attributesrule] [SemActionrule] Equal Expression Point.

It would be neat if Expression would return an AlternativeExpression object. That is what our Rule object wants. Assume it does

Production< out Rule rule > =         (. AlternativeExpression alternative = null; .)
  Ident                                          (. rule = new Rule( t.val ); .)
  [Attributesrule] [SemActionrule] Equal
  Expression< out alternative >        (. rule.set( alternative );  .)
  Point.

An AlternativeExpression variable must be created locally to hold the returned object from Expression.

Expression

This is how Expression is defined without any Actions

Expression =
  Term {VerticalBar Term}.

Expression must now return an AlternativeExpression

Expression< out AlternativeExpression alternative > =
  Term {VerticalBar Term}.

It would be neat if Term returned a ListExpression object. That could be added to the AlternativeExpression object.

Expression< out AlternativeExpression alternative >  (. alternative = new AlternativeExpression(); ListExpression list = null; .)
  =
  Term< out list >   (. alternative.add( list ); .)
  {VerticalBar Term< out list >   (. alternative.add( list ); .)
  }.

Term

This is how Term looks like without Actions

Term =
  [[Resolver] Factor {Factor}].

Now it must return a ListExpression

Term< out ListExpression list > =
  [[Resolver] Factor {Factor}].

It would be neat if Factor would return an Expression object. That object could be added to the ListExpression object.

Term< out ListExpression list >   (. list = new ListExpression(); Expression expression = null; .)
  =
  [[Resolver] Factor< out expression >   (. if ( expression != null ) { list.add( expression ); } .)
  {Factor< out expression >                 (. if ( expression != null ) { list.add( expression ); } .)
  }].

Factor

The Factor definition before Actions

Factor =
  [Weak] Symbol [Attributes]  |
  LeftParenthesis Expression RightParenthesis |
  LeftSquareBracket Expression RightSquareBracket  |
  LeftCurlyBrace Expression RightCurlyBrace  |
  Any  |
  Sync  |
  SemAction.

Now, Factor returns a Expression object. And, as defined before, the rule Expression returns a AlternativeExpression

Factor< out Expression expression >
  =
  [Weak] Symbol [Attributes]  |
  LeftParenthesis Expression< out alternative >
  RightParenthesis |
  LeftSquareBracket Expression< out alternative >
  RightSquareBracket  |
  LeftCurlyBrace Expression< out alternative >
  RightCurlyBrace  |
  Any  |
  Sync  |
  SemAction.

It would be neat if Symbol returned a PrimaryExpression object. Then that object could be added to the Expression object

Factor< out Expression expression >   (. expression = null; PrimaryExpression primary = null; AlternativeExpression alternative = null; .)
  =
  [Weak] Symbol< out primary >             (. expression = primary; .)
  [Attributesrule] 
  | LeftParenthesis Expression< out alternative >  (. alternative.setGroup(); expression = alternative; .)
    RightParenthesis
  | LeftSquareBracket Expression< out alternative >  (. alternative.setOptional(); expression = alternative; .)
    RightSquareBracket 
  | LeftCurlyBrace Expression< out alternative >  (. alternative.setIteration(); expression = alternative; .)
    RightCurlyBrace 
  | Any                  (. primary = new PrimaryExpression(); primary.setValue( t.val ); expression = primary; .)
  | Sync 
  | SemActionrule.

Symbol

This is how Symbol is defined without Actions

Symbol =
  Ident | String | Char.

Now it must return a PrimaryExpression object

Symbol< out PrimaryExpression primary >  (. primary = new PrimaryExpression(); .)
  =
  Ident                           (. primary.setValue( t.val ); .)
  | String                       (. primary.setValue( t.val ); .)
  | Char                         (. primary.setValue( t.val ); .)
  .

The PrimaryExpression is filled with information via the member function setValue.

Avoiding compiling errors

Symbol is used in some other rules. These rules must be updated to avoid compiling errors

TokenDecl   (. PrimaryExpression primary = null; .)
  =
  Symbol< out primary > [Equal TokenExpr Point].

TokenFactor   (. PrimaryExpression primary = null; .)
  =
  Symbol< out primary >           
    | LeftParenthesis TokenExpr RightParenthesis
    | LeftSquareBracket TokenExpr RightSquareBracket
    | LeftCurlyBrace TokenExpr RightCurlyBrace.

The Last Part

One last part is to sew together the pieces: when the parsing is done

  • the rulemap should connect rules
  • the first rule should be set in the DocumentImpl object
  • and the rulemap should be set in the DocumentImpl object

This is neat to be done when the End token is consumed:

PRODUCTIONS

CocoR = 
  {ANY}
  Compiler Ident   (. firstRuleName = t.val; .)
  {ANY}
  ScannerSpecification 
  ParserSpecification 
  End   (. rulemap.connectRules(); doc.setTopRule(rulemap.getRule(firstRuleName)); doc.setRuleMap(rulemap); .)
  Ident Point.

Conclusion

It is a lot of work to write a parser. Though it is probably less work than write it by hand. The Coco/R grammar is quite small and manageable. It is also LL(1)! That is, we did not get any conflicts and did't have to write any IF-clauses with code to resolve the conflicts. Apart from this, I hope this tutorial has clarified the ideas how to think when using Coco/R and JFlex tools to write a parser.


<- back

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License