1
ANTLR v3 Overview(for ANTLR v2 users)
Terence Parr
University of San Francisco
2
Topics
Information flowv3 grammarsError recoveryAttributesTree constructionTree grammarsCode generationInternationalizationRuntime support
3
Block Info Flow Diagram
4
Grammar Syntax
header {…}/** doc comment */kind grammar name;options {…}tokens {…}scopes…actionrules…
/** doc comment */rule[String s, int z] returns [int x, int y] throws E options {…} scopes init {…} : | ; exceptions
^(root child1 … childN)Trees
Note: No inheritance
5
Grammar improvements
Single element EBNF like ID*Combined parser/lexerAllows ‘c’ and “literal” literalsMultiple parameters, return valuesLabels do not have to be unique
(x=ID|x=INT) {…$x…}For combined grammars, warns when
tokens are not defined
6
Example Grammargrammar SimpleParser;program : variable* method+ ;variable: "int" ID (‘=‘ expr)? ';’ ;method : "method" ID '(' ')' '{' variable* statement+ '}' ;statement : ID ‘=‘ expr ';' | "return" expr ';' ;expr : ID | INT ;ID : ('a'..'z'|'A'..'Z')+ ;INT : '0'..'9'+ ;WS : (' '|'\t'|'\n')+ {channel=99;} ;
7
Using the parserCharStream in = new ANTLRFileStream(“inputfile”);SimpleParserLexer lexer = new SimpleParserLexer(in);CommonTokenStream tokens = new CommonTokenStream(lexer);SimpleParser p = new SimpleParser(tokens);p.program(); // invoke start rule
8
Improved grammar warnings
they happen less often ;)internationalized (templates again!)gives (smallest) sample input sequencebetter recursion warnings
9
Recursion Warningsa : a A | B ;
t.g:2:5: Alternative 1 discovers infinite left-recursion to a from a
t.g:2:5: Alternative 1: after matching input such as B decision cannot predict what comes next due to recursion overflow to c from b
// with -Im 0 (secret internal parameter)a : b | B ;b : c ;c : B b ;
10
Nondeterminisms
t.g:2:5: Decision can match input such as "A B" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that inputt.g:2:5: The following alternatives are unreachable: 2
a : (A B|A B) C ;
a : (A+ B|A+ B) C ;
t.g:2:5: Decision can match input such as "A B" using multiple alternatives: 1, 2
11
Runtime Objects of Interest
Lexer passes all tokens to the parser, but parser listens to only a single “channel”; channel 99, for example, where I place WS tokens, is ignored
Tokens have start/stop index into single text input buffer
Token is an abstract class TokenSource
anything answering nextToken() TokenStream
stream pulling from TokenSource; LT(i), … CharStream
source of characters for a lexer; LT(i), …
12
Error Recovery
ANTLR v3 does what Josef Grosch does in Cocktail
Does single token insertion or deletion if necessary to keep going
Computes context-sensitive FOLLOW to do insert/delete proper context is passed to each rule invocation knows precisely what can follow reference to r
rather than what could follow any reference to r (per Wirth circa 1970)
13
Example Error Recoveryint i = 0;method foo( { int j = i; i = 4}
[program, method]: line 2:12 mismatched token:[@14,23:23='{',<14>,2:12]; expecting type ')'
[program, method, statement]: line 5:0 mismatched token:[@31,46:46='}',<15>,5:0]; expecting type ';'
int i = 0;method foo() ) { int j = i; i = = 4;}
[program, method]: line 2:13 mismatched token:[@15,24:24=')',<13>,2:13]; expecting type '{'
[program, method, statement, expr]: line 4:6 mismatched token:[@32,47:47='=',<6>,4:6]; expecting set null
Note: I put in two errors each so you’ll see it continues properly
One token insertion
One token deletion
14
Attributes
New label syntax and multiple return values
Unified token, rule, parameter, return value, tree reference syntax in actions
Dynamically scope attributes!
a[String s] returns [float y] : id=ID f=field (ids+=ID)+ {$s, $y, $id, $id.text, $f.z; $ids.size();} ;field returns [int x, int z] : … ;
15
Label properties
Token label reference properties text, type, line, pos, channel, index, tree
Rule label reference properties start, stop; indices of token boundaries tree text; text matched for whole rule
16
Rule Scope Attributes
A rule may define a scope of attributes visible to any invoked rule; operates like a stacked global variable
Avoids having to pass a value downmethodscope { String name; } : "method" ID '(' ')' {$name=$ID.text;} body ;body: '{' stat* '}’ ;…atom init {… $method.name …} : ID | INT ;
17
Global Scope Attributes
Named scopes; rules must explicitly request access
scope Symbols { List names; }{int level=0;}
globalsscope Symbols;init { level++; $Symbols.names = new ArrayList();} : decl* {level--;} ;
blockscope Symbols;init { level++; $Symbols.names = new ArrayList();} : '{' decl* stat* '}’ {level--;} ;
decl : "int" ID ';' {$Symbols.names.add($ID);} ;
*What if we want to keep the symbol tables around after parsing?
18
Tree Support
TreeAdaptor; How to create and navigate trees (like ASTFactory from v2); ANTLR assumes tree nodes are Object type
Tree; used by support codeBaseTree; List of children, w/o payload (no
more child-sibling trees)CommonTree; node wrapping Token as
payloadParseTree; used by interpreter to build trees
19
Tree Construction
Automatic mechanism is same as v2 except ^ is now ^^expr : atom ( '+'^^ atom )* ;
^ implies root of tree for enclosing subrulea : ( ID^ INT )* ; builds (a 1) (b 2) …
Token labels are $label not #label and rule invocation tree results are $ruleLabel.tree
Turn onoptions {output=AST;}(one can imagine output=text for templates)
Option: ASTLabelType=CommonTree;
20
Tree Rewrite Rules
Maps an input grammar fragment to an output tree grammar fragment
variable : type declarator ';' -> ^(VAR_DEF type declarator) ;
functionHeader : type ID '(' ( formalParameter ( ',' formalParameter )* )? ')' -> ^(FUNC_HDR type ID formalParameter+) ;
atom : … | '(' expr ')' -> expr ;
21
Mixed Rewrite/Auto Trees
Alternatives w/o -> rewrite use automatic mechanism
b : ID INT -> INT ID | INT // implies -> INT ;
22
Rewrites and labels
Disambiguates element references or used to construct imaginary nodes
Concatenation += labels useful too:
forStat : "for" '(' start=assignStat ';' expr ';' next=assignStat ')' block -> ^("for" $start expr $next block) ;block : lc='{' variable* stat* '}’ -> ^(BLOCK[$lc] variable* stat*) ;
/** match string representation of tree and build tree in memory */tree : ‘^’ ‘(‘ root=atom (children+=tree)+ ‘)’ -> ^($root $children) | atom ;
23
Loops in Rewrites
Repeated elementID ID -> ^(VARS ID+)yields ^(VARS a b)
Repeated treeID ID -> ^(VARS ID)+yields ^(VARS a) ^(VARS b)
Multiple elements in loop need same size ID INT ID INT -> ^( R ID ^( S INT) )+yields(R a (S 1)) (R b (S 2))
Checks cardinality + and * loops
24
Preventing cyclic structures
Repeated elements get duplicateda : INT -> INT INT ; // dups INT!a : INT INT -> INT+ INT+ ; // 4 INTs!
Repeated rule references get duplicateda : atom -> ^(atom atom) ; // no cycle!
Duplicates whole tree for all but first ref to an element; here 2nd ref to atom results in a duplicated atom tree
*Useful example “int x,y” -> “^(int x) ^(int y)”decl : type ID (‘,’ ID)* -> ^(type ID)+ ;
*Just noticed a bug in this one ;)
25
Predicated rewrites
Use semantic predicate to indicate which rewrite to choose from
a : ID INT -> {p1}? ID -> {p2}? INT -> ;
26
Misc Rewrite Elements
Arbitrary actionsa : atom -> ^({adaptor.createToken(INT,"9")} atom) ;
rewrite always sets the rule’s AST not subrule’s
Reference to previous value (useful?)
b : "int" ( ID -> ^(TYPE "int" ID) | ID '=' INT -> ^(TYPE "int" ID INT) ) ;
a : (atom -> atom) (op='+' r=atom -> ^($op $a $r) )* ;
27
Tree Grammars
Syntax same as parser grammars, add^(root children…) tree element
Uses LL(*) also; even derives from same superclass! Tree is serialized to include DOWN, UP imaginary tokens to encode 2D structure for serial parser
variable : ^(VAR_DEF type ID) | ^(VAR_DEF type ID ^(INIT expr)) ;
28
Code Generation
Uses StringTemplate to specify how each abstract ANTLR concept maps to code; wildly successful!
Separates code gen logic from output; not a single character of output in the Java code
Java.stg: 140 templates, 1300 lines
29
Sample code gen templates
/** Dump the elements one per line and stick in debugging * location() trigger in front. */element() ::= <<<if(debug)>dbg.location(<it.line>,<it.pos>);<\n><endif><it.el><\n>>>
/** match a token optionally with a label in front */tokenRef(token,label,elementIndex) ::= <<<if(label)><label>=input.LT(1);<\n><endif>match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>);>>
30
Internationalization
ANTLR v3 uses StringTemplate to display all errors
Senses locale to load messages;en.stg: 76 templates
ErrorManager error number constants map to a template name; e.g., RULE_REDEFINITION(file,line,col,arg) ::= "<loc()>rule <arg> redefinition”
/* This factors out file location formatting; file,line,col inherited from * enclosing template; don't manually pass stuff in. */loc() ::= "<file>:<line>:<col>: "
31
Runtime Support
Better organized, separated:org.antlr.runtimeorg.antlr.runtime.treeorg.antlr.runtime.debug
Clean; Parser has input ptr only (except error recovery FOLLOW stack); Lexer also only has input ptr
4500 lines of Java code minus BSD header
32
Summary
v3 kicks assit sort of works!http://www.antlr.org/download/…ANTLRWorks progressing in parallel