4. Formal Grammars and Parsing and Top-down Parsing
Chih-Hung Wang
Compilers
References1. C. N. Fischer, R. K. Cytron and R. J. LeBlanc. Crafting a Compiler. Pearson Education Inc., 2010.2. D. Grune, H. Bal, C. Jacobs, and K. Langendoen. Modern Compiler Design. John Wiley & Sons, 2000.3. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986. (2nd Ed. 2006)1
2
IntroductionContext-free Grammar
The syntax of programming language constructs can be described by context-free grammar
Important aspectsA grammar serves to impose a structure on the
linear sequence of tokens which is the program.Using techniques from the field of formal languages,
a grammar can be employed to construct a parser for it automatically.
Grammars aid programmers to write syntactically correct programs and provide answer to detailed questions about the syntax.
Context-Free GrammarsA context-free grammar(CFG) is a compact,
finite representation of a language, defined by the following four components:A finite terminal alphabet ΣA finite non-terminal alphabet NA start symbol S NA finite set of productions P
4
Leftmost DerivationsA sentential form produced via a leftmost
derivation is called a left sentential form.The production sequence discovered by a
large class of parsers (the top-down parsers) is a leftmost derivation. Hence, these parsers are said to produce a leftmost parse.
Example: f(V+V)
6
Elm Prefix(E)
lm f(E)
lm f(V Tail)
lm f(V+E)
lm f(V+V Tail)
lm f(V+V)
Rightmost DerivationsAs a bottom-up parser discovers the
productions that derive a given token sequence, it traces a rightmost derivation, but the productions are applied in reverse order.
Called rightmost or canonical parseExample: f(V+V)
7
Erm Prefix(E)
rm Prefix(V Tail)
rm Prefix(V+E)
rm Prefix(V+V Tail)
rm Prefix(V+V)
rm f(V+V)
Properties of CFGsThe grammar may include useless symbolsThe grammar may allow multiple, distinct
derivations (parse trees) for some input string.
The grammar may include strings that do not belong in the language, or the grammar may exclude strings that are in the language.
9
Ambiguity (1)Some grammars allow a derived string to
have two or more different parse trees (and thus a nonunique structure).
Example: 1. Expr →Expr – Expr 2. | idThis grammar allows two different parse
tree for id - id - id.
10
Parsers and RecognizersTwo approaches
A parser is considered top-down if it generates a parse tree by starting at the root of the tree, expanding the tree by applying productions in a depth-first manner.
The bottom-up parsers generate a parse tree by starting the tree’s leaves and working toward its root.
12
13
Two approaches of ParserDeterministic left-to-right top-down
LL methodDeterministic left-to-right bottom-up
LR methodLeft-to-right
The sequence of tokens is processed from left to right
DeterministicNo searching is involved: each token brings
the parser one step closer to the goal of constructing the syntax tree
16
Pre-order and post-order (1)The top-down method constructs the
syntax tree in pre-orderThe bottom-up method constructs the
syntax tree in post-order
18
Principles of top-down parsing
The main task of a top-down parser is to choose the correct alternatives for known non-terminals
19
Principles of bottom-up parsingThe main task of a bottom-up parser is to
repeatedly find the first node all of whose children have already been constructed.
20
Creating a top-down parser manuallyRecursive descent parsing
Simplest way but has its limitations
23
DrawbacksThree drawbacks
There is still some searching through the alternatives
The method often fails to produce a correct parser
Error handling leaves much to be desired
27
Creating a top-down parser automaticallyThe principles of constructing a top-down
parser automatically derive from those of writing one by hand, by applying precomputation.
Grammars which allow the construction of a top-down parser to be performed are called LL(1) grammars.
28
LL(1) parsingFIRST set
The sets of first tokens produced by all alternatives in the grammar.
We have to precompute the FIRST sets of all non-terminals
The first sets of the terminals are obvious.Finding FIRST() is trivial when starts with
a terminal.FIRST(N) is the union of the FIRST sets of its
alternatives.First()={a Σ| * a}
29
Predictive recursive descent parserThe FIRST sets can be used in the
construction of a predictive parser because it predicts the presence of a given alternative without trying to find out if it is there.
41
PracticeFind the FIRST sets of all alternative of the
following grammar.E -> TE’E’->+TE’|T->FT’T’->*FT’|F->(E)|id
42
Nullable alternativesA complication arises with the case label
for the empty alternative (ex. rest_expression). Since it does not itself start with any token, how can we decide whether it is the correct alternative?
43
FOLLOW setsFollow sets
Determining the set of tokens that can immediately follow a given non-terminal N.
LL(1) parser‘LL’ because the parser works from Left to right
identifying the nodes in what is called Leftmost derivation order.
‘(1)’ because all choices are based on a one token look-ahead.Follow(A)={b Σ |S+ Ab β}
49
Recall the predictive parser
rest_expression ‘+’ expression | FIRST(rest_expr) = {‘+’, }void rest_expression(void) { switch (Token.class) { case '+': token('+'); expression(); break; case EOF: case ')': break; default: error();
}
}
FOLLOW(rest_expr) = {EOF, ‘)’}
51
LL(1) conflictsFIRST/FIRST conflict
term IDENTIFIER | IDENTIFIER ‘[‘ expression ‘]’ | ‘(’ expression ‘)’
52
LL(1) conflictsFIRST/FOLLOW conflict
FIRST set FOLLOW set S A ‘a’ ‘b’ { ‘a’ } {}A ‘a’ | {‘a’, } {‘a’}
53
LL(1) conflictsleft recursion
expression expression ‘-’ term | term Look-ahead token
LL(1) method predicts the alternative Ak for a non-terminal N
FIRST(Ak) (if is nullable then FOLLOW(N))
LL(1) grammarNo FIRST/FIRST conflictsNo FIRST/FOLLOW conflictsNo multiple nullable alternatives
No non-terminal can have more than one nullable alternative.
55
Making a grammar LL(1)manual labour
rewrite grammaradjust semantic actions
three rewrite methodsleft factoringsubstitutionleft-recursion removal
56
Left-factoringterm IDENTIFIER | IDENTIFIER ‘[‘ expression ‘]’
factor out common prefix
term IDENTIFIER after_identifierafter_identifier | ‘[‘ expression ‘]’
‘[’ FOLLOW(after_identifier)
57
SubstitutionA a | B c | S p A q
replace non-terminal by its alternative S p a q | p B c q | p q
ExampleS A ‘a’ ‘b’
A ‘a’ | replace non-terminal by its alternative
S ‘a’ ‘a’ ‘b’ | ‘a’ ‘b’
58
Left-recursion removalThree types of left-recursion
Direct left-recursionN N|…
Indirect left-recursion Chain structure
N A … A B … … Z N …
Hidden left-recursionN N|… ( can produce )
59
Left-recursion removalN N |
replace by
N MM M |
example
expression expression ‘-’ term | term
...
expression term expression_tail_option
expression_tail_option ‘-’ term expression_tail_option |
N
60
Practicemake the following grammar LL(1)
expression expression ‘+’ term | expression ‘-’ term | term
term term ‘*’ factor | term ‘/’ factor | factor
factor ‘(‘ expression ‘)’ | func-call | identifier | constant
func-call identifier ‘(‘ expr-list? ‘)’expr-list expression (‘,’ expression)*
61
Answerssubstitution
F ‘(‘ E ‘)’ | ID ‘(‘ expr-list? ‘)’ | ID | constantleft factoring
E E ( ‘+’ | ‘-’ ) T | TT T ( ‘*’ | ‘/’ ) F | FF ‘(‘ E ‘)’ | ID ( ‘(‘ expr-list? ‘)’ )? | constant
left recursion removalE T (( ‘+’ | ‘-’ ) T )*T F (( ‘*’ | ‘/’ ) F )*
62
Undoing the semantic effects of grammar transformationsWhile it is often possible to transform our
grammar into a new grammar that is acceptable by a parser generator and that generates the same language, the new grammar usually assigns a different structure to strings in the language than our original grammar did
Fortunately, in many cases we are not really interested in the structure but rather in the semantics implied by it.
64
Automatic conflict resolution (1)There are two ways in which LL parsers
can be strengthenedBy increasing the look-ahead
Distinguishing alternatives not by their first token but by their first two tokens is called LL(2).
Disadvantages: the parser code can get much bigger.
By allowing dynamic conflict resolversWhen the conflict arises during parsing, some of
conditions are evaluated to solve it.The parser generator LLgen requires a conflict
resolver to be placed on the first of two conflicting alternatives.
65
If-else statement in C
else_tail_option: both FIRST set and FOLLOW set contain the token ‘else’
Conflict resolver
Automatic conflict resolution (2)
67
Push-down automation (PDA)Type of moves
Prediction moveTop of the prediction stack is a non-terminal N.N is removed from the stackLook up the prediction tablePush the alternative of N into the prediction stack
Match moveTop of the prediction stack is a terminal
TerminationParsing terminates when the prediction stack is
exhausted.
71
PDA example (1)
aap + ( noot + mies ) EOF
input
input
prediction stack
state(top of stack)
look-ahead token
IDENT + ( ) EOF
input expression EOF
expression EOF
expression term rest-expr
term rest-expr
term IDENT ( expression )
rest-expr + expression
72
PDA example (2)
aap + ( noot + mies ) EOF
input
input
prediction stack
state(top of stack)
look-ahead token
IDENT + ( ) EOF
input expression EOF
expression EOF
expression term rest-expr
term rest-expr
term IDENT ( expression )
rest-expr + expression
replace non-terminal by transition entry
73
PDA example (3)
aap + ( noot + mies ) EOF
expression EOF
input
prediction stack
state(top of stack)
look-ahead token
IDENT + ( ) EOF
input expression EOF
expression EOF
expression term rest-expr
term rest-expr
term IDENT ( expression )
rest-expr + expression
74
PDA example (4)
aap + ( noot + mies ) EOF
expression EOF
input
prediction stack
state(top of stack)
look-ahead token
IDENT + ( ) EOF
input expression EOF
expression EOF
expression term rest-expr
term rest-expr
term IDENT ( expression )
rest-expr + expression
replace non-terminal by transition entry
75
PDA example (5)
aap + ( noot + mies ) EOF
term rest-expr EOF
input
prediction stack
state(top of stack)
look-ahead token
IDENT + ( ) EOF
input expression EOF
expression EOF
expression term rest-expr
term rest-expr
term IDENT ( expression )
rest-expr + expression
76
PDA example (6)
aap + ( noot + mies ) EOF
term rest-expr EOF
input
prediction stack
state(top of stack)
look-ahead token
IDENT + ( ) EOF
input expression EOF
expression EOF
expression term rest-expr
term rest-expr
term IDENT ( expression )
rest-expr + expression
replace non-terminal by transition entry
Obtaining LL(1) GrammarsMost LL(1) prediction conflicts can be
grouped into two categories: common prefix and left recursion
82
87
LLgenLLgen is part of the Amsterdam Compiler Kittakes LL(1) grammar + semantic actions in C
and generates a recursive descent parserThe non-terminals in the grammar can have
parameters, and rules can have local variables, both again expressed in C.
LLgen features:repetition operatorsadvanced error handlingparameter passingcontrol over semantic actionsdynamic conflict resolvers
88
LLgen
start from LR(1) grammarmake grammar LL(1)
use repetition operators
%token DIGIT;
main : [line]+
;
line : expr '\n'
;
expr : term [ '+' term ]*
;
term : factor [ '*' factor ]*
;
factor : '(' expr ')‘
| DIGIT
;
LLgen
• add semantic actions• attach parameters to grammar rules
• insert C-code between the symbols
91
LLgen code for a parserThe code from previous page resides in a
file called parser.g. LLgen converts the file to one called parser.c, which contains a recursive descent parser.