1
Theory of Compilation 236360
Erez PetrankLecture 2:
Syntax Analysis, Top-Down Parsing
2
You are here
Executable code
exe
Sourcetext
txt
Compiler
LexicalAnalysi
s
Syntax Analysi
sParsing
Semantic
Analysis
Inter.Rep.(IR)
CodeGen.
3
Last Week: from characters to tokens(Using Regular Expressions)
x = b*b – 4*a*c
txt
<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS>
<INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>
Token Stream
4
The Lex Tool• Lex automatically generates a lexical analyzer from
declaration file.• Advantages: easy to produce a lexical analyzer from a
short declaration, easily verified, easily modified and maintained.
• Intuitively: Lex builds a DFA, The analyzer simulates the DFA on a given input.
LexDeclaration file
LexicalAnalysi
s
characters
tokens
5
Today: from tokens to AST
LexicalAnalysi
s
Syntax Analysi
s
Sem.Analysi
s
Inter.Rep.
Code Gen.
<ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>
‘b’ ‘4’
‘b’‘a’
‘c’
ID
ID
ID
ID
ID
factor
term factorMULT
term
expression
expression
factor
term factorMULT
term
expression
term
MULT factor
MINUS
SyntaxTree
6
Syntax Analysis (Parsing)• Goal: discover the program structure.
– For example, a C program is built of functions, each function is built from declarations and instructions, each instruction is built from expressions, etc.
– Is a sequence of tokens a valid program in the language?– Construct a structured representation of the input text– Error detection and reporting
• A simple and accurate method for describing a program structure is context free grammars.
• We will look at families of grammars that can be efficiently parsed.
• The parser will read the token series, make sure that they are derivable in the grammar (or report an error), and construct the derivation tree.
7
Context free grammars
• V – non terminals• T – terminals (tokens for us)• P – derivation rules
– Each rule of the form V ➞ (T ∪ V)• S∈V – the initial symbol
G = (V,T,P,S)
8
Why do we need context free grammars?• Important program structures cannot be expressed
by regular expressions. E.g., balanced parenthesis… – S ➞ SS; S ➞ (S); S ➞ ()
• Anything expressible as a regular expression is expressible by CFG. Why use regular expressions at all? – Separation, modularity, simplification. – No point in using strong (and less efficient) tools on
easily analyzable regular expressions. • Regular expressions describe lexical structures like
identifiers, constants, keywords, etc. • Grammars describe nested structured like balanced
parenthesis, match begin-end, if-then-else, etc.
9
ExampleS ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )
V = { S, E }T = { id, ‘+’, ‘*’, ‘(‘, ‘)’}S is the initial variable.
10
Terminology• Derivation: a sequence of replacements of non-terminals
using the derivation rules. • Language: the set of strings of terminals derivable from the
initial state. • Sentential form (תבנית פסוקית) – the result of a partial
derivation in which there may be non-terminals.
Derivation Example
11
SS S;
id := E S;id := id S;id := id id := E ;id := id id := E + E ;id := id id := E + id ;id := id id := id + id ;
x := z;y := x + z
S ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )
S ➞ S;SS ➞ id := EE ➞ idS ➞ id := EE ➞ E + E E ➞ id E ➞ id
x:= z ; y := x + z
input grammar
12
Parse TreeS
S S;id := E S;id := id S;id := id id := E ;id := id id := E + E ;id := id id := E + id ;id := id id := id + id ;
x:= z ; y := x + z
S
S
;
S
id :=
E
id
id := E
E
+
E
id id
13
Questions• How did we know which rule to apply on every
step?• Does it matter? • Would we always get the same result?
14
Ambiguityx := y+z*w
S ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )
S
id := E
E + E
id
id
E * E
id
S
id := E
E*E
id
id
E + E
id
15
Leftmost/rightmost Derivation• Leftmost derivation
– always expand leftmost non-terminal• Rightmost derivation
– Always expand rightmost non-terminal
• Allows us to describe derivation by listing a sequence of rules only. – always know what a rule is applied to
• Note that it does not necessarily solve ambiguity (e.g., previous slide).
• These are the orders of derivation applied in our parsers (coming soon).
Leftmost Derivation
16
x := z;y := x + z
S ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )
SS S;
id := E S;id := id S;id := id id := E ;id := id id := E + E ;id := id id := id + E ;id := id id := id + id ;
S ➞ S;SS ➞ id := EE ➞ idS ➞ id := EE ➞ E + E E ➞ id E ➞ id
x:= z ; y := x + z
17
Rightmost Derivation
SS S;S id := E;S id := E + E;S id := E + id;S id := id + id ;
id := E id := id + id ;id := id id := id + id ;
x := z;y := x + z
S ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )
S ➞ S;SS ➞ id := EE ➞ E + EE ➞ id E ➞ id S ➞ id := EE ➞ id
x:= z ; y := x + z
18
Bottom-up Examplex := z;y := x + z
S ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )
id := id ; id := id + idid := E id := id + id;
S id := id + id;S id := E + id;S id := E + E;S id := E ;S S;
S
E ➞ idS ➞ id := EE ➞ idE ➞ id E ➞ E + ES ➞ id := ES ➞ S;S
Bottom-up picking left alternative on every step Rightmost derivation when going top-down
19
Parsing • A context free language can be recognized by a non-
deterministic pushdown automaton– But not a deterministic one…
• Parsing can be seen as a search problem– Can you find a derivation from the start symbol to the input
word?– Easy (but very expensive) to solve with backtracking
• Cocke-Younger-Kasami parser can be used to parse any context-free language but has complexity O(n3)– Imagine a program with hundreds of thousands of lines of
code. • We want efficient parsers
– Linear in input size– Deterministic pushdown automata– We will sacrifice generality for efficiency
20
“Brute-force” Parsingx := z;y := x + z
S ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )
id := id ; id := id + id
id := E id := id + id; id := id id := E+ id; …E ➞ id E ➞ id
(not a parse tree… a search for the parse tree by exhaustively applying all rules)
id := E id := id + id; id := E id := id + id;
21
Efficient Parsers• Top-down (predictive)
– Construct the leftmost derivation– Apply rules “from left to right”– Predict what rule to apply based on nonterminal and
token • Bottom up (shift reduce)
– Construct the rightmost derivation– Apply rules “from right to left”– Reduce a right-hand side of a production to its non-
terminal
22
Efficient Parsers• Top-down (predictive parsing)
Bottom-up (shift reduce)
to be read…already read…
23
Top-down Parsing• Given a grammar G=(V,T,P,S) and a word w• Goal: derive w using G• Idea
– Apply production to leftmost nonterminal– Pick production rule based on next input token
• General grammar– More than one option for choosing the next production
based on a token• Restricted grammars (LL)
– Know exactly which single rule to apply– May require some lookahead to decide
24
An Easily Parse-able GrammarE ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor
not (not true or false)
E => not E => not ( E OP E ) =>not ( not E OP E ) =>not ( not LIT OP E ) =>not ( not true OP E ) =>not ( not true or E ) =>not ( not true or LIT ) =>not ( not true or false )
Production to apply is known from next input token
E
not E
EOPE
LIT
true
not LIT or
( )
false
At any stage, looking at the current variable and the next input token, a rule can be easily determined.
25
LL(k) Grammars• A grammar is in the class LL(K) when it can be
derived via:– Top down derivation– Scanning the input from left to right (L)– Producing the leftmost derivation (L)– With lookahead of k tokens (k)
• A language is said to be LL(k) when it has an LL(k) grammar
26
Recursive Descent Parsing• Define a function for every nonterminal• Every function simulates the derivation of the
variable it represents:– Find applicable production rule– Terminal function checks match with next input token– Nonterminal function calls (recursively) other functions
• If there are several applicable productions for a nonterminal, use lookahead
27
Matching tokens
• Variable current holds the current input token
void match(token t) { if (current == t) current = next_token(); else ;}
28
functions for nonterminalsE ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor
void E() { if (current {TRUE, FALSE}) // E → LIT LIT(); else if (current == LPAREN) // E → ( E OP E ) match(LPARENT); E(); OP(); E(); match(RPAREN); else if (current == NOT) // E → not E match(NOT); E(); else error; }
29
functions for nonterminals
void LIT() { if (current == TRUE) match(TRUE); else if (current == FALSE) match(FALSE); else error;}
E ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor
30
functions for nonterminals
void OP() {if (current == AND)
match(AND);else if (current == OR)
match(OR);else if (current == XOR)
match(XOR);else
error;}
E ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor
31
Overall: Functions for Grammar
E → LIT | ( E OP E ) | not ELIT → true | falseOP → and | or | xor
void E() {if (current {TRUE, FALSE}) LIT();else if (current == LPAREN) match(LPARENT);
E(); OP(); E();match(RPAREN);
else if (current == NOT) match(NOT); E();else error;
}
void LIT() {if (current == TRUE) match(TRUE);else if (current == FALSE) match(FALSE);else error;
}
void OP() {if (current == AND) match(AND);else if (current == OR) match(OR);else if (current == XOR) match(XOR);else error;
}
32
Adding semantic actions• Can add an action to perform on each production
rule simply by executing it when a function is invoked.
• For example, can build the parse tree– Every function returns an object of type Node– Every Node maintains a list of children– Function calls can add new children
33
Building the parse treeNode E() { result = new Node(); result.name = “E”; if (current {TRUE, FALSE}) // E → LIT result.addChild(LIT()); else if (current == LPAREN) // E → ( E OP E ) result.addChild(match(LPARENT)); result.addChild(E()); result.addChild(OP()); result.addChild(E()); result.addChild(match(RPAREN)); else if (current == NOT) // E → not E result.addChild(match(NOT)); result.addChild(E()); else error; return result;}
34
Getting Back to the Example
• Input = “( not true and false )”;Node treeRoot = E();
E
( E OP E )
not LIT
falsetrue
and LIT
35
Recursive Descent
• How do you pick the right A-production?• Generally – try them all and use backtracking
(costly). • In our case – use lookahead
void A() { choose an A-production, A -> X1X2…Xk; for (i=1; i≤ k; i++) { if (Xi is a nonterminal) call procedure Xi(); elseif (Xi == current) advance input; else report error; }}
In its basic form, each variable has a procedure that looks like:
36
Recursive Descent: a problem
• With lookahead 1, the function for indexed_elem will never be tried… – What happens for input of the form
• ID [ expr ]
term ➞ ID | indexed_elemindexed_elem ➞ ID [ expr ]
37
Recursive Descent: Another Problem
Bool S() { return A() && match(token(‘a’)) && match(token(‘b’));}Bool A() { if (current == ‘a’) return match(token(‘a’)) else return true ;}
S ➞ A a bA ➞ a |
What happens for input “ab” ? What happens if you flip order of alternatives and try “aab”?
38
Recursive descent: a third problem
Bool E() { return E() && match(token(‘-’)) && term() || ID();}
E ➞ E – term | term
What happens with this procedure? Recursive descent parsers cannot handle left-recursive
grammars
39
3 Bad Examples for Recursive Descent
Can we make it work?
term ➞ ID | indexed_elemindexed_elem ➞ ID [ expr ]
S ➞ A a bA ➞ a |
E ➞ E - term
40
The “FIRST” Sets• To formalize the property (of a grammar) that we can
determine a rule using a single lookahead we define the FIRST sets.
• For every production rule A➞ 𝞪– FIRST( ) = all terminals that can start with 𝞪 𝞪– i.e., every token that can appear first under some derivation
for 𝞪• No intersection between FIRST sets => can pick a single
rule• In our Boolean expressions example
– FIRST(LIT) = { true, false }– FIRST( ( E OP E ) ) = { ‘(‘ }– FIRST ( not E ) = { not }
E ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor
41
The “FIRST” Sets• No intersection between FIRST sets => can pick a single
rule• If the FIRST sets intersect, may need longer lookahead
– LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens
– LL(1) is an important and useful class
42
The FOLLOW Sets• FIRST is not enough when variables are nullified. • Consider: S ➞ AB | c ; A ➞ a | ; B ➞ b;
• Need to know what comes afterwards to select the right production
• For any non-terminal A – FOLLOW(A) = set of tokens that can immediately follow
A• Can select the rule N ➞ with lookahead “b”, if 𝞪
– b∈FIRST( ) or𝞪– 𝞪 may be nullified and b∈FOLLOW(N).
43
Back to our 1st example
• FIRST(ID) = { ID }• FIRST(indexed_elem) = { ID }
• FIRST/FIRST conflict
• This grammar is not in LL(1). Can we “fix” it?
term ➞ ID | indexed_elemindexed_elem ➞ ID [ expr ]
44
Left factoring• Rewrite into an equivalent grammar in LL(1)
term ➞ ID | indexed_elemindexed_elem ➞ ID [ expr ]
term ➞ ID after_IDafter_ID ➞ [ expr ] |
Intuition: just like factoring x*y + x*z into x*(y+z)
45
Left factoring – another example
S ➞ if E then S else S | if E then S | T
S ➞ if E then S S’ | TS’ ➞ else S |
46
Back to our 2nd example
• Select a rule for A with a in the look-ahead: – Should we pick (1) A ➞ a or (2) A ➞ ?
• (1) FIRST(a) = { ‘a’ } (and a cannot be nullified).
• (2) FIRST ()=. Also, can (must) be nullified and FOLLOW(A) = { ‘a’ }
• FIRST/FOLLOW conflict• The grammar is not in LL(1).
S ➞ A a bA ➞ a |
47
An Equivalent Grammar via Substitution
S ➞ A a bA ➞ a |
S ➞ a a b | a b
Substitute A in S
S ➞ a after_a after_a ➞ a b | b
Left factoring
48
So Far• We have tools to determine if a grammar is in
LL(1)– The FIRST and FOLLOW sets. – The exercises will provide algorithms for finding and
using those. • We have some techniques for modifying a
grammar to find an equivalent in LL(1). – Left factoring,– Assignment.
• Now let’s look at the 3rd example and present one more such technique.
49
Back to our 3rd example
• Left recursion cannot be handled with a bounded lookahead.
• What can we do?
• Any grammar with a left recursion has an equivalent grammar with no left recursion.
E ➞ E – term | term
50
Left Recursion Elimination
• L(G1) = β, βα, βαα, βααα, …• L(G2) = same
N N➞ α | β N ➞ βN’ N’ ➞ αN’ |
G1 G2
E ➞ E – term | term E ➞ term TETE ➞ - term TE | term
For our 3rd example:
אלימינציה של רקורסיה שמאלית
• נחליף את הכלליםביטול רקורסיה ישירה:• A → Aα1 | Aα2 | ··· | Aαn | β1 | β2 | ··· | βn
• בכללים• A → β1A’ | β2A’ | ··· | βnA’ • A’ → α1A’ | α2A’| ··· | αnA’ | Є
• ...ריק αi שימו לב שהשיטה לא עובדת אם• .ריק βi וגם עלולה ליצור רקורסיה שמאלית עקיפה אם
• .’A ריק, אז נוצרת רקורסיה שמאלית של αi אם• αj ריק, אז תיתכן רקורסיה שמאלית עקיפה כאשר βi אם
:A-מתחיל ב– A → A’ וגם – A’ → A….
אלימינציה של רקורסיה שמאלית
• נחליף את הכלליםביטול רקורסיה ישירה:• A → Aα1 | Aα2 | ··· | Aαn | β1 | β2 | ··· | βn
• בכללים• A → β1A’ | β2A’ | ··· | βnA’ • A’ → α1A’ | α2A’| ··· | αnA’ | Є
• :צריך לטפל גם ברקורסיה עקיפה. למשל• S → Aa | b• A → Ac | Sd | Є
• .ועבורה האלגוריתם מעט יותר מורכב
)עקיפה אלגוריתם להעלמת רקורסיה שמאלת וישירה( מדקדוק
• שאולי יש בו רקורסיה שמאלית, ללא מעגלים, וללא G דקדוקקלט:.ε כללי
• . דקדוק שקול ללא רקורסיה שמאליתפלט:
• .A → Є :דוגמא לכלל אפסילון• .;A → B; B → A :דוגמא למעגל
• .)ניתן לבטל כללי אפסילון ומעגלים בדקדוק )באופן אוטומטי
• רעיון האלגוריתם לסילוק רקורסיה שמאלית: נסדר את המשתנים לפי A1, A2, …, An :סדר כלשהו
• נדאג שכל כלל שלו יהיה Ai נעבור על המשתנים לפי הסדר, ולכל מהצורה
• Ai → Ajβ with j > i .• ?מדוע זה מספיק
An Algorithm for Left-Recursion Elimination
• Input: Grammar G possibly left-recursive, no cycles, no ε productions.
• Output: An equivalent grammar with no left-recursion• Method: Arrange the nonterminals in some order A1, A2, …, An
• for i:=1 to n do begin for s:=1 to i-1 do begin replace each production of the form Ai → Asβ by the productionsAi → d1β |d2β|…|dkβ where As -> d1 | d2 | …| dk are all the current As-productions; end eliminate immediate left recursion among the Ai-productionsend
ניתוח האלגוריתם
• Ak → Atβ נראה שבסיום האלגוריתם כל חוק גזירה מהצורה. t > k מקיים
• : כשגומרים את הלולאה הפנימית עבור1שמורה s כלשהו )עם Ai בלולאה
Aj מתחילים בטרמינלים, או במשתנים Ai החיצונית( אז כל כללי הגזירה של .j>s עבורם
• : כשמסיימים עם המשתנה2שמורה Ai, כל כללי הגזירה שלו מתחיליםאו בטרמינלים. j>i עבורם Aj במשתנים
.s-ו i הוכחת שתי השמורות יחד באינדוקציה על• : בסיום האלגוריתם אין רקורסיה שמאלית בין המשתנים מסקנה
המקוריים )ישירה או עקיפה(. 2נובע משמורה .
• לגבי המשתנים החדשים, הם תמיד מופיעים כימניים ביותר, ולכן לעולם לא יהיו .מעורבים ברקורסיה שמאלית
56
LL(k) Parsers• Recursive Descent
– Manual construction– Uses recursion
• Wanted– A parser that can be generated automatically– Does not use recursion
57
LL(k) parsing with pushdown automata
• Pushdown automaton uses– A stack– Input stream– Transition table
• nonterminals x tokens -> production rule• Entry indexed by nonterminal N and token t contains the
rule of N that must be used when current input starts with t
• The initial state: – Input stream has the input ($ marks its end). – Stack starts with “S$” for the initial variable S.
58
LL(k) parsing with pushdown automata
• Two possible moves– Prediction:
• When top of stack is nonterminal N and next token is t: pop N, lookup rule at table[N,t]. If table[N,t] is not empty, push the right-side of the rule on prediction stack, otherwise – syntax error.
– Match:• When top of prediction stack is a terminal T and next token is t:
If (t == T), pop T and consume t. If (t ≠ T) syntax error.
• Parsing terminates when prediction stack is empty. If input is empty at that point, success. Otherwise, syntax error
Stack During the Run:
if ( E ) then Stmt else Stmt ; Stmts ; } $
מחסנית:
top
if ( id < id ) then id = id + num else break; id = id * id; …
Remaining Input:
60
Example transition table
( ) not true false and or xor $
E 2 3 1 1
LIT 4 5
OP 6 7 8
(1) E → LIT(2) E → ( E OP E ) (3) E → not E(4) LIT → true(5) LIT → false(6) OP → and(7) OP → or(8) OP → xor
Non
term
inal
s
Input tokens
Which rule should be used
61
Simple Example
a b c
A A aAb➞ A c➞
A aAb | c➞aacbb$
Input suffix Stack content Move
aacbb$ A$ predict(A,a) = A aAb➞aacbb$ aAb$ match(a,a)
acbb$ Ab$ predict(A,a) = A aAb➞acbb$ aAbb$ match(a,a)
cbb$ Abb$ predict(A,c) = A c➞cbb$ cbb$ match(c,c)
bb$ bb$ match(b,b)
b$ b$ match(b,b)
$ $ match($,$) – success
Stack top on left
62
The Transition Table • Constructing the transition table is not hard.
– It builds on FIRST and FOLLOW. • You will construct First, Follow, and the table in
the exercises.
63
Simple Example on a Bad Word
a b c
A A aAb➞ A c➞
A ➞ aAb | cabcbb$
Input suffix Stack content Move
abcbb$ A$ predict(A,a) = A aAb➞abcbb$ aAb$ match(a,a)
bcbb$ Ab$ predict(A,b) = ERROR
64
Error Handling• Types of errors:
– Lexical errors (typos)– Syntax errors (e.g., imbalanced parenthesis) – Semantic errors (e.g., type mismatch)– Logical errors (infinite loop, but also use of ‘=‘ instead of
‘==‘). • Requirements:
– Report the error clearly. – Recover and continue so that more errors can be
discovered. – Be reasonably efficient.
65
Error Handling and Recoveryx = a * (p+q * ( -b * (r-s); Where should we report the error? The valid prefix property Recovery is tricky
Heuristics for dropping tokens, skipping to semicolon, etc.
66
Error Handling in LL Parsers
• Now what?– Predict bS anyway “missing token b inserted in line XXX”
S ➞ a c | b Sc$
a b c
S S ➞ a c S ➞ bS
Input suffix Stack content Move
c$ S$ predict(S,c) = ERROR
67
Error Handling in LL Parsers
• Result: infinite loop
S ➞ a c | b Sc$
a b c
S S ➞ a c S ➞ bS
Input suffix Stack content Move
bc$ S$ predict(b,c) = S ➞ bS
bc$ bS$ match(b,b)
c$ S$ Looks familiar?
68
Error Handling• Requires more systematic treatment• Some examples
– Panic mode (or acceptable-set method): drop tokens until reaching a synchronizing token, like a semicolon, a right parenthesis, end of file, etc.
– Phrase-level recovery: attempting local changes: replace “,” with “;”, eliminate or add a “;”, etc.
– Error production: anticipate errors and automatically handle them by adding them to the grammar.
– Global correction: find the minimum modification to the program that will make it derivable in the grammar. • Not a practical solution…
An Example Structure of a Programprogram
Main function More Functions
More FunctionsFunction
Function
Decls Stmts
Decls Stmts
Decls Stmts• • •
• • • •
• •
• • •
Decl Decls
Decl Decls
Decl
Stmt Stmts
StmtIdType• • • •
• •
• • •
exprid =• • •
;
;
{ }
{
{
}
}
70
Summary• After peeling the lexical layer, we parse the token sequence
to understand the program structure. • Program legal structure is accurately described by a CFG. • Parsing is executed top-down or bottom-up. • Recursive descent uses recursion and a function for each
variable. • General grammars may be hard to parse. • LL(k) grammars can be parsed efficiently (with small k’s)
using a pushdown automata.• Grammars that are not LL(K) may sometimes be “fixed”
using left-recursion elimination, left factorization, and assignments.
71
Coming up next time• Bottom-Up Parsing.