Theory of Compilation 236360 Erez Petrank

1

Theory of Compilation 236360

Erez PetrankLecture 2:

Syntax Analysis, Top-Down Parsing

2

You are here

Executable code

exe

Sourcetext

txt

Compiler

LexicalAnalysi

s

Syntax Analysi

sParsing

Semantic

Analysis

Inter.Rep.(IR)

CodeGen.

3

Last Week: from characters to tokens(Using Regular Expressions)

x = b*b – 4*a*c

txt

<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS>

<INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

Token Stream

4

The Lex Tool• Lex automatically generates a lexical analyzer from

declaration file.• Advantages: easy to produce a lexical analyzer from a

short declaration, easily verified, easily modified and maintained.

• Intuitively: Lex builds a DFA, The analyzer simulates the DFA on a given input.

LexDeclaration file

LexicalAnalysi

s

characters

tokens

5

Today: from tokens to AST

LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

<ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

‘b’ ‘4’

‘b’‘a’

‘c’

ID

ID

ID

ID

ID

factor

term factorMULT

term

expression

expression

factor

term factorMULT

term

expression

term

MULT factor

MINUS

SyntaxTree

6

Syntax Analysis (Parsing)• Goal: discover the program structure.

– For example, a C program is built of functions, each function is built from declarations and instructions, each instruction is built from expressions, etc.

– Is a sequence of tokens a valid program in the language?– Construct a structured representation of the input text– Error detection and reporting

• A simple and accurate method for describing a program structure is context free grammars.

• We will look at families of grammars that can be efficiently parsed.

• The parser will read the token series, make sure that they are derivable in the grammar (or report an error), and construct the derivation tree.

7

Context free grammars

• V – non terminals• T – terminals (tokens for us)• P – derivation rules

– Each rule of the form V ➞ (T ∪ V)• S∈V – the initial symbol

G = (V,T,P,S)

8

Why do we need context free grammars?• Important program structures cannot be expressed

by regular expressions. E.g., balanced parenthesis… – S ➞ SS; S ➞ (S); S ➞ ()

• Anything expressible as a regular expression is expressible by CFG. Why use regular expressions at all? – Separation, modularity, simplification. – No point in using strong (and less efficient) tools on

easily analyzable regular expressions. • Regular expressions describe lexical structures like

identifiers, constants, keywords, etc. • Grammars describe nested structured like balanced

parenthesis, match begin-end, if-then-else, etc.

9

ExampleS ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )

V = { S, E }T = { id, ‘+’, ‘*’, ‘(‘, ‘)’}S is the initial variable.

10

Terminology• Derivation: a sequence of replacements of non-terminals

using the derivation rules. • Language: the set of strings of terminals derivable from the

initial state. • Sentential form (תבנית פסוקית) – the result of a partial

derivation in which there may be non-terminals.

Derivation Example

11

SS S;

id := E S;id := id S;id := id id := E ;id := id id := E + E ;id := id id := E + id ;id := id id := id + id ;

x := z;y := x + z

S ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )

S ➞ S;SS ➞ id := EE ➞ idS ➞ id := EE ➞ E + E E ➞ id E ➞ id

x:= z ; y := x + z

input grammar

12

Parse TreeS

S S;id := E S;id := id S;id := id id := E ;id := id id := E + E ;id := id id := E + id ;id := id id := id + id ;

x:= z ; y := x + z

S

S

;

S

id :=

E

id

id := E

E

+

E

id id

13

Questions• How did we know which rule to apply on every

step?• Does it matter? • Would we always get the same result?

14

Ambiguityx := y+z*w


S

id := E

E + E

id

id

E * E

id

S

id := E

E*E

id

id

E + E

id

15

Leftmost/rightmost Derivation• Leftmost derivation

– always expand leftmost non-terminal• Rightmost derivation

– Always expand rightmost non-terminal

• Allows us to describe derivation by listing a sequence of rules only. – always know what a rule is applied to

• Note that it does not necessarily solve ambiguity (e.g., previous slide).

• These are the orders of derivation applied in our parsers (coming soon).

Leftmost Derivation

16

x := z;y := x + z


SS S;

id := E S;id := id S;id := id id := E ;id := id id := E + E ;id := id id := id + E ;id := id id := id + id ;

S ➞ S;SS ➞ id := EE ➞ idS ➞ id := EE ➞ E + E E ➞ id E ➞ id

x:= z ; y := x + z

17

Rightmost Derivation

SS S;S id := E;S id := E + E;S id := E + id;S id := id + id ;

id := E id := id + id ;id := id id := id + id ;

x := z;y := x + z


S ➞ S;SS ➞ id := EE ➞ E + EE ➞ id E ➞ id S ➞ id := EE ➞ id

x:= z ; y := x + z

18

Bottom-up Examplex := z;y := x + z


id := id ; id := id + idid := E id := id + id;

S id := id + id;S id := E + id;S id := E + E;S id := E ;S S;

S

E ➞ idS ➞ id := EE ➞ idE ➞ id E ➞ E + ES ➞ id := ES ➞ S;S

Bottom-up picking left alternative on every step Rightmost derivation when going top-down

19

Parsing • A context free language can be recognized by a non-

deterministic pushdown automaton– But not a deterministic one…

• Parsing can be seen as a search problem– Can you find a derivation from the start symbol to the input

word?– Easy (but very expensive) to solve with backtracking

• Cocke-Younger-Kasami parser can be used to parse any context-free language but has complexity O(n3)– Imagine a program with hundreds of thousands of lines of

code. • We want efficient parsers

– Linear in input size– Deterministic pushdown automata– We will sacrifice generality for efficiency

20

“Brute-force” Parsingx := z;y := x + z


id := id ; id := id + id

id := E id := id + id; id := id id := E+ id; …E ➞ id E ➞ id

(not a parse tree… a search for the parse tree by exhaustively applying all rules)

id := E id := id + id; id := E id := id + id;

21

Efficient Parsers• Top-down (predictive)

– Construct the leftmost derivation– Apply rules “from left to right”– Predict what rule to apply based on nonterminal and

token • Bottom up (shift reduce)

– Construct the rightmost derivation– Apply rules “from right to left”– Reduce a right-hand side of a production to its non-

terminal

22

Efficient Parsers• Top-down (predictive parsing)

Bottom-up (shift reduce)

to be read…already read…

23

Top-down Parsing• Given a grammar G=(V,T,P,S) and a word w• Goal: derive w using G• Idea

– Apply production to leftmost nonterminal– Pick production rule based on next input token

• General grammar– More than one option for choosing the next production

based on a token• Restricted grammars (LL)

– Know exactly which single rule to apply– May require some lookahead to decide

24

An Easily Parse-able GrammarE ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor

not (not true or false)

E => not E => not ( E OP E ) =>not ( not E OP E ) =>not ( not LIT OP E ) =>not ( not true OP E ) =>not ( not true or E ) =>not ( not true or LIT ) =>not ( not true or false )

Production to apply is known from next input token

E

not E

EOPE

LIT

true

not LIT or

( )

false

At any stage, looking at the current variable and the next input token, a rule can be easily determined.

25

LL(k) Grammars• A grammar is in the class LL(K) when it can be

derived via:– Top down derivation– Scanning the input from left to right (L)– Producing the leftmost derivation (L)– With lookahead of k tokens (k)

• A language is said to be LL(k) when it has an LL(k) grammar

26

Recursive Descent Parsing• Define a function for every nonterminal• Every function simulates the derivation of the

variable it represents:– Find applicable production rule– Terminal function checks match with next input token– Nonterminal function calls (recursively) other functions

• If there are several applicable productions for a nonterminal, use lookahead

27

Matching tokens

• Variable current holds the current input token

void match(token t) { if (current == t) current = next_token(); else ;}

28

functions for nonterminalsE ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor

void E() { if (current {TRUE, FALSE}) // E → LIT LIT(); else if (current == LPAREN) // E → ( E OP E ) match(LPARENT); E(); OP(); E(); match(RPAREN); else if (current == NOT) // E → not E match(NOT); E(); else error; }

29

functions for nonterminals

void LIT() { if (current == TRUE) match(TRUE); else if (current == FALSE) match(FALSE); else error;}

E ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor

30

functions for nonterminals

void OP() {if (current == AND)

match(AND);else if (current == OR)

match(OR);else if (current == XOR)

match(XOR);else

error;}


31

Overall: Functions for Grammar

E → LIT | ( E OP E ) | not ELIT → true | falseOP → and | or | xor

void E() {if (current {TRUE, FALSE}) LIT();else if (current == LPAREN) match(LPARENT);

E(); OP(); E();match(RPAREN);

else if (current == NOT) match(NOT); E();else error;

}

void LIT() {if (current == TRUE) match(TRUE);else if (current == FALSE) match(FALSE);else error;

}

void OP() {if (current == AND) match(AND);else if (current == OR) match(OR);else if (current == XOR) match(XOR);else error;

}

32

Adding semantic actions• Can add an action to perform on each production

rule simply by executing it when a function is invoked.

• For example, can build the parse tree– Every function returns an object of type Node– Every Node maintains a list of children– Function calls can add new children

33

Building the parse treeNode E() { result = new Node(); result.name = “E”; if (current {TRUE, FALSE}) // E → LIT result.addChild(LIT()); else if (current == LPAREN) // E → ( E OP E ) result.addChild(match(LPARENT)); result.addChild(E()); result.addChild(OP()); result.addChild(E()); result.addChild(match(RPAREN)); else if (current == NOT) // E → not E result.addChild(match(NOT)); result.addChild(E()); else error; return result;}

34

Getting Back to the Example

• Input = “( not true and false )”;Node treeRoot = E();

E

( E OP E )

not LIT

falsetrue

and LIT

35

Recursive Descent

• How do you pick the right A-production?• Generally – try them all and use backtracking

(costly). • In our case – use lookahead

void A() { choose an A-production, A -> X1X2…Xk; for (i=1; i≤ k; i++) { if (Xi is a nonterminal) call procedure Xi(); elseif (Xi == current) advance input; else report error; }}

In its basic form, each variable has a procedure that looks like:

36

Recursive Descent: a problem

• With lookahead 1, the function for indexed_elem will never be tried… – What happens for input of the form

• ID [ expr ]

term ➞ ID | indexed_elemindexed_elem ➞ ID [ expr ]

37

Recursive Descent: Another Problem

Bool S() { return A() && match(token(‘a’)) && match(token(‘b’));}Bool A() { if (current == ‘a’) return match(token(‘a’)) else return true ;}

S ➞ A a bA ➞ a |

What happens for input “ab” ? What happens if you flip order of alternatives and try “aab”?

38

Recursive descent: a third problem

Bool E() { return E() && match(token(‘-’)) && term() || ID();}

E ➞ E – term | term

What happens with this procedure? Recursive descent parsers cannot handle left-recursive

grammars

39

3 Bad Examples for Recursive Descent

Can we make it work?


S ➞ A a bA ➞ a |

E ➞ E - term

40

The “FIRST” Sets• To formalize the property (of a grammar) that we can

determine a rule using a single lookahead we define the FIRST sets.

• For every production rule A➞ 𝞪– FIRST( ) = all terminals that can start with 𝞪 𝞪– i.e., every token that can appear first under some derivation

for 𝞪• No intersection between FIRST sets => can pick a single

rule• In our Boolean expressions example

– FIRST(LIT) = { true, false }– FIRST( ( E OP E ) ) = { ‘(‘ }– FIRST ( not E ) = { not }


41

The “FIRST” Sets• No intersection between FIRST sets => can pick a single

rule• If the FIRST sets intersect, may need longer lookahead

– LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens

– LL(1) is an important and useful class

42

The FOLLOW Sets• FIRST is not enough when variables are nullified. • Consider: S ➞ AB | c ; A ➞ a | ; B ➞ b;

• Need to know what comes afterwards to select the right production

• For any non-terminal A – FOLLOW(A) = set of tokens that can immediately follow

A• Can select the rule N ➞ with lookahead “b”, if 𝞪

– b∈FIRST( ) or𝞪– 𝞪 may be nullified and b∈FOLLOW(N).

43

Back to our 1st example

• FIRST(ID) = { ID }• FIRST(indexed_elem) = { ID }

• FIRST/FIRST conflict

• This grammar is not in LL(1). Can we “fix” it?


44

Left factoring• Rewrite into an equivalent grammar in LL(1)


term ➞ ID after_IDafter_ID ➞ [ expr ] |

Intuition: just like factoring x*y + x*z into x*(y+z)

45

Left factoring – another example

S ➞ if E then S else S | if E then S | T

S ➞ if E then S S’ | TS’ ➞ else S |

46

Back to our 2nd example

• Select a rule for A with a in the look-ahead: – Should we pick (1) A ➞ a or (2) A ➞ ?

• (1) FIRST(a) = { ‘a’ } (and a cannot be nullified).

• (2) FIRST ()=. Also, can (must) be nullified and FOLLOW(A) = { ‘a’ }

• FIRST/FOLLOW conflict• The grammar is not in LL(1).

S ➞ A a bA ➞ a |

47

An Equivalent Grammar via Substitution

S ➞ A a bA ➞ a |

S ➞ a a b | a b

Substitute A in S

S ➞ a after_a after_a ➞ a b | b

Left factoring

48

So Far• We have tools to determine if a grammar is in

LL(1)– The FIRST and FOLLOW sets. – The exercises will provide algorithms for finding and

using those. • We have some techniques for modifying a

grammar to find an equivalent in LL(1). – Left factoring,– Assignment.

• Now let’s look at the 3rd example and present one more such technique.

49

Back to our 3rd example

• Left recursion cannot be handled with a bounded lookahead.

• What can we do?

• Any grammar with a left recursion has an equivalent grammar with no left recursion.

E ➞ E – term | term

50

Left Recursion Elimination

• L(G1) = β, βα, βαα, βααα, …• L(G2) = same

N N➞ α | β N ➞ βN’ N’ ➞ αN’ |

G1 G2

E ➞ E – term | term E ➞ term TETE ➞ - term TE | term

For our 3rd example:

אלימינציה של רקורסיה שמאלית

• נחליף את הכלליםביטול רקורסיה ישירה:• A → Aα1 | Aα2 | ··· | Aαn | β1 | β2 | ··· | βn

• בכללים• A → β1A’ | β2A’ | ··· | βnA’ • A’ → α1A’ | α2A’| ··· | αnA’ | Є

• ...ריק αi שימו לב שהשיטה לא עובדת אם• .ריק βi וגם עלולה ליצור רקורסיה שמאלית עקיפה אם

• .’A ריק, אז נוצרת רקורסיה שמאלית של αi אם• αj ריק, אז תיתכן רקורסיה שמאלית עקיפה כאשר βi אם

:A-מתחיל ב– A → A’ וגם – A’ → A….

אלימינציה של רקורסיה שמאלית

• נחליף את הכלליםביטול רקורסיה ישירה:• A → Aα1 | Aα2 | ··· | Aαn | β1 | β2 | ··· | βn

• בכללים• A → β1A’ | β2A’ | ··· | βnA’ • A’ → α1A’ | α2A’| ··· | αnA’ | Є

• :צריך לטפל גם ברקורסיה עקיפה. למשל• S → Aa | b• A → Ac | Sd | Є

• .ועבורה האלגוריתם מעט יותר מורכב

)עקיפה אלגוריתם להעלמת רקורסיה שמאלת וישירה( מדקדוק

• שאולי יש בו רקורסיה שמאלית, ללא מעגלים, וללא G דקדוקקלט:.ε כללי

• . דקדוק שקול ללא רקורסיה שמאליתפלט:

• .A → Є :דוגמא לכלל אפסילון• .;A → B; B → A :דוגמא למעגל

• .)ניתן לבטל כללי אפסילון ומעגלים בדקדוק )באופן אוטומטי

• רעיון האלגוריתם לסילוק רקורסיה שמאלית: נסדר את המשתנים לפי A1, A2, …, An :סדר כלשהו

• נדאג שכל כלל שלו יהיה Ai נעבור על המשתנים לפי הסדר, ולכל מהצורה

• Ai → Ajβ with j > i .• ?מדוע זה מספיק

An Algorithm for Left-Recursion Elimination

• Input: Grammar G possibly left-recursive, no cycles, no ε productions.

• Output: An equivalent grammar with no left-recursion• Method: Arrange the nonterminals in some order A1, A2, …, An

• for i:=1 to n do begin for s:=1 to i-1 do begin replace each production of the form Ai → Asβ by the productionsAi → d1β |d2β|…|dkβ where As -> d1 | d2 | …| dk are all the current As-productions; end eliminate immediate left recursion among the Ai-productionsend

ניתוח האלגוריתם

• Ak → Atβ נראה שבסיום האלגוריתם כל חוק גזירה מהצורה. t > k מקיים

• : כשגומרים את הלולאה הפנימית עבור1שמורה s כלשהו )עם Ai בלולאה

Aj מתחילים בטרמינלים, או במשתנים Ai החיצונית( אז כל כללי הגזירה של .j>s עבורם

• : כשמסיימים עם המשתנה2שמורה Ai, כל כללי הגזירה שלו מתחיליםאו בטרמינלים. j>i עבורם Aj במשתנים

.s-ו i הוכחת שתי השמורות יחד באינדוקציה על• : בסיום האלגוריתם אין רקורסיה שמאלית בין המשתנים מסקנה

המקוריים )ישירה או עקיפה(. 2נובע משמורה .

• לגבי המשתנים החדשים, הם תמיד מופיעים כימניים ביותר, ולכן לעולם לא יהיו .מעורבים ברקורסיה שמאלית

56

LL(k) Parsers• Recursive Descent

– Manual construction– Uses recursion

• Wanted– A parser that can be generated automatically– Does not use recursion

57

LL(k) parsing with pushdown automata

• Pushdown automaton uses– A stack– Input stream– Transition table

• nonterminals x tokens -> production rule• Entry indexed by nonterminal N and token t contains the

rule of N that must be used when current input starts with t

• The initial state: – Input stream has the input ($ marks its end). – Stack starts with “S$” for the initial variable S.

58

LL(k) parsing with pushdown automata

• Two possible moves– Prediction:

• When top of stack is nonterminal N and next token is t: pop N, lookup rule at table[N,t]. If table[N,t] is not empty, push the right-side of the rule on prediction stack, otherwise – syntax error.

– Match:• When top of prediction stack is a terminal T and next token is t:

If (t == T), pop T and consume t. If (t ≠ T) syntax error.

• Parsing terminates when prediction stack is empty. If input is empty at that point, success. Otherwise, syntax error

Stack During the Run:

if ( E ) then Stmt else Stmt ; Stmts ; } $

מחסנית:

top

if ( id < id ) then id = id + num else break; id = id * id; …

Remaining Input:

60

Example transition table

( ) not true false and or xor $

E 2 3 1 1

LIT 4 5

OP 6 7 8

(1) E → LIT(2) E → ( E OP E ) (3) E → not E(4) LIT → true(5) LIT → false(6) OP → and(7) OP → or(8) OP → xor

Non

term

inal

s

Input tokens

Which rule should be used

61

Simple Example

a b c

A A aAb➞ A c➞

A aAb | c➞aacbb$

Input suffix Stack content Move

aacbb$ A$ predict(A,a) = A aAb➞aacbb$ aAb$ match(a,a)

acbb$ Ab$ predict(A,a) = A aAb➞acbb$ aAbb$ match(a,a)

cbb$ Abb$ predict(A,c) = A c➞cbb$ cbb$ match(c,c)

bb$ bb$ match(b,b)

b$ b$ match(b,b)

$ $ match($,$) – success

Stack top on left

62

The Transition Table • Constructing the transition table is not hard.

– It builds on FIRST and FOLLOW. • You will construct First, Follow, and the table in

the exercises.

63

Simple Example on a Bad Word

a b c

A A aAb➞ A c➞

A ➞ aAb | cabcbb$


abcbb$ A$ predict(A,a) = A aAb➞abcbb$ aAb$ match(a,a)

bcbb$ Ab$ predict(A,b) = ERROR

64

Error Handling• Types of errors:

– Lexical errors (typos)– Syntax errors (e.g., imbalanced parenthesis) – Semantic errors (e.g., type mismatch)– Logical errors (infinite loop, but also use of ‘=‘ instead of

‘==‘). • Requirements:

– Report the error clearly. – Recover and continue so that more errors can be

discovered. – Be reasonably efficient.

65

Error Handling and Recoveryx = a * (p+q * ( -b * (r-s); Where should we report the error? The valid prefix property Recovery is tricky

Heuristics for dropping tokens, skipping to semicolon, etc.

66

Error Handling in LL Parsers

• Now what?– Predict bS anyway “missing token b inserted in line XXX”

S ➞ a c | b Sc$

a b c

S S ➞ a c S ➞ bS


c$ S$ predict(S,c) = ERROR

67

Error Handling in LL Parsers

• Result: infinite loop

S ➞ a c | b Sc$

a b c

S S ➞ a c S ➞ bS


bc$ S$ predict(b,c) = S ➞ bS

bc$ bS$ match(b,b)

c$ S$ Looks familiar?

68

Error Handling• Requires more systematic treatment• Some examples

– Panic mode (or acceptable-set method): drop tokens until reaching a synchronizing token, like a semicolon, a right parenthesis, end of file, etc.

– Phrase-level recovery: attempting local changes: replace “,” with “;”, eliminate or add a “;”, etc.

– Error production: anticipate errors and automatically handle them by adding them to the grammar.

– Global correction: find the minimum modification to the program that will make it derivable in the grammar. • Not a practical solution…

An Example Structure of a Programprogram

Main function More Functions

More FunctionsFunction

Function

Decls Stmts

Decls Stmts

Decls Stmts• • •

• • • •

• •

• • •

Decl Decls

Decl Decls

Decl

Stmt Stmts

StmtIdType• • • •

• •

• • •

exprid =• • •

;

;

{ }

{

{

}

}

70

Summary• After peeling the lexical layer, we parse the token sequence

to understand the program structure. • Program legal structure is accurately described by a CFG. • Parsing is executed top-down or bottom-up. • Recursive descent uses recursion and a function for each

variable. • General grammars may be hard to parse. • LL(k) grammars can be parsed efficiently (with small k’s)

using a pushdown automata.• Grammars that are not LL(K) may sometimes be “fixed”

using left-recursion elimination, left factorization, and assignments.

71

Coming up next time• Bottom-Up Parsing.

Date post:	22-Feb-2016
Category:	Documents
Upload:	azuka
View:	31 times
Download:	0 times

Theory of Compilation 236360 Erez Petrank

Documents