Topic #4: Syntactic Analysis (Parsing) EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Topic #4: Syntactic Analysis (Parsing)

EE 456 – Compiling Techniques

Prof. Carl Sable

Fall 2003

Lexical Analyzer and Parser

Parser

• Accepts string of tokens from lexical analyzer (usually one token at a time)

• Verifies whether or not string can be generated by grammar

• Reports syntax errors (recovers if possible)

Errors

• Lexical errors (e.g. misspelled word)

• Syntax errors (e.g. unbalanced parentheses, missing semicolon)

• Semantic errors (e.g. type errors)

• Logical errors (e.g. infinite recursion)

Error Handling

• Report errors clearly and accurately

• Recover quickly if possible

• Poor error recover may lead to avalanche of errors

Error Recovery

• Panic mode: discard tokens one at a time until a synchronizing token is found

• Phrase-level recovery: Perform local correction that allows parsing to continue

• Error Productions: Augment grammar to handle predicted, common errors

• Global Production: Use a complex algorithm to compute least-cost sequence of changes leading to parseable code

Context Free Grammars

• CFGs can represent recursive constructs that regular expressions can not

• A CFG consists of:– Tokens (terminals, symbols)– Nonterminals (syntactic variables denoting sets of

strings)– Productions (rules specifying how terminals and

nonterminals can combine to form strings)– A start symbol (the set of strings it denotes is the

language of the grammar)

Derivations (Part 1)

• One definition of language: the set of strings that have valid parse trees

• Another definition: the set of strings that can be derived from the start symbol

E E + E | E * E | (E) | – E | id

E => -E (read E derives –E)

E => -E => -(E) => -(id)

Derivations (Part 2)

• αAβ => αγβ if A γ is a production and α and β are arbitrary strings of grammar symbols

• If a1 => a2 => … => an, we say a1 derives an

• => means derives in one step• *=> means derives in zero or more steps• +=> means derives in one or more steps

Sentences and Languages

• Let L(G) be the language generated by the grammar G with start symbol S:– Strings in L(G) may contain only tokens of G– A string w is in L(G) if and only if S +=> w– Such a string w is a sentence of G

• Any language that can be generated by a CFG is said to be a context-free language

• If two grammars generate the same language, they are said to be equivalent

Sentential Forms

• If S *=> α, where α may contain nonterminals, we say that α is a sentential form of G

• A sentence is a sentential form with no nonterminals

Leftmost Derivations

• Only the leftmost nonterminal in any sentential form is replaced at each step

• A leftmost step can be written as wAγ lm=> wδγ– w consists of only terminals– γ is a string of grammar symbols

• If α derives β by a leftmost derivation, then we write α lm*=> β

• If S lm*=> α then we say that α is a left-

sentential form of the grammar• Analogous terms exist for rightmost derivations

Parse Trees

• A parse tree can be viewed as a graphical representation of a derivation

• Every parse tree has a unique leftmost derivation (not true of every sentence)

• An ambiguous grammars has:– more than one parse tree for at least one

sentence– more than one leftmost derivation for at least

one sentence

Capability of Grammars

• Can describe most programming language constructs

• An exception: requiring that variables are declared before they are used– Therefore, grammar accepts superset of

actual language– Later phase (semantic analysis) does type

checking

Regular Expressions vs. CFGs

• Every construct that can be described by an RE and also be described by a CFG

• Why use REs at all?– Lexical rules are simpler to describe this way– REs are often easier to read– More efficient lexical analyzers can be

constructed

Verifying Grammars

• A proof that a grammar verifies a language has two parts:– Must show that every string generated by the

grammar is part of the language– Must show that every string that is part of the

language can be generated by the grammar

• Rarely done for complete programming languages!

Eliminating Ambiguity (1)

stmt if expr then stmt

| if expr then stmt else stmt

| other

if E1 then if E2 then S1 else S2



stmt matched

| unmatched

matched if expr then matched else matched

| other

unmatched if expr then stmt

| if expr then matched else unmatched

Left Recursion

• A grammar is left recursive if for any nonterminal A such that there exists any derivation A +=> Aα for any string α

• Most top-down parsing methods can not handle left-recursive grammars

Eliminating Left Recursion (1)

A Aα1 | Aα2 | … | Aαm | β1 | β2 | … | βn

A β1A’ | β2A’ | … | βnA’A’ α1A’ | α2A’ | … | αmA’ | ε

Harder case:S Aa | bA Ac | Sd | ε

Eliminating Left Recursion (2)

• First arrange the nonterminals in some order A1, A2, … An

• Apply the following algorithm:

for i = 1 to n { for j = 1 to i-1 { replace each production of the form Ai Ajγ by the productions Ai δ1γ | δ2γ | … | δkγ, where Aj δ1 | δ2 | … | δk are the Aj productions } eliminate the left recursion among Ai productions}

Left Factoring

• Rewriting productions to delay decisions

• Helpful for predictive parsing

• Not guaranteed to remove ambiguity

A αβ1 | αβ2

A αA’A’ β1 | β2

Limitations of CFGs

• Can not verify repeated strings– Example: L1 = {wcw | w is in (a|b)*}– Abstracts checking that variables are declared

• Can not verify repeated counts– Example: L2 = {anbmcndm | n≥1 & m≥1}– Abstracts checking that number of formal and

actual parameters are equal

• Therefore, some checks put off until semantic analysis

Top Down Parsing

• Can be viewed two ways:– Attempt to find leftmost derivation for input

string– Attempt to create parse tree, starting from at

root, creating nodes in preorder

• General form is recursive descent parsing– May require backtracking– Backtracking parsers not used frequently

because not needed

Predictive Parsing

• A special case of recursive-descent parsing that does not require backtracking

• Must always know which production to use based on current input symbol

• Can often create appropriate grammar:– removing left-recursion– left factoring the resulting grammar

Transition Diagrams

• For parser:– One diagram for each nonterminal– Edge labels can be tokens or nonterminal

• A transition on a token means we should take that transition if token is next input symbol

• A transition on a nonterminal can be thought of as a call to a procedure for that nonterminal

• As opposed to lexical analyzers:– One (or more) diagrams for each token– Labels are symbols of input alphabet

Creating Transition Diagrams

• First eliminate left recursion from grammar

• Then left factor grammar

• For each nonterminal A:– Create an initial and final state

– For every production A X1X2…Xn, create a path from initial to final state with edges labeled X1, X2, …, Xn.

Using Transition Diagrams

• Predictive parsers:– Start at start symbol of grammar– From state s with edge to state t labeled with token a, if next input token is a:

• State changes to t• Input cursor moves one position right

– If edge labeled by nonterminal A:• State changes to start state for A• Input cursor is not moved• If final state of A reached, then state changes to t

– If edge labeled by ε, state changes to t• Can be recursive or non-recursive using stack

Transition Diagram Example

E E + T | TT T * F | FF (E) | id

E TE’E’ +TE’ | εT FT’T’ *FT’ | εF (E) | id

E:

F:

T’:

T:

E’:

Simplifying Transition Diagrams

E’: E:

Nonrecursive Predictive Parsing (1)

Input

Stack


• Program considers X, the symbol on top of the stack, and a, the next input symbol

• If X = a = $, parser halts successfully

• if X = a ≠ $, parser pops X off stack and advances to next input symbol

• If X is a nonterminal, the program consults M[X, a] (production or error entry)


• Initialize stack with start symbol of grammar

• Initialize input pointer to first symbol of input

• After consulting parsing table:– If entry is production, parser replaces top

entry of stack with right side of production (leftmost symbol on top)

– Otherwise, an error recovery routine is called

Predictive Parsing Table

Nonter-minal

Input Symbol

id + * ( ) $

E ETE’ ETE’

E’ E’+TE’ E’ε E’ε

T TFT’ TFT’

T’ T’ε T’*FT’ T’ε T’ε

F Fid F(E)

Using a Predictive Parsing Table

Stack Input Output

$E id+id*id$

$E’T id+id*id$ ETE’

$E’T’F id+id*id$ TFT’

$E’T’id id+id*id$ Fid

$E’T’ +id*id$

$E’ +id*id$ T’ε

$E’T+ +id*id$ E’+TE’

$E’T id*id$

$E’T’F id*id$ TFT’

Stack Input Output

… … …

$E’T’id id*id$ Fid

$E’T’ *id$

$E’T’F* *id$ T’*FT’

$E’T’F id$

$E’T’id id$ Fid

$E’T’ $

$E’ $ T’ ε

$ $ E’ ε

FIRST

• FIRST(α) is the set of all terminals that begin any string derived from α

• Computing FIRST:– If X is a terminal, FIRST(X) = {X}– If Xε is a production, add ε to FIRST(X)– If X is a nonterminal and XY1Y2…Yn is a

production:• For all terminals a, add a to FIRST(X) if a is a member of

any FIRST(Yi) and ε is a member of FIRST(Y1), FIRST(Y2), … FIRST(Yi-1)

• If ε is a member of FIRST(Y1), FIRST(Y2), … FIRST(Yn), add ε to FIRST(X)

FOLLOW

• FOLLOW(A), for any nonterminal A, is the set of terminals a that can appear immediately to the right if A in some sentential form

• More formally, a is in FOLLOW(A) if and only if there exists a derivation of the form S *=>αAaβ

• $ is in FOLLOW(A) if and only if there exists a derivation of the form S *=> αA

Computing FOLLOW

• Place $ in FOLLOW(S)• If there is a production A αBβ, then

everything in FIRST(β) (except for ε) is in FOLLOW(B)

• If there is a production A αB, or a production A αBβ where FIRST(β) contains ε,then everything in FOLLOW(A) is also in FOLLOW(B)

FIRST and FOLLOW Example

E TE’E’ +TE’ | εT FT’T’ *FT’ | εF (E) | id

FIRST(E) = FIRST(T) = FIRST(F) = {(, id}FIRST(E’) = {+, ε}FIRST(T’) = {*, ε}FOLLOW(E) = FOLLOW(E’) = {), $}FOLLOW(T) = FOLLOW(T’) = {+, ), $}FOLLOW(F) = {+, *, $}

Creating a Predictive Parsing Table

• For each production A α :– For each terminal a in FIRST(α) add A α

to M[A, a]– If ε is in FIRST(α) add A α to M[A, b]

for every terminal b in FOLLOW(A)– If ε is in FIRST(α) and $ is in FOLLOW(A)

add A α to M[A, $]

• Mark each undefined entry of M as an error entry (use some recovery strategy)

Multiply-Defined Entries Example

S iEtSS’ | aS’ eS | εE b

Nonter-minal

Input Symbol

a b i t e $

S S a S iEtSS’

S’S’ ε

S’ eSS’ ε

E E b

LL(1) Grammars (1)

• Algorithm covered in class can be applied to any grammar to produce a parsing table

• If parsing table has no multiply-defined entries, grammar is said to be “LL(1)”– First “L”, left-to-right scanning of input– Second “L”, produces leftmost derivation– “1” refers to the number of lookahead symbols

needed to make decisions

LL(1) Grammars (2)

• No ambiguous or left-recursive grammar can be LL(1)

• Eliminating left recursion and left factoring does not always lead to LL(1) grammar

• Some grammars can not be transformed into an LL(1) grammar at all

• Although the example of a non-LL(1) grammar we covered has a fix, there are no universal rules to handle cases like this

Shift-Reduce Parsing

• One simple form of bottom-up parsing is shift-reduce parsing

• Starts at the bottom (leaves, terminals) and works its way up to the top (root, start symbol)

• Each step is a “reduction”:– Substring of input matching the right side of a

production is “reduced”– Replaced with the nonterminal on the left of the

production

• If all substrings are chosen correctly, a rightmost derivation is traced in reverse

Shift-Reduce Parsing Example

S aABeA Abc | bB -> d

abbcdeaAbcdeaAdeaABeS

S rm=> aABe rm=>aAde rm=>aAbcde rm=> abbcde

Handles (1)

• Informally, a “handle” of a string:– Is a substring of the string– Matches the right side of a production– Reduction to left side of production is one step along

reverse of rightmost derivation

• Leftmost substring matching right side of production is not necessarily a handle– Might not be able to reduce resulting string to start

symbol– In example from previous slide, if reduce aAbcde to aAAcde, can not reduce this to S

Handles (2)

• Formally, a handle of a right-sentential form γ:– Is a production A β and a position of γ where β

may be found and replaced with A– Replacing A by β leads to the previous right-sentential

form in a rightmost derivation of γ

• So if S rm*=> αAw rm=> αβw then A β in

the position following α is a handle of αβw• The string w to the right of the handle contains

only terminals• Can be more than one handle if grammar is

ambiguous (more than one rightmost derivation)

Ambiguity and Handles Example

E E + EE E * EE (E)E id

E rm=> E + E rm=> E + E * E rm=> E + E * id3

rm=> E + id2 * id3

rm=> id1 + id2 * id3

E rm=> E * E rm=> E * id3

rm=> E + E * id3

rm=> E + id2 * id3

rm=> id1 + id2 * id3

Handle Pruning

• Repeat the following process, starting from string of tokens until obtain start symbol:– Locate handle in current right-sentential form– Replace handle with left side of appropriate

production

• Two problems that need to be solved:– How to locate handle– How to choose appropriate production

Shift-Reduce Parsing

• Data structures include a stack and an input buffer– Stack holds grammar symbols and starts off empty– Input buffer holds the string w to be parsed

• Parser shifts input symbols onto stack until a handle β is on top of the stack– Handle is reduced to the left side of appropriate

production– If stack contains only start symbol and input is empty,

this indicates success

Actions of a Shift-Reduce Parser

• Shift – the next input symbol is shifted onto the top of the stack

• Reduce – The parser reduces the handle at the top of the stack to a nonterminal (the left side of the appropriate production)

• Accept – The parser announces success

• Error – The parser discovers a syntax error and calls a recovery routine

Shift Reduce Parsing Example

Stack Input Action

$ id1 + id2 * id3$ shift

$id1 + id2 * id3$ reduce by E id

$E + id2 * id3$ shift

$E + id2 * id3$ shift

$E + id2 * id3$ reduce by E id

$E + E * id3$ shift

$E + E * id3$ shift

$E + E * id3 $ reduce by E id

$E + E * E $ reduce by E E * E

$E + E $ reduce by E E + E

$E $ accept

Viable Prefixes

• Two definitions of a viable prefix:– A prefix of a right sentential form that can

appear on a stack during shift-reduce parsing– A prefix of a right-sentential form that does

not continue past the right end of the rightmost handle

• Can always add tokens to the end of a viable prefix to obtain a right-sentential form

Conflicts in Shift-Reduce Parsing

• There are grammars for which shift-reduce parsing can not be used

• Shift/reduce conflict: can not decide whether to shift or reduce

• Reduce/reduce conflict: can not decide which of multiple possible reductions to make

• Sometimes can add rule to adapt for use with ambiguous grammar

Operator-Precedence Parsing

• A form of shift-reduce parsing that can apply to certain simple grammars– No productions can have right side ε– No right side can have two adjacent

nonterminals– Other essential requirements must be met

• Once the parser is built (often by hand), the grammar can be effectively ignored

Precedence Relations

Relation Meaning

a <· b a "yields precedence to" b

a ·= b a "has the same precedence as" b

a ·> b a "takes precedence over" b

Using Precedence Relations (1)

• Can be thought of as delimiting handles:– <· Marks left end of handle– ·> Appears in the interior of handle– ·= Marks right end of handle

• Consider right-sentential β0a1β1β1…anBn:– Each βi is either single nonterminal or ε– Each ai is a single token– Suppose that exactly one precedence relation

will hold for each ai, ai+1 pair


• Mark beginning and end of string with $• Remove the nonterminals• Insert correct precedence relation between each

pair of terminals

id + * $

id ·> ·> ·>

+ <· ·> <· ·>

* <· ·> ·> ·>

$ <· <· <·

id + id * id

$ <· id ·> + <· id ·> * <· id ·> $


• To find the current handle:– Scan the string from the left until the first ·>

is encountered– Scan backwards (left) from there until a <· is

encountered– Everything in between, including intervening

or surrounding nonterminals, is the handle

• The nonterminals do not influence the parse!

Implementing the Algorithmset ip to point to the first symbol in w$initialize stack to $repeat forever if $ on top of stack in ip points to $ return success else let a be topmost symbol on stack let b be symbol pointed to by ip if a <· b or a ·= b push b onto stack advance ip to next input symbol else if a ·> b repeat pop x until top symbol on stack <· x else error()

Precedence and Associativity (1)

• For grammars describing arithmetic expressions:– Can construct table of operator-precedence

relations automatically– Heuristic based on precedence and

associativity of operators

• Selects proper handles, even if grammar is ambiguous


• If operator θ1 has higher precedence than operator θ2, make θ1 ·> θ2 and θ2 <· θ1

• If θ1 and θ2 are of equal precedence:

– If they are left associative, make θ1 ·> θ2 and θ2 ·> θ1

– If they are right associative, make θ1 <· θ2 and θ2 <· θ1


• For all operators θ:– θ <· id– id ·> θ– θ <· (– ( <· θ– ) ·> θ– θ ·> )– θ ·> $– $ <· θ

• Also let:– (·= )– ( <· (– ( <· id– $ <· (– id ·> $– id ·> )– $ <· id– ) ·> $– ) ·> )

Operator Grammar Example

• ^ is of highest precedence and is right-associative

• * and / are of next highest precedence and are left-associative

• + and – are of lowest precedence and are left-associative

E E + E | E – E | E * E | E / E

| E ^ E | (E) | -E | id

Computed Precedence Relations

+ - * / ^ id ( ) $

+ ·> ·> <· <· <· <· <· ·> ·>

- ·> ·> <· <· <· <· <· ·> ·>

* ·> ·> ·> ·> <· <· <· ·> ·>

/ ·> ·> ·> ·> <· <· <· ·> ·>

^ ·> ·> ·> ·> <· <· <· ·> ·>

id ·> ·> ·> ·> ·> ·> ·>

( <· <· <· <· <· <· <· ·=

) ·> ·> ·> ·> ·> ·> ·>

$ <· <· <· <· <· <· <·

Handling Unary Operators

• If unary operator θ is not also a binary operator:– Incorporate θ into table

– For all θn:

• Make θn <· θ no matter what

• If θ has higher precedence than θn, make θ ·> θn

• Otherwise, make θ <· θn

• Otherwise the easiest thing to do is create second lexical symbol (token)– Lexical analyzer must distinguish one from the other– Can't use lookahead, must rely on previous token

Precedence Functions (1)

• Do not need to store entire table of precedence relations

• Select two precedence functions f and g :– f(a) < g(b) whenever a <· b– f(a) = g(b) whenever a ·= b– f(a) > g(b) whenever a ·> b

+ - * / ^ ( ) id $

f 2 2 4 4 4 0 6 6 0

g 1 1 3 3 5 5 0 5 0

Precedence Functions (2)

• Precedence relation between a and b is determined by comparing f(a) to g(b)

• Loss of error detection capability (errors caught later when no reduction for handle is found)

• It is not always possible to construct valid precedence functions

• When it is possible, functions can be computed automatically

Precedence Functions Algorithm

• Create symbols fa and fg for all tokens and $• If a ·= b then fa and gb must be in same group• Partition symbols into as many groups as possible• For all cases where a <· b, draw edge from group of gb

to group of fa• For all cases where a ·> b, draw edge from group of fa

to group of gb

• If graph has cycles, no precedence functions exist• Otherwise:

– f(a) is the length of the longest path beginning at group of fa

– g(a) is the length of the longest path beginning at group of ga

Precedence Functions Example

id + * $

id ·> ·> ·>

+ <· ·> <· ·>

* <· ·> ·> ·>

$ <· <· <·

+ * id $

f 2 4 4 0

g 1 3 5 0

Detecting and Handling Errors

• Errors can occur at two points:– If no precedence relation holds between the terminal

on top of stack and current input– If a handle has been found, but no production is found

with this handle as right side

• Errors during reductions can be handled with diagnostic message

• Errors due to lack of precedence relation can be handled by recovery routines specified in table

LR Parsers

• LR Parsers us an efficient, bottom-up parsing technique useful for a large class of CFGs

• Too difficult to construct by hand, but automatic generators to create them exist (e.g. Yacc)

• LR(k) grammars– “L” refers to left-to-right scanning of input– “R” refers to rightmost derivation (produced in reverse

order)– “k” refers to the number of lookahead symbols

needed for decisions (if omitted, assumed to be 1)

Benefits of LR Parsing

• Can be constructed to recognize virtually all programming language construct for which a CFG can be written

• Most general non-backtracking shift-reduce parsing method known

• Can be implemented efficiently• Handles a class of grammars that is a superset

of those handled by predictive parsing• Can detect syntactic errors as soon as possible

with a left-to-right scan of input

Model of LR Parser

Stack

Input

LR Parser (1)

• Driver program is the same for all LR Parsers

• Stack consists of states (si) and grammar symbols (Xi)– Each state summarizes information contained in stack

below it– Grammar symbols do not actually need to be stored

on stack in most implementations

• State symbol on top of stack and next input symbol used to determine shift/reduce decision

LR Parser (2)

• Parsing table includes action function and goto function

• Action function– Based on state and next input symbol– Actions are shift, reduce, accept or error

• Goto function– Based on state and grammar symbol– Produces next state

LR Parser (3)

• Configuration (s0X1s1…Xmsm,aiai+1…an$) indicates right-sentential form X1X2…Xmaiai+1…an

• If action[sm,ai] = shift s, enter configuration (s0X1s1…Xmsmais,ai+1…an$)

• If action[sm,ai] = reduce A B, enter configuration (s0X1s1…Xm-rsm-rAs, ai+1…an$), where s = goto[sm-r,A] and r is length of B

• If action[sm,ai] = accept, signal success

• If action[sm,ai] = error, try error recovery

LR Parsing Algorithmset ip to point to the first symbol in w$initialize stack to s0

repeat forever let s be topmost state on stack let a be symbol pointed to by ip if action[s,a] = shift s’ push a then s’ onto stack advance ip to next input symbol else if action[s,a] = reduce A B pop 2*|B| symbols of stack let s’ be state now on top of stack push A then goto[s’,A] onto stack output production A B else if action[s,a] == accept return success else error()

LR Parsing Table Example

stateaction goto

id + * ( ) $ E T F

0 s5 s4 1 2 3

1 s6 acc

2 r2 s7 r2 r2

3 r4 r4 r4 r4

4 s5 s4 8 2 3

5 r6 r6 r6 r6

6 s5 s4 9 3

7 s5 s4 10

8 s6 s11

9 r1 s7 r1 r1

10 r3 r3 r3 r3

11 r5 r5 r5 r5

(1) E E + T(2) E T(3) T T * F(4) T F(5) F (E)(6) F id

LR Parsing ExampleStack Input Action

(1) s0 id * id + id $ shift

(2) s0 id s5 * id + id $ reduce by F id

(3) s0 F s3 * id + id $ reduce by T F

(4) s0 T s2 * id + id $ shift

(5) s0 T s2 * s7 id + id $ shift

(6) s0 T s2 * s7 id s5 + id $ reduce by F id

(7) s0 T s2 * s7 F s10 + id $ reduce by T T * F

(8) S0 T s2 + id $ reduce by E T

(9) s0 E s1 + id $ shift

(10) s0 E s1 + s6 id $ shift

(11) s0 E s1 + s6 id s5 $ reduce by F id

(12) s0 E s1 + s6 F s3 $ reduce by T F

(13) s0 E s1 + s6 T s9 $ reduce by E E + T

(14) s0 E s1 $ accept

Constructing LR Parsing Tables

• Three methods:– SLR (simple LR)

• Not all that simple (but simpler than other two)!• Weakest of three methods, easiest to implement

– Constructing canonical LR parsing tables• Most general of methods• Constructed tables can be quite large

– LALR parsing table (lookahead LR)• Tables smaller than canonical LR• Most programming language constructs can be handled

• We will not cover any of these methods in class– Too much detail– Yacc will take care of it!

Date post:	28-Dec-2015
Category:	Documents
Upload:	franklin-stanley-chapman
View:	225 times
Download:	0 times

Topic #4: Syntactic Analysis (Parsing) EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Documents