Parsing using graphs

Parsing Needs

New Abstractions

11/23/2011 1

Problem

• Parsing of context-free languages – active research topic from 60’s to 80’s

– rich variety of parsing techniques are known• general CFL parsing:

– Earley’s algorithm, Cocke-Younger-Kasami (CYK)

• deterministic parsing:– SLL(k), LL(k), SLR(k), LR(k), LALR(k), LA(l)LR(k)..

• Problem: most of these techniques were invented by automata theory people– terminology is fairly obscure: leftmost derivations, rightmost

derivations, handles, viable prefixes, ….

– string rewriting is very clean but not intuitive for most PL people

– descriptions in compiler textbooks are obscure/erroneous

– connections between different parsing techniques are lost

• Question: is there an easier way of thinking about parsing than in terms of strings and string rewriting?

11/23/2011 2

New abstraction

• For any context-free grammar, construct a Grammar Flow Graph (GFG)– syntax: representation of grammar as a control-flow graph

– semantics: executable representation• special kind of non-deterministic pushdown automaton

• Parsing problems – become path problems in GFG

• Alphabet soup of grammar classes like LL(k), SLL(k), LR(k), LALR(k), SLR(k) etc. can be viewed as choices along three dimensions– non-determinism: how many paths can we explore at a time?

• all (Earley), only one (LL), some (LR)

– look-ahead: how much do we know about future?• solve fixpoint equations over sets

– context: how much do we remember about the past?• procedure cloning

11/23/2011 3

GFG exampleSAa | bAc | Bc | bBa

Ad

Bd

START-S

END-S

START-A

END-A

START-B

END-B

A.d

Ad.

d

²

²

B.d

Bd.

d

²

²

S.Aa

SA.a

SAa.

a

²

²

²

²

²

²S.bAc

Sb.Ac

SbA.c

SbAc.

c

b

S.Bc

SB.c

SBc.

c

S.bBa

Sb.Ba

SbB.a

SbBa.

a

b

²²

²

² ²

²

²

²

²

²

²

GFG construction

START-A A. END-A

A bXY START-A A.bXY A b.XY

START-X END-X

AbX.Y AbXY. END-A

…….

b ²

²²

²

² ²

For each non-terminal A, create nodes labeled START-A and END-A.

For each production in grammar, create a “procedure” and connect to

START and END nodes of LHS non-terminal as shown below.

A ²

Edges labeled ²: only at entry/exit of START-A and END-A nodes.

Fan-out: only at exit of START-A nodes and END-A nodes

Terminal transition node: node whose outgoing edge is labeled with a terminal

START-Y END-Y…….

² ²

11/23/2011 5

Terminology

START-A A. END-A

A bXY START-A A.bXY A b.XY

START-X END-X

AbX.Y AbXY. END-A

…….

b ²

²²

²

² ²A ²

START-Y END-Y…….

² ²

Call node Return nodeEntry node Exit nodeStart node End node

11/23/2011 6

Non-deterministic GFG automaton

• Interpretation of GFG: NGA– similar to NFA

• Rules:– begin at START-S

– at START nodes, make non-deterministic choice

– at END nodes, must follow CFL path

• “return to the same procedure from which you made the call”

• CFL path from START to END leftmost derivation

• Label(path): – sequence of terminal

symbols labeling edges in path

– Label of CFL path from START to END is a word in language generated by CFG

ad

c

b

ca

b

START-S

END-S

SAa | bAc | Bc | bBa

Ad

Bd

d

11/23/2011 7

Parsing problem

• Paths(l):– set of paths with label l– inverse relation of Label

• Parsing problem: given a grammar G and a string S, – find all paths in GFG(G) that

generate S, or– demonstrate that there is no

such path

• Parallel paths:– P1 = START-S + A

– P2 = START-S + B

– Label(P1) = Label(P2)

– Equivalence relation on paths originating at START-S

• Ambiguous grammar– two or more parallel paths

START-S+ END-S

ad

c

b

ca

b

START-S

END-S


Ad

Bd

d

11/23/2011 8

Compressed paths

11/23/2011 9

Addition to GFG

• We need to be able to talk about sentential forms, not just sentences

• Small modification to GFG:– add transitions labeled

with non-terminals at procedure calls

• Some paths will have edges labeled with non-terminals– non-terminals that

have not been “expanded out”


Ad

Bd

a dc

b

c da

b

START

END

SAa

Ad

SBc

Bd

SbBaSbAc

AA

BB

11/23/2011 10

Compressed GFG paths

• More compact representation of GFG path

• Idea: – collapse portion of path between

start and end of a given procedure and replace with non-terminal

• Point: completed calls cannot affect further evolution of path so we need not store full path

• Edges going out of END nodes of procedures will never appear in compressed representation

START

P1

START-P

END-PP

11/23/2011 11

NFA for compressed paths

• Start from extended GFG

• Remove edges out of END nodes since these will never be in compressed path

• Each path in NFA corresponds to a compressed GFG path


Ad

Bd

a dc

b

c da

b

START

END

SAa

Ad

SBc

Bd

SbBaSbAc

AA

BB

11/23/2011 12

Following all paths:

Earley’s algorithm

11/23/2011 13

Recall: NFA simulation

• Input string is processed left to right, one symbol at a time

• Deterministic simulator keeps track of all states NFA could be in as the input is processed

• Simulation– simulated state = subset of NFA

states

– if current simulated state is C and next input symbol is t , compute next simulated N as follows:

• scanning: if state si 2 C and NFA has transition si sj, add sj to N

• prediction: if state sj 2 N and NFA has transition sj sk, add sk to N

– initial simulated state = set of initial states of NFA closed with prediction rule

t

{s0,s1,s4} !a {s2} !a {s2,s3,s7} ….

²!

11/23/2011 14

Analog in GFG

• First cut: use exactly the same idea– current state C, next state N,

next input symbol is t

– scanning: if state si 2 C and NFA has transition si sj, add sj to N

– prediction: if state sj 2 N and NFA has transition sj sk, add sk to N

• Problem: not clear how to make ²-transitions at return states like s18 and s12

• Solution: keep “return addresses” as in Earley

a dc

b

c da

b

S0

S19


Ad

Bd

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

S16

{S0,S1,S4,S8,S13,S17,S11}

S17

S18

!d {S12,S18, ?????}

t

11/23/2011 15

(

E

+

E

)

(

E

---

E

)

int

START-E

START-E,0

E.(E+E),0 E.(E-E) ,0 E.int,0

E(.E+E),0 E(.E- E),0

START-E,1

E.(E+E) ,1 E.(E-E) ,1 E.int,1

E.int,1

END-E ,1

E(E. +E),0 E(E. -E),0

E(E+.E),0

START-E,3

E.(E+E),3 E.(E-E),3 E.int,3

Eint.,3

END-E,3

E(E+E.),0

E (E+E)., 0

END-E,0

( (

9

+

6

)

E int | (E+E) | (E-E)

Input string: (9+6)

16

0

1

2

3

4

5

Earley parser and GFG states

• A given § set can contain multiple instances of the same GFG state.

• Example: SaS|a

• Earley set §i

– <Sa.S, i-1>

– <Sa., i-1>

– <S.aS, i>

– <S.a, i>

– <SaS. , i-2>

– <SaS., i-3>

– ……

– <SaS., 0>

11/23/2011 17

Earley’s parser and

ambiguous grammars• If an Earley configuration

can be added to a given §

set by two or more

configurations, grammar is

ambiguous

• Example: substring between

positions p and t can be

derived from A in two

different ways

<X ® . , p1>

<Y ¯. , p2>

<Z ° A. ±, p>

§t

11/23/2011 18

Look-ahead computation

11/23/2011 19

Look-ahead computation

• Look-ahead at point p in GFG:– first k symbols you might encounter on path starting at p

– k is a small integer that is given for entire grammar

• Subtle point:– look-ahead may depend on path from START that you took to get to p

– (eg) 2-look-ahead at entry of N is different for red and blue calls

• Two approaches:– context-independent look-ahead: first k symbols on paths starting at p

– context-dependent look-ahead: given a path C from START to p, what are the first k symbols on any path starting at p that extends C

N

S

N

x y

a b

b c

N

a

{aa,ab} {ab,bc}

S xNab | yNbc

N a |

{xa} {ya,yb}

11/23/2011 20

FIRSTk sets• FIRSTk(A): set of strings of length k or less

– If A * s where s is a terminal string of length k or less, s ²FIRSTk(A)

– If A * s where s is a string longer than k symbols, then k-prefix of s ² FIRSTk(A)

• Intuition: – non-terminal A represents a set, which is the set of strings we can

derive from it– FIRSTk(A) is the set of k-prefixes of these strings

• Easy to extend FIRSTk to sequences of grammar symbols

S xNab | yNbc

N a |

S

N N

x y

a b

b c

N

a

FIRST2(N)= {a, }

FIRST2(Nab)

= {aa,ab}

11/23/201121

Useful string functions

• Concatenation: s + t– (eg) xy + abc = xyabc

• k-prefix of string s: sk

– (eg) (xyz)2 = xy, (x)2 = x, ( )2 =

• Composition of concatenation and k-prefix: s +k t– defined as (s+t)k

– (eg) x +2 yz = xy

– operation is associative

• Easy result: (s+t)k = (sk+tk)k = sk +k tk• Operations can be extended to sets in the obvious

way– (eg) {a,bcd} +2 { ,x,yz} = {a,ax,ay,bc}

11/23/2011 22

FIRSTk

FIRSTk(²) = {²}

FIRSTk(t) = {t}

FIRSTk(A) = FIRSTk(X1X2…Xn) U

FIRSTk(Y1Y2…Ym) U …

//rhs of productions

FIRSTk(X1X2..Xn) = FIRSTk(X1) +k FIRSTk(X2)

+k…+k FIRSTk(Xn)

11/23/2011 23

FIRSTk example

FIRST2(S) = FIRST2(aAab) U FIRST2(bAb)

= ({a}+2 FIRST2(A) +2 {ab}) U ({b}+2 FIRST2(A) +2 {b})

FIRST2(A) = FIRST2(cAB) U {²} U{a} = ({c} +2 FIRST2(A) +2 FIRST2(B)) U {²} U {a}

FIRST2(B) = { }

FIRST2(A) ={²,a,c,ca,cc}

FIRST2(B) = {²}

FIRST2(S)={aa,ac,bb,ba,bc}

11/23/2011 24

S aAab | bAb

A cAB | | a

B

Context-independent look-aheadsS

A A

a b

a b

b

A

A

c

a

B

B

Ae BeSe={$$}

Compute FOLLOWk(A) sets: strings of length k that can be encountered

after you return from non-terminal A

Se = {$$}

Ae = (FIRST2({ab}) +2 Se) U (FIRST2({b})+2 Se) U (FIRST2(B) +2 Ae)

Be = Ae

Solution: Se = {$$} Ae = {ab,b$} Be = {ab,b$}

From these FOLLOW sets, we can now compute look-ahead at any GFG point.

{ab}{b$}

?

Computing context-independent

look-ahead sets

• Algorithm:– For each non-terminal A, compute FIRSTk(A)

• First k terminals you encounter on path A-START + A-END

– For each non-terminal A, compute FOLLOWk(A)• First k terminals you encounter on path that extends a GFG

path START + A-END

– Use the FIRSTk and FOLLOWk sets to compute the look-ahead at any point of interest in GFG

• You can even compute FIRSTk and FOLLOWksets in one big iteration if you want.

• This computation is independent of the particular parsing method used

11/23/2011 26

Production cloning:

a way of implementing

context-dependence

11/23/2011 27

Context-dependent look-ahead

• In running example, – look-aheads for N for red

call to N are disjoint– look-aheads for N for blue

call to N are disjoint– context-independent look-

ahead computation combines the look-aheadsfrom all the call sites of N at the bottom of N and propagates them to the top

• Idea: – compute look-aheads

separately for each context – keep track of context while

parsing we can get a more capable

parser

S

N N

x y

a b

b c

N

a

{aa,ab} {ab,bc}

S xNab | yNbc

N a |

{xa} {ya,yb}

Input string: xab$$

{ab}

{bc}

11/23/2011 28

• Grammar:S xNab | yNbc S xN1ab | yN2bc

N a | N1 a | N2 a |

Tracking context by cloning

S

N1 N2

x y

a b

b c

N1

a

N2

a

{aa} {ab} {ab} {bc}

[N,{ab}] [N,{bc}]

[N,{bc}][N,{ab}]

11/23/2011 29

General idea of cloning

• Cloning creates copies of productions

• Intuitively we would like to create a clone of a production for each of its contexts and write-down look-ahead– but set of contexts for a production is usually infinite

• Solution:– create finite number of equivalence classes of contexts for a given

production

– create a clone for each equivalence class

– compute context-independent look-ahead

• Two cloning rules are important in practice– k-look-ahead cloning: two contexts are in same equivalence class if

their k-look-aheads are identical (used in LL(k))

– reachability cloning: two contexts C1 and C2 are in same equivalence class if the set of GFG nodes reachable by paths with label(C1) is equal to set of GFG nodes reachable by paths with label(C2) (used in LR(0))

– LR(k) uses a combination of them

11/23/2011 30

k-look-ahead cloning (intuitive idea)S

A A

a b

a b

b

A

A

ca

B

{ab}

{b$}

S

[A,{ab} [A,{b$}]

a b

a b

b

{ab}{b$}

[A,{ab}]

[A,{da}]

ca

[B,{ab}]

[A,{b$}]

[A,{db}]

ca

[B,{b$}]

B

d

Other clones not shown

If there are |T| terminal symbols, you may end up with 2|T|k clones of a given production

k-look-ahead cloning

• G=(V,T,P,S):grammar, k:positive integer.

• Tk(G) is following grammar – nonterminals: {[A,R]| A in V -T, and R µ Tk}

– terminals: T

– start symbol: [S,{$k}]

– rules: all rules of the form [A,R] X1'X2'X3'...Xm' where for some rule A X1X2X3...Xm in P

• Xi' = Xi if Xi is a terminal

• Xi' = [Xi, FIRSTk(Xi+1,..Xm) +k R] when Xi is a non-terminal.

• Intuition: – after this kind of cloning, k-look-aheads at the end of a procedure

are identical for all return edges

– so doing a context-independent look-ahead computation on the transformed grammar does not tell you anything you did not already know about k-look-aheads

11/23/2011 32

LL(k) and SLL(k)

11/23/2011 33

Intuition

• This class of grammars has the following

property:

– if s is a string in the language, then for any prefix

p of s, there is a unique path P from START such

that label(P) = p (modulo look-ahead)

• So we need to follow only one path through

GFG for a given input string, using look-

ahead to eliminate alternatives

• Roughly analogous to DFAs in the CFL world

11/23/2011 34

LL(k) parsing

• Only one path can be followed by the parser– so at procedure call for

non-terminal N, we must know exactly which procedure (rule) to call

• Simple LL(k) parsing:– make decision based on

context-independent look-ahead of k symbols at entry point for N

• LL(k) parsing:– use context-dependent

look-ahead of k symbols

– procedure cloning technique converts LL(k) grammar into SLL(k) grammar

S

N N

x y

a b

b c

N

a

{aa,ab} {ab,bc}

S xNab | yNbc

N a |

{xa} {ya,yb}

Grammar is LL(2) but not SLL(2)

11/23/2011 35

Parser

• Modify Earley parser to

– track compressed paths instead of full paths

• transitions labeled by non-terminals and terminals

– eliminate return addresses

• at the end of a production

– A X1X2..Xn: pop n states off and make an A transition

from the exposed state

– A ² : make an A transition from current state

– use look-ahead to eliminate alternatives

11/23/2011 36

(

E

+

E

)

(

E

---

E

)

int

START-E

START-E

E.(+ E E) E.(- E E) E.int

E(. - E E)

(

E int | (+ E E) | (- E E)

Input string: ( - ( + 8 9 ) 7 )

37

E(- . E E)

START-E

E .(+ E E) E .(- E E) E .int

-

E (+ . E E)

START-E

E.(+ E E) E .(- E E) E .int

(

E int.

END-EE (+ E . E)

START-E

E.(+ E E) E (- E E) E .int

E int.

END-EE (+ E E.)

E(+ E E).

END-E

E (- E .E)

START-E

E. (+ E E) E .(- E E) E .int

E int.

END-EE (- E E .)

E (- E E ).

END-E

E (.+ E E)

+

8 E

9 E

)

E

7 E

)

Many grammars are not LL(k)

• Grammar

– Eint | (E+E) | (E-E)

• Not clear which rule to

apply until you see “+”

or “-”

– this needs unbounded

look-ahead, so grammar

is not LL(k) for any k

• One solution:

– follow multiple paths till

only one survives

(

E

+

E

)

(

E

---

E

)

int

E

11/23/2011 38

LR(k),SLR(k),LALR(k)

11/23/2011 39

LR grammars (informal)

• LR parsers permit limited non-determinism– can follow more than one path but not

all paths like Early

• LR(0) condition: for any prefix of input, the corresponding fully extended compressed paths must have the same label

• Condition not true in general grammars: see example– Consider string “da”

– For prefix “d”, there are two paths:• red path

• blue path

– Labels of compressed paths:• red path: “A”

• blue path: “B”

• We can use modified Earley parser for these grammars

ac

b

c da

b

START

END

AA

B


Ad

Bd

d

11/23/2011 40

(

E

+

E

)

(

E

---

E

)

int

START-ESTART-E,0

E.(E+E),0 E.(E-E) ,0 E.int,0

E(.E+E),0 E(.E- E),0

START-E,1

E.(E+E) ,1 E.(E-E) ,1 E.int,1

E.int,1

END-E ,1

E(E. +E),0 E(E. -E),0

E(E+.E),0

START-E,3

E.(E+E),3 E.(E-E),3 E.int,3

Eint.,3

END-E,3

E(E+E.),0

E (E+E)., 0

END-E,0

( (

3

+

4

)

START-E

E.(E+E) E.(E-E) E.int

E(.E+E) E(.E- E)

START-E


E(E. +E) E(E. -E)

E(E+.E)

START-E


Eint.

END-E

( (

+

4

E E

E int | (E+E) | (E-E)

Input string: (3+4)

…………..

41

0

1

2

3

4

5

Parser for LR languages

• Use the modified Earley parser we used for LL grammars– each -state will have multiple items as in the original Earley parser

since LR parsers follow multiple paths too

• -states must follow a stack discipline for modified Earley parser to work

• Since we are following multiple paths, this might break down– shift-reduce conflict: parallel compressed paths

• P1 to a scan node and P2 to an EXIT node (push/pop conflict)

– reduce-reduce conflict: parallel compressed paths• P1 and P2 to different EXIT nodes (pop/pop conflict)

• If grammar does not have shift-reduce or reduce-reduce conflicts, we can use modified Earley parser and follow compressed paths while maintaining a stack discipline for -states

• How do we determine whether grammar has shift-reduce or reduce-reduce conflicts?

11/23/2011 42

Finding LR(0) conflicts• Compute the DFA corresponding to the

compressed path NFA

• If conflicting states are in same DFA state,

grammar has an LR(0) conflict

S.Aa

S.bAc

S.Bc

S.bBa

A.d

B.d

Ad.

Bd.

Sb.Ac

Sb.Ba

A.d

B.d

SbA.c SbAc.

SbB.a SbBa.

SA.a SAa.

SB.c SBc.

d

b

A

B

c

aA

B a

c

d

Reduce-reduce conflict


Ad

Bd11/23/2011 43

LR(0) automaton for expression grammar

E.(E+E)

E.(E – E)

E .int

Eint.

E(.E+E)

E.(E+E)

E.(E-E)

E.int

E(.E-E)

E(E.+E)

E(E.- E)

E(E+.E)

E.(E+E)

E.(E-E)

E.int

E(E-.E)

E.(E+E)

E.(E-E)

E.int

Eint.

int

(E

+

-

(

(

int

int

E(E-E.)

E(E-E).

E(E+E.)

E(E+E).

E

)

E

)

int

(

11/23/2011 44

Parser for LR(0) languages

• Use the modified Earley parser we used for LL grammars– each -state will have multiple items as in the

original Earley parser since LR parsers follow multiple paths too

• No need to keep track of GFG nodes within each -state– states in compressed path DFA correspond to

possible -states

– So modified Earley parser just pushes and pops DFA states

11/23/2011 45

GFG path interpretation

• Let P1 and P2 be two GFG

paths with identical labels

• Sufficient condition for labels

of compressed paths to be

equal:

– sequence of completed calls in

P1 and P2 are identical

• Most of the action in LR

parsers happens at EXIT

nodes of productions

START

START-P

END-P

P1 P2

START-P

END-P

11/23/2011 46

LR(0) conflicts: GFG

• LR(0) conflicts (GFG definition):– Shift-reduce conflict: there are parallel paths P1: START + Aexit and

P2: START + scan-node

– Reduce-reduce conflict: there are parallel paths P1: START + Aexitand P2: START + Bexit

• Claim: Let G be an LR(0) grammar according to GFG definition.– P1 and P2 are two GFG paths that end at SCAN or END nodes, and

C(P1) and C(P2) are their compressed equivalents

– P1 and P2 have the same label iff C(P1) and C(P2) have the same label

START

t*t*

Aexit

Bexit

START

t*

t*

Aexit

B

reduce-reduce conflict shift-reduce conflict

u

11/23/2011 47

LR(0) conflicts: GFG

• Claim: Let G be an LR(0) grammar according to GFG definition.– P1 and P2 are two GFG paths that end at SCAN or END nodes, and C(P1) and C(P2)

are their compressed equivalents

– P1 and P2 have the same label iff C(P1) and C(P2) have the same label

• This claim is not true if the paths do not end at SCAN or END nodes– counterexample: in this LR(0) grammar, consider paths from START to nodes

S A.a and S .Uc

S Aa | Uc

U Ab

A .

START

t*t*

Aexit

Bexit

START

t*

t*

Aexit

B


u

11/23/2011 48

Example

• States with LR(0) conflicts

– (Ad. , Bd.)

• Conflicting context pairs

(i) path label: d

– C1: START, S.Aa, A.d, Ad.

– C2: START, S.Bc, B.d, Bd.

(ii) path label: bd

– C3: START, S.bAc, Sb.Ac, A.d, Ad.

– C4: START, S.bBa, Sb.Ba, B.d, Bd.

• So grammar is not LR(0)

a dc

b

c da

b

START

END


Ad

Bd

SAa

Ad

SBc

Bd

SbBaSbAc

11/23/2011 49

LR(0) H&U

• A grammar G is LR(0) if – its start symbol does not appear on the right side of any

production, and

– for every viable prefix °, whenever A ! ®. is a complete valid item for °, then no other complete item nor any item with a terminal to the right of the dot is valid for °.

• Comment: – by this definition, the only other valid items that can occur

together with A ! ®. are incomplete items with a non-terminal to the right of the dot of the form B! ¯.C±

– if First(C) contains a terminal t, it can be shown that an item of the form Y ! .t ¸ is valid for °, violating the LR(0) condition. Therefore, First(C) = {²}. It can be shown that this means ® = ²

– Example: this grammar is LR(0) (A . and B .Cd are valid items for viable prefix ² )

• SB

• BCd

• CA

• A ²11/23/2011 50

• LR(k)– for each pair of parallel paths to LR(0) conflicting states, k-look-ahead

sets are disjoint

• SLR(k): – if there is LR(0) conflict at nodes A and B, context-insensitive look-

ahead sets of A and B are disjoint

• LALR(k): grammar is SLR(k) after reachability cloning

START

t*t*

Aexit

Bexit

START

t*

t*

Aexit

B


Look-ahead in LR grammars

11/23/2011 51

Example

• States with LR(0) conflicts

– (Ad. , Bd.)

• Conflicting context pairs

(i) path label: d

– C1: START, S.Aa, A.d, Ad.

– C2: START, S.Bc, B.d, Bd.

(ii) path label: bd

– C3: START, S.bAc, Sb.Ac, A.d, Ad.

– C4: START, S.bBa, Sb.Ba, B.d, Bd.

• Grammar is LR(1)

– Look-ahead for C1: {a}, look-ahead for C2: {c}

– Look-ahead for C3: {c}, look-ahead for C4: {a}

a dc

b

c da

b

START

END


Ad

Bd

SAa

Ad

SBc

Bd

SbBaSbAc

11/23/2011 52

LR(1) automaton

S.Aa,$

S.bAc,$

S.Bc,$

S.bBa,$

A.d, a

B.d, c

Ad., a

Bd., c

Sb.Ac,$

Sb.Ba,$

A.d, c

B.d, a

SbA.c,$ SbAc.,$

SbB.a, $ SbBa.,$

SA.a,$ SAa.,$

SB.c,$ SBc.,$

d

b

A

B

c

aA

Ba

c


Ad

Bd

Ad.,c

Bd.,ad

11/23/2011 53

LALR look-ahead computation

• Key observation: – each path START s in deterministic LR(0) automaton

represents a set of contexts in the non-deterministic LR(0) automaton

• each context in this set ends at one of the items in s

– in general, there will be multiple paths to state s in deterministic LR(0) automaton

– so each state in LR(0) automaton represents a set of sets of contexts

– in LALR, we merge the look-aheads for those contexts

• LALR = reachability cloning + SLR (Bermudez and Logothetis) + unions at some nodes (see RL.) state in diagram on next page

11/23/2011 54

LALR(1) but not SLR(1)

S’ .S$

S .L=R

S .R

L .*R

L .id

R .L

S’ S.$

S L.=R

R L.

S R.

S L=.R

R .L

L .*R

L .id

S L=R.

R L.

L id.

L *.R

R .L

L .*R

L .id

L *R.

S

L

R

=

R

L

id*

id

RFOLLOW(S) = { $ }

FOLLOW(R) = { =, $ }

FOLLOW(L) = { =, $ }

S’ S $

S L = R | R

L *R | id

R L

shift-reduce conflict

*

S’ S$.$

L

id

*

11/23/2011 55

LALR SLR grammar

S’ .S$

S .L=R

S .R

L .*R

L .id

R .L

S’ S.$

S L.=R

R L.

S R.

S L=.R

R .L

L .*R

L .id

S L=R.

R L.

L id.

L *.R

R .L

L .*R

L .id

L *R.

S

L1

R1

=

R2

L2

id*

id

R3

S’ S $

S L = R | R

L *R | id

R L

*

S’ S$.$

L3

id

*

S’ S $

S L1 = R2 | R1

L1,L2,L3 *R3 | id

R1 L1

R2 L2

R3 L3

11/23/2011 56

LR(0): Reachability cloning

• Motivation: NFADFA conversion for LR grammars

• Driven by compressed paths

• Need to verify that this cloning satisfies sanity condition even if grammar is not LR(0)

• Compressed contexts C1 and C2 of node A are in same equivalence class if

set of GFG nodes reachable by paths with label(C1)

=

set of GFG nodes reachable by paths with label(C2)

START

AB

C1

1

C2

2

1

2C3

C1 and C2 will be in the same

equivalence class. C3 is in a different class.

3

11/23/2011 57

Algorithm (need to write)

• G=(V,T,P,S):grammar

• R(G) is following grammar – nonterminals: {[Ai]| A in V -T, 1 <= i <= n and

there are n edges labeled A in compressed path DFA}

– terminals: T

– start symbol: [S]

– rules: all rules of the form [Ai] X1'X2'X3'...Xm' where for some rule A X1X2X3...Xm in P

• Xi' = Xi if Xi is a terminal

• [Xi] when Xi is a non-terminal.

11/23/2011 58

Cloning for LALR(1)

• Same condition as LR(0): reachability

cloning

• Extension to LA(k)LR(l):

– cloning is governed by LR(l)

– compute SLR(k) look-aheads

– LALR(k) is LA(k)LR(0)

– LR(k) is LA(k)LR(k clone as in LR(l)

11/23/2011 59

Summary

• New abstraction for CFL parsing– Grammar Flow Graph (GFG)

• Parsing problems become path problems in GFG

• Earley parser emerges as simple extension of NFA simulation

• Mechanisms– control number of paths followed during parsing

– look-ahead: • algorithm: solving set constraints

– context-dependent look-ahead• algorithm: cloning

• SLL(k), LL(k), SLR(k), LR(k), LALR(k) grammars arise from different choices of these mechanisms

• LL and LR parsers emerge as specializations of Earley parser

11/23/2011 60

LR(0) ²DFA

E.(E+E)

E.(E – E)

E .int

Eint.

E(E.+E)

E(E.- E)

E(E+.E)

E(E-.E)

int

(

E

+

-

E(E-E.)

E(E-E).

E(E+E.)

E(E+E).

E)

E

)

²

E(.E+E)

E(.E- E)

²

²

M0

M1

M2

M3

M4

M5

M6

M7

M8

M9

((2+3)-4)

<M0,0> <M2,0>

<M0,1>

<M2,1>

<M0,2>

<M1,2>

<M4,1>

<M3,1>

<M0,4>

<M1,4>

<M6,1>

<M8,1>

<M4,0>

<M5,0>

<M0,7>

<M1,7>

<M7,0><M9,0>

( ( 2 + 3 ) - 4

) )11/23/2011 61

LALR(1) example from G&J

S’ -> S #

S -> A B c

A -> a

B -> b

B -> eS’ .S#

S.ABc

A.a

S’S.#

SA.Bc

B.

B.b

SAB.c SAbc.

Bb.Aa.

a

S

A B

b

c

11/23/2011 62

S L = R | R

L *R | id

R L

Shift-reduce conflict occurs at states C and Rend (conflicting paths are S->L->Lend->C and S->R->L->Lend->Rend)

1-look-ahead at C is =Context-independent 1-look-ahead at Rend is {=,$} so grammar is not SLR(1).LALR(1) figures out that for conflicting state, the calling context must SR.Look-ahead at Rend is = for context S LTRLLendRend but there is

no context S* C parallel to this one.

R Rend

L Lend

S Send

*

id

=

C

L

R

L R

T

11/23/2011 63

LR(1)

S

L R

=

L

R

R

*

id

R

L

FIRST(L)=FIRST(R)={*,id}

Shift-reduce conflict: id $

S

[L,{=}] [R,{$}]

=

[L,{=}]

[R,{$}]

[R,{=}]

*

id

[R,{$}]

[L,{$}]

[L,{$}]

[R,{$}]

*

id

[R,{=}]

[L,{=}]

After procedure cloning11/23/2011 64

LALR(1) look-aheads

S(S)

S

S’ .S$

S .(S) [$]

S.

S(.S) [$,)]

S.(S) [)]

S. [)]

S(S.) [$,)] S(S). [$,)]

S’S.$

(

(

)

S

• After reduction S(S), parsing can

resume either in state T0 or T1.

• LR parser stack tells you which one to

resume from

• LALR(1) look-aheads in state T1 are

interesting. Item S(.S) gets look-ahead

from item S .(S) in state T0 as well as

item S(.S) from state T1.

T0 T1 T2 T4

T5

[$]

S

11/23/2011 65

Parsing techniques

• Our focus: techniques that perform breadth-first traversal of GFG– similar to online simulation of NFA

– input is read left to right one symbol at a time

– extend current GFG paths if possible, using symbol

• Three dimensions:– non-determinism: how many paths can I follow at a given

time?

– look-ahead: how many symbols of look-ahead are known at each point?

– context: how much context do we keep?• this is implemented by procedure cloning, independent of look-

ahead

11/23/2011 66

What we would like to show

• Obvious algorithm:

– follow all CFL-paths in GFG

– essentially a fancy transitive closure in GFG

– leads to Earley’s algorithm

– O(n3) complexity

• O(n) algorithms: LL/LR/LALR,…

– preprocessing to compute look-ahead sets

– maintain compressed paths

– ensure that Earley sets can be manipulated as a stack

11/23/2011 67

What we would like to show

(contd.)

• SLL(k) = no cloning + decision at procedure start

• LL(k) = k-look-ahead-cloning+ decision at procedure start

• LA(l)LL(k) = l-look-ahead-cloning + context-independent k-look-ahead + decision at procedure start

• SLR(k) = no cloning + decision at procedure end

• LR(k) = k-lookahead-cloning + decision at procedure end

• LALR(k) = reachability-cloning + decision at procedure end

11/23/2011 68

Computing context-independent look-ahead

• Intuition: – simple inter-procedural

backward dataflow analysis in GFG

– assume look-ahead at exit of GFG is {$k}

– propagate look-ahead back through GFG to determine look-aheads at other points

• How do we propagate look-aheads through non-terminal calls?– would like to avoid repeatedly

analyzing procedure for each look-ahead set we want to propagate through it

– need to handle recursive calls – ideally, we would have a

function that tells us how a look-ahead set at the exit of a procedure gets propagated to its input

S

N N

x y

a b

b c

N

a

S xNab | yNbc

N a |

{xa} {ya,yb}

2-symbol look-aheads

11/23/2011 69

C1

C1’

C2

C2’

N

START

END

x y

P Q

Let string generated by paths P and Q be SP and SQ

Cases:

-SP = a and SQ = a : grammar is neither LL(1) nor SLL(1)

-SP = a and SQ = b : grammar is LL1() and SLL(1)

-SP = and SQ = : grammar is neither LL(1) nor SLL(1)

-SP = a and SQ = :

- We show that there cannot be a context Ci for which the

generated string for the complementary context Ci’ is a

- Otherwise, for context Ci, 1-lookahead for choice P is a

1-lookahead for choice Q is a

so the grammar is not LL(1).

- Therefore, there is no context Ci for which the 1-lookahead for

choice Q is a.

- But this means that the context-independent 1-lookahead

for choice Q cannot contain a.

- Therefore the grammar is SLL(1).

Every LL(1) grammar is an SLL(1) grammar

11/23/2011 70

C1

C1’

C2

C2’

N

START

END

x y

a

a

b

b

c

P Q

-Consider the context-sensitive look-aheads at N.

-For context C1,

2-lookahead for choice P is {aa}

2-lookahead for choice Q is {ab}

-For context C2,

2-lookahead for choice P is {ab}

2-lookahead for choice Q is {bc}.

-Therefore, grammar is LL(2).

-Context-independent lookaheads:

2-lookahead for choice P is {aa,ab}

2-lookahead for choice Q is {ab,bc}.

-Since these two sets are not disjoint, the grammar is

not SLL(2).

-Grammar:

S xNab

S yNbc

Na

N

LL(2) grammar that is not SLL(2)

11/23/2011 71

Cloning for LR(k)

• From Sippu & Soissalon– replace each non-terminal A in the original grammar

G with the set of all pairs of the form ([ ]k,A) where is a viable prefix of the $-augmented grammar G

• [page 16] String 1 is LR(k) equivalent to string 2 if VALIDk( 1) = VALIDk( 2); i.e. exactly those items valid for 2 are valid 1 and vice versa.

• An item [A . ,y] is LR(k)-valid for if

S rm* Az rm z = z and k:z = y

• Question: – is this a finer equivalence class than LL(k)?

11/23/2011 72

Sanity condition on

equivalence classes

• If C1 and C2 are two contexts for some node N and– C1 = B1 + P

– C2 = B2 + P

– B1 and B2 are in the same equivalence class

C1 and C2 must be in the same equivalence class

• Can we come up with a general construction procedure for cloning, given a specification of the equivalence classes?

B1 B2

P

N

START

11/23/2011 73

Date post:	06-Jul-2015
Category:	Spiritual
Upload:	kpingali
View:	249 times
Download:	0 times

Parsing using graphs

Spiritual