Parsing Needs
New Abstractions
11/23/2011 1
Problem
• Parsing of context-free languages – active research topic from 60’s to 80’s
– rich variety of parsing techniques are known• general CFL parsing:
– Earley’s algorithm, Cocke-Younger-Kasami (CYK)
• deterministic parsing:– SLL(k), LL(k), SLR(k), LR(k), LALR(k), LA(l)LR(k)..
• Problem: most of these techniques were invented by automata theory people– terminology is fairly obscure: leftmost derivations, rightmost
derivations, handles, viable prefixes, ….
– string rewriting is very clean but not intuitive for most PL people
– descriptions in compiler textbooks are obscure/erroneous
– connections between different parsing techniques are lost
• Question: is there an easier way of thinking about parsing than in terms of strings and string rewriting?
11/23/2011 2
New abstraction
• For any context-free grammar, construct a Grammar Flow Graph (GFG)– syntax: representation of grammar as a control-flow graph
– semantics: executable representation• special kind of non-deterministic pushdown automaton
• Parsing problems – become path problems in GFG
• Alphabet soup of grammar classes like LL(k), SLL(k), LR(k), LALR(k), SLR(k) etc. can be viewed as choices along three dimensions– non-determinism: how many paths can we explore at a time?
• all (Earley), only one (LL), some (LR)
– look-ahead: how much do we know about future?• solve fixpoint equations over sets
– context: how much do we remember about the past?• procedure cloning
11/23/2011 3
GFG exampleSAa | bAc | Bc | bBa
Ad
Bd
START-S
END-S
START-A
END-A
START-B
END-B
A.d
Ad.
d
²
²
B.d
Bd.
d
²
²
S.Aa
SA.a
SAa.
a
²
²
²
²
²
²S.bAc
Sb.Ac
SbA.c
SbAc.
c
b
S.Bc
SB.c
SBc.
c
S.bBa
Sb.Ba
SbB.a
SbBa.
a
b
²²
²
² ²
²
²
²
²
²
²
GFG construction
START-A A. END-A
A bXY START-A A.bXY A b.XY
START-X END-X
AbX.Y AbXY. END-A
…….
b ²
²²
²
² ²
For each non-terminal A, create nodes labeled START-A and END-A.
For each production in grammar, create a “procedure” and connect to
START and END nodes of LHS non-terminal as shown below.
A ²
Edges labeled ²: only at entry/exit of START-A and END-A nodes.
Fan-out: only at exit of START-A nodes and END-A nodes
Terminal transition node: node whose outgoing edge is labeled with a terminal
START-Y END-Y…….
² ²
11/23/2011 5
Terminology
START-A A. END-A
A bXY START-A A.bXY A b.XY
START-X END-X
AbX.Y AbXY. END-A
…….
b ²
²²
²
² ²A ²
START-Y END-Y…….
² ²
Call node Return nodeEntry node Exit nodeStart node End node
11/23/2011 6
Non-deterministic GFG automaton
• Interpretation of GFG: NGA– similar to NFA
• Rules:– begin at START-S
– at START nodes, make non-deterministic choice
– at END nodes, must follow CFL path
• “return to the same procedure from which you made the call”
• CFL path from START to END leftmost derivation
• Label(path): – sequence of terminal
symbols labeling edges in path
– Label of CFL path from START to END is a word in language generated by CFG
ad
c
b
ca
b
START-S
END-S
SAa | bAc | Bc | bBa
Ad
Bd
d
11/23/2011 7
Parsing problem
• Paths(l):– set of paths with label l– inverse relation of Label
• Parsing problem: given a grammar G and a string S, – find all paths in GFG(G) that
generate S, or– demonstrate that there is no
such path
• Parallel paths:– P1 = START-S + A
– P2 = START-S + B
– Label(P1) = Label(P2)
– Equivalence relation on paths originating at START-S
• Ambiguous grammar– two or more parallel paths
START-S+ END-S
ad
c
b
ca
b
START-S
END-S
SAa | bAc | Bc | bBa
Ad
Bd
d
11/23/2011 8
Compressed paths
11/23/2011 9
Addition to GFG
• We need to be able to talk about sentential forms, not just sentences
• Small modification to GFG:– add transitions labeled
with non-terminals at procedure calls
• Some paths will have edges labeled with non-terminals– non-terminals that
have not been “expanded out”
SAa | bAc | Bc | bBa
Ad
Bd
a dc
b
c da
b
START
END
SAa
Ad
SBc
Bd
SbBaSbAc
AA
BB
11/23/2011 10
Compressed GFG paths
• More compact representation of GFG path
• Idea: – collapse portion of path between
start and end of a given procedure and replace with non-terminal
• Point: completed calls cannot affect further evolution of path so we need not store full path
• Edges going out of END nodes of procedures will never appear in compressed representation
START
P1
START-P
END-PP
11/23/2011 11
NFA for compressed paths
• Start from extended GFG
• Remove edges out of END nodes since these will never be in compressed path
• Each path in NFA corresponds to a compressed GFG path
SAa | bAc | Bc | bBa
Ad
Bd
a dc
b
c da
b
START
END
SAa
Ad
SBc
Bd
SbBaSbAc
AA
BB
11/23/2011 12
Following all paths:
Earley’s algorithm
11/23/2011 13
Recall: NFA simulation
• Input string is processed left to right, one symbol at a time
• Deterministic simulator keeps track of all states NFA could be in as the input is processed
• Simulation– simulated state = subset of NFA
states
– if current simulated state is C and next input symbol is t , compute next simulated N as follows:
• scanning: if state si 2 C and NFA has transition si sj, add sj to N
• prediction: if state sj 2 N and NFA has transition sj sk, add sk to N
– initial simulated state = set of initial states of NFA closed with prediction rule
t
{s0,s1,s4} !a {s2} !a {s2,s3,s7} ….
²!
11/23/2011 14
Analog in GFG
• First cut: use exactly the same idea– current state C, next state N,
next input symbol is t
– scanning: if state si 2 C and NFA has transition si sj, add sj to N
– prediction: if state sj 2 N and NFA has transition sj sk, add sk to N
• Problem: not clear how to make ²-transitions at return states like s18 and s12
• Solution: keep “return addresses” as in Earley
a dc
b
c da
b
S0
S19
SAa | bAc | Bc | bBa
Ad
Bd
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
{S0,S1,S4,S8,S13,S17,S11}
S17
S18
!d {S12,S18, ?????}
t
11/23/2011 15
(
E
+
E
)
(
E
---
E
)
int
START-E
START-E,0
E.(E+E),0 E.(E-E) ,0 E.int,0
E(.E+E),0 E(.E- E),0
START-E,1
E.(E+E) ,1 E.(E-E) ,1 E.int,1
E.int,1
END-E ,1
E(E. +E),0 E(E. -E),0
E(E+.E),0
START-E,3
E.(E+E),3 E.(E-E),3 E.int,3
Eint.,3
END-E,3
E(E+E.),0
E (E+E)., 0
END-E,0
( (
9
+
6
)
E int | (E+E) | (E-E)
Input string: (9+6)
16
0
1
2
3
4
5
Earley parser and GFG states
• A given § set can contain multiple instances of the same GFG state.
• Example: SaS|a
• Earley set §i
– <Sa.S, i-1>
– <Sa., i-1>
– <S.aS, i>
– <S.a, i>
– <SaS. , i-2>
– <SaS., i-3>
– ……
– <SaS., 0>
11/23/2011 17
Earley’s parser and
ambiguous grammars• If an Earley configuration
can be added to a given §
set by two or more
configurations, grammar is
ambiguous
• Example: substring between
positions p and t can be
derived from A in two
different ways
<X ® . , p1>
<Y ¯. , p2>
<Z ° A. ±, p>
§t
11/23/2011 18
Look-ahead computation
11/23/2011 19
Look-ahead computation
• Look-ahead at point p in GFG:– first k symbols you might encounter on path starting at p
– k is a small integer that is given for entire grammar
• Subtle point:– look-ahead may depend on path from START that you took to get to p
– (eg) 2-look-ahead at entry of N is different for red and blue calls
• Two approaches:– context-independent look-ahead: first k symbols on paths starting at p
– context-dependent look-ahead: given a path C from START to p, what are the first k symbols on any path starting at p that extends C
N
S
N
x y
a b
b c
N
a
{aa,ab} {ab,bc}
S xNab | yNbc
N a |
{xa} {ya,yb}
11/23/2011 20
FIRSTk sets• FIRSTk(A): set of strings of length k or less
– If A * s where s is a terminal string of length k or less, s ²FIRSTk(A)
– If A * s where s is a string longer than k symbols, then k-prefix of s ² FIRSTk(A)
• Intuition: – non-terminal A represents a set, which is the set of strings we can
derive from it– FIRSTk(A) is the set of k-prefixes of these strings
• Easy to extend FIRSTk to sequences of grammar symbols
S xNab | yNbc
N a |
S
N N
x y
a b
b c
N
a
FIRST2(N)= {a, }
FIRST2(Nab)
= {aa,ab}
11/23/201121
Useful string functions
• Concatenation: s + t– (eg) xy + abc = xyabc
• k-prefix of string s: sk
– (eg) (xyz)2 = xy, (x)2 = x, ( )2 =
• Composition of concatenation and k-prefix: s +k t– defined as (s+t)k
– (eg) x +2 yz = xy
– operation is associative
• Easy result: (s+t)k = (sk+tk)k = sk +k tk• Operations can be extended to sets in the obvious
way– (eg) {a,bcd} +2 { ,x,yz} = {a,ax,ay,bc}
11/23/2011 22
FIRSTk
FIRSTk(²) = {²}
FIRSTk(t) = {t}
FIRSTk(A) = FIRSTk(X1X2…Xn) U
FIRSTk(Y1Y2…Ym) U …
//rhs of productions
FIRSTk(X1X2..Xn) = FIRSTk(X1) +k FIRSTk(X2)
+k…+k FIRSTk(Xn)
11/23/2011 23
FIRSTk example
FIRST2(S) = FIRST2(aAab) U FIRST2(bAb)
= ({a}+2 FIRST2(A) +2 {ab}) U ({b}+2 FIRST2(A) +2 {b})
FIRST2(A) = FIRST2(cAB) U {²} U{a} = ({c} +2 FIRST2(A) +2 FIRST2(B)) U {²} U {a}
FIRST2(B) = { }
FIRST2(A) ={²,a,c,ca,cc}
FIRST2(B) = {²}
FIRST2(S)={aa,ac,bb,ba,bc}
11/23/2011 24
S aAab | bAb
A cAB | | a
B
Context-independent look-aheadsS
A A
a b
a b
b
A
A
c
a
B
B
Ae BeSe={$$}
Compute FOLLOWk(A) sets: strings of length k that can be encountered
after you return from non-terminal A
Se = {$$}
Ae = (FIRST2({ab}) +2 Se) U (FIRST2({b})+2 Se) U (FIRST2(B) +2 Ae)
Be = Ae
Solution: Se = {$$} Ae = {ab,b$} Be = {ab,b$}
From these FOLLOW sets, we can now compute look-ahead at any GFG point.
{ab}{b$}
?
Computing context-independent
look-ahead sets
• Algorithm:– For each non-terminal A, compute FIRSTk(A)
• First k terminals you encounter on path A-START + A-END
– For each non-terminal A, compute FOLLOWk(A)• First k terminals you encounter on path that extends a GFG
path START + A-END
– Use the FIRSTk and FOLLOWk sets to compute the look-ahead at any point of interest in GFG
• You can even compute FIRSTk and FOLLOWksets in one big iteration if you want.
• This computation is independent of the particular parsing method used
11/23/2011 26
Production cloning:
a way of implementing
context-dependence
11/23/2011 27
Context-dependent look-ahead
• In running example, – look-aheads for N for red
call to N are disjoint– look-aheads for N for blue
call to N are disjoint– context-independent look-
ahead computation combines the look-aheadsfrom all the call sites of N at the bottom of N and propagates them to the top
• Idea: – compute look-aheads
separately for each context – keep track of context while
parsing we can get a more capable
parser
S
N N
x y
a b
b c
N
a
{aa,ab} {ab,bc}
S xNab | yNbc
N a |
{xa} {ya,yb}
Input string: xab$$
{ab}
{bc}
11/23/2011 28
• Grammar:S xNab | yNbc S xN1ab | yN2bc
N a | N1 a | N2 a |
Tracking context by cloning
S
N1 N2
x y
a b
b c
N1
a
N2
a
{aa} {ab} {ab} {bc}
[N,{ab}] [N,{bc}]
[N,{bc}][N,{ab}]
11/23/2011 29
General idea of cloning
• Cloning creates copies of productions
• Intuitively we would like to create a clone of a production for each of its contexts and write-down look-ahead– but set of contexts for a production is usually infinite
• Solution:– create finite number of equivalence classes of contexts for a given
production
– create a clone for each equivalence class
– compute context-independent look-ahead
• Two cloning rules are important in practice– k-look-ahead cloning: two contexts are in same equivalence class if
their k-look-aheads are identical (used in LL(k))
– reachability cloning: two contexts C1 and C2 are in same equivalence class if the set of GFG nodes reachable by paths with label(C1) is equal to set of GFG nodes reachable by paths with label(C2) (used in LR(0))
– LR(k) uses a combination of them
11/23/2011 30
k-look-ahead cloning (intuitive idea)S
A A
a b
a b
b
A
A
ca
B
{ab}
{b$}
S
[A,{ab} [A,{b$}]
a b
a b
b
{ab}{b$}
[A,{ab}]
[A,{da}]
ca
[B,{ab}]
[A,{b$}]
[A,{db}]
ca
[B,{b$}]
B
d
Other clones not shown
If there are |T| terminal symbols, you may end up with 2|T|k clones of a given production
k-look-ahead cloning
• G=(V,T,P,S):grammar, k:positive integer.
• Tk(G) is following grammar – nonterminals: {[A,R]| A in V -T, and R µ Tk}
– terminals: T
– start symbol: [S,{$k}]
– rules: all rules of the form [A,R] X1'X2'X3'...Xm' where for some rule A X1X2X3...Xm in P
• Xi' = Xi if Xi is a terminal
• Xi' = [Xi, FIRSTk(Xi+1,..Xm) +k R] when Xi is a non-terminal.
• Intuition: – after this kind of cloning, k-look-aheads at the end of a procedure
are identical for all return edges
– so doing a context-independent look-ahead computation on the transformed grammar does not tell you anything you did not already know about k-look-aheads
11/23/2011 32
LL(k) and SLL(k)
11/23/2011 33
Intuition
• This class of grammars has the following
property:
– if s is a string in the language, then for any prefix
p of s, there is a unique path P from START such
that label(P) = p (modulo look-ahead)
• So we need to follow only one path through
GFG for a given input string, using look-
ahead to eliminate alternatives
• Roughly analogous to DFAs in the CFL world
11/23/2011 34
LL(k) parsing
• Only one path can be followed by the parser– so at procedure call for
non-terminal N, we must know exactly which procedure (rule) to call
• Simple LL(k) parsing:– make decision based on
context-independent look-ahead of k symbols at entry point for N
• LL(k) parsing:– use context-dependent
look-ahead of k symbols
– procedure cloning technique converts LL(k) grammar into SLL(k) grammar
S
N N
x y
a b
b c
N
a
{aa,ab} {ab,bc}
S xNab | yNbc
N a |
{xa} {ya,yb}
Grammar is LL(2) but not SLL(2)
11/23/2011 35
Parser
• Modify Earley parser to
– track compressed paths instead of full paths
• transitions labeled by non-terminals and terminals
– eliminate return addresses
• at the end of a production
– A X1X2..Xn: pop n states off and make an A transition
from the exposed state
– A ² : make an A transition from current state
– use look-ahead to eliminate alternatives
11/23/2011 36
(
E
+
E
)
(
E
---
E
)
int
START-E
START-E
E.(+ E E) E.(- E E) E.int
E(. - E E)
(
E int | (+ E E) | (- E E)
Input string: ( - ( + 8 9 ) 7 )
37
E(- . E E)
START-E
E .(+ E E) E .(- E E) E .int
-
E (+ . E E)
START-E
E.(+ E E) E .(- E E) E .int
(
E int.
END-EE (+ E . E)
START-E
E.(+ E E) E (- E E) E .int
E int.
END-EE (+ E E.)
E(+ E E).
END-E
E (- E .E)
START-E
E. (+ E E) E .(- E E) E .int
E int.
END-EE (- E E .)
E (- E E ).
END-E
E (.+ E E)
+
8 E
9 E
)
E
7 E
)
Many grammars are not LL(k)
• Grammar
– Eint | (E+E) | (E-E)
• Not clear which rule to
apply until you see “+”
or “-”
– this needs unbounded
look-ahead, so grammar
is not LL(k) for any k
• One solution:
– follow multiple paths till
only one survives
(
E
+
E
)
(
E
---
E
)
int
E
11/23/2011 38
LR(k),SLR(k),LALR(k)
11/23/2011 39
LR grammars (informal)
• LR parsers permit limited non-determinism– can follow more than one path but not
all paths like Early
• LR(0) condition: for any prefix of input, the corresponding fully extended compressed paths must have the same label
• Condition not true in general grammars: see example– Consider string “da”
– For prefix “d”, there are two paths:• red path
• blue path
– Labels of compressed paths:• red path: “A”
• blue path: “B”
• We can use modified Earley parser for these grammars
ac
b
c da
b
START
END
AA
B
SAa | bAc | Bc | bBa
Ad
Bd
d
11/23/2011 40
(
E
+
E
)
(
E
---
E
)
int
START-ESTART-E,0
E.(E+E),0 E.(E-E) ,0 E.int,0
E(.E+E),0 E(.E- E),0
START-E,1
E.(E+E) ,1 E.(E-E) ,1 E.int,1
E.int,1
END-E ,1
E(E. +E),0 E(E. -E),0
E(E+.E),0
START-E,3
E.(E+E),3 E.(E-E),3 E.int,3
Eint.,3
END-E,3
E(E+E.),0
E (E+E)., 0
END-E,0
( (
3
+
4
)
START-E
E.(E+E) E.(E-E) E.int
E(.E+E) E(.E- E)
START-E
E.(E+E) E.(E-E) E.int
E(E. +E) E(E. -E)
E(E+.E)
START-E
E.(E+E) E.(E-E) E.int
Eint.
END-E
( (
+
4
E E
E int | (E+E) | (E-E)
Input string: (3+4)
…………..
41
0
1
2
3
4
5
Parser for LR languages
• Use the modified Earley parser we used for LL grammars– each -state will have multiple items as in the original Earley parser
since LR parsers follow multiple paths too
• -states must follow a stack discipline for modified Earley parser to work
• Since we are following multiple paths, this might break down– shift-reduce conflict: parallel compressed paths
• P1 to a scan node and P2 to an EXIT node (push/pop conflict)
– reduce-reduce conflict: parallel compressed paths• P1 and P2 to different EXIT nodes (pop/pop conflict)
• If grammar does not have shift-reduce or reduce-reduce conflicts, we can use modified Earley parser and follow compressed paths while maintaining a stack discipline for -states
• How do we determine whether grammar has shift-reduce or reduce-reduce conflicts?
11/23/2011 42
Finding LR(0) conflicts• Compute the DFA corresponding to the
compressed path NFA
• If conflicting states are in same DFA state,
grammar has an LR(0) conflict
S.Aa
S.bAc
S.Bc
S.bBa
A.d
B.d
Ad.
Bd.
Sb.Ac
Sb.Ba
A.d
B.d
SbA.c SbAc.
SbB.a SbBa.
SA.a SAa.
SB.c SBc.
d
b
A
B
c
aA
B a
c
d
Reduce-reduce conflict
SAa | bAc | Bc | bBa
Ad
Bd11/23/2011 43
LR(0) automaton for expression grammar
E.(E+E)
E.(E – E)
E .int
Eint.
E(.E+E)
E.(E+E)
E.(E-E)
E.int
E(.E-E)
E(E.+E)
E(E.- E)
E(E+.E)
E.(E+E)
E.(E-E)
E.int
E(E-.E)
E.(E+E)
E.(E-E)
E.int
Eint.
int
(E
+
-
(
(
int
int
E(E-E.)
E(E-E).
E(E+E.)
E(E+E).
E
)
E
)
int
(
11/23/2011 44
Parser for LR(0) languages
• Use the modified Earley parser we used for LL grammars– each -state will have multiple items as in the
original Earley parser since LR parsers follow multiple paths too
• No need to keep track of GFG nodes within each -state– states in compressed path DFA correspond to
possible -states
– So modified Earley parser just pushes and pops DFA states
11/23/2011 45
GFG path interpretation
• Let P1 and P2 be two GFG
paths with identical labels
• Sufficient condition for labels
of compressed paths to be
equal:
– sequence of completed calls in
P1 and P2 are identical
• Most of the action in LR
parsers happens at EXIT
nodes of productions
START
START-P
END-P
P1 P2
START-P
END-P
11/23/2011 46
LR(0) conflicts: GFG
• LR(0) conflicts (GFG definition):– Shift-reduce conflict: there are parallel paths P1: START + Aexit and
P2: START + scan-node
– Reduce-reduce conflict: there are parallel paths P1: START + Aexitand P2: START + Bexit
• Claim: Let G be an LR(0) grammar according to GFG definition.– P1 and P2 are two GFG paths that end at SCAN or END nodes, and
C(P1) and C(P2) are their compressed equivalents
– P1 and P2 have the same label iff C(P1) and C(P2) have the same label
START
t*t*
Aexit
Bexit
START
t*
t*
Aexit
B
reduce-reduce conflict shift-reduce conflict
u
11/23/2011 47
LR(0) conflicts: GFG
• Claim: Let G be an LR(0) grammar according to GFG definition.– P1 and P2 are two GFG paths that end at SCAN or END nodes, and C(P1) and C(P2)
are their compressed equivalents
– P1 and P2 have the same label iff C(P1) and C(P2) have the same label
• This claim is not true if the paths do not end at SCAN or END nodes– counterexample: in this LR(0) grammar, consider paths from START to nodes
S A.a and S .Uc
S Aa | Uc
U Ab
A .
START
t*t*
Aexit
Bexit
START
t*
t*
Aexit
B
reduce-reduce conflict shift-reduce conflict
u
11/23/2011 48
Example
• States with LR(0) conflicts
– (Ad. , Bd.)
• Conflicting context pairs
(i) path label: d
– C1: START, S.Aa, A.d, Ad.
– C2: START, S.Bc, B.d, Bd.
(ii) path label: bd
– C3: START, S.bAc, Sb.Ac, A.d, Ad.
– C4: START, S.bBa, Sb.Ba, B.d, Bd.
• So grammar is not LR(0)
a dc
b
c da
b
START
END
SAa | bAc | Bc | bBa
Ad
Bd
SAa
Ad
SBc
Bd
SbBaSbAc
11/23/2011 49
LR(0) H&U
• A grammar G is LR(0) if – its start symbol does not appear on the right side of any
production, and
– for every viable prefix °, whenever A ! ®. is a complete valid item for °, then no other complete item nor any item with a terminal to the right of the dot is valid for °.
• Comment: – by this definition, the only other valid items that can occur
together with A ! ®. are incomplete items with a non-terminal to the right of the dot of the form B! ¯.C±
– if First(C) contains a terminal t, it can be shown that an item of the form Y ! .t ¸ is valid for °, violating the LR(0) condition. Therefore, First(C) = {²}. It can be shown that this means ® = ²
– Example: this grammar is LR(0) (A . and B .Cd are valid items for viable prefix ² )
• SB
• BCd
• CA
• A ²11/23/2011 50
• LR(k)– for each pair of parallel paths to LR(0) conflicting states, k-look-ahead
sets are disjoint
• SLR(k): – if there is LR(0) conflict at nodes A and B, context-insensitive look-
ahead sets of A and B are disjoint
• LALR(k): grammar is SLR(k) after reachability cloning
START
t*t*
Aexit
Bexit
START
t*
t*
Aexit
B
reduce-reduce conflict shift-reduce conflict
Look-ahead in LR grammars
11/23/2011 51
Example
• States with LR(0) conflicts
– (Ad. , Bd.)
• Conflicting context pairs
(i) path label: d
– C1: START, S.Aa, A.d, Ad.
– C2: START, S.Bc, B.d, Bd.
(ii) path label: bd
– C3: START, S.bAc, Sb.Ac, A.d, Ad.
– C4: START, S.bBa, Sb.Ba, B.d, Bd.
• Grammar is LR(1)
– Look-ahead for C1: {a}, look-ahead for C2: {c}
– Look-ahead for C3: {c}, look-ahead for C4: {a}
a dc
b
c da
b
START
END
SAa | bAc | Bc | bBa
Ad
Bd
SAa
Ad
SBc
Bd
SbBaSbAc
11/23/2011 52
LR(1) automaton
S.Aa,$
S.bAc,$
S.Bc,$
S.bBa,$
A.d, a
B.d, c
Ad., a
Bd., c
Sb.Ac,$
Sb.Ba,$
A.d, c
B.d, a
SbA.c,$ SbAc.,$
SbB.a, $ SbBa.,$
SA.a,$ SAa.,$
SB.c,$ SBc.,$
d
b
A
B
c
aA
Ba
c
SAa | bAc | Bc | bBa
Ad
Bd
Ad.,c
Bd.,ad
11/23/2011 53
LALR look-ahead computation
• Key observation: – each path START s in deterministic LR(0) automaton
represents a set of contexts in the non-deterministic LR(0) automaton
• each context in this set ends at one of the items in s
– in general, there will be multiple paths to state s in deterministic LR(0) automaton
– so each state in LR(0) automaton represents a set of sets of contexts
– in LALR, we merge the look-aheads for those contexts
• LALR = reachability cloning + SLR (Bermudez and Logothetis) + unions at some nodes (see RL.) state in diagram on next page
11/23/2011 54
LALR(1) but not SLR(1)
S’ .S$
S .L=R
S .R
L .*R
L .id
R .L
S’ S.$
S L.=R
R L.
S R.
S L=.R
R .L
L .*R
L .id
S L=R.
R L.
L id.
L *.R
R .L
L .*R
L .id
L *R.
S
L
R
=
R
L
id*
id
RFOLLOW(S) = { $ }
FOLLOW(R) = { =, $ }
FOLLOW(L) = { =, $ }
S’ S $
S L = R | R
L *R | id
R L
shift-reduce conflict
*
S’ S$.$
L
id
*
11/23/2011 55
LALR SLR grammar
S’ .S$
S .L=R
S .R
L .*R
L .id
R .L
S’ S.$
S L.=R
R L.
S R.
S L=.R
R .L
L .*R
L .id
S L=R.
R L.
L id.
L *.R
R .L
L .*R
L .id
L *R.
S
L1
R1
=
R2
L2
id*
id
R3
S’ S $
S L = R | R
L *R | id
R L
*
S’ S$.$
L3
id
*
S’ S $
S L1 = R2 | R1
L1,L2,L3 *R3 | id
R1 L1
R2 L2
R3 L3
11/23/2011 56
LR(0): Reachability cloning
• Motivation: NFADFA conversion for LR grammars
• Driven by compressed paths
• Need to verify that this cloning satisfies sanity condition even if grammar is not LR(0)
• Compressed contexts C1 and C2 of node A are in same equivalence class if
set of GFG nodes reachable by paths with label(C1)
=
set of GFG nodes reachable by paths with label(C2)
START
AB
C1
1
C2
2
1
2C3
C1 and C2 will be in the same
equivalence class. C3 is in a different class.
3
11/23/2011 57
Algorithm (need to write)
• G=(V,T,P,S):grammar
• R(G) is following grammar – nonterminals: {[Ai]| A in V -T, 1 <= i <= n and
there are n edges labeled A in compressed path DFA}
– terminals: T
– start symbol: [S]
– rules: all rules of the form [Ai] X1'X2'X3'...Xm' where for some rule A X1X2X3...Xm in P
• Xi' = Xi if Xi is a terminal
• [Xi] when Xi is a non-terminal.
11/23/2011 58
Cloning for LALR(1)
• Same condition as LR(0): reachability
cloning
• Extension to LA(k)LR(l):
– cloning is governed by LR(l)
– compute SLR(k) look-aheads
– LALR(k) is LA(k)LR(0)
– LR(k) is LA(k)LR(k clone as in LR(l)
11/23/2011 59
Summary
• New abstraction for CFL parsing– Grammar Flow Graph (GFG)
• Parsing problems become path problems in GFG
• Earley parser emerges as simple extension of NFA simulation
• Mechanisms– control number of paths followed during parsing
– look-ahead: • algorithm: solving set constraints
– context-dependent look-ahead• algorithm: cloning
• SLL(k), LL(k), SLR(k), LR(k), LALR(k) grammars arise from different choices of these mechanisms
• LL and LR parsers emerge as specializations of Earley parser
11/23/2011 60
LR(0) ²DFA
E.(E+E)
E.(E – E)
E .int
Eint.
E(E.+E)
E(E.- E)
E(E+.E)
E(E-.E)
int
(
E
+
-
E(E-E.)
E(E-E).
E(E+E.)
E(E+E).
E)
E
)
²
E(.E+E)
E(.E- E)
²
²
M0
M1
M2
M3
M4
M5
M6
M7
M8
M9
((2+3)-4)
<M0,0> <M2,0>
<M0,1>
<M2,1>
<M0,2>
<M1,2>
<M4,1>
<M3,1>
<M0,4>
<M1,4>
<M6,1>
<M8,1>
<M4,0>
<M5,0>
<M0,7>
<M1,7>
<M7,0><M9,0>
( ( 2 + 3 ) - 4
) )11/23/2011 61
LALR(1) example from G&J
S’ -> S #
S -> A B c
A -> a
B -> b
B -> eS’ .S#
S.ABc
A.a
S’S.#
SA.Bc
B.
B.b
SAB.c SAbc.
Bb.Aa.
a
S
A B
b
c
11/23/2011 62
S L = R | R
L *R | id
R L
Shift-reduce conflict occurs at states C and Rend (conflicting paths are S->L->Lend->C and S->R->L->Lend->Rend)
1-look-ahead at C is =Context-independent 1-look-ahead at Rend is {=,$} so grammar is not SLR(1).LALR(1) figures out that for conflicting state, the calling context must SR.Look-ahead at Rend is = for context S LTRLLendRend but there is
no context S* C parallel to this one.
R Rend
L Lend
S Send
*
id
=
C
L
R
L R
T
11/23/2011 63
LR(1)
S
L R
=
L
R
R
*
id
R
L
FIRST(L)=FIRST(R)={*,id}
Shift-reduce conflict: id $
S
[L,{=}] [R,{$}]
=
[L,{=}]
[R,{$}]
[R,{=}]
*
id
[R,{$}]
[L,{$}]
[L,{$}]
[R,{$}]
*
id
[R,{=}]
[L,{=}]
After procedure cloning11/23/2011 64
LALR(1) look-aheads
S(S)
S
S’ .S$
S .(S) [$]
S.
S(.S) [$,)]
S.(S) [)]
S. [)]
S(S.) [$,)] S(S). [$,)]
S’S.$
(
(
)
S
• After reduction S(S), parsing can
resume either in state T0 or T1.
• LR parser stack tells you which one to
resume from
• LALR(1) look-aheads in state T1 are
interesting. Item S(.S) gets look-ahead
from item S .(S) in state T0 as well as
item S(.S) from state T1.
T0 T1 T2 T4
T5
[$]
S
11/23/2011 65
Parsing techniques
• Our focus: techniques that perform breadth-first traversal of GFG– similar to online simulation of NFA
– input is read left to right one symbol at a time
– extend current GFG paths if possible, using symbol
• Three dimensions:– non-determinism: how many paths can I follow at a given
time?
– look-ahead: how many symbols of look-ahead are known at each point?
– context: how much context do we keep?• this is implemented by procedure cloning, independent of look-
ahead
11/23/2011 66
What we would like to show
• Obvious algorithm:
– follow all CFL-paths in GFG
– essentially a fancy transitive closure in GFG
– leads to Earley’s algorithm
– O(n3) complexity
• O(n) algorithms: LL/LR/LALR,…
– preprocessing to compute look-ahead sets
– maintain compressed paths
– ensure that Earley sets can be manipulated as a stack
11/23/2011 67
What we would like to show
(contd.)
• SLL(k) = no cloning + decision at procedure start
• LL(k) = k-look-ahead-cloning+ decision at procedure start
• LA(l)LL(k) = l-look-ahead-cloning + context-independent k-look-ahead + decision at procedure start
• SLR(k) = no cloning + decision at procedure end
• LR(k) = k-lookahead-cloning + decision at procedure end
• LALR(k) = reachability-cloning + decision at procedure end
11/23/2011 68
Computing context-independent look-ahead
• Intuition: – simple inter-procedural
backward dataflow analysis in GFG
– assume look-ahead at exit of GFG is {$k}
– propagate look-ahead back through GFG to determine look-aheads at other points
• How do we propagate look-aheads through non-terminal calls?– would like to avoid repeatedly
analyzing procedure for each look-ahead set we want to propagate through it
– need to handle recursive calls – ideally, we would have a
function that tells us how a look-ahead set at the exit of a procedure gets propagated to its input
S
N N
x y
a b
b c
N
a
S xNab | yNbc
N a |
{xa} {ya,yb}
2-symbol look-aheads
11/23/2011 69
C1
C1’
C2
C2’
N
START
END
x y
P Q
Let string generated by paths P and Q be SP and SQ
Cases:
-SP = a and SQ = a : grammar is neither LL(1) nor SLL(1)
-SP = a and SQ = b : grammar is LL1() and SLL(1)
-SP = and SQ = : grammar is neither LL(1) nor SLL(1)
-SP = a and SQ = :
- We show that there cannot be a context Ci for which the
generated string for the complementary context Ci’ is a
- Otherwise, for context Ci, 1-lookahead for choice P is a
1-lookahead for choice Q is a
so the grammar is not LL(1).
- Therefore, there is no context Ci for which the 1-lookahead for
choice Q is a.
- But this means that the context-independent 1-lookahead
for choice Q cannot contain a.
- Therefore the grammar is SLL(1).
Every LL(1) grammar is an SLL(1) grammar
11/23/2011 70
C1
C1’
C2
C2’
N
START
END
x y
a
a
b
b
c
P Q
-Consider the context-sensitive look-aheads at N.
-For context C1,
2-lookahead for choice P is {aa}
2-lookahead for choice Q is {ab}
-For context C2,
2-lookahead for choice P is {ab}
2-lookahead for choice Q is {bc}.
-Therefore, grammar is LL(2).
-Context-independent lookaheads:
2-lookahead for choice P is {aa,ab}
2-lookahead for choice Q is {ab,bc}.
-Since these two sets are not disjoint, the grammar is
not SLL(2).
-Grammar:
S xNab
S yNbc
Na
N
LL(2) grammar that is not SLL(2)
11/23/2011 71
Cloning for LR(k)
• From Sippu & Soissalon– replace each non-terminal A in the original grammar
G with the set of all pairs of the form ([ ]k,A) where is a viable prefix of the $-augmented grammar G
• [page 16] String 1 is LR(k) equivalent to string 2 if VALIDk( 1) = VALIDk( 2); i.e. exactly those items valid for 2 are valid 1 and vice versa.
• An item [A . ,y] is LR(k)-valid for if
S rm* Az rm z = z and k:z = y
• Question: – is this a finer equivalence class than LL(k)?
11/23/2011 72
Sanity condition on
equivalence classes
• If C1 and C2 are two contexts for some node N and– C1 = B1 + P
– C2 = B2 + P
– B1 and B2 are in the same equivalence class
C1 and C2 must be in the same equivalence class
• Can we come up with a general construction procedure for cloning, given a specification of the equivalence classes?
B1 B2
P
N
START
11/23/2011 73