Post on 20-Jan-2016
description
transcript
1
Syntax Analysis / Parser
Lexical
AnalyzerParser
source program
token
get next token
parse tree
• Parser works on a stream of tokens.
• The smallest item is a token.
2
Syntax Analysis
• At syntax analysis phase, a compiler verifies whether or not the tokens generated by the lexical analyzer are grouped according to the syntactic rules of the language.
• If the tokens in a string are grouped according to the language’s rules of syntax, then the string of tokens generated by the lexical analyser is accepted as a valid construct of the language; otherwise an error handler is called.
• Therefore, a suitable notation must be used to specify the constructs of a languageCFG
3
Syntax Analyzer• Syntax Analyzer creates the syntactic structure (Parse Tree)of the
given source program.• The syntax of a programming is described by a CFG. We will use
BNF notation in the description of CFGs.• The syntax analyzer checks whether a given source program
satisfies the rules implied by a CFG or not.– If it satisfies, the parser creates the parse tree of that
program.– Otherwise the parser gives an error messages.
• A context-free grammar– gives a precise syntactic specification of a programming
language.– the design of the grammar is an initial phase of the design of a
compiler.
4
Parsers
We categorize the parsers into two groups:
• Top-Down Parser– the parse tree is created top to bottom, starting from
the root. (from N.T T)• Bottom-Up Parser
– the parse tree is created bottom to top, starting from the leaves. (from TN.T)
• Both top-down and bottom-up parsers scan the input from left to right (one symbol at a time).
5
Context-Free Grammars (CFG)• Inherently recursive structures of a programming
language are defined by a CFG.• In a CFG, we have:
– A finite set of terminals (in our case, this will be the set of tokens)
– A finite set of non-terminals (syntactic-variables)– A finite set of productions rules in the following form
A where A is a non-terminal and is a string of terminals and non-terminals (including the empty string)
– A start symbol (one of the non-terminal symbol)
• Example:E E + E | E – E | E * E | E / E | - EE ( E )E id
6
DerivationsE E+E
E+E derives from E– we can replace E by E+E– to able to do this, we have to have a production rule EE+E in our grammar.
E E+E id+E id+id
“A sequence of replacements of non-terminal symbols is called derivation”.
In general a derivation step is
A if there is a production rule A in our grammar where and are arbitrary strings of terminal and non-terminal
symbols
1 2 ... n (n derived from 1 or 1 derives n )
: derives in one step : derives in zero or more steps : derives in one or more steps
*+
7
CFG - Terminology• L(G) is the language of G (the language generated by G) which is a
set of sentences.• A sentence of L(G) is a string of terminal symbols of G.• If S is the start symbol of G then
is a sentence of L(G) iff S where is a string of terminals of G.
• If G is a context-free grammar, L(G) is a context-free language.• Two grammars are equivalent if they produce the same language.
• S - If contains non-terminals, it is called as a sentential form of G.- If does not contain non-terminals, it is called as a sentence of G.
+
8
Derivation ExampleLet the grammar be
E E + E | E – E | E * E | E / E | - EE ( E )E id
E lmd -E -(E) -(E+E) -(E * E +E) -(id * E +E) -(id * id +E) -(id*id + id)
E rmd -E -(E) -(E*E) -(E * E +E) -(E * E + id) -(E * id + id) -(id*id +id)
• At each derivation step, we can choose any of the non-terminal in the sentential form of G for the replacement.
• If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-most derivation.
• If we always choose the right-most non-terminal in each derivation step, this derivation is called as right-most derivation.
Exercise
a) Derive a sentence -(id + id * id) using LMD and RMD?
9
Left-Most and Right-Most Derivations
Left-Most DerivationE -E -(E) -(E+E) -(id+E) -(id+id)
Right-Most DerivationE -E -(E) -(E+E) -(E+id) -(id+id)
• Top-down parsers try to find the left-most derivation of the given source program.
• Bottom-up parsers try to find the right-most derivation of the given source program in the reverse order.
10
Parse Tree• Inner nodes of a parse tree are non-terminal symbols.• The leaves of a parse tree are terminal symbols.
• A parse tree can be seen as a graphical representation of a derivation.
E -E E
E-
E
E
EE
E
+
-
( )
E
E
E-
( )
E
E
id
E
E
E +
-
( )
id
E
E
E
EE +
-
( )
id
-(E) -(E+E)
-(id+E) -(id+id)
11
Ambiguity
• A grammar that produces more than one parse tree for a sentence is called as an ambiguous grammar.
• Therefore, the grammar having the productions
E → E +EE → E * EE → id
Is an ambiguous grammar because there exists more than one parse tree for the sentence id + id * id in L(G).
12
Two parse trees for id + id*id
E E+E id+E id+E*E id+id*E id+id*id
E E*E E+E*E id+E*E id+id*E id+id*id
E
id
E +
id
id
E
E
* E
E
E +
id E
E
* E
id id
Ambiguity – Operator Precedence
Rewrite the grammarUse different N for each precedence levelsStart with the lowest precedenceEE-E|E/E|(E)|id
Rewrite to:EE-E|TTT/T|FFid|(E)
Conts..
W=id-id/idEE-E E-T E-T/T E-T/F E-T/id E-F/id E-id/id id-id/id
Workout
E E + E | E – E | E * E | E / E E ( E )E idRewrite the above P in operator precedence and
parse this string w= id+id-id (both in RMD and LMD)
16
Left Recursion• A grammar is left recursive if it has a non-terminal A such that there is a
derivation. A A for some string
• Top-down parsing techniques cannot handle left-recursive grammars.• So, we have to convert our left-recursive grammar into a grammar which
is not left-recursive.• The left-recursion may appear in a single step of the derivation
(immediate left-recursion), or may appear in more than one step of the derivation.
+
17
Left-Recursion -- Problem• A grammar can be a direct and indirect left recursion• By just eliminating the immediate left-recursion, we may not get a grammar which is not left-recursive.
S Aa | bA Sc | d This grammar is not immediately left-recursive,
but it is still left-recursive.
S Aa Sca orA Sc Aac causes to a left-recursion
• So, we have to eliminate all left-recursions from our grammar
Eliminating Left –recursion
1. Eliminate -production 2. Eliminate cycles AA (one / more derivation
of time if it produces A for A)
+
ExampleCFG is given as :
SXX|Y | XaSb | SYYb|ZZbZa|
S, XS, YZ, Z
SXX|Y|XXaSb|S|ab
YYb|Z|bZbZa|ba
Workout
SXZXXb| ZaZ|ZX| 1.What are the nullable variables?2.What are the new productions?
Cont..(Eliminating Cycles)
YY , ZZ
SXX|Y|XXaSb|b |SYY’b|b|ZZbZ’a|baY’Yb|b|ZZ’bZa|ba
SXX|Y|XXaSb|b |SYYb |b|ZZbZa|ba
SXX|Y|XXaSb|b |S
YYbYbYZ
ZbZaZba
Immediate Left-Recursion where:A is a left-recursive nonterminal is a sequence of nonterminals and terminals that is not null is a sequence of nonterminals and terminals that does not start with A.
replace the A-production by the production: And create a new nonterminal
This newly created symbol is often called the "tail", or the "rest".
Immediate Left-Recursion -- Example
For each rule which contains a left-recursive option,
A A c | BcIntroduce a new nonterminal A' and rewrite the
rule asA B c A' A' | c A’
24
Left-Factoring
• A predictive parser (a top-down parser without backtracking) insists that the grammar must be left-factored.
grammar a new equivalent grammar suitable for predictive parsing
stmt if expr then stmt else stmt | ????? if expr then stmt (which one to use ???)
25
Left-Factoring -- Steps
• For each non-terminal A with two or more alternatives (production rules) with a common non-empty prefix, let say
A 1 | ... | n | 1 | ... | m
convert it intoA A’ | 1 | ... | m
A’ 1 | ... | n
26
Left-Factoring (cont.)• In general, A 1 | 2 where is non-empty and the first symbols of
1 and 2 are different.
• when processing we cannot know whether expand A to 1 or
A to 2
• But, if we re-write the grammar as follows A A’
A’ 1 | 2 so, we can immediately expand A to A’
27
Left-Factoring – Example1
A abB | aB | cdg | cdeB | cdfB
A aA’ | cdg | cdeB | cdfBA’ bB | B
A aA’ | cdA’’
A’ bB | BA’’ g | eB | fB
28
Left-Factoring – Example2
A ad | a | ab | abc | b
A aA’ | bA’ d | | b | bc
A aA’ | bA’ d | | bA’’A’’ | c