Date post: | 16-Dec-2015 |
Category: |
Documents |
Upload: | vivian-george |
View: | 221 times |
Download: | 4 times |
CS-338Compiler Design
Dr. Syed Noman HasanyAssistant Professor
College of Computer, Qassim University
Chapter 3: Lexical Analyzer
• THE ROLE OF LEXICAL ANALYSER :
It is the first phase of the compiler. It reads the input characters and produces as output a
sequence of tokens that the parser uses for syntax analysis.
It strips out from the source program comments and white spaces in the form of blank , tab and newline characters .
It also correlates error messages from the compiler with the source program (because it keeps track of line numbers).
2
3
Interaction of the Lexical Analyzer with the Parser
LexicalAnalyzer
ParserSource
Program
Token,tokenval
Symbol Table
Get nexttoken
error error
4
The Reason Why Lexical Analysis is a Separate Phase
• Simplifies the design of the compiler– LL(1) or LR(1) parsing with 1 token lookahead would
not be possible (multiple characters/tokens to match)
• Provides efficient implementation– Systematic techniques to implement lexical analyzers
by hand or automatically from specifications– Stream buffering methods to scan input
• Improves portability– Non-standard symbols and alternate character
encodings can be normalized (e.g. trigraphs)
5
Attributes of Tokens
Lexical analyzer
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
y := 31 + 28*x
Parser
token
tokenval(token attribute)
6
Tokens, Patterns, and Lexemes
• A token is a classification of lexical units– For example: id and num
• Lexemes are the specific character strings that make up a token– For example: abc and 123
• Patterns are rules describing the set of lexemes belonging to a token– For example: “letter followed by letters and digits” and
“non-empty sequence of digits”
Tokens, Patterns, and Lexemes
• A lexeme is a sequence of characters from the source program that is matched by a pattern for a token.
7
lexeme Pattern Token
Tokens, Patterns, and Lexemes
Token Sample Lexemes Informal Description of Pattern
const
if
relation
id
num
literal
const
if
<, <=, =, < >, >, >=
pi, count, D2
3.1416, 0, 6.02E23
“core dumped”
const
if
< or <= or = or < > or >= or >
letter followed by letters and digits
any numeric constant
any characters between “ and “ except “
Classifies Pattern
Actual values are critical. Info is :
1. Stored in symbol table2. Returned to parser
• Examining ways of speeding reading the source program– In one buffer technique, the last lexeme under process will be over-written when we
reload the buffer.
– Two-buffer scheme handling large look ahead safely
3.2 Input Buffering
• Two buffers of the same size, say 4096, are alternately reloaded.• Two pointers to the input are maintained:
– Pointer lexeme_Begin marks the beginning of the current lexeme.
– Pointer forward scans ahead until a pattern match is found.
3.2.1 Buffer Pairs
If forward at end of first half then begin
reload second half;
forward:=forward + 1;
End
Else
if forward at end of second half then begin
reload first half;
move forward to beginning of first half
End
Else forward:=forward + 1;
3.2.2 Sentinels
E = M * eof C * * 2 eof eof
forward:=forward+1;
If forward ^ = EOF then begin
If forward at end of first half then begin
reload second half;
forward:=forward + 1;
End
Else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
End
Else terminate lexical analysis;
14
Specification of Patterns for Tokens: Definitions
• An alphabet is a finite set of symbols (characters)
• A string s is a finite sequence of symbols from s denotes the length of string s denotes the empty string, thus = 0
• A language is a specific set of strings over some fixed alphabet
15
Specification of Patterns for Tokens: String Operations
• The concatenation of two strings x and y is denoted by xy
• The exponentation of a string s is defined by
s0 = (Empty string: a string of length zero)
si = si-1s for i > 0
note that s = s = s
16
Specification of Patterns for Tokens: Language Operations
• UnionL M = {s s L or s M}
• ConcatenationLM = {xy x L and y M}
• ExponentiationL0 = {}; Li = Li-1L
• Kleene closureL* = i=0,…, Li
• Positive closureL+ = i=1,…, Li
Language Operations ExamplesL = {A, B, C, D } D = {1, 2, 3}
L D = {A, B, C, D, 1, 2, 3 }
LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
L4 = L2 L2 = ??
L* = { All possible strings of L plus }
L+ = L* -
L (L D ) = ??
L (L D )* = ??
18
Specification of Patterns for Tokens: Regular Expressions
• Basis symbols: is a regular expression denoting language {}– a is a regular expression denoting {a}
• If r and s are regular expressions denoting languages L(r) and M(s) respectively, then– rs is a regular expression denoting L(r) M(s)– rs is a regular expression denoting L(r)M(s)– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called a regular set
• Examples:– let
a | b
(a | b) (a | b)
a *
(a | b)*
a | a*b
– We assume that ‘*’ has the highest precedence and is left associative. Concatenation has second highest precedence and is left associative and ‘|’ has the lowest precedence and is left associative
• (a) | ((b)*(c ) ) = a | b*c
},{ ba
Algebraic Properties of Regular Expressions
AXIOM DESCRIPTION
r | s = s | r
r | (s | t) = (r | s) | t
(r s) t = r (s t)
r = rr = r
r* = ( r | )*
r ( s | t ) = r s | r t( s | t ) r = s r | t r
r** = r*
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
relation between * and
Is the identity element for concatenation
* is idempotent
Finite Automaton• Given an input string, we need a “machine”
that has a regular expression hard-coded in it and can tell whether the input string matches the pattern described by the regular expression or not.
• A machine that determines whether a given string belongs to a language is called a finite automaton.
Deterministic Finite Automaton• Definition: Deterministic Finite Automaton
– a five-tuple (, S, , s0, F) where is the alphabet• S is the set of states is the transition function (SS)• s0 is the starting state• F is the set of final states (F S)
• Notation: – Use a transition diagram to describe a DFA
• states are nodes, transitions are directed, labeled edges, some states are marked as final, one state is marked as starting
• If the automaton stops at a final state on end of input, then the input string belongs to the language.
① a
={a}
L= {a}
S = {1,2}
(1,a)=2
S0 = 1
F = {2}
② a|b
={a,b}
L = {a,b}
S = {1,2}
(1,a)=2, (1,b)=2
S0 = 1
F = {2}
③ a(a|b)
={a,b}L = {aa,ab}S = {1,2,3} (1,a)=2, (2,a)=3, (2,b)=3
S0 = 1F = {3}
④ a*
= {a}
L = {,a,aa,aaa,aaaa,…}
S = {1}
(1, )=1, (1,a)=1
S0 = 1
F = {1}
⑤a⁺
={a}L = {a,aa,aaa,aaaa,…}S = {1,2} (1,a)=2, (2,a)=2
S0 = 1F = {2}Note: a =aa*⁺
⑥ (a|b)(a|b)b
= {a,b}L = {aab,abb,bab,bbb}S = {1,2,3,4}(1,a)=2, (1,b)=2, (2,a)=3, (2,b)=3, (3,b)=4
S0 = 1F = {4}
⑦ (a|b)*
={a,b}L={,a,b,aa,bb,ba,ab,aaa,…,bbb,…,abab,
…,baba,bbba,…,…}S = {1} (1,a)=1, (1,b)=1
S0 = 1F = {1}
⑧ (a|b)⁺
={a,b}
L = {a,aa,aaa,…,b,bb,bbb,…}
S = {1,2}
(1,a)=2, (1,b)=2, (2,a)=2, (2,b)=2
S0 = 1
F = {2}
Note: (a|b) =(a|b)(a|b)*⁺
⑨a |b⁺ ⁺
={a,b}
L = {a,aa,aaa,…,b,bb,bbb,…}
S = {1,2,3}
(1,a)=2, (2,a)=2, (1,b)=3, (3,b)=3
S0 = 1
F = {2,3}
⑩a(a|b)*
={a,b}
L={a,aa,ab,…,aba,…,abb,…,baa,abbb,…,bababa,…}
S = {1,2}
(1,a)=2, (2,a)=2, (2,b)=2
S0 = 1
F = {2}
⑪a(b|a)b⁺
={a,b}L = {aab,abb,aabb,…,abbb,abbbb,…}S ={1,2,3,4} (1,a)=2, (2,a)=3, (2,b)=3, (3,b)=4,(4,b)=4
S0 = 1F = {4}
⑫ ab*a(a |b )⁺ ⁺
={a,b}L = {aaa,aab,abaa,abbaa,…,abbab,abbabbb,…}S = {1,2,3,4,5} (1,a)=2, (2,b)=2, (2,a)=3, (3,a)=4, (4,a)=4, (3,b)=5, (5,b)=5
S0 = 1F = {4,5}
35
Specification of Patterns for Tokens: Regular Definitions
• Regular definitions introduce a naming convention: d1 r1
d2 r2
…dn rn
where each ri is a regular expression over {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in ri to obtain an equivalent set of definitions
36
Specification of Patterns for Tokens: Regular Definitions
• Example:
letter AB…Zab…z digit 01…9 id letter ( letterdigit )*
• Regular definitions are not recursive:
digits digit digitsdigit wrong!
37
Specification of Patterns for Tokens: Notational Shorthand
• The following shorthands are often used:
r+ = rr*
r? = r[a-z] = abc…z
• Examples:digit [0-9]num digit+ (. digit+)? ( E (+-)? digit+ )?
38
Regular Definitions and Grammars
stmt if expr then stmt if expr then stmt else stmt
expr term relop term termterm id num
if if then then else elserelop < <= <> > >= = id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+-)? digit+ )?
Grammar
Regular definitions
Constructing Transition Diagrams for Tokens
• Transition Diagrams (TD) are used to represent the tokens – these are automatons!
• As characters are read, the relevant TDs are used to attempt to match lexeme to a pattern
• Each TD has:
• States : Represented by Circles
• Actions : Represented by Arrows between states
• Start State : Beginning of a pattern (Arrowhead)
• Final State(s) : End of pattern (Concentric Circles)
• Each TD is Deterministic - No need to choose between 2 different actions !
Example : All RELOPsstart <
0
other
=6 7
8
return(relop, LE)
5
4
>
=1 2
3
other
>
=
*
*
return(relop, NE)
return(relop, LT)
return(relop, EQ)
return(relop, GE)
return(relop, GT)
Example TDs : id and delim
Keyword or id :
delim :
start delim28
other3029
delim
*
return(install_id(), gettoken())
start letter9
other1110
letter or digit
*
Combine TD for KW and IDs
• Install_id(): decides for the attribute– It will check the accepted lexeme in the list of keywords; if it is
matched, zero is returned.
– Otherwise checks the lexeme in symbol table, if it is found, the address is returned.
– If the lexeme not found in symbol table, install_id() first installs the ID in the symbol table and return the address of the newly created entry.
• Gettoken(): decides for the token– If zero returned by install_id(), the same word(or its numeric
form) is returned as token
– Otherwise token “ID” is returned.
42
Example TDs : Unsigned #s
1912 1413 1615 1817start otherdigit . digit E + | - digit
digit
digit
digit
E
digit
*
start digit25
other2726
digit
*
start digit20
* .21
digit
24other
23
digit
digit22
*
Questions: Is ordering important for unsigned #s ?
Why are there no TDs for then, else, if ?
Keywords RecognitionAll Keywords / Reserved words are matched as ids
• After the match, the symbol table or a special keyword table is consulted
• Keyword table contains string versions of all keywords and associated token values
if
begin
then
0
0
0
... ...
• If a match is not found, then it is assumed that an id has been discovered
Transition Diagrams & Lexical Analyzers
state = 0;
token nexttoken()
{ while(1) {
switch (state) {
case 0: c = nextchar();
/* c is lookahead character */
if (c== blank || c==tab || c== newline) {
state = 0;
lexeme_beginning++;
/* advance beginning of lexeme */
}
else if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else state = fail();
break;
… /* cases 1-8 here */
case 9: c = nextchar();
if (isletter(c)) state = 10;
else state = fail();
break;
case 10; c = nextchar();
if (isletter(c)) state = 10;
else if (isdigit(c)) state = 10;
else state = 11;
break;
case 11; retract(1); install_id();
return ( gettoken() );
… /* cases 12-24 here */
case 25; c = nextchar();
if (isdigit(c)) state = 26;
else state = fail();
break;
case 26; c = nextchar();
if (isdigit(c)) state = 26;
else state = 27;
break;
case 27; retract(1); install_num();
return ( NUM ); } } }
Case numbers correspond to transition diagram states !
When Failures Occur:int state = 0, start = 0;
Int lexical_value;
/* to “return” second component of token */
Init fail()
{
forward = token_beginning;
switch (start) {
case 0: start = 9; break;
case 9: start = 12; break;
case 12: start = 20; break;
case 20: start = 25; break;
case 25: recover(); break;
default: /* compiler error */
}
return start;
}
Using a Lex Generator
Lex source prog lex.yy.c lex.l
lex.yy.c a.out
Input stream sequence of input.c tokens
Lex
Compiler
C
compiler
a.out