8/3/2019 14823_02. Chapter 3 - Lexical Analysis
1/50
Chapter 3
Lexical Analysis
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
2/50
OutlineRole of lexical analyzer
Specification of tokens
Recognition of tokensLexical analyzer generator
Finite automata
Design of lexical analyzer generator
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
3/50
The role of lexical
analyzer
LexicalAnalyzer Parser
Source
program
token
getNextToken
Symboltable
To semantic
analysis
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
4/50
Why to separate Lexical
analysis and parsing1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
5/50
Tokens, Patterns and
LexemesA token is a pair a token name and an
optional token value
A pattern is a description of the form that thelexemes of a token may take
A lexeme is a sequence of characters in thesource program that matches the pattern for
a token
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
6/50
ExampleToken Informal description Sample lexemes
if
else
comparison
id
number
literal
Characters i, f
Characters e, l, s, e
< or > or = or == or !=
Letter followed by letter and digits
Any numeric constant
Anything but sorrounded by
if
else
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
7/50
Attributes for tokensE = M * C ** 2
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
8/50
Lexical errorsSome errors are out of power of lexical
analyzer to recognize:
fi (a == f(x)) However it may be able to recognize errors
like:d = 2r
Such errors are recognized when no patternfor tokens matches a character sequence
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
9/50
Error recoveryPanic mode: successive characters are
ignored until we reach to a well formed token
Delete one character from the remaininginput
Insert a missing character into the remaininginput
Replace a character by another characterTranspose two adjacent characters
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
10/50
Input bufferingSometimes lexical analyzer needs to look
ahead some symbols to decide about thetoken to returnIn C language: we need to look after -, = or < to
decide what token to return
In Fortran: DO 5 I = 1.25
We need to introduce a two buffer scheme tohandle large look-aheads safely
E = M * C * * 2 eof
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
11/50
SentinelsSwitch (*forward++) {
case eof:
if (forward is at end of first buffer) {reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;}
E = M eof* C * * 2 eof eof
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
12/50
Specification of tokensIn theory of compilation regular expressions
are used to formalize the specification oftokens
Regular expressions are means for specifyingregular languages
Example:
Letter_(letter_ | digit)*Each regular expression is a pattern
specifying the form of strings
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
13/50
Regular expressions is a regular expression, L() = {}
If a is a symbol in then a is a regular
expression, L(a) = {a}(r) | (s) is a regular expression denoting the
language L(r) L(s)
(r)(s) is a regular expression denoting the
language L(r)L(s)(r)* is a regular expression denoting (L9r))*
(r) is a regular expression denting L(r)
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
14/50
Regular definitionsd1 -> r1
d2 -> r2
dn -> rn
Example:
letter_ -> A | B | | Z | a | b | | Z | _
digit -> 0 | 1 | | 9
id -> letter_ (letter_ | digit)*
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
15/50
ExtensionsOne or more instances: (r)+
Zero of one instances: r?
Character classes: [abc]
Example:letter_ -> [A-Za-z_]
digit -> [0-9]
id -> letter_(letter|digit)*
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
16/50
Recognition of tokensStarting point is the language grammar to
understand the tokens:
stmt -> ifexpr then stmt| ifexpr then stmt else stmt
|
expr -> term relop term
| term
term -> id
| number
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
17/50
Recognition of tokens
(cont.)The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number-> digit(.digits)? (E[+-]? Digit)?letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | = | = |
We also need to handle whitespaces:
ws -> (blank | tab | newline)+
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
18/50
Transition diagramsTransition diagram for relop
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
19/50
Transition diagrams
(cont.)Transition diagram for reserved words and
identifiers
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
20/50
Transition diagrams
(cont.)Transition diagram for unsigned numbers
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
21/50
Transition diagrams
(cont.)Transition diagram for whitespace
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
22/50
transition-diagram-based
lexical analyzerTOKEN getRelop(){
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1:
case 8: retract();
retToken.attribute = GT;
return(retToken);
}
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
23/50
Lexical Analyzer
Generator - LexLexical
Compiler
Lex Source
program
lex.l
lex.yy.c
Ccompiler
lex.yy.c a.out
a.outInput stream Sequenceof tokens
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
24/50
Structure of Lex
programs
declarations
%%translation rules
%%
auxiliary functions
Pattern {Action}
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
25/50
Example%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions
delim [ \t\n]
ws {delim}+letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum();
return(NUMBER);}
Int installID() {/* funtion to installthe lexeme, whose firstcharacter is pointed to byyytext, and whose length isyyleng, into the symbol tableand return a pointer thereto */
}
Int installNum() { /* similar toinstallID, but puts numericalconstants into a separatetable */
}
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
26/50
26
Finite AutomataRegular expressions = specification
Finite automata = implementation
A finite automaton consists ofAn input alphabet
A set of states S
A start state nA set of accepting states F S
A set of transitions state input state
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
27/50
27
Finite AutomataTransition
s1a s2
Is readIn state s1 on input a go to state s2
If end of inputIf in accepting state => accept, othewise =>
reject
If no transition possible => reject
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
28/50
28
Finite Automata State
GraphsA state
The start state
An accepting state
A transitiona
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
29/50
29
A Simple ExampleA finite automaton that accepts only 1
A finite automaton accepts a string if we canfollow transitions labeled with the charactersin the string from the start to some acceptingstate
1
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
30/50
30
Another Simple ExampleA finite automaton accepting any number of1s followed by a single 0
Alphabet: {0,1}
Check that 1110 is accepted but 110 isnot
0
1
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
31/50
31
And Another ExampleAlphabet {0,1}What language does this recognize?
0
1
0
1
0
1
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
32/50
32
And Another ExampleAlphabet still { 0, 1 }
The operation of the automaton is notcompletely defined by the inputOn input 11 the automaton could be in either
state
1
1
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
33/50
33
Epsilon MovesAnother kind of transition: -moves
Machine can move from state A to state Bwithout reading input
A B
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
34/50
34
Nondeterministic
AutomataDeterministic Finite Automata (DFA)One transition per input per state
No -moves
Nondeterministic Finite Automata (NFA)Can have multiple transitions for one input in a
given state
Can have -movesFinite automata have finite memoryNeed only to encode the current state
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
35/50
35
Execution of Finite
AutomataA DFA can take only one path through the
state graphCompletely determined by input
NFAs can chooseWhether to make -moves
Which of multiple transitions for a single input totake
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
36/50
36
Acceptance of NFAsAn NFA can get into multiple states
Input:
0
1
1
0
1 0 1
Rule: NFA accepts if it can get in a final state
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
37/50
37
NFA vs. DFA (1)NFAs and DFAs recognize the same set of
languages (regular languages)
DFAs are easier to implementThere are no choices to consider
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
38/50
38
NFA vs. DFA (2)For a given language the NFA can be simplerthan the DFA
01
0
0
01
0
1
0
1
NFA
DFA
DFA can be exponentially larger than NFA
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
39/50
39
Regular Expressions to
Finite AutomataHigh-level sketch
Regularexpressions
NFA
DFA
LexicalSpecification
Table-drivenImplementation of DFA
l i
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
40/50
40
Regular Expressions to
NFA (1)For each kind of rexp, define an NFANotation: NFA for rexp A
A
For
For input aa
R l E i
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
41/50
41
Regular Expressions to
NFA (2)For ABA B
For A | B
A
B
R l E i t
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
42/50
42
Regular Expressions toNFA (3)For A*
A
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
43/50
43
Example of RegExp ->
NFA conversionConsider the regular expression(1 | 0)*1
The NFA is
1C E
0D F
B
G
AH
1I J
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
44/50
44
Next
Regularexpressions
NFA
DFA
LexicalSpecification
Table-drivenImplementation of DFA
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
45/50
45
NFA to DFA. The TrickSimulate the NFAEach state of resulting DFA
= a non-empty subset of states of the NFAStart state
= the set of NFA states reachable through -moves from NFA start state
Add a transition S a S to DFA iffS is the set of NFA states reachable from the
states in S after seeing the input a considering -moves as well
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
46/50
46
NFA -> DFA Example
1
01
A BC
D
E
FG H I J
ABCDHI
FGABCDHI
EJGABCDHI
0
1
0
10 1
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
47/50
47
NFA to DFA. RemarkAn NFA may be in many states at any time
How many different states ?
If there are N states, the NFA must be in somesubset of those N states
How many non-empty subsets are there?2N - 1 = finitely many, but exponentially many
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
48/50
48
ImplementationA DFA can be implemented by a 2D table TOne dimension is states
Other dimension is input symbols
For every transition Sia Sk define T[i,a] = k
DFA execution
If in state Si and input a, read T[i,a] = k and skip
to state SkVery efficient
Table Implementation of
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
49/50
49
Table Implementation ofa DFA
S
T
U
0
1
0
10 1
0 1
S T UT T U
U T U
8/3/2019 14823_02. Chapter 3 - Lexical Analysis
50/50
Implementation (Cont.)NFA -> DFA conversion is at the heart of toolssuch as flex or jflex
But, DFAs can be huge
In practice, flex-like tools trade off speed for
space in the choice of NFA and DFArepresentations