Lexical Analysis
Scanning Tokens
The first step in compiling a program is to break it into tokens (aka lexemes)
Consider the j-- program
L Factorial.java
// Computes the factorial of a number recursively.
package pass;
import java.lang.System;
public class Factorial {
private static int n = 5;
public static int factorial(int n) {
if (n <= 0) {
return 1;
} else {
return n * factorial(n - 1);
}
}
public static void main(String [] args) {
int x = n;
System.out.println(x + "! = " + factorial(x));
}
}
For Factorial.java, we want to produce the sequence of tokens package, pass, ;,import, java, ., lang, .,System,;, public, class, Factorial, {, and so on
Scanning Tokens
We separate the lexemes into categories
In Factorial.java:
public, class, static, and void are reserved words
Factorial, main, String, args, System, out, and println are all identifiers
The token "!=" is a literal, a string literal in this instance
The rest are operators (eg, *) and separators (eg, ;)
The program that breaks the source program into a sequence of tokens is called a lexicalanalyzer or a scanner
A scanner may be hand-crafted or it may be generated from a specification consisting ofregular expressions
Scanning Tokens
State transition diagrams can be used for describing scanners
A state transition diagram for recognizing identifiers and integers
start
id
int
idEnd
intEnd
letter, _, $
1...9
0
letter, digit, _, $
digit
Scanning Tokens
if (isLetter(ch) || ch == ’_’ || ch == ’$’) {
buffer = new StringBuffer ();
while (isLetter(ch) || isDigit(ch) || ch == ’_’ || ch == ’$’) {
buffer.append(ch);
nextCh ();
}
return new TokenInfo(IDENTIFIER , buffer.toString(), line);
}
else if (ch == ’0’) {
nextCh ();
return new TokenInfo(INT_LITERAL , "0", line);
}
else if (isDigit(ch)){
buffer = new StringBuffer ();
while (isDigit(ch)) {
buffer.append(ch);
nextCh ();
}
return new TokenInfo(INT_LITERAL , buffer.toString(), line);
}
Scanning Tokens
A state transition diagram for recognizing keywords
start id idEnd
keyword
identifier
letter, _, $
letter, digit, _, $
reserved
!reserved
Scanning Tokens
reserved = new Hashtable <String , Integer >();
reserved.put("abstract", ABSTRACT );
reserved.put("boolean", BOOLEAN );
reserved.put("char", CHAR);
...
reserved.put("while", WHILE );
if (isLetter(ch) || ch == ’_’ || ch == ’$’) {
buffer = new StringBuffer ();
while (isLetter(ch) || isDigit(ch) || ch == ’_’ || ch == ’$’){
buffer.append(ch);
nextCh ();
}
String identifier = buffer.toString ();
if (reserved.containsKey(identifier )) {
return new TokenInfo(reserved.get(identifier), line);
}
else {
return new TokenInfo(IDENTIFIER , identifier , line);
}
}
Scanning Tokens
A state transition diagram for recognizing separators and operators
start !
=
;
*
...
==
=
!
=
;
*
=
Scanning Tokens
switch (ch) {
...
case ’;’:
nextCh ();
return new TokenInfo(SEMI , line);
case ’=’:
nextCh ();
if (ch == ’=’) {
nextCh ();
return new TokenInfo(EQUAL , line);
}
else {
return new TokenInfo(ASSIGN , line);
}
case ’!’:
nextCh ();
return new TokenInfo(LNOT , line);
case ’*’:
nextCh ();
return new TokenInfo(STAR , line);
...
}
Scanning Tokens
A state transition diagram for recognizing whitespace
start ...
’ ’, ’\t’, ’\f’, ’\b’, ’\r’, ’\n’
while (isWhitespace(ch)) {
nextCh ();
}
Scanning Tokens
A state transition diagram for recognizing comments
start comment
/
...
whitespace
not /
/
/
’\n’
not ’\n’ and not EOF
Scanning Tokens
boolean moreWhiteSpace = true;
while (moreWhiteSpace) {
while (isWhitespace(ch)) {
nextCh ();
}
if (ch == ’/’) {
nextCh ();
if (ch == ’/’) {
while (ch != ’\n’ && ch != EOFCH) {
nextCh ();
}
}
else {
reportScannerError("Operator / is not supported in j--.");
}
}
else {
moreWhiteSpace = false;
}
}
Regular Expressions
A regular expression desbribes a language of strings over an alphabet Σ, and thusprovides a notation for describing patterns of characters in a text
ε (epsilon) describes the language consisting of only the empty string
If a ∈ Σ, then a describes the language L(a) consisting of the string a
If r and s are regular expressions, then their concatenation rs describes the language L(rs)consisting of all strings obtained by concatenating a string from L(r) to a string from L(s)
If r and s are regular expressions, then their alternation r|s describes the language L(r|s)consisting of all strings from L(r) or L(s)
If r is a regular expression, then the repetition (aka the Kleene closure) r∗ describes thelanguage L(r∗) consisting of all strings obtained by concatenating zero or more instancesof strings from L(r)
Both r and (r) describe the same language, ie, L(r) = L((r))
Regular Expressions
For example, given an alphabet Σ = {a, b}:
a(a|b)∗ describes the language of non-empty strings of a’s and b’s, beginning with an a
aa|ab|ba|bb describes the language of all two-symbol strings over the alphabet
(a|b)∗ab describes the language of all strings of a’s and b’s, ending in ab
In a programming language such as Java:
Reserved words may be described as abstract | boolean | char | ... | while
Operators may be described as = | == | > | ... | *
Identifiers may be described as ([a-zA-Z] | _ | $)([a-zA-Z0-9] | _ | $)*
Finite State Automata
For any language described by a regular expression, there is a state transition diagramcalled Finite State Automaton that can recognize strings in the language
A finite state automaton (FSA) F is a quintuple F = (Σ, S, s0, F,M), where:
Σ is the input alphabet
S is a set of states
s0 ∈ S is a special start state
F ∈ S is a set of final states
M is a set of moves or state transitions of the form m(r, a) = s, where r, s ∈ S anda ∈ Σ
Finite State Automata
For example, consider the regular expression (a|b)a∗b over the alphabet {a, b}
An FSA F that recognizes the language described by the regular expression
0 1 2a
b
a
b
Formally, F = (Σ, S, s0, F,M), where Σ = {a, b}, S = {0, 1, 2}, s0 = 0, F = {2}, and M is
r a m(r, a)
0 a 1
0 b 1
1 a 1
1 b 2
Non-deterministic (NFA) Versus Deterministic Finite State Automata (DFA)
A non-deterministic finite state automaton (NFA) is one that allows:
An ε-move defined on the empty string ε, ie, m(r, ε) = s
More than one move from the same state on the same input symbol a, ie, m(r, a) = sand m(r, a) = t, where s 6= t
An NFA is said to recognize an input string if, starting in the start state, there exists aset of moves based on the input that takes us into one of the final states
A deterministic finite state automaton (DFA) is one without ε-moves, and there is aunique move from any state on an input symbol a, ie, if m(r, a) = s and m(r, a) = t, thens = t
Non-deterministic (NFA) Versus Deterministic Finite State Automata (DFA)
For example, consider the regular expression a(a|b)∗b over the alphabet {a, b}
An NFA N that recognizes the language described by the regular expression
0 1 2a
ε
a, b
b
Formally, N = (Σ, S, s0, F,M) where Σ = {a, b}, S = {0, 1, 2}, s0 = 0, F = {2}, and M is
r a m(r, a)
0 a 1
1 ε 0
1 a 1
1 b 1
1 b 2
Regular Expressions to NFA
Given any regular expression r, we can construct (using Thompson’s constructionprocedure) an NFA N that recognizes the same language; ie, L(N) = L(r)
(Rule 1) NFA Nr for recognizing L(r = ε)
start finalε
(Rule 2) NFA Nr for recognizing L(r = a)
start finala
Regular Expressions to NFA
(Rule 3) NFA Nrs for recognizing L(rs)
start final start finalε
Nr Ns
(Rule 4) NFA Nr|s for recognizing L(r|s)
start
start
start
final
final
final
ε
ε
ε
ε
Nr
Ns
Regular Expressions to NFA
(Rule 5) NFA Nr∗ for recognizing L(r∗)
start start final finalε ε
ε
ε
Nr
(Rule 6) NFA Nr for recognizing L(r) also recognizes L((r))
Regular Expressions to NFA
As an example, let’s construct an NFA for the regular expression (a|b)a∗b, proceedingfrom left to right
Using Rule 2, we get the NFAs Na and Nb for recognizing a and b as
1 2a
3 4b
Using Rules 4 and 6, we get the NFA N(a|b) for recognizing (a|b) as
0
1
3
2
4
5
ε
ε
a
b
ε
ε
Regular Expressions to NFA
Using Rule 2, we get the NFAs Na for recognizing the second instance of a as
7 8a
Using Rule 5, we get the NFA Na∗ for recognizing a∗ as
6 7 8 9ε a
εε
ε
Using Rule 3, we get the NFA N(a|b)a∗ for recognizing (a|b)a∗
0
1
3
2
4
5 6 7 8 9
ε
ε
a
b
ε
ε
ε a
εε
εε
Regular Expressions to NFA
Using Rule 2, we get the NFAs Nb for recognizing the second instance of b as
10 11b
Finally, using Rule 3, we get the NFA N(a|b)a∗b for recognizing (a|b)a∗b as
0
1
3
2
4
5 6 7 8 9 10 11
ε
ε
a
b
ε
ε
ε a
εε
εε ε b
NFA to DFA
For any NFA, there is an equivalent DFA that can be constructed using the powerset (orsubset) construction procedure
The DFA is always in a state that simulates all the possible states that the NFA couldpossibly be in having scanned the same portion of the input
The computation of all states reachable from a given state s based on ε-moves alone iscalled taking the ε-closure of that state
The ε-closure(s) for a state s includes s and all states reachable from s using ε-movesalone, ie, ε-closure(s) = {s} ∪ {r ∈ S| there is a path of only ε-moves from s to r}
The ε-closure(S) for a set of states S includes S and all states reachable from any states ∈ S using ε-moves alone
NFA to DFA
Algorithm ε-closure(S) for a set of states S
Input: a set of states SOutput: ε-closure(S)
Stack P .addAll(S) // a stack containing all states in SSet C.addAll(S) // the closure initially contains the states in Swhile ! P .empty() dor ← P .pop()for s in m(r, ε) do
if s /∈ C thenP .push(s)C.add(s)
end ifend for
end whilereturn C
NFA to DFA
Algorithm ε-closure(s) for a state s
Input: a state sOutput: ε-closure(s)
Set S.add(s) // S = {s}return ε-closure(S)
NFA to DFA
As an example, let’s convert the NFA N(a|b)a∗b to a DFA
0
1
3
2
4
5 6 7 8 9 10 11
ε
ε
a
b
ε
ε
ε a
εε
εε ε b
r a m(r, a)
{0, 1, 3} = 0 (start state) a {2, 5, 6, 7, 9, 10} = 1
0 b {4, 5, 6, 7, 9, 10} = 2
1 a {7, 8, 9, 10} = 3
1 b {11} = 4 (accept state)
2 a 3
2 b 4
3 a 3
3 b 4
NFA to DFA
The DFA D(a|b)a∗b for recognizing (a|b)a∗b
0{0, 1, 3}
1{2, 5, 6,7, 9, 10}
2{4, 5, 6,7, 9, 10}
3{7, 8, 9, 10}
4{11}
a
b
a
a
b
b
b
a
In the DFA, for a state r and an input symbol a, if there is no move m(r, a) = s defined,we invent a special dead state d (usually denoted φ), such that m(r, a) = d
NFA to DFA
Algorithm NFA to DFA construction
Input: an NFA N = (Σ, S, s0,M, F )Output: an equivalent DFA D = (Σ, SD, sD0,MD, FD)
Set sD0 ← ε-closure(s0)Set SD.add(sD0)Moves MD
Stack stk.push(sD0)i← 0while !stk.empty() dor ← stk.pop()for a in Σ dosDi+1 ← ε-closure(m(r, a))if sDi+1 6= {} then
if sDi+1 /∈ SD thenSD.add(sDi+1) // We have a new statestk.push(sDi+1)i← i+ 1MD.add(i)
else if ∃j, sj ∈ SD and sDi+1 = sj thenMD.add(j) // The state already exists
end ifend if
end forend while
NFA to DFA
Algorithm NFA to DFA construction (contd.)
Set FD
for sD in SD dofor s in sD do
if s ∈ F thenFD.add(sD)
end ifend for
end forreturn D = (Σ, SD, sD0,MD, FD)
DFA to Minimal DFA
To obtain a smaller but equivalent DFA, we must combine states such that the states inthe new DFA are partitions of the states in the original (perhaps larger) DFA
A good strategy is to start with just one or two partitions and then split them as necessary
An obvious first partition has two sets: the set of final states and the set of non-final states
DFA to Minimal DFA
For example, consider the DFA for (a|b)a ∗ b, partitioned as follows
The two states in this new DFA consist of the start state, {0, 1, 2, 3} and the final state {4}
We must make sure that from a particular partition, each input symbol must move us toan identical partition
DFA to Minimal DFA
From any state in {0, 1, 2, 3}, an a takes us to a state in {0, 1, 2, 3}
m(0, a) = 1
m(1, a) = 3
m(2, a) = 3
m(3, a) = 3
So a does not split {0, 1, 2, 3}
For the symbol b,
m(0, b) = 2
m(1, b) = 4
m(2, b) = 4
m(3, b) = 4
So b splits {0, 1, 2, 3} into {0} and {1, 2, 3}
DFA to Minimal DFA
We are left with a partition into three sets: {0}, {1, 2, 3} and {4}, as shown below
DFA to Minimal DFA
We need not worry about {0} and {4} as they contain just one state
We consider {1, 2, 3} to see if it is necessary to split it
m(1, a) = 3
m(2, a) = 3
m(3, a) = 3
m(1, b) = 4
m(2, b) = 4
m(3, b) = 4
There is no further state splitting to be done, and we have the following minimal DFA
DFA to Minimal DFA
Algorithm Minimizing a DFA
Input: a DFA D = (Σ, S, s0,M, F )Output: a partition of S
Set partition← {S − F, F} // Start with two sets: the non-final and the final states// Splitting the stateswhile splitting occurs do
for set in partition doif set.size() > 1 then
for a in Σ do// Determine if moves from this ‘state’ force a splitr ← a state chosen from settargetSet← the set in the partition containing m(r, a)Set set1← {states t from set, such that m(t, a) ∈ targetSet}Set set2← {states t from set, such that m(t, a) /∈ targetSet}if set2 6= {} then
// Yes, split the states.replace set in partition by set1 and set2 and break out of the for-loop tocontinue with the next set in the partition
end ifend for
end ifend for
end while
DFA to Minimal DFA
Let us run through another example, starting from a regular expression, producing anNFA, then a DFA, and finally a minimal DFA
Consider the regular expression (a|b)∗ baa
We apply the Thompson’s construction procedure to produce the following NFA
DFA to Minimal DFA
Using the powerset construction method, we derive a DFA having the following states
s0 = {0, 1, 2, 4, 7, 8}m(s0, a) : {1, 2, 3, 4, 6, 7, 8} = s1
m(s0, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2
m(s1, a) : {1, 2, 3, 4, 6, 7, 8} = s1
m(s1, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2
m(s2, a) : {1, 2, 3, 4, 6, 7, 8, 11, 12} = s3
m(s2, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2
m(s3, a) : {1, 2, 3, 4, 6, 7, 8, 13} = s4
m(s3, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2
m(s4, a) : {1, 2, 3, 4, 6, 7, 8} = s1
m(s4, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2
DFA to Minimal DFA
The DFA itself is shown below
DFA to Minimal DFA
We use partitioning to produce the minimal DFA shown below
DFA to Minimal DFA
Finally, we re-number the states to produce the equivalent DFA shown below
JavaCC: a Tool for Generating Scanners
JavaCC (the CC stands for compiler-compiler) is a tool for generating lexical analyzersfrom regular expressions and parsers from context-free grammars
A lexical grammar specification consists a set of regular expressions and a set of lexicalstates
From a particular state, only certain regular expressions may be matched in scanning theinput
There is a DEFAULT state in which scanning generally begins — one may specify additionalstates as required
Scanning a token proceeds by considering all regular expressions in the current state andchoosing the one which consumes the greatest number of input characters
After a match, one can specify a state in which the scanner should go into; otherwise thescanner stays in the current state
JavaCC: a Tool for Generating Scanners
There are four kinds of regular expressions that determine what happens when the regularexpression has been matched:
SKIP: throws away the matched string
MORE: continues to the next state, taking the matched string along
TOKEN: creates a token from the matched string and returns it to the parser
SPECIAL_TOKEN: creates a special token that does not participate in the parsing
JavaCC: a Tool for Generating Scanners
For example, a SKIP can be used for ignoring white space
SKIP: {" "|"\t"|"\n"|"\r"|"\f"}
We can deal with single-line comments with the following regular expressions
MORE: { "//": IN_SINGLE_LINE_COMMENT }
<IN_SINGLE_LINE_COMMENT >
SPECIAL_TOKEN: { <SINGLE_LINE_COMMENT: "\n"|"\r"|"\r\n" > : DEFAULT }
<IN_SINGLE_LINE_COMMENT >
MORE: { < ~[] > }
An alternative regular expression dealing with single-line comments
SPECIAL_TOKEN: {
<SINGLE_LINE_COMMENT: "//" (~["\n","\r"])* ("\n"|"\r"|"\r\n")>
}
Reserved words and symbols are specified by simply spelling them out; for example
TOKEN: {
< ABSTRACT: "abstract" >
| < BOOLEAN: "boolean" >
...
| < COMMA: "," >
| < DOT: "." >
}
JavaCC: a Tool for Generating Scanners
A token for scanning identifiers
TOKEN: {
< IDENTIFIER: (<LETTER >|"_"|"$") (<LETTER >|<DIGIT >|"_"|"$")* >
| < #LETTER: ["a"-"z","A"-"Z"] >
| < #DIGIT: ["0" -"9"] >
}
A token for scanning literals
TOKEN: {
< INT_LITERAL: ("0" | <NON_ZERO_DIGIT > (<DIGIT >)*) >
| < #NON_ZERO_DIGIT: ["1" -"9"] >
| < CHAR_LITERAL: "’" (<ESC > | ~[" ’" ,"\\" ,"\n","\r"]) "’" >
| < STRING_LITERAL: "\"" (<ESC > | ~["\"" ,"\\" ,"\n","\r"])* "\"" >
| < #ESC: "\\" ["n","t","b","r","f" ,"\\" ," ’" ,"\""] >
}
JavaCC takes a specification of the lexical syntax and produces several Java files, one ofwhich is TokenManager.java, a program that implements a state machine; this is our scanner
The lexical specification for j-- is contained in $j/j--/src/jminusminus/j--.jj