Lexical Analysis · 2020-02-16 · Scanning Tokens We separate the lexemes into categories In...

Lexical Analysis

Scanning Tokens

The first step in compiling a program is to break it into tokens (aka lexemes)

Consider the j-- program

L Factorial.java

// Computes the factorial of a number recursively.

package pass;

import java.lang.System;

public class Factorial {

private static int n = 5;

public static int factorial(int n) {

if (n <= 0) {

return 1;

} else {

return n * factorial(n - 1);

}

}

public static void main(String [] args) {

int x = n;

System.out.println(x + "! = " + factorial(x));

}

}

For Factorial.java, we want to produce the sequence of tokens package, pass, ;,import, java, ., lang, .,System,;, public, class, Factorial, {, and so on

Scanning Tokens

We separate the lexemes into categories

In Factorial.java:

public, class, static, and void are reserved words

Factorial, main, String, args, System, out, and println are all identifiers

The token "!=" is a literal, a string literal in this instance

The rest are operators (eg, *) and separators (eg, ;)

The program that breaks the source program into a sequence of tokens is called a lexicalanalyzer or a scanner

A scanner may be hand-crafted or it may be generated from a specification consisting ofregular expressions

Scanning Tokens

State transition diagrams can be used for describing scanners

A state transition diagram for recognizing identifiers and integers

start

id

int

idEnd

intEnd

letter, _, $

1...9

0

letter, digit, _, $

digit

Scanning Tokens

if (isLetter(ch) || ch == ’_’ || ch == ’$’) {

buffer = new StringBuffer ();

while (isLetter(ch) || isDigit(ch) || ch == ’_’ || ch == ’$’) {

buffer.append(ch);

nextCh ();

}

return new TokenInfo(IDENTIFIER , buffer.toString(), line);

}

else if (ch == ’0’) {

nextCh ();

return new TokenInfo(INT_LITERAL , "0", line);

}

else if (isDigit(ch)){


while (isDigit(ch)) {

buffer.append(ch);

nextCh ();

}

return new TokenInfo(INT_LITERAL , buffer.toString(), line);

}

Scanning Tokens

A state transition diagram for recognizing keywords

start id idEnd

keyword

identifier

letter, _, $

letter, digit, _, $

reserved

!reserved

Scanning Tokens

reserved = new Hashtable <String , Integer >();

reserved.put("abstract", ABSTRACT );

reserved.put("boolean", BOOLEAN );

reserved.put("char", CHAR);

...

reserved.put("while", WHILE );

if (isLetter(ch) || ch == ’_’ || ch == ’$’) {


while (isLetter(ch) || isDigit(ch) || ch == ’_’ || ch == ’$’){

buffer.append(ch);

nextCh ();

}

String identifier = buffer.toString ();

if (reserved.containsKey(identifier )) {

return new TokenInfo(reserved.get(identifier), line);

}

else {

return new TokenInfo(IDENTIFIER , identifier , line);

}

}

Scanning Tokens

A state transition diagram for recognizing separators and operators

start !

=

;

*

...

==

=

!

=

;

*

=

Scanning Tokens

switch (ch) {

...

case ’;’:

nextCh ();

return new TokenInfo(SEMI , line);

case ’=’:

nextCh ();

if (ch == ’=’) {

nextCh ();

return new TokenInfo(EQUAL , line);

}

else {

return new TokenInfo(ASSIGN , line);

}

case ’!’:

nextCh ();

return new TokenInfo(LNOT , line);

case ’*’:

nextCh ();

return new TokenInfo(STAR , line);

...

}

Scanning Tokens

A state transition diagram for recognizing whitespace

start ...

’ ’, ’\t’, ’\f’, ’\b’, ’\r’, ’\n’

while (isWhitespace(ch)) {

nextCh ();

}

Scanning Tokens

A state transition diagram for recognizing comments

start comment

/

...

whitespace

not /

/

/

’\n’

not ’\n’ and not EOF

Scanning Tokens

boolean moreWhiteSpace = true;

while (moreWhiteSpace) {

while (isWhitespace(ch)) {

nextCh ();

}

if (ch == ’/’) {

nextCh ();

if (ch == ’/’) {

while (ch != ’\n’ && ch != EOFCH) {

nextCh ();

}

}

else {

reportScannerError("Operator / is not supported in j--.");

}

}

else {

moreWhiteSpace = false;

}

}

Regular Expressions

A regular expression desbribes a language of strings over an alphabet Σ, and thusprovides a notation for describing patterns of characters in a text

ε (epsilon) describes the language consisting of only the empty string

If a ∈ Σ, then a describes the language L(a) consisting of the string a

If r and s are regular expressions, then their concatenation rs describes the language L(rs)consisting of all strings obtained by concatenating a string from L(r) to a string from L(s)

If r and s are regular expressions, then their alternation r|s describes the language L(r|s)consisting of all strings from L(r) or L(s)

If r is a regular expression, then the repetition (aka the Kleene closure) r∗ describes thelanguage L(r∗) consisting of all strings obtained by concatenating zero or more instancesof strings from L(r)

Both r and (r) describe the same language, ie, L(r) = L((r))

Regular Expressions

For example, given an alphabet Σ = {a, b}:

a(a|b)∗ describes the language of non-empty strings of a’s and b’s, beginning with an a

aa|ab|ba|bb describes the language of all two-symbol strings over the alphabet

(a|b)∗ab describes the language of all strings of a’s and b’s, ending in ab

In a programming language such as Java:

Reserved words may be described as abstract | boolean | char | ... | while

Operators may be described as = | == | > | ... | *

Identifiers may be described as ([a-zA-Z] | _ | $)([a-zA-Z0-9] | _ | $)*

Finite State Automata

For any language described by a regular expression, there is a state transition diagramcalled Finite State Automaton that can recognize strings in the language

A finite state automaton (FSA) F is a quintuple F = (Σ, S, s0, F,M), where:

Σ is the input alphabet

S is a set of states

s0 ∈ S is a special start state

F ∈ S is a set of final states

M is a set of moves or state transitions of the form m(r, a) = s, where r, s ∈ S anda ∈ Σ

Finite State Automata

For example, consider the regular expression (a|b)a∗b over the alphabet {a, b}

An FSA F that recognizes the language described by the regular expression

0 1 2a

b

a

b

Formally, F = (Σ, S, s0, F,M), where Σ = {a, b}, S = {0, 1, 2}, s0 = 0, F = {2}, and M is

r a m(r, a)

0 a 1

0 b 1

1 a 1

1 b 2

Non-deterministic (NFA) Versus Deterministic Finite State Automata (DFA)

A non-deterministic finite state automaton (NFA) is one that allows:

An ε-move defined on the empty string ε, ie, m(r, ε) = s

More than one move from the same state on the same input symbol a, ie, m(r, a) = sand m(r, a) = t, where s 6= t

An NFA is said to recognize an input string if, starting in the start state, there exists aset of moves based on the input that takes us into one of the final states

A deterministic finite state automaton (DFA) is one without ε-moves, and there is aunique move from any state on an input symbol a, ie, if m(r, a) = s and m(r, a) = t, thens = t

Non-deterministic (NFA) Versus Deterministic Finite State Automata (DFA)

For example, consider the regular expression a(a|b)∗b over the alphabet {a, b}

An NFA N that recognizes the language described by the regular expression

0 1 2a

ε

a, b

b

Formally, N = (Σ, S, s0, F,M) where Σ = {a, b}, S = {0, 1, 2}, s0 = 0, F = {2}, and M is

r a m(r, a)

0 a 1

1 ε 0

1 a 1

1 b 1

1 b 2

Regular Expressions to NFA

Given any regular expression r, we can construct (using Thompson’s constructionprocedure) an NFA N that recognizes the same language; ie, L(N) = L(r)

(Rule 1) NFA Nr for recognizing L(r = ε)

start finalε

(Rule 2) NFA Nr for recognizing L(r = a)

start finala


(Rule 3) NFA Nrs for recognizing L(rs)

start final start finalε

Nr Ns

(Rule 4) NFA Nr|s for recognizing L(r|s)

start

start

start

final

final

final

ε

ε

ε

ε

Nr

Ns


(Rule 5) NFA Nr∗ for recognizing L(r∗)

start start final finalε ε

ε

ε

Nr

(Rule 6) NFA Nr for recognizing L(r) also recognizes L((r))


As an example, let’s construct an NFA for the regular expression (a|b)a∗b, proceedingfrom left to right

Using Rule 2, we get the NFAs Na and Nb for recognizing a and b as

1 2a

3 4b

Using Rules 4 and 6, we get the NFA N(a|b) for recognizing (a|b) as

0

1

3

2

4

5

ε

ε

a

b

ε

ε


Using Rule 2, we get the NFAs Na for recognizing the second instance of a as

7 8a

Using Rule 5, we get the NFA Na∗ for recognizing a∗ as

6 7 8 9ε a

εε

ε

Using Rule 3, we get the NFA N(a|b)a∗ for recognizing (a|b)a∗

0

1

3

2

4

5 6 7 8 9

ε

ε

a

b

ε

ε

ε a

εε

εε


Using Rule 2, we get the NFAs Nb for recognizing the second instance of b as

10 11b

Finally, using Rule 3, we get the NFA N(a|b)a∗b for recognizing (a|b)a∗b as

0

1

3

2

4

5 6 7 8 9 10 11

ε

ε

a

b

ε

ε

ε a

εε

εε ε b

NFA to DFA

For any NFA, there is an equivalent DFA that can be constructed using the powerset (orsubset) construction procedure

The DFA is always in a state that simulates all the possible states that the NFA couldpossibly be in having scanned the same portion of the input

The computation of all states reachable from a given state s based on ε-moves alone iscalled taking the ε-closure of that state

The ε-closure(s) for a state s includes s and all states reachable from s using ε-movesalone, ie, ε-closure(s) = {s} ∪ {r ∈ S| there is a path of only ε-moves from s to r}

The ε-closure(S) for a set of states S includes S and all states reachable from any states ∈ S using ε-moves alone

NFA to DFA

Algorithm ε-closure(S) for a set of states S

Input: a set of states SOutput: ε-closure(S)

Stack P .addAll(S) // a stack containing all states in SSet C.addAll(S) // the closure initially contains the states in Swhile ! P .empty() dor ← P .pop()for s in m(r, ε) do

if s /∈ C thenP .push(s)C.add(s)

end ifend for

end whilereturn C

NFA to DFA

Algorithm ε-closure(s) for a state s

Input: a state sOutput: ε-closure(s)

Set S.add(s) // S = {s}return ε-closure(S)

NFA to DFA

As an example, let’s convert the NFA N(a|b)a∗b to a DFA

0

1

3

2

4

5 6 7 8 9 10 11

ε

ε

a

b

ε

ε

ε a

εε

εε ε b

r a m(r, a)

{0, 1, 3} = 0 (start state) a {2, 5, 6, 7, 9, 10} = 1

0 b {4, 5, 6, 7, 9, 10} = 2

1 a {7, 8, 9, 10} = 3

1 b {11} = 4 (accept state)

2 a 3

2 b 4

3 a 3

3 b 4

NFA to DFA

The DFA D(a|b)a∗b for recognizing (a|b)a∗b

0{0, 1, 3}

1{2, 5, 6,7, 9, 10}

2{4, 5, 6,7, 9, 10}

3{7, 8, 9, 10}

4{11}

a

b

a

a

b

b

b

a

In the DFA, for a state r and an input symbol a, if there is no move m(r, a) = s defined,we invent a special dead state d (usually denoted φ), such that m(r, a) = d

NFA to DFA

Algorithm NFA to DFA construction

Input: an NFA N = (Σ, S, s0,M, F )Output: an equivalent DFA D = (Σ, SD, sD0,MD, FD)

Set sD0 ← ε-closure(s0)Set SD.add(sD0)Moves MD

Stack stk.push(sD0)i← 0while !stk.empty() dor ← stk.pop()for a in Σ dosDi+1 ← ε-closure(m(r, a))if sDi+1 6= {} then

if sDi+1 /∈ SD thenSD.add(sDi+1) // We have a new statestk.push(sDi+1)i← i+ 1MD.add(i)

else if ∃j, sj ∈ SD and sDi+1 = sj thenMD.add(j) // The state already exists

end ifend if

end forend while

NFA to DFA

Algorithm NFA to DFA construction (contd.)

Set FD

for sD in SD dofor s in sD do

if s ∈ F thenFD.add(sD)

end ifend for

end forreturn D = (Σ, SD, sD0,MD, FD)

DFA to Minimal DFA

To obtain a smaller but equivalent DFA, we must combine states such that the states inthe new DFA are partitions of the states in the original (perhaps larger) DFA

A good strategy is to start with just one or two partitions and then split them as necessary

An obvious first partition has two sets: the set of final states and the set of non-final states

DFA to Minimal DFA

For example, consider the DFA for (a|b)a ∗ b, partitioned as follows

The two states in this new DFA consist of the start state, {0, 1, 2, 3} and the final state {4}

We must make sure that from a particular partition, each input symbol must move us toan identical partition

DFA to Minimal DFA

From any state in {0, 1, 2, 3}, an a takes us to a state in {0, 1, 2, 3}

m(0, a) = 1

m(1, a) = 3

m(2, a) = 3

m(3, a) = 3

So a does not split {0, 1, 2, 3}

For the symbol b,

m(0, b) = 2

m(1, b) = 4

m(2, b) = 4

m(3, b) = 4

So b splits {0, 1, 2, 3} into {0} and {1, 2, 3}

DFA to Minimal DFA

We are left with a partition into three sets: {0}, {1, 2, 3} and {4}, as shown below

DFA to Minimal DFA

We need not worry about {0} and {4} as they contain just one state

We consider {1, 2, 3} to see if it is necessary to split it

m(1, a) = 3

m(2, a) = 3

m(3, a) = 3

m(1, b) = 4

m(2, b) = 4

m(3, b) = 4

There is no further state splitting to be done, and we have the following minimal DFA

DFA to Minimal DFA

Algorithm Minimizing a DFA

Input: a DFA D = (Σ, S, s0,M, F )Output: a partition of S

Set partition← {S − F, F} // Start with two sets: the non-final and the final states// Splitting the stateswhile splitting occurs do

for set in partition doif set.size() > 1 then

for a in Σ do// Determine if moves from this ‘state’ force a splitr ← a state chosen from settargetSet← the set in the partition containing m(r, a)Set set1← {states t from set, such that m(t, a) ∈ targetSet}Set set2← {states t from set, such that m(t, a) /∈ targetSet}if set2 6= {} then

// Yes, split the states.replace set in partition by set1 and set2 and break out of the for-loop tocontinue with the next set in the partition

end ifend for

end ifend for

end while

DFA to Minimal DFA

Let us run through another example, starting from a regular expression, producing anNFA, then a DFA, and finally a minimal DFA

Consider the regular expression (a|b)∗ baa

We apply the Thompson’s construction procedure to produce the following NFA

DFA to Minimal DFA

Using the powerset construction method, we derive a DFA having the following states

s0 = {0, 1, 2, 4, 7, 8}m(s0, a) : {1, 2, 3, 4, 6, 7, 8} = s1

m(s0, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2

m(s1, a) : {1, 2, 3, 4, 6, 7, 8} = s1

m(s1, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2

m(s2, a) : {1, 2, 3, 4, 6, 7, 8, 11, 12} = s3

m(s2, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2

m(s3, a) : {1, 2, 3, 4, 6, 7, 8, 13} = s4

m(s3, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2

m(s4, a) : {1, 2, 3, 4, 6, 7, 8} = s1

m(s4, b) : {1, 2, 4, 5, 6, 7, 8, 9, 10} = s2

DFA to Minimal DFA

The DFA itself is shown below

DFA to Minimal DFA

We use partitioning to produce the minimal DFA shown below

DFA to Minimal DFA

Finally, we re-number the states to produce the equivalent DFA shown below

JavaCC: a Tool for Generating Scanners

JavaCC (the CC stands for compiler-compiler) is a tool for generating lexical analyzersfrom regular expressions and parsers from context-free grammars

A lexical grammar specification consists a set of regular expressions and a set of lexicalstates

From a particular state, only certain regular expressions may be matched in scanning theinput

There is a DEFAULT state in which scanning generally begins — one may specify additionalstates as required

Scanning a token proceeds by considering all regular expressions in the current state andchoosing the one which consumes the greatest number of input characters

After a match, one can specify a state in which the scanner should go into; otherwise thescanner stays in the current state


There are four kinds of regular expressions that determine what happens when the regularexpression has been matched:

SKIP: throws away the matched string

MORE: continues to the next state, taking the matched string along

TOKEN: creates a token from the matched string and returns it to the parser

SPECIAL_TOKEN: creates a special token that does not participate in the parsing


For example, a SKIP can be used for ignoring white space

SKIP: {" "|"\t"|"\n"|"\r"|"\f"}

We can deal with single-line comments with the following regular expressions

MORE: { "//": IN_SINGLE_LINE_COMMENT }

<IN_SINGLE_LINE_COMMENT >

SPECIAL_TOKEN: { <SINGLE_LINE_COMMENT: "\n"|"\r"|"\r\n" > : DEFAULT }

<IN_SINGLE_LINE_COMMENT >

MORE: { < ~[] > }

An alternative regular expression dealing with single-line comments

SPECIAL_TOKEN: {

<SINGLE_LINE_COMMENT: "//" (~["\n","\r"])* ("\n"|"\r"|"\r\n")>

}

Reserved words and symbols are specified by simply spelling them out; for example

TOKEN: {

< ABSTRACT: "abstract" >

| < BOOLEAN: "boolean" >

...

| < COMMA: "," >

| < DOT: "." >

}


A token for scanning identifiers

TOKEN: {

< IDENTIFIER: (<LETTER >|"_"|"$") (<LETTER >|<DIGIT >|"_"|"$")* >

| < #LETTER: ["a"-"z","A"-"Z"] >

| < #DIGIT: ["0" -"9"] >

}

A token for scanning literals

TOKEN: {

< INT_LITERAL: ("0" | <NON_ZERO_DIGIT > (<DIGIT >)*) >

| < #NON_ZERO_DIGIT: ["1" -"9"] >

| < CHAR_LITERAL: "’" (<ESC > | ~[" ’" ,"\\" ,"\n","\r"]) "’" >

| < STRING_LITERAL: "\"" (<ESC > | ~["\"" ,"\\" ,"\n","\r"])* "\"" >

| < #ESC: "\\" ["n","t","b","r","f" ,"\\" ," ’" ,"\""] >

}

JavaCC takes a specification of the lexical syntax and produces several Java files, one ofwhich is TokenManager.java, a program that implements a state machine; this is our scanner

The lexical specification for j-- is contained in $j/j--/src/jminusminus/j--.jj

Date post:	19-Mar-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Lexical Analysis · 2020-02-16 · Scanning Tokens We separate the lexemes into categories In...

Documents