+ All Categories
Home > Documents > Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a...

Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a...

Date post: 20-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
33
CS 226/326 Spring 2003 Lesson 2 Lexical Analysis
Transcript
Page 1: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

CS 226/326Spring 2003

Lesson 2Lexical Analysis

Page 2: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Lexical Analysis

• Transform source program (a sequence of characters) into a sequence of tokens.

• Lexical structure is specified using regular expressions

• Secondary tasks

1. discard white space and comments

2. record positional attributes (e.g. char positions, line numbers)

lexicalanalyzer

parsersourceprogram

get token

token

parsetree

Page 3: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Example Program

let function g(a:int) = ain g(2,”str”)end

A sample source program in Tiger

What are the tokens?

LET FUNCTION ID “g”LPAREN ID “a” COLONID “int” RPAREN EQID “a” IN ID “g”LPAREN INT “2” COMMASTRING “str” RPAREN END

Page 4: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Tokens

Tokens Text Description

LET let keyword LET

END end keyword END

PLUS + arithmetic operator

LPAREN ( punctuation

COLON : punctuation

STRING “str” string

RPAREN ) punctuation

INT 46 integer literal

ID g, a, int variables, types

EQ =

EOF end of file

Page 5: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Strings

• Alphabet: Σ - a set of basic characters or symbols

• finite or infinite, but we will only be concerned with finite Σ• e.g. printable Ascii characters

• Strings: Σ* - finite sequences of symbols from Σ• e.g. ε (the empty string), abc, *?x_2

• Language: L ⊆ Σ* - a set of strings

• e.g. L = {ε, a, aa, aaa, ...}• Concatenation: s × t - concatenation of strings s and t

• e.g. abc × xy = abcxy

• 〈Σ*, ×, ε〉is a semigroup

• Product of languages: L1 × L2 = { s×t | s ˛ L1 & t ˛ L2}

Page 6: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Regular Expressions

Regular expressions are a small language for describing languages (i.e. subsets of Σ*).

Regular expressions are defined by the following grammar:

M ::= a -- a single symbol (a ˛ Σ) M1 | M2 -- alternation M1 × M2 -- concatenation (also M1M2 )

ε -- epsilon

M* -- repetition (0 or more times)

Examples: (a × b) | ε (0 × 1)* × 0 b*(abb*)*(a|ε)

Page 7: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Regular Expressions

The previous forms of regular expressions are adequate, but for convenience we add some redundant forms that could be defined in terms of the basic ones.

M ::= ...M+ -- repetition (1 or more times)M? -- 0 or 1 occurrence of M[a-z] -- ranges of characters (alternation). -- any character other than newline (\n)“abc” -- literal sequence of characters

Defs: M+ = M M*

M? = M | ε[a-z] = (a | b | c | ... | z)“abc” = a×b×c

Page 8: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Meaning of Regular Expressions

The meaning of regular expressions is given by a function Lfrom regular expressions (re’s) to languages (subsets of Σ*).L is defined by the equations:

L(a) = {a} L(M1 | M2) = L(M1) ¨ L(M2)

L(M1 × M2) = L(M1) × L(M2)

L(ε) = {ε}

L(M*) = {ε} | (L(M) × L(M*))

ExamplesL((a × b) | ε) = {ε, ab}

L((0 × 1)* × 0) = even binary numbers

L(b*(abb*)*(a|ε)) = strings of a, b with no consecutive a’s

Page 9: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Using R.E.s to Define Tokens

Regular expressions are used to define token classes in a specification of lexical structure:

if (IF) -- if keyword[a-z][a-z0-9]* (ID(str)) -- identifier[0-9]+ (NUM(str)) -- integer const([0-9]+”.”[0-9]*)|([0-9]*”.”[0-9]+) (REAL(str))

-- real const(”--”[a-z]*”\n”) (continue()) -- comment(” ”|”\t”|”\n”)+ (continue()) -- white space. (error();continue())

-- error

Patterns are matched “top-down”, and the longest match is preferred.

Page 10: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Choosing among Multiple Matches

if (IF) -- if keyword[a-z][a-z0-9]* (ID(str)) -- identifier

Consider string “if8”. The initial segment “if” matches the first r.e. while the whole string is matches the second r.e. In this case we choose the longest possible match, recognizing the string as an identifier.

Consider “if 8”. Both the first and second r.e.’s match the initial segment “if” and no r.e. matches the entire string (or“if ” for that matter). In this case we choose the first matching r.e. and recognize the if keyword.

Summary: the longest match is preferred, and ties are resolved in favor of the earliest match.

Page 11: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Homework Assignment 1

1. Program 1 (p. 10)file: prog1.sml

2. Exercise 1.1(a,b,c) (p. 12)file: ex1_1.sml

Page 12: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Finite State Machines

The r.e. recognition problem: for re M we want to build a machine that scans a string and tells us whether it belongs to L(M). Alternatively, in lexical analysis we want to scan a string and find a (longest) initial segment of the string that belongs to L(M).

re ⇒ nondeterministic finite automaton (NFA)

⇒ deterministic finite automaton (DFA)

⇒ optimization/simplification of the DFA

⇒ transition table + matching engine

⇒ code for a lexical analyzer

Page 13: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Finite State Machines

A finite state machine (finite automaton or FA) over alphabet Σ is a quadruple

M =〈S, T, i, F〉where

S = a finite set of states (usually represented by numbers)T = a transition relation: T ⊆ S × Σ × Si = an initial state i ∈ SF = a set of final states: F ⊆ S

Graphical representations:

m ∈ S: 〈m,a,n〉∈ T:

i ∈ S: f ∈ F:

m nma

i f

Page 14: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Deterministic and Nondeterministic FA

A finite automata M =〈S, T, i, F〉is deterministic (a DFA) if for each m ∈ S and a ∈ Σ there is at most one n ∈ S such that 〈m,a,n〉∈ TGraphically, in a DFA we don’t have any situations of the form:

If a FA is not deterministic, it is a nondeterministic FA (an NFA).Nondeterministic automata are also formed by introducing εtransitions -- silent transitions that can be taken without

consuming an input symbol.

aq

p

m

a

nmε

Page 15: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

DFAs for Token Classes

1 2 3i f

1 2a-z a-z

0-9

if (IF)

[a-z][a-z0-9]* (ID(str))

1 20-9 0-9

[0-9]+ (NUM(str))

Page 16: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

DFAs for Token Classes

2

0-9

0-90-9 .

4 5 0-9

1 2

0-9

.

([0-9]+”.”[0-9]*)|([0-9]*”.”[0-9]+) (REAL(str))

3 4

a-z

2- \n

1-

(”--”[a-z]*”\n”) (continue()) -- comment

Page 17: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

DFAs for Token Classes

(” ”|”\t”|”\n”)+ (continue()) -- white space

. (error();continue()) -- error

where ws is (” ”|”\t”|”\n”)

1ws

2 ws

1any but \n

2

Page 18: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Combined DFA

44

1

22 3

12 13

5 66

7 8

9 10 11

ID IDIF

ws error

error

error

comment

ws

ws

0-90-9

0-9

0-9

0-9

-

-

a-z

NUM REAL

.

.i

f 0-9,a-z

\n

0-9a-z

REAL

a-h,j-z

other

a-e,g-z

Page 19: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

R.E. to NFA

a ε

M | N εε

ε

M

N

εa

M × N M N

ε

εM* M

Page 20: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

RE to NFA Example

b*(abb*)*(a|ε)

ε

a

bb

εε

ε

ε

εa

ε

ε

ε

Page 21: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

NFA to DFA

ε

1 432

5 76ε

εεε

ε

yzx

Page 22: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

NFA to DFA

ε

1 432

5 76ε

εεε

ε

yzx

1

Page 23: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

NFA to DFA

ε

1 432

5 76ε

εεε

ε

yzx

1 23 4

ε-closure of 1

Page 24: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

NFA to DFA

ε

1 432

5 76ε

εεε

ε

yzx

1 23 4

5

x

Page 25: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

NFA to DFA

ε

1 432

5 76ε

εεε

ε

yzx

1 23 4

5 67

x

ε-closure of 5

Page 26: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

NFA to DFA

6 71 23 4

5 67

x

y ε-closure of 6

ε

1 432

5 76ε

εεε

ε

yzx

Page 27: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

NFA to DFA

ε

1 432

5 76ε

εεε

ε

yzx

6 71 23 4

5 67

x

y

z

ε

1 432

5 76ε

εεε

ε

yzx

Page 28: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

NFA to DFA

ε

1 432

5 76ε

εεε

ε

yzx

1y

3

2

x z

ε

1 432

5 76ε

εεε

ε

yzx

Page 29: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

ML-Lex

ML-Lexfoo.lex foo.lex.sml

lexer specification sml code for lexer

Specification for token values has to be supplied externally, usuallyin the form of a Tokens module that defines a token type and a setof functions for building tokens of various classes.

Page 30: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

An ML-Lex specification

ML Declarations:type lexresult = Tokens.tokenfun eof() = Tokens.EOF(0,0)%%

Lex definitions:digits=[0-9]+;%%

Regular Expressions and Actions:if => (Tokens.IF(yypos,yypos+2));[a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos+size yytext));{digits} => (Tokens.NUM(Int.fromString yytext,yypos, yypos+size yytext));({digits}"."[0-9]*)|([0-9]*"."{digits}) => (Tokens.REAL(Real.fromString yytext,yypos,

yypos+size yytext));("--"[a-z]*"\n") => (continue());(" "|"\n"|"\t") => (continue());. => (ErrorMsg.error yypos "illegal character";

continue());

Page 31: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Variables Defined by ML-Lex

ML-Lex defines several variables:

lex() recursively call the lexercontinue() same, but with %arg

yytext the string matched by the current r.e.

yypos character position at start of currentr.e. match

yylineno line number at start of match(if command %count given)

Page 32: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Defining Tokens

(* ML Declaration of a Tokens module (called a structure in ML): *)

structure Tokens =struct

type pos = int datatype token = EOF of pos * pos | IF of pos * pos | ID of string * pos * pos | NUM of int * pos * pos | REAL of real * pos * pos ...end (* structure Tokens *)

Page 33: Lesson 2 Lexical Analysis · 2003-04-08 · Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens. • Lexical structure is specified

Start States

Several different lexing automata can be set up using start states.Additional start states are commonly used for handling commentsand strings.

ML decls...%%Lex decls...%s COMMENT%%<INITIAL>if => (Tokens.IF(yypos,yypos+2));<INITIAL>[a-z]+ => (Tokens.ID(yytext,yypos,

yypos+size yytext));

<INITIAL>”(*” => (YYBEGIN COMMENT; continue());<COMMENT>”*)” => (YYBEGIN INITIAL; continue());<COMMENT>. => (continue());


Recommended