Download - Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar.

Lexical Analysis and Scanning

Compiler ConstructionLecture 2

Spring 2001Robert Dewar

The Input Read string input

Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set

ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Others (EBCDIC, JIS, etc)

The Output A series of tokens

Punctuation ( ) ; , [ ] Operators + - ** := Keywords begin end if Identifiers Square_Root String literals “hello this is a string” Character literals ‘x’ Numeric literals 123 4_5.23e+2

16#ac#

Free form vs Fixed form Free form languages

White space does not matter Tabs, spaces, new lines, carriage returns

Only the ordering of tokens is important Fixed format languages

Layout is critical Fortran, label in cols 1-6 COBOL, area A B Lexical analyzer must worry about layout

Punctuation Typically individual special characters

Such as + - Lexical analyzer does not know : from : Sometimes double characters

E.g. (* treated as a kind of bracket Returned just as identity of token

And perhaps location For error message and debugging purposes

Operators Like punctuation

No real difference for lexical analyzer Typically single or double special

chars Operators + - Operations :=

Returned just as identity of token And perhaps location

Keywords Reserved identifiers

E.g. BEGIN END in Pascal, if in C Maybe distinguished from identifiers

E.g. mode vs mode in Algol-68 Returned just as token identity

With possible location information Unreserved keywords (e.g. PL/1)

Handled as identifiers (parser distinguishes)

Identifiers Rules differ

Length, allowed characters, separators Need to build table

So that junk1 is recognized as junk1 Typical structure: hash table

Lexical analyzer returns token type And key to table entry Table entry includes location

information

More on Identifier Tables Most common structure is hash

table With fixed number of headers Chain according to hash code Serial search on one chain Hash code computed from characters No hash code is perfect! Avoid any arbitrary limits

String Literals Text must be stored Actual characters are important

Not like identifiers Character set issues Table needed

Lexical analyzer returns key to table May or may not be worth hashing

Character Literals Similar issues to string literals Lexical Analyzer returns

Token type Identity of character

Note, cannot assume character set of host machine, may be different

Numeric Literals Also need a table Typically record value

E.g. 123 = 0123 = 01_23 (Ada) But cannot use int for values

Because may have different characteristics

Float stuff much more complex Denormals, correct rounding Very delicate stuff

Handling Comments Comments have no effect on program Can therefore be eliminated by

scanner But may need to be retrieved by tools Error detection issues

E.g. unclosed comments Scanner does not return comments

Case Equivalence Some languages have case equivalence

Pascal, Ada Some do not

C, Java Lexical analyzer ignores case if needed

This_Routine = THIS_RouTine Error analysis may need exact casing

Issues to Address Speed

Lexical analysis can take a lot of time Minimize processing per character

I/O is also an issue (read large blocks) We compile frequently

Compilation time is important Especially during development

General Approach Define set of token codes

An enumeration type A series of integer definitions These are just codes (no semantics) Some codes associated with data

E.g. key for identifier table May be useful to build tree node

For identifiers, literals etc

Interface to Lexical Analyzer Convert entire file to a file of

tokens Lexical analyzer is separate phase

Parser calls lexical analyzer Get next token This approach avoids extra I/O Parser builds tree as we go along

Implementation of Scanner Given the input text Generate the required tokens Or provide token by token on

demand Before we describe implementations

We take this short break To describe relevant formalisms

Relevant Formalisms Type 3 (Regular) Grammars Regular Expressions Finite State Machines

Regular Grammars Regular grammars

Non-terminals (arbitrary names) Terminals (characters) Two forms of rules

Non-terminal ::= terminal Non-terminal ::= terminal Non-terminal

One non-terminal is the start symbol Regular (type 3) grammars cannot count

No concept of matching nested parens

Regular Grammars Regular grammars

E.g. grammar of reals with no exponent REAL ::= 0 REAL1 (repeat for 1 .. 9) REAL1 ::= 0 REAL1 (repeat for 1 .. 9) REAL1 ::= . INTEGER INTEGER ::= 0 INTEGER (repeat for 1 .. 9) INTEGER ::= 0 (repeat for 1 .. 9)

Start symbol is REAL

Regular Expressions Regular expressions (RE) defined by

Any terminal character is an RE Alternation RE | RE Concatenation RE1 RE2 Repetition RE* (zero or more RE’s)

Language of RE’s = type 3 grammars Regular expressions are more convenient

Specifying RE’s in Unix Tools Single characters a b c d \x Alternation [bcd] [b-z] ab|cd Match any character . Match sequence of characters x*

y+ Concatenation abc[d-q] Optional [0-9]+(.[0-9]*)?

Finite State Machines Languages and Automata A language is a set of strings An automaton is a machine

That determines if a given string is in the language or not.

FSM’s are automata that recognize regular languages (regular expressions)

Definitions of FSM A set of labeled states Directed arcs labeled with character A state may be marked as terminal Transition from state S1 to S2

If and only if arc from S1 to S2 Labeled with next character (which is eaten)

Recognized if ends up in terminal state

One state is distinguished start state

Building FSM from Grammar One state for each non-terminal A rule of the form

Nont1 ::= terminal Generates transition from S1 to final

state A rule of the form

Nont1 ::= terminal Nont2 Generates transition from S1 to S2

Building FSM’s from RE’s Every RE corresponds to a

grammar For all regular expressions

A natural translation to FSM exists We will not give details of algorithm

here

Non-Deterministic FSM A non-deterministic FSM

Has at least one state With two arcs to two separate states Labeled with the same character

Which way to go? Implementation requires backtracking Nasty

Deterministic FSM For all states S

For all characters C There is either ONE or NO arcs From state S Labeled with character C

Much easier to implement No backtracking

Dealing with ND FSM Construction naturally leads to ND

FSM For example, consider FSM for

[0-9]+ | [0-9]+\.[0-9]+ (integer or real)

We will naturally get a start state With two sets of 0-9 branches And thus non-deterministic

Converting to Deterministic There is an algorithm for

converting From any ND FSM

To an equivalent deterministic FSM

Algorithm is in the text book Example (given in terms of RE’s)

[0-9]+ | [0-9]+\.[0-9]+ [0-9]+(\.[0-9]+)?

Implementing the Scanner Three methods

Completely informal, just write code Define tokens using regular

expressions Convert RE’s to ND finite state machine Convert ND FSM to deterministic FSM Program the FSM

Use an automated program To achieve above three steps

Ad Hoc Code (forget FSM’s) Write normal hand code

A procedure called Scan Normal coding techniques

Basically scan over white space and comments till non-blank character found.

Base subsequent processing on character E.g. colon may be : or := / may be operator or start of comment

Return token found Write aggressive efficient code

Using FSM Formalisms Start with regular grammar or RE

Typically found in the language standard For example, for Ada:

Chapter 2. Lexical Elements Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 decimal-literal ::= integer [.integer]

[exponent] integer ::= digit {[underline] digit} exponent ::= E [+] integer | E - integer

Using FSM formalisms, cont Given RE’s or grammar

Convert to finite state machine Convert ND FSM to deterministic FSM

Write a program to recognize Using the deterministic FSM

Implementing FSM (Method 1) Each state is code of the form:

<<state1>>case Next_Character is

when ‘a’ => goto state3;when ‘b’ => goto state1;when others => End_of_token_processing;

end case; <<state2>>

…

Implementing FSM (Method 2) There is a variable called State

loop case State is when state1 =><<state1>>

case Next_Character is when ‘a’ => State := state3; when ‘b’ => State := state1; when others =>

End_token_processing; end case;

when state2 … …

end case;end loop;

Implementing FSM (Method 3) T : array (State, Character) of

State;while More_Input loop Curstate := T (Curstate, Next_Char); if Curstate = Error_State then …end loop;

Automatic FSM Generation Our example, FLEX

See home page for manual in HTML FLEX is given

A set of regular expressions Actions associated with each RE

It builds a scanner Which matches RE’s and executes

actions

Flex General Format Input to Flex is a set of rules:

Regexp actions (C statements) Regexp actions (C statements) …

Flex scans the longest matching Regexp And executes the corresponding

actions

An Example of a Flex scanner

DIGIT [0-9]ID [a-z][a-z0-9]*%%{DIGIT}+ {

printf (“an integer %s (%d)\n”, yytext, atoi (yytext));

}

{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext));

if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext));

Flex Example (continued) {ID} printf (“an identifier %s\n”, yytext);

“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); }

“--”.*\n /* eat Ada style comment */

[ \t\n]+ /* eat white space */

. printf (“unrecognized character”);%%

Assembling the flex program %{

#include <math.h> /* for atof */%}

<<flex text we gave goes here>>

%%main (argc, argv)int argc;char **argv;{

yyin = fopen (argv[1], “r”);yylex();

}

Running flex flex is a program that is executed

The input is as we have given The output is a running C program

For Ada fans Look at aflex (www.adapower.com)

For C++ fans flex can run in C++ mode

Generates appropriate classes

Choice Between Methods? Hand written scanners

Typically much faster execution And pretty easy to write And a easier for good error recovery

Flex approach Simple to Use Easy to modify token language

The GNAT Scanner Hand written (scn.adb/scn.ads)

Basically a call does Super quick scan past blanks/comments etc Big case statement Process based on first character Call special routines

Namet.Get_Name for identifier (hashing) Keywords recognized by special hash Strings (stringt.ads) Integers (uintp.ads) Reals (ureal.ads)

More on the GNAT Scanner Entire source read into memory

Single contiguous block Source location is index into this block Different index range for each source

file See sinput.adb/ads for source mgmt

See scans.ads for definitions of tokens

More on GNAT Scanner Read scn.adb code

Very easy reading, e.g.

DTL (Dewar Trivial Language) DTL Grammar

Program ::= DECLARE Decls BEGIN Stmts

Decls ::= {Decl}* Stmts ::= {Stmt}+ Type ::= INTEGER | REAL Identifier ::= letter (_{digit}+)* Decl ::= DECLARE identifier : Type

DTL (Continued) Integer_Literal ::= {digit}+

Real_Literal ::= {digit}+”.”{digit}*Stmt ::= Assignstmt | Ifstmt | WhilestmtAssignstmt ::= Identifier := ExprExpr ::= Literal | (Expr) Op (Expr)Op ::= + | -Literal ::= Integer_Literal | Real_LiteralIfstmt ::= IF Expr Relop Expr THEN StmtsWhilestmt ::= WHILE Expr Relop Expr DO StmtsRelop ::= > | < | >= | <=

DTL Example DECLARE

DECL A_123 : INTEGER DECL B: REAL

BEGIN A_123 := 23 B := 2.4 WHILE A_123 > (2) + (1) DO A_123 := A_123 - 1

ASSIGNMENT TWO Write a flex or aflex program

Recognize tokens of DTL program Print out tokens in style of flex

example Extra credit

Build hash table for identifiers Output hash table key

Preprocessors Some languages allow

preprocessing This is a separate step

Input is source Output is expanded source

Can either be done as separate phase Or embedded into the lexical analyzer Often done as separate phase

Need to keep track of source locations

Nasty Glitches Separation of tokens Not all languages have clear rules FORTRAN has optional spaces

DO10I=1.6 identifier operator literal DO10I = 1.6

DO10I=1,6 Keyword stmt loopvar operator literal punc literal DO 10 I = 1 , 6

Modern languages avoid this kind of thing!