The Input Read string input
Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set
ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Others (EBCDIC, JIS, etc)
The Output A series of tokens
Punctuation ( ) ; , [ ] Operators + - ** := Keywords begin end if Identifiers Square_Root String literals “hello this is a string” Character literals ‘x’ Numeric literals 123 4_5.23e+2
16#ac#
Free form vs Fixed form Free form languages
White space does not matter Tabs, spaces, new lines, carriage returns
Only the ordering of tokens is important Fixed format languages
Layout is critical Fortran, label in cols 1-6 COBOL, area A B Lexical analyzer must worry about layout
Punctuation Typically individual special characters
Such as + - Lexical analyzer does not know : from : Sometimes double characters
E.g. (* treated as a kind of bracket Returned just as identity of token
And perhaps location For error message and debugging purposes
Operators Like punctuation
No real difference for lexical analyzer Typically single or double special
chars Operators + - Operations :=
Returned just as identity of token And perhaps location
Keywords Reserved identifiers
E.g. BEGIN END in Pascal, if in C Maybe distinguished from identifiers
E.g. mode vs mode in Algol-68 Returned just as token identity
With possible location information Unreserved keywords (e.g. PL/1)
Handled as identifiers (parser distinguishes)
Identifiers Rules differ
Length, allowed characters, separators Need to build table
So that junk1 is recognized as junk1 Typical structure: hash table
Lexical analyzer returns token type And key to table entry Table entry includes location
information
More on Identifier Tables Most common structure is hash
table With fixed number of headers Chain according to hash code Serial search on one chain Hash code computed from characters No hash code is perfect! Avoid any arbitrary limits
String Literals Text must be stored Actual characters are important
Not like identifiers Character set issues Table needed
Lexical analyzer returns key to table May or may not be worth hashing
Character Literals Similar issues to string literals Lexical Analyzer returns
Token type Identity of character
Note, cannot assume character set of host machine, may be different
Numeric Literals Also need a table Typically record value
E.g. 123 = 0123 = 01_23 (Ada) But cannot use int for values
Because may have different characteristics
Float stuff much more complex Denormals, correct rounding Very delicate stuff
Handling Comments Comments have no effect on program Can therefore be eliminated by
scanner But may need to be retrieved by tools Error detection issues
E.g. unclosed comments Scanner does not return comments
Case Equivalence Some languages have case equivalence
Pascal, Ada Some do not
C, Java Lexical analyzer ignores case if needed
This_Routine = THIS_RouTine Error analysis may need exact casing
Issues to Address Speed
Lexical analysis can take a lot of time Minimize processing per character
I/O is also an issue (read large blocks) We compile frequently
Compilation time is important Especially during development
General Approach Define set of token codes
An enumeration type A series of integer definitions These are just codes (no semantics) Some codes associated with data
E.g. key for identifier table May be useful to build tree node
For identifiers, literals etc
Interface to Lexical Analyzer Convert entire file to a file of
tokens Lexical analyzer is separate phase
Parser calls lexical analyzer Get next token This approach avoids extra I/O Parser builds tree as we go along
Implementation of Scanner Given the input text Generate the required tokens Or provide token by token on
demand Before we describe implementations
We take this short break To describe relevant formalisms
Regular Grammars Regular grammars
Non-terminals (arbitrary names) Terminals (characters) Two forms of rules
Non-terminal ::= terminal Non-terminal ::= terminal Non-terminal
One non-terminal is the start symbol Regular (type 3) grammars cannot count
No concept of matching nested parens
Regular Grammars Regular grammars
E.g. grammar of reals with no exponent REAL ::= 0 REAL1 (repeat for 1 .. 9) REAL1 ::= 0 REAL1 (repeat for 1 .. 9) REAL1 ::= . INTEGER INTEGER ::= 0 INTEGER (repeat for 1 .. 9) INTEGER ::= 0 (repeat for 1 .. 9)
Start symbol is REAL
Regular Expressions Regular expressions (RE) defined by
Any terminal character is an RE Alternation RE | RE Concatenation RE1 RE2 Repetition RE* (zero or more RE’s)
Language of RE’s = type 3 grammars Regular expressions are more convenient
Specifying RE’s in Unix Tools Single characters a b c d \x Alternation [bcd] [b-z] ab|cd Match any character . Match sequence of characters x*
y+ Concatenation abc[d-q] Optional [0-9]+(.[0-9]*)?
Finite State Machines Languages and Automata A language is a set of strings An automaton is a machine
That determines if a given string is in the language or not.
FSM’s are automata that recognize regular languages (regular expressions)
Definitions of FSM A set of labeled states Directed arcs labeled with character A state may be marked as terminal Transition from state S1 to S2
If and only if arc from S1 to S2 Labeled with next character (which is eaten)
Recognized if ends up in terminal state
One state is distinguished start state
Building FSM from Grammar One state for each non-terminal A rule of the form
Nont1 ::= terminal Generates transition from S1 to final
state A rule of the form
Nont1 ::= terminal Nont2 Generates transition from S1 to S2
Building FSM’s from RE’s Every RE corresponds to a
grammar For all regular expressions
A natural translation to FSM exists We will not give details of algorithm
here
Non-Deterministic FSM A non-deterministic FSM
Has at least one state With two arcs to two separate states Labeled with the same character
Which way to go? Implementation requires backtracking Nasty
Deterministic FSM For all states S
For all characters C There is either ONE or NO arcs From state S Labeled with character C
Much easier to implement No backtracking
Dealing with ND FSM Construction naturally leads to ND
FSM For example, consider FSM for
[0-9]+ | [0-9]+\.[0-9]+ (integer or real)
We will naturally get a start state With two sets of 0-9 branches And thus non-deterministic
Converting to Deterministic There is an algorithm for
converting From any ND FSM
To an equivalent deterministic FSM
Algorithm is in the text book Example (given in terms of RE’s)
[0-9]+ | [0-9]+\.[0-9]+ [0-9]+(\.[0-9]+)?
Implementing the Scanner Three methods
Completely informal, just write code Define tokens using regular
expressions Convert RE’s to ND finite state machine Convert ND FSM to deterministic FSM Program the FSM
Use an automated program To achieve above three steps
Ad Hoc Code (forget FSM’s) Write normal hand code
A procedure called Scan Normal coding techniques
Basically scan over white space and comments till non-blank character found.
Base subsequent processing on character E.g. colon may be : or := / may be operator or start of comment
Return token found Write aggressive efficient code
Using FSM Formalisms Start with regular grammar or RE
Typically found in the language standard For example, for Ada:
Chapter 2. Lexical Elements Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 decimal-literal ::= integer [.integer]
[exponent] integer ::= digit {[underline] digit} exponent ::= E [+] integer | E - integer
Using FSM formalisms, cont Given RE’s or grammar
Convert to finite state machine Convert ND FSM to deterministic FSM
Write a program to recognize Using the deterministic FSM
Implementing FSM (Method 1) Each state is code of the form:
<<state1>>case Next_Character is
when ‘a’ => goto state3;when ‘b’ => goto state1;when others => End_of_token_processing;
end case; <<state2>>
…
Implementing FSM (Method 2) There is a variable called State
loop case State is when state1 =><<state1>>
case Next_Character is when ‘a’ => State := state3; when ‘b’ => State := state1; when others =>
End_token_processing; end case;
when state2 … …
end case;end loop;
Implementing FSM (Method 3) T : array (State, Character) of
State;while More_Input loop Curstate := T (Curstate, Next_Char); if Curstate = Error_State then …end loop;
Automatic FSM Generation Our example, FLEX
See home page for manual in HTML FLEX is given
A set of regular expressions Actions associated with each RE
It builds a scanner Which matches RE’s and executes
actions
Flex General Format Input to Flex is a set of rules:
Regexp actions (C statements) Regexp actions (C statements) …
Flex scans the longest matching Regexp And executes the corresponding
actions
An Example of a Flex scanner
DIGIT [0-9]ID [a-z][a-z0-9]*%%{DIGIT}+ {
printf (“an integer %s (%d)\n”, yytext, atoi (yytext));
}
{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext));
if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext));
Flex Example (continued) {ID} printf (“an identifier %s\n”, yytext);
“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); }
“--”.*\n /* eat Ada style comment */
[ \t\n]+ /* eat white space */
. printf (“unrecognized character”);%%
Assembling the flex program %{
#include <math.h> /* for atof */%}
<<flex text we gave goes here>>
%%main (argc, argv)int argc;char **argv;{
yyin = fopen (argv[1], “r”);yylex();
}
Running flex flex is a program that is executed
The input is as we have given The output is a running C program
For Ada fans Look at aflex (www.adapower.com)
For C++ fans flex can run in C++ mode
Generates appropriate classes
Choice Between Methods? Hand written scanners
Typically much faster execution And pretty easy to write And a easier for good error recovery
Flex approach Simple to Use Easy to modify token language
The GNAT Scanner Hand written (scn.adb/scn.ads)
Basically a call does Super quick scan past blanks/comments etc Big case statement Process based on first character Call special routines
Namet.Get_Name for identifier (hashing) Keywords recognized by special hash Strings (stringt.ads) Integers (uintp.ads) Reals (ureal.ads)
More on the GNAT Scanner Entire source read into memory
Single contiguous block Source location is index into this block Different index range for each source
file See sinput.adb/ads for source mgmt
See scans.ads for definitions of tokens
DTL (Dewar Trivial Language) DTL Grammar
Program ::= DECLARE Decls BEGIN Stmts
Decls ::= {Decl}* Stmts ::= {Stmt}+ Type ::= INTEGER | REAL Identifier ::= letter (_{digit}+)* Decl ::= DECLARE identifier : Type
DTL (Continued) Integer_Literal ::= {digit}+
Real_Literal ::= {digit}+”.”{digit}*Stmt ::= Assignstmt | Ifstmt | WhilestmtAssignstmt ::= Identifier := ExprExpr ::= Literal | (Expr) Op (Expr)Op ::= + | -Literal ::= Integer_Literal | Real_LiteralIfstmt ::= IF Expr Relop Expr THEN StmtsWhilestmt ::= WHILE Expr Relop Expr DO StmtsRelop ::= > | < | >= | <=
DTL Example DECLARE
DECL A_123 : INTEGER DECL B: REAL
BEGIN A_123 := 23 B := 2.4 WHILE A_123 > (2) + (1) DO A_123 := A_123 - 1
ASSIGNMENT TWO Write a flex or aflex program
Recognize tokens of DTL program Print out tokens in style of flex
example Extra credit
Build hash table for identifiers Output hash table key
Preprocessors Some languages allow
preprocessing This is a separate step
Input is source Output is expanded source
Can either be done as separate phase Or embedded into the lexical analyzer Often done as separate phase
Need to keep track of source locations