Compiler ConstructionCompiler Construction
Lexical Analysis Lexical Analysis
Rina Zviel-Girshin and Ohad ShachamRina Zviel-Girshin and Ohad ShachamSchool of Computer ScienceSchool of Computer Science
Tel-Aviv UniversityTel-Aviv University
22
Generic compiler structureGeneric compiler structure
Executable
code
exe
Source
text
txt
Semantic
Representation
Backend
(synthesis)
Compiler
Frontend
(analysis)
33
Lexical AnalysisLexical Analysis
converts characters to tokensconverts characters to tokens
class Quicksort { int[] a; int partition(int low, int high) { int pivot = a[low]; ...}
1: CLASS1: CLASS_ID(Quicksort)1: LCBR2: INT2: LB2: RB2: ID(a)
. . .
2: SEMI
44
Lexical AnalysisLexical Analysis TokensTokens
ID – _size, _numID – _size, _num Num – 7, 5 , 9, 4926209837Num – 7, 5 , 9, 4926209837 COMMA – ,COMMA – , SEMI – ;SEMI – ; ……
Non tokensNon tokens Comment – // Comment – // WhitespaceWhitespace Macro Macro ……
55
ProblemProblem
InputInput Program textProgram text Tokens specificationTokens specification
OutputOutput Sequence of tokensSequence of tokens
class Quicksort { int[] a; int partition(int low, int high) { int pivot = a[low]; ...}
1: CLASS1: CLASS_ID(Quicksort)1: LCBR2: INT2: LB2: RB2: ID(a)
. . .
2: SEMI
66
SolutionSolution
Write a lexical analyzerWrite a lexical analyzerToken nextToken(){
char c ;loop: c = getchar();switch (c){
case ` `:goto loop ;case `;`: return SemiColumn;case `+`: c = getchar() ;
switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c);
return Plus; }
case `<`:case `w`:
… }
77
Solution’s ProblemSolution’s Problem
A lot of workA lot of work Corner casesCorner cases Error pruneError prune Hard to debugHard to debug ExhaustingExhausting BoringBoring Hard to reuseHard to reuse Switch parser’s code between peopleSwitch parser’s code between people …….. ……..
88
Scanner generator: historyScanner generator: history
LEXLEX a lexical analyzer generator, written by Lesk and Schmidt at Bell a lexical analyzer generator, written by Lesk and Schmidt at Bell
Labs in 1975 for the UNIX operating system; Labs in 1975 for the UNIX operating system; It now exists for many operating systems;It now exists for many operating systems; LEX produces a scanner which is a C program;LEX produces a scanner which is a C program; LEX accepts regular expressions and allows actions (i.e., code LEX accepts regular expressions and allows actions (i.e., code
to executed) to be associated with each regular expression. to executed) to be associated with each regular expression.
JLexJLex Lex that generates a scanner written in Java;Lex that generates a scanner written in Java; Itself is also implemented in Java.Itself is also implemented in Java.
There are many similar tools, for every programming There are many similar tools, for every programming languagelanguage
99
Overall pictureOverall picture
Tokens
Scanner generator
NFAREJava scanner program
String stream
DFA
Minimize DFA
Simulate DFA
1010
JFlexJFlex
Off the shelf lexical analysis generatorOff the shelf lexical analysis generator InputInput
scanner specification filescanner specification fileOutputOutput
Lexical analyzer written in JavaLexical analyzer written in Java
JFlex javacIC.lex Lexical analyzer
IC text
tokens
Lexer.java
1111
JFlexJFlex
SimpleSimple Good for reuseGood for reuse Easy to understandEasy to understand Many developers and users debugged the generatorsMany developers and users debugged the generators
"+" { return new symbol (sym.PLUS); }"boolean" { return new symbol (sym.BOOLEAN); }“int" { return new symbol (sym.INT); }"null" {return new symbol (sym.NULL);}"while" {return new symbol (sym.WHILE);}
"=" {return new symbol (sym.ASSIGN);}
……
1212
JFlex Spec FileJFlex Spec File
User codeUser code Copied directly to Java fileCopied directly to Java file
%%
JFlex directivesJFlex directives Define macros, state namesDefine macros, state names
%%
Lexical analysis rulesLexical analysis rules How to break input to tokensHow to break input to tokens Action when token matchedAction when token matched
Possible source of
javac errors down the
roadDIGIT= [0-9]
LETTER= [a-zA-Z]
YYINITIAL
{LETTER}({LETTER}|{DIGIT})*
1313
User codeUser code
package IC.Parser;import IC.Parser.Token;
…any scanner-helper Java code…
1414
JFlex DirectivesJFlex Directives
Control JFlex internalsControl JFlex internals %line %line switches line counting onswitches line counting on %char %char switches character counting onswitches character counting on %class class-name%class class-name changes default name changes default name %cup %cup CUP compatibility modeCUP compatibility mode %type token-class-name%type token-class-name %public %public Makes generated class public (package by default)Makes generated class public (package by default) %function read-token-method%function read-token-method %scanerror exception-type-name%scanerror exception-type-name
1515
JFlex DirectivesJFlex Directives
State definitionsState definitions%state %state state-name state-name %state %state STRINGSTRING
Macro definitionsMacro definitionsmacro-name = regexmacro-name = regex
1616
Regular ExpressionRegular Expression
rr $$ match reg. exp. match reg. exp. rr at end of a line at end of a line
. . any character except the newlineany character except the newline"...""..." stringstring{name}{name} macro expansionmacro expansion** zero or more repetitions zero or more repetitions ++ one or more repetitionsone or more repetitions?? zero or one repetitions zero or one repetitions
(...) (...) grouping within regular expressionsgrouping within regular expressions
aa||bb match match aa or or bb
[...][...] class of characters - any class of characters - any oneone character enclosed in brackets character enclosed in brackets
aa––bb range of charactersrange of characters
[^…] [^…] negated class – any one not enclosed in bracketsnegated class – any one not enclosed in brackets
1717
Example macrosExample macros
ALPHA=[A-Za-z_] ALPHA=[A-Za-z_]
DIGIT=[0-9]DIGIT=[0-9]
ALPHA_NUMERIC={ALPHA}|{DIGIT}ALPHA_NUMERIC={ALPHA}|{DIGIT}
IDENT={ALPHA}({ALPHA_NUMERIC})*IDENT={ALPHA}({ALPHA_NUMERIC})*
NUMBER=({DIGIT})+NUMBER=({DIGIT})+
NUMBER=[0-9]+NUMBER=[0-9]+
1818
RulesRules
[states] regexp {action as Java code}[states] regexp {action as Java code}
PrioritiesPriorities Longest matchLongest match Order in the lex fileOrder in the lex file
Rules should match all inputs!!!Rules should match all inputs!!!
Breaks Input to Tokens Invokes when
regexp matches
breakbreakdown int
identifier or integer ?
The regexp should be evaluated ?
1919
Rules ExamplesRules Examples
<YYINITIAL> {DIGIT}+ DIGIT}+ {
return new Symbol(sym.NUMBER, yytext(), yyline);
}
<YYINITIAL> "-" {
return new Symbol(sym.MINUS, yytext(), yyline);
}
<YYINITIAL> [a-zA-Z] ([a-zA-Z0-9]) * {
return new Symbol(sym.ID, yytext(), yyline);
}
2020
Rules – ActionRules – Action
ActionActionJava codeJava codeCan use special methods and varsCan use special methods and vars
yylineyylineyytext()yytext()
Returns a token for a tokenReturns a token for a tokenEats chars for non tokensEats chars for non tokens
2121
Rules – StateRules – State
StateStateWhich regexp should be evaluated?Which regexp should be evaluated?yybegin(stateX)yybegin(stateX)
jumps to stateXjumps to stateX
YYINITIALYYINITIAL JFlex’s initial stateJFlex’s initial state
2222
Rules – StateRules – State
<YYINITIAL> "//" { yybegin(COMMENTS); }
<COMMENTS> [^\n] { }
<COMMENTS> [\n] { yybegin(YYINITIAL); }
YYINITIAL COMMENTS
‘//’
\n
^\n
2323
Lines Count ExampleLines Count Exampleimport java_cup.runtime.Symbol;
%%%cup%{ private int lineCounter = 0;%}
%eofval{ System.out.println("line number=" + lineCounter); return new Symbol(sym.EOF);%eofval}
NEWLINE=\n%%
<YYINITIAL>{NEWLINE} {lineCounter++;
} <YYINITIAL>[^{NEWLINE}] { }
2424
Lines Count ExampleLines Count Example
JFlex
javac
lineCount.lex
Lexical analyzer
text
tokens
Yylex.java
Main.java
JFlex and JavaCup must be on CLASSPATH
sym.java
java JFlex.Main lineCount.lex
javac *.java
2525
Test BedTest Bedimport java.io.*;
public class Main { public static void main(String[] args) { Symbol currToken; try { FileReader txtFile = new FileReader(args[0]); Yylex scanner = new Yylex(txtFile); do { currToken = scanner.next_token(); // do something with currToken } while (currToken.sym != sym.EOF); } catch (Exception e) { throw new RuntimeException("IO Error (brutal exit)” +
e.toString()); } }}
2626
Common PitfallsCommon Pitfalls
ClasspathClasspathPath to executablePath to executableDefine environment variablesDefine environment variables
JAVA_HOMEJAVA_HOMECLASSPATHCLASSPATH
2727
JFlex directives to useJFlex directives to use
%cup%cup (integrate with cup)(integrate with cup)
%line%line (count lines)(count lines)
%type Token%type Token (pass type Token)(pass type Token)
%class Lexer%class Lexer (gen. scanner class)(gen. scanner class)
2828
StructureStructure
JFlex javacIC.lexLexical analyzer
test.ic
tokens
Lexer.java
sym.javaToken.java
LexicalError.javaCompiler.java