+ All Categories
Home > Documents > Compiler Construction Lexical Analysis

Compiler Construction Lexical Analysis

Date post: 07-Feb-2016
Category:
Upload: ivy
View: 84 times
Download: 0 times
Share this document with a friend
Description:
Compiler Construction Lexical Analysis. Rina Zviel-Girshin and Ohad Shacham School of Computer Science Tel-Aviv University. Source text. txt. Executable code. exe. Generic compiler structure. Compiler. Frontend (analysis). Semantic Representation. Backend (synthesis). - PowerPoint PPT Presentation
28
Compiler Construction Compiler Construction Lexical Analysis Lexical Analysis Rina Zviel-Girshin and Ohad Shacham Rina Zviel-Girshin and Ohad Shacham School of Computer Science School of Computer Science Tel-Aviv University Tel-Aviv University
Transcript
Page 1: Compiler Construction  Lexical Analysis

Compiler ConstructionCompiler Construction

Lexical Analysis Lexical Analysis

Rina Zviel-Girshin and Ohad ShachamRina Zviel-Girshin and Ohad ShachamSchool of Computer ScienceSchool of Computer Science

Tel-Aviv UniversityTel-Aviv University

Page 2: Compiler Construction  Lexical Analysis

22

Generic compiler structureGeneric compiler structure

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

(synthesis)

Compiler

Frontend

(analysis)

Page 3: Compiler Construction  Lexical Analysis

33

Lexical AnalysisLexical Analysis

converts characters to tokensconverts characters to tokens

class Quicksort { int[] a; int partition(int low, int high) { int pivot = a[low]; ...}

1: CLASS1: CLASS_ID(Quicksort)1: LCBR2: INT2: LB2: RB2: ID(a)

. . .

2: SEMI

Page 4: Compiler Construction  Lexical Analysis

44

Lexical AnalysisLexical Analysis TokensTokens

ID – _size, _numID – _size, _num Num – 7, 5 , 9, 4926209837Num – 7, 5 , 9, 4926209837 COMMA – ,COMMA – , SEMI – ;SEMI – ; ……

Non tokensNon tokens Comment – // Comment – // WhitespaceWhitespace Macro Macro ……

Page 5: Compiler Construction  Lexical Analysis

55

ProblemProblem

InputInput Program textProgram text Tokens specificationTokens specification

OutputOutput Sequence of tokensSequence of tokens

class Quicksort { int[] a; int partition(int low, int high) { int pivot = a[low]; ...}

1: CLASS1: CLASS_ID(Quicksort)1: LCBR2: INT2: LB2: RB2: ID(a)

. . .

2: SEMI

Page 6: Compiler Construction  Lexical Analysis

66

SolutionSolution

Write a lexical analyzerWrite a lexical analyzerToken nextToken(){

char c ;loop: c = getchar();switch (c){

case ` `:goto loop ;case `;`: return SemiColumn;case `+`: c = getchar() ;

switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c);

return Plus; }

case `<`:case `w`:

… }

Page 7: Compiler Construction  Lexical Analysis

77

Solution’s ProblemSolution’s Problem

A lot of workA lot of work Corner casesCorner cases Error pruneError prune Hard to debugHard to debug ExhaustingExhausting BoringBoring Hard to reuseHard to reuse Switch parser’s code between peopleSwitch parser’s code between people …….. ……..

Page 8: Compiler Construction  Lexical Analysis

88

Scanner generator: historyScanner generator: history

LEXLEX a lexical analyzer generator, written by Lesk and Schmidt at Bell a lexical analyzer generator, written by Lesk and Schmidt at Bell

Labs in 1975 for the UNIX operating system; Labs in 1975 for the UNIX operating system; It now exists for many operating systems;It now exists for many operating systems; LEX produces a scanner which is a C program;LEX produces a scanner which is a C program; LEX accepts regular expressions and allows actions (i.e., code LEX accepts regular expressions and allows actions (i.e., code

to executed) to be associated with each regular expression. to executed) to be associated with each regular expression.

JLexJLex Lex that generates a scanner written in Java;Lex that generates a scanner written in Java; Itself is also implemented in Java.Itself is also implemented in Java.

There are many similar tools, for every programming There are many similar tools, for every programming languagelanguage

Page 9: Compiler Construction  Lexical Analysis

99

Overall pictureOverall picture

Tokens

Scanner generator

NFAREJava scanner program

String stream

DFA

Minimize DFA

Simulate DFA

Page 10: Compiler Construction  Lexical Analysis

1010

JFlexJFlex

Off the shelf lexical analysis generatorOff the shelf lexical analysis generator InputInput

scanner specification filescanner specification fileOutputOutput

Lexical analyzer written in JavaLexical analyzer written in Java

JFlex javacIC.lex Lexical analyzer

IC text

tokens

Lexer.java

Page 11: Compiler Construction  Lexical Analysis

1111

JFlexJFlex

SimpleSimple Good for reuseGood for reuse Easy to understandEasy to understand Many developers and users debugged the generatorsMany developers and users debugged the generators

"+" { return new symbol (sym.PLUS); }"boolean" { return new symbol (sym.BOOLEAN); }“int" { return new symbol (sym.INT); }"null" {return new symbol (sym.NULL);}"while" {return new symbol (sym.WHILE);}

"=" {return new symbol (sym.ASSIGN);}

……

Page 12: Compiler Construction  Lexical Analysis

1212

JFlex Spec FileJFlex Spec File

User codeUser code Copied directly to Java fileCopied directly to Java file

%%

JFlex directivesJFlex directives Define macros, state namesDefine macros, state names

%%

Lexical analysis rulesLexical analysis rules How to break input to tokensHow to break input to tokens Action when token matchedAction when token matched

Possible source of

javac errors down the

roadDIGIT= [0-9]

LETTER= [a-zA-Z]

YYINITIAL

{LETTER}({LETTER}|{DIGIT})*

Page 13: Compiler Construction  Lexical Analysis

1313

User codeUser code

package IC.Parser;import IC.Parser.Token;

…any scanner-helper Java code…

Page 14: Compiler Construction  Lexical Analysis

1414

JFlex DirectivesJFlex Directives

Control JFlex internalsControl JFlex internals %line %line switches line counting onswitches line counting on %char %char switches character counting onswitches character counting on %class class-name%class class-name changes default name changes default name %cup %cup CUP compatibility modeCUP compatibility mode %type token-class-name%type token-class-name %public %public Makes generated class public (package by default)Makes generated class public (package by default) %function read-token-method%function read-token-method %scanerror exception-type-name%scanerror exception-type-name

Page 15: Compiler Construction  Lexical Analysis

1515

JFlex DirectivesJFlex Directives

State definitionsState definitions%state %state state-name state-name %state %state STRINGSTRING

Macro definitionsMacro definitionsmacro-name = regexmacro-name = regex

Page 16: Compiler Construction  Lexical Analysis

1616

Regular ExpressionRegular Expression

rr $$ match reg. exp. match reg. exp. rr at end of a line at end of a line

. . any character except the newlineany character except the newline"...""..." stringstring{name}{name} macro expansionmacro expansion** zero or more repetitions zero or more repetitions ++ one or more repetitionsone or more repetitions?? zero or one repetitions zero or one repetitions

(...) (...) grouping within regular expressionsgrouping within regular expressions

aa||bb match match aa or or bb

[...][...] class of characters - any class of characters - any oneone character enclosed in brackets character enclosed in brackets

aa––bb range of charactersrange of characters

[^…] [^…] negated class – any one not enclosed in bracketsnegated class – any one not enclosed in brackets

Page 17: Compiler Construction  Lexical Analysis

1717

Example macrosExample macros

ALPHA=[A-Za-z_] ALPHA=[A-Za-z_]

DIGIT=[0-9]DIGIT=[0-9]

ALPHA_NUMERIC={ALPHA}|{DIGIT}ALPHA_NUMERIC={ALPHA}|{DIGIT}

IDENT={ALPHA}({ALPHA_NUMERIC})*IDENT={ALPHA}({ALPHA_NUMERIC})*

NUMBER=({DIGIT})+NUMBER=({DIGIT})+

NUMBER=[0-9]+NUMBER=[0-9]+

Page 18: Compiler Construction  Lexical Analysis

1818

RulesRules

[states] regexp {action as Java code}[states] regexp {action as Java code}

PrioritiesPriorities Longest matchLongest match Order in the lex fileOrder in the lex file

Rules should match all inputs!!!Rules should match all inputs!!!

Breaks Input to Tokens Invokes when

regexp matches

breakbreakdown int

identifier or integer ?

The regexp should be evaluated ?

Page 19: Compiler Construction  Lexical Analysis

1919

Rules ExamplesRules Examples

<YYINITIAL> {DIGIT}+ DIGIT}+ {

return new Symbol(sym.NUMBER, yytext(), yyline);

}

<YYINITIAL> "-" {

return new Symbol(sym.MINUS, yytext(), yyline);

}

<YYINITIAL> [a-zA-Z] ([a-zA-Z0-9]) * {

return new Symbol(sym.ID, yytext(), yyline);

}

Page 20: Compiler Construction  Lexical Analysis

2020

Rules – ActionRules – Action

ActionActionJava codeJava codeCan use special methods and varsCan use special methods and vars

yylineyylineyytext()yytext()

Returns a token for a tokenReturns a token for a tokenEats chars for non tokensEats chars for non tokens

Page 21: Compiler Construction  Lexical Analysis

2121

Rules – StateRules – State

StateStateWhich regexp should be evaluated?Which regexp should be evaluated?yybegin(stateX)yybegin(stateX)

jumps to stateXjumps to stateX

YYINITIALYYINITIAL JFlex’s initial stateJFlex’s initial state

Page 22: Compiler Construction  Lexical Analysis

2222

Rules – StateRules – State

<YYINITIAL> "//" { yybegin(COMMENTS); }

<COMMENTS> [^\n] { }

<COMMENTS> [\n] { yybegin(YYINITIAL); }

YYINITIAL COMMENTS

‘//’

\n

^\n

Page 23: Compiler Construction  Lexical Analysis

2323

Lines Count ExampleLines Count Exampleimport java_cup.runtime.Symbol;

%%%cup%{ private int lineCounter = 0;%}

%eofval{ System.out.println("line number=" + lineCounter); return new Symbol(sym.EOF);%eofval}

NEWLINE=\n%%

<YYINITIAL>{NEWLINE} {lineCounter++;

} <YYINITIAL>[^{NEWLINE}] { }

Page 24: Compiler Construction  Lexical Analysis

2424

Lines Count ExampleLines Count Example

JFlex

javac

lineCount.lex

Lexical analyzer

text

tokens

Yylex.java

Main.java

JFlex and JavaCup must be on CLASSPATH

sym.java

java JFlex.Main lineCount.lex

javac *.java

Page 25: Compiler Construction  Lexical Analysis

2525

Test BedTest Bedimport java.io.*;

public class Main { public static void main(String[] args) { Symbol currToken; try { FileReader txtFile = new FileReader(args[0]); Yylex scanner = new Yylex(txtFile); do { currToken = scanner.next_token(); // do something with currToken } while (currToken.sym != sym.EOF); } catch (Exception e) { throw new RuntimeException("IO Error (brutal exit)” +

e.toString()); } }}

Page 26: Compiler Construction  Lexical Analysis

2626

Common PitfallsCommon Pitfalls

ClasspathClasspathPath to executablePath to executableDefine environment variablesDefine environment variables

JAVA_HOMEJAVA_HOMECLASSPATHCLASSPATH

Page 27: Compiler Construction  Lexical Analysis

2727

JFlex directives to useJFlex directives to use

%cup%cup (integrate with cup)(integrate with cup)

%line%line (count lines)(count lines)

%type Token%type Token (pass type Token)(pass type Token)

%class Lexer%class Lexer (gen. scanner class)(gen. scanner class)

Page 28: Compiler Construction  Lexical Analysis

2828

StructureStructure

JFlex javacIC.lexLexical analyzer

test.ic

tokens

Lexer.java

sym.javaToken.java

LexicalError.javaCompiler.java


Recommended