Lexical Analysis - Lecture 3kjleach.eecs.umich.edu/c18/l3.pdf · 2018-01-10 · Lexical Analysis...

Lexical AnalysisLecture 3

January 10, 2018

Announcements

É PA1c due tonight at 11:50pm!É Don’t forget about PA1, the Cool implementation!É Use Monday’s lecture, the video guides and Cool

examples if you’re stuck with Cool!

Compiler Construction 2/39

Programming Assignments Going Forward

É C was allowed for PA1, but not for PA2through PA6É How comfortable are we with other languages?É Python, Haskell, Ruby, OCaml, and JavaScript


Lexical Analysis Summary

É Lexical analysis turns a stream of charactersinto a stream of tokensÉ Regular expressions are a way to specify sets of

strings, which we use to describe tokens

class Main { ...

Lexical Analyzer

CLASS, IDENT, LBRACE, ...





class Main { ...

Lexical Analyzer






class Main { ...

Lexical Analyzer






class Main { ...

Lexical Analyzer






class Main { ...

Lexical Analyzer






class Main { ...

Lexical Analyzer



Cunning Plan

É Informal Sketch of Lexical AnalysisÉ LA identifies tokens from input stringÉ List<Token> lexer ( char[] )

É Issues in Lexical AnalysisÉ LookaheadÉ Ambiguity

É Specifying LexersÉ Regular ExpressionsÉ Examples


Definitions

É Token — set of strings defining an atomicelement with a distinct meaningÉ a syntactic category

É In English:É noun, verb, adjective

É In Programming:É identifier, integer, keyword, whitespace, ...

É Lexeme — a sequence of characters than can becategorized as a Token


Definitions

É Token — set of strings defining an atomicelement with a distinct meaningÉ a syntactic categoryÉ In English:

É noun, verb, adjectiveÉ In Programming:

É identifier, integer, keyword, whitespace, ...

É Lexeme — a sequence of characters than can becategorized as a Token


Tokens and Lexemes

Token Lexeme

CLASS classLT <FALSE falseIDENT variable_name


Tokens and Lexemes

Token Lexeme



Tokens and Lexemes

Token Lexeme


By the way, what do you think of Cool’s fi, pool,esac...?


Context for Lexers

É Lexing and Parsing go hand-in-handÉ Parser uses distinctions between tokens

É e.g., a keyword is treated differently than an identifier

input Lexer Parser

get_char()

character

get_token()

get_token()


Lexical Analysis

É Consider this example:if(i=j) then

z<-0else

z<-1É The input is simply a sequence of characters:

if(i=j) then\n\tz<-0\nelse ...É Goal partition input strings into substrings

É Then, classify them according to their role(tokenize!)


Tokens

É Tokens correspond to sets of strings

É Identifier— strings of letters or digits, startingwith a letterÉ Integer— a non-empty string of digitsÉ Keyword— “else” or “class” or “let” ...É Whitespace— Non-empty sequence of blanks,

newlines, and/or tabsÉ OpenParen— a left parenthesis (É CloseParen— a right parenthesis )


Building a Lexical Analyzer

É Lexer implementation must do three things1. Recognize substrings corresponding to tokens

2. Return the value of lexeme of the token

3. Report errors intelligently (line numbers for Cool)


Lexical Analyzer Implementation

É Lexer usually discards “uninteresting” tokensthat don’t contribute to parsing

É Examples: Whitespace, commentsÉ Exceptions: Which languages care about

whitespace?

É Review: What would happen if we removed allwhitespace and comments before lexing?


Example

É Recall:if (i = j) then

z<-0else

z<-1É Our Cool Lexer would return

token-lexeme-linenumber tuples<IF, “if”, 1><WHITESPACE, “ ”, 1><OPENPAREN, “(”, 1><IDENTIFIER, “i”, 1><WHITESPACE, “ ”, 1><EQUALS, “=”, 1>


Lexing Considerations

É The goal is to partition the input string intomeaningful tokens.É Scan left to right (i.e., in order)É Recognize tokens

É We really need a way to describe the lexemesassociated with each tokenÉ And also a way to handle ambiguities

É is “if” two variables “i”, “f”




É We really need a way to describe the lexemesassociated with each token

É And also a way to handle ambiguitiesÉ is “if” two variables “i”, “f”




É We really need a way to describe the lexemesassociated with each tokenÉ And also a way to handle ambiguities

É is “if” two variables “i”, “f”


Regular LanguagesÉ Sounds like we can use DFAs to recognize

lexemesÉ With accepting states corresponding to tokens!

Example: Capture the word “class”c l a s s WS

Example: Capture some variable name

A

AN

WS A = letterAN = alphanumericW = whitespace


Capturing Multiple Tokens

What about both “class” and variable names?

1

2

c l a s s WS

A-c WS

WSAN


Lexical Analyzer Generators

É We like regular languages as a means tocategorize lexemes into tokensÉ We don’t like the complexity of implementing

a DFA manually

É We use Regular Expressions to describeregular languagesÉ And our tokens are recognizable as regular

languages!

É Regular Expressions can be automaticallyturned into a DFA for rapid lexing!


Languages Review

É Definition Let Σ be a set of characters. Alanguage over Σ is a set of strings of charactersdrawn from Σ. Σ is called the alphabet


Examples of Languages

É Alphabet = English CharactersÉ Language = English Sentences

É Note: Not every string on English characters is anEnglish sentence

É Example: adsfasdklg gdsajkl

É Alphabet = ASCII charactersÉ Language = C Programs

É Note: ASCII character set is different from Englishcharacter set


Notation

É Languages are sets of strings

É We need some notation for specifying whichsets we wantÉ i.e., which strings are in a set?

É For lexical analysis, we care about regularlanguages, which can we described using regularexpressions


Regular Expressions

É Each regular expressions is a notation for aregular language (a set of “words”)É Notation forthcoming!

É If A is a regular expression, we write L(A) torefer to the language denoted by A


Base Regular Expressions

É Single character: ‘c’É L(‘c’) = { ‘c’ }

É Concatenation: ABÉ A and B are both Regular expressionsÉ L(AB) = { ab | a ∈ L(A) and b ∈ L(B)}

É Example: L(‘i’ ‘f’) = { ‘if’ }


Compound Regular Expressions

É UnionÉ L(A|B) = { s | s ∈ L(A) or s ∈ L(B)}

É ExamplesÉ L(‘if’ | ‘then‘ | ‘else’) = { ‘if’, ‘then’, ‘else’ }É L(‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’) = what?

É L ( (‘0’|‘1’) (‘0’|’1’) ) = what?


Starz!É So far, base and compound regular expressions

only describe finite languagesÉ Iteration: A∗

É L(A∗) = {“”} ∪ {L(A)} ∪ {L(AA)} ∪ {L(AAA)} ∪...

É ExamplesÉ L(‘0′∗) = {“”, “0”, “00”, “000”, ...}É L(‘1′‘0′∗) = {“1”, “10”, “100”, “1000”, ...}

É Empty: εÉ L(ε) = {“”}


Example: Keyword

É Keywords: “else” or “if” or “fi”

‘else’ | ‘if’ | ‘fi’(Recall that ‘else’ abbreviates concatenation of‘e’ ‘l’ ‘s’ ‘e’ )


Example: Integer

É Integer: a non-empty string of digits

digit = ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’

number = digit digit*

É Abbreviation: A+ = AA*


Example: Identifiers

É Identifier: string of letters or digits, start with aletter

letter = ‘A’ | ... | ‘Z’ | ‘a’ | ... ‘Z’ident = letter (letter | digit ) *

É Is (letter*|digit*) the same?


Example: Whitespace

É Whitespace: a non-empty sequence of blanks,newlines, and tabs

( ‘ ’ | ‘\t’ | ‘\n’ | ‘\r’ ) +


Example: Phone Numbers

Regular expressions are everywhere!Consider: (123)-234-4567

Σ { 0, 1, 2, ... 9, (, ), -}area digit digit digitexch digit digit digitphone digit digit digit digitnumber ‘(’ area ‘)’ ‘-’ exch ‘-’ phone


Example: Email addresses

Consider [email protected]Σ {a, b, c, ..., z, ‘.’, ‘@’}name letter+address name ‘@’ name (‘.’ name)*


Regular Expression Summary

É Regular expressions describe many usefullanguagesÉ Given a string s and a regexp R, we can find if

s ∈ L(R)É Is this enough?

É NO Recall we need the original lexeme!

É We must adapt regular expressions to this goal




s ∈ L(R)É Is this enough?É NO Recall we need the original lexeme!





s ∈ L(R)É Is this enough?É NO Recall we need the original lexeme!



Next time

É Specifying lexical structure using regularexpressionsÉ Finite automata

É Deterministic Finite AutomataÉ Nondeterministic Finite Automata

É Implementation of Regular ExpressionsÉ Regexp→NFA→ DFA→ lookup table

É Lexical Analyzer Generation (i.e., doing this allautomatically)


Lexical Specification (1)

É Start with a set of tokens (protip, PA2 liststhem for Cool)É Write a regular expressions for the lexemes

representing each tokenÉ Number = digit+É IF = “if”É ELSE = “else”É IDENT = letter ( letter | digit ) *

...



É Construct R, matching all lexemes for alltokensÉ R =Number | IF | ELSE | IDENT | ...É R = R1 | R2 | R3 | ...

É If s ∈ L(R), then s is a lexemeÉ Also s ∈ L(Rj) for some jÉ The particular j corresponds to the type of token

reported by lexer



É For an input x1, ...,xnÉ Each xi ∈Σ

É For each 1≤ i≤ n, checkÉ is x1...xi ∈ L(R)?

É If so, it must be thatx1...xi ∈ L(Rj) for some jÉ Remove x1...xi from input and restart


Example Lexing

R =Whitespace | Integer | Identifier | PlusLex “f + 3 + g”

É “f” matches R (more specifically, Identifier)É “ ” matches R (more specifically, Whitespace)

...

What does the lexer output look like for thisexample?


Ambiguities

É Ambiguities arise in this algorithm

É R =Whitespace | Integer | Identifier | Plus

É Lex “foo+3”É “f”, “fo”, and “foo” all match R, but not “foo+”É How much input do we consume?

É Maximal munch rule: pick the longest possiblesubstring that matches R


Ambiguities (2)

R=Whitespace | ‘new’ | Integer | Identifier | PlusÉ Lex “new foo”

É “new” matches both the ‘new’ rule and the‘Identifier’ rulewhich one do you pick?

É Generally, pick the rule listed firstÉ Arbitrary, but typicalÉ Important for PA2!

‘new’ was listed before ‘Identifier‘, so the tokengiven is ‘new’


Summary

É Regular expressions provide a concise notationfor string patternsÉ We need to adapt them for lexical analysis to

É Resolve ambiguitiesÉ Handle errors (report line numbers)

É Next time, Lexical Analysis Generators


c l a s s WS

A

AN

WS A = letterAN = alphanumericW = whitespace

1

2

c l a s s WS

A-c WS

WSAN


Date post:	24-Jul-2020
Category:	Documents
Upload:	others
View:	23 times
Download:	0 times

Lexical Analysis - Lecture 3kjleach.eecs.umich.edu/c18/l3.pdf · 2018-01-10 · Lexical Analysis...

Documents