Lexical AnalysisLecture 3
January 10, 2018
Announcements
É PA1c due tonight at 11:50pm!É Don’t forget about PA1, the Cool implementation!É Use Monday’s lecture, the video guides and Cool
examples if you’re stuck with Cool!
Compiler Construction 2/39
Programming Assignments Going Forward
É C was allowed for PA1, but not for PA2through PA6É How comfortable are we with other languages?É Python, Haskell, Ruby, OCaml, and JavaScript
Compiler Construction 3/39
Lexical Analysis Summary
É Lexical analysis turns a stream of charactersinto a stream of tokensÉ Regular expressions are a way to specify sets of
strings, which we use to describe tokens
class Main { ...
Lexical Analyzer
CLASS, IDENT, LBRACE, ...
Compiler Construction 4/39
Lexical Analysis Summary
É Lexical analysis turns a stream of charactersinto a stream of tokensÉ Regular expressions are a way to specify sets of
strings, which we use to describe tokens
class Main { ...
Lexical Analyzer
CLASS, IDENT, LBRACE, ...
Compiler Construction 4/39
Lexical Analysis Summary
É Lexical analysis turns a stream of charactersinto a stream of tokensÉ Regular expressions are a way to specify sets of
strings, which we use to describe tokens
class Main { ...
Lexical Analyzer
CLASS, IDENT, LBRACE, ...
Compiler Construction 4/39
Lexical Analysis Summary
É Lexical analysis turns a stream of charactersinto a stream of tokensÉ Regular expressions are a way to specify sets of
strings, which we use to describe tokens
class Main { ...
Lexical Analyzer
CLASS, IDENT, LBRACE, ...
Compiler Construction 4/39
Lexical Analysis Summary
É Lexical analysis turns a stream of charactersinto a stream of tokensÉ Regular expressions are a way to specify sets of
strings, which we use to describe tokens
class Main { ...
Lexical Analyzer
CLASS, IDENT, LBRACE, ...
Compiler Construction 4/39
Lexical Analysis Summary
É Lexical analysis turns a stream of charactersinto a stream of tokensÉ Regular expressions are a way to specify sets of
strings, which we use to describe tokens
class Main { ...
Lexical Analyzer
CLASS, IDENT, LBRACE, ...
Compiler Construction 4/39
Cunning Plan
É Informal Sketch of Lexical AnalysisÉ LA identifies tokens from input stringÉ List<Token> lexer ( char[] )
É Issues in Lexical AnalysisÉ LookaheadÉ Ambiguity
É Specifying LexersÉ Regular ExpressionsÉ Examples
Compiler Construction 5/39
Definitions
É Token — set of strings defining an atomicelement with a distinct meaningÉ a syntactic category
É In English:É noun, verb, adjective
É In Programming:É identifier, integer, keyword, whitespace, ...
É Lexeme — a sequence of characters than can becategorized as a Token
Compiler Construction 6/39
Definitions
É Token — set of strings defining an atomicelement with a distinct meaningÉ a syntactic categoryÉ In English:
É noun, verb, adjectiveÉ In Programming:
É identifier, integer, keyword, whitespace, ...
É Lexeme — a sequence of characters than can becategorized as a Token
Compiler Construction 6/39
Tokens and Lexemes
Token Lexeme
CLASS classLT <FALSE falseIDENT variable_name
Compiler Construction 7/39
Tokens and Lexemes
Token Lexeme
CLASS classLT <FALSE falseIDENT variable_name
Compiler Construction 7/39
Tokens and Lexemes
Token Lexeme
CLASS classLT <FALSE falseIDENT variable_name
By the way, what do you think of Cool’s fi, pool,esac...?
Compiler Construction 7/39
Context for Lexers
É Lexing and Parsing go hand-in-handÉ Parser uses distinctions between tokens
É e.g., a keyword is treated differently than an identifier
input Lexer Parser
get_char()
character
get_token()
get_token()
Compiler Construction 8/39
Lexical Analysis
É Consider this example:if(i=j) then
z<-0else
z<-1É The input is simply a sequence of characters:
if(i=j) then\n\tz<-0\nelse ...É Goal partition input strings into substrings
É Then, classify them according to their role(tokenize!)
Compiler Construction 9/39
Tokens
É Tokens correspond to sets of strings
É Identifier— strings of letters or digits, startingwith a letterÉ Integer— a non-empty string of digitsÉ Keyword— “else” or “class” or “let” ...É Whitespace— Non-empty sequence of blanks,
newlines, and/or tabsÉ OpenParen— a left parenthesis (É CloseParen— a right parenthesis )
Compiler Construction 10/39
Building a Lexical Analyzer
É Lexer implementation must do three things1. Recognize substrings corresponding to tokens
2. Return the value of lexeme of the token
3. Report errors intelligently (line numbers for Cool)
Compiler Construction 11/39
Lexical Analyzer Implementation
É Lexer usually discards “uninteresting” tokensthat don’t contribute to parsing
É Examples: Whitespace, commentsÉ Exceptions: Which languages care about
whitespace?
É Review: What would happen if we removed allwhitespace and comments before lexing?
Compiler Construction 12/39
Example
É Recall:if (i = j) then
z<-0else
z<-1É Our Cool Lexer would return
token-lexeme-linenumber tuples<IF, “if”, 1><WHITESPACE, “ ”, 1><OPENPAREN, “(”, 1><IDENTIFIER, “i”, 1><WHITESPACE, “ ”, 1><EQUALS, “=”, 1>
Compiler Construction 13/39
Lexing Considerations
É The goal is to partition the input string intomeaningful tokens.É Scan left to right (i.e., in order)É Recognize tokens
É We really need a way to describe the lexemesassociated with each tokenÉ And also a way to handle ambiguities
É is “if” two variables “i”, “f”
Compiler Construction 14/39
Lexing Considerations
É The goal is to partition the input string intomeaningful tokens.É Scan left to right (i.e., in order)É Recognize tokens
É We really need a way to describe the lexemesassociated with each token
É And also a way to handle ambiguitiesÉ is “if” two variables “i”, “f”
Compiler Construction 14/39
Lexing Considerations
É The goal is to partition the input string intomeaningful tokens.É Scan left to right (i.e., in order)É Recognize tokens
É We really need a way to describe the lexemesassociated with each tokenÉ And also a way to handle ambiguities
É is “if” two variables “i”, “f”
Compiler Construction 14/39
Regular LanguagesÉ Sounds like we can use DFAs to recognize
lexemesÉ With accepting states corresponding to tokens!
Example: Capture the word “class”c l a s s WS
Example: Capture some variable name
A
AN
WS A = letterAN = alphanumericW = whitespace
Compiler Construction 15/39
Capturing Multiple Tokens
What about both “class” and variable names?
1
2
c l a s s WS
A-c WS
WSAN
Compiler Construction 16/39
Lexical Analyzer Generators
É We like regular languages as a means tocategorize lexemes into tokensÉ We don’t like the complexity of implementing
a DFA manually
É We use Regular Expressions to describeregular languagesÉ And our tokens are recognizable as regular
languages!
É Regular Expressions can be automaticallyturned into a DFA for rapid lexing!
Compiler Construction 17/39
Languages Review
É Definition Let Σ be a set of characters. Alanguage over Σ is a set of strings of charactersdrawn from Σ. Σ is called the alphabet
Compiler Construction 18/39
Examples of Languages
É Alphabet = English CharactersÉ Language = English Sentences
É Note: Not every string on English characters is anEnglish sentence
É Example: adsfasdklg gdsajkl
É Alphabet = ASCII charactersÉ Language = C Programs
É Note: ASCII character set is different from Englishcharacter set
Compiler Construction 19/39
Notation
É Languages are sets of strings
É We need some notation for specifying whichsets we wantÉ i.e., which strings are in a set?
É For lexical analysis, we care about regularlanguages, which can we described using regularexpressions
Compiler Construction 20/39
Regular Expressions
É Each regular expressions is a notation for aregular language (a set of “words”)É Notation forthcoming!
É If A is a regular expression, we write L(A) torefer to the language denoted by A
Compiler Construction 21/39
Base Regular Expressions
É Single character: ‘c’É L(‘c’) = { ‘c’ }
É Concatenation: ABÉ A and B are both Regular expressionsÉ L(AB) = { ab | a ∈ L(A) and b ∈ L(B)}
É Example: L(‘i’ ‘f’) = { ‘if’ }
Compiler Construction 22/39
Compound Regular Expressions
É UnionÉ L(A|B) = { s | s ∈ L(A) or s ∈ L(B)}
É ExamplesÉ L(‘if’ | ‘then‘ | ‘else’) = { ‘if’, ‘then’, ‘else’ }É L(‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’) = what?
É L ( (‘0’|‘1’) (‘0’|’1’) ) = what?
Compiler Construction 23/39
Starz!É So far, base and compound regular expressions
only describe finite languagesÉ Iteration: A∗
É L(A∗) = {“”} ∪ {L(A)} ∪ {L(AA)} ∪ {L(AAA)} ∪...
É ExamplesÉ L(‘0′∗) = {“”, “0”, “00”, “000”, ...}É L(‘1′‘0′∗) = {“1”, “10”, “100”, “1000”, ...}
É Empty: εÉ L(ε) = {“”}
Compiler Construction 24/39
Example: Keyword
É Keywords: “else” or “if” or “fi”
‘else’ | ‘if’ | ‘fi’(Recall that ‘else’ abbreviates concatenation of‘e’ ‘l’ ‘s’ ‘e’ )
Compiler Construction 25/39
Example: Integer
É Integer: a non-empty string of digits
digit = ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’
number = digit digit*
É Abbreviation: A+ = AA*
Compiler Construction 26/39
Example: Identifiers
É Identifier: string of letters or digits, start with aletter
letter = ‘A’ | ... | ‘Z’ | ‘a’ | ... ‘Z’ident = letter (letter | digit ) *
É Is (letter*|digit*) the same?
Compiler Construction 27/39
Example: Whitespace
É Whitespace: a non-empty sequence of blanks,newlines, and tabs
( ‘ ’ | ‘\t’ | ‘\n’ | ‘\r’ ) +
Compiler Construction 28/39
Example: Phone Numbers
Regular expressions are everywhere!Consider: (123)-234-4567
Σ { 0, 1, 2, ... 9, (, ), -}area digit digit digitexch digit digit digitphone digit digit digit digitnumber ‘(’ area ‘)’ ‘-’ exch ‘-’ phone
Compiler Construction 29/39
Example: Email addresses
Consider [email protected]Σ {a, b, c, ..., z, ‘.’, ‘@’}name letter+address name ‘@’ name (‘.’ name)*
Compiler Construction 30/39
Regular Expression Summary
É Regular expressions describe many usefullanguagesÉ Given a string s and a regexp R, we can find if
s ∈ L(R)É Is this enough?
É NO Recall we need the original lexeme!
É We must adapt regular expressions to this goal
Compiler Construction 31/39
Regular Expression Summary
É Regular expressions describe many usefullanguagesÉ Given a string s and a regexp R, we can find if
s ∈ L(R)É Is this enough?É NO Recall we need the original lexeme!
É We must adapt regular expressions to this goal
Compiler Construction 31/39
Regular Expression Summary
É Regular expressions describe many usefullanguagesÉ Given a string s and a regexp R, we can find if
s ∈ L(R)É Is this enough?É NO Recall we need the original lexeme!
É We must adapt regular expressions to this goal
Compiler Construction 31/39
Next time
É Specifying lexical structure using regularexpressionsÉ Finite automata
É Deterministic Finite AutomataÉ Nondeterministic Finite Automata
É Implementation of Regular ExpressionsÉ Regexp→NFA→ DFA→ lookup table
É Lexical Analyzer Generation (i.e., doing this allautomatically)
Compiler Construction 32/39
Lexical Specification (1)
É Start with a set of tokens (protip, PA2 liststhem for Cool)É Write a regular expressions for the lexemes
representing each tokenÉ Number = digit+É IF = “if”É ELSE = “else”É IDENT = letter ( letter | digit ) *
...
Compiler Construction 33/39
Lexical Specification (2)
É Construct R, matching all lexemes for alltokensÉ R =Number | IF | ELSE | IDENT | ...É R = R1 | R2 | R3 | ...
É If s ∈ L(R), then s is a lexemeÉ Also s ∈ L(Rj) for some jÉ The particular j corresponds to the type of token
reported by lexer
Compiler Construction 34/39
Lexical Specification (3)
É For an input x1, ...,xnÉ Each xi ∈Σ
É For each 1≤ i≤ n, checkÉ is x1...xi ∈ L(R)?
É If so, it must be thatx1...xi ∈ L(Rj) for some jÉ Remove x1...xi from input and restart
Compiler Construction 35/39
Example Lexing
R =Whitespace | Integer | Identifier | PlusLex “f + 3 + g”
É “f” matches R (more specifically, Identifier)É “ ” matches R (more specifically, Whitespace)
...
What does the lexer output look like for thisexample?
Compiler Construction 36/39
Ambiguities
É Ambiguities arise in this algorithm
É R =Whitespace | Integer | Identifier | Plus
É Lex “foo+3”É “f”, “fo”, and “foo” all match R, but not “foo+”É How much input do we consume?
É Maximal munch rule: pick the longest possiblesubstring that matches R
Compiler Construction 37/39
Ambiguities (2)
R=Whitespace | ‘new’ | Integer | Identifier | PlusÉ Lex “new foo”
É “new” matches both the ‘new’ rule and the‘Identifier’ rulewhich one do you pick?
É Generally, pick the rule listed firstÉ Arbitrary, but typicalÉ Important for PA2!
‘new’ was listed before ‘Identifier‘, so the tokengiven is ‘new’
Compiler Construction 38/39
Summary
É Regular expressions provide a concise notationfor string patternsÉ We need to adapt them for lexical analysis to
É Resolve ambiguitiesÉ Handle errors (report line numbers)
É Next time, Lexical Analysis Generators
Compiler Construction 39/39
c l a s s WS
A
AN
WS A = letterAN = alphanumericW = whitespace
1
2
c l a s s WS
A-c WS
WSAN
Compiler Construction 39/39