+ All Categories
Home > Documents > The Front End

The Front End

Date post: 09-Jan-2016
Category:
Upload: phuc
View: 30 times
Download: 6 times
Share this document with a friend
Description:
The Front End. The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the program well-formed (semantically) ? Build an IR version of the code for the rest of the compiler - PowerPoint PPT Presentation
Popular Tags:
32
1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code source language? Is the program well-formed (semantically) ? Build an IR version of the code for the rest of the compiler The front end deals with form (syntax) & meaning (semantics) Source code Front End Errors Machine code Back End IR
Transcript
Page 1: The Front End

1

The Front End

The purpose of the front end is to deal with the input language

• Perform a membership test: code source language?

• Is the program well-formed (semantically) ?

• Build an IR version of the code for the rest of the compiler

The front end deals with form (syntax) & meaning (semantics)

Sourcecode

FrontEnd

Errors

Machinecode

BackEnd

IR

Page 2: The Front End

2

The Front End

Implementation Strategy

Sourcecode Scanner

IRParser

Errors

tokens

Scanning Parsing

Specify Syntax regular expressionscontext-free grammars

Implement Recognizer

deterministic finite automaton

push-down automaton

Perform Work Actions on transitions in automaton

Page 3: The Front End

3

The Front End

Why separate the scanner and the parser?

• Scanner classifies words

• Parser constructs grammatical derivations

• Parsing is harder and slower

Separation simplifies the implementation

• Scanners are simple

• Scanner leads to a faster, smaller parser

token is a pair<part of speech, lexeme >

stream ofcharacters Scanner

IR +annotation

s

Parser

Errors

stream oftokensmicrosyntax syntax

Scanner is only pass that touches every character of the input.

Page 4: The Front End

4

The Big Picture

The front end deals with syntax

• Language syntax is specified with parts of speechparts of speech, not words

• Syntax checking matches parts of speech against a grammar

1. goal expr

2. expr expr op term3. | term

4. term number5. | id

6. op +7. | –

S = goal

T = { number, id, +, - }

N = { goal, expr, term, op }

P = { 1, 2, 3, 4, 5, 6, 7 }parts of speechsyntactic variables

Simple expression grammar

The scanner turns a stream of characters into a stream of words, and classifies them with their part of speech.

Page 5: The Front End

5

The Big PictureWhy study automatic scanner construction?

• Avoid writing scanners by hand

• Harness theory

Goals:• To simplify specification & implementation of scanners

• To understand the underlying techniques and technologies

ScannerGenerator

specifications

Scannersource code parts of speech &

words

Specifications written as “regular expressions”

Represent words as

indices into a global

table

tables or code

design time

compile time

Page 6: The Front End

6

Regular ExpressionsWe constrain programming languages so that the spelling of a word always implies its part of speech

The rules that impose this mapping form a regular language

Regular expressions (REs) describe regular languages

Regular Expression (over alphabet )

• is a RE denoting the set {}• If a is in , then a is a RE denoting {a}

• If x and y are REs denoting L(x) and L(y) then— x | y is an RE denoting L(x) L(y)— xy is an RE denoting L(x)L(y)— x* is an RE denoting L(x)*

Precedence is closure, then concatenation, then alternation

Page 7: The Front End

7

Regular ExpressionsHow do these operators help?

Regular Expression (over alphabet )

• is a RE denoting the set {}

• If a is in , then a is a RE denoting {a} the spelling of any specific word is an RE

• If x and y are REs denoting L(x) and L(y) then—x |y is an RE denoting L(x) L(y)

any finite list of words can be written as an RE ( w0 | w1 | … | wn )

— xy is an RE denoting L(x)L(y)— x* is an RE denoting L(x)*

we can use concatenation & closure to write more concise patterns and to specify infinite sets that have finite descriptions

Page 8: The Front End

8

Examples of Regular Expressions

Identifiers:Letter (a|b|c| … |z|A|B|C| … |Z)

Digit (0|1|2| … |9)

Identifier Letter ( Letter | Digit )*

Numbers:Integer (+|-|) (0| (1|2|3| … |9)(Digit *) )

Decimal Integer . Digit *

Real ( Integer | Decimal ) E (+|-|) Digit *

Complex ( Real , Real )

Numbers can get much more complicated! underlining indicates a letter in the input stream

Page 9: The Front End

9

Regular Expressions We use regular expressions to specify the mapping of words to parts of speech for the lexical analyzer

Using results from automata theory and theory of algorithms, we can automate construction of recognizers from REs

We study REs and associated theory to automate scanner construction !

Fortunately, the automatic techiques lead to fast scanners used in text editors, URL filtering software, …

Page 10: The Front End

10

Consider the problem of recognizing ILOC register names

Register r (0|1|2| … | 9) (0|1|2| … | 9)*

• Allows registers of arbitrary number• Requires at least one digit

RE corresponds to a recognizer (or DFA)

Transitions on other inputs go to an error state, se

Example

S0 S2 S1

r

(0|1|2| … 9)

(0|1|2| … 9)

Recognizer for Register

Page 11: The Front End

11

DFA operation

• Start in state S0 & make transitions on each input character

• DFA accepts a word x iff x leaves it in a final state (S2 )

So,

• r17 takes it through s0, s1, s2 and accepts

• r takes it through s0, s1 and fails

• a takes it straight to se

Example (continued)

S0 S2 S1

r

(0|1|2| … 9)

(0|1|2| … 9)

Recognizer for Register

Page 12: The Front End

12

Example (continued)

To be useful, the recognizer must be converted into code

r0,1,2,3,4,5,6,7,8,

9

All others

s0 s1 se se

s1 se s2 se

s2 se s2 se

se se se se

Char next characterState s0

while (Char EOF) State (State,Char) Char next character

if (State is a final state ) then report success else report failure

Skeleton recognizer

Table encoding the RE

O(1) cost per character (or per transition)

Page 13: The Front End

13

Example (continued)

We can add “actions” to each transition

r0,1,2,3,4,5,6,7,8,

9

All other

s

s0 s1

startse

errorse

error

s1 se

errors2

addse

error

s2 se

errors2

addse

error

se se

errorse

errorse

error

Char next characterState s0

while (Char EOF) Next (State,Char) Act (State,Char) perform action Act State Next Char next character

if (State is a final state ) then report success else report failure

Skeleton recognizer

Table encoding RE

Typical action is to capture the lexeme

Page 14: The Front End

14

r Digit Digit* allows arbitrary numbers• Accepts r00000 • Accepts r99999• What if we want to limit it to r0 through r31 ?

Write a tighter regular expression— Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )

— Register r0|r1|r2| … |r31|r00|r01|r02| … |r09

Produces a more complex DFA

• DFA has more states• DFA has same cost per transition (or per

character)• DFA has same basic implementation

What if we need a tighter specification?

More states implies a larger table. The larger table might have mattered when computers had 128 KB or 640 KB of RAM. Today, when a cell phone has megabytes and a laptop has gigabytes, the concern seems outdated.

Page 15: The Front End

15

Tighter register specification (continued)

The DFA forRegister r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )

• Accepts a more constrained set of register names• Same set of actions, more states

S0 S5 S1

r

S4

S3

S6

S2

0,1,2

3 0,1

4,5,6,7,8,9

(0|1|2| … 9)

Page 16: The Front End

16

Tighter register specification (continued)

r 0,1 2 3 4-9All

others

s0 s1 se se se se se

s1 se s2 s2 s5 s4 se

s2 se s3 s3 s3 s3 se

s3 se se se se se se

s4 se se se se se se

s5 se s6 se se se se

s6 se se se se se se

se se se se se se se

Table encoding RE for the tighter register specification

Page 17: The Front End

17

Tighter register specification (continued)

State Action

r 0,1 2 34,5,67,8,9

other

01

starte e e e e

1 e2

add2

add5

add4

adde

2 e3

add3

add3

add3

adde

exit

3,4 e e e e ee

exit

5 e6

adde e e

eexit

6 e e e e ee

exit

e e e e e e e

S0 S5 S1

r

S4

S3

S6

S2

0,1,2

3 0,1

4,5,6,7,8,9

(0|1|2| … 9)

Page 18: The Front End

18

Table-Driven Scanners

Common strategy is to simulate DFA execution • Table + Skeleton Scanner

— So far, we have used a simplified skeleton

• In practice, the skeleton is more complex— Character classification for table compression— Building the lexeme— Recognizing subexpressions

Practice is to combine all the REs into one DFA Must recognize individual words without hitting EOF

state s0 ;

while (state exit) do char NextChar( ) // read next character state (state,char); // take the transition

rs0 sf0 … 9

0 … 9

Page 19: The Front End

19

Table-Driven Scanners

Character Classification• Group together characters by their actions in the DFA

— Combine identical columns in the transition table, — Indexing by class shrinks the table

• Idea works well in ASCII (or EBCDIC)— compact, byte-oriented character sets— limited range of values

• Not clear how it extends to larger character sets (unicode)

state s0 ;

while (state exit) do char NextChar( ) // read next character cat CharCat(char) // classify character state (state,cat) // take the transition

Page 20: The Front End

20

Table-Driven Scanners

Building the Lexeme• Scanner produces syntactic category (part of

speech)— Most applications want the lexeme (word), too

• This problem is trivial— Save the characters

state s0

lexeme empty stringwhile (state exit) do char NextChar( ) // read next character lexeme lexeme + char // concatenate onto lexeme cat CharCat(char) // classify character state (state,cat) // take the transition

Page 21: The Front End

21

Table-Driven Scanners

Choosing a Category from an Ambiguous RE• We want one DFA, so we combine all the REs into one

— Some strings may fit RE for more than 1 syntactic category Keywords versus general identifiers Would like to encode them into the RE & recognize them

— Scanner must choose a category for ambiguous final states Classic answer: specify priority by order of REs (return 1st)

Alternate Implementation Strategy (Quite popular)• Build hash table of keywords & fold keywords into identifiers • Preload keywords into hash table• Makes sense if

— Scanner will enter all identifiers in the table— Scanner is hand coded

• Othersise, let the DFA handle them (O(1) cost per character)

Separate keyword table can make matters worse

Separate keyword table can make matters worse

Page 22: The Front End

22

Table-Driven Scanners

Scanning a Stream of Words

• Real scanners do not look for 1 word per input stream— Want scanner to find all the words in the input stream, in

order— Want scanner to return one word at a time— Syntactic Solution: can insist on delimiters

Blank, tab, punctuation, … Do you want to force blanks everywhere? in expressions?

— Implementation solution Run DFA to error or EOF, back up to accepting state

• Need the scanner to return token, not boolean— Token is < Part of Speech, lexeme > pair— Use a map from DFA’s state to Part of Speech (PoS)

Page 23: The Front End

23

Table-Driven Scanners

Handling a Stream of Words

// recognize wordsstate s0

lexeme empty stringclear stackpush (bad)

while (state se) do char NextChar( ) lexeme lexeme + char if state ∈ SA

then clear stack push (state) cat CharCat(char) state (state,cat)

end;

// clean up final statewhile (state ∉ SA and state ≠ bad) do state ← pop() truncate lexeme roll back the input one character end;

// report the resultsif (state ∈ SA ) then return <PoS(state),

lexeme> else return invalid

Need a clever buffering scheme, such as double buffering to support roll back

Page 24: The Front End

Avoiding Excess Rollback

• Some REs can produce quadratic rollback— Consider ab | (ab)* c and its DFA — Input “ababababc”

s0, s1, s3, s4, s3, s4, s3, s4, s5

— Input “abababab” s0, s1, s3, s4, s3, s4, s3, s4, rollback 6 characters

s0, s1, s3, s4, s3, s4, rollback 4 characters

s0, s1, s3, s4, rollback 2 characters

s0, s1, s3

• This behavior is preventable— Have the scanner remember paths that fail on particular

inputs— Simple modification creates the “maximal munch scanner”

24

a

s0

s1

s2

s5

s3

s4

b

c

a

a

c

b

DFA for ab | (ab)* c

c

Not too pretty

Page 25: The Front End

25

Maximal Munch Scanner// recognize wordsstate s0

lexeme empty stringclear stackpush (bad,bad)

while (state se) do char NextChar( ) InputPos InputPos + 1 lexeme lexeme + char

if Failed[state,InputPos] then break;

if state ∈ SA

then clear stack

push (state,InputPos) cat CharCat(char) state (state,cat)

end

// clean up final statewhile (state ∉ SA and state ≠ bad) do Failed[state,InputPos) true 〈 state,InputPos ← 〉 pop() truncate lexeme roll back the input one character end

// report the resultsif (state ∈ SA ) then return <PoS(state),

lexeme> else return invalid

InitializeScanner() InputPos 0 for each state s in the DFA do for i 0 to |input| do

Failed[s,i] false end; end;

Page 26: The Front End

Maximal Munch Scanner

• Uses a bit array Failed to track dead-end paths— Initialize both InputPos & Failed in InitializeScanner()

— Failed requires space ∝ |input stream| Can reduce the space requirement with clever implementation

• Avoids quadratic rollback— Produces an efficient scanner— Can your favorite language cause quadratic rollback?

If so, the solution is inexpensive If not, you might encounter the problem in other applications

of these technologies

26Thomas Reps, “`Maximal munch’ tokenization in linear time”, ACM TOPLAS, 20(2), March 1998, pp 259-273.

Page 27: The Front End

27

Table-Driven Versus Direct-Coded Scanners

Table-driven scanners make heavy use of indexing• Read the next character• Classify it• Find the next state • Branch back to the top

Alternative strategy: direct coding• Encode state in the program counter

— Each state is a separate piece of code

• Do transition tests locally and directly branch• Generate ugly, spaghetti-like code• More efficient than table driven strategy

— Fewer memory operations, might have more branches

state s0 ;

while (state exit) do char NextChar( ) cat CharCat(char ) state (state,cat);

state s0 ;

while (state exit) do char NextChar( ) cat CharCat(char ) state (state,cat);

index

index

Code locality as opposed to random access in

Page 28: The Front End

28

Table-Driven Versus Direct-Coded Scanners

Overhead of Table Lookup• Each lookup in CharCat or involves an address

calculation and a memory operation— CharCat(char) becomes

@CharCat0 + char x w w is sizeof(el’t of CharCat)

(state,cat) becomes@0 + (state x cols + cat) x w cols is # of columns in

w is sizeof(el’t of )

• The references to CharCat and expand into multiple ops

• Fair amount of overhead work per character• Avoid the table lookups and the scanner will run faster

Page 29: The Front End

29

Building Faster Scanners from the DFA

A direct-coded recognizer for r Digit Digit

start: accept se

lexeme “” count 0 goto s0

s0: char NextChar lexeme lexeme + char count++ if (char = ‘r’) then goto s1

else goto sout

s1: char NextChar lexeme lexeme + char count++ if (‘0’ char ‘9’) then goto s2

else goto sout

s2: char NextChar lexeme lexeme + char count 0 accept s2

if (‘0’ char ‘9’) then goto s2

else goto sout

sout: if (accept se )

then beginfor i 1 to count RollBack()

report success end else report failureFewer (complex) memory operations

No character classifierUse multiple strategies for test & branch

Page 30: The Front End

30

Building Faster Scanners from the DFA

A direct-coded recognizer for r Digit Digit

start: accept se

lexeme “” count 0 goto s0

s0: char NextChar lexeme lexeme + char count++ if (char = ‘r’) then goto s1

else goto sout

s1: char NextChar lexeme lexeme + char count++ if (‘0’ char ‘9’) then goto s2

else goto sout

s2: char NextChar lexeme lexeme + char count 1 accept s2

if (‘0’ char ‘9’) then goto s2

else goto sout

sout: if (accept se )

then beginfor i 1 to count RollBack()

report success end else report failure

If end of state test is complex (e.g., many cases), scanner generator should consider other schemes

• Table lookup (with classification?)

• Binary search

Direct coding the maximal munch scanner is easy, too.

Page 31: The Front End

31

What About Hand-Coded Scanners?

Many (most?) modern compilers use hand-coded scanners• Starting from a DFA simplifies design & understanding• Avoiding straight-jacket of a tool allows flexibility

— Computing the value of an integer In LEX or FLEX, many folks use sscanf() & touch chars many

times Can use old assembly trick and compute value as it appears

— Combine similar states (serial or parallel)

• Scanners are fun to write— Compact, comprehensible, easy to debug, …— Don’t get too cute (e.g., perfect hashing for

keywords)

Page 32: The Front End

32

Building Scanners

The point• All this technology lets us automate scanner

construction• Implementer writes down the regular expressions• Scanner generator builds NFA, DFA, minimal DFA, and

then writes out the (table-driven or direct-coded) code• This reliably produces fast, robust scanners

For most modern language features, this works• You should think twice before introducing a feature that

defeats a DFA-based scanner• The ones we’ve seen (e.g., insignificant blanks, non-

reserved keywords) have not proven particularly useful or long lasting


Recommended