Lexical Analysis - Lecture 2 Sections 3.1 - 3 -...

transcript

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Lexical AnalysisLecture 2

Sections 3.1 - 3.4

Robb T. Koether

Hampden-Sydney College

Mon, Jan 19, 2009

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Outline

1 Lexical Analysis

2 Regular Expressions

3 State Diagrams

4 Assignment

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Tokens

A token has a type and a value.Types include id, num, assign, lparen, etc.Values are used primarily with identifiers and numbers.If we read “count”, the type is id and the value is“count”.If we read “123”, the type is num and the value is“123”.If we read “=”, the type is assign and the value is “=”.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Analyzing Tokens

Each type of token can be described by a regularexpression.Therefore, the set of all tokens can be described by aregular expression. (Why?)Regular expressions are accepted by DFAs.Therefore, the set of all tokens can be processed andaccepted by a DFA.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Regular Expressions

The set of all regular expressions may be defined in twoparts.The basic part:

ε represents the language {ε}.a represents the language {a} for every a ∈ Σ.Call these languages L(ε) and L(a), respectively.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Regular Expressions

The recursive part: Let r and s denote regularexpressions.

r | s represents the language L(r) ∪ L(s).rs represents the language L(r)L(s).r∗ represents the language L(r)∗.

In other wordsL(r | s) = L(r) ∪ L(s).L(rs) = L(r)L(s).L(r∗) = L(r)∗.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Example

Example (Identifiers)Identifiers in C++ can be represented by a regularexpression.

r = A | B | · · · | Z | a | b | · · · | zs = 0 | 1 | · · · | 9t = r(r | s)∗

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Regular Expressions

Definition (Regular definition)

A regular definition of a regular expression is a “grammar” ofthe form

d1 → r1

d2 → r2

...dn → rn

where each ri is a regular expression overΣ ∪ {d1, d2, . . . , di−1}.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Regular Expressions

Note that this definition does not allow recursivelydefined tokens.In other words, di cannot be defined in terms of di, noteven indirectly.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Example

Example (Identifiers)We may now describe C++ identifiers as follows.

letter → A | B | · · · | Z | a | b | · · · | zdigit → 0 | 1 | · · · | 9

id → letter(letter | digit)∗

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Lexical Analysis

After writing a regular expression for each kind oftoken, we may combine them into one big regularexpression describing all tokens.

id → letter(letter | digit)∗

num → digit(digit)∗

relop → < | > | == | != | >= | <=token → id | num | relop | . . .

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

State Diagrams

A regular expression may be represented by a statediagram.The state diagram provides a good guide to writing alexical analyzer program.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Example

Example (State Diagrams)

letterletter | digit

digitdigit

letter

letter | digit

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Diagram Diagrams

Unfortunately, it is not that simple.At what point may we stop in an accepting state?Do not read “count” as 5 identifiers: “c”, “o”, “u”, “n”,“t”.When we stop in an accepting state, we must be able todetermine the type of token processed.Did we read the id token “count” or did we read the iftoken “if”?

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Example

Consider state diagrams to accept relational operators==, !=, <, >, <=, and >=.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Example

Combine them into a single state diagram.

relop =

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

State Diagrams

When we reach an accepting state, how can we tellwhich operator was processed?.In general, we design the diagram so that each kind oftoken has its own accepting state.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

State Diagrams

If we reach state 3, how do we decide whether tocontinue to state 4?We read characters until the current character does notmatch any pattern, i.e., it would lead to the dead state.At that point, we accept the string, minus the lastcharacter.Later, processing resumes with the last character.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

State Diagrams

The Maximal Munch PrincipleProcess as many symbols as possible and still be able tomatch a regular expression.

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

Example

=relop

= other

! = other

< = other

> = other

LexicalAnalysis

Robb T.Koether

LexicalAnalysis

RegularExpressions

StateDiagrams

Assignment

HomeworkRead Sections 3.1 - 3.4.

Lexical Analysis - Lecture 2 Sections 3.1 - 3 -...

Documents