+ All Categories
Home > Documents > Compiler Design

Compiler Design

Date post: 15-Nov-2014
Category:
Upload: alexis
View: 7 times
Download: 0 times
Share this document with a friend
Description:
Outline of Compiler Design
21
Summary in Compiler Design MODULE I. INTRODUCTION TO COMPILER Compiler writing spans programming languages, machine architecture, language theory, algorithms, and software engineering. Fortunately, a few basic compiler- writing techniques can be used to construct translators for a wide variety of languages and machines. A Compiler is a program that reads a program written in one language – the source language – and translates it into an equivalent program in another language – the target language as illustrated in Figure 1.1 in which the important part of the translation process is that the compiler reports to its user the presence of errors in the source program. At first glance, the variety of compilers may appear overwhelming. There are thousands of source languages, ranging from traditional programming languages such as Fortran and Pascal to specialized languages that have arisen in virtually every area of computer application. Target languages are equally as varied; a target language may be another programming language, or the machine language of any computer between a microprocessor and a supercomputer. A compiler translates a source program into machine language. An interpreter program reads individual words of a source program and immediately executes corresponding machine-language segments. Interpretation occurs each time the program is used. Thus, once it has been compiled, a program written into a compiled language will run more quickly than a program in an interpreted language. An interpreter is a computer program that translates commands written in a high-level computer language into machine-language commands that the computer can understand and execute. An interpreter's function is thus similar to that of a compiler, but the two differ in their mode of operation. A compiler translates a complete set of high-level commands, producing a corresponding set of machine-language commands that are then executed, whereas an interpreter translates and executes each command as the user enters it. Interpretive languages, such as the widely used BASIC, are relatively easy to learn, and programs written in them are easy to edit and correct. Compiled programs, on the other hand, operate more rapidly and efficiently. THE ANALYSIS-SYNTHESIS MODEL OF COMPILATION There are two parts of compilation: analysis and synthesis. The analysis part breaks up the source program into constituent pieces and creates an intermediate Figure 1.1 A Compiler error messages source program compiler target program
Transcript
Page 1: Compiler Design

Summary in Compiler Design

MODULE I. INTRODUCTION TO COMPILER

Compiler writing spans programming languages, machine architecture, language theory, algorithms, and software engineering. Fortunately, a few basic compiler-writing techniques can be used to construct translators for a wide variety of languages and machines.

A Compiler is a program that reads a program written in one language – the source language – and translates it into an equivalent program in another language – the target language as illustrated in Figure 1.1 in which the important part of the translation process is that the compiler reports to its user the presence of errors in the source program.

At first glance, the variety of compilers may appear overwhelming. There are thousands of source languages, ranging from traditional programming languages such as Fortran and Pascal to specialized languages that have arisen in virtually every area of computer application. Target languages are equally as varied; a target language may be another programming language, or the machine language of any computer between a microprocessor and a supercomputer.

A compiler translates a source program into machine language. An interpreter program reads individual words of a source program and immediately executes corresponding machine-language segments. Interpretation occurs each time the program is used. Thus, once it has been compiled, a program written into a compiled language will run more quickly than a program in an interpreted language.

An interpreter is a computer program that translates commands written in a high-level computer language into machine-language commands that the computer can understand and execute. An interpreter's function is thus similar to that of a compiler, but the two differ in their mode of operation. A compiler translates a complete set of high-level commands, producing a corresponding set of machine-language commands that are then executed, whereas an interpreter translates and executes each command as the user enters it. Interpretive languages, such as the widely used BASIC, are relatively easy to learn, and programs written in them are easy to edit and correct. Compiled programs, on the other hand, operate more rapidly and efficiently.

THE ANALYSIS-SYNTHESIS MODEL OF COMPILATION

There are two parts of compilation: analysis and synthesis. The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program. The synthesis part constructs the desired target program from the intermediate representation.

During analysis, the operations implied by the source program are determined and recorded in a hierarchical structure called a tree. It is a special kind of tree called a syntax tree, in which each node represents an operation and the children of a node represent the arguments of the operation. An example is shown in

Figure 1.1 A Compiler

error messages

source program

compilertarget

program

Page 2: Compiler Design

Figure 1.2.

Software tools that manipulate source programs:

1. Structure editors – takes an input a sequence of commands to build a source program. The structure editor not only performs the text-creation and modification functions of an ordinary text editor, but it also analyzes the program text, putting an appropriate hierarchical structure of the source program.

2. Pretty printers – analyzes a program and prints it in such a way that the structure of the program becomes clearly visible. Example of this, comments may appear in a special font, statements may appear with an amount of indentation proportional to the depth of their nesting in the hierarchical organization of the statements.

3. Static checkers – reads a program, analyzes it, and attempts to discover potential bugs without running the program. For example, a static checker may detect that parts of the source program can never be executed, or that a certain variable might be used before being defined.

4. Interpreters – performs the operations implied by the source program. For an assignment statement, for example, an interpreter might build a tree like in Figure 1.2 and then carry out the operations at the nodes as it “walks” the tree. Interpreters are frequently used to execute command languages, since each operator executed in a command language is usually in invocation of a complex routine such as an editor or compilers.

ANALYSIS OF THE SOURCE PROGRAM

It consists of three phases:

1. Linear analysis, in which the stream of characters making up the source program is read from left-to-right and grouped into tokens that are sequences of characters having a collective meaning.

2. Hierarchical analysis, in which characters or tokens are grouped hierarchically into nested collections with collective meaning.

3. Semantic analysis, in which certain checks are performed to ensure that the components of a program fit together meaningfully.

THE PHASES OF A COMPILER

Conceptually, a compiler operates in phases, each of which transforms the source program from one representation to another. A typical decomposition of a compiler is shown in Figure 1.3.

:=

+

*60

positioninitial

rate

Figure 1.2 Syntax tree for position := initial + rate * 60.

Page 3: Compiler Design

LEXICAL ANALYSIS

In a compiler, linear analysis is called lexical analysis or scanning. For example, in lexical analysis the characters in the assignment statement

position := initial + rate * 60

would be grouped into the following tokens:1. The identifier position.2. The assignment symbol :=.3. The identifier initial.4. The plus sign5. The identifier rate.6. The multiplication sign.7. The number 60.

The blanks separating the characters of these tokens would normally be eliminated during lexical analysis.

SYNTAX ANALYSIS

Hierarchical analysis is called parsing or syntax analysis. It involves grouping the tokens of the source program into grammatical phrases that are used by the compiler to synthesize output. Usually, a parse tree such as the one shown in Figure 1.4 represents the grammatical phrases of the source program.

SEMANTIC ANALYSIS

The semantic analysis phase checks the source program for semantic errors and gathers type information for the subsequent code-generation phase. It uses the hierarchical structure determined by the syntax-analysis phase to identify the operators and operands of expressions and statements.

Page 4: Compiler Design

An important component of semantic analysis is type checking. Here the compiler checks that each operator has operands that are permitted by the source language specification. For example, many programming language definitions require a compiler to report an error every time a real number is used to index an array. However, the language specification may permit some operand coercions, for example, when a binary arithmetic operator is applied to an integer and real. In this case, the compiler may need to convert the integer to a real.

SYMBOL-TABLE MANAGEMENT

An essential function of a compiler is to record the identifiers used in the source program and collect information about various attributes of each identifier. These attributes may provide information about the storage allocated for an identifier, its type, its scope, and in the case of procedure names, such things as the number and types of its arguments, the method of passing each argument, and the type returned, if any.

A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from that record quickly.

When the lexical analyzer detects an identifier in the source program, the identifier is entered into the symbol table. However, the attributes of an identifier cannot normally be determined during lexical analysis. For example, in a Pascal declaration like

var position, initial, rate : real ;

The type real is not known when position, initial, and rate are seen by the lexical analyzer.

INTERMEDIATE CODE GENERATION

After syntax and semantic analysis, some compilers generate an explicit intermediate representation of the source program. We can think of this intermediate representation as a program for an abstract machine. This intermediate representation should have two important properties: it should be easy to produce, and easy to translate into the target program.

CODE OPTIMIZATION

The code optimization phase attempts to improve the intermediate code, so that faster-running machine code will result.

CODE GENERATION

The final phase of the compiler is the generation of target code, consisting normally of relocatable machine code or assembly code. Memory locations are selected for each of the variables used by the program. Then, intermediate instructions are each translated into a sequence of machine instructions that perform the same task. A crucial aspect is the assignment of variables to registers.

ERROR-HANDLING

Each phase can encounter errors. However, after detecting an error, a phase must somehow deal with that error, so that compilation can proceed, allowing further errors in the source program to be detected. A compiler that stops when it finds the first error is not as helpful as it could be.

The syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the compiler. The lexical phase can detect errors where the characters remaining in the input do not form any token of the language. Errors where the token stream violates the structure rules (syntax) of the language are determined by the syntax analysis phase. During semantic analysis the compiler tries to detect constructs that have the right syntactic structure but no meaning to the operation involved.

Page 5: Compiler Design

COMPILER-WRITING TOOLS

The compiler writer, like any programmer, can profitably use software tool such as debuggers, version managers, profilers, and so on. Shortly after the first compilers were written, systems to help with the compiler-writing process appeared. These systems have often been referred to as compiler-compilers, compiler-generators, or translator-writing systems. Largely, they are oriented around a particular model of languages, and they are most suitable for generating compilers of languages similar to the model.

Some general tools have been created for the automatic design of specific compiler components. These tools use specialized languages for specifying and implementing the component, and many use algorithms that are quite sophisticated. The most successful tools are those that hide the details of the generation algorithm and produce components that can be easily integrated into the remainder of a compiler. The following is a list of some useful compiler-construction tools:

1. Parser generators. These produce syntax analyzers, normally from input that is based on a context-free grammar. In early compilers, syntax analysis consumed not only a large fraction of the running time of a compiler, but also a large fraction of the intellectual effort of writing a compiler. This phase is now considered one of the easiest to implement. Many parser generators utilize powerful parsing algorithms that are too complex to be carried out by hand.

2. Scanner generators. These automatically generate lexical analyzers, normally from a specification based on regular expressions. The basic organization of the resulting lexical analyzer is in effect a finite automation.

3. Syntax-directed translation engines. These produce collections of routines that walk the parse tree, generating intermediate code. The basic idea is that one or more “translations” are associated with each node of the parse tree, and each translation is defined in terms of translations at its neighbor nodes in the tree.

4. Automatic code generators. Such a tool takes a collection of rules that define the translation of each operation of the intermediate language into the machine language for the target machine. The rules must include sufficient detail that we can handle the different possible access methods for data; e.g., variables may be in registers, in a fixed (static) location in memory, or may be allocated a position on a stack. The basic technique is “template matching.” The intermediate code statements are replaced by “templates” that represent sequences of machine instructions, in such a way that the assumptions about storage of variables match from template to template.

5. Data-flow engines. Much of the information needed to perform good code optimization involves “data-flow analysis,” the gathering of information about how values are transmitted from one part of a program to each other part. Different tasks of this nature can be performed by essentially the same routine, with the user supplying details of the relationship between intermediate code statements and the information being gathered.

FORMAL SPECIFICATIONS AND GENERATION OF COMPILER MODULES

Compiler Subtask Specification mechanism Automaton typeLexical analysis

Syntax analysis

Semantic analysis

Efficiency-increasing transformations

Code selection in code generation

Regular expressions

Context-free grammars

Attribute grammars

Tree treetransformation

Regular tree grammars

Deterministic finite automata

Deterministic pushdown automata

FiniteTree transducers

Finite tree automata

MODULE II. LEXICAL ANALYSIS

THE ROLE OF LEXICAL ANALYZER

Page 6: Compiler Design

The lexical analyzer is the first phase of a compiler. Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis. Upon receiving a “get next token” command from the parser, the lexical analyzer reads input characters until it can identify the next token.

Since the lexical analyzer is the part of the compiler that reads the source text, it may also perform certain secondary tasks at the user interface. One such task is stripping out from the source program comments and white spaces in the form of blank, tab, and new line characters. Another is correlating error messages from the compiler with the source program.

Sometimes lexical analyzers are divided into a cascade of two phases first called “scanning” and the second “lexical analysis”. The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the more complex operations.

ISSUES IN LEXICAL ANALYSIS

Reasons for separating the analysis phase of compiling into lexical analysis and parsing:1. Simpler design is perhaps the most important consideration. The separation of lexical analysis

from syntax analysis often allows us to simplify one or the other of these phases.2. Compiler efficiency is improved. A separate lexical analyzer allows us to construct a specialized

and potentially more efficient processor for the task.3. Compiler portability is enhanced. Input alphabet peculiarities and other device-specific

anomalies can be restricted to the lexical analyzer.

TOKENS, PATTERNS, LEXEMES

A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. Examples of tokensToken Sample Lexemes Informal Description of PatternConstIfRelationIdNumLiteral

constif<, <=, =, <>, >, >=pi, count, D23.1416, 0, 6.02E23“core dumped”

constif< or <= or = or <> or > or >=letter followed by letters and digits any numeric constantany numeric constantany characters between “ and “ except “

Tokens are treated as terminal symbols in the grammar for the source language, using boldface names to represent tokens. The lexemes matched by the pattern for the token represent strings of characters in the source program that can be treated together as a lexical unit.

In other languages, the following constructs are treated as tokens: keywords, operators, identifiers, constants, literal strings, and punctuation symbols such as parentheses, commas, and semicolons.

A pattern is a rule describing the set of lexemes that can represent a particular token in a source program. The pattern for the token const just the single const that spells out the keyword.

ATTRIBUTES FOR TOKENS

When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the compiler.

The lexical analyzer collects information about tokens into their associated attributes. The tokens influence parsing decisions; the attributes influence the translation of tokens.

LEXICAL ERRORS

Few errors are discernible at the lexical level alone, because a lexical analyzer has a very localized view of the source of program.

Page 7: Compiler Design

But suppose a situation does arise in which the lexical analyzer is unable to proceed because none of the patterns for the tokens matches a prefix of the remaining input. Perhaps the simplest recovery strategy is “panic mode” recovery.

Other possible error-recovery actions are:1. deleting an extraneous character2. inserting a missing character3. replacing an incorrect character by a correct character4. transposing two adjacent characters

Error transformations like these may be tried in attempting to repair the input. The simplest such strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme by just a single error transformation.

One way of finding the error in a program is to compute the minimum number of error transformations required to transform the erroneous program into one that is syntactically well formed.

INPUT BUFFERING

Three general approaches to the implementation of a lexical analyzer:1. Use a lexical analyzer generator, such as Lex compiler to produce the lexical analyzer from a

regular expression based specification.2. Write the lexical analyzer in a conventional systems-programming language, using the I/O

facilities of that language to read the input.3. Write the lexical analyzer in assembly language and explicitly manage the reading of input.

The three are listed in order of increasing difficulty for the implementor. Since the lexical analyzer is the only phase of the compiler that reads the source program character by character, it is possible to spend a considerable amount of time in the lexical analysis phase, even though the later phases are conceptually more complex.

Buffer Pairs

There are many times when the lexical analyzer needs to look ahead several characters beyond the lexeme for a pattern before a match can be announced. Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead required to process an input character.

: : : E : : = : : M : * C : * : * : 2 : eof : : :

An input buffer in two halves

Sentinels

Except at the ends of the buffer halves, it requires tests for each advance of the forward pointer. The two tests can be reduced to one if each buffer half will be extended to hold a sentinel character at the end.

: E : : = : : M : * eof C : * : * : 2 : eof : : : eofSentinels at the end of each buffer half

SPECIFICATION OF TOKENS

Regular expressions are an important notation for specifying patterns. Each pattern matches a set of strings, so regular expressions will serve as names for sets of strings.

Strings and languages

Alphabet or character class denotes any finite set of symbols.

lexeme beginning forward

lexeme beginningforward

Page 8: Compiler Design

Example The set { 0 , 1} is the binary alphabet

String – is finite sequence of symbols drawn from the alphabet.In language theory, the terms sentence and word are often used as synonyms for the term

“string”. The length of a string s, usually written |s| is the number of occurrences of symbol in s.

Empty string – denoted , is a special string of length zero.The term language denotes any set of strings over some fixed alphabet.

Term DefinitionPrefix of s

Suffix of s

Substring of s

Proper prefix, suffix, of substring of s

Subsequence of s

A string obtained by removing zero or more trailing symbols of string s, e.g. ban is a prefix of banana.

A string formed by deleting zero or more of the leading symbols of s; nana is a suffix of banana.

A string obtained by deleting a prefix and a suffix from s; nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or suffix of s.

Any nonempty string x that is, respectively, a prefix, suffix, or substring of such that s is not equal to x.

Any string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.

Operations on Languages

There are several important operations that can be applied to languages. For lexical analysis, the primary operations are union, concatenation, and closure.

Operation DefinitionUnion L and M written L M

Concatenation of L and M written LM

Kleene closure of L written L *

Positive closure of L written L +

L M = { s |s is in L or s is in M }

LM = { st | s is in L and t is in M}

L * denotes “zero or more concatenations of” L

L + denotes “one or more concatenations of” L

REGULAR EXPRESSIONS

A regular expression is built up out of simpler regular expressions using a set of defining rules. Each regular expression r denotes a language L(r). The defining rules specifies how L(r) is formed by combining in various ways the languages denoted by the subexpressions of r.

Rules that define the regular expressions using a set of defining rules:1. is a regular expression that denotes { }, that is, the set of containing the empty string.2. If a is a symbol in , then a is a regular expression that denotes {a} is a set containing the string

a.3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then:

a) (r) | (s) is a regular expression denoting L(r) L(s)

Page 9: Compiler Design

b) (r) (s) is a regular expression denoting L(r) L(s)c) (r)* is a regular expression denoting (L(r))*d) (r) is a regular expression denoting L(r)2

A language denoted by a regular expression is said to be a regular set.

PRACTICE SET

1. What is the input alphabet of each of the following languages?a. Pascalb. Cc. C++d. Visual Basic

2. What are the conventions regarding the use of the blanks in each language in number 1?3. In a string length n, how many of the following are there?

a. Prefixesb. Suffixesc. Substringsd. Proper prefixese. Subsequences

4. Describe the language denoted by the following regular expressions:a. 0(0|1)*)b. (0|1)*0(0|1)(0|1)c. 0*10*10*d. (00|11)*((01|10)(00|11))

5. Write regular definitions for the following languages:a. all strings of letters that contain the five vowels in orderb. all strings of letters in which the letters are in ascending lexicographic orderc. comments consisting of a string surrounded by /* and */ without and intervening */ unless

it appears inside the quotes “and”d. all strings of digits with at most one repeated digit

REGULAR EXPRESSIONS & REGULAR LANGUAGES

• “Familiar” algebraic properties/identities- Alphabet is always a finite set. It can be ANY alphabet but in theory of languages it is usually chosen to consist of two or three characters, typically: a and b or 0 and 1.

= {a, b} = (a + b) = (a | b) are all IDENTICAL regular = {a} U {b} = a U b expression notations for Σ.

DefinitionConsider a string w made up from the symbols in the alphabet Σ = {a, b} such as “a” or “b” or “ab” or “ababbbbb” or “aaa” or “abbbaaabb” etc., then we define string “e”, called the empty string such that for any string w:

ew = we = w

The symbol “e” is then “neutral element” for strings when performing the concatenation operation. (This is similar to defining 0 as the neutral element for real numbers when performing the operation of addition or defining 1 as the neutral element for the operation of multiplication). It is important to notice that e in never part of the alphabet.

a* = {a}* = {e, a, aa, aaa, aaaa, ....}, this is the Kleene star or simply operation on one symbol of the alphabet.

Σ* = {a, b}* = (a, b)* = (a | b)* = an infinite set given by:{e, a, b, aa, ab, ba, bb, aaa, aab, aba, baa, abb, bab, bba, bbb, aaaa, ....}, that is, all possible strings of a’s and b’s.This is the star operation on the entire alphabet. It can be proven that another regular

expression for this same set is (a*b*)*.Ø = { } ≠ e notice that Ø is a set while e is a symbol.

Page 10: Compiler Design

Ø* = {e} ≠ Ø the set Ø* has one symbol in it, mainly e, the empty string, and thus it is NOT the empty set.

a | a*b denotes the set containing the string a or all strings of zero or more a’s followed by a b.

Note: The notation “a | a*b” can also be written as “(a + a*b)” as we shall see when reviewing the regular expressions. The symbol “|” or the symbol “+” in this case means the so-called “or” operation used in set theory. Therefore, the expression a*b means the following:a*b = {e, a, aa, aaa, ...}b = {b, ab, aab, aaab, ....},that is, the set of all strings of zero or more a’s followed by a b.

• Define a set of strings (i.e., a language L) over a given alphabet Σ and use a set of construction rules: - The empty string e is a regular expression. - Every symbol a in the alphabet Σ is a regular expression - Given the regular expressions r and s denoting the languages L(r) and L(s), then: • (r) + (s) is the regular expression denoting L(r)U L(s) • (r)°(s) is the regular expression denoting L(r)°L(s) • (r)* is a regular expression denoting (L(r))* • (r) is a regular expression denoting L(r)• Precedence rules to avoid parentheses: - the unary operator * has the highest precedence - concatenation ° has the second highest precedence - + (also “|”, U-union-, “or”) has the lowest precedence.

• Algebraic properties of regular expressions (r & s): - r+s = s+r => + is commutative - r+(s+t) = (r+s)+t => + is associative - (r°s)°t = r°(s°t) => ° is associative - r°(s+t) = r°s + r°t & (s+t)°r = s°r+t°r => ° distributes over + - e°r = r & r°e = r => e is the identity element for ° - r* = (r+e)* => relation between * and e. - r** = r* => * is idempotent.NOTE: If two regular expressions denote the same regular language we say that they are equivalent.

Examples of regular expressions:Strings over the alphabet Σ = {0, 1}• L = 0*1(0 | 1)*0 = 0*1Σ*0; set of non-zero binary numbers that are multiples of 2. (They end in

0).

• L = (0 + 1)* = Σ*; set of all strings of 0's and 1s.

• L = (0 + 1)*00(0 + 1)* = Σ*00Σ*; set of all strings of 0's and 1's with at least two consecutive 0's.

• L = (0 + 1)*011 = Σ*011; set of all strings of 0's and 1's ending in 011.

• L = letter (letter | digit)*; set of all strings of letters and digits beginning with a letter.

• (e + 00)* it is equivalent to (00)*.

• 1 + 01 it is equivalent to (e + 0)1.

• 0* + 0*(1 + 11)(00*(1 + 11))*0* ; set of all strings that do not have the substring 111.

MODULE III. THE SYNTACTIC SPECIFICATION OF PROGRAMMING LANGUAGES

FINITE AUTOMATA

• Machine Model that Recognizes Regular Languages• The finite automata, (FA), machine M defined by the 5-tuple M = {∑, S, s0, δ, F}; where the alphabet is: ∑ = {0,1}; the set of states is: S = {s0, s1, s2}; the starting state is s0; the final state is: F = {s2}; and the transitions δ are defined by the table below.

Page 11: Compiler Design

δ 0 1 . s0 s0 s1s1 {s1, s2} s1s2 s2 Ø

M can also be represented by the transition graph:

this figure (which corresponds to the transition table above) is a non-deterministic finite automaton, NFA. The big circles represent states and the double circles represent accepting or final states. The state with an unlabeled arrow coming from its left is the starting state. (A brief list of the differences between a Deterministic Finite Automaton, DFA, and a Non-deterministic Finite Automaton, NFA, given below.)

NFA vs. DFA

Deterministic Finite Automaton (DFA)• For every state there is exactly one outgoing edge per alphabet symbol (the transition function IS a function)• For each symbol in the alphabet there is a corresponding output and there is only one.

Non-Deterministic Finite Automaton (NFA)• At least one of the states has more than one outgoing edge for the same alphabet symbol (the transition function IS NOT a function, it is a relation.)• There may be e transitions (transitions that occur without the presence of an input symbol from the alphabet.)

Are NFA more powerful than DFA?• No at all. Both, NFAs and DFAs recognize Regular Languages and nothing more. • Therefore, NFAs can be converted to DFA, which in turn can be "programmed". (Only DFA can be computer programmed!)

Example

The FAs below are equivalent and they recognize the same regular language. A regular expression (there may be many regular expressions) corresponding to the regular language can be easily captured from the NFA construction:

0*1Σ*00*that is, the regular language includes all strings of 0’s and 1’s with at least an identified 1 which in the end has to be followed by an identified 0 regardless how many characters there are between them or regardless how many characters follow or precede them.

00

A C B 1 0

1

DFA1

00

s0 s2 s1 1 0

1,0

NFA

00

s0 s2s11 0

1,0

Page 12: Compiler Design

ALGORITHM TO SIMULATE A DFA

Algorithm 1Input: A string x ended by an EOF character and a DFA defined as a 5-tuple with s0 as its initial state and F as its set of accepting states.Output: A “YES” if the DFA accepts x or “NO” otherwise.

Method:Apply the “pseudo code” algorithm below to the input string x. The function move(x, c) gives the state to which there is a transition from state s on input character c. The function nextchar returns the next character of the input string x.

s = s0;c = nextchar;while c =! EOF

s = move(s, c);c = nextchar;

endif s is in F then

return “YES”else return “NO”;

Finite Automata Parsing• Accept input string iff there exists at least one path from the initial state (s0) to a final state (sf) that spells the input string.• If no path exists, reject the string• Example: Use the input string x = ‘11000’

Example

Consider the NFA N = {{q0,.., q4}, {0, 1}, q0,, {q2, q4}} shown below. q0

q3q1

q2 q41, 00, 1

0

01

1

0, 1

Page 13: Compiler Design

| 0 1 . q0 | {q0, q3} {q0, q1}q1 | Ø {q2}q2 | {q2} {q2}q3 | {q4} Øq4 | {q4} {q4}

We show next the proliferation of states of the NFA under the input string 01001

DEFINITIONS RELATED TO THE e-closure ALGORITHMDefinition

Given a state s of some NFA N, the e-closure(s) is the set of states in N reachable from s on e-transitions only.

DefinitionGiven a set of states T of some NFA N, the e-closure(T) is the set of states in N reachable from some state s in T on e-transitions alone.

ExampleX = e-closure({0}) = {0, 1, 2, 4, 7}Y = e-closure({2}) = {2}Z = e-closure({6}) = {6, 1, 2, 4, 7} = {1, 2, 4, 6, 7}T = e-clos.({1, 2, 5}) = e-clos.({1}) U e-clos.({2}) U e-clos.({5})

= {1, 2, 5, 4, 6, 7} = {1, 2, 4, 5, 6, 7}

q0

q3

q0 q0 q0 q0 q0

q1 q3 q3 q1

q4 q4

0 0

0

0

0

0

0

1

1

1

1

1

e

e

e

2

4

10

e

e3

5b

a

7ee

6

e

NFA(with e-transitions)

Page 14: Compiler Design

THE e-closure ALGORITHMThe computation of e-closure(T) is a typical process of searching a graph for nodes reachable

from a given set T of nodes by following the e-labeled edges of the NFA.The (pseudo code) algorithm below used to compute the e-closure (T) uses a stack to hold the

states whose edges have not been checked for e-labeled transitions

Algorithm (e-closure)push all states in T onto stack;initialize e-closure(T) to T;while stack is not empty

pop t, the top element, off stack;for each state u with an edge from t to u labeled e

if u is not in the e-closure(T)add u to e-closure(T);push u onto stack

endend

ALGORITHM TO BUILD THE DFA FROM AN NFAPotentially, the total number of states of the DFA is the power set of the number of states in the NFA (2N).Input: An NFA N with starting state s0, accepting some regular language.Output: A DFA D accepting the same language as N.

Method:The algorithm constructs a transition table DTran for D. We call Dstates the set of states in D. We will use the definitions given for the e-closure algorithm plus:

move(T, a) = set of states to which there is a transition on input symbol a in Σ from some NFA state s in T.

Algorithm 2 (“subset construction”).initially, e-closure(s0) is the only state in Dstates and it is unmarked;while there is an unmarked state T in Dstates

mark T;for each input symbol a in Σ

U = e-closure(move(T, a));if U is not in Dstates then

add U as an unmarked state to Dstates;DTran(T, a) = U

endend

Algorithm for Subset Construction.(taken from Parson’s textbook)

Consider an NFA N and a DFA D accepting the same regular language. [Equivalent to Algorithm 2]. 1) Initially the list of states in D is empty. Create the starting state as e-closure(s0), named after initial

Page 15: Compiler Design

state {s0} in N. That is, new start state: state(1) = e-closure (s0). 2) While (there is an uncompleted row in the transition table for D) do: a) Let x = {s1, s2, ...,sk) be the state for this row. b) For each input symbol a do:

i) Find the e-closure({s1,s2,...sk},a) = N({s1},a) U N({s2},a) U..... U N({sk},a) = some set we'll call T .

ii) Create the D-state y = {T}, corresponding to T.iii) If y is not already in the list of D-states, add it to the list. (This results in a new row in the

table)iv) Add the rule D(x , a) = y to the list of transition rules for D.

3) Identify the accepting states in D. (They must include the accepting states in N). That is, make state(j) final state iff at least one state in state(j) is final state in the NFA.

Yet another algorithm for the construction of a DFA equivalent to an NFA.state[0] = { };state[1] = e-closure (s0);p = 1;j = 0;while (j <= p) { foreach c in ∑ do {

e = DFAedge(state[j],c);if (e == state[i] for some i <= p) { trans[j,c] = i;}else { states[p] = e; trans[j,c] = p;}

}j = j + 1;}

ExampleFind all possible theoretical states for the NFA given below.

There are in total 23 = 8 states for the corresponding DFA. This is the Power Set of set S = {1, 2, 3} represented by 2|S|. That is, the 8 states of the DFA are:

2|S| ={Ø, {0}, {1}, {2}, {0, 1}, {0, 2}, {1, 2}, {0, 1, 2}}

and graphically, the transition diagram is:

00

0 2 1 1 0

1,0

Page 16: Compiler Design

Trivially, the states {2} and Ø plus the states {0, 2} and {0, 1} that have no input can be eliminated. After the 1st elimination cycle is complete, {0, 1, 2} has no input to it and can be eliminated. Only A = {0}, B= {1} and C = {1, 2} remain.

ExampleThe NFA below represents the regular expression letter (letter | digit)*. Find its corresponding DFA. (The theoretical number of states in this case is 210 = 1,024).

Solution:A = e-closure ({1}) = {1}

move(A, letter) = {2} (function move defined in Alg. 2)move(A, digit) = Ø

B = e-closure({2}) = {2, 3, 4, 5, 7, 10}move(B, letter) = {6}move(B, digit) = {8}

C = e-closure({6}) = {6, 9, 10, 4, 5, 7} = {4, 5, 6, 7, 9, 10}move(C, letter) = {6}move(C, digit) = {8}

D = e-closure({8}) = {8, 9, 10, 4, 5, 7} = {4, 5, 7, 8, 9, 10}move(D, letter) = {6}move(D, digit) = {8}

State A is the start state of the DFA and all states that include state 10 of the NFA are accepting states of the DFA. The transition diagram of the DFA is given below.

Saint Michael College

e

e

e

e6

8

9 104

e

e5

7

digit

letter

1

letter e2 3

e e

{0, 1}

1

0

0

0

{0} {1} 1

1

{2} Ø

1

1,00

{0, 2} {1, 2} {0, 1, 2}

0 0

11

0

1

Page 17: Compiler Design

Cantilan, Surigao del Sur

“Compiler Design Module 1-6 Summary”

Submitted by:

Miranda, Peter Jake J. BSCS-4Student

Submitted to:

Mrs. Cherly SardoviaInstructress


Recommended