Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | nora-wright |
View: | 257 times |
Download: | 3 times |
1
A Simple Syntax-Directed Translator
CS308 Compiler Theory
Lecture Outline
• We shall look at a simple programming language and describe the initial phases of compilation.
• We start off by creating a ‘simple’ syntax directed translator that maps infix arithmetic to postfix arithmetic.
• This translator is then extended to cater for more elaborate programs such as (check page 39 Aho) – While (true) { x=a[i]; a[i]=a[j]; a[j]=x; }
• Which generates simplified intermediate code (as on pg40 Aho)
2CS308 Compiler Theory
Two Main Phases (Analysis and Synthesis)
• Analysis Phase :- Breaks up a source program into constituent pieces and produces an internal representation of it called intermediate code.
• Synthesis Phase :- translates the intermediate code into the target program.
• During this lecture we shall focus on the analysis phase (compiler front end … see figure next slide)
3CS308 Compiler Theory
A Model of A Compiler Font End
4CS308 Compiler Theory
Syntax vs. Semantics
• The syntax of a programming language describes the proper form of its programs
• The semantics of the language defines what its programs mean.
5CS308 Compiler Theory
A note on Grammars (Context-free)
• A formal grammar is used to specify the syntax of a formal language (for example a programming language like C, Java)
• Grammar describes the structure (usually hierarchical) of programming languages.
– For e.g. in Java an IF statement should fit in • if ( expression ) statement else statement
– statement -> if ( expression ) statement else statement
– Note the recursive nature of statement.
6CS308 Compiler Theory
A CFG has four components
• A set of terminal symbols, sometimes referred to as ‘tokens’.
• A set of non-terminals, sometimes called ‘syntactic variables’.
• A set of productions.
• A designation of one of the non-terminals as the start symbol .
7CS308 Compiler Theory
A Grammar for ‘list of digits separated by + or -’
list -> list + digit
list -> list – digit
list -> digit
digit -> 0|1|… |9
• Accepts strings such as 9-5+2, 3-1, or 7.
• list and digit are non-terminals
• 0 | 1 | … | 9, +, - are the terminal symbols
8CS308 Compiler Theory
Parsing and derivations
• Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar
• A grammar derives strings by beginning with the start symbol and repeatedly replacing a non-terminal by the body of a production
• If it cannot be derived from the start symbol then reporting syntax errors within the string.
9CS308 Compiler Theory
Parse Trees
• A parse tree pictorially shows how the start symbol of a grammar derives a string in the language
• A grammar can have more than one parse tree generating a given string of terminals (thus making it ambiguous)
10CS308 Compiler Theory
Parse Trees
• A parse tree is a tree with the following properties:
– The root is labeled by the start symbol.
– Each leaf is labeled by a terminal or by E.
– Each interior node is labeled by a non-terminal.
– If A is the non-terminal labeling some interior node and Xl , X2, • • • , Xn are the labels of the children of that node from left to right, then there must be a production A -> X1X2 · · · Xn. Here, X1 , X2 , . . . , Xn each stand for a symbol that is either a terminal or a non-terminal .
11CS308 Compiler Theory
Ambiguity
string -> string + string | string - string | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
12CS308 Compiler Theory
Operator Associativity and Precedence
• To resolve some of the ambiguity with grammars that have operators we use: – Operator associativity :- in most programming languages arithmetic operators have left
associativity.
– E.g. 9+5-2 = (9+5)-2
– However = has right associativity, i.e. a=b=c is equivalent to a=(b=c)
– Operator Precedence :- if an operator has higher precedence, then it will bind to it’s operands first.
– eg. * has higher precedence then +, therefore 9+5*2 is equivalent to 9+(5*2)
13CS308 Compiler Theory
Syntax-Directed Translation
• Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar.
14CS308 Compiler Theory
Postfix Notation
• If E is a variable or constant , then the postfix notation for E is E itself.
• If E is an expression of the form E1 op E2 , where op is any binary operator, then the postfix notation for E is E’1 E’2 op, where E’1 and E’2
are the postfix notations for E1 and E2 , respectively.
• If E is a parenthesized expression of the form (E1), then the postfix notation for E is the same as the postfix notation for E1 ·
• X=Y==Z+U*V?Y:0
15CS308 Compiler Theory
Synthesized Attributes
• Attributes associated with non-terminals and terminals in a grammar.
• An attribute is said to be synthesized if its value at a parse-tree node N is determined from attribute values of the children of N and at N itself.
16CS308 Compiler Theory
Semantic Rules for infix to postfix
17CS308 Compiler Theory
A syntax-directed translation scheme
• A notation for specifying a translation by attaching program fragments to productions in a grammar.
• The program fragments are called semantic actions.
18CS308 Compiler Theory
Parsing
• Parsing is the process of determining how a string of terminals can be generated by a grammar.
• Two classes :
– Bottom-up, where construction starts at the leaves and proceeds towards the root;
– Top-down, where construction starts at the root and proceeds towards the leaves.
19CS308 Compiler Theory
Top-Down Parsing
• The top-down construction of a parse is done by starting with the root, labeled with the starting non-terminal stmt, and repeatedly performing the following two steps.
– At node N, labeled with non-terminal A, select one of the productions for A and construct children at N for the symbols in the production body.
– Find the next node at which a subtree is to be constructed, typically the leftmost unexpanded non-terminal of the tree.
20CS308 Compiler Theory
Top-Down Parsing
21
=>
CS308 Compiler Theory
Predictive Parsing
• Recursive descent parsing : a top-down method of syntax analysis in which a set of recursive procedures is used to process the input.
• Predictive parsing is a simple form of recursive-descent parsing, in which the lookahead symbol (the first symbol that can be generated by a production body) unambiguously determines the flow of control through the procedure body for each non-terminal.
22CS308 Compiler Theory
Predictive Parsing
23
Every non-terminal has such a procedure in predictive parser. CS308 Compiler Theory
Left Recursion
• Since the lookahead symbol changes only when a terminal is matched, no change to the input takes place between recursive calls of expr.
• A left-recursive production can be eliminated by rewriting the offending production.
24CS308 Compiler Theory
Eliminating Left Recursion
25
=>
CS308 Compiler Theory
Translators (using program fragments)
26
Left-recursion-eliminated
CS308 Compiler Theory
A Translator for Simple Expressionsformed by extending predictive parser
27CS308 Compiler Theory
Lexical Analysis
• A lexical analyzer reads characters from the input and groups them into "token objects.“
• Token
• Terminal symbol
• Lexeme
28CS308 Compiler Theory
Extended translation scheme
29CS308 Compiler Theory
Reading Ahead
• Is it '>' or '>=' ? ... The lexer needs to read one character in order to decide what token to return to the parser.
• One-character read ahead usually suffices, so a simple solution is to use a variable, call it peek, to hold the next input character.
30CS308 Compiler Theory
Constant
• Write tokens as tuples enclosed between <> – 31 + 28 + 59 is transformed into the sequence
<num, 31><+><num, 28><+><num, 59>
• Simulate parsing some number .... If ( peek holds a digit) {
v = 0;
Do {
v = v * 10 + integer value of digit peek;
Peek = next input character; } while ( peek holds a digit );
Return token <num, v>
31CS308 Compiler Theory
Keywords and Identifiers
32
=>
Identifiers:
Keywords:
A character string forms an identifier only if it is not a keyword.
CS308 Compiler Theory
Symbol Table
• Data structures that are used by compilers to hold information about the source-program constructs.
• Information is collected incrementally throughout the analysis phase and used for the synthesis phase.
• One symbol table per scope (of declaration)...
{ int x; char y; { bool y; x; y; } x; y; }
{ { x:int; y:bool; } x:int; y:char; } 33
=>
CS308 Compiler Theory
Intermediate Code Generation
• The front end of a compiler constructs an intermediate representation of the source program from which the back end generates the target program.
• Two kinds of intermediate representations
– Tree, including parse trees and (abstract) syntax trees.
– Linear representations, especially “three-address code.”
34CS308 Compiler Theory
Static Checking
• Done by a compiler front end
• To check that the program follows the syntactic and semantic rules
• Including:– Syntactic checking
– Type checking
35CS308 Compiler Theory
Syntax Trees
• For statement
stmt -> while ( expr ) stmt
{ stmt.n = new While(expr.n, stmt.n }
n is a node in the syntax tree
stmts -> stmts1 stmt
{ stmts.n = new Seq(stmts1.n, stmt.n); }
36CS308 Compiler Theory
Syntax Trees
• For expressions
37
=>CS308 Compiler Theory
Three-Address Code
• Three-address code is a sequence of instructions of the form
x = y op Z
• Arrays will be handled by using the following two variants of instructions:
1. x [ y ] = z 2. x = y [ z ]
• Instructions for control flow:
1. if False x goto L 2. if True x goto L 3. goto L
• Instruction for copying value
x = y
38CS308 Compiler Theory
Translation of Statements
• Use jump instructions to implement the flow of control through the statement.
• The translation of if expr then stmtl
39CS308 Compiler Theory
Translation of Statements
40CS308 Compiler Theory
Functions lvalue(x:Expr) and rvalue(x:Expr)
• a = a + 1, a is computed differently for the l-value and r-value
• Two functions used to distinguish then:– Rvalue
which when applied to a nonleaf node x, generates the instructions to compute x into a
temporary var, and returns a new node representing the temporary var.
– Lvalue
which when applied to a nonleaf, generates instructions to compute the subtrees below x, and returns a node representing the “address” for x
• R-values is what we usually think of as “values” while L-values are “locations”
41CS308 Compiler Theory
Translation of Expressions
• Expressions contain binary operators, array accesses, assignments, constants and identifiers.
• We can take the simple approach of generating one three-address instruction for each operator node in the syntax tree of an expression.
• Expression: i-j+k translates into
t1 = i-j
t2 = t1+k
• Expression: 2 * a[i] translates into
t1 = a [ i ]
t2 = 2 * t1
42CS308 Compiler Theory
Test Yourself
• Generate three-address codes for
If(x[2*a]==y[b]) x[2*a+1]=y[b+1];
43CS308 Compiler Theory
Summary• Grammars
• Parse Trees and Syntax Tree– Ambiguity
• Postfix notation
• Lexical analyzer– Token
– Synthesized Attributes
• Parsing– Predicative parsing
• Syntax-directed translation– Attaching rules to productions
– Attaching program fragments to productions
• Intermediate code– Abstract syntax tree
– Three-address code44CS308 Compiler Theory