+ All Categories
Home > Documents > COP4020 Programming Languages

COP4020 Programming Languages

Date post: 25-Feb-2016
Category:
Upload: lilah
View: 34 times
Download: 0 times
Share this document with a friend
Description:
COP4020 Programming Languages. Syntax analysis Prof. Xin Yuan. Overview. Syntax analysis overview Grammar and context-free grammar Grammar derivations Parse trees. Syntax analysis. Syntax analysis is done by the parser. - PowerPoint PPT Presentation
21
COP4020 Programming Languages Syntax analysis Prof. Xin Yuan
Transcript
Page 1: COP4020 Programming Languages

COP4020Programming LanguagesSyntax analysisProf. Xin Yuan

Page 2: COP4020 Programming Languages

COP4020 Spring 2014 204/22/23

Overview Syntax analysis overview Grammar and context-free grammar Grammar derivations Parse trees

Page 3: COP4020 Programming Languages

Syntax analysis Syntax analysis is done by the parser.

Detects whether the program is written following the grammar rules and reports syntax errors.

Produces a parse tree from which intermediate code can be generated.

Sourceprogram

Lexicalanalyzer

token

Request for token

parserRest of front end

Symboltable

Int.code

Parsetree

Page 4: COP4020 Programming Languages

The syntax of a programming language is described by a context-free grammar (Backus-Naur Form (BNF)). Similar to the languages specified by regular

expressions, but more general. A grammar gives a precise syntactic specification

of a language. From some classes of grammars, tools exist that

can automatically construct an efficient parser. These tools can also detect syntactic ambiguities and other problems automatically.

A compiler based on a grammatical description of a language is more easily maintained and updated.

Page 5: COP4020 Programming Languages

Grammars A grammar has four components G=(N, T, P, S):

T is a finite set of tokens (terminal symbols)

N is a finite set of nonterminals

P is a finite set of productions of the form .Where and

S is a special nonterminal that is a designated start symbol

COP4020 Spring 2014 504/22/23

*)(*)( TNNTN *)( TN

Page 6: COP4020 Programming Languages

Example Grammar for expression (T=?, N=?, P=?, S=?)

Production: E ->E+EE-> E-EE-> (E)E-> -EE->numE->id

How does this correspond to a language?Informally, you can expand the non-terminals using the productions

until all are expanded: the ending sentence (a sequence of tokens) is recognized by the grammar.

COP4020 Spring 2014 604/22/23

Page 7: COP4020 Programming Languages

Language recognized by a grammar We say “aAb derives awb in one step”, denoted as

“aAb=>awb”, if A->w is a production and a and b are arbitrary strings of terminal or nonterminal symbols.

We say a1 derives am if a1=>a2=>…=>am, written as a1=>am

The languages L(G) defined by G are the set of strings of the terminals w such that S=>w.

COP4020 Spring 2014 704/22/23

*

*

Page 8: COP4020 Programming Languages

ExampleA->aAA->bAA->aA->b

G=(N, T, P, S) N=? T=? P=? S=?

What is the language recognized by this grammer?

COP4020 Spring 2014 804/22/23

Page 9: COP4020 Programming Languages

Chomsky Hierarchy (classification of grammars)

A grammar is said to be regular if it is

right-linear, where each production in P has the form, or . Here, A and B are non-terminals and w is a terminal

or left-linear context-free if each production in P is of the form

, where and context sensitive if each production in P is of the form

where unrestricted if each production in P is of the form

where All languages recognized by regular expression can

be represented by a regular grammar.

wBA wA

ANA *)( TN

||||

Page 10: COP4020 Programming Languages

A context free grammar has four components G=(N, T, P, S):

T is a finite set of tokens (terminal symbols)

N is a finite set of nonterminals

P is a finite set of productions of the form Where and .

S is a special nonterminal that is a designated start symbol.

Context free grammar is more expressive than regular expression. Consider language

{ab, aabb, aaabbb, …}

ANA *)( TN

Page 11: COP4020 Programming Languages

COP4020 Spring 2014 1104/22/23

BNF Notation (another form of context free grammar) Backus-Naur Form (BNF) notation for productions:

<nonterminal> ::= sequence of (non)terminals

where Each terminal in the grammar is a token A <nonterminal> defines a syntactic category The symbol | denotes alternative forms in a production The special symbol denotes empty

Page 12: COP4020 Programming Languages

COP4020 Spring 2014 1204/22/23

Example<Program> ::= program <id> ( <id> <More_ids> ) ; <Block> .<Block> ::= <Variables> begin <Stmt> <More_Stmts> end<More_ids> ::= , <id> <More_ids>

| <Variables> ::= var <id> <More_ids> : <Type> ; <More_Variables>

| <More_Variables> ::= <id> <More_ids> : <Type> ; <More_Variables>

| <Stmt> ::= <id> := <Exp>

| if <Exp> then <Stmt> else <Stmt>| while <Exp> do <Stmt>| begin <Stmt> <More_Stmts> end

<More_Stmts> ::= ; <Stmt> <More_Stmts>|

<Exp> ::= <num>| <id>| <Exp> + <Exp>| <Exp> - <Exp>

Page 13: COP4020 Programming Languages

COP4020 Spring 2014 1304/22/23

Derivations From a grammar we can derive strings (= sequences of

tokens) The opposite process of parsing

Starting with the grammar’s designated start symbol, in each derivation step a nonterminal is replaced by a right-hand side of a production for that nonterminal A sentence (in the language) is a sequence of terminals that can

be derived from the start symbol. A sentential form is a sequence of terminals and nonterminals

that can be derived from the start symbol.

Page 14: COP4020 Programming Languages

COP4020 Spring 2014 1404/22/23

Example Derivation

<expression>  <expression> <operator> <expression>  <expression> <operator> identifier  <expression> + identifier  <expression> <operator> <expression> + identifier  <expression> <operator> identifier + identifier  <expression> * identifier + identifier  identifier * identifier + identifier

<expression> ::= identifier              | unsigned_integer              | - <expression>              | ( <expression> )              | <expression> <operator> <expression><operator> ::= + | - | * | /

Start symbol

Replacement of nonterminal with one of its productions

The final string is the yield

Sent

entia

l for

ms

Page 15: COP4020 Programming Languages

COP4020 Spring 2014 1504/22/23

Rightmost versus Leftmost Derivations

When the nonterminal on the far right (left) in a sentential form is replaced in each derivation step the derivation is called right-most (left-most)

<expression>  <expression> <operator> <expression>  <expression> <operator> identifier 

Replace in leftmost derivation

Replace in rightmost derivation

<expression>  <expression> <operator> <expression>  identifier <operator> <expression> Replace in leftmost derivation

Replace in rightmost derivation

Page 16: COP4020 Programming Languages

COP4020 Spring 2014 1604/22/23

A Language Generated by a Grammar A context-free grammar is a generator of a context-free language The language defined by a grammar G is the set of all strings w that

can be derived from the start symbol S

L(G) = { w | S * w }

<S> ::= a | ‘(’ <S> ‘)’ L(G) = { set of all strings a (a) ((a)) (((a))) … }

<S> ::= <B> | <C><B> ::= <C> + <C><C> ::= 0 | 1

L(G) = { 0+0, 0+1, 1+0, 1+1, 0, 1 }

Page 17: COP4020 Programming Languages

COP4020 Spring 2014 1704/22/23

Parse Trees A parse tree depicts the end result of a derivation

The internal nodes are the nonterminals The children of a node are the symbols (terminals and

nonterminals) on a right-hand side of a production The leaves are the terminals

<expression>

<expression> <operator>

identifier

<operator> <expression><expression>

<expression>

identifieridentifier * +

Page 18: COP4020 Programming Languages

Parse Trees

COP4020 Spring 2014 1804/22/23

<expression>

<expression> <operator>

identifier

<operator> <expression><expression>

<expression>

identifieridentifier * +

<expression>  <expression> <operator> <expression>  <expression> <operator> identifier  <expression> + identifier  <expression> <operator> <expression> + identifier  <expression> <operator> identifier + identifier  <expression> * identifier + identifier  identifier * identifier + identifier

Page 19: COP4020 Programming Languages

COP4020 Spring 2014 1904/22/23

Ambiguity There is another parse tree for the same grammar and

input: the grammar is ambiguous This parse tree is not desired, since it appears that + has

precedence over *

<expression>

<expression> <operator>

identifier

<operator> <expression><expression>

<expression>

identifieridentifier +*

Page 20: COP4020 Programming Languages

COP4020 Spring 2014 2004/22/23

Ambiguous Grammars Ambiguous grammar: more than one distinct derivation

of a string results in different parse trees A programming language construct should have only

one parse tree to avoid misinterpretation by a compiler For expression grammars, associativity and precedence

of operators is used to disambiguate

<expression> ::= <term> | <expression> <add_op> <term><term> ::= <factor> | <term> <mult_op> <factor><factor> ::= identifier | unsigned_integer | - <factor> | ( <expression> )<add_op> ::= + | -<mult_op> ::= * | /

Page 21: COP4020 Programming Languages

COP4020 Spring 2014 2104/22/23

Ambiguous if-then-else:the “Dangling Else” A classical example of an ambiguous grammar are the

grammar productions for if-then-else:

<stmt> ::= if <expr> then <stmt> | if <expr> then <stmt> else <stmt>

It is possible to hack this into unambiguous productions for the same syntax, but the fact that it is not easy indicates a problem in the programming language design

Ada uses different syntax to avoid ambiguity:

<stmt> ::= if <expr> then <stmt> end if | if <expr> then <stmt> else <stmt> end if


Recommended