BNF

BNF (Backus Normal Form or Backus–Naur Form)

BNF is an acronym for "Backus Naur Form". John Backus and Peter Naur introduced for the first time a formal notation to describe the syntax of a given language BNF (Backus Normal Form or Backus–Naur Form) is a notation technique for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols. It is applied wherever exact descriptions of languages are needed, for instance, in official language specifications, in manuals, and in textbooks on programming language theory

A BNF specification is a set of derivation rules, written as

<symbol> ::= __expression__

where <symbol> is a nonterminal, and the __expression__ consists of one or more sequences of symbols; more sequences are separated by the vertical bar, '|', indicating a choice, the whole being a possible substitution for the symbol on the left. Symbols that never appear on a left side are terminals. On the other hand, symbols that appear on a left side are non-terminals and are always enclosed between the pair <>.

The meta-symbols of BNF are: ::=

meaning "is defined as" |

meaning "or" < >

angle brackets used to surround category names. The angle brackets distinguish syntax rules names (also called non-terminal symbols) from terminal symbols which are written exactly as they are to be represented. A BNF rule defining a nonterminal has the form: nonterminal ::= sequence_of_alternatives consisting of strings of terminals or nonterminals separated by the meta-symbol |

For example, the BNF production for a mini-language is: <program> ::= program <declaration_sequence> begin <statements_sequence> end ;

http://en.wikipedia.org/wiki/Nonterminal_symbol

http://en.wikipedia.org/wiki/Terminal_symbol

http://en.wikipedia.org/wiki/Substitution

http://en.wikipedia.org/wiki/Choice

http://en.wikipedia.org/wiki/Vertical_bar

http://en.wikipedia.org/wiki/Expression_(mathematics)

http://en.wikipedia.org/wiki/Nonterminal

http://en.wikipedia.org/wiki/Symbol

http://en.wikipedia.org/wiki/Communication_protocol

http://en.wikipedia.org/wiki/Instruction_set

http://en.wikipedia.org/wiki/Document_format

http://en.wikipedia.org/wiki/Programming_language

http://en.wikipedia.org/wiki/Programming_language

http://en.wikipedia.org/wiki/Formal_language#Programming_languages

http://en.wikipedia.org/wiki/Syntax

http://en.wikipedia.org/wiki/Context-free_grammar

http://en.wikipedia.org/wiki/Metasyntax

Types of Grammer:.

Recall that a formal grammar $G=(\Sigma,N,P,\sigma)$ consists of an alphabet $\Sigma$ , an alphabet $N$ of non-terminal symbols properly included in $\Sigma$ , a non-empty finite set $P$ of productions, and a symbol $\sigma\in N$ called the start symbol. The non-empty alphabet $T:=\Sigma-N$ is the set of terminal symbols. Then $G$ is called a

Type-0 grammar

if there are no restrictions on the productions. Type-0 grammar is also known as an unrestricted grammar, or a phrase-structure grammar.

Type-1 grammar

if the productions are of the form $uAv \to uWv$ , where $u,v,W\in \Sigma^*$ with $W\ne \lambda$ , and $A\in N$ , or $\sigma\to \lambda$ , provided that $\sigma$ does not occur on the right hand side of any productions in $P$ . As $A$ is surrounded by words $u,v$ , a type-1 grammar is also known as a context-sensitive grammar.

Type-2 grammar

if the productions are of the form $A\to W$ , where $A\in N$ and $W\in \Sigma^*$ . Type-2 grammars are also called context-free grammars, because the left hand side of any productions are ``free'' of contexts.

Type-3 grammar

if the productions are of the form $A\to u$ or $A\to uB$ , where $A,B\in N$ and $u\in T^*$ . Owing to the fact that languages generated by type-3 grammars can be represented by regular expressions, type-3 grammars are also known as regular grammars.

It is clear that every type-$i$ grammar is type-0, and every type-3 grammar is type-2. A type-2 grammar is not necessarily type-1, because it may contain both $\sigma\to \lambda$ and $A\to W$ , where $\lambda$ occurs in $W$ . Nevertheless, the relevance of the hierarchy has more to do with the languages generated by the grammars. Call a formal language a type-$i$

language if it is generated by a type-$i$ grammar, and denote the family of type-$i$ languages. Then it can be shown that

grammar language family abbreviation automaton

type-0 recursively enumerableor

turing machine

http://planetmath.org/encyclopedia/TuringMachine2.html

http://planetmath.org/encyclopedia/RecursivelyEnumerable.html

http://planetmath.org/encyclopedia/Automata.html

http://planetmath.org/encyclopedia/Terminal.html

http://planetmath.org/encyclopedia/Occurrence.html

http://planetmath.org/encyclopedia/ProperSuperset.html

http://planetmath.org/encyclopedia/EasyToSee.html

http://planetmath.org/encyclopedia/RegularEvent.html


http://planetmath.org/encyclopedia/RegularExpression.html


http://planetmath.org/encyclopedia/SubgroupGeneratedBy.html

http://planetmath.org/encyclopedia/ImproperLanguage.html

http://planetmath.org/encyclopedia/MultipleRoot.html

http://planetmath.org/encyclopedia/ContextFreeGrammar.html


http://planetmath.org/encyclopedia/ContextSensitive.html


http://planetmath.org/encyclopedia/MultipleRoot.html

http://planetmath.org/encyclopedia/Restriction3.html

http://planetmath.org/encyclopedia/Terminal.html

http://planetmath.org/encyclopedia/Finite.html

http://planetmath.org/encyclopedia/Alphabet.html

type-1 context-sensitiveor

linear bounded automaton

type-2 context-freeor

pushdown automaton

type-3 regularor

finite automaton

Classification of GrammarsDue to Noam Chomsky (1956)Grammars are sets of productions of the form α = β.class 0 Unrestricted grammars (α and β arbitrary)e.g: X = a X b | Y c Y.aYc = d.dY = bb.X ⇒ aXb ⇒ aYcYb ⇒ dYb ⇒ bbbRecognized by Turing machinesclass 1 Context-sensitive grammars (|α| ≤ |β|)e.g: a X = a b c.Recognized by linear bounded automataclass 2 Context-free grammars (α = NT, β ≠ ε)e.g: X = a b c.Recognized by push-down automataclass 3 Regular grammars (α = NT, β = T or T NT)e.g: X = b | b Y.Recognized by finite automataOnly these two classesare relevant in compilerconstruction

http://planetmath.org/encyclopedia/Finite.html

http://planetmath.org/encyclopedia/StartStackSymbol.html


http://planetmath.org/encyclopedia/LinearBoundedAutomaton.html


Introduction to Compilers

(NOTE: To view the material in this section correctly, you need to use a PC with the TrueType Symbol font installed.)

What is a Compiler?

1. A compiler is software (a program) that translates a high-level programming language to machine language. So, a simple representation would be:

Source Code ----> Compiler -----> Machine Language (Object File)

2. But, a compiler has to translate high-level code to machine language, so it's not as simple as an assembler translator. A compiler has to perform several steps to produce an object file in machine code form.

Analysis of the source code:

o Lexical Analysis: scan the input source file to identify tokens of the programming language. Tokens are basic units (keywords, identifier names, etc.) that can be identified using rules. This step is performed by a lexical recognizer, or scanner.

o Syntax Analysis: group the tokens identified by the scanner into grammatical phrases that will be used by the compiler to generate the output. This process is called parsing and is performed by a parser based on the formal grammar of the programming language. The parser is created from the grammar using a parser generator or compiler-compiler.

o Semantic Analysis: check the source program for semantic (meaning) errors and gather type information for the subsequent code generation phase. It uses the hierarchical structure determined by the parser to identify the operators and operands of expressions and statements.

Synthesis of the target program:

o Generate an intermediate representation of the source program. This is performed by some compilers, but not necessarily all. An intermediate representation can be thought of as a program for an abstract machine --- it should be easy to produce and easy to translate into the target program.

o Code Optimization: improve the intermediate code to produce a faster running machine code in the final translation. Not all compilers include the code optimization step, which can require a lot of time.

o Code Generation: generate the target code, normally either relocatable machine code or assembly code. If the compiler produces assembly code, the compiler output has to subsequently be translated to machine code by an assembler translator as an extra step.

3. After the compiler (or assembler translator) has produced the object file, two additional steps are needed to produce and run the program.

Linking the program: the linker (link-editor) links the object files from the program modules, and any additional library files to create the executable program. Usually this includes the use of relocatable addresses within the program to allow the program to run in different memory locations.

Loading the program: the loader identifies a memory location in which the program can be loaded and alters the relocatable machine code addresses to run in the designated memory location. A program is loaded into memory each time it is run (unless it's a TSR that remains in memory, even when not active). In some situations, the loader performs both steps of linking and loading a program.

Objectives:

The specific objectives for our discussion of compilers will

be lexical analysis and syntax analysis. We will create lexical analyzers (scanners) using a scanner generator and syntax analyzers (parsers) using a parser generator. We will leave the actual generation of machine code to the C compiler.

Grammars and Languages

Definitions

Alphabet. A finite set of symbols.

Token. A terminal symbol in the grammar for the source language.

Typical tokens in a programming language include: keywords, operators, identifiers, constants, literal strings, and punctuation symbols such as parentheses, commas, and semicolons.

String. A finite sequence of symbols drawn from an alphabet.

Greek letters are used to denote strings. Roman letters are used to denote symbols.

The length of the string , denoted as ||, is the number of occurrences of symbols in the string .

The empty string, denoted as , is a string of length 0.

If is a string, then by i we mean , i times.

A terminal string is one composed only of terminal symbols (tokens). is considered a terminal string also.

Language. Any set of strings over some fixed alphabet. This general definition includes the strings:

The empty set, The set containing the empty string, {}

Grammar. Rules that specify the syntactic structure of well-formed programs (sentences) in a language.

A grammar is a 4-tuple (VN, VT, G0, P) where VN is a set of non-terminal symbols, VT is a set of terminal symbols (tokens), G0 is the goal symbol, and P is a set of productions (rules) of the grammar.

V, which denotes VN VT, is called the alphabet or vocabulary of the grammar.

Terminal symbols are the basic symbols from which strings are formed. The word "token" is a synonym for terminal.

Nonterminal symbols are syntactic variables that denotes sets of strings. They impose a hierarchical structure on the language that is useful for syntax analysis and translation.

In a grammar, one nonterminal symbol is defined as the start (goal) symbol. The set of strings it generates constitutes the language defined by the grammar.

The productions (rules) of a grammar specify the manner in which the terminals and nonterminals can be combined to form strings. Each production consists of a nonterminal, followed by an arrow (sometimes the symbol ::= is used instead), followed by a string of nonterminals and terminals.

E.g., A -> B d

Classification of Grammars According to Types of Productions Allowed

Type 0: Phase Structure Grammars (most general classification)

Productions allowed: -> where may equal

E.g., a b c -> d e f g

Type 1: Context-Sensitive Grammars

Productions allowed:

X-> where X VN and ( and are called the context of X) G ->

Type 2: Context-Free Grammars

Productions allowed: X -> where x VN and may equal .

Type 3: Regular Grammars (most restrictive classification)

Productions allowed:

X -> a and

X -> Y a where X, Y VN, a VT

Note: Type 3 LR(0) LR(1) LR(K) Type 2

Definition:

If -> is a production and is a string of the grammar (i.e., a string of symbols from the vocabulary), then we write ==> (the notation for "immediately derives")

Definition:

If in some grammar 1, 2, ..., t are strings such that , for 1 i t - 1, i ==> i+1 or t = 1, then we write 1 =*=> t (the notation for "derives")

Note: alpha ==> alpha might not be true, but alpha =*=> alpha is always true.

Definition:

The language defined by a grammar G, denoted as L(G), is given by

L(G) = { | G0 =*=> , where is a terminal string. I.e., alpha is composed entirely of terminals}

If G0 =*=> , then is called a sentential form. A terminal sentential form is called a sentence.

Definition:

A language (on some alphabet) is a set of strings on that alphabet. A language is called a {phase-structure | context-sensitive | context-free | regular} language if it has a grammar of the corresponding type.

Definition:

If =*=> , then is called a descendent of , and is called an ancestor of .

If ==> , then is called an immediate descendent of , and is called an immediate ancestor of .

Definition:

Backus Naur Form is a method for representing context-free grammars.

Examples of Grammars and Derivations:

Grammar 1:

1. SENTENCE -> NOUNPHRASE VERB NOUNPHRASE

2. NOUNPHRASE -> the ADJECTIVE NOUN

3. NOUNPHRASE -> the NOUN

4. VERB -> pushed

5. VERB -> helped

6. ADJECTIVE -> pretty

7. ADJECTIVE -> poor

8. NOUN -> man

9. NOUN -> boy

10. NOUN -> cat

Derivation of the sentence: "the man helped the poor boy"

1. SENTENCE (goal symbol)

2. ==> NOUNPHRASE VERB NOUNPHRASE (by Rule 1)

3. ==> the NOUN VERB NOUNPHRASE (Rule 3)

4. ==> the man VERB NOUNPHRASE (Rule 8)

5. ==> the man helped NOUNPHRASE

6. ==> the man helped the ADJECTIVE NOUN

7. ==> the man helped the poor NOUN

8. ==> the man helped the poor boy

(this derivation shows that "the man helped the poor boy" is a sentence in the language defined by the grammar.)

This derivation may also be represented diagrammatically by a syntax tree:

Typical format of a grammar for a programming language:

PROGRAM -> PROGRAM STATEMENT

PROGRAM -> STATEMENT

STATEMENT -> ASSIGNMENT-STATEMENT

STATEMENT -> IF-STATEMENT

STATEMENT -> DO-STATEMENT

...

ASSIGNMENT-STATEMENT -> ...

...

IF-STATEMENT -> ...

...

DO-STATEMENT -> ...

...

Grammar 2 (a simple grammar for arithmetic statements)1. E -> E + T2. E -> T3. T -> T * a4. T -> a

Derivation of: a + a * a 1. E Goal Symbol2. ==> E + T Rule 13. ==> E + T * a Rule 34. ==> E + a * a Rule 45. ==> T + a * a Rule 26. ==> a + a * a Rule 4

Derivation of: a + a * a written in reverse:1. a + a * a Given sentential form2. T + a * a Rule 4 in reverse3. E + a * a Rule 2 in reverse4. E + T * a Rule 45. E + T Rule 3 in reverse6. E Rule 1

Note: a derivation in which the terminal symbols are introduced (or resolved) from right to left is called a rightmost derivation. (It is also possible to do derivations using leftmost derivations.)

Date post:	25-Nov-2014
Category:	Documents
Upload:	ablog165
View:	120 times
Download:	0 times

BNF

Documents