CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau [email protected].

CS252: Systems Programming

Ninghui Li

Topic 5: Parsing

Prepared by Evan Hanau

[email protected]

mailto:[email protected]

Introduction to Parsing with Yacc

An Introduction to Parsing with Yacc

• Context-Free Grammars• Yacc Parsing• An example Infix Calculator Program

Context-Free Grammar

Background: The Context-Free Grammar

• By CS252 you are already somewhat familiar with Regular Expressions.

• Regular expressions can be used to describe regular languages, which belong to a larger classification of language types.

In CS, we classify languages on the Chomsky Hierarchy.

Type-0 Recursively EnumerableType-1 Context-SensitiveType-2 Context-Free Type-3 Regular


Type-(i) is a superset of Type-(i+1)

• Languages generated by regular expressions belong to type 3.

• Note: Your specific regular expression engine (e.g. POSIX extended RE) is likely capable of more complex productions.

• In any case, we need more than regular expressions to parse computer programming languages and shell scripts.


• You can do a great deal with regular expressions.

• Exercise: Create a regular expression that matches on any English phrase that is a palindrome, for instance the string “some men interpret nine memos”.


• This is in fact not possible with regex (by its strict CS definition!). You would be limited to palindromes of a finite length only.• RE cannot express “anbn”, a string with some number of

a’s followed by equal number of b’s• The expression a*b* does not require number of a’s

equal that of b’s

• We must use a context-free grammar to describe palindromes and other constructs.

• More powerful than a regular expression, and useful when some notion of “what came before” is required.


• BNF or Backus-Naur form is used in CS to describe context-free grammars. It is often used to describe the syntax of programming languages.consists of one or more of the following:

<nonterminal> ::= __expression__

• Where __expression__ consists of one or more terminals and nonterminals or nothing (epsilon).

Backus-Naur form

<postal-address> ::= <name-part> <street-address> <zip-part>

<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>

| <personal-part> <name-part>

<personal-part> ::= <first-name> | <initial> "."

<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>

<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>

<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""

<opt-apt-num> ::= <apt-num> | ""

US Post Address in Backus-Naur form (from wikipedia)

Let’s define a grammar for a primitive add or multiply expression:

<expr> ::= <expr> * <expr> | <expr> + <expr>

| number

In this case, <expr> is a terminal and number, *, and + are the nonterminals.

Context-Free Grammar for Simple Expressions

• Clearly, there is some ambiguity here, because operator precedence (sometimes referred to as binding) is not defined.

• The grammar does not distinguish between 2+2*2+2 = 16 (incorrect under normal rules) or 2+2*2+2 = 8 (correct).


One Solution: Define expressions of different levels:<expr> ::= <add_expr><add_expr> ::= <add_expr> + <mul_expr> | <mul_expr><mul_expr> ::= <mul_expr> * number | number

Now, multiplication will bind tighter than addition (this may require a few sample expressions to wrap your head around!)


Associativity follows from the above example (Hint: What side of the multiply and add operation did we have the “deeper” production on?)


<palindrome> ::= letter

| // empty string

<palindrome> ::= “a” <palindrome> “a”

| “b” <palindrome> “b”

….

Or, <palindrome> ::= letter <palindrome> letter

However, we need to check the two letters are the same.

CFG for anbn:

<anbn> = // empty string | “a” <anbn> “b”

CFG for palindrome

Chomsky Hierarchy (From Wikipedia Page)

Grammar Languages AutomatonProduction rules

(constraints)

Type-0 Recursively enumerable

Turing machine (no restrictions)

Type-1 Context-sensitive Linear-bounded non-deterministic Turing machine

(equivalently, Right side no shorter than left)

Type-2 Context-freeNon-deterministic pushdown automaton

Type-3 Regular Finite state automaton

and

http://en.wikipedia.org/wiki/Recursively_enumerable_language

http://en.wikipedia.org/wiki/Recursively_enumerable_language

http://en.wikipedia.org/wiki/Turing_machine

http://en.wikipedia.org/wiki/Context-sensitive_grammar

http://en.wikipedia.org/wiki/Linear_bounded_automaton



http://en.wikipedia.org/wiki/Context-free_grammar

http://en.wikipedia.org/wiki/Pushdown_automaton

http://en.wikipedia.org/wiki/Pushdown_automaton

http://en.wikipedia.org/wiki/Regular_grammar

http://en.wikipedia.org/wiki/Finite_state_automaton

http://en.wikipedia.org/wiki/Finite_state_automaton

Type-0 Recursively EnumerableType-1 Context-Sensitive

Cannot encode all strings r1r2 such that r1 and r2 are two regular expressions that are equivalentType-2 Context-Free (Pushdown Automaton, i.e, Finite State Automaton with a Stack)

Can encode anbn, but not anbncn

Type-3 Regular (Finite State Automaton)

Can encode a*b*, but not anbn

Chomsky Hierarchy Revisited.

Why is Context-Free Grammar Called Context Free?

In a CFG, the left hand of each production is a single non-terminal, e.g., <palindrome> ::= “a” <palindrome> “a”

This means that “a”, followed by a <palindrome>, and by “a” will always be considered as <palindrome>, no matter what is the context, hence context free.

In a Context-Sensitive Grammar, left hand of production rules can include other things

An Example Context-Sensitive Grammar for anbncn

1. S a B C

2. S a S B C

3. B C C B

4. a B a b

5. b B b b

6. b C b c

7. c C c c

S→2 aSBC →1 aaBCBC

→3 aaBBCC

→4 aabBCC

→5 aabbCC

→6 aabbcC

→7 aabbcc

Yacc & Parsing

There are many ways to parse BNF grammars, most of which are discussed in a compilers course.• Recall: A finite state automaton (FSA) is

used for regular expressions. (CS182).• For a context-free grammar, we use a

pushdown automaton, which combines features of a FSA with a stack.

Yacc generates what is known as a LALR parser, which is generated from the BNF grammar in your Yacc file. This parser is defined in the C source file that Yacc generates.

We use Lex to make a lexer to generate our terminals, which are matched with regular expressions before being fed into the parser.

Yacc & Parsing

Yacc is capable of generating a powerful parser that will handle many different grammars.

LEX Lexer

YACC Parser

Rule-Based

Behavior

terminals

Input Characters

Yacc & Parsing

• Recall that parsing combines a state machine with a stack. States go on a stack to keep track of where parsing is. Yacc uses a parse table which defines possible states.

• Yacc’s parser operates using two primary actions, shift and reduce.

• shift puts a state on the stack, reduce pops state(s) off the stack and reduces combinations of nonterminals and terminals to a single nonterminal. After a reduction to a rule, Yacc’s parser will optionally run some user-defined code.

Yacc & Parsing

A very basic example:

<rule> := “hello” “world” “\n”

The parser would shift each word, successively pushing each state (.”hello”, .”world”, .”\n”) onto the stack. Then at the end of the rule, reduce everything to <rule> and pop the three states.

Yacc & Parsing

A Lex/Yacc Infix Calculator

• Yacc’s parser is powerful, but is not capable of parsing all grammars.

• Certain ambiguous grammars may produce what is known as a shift/reduce or reduce/reduce conflict. Yacc will, by default, shift instead of reduce.

Consider the classic shift/reduce conflict example:

<ifexp> ::= IF <expr> THEN <stmt> ELSE <stmt> | IF <expr> THEN <stmt>

Yacc will have a shift/reduce conflict here, but will go with shift (the top option) by default. It’s greedy!

Yacc & Parsing


• To demonstrate the utility of Lex and Yacc (or in our case, Flex and Bison) we provide an example infix calculator.

• Similar to several of the examples provided on the Lex and Yacc manpage at http://dinosaur.compilertools.net, but with added features


• Make sure to read ALL source code comments, particularly those that describe source file organization.

• Lex definition file: calculator.l• Yacc grammar file: calculator.y• AST Classes: ast.cc• Symbol table: symtab.cc


The example calculator application uses Lex and Yacc to parse mathematical expressions and produce an Abstract Syntax Tree, which is then used to evaluate those expressions.

It allows the =, +, *, -, +, ^, () and unary minus operators, with appropriate levels of binding and precedence. Examine calculator.y, because it is heavily commented.


• The symbol table (implemented here in simple O(n) access time) maps variables to values.

• Print the AST after every expression evaluation by running calculator with the –t flag, e.g. “calc –t”.

calc> 2*2^3/3 = 5.333333 3.000 /(/) \ 3.000 / (^) \ 2.000 / (*) \ 2.000

A calculator example. Type “2*2^3/3 and press enter:


Review

• What are required:• Able to write simple Context Free

Grammars, similar to those used in implementing FIZ

• Able to determine whether a string of tokens is accepted by a grammar

• Able to show how a string of tokens is parsed into some non-terminal (i.e., draw the parsing tree)

Date post:	14-Jan-2016
Category:	Documents
Upload:	jean-claycomb
View:	223 times
Download:	0 times

CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau [email protected].

Documents