+ All Categories
Home > Documents > Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Date post: 22-Dec-2015
Category:
Upload: alexander-reynolds
View: 223 times
Download: 0 times
Share this document with a friend
Popular Tags:
45
Top-Down Parsing using Regular Expressions A seminar by Brian Westphal
Transcript
Page 1: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Top-Down Parsing usingRegular Expressions

A seminar by

Brian Westphal

Page 2: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Questions and comments

• Please feel free to interrupt at any time throughout this presentation to make comments and ask questions.

• However, if there is a question that you feel might take more than a minute or two to explain, please wait until I specifically ask for questions.

Page 3: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Part I

General Discussion

Page 4: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

What is a top-down parser?

• Starts at the most general case (I.e. source code for a computer program).

• Tries to reach more specific cases (I.e. variable, loop, init statement, etc.) until a string is broken down into its smallest elements.

• End result is a tree of parts, each part described by a rule.

Page 5: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

An example of tree in English

This tree is used to parse a Sentence which is made up of an ordered set including Article, Noun, Verb, ….

It starts with the most general - Sentence, and gets more specific.

Page 6: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

English ::= Sentence*Sentence ::= (Article S)? Noun S Verb S (Preposition S)? (Article S)? Noun '/.'Article ::= 'a' | 'an' | 'the'Noun ::= 'house' | 'car' | 'person'Verb ::= 'plays' | 'sits' | 'goes'Preposition ::= 'on' | 'over' | 'above'S ::= ' '

The grammar (or rules) in BNF

Bachus-Naur Form

Page 7: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Regular Expressions v Grammars

• Regular expressions compose the most specific elements in a grammar (I.e. ‘a’, ‘car’, etc).

• BNF notations allows rules to be combined in a regular expression-like manner.

• The grammar holds a collection of rules which eventually terminate in regular expressions.

Page 8: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

The process of parsing, by example

• We want to parse the following string with our example grammar:“A person sits on the car. A person sits on the

house.”

• (ignore capitalization for now)

• Start with most basic rule, English (not denoted by graph)

Page 9: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

• Try to match the string to rule English.

• 1. English is equal to Sentence*

• In a loop (until all of string is exhausted or failure is reached) match each Sentence

• 2. Sentence is equal to (Article S)? Noun S …

• Match each sub-rule where “(Article S)?” is a single sub-rule, Noun is a single sub-rule, etc.

• 3. Article is equal to ‘a’|’an’|’the’

Page 10: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

• 4. Try to match string with ‘a’, ‘an’, and ‘the’– A match is found with ‘a’ (the longest of the

acceptable matches).

• 5. Try to match remaining string (chop off ‘a’) with rule S.

• 6. S is equal to ‘ ‘ (a single space).A match is found with ‘ ‘.

• 7. Try to match remaining string with rule Noun.

• 8. Continue pattern.

Page 11: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

What happens if a rule does not match?

• It depends on the situation.

• If a rule is followed by a * or a ? The rule will never fail by the fact that it is not required.

• Otherwise, there is an error in the data. Report an error message or try to fix it.

Page 12: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Failure/Success Examples

• “sits on the car.”failure (noun, space)

• “person sits on the car.” success

• “person sits on the car” failure (period)

• “a car goes over person.” success

Page 13: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Part II

Goals and Requirements

Page 14: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

List of goals

• Parse any EBNF grammar compatible data such as source code.

• Use and extend our existing regular expression classes.

• Output a syntax tree describing the full structure of the parsed data.

• For the end result we also wish to develop a tool that builds parsers for us (based on specified EBNF grammar files).

Page 15: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Goal Requirements

• A class to process EBNF rules (also called productions).

• A class to process more basic rules.

• A class to process regular expressions as rules.

• A class to act as the tree structure.

Page 16: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

A BNF grammar describing EBNF grammars

• Show EBNF.grammar

Page 17: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

In other words

• An EBNF grammar (Extended Bachus Naur Form) supports a subtraction operation on top of regular BNF functionality.

• Subtraction (A-B) implies that for the same section of data the system matches A but does not match B. Clearly this is not a typical regular expression-like operation.

Page 18: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Part III

Basic Class Descriptions

Page 19: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

RETree

• The structure for the syntax tree. Each node in an RETree is either an RETree or an RE (regular expression).

• Each node may also be associated with a type (I.e. English, Sentence, Noun, etc.)

Page 20: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

REProcessor

• Handles regular expressions in binary or unary operations as parts of productions.

• Binary modes (operations) are:andnot, or, follow

• Unary modes are:follow, star, plus, maybe

Page 21: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

SubProductionProcessor

• Extension of REProcessor.• Handles REProcessors and references to other

productions and subproductions.• In this way the unary and binary operations can be

used for productions in general.• Assigns a type in the tree to matching values.

Page 22: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

ProductionProcessor

• Extension of SubProductionProcessor.

• Contains a reference array to the other productions.

• Breaks down a single rule into subproductions by splitting rule in more manageable (unary or binary) pieces (a.k.a. subproductions).

Page 23: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Part IV

Code Overview

Page 24: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Coding RETree

• Take advantage of Java’s excellent polymorphism.

• RETree has two properties:Object branches & String type

• branches can be of type String or LinkedList.

Page 25: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Useful function in RETree

• Collapse– Returns a single string of the section matched by the

RETree (I.e. for English in our previous example, it would be the two sentences. For Sentence it would be a whole sentence, etc.)

• Size– Returns the length of the string (so that we can know

how much of the input we have used).

Page 26: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Coding REProcessor

• Make some contants for the modes (operations).

• REProcessor has three properties:Object A, Object B, & byte mode

• A & B can be of type REProcessor or RE.

• Mode will be one of the contant values.

Page 27: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Useful functions in REProcessor

• beginningMatches– Takes a string of input and returns an RETree if

the REProcessor matches the beginning of the input (returns the longest match if multiple matches exist). Returns null if no match is found.

• evaluate– Calls beginningMatches for either

REProcessors or REs

Page 28: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

More on beginningMatches

• Must perform actions for each mode.• Calls evaluate on substrings of the input for A

and/or B as many times as is necessary to find the longest match or failure.

• Mode follow has both a unary and binary operation.

• In unary mode it checks if A matches, in binary mode it checks if A matches then B matches the section immediately following A.

Page 29: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Coding SubProductionProcessor

• SubProductionProcessor has two properties:String type

SubProductionProcessor [] subproduction

• Uses the super functions for beginningMatches and evaluate (with a few minor changes).

• Not much code to this class.

Page 30: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Changes to beginningMatches

• First calls the super classes beginningMatches function. If it is successful and no type has been assigned to the resulting tree, a type is added.

Page 31: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Changes to evaluate

• If the automata passed to evaluate is of type Integer, the function gets the subproduction associated with the value and calls beginningMatches for it.

• Otherwise, it calls the super classes evaluate function.

Page 32: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Coding ProductionProcessor

• ProductionProcessor has three properties:ProductionProcessor [] production

LinkedList subproductionlist

int numsubproductions

• The production array contains references to all other productions and the subproductions for the current production.

Page 33: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

• subproductionlist holds 5-tuples describing subproductions. These are only copied into actual subproductions during the InitializeProcessors function call.

• numsuproductions holds the number of subproductions currently in use (incremented duing BuildSubProductionProcessors function call).

Page 34: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

A note about ProductionProcessor

• Pieces to be matched by ProductionProcessor are passed in chunks. This way, special characters are easy to process. For example, instead of passing the rule matched by “(hello)?” we would pass “(“, “hello”, “)”, “?”. To search for parentheses and the question mark in this way is much quicker.

Page 35: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Useful functions in ProductionProcessor

• InitializeProcessors– Adds intermediate subproduction 5-tuples to

the production array so that subproductions and productions are regarded as inherently similar items. This takes great advantage of Java’s excellent polymorphism.

Page 36: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

• BuildSubProductionProcessors– Builds the intermedie 5-tuples representing the

subproductions. This is the first step in a rather complex series of function calls which break down the production into unary and/or binary pieces.

– It begins by splitting the rule on or operations (the pipe symbol).

– Following order of operations, it then splits on the minus operation leaving groups without having to worry about or and minus operations.

– No more detail will be provided about this sequence as it is too complicated for one class period.

Page 37: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Changes to beginningMatches

• Calls beginningMatches for the first subproduction (as they are linked together, this will eventually call beginningMatches for all subproductions).

• A type is also assigned if none is available when the matching RETree is returned.

Page 38: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Part V

Building a Parser Generator

Page 39: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Parsing an EBNF grammar

• To build a parser generator we must first complete our parser by making a class to parse EBNF grammar data.

• We have already seen the BNF grammar for EBNF grammars.

Page 40: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

• rule ::= symbol whitespace eq whitespace expression• whitespace ::= '[\t ]*'• symbol ::= '[a-zA-Z0-9]+'• re ::= hexCharacter | characterList | standardRE• //re supporting rules:• hexCharacter ::= '/#x[0-9A-F]+'• characterList ::= lbracket characterList1? rbracket• characterList1 ::= characterList2 | characterList2 characterList1• characterList2 ::= '[^/]//]' | escapeSequence• standardRE ::= singleQuote standardRE1? singleQuote• standardRE1 ::= standardRE2 | standardRE2 standardRE1• standardRE2 ::= '[^/'//]' | escapeSequence• escapeSequence ::= '//.'• lbracket ::= '/['• rbracket ::= '/]'• singleQuote ::= '/''• //------------------------------------------------------------------------------• eq ::= '::='• expression ::= group (whitespace or whitespace expression)? (whitespace or whitespace epsilon)?• group ::= sequence (whitespace minus whitespace group)?• sequence ::= sequenceLHS modifier? whitespace sequence?• sequenceLHS ::= symbol | re | lparen whitespace expression whitespace rparen• modifier ::= '[*+?]'• lparen ::= '/('• rparen ::= '/)'• minus ::= '/-'• or ::= '/|'• epsilon ::= '_'

Page 41: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

The Tokens

//Production constants

public static int PRODUCTIONS = 0;

public static final Integer RULE = new Integer (PRODUCTIONS++);

public static final Integer WHITESPACE = new Integer (PRODUCTIONS++);

public static final Integer SYMBOL = new Integer (PRODUCTIONS++);

public static final Integer RE = new Integer (PRODUCTIONS++);

public static final Integer HEXCHARACTER = new Integer (PRODUCTIONS++);

public static final Integer CHARACTERLIST = new Integer (PRODUCTIONS++);

public static ProductionProcessor [] production = new ProductionProcessor[PRODUCTIONS];

Page 42: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

The Productions

Object [] parameters;

parameters = new Object[5];parameters[0] = SYMBOL;parameters[1] = WHITESPACE;parameters[2] = EQ;parameters[3] = WHITESPACE;parameters[4] = EXPRESSION;//rule ::= symbol whitespace eq whitespace expressionproduction[RULE.intValue ()] = new ProductionProcessor (production, "rule",

parameters);

parameters = new Object[1];parameters[0] = "[\t ]*";//whitespace ::= '[\t ]*'production[WHITESPACE.intValue ()] = new ProductionProcessor (production,

"whitespace", parameters);

• Continue with this process for each rule.

Page 43: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Make sure to call InitializeProcessors

//Initializing production processors

for (int index = 0; index < production.length; index++)

{

production[index].InitializeProcessors ();

}

Page 44: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Continuing On…

• That is pretty much it for the foundations of the parser generator.

• Actually writing the parser generator to do what you want is up to you. I will show you my code (or you can download it from www.fokno.org), but it is too difficult to present line by line.

Page 45: Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

The End

• Thank you for coming. Please feel free to ask questions and/or make comments at this time.

• Be sure to visit www.fokno.org to download a copy of the discussed source code.


Recommended