+ All Categories
Home > Documents > Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Date post: 27-Dec-2015
Category:
Upload: hester-mcdowell
View: 214 times
Download: 1 times
Share this document with a friend
Popular Tags:
74
Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1
Transcript
Page 1: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

1

Recursive Descent Parsing for XML

Developers

Roger L. Costello15 October 2014

Page 2: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

2

Table of Contents

• Introduction to parsing in general, recursive descent parsing in particular

• Example #1: How to do recursive descent parsing on Book data• Example #2: How to do recursive descent parsing for a grammar that

contains alternatives • Limitations of recursive descent parsing

Page 3: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Flat XML Document

You might receive an XML document that has no structure. For example, this XML document contains a flat (linear) list of Book data:

<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input> 2

Page 4: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Give it structure to facilitate processing

<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books> 3

Page 5: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

That’s parsing!

Parsing is taking a flat (linear) sequence of items and adding structure so that the result conforms to a grammar.

4

Page 6: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Parsing

<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

parse

5

Page 7: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

7

From the book: “Parsing Techniques”

• Parsing is the process of structuring a linear representation in accordance with a given grammar.

• The “linear representation” may be:• a flat sequence of XML elements• a sentence• a computer program• a knitting pattern• a sequence of geological strata• a piece of music• actions of ritual behavior

Page 8: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Grammar

• A grammar is a succinct description of the structure.• Here is a grammar for Books:

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

7

Page 9: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Parsing

parser

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input>

Grammar

Linear representation

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

Structured representation

8

Page 10: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Alternate view of the parser output

parser

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input>

Grammar

Linear representation

Parse tree

8

Books

Book

Title Authors Date ISBN Publisher

Book

Title Authors

Author

Date ISBN Publisher

Book

Title Authors

Author

Date ISBN Publisher

Author Author

Parsing Techniques

Dick Grune Ceriel J.H. Jacobs

2007 978-0-387-20248-8 Springer Introduction to Graph Theory

Richard J. Trudeau

1993 0-486-67870-9 Dover Publications Introduction to Formal Languages

Gyorgy E. Revesz

2012 0-486-66697-2 Dover Publications

Page 11: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Parsing Techniques

• Over the last 50 years many parsing techniques have been created.• Some parsing techniques work from the starting grammar rule to the

bottom. Those are called top-down parsing techniques.• Other parsing techniques work from the bottom grammar rules to the

starting grammar rule. Those are called bottom-up parsing techniques.

• The following slides explain the “recursive descent parsing technique.” It is a top-down parsing technique.

9

Page 12: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

12

Terminology: Token

• A token is an atomic (indivisible) unit.• Each item in the input is a token.• After parsing the tokens will be leaf nodes.

Page 13: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

13

The input consists of a sequence of tokens

<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input>

Each of these are tokens. This input consists of 16 tokens.

Page 14: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

14

After parsing the tokens will be leaf nodes

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

tokens (terminal symbols)

Page 15: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

15

Another view of the tokens, after parsing

Books

Book

Title Authors Date ISBN Publisher

Book

Title Authors

Author

Date ISBN Publisher

Book

Title Authors

Author

Date ISBN Publisher

Author Author

Parsing Techniques

Dick Grune Ceriel J.H. Jacobs

2007 978-0-387-20248-8 Springer Introduction to Graph Theory

Richard J. Trudeau

1993 0-486-67870-9 Dover Publications Introduction to Formal Languages

Gyorgy E. Revesz

2012 0-486-66697-2 Dover Publications

Page 16: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

16

Parsing structures the input by wrapping the tokens in non-terminal symbols

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

non-terminal symbols

Page 17: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

17

Recursive descent parsing

Recursive descent parsing works like this:• Start at the grammar’s start symbol and output it. In our grammar, the start

symbol is <Books>, so output it.• Progress through each grammar rule. For a non-terminal symbol, output it.

For a terminal symbol (i.e., token), check the token in the input stream for match with the terminal symbol; if it matches, output it.

Page 18: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

Initial

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

7

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Start with the grammar’s start symbol and the first token in the input stream.

Page 19: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

19

Output the start symbol

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

<Books>

</Books>

Output:

Page 20: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

20

Grammar says there must be at least one Book

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

So the input stream must contain all the tokens for at least one Book. Let’s process the grammar rule for Book.

Page 21: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

21

Output <Book>

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

<Books> <Book>

<Book></Books>

Output:

Page 22: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

22

Grammar says the token in the input stream must be Title

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher><Books>

<Book> <Title>Parsing Techniques</Title>

<Book></Books>

Output:

Yea, the input token matches the

grammar rule

Page 23: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

23

Grammar: after Title must be Authors

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

So the input stream must contain Author tokens. Let’s process the rule for Authors.

Page 24: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

24

Output <Authors>

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher><Books>

<Book> <Authors>

<Authors>

<Book></Books>

Output:

Page 25: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

25

Grammar says the next token in the input stream must be an Author token

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Yea, the input token matches the

grammar rule

<Books> <Book> <Authors> <Author>Dick Grune</Author>

<Authors>

<Book></Books>

Output:

Page 26: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

26

Grammar says the next token in the input stream may be an Author token

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Another Author match

<Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors>

<Book></Books>

Output:

Page 27: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

27

The next token in the input stream is not an Author token

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

So, return to the caller (i.e., return to the Book rule).

Page 28: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

28

Grammar says the input stream token must be a Date token

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Yea, the input token matches the

grammar rule

<Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors> <Date>2007</Date>

<Book></Books>

Output:

Page 29: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

29

Grammar says the input stream token must be an ISBN token

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Yea, the input token matches the

grammar rule

<Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN>

<Book></Books>

Output:

Page 30: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

30

Grammar says the input stream token must be a Publisher token

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Yea, the input token matches the

grammar rule

<Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Book></Books>

Output:

Page 31: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

31

We’ve completed structuring the first 6 input tokens

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

<Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Book></Books>

Output:

Page 32: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

32

Completed the Book rule

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

We’ve finished processing the Book rule, so return to the caller (i.e., the Books rule).

Page 33: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

33

Begin work on structuring the next Book

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Page 34: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

34

Implementation

The following slides show, in a step-by-step manner, how to implement a recursive descent parser

Page 35: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

35

Step 1

Create a function for each non-terminal symbol in the grammar:

Books() { …} Book() { …} Authors() { …}

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+

Functions

Page 36: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

36

Step 2

Create a global element, Token, that is used to identify the current position in the input stream. Initialize Token to 0:

Token = 0

Page 37: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

37

Step 3

Create a function, get_next_token(). When it is called, it increments the current position in the input stream:

get_next_token() { Token = Token + 1}

Page 38: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

38

Step 4

Create a function, token(), and pass it a name, tk. The purpose of this function is to answer the question: “Does the token at the current position in the input stream match tk?”

Page 39: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

39

Example of using the token() functionSuppose that during recursive descent parsing the grammar indicates that the next token in the input stream must be “Title.” Suppose the global variable, Token, indicates that we are here in the input stream:

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Page 40: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

40

Example (cont.)

The token() function determines that there is a match, so it calls get_next_token() to increment the position in the input stream and returns the token:

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

return

Page 41: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

41

The token() function

token(string tk) { if (tk != input[position() = Token]) then return () else { get_next_token() return input[position() = Token]) }}

Notice that token() returns empty if there is not a match.

Page 42: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

42

Motivation for Step 5

Suppose that during recursive descent parsing we are in the Book() function. The Book() function first checks—by calling the token() function—to see if the current position of the input stream contains “Title.” Suppose it does. Then, according to the grammar, there must be Authors, Date, ISBN, and then Publisher:

Book → Title Authors Date ISBN Publisher

Page 43: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

43

Step 5

Create a function, require(), and pass it a token, found. If the token is empty (i.e., the token() function returned empty because there was not a match) then call the error() function. Otherwise, return the token.

require(element found) { if empty(found) then error(‘Invalid input’) else return found}

Page 44: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

44

Step 6

Create an error function, error(). Pass it a string. It outputs the string and then halts the parser.

error(string s) { output s stop}

Page 45: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

45

The complete implementation

• Recursive descent has been around a long time and people have developed beautiful code for it.

• The following two slides collects all the code from the previous slides. I recommend spending some time studying it to appreciate its beauty.

Page 46: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

46

Token = 0

main() { get_next_token() require(input())}

input() { return require(Books())} Books() { <Books> return (require(Book()), optional_additional_Books()) </Books>} optional_additional_Books() { book = Book() if exists(book) then return (book, optional_additional_Books())} Book() { title = token('Title') if exists(title) then <Book> return (title, require(Authors(), require(token('Date')), require(token(‘ISBN')), require(token(‘Publisher')) </Book> } Authors() { <Authors> return (require(Author()), optional_additional_Authors()) </Authors>}

Code for a Recursive Descent Parser

Page 47: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

47

optional_additional_Authors() { author = token(‘Author') if exists(author) then return (author, optional_additional_Authors())}

token(string tk) { if (tk != input[position() = Token]) then return () else { get_next_token() return input[position() = Token]) }}

require(element found) { if empty(found) then error(‘Invalid input’) else return found}

get_next_token() { Token = Token + 1}

Page 48: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

48

XSLT Implementation

• I created an XSLT implementation. I tried to mirror the beautiful code shown on the previous slides.

• If you would like to give my implementation a go, here is the XSLT program and a sample flat (linear) input XML document:

• http://www.xfront.com/parsing-techniques/recursive-descent-parser/books-parser.xsl

• http://www.xfront.com/parsing-techniques/recursive-descent-parser/books-test.xml

Page 49: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

49

Richer example

• The Books example shown on the previous slides was fine for introducing recursive descent parsing.

• But it glossed over an important problem: grammar rules with alternatives.

• The following example shows how to do recursive descent parsing with a grammar that has alternatives.

Page 50: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

50

Expressions

• Let’s parse a simple expression language that has these tokens: IDENTIFIER, addition, parentheses, and EoF.

• Here are a few examples of expressions:

IDENTIFIER EoF(IDENTIFIER) EoFIDENTIFIER + IDENTIFIER EoF(IDENTIFIER + IDENTIFIER) EoFIDENTIFIER + (IDENTIFIER + IDENTIFIER) EoF(IDENTIFIER + IDENTIFIER) + IDENTIFIER EoFIDENTIFIER + (IDENTIFIER + (IDENTIFIER + IDENTIFIER)) EoF

Each expression ends with an end-of-file (EoF) token.

Page 51: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

51

Expression grammar

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

Page 52: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

52

Parse tree for: IDENTIFIER EoF

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

input

expression EoF

term rest_expression

IDENTIFIER ε

Page 53: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

53

Parser selects the first alternative

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

input

expression EoF

term rest_expression

IDENTIFIER ε

term has two alternatives. The parser selected the first alternative.

Page 54: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

54

Parse tree for: (IDENTIFIER) EoFinput

expression EoF

term rest_expression

parenthesized_expression ε( expression )

term rest_expression

IDENTIFIER εinput → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

Page 55: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

55

Parser selects the second alternative

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

input

expression EoF

term rest_expression

parenthesized_expression ε( expression )

term rest_expression

IDENTIFIER ε term’s second alternative is selected

Page 56: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

56

Question

How does a recursive descent parser know that it should select the first or second alternative?

term → IDENTIFIER | parenthesized_expression

How does the parser know which alternative to select?

Page 57: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

57

Answer

• The parser doesn’t know.• It tries the first alternative. If that fails it tries the second alternative

(i.e., the parser backtracks and tries the next alternative). It repeats until it finds an alternative that succeeds.

Page 58: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

58

Processing the first token in the input stream

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

input

expression

term

IDENTIFIER

Try the first alternative, which says the input token must be IDENTIFIER. However, the input token is ( so we must back up and try the next alternative

1

2

3

Input tokens:(IDENTIFIER)EoF

Page 59: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

59

Implementation of the term() function

term() { <term> identifier = token('IDENTIFIER') if exists(identifier) then return (identifier) else return (require(parenthesized_expression())) </term> }

Check the current token in the input stream to see if it is IDENTIFIER.

Page 60: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

60

term() function (cont.)

term() { <term> identifier = token('IDENTIFIER') if exists(identifier) then return (identifier) else return (require(parenthesized_expression())) </term> }

If there is a match, return the token.

Page 61: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

61

term() function (cont.)

term() { <term> identifier = token('IDENTIFIER') if exists(identifier) then return (identifier) else return (require(parenthesized_expression())) </term> }

Otherwise try the other alternative, it must succeed.

Page 62: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

62

Let’s represent each expression as XML

Instead of this input: IDENTIFIER EoF

our input will be this:

<input> <IDENTIFIER /> <EoF /></input>

Page 63: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

63

XML representation (cont.)

Instead of this input: (IDENTIFIER) EoF

our input will be this:

<input> <LP /> <IDENTIFIER /> <RP /> <EoF /></input>

Page 64: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

64

XML representation (cont.)

Instead of this input: IDENTIFIER + IDENTIFIER EoF

our input will be this:

<input> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <EoF /></input>

Page 65: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

65

XML representation (cont.)

Instead of this input: IDENTIFIER + (IDENTIFIER + IDENTIFIER) EoF

our input will be this XML input:

<input> <IDENTIFIER /> <PLUS /> <LP /> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <RP /> <EoF /></input>

Page 66: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

66

Parsing

<input> <IDENTIFIER /> <EoF /></input>

Parser

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

<output> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> <EoF/></output>

Page 67: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

67

Parsing (cont.)

<input> <LP /> <IDENTIFIER /> <RP /> <EoF /></input>

Parser

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

<output> <expression> <term> <parenthesized_expression> <LP/> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> <RP/> </parenthesized_expression> </term> <rest_expression/> </expression> <EoF/></output>

Page 68: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

68

Parsing (cont.)

<input> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <EoF /></input>

Parser

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

<output> <expression> <term> <IDENTIFIER/> </term> <rest_expression> <PLUS/> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> </rest_expression> </expression> <EoF/></output>

Page 69: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

69

Parsing (cont.)

<input> <IDENTIFIER /> <PLUS /> <LP /> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <RP /> <EoF /></input>

Parser

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε

<output> <expression> <term> <IDENTIFIER/> </term> <rest_expression> <PLUS/> <expression> <term> <parenthesized_expression> <LP/> <expression> <term> <IDENTIFIER/> </term> <rest_expression> <PLUS/> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> </rest_expression> </expression> <RP/> </parenthesized_expression> </term> <rest_expression/> </expression> </rest_expression> </expression> <EoF/></output>

Page 70: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

70

XSLT Implementation

• I created an XSLT implementation of a recursive descent parser for the expression language.

• If you would like to give my implementation a go, here is the XSLT program and a sample flat (linear) input XML document:

• http://www.xfront.com/parsing-techniques/recursive-descent-parser/expression-parser.xsl

• http://www.xfront.com/parsing-techniques/recursive-descent-parser/expression-test.xml

Page 71: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

71

Limitations of Recursive Descent ParsersRecall that in a rule containing alternatives we tried the first alternative, if it failed we backtracked and tried the second alternative. Searching the alternatives is time-consuming.

Page 72: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

72

Limitations (cont.)

Recursive descent parsers can’t handle left-recursive grammar rules. The parser goes into an infinite loop.Example: suppose the grammar has this rule: expression → expression '-' termThat is a “left-recursive” rule: on the rule’s right-hand side it starts with the same symbol as on the left-hand side (i.e., expression). The recursive descent routine for this rule is:

expression() { return expression() and require(token(‘-’)) and require(term)}

(infinite) recursion!

Page 73: Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 2014 1.

73

Limitations (cont.)

Suppose we add an array element as a term: term → IDENTIFIER | indexed_element | parenthesized_expression indexed_element → IDENTIFIER '[' expression ']'

and create a recursive descent parser for the new grammar. The routine for indexed_element will never be tried: when the sequence IDENTIFIER '[' occurs in the input, the first alternative of term will succeed, consume the identifier, and leave the indigestible part '[' expression ']' in the input.


Recommended