Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | mollie-sharrock |
View: | 222 times |
Download: | 0 times |
Parsing for XML Developers
Roger L. Costello28 September 2014
Flat XML Document
You might receive an XML document that has no structure. For example, this XML document contains a flat (linear) list of Book data:
<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books> 2
Give it structure to facilitate processing
<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books> 3
That’s parsing!
Parsing is taking a flat (linear) sequence of items and adding structure so that the result conforms to a grammar.
4
Parsing
<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>
parse
5
6
From the book: “Parsing Techniques”
• Parsing is the process of structuring a linear representation in accordance with a given grammar.
• The “linear representation” may be:• A flat sequence of XML elements• a sentence• a computer program• a knitting pattern• a sequence of geological strata• a piece of music• actions of ritual behavior
Grammar
• A grammar is a succinct description of the structure.• Here is a grammar for Books:
Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text
7
Parsing
parser
Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text
<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>
Grammar
Linear representation
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>
Structured representation
8
Parsing Techniques
• Over the last 50 years many parsing techniques have been created.• Some parsing techniques work from the starting grammar rule to the
bottom. These are called top-down parsing techniques.• Other parsing techniques work from the bottom grammar rules to the
starting grammar rule. These are called bottom-up parsing techniques.
• The following slides show how to apply a powerful bottom-up parsing technique to the Books example.
9
What does “powerful” mean?
• The previous slide said, … following slides show how to apply a powerful bottom-up parsing technique …
• “Powerful” means the technique can be used with lots of grammars, i.e., it can be used to generate lots of different structures.
10
Suppose we were to structure the XML from scratch. We might follow these steps:
<Books> </Books>
<Books> <Book> </Book> </Books>
<Books> <Book> <Title>Parsing Techniques</Title> </Book> </Books>
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Authors> </Book> </Books>
11
continuedon nextslide
Follow these steps (cont.):
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> </Authors> </Book> </Books>
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> </Book> </Books>
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> </Book> </Books>
continuedon nextslide
12
Follow these steps (cont.):<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> </Book> </Books>
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Dover Publications</Publisher> </Book> </Books>
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> </Book> </Books>
and so forth, filling in the second Book then the third Book
13
Last step: add the last Book’s Publisher
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN>
</Book></Books>
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>
last step adds this
14
Alternate view of the steps (a tree view)
Books Books
Book
Books
Book
Title
Books
Book
Title Authors
Books
Book
Title Authors
Author
Books
Book
Title Authors
Author
continuedon nextslide
15
Author
Alternate view (cont.)
16
Books
Book
Title Authors
Author Author
Date
Books
Book
Title Authors
Author Author
Date ISBN
Books
Book
Title Authors
Author Author
Date ISBN Publisher
continuedon nextslide
Alternate view (cont.)
Books
Book
Title Authors Date ISBN Publisher
Bookand so forth, filling in the second Book then the third Book
17
Author Author
Last step: add the last Book’s Publisher
Books
Book
Title Authors Date ISBN Publisher
Book
Title Authors
Author
Date ISBN Publisher
Book
Title Authors
Author
Date ISBN
Books
Book
Title Authors Date ISBN Publisher
Book
Title Authors
Author
Date ISBN Publisher
Book
Title Authors
Author
Date ISBN Publisher
Author
Author
last step adds this
18
Author
Author
Terminology: Production Step
<Books> </Books>
<Books> <Book> </Book> </Books>
<Books> <Book> <Title>Parsing Techniques</Title> </Book> </Books>
<Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Authors> </Book> </Books>
Each step is called a production step
21
Top down
The previous slides showed the generation of the structured XML by starting from the top (root element) down to the bottom (leaf nodes).
19
Bottom-up parsing
In bottom-up parsing we work backward: from the last step to the first step.
20
22
Let’s begin …• One production step must have been the last and its
result must be visible in the linear representation.• We recognize the rule Publisher → text in
This gives us the final step in the production process (and the first step in bottom-up parsing):
<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>
23
NextWe recognize the rule ISBN → text inThis gives us the next-to-last step in the production process (and the second step in bottom-up parsing):
<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>
24
NextWe recognize the rule Date → text inThis gives us the third step in bottom-up parsing:
<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>
25
NextWe recognize the rule Author → text inThis gives us the fourth step in bottom-up parsing:
<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>
26
NextWe recognize the rule Authors → Author+ inThis gives us the fifth step in bottom-up parsing:
<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Authors> <Author>Gyorgy E. Revesz</Author></Authors><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>
27
NextWe recognize the rule Title → text inThis gives us the sixth step in bottom-up parsing:
<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Authors> <Author>Gyorgy E. Revesz</Author></Authors><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>
28
NextWe recognize the rule Book → Title Authors Date ISBN Publisher inThis gives us the seventh step in bottom-up parsing:
<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Book>
See the algorithm?
See how we are working backwards, from the bottom grammar rules up to the starting grammar rule? In the process we are adding structure to the flat (linear) XML – neat!
29
30
Terminology: Reduction• In bottom-up parsing a collection of symbols are
recognized as derived from a symbol. For example, Title, Authors, Date, ISBN, Publisher is derived from Book:
• Title, Authors, Date, ISBN, Publisher is reduced to Book
• So the bottom-up parsing process is a reduction process.
Book
Title Authors Date ISBN Publisher
Build your own bottom up parser!
You now have enough knowledge that you can go off and build your own bottom-up parser.
31
I implemented a bottom-up parser
• I used XSLT to implement a bottom-up parser.• If you would like to give my implementation a go, here is the XSLT
program and a sample flat (linear) input XML document:• http://
www.xfront.com/parsing-techniques/bottom-up-parser/bottom-up-parser-for-Books.xsl
• http://www.xfront.com/parsing-techniques/bottom-up-parser/Books.xml
32