Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | monica-dickerson |
View: | 223 times |
Download: | 0 times |
Dan Suciu XML Toolkit 1
From Searching Text to Querying XML Streams
Dan Suciu
www.cs.washington.edu/homes/suciu
Dan Suciu XML Toolkit 2
About Me• Born 1957, Romania• BS: Bucharest, PhD: University of Pennsylvania• Now: University of Washington (Seattle)
My work is on semistructured data• Book: Data on the Web:
From relations, to semistructured data and XML
Past/present projects:• XML-QL = precursor of XQuery• XMill = the XML compressor• XML toolkit
Dan Suciu XML Toolkit 3
Motivation
• Text databases– Studied over the past 15 years– Traditional client/server model– Struggled with lack of standard text syntax
• Recently, new standard: XML– Traditional client/server: in today’s dbms– New applications: stream processing
• This talk: processing stream XML data– My motivation: work on the XML Toolkit project
Dan Suciu XML Toolkit 4
Outline
• Background
• The XML stream processing problem
• Basic XML processing with automata
• Adapting automata to XML
• Stream indexes
• Conclusions
Dan Suciu XML Toolkit 5
Background:Relational Databases
• Structured, stored in tables
• Schema separate from data
• Queries: precise, refer to schema and data (SQL)
: BOOKS
ISBN Title Year Publisher
0201537710Foundations of
Databases1995 AW
155860622X Data on the Web 1999 MK
AUTHOR
AID Name Country
44 Abiteboul FR
06 Buneman UK
62 Hull USA
12 Suciu USA
29 Vianu USA
WROTE:
ISBN AID
0201537710 44
0201537710 62
0201537710 29
155860622X 44
155860622X 06
155860622X 12
Hard to publish, easy to query preciselyHard to publish, easy to query precisely
Dan Suciu XML Toolkit 6
Background:Text Databases
• Unstructured, stored in documents
• No schema, only data
• Queries: imprecise, refer to data only (keywords)
Foundations of Databases,
Abiteboul (FR), Hull (USA), Vianu (USA)
Addison Wesley,
1995
Foundations of Databases,
Abiteboul (FR), Hull (USA), Vianu (USA)
Addison Wesley,
1995
Data on the Web
Abiteoul (FR), Buneman (UK), Suciu (USA)
Morgan Kaufmann,
1999
Data on the Web
Abiteoul (FR), Buneman (UK), Suciu (USA)
Morgan Kaufmann,
1999
Easy to publish, hard to query preciselyEasy to publish, hard to query precisely
Dan Suciu XML Toolkit 7
Background:XML Data• Semistructured
• Schema and data are together: self-describing• Queries: precise, refer to schema and data (SQL)
<bib> <book> <title> Foundations… </title> <author> <name> Abiteboul </name> <country> FR </country> </author> <author> <name> Hull </name> <country> USA </country> </author> <author> <name> Vianu </name> <country> USA </country> </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …
</bib>
<bib> <book> <title> Foundations… </title> <author> <name> Abiteboul </name> <country> FR </country> </author> <author> <name> Hull </name> <country> USA </country> </author> <author> <name> Vianu </name> <country> USA </country> </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …
</bib>
XML: Easier to publish,easy to query precisely
XML: Easier to publish,easy to query precisely
Dan Suciu XML Toolkit 8
Background:XML Data
bib
book
paper
titletitle
author author author publisherauthor journal
book
Data onthe Web
name country
Abiteboul FR Buneman UK
name countryAddisonWesley
Data model = tree
Dan Suciu XML Toolkit 9
Background:XML Data
• Querying with XPath (and XQuery)• This talk: XPath queries restricted to:
tag///* [ ]path=“constant”
Dan Suciu XML Toolkit 10
Background:XPath in One Slide
/bib/book[author/name=“Abiteboul”]/bib/book[author/name=“Abiteboul”]
/bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]]/bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]]
/bib/book/author/name/bib/book/author/name
/bib/book//name/*/zip/bib/book//name/*/zip
tag, /
//,*
[ ]
This is precisely the “region algebra”
E.g. use proximal nodes [Navarro&Baeza-Yates’97]
This is precisely the “region algebra”
E.g. use proximal nodes [Navarro&Baeza-Yates’97]
Navigate partially known structure
Conjunctivequeries ala SQL
Dan Suciu XML Toolkit 11
Outline
• Background
• The XML stream processing problem
• Basic XML processing with automata
• Adapting automata to XML
• Stream indexes
• Conclusions
Dan Suciu XML Toolkit 12
Main Application:XML Packet Routing
• Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02]
• XML content routing [Snoeren et al.01]
• SOAP Message routing in Application Servers
Dan Suciu XML Toolkit 13
XML Packet Routing<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc> <doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
Dan Suciu XML Toolkit 14
/bib/book /publisher=“MK”/bib/book [category=“recent”]/title =“Web”/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”
/bib/book /publisher=“MK”/bib/book [category=“recent”]/title =“Web”/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”
XPath expressions
<bib> <book>...</bib>
<bib> <book>...</bib>
Input XML StreamOutput XML Streams
Dan Suciu XML Toolkit 15
The XML Stream Processing Problem
Given:A set of XPath expressionsAn Incoming stream of XML documents
Decide:For each document which expressions it matches
Given:A set of XPath expressionsAn Incoming stream of XML documents
Decide:For each document which expressions it matches
Hard: Large number of XPath expressions e.g. 103 - 106
Streaming XML data, high throughput e.g. 5MB/s
Easy: Shallow XML data e.g. depth=20 Short XPath expressions
Hard: Large number of XPath expressions e.g. 103 - 106
Streaming XML data, high throughput e.g. 5MB/s
Easy: Shallow XML data e.g. depth=20 Short XPath expressions
Dan Suciu XML Toolkit 16
The ApproachesBasic techniques• NFA plus optimizations:
– Xfilter/Yfilter [Altinel&Franklin’00]– XTrie [Chan et al.02]
• DFA:– XML Toolkit
Beyond the obvious• Stream indexes (XML Toolkit)• Stream views
Dan Suciu XML Toolkit 17
Outline
• Background
• The XML stream processing problem
• Basic XML processing with automata
• Adapting automata to XML
• Stream indexes
• Conclusions
Dan Suciu XML Toolkit 18
From XPath to NFA
/catalog/product[category="tools"][*/price = 200]/quantity//price
/catalog/product[category="tools"][*/price = 200]/quantity//price
Extra processing needed
to combine branches
(not in this talk)
Extra processing needed
to combine branches
(not in this talk)
catalog
product
category
price
quantity
"tools"
200
*
price
*
Dan Suciu XML Toolkit 19
Basic NFA Evaluation/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”
/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”
<bib> <book>...</bib>
NFA
. . . . . .
XPath
3,66,102,4534,...
2,3,543,43,254
1,55,99,...
STACK
SAXevents
Current states
Dan Suciu XML Toolkit 20
Basic NFA Evaluation
Properties: Space = linear Throughput = decreases linearly
Systems:
• XFilter [Altinel&Franklin’99], YFilter.
• XTrie [Chan et al.’02]
Dan Suciu XML Toolkit 21
Basic DFA Evaluation/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”
/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”
<bib> <book>...</bib>
XPath
399
552
1
STACKSAXevents
DFAs
Current state
Dan Suciu XML Toolkit 22
Basic DFA Evaluation
Properties: Throughput = constant ! Space = GOOD QUESTION
System:
• XML Toolkit [University of Washington]http://xmltk.sourceforge.net
Dan Suciu XML Toolkit 23
XMLTK: An XML Toolkit for Scalable XML Stream Processing
I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka,
D. Raven, D. Suciu
Dan Suciu XML Toolkit 24
Motivation
• Lots of data sits in large text files– ad hoc data formats
• “Queried” with Unix command line tools– grep, sort, tail, etc
• Would be nice to XML-ize it...
• ...but then the Unix command line tools won’t work any more.
Dan Suciu XML Toolkit 25
Example
• In the old Unix world…
6 accept P054 “Theory of XML parsing”7 reject P021 “Experience with an XML optimizer”7 accept P069 “Towards a unified theory of data models”. . . . . .
6 accept P054 “Theory of XML parsing”7 reject P021 “Experience with an XML optimizer”7 accept P069 “Towards a unified theory of data models”. . . . . .
score decision paperID title
grep “reject” papers.txt | sort | tail 10grep “reject” papers.txt | sort | tail 10
• Find the top ten rejected papers (in score order):
Text file
Dan Suciu XML Toolkit 26
Example (cont’d)
• In the new XML world…
<submissions><paper> <score> 6 </score> <decision> accept </decision> <paperID> P054 <paperID> <title>Theory of XML parsing </title></paper><paper> <score> 3 </score> <decision> reject </decision> <paperID> P021 </paperID> <title> Experience with an XML optimizer </title></paper>. . . . .
<submissions><paper> <score> 6 </score> <decision> accept </decision> <paperID> P054 <paperID> <title>Theory of XML parsing </title></paper><paper> <score> 3 </score> <decision> reject </decision> <paperID> P021 </paperID> <title> Experience with an XML optimizer </title></paper>. . . . .
… can’t use those tools anymore
Dan Suciu XML Toolkit 27
Example (con’d)
Doing it with the XML Toolkit:
Finds top ten rejected <paper>s, in <score> order
xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml| xtail –c /submissions –e paper –n 10
xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml| xtail –c /submissions –e paper –n 10
Dan Suciu XML Toolkit 28
Goals of the XML Toolkit
Simple, scalable tools for XML processing
• Provides service: there are people who need this
• Provides a research platform: for XML stream processing
Dan Suciu XML Toolkit 30
The ToolsCurrent tools:• xsort• xagg• xnest• xflatten• xdelete• xpair• xhead• xtail• file2xml• xmill
Will talk only about this
May look plenty, but actually still incomplete...
Dan Suciu XML Toolkit 31
XSort: Definition
-c = the context, i.e. where to sort
-e = the item, i.e what to sort
-k = the key, i.e. what to sort on
xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr)*)*)*xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr)*)*)*
General form
Dan Suciu XML Toolkit 32
XSort: Definition
XSort
cc
c
e1e2
e3e4 e5e6 e7 e8 e9
cc
ce4
e1e3
e2 e6 e7 e5e9
e8
Dan Suciu XML Toolkit 33
XSort Examples
<bib> <book> <author>Elliotte Rusty Harold</author> <author>W. Scott Means</author> <title>XML in a Nutshell</title> <publisher>O'Reilly</publisher> <year>2001</year> <isbn>0-596-00058-8</isbn> </book> <paper> <author>Sylvain Devillers</author> <title>XML and XSLT Modeling for Multimedia Bitstream Manipulation.</title> <year>2001</year> <booktitle>WWW Posters</booktitle> <ee>http://www10.org/cdrom/posters/1112.pdf</ee> <url>db/conf/www/www2001p.html#Devillers01</url> </paper>. . . . .
<bib> <book> <author>Elliotte Rusty Harold</author> <author>W. Scott Means</author> <title>XML in a Nutshell</title> <publisher>O'Reilly</publisher> <year>2001</year> <isbn>0-596-00058-8</isbn> </book> <paper> <author>Sylvain Devillers</author> <title>XML and XSLT Modeling for Multimedia Bitstream Manipulation.</title> <year>2001</year> <booktitle>WWW Posters</booktitle> <ee>http://www10.org/cdrom/posters/1112.pdf</ee> <url>db/conf/www/www2001p.html#Devillers01</url> </paper>. . . . .
Examples illustrated on data like this:
Dan Suciu XML Toolkit 34
XSort: Examples
xsort –c /bib –e paper –k title/text()xsort –c /bib –e paper –k title/text()
Sorts the <paper>s, by <title>The <book>s are dropped from the output
<bib> <paper> . . . </paper> <paper> . . . </paper>. . . . .</bib>
<bib> <paper> . . . </paper> <paper> . . . </paper>. . . . .</bib>
Compare to…
xsort –c /bib –e * –k title/text()xsort –c /bib –e * –k title/text()
xsort –c /bib –e paper –k title/text() –e book –k title/text()xsort –c /bib –e paper –k title/text() –e book –k title/text()
Dan Suciu XML Toolkit 35
XSort: Examples
xsort –c /bib –e paper/author –k lastName/text() –k firstName/text()xsort –c /bib –e paper/author –k lastName/text() –k firstName/text()
Sorts the <author>s, by <lastName> then <firstName>
<bib> <author> . . . </author> <author> . . . </author>. . . . .</bib>
<bib> <author> . . . </author> <author> . . . </author>. . . . .</bib>
Dan Suciu XML Toolkit 36
XSort: Examples
xsort –c /bib –e paper –e article –e book –e *xsort –c /bib –e paper –e article –e book –e *
<paper>s first, then <article>s, then <book>s, then all the rest
<bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . <article> . . . </article> . . . . . <book> . . . </book> . . . . .</bib>
<bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . <article> . . . </article> . . . . . <book> . . . </book> . . . . .</bib>
Dan Suciu XML Toolkit 37
XSort: Examples
xsort –c /bib/* –e author –e title –e year –e *xsort –c /bib/* –e author –e title –e year –e *
Normalize all entries: <author>s first, then <title>s, then <year>sthen all the other elements
xsort –c /bib/paper –e author –e * –c /bib/book –e title –e *xsort –c /bib/paper –e author –e * –c /bib/book –e title –e *
In <paper>s list the <author>s first;in <book>s list the <title> first;Leave other entries unchanged
Dan Suciu XML Toolkit 38
XSort: Implementation
• Sorts one context at a time, copies the rest• For each context:
– Create a “global key” for each item
– Sort items, with a two-pass, multiway merge sort
• Quote from Databases 101 (news from the trenches):– with disk blocks of 4KB and 128MB of main memory,
one can sort files up to 4TB in two passes !
Dan Suciu XML Toolkit 39
XSort: Performance
Size (KB) Xalan (sec) Xsort (sec)
0.41 0.08 0.00
4.91 0.09 0.00
76.22 0.27 0.02
991.79 2.52 0.26
9671.79 27.45 2.85
100964.43 - 43.97
1009643.71 - 461.36
xsort –c /dblp –e * –k title/text()xsort –c /dblp –e * –k title/text()
1GB !8minutes
Dan Suciu XML Toolkit 41
The XPath Processor
Common to all tools is the following problem:
Given:• Set of correlated XPath expressions• Stream of SAX events
Decide:• When are the expressions true variable events
Dan Suciu XML Toolkit 43
The XPath Processor
How we did it:• All Xpath expressions Deterministic Finite
Automaton– Restriction: no predicates yet (current work...)
• Does this scale to many, many XPath expressions ?– Yes, if we compute the DFA lazily (upcoming
ICDT’2003 paper)
• Evaluation time is = parsing time• Can do even better with a Stream IndeX (next)
Dan Suciu XML Toolkit 44
Stream IndeX (SIX)
Solution: “Index” the XML stream, parse only partially
Definition: The SIX = a table of (start, end) offsets
News: The parser isthe main bottleneckin XPath streamprocessing !
Dan Suciu XML Toolkit 45
Stream IndeX (SIX): Construction
<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>
</bib>
<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>
</bib>
start end
bib 0 1490124
book 3 409023
publisher 12 423
author 426 879
author 978 . . .
. . .
SIXXML
Dan Suciu XML Toolkit 46
Stream IndeX (SIX): Skip Parsing
<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book><paper>. . . . . .
</bib>
<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book><paper>. . . . . .
</bib>
XPathXML
/bib/paper/title. . ./bib/paper/title. . .
Skip Parsing
Skip Parsing
Dan Suciu XML Toolkit 47
Stream IndeX (SIX) in XML Stream Processing
<bib> <book>...</bib>
<bib> <book>...</bib>
<bib> <book>...</bib>
<bib> <book>...</bib>
<bib> <book>...</bib>
<bib> <book>...</bib>
0 205
30 66
72 188
0 205
30 66
72 188
90 110
95 98
0 205
30 66
The SIX stream is about 6% of the data stream
And can be made MUCH smaller
The SIX stream is about 6% of the data stream
And can be made MUCH smaller
SIX
(E.g. DIME)
XML
Dan Suciu XML Toolkit 48
Throughput improvements from SIX (stable)
0
5
10
15
20
25
30
35
55 60 65 70 75 80 85 90 95 100 105
XML stream (MB)
MB
/s
Theta=3% (SIX)
Theta=3%
Theta=8% (SIX)
Theta=8%
Theta=14% (SIX)
Theta=14%
Dan Suciu XML Toolkit 49
Effect of Decreasing the SIX Size
0
5
10
15
20
25
30
0k 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k
size of XML elements deleted
MB
/s
1
10
100
1000
10000
size
in K
B
Throughput
SIX size
Dan Suciu XML Toolkit 51
Conclusions
• The toolkit is already available:– http://www.cs.washington.edu/homes/suciu/XMLTK– http://xmltk.sourceforge.net
• What it does so far it does very well:– Sorting, aggregation, nest/unnest
• But doesn’t do too much:– Restricted selections, no projections, no restructurings yet– Volunteers welcome !
• Can one process XML data without parsing it completely ?– SIX