Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 213 times |
Download: | 0 times |
1
A Unified Model for XQuery Evaluation over XML Data Streams
Jinhui Jian
Hong Su
Elke A. Rundensteiner
Worcester Polytechnic Institute
ER 2003
2
Need for Stream Processing New environment
Data source is everywhere Data request is everywhere
New applications Sensor networks Analysis of XML web logs Selective dissemination of XML information (e.g., news)
New features On-line arriving data Potentially unstable data Real-time response requirement Scalability requirement
3
Specific Challenges for XML Streams
Pattern retrieval on nested data
+ filtering/restructuring
FOR $b in doc (bib.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p > 50Return <expensive> $t </expensive>
<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price> 65.95</price>
</book>
…
Token-by-Token access manner
timeline
<bib> <book> <title> TCP/IP Illustrated </title> …
A token: can be an open tag/close tag
/PCDATA is not a direct counterpart of
a tuple
4
Observations and Questions
Observations Pattern retrieval->The Automata model is long studied for
pattern retrieval on tokens Filtering/Structuring->The Algebraic model is long studied f
or optimizing query plan on tuples Questions
How to integrate the two models? How to optimize a query within the integrated query model?
6
A Running ExampleGive me book titles whose price is greater than 50: FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN <expensive> $b/title </expensive>
<bib> <book year="1994"> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> <price> 65.95</price> </book> <book year="2000"> <title>Languages and Machines</title> <author><last>Sudkamp</last><first>T.</first></author> <publisher> Addison-Wesley </publisher> <price>39.95</price> </book> …
</bib>
<expensive> <title>TCP/IP Illustrated</title> </expensive> …
timeline
<bib> <book> <title> TCP/IP Illustrated </title> <author> <last> Stevens</last> …</book>…
Input XML stream
7
Automata Computation: NFAs + BuffersFOR $b in doc (bib.xml) //bookWHERE $b/price > 50RETURN $b/title
1book
*
2
4title
3
price
<title>TCP/IP Illustrated</title>
<price>65.96</price>
Buffer for title
Buffer for price
t0 t1 t2 t3 t4 t5 t6 t7 <bib> <book> <title> TCP/IP Illustrated </title> <price> 65.95 </price> …
input
active states +0 +1 +1,2 +1,4 -1,4 +1,3 … …
stack [0] [0]
[1]
[0]
[1]
[1,2]
[0]
[1]
[1,2]
[1,4]
[0]
[1]
[1,2]
[0]
[1]
[1,2]
[1,3]
… …
• No materialization needed
• Multiple patterns resolved in one pass
8
Algebraic Computation
FOR $b in doc (bib.xml) //bookWHERE $b/price > 50 RETURN $b/title
Extract //book
Navigate //book, price
Select price > 50
Tagger
Navigate //book, title
book bookbook
title author
last first
publisher price
Text
Text Text
Text Text
•Selection push-down enabled
9
The Raindrop Approach
Uniform Automata computation modeled in an algebraic manner
Tight-coupling Automata and regular tuple-based computation
interchangeable
10
Path Bindings in XQuery
FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t
FLWR expression:
FOR…LET...WHERE…RETURN…
Path bindings Filtering and restructuring
“The purpose of path bindings is to produce a tuple stream in which each tuple consists of one or more bound variables” [W3C]
12
Modeling the Automata Plan:Black Box[xscan] vs. White Box
AutomataPlan
Q1 := //bookQ2 := //book/priceQ3 := //book/title
SJoin//book
Extract//book/price
Extract//book/title
Black Box White Box
13
A Unified Process at the Logical View
Select //book/price >5 0
Navigate //book, //book/title
SJoin//book
Extract//book/price
Extract//book/title
The Algebra CoreOp Symbol Semantic
Selection Filter tuples based on the predicate pred
Projection Filter columns in the input tuples based on the variable list v
Join Join input tuples based on the predicate pred
Aggregate Aggregate over input tuples with the aggregate function f, e.g., sum and average
Tagger Format outputs based on the pattern pt, i.e., reconstruct XML tags
Navigate Take input elements of path p1 and output ancestor elements of path p2
Extract Identify elements of path p from the input stream
Structural Join
Join input tuples on their structural relationship, e.g, the common parent relationship p
2,1 pp
p
pred
v
pred
ptT
f
p
Relatinal-like
XML-Specific
15
The Extract Operator
1 2book
*
Extract//book/title
<bib> <book> <title> TCP/IP Illustrated </title> … </book>…
1title
<title> TCP/IP Illustrated </title>
<title> Data on the Web </title>
<title>Advanced Programming in the Unix environment</title>
16
The Structural Join Operator
1 2book
3title*
4price
Extract//book/title
Extract//book/price
SJoin//book
FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t
<title>…</title> <price>…</price>
<title>…</title> <price>…</price>
<bib> <book> <title> TCP/IP Illustrated </title> … </book>… <book>… </book>
Tight coupling<price>…</price>
<price>…</price>
<title>…</title>
<title>…</title>
17
The Navigate Operator
<book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65.95 </price> </book>
<book>… … </book>
<book>… … </book>
<book>… … </book> <title>…</title>
<book>… … </book> <title>…</title>
<book>… … </book> <title>…</title>
Navigate//book, title
19
In or Out?
Automata plan
Regular algebraic plan
Tuple stream
XML data stream
Query answer
Pattern retrieval
Pattern Retrieval Alternatives<title>…</title> <price>…</price
<title>…</title> <price>…</price>
<price>…</price>
<price>…</price>
<title>…</title>
<title>…</title>
<book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65.95 </price> </book>
<book>… … </book>
<book>… … </book>
<book>… … </book> <title>…</title>
<book>… … </book> <title>…</title>
<book>… … </book> <title>…</title> <price>…</price>
<book>… … </book> <title>…</title> <price>…</price>
In Automata (/title, /price) Out of Automata(/title, /price)
1book
*
2
4title
3
price1
book*
2
21
Plan Alternatives
1
Extract //book
*
Navigate //book, price
2book
Select price >5 0
Navigate //book, title
The pull-out plan
Extract //book/price
13
4
title
price
Extract //book/title
*
SJoin //book
2book
Select //book/price
>50
The push-in plan
TaggerTagger
24
Camp 1: Complete Automata Model [XSQ, XSM, XPush]
All details on the same level Hard to understand Not suitable for
optimizing at different levels
Little studied for using automata as query processing paradigm
For $x in $R/a return
for $Y in $X/b return
<res>$Y, $X </res>
0,0,0
1,0,0
2,1,0
2,2,1
2,2,2
2,1,3
1,1,3
1,2,2
1,2,1
1,1,0
*r=er|r++*r=sr|r++
*r!=<a>|r++*r=<a>|w(x,sx),w(x,<a>),r++,x”++
*r=</a>|w(x,</a>),w(x,ex),r++,xs=x
*r!=</a>&*r!=</b>|w(x,*r),r++,x”++
*r=<b>|w(x,<b>),r++
*true|xm=x’, w(o,<res>),w(o,<b>),x’++
*r!=</a>&*r!=</b>|w(x,*r),w(o,*r),x”++,r++
*r=</b>|w(x,</b>),w(o,</b>),r++,x”++
!AE(x’)&*x’!=ex|w(o,*x’),x’++
AE(x’)&*r!=</a>|w(x,*r),w(o,*r),r++,x”++
AE(x’)&*r=</a>|w(x,</a>),w(o,</a>),w(x,ex),r++,x’++
!AE(x’)&x’!=ex|w(o,*x’),x’++
!AE(x”)&x”=</b>|w(o,</b>),x”++
!AE(x”)&*x”!=</b>|w(o,*x”),x”++
True|xm=x’,w(o,<res>),w(o,<b>),x’++
!AE(x”)&*x”=<b>|x”++
!AE(x”)&*x”!=<b>&*x”!=ex|x”++
!AE(x”)&*x”=ex|xs=x”
25
Camp 2: Automata-Algebra Loosely Coupled Model [Tukwila, YFilter]
Fixed interface for automata computation (all pattern retrieval pushed down)
No opportunity of pushing/pulling computation into/from automata
Bloated, black box operator Algebraic rewriting impossible for internal
optimization
AutomataPlan
$b := //book$p := //book/price$t := //book/title
$b $p $t
26
Contribution
Automata and algebra modeled into one framework allowing a uniform logical view
Opportunity of push-into-automata and pull-out of-automata provided via query rewriting
Optimization necessity verified by experiments