Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 224 times |
Download: | 0 times |
Querying Streaming XML Data
Layout of the presentation
Introduction Common Problems faced Solution proposed Basic Building blocks of the solution How to build up a solution to a given
query Features of the system
Streaming XML XML – standard for information exchange. Some XML documents only available in
streaming format. Streaming is like reading data from a tape
drive. Used in Stock Market, News, Network
Statistics. Predecessor systems used to filter
documents.
Structure of an XPath Query
Consists of a Location path and an Output Expression (name).
Location path consists of closure axis(//), node test (book) and predicate (year>2000).
e.g. //book[year>2000]/name
Features of our Approach
Efficient Easy to understand design. Design of BPDT is tricky
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Test passed. But year=2002?
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Test passed. But year=2002?
Buffer both A & B
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Test passed. But year=2002?
Failed price<11. Remove
Buffer both A & B
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Test passed. But year=2002?
Failed price<11. Remove
Buffer both A & B
Test passed. Output
Problems caused by closure axis
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>
7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>
Query: //pub[year=2002]//book[author]//name
Pub [year=2002] book [author]
Line 2 True Line 7 False
Line 2 True Line 10 True
Line 9 False Line 10 True
Problems caused by closure axis
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>
7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>
Query: //pub[year=2002]//book[author]//name
Pub [year=2002] book [author]
Line 2 True Line 7 False
Line 2 True Line 10 True
Line 9 False Line 10 True
Fails year=2002
Problems caused by closure axis
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>
7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>
Query: //pub[year=2002]//book[author]//name
Pub [year=2002] book [author]
Line 2 True Line 7 False
Line 2 True Line 10 True
Line 9 False Line 10 True
Fails year=2002
Passes year=2002
Problems caused by closure axis
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>7. <book>8. <name> Y </name>9. <author> B </author>10. <pub>11. <book>12. <name> Z </name>13. <author> B </author>14. </book>15. <year> 1999 </year>16. </pub>17. </book>18. <year> 2002 </year>19. </pub>20. </root>
Query: //pub[year=2002]//book[author]//name
Pub [year=2002] book [author]
Line 2 True Line 7 False
Line 2 True Line 10 True
Line 9 False Line 10 True
Fails year=2002
Passes year=2002
Lets add author. Result?
Handling XML Stream
Input – well formed XML stream. Use SAX API to parse XML. Events belong to
Begin = {(a, attrs, d)} End = {(/a, d)} Text = {(a, text(), d)}
XML Stream: {e1,e2,…,ei,…} ¦
ei Є Begin υ End υ Text
Grammar for XPath Queries Q N+[/O] N [/¦//] tag [F] F [FO[OP constant]] FO @attribute ¦ tag [@attribute] ¦ text() O @attribute ¦ text() OP > ¦ ≥ ¦ = ¦ < ¦ ≥ ¦ ≠ ¦ contains
XPath query of the form N1N2…Nn/O
Cant handle Reverse Axis, Positional Functions.
Solution to QueryQuery: /pub[year=2002]/book[price<11]/author
PDA PDT
Basic PushDown Transducer (BPDT)
Similar to PushDown Automata Actions defined on Transition Arcs Finite set of states
A Start state A set of final states
Set of input symbols Set of Stack symbols
Book – Author: Buffer for future: Begin event of Author.
Book – Author: Remove from Buffer: End event of Book.
Book – Author: Output result if predicates true: Begin event of Author.
Building a BPDTQuery: /pub[year>2000]/book[author]/name/text()
Consider location step: /book[author]
Basic Building Blocks
XPath Expression: /tag[child]
Buffer Operations needed Enqueue(x): Add x to the end of the queue.
Clear(): Removes all items from the queue.
Flush(): Outputs all items in the queue in FIFO order.
Upload(): Moves all items to the end of the queue of a parent BPDT.
No Dequeue operation needed.
Basic Building Blocks
XPath Expression: /tag[@attr=val]
Basic Building Blocks
XPath Expression: /tag[text()=val]
Basic Building Blocks
XPath Expression: /tag[child@attr=val]
Basic Building Blocks
XPath Expression: /tag[child=val]
A sample BPDT
Query: /pub[year>2000]
Building a solutionHPDT for Query:
//pub[year>2000]//book[author]//name/text()
HPDT Structure Each BPDT in HPDT has:
Position BPDT POSITION (l,K) :- l = depth of BPDT in HPDT, K
= sequence # from right to left BPDT Position (i-1,k) – has right child BPDT position
(i,2k) – connected to NA state BPDT Position(i-1,k) – has left child BPDT position
(I,2k+1) – connected to True state. BPDT Position (i, 2i – 1) – means predicates in higher
level BPDT’s evaluate to trueBuffer – potential resultsStack – stack of elements (SAX) eventsDepth Vector
Example Query
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>
7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>
Query: //pub[year=2002]//book[author]//name
rootpub book name
1 2 7 11
1 2 10 11
1 9 10 11
3 paths from $1 to $14
System Features
Name Support Streaming Multiple
Predicates Closure
Buffered Predicate
Evaluation
XSQ-F XPath X X X X
XSQ-NC XPath X X X
XMLTK XPath X X
XQEngine XQuery X X
Galax XQuery X X
Joost STX X X
Reference Feng Peng and Sudarshan Chawate. XPath Queries
on Streaming Data. In SIGMOD 2003.
Thank You
???