Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 213 times |
Download: | 0 times |
1
Introduction to XML Algebra
Based on talk prepared for CS561 by Wan Liu and Bintou Kane
2
Data Model data model ~ core data structures
and data types supported by DBMS relational database is a table (set-
oriented) data model XML format is a tree-structured
hierarchical model
3
Why XML Algebra?
It is common to translate a query language into an algebra.
First, the algebra is used to give a semantics for the query language.
Second, the algebra is used to support query optimization.
5
NIAGARA Title : Following the paths of XML
Data: An algebraic framework for XML query evaluation
By : Leonidas Galanis, Efstratios Viglas, David J. DeWitt, Jeffrey. F. Naughton, and David Maier.
Univ. of Wisconsin
6
Outline
Concepts of Niagara Algebra
Operations
Optimization
7
Goals of Niagara Algebra
Be independent of schema information Query on both structure and content Generate simple, flexible, yet powerful
algebraic expressions Allow re-use of traditional optimization
techniques
8
Example: XML Source Documents
Invoice.xml
<Invoice_Document>
<invoice No = 1>
<account_number>2 </account_number>
<carrier>AT&T</carrier>
<total>$0.25</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<carrier>Sprint</carrier>
<total>$1.20</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<carrier>AT&T</carrier>
<total>$0.75</total>
</invoice>
</Invoice_Document>
Customer.xml
<Customer_Document>
<customer>
<account>1 </account>
<name>Tom </name>
</customer >
<customer>
<account>2 </account>
<name>George </name>
</customer >
</Customer _Document>
9
XML Data Model and Tree Graph
Example:Invoice_Document
Invoice Invoice…
numbercarrier total number
carriertotal
2 AT&T $0.25 1 Sprint $1.20
<Invoice_Document> <invoice> <number>2</number> <carrier>Sprint</carrier> <total>$0.25</total> </invoice>
<invoice><number>1</number> <carrier>Sprint</carrier> <total>$1.20</total> </invoice>
</Invoice_Document>
Ordered Tree Graph,
Semi structured Data
10
XML Data Model [GVDNM01]
Collection of bags of vertices. Vertices in a bag have no order. Example:
Root invoice.xml invoice invoice.account_number
<invoice>Invoice-element-content
</invoice>
< account_number >element-content
</ account_number >
[Root“invoice.xml”, invoice, invoice. account_number ]
11
Data Model
Bag elements are reachable by path expressions.
Path expression consists of two parts: An entry point A relative forward part
Example: account_number:invoice
12
Operators
Source S , Follow , Select , Join , Rename , Expose , Vertex , Group , Union , Intersection , Difference - , Cartesian Product .
13
Source Operator S
Input : a list of documents Output :a collection of singleton bags
Examples :
S (*) All Known XML documentsS (invoice*.xml) All XML documents whose filename match “invoice*.xmlS (*,schema.dtd) All known XML documents that conform to schema.dtd
14
Follow operator Input : a path expression in entry
point notation Functionality : extracts vertices
reachable by path expression Output : a new bag that consists of
the extracted vertex + all contents of original bag (in case of unnesting follow)
15
Follow operator (Example*)
Root invoice.xml invoice
<invoice>Invoice-element-content
</invoice>
Root invoice.xml invoice invoice.carrier
<invoice>Invoice-element-content
</invoice>
<carrier>carrier -element-content
</carrier >
(carrier:invoice)*Unnesting Follow
{[Root invoice.xml , invoice]}
{[Root invoice.xml , invoice, invoice.carrier]}
16
Select operator
Input : a set of bags Functionality : filters the bags of a
collection using a predicate Output : a set of bags that conform
to the predicate Predicate : Logical operator (,,), or simple
qualifications (,,,,,)
17
Select operator (Example)
invoice.carrier =Sprint
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
{[Root invoice.xml , invoice], [Root invoice.xml , invoice], ……………}
{[Root invoice.xml , invoice],… }
18
Join operator Input: two collections of bags Functionality: Joins the two
collections based on a predicate Output: the concatenation of pairs of
pages that satisfy the predicate
19
Join operator (Example)
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
Root customer.xml customer<customer>
customer-element-content</customer>
account_number: invoice =number:customer
Root invoice.xml invoice Root customer.xml customer<invoice>
Invoice-element-content</invoice>
<customer>customer-element-content
</customer>
{[Root invoice.xml , invoice]} {[Root customer.xml , customer]}
{[Root invoice.xml , invoice, Root customer.xml , customer]}
20
Expose operator
Input: a list of path expressions of vertices to be exposed
Output: a set of bags that contains vertices in the parameter list with the same order
21
Expose operator (Example)
Root invoice.xml invoice. bill_period invoice.carrier
<invoice>carrier-element-content
</invoice>
<carrier>bill_period -element-content
</carrier >
(bill_period,carrier)
{[Root invoice.xml , invoice.bill_period, invoice.carrier]}
Root invoice.xml invoice invoice.carrier invoice.bill_period
<invoice>Invoice-element-content
</invoice>
<carrier>bill_period -element-content
</carrier >
{[Root invoice.xml , invoice, invoice.carrier, invoice.bill_period]}
<invoice>carrier-element-content
</invoice>
22
Vertex operator
Creates the actual XML vertex that will encompass everything created by an expose operator
Example :
(Customer_invoice)[((account)[invoice.account_number], (inv_total)[invoice.total])]
23
Other operators Group : is used for arbitrary
grouping of elements based on their values Aggregate functions can be used with
the group operator (i.e. average) Rename : Changes entry point
annotation of elements of a bag. Example: (invoice.bill_period,date)
24
Example: XML Source Documents
Invoice.xml
<Invoice_Document>
<invoice>
<account_number>2 </account_number>
<carrier>AT&T</carrier>
<total>$0.25</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<carrier>Sprint</carrier>
<total>$1.20</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<total>$0.75</total>
</invoice>
<auditor> maria </auditor>
</Invoice_Document>
Customer.xml
<Customer_Document>
<customer>
<account>1 </account>
<name>Tom </name>
</customer >
<customer>
<account>2 </account>
<name>George </name>
</customer >
</Customer _Document>
25
Xquery ExampleList account number, customer name, and
invoice total for all invoices that has carrier = “Sprint”.
FOR $i in (invoices.xml)//invoice,
$c in (customers.xml)//customer
WHERE $i/carrier = “Sprint” and
$i/account_number= $c/account
RETURN
<Sprint_invoices>
$i/account_number,
$c/name,
$i/total
</Sprint_invoices>
26
Example: Xquery output
<Sprint_Invoice>
<account_number>1 </account_number>
<name>Tom </name>
<total>$1.20</total>
</Sprint_Invoice >
27
Algebra Tree Execution
customer (2) customer(1) Invoice (1) invoice (2) invoice (3)
Source (Invoices.xml) Source (cutomers.xml)
Follow (*.invoice) Follow (*.customer)
Select (carrier= “Sprint” )
invoice (2)
Join (*.invoice.account_number=*.customer.account)
invoice(2) customer(1)
Expose (*.account_number , *.name, *.total )
Account_number name total
28
Optimization with Niagara
Optimizer based on Niagara algebra:
Use the operation more efficiently Produce simpler expressions by
combining operations
29
Language Convention A and B are path expressions A< B -- Path Expression A is
prefix of B AnB --- Common prefix of path
A and B AńB --- Greatest common of
path A and B ┴ --- Null path Expression
30
Heuristics using Rewrite Rules
Allow optimization based on path selectivity
When applying un-nesting following operation Φμ
31
Φμ(A) [Φμ(B)]=Φμ (B)[Φμ (A)]
TRUE when exists C such that C < A && C < B and C = AńB
Or AnB = ┴
Interchangeability of Follow operation
32
Application of Rule on Invoice
Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] *
=?=Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] **
33
Application of Rule on Invoice
Φμ(acc_Num:invoice)[Φμ(carrier:invoice)]
?=Φμ(carrier:invoice)[Φμ(acc_Num:invoice)]
Equivalent because both share the common prefix “invoice”.
Case AńB = invoice
34
Benefit of Rule Application NOTE: let us assume that acc_Num is required for each invoice
element, while carrier is not required for invoice element
THEN:Φμ(acc_Num:invoice)[Φμ(carrier:invoice)]
?=Φμ(carrier:invoice)[Φμ(acc_Num:invoice)]
Then what algebra tree do we prefer?
Φμ(acc_Num:invoice)[Φμ(acc_Num:customer)]
make more sense than ** Why?
35
Discussion
Reduction of Input Size on firstSub-operation:
Φμ(carrier:invoice)
36
Should we/can we apply the rule below?
Φμ(acc_Num:invoice)[Φμ(acc_Num:Customer)]
37
“acc_Num:invoice” and“acc_Num:customer” are two totally different paths
Case is: AnB = ┴
So yes, rule is valid.