Unordered Tree Matching and Strict Unordered Tree Matching: the
Evaluation of Tree Pattern Queries
Dr. Yangjun Chen
Dept. Applied Computer Science,
University of Winnipeg
515 Portage Ave.
Winnipeg, Manitoba, Canada R3B 2E9
Outline
Motivation Algorithm for unordered tree pattern
query evaluation- Tree encoding- Algorithm description
On the strict unordered tree matching
Experiment results Summary
Efficient method to evaluate XPath expression queries – XML query processing
XML documentsa tree pattern query
Motivation
Document:<Purchase>
<Seller><Name>dell</Name><Item>
<Manufacturer>IBM</Manufacturer><Name>part#1</Name><Item>
<Manufacturer>Intel</Manufacturer></Item>
</Item><Item>
<Name>Part#2</Name></Item><Location>Houston</Location>
</Seller><Buyer>
<Location>Winnipeg</Location><Name>Y-Chen</Name>
</Buyer></Purchase>
P
S B
I I L L NN
I
M
M N N
IBM Part#1 Part#2
Dell Houston Winnipeg Y-Chen
Intel
Motivation
Motivation
Document: Query – XPath expressions:
Q1: /Purchase[Seller[Loc=‘Boston’]]/Buyer[Loc = ‘New York’
Q2: /Purchase//Item[Manufacturer = ‘Intel’]
Purchase
Seller Buyer
Location Location
‘Houston’ ‘Winnipeg’
Buyer
Item
Manufacturer
‘Intel’
d-edge: ancestor-descendant relationship
c-edge: parent-childrelationship
P
S B
I I L L NN
I
M
M N N
IBM Part#1 Part#2
Dell Houston Winnipeg Y-Chen
Intel
Motivation
- XPath expression
a[b[c and .//d]]/b[c and e//d]
book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and
ln = ‘Knuth’]
a
b b
c d c e
d
title
Art of Programming
book
author
fn ln
KnuthDonald
<document><book>
<title>Art of Programming
</title><author>
<fn>Donald Knuth</fn>… …
XPath evaluation against XML documents
- Evaluation based on unordered tree matching:Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions:
(i) Preserve node label: For each u Q, label(u) matches label(f(u)).
(ii) Preserve parent-child/ancestor-descendant relationships: If u v in Q, then f(v) is a child of f(u) in T; if u v in Q, then f(v) is a descendant of f(u) in T.
Motivation XPath evaluation against XML documents
a
b c
Q: q3
q1 q2
a
b c
c b
b
T:
v1
v2
v3
v4 v5
v6
- Evaluation based on unordered tree matching:Definition 1 An embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions:
(i) Preserve node label: For each u Q, label(u) matches label(f(u)).
(ii) Preserve parent-child/ancestor-descendant relationships: If u v in Q, then f(v) is a child of f(u) in T; if u v in Q, then f(v) is a descendant of f(u) in T.
Motivation XPath evaluation against XML documents
a
b c
Q: q3
q1 q2
a
b c
c b
b
T:
v1
v2
v3
v4 v5
v6
Algorithm for query evaluation Tree encoding
Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document.
<A><B>
<C></C><B>
<C></C><C></C><D></D>
</B></B><B></B>
</A>
(1, 1, 11, 1)
(1, 10, 10, 2)B v8
A v1
(1, 7, 7, 4)
(1, 6, 6, 4)
T:
(1, 5, 5, 4)
(1, 3, 3, 3)
B v2
v3 C B v4
D v7v5 C
(1, 4, 8, 3)
(1, 2, 9, 2)
v6 C
12
3
43
5
65
6
77
891010
11
Tree encoding
Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as (v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document.
(i) ancestor-descendant: a node v1 associated with (d1, l1, r1, ln1) is an ancestor of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, and r1 > r2.
(ii) parent-child: a node v1 associated with (d1, l1, r1, ln1) is the parent of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, r1 > r2, and ln2 = ln1 + 1.
(iii)from left to right: a node v1 associated with (d1, l1, r1, ln1) is to the left of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, r1 < l2.
A: (1, 1, 11, 1)
B: (1, 2, 9, 2) (1, 4, 8, 3), (1, 10, 10, 2)C: (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4)
D: (1, 7, 7, 4)
Data streams:
Algorithm for query evaluation Tree encoding
(1, 1, 11, 1)
(1, 10, 10, 2)B v8
A v1
(1, 7, 7, 4)
(1, 6, 6, 4)
T:
(1, 5, 5, 4)
(1, 3, 3, 3)
B v2
v3 C B v4
D v7v5 C
(1, 4, 8, 3)
(1, 2, 9, 2)
v6 C
sorted by LeftPos values
<A><B>
<C></C><B>
<C></C><C></C><D></D>
</B></B><B></B>
</A>
12
3
43
5
65
6
77
891010
11
Algorithm for query evaluation Algorithm description
• Our algorithm works bottom-up. Therefore, we need to sort XMLstreams by (DocID, RightPos) values.
• Each time a query Q is submitted to the system, we will associateeach query node q with a data stream L(q) such that foreach v L(q) label(v) = label(q), in which each query node isattached with a list of matching nodes of the document tree.
{v1}
A q1
q2 B B q5
q3 C C q4
L(q1)Q:
L(q2) L(q5)
L(q4)L(q3)
L(q1 ) = (1, 1, 11, 1) -
L(q2 ) = L(q5) = (1, 4, 8, 3), (1, 2, 9, 2) (1, 10, 10, 2) -
L(q3) = L(q4) = (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4) -
{v4, v2, v8}
{v3, v5, v6}
sorted by RightPos values
T: Q:
Algorithm for query evaluation Algorithm description
1. S(v) – a set of query nodes q associated with document tree nodev such that Q[q] can be embedded in T[v].
2. (q) – a variable associated with query node q.• During the process, (q) will be dynamically assigned a series
of values a0, a1, ..., am for some m in sequence, where a0 = and ai’s (i = 1, ..., m) are different nodes of T.
• Initially, (q) is set to a0 = . (q) will be changed from ai-1 to
ai = v (i = 1, ..., m) when the following conditions are satisfied.i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v.iii) q is a d-child, or
q is a c-child, and u is a c-child with label(u) = label(q).
v S(v)T:
q (q)Q:
Algorithm for query evaluation Algorithm description
i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v.iii) q is a d-child, or
q is a c-child, and u is a c-child with label(u) = label(q).
v
u
S(u) = {…, q, …}
q’
q
v
u
S(u) = {…, q, …}
q’
q
label(u) = label(q).
u must be a direct child of v.
(q) is changedfrom ai-1 to ai = v
d-child
c-child
v
(q1) = (q2) = … = (ql) = v
q’
q1
Algorithm for query evaluation Algorithm description
3. Subtree embedding by using (q)
ql
label(v) = label(q’}
T[v] embeds Q[q’].
Algorithm for query evaluation Algorithm description
4. Construction of S(v)
5
3
A q1
B q5
C q4
q2 B
q3 C
Q :
1 2
4
• The nodes of Q are numbered in postorder. So the nodes in Q will be referenced by their postorder numbers.
• For a leaf node v in T, any encountered leaf node q in Q will be inserted into S(v) if label(u) = label(q). (Since Q is explored bottom-up, the query nodes in S(v) must be increasingly sorted.)
• For a non-leaf node v with children v1, …, vk, S(v) S(v1) … S(vk) {all nodes q with T[v] embedding Q[q]}.
Since S(v) is sorted, ‘union’ can beimplemented using the ‘merge’ operation.
Strict unordered tree matching Definition
Definition 2 A strict embedding of a tree pattern Q into an XMLdocument T is a mapping f: Q T, from the nodes of Q to the nodesof T, which satisfies the following conditions:
(i) same as (i) in Definition 1.(ii) same as (ii) in Definition 1.(iii) For any two nodes v1 Q and v2 Q, if v1 and v2 are not
related by an ancestor-descendant relationship, then f(v1) andf(v2) in T are not related by an ancestor-descendant relationship.
a
b c
Q: q3
q1 q2
a
b c
c b
b
T:
v1
v2
v3
v4 v5
v6
Strict unordered tree matching Definition
Definition 2 A strict embedding of a tree pattern Q into an XMLdocument T is a mapping f: Q T, from the nodes of Q to the nodesof T, which satisfies the following conditions:
(i) same as (i) in Definition 1.(ii) same as (ii) in Definition 1.(iii) For any two nodes v1 Q and v2 Q, if v1 and v2 are not
related by an ancestor-descendant relationship, then f(v1) andf(v2) in T are not related by an ancestor-descendant relationship.
a
b c
Q: q3
q1 q2
a
b c
c b
b
T:
v1
v2
v3
v4 v5
v6
By Definition 2, this mappingis not allowed.
Strict unordered tree matching
Strict unordered tree matching needs the exponential time.
Q:
a
T:
b dc e
a
x dy bce…
O(|T||Q|dk)
d – the largest out-degree of the nodes in T.k – the largest out-degree of the nodes in Q.
Using the concept of hypergraphs, the time complexity can be slightly improvedto O(|T||Q|2k).
Algorithm for query evaluation Experiments
• We conducted our experiments on a DELL desktop PCequipped with Pentium(R) 4 CPU 2.80GHz, 0.99GB RAMand 20GB hard disk. The code was compiled usingMicrosoft Visual C++ compiler version 6.0, runningstandalone.
• Tested methodsIn the experiments, we have tested four methods:- TwigStack (TS for short),- Twig2Stack (T2S for short),- Twig-List (TL for short),- One-Phase Holistic (OPH for short),- tree-embedding (discussed in this paper, TE for short).
• Tested methodsIn the experiments, we have tested five methods:- TwigStack (TS for short) [1],- Twig2Stack (T2S for short) [2],- Twig-List (TL for short) [3],- One-Phase Holistic (OPH for short) [4],- tree-embedding (discussed in this paper, TE for short).
Algorithm for query evaluation Experiments
[1] N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002, pp. 310-321.
[2] S. Chen, H-G. Li, J. Tatemura, W-P. Hsiung, D. Agrawa, and K.S. Canda, Twig2Stack: Bottom-up Processing of General ized-Tree-Pattern Queries over XML Documents, in Proc. VLDB, Seoul, Korea, Sept. 2006, pp. 283-294.
[3] Qin, L., Yu, J. X., and Ding, B., “TwigList: Make Twig Pattern Matching fast,” In Proc. 12th Int’l Conf. on Database Systems for Advanced Applications (DASFAA),pp. 850-862, Apr. 2007.
[4] Jiang, Z., Luo, C., Hou, W.-C., Zhu, Q., and Che, D., “Efficient Processing of XML Twig Pattern: A Novel One-Phase Holistic Solution,” In Proc. The 18th Int’l Conf. on Database and Expert Systems Applications (DEXA), pp. 87-97, Sept. 2007.
Algorithm for query evaluation Experiments
• Theoretical computational complexities
methods Query time Runtime space usage
TwigStack O(|D||Q|) O(|D||Q|)
Twig2Stack O(|D||Q|2+|subTwig-Results|
O(|D||Q|)
TL O(|D||Q|2) O(|D||Q|)
OPH O(|D||Q|2) O(|D||Q|)
TE O(|D||Q|) O(|D||Q|)
D - a largest data stream associated with a node q of Q such that for each vin the data stream we have label(v) = label(q).
Algorithm for query evaluation Experiments
• Indexes
All the tested methods use XB-trees as their indexstructure.
Algorithm for query evaluation Experiments
• Data sets
The data sets used for the tests are DBLP data set [30] and a syntheticXMARK data set [35]. The DBLP data set is another real data set withhigh similarity in structure. It is in fact a wide and shallow document.
DBLP XMark
Data size 127 (MB) 113 (MB)
Number of nodes 3332k 1666k
Max/Avg. depth 6/2.9 12/5.5
Algorithm for query evaluation Experiments
• Queries – altogether 20 queries, divided into 4 groups
Q1: //inproceedings [author]//year [text() = ‘2004’]
Q2: //inproceedings [author and title]//year [text() = ‘2004’]
Q3: //inproceedings [author and title and .//pages]//year[text() = ‘2004’]
Q4: //inproceedings [author and title and .//pages and .//url]//year [text() = 2004’]
Q5: //articles [author and title and .//volume and .//pages and //url]//year [text() = ‘2004’]
Group I:
Algorithm for query evaluation Experiments
• Test resultsFor all the experiments, the buffer pool size was fixed at 2000 pages. Thepage size of 8KB was used. For each data set, all the tag names are storedin a single list and then each tag name is represented by its order numberin that list during the evaluation of queries. In our implementation, eachDocId occupies 4 bytes while a number in a Prüfer sequence, a LeftPos ora RightPos occupies 2 bytes. A levelNum value takes only 1 byte.
Q1 Q2 Q3 Q4 Q5
5
4
3
2
1
execution time (sec.)
TST2SOPHTLTM
+
++
+
Algorithm for query evaluation Experiments
*
*
*
Algorithm for query evaluation Experiments
• Queries – altogether 20 queries, divided into 4 groups
Q6: //inproceedings[author/* and ./*]/year
Q7: //inproceedings[author/* and title and ./*]/year
Q8: //inproceedings[author/* and title and .//pages and ./*]/year
Q9: //inproceeding[author/* and title and .//pages and .//url and ./*]/year
Q10: //articles[author/* and title and .//volume and .//pages and .//pages and .//url and ./*]/year
Group II:
Q6 Q7 Q8 Q9 Q10
60
50
40
30
20
execution time (sec.)
+
+
Algorithm for query evaluation Experiments
TST2SOPHTLTM
+*
**
**
Algorithm for query evaluation Experiments
• Queries – altogether 20 queries, divided into 4 groups
Q11: /site//open_auction[.//seller/person]//date[text() = ‘10/23/2006’]
Q12: /site//open_auction[.//seller/person and .//bidder]//date[text() = ‘10/23/2006’]
Q13: /site//open_auction[.//seller/person and /./bidder/increase]//date[text() = ‘10/23/2006’]
Q14: /site//open_auction[.//seller/person and .//bidder/increase and .//initial]//date [text() = ‘10/23/ 2006’]
Q15: /site//open_auction[.//seller/person and .//bidder/increase and //initial and .//description]//date [text() = ‘10/23/2006’]
Group III:
Q11 Q12 Q13 Q14 Q15
5
4
3
2
1
execution time (sec.)
++
+TST2SOPHTLTM
+*
**
Algorithm for query evaluation Experiments
* *
Algorithm for query evaluation Experiments
• Queries – altogether 20 queries, divided into 4 groups
Q16: /site//open_auction[.//seller/person/* and ./*]/date
Q17: /site//open_auction[.//seller/person/* and .//bidder and ./*]/date
Q19: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and ./*]/date
Q20: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and .//description and ./*]/ date
Group IV:
Q18: /site//open_auction[.//seller/person/* and .//bidder/increase]/date
Q16 Q17 Q18 Q19 Q20
5
4
3
2
1
execution time (sec.)
+TST2SOPHTLTM
+*
*
*
*
*
Algorithm for query evaluation Experiments
Algorithm for query evaluation Experiments
• Test results
0
2000
4000
6000
8000
10000
Q6 Q7 Q8 Q9 Q10
TS T2S OPH TL TM
Run t
ime s
pace
usa
ge
In the following figure, we compare the runtime memory usage of all thefive tested approaches for the second group of queries. By the memoryusage, we mean the intermediate data structures, not including datastream (concretely, path stacks for TwigStack; hierarchical stacks forTwig2Stack, TL and OPH; and QSs for ours.)
Summary
• An efficient method for evaluating unordered tree pattern queries in XML document databases- parent/child and ancestor/descendant relation- O(|D||Q|) time and O(|D||Q|) space
• Strict unordered tree matching- exponential time
• Experiments- TreeBank database, DBLP and XMark documents - I/O time and CPU processing time
Thank you.