+ All Categories
Home > Documents > APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling...

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling...

Date post: 27-Mar-2015
Category:
Upload: amber-graves
View: 213 times
Download: 1 times
Share this document with a friend
Popular Tags:
32
1 APWeb 2004 Hangzhou, China beling and Querying Dynamic XML Tr Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore
Transcript
Page 1: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

1

APWeb 2004 Hangzhou, China

Labeling and Querying Dynamic XML Trees

Jiaheng Lu and Tok Wang Ling School of ComputingNational University of Singapore

Page 2: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

2

APWeb 2004 Hangzhou, China

Contents

Introduction

Introduction to structural join

Introduction to labeling scheme Our Methods

Preliminary definition

Group based prefix labeling schemeGroup based join algorithm

Our Experiments

Page 3: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

3

APWeb 2004 Hangzhou, China

Introduction to Structural Join

XML employs tree-structured model for representing data

XML query can be decomposed into a set of basic structural ( parent-child or ancestor-descendant ) relationships between pairs of nodes

Page 4: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

4

APWeb 2004 Hangzhou, China

book

Title author

XML

c) Xpath Tree Pattern

book

Title

Title

XML

book

Author

Author

d) Basic Structural relationship

John John

parent-child

ancestor-descendant

<book title=“XML”>

<allauthors>

<author>John</author>

<author>Tom</author>

</allauthors>

<year>2003</year>

<chapter>

<head>….</head>

<section>…</section>

</chapter>

</book>

book

title allauthors year chapter

author author 2003

John Tom

XML head section

b) XML tree

…. ….

a) XML source Any node in XML tree may be an element, attribute, value of XML source.

Page 5: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

5

APWeb 2004 Hangzhou, China

Contents

Introduction

Introduction to structural join

Introduction to labeling scheme Our Method

Preliminary definition

Group based prefix labeling schemeGroup based join algorithm

Our Experiments

Page 6: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

6

APWeb 2004 Hangzhou, China

Labeling scheme

In order to perform structural join, each node in an XML tree is assigned an unique label.

We can determine the ancestor-descendant (or parent-child) relationship for any two nodes from their labels.

The method of assigning the labels is called as labeling scheme.

Page 7: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

7

APWeb 2004 Hangzhou, China

Range labeling scheme

In the range labeling scheme, the label of a node v is interpreted as a pair of numbers <av, bv>: av is called as the

start position, while bv is the end position. A node v (<av,

bv> ) is an ancestor of u (<au, bu> ) iff av≤au≤bu≤ bv. In other

words, range <au, bu> is contained in range <av, bv>.

Page 8: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

8

APWeb 2004 Hangzhou, China

Range labeling schemeBook

<1,12>

Title <1,2> Allauthors <3,6>

Year <7,8>

Chapter <9,11>

Author <3,4>

Author <5,6>

2003<7,7>

John<3,3>

Tom <5,5>

Head<9,9>

Section<10,10>

XML <1,1>

Page 9: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

9

APWeb 2004 Hangzhou, China

Range labeling scheme

Pros: Ancestor-descendants relationship can be decided in constant time.

Cons: This method lacks of flexibility. There is a renumbering problem for insertion nodes. To get around this problem, some papers propose to leave some “gaps” between the numbers of the leaves. However, if one part of the documents is heavily updated, the available numbers may be still not enough and the tree needs to be renumbered. So this approach cannot ultimately solve this problem.

Page 10: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

10

APWeb 2004 Hangzhou, China

Prefix labeling scheme

Edith Cohen in PODS 2002 proposes a simple prefix labeling scheme, which can avoid renumbering in any case.

For any new node v, the label L(v) = L(u) + 1…10 where u is the parent of v, i is the number of labeled children of u. Root

node is labeled as an empty string.

(1) Edith Cohen, Haim Kaplan, Tova Milo Labeling Dynamic XML Trees ACM PODS 2002.

Page 11: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

11

APWeb 2004 Hangzhou, China

How to generate simple prefix label

L(v)=L(u) + 1…10

u is the parent of v and i is the number of labeled children of u.

“”

“0”“10”

Book

Title Authors

Author Author

“1010”“100”

Author

“10110”

Page 12: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

12

APWeb 2004 Hangzhou, China

Pros: Compared other labeling scheme, such as range labeling scheme, simple prefix scheme does not need renumbering for any insertion sequence.

Cons: The index size is too large. The tight bound of size is O(N2) in the worst case, where N is the number of nodes in an XML tree.

Simple Prefix labeling scheme

Page 13: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

13

APWeb 2004 Hangzhou, China

Contents

Introduction

Introduction to structural join

Introduction to labeling scheme Our Method

Preliminary definition

Group based prefix labeling schemeGroup based join algorithm

Our Experiments

Page 14: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

14

APWeb 2004 Hangzhou, China

GroupDefinition: Given a XML tree T, a group is a set of subtrees. All root

nodes of subtrees in this set have the common parent node in T.

A

B

F

C

H

D

E

One Subtree

I J

Two subtrees

Group

G

Page 15: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

15

APWeb 2004 Hangzhou, China

Group

Property 1: Given a XML tree T and a group S, for any node n T, but n S, one of the following two conditions ∈must be satisfied: (1) n is an ancestor of all nodes in S; (2) n is not an ancestor of any node in S.

In other words, it is impossible that n is an ancestor of part of nodes in S.

Page 16: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

16

APWeb 2004 Hangzhou, China

Contents

Introduction

Introduction to structural join

Introduction to labeling scheme Our Method

Preliminary definition

Group based prefix labeling schemeGroup based join algorithm

Our Experiments

Page 17: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

17

APWeb 2004 Hangzhou, China

GRP Labeling Scheme

Group based Prefix (GRP) labeling scheme associates each node n in the XML tree with a pair of number <groupID, prefix-label>, where groupID is a nonegtive integer and prefix-label is a binary string. All nodes on the same group have the same groupID, and are distinguished by their prefix-label .

Page 18: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

18

APWeb 2004 Hangzhou, China

GRP label example:

1,“0”

1,“00”

1,“010”

Root

A B

C D

2,“10”2,“0”

E

2,“110”

In this example, the maximal number of each group is three.

Page 19: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

19

APWeb 2004 Hangzhou, China

Contents

Introduction

Introduction to structural join

Introduction to labeling scheme Our Method

Preliminary definition

Group based prefix labeling schemeGroup based join algorithm

Our Experiments

Page 20: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

20

APWeb 2004 Hangzhou, China

Group based Structural Join (GRJ) Algorithm

The main idea in GRJ is to divide the join operations into two classes, one is intra-group join, and the other is inter-group join.

Intra-group join means the join happens among the elements in the same group.

Inter-group join means the join happens among the elements in different groups.

Page 21: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

21

APWeb 2004 Hangzhou, China

Intra-group Join

Intra-group join is easy to understand. There are two alternative methods to perform this join:

(i) simply comparing the prefix labels of any two elements to identify their relationship, like nested-loop join in RDBMS.

(ii) A clever method is first to sort the prefix-label, then use a stack to cache the potential ancestor and only scan the join data once, like sort-merge join in RDBMS.

Page 22: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

22

APWeb 2004 Hangzhou, China

Inter-group Join algorithm

The key point of inter-group join is to use a hash table to cache the ancestor nodes of each group.

A key of hash table is a group ID of the descendant set and a value of hash table is the parent element of this group.

Page 23: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

23

APWeb 2004 Hangzhou, China

 Algorithm. GRJ algorithm

Input: A is the ancestor list and D is the descendant list

Output: Pairs of ancestor-descendant elements

1.Scan A,D list once to assign every element to their respective group bucket;

2. Initialize DgroupHash as a hash table, where keys are group IDs of Dlist, and each value is initialized as an empty set.

/* DgroupHash will cache the ancestors of any group in Dlist. */

3. For i:=2 to max group do

/*since group 1 only contains root node, here begin from group 2 *

4. Output each elements in set DgroupHash(i) as the ancestor of each element in Dlist of group i ;

5. Delete key i from DgroupHash;

6. Perform Intra-group join for group i;

7. Perform Inter-group join for group i (join result is stored in the hash table DgroupHash).

8. End For

Page 24: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

24

APWeb 2004 Hangzhou, China

Contents

Introduction

Introduction to structural join

Introduction to labeling scheme Our Method

Preliminary definition

Group based prefix labeling schemeGroup based join algorithm

Our Experiments

Page 25: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

25

APWeb 2004 Hangzhou, China

Experiment setup

Comprehensive experiments were conducted to study the effectiveness and efficiency of GRJ algorithm.

We use synthetic and real-life data including XMARK, IBM XML generator and DBLP.

Page 26: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

26

APWeb 2004 Hangzhou, China

Query performance

For GRP scheme, we use GRJ algorithm. For SP scheme, we first use block nested loop (BNL)

algorithm. Because if the label of node is given directly according to their inserted order, they are usually unsorted, we cannot use more efficient algorithm.

Experiment result: GRJ is much efficient than BNL.

Page 27: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

27

APWeb 2004 Hangzhou, China

0

5

10

15

20

25

30

3 10 20 25 30

Number of nodes(K)

Elap

sed

Time

(#se

c)

GRJ al gori thm BNL al gori thm

Page 28: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

28

APWeb 2004 Hangzhou, China

Query Performance

When the special efforts are taken to guarantee that element lists are sorted, for SP labeling scheme, we may use a more efficient algorithm, called Stack-Tree-Desc(2) to perform structural join. Stack-Tree-Desc is like sort-merge join in RDBMS.

Since the original Stack-Tree-Desc algorithm is based on range labeling scheme, here we first modify it to utilize SP labeling scheme (but the main idea is the same).

(2):D. Srivastava, S.Al-khalifa, H. V. Jagadish, N. Koudas, J. M. Patel and Yuqing Wu. Structural Joins: A primitive for efficient XML query pattern matching. In ICDE 2002

Page 29: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

29

APWeb 2004 Hangzhou, China

0

50

100

150

200

250

300

350

50 100 150 250# of nodes i n j oi n set( K)

Elap

sed

time

SP-stack- tree-desc GRJ

Page 30: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

30

APWeb 2004 Hangzhou, China

Query Performance

Interestingly, we find that although GRJ algorithm needs to scan the element lists twice and Stack-Tree-Desc algorithm scan them only once, GRJ algorithm still performs better than Stack-Tree-Desc algorithm for the large data.

Page 31: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

31

APWeb 2004 Hangzhou, China

Query Performance

This can be explained as follows: Stack-Tree-Desc algorithm is based on SP labeling scheme, while GRJ is based on GRP scheme. Since the size of labels of SP is much larger than that of GRP, the time of accessing the GRP labels twice may be still smaller than accessing SP labels once. As a result, GRJ algorithm outperforms Stack-Tree-Desc algorithm.

This result shows the importance of the size of labels.

Page 32: APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

32

APWeb 2004 Hangzhou, China

------ End-------

Thank you !

Question and Answer


Recommended