Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | jared-charles |
View: | 217 times |
Download: | 3 times |
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications
Ashraf AboulnagaAlaa R. AlameldeenJeffrey F. Naughton
Computer Sciences DepartmentUniversity of Wisconsin - Madison
Motivation XML enables Internet scale applications that
query data from many sources Niagara, Xyleme, …
Queries over XML data use path expressions Optimizing these queries requires
estimating the selectivity of the path expressions
Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions
What is XML?<readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel></readings>
Querying XML
FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/authorWHERE $n_auth/text() = $p_auth/text()RETURN $n_auth
Optimizing this query requires estimating the selectivity of the path expressions
This requires information about the structure of the XML data
Goal of this Work Build database statistics that capture
the structure of XML data Ensure that the statistics fit in a small
amount of memory For efficient query optimization Important for Internet scale applications
Use the statistics to estimate the selectivity of simple XML path expressions//t1/t2/…/tn
Outline of Presentation Introduction Path Trees Markov Tables Performance Evaluation Conclusions
Path Trees<A> <B> </B> <B> <D> </D> </B> <C> <D> </D> <E> </E> <E> </E> <E> </E> </C></A>
A 1
C 1B 2
D 1D 1 E 3
Summarizing Path Trees Path trees contain all the information
needed for selectivity estimation Problem: May not fit in available memory
Small available memory Internet scale
Remove low frequency nodes Removed nodes replaced with *-nodes
Tag name: * meaning "any tag" Frequency: Average frequency of replaced
nodes Sibling-*, Level-*, Global-*, No-*
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11* f=6n=2
*-nodes represent deleted sibling nodes Memory saved by coalescing nodes
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11* f=6n=2
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11* f=6n=2
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11* f=6n=2
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12K 11* f=6n=2
* f=12n=2
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
*
K 11* f=6n=2
f=12n=2
Sibling-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
*
K 11* f=6n=2
f=12n=2
Sibling-* Summarization
A 1
C 9B 13
F 15
K 12
*
K 11* f=6n=2
f=12n=2 * f=16
n=2
Sibling-* Summarization
A 1
C 9B 13
*F 15*
K* f=6n=2
f=12n=2
f=16n=2
f=23n=2
Sibling-* Summarization
A 1
C 9B 13
*F 15*
K* f=23n=2
6 8
3
Original Path Tree
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
Sibling-* Summarization
A 1
C 9B 13
*F 15*
K* f=23n=2
6 8
3
Try to retain as much information as possible about the deleted nodes
Level-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
Level-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
Level-* Summarization
A 1
C 9B 13
G 10F 15
K 12K 11
* 6
* 3
Less information about deleted nodes than sibling-* Deletes fewer nodes than sibling-*
Global-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
Global-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
Global-* Summarization
C 9B 13
G 10F 15 H 6
K 12
D 7
K 11
*3
No-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
No-* Summarization
A 1
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11J 4I 2
No-* Summarization
C 9B 13
G 10F 15 H 6
K 12
E 5D 7
K 11
Memory savings similar to global-* Conservative assumption about deleted nodes
Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions
Markov Tables A table of all distinct paths of length up
to m and their frequencies For paths of length greater than m,
combine paths from the Markov table Example:
Uses "short memory" or "Markov" property
f(B/C/D)
f(B/C)f(A/B/C/D) =
f(A/B/C)
Markov Tables
Path Freq Path Freq
A 1 AC 6
B 11 AD 4
C 15 BC 9
D 19 BD 7
AB 11 CD 8
A 1
D 4C 6B 11
D 7C 9
D 8
Summarizing Markov Tables Exact selectivities for paths of length up to
m Approximate selectivities for paths longer
than m Problem: May not fit in available memory Remove low frequency paths Discard removed paths of length > 2 Replace removed paths of length 1 or 2
with *-paths Suffix-*, Global-*, No-*
Suffix-* Summarization
Path Freq Path Freq
A 1 AC 6
B 11 AD 4
C 15 BC 9
D 19 BD 7
AB 11 CD 8
Suffix-* Summarization
Path Freq Path Freq
A 1 AC 6
B 11 AD 4
C 15 BC 9
D 19 BD 7
AB 11 CD 8
* 0 ** 0
Suffix-* Summarization
Path Freq Path Freq
A 1 AC 6
B 11 AD 4
C 15 BC 9
D 19 BD 7
AB 11 CD 8
* 0 ** 0
Suffix-* Summarization
Path Freq Path Freq
AC 6
B 11 AD 4
C 15 BC 9
D 19 BD 7
AB 11 CD 8
* f=1,n=1 ** 0
Suffix-* Summarization
Path Freq Path Freq
AC 6
B 11 AD 4
C 15 BC 9
D 19 BD 7
AB 11 CD 8
* f=1,n=1 ** 0
SD= { }
Set of deleted paths of length 2
Suffix-* Summarization
Path Freq Path Freq
AC 6
B 11
C 15 BC 9
D 19 BD 7
AB 11 CD 8
* f=1,n=1 ** 0
SD= { (AD,4) }
Suffix-* Summarization
Path Freq Path Freq
AC 6
B 11
C 15 BC 9
D 19 BD 7
AB 11 CD 8
* f=1,n=1 ** 0
SD= { (AD,4) }
Suffix-* Summarization
Path Freq Path Freq
AC 6
B 11
C 15 BC 9
D 19 BD 7
AB 11 CD 8
* f=1,n=1 ** 0
SD= { (AD,4) }
Suffix-* Summarization
Path Freq Path Freq
A*f=10,n=2
B 11
C 15 BC 9
D 19 BD 7
AB 11 CD 8
* f=1,n=1 ** 0SD= { }
Suffix-* Summarization
Path Freq Path Freq
A*f=10,n=2
B 11
C 15 BC 9
D 19 BD 7
AB 11 CD 8
* f=1,n=1 ** 0SD= { }
Suffix-* Summarization
Path Freq Path Freq
A*f=10,n=2
B 11
C 15 BC 9
D 19
AB 11 CD 8
* f=1,n=1 ** 0SD= { (BD,7) }
Suffix-* Summarization
Path Freq Path Freq
A*f=10,n=2
B 11
C 15 BC 9
D 19
AB 11 CD 8
* f=1,n=1 ** 0SD= { (BD,7) }
Suffix-* Summarization
Path Freq Path Freq
A*f=10,n=2
B 11
C 15 BC 9
D 19
AB 11
* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }
Suffix-* Summarization
Path Freq Path Freq
A*f=10,n=2
B 11
C 15 BC 9
D 19
AB 11
* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }
Suffix-* Summarization
Path Freq Path Freq
A*f=10,n=2
B 11
C 15 BC 9
D 19
AB 11
* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }
Suffix-* Summarization
Path Freq Path Freq
A*f=10,n=2
B 11
C 15 B*f=16,n=2
D 19
AB 11
* f=1,n=1 ** 0SD= { (CD,8) }
Suffix-* Summarization
Path Freq Path Freq
A*f=10,n=2
B 11
C 15 B*f=16,n=2
D 19
AB 11
* f=1,n=1 ** 0SD= { (CD,8) }
Suffix-* Summarization
Path Freq Path Freq
B 11
C 15 B*f=16,n=2
D 19
AB 11
* f=1,n=1 **f=10,n=2SD= { (CD,8) }
Suffix-* Summarization
Path Freq Path Freq
B 11
C 15 B* 8
D 19
AB 11
* 1 ** 6
SD= { }
Global-*, No-* Summarization Global-*
Two *-paths, * and ** Deletes fewer paths than suffix-* to
summarize the Markov table No-*
No *-paths Conservatively assumes that paths not in
the Markov table do not exist in the data
Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions
Data Sets for Experiments Synthetic data set
100,000 XML elements Path tree: 3197 nodes, 6 levels, 38 KB Element frequencies: Zipfian (z=1)
DBLP data set 1,399,765 XML elements Path tree: 5883 nodes, 6 levels, 69 KB
Query Workloads 1,000 paths of length between 1 and 4 Random paths
All query paths exist in the data Random tags
Most query paths of length 2 or more do not exist in the data
Available memory between 5 and 50 KB
Best Summarization Methods Path trees
Query paths in data: Global-* Query paths not in data: No-*
Markov tables m = 2 is best Query paths in data: Suffix-* Query paths not in data: No-*
Path Trees vs. Markov Tables When to use path trees and when to use
Markov tables? Also compared against Pruned Suffix
Trees (PSTs) [Chen et al, ICDE 2001] Can handle branching path expressions Can handle conditions on element values
Synthetic Data – Random Paths
0
4
8
12
16
0 10 20 30 40 50
Available Memory (KB)
Abso
lute
Err
or
Tree Global-*Markov Suffix-*PST
Synthetic Data – Random Tags
0
2
4
6
8
0 10 20 30 40 50
Available Memory (KB)
Abso
lute
Err
or
Tree No-*Markov No-*PST
DBLP Data – Random Paths
0
20
40
60
80
100
0 10 20 30 40 50
Available Memory (KB)
Abso
lute
Err
or
Tree Global-*Markov Suffix-*PST
DBLP Data – Random Tags
0
1
2
3
4
0 10 20 30 40 50
Available Memory (KB)
Abso
lute
Err
or
Tree No-*Markov No-*PST
When are Markov Tables Better? DBLP
Repeated sub-structures effectively captured by Markov tables
<sigmod> <inproceedings> <author>…</author> … </inproceedings> …</sigmod>
<vldb> <inproceedings> <author>…</author> … </inproceedings> …</vldb>
Conclusions Novel statistics for estimating the selectivity of
XML path expressions Scale to "all the XML data on the Internet" More accurate than best previously known
alternative Repeated sub-structures: Markov tables
No repeated sub-structures: Path trees Query paths exist in the data: Global-*, Suffix-*
Query paths do not exist in the data: No-* To appear in VLDB 2001