Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 0 times |
1
XML processing in DHT networks Serge Abiteboul, Ioana Manolescu, Neoklis Polyzotis, Nicoleta Preda, Chong Sun
INRIA-Saclay & UC Santa-Cruz
Date
Outline
• Topic• KadoP System
– Overview of DHT– Query evaluation
• Optimization techniquesa)DPP: Distributed postings partitioningb)Structural Bloom Filters
• Conclusion
2
Topic
Querying large volume of content in a P2P network for a community of users– Focus on indexing– Content = XML– P2P network = structured - around DHT
XML indexing DHT networks
3
Example: Edos distribution system
A system for managing Linux distribution (Mandriva)
System releases– about 10 000 software packages + metadata (XML)
Community of open-source developers: thousands
Functionalities– Publish/update releases– Query the metadata – Retrieve packages
4
The KadoP system
5
6
ID0=x mod 24
ID1=(x+20)mod24
ID2=(x+21)mod24
DHT: A P2P indexing infrastructure
[K, Object]
ID15
Pointer in the finger tableLook-up (K) from client ID0Look-up (K) from client ID1
ID4=(x+22)mod24
ID8=(x+23)mod24
Use a ring ─ each peer takes an ID in the space Modulo(2N)─ each peer stores (K, Object) pairs, for K satisfying:
─ID peer ≤ K < ID next peer
Which API?─ locate (K) → Peer IP─ get (K) → Object─ put (K, Object)
Pastry
Advantages and Disadvantage
Advantages• Availability and reliability– No centralization (bottleneck) and replication
• Scalability– Scalable solution for keyword queries
Disadvantage• Difficult to maintain the structure– Not suited for transient population of peers
7
XML query processing in KadoP
Query evaluation:
Step 1.• Given a XQuery Q, decompose Q in tree pattern queries • Evaluate each tree pattern query using the DHT index to
identify a set candidates peers P that can provide answers
Step 2.• Ship Q to these peers P and evaluate it there
8
9
Indexing XML documents
X ancestor of Y ↔start(X) < start(Y) ≤ end(X)
X parent of Y↔X ancestor of Y andlevel(X) = level(Y) - 1
Posting = peer, doc, start, end, level
A
B
D E
C
F
“John” G
Doc.xml
1
2
8
6
6 6
5 63 4
4 4
7 8
4 4
Publish them via a DHT – put (k,postings), where k is a
label or a keyword
Remark: all the postings for author accumulate at the same peer
10
XML indexing in DHT
DHT
put(author;[p1,d2,start,end,lev])
p2Posting list for author put(author;[p2,d2,start,end,lev])
p1
p(author)
Some technical issuesGoal: manage millions of documents with thousands of
peers
First experiments were a disaster
First works• Replace the index storage of the DHT in a FS by
storage in a database (Berkeley DB)• Extend the API of the DHT with Append and not only
Read/Write• Extend the API of the DHT with a streaming exchange of
postingsWith this, KadoP scaled but was slow due to e.g. long
postings
11
Optimization
12
Main issue: long postings
Transfer of long posting is hurting performance
Bad response time– Parallelization: Distributed Posting Partitioning (DPP)
Communication load– Bloom filter: Structural Bloom Filter
13
long posting for “Name”p(Name)
14
DPP structure
long posting for “Name”p(Name)
DPP structure– Split and distribute postings
according to conditions– Each condition is an interval:
C1=[(p1,d1),(p2,d2)]– Each two conditions are over
disjoint intervals
Some kind of B-tree for postings
p(Name)
C1 C2 C3 C4
(p1,d1)
(p,d)
(p4,d4)(p2,d2) (p3,d3)
15
Query processing (no DPP)
“Ullman” “database”
Pipeline transfers of postings to query processing peerHolistic twig-join algorithm to compute the result in parallel at QP peer
article
author abstract
article0 ∞
Ullman
author0 ∞
abstract0
0
∞
∞database
∞0
QP-peer
index-Q
Query processing with DPP
16
abstract
“XML”
p( )
p( )
Fetch from p(abstract) and p(XML) the conditions C1-C5Prune intervals Transfer and compute in parallel the join for each sub-interval
Conditions sorted according (p,d) At p(client)
p( )
p( )
C1 C2
C3 C5C4
Experiments
• Platform– Grid5000: P2P platform for research in P2P systems– Distributed geographically across 6 sites in France
• KadoP tested on more than 100 machines– 1000 logical peers
• Conclusions in brief– Good performance– KadoP scales very nicely– Issue: does not support high churn of peers (index
copying)
17
18
Query response time
Q=article//author//Ullman
Optimization
(b) Structural Bloom Filters Ancestor Bloom Filter Also in paper: Descendant BF
19
Using an Ancestor Bloom Filters
Query: a//b
Compute the Bloom Filter of the a-postings and send to p(b)
Compute the b-postings that have an a-ancestor (and more)
Send it to the p(a) that can compute the answer
20
DHT
p(b)
p(a)L(a)
L(b)
ABF(a)
F(b, ABF(a))
Dyadic covers– D(ap)={[1,4], [5,6], [7,7]}
ap is ancestor of bp if D(ap) (start(bp) )
– Here 3 [1,4], so answer is yes!
1 2 3 4 5 6 7
Technique: dyadic intervals
21
[1 8]
ap
bp
start end[1,1] [2,2] [3,3] [4,4] [5,5] [6,6] [7,7] [8,8][1, 2] [3, 4] [5, 6] [7, 8]
[1, 4]
20
21
22
23
Dyadic intervals
[5, 8]
Ancestor Bloom Filter (simplified)
• Publication: d, ap in d, D(ap)
– Insert a trace in the Bloom Filter• Say T[h(d,)] = 1 for some has function h
• Test: for bp in d,
– for each dyadic interval s.t. start(bp) ,
test if T[h(d,)] = 1
– If one test is positive, conclude bp in d is a solution
• Wrong positives because of Hash collisions
22
Query evaluation strategies
p(a)
ABF(a)
ABF(b’)ABF(b’)
b’ = F(b, ABF(a))p(b)
p(c) p(d) d’ = F(d, ABF(b’))
Ancestor Bloom Reducer
p(a)
DBF(b’’)
DBF(c)DBF(c)
b’’ = F(b, DBF(c) ^ DBF(d))
p(b)
p(c) p(d)
Descendant Bloom Reducerc’ = F(c, ABF(b’))
Performances
PostingsDB FilterAB Filter
Conclusion
25
Related works
• Very active area
• DHT-based platforms for XML data management– Locating data sources (Galanis & al. VLDB03)
– XPath lookup queries in P2P networks (Bonifati et al. WIDM04)
• Other DHT-based systems for data management– PIER query processor (Huebsch & al, CIDR05)– Indexing in P2P networks (Aberer & al, VLDB05)
• Dyadic Intervals– Maintenance of dynamic intervals (Gilbert & al, VLDB02)
26
Contribution
• Two optimization techniques for index processing – Distributed Posting Partitioning– Structural Bloom Filters
• A full system for P2P XML indexing– As opposed to some simulation– Lots of engineering details that are important for
performance– Extensively tested for performance– Tested with a real application, EDOS
27
On-going and future work
• New indexing techniques– Trading-off precision for performance
• Publish summarizations of documents• Index/transfer postings at a coarse level of detail
– Index views (query caching)
• Query optimizer for KadoP– This is standard distributed query processing– Use standard optimization techniques, e.g., use
OptiMax ActiveXML optimizer (demo in ICDE08) – Develop what is specific for KadoP: cost model
28
Merci
30
Indexing time