Complex Queries in DHT-based Peer-to-Peer Networks Matthew Harren, Joe Hellerstein, Ryan Huebsch,...

Complex Queries in DHT-based Peer-to-Peer Networks

Matthew Harren, Joe Hellerstein,Ryan Huebsch, Boon Thau Loo,

Scott Shenker, Ion Stoica

[email protected] Berkeley, CS Division

IPTPS 3/8/02

2

Outline Contrast P2P & DB systems Motivation Architecture

DHT Requirements Query Processor

Current Status Future Research

3

QueryProcessor

SQL

Joins

Predicates

Relational Data

Group By

Aggregation

DHT

CANChord

Tapestry

Pastry

Uniting DHTs andQuery Processing…

4

P2P & DB Systems

Flexibility Decentralized Strong Semantics Powerful query facilities Fault Tolerance Lightweight Transactions & Concurrency Control

P2P

DB

5

P2P + DB = ? P2P Database? No!

ACID transactional guarantees do not scale, nor does the everyday user want ACID semantics

Much too heavyweight of a solution for the everyday user

Query Processing on P2P! Both P2P and DBs do data location and

movement Can be naturally unified (lessons in both

directions) P2P brings scalability & flexibility

DB brings relational model & query facilities

6

P2P Query Processing(Simple) Example

Filesharing+ SELECT song, size, server…FROM album, songWHERE album.ID = song.albumID AND album.name = “Rubber Soul”

Keyword searching is ONE canned SQL queryImagine what else you could do!

7

P2P Query Processing(Simple) Example

Filesharing+SELECT song, size, server…FROM album-ngrams AN, songWHERE AN.ID = song.albumID AND AN.ngram IN <list of search ngrams>GROUP BY AN.IDHAVING COUNT(AN.ngram) >= <# of ngrams in search>

Keyword searching is ONE canned SQL queryImagine what else you could do!

Fuzzy Searching, Resource Discovery, Enhanced DNS

8

What this projectIS and IS NOT about… IS NOT ABOUT: Absolute Performance

In most situations a centralized solution could be faster…

IS ABOUT: Decentralized Features No administrator, anonymity, shared resources,

tolerates failures, resistant to censorship… IS NOT ABOUT: Replacing RDBMS

Centralized solutions still have their place for many applications (commercial records, etc.)

IS ABOUT: Research synergies Unifying/morphing design principles and techniques

from DB and NW communities

9

General Architecture

Based on Distributed Hash Tables (DHT) to get many good networking properties

A query processor is built on top

Note: the data is stored separately from the query engine, not a standard DB practice!

10

DHT – API Basic API

publish(RID, object) lookup(RID) multicast(object)

NOTE: Applications can only fetch-by-name… a very limited query language!

11

DHT – API Enhancements I Basic API

publish(namespace, RID, object) lookup(namespace, RID) multicast(namespace, object)

Namespaces: subsets of the ID space for logical and physical data partitioning

12

DHT – API Enhancements II Additions

lscan(namespace) – retrieve the data stored locally from a particular namespace

newData(namespace) – receive a callback when new data is inserted into the local store for the namespace

This violates the abstraction of location independence

Why necessary? Parallel scanning of base relation Why acceptable? Access is limited to reading,

applications can not control the location of data

13

Query Processor(QP) Architecture QP is just another application as far as the

DHT is concerned… DHT objects = QP tuples User applications can use QP to query data

using a subset of SQL Select Project Joins Group By / Aggregate

Data can be metadata (for a file sharing type application) or entire records, mechanisms are the same

14

Indexes. The lifeblood of a database engine.

DHT’s mapping of RID/Object is equivalent to an index Additional indexes are created by adding another

key/value pair with the key being the value of the indexed field(s) and value being a ‘pointer’ to the object (the RID or primary key)

DataPKey

Primary Index

Ind

ex

NS

Secondary Index

DH

T PtrKey

Seco

nd

ar

y

DH

T

DataPKeyPri

mar

y

15

Relational Algorithms Selection/Projection Join Algorithms

Symmetric Hash Use lscan on tables R & S. Republish tuples in a

temporary namespace using the join attributes as the RID. Nodes in the temporary namespace perform mini-joins locally as tuples arrive and forwards results to requestor.

Fetch Matches If there is an index on the join attribute(s) for one

table (say R), use lscan for other table (say S) and then issue a lookup probing for matches in R.

Semi-Join like algorithms Bloom-Join like algorithms

Group-By (Aggregation)

16

Interesting note… The state of the join is stored in the DHT

store Rehashed data is automatically re-routed to the

proper node if the coordinate space adjusted When a node splits (to accept a new node into

the network) the data is also split, this includes previously delivered rehashed tuples

Allows for graceful re-organization of the network not to interfere with ongoing operations

17

Where we are… A working real implementation of our

Query Processing (currently named PIER) on top of a CAN simulator

Initial work studying and analyzing algorithms… nothing really ground-breaking… YET!

Analyzing the design space and which problems seem most interesting to pursue

18

Where to go from here? Common Issues:

Caching – Both at DHT and QP levels Using Replication – for speed and fault tolerance (both

in data and computation) Security

Database Issues: Pre-computation of (intermediate) results Continuous queries/alerters Query optimization (Is this like network routing?) More algorithms, Dist-DBMS have more tricks Performance Metrics for P2P QP Systems

What are the new apps the system enables?

Date post:	27-Dec-2015
Category:	Documents
Upload:	cassandra-stone
View:	218 times
Download:	1 times

Complex Queries in DHT-based Peer-to-Peer Networks Matthew Harren, Joe Hellerstein, Ryan Huebsch,...

Documents