Searching and Data Sharing in P2P Systems Beng Chin Ooi Department of Computer Science National...

Post on 01-Apr-2015

217 views 1 download

Tags:

transcript

Searching and Data Sharing in P2P Systems

Beng Chin Ooi Department of Computer ScienceNational University of Singapore

ooibc@comp.nus.edu.sgwww.comp.nus.edu.sg/~ooibc

Acknowledgement

A few ppt slides are borrowed/adapted from Hellerstein’s group and his vldb-04 tutorial slides

Some are screen dumps as examples

What is P2P?

Client Server Architecture

Peer-to-Peer Architecture

P2P Systems?

Effective Use of the Internet-connected PCs/workstations directly participate in the Internet

Sites are autonomous Similar functionalities and responsibilities

Each peer consumes and serves

Resources are distributed

Driving Forces

Main driving forces: Exploiting existing resources

Computational efficiency is not the main goal Sharing costs among users Autonomy Anonymity Legal protection

P2P Systems

“ A class of applications that takes advantage of resources like storage, CPU cycles, content and even human presence available at the edges of the Internet” -- Clay Shirkey, an investment advisor

P2P Applications

InstantMessaging

File Sharing

Groupware

Others

Computation Bandwidth

Storage

Resource UtilisationCollaboration

Peer-to-Peer Applications

freenet

P2P MessengerP2P Messenger

Upriser

Groove SETI

Folding@home

Properties of P2P Applications?

Dynamic and Self-Organizing Enduring Resilient Collaborative

P2P Future

Aberdeen Group’s prediction: US$930 million by end 2004 From US$20.6 at end of 2000

Standardization NPI (New Productivity Initiative) Peer-to-Peer Working Group (P2PWG)

NAT, Taxonomy, Security, File Services, Interoprability

Overlay Networks

P2P applications need to: Track identities & (IP) addresses of peers

May be many! May have significant Churn Best not to have n2 ID references

Route messages among peers If you don’t keep track of all peers, this is “multi-hop”

This is an overlay network Peers are doing both naming and routing IP becomes “just” the low-level transport

All the IP routing is opaque Control over naming and routing is powerful

And as we’ll see, brings networks into the database era

Infecting the Network, Peer-to-Peer

The Internet is hard to change. But Overlay Nets are easy!

P2P is a wonderful “host” for infecting network designs The “next” Internet is likely to be very different

“Naming” is a key design issue today Querying and data independence key tomorrow?

Don’t forget: The Internet was originally an overlay on the telephone network There is no money to be made in the bit-shipping business

• A modest goal for DB research:– Don’t query the Internet.

The Evolution of P2P systems First generation – centralized P2P systems

E.g. Napster, SETI@home Second generation –decentralized & unstructured P2P systems

E.g. Gnutella Third generation—structured P2P systems

DHT systems (CAN/Chord/Pastry/Tapestry) Skip-list based systems

….

Unstructured P2P Systems

P2P with Central Servers P2P with fully Autonomous Peers (pure p2p) P2P with Superpeers (SuperNodes)

Unstructured Centralized P2P Systems -- Napster

Searching is efficient, with only a few messages exchanged;

Non-scalable, a central point of failure;

B has X

Get X

Reply with XA B

Directory Server

Who has X?

Harnessing Idle CPU Cycles – SETI@HOME

C

B

D

E

Centerdata

source

A

Unstructured Fully Decentralized -- Gnutella

Searching is inherently flooding (unscalable); Time-to-Live(TTL) is used to partially address this problem;

Techniques for improving search in Gnutella-like Network

Expanding Ring; Random Walks; Good Peer; Local indices; Routing indices;

Freenet

A

B

D

C

E

Query: “Who has file X”

Download file X from Peer E

Worst Case for Freenet

A

B

D

C

E

Download file X from Peer E

F

NOT FOUND !

I HAVE FILE X !

Peer F has the requested file, but never finds it because a poor routing decision made at Peer D, and results in the query not being matched.

In this case, query will be rerouted once again with alternate path

Unstructured P2P with Supernodes

Combine the benefits of centralized and decentralized search;

Take advantage of the heterogeneity of peer capabilities;

Morpheus

Cluster

Cluster

Cluster

CenterIndex forits cluster

C

B

A

F

E

D

I

H

G

Quer y: “W

ho has

file X”

Rep ly: “P

eer H ha s

file X”

Download file X from Peer H

SupernodeLayer

What is Grid?

“A hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities”

-- Ian Foster & Kal Kesselman, 1998

“Sharing enviorment implemented via the deployment of a persistent, standards-based service infrastructure that supports the creation of, and resource sharing within distributed communities”

--Ian Foster & Adriana Iamnitchi, 2003

A basic concept in Grid -- “Virtual Organization”

The evolution of Grid Systems

First generation systems involved proprietary solutions for sharing high performance computing resources; e.g. Condor

Second generation systems introduced middleware to cope with scale and heterogeneity, with a focus on large scale computational power and large volumes of data; e.g. Globus, Eu DataGrid

Third generation systems are adopting a service-oriented approach, adopt a more holistic view of the e-Science infrastructure, are metadata-enabled and may exhibit autonomic features. Open Grid Services Architecture (OGSA)

P2P vs. Grid --similarities

Both P2P and Grid address the same problem, share the same goal

Resource sharing within distributed resources.

Both offer promising paradigms for developing distributed systems and applications

P2P vs. Grid --differences

Resources Grid– higher-end resources, better connected with high levels of

availability P2P– edge level devices, intermittently connected with highly

variable availability

P2P vs. Grid --differences

Services Dependent on the nature of communities Eg 1. Resource Discovery

Grid—very well structured and stable network making this less of an issue

P2P—unstable network Eg 2. Security

Grid—authentication, authorization, accountability P2P—anonymity, censorship resistance

P2P vs. Grid --differences

Infrastructure Grid – more emphasis in standardization, interoperability P2P – little emphasis, no interoperability

Applications Grid – large range of applications, more computation and data intensive P2P – more social-based, less computation and data intensive

P2P vs. Grid --differences

Scalability Grid– Most services, such as resource discovery, are mainly based

on centralized or hierarchial models P2P– Most P2P systems are decentralized

P2P vs. Grid --summary

Grid needs to address more in decentralization, self-organization, fault tolerance, and scalability issues, which are strong points of P2P.

P2P should put more effort on standard infrastructure and provide more services.

The P2P model could help to ensure Grid scalability Two technologies are likely to converge (grid +

structured p2p)

Data sharing in P2P systems

Provide only file-level sharing, and lack of content-based search

coarse granularity of information sharing.

Lack of extensibility and flexibility no easy and rapid means to expand applications

Node’s neighbors are typically statically defined difficult to utilize network bandwidth and optimize system

performance

Relational data sharing in Unstructured P2P vs. Distributed DB

Can actually retrieve the complete set of answers.

Answers to queries are typically incomplete. *by “completeness” we mean all answers that satisfy a query

Exact location to direct the query is typically known.

Content location is typically by “word-of-mouth” e.g., node routes query to its neighbors and so on…

Have some knowledge of a shared schema. Queries: SQL

Usually no predetermined (global) schema among nodes. Queries: Keywords

Nodes are added/removed from the network in a controlled manner.

Nodes can join and leave the network anytime.

Distributed Database SystemsP2P

P2P & DB Systems

Lightweight

Fault Tolerance

Powerful query facilities

Transactions & Concurrency Control

Strong Semantics

Decentralized

Flexibility

P2P

DB

Taken from Hellerstein’s group ppt

P2P + DB = ?

P2P Database? No! ACID transactional guarantees do not scale, nor does the everyday user

want ACID semantics Much too heavyweight of a solution for the everyday user

Query Processing on P2P! Both P2P and DBs do data location and movement Can be naturally unified (lessons in both directions) P2P brings scalability & flexibility

DB brings relational model & query facilities

Taken from Hellerstein’s group ppt

Many New Challenges

Relative to other parallel/distributed systems Partial failure Churn Few guarantees on transport, storage, etc. Huge optimization space Network bottlenecks & other resource constraints No administrative organizations Trust issues: security, privacy, incentives

Relative to IP networking Much higher function, more flexible Much less controllable/predictable

Some Proposals on Data Sharing…

Database: Data Mapping (SIGMOD’03) Piazza (ICDE’03) PeerDB(ICDE’03) …

IR: PlanetP((HPDC’03) SummaryIndex (TKDE’04 special issue on P2P) …

The Birth of BestPeer…

Started in 1998 To steal storage and CPU cycles from staff machines To provide a virtual and parallelised content-based document

retrieval system To be able to move processes from one PC to another quickly when

users need the PC back

Extended to P2P in early 2000 VC showed interested in the project

W.S. Ng, B. C. Ooi and K.L. Tan: BestPeer: A self configurable peer-to-peer system. ICDE’2002.

BestPeer Network

BestPeer is a generic P2P system designed to serve as a platform on which P2P applications can be developed easily and efficiently

Integrate mobile agent with P2P technologies Each participant runs BestPeer software

Provide communication facilities and share resources with other peers

Provide an environment in which agent can reside and perform their tasks

BestPeer Network Large # of peers, Small # of LIGLO; Each node comprises of two types of data: private data and sharable

data;

cont…cont…

New node registration: Register with LIGLO Obtain a unique BPID from LIGLO. LIGLO sends a list of (BPID, IP) pairs

that node can communicate directly. Node is ready to communicate to other

peers.

BestPeer Network

Node Rejoins: Send node’s current IP to

LIGLO For each peer of the node, p,

send p’s BPID to its registered LIGLO

p’s registered LIGLO will reply with IP of p if it is currently connected to the network

Node has rejoined

cont…cont…

BestPeer Network

Access Data from other nodes: Propagation broadcast Node with matching result will respond to initiating node directly

Two modes to access data: Phase 1: Node with matching answer will return the result directly

or Node with matching answer will only indicate that they have the information

Phase 2: The initiating node will then send a further message to some, if not all, of these nodes to obtain desired information

cont…cont…

Reconfigurable BestPeer Network

A node in the BestPeer network can dynamically reconfigure itself by keeping peers that benefit it most.

Based on assumption: peers that benefit a node most for a query are most likely to provide the greatest gain for subsequent query.

Every node has its control of maximum number of direct peers it can have

Reconfigurable BestPeer Network

BestPeer applies autonomous strategy, where each node tries to keep promising peers as closes as possible with no information exchange between peers.

BestPeer provides two default reconfiguration strategies: MaxCount

Maximizes the number of objects a node can obtain from its directly connected peers.

MinHops Minimizes the number of Hops that a node needs to travel

cont…cont…

Location-Independent Global Names Lookup Server (LIGLO)

To facilitate identification of a single node that may have different IP addresses at different occasion

LIGLO is a node that has a fixed IP and running LIGLO software

LIGLO: Generates BestPeer Global Identity (BPID) Maintains peer’s current status

LIGLO applies distributed approach, each LIGLO only needs to maintain its members’ name

Features of BestPeer

Combines the power of agent technology and P2P technology in a single system

Supports a finer granularity of data sharing, and sharing of computational power

Facilitates dynamic reconfiguration of BestPeer network Adopts a distributed approach to minimize bottlenecks of

servers acting as LIGLO

Integrating of Mobile Agent and P2P Technologies

P2P technologies provide resources sharing capabilities among node; Mobile Agent further extends the functionalities

Java-based Agent System BestPeer Search Agent vs. Traditional Search Agent:

(Trad) Predefined itinerary vs. Auto and transparent TTL / Hops based lifetime Result/Cost-based lifespan

PeerDB PeerDB is built on top of BestPeer Four components that are integrated and implemented on the

application layer. Data management system

Facilitates storage, manipulation and retrieval of the data MySQL as the backend for supporting SQL query facility

Local Dictionary Metadata stored in Local Dictionary

Export Dictionary Metadata sharable to other nodes

Cache Manager Caching remote data in secondary storage Caching/replacement policy

B.C. Ooi, K.L. Tan, A. Zhou, C.H. Goh, Y.G. Li, C.Y. Liau, B. Ling, W.S. Ng, Y. Shu, X.Y. Wang, M. Zhang: PeerDB: Peering into Personal Databases. SIGMOD’2003, Demo.

W.S. Ng, B. C. Ooi, K.L. Tan, A. Zhou: PeerDB: A P2P-based System for Distributed Data Sharing. ICDE’2003

PeerDB

Agent Layer: DBAgent Provide the environment for mobile agent (Java agent) to operate on. Each PeerDB node has a master agent that manages the user query. Clone and dispatch worker agents to neighboring nodes

P2P Layer: Network management and messages management Monitor statistics and manage network reconfiguration

cont…cont…

Architecture

Sharing Data Without Global SchemaInformation Retrieval (IR) approachMeta-data (keywords) are maintained for each relation’s name and attributes

Serve as a kind of synonymous names (i.e., miniature thesaurus)

Protein(Name, char)

ProteinKLen(ID,seqLength)

ProteinKSeq(ID,sequence)

Protein(SeqNo, len, sequence)

Kinases(SeqID, length, proteinSeq)

Relations

Protein, kinases, annexinName

Characteristics, features, functions

ProteinNamechar

P4

Protein, kinases, lengthNumber, identifier

LengthProtein,sequenceNumber identifier

sequence

ProteinKLenID

seqLengthProteinKSeq

IDSequence

P3

protein, annexin, zebrafishnumber, identifier

lengthsequence

ProteinSeqNo

Lensequence

P2

protein, humankey, identifier, ID

lengthSequence, protein sequence

Kinases, SeqID

length, proteinSeq

P1

KeywordsNamesPeer

P1 Query

SELECT SeqID, proteinSeqFROM KinasesWHERE length > 30

* Knows own schema but not the schema of other peers

P2, P3 and P4 match the query relation P2, P3 and P4 match the query relationSeqID, proteinSeq and length all have matching keywords in P2 and P3

Note: For P3, query may have to be turned into a join query ProteinKLen ProteinKSeq

P4 (relation match only) ranks lower than P2 and P3Semantically, P2’s data are not actually those that P1 is interested in…the meta-data & info returned to the users before fetching the data.

Example

Query Processing Strategy

Completely assisted by agents and interact with DBMS. Query may be rewritten into another form by the DBAgent.

e.g., single query -> join query involving multiple relations Local query vs. Remote query – A query is local to a node if

it is initiated there, and remote otherwise.

Convergence of Technologies on P2P Network

DBMS Search Engine

Information Aggregation

• Possible Business Model for P2P?

Keyword Join – Current Work

A mean to facilitate information aggregation Tuples are “joined” based on similar values (not exact as

in normal join) IR similarity matching between attribute values + contents

Top-K Answers Eg. Database search patent filing

Some BestPeer work Wee Siong Ng, Beng Chin Ooi, Yan Feng Shu, Kian Lee Tan and Wee Hyong Tok

Efficient Distributed CQ Processing using Peers (Poster).The Twelfth International World Wide Web Conference 2004.

B.C. Ooi, K.L. Tan, A.Y. Zhou, C.H. Goh, Y.G. Li, C.Y. Liau, B. Ling, W.S. Ng, Y.F. Shu, X.Y. Wang, M. Zhang PeerDB: Peering into Personal Databases.The 2003 ACM SIGMOD Intl. Conf. on Management of Data (Demo).

Wee Siong Ng, Beng Chin Ooi, Kian Lee Tan and AoYing Zhou PeerDB: A P2P-based System for Distributed Data Sharing.The 19th International Conference on Data Engineering 2003.

Panos Kalnis*, Wee Siong Ng, Beng Chin Ooi, Dimitris Papadias*, Kian-Lee Tan An Adaptive Peer-to-Peer Network for Distributed Caching of OLAP Results.ACM-SIGMOD Conference 2002. (SIGMOD 2002).

Wee Siong Ng, Beng Chin Ooi and Kian Lee Tan BestPeer: A Self-Configurable Peer-to-Peer System (Poster).The 18th International Conference on Data Engineering 2002