Comparing Hybrid Peer-to-Peer Systems

Comparing Hybrid Peer-to-Peer Systems

based on an article by

Hector Garcia-Molina

Beverly Yang

by

Tudor Balan

P2P short survey

P2P advantages– Resources of many computers might be gathered to form large pools of

information and significantly computing power.– Network bandwidth significantly improves as computers directly communicate

P2P drawbacks– due to decentralized nature.

• Ex. Gnutella(network flooding & no scallability)

– Improvements • Ex. Napster (restricted server search, fractional indexing)

Goal– Study the functionality of P2P systems in order to understand their

tradeoffs

– Concentrate on data sharing and hybrid P2P systems.

Data sharing overview

Data sharing systems

Pure data sharing systems

Hybrid data sharing systems

Hybrid data sharing systems hugely popular but …

well studied also?

Which is the best way to organize server nodes?

Should indexes be replicated?

Which are the common queries asked by users?

How to treat disconnected users?

Problem analysis and treatment

• Present several architectures for P2P data sharing systems already used or to be.

• Probabilistic model for user queries and for the result size

• Illustrate a systems performance evaluating model

• Based on above, let’s see some comparisons.

Server architectureGeneral concepts

Login library connecting metadata upload index connection information (client IP, line speed) local server remote servers local users

Query list of desired words satisfied (max nr of results touched) query processing way (retrieve and intersect lists for each query word)

Download library enrichment notification index update server notification when remove/logoff comes up

Server architectures Login policies

Batch Login entire library metadata upload Logoff entire library metadata removed Index={metadata of active users} Advantages

Small index dimensions Increased query efficiency

Disadvantages Intense and expensive metadata update

Incremental Metadata permanence Difference update Advantages

Less effort at login/logoff

Disadvantages Increased memory requirements Penalty on query efficiency Need to connect to the same server(sometimes)

Server architectures

Chained Architecure– Linked server nodes

– Login• Local server metadata upload• Others server nodes unaffected

– Query• Submitted to local server• While (not enough results OR all servers received and serviced the query)

– local server contacts other servers

• End While

– Performance• Efficient login and downloads (local server conversation only)• Expensive query treatments (query forwarding, multiple query execution, results retrieval)

Full Replication Architecture– Intended to overcome previous disadvantages

– Each server contains a complete index

– Advantages• Single server queried• Login at any server (even in incremental policy case)

– Disadvantages• Logins sent to all servers• High login/logoff frequency sensibility

Server architectures

Hash Architecture– Login

• Metadata words hashed to # servers• A given server maintains the complete lists for a subset of all words

– Query• Addressed to only one server• The addressed server ask other servers the lists for the words it doesn’t have• The addressed server merges all lists

– Advantages• Limited nr of servers involved in each query processing• Limited nr of servers update metadata• No results traffic (only lists)

– Disadvantages• High bandwidth for lists manipulations

Unchained architecture– Set of independent servers– Login

• To one isolate server• No other servers are affected

– Query• The server the user has logged on

– Advantage• Scalability

– Disadvantage• Partial functionality• Limited query results

Query model

• Needed for systems comparison• Goals

– Number of query results estimation– Nr. of servers to process a query

• Initial computations in Chained architecture (more complex)

• Subsequent derived computations (relaxing or particularizing chained architecture conditions)

Query model(following)Chained architecture

• Assume a query universe q1,q2…• g = the probability function that describes the query popularity, i.e g(i) is the probability

that a submitted query happens to be query I• f= the probability density function that describes the query selection power. If we take a

given file in a user’s library, it will match query i with probability f(i)

Query model(following)

• Full replication– ExServ=1 all results are local– ExRemoteResults=0

• Unchained– ExServ=1all results are local– ExRemoteResults=0

• Hash– ExRemoteResults=0

Particularization

In case of music share g and f might be realistically taken as:

Performance model

• Illustrates the way to measure the performance of a P2P system

• NumServers (LAN, WAN)

• Users (LAN, WAN)

• {LAN,WAN} X {LAN, WAN}

• Compute action costs in terms of:– CPU cycles– Interserver communication bandwidth– User-server communication bandwidth

CPU consumptionCPU cost variations for chained architecture (batched and incremental)

CPU cost variations for other architectures (related to chained one)

Unchained & Full replication

query costs (batch & incremental) formula is the same

…but ExServ=1 and ExRemoteResults=0

Hash

additional cost for list transfer (for query costs)

Interpretation

Network consumption

Client-Server byte costs

Unchained

• no interserver communication

• 0 login costs

Chained

• query interserver communication

• no login interserver communication

• 0 login costs

Interserver byte costs

Full replication• each server sees each Login, AddFile, RemoveFile

• LAN once broadcast each message

• WAN each message sent NumServers-1 times by local server

Hash• each of selected server sees each Login, AddFile, RemoveFile

• LAN once broadcast each message

• WAN AddFile sent once for each server containing lists for words contained in the name of the file

Overall performance

• Hypothesis: known formulas for each action cost• Performance metric: UsersPerServer• How to compute a global formula for UsersPerServer ?

(direct?...to complex)• For each resource

– Assume infinite resources of other 2 types– Compute UsersPerServer for current resource (UsersPerServeri)

• Compute min(UsersPerServeri)

Experiments

Results of performance studies

Music sharing systems

Sharing systems for domains others than music

Maximum number of users( throughput, not response time)

Architectures={CHN,FR,HASH,FR}

Login policies={batch, incremental}

Strategies=Architectures X Login policies

Music share systems behaviourEx: For Query/Login ratio=1:

Incremental FR=54203

Batch FR=7281

QueryLoginRation increaseslogins/sec decreasesperformance increases

1. UHCCHN(conserves performance but increases returned results)

2. Paradox: UCH more used than CHN3. QueryLoginRatio sensitivity

1. Incremental strategies outperform batch counterparts

2. CHN & UCH better than FR & HASH

For MaxResults=100:

QueryLoginRation

nr of logins/sec

users supported

available files

expected nr of results

Memory analysis No previous treatment of memory implications

Batch strategies better than the incremental counterparts

Memory=f(NumServers,ActiveFrac)

NumServers , Memory (for FR)

Mem of incremental=1/ActiveFrac Mem of batch

ActiveFrac incr. strategies come closer to batch.

Memory price may eliminate worries about memory limitations

Small analysis

Ex1. QueryLoginRatio=.75(incr & batch CHN comparison) (69708,26828) vs (12268,28828)

take batch

Ex2. QueryLoginRatio=.25(incr & batch CHN comparison) (52088,9190) vs (12268,9190)

take incremental

Beyond music…• We can generally compute

– Expected nr of results of a query– Expected nr of servers to satisfy the query

• …using– g() distribution of query frequency– f() distribution of selection power

• f and g are input for the general query model• For music f, g exponential (positively correlated) all precedent results( the more

popular a query is, the greater the selection power is)• What if we have a stock?

– Select * from Product where price>10 (rare query) return as much results as– Select * from Product (common query)– No correlation

• What about archive-driven company?– Rare queries (for old articles) return good results– Frequent queries (for new articles) return few results– Negative correlation

Performance variation as function of correlation

Final Conclusions• Chained

– Best for music today– Good login, least memory– Poor if many servers involved

• Full replication– Potentially good in the future when more stable connections

• Hash– Has high bandwidth requirements– Good in future or in systems when servers must not exchange large metadata

amounts

• Unchained– Not recommended– Few results for only small performance improvement– Good when nr of results is not important

• Incremental policy– Good for systems with negative correlation

Date post:	12-Jan-2016
Category:	Documents
Upload:	axel
View:	31 times
Download:	0 times

Comparing Hybrid Peer-to-Peer Systems

Documents