Comparing Hybrid Peer-to-Peer Systems
based on an article by
Hector Garcia-Molina
Beverly Yang
by
Tudor Balan
P2P short survey
P2P advantages– Resources of many computers might be gathered to form large pools of
information and significantly computing power.– Network bandwidth significantly improves as computers directly communicate
P2P drawbacks– due to decentralized nature.
• Ex. Gnutella(network flooding & no scallability)
– Improvements • Ex. Napster (restricted server search, fractional indexing)
Goal– Study the functionality of P2P systems in order to understand their
tradeoffs
– Concentrate on data sharing and hybrid P2P systems.
Data sharing overview
Data sharing systems
Pure data sharing systems
Hybrid data sharing systems
Hybrid data sharing systems hugely popular but …
well studied also?
Which is the best way to organize server nodes?
Should indexes be replicated?
Which are the common queries asked by users?
How to treat disconnected users?
Problem analysis and treatment
• Present several architectures for P2P data sharing systems already used or to be.
• Probabilistic model for user queries and for the result size
• Illustrate a systems performance evaluating model
• Based on above, let’s see some comparisons.
Server architectureGeneral concepts
Login library connecting metadata upload index connection information (client IP, line speed) local server remote servers local users
Query list of desired words satisfied (max nr of results touched) query processing way (retrieve and intersect lists for each query word)
Download library enrichment notification index update server notification when remove/logoff comes up
Server architectures Login policies
Batch Login entire library metadata upload Logoff entire library metadata removed Index={metadata of active users} Advantages
Small index dimensions Increased query efficiency
Disadvantages Intense and expensive metadata update
Incremental Metadata permanence Difference update Advantages
Less effort at login/logoff
Disadvantages Increased memory requirements Penalty on query efficiency Need to connect to the same server(sometimes)
Server architectures
Chained Architecure– Linked server nodes
– Login• Local server metadata upload• Others server nodes unaffected
– Query• Submitted to local server• While (not enough results OR all servers received and serviced the query)
– local server contacts other servers
• End While
– Performance• Efficient login and downloads (local server conversation only)• Expensive query treatments (query forwarding, multiple query execution, results retrieval)
Full Replication Architecture– Intended to overcome previous disadvantages
– Each server contains a complete index
– Advantages• Single server queried• Login at any server (even in incremental policy case)
– Disadvantages• Logins sent to all servers• High login/logoff frequency sensibility
Server architectures
Hash Architecture– Login
• Metadata words hashed to # servers• A given server maintains the complete lists for a subset of all words
– Query• Addressed to only one server• The addressed server ask other servers the lists for the words it doesn’t have• The addressed server merges all lists
– Advantages• Limited nr of servers involved in each query processing• Limited nr of servers update metadata• No results traffic (only lists)
– Disadvantages• High bandwidth for lists manipulations
Unchained architecture– Set of independent servers– Login
• To one isolate server• No other servers are affected
– Query• The server the user has logged on
– Advantage• Scalability
– Disadvantage• Partial functionality• Limited query results
Query model
• Needed for systems comparison• Goals
– Number of query results estimation– Nr. of servers to process a query
• Initial computations in Chained architecture (more complex)
• Subsequent derived computations (relaxing or particularizing chained architecture conditions)
Query model(following)Chained architecture
• Assume a query universe q1,q2…• g = the probability function that describes the query popularity, i.e g(i) is the probability
that a submitted query happens to be query I• f= the probability density function that describes the query selection power. If we take a
given file in a user’s library, it will match query i with probability f(i)
Query model(following)
• Full replication– ExServ=1 all results are local– ExRemoteResults=0
• Unchained– ExServ=1all results are local– ExRemoteResults=0
• Hash– ExRemoteResults=0
Particularization
In case of music share g and f might be realistically taken as:
Performance model
• Illustrates the way to measure the performance of a P2P system
• NumServers (LAN, WAN)
• Users (LAN, WAN)
• {LAN,WAN} X {LAN, WAN}
• Compute action costs in terms of:– CPU cycles– Interserver communication bandwidth– User-server communication bandwidth
CPU consumptionCPU cost variations for chained architecture (batched and incremental)
CPU cost variations for other architectures (related to chained one)
Unchained & Full replication
query costs (batch & incremental) formula is the same
…but ExServ=1 and ExRemoteResults=0
Hash
additional cost for list transfer (for query costs)
Interpretation
Network consumption
Client-Server byte costs
Unchained
• no interserver communication
• 0 login costs
Chained
• query interserver communication
• no login interserver communication
• 0 login costs
Interserver byte costs
Full replication• each server sees each Login, AddFile, RemoveFile
• LAN once broadcast each message
• WAN each message sent NumServers-1 times by local server
Hash• each of selected server sees each Login, AddFile, RemoveFile
• LAN once broadcast each message
• WAN AddFile sent once for each server containing lists for words contained in the name of the file
Overall performance
• Hypothesis: known formulas for each action cost• Performance metric: UsersPerServer• How to compute a global formula for UsersPerServer ?
(direct?...to complex)• For each resource
– Assume infinite resources of other 2 types– Compute UsersPerServer for current resource (UsersPerServeri)
• Compute min(UsersPerServeri)
Experiments
Results of performance studies
Music sharing systems
Sharing systems for domains others than music
Maximum number of users( throughput, not response time)
Architectures={CHN,FR,HASH,FR}
Login policies={batch, incremental}
Strategies=Architectures X Login policies
Music share systems behaviourEx: For Query/Login ratio=1:
Incremental FR=54203
Batch FR=7281
QueryLoginRation increaseslogins/sec decreasesperformance increases
1. UHCCHN(conserves performance but increases returned results)
2. Paradox: UCH more used than CHN3. QueryLoginRatio sensitivity
1. Incremental strategies outperform batch counterparts
2. CHN & UCH better than FR & HASH
For MaxResults=100:
QueryLoginRation
nr of logins/sec
users supported
available files
expected nr of results
Memory analysis No previous treatment of memory implications
Batch strategies better than the incremental counterparts
Memory=f(NumServers,ActiveFrac)
NumServers , Memory (for FR)
Mem of incremental=1/ActiveFrac Mem of batch
ActiveFrac incr. strategies come closer to batch.
Memory price may eliminate worries about memory limitations
Small analysis
Ex1. QueryLoginRatio=.75(incr & batch CHN comparison) (69708,26828) vs (12268,28828)
take batch
Ex2. QueryLoginRatio=.25(incr & batch CHN comparison) (52088,9190) vs (12268,9190)
take incremental
Beyond music…• We can generally compute
– Expected nr of results of a query– Expected nr of servers to satisfy the query
• …using– g() distribution of query frequency– f() distribution of selection power
• f and g are input for the general query model• For music f, g exponential (positively correlated) all precedent results( the more
popular a query is, the greater the selection power is)• What if we have a stock?
– Select * from Product where price>10 (rare query) return as much results as– Select * from Product (common query)– No correlation
• What about archive-driven company?– Rare queries (for old articles) return good results– Frequent queries (for new articles) return few results– Negative correlation
Performance variation as function of correlation
Final Conclusions• Chained
– Best for music today– Good login, least memory– Poor if many servers involved
• Full replication– Potentially good in the future when more stable connections
• Hash– Has high bandwidth requirements– Good in future or in systems when servers must not exchange large metadata
amounts
• Unchained– Not recommended– Few results for only small performance improvement– Good when nr of results is not important
• Incremental policy– Good for systems with negative correlation