February 1, 2005
Peer-to-Peer Networksfor Content Distribution
Ernst [email protected]
http://www.eurecom.fr/~erbi/
© Institut Eurécom 2005
2 © Institut Eurécom 2005
Overview
MotivationOverview of distribution modelsSimple example for file replicationParallel download
BitTorrent: A real P2P file download applicationHow to organize peers for file distribution?
Theoretical analysis of different static organizationsSimulation of mesh-based organizations
Conclusion
3 © Institut Eurécom 2005
Motivation: Context and Problem
An growing number of well-connected users access increasing amounts of contentBut interest in content is often “Zipf”distributed (small fraction of very popular content)Servers and links are overloaded
Number of clientsSize of content“Flash crowd” (e.g., 9/11)
Tremendous engineering (and cost!) necessary to make server farms scalableand robust
InternetInternet
Source
Congestion
Degraded/loss of service
Traditional client/server content distribution
Traditional client/server content distribution
Problem: scalable distribution of contentProblem: scalable
distribution of content
4 © Institut Eurécom 2005
Zipf’s law
Zipf’s law: The frequency of an event P as a function of rank i is a power law function:
Pi = Ω / iα where α ≤ 1Observed to be true for
Frequency of written words in English textsPopulation of citiesFrequency of access of Web pagesSize of Web objects
5 © Institut Eurécom 2005
Motivation: Real-World Scenarios
Quick distribution of critical contentE.g., antivirus definitions
Efficient distribution of large contentE.g., nightly update of a bank’s branches, promotional movie from manufacturer to all car dealers
Distribution of streaming contentE.g., live event, Internet TV
Classical approaches have high costSource over-provisioning (for peak demand)Content Delivery Networks (CDNs)
Novel approach: cooperative networks
6 © Institut Eurécom 2005
Illustration: File Replication
Assume you have a set of N hosts
We initially have one or a few (n) copies of the file to distribute (n << N)
Q: how to replicate efficiently the file on all N hosts?
7 © Institut Eurécom 2005
File Replication: Simple Scenario
3 asymmetric hosts:upload rate: u= 0.5 file_unit/time_unitdownload rate: d = 1 file_unit/time_unit
Initially one copy
File of size: 1 file_unit
8 © Institut Eurécom 2005
File Replication: Simple Scenario
Tree:Upload in parallel to each host at rate u/2=0.25Download time= 1/(u/2)=4 time_units
(Smart) Full replication:-Upload to host 2 at full rate u=0.5
-Host 2 done at = 1/(u)=2 time_units-Replication on host 3: Simultaneously
-Host 1 gives one half of file at u=0.5, takes 1 time_unit-Host 2 gives one half of file at u=0.5, takes 1 time_unit
- Completion after 3 time_units
1
2 3
1
2 3
1
2 3
9 © Institut Eurécom 2005
File Replication: Simple Scenario Exercise:
Assume that the file is broken in 4 equal size pieces p1 – p4Host 1 first transmits p1 to host 2Then, simultaneously
Host 1 transmits p2 to host 2 and host 2 transmits p1 to host 3 Then, simultaneously
Host 1 transmits p3 to host 2 and host 2 transmits p2 to host 3 Then, simultaneously
Host 1 transmits p4 to host 2 and host 2 transmits p3 to host 3 Host 2 transmits p4 to host 3
Questions: How long does it take to distribute the file to both, host 2 and 3What kind of organization does this type of parallel, shifted transfer correspond to?
10 © Institut Eurécom 2005
File Replication: Simple Scenario
Partial replication:4 actions in parallel:-Upload f/2 from host 1 to host 2 at u/2-Upload f/2 from host 1 to host 3 at u/2-Host 2 upload to host 3 f/2-Host 3 upload to host 2 f/2
Total time: f/2/(u/2)=1/0.5= 2 time_units
1
2 3
1
2 3
I am cheating in this example a bit: WHY?
11 © Institut Eurécom 2005
Observations
Designing optimal policy is difficult in practice because ofHeterogeneity of hosts in terms of their upload/download capacity
Host 1 has a 10Mb/s access link Whereas host 2 has a modem access
Hosts comie and leave at any point of time
12 © Institut Eurécom 2005
ObservationsHosts are not homogeneous:
Different access linksCampus, corporate networks
• Firewalls might be a problemxDSL: asymmetric upload and download capacity
Different locations in the worldUS, Europe, Asia
• Different RTTs ⇒ different TCP throughputs• Different Availability of paths, hosts….
98) Sigcomm al.et (Padhye lossRTT
MSSCstTTCP⋅
≈
February 1, 2005
FastReplicaL. Cherkasova, J. Lee
Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems, 2003
14 © Institut Eurécom 2005
FastReplica in the SmallProblem Statement:
Let N0 be a node which has an original file F with size Size(F);Let R = N1, … , Nn be a replication set of nodes.
The problem consists in replicating file F across nodes N1, … ,Nn while minimizing the overall replication time.We assume n∈[ 10-30] nodes.File F is divided in equal chunks: F1, … , Fn, where Size(Fi) = Size(F) / n FastReplica consists of two steps:
Distribution Step ,Collection Step.
15 © Institut Eurécom 2005
N3
File F
F1 F2 F3 F n-1 F n
F1
F n-1
F n
F3F2
N0
N1
N2 N n-1
N n
FastReplica in the Small: Distribution Step
N0 sends to Ni: • chunk Fi .• a list of nodes R = N1, … , Nn\Ni to which chunk Fi must be sent in
the next step;
16 © Institut Eurécom 2005
F n-1
F n
File F
F1
F1
F2 F3 F n-1 F n
N0
N1
N2
N3
N n-1
N n
F1
F1
F3F2
F1
F1
FastReplica in the SmallCollection Step (View “from a Node”)
After receiving Fi , node Ni opens (n-1) connections to its siblings and sends chunk Fi to them.
17 © Institut Eurécom 2005
File F
F1
F1
F2 F3 F n-1 F n
F n-1
F n
F3F2
N0
N1
N2
N3
N n-1
N n
F2F3
F n-1
F n
FastReplica in the SmallCollection Step (View “to a Node”)
Thus each node Ni has:• (n - 1) outgoing connections to send chunk Fi ,• (n - 1) incoming connections from the remaining nodes in the group to
receive the other chunks F1, … , Fi-1 , Fi+1 , … , Fn .
18 © Institut Eurécom 2005
What Is the Main Idea of FastReplica?
Instead of typical replication of the entire file F to n nodes Using n Internet paths FastReplica exploits (n x n) different Internet paths within the replication group, where each path is used for transferring 1/n-th of file F.
Benefits:The impact of congestion along the involved paths is limited for a transfer of 1/n-th of the file,FastReplica takes advantage of the upload and downloadbandwidth of recipient nodes.
Limitations:Organization of nodes is very static, does not scale to larger number of nodes
19 © Institut Eurécom 2005
Preliminary performance analysis of FastReplica in the small
Two performance metrics: average and maximum replication time.
Idealistic setting: all nodes and links are homogeneous, and each node can support n network connections to other nodes at B bytes/sec.
Timedistr = Size(F) / (nxB)Timecollect = Size(F) / (nxB)
FastReplica: TimeFR =Timedistr + Timecollect = 2 x Size(F) / (nxB)Multiple Unicast: TimeMU = Size(F) / B
Replication_Time_Speedup = TimeMU / TimeFR = n / 2
20 © Institut Eurécom 2005
February 1, 2005
Parallel-Access to access Replicated Information in the Internet
Note: While this idea was originally proposed by P. Rodriguez and E. Biersack
for content that exists in multiple copies on the Web, it turns out that the same concept is very useful in Peer-to-Peer systems.We first present the scheme as originally conceived and then will come
back to it when discussing Bittorrent
22 © Institut Eurécom 2005
Overview
Motivation and Problem StatementHow it worksChunk and Peer selectionPerformance studyImplementation and Deployment IssuesConclusion
23 © Institut Eurécom 2005
Replicate the same content in several places (mirrors) in the networkClients are directly satisfied from one of the mirror servers
Access to frequently requested content
THE INTERNET
24 © Institut Eurécom 2005
How are copies created?
Web caches are widely deployed in the Internet to store copies of master documents from multiple content providers
In control of the ISPMirror sites fully replicate content from a certain content provider
In control of the content providerContent distribution networks (e.g. Akamai) replicate content from different content providers
In control of the content provider. Multiple content providers share the same resources
25 © Institut Eurécom 2005
IssuesHow to re-direct clients to the “best” copy?
26 © Institut Eurécom 2005
Selecting Best “Copy”The copy selected should
provide the client the lowest possible access time and achieve overall good load balancing between the different sites.
Techniques for copy selection areIP-administrative domain, Internet TopologyNumber of hops, RTTsApplication level measurements
ProblemsHigh complexity and overhead (periodic polls, state information)The selected copy may not always be the best one at this point of time
27 © Institut Eurécom 2005
Parallel AccessInstead do a parallel-access!
Avoids non-trivial server selectionsPerforms load balancing
Popular Document
Mirror Servers
28 © Institut Eurécom 2005
The number of blocks should be larger than the number of mirror sitesaccessed in parallelEach block should be small enough to rapidly adapt to changing conditions and ensure that the last block requested from each server terminates at about the same time
Each block should be sufficient large to reduce the influence of the idle times and reduce the number of negotiations (transmission time>>RTT)
How to choose the block size?
Mirror SiteClientGet block i
Block i
Get block i+1
Transmission of block i
Idle time
Mirror SiteClientGet block i
Block iGet block i+1
Transmission of block i
idle time
Pipeline
29 © Institut Eurécom 2005
AssumptionsWe consider popular documents that are identically replicated (bit-by-bit)Documents should be large enough (several hundreds of Kbytes)
For small documents, several documents can be grouped (i.e. all in-lined objects of a Web page) and perform a parallel-access to group of documents
The last block can be requested from several servers at the sametime to avoid waiting for very slow serversWe assume that the paths from the client to the mirror servers are bottleneck-disjointClients and servers are able to utilize range requests as specified in HTTP 1.1
30 © Institut Eurécom 2005
Current implementation runs on a Java client and does not perform pipeliningWe considered mirror sites for the Squid software(http://squid.nlanr.net)We evaluated the performance of a parallel-access every 15 minutesduring a period of 10 days, and averaged the performance over the 10 day periodFor every experiment, we calculate the optimum transmission time, which is the transmission time obtained when all servers send useful information until the document is fully received and there are no idle times
Performance Evaluation
Slovakia Portugal Greece SpainAustria
UKJapan
Australia
Israel
Eurecom, France
31 © Institut Eurécom 2005
Parallel-access: Performance
Drastically reduces downloading times and smooths out bandwidth variations The differences between the optimum and the parallel-access are due to the idle times => Pipelining
32 © Institut Eurécom 2005
Small Documents
For small documents the transmission time is not very relevantPipelining is difficult due to the small transmission timesEven for a 10 KB file, a parallel-access has a very good performance
33 © Institut Eurécom 2005
Parallel-Access on a Modem line
A parallel-access through a modem line gives as good performance as the fastest server with no server selection
34 © Institut Eurécom 2005
Simulation of pipelining
Using pipelining the parallel-rate is very close to the optimumHowever, using pipelining is not crucial since parallel-access with no pipelining is already very good
35 © Institut Eurécom 2005
Parallel-Access: Multiple sites vs One site
A parallel-access to the same server may result in a worse performance than a single access since connections compete among themselves
36 © Institut Eurécom 2005
Current Implementation
0-yasamin$ pa -n20 http://www.auth.gr/Squid/FAQ/Squid1.2.ps.gz http://www.uniovi.es/Squid/FAQ/Squid1.2.ps.gz
Source 0=> http://www.auth.gr/Squid/FAQ/FAQ.ps.gzSource 1=> http://www.uniovi.es/~mirror/squid/FAQ/FAQ.ps.gzParts Requested/ReceivedRequested: 01101 01010 11001 10101Received: 01101 01010 11001 10101It took 42 sec to download 763 KBParallel Rate: 18 KBps (42 sec)Rate Source 0: 8 KBps (91 sec)Rate Source 1: 10 KBps (74 sec)
37 © Institut Eurécom 2005
Deployment IssuesWhat if everybody does the same thing?
If all clients share the same bottleneck => Speedup is reducedIf clients do not share the same bottleneck => Speedup is still kept high (for more info see: Christos Gkantsidis, Mostafa Ammar, Ellen Zegura, "On the Effect of Large-Scale Deployment of Parallel Downloading", IEEE Workshop on Internet Applications (WIAPP'03), 2003)
In any caseClients experience a performance which is at least equal to the one of the fastest serverLoad is automatically shared among all servers: More powerful servers will have higher number of requests, while less powerful servers will have less requestsThere is no need for a server selection algorithm
38 © Institut Eurécom 2005
ConclusionsA parallel-access for popular and large documents
Avoids non-trivial server selectionsSmoothes bandwidth fluctuationsSpeeds-up document downloadsEven on bandwidth limited environments, the performance obtainedwith a parallel-access is at least as good as that offered by the best server
Parallel Access applicable in P2P environments Clients who have copy act as mirror servers
Parallel Access to files is implemented today in various tools such asMorpheus, EDonkey, or BitTorrent
39 © Institut Eurécom 2005
Handout
Pablo Rodriguez and Ernst W. Biersack. Dynamic Parallel-Access to Replicated Content in the Internet. IEEE/ACM Transactions on Networking, 10(4):455--464, August 2002.
February 1, 2005
BitTorrentPeer-to-peer based system for distributing a file from one or more
nodes that have a complete copy to a possibly large number of peers
41 © Institut Eurécom 2005
Overview
MotivationHow it worksChunk and Peer selectionPerformance study
Global Performance (Tracker log)Client Behavior and Performance (Client log)
Summary
42 © Institut Eurécom 2005
Content Distribution Model: Cooperative Networking
In a cooperative network (“peer-to-peer”), all nodes are both client and server
Many nodes, but unreliable and heterogeneous
Takes advantage of distributed, shared resources (bandwidth, CPU, storage) on peer nodes
Fault-tolerant, self-organizing
Dynamic environment: frequent join and leave is the norm
InternetInternet
Source
Cooperative content distributionCooperative content distribution
43 © Institut Eurécom 2005
Cooperative Distribution: Intuition
Client/Server Cooperative
1. 9h:52m2. 14h:48m1. 9h:52m2. 14h:48m
1. 52s2. 09m:54s1. 52s2. 09m:54s
Source server: 100 Mb/sClients: 10 Mb/s1. Antivirus update100,000 clientsFile: 4 MB2. Daily database update1000 clientsFile: 600 MB
Source server: 100 Mb/sClients: 10 Mb/s1. Antivirus update100,000 clientsFile: 4 MB2. Daily database update1000 clientsFile: 600 MB
44 © Institut Eurécom 2005
Cooperative Distribution in BitTorrent
Principle: Capitalize bandwidth of edge computersSelf-scaling network: more clients ⇒ more aggregate bandwidth ⇒ more scalabilityCost-effective, robust against failures and flash crowds
How well does it work in practice?Study of BitTorrent tracker log covering 5 months, 1.77 GB file, 180,000+ clientsTracing the behavior of BitTorrent client
45 © Institut Eurécom 2005
Elements of a BitTorrent Session
One session= distribution of a single (generally large) file
Elements:An ordinary web serverA static ‘metainfo’ fileA TrackerAn original downloaderOn end user side: web browser+bt plug-in
46 © Institut Eurécom 2005
Joining a BT Session
Web Server Tracker
New Peer
Updates
Active peersIP1, IP2, IP3, ...Active peersIP1, IP2, IP3, ...
BTClient
TorrentFile
Random peersIPx, IPy, ...Random peersIPx, IPy, ...
1. Download torrent meta-info1. Download torrent meta-info
2. Launch BT client2. Launch BT client
3. BT client contacts tracker3. BT client contacts tracker
4. Tracker picks 40 peers at random for the new client4. Tracker picks 40 peers at random for the new client
5. BT client cooperates withpeers returned by the tracker
5. BT client cooperates withpeers returned by the tracker
47 © Institut Eurécom 2005
Session Initiation
Start running a web server that hosts a torrent fileThe torrent file contains the IP address of the tracker
The tracker (often not on web server) tracks all peersInitially, it must know at least one peer with the complete filePeer that has entire file: seedPeer still downloading file: leecher
On client sideBT client reads tracker IP address and contacts the tracker (through HTTP or HTTPS)The tracker provides to the BT client a set of active peers (leechers and seeds, typically 40) to cooperate withClients regularly report state (% of download) to the tracker
48 © Institut Eurécom 2005
Peer SetsTracker picks peers at random in its listOnce a peer is incorporated in the BT session, it can also be picked to be in the peer set of another peerThis technique allows a wide temporal diversity
A peer knows both older peers and newcomers!Ensures transfer of chunks between “generations”
Note: a peer communicates with its initial peer set and the other peers that contacted it but NOT with other peer sets
TimePeer pPeers of p’s initial peer set Peers with p in initial peer set
Peer arrival
49 © Institut Eurécom 2005
File Transfer Algorithm
Initial file broken into chunks (typically 256 kB)Torrent file contains SHA1 hash for each chunk: allows to check integrity of each chunk
Reports sent regularly (at start-up, shutdown, and every 30 minutes) to tracker
Unique peer ID, IP, port, quantity of data uploaded and downloaded, status (started, completed, stopped), etc.
Peer connect with each other over TCP, full duplex (data transit in both directions)
Upon connection, peers exchange their list of chunksEach time a peer has downloaded a chunk and checked its integrity, it advertises it to its peer set
50 © Institut Eurécom 2005
Connection States
On each side, a connection maintains 2 variables“Interested”: you have a chunk that I want
Allows a peer to know its possible clients for upload“Chocked”: I don’t want to send you data at the time
Possible reasons: I have found faster peers, you did not/can’t reciprocate enough, …
51 © Institut Eurécom 2005
Chunk Selection Algorithm
Which missing chunk should we request from other peers?Simple strategy: random selection
Choose at random among chunks available in peer setRandomness ensures diversity
Biased strategy: peers apply the rarest-first policyChoose the least represented missing chunk in the peer setRare chunks can more easily be traded with othersMaximize the minimum number of copies of any given chunk in each peer set
BT uses rarest-first policy except for newcomers that use random to quickly obtain a first block
52 © Institut Eurécom 2005
Peer Selection Algorithm
Serving too many peers simultaneously is not efficient: BT serves 5 hosts in parallelWhich hosts to serve?
The ones that also serve us: tit for tat (leechers)The ones that offer the best download rates (seeds)
Can there be any better hosts?Optimistically unchoke a random peer to possibly find another host that provides better serviceNewcomers have less data to offer ⇒ give them “priority” in the optimistic unchokeBT reconsiders choking/unchoking every 10 s (long enough for TCP to reach steady state)
53 © Institut Eurécom 2005
Five months (April to August 2003) tracker log of a very popular BT session
Linux RedHat 9, 1.77 GB fileLog contains all the reports of all the clients (ID, IP, amount of bytes uploaded and downloaded)
In addition, we ran our own instrumented client on 3 different days to observe a given peer set
Log contains blocks uploaded to and downloaded by each host (each time a host has a new block, it advertises its peer set)Exhibits the behavior of BT during the download phase and once the client becomes seed
BitTorrent Study
54 © Institut Eurécom 2005
Tracker Log180,000 clients during the 5 five months periodInitial flash crowd: 51,000 clients during the first 5 days
Flash crowdOne client every 80 s
55 © Institut Eurécom 2005
Tracker Log: Number of Clients
Reaches 4000+ active clients in the first dayRemains in the interval [100,200] later
Flash crowd
56 © Institut Eurécom 2005
Tracker Log: Clients’ behavior
Clients are very altruistic1. When they are leechers
They have no choice due to tit-for-tat2. Once download is completed since they stay on average 3 hours
after downloadThe transfer is long, may complete overnightThe content is legal (RIAA will not sue!)The user is very kind
57 © Institut Eurécom 2005
Tracker Log: Seeds
40 Tbytes
20 Tbytes
Presence of seeds is a key feature of BTOver the 5 months they contributed twice as much volume as leechers
58 © Institut Eurécom 2005
Tracker Log: Seeds vs. Leechers
The percentage of seeds is consistently high (20+%)Thus, two factors allow BT to sustain the flash crowd:
Its ability to quickly create seeds (i.e., complete downloads)The fact that users are altruistic and seeds remain online
20%
Peak during flash crowd
59 © Institut Eurécom 2005
Tracker Log: BT vs. Mirroring
Throughput per leecher is always above 500 kb/sAt least ADSL client
Aggregate throughput of system (sum over all leechers at each instant) was higher than 800 Mb/s
More than 80 mirrors, each sustaining a 10 Mb/s serviceConsidering only the 20,000 hosts that completed download in a single session (BT allows resume)
Throughput is better than average: 1.3 Mb/sAverage download time is 30,000 s (8.3 h)1.77 GB / 1.3 Mb/s = 10,000s (2.7 h)Conclusion: a high variance in download throughputs!
60 © Institut Eurécom 2005
Tracker Log: Complete Sessions
Peak value around ADSL speedSome hosts have very high bandwidth
Mean
Peak close to 400 kb/s (ADSL)
61 © Institut Eurécom 2005
Tracker Log: US vs. Europe
In the first 4 weeks: 45% from US, 15% from EuropeUS clients have better access links that European clients
High-bandwidth peers
62 © Institut Eurécom 2005
Tracker Log: Incomplete Sessions
Causes of abortion (no interest, crash)?Assumption: abortions due to experiencing bad serviceValid if users receive almost nothing while online
90% of the incompletesessions last less than 10,000 sec (<3 h)
90% of the incompletesessions last less than 10,000 sec (<3 h)
60% of the incomplete sessions last less than 1,000 sec (<20 min)
60% of the incomplete sessions last less than 1,000 sec (<20 min)
Throughput of incomplete sessions smaller than that of complete sessions
Throughput of incomplete sessions smaller than that of complete sessions
63 © Institut Eurécom 2005
Tracker Log: Incomplete Sessions
90% of non-completedClients downloaded lessthan 10% of file
90% of non-completedClients downloaded lessthan 10% of file
64 © Institut Eurécom 2005
Client Log
Modified client behind a 10 Mb/s campus access link3 transfers, 3 days of 5th month (far from flash-crowd)Average transfer time: 4,500 s (1.25 h, fast client!)We remained as seed for another 13 hours
End of download
Number of clients drops after end of download phase.Explanation: seeds disconnect
Number of clients drops after end of download phase.Explanation: seeds disconnect
65 © Institut Eurécom 2005
Client Log:Tit-for-tatA lot of straight lines: continuous serviceSteps: chocke effectThis Large Step: client disconnected?
66 © Institut Eurécom 2005
Client Log: Upload and Download
Ramp-up period(obtain first chunks)
End of downloadStart serving
chunks
Client never gets stalled: we always find peers to serve and download chunks from ⇒ good efficiency
Client never gets stalled: we always find peers to serve and download chunks from ⇒ good efficiency
Connectionsreach full speed
We uploaded as much as we downloaded after 10,000 sec = twice the download time
We uploaded as much as we downloaded after 10,000 sec = twice the download time
Cooperation is enforced: the download rate increases becausethe upload rate increases
Cooperation is enforced: the download rate increases becausethe upload rate increases
67 © Institut Eurécom 2005
Client Log: Tit-for-Tat
Who gave us the file, seeds or leechers?40% from seeds and 60% from leechers85% of the file was provided by only 25% peersMost of the file provided by peers that connected to us (not from original peer set)
How good is the tit-for-tat policy?Two conflicting goalsMust enforce cooperation among peersMust allow transfer even if bandwidth not perfectly balanced
Example: I don’t give you anything because I can send you at 100 kb/s whereas you can only send at 80 kb/s
68 © Institut Eurécom 2005
Client Log: Tit-for-Tat
We found that BT pays more attention to the amount of data transferred than to the balance of bandwidths — a very good property
Legend:V: volumeT: throughputd: downloadu: upload
Legend:V: volumeT: throughputd: downloadu: upload
High correlation of traffic volumesLow correlation of throughputsHigh correlation of traffic volumesLow correlation of throughputs
69 © Institut Eurécom 2005
Client Log: Tit-for-Tat
We received more than we gave, even if we do not account for seeds traffic
Probably due to our good download capacity and to tit-for-tat enforcement
Ratio of 4
Ratio of 2
70 © Institut Eurécom 2005
BitTorrent Summary
BT very efficient for highly popular downloadsStill, its performance might be affected if clients do not stay long enough as seeds, e.g., in case of illegal content…What happened to 160,000 incomplete downloads?
BT is clearly able to sustain large flash crowdsSome open question
Could we do better by using different peer and chunk selection strategies?
71 © Institut Eurécom 2005
Handout
M. Izal, G. Urvoy-Keller, E.W. Biersack, P. Felber, A. Al Hamra, and L. Garces-Erice. Dissecting BitTorrent: Five Months in a Torrent's Lifetime. In Passive and Active Measurements 2004, April 2004.
February 1, 2005
Performance Analysis of P2P Networks for File Distribution
Ernst W. [email protected] EurecomFrance
73 © Institut Eurécom 2005
Overview
MotivationModel and MetricsThree distribution techniques
Linear chainTreeParallel Trees
Performance comparisonFurther ImprovementsLessons learned and summary
74 © Institut Eurécom 2005
Context and ProblemAn growing number of well-connected users access increasing amounts of contentBut interest in content is often “Zipf”distributed (small fraction of very popular content)Servers and links are overloaded
Number of clientsSize of content“Flash crowd” (e.g., 9/11)
Tremendous engineering (and cost!) necessary to make server farms scalable and robust
InternetInternet
Source
Congestion
Degraded/loss of service
Traditional client/server content distribution
Traditional client/server content distribution
Problem: scalable distribution of contentProblem: scalable
distribution of content
75 © Institut Eurécom 2005
Cooperative Networking
In a cooperative network (“peer-to-peer”), all nodes are both client and server
Many nodes, but unreliable and heterogeneous
Takes advantage of distributed, shared resources (bandwidth, CPU, storage) on peer nodes
Fault-tolerant, self-organizing
Dynamic environment: frequent join and leave is the norm
InternetInternet
Source
Cooperative content distributionCooperative content distribution
76 © Institut Eurécom 2005
Cooperative Distribution
Principle: Capitalize bandwidth of edge computers
Self-scaling network:more clients ⇒ more aggregate bandwidth ⇒ more scalability
Self-organizing: robust against failures and flash crowds
How well does it work in practice?Study of BitTorrent over 5 months, 1.77 GB file, 180,000+ clients, 60+ TB transferred ⇒ scales very well
What are the best cooperative distribution strategies?Analytical models of distribution topologies
77 © Institut Eurécom 2005
Model and MetricsSingle copy of the file held by the source s
The server serves the file infinitelyN peers p1,...,pN arrive at t0 and request the same file F
File F is broken into C chunksPeers cooperate by exchanging chunksPeers upload F a limited number of times before leaving
All peers and source have identical upload and download rate b1 round (unit of time) = time needed to download F at rate bTime needed to download a chunk = 1/C
MetricsT(N) — number of rounds needed to fully serve N peers (≥ 1)N(t) — number of peers served within t rounds
78 © Institut Eurécom 2005
Sequential Service
Non-cooperative approach: iteratively serve peers
Independent of C, scales linearly with t
s
p1
p3
p2
t = 0
t = 1/3
t = 2/3
t = 1
t = 4/3
t = 5/3
t = 2
Sequential, C = 3
ttNSequential =)(
NNTSequential =)(
79 © Institut Eurécom 2005
Linear Chain
Cooperative approachEvery peer serves the whole file once to another peer and disconnectsA peer can start serving once it has 1 chunk
Many chunks improve scalability
s
p1
p10
p2
p3
p5
p7
p9
p4
p6
p8
p11p12
t = 0
t = 1/3
t = 2/3
t = 1
t = 4/3
t = 5/3
t = 2
Linear, C = 3After t rounds: t+1 chains
Chain has 1+tC peers (1+(t-1)C complete)
11
23
12
3
12
3
12
3
12
312
80 © Institut Eurécom 2005
Linear Chain
Node to chunk ratio N/CC
CNCCC
CNCCNCTLinear
⋅⋅⋅++
≈
⋅⋅⋅+−+−
=
28
28)2()2(
),(
2
2
CNNCTLinear ≈),(
2221),( ≈+≈NCTLinear
141
21),( =+≈NCTLinear1<<
CN
1=CN
1>>CN
All peers active most of the time, 1 chain
Few peers active, many complete or not started, several chains
First peer finishes when last starts, 1 chain
N/C = 104
N/C = 10-3N/C = 1 N/C = 1
81 © Institut Eurécom 2005
Treek
Server uploads the file to k peers in parallel at rate b/kEvery peer downloads the file at rate b/k in k roundsNon-leaf peers upload the whole file to k peers at rate b/k
Serves the file k timesLinear chain when k=1
t = 0
t = 2/C
t = 4/C
t = 6/C
Treek=2, C = 3, N = 30
p1 p2
p3 p4 p5 p6
p7 p8 p9 p10 p11 p12 p13 p14
p15 p16 p17 p18 p19 p20 p21 p22 p23 p24 p25 p26 p27 p28 p29 p30
s
b/2 b/2
82 © Institut Eurécom 2005
Treek
Scales exponentiallywith C and tT and N depend on k
Deep trees delayengagement of peersFlat trees have lowthroughput (b/k)
Ck
kNkNkCT kTree ⋅⎥⎦
⎥⎢⎣⎢+= )(log),,(
1)(1)1(),,(
+−+−=≈ k
CktCkt
Tree kktkCN
N/C = 1
N/C < 1
N/C > 1
Tree depth
Per-
leve
l del
ay
File
tran
s. @
b/k
83 © Institut Eurécom 2005
Treek
Optimal value for k
Depends on N and C
Larger values yield moreleaves ⇒ more non-cooperating peersWhen N/C is large, linearchain has few peers simultaneously active⇒ tree is better
)1(2log)1(4)(loglog 2
−⋅⋅−⋅++−
= CNCNN
opt ek N/C = 104
N/C = 10-3
84 © Institut Eurécom 2005
PTreek
Server uploads 1/k of the file to k peers in parallel at rate b/kEvery peer downloads the file from k peers at rate b/kEvery peer uploads 1/k-th of the file to k peers at rate b/k
Serves the file onceCreates k parallel spanning trees, linear chain when k=1
Each peer is interior node of at most one tree ⇒ reliability
t = 0
t = 2/C
t = 4/C
t = 6/C
PTreek=2, N = 15
p1
p3 p4
p7 p8 p9 p10
p2
p5 p6
p11 p12 p13 p14
p2 p5 p6 p11 p12 p13 p14 p1 p3 p4 p7 p8 p9 p10
s
b/2 b/2
even odd
85 © Institut Eurécom 2005
PTreek
Scales exponentiallywith C and tOptimal value for k = e
Independent of N and CLarger values providebetter resilience tofailures
⎣ ⎦ CkNNkCT kPTree ⋅+= log1),,(
kCt
PTree ktkCN)1(
),,(−
≈
Best value
Tree depth
Per-
leve
l del
ay
File
tran
s. @
b
86 © Institut Eurécom 2005
PTreek vs. Linear Chain
When N/C << 1, peers stay engaged a long time in the chain ⇒ benefit of PTreek diminishesPTreek outperforms Linear when N/C > 10-1
N/C > 10-1N/C << 1N/C > 10-1
PTreek=3
Linear
PTreek=3
Linear
C = 102 C = 106
87 © Institut Eurécom 2005
PTreek vs. Treek
Download time for PTreek close to 1 (independent of k)Treek needs always at least k rounds
PTreek=2,3
Treek=2
Treek=3
Treek=2
Treek=3
PTreek=2,3
C = 102 C = 104
88 © Institut Eurécom 2005
Recap
PTreek performs best
Further, PTreek is less sensitive to churn
1
k
1
Copiesserved
Servicetime
b/k * k = bbPTreek
b/kb (- leaves)
Treek
bb (- leaves)
Linear
Download &upload rate
Clientsserved
CCNCC
⋅⋅⋅++
282
Ck
kNk k ⋅⎥⎦
⎥⎢⎣⎢+ )(log1)( +−
kCkt
k
2tC ⋅
⎣ ⎦ CkNk ⋅+ log1k
Ctk
)1( −
Opposite from ADSL
89 © Institut Eurécom 2005
Can we do Better?Phase 1: Send one chunk to every node
s p4p2 p3p1 p5 p6 p7
s p2 p5 p7
s p4p2 p5 p6 p7
c1
c2
c3
t = 1/C
t = 0
t = 2/C
t = 3/C
p1
p1
p4 p6p3
p3
...
...
...
p8
p8
p8
c1
c1
c2
c1
The frequency of each block grows exponentiallyAssuming N=2m-1, after m rounds (one round = 1/C), each peer has one chunk
90 © Institut Eurécom 2005
Can we do Better?
Phase 2: peers exchange chunks
At each round, each peer receives one new chunkWe need C-1 rounds to deliver the remaining C-1 chunks to all peers
c2
c1
c3
...
c1+c2
c1+c4
c1+c3
c1+...
c1+c2+c4
c1+c2+c3
c1+c2+...c1+c2+c5
CN
CC
CNNCT m 22 log11)1(log)12,( +≈
−+
+=−=
91 © Institut Eurécom 2005
Lessons Learned
Engage peers as fast as possible and keep them engaged as long as possible
Keep the number of copies about the same (to avoid that many peers must wait for a very rare block that very few peers own)Linear chain can perform better than Tree (for N/C << 1)
Analysis is a first stepDeterministic analysisHomogeneous clients and homogeneous bandwidthNo peer failures
Real systems must account for heterogeneity and churn
92 © Institut Eurécom 2005
Summary
Self-scaling: more clients ⇒ more aggregate bandwidth ⇒ more scalabilityPTreek is a very efficient tree-based architectureFile should be split in many chunks
Performance scales exponentially with number of chunks CNot too many: coordination and connection overhead
Limit the number of simultaneous uploads to k = [3-5]Higher values provide more robustness
Self-organizing networkDegree, chunk selection strategy, peer selection strategy
93 © Institut Eurécom 2005
Handout
E. W. Biersack, P. Rodriguez, and P. Felber. Performance Analysis of Peer-to-Peer Networks for File Distribution. In Proceedings of the Fifth International Workshop on Quality of Future Internet Services (QofIS'04), Barcelona, Spain, September 2004.
94 © Institut Eurécom 2005
Mesh-based Architectures
95 © Institut Eurécom 2005
Overview
MotivationPeer and chunk selection strategiesModel for performance evaluationResults
Simultaneous peer arrivalRarest block Random block
Mesh-based topologies vs static (tree-like) topologiesLessons learned and summary
96 © Institut Eurécom 2005
Mesh-based Architectures: Motivation
In practice, peers come and go (referred to as churn)Network self-organizes, according to 3 factors:
Indegree and outdegree: number of neighborsPeer selection strategy: which peer to serve next?Chunk selection strategy: which chunk to serve next?
97 © Institut Eurécom 2005
Cooperative Distribution SimulatorSimulate cooperative distribution of a large file
File size: 200 chunks of 256 kB = 51.2 MBNumber of connections per peer: 5 (more if free bandwidth)Number of origin peers: 1 with 128 kb/s upPeer capacities
Homogeneous symmetric: all peers with 128 kb/sHomogeneous asymmetric: 100% of the peers with 512/128 kb/sHeterogeneous asymmetric : 50% of the peers with 512/128 kb/s and 50% with 128/64 kb/s
Peer lifetimeSelfish: peers leave as soon as download finishesAltruistic: peers remain online 5 minutes after completion
98 © Institut Eurécom 2005
Cooperative Distribution SimulatorPeer arrivals
Simultaneous: 5000 peers arrive at t0Continuous: average inter-arrival time of 2.5 s
Block selection policyRandomRarest
Peer selection policyRandomLeast missingMost missing
99 © Institut Eurécom 2005
Simultaneous Arrivals: Completion Times
Least missing performs very poor. WHY?
Least missing performs very poor. WHY?
Rarest chunk selection, simultaneous arrivals,homogeneous and symmetric bandwidth, selfish peers
100 © Institut Eurécom 2005
Simultaneous Arrivals: Download DurationRarest chunk selection, simultaneous arrivals, heterogeneous and asymmetric bandwidth, selfish peers
Two distinct classes of peers (most missing
tends to even the download durations)
Two distinct classes of peers (most missing
tends to even the download durations)
Fast peersSlow peers
101 © Institut Eurécom 2005
Simultaneous Arrivals: Completion Times
Random and least missing suffer from
random chunk selection and peer
selfishness
Random and least missing suffer from
random chunk selection and peer
selfishness
sec 3200/ 128
8256200=
⋅⋅=
skbkbTopt
First peer completes first with Least Missing
Random chunk selection, simultaneous arrivals, homogeneous and symmetric bandwidth, selfish peers
102 © Institut Eurécom 2005
Simultaneous Arrivals: Efficiency
Most missing and adaptive missing are
most efficient (quickly engage many peers)
Most missing and adaptive missing are
most efficient (quickly engage many peers)
Random chunk selection, simultaneous arrivals, homogeneous and symmetric bandwidth, selfish peers
103 © Institut Eurécom 2005
Simultaneous Arrivals: Chunk Distribution
Random peer selection A single chunk is missing from many
peers (served sequentially by the
source)
A single chunk is missing from many
peers (served sequentially by the
source)
Random chunk selection, simultaneous arrivals, homogeneous and symmetric bandwidth, selfish peers
Random quickly brings all peers
close to completion, but blocks on the last few chunks
Random quickly brings all peers
close to completion, but blocks on the last few chunks
104 © Institut Eurécom 2005
Comparison of Tree-based and Mesh-based Approaches
We will show thatLeast-Missing provides similar performance to Treek
Most-Missing that provides similar performance to PTreek
Advantages of meshesAvoiding the construction of treesAre more flexible in case of node failures or node heterogeneity
105 © Institut Eurécom 2005
Mesh-based ApproachesLeast-Missing:
chunk selection strategy = globally rarest Peer selection strategy = closest to complete, i.e. serve in priority the peer that holds the highest number of chunks among all peers
Most-Missing:chunk selection strategy = globally rarestPeer selection strategy = furthest from completion, i.e. serve with priority the peer that holds the lowest number of chunks among all peers
106 © Institut Eurécom 2005
Least-Missing with Pin=1 and Pout=2Least-Missing with Pin=1 and Pout=2 Treek=2
107 © Institut Eurécom 2005
Least-Missing with Pin=1 and Pout=2Least-Missing with Pin=1 and Pout=2 Treek=2
r/2 r/2
s
p1 p2
p3
p7 p8 p12 p14
p4
p9 p10
p5
p11
p6
p13
C1 C1
C1 C1
C1 C1 C1 C1C1
C2 C2
C2 C2
C3 C3at t = 0at t = 1/Cat t = 2/C
108 © Institut Eurécom 2005
Least-Missing with Pin=1 and Pout=1
What organization does this correspond to??Linear chainTreeParallel Trees
109 © Institut Eurécom 2005
Most-Missing
s p4p2 p3 p5 p6 p7 at t = 0p8p1
C1
s p4p2 p3 p5 p6 p7 at t = 1/Cp8p1
C2
s p2 p3 p5 p6 p7 at t = 2/Cp8p1 p4
s p2 p3 p5 p6 p7 p8p1 p4
C3
C4
C1
C1
C1
C1 C3
C2
at t = 3/C
All peers have Pin=Pout=1
p1
p2 p3
p4 p5 p6
C2
C1
p7
C1
110 © Institut Eurécom 2005
PTreek
Idea: Engage all peers as fast as possibleMake all peers work as long as possible and therefore all terminate at about the same time
t = 0
t = 2/C
t = 4/C
t = 6/C
PTreek=2, N = 15
p1
p3 p4
p7 p8 p9 p10
p2
p5 p6
p11 p12 p13 p14
p2 p5 p6 p11 p12 p13 p14 p1 p3 p4 p7 p8 p9 p10
s
b/2 b/2
even odd
111 © Institut Eurécom 2005
Conclusion
Selection strategy is importantPeer selection: Most missing performs bestBlock selection: Rarest block first avoids performance problems that can occur with random block selection
• When many clients may need the same (last) block
112 © Institut Eurécom 2005
Overall ConclusionPeer to peer systems for file distribution perform very well in practice
BitTorrent allows to serve thousands of clients using mesh based organization
Peer to peer systems for file distribution areVery cost effective and robust against failures and flash crowdsSelf-scaling: the more clients the more “resources” to serve
What are the best cooperative distribution strategiesMesh –based organization is most appropriate in heterogeneous environments (in terms of up- and down-link bandwidth)Peer- and block selection strategy have a major impact on performance
113 © Institut Eurécom 2005
Handout
P. A. Felber and E. W. Biersack. Self-scaling Networks for Content Distribution, September 2004.
February 1, 2005
Questions
115 © Institut Eurécom 2005
END