1
A Measurement Study of Peer-to-Peer File Sharing Systems
by
Stefan SaroiuP. Krishna Gummadi
Steven D. Gribble
Presentationby
Nanda Kishore [email protected]
2
Outline• P2P Overview
– What is a peer?– Example applications– Benefits of P2P
• P2P Content Sharing– Challenges– Group management/data placement approaches– Measurement studies
• Conclusion
3
What is Peer-to-Peer (P2P)?
• Most people think of P2P as music sharing
Examples:
• Napster
• Gnutella
4
What is a peer?
• Contrasted with Client-Server model
• Servers are centrally maintained and administered
• Client has fewer resources than a server
5
What is a peer?
• A peer’s resources are similar to the resources of the other participants
• P2P – peers communicating directly with other peers and sharing resources
6
P2P Application Taxonomy
P2P Systems
Distributed Computing File Sharing Collaboration PlatformsJXTA
7
P2P Goals/Benefits
• Cost sharing
• Resource aggregation
• Improved scalability/reliability
• Increased autonomy
• Anonymity/privacy
• Dynamism
• Ad-hoc communication
8
P2P File Sharing
• Content exchange– Gnutella
• File systems– Oceanstore
• Filtering/mining– Opencola
9
Research Areas
• Peer discovery and group management
• Data location and placement
• Reliable and efficient file exchange
• Security/privacy/anonymity/trust
10
Current Research
• Group management and data placement– Chord, CAN, Tapestry, Pastry
• Anonymity– Publius
• Performance studies– Gnutella measurement study etc.
11
Management/Placement Challenges
• Per-node state
• Bandwidth usage
• Search time
• Fault tolerance/resiliency
12
Approaches
• Centralized
• Flooding
• Document Routing
13
Centralized
• Napster model• Benefits:
– Efficient search
– Limited bandwidth usage
– Efficient network handling
• Drawbacks:– Central point of failure
– Limited scale
Bob Alice
JaneJudy
14
Flooding
• Gnutella model• Benefits:
– No central point of failure
– Limited per-node state
• Drawbacks:– Slow searches
– Bandwidth intensive
Bob
Alice
Jane
Judy
Carl
15
Document Routing
• FreeNet, Chord, CAN, Tapestry, Pastry model
• Benefits:– More efficient searching
– Limited per-node state
• Drawbacks:– Limited fault-tolerance vs
redundancy
001 012
212
305
332
212 ?
212 ?
16
Document Routing – CAN
• Associate to each node and item a unique id in an d-dimensional space
• Goals– Scales to hundreds of thousands of nodes
– Handles rapid arrival and failure of nodes
• Properties – Routing table size O(d)
– Guarantees that a file is found in at most d*n1/d steps, where n is the total number of nodes
Slide modified from another presentation
17
CAN Example: Two Dimensional Space
• Space divided between nodes• All nodes cover the entire space• Each node covers either a square or a
rectangular area • Example:
– Node n1:(1, 2) first node that joins cover the entire space
1 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1
Slide modified from another presentation
18
CAN Example: Two Dimensional Space
• Node n2:(4, 2) joins space is divided between n1 and n2
1 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1 n2
Slide modified from another presentation
19
CAN Example: Two Dimensional Space
• Node n2:(4, 2) joins space is divided between n1 and n2
1 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1 n2
n3
Slide modified from another presentation
20
CAN Example: Two Dimensional Space
• Nodes n4:(5, 5) and n5:(6,6) join
1 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1 n2
n3 n4n5
Slide modified from another presentation
21
CAN Example: Two Dimensional Space
• Nodes: n1:(1, 2); n2:(4,2); n3:(3, 5); n4:(5,5);n5:(6,6)
• Items: f1:(2,3); f2:(5,1); f3:(2,1); f4:(7,5);
1 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1 n2
n3 n4n5
f1
f2
f3
f4
Slide modified from another presentation
22
CAN Example: Two Dimensional Space
• Each item is stored by the node who owns its mapping in the space
1 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1 n2
n3 n4n5
f1
f2
f3
f4
Slide modified from another presentation
23
CAN: Query Example• Each node knows its
neighbors in the d-space• Forward query to the
neighbor that is closest to the query id
• Can route around some failures– some failures require local
flooding• Example: assume n1 queries
f41 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1 n2
n3 n4n5
f1
f2
f3
f4
Slide modified from another presentation
24
CAN: Query Example
1 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1 n2
n3 n4n5
f1
f2
f3
f4
Slide modified from another presentation
25
CAN: Query Example
1 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1 n2
n3 n4n5
f1
f2
f3
f4
Slide modified from another presentation
26
CAN: Query Example
1 2 3 4 5 6 70
1
2
3
4
5
6
7
0
n1 n2
n3 n4n5
f1
f2
f3
f4
Slide modified from another presentation
27
Node Failure Recovery
• Simple failures– know your neighbor’s neighbors– when a node fails, one of its neighbors takes
over its zone
• More complex failure modes– simultaneous failure of multiple adjacent nodes – scoped flooding to discover neighbors– hopefully, a rare event
Slide modified from another presentation
28
Document Routing – Chord
• MIT project• Uni-dimensional ID
space• Keep track of log N
nodes• Search through log N
nodes to find desired key
N32
N10
N5
N20
N110
N99
N80
N60
K19
29
Document Routing – Chord(2)
N32
N10
N5
N20
N110
N99
N80
N60
K19• Each node and key is
assigned an id.• If a node needs a key, searches in its table of n nodes for the key.• If fails, goes to the last
node of its table and repeats until it finds the key.
• Search through log N nodes to find desired key
30
Doc Routing – Tapestry/Pastry
• Global mesh of meshes• Suffix-based routing• Uses underlying network
distance in constructing mesh
13FE
ABFE
1290239E
73FE
9990
F990
993E
04FE
43FE
31
Naming in Tapestry
13FE
ABFE
1290239E
73FE
9990
F990
993E
04FE
43FE
• Every node has a 4 bit
name similar to IP address
• Each bit in the name
can hold 16 types• Keys present at the node
are in accordance with
the node name.
32
Tapestry Routing
6789
B4F8
9098
7598
4598
Msg to 4598
B437
33
Remaining Problems?
• Hard to handle highly dynamic environments
• Methods don’t consider peer characteristics
34
Measurement Studies
Gnutella vs. Napster
35
S S
S S
napster.com
P
P
P
P
P
P
Q
R
D
PP
PPP
PP
Q
QD
Q
R
P
S
peer
server
Q
RD
response
queryfile download
Napster Gnutella
R
36
Methodology
2 stages:1. periodically crawl Gnutella/Napster
• discover peers and their metadata
2. feed output from crawl into measurement tools:• bottleneck bandwidth – SProbe• latency – SProbe• peer availability – LF• degree of content sharing – Napster crawler
37
Crawling
• May 2001
• Napster crawl– query index server and keep track of results– query about returned peers– don’t capture users sharing unpopular content
• Gnutella crawl– send out ping messages with large TTL
38
39
Measurement Study
• How many peers are server-like…client-like?
• Bandwidth, latency
• Connectivity
• Who is sharing what?
40
41
Graph results
• CDF: cumulative distribution function• From this graph, we see that while 78% of the
participating peers have downstream bottleneck
bandwidths of at least 1000Kbps• Only 8% of the peers have upstream bottleneck
bandwidths of at least 10Mbps.• 22% of the participating peers have upstream
bottleneck bandwidths of 100Kbps or less.
42
43
Reported Bandwidth
44
Graph results
• The percentage of Napster users connected with modems (of 64Kbps or less) is
• About 25%, while the percentage of Gnutella users with similar connectivity is as low as 8%.
• 50% of the users in Napster and 60% of the users in Gnutella use broadband connections
• only about 20% of the users in Napster and 30% of the users in Gnutella have very high bandwidth connections
• Overall, Gnutella users on average tend to have higher downstream bottleneck bandwidths than Napster
users.
45
46
Graph results
• Approximately 20% of the peers have latencies of at least 280ms,
• Another 20% have latencies of at most 70ms
47
48
Graph results
• This graph illustrates the presence of two clusters; a smaller one situated at (20-60Kbps, 100-1,000ms) and a larger one at over (1,000Kbps, 60-300ms).
• horizontal lines in the graph they predicate that the latency also depends on the location of the peer for measuring system.
49
Measured Uptime
50
51
Number of Shared Files
52
53
Correlation of Free-Riding with B/WCDF of Number of Downloads Per Reported Bandwidths (Napster)
0
20
40
60
80
100
0 1 10 100Number of Downloads
Per
cen
tag
e o
f H
ost
s
Unknown
Modem + ISDN
Dual ISDN + Cable + DSL
T1 + T3
CDF of Number of Uploads Per Reported Bandwidths (Napster)
0
20
40
60
80
100
0 1 10 100Number of Uploads
Per
cen
tag
e o
f H
ost
sUnknown
T1 + T3
Dual ISDN + Cable + DSL
Modem + ISDN
54
55
56
CDFs of Downstream Bottlenck Bandwidths for All Napster Users and For All Users Who Reported Unknown Bandwidhts
0
20
40
60
80
100
1 10 100 1,000 10,000 100,000
Measured (SProbe) Downstream Bottleneck Bandwidth (Kbps)
Per
cent
age
of H
osts
All Napster Users
Napster Users whoReported Unknown Bandwidhts
57
Power law
• A connected cluster of peers that spans the entire network survives even in the presence of a large percentage p of random peer breakdowns, where p can be as large as:
where m is the minimum node degree and K is the maximum node degree.α <3.
58
59
Gnutella
Fri Feb 16 05:21:52-05:23:22 PST1771 hosts
Popular sites:
• 212.239.171.174
• adams-00-305a.Stanford.EDU
• 0.0.0.0
60
30% random failures
1771 – 471 – 294 hosts Fri Feb 16 05:21:52-05:23:22 PST
61
4% orchestrated failures
Fri Feb 16 05:21:52-05:23:22 PST1771 - 63 hosts
62
Results Overview
• Lots of heterogeneity between peers– Systems should consider peer capabilities
• Peers lie– Systems must be able to verify reported peer
capabilities or measure true capabilities
63
Points of Discussion
• Is it all hype?
• Should P2P be a research area?
• Do P2P applications/systems have common research questions?
• What are the “killer apps” for P2P systems?
64
Conclusion
• P2P is an interesting and useful model
• There are lots of technical challenges to be solved