P2P: An Overview Dr. Tony White Carleton University.

transcript

P2P: An Overview

Dr. Tony White

Carleton University

Outline

• Introduction• Evolution of Network Computing

– Definitions– The Rise of Edge Computing– Why Peer-to-Peer? What is it?

• Applications– Cycle Sharing– Content Delivery – …

• Open Problems• Summary

Evolution of Network Computing

•Web introduced:- A common protocol: HTTP- A common document format: HTML- A universal client: the browser

• Client/server:- Introduced inequalities- Required homogeneity

P2P Definition

• Peer-to-peer computing is the location and sharing of computer resources and services by direct exchange between servents.

• A servent is a peer that can adopt the roles of both server and client when operating.

P2P Definition

“P2P is a class of applications that takes advantage of resources -- storage, cycles, content, human presence -- available at the edges of the Internet. Because accessing these decentralized resources means operating in an environment of unstable connectivity and unpredictable IP addresses, P2P nodes must operate outside the DNS system and have significant or total autonomy from central servers.” Clay Shirkey February, 2000

Definitions I

• Pure peer-to-peer is completely decentralized and characterized by lack of a central server or central entity; clients make direct contact with one another.

• Computational peer-to-peer uses P2P technology to disseminate computational tasks over multiple clients; peers do not have a direct connection to one another.

Definitions II

• Datacentric peer-to-peer is information and data residing on systems or devices that is accessible to others when users connect. It is sometimes called peer-assisted or grid-assisted delivery. Applications include distributed file and content sharing.

• Usercentric/hybrid peer-to-peer involves clients contacting others via a central server or entity to communicate, share data, or process data. Often used in collaboration applications.

What is a P2P network?

• It is an overlay network

• Peer applications know IP addresses of other peer applications.

• Link between two nodes is actually an application-level connection.

What matters?

• Topology of overlay matters

• Where content is stored matters

• Search protocol matters

• Gnutella results in:– Poor performance– Poor reliability

The Rise of Edge Computing …

• In P2P, clients also are servers, hence are peers.• Driving P2P is the abundance of:

– Computing power

– Non-volatile storage

– Network bandwidth

– (This seems to turn thin clients on their heads.)

• Sharing from the edge:– Physical Resources: cycles, disk

– Information Resources: files, database access

– Services: code mobility implied

P2P Enables Complete Access

• P2P file swapping is the obvious application– Text, audio, video, executables, …

• Searching and sharing– Resources

• Information• Information processing capacity

– Searches• More current than Google™

– Indexing web logs (blogs, klogs …)

• More focused: search within a “peer group”

P2P Enables Complete Access …

• Searching and sharing:– Instant messaging

• locate user quickly independent of service provider.

– Buyers and sellers• P2P auctions – compete with Ebay.

– Blogging• Sharing of “self”.

• Edge-based multi-media streaming:– Web radio– Web TV

• Peer shells:– Script complex P2P applications from simpler ones.– Service creation using service composition.

P2P Enables Complete Access …

• A New Style of Distributed Computing

• P2P applications tolerate peers coming/going.

• Result depends on which peers are available.

• High availability comes from probability that some peers are available.– Not on load-balancing and fail over schemes.– Must avoid “tragedy of the commons”.

Examples of Early P2P

• Some new Internet applications are different:– SETI@home– Instant messaging services (AIM, MS Messenger, …)– P2P applications – no central authority/server.

• Napster – quasi-P2P• Gnutella• Freenet

• These applications are vertically integrated:– Non-standard protocols– Closed namespaces– Stand alone

Problems I

• Topology– Bandwidth usage

– Fault tolerance

– Search efficiency

• Identity– Trust

– Anonymity

• Security– Authorization

– Privacy

Problems II

• Namespaces• Community Management

– Overlaps traditional enterprise groups– Highly dynamic, user controlled

• Firewall traversal• Political

– IT loses control of content distribution– No control of information flow!

• Legal– DRM

What is needed?

• Interoperability (common protocols & standards):– Communication protocols (e.g. JXTA, Jabber, …)

– Representation of identity (or not!)

– Semantic content (meta-data)

• Secure information exchange:– Must be able to guarantee trust within a network

– Prevent unauthorized access to network

– Policy-based control of information exchange

• Ubiquity– Buy-in from large groups of users

Securing Distributed Computationsin a Commercial Environment

Philippe Golle, Stanford University

Stuart Stubblebine, CertCo

• 580,000 active participants• 565,800 years of CPU time since 1996• 26.1 TeraFLOPs / sec

Example of a Distributed Computation

Commercialization: supply• A dozen of companies have recruited thousands of

participants

• $100 million in venture funding in 2000

www.mithral.com

www.dcypher.net

(with www.processtree.com)

www.distributed.net

www.entropia.com

www.parabon.com

www.uniteddevices.com

www.popularpower.com

www.distributedsciences.com

www.datasynapse.com

www.juno.com

Commercialization: demand

• Super-computing market: $2 billion / year• Computationally intensive parallelizable projects:

– Drug design research

– Mathematical research

– Economic simulations

– Digital entertainment

Cheaters!"Fifty percent of the project's resources have been spent dealing with security problems"

“The really hard part has to do with verifying computational results"

David Anderson, Seti@home's director.

Cycle Sharing Participants• Trusted supervisor

– Maintains a pool of registered participants

– Bids for large computations

– Divides the computation into tasks that are assigned to participants

– Collects the results and distributes payment to the participants

– Example: Distributed.net, Entropia.com, etc…

• Untrusted participants– May range from large companies to individual users

– Participants are anonymous (No “real world” leverage)

– Participants may collude. We distinguish between real-world entities (agents) and anonymous participants.

– Participants may leave the computation at any time, either temporarily or for good.

Organization

• Distribution of tasks– The unit of computation is a task

– Assumption: all tasks have the same size and can be run by any participant within the same time bounds.

– The supervisor runs a probabilistic algorithm to assign tasks to participants.

– The supervisor keeps track of who did what

Security

• Definition: a computation is secure if no rational, non-risk-seeking participant ever cheats.

• Collusion may occur only before tasks are assigned.

• A participant has 3 choices:– Request a computation and do it– Request a computation and NOT do it– Take a leave

• Assumption: all errors are malicious

Utility function of an agent

• Security condition: (α+E)P – L(1-P) < 0

where P is the probability that cheating is undetected

– L α + E

Run the computation

Cheat and “guess” the result

α: Payment received per taskE: Benefit of defecting (E = e α)L: Cost of getting caught cheating

Cheating detected Cheating undetected

Basic scheme

• Registration:– Participant performs d+1 unpaid tasks– The supervisor verifies them (at limited cost)– The participant is accepted iff all the results are correct

• Assignment of a task:– A task is given to N participants chosen uniformly independently at

random– The number N is chosen according to the probability distribution

– Payment: a constant amount α per task if all the results agree– If not, the task is re-assigned to a new set of participants

• Severance: a participant is paid an amount d.α

11 icciNQ

Properties• Computational overhead = (α+E)P – L(1-P) < 0

• Security condition:

Computational overhead

Setup time Maximum coalition size

Maximum e

10% 10 1% 1

17% 10 10% 1

46% 10 1% 10

243% 10 1% 100

e• Overhead = for “small” p

Content Delivery Networks

• Swarmcast/OnionNetworks– File is stored in multiple locations

– Idea is to retrieve portions of file from separate hosts:• File is split into small (32k) pieces

• Requests are random

• Space of packets bigger than file

• Only subset of packets required

• Technique is Forward Error Correction

• Kazaa/Morpheus• MojoNation (HiveCache)

– Distributed backup and restore system

Privacy Networks: Publius

• Publius– Publishers: want to publish anonymously– Servers: host random-looking content

• Storage– The publisher takes the key, K that is used to encrypt the file and

splits it into n shares, such that any k of them can reproduce the original K, but k-1 give no hints as to the key.

– Each server receives the encrypted Publius content and one of the shares.

• Retrieval– A retriever must get the encrypted Publius content from some server

and k of the shares. – Content is tied to URL that is used to recover the data and the shares.

Privacy Networks: Freehaven• Anonymity:

– Publishers that insert documents, – Readers that retrieve documents, – Servers that store documents. – Uses a free, low-latency, two-way mixnet for forward-anonymous

communication.

• Accountability: – Reputation and micropayment schemes, which allow us to limit

the damage done by servers that misbehave.

• Persistence: – Publisher of a document determines its lifetime.

• Flexibility: – System functions smoothly as peers dynamically join or leave

OceanStoreToward Global-Scale, Self-Repairing,

Secure and Persistent Storage

John Kubiatowicz

University of California at Berkeley

OceanStore Context: Ubiquitous Computing

• Computing everywhere:– Desktop, Laptop, Palmtop

– Cars, Cellphones

– Shoes? Clothing? Walls?

• Connectivity everywhere:– Rapid growth of bandwidth in the interior of the net

– Broadband to the home and office

– Wireless technologies such as CMDA, Satelite, laser

• Where is persistent data????

Utility-based Infrastructure?

Pac Bell

Sprint

IBMAT&T

CanadianOceanStore

• Data service provided by storage federation• Cross-administrative domain • Pay for Service

OceanStore: Everyone’s Data, One Big Utility

“The data is just out there”

• How many files in the OceanStore?– Assume 1010 people in world– Say 10,000 files/person (very conservative?)– So 1014 files in OceanStore!

– If 1 gig files (ok, a stretch), get 1 mole of bytes!

Truly impressive number of elements…… but small relative to physical constants

Aside: new results: 1.5 Exabytes/year (1.51018)

OceanStore Assumptions• Untrusted Infrastructure:

– The OceanStore is comprised of untrusted components– Individual hardware has finite lifetimes– All data encrypted within the infrastructure

• Responsible Party:– Some organization (i.e. service provider) guarantees that your

data is consistent and durable– Not trusted with content of data, merely its integrity

• Mostly Well-Connected: – Data producers and consumers are connected to a high-

bandwidth network most of the time– Exploit multicast for quicker consistency when possible

• Promiscuous Caching: – Data may be cached anywhere, anytime

The Peer-To-Peer View:Irregular Mesh of “Pools”

Key Observation:Want Automatic Maintenance

• Can’t possibly manage billions of servers by hand!• System should automatically:

– Adapt to failure – Exclude malicious elements– Repair itself – Incorporate new elements

• System should be secure and private– Encryption, authentication

• System should preserve data over the long term (accessible for 1000 years):– Geographic distribution of information– New servers added from time to time– Old servers removed from time to time– Everything just works

Attack Resistant P2P

• Content can be compromised by:– Attack by malicious agents– Censorship– Faulty nodes

• Remember:– Nodes have finite resources

Gnutella

Morpheus/Kazaa

......

super peer

Examples

• Napster shut down by attacks on central server

• Gnutella spammed by Flatplanet

• Removal of a few peers shatters Gnutella– 63 from 1800 in figures

Performance

After deletion of 2/3 of peers, 99% of remainder can still access99% of the data items

DRN design [Jared Saia]

• Topology based upon butterfly network (constant degree version of hypercube)

• Each vertex of butterfly called a supernode

• Each supernode represents a set of peers

• Each peer is in multiple supernodes

DRN Topology

N peers, n supernodesEach peer participates in Clogn randomly chosen supernodesSupernode X connected to supernode Y means all nodes in X connected to all nodes in Y

Conclusion

• P2P systems popular today– Limewire, Kazaa …

• Existing P2P systems vulnerable and inefficient

• Many challenges ahead:– Search– Resource Management– Security and Privacy

Lots of good research to be done …Lots of good research to be done …

Appendix I

Open Problems in P2P Data Sharing

Open Problems in Data Sharing Peer-To-Peer Systems

Hector Garcia-Molina

ICDT Conference, January 10, 2003

Contributors: Mayank Bawa, Brian Cooper, Arturo Crespo,

Neil Daswani, Prasanna Ganesan, Sergio Marti,

Qi Sun, Beverly Yang and others

P2P Challenges

• Search

• Resource Management

• Security & Privacy

not independent challenges!

Search

• Search Options– Query Expressiveness– Comprehensiveness– Topology– Data Placement– Message Routing

Comparison

Gnutella CAN Others?

Expressivness

Comprehensivness

Autonomy

Efficiency

Robustness

Topology pwr law

Data Placement arbitrary

Message Routing flooding

Content Addressable Network (CAN)

NodesData

A distributed hash table on Internet scales …

Comparison

Gnutella CAN Others?

Expressivness

Comprehensivness

Autonomy

Efficiency

Robustness

Topology pwr law grid

Data Placement arbitrary hashing

Message Routing flooding directed

Challenge: Exploring the Space

autonomy

robustness efficiency

gnutellacan

a lot of research

SIL model

Search Index Link (SIL) Model

• Forwarding search link (FSL)

• Non-forwarding search link (NSL)

• Forwarding index link (FIL)

• Non-forwarding index link (NIL)

SIL Model

• Forwarding search link (FSL)

• Non-forwarding search link (NSL)

• Forwarding index link (FIL)

• Non-forwarding index link (NIL)

Super-Peer Network

SIL Challenges

• Desirable graph properties

• Desirable features

• Dynamic configuration

Example Property: Redundancy

• Redundancy exists in a SIL graph if a link can be removed without reducing coverage

Example: Undesirable Feature

• One-index cycle: Node A has an index link to B, and there is a search path from B to A

Avoiding Undesirable Features

• Node D is joining the system:– what neighbors should it connect to?– what type of links should it use?

Open Problems: Security

• Availability (e.g., coping with DOS attacks)

• Authenticity

• Anonymity• Access Control (e.g., IP protection, payments,...)

Authenticity

title: origin of species

author: charles darwin

date: 1859

body: In an island far,far away ...

More than Just File Integrity

title: origin of species

author: charles darwin

date: 1859

body: In an island far,far away ...

checksum

More than Fetching One File

T=originY=1800

A=darwin

T=originY=1859

A=darwin

T=originY=1859

A=darwin

T=originY=?

A=darwinB=?

T=originY=1859

A=darwinB=abcd

Solutions

• Authenticity Function A(doc): T or F– at expert sites, at all sites?– can use signature expert sig(doc) user

• Voting Based– authentic is what majority says

• Time Based– e.g., oldest version (available) is authentic

Added Challenge: Efficiency

• Example: Current music sharing– everyone has authenticity function

– but downloading files is expensive

• Solution: Track peer behavior

bad peer

good peergood peer

How to Track Peer Behavior?

• Trust Vector [ v1, v2, v3, v4 ] a b c d

• Single value between 0 and 1?

• Pair of values: [ total downloads, good downloads ] ?

Trust Operations

[1, .9, .5, 0, 0]

[1, 0, 1, 1, .2][1, 1, 0, .3, 1]

update?

Issues

• Trust computations in dynamic system

• Overloading good nodes

• Bad nodes provide good content sometimes

• Bad nodes can build up reputation

• Bad nodes can form collectives

• ...

Sample Results

Fraction of malicious peers

P2P Challenges

• Search

• Resource Management

• Security and Privacy

Resource Management

2capacity = C1 capacity = C2

capacity = C3

• Local work: Ci

• Remote work: (1 - ) Ci

Incentives for Remote Work

Local work: Ci

Remote work: (1 - ) Ci

• What is best value for ?

• How do I get remote nodes to work for me?

Conclusion

• P2P systems popular today– Limewire, Kazaa …

• Existing P2P systems vulnerable and inefficient

• Many challenges ahead:– Search– Resource Management– Security and Privacy

Lots of good research to be done …Lots of good research to be done …

For Additional Information

• Google: – “Stanford Peers”, OceanStore, Tapestry, Chord

• http://www-db.stanford.edu/peers/

• http://www.freehaven.net/

• http://cs1.cs.nyu.edu/~waldman/publius/

• http://www.onionnetworks.com

Appendix II

P2P Architectures

Peer-to-Peer is Not Always

Decentralized…when Centralization is

Nelson Minar

<nelson@monkey.org>http://www.media.mit.edu/~nelson/

Talk Overview

• Topologies of distributed systems

• Strengths and weaknesses

• Conclusions

Warning: Broad generalizations ahead

What is P2P Anyway?

• Decentralized Systems: no– Popular Power fails test– Napster fails test– Most Instant Messaging fails test– Confuses topology with function

• Edge Resources: yes– Small computers on edges contribute back– All peers are active participants

Distributed Systems Topologies

• Get away from fundamentalism– “Pure P2P”, “True P2P”, etc

• Focus instead on system architecture– How do the pieces fit together?

• Concentrate on connection topology

• Which topology for which problem?

Centralized

• Client/server• Web servers• Databases• Napster search• Instant Messaging• Popular Power

• Fail-over clusters• Simple load balancing• Assumption

– Single owner

Hierarchical

• DNS• NTP• Usenet (sort of)

Decentralized

• Gnutella• Freenet• Hive• Internet routing

Centralized + Centralized

• N-tier apps• Database heavy systems• Web services gateways• Grand Central

Centralized + Ring

• Serious web applications

• High availability servers

Centralized + Decentralized

• Clip2 Gnutella Reflector• FastTrack

– KaZaA

– Morpheus

• Email

What about other topologies?

• Centralized + Hierarchical?– Back end tree of information

– Caching architectures

• Decentralized + Ring?– P2P network of fail-over clusters

• Decentralized + Hierarchical?• Decentralized + Centralized?

Strengths and Weaknesses

• Plenty of topologies to choose from

• What is each kind good for?

• Need a set of properties to measure

• Caution: What follows is very high level

Things to Measure• Manageability

– How hard is it to keep working?

• Information coherence– How authoritative is info? (Auditing, non-repudiation)

• Extensibility– How easy is it to grow?

• Fault tolerance– How well can it handle failures?

• Security– How hard is it to subvert?

• Resistance to legal or political intervention– How hard is it to shut down? (Can be good or bad)

• Scalability– How big can it grow?

Centralized

ManageableCoherentExtensibleFault TolerantSecureLawsuit-proofScalable

System is all in one place All information is in one placeX No one can add on to systemX Single point of failure Simply secure one hostX Easy to shut down? One machine. But in practice?

ManageableCoherent

ExtensibleFault Tolerant

SecureLawsuit-proof

Scalable

Simple rules for relationships Easy logic for stateX Only ring owner can add Fail-over to next host As long as ring has one ownerX Shut down owner Just add more hosts

Hierarchical

Manageable

Coherent

Extensible

Fault Tolerant

Secure

Lawsuit-proof

Scalable

½ Chain of authority

½ Cache consistency

½ Add more leaves, rebalance

½ Root is vulnerable

X Too easy to spoof links

X Just shut down the root Hugely scalable – DNS

Decentralized

Manageable

Coherent

Extensible

Fault Tolerant

Secure

Lawsuit-proof

Scalable

X Very difficult, many ownersX Difficult, unreliable peers Anyone can join in! RedundancyX Difficult, open research No one to sue! (…but follow $)? Theory – yes : Practice – no

Centralized + Ring

ManageableCoherent

SecureLawsuit-proof

Scalable

Just manage the ring As coherent as ringX No more than ring Ring is a huge win As secure as ringX Still single place to shut down Ring is a huge win

Common architecture for web applications

Centralized + Decentralized

ManageableCoherent

SecureLawsuit-proof

Scalable

X Same as decentralized½ Better than decentralized Anyone can still join! Plenty of redundancyX Same as decentralized Still no one to sue? Looking very hopeful

Best architecture for P2P networks?

Centralized vs. Decentralized

• Centralized is pretty good!– Manageable– Coherent– Security

• Decentralized is exciting– Extensible– Massive fault tolerance– Lawsuit-proof

• Scalability is the big question

Conclusions

• Centralized is easy to deal with– Major architecture for distributed systems– Combines well with rings

• Decentralized is good, needs research– Coherence, Manageability, Security– Scalability

• Hierarchical is overlooked

• Combining architectures is powerful

Peer-to-Peer is Not Always Decentralized

…when Centralization is GoodNelson Minar

<nelson@monkey.org>

http://www.media.mit.edu/~nelson/Thanks to Marc Hedlund, Raffi Krikorian, Tony White

Appendix III

P2P Industry

P2P Industry Outline

“There’s no peer-to-peer market any more than there’s a client/server market” – Anne Manes, Sun Microsystems

• Peer-to-peer encompasses a wide range of technologies centered around decentralizing computing

• Business and revenue models are still fuzzy

• There are clear opportunities and research excitement

Distribution of P2P CompaniesCategory Examples Industry ShareDistributed Computing Entropia

United Devices

Collaboration / Knowledge Management

Groove Networks

Engenia

Content Distribution Akamai

Proksim

Infrastructure / Platform Akavi

Xdegrees

File Sharing Kazaa

Napster

Distributed Search OpenCola

Thinkstream

(From “P2P 101: An Overview of the P2P Landscape” by Larry Cheng)

Major Features of P2P Industry

(From “P2P 101: An Overview of the P2P Landscape” by Larry Cheng)

• Lack of experienced, quality management teams• Lack of detailed business models• Skeptical investors• 150+ active companies• Estimated 95% failure rate

“The elephant in the room is the fact that most companies here are not commercially viable.” - Heard from a speaker at O’Reilly

Current P2P Business Models

• Sell P2P products to end-users– No current revenue-generating business model

– Sometimes coupled with content-sale models

• Sell content through P2P– Subscription-based – I buy content from you

– Sponsor-based – Someone pays you to give me content

– Ad-based – You give me content and sell ads

Current P2P Business Models

• Sell something which lets others profit from P2P– Solve a critical problem for decentralized applications

– Offer support and enhanced services for free tools

– Specialized packages for particular industries

– Tools and libraries for P2P infrastructure

“The people most likely to make money during a Gold Rush are the ones selling pickaxes and shovels.” Andy Oram, The O’Reilly Network

P2P: An Overview Dr. Tony White Carleton University.

Documents