[Lecture Notes in Computer Science] Database Theory — ICDT 2003 Volume 2572 || Open Problems in...

Open Problems in Data-Sharing Peer-to-PeerSystems

Neil Daswani, Hector Garcia-Molina, and Beverly Yang

Stanford University, Stanford CA 94305, USA,{daswani, hector, byang}@db.stanford.edu,

http://www-db.stanford.edu

Abstract. In a Peer-To-Peer (P2P) system, autonomous computerspool their resources (e.g., files, storage, compute cycles) in order to inex-pensively handle tasks that would normally require large costly servers.The scale of these systems, their “open nature,” and the lack of cen-tralized control pose difficult performance and security challenges. Muchresearch has recently focused on tackling some of these challenges; inthis paper, we propose future directions for research in P2P systems,and highlight problems that have not yet been studied in great depth.We focus on two particular aspects of P2P systems – search and secu-rity – and suggest several open and important research problems for thecommunity to address.

1 Introduction

Peer-to-peer (P2P) systems have recently become a very active research area, dueto the popularity and widespread use of P2P systems today, and their potentialuses in future applications. Recently, P2P systems have emerged as a popularway to share huge amounts of data (e.g., [1,16,17]). In the future, the adventof large-scale ubiquitous computing makes P2P a natural model for interactionbetween devices (e.g., via the web services [18] framework).

P2P systems are popular because of the many benefits they offer: adapta-tion, self-organization, load-balancing, fault-tolerance, availability through mas-sive replication, and the ability to pool together and harness large amounts ofresources. For example, file-sharing P2P systems distribute the main cost of shar-ing data – bandwidth and storage – across all the peers in the network, therebyallowing them to scale without the need for powerful, expensive servers.

Despite their many strengths, however, P2P systems also present severalchallenges that are currently obstacles to their widespread acceptance and usage– e.g., security, efficiency, and performance guarantees like atomicity and trans-actional semantics. The P2P environment is particularly challenging to work inbecause of the scale of the network and unreliable nature of peers characterizingmost P2P systems today. Many techniques previously developed for distributedsystems of tens or hundreds of servers may no longer apply; new techniques areneeded to meet these challenges in P2P systems.

D. Calvanese et al. (Eds.): ICDT 2003, LNCS 2572, pp. 1–15, 2003.c© Springer-Verlag Berlin Heidelberg 2003

Verwendete Distiller 5.0.x Joboptions

Dieser Report wurde automatisch mit Hilfe der Adobe Acrobat Distiller Erweiterung "Distiller Secrets v1.0.5" der IMPRESSED GmbH erstellt. Sie koennen diese Startup-Datei für die Distiller Versionen 4.0.5 und 5.0.x kostenlos unter http://www.impressed.de herunterladen. ALLGEMEIN ---------------------------------------- Dateioptionen: Kompatibilität: PDF 1.2 Für schnelle Web-Anzeige optimieren: Ja Piktogramme einbetten: Ja Seiten automatisch drehen: Nein Seiten von: 1 Seiten bis: Alle Seiten Bund: Links Auflösung: [ 600 600 ] dpi Papierformat: [ 595.276 824.882 ] Punkt KOMPRIMIERUNG ---------------------------------------- Farbbilder: Downsampling: Ja Berechnungsmethode: Bikubische Neuberechnung Downsample-Auflösung: 150 dpi Downsampling für Bilder über: 225 dpi Komprimieren: Ja Automatische Bestimmung der Komprimierungsart: Ja JPEG-Qualität: Mittel Bitanzahl pro Pixel: Wie Original Bit Graustufenbilder: Downsampling: Ja Berechnungsmethode: Bikubische Neuberechnung Downsample-Auflösung: 150 dpi Downsampling für Bilder über: 225 dpi Komprimieren: Ja Automatische Bestimmung der Komprimierungsart: Ja JPEG-Qualität: Mittel Bitanzahl pro Pixel: Wie Original Bit Schwarzweiß-Bilder: Downsampling: Ja Berechnungsmethode: Bikubische Neuberechnung Downsample-Auflösung: 600 dpi Downsampling für Bilder über: 900 dpi Komprimieren: Ja Komprimierungsart: CCITT CCITT-Gruppe: 4 Graustufen glätten: Nein Text und Vektorgrafiken komprimieren: Ja SCHRIFTEN ---------------------------------------- Alle Schriften einbetten: Ja Untergruppen aller eingebetteten Schriften: Nein Wenn Einbetten fehlschlägt: Warnen und weiter Einbetten: Immer einbetten: [ /Courier-BoldOblique /Helvetica-BoldOblique /Courier /Helvetica-Bold /Times-Bold /Courier-Bold /Helvetica /Times-BoldItalic /Times-Roman /ZapfDingbats /Times-Italic /Helvetica-Oblique /Courier-Oblique /Symbol ] Nie einbetten: [ ] FARBE(N) ---------------------------------------- Farbmanagement: Farbumrechnungsmethode: Alle Farben zu sRGB konvertieren Methode: Standard Arbeitsbereiche: Graustufen ICC-Profil: ¡M RGB ICC-Profil: sRGB IEC61966-2.1 CMYK ICC-Profil: U.S. Web Coated (SWOP) v2 Geräteabhängige Daten: Einstellungen für Überdrucken beibehalten: Ja Unterfarbreduktion und Schwarzaufbau beibehalten: Ja Transferfunktionen: Anwenden Rastereinstellungen beibehalten: Ja ERWEITERT ---------------------------------------- Optionen: Prolog/Epilog verwenden: Ja PostScript-Datei darf Einstellungen überschreiben: Ja Level 2 copypage-Semantik beibehalten: Ja Portable Job Ticket in PDF-Datei speichern: Nein Illustrator-Überdruckmodus: Ja Farbverläufe zu weichen Nuancen konvertieren: Nein ASCII-Format: Nein Document Structuring Conventions (DSC): DSC-Kommentare verarbeiten: Nein ANDERE ---------------------------------------- Distiller-Kern Version: 5000 ZIP-Komprimierung verwenden: Ja Optimierungen deaktivieren: Nein Bildspeicher: 524288 Byte Farbbilder glätten: Nein Graustufenbilder glätten: Nein Bilder (< 257 Farben) in indizierten Farbraum konvertieren: Ja sRGB ICC-Profil: sRGB IEC61966-2.1 ENDE DES REPORTS ---------------------------------------- IMPRESSED GmbH Bahrenfelder Chaussee 49 22761 Hamburg, Germany Tel. +49 40 897189-0 Fax +49 40 897189-71 Email: [email protected] Web: www.impressed.de

Adobe Acrobat Distiller 5.0.x Joboption Datei

<< /ColorSettingsFile () /AntiAliasMonoImages false /CannotEmbedFontPolicy /Warning /ParseDSCComments false /DoThumbnails true /CompressPages true /CalRGBProfile (sRGB IEC61966-2.1) /MaxSubsetPct 100 /EncodeColorImages true /GrayImageFilter /DCTEncode /Optimize true /ParseDSCCommentsForDocInfo false /EmitDSCWarnings false /CalGrayProfile ( ¡M) /NeverEmbed [ ] /GrayImageDownsampleThreshold 1.5 /UsePrologue true /GrayImageDict << /QFactor 0.9 /Blend 1 /HSamples [ 2 1 1 2 ] /VSamples [ 2 1 1 2 ] >> /AutoFilterColorImages true /sRGBProfile (sRGB IEC61966-2.1) /ColorImageDepth -1 /PreserveOverprintSettings true /AutoRotatePages /None /UCRandBGInfo /Preserve /EmbedAllFonts true /CompatibilityLevel 1.2 /StartPage 1 /AntiAliasColorImages false /CreateJobTicket false /ConvertImagesToIndexed true /ColorImageDownsampleType /Bicubic /ColorImageDownsampleThreshold 1.5 /MonoImageDownsampleType /Bicubic /DetectBlends false /GrayImageDownsampleType /Bicubic /PreserveEPSInfo false /GrayACSImageDict << /VSamples [ 2 1 1 2 ] /QFactor 0.76 /Blend 1 /HSamples [ 2 1 1 2 ] /ColorTransform 1 >> /ColorACSImageDict << /VSamples [ 2 1 1 2 ] /QFactor 0.76 /Blend 1 /HSamples [ 2 1 1 2 ] /ColorTransform 1 >> /PreserveCopyPage true /EncodeMonoImages true /ColorConversionStrategy /sRGB /PreserveOPIComments false /AntiAliasGrayImages false /GrayImageDepth -1 /ColorImageResolution 150 /EndPage -1 /AutoPositionEPSFiles false /MonoImageDepth -1 /TransferFunctionInfo /Apply /EncodeGrayImages true /DownsampleGrayImages true /DownsampleMonoImages true /DownsampleColorImages true /MonoImageDownsampleThreshold 1.5 /MonoImageDict << /K -1 >> /Binding /Left /CalCMYKProfile (U.S. Web Coated (SWOP) v2) /MonoImageResolution 600 /AutoFilterGrayImages true /AlwaysEmbed [ /Courier-BoldOblique /Helvetica-BoldOblique /Courier /Helvetica-Bold /Times-Bold /Courier-Bold /Helvetica /Times-BoldItalic /Times-Roman /ZapfDingbats /Times-Italic /Helvetica-Oblique /Courier-Oblique /Symbol ] /ImageMemory 524288 /SubsetFonts false /DefaultRenderingIntent /Default /OPM 1 /MonoImageFilter /CCITTFaxEncode /GrayImageResolution 150 /ColorImageFilter /DCTEncode /PreserveHalftoneInfo true /ColorImageDict << /QFactor 0.9 /Blend 1 /HSamples [ 2 1 1 2 ] /VSamples [ 2 1 1 2 ] >> /ASCII85EncodePages false /LockDistillerParams false >> setdistillerparams << /PageSize [ 595.276 841.890 ] /HWResolution [ 600 600 ] >> setpagedevice

2 N. Daswani, H. Garcia-Molina, and B. Yang

In this paper, we consider research problems associated with search and secu-rity in data-sharing P2P systems. Though data-sharing P2P systems are capableof sharing enormous amounts of data (e.g., 0.36 petabytes on the Morpheus [17]network as of October 2001), such a collection is useless without a search mech-anism allowing users to quickly locate a desired piece of data (Section 2). Fur-thermore, to ensure proper, continued operation of the system, security measuresmust be in place to protect against availability attacks, unauthentic data, andillegal access (Section 3). In this paper, we highlight several important and openresearch issues within both of these topics.

Note that this paper is not meant to be an exhaustive survey of P2P research.First, P2P can be applied to many domains outside of data-sharing; for example,computation (e.g., [19,20]), collaboration (e.g., [21]), and infrastructure systems(e.g., [22]) are all popular applications of P2P. Each application faces its ownunique challenge (e.g., job scheduling in computation systems), as well as com-mon issues (e.g., resource discovery). In addition, within data-sharing systemsthere exists important research outside of search and security. Good examplesinclude resource management issues such as fairness and administrative ease. Fi-nally, due to space limitations, the issues we present within search and securityare not comprehensive, but illustrative. Examples are also chosen with a biastowards work done at the Stanford Peers group [23], because it is the researchthat the authors know best.

2 Search

A good search mechanism allows users to effectively locate desired data in aresource-efficient manner. Designing such a mechanism is difficult in P2P systemsfor several reasons: scale of the system, unreliability of individual peers, etc. Inthis section, we outline the basic architecture, requirements and goals of a searchmechanism for P2P systems, and then suggest several areas of open research.

2.1 Overview

In a data-sharing P2P system, users submit queries and receive results (suchas data, or pointers to data) in return, via the search mechanism. Data sharedin the system can be of any type. In most cases users share files, such as musicfiles, images, news articles, web pages, etc. Other possibilities include data storedin a relational DBMS, or a queryable spreadsheet. Queries may take any formthat is appropriate given the type of data shared. For example, in a file-sharingsystem, queries might be keywords with regular expressions, and the search maybe defined over different portions of the document (e.g., header, title, metadata).

A search mechanism defines the behavior of peers in three areas:

– Topology: Defines how peers are connected to each other. In some systems(e.g., Gnutella [1]), peers may connect to whomever they wish. In othersystems, peers are organized into a rigid structure, in which the number and

Open Problems in Data-Sharing Peer-to-Peer Systems 3

nature of connections is dictated by the protocol. Defining a rigid topologymay increase efficiency, but will restrict autonomy.

– Data placement: Defines how data or metadata is distributed across thenetwork of peers. For example, in Gnutella, each node stores only its owncollection of data. In Chord [2], data or metadata is carefully placed acrossnodes in a deterministic fashion. In super-peer networks [12], metadata fora small group of peers is centralized onto a single super-peer.

– Message routing: Defines how messages are propagated through the net-work. When a peer submits a query, the query message is sent to a numberof the peer’s “neighbors” (that is, nodes to whom the peer is connected),who may in turn forward the message sequentially or in parallel to some oftheir neighbors, and so on. When, and to whom, messages are sent is dic-tated by the routing protocol. Often, the routing protocol can take advantageof known patterns in topology and data placement, in order to reduce thenumber of messages sent.

In an actual system, the general model described above takes on a differentform depending on the requirements of the system. Requirements are specifiedin several main categories:

– Expressiveness: The query language used for a system must be able todescribe the desired data in sufficient detail. Key lookups are not expressiveenough for IR searches over text documents, and keyword queries are notexpressive enough to search structured data such as relational tables.

– Comprehensiveness: In some systems, returning any single result is suf-ficient (e.g., anycast), whereas in others, all results are required. The lattertype of system requires a comprehensive search mechanism, in which allpossible results are returned.

– Autonomy: Every search mechanism must define peer behavior with respectto topology, data placement, and message routing. However, autonomy of apeer is restricted when the mechanism limits behavior that a peer couldreasonably expect to control. For example, a peer may wish to only connectto its friends or other trusted peers in the same organization, or the peermay wish to control which nodes can store its data (e.g., only nodes on theintranet), and how much of other nodes’ data it must store. Depending onthe purpose and users of the system, the search mechanism may be requiredto meet a certain level of autonomy for peers.

In this paper, we assume the additional requirement that the search mechanismbe decentralized. A P2P system may have centralized search, and indeed, such“hybrid systems” have been very useful and popular in practice (e.g., [16]).However, centralized systems have been well-studied, and it is desirable that thesearch mechanism share the same benefits of P2P mentioned in Section 1; hence,here we focus only on decentralized P2P solutions.

While a well-designed search mechanism must satisfy the requirements spec-ified by the system, it should also seek to maximize the following goals:


– Efficiency: We measure efficiency in terms of absolute resources consumed– bandwidth, processing power, storage, etc. An efficient use of resourcesresults in lighter overhead on the system, and hence, higher throughput.

– Quality of Service: We can measure quality of service (QoS) along manydifferent metrics depending on the application – number of results, responsetime, etc. Note the distinction between QoS and efficiency: QoS focuses onuser-perceived qualities, while efficiency focuses on the resource cost (e.g.,bandwidth) to achieve a particular level of service.

– Robustness: We define robustness to mean stability in the presence offailures: quality of service and efficiency are maintained as peers in the systemfail or leave. Robustness to attacks is a separate issue discussed in Section 3.

By placing current work in the framework of requirements and goals above, wecan identify several areas in which research is much needed. In the followingsection, we mention just a few of these areas.

2.2 Expressiveness

In order for P2P systems to be useful in a wide range of applications, they mustbe able to support query languages of varying levels of expressiveness. Thus far,work in P2P search has focused on answering simple queries, such as key lookups.An important area of research therefore lies in developing mechanisms for richerquery languages. Here, we list a few examples of useful types of queries, anddiscuss the related work and challenges in supporting them.

– Key lookup: The simplest form of query is an object lookup by key or iden-tifier. Protocols directly supporting this primitive have been widely studied,and efficient solutions exist (e.g., [2,3,4]). Ongoing research is exploring howto make these protocols more efficient and robust [5].

– Keyword: While much research has focused on search techniques for key-word queries (e.g., [11,10,6]), all of these techniques have been geared to-wards efficient, partial (not comprehensive) search – e.g., all music-sharingsystems currently only support partial search. Partial search is acceptable inthose applications where a few keywords can usually uniquely identify thedesired file (e.g., music-sharing systems, as opposed to web page reposito-ries), because the first few matches are likely to satisfy the user’s request.Techniques for partial search can always be made comprehensive simply bysending the query message to every peer in the network; however, such anapproach is prohibitively expensive. Hence, designing techniques for efficient,comprehensive search remains an open problem.

– Ranked keyword: If many results are returned for comprehensive keywordsearch, users will need results to be ranked and filtered by relevance. Whilethe query language for ranked keyword search remains the same, the addi-tional information in the results (i.e., the relevance ranking) poses additionalchallenges and opportunities. For example, ranked search can be built on topof regular search by retrieving all results and sorting locally; however, state-of-the-art ranking functions usually require global statistics over the total


collection of documents (e.g., document frequency). Collecting and main-taining these statistics in a robust, efficient, and distributed manner is achallenge. At the same time, ranked results allow the system to return “topk” results, which provides the opportunity to optimize search if k is muchless than the total number of results (which is generally the case, for ex-ample, in web searches). Techniques for ranked search exists for distributedsystems of moderate scale (e.g., [7]), but future research must extend thesetechniques to support much larger systems.

– Aggregates: A user may sometimes be interested in knowing aggregateproperties of the system or data collection as a whole, rather than locat-ing specific data. For example, to collect global statistics to support rankedkeyword search mentioned earlier, a user could submit several SUM queriesto sum the number of documents that contain a particular term. Ongoingresearch [8] addresses COUNT queries defined over a predicate – for example,counting the number of nodes that belong to the stanford.edu domain.Further research is needed to extend these techniques into more expressiveaggregates like SUM, MAX, and MEDIAN.

– SQL: As a complex language defined over a rich data model, SQL is the mostdifficult query language to support among the examples listed. Current re-search on supporting SQL in P2P systems is very preliminary. For example,the PIER project [9] supports a subset of SQL over a P2P framework, butthey report significant performance “hotspots” in their preliminary imple-mentation. A great deal of additional research is needed to advance currentwork into a search mechanism with reasonable performance, and to investi-gate alternative approaches to the problem.

2.3 Autonomy, Efficiency, and Robustness

Autonomy, efficiency and robustness are all desirable features in any system.These features conceptually define an informal space of P2P systems, as shown inFigure 1a, where a point in the space represents a system with the corresponding“values” for each feature. Note that the value of a system with respect to afeature only provides a partial order, since features can be measured along severalmetrics (e.g., efficiency can be measured by bandwidth, processing power, andstorage). Hence, Figure 1 illustrates the space by showing just a few points forwhich the relative order (and not the actual coordinates) along each feature isfairly obvious.

The space defined by autonomy, efficiency and robustness is not fully ex-plored; in particular, there appears to be some correlation between autonomyand efficiency (Figure 1b), and autonomy and robustness (Figure 1c). A par-tial explanation for the first correlation is that less autonomy allows the searchmechanism to specify a data placement and topology such that:

– There exist a deterministic way to locate data within bounded cost (e.g.,Chord)

– There is a small set of nodes that is guaranteed to hold the answer, if itexists (e.g., super-peer networks, concept clusters [13])


autonomy

efficiencyrobustness

+

+

+

---

aut.

eff.

aut.

rob.

Gnutella

super-peer networks

Chord/Viceroy Chord

super-peer

Gnutella

Viceroy

super-peer redundancy

(a) (b) (c)

Fig. 1. The space of systems defined by autonomy, efficiency and robustness (a).Looking at a few example systems within this space, there appears to be a relationshipbetween autonomy and efficiency (b), and autonomy and robustness (c)

– There is an increased chance of finding results on a random node (e.g., repli-cation [6]).

At the same time, these rigidly organized networks can be difficult or expensiveto maintain, especially as peers join and leave the network at the rapid ratecharacteristic of many P2P systems. As a result, robustness is also correlatedwith autonomy.

One important area of research is finding techniques that push beyond thecurrent tradeoffs between efficiency, autonomy and robustness. Decoupling effi-ciency from autonomy seems to be the greatest challenge, since existing tech-niques almost uniformly sacrifice autonomy to achieve efficiency. However, thepotential gain is the greatest: a search mechanism that is efficient, robust, andpreserves peer autonomy. Decoupling autonomy from robustness is also impor-tant, because it allows greater flexibility in choosing the desired properties ofthe mechanism. For example, a search mechanism that is robust, but has lowpeer autonomy, can be desirable if the lack of autonomy leads to efficiency, andpeer autonomy is not a requirement of the system.

Several research projects have tackled the autonomy/robustness tradeoff. Forexample, the Viceroy [14] network construction maintains a low level of peerautonomy, but increases robustness and efficiency by reducing the cost of main-taining the network structure to a constant term, for each join/leave of a peer.In comparison, most distributed hash tables (DHTs) with the same functional-ity have logarithmic maintenance cost. As another example, super-peer redun-dancy [12] imposes slightly stricter rules on topology and data placement withina cluster of peers, but this decrease in autonomy results in greater robustness ofthe super-peer and improved efficiency in the overall network.

Another interesting area of research is providing fine-granularity tuning ofthe tradeoff between autonomy and efficiency within a single system. A singleuser may have varying needs; for example, a company may have a few sensitivefiles that must remain on the intranet, but the remaining files can be storedanywhere. A single system that can be tuned to support all of these needs is moredesirable than requiring users to use different systems for different purposes. Agood example of a tunable system is SkipNet [15]. SkipNet allows users to specifya range of peers on which a document may be stored (e.g., all peers within the


stanford.edu domain). At one extreme, if the range is always limited to a singlepeer, then user autonomy is high, but the system ceases to be P2P and loses goodproperties such as load-balancing and self-organization. At the other extreme,if the range always includes all peers in the network, SkipNet functions as atraditional P2P lookup system with low autonomy, but other good properties.While SkipNet does not push beyond existing tradeoffs, its value lies in allowingusers to choose the point along the tradeoff that meets their needs.

2.4 Quality of Service

In the previous discussion, we implicitly assume a fixed level of service (e.g.,number of results per query) that must be maintained as other factors (e.g.,autonomy) are varied. However, quality of service (QoS) can be measured withmany different metrics, depending on the application, and a spectrum of accept-able performance exists along each metric. Examples of service metrics includenumber of results (e.g., in partial-search systems), response time, and relevance(e.g., precision and recall in ranked keyword searches). A constant challenge indesigning P2P systems is achieving a desired level of QoS as efficiently as possi-ble. Because metrics and applications differ so widely, this challenge must oftenbe tackled on a per-case basis.

As an example, the number of results returned is an important QoS met-ric for partial-search systems like Gnutella. However, in systems where thereis high autonomy (such as Gnutella), there is a clear and unavoidable tradeoffbetween number of results and cost; hence, the interesting problem is to get asclose as possible to the lower bounds of the tradeoff. For example, the directedBFS technique in [11] attempts to minimize cost by sending messages to “pro-ductive” nodes (e.g., nodes with large collections). Concept-clustering networks(e.g., [13]) cluster peers together according to “interest” (e.g., music genre), andsend queries to the cluster that best matches the queries’ area of interest. Thesetechniques do improve the tradeoff between cost and number of results, but areclearly not optimal: performance of directed BFS depends on the ad-hoc topol-ogy and is therefore unpredictable, while concept-clustering only works well ifqueries and interests fall cleanly into single categories. Can there exist a generaltechnique that can guarantee (with high probability) that the cost/QoS tradeoffis optimal?

With other metrics of QoS, there is not such an obvious tradeoff betweenquality and cost. In these cases, the goal is to maintain the same level of servicewhile decreasing cost. For example, consider the “satisfaction” metric, which isbinary and is true when a threshold number of results is found. Satisfaction is animportant metric in partial-search systems where only the first k results are dis-played to the user (e.g., [16,1]). Reference [11] shows that, compared to currenttechniques, cost can be drastically reduced while maintaining satisfaction. Fur-thermore, even better performance is probably possible if we discard this work’srequirement of peer autonomy and simplicity. Additional research is required toexplore this space further.


3 Security

Securing P2P data sharing applications is challenging due to their open andautonomous nature. Compared to a client-server system in which servers can berelied upon or trusted to always follow protocols, peers in a P2P system mayprovide no such guarantee. The environment in which a peer must function is ahostile one in which any peer is welcome to join the network; these peers cannotnecessarily be trusted to route queries or responses correctly, store documentswhen asked to, or serve documents when requested. In this part of the paper, weoutline a number of security issues that are characteristic to P2P data sharingsystems, discuss a few examples of research that has taken place to address someof these issues, and suggest a number of open research problems.

We organize the security requirements of P2P data sharing systems intofour general areas: availability, file authenticity, anonymity, and access control.Today’s P2P systems rarely address all of the necessary requirements in anyone of these areas, and developing systems that have the flexibility to supportrequirements in all of these areas is expected to be a research challenge for quitesome time.

For each of these areas, it will be important to develop techniques that pre-vent, detect, manage, and are able to recover from attacks. For example, sinceit may be difficult to prevent a denial-of-service attack against a system’s avail-ability, it will be important to develop techniques that are able to 1) detectwhen a denial-of service attack is taking place (as opposed to there just being ahigh load), 2) manage an attack that is “in-progress” such that the system cancontinue to provide some (possibly reduced) level of service to clients, and 3)recover from the attack by disconnecting the malicious nodes.

3.1 Availability

There are a number of different node and resource availability requirements thatare important to P2P file sharing systems. In particular, each node in a P2Psystem should be able to accept messages from other nodes, and communicatewith them to offer access to the resources that it contributes to the network.

A denial-of-service (DoS) attack attempts to make a node and its resourcesunavailable by overloading it. The most obvious DoS attack is targeted at us-ing up all of a node’s bandwidth. This type of attack is similar to traditionalnetwork-layer DoS attacks (e.g. [31]). If a node’s available bandwidth is used uptransferring useless messages that are directly or indirectly created by a mali-cious node, all of the other resources that the node has to offer (including CPUand storage) will also be unavailable to the P2P network.

A specific example of a DoS attack against node availability is a chosen-victim attack in Gnutella that an adversary constructs as follows: a malicioussuper-node maneuvers its way into a “central” position in the network and thenresponds to every query that passes thru it claiming that the victim node has afile that satisfies the query (even though it does not). Every node that receivesone of these responses then attempts to connect to the victim to obtain the


file that they were looking for, and the number of these requests overloads thebandwidth of the victim such that any other node seeking a file that the victimdoes have is unable to communicate the victim.

The key aspect to note here is that in our example the attacker exploited avulnerability of the Gnutella P2P protocol (namely, that any node can respondto any query claiming that any file could be anywhere). In the future, P2Pprotocols need to be designed to make it hard for adversaries to construct DoSattacks by taking advantage of loosely constrained protocol features.

Attackers that construct DoS attacks typically need to find and take advan-tage of an “amplification mechanism” in the network to cause significantly moredamage than they could with only their own resources. In addition, if they wouldlike to have control over how their attack is carried out, they must also find orcreate a back-door communication channel to communicate with “zombie” hoststhat they infiltrate using manual or automatic means. It is important to de-sign future P2P protocols such that they do not open up new opportunities forattackers to use as amplifiers and back-door communication channels.

Some research has taken place to date to specifically address DoS attacks inP2P networks. In particular, [38] addresses DoS attacks based on query-floodsin the Gnutella network. However, more research is necessary to understand theeffects of other types of DoS attacks in various P2P networks.

Aside from DoS attacks, node availability can also be attacked by malicioususers that infiltrate victim nodes and induce their failure. These types of attackscan be modeled as fail-stop or byzantine failures, which could potentially be dealtwith using many techniques that have already been developed (e.g. [34]). How-ever, these techniques have typically not been popular due to their inefficiency,unusually high message overhead, and complexity. In addition, these techniquesoften assume complete and secure pairwise connectivity between nodes, whichis not the case in most P2P networks. Further research will be necessary tomake these or similar techniques acceptable from a performance and securitystandpoint in a P2P context.

In addition, there are many proposals to provide significant levels of fault-tolerance in the face of node failure including CAN [3], Chord [2], Pastry [4],and Viceroy [14]. Security analyses of these types of proposals can be found in[43] and [36]. The IRIS [25] project seeks to continue the investigation of thesetypes of approaches.

A malicious node can also directly attack the availability of any of the par-ticular resources at a node. The CPU availability at a node can be attacked bysending a modest number of complex queries to bog down the CPU of a nodewithout consuming all of its bandwidth. The available storage could be attackedby malicious nodes who are allowed to submit bogus documents for storage. Oneapproach to deal with this is to allocate storage to nodes in a manner propor-tional to the resources that a node contributes to the network as proposed in[28].

We might like to ensure that all files stored in the system are always availableregardless of which nodes in the network are currently online. File availability


ensures that files can be perpetually preserved, regardless of factors such asthe popularity of the files. Systems such as Gnutella and Freenet provide noguarantees about the preservation of files, and unpopular files tend to disappear.

Even if files can be assured to physically exist and are accessible, a DoS attackcan still be made against the quality-of-service with which they are available.In this type of a DoS attack, a malicious node makes a file available, but whena request to download the file is received, it serves the file so slowly that therequester will most likely lose patience and cancel the download before it com-pletes. The malicious node could also claim that it is serving the file requestedbut send some other file instead. As such, techniques such as hash trees [26]could to be used by the client to incrementally ensure that the server is sendingthe correct data, and that data is sent at a reasonable rate.

3.2 File Authenticity

File authenticity is a second key security requirement that remains largely unad-dressed in P2P systems. The question that a file authenticity mechanism answersis: given a query and a set of documents that are responses to the query, whichof the responses are “authentic” responses to the query? For example, if a peerissues a search for “Origin of Species” and receives three responses to the query,which of these responses are “authentic”? One of the responses may be the ex-act contents of the book authored by Charles Darwin. Another response may bethe content of the book by Charles Darwin with several key passages altered. Athird response might be a different document that advocates creationism as thetheory by which species originated.

Note that the problem of file authenticity is different than the problem of file(or data) integrity. The goal of file integrity is to ensure that documents do notget inadvertently corrupted due to communication failures. Solutions to the fileintegrity problem usually involve adding some type of redundancy to messages inthe form of a “signature.” After a file is sent from node A to node B, a signatureof the file is also sent. There are many fewer bits in the signature than in the fileitself, and every bit of the signature is dependent on every bit of the file. If the filearrived at node B corrupted, the signature would not match. Techniques suchas CRCs (cyclic redundancy checks), hashing, MACs (message authenticationcodes), or digital signatures (using symmetric or asymmetric encryption) arewell-understood solutions to the file integrity problem.

The problem of file authenticity, however, can be viewed as: given a query,what is (or are) the “authentic” signature(s) for the document(s) that satisfythe query? Once some file authenticity algorithm is used to determine what is(or are) the authentic signatures, a peer can inspect responses to the query bychecking that each response has an authentic signature.

In our discussion until this point, we have not defined what it means for afile to be authentic. There are a number of potential options: we will outline fourreasonable ones.

Oldest Document. The first definition of authenticity considers the oldest doc-ument that was submitted with a particular set of metadata to be the authentic


copy of that document. For example, if Charles Darwin was the first author toever submit a document with the title “Origin of Species,” then his documentwould be considered to be an authentic match for a query looking for “Origin ofSpecies” as the title. Any documents that were submitted with the title “Originof Species” after Charles Darwin’s submission would be considered unauthenticmatches to the query even if we decided to store these documents in the system.Timestamping systems (e.g. [35]) can be helpful in constructing file authenticitysystems based on this approach.

Expert-Based. In this approach, a document would be deemed authentic byan “expert” or authoritative node. For example, node G may be an expert thatkeeps track of signatures for all files ever authored by any user of G. If a usersearching for documents authored by any of G’s users is ever concerned aboutthe potential authenticity of a file received as a response to a query, node Gcan be consulted. Of course, if node G is unavailable at any particular timedue to a transient or permanent failure, is infiltrated by an attacker, or is itselfmalicious, it may be difficult to properly verify the authenticity of files that G’susers authored. Offline digital signature schemes (i.e., RSA) can be used to verifyfile authenticity in the face of node failures, but are limited by the lifetime andsecurity of public/private keys.

Voting-Based. To deal with the possible failure of G or a compromised keyin our last approach, our third definition of authenticity takes into account the“votes” of many experts. The expert nodes may be nodes that are run by hu-man experts qualified to study and assess the authenticity of particular types offiles, and the majority opinion of the human experts can be used to assess theauthenticity of a file. Alternatively, the expert nodes may simply be “regular”nodes that store files, and will vote that a particular file is authentic if they storea copy of it. In this scheme, users are expected to delete files that they do notbelieve are authentic, and a file’s authenticity is determined by the number ofcopies of the file that are distributed throughout the system. The key technicalissues with this approach are how to prevent spoofing of votes, of nodes, and offiles.

Reputation-Based. Some experts might be more trustworthy than others (asdetermined by past performance), and we might weight the votes of more trust-worthy experts more heavily. The weights in this approach are a simple exam-ple of “reputations” that may be maintained by a reputation system. A rep-utation system is responsible for maintaining, updating, and propagating suchweights and other reputation information [41]. Reputation systems may or maynot choose to use voting in making their assessments. There has been some studyof reputation systems in the context of P2P networks, but no such system hasbeen commercially successful (e.g. [33,24]).

3.3 Anonymity

There is much work that has taken place on anonymity in the context of theInternet both at the network-layer (e.g. [30]) as well as at the application-layer


Table 1. Types of Anonymity

Type of Anonymity Difficult for Adversary to Determine:Author Which users created which documents?Server Which nodes store a given document?Reader Which users access which documents?

Document Which documents are stored at a given node?

(e.g. [40]). In this section we specifically focus on application-layer anonymityin P2P data sharing systems.

While some would suggest that many users are interested in anonymity be-cause it allows them to illegally trade copyrighted data files in an untraceablefashion, there are many legitimate reasons for supporting anonymity in a P2Psystem. Anonymity can enable censorship resistance, freedom of speech withoutthe fear of persecution, and privacy protection. Malicious parties can be pre-vented from deterring the creation, publication, and distribution of documents.For example, such a system may allow an Iraqi nuclear scientist to publish adocument about the true state of Iraq’s nuclear weapons program to the worldwithout the fear that Saddam Hussein’s regime could trace the document back tohim or her. Users that access documents could also have their privacy protectedin such a system. An FBI agent could access a company’s public informationresources (i.e., web pages, databases, etc.) anonymously so as not to arouse sus-picion that the company may be under investigation.

There are a number of different types of anonymity that can be provided in aP2P system. It is difficult for the adversary to determine the answers to differentquestions for different types of anonymity. Table 1 summarizes a few types ofanonymity discussed in [39].

We would ideally like to provide anonymity while maintaining other desir-able search and security features such as efficiency, decentralization, and peerdiscovery. Unfortunately, providing various types of anonymity often conflictswith these design goals for a P2P system.

To illustrate one of these conflicting goals, consider the natural trade-offbetween server anonymity and efficient search. If we are to provide serveranonymity, it should be impossible to determine which nodes are responsiblefor storing a document. On the other hand, if we would like to be able to ef-ficiently search for a document, we should be able to tell exactly which nodesare responsible for storing a document. A P2P system such as Free Haven thatstrives to provide server anonymity resorts to broadcast search, while otherssuch as Freenet [27] provide for efficient search but do not provide for serveranonymity. Freenet does, however, provide author anonymity. Nevertheless, sup-porting server anonymity and efficient search concurrently remains an open issue.

There exists a middle-ground: we might be able to provide some level of serveranonymity by assigning pseudonyms to each server, albeit at the cost of searchefficiency. If an adversary is able to determine the pseudonym for the server of


a controversial document, the adversary is still unable to map the pseudonymto the publisher’s true identity or location. The document can be accessed insuch a way as to preserve the server’s anonymity by requiring that a reader (apotential adversary) never directly communicate with a server. Instead, readersonly communicate with a server through a chain of intermediate proxy nodes thatforward requests from the reader to the server. The reader presents the server’spseudonym to a proxy to request communication with the server (thereby hidinga server’s true identity), and never obtains a connection to the actual server fora document (thereby hiding the server’s location). Reader anonymity can alsobe provided using a chain of intermediate proxies, as the server does not knowwho the actual requester of a document is, and each proxy does not know if theprevious node in the chain is the actual reader or is just another proxy. Of course,in both these cases, the anonymity is provided based on the assumption thatproxies do not collude. The degradation of anonymity protocols under attackshas been studied in [44], and this study suggests that further work is necessaryin this area.

Free Haven and Crowds [40] are examples of systems that use forwardingproxies to provide various types of anonymity with varying strength. Each ofthese systems differ in how the level of anonymity degrades as more and more po-tentially colluding malicious nodes take on the responsibilities of proxies. Othertechniques that are commonly found in systems that provide anonymity includemix networks (e.g. [32]), and using cryptographic secret-sharing techniques tosplit files into many shares (e.g. [42]).

3.4 Access Control

Intellectual property management and digital rights management issues can becast as access control problems. We want to restrict the accessibility of docu-ments to only those users that have paid for that access. P2P systems currentlycannot be trusted to successfully enforce copyright laws or carry out any form ofsuch digital rights management, especially since few assumptions can be madeabout key management infrastructure. This has led to blatant violation of copy-right laws by users of P2P systems, and has also led to lawsuits against companiesthat build P2P systems.

The trade-offs involved in enforcing access control in a P2P data sharingsystem are challenging because if a system imposes restrictions over what types ofdata it shares (i.e., only copy-protected content), then its utility will be limited.On the other hand, if it imposes no such restrictions, then it can be used as aplatform to freely distribute any content to anyone that wants it [37].

Further effort must go into exploring whether or not it is reasonable to havethe P2P network enforce access control, or if the enforcements should take placeat the endpoints of the network. In either case, only users that own (or havepaid for) the right to download and access certain files should be able to do soto legally support data sharing applications.


If the benefits of P2P systems are to be realized, we need to explore the fea-sibility of and the technical approaches to instrumenting them with appropriatemechanisms to allow for the management of intellectual property.

4 Conclusion

Many of the open problems in P2P data sharing systems surround search andsecurity issues. The key research problem in providing a search mechanism ishow to provide for maximum expressiveness, comprhensiveness, and autonomywith the best possible efficiency, quality-of-service, and robustness. The key tosecuring a P2P network lies in designing mechanisms that ensure availabiity, fileauthenticity, anonymity, and access control. In this paper, we have illustratedsome of the trade-offs at the heart of search and security problems in P2P datasharing systems, and outlined several major areas of importance for future work.

References

1. Gnutella website. http://www.gnutella.com2. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., Balakrishnan, H.: Chord: A

scalable peer-to-peer lookup service for internet applications. Proc. ACM SIG-COMM (2001)

3. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable contentaddressable network. Proc. ACM SIGCOMM (2001)

4. Rowstron, A., Druschel, P.: Pastry: Scalable, distributed object location and rout-ing for large-scale peer-to-peer systems. Proc. of the 18th IFIP/ACM Intl. Conf.on Distributed Systems Platforms (2001)

5. Ratnasamy, S., Shenker, S., Stoica, I.: Routing Algorithms for DHTs: Some OpenQuestions. Proc. IPTPS (2002)

6. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and replication in unstruc-tured peer-to-peer networks. Proc. of Intl. Conf. on Supercomputing (2002)

7. Cuenca-Acuna, F. M., Peery, C., Martin, R. P., Nguyen, T. D.: PlanetP: using gos-siping to build content addressable peer-to-peer information sharing communities.Technical Report DCS-TR-487, Dept. of Computer Science, Rutgers Univ. (2002)

8. Bawa, M., Garcia-Molina, H., Gionis, A., Motwani, R.: Estimating the size of apeer-to-peer network (2002)

9. Harren, M., Hellerstein, M., Huebsch, R., Loo, B., Shenker, S., Stoica, I.: ComplexQueries in DHT-based Peer-to-Peer Networks. Proc. IPTPS (2002)

10. Crespo, A., Garcia-Molina, H.: Routing indicies for peer-to-peer systems. Proc.28th Intl. Conf. on Distributed Computing Systems (2002)

11. Yang, B., Garcia-Molina, H.: Improving search in peer-to-peer systems. Proc. 28thIntl. Conf. on Distributed Computing Systems (2002)

12. Yang, B., Garcia-Molina, H.: Designing a super-peer network. Proc. ICDE (2003)13. Schlosser, M., Sintek, M., Decker, S., Nejdl, W.: A scalable and ontology-based

P2P infrastructure for semantic web services (2002)14. Malkhi, D., Nao, M., Ratajczak, D.: Viceroy: a scalable and dynamic emulation of

the butterfly. Proc. PODC (2002)15. Harvey, N., Jones, M., Saroiu, S., Theimer, M., Wolman, A.: SkipNet: a scalable

overlay network with practical locality properties (2002)


16. Napster website. http://www.napster.com17. Morpheus website. http://www.musiccity.com18. W3C website on Web Services. http://www.w3.org/2002/ws19. Seti@Home website. http://setiathome.ssl.berkely.edu20. DataSynapse website. http://www.datasynapse.com21. Groove Networks website. http://www.groove.net22. Stoica, I., Adkins, D., Zhuang, S., Shenker, S., Surana, S.: Internet Indirection

Infrastructure. Proc. SIGCOMM (2002)23. Stanford Peers group website. http://www-db.stanford.edu/peers24. Reputation research network home page.

http://databases.si.umich.edu/reputations/25. Iris: Infrastructure for resilient internet systems. http://iris.lcs.mit.edu/26. Personal communication with Dan Boneh.27. Clarke, I., Sandberg, O., Wiley, B., Hong, T.W.: Freenet: A distributed anonymous

information storage and retrieval system. Workshop on Design Issues in Anonymityand Unobservability, pages 46–66 (2000)

28. Cooper, B., Garcia-Molina., H.: Peer to peer data trading to preserve information.ACM Transactions on Information Systems (2002)

29. Dingledine, R., Freedman, M.J., Molnar, D.: The free haven project: Distributedanonymous storage service. Workshop on Design Issues in Anonymity and Unob-servability, pages 67–95 (2000)

30. Freedman, M.J., Morris, R.: Tarzan: A peer-to-peer anonymizing network layer.Proc. 9th ACM Conference on Computer and Communications Security, Washing-ton, D.C. (2002)

31. Garber, L.: Denial-of-service attacks rip the internet. Computer, pages 12-17 (April2000)

32. Hill, R., Hwang, A., Molnar, D.: Approaches to mixnets.33. Lethin, R.: Chapter 17: Reputation. Peer-to-Peer: Harnessing the Power of Dis-

ruptive Technologies ed. Andy Oram, O’Reilly and Associates (2001)34. Lynch, N.A.: Distributed algorithms. Morgan Kaufmann (1996)35. Maniatis, P., Baker, M.: Secure History Preservation Through Timeline Entangle-

ment. Proc. 11th USENIX Security Symposium, SF, CA, USA (2002)36. Ganesh, A., Rowstron, A., Castro, M., Druschel, P., Wallach, D.: Security for

structured peer-to-peer overlay networks. Proc. 5th OSDI, Boston, MA (2002)37. Peinado, M., Biddle, P., England, P., Willman, B.: The darknet and the future of

content distribution. http://crypto.stanford.edu/DRM2002/darknet5.doc.38. Daswani, N., Garcia-Molina, H.: Query-flood DoS Attacks in Gnutella. Proc. Ninth

ACM Conference on Computer and Communications Security, Washington, DC(2002)

39. Molnar, D., Dingledine, R., Freedman, M.: Chapter 12: Free haven. Peer-to-Peer:Harnessing the Power of Disruptive Technologies ed. Andy Oram, O’Reilly andAssociates (2001)

40. Reiter, M.K., Rubin, A.D.: Crowds: anonymity for Web transactions. ACM Trans-actions on Information and System Security, 1(1):66–92 (1998)

41. Resnick, P., Zeckhauser, R., Friedman, E., Kuwabara, K.: Reputation systems.Communications of the ACM, pages 45–48 (2000)

42. Shamir, A.: How to share a secret. Communications of the ACM, 22:612–613 (1979)43. Sit, E., Morris, R.: Security considerations for peer-to-peer distributed hash tables.

IPTPS ’02, http://www.cs.rice.edu/Conferences/IPTPS02/173.pdf (2002)44. Wright, M., Adler, M., Levine, B., Shields, C.: An analysis of the degradation of

anonymous protocols. Technical Report, Univ. of Massachusetts, Amherst (2001)

Date post:	08-Dec-2016
Category:	Documents
Upload:	rajeev
View:	216 times
Download:	2 times

[Lecture Notes in Computer Science] Database Theory — ICDT 2003 Volume 2572 || Open Problems in...

Documents