“Information Retrieval in Peer-to-Peer Systems”

transcript

“Information Retrieval in Peer-to-Peer

Systems”Demetrios Zeinalipour-Yazti

http://www.cs.ucr.edu/~csyiazti/msc.html

M.Sc. Thesis Defense

Monday, May 5, 2003Surge 349 12:00-1:00 PM

Thesis Committee:Dr. Dimitrios Gunopulos,

ChairpersonDr. Vana Kalogeraki

Dr. Chinya V. Ravishankar

Dept. of Computer Science & Engineering. @ University of California - Riverside

Presentation Outline

• Introduction & Motivation.

• Search Techniques for P2P systems

• The Intelligent Search Mechanism

• PeerWare Simulation Infrastructure

• Experimental Evaluation.

• Conclusions & Future Work.

Introduction to Peer-to-Peer• Peer-to-Peer Computing definition:

“Sharing of computer resources and information through direct exchange”

• Clients (downloaders) are also servers

• Clients may join or leave the network at any time => highly fault-tolerant but with a cost!

• Searches are done within the virtual network while actual downloads are done offline (with HTTP).

The physical topologyThe virtual P2P topology

Introduction to Peer-to-Peer• Peer-to-Peer (P2P) systems are increasingly

becoming popular.

• P2P file-sharing systems, such as Gnutella, Napster and Freenet realized a distributed infrastructure for sharing files.

• Traditionally, files were shared using the Client-Server model (e.g. http). Not scalable since they are centralized services.

• P2P uncover new advantages in simplicity of use, robustness, self organization and scalability.

Information Retrieval in P2PProblem:“How to efficiently retrieve Information in P2P systems where each node shares a collection of documents?”

keywords

• Documents consists of keywords.• Resembles Information Retrieval but resources are

distributed now.• Primary Data Structures such as Global Inverted

Indexes can’t be maintained efficiently.

Solutions for P2P Information Retrieval

1) Centralized Approaches• Centralized Indexes • e.g. Napster, SETI@HOME

2) Purely Distributed Approaches• Each node has only local

knowledge.• I.R is done using Brute force

mechanisms• e.g. Gnutella, Fasttrack (Kazaa)

3) Hybrid Approaches• One or more peers have partial

indexes of the contents of others.• e.g. Limewire's Ultrapeers

Centralized Index1) Upload Index2) Query/QueryHit3) Download (offline)

1) Connect2) Query/QueryHit3) Download (offline)

1) Connect2) Intelligent Query/QueryHit3) Download (offline)

Motivation• On 1st June we crawled the Gnutella P2P Network for 5

hours with 17 workstations.• We analyzed 15,153,524 query messages.

• Observation: High locality of specific queries.

• We try to exploit this property for more efficient searches?

Search Techniques for P2P systems1. Breadth-First Search (Gnutella)• Idea: Each Query Message is propagated along all

outgoing links of a peer using TTL (time-to-live).• TTL is decremented on each forward until it becomes 0• Technique for I.R in P2P systems such as Gnutella.• Highlights

– The physical network comes to its knees– Long Delays for search results.

P2P Network N

Peer q

QUERY1

QUERYHIT2

Peer d

Search Techniques for P2P systems2. Modified Random BFS

[V. Kalogeraki, D. Gunopulos, D. Zeinalipour-Yazti . CIKM2002]

• Idea: Each Query Message is forwarded to only a fraction of outgoing links (e.g. ½ of them).

• TTL is again decremented on each forward until it becomes 0.

• Highlights– Fewer Messages but possibly less results– This algorithm is probabilistic. – Some segments may become

unreachable

P2P Network N

QUERY1

QUERYHIT2

Peer dPeer q

unreachable

Search Techniques for P2P systems

Peer q

QUERY1

2-walker

3. Searching Using Random Walkers[Q. Lv et al P. Cao, E. Cohen, K. Li, and S. Shenker. ICS2002]

• Idea: Each Query Message is forwarded to 1 neighbor• With k walkers after T steps we reach the same nodes

as 1 walker after kT steps. (They use 16-64 walkers)• Highlights

– Network Traffic reduced (from BFS) by 2 orders of magnitude– Increases the user-perceived delay (from 2-6 hops to 4-15 hops)– This algorithm is probabilistic and the likelihood to locate the

objects depends on the network topology.

Peer d

Search Techniques for P2P systems4. Using Randomized Gossiping to Replicate

Global State [F.M Cuenca-Acuna, Thu D. Nguyen HPDC-12]

• Idea: PlanetP uses Bloom Filters to propagate summary indexes of the contents of a Peer.

• Bloom Filters are used for Membership Queries• Highlights

– Not Scalable (Technique works well for <10000 nodes)

– No Data Replication Required– False Positives are a function of m,n,k

and can be kept small

An 8-bit bloom filter w/ 4 hash functions

D = {d1,d2,...,dn}

h1(d1)

h2(d1)

h3(d1)

h4(d1)1

Search Techniques for P2P systems5. Searching using Local Indices [Arturo Crespo and

Hector Garcia-Molina, ICDCS 2002.]

• Idea: Create indices which contain “statistics” that reveal the “direction” towards the documents.

• Types of Proposed Indices– Compound Routing Index (CRI): metric=number of documents– Hop-Count Routing Index (HRI): maintain a CRI for k hops, – Exponentially Aggregated Index (ERI): Apply some cost

formula on HRI to shrink HRI’s size.

• Highlights– Not Scalable, Expensive Routing Updates but better than

replicating data indexes.– Assumes static environment but No Data Replication Required

Search Techniques for P2P systems6. Directed BFS and the >RES Heuristic 1/2

[Beverly Yang and Hector Garcia-Molina, ICDCS 2002.]

• Proposed Techniques:– Directed BFS based on aggregate statistics (e.g. num of results a

peer returned, shortest queue, forwarded the most data)

– Iterative Deepening, until Z results are returned– Local Indexes, each node maintains the actual index over the

data of peers r hops away.

• Their experiments deploy the Direct BFS techniques by attaching nodes to the Gnutella Network.

• The >RES Heuristic is shown to be working well.

Search Techniques for P2P systems6. Directed BFS and the >RES Heuristic 2/2• The >RES Heuristic is optimized to find Z

documents efficiently for some user defined Z.• >RES works well because:

– It captures stable/large network segments. – Potentially less overloaded peers

• >RES is a quantitative approach• Drawback: >RES doesn’t route queries to most

relevant content

RES=1000

CRES=10

qQUERY

QUERYHIT

Search Techniques for P2P systems7. Depth-First-Search and Freenet

[I. Clarke O. Sandberg, B. Wiley, and T.W. Hong, LNCS 2009 ]

Idea: Objects are Hashed and route the hash of a query based on the “key closeness” in a DFS manner.Highlights:

– Uses caching of key/object for future requests.– Data Replication along the QueryHit path provides Availability– Anonymity of Searcher and Publisher. – Drawbacks: i) Searches ONLY based on Object Identifier.

ii) The user-perceived delay is high

Peer q

QUERY1

Search: A

result: S

original file:A

replicatedfile:A

Search Techniques for P2P systems8. Consistent Hashing and Chord

[Ion Stoica et al. SIGCOMM 2001]

Idea: Objects/Nodes are hashed with m-bit identifier and organized in a virtual ring. Object lookup is achieved in O(logN).Highlights:

– Consistent Hashing achieves : (i) Good Load Balancing of keys (ii) Little object/key movement in case of node join/leave .

– Drawbacks: i) Searches ONLY based on Object Identifier ii) Data Movement may be a big overhead.

successor [5,7)=7successor [7,5)=0

Nodesh(A)=2h(B)=3h(C)=5h(D)=7

Objectsh(file1)=7

finger table C

successor [2,3)=3successor [3,5)=5successor [5,2)=0

finger table A

Intelligent Search Mechanism ISMIntroduction• Idea: Each Query Message is forwarded intelligently

based on what queries a peer answered in the past.• Components of ISM (for each node u)

a) Profile Mechanism, for each neighbor N(u).b) Peer Ranking Mechanism, for ranking peers locally and send a

search query only to the ones that most likely will answer.c) Similarity Function, for finding similar search queries.d) Search Mechanism, for propagating queries based on local

indexes

QUERY1

QUERYHIT2

Peer dPeer q ?

profiles

Intelligent Search Mechanism ISMComponents of ISMa) Profile mechanism.

– Maintains a list of past queries routed through that host.– Every time a QueryHit is received the table is updated

– The profile manager uses a Least Recently Used policy to keep most recent queries in repository.

– Profiles are kept for neighbors only so the cost for maintaining this cost is O(Td), T is a limiting factor per profile, d is the degree of a node

}Size:T*d

Intelligent Search Mechanism ISMComponents of ISMb) The RelevanceRank Peer Ranking Metric.

– Before forwarding a Query Message a peer performs an on-the-fly ranking of its peers to determine the best paths.

– We use the Aggregate Weighted Similarity of peer Pi to a query q, computed by a peer Pl as:

ExampleAssume host Pl needs to forward a query q=“italy disaster” to two of

its peers {P1, P2, P3}. Pk maintains queries {q1 ,q2,. ,q5} in its profile.

P2 {P1

Sim(q, q1) = 0.8Sim(q, q2) = 0.6Sim(q, q3) = 0.5 Sim(q, q4) = 0.4 Sim(q, q5) = 0.4

} => RR(P2, q) = (0.6x2 + 0.5x2) = 2.2

} => RR(P3, q) = (0.4x2 + 0.3x2) = 1.4

=> RR(P1, q) = 0.8 x 2 = 1.6=2

Intelligent Search Mechanism ISMComponents of ISMc) Similarity Function – The cosine similarity.

• Assume that L is a set of all words (in Profile Manager)\e.g. L={elections, bush, clinton, super, bowl, san, diego, … ,italy, earthquake, disaster}

• We define an |L|-dimensional space where each query is a vector. If q=“italy disaster” => q (vector of q) = [0,0,0,…,1,0,1]

• Recall that we have a vector for each qi stored in the Profile Manager ( i.e. qi )

Intelligent Search Mechanism ISMComponents of ISMd) Search Mechanism

• Utilizes the Peer Ranking Mechanism to forward Queries to nodes that will potentially contain the info we are looking for

QUERY1

Peer d

Peer q ?

profiles

Intelligent Search Mechanism ISM

Breaking cycles with Random Perturbation

• Suppose that nodes answers to conjunction of q-terms• Suppose that query: q has no answer from A,B,C or D.

and that one of them answered to similar q in the past Query q fails to explore the segment through E• Random Perturbation adds one additional random

message

QUERY A

QUERYHIT

PeerWare Simulation InfrastructureIntroduction• PeerWare is our distributed middleware

infrastructure that allows us to benchmark various Query Routing Algorithms.

• It is deployed on a network of 50 workstations• It uses Public/Private Keys and SSH to connect to

the networked hosts.• It is implemented in JAVA and consists of

approximately 10000 lines of code.

PeerWare Simulation InfrastructureWhy real middleware and not simulations?• Many properties such as network failures, dropped

queries may reveal interesting and unknown patterns.

• In a real middleware we are able to measure the actual time to satisfy queries.

• Finally there are no assumptions (network delays etc) which are typical in simulation environments

The Anthill Project (Univ. of Bologna) uses a similar approach to investigate properties of the Freenet algorithm.

PeerWare Simulation InfrastructurePeerWare Components1. dataGen – The Dataset Generator

2. graphGen – The Network Graph Generator

3. dataPeer – The Data Node

4. searchPeer – The Search Node

Other Administrative Components • netLaucher – Shell script that launches Network• netStats – Shell script that provides statistics• graphPlot – Shell script that plots Graphs based on

generated results.

PeerWare Simulation Infrastructure1) dataGen Component• dataGen is the Dataset Generator which generates

documents about specific documents(each peer can have some specialized knowledge)

• It uses the REUTERS News Agency dataset (22,531 documents).

• It groups documents by various properties:{Date, Topics, Places, People, Orgs, Companies}

• In our experiments we use the Places attribute and generate 104 countries.

PeerWare Simulation Infrastructure2) graphGen Component• graphGen is topology generator• Currently it generates Random Topologies given

parameters such as {degree, IPs, ports}• It generates with graphViz visualizations of the

generated topologies.

PeerWare Simulation Infrastructure3) dataPeer Component• dataPeer is a P2P client that maintains an XML

repository of documents.• It uses the PDOM-XQL engine to query its

documents.• It pre-establishes connections to other peers with

persistent TCP connections Data-Peer (e.g. usa)

PDOM-XMLManager

P2P NetworkModule

XML Data Files

germany

france

Routing Structures (Profiles)

mexico

argentina

greece

usa.graph

PeerWare Simulation Infrastructure4) searchPeer Component• searchPeer is a P2P client that connects to a

PeerWare Network and performs unstructured queries.

• Keywords are sampled from within the dataset• It logs statistics such as query response time,

nodes answered to a node etc.

Search-Peer

P2P NetworkModule

germany

france

mexico

argentina

greece

queries.txtresults.txt

Experimental EvaluationIntroduction

• We create a distributed Newspaper application

• We use a Random Network of 104 peers – Each peer has documents for 1 country– The average degree of a node is 7 ~= log2100 (connected graph)

• We perform two series of experiments1. 10x10 sequential queries with a delay of 4 sec.2. 400 random queries with a delay of 4 sec.

• We compare Doc. Ratio (Recall Rate) vs. Num. of messages– BFS (Gnutella Message Flooding) (forward to degree nodes).– Random BFS (randomly forward to degree/2 nodes).– Intelligent Search Mechanism (forward to M=(degree/2)-1 highest

RelevanceRank nodes + 1 random).– >RES Heuristic (forward to degree/2 nodes that answered >RES)

Experimental EvaluationReducing Query Messages (10x10 Experiment)

Recall Rate vs. Num. of messages with TTL=4 • BFS uses ~1050 messages w/ recall rate 100%• RBFS uses ~220 (20%) msgs w/ recall rate ~50%• >RES uses ~400 (38%) msgs w/ recall rate ~70%• ISM uses ~400 (38%) msgs w/ recall rate ~90%• ISM improves over time since Peer Profiles get more knowledge.• ISM and >RES start out slow since the use RBFS

until they populate their routing structures

Experimental EvaluationDigging Deeper by Increasing the TTL (10x10)

• Recall Rate vs. Num. of messages with TTL=5 • BFS uses again ~1050 messages w/ recall rate 100%• RBFS uses ~450 (43%) msgs w/ recall rate ~82%• >RES uses ~570(54%) msgs w/ recall rate ~90%• ISM uses ~570 (54%) msgs w/ recall rate ~99%

Experimental EvaluationReducing Query Response Time (QRT) (10x10 Experiment)• BFS’s QRT is in the order of 6 seconds• RBFS, ISM and >RES use

30-60% of BFS for TTL=4

60-80% of BFS for TTL=5• BFS unnecessary messages increase the user perceived delay

The Query Response Time as a percentage of BFS

Experimental EvaluationThe Discarded Message Problem (DMP)

• A query q is identified by a GUID.• To avoid cycles a node never forwards a query it

already forwarded.• DMP occurs if a node has forwarded q with TTL1 and

then receives again q with TTL2, where TTL2>TTL1• In our experiments approximately 30% of queries were

affected by the DMP problem.

Experimental EvaluationImproving Recall Rate over Time (400 Experiment)• 10x10 Queries Experiment suited well ISM• In this experiment we perform 400 random queries• BFS overwhelming message create two major outbreaks • ISM improves over time achieving:

96% Recall Rate using again 38% of Messages

Conclusions

• Efficient Information Retrieval in P2P networks is not feasible with the current Search Algorithms.

• We propose an Intelligent Search Mechanism that uses local knowledge to improve Information Retrieval in P2P.

• We implement PeerWare and evaluate the performance of various Search Techniques

• The ISM achieves in some cases 100% recall rate while using only 57% of the BFS messaging.

AverageRTT=9ms4 Router Hops

8,806Km

8,747Km

pc-62-30-117-83-cr.blueyonder.co.uk

sdcax6-097.dialup.optusnet.com.au

12,764Km

p237-165.yahoo.co.jp

roc-24-169-109-208.rochester.rr.com

12-224-0-236.client.attbi.comAverageRTT=46ms13 Router Hops

1,544Km

3,933Km

66-215-0-xx1.oc-nod.charterpipeline.net

66-215-0-xx2.oc-nod.charterpipeline.net

London

Melbourne

SeattleRochester

Riverside

Future Work• Probe different Network Topologies such as ASMap with PowerLaws.• Deploy larger PeerWares with more queries.• Probe different Peer-Profile maintenance policies. • Use Stemming/Stop Words to answer more accurately.• Compare the performance of our method with new proposed

techniques (random gossiping, random walkers, etc).• 60% of Gnutella belongs to 20% ISPs. How to exploit that to provide

more efficient query routing schemes?

“Information Retrieval in Peer-to-Peer Systems”Demetrios Zeinalipour-Yazti

http://www.cs.ucr.edu/~csyiazti/msc.html

M.Sc. Thesis Defense

Monday, May 5, 2003Surge 349 12:00-1:00 PM

Thesis Committee:Dr. Dimitrios Gunopulos,

ChairpersonDr. Vana Kalogeraki

Dr. Chinya V. Ravishankar

Dept. of Computer Science & Engineering. @ University of California - Riverside

Thank You!