+ All Categories
Home > Documents > 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D....

1 Exploiting locality for scalable information retrieval in peer-to-peer networks D....

Date post: 30-Dec-2015
Category:
Upload: darren-norton
View: 217 times
Download: 1 times
Share this document with a friend
Popular Tags:
28
1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous November, 2004
Transcript
Page 1: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

1

Exploiting locality for scalable information retrieval in peer-to-peer networks

D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos

Manos Moschous

November, 2004

Page 2: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

2

Issues in p2p networks

Content based / file identifiers information retrieval.

Dynamic networks (ad-hoc).

Scalability (global knowledge).

Query messages (flooding – network congestion).

Recall rate.

Efficiency (recall rate / query messages).

Query Response Time (QRT).

Page 3: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

3

IR in pure p2p networks

BFS technique Each peer forwards the query to all its neighbors Simple Performance Network utilization Use of TTL

RBFS technique Each peer forwards the query to a random subset of its

neighbors Reduce query messages Probabilistic algorithm

Page 4: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

4

IR in pure p2p networks

>RES technique Each peer forwards the query to some of its peers based

on some aggregated statistics. Heuristic: The Most Results in Past (for the last 10

queries).

Explore.. The larger network segments. The most stable neighbors. ! (The nodes which contain content related to the query.)

>RES is a quantitative rather than qualitative approach.

Page 5: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

5

The intelligent search mechanism (ISM)

Main Idea: Peers estimate for each query, which of its peers are more likely to reply to this query, and propagates the query message to those peers only. Exploit the locality of past queries.

Some characteristics: Entirely distributed (requires only local knowledge). Scales well with the size of the network. Scales well to large data sets. Works well in dynamic environments. High recall rates. Minimize the communication costs.

Page 6: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

6

Architecture (ISM) (1/4)

Profiling structure: Single queries table LRU policy to keep the most recent queries Table size is limited good performance

Page 7: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

7

Architecture (ISM) (2/4)

Query Similarity function (cosine similarity) Assumption: A peer that has a document relevant

to a given query is also likely to have other documents that are relevant to other similar queries.

Qsim : Q2 [0,1]

L: the set of all words appeared in queries {1,1,1,1}

q:{1,1,0,0}

qi:{1,0,1,0}

Page 8: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

8

Architecture (ISM) (3/4)

Peer ranking (Relevance Rank)

Pi: each peer.

Pl: the decision-maker node.

a: allows us to add more weight to the most similar queries.

S(Pi, qj): the number of results returned by Pi for query qj.

Page 9: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

9

Architecture (ISM) (4/4)

Search Mechanism Invoke RR function. Forward query to k (threshold) peers only.

Page 10: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

10

Experiments

Peerware: A distributed middleware infrastructure GraphGen: generates network topologies. dataPeer: p2p client which answers to boolean queries

from its local xml repository(XQL). SearchPeer: p2p client that performs queries and harvest

answers back from a Peerware network (connect to a dataPeer and perform queries).

Page 11: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

11

Experiments - DMP

If node Pk receives the same query q with some TTL2, where TTL2>TTL1 we allow the TTL2 message to proceed.

This may allow q to reach more peers than its predecessor Without this fix the BFS behaviour is not predictable and

therefore is not able to find the nodes that we were supposed to find.

Our experiments revealed that almost 30% of the forwarded queries were discarded because of DMP.

The experimental results presented in this work are not suffering from DMP.

This is the reason why the number of messages is slightly higher (~30%) than the expected number of messages.

The total number of messages should be for n nodes each of which with a degree di.

Page 12: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

12

Experiments-DMP

Query examples A set of 4 keywords 1 keyword >= 4 characters

# Query

1 AUSTRIA INTERVENE DOES DOLLAR

2 APPROVES MEDITERRANEAN FINANCIAL PACKAGES

3 AGREES PEACE NEW MOVES

Random Topology:

•Each vertex selects its d neighbors randomly.

•Simple.

•Leads to connected topologies if the degree d > log2n.

Page 13: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

13

Experiments (Set1)

Reuters – 21578 Peerware Random topology of 104 nodes (static) with average degree 8

(running on network 75 workstations). Categorize the documents by their country attribute (104

country files - each for a node) - Each country file has at least 5 articles.

Data Sets: Reuters 10X10: set of 10 random queries which are repeated 10

consecutive times (high locality of similar queries) – suits better to ISM.

Reuters 400: set of 400 random queries which are uniformly sampled from the initial 104 country files (lower repetition).

Page 14: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

14

Results (Set1) – Reuters 10X10 (1/4)

Reducing query messages ISM finds the most documents compared to RBFS and >RES. ISM achieves almost 90% (recall rate) while using only 38% of

BFS’s messages. ISM and >RES start out with low recall rate. Suffer from low recall rate.

Page 15: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

15

Results (Set1) – Reuters 10X10 (2/4)

Digging deeper by increasing TTL Reach more nodes deeper. ISM achieves 100% recall rate while using only 57% of BFS’s

messages with TTL=4.

Page 16: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

16

Results (Set1) – Reuters 10X10 (3/4)

Reducing query response time (QRT) ~30-60% of BFS’s QRT for TTL=4 and ~60-80% for TTL=5. ISM requires more time than >RES because it’s decision

involves some computation over the past queries.

Page 17: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

17

Results (Set1) – Reuters 400 (4/4)

Improving the recall rate over time ISM achieves 95% recall rate while using 38% of BFS’s

messages. During queries 150-200 major outbreaks occur in BFS. ISM requires a learning period of about 100 queries before it

starts competing the performance of >RES.

Page 18: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

18

Experiments (Set2)

TREC-LATimes Preeware (random topology of 1000 nodes – static) It contains approximately 132,000 articles. These articles were horizontally partitioned in 1000

documents (Each document contain 132 articles). Each peer shares one or more of 1000 documents (replicated

articles).

Page 19: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

19

Experiments (Set2)

Data Sets: TREC 100: a set of 100 queries out of the initial 150 topics. TREC 10X10: a list of 10 randomly sampled queries, out of the

initial 150 topics, which are repeated 10 consecutive times. TREC 50X2: for which we first generated a set a=“50 randomly

sampled queries out of the initial 150 topics” merged with a generated list of another 50 queries which are randomly sampled out of a.

Page 20: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

20

Results (Set2) – TREC100 (1/3)

Searching in a large-scale network topology For TTL=5 we reach 859 of 1000 nodes (BFS). For TTL=6 we reach 998 of 1000 nodes at a cost of 8500 m/q. For TTL=7 we reach all nodes at a cost of 10,500 m/q. ISM will not exhibit any learning behavior if the frequency of

terms is very low.

Page 21: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

21

Results (Set2) – TREC 10X10 (2/3)

The effect of high term frequency The recall rate will improve dramatically if the frequency of

terms is high. ISM achieves higher recall rate than BFS (BFS’s TTL=5). After the learning phase of 20-30 queries it scores 120% of

BFS’s recall rate by using 4 times less messages.

Page 22: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

22

Results (Set2) – TREC 50X2 (3/3)

The effect of high term frequency More realistic set, a few terms occur many times in queries

and most terms occur less frequently. ISM monotonically improves its recall rate and at the 90th

query it again exceeds BFS performance. >RES’s recall rate fluctuate and behave as bad as RBFS if the

queries don’t follow any constant pattern.

Page 23: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

23

Experiments (Set3)

Searching in dynamic network topologies

Why network failures? Misusage at the application layer (shutdown PC without disconnecting). Overwhelming amount of generated network traffic. Because of some poorly written p2p clients.

Simulate dynamic environment Total number of suspended nodes is no more than drop_rate. drop_rate is evaluated every k seconds against a random number r. If r < drop_rate node will break all incoming and outgoing connections (for l

seconds).

In our experiments: K=60,000 ms and l=60,000 ms. TREC-LATimes Peerware with the TREC 10X10 query set. drop_rate belongs to (0.0, 0.05, 0.1, 0.2) r is a random number which is uniformly generated in [0.0 .. 1.0)

Page 24: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

24

Results (Set3) (1/3)

BFS mechanism The increase of drop_rate decreases the number of

messages. BFS does not exhibit any learning behavior at any level of

drop_rate. BFS is tolerable to small drop_rates (5%) because is highly

redundant.

Page 25: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

25

Results (Set3) (2/3)

>RES mechanism The increase of drop_rate decreases the number of

messages. >RES does not exhibit any learning behavior at any level

of drop_rate.

Page 26: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

26

Results (Set3) (3/3)

ISM mechanism The increase of drop_rate decreases the number of

messages. Quite well at low levels of drop_rate. Not expected to be tolerant to large drop_rates (The

information gathered by the profiling structure becomes obsolete before it gets the chance to be utilized).

Page 27: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

27

Extend ISM to different environments

ISM mechanism could easily become the query routing protocol for some hybrid p2p environments (KaZaa, Gnutella). Super Peers form a backbone of infrastructure (long-time

network connectivity). Regular Peers are unstable and less powerful.

How could it work? Regular peer obtain a list of active Super peers. Connects to one or more Super peer and post queries. Super peer utilize the ISM mechanism and forward the

query to a selective subset of its super peer neighbors.

Page 28: 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

28

Thank you


Recommended