+ All Categories
Transcript
Page 1: Query-Driven Indexing for Peer-to-Peer Text Retrieval ** WWW 2007 Banff, Canada Contact: Gleb Skobeltsyn Contact: Gleb Skobeltsyn gleb.skobeltsyn@epfl.ch.

Query-Driven Indexing for Peer-to-Peer Text RetrievalQuery-Driven Indexing for Peer-to-Peer Text Retrieval****

WWW 2007Banff, Canada

Contact: Contact: Gleb Skobeltsyn Gleb Skobeltsyn [email protected]

http://lsirpeople.epfl.ch/skobelts

* I.Podnar is currently affiliated with University of Zagreb, Croatia

** The work presented in this paper was (partly) carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European projects BRICKS (507457) and ALVIS (002068).

* I.Podnar is currently affiliated with University of Zagreb, Croatia

** The work presented in this paper was (partly) carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European projects BRICKS (507457) and ALVIS (002068).

G.Skobeltsyn, T.Luu, I.PodnarG.Skobeltsyn, T.Luu, I.Podnar**, M.Rajman, K.Aberer, M.Rajman, K.Aberer

Experiments: retrieval quality of the query-driven index when compared to Google

0 1 2 3 4 5 10 50 100 ∞

c) QFmin / 3 months

Our goal: Our goal:

Features:Features:

Our goal: Our goal:

Features:Features:- Low bandwidth Low bandwidth during retrieval as posting lists of bounded size bounded size are transmitted,- The content of the index adaptsadapts to the current query popularitypopularity distribution,- TradeoffTradeoff between retrieval quality and index size (i.e., indexing cost).

Scalable full text web retrieval in a structured P2P network. Scalable full text web retrieval in a structured P2P network.

Processing the query abc with a query-driven index

More details in:• Skobeltsyn et al: “Query-Driven Indexing for Scalable Peer-to-

Peer Text Retrieval”, in Infoscale’07, Suzhou, China, 2007• Skobeltsyn et al: “Web Text Retrieval with a P2P Query-Driven

Index”, in SIGIR’07, Amsterdam, The Netherlands, 2007• Alvis project web site: http://globalcomputing.epfl.ch/alvishttp://globalcomputing.epfl.ch/alvis

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90

Qu

alit

y o

f an

swe

r (%

0%

(0%,25%)

[25%,50%)

[50%,75%)

[75%,100%)

100%

Avg.overlap

a) Log history size (days)

20 50 100 200 300 400 500 600

b) DFmax (documents)

Overlap achieved for different sizes of the query log measured in number of days with QFmin=1, DFmax=600

Overlap achieved for different values of DFmax with QFmin=1

Overlap achieved for different values of QFmin/3 months with DFmax=600

>id=481, q=“what did babe ruth do in the 1920what did babe ruth do in the 1920” “1920 babe ruth”, qf=0 ----> Ov@100= 100% “1920 babe”, qf=0 ---------> Ov@100= 9% + “1920 ruth1920 ruth”, qf=1 ---------> Ov@100= 33%33% + “babe ruthbabe ruth”, qf=495 -------> Ov@100= 69%69% - “1920”, qf=716 ------------> Ov@100= 1% - “babe”, qf=3196 -----------> Ov@100= 2% - “ruth”, qf=1653 -----------> Ov@100= 7% Size: 192192, Keys used: 22, Overlap@100: 94%94%

Top-20 overlap measure:•Use GoogleGoogle to answer a query

and comparecompare it to the union of top-top-DFDFmaxmax Google results Google results for each of its indexed indexed keys,

•Keys are indexed indexed if contained in more than QFQFminmin queries in the global query history.

Example of resolving a query:

• Distributed single term index – maintains global posting lists for each single term single term in a DHT

• To process a multi-term query abc it intersectsintersects the full posting lists of a, b and c.

• Intersections lead to unscalableunscalable retrieval traffic

The naïve approach:

Pc

Pb

Pa

Querying peer

a b c

ab bcac

abc

a b c

ab bcac

abc

a) if the posting lists for b and c are truncated (only)

c) if the key bc is also indexed.

a b c

ab bcac

abc

b) if the posting list for a is also truncated,

- probed combination

- skipped combination

- popularity counter

- truncated posting list

- posting list is used to answer the query

- no index item for the key - candidate index item (only stat.)

- active index item (stat.+TPL)

I n d e x i t e m s :

a

b

Legend

Top Related