+ All Categories
Home > Documents > Peer-to-Peer Database Networks

Peer-to-Peer Database Networks

Date post: 02-Jan-2022
Category:
Upload: others
View: 10 times
Download: 0 times
Share this document with a friend
34
L3S Research Center, University of Hannover Peer Peer- -to to- -Peer Database Networks Peer Database Networks Peer Peer to to Peer Database Networks Peer Database Networks Part 2 Part 2 Wolf Wolf- -Tilo Tilo Balke Balke and Wolf Siberski and Wolf Siberski 9.1.2008 9.1.2008 1 Peer-to-Peer Systems and Applications, Springer LNCS 3485 *with slides from J.M.Hellerstein (UC Berkeley), A. Halevi (U Washington), P. Raghavan (Stanford) Overview 1. Why Peer-to-Peer Databases? 1. Federation 2. Information integration 3. Sensor networks 4. ‘New’ internet 2. Distributed Databases 3. P2P Databases 1. Challenges 2. Design Dimensions 4. Existing P2P Database systems 1 Edutella: focus on expressivity 2 P2P Databases L3S Research Center 1. Edutella: focus on expressivity 2. PIER: focus on scalability 3. Piazza: focus on integration 4. HiSbase: focus on scalability for spatial data
Transcript
Page 1: Peer-to-Peer Database Networks

L3S Research Center, University of Hannover

PeerPeer--toto--Peer Database Networks Peer Database Networks PeerPeer toto Peer Database Networks Peer Database Networks Part 2Part 2

WolfWolf--TiloTilo BalkeBalke and Wolf Siberskiand Wolf Siberski

9.1.20089.1.2008

1Peer-to-Peer Systems and Applications, Springer LNCS 3485Peer-to-Peer Systems and Applications, Springer LNCS 3485

*with slides from J.M.Hellerstein (UC Berkeley), A. Halevi (U Washington), P. Raghavan (Stanford)

Overview

1. Why Peer-to-Peer Databases?1. Federation2. Information integration3. Sensor networks4. ‘New’ internet

2. Distributed Databases

3. P2P Databases1. Challenges2. Design Dimensions

4. Existing P2P Database systems1 Edutella: focus on expressivity

2P2P DatabasesL3S Research Center

1. Edutella: focus on expressivity2. PIER: focus on scalability3. Piazza: focus on integration4. HiSbase: focus on scalability for spatial data

Page 2: Peer-to-Peer Database Networks

PIER

P2P Relational Database

Foundation: any DHT 150

1

214

Extended hash interfaceput(namespace, key, value)

get(namespace, key)

namespace/key combination is used as hash value (DHT Key)

3

4

5

6

78

9

10

11

12

13

3P2P DatabasesL3S Research Center

Extended network capabilitiesExploit DHT structure for broadcast

Required for joins and aggregate queries)

Spanning Tree

Application: Phi

Phi: Public Health for the InternetMonitor ip network state world-wide

Collect statisticsNetwork trafficNetwork traffic

Latency

Malware alerts

4P2P DatabasesL3S Research Center

Page 3: Peer-to-Peer Database Networks

Storing and Indexing Tuples

StoringEvery tuple needs a synthetic tuple key

Choose combination of table name and tuple key as DHT key

Insert complete tuple into DHT using this key

IndexingAdditional attribute indexes are built by inserting attribute value/tuple key pairs into the DHT

Choose combination of attribute name and attribute value as DHT

5P2P DatabasesL3S Research Center

Choose combination of attribute name and attribute value as DHT key

Insert tuple key as DHT value

Example

Sample Database DocIdTitleDateLanguage

AuthorDocIdPersonId

PersonIdNameSurname

Sample tuple : (456, ‘Critique of pure Reason’, 1781, ‘en’)

Storing

put(Doc, 456, (456, ‘Critique...’, ‘en’, Philosophy))

Indexing on ‘Title’ and ‘Date’ attributes

6P2P DatabasesL3S Research Center

put(Doc.Title, ‘Critique...’, 456)

put(Doc.Date, ‘1781’, 456)

Page 4: Peer-to-Peer Database Networks

PIER Query Plans

DHT-Scan1. Use index to retrieve tuple key(s)

2. Use key(s) to retrieve data tuple(s)

ExampleSELECT Id, Title FROM Doc WHERE

Date= ‘1781’ AND Lang = ‘en’

Each peer can create a query plan

7P2P DatabasesL3S Research Center

One DHT lookup per result tuple

Filter has to be done on query originator

Aggregate and Range Queries

ExampleSELECT COUNT(Id) FROM Doc WHERE Date>‘1780’ AND Date<‘1790’

Use spanning tree for broadcast

Aggregate on return

1 13 1

16

8P2P DatabasesL3S Research Center

1 1

Page 5: Peer-to-Peer Database Networks

Join Queries

ExampleAssume a Person tuple (789, ‘Kant’, ‘Immanuel’)

SELECT Id, Title FROM Doc WHERE Author.DocId = Doc.Id AND A th P Id 789AND Author.PersonId = 789

Approach: Hierarchical JoinsUse spanning tree for broadcast

Do local select on peer table fragments

Do local join on each peer

9P2P DatabasesL3S Research Center

Do local join on each peer Improves load balancing

Forward table fragments and partial results to parent

Repeat until query originator has all fragments

Hierarchical Joins

T23

T12

T22

T21 T32

D3 D2 A2T31T13

T33

D3D1

A1 A3

D2 A2

10P2P DatabasesL3S Research Center

D1 A1 A3T11 D1 A1 A3

Page 6: Peer-to-Peer Database Networks

PIER - Discussion

Real query planning

Very efficient access to individual tuples and small result sets

Very good scalability in terms of network size

Degrades to broadcast for many types of queriesAggregate queries

Joins

11P2P DatabasesL3S Research Center

INSERT operation expensive (see P2P Inform. Retrieval)

No load-balancing mechanisms

Overview

1. Why Peer-to-Peer Databases?1. Federation2. Information integration3. Sensor networks4. ‘New’ internet

2. Distributed Databases

3. P2P Databases1. Challenges2. Design Dimensions

4. Existing P2P Database systems1 Edutella: focus on expressivity

12P2P DatabasesL3S Research Center

1. Edutella: focus on expressivity2. PIER: focus on scalability3. Piazza: focus on integration4. HiSbase: focus on scalability for spatial data

Page 7: Peer-to-Peer Database Networks

Piazza

Tackles problem of „reconciling different models of the world” (A. Halevy)

Goal: provide a uniform interface to a set of t d t autonomous data sources.

New abstraction layer over multiple sources

Introduce mappings between ‘world views’Mapping rules are specified manually by experts

13P2P DatabasesL3S Research Center

Don’t need to be complete

Example – Publication Databases

UCSD

14P2P DatabasesL3S Research Center

Page 8: Peer-to-Peer Database Networks

Mapping Rules

Datalog to specify mapping rulesUCSD : Member(projName; member) :

UW : Member(;pid; member; );

UW : Project(pid; ; projName):Mapping from UW to UCSDUW : Project(pid; ; projName):

UCSD : Member(projName; member) :

UPenn : Student(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

UCSD : Member(projName; member) :

to UCSD

Mapping from

15P2P DatabasesL3S Research Center

( j )

UPenn : Faculty(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

UPenn to UCSD

Storing and Indexing

Unstructured network (Gnutella-like)

Peer keeps its databaseNo exchange of data between peers

IndexingOnly on schema level

Each peer maintains schema catalog of its neighbors

Mappings Stored in central catalog (hybrid system)

16P2P DatabasesL3S Research Center

could be replaced by DHT

Replication of mappings to all relevant peers

Page 9: Peer-to-Peer Database Networks

Query Routing

Query Flooding

Peer translates query to schema of neighbor (if possible)

Result tuples are converted on CiteSeerpway back

Queries answered by traversing semantic paths UCSD

UPenn

DBLP

CiteSeer

Q1

Q4

Q3M(UCSD, UPenn)

17P2P DatabasesL3S Research Center

DBLP

UW UC BerkeleyStanfordQ Q2

M(UW, UCSD)M(UW, Stanford)

M(Stanford, DBLP)

Piazza - Discussion

Supports multiple schema world (more realistic)

Very expressive mapping mechanism

Not scalableGnutella-like topology and flooding

Piazza mapping technique could be applied to other network infrastructures

18P2P DatabasesL3S Research Center

Page 10: Peer-to-Peer Database Networks

Overview

1. Why Peer-to-Peer Databases?1. Federation2. Information integration3. Sensor networks4. ‘New’ internet

2. Distributed Databases

3. P2P Databases1. Challenges2. Design Dimensions

4. Existing P2P Database systems1 Edutella: focus on expressivity

19P2P DatabasesL3S Research Center

1. Edutella: focus on expressivity2. PIER: focus on scalability3. Piazza: focus on integration4. HiSbase: focus on scalability for spatial data

HiSbase

Specialized on distributed spatial data

Application: astronomy dataHuge amounts of data (terabyte scale)

Region-based queries

Skewed data distribution

Main ideas

20P2P DatabasesL3S Research Center

Distribute data on peers by region

Use DHT for data access

Use neighbor-preserving hash function (space-filling curve)

Page 11: Peer-to-Peer Database Networks

Load Distribution

Use Quad-Tree structure to split data space into equally loaded regions

21P2P DatabasesL3S Research Center

Data Hashing

Use Z-Linearization for hashing coordinates

22P2P DatabasesL3S Research Center

Page 12: Peer-to-Peer Database Networks

Insertion into DHT

23P2P DatabasesL3S Research Center

Query Processing

Point query: simple DHT access

Region query Route to arbitrary peer in range (e.g. using upper left region boundary)

This peer acts as coordinator

Forward query to peer region neigborsUntil whole area is covered

Collect results at coordinator

24P2P DatabasesL3S Research Center

Page 13: Peer-to-Peer Database Networks

HiSbase - Discussion

Very efficient for spatial queries

Not completely self-organizingQuad-Tree splitting needs central coordination

Only spatial queries possible

25P2P DatabasesL3S Research Center

Peer-to-Peer Database Networks – Summary (1)

ChallengesMulti-Dimensional Search SpaceSchema HeterogeneityPotentially large result setsPotentially large result sets

Design DimensionsNetwork Properties (Data Placement, Topology and Routing)Data Access (Data Model, Query Language)Integration Mechanism (Mapping Representation/Creation/Usage)

P2P Database Types

26P2P DatabasesL3S Research Center

ypFocus on high network scalability (e.g., PIER)Focus on high query expressivity (e.g., Edutella)Focus on information integration (e.g., Piazza)Focus on specific data structures (e.g. HiSbase)

Page 14: Peer-to-Peer Database Networks

Conclusion

P2P Databases do already workalthough immature compared to traditional database technology

One size does not fit allOne size does not fit allChoose P2P database approach according to application requirements

Open problemsLoad Balancing (Replication/Caching)

27P2P DatabasesL3S Research Center

How to combine DHT and filtered flooding advantages

Reliability (probabilistic guarantees)

...

L3S Research Center, University of Hannover

PeerPeer--toto--Peer Information RetrievalPeer Information RetrievalBasicsBasics

WolfWolf--TiloTilo Balke and Wolf Balke and Wolf SiberskiSiberski

28Peer-to-Peer Systems and Applications, Springer LNCS 3485Peer-to-Peer Systems and Applications, Springer LNCS 3485

Page 15: Peer-to-Peer Database Networks

Overview

Goal = find documents relevant to an information need from a large document set

Info. need

Document collection

Query

Answer list

IR systemRetrieval

29P2P Information RetrievalL3S Research Center

First Applications

Libraries (1950s)ISBN: 0-201-12227-8Author: Salton, Gerard

Titl A t ti t t i th t f tiTitle: Automatic text processing: the transformation, analysis, and retrieval of information by computer

Editor: Addison-Wesley

Date: 1989

Content: <Text>

External attributes and internal attribute (= content)

30P2P Information RetrievalL3S Research Center

( )

Search by external attributes = Search in DB

IR: search by content

Page 16: Peer-to-Peer Database Networks

IR applications vs. Databases

IR DBMS

Imprecise Semantics Precise Semantics

Keyword search SQL

Unstructured data format Structured data

Read-Mostly. Add docs occasionally

Expect reasonable number of updates

Page through top k results Generate full answer

31P2P Information RetrievalL3S Research Center

The IR-Cycle

SourceSelection Resource

Search

Query

Selection

Ranked List

Documents

QueryFormulation

query reformulation,vocabulary learning

32P2P Information RetrievalL3S Research Center

Examination

Delivery

Documents

vocabulary learning,relevance feedback

source reselection

Page 17: Peer-to-Peer Database Networks

Supporting the search process

SourceSelection

Query

Resource

Search

Query

Selection

Ranked List

Documents

QueryFormulation

Indexing Index

33P2P Information RetrievalL3S Research Center

Examination

Delivery

DocumentsAcquisition Collection

Information Hierarchy

More refined and abstract

Information

Knowledge

Wisdom

34P2P Information RetrievalL3S Research Center

Data

Information

Page 18: Peer-to-Peer Database Networks

Information?

DataThe raw material of information

InformationData organized and presented in a particular manner

Knowledge“Justified true belief”

Information that can be acted upon

Wisdom

35P2P Information RetrievalL3S Research Center

Distilled and integrated knowledge

Demonstrative of high-level “understanding”

Information?

Data98.6º F, 99.5º F, 100.3º F, 101º F, …

InformationHourly body temperature: 98.6º F, 99.5º F, 100.3º F, 101º F, …

KnowledgeIf you have a temperature above 100º F, you most likely have a fever

WisdomIf you don’t feel well, go see a doctor

36P2P Information RetrievalL3S Research Center

Page 19: Peer-to-Peer Database Networks

What is the Retrieval-Task?

“Fetch something” that’s been stored

Recover a stored state of knowledge

Search through stored messages to find some messages relevant to the task at hand

Sender Recipient

Encoding Decodingstoragemessage message

37P2P Information RetrievalL3S Research Center

noiseindexing/writing Retrieval/reading

What is IR?

Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human usergenerator and human user

38P2P Information RetrievalL3S Research Center

Page 20: Peer-to-Peer Database Networks

What is IR?

DFB-PokalAm 21.12.2005 be-streiten Hannoverstreiten Hannover 96 und Werder Bremen das DFB-Pokal Achtelfinale in der AWD-Arena in Hannover.

Query

IR

System Result

Output

Input1. doc45542. doc23743. doc76524. doc7642…

39P2P Information RetrievalL3S Research Center

HannoverFußballDFB-Pokal

Modern History

The “information overload” problem is much older than you may think

Origins in period immediately after World War IITremendous scientific progress during the war

Rapid growth in amount of scientific publications available

The “Memex Machine”Conceived by Vannevar Bush, President Roosevelt's science advisor

Outlined in 1945 Atlantic Monthly article titled “As We May Think”

F h d th d l t f h t t (th W b) d i f ti

40P2P Information RetrievalL3S Research Center

Foreshadows the development of hypertext (the Web) and information retrieval system

Page 21: Peer-to-Peer Database Networks

Memex

41P2P Information RetrievalL3S Research Center

Document View

Space of all documents

Relevant RetrievedRelevant +Retrieved

42P2P Information RetrievalL3S Research Center

Not Relevant + Not Retrieved

Page 22: Peer-to-Peer Database Networks

What is a Model?

A model is a construct designed help us understand a complex system

A particular way of “looking at things”

Models inevitably make simplifying assumptionsWhat are the limitations of the model?

Different types of models:Conceptual models

Physical analog models

M th ti l d l

43P2P Information RetrievalL3S Research Center

Mathematical models

The central Problem in IR

Information Seeker Authors

Concepts Concepts

44P2P Information RetrievalL3S Research Center

Query Terms Document Terms

Do these represent the same concepts?

Page 23: Peer-to-Peer Database Networks

Representing Text

How do we represent the complexities of language?Keeping in mind that computers don’t “understand” documents or queries

Si l t ff ti h “b f d ”Simple, yet effective approach: “bag of words”Treat all the words in a document as index terms for that document

Assign a “weight” to each term based on its “importance”

Disregard order, structure, meaning, etc. of the words

45P2P Information RetrievalL3S Research Center

Representing Text

McDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.

16 × said

14 × McDonalds

12 × fat

11 × friesBut does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0 54 to $23 22 Research

11 fries

8 × new

6 × company french nutrition

5 × food oil percent reduce taste Tuesday

46P2P Information RetrievalL3S Research Center

(MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…

“Bag of Words”

Page 24: Peer-to-Peer Database Networks

Bag of Words

Retrieving relevant information is hard!Evolving, ambiguous user needs, context, etc.

Complexities of language

To operationalize information retrieval, we must vastly simplify the picture

Bag-of-words approach:Information retrieval is all (and only) about matching words in documents with words in queries

Obviously not true

47P2P Information RetrievalL3S Research Center

Obviously, not true…

But it works pretty well!

Representing Documents as Vectors

The quick brown fox jumped over the lazy dog’s

Document 1

aid 0 1

Term Doc

umen

t 1

Doc

umen

t 2

the lazy dog’s back.

Document 2

Now is the time

the

isfor

to

of

brown

fox

lazy

dog

backall

good

men

come

jump

aid 00110110110

11001001001

StopwordList

48P2P Information RetrievalL3S Research Center

for all good men to come to the aid of their party.

quick

overnow

time

men

their

party

0010100

1101011

Page 25: Peer-to-Peer Database Networks

Representing Text

How to comparedocuments and queries?

49P2P Information RetrievalL3S Research Center

Boolean Retrieval

Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document

“1” represents “presence”: term is in the document

Build queries by combining terms with Boolean operators

AND, OR, NOT

The system returns all documents that satisfy the query

50P2P Information RetrievalL3S Research Center

Page 26: Peer-to-Peer Database Networks

Boolean View of a Document-Set (=Collection)

allaid 0

001

Term

Doc

1D

oc 2

00

11

Doc

3D

oc 4

00

01

Doc

5D

oc 6

00

10

Doc

7D

oc 8

Each column represents the view of

brown

fox

lazy

dog

back

now

all

good

men

come

jump

01100000100

10010010011

01101101100

10010010010

00101100100

10010010001

01100100100

00010010011

a particular document: What terms are contained in this document?

Each row represents the view of a particular term: What documents contain this term?

To execute a query, pick out rows corresponding to query terms and

51P2P Information RetrievalL3S Research Center

quick

overnow

timetheir

party

010110

100001

010100

000001

010010

101001

010010

111000

co espo d g o que y e s a dthen apply logic table of corresponding Boolean operator

Sample Queries

foxdog 0

000

11

00

11

00

01

00

Term

Doc

1D

oc 2

Doc

3D

oc 4

Doc

5D

oc 6

Doc

7D

oc 8

dog ∧ fox 0 0 1 0 1 0 0 0 dog AND fox → Doc 3 Doc 5dog ∧ fox 0 0 1 0 1 0 0 0

dog ∨ fox 0 0 1 0 1 0 1 0

dog ¬ fox 0 0 0 0 0 0 0 0

fox ¬ dog 0 0 0 0 0 0 1 0

dog AND fox → Doc 3, Doc 5

dog OR fox → Doc 3, Doc 5, Doc 7

dog NOT fox → empty

fox NOT dog → Doc 7

Term

Doc

1D

oc 2

Doc

3D

oc 4

Doc

5D

oc 6

Doc

7D

oc 8

52P2P Information RetrievalL3S Research Center

goodparty

00

10

00

10

00

11

00

11

g ∧ p 0 0 0 0 0 1 0 1

g ∧ p ¬ o 0 0 0 0 0 1 0 0

good AND party → Doc 6, Doc 8over 1 0 1 0 1 0 1 1

good AND party NOT over → Doc 6

Page 27: Peer-to-Peer Database Networks

Why Boolean Retrieval works

Boolean operators approximate natural languageFind documents about a good party that is not over

AND can discover relationships between conceptsgood party

OR can discover alternate terminologyexcellent party, wild party, etc.

NOT can discover alternate meaningsDemocratic party

53P2P Information RetrievalL3S Research Center

The Perfect Query Paradox

Every information need has a perfect set of documentsIf not, there would be no sense doing retrieval

Every document set has a perfect queryAND every word in a document to get a query for it

Repeat for each document in the set

OR every document query to get the set query

But can users realistically be expected to formulate this perfect query?

B l f l ti i h d!

54P2P Information RetrievalL3S Research Center

Boolean query formulation is hard!

Page 28: Peer-to-Peer Database Networks

Why Boolean Retrieval fails

Natural language is way more complex

AND “discovers” nonexistent relationshipsTerms in different sentences, paragraphs, …

Guessing terminology for OR is hardgood, nice, excellent, outstanding, awesome, …

Guessing terms to exclude is even harder!Democratic party, party to a lawsuit, …

55P2P Information RetrievalL3S Research Center

Strengths and Weaknesses

StrengthsPrecise, if you know the right strategies

Precise, if you have an idea of what you’re looking for

Efficient for the computer

WeaknessesUsers must learn Boolean logic

Boolean logic insufficient to capture the richness of language

No control over size of result set: either too many documents or none

When do you stop reading? All documents in the result set are

56P2P Information RetrievalL3S Research Center

When do you stop reading? All documents in the result set are considered “equally good”

What about partial matches? Documents that “don’t quite match” the query may be useful also

Page 29: Peer-to-Peer Database Networks

Ranked Retrieval

Order documents by how likely they are to be relevant to the information need

Present hits one screen at a time

At any point, users can continue browsing through ranked list or reformulate query

Attempts to retrieve relevant documents directly, not merely provide tools for doing so

57P2P Information RetrievalL3S Research Center

Why Ranked Retrieval?

Arranging documents by relevance isCloser to how humans think: some documents are “better” than others

Closer to user behavior: users can decide when to stop reading

Best (partial) match: documents need not have all query terms

Although documents with more query terms should be “better”

Easier said than done!

58P2P Information RetrievalL3S Research Center

Page 30: Peer-to-Peer Database Networks

Similarity-based Retrieval?

Let’s replace relevance with “similarity”Rank documents by their similarity with the query

Treat the query as if it were a documentCreate a query bag-of-words

Find its similarity to each document

Rank order the documents by similarity

Surprisingly, this works pretty well!

59P2P Information RetrievalL3S Research Center

Vector Space Model

d2

d

d3

t3

t1

d1

d4

d5t2

θφ

60P2P Information RetrievalL3S Research Center

Postulate: Documents that are “close together” in vector space “talk about” the same things

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Page 31: Peer-to-Peer Database Networks

How to Weight Terms?

Idea: Hans Peter Luhn 1958, IBM

Here’s the intuition:Terms that appear often in a document should get high weights

Terms that appear in many documents should get low weights

How do we capture this mathematically?

The more often a document contains the term “dog”, the more likely that the document is “about” dogs.

Words like “the”, “a”, “of” appear in (nearly) all documents.

61P2P Information RetrievalL3S Research Center

How do we capture this mathematically?Term frequency

Inverse document frequency

TFxIDF

TFxIDF [Gerald Salton, 1961]

Term Frequency (TF)How often a term appears in a document

Document Frequency (DF)Number of documents, which contain a specific term

62P2P Information RetrievalL3S Research Center

Number of documents, which contain a specific term

Inverse Document Frequency (IDF)Discriminator for the importance of a term regarding the number of occurrences in all documents

Page 32: Peer-to-Peer Database Networks

Working on Indices

brownbackallaid 0

011

0100

Term Doc

1D

oc 2

0011

1100

Doc

3D

oc 4

0001

0100

Doc

5D

oc 6

0011

1000

Doc

7D

oc 8

brown

fox

over

lazy

dog

now

good

men

come

jump

party

10000010010

01001001100

10110110010

01001001000

10110010010

01001000101

10010010010

01001001111

63P2P Information RetrievalL3S Research Center

quick

timetheir

party 0110

0001

0100

0001

0010

1001

0010

1000

The term-document matrix has “bag of words” information about the collection

Small yet Fast?

Can we make this data structure smaller, keeping in mind the need for fast retrieval?

Observations:The nature of the search problem requires us to quickly find which documents contain a term

The term-document matrix is very sparse

Some terms are more useful than others

64P2P Information RetrievalL3S Research Center

Page 33: Peer-to-Peer Database Networks

Postings

allaid 0

001

Term Doc

1D

oc 2

00

11

Doc

3D

oc 4

00

01

Doc

5D

oc 6

00

10

Doc

7D

oc 8

Postings

2 4 64, 8

brown

fox

lazy

dog

back

now

all

good

men

come

jump

01100000100

10010010011

01101101100

10010010010

00101100100

10010010001

01100100100

00010010011

1, 3, 5, 7

3, 5, 7

1, 3, 5, 7

3, 5

1, 3, 7

2 6 8

2, 4, 6

2, 4, 6, 8

2, 4, 8

2, 4, 6, 8

3

65P2P Information RetrievalL3S Research Center

quick

overnow

timetheir

party

010110

100001

010100

000001

010010

101001

010010

111000

1, 3

1, 3, 5, 7, 82, 6, 8

2, 4, 61, 5, 7

6, 8

Inverted Document Index

allaid

Term Postings

2, 4, 64, 8

brown

fox

lazy

dog

back

now

a

good

men

come

jump

1, 3, 5, 7

3, 5, 7

1, 3, 5, 7

3, 5

1, 3, 7

2, 6, 8

, ,

2, 4, 6, 8

2, 4, 8

2, 4, 6, 8

3

66P2P Information RetrievalL3S Research Center

quick

overo

timetheir

party1, 3

1, 3, 5, 7, 8, ,

2, 4, 61, 5, 7

6, 8

Page 34: Peer-to-Peer Database Networks

What goes in the Postings?

Boolean retrievalJust the document number

Ranked RetrievalDocument number and term weight (tf.idf, ...)

Proximity operatorsWord offsets for each occurrence of the term

67P2P Information RetrievalL3S Research Center

Summary

Information retrieval needs techniques different from database style retrieval

Ranked query model instead of simple look-ups

Global statistics about the collection may be needed (e.g., IDFs)

Inverted indices are main datastructures

Problem: How to perform IR-style retrieval in P2P systems?

How does the distributed setting affect rankings?

How to collect global statistics over autonomous peers?

68P2P Information RetrievalL3S Research Center

How to collect global statistics over autonomous peers?

How to deal with unstable collections due to network churn?


Recommended