+ All Categories
Home > Documents > Modern Information Retrieval -...

Modern Information Retrieval -...

Date post: 04-Sep-2018
Category:
Upload: vuthien
View: 224 times
Download: 4 times
Share this document with a friend
37
Modern Information Retrieval Chapter 2 Modeling Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 1/37
Transcript
Page 1: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Modern Information Retrieval

Chapter 2

Modeling

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 1/37

Page 2: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

IntroductionIR systems usually adopt index terms to processqueries

Index term:a keyword or group of selected wordsany word (more general)

Stemming might be used:connect: connecting, connection, connections

An inverted file is built for the chosen index terms

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 2/37

Page 3: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Introduction

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 3/37

Page 4: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

IntroductionMatching at index term level is quite imprecise

No surprise that users get frequently unsatisfied

Since most users have no training in query formation,problem is even worst

Frequent dissatisfaction of Web users

Issue of deciding relevance is critical for IR systems:ranking

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 4/37

Page 5: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

IntroductionA ranking is an ordering of the documents retrieved that(hopefully) reflects the relevance of the documents tothe user query

A ranking is based on fundamental premisses regardingthe notion of relevance, such as:

common sets of index termssharing of weighted termslikelihood of relevance

Each set of premisses leads to a distinct IR model

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 5/37

Page 6: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

IR Models

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 6/37

Page 7: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

IR ModelsThe IR model, the logical view of the docs, and theretrieval task are distinct aspects of the system

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 7/37

Page 8: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Retrieval: Ad Hoc x FilteringAd Hoc Retrieval:

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 8/37

Page 9: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Retrieval: Ad Hoc x FilteringFiltering

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 9/37

Page 10: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Classic IR Models - Basic ConceptsEach document represented by a set of representativekeywords or index terms

An index term is a document word useful forremembering the document main themes

Usually, index terms are nouns because nouns havemeaning by themselves

However, search engines assume that all words areindex terms (full text representation)

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 10/37

Page 11: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Classic IR Models - Basic ConceptsNot all terms are equally useful for representing thedocument contents

less frequent terms allow identifying a narrower setof documents

To quantify the importance of an index term, weassociate a weight with it

Letki be an index termdj be a documentwij be a weight associated with (ki , dj), whichquantifies the importance of ki for describing thecontents of dj

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 11/37

Page 12: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Classic IR Models - Basic ConceptsLet

ki : ith index termdj : jth documentt : total number of terms in the vocabularyK = {k1, k2, ..., kt} : the set of all index termswij > 0 : weight associated with (ki, dj)

if wij = 0 then term ki does not occur within dj

~dj = (w1j , w2j , ..., wtj) : weighted vector associatedwith dj

g(~dj) : a reference to the weight wij

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 12/37

Page 13: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Boolean ModelSimple model based on set theory

Queries specified as Boolean expressionsprecise semanticsneat formalismq = ka ∧ (kb ∨ ¬kc)

Letwiq : weight associated with pair (ki, q)

wiq ∈ {0, 1} : terms either present or absent(Boolean)~dq = (w1q, w2q, ..., wtq) : weighted vector associatedwith q

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 13/37

Page 14: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Boolean ModelLet,

q = ka ∧ (kb ∨ ¬kc)

~dq : weighted vector associated with q

dnf(~dq) : distinct normal form for vector ~Dq

Then,

dnf(~dq) = (1, 1, 1) ∨ (1, 1, 0) ∨ (1, 0, 0)

(1,1,1) : conjunctive component for (ka, kb, kc)(1,1,0) : conjunctive component for (ka, kb,¬kc)(1,0,0) : conjunctive component for (ka,¬kb,¬kc)

ccq : a conjunctive component for q, ccq ∈ dnf(~dq)

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 14/37

Page 15: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Boolean Modelq = ka ∧ (kb ∨ ¬kc)

sim(q, dj) = 1, if ∃ccq|∀ki, gi(~dj) = gi(ccq)

sim(q, dj) = 0, otherwise

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 15/37

Page 16: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Drawbacks of the Boolean ModelRetrieval based on binary decision criteria with nonotion of partial matching

No ranking of the documents is provided (absence of agrading scale)

Information need has to be translated into a Booleanexpression, which most users find awkward

The Boolean queries formulated by the users are mostoften too simplistic

As a consequence, the Boolean model frequentlyreturns either too few or too many documents inresponse to a user query

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 16/37

Page 17: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelUse of binary weights is too limiting

Non-binary weights provide consideration for partialmatches

These term weights are used to compute a degree ofsimilarity between a query and each document

Ranked set of documents provides for better matching

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 17/37

Page 18: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelDefine:

wij > 0 whenever ki ∈ dj

wiq > 0 associated with the pair (ki, q)

~dj = (w1j , w2j , . . . , wtj)

~dq = (w1q, w2q, . . . , wtq)

To each term ki is associated a unitary vector~i

The unitary vectors~i and ~j are assumed to beorthonormal (i.e., index terms are assumed to occurindependently within the documents)

The t unitary vectors~i form an orthonormal basis for at-dimensional space

In this space, queries and documents are representedas weighted vectors

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 18/37

Page 19: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelSimilarity

sim(dj , q) = cos(θ) =~dj • ~q

|~dj | × |~q|=

∑ti=1 wi,j × wi,q

∑ti=1 w2

i,j ×√

∑tj=1 w2

i,q

Since wij > 0 and wiq > 0 then 0 6 sim(dj , q) 6 1

A document is retrieved even if it matches the queryterms only partially

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 19/37

Page 20: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelSimilarity

sim(dj , q) =

∑ti=1 wi,j × wi,q

∑ti=1 w2

i,j ×√

∑tj=1 w2

i,q

How to compute the weights wij and wiq?

A good weight must take into account two effects:quantification of intra-document contents (similarity)

tf factor, the term frequency within a documentquantification of inter-documents separation(dissimilarity)

idf factor, the inverse document frequencywi,j = tf i ,j × idf i

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 20/37

Page 21: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelLet, ki ∈ dj

N be the total number of docs in the collectionni be the number of docs which contain ki

freqi,j raw frequency of ki within dj

A normalized tf factor is given by

fi,j = freqi,j

maxl freql,j

where the maximum is computed over all termswhich occur within the document dj

The idf factor is computed as

idfi = log Nni

the log is used to make the values of tf and idfcomparable. It can also be interpreted as theamount of information associated with the term ki.

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 21/37

Page 22: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelThe best term-weighting schemes use weights whichare give by

wi,j = fi,j × log Nni

the strategy is called a tf-idf weighting scheme

For the query term weights, a suggestion is

wi,q =(

0.5 + 0.5 freqi,q

maxl freql,q

)

× log Nni

The vector model with tf-idf weights is a good rankingstrategy with general collections

The vector model is usually as good as the knownranking alternatives. It is also simple and fast tocompute.

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 22/37

Page 23: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelAdvantages:

term-weighting improves quality of the answer setpartial matching allows retrieval of docs thatapproximate the query conditionscosine ranking formula sorts documents accordingto degree of similarity to the query

Disadvantages:assumes independence of index terms (??); notclear that this is bad though

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 23/37

Page 24: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelExample 1

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 24/37

Page 25: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelExample 2

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 25/37

Page 26: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Vector ModelExample 3

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 26/37

Page 27: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Probabilistic ModelObjective: to capture the IR problem using aprobabilistic framework

Given a user query, there is an ideal answer set

Querying as specification of the properties of this idealanswer set (clustering)

But, what are these properties?

Guess at the beginning what they could be (i.e., guessinitial description of ideal answer set)

Improve by iteration

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 27/37

Page 28: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Probabilistic ModelAn initial set of documents is retrieved somehow

User inspects these docs looking for the relevant ones(in truth, only top 10-20 need to be inspected)

IR system uses this information to refine description ofideal answer set

By repeating this process, it is expected that thedescription of the ideal answer set will improve

Have always in mind the need to guess at the verybeginning the description of the ideal answer set

Description of ideal answer set is modeled inprobabilistic terms

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 28/37

Page 29: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Probabilistic Ranking PrincipleGiven a user query q and a document dj, theprobabilistic model tries to estimate the probability thatthe user will find the document dj interesting (i.e.,relevant). The model assumes that this probability ofrelevance depends on the query and the documentrepresentations only. Ideal answer set is referred to asR and should maximize the probability of relevance.Documents in the set R are predicted to be relevant.

But,how to compute probabilities?what is the sample space?

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 29/37

Page 30: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The RankingSimilarity

sim(dj , q) =P (R|~dj)

P (R|~dj)=

P (~dj |R) × P (R)

P (~dj |R) × P (R)∼

P (~dj |R)

P (~dj |R)

P (~dj |R) : probability of randomly selecting thedocument dj from the set R of relevant documents

P (R) : probability that a document randomly selectedfrom the entire collection is relevant

P (~dj |R) and P (R) : analogous and complementary

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 30/37

Page 31: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The RankingAssuming independence of index terms

sim(dj , q) ∼(∏

gi(~dj)=1P (ki|R)) × (

gi(~dj)=0P (ki|R))

(∏

gi(~dj)=1P (ki|R)) × (

gi(~dj)=0P (ki|R))

P (ki|R) : probability that the index term ki is present ina document randomly selected from the set R ofrelevant documents

P (ki|R) : probability that the index term ki is not presentin a document randomly selected from the set R

The probabilities associated with the R have meaningswhich are analogous to the ones just described

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 31/37

Page 32: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The RankingFinally

sim(dj , q) ∼

∼t

i=1

wi,q × wi,j ×

(

logP (ki|R)

1 − P (ki|R)+ log

1 − P (ki|R)

P (ki|R)

)

WhereP (ki|R) = 1 − P (ki|R)

P (ki|R) = 1 − P (ki|R)

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 32/37

Page 33: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

The Initial RankingSimilarity sim(dj , q)

t∑

i=1

wi,q × wi,j ×

(

logP (ki|R)

1 − P (ki|R)+ log

1 − P (ki|R)

P (ki|R)

)

Probabilities P (ki|R) and P (ki|R) ?

Estimates based on assumptions:P (ki|R) = 0.5

P (ki|R) = ni

N where ni is the number of docs thatcontain ki

Use this initial guess to retrieve an initial rankingImprove upon this initial ranking

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 33/37

Page 34: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Improving the Initial Ranking

t∑

i=1

wi,q × wi,j ×

(

logP (ki|R)

1 − P (ki|R)+ log

1 − P (ki|R)

P (ki|R)

)

LetV : set of docs initially retrievedVi : subset of docs retrieved that contain ki

Reevaluate estimates:P (ki|R) = Vi

V

P (ki|R) = ni−Vi

N−V

Repeat recursively

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 34/37

Page 35: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Improving the Initial Ranking

t∑

i=1

wi,q × wi,j ×

(

logP (ki|R)

1 − P (ki|R)+ log

1 − P (ki|R)

P (ki|R)

)

To avoid problems with V = 1 and Vi = 0:

P (ki|R) = Vi+0.5V +1

P (ki|R) = ni−Vi+0.5N−V +1

Also,

P (ki|R) =Vi+

niN

V +1

P (ki|R) =ni−Vi+

niN

N−V +1

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 35/37

Page 36: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Pluses and MinusesAdvantages:

Docs ranked in decreasing order of probability ofrelevance

Disadvantages:need to guess initial estimates for P (ki|R)

method does not take into account tf and idf factors

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 36/37

Page 37: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example

Brief Comparison of Classic ModelsBoolean model does not provide for partial matchesand is considered to be the weakest classic model

Salton and Buckley did a series of experiments thatindicate that, in general, the vector model outperformsthe probabilistic model with general collections

This seems also to be the view of the researchcommunity

Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 37/37


Recommended