+ All Categories
Home > Documents > Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Date post: 25-Dec-2015
Category:
Upload: posy-patrick
View: 233 times
Download: 3 times
Share this document with a friend
Popular Tags:
33
Modeling (Chap. 2) Modern Information Retrieval Spring 2000
Transcript
Page 1: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Modeling (Chap. 2)Modern Information Retrieval

Spring 2000

Page 2: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Introduction Traditional IR systems adopt index

terms to index, retrieve documents An index term is simply any word

that appears in text of documents Retrieval based on index terms is

simple premise is that semantics of documents and

user information can be expressed through set of index terms

Page 3: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Key Question semantics in document (user request)

lost when text replaced with set of words matching between documents and user

request done in very imprecise space of index terms (low quality retrieval)

problem worsened for users with no training in properly forming queries (cause of frequent dissatisfaction of Web users with answers obtained)

Page 4: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Taxonomy of IR Models Three classic models

Boolean documents and queries represented

as sets of index terms Vector

documents and queries represented as vectors in t-dimensional space

Probabilistic document and query representations

based on probability theory

Page 5: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Basic Concepts Classic models consider that each

document is described by index terms

Index term is a (document) word that helps in remembering document’s main themes index terms used to index and summarize

document content in general, index terms are nouns (because

meaning by themselves) index terms may consider all distinct words

in a document collection

Page 6: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Distinct index terms have varying relevance when describing document contents

Thus numerical weights assigned to each index term of a document

Let ki be index term, dj document, and wi,j 0 be weight for pair (ki, dj)

Weight quantifies importance of index term for describing document semantic contents

Page 7: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Definition (pp. 25)

Let t be no. of index terms in system and k i be generic index term.

K = {k1, …, kt} is set of all index terms.

A weight wi,j > 0 associated with each index term ki of document dj.

For index term that does not appear in document text, wi,j = 0.

Document dj associated with index term vector j represented by j = (w1,j, w2,j, …wt,j)

d

d

Page 8: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Boolean Model Simple retrieval model based on set

theory and Boolean algebra framework is easy to grasp by users

(concept of set is intuitive) Queries specified as Boolean

expressions which have precise semantics

Page 9: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Drawbacks Retrieval strategy is binary decision

(document is relevant/non-relevant) prevents good retrieval performance

not simple to translate information need into Boolean expression (difficult and awkward to express)

dominant model with commercial DB systems

Page 10: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Boolean Model (Cont.) Considers that index terms are

present or absent in document index term weights are binary, I.e.

wi,j {0,1} query q composed of index terms

linked by not, and, or query is Boolean expression which

can be represented as DNF

Page 11: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Boolean Model (Cont.) Query [q=ka (kb kc)] can be written in

DNF as [ dnf = (1,1,1) (1,1,0) (1,0,0)] each component is binary weighted vector

associated with tuple (ka, kb, kc)

binary weighted vectors are called conjunctive components of dnf

q

q

Page 12: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Boolean Model (cont.) Index term weight variables are all binar

y, I.e. wi,j {0,1} query q is a Boolean expression Let dnf be DNF for query q

Let cc be any conjunctive components of dnf

Similarity of document dj to query q is sim(dj,q) = 1 if cc | ( cc dnf) (ki,gi( j) = gi

( cc)) where gi( j) = wi,j

sim(dj,q) = 0 otherwise

q

q

q

q

q

q

q d

d

Page 13: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Boolean Model (Cont.)

If sim(dj,q) = 1 then Boolean model predict that document dj is relevant to query q (it might not be)

Otherwise, prediction is that document is not relevant

Boolean model predicts that each document is either relevant or non-relevant

no notion of partial match

Page 14: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Main advantages clean formalism simplicity

Main disadvantages exact matching lead to retrieval of too

few or too many documents index term weighting can lead to impr

ovement in retrieval performance

Page 15: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Vector Model Assign non-binary weights to index terms

in queries and documents term weights used to compute degree of

similarity between document and user query

by sorting retrieved documents in decreasing order (of degree of similarity), vector model considers partially matched documents ranked document answer set a lot more preci

se (than answer set by Boolean model)

Page 16: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Vector Model (Cont.)

Weight wi,j for pair (ki, dj) is positive and non-binary

index terms in query are also weighted Let wi,q be weight associated with pair

[ki,q], where wi,q 0

query vector defined as = (w1,q, w2,q, …, wt,q) where t is total no. of index terms in system

vector for document dj is represented by j = (w1,j, w2,j, …, wt,j)

q

q

d

Page 17: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Vector Model (Cont.)

Document dj and user query q represented as t-dimensional vectors.

evaluate degree of similarity of dj with regard to q as correlation between vectors j and .

Correlation can be quantified by cosine of angle between these two vectors sim(dj,q) =

d

q

|||| qjd

qjd

Page 18: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Vector Model (Cont.)

Sim(q,dj) varies from 0 to +1. Ranks documents according to degree of

similarity to query document may be retrieved even if it

partially matches query establish a threshold on sim(dj,q) and retrieve

documents with degree of similarity above threshold

Page 19: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Index term weights Documents are collection C of objects User query is set A of objects IR problem is to determine which

documents are in set A and which are not (I.e. clustering problem)

In clustering problem intra-cluster similarity (features which better

describe objects in set A) inter-cluster similarity (features which better

distinguish objects in set A from remaining objects in collection C

Page 20: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

In vector model, intra-cluster similarity quantified by measuring raw frequency of term ki inside document dj (tf factor) how well term describes document contents

inter-cluster dissimilarity quantified by measuring inverse of frequency of term ki

among documents in collection (idf factor) terms which appear in many documents are

not very useful for distinguishing relevant document from non-relevant one

Page 21: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Definition (pp.29) Let N be total no. of documents in system let ni be number of documents in which

index term ki appears

let freqi,j be raw frequency of term ki in document dj

no. of times term ki mentioned in text of document dj

Normalized frequency fi,j of term ki in dj

fi,j =

jfreql

jfreqi

,max

,

Page 22: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Maximum computed over all terms mentioned in text of document dj

if term ki does not appear in document dj then fi,j = 0

let idfi, inverse document frequency for ki be idfi = log

best known term weighting scheme wi,j = fi,j log

ni

N

ni

N

Page 23: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Advantages of vector model term weighting scheme improves retrieval

performance retrieve documents that approximate query

conditions sorts documents according to degree of

similarity to query Disadvantage

index terms are mutually independent

Page 24: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Probabilistic Model Given user query, there is set of

documents containing exactly relevant documents. Ideal answer set

given description of ideal answer set, no problem in retrieving its documents

querying process is process of specifying properties of ideal answer set the properties are not exactly known there are index terms whose semantics are

used to characterize these properties

Page 25: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Probabilistic Model (Cont.) These properties not known at query

time effort has to be made to initially guess

what they (I.e. properties) are initial guess generate preliminary

probabilistic description of ideal answer set to retrieve first set of documents

user interaction initiated to improve probabilistic description of ideal answer set

Page 26: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

User examine retrieved documents and decide which ones are relevant

this information used to refine description of ideal answer set

by repeating this process, such description will evolve and be closer to ideal answer set

Page 27: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Fundamental Assumption

Given user query q and document dj in collection, probabilistic model estimate probability that user will find document dj

relevant assumes that probability of relevance depends on

query and document representations only assumes that there is subset of all documents

which user prefers as answer set for query q such ideal answer set is labeled R documents in set R are predicted to be relevant to

query

Page 28: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Given query q, probabilistic model

assigns to each document dj the ratio P(dj relevant-to q)/P(dj non-relevant-to q) measure of similarity to query odds of document dj being relevant to

query q

Page 29: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Index term weight variables are all binary I.e. wi,j {0,1}, wi,q {0,1}

query q is subset of index terms let R be set of documents known (initially

guessed) to be relevant let be complement of R let P(R| j) be probability that document d j

is relevant to query q let P( | j) be probability that document

dj not relevant to query q.

R

d

R d

Page 30: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Similarity sim(dj,q) of document dj to query q is ratio

sim(dj,q) =

sim(dj,q) ~

sim(dj,q) ~ wi,q wi,j

)|(

)|(

jdRP

jdRP

)|(

)|(

RjdP

RjdP

)|(

)|(1log

)|(1

)|(log

RkiP

RkiP

RkiP

RkiP

t

i 1

Page 31: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

How to compute P(ki|R) and P(ki| ) initially ? assume P(ki|R) is constant for all index terms

ki (typically 0.5)

P(ki|R) = 0.5

assume distribution of index terms among non-relevant documents approximated by distribution of index terms among all documents in collection

P(ki| ) = ni/N where ni is no. of documents containing index term ki; N is total no. of doc.

R

R

Page 32: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Let V be subset of documents initially retrieved and ranked by model

let Vi be subset of V composed of documents in V with index term ki

P(ki|R) approximated by distribution of index term ki among doc. retrieved P(ki|R) = Vi / V

P(ki| ) approximated by considering all non-retrieved doc. are not relevant P(ki| ) =

R

VN

Vini

R

Page 33: Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Advantages documents ranked in decreasing order of

their probability of being relevant Disadvantages

need to guess initial separation of relevant and non-relevant sets

all index term weights are binary index terms are mutually independent


Recommended