Modern Information Retrieval
Chapter 2
Modeling
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 1/37
IntroductionIR systems usually adopt index terms to processqueries
Index term:a keyword or group of selected wordsany word (more general)
Stemming might be used:connect: connecting, connection, connections
An inverted file is built for the chosen index terms
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 2/37
Introduction
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 3/37
IntroductionMatching at index term level is quite imprecise
No surprise that users get frequently unsatisfied
Since most users have no training in query formation,problem is even worst
Frequent dissatisfaction of Web users
Issue of deciding relevance is critical for IR systems:ranking
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 4/37
IntroductionA ranking is an ordering of the documents retrieved that(hopefully) reflects the relevance of the documents tothe user query
A ranking is based on fundamental premisses regardingthe notion of relevance, such as:
common sets of index termssharing of weighted termslikelihood of relevance
Each set of premisses leads to a distinct IR model
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 5/37
IR Models
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 6/37
IR ModelsThe IR model, the logical view of the docs, and theretrieval task are distinct aspects of the system
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 7/37
Retrieval: Ad Hoc x FilteringAd Hoc Retrieval:
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 8/37
Retrieval: Ad Hoc x FilteringFiltering
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 9/37
Classic IR Models - Basic ConceptsEach document represented by a set of representativekeywords or index terms
An index term is a document word useful forremembering the document main themes
Usually, index terms are nouns because nouns havemeaning by themselves
However, search engines assume that all words areindex terms (full text representation)
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 10/37
Classic IR Models - Basic ConceptsNot all terms are equally useful for representing thedocument contents
less frequent terms allow identifying a narrower setof documents
To quantify the importance of an index term, weassociate a weight with it
Letki be an index termdj be a documentwij be a weight associated with (ki , dj), whichquantifies the importance of ki for describing thecontents of dj
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 11/37
Classic IR Models - Basic ConceptsLet
ki : ith index termdj : jth documentt : total number of terms in the vocabularyK = {k1, k2, ..., kt} : the set of all index termswij > 0 : weight associated with (ki, dj)
if wij = 0 then term ki does not occur within dj
~dj = (w1j , w2j , ..., wtj) : weighted vector associatedwith dj
g(~dj) : a reference to the weight wij
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 12/37
The Boolean ModelSimple model based on set theory
Queries specified as Boolean expressionsprecise semanticsneat formalismq = ka ∧ (kb ∨ ¬kc)
Letwiq : weight associated with pair (ki, q)
wiq ∈ {0, 1} : terms either present or absent(Boolean)~dq = (w1q, w2q, ..., wtq) : weighted vector associatedwith q
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 13/37
The Boolean ModelLet,
q = ka ∧ (kb ∨ ¬kc)
~dq : weighted vector associated with q
dnf(~dq) : distinct normal form for vector ~Dq
Then,
dnf(~dq) = (1, 1, 1) ∨ (1, 1, 0) ∨ (1, 0, 0)
(1,1,1) : conjunctive component for (ka, kb, kc)(1,1,0) : conjunctive component for (ka, kb,¬kc)(1,0,0) : conjunctive component for (ka,¬kb,¬kc)
ccq : a conjunctive component for q, ccq ∈ dnf(~dq)
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 14/37
The Boolean Modelq = ka ∧ (kb ∨ ¬kc)
sim(q, dj) = 1, if ∃ccq|∀ki, gi(~dj) = gi(ccq)
sim(q, dj) = 0, otherwise
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 15/37
Drawbacks of the Boolean ModelRetrieval based on binary decision criteria with nonotion of partial matching
No ranking of the documents is provided (absence of agrading scale)
Information need has to be translated into a Booleanexpression, which most users find awkward
The Boolean queries formulated by the users are mostoften too simplistic
As a consequence, the Boolean model frequentlyreturns either too few or too many documents inresponse to a user query
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 16/37
The Vector ModelUse of binary weights is too limiting
Non-binary weights provide consideration for partialmatches
These term weights are used to compute a degree ofsimilarity between a query and each document
Ranked set of documents provides for better matching
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 17/37
The Vector ModelDefine:
wij > 0 whenever ki ∈ dj
wiq > 0 associated with the pair (ki, q)
~dj = (w1j , w2j , . . . , wtj)
~dq = (w1q, w2q, . . . , wtq)
To each term ki is associated a unitary vector~i
The unitary vectors~i and ~j are assumed to beorthonormal (i.e., index terms are assumed to occurindependently within the documents)
The t unitary vectors~i form an orthonormal basis for at-dimensional space
In this space, queries and documents are representedas weighted vectors
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 18/37
The Vector ModelSimilarity
sim(dj , q) = cos(θ) =~dj • ~q
|~dj | × |~q|=
∑ti=1 wi,j × wi,q
√
∑ti=1 w2
i,j ×√
∑tj=1 w2
i,q
Since wij > 0 and wiq > 0 then 0 6 sim(dj , q) 6 1
A document is retrieved even if it matches the queryterms only partially
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 19/37
The Vector ModelSimilarity
sim(dj , q) =
∑ti=1 wi,j × wi,q
√
∑ti=1 w2
i,j ×√
∑tj=1 w2
i,q
How to compute the weights wij and wiq?
A good weight must take into account two effects:quantification of intra-document contents (similarity)
tf factor, the term frequency within a documentquantification of inter-documents separation(dissimilarity)
idf factor, the inverse document frequencywi,j = tf i ,j × idf i
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 20/37
The Vector ModelLet, ki ∈ dj
N be the total number of docs in the collectionni be the number of docs which contain ki
freqi,j raw frequency of ki within dj
A normalized tf factor is given by
fi,j = freqi,j
maxl freql,j
where the maximum is computed over all termswhich occur within the document dj
The idf factor is computed as
idfi = log Nni
the log is used to make the values of tf and idfcomparable. It can also be interpreted as theamount of information associated with the term ki.
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 21/37
The Vector ModelThe best term-weighting schemes use weights whichare give by
wi,j = fi,j × log Nni
the strategy is called a tf-idf weighting scheme
For the query term weights, a suggestion is
wi,q =(
0.5 + 0.5 freqi,q
maxl freql,q
)
× log Nni
The vector model with tf-idf weights is a good rankingstrategy with general collections
The vector model is usually as good as the knownranking alternatives. It is also simple and fast tocompute.
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 22/37
The Vector ModelAdvantages:
term-weighting improves quality of the answer setpartial matching allows retrieval of docs thatapproximate the query conditionscosine ranking formula sorts documents accordingto degree of similarity to the query
Disadvantages:assumes independence of index terms (??); notclear that this is bad though
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 23/37
The Vector ModelExample 1
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 24/37
The Vector ModelExample 2
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 25/37
The Vector ModelExample 3
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 26/37
Probabilistic ModelObjective: to capture the IR problem using aprobabilistic framework
Given a user query, there is an ideal answer set
Querying as specification of the properties of this idealanswer set (clustering)
But, what are these properties?
Guess at the beginning what they could be (i.e., guessinitial description of ideal answer set)
Improve by iteration
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 27/37
Probabilistic ModelAn initial set of documents is retrieved somehow
User inspects these docs looking for the relevant ones(in truth, only top 10-20 need to be inspected)
IR system uses this information to refine description ofideal answer set
By repeating this process, it is expected that thedescription of the ideal answer set will improve
Have always in mind the need to guess at the verybeginning the description of the ideal answer set
Description of ideal answer set is modeled inprobabilistic terms
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 28/37
Probabilistic Ranking PrincipleGiven a user query q and a document dj, theprobabilistic model tries to estimate the probability thatthe user will find the document dj interesting (i.e.,relevant). The model assumes that this probability ofrelevance depends on the query and the documentrepresentations only. Ideal answer set is referred to asR and should maximize the probability of relevance.Documents in the set R are predicted to be relevant.
But,how to compute probabilities?what is the sample space?
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 29/37
The RankingSimilarity
sim(dj , q) =P (R|~dj)
P (R|~dj)=
P (~dj |R) × P (R)
P (~dj |R) × P (R)∼
P (~dj |R)
P (~dj |R)
P (~dj |R) : probability of randomly selecting thedocument dj from the set R of relevant documents
P (R) : probability that a document randomly selectedfrom the entire collection is relevant
P (~dj |R) and P (R) : analogous and complementary
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 30/37
The RankingAssuming independence of index terms
sim(dj , q) ∼(∏
gi(~dj)=1P (ki|R)) × (
∏
gi(~dj)=0P (ki|R))
(∏
gi(~dj)=1P (ki|R)) × (
∏
gi(~dj)=0P (ki|R))
P (ki|R) : probability that the index term ki is present ina document randomly selected from the set R ofrelevant documents
P (ki|R) : probability that the index term ki is not presentin a document randomly selected from the set R
The probabilities associated with the R have meaningswhich are analogous to the ones just described
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 31/37
The RankingFinally
sim(dj , q) ∼
∼t
∑
i=1
wi,q × wi,j ×
(
logP (ki|R)
1 − P (ki|R)+ log
1 − P (ki|R)
P (ki|R)
)
WhereP (ki|R) = 1 − P (ki|R)
P (ki|R) = 1 − P (ki|R)
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 32/37
The Initial RankingSimilarity sim(dj , q)
t∑
i=1
wi,q × wi,j ×
(
logP (ki|R)
1 − P (ki|R)+ log
1 − P (ki|R)
P (ki|R)
)
Probabilities P (ki|R) and P (ki|R) ?
Estimates based on assumptions:P (ki|R) = 0.5
P (ki|R) = ni
N where ni is the number of docs thatcontain ki
Use this initial guess to retrieve an initial rankingImprove upon this initial ranking
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 33/37
Improving the Initial Ranking
t∑
i=1
wi,q × wi,j ×
(
logP (ki|R)
1 − P (ki|R)+ log
1 − P (ki|R)
P (ki|R)
)
LetV : set of docs initially retrievedVi : subset of docs retrieved that contain ki
Reevaluate estimates:P (ki|R) = Vi
V
P (ki|R) = ni−Vi
N−V
Repeat recursively
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 34/37
Improving the Initial Ranking
t∑
i=1
wi,q × wi,j ×
(
logP (ki|R)
1 − P (ki|R)+ log
1 − P (ki|R)
P (ki|R)
)
To avoid problems with V = 1 and Vi = 0:
P (ki|R) = Vi+0.5V +1
P (ki|R) = ni−Vi+0.5N−V +1
Also,
P (ki|R) =Vi+
niN
V +1
P (ki|R) =ni−Vi+
niN
N−V +1
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 35/37
Pluses and MinusesAdvantages:
Docs ranked in decreasing order of probability ofrelevance
Disadvantages:need to guess initial estimates for P (ki|R)
method does not take into account tf and idf factors
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 36/37
Brief Comparison of Classic ModelsBoolean model does not provide for partial matchesand is considered to be the weakest classic model
Salton and Buckley did a series of experiments thatindicate that, in general, the vector model outperformsthe probabilistic model with general collections
This seems also to be the view of the researchcommunity
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 37/37