Retrieval models{week 13}
The College of Saint RoseCSC 460 / CIS 560 – Search and Information RetrievalDavid Goldschmidt, Ph.D.
from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
Retrieval models (i) A retrieval model is a formal
(mathematical) representation of the process of matching a query and a document
Forms the basis of ranking results
doc 234
doc 345
doc 455
doc 567
doc 678
doc 789
doc 881
doc 972
doc 123
doc 257
user query terms ?
doc 913
Retrieval models (ii)
Goal: Retrieve exactly the documents that users want (whether they know it or not!) A good retrieval model finds documents
that are likely to be consideredrelevant by the user submittingthe query (i.e. user relevance)
A good retrieval model alsooften considers topical relevance
Topical relevance
Given a query, topical relevance identifies documents judged to be on the same topic Even though keyword-based document
scores might show a lack of relevance!Abraham Lincoln
query: Abraham Lincoln
Civil War
Tall Guys with
BeardsStovepipe
Hats
U.S. President
s
User relevance
User relevance is difficult to quantify because of each user’s subjectivity Humans often have difficulty
explaining why one documentis more relevant than another
Humans may disagree abouta given document’s relevancein relation to the same query
R
R
Boolean retrieval model (i) In the Boolean retrieval model, there
are exactly two possible outcomes for query processing: TRUE (an exact match of query
specification) FALSE (otherwise)
Ranking is nonexistent Each matching document has a score of
1
Boolean retrieval model (ii) Often the goal is to reduce the
number of search results down to a manageable size Typically called searching by numbers
Given a small enough set of results, human users can continue their search manually Still a useful strategy, but the “best”
resultsmay be omitted
Boolean retrieval model (iii) Advantages:
Results are predictable and explainable Efficient and easy implementation
Disadvantages: Query results essentially unranked
(instead ordered by date or title) Effectiveness of query results depends
entirely on the user’s ability to formulate query
Vector space model (i)
The vector space model is a decades-old IR approach for implementing term weighting and document ranking Documents are represented as vector Di
ina t-dimensional vector space
Each element dij represents the weight ofterm j in document i
t is the numberof index terms
Vector space model (ii)
Given n documents, we can use a matrix to represent all term weights:
Vector space model (iii)
term weights arethe term counts
in each document
Vector space model (iv)
Query Q is represented by a t-dimensional vector of weights
Each qj is the weight of term j in the query
Vector space model (v)
Given the query “tropical fish,”query vector Qa is below:
Qa00010000001
Qb10100000100
Qc00000100010
what do query vectorsQb and Qc represent?
Vector space model (vi)
Conceptually, the document vector closest to the query vector is the most relevant In reality, the distance
function is not a goodmeasure of relevance
Use a similarity measure instead (and maximize)
First, think normalization
Cosine correlation (i)
The cosine correlation measures thecosine of the angle betweenquery and document vectors Normalize vectors such that
all documents and queriesare of equal length
Cosine correlation (ii)
The cosine function is shown in blue below:
http://en.wikipedia.org/wiki/File:Sine_cosine_one_period.svg
Cosine correlation (iii)
Given document Di and query Q, the cosine measure is given by:
normalization occursin the denominator
Term weighting (i)
Term weighting is often based on tf.idf: The term frequency (tf) quantifies the
importance of a term in a document
▪ tfik is term frequency weight of term k in document Di ▪ fik is the number of occurrences of term k in Di
word count(of words considered)
in document Di
Term weighting (ii)
Term weighting is often based on tf.idf: The inverse document frequency (idf)
quantifies the importance of a termwithin the entire collection of documents
▪ idfk is inverse document frequency weight for term k ▪ N is the number of documents in the collection▪ nk is the number of documents in which term k
occurs
Term weighting (iii)
Obtain term weights by multiplying term frequency and inverse document frequency values together Perform this calculation for
each term As new/updated documents
are processed, algorithmmust recalculate idf
What next?
Read and study Chapter 7
Do Exercises 7.1, 7.2, 7.3, and 7.4