+ All Categories
Home > Documents > Retrieval models { week 13}

Retrieval models { week 13}

Date post: 23-Feb-2016
Category:
Upload: muriel
View: 45 times
Download: 0 times
Share this document with a friend
Description:
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Retrieval models { week 13}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0. - PowerPoint PPT Presentation
Popular Tags:
21
Retrieval models {week 13} The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. om Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
Transcript
Page 1: Retrieval models { week  13}

Retrieval models{week 13}

The College of Saint RoseCSC 460 / CIS 560 – Search and Information RetrievalDavid Goldschmidt, Ph.D.

from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

Page 2: Retrieval models { week  13}

Retrieval models (i) A retrieval model is a formal

(mathematical) representation of the process of matching a query and a document

Forms the basis of ranking results

doc 234

doc 345

doc 455

doc 567

doc 678

doc 789

doc 881

doc 972

doc 123

doc 257

user query terms ?

doc 913

Page 3: Retrieval models { week  13}

Retrieval models (ii)

Goal: Retrieve exactly the documents that users want (whether they know it or not!) A good retrieval model finds documents

that are likely to be consideredrelevant by the user submittingthe query (i.e. user relevance)

A good retrieval model alsooften considers topical relevance

Page 4: Retrieval models { week  13}

Topical relevance

Given a query, topical relevance identifies documents judged to be on the same topic Even though keyword-based document

scores might show a lack of relevance!Abraham Lincoln

query: Abraham Lincoln

Civil War

Tall Guys with

BeardsStovepipe

Hats

U.S. President

s

Page 5: Retrieval models { week  13}

User relevance

User relevance is difficult to quantify because of each user’s subjectivity Humans often have difficulty

explaining why one documentis more relevant than another

Humans may disagree abouta given document’s relevancein relation to the same query

R

R

Page 6: Retrieval models { week  13}

Boolean retrieval model (i) In the Boolean retrieval model, there

are exactly two possible outcomes for query processing: TRUE (an exact match of query

specification) FALSE (otherwise)

Ranking is nonexistent Each matching document has a score of

1

Page 7: Retrieval models { week  13}

Boolean retrieval model (ii) Often the goal is to reduce the

number of search results down to a manageable size Typically called searching by numbers

Given a small enough set of results, human users can continue their search manually Still a useful strategy, but the “best”

resultsmay be omitted

Page 8: Retrieval models { week  13}

Boolean retrieval model (iii) Advantages:

Results are predictable and explainable Efficient and easy implementation

Disadvantages: Query results essentially unranked

(instead ordered by date or title) Effectiveness of query results depends

entirely on the user’s ability to formulate query

Page 9: Retrieval models { week  13}

Vector space model (i)

The vector space model is a decades-old IR approach for implementing term weighting and document ranking Documents are represented as vector Di

ina t-dimensional vector space

Each element dij represents the weight ofterm j in document i

t is the numberof index terms

Page 10: Retrieval models { week  13}

Vector space model (ii)

Given n documents, we can use a matrix to represent all term weights:

Page 11: Retrieval models { week  13}

Vector space model (iii)

term weights arethe term counts

in each document

Page 12: Retrieval models { week  13}

Vector space model (iv)

Query Q is represented by a t-dimensional vector of weights

Each qj is the weight of term j in the query

Page 13: Retrieval models { week  13}

Vector space model (v)

Given the query “tropical fish,”query vector Qa is below:

Qa00010000001

Qb10100000100

Qc00000100010

what do query vectorsQb and Qc represent?

Page 14: Retrieval models { week  13}

Vector space model (vi)

Conceptually, the document vector closest to the query vector is the most relevant In reality, the distance

function is not a goodmeasure of relevance

Use a similarity measure instead (and maximize)

First, think normalization

Page 15: Retrieval models { week  13}

Cosine correlation (i)

The cosine correlation measures thecosine of the angle betweenquery and document vectors Normalize vectors such that

all documents and queriesare of equal length

Page 16: Retrieval models { week  13}

Cosine correlation (ii)

The cosine function is shown in blue below:

http://en.wikipedia.org/wiki/File:Sine_cosine_one_period.svg

Page 17: Retrieval models { week  13}

Cosine correlation (iii)

Given document Di and query Q, the cosine measure is given by:

normalization occursin the denominator

Page 18: Retrieval models { week  13}

Term weighting (i)

Term weighting is often based on tf.idf: The term frequency (tf) quantifies the

importance of a term in a document

▪ tfik is term frequency weight of term k in document Di ▪ fik is the number of occurrences of term k in Di

word count(of words considered)

in document Di

Page 19: Retrieval models { week  13}

Term weighting (ii)

Term weighting is often based on tf.idf: The inverse document frequency (idf)

quantifies the importance of a termwithin the entire collection of documents

▪ idfk is inverse document frequency weight for term k ▪ N is the number of documents in the collection▪ nk is the number of documents in which term k

occurs

Page 20: Retrieval models { week  13}

Term weighting (iii)

Obtain term weights by multiplying term frequency and inverse document frequency values together Perform this calculation for

each term As new/updated documents

are processed, algorithmmust recalculate idf

Page 21: Retrieval models { week  13}

What next?

Read and study Chapter 7

Do Exercises 7.1, 7.2, 7.3, and 7.4


Recommended