+ All Categories
Home > Documents > Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is...

Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is...

Date post: 18-Jul-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
175
Recap Why rank? More on cosine The complete search system Implementation of ranking Introduction to Information Retrieval http://informationretrieval.org IIR 7: Scores in a Complete Search System Hinrich Sch¨ utze Center for Information and Language Processing, University of Munich 2014-05-07 Sch¨ utze: Scores in a complete search system 1 / 59
Transcript
Page 1: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Introduction to Information Retrievalhttp://informationretrieval.org

IIR 7: Scores in a Complete Search System

Hinrich Schutze

Center for Information and Language Processing, University of Munich

2014-05-07

Schutze: Scores in a complete search system 1 / 59

Page 2: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Overview

1 Recap

2 Why rank?

3 More on cosine

4 The complete search system

5 Implementation of ranking

Schutze: Scores in a complete search system 2 / 59

Page 3: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Outline

1 Recap

2 Why rank?

3 More on cosine

4 The complete search system

5 Implementation of ranking

Schutze: Scores in a complete search system 3 / 59

Page 4: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term frequency weight

The log frequency weight of term t in d is defined as follows

wt,d =

{

1 + log10 tft,d if tft,d > 00 otherwise

Schutze: Scores in a complete search system 4 / 59

Page 5: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

idf weight

The document frequency dft is defined as the number ofdocuments that t occurs in.

We define the idf weight of term t as follows:

idft = log10N

dft

idf is a measure of the informativeness of the term.

Schutze: Scores in a complete search system 5 / 59

Page 6: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

tf-idf weight

The tf-idf weight of a term is the product of its tf weight andits idf weight.

wt,d = (1 + log tft,d) · logN

dft

Schutze: Scores in a complete search system 6 / 59

Page 7: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Cosine similarity between query and document

cos(~q, ~d) = sim(~q, ~d) =~q

|~q| ·~d

|~d |=

|V |∑

i=1

qi√

∑|V |i=1 q

2i

· di√

∑|V |i=1 d

2i

qi is the tf-idf weight of term i in the query.

di is the tf-idf weight of term i in the document.

|~q| and |~d | are the lengths of ~q and ~d .

~q/|~q| and ~d/|~d | are length-1 vectors (= normalized).

Schutze: Scores in a complete search system 7 / 59

Page 8: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Cosine similarity illustrated

0 10

1

rich

poor

~v(q)

~v(d1)

~v(d2)

~v(d3)

θ

Schutze: Scores in a complete search system 8 / 59

Page 9: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”.

word query document producttf-idf

tf-raw tf-wght df idf weight tf-raw tf-wght tf-wght n’lized

auto 0 0 5000 2.3 0 1 1 1 0.52 0best 1 1 50000 1.3 1.3 0 0 0 0 0car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04

Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight√12 + 02 + 12 + 1.32 ≈ 1.92

1/1.92 ≈ 0.521.3/1.92 ≈ 0.68

Final similarity score between query and document:∑

i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08

Schutze: Scores in a complete search system 9 / 59

Page 10: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Take-away today

Schutze: Scores in a complete search system 10 / 59

Page 11: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Take-away today

The importance of ranking: User studies at Google

Schutze: Scores in a complete search system 10 / 59

Page 12: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Take-away today

The importance of ranking: User studies at Google

Length normalization: Pivot normalization

Schutze: Scores in a complete search system 10 / 59

Page 13: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Take-away today

The importance of ranking: User studies at Google

Length normalization: Pivot normalization

The complete search system

Schutze: Scores in a complete search system 10 / 59

Page 14: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Take-away today

The importance of ranking: User studies at Google

Length normalization: Pivot normalization

The complete search system

Implementation of ranking

Schutze: Scores in a complete search system 10 / 59

Page 15: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Outline

1 Recap

2 Why rank?

3 More on cosine

4 The complete search system

5 Implementation of ranking

Schutze: Scores in a complete search system 11 / 59

Page 16: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Why is ranking so important?

Schutze: Scores in a complete search system 12 / 59

Page 17: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Why is ranking so important?

Last lecture: Problems with unranked retrieval

Schutze: Scores in a complete search system 12 / 59

Page 18: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Why is ranking so important?

Last lecture: Problems with unranked retrieval

Users want to look at a few results – not thousands.

Schutze: Scores in a complete search system 12 / 59

Page 19: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Why is ranking so important?

Last lecture: Problems with unranked retrieval

Users want to look at a few results – not thousands.It’s very hard to write queries that produce a few results.

Schutze: Scores in a complete search system 12 / 59

Page 20: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Why is ranking so important?

Last lecture: Problems with unranked retrieval

Users want to look at a few results – not thousands.It’s very hard to write queries that produce a few results.Even for expert searchers

Schutze: Scores in a complete search system 12 / 59

Page 21: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Why is ranking so important?

Last lecture: Problems with unranked retrieval

Users want to look at a few results – not thousands.It’s very hard to write queries that produce a few results.Even for expert searchers→ Ranking is important because it effectively reduces a largeset of results to a very small one.

Schutze: Scores in a complete search system 12 / 59

Page 22: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Why is ranking so important?

Last lecture: Problems with unranked retrieval

Users want to look at a few results – not thousands.It’s very hard to write queries that produce a few results.Even for expert searchers→ Ranking is important because it effectively reduces a largeset of results to a very small one.

Next: More data on “users only look at a few results”

Schutze: Scores in a complete search system 12 / 59

Page 23: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

Schutze: Scores in a complete search system 13 / 59

Page 24: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Schutze: Scores in a complete search system 13 / 59

Page 25: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Dan Russell was the “Uber Tech Lead for Search Quality &User Happiness” at Google.

Schutze: Scores in a complete search system 13 / 59

Page 26: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Dan Russell was the “Uber Tech Lead for Search Quality &User Happiness” at Google.

How can we measure how important ranking is?

Schutze: Scores in a complete search system 13 / 59

Page 27: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Dan Russell was the “Uber Tech Lead for Search Quality &User Happiness” at Google.

How can we measure how important ranking is?

Observe what searchers do when they are searching in acontrolled setting

Schutze: Scores in a complete search system 13 / 59

Page 28: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Dan Russell was the “Uber Tech Lead for Search Quality &User Happiness” at Google.

How can we measure how important ranking is?

Observe what searchers do when they are searching in acontrolled setting

Videotape them

Schutze: Scores in a complete search system 13 / 59

Page 29: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Dan Russell was the “Uber Tech Lead for Search Quality &User Happiness” at Google.

How can we measure how important ranking is?

Observe what searchers do when they are searching in acontrolled setting

Videotape themAsk them to “think aloud”

Schutze: Scores in a complete search system 13 / 59

Page 30: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Dan Russell was the “Uber Tech Lead for Search Quality &User Happiness” at Google.

How can we measure how important ranking is?

Observe what searchers do when they are searching in acontrolled setting

Videotape themAsk them to “think aloud”Interview them

Schutze: Scores in a complete search system 13 / 59

Page 31: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Dan Russell was the “Uber Tech Lead for Search Quality &User Happiness” at Google.

How can we measure how important ranking is?

Observe what searchers do when they are searching in acontrolled setting

Videotape themAsk them to “think aloud”Interview themEye-track them

Schutze: Scores in a complete search system 13 / 59

Page 32: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Dan Russell was the “Uber Tech Lead for Search Quality &User Happiness” at Google.

How can we measure how important ranking is?

Observe what searchers do when they are searching in acontrolled setting

Videotape themAsk them to “think aloud”Interview themEye-track themTime them

Schutze: Scores in a complete search system 13 / 59

Page 33: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Empirical investigation of the effect of ranking

The following slides are from Dan Russell’s JCDL 2007 talk

Dan Russell was the “Uber Tech Lead for Search Quality &User Happiness” at Google.

How can we measure how important ranking is?

Observe what searchers do when they are searching in acontrolled setting

Videotape themAsk them to “think aloud”Interview themEye-track themTime themRecord and count their clicks

Schutze: Scores in a complete search system 13 / 59

Page 34: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to
Page 35: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to
Page 36: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to
Page 37: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to
Page 38: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to
Page 39: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to
Page 40: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Importance of ranking: Summary

Schutze: Scores in a complete search system 20 / 59

Page 41: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Importance of ranking: Summary

Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).

Schutze: Scores in a complete search system 20 / 59

Page 42: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Importance of ranking: Summary

Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).

Clicking: Distribution is even more skewed for clicking

Schutze: Scores in a complete search system 20 / 59

Page 43: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Importance of ranking: Summary

Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).

Clicking: Distribution is even more skewed for clicking

In 1 out of 2 cases, users click on the top-ranked page.

Schutze: Scores in a complete search system 20 / 59

Page 44: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Importance of ranking: Summary

Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).

Clicking: Distribution is even more skewed for clicking

In 1 out of 2 cases, users click on the top-ranked page.

Even if the top-ranked page is not relevant, 30% of users willclick on it.

Schutze: Scores in a complete search system 20 / 59

Page 45: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Importance of ranking: Summary

Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).

Clicking: Distribution is even more skewed for clicking

In 1 out of 2 cases, users click on the top-ranked page.

Even if the top-ranked page is not relevant, 30% of users willclick on it.

→ Getting the ranking right is very important.

Schutze: Scores in a complete search system 20 / 59

Page 46: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Importance of ranking: Summary

Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).

Clicking: Distribution is even more skewed for clicking

In 1 out of 2 cases, users click on the top-ranked page.

Even if the top-ranked page is not relevant, 30% of users willclick on it.

→ Getting the ranking right is very important.

→ Getting the top-ranked page right is most important.

Schutze: Scores in a complete search system 20 / 59

Page 47: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Importance of ranking: Summary

Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).

Clicking: Distribution is even more skewed for clicking

In 1 out of 2 cases, users click on the top-ranked page.

Even if the top-ranked page is not relevant, 30% of users willclick on it.

→ Getting the ranking right is very important.

→ Getting the top-ranked page right is most important.

Schutze: Scores in a complete search system 20 / 59

Page 48: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Exercise

Ranking is also one of the high barriers to entry forcompetitors to established players in the search engine market.

Why?

Schutze: Scores in a complete search system 21 / 59

Page 49: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Outline

1 Recap

2 Why rank?

3 More on cosine

4 The complete search system

5 Implementation of ranking

Schutze: Scores in a complete search system 22 / 59

Page 50: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Why distance is a bad idea

0 10

1

rich

poor

q: [rich poor]

d1:Ranks of starving poets swelld2:Rich poor gap grows

d3:Record baseball salaries in 2010

The Euclidean distance of ~q and ~d2 is large although thedistribution of terms in the query q and the distribution of terms inthe document d2 are very similar.

That’s why we do length normalization or, equivalently, use cosineto compute query-document matching scores.

Schutze: Scores in a complete search system 23 / 59

Page 51: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Exercise: A problem for cosine normalization

Query q: “anti-doping rules Beijing 2008 olympics”

Compare three documents

d1: a short document on anti-doping rules at 2008 Olympicsd2: a long document that consists of a copy of d1 and 5 othernews stories, all on topics different from Olympics/anti-dopingd3: a short document on anti-doping rules at the 2004 AthensOlympics

What ranking do we expect in the vector space model?

What can we do about this?

Schutze: Scores in a complete search system 24 / 59

Page 52: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Pivot normalization

Cosine normalization produces weights that are too large forshort documents and too small for long documents (onaverage).

Schutze: Scores in a complete search system 25 / 59

Page 53: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Pivot normalization

Cosine normalization produces weights that are too large forshort documents and too small for long documents (onaverage).

Adjust cosine normalization by linear adjustment: “turning”the average normalization on the pivot

Schutze: Scores in a complete search system 25 / 59

Page 54: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Pivot normalization

Cosine normalization produces weights that are too large forshort documents and too small for long documents (onaverage).

Adjust cosine normalization by linear adjustment: “turning”the average normalization on the pivot

Effect: Similarities of short documents with query decrease;similarities of long documents with query increase.

Schutze: Scores in a complete search system 25 / 59

Page 55: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Pivot normalization

Cosine normalization produces weights that are too large forshort documents and too small for long documents (onaverage).

Adjust cosine normalization by linear adjustment: “turning”the average normalization on the pivot

Effect: Similarities of short documents with query decrease;similarities of long documents with query increase.

This removes the unfair advantage that short documents have.

Schutze: Scores in a complete search system 25 / 59

Page 56: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Predicted and true probability of relevance

Schutze: Scores in a complete search system 26 / 59

Page 57: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Predicted and true probability of relevance

source:Lillian Lee

Schutze: Scores in a complete search system 26 / 59

Page 58: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Pivot normalization

Schutze: Scores in a complete search system 27 / 59

Page 59: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Pivot normalization

source:Lillian Lee

Schutze: Scores in a complete search system 27 / 59

Page 60: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Pivoted normalization: Amit Singhal’s experiments

Schutze: Scores in a complete search system 28 / 59

Page 61: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Pivoted normalization: Amit Singhal’s experiments

(relevant documents retrieved and (change in) average precision)

Schutze: Scores in a complete search system 28 / 59

Page 62: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Outline

1 Recap

2 Why rank?

3 More on cosine

4 The complete search system

5 Implementation of ranking

Schutze: Scores in a complete search system 29 / 59

Page 63: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Complete search system

Schutze: Scores in a complete search system 30 / 59

Page 64: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Schutze: Scores in a complete search system 31 / 59

Page 65: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Basic idea:

Schutze: Scores in a complete search system 31 / 59

Page 66: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Basic idea:

Create several tiers of indexes, corresponding to importance ofindexing terms

Schutze: Scores in a complete search system 31 / 59

Page 67: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Basic idea:

Create several tiers of indexes, corresponding to importance ofindexing termsDuring query processing, start with highest-tier index

Schutze: Scores in a complete search system 31 / 59

Page 68: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Basic idea:

Create several tiers of indexes, corresponding to importance ofindexing termsDuring query processing, start with highest-tier indexIf highest-tier index returns at least k (e.g., k = 100) results:stop and return results to user

Schutze: Scores in a complete search system 31 / 59

Page 69: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Basic idea:

Create several tiers of indexes, corresponding to importance ofindexing termsDuring query processing, start with highest-tier indexIf highest-tier index returns at least k (e.g., k = 100) results:stop and return results to userIf we’ve only found < k hits: repeat for next index in tiercascade

Schutze: Scores in a complete search system 31 / 59

Page 70: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Basic idea:

Create several tiers of indexes, corresponding to importance ofindexing termsDuring query processing, start with highest-tier indexIf highest-tier index returns at least k (e.g., k = 100) results:stop and return results to userIf we’ve only found < k hits: repeat for next index in tiercascade

Example: two-tier system

Schutze: Scores in a complete search system 31 / 59

Page 71: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Basic idea:

Create several tiers of indexes, corresponding to importance ofindexing termsDuring query processing, start with highest-tier indexIf highest-tier index returns at least k (e.g., k = 100) results:stop and return results to userIf we’ve only found < k hits: repeat for next index in tiercascade

Example: two-tier system

Tier 1: Index of all titles

Schutze: Scores in a complete search system 31 / 59

Page 72: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Basic idea:

Create several tiers of indexes, corresponding to importance ofindexing termsDuring query processing, start with highest-tier indexIf highest-tier index returns at least k (e.g., k = 100) results:stop and return results to userIf we’ve only found < k hits: repeat for next index in tiercascade

Example: two-tier system

Tier 1: Index of all titlesTier 2: Index of the rest of documents

Schutze: Scores in a complete search system 31 / 59

Page 73: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Basic idea:

Create several tiers of indexes, corresponding to importance ofindexing termsDuring query processing, start with highest-tier indexIf highest-tier index returns at least k (e.g., k = 100) results:stop and return results to userIf we’ve only found < k hits: repeat for next index in tiercascade

Example: two-tier system

Tier 1: Index of all titlesTier 2: Index of the rest of documentsPages containing the search words in the title are better hitsthan pages containing the search words in the body of the text.

Schutze: Scores in a complete search system 31 / 59

Page 74: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered index

Schutze: Scores in a complete search system 32 / 59

Page 75: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered index

Tier 1

Tier 2

Tier 3

auto

best

car

insurance

auto

auto

best

car

car

insurance

insurance

best

Doc2

Doc1

Doc2

Doc1

Doc3

Doc3

Doc3

Doc1

Doc2

Schutze: Scores in a complete search system 32 / 59

Page 76: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

Schutze: Scores in a complete search system 33 / 59

Page 77: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

The use of tiered indexes is believed to be one of the reasonsthat Google search quality was significantly higher initially(2000/01) than that of competitors.

Schutze: Scores in a complete search system 33 / 59

Page 78: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Tiered indexes

The use of tiered indexes is believed to be one of the reasonsthat Google search quality was significantly higher initially(2000/01) than that of competitors.

(along with PageRank, use of anchor text and proximityconstraints)

Schutze: Scores in a complete search system 33 / 59

Page 79: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Complete search system

Schutze: Scores in a complete search system 34 / 59

Page 80: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Components we have introduced thus far

Document preprocessing (linguistic and otherwise)

Positional indexes

Tiered indexes

Spelling correction

k-gram indexes for wildcard queries and spelling correction

Query processing

Document scoring

Schutze: Scores in a complete search system 35 / 59

Page 81: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Components we haven’t covered yet

Document cache: we need this for generating snippets (=dynamic summaries)

Zone indexes: They separate the indexes for different zones:the body of the document, all highlighted text in thedocument, anchor text, text in metadata fields etc

Machine-learned ranking functions

Proximity ranking (e.g., rank documents in which the queryterms occur in the same local window higher than documentsin which the query terms occur far from each other)

Query parser

Schutze: Scores in a complete search system 36 / 59

Page 82: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Vector space retrieval: Interactions

Schutze: Scores in a complete search system 37 / 59

Page 83: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Vector space retrieval: Interactions

How do we combine phrase retrieval with vector spaceretrieval?

Schutze: Scores in a complete search system 37 / 59

Page 84: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Vector space retrieval: Interactions

How do we combine phrase retrieval with vector spaceretrieval?

We do not want to compute document frequency / idf forevery possible phrase. Why?

Schutze: Scores in a complete search system 37 / 59

Page 85: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Vector space retrieval: Interactions

How do we combine phrase retrieval with vector spaceretrieval?

We do not want to compute document frequency / idf forevery possible phrase. Why?

How do we combine Boolean retrieval with vector spaceretrieval?

Schutze: Scores in a complete search system 37 / 59

Page 86: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Vector space retrieval: Interactions

How do we combine phrase retrieval with vector spaceretrieval?

We do not want to compute document frequency / idf forevery possible phrase. Why?

How do we combine Boolean retrieval with vector spaceretrieval?

For example: “+”-constraints and “-”-constraints

Schutze: Scores in a complete search system 37 / 59

Page 87: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Vector space retrieval: Interactions

How do we combine phrase retrieval with vector spaceretrieval?

We do not want to compute document frequency / idf forevery possible phrase. Why?

How do we combine Boolean retrieval with vector spaceretrieval?

For example: “+”-constraints and “-”-constraints

Postfiltering is simple, but can be very inefficient – no easyanswer.

Schutze: Scores in a complete search system 37 / 59

Page 88: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Vector space retrieval: Interactions

How do we combine phrase retrieval with vector spaceretrieval?

We do not want to compute document frequency / idf forevery possible phrase. Why?

How do we combine Boolean retrieval with vector spaceretrieval?

For example: “+”-constraints and “-”-constraints

Postfiltering is simple, but can be very inefficient – no easyanswer.

How do we combine wild cards with vector space retrieval?

Schutze: Scores in a complete search system 37 / 59

Page 89: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Vector space retrieval: Interactions

How do we combine phrase retrieval with vector spaceretrieval?

We do not want to compute document frequency / idf forevery possible phrase. Why?

How do we combine Boolean retrieval with vector spaceretrieval?

For example: “+”-constraints and “-”-constraints

Postfiltering is simple, but can be very inefficient – no easyanswer.

How do we combine wild cards with vector space retrieval?

Again, no easy answer

Schutze: Scores in a complete search system 37 / 59

Page 90: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Exercise

Design criteria for tiered system

Each tier should be an order of magnitude smaller than thenext tier.The top 100 hits for most queries should be in tier 1, the top100 hits for most of the remaining queries in tier 2 etc.We need a simple test for “can I stop at this tier or do I haveto go to the next one?”

There is no advantage to tiering if we have to hit most tiers

for most queries anyway.

Consider a two-tier system where the first tier indexes titlesand the second tier everything.

Question: Can you think of a better way of setting up amultitier system? Which “zones” of a document should beindexed in the different tiers (title, body of document,others?)? What criterion do you want to use for including adocument in tier 1?

Schutze: Scores in a complete search system 38 / 59

Page 91: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Outline

1 Recap

2 Why rank?

3 More on cosine

4 The complete search system

5 Implementation of ranking

Schutze: Scores in a complete search system 39 / 59

Page 92: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Now we also need term frequencies in the index

Schutze: Scores in a complete search system 40 / 59

Page 93: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Now we also need term frequencies in the index

Brutus −→ 1,2 7,3 83,1 87,2 . . .

Caesar −→ 1,1 5,1 13,1 17,1 . . .

Calpurnia −→ 7,1 8,2 40,1 97,3

Schutze: Scores in a complete search system 40 / 59

Page 94: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Now we also need term frequencies in the index

Brutus −→ 1,2 7,3 83,1 87,2 . . .

Caesar −→ 1,1 5,1 13,1 17,1 . . .

Calpurnia −→ 7,1 8,2 40,1 97,3

term frequencies

Schutze: Scores in a complete search system 40 / 59

Page 95: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Now we also need term frequencies in the index

Brutus −→ 1,2 7,3 83,1 87,2 . . .

Caesar −→ 1,1 5,1 13,1 17,1 . . .

Calpurnia −→ 7,1 8,2 40,1 97,3

term frequencies

We also need positions. Not shown here.

Schutze: Scores in a complete search system 40 / 59

Page 96: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term frequencies in the inverted index

Thus: In each posting, store tft,d in addition to docID d .

Schutze: Scores in a complete search system 41 / 59

Page 97: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term frequencies in the inverted index

Thus: In each posting, store tft,d in addition to docID d .

As an integer frequency, not as a (log-)weighted real number. . .

Schutze: Scores in a complete search system 41 / 59

Page 98: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term frequencies in the inverted index

Thus: In each posting, store tft,d in addition to docID d .

As an integer frequency, not as a (log-)weighted real number. . .

. . . because real numbers are difficult to compress.

Schutze: Scores in a complete search system 41 / 59

Page 99: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term frequencies in the inverted index

Thus: In each posting, store tft,d in addition to docID d .

As an integer frequency, not as a (log-)weighted real number. . .

. . . because real numbers are difficult to compress.

Overall, additional space requirements are small: a byte perposting or less

Schutze: Scores in a complete search system 41 / 59

Page 100: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

Schutze: Scores in a complete search system 42 / 59

Page 101: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

We usually don’t need a complete ranking.

Schutze: Scores in a complete search system 42 / 59

Page 102: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

We usually don’t need a complete ranking.

We just need the top k for a small k (e.g., k = 100).

Schutze: Scores in a complete search system 42 / 59

Page 103: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

We usually don’t need a complete ranking.

We just need the top k for a small k (e.g., k = 100).

If we don’t need a complete ranking, is there an efficient wayof computing just the top k?

Schutze: Scores in a complete search system 42 / 59

Page 104: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

We usually don’t need a complete ranking.

We just need the top k for a small k (e.g., k = 100).

If we don’t need a complete ranking, is there an efficient wayof computing just the top k?

Naive:

Schutze: Scores in a complete search system 42 / 59

Page 105: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

We usually don’t need a complete ranking.

We just need the top k for a small k (e.g., k = 100).

If we don’t need a complete ranking, is there an efficient wayof computing just the top k?

Naive:

Compute scores for all N documents

Schutze: Scores in a complete search system 42 / 59

Page 106: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

We usually don’t need a complete ranking.

We just need the top k for a small k (e.g., k = 100).

If we don’t need a complete ranking, is there an efficient wayof computing just the top k?

Naive:

Compute scores for all N documentsSort

Schutze: Scores in a complete search system 42 / 59

Page 107: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

We usually don’t need a complete ranking.

We just need the top k for a small k (e.g., k = 100).

If we don’t need a complete ranking, is there an efficient wayof computing just the top k?

Naive:

Compute scores for all N documentsSortReturn the top k

Schutze: Scores in a complete search system 42 / 59

Page 108: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

We usually don’t need a complete ranking.

We just need the top k for a small k (e.g., k = 100).

If we don’t need a complete ranking, is there an efficient wayof computing just the top k?

Naive:

Compute scores for all N documentsSortReturn the top k

Not very efficient

Schutze: Scores in a complete search system 42 / 59

Page 109: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

How do we compute the top k in ranking?

We usually don’t need a complete ranking.

We just need the top k for a small k (e.g., k = 100).

If we don’t need a complete ranking, is there an efficient wayof computing just the top k?

Naive:

Compute scores for all N documentsSortReturn the top k

Not very efficient

Alternative: min heap

Schutze: Scores in a complete search system 42 / 59

Page 110: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Use min heap for selecting top k ouf of N

Schutze: Scores in a complete search system 43 / 59

Page 111: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Use min heap for selecting top k ouf of N

A binary min heap is a binary tree in which each node’s valueis less than the values of its children.

Schutze: Scores in a complete search system 43 / 59

Page 112: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Use min heap for selecting top k ouf of N

A binary min heap is a binary tree in which each node’s valueis less than the values of its children.

Takes O(N log k) operations to construct (where N is thenumber of documents) . . .

Schutze: Scores in a complete search system 43 / 59

Page 113: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Use min heap for selecting top k ouf of N

A binary min heap is a binary tree in which each node’s valueis less than the values of its children.

Takes O(N log k) operations to construct (where N is thenumber of documents) . . .

. . . then read off k winners in O(k log k) steps

Schutze: Scores in a complete search system 43 / 59

Page 114: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Binary min heap

0.6

0.85 0.7

0.9 0.97 0.8 0.95

Schutze: Scores in a complete search system 44 / 59

Page 115: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Selecting top k scoring documents in O(N log k)

Goal: Keep the top k documents seen so far

Use a binary min heap

To process a new document d ′ with score s ′:

Get current minimum hm of heap (O(1))If s ′ ≤ hm skip to next documentIf s ′ > hm heap-delete-root (O(log k))Heap-add d ′/s ′ (O(log k))

Schutze: Scores in a complete search system 45 / 59

Page 116: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Even more efficient computation of top k?

Ranking has time complexity O(N) where N is the number ofdocuments.

Schutze: Scores in a complete search system 46 / 59

Page 117: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Even more efficient computation of top k?

Ranking has time complexity O(N) where N is the number ofdocuments.

Optimizations reduce the constant factor, but they are stillO(N), N > 1010

Schutze: Scores in a complete search system 46 / 59

Page 118: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Even more efficient computation of top k?

Ranking has time complexity O(N) where N is the number ofdocuments.

Optimizations reduce the constant factor, but they are stillO(N), N > 1010

Are there sublinear algorithms?

Schutze: Scores in a complete search system 46 / 59

Page 119: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Even more efficient computation of top k?

Ranking has time complexity O(N) where N is the number ofdocuments.

Optimizations reduce the constant factor, but they are stillO(N), N > 1010

Are there sublinear algorithms?

What we’re doing in effect: solving the k-nearest neighbor(kNN) problem for the query vector (= query point).

Schutze: Scores in a complete search system 46 / 59

Page 120: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Even more efficient computation of top k?

Ranking has time complexity O(N) where N is the number ofdocuments.

Optimizations reduce the constant factor, but they are stillO(N), N > 1010

Are there sublinear algorithms?

What we’re doing in effect: solving the k-nearest neighbor(kNN) problem for the query vector (= query point).

There are no general solutions to this problem that aresublinear.

Schutze: Scores in a complete search system 46 / 59

Page 121: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

More efficient computation of top k : Heuristics

Schutze: Scores in a complete search system 47 / 59

Page 122: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

More efficient computation of top k : Heuristics

Idea 1: Reorder postings lists

Schutze: Scores in a complete search system 47 / 59

Page 123: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

More efficient computation of top k : Heuristics

Idea 1: Reorder postings lists

Instead of ordering according to docID . . .

Schutze: Scores in a complete search system 47 / 59

Page 124: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

More efficient computation of top k : Heuristics

Idea 1: Reorder postings lists

Instead of ordering according to docID . . .. . . order according to some measure of “expected relevance”.

Schutze: Scores in a complete search system 47 / 59

Page 125: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

More efficient computation of top k : Heuristics

Idea 1: Reorder postings lists

Instead of ordering according to docID . . .. . . order according to some measure of “expected relevance”.

Idea 2: Heuristics to prune the search space

Schutze: Scores in a complete search system 47 / 59

Page 126: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

More efficient computation of top k : Heuristics

Idea 1: Reorder postings lists

Instead of ordering according to docID . . .. . . order according to some measure of “expected relevance”.

Idea 2: Heuristics to prune the search space

Not guaranteed to be correct . . .

Schutze: Scores in a complete search system 47 / 59

Page 127: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

More efficient computation of top k : Heuristics

Idea 1: Reorder postings lists

Instead of ordering according to docID . . .. . . order according to some measure of “expected relevance”.

Idea 2: Heuristics to prune the search space

Not guaranteed to be correct . . .. . . but fails rarely.

Schutze: Scores in a complete search system 47 / 59

Page 128: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

More efficient computation of top k : Heuristics

Idea 1: Reorder postings lists

Instead of ordering according to docID . . .. . . order according to some measure of “expected relevance”.

Idea 2: Heuristics to prune the search space

Not guaranteed to be correct . . .. . . but fails rarely.In practice, close to constant time.

Schutze: Scores in a complete search system 47 / 59

Page 129: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

More efficient computation of top k : Heuristics

Idea 1: Reorder postings lists

Instead of ordering according to docID . . .. . . order according to some measure of “expected relevance”.

Idea 2: Heuristics to prune the search space

Not guaranteed to be correct . . .. . . but fails rarely.In practice, close to constant time.For this, we’ll need the concepts of document-at-a-timeprocessing and term-at-a-time processing.

Schutze: Scores in a complete search system 47 / 59

Page 130: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists

So far: postings lists have been ordered according to docID.

Schutze: Scores in a complete search system 48 / 59

Page 131: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists

So far: postings lists have been ordered according to docID.

Alternative: a query-independent measure of “goodness” of apage

Schutze: Scores in a complete search system 48 / 59

Page 132: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists

So far: postings lists have been ordered according to docID.

Alternative: a query-independent measure of “goodness” of apage

Example: PageRank g(d) of page d , a measure of how many“good” pages hyperlink to d (chapter 21)

Schutze: Scores in a complete search system 48 / 59

Page 133: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists

So far: postings lists have been ordered according to docID.

Alternative: a query-independent measure of “goodness” of apage

Example: PageRank g(d) of page d , a measure of how many“good” pages hyperlink to d (chapter 21)

Order documents in postings lists according to PageRank:g(d1) > g(d2) > g(d3) > . . .

Schutze: Scores in a complete search system 48 / 59

Page 134: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists

So far: postings lists have been ordered according to docID.

Alternative: a query-independent measure of “goodness” of apage

Example: PageRank g(d) of page d , a measure of how many“good” pages hyperlink to d (chapter 21)

Order documents in postings lists according to PageRank:g(d1) > g(d2) > g(d3) > . . .

Define composite score of a document:

net-score(q, d) = g(d) + cos(q, d)

Schutze: Scores in a complete search system 48 / 59

Page 135: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists

So far: postings lists have been ordered according to docID.

Alternative: a query-independent measure of “goodness” of apage

Example: PageRank g(d) of page d , a measure of how many“good” pages hyperlink to d (chapter 21)

Order documents in postings lists according to PageRank:g(d1) > g(d2) > g(d3) > . . .

Define composite score of a document:

net-score(q, d) = g(d) + cos(q, d)

This scheme supports early termination: We do not have toprocess postings lists in their entirety to find top k .

Schutze: Scores in a complete search system 48 / 59

Page 136: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists (2)

Order documents in postings lists according to PageRank:g(d1) > g(d2) > g(d3) > . . .

Define composite score of a document:

net-score(q, d) = g(d) + cos(q, d)

Schutze: Scores in a complete search system 49 / 59

Page 137: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists (2)

Order documents in postings lists according to PageRank:g(d1) > g(d2) > g(d3) > . . .

Define composite score of a document:

net-score(q, d) = g(d) + cos(q, d)

Suppose: (i) g → [0, 1]; (ii) g(d) < 0.1 for the document dwe’re currently processing; (iii) smallest top k score we’vefound so far is 1.2

Schutze: Scores in a complete search system 49 / 59

Page 138: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists (2)

Order documents in postings lists according to PageRank:g(d1) > g(d2) > g(d3) > . . .

Define composite score of a document:

net-score(q, d) = g(d) + cos(q, d)

Suppose: (i) g → [0, 1]; (ii) g(d) < 0.1 for the document dwe’re currently processing; (iii) smallest top k score we’vefound so far is 1.2

Then all subsequent scores will be < 1.1.

Schutze: Scores in a complete search system 49 / 59

Page 139: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists (2)

Order documents in postings lists according to PageRank:g(d1) > g(d2) > g(d3) > . . .

Define composite score of a document:

net-score(q, d) = g(d) + cos(q, d)

Suppose: (i) g → [0, 1]; (ii) g(d) < 0.1 for the document dwe’re currently processing; (iii) smallest top k score we’vefound so far is 1.2

Then all subsequent scores will be < 1.1.

So we’ve already found the top k and can stop processing theremainder of postings lists.

Schutze: Scores in a complete search system 49 / 59

Page 140: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Non-docID ordering of postings lists (2)

Order documents in postings lists according to PageRank:g(d1) > g(d2) > g(d3) > . . .

Define composite score of a document:

net-score(q, d) = g(d) + cos(q, d)

Suppose: (i) g → [0, 1]; (ii) g(d) < 0.1 for the document dwe’re currently processing; (iii) smallest top k score we’vefound so far is 1.2

Then all subsequent scores will be < 1.1.

So we’ve already found the top k and can stop processing theremainder of postings lists.

Questions?

Schutze: Scores in a complete search system 49 / 59

Page 141: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Document-at-a-time processing

Schutze: Scores in a complete search system 50 / 59

Page 142: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Document-at-a-time processing

Both docID-ordering and PageRank-ordering impose aconsistent ordering on documents in postings lists.

Schutze: Scores in a complete search system 50 / 59

Page 143: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Document-at-a-time processing

Both docID-ordering and PageRank-ordering impose aconsistent ordering on documents in postings lists.

Computing cosines in this scheme is document-at-a-time.

Schutze: Scores in a complete search system 50 / 59

Page 144: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Document-at-a-time processing

Both docID-ordering and PageRank-ordering impose aconsistent ordering on documents in postings lists.

Computing cosines in this scheme is document-at-a-time.

We complete computation of the query-document similarityscore of document di before starting to compute thequery-document similarity score of di+1.

Schutze: Scores in a complete search system 50 / 59

Page 145: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Document-at-a-time processing

Both docID-ordering and PageRank-ordering impose aconsistent ordering on documents in postings lists.

Computing cosines in this scheme is document-at-a-time.

We complete computation of the query-document similarityscore of document di before starting to compute thequery-document similarity score of di+1.

Alternative: term-at-a-time processing

Schutze: Scores in a complete search system 50 / 59

Page 146: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Weight-sorted postings lists

Schutze: Scores in a complete search system 51 / 59

Page 147: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Weight-sorted postings lists

Idea: don’t process postings that contribute little to final score

Schutze: Scores in a complete search system 51 / 59

Page 148: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Weight-sorted postings lists

Idea: don’t process postings that contribute little to final score

Order documents in postings list according to weight

Schutze: Scores in a complete search system 51 / 59

Page 149: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Weight-sorted postings lists

Idea: don’t process postings that contribute little to final score

Order documents in postings list according to weight

Simplest case: normalized tf-idf weight (rarely done: hard tocompress)

Schutze: Scores in a complete search system 51 / 59

Page 150: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Weight-sorted postings lists

Idea: don’t process postings that contribute little to final score

Order documents in postings list according to weight

Simplest case: normalized tf-idf weight (rarely done: hard tocompress)

Documents in the top k are likely to occur early in theseordered lists.

Schutze: Scores in a complete search system 51 / 59

Page 151: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Weight-sorted postings lists

Idea: don’t process postings that contribute little to final score

Order documents in postings list according to weight

Simplest case: normalized tf-idf weight (rarely done: hard tocompress)

Documents in the top k are likely to occur early in theseordered lists.

→ Early termination while processing postings lists is unlikelyto change the top k .

Schutze: Scores in a complete search system 51 / 59

Page 152: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Weight-sorted postings lists

Idea: don’t process postings that contribute little to final score

Order documents in postings list according to weight

Simplest case: normalized tf-idf weight (rarely done: hard tocompress)

Documents in the top k are likely to occur early in theseordered lists.

→ Early termination while processing postings lists is unlikelyto change the top k .

But:

Schutze: Scores in a complete search system 51 / 59

Page 153: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Weight-sorted postings lists

Idea: don’t process postings that contribute little to final score

Order documents in postings list according to weight

Simplest case: normalized tf-idf weight (rarely done: hard tocompress)

Documents in the top k are likely to occur early in theseordered lists.

→ Early termination while processing postings lists is unlikelyto change the top k .

But:

We no longer have a consistent ordering of documents inpostings lists.

Schutze: Scores in a complete search system 51 / 59

Page 154: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Weight-sorted postings lists

Idea: don’t process postings that contribute little to final score

Order documents in postings list according to weight

Simplest case: normalized tf-idf weight (rarely done: hard tocompress)

Documents in the top k are likely to occur early in theseordered lists.

→ Early termination while processing postings lists is unlikelyto change the top k .

But:

We no longer have a consistent ordering of documents inpostings lists.We no longer can employ document-at-a-time processing.

Schutze: Scores in a complete search system 51 / 59

Page 155: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term-at-a-time processing

Schutze: Scores in a complete search system 52 / 59

Page 156: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term-at-a-time processing

Simplest case: completely process the postings list of the firstquery term

Schutze: Scores in a complete search system 52 / 59

Page 157: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term-at-a-time processing

Simplest case: completely process the postings list of the firstquery term

Create an accumulator for each docID you encounter

Schutze: Scores in a complete search system 52 / 59

Page 158: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term-at-a-time processing

Simplest case: completely process the postings list of the firstquery term

Create an accumulator for each docID you encounter

Then completely process the postings list of the second queryterm

Schutze: Scores in a complete search system 52 / 59

Page 159: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term-at-a-time processing

Simplest case: completely process the postings list of the firstquery term

Create an accumulator for each docID you encounter

Then completely process the postings list of the second queryterm

. . . and so forth

Schutze: Scores in a complete search system 52 / 59

Page 160: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term-at-a-time processing

Schutze: Scores in a complete search system 53 / 59

Page 161: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Term-at-a-time processing

CosineScore(q)1 float Scores[N] = 02 float Length[N]3 for each query term t4 do calculate wt,q and fetch postings list for t5 for each pair(d , tft,d) in postings list6 do Scores[d ]+ = wt,d × wt,q

7 Read the array Length8 for each d9 do Scores[d ] = Scores[d ]/Length[d ]10 return Top k components of Scores[]

The elements of the array “Scores” are called accumulators.

Schutze: Scores in a complete search system 53 / 59

Page 162: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Accumulators

For the web (20 billion documents), an array of accumulatorsA in memory is infeasible.

Thus: Only create accumulators for docs occurring in postingslists

This is equivalent to: Do not create accumulators for docswith zero scores (i.e., docs that do not contain any of thequery terms)

Schutze: Scores in a complete search system 54 / 59

Page 163: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Accumulators: Example

Brutus −→ 1,2 7,3 83,1 87,2 . . .

Caesar −→ 1,1 5,1 13,1 17,1 . . .

Calpurnia −→ 7,1 8,2 40,1 97,3

For query: [Brutus Caesar]:

Only need accumulators for 1, 5, 7, 13, 17, 83, 87

Don’t need accumulators for 3, 8 etc.

Schutze: Scores in a complete search system 55 / 59

Page 164: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Enforcing conjunctive search

Schutze: Scores in a complete search system 56 / 59

Page 165: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Enforcing conjunctive search

We can enforce conjunctive search (a la Google): onlyconsider documents (and create accumulators) if all termsoccur.

Schutze: Scores in a complete search system 56 / 59

Page 166: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Enforcing conjunctive search

We can enforce conjunctive search (a la Google): onlyconsider documents (and create accumulators) if all termsoccur.

Example: just one accumulator for [Brutus Caesar] in theexample above . . .

Schutze: Scores in a complete search system 56 / 59

Page 167: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Enforcing conjunctive search

We can enforce conjunctive search (a la Google): onlyconsider documents (and create accumulators) if all termsoccur.

Example: just one accumulator for [Brutus Caesar] in theexample above . . .

. . . because only d1 contains both words.

Schutze: Scores in a complete search system 56 / 59

Page 168: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Implementation of ranking: Summary

Schutze: Scores in a complete search system 57 / 59

Page 169: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Implementation of ranking: Summary

Schutze: Scores in a complete search system 57 / 59

Page 170: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Implementation of ranking: Summary

Ranking is very expensive in applications where we have tocompute similarity scores for all documents in the collection.

Schutze: Scores in a complete search system 57 / 59

Page 171: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Implementation of ranking: Summary

Ranking is very expensive in applications where we have tocompute similarity scores for all documents in the collection.

In most applications, the vast majority of documents havesimilarity score 0 for a given query → lots of potential forspeeding things up.

Schutze: Scores in a complete search system 57 / 59

Page 172: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Implementation of ranking: Summary

Ranking is very expensive in applications where we have tocompute similarity scores for all documents in the collection.

In most applications, the vast majority of documents havesimilarity score 0 for a given query → lots of potential forspeeding things up.

However, there is no fast nearest neighbor algorithm that isguaranteed to be correct even in this scenario.

Schutze: Scores in a complete search system 57 / 59

Page 173: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Implementation of ranking: Summary

Ranking is very expensive in applications where we have tocompute similarity scores for all documents in the collection.

In most applications, the vast majority of documents havesimilarity score 0 for a given query → lots of potential forspeeding things up.

However, there is no fast nearest neighbor algorithm that isguaranteed to be correct even in this scenario.

In practice: use heuristics to prune search space – usuallyworks very well.

Schutze: Scores in a complete search system 57 / 59

Page 174: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Take-away today

The importance of ranking: User studies at Google

Length normalization: Pivot normalization

The complete search system

Implementation of ranking

Schutze: Scores in a complete search system 58 / 59

Page 175: Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/07system.pdf · Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to

Recap Why rank? More on cosine The complete search system Implementation of ranking

Resources

Chapters 6 and 7 of IIR

Resources at http://cislmu.org

How Google tweaks its ranking functionInterview with Google search guru Udi ManberAmit Singhal on Google rankingSEO perspective: ranking factorsYahoo Search BOSS: Opens up the search engine todevelopers. For example, you can rerank search results.Compare Google and Yahoo ranking for a queryHow Google uses eye tracking for improving search

Schutze: Scores in a complete search system 59 / 59


Recommended