+ All Categories
Home > Documents > Introduction to Information Retrieval ` `%%%`# `...

Introduction to Information Retrieval ` `%%%`# `...

Date post: 06-Jan-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
76
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Sch¨ utze Institute for Natural Language Processing, Universit¨ at Stuttgart 2008.04.22 Sch¨ utze: Boolean Retrieval 1 / 55
Transcript
Page 1: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Introduction to Information Retrievalhttp://informationretrieval.org

IIR 1: Boolean Retrieval

Hinrich Schutze

Institute for Natural Language Processing, Universitat Stuttgart

2008.04.22

Schutze: Boolean Retrieval 1 / 55

Page 2: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).

Schutze: Boolean Retrieval 4 / 55

Page 3: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Boolean retrieval

Queries are Boolean expressions, e.g., Caesar and Brutus

The seach engine returns all documents that satisfy theBoolean expression.

Does Google use the Boolean model?

Schutze: Boolean Retrieval 7 / 55

Page 4: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Outline

1 Introduction

2 Inverted index

3 Processing Boolean queries

4 Course overview

Schutze: Boolean Retrieval 8 / 55

Page 5: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Unstructured data in 1650

Which plays of Shakespeare contain the words Brutus and

Caesar, but not Calpurnia?

One could grep all of Shakespeare’s plays for Brutus andCaesar, then strip out lines containing Calpurnia.

Why is grep not the solution?

Schutze: Boolean Retrieval 10 / 55

Page 6: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Unstructured data in 1650

Which plays of Shakespeare contain the words Brutus and

Caesar, but not Calpurnia?

One could grep all of Shakespeare’s plays for Brutus andCaesar, then strip out lines containing Calpurnia.

Why is grep not the solution?

Slow (for large collections)“not Calpurnia” is non-trivialOther operations (e.g., find the word Romans nearcountryman) not feasibleRanked retrieval (best documents to return) – focus of laterlectures, but not this one

Schutze: Boolean Retrieval 10 / 55

Page 7: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Term-document incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest

CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The

tempest.

Schutze: Boolean Retrieval 11 / 55

Page 8: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Incidence vectors

So we have a 0/1 vector for each term.

To answer the query Brutus and Caesar and not

Calpurnia:

Schutze: Boolean Retrieval 12 / 55

Page 9: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Incidence vectors

So we have a 0/1 vector for each term.

To answer the query Brutus and Caesar and not

Calpurnia:

Take the vectors for Brutus, Caesar, and Calpurnia

Complement the vector of Calpurnia

Do a (bitwise) and on the three vectors110100 and 110111 and 101111 = 100100

Schutze: Boolean Retrieval 12 / 55

Page 10: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

0/1 vector for Brutus

Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest

CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .

Schutze: Boolean Retrieval 13 / 55

Page 11: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Bigger collections

Consider N = 106 documents, each with about 1000 tokens

On average 6 bytes per token, including spaces andpunctuation ⇒ size of document collection is about 6 GB

Assume there are M = 500,000 distinct terms in the collection

(Notice that we are making a term/token distinction.)

Schutze: Boolean Retrieval 15 / 55

Page 12: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Can’t build the incidence matrix

M = 500,000× 106 = half a trillion 0s and 1s.

But the matrix has no more than one billion 1s.

Matrix is extremely sparse.

What is a better representations?

We only record the 1s.

Schutze: Boolean Retrieval 16 / 55

Page 13: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Inverted Index

For each term t, we store a list of all documents that contain t.

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 . . .

Calpurnia −→ 2 31 54 101

...

︸ ︷︷ ︸ ︸ ︷︷ ︸

dictionary postings

Schutze: Boolean Retrieval 17 / 55

Page 14: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Inverted index construction

1 Collect the documents to be indexed:

Friends, Romans, countrymen. So let it be with Caesar . . .

2 Tokenize the text, turning each document into a list of tokens:

Friends Romans countrymen So . . .

3 Do linguistic preprocessing, producing a list of normalized

tokens, which are the indexing terms: friend roman

countryman so . . .

4 Index the documents that each term occurs in by creating aninverted index, consisting of a dictionary and postings.

Schutze: Boolean Retrieval 18 / 55

Page 15: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Outline

1 Introduction

2 Inverted index

3 Processing Boolean queries

4 Course overview

Schutze: Boolean Retrieval 25 / 55

Page 16: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Simple conjunctive query (two terms)

Consider the query: Brutus AND Calpurnia

To find all matching documents using inverted index:1 Locate Brutus in the dictionary2 Retrieve its postings list from the postings file3 Locate Calpurnia in the dictionary4 Retrieve its postings list from the postings file5 Intersect the two postings lists6 Return intersection to user

Schutze: Boolean Retrieval 26 / 55

Page 17: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒

Schutze: Boolean Retrieval 27 / 55

Page 18: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒

Schutze: Boolean Retrieval 27 / 55

Page 19: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒

Schutze: Boolean Retrieval 27 / 55

Page 20: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2

Schutze: Boolean Retrieval 27 / 55

Page 21: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2

Schutze: Boolean Retrieval 27 / 55

Page 22: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2

Schutze: Boolean Retrieval 27 / 55

Page 23: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2

Schutze: Boolean Retrieval 27 / 55

Page 24: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31

Schutze: Boolean Retrieval 27 / 55

Page 25: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31

Schutze: Boolean Retrieval 27 / 55

Page 26: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31

Schutze: Boolean Retrieval 27 / 55

Page 27: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31

Schutze: Boolean Retrieval 27 / 55

Page 28: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Introduction Inverted index Processing Boolean queries Course overview

Intersecting two postings lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31

Schutze: Boolean Retrieval 27 / 55

Page 29: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap The term vocabulary Skip pointers Phrase queries

Recall basic intersection algorithm

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31

Can we do better?

Schutze: The term vocabulary and postings lists 40 / 60

Page 30: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap The term vocabulary Skip pointers Phrase queries

Skip lists

Schutze: The term vocabulary and postings lists 42 / 60

Page 31: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap The term vocabulary Skip pointers Phrase queries

Outline

1 Recap

2 The term vocabulary

3 Skip pointers

4 Phrase queries

Schutze: The term vocabulary and postings lists 8 / 60

Page 32: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap The term vocabulary Skip pointers Phrase queries

Definitions

Word – A delimited string of characters as it appears in thetext.

Term – A “normalized” word (case, morphology, spelling etc);an equivalence class of words.

Token – An instance of a word or term occurring in adocument.

Type – The same as a term in most cases: an equivalenceclass of tokens.

Schutze: The term vocabulary and postings lists 13 / 60

Page 33: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap The term vocabulary Skip pointers Phrase queries

Recall: Inverted index construction

Input:

Friends, Romans, countrymen. So let it be with Caesar . . .

Output:

friend roman countryman so . . .

Each token is a candidate for a postings entry.

What are valid tokens to emit?

Schutze: The term vocabulary and postings lists 15 / 60

Page 34: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap The term vocabulary Skip pointers Phrase queries

Stop words

stop words = extremely common words which would appearto be of little value in helping select documents matching auser need

Examples: a, an, and, are, as, at, be, by, for, from, has, he, in,

is, it, its, of, on, that, the, to, was, were, will, with

Stop word elimination used to be standard in older IR systems.

But you need stop words for phrase queries, e.g. “King ofDenmark”

Most web search engines index stop words.

Schutze: The term vocabulary and postings lists 29 / 60

Page 35: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap The term vocabulary Skip pointers Phrase queries

Lemmatization

Reduce inflectional/variant forms to base form

Example: am, are, is → be

Example: car, cars, car’s, cars’ → car

Example: the boy’s cars are different colors → the boy car be

different color

Lemmatization implies doing “proper” reduction to dictionaryheadword form (the lemma).

Inflectional morphology (cutting → cut) vs. derivationalmorphology (destruction → destroy)

Schutze: The term vocabulary and postings lists 32 / 60

Page 36: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap The term vocabulary Skip pointers Phrase queries

Stemming

Definition of stemming: Crude heuristic process that chops offthe ends of words in the hope of achieving what “principled”lemmatization attempts to do with a lot of linguisticknowledge.

Language dependent

Often inflectional and derivational

Example for derivational: automate, automatic, automation

all reduce to automat

Schutze: The term vocabulary and postings lists 33 / 60

Page 37: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap The term vocabulary Skip pointers Phrase queries

Porter stemmer: A few rules

Rule Example

SSES → SS caresses → caressIES → I ponies → poniSS → SS caress → caressS → cats → cat

Schutze: The term vocabulary and postings lists 35 / 60

Page 38: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Introduction to Information Retrievalhttp://informationretrieval.org

IIR 6: Scoring, Term Weighting, The Vector Space Model

Hinrich Schutze

Institute for Natural Language Processing, Universitat Stuttgart

2008.05.20

Schutze: Scoring, term weighting, the vector space model 1 / 53

Page 39: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Outline

1 Recap

2 Term frequency

3 tf-idf weighting

4 The vector space

Schutze: Scoring, term weighting, the vector space model 9 / 53

Page 40: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Ranked retrieval

Thus far, our queries have all been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of theirneeds and the collection.

Also good for applications: Applications can easily consume1000s of results.

Not good for the majority of users.

Most users are not capable of writing Boolean queries (or theyare, but they think it’s too much work).

Most users don’t want to wade through 1000s of results.

This is particularly true of web search.

Schutze: Scoring, term weighting, the vector space model 10 / 53

Page 41: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Problem with Boolean search: Feast or famine

Boolean queries often result in either too few (=0) or toomany (1000s) results.

Query 1: “standard user dlink 650” → 200,000 hits

Query 2: “standard user dlink 650 no card found”: 0 hits

It takes a lot of skill to come up with a query that produces amanageable number of hits.

With a ranked list of documents it does not matter how largethe retrieved set is.

Schutze: Scoring, term weighting, the vector space model 11 / 53

Page 42: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Scoring as the basis of ranked retrieval

We wish to return in order the documents most likely to beuseful to the searcher.

How can we rank-order the documents in the collection withrespect to a query?

Assign a score – say in [0, 1] – to each document

This score measures how well document and query “match”.

Schutze: Scoring, term weighting, the vector space model 12 / 53

Page 43: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Query-document matching scores

We need a way of assigning a score to a query/document pair.

Let’s start with a one-term query.

If the query term does not occur in the document: scoreshould be 0.

The more frequent the query term in the document, thehigher the score

Schutze: Scoring, term weighting, the vector space model 13 / 53

Page 44: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Recall: Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest

CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .

Each document is represented by a binary vector ∈ {0, 1}|V |.

Schutze: Scoring, term weighting, the vector space model 17 / 53

Page 45: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

From now on, we will use the frequencies of terms

Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest

CleopatraAnthony 157 73 0 0 0 1Brutus 4 157 0 2 0 0Caesar 232 227 0 2 1 0Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 8 5 8worser 2 0 1 1 1 5. . .

Each document is represented by a count vector ∈ N|V |.

Schutze: Scoring, term weighting, the vector space model 18 / 53

Page 46: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Bag of words model

We do not consider the order of words in a document.

John is quicker than Mary and Mary is quicker than John arerepresented the same way.

This is called a bag of words model.

Schutze: Scoring, term weighting, the vector space model 19 / 53

Page 47: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Term frequency tf

The term frequency tft,d of term t in document d is definedas the number of times that t occurs in d .

We want to use tf when computing query-document matchscores.

But how?

Schutze: Scoring, term weighting, the vector space model 20 / 53

Page 48: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Term frequency tf

The term frequency tft,d of term t in document d is definedas the number of times that t occurs in d .

We want to use tf when computing query-document matchscores.

But how?

Raw term frequency is not what we want.

A document with 10 occurrences of the term is more relevantthan a document with one occurrence of the term.

But not 10 times more relevant.

Relevance does not increase proportionally with termfrequency.

Schutze: Scoring, term weighting, the vector space model 20 / 53

Page 49: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Log frequency weighting

The log frequency weight of term t in d is defined as follows

wt,d =

{

1 + log10 tft,d if tft,d > 00 otherwise

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Schutze: Scoring, term weighting, the vector space model 21 / 53

Page 50: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Log frequency weighting

The log frequency weight of term t in d is defined as follows

wt,d =

{

1 + log10 tft,d if tft,d > 00 otherwise

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Score for a document-query pair: sum over terms t in both q

and d :matching-score =

t∈q∩d(1 + log tft,d)

Schutze: Scoring, term weighting, the vector space model 21 / 53

Page 51: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Log frequency weighting

The log frequency weight of term t in d is defined as follows

wt,d =

{

1 + log10 tft,d if tft,d > 00 otherwise

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Score for a document-query pair: sum over terms t in both q

and d :matching-score =

t∈q∩d(1 + log tft,d)

The score is 0 if none of the query terms is present in thedocument.

Schutze: Scoring, term weighting, the vector space model 21 / 53

Page 52: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Outline

1 Recap

2 Term frequency

3 tf-idf weighting

4 The vector space

Schutze: Scoring, term weighting, the vector space model 22 / 53

Page 53: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Document frequency

Rare terms are more informative than frequent terms.Consider a term in the query that is rare in the collection (e.g.,arachnocentric)

A document containing this term is very likely to be relevant.→ We want a high weight for rare terms like arachnocentric.

Schutze: Scoring, term weighting, the vector space model 23 / 53

Page 54: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Document frequency

Rare terms are more informative than frequent terms.Consider a term in the query that is rare in the collection (e.g.,arachnocentric)

A document containing this term is very likely to be relevant.→ We want a high weight for rare terms like arachnocentric.

Consider a term in the query that is frequent in the collection (e.g., high,increase, line)

A document containing this term is more likely to be relevant than adocument that doesn’t, but it’s not a sure indicator of relevance.→ For frequent terms, we want positive weights for words like high,increase, and line, but lower weights than for rare terms.

Schutze: Scoring, term weighting, the vector space model 23 / 53

Page 55: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Document frequency

Rare terms are more informative than frequent terms.Consider a term in the query that is rare in the collection (e.g.,arachnocentric)

A document containing this term is very likely to be relevant.→ We want a high weight for rare terms like arachnocentric.

Consider a term in the query that is frequent in the collection (e.g., high,increase, line)

A document containing this term is more likely to be relevant than adocument that doesn’t, but it’s not a sure indicator of relevance.→ For frequent terms, we want positive weights for words like high,increase, and line, but lower weights than for rare terms.

We will use document frequency to factor this into computing thematching score.The document frequency is the number of documents in the collectionthat the term occurs in.

Schutze: Scoring, term weighting, the vector space model 23 / 53

Page 56: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

idf weight

dft is the document frequency, the number of documents that t

occurs in.df is an inverse measure of the informativeness of the term.We define the idf weight of term t as follows:

idft = log10

N

dft

idf is a measure of the informativeness of the term.

Schutze: Scoring, term weighting, the vector space model 24 / 53

Page 57: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Examples for idf

Compute idft using the formula: idft = log101,000,000

dft

term dft idftcalpurnia 1 6animal 100 4sunday 1000 3fly 10,000 2under 100,000 1the 1,000,000 0

Schutze: Scoring, term weighting, the vector space model 25 / 53

Page 58: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight andits idf weight.

wt,d = (1 + log tft,d) · logN

dft

Best known weighting scheme in information retrieval

Note: the “-” in tf-idf is a hyphen, not a minus sign!

Schutze: Scoring, term weighting, the vector space model 28 / 53

Page 59: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Summary: tf-idf

Assign a tf-idf weight for each term t in each document d :wt,d = (1 + log tft,d) · log N

dftN: total number of documents

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection

Schutze: Scoring, term weighting, the vector space model 29 / 53

Page 60: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Outline

1 Recap

2 Term frequency

3 tf-idf weighting

4 The vector space

Schutze: Scoring, term weighting, the vector space model 31 / 53

Page 61: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Binary → count → weight matrix

Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest

CleopatraAnthony 5.25 3.18 0.0 0.0 0.0 0.35Brutus 1.21 6.10 0.0 1.0 0.0 0.0Caesar 8.59 2.54 0.0 1.51 0.25 0.0Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0mercy 1.51 0.0 1.90 0.12 5.25 0.88worser 1.37 0.0 0.11 4.15 0.25 1.95. . .

Each document is now represented by a real-valued vector of tf-idf weights∈ R

|V |.

Schutze: Scoring, term weighting, the vector space model 32 / 53

Page 62: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Documents as vectors

Each document is now represented by a real-valued vector oftf-idf weights ∈ R

|V |.

So we have a |V |-dimensional real-valued vector space.

Terms are axes of the space.

Documents are points or vectors in this space.

Very high-dimensional: tens of millions of dimensions whenyou apply this to a web search engine

This is a very sparse vector - most entries are zero.

Schutze: Scoring, term weighting, the vector space model 33 / 53

Page 63: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Queries as vectors

Key idea 1: do the same for queries: represent them asvectors in the space

Key idea 2: Rank documents according to their proximity tothe query

Schutze: Scoring, term weighting, the vector space model 34 / 53

Page 64: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

How do we formalize vector space similarity?

First cut: distance between two points

( = distance between the end points of the two vectors)

Euclidean distance?

Schutze: Scoring, term weighting, the vector space model 35 / 53

Page 65: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

How do we formalize vector space similarity?

First cut: distance between two points

( = distance between the end points of the two vectors)

Euclidean distance?

Euclidean distance is a bad idea . . .

. . . because Euclidean distance is large for vectors of differentlengths.

Schutze: Scoring, term weighting, the vector space model 35 / 53

Page 66: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Why distance is a bad idea

0 10

1

jealous

gossip

q

d1

d2

d3

The Euclidean distance of ~q

and ~d2 is large although thedistribution of terms in thequery q and the distribution ofterms in the document d2 arevery similar.

Schutze: Scoring, term weighting, the vector space model 36 / 53

Page 67: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Use angle instead of distance

Rank documents according to angle with query

Thought experiment: take a document d and append it toitself. Call this document d ′.

“Semantically” d and d ′ have the same content.

The angle between the two documents is 0, corresponding tomaximal similarity.

The Euclidean distance between the two documents can bequite large.

Schutze: Scoring, term weighting, the vector space model 37 / 53

Page 68: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

From angles to cosines

The following two notions are equivalent.

Rank documents according to the angle between query anddocument in decreasing orderRank documents according to cosine(query,document) inincreasing order

Cosine is a monotonically decreasing function of the angle forthe interval [0◦, 180◦]

Schutze: Scoring, term weighting, the vector space model 38 / 53

Page 69: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Length normalization

How do we compute the cosine?

A vector can be (length-) normalized by dividing each of itscomponents by its length – here we use the L2 norm:

||x ||2 =√

i x2i

This maps vectors onto the unit sphere . . .

. . . since after normalization: ||x ||2 =√

i x2i = 1.0

As a result, longer documents and shorter documents haveweights of the same order of magnitude.

Effect on the two documents d and d ′ (d appended to itself)from earlier slide: they have identical vectors afterlength-normalization.

Schutze: Scoring, term weighting, the vector space model 41 / 53

Page 70: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Cosine similarity between query and document

cos(~q, ~d) = sim(~q, ~d) =~q · ~d

|~q||~d |=

∑|V |i=1 qidi

√∑|V |

i=1 q2i

√∑|V |

i=1 d2i

qi is the tf-idf weight of term i in the query.

di is the tf-idf weight of term i in the document.

|~q| and |~d | are the lengths of ~q and ~d .

This is the cosine similarity of ~q and ~d . . . . . . or, equivalently,the cosine of the angle between ~q and ~d .

Schutze: Scoring, term weighting, the vector space model 42 / 53

Page 71: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Cosine similarity illustrated

0 10

1

jealous

gossip

~v(q)

~v(d1)

~v(d2)

~v(d3)

θ

Schutze: Scoring, term weighting, the vector space model 44 / 53

Page 72: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Cosine: Example

How similar arethe novels? SaS:Sense andSensibility, PaP:Pride andPrejudice, andWH: WutheringHeights?

term frequencies (counts)

term SaS PaP WH

affection 115 58 20jealous 10 7 11gossip 2 0 6wuthering 0 0 38

Schutze: Scoring, term weighting, the vector space model 45 / 53

Page 73: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Cosine: Example

term frequencies (counts)

term SaS PaP WH

affection 115 58 20jealous 10 7 11gossip 2 0 6wuthering 0 0 38

log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30jealous 2.0 1.85 2.04gossip 1.30 0 1.78wuthering 0 0 2.58

(To simplify this example, we don’t do idf weighting.)

Schutze: Scoring, term weighting, the vector space model 46 / 53

Page 74: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Cosine: Example

log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30jealous 2.0 1.85 2.04gossip 1.30 0 1.78wuthering 0 0 2.58

log frequency weighting& cosine normalization

term SaS PaP WH

affection 0.789 0.832 0.524jealous 0.515 0.555 0.465gossip 0.335 0.0 0.405wuthering 0.0 0.0 0.588

Schutze: Scoring, term weighting, the vector space model 47 / 53

Page 75: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Cosine: Example

log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30jealous 2.0 1.85 2.04gossip 1.30 0 1.78wuthering 0 0 2.58

log frequency weighting& cosine normalization

term SaS PaP WH

affection 0.789 0.832 0.524jealous 0.515 0.555 0.465gossip 0.335 0.0 0.405wuthering 0.0 0.0 0.588

cos(SaS,PaP) ≈0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈ 0.94.cos(SaS,WH) ≈ 0.79cos(PaP,WH) ≈ 0.69Why do we have cos(SaS,PaP) > cos(SAS,WH)?

Schutze: Scoring, term weighting, the vector space model 47 / 53

Page 76: Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

Recap Term frequency tf-idf weighting The vector space

Summary: Ranked retrieval in the vector space model

Represent the query as a weighted tf-idf vector

Represent each document as a weighted tf-idf vector

Compute the cosine similarity between the query vector andeach document vector

Rank documents with respect to the query

Return the top K (e.g., K = 10) to the user

Schutze: Scoring, term weighting, the vector space model 52 / 53


Recommended