+ All Categories
Home > Documents > Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note:...

Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note:...

Date post: 17-Jan-2018
Category:
Upload: laura-bruce
View: 217 times
Download: 0 times
Share this document with a friend
Description:
Problem with Boolean Search Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000 hits Query 2: “standard user dlink 650 no card found”: 0 hits It takes a lot of skill to come up with a query that produces a manageable number of hits. –AND gives too few; OR gives too many Ch. 6
41
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course taught by Ray Mooney at UT Austin (who in turn adapted them from Joydeep Ghosh), and from an IR course taught by Chris Manning at Stanford)
Transcript
Page 1: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Information Retrieval and Web SearchIR models: Vector Space Model

Instructor: Rada Mihalcea

[Note: Some slides in this set were adapted from an IR course taught by Ray Mooney at UT Austin (who in turn adapted them from Joydeep Ghosh), and from an IR course taught by Chris Manning at Stanford)

Page 2: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Ranked Retrieval• Thus far, our queries have all been Boolean

– Documents either match or don’t• Good for expert users with precise understanding

of their needs and the collection– Also good for applications: Applications can easily

consume 1000s of results• Not good for the majority of users

– Most users incapable of writing Boolean queries (or they are, but they think it’s too much work)

– Most users don’t want to wade through 1000s of results• This is particularly true of Web search

Ch. 6

Page 3: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Problem with Boolean Search• Boolean queries often result in either too few

(=0) or too many (1000s) results.• Query 1: “standard user dlink 650” → 200,000

hits• Query 2: “standard user dlink 650 no card

found”: 0 hits• It takes a lot of skill to come up with a query that

produces a manageable number of hits.– AND gives too few; OR gives too many

Ch. 6

Page 4: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Ranked Retrieval Models• Rather than a set of documents satisfying a query

expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query

• Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language

• In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa

4

Page 5: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Not a Problem in Ranked Retrieval• When a system produces a ranked result set,

large result sets are not an issue– Indeed, the size of the result set is not an issue– We just show the top k ( ≈ 10) results– We don’t overwhelm the user

– Premise: the ranking algorithm works

Ch. 6

Page 6: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Scoring as the Basis of Ranked Retrieval• We wish to return in order the documents most

likely to be useful to the searcher• How can we rank-order the documents in the

collection with respect to a query?• Assign a score – say in [0, 1] – to each document• This score measures how well document and

query “match”.

Ch. 6

Page 7: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Query-document Matching Scores• We need a way of assigning a score to a

query/document pair• Let’s start with a one-term query• If the query term does not occur in the document:

score should be 0• The more frequent the query term in the

document, the higher the score (should be)

Page 8: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Take 1: Jaccard coefficient• Jaccard: A commonly used measure of overlap of

two sets A and B• Jaccard(A,B) = |A ∩ B| / |A ∪ B|• Jaccard(A,A) = 1• Jaccard(A,B) = 0 if A ∩ B = 0• A and B don’t have to be the same size.• Always assigns a number between 0 and 1.

Ch. 6

Page 9: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Exercise: Jaccard Coefficient• What is the query-document match score that the

Jaccard coefficient computes for each of the two documents below?

• Query: march of dimes• Document 1: caesar died in march• Document 2: the long march

Ch. 6

Page 10: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Issues with Jaccard for Scoring• It does not consider term frequency (how many

times a term occurs in a document)• Rare terms in a collection are more informative

than frequent terms. Jaccard does not consider this information

• We need a more sophisticated way of normalizing for length

Ch. 6

Page 11: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Recall (from Boolean Retrieval): Binary Term-Document Incidence Matrix

Each document is represented by a binary vector ∈ {0,1}|V|

Sec. 6.2

Page 12: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Term-document Count Matrices• Consider the number of occurrences of a term in

a document: – Each document is a count vector: a column below

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0Brutus 4 157 0 1 0 0Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Sec. 6.2

Page 13: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Vector-Space Model• t distinct terms remain after preprocessing

– Unique terms that form the VOCABULARY• These “orthogonal” terms form a vector space.

Dimension = t = |vocabulary| – 2 terms bi-dimensional; …; t-terms t-dimensional

• Each term, i, in a document or query j, is given a real-valued weight, wij.

• Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)

Page 14: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Vector-Space Model

Query as vector:• We regard query as short document• We return the documents ranked by the

closeness of their vectors to the query, also represented as a vector.

• Vector-space model was developed in the SMART system (Salton, c. 1970) and standardly used by TREC participants and web IR systems

Page 15: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Graphic Representation

Example:D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

32

5

• Is D1 or D2 more similar to Q?

• How to measure the degree of similarity? Distance? Angle? Projection?

Page 16: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Document Collection Representation• A collection of n documents can be represented in the vector space model by a term-document matrix.

• An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 ….

Tt

D1 w11 w21 … wt1

D2 w12 w22 … wt2

: : : : : : : :Dn w1n w2n … wtn

Page 17: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Term Frequency tf• The term frequency tfij of term i in document j is

defined as the number of times that term i occurs in document j.

• More frequent terms in a document are more important, i.e. more indicative of the topic.

• May want to normalize term frequency (tf) : tfij = freqij / max{fij}

• We want to use tf when computing query-document match scores. But how?

Page 18: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Term Frequency tf

• Raw term frequency is not what we want:– A document with 10 occurrences of the term is more

relevant than a document with 1 occurrence of the term.

– But not 10 times more relevant.• Relevance does not increase proportionally with

term frequency.

Page 19: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Document Frequency

• Rare terms are more informative than frequent terms– Recall stop words

• Consider a term in the query that is rare in the collection (e.g., arachnocentric)

• A document containing this term is very likely to be relevant to the query arachnocentric

• → We want a high weight for rare terms like arachnocentric.

Sec. 6.2.1

Page 20: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Document Frequency (continued)• Frequent terms are less informative than rare

terms• Consider a query term that is frequent in the

collection (e.g., high, increase, line)• A document containing such a term is more likely

to be relevant than a document that doesn’t• But it’s not a sure indicator of relevance.• → For frequent terms, we want high positive

weights for words like high, increase, and line• But lower weights than for rare terms.• We will use document frequency (df) to capture

this.

Sec. 6.2.1

Page 21: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Idf Weight• dfi is the document frequency of term i: the

number of documents that contain term i– dfi is an inverse measure of the informativeness of term i– dfi n, where n is the number of documents in the

collection• We define the idf (inverse document frequency)

of term i by

– We use log (N/dfi) instead of N/dfi to “dampen” the effect of a high df (as compared to tf)

Sec. 6.2.1

Page 22: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

idf example, suppose N = 1 million

term dft idftcalpurnia 1

animal 100

sunday 1,000

fly 10,000

under 100,000

the 1,000,000

There is one idf value for each term i in a collection.

Sec. 6.2.1

Page 23: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Collection vs. Document Frequency

• The collection frequency of i is the number of occurrences of i in the collection, counting multiple occurrences.

• Example:

• Which word is a better search term (and should get a higher weight)?

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Sec. 6.2.1

Page 24: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

tf-idf Weighting

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

• Best known weighting scheme in information retrieval– Theoretically proven to work well (Papineni, NAACL 2001)– Note: the “-” in tf-idf is a hyphen, not a minus sign!– Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

Sec. 6.2.2

Page 25: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Computing tf-idf: An ExampleGiven a document containing terms with given frequencies: A(3), B(2), C(1)Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250)Then:A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

Page 26: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Binary → Count → Weight matrix

Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

Sec. 6.3

Page 27: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Documents as Vectors• So we have a |V|-dimensional vector space• Terms are axes of the space• Documents are points or vectors in this space• Very high-dimensional: tens of millions of

dimensions when you apply this to a web search engine

• These are very sparse vectors - most entries are zero.

Sec. 6.3

Page 28: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Query as Vector• Query vector is typically treated as a document

and also tf-idf weighted.• Alternative is for the user to supply weights for

the given query terms.

Page 29: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Similarity Measure• We now have vectors for all documents in the

collection, a vector for the query, how to compute similarity?

• A similarity measure is a function that computes the degree of similarity between two vectors.

• Using a similarity measure between the query and each document:– It is possible to rank the retrieved documents in the

order of presumed relevance.– It is possible to enforce a certain threshold so that the

size of the retrieved set can be controlled.

Page 30: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

First Cut: Euclidean Distance

• Distance between vectors d1 and d2 is the length of the vector |d1 – d2|.– Euclidean distance

• Exercise: Determine the Euclidean distance between the vectors (0, 3, 2, 1, 10) and (2, 7, 1, 0, 0)

• Why is this not a great idea?• We still haven’t dealt with the issue of length

normalization– Long documents would be more similar to each other by

virtue of length, not topic

Page 31: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Second Cut: Manhattan Distance

• Or “city block” measure– Based on the idea that generally in American cities you

cannot follow a direct line between two points.

• Uses the formula:

• Exercise: Determine the Manhattan distance between the vectors (0, 3, 2, 1, 10) and (2, 7, 1, 0, 0)

x

y

n

iii yxYXManhDist

1

||),(

Page 32: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Third Cut: Inner Product• Similarity between vectors for the documents

d1 and document d2 can be computed as the vector inner product:

where wij is the weight of term i in document j• For binary vectors, the inner product is the

number of matched query terms in the document (size of intersection).

• For weighted term vectors, it is the sum of the products of the weights of the matched terms.

Page 33: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Properties of Inner Product• Favors long documents with a large number of

unique terms.– Again, the issue of normalization

• Measures how many terms matched but not how many terms are not matched.

Page 34: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Exercise

Page 35: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Cosine Similarity• Distance between vectors d1 and d2 captured by

the cosine of the angle x between them.• Note – this is similarity, not distance

t 1

d2

d1

t 3

t 2

θ

Page 36: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Cosine Similarity

• wij – weight of term i in document j• Cosine of angle between two vectors• The denominator involves the lengths of the

vectors• So the cosine measure is also known as the

normalized inner product

Page 37: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Exercise• Documents: Sense and Sensibility, Pride and

Prejudice; Wuthering Heights • Vocabulary of four terms

• Measure cos(SaS,PaP), cos(SaS,WH)

term SaS PaP WHaffection 115 58 20jealous 10 7 11gossip 2 0 6wuthering 0 0 38

term SaS PaP WHaffection 0.789 0.832 0.524

jealous 0.515 0.555 0.465

gossip 0.335 0 0.405

wuthering 0 0 0.588

Page 38: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Exercise• Rank the following by decreasing cosine

similarity:– Two documents that have only frequent words (the, a,

an, of) in common.– Two documents that have no words in common.– Two documents that have many rare words in common

(wingspan, tailfin).

Page 39: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Cosine Similarity vs. Inner Product• Cosine similarity measures the cosine of

the angle between two vectors.• Inner product normalized by the vector

lengths.

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13

Q = 0T1 + 0T2 + 2T3

2

t3

t1

t2

D1

D2

Q

1

D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.

t

i

t

i

t

i

ww

ww

qdqd

iqij

iqij

j

j

1 1

22

1

)(

CosSim(dj, q) =

qdj

InnerProduct(dj, q) =

Page 40: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Comments on Vector Space Models

• Simple, mathematically based approach. • Considers both local (tf) and global (idf) word

occurrence frequencies.• Provides partial matching and ranked results.• Tends to work quite well in practice despite

obvious weaknesses.• Allows efficient implementation for large

document collections.

Page 41: Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Problems with Vector Space Model

• Missing semantic information (e.g. word sense).• Missing syntactic information (e.g. phrase

structure, word order, proximity information).• Assumption of term independence• Lacks the control of a Boolean model (e.g.,

requiring a term to appear in a document).– Given a two-term query “A B”, may prefer a document

containing A frequently but not B, over a document that contains both A and B, but both less frequently.


Recommended