+ All Categories
Home > Documents > 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

1 Term Weighting CSE3201/4500 Information Retrieval Systems.

Date post: 01-Apr-2015
Category:
Upload: amir-chesley
View: 240 times
Download: 6 times
Share this document with a friend
46
1 Term Weighting CSE3201/4500 Information Retrieval Systems
Transcript
Page 1: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

1

Term Weighting

CSE3201/4500

Information Retrieval Systems

Page 2: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

2

Weighting Terms

• Having decided on a set of terms for indexing, we need to consider whether all terms should be given the same significance. If not, how should we decide on their significance?

Page 3: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

3

Weighting Terms - tf

• Let tftfijij be the term frequency for term i on

document j. The more a term appears in a document, the more likely it is to be a highly significant index term.

Page 4: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

4

Weighting Terms - df & idf

• Let dfi be document frequency of the i-th

term.

• Since the significance increases with a decrease in the document frequency, we have the inverse document frequencyinverse document frequency, idf

i = loglogee

(N/df(N/dfii) ) where N is the number of documents

in the database; loglogee is the natural logarithm is the natural logarithm (ln in the calculator)

Page 5: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

5

Weighting Terms - tf. idf• The above two indicators are very often

multiplied together to form the “tftf..idfidf” weight,

wwijij = tf = tfij * ij * idfidfii

• or as is now more popular

wwijij = log = logee (1 + tf (1 + tfijij ) (1 + idf ) (1 + idfii))

Page 6: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

6

Example• Consider 5 document collection:

D1= “Dogs eat the same things that cats eat”D2 = “No dog is a mouse”D3 = “Mice eat little things”D4 = “Cats often play with rats and mice”D5 = “Cats often play, but not with other cats”

Page 7: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

7

Example - Cont.• We might generate the following index sets:

V1 = ( dog, eat, cat )V2 = ( dog, mouse )V3 = ( mouse, eat )V4 = ( cat, play, rat, mouse )V5 = (cat, play)

• System dictionary (cat,dog,eat,mouse,play,rat)

Page 8: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

8

Example-Cont

dfcat=3

idfcat=ln(5/3)=0.51

dfdog=2

idfdog=ln(5/2)=0.91

dfeat=2

idfeat=ln(5/2)=0.91

dfmouse=3idfmouse=ln(5/3)=0.51

dfplay=2

idfplay=ln(5/2)=0.91

dfrat=1

idfrat=ln(5/1)=1.61

Page 9: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

9

Example-Cont

• V1(cat, eat,dog)– wcat= tfcat * idfcat = 1 * 0.51 = 0.51

– wdog= tfdog * idfdog = 1 * 0.91 = 0.91

– weat= tfeat * idfat = 2 * 0.91 = 1.82

• V2(dog,mouse)– wdog= tfdog * idfdog = 1 * 0.91 = 0.91

– wmouse= tfmouse * idfmouse = 1 * 0.51 = 0.51

Page 10: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

10

Example-Cont

• V3(mouse,eat)– wmouse= tfmouse * idfmouse = 1 * 0.51 = 0.51– weat= tfeat * idfat = 1 * 0.91 = 0.91

• V4(cat,mouse,play, rat)– wcat= tfcat * idfcat = 1 * 0.51 = 0.51– wplay= tfplay * idfplay = 1 * 0.91 = 0.91– wrat= tfrat * idfrat = 1 * 1.61 = 1.61– wmouse= tfmouse * idfmouse = 1 * 0.51 = 0.51

Page 11: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

11

Example-Cont

• V5– wcat= tfcat * idfcat = 2 * 0.51 = 1.02

– wplay= tfplay * idfplay = 1 * 0.91 = 0.91

Page 12: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

12

Example - cont.• Dictionary: (cat,dog,eat,mouse,play,rat)• Weights:

V1 = [cat(0.51), dog (0.91),eat(1.82), 0, 0,0 ]

V2 = [0,dog(0.91),0,mouse(0.51),0,0]V3 = [0,0,eat(0.91), mouse(0.51),0,0]V4 = [cat(0.51), 0,0,mouse(0.51), play(0.91),

rat(1.61)] V5 = [cat(1.02),0,0,0, play (0.91),0]

Page 13: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

13

A Larger Example• Doc 1: The problem of how to describe documents for

retrieval is called indexing.• Doc 2: It is possible to use a document as its own index.• Doc 3: The problem is that a document will exactly match

only one query, namely document itself.• Doc 4: The purpose of indexing then is to provide a

description of a document so that it can be retrieved with queries that concern the same subject as the document.

• Doc 5: It must be a sufficiently specific description so that the document will not be returned for queries unrelated to the document.

Page 14: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

14

A Larger Example• Doc 6: A simple way of indexing a document is to give a

single code from a predefined set.

• Doc 7: We have the task of describing how we are going to match queries against document.

• Doc 8: The vector space model creates a space in which both document and queries are represented by vectors.

• Doc 9: A vector is obtained for each document and query from sets of index terms with associated weights.

• Doc 10: In order to compare the similarity of these vectors, we may measure the angle between them.

Page 15: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

15

A Larger Example• If we index these document using all words not

on a stop list, we might obtain

D1- problem, describe, documents, retrieval, called, indexingD2 - possible, document, own, indexD3 - problem, document (*), exactly, match, one, query, namelyD4 - purpose, indexing, provide, description, document(*), retrieved, queries, concern, subjectD5 - sufficiently, specific, description, document(*), returned, queries, unrelated

Page 16: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

16

A Larger Example• If we index these documents using all words not

on a stop list, we might obtain

D6- simple, way, indexing, document, give, single, code, predefined, listD7- task, describing, going, match, queries, against, documentsD8- vector (*), space(*), model, creates, documents, queries, representedD9- vector, obtained, document, query, sets, index, terms, associated, weightsD10- order, compare, similarity, vectors, measure, angle

Page 17: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

17

A larger Example• We may now choose to stem the terms, which

may leave us :

D1- problem, describ, docu, retriev, call, indexD2- possibl, docu, own, indexD3- problem, docu (*), exact, match, on, quer, nameD4- purpos, index, provid, descript, docu (*), retriev, queries, concern, subjectD5- suffic, specif, descript, docu (*), return, quer, unrelat

Page 18: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

18

A Larger Example• We may now choose to stem the terms, which

may leave us:

D6-simpl, way, index, docu, giv, singl, cod, predefin, listD7- task, describ, go, match, quer, against, docuD8- vect (*), spac (*), model, creat, docu, quer, representD9- vect, obtain, docu, quer, set, index, terms, associat, weightD10- order, compar, similarit, vect, measur, angle

Page 19: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

19

Document Frequenciesagainst 1 angl 1 associat 1call 1 cod 1 compar 1concern 1 creat 1 describ 1descript 1 docu 8 exact 1giv 1 go 1 index 5list 1 match 2 measur 1model 1 name 1 obtain 1on 1 order 1 own 1possibl 1 predefin 1 problem 2provid 1 purpos 1 quer 6represent 1 return 1 retriev 2set 1 similarit 1 simpl 1singl 1 spac 1 specif 1subject 1 suffi 1 task 1term 1 unrelat 1 vect 3

Page 20: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

20

A Larger Example

• We can now calculate the weights of the terms of one of the documents. For document 8, using the tf . idf formula, we give the terms the following weights:

vect (2.41), spac (4.60), model (2.30), creat(2.30), docu (0.22), quer (0.51), represent (2.30)

Page 21: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

21

Retrieval Model

CSE3201/4500

Information Retrieval Systems

Page 22: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

22

Retrieval Process

Information Needs

Query Formulation

Document Collections

Indexing

keyword space

SIMILAR

Page 23: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

23

Retrieval Paradigms

• How do we match?– Produce non-ranked output

• Boolean retrieval

– Produce ranked output• vector space model

• probabilistic retrieval

Page 24: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

24

Advantages of Ranking• Good control over how many documents are

viewed by a user.• Good control over in which order

documents are viewed by a user.• The first documents that are viewed may

help modify the order in which later documents are viewed.– The main disadvantage is computational cost.

Page 25: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

25

Boolean Retrieval

• A query is a set of terms combined by the Boolean connectives “and”, “or” and “not”.– e.g... FIND (document OR information) AND

retrieval AND (NOT (information AND systems))

• Each term is matched against this query and either matches (TRUE) or it doesn’t (FALSE)

Page 26: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

26

Systems Provide• Most systems provide match information such as

– FIND (document or information)

– 1,000 records found

– FIND (document OR information) AND retrieval

– 40 records found

– FIND (document OR information) AND retrieval AND *NOT (information AND systems))

– 10 records found

– SHOW

Page 27: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

27

An Example• Consider the following document collection:

D1 = “Dogs eat the same things that cats eat”D2 = “no dog is a mouse”D3 = “mice eat little things”D4 = “Cats often play with rats and mice”D5 = “cats often play, but not with other cats”

• indexed by:D1 = dog, eat, catD2 = dog, mouseD3 = mouse, eatD4 = cat, play, rat, mouseD5 =cat, play

Page 28: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

28

An Example

• The Boolean query (cat AND dog) returns D1

• (cat OR (dog AND eat)) returns D1, D4, D5

Page 29: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

29

Problem with Boolean

• No ranking– users must fuss with retrieved set size, structural

reformulation

– users must scan entire retrieved set

• No weights on query terms– users cannot give more importance to some terms ---

retrieval:2 AND system:1

– users cannot give more importance to some clauses --- retrieval:1 AND (system OR model):2

Page 30: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

30

Problem with Boolean

• No weights on document terms– no use can be made of importance of a term in a

document --- if occurs frequently– no use can be made of importance of a term in

the collection --- if occurs rarely

Page 31: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

31

Any Good News for Boolean?

• Yes.

• Advantages– conceptually simple– computationally inexpensive– commercially available

Page 32: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

32

Introduction to Vectors

• A.B = |A||B| cos • A=(a1, a2, a3,…, an), B=(b1, b2, b3,…, bn)

• A.B = (a1b1+ a2b2+ a3b3+ …+ anbn)

• Magnitude of a vector |A|=(a1, a2, a3,…, an) is defined as

)...( 223

22

21 naaaaA

Page 33: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

33

Similarity Measures

• Inner product– A.B = (a1b1+ a2b2+ a3b3+ …+ anbn)

• Cosine

223

22

21

223

22

21

332211

......

)...(

.cos

nn

nn

bbbbaaaa

babababa

BA

BA

Page 34: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

34

The Vector Space Model• Each document and query is represented by a vector.

A vector is obtained for each document and query from sets of index terms with associated weights.

• The document and query representatives are considered as vectors in nn dimensional space where n is the number of unique terms in the dictionary/document collection.

• Measuring vectors similarity:– inner product– value of cosine of the angle between the two vectors.

Page 35: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

35

Vector Space

• Assume that document’s vector is represented by vector D and the query is represented by vector Q.

• The total number of terms in the dictionary is n.

• Similarity between D and Q is measured by the angle .

Page 36: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

36

Inner Product

n

iqdj iji

wwQDsim1

),(

Page 37: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

37

Cosine

• The similarity between D and Q can be written as:

• Using the weight of the term as the components of D and Q:

n

ii

n

ij

n

iij

j

jj

qd

qd

QD

QDQDsim

i

i

1

2

1

2

1.

),(cos

n

iq

n

id

n

iqd

j

jj

iij

iij

ww

ww

QD

QDQDsim

1

2

1

2

1.

),(

Page 38: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

38

Simple Example (1)

• Assume:– there are 2 terms in the dictionary (t1, t2)

– Doc-1 contains t1 and t2, with weights 0.5 and 03 respectively.

– Doc-2 contains t1 with weight 0.6

– Doc-3 contains t2 with weights 0.4.

– Query contains t2 with weight 0.5.

Page 39: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

39

Simple Example (2)

• The vectors for the query and documents:

Doc# wt1 wt2

1 0.5 0.3

2 0.6 0

3 0 0.4

Doc-1= (0.5,0.3)

Doc-2= (0.6,0)

Doc-3= (0,0.4)

Query = ( 0, 0.5)

Page 40: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

40

Simple Example - Inner Product

• D1=0.5x0+0.3x0.5=0.15

• D2=0.6x0+0x0.5=0

• D3=0x0+0.4x0.5=0.2

• Ranking: D3, D1, D2

Doc# wt1 wt2

1 0.5 0.3

2 0.6 0

3 0 0.4

Query = ( 0, 0.5)

Page 41: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

41

Simple Example - Cosine

515.05.0583.0

15.0

5.003.05.0

5.03.005.0),(

22221

QDsim

05.0006.0

5.0006.0),(

22222

QDsim

12.0

2.0

5.004.00

5.04.000),(

22223

QDsim

Similarity measured between Query(Q) and

Doc-1

Doc-2

Doc-3

Ranked output: D3, D1, D2

Page 42: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

42

Large Example (1)• Consider the same five document collection

D1= “Dogs eat the same things that cats eat”D2 = “No dog is mouse”D3 = “Mice eat little things”D4 = “Cats often play with rats and mice”D5 = “Cats often play, but not with other cats”

Indexed byV1 = ( dog, eat, cat )V2 = ( dog, mouse )V3 = ( mouse, eat )V4 = ( cat, play, rat, mouse )V5 = (cat, play)

Page 43: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

43

Large Example (2)• The set of all terms (dictionary)

(cat, dog, eat, mouse, play, rat)• Using tf.idf weights, we obtain weights

v1 = (cat(0.51), eat(1.82), dog(0.91)) v2 = (dog(0.91), mouse(0.51)) v3 = (mouse(0.51), eat(0.91)) v4 = (cat(0.51), play(0.91), rat(1.61), mouse(0.51)) v5 = (cat (1.02), play (0.91))

Page 44: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

44

Large Example (3)

• In the vector space model, we obtain vectors

(0.51, 0.91, 1.82, 0.00, 0.00, 0.00)(0.00, 0.91, 0.00, 0.51, 0.00, 0.00)(0.00, 0.00, 0.91, 0.51, 0.00, 0.00)(0.51, 0.00, 0.00, 0.51, 0.91, 1.61)(1.02, 0.00, 0.00, 0.00, 0.91, 0.00)

• 6 dimensional space for 6 terms

Page 45: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

45

Inner-Product

• Query: “what do cats play with?” forms a query vector as (0.51, 0.00, 0.00, 0.00, 0.91, 0.00)

• D1= 0.51x0.51+0x0.91+0x1.82+0x0+0x0.91+0x0=0.2601• D2= 0.00x0.51+0.91x0+0x0+0.51x0+0x0.91+0x0=0• D3= 0.00x0.51+ 0x0+0.91x0+0.51x0+0x0.91+0x0=0• D4=

0.51x0.51+0x0+0x0+0.51x0+0.91x0.91+1.61x0=1.0882• D5= 1.02x0.51+0x0+0x0+0x0+0.91x0.91+0x0=1.3483

• Ranking: D5, D4, D1, D2, D3

Page 46: 1 Term Weighting CSE3201/4500 Information Retrieval Systems.

46

Cosine Similarity• Query: “what do cats play with?” forms a query vector

as (0.51, 0.00, 0.00, 0.00, 0.91, 0.00)

• using the cosine measure (cm), we obtain the following similarity measures:D1 = 0.512/[(0.512+0.912)0.5 x(0.512+0.912+1.822)0.5]D2 = 0.0D3 = 0.0D4 = (0.512+0.912)/[(0.512+0.912)0.5x(0.512+0.512+0.912+1.612)0.5]D5 = (0.51*1.02+0.912)/[(0.512+0.912)0.5x(1.022+0.912)0.5]

• Thus we obtain he ranking: D5, D4, D1, D2, D3 (or D3, D2)


Recommended