9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti...

Post on 21-Dec-2015

215 views 1 download

Tags:

transcript

9/21/2000 Information Organization and Retrieval

Ranking and Relevance Feedback

Ray Larson & Marti Hearst

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

9/21/2000 Information Organization and Retrieval

Review

• Inverted files

• The Vector Space Model

• Term weighting

9/21/2000 Information Organization and Retrieval

tf x idf)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

9/21/2000 Information Organization and Retrieval

Inverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents

9/21/2000 Information Organization and Retrieval

Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

9/21/2000 Information Organization and Retrieval

Vector Space Visualization

9/21/2000 Information Organization and Retrieval

Text ClusteringClustering is

“The art of finding groups in data.” -- Kaufmann and Rousseeu

Term 1

Term 2

9/21/2000 Information Organization and Retrieval

tf x idf normalization• Normalize the term weights (so longer documents

are not unfairly given more weight)– normalize usually means force all values to fall within a certain

range, usually between 0 and 1, inclusive.

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

9/21/2000 Information Organization and Retrieval

Vector space similarity(use the weights to compare the documents)

terms.) thehting when weigdone tion was(Normaliza

product.inner normalizedor cosine, thecalled also is This

),(

:is documents twoof similarity theNow,

1

t

kjkikji wwDDsim

9/21/2000 Information Organization and Retrieval

Vector Space Similarity Measurecombine tf x idf into a similarity measure

)()(

),(

:comparison similarity in the normalize otherwise

),( :normalized weights termif

absent is terma if 0 ...,,

,...,,

1

2

1

2

1

1

,21

21

t

jd

t

jqj

t

jdqj

i

t

jdqji

qtqq

dddi

ij

ij

ij

itii

ww

ww

DQsim

wwDQsim

wwwwQ

wwwD

9/21/2000 Information Organization and Retrieval

Vector Space with Term Weights and Cosine Matching

1.0

0.8

0.6

0.4

0.2

0.80.60.40.20 1.0

D2

D1

Q

1

2

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

t

j

t

j dq

t

j dq

i

ijj

ijj

ww

wwDQsim

1 1

22

1

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

64.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

2222

DQsim

74.058.0

56.),( 1 DQsim

9/21/2000 Information Organization and Retrieval

Term Weights in SMART

• In SMART weights are decomposed into three factors:

norm

collectfreqw kkd

kd

9/21/2000 Information Organization and Retrieval

SMART Freq Components

1)ln(

)max(2

1

2

1)max(

}1,0{

kd

kd

kd

kd

kd

kd

freq

freqfreq

freq

freq

freq

Binary

maxnorm

augmented

log

9/21/2000 Information Organization and Retrieval

Collection Weighting in SMART

k

k

k

k

k

k

Doc

Doc

DocNDocDoc

NDoc

Doc

NDoc

collect

1

log

log

log

2

Inverse

squared

probabilistic

frequency

9/21/2000 Information Organization and Retrieval

Term Normalization in SMART

jvector

vectorj

vectorj

vectorj

w

w

w

w

norm

max

4

2

sum

cosine

fourth

max

9/21/2000 Information Organization and Retrieval

Probabilistic Models: Logistic Regression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged

9/21/2000 Information Organization and Retrieval

Probabilistic Models: Logistic Regression

6

10),|(

iii XccDQRP

Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients.At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown previously

9/21/2000 Information Organization and Retrieval

Probabilistic Models

• Strong theoretical basis

• In principle should supply the best predictions of relevance given available information

• Can be implemented similarly to Vector

• Relevance information is required -- or is “guestimated”

• Important indicators of relevance may not be term -- though terms only are usually used

• Optimally requires on-going collection of relevance information

Advantages Disadvantages

9/21/2000 Information Organization and Retrieval

Vector and Probabilistic Models

• Support “natural language” queries• Treat documents and queries the same• Support relevance feedback searching• Support ranked retrieval• Differ primarily in theoretical basis and in how the

ranking is calculated– Vector assumes relevance

– Probabilistic relies on relevance judgments or estimates

9/21/2000 Information Organization and Retrieval

Current use of Probabilistic Models

• Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

9/21/2000 Information Organization and Retrieval

Okapi BM25

• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q was

derived• dl and avdl are the document length and the average

document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight.

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

9/21/2000 Information Organization and Retrieval

Today

• Logistic regression and Cheshire

• Relevance Feedback

9/21/2000 Information Organization and Retrieval

Logistic Regression and Cheshire II

• The Cheshire II system uses Logistic Regression equations estimated from TREC full-text data.

• Demo (?)

9/21/2000 Information Organization and Retrieval

Querying in IR SystemInterest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

9/21/2000 Information Organization and Retrieval

Relevance Feedback in an IR System

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Selected relevant docs

9/21/2000 Information Organization and Retrieval

Query Modification• Problem: how to reformulate the query?

– Thesaurus expansion:• Suggest terms similar to query terms

– Relevance feedback:• Suggest terms (and documents) similar to retrieved documents

that have been judged to be relevant

9/21/2000 Information Organization and Retrieval

Relevance Feedback• Main Idea:

– Modify existing query based on relevance judgements• Extract terms from relevant documents and add them to the

query

• and/or re-weight the terms already in the query

– Two main approaches:• Automatic (psuedo-relevance feedback)

• Users select relevant documents

– Users/system select terms from an automatically-generated list

9/21/2000 Information Organization and Retrieval

Relevance Feedback

• Usually do both:– expand query with new terms

– re-weight terms in query

• There are many variations– usually positive weights for terms from relevant docs

– sometimes negative weights for terms from non-relevant docs

– Remove terms ONLY in non-relevant documents

9/21/2000 Information Organization and Retrieval

Rocchio Method

0.25) to and 0.75 toset best to studies some(in

t termsnonrelevan andrelevant of importance the tune and

chosen documentsrelevant -non ofnumber the

chosen documentsrelevant ofnumber the

document relevant -non for the vector the

document relevant for the vector the

query initial for the vector the

2

1

0

1 21 101

21

n

n

iS

iR

Q

where

n

S

n

RQQ

i

i

n

i

in

i

i

9/21/2000 Information Organization and Retrieval

Rocchio Method

0.25) to and 0.75 toset best to studies some(in

t termsnonrelevan andrelevant of importance thetune and ,

chosen documentsrelevant -non ofnumber the

chosen documentsrelevant ofnumber the

document relevant -non for the vector the

document relevant for the vector the

query initial for the vector the

2

1

0

121101

21

n

n

iS

iR

Q

where

Sn

Rn

QQ

i

i

i

n

i

n

ii

9/21/2000 Information Organization and Retrieval

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

9/21/2000 Information Organization and Retrieval

Example Rocchio Calculation

)04.1,033.0,488.0,022.0,527.0,01.0,002.0,000875.0,011.0(

12

25.0

75.0

1

)950,.00.0,450,.00.0,500,.00.0,00.0,00.0,00.0(

)00.0,020,.00.0,025,.005,.00.0,020,.010,.030(.

)120,.100,.100,.025,.050,.002,.020,.009,.020(.

)120,.00.0,00.0,050,.025,.025,.00.0,00.0,030(.

121

1

2

1

new

new

Q

SRRQQ

Q

S

R

R

Relevantdocs

Non-rel doc

Original Query

Constants

Rocchio Calculation

Resulting feedback query

9/21/2000 Information Organization and Retrieval

Rocchio Method

• Rocchio automatically– re-weights terms

– adds in new terms (from relevant docs)• have to be careful when using negative terms

• Rocchio is not a machine learning algorithm

• Most methods perform similarly– results heavily dependent on test collection

• Machine learning methods are proving to work better than standard IR approaches like Rocchio

9/21/2000 Information Organization and Retrieval

Probabilistic Relevance Feedback Robertson & Sparck Jones

Document Relevance

Documentindexing

Given a query term t

+ -

+ r n-r n

- R-r N-n-R+r N-n

R N-R N

Where N is the number of documents seen

9/21/2000 Information Organization and Retrieval

Robertson-Spark Jones Weights

• Retrospective formulation --

rRnNrnrR

r

wnewt log

9/21/2000 Information Organization and Retrieval

Robertson-Sparck Jones Weights

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

Predictive formulation

9/21/2000 Information Organization and Retrieval

Using Relevance Feedback

• Known to improve results– in TREC-like conditions (no user involved)

• What about with a user in the loop?– How might you measure this?

9/21/2000 Information Organization and Retrieval

Relevance Feedback Summary

• Iterative query modification can improve precision and recall for a standing query

• In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them

Information Organization and Retrieval9/21/2000

Alternative Notions of Relevance Feedback

9/21/2000 Information Organization and Retrieval

Alternative Notions of Relevance Feedback

• Find people whose taste is “similar” to yours. Will you like what they like?

• Follow a users’ actions in the background. Can this be used to predict what the user will want to see next?

• Track what lots of people are doing. Does this implicitly indicate what they think is good and not good?

9/21/2000 Information Organization and Retrieval

Alternative Notions of Relevance Feedback

• Several different criteria to consider:– Implicit vs. Explicit judgements – Individual vs. Group judgements– Standing vs. Dynamic topics– Similarity of the items being judged vs. similarity

of the judges themselves

Information Organization and Retrieval

Collaborative Filtering (social filtering)

• If Pam liked the paper, I’ll like the paper

• If you liked Star Wars, you’ll like Independence Day

• Rating based on ratings of similar people– Ignores the text, so works on text, sound, pictures etc.

– But: Initial users can bias ratings of future users

Sally Bob Chris Lynn KarenStar Wars 7 7 3 4 7Jurassic Park 6 4 7 4 4Terminator II 3 4 7 6 3Independence Day 7 7 2 2 ?

Information Organization and Retrieval

• Users rate musical artists from like to dislike– 1 = detest 7 = can’t live without 4 = ambivalent

– There is a normal distribution around 4

– However, what matters are the extremes

• Nearest Neighbors Strategy: Find similar users and predicted (weighted) average of user ratings

• Pearson r algorithm: weight by degree of correlation between user U and user J– 1 means very similar, 0 means no correlation, -1 dissimilar

– Works better to compare against the ambivalent rating (4), rather than the individual’s average score

Ringo Collaborative Filtering (Shardanand & Maes 95)

22 )()(

))((

JJUU

JJUUrUJ

9/21/2000 Information Organization and Retrieval

Social Filtering• Ignores the content, only looks at who judges

things similarly• Works well on data relating to “taste”

– something that people are good at predicting about each other too

• Does it work for topic? – GroupLens results suggest otherwise (preliminary)– Perhaps for quality assessments– What about for assessing if a document is about a

topic?

Information Organization and Retrieval

Learning interface agents• Add agents in the UI, delegate tasks to them

• Use machine learning to improve performance– learn user behavior, preferences

• Useful when:– 1) past behavior is a useful predictor of the future

– 2) wide variety of behaviors amongst users

• Examples: – mail clerk: sort incoming messages in right mailboxes

– calendar manager: automatically schedule meeting times?

9/21/2000 Information Organization and Retrieval

Summary

• Relevance feedback is an effective means for user-directed query modification.

• Modification can be done with either direct or indirect user input

• Modification can be done based on an individual’s or a group’s past input.

9/21/2000 Information Organization and Retrieval