+ All Categories
Home > Documents > Statistical Models for Information Retrieval and Text Mining

Statistical Models for Information Retrieval and Text Mining

Date post: 22-Feb-2016
Category:
Upload: juro
View: 37 times
Download: 0 times
Share this document with a friend
Description:
Statistical Models for Information Retrieval and Text Mining. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign - PowerPoint PPT Presentation
85
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1 Statistical Models for Information Retrieval and Text Mining ChengXiang Zhai ( 翟翟翟 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
Transcript
Page 1: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1

Statistical Models for Information

Retrieval and Text MiningChengXiang Zhai (翟成祥 )

Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, StatisticsUniversity of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Page 2: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 2

Course Overview

Statistics

Machine Learning

Computer VisionNatural Language Processing

Information Retrieval Multimedia Data Text Data

Scope of the course

Page 3: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 3

Goal of the Course• Overview of techniques for information retrieval (IR)

• Detailed explanation of a few statistical models for IR and text mining– Probabilistic retrieval models (for search)– Probabilistic topic models (for text mining)

• Potential benefit for you:– Some ideas working well for text retrieval may also work for

computer vision– Techniques for computer vision may be applicable to IR– IR and text mining raise new challenges as well as opportunities

for machine learning

Page 4: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 4

Course Plan • Lecture 1: Overview of information retrieval

• Lecture 2: Statistical language models for IR: Part 1

• Lecture 3: Statistical language models for IR: Part 2

• Lecture 4: Formal retrieval frameworks

• Lecture 5: Probabilistic topic models for text mining

Page 5: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 5

Lecture 1: Overview of IR • Basic Concepts in Text Retrieval (TR)

• Evaluation of TR

• Common Components of a TR system

• Overview of Retrieval Models

Page 6: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 6

Basic Concepts in TR

Page 7: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 7

What is Text Retrieval (TR)?• There exists a collection of text documents

• User gives a query to express the information need

• A retrieval system returns relevant documents to users

• Known as “search technology” in industry

Page 8: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 8

History of TR on One Slide• Birth of TR

– 1945: V. Bush’s article “As we may think”– 1957: H. P. Luhn’s idea of word counting and matching

• Indexing & Evaluation Methodology (1960’s)– Smart system (G. Salton’s group)– Cranfield test collection (C. Cleverdon’s group)– Indexing: automatic can be as good as manual (controlled vocabulary)

• TR Models (1970’s & 1980’s) …

• Large-scale Evaluation & Applications (1990’s-Present)– TREC (D. Harman & E. Voorhees, NIST)– Web search, PubMed, …– Boundary with related areas are disappearing

Page 9: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 9

Short vs. Long Term Info Need• Short-term information need (Ad hoc retrieval)

– “Temporary need”, e.g., info about used cars– Information source is relatively static – User “pulls” information– Application example: library search, Web search

• Long-term information need (Filtering)– “Stable need”, e.g., new data mining algorithms– Information source is dynamic– System “pushes” information to user– Applications: news filter

Page 10: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 10

Importance of Ad hoc Retrieval• Directly manages any existing large collection of

information

• There are many many “ad hoc” information needs

• A long-term information need can be satisfied through frequent ad hoc retrieval

• Basic techniques of ad hoc retrieval can be used for filtering and other “non-retrieval” tasks, such as automatic summarization.

Page 11: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 11

Formal Formulation of TR• Vocabulary V={w1, w2, …, wN} of language

• Query q = q1,…,qm, where qi V

• Document di = di1,…,dimi, where dij V

• Collection C= {d1, …, dk}

• Set of relevant documents R(q) C– Generally unknown and user-dependent– Query is a “hint” on which doc is in R(q)

• Task = compute R’(q), an “approximate R(q)”

Page 12: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 12

Computing R(q)• Strategy 1: Document selection

– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an indicator function or classifier

– System must decide if a doc is relevant or not (“absolute relevance”)

• Strategy 2: Document ranking– R(q) = {dC|f(d,q)>}, where f(d,q) is a relevance

measure function; is a cutoff– System must decide if one doc is more likely to be

relevant than another (“relative relevance”)

Page 13: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 13

Document Selection vs. Ranking

++

+ + -- -- - - -- - - -

-

- - +- -

Doc Selectionf(d,q)=?

++

++--+

-+

--- ----

-

Doc Rankingf(d,q)=?

1

0

0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -

R’(q)

R’(q)

True R(q)

Page 14: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 14

Problems of Doc Selection• The classifier is unlikely accurate

– “Over-constrained” query (terms are too specific): no relevant documents found

– “Under-constrained” query (terms are too general): over delivery

– It is extremely hard to find the right position between these two extremes

• Even if it is accurate, all relevant documents are not equally relevant

• Relevance is a matter of degree!

Page 15: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 15

Ranking is often preferred• Relevance is a matter of degree

• A user can stop browsing anywhere, so the boundary is controlled by the user– High recall users would view more items– High precision users would view only a few

• Theoretical justification: Probability Ranking Principle [Robertson 77]

Page 16: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 16

Probability Ranking Principle[Robertson 77]

• As stated by Cooper

• Robertson provides two formal justifications

• Assumptions: Independent relevance and sequential browsing (not necessarily all hold in reality)

“If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”

Page 17: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 17

According to the PRP, all we need is

“A relevance measure function f”

which satisfies

For all q, d1, d2, f(q,d1) > f(q,d2) iff p(Rel|q,d1) >p(Rel|q,d2)

Most IR research has focused on finding a good function f

Page 18: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 18

Evaluation in Information Retrieval

Page 19: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 19

Evaluation Criteria• Effectiveness/Accuracy

– Precision, Recall

• Efficiency– Space and time complexity

• Usability– How useful for real user tasks?

Page 20: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 20

Methodology: Cranfield Tradition• Laboratory testing of system components

– Precision, Recall– Comparative testing

• Test collections– Set of documents– Set of questions– Relevance judgments

Page 21: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 21

The Contingency Table

Relevant Retrieved

Irrelevant Retrieved Irrelevant Rejected

Relevant RejectedRelevant

Not relevant

Retrieved Not RetrievedDocAction

RelevantRetrievedRelevant Recall

RetrievedRetrievedRelevant Precision

Page 22: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 22

How to measure a ranking?• Compute the precision at every recall point

• Plot a precision-recall (PR) curve

precision

recall

x

x

x

x

precision

recall

x

x

x

x

Which is better?

Page 23: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 23

Summarize a Ranking: MAP• Given that n docs are retrieved

– Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs

– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2. – If a relevant document never gets retrieved, we assume the precision

corresponding to that rel. doc to be zero

• Compute the average over all the relevant documents– Average precision = (p(1)+…p(k))/k

• This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document

• Mean Average Precisions (MAP)– MAP = arithmetic mean average precision over a set of topics– gMAP = geometric mean average precision over a set of topics (more

affected by difficult topics)

Page 24: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 24

Summarize a Ranking: NDCG• What if relevance judgments are in a scale of [1,r]? r>2

• Cumulative Gain (CG) at rank n– Let the ratings of the n documents be r1, r2, …rn (in ranked order)– CG = r1+r2+…rn

• Discounted Cumulative Gain (DCG) at rank n– DCG = r1 + r2/log22 + r3/log23 + … rn/log2n– We may use any base for the logarithm, e.g., base=b – For rank positions above b, do not discount

• Normalized Cumulative Gain (NDCG) at rank n– Normalize DCG at rank n by the DCG value at rank n of the ideal

ranking– The ideal ranking would first return the documents with the highest

relevance level, then the next highest relevance level, etc– Compute the precision (at rank) where each (new) relevant

document is retrieved => p(1),…,p(k), if we have k rel. docs

• NDCG is now quite popular in evaluating Web search

Page 25: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 25

When There’s only 1 Relevant Document

• Scenarios: – known-item search– navigational queries

• Search Length = Rank of the answer: – measures a user’s effort

• Mean Reciprocal Rank (MRR): – Reciprocal Rank: 1/Rank-of-the-answer– Take an average over all the queries

Page 26: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 26

Precion-Recall Curve

Mean Avg. Precision (MAP)

Recall=3212/4728

Breakeven Point (prec=recall)

Out of 4728 rel docs, we’ve got 3212

D1 +D2 +D3 –D4 –D5 +D6 -

Total # rel docs = 4System returns 6 docs

Average Prec = (1/1+2/2+3/5+0)/4

about 5.5 docsin the top 10 docs

are relevant

Precision@10docs

Page 27: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 27

What Query Averaging Hides

00.10.20.30.40.50.60.70.80.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

ision

Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation

Page 28: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 28

The Pooling Strategy• When the test collection is very large, it’s impossible to

completely judge all the documents

• TREC’s strategy: pooling – Appropriate for relative comparison of different systems– Given N systems, take top-K from the result of each, combine them

to form a “pool”– Users judge all the documents in the pool; unjudged documents

are assumed to be non-relevant

• Advantage: less human effort

• Potential problem: – bias due to incomplete judgments (okay for relative comparison)– Favor a system contributing to the pool, but when reused, a new

system’s performance may be under-estimated

• Reuse the data set with caution!

Page 29: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 29

User Studies• Limitations of Cranfield evaluation strategy:

– How do we evaluate a technique for improving the interface of a search engine?

– How do we evaluate the overall utility of a system?

• User studies are needed

• General user study procedure:– Experimental systems are developed– Subjects are recruited as users – Variation can be in the system or the users – Users use the system and user behavior is logged– User information is collected (before: background, after: experience with

the system)

• Clickthrough-based real-time user studies: – Assume clicked documents to be relevant– Mix results from multiple methods and compare their clickthroughs

Page 30: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 30

Common Components in a TR System

Page 31: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 31

Typical TR System Architecture

User

querydocs

results

Query Rep

Doc Rep (Index)

ScorerIndexer

Tokenizer

Index

judgmentsFeedback

Page 32: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 32

Text Representation/Indexing• Making it easier to match a query with a document

• Query and document should be represented using the same units/terms

• Controlled vocabulary vs. full text indexing

• Full-text indexing is more practically useful and has proven to be as effective as manual indexing with controlled vocabulary

Page 33: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 33

What is a good indexing term?• Specific (phrases) or general (single word)?

• Luhn found that words with middle frequency are most useful– Not too specific (low utility, but still useful!)– Not too general (lack of discrimination, stop words)– Stop word removal is common, but rare words are kept

• All words or a (controlled) subset? When term weighting is used, it is a matter of weighting not selecting of indexing terms

Page 34: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 34

Tokenization• Word segmentation is needed for some languages

– Is it really needed?

• Normalize lexical units: Words with similar meanings should be mapped to the same indexing term– Stemming: Mapping all inflectional forms of words to the same root form, e.g.

• computer -> compute• computation -> compute• computing -> compute (but king->k?)

– Are we losing finer-granularity discrimination?

• Stop word removal – What is a stop word? What about a query like “to be or not to be”?

Page 35: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 35

Relevance Feedback

Updatedquery

Feedback

Judgments:d1 +d2 -d3 +

…dk -...

Query RetrievalEngine

Results:d1 3.5d2 2.4…dk 0.5...

UserDocumentcollection

Page 36: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 36

Pseudo/Blind/Automatic Feedback

Query RetrievalEngine

Results:d1 3.5d2 2.4…dk 0.5...

Judgments:d1 +d2 +d3 +

…dk -...

Documentcollection

Feedback

Updatedquery

top 10

Page 37: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 37

Implicit Feedback

Updatedquery

Feedback

Query RetrievalEngine

Results:d1 3.5d2 2.4…dk 0.5...

UserDocumentcollection

User Activitiese.g. clickthroughs

Judgments:d1 +d2 -d3 +

…dk -...

infer

Page 38: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 38

Important Points to Remember• PRP provides a justification for ranking, which is generally

preferred to document selection

• How to compute the major evaluation measure (precision, recall, precision-recall curve, MAP, gMAP, breakeven precision, NDCG, MRR)

• What is pooling

• What is tokenization (word segmentation, stemming, stop word removal)

• What are relevance feedback, pseudo relevance feedback, and implicit feedback

Page 39: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 39

Overview of Retrieval ModelsRelevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

(Salton et al., 75)

Prob. distr.model

(Wong & Yao, 89)

GenerativeModel

RegressionModel

(Fox 83)

Classicalprob. Model(Robertson &

Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach

(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model

(Wong & Yao, 95)

Differentinference system

Inference network model

(Turtle & Croft, 91)

Learn toRank

(Joachims 02)(Burges et al. 05)

Page 40: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 40

Retrieval Models: Vector Space

Page 41: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 41

Relevance = Similarity• Assumptions

– Query and document are represented similarly– A query can be regarded as a “document”– Relevance(d,q) similarity(d,q)

• R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))

• Key issues– How to represent query/document?– How to define the similarity measure ?

Page 42: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 42

Vector Space Model• Represent a doc/query by a term vector

– Term: basic concept, e.g., word or phrase– Each term defines one dimension– N terms define a high-dimensional space– Element of vector corresponds to term weight

– E.g., d=(x1,…,xN), xi is “importance” of term i

• Measure relevance by the distance between the query vector and document vector in the vector space

Page 43: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 43

VS Model: illustration

Java

Microsoft

Starbucks

D6

D10

D9

D4

D7

D8

D5

D11

D2 ? ?

D1

? ?

D3

? ?

Query

Page 44: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 44

What the VS model doesn’t say• How to define/select the “basic concept”

– Concepts are assumed to be orthogonal

• How to assign weights– Weight in query indicates importance of term– Weight in doc indicates how well the term

characterizes the doc

• How to define the similarity/distance measure

Page 45: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 45

What’s a good “basic concept”?• Orthogonal

– Linearly independent basis vectors– “Non-overlapping” in meaning

• No ambiguity

• Weights can be assigned automatically and hopefully accurately

• Many possibilities: Words, stemmed words, phrases, “latent concept”, …

Page 46: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 46

How to Assign Weights?• Very very important!

• Why weighting– Query side: Not all terms are equally important– Doc side: Some terms carry more information about contents

• How? – Two basic heuristics

• TF (Term Frequency) = Within-doc-frequency• IDF (Inverse Document Frequency)

– TF normalization

Page 47: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 47

TF Weighting• Idea: A term is more important if it occurs more frequently in a

document

• Some formulas: Let f(t,d) be the frequency count of term t in doc d– Raw TF: TF(t,d) = f(t,d)– Log TF: TF(t,d)=log f(t,d)– Maximum frequency normalization: TF(t,d) = 0.5

+0.5*f(t,d)/MaxFreq(d)– “Okapi/BM25 TF”: TF(t,d) = k f(t,d)/(f(t,d)

+k(1-b+b*doclen/avgdoclen))

• Normalization of TF is very important!

Page 48: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 48

TF Normalization

• Why? – Document length variation– “Repeated occurrences” are less informative than the

“first occurrence”

• Two views of document length– A doc is long because it uses more words– A doc is long because it has more contents

• Generally penalize long doc, but avoid over-penalizing (pivoted normalization)

Page 49: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 49

TF Normalization: How?

Norm. TF

Raw TF

Which curve is more reasonable? Should normalized-TF be up-bounded?

Normalization interacts with the similarity measure

Page 50: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 50

Regularized/“Pivoted” Length Normalization

Norm. TF

Raw TF“Pivoted normalization”: Using avg. doc length to regularize normalization

1-b+b*doclen/avgdoclen (b varies from 0 to 1)What would happen if doclen is {>, <,=} avgdoclen?

Advantage: stabalize parameter setting

Page 51: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 51

IDF Weighting• Idea: A term is more discriminative if it occurs only in

fewer documents

• Formula:IDF(t) = 1+ log(n/k)

n – total number of docsk -- # docs with term t (doc freq)

Page 52: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 52

TF-IDF Weighting• TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)

– Common in doc high tf high weight– Rare in collection high idf high weight

• Imagine a word count profile, what kind of terms would have high weights?

Page 53: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 53

How to Measure Similarity?

product)dot normalized(

)()(

),( :Cosine

),( :similarityproduct Dot

absent is term a if 0 ),...,(

),...,(

1

2

1

2

1

1

1

1

N

jij

N

jqj

N

jijqj

i

N

jijqji

qNq

iNii

ww

wwDQsim

wwDQsim

wwwQ

wwD

How about Euclidean?

N

jijqji wwDQsim

1

2)(),(

Page 54: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 54

What Works the Best?

(Singhal 2001)

•Use single words

•Use stat. phrases

•Remove stop words

•Stemming

•Others(?)

Error

[ ]

Page 55: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 55

“Extension” of VS Model• Alternative similarity measures

– Many other choices (tend not to be very effective)– P-norm (Extended Boolean): matching a Boolean query

with a TF-IDF document vector

• Alternative representation– Many choices (performance varies a lot)– Latent Semantic Indexing (LSI) [TREC performance

tends to be average]

• Generalized vector space model– Theoretically interesting, not seriously evaluated

Page 56: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 56

Relevance Feedback in VS• Basic setting: Learn from examples

– Positive examples: docs known to be relevant– Negative examples: docs known to be non-relevant– How do you learn from this to improve performance?

• General method: Query modification– Adding new (weighted) terms– Adjusting weights of old terms– Doing both

• The most well-known and effective approach is Rocchio [Rocchio 1971]

Page 57: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 57

+

Rocchio Feedback: Illustration

qq+ ++++ +

++++

++

++

+ --- ----

-- --

------ --

-----

--

-+ + +

Page 58: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 58

Rocchio Feedback: Formula

Origial query Rel docs Non-rel docs

ParametersNew query

Page 59: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 59

Rocchio in Practice• Negative (non-relevant) examples are not very important

(why?)

• Often truncate the vector onto to lower dimension (i.e., consider only a small number of words that have high weights in the centroid vector) (efficiency concern)

• Avoid overfitting by keeping a relatively high weight on the original query weights (why?)

• Can be used for relevance feedback and pseudo feedback

• Usually robust and effective

Page 60: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 60

Advantages of VS Model

• Empirically effective! (Top TREC performance)

• Intuitive

• Easy to implement

• Well-studied/Most evaluated

• The Smart system– Developed at Cornell: 1960-1999– Still available

• Warning: Many variants of TF-IDF!

Page 61: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 61

Disadvantages of VS Model• Assume term independence

• Assume query and document to be the same

• Lack of “predictive adequacy” – Arbitrary term weighting– Arbitrary similarity measure

• Lots of parameter tuning!

Page 62: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 62

Probabilistic Retrieval Models

Page 63: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 63

Overview of Retrieval ModelsRelevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

(Salton et al., 75)

Prob. distr.model

(Wong & Yao, 89)

GenerativeModel

RegressionModel

(Fox 83)

Classicalprob. Model(Robertson &

Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach

(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model

(Wong & Yao, 95)

Differentinference system

Inference network model

(Turtle & Croft, 91)

Learn toRank

(Joachims 02)(Burges et al. 05)

Page 64: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 64

Probability of Relevance• Three random variables

– Query Q– Document D– Relevance R {0,1}

• Goal: rank D based on P(R=1|Q,D)– Evaluate P(R=1|Q,D)– Actually, only need to compare P(R=1|Q,D1) with

P(R=1|Q,D2), I.e., rank documents

• Several different ways to refine P(R=1|Q,D)

Page 65: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 65

Refining P(R=1|Q,D) Method 1: conditional models

• Basic idea: relevance depends on how well a query matches a document– Define features on Q x D, e.g., #matched terms, # the highest IDF of a

matched term, #doclen,..– P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D), )– Using training data (known relevance judgments) to estimate parameter – Apply the model to rank new documents

• Early work (e.g., logistic regression [Cooper 92, Gey 94])– Attempted to compete with other models

• Recent work (e.g. Ranking SVM [Joachims 02], RankNet (Burges et al. 05))– Attempted to leverage other models – More features (notably PageRank, anchor text)– More sophisticated learning (Ranking SVM, RankNet, …)

Page 66: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 66

Logistic Regression (Cooper 92, Gey 94)

)exp(1

1),|1( 6

10

iiiX

DQRP

6

10),|1(1

),|1(logi

iiXDQRPDQRP x

x1

loglogit function:

)exp(1)exp(

)exp(11

xx

x

logistic (sigmoid) function:

X

P(R=1|Q,D)1.0Uses 6 features X1, …, X6

Page 67: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 67

Features/Attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged

Page 68: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 68

Learning to Rank

• Advantages– May combine multiple features (helps improve accuracy and

combat web spams)– May re-use all the past relevance judgments (self-improving)

• Problems– Don’t learn the semantic associations between query words and

document words– No much guidance on feature generation (rely on traditional

retrieval models)

• All current Web search engines use some kind of learning algorithms to combine many features such as PageRank and many different representations of a page

Page 69: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 69

The PageRank Algorithm (Page et al. 98)

1 1

1

0 0 1/ 2 1/ 21 0 0 00 1 0 01/ 2 1/ 2 0 0

1( ) (1 ) ( ) ( )

1[ (1 ) ] ( )

( (1 ) )

N N

i ki k kk k

N

ki kk

T

M

p d m p d p dN

m p dN

p I M p

d1

d2

d4

“Transition matrix”d3

Iterate until converge

N= # pages

Stationary (“stable”) distribution, so we

ignore time

Random surfing model: At any page,

With prob. , randomly jumping to a pageWith prob. (1-), randomly picking a link to follow.

Iij = 1/N

Initial value p(d)=1/N

Page 70: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 70

PageRank in Practice• Interpretation of the damping factor (0.15):

– Probability of a random jump – Smoothing the transition matrix (avoid zero’s)

• Normalization doesn’t affect ranking, leading to some variants

• The zero-outlink problem: p(di)’s don’t sum to 1– One possible solution = page-specific damping factor (=1.0

for a page with no outlink)

1 1

1( ) (1 ) ( ) ( ) ( (1 ) )N N

Ti ki k k

k k

p d m p d p d p I M pN

1 1

1 1

1

1'( ) ( ), , ( ) (1 ) ( ) ( )

1'( ) (1 ) '( ) '( )

'( ) (1 ) '( )

N N

i i i ki k kk k

N N

i ki k kk k

NC N

i ki kk

Let p d cp d c constant cp d m cp d cp dN

p d m p d p dN

p d m p d

Page 71: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 71

HITS: Capturing Authorities & Hubs [Kleinberg 98]

• Intuitions– Pages that are widely cited are good authorities– Pages that cite many other pages are good hubs

• The key idea of HITS– Good authorities are cited by good hubs– Good hubs point to good authorities– Iterative reinforcement…

Page 72: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 72

The HITS Algorithm [Kleinberg 98]

d1

d2

d4( )

( )

0 0 1 11 0 0 00 1 0 01 1 0 0

( ) ( )

( ) ( )

;

;

j i

j i

i jd OUT d

i jd IN d

T

T T

A

h d a d

a d h d

h Aa a A h

h AA h a A Aa

“Adjacency matrix”

d3 Initial values: a(di)=h(di)=1

Iterate

Normalize: 2 2( ) ( ) 1i i

i i

a d h d

Page 73: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 73

Refining P(R=1|Q,D) Method 2:generative models

• Basic idea– Define P(Q,D|R)– Compute O(R=1|Q,D) using Bayes’ rule

• Special cases– Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R)– Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R)

)0()1(

)0|,()1|,(

),|0(),|1(),|1(

RPRP

RDQPRDQP

DQRPDQRPDQRO Ignored for ranking D

Page 74: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 74

Document Generation

)0,|()1,|(

)0|()0,|()1|()1,|(

)0|,()1|,(

),|0(),|1(

RQDPRQDP

RQPRQDPRQPRQDP

RDQPRDQP

DQRPDQRP

Model of relevant docs for Q

Model of non-relevant docs for Q

Assume independent attributes A1…Ak ….(why?)Let D=d1…dk, where dk {0,1} is the value of attribute Ak (Similarly Q=q1…qk )

)0),0,|1()1,|1(()1,|0()0,|1()0,|0()1,|1(

)1,|0()0,|1()0,|0()1,|1(

)0,|0()1,|0(

)0,|1()1,|1(

)0,|()1,|(

),|0(),|1(

1,1

1,1

0,11,1

1

iii

k

qdi ii

ii

k

di ii

ii

k

di i

ik

di i

i

k

i ii

ii

qifRQAPRQAPAssumeRQAPRQAPRQAPRQAP

RQAPRQAPRQAPRQAP

RQAPRQAP

RQAPRQAP

RQdAPRQdAP

DQRPDQRP

ii

i

ii

Non-query terms are equally likely to

appear in relevant and non-relevant docs

Page 75: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 75

Robertson-Sparck Jones Model(Robertson & Sparck Jones 76)

Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc

k

qdi ii

iiRank

iipqqpDQRO

1,1 )1()1(log),|1(log (RSJ model)

How to estimate parameters?Suppose we have relevance judgments,

1).(#5.0).(#ˆ

1).(#5.0).(#ˆ

docnonrelAwithdocnonrelq

docrelAwithdocrelp i

ii

i

“+0.5” and “+1” can be justified by Bayesian estimation

Page 76: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 76

RSJ Model: No Relevance Info(Croft & Harper 79)

k

qdi ii

iiRank

iipqqpDQRO

1,1 )1()1(log),|1(log (RSJ model)

How to estimate parameters?Suppose we do not have relevance judgments,

- We will assume pi to be a constant - Estimate qi by assuming all documents to be non-relevant

k

qdi i

iRank

iin

nNDQRO1,1 5.0

5.0log),|1(log

N: # documents in collectionni: # documents in which term Ai occurs

i

i

nnNIDF

log'

Page 77: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 77

Improving RSJ: Adding TF

Let D=d1…dk, where dk is the frequency count of term Ak

k

di iii

iii

k

di i

ik

di ii

ii

k

i ii

ii

i

ii

RQAPRQdAPRQAPRQdAP

RQAPRQAP

RQdAPRQdAP

RQdAPRQdAP

DQRPDQRP

1,1

0,11,1

1

)1,|0()0,|()0,|0()1,|(

)0,|0()1,|0(

)0,|()1,|(

)0,|()1,|(

),|0(),|1(

)0,|()1,|(

),|0(),|1(

RQDPRQDP

DQRPDQRPBasic doc. generation model:

EE ef

RQEPef

RQEp

EfApRQEPEfApRQEpRQfApf

Ef

E

iii

!),|(

!),|(

)|(),|()|(),|(),|(

2-Poisson mixture model

Many more parameters to estimate! (how many exactly?)

Page 78: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 78

BM25/Okapi Approximation(Robertson et al. 94)

• Idea: Approximate p(R=1|Q,D) with a simpler function that share similar properties

• Observations:– log O(R=1|Q,D) is a sum of term weights Wi

– Wi= 0, if TFi=0

– Wi increases monotonically with TFi

– Wi has an asymptotic limit

• The simple function is

)1()1(log)1(

1

1

ii

ii

i

ii pq

qpTFkkTFW

Page 79: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 79

Adding Doc. Length & Query TF• Incorporating doc length

– Motivation: The 2-Poisson model assumes equal document length

– Implementation: “Carefully” penalize long doc

• Incorporating query TF– Motivation: Appears to be not well-justified– Implementation: A similar TF transformation

• The final formula is called BM25, achieving top TREC performance

Page 80: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 80

The BM25 Formula

“Okapi TF/BM25 TF”

Page 81: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 81

Extensions of “Doc Generation” Models

• Capture term dependence (Rijsbergen & Harper 78)

• Alternative ways to incorporate TF (Croft 83, Kalt96)

• Feature/term selection for feedback (Okapi’s TREC reports)

• Other Possibilities (machine learning … )

Page 82: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 82

Query Generation

))0|()0,|(()0|()1|()1,|(

)0|()0,|()1|()1,|(

)0|,()1|,(),|1(

RQPRDQPAssumeRDPRDPRDQP

RDPRDQPRDPRDQP

RDQPRDQPDQRO

Assuming uniform prior, we have

Query likelihood p(q| d) Document prior

)1,|(),|1( RDQPDQRO

Now, the question is how to compute ?)1,|( RDQP

Generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated model

Leading to the so-called “Language Modeling Approach” …

Page 83: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 83

Lecture 1: Key Points • Vector Space Model is a family of models, not a single model

• Many variants of TF-IDF weighting and some are more effective than others

• State of the art retrieval performance is achieved through– Bag of words representation – TF-IDF weighting (BM25) + length normalization– Pseudo relevance feedback (mostly for recall)– For web search, add PageRank, anchor text, …, plus learning to

rank

• Principled approaches didn’t lead to good performance directly (before the “language modeling approach” was proposed); heuristic modification has been necessary

Page 84: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 84

Readings• Amit Singhal’s overview:

– http://sifaka.cs.uiuc.edu/course/ds/mir.pdf

• My review of IR models:– http://sifaka.cs.uiuc.edu/course/ds/irmod.pdf

Page 85: Statistical Models for Information Retrieval and Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 85

Discussion• Text retrieval and image retrieval

– Query language: • keywords vs. image features• By example

– Content representation• Bag of words vs. “bag of image features”?• Phrase indexing vs. units from image parsing? • Sentiment analysis

– Retrieval heuristics• TF-IDF weighting vs. ? • Passage retrieval vs. image region retrieval? • Proximity

• Text retrieval and video retrieval

• Multimedia retrieval?


Recommended