+ All Categories
Home > Documents > Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea [email protected]...

Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea [email protected]...

Date post: 30-Mar-2015
Category:
Upload: shannon-houseman
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
35
Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea [email protected] Universität des Saarlandes March 10, 2005 Seminar CS 555 – Language Model based Information Retrieval
Transcript
Page 1: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

Term-Specific Smoothing

On a paper by D. Hiemstra

Alexandru A. [email protected]

Universität des Saarlandes March 10, 2005

Seminar CS 555 – Language Model based Information Retrieval

Page 2: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 2

Introduction

Experimental approach to Information Retrieval

– A formal model specifies an exact formula, which is tried

empirically

– Formulae are empirically tried because they seem

plausible

Modeling approach to Information Retrieval

– A formal model specifies an exact formula that is used to

prove some simple mathematical properties of the model

Page 3: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 3

Information Retrieval – Overview

System query returns a ranked result list

– Statistical ranking on term frequencies is still standard

practice

Search engines provide means to override the

default ranking mechanisms

– Users can specify mandatory query terms (e.g. +term or

“term” in Google)

Page 4: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 4

Information Retrieval – Practice (1)

Query: Star Wars Episode I (I is not treated as a mandatory term)

Page 5: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 5

Information Retrieval – Practice (2)

Query: Star Wars Episode +I (I is treated as a mandatory term)

Page 6: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 6

Motivation

Performance limitations in statistical ranking

Statistics-based IR models do not capture term

importance specification

User/system should be able to override the

default ranking mechanism

Objective

Mathematical model that supports the concept of query

term importance

Page 7: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 7

Language Models

A statistical model for generating text

– Probability distribution over strings in a given language

M ntt ,,1

),,|(),|()|()|,,( 111211 nnn ttMtPtMtPMtPMttP

Consider the Unigram Language Model (LM)

)()(),,( 11 nn tPtPttP

Page 8: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 8

Example – Language Models

IRsample

text … 0.2search … 0.1…mining … 0.1food … 0.0001…

build model

)()()( foodPtextP,food,text,P

Healthsample

food … 0.25nutrition … 0.1…healthy … 0.05diet … 0.02…

build model

)()()( dietPfoodP,diet,food,P

Page 9: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 9

Language Models in IR

Estimate a LM for each document: D

Estimate probability of generating a query Q with

terms (t1,…,tn) using a given model:

Rank documents by probability of generating Q:

)|,,( 1 DttP n

),,(

)()|,,(),,|(

111

nnn

ttP

DPDttPttDP

Page 10: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 10

Insufficient Data

If a term is not in the document, the query cannot

be generated:

0)|()|,,(1

1

n

i

in DtPDttP

Smooth probabilities

– Probabilities of observed events are decreased by a

certain amount, which is credited to unobserved events

Page 11: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 11

Smoothing

Roles

– Estimation >> reevaluation of probabilities

– Query modeling >> to “explain” the common and non-

informative terms in a query

Linear interpolation smoothing

– Defines a smoothing parameter necessary for query

modeling

– Can be defined as a two-state Hidden Markov Model

Page 12: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 12

Smoothing Models

Mixture Model smoothing

– Define a hidden event for all query terms

Term-specific smoothing

– Define a hidden event for each query term

Page 13: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 13

Smoothing – Mixture Model

Mixes the probability from the document with the

general collection probability of the term

])|()1()|([)|,,(1

1

n

i

iin CtPDtPdttP

can be tuned to adjust performance:

– High value >> “conjunctive-like” search, i.e., suitable for

short queries

– Low value >> suitable for long queries

Page 14: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 14

Bayesian Networks (1)

A Bayesian Network (BN) is a directed, acyclic graph

G(V, E) where:

– Nodes >> Random variables (RVs)

– Edges >> Dependencies

Properties:

),,|(),,,|(

},,{)(

itsgiven Y nodeparent -non a oft independen lconditiona is X Node -

),,|(y probabilit

lconditiona thecaptures BN the},,,{)( with V X nodeGiven -

)(y probabilitprior thecaptures BN theV, Rroot Given -

11

1

1

1

kk

k

k

k

PPXPYPPXP

PPXparents

PPxXP

PPXparents

rRP

Page 15: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 15

Bayesian Networks (2)

From the properties it holds that:

),,(),,|(),,( 2211 nnn XXPXXXPXXP

By the chain rule:

n

i

niin XXXPXXP1

11 ),,|(),,(

By conditional independence:

n

i

iin XparentsXPXXP1

1 )nodesother ),(|(),,(

n

i

ii XparentsXP1

))(|(

Page 16: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 16

LM as a Bayesian Network

Nodes >> random variables

Edges >> model’s conditional dependencies

Clear nodes >> hidden random variables

Shaded nodes >> observed random variables

Figure 1: The language modeling approachas a Bayesian network

D

tn…t1

Page 17: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 17

Example – Mixture Model (1) Collection (2 documents)

– d1: IBM reports a profit but revenue is down

– d2: Siemens narrows quarter loss but revenue decreases further

Model: MLE unigram from documents;

Query: revenue down

256

3)]

16

1

8

1(

2

1[)]

16

2

8

1(

2

1[)|,( 121 dttP

21

256

1)]

16

10(

2

1[)]

16

2

8

1(

2

1[)|,( 221 dttP

Ranking: d1 > d2

Page 18: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 18

Example – Mixture Model (2)

D

t3:downt1:revenue

C

Figure 2: Bayesian Network for C(d1,d2) languagemodel

Page 19: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 19

Term-Specific Smoothing

D

t3t2t1

t3t1 t2

D

1 2 3

])|()1()|([)|,,(1

1

n

i

iin CtPDtPDttP ])|()1()|([)|,,(1

1

n

i

iiiin CtPDtPDttP

Page 20: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 20

Term-Specific Smoothing – Derivation

Step 1: Assume query term independence

n

i

in DtPDttP1

1 )|()|,,(

Step 2: For each ti introduce a binary RV Ii (i.e.

the importance of a query term)

otherwise ,0

important ,1iI

n

i k

iin DkItPDttP1 }1,0{

1 ])|,([)|,,(

Page 21: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 21

Term-Specific Smoothing – Derivation

Step 3: Assume query term importance does not

depend on D

Step 4: Writing the full sum over the importance

values yields:

n

i k

iiin DkItPkIPDttP1 }1,0{

1 ]),|()([)|,,(

n

i

iiiiiin DItPIPDItPIPDttP1

1 )],1|()1(),0|()0([)|,,(

Page 22: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 22

Term-Specific Smoothing – Derivation

Step 4 (contd.):

– Let,

iiIP )1(

iiIP 1)0(

– Assume),1|()|( DItPDtP iii

),0|()|( DItPCtP iii

n

i

iiiin DtPCtPDttP1

1 )]|()|()1[()|,,(

Page 23: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 23

Term-Specific Smoothing – Properties

Case 1: Stop Words (‘–’)

– >> query term is not important

– >> ignore query term ti

Case 2: Mandatory Terms (‘+’)

– >> relevant documents contain the query term

– >> no smoothing by collection model performed

Case 3: Coordination level ranking

– A

1i0)|()1( DtP ii

1, ii

0)|( DtP ii0i

Page 24: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 24

Stop Words

Query terms that are ignored during the search

Reasons:

– Frequent words (e.g. the, it, a, …) might not contribute

significantly to the final document score, but they do

require processing power

– Words are stopped if they carry little meaning (e.g.

hereupon, whereafter)

0i

Page 25: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 25

Mandatory Terms

A query term that should occur in every retrieved

document

Collection model can be dropped from the

calculation of the document score

Documents that do not match the query term are

assigned null probabilities

Users specify mandatory terms (e.g. by +)

1i

Page 26: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 26

Coordination Level Ranking

A

A document containing n query terms will always

rank higher than one with n-1 query terms

Most tf.idf-ranking methods do not behave like

coordination level ranking

1, ii

Page 27: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 27

Term-Specific Smoothing – Review

Term importance probability accounts for:

– Statistics alone cannot always account for ignored query

terms

– Restrict the retrieved list of documents to documents

that match specific terms, regardless of their frequency

distributions

– Enforce a coordination level ranking of the documents,

regardless of the terms frequency distribution

Page 28: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 28

Relevance Feedback

Predict optimal values for lambda

Train on relevant documents and predict the

probability of term importance for each term that

maximizes retrieval performance

Use the Expectation Maximization (EM) algorithm

– Maximize the probability of the observed data given

some training data

Page 29: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 29

EM Algorithm

The algorithm iteratively maximizes the probability of the

query t1,…,tn given r relevant documents D1,…,Dr

E-step

r

j jipii

pi

jipi

iDtPCtP

DtPm

1)()(

)(

)|()|()1(

)|(

M-step

r

mipi )1(

Page 30: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 30

Generalization of Term Importance

Allow the RV Ii to have more than 2 realizations:

– Combine the unigram document model with the bigram

document model

n

i

iiiiiiii

n

DttPDtPCtP

DtPCtPDttP

2

1

11111

)],|()|()|()1[(

)]|()|()1[()|,,(

),2,|(),|( and ),2(},2,1,0{ 11 DIttPDttPIPI iiiiiiii

Page 31: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 31

Example – General Model

“last will” of Alfred Nobel

)0( i

+“last will” of Alfred Nobel

)0,1( ii

t3t1 t2

D

1 2 3

Figure 3: Graphical model of dependence relationsbetween query terms

Page 32: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 32

Future Research

Define a unigram LM for a topic-specific space

Extend beyond term-matching

– Use syntax (bag of words vs. structured text) and

semantics (exact terms vs. “equivalent” terms)

Page 33: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 33

Conclusions

Extension to the LM approach to IR: model the

importance of a query term

– Stop Words/Phrases: trade-off between search quality

and search speed

– Mandatory Terms: the user overrides the default

ranking algorithm

Statistical ranking algorithms motivated by the LM

approach perform well in an empirical setting

Page 34: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 34

Discussion

Is this a valid approach?

How does it differ from term weighting?

Why do we want coordination level ranking?

Is the bi-gram generalization valid and/or useful?

Page 35: Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

March 10, 2006 Term-Specific Smoothing 35

References

1. D. Hiemstra. Term-Specific Smoothing for the

Language Modeling Approach to Information

Retrieval: The Importance of a Query Term.

SIGIR’02, August 11-15, 2002.

2. G. Weikum. Information Retrieval and Data

Mining. Course Slides. Universität des

Saarlandes (Retrieved on: February 15, 2006)


Recommended