+ All Categories
Home > Documents > Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information...

Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information...

Date post: 21-Apr-2018
Category:
Upload: buicong
View: 217 times
Download: 4 times
Share this document with a friend
52
Language Models for Information Retrieval References: 1. W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval. July 2003 2. X. Liu and W.B. Croft, Statistical Language Modeling For Information Retrieval, the Annual Review of Information Science and Technology, vol. 39, 2005 3. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. (Chapter 12) 4. D. A. Grossman, O. Frieder, Information Retrieval: Algorithms and Heuristics, Springer, 2004 (Chapter 2) 5. C.X. Zhai. Statistical Language Models for Information Retrieval (Synthesis Lectures Series on Human Language Technologies). Morgan & Claypool Publishers, 2008 Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University
Transcript
Page 1: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Language Models for Information Retrieval

References:1. W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval. July 20032. X. Liu and W.B. Croft, Statistical Language Modeling For Information Retrieval, the Annual Review of Information Science and

Technology, vol. 39, 2005 3. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University

Press, 2008. (Chapter 12)4. D. A. Grossman, O. Frieder, Information Retrieval: Algorithms and Heuristics, Springer, 2004 (Chapter 2)5. C.X. Zhai. Statistical Language Models for Information Retrieval (Synthesis Lectures Series on Human Language Technologies).

Morgan & Claypool Publishers, 2008

Berlin ChenDepartment of Computer Science & Information Engineering

National Taiwan Normal University

Page 2: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 2

Taxonomy of Classic IR Models

Document Property

TextLinksMultimedia Proximal Nodes, others

XML-based

Semi-structured Text

Classic Models

BooleanVectorProbabilistic

Set Theoretic

FuzzyExtended BooleanSet-based

Probabilistic

BM25Language ModelsDivergence from RamdomnessBayesian Networks

Algebraic

Generalized VectorLatent Semanti IndexingNeural NetworksSupport Vector Machines

Page RankHubs & Authorities

Web

Image retrievalAudio and Music RetrievalVideo Retrieval

Multimedia Retrieval

Page 3: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 3

Statistical Language Models (1/2)

• A probabilistic mechanism for “generating” a piece of text– Define a distribution over all possible word sequences

– Used LM to quantify the acceptability of a given word sequence • What is LM Used for ?

– Speech recognition– Spelling correction– Handwriting recognition– Optical character recognition– Machine translation– Document classification and routing– Information retrieval …

LwwwW 21

?WP

Page 4: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 4

Statistical Language Models (2/2)

• (Statistical) language models (LM) have been widely used for speech recognition and language (machine) translation for more than thirty years

• However, their use for information retrieval started only in 1998 [Ponte and Croft, SIGIR 1998]– Basically, a query is considered generated from an “ideal”

document that satisfies the information need– The system’s job is then to estimate the likelihood of each

document in the collection being the ideal document and rank then accordingly (in decreasing order)

Ponte and Croft. A language modeling approach to information retrieval. SIGIR 1998

Page 5: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Three Ways of Developing LM Approaches for IR

IR – Berlin Chen 5

(a) Query likelihood(b) Document likelihood(c) Model comparison

literal term matching or concept matching

Page 6: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 6

Query-Likelihood Language Models

• Criterion: Documents are ranked based on Bayes (decision) rule

– is the same for all documents, and can be ignored

– might have to do with authority, length, genre, etc.• There is no general way to estimate it• Can be treated as uniform across all documents

• Documents can therefore be ranked based on

– The user has a prototype (ideal) document in mind, and generates a query based on words that appear in this document

– A document is treated as a model to predict (generate) the query

QP

DPDQPQDP

QP DP

M as denotedor DQPDQP

D DM

document model

Page 7: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 7

Another Criterion: Maximum Mutual Information

• Documents can be ranked based their mutual information with the query (in decreasing order)

• Document ranking by mutual information (MI) is equivalent that by likelihood

QPDQP

DPQPDQPDQMI

loglog

,log,

being the same for all documents, and hence can be ignored

DQPDQMIDDmaxarg,maxarg

rank

Page 8: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 8

Yet Another Criterion: Minimum KL Divergence

• Documents are ranked by Kullback-Leibler (KL) divergence (in increasing order)

log,

log

loglog

log

rankDQPDwPQwc

DwPQwP

DwPQwPQwPQwP

DwPQwP

QwPDQKL

w

w

ww

w

The same for all document=> can be disregarded

Relevant documents are deemed tohave lower cross entropies

Cross entropy between the language models of a query and a document

Documentmodel

Querymodel

Equivalent to ranking in decreasing order of

Page 9: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 9

Schematic Depiction for Query-Likelihood Approach

D1

D2

D3

.

.

.

DocumentCollection

.

.

.

MD1

MD2

MD3

query (Q)

DocumentModels

Page 10: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 10

Building Document Models: n-grams

• Multiplication (Chain) rule

– Decompose the probability of a sequence of events into the probability of each successive events conditioned on earlier events

• n-gram assumption– Unigram

• Each word occurs independently of the other words• The so-called “bag-of-words” model (e.g., how to distinguish

“street market” from “market street)– Bigram

– Most language-modeling work in IR has used unigram models• IR does not directly depend on the structure of sentences

12121312121 LLL wwwwPwwwPwwPwPw....wwP

LL wPwPwPwPw....wwP 32121

12312121 LLL wwPwwPwwPwPw....wwP

Page 11: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 11

Unigram Model (1/4)

• The likelihood of a query given a document

– Words are conditionally independent of each other given the document

– How to estimate the probability of a (query) word given the document ?

• Assume that words follow a multinomial distributiongiven the document

Li Di

DLDDD

wP

wPwPwPQP

1

21

M

MM MM

Lw....wwQ 21 D

Vi wDiw

i

Vi

wcwV

i i

Vj j

DV

ii

ii

λ,wPλ

:wc

λ!wc

!wcwc,...,wcP

1

11

11

1 M

occursword a timesofnumber the where

M

permutation is considered here

M DwP

Page 12: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 12

Unigram Model (2/4)

• Use each document itself a sample for estimating its corresponding unigram (multinomial) model– If Maximum Likelihood Estimation (MLE) is adopted

wa

wa

wa

wb

wa

wbwb

wc

Doc D

wc

P(wb|MD)=0.3

wd

P(wc |MD)=0.2P(wd |MD)=0.1P(we |MD)=0.0P(wf |MD)=0.0

DD,wcD:D

Dw:D,wc

DD,wc

wP

i i

ii

iDi

, oflength in occurs timesofnumber

where

M

The zero-probability problemIf we and wf do not occur in Dthen P(we |MD)= P(wf |MD)=0

This will cause a problem in predicting the query likelihood (See the equation for the query likelihood in the preceding slide)

Page 13: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 13

Unigram Model (3/4)

• Smooth the document-specific unigram model with a collection model (two states, or a mixture of two multinomials)

• The role of the collection unigram model– Help to solve zero-probability problem– Help to differentiate the contributions of different missing terms in

a document (global information like IDF ? )

• The collection unigram model can be estimated in a similar way as what we do for the document-specific unigram model

Li CiDiD wPλwPλQP 1 1 MMM

CiwP M

ll w l

i

wl

iCi

nNnN

Collection,wcCollection,wcwP or M ii wn

N containing collection in the doc ofnumber :

collection in the doc ofnumber :

Normalized doc freq

DiwP M

CiwP M

A document model

Query

Lw....wwQ 21

Page 14: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 14

Unigram Model (4/4)

• An evaluation on the Topic Detection and Tracking (TDT) corpora– Language Model

– Vector Space Model

mAP Unigram Unigram+Bigram

TQ/TD 0.6327 0.5427

TDT2 TQ/SD 0.5658 0.4803

TQ/TD 0.6569 0.6141

TDT3 TQ/SD 0.6308 0.5808

mAP Unigram Unigram+Bigram

TQ/TD 0.5548 0.5623

TDT2 TQ/SD 0.5122 0.5225

TQ/TD 0.6505 0.6531

TDT3 TQ/SD 0.6216 0.6233

CiLi Di

DUnigram

MwPλMwPλ

MQP

11

Cii

Dii

Li CiDi

DBigramUnigram

M,wwPλλλ

M,wwPλ

MwPλMwPλ

MQP

1321

13

1 21

1

• Consideration of contextual information (Higher-order language models, e.g., bigrams)will not always lead to improved performance

Page 15: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Training Mixture Weights of LMs

IR – Berlin Chen 15

• Expectation-Maximization (EM) Training– The weights are tied among the documents

– E.g. m1 of Type I HMM can be trained using the following equation:

• Where is the set of training query exemplars, is the set of docs that are relevant to a specific

training query exemplar , is the length of the query , and is the total number of docs relevant to the query

Q

Q QR n

TrainSetQQR

TrainSetQ DocD Qq nn

n

DocQCorpusqPmDqPm

DqPm

m to

21

1

1 to

ˆ

QTrainSet QRDoc to

Q Q QRDoc to

Q

the old weight

the new weight819 queries ≦2265 docs

Page 16: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Discriminative Training of LMs (1/5)

IR – Berlin Chen 16

• Minimum Classification Error (MCE) Training– Given a query and a desired relevant doc , define the

classification error function as:

“>0”: means misclassified; “<=0”: means a correct decision

– Transform the error function to the loss function

• In the range between 0 and 1

Q *D

RDQPRDQPQ

DQED

not is 'logmax is log1),( '

**

)),(exp(11),(

**

DQEDQL

),( *DQE

),( *DQL

1

B. Chen et al., “A discriminative HMM/N-gram-based retrieval approach for Mandarin spoken documents,” ACM Transactions on Asian Language Information Processing, 3(2), pp. 128-145, June 2004

Page 17: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Discriminative Training of LMs (2/5)

IR – Berlin Chen 17

• Minimum Classification Error (MCE) Training– Apply the loss function to the MCE procedure for iteratively

updating the weighting parameters• Constraints:

• Parameter Transformation, (e.g.,Type I HMM)and

– Iteratively update (e.g., Type I HMM)

• Where,

1 , 0 k

kk mm

21

1

~~

~

1 mm

m

eeem

21

2

~~

~

2 mm

m

eeem

1m

iDDm

DQLiimim **

1

*

11 ~,~1~

,~),(

),(),(

~),(

1

*

*

*1

*~, 1

*

mDQE

DQEDQLi

mDQLimD

),(1),(),(),( **

*

*DQLDQL

DQEDQL

Gradient descent

Page 18: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Discriminative Training of LMs (3/5)

• Minimum Classification Error (MCE) Training– Iteratively update (e.g., Type I HMM)

,1

1

1

~

log1

~),(

2*

1

*1

1

~~

~*

~~

~

*~~

~

~~

~

~~

~*

~~

~

*~~

~~*~

2~~

~

1

~~

~*

~~

~

1

*

21

2

21

1

21

1

21

1

21

2

21

1

21

121

21

1

21

2

21

1

Qq nn

n

Qqnmm

m

nmm

m

nmm

m

mm

m

Qqnmm

m

nmm

m

nmm

m

nm

nm

mm

m

Qqnmm

m

nmm

m

n

n

n

n

CorpusqPmDqPmDqPm

Qm

CorpusqPee

eDqPee

e

DqPee

e

Qeee

CorpusqPee

eDqPee

e

DqPee

eCorpusqPeDqPeeee

Q

m

CorpusqPee

eDqPee

e

QmDQE

1m

xg

xgxfxgxfxgxf

xgxfxgxfxgxf

xfxf

xf

2

1log

:Note

IR – Berlin Chen 18

Page 19: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Discriminative Training of LMs (4/5)

• Minimum Classification Error (MCE) Training– Iteratively update (e.g., Type I HMM)

,1

,1,)(

2*

1

*1

1

**~, 1*

Qq nn

n

mD

n CorpusqPimDqPimDqPim

Qim

DQLDQLii

,

1

)(

2

)(

1

)(

1

~~)(~~~)(~

~~)(~

)(~)(~

)(~

1~1~

1~

1

2~,*1~,*

1~,*

212~,*2211~,*

1

211~,*1

2~,*21~,*

1

1~,*1

21

1

ii

i

imimiimimimiim

imimiim

iimiim

iim

imim

im

mDmD

mD

mDmD

mD

mDmD

mD

eimeim

eim

eeeeeeee

eeee

eeee

ee

eeeim

1m

the new weight

the old weight

)(~1~1

~,*11 iimimmD

IR – Berlin Chen 19

Page 20: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Discriminative Training of LMs (5/5)

• Minimum Classification Error (MCE) Training– Final Equations

• Iteratively update

• can be updated in the similar way

1m

Qnq

nn

n

mD

CorpusqPimDqPimDqPim

Qim

DQLDQLii

2*

1

*1

1

**

1~,*

1

,1,)(

)(

2~,*

2

)(1~,*

1

)(1~,*

11 1 i

mDi

mD

imD

eimeimeimim

2m

IR – Berlin Chen 20

Page 21: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 21

Statistical Translation Model (1/2)

• A query is viewed as a translation or distillation from a document– That is, the similarity measure is computed by estimating the

probability that the query would have been generated as a translation of that document

• Assumption of context-independence (the ability to handle the ambiguity of word senses is limited)

• However, it has the capability of handling the issues of synonymy (multiple terms having similar meaning) and polysemy (the same term having multiple meanings)

Qq

Qqc

DwQq

Qqc DwPwqPDqPDQPDQsim ,,Trans,

Berger & Lafferty (1999)

A. Berger and J. Lafferty. Information retrieval as statistical translation. SIGIR 1999

word-to-word translation

Qq

QqcCqDqPDQP ,

Trans M1ˆ

QD

Page 22: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 22

Statistical Translation Model (2/2)

• Weakness of the statistical translation model– The need of a large collection of training data for estimating

translation probabilities, and inefficiency for ranking documents

• Jin et al. (2002) proposed a “Title Language Model” approach to capture the intrinsic document to query translation patterns– Queries are more like titles than documents (queries and titles

both tend to be very short and concise descriptions of information, and created through a similar generation process)

– Train the statistical translation model based on the document-title pairs in the whole collection

R. Jin et al. Title language model for information retrieval. SIGIR 2002

N

j Tt

Ttc

DwMM

N

j TtjMM

N

jjjMM

j

j

j

DwPwtP

DtPDTPM

1

,

11

*

maxarg

maxarg maxarg

Page 23: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 23

Probabilistic Latent Semantic Analysis (PLSA)

• Also called The Aspect Model, Probabilistic Latent Semantic Indexing (PLSI)– Graphical Model Representation (a kind of Bayesian Networks)

T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 2001

Hofmann (1999)

Qw

Q,wcCD MwPλMwPλ

DQP

DPDQPQPDPDQP

QDPD,Qsim

1

Q,wc

Qw

K

kkk

Q,wc

Qw

K

kk

Qw

Q,wc

DTPTwP

DT,wP

DwPDQPD,Qsim

1

1

The latent variables=>The unobservable class variables Tk

(topics or domains)

Language (unigram) model

PLSA

Page 24: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 24

PLSA: Formulation

• Definition– : the prob. when selecting a doc

– : the prob. when pick a latent class for the doc

– : the prob. when generating a word from the class

DP D

D DTP k kT

kTwP kTw

Page 25: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 25

PLSA: Assumptions

• Bag-of-words: treat docs as memoryless source, words are generated independently

• Conditional independent: the doc and word are independent conditioned on the state of the associated latent variable

D

kT

w

kkk TDPTwPTDwP ,

w

Q,wcDwPDQPD,Qsim

K

kkk

K

k

kkK

k

kkk

K

k

kkK

k

kK

kk

DTPTwP

DPDTPTwP

DPTPTDPTwP

DPTPTDwP

DPTDwPDTwPDwP

1

11

111

,

,,,,

Page 26: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 26

PLSA: Training (1/2)

• Probabilities are estimated by maximizing the collection likelihood using the Expectation-Maximization (EM) algorithm

D w Tkk

D wC

k

DTPTwPlogD,wc

DwPlogD,wcL

EM tutorial:- Jeff A. Bilmes "A Gentle Tutorial of the EM Algorithm and its Application

to Parameter Estimation for Gaussian Mixture and Hidden Markov Models," U.C. Berkeley TR-97-021

Page 27: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 27

PLSA: Training (2/2)

• E (expectation) step

• M (Maximization) step

w D k

D kk D,wTPD,wc

D,wTPD,wcTwP

w

kwk Dwc

DwTPDwcDTP

,,,ˆ

kT kk

kkk DTPTwP

DTPTwPDwTP ,

Page 28: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 28

PLSA: Latent Probability Space (1/2)

image sequenceanalysis

medical imagingcontext of contourboundary detection

phonetic segmentation

kikT

kj

ikT

ikjT

ikjij

TDPTPTwP

DTPDTwPDTwPDwP

k

kk

,,,,,

kjkj TwP

,:U kkTPdiag:Σ

kiki TDP,

:V DWP , .= .

Dimensionality K=128 (latent classes)

Page 29: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 29

PLSA: Latent Probability Space (2/2)

=

D1 D2 Di Dn

mxn

kxk

mxk

rxn

P Umxk

Σk VTkxn

w1w2

wj

wm

T1…Tk… TK

kikT

kjij TDPTPTwPDwPk

,

kj TwP

kTPw1w2

wj

wm

ki TDP

ij DwP ,

D1 D2 Di Dn

Page 30: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 30

PLSA: One more example on TDT1 dataset

aviation space missions family love Hollywood love

Page 31: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 31

PLSA: Experiment Results (1/4)

• Experimental Results – Two ways to smoothen empirical distribution with PLSA

• Combine the cosine score with that of the vector space model (so does LSA)PLSA-U* (See next slide)

• Combine the multinomials individually PLSA-Q*

Both provide almost identical performance– The performance of PLSA ( ) is not promising

when being used alone

)|()1()|()|(* DwPDwPDwP PLSAEmpiricalQPLSA

DcDwcDwPEmpirical

,)|(

K

kkkPLSA DTPTwPDwP

1

DwcQw

PLSAEmpiricalQPLSA DwPDwPDQP ,* )|()1()|()|(

D|wPPLSA

Page 32: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 32

PLSA: Experiment Results (2/4)

PLSA-U*• Use the low-dimensional representation and

(be viewed in a k-dimensional latent space) to evaluate relevance by means of cosine measure

• Combine the cosine score with that of the vector space model

• Use the ad hoc approach to re-weight the different model components (dimensions) by

)|( QTP k )|( DTP k

kk

kk

kkk

UPLSADTPQTP

DTPQTPDQR

22* ),(

where,

Qw

Qwk

k Q,wc

Q,wTPQ,wcQTP

),(1),(),(~** DQRDQRDQR VSMUPLSAUPLSA

online folded-in

Page 33: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 33

PLSA: Experiment Results (3/4)

• Why ?

– Reminder that in LSA, the relations between any two docs can be formulated as

– PLSA mimics LSA in similarity measure

kik

kk

kikk

iQPLSIDTPQTP

DTPQTPDQR

22* ),(

A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)T

Di

Ds

ksk

kik

kskik

kssk

kiik

ksskiik

kkki

kkki

kkskkki

siQPLSI

DTPDTP

DTPDTP

DPDTPDPDTP

DPDTPDPDTP

TPTDPTPTDP

TDPTPTPTDPDDR

22

22

22*

),(

vectorsrow are ˆ and ˆ

ˆˆˆˆ

)ˆ,ˆ(,2

si

si

Tsi

sisi

DD

DDDDDDcoineDDsim

iikkki DPDTPTPTDP

Page 34: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 34

PLSA: Experiment Results (4/4)

Page 35: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 35

PLSA vs. LSA

• Decomposition/Approximation– LSA: least-squares criterion measured on the L2- or Frobenius

norms of the word-doc matrices– PLSA: maximization of the likelihoods functions based on the

cross entropy or Kullback-Leibler divergence between the empirical distribution and the model

• Computational complexity– LSA: SVD decomposition– PLSA: EM training, is time-consuming for iterations ?

– The model complexity of Both LSA and PLSA grows linearly with the number of training documents

• There is no general way to estimate or predict the vector representation (of LSA) or the model parameters (of PLSA) for a newly observed document

Page 36: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 36

Latent Dirichlet Allocation (LDA) (1/2)

• The basic generative process of LDA closely resembles PLSA; however,– In PLSA, the topic mixture is conditioned on each

document ( is fixed, unknown)– While in LDA, the topic mixture is drawn from a Dirichlet

distribution, so-called the conjugate prior, ( is unknown and follows a probability distribution)

Blei et al. (2003)

DTP k

DTP k

DTP k

T

D

D

T

φw

KTα

D

parameter withondistributi lmultinomiaa from orda Pick 4)parameter

withondistributi lmultinomiaa from ,,2,1 a topic Pick 3) parameter withondistributiet Dirichl

a from document eachfor ondistributi lmultinomiaa Pick )2parameter withondistributiet Dirichl

a from topiceachfor ondistributi lmultinomiaa Pick )1 LDA withcorpusa generating of Process

DTP k

Blei et al. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003

Page 37: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 37

Latent Dirichlet Allocation (2/2)

word 2

word 3

word 1

X (P(w1))Y (P(w2))

Z (P(w3))

X+Y+Z=1

DD

D

i

K

kDkzkiD

K

kT ddTPTwPpPL

1 11LDA ,||

Page 38: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 38

Word Topic Models (WTM)

• Each word of language are treated as a word topical mixture model for predicting the occurrences of other words

• WTM also can be viewed as a nonnegative factorization of a “word-word” matrix consisting probability entries – Each column encodes the vicinity information of all occurrences of

a distinct word in the document collection

K

kwkkiwi jj

TPTwPwP1

WTM M||M|

B. Chen, “Word topic models for spoken document retrieval and transcription,” ACM Transactions on Asian Language Information Processing, 8(1), pp. 2:1-2:27, March 2009

Page 39: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 39

Comparison of WTM and PLSA/LDA

• A schematic comparison for the matrix factorizations of PLSA/LDA and WTM

documents

wor

dsw

ords

vicinities of words

A

B

documentstopics

topics vicinities of words

wor

dsw

ords

normalized “word-document”co-occurrence matrix

normalized “word-word”co-occurrence matrix

mixture components

mixture weights

mixture components

mixture weights

G TH

Q TQ

PLSA/LDA

WTM

topi

csto

pics

K

kwkkiwi jj

TPTwPwP1

WTM M||M|

K

kDkkiDi TPTwPwP

1PLSA M||M|

Page 40: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 40

WTM: Information Retrieval (1/4)

• The relevance measure between a query and a document can be expressed by

• Unsupervised training– The WTM of each word can be trained by concatenating those

words occurring within a context window of size around each occurrence of the word, which are postulated to be relevant to the word

Qwc

Qw Dw

K

kwkkiDj

i

i jj

TPTwPDQP,

1,WTM M

.Mlog,Mloglog WTMWTM

www

j jwijj

jjj w Qw

wiwiw

ww wPOwcOPL

jw jw jw

1,jwO

Nwwww jjjjOOOO ,2,1, ,,,

2,jwO Nw j

O ,

Page 41: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 41

WTM: Information Retrieval (2/4)

• Supervised training: The model parameters are trained using a training set of query exemplars and the associated query-document relevance information– Maximize the log-likelihood of the training set of query

exemplars generated by their relevant documents

TrainSet QR

TrainSet Q DDQPL

Q DQ

toWTMloglog

Page 42: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

WTM: Information Retrieval (3/4)

• Example topic distributions of WTM

IR – Berlin Chen 42

Topic 13

word weight

Vena (靜脈) 1.202 

Resection (切除) 0.674 

Myoma (肌瘤) 0.668 

Cephalitis (腦炎) 0.618 

Uterus (子宮) 0.501 

Bronchus (支氣管) 0.500 

Topic 23

word weight

Cholera (霍亂) 0.752 

Colorectal cancer (大腸直腸癌)

0.681 

Salmonella enterica(沙門氏菌)

0.471 

Aphtae epizooticae(口蹄疫)

0.337 

Thyroid (甲狀腺) 0.303 

Gastric cancer (胃癌) 0.298 

Topic 14

word weight

Land tax (土地稅) 0.704 

Tobacco and alcohol tax law (菸酒稅法)

0.489 

Tax (財稅) 0.457 

Amend drafts (修正草案) 0.446 

Acquisition (購併) 0.396 

Insurance law (保險法) 0.373 

Page 43: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

WTM: Information Retrieval (4/4)

• Pairing of PLSA and WTM – Sharing the same set of latent topics

IR – Berlin Chen 43

wor

ds

documentsvicinity

documents

PLSA WTM wor

ds

topics

Topi

cs

documentsvicinity

documents

normalized “word-document” & “word-word”

co-occurrence matrix

mixture components

mixture weights

TwP DTP WTP M

Page 44: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 44

Applying Relevance Feedback to LM Framework (1/2)

• There is still no formal mechanism to incorporate relevance feedback (judgments) into the language modeling framework (especially for the query-likelihood approach)– The query is a fixed sample while focusing on estimating

accurate estimation of document language models

• Ponte (1998) proposed a limited way to incorporate blind reference feedback into the LM framework – Think of example relevant documents as examples of

what the query might have been, and re-sample (or expand) the query by adding k highly descriptive words from the these documents (blind reference feedback)

DwP

RD ~

RD C

D

w wPwP

w~

*MM

logmaxarg

J. M. Ponte, A language modeling approach to information retrieval, Ph.D. dissertation, UMass, 1998

Page 45: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 45

Applying Relevance Feedback to LM Framework (2/2)

• Miller et al. (1999) propose two relevance feedback approach – Query expansion: add those words to the initial query that

appear in two or more of the top m retrieved documents– Document model re-estimation: use a set of outside training

query exemplars to train the transition probabilities of the document models

• Where is the set of training query exemplars, is the set of documents that are relevant to a specific training query

exemplar , is the length of the query , and is the total number of documents relevant to the query

DiwP M

CiwP M

A document model

Query

Lw....wwQ 21

Q

Q QR n

TrainSetQQR

TrainSetQ DocD Qq CnDn

Dn

DocQqPλqPλ

qPλ

λ to

to MMM

the old weight

the new weight 819 queries ≦2265 docs

Li CiDiD wPλwPλQP 1 1 MMM

QTrainSet QRDoc to

Q Q QRDoc toQ

Miller et al. , A hidden Markov model information retrieval system, SIGIR 1999

Page 46: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Relevance Modeling (1/3)

• A schematic illustration of the information retrieval system with relevance feedback for improved query modeling

IR – Berlin Chen 46

Initial Query Model

Document Models

Initial Round of Retrieval

Top-Ranked Documents

Representative Documents

Various Query Modeling Second Round of Retrieval

Document Collection

Retrieval Result

Top-Ranked Documents

Query

K.-Y. Chen et al., "Exploring the use of unsupervised query modeling techniques for speech recognition and summarization,“Speech Communication, 80, pp. 49-59, June 2016.

Page 47: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Relevance Modeling (2/3)

• Use the top-ranked pseudo-relevant documents to approximate the relevance class of each query– The joint probability of a query Q and any word w being generated by

the relevance class of is computed as follows, on the basis of the top-ranked list of pseudo-relevant documents obtained from the initial round of retrieval:

• If we further assume that words are conditionally independent given Dm and their order is of no importance

IR – Berlin Chen 47

Mm mLm DwqqqPDPwQP 1 21RM |,,,,

Mm

Ll mlmm DqPDwPDPwQP 1 1RM ||,

Page 48: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Relevance Modeling (3/3)

• The enhanced query model , therefore, can be expressed by

• Incorporation of Latent Topic Information

IR – Berlin Chen 48

Mm

Ll mlm

Mm

Ll mlmm

DqPDPDqPDwPDP

QPwQPQwP

1 1

1 1

RM

RMRM |

||,

Kk mkkm DTPTwPDwP 1 |||~

Mm

Kk

Ll klkmkm TqPTwPDTPDPwQP 1 1 1TRM ||| ,

Page 49: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Leveraging Non-relevant Information

• Hypothesize that the low-ranked (or pseudo non-relevant) documents can provide useful cues as well to boost the retrieval effectiveness of a given query– For this idea to work, we may estimate a non-relevance model

for each test query based on those selected pseudo non-relevant documents

• The similarity measure between a query and document thus can be computed as follows

IR – Berlin Chen 49

DNRKLDQKLDQSIM Q ,

QNRwP

Page 50: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 50

Incorporating Prior Knowledge into LM Framework

• Several efforts have been paid to using prior knowledge for the LM framework, especially modeling the document prior– Document length– Document source– Average word-length – Aging (time information/period)– URL– Page links

DP

QP

DPDQPQDP

Page 51: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

IR – Berlin Chen 51

Implementation Notes: Probability Manipulation

• For language modeling approaches to IR, many conditional probabilities are usually multiplied. This can result in a “floating point underflow”

• It is better to perform the computation by “adding” logarithms of probabilities instead– The logarithm function is monotonic (order-preserving)

• We also should avoid the problem of “zero probabilities (or estimates)” owing to sparse data, by using appropriate probability smoothing techniques

Li CiDiD wPλwPλQP 1 1 MMM

CiDil

iD wPwPQP M1MlogMlog 1

Page 52: Language Models for Information Retrieval - Berlin …berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Language Models for Information Retrieval ... Department of Computer

Implementation Notes: Converting to tf-idf-like Weighting

• The query likelihood retrieval model

IR – Berlin Chen 52

Li CiDiD wPλwPλQP 1 1 MMM

0, ,

rank

10, ,

10, ,

0, ,0, ,

1

1,1

,

log

,1log

,1

,1,

log

,1log

,1log

,1

,log

,1log

,1

,log

M1MlogMlog

Dwci i

i

Li

i

Dwci i

ii

Li

i

Dwci

iii

Dwci

i

Dwci

ii

Li CiDiD

i

i

i

ii

CCwc

DDwc

CCwc

CCwc

CCwc

DDwc

CCwc

CCwc

CCwc

DDwc

CCwc

CCwc

DDwc

wPwPQP

Therefore, the similarity score is directly proportional to the document frequency and inversely proportional to the collection frequency.=> Can be efficiently implemented with inverted files(To be discussed later on!)

Logarithm is a monotonic (rank‐preserving) transformation

Being the same for all documents

=> can be discarded!


Recommended