Statistical Translation Language Model

transcript

Maryam Karimzadehganmkarimz2@illinois.edu

University of Illinois at Urbana-Champaign1

Outline• Motivation & Background

– Language model (LM) for IR – Smoothing methods for IR

• Statistical Machine Translation – Cross-Lingual– Motivation– IBM Model 1

• Statistical Translation Language Model – Monolingual– Synthetic Queries– Mutual Information-based approach– Regularization of self-translation probabilities

• Smoothing in Statistical Translation Language Model

The Basic LM Approach([Ponte & Croft 98], [Hiemstra & Kraaij 98], [Miller et al. 99])

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?…

…food ?nutrition ?healthy ?diet ?…

Query = “data mining algorithms”

? Which model would most likely have generated this query?

Ranking Docs by Query Likelihood

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood

Retrieval as LM Estimation

• Document ranking based on query likelihood | |

log ( | ) log ( | ) ( , ) log ( | )

i i ii i

p q d p q d c w q p w d

where q q q q

• Retrieval problem Estimation of p(wi|d)• Smoothing is an important issue, and

distinguishes different approaches

Document language model

How to Estimate p(w|d)?

• Simplest solution: Maximum Likelihood Estimator– P(w|d) = relative frequency of word w in d– What if a word doesn’t appear in the text? P(w|d)=0

• In general, what probability should we give a word that has not been observed?

• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words

• This is what “smoothing” is about …

Language Model Smoothing

Max. Likelihood Estimate

wordsallofcountwofcount

ML wp )(

Smoothed LM

Smoothing Methods for IR

• Method 1(Linear interpolation, Jelinek-Mercer):

• Method 2 (Dirichlet Prior/Bayesian):

( , )( | ) (1 ) ( | )| |c w dp w d p w REFd

parameterML estimate

( ; ) ( | ) | || | | | | |

( , )( | ) ( | )| |

c w d p w REF dd d d

c w dp w d p w REFd

parameter

(Zhai & Lafferty 01)

A Brief History

• Machine translation was one of the first applications envisioned for computers

• Warren Weaver (1949): “I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.”

• First demonstrated by IBM in 1954 with a basic word-for-word translation system

Interest in Machine Translation

• Commercial interest:– U.S. has invested in MT for intelligence

purposes– MT is popular on the web—it is the most used

of Google’s special features– EU spends more than $1 billion on translation

costs each year.– (Semi-)automated translation could lead to

huge savings

Interest in Machine Translation

• Academic interest:– One of the most challenging problems in NLP

research– Requires knowledge from many NLP sub-areas, e.g.,

lexical semantics, parsing, morphological analysis, statistical modeling,…

– Being able to establish links between two languages allows for transferring resources from one language to another

Word-Level Alignments

• Given a parallel sentence pair we can link (align) words or phrases that are translations of each other:

Machine Translation -- Concepts• We are trying to model P(e|f)

– I give you a French sentence– You give me back English

• How are we going to model this?– The maximum likelihood estimation of P(e | f)

is: freq(e,f)/freq(f).– Way too specific to get any reasonable

frequencies! Vast majority of unseen data will have zero counts!

Machine Translation – Alternative way

• We could use Bayes rule

• Why using Bayes rule and not directly estimating p(e|f) ?

)()|()(

)()|()|( ePefPfPePefPfeP

)()|(maxarg ePefPee

It is important that our model for p(e|f) concentrates its probability as much as possible on well-formed English sentences. But it is not important that our model for P(f|e) concentrate its probability on well-formed French sentences.

Given a French sentence f, we could do a search for an e that maximizes p(e|f).

Statistical Machine Translation• The noisy channel model

– Assumptions:• An English word can be aligned with multiple French words

while each French word is aligned with at most one English word

• Independence of the individual word-to-word translations

Language Model Translation Model Decoder eP

f fePe emaxargˆ

e: English f: French

|e|=l |f|=m

Estimation of Probabilities -- IBM Model 1

• Simplest of the IBM models. (There are 5 models)

• Does not consider word order (bag-of-words approach)

• Does not model one-to-many alignments• Computationally inexpensive• Useful for parameter estimations that are

passed on to more elaborate models

IBM Model 1• Three important components involved

– Language model• Give the probability p(e).

– Translation model • Estimate the Translation Probability p(f|e).

– Decoder

efpepfpefpep

fePeeeemaxargmaxargmaxargˆ

IBM Model 1- Translation Model

• Joint probability of P(F=f, E=e, A=a) where A is an alignment between two sentences.

• Assume each French word has exactly one connection.

IBM Model 1- Translation Model• Assume, |e|=l and |f|=m, then the alignment can be

represented by a series a = a1a2…am

• Each alignment is between 0 and l such that if the word in position j of the French sentence is connected to the word in position i of the English sentence, then aj=i and if it not connected to any English word, then aj=0.

• The alignment is determined by specifying the values of aj for j from 1 to m, each of which can take any value from 0 to l.

IBM Model 1 – Translation Model

all possible alignments(the English word that a French

word fj is aligned with)translation probability

𝑃 ( 𝑓 |𝑒 )=∑𝑎𝑝( 𝑓 ,𝑎∨𝑒)

𝑝 ( 𝑓 ,𝑎|𝑒 )= 1(𝑙+1)𝑚

∏𝑗=1

𝑝 ( 𝑓 𝑗∨𝑒𝑎 𝑗¿)¿

EM algorithm is used to estimate the translation probabilities.

The Problem of Vocabulary Gap Query = auto wash

autowash

carwash

vehicle

autobuy…

P(“auto”) P(“wash”)

How to support inexact matching? {“car” , “vehicle”} == “auto” “buy” ==== “wash”

Translation Language Models for IR [Berger & Lafferty 99]

Query = auto wash

autowash

carwash

vehicle

autobuyauto

“auto”

“car”

“translate” “auto”

Query = car wash

P(“auto”) P(“wash”) “car”

“auto”

P(“car”|d3)

Pt(“auto”| “car”)

“vehicle” P(“vehicle”|d3) Pt(“auto”| “vehicle”)

)|()|()|( uwpdupdwp tu

How to estimate?

• When relevance judgments are available, (q,d) serves as data to train the translation model

• Without relevance judgments, we can use synthetic data [Berger & Lafferty 99], <title, body>[Jin et al. 02]

Basic translation model

Translation model Regular doc LM

tt dupuwpdwp )|()|()|(

Estimation of Translation Model: pt(w|u)

• Select words that are representative of a document.

• Calculate a Mutual information for each word in a document: I(w,d) = p(w,d)log

• Synthetic queries are sampled based on normalized mutual information.

• The resulting (d,q) of documents and synthetic queries are used to estimate the probabilities using EM algorithm (IBM Model 1).

Estimation of Translation Model – Synthetic Queries

([Berger & Lafferty 99])

Estimation of Translation Model – Synthetic Queries Algorithm

([Berger & Lafferty 99])

Training data

Limitations: 1.Can’t translate into words not seen

in the training queries 2.Computational complexity

A simpler and more efficient method for estimating pt(w|u) with higher coverage

was proposed in:

M. Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval.

ACM SIGIR, pages 323-330, 2010

Estimation of Translation Model Based on Mutual Information

1. Calculate Mutual information for each pair of two words in the collection (measuring co-occurrences)

2. Normalize mutual information score to obtain a translation probability:

}1,0{ }1,0{ )()(

),(log),();(w uX X uw

uwuwuw XpXp

XXpXXpXXI

'' );(

);()|(

uwmi XXI

XXIuwp

presence/absence of word w in a document

Computation Detail

1,0 1,0 )()(

),(log),();(w uX X uw

uwuwuw XpXp

XXpXXpXXI

Xw=1 Xu=1

NXXcXXp uw

uw)1,1()1,1(

NXcXp w

w)1()1(

NXcXp u

u)1()1(

D1 0 0D2 1 1D3 1 0….…DN 0 0

Exploit index to speed up computation

Sample Translation Probabilities (AP90)

q p(q|w)everest 0.079climber 0.042climb 0.0365

mountain 0.0359mount 0.033reach 0.0312

expedit 0.0314summit 0.0253whittak 0.016

peak 0.0149

p(w| “everest”)

q p(q|w)everest 0.1051climber 0.0423mount 0.0339

028 0.0308expedit 0.0303

peak 0.0155himalaya 0.01532

nepal 0.015sherpa 0.01431hillari 0.01431

Mutual Information

Synthetic Query

Regularizing Self-Translation Probability

• Self-translation probability can be under-estimated

An exact match would be counted less than an exact match

• Solution: Interpolation with “1.0 self-translation”

)|()|( wwpuwp tt

)|()1(

)|()1()|(

uwpuup

uwpt w = u

a= 1 basic query likelihood modela= 0 original MI estimate

Query Likelihood and Translation Language Model

• Document ranking based on query likelihood | |

log ( | ) log ( | ) ( , ) log ( | )

i i ii i

p q d p q d c w q p w d

where q q q q

Document language model

tt dupuwpdwp )|()|()|(

• Translation Language Model

Do you see any problem?

Further Smoothing of Translation Model for Computing Query Likelihood

• Linear interpolation (Jelinek-Mercer):

• Bayesian interpolation (Dirichlet prior):

mltt Cwpdupuwpdwp )|()]|()|()[1()|(

mltt Cwpd

dupuwpdddwp )|(

||)]|()|([

||||)|(

pml(w|d)

Experiment Design• MI vs. Synthetic query estimation

– Data Sets: Associated Press (AP90) and San Jose Mercury News (SJMN) + TREC topics 51-100

– Relatively small data sets in order to compare our results with Synthetic queries in [Berger& Lafferty 99].

• MI Translation model vs. Basic query likelihood – Larger Data Sets: TREC7, TREC8 (plus AP90, SJMN) – TREC topics 351-400 for TREC7 and 401-450 for TREC8

• Additional issues– Regularization of self-translation? – Influence of smoothing on translation models? – Translation model + pseudo feedback?

Mutual information outperforms synthetic queries in both MAP and P@10

AP90 + queries 51-100, Dirichlet Prior Smoothing

Syn. Query

Upper Bound Comparison of Mutual Information and Synthetic Queries

Dirichlet Prior Smoothing

Data MAP Precision @10Mutual Info Syn. Query Mutual Info. Syn. Query

AP90 0.264* 0.25 0.381 0.357SJMN 0.197* 0.189 0.252 0.267

JM Smoothing

Data MAP Precision @10Mutual Info Syn. Query Mutual Info. Syn. Query

AP90 0.272* 0.251 0.423 0.404SJMN 0.2* 0.195 0.28 0.266

Mutual information translation model outperforms basic query likelihood

Data MAP Precision @10Basic QL MI Trans. Basic QL MI Trans.

AP90 0.248 0.272* 0.398 0.423SJMN 0.195 0.2* 0.266 0.28TREC7 0.183 0.187* 0.412 0.404TREC8 0.248 0.249 0.452 0.456

JM Smoothing

Data MAP Precision @10Basic QL MI Trans. Basic QL MI Trans.

AP90 0.246 0.264* 0.357 0.381SJMN 0.188 0.197* 0.252 0.267TREC7 0.165 0.172 0.354 0.362TREC8 0.236 0.244* 0.428 0.436

Dir. Prior Smoothing

Translation model appears to need less collection smoothing than basic QL

Translation model

Basic query likelihood

Translation model and pseudo feedback exploit word co-occurrences differently

Data MAP Precision @10BL PFB PFB+TM BL PFB PFB+TM

AP90 0.246 0.271 0.298 0.357 0.383 0.411SJMN 0.188 0.229 0.234 0.252 0.316 0.313TREC7 0.165 0.209 0.222 0.354 0.38 0.384TREC8 0.236 0.240 0.281 0.428 0.4 0.452

JM Smoothing

)|(log)|(qwp

tq dwpwp

Query model from pseudo FB

Smoothed Translation

Model40

Regularization of self-translation is beneficial

AP Data Set, Dirichlet Prior

Summary

• Statistical Translation language model are effective for bridging the vocabulary gap.

• Mutual information is more effective and more efficient than synthetic queries for estimating translation model probabilities.

• Regularization of self-translation is beneficial• Translation model outperforms basic query likelihood on

small and large collections and is more robust • Translation model and pseudo feedback exploit word co-

occurrences differently and can be combined to further improve performance

References• [1] A. Berger and J. Lafferty. Information Retrieval as Statistical

Translation. ACM SIGIR, pages 222–229, 1999.• [2] P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. Mercer. The

mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.

• [3] M. Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval. ACM SIGIR, pages 323-330, 2010.

Statistical Translation Language Model

Documents