Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf ·...

transcript

Variational Decoding for Statistical Machine Translation

Zhifei Li, Jason Eisner, and Sanjeev KhudanpurCenter for Language and Speech Processing

Computer Science DepartmentJohns Hopkins University

Monday, August 17, 2009

Spurious Ambiguity

• Statistical models in MT exhibit spurious ambiguity

• Many different derivations (e.g., trees or segmentations) generate the same translation string

• Regular phrase-based MT systems

• phrase segmentation ambiguity

• Tree-based MT systems

• derivation tree ambiguity

Spurious Ambiguity in Phrase Segmentations

机器翻译软件

machine translation software

machine translation

机器翻译软件

machine translation

机器翻译软件

software

machine translation

机器翻译软件

software

machine translation

机器翻译软件

machine

机器翻译软件

software

machine translation

机器翻译软件

machine

机器翻译软件

software

translation software

machine translation

机器翻译软件

machine

机器翻译软件

software

machine

机器

machine translation

机器翻译软件

machine

机器翻译软件

software

machine

机器翻译

translation

machine translation

机器翻译软件

machine

机器翻译软件

software

machine

机器翻译软件

machine translation

机器翻译软件

machine

机器翻译软件

software

machine

机器翻译软件

• Same output: “machine translation software”

• Three different phrase segmentations

machine translation

机器翻译软件

machine

机器翻译软件

software

machine

机器翻译软件

• Three different phrase segmentations

machine transfer software

Spurious Ambiguity in Derivation Trees

Spurious Ambiguity in Derivation Trees 机器翻译软件

S->(机器, machine)

S->(机器, machine) S->(翻译, translation)

S->(机器, machine) S->(翻译, translation) S->(软件, software)

S->(S0 S1, S0 S1)

S->(机器, machine) 翻译 S->(软件, software)

S->(S0 S1, S0 S1)

S->(S0 翻译 S1, S0 translation S1)

S->(机器, machine) 翻译 S->(软件, software)

S->(S0 S1, S0 S1)

S->(S0 翻译 S1, S0 translation S1)

• Three different derivation trees

Maximum A Posterior (MAP) Decoding

red translation

blue translation

green translation

translation string

red translation

blue translation

green translation

derivationtranslation string

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

• Exact MAP decoding

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

• x: Foreign sentence

• y: English sentence

• d: derivation

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

• d: derivation

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

• d: derivation

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

• d: derivation

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

• d: derivation

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

• d: derivation

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

• d: derivation

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

• d: derivation

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

• d: derivation

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

• d: derivation

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

p(y|x)

= arg maxy"Trans(x)

d"D(x,y)

p(y, d|x)

• d: derivation

Hypergraph as a search space

dianzi0 shang1 de2 mao3

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

A hypergraph is a compact structure to encode exponentially many trees.

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Probabilistic Hypergraph

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

The hypergraph defines a probability distribution over derivation trees, i.e. p(y, d | x),

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

The hypergraph defines a probability distribution over derivation trees, i.e. p(y, d | x),and also a distribution (implicit) over strings, i.e. p(y | x).

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

The hypergraph defines a probability distribution over derivation trees, i.e. p(y, d | x),and also a distribution (implicit) over strings, i.e. p(y | x).

NP-hard (Sima’an 1996)

exponential size

y! = arg maxy"HG(x)

p(y|x)

= arg maxy"HG(x)

d"D(x,y)

p(y, d|x)

• Maximum a posterior (MAP) decoding

• Viterbi approximation

• N-best approximation (crunching) (May and Knight 2006)

Decoding with spurious ambiguity?

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

Viterbi Approximation

• Viterbi approximationy! = arg max

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

Viterbi

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

Viterbi

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

Viterbi

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

Viterbi

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

N-best Approximation

Viterbi

d"D(x,y)#ND(x)

p(y, d|x)

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

Viterbi

4-best crunching

d"D(x,y)#ND(x)

p(y, d|x)

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

Viterbi

4-best crunching

d"D(x,y)#ND(x)

p(y, d|x)

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

Viterbi

4-best crunching

d"D(x,y)#ND(x)

p(y, d|x)

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

Viterbi

4-best crunching

d"D(x,y)#ND(x)

p(y, d|x)

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

MAP vs. Approximations

Viterbi

4-best crunching

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

Viterbi

4-best crunching

• Exact MAP decoding under spurious ambiguity is intractable

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

Viterbi

4-best crunching

• Viterbi and crunching are efficient, but ignore most derivations• Exact MAP decoding under spurious ambiguity is intractable

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

Viterbi

4-best crunching

• Our goal: develop an approximation that considers all the derivations but still allows tractable decoding

• Viterbi and crunching are efficient, but ignore most derivations• Exact MAP decoding under spurious ambiguity is intractable

Variational Decoding

Decoding using Variational approximation

Decoding using a sentence-specific approximate distribution

Variational Decoding for MT: an Overview

Sentence-specific decoding

Three steps:

1 Generate a hypergraph

Three steps:

Foreign sentence x

Three steps:

Foreign sentence x SMT

Three steps:

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Three steps:

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Three steps:

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y | x)

Three steps:

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

MAP decoding under P is intractable

p(y | x)

Three steps:

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)Generate a hypergraph

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)Generate a hypergraph

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

Generate a hypergraph

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

2 Estimate a model from the hypergraph

q*(y | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q* is an n-gram model over output strings.

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

q*(y | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

q*(y | x)

≈∑d∈D(x,y) p(y,d|x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

Estimate a model from the hypergraph

q*(y | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Decode using q*on the hypergraph

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q*(y | x)

Variational Inference

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

y! = arg maxy

p(y|x)

q! = arg minq"Q

KL(p||q)

y! = arg maxy

p(y|x)

• Then, we will use q* as a surrogate for p in inference

q! = arg minq"Q

KL(p||q)

y! = arg maxy

p(y|x)

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

y! = arg maxy

p(y|x)

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

y! = arg maxy

p(y|x)

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

y! = arg maxy

p(y|x)

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

y! = arg maxy

p(y|x)

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

y! = arg maxy

p(y|x)

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

Variational Approximation• q*: an approximation having minimum distance to p

q! = arg minq"Q

KL(p||q)a family of distributions

= arg minq!Q

y!Trans(x)

q! = arg minq"Q

= arg minq!Q

y!Trans(x)

= arg minq!Q

y!Trans(x)

(plogp! plogq)

q! = arg minq"Q

constant

= arg minq!Q

y!Trans(x)

= arg minq!Q

y!Trans(x)

(plogp! plogq)

q! = arg minq"Q

constant

= arg minq!Q

y!Trans(x)

= arg minq!Q

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

y!Trans(x)

q! = arg minq"Q

constant

= arg minq!Q

y!Trans(x)

= arg minq!Q

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

y!Trans(x)

q! = arg minq"Q

KL(p||q)

• Three questions

a family of distributions

constant

= arg minq!Q

y!Trans(x)

= arg minq!Q

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

y!Trans(x)

q! = arg minq"Q

KL(p||q)

• Three questions

• how to parameterize q?

constant

= arg minq!Q

y!Trans(x)

= arg minq!Q

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

y!Trans(x)

q! = arg minq"Q

KL(p||q)

• Three questions

• how to estimate q*?

constant

= arg minq!Q

y!Trans(x)

= arg minq!Q

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

y!Trans(x)

q! = arg minq"Q

KL(p||q)

• Three questions

• how to estimate q*?

• how to use q* for decoding?

Parameterization of q∈Q

Parameterization of q∈Q• Naturally, we parameterize q as an n-gram model

• The probability of a string is a product of the probabilities of those n-grams appearing in that string

3-gram model

y: a b c d e f3-gram model

Other ways of parameterizations are possible!

• Naturally, we parameterize q as an n-gram model

how to estimate these n-gram probabilities?

Estimation of q*∈Q• Variational approximation

q! = arg maxq"Q

y"Trans(x)

• q* is a maximum likelihood estimate (MLE) where p is the empirical distribution

q! = arg maxq"Q

y"Trans(x)

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

q! = arg maxq"Q

y"Trans(x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q! = arg maxq"Q

y"Trans(x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

estimate

q! = arg maxq"Q

y"Trans(x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1# bi-gram modelestimate

q! = arg maxq"Q

y"Trans(x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1# bi-gram model

• brute force

estimate

q! = arg maxq"Q

y"Trans(x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1# bi-gram model

• brute force

• dynamic programming

estimate

q! = arg maxq"Q

y"Trans(x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X1 on X0#X!"X0 de X1,X1 of X0#

S!"X0,X0#

Estimating q* from a hypergraph: brute force

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

S!"X0,X0#

Bi-gram estimation:

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

S!"X0,X0#

Bi-gram estimation:

‣ unpack the hypergraph

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

S!"X0,X0#

X!"X0 de X1,X1 on X0# X!"X0 de X1,X1 of X0#

S!"X0,X0#

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Bi-gram estimation:

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

S!"X0,X0#

the mat a cat

a cat on the mat a cat of the mat

the mat ‘s a cat

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Bi-gram estimation:

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

S!"X0,X0#

the mat a cat

a cat on the mat a cat of the mat

the mat ‘s a cat

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Bi-gram estimation:

the mat a cat

a cat of the mat

the mat ‘s a cat

a cat on the mat

the mat a cat

a cat of the mat

the mat ‘s a cat

Bi-gram estimation:

a cat on the mat

the mat a cat

a cat of the mat

the mat ‘s a cat

Bi-gram estimation:

‣ accumulate the soft-count of each bigram

a cat on the mat

the mat a cat

a cat of the mat

the mat ‘s a cat

Bi-gram estimation:

‣ normalize the counts

a cat on the mat

the mat a cat

a cat of the mat

the mat ‘s a cat

Bi-gram estimation:

Pr(on | cat)=1/8

Pr(of | cat)=2/8

Pr(</s> | cat)=5/8

a cat on the mat

the mat a cat

a cat of the mat

the mat ‘s a cat

Bi-gram estimation:

Pr(on | cat)=1/8

Pr(of | cat)=2/8

Pr(</s> | cat)=5/8

a cat on the mat

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Estimating q* from a hypergraph: dynamic programming

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Bi-gram estimation:

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Bi-gram estimation:

‣ run inside-outside on the hypergraph

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Bi-gram estimation:

‣ accumulate the soft-count of each bigram at each hyperedge

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Bi-gram estimation:

‣ accumulate the soft-count of each bigram at each hyperedge

Decoding using q*∈Q

• Rescore the hypergraph HG(x)

y! = arg maxy"HG(x)

q!(y|x)

y! = arg maxy"HG(x)

q!(y|x)

y! = arg maxy"HG(x)

q!(y|x)

q* is an n-gram model.

• have efficient dynamic programming algorithms

• score the hypergraph using an n-gram model

y! = arg maxy"HG(x)

q!(y|x)

• have efficient dynamic programming algorithms

• score the hypergraph using an n-gram model

y! = arg maxy"HG(x)

q!(y|x)

John already told you how to do this☺

KL divergences under different variational models

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

Measure H(p) KL(p||·)bits/word q!1 q!2 q!3 q!4

MT’04 1.36 0.97 0.32 0.21 0.17MT’05 1.37 0.94 0.32 0.21 0.17

• The larger the order n is, the smaller the KL divergence is!

• The reduction of KL divergence happens mostly when switching from unigram to bigram

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

MT’04 1.36 0.97 0.32 0.21 0.17MT’05 1.37 0.94 0.32 0.21 0.17

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

How to compute them on a hypergraph?

see (Li and Eisner, EMNLP’09)

MT’04 1.36 0.97 0.32 0.21 0.17MT’05 1.37 0.94 0.32 0.21 0.17

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

BLEU scores when using a single variational n-gram model

Decoding scheme MT’04 MT’05Viterbi 35.4 32.61gram 25.9 24.52gram 36.1 33.43gram 36.0 33.14gram 35.8 32.9

• unigram performs very badly

• bigram achieves best BLEU scores

• bigram achieves best BLEU scores ???

modeling error in p

BLEU cares about both low- and high-order n-gram matches

• Interpolating variational n-gram model for different n

y! = arg maxy"HG(x)

!n · log q!n(y | x)

Viterbi and variational are different ways in approximating p

y! = arg maxy"HG(x)

!n · log q!n(y | x) + !v · log pViterbi(y | x)

y! = arg maxy"HG(x)

!n · log q!n(y | x) + !v · log pViterbi(y | x)

y! = arg maxy"HG(x)

Minimum Bayes Risk (MBR) decoding?

(Tromble et al. 2008)

(Denero et al. 2009)

Minimum Risk Decoding

• Minimum risk decoding

• find the consensus translation string

• Maximum A Posterior (MAP) decoding

• find the most probable translation string

Risk(y) =!

L(y, y!)p(y

y! = arg maxy"HG(x)

p(y|x)

y! = arg miny"HG(x)

Risk(y)

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

spurious ambiguity

MBR Interpolated VD

spurious ambiguity

MBR Interpolated VD

Both BLEU metric and our variational distributions happen to use n-gram dependencies.

• Variational decoding with interpolation

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

q(r(w) | h(w), x)cw(y)

y! = arg maxy"HG(x)

• Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =!

g(w | x)cw(y)

g(w | x) =!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!n · gn(y | x)

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

y! = arg maxy"HG(x)

gn(y | x) =!

g(w | x)cw(y)

g(w | x) =!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!n · gn(y | x)

decision rule

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

y! = arg maxy"HG(x)

gn(y | x) =!

g(w | x)cw(y)

g(w | x) =!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!n · gn(y | x)

decision rule

n-gram model

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

y! = arg maxy"HG(x)

gn(y | x) =!

g(w | x)cw(y)

g(w | x) =!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!n · gn(y | x)

decision rule

n-gram model

n-gram probability

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

y! = arg maxy"HG(x)

gn(y | x) =!

g(w | x)cw(y)

g(w | x) =!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!n · gn(y | x)

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

y! = arg maxy"HG(x)

gn(y | x) =!

g(w | x)cw(y)

g(w | x) =!

!w(y!)p(y! | x)

non-probabilistic

very expensive to compute

y! = arg maxy"HG(x)

!n · gn(y | x)

BLEU Results on Chinese-English NIST MT Tasks

Decoding scheme MT’04 MT’05Viterbi 35.4 32.6MBR (K=1000) 35.8 32.7Crunching (N=10000) 35.7 32.8Crunching+MBR (N=10000) 35.8 32.7Variational (1to4gram+wp+vt) 36.6 33.5

BLEU Results on Chinese-English NIST MT Tasks

• variational decoding improves over Viterbi, MBR, and crunching

Decoding scheme MT’04 MT’05Viterbi 35.4 32.6MBR (K=1000) 35.8 32.7Crunching (N=10000) 35.7 32.8Crunching+MBR (N=10000) 35.8 32.7Variational (1to4gram+wp+vt) 36.6 33.5

Conclusions

• Exact MAP decoding with spurious ambiguity is intractable

• Viterbi or N-best approximations are efficient, but ignore most derivations

• We developed a variational approximation, which considers all derivations but still allows tractable decoding

• Our variational decoding improves a state of the art baseline

Future directions

• The MT pipeline is full of intractable problems

• variational approximation is a principled way to tackle these problems

• Decoding with spurious ambiguity is a common problem in many other NLP applications

• Models with latent variables

• Data oriented parsing (DOP)

• Hidden Markov Models (HMM)

• ......

Thank you!谢谢！

Joshua44

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Decode using q*on the hypergraph

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q*(y | x)

Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf ·...

Documents