+ All Categories
Home > Documents > Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf ·...

Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf ·...

Date post: 02-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
187
Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1 Monday, August 17, 2009
Transcript
Page 1: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for Statistical Machine Translation

Zhifei Li, Jason Eisner, and Sanjeev KhudanpurCenter for Language and Speech Processing

Computer Science DepartmentJohns Hopkins University

1

Monday, August 17, 2009

Page 2: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity

• Statistical models in MT exhibit spurious ambiguity

• Many different derivations (e.g., trees or segmentations) generate the same translation string

• Regular phrase-based MT systems

• phrase segmentation ambiguity

• Tree-based MT systems

• derivation tree ambiguity

2

Monday, August 17, 2009

Page 3: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Phrase Segmentations

3

Monday, August 17, 2009

Page 4: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

3

Monday, August 17, 2009

Page 5: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine translation software

3

Monday, August 17, 2009

Page 6: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine translation software

3

Monday, August 17, 2009

Page 7: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

software

machine translation software

3

Monday, August 17, 2009

Page 8: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

机器 翻译 软件

software

machine translation software

3

Monday, August 17, 2009

Page 9: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software

machine translation software

3

Monday, August 17, 2009

Page 10: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software

translation software

machine translation software

3

Monday, August 17, 2009

Page 11: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software

translation software

machine

机器

machine translation software

3

Monday, August 17, 2009

Page 12: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software

translation software

machine

机器 翻译

translation

machine translation software

3

Monday, August 17, 2009

Page 13: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software

translation software

machine

机器 翻译 软件

translation software

machine translation software

3

Monday, August 17, 2009

Page 14: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software

translation software

machine

机器 翻译 软件

translation software

• Same output: “machine translation software”

• Three different phrase segmentations

machine translation software

3

Monday, August 17, 2009

Page 15: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software

translation software

machine

机器 翻译 软件

translation software

• Same output: “machine translation software”

• Three different phrase segmentations

machine translation software

3

machine transfer software

Monday, August 17, 2009

Page 16: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees

4

Monday, August 17, 2009

Page 17: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

4

Monday, August 17, 2009

Page 18: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

S->(机器, machine)

4

Monday, August 17, 2009

Page 19: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

S->(机器, machine) S->(翻译, translation)

4

Monday, August 17, 2009

Page 20: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software)

4

Monday, August 17, 2009

Page 21: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software)

S->(S0 S1, S0 S1)

4

Monday, August 17, 2009

Page 22: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

4

Monday, August 17, 2009

Page 23: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software)

S->(机器, machine) S->(翻译, translation) S->(软件, software)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

4

Monday, August 17, 2009

Page 24: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software)

S->(机器, machine) S->(翻译, translation) S->(软件, software)

S->(机器, machine) 翻译 S->(软件, software)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 翻译 S1, S0 translation S1)

4

Monday, August 17, 2009

Page 25: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software)

S->(机器, machine) S->(翻译, translation) S->(软件, software)

S->(机器, machine) 翻译 S->(软件, software)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 翻译 S1, S0 translation S1)

• Same output: “machine translation software”

• Three different derivation trees

4

Monday, August 17, 2009

Page 26: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009

Page 27: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

translation string

5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009

Page 28: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

derivationtranslation string

5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009

Page 29: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009

Page 30: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009

Page 31: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

5

• Exact MAP decoding

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009

Page 32: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

5

• Exact MAP decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009

Page 33: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

5

• Exact MAP decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009

Page 34: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

Maximum A Posterior (MAP) Decoding

6

• Exact MAP decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

Monday, August 17, 2009

Page 35: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

Maximum A Posterior (MAP) Decoding

6

• Exact MAP decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

0.28

Monday, August 17, 2009

Page 36: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

7

• Exact MAP decoding

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

0.28

Maximum A Posterior (MAP) Decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

Monday, August 17, 2009

Page 37: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

7

• Exact MAP decoding

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

0.28

0.28

Maximum A Posterior (MAP) Decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

Monday, August 17, 2009

Page 38: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

8

• Exact MAP decoding

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

0.28

0.28

Maximum A Posterior (MAP) Decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

Monday, August 17, 2009

Page 39: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

8

• Exact MAP decoding

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

0.28

0.28

0.44

Maximum A Posterior (MAP) Decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

Monday, August 17, 2009

Page 40: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

0.28

0.28

0.44

9

• Exact MAP decoding

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

Monday, August 17, 2009

Page 41: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

0.28

0.28

0.44

9

• Exact MAP decoding

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

Monday, August 17, 2009

Page 42: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

0.28

0.28

0.44

9

• Exact MAP decoding

y! = arg maxy"Trans(x)

p(y|x)

= arg maxy"Trans(x)

!

d"D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

• x: Foreign sentence

• y: English sentence

• d: derivation

Monday, August 17, 2009

Page 43: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Hypergraph as a search space

Monday, August 17, 2009

Page 44: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Monday, August 17, 2009

Page 45: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

A hypergraph is a compact structure to encode exponentially many trees.

Monday, August 17, 2009

Page 46: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Monday, August 17, 2009

Page 47: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Probabilistic Hypergraph

Monday, August 17, 2009

Page 48: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

The hypergraph defines a probability distribution over derivation trees, i.e. p(y, d | x),

Probabilistic Hypergraph

Monday, August 17, 2009

Page 49: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

The hypergraph defines a probability distribution over derivation trees, i.e. p(y, d | x),and also a distribution (implicit) over strings, i.e. p(y | x).

Probabilistic Hypergraph

Monday, August 17, 2009

Page 50: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

The hypergraph defines a probability distribution over derivation trees, i.e. p(y, d | x),and also a distribution (implicit) over strings, i.e. p(y | x).

Probabilistic Hypergraph

• Exact MAP decoding

NP-hard (Sima’an 1996)

exponential size

y! = arg maxy"HG(x)

p(y|x)

= arg maxy"HG(x)

!

d"D(x,y)

p(y, d|x)

Monday, August 17, 2009

Page 51: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

• Maximum a posterior (MAP) decoding

• Viterbi approximation

• N-best approximation (crunching) (May and Knight 2006)

Decoding with spurious ambiguity?

Monday, August 17, 2009

Page 52: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

Viterbi Approximation

• Viterbi approximationy! = arg max

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

0.28

0.28

0.44

Monday, August 17, 2009

Page 53: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

Viterbi Approximation

• Viterbi approximationy! = arg max

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

0.28

0.28

0.44

Viterbi

Monday, August 17, 2009

Page 54: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

Viterbi Approximation

• Viterbi approximationy! = arg max

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

0.28

0.28

0.44

Viterbi

Monday, August 17, 2009

Page 55: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

Viterbi Approximation

• Viterbi approximationy! = arg max

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

Monday, August 17, 2009

Page 56: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

Viterbi Approximation

• Viterbi approximationy! = arg max

y"Trans(x)max

d"D(x,y)p(y, d|x)

= Y(arg maxd!D(x)

p(y, d|x))

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

Monday, August 17, 2009

Page 57: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

N-best Approximation

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

• N-best approximation (crunching) (May and Knight 2006)

y! = arg maxy"Trans(x)

!

d"D(x,y)#ND(x)

p(y, d|x)

Monday, August 17, 2009

Page 58: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

N-best Approximation

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunching

• N-best approximation (crunching) (May and Knight 2006)

y! = arg maxy"Trans(x)

!

d"D(x,y)#ND(x)

p(y, d|x)

Monday, August 17, 2009

Page 59: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunching

N-best Approximation

• N-best approximation (crunching) (May and Knight 2006)

y! = arg maxy"Trans(x)

!

d"D(x,y)#ND(x)

p(y, d|x)

Monday, August 17, 2009

Page 60: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunching

0.16

0.28

0.13

N-best Approximation

• N-best approximation (crunching) (May and Knight 2006)

y! = arg maxy"Trans(x)

!

d"D(x,y)#ND(x)

p(y, d|x)

Monday, August 17, 2009

Page 61: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunching

0.16

0.28

0.13

N-best Approximation

• N-best approximation (crunching) (May and Knight 2006)

y! = arg maxy"Trans(x)

!

d"D(x,y)#ND(x)

p(y, d|x)

Monday, August 17, 2009

Page 62: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

MAP vs. Approximations

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunching

0.16

0.28

0.13

Monday, August 17, 2009

Page 63: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

MAP vs. Approximations

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunching

0.16

0.28

0.13

• Exact MAP decoding under spurious ambiguity is intractable

Monday, August 17, 2009

Page 64: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

MAP vs. Approximations

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunching

0.16

0.28

0.13

• Viterbi and crunching are efficient, but ignore most derivations• Exact MAP decoding under spurious ambiguity is intractable

Monday, August 17, 2009

Page 65: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivationtranslation string

MAP

MAP vs. Approximations

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunching

0.16

0.28

0.13

• Our goal: develop an approximation that considers all the derivations but still allows tractable decoding

• Viterbi and crunching are efficient, but ignore most derivations• Exact MAP decoding under spurious ambiguity is intractable

Monday, August 17, 2009

Page 66: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding

18

Monday, August 17, 2009

Page 67: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding

18

Decoding using Variational approximation

Monday, August 17, 2009

Page 68: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding

18

Decoding using Variational approximation

Decoding using a sentence-specific approximate distribution

Monday, August 17, 2009

Page 69: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

Monday, August 17, 2009

Page 70: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

Sentence-specific decoding

Monday, August 17, 2009

Page 71: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

Sentence-specific decoding

Three steps:

Monday, August 17, 2009

Page 72: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

Sentence-specific decoding

1 Generate a hypergraph

Three steps:

Monday, August 17, 2009

Page 73: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

Sentence-specific decoding

Foreign sentence x

1 Generate a hypergraph

Three steps:

Monday, August 17, 2009

Page 74: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

Sentence-specific decoding

Foreign sentence x SMT

1 Generate a hypergraph

Three steps:

Monday, August 17, 2009

Page 75: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Sentence-specific decoding

Foreign sentence x SMT

1 Generate a hypergraph

Three steps:

Monday, August 17, 2009

Page 76: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Sentence-specific decoding

Foreign sentence x SMT

1 Generate a hypergraph

Three steps:

p(y, d | x)

Monday, August 17, 2009

Page 77: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Sentence-specific decoding

Foreign sentence x SMT

p(y | x)

1 Generate a hypergraph

Three steps:

p(y, d | x)

Monday, August 17, 2009

Page 78: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding for MT: an Overview

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Sentence-specific decoding

Foreign sentence x SMT

MAP decoding under P is intractable

p(y | x)

1 Generate a hypergraph

Three steps:

p(y, d | x)

Monday, August 17, 2009

Page 79: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

1

p(y, d | x)Generate a hypergraph

Monday, August 17, 2009

Page 80: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

1

p(y, d | x)Generate a hypergraph

Monday, August 17, 2009

Page 81: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

1

p(y, d | x)

2

Generate a hypergraph

Monday, August 17, 2009

Page 82: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

1

p(y, d | x)

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

2

Generate a hypergraph

Monday, August 17, 2009

Page 83: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

1

p(y, d | x)

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

2 Estimate a model from the hypergraph

Generate a hypergraph

q*(y | x)

Monday, August 17, 2009

Page 84: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q* is an n-gram model over output strings.

1

p(y, d | x)

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

2 Estimate a model from the hypergraph

Generate a hypergraph

q*(y | x)

Monday, August 17, 2009

Page 85: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q* is an n-gram model over output strings.

1

p(y, d | x)

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

2 Estimate a model from the hypergraph

Generate a hypergraph

q*(y | x)

≈∑d∈D(x,y) p(y,d|x)

Monday, August 17, 2009

Page 86: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q* is an n-gram model over output strings.

1

p(y, d | x)

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

2

3

Estimate a model from the hypergraph

Generate a hypergraph

q*(y | x)

≈∑d∈D(x,y) p(y,d|x)

Monday, August 17, 2009

Page 87: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q* is an n-gram model over output strings.

Decode using q*on the hypergraph

1

p(y, d | x)

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q*(y | x)

2

3

Estimate a model from the hypergraph

Generate a hypergraph

q*(y | x)

≈∑d∈D(x,y) p(y,d|x)

Monday, August 17, 2009

Page 88: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference

21

Monday, August 17, 2009

Page 89: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

21

Monday, August 17, 2009

Page 90: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

21

Monday, August 17, 2009

Page 91: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

21

Monday, August 17, 2009

Page 92: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

q! = arg minq"Q

KL(p||q)

21

Monday, August 17, 2009

Page 93: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

• Then, we will use q* as a surrogate for p in inference

q! = arg minq"Q

KL(p||q)

21

Monday, August 17, 2009

Page 94: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

• Then, we will use q* as a surrogate for p in inference

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

21

Monday, August 17, 2009

Page 95: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

• Then, we will use q* as a surrogate for p in inference

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

21

P

Monday, August 17, 2009

Page 96: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

• Then, we will use q* as a surrogate for p in inference

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

21

pP

Monday, August 17, 2009

Page 97: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

• Then, we will use q* as a surrogate for p in inference

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

21

p

Q

P

Monday, August 17, 2009

Page 98: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

• Then, we will use q* as a surrogate for p in inference

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

21

p

Q

P

Monday, August 17, 2009

Page 99: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Inference• We want to do inference under p, but it is intractable

y! = arg maxy

p(y|x)

• Instead, we derive a simpler distribution q*

• Then, we will use q* as a surrogate for p in inference

y! = arg maxy

q!(y | x)

q! = arg minq"Q

KL(p||q)

21

p

Q q*

P

Monday, August 17, 2009

Page 100: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Approximation• q*: an approximation having minimum distance to p

q! = arg minq"Q

KL(p||q)a family of distributions

22

Monday, August 17, 2009

Page 101: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Approximation• q*: an approximation having minimum distance to p

= arg minq!Q

!

y!Trans(x)

plogp

q

q! = arg minq"Q

KL(p||q)a family of distributions

22

Monday, August 17, 2009

Page 102: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Approximation• q*: an approximation having minimum distance to p

= arg minq!Q

!

y!Trans(x)

plogp

q

= arg minq!Q

!

y!Trans(x)

(plogp! plogq)

q! = arg minq"Q

KL(p||q)a family of distributions

22

Monday, August 17, 2009

Page 103: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

constant

Variational Approximation• q*: an approximation having minimum distance to p

= arg minq!Q

!

y!Trans(x)

plogp

q

= arg minq!Q

!

y!Trans(x)

(plogp! plogq)

q! = arg minq"Q

KL(p||q)a family of distributions

22

Monday, August 17, 2009

Page 104: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

constant

Variational Approximation• q*: an approximation having minimum distance to p

= arg minq!Q

!

y!Trans(x)

plogp

q

= arg minq!Q

!

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

!

y!Trans(x)

plogq

q! = arg minq"Q

KL(p||q)a family of distributions

22

Monday, August 17, 2009

Page 105: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

constant

Variational Approximation• q*: an approximation having minimum distance to p

= arg minq!Q

!

y!Trans(x)

plogp

q

= arg minq!Q

!

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

!

y!Trans(x)

plogq

q! = arg minq"Q

KL(p||q)

• Three questions

a family of distributions

22

Monday, August 17, 2009

Page 106: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

constant

Variational Approximation• q*: an approximation having minimum distance to p

= arg minq!Q

!

y!Trans(x)

plogp

q

= arg minq!Q

!

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

!

y!Trans(x)

plogq

q! = arg minq"Q

KL(p||q)

• Three questions

• how to parameterize q?

a family of distributions

22

Monday, August 17, 2009

Page 107: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

constant

Variational Approximation• q*: an approximation having minimum distance to p

= arg minq!Q

!

y!Trans(x)

plogp

q

= arg minq!Q

!

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

!

y!Trans(x)

plogq

q! = arg minq"Q

KL(p||q)

• Three questions

• how to parameterize q?

• how to estimate q*?

a family of distributions

22

Monday, August 17, 2009

Page 108: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

constant

Variational Approximation• q*: an approximation having minimum distance to p

= arg minq!Q

!

y!Trans(x)

plogp

q

= arg minq!Q

!

y!Trans(x)

(plogp! plogq)

= arg maxq!Q

!

y!Trans(x)

plogq

q! = arg minq"Q

KL(p||q)

• Three questions

• how to parameterize q?

• how to estimate q*?

• how to use q* for decoding?

a family of distributions

22

Monday, August 17, 2009

Page 109: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Parameterization of q∈Q

23

Monday, August 17, 2009

Page 110: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Parameterization of q∈Q• Naturally, we parameterize q as an n-gram model

23

Monday, August 17, 2009

Page 111: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Parameterization of q∈Q• Naturally, we parameterize q as an n-gram model

• The probability of a string is a product of the probabilities of those n-grams appearing in that string

23

Monday, August 17, 2009

Page 112: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Parameterization of q∈Q• Naturally, we parameterize q as an n-gram model

• The probability of a string is a product of the probabilities of those n-grams appearing in that string

3-gram model

23

Monday, August 17, 2009

Page 113: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Parameterization of q∈Q• Naturally, we parameterize q as an n-gram model

• The probability of a string is a product of the probabilities of those n-grams appearing in that string

y: a b c d e f3-gram model

23

Monday, August 17, 2009

Page 114: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Parameterization of q∈Q• Naturally, we parameterize q as an n-gram model

• The probability of a string is a product of the probabilities of those n-grams appearing in that string

y: a b c d e f3-gram model

23

q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f |de)

Monday, August 17, 2009

Page 115: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Parameterization of q∈Q• Naturally, we parameterize q as an n-gram model

• The probability of a string is a product of the probabilities of those n-grams appearing in that string

y: a b c d e f3-gram model

23

q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f |de)

Other ways of parameterizations are possible!

Monday, August 17, 2009

Page 116: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

• Naturally, we parameterize q as an n-gram model

• The probability of a string is a product of the probabilities of those n-grams appearing in that string

y: a b c d e f3-gram model

24

q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f |de)

Parameterization of q∈Q

Monday, August 17, 2009

Page 117: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

• Naturally, we parameterize q as an n-gram model

• The probability of a string is a product of the probabilities of those n-grams appearing in that string

y: a b c d e f3-gram model

24

q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f |de)

how to estimate these n-gram probabilities?

Parameterization of q∈Q

Monday, August 17, 2009

Page 118: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Estimation of q*∈Q• Variational approximation

25

q! = arg maxq"Q

!

y"Trans(x)

plogq

Monday, August 17, 2009

Page 119: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Estimation of q*∈Q• Variational approximation

• q* is a maximum likelihood estimate (MLE) where p is the empirical distribution

25

q! = arg maxq"Q

!

y"Trans(x)

plogq

Monday, August 17, 2009

Page 120: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Estimation of q*∈Q• Variational approximation

• q* is a maximum likelihood estimate (MLE) where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

q! = arg maxq"Q

!

y"Trans(x)

plogq

Monday, August 17, 2009

Page 121: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Estimation of q*∈Q• Variational approximation

• q* is a maximum likelihood estimate (MLE) where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q! = arg maxq"Q

!

y"Trans(x)

plogq

Monday, August 17, 2009

Page 122: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Estimation of q*∈Q• Variational approximation

• q* is a maximum likelihood estimate (MLE) where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

estimate

q! = arg maxq"Q

!

y"Trans(x)

plogq

Monday, August 17, 2009

Page 123: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Estimation of q*∈Q• Variational approximation

• q* is a maximum likelihood estimate (MLE) where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1# bi-gram modelestimate

q! = arg maxq"Q

!

y"Trans(x)

plogq

Monday, August 17, 2009

Page 124: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Estimation of q*∈Q• Variational approximation

• q* is a maximum likelihood estimate (MLE) where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1# bi-gram model

• brute force

estimate

q! = arg maxq"Q

!

y"Trans(x)

plogq

Monday, August 17, 2009

Page 125: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Estimation of q*∈Q• Variational approximation

• q* is a maximum likelihood estimate (MLE) where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1# bi-gram model

• brute force

• dynamic programming

estimate

q! = arg maxq"Q

!

y"Trans(x)

plogq

Monday, August 17, 2009

Page 126: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

dianzi0 shang1 de2 mao3

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X1 on X0#X!"X0 de X1,X1 of X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

26

Estimating q* from a hypergraph: brute force

Monday, August 17, 2009

Page 127: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

dianzi0 shang1 de2 mao3

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X1 on X0#X!"X0 de X1,X1 of X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

26

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

Monday, August 17, 2009

Page 128: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

dianzi0 shang1 de2 mao3

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X1 on X0#X!"X0 de X1,X1 of X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

26

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

‣ unpack the hypergraph

Monday, August 17, 2009

Page 129: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X1 on X0# X!"X0 de X1,X1 of X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

27

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

‣ unpack the hypergraph

Monday, August 17, 2009

Page 130: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X1 on X0# X!"X0 de X1,X1 of X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

the mat a cat

a cat on the mat a cat of the mat

the mat ‘s a cat

27

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

‣ unpack the hypergraph

Monday, August 17, 2009

Page 131: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

X!"X0 de X1,X1 on X0# X!"X0 de X1,X1 of X0#

dianzi0 shang1 de2 mao3

X!"mao,a cat#X!"dianzi shang, the mat#

S!"X0,X0#

the mat a cat

a cat on the mat a cat of the mat

the mat ‘s a cat

p=2/8

p=1/8

p=3/8

p=2/8

27

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

‣ unpack the hypergraph

Monday, August 17, 2009

Page 132: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

the mat a cat

a cat of the mat

the mat ‘s a cat

p=2/8

p=1/8

p=3/8

p=2/8

28

Estimating q* from a hypergraph: brute force

a cat on the mat

Monday, August 17, 2009

Page 133: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

the mat a cat

a cat of the mat

the mat ‘s a cat

p=2/8

p=1/8

p=3/8

p=2/8

28

Bi-gram estimation:

‣ unpack the hypergraph

Estimating q* from a hypergraph: brute force

a cat on the mat

Monday, August 17, 2009

Page 134: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

the mat a cat

a cat of the mat

the mat ‘s a cat

p=2/8

p=1/8

p=3/8

p=2/8

28

Bi-gram estimation:

‣ unpack the hypergraph

‣ accumulate the soft-count of each bigram

Estimating q* from a hypergraph: brute force

a cat on the mat

Monday, August 17, 2009

Page 135: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

the mat a cat

a cat of the mat

the mat ‘s a cat

p=2/8

p=1/8

p=3/8

p=2/8

28

Bi-gram estimation:

‣ unpack the hypergraph

‣ accumulate the soft-count of each bigram

‣ normalize the counts

Estimating q* from a hypergraph: brute force

a cat on the mat

Monday, August 17, 2009

Page 136: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

the mat a cat

a cat of the mat

the mat ‘s a cat

p=2/8

p=1/8

p=3/8

p=2/8

28

Bi-gram estimation:

‣ unpack the hypergraph

‣ accumulate the soft-count of each bigram

‣ normalize the counts

Estimating q* from a hypergraph: brute force

Pr(on | cat)=1/8

Pr(of | cat)=2/8

Pr(</s> | cat)=5/8

a cat on the mat

Monday, August 17, 2009

Page 137: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

the mat a cat

a cat of the mat

the mat ‘s a cat

p=2/8

p=1/8

p=3/8

p=2/8

28

Bi-gram estimation:

‣ unpack the hypergraph

‣ accumulate the soft-count of each bigram

‣ normalize the counts

Estimating q* from a hypergraph: brute force

Pr(on | cat)=1/8

Pr(of | cat)=2/8

Pr(</s> | cat)=5/8

a cat on the mat

Monday, August 17, 2009

Page 138: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

29

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Estimating q* from a hypergraph: dynamic programming

Monday, August 17, 2009

Page 139: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

29

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Estimating q* from a hypergraph: dynamic programming

Bi-gram estimation:

Monday, August 17, 2009

Page 140: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

29

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Estimating q* from a hypergraph: dynamic programming

Bi-gram estimation:

‣ run inside-outside on the hypergraph

Monday, August 17, 2009

Page 141: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

29

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Estimating q* from a hypergraph: dynamic programming

Bi-gram estimation:

‣ run inside-outside on the hypergraph

‣ accumulate the soft-count of each bigram at each hyperedge

Monday, August 17, 2009

Page 142: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

29

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

Estimating q* from a hypergraph: dynamic programming

Bi-gram estimation:

‣ run inside-outside on the hypergraph

‣ accumulate the soft-count of each bigram at each hyperedge

‣ normalize the counts

Monday, August 17, 2009

Page 143: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Decoding using q*∈Q

30

Monday, August 17, 2009

Page 144: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Decoding using q*∈Q

• Rescore the hypergraph HG(x)

30

Monday, August 17, 2009

Page 145: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Decoding using q*∈Q

• Rescore the hypergraph HG(x)

y! = arg maxy"HG(x)

q!(y|x)

30

Monday, August 17, 2009

Page 146: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Decoding using q*∈Q

• Rescore the hypergraph HG(x)

y! = arg maxy"HG(x)

q!(y|x)

30

Monday, August 17, 2009

Page 147: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Decoding using q*∈Q

• Rescore the hypergraph HG(x)

y! = arg maxy"HG(x)

q!(y|x)

30

q* is an n-gram model.

Monday, August 17, 2009

Page 148: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Decoding using q*∈Q

• Rescore the hypergraph HG(x)

• have efficient dynamic programming algorithms

• score the hypergraph using an n-gram model

y! = arg maxy"HG(x)

q!(y|x)

30

q* is an n-gram model.

Monday, August 17, 2009

Page 149: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Decoding using q*∈Q

• Rescore the hypergraph HG(x)

• have efficient dynamic programming algorithms

• score the hypergraph using an n-gram model

y! = arg maxy"HG(x)

q!(y|x)

30

q* is an n-gram model.

John already told you how to do this☺

Monday, August 17, 2009

Page 150: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

KL divergences under different variational models

31

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

Monday, August 17, 2009

Page 151: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

KL divergences under different variational models

31

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

Measure H(p) KL(p||·)bits/word q!1 q!2 q!3 q!4

MT’04 1.36 0.97 0.32 0.21 0.17MT’05 1.37 0.94 0.32 0.21 0.17

Monday, August 17, 2009

Page 152: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

KL divergences under different variational models

• The larger the order n is, the smaller the KL divergence is!

• The reduction of KL divergence happens mostly when switching from unigram to bigram

31

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

Measure H(p) KL(p||·)bits/word q!1 q!2 q!3 q!4

MT’04 1.36 0.97 0.32 0.21 0.17MT’05 1.37 0.94 0.32 0.21 0.17

Monday, August 17, 2009

Page 153: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

KL divergences under different variational models

32

Measure H(p) KL(p||·)bits/word q!1 q!2 q!3 q!4

MT’04 1.36 0.97 0.32 0.21 0.17MT’05 1.37 0.94 0.32 0.21 0.17

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

Monday, August 17, 2009

Page 154: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

KL divergences under different variational models

32

How to compute them on a hypergraph?

see (Li and Eisner, EMNLP’09)

Measure H(p) KL(p||·)bits/word q!1 q!2 q!3 q!4

MT’04 1.36 0.97 0.32 0.21 0.17MT’05 1.37 0.94 0.32 0.21 0.17

q! = arg minq"Q

KL(p||q) = H(p, q)!H(p)

Monday, August 17, 2009

Page 155: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

BLEU scores when using a single variational n-gram model

Decoding scheme MT’04 MT’05Viterbi 35.4 32.61gram 25.9 24.52gram 36.1 33.43gram 36.0 33.14gram 35.8 32.9

33

Monday, August 17, 2009

Page 156: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

BLEU scores when using a single variational n-gram model

• unigram performs very badly

Decoding scheme MT’04 MT’05Viterbi 35.4 32.61gram 25.9 24.52gram 36.1 33.43gram 36.0 33.14gram 35.8 32.9

33

Monday, August 17, 2009

Page 157: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

BLEU scores when using a single variational n-gram model

• unigram performs very badly

Decoding scheme MT’04 MT’05Viterbi 35.4 32.61gram 25.9 24.52gram 36.1 33.43gram 36.0 33.14gram 35.8 32.9

33

• bigram achieves best BLEU scores

Monday, August 17, 2009

Page 158: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

BLEU scores when using a single variational n-gram model

• unigram performs very badly

Decoding scheme MT’04 MT’05Viterbi 35.4 32.61gram 25.9 24.52gram 36.1 33.43gram 36.0 33.14gram 35.8 32.9

33

• bigram achieves best BLEU scores ???

Monday, August 17, 2009

Page 159: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

BLEU scores when using a single variational n-gram model

• unigram performs very badly

Decoding scheme MT’04 MT’05Viterbi 35.4 32.61gram 25.9 24.52gram 36.1 33.43gram 36.0 33.14gram 35.8 32.9

33

• bigram achieves best BLEU scores ???

modeling error in p

Monday, August 17, 2009

Page 160: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

34

Monday, August 17, 2009

Page 161: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

34

BLEU cares about both low- and high-order n-gram matches

Monday, August 17, 2009

Page 162: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

34

BLEU cares about both low- and high-order n-gram matches

• Interpolating variational n-gram model for different n

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

Monday, August 17, 2009

Page 163: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

34

BLEU cares about both low- and high-order n-gram matches

Viterbi and variational are different ways in approximating p

• Interpolating variational n-gram model for different n

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

Monday, August 17, 2009

Page 164: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

34

BLEU cares about both low- and high-order n-gram matches

Viterbi and variational are different ways in approximating p

y! = arg maxy"HG(x)

!"

n

!n · log q!n(y | x) + !v · log pViterbi(y | x)

#

• Interpolating variational n-gram model for different n

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

Monday, August 17, 2009

Page 165: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

34

BLEU cares about both low- and high-order n-gram matches

Viterbi and variational are different ways in approximating p

y! = arg maxy"HG(x)

!"

n

!n · log q!n(y | x) + !v · log pViterbi(y | x)

#

• Interpolating variational n-gram model for different n

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

Monday, August 17, 2009

Page 166: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Minimum Bayes Risk (MBR) decoding?

35

(Tromble et al. 2008)

(Denero et al. 2009)

Monday, August 17, 2009

Page 167: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Minimum Risk Decoding

• Minimum risk decoding

• find the consensus translation string

• Maximum A Posterior (MAP) decoding

• find the most probable translation string

Risk(y) =!

y!

L(y, y!)p(y

!|x)

y! = arg maxy"HG(x)

p(y|x)

y! = arg miny"HG(x)

Risk(y)

36

Monday, August 17, 2009

Page 168: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

Monday, August 17, 2009

Page 169: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity

Monday, August 17, 2009

Page 170: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity

VD

Monday, August 17, 2009

Page 171: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity

cons

ensu

s

VD

Monday, August 17, 2009

Page 172: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity

cons

ensu

s

VD

MBR

Monday, August 17, 2009

Page 173: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity

cons

ensu

s

VD

MBR Interpolated VD

Monday, August 17, 2009

Page 174: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity

cons

ensu

s

VD

MBR Interpolated VD

Both BLEU metric and our variational distributions happen to use n-gram dependencies.

Monday, August 17, 2009

Page 175: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

• Variational decoding with interpolation

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

w!Wn

q(r(w) | h(w), x)cw(y)

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

• Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =!

w!Wn

g(w | x)cw(y)

g(w | x) =!

y!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!

n

!n · gn(y | x)

38

Monday, August 17, 2009

Page 176: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

• Variational decoding with interpolation

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

w!Wn

q(r(w) | h(w), x)cw(y)

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

• Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =!

w!Wn

g(w | x)cw(y)

g(w | x) =!

y!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!

n

!n · gn(y | x)

38

decision rule

decision rule

Monday, August 17, 2009

Page 177: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

• Variational decoding with interpolation

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

w!Wn

q(r(w) | h(w), x)cw(y)

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

• Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =!

w!Wn

g(w | x)cw(y)

g(w | x) =!

y!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!

n

!n · gn(y | x)

38

decision rule

decision rule

n-gram model

n-gram model

Monday, August 17, 2009

Page 178: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

• Variational decoding with interpolation

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

w!Wn

q(r(w) | h(w), x)cw(y)

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

• Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =!

w!Wn

g(w | x)cw(y)

g(w | x) =!

y!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!

n

!n · gn(y | x)

38

decision rule

decision rule

n-gram model

n-gram model

n-gram probability

n-gram probability

Monday, August 17, 2009

Page 179: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

• Variational decoding with interpolation

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

w!Wn

q(r(w) | h(w), x)cw(y)

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

• Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =!

w!Wn

g(w | x)cw(y)

g(w | x) =!

y!

!w(y!)p(y! | x)

y! = arg maxy"HG(x)

!

n

!n · gn(y | x)

39

Monday, August 17, 2009

Page 180: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

• Variational decoding with interpolation

q(r(w) | h(w), x) =!

y! cw(y!)p(y

! | x)!

y! ch(w)(y!)p(y! | x)

qn(y | x) =!

w!Wn

q(r(w) | h(w), x)cw(y)

y! = arg maxy"HG(x)

!

n

!n · log q!n(y | x)

• Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =!

w!Wn

g(w | x)cw(y)

g(w | x) =!

y!

!w(y!)p(y! | x)

non-probabilistic

very expensive to compute

y! = arg maxy"HG(x)

!

n

!n · gn(y | x)

39

Monday, August 17, 2009

Page 181: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

BLEU Results on Chinese-English NIST MT Tasks

Decoding scheme MT’04 MT’05Viterbi 35.4 32.6MBR (K=1000) 35.8 32.7Crunching (N=10000) 35.7 32.8Crunching+MBR (N=10000) 35.8 32.7Variational (1to4gram+wp+vt) 36.6 33.5

40

Monday, August 17, 2009

Page 182: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

BLEU Results on Chinese-English NIST MT Tasks

• variational decoding improves over Viterbi, MBR, and crunching

Decoding scheme MT’04 MT’05Viterbi 35.4 32.6MBR (K=1000) 35.8 32.7Crunching (N=10000) 35.7 32.8Crunching+MBR (N=10000) 35.8 32.7Variational (1to4gram+wp+vt) 36.6 33.5

40

Monday, August 17, 2009

Page 183: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Conclusions

• Exact MAP decoding with spurious ambiguity is intractable

• Viterbi or N-best approximations are efficient, but ignore most derivations

• We developed a variational approximation, which considers all derivations but still allows tractable decoding

• Our variational decoding improves a state of the art baseline

41

Monday, August 17, 2009

Page 184: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Future directions

• The MT pipeline is full of intractable problems

• variational approximation is a principled way to tackle these problems

• Decoding with spurious ambiguity is a common problem in many other NLP applications

• Models with latent variables

• Data oriented parsing (DOP)

• Hidden Markov Models (HMM)

• ......

42

Monday, August 17, 2009

Page 185: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Thank you!谢谢!

43

Monday, August 17, 2009

Page 186: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

Joshua44

Monday, August 17, 2009

Page 187: Variational Decoding for Statistical Machine Translationjason/papers/li+al.acl09.slides-anim.pdf · Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity •

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q* is an n-gram model over output strings.

Decode using q*on the hypergraph

1

p(y, d | x)

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

p(y, d | x)

dianzi0 shang1 de2 mao3

S 0,4

X 0,4 the · · · cat X 0,4 a · · · mat

X 0,2 the · · · mat X 3,4 a · · · cat

X!"mao,a cat#

X!"X0 de X1,X0 X1#

X!"dianzi shang, the mat#

X!"X0 de X1,X1 on X0#

S!"X0,X0#

X!"X0 de X1,X1 of X0#

S!"X0,X0#

X!"X0 de X1,X0 ’s X1#

q*(y | x)

2

3

Estimate a model from the hypergraph

Generate a hypergraph

q*(y | x)

≈∑d∈D(x,y) p(y,d|x)

Monday, August 17, 2009


Recommended