+ All Categories
Home > Documents > MaxForce: Max-Violation Perceptron and Forced Decoding for...

MaxForce: Max-Violation Perceptron and Forced Decoding for...

Date post: 25-Jun-2020
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
150
MaxForce: Max-Violation Perceptron and Forced Decoding for Scalable MT Training Heng Yu Chinese Acad. of Sciences Liang Huang CUNY Haitao Mi IBM T. J. Watson 0 1 2 3 4 5 6 Bush held heldtalks talks with withSharon Sharon Kai Zhao CUNY
Transcript
Page 1: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

MaxForce: Max-Violation Perceptron and

Forced Decoding for Scalable MT Training

Heng Yu

Chinese Acad. of Sciences

Liang Huang

CUNY

Haitao Mi

IBM T. J. Watson

0 1 2 3 4 5 6

Bush

held

held talks

talks with

with Sharon

Sharon

Kai Zhao

CUNY

Page 2: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

MaxForce: Max-Violation Perceptron and

Forced Decoding for Scalable MT Training

Heng Yu

Chinese Acad. of Sciences

Liang Huang

CUNY

Haitao Mi

IBM T. J. Watson

0 1 2 3 4 5 6

Bush

held

held talks

talks with

with Sharon

Sharon

Scalable Training for MT Finally Made Successful

Kai Zhao

CUNY

Page 3: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Discriminative Training for SMT• discriminative training is dominant in parsing / tagging

• can use arbitrary, overlapping, lexicalized features

• but not very successful yet in machine translation

• most efforts on MT training tune feature weights on the small dev set (~1k sents) not the training set!

• as a result can only use ~10 dense features (MERT)

• or ~10k rather impoverished features (MIRA/PRO)

• Liang et al (2006) train on the training set but failed

2

training set (>100k sentences) dev set (~1k sents)

test set (~1k sents)

Page 4: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Timeline for MT Training

3

training set (>100k sentences) dev set (~1k sents)

test set (~1k sents)

MERT (Och ’02)

(dense features)

Page 5: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Timeline for MT Training

3

training set (>100k sentences) dev set (~1k sents)

test set (~1k sents)

Standard Perceptron (a noble failure) (Liang et al 2006)

MERT (Och ’02)

(dense features)

Page 6: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Timeline for MT Training

3

training set (>100k sentences) dev set (~1k sents)

test set (~1k sents)

Standard Perceptron (a noble failure) (Liang et al 2006) MIRA

(Watanabe+ ’07)(Chiang+ ’08-’12)

MERT (Och ’02)

(dense features)

(pseudo sparsefeatures)

Page 7: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Timeline for MT Training

3

training set (>100k sentences) dev set (~1k sents)

test set (~1k sents)

Standard Perceptron (a noble failure) (Liang et al 2006) MIRA

(Watanabe+ ’07)(Chiang+ ’08-’12)

PRO(Hopkins+May ’11)

Regression(Bazrafshan+ ’12)

MERT (Och ’02)

(dense features)

(pseudo sparsefeatures)

Page 8: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Timeline for MT Training

3

training set (>100k sentences) dev set (~1k sents)

test set (~1k sents)

Standard Perceptron (a noble failure) (Liang et al 2006) MIRA

(Watanabe+ ’07)(Chiang+ ’08-’12)

PRO(Hopkins+May ’11)

Regression(Bazrafshan+ ’12)

HOLS(Flanigan+ ’13)

(sparse features as one dense feature)

MERT (Och ’02)

(dense features)

(pseudo sparsefeatures)

Page 9: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Timeline for MT Training

3

training set (>100k sentences) dev set (~1k sents)

test set (~1k sents)

Standard Perceptron (a noble failure) (Liang et al 2006) MIRA

(Watanabe+ ’07)(Chiang+ ’08-’12)

PRO(Hopkins+May ’11)

Regression(Bazrafshan+ ’12)

our work (2013): violation-fixing perceptron with truly sparse features

HOLS(Flanigan+ ’13)

(sparse features as one dense feature)

MERT (Och ’02)

(dense features)

(pseudo sparsefeatures)

Page 10: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Timeline for MT Training

3

training set (>100k sentences) dev set (~1k sents)

test set (~1k sents)

Standard Perceptron (a noble failure) (Liang et al 2006) MIRA

(Watanabe+ ’07)(Chiang+ ’08-’12)

PRO(Hopkins+May ’11)

Regression(Bazrafshan+ ’12)

our work (2013): violation-fixing perceptron with truly sparse features

HOLS(Flanigan+ ’13)

(sparse features as one dense feature)

MERT (Och ’02)

(dense features)

(pseudo sparsefeatures)?

Page 11: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Why previous work fails

• their learning methods are based on exact search

• MT has huge search spaces => severe search errors

• learning algorithms should fix search errors

• full updates (perceptron/MIRA/PRO) can’t fix search errors

• MT involves latent variables (derivations not annotated)

• perceptron/MIRA was not designed for latent variables

• we need better variants for perceptron4

Page 12: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Why our approach works

• use a variant of perceptron tailored for inexact search

• fix search errors in the middle of the search

• “partial updates” instead of “full updates”

• use forced decoding lattice as the target to update to

• use parallelized minibatch to speed up learning

• result: scaled to a large portion of the training data

• 20M sparse features => +2.0 BLEU over MERT/PRO 5

Page 13: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

MT as Structured Classification

• with latent variables (hidden derivations)

6

x

ythe man bit the dog

那 人 咬 了 狗

Page 14: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

MT as Structured Classification

• with latent variables (hidden derivations)

6

x

ythe man bit the dog

那 人 咬 了 狗

...

all gold derivations

Page 15: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

MT as Structured Classification

• with latent variables (hidden derivations)

6

x

ythe man bit the dog

那 人 咬 了 狗

...x那 人 咬 了 狗

all gold derivations

Page 16: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

MT as Structured Classification

• with latent variables (hidden derivations)

6

x

ythe man bit the dog

那 人 咬 了 狗

...x

ythe dog bit the man

那 人 咬 了 狗

best derivation

all gold derivations

Page 17: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

MT as Structured Classification

• with latent variables (hidden derivations)

6

x

ythe man bit the dog

那 人 咬 了 狗

...x

ythe dog bit the man

那 人 咬 了 狗

best derivation

all gold derivations wrong translation

Page 18: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

MT as Structured Classification

• with latent variables (hidden derivations)

6

x

ythe man bit the dog

那 人 咬 了 狗

...x

ythe dog bit the man

那 人 咬 了 狗

best derivation

best goldderivation

all gold derivations wrong translation

Page 19: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

MT as Structured Classification

• with latent variables (hidden derivations)

6

x

ythe man bit the dog

那 人 咬 了 狗

...x

ythe dog bit the man

那 人 咬 了 狗

best derivation

best goldderivation

update: penalize best derivationand reward best gold derivation

all gold derivations wrong translation

--++

Page 20: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Outline

• Motivations

• Phrase-based Translation and Forced Decoding

• Violation-Fixing Perceptron for SMT

• Update Strategies: Early Update and Max-Violation

• Feature Design

• Experiments

7

Page 21: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龙 举行 了 会谈

held talks with Sharon

布什Bushi

Bush

yu Shalong juxing le huitanwith Sharon held talks

meetingsSharon heldwithBush

Bushi

Page 22: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龙 举行 了 会谈

held talks with Sharon

布什Bushi

Bush

yu Shalong juxing le huitanwith Sharon held talks

meetingsSharon heldwith

_ _ _ _ _ _

Bush

Bushi

Page 23: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龙 举行 了 会谈

held talks with Sharon

布什Bushi

Bush

yu Shalong juxing le huitanwith Sharon held talks

meetingsSharon heldwith

_ _ _ _ _ _

Bush

Bushi

●_ _ _ _ _

Page 24: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龙 举行 了 会谈

held talks with Sharon

布什Bushi

Bush

yu Shalong juxing le huitanwith Sharon held talks

meetingsSharon heldwith

_ _ _ _ _ _ ●_ _●●●

Bush

Bushi

●_ _ _ _ _

Page 25: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龙 举行 了 会谈

held talks with Sharon

布什Bushi

Bush

yu Shalong juxing le huitanwith Sharon held talks

meetingsSharon heldwith

_ _ _ _ _ _ ●_ _●●● ●●●●●●

Bush

Bushi

●_ _ _ _ _

Page 26: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龙 举行 了 会谈

held talks with Sharon

布什Bushi

Bush

yu Shalong juxing le huitanwith Sharon held talks

meetingsSharon heldwith

_ _ _ _ _ _ ●_ _●●● ●●●●●●

Bush

Bushi

●_ _ _ _ _

Page 27: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龙 举行 了 会谈

held talks with Sharon

布什Bushi

Bush

yu Shalong juxing le huitanwith Sharon held talks

meetingsSharon heldwith

_ _ _ _ _ _ ●_ _●●● ●●●●●●

● ●_●●●

Bush

Bushi

●_ _ _ _ _

Page 28: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Language Model and Beam Search• split each -LM state into many +LM states

9

Page 29: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Language Model and Beam Search• split each -LM state into many +LM states

9

●_ _ _ _ _ Bush

Page 30: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Language Model and Beam Search• split each -LM state into many +LM states

9

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●● ... meeting

●_ _ _ _ _ Bush

Page 31: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Language Model and Beam Search• split each -LM state into many +LM states

9

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●● ... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

Page 32: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Language Model and Beam Search• split each -LM state into many +LM states

9

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●● ... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

● ● ● ● ● ●

Page 33: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Forced Decoding

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

• both as data selection (more literal) and oracle derivations

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

Page 34: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Forced Decoding

●_ _ _ _ _ Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

• both as data selection (more literal) and oracle derivations

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

Page 35: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Forced Decoding

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●● ... meeting

●_ _ _ _ _ Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

• both as data selection (more literal) and oracle derivations

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

Page 36: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Forced Decoding

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●● ... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

• both as data selection (more literal) and oracle derivations

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

Page 37: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Forced Decoding

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●● ... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

• both as data selection (more literal) and oracle derivations

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

Page 38: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Forced Decoding

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●● ... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

• both as data selection (more literal) and oracle derivations

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

Page 39: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Forced Decoding

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●● ... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

one gold derivation

• both as data selection (more literal) and oracle derivations

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

Page 40: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Unreachable Sentences and Prefix

11

Lianheguo

paiqian

50mıng

guanchaiyuan

jiandu

Bolıweiya

huıfumınzhu

zhengzhı

yılaishoucı

quanguo

daxuan

U.N.

sent

50

observers

to

monitor

the

1st

election

since

Bolivia

restored

democracy

5

33

4

1

玻利维亚

恢复 民主 政治 以来 首次 全国 大选联合

国派遣 50

名 观察员

监督

• distortion limit causes unreachability (hiero would be better)

• but we can still use reachable prefix-pairs of unreachable pairs

Page 41: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Unreachable Sentences and Prefix

11

Lianheguo

paiqian

50mıng

guanchaiyuan

jiandu

Bolıweiya

huıfumınzhu

zhengzhı

yılaishoucı

quanguo

daxuan

U.N.

sent

50

observers

to

monitor

the

1st

election

since

Bolivia

restored

democracy

5

33

4

1

玻利维亚

恢复 民主 政治 以来 首次 全国 大选联合

国派遣 50

名 观察员

监督

• distortion limit causes unreachability (hiero would be better)

• but we can still use reachable prefix-pairs of unreachable pairs

Page 42: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Sentence/Word Reachability Ratio• how many sentences pairs pass forced decoding?

• the ratio drops dramatically as sentences get longer

• prefixes boost coverage

12

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

10 20 30 40 50 60 70

Ratio

of co

mple

te c

ove

rage

Sentence length

Distortion-unlimitDistortion-limit 6Distortion-limit 4Distortion-limit 2Distortion-limit 0

Page 43: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Sentence/Word Reachability Ratio• how many sentences pairs pass forced decoding?

• the ratio drops dramatically as sentences get longer

• prefixes boost coverage

12

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

10 20 30 40 50 60 70

Ratio

of co

mple

te c

ove

rage

Sentence length

Distortion-unlimitDistortion-limit 6Distortion-limit 4Distortion-limit 2Distortion-limit 0

Page 44: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Sentence/Word Reachability Ratio• how many sentences pairs pass forced decoding?

• the ratio drops dramatically as sentences get longer

• prefixes boost coverage

12

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

10 20 30 40 50 60 70

Ratio

of co

mple

te c

ove

rage

Sentence length

Distortion-unlimitDistortion-limit 6Distortion-limit 4Distortion-limit 2Distortion-limit 0

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

10 20 30 40 50 60 70

Ra

tio o

f co

mp

lete

co

vera

ge

Sentence length

dist-6dist-4dist-2dist-0

Page 45: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Number of Gold Derivations

• exponential in sentence length (on fully reachables)

• these are the “latent variables” in learning

13

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

5 10 15 20 25 30 35 40 45 50

Ave

rage n

um

ber

of deriva

tions

Sentence length

dist-6dist-4dist-2dist-0

Page 46: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Outline

• Background: Phrase-based Translation (Koehn, 2004)

• Forced Decoding

• Violation-Fixing Perceptron for MT Training

• Update strategy

• Feature design

• Experiments

14

Page 47: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Structured Perceptron (Collins 02)

15

x

y=-1y=+1

x

y

update weights

if y ≠ z

w

x zexactinference

binary classification

Page 48: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Structured Perceptron (Collins 02)

15

x

y=-1y=+1

x

y

update weights

if y ≠ z

w

x zexactinference

binary classification

structured classification

Page 49: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Structured Perceptron (Collins 02)

15

x

ythe man bit the dog

那 人 咬 了 狗

x

y=-1y=+1

x

y

update weights

if y ≠ z

w

x zexactinference

binary classification

structured classification

Page 50: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Structured Perceptron (Collins 02)

• challenges in applying perceptron for MT

• the inference (decoding) is vastly inexact (beam search)

• we know standard perceptron doesn’t work for MT

• intuition: the learner should fix the search error first15

x

ythe man bit the dog

那 人 咬 了 狗

y

update weights

if y ≠ z

w

x zexactinference

x

y=-1y=+1

x

y

update weights

if y ≠ z

w

x zexactinference

constant# of classes

exponential # of classes

binary classification

structured classification

Page 51: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Structured Perceptron (Collins 02)

• challenges in applying perceptron for MT

• the inference (decoding) is vastly inexact (beam search)

• we know standard perceptron doesn’t work for MT

• intuition: the learner should fix the search error first15

x

ythe man bit the dog

那 人 咬 了 狗

y

update weights

if y ≠ z

w

x zexactinference

x

y=-1y=+1

x

y

update weights

if y ≠ z

w

x zexactinference

constant# of classes

exponential # of classes

binary classification

structured classification

inexactinference

Page 52: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Search Error: Gold Derivations Pruned

16

_ _ _ _ _ _

0 1 2 3 4 5 6

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

real decoding beam search

Page 53: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Search Error: Gold Derivations Pruned

16

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

real decoding beam search

Page 54: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Search Error: Gold Derivations Pruned

16

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

real decoding beam search

Page 55: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Search Error: Gold Derivations Pruned

16

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

real decoding beam search

Page 56: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Search Error: Gold Derivations Pruned

16

_ _ _ _ _ _ ● _ _ ● ● ●

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

real decoding beam search

Page 57: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Search Error: Gold Derivations Pruned

16

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

real decoding beam search

Page 58: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Search Error: Gold Derivations Pruned

16

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

real decoding beam search

should fix search errors here!

Page 59: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Fixing Search Error 1: Early Update

17

standard update(no guarantee!)

21

Model

Page 60: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Fixing Search Error 1: Early Update

• early update (Collins/Roark’04) when the correct falls off beam

• up to this point the incorrect prefix should score higher

• that’s a “violation” which we want to fix

17

standard update(no guarantee!)

21

Model

Page 61: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Fixing Search Error 1: Early Update

• early update (Collins/Roark’04) when the correct falls off beam

• up to this point the incorrect prefix should score higher

• that’s a “violation” which we want to fix

17

correct sequencefalls off beam

(pruned)

correct

standard update(no guarantee!)

21

Model

Page 62: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Fixing Search Error 1: Early Update

• early update (Collins/Roark’04) when the correct falls off beam

• up to this point the incorrect prefix should score higher

• that’s a “violation” which we want to fix

17

correct sequencefalls off beam

(pruned)

correct

incorrect

standard update(no guarantee!)

21

Model

Page 63: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Fixing Search Error 1: Early Update

• early update (Collins/Roark’04) when the correct falls off beam

• up to this point the incorrect prefix should score higher

• that’s a “violation” which we want to fix

17

earl

y up

date

correct sequencefalls off beam

(pruned)

correct

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

standard update(no guarantee!)

21

Model

Page 64: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Fixing Search Error 1: Early Update

• early update (Collins/Roark’04) when the correct falls off beam

• up to this point the incorrect prefix should score higher

• that’s a “violation” which we want to fix

• standard perceptron does not guarantee violation

• w/ pruning, the correct seq. might score higher at the end!

• called “invalid” update b/c it doesn’t fix the search error

17

earl

y up

date

correct sequencefalls off beam

(pruned)

correct

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

standard update(no guarantee!)

21

Model

Page 65: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

• the gold-standard derivations are not annotated

• we treat any reference-producing derivation as good

Page 66: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

• the gold-standard derivations are not annotated

• we treat any reference-producing derivation as good

correct

Page 67: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

• the gold-standard derivations are not annotated

• we treat any reference-producing derivation as good

correct

Page 68: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

• the gold-standard derivations are not annotated

• we treat any reference-producing derivation as good

correct

all correct derivations fall off

Page 69: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update w/ Latent Variable

18

incorrect

21

Model

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

• the gold-standard derivations are not annotated

• we treat any reference-producing derivation as good

correct

all correct derivations fall off

Page 70: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update w/ Latent Variable

18

earl

y up

date

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

21

Model

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

• the gold-standard derivations are not annotated

• we treat any reference-producing derivation as good

correct

all correct derivations fall off

Page 71: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update w/ Latent Variable

18

earl

y up

date

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

21

Model

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

• the gold-standard derivations are not annotated

• we treat any reference-producing derivation as good

correct

all correct derivations fall off

stop decoding

Page 72: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Fixing Search Error 2: Max-Violation

19

• early update works but learns slowly due to partial updates

• max-violation: use the prefix where violation is maximum

• “worst-mistake” in the search space

• we call these methods “violation-fixing perceptrons” (Huang et al 2012)

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 73: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

_ _ _ _ _ _

0 1 2 3 4 5 6

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 74: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●ea

rly

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 75: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 76: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 77: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

Early-update

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 78: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

Early-update

_ _ _ _ _ _ ● _ _ ● ● ●

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 79: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

Early-update

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 80: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

Early-update

● ● _ ● ● ●

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _● ● ● ● _ ●● _ ● ● ● ●● ● ● _ ● ●

● ● ● _ ● ●

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 81: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

Early-update

● ● _ ● ● ●

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _● ● ● ● _ ●● _ ● ● ● ●● ● ● _ ● ●

● ● ● _ ● ● ● ● ● ● ● ●

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 82: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Early Update vs. Max-Violation

Early-update

● ● _ ● ● ●

_ _ _ _ _ _

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ __ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ __ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _● ● ● ● _ ●● _ ● ● ● ●● ● ● _ ● ●

● ● ● _ ● ●

Max-violation

● ● ● ● ● ●

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 83: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Latent-Variable Perceptron

early

max

-vi

olat

ion

late

st

full

(standard)

best in the beam

worst in the beamfalls off

the beam biggestviolation

last valid update

correct sequence

invalidupdate!

early

max

-vi

olat

ion

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Page 84: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

22

structured perceptron(Collins, 2002)

Page 85: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

22

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

Page 86: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

22

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004;Huang et al 2012)

Page 87: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

22

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004;Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

Page 88: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

22

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004;Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

hiero syntactic parsing semantic parsing transliteration

Page 89: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Feature Design

• Dense features:

• standard phrase-based features (Koehn, 2004)

• Sparse Features:

• rule-identification features (unique id for each rule)

• word-edges features

• lexicalized local translation context within a rule

• non-local features

• dependency between consecutive rules

23

Page 90: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Page 91: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Page 92: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Page 93: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Page 94: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Page 95: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Combo Features:

Page 96: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Combo Features:

Page 97: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Combo Features:100010=沙⻰龙|held

Page 98: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Combo Features:100010=沙⻰龙|held

Page 99: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Combo Features:100010=沙⻰龙|held

Page 100: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Combo Features:100010=沙⻰龙|held010001=举行|talks

Page 101: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

WordEdges Features (local)

24

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• the first and last Chinese words in the rule

• the first and last English words in the rule

• the two Chinese words surrounding the rule

Combo Features:100010=沙⻰龙|held010001=举行|talks

Page 102: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Lexical backoffs and combos

• Lexical features are often too sparse

• 6 kinds of lexical backoffs with various budgets

• total budget can’t exceed 10 (bilexical)

25

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

Page 103: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Lexical backoffs and combos

• Lexical features are often too sparse

• 6 kinds of lexical backoffs with various budgets

• total budget can’t exceed 10 (bilexical)

25

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

Page 104: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Lexical backoffs and combos

• Lexical features are often too sparse

• 6 kinds of lexical backoffs with various budgets

• total budget can’t exceed 10 (bilexical)

25

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

100010=沙⻰龙|held

Page 105: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Lexical backoffs and combos

• Lexical features are often too sparse

• 6 kinds of lexical backoffs with various budgets

• total budget can’t exceed 10 (bilexical)

25

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

P00010=NN|held

100010=沙⻰龙|held

Page 106: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Lexical backoffs and combos

• Lexical features are often too sparse

• 6 kinds of lexical backoffs with various budgets

• total budget can’t exceed 10 (bilexical)

25

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

P00010=NN|held

100010=沙⻰龙|held

Page 107: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Lexical backoffs and combos

• Lexical features are often too sparse

• 6 kinds of lexical backoffs with various budgets

• total budget can’t exceed 10 (bilexical)

25

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

P00010=NN|held

100010=沙⻰龙|held

010001=举行|talks

Page 108: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Lexical backoffs and combos

• Lexical features are often too sparse

• 6 kinds of lexical backoffs with various budgets

• total budget can’t exceed 10 (bilexical)

25

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

P00010=NN|held

0c0001=举|talks

100010=沙⻰龙|held

010001=举行|talks

Page 109: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Non-Local Features (trivial)

26

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• two consecutive rule ids (rule bigram model)

• the last two English words and the current rule

• should explore a lot more!

Page 110: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Non-Local Features (trivial)

26

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• two consecutive rule ids (rule bigram model)

• the last two English words and the current rule

• should explore a lot more!

Page 111: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Non-Local Features (trivial)

26

与 沙⻰龙 举行 了 会谈

held a few talks

</s>

r2

布什

Bush

r1

<s>

<s>

• two consecutive rule ids (rule bigram model)

• the last two English words and the current rule

• should explore a lot more!

Page 112: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Experiments

27

• Date sets

Page 113: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Experiments

27

Scale Language sent. dev tst

SmallCh-En

30knist06 news nist08 news

LargeCh-En

240knist06 news nist08 news

Large Sp-En 170k newstest2012 newtest2013

• Date sets

Page 114: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Experiments

27

Scale Language sent. dev tst

SmallCh-En

30knist06 news nist08 news

LargeCh-En

240knist06 news nist08 news

Large Sp-En 170k newstest2012 newtest2013

• Date sets

Page 115: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Experiments

27

Scale Language sent. dev tst

SmallCh-En

30knist06 news nist08 news

LargeCh-En

240knist06 news nist08 news

Large Sp-En 170k newstest2012 newtest2013

• Date sets

Page 116: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Experiments

27

10x dev

Scale Language sent. dev tst

SmallCh-En

30knist06 news nist08 news

LargeCh-En

240knist06 news nist08 news

Large Sp-En 170k newstest2012 newtest2013

• Date sets

Page 117: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Experiments

27

10x dev 120x dev

Scale Language sent. dev tst

SmallCh-En

30knist06 news nist08 news

LargeCh-En

240knist06 news nist08 news

Large Sp-En 170k newstest2012 newtest2013

• Date sets

Page 118: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Experiments

27

10x dev 120x dev

Scale Language sent. dev tst

SmallCh-En

30knist06 news nist08 news

LargeCh-En

240knist06 news nist08 news

Large Sp-En 170k newstest2012 newtest2013

• Date sets

Page 119: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Experiments

27

10x dev 120x dev

Scale Language sent. dev tst

SmallCh-En

30knist06 news nist08 news

LargeCh-En

240knist06 news nist08 news

Large Sp-En 170k newstest2012 newtest2013

• Date sets

Sp-En sent. word.ratio 55% 43.9%

Page 120: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Experiments

27

10x dev 120x dev

Scale Language sent. dev tst

SmallCh-En

30knist06 news nist08 news

LargeCh-En

240knist06 news nist08 news

Large Sp-En 170k newstest2012 newtest2013

• Date sets

Sp-En sent. word.ratio 55% 43.9%

31x dev

Page 121: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Perceptron: std, early, and max-violation• standard perceptron (Liang et al’s “bold”) works poorly

• b/c invalid update ratio is very high (search quality is low)

• max-violation converges faster than early update

17

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16 18 20

BLE

U

Number of iteration

MaxForce

MERTearly

local

standard

28

Page 122: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Perceptron: std, early, and max-violation• standard perceptron (Liang et al’s “bold”) works poorly

• b/c invalid update ratio is very high (search quality is low)

• max-violation converges faster than early update

this explains why Liang et al ’06 failedstd ~ “bold”; local ~ “local”

17

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16 18 20

BLE

U

Number of iteration

MaxForce

MERTearly

local

standard

28

Page 123: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Perceptron: std, early, and max-violation• standard perceptron (Liang et al’s “bold”) works poorly

• b/c invalid update ratio is very high (search quality is low)

• max-violation converges faster than early update

50%

60%

70%

80%

90%

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ratio

beam size

Ratio of invalid updates+non-local feature

(standard perceptron)

this explains why Liang et al ’06 failedstd ~ “bold”; local ~ “local”

17

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16 18 20

BLE

U

Number of iteration

MaxForce

MERTearly

local

standard

28

Page 124: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Perceptron: std, early, and max-violation• standard perceptron (Liang et al’s “bold”) works poorly

• b/c invalid update ratio is very high (search quality is low)

• max-violation converges faster than early update

50%

60%

70%

80%

90%

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ratio

beam size

Ratio of invalid updates+non-local feature

(standard perceptron)

this explains why Liang et al ’06 failedstd ~ “bold”; local ~ “local”

17

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16 18 20

BLE

U

Number of iteration

MaxForce

MERTearly

local

standard

28

Page 125: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Parallelized Perceptron

29

• mini-batch perceptron (Zhao and Huang, 2013) much faster than iterative parameter mixing (McDonald et al, 2010)

• 6 CPUs => ~4x speedup; 24 CPUs => ~7x speedup

22

23

24

0 0.5 1 1.5 2 2.5 3 3.5 4

BLE

U

Time

MERT PRO-dense

minibatch(24-core)minibatch(6-core)minibatch(1 core)single processor

Time

Page 126: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Internal comparison with different features

• dense: 11 standard features for phrase-based MT

• ruleid: rule identification feature

• word-edges: word-edges features with back-offs

• non-local: non-local features with back-offs

30

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16

BLE

U

Number of iteration

MERT

+non-local+word-edges

+ruleiddense

Page 127: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Internal comparison with different features

• dense: 11 standard features for phrase-based MT

• ruleid: rule identification feature

• word-edges: word-edges features with back-offs

• non-local: non-local features with back-offs

30

dense: 11 features

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16

BLE

U

Number of iteration

MERT

+non-local+word-edges

+ruleiddense

Page 128: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Internal comparison with different features

• dense: 11 standard features for phrase-based MT

• ruleid: rule identification feature

• word-edges: word-edges features with back-offs

• non-local: non-local features with back-offs

30

ruleid: 0.1%

dense: 11 features

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16

BLE

U

Number of iteration

MERT

+non-local+word-edges

+ruleiddense

+0.9 bleu

Page 129: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Internal comparison with different features

• dense: 11 standard features for phrase-based MT

• ruleid: rule identification feature

• word-edges: word-edges features with back-offs

• non-local: non-local features with back-offs

30

ruleid: 0.1%

wordedges: 99.6%

dense: 11 features

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16

BLE

U

Number of iteration

MERT

+non-local+word-edges

+ruleiddense

+0.9 bleu

+2.3

Page 130: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Internal comparison with different features

• dense: 11 standard features for phrase-based MT

• ruleid: rule identification feature

• word-edges: word-edges features with back-offs

• non-local: non-local features with back-offs

30

ruleid: 0.1%

wordedges: 99.6%

non-local: 0.3%

dense: 11 features

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16

BLE

U

Number of iteration

MERT

+non-local+word-edges

+ruleiddense

+0.9 bleu

+2.3+0.7

Page 131: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

External comparison with MERT & PRO

31

• MERT, PRO-dense/medium/sparse all tune on dev-set

• PRO-sparse use the same feature as ours

10

12

14

16

18

20

22

24

26

2 4 6 8 10 12 14 16

BLE

U

Number of iteration

MaxForceMERT

PRO-densePRO-medium

PRO-large

Page 132: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Final Results on FBIS Data

32

• Moses: state-of-the-art phrase-based system in C++

• Cubit: phrase-based system (Huang and Chiang, 2007) in python

• almost identical baseline scores with MERT

• max-violation takes ~47 hours on 24 CPUs (23M features)

Page 133: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Final Results on FBIS Data

32

System Alg. Tune on Features Dev TestMoses

CubitCubitCubitCubitCubit

MERT dev set 11 25.5 22.5

MERT dev set 11 25.4 22.5

PRO dev set11 25.6 22.6

PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3

MaxForce Train set 23M 27.8 24.5

• Moses: state-of-the-art phrase-based system in C++

• Cubit: phrase-based system (Huang and Chiang, 2007) in python

• almost identical baseline scores with MERT

• max-violation takes ~47 hours on 24 CPUs (23M features)

Page 134: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Final Results on FBIS Data

32

System Alg. Tune on Features Dev TestMoses

CubitCubitCubitCubitCubit

MERT dev set 11 25.5 22.5

MERT dev set 11 25.4 22.5

PRO dev set11 25.6 22.6

PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3

MaxForce Train set 23M 27.8 24.5

• Moses: state-of-the-art phrase-based system in C++

• Cubit: phrase-based system (Huang and Chiang, 2007) in python

• almost identical baseline scores with MERT

• max-violation takes ~47 hours on 24 CPUs (23M features)

Page 135: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Final Results on FBIS Data

32

System Alg. Tune on Features Dev TestMoses

CubitCubitCubitCubitCubit

MERT dev set 11 25.5 22.5

MERT dev set 11 25.4 22.5

PRO dev set11 25.6 22.6

PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3

MaxForce Train set 23M 27.8 24.5

• Moses: state-of-the-art phrase-based system in C++

• Cubit: phrase-based system (Huang and Chiang, 2007) in python

• almost identical baseline scores with MERT

• max-violation takes ~47 hours on 24 CPUs (23M features)

Page 136: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Final Results on FBIS Data

32

System Alg. Tune on Features Dev TestMoses

CubitCubitCubitCubitCubit

MERT dev set 11 25.5 22.5

MERT dev set 11 25.4 22.5

PRO dev set11 25.6 22.6

PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3

MaxForce Train set 23M 27.8 24.5

• Moses: state-of-the-art phrase-based system in C++

• Cubit: phrase-based system (Huang and Chiang, 2007) in python

• almost identical baseline scores with MERT

• max-violation takes ~47 hours on 24 CPUs (23M features)

Page 137: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Final Results on FBIS Data

32

System Alg. Tune on Features Dev TestMoses

CubitCubitCubitCubitCubit

MERT dev set 11 25.5 22.5

MERT dev set 11 25.4 22.5

PRO dev set11 25.6 22.6

PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3

MaxForce Train set 23M 27.8 24.5

• Moses: state-of-the-art phrase-based system in C++

• Cubit: phrase-based system (Huang and Chiang, 2007) in python

• almost identical baseline scores with MERT

• max-violation takes ~47 hours on 24 CPUs (23M features)

Page 138: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Final Results on FBIS Data

32

System Alg. Tune on Features Dev TestMoses

CubitCubitCubitCubitCubit

MERT dev set 11 25.5 22.5

MERT dev set 11 25.4 22.5

PRO dev set11 25.6 22.6

PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3

MaxForce Train set 23M 27.8 24.5

• Moses: state-of-the-art phrase-based system in C++

• Cubit: phrase-based system (Huang and Chiang, 2007) in python

• almost identical baseline scores with MERT

• max-violation takes ~47 hours on 24 CPUs (23M features)

+2.3

Page 139: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Final Results on FBIS Data

32

System Alg. Tune on Features Dev TestMoses

CubitCubitCubitCubitCubit

MERT dev set 11 25.5 22.5

MERT dev set 11 25.4 22.5

PRO dev set11 25.6 22.6

PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3

MaxForce Train set 23M 27.8 24.5

• Moses: state-of-the-art phrase-based system in C++

• Cubit: phrase-based system (Huang and Chiang, 2007) in python

• almost identical baseline scores with MERT

• max-violation takes ~47 hours on 24 CPUs (23M features)

+2.3 +2.0

Page 140: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Results on Spanish-English set

• Data-set: Europarl corpus, 170k sentences

• dev/test set: newtest2012 / 2013 (one-reference only)

• +1 in 1-ref bleu ~ +2 in 4-ref bleu

• bleu improvement is comparable to Chinese w/ 4-refs

33

system algorithm #feat. dev test

Moses Mert 11 27.4 24.4

Cubit MaxForce 21M 28.7 25.5

Sp-En sent. word.Reachable ratio 55% 43.9%

Page 141: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Results on Spanish-English set

• Data-set: Europarl corpus, 170k sentences

• dev/test set: newtest2012 / 2013 (one-reference only)

• +1 in 1-ref bleu ~ +2 in 4-ref bleu

• bleu improvement is comparable to Chinese w/ 4-refs

33

system algorithm #feat. dev test

Moses Mert 11 27.4 24.4

Cubit MaxForce 21M 28.7 25.5

+1.3 +1.1Sp-En sent. word.

Reachable ratio 55% 43.9%

Page 142: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Conclusion• a simple yet effective online learning approach for MT

• scaled to (a large portion of) the training set for the first time

• able to incorporate 20M sparse lexicalized features

• no need to define BLEU+1, or hope/fear derivations

• no learning rate or hyperparameters

• +2.3/+2.0 BLEU points better than MERT/PRO

• the three ingredients that made it work

• violation-fixing perceptron: early-update and max-violation

• forced decoding lattice helps

• minibatch parallelization scales it up to big data34

Page 143: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

35

structured perceptron(Collins, 2002)

Page 144: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

35

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

Page 145: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

35

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004;Huang et al 2012)

Page 146: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

35

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004;Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

Page 147: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

35

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004;Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

hiero syntactic parsing semantic parsing transliteration

Page 148: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

Roadmap of the techniques

35

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004;Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

hiero syntactic parsing semantic parsing transliteration

replacing EM for partially-

observed data

Page 149: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

20 years of Statistical MT• word alignment: IBM models (Brown et al 90, 93)

• translation model (choose one from below)

• SCFG (ITG: Wu 95, 97; Hiero: Chiang 05, 07) or STSG (GHKM 04, 06; Liu+ 06; Huang+ 06)

• PBMT (Och+Ney 02; Koehn et al 03)

• evaluation metric: BLEU (Papineni et al 02)

• decoding algorithm: cube pruning (Chiang 07; Huang+Chiang 07)

• training algorithm (choose one from below)

• MERT (Och 03): ~10 dense features on dev set

• MIRA (Chiang et al 08-12) or PRO (Hopkins+May 11): ~10k feats on dev set

• MaxForce: 20M+ feats on training set; +2/+1.5 BLEU over MERT/PRO

• Max-Violation Perceptron with Forced Decoding: fixes search errors

• first successful effort of online large-scale discriminative training for MT

Page 150: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training

When learning with vastly inexact search, you should use a principled method such as max-violation.

Thank you!

Max-violation


Recommended