Monolingual data in NMT
Franck Burlot & François Yvon
LIMSI, CNRS, Université Paris-Saclay
NLP MeetupParis, november 28, 2018
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 1 / 35
Corpora in data-based machine translation
Corpora in Statistical MachineTranslationThe Noisy Channel Model
Statistical MT translates French (f) into English (e) according to :
e? = argmaxe
P(e|f; θTM)
θTM trained using examples of bio-translations⇒ parallel data
f eElle partit avec son père, le visage souriant ; In the gayest and happiest spirits she set forward with her father ;elle n’ écoutait pas toujours, mais elle acquiesçait de confiance. not always listening, but always agreeing to what he said ;Ils arrivèrent. They arrived.– C’est Frank et Mlle Fairfax, dit aussitôt Mme Weston. It is Frank and Miss Fairfax, said Mrs. Weston.– J’allai justement vous faire part de l’agréable surprise que nousavons eue en le voyant arriver.
I was just going to tell you of our agreeable surprize in seeinghim arrive this morning.
Il reste jusqu’à demain et Mlle Fairfax a bien voulu, sur notredemande, venir passer la journée.
He stays till tomorrow, and Miss Fairfax has been persuaded tospend the day with us.
From Emma, by J. Austen
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 2 / 35
Corpora in data-based machine translation
Corpora in Statistical MachineTranslationThe Noisy Channel Model
Statistical MT actually translates French (f) in English (e) according to
e? = argmaxe
P(f|e; θTM)× P(e; θLM)
θTM trained using examples of existing translations⇒ parallel data
θLM trained using examples of existing texts⇒ monolingual data
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 3 / 35
Corpora in data-based machine translation
Parallel CorpusFrench-English
Monolingual CorpusEnglish
Statistical processing Statistical processing
P(f|e) P(e)French broken English English
Decoding / Inference :e? = argmaxe P(f|e)P(e)
A search problem
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 4 / 35
Corpora in data-based machine translation
Parallel CorpusFrench-English
Monolingual CorpusEnglish
Statistical processing Statistical processing
P(f|e) P(e)French broken English English
Decoding / Inference :e? = argmaxe P(f|e)P(e)
A search problem
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 4 / 35
Corpora in data-based machine translation
Parallel CorpusFrench-English
Monolingual CorpusEnglish
Statistical processing Statistical processing
P(f|e) P(e)French broken English English
Decoding / Inference :e? = argmaxe P(f|e)P(e)
A search problem
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 4 / 35
Corpora in data-based machine translation
Parallel CorpusFrench-English
Monolingual CorpusEnglish
Statistical processing Statistical processing
P(f|e) P(e)French broken English English
Decoding / Inference :e? = argmaxe P(f|e)P(e)
A search problem
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 4 / 35
Corpora in data-based machine translation
Corpora in Statistical MachineTranslationBeauty of the Noisy Channel Model
e? = argmaxe
P(f|e; θTM)P(e; θLM)
Parallel corpora are costly, scarce, difficult to get, restricted to specificdomains : counts in millions of sentences
Monolingual corpora are “free”, massive, easy to get, for all domains :counts in billions of sentences
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 5 / 35
Corpora in data-based machine translation
Corpora in Neural MachineTranslation (NMT)
NMT translates French (f) into English (e) according to
e? = argmaxe
P(e|f; θNN)
θNN trained using translation examples⇒ parallel data
Tonight’s question
How to best leverage existing monolingual data?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 6 / 35
Corpora in data-based machine translation
Corpora in Neural MachineTranslation (NMT)
NMT translates French (f) into English (e) according to
e? = argmaxe
P(e|f; θNN)
θNN trained using translation examples⇒ parallel data
Tonight’s question
How to best leverage existing monolingual data?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 6 / 35
NMT Primer in two slides
Recurrent Neural Networks
words : w ∈ {0, 1}|V| w0 . . . wt wt+1 . . . wI
embeddings : i ∈ Rd i0 . . . it it+1 . . . iI
hidden states : h ∈ Rp h0 . . . ht ht+1 . . . hI
it =W iwt
ht+1 = f(W ihit+1 + Wrhht + br)
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 7 / 35
NMT Primer in two slides
Recurrent Neural Networks for Language Modeling
words : w ∈ {0, 1}|V| w0 . . . wt wt+1 . . . wI
embeddings : i ∈ Rd i0 . . . it it+1 . . . iI
hidden states : h ∈ Rp h0 . . . ht ht+1 . . . hI
output : o ∈ R|V| . . . ot ot+1 . . . oI
P(wt+1 = k|w≤t; θLM) = [softmax(ot = Whoht + bo)]k
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 8 / 35
NMT Primer in two slides
Recurrent Neural Networks for Machine Translation
source : f ∈ {0, 1}|V| w0 . . . wt wt+1 . . . wI
embeddings : i ∈ Rd i0 . . . it it+1 . . . iI
encoder states : h ∈ Rp h0 . . . ht ht+1 . . . hI
attention / context : c ∈ Rp αt, ct ct = αTt h
decoder states : s ∈ Rp s0 . . . st st+1 . . . sI
output : o ∈ R|V| . . . ot ot+1 . . . oI
P(et+1 = k|e≤t, f; θNMT) = [softmax(ot = Wsost + bo)]k
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 9 / 35
NMT Primer in two slides
To remember for now
NNLMs / NMTs predict one word at a timeInference ends with a softmax layer
They have multiple subparts :encoder (embeddings+RNN), decoder (RNN+embeddings), attention layer
Many architectural variantsGRU / LSTMs, multiple layers, transformers, CNNs
Back to Tonight’s question
How to best leverage existing monolingual data?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 10 / 35
NMT Primer in two slides
To remember for now
NNLMs / NMTs predict one word at a timeInference ends with a softmax layer
They have multiple subparts :encoder (embeddings+RNN), decoder (RNN+embeddings), attention layer
Many architectural variantsGRU / LSTMs, multiple layers, transformers, CNNs
Back to Tonight’s question
How to best leverage existing monolingual data?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 10 / 35
NMT Primer in two slides
To remember for now
NNLMs / NMTs predict one word at a timeInference ends with a softmax layer
They have multiple subparts :encoder (embeddings+RNN), decoder (RNN+embeddings), attention layer
Many architectural variantsGRU / LSTMs, multiple layers, transformers, CNNs
Back to Tonight’s question
How to best leverage existing monolingual data?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 10 / 35
Using Language Models
The old timer way : combine NNLM and NMT
ex post - aka shallow fusion
P(et+1 = k|e≤t, f; θLM, θTM) = λ1PTM(et+1 = k|wIt; θTM)
+ PLM(et+1 = k|eIt; θLM)
Combines the output layers.train θLM and θTM separately [Gulcehre et al., 2017] or one after the other [Stahlberg
et al., 2018]
within decoder - aka deep fusion [Gulcehre et al., 2017, Burlot and Yvon, 2018]
P(et+1 = k|e≤t, f; θLM, θTM) ∝ [Wf (hLMt ; sTM
t ; ct; ot)]k (1)
Combines the hidden layers hLMt ; sTM
t ; (+ trained scaling factor σhLMt )
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 11 / 35
Using Language Models
The old timer way : combine NNLM and NMT
ex post - aka shallow fusion
within decoder - aka deep fusion [Gulcehre et al., 2017, Burlot and Yvon, 2018]
P(et+1 = k|e≤t, f; θLM, θTM) ∝ [Wf (hLMt ; sTM
t ; ct; ot)]k (1)
Combines the hidden layers hLMt ; sTM
t ; (+ trained scaling factor σhLMt )
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 11 / 35
Using Language Models
The old timer way : combine NNLM and NMT
log-linear shallow better than linear [Stahlberg et al., 2018]
Deep fusion better than shallow [Stahlberg et al., 2018]
No clear result for very large data
“Back-Translation” seems to be a much better receipe.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 12 / 35
Using Language Models
The old timer way : combine NNLM and NMT
log-linear shallow better than linear [Stahlberg et al., 2018]
Deep fusion better than shallow [Stahlberg et al., 2018]
No clear result for very large data
“Back-Translation” seems to be a much better receipe.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 12 / 35
Using Language Models
The old timer way : combine NNLM and NMT
log-linear shallow better than linear [Stahlberg et al., 2018]
Deep fusion better than shallow [Stahlberg et al., 2018]
No clear result for very large data
“Back-Translation” seems to be a much better receipe.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 12 / 35
Using Language Models
The old timer way : combine NNLM and NMT
log-linear shallow better than linear [Stahlberg et al., 2018]
Deep fusion better than shallow [Stahlberg et al., 2018]
No clear result for very large data
“Back-Translation” seems to be a much better receipe.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 12 / 35
Back translation
The rich man’s way : generate artificial parallel data
Parallel CorpusFrench-English
NN Training
P(e|f)
Decoding :f? = argmax P(e|f)
Monolingual CorpusEnglish
BackwardsMachine Translation
Artificial CorpusFrench-English
a very very very old idea [Bertoldi and Federico, 2009, Bojar and Tamchyna, 2011]
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 13 / 35
Back translation
The rich man’s way : generate artificial parallel data
Parallel CorpusFrench-English
NN Training
P(e|f)
Decoding :f? = argmax P(e|f)
Monolingual CorpusEnglish
BackwardsMachine Translation
Artificial CorpusFrench-English
a very very very old idea [Bertoldi and Federico, 2009, Bojar and Tamchyna, 2011]
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 13 / 35
Back translation
The rich man’s way : generate artificial parallel data
Design choicesback-translation engine (WBMT, NMMT, NMT)
data selection and weighting
training regime / data mix
Experimental validation : try approach X, evaluate MT quality
Main findings to dateBT very works well [Sennrich et al., 2016] and many others
BT quality matters, real data is even better [Burlot and Yvon, 2018]
BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]
BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]
FT also helps [Crego and Senellart, 2016]
training regime matters [Poncelas et al., 2018]
large scale experiments yield large gains [Edunov et al., 2018]
iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and
Kreutzer, 2018] idea
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35
Back translation
The rich man’s way : generate artificial parallel data
Main findings to dateBT very works well [Sennrich et al., 2016] and many others
BT quality matters, real data is even better [Burlot and Yvon, 2018]
BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]
BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]
FT also helps [Crego and Senellart, 2016]
training regime matters [Poncelas et al., 2018]
large scale experiments yield large gains [Edunov et al., 2018]
iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and
Kreutzer, 2018] idea
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35
Back translation
The rich man’s way : generate artificial parallel data
Main findings to dateBT very works well [Sennrich et al., 2016] and many others
BT quality matters, real data is even better [Burlot and Yvon, 2018]
BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]
BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]
FT also helps [Crego and Senellart, 2016]
training regime matters [Poncelas et al., 2018]
large scale experiments yield large gains [Edunov et al., 2018]
iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and
Kreutzer, 2018] idea
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35
Back translation
The rich man’s way : generate artificial parallel data
Main findings to dateBT very works well [Sennrich et al., 2016] and many others
BT quality matters, real data is even better [Burlot and Yvon, 2018]
BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]
BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]
FT also helps [Crego and Senellart, 2016]
training regime matters [Poncelas et al., 2018]
large scale experiments yield large gains [Edunov et al., 2018]
iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and
Kreutzer, 2018] idea
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35
Back translation
The rich man’s way : generate artificial parallel data
Main findings to dateBT very works well [Sennrich et al., 2016] and many others
BT quality matters, real data is even better [Burlot and Yvon, 2018]
BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]
BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]
FT also helps [Crego and Senellart, 2016]
training regime matters [Poncelas et al., 2018]
large scale experiments yield large gains [Edunov et al., 2018]
iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and
Kreutzer, 2018] idea
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35
Back translation
The rich man’s way : generate artificial parallel data
Main findings to dateBT very works well [Sennrich et al., 2016] and many others
BT quality matters, real data is even better [Burlot and Yvon, 2018]
BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]
BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]
FT also helps [Crego and Senellart, 2016]
training regime matters [Poncelas et al., 2018]
large scale experiments yield large gains [Edunov et al., 2018]
iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and
Kreutzer, 2018] idea
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35
Back translation
The rich man’s way : generate artificial parallel data
Main findings to dateBT very works well [Sennrich et al., 2016] and many others
BT quality matters, real data is even better [Burlot and Yvon, 2018]
BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]
BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]
FT also helps [Crego and Senellart, 2016]
training regime matters [Poncelas et al., 2018]
large scale experiments yield large gains [Edunov et al., 2018]
iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and
Kreutzer, 2018] idea
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35
Back translation
The rich man’s way : generate artificial parallel data
Main findings to dateBT very works well [Sennrich et al., 2016] and many others
BT quality matters, real data is even better [Burlot and Yvon, 2018]
BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]
BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]
FT also helps [Crego and Senellart, 2016]
training regime matters [Poncelas et al., 2018]
large scale experiments yield large gains [Edunov et al., 2018]
iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and
Kreutzer, 2018] idea
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35
Back translation
The rich man’s way : generate artificial parallel data
Main findings to dateBT very works well [Sennrich et al., 2016] and many others
BT quality matters, real data is even better [Burlot and Yvon, 2018]
BT selection helps : choose monotonic sentences [Burlot and Yvon, 2018],difficult words/phrases [Fadaee and Monz, 2018]
BT data are unsufficiently diverse [Burlot and Yvon, 2018], noising helps[Edunov et al., 2018]
FT also helps [Crego and Senellart, 2016]
training regime matters [Poncelas et al., 2018]
large scale experiments yield large gains [Edunov et al., 2018]
iterative BT : an effective [Lample et al., 2018] and sound [Cotterell and
Kreutzer, 2018] idea
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 14 / 35
Back translation
BT quality matters, real data is even better
Backtranslation setup3 automatic systems BT :
backtrans-bad : SMT (Moses) with 50k parallel sentences
backtrans-good : SMT (Moses) with all WMT data
backtrans-nmt : backward NMTs
French→English German→Englishtest-07 test-08 nt-14 unk test-07 test-08 nt-14 unk
backtrans-bad 18.86 19.27 20.49 3.22% 14.66 14.62 15.07 1.45%backtrans-good 29.71 29.51 32.10 0.24% 24.19 24.19 25.75 0.73%backtrans-nmt 31.10 31.43 31.27 0.0% 26.02 26.03 26.98 0.0%
Fine-tuningThese systems are used to backtranslate the target side of Europarl in order tofine-tune the baselines.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 15 / 35
Back translation
Assessing the effectiveness of BT
English→Frenchtest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67
English→Germantest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23
⇒ Bad BT hardly helps.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 16 / 35
Back translation
Assessing the effectiveness of BT
English→Frenchtest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67
English→Germantest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23
⇒ BTs with PBMT and NMT not so different.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 17 / 35
Back translation
Assessing the effectiveness of BT
English→Frenchtest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67
English→Germantest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23
⇒ Forward-translated source data can also help.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 18 / 35
Back translation
Assessing the effectiveness of BT
English→Frenchtest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67
English→Germantest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23
⇒ human translated sources are much better. Why this gap?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 19 / 35
Back translation
Assessing the effectiveness of BT
English→Frenchtest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56backtrans-bad 31.55 62.39 51.50 31.89 62.23 51.73 31.99 61.59 48.86backtrans-good 32.99 63.43 49.58 33.25 63.08 49.29 33.52 62.62 47.23backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94fwdtrans-nmt 31.93 62.55 50.84 32.62 62.66 49.83 33.56 62.44 47.65backfwdtrans-nmt 33.09 63.19 50.08 33.70 63.25 48.83 34.00 62.76 47.22natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67
English→Germantest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64backtrans-bad 21.84 57.85 61.24 21.04 57.44 59.77 22.28 57.70 55.49backtrans-good 23.33 59.03 58.84 23.11 57.14 57.14 22.87 58.09 54.91backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67fwdtrans-nmt 21.97 57.46 61.99 21.89 57.53 59.71 22.52 57.93 55.13backfwdtrans-nmt 22.99 58.37 60.45 22.82 58.14 58.80 23.04 58.17 54.96natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23
⇒ human translated sources are much better. Why this gap?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 19 / 35
Back translation
Properties of Backtranslated sentences (I)
English→French English→German
⇒ Synthetic sources contain shorter sentences.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 20 / 35
Back translation
Properties of Backtranslated sentences (II)
English→French English→German
⇒ Synthetic sources contain slightly simpler syntax.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 21 / 35
Back translation
Properties of Backtranslated sentences (III)
English→French English→German
⇒ Synthetic sources use a smaller vocabulary.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 22 / 35
Back translation
Properties of Backtranslated sentences (IV)
Monotonic translationsMonotonicity measured by average Kendall τ distance of source-targetalignments [Birch and Osborne, 2010]
en2fr en2denatural backtrans-nmt natural backtrans-nmt
0.048 0.018 0.068 0.053
Selected 10M words from natural randomly or according to Kendall τdistance, then fine-tuned on the result.
test-07 test-08 newstest-14BLEU BEER CTER BLEU BEER CTER BLEU BEER CTER
random 32.08 62.98 50.78 32.66 62.86 49.99 23.05 55.38 58.51monotonic 33.52 63.75 49.51 33.73 63.59 48.91 32.16 61.75 48.64
⇒Monotonic BTs help NMT.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 23 / 35
Back translation
Properties of Backtranslated sentences (IV)
Monotonic translationsMonotonicity measured by average Kendall τ distance of source-targetalignments [Birch and Osborne, 2010]
en2fr en2denatural backtrans-nmt natural backtrans-nmt
0.048 0.018 0.068 0.053
Selected 10M words from natural randomly or according to Kendall τdistance, then fine-tuned on the result.
test-07 test-08 newstest-14BLEU BEER CTER BLEU BEER CTER BLEU BEER CTER
random 32.08 62.98 50.78 32.66 62.86 49.99 23.05 55.38 58.51monotonic 33.52 63.75 49.51 33.73 63.59 48.91 32.16 61.75 48.64
⇒Monotonic BTs help NMT.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 23 / 35
Pseudo-back translations
The poor man’s way : simulate parallel data
BT assumesa. monolingual datab. MT engine translating “backwards” (from target to source)c. (lots of) compute power
What can we do with less resource?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 24 / 35
Pseudo-back translations
4 cheap ways to generate parallel data : stupid BT
copy : recopy the target onto the source
e (True English) How useful are fake translations?f (Fake French) How useful are fake translations?
copy+mark : copies carry a language id
copy+mark+noise : add noise (deletions, swaps, etc)
copy-dummy : replace everything with dummy symbol
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 25 / 35
Pseudo-back translations
4 cheap ways to generate parallel data : stupid BT
copy : recopy the target onto the source
copy+mark : copies carry a language id
f (Fake French) @fr@How @fr@useful @fr@are @fr@fake @fr@translations?
copy+mark+noise : add noise (deletions, swaps, etc)
copy-dummy : replace everything with dummy symbol
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 25 / 35
Pseudo-back translations
4 cheap ways to generate parallel data : stupid BT
copy : recopy the target onto the source
copy+mark : copies carry a language id
copy+mark+noise : add noise (deletions, swaps, etc)
f (Fake French) @fr@useful @fr@How@fr@fake @fr@are @fr@translations?
copy-dummy : replace everything with dummy symbol
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 25 / 35
Pseudo-back translations
4 cheap ways to generate parallel data : stupid BT
copy : recopy the target onto the source
copy+mark : copies carry a language id
copy+mark+noise : add noise (deletions, swaps, etc)
copy-dummy : replace everything with dummy symbol
f (Fake French) dummy dummy dummy dummy dummy?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 25 / 35
Pseudo-back translations
Stupid Backtranslation not so stupid
English→Frenchtest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56copy-dummies 30.89 62.06 52.07 31.51 61.98 51.46 31.43 60.92 50.58copy 31.65 62.45 52.09 32.23 62.37 52.20 32.80 61.99 49.05copy+mark 32.01 62.66 51.57 32.31 62.52 51.46 32.33 61.55 49.44copy+mark+noise 31.87 62.52 52.69 32.64 62.55 51.63 33.04 62.11 48.47
English→Germantest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64copy-dummies 21.73 57.84 61.35 21.38 57.38 60.10 21.12 56.81 57.21copy 22.15 57.95 61.49 21.95 57.72 59.58 22.59 57.83 55.44copy+mark 22.58 58.23 61.10 22.47 57.97 59.24 22.53 57.54 55.85copy+mark+noise 22.92 58.62 60.27 22.83 58.36 58.48 22.34 57.47 55.72
Stupid BT almost as good as smart BT (for German, where BT is bad)
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 26 / 35
Adversarial training
The smart poor man’s way : simulate credible parallel data
Stupid BT is cheap and almost as good as using LMs : mostly trains thedecoder.
Can we improve even better by also training the rest of thesystem?
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 27 / 35
Adversarial training
Towards better fake sources
Fake sources in a GAN setupcopy-marked contains a fake source. Let’s make it look like a real source.
Two encoders : MT encoder E(x) and pseudo-source encoder G(x′).Discriminator D is optimized to distinguish both sources :
J(D) =− 12Ex∼preal logD(E(x))
− 12Ex′∼ppseudo log(1− D(G(x′)))
G is trained to fool the discriminator D :
J(G) = −Ex′∼ppseudo logD(G(x′))
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 28 / 35
Adversarial training
Towards better fake sources
Fake sources in a GAN setupcopy-marked contains a fake source. Let’s make it look like a real source.
Two encoders : MT encoder E(x) and pseudo-source encoder G(x′).Discriminator D is optimized to distinguish both sources :
J(D) =− 12Ex∼preal logD(E(x))
− 12Ex′∼ppseudo log(1− D(G(x′)))
G is trained to fool the discriminator D :
J(G) = −Ex′∼ppseudo logD(G(x′))
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 28 / 35
Adversarial training
Towards better fake sources
Fake sources in a GAN setupcopy-marked contains a fake source. Let’s make it look like a real source.
Two encoders : MT encoder E(x) and pseudo-source encoder G(x′).Discriminator D is optimized to distinguish both sources :
J(D) =− 12Ex∼preal logD(E(x))
− 12Ex′∼ppseudo log(1− D(G(x′)))
G is trained to fool the discriminator D :
J(G) = −Ex′∼ppseudo logD(G(x′))
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 28 / 35
Adversarial training
GANs Results
English→Frenchtest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 31.25 62.14 51.89 32.17 62.35 50.79 33.06 61.97 48.56copy-mark 32.01 62.66 51.57 32.31 62.52 51.46 32.33 61.55 49.44+ GANs 31.95 62.55 52.87 32.24 62.47 52.16 32.86 61.90 48.97copy-mark+noise 31.87 62.52 52.69 32.64 62.55 51.63 33.04 62.11 48.47+ GANs 32.41 62.78 52.25 32.79 62.72 50.92 33.01 61.98 48.37backtrans-nmt 33.30 63.33 50.02 33.39 63.09 49.48 34.11 62.76 46.94+ GANs 32.91 63.08 51.17 33.24 62.93 50.82 33.77 62.42 47.80natural 35.10 64.71 48.33 35.29 64.52 48.26 34.96 63.08 46.67
English→Germantest-07 test-08 newstest-14
BLEU BEER CTER BLEU BEER CTER BLEU BEER CTERBaseline 21.36 57.08 63.32 21.27 57.11 60.67 22.49 57.79 55.64copy-mark 22.58 58.23 61.10 22.47 57.97 59.24 22.53 57.54 55.85+ GANs 22.71 58.25 61.25 22.44 57.86 59.28 22.81 57.54 55.99copy-mark+noise 22.92 58.62 60.27 22.83 58.36 58.48 22.34 57.47 55.72+ GANs 23.01 58.66 60.22 22.53 58.16 58.65 22.64 57.70 55.48backtrans-nmt 23.00 59.12 58.31 23.10 58.85 56.67 22.91 58.12 54.67+ GANs 23.65 58.85 59.70 23.20 58.50 58.22 23.00 57.89 55.15natural 26.74 61.14 56.19 26.16 60.64 54.76 23.84 58.64 54.23
GANs provide a small additional boost
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 29 / 35
Conclusions
Conclusion
BT a very efficient method to integrate monolingual data
BT helps improve simultaneously all the components of NMT
artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor
the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated
GANs can help by making the pseudo-source sentences closer to naturalones.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35
Conclusions
Conclusion
BT a very efficient method to integrate monolingual data
BT helps improve simultaneously all the components of NMT
artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor
the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated
GANs can help by making the pseudo-source sentences closer to naturalones.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35
Conclusions
Conclusion
BT a very efficient method to integrate monolingual data
BT helps improve simultaneously all the components of NMT
artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor
the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated
GANs can help by making the pseudo-source sentences closer to naturalones.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35
Conclusions
Conclusion
BT a very efficient method to integrate monolingual data
BT helps improve simultaneously all the components of NMT
artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor
the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated
GANs can help by making the pseudo-source sentences closer to naturalones.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35
Conclusions
Conclusion
BT a very efficient method to integrate monolingual data
BT helps improve simultaneously all the components of NMT
artificial sources are lexically and syntactically simpler than naturalsources - sampling brings diversity, monotonicity is a facilitating factor
the quality of BT matters for NMT and BT is only worth its cost whenhigh-quality BT can be generated
GANs can help by making the pseudo-source sentences closer to naturalones.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 30 / 35
Conclusions
Meta-Conclusion
NMT is improving ; yields useful translations in many language pairs
NMT research is empirical, burns a lot of CPUs / GPUs
many conclusions are unstable : so many variables to control
MT not solved : many open avenues/problems : architectural, theoretical,data based, and many more
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 31 / 35
Conclusions
Meta-Conclusion
NMT is improving ; yields useful translations in many language pairs
NMT research is empirical, burns a lot of CPUs / GPUs
many conclusions are unstable : so many variables to control
MT not solved : many open avenues/problems : architectural, theoretical,data based, and many more
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 31 / 35
Conclusions
Meta-Conclusion
NMT is improving ; yields useful translations in many language pairs
NMT research is empirical, burns a lot of CPUs / GPUs
many conclusions are unstable : so many variables to control
MT not solved : many open avenues/problems : architectural, theoretical,data based, and many more
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 31 / 35
Conclusions
Meta-Conclusion
NMT is improving ; yields useful translations in many language pairs
NMT research is empirical, burns a lot of CPUs / GPUs
many conclusions are unstable : so many variables to control
MT not solved : many open avenues/problems : architectural, theoretical,data based, and many more
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 31 / 35
Conclusions
Thank you for your attention !
Franck Burlot, François Yvon
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 32 / 35
Conclusions
References I
Nicola Bertoldi and Marcello Federico. Domain adaptation for statistical machine translationwith monolingual resources. In Proceedings of the Fourth Workshop on Statistical MachineTranslation, pages 182–189, Athens, Greece, 2009. URLhttp://www.aclweb.org/anthology/W09-0432.
Alexandra Birch and Miles Osborne. LRscore for evaluating lexical and reordering quality inMT. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation andMetricsMATR, WMT ’10, pages 327–332, Stroudsburg, PA, USA, 2010. Association forComputational Linguistics. ISBN 978-1-932432-71-8. URLhttp://dl.acm.org/citation.cfm?id=1868850.1868899.
Ondrej Bojar and Aleš Tamchyna. Improving translation model by monolingual data. InProceedings of the Sixth Workshop on Statistical Machine Translation, WMT ’11, pages330–336. Association for Computational Linguistics, 2011. URLhttp://dl.acm.org/citation.cfm?id=2132960.2133004.
Franck Burlot and François Yvon. Using monolingual data in neural machine translation : asystematic study. In Proceedings of the Third Conference on Machine Translation, pages144–155, Belgium, Brussels, October 2018. Association for Computational Linguistics.URL http://www.aclweb.org/anthology/W18-64015.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 33 / 35
Conclusions
References II
Ryan Cotterell and Julia Kreutzer. Explaining and generalizing back-translation throughwake-sleep, 2018.
Josep Maria Crego and Jean Senellart. Neural machine translation from simplified translations.CoRR, abs/1612.06139, 2016. URL http://arxiv.org/abs/1612.06139.
Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translationat scale. In Proceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing, pages 489–500. Association for Computational Linguistics, 2018.URL http://aclweb.org/anthology/D18-1045.
Marzieh Fadaee and Christof Monz. Back-translation sampling by targeting difficult words inneural machine translation. In Proceedings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 436–446. Association for ComputationalLinguistics, 2018. URL http://aclweb.org/anthology/D18-1040.
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, and Yoshua Bengio. On integratinga language model into neural machine translation. Comput. Speech Lang., 45(C) :137–148,September 2017. ISSN 0885-2308. doi : 10.1016/j.csl.2017.01.014. URLhttps://doi.org/10.1016/j.csl.2017.01.014.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 34 / 35
Conclusions
References III
Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato.Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing, pages 5039–5049.Association for Computational Linguistics, 2018. URLhttp://aclweb.org/anthology/D18-1549.
Alberto Poncelas, Dimitar Shterionov, Andy Way, Gideon Maillette de Buy Wenniger, andPeyman Passban. Investigating backtranslation in neural machine translation. InProceedings of the 21st Annual Conference of the European Association for MachineTranslation, EAMT, Alicante, Spain, 28–30 May 2018.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translationmodels with monolingual data. In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume 1 : Long Papers), pages 86–96, Berlin,Germany, August 2016. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/P16-1009.
Felix Stahlberg, James Cross, and Veselin Stoyanov. Simple fusion : Return of the languagemodel. In Proceedings of the Third Conference on Machine Translation, pages 204–211,Belgium, Brussels, October 2018. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/W18-64021.
F. Yvon + F. Burlot (LIMSI) Monolingual Data in NMT Paris - 2018-11-28 35 / 35