+ All Categories
Home > Documents > What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility...

What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility...

Date post: 20-May-2020
Category:
Upload: others
View: 17 times
Download: 0 times
Share this document with a friend
34
What can Statistical Machine Translation teach Neural Text Generation about Optimization? Graham Neubig @ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation 6/6/2019
Transcript
Page 1: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

What can Statistical Machine Translation teach Neural Text

Generation about Optimization?

Graham Neubig @ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation

6/6/2019

Page 2: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Graham Neubig @ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation

6/6/2019

How to Optimize your Neural Generation System towards

your Evaluation Function

or

Page 3: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

...

Neubig & Watanabe, Computational Linguistics (2016)

Page 4: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Then: Symbolic Translation Modelskono eiga ga kirai

moviethisI hate• First step: learn component models to maximize likelihood

• Translation model P(y|x) -- e.g. P( movie | eiga ) • Language model P(Y) -- e.g. P(hate | I) • Reordering model -- e.g. P(<swap> | eiga, ga kirai) • Length model P(|Y|) -- e.g. word penalty for each word added

• Second step: learning log-linear combination to maximize translation accuracy [Och 2004]

Minimum Error Rate Training in Statistical Machine Translation (Och 2004)

logP (Y | X) =X

i

�i�i(X,Y )/Z<latexit sha1_base64="zi4llDHl42mhk2a3gk9P95mU898=">AAACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNII+pejeuOeSEVjcaMHCelw1BU0ohhpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HHzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OcciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI99RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4llDHl42mhk2a3gk9P95mU898=">AAACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNII+pejeuOeSEVjcaMHCelw1BU0ohhpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HHzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OcciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI99RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4llDHl42mhk2a3gk9P95mU898=">AAACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNII+pejeuOeSEVjcaMHCelw1BU0ohhpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HHzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OcciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI99RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit>

Page 5: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Now: Auto-regressive Neural Networks

</s>

dec dec dec dec

</s>

I hate this movie

kono eiga ga kirai

I hate this movie

Encoder

Decoder

• All parameters trained end-to-end, usually to maximize likelihood (not accuracy!)

Page 6: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Standard MT System Training/Decoding

Page 7: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Decoder StructureI

classifyclassify

I hate

hate

classify

this

this

classify

movie

movie

classify

</s>

encoder

P (E | F ) =TY

t=1

P (et | F, e1, . . . , et�1)<latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">AAACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZggJ//qom0mKivZzo5HQwFH2LGFcC7sr5TdMM4423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">AAACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZggJ//qom0mKivZzo5HQwFH2LGFcC7sr5TdMM4423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">AAACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZggJ//qom0mKivZzo5HQwFH2LGFcC7sr5TdMM4423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit>

Page 8: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Maximum Likelihood Training

• Maximum the likelihood of predicting the next word in the reference given the previous words

`(E | F ) = � logP (E | F )

= �TX

t=1

logP (et | F, e1, . . . , et�1)<latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit>

• Also called "teacher forcing"

Page 9: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Problem 1: Exposure Bias• Teacher forcing assumes feeding correct previous input,

but at test time we may make mistakes that propagate

• Exposure bias: The model is not exposed to mistakes during training, and cannot deal with them at test

• Really important! One main source of commonly witnessed phenomena such as repeating.

I

classifyclassify

I I

I

classify

I

encoder I

classify

I

I

classify

I

Page 10: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Problem 2: Disregard to Evaluation Metrics

• In the end, we want good translations

• Good translations can be measured with metrics, e.g. BLEU or METEOR

• Really important! Causes systematic problems:

• Hypothesis-reference length mismatch

• Dropped/repeated content

Page 11: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

A Clear Example• My (winning) submission to Workshop on Asian

Translation 2016 [Neubig 16]

23

24

25

26

27

MLE MLE+Length MinRisk80

85

90

95

100

MLE MLE+Length MinRisk

BLEU Length Ratio

• Just training for (sentence-level) BLEU largely fixes length problems, and does much better than heuristics

Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)

Page 12: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Error and Risk

Page 13: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Error• Generate a translation

• Calculate its "badness" (e.g. 1-BLEU, 1-METEOR)

• We would like to minimize error

• Problem: argmax is not differentiable, and thus not conducive to gradient-based optimization

E = argmaxEP (E | F )<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>

error(E, E) = 1� BLEU(E, E)<latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit>

Page 14: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

In Phrase-based MT: Minimum Error Rate Training

• A clever trick for gradient-free optimization of linear models

• Pick a single direction in feature space

• Exactly calculate the loss surface in this direction only (over an n-best list for every hypothesis)

F1

φ1

φ2

φ3

err

E1,1 1 0 -1 0.6

E1,2 0 1 0 0

E1,3 1 0 1 1

F2

φ1

φ2

φ3

err

E2,1 1 0 -2 0.8

E2,2 3 0 1 0.3

E2,3 3 1 2 0

-4 -2 0 2 4

-4

-3

-2

-1

0

1

2

3

4

-4 -2 0 2 4

-4

-3

-2

-1

0

1

2

3

4(a) (b)

λ1=-1, λ

2=1, λ

3=0

-4 -2 0 2 40

1

-4 -2 0 2 40

1

-4 -2 0 2 40

1

2

(d)

α ←1.25

(c)F1 candidates

F2 candidates

F1 error

F2 error

total error

E1,1

E1,2

E1,3

E2,1

E2,2

E2,3

d1=0, d

2=0, d

3=1

λ1=-1, λ

2=1, λ

3=1.25

Page 15: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]

• Risk is defined as the expected error

risk(F,E, ✓) =X

E

P (E | F ; ✓)error(E, E).

<latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit>

• This is includes the probability in the objective function -> differentiable!

Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)

Page 16: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Sub-sampling• Create a small sample of sentences (5-50), and

calculate risk over that

• Samples can be created using random sampling or n-best search

• If random sampling, make sure to deduplicate

risk(F,E, S) =X

E2S

P (E | F )

Zerror(E, E)

<latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit>

Page 17: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Policy Gradient/REINFORCE• Alternative way of maximizing expected reward,

minimizing risk

• Outputs that get a bigger reward will get a higher weight

• Can show this converges to minimum-risk solution

`reinforce(X,Y ) = �R(Y , Y ) logP (Y | X)<latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit>

Page 18: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

But Wait, why is Everyone Using MLE for NMT?

Page 19: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

When Training goes Bad...

Chances are, this is you 😔

Minimum risk training for neural machine translation (Shen et al. 2015)

Page 20: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

It Happens to the Best of Us

• Email from a famous MT researcher: "we also re-implemented MRT, but so far, training has been very unstable, and after a improving for a bit, our models develop a bias towards producing ever-shorter translations..."

Page 21: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

My Current Recipe for Stabilizing MRT/Reinforcement Learning

Page 22: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Warm-start• Start training with maximum likelihood, then switch

over to REINFORCE

• Works only in the scenarios where we can run MLE (not latent variables or standard RL settings)

• MIXER (Ranzato et al. 2016) gradually transitions from MLE to the full objective

Page 23: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Adding a Baseline• Basic idea: we have expectations about our reward

for a particular sentence

Reward0.80.3

0.95Baseline

0.1

B-R-0.150.2

“This is an easy sentence”“Buffalo Buffalo Buffalo”

• We can instead weight our likelihood by B-R to reflect when we did better or worse than expected

`baseline(X) = �(R(Y , Y )�B(Y )) logP (Y | X)

Page 24: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Increasing Batch Size

• If we use a single sentence, high variance

• Solution: increase the number of examples (roll-outs) done before an update to stabilize

Page 25: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Adding Temperature

• Temperature adjusts the peakiness of the distribution

• With a small sample, setting temperature > 1 accounts for unsampled hypotheses that should be in the denominator

risk(F,E, ✓, ⌧, S) =X

E2S

P (E | F ; ✓)1/⌧

Zerror(E, E)

<latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit>

-4 -3 -2 -1 0 1 2 3 4

0

0.5

1

1.5

2

-4 -3 -2 -1 0 1 2 3 4

0

0.5

1

1.5

2

-4 -3 -2 -1 0 1 2 3 4

0

0.5

1

1.5

2

-4 -3 -2 -1 0 1 2 3 4

0

0.5

1

1.5

2

τ = 1 τ = 0.5 τ = 0.25 τ = 0.05

Page 26: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Contrasting Phrase-based SMT and NMT

Page 27: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Phrase-based SMT MERT and NMT MinRisk/REINFORCE

NMT+MinRisk PBMT+MERT

Model NMT PBMT

Optimized Parameters Millions 5-30 Log-linear

Weights (others MLE)

Objective Risk Error

Metric Granularity Sentence Level Corpus Level

n-best Lists Re-generated Accumulated

Page 28: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Optimized Parameters

• Maybe we can optimize only some parts of the model?Freezing Subnetworks to Analyze Domain Adaptation in NMT. Thompson et al. 2018.

• Maybe we can express models as a linear combination of a few hyper-parameters?Contextualized Parameter Generation for Universal NMT. Platanios et al. 2018.

• Can we reduce the number of parameters optimized for NMT?

W =X

i

↵iWi

<latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AAAB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xxprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TTTnLYzSSGJOfXj4c2k7j9SqVgq7vUoo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovgllBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yyQlmo8MAJHM/BWTAUgg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kff4AcJ+VOw==</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AAAB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xxprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TTTnLYzSSGJOfXj4c2k7j9SqVgq7vUoo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovgllBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yyQlmo8MAJHM/BWTAUgg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kff4AcJ+VOw==</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AAAB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xxprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TTTnLYzSSGJOfXj4c2k7j9SqVgq7vUoo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovgllBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yyQlmo8MAJHM/BWTAUgg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kff4AcJ+VOw==</latexit>

Page 29: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Objective• Can we move closer to minimizing error, which is what we

want to do in the first place?

• Maybe we can gradually anneal the temperature to move towards a peakier distribution?Minimum risk annealing for training log-linear models. Smith and Eisner 2006.

-4 -3 -2 -1 0 1 2 3 4

0

0.5

1

1.5

2

-4 -3 -2 -1 0 1 2 3 4

0

0.5

1

1.5

2

-4 -3 -2 -1 0 1 2 3 4

0

0.5

1

1.5

2

-4 -3 -2 -1 0 1 2 3 4

0

0.5

1

1.5

2

τ = 1 τ = 0.5 τ = 0.25 τ = 0.05

Training progression

Page 30: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Metric• We have lots of metrics! BLEU, METEOR, ROUGE, CIDER • Depending on the metric you optimize, results differ. The Best Lexical Metric for Phrase-Based Statistical MT System Optimization. Cer et al. 2010.

• Maybe a metric that considers semantic roles? MEANT: an inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames. Lo and Wu, 2011.

• New! Optimizing towards neural semantic similarity measures improves MT:Beyond BLEU: Training Neural Machine Translation with Semantic Similarity. Wieting et al. 2019.

Page 31: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Metric Granularity• Two ways of measuring metrics

• Sentence-level: Measure sentence-by-sentence, average

• Corpus: Sum sufficient statistics, calculate score • Regular BLEU is corpus-level, but mini-batch NMT

optimization algorithms calculate sentence level • This causes problems, e.g. in sentence length!

Optimizing for sentence-level BLEU+1 yields short translations. Naklov et al. 2012. • Maybe we can keep a running average of the sufficient

statistics to approximate corpus BLEU?Online large-margin training of syntactic and structural translation features. Chiang et al. 2008.

Page 32: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

N-best Lists• In MERT for PBMT, we would accumulate n-best

lists across epochs:

new n-best 2

n-best 1

Epoch 1n-best 1

Epoch 2

new n-best 2

n-best 1

Epoch 3

new n-best 3

• Greatly stabilizes training! Even if model learns horrible parameters, it still has good hypotheses from which to recover.

• Maybe we could do the same for NMT? Analogous to experience replay in RL:Self-improving reactive agents based on reinforcement learning, planning and teaching. Lin 1992. Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing. Liang et al. 2018.

Page 33: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Summary

Page 34: What can Statistical Machine Translation teach Neural Text ... · evaluating translation utility via semantic frames. Lo and Wu, 2011. • New! Optimizing towards neural semantic

Summary• Neural MT has come a long way, and we can

optimize for accuracy • This is important, fixes lots of problems that we'd

otherwise use heuristic hacks for • But no-one does it... Problems of stability speed. • Still lots to remember from the past!

Optimization for Statistical Machine Translation, a Survey (Neubig and Watanabe 2016)

Thanks! Questions?


Recommended