NeuralMachineTransla/on
WeiXu(many slides from Greg Durrett and Emma Strubell)
Recap:Phrase-BasedMT
Unlabeled English data
cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 …
Language model P(e)
Phrase table P(f|e) P (e|f) / P (f |e)P (e)
Noisy channel model: combine scores from translation model + language model to translate foreign to
English
“Translate faithfully but make fluent English”
}
Recap:HMMforAlignment
Brownetal.(1993)
Thankyou,Ishalldosogladly.e
‣ Sequen/aldependencebetweena’stocapturemonotonicity
0 2 6
Gracias,loharedemuybuengrado.f
5 7 7 7 7 8a
‣ Alignmentdistparameterizedbyjumpsize:
‣ :wordtransla/ontableP (fi|eai)
§ Wantlocalmonotonicity:mostjumpsaresmall§ HMMmodel(Vogel96)
§ Re-es>mateusingtheforward-backwardalgorithm -2 -1 0 1 2 3
P (f ,a|e) =nY
i=1
P (fi|eai)P (ai|ai�1)
Recap:BeamSearchforDecoding
�4
Recall:Decoding
…didnotidx=2
Marynot
Maryno
4.2
-1.2
-2.9
idx=2
idx=2
…notgiveidx=3
…notslapidx=5
…notslapidx=6
1 2 3 4 5 6 7 8 9
‣ ScoresfromlanguagemodelP(e)+transla=onmodelP(f|e)
Recap:Evalua/ngMT‣ Fluency:doesitsoundgoodinthetargetlanguage?‣ Fidelity/adequacy:doesitcapturethemeaningoftheoriginal?
‣ BLEUscore:geometricmeanof1-,2-,3-,and4-gramprecisionvs.areference,mul/pliedbybrevitypenalty
‣ Typicallyn=4,wi=1/4
‣ r=lengthofreferencec=lengthofpredic/on
‣ Doesthiscapturefluencyandadequacy?
Recap:seq2seqModels
themoviewasgreat
‣ Encoder:consumessequenceoftokens,producesavector.Analogoustoencodersforclassifica/on/taggingtasks
le
<s>
‣ Decoder:separatemodule,singlecell.Takestwoinputs:hiddenstate(vectorhortuple(h,c))andprevioustoken.Outputstoken+newstate
Encoder
…
film
le
Decoder Decoder
Recap:seq2seqModels‣ Generatenextwordcondi/onedonpreviouswordaswellashiddenstate
themoviewasgreat <s>
h̄
‣Wsizeis|vocab|x|hiddenstate|,solmaxoveren/revocabulary
Decoderhasseparateparametersfromencoder,sothiscanlearntobealanguagemodel(produceaplausiblenextwordgivencurrentone)
P (y|x) =nY
i=1
P (yi|x, y1, . . . , yi�1)
P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)
Recap:BeamSearchfordecoding‣Maintaindecoderstate,tokenhistoryinbeam
la:0.4
<s>
la
le
les
le:0.3les:0.1
log(0.4)log(0.3)
log(0.1)
film:0.4
la
…
film:0.8
le
… lefilm
lafilm
log(0.3)+log(0.8)
…
log(0.4)+log(0.4)
‣ Keepbothfilmstates!Hiddenstatevectorsaredifferent
themoviewasgreat
Recap:NL-to-SQLGenera/on
‣ Convertnaturallanguagedescrip/onintoaSQLqueryagainstsomeDB
‣ Howtoensurethatwell-formedSQLisgenerated?
Zhongetal.(2017)
‣ Threecomponents
‣ Howtocapturecolumnnames+constants?
‣ Pointermechanisms
Fairseq
NeuralMTDetails
Encoder-DecoderMT
Sutskeveretal.(2014)
‣ SOTA=37.0—notallthatcompe//ve…
‣ Sutskeverseq2seqpaper:firstmajorapplica/onofLSTMstoNLP
‣ Basicencoder-decoderwithbeamsearch
Encoder-DecoderMT
‣ Betermodelfromseq2seqlectures:encoder-decoderwitha(en*onandcopyingforrarewords
themoviewasgreat
h1 h2 h3 h4
<s>
h̄1
c1
distribu/onovervocab+copying
…
le
eij = f(h̄i, hj)
↵ij =exp(eij)Pj0 exp(eij0)
ci =X
j
↵ijhj
P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])
Results:WMTEnglish-French
Classicphrase-basedsystem:~33BLEU,usesaddi/onaltarget-languagedata
RerankwithLSTMs:36.5BLEU(longlineofworkhere;Devlin+2014)
Sutskever+(2014)seq2seqsingle:30.6BLEU
Sutskever+(2014)seq2seqensemble:34.8BLEU
‣ ButEnglish-Frenchisareallyeasylanguagepairandthere’stonsofdataforit!Doesthisapproachworkforanythingharder?
Luong+(2015)seq2seqensemblewithaten/onandrarewordhandling:37.5BLEU
‣ 12Msentencepairs
Results:WMTEnglish-German
‣ NotnearlyasgoodinabsoluteBLEU,butnotreallycomparableacrosslanguages
Classicphrase-basedsystem:20.7BLEU
Luong+(2014)seq2seq:14BLEU
‣ French,Spanish=easiestGerman,Czech=harderJapanese,Russian=hard(gramma/callydifferent,lotsofmorphology…)
Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEU
‣ 4.5Msentencepairs
MTExamples
Luongetal.(2015)
‣ NMTsystemscanhallucinatewords,especiallywhennotusingaten/on—phrase-baseddoesn’tdothis
‣ best=withaten/on,base=noaten/on
MTExamples
Luongetal.(2015)
‣ best=withaten/on,base=noaten/on
Zhangetal.(2017)
‣ NMTcanrepeatitselfifitgetsconfused(pHorpH)
‣ Phrase-basedMTolengetschunksright,mayhavemoresubtleungramma/cali/es
MTExamples
HandlingRareWords
‣Wordsareadifficultunittoworkwith:copyingcanbecumbersome,wordvocabulariesgetverylarge
‣ Character-levelmodelsdon’tworkwell
Input:_the_ecotax_portico_in_Pont-de-Buis…
Output:_le_portique_écotaxe_de_Pont-de-Buis
‣ Solu/on:“wordpieces”(whichmaybefullwordsbutmaybesubwords)‣
‣ Canhelpwithtranslitera/on;capturesharedlinguis/ccharacteris/csbetweenlanguages(e.g.,translitera/on,sharedwordroot,etc.)
Wuetal.(2016)
BytePairEncoding(BPE)
‣ Countbigramcharactercooccurrences
Sennrichetal.(2016)
‣Mergethemostfrequentpairofadjacentcharacters
‣ Dothiseitheroveryourvocabulary(originalversion)oroveralargecorpus(morecommonversion)‣ Finalvocabularysizeisolenin10k~30krangeforeachlanguage
‣ Startwitheveryindividualbyte(basicallycharacter)asitsownsymbol
‣MostSOTANMTsystemsusethisonbothsource+target
�21
WordPieces
‣ SentencePiecelibraryfromGoogle:unigramLM
SchusterandNakajima(2012),Wuetal.(2016),KudoandRichardson(2018)
Buildalanguagemodeloveryourcorpus
Mergepiecesthatleadtohighestimprovementinlanguagemodelperplexity
‣ Issues:whatLMtouse?Howtomakethistractable?
whilevocsize<targetvocsize:
‣ Result:wayofsegmen=nginputappropriatefortransla=on
Google’sNMTSystem
Wuetal.(2016)
‣ 8-layerLSTMencoder-decoderwithaten/on,wordpiecevocabularyof8k-32k
Google’sNMTSystem
Wuetal.(2016)
Luong+(2015)seq2seqensemblewithrarewordhandling:37.5BLEUGoogle’s32kwordpieces:38.95BLEU
Google’sphrase-basedsystem:37.0BLEU
English-French:
Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEUGoogle’s32kwordpieces:24.2BLEU
Google’sphrase-basedsystem:20.7BLEU
English-German:
HumanEvalua/on(En-Es)
Wuetal.(2016)
‣ Similartohuman-level performanceonEnglish-Spanish
Google’sNMTSystem
Wuetal.(2016)
GenderiscorrectinGNMTbutnotinPBMT
“sled”“walker”
Theright-mostcolumnshowsthehumanra/ngsonascaleof0(completenonsense)to6(perfecttransla/on)
Backtransla/on‣ Sta/s/calMTmethods(e.g.,phrase-basedMT)usedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?
Sennrichetal.(2015)
s1,t1
[null],t’1[null],t’2
s2,t2…
…
‣ Approach1:forcethesystemtogenerateT’astargetsfromnullinputs
‣ Approach2:generatesynthe/csourceswithaT->Smachinetransla/onsystem(backtransla/on)
s1,t1
MT(t’1),t’1
s2,t2…
…MT(t’2),t’2
Backtransla/on
Sennrichetal.(2015)
‣ parallelsynth:backtranslatetrainingdata;makesaddi/onalnoisysourcesentenceswhichcouldbeuseful
‣ Gigaword:largemonolingualEnglishcorpus
TransformersforMT
Recall:Self-Aten/on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesaten/onovereachword
‣Mul/ple“heads”analogoustodifferentconvolu/onalfilters.UseparametersWkandVktogetdifferentaten/onvalues+transformvectors
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj) x0
k,i =nX
j=1
↵k,i,jVkxj
committee awards Strickland advanced opticswho
Layer p
QKV
Nobel
Self-aten/on
Layer p
QKV
committee awards Strickland advanced opticswhoNobel
Self-aten/on
Layer p
QKV
optics advanced
who Strickland
awards committee
Nobel
committee awards Strickland advanced opticswhoNobel
Self-aten/on
optics advanced
who Strickland
awards committee
Nobel
Layer p
QKV
A
committee awards Strickland advanced opticswhoNobel
Self-aten/on
Layer p
QKV
committee awards Strickland advanced opticswhoNobel
optics advanced
who Strickland
awards committee
NobelA
Self-aten/on
Layer p
QKV
committee awards Strickland advanced opticswhoNobel
optics advanced
who Strickland
awards committee
NobelA
Self-aten/on
Layer p
QKV
M
committee awards Strickland advanced opticswhoNobel
optics advanced
who Strickland
awards committee
NobelA
Self-aten/on
Layer p
QKV
M
committee awards Strickland advanced opticswhoNobel
optics advanced
who Strickland
awards committee
NobelA
Self-aten/on
Mul/-headself-aten/on
Layer p
QKV
MM1
MH
committee awards Strickland advanced opticswhoNobel
optics advanced
who Strickland
awards committee
NobelA
Layer p
QKV
MH
M1
committee awards Strickland advanced opticswhoNobel
optics advanced
who Strickland
awards committee
NobelA
Mul/-headself-aten/on
Layer p
QKV
MH
M1
Layer p+1
Feed Forward
Feed Forward
Feed Forward
Feed Forward
Feed Forward
Feed Forward
Feed Forward
committee awards Strickland advanced opticswhoNobel
optics advanced
who Strickland
awards committee
NobelA
Mul/-headself-aten/on
committee awards Strickland advanced opticswhoNobelp+1
Mul/-headself-aten/on
Multi-head self-attention + feed forward
Multi-head self-attention + feed forward
42
Layer 1
Layer p
Multi-head self-attention + feed forwardLayer J
committee awards Strickland advanced opticswhoNobel
Mul/-headself-aten/on
Transformers
Vaswanietal.(2017)
‣ Encoderanddecoderarebothtransformers
‣ Decoderconsumesthepreviousgeneratedtoken(andatendstoinput),buthasnorecurrentstate
‣Manyotherdetailstogetittowork:residualconnec/ons,layernormaliza/on,posi/onalencoding,op/mizerwithlearningrateschedule,labelsmoothing….
�44
Transformers
Vaswanietal.(2017)
themoviewasgreat
‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts
‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector
themoviewasgreat
emb(1)
emb(2)
emb(3)
emb(4)
ResidualNetwork
�45
Transformers
Vaswanietal.(2017)
themoviewasgreat
‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts
‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector
themoviewasgreat
emb(1)
emb(2)
emb(3)
emb(4)
inputfrompreviouslayer
outputtonextlayer
f(x)
x
non-linearity
+x
f(x)+x
f
Transformers
Vaswanietal.(2017)
‣ Adamop/mizerwithvariedlearningrateoverthecourseoftraining
‣ Linearlyincreaseforwarmup,thendecaypropor/onallytotheinversesquarerootofthestepnumber
‣ Thispartisveryimportant!
LabelSmoothing
‣ Insteadofusingaone-nottargetdistribu/on,createadistribu/onthathas“confidence”ofthecorrectwordandtherestofthe“smoothing”massdistributedthroughoutthevocabulary.
‣ ImplementedbyminimizingKL-divergencebetweensmoothedgroundtruthprobabili/esandtheprobabili/escomputedbymodel.
Iwenttoclassandtook_____
cats TV notes took sofa0 0 1 0 0
0.025 0.025 0.9 0.025 0.025 withlabelsmoothing
Transformers
Vaswanietal.(2017)
‣ Big=6layers,1000dimforeachtoken,16heads,base=6layers+otherparamshalved
Visualiza/on
htps://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Takeaways
‣ CanbuildMTsystemswithLSTMencoder-decoders,CNNs,ortransformers
‣Wordpiece/bytepairmodelsarereallyeffec/veandeasytouse
‣ Transformerisaverystrongmodel(whendataislargeenough);trainingcanbetricky
‣ Stateoftheartsystemsarege�ngpretygood,butlotsofchallengesremain,especiallyforlow-resourcese�ngs
TextSimplifica/on
ChaoJiang,MounicaMaddela,WuweiLan,YangZhong,WeiXu.“NeuralCRFModelforSentenceAlignmentinTextSimplifica/on”inACL(2020)
TextSimplifica/on
94ksent.pairs
394ksent.pairs
ChaoJiang,MounicaMaddela,WuweiLan,YangZhong,WeiXu.“NeuralCRFModelforSentenceAlignmentinTextSimplifica/on”inACL(2020)