Neural Machine Transla/on · translation model + language model to translate foreign to English...

NeuralMachineTransla/on

WeiXu(many slides from Greg Durrett and Emma Strubell)

Recap:Phrase-BasedMT

Unlabeled English data

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 …

Language model P(e)

Phrase table P(f|e) P (e|f) / P (f |e)P (e)

Noisy channel model: combine scores from translation model + language model to translate foreign to

English

“Translate faithfully but make fluent English”

}

Recap:HMMforAlignment

Brownetal.(1993)

Thankyou,Ishalldosogladly.e

‣ Sequen/aldependencebetweena’stocapturemonotonicity

0 2 6

Gracias,loharedemuybuengrado.f

5 7 7 7 7 8a

‣ Alignmentdistparameterizedbyjumpsize:

‣ :wordtransla/ontableP (fi|eai)

§  Wantlocalmonotonicity:mostjumpsaresmall§  HMMmodel(Vogel96)

§  Re-es>mateusingtheforward-backwardalgorithm -2 -1 0 1 2 3

P (f ,a|e) =nY

i=1

P (fi|eai)P (ai|ai�1)

Recap:BeamSearchforDecoding

�4

Recall:Decoding

…didnotidx=2

Marynot

Maryno

4.2

-1.2

-2.9

idx=2

idx=2

…notgiveidx=3

…notslapidx=5

…notslapidx=6

1 2 3 4 5 6 7 8 9

‣ ScoresfromlanguagemodelP(e)+transla=onmodelP(f|e)

Recap:Evalua/ngMT‣ Fluency:doesitsoundgoodinthetargetlanguage?‣ Fidelity/adequacy:doesitcapturethemeaningoftheoriginal?

‣ BLEUscore:geometricmeanof1-,2-,3-,and4-gramprecisionvs.areference,mul/pliedbybrevitypenalty

‣ Typicallyn=4,wi=1/4

‣ r=lengthofreferencec=lengthofpredic/on

‣ Doesthiscapturefluencyandadequacy?

Recap:seq2seqModels

themoviewasgreat

‣ Encoder:consumessequenceoftokens,producesavector.Analogoustoencodersforclassifica/on/taggingtasks

le

<s>

‣ Decoder:separatemodule,singlecell.Takestwoinputs:hiddenstate(vectorhortuple(h,c))andprevioustoken.Outputstoken+newstate

Encoder

…

film

le

Decoder Decoder

Recap:seq2seqModels‣ Generatenextwordcondi/onedonpreviouswordaswellashiddenstate

themoviewasgreat <s>

h̄

‣Wsizeis|vocab|x|hiddenstate|,solmaxoveren/revocabulary

Decoderhasseparateparametersfromencoder,sothiscanlearntobealanguagemodel(produceaplausiblenextwordgivencurrentone)

P (y|x) =nY

i=1

P (yi|x, y1, . . . , yi�1)

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)

Recap:BeamSearchfordecoding‣Maintaindecoderstate,tokenhistoryinbeam

la:0.4

<s>

la

le

les

le:0.3les:0.1

log(0.4)log(0.3)

log(0.1)

film:0.4

la

…

film:0.8

le

… lefilm

lafilm

log(0.3)+log(0.8)

…

log(0.4)+log(0.4)

‣ Keepbothfilmstates!Hiddenstatevectorsaredifferent

themoviewasgreat

Recap:NL-to-SQLGenera/on

‣ Convertnaturallanguagedescrip/onintoaSQLqueryagainstsomeDB

‣ Howtoensurethatwell-formedSQLisgenerated?

Zhongetal.(2017)

‣ Threecomponents

‣ Howtocapturecolumnnames+constants?

‣ Pointermechanisms

Fairseq

NeuralMTDetails

Encoder-DecoderMT

Sutskeveretal.(2014)

‣ SOTA=37.0—notallthatcompe//ve…

‣ Sutskeverseq2seqpaper:firstmajorapplica/onofLSTMstoNLP

‣ Basicencoder-decoderwithbeamsearch

Encoder-DecoderMT

‣ Betermodelfromseq2seqlectures:encoder-decoderwitha(en*onandcopyingforrarewords

themoviewasgreat

h1 h2 h3 h4

<s>

h̄1

c1

distribu/onovervocab+copying

…

le

eij = f(h̄i, hj)

↵ij =exp(eij)Pj0 exp(eij0)

ci =X

j

↵ijhj

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

Results:WMTEnglish-French

Classicphrase-basedsystem:~33BLEU,usesaddi/onaltarget-languagedata

RerankwithLSTMs:36.5BLEU(longlineofworkhere;Devlin+2014)

Sutskever+(2014)seq2seqsingle:30.6BLEU

Sutskever+(2014)seq2seqensemble:34.8BLEU

‣ ButEnglish-Frenchisareallyeasylanguagepairandthere’stonsofdataforit!Doesthisapproachworkforanythingharder?

Luong+(2015)seq2seqensemblewithaten/onandrarewordhandling:37.5BLEU

‣ 12Msentencepairs

Results:WMTEnglish-German

‣ NotnearlyasgoodinabsoluteBLEU,butnotreallycomparableacrosslanguages

Classicphrase-basedsystem:20.7BLEU

Luong+(2014)seq2seq:14BLEU

‣ French,Spanish=easiestGerman,Czech=harderJapanese,Russian=hard(gramma/callydifferent,lotsofmorphology…)

Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEU

‣ 4.5Msentencepairs

MTExamples

Luongetal.(2015)

‣ NMTsystemscanhallucinatewords,especiallywhennotusingaten/on—phrase-baseddoesn’tdothis

‣ best=withaten/on,base=noaten/on

MTExamples

Luongetal.(2015)

‣ best=withaten/on,base=noaten/on

Zhangetal.(2017)

‣ NMTcanrepeatitselfifitgetsconfused(pHorpH)

‣ Phrase-basedMTolengetschunksright,mayhavemoresubtleungramma/cali/es

MTExamples

HandlingRareWords

‣Wordsareadifficultunittoworkwith:copyingcanbecumbersome,wordvocabulariesgetverylarge

‣ Character-levelmodelsdon’tworkwell

Input:_the_ecotax_portico_in_Pont-de-Buis…

Output:_le_portique_écotaxe_de_Pont-de-Buis

‣ Solu/on:“wordpieces”(whichmaybefullwordsbutmaybesubwords)‣

‣ Canhelpwithtranslitera/on;capturesharedlinguis/ccharacteris/csbetweenlanguages(e.g.,translitera/on,sharedwordroot,etc.)

Wuetal.(2016)

BytePairEncoding(BPE)

‣ Countbigramcharactercooccurrences

Sennrichetal.(2016)

‣Mergethemostfrequentpairofadjacentcharacters

‣ Dothiseitheroveryourvocabulary(originalversion)oroveralargecorpus(morecommonversion)‣ Finalvocabularysizeisolenin10k~30krangeforeachlanguage

‣ Startwitheveryindividualbyte(basicallycharacter)asitsownsymbol

‣MostSOTANMTsystemsusethisonbothsource+target

�21

WordPieces

‣ SentencePiecelibraryfromGoogle:unigramLM

SchusterandNakajima(2012),Wuetal.(2016),KudoandRichardson(2018)

Buildalanguagemodeloveryourcorpus

Mergepiecesthatleadtohighestimprovementinlanguagemodelperplexity

‣ Issues:whatLMtouse?Howtomakethistractable?

whilevocsize<targetvocsize:

‣ Result:wayofsegmen=nginputappropriatefortransla=on

Google’sNMTSystem

Wuetal.(2016)

‣ 8-layerLSTMencoder-decoderwithaten/on,wordpiecevocabularyof8k-32k

Google’sNMTSystem

Wuetal.(2016)

Luong+(2015)seq2seqensemblewithrarewordhandling:37.5BLEUGoogle’s32kwordpieces:38.95BLEU

Google’sphrase-basedsystem:37.0BLEU

English-French:

Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEUGoogle’s32kwordpieces:24.2BLEU

Google’sphrase-basedsystem:20.7BLEU

English-German:

HumanEvalua/on(En-Es)

Wuetal.(2016)

‣ Similartohuman-level performanceonEnglish-Spanish

Google’sNMTSystem

Wuetal.(2016)

GenderiscorrectinGNMTbutnotinPBMT

“sled”“walker”

Theright-mostcolumnshowsthehumanra/ngsonascaleof0(completenonsense)to6(perfecttransla/on)

Backtransla/on‣ Sta/s/calMTmethods(e.g.,phrase-basedMT)usedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?

Sennrichetal.(2015)

s1,t1

[null],t’1[null],t’2

s2,t2…

…

‣ Approach1:forcethesystemtogenerateT’astargetsfromnullinputs

‣ Approach2:generatesynthe/csourceswithaT->Smachinetransla/onsystem(backtransla/on)

s1,t1

MT(t’1),t’1

s2,t2…

…MT(t’2),t’2

Backtransla/on

Sennrichetal.(2015)

‣ parallelsynth:backtranslatetrainingdata;makesaddi/onalnoisysourcesentenceswhichcouldbeuseful

‣ Gigaword:largemonolingualEnglishcorpus

TransformersforMT

Recall:Self-Aten/on

Vaswanietal.(2017)

themoviewasgreat

‣ Eachwordformsa“query”whichthencomputesaten/onovereachword

‣Mul/ple“heads”analogoustodifferentconvolu/onalfilters.UseparametersWkandVktogetdifferentaten/onvalues+transformvectors

x4

x04

scalar

vector=sumofscalar*vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj) x0

k,i =nX

j=1

↵k,i,jVkxj

committee awards Strickland advanced opticswho

Layer p

QKV

Nobel

Self-aten/on

Layer p

QKV

committee awards Strickland advanced opticswhoNobel

Self-aten/on

Layer p

QKV

optics advanced

who Strickland

awards committee

Nobel


Self-aten/on

optics advanced

who Strickland

awards committee

Nobel

Layer p

QKV

A


Self-aten/on

Layer p

QKV


optics advanced

who Strickland

awards committee

NobelA

Self-aten/on

Layer p

QKV


optics advanced

who Strickland

awards committee

NobelA

Self-aten/on

Layer p

QKV

M


optics advanced

who Strickland

awards committee

NobelA

Self-aten/on

Layer p

QKV

M


optics advanced

who Strickland

awards committee

NobelA

Self-aten/on

Mul/-headself-aten/on

Layer p

QKV

MM1

MH


optics advanced

who Strickland

awards committee

NobelA

Layer p

QKV

MH

M1


optics advanced

who Strickland

awards committee

NobelA


Layer p

QKV

MH

M1

Layer p+1

Feed Forward

Feed Forward

Feed Forward

Feed Forward

Feed Forward

Feed Forward

Feed Forward


optics advanced

who Strickland

awards committee

NobelA


committee awards Strickland advanced opticswhoNobelp+1


Multi-head self-attention + feed forward

Multi-head self-attention + feed forward

42

Layer 1

Layer p

Multi-head self-attention + feed forwardLayer J



Transformers

Vaswanietal.(2017)

‣ Encoderanddecoderarebothtransformers

‣ Decoderconsumesthepreviousgeneratedtoken(andatendstoinput),buthasnorecurrentstate

‣Manyotherdetailstogetittowork:residualconnec/ons,layernormaliza/on,posi/onalencoding,op/mizerwithlearningrateschedule,labelsmoothing….

�44

Transformers

Vaswanietal.(2017)

themoviewasgreat

‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts

‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector

themoviewasgreat

emb(1)

emb(2)

emb(3)

emb(4)

ResidualNetwork

�45

Transformers

Vaswanietal.(2017)

themoviewasgreat

‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts

‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector

themoviewasgreat

emb(1)

emb(2)

emb(3)

emb(4)

inputfrompreviouslayer

outputtonextlayer

f(x)

x

non-linearity

+x

f(x)+x

f

Transformers

Vaswanietal.(2017)

‣ Adamop/mizerwithvariedlearningrateoverthecourseoftraining

‣ Linearlyincreaseforwarmup,thendecaypropor/onallytotheinversesquarerootofthestepnumber

‣ Thispartisveryimportant!

LabelSmoothing

‣ Insteadofusingaone-nottargetdistribu/on,createadistribu/onthathas“confidence”ofthecorrectwordandtherestofthe“smoothing”massdistributedthroughoutthevocabulary.

‣ ImplementedbyminimizingKL-divergencebetweensmoothedgroundtruthprobabili/esandtheprobabili/escomputedbymodel.

Iwenttoclassandtook_____

cats TV notes took sofa0 0 1 0 0

0.025 0.025 0.9 0.025 0.025 withlabelsmoothing

Transformers

Vaswanietal.(2017)

‣ Big=6layers,1000dimforeachtoken,16heads,base=6layers+otherparamshalved

Visualiza/on

htps://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html



Takeaways

‣ CanbuildMTsystemswithLSTMencoder-decoders,CNNs,ortransformers

‣Wordpiece/bytepairmodelsarereallyeffec/veandeasytouse

‣ Transformerisaverystrongmodel(whendataislargeenough);trainingcanbetricky

‣ Stateoftheartsystemsarege�ngpretygood,butlotsofchallengesremain,especiallyforlow-resourcese�ngs

TextSimplifica/on

ChaoJiang,MounicaMaddela,WuweiLan,YangZhong,WeiXu.“NeuralCRFModelforSentenceAlignmentinTextSimplifica/on”inACL(2020)

TextSimplifica/on

94ksent.pairs

394ksent.pairs

ChaoJiang,MounicaMaddela,WuweiLan,YangZhong,WeiXu.“NeuralCRFModelforSentenceAlignmentinTextSimplifica/on”inACL(2020)

Date post:	11-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Neural Machine Transla/on · translation model + language model to translate foreign to English...

Documents