Basics and (Modern) Methods for Natural Language Processing · 2021. 1. 28. · Nikolaos Pappas...

Basicsand(Modern)MethodsforNaturalLanguageProcessing

DigitalProductSchoolofUnternehmerTUMJune5,2018

NikolaosPappasNaturalLanguageUnderstandingGroup

IdiapResearchIns>tute,Mar>gny

NikolaosPappas /111

Outlineofthetalk1. Introduc>onandMo>va>on

2. WordRepresenta>onLearning• Seman>csimilarity• Tradi>onalandrecentapproaches• Intrinsicandextrinsicevalua>on

3. WordSequenceModeling•Essen>als:Encoders,ARen>on,DLtricks•TextClassifica>on•MachineTransla>on

4. ConclusionandDiscussion

�2

NikolaosPappas /111

NLP LinguisticsArtificial Intelligence

NaturalLanguageProcessing

�3

• NLPisafieldattheintersec>onofAIandlinguis>cs• Linguis>cs(structureoflanguage,brainmappings,languagelearning)

• Computa>onalLinguis>cs(comp.modelsoflanguage,toolsforstudyinglanguage)

•Goals✓Processlargeamountsofnaturaltext✓Givecomputerstheabilityto“understand”languagetoperformusefultasks➡Intrinsictasks:parsing,languagemodeling,etc➡Extrinsictasks:speechrecogni>on,transla>on,etc

NikolaosPappas /111

Levelsofprocessing

�4

➡ Lexicallevel• Speech:phone>canalysis• Vision:characterrecogni>on

➡ Morphological&Syntac8clevels• Wordstructure(forms,inflec>ons)• Sentencestructure(grammar,syntax)

➡ Seman8c&Discourselevels• Wordandsentencemeanings• Broadcontext,co-reference

➡ Theul>mategoalofasystemhoweveristobeabletranslate,assist,retrieve,classify,communicate

NikolaosPappas /111

• Stemming(reduc>onofwordformstostems)

• Lemma>za>on(reduc>onofwordformstobaseform&intendedPOSandmeaning)

Intrinsictasks:Textsegmenta>on&Morphology

�5

• Tokeniza>on(splittextintomeaningfulsegments)

• Punctua>onpredic>on

Morphemes: smallest linguistic pieces with a grammatical function (inflectional: fish-> fishes, derivational: fish -> fishery, compounding: sky + scraper -> skyscraper)

NikolaosPappas /111

• Cons>tuencyparsing(nestedphrasalstructures)

• Languagemodeling(wordsequence)

Intrinsictasks:Syntax&Grammar

�6

• Part-of-speechtagging(POStagssequence)

• Dependencyparsing(rolespecificstructures)

DT: determiner, NN: noun, singular, VBD: verb past tense, NP: noun phrase, VP: Verb phrase, ATT: attributive, SBJ: nominal subject, TMP: temporal modifier, PC: prepositional complement

NikolaosPappas /111

• Nameden>tyrecogni>on

• CoreferenceResolu>on(findexpressionsreferringtothesameen>tyinatext)

Intrinsictasks:Seman>cs&Discourse

�7

• Lexicalseman>cs

• Textualentailment(direc>onalrela>onbetweentextfragments)

NikolaosPappas /111

ExtrinsicTasks

�8

• Machinetransla>on

•

• Ques>onanswering

NikolaosPappas /111

ExtrinsicTasks

�9

• Sen>mentanalysis

•

• Summariza>on

• Dialogueagents/Chatbots• Topicrecogni>on• Searchandretrieval• andmore

NikolaosPappas /111

Whatisspecialaboutnaturallanguage?

�10

• Formallanguagesaresta8c,explicitandnon-ambiguous• Definedasmathema>calabstrac>ons(alphabet,rules)• Onecanexplicitlyenumerateallwell-formedwords

• Naturallanguagesaredynamic,implicitandambiguous• Existintherealworldandisspokenbyitsusers• Grammarisdiscoveredthroughempiricalinves>ga>on“CathrineandJohngaveflowerstoMary.

Shesaid“thanks”andputtheminavase.”

“Ineedtotalktoyouasap.”(abbreviations)

“Didyoudownloadtheapp?”(neologisms)

NikolaosPappas /111

Whatisnaturallanguage?

�11

• Anaturallyevolvedsystemusedbyhumanstoexpressthoughtsfor(i)communica>ngwithoneanother,(ii)learningfromprevious(iii)experiencesandachievingtheirgoals

• Essen>ally,itisadiscrete/symbolic/categoricalsignalingsystem✓Symbolsareinvariantacrosssignals(audio,visual)

✓Conciseandgroundedonsharedknowledge(entailsambigui>es)“Didyouwatchthefinals?Ourgoalkeeperwasuseless!”

✓Unlimitedexpressivepower(impliesflexibleinterpreta>onrulesi.e.meaningcannotbeexclusivelyexpressedinthesurfaceform)

“Allpoliticianslie.”|

NikolaosPappas /111

• Symbolicencodingsrequirelargevocabularies• Sparsityissuesformachinelearning• Scalingissuesinreal-worldsejngs

• Brainencodingsappeartobeacon>nuouspaRernofac>va>ons(distributedacrossneurons)➡ Con>nuousencodingsprovidesacogni>vely

plausiblewaytoencodethoughts

• Challenges• Howtolearncon>nuousencodingsthatgeneralizewell?• Canweencodeverycomplexthoughtsinasinglecon>nuousencoding?• CanwecreateandreasonoverthoughtstosolveanyNLPtask?• Howtotransferknowledgefromonedomain,task,languagetoanother?

Whatarethemainchallengesinlanguage`understanding’?

�12

Buchweitz et al. 2009

NikolaosPappas /111

WhatisDeepLearning?• MachineLearningboilsdowntominimizinganobjec>vefunc>ontoincreasetaskperformance

• Mostlyreliesonhuman-craoedfeatures• Tasksinvolveregression,classifica>on,structuredpredic>on,representa>onlearning

➡ Representa8onLearning:Learngoodfeaturesorrepresenta>ons

➡DeepLearning:Machinelearningalgorithmsbasedonmul>plelevelsofrepresenta>onorabstrac>on✓ Biologicallyinspiredfromhowthehumanbrainworks✓ Neuronsac>vatetocertaininputsandexciteotherneurons✓ Canhandleavarietyofinput,suchasvision,speech,andlanguage

�13

NikolaosPappas /111

DeepLearning:Whythisdecade?

• WhatenableddeeplearningtechniquestostartoutperformingothermachinelearningtechniquessinceHintonetal.2006?• Largeramountsofdata• Fastercomputersandmul>corecpuandgpu• Newmodels,algorithmsandimprovementsover“older”methods(speech,visionandlanguage)

�14

NikolaosPappas /111

DeepLearningforspeech:Phonemedetec>on

�15

• Thefirstbreakthroughresultsof“deeplearning”onlargedatasetsbyDahletal.2010

• -30%reduc>onoferror• MostrecentlyonspeechsynthesisOordetal.2016

NikolaosPappas /111

DeepLearningforvision:Objectdetec>on

• PopulartopicforDL• BreakthroughonImageNetbyKrizhevskyetal.2012• -21%and-51%errorreduc>onattop1and5

�16

NikolaosPappas /111

DeepLearningforlanguage:Ongoing

• Significantimprovementsinrecentyearsacrossdifferentlevels(phonology,morphology,syntax,seman>cs)andapplica>onsinNLP

• Machinetransla8on(mostnotable)• Ques8onanswering• Sen8mentclassifica8on• Summariza8on

�17

S>llalotofworktobedone…(beyondsupervisedand“basic”recogni>on)

NikolaosPappas /111

DeepLearningforlanguage:MachineTransla>on

• Reachedthestate-of-the-artinoneyear:Bahdanauetal.2014,Jeanetal.2014,Gulcehreetal.2015

�18

NikolaosPappas /111

Neuralnetworkcomponentsforlanguage`understanding’

�19

• DistributedRepresenta8ons(word/subwordunits)• Abilitytorepresent`meaning’efficiently

• Abstrac8on&Composi8on(wordsequences)• Abilitytocomposecomplex`meanings’fromsimplerones

• AKen8onMechanism• Abilitytofocuson/collectwhatis`relevant’(input,memory)

• MemoryMechanism• Abilitytostore/retrieveimportantpreviousinforma>on/knowledge

• ReasoningMechanism• Abilitytoreasonwithwhatis`relevant’

• LearningMechanism• Abilitytolearnfrompastexperience

…

NikolaosPappas /111


•Basics:Perceptron,NNs,SGD




�20

NikolaosPappas /111

Basics:Perceptron

�21

NikolaosPappas /111

• Solvelinearlyseparableproblems

• …butnotnon-linearlyseparableones.

Basics:Whatcanaperceptrondo?• Processes

�22

NikolaosPappas /111

Basics:Fromlogis>cregressiontoneuralnetworks• Processes

�23

NikolaosPappas /111

Basics:Neuralnetwork

�24

• Applyseveralregressionstoobtainavectorofoutputs

• Thevaluesoftheoutputsareini>allyunknown

• Noneedtospecifyaheadof>mewhatvaluesthelogis>cregressionsaretryingtopredict

NikolaosPappas /111�25

• Theintermediatevariablesarelearneddirectlybasedonthetrainingobjec>ve

• Thismakesthemdoagoodjobatpredic>ngthetargetforthenextlayer

• Result:abletomodelnon-lineari>esinthedata!

Basics:Neuralnetwork

NikolaosPappas /111

Basics:Neuralnetworkwithmul>plelayers

�26

NikolaosPappas /111

Basics:Learningmodelparameterswithgradientdescend

�27

• Giventrainingdatafindandthatminimizeslosswithrespecttotheseparameters

• Computegradientwithrespecttoparametersandmakesmallsteptowardsthedirec>onofthenega>vegradient

• Applychain-rulefornestedfunc>onse.g.y=f(g(x))

NikolaosPappas /111

Basics:Stochas>cgradientdescent(SGD)

�28

• Approximatethegradientusingamini-batchofexamplesinsteadofen>retrainingset

• OnlineSGDwhenminibatchsizeisone

• MostcommonlyusedwhencomparedtoGD

NikolaosPappas /111

Basics:ChoosingaStochas>cOp>miza>onAlgorithm

�29

• Severalout-of-the-boxstrategiesfordecayinglearningrateofanobjec>vefunc>on:

• Selectthebestaccordingtovalida>onsetperformance

NikolaosPappas /111

Trainingneuralnetworkswitharbitrarylayers:Backpropaga>on

�30

• Wes>llminimizetheobjec>vefunc>onbutthis>mewe“backpropagate”theerrorstoallthehiddenlayers

• Chainrule:Ify=f(u)andu=g(x),i.e.y=f(g(x)),then:

• Usefulbasicderiva>ves:

•

Typically, backprop computation is implemented in

popular libraries: Theano, Torch,

Tensorflow

NikolaosPappas /111

Basics:Theend

�31

• Essen>ally,nowwehaveallthebasic“ingredients”weneedtobuilddeepneuralnetworks

• However,wewillalsoneed

➡ Abilitytolearnfromdifferentinputs(spa>al,sequen>al,con>nuousvsdiscrete)

➡ Overcomeop>miza>ondifficul>es(exploding/vanishinggradient,informa>onflow,convergence)

➡ Avoidoverfijng/Regulariza>on(dropout,L2norm)

➡ andother…

NikolaosPappas /111





�32

*FigurefromLebret’sthesis,EPFL

NikolaosPappas /111

Seman>csimilarity:Howsimilararetwolinguis>citems?

�33

• Wordlevelscrewdriver—?—>wrenchverysimilarscrewdriver—?—>hammerliRlesimilarscrewdriver—?—>technicianrelatedscrewdriver—?—>fruitunrelated

• SentencelevelThebossfiredtheworkerThesupervisorlettheemployeegoverysimilarThebossreprimandedtheworkerliRlesimilarThebosspromotedtheworkerrelatedThebosswentforjoggingtodayunrelated

NikolaosPappas /111

Seman>csimilarity:Howsimilararetwolinguis>citems?

�34

• Definedinmanylevels• Words,wordsensesorconcepts,phrases,paragraphs,documents

• Similarityisaspecifictypeofrelatedness• Related:topicallyorviarela>onheartvssurgeonwheelvsbike

• Similar:synonymsandhyponymsdoctorvssurgeonbikevsbicycle

NikolaosPappas /111

Seman>csimilarity:NumerousaRemptstoanswerthat

�35

*Image from D. Jurgens’ NAACL 2016 tutorial.

NikolaosPappas /111

Seman>csimilarity:NumerousaRemptstoanswerthat

�36

NikolaosPappas /111

Seman>csimilarity:Whydowehavesomanymethods?

�37

• Newresourcesormethods• Datasetsrevealweaknessinpreviousmethods• State-of-the-artismovingtarget

• Task-specificsimilarityfunc>ons• Performanceinnewtasksnotsa>sfactory

➡ Seman>csimilarityisnottheend-task• Picktheonewhichyieldsbestresults• Needformethodstoquicklyadaptsimilarity

NikolaosPappas /111

Twomainsourcesformeasuringsimilarity

Massivetextcorpora

�38

Seman8cresourcesandknowledgebases

NikolaosPappas /111

HowtoRepresentWord`Meaning’?

• Discrete:eachdimensiondenotesaspecificlinguis>citem• Interpretabledimensions• Highdimensionality

• Con8nuous:dimensionsarenot>edtoexplicitconcepts• Enablecomparisonbetweenrepresentedlinguis>citems

• Lowdimensionality

�39

dog = [0, 0, 0, 1, 0, 0] cat = [0, 1, 0, 0, 0, 0] sim(dog, cat) = 0.0

NikolaosPappas /111

Howtocomparetwolinguis>citemsinthevectorspace

• CosineoftheangleθbetweenAandB:

• Explicitmodelshaveaserioussparsityproblemduetotheirdiscreteor“k-hot”vectorrepresenta>ons

france=[0,0,0,1,0,0]england=[0,1,0,0,0,0]

franceisnearspain=[1,0,0,1,1,1]• cos(france,england)=0.0• cos(france,franceisnearspain)=0.57

�40

A B

θ

NikolaosPappas /111

LearningWordRepresenta>onsFromText

• Limita>onsofknowledge-basedmethods• Out-of-contextdespitevalidityofresources• Mostlackofevalua>ononprac>caltasks

• Whatifwedonotknowanythingaboutwords?- Followthedistribu>onalhypothesis(unsupervised):“Youshallknowawordbythecompanyitkeeps”,Firth1957

Thevalueofthecentralbankincreasedby10%.Sheooengoestothebanktowithdrawcash.Shewenttotheriverbanktohavepicnicwithherchild.

�41

financialins8tu8on

geographicalterm

NikolaosPappas /111

Simpleapproach:Computeaword-in-contextco-occurencematrix

• Matrixofcountsbetweenwordsandcontexts

• Limita8ons• Allwordshaveequalimportance(imbalance)• Vectorsareveryhighdimensional(storageissue)• Infrequentwordshaveoverlysparsevectors(makesubsequentmodelslessrobust)

�42

words context document

NikolaosPappas /111

Themoststandardapproach:DimensionalityReduc>on

• Performsingularvaluedecomposi>on(SVD)ofthewordco-occurencematrixthatwesawpreviously

• Typically,U*Σisusedasthevectorspace

�43

*Image from D. Jurgens’ NAACL 2016 tutorial.

NikolaosPappas /111

• Syntac>callyandseman>callyrelatedwordsclustertogether

•

Themoststandardapproach:DimensionalityReduc>on

�44

*Plots from Rohde et al. 2005

NikolaosPappas /111

Dimensionalityreduc>onwithHellingerPCA

• PerformPCAwithHellingerdistanceonthewordco-occurencematrix:LebretandCollobert2014

• Wellsuitedfordiscreteprobabilitydistribu>ons(P,Q)

• Neuralapproachesare>me-consuming(tuning,data)• InsteadcomputewordvectorsefficientlywithPCA• Fine-tuningthemontasks;beRerthanneural

• Limita8ons:hardtoaddnewwords,notscalableO(mn2)

�45

hRps://github.com/rlebret/hpca

NikolaosPappas /111

Dimensionalityreduc>onwithweightedleastsquares

• GlovevectorsbyPenningtonetal2014.Factorizesthelogoftheco-occurencematrix:

• Fasttraining,scalabletohugecorporabuts>llhardtoincorporatenewwords

• MuchbeRerresultsthanneuralembedding,howeverunderequivalenttuningitisnotthecase:LevyandGoldberg2015

�46

hRp://nlp.stanford.edu/projects/glove/

NikolaosPappas /111

Dimensionalityreduc>onwithneuralnetworks

• Themainideaistodirectlylearnlow-dimensionalwordrepresenta>onsfromdata

• Learningrepresenta>ons:Rumelhartetal1986• Neuralprobabilis>clanguagemodel:Bengioetal2003• NLP(almost)fromscratch:CollobertandWeston2008

• Recentmethodsarefasterandmoresimple• Con>nuousBag-Of-Words(CBOW)• Skip-gramwithNega>veSampling(SGNS)• word2vectoolkit:Mikolovetal.2013

�47


• Giventhemiddlewordpredictsurroundingonesinafixedwindowofwords(maximizeloglikelihood)

word2vec:Skip-gramwithnega>vesampling(SGNS)


• HowistheP(wt|h)probabilitycomputed?

• Denominatorisverycostlyforbigvocabulary!• Insteaditusesamorescalableobjec>ve,logQθisabinarylogis>cregressionofwordwandhistoryh:

word2vec:Skip-gramwithnega>vesampling(SGNS)

NikolaosPappas /111

word2vec:Con>nuousBag-Of-Wordswithnega>vesampling(CBOW)

• FactorizesaPMIword-contextmatrix:LevyandGoldberg2014

• Buildsuponexis>ngmethods(newdecomp.)

• Improvementsonavarietyofintrinsictaskssuchasrelatedness,categoriza>onandanalogy:Baronietal2014,Schnabeletal2015

�50

• Moreefficientbuttheorderinginforma>onofthewordsdoesnotinfluencetheprojec>on

NikolaosPappas /111

Distributedrepresenta>ons:Encodedproper>es

�51

• Encodesgeneral-purposerela>onsbetweenwords:present—pasttense,singular—plural,male—female,capital—country

• Analogybetweenwordscanbeefficientlycomputedusingbasicarithme>copera>onsbetweenvectors(+,-)

king - man + woman ≈ queen

NikolaosPappas /111

Summary:Learningwordrepresenta>ons

• Neuralversuscount-basedmethods• neuralonesimplicitlydoSVDoveraPMImatrix• similartocount-basedwhenusingthesametricks

• Neuralmethodsappeartohavetheedge(word2vec)• efficientandscalableobjec>ve+toolkit• intui>veformula>on(=predictwordsincontext)

➡ Severalextensions• Dependency-basedembeddings:LevyandGoldberg2014• RetrofiRed-to-lexiconsembeddings:Faruquietal.2014• Sense-awareembeddings:LiandJurafsky2015• Visually-groundedembeddings:Lazaridouetal.2015• Mul>lingualembeddings:Gouwsetal2015

�52

NikolaosPappas /111

Summary:Learningwordrepresenta>ons

Howcanwebenefitfromthem?• studylinguis>cproper>esofwords• injectgeneralknowledgeondownstreamtasks• transferknowledgeacrosslanguagesormodali>es• representa>onsofwordsequences

�53

NikolaosPappas /111



3. WordSequenceModeling•Essen>als:RNNs,ARen>on,DLtricks•TextClassifica>on•MachineTransla>on


�54

*FigurefromColah’sblog,2015.

NikolaosPappas /111

LanguageModeling

�55

• Computesthejointprobabilityofasequenceofwordsbyemployingthechainrule(“Howlikelyisatext”):

p(w1,w2,…,wt)=p(w1)p(w2|w1)p(w3|w2,w1)…p(wt|wt-1,wt-2,…)

• GiventheobservedtexthowlikelyisthenewuRerance?p(wt|wt-1,…,w1)

• Hence,wecancompareorderings(transla>on)p(helikesapples)>p(appleslikeshe)

orwordchoice(speechrecogni>on)p(helikesapples)>p(helicksapples)

➡ Exactdecomposi>onallowstolearncomplexdistribu>ons➡ ManyNLPtaskscanbestructuredas(condi>onal)languagemodel

Evalua>on

NikolaosPappas /111

• N-grammodels:historyofobservedwordsisapproximatedwithjustthepreviousnwords(Markovmodel):

• hardtocapturelong-termdependencies(boundedmemory)• doesnotleveragewordseman>csandrela>onships

• Neuraln-grammodels:embedthesamefixedn-gramhistoryinacon>nuousspace(s>llMarkovmodel)

• capturesbeRercorrela>ons+smallermemoryfootprint

• trainedwithMLE

LanguageModeling:MarkovModels

�56

NikolaosPappas /111

• WithRNNLMswedropthefixedn-gramhistoryandcompresstheen>rehistoryinafixedlengthvector

• longrangecorrela>onsarecaptured—intheory• canrepresentunboundeddependencies• but,theyarehardtolearn(vanishinggradient)

LanguageModeling:RecurrentNeuralNetworks(RNN)

�57

NikolaosPappas /111

• Increasingthesizeofthehiddenlayerresultsinaquadra>cincreaseinthemodelsizeandcomputa>on

• Stackingmul>pleRNNsincreasesthememorycapacityandrepresenta8onalabilitywithlinearscaling

• Wecanalsoincreasedepthinthe>medimension

LanguageModeling:DeepRNNs

�58

NikolaosPappas /111

• Muchofthecomputa>onalcostcomesfromtheclassifica>onlayerbecauseitsparametersdependonthesizeofthevocabulary:

• Severalsolu>onsexist• Short-lists:usemostfrequentwords+n-gramLMfortherest• Localshort-lists:subsetsofvocabularyspecifictodatasegments• Gradientapproxima>ons:useNoiseContras>veEs>ma>on(NCE)i.e.learningabinaryclassifiertodis>nguishbetweendatasamplesfromksamplesfromanoisedistribu>on:

Scaling:LargeVocabularies

�59

NikolaosPappas /111

• Changingtheinputgranularityandmodeltextatthemorphemeorcharacterlevel

• Muchsmallersoomaxbutlongerdependencies• Itcapturesmorphologicalproper>esofwords• Byte-PairEncodingmethodismostcommonforneuralMT(Sennrichetal2015)

Scaling:LargeVocabularies

�60


• Long-shorttermmemorynetsareabletolearnlong-termdependencies:HochreiterandSchmidhuber1997

SimpleRNN:


LongShortTermMemory(LSTM)

NikolaosPappas /111

LongShortTermMemory(LSTM)

�62

• Long-shorttermmemorynetsareabletolearnlong-termdependencies:HochreiterandSchmidhuber1997

• Abilitytoremoveoraddinforma>ontothecellstateregulatedby“gates”(avoidsgrad.vanishing)

•


NikolaosPappas /111

GatedRecurrentUnit(GRU)

�63

• GatedRNNbyChungetal,2014combinestheforgetandinputgatesintoasingle“updategate”

• keepmemoriestocapturelong-termdependencies• allowerrormessagestoflowatdifferentstrengths

zt:updategate—rt:resetgate—ht:regularRNNupdate*FigurefromColah’sblog,2015.

NikolaosPappas /111

DeepBidirec>onalModels

�64

• HereRNNbutitappliestoLSTMsandGRUstoo

(Irsoy and Cardie, 2014)

NikolaosPappas /111

• Typicallygoodforimages• Convolu>onalfilter(s)is(are)appliedeverykwords:

• SimilartoRecursiveNNsbutwithoutconstrainingtogramma>calphrasesonly,asSocheretal.,2011• noneedforaparser(!)• lesslinguis>callymo>vated?

Convolu>onalNeuralNetwork(CNN)

�65

(Collobert et al., 2011)(Kim, 2014)

NikolaosPappas /111

• Word-levelandsentence-levelabstrac>ons

HierarchicalModels

�66(Tang et al., 2015)

NikolaosPappas /111

ARen>onMechanism:MachineTransla>on

�67

(Bahdanauetal.,2015)

• Canwecompressalltheneededinforma>oninthelastencoderstate?Idea:useallthehiddenstates!• lengthpropor>onaltosentencelength• weightedaverageofallhiddenstates

• Learnstoassignarelevancetoeachinputposi>ongivencurrentencoderstateandthepreviousdecoderstate• soobilingualalignmentmodel

NikolaosPappas /111

ARen>onMechanism:MachineTransla>on

�68


NikolaosPappas /111

ARen>onMechanism:DocumentClassifica>on

�69

• Operatesoninputwordsequenceorintermediatehiddenstates

• Learnstofocusonrelevantpartsoftheinputwithrespecttoeachtargetlabel

• soosummariza>onmodel• Canbeappliedatmul>ple

languagelevels(Yangetal,2016)

(PappasandPopescu-Belis,2014&2017)

NikolaosPappas /111

HierarchicalaRen>onnetworks

�70

(Yangetal.,2016)

• VerysimilarhierarchicalstructureasTangetal.,2015exceptaveragepooling• aRen>onmechanismatthe

wordanddocumentlevels

NikolaosPappas /111

ARen>onMechanism:Sen>mentClassifica>on

�71

(Yangetal,2016)

NikolaosPappas /111

Memorymechanism:NeuralTuringMachinesorMemoryNetworks

�72

*DiagramfromChristopherOlah’sblog.

• Combina>onofrecurrentnetworkwithexternalmemorybank:Gravesetal.2014,Westonet.al2014

NikolaosPappas /111

Residualconnec>ons

�73

• Residuallearningallowsinforma>ontoflowmoreeasilybyaddingtheinputofalayerF(x)toitsoutputi.e.F(x)+x

• It’stypicallyusedformakingconnec>onsfromonelayertoanother

• Thisimprovesthetrainingandavoidsthevanishinggradientproblem

• Layerisignoredifnotbeneficial

NikolaosPappas /111

OtherDLtricks

�74

• Dropoutunitsatrandom:isusedasaregulariza>onmethodtoavoidoverfijng

• Itallowstoover-parameterizeanetworkands>llgeneralizewell

• Layeroutputnormaliza8on:stabilizesthetrainingprocessofamodel,especiallyusefulforself-aRen>onarchitecture

• Hyper-parameterop8miza8on:tuningwellmayhaveveryhugeimpactinperformance

• andmore(e.g.ini>aliza>on,labelsmoothing,weightdecay)

NikolaosPappas /111

Pujngeverythingtogether:Flexiblemodeling

�75

• Sen>mentclassifica>on• Topicdetec>on• Spamdetec>on• NamedEn>tyRecogni>on

• Machinetransla>on• Summariza>on• Imagecap>oning• Conversa>onalagents

• Ques>onanswering• Paraphrasedetec>on• Rela>onExtrac>on

✓Mul>plelevelsofabstrac>on(deep,hierarchical)✓End-to-endtrainingwithstochas8cgradientdescent✓Goodbasisformul>-tasklearning/transferlearning

NikolaosPappas /111





�76


NikolaosPappas /111

ParagraphvectorsforDocumentClassifica>on

�77

• Learningvectorsofparagraphsinspiredbyword2vec• trainedwithoutsupervisiononalargecorpus• preferablysimilardomainasthetarget

• Twomethods:withorwithoutwordordering

(Leetal.,2014)

NikolaosPappas /111

ParagraphvectorsforDocumentClassifica>on

�78

• Learnedparagraphvectors+logis>cregression• Outperformedpreviousmethodonsentence-leveland

document-levelsen>mentclassifica>on

(Leetal.,2014)

NikolaosPappas /111

Convolu>onalneuralnetworkforDocumentClassifica>on

�79

(Kimetal.,2014)

• Usedmul>plefilterwidths• Dropoutregulariza>on(randomlydroppingpor>onof

hiddenunitsduringback-propaga>on)

NikolaosPappas /111

Convolu>onalneuralnetworkforDocumentClassifica>on

�80

(Kimetal.,2014)

• Notallbaselinemethodsuseddrop-outthough

NikolaosPappas /111

• SimilartoKimetal,2014howeverdifferent• K-maxpoolinginsteadofmaxpooling• Twolayersofconvolu>ons

�81

(Deniletal.,2014)

ModelingandSummarizingDocumentswithaConvolu>onalNetwork

NikolaosPappas /111


�82

(Deniletal.,2014)

NikolaosPappas /111


�83

(Deniletal.,2014)

NikolaosPappas /111


�84

(Deniletal.,2014)

NikolaosPappas /111

GatedrecurrentneuralnetworkforDocumentClassifica>on

�85

(Tangetal.,2015)

NikolaosPappas /111

GatedrecurrentneuralnetworkforDocumentClassifica>on

�86

(Tangetal.,2015)

NikolaosPappas /111

HierarchicalaRen>onnetworksforDocumentClassifica>on

�87

(Yangetal.,2016)

• VerysimilarhierarchicalstructureasTangetal.,2015exceptaveragepooling• aRen>onmechanismatthe

wordanddocumentlevels

NikolaosPappas /111

HierarchicalaRen>onnetworksforDocumentClassifica>on

�88

(Yangetal.,2016)

NikolaosPappas /111





�89


NikolaosPappas /111

RNNencoder-decoderforMachineTransla>on

�90

• GRUashiddenlayer• Maximizetheloglikelihood

ofthetargetsequencegiventhesourcesequence:

• WMT2014(EN→FR)

(Choetal.,2014)

NikolaosPappas /111

SequencetosequencelearningforMachineTransla>on

�91

• LSTMhiddenlayersinsteadofGRU• 4layersdeepinsteadofshallowencoder-decoder

(Sutskeveretal.,2014)

NikolaosPappas /111

SequencetosequencelearningforMachineTransla>on

�92

(Sutskeveretal.,2014)


• PCAprojec>onofthehiddenstateofthelastencoderlayer

NikolaosPappas /111

JointlylearningtoalignandtranslateforMachineTransla>on

�93


• Limita8on:canwecompressalltheneededinforma>oninthelastencoderstate?

• Idea:useallthehiddenstatesoftheencoder• lengthpropor>onaltothat

ofthesentence!• computeaweightedaverage

ofallthehiddenstates

NikolaosPappas /111

JointlylearningtoalignandtranslateforMachineTransla>on

�94



NikolaosPappas /111

Effec>veapproachestoaRen>on-basedNMT

�95

(Luongetal.,2015)

• GlobalandlocalaRen>on• Input-feedingapproach• StackedLSTMinsteadofsingle-layer

NikolaosPappas /111

Mul>-sourceNMT

�96

(ZophandKnight,2016)

• Trainp(e|f,g)modeldirectlyontrilingualdata

• Useittodecodeegivenany(f,g)pair

• Takelocal-aRen>onNMTmodelandconcatenatecontextfrommul>plesources

NikolaosPappas /111

Mul>-sourceNMT

�97

(ZophandKnight,2016)

• Mul>-sourcetrainingimprovesoverindividualFrenchEnglishandGermanEnglishpairs

• Best:basicconcatena>onwithaRen>on

NikolaosPappas /111

Mul>-targetNMT

�98

(Dongetal.,2015)

• Mul>-tasklearningframeworkformul>pletargetlanguagetransla>on

• Op>miza>onforonetomanymodel

NikolaosPappas /111

Mul>-targetNMT

�99

(Dongetal.,2015)

• ImprovesoverNMTandmosesbaselinesoverWMT2013test• butalsoonlargerdatasets

• FasterandbeRerconvergenceinmul>plelanguagetransla>on

NikolaosPappas /111

Mul>-way,Mul>lingualNMT

�100

(Firatetal.,2016)

• Encoder-decodermodelwithmul>pleencodersanddecoderssharedacrosslanguagepairs

• shareknowledgethroughauniversalspace

• goodforlow-resourcelangs• ARen>onispairspecific,hence

expensiveO(L^2)• insteadshareaRen>onacross

allpairs!

Figure:n_thencoderandm_thdecoderat>mestept/φmakesencoder&decoderstatescompa>blewiththeaRen>onmechanism/f_adpmakescontextvectorcompa>blewiththedecoder→allthesetransforma>onstosupportdifferenttypesofencoders/decodersfordifferentlanguages!

NikolaosPappas /111

Mul>-way,Mul>lingualNMT

�101

(Firatetal.,2016)

• Consistentimprovementsforlow-resourcelanguages

• thelowerthetrainingdatathebiggertheimprovement

• Inlarge-scaletransla8onimprovesonlytransla8ontoEnglish• hypothesis:ENappearsalwaysassourceortargetlanguageforallpairs→beRerdecoder?

NikolaosPappas /111

Google’sNeuralMachineTransla>onSystem

�102

(Wuetal.,2016)

• Anencoder,adecoderandanaRen>onnetwork• 8-layerdeepwithresidualconnec>ons• RefinementwithReinforcementLearning• Sub-wordunits…andmore

NikolaosPappas /111

Google’sNeuralMachineTransla>onSystem

�103

(Wuetal.,2016)

• EN->FRtrainingtook6dayson96GPUS!!!!and3moredaysforrefinement...

NikolaosPappas /111

Convolu>onalEncoder-Decoder

�104

(Gehringetal.,2017)

• OutperformedGNMTandwasmoreefficientintermsofspeed,but

• Lackslong-termmemory• Requiresme>culousini>aliza>onschemesandcarefulnormaliza>on

• Requiresposi>onalembeddings• Requiresmoredepth(15layers)

NikolaosPappas /111

• Encode/Decodeinputw.o.usingCNNsorLSTMs• Lowertrainingcostbutlacklong-termmemory• Stackedself-aRen>onwithmul>pleheads

• Tocapturesequenceinforma>onitusesposi>onalembeddings(sinusoids)

Self-ARen>onorTransformerNetworks

�105

NikolaosPappas /111

Self-ARen>onorTransformerNetworks

�106

Head1

Head2

NikolaosPappas /111

• Transformernetworkusestechniquesthatothermodelsdonotuse

• Combinestrengthsofbothworlds• CNNsfallbehindinBLEUandconvergencespeed

Bestofbothworlds:LSTMs&TransformerNetworks

�107(Chen&Firat&Bapnaetal.,2018)

NikolaosPappas /111





�108

NikolaosPappas /111

Conclusion

�109

• DeeplearningforNLPhasflourishedthelastcoupleofyears➡ ARen>onmechanismbecamepopularevenoutsideNLP➡ CompaniesheavilyuseNLPe.g.machinetransla>on

• AlthoughaRen>on-basedmodelscangetusfar➡ Linguis>cstructurecangetusevenfurtherbyconstraining

modelsandcrea>ngusefulinduc>vebiases➡ Inspec>ngandanalyzingthelearnedstructurescanhelpus

gaininsightsaboutlanguage➡ Thereiss>llalotofworktobedoneonweakly-supervised

andunsupervisedlearning

NikolaosPappas /111

FutureChallenges

�110

• Transferringknowledgeacrossdomains,languagesanddifferentoutputs

• Contextualizingwordrepresenta>ons

• Generalizingonunseenexamples

• Learningfromveryfewexamples

• Summariza>on/Entailment/Reasoning

NikolaosPappas /111

Discussion/CodingSession

�111

• LearningwordembeddingshRps://www.tensorflow.org/tutorials/word2vechRps://nlp.stanford.edu/projects/glove/

• Classifica>onusingpre-trainedwordembeddingshRps://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

➡ https://www.tensorflow.org/install/

➡ https://keras.io/backend/

➡ https://keras.io/#installation

Date post:	14-Mar-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times