Basicsand(Modern)MethodsforNaturalLanguageProcessing
DigitalProductSchoolofUnternehmerTUMJune5,2018
NikolaosPappasNaturalLanguageUnderstandingGroup
IdiapResearchIns>tute,Mar>gny
NikolaosPappas /111
Outlineofthetalk1. Introduc>onandMo>va>on
2. WordRepresenta>onLearning• Seman>csimilarity• Tradi>onalandrecentapproaches• Intrinsicandextrinsicevalua>on
3. WordSequenceModeling•Essen>als:Encoders,ARen>on,DLtricks•TextClassifica>on•MachineTransla>on
4. ConclusionandDiscussion
�2
NikolaosPappas /111
NLP LinguisticsArtificial Intelligence
NaturalLanguageProcessing
�3
• NLPisafieldattheintersec>onofAIandlinguis>cs• Linguis>cs(structureoflanguage,brainmappings,languagelearning)
• Computa>onalLinguis>cs(comp.modelsoflanguage,toolsforstudyinglanguage)
•Goals✓Processlargeamountsofnaturaltext✓Givecomputerstheabilityto“understand”languagetoperformusefultasks➡Intrinsictasks:parsing,languagemodeling,etc➡Extrinsictasks:speechrecogni>on,transla>on,etc
NikolaosPappas /111
Levelsofprocessing
�4
➡ Lexicallevel• Speech:phone>canalysis• Vision:characterrecogni>on
➡ Morphological&Syntac8clevels• Wordstructure(forms,inflec>ons)• Sentencestructure(grammar,syntax)
➡ Seman8c&Discourselevels• Wordandsentencemeanings• Broadcontext,co-reference
➡ Theul>mategoalofasystemhoweveristobeabletranslate,assist,retrieve,classify,communicate
NikolaosPappas /111
• Stemming(reduc>onofwordformstostems)
• Lemma>za>on(reduc>onofwordformstobaseform&intendedPOSandmeaning)
Intrinsictasks:Textsegmenta>on&Morphology
�5
• Tokeniza>on(splittextintomeaningfulsegments)
• Punctua>onpredic>on
Morphemes: smallest linguistic pieces with a grammatical function (inflectional: fish-> fishes, derivational: fish -> fishery, compounding: sky + scraper -> skyscraper)
NikolaosPappas /111
• Cons>tuencyparsing(nestedphrasalstructures)
• Languagemodeling(wordsequence)
Intrinsictasks:Syntax&Grammar
�6
• Part-of-speechtagging(POStagssequence)
• Dependencyparsing(rolespecificstructures)
DT: determiner, NN: noun, singular, VBD: verb past tense, NP: noun phrase, VP: Verb phrase, ATT: attributive, SBJ: nominal subject, TMP: temporal modifier, PC: prepositional complement
NikolaosPappas /111
• Nameden>tyrecogni>on
• CoreferenceResolu>on(findexpressionsreferringtothesameen>tyinatext)
Intrinsictasks:Seman>cs&Discourse
�7
• Lexicalseman>cs
• Textualentailment(direc>onalrela>onbetweentextfragments)
NikolaosPappas /111
ExtrinsicTasks
�8
• Machinetransla>on
•
• Ques>onanswering
NikolaosPappas /111
ExtrinsicTasks
�9
• Sen>mentanalysis
•
• Summariza>on
• Dialogueagents/Chatbots• Topicrecogni>on• Searchandretrieval• andmore
NikolaosPappas /111
Whatisspecialaboutnaturallanguage?
�10
• Formallanguagesaresta8c,explicitandnon-ambiguous• Definedasmathema>calabstrac>ons(alphabet,rules)• Onecanexplicitlyenumerateallwell-formedwords
• Naturallanguagesaredynamic,implicitandambiguous• Existintherealworldandisspokenbyitsusers• Grammarisdiscoveredthroughempiricalinves>ga>on“CathrineandJohngaveflowerstoMary.
Shesaid“thanks”andputtheminavase.”
“Ineedtotalktoyouasap.”(abbreviations)
“Didyoudownloadtheapp?”(neologisms)
NikolaosPappas /111
Whatisnaturallanguage?
�11
• Anaturallyevolvedsystemusedbyhumanstoexpressthoughtsfor(i)communica>ngwithoneanother,(ii)learningfromprevious(iii)experiencesandachievingtheirgoals
• Essen>ally,itisadiscrete/symbolic/categoricalsignalingsystem✓Symbolsareinvariantacrosssignals(audio,visual)
✓Conciseandgroundedonsharedknowledge(entailsambigui>es)“Didyouwatchthefinals?Ourgoalkeeperwasuseless!”
✓Unlimitedexpressivepower(impliesflexibleinterpreta>onrulesi.e.meaningcannotbeexclusivelyexpressedinthesurfaceform)
“Allpoliticianslie.”|
NikolaosPappas /111
• Symbolicencodingsrequirelargevocabularies• Sparsityissuesformachinelearning• Scalingissuesinreal-worldsejngs
• Brainencodingsappeartobeacon>nuouspaRernofac>va>ons(distributedacrossneurons)➡ Con>nuousencodingsprovidesacogni>vely
plausiblewaytoencodethoughts
• Challenges• Howtolearncon>nuousencodingsthatgeneralizewell?• Canweencodeverycomplexthoughtsinasinglecon>nuousencoding?• CanwecreateandreasonoverthoughtstosolveanyNLPtask?• Howtotransferknowledgefromonedomain,task,languagetoanother?
Whatarethemainchallengesinlanguage`understanding’?
�12
Buchweitz et al. 2009
NikolaosPappas /111
WhatisDeepLearning?• MachineLearningboilsdowntominimizinganobjec>vefunc>ontoincreasetaskperformance
• Mostlyreliesonhuman-craoedfeatures• Tasksinvolveregression,classifica>on,structuredpredic>on,representa>onlearning
➡ Representa8onLearning:Learngoodfeaturesorrepresenta>ons
➡DeepLearning:Machinelearningalgorithmsbasedonmul>plelevelsofrepresenta>onorabstrac>on✓ Biologicallyinspiredfromhowthehumanbrainworks✓ Neuronsac>vatetocertaininputsandexciteotherneurons✓ Canhandleavarietyofinput,suchasvision,speech,andlanguage
�13
NikolaosPappas /111
DeepLearning:Whythisdecade?
• WhatenableddeeplearningtechniquestostartoutperformingothermachinelearningtechniquessinceHintonetal.2006?• Largeramountsofdata• Fastercomputersandmul>corecpuandgpu• Newmodels,algorithmsandimprovementsover“older”methods(speech,visionandlanguage)
�14
NikolaosPappas /111
DeepLearningforspeech:Phonemedetec>on
�15
• Thefirstbreakthroughresultsof“deeplearning”onlargedatasetsbyDahletal.2010
• -30%reduc>onoferror• MostrecentlyonspeechsynthesisOordetal.2016
NikolaosPappas /111
DeepLearningforvision:Objectdetec>on
• PopulartopicforDL• BreakthroughonImageNetbyKrizhevskyetal.2012• -21%and-51%errorreduc>onattop1and5
�16
NikolaosPappas /111
DeepLearningforlanguage:Ongoing
• Significantimprovementsinrecentyearsacrossdifferentlevels(phonology,morphology,syntax,seman>cs)andapplica>onsinNLP
• Machinetransla8on(mostnotable)• Ques8onanswering• Sen8mentclassifica8on• Summariza8on
�17
S>llalotofworktobedone…(beyondsupervisedand“basic”recogni>on)
NikolaosPappas /111
DeepLearningforlanguage:MachineTransla>on
• Reachedthestate-of-the-artinoneyear:Bahdanauetal.2014,Jeanetal.2014,Gulcehreetal.2015
�18
NikolaosPappas /111
Neuralnetworkcomponentsforlanguage`understanding’
�19
• DistributedRepresenta8ons(word/subwordunits)• Abilitytorepresent`meaning’efficiently
• Abstrac8on&Composi8on(wordsequences)• Abilitytocomposecomplex`meanings’fromsimplerones
• AKen8onMechanism• Abilitytofocuson/collectwhatis`relevant’(input,memory)
• MemoryMechanism• Abilitytostore/retrieveimportantpreviousinforma>on/knowledge
• ReasoningMechanism• Abilitytoreasonwithwhatis`relevant’
• LearningMechanism• Abilitytolearnfrompastexperience
…
NikolaosPappas /111
Outlineofthetalk1. Introduc>onandMo>va>on
•Basics:Perceptron,NNs,SGD
2. WordRepresenta>onLearning• Seman>csimilarity• Tradi>onalandrecentapproaches• Intrinsicandextrinsicevalua>on
3. WordSequenceModeling•Essen>als:Encoders,ARen>on,DLtricks•TextClassifica>on•MachineTransla>on
4. ConclusionandDiscussion
�20
NikolaosPappas /111
Basics:Perceptron
�21
NikolaosPappas /111
• Solvelinearlyseparableproblems
• …butnotnon-linearlyseparableones.
Basics:Whatcanaperceptrondo?• Processes
�22
NikolaosPappas /111
Basics:Fromlogis>cregressiontoneuralnetworks• Processes
�23
NikolaosPappas /111
Basics:Neuralnetwork
�24
• Applyseveralregressionstoobtainavectorofoutputs
• Thevaluesoftheoutputsareini>allyunknown
• Noneedtospecifyaheadof>mewhatvaluesthelogis>cregressionsaretryingtopredict
NikolaosPappas /111�25
• Theintermediatevariablesarelearneddirectlybasedonthetrainingobjec>ve
• Thismakesthemdoagoodjobatpredic>ngthetargetforthenextlayer
• Result:abletomodelnon-lineari>esinthedata!
Basics:Neuralnetwork
NikolaosPappas /111
Basics:Neuralnetworkwithmul>plelayers
�26
NikolaosPappas /111
Basics:Learningmodelparameterswithgradientdescend
�27
• Giventrainingdatafindandthatminimizeslosswithrespecttotheseparameters
• Computegradientwithrespecttoparametersandmakesmallsteptowardsthedirec>onofthenega>vegradient
• Applychain-rulefornestedfunc>onse.g.y=f(g(x))
NikolaosPappas /111
Basics:Stochas>cgradientdescent(SGD)
�28
• Approximatethegradientusingamini-batchofexamplesinsteadofen>retrainingset
• OnlineSGDwhenminibatchsizeisone
• MostcommonlyusedwhencomparedtoGD
NikolaosPappas /111
Basics:ChoosingaStochas>cOp>miza>onAlgorithm
�29
• Severalout-of-the-boxstrategiesfordecayinglearningrateofanobjec>vefunc>on:
• Selectthebestaccordingtovalida>onsetperformance
NikolaosPappas /111
Trainingneuralnetworkswitharbitrarylayers:Backpropaga>on
�30
• Wes>llminimizetheobjec>vefunc>onbutthis>mewe“backpropagate”theerrorstoallthehiddenlayers
• Chainrule:Ify=f(u)andu=g(x),i.e.y=f(g(x)),then:
• Usefulbasicderiva>ves:
•
Typically, backprop computation is implemented in
popular libraries: Theano, Torch,
Tensorflow
NikolaosPappas /111
Basics:Theend
�31
• Essen>ally,nowwehaveallthebasic“ingredients”weneedtobuilddeepneuralnetworks
• However,wewillalsoneed
➡ Abilitytolearnfromdifferentinputs(spa>al,sequen>al,con>nuousvsdiscrete)
➡ Overcomeop>miza>ondifficul>es(exploding/vanishinggradient,informa>onflow,convergence)
➡ Avoidoverfijng/Regulariza>on(dropout,L2norm)
➡ andother…
NikolaosPappas /111
Outlineofthetalk1. Introduc>onandMo>va>on
2. WordRepresenta>onLearning• Seman>csimilarity• Tradi>onalandrecentapproaches• Intrinsicandextrinsicevalua>on
3. WordSequenceModeling•Essen>als:Encoders,ARen>on,DLtricks•TextClassifica>on•MachineTransla>on
4. ConclusionandDiscussion
�32
*FigurefromLebret’sthesis,EPFL
NikolaosPappas /111
Seman>csimilarity:Howsimilararetwolinguis>citems?
�33
• Wordlevelscrewdriver—?—>wrenchverysimilarscrewdriver—?—>hammerliRlesimilarscrewdriver—?—>technicianrelatedscrewdriver—?—>fruitunrelated
• SentencelevelThebossfiredtheworkerThesupervisorlettheemployeegoverysimilarThebossreprimandedtheworkerliRlesimilarThebosspromotedtheworkerrelatedThebosswentforjoggingtodayunrelated
NikolaosPappas /111
Seman>csimilarity:Howsimilararetwolinguis>citems?
�34
• Definedinmanylevels• Words,wordsensesorconcepts,phrases,paragraphs,documents
• Similarityisaspecifictypeofrelatedness• Related:topicallyorviarela>onheartvssurgeonwheelvsbike
• Similar:synonymsandhyponymsdoctorvssurgeonbikevsbicycle
NikolaosPappas /111
Seman>csimilarity:NumerousaRemptstoanswerthat
�35
*Image from D. Jurgens’ NAACL 2016 tutorial.
NikolaosPappas /111
Seman>csimilarity:NumerousaRemptstoanswerthat
�36
NikolaosPappas /111
Seman>csimilarity:Whydowehavesomanymethods?
�37
• Newresourcesormethods• Datasetsrevealweaknessinpreviousmethods• State-of-the-artismovingtarget
• Task-specificsimilarityfunc>ons• Performanceinnewtasksnotsa>sfactory
➡ Seman>csimilarityisnottheend-task• Picktheonewhichyieldsbestresults• Needformethodstoquicklyadaptsimilarity
NikolaosPappas /111
Twomainsourcesformeasuringsimilarity
Massivetextcorpora
�38
Seman8cresourcesandknowledgebases
NikolaosPappas /111
HowtoRepresentWord`Meaning’?
• Discrete:eachdimensiondenotesaspecificlinguis>citem• Interpretabledimensions• Highdimensionality
• Con8nuous:dimensionsarenot>edtoexplicitconcepts• Enablecomparisonbetweenrepresentedlinguis>citems
• Lowdimensionality
�39
dog = [0, 0, 0, 1, 0, 0] cat = [0, 1, 0, 0, 0, 0] sim(dog, cat) = 0.0
NikolaosPappas /111
Howtocomparetwolinguis>citemsinthevectorspace
• CosineoftheangleθbetweenAandB:
• Explicitmodelshaveaserioussparsityproblemduetotheirdiscreteor“k-hot”vectorrepresenta>ons
france=[0,0,0,1,0,0]england=[0,1,0,0,0,0]
franceisnearspain=[1,0,0,1,1,1]• cos(france,england)=0.0• cos(france,franceisnearspain)=0.57
�40
A B
θ
NikolaosPappas /111
LearningWordRepresenta>onsFromText
• Limita>onsofknowledge-basedmethods• Out-of-contextdespitevalidityofresources• Mostlackofevalua>ononprac>caltasks
• Whatifwedonotknowanythingaboutwords?- Followthedistribu>onalhypothesis(unsupervised):“Youshallknowawordbythecompanyitkeeps”,Firth1957
Thevalueofthecentralbankincreasedby10%.Sheooengoestothebanktowithdrawcash.Shewenttotheriverbanktohavepicnicwithherchild.
�41
financialins8tu8on
geographicalterm
NikolaosPappas /111
Simpleapproach:Computeaword-in-contextco-occurencematrix
• Matrixofcountsbetweenwordsandcontexts
• Limita8ons• Allwordshaveequalimportance(imbalance)• Vectorsareveryhighdimensional(storageissue)• Infrequentwordshaveoverlysparsevectors(makesubsequentmodelslessrobust)
�42
words context document
NikolaosPappas /111
Themoststandardapproach:DimensionalityReduc>on
• Performsingularvaluedecomposi>on(SVD)ofthewordco-occurencematrixthatwesawpreviously
• Typically,U*Σisusedasthevectorspace
�43
*Image from D. Jurgens’ NAACL 2016 tutorial.
NikolaosPappas /111
• Syntac>callyandseman>callyrelatedwordsclustertogether
•
Themoststandardapproach:DimensionalityReduc>on
�44
*Plots from Rohde et al. 2005
NikolaosPappas /111
Dimensionalityreduc>onwithHellingerPCA
• PerformPCAwithHellingerdistanceonthewordco-occurencematrix:LebretandCollobert2014
• Wellsuitedfordiscreteprobabilitydistribu>ons(P,Q)
• Neuralapproachesare>me-consuming(tuning,data)• InsteadcomputewordvectorsefficientlywithPCA• Fine-tuningthemontasks;beRerthanneural
• Limita8ons:hardtoaddnewwords,notscalableO(mn2)
�45
hRps://github.com/rlebret/hpca
NikolaosPappas /111
Dimensionalityreduc>onwithweightedleastsquares
• GlovevectorsbyPenningtonetal2014.Factorizesthelogoftheco-occurencematrix:
• Fasttraining,scalabletohugecorporabuts>llhardtoincorporatenewwords
• MuchbeRerresultsthanneuralembedding,howeverunderequivalenttuningitisnotthecase:LevyandGoldberg2015
�46
hRp://nlp.stanford.edu/projects/glove/
NikolaosPappas /111
Dimensionalityreduc>onwithneuralnetworks
• Themainideaistodirectlylearnlow-dimensionalwordrepresenta>onsfromdata
• Learningrepresenta>ons:Rumelhartetal1986• Neuralprobabilis>clanguagemodel:Bengioetal2003• NLP(almost)fromscratch:CollobertandWeston2008
• Recentmethodsarefasterandmoresimple• Con>nuousBag-Of-Words(CBOW)• Skip-gramwithNega>veSampling(SGNS)• word2vectoolkit:Mikolovetal.2013
�47
NikolaosPappas /111�48
• Giventhemiddlewordpredictsurroundingonesinafixedwindowofwords(maximizeloglikelihood)
word2vec:Skip-gramwithnega>vesampling(SGNS)
NikolaosPappas /111�49
• HowistheP(wt|h)probabilitycomputed?
• Denominatorisverycostlyforbigvocabulary!• Insteaditusesamorescalableobjec>ve,logQθisabinarylogis>cregressionofwordwandhistoryh:
word2vec:Skip-gramwithnega>vesampling(SGNS)
NikolaosPappas /111
word2vec:Con>nuousBag-Of-Wordswithnega>vesampling(CBOW)
• FactorizesaPMIword-contextmatrix:LevyandGoldberg2014
• Buildsuponexis>ngmethods(newdecomp.)
• Improvementsonavarietyofintrinsictaskssuchasrelatedness,categoriza>onandanalogy:Baronietal2014,Schnabeletal2015
�50
• Moreefficientbuttheorderinginforma>onofthewordsdoesnotinfluencetheprojec>on
NikolaosPappas /111
Distributedrepresenta>ons:Encodedproper>es
�51
• Encodesgeneral-purposerela>onsbetweenwords:present—pasttense,singular—plural,male—female,capital—country
• Analogybetweenwordscanbeefficientlycomputedusingbasicarithme>copera>onsbetweenvectors(+,-)
king - man + woman ≈ queen
NikolaosPappas /111
Summary:Learningwordrepresenta>ons
• Neuralversuscount-basedmethods• neuralonesimplicitlydoSVDoveraPMImatrix• similartocount-basedwhenusingthesametricks
• Neuralmethodsappeartohavetheedge(word2vec)• efficientandscalableobjec>ve+toolkit• intui>veformula>on(=predictwordsincontext)
➡ Severalextensions• Dependency-basedembeddings:LevyandGoldberg2014• RetrofiRed-to-lexiconsembeddings:Faruquietal.2014• Sense-awareembeddings:LiandJurafsky2015• Visually-groundedembeddings:Lazaridouetal.2015• Mul>lingualembeddings:Gouwsetal2015
�52
NikolaosPappas /111
Summary:Learningwordrepresenta>ons
Howcanwebenefitfromthem?• studylinguis>cproper>esofwords• injectgeneralknowledgeondownstreamtasks• transferknowledgeacrosslanguagesormodali>es• representa>onsofwordsequences
�53
NikolaosPappas /111
Outlineofthetalk1. Introduc>onandMo>va>on
2. WordRepresenta>onLearning• Seman>csimilarity• Tradi>onalandrecentapproaches• Intrinsicandextrinsicevalua>on
3. WordSequenceModeling•Essen>als:RNNs,ARen>on,DLtricks•TextClassifica>on•MachineTransla>on
4. ConclusionandDiscussion
�54
*FigurefromColah’sblog,2015.
NikolaosPappas /111
LanguageModeling
�55
• Computesthejointprobabilityofasequenceofwordsbyemployingthechainrule(“Howlikelyisatext”):
p(w1,w2,…,wt)=p(w1)p(w2|w1)p(w3|w2,w1)…p(wt|wt-1,wt-2,…)
• GiventheobservedtexthowlikelyisthenewuRerance?p(wt|wt-1,…,w1)
• Hence,wecancompareorderings(transla>on)p(helikesapples)>p(appleslikeshe)
orwordchoice(speechrecogni>on)p(helikesapples)>p(helicksapples)
➡ Exactdecomposi>onallowstolearncomplexdistribu>ons➡ ManyNLPtaskscanbestructuredas(condi>onal)languagemodel
Evalua>on
NikolaosPappas /111
• N-grammodels:historyofobservedwordsisapproximatedwithjustthepreviousnwords(Markovmodel):
• hardtocapturelong-termdependencies(boundedmemory)• doesnotleveragewordseman>csandrela>onships
• Neuraln-grammodels:embedthesamefixedn-gramhistoryinacon>nuousspace(s>llMarkovmodel)
• capturesbeRercorrela>ons+smallermemoryfootprint
• trainedwithMLE
LanguageModeling:MarkovModels
�56
NikolaosPappas /111
• WithRNNLMswedropthefixedn-gramhistoryandcompresstheen>rehistoryinafixedlengthvector
• longrangecorrela>onsarecaptured—intheory• canrepresentunboundeddependencies• but,theyarehardtolearn(vanishinggradient)
LanguageModeling:RecurrentNeuralNetworks(RNN)
�57
NikolaosPappas /111
• Increasingthesizeofthehiddenlayerresultsinaquadra>cincreaseinthemodelsizeandcomputa>on
• Stackingmul>pleRNNsincreasesthememorycapacityandrepresenta8onalabilitywithlinearscaling
• Wecanalsoincreasedepthinthe>medimension
LanguageModeling:DeepRNNs
�58
NikolaosPappas /111
• Muchofthecomputa>onalcostcomesfromtheclassifica>onlayerbecauseitsparametersdependonthesizeofthevocabulary:
• Severalsolu>onsexist• Short-lists:usemostfrequentwords+n-gramLMfortherest• Localshort-lists:subsetsofvocabularyspecifictodatasegments• Gradientapproxima>ons:useNoiseContras>veEs>ma>on(NCE)i.e.learningabinaryclassifiertodis>nguishbetweendatasamplesfromksamplesfromanoisedistribu>on:
Scaling:LargeVocabularies
�59
NikolaosPappas /111
• Changingtheinputgranularityandmodeltextatthemorphemeorcharacterlevel
• Muchsmallersoomaxbutlongerdependencies• Itcapturesmorphologicalproper>esofwords• Byte-PairEncodingmethodismostcommonforneuralMT(Sennrichetal2015)
Scaling:LargeVocabularies
�60
NikolaosPappas /111�61
• Long-shorttermmemorynetsareabletolearnlong-termdependencies:HochreiterandSchmidhuber1997
SimpleRNN:
*FigurefromColah’sblog,2015.
LongShortTermMemory(LSTM)
NikolaosPappas /111
LongShortTermMemory(LSTM)
�62
• Long-shorttermmemorynetsareabletolearnlong-termdependencies:HochreiterandSchmidhuber1997
• Abilitytoremoveoraddinforma>ontothecellstateregulatedby“gates”(avoidsgrad.vanishing)
•
*FigurefromColah’sblog,2015.
NikolaosPappas /111
GatedRecurrentUnit(GRU)
�63
• GatedRNNbyChungetal,2014combinestheforgetandinputgatesintoasingle“updategate”
• keepmemoriestocapturelong-termdependencies• allowerrormessagestoflowatdifferentstrengths
zt:updategate—rt:resetgate—ht:regularRNNupdate*FigurefromColah’sblog,2015.
NikolaosPappas /111
DeepBidirec>onalModels
�64
• HereRNNbutitappliestoLSTMsandGRUstoo
(Irsoy and Cardie, 2014)
NikolaosPappas /111
• Typicallygoodforimages• Convolu>onalfilter(s)is(are)appliedeverykwords:
• SimilartoRecursiveNNsbutwithoutconstrainingtogramma>calphrasesonly,asSocheretal.,2011• noneedforaparser(!)• lesslinguis>callymo>vated?
Convolu>onalNeuralNetwork(CNN)
�65
(Collobert et al., 2011)(Kim, 2014)
NikolaosPappas /111
• Word-levelandsentence-levelabstrac>ons
HierarchicalModels
�66(Tang et al., 2015)
NikolaosPappas /111
ARen>onMechanism:MachineTransla>on
�67
(Bahdanauetal.,2015)
• Canwecompressalltheneededinforma>oninthelastencoderstate?Idea:useallthehiddenstates!• lengthpropor>onaltosentencelength• weightedaverageofallhiddenstates
• Learnstoassignarelevancetoeachinputposi>ongivencurrentencoderstateandthepreviousdecoderstate• soobilingualalignmentmodel
NikolaosPappas /111
ARen>onMechanism:MachineTransla>on
�68
(Bahdanauetal.,2015)
NikolaosPappas /111
ARen>onMechanism:DocumentClassifica>on
�69
• Operatesoninputwordsequenceorintermediatehiddenstates
• Learnstofocusonrelevantpartsoftheinputwithrespecttoeachtargetlabel
• soosummariza>onmodel• Canbeappliedatmul>ple
languagelevels(Yangetal,2016)
(PappasandPopescu-Belis,2014&2017)
NikolaosPappas /111
HierarchicalaRen>onnetworks
�70
(Yangetal.,2016)
• VerysimilarhierarchicalstructureasTangetal.,2015exceptaveragepooling• aRen>onmechanismatthe
wordanddocumentlevels
NikolaosPappas /111
ARen>onMechanism:Sen>mentClassifica>on
�71
(Yangetal,2016)
NikolaosPappas /111
Memorymechanism:NeuralTuringMachinesorMemoryNetworks
�72
*DiagramfromChristopherOlah’sblog.
• Combina>onofrecurrentnetworkwithexternalmemorybank:Gravesetal.2014,Westonet.al2014
NikolaosPappas /111
Residualconnec>ons
�73
• Residuallearningallowsinforma>ontoflowmoreeasilybyaddingtheinputofalayerF(x)toitsoutputi.e.F(x)+x
• It’stypicallyusedformakingconnec>onsfromonelayertoanother
• Thisimprovesthetrainingandavoidsthevanishinggradientproblem
• Layerisignoredifnotbeneficial
NikolaosPappas /111
OtherDLtricks
�74
• Dropoutunitsatrandom:isusedasaregulariza>onmethodtoavoidoverfijng
• Itallowstoover-parameterizeanetworkands>llgeneralizewell
• Layeroutputnormaliza8on:stabilizesthetrainingprocessofamodel,especiallyusefulforself-aRen>onarchitecture
• Hyper-parameterop8miza8on:tuningwellmayhaveveryhugeimpactinperformance
• andmore(e.g.ini>aliza>on,labelsmoothing,weightdecay)
NikolaosPappas /111
Pujngeverythingtogether:Flexiblemodeling
�75
• Sen>mentclassifica>on• Topicdetec>on• Spamdetec>on• NamedEn>tyRecogni>on
• Machinetransla>on• Summariza>on• Imagecap>oning• Conversa>onalagents
• Ques>onanswering• Paraphrasedetec>on• Rela>onExtrac>on
✓Mul>plelevelsofabstrac>on(deep,hierarchical)✓End-to-endtrainingwithstochas8cgradientdescent✓Goodbasisformul>-tasklearning/transferlearning
NikolaosPappas /111
Outlineofthetalk1. Introduc>onandMo>va>on
2. WordRepresenta>onLearning• Seman>csimilarity• Tradi>onalandrecentapproaches• Intrinsicandextrinsicevalua>on
3. WordSequenceModeling•Essen>als:RNNs,ARen>on,DLtricks•TextClassifica>on•MachineTransla>on
4. ConclusionandDiscussion
�76
*FigurefromColah’sblog,2015.
NikolaosPappas /111
ParagraphvectorsforDocumentClassifica>on
�77
• Learningvectorsofparagraphsinspiredbyword2vec• trainedwithoutsupervisiononalargecorpus• preferablysimilardomainasthetarget
• Twomethods:withorwithoutwordordering
(Leetal.,2014)
NikolaosPappas /111
ParagraphvectorsforDocumentClassifica>on
�78
• Learnedparagraphvectors+logis>cregression• Outperformedpreviousmethodonsentence-leveland
document-levelsen>mentclassifica>on
(Leetal.,2014)
NikolaosPappas /111
Convolu>onalneuralnetworkforDocumentClassifica>on
�79
(Kimetal.,2014)
• Usedmul>plefilterwidths• Dropoutregulariza>on(randomlydroppingpor>onof
hiddenunitsduringback-propaga>on)
NikolaosPappas /111
Convolu>onalneuralnetworkforDocumentClassifica>on
�80
(Kimetal.,2014)
• Notallbaselinemethodsuseddrop-outthough
NikolaosPappas /111
• SimilartoKimetal,2014howeverdifferent• K-maxpoolinginsteadofmaxpooling• Twolayersofconvolu>ons
�81
(Deniletal.,2014)
ModelingandSummarizingDocumentswithaConvolu>onalNetwork
NikolaosPappas /111
ModelingandSummarizingDocumentswithaConvolu>onalNetwork
�82
(Deniletal.,2014)
NikolaosPappas /111
ModelingandSummarizingDocumentswithaConvolu>onalNetwork
�83
(Deniletal.,2014)
NikolaosPappas /111
ModelingandSummarizingDocumentswithaConvolu>onalNetwork
�84
(Deniletal.,2014)
NikolaosPappas /111
GatedrecurrentneuralnetworkforDocumentClassifica>on
�85
(Tangetal.,2015)
NikolaosPappas /111
GatedrecurrentneuralnetworkforDocumentClassifica>on
�86
(Tangetal.,2015)
NikolaosPappas /111
HierarchicalaRen>onnetworksforDocumentClassifica>on
�87
(Yangetal.,2016)
• VerysimilarhierarchicalstructureasTangetal.,2015exceptaveragepooling• aRen>onmechanismatthe
wordanddocumentlevels
NikolaosPappas /111
HierarchicalaRen>onnetworksforDocumentClassifica>on
�88
(Yangetal.,2016)
NikolaosPappas /111
Outlineofthetalk1. Introduc>onandMo>va>on
2. WordRepresenta>onLearning• Seman>csimilarity• Tradi>onalandrecentapproaches• Intrinsicandextrinsicevalua>on
3. WordSequenceModeling•Essen>als:RNNs,ARen>on,DLtricks•TextClassifica>on•MachineTransla>on
4. ConclusionandDiscussion
�89
*FigurefromColah’sblog,2015.
NikolaosPappas /111
RNNencoder-decoderforMachineTransla>on
�90
• GRUashiddenlayer• Maximizetheloglikelihood
ofthetargetsequencegiventhesourcesequence:
• WMT2014(EN→FR)
(Choetal.,2014)
NikolaosPappas /111
SequencetosequencelearningforMachineTransla>on
�91
• LSTMhiddenlayersinsteadofGRU• 4layersdeepinsteadofshallowencoder-decoder
(Sutskeveretal.,2014)
NikolaosPappas /111
SequencetosequencelearningforMachineTransla>on
�92
(Sutskeveretal.,2014)
• WMT2014(EN→FR)
• PCAprojec>onofthehiddenstateofthelastencoderlayer
NikolaosPappas /111
JointlylearningtoalignandtranslateforMachineTransla>on
�93
(Bahdanauetal.,2015)
• Limita8on:canwecompressalltheneededinforma>oninthelastencoderstate?
• Idea:useallthehiddenstatesoftheencoder• lengthpropor>onaltothat
ofthesentence!• computeaweightedaverage
ofallthehiddenstates
NikolaosPappas /111
JointlylearningtoalignandtranslateforMachineTransla>on
�94
(Bahdanauetal.,2015)
• WMT2014(EN→FR)
NikolaosPappas /111
Effec>veapproachestoaRen>on-basedNMT
�95
(Luongetal.,2015)
• GlobalandlocalaRen>on• Input-feedingapproach• StackedLSTMinsteadofsingle-layer
NikolaosPappas /111
Mul>-sourceNMT
�96
(ZophandKnight,2016)
• Trainp(e|f,g)modeldirectlyontrilingualdata
• Useittodecodeegivenany(f,g)pair
• Takelocal-aRen>onNMTmodelandconcatenatecontextfrommul>plesources
NikolaosPappas /111
Mul>-sourceNMT
�97
(ZophandKnight,2016)
• Mul>-sourcetrainingimprovesoverindividualFrenchEnglishandGermanEnglishpairs
• Best:basicconcatena>onwithaRen>on
NikolaosPappas /111
Mul>-targetNMT
�98
(Dongetal.,2015)
• Mul>-tasklearningframeworkformul>pletargetlanguagetransla>on
• Op>miza>onforonetomanymodel
NikolaosPappas /111
Mul>-targetNMT
�99
(Dongetal.,2015)
• ImprovesoverNMTandmosesbaselinesoverWMT2013test• butalsoonlargerdatasets
• FasterandbeRerconvergenceinmul>plelanguagetransla>on
NikolaosPappas /111
Mul>-way,Mul>lingualNMT
�100
(Firatetal.,2016)
• Encoder-decodermodelwithmul>pleencodersanddecoderssharedacrosslanguagepairs
• shareknowledgethroughauniversalspace
• goodforlow-resourcelangs• ARen>onispairspecific,hence
expensiveO(L^2)• insteadshareaRen>onacross
allpairs!
Figure:n_thencoderandm_thdecoderat>mestept/φmakesencoder&decoderstatescompa>blewiththeaRen>onmechanism/f_adpmakescontextvectorcompa>blewiththedecoder→allthesetransforma>onstosupportdifferenttypesofencoders/decodersfordifferentlanguages!
NikolaosPappas /111
Mul>-way,Mul>lingualNMT
�101
(Firatetal.,2016)
• Consistentimprovementsforlow-resourcelanguages
• thelowerthetrainingdatathebiggertheimprovement
• Inlarge-scaletransla8onimprovesonlytransla8ontoEnglish• hypothesis:ENappearsalwaysassourceortargetlanguageforallpairs→beRerdecoder?
NikolaosPappas /111
Google’sNeuralMachineTransla>onSystem
�102
(Wuetal.,2016)
• Anencoder,adecoderandanaRen>onnetwork• 8-layerdeepwithresidualconnec>ons• RefinementwithReinforcementLearning• Sub-wordunits…andmore
NikolaosPappas /111
Google’sNeuralMachineTransla>onSystem
�103
(Wuetal.,2016)
• EN->FRtrainingtook6dayson96GPUS!!!!and3moredaysforrefinement...
NikolaosPappas /111
Convolu>onalEncoder-Decoder
�104
(Gehringetal.,2017)
• OutperformedGNMTandwasmoreefficientintermsofspeed,but
• Lackslong-termmemory• Requiresme>culousini>aliza>onschemesandcarefulnormaliza>on
• Requiresposi>onalembeddings• Requiresmoredepth(15layers)
NikolaosPappas /111
• Encode/Decodeinputw.o.usingCNNsorLSTMs• Lowertrainingcostbutlacklong-termmemory• Stackedself-aRen>onwithmul>pleheads
• Tocapturesequenceinforma>onitusesposi>onalembeddings(sinusoids)
Self-ARen>onorTransformerNetworks
�105
NikolaosPappas /111
Self-ARen>onorTransformerNetworks
�106
Head1
Head2
NikolaosPappas /111
• Transformernetworkusestechniquesthatothermodelsdonotuse
• Combinestrengthsofbothworlds• CNNsfallbehindinBLEUandconvergencespeed
Bestofbothworlds:LSTMs&TransformerNetworks
�107(Chen&Firat&Bapnaetal.,2018)
NikolaosPappas /111
Outlineofthetalk1. Introduc>onandMo>va>on
2. WordRepresenta>onLearning• Seman>csimilarity• Tradi>onalandrecentapproaches• Intrinsicandextrinsicevalua>on
3. WordSequenceModeling•Essen>als:RNNs,ARen>on,DLtricks•TextClassifica>on•MachineTransla>on
4. ConclusionandDiscussion
�108
NikolaosPappas /111
Conclusion
�109
• DeeplearningforNLPhasflourishedthelastcoupleofyears➡ ARen>onmechanismbecamepopularevenoutsideNLP➡ CompaniesheavilyuseNLPe.g.machinetransla>on
• AlthoughaRen>on-basedmodelscangetusfar➡ Linguis>cstructurecangetusevenfurtherbyconstraining
modelsandcrea>ngusefulinduc>vebiases➡ Inspec>ngandanalyzingthelearnedstructurescanhelpus
gaininsightsaboutlanguage➡ Thereiss>llalotofworktobedoneonweakly-supervised
andunsupervisedlearning
NikolaosPappas /111
FutureChallenges
�110
• Transferringknowledgeacrossdomains,languagesanddifferentoutputs
• Contextualizingwordrepresenta>ons
• Generalizingonunseenexamples
• Learningfromveryfewexamples
• Summariza>on/Entailment/Reasoning
NikolaosPappas /111
Discussion/CodingSession
�111
• LearningwordembeddingshRps://www.tensorflow.org/tutorials/word2vechRps://nlp.stanford.edu/projects/glove/
• Classifica>onusingpre-trainedwordembeddingshRps://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
➡ https://www.tensorflow.org/install/
➡ https://keras.io/backend/
➡ https://keras.io/#installation