Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language...

transcript

Natural Language Processing with Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 14: Tree Recursive Neural Networks and Constituency Parsing

LecturePlan

1.  Mo%va%on:Composi%onalityandRecursion2.  Structurepredic%onwithsimpleTreeRNN:Parsing3.  Researchhighlight:DeepReinforcementLearningforDialogue

Genera9on4.  Backpropaga%onthroughStructure5.  Morecomplexunits

Reminders/comments:LearnuponGPUs,Azure,DockerAss4:Getsomethingworking,usingaGPUformilestoneFinalprojectdiscussions�comemeetwithus!

1.ThespectrumoflanguageinCS

Seman9cinterpreta9onoflanguage–Notjustwordvectors

Howcanweknowwhenlargerunitsaresimilarinmeaning?

•  Thesnowboarderisleapingoveramogul

•  Apersononasnowboardjumpsintotheair

Peopleinterpretthemeaningoflargertextunits–en%%es,descrip%veterms,facts,arguments,stories–byseman9ccomposi9onofsmallerelements

Compositionality

Language understanding – & Artificial Intelligence – requires being able to understand bigger

things from knowing about smaller parts

Arelanguagesrecursive?

•  Cogni%velysomewhatdebatable•  But:recursionisnaturalfordescribinglanguage•  [Themanfrom[thecompanythatyouspokewithabout[the

project]yesterday]]•  nounphrasecontaininganounphrasecontaininganounphrase•  Argumentsfornow:1)Helpfulindisambigua%on:

Isrecursionuseful?

2)Helpfulforsometaskstorefertospecificphrases:•  JohnandJanewenttoabigfes%val.Theyenjoyedthetripandthemusicthere.

•  “they”:JohnandJane•  “thetrip”:wenttoabigfes%val•  “there”:bigfes%val

3)Worksbe^erforsometaskstousegramma%caltreestructure•  It’sapowerfulpriorforlanguagestructure

BuildingonWordVectorSpaceModels

x1012345678910

1Monday

Tuesday 9.51.5

Bymappingthemintothesamevectorspace!

thecountryofmybirththeplacewhereIwasborn

Howcanwerepresentthemeaningoflongerphrases?

France 22.5

Germany 13

Howshouldwemapphrasesintoavectorspace?

thecountryofmybirth

0.40.3

2.33.6

2.13.3

2.53.8

5.56.1

Useprincipleofcomposi%onalityThemeaning(vector)ofasentenceisdeterminedby(1) themeaningsofitswordsand(2) therulesthatcombinethem.

Modelsinthissec%oncanjointlylearnparsetreesandcomposi%onalvectorrepresenta%ons

x1012345678910

thecountryofmybirth

theplacewhereIwasborn

Monday

Tuesday

FranceGermany

Cons9tuencySentenceParsing:Whatwewant

Thecatsatonthemat.13

LearnStructureandRepresenta9on

Thecatsatonthemat.

Recursivevs.recurrentneuralnetworks

3/2/17

thecountryofmybirth

0.40.3

2.33.6

2.13.3

2.53.8

5.56.1

thecountryofmybirth

0.40.3

2.33.6

2.13.3

4.53.8

5.56.1

2.53.8

Recursivevs.recurrentneuralnetworks

3/2/17RichardSocher

•  Recursiveneuralnetsrequireaparsertogettreestructure

•  Recurrentneuralnetscannotcapturephraseswithoutprefixcontextandohencapturetoomuchoflastwordsinfinalvector

thecountryofmybirth

0.40.3

2.33.6

2.13.3

2.53.8

5.56.1

thecountryofmybirth

0.40.3

2.33.6

2.13.3

4.53.8

5.56.1

2.53.8

FromRNNstoCNNs

3/2/17RichardSocher

•  RNN:Getcomposi%onalvectorsforgramma%calphrasesonly

•  CNN:Computesvectorsforeverypossiblephrase•  Example:“thecountryofmybirth”computesvectorsfor:

•  thecountry,countryof,ofmy,mybirth,thecountryof,countryofmy,ofmybirth,thecountryofmy,countryofmybirth

•  Regardlessofwhethereachisgramma%cal–manydon’tmakesense

•  Don’tneedparser•  Butmaybenotverylinguis%callyorcogni%velyplausible

Rela9onshipbetweenRNNsandCNNs

•  CNN RNN

3/2/17RichardSocher

Rela9onshipbetweenRNNsandCNNs

•  CNN RNN

peopletherespeakslowlypeopletherespeakslowly

3/2/17RichardSocher

2.RecursiveNeuralNetworksforStructurePredic9on

onthemat.

Neural Network

Inputs:twocandidatechildren’srepresenta%onsOutputs:1.  Theseman%crepresenta%onifthetwonodesaremerged.2.  Scoreofhowplausiblethenewnodewouldbe.

RecursiveNeuralNetworkDefini9on

score=UTp

p=tanh(W+b),

SameWparametersatallnodesofthetree

Neural Network

1.3score= =parent

ParsingasentencewithanRNN

Neural Network

Thecatsatonthemat.

Parsingasentence

Neural Network

Thecatsatonthemat.

Parsingasentence

Neural Network

Thecatsatonthemat.

Parsingasentence

25Thecatsatonthemat.

Max-MarginFramework-Details

•  Thescoreofatreeiscomputedbythesumoftheparsingdecisionscoresateachnode:

•  xissentence;yisparsetree

Max-MarginFramework-Details

•  Similartomax-marginparsing(Taskaretal.2004),asupervisedmax-marginobjec%ve

•  Thelosspenalizesallincorrectdecisions

•  StructuresearchforA(x)wasgreedy(joinbestnodeseach%me)•  Instead:Beamsearchwithchart

Backpropaga9onThroughStructure

IntroducedbyGoller&Küchler(1996)Principallythesameasgeneralbackpropaga%onThreedifferencesresul%ngfromtherecursionandtreestructure:

1.  Sumderiva%vesofWfromallnodes(likeRNN)2.  Splitderiva%vesateachnode(fortree)3.  Adderrormessagesfromparent+nodeitself

The second derivative in eq. 28 for output units is simply

�1)ij

�1)i· a

�1)⌘= a

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

�1)ij

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

�2)ij

)| {z }�

�2)ij

+ �W

�2)ji

. (48)

(�(nl

�2)ij

= (�(nl

�2)ij

= (�(nl

�2)ij

�1)a

�1) (50)

= (�(nl

�2)ij

�1)·i a

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

�2)ij

f(W (nl

�2)i· a

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

�2)j

=⇣(�(nl

))TW (nl

�1)·i

0(z(nl

�1)i

�2)j

�1)ji

0(z(nl

�1)i

| {z }

�2)j

�1)i

�2)j

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

(l+1)i

+ �W

Which in one simplified vector notation becomes:

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to richard@socher.org

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

BTS:1)Sumderiva9vesofallnodes

Youcanactuallyassumeit’sadifferentWateachnodeIntui%onviaexample:Ifwetakeseparatederiva%vesofeachoccurrence,wegetsame:

BTS:2)Splitderiva9vesateachnode

Duringforwardprop,theparentiscomputedusing2children

Hence,theerrorsneedtobecomputedwrteachofthem:

whereeachchild’serrorisn-dimensional

c1p=tanh(W+b)c1

BTS:3)Adderrormessages

•  Ateachnode:• Whatcameup(fprop)mustcomedown(bprop)•  Totalerrormessages=errormessagesfromparent+errormessagefromownscore

3/2/17RichardSocherLecture1,Slide31

parentscore

BTSPythonCode:forwardProp

BTSPythonCode:backProp

The second derivative in eq. 28 for output units is simply

�1)ij

�1)i· a

�1)⌘= a

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

�1)ij

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

�2)ij

)| {z }�

�2)ij

+ �W

�2)ji

. (48)

(�(nl

�2)ij

= (�(nl

�2)ij

= (�(nl

�2)ij

�1)a

�1) (50)

= (�(nl

�2)ij

�1)·i a

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

�2)ij

f(W (nl

�2)i· a

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

�2)j

=⇣(�(nl

))TW (nl

�1)·i

0(z(nl

�1)i

�2)j

�1)ji

0(z(nl

�1)i

| {z }

�2)j

�1)i

�2)j

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

(l+1)i

+ �W

Which in one simplified vector notation becomes:

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to richard@socher.org

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

BTS:Op9miza9on

•  Asbefore,wecanplugthegradientsintoastandardoff-the-shelfL-BFGSop%mizerorSGD

•  BestresultswithAdaGrad(Duchietal,2011):

•  Fornon-con%nuousobjec%veusesubgradientmethod(Ratliffetal.2007)

Deep Reinforcement Learning for Dialogue Generation Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao and Dan Jurafsky

Seq2Seq for Dialogue

Encode previous message(s) into vector

How are you

Decode vector into response

Seq2Seq for Dialogue

Encode previous message(s) into vector

How are you

Decode vector into response

Train by maximizing

p(response|input)

where the response is produced by a human

Problems with Seq2Seq

How old are you?

I’m 16

How old are you?

I’m 16

I don’t know what you’re talking about

You don’t know what you’re saying

How old are you?

I’m 16

reasonable, but unhelpful

generic

probable response != good response

What is a good response?

•  Reasonable p(response|input) is high according to seq2seq model

•  Nonrepetitive similarity between response and previous messages is low

•  Easy to answer p(“i don’t know”|response) is low

What is a good response?

•  Reasonable p(response|input) is high according to seq2seq model

•  Nonrepetitive similarity between response and previous messages is low

•  Easy to answer p(“i don’t know”|response) is low

Scoring function: R(response) = reasonable_score + nonrepetitive_score + easy_to_answer_score

Reinforcement Learning

Learn from rewards instead of from examples

1. Encode input into a vector

How are you

2. Have the system generate a response

How are you

don’t

3. Receive reward R(response)

- Train system to maximize reward

R = -5 How are you

don’t

Quantitative Results

Qualitative Results

How old are you?

I thought you were 12

I’m 16. Why are you asking?

What made you think so?

Qualitative Results

How old are you?

I thought you were 12

I’m 16. Why are you asking?

What made you think so?

Conclusion

•  Reinforcement learning useful when we want our model to do more than produce a probable human label

•  Many more application of RL to NLP!

Information extraction, question answering, task-oriented dialogue, coreference resolution, and more

Discussion:SimpleRNN•  DecentresultswithsinglematrixTreeRNN

•  SingleweightmatrixTreeRNNcouldcapturesomephenomenabutnotadequateformorecomplex,higherordercomposi%onandparsinglongsentences

•  Thereisnorealinterac%onbetweentheinputwords

•  Thecomposi%onfunc%onisthesameforallsyntac%ccategories,punctua%on,etc. W

pWscore s

Version2:Syntac9cally-Un9edRNN

•  AsymbolicContext-FreeGrammar(CFG)backboneisadequateforbasicsyntac%cstructure

•  Weusethediscretesyntac%ccategoriesofthechildrentochoosethecomposi%onmatrix

•  ATreeRNNcandobe^erwithdifferentcomposi%onmatrixfordifferentsyntac%cenvironments

•  Theresultgivesusabe^erseman%cs

Composi9onalVectorGrammars

•  Problem:Speed.Everycandidatescoreinbeamsearchneedsamatrix-vectorproduct.

•  Solu%on:Computescoreonlyforasubsetoftreescomingfromasimpler,fastermodel(PCFG)• Prunesveryunlikelycandidatesforspeed• Providescoarsesyntac%ccategoriesofthechildrenforeachbeamcandidate

•  Composi%onalVectorGrammar=PCFG+TreeRNN

Details:Composi9onalVectorGrammar

•  Scoresateachnodecomputedbycombina%onofPCFGandSU-RNN:

•  Interpreta%on:Factoringdiscreteandcon%nuousparsinginonemodel:

•  Socheretal.(2013)

Relatedworkforrecursiveneuralnetworks

Pollack(1990):Recursiveauto-associa%vememoriesPreviousRecursiveNeuralNetworksworkbyGoller&Küchler(1996),Costaetal.(2003)assumedfixedtreestructureandusedonehotvectors.Hinton(1990)andBo^ou(2011):Relatedideasaboutrecursivemodelsandrecursiveoperatorsassmoothversionsoflogicopera%ons

RelatedWorkforparsing

•  Resul%ngCVGParserisrelatedtopreviousworkthatextendsPCFGparsers

•  KleinandManning(2003a):manualfeatureengineering•  Petrovetal.(2006):learningalgorithmthatsplitsandmerges

syntac%ccategories•  Lexicalizedparsers(Collins,2003;Charniak,2000):describeeach

categorywithalexicalitem•  HallandKlein(2012)combineseveralsuchannota%onschemesina

factoredparser.•  CVGsextendtheseideasfromdiscreterepresenta%onstoricher

con%nuousones

Experiments•  StandardWSJsplit,labeledF1•  BasedonsimplePCFGwithfewerstates•  Fastpruningofsearchspace,fewmatrix-vectorproducts•  3.8%higherF1,20%fasterthanStanfordfactoredparser

Parser Test,AllSentences

StanfordPCFG,(KleinandManning,2003a) 85.5

StanfordFactored(KleinandManning,2003b) 86.6

FactoredPCFGs(HallandKlein,2012) 89.4

Collins(Collins,1997) 87.7

SSN(Henderson,2004) 89.4

BerkeleyParser(PetrovandKlein,2007) 90.1

CVG(RNN)(Socheretal.,ACL2013) 85.0

CVG(SU-RNN)(Socheretal.,ACL2013) 90.4

Charniak-SelfTrained(McCloskyetal.2006) 91.0

Charniak-SelfTrained-ReRanked(McCloskyetal.2006) 92.1

SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]

Learnssohno%onofheadwordsIni%aliza%on:

NP-PP PP-NP

PRP$-NP

SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]

ADJP-NP

ADVP-ADJP

Analysisofresul9ngvectorrepresenta9ons

Allthefiguresareadjustedforseasonalvaria%ons1.Allthenumbersareadjustedforseasonalfluctua%ons2.Allthefiguresareadjustedtoremoveusualseasonalpa^erns

Knight-Ridderwouldn’tcommentontheoffer1.Harscodeclinedtosaywhatcountryplacedtheorder2.Coastalwouldn’tdisclosetheterms

Salesgrewalmost7%to$UNKm.from$UNKm.1.Salesrosemorethan7%to$94.9m.from$88.3m.2.Salessurged40%toUNKb.yenfromUNKb.

SU-RNNAnalysis

•  Cantransferseman%cinforma%onfromsinglerelatedexample

•  Trainsentences:• Heeatsspaghewwithafork.• Sheeatsspaghewwithpork.

•  Testsentences• Heeatsspaghewwithaspoon.• Heeatsspaghewwithmeat.

SU-RNNAnalysis

LabelinginRecursiveNeuralNetworks

Neural Network

• Wecanuseeachnode’srepresenta%onasfeaturesforasoJmaxclassifier:

•  Trainingsimilartomodelinpart1withstandardcross-entropyerror+scores

SoftmaxLayer

Version3:Composi9onalityThroughRecursiveMatrix-VectorSpaces

Onewaytomakethecomposi%onfunc%onmorepowerfulwasbyuntyingtheweightsWButwhatifwordsactmostlyasanoperator,e.g.“very”in

verygoodProposal:Anewcomposi%onfunc%on

p=tanh(W+b)

Before:

Version3:Matrix-vectorRNNs[Socher,Huval,Bhat,Manning,&Ng,2012]

Composi9onalityThroughRecursiveMatrix-VectorRecursiveNeuralNetworks

p=tanh(W+b)

c1c2 p=tanh(W+b)

C2c1C1c2

Matrix-vectorRNNs[Socher,Huval,Bhat,Manning,&Ng,2012]

Predic9ngSen9mentDistribu9onsGoodexamplefornon-linearityinlanguage

Classifica9onofSeman9cRela9onships

•  CananMV-RNNlearnhowalargesyntac%ccontextconveysaseman%crela%onship?

•  My[apartment]e1hasapre^ylarge[kitchen]e2 àcomponent-wholerela%onship(e2,e1)

•  Buildasinglecomposi%onalseman%csfortheminimalcons%tuentincludingbothterms

Classifica9onofSeman9cRela9onships

Classifier Features F1SVM POS,stemming,syntac%cpa^erns 60.1MaxEnt POS,WordNet,morphologicalfeatures,noun

compoundsystem,thesauri,Googlen-grams77.6

SVM POS,WordNet,prefixes,morphologicalfeatures,dependencyparsefeatures,Levinclasses,PropBank,FrameNet,NomLex-Plus,Googlen-grams,paraphrases,TextRunner

RNN – 74.8MV-RNN – 79.1MV-RNN POS,WordNet,NER 82.4

SceneParsing

•  Themeaningofasceneimageisalsoafunc%onofsmallerregions,

•  howtheycombineaspartstoformlargerobjects,

•  andhowtheobjectsinteract.

Similarprincipleofcomposi%onality.

AlgorithmforParsingImages

SameRecursiveNeuralNetworkasfornaturallanguageparsing!(Socheretal.ICML2011)

Features

Grass Tree

Segments

SemanticRepresentations

People Building

ParsingNaturalSceneImagesParsingNaturalSceneImages

Mul9-classsegmenta9on

Method Accuracy

PixelCRF(Gouldetal.,ICCV2009) 74.3

Classifieronsuperpixelfeatures 75.9

Region-basedenergy(Gouldetal.,ICCV2009) 76.4

Locallabelling(Tighe&Lazebnik,ECCV2010) 76.9

SuperpixelMRF(Tighe&Lazebnik,ECCV2010) 77.5

SimultaneousMRF(Tighe&Lazebnik,ECCV2010) 77.5

RecursiveNeuralNetwork 78.1

StanfordBackgroundDataset(Gouldetal.2009)74

QCD-AwareRecursiveNeuralNetworksforJetPhysicsGillesLouppe,KyunghunCho,CyrilBecot,KyleCranmer

Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language...

Documents