+ All Categories
Home > Documents > CS224d Deep NLP Lecture 4: Word Window Classification and...

CS224d Deep NLP Lecture 4: Word Window Classification and...

Date post: 28-Sep-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
55
CS224d Deep NLP Lecture 4: Word Window Classification and Neural Networks Richard Socher
Transcript
Page 1: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

CS224dDeepNLP

Lecture4:WordWindowClassification

andNeuralNetworks

RichardSocher

Page 2: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

OverviewToday:

• Generalclassificationbackground

• Updatingwordvectorsforclassification

• Windowclassification&crossentropyerrorderivationtips

• Asinglelayerneuralnetwork!

• (Max-Marginlossandbackprop)

4/7/16RichardSocherLecture1,Slide 2

Page 3: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Classificationsetupandnotation

• Generallywehaveatrainingdatasetconsistingofsamples

{xi,yi}Ni=1

• xi - inputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.

• yi - labelswetrytopredict,• e.g.otherwords• class:sentiment,namedentities,buy/selldecision,• later:multi-wordsequences

4/7/16RichardSocherLecture1,Slide 3

Page 4: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Classificationintuition

• Trainingdata:{xi,yi}Ni=1

• Simpleillustrationcase:• Fixed2dwordvectorstoclassify• Usinglogisticregression• à lineardecisionboundaryà

• GeneralML:assumexisfixedandonlytrainlogisticregressionweightsWandonlymodifythedecisionboundary

4/7/16RichardSocherLecture1,Slide 4

VisualizationswithConvNetJS byKarpathy!http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

Page 5: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Classificationnotation

• Crossentropylossfunctionoverdataset{xi,yi}Ni=1

• Whereforeachdatapair(xi,yi):

• Wecanwritef inmatrixnotation andindexelementsofitbasedonclass:

4/7/16RichardSocherLecture1,Slide 5

Page 6: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Classification:Regularization!

• Reallyfulllossfunctionoveranydatasetincludesregularizationoverallparametersµ:

• Regularizationwillpreventoverfittingwhenwehavealotoffeatures(orlateraverypowerful/deepmodel)• x-axis:morepowerfulmodelormoretrainingiterations

• Blue:trainingerror,red:testerror

4/7/16RichardSocherLecture1,Slide 6

Page 7: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Details:GeneralMLoptimization

• Forgeneralmachinelearningµ usuallyonlyconsistsofcolumnsofW:

• Soweonlyupdatethedecisionboundary

4/7/16RichardSocherLecture1,Slide 7

VisualizationswithConvNetJS byKarpathy

Page 8: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Classificationdifferencewithwordvectors

• Commonindeeplearning:• LearnbothWandwordvectorsx

4/7/16RichardSocherLecture1,Slide 8

Verylarge!

OverfittingDanger!

Page 9: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Losinggeneralizationbyre-trainingwordvectors

• Setting:Traininglogisticregressionformoviereviewsentimentandinthetrainingdatawehavethewords• “TV”and“telly”

• Inthetestingdatawehave• “television”

• Originallytheywereallsimilar(frompre-trainingwordvectors)

• Whathappenswhenwetrainthewordvectors?

4/7/16RichardSocherLecture1,Slide 9

TVtelly

television

Page 10: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Losinggeneralizationbyre-trainingwordvectors

• Whathappenswhenwetrainthewordvectors?• Thosethatareinthetrainingdatamovearound• Wordsfrompre-trainingthatdoNOTappearintrainingstay

• Example:• Intrainingdata:“TV”and“telly”• Intestingdataonly:“television”

4/7/16RichardSocherLecture1,Slide 10

TVtelly

television:(

Page 11: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Losinggeneralizationbyre-trainingwordvectors

• Takehomemessage:

Ifyouonlyhaveasmalltrainingdataset,don’ttrainthewordvectors.

Ifyouhavehaveaverylargedataset,itmayworkbettertotrainwordvectorstothetask.

4/7/16RichardSocherLecture1,Slide 11

TVtelly

television

Page 12: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Sidenoteonwordvectorsnotation

• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings =wordrepresentations(mostly)• Mostlyfrommethodslikeword2vecorGlove

|V|

L =d ……

aardvarka…meta…zebra• Thesearethewordfeaturesxword fromnowon

• Conceptuallyyougetaword’svectorbyleftmultiplyingaone-hotvectore byL:x =Le2 d£ V¢ V£ 1

[]

12

Page 13: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Windowclassification

• Classifyingsinglewordsisrarelydone.

• Interestingproblemslikeambiguityariseincontext!

• Example:auto-antonyms:• "Tosanction"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."

• Example:ambiguousnamedentities:• Parisà Paris,Francevs ParisHilton• Hathawayà BerkshireHathawayvs AnneHathaway

4/7/16RichardSocherLecture1,Slide 13

Page 14: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Windowclassification

• Idea:classifyawordinitscontextwindowofneighboringwords.

• Forexamplenamedentityrecognitioninto4classes:• Person,location,organization,none

• Manypossibilitiesexistforclassifyingonewordincontext,e.g.averagingallthewordsinawindowbutthatloosespositioninformation

4/7/16RichardSocherLecture1,Slide 14

Page 15: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Windowclassification

• Trainsoftmax classifierbyassigningalabeltoacenterwordandconcatenatingallwordvectorssurroundingit

• Example:ClassifyParisinthecontextofthissentencewithwindowlength2:

…museumsinParisareamazing….

Xwindow =[xmuseums xin xParis xare xamazing ]T

• Resultingvectorxwindow =x2 R5d,acolumnvector!

4/7/16RichardSocherLecture1,Slide 15

Page 16: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Simplestwindowclassifier:Softmax

• Withx=xwindow wecanusethesamesoftmax classifierasbefore

• Withcrossentropyerrorasbefore:

• Buthowdoyouupdatethewordvectors?

4/7/16RichardSocherLecture1,Slide 16

same

predictedmodeloutputprobability

Page 17: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Updatingconcatenatedwordvectors

• Shortanswer:Justtakederivativesasbefore

• Longanswer:Let’sgooverthestepstogether(you’llhavetofillinthedetailsinPSet 1!)

• Define:• :softmax probabilityoutputvector(seepreviousslide)• :targetprobabilitydistribution(all0’sexceptatgroundtruthindexofclassy,whereit’s1)

• andfc =c’th elementofthefvector

• Hard,thefirsttime,hencesometipsnow:)

4/7/16RichardSocherLecture1,Slide 17

Page 18: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!

• Tip2:Knowthychainruleanddon’tforgetwhichvariablesdependonwhat:

• Tip3:Forthesoftmax partofthederivative:Firsttakethederivativewrt fc whenc=y(thecorrectclass),thentakederivativewrt fc whenc≠ y(alltheincorrectclasses)

Updatingconcatenatedwordvectors

4/7/16RichardSocherLecture1,Slide 18

Page 19: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

• Tip4:Whenyoutakederivativewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpartialderivatives:

• Tip5:Tolaternotgoinsane&implementation!à resultsintermsofvectoroperationsanddefinesingleindex-ablevectors:

Updatingconcatenatedwordvectors

4/7/16RichardSocherLecture1,Slide 19

Page 20: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpartialderivativesofe.g.xi orWij

• Tip7:Tocleanitupforevenmorecomplexfunctionslater:Knowdimensionalityofvariables&simplifyintomatrixnotation

• Tip8:Writethisoutinfullsumsifit’snotclear!

Updatingconcatenatedwordvectors

4/7/16RichardSocherLecture1,Slide 20

Page 21: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

• Whatisthedimensionalityofthewindowvectorgradient?

• x istheentirewindow,5d-dimensionalwordvectors,sothederivativewrt toxhastohavethesamedimensionality:

Updatingconcatenatedwordvectors

4/7/16RichardSocherLecture1,Slide 21

Page 22: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:

• Let• Withxwindow =[xmuseums xin xParis xare xamazing ]

• Wehave

Updatingconcatenatedwordvectors

4/7/16RichardSocherLecture1,Slide 22

Page 23: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnamedentities.

• Forexample,themodelcanlearnthatseeingxin asthewordjustbeforethecenterwordisindicativeforthecenterwordtobealocation

Updatingconcatenatedwordvectors

4/7/16RichardSocherLecture1,Slide 23

Page 24: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

• ThegradientofJwrt thesoftmax weightsW!

• Similarsteps,writedownpartialwrt Wij first!• Thenwehavefull

What’smissingfortrainingthewindowmodel?

4/7/16RichardSocherLecture1,Slide 24

Page 25: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Anoteonmatriximplementations

4/7/16RichardSocher25

• Therearetwoexpensiveoperationsinthesoftmax:

• Thematrixmultiplication andtheexp

• Aforloopisneverasefficientwhenyouimplementitcomparedvs whenyouusealargermatrixmultiplication!

• Examplecodeà

Page 26: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Anoteonmatriximplementations

4/7/16RichardSocher26

• Loopingoverwordvectorsinsteadofconcatenatingthemallintoonelargematrixandthenmultiplyingthesoftmax weightswiththatmatrix

• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop

Page 27: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Anoteonmatriximplementations

4/7/16RichardSocher27

• ResultoffastermethodisaCxNmatrix:

• Eachcolumnisanf(x)inournotation(unnormalized classscores)

• Matricesareawesome!

• Youshouldspeedtestyourcodealottoo

Page 28: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Softmax (=logisticregression)isnotverypowerful

4/7/16RichardSocher28

• Softmax onlygiveslineardecisionboundariesintheoriginalspace.

• Withlittledatathatcanbeagoodregularizer

• Withmoredataitisverylimiting!

Page 29: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Softmax (=logisticregression)isnotverypowerful

4/7/16RichardSocher29

• Softmax onlylineardecisionboundaries

• à Lamewhenproblemiscomplex

• Wouldn’titbecooltogetthesecorrect?

Page 30: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

NeuralNetsfortheWin!

4/7/16RichardSocher30

• Neuralnetworkscanlearnmuchmorecomplexfunctionsandnonlineardecisionboundaries!

Page 31: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Fromlogisticregressiontoneuralnets

31

Page 32: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Demystifyingneuralnetworks

Neuralnetworkscomewiththeirownterminologicalbaggage

…justlikeSVMs

Butifyouunderstandhowsoftmax modelswork

Thenyoualreadyunderstand theoperationofabasicneuralnetworkneuron!

AsingleneuronAcomputationalunitwithn(3) inputs

and1outputandparametersW,b

Activationfunction

Inputs

Biasunitcorresponds tointerceptterm

Output

32

Page 33: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Aneuronisessentiallyabinarylogisticregressionunit

hw,b(x) = f (wTx + b)

f (z) = 11+ e−z

w,b aretheparametersofthisneuroni.e.,thislogisticregressionmodel

33

b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm

Page 34: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Aneuralnetwork=runningseverallogisticregressionsatthesametimeIfwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs…

Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!

34

Page 35: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Aneuralnetwork=runningseverallogisticregressionsatthesametime…whichwecanfeedintoanotherlogisticregressionfunction

Itisthelossfunctionthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredictingthetargetsforthenextlayer,etc.

35

Page 36: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Aneuralnetwork=runningseverallogisticregressionsatthesametime

Beforeweknowit,wehaveamultilayerneuralnetwork….

36

Page 37: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Matrixnotationforalayer

Wehave

Inmatrixnotation

wheref isappliedelement-wise:

a1

a2

a3

a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.

z =Wx + ba = f (z)

f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]37

W12

b3

Page 38: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Non-linearities (f):Whythey’reneeded

• Example:functionapproximation,e.g.,regressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform

• Extralayerscouldjustbecompileddownintoasinglelineartransform:W1W2x =Wx

• Withmorelayers,theycanapproximatemorecomplexfunctions!

38

Page 39: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Amorepowerfulwindowclassifier

• Revisiting

• Xwindow =[xmuseums xin xParis xare xamazing ]

4/7/16RichardSocherLecture1,Slide 39

Page 40: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

ASingleLayerNeuralNetwork

• Asinglelayerisacombinationofalinearlayerandanonlinearity:

• Theneuralactivationsacanthenbeusedtocomputesomefunction

• Forinstance,asoftmax probabilityoranunnormalized score:

40

Page 41: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Summary:Feed-forwardComputation

41

Computingawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)

Xwindow =[xmuseums xin xParis xare xamazing ]

Page 42: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Nextlecture:

4/7/16RichardSocher42

Trainingawindow-basedneuralnetwork.

Takingmoredeeperderivativesà Backprop

Thenwehaveallthebasictoolsinplacetolearnaboutmorecomplexmodels:)

Page 43: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Probablyfornextlecture…

4/7/16RichardSocher43

Page 44: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Anotheroutputlayerandlossfunctioncombo!

44

• Sofar:softmax andcross-entropyerror(exp slow)

• Wedon’talwaysneedprobabilities,oftenunnormalized scoresareenoughtoclassifycorrectly.

• Also:Max-margin!

• Moreonthatinfuturelectures!

Page 45: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

NeuralNetmodeltoclassifygrammaticalphrases

4/7/16RichardSocher45

• Idea:Trainaneuralnetworktoproducehighscoresforgrammatical phrasesofspecificlengthandlowscoresforungrammaticalphrases

• s =score(catchillsonamat)

• sc =score(catchillsMenloamat)

Page 46: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Anotheroutputlayerandlossfunctioncombo!

• Ideafortrainingobjective• Makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize

• Thisiscontinuous,canperformSGD46

Page 47: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

TrainingwithBackpropagation

AssumingcostJis>0,itissimpletoseethatwecancomputethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x

47

Page 48: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

TrainingwithBackpropagation

• Let’sconsiderthederivativeofasingleweightWij

• Thisonlyappearsinsideai

• Forexample:W23 isonlyusedtocomputea2

x1 x2x3 +1

a1 a2

s U2

W23

48

Page 49: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

TrainingwithBackpropagation

DerivativeofweightWij:

49

x1 x2x3 +1

a1 a2

s U2

W23

Page 50: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

whereforlogisticf

TrainingwithBackpropagation

DerivativeofsingleweightWij :

Localerrorsignal

Localinputsignal

50

x1 x2x3 +1

a1 a2

s U2

W23

Page 51: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

• Wewantallcombinationsofi =1,2 and j=1,2,3

• Solution:Outerproduct:whereisthe“responsibility”comingfromeachactivationa

TrainingwithBackpropagation

• FromsingleweightWij tofullW:

51

x1 x2x3 +1

a1 a2

s U2

W23

Page 52: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

TrainingwithBackpropagation

• Forbiasesb,weget:

52

x1 x2x3 +1

a1 a2

s U2

W23

Page 53: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

TrainingwithBackpropagation

53

That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!

Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers

Example:lastderivativesofmodel,thewordvectorsinx

Page 54: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

TrainingwithBackpropagation

• Takederivativeofscorewithrespecttosinglewordvector(forsimplicitya1dvector,butsameifitwaslonger)

• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:

Re-usedpartofpreviousderivative54

Page 55: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard

Summary

4/7/16RichardSocher55


Recommended