Natural Language Processing with Deep Learning
CS224N/Ling284
Lecture 4: Word Window Classification and Neural Networks
Christopher Manning and Richard Socher
Natural Language Processingwith Deep Learning
CS224N/Ling284
Christopher Manning and Richard Socher
Lecture 2: Word Vectors
OverviewToday:
• Classifica(onbackground
• Upda(ngwordvectorsforclassifica(on
• Windowclassifica(on&crossentropyerrorderiva(on(ps
• Asinglelayerneuralnetwork!
• Max-Marginlossandbackprop
ThislecturewillhelpalotwithPSet1:)
Classifica6onsetupandnota6on
• Generallywehaveatrainingdatasetconsis(ngofsamples
{xi,yi}Ni=1
• xi-inputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.
• yi-labelswetrytopredict,forexample• class:sen(ment,nameden((es,buy/selldecision,• otherwords• later:mul(-wordsequences
Classifica6onintui6on
• Trainingdata:{xi,yi}Ni=1
• Simpleillustra(oncase:• Fixed2dwordvectorstoclassify• Usinglogis(cregression• àlineardecisionboundaryà
• GeneralML:assumexisfixed,trainlogis(cregressionweightsWàonlymodifythedecisionboundary
• Goal:predictforeachx:where
Visualiza(onswithConvNetJSbyKarpathy!hZp://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
Detailsoftheso=max
• Wecanteaseapart intotwosteps:
1. Takethey’throwofWandmul(plythatrowwithx:
Computeallfcforc=1,…,C2. Normalizetoobtainprobabilitywithso^maxfunc(on:
=𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑓)↓𝑦
Theso=maxandcross-entropyerror
• Foreachtrainingexample{x,y},ourobjec(veistomaximizetheprobabilityofthecorrectclassy
• Hence,weminimizethenega(velogprobabilityofthatclass:
Background:Why“Crossentropy”error
• Assumingagroundtruth(orgoldortarget)probabilitydistribu(onthatis1attherightclassand0everywhereelse:p=[0,…,0,1,0,…0]andourcomputedprobabilityisq,thenthecrossentropyis:
• Becauseofone-hotp,theonlytermle=isthenega6velogprobabilityofthetrueclass
Sidenote:TheKLdivergence
• Cross-entropycanbere-wriZenintermsoftheentropyandKullback-Leiblerdivergencebetweenthetwodistribu(ons:
• BecauseH(p)iszeroinourcase(andevenifitwasn’titwouldbefixedandhavenocontribu(ontogradient),tominimizethisisequaltominimizingtheKLdivergencebetweenpandq
• TheKLdivergenceisnotadistancebutanon-symmetricmeasureofthedifferencebetweentwoprobabilitydistribu(onspandq
Classifica6onoverafulldataset
• Crossentropylossfunc(onoverfulldataset{xi,yi}Ni=1
• Insteadof
• Wewillwritefinmatrixnota(on:• Wecans(llindexelementsofitbasedonclass
Classifica6on:Regulariza6on!
• Reallyfulllossfunc(onoveranydatasetincludesregulariza6onoverallparameters𝜃:
• Regulariza(onwillpreventoverfigngwhenwehavealotoffeatures(orlateraverypowerful/deepmodel)• x-axis:morepowerfulmodelormoretrainingitera(ons
• Blue:trainingerror,red:testerror
Details:GeneralMLop6miza6on
• Forgeneralmachinelearning𝜃usuallyonlyconsistsofcolumnsofW:
• Soweonlyupdatethedecisionboundary Visualiza(onswithConvNetJSbyKarpathy
Classifica6ondifferencewithwordvectors
• Commonindeeplearning:• LearnbothWandwordvectorsx
Verylarge!
OverfigngDanger!
Losinggeneraliza6onbyre-trainingwordvectors
• Segng:Traininglogis(cregressionformoviereviewsen(mentsinglewordsandinthetrainingdatawehave• “TV”and“telly”
• Inthetes(ngdatawehave• “television”
• Originallytheywereallsimilar(frompre-trainingwordvectors)
• Whathappenswhenwetrainthewordvectors?
TVtelly
television
Losinggeneraliza6onbyre-trainingwordvectors
• Whathappenswhenwetrainthewordvectors?• Thosethatareinthetrainingdatamovearound• Wordsfrompre-trainingthatdoNOTappearintrainingstay
• Example:• Intrainingdata:“TV”and“telly”• Onlyintes(ngdata:“television” TV
telly
television:(
Losinggeneraliza6onbyre-trainingwordvectors
• Takehomemessage:Ifyouonlyhaveasmalltrainingdataset,don’ttrainthewordvectors.Ifyouhavehaveaverylargedataset,itmayworkbeZertotrainwordvectorstothetask.
TVtelly
television
Sidenoteonwordvectorsnota6on
• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings=wordrepresenta(ons(mostly)• Mostlyfrommethodslikeword2vecorGlove
V
L=d ……
aardvarka…meta…zebra• Thesearethewordfeaturesxwordfromnowon
• Newdevelopment(laterintheclass):charactermodels:o
[]
Windowclassifica6on
• Classifyingsinglewordsisrarelydone.
• Interes(ngproblemslikeambiguityariseincontext!
• Example:auto-antonyms:• "Tosanc(on"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."
• Example:ambiguousnameden((es:• ParisàParis,FrancevsParisHilton• HathawayàBerkshireHathawayvsAnneHathaway
Windowclassifica6on
• Idea:classifyawordinitscontextwindowofneighboringwords.
• Forexamplenameden(tyrecogni(oninto4classes:• Person,loca(on,organiza(on,none
• Manypossibili(esexistforclassifyingonewordincontext,e.g.averagingallthewordsinawindowbutthatloosesposi(oninforma(on
Windowclassifica6on
• Trainso^maxclassifierbyassigningalabeltoacenterwordandconcatena(ngallwordvectorssurroundingit
• Example:ClassifyParisinthecontextofthissentencewithwindowlength2:
…museumsinParisareamazing….Xwindow=[xmuseumsxinxParisxarexamazing]T
• Resul(ngvectorxwindow=x∈ R5d,acolumnvector!
Simplestwindowclassifier:So=max
• Withx=xwindowwecanusethesameso^maxclassifierasbefore
• Withcrossentropyerrorasbefore:
• Buthowdoyouupdatethewordvectors?
same
predictedmodeloutputprobability
Upda6ngconcatenatedwordvectors
• Shortanswer:Justtakederiva(vesasbefore
• Longanswer:Let’sgooverstepstogether(helpfulforPSet1)
• Define:• :so^maxprobabilityoutputvector(seepreviousslide)• :targetprobabilitydistribu(on(all0’sexceptatgroundtruthindexofclassy,whereit’s1)
• andfc=c’thelementofthefvector
• Hard,thefirst(me,hencesome(psnow:)
• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!
• Tip2:Chainrule!Ify=f(u)andu=g(x),i.e.y=f(g(x)),then:• Simpleexample:
Upda6ngconcatenatedwordvectors
• Tip2con(nued:Knowthychainrule• Don’tforgetwhichvariablesdependonwhatandthatx
appearsinsideallelementsoff’s
• Tip3:Fortheso^maxpartofthederiva(ve:Firsttakethederiva(vewrtfcwhenc=y(thecorrectclass),thentakederiva(vewrtfcwhenc≠y(alltheincorrectclasses)
Upda6ngconcatenatedwordvectors
• Tip4:Whenyoutakederiva(vewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpar(alderiva(ves:
• Tip5:Tolaternotgoinsane&implementa(on!àresultsintermsofvectoropera(onsanddefinesingleindex-ablevectors:
Upda6ngconcatenatedwordvectors
• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpar(alderiva(vesofe.g.xiorWij
• Tip7:Tocleanitupforevenmorecomplexfunc(onslater:Knowdimensionalityofvariables&simplifyintomatrixnota(on
• Tip8:Writethisoutinfullsumsifit’snotclear!
Upda6ngconcatenatedwordvectors
• Whatisthedimensionalityofthewindowvectorgradient?
• xistheen(rewindow,5d-dimensionalwordvectors,sothederiva(vewrttoxhastohavethesamedimensionality:
Upda6ngconcatenatedwordvectors
• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:
• Let• Withxwindow=[xmuseumsxinxParisxarexamazing]
• Wehave
Upda6ngconcatenatedwordvectors
• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnameden((es.
• Forexample,themodelcanlearnthatseeingxinasthewordjustbeforethecenterwordisindica(veforthecenterwordtobealoca(on
Upda6ngconcatenatedwordvectors
• ThegradientofJwrttheso^maxweightsW!
• Similarsteps,writedownpar(alwrtWijfirst!• Thenwehavefull
What’smissingfortrainingthewindowmodel?
Anoteonmatriximplementa6ons
• Therearetwoexpensiveopera(onsintheso^max:
• Thematrixmul(plica(onandtheexp
• Aforloopisneverasefficientwhenyouimplementitcomparedtoalargematrixmul(plica(on!
• Examplecodeà
Anoteonmatriximplementa6ons
• Loopingoverwordvectorsinsteadofconcatena(ngthemallintoonelargematrixandthenmul(plyingtheso^maxweightswiththatmatrix
• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop
Anoteonmatriximplementa6ons
• ResultoffastermethodisaCxNmatrix:
• Eachcolumnisanf(x)inournota(on(unnormalizedclassscores)
• Matricesareawesome!
• Youshouldspeedtestyourcodealottoo
So=max(=logis6cregression)alonenotverypowerful
• So^maxonlygiveslineardecisionboundariesintheoriginalspace.
• WithliZledatathatcanbeagoodregularizer
• Withmoredataitisverylimi(ng!
So=max(=logis6cregression)isnotverypowerful
• So^maxonlylineardecisionboundaries
• àLamewhenproblem iscomplex
• Wouldn’titbecoolto getthesecorrect?
NeuralNetsfortheWin!
• Neuralnetworkscanlearnmuchmorecomplexfunc(onsandnonlineardecisionboundaries!
Fromlogis6cregressiontoneuralnets
Demys6fyingneuralnetworks
Neuralnetworkscomewiththeirownterminologicalbaggage
Butifyouunderstandhowso^maxmodelswork
Thenyoualreadyunderstandtheopera(onofabasicneuron!
AsingleneuronAcomputa(onalunitwithn(3)inputs
and1outputandparametersW,b
Ac(va(onfunc(on
Inputs
Biasunitcorrespondstointerceptterm
Output
Aneuronisessen6allyabinarylogis6cregressionunit
hw,b(x) = f (wTx + b)
f (z) = 11+ e−z
w,baretheparametersofthisneuroni.e.,thislogis(cregressionmodel
b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm
Aneuralnetwork=runningseverallogis6cregressionsatthesame6meIfwefeedavectorofinputsthroughabunchoflogis(cregressionfunc(ons,thenwegetavectorofoutputs…
Butwedon’thavetodecideaheadofAmewhatvariablestheselogisAcregressionsaretryingtopredict!
Aneuralnetwork=runningseverallogis6cregressionsatthesame6me…whichwecanfeedintoanotherlogis(cregressionfunc(on
ItisthelossfuncAonthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredicAngthetargetsforthenextlayer,etc.
Aneuralnetwork=runningseverallogis6cregressionsatthesame6me
Beforeweknowit,wehaveamul(layerneuralnetwork….
Matrixnota6onforalayer
Wehave
Inmatrixnota(on
wherefisappliedelement-wise:
a1
a2
a3
a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.
z =Wx + ba = f (z)
f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]
W12
b3
Non-lineari6es(f):Whythey’reneeded
• Example: function approximation, e.g., regression or classification • Without non-linearities, deep neural
networks can’t do anything more than a linear transform
• Extra layers could just be compiled down into a single linear transform: W1W2x = Wx
• With more layers, they can approximate more complex functions!
Amorepowerful,neuralnetwindowclassifier
• Revisi(ng
• Xwindow=[xmuseumsxinxParisxarexamazing]
• Assumewewanttoclassifywhetherthecenterwordisaloca(onornot
1/19/17RichardSocherLecture5,Slide44
ASingleLayerNeuralNetwork
• Asinglelayerisacombina(onofalinearlayerandanonlinearity:
• Theneuralac(va(onsacanthenbeusedtocomputesomeoutput
• Forinstance,aprobabilityviaso^max𝑝𝑦𝑥 = softmax(𝑊𝑎)
• Oranunnormalizedscore(evensimpler)
45
Summary:Feed-forwardComputa6on
46
Compu(ngawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)
xwindow=[xmuseumsxinxParisxarexamazing]
Mainintui6onforextralayer
47
Thelayerlearnsnon-linearinterac(onsbetweentheinputwordvectors.
Example:onlyif“museums”isfirstvectorshoulditmaZerthat“in”isinthesecondposi(on
Xwindow=[xmuseumsxinxParisxarexamazing]
Themax-marginloss
• s=score(museumsinParisareamazing)• sc=score(NotallmuseumsinParis)
• Ideafortrainingobjec(ve:makescoreoftruewindowlargerandcorruptwindow’sscorelower(un(lthey’regoodenough):minimize
• Thisiscon(nuous-->wecanuseSGD48
Max-marginObjec6vefunc6on
• Objec(veforasinglewindow:
• Eachwindowwithaloca(onatitscentershouldhaveascore+1higherthananywindowwithoutaloca(onatitscenter
• xxx|ß1à|ooo
• Forfullobjec(vefunc(on:Sampleseveralcorruptwindowspertrueone.Sumoveralltrainingwindows
49
TrainingwithBackpropaga6on
AssumingcostJis>0,computethederiva(vesofsandscwrtalltheinvolvedvariables:U,W,b,x
50
TrainingwithBackpropaga6on
• Let’sconsiderthederiva(veofasingleweightWij
• Thisonlyappearsinsideai
• Forexample:W23isonlyusedtocomputea2
x1 x2x3+1
a1 a2
s U2
W23
51
b2
TrainingwithBackpropaga6on
Deriva(veofweightWij:
52
x1 x2x3+1
a1 a2
s U2
W23
whereforlogis(cf
TrainingwithBackpropaga6on
Deriva(veofsingleweightWij:
Localerrorsignal
Localinputsignal
53
x1 x2x3+1
a1 a2
s U2
W23
• Wewantallcombina(onsofi=1,2andj=1,2,3à?
• Solu(on:Outerproduct:whereisthe“responsibility”orerrorsignalcomingfromeachac(va(ona
TrainingwithBackpropaga6on
• FromsingleweightWijtofullW:
54
x1 x2x3+1
a1 a2
s U2
W23
S
TrainingwithBackpropaga6on
• Forbiasesb,weget:
55
x1 x2x3+1
a1 a2
s U2
W23
TrainingwithBackpropaga6on
56
That’salmostbackpropaga(onIt’stakingderiva(vesandusingthechainrule
Remainingtrick:wecanre-usederiva(vescomputedforhigherlayersincompu(ngderiva(vesforlowerlayers!
Example:lastderiva(vesofmodel,thewordvectorsinx
TrainingwithBackpropaga6on
• Takederiva(veofscorewithrespecttosingleelementofwordvector
• Now,wecannotjusttakeintoconsidera(ononeaibecauseeachxjisconnectedtoalltheneuronsaboveandhencexjinfluencestheoverallscorethroughallofthese,hence:
Re-usedpartofpreviousderiva(ve57
TrainingwithBackpropaga6on
• With,whatisthefullgradient?à
• Observa(ons:Theerrormessage𝛿thatarrivesatahiddenlayerhasthesamedimensionalityasthathiddenlayer
58
Pu[ngallgradientstogether:
• Remember:Fullobjec(vefunc(onforeachwindowwas:
• Forexample:gradientforU:
59
Summary
Congrats!Superusefulbasiccomponentsandrealmodel
• Wordvectortraining
• Windows
• So^maxandcrossentropyerror àPSet1
• Scoresandmax-marginloss
• Neuralnetwork àPSet1
Onemorehalfofamath-heavylecture
Thentherestwillbeeasierandmoreapplied:)
Nextlecture:
Projectadvice
Takingmoreanddeeperderiva6vesàFullBackprop
Thenwehaveallthebasictoolsinplacetolearnaboutmorecomplexmodelsandhavesomefun:)