Natural Language Processingwith Deep Learning
CS224N/Ling284
Richard Socher
Lecture 2: Word Vectors
Organization
• PSet 1isreleased.CodingSession1/22:(Monday,PA1dueThursday)
• SomeofthequestionsfromPiazza:• sharingthechoose-your-ownfinalprojectwithanotherclass
seemsfine-->Yes*• Buthowaboutthedefaultfinalproject?Canthatalsobeused
asafinalprojectforadifferentcourse?-->Yes*• Areweallowingstudentstobringonesheetofnotesforthe
midterm?-->Yes• AzurecomputingresourcesforProjects/PSet4.Partofmilestone
1/11/182
LecturePlan
1. Wordmeaning(15mins)2. Word2vecintroduction(20mins)3. Word2vecobjectivefunctiongradients(25mins)4. Optimizationrefresher(10mins)
1/11/183
1.Howdowerepresentthemeaningofaword?
Definition:meaning (Websterdictionary)
• theideathatisrepresentedbyaword,phrase,etc.
• theideathatapersonwantstoexpressbyusingwords,signs,etc.
• theideathatisexpressedinaworkofwriting,art,etc.
Commonestlinguisticwayofthinkingofmeaning:
signifier(symbol)⟺ signified(ideaorthing)
=denotation 1/11/184
Howdowehaveusablemeaninginacomputer?Commonsolution:Usee.g.WordNet,aresourcecontaininglistsofsynonymsets andhypernyms (“isa”relationships).
[Synset('procyonid.n.01'), Synset('carnivore.n.01'), Synset('placental.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'), Synset('chordate.n.01'), Synset('animal.n.01'), Synset('organism.n.01'), Synset('living_thing.n.01'), Synset('whole.n.02'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]
(adj) full, good (adj) estimable, good, honorable, respectable (adj) beneficial, good (adj) good, just, upright (adj) adept, expert, good, practiced, proficient, skillful(adj) dear, good, near (adj) good, right, ripe…(adv) well, good (adv) thoroughly, soundly, good (n) good, goodness (n) commodity, trade good, good
e.g.synonymsetscontaining“good”: e.g.hypernymsof“panda”:
1/11/185
ProblemswithresourceslikeWordNet
• Greatasaresourcebutmissingnuance
• e.g.“proficient”islistedasasynonymfor“good”.Thisisonlycorrectinsomecontexts.
• Missingnewmeaningsofwords
• e.g.wicked,badass,nifty,wizard,genius,ninja,bombest
• Impossibletokeepup-to-date!
• Subjective
• Requireshumanlabortocreateandadapt
• Hardtocomputeaccuratewordsimilarityà1/11/186
Representingwordsasdiscretesymbols
IntraditionalNLP,weregardwordsasdiscretesymbols:hotel, conference, motel
Wordscanberepresentedbyone-hot vectors:
motel=[000000000010000]hotel=[000000010000000]
Vectordimension=numberofwordsinvocabulary(e.g.500,000)
Meansone1,therest0s
1/11/187
Problemwithwordsasdiscretesymbols
Example: inwebsearch,ifusersearchesfor“Seattlemotel”,wewouldliketomatchdocumentscontaining“Seattlehotel”.
But:motel=[000000000010000]hotel=[000000010000000]
Thesetwovectorsareorthogonal.Thereisnonaturalnotionofsimilarityforone-hotvectors!
Solution:• CouldrelyonWordNet’slistofsynonymstogetsimilarity?• Instead:learntoencodesimilarityinthevectorsthemselves
Sec. 9.2.2
1/11/188
Representingwordsbytheircontext
• Coreidea:Aword’smeaningisgivenbythewordsthatfrequentlyappearclose-by
• “Youshallknowawordbythecompanyitkeeps” (J.R.Firth1957:11)
• OneofthemostsuccessfulideasofmodernstatisticalNLP!
• Whenawordw appearsinatext,itscontext isthesetofwordsthatappearnearby(withinafixed-sizewindow).
• Usethemanycontextsofw tobuilduparepresentationofw
…government debt problems turning into banking crises as happened in 2009……saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…
Thesecontextwordswillrepresentbanking 1/11/189
Wordvectors
Wewillbuildadensevectorforeachword,chosensothatitissimilartovectorsofwordsthatappearinsimilarcontexts.
Note:wordvectorsaresometimescalledwordembeddings orwordrepresentations.
linguistics=
0.2860.792−0.177−0.1070.109−0.5420.3490.271
1/11/1810
2.Word2vec:Overview
Word2vec(Mikolov etal.2013)isaframeworkforlearningwordvectors.Idea:
• Wehavealargecorpusoftext• Everywordinafixedvocabularyisrepresentedbyavector• Gothrougheachpositiont inthetext,whichhasacenterword
c andcontext(“outside”)wordso• Usethesimilarityofthewordvectorsforcando tocalculate
theprobabilityofo givenc(orviceversa)• Keepadjustingthewordvectorstomaximizethisprobability
1/11/1811
Word2VecOverview
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑤78;|𝑤7
𝑃 𝑤78<|𝑤7
𝑃 𝑤7=;|𝑤7
𝑃 𝑤7=<|𝑤7
1/11/1812
Word2VecOverview
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑤78;|𝑤7
𝑃 𝑤78<|𝑤7
𝑃 𝑤7=;|𝑤7
𝑃 𝑤7=<|𝑤7
1/11/1813
Word2vec:objectivefunctionForeachposition𝑡 = 1,… , 𝑇,predictcontextwordswithinawindowoffixedsizem,givencenterword𝑤9.
𝐿 𝜃 =F F 𝑃 𝑤789|𝑤7; 𝜃�
=IJ9JI9KL
M
7N;
Theobjectivefunction 𝐽 𝜃 isthe(average)negativeloglikelihood:
𝐽 𝜃 = −1𝑇log 𝐿(𝜃) = −
1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃
�
=IJ9JI9KL
M
7N;
Minimizingobjectivefunction⟺Maximizingpredictiveaccuracy
Likelihood=
𝜃 isallvariablestobeoptimized
sometimescalledcost orlossfunction
1/11/1814
Word2vec:objectivefunction
• Wewanttominimizetheobjectivefunction:
𝐽 𝜃 = −1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃
�
=IJ9JI9KL
M
7N;
• Question: Howtocalculate𝑃 𝑤789|𝑤7; 𝜃 ?
• Answer: Wewillusetwovectorsperwordw:
• 𝑣U whenw isacenterword
• 𝑢U whenw isacontextword
• Thenforacenterwordc andacontextwordo:
𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)
∑ exp(𝑢UM 𝑣Z)�U∈] 1/11/1815
Word2VecOverviewwithVectors
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7• 𝑃 𝑢^_Y`abIc|𝑣de7Y shortforP 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠|𝑖𝑛𝑡𝑜; 𝑢^_Y`abIc, 𝑣de7Y, 𝜃
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑢`peqder|𝑣de7Y
𝑃 𝑢Z_dcdc|𝑣de7Y
𝑃 𝑢7seder|𝑣de7Y
𝑃 𝑢^_Y`abIc|𝑣de7Y
1/11/1816
Word2VecOverviewwithVectors
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑢Z_dcbc|𝑣`peqder
𝑃 𝑢pc|𝑣`peqder
𝑃 𝑢de7Y|𝑣`peqder
𝑃 𝑢7s_eder|𝑣`peqder
1/11/1817
Word2vec:predictionfunction
𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)
∑ exp(𝑢UM 𝑣Z)�U∈]
• Thisisanexampleofthesoftmax functionℝe → ℝe
softmax 𝑥d = exp(𝑥d)
∑ exp(𝑥9)e9N;
= 𝑝d
• Thesoftmax functionmapsarbitraryvalues𝑥d toaprobabilitydistribution𝑝d• “max” becauseamplifiesprobabilityoflargest𝑥d• “soft” becausestillassignssomeprobabilitytosmaller𝑥d• FrequentlyusedinDeepLearning
Dotproductcomparessimilarityofo andc.Largerdotproduct=largerprobability
Aftertakingexponent,normalizeoverentirevocabulary
1/11/1818
Totrainthemodel:Computeall vectorgradients!
• Recall:𝜃representsallmodelparameters,inonelongvector• Inourcasewithd-dimensionalvectorsand V-manywords:
• Remember:everywordhastwovectors• Wethenoptimizetheseparameters
1/11/1819
3.Derivationsofgradient
• Whiteboard– seevideoifyou’renotinclass;)
• ThebasicLegopiece
• Usefulbasics:
• Ifindoubt:writeoutwithindices
• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:
1/11/1820
ChainRule
• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:
• Simpleexample:
1/11/1821
InteractiveWhiteboard Session!
Let’sderivegradientforcenterwordtogetherForoneexamplewindowandoneexampleoutsideword:
Youthenalsoneedthegradientforcontextwords(it’ssimilar;leftforhomework).That’salloftheparametersθ here.
1/11/1822
Calculatingallgradients!
• Wewentthroughgradientforeachcentervectorv inawindow• Wealsoneedgradientsforoutsidevectorsu• Deriveathome!• Generallyineachwindowwewillcomputeupdatesforall
parametersthatarebeingusedinthatwindow.Forexample:
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑢Z_dcbc|𝑣`peqder
𝑃 𝑢pc|𝑣`peqder
𝑃 𝑢de7Y|𝑣`peqder
𝑃 𝑢7s_eder|𝑣`peqder
1/11/1823
Word2vec:Moredetails
Whytwovectors?à Easieroptimization.Averagebothattheend.
Twomodelvariants:1. Skip-grams(SG)
Predictcontext(”outside”)words(positionindependent)givencenterword
2. ContinuousBagofWords(CBOW)Predictcenterwordfrom(bagof)contextwords
Thislecturesofar:Skip-grammodel
Additionalefficiencyintraining:1. Negativesampling
Sofar:Focusonnaïvesoftmax (simplertrainingmethod)
1/11/1824
GradientDescent
• Wehaveacostfunction𝐽 𝜃 wewanttominimize• GradientDescentisanalgorithmtominimize𝐽 𝜃• Idea:forcurrentvalueof𝜃,calculategradientof𝐽 𝜃 ,thentake
smallstepindirectionofnegativegradient.Repeat.
Note:Ourobjectivesarenotconvexlikethis:(
1/11/1825
Intuition
Forasimpleconvexfunctionovertwoparameters.
Contourlinesshowlevelsofobjectivefunction
1/11/1826
• Updateequation(inmatrixnotation):
• Updateequation(forsingleparameter):
• Algorithm:
GradientDescent
𝛼 =stepsizeorlearningrate
1/11/1827
StochasticGradientDescent
• Problem:𝐽 𝜃 isafunctionofall windowsinthecorpus(potentiallybillions!)• Soisveryexpensivetocompute
• Youwouldwaitaverylongtimebeforemakingasingleupdate!
• Very badideaforprettymuchallneuralnets!• Solution: Stochasticgradientdescent(SGD)• Repeatedlysamplewindows,andupdateaftereachone.
• Algorithm:
1/11/1828
1/11/1829
PSet1:Theskip-grammodelandnegativesampling
• Frompaper:“DistributedRepresentationsofWordsandPhrasesandtheirCompositionality”(Mikolov etal.2013)
• Overallobjectivefunction(theymaximize):
• Thesigmoidfunction!(we’llbecomegoodfriendssoon)
• Sowemaximizetheprobabilityoftwowordsco-occurringinfirstlogà
1/11/1830
PSet1:Theskip-grammodelandnegativesampling
• Simplernotation,moresimilartoclassandPSet:
• Wetakeknegativesamples.• Maximizeprobabilitythatrealoutsidewordappears,
minimizeprob.thatrandomwordsappeararoundcenterword
• P(w)=U(w)3/4/Z,theunigramdistributionU(w)raisedtothe3/4power(Weprovidethisfunctioninthestartercode).
• Thepowermakeslessfrequentwordsbesampledmoreoften
1/11/1831
PSet1:Thecontinuousbagofwordsmodel
• Mainideaforcontinuousbagofwords(CBOW):Predictcenterwordfromsumofsurroundingwordvectorsinsteadofpredictingsurroundingsinglewordsfromcenterwordasinskip-grammodel
• Tomakeassignmentslightlyeasier:
ImplementationoftheCBOWmodelisnotrequired(youcandoitforacoupleofbonuspoints!),butyoudohavetodothetheoryproblemonCBOW.
1/11/1832