Vector SemanticsIntroduction
Klinton Bicknell
(borrowing from: Dan Jurafsky and Jim Martin)
Whyvectormodelsofmeaning? computingthesimilaritybetweenwords
“fast”issimilarto“rapid”“tall”issimilarto“height”
Questionanswering:Q:“HowtallisMt.Everest?” CandidateA:“TheofficialheightofMountEverestis29029feet”
2
Wordsimilarityforplagiarismdetection
Wordsimilarityforhistoricalchange:semanticchangeovertime
4
Kulkarni,Al-Rfou,Perozzi,Skiena2015Sagi,KaufmannClark2013
Seman
;cBroad
ening
0
10
20
30
40
dog deer hound
<1250Middle1350-1500Modern1500-1710
Problemswiththesaurus-basedmeaning
• Wedon’thaveathesaurusforeverylanguage• Wecan’thaveathesaurusforeveryyear
• Forchangedetection,weneedtocomparewordmeaningsinyearttoyeart+1
• Thesauruseshaveproblemswithrecall• Manywordsandphrasesaremissing• Thesauriworklesswellforverbs,adjectives
Distributionalmodelsofmeaning =vector-spacemodelsofmeaning =vectorsemantics
Intuitions:ZelligHarris(1954):• “oculistandeye-doctor…occurinalmostthesameenvironments”
• “IfAandBhavealmostidenticalenvironmentswesaythattheyaresynonyms.”
Firth(1957):• “Youshallknowawordbythecompanyitkeeps!”
6
Intuitionofdistributionalwordsimilarity
• Nidaexample:SupposeIaskedyouwhatistesgüino? A bottle of tesgüino is on the tableEverybody likes tesgüinoTesgüino makes you drunkWe make tesgüino out of corn.
• From context words humans can guess tesgüino means• analcoholicbeveragelikebeer
• Intuitionforalgorithm:• Twowordsaresimilariftheyhavesimilarwordcontexts.
Threekindsofvectormodels
Sparsevectorrepresentations1. Mutual-informationweightedwordco-occurrencematrices
Densevectorrepresentations:2. Singularvaluedecomposition(andLatentSemanticAnalysis)3. Neural-network-inspiredmodels(skip-grams,CBOW)
8
Sharedintuition
• Modelthemeaningofawordby“embedding”inavectorspace.• Themeaningofawordisavectorofnumbers
• Vectormodelsarealsocalled“embeddings”.
• Contrast:wordmeaningisrepresentedinmanycomputationallinguisticapplicationsbyavocabularyindex(“wordnumber545”)
• Oldphilosophyjoke:Q:What’sthemeaningoflife?A:LIFE’
9
Vector SemanticsWordsandco-occurrence
vectors
Co-occurrenceMatrices
• Werepresenthowoftenawordoccursinadocument• Term-documentmatrix
• Orhowoftenawordoccurswithanother• Term-termmatrix
(orword-wordco-occurrencematrixorword-contextmatrix)
11
AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
Term-documentmatrix
• Eachcell:countofwordwinadocumentd:• Eachdocumentisacountvectorinℕv:acolumnbelow
12
Similarityinterm-documentmatrices
Twodocumentsaresimilariftheirvectorsaresimilar
13
AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
Thewordsinaterm-documentmatrix
• EachwordisacountvectorinℕD:arowbelow
14
AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
Thewordsinaterm-documentmatrix
• Twowordsaresimilariftheirvectorsaresimilar
15
AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
Theword-wordorword-contextmatrix•
16
17
aardvark computer data pinch result sugar …apricot 0 0 0 1 0 1pineapple 0 0 0 1 0 1digital 0 2 1 0 1 0information 0 1 6 0 4 0… …
Word-wordmatrix
•
18
2kindsofco-occurrencebetween2words
• First-orderco-occurrence(syntagmaticassociation):• Theyaretypicallynearbyeachother.• wroteisafirst-orderassociateofbookorpoem.
• Second-orderco-occurrence(paradigmaticassociation):• Theyhavesimilarneighbors.• wroteisasecond-orderassociateofwordslikesaidorremarked.
19
(Schütze and Pedersen, 1993)
Vector SemanticsPositivePointwiseMutual
Information(PPMI)
Problemwithrawcounts
• Rawwordfrequencyisnotagreatmeasureofassociationbetweenwords• It’sveryskewed
• “the”and“of”areveryfrequent,butmaybenotthemostdiscriminative
• We’dratherhaveameasurethataskswhetheracontextwordisparticularlyinformativeaboutthetargetword.• PositivePointwiseMutualInformation(PPMI)
21
PointwiseMutualInformation
•
PMI(X,Y) = log2P(x,y)P(x)P(y)
PositivePointwiseMutualInformation•
ComputingPPMIonaterm-contextmatrix
• MatrixFwithWrows(words)andCcolumns(contexts)• fijis#oftimeswioccursincontextcj
24
pij =fij
fijj=1
C
∑i=1
W
∑pi* =
fijj=1
C
∑
fijj=1
C
∑i=1
W
∑p* j =
fiji=1
W
∑
fijj=1
C
∑i=1
W
∑
pmiij = log2pijpi*p* j
ppmiij =pmiij if pmiij > 0
0 otherwise
⎧⎨⎪
⎩⎪
p(w=information,c=data)=p(w=information)=p(c=data)=
25
p(w,context) p(w)computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11pineapple 0.00 0.00 0.05 0.00 0.05 0.11digital 0.11 0.05 0.00 0.05 0.00 0.21information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
=.326/19
11/19 =.58
7/19 =.37
pij =fij
fijj=1
C
∑i=1
W
∑
p(wi ) =fij
j=1
C
∑
Np(cj ) =
fiji=1
W
∑
N
26
pmiij = log2pijpi*p* j
• pmi(information,data)=log2(
p(w,context) p(w)computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11pineapple 0.00 0.00 0.05 0.00 0.05 0.11digital 0.11 0.05 0.00 0.05 0.00 0.21information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
PPMI(w,context)computer data pinch result sugar
apricot - - 2.25 - 2.25pineapple - - 2.25 - 2.25digital 1.66 0.00 - 0.00 -information 0.00 0.57 - 0.47 -
.32/ (.37*.58)) =.58(.57usingfullprecision)
WeightingPMI
• PMIisbiasedtowardinfrequentevents• VeryrarewordshaveveryhighPMIvalues
• Twosolutions:• Giverarewordsslightlyhigherprobabilities• Useadd-deltasmoothing(whichhasasimilareffect)
27
WeightingPMI:Givingrarecontextwordsslightlyhigherprobability
•
28
29
Add-2!Smoothed!Count(w,context)computer data pinch result sugar
apricot 2 2 3 2 3pineapple 2 2 3 2 3digital 4 3 2 3 2information 3 8 2 6 2
p(w,context)![add-2] p(w)computer data pinch result sugar
apricot 0.03 0.03 0.05 0.03 0.05 0.20pineapple 0.03 0.03 0.05 0.03 0.05 0.20digital 0.07 0.05 0.03 0.05 0.03 0.24information 0.05 0.14 0.03 0.10 0.03 0.36
p(context) 0.19 0.25 0.17 0.22 0.17
PPMIversusadd-2smoothedPPMI
30
PPMI(w,context)![add-2]computer data pinch result sugar
apricot 0.00 0.00 0.56 0.00 0.56pineapple 0.00 0.00 0.56 0.00 0.56digital 0.62 0.00 0.00 0.00 0.00information 0.00 0.58 0.00 0.37 0.00
PPMI(w,context)computer data pinch result sugar
apricot - - 2.25 - 2.25pineapple - - 2.25 - 2.25digital 1.66 0.00 - 0.00 -information 0.00 0.57 - 0.47 -
Vector SemanticsMeasuringsimilarity:the
cosine
Measuringsimilarity
• Given2targetwordsvandw• We’llneedawaytomeasuretheirsimilarity.• Mostmeasureofvectorssimilarityarebasedonthe:• Dotproductorinnerproductfromlinearalgebra
• Highwhentwovectorshavelargevaluesinsamedimensions.• Low(infact0)fororthogonalvectorswithzerosincomplementarydistribution32
Problemwithdotproduct
• Dotproductislongerifthevectorislonger.Vectorlength:
• Vectorsarelongeriftheyhavehighervaluesineachdimension• Thatmeansmorefrequentwordswillhavehigherdotproducts• That’sbad:wedon’twantasimilaritymetrictobesensitivetoword
frequency33
Solution:cosine
• Justdividethedotproductbythelengthofthetwovectors!
• Thisturnsouttobethecosineoftheanglebetweenthem!
34
Cosineforcomputingsimilarity
cos(!v, !w) =
!v• !w!v !w
=!v!v•!w!w=
viwii=1
N∑vi2
i=1
N∑ wi
2i=1
N∑
Dot product Unit vectors
vi is the PPMI value for word v in context i wi is the PPMI value for word w in context i.
Cos(v,w) is the cosine similarity of v and w
Sec. 6.3
Cosineasasimilaritymetric
• -1:vectorspointinoppositedirections• +1:vectorspointinsamedirections• 0:vectorsareorthogonal
• RawfrequencyorPPMIarenon-negative,socosinerange0-1
36
large data computer
apricot 2 0 0
digital 0 1 2
information 1 6 1
37
Whichpairofwordsismoresimilar? cosine(apricot,information)=
cosine(digital,information)=
cosine(apricot,digital)=
cos(!v, !w) =
!v• !w!v !w
=!v!v•!w!w=
viwii=1
N∑vi2
i=1
N∑ wi
2i=1
N∑
1+ 0+ 0
1+36+1
1+36+1
0+1+ 4
0+1+ 4 0+ 6+ 2
0+ 0+ 0
=838 5
= .58
= 0
Visualizingvectorsandangles
1 2 3 4 5 6 7
1
2
3
digital
apricotinformation
Dim
ensio
n 1:
‘lar
ge’
Dimension 2: ‘data’38
large data
apricot 2 0
digital 0 1
information 1 6
Clusteringvectorstovisualizesimilarityinco-occurrencematrices
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence
HEAD
HANDFACE
DOG
AMERICA
CAT
EYE
EUROPE
FOOT
CHINAFRANCE
CHICAGO
ARM
FINGER
NOSE
LEG
RUSSIA
MOUSE
AFRICA
ATLANTA
EAR
SHOULDER
ASIA
COW
BULL
PUPPY LION
HAWAII
MONTREAL
TOKYO
TOE
MOSCOW
TOOTH
NASHVILLE
BRAZIL
WRIST
KITTEN
ANKLE
TURTLE
OYSTER
Figure 8: Multidimensional scaling for three noun classes.
WRISTANKLE
SHOULDERARMLEGHAND
FOOTHEADNOSEFINGER
TOEFACEEAREYE
TOOTHDOGCAT
PUPPYKITTEN
COWMOUSE
TURTLEOYSTER
LIONBULLCHICAGOATLANTA
MONTREALNASHVILLE
TOKYOCHINARUSSIAAFRICAASIAEUROPEAMERICA
BRAZILMOSCOW
FRANCEHAWAII
Figure 9: Hierarchical clustering for three noun classes using distances based on vector correlations.
20
39 Rohdeetal.(2006)
Otherpossiblesimilaritymeasures
Evaluatingsimilarity(thesameasforthesaurus-based)
• IntrinsicEvaluation:• Correlationbetweenalgorithmandhumanwordsimilarityratings
• Extrinsic(task-based,end-to-end)Evaluation:• Spellingerrordetection,WSD,essaygrading• TakingTOEFLmultiple-choicevocabularytests
Levied is closest in meaning to which of these: imposed, believed, requested, correlated
AlternativetoPPMIformeasuringassociation
• tf-idf(that’sahyphennotaminussign)• Thecombinationoftwofactors
• Termfrequency(Luhn1957):frequencyoftheword(canbelogged)• Inversedocumentfrequency(IDF)(SparckJones1972)
• Nisthetotalnumberofdocuments
• dfi=“documentfrequencyofwordi”
• =#ofdocumentswithwordI
• wij=wordiindocumentj
wij=tfij idfi42
idfi = logNdfi
⎛
⎝
⎜⎜
⎞
⎠
⎟⎟
tf-idfnotgenerallyusedforword-wordsimilarity
• Butisbyfarthemostcommonweightingwhenweareconsideringtherelationshipofwordstodocuments
43
Vector SemanticsDenseVectors
Sparseversusdensevectors
• PPMIvectorsare• long(length|V|=20,000to50,000)• sparse(mostelementsarezero)
• Alternative:learnvectorswhichare• short(length200-1000)• dense(mostelementsarenon-zero)
45
Sparseversusdensevectors
• Whydensevectors?• Shortvectorsmaybeeasiertouseasfeaturesinmachinelearning(lessweightstotune)
• Densevectorsmaygeneralizebetterthanstoringexplicitcounts• Theymaydobetteratcapturingsynonymy:
• carandautomobilearesynonyms;butarerepresentedasdistinctdimensions;thisfailstocapturesimilaritybetweenawordwithcarasaneighborandawordwithautomobileasaneighbor
46
Threemethodsforgettingshortdensevectors
• SingularValueDecomposition(SVD)• AspecialcaseofthisiscalledLSA–LatentSemanticAnalysis
• “NeuralLanguageModel”-inspiredpredictivemodels• skip-gramsandCBOW
• Brownclustering
47
Vector SemanticsDenseVectorsviaSVD
Intuition• ApproximateanN-dimensionaldatasetusingfewerdimensions• Byfirstrotatingtheaxesintoanewspace• Inwhichthehighestorderdimensioncapturesthemostvarianceinthe
originaldataset• Andthenextdimensioncapturesthenextmostvariance,etc.• Manysuch(related)methods:
• PCA–principlecomponentsanalysis• FactorAnalysis• SVD
49
50
Dimensionalityreduction
SingularValueDecomposition
51
AnyrectangularwxcmatrixXequalstheproductof3matrices:W:rowscorrespondingtooriginalbutmcolumnsrepresentsadimensioninanewlatentspace,suchthat
• Mcolumnvectorsareorthogonaltoeachother• Columnsareorderedbytheamountofvarianceinthedataseteachnewdimensionaccountsfor
S:diagonalmxmmatrixofsingularvaluesexpressingtheimportanceofeachdimension.C:columnscorrespondingtooriginalbutmrowscorrespondingtosingularvalues
SingularValueDecomposition
52 LanduaerandDumais1997
SVDappliedtoterm-documentmatrix: LatentSemanticAnalysis
• Ifinsteadofkeepingallmdimensions,wejustkeepthetopksingularvalues.Let’ssay300.• Theresultisaleast-squaresapproximationtotheoriginalX
53 k /
/ k
/ k
/ k
Deerwesteretal(1988)
LSAmoredetails
• 300dimensionsarecommonlyused• Thecellsarecommonlyweightedbyaproductoftwoweights
• Localweight:Logtermfrequency• Globalweight:eitheridforanentropymeasure
54
Let’sreturntoPPMIword-wordmatrices
• CanweapplytoSVDtothem?
55
SVDappliedtoterm-termmatrix
56 (I’msimplifyingherebyassumingthematrixhasrank|V|)
TruncatedSVDonterm-termmatrix
57
TruncatedSVDproducesembeddings
58
• EachrowofWmatrixisak-dimensionalrepresentationofeachwordw
• Kmightrangefrom50to1000• Generallywekeepthetopkdimensions,
butsomeexperimentssuggestthatgettingridofthetop1dimensionoreventhetop50dimensionsishelpful(LapesaandEvert2014).
Embeddingsversussparsevectors
• DenseSVDembeddingssometimesworkbetterthansparsePPMImatricesattaskslikewordsimilarity• Denoising:low-orderdimensionsmayrepresentunimportantinformation
• Truncationmayhelpthemodelsgeneralizebettertounseendata.• Havingasmallernumberofdimensionsmaymakeiteasierforclassifierstoproperlyweightthedimensionsforthetask.
• Densemodelsmaydobetteratcapturinghigherorderco-occurrence.59
Vector SemanticsEmbeddingsinspiredbyneural
languagemodels:skip-gramsandCBOW
Prediction-basedmodels: Analternativewaytogetdensevectors
• Skip-gram(Mikolovetal.2013a)CBOW(Mikolovetal.2013b)• Learnembeddingsaspartoftheprocessofwordprediction.• Trainaneuralnetworktopredictneighboringwords
• Inspiredbyneuralnetlanguagemodels.• Insodoing,learndenseembeddingsforthewordsinthetrainingcorpus.
• Advantages:• Fast,easytotrain(muchfasterthanSVD)• Availableonlineintheword2vecpackage• Includingsetsofpretrainedembeddings!
61
Skip-grams
• Predicteachneighboringword• inacontextwindowof2Cwords• fromthecurrentword.
• SoforC=2,wearegivenwordwtandpredictingthese4
words:
62
Skip-gramslearn2embeddingsforeachw
inputembeddingv,intheinputmatrixW• ColumnioftheinputmatrixWisthe1×d
embeddingviforwordiinthevocabulary.
outputembeddingvʹ,inoutputmatrixW’
• RowioftheoutputmatrixWʹisad×1vector
embeddingvʹiforwordiinthevocabulary.
63
Setup
• Walkingthroughcorpuspointingatwordw(t),whoseindexin
thevocabularyisj,sowe’llcallitwj(1<j<|V|).
• Let’spredictw(t+1),whoseindexinthevocabularyisk(1<k<|
V|).HenceourtaskistocomputeP(wk|wj).
64
One-hotvectors
• Avectoroflength|V|• 1forthetargetwordand0forotherwords• Soif“popsicle”isvocabularyword5• Theone-hotvectoris• [0,0,0,0,1,0,0,0,0…….0]
65
66
Skip-gram
67
Skip-gramh=vj
o=W’h
o=W’h
68
Skip-gram
h=vjo=W’h
ok=v’khok=v’k·vj
Turningoutputsintoprobabilities
• ok=v’k·vj• Weusesoftmaxtoturnintoprobabilities
69
EmbeddingsfromWandW’
• Sincewehavetwoembeddings,vjandv’jforeachwordwj
• Wecaneither:• Justusevj• Sumthem• Concatenatethemtomakeadouble-lengthembedding
70
Butwait;howdowelearntheembeddings?
71
RelationbetweenskipgramsandPMI!
• IfwemultiplyWW’T
• Wegeta|V|x|V|matrixM,eachentrymijcorrespondingtosome
associationbetweeninputwordiandoutputwordj• LevyandGoldberg(2014b)showthatskip-gramreachesitsoptimumjust
whenthismatrixisashiftedversionofPMI:
WWʹT=MPMI−logk• Soskip-gramisimplicitlyfactoringashiftedversionofthePMImatrixinto
thetwoembeddingmatrices.72
CBOW(ContinuousBagofWords)
73
Propertiesofembeddings
74
• Nearestwordstosomeembeddings(Mikolovetal.20131)
Embeddingscapturerelationalmeaning!
•
75
Vector SemanticsEvaluatingsimilarity
Evaluatingsimilarity
• Extrinsic(task-based,end-to-end)Evaluation:• QuestionAnswering• SpellChecking• Essaygrading
• IntrinsicEvaluation:• Correlationbetweenalgorithmandhumanwordsimilarityratings
• Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77• TakingTOEFLmultiple-choicevocabularytests
• Levied is closest in meaning to: imposed, believed, requested, correlated
Summary
• Distributional(vector)modelsofmeaning• Sparse(PPMI-weightedword-wordco-occurrencematrices)• Dense:
• Word-wordSVD50-2000dimensions• Skip-gramsandCBOW(embeddingsavailableinword2vec)
78
Agreatsemanticvectorspacefordocuments
• wordshavelow-dimensionalembeddings,usefulformanycomputationallinguisticapplications
• documentsareaweightedcombinationofwords• documentsasavectorinthelow-dimensionalspace• thisallows
• semanticdocumentclustering(k-means,hierarchical,etc.)• searchforsimilardocuments(priorartinpatents,etc.)
79