+ All Categories
Home > Documents > Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize...

Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize...

Date post: 20-May-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
79
Vector Semantics Introduction Klinton Bicknell (borrowing from: Dan Jurafsky and Jim Martin)
Transcript
Page 1: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Vector SemanticsIntroduction

Klinton Bicknell

(borrowing from: Dan Jurafsky and Jim Martin)

Page 2: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Whyvectormodelsofmeaning? computingthesimilaritybetweenwords

“fast”issimilarto“rapid”“tall”issimilarto“height”

Questionanswering:Q:“HowtallisMt.Everest?” CandidateA:“TheofficialheightofMountEverestis29029feet”

2

Page 3: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Wordsimilarityforplagiarismdetection

Page 4: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Wordsimilarityforhistoricalchange:semanticchangeovertime

4

Kulkarni,Al-Rfou,Perozzi,Skiena2015Sagi,KaufmannClark2013

Seman

;cBroad

ening

0

10

20

30

40

dog deer hound

<1250Middle1350-1500Modern1500-1710

Page 5: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Problemswiththesaurus-basedmeaning

• Wedon’thaveathesaurusforeverylanguage• Wecan’thaveathesaurusforeveryyear

• Forchangedetection,weneedtocomparewordmeaningsinyearttoyeart+1

• Thesauruseshaveproblemswithrecall• Manywordsandphrasesaremissing• Thesauriworklesswellforverbs,adjectives

Page 6: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Distributionalmodelsofmeaning =vector-spacemodelsofmeaning =vectorsemantics

Intuitions:ZelligHarris(1954):• “oculistandeye-doctor…occurinalmostthesameenvironments”

• “IfAandBhavealmostidenticalenvironmentswesaythattheyaresynonyms.”

Firth(1957):• “Youshallknowawordbythecompanyitkeeps!”

6

Page 7: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Intuitionofdistributionalwordsimilarity

• Nidaexample:SupposeIaskedyouwhatistesgüino? A bottle of tesgüino is on the tableEverybody likes tesgüinoTesgüino makes you drunkWe make tesgüino out of corn.

• From context words humans can guess tesgüino means• analcoholicbeveragelikebeer

• Intuitionforalgorithm:• Twowordsaresimilariftheyhavesimilarwordcontexts.

Page 8: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Threekindsofvectormodels

Sparsevectorrepresentations1. Mutual-informationweightedwordco-occurrencematrices

Densevectorrepresentations:2. Singularvaluedecomposition(andLatentSemanticAnalysis)3. Neural-network-inspiredmodels(skip-grams,CBOW)

8

Page 9: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Sharedintuition

• Modelthemeaningofawordby“embedding”inavectorspace.• Themeaningofawordisavectorofnumbers

• Vectormodelsarealsocalled“embeddings”.

• Contrast:wordmeaningisrepresentedinmanycomputationallinguisticapplicationsbyavocabularyindex(“wordnumber545”)

• Oldphilosophyjoke:Q:What’sthemeaningoflife?A:LIFE’

9

Page 10: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Vector SemanticsWordsandco-occurrence

vectors

Page 11: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Co-occurrenceMatrices

• Werepresenthowoftenawordoccursinadocument• Term-documentmatrix

• Orhowoftenawordoccurswithanother• Term-termmatrix

(orword-wordco-occurrencematrixorword-contextmatrix)

11

Page 12: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Term-documentmatrix

• Eachcell:countofwordwinadocumentd:• Eachdocumentisacountvectorinℕv:acolumnbelow

12

Page 13: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Similarityinterm-documentmatrices

Twodocumentsaresimilariftheirvectorsaresimilar

13

AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Page 14: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Thewordsinaterm-documentmatrix

• EachwordisacountvectorinℕD:arowbelow

14

AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Page 15: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Thewordsinaterm-documentmatrix

• Twowordsaresimilariftheirvectorsaresimilar

15

AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Page 16: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Theword-wordorword-contextmatrix•

16

Page 17: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

17

aardvark computer data pinch result sugar …apricot 0 0 0 1 0 1pineapple 0 0 0 1 0 1digital 0 2 1 0 1 0information 0 1 6 0 4 0… …

Page 18: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Word-wordmatrix

18

Page 19: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

2kindsofco-occurrencebetween2words

• First-orderco-occurrence(syntagmaticassociation):• Theyaretypicallynearbyeachother.• wroteisafirst-orderassociateofbookorpoem.

• Second-orderco-occurrence(paradigmaticassociation):• Theyhavesimilarneighbors.• wroteisasecond-orderassociateofwordslikesaidorremarked.

19

(Schütze and Pedersen, 1993)

Page 20: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Vector SemanticsPositivePointwiseMutual

Information(PPMI)

Page 21: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Problemwithrawcounts

• Rawwordfrequencyisnotagreatmeasureofassociationbetweenwords• It’sveryskewed

• “the”and“of”areveryfrequent,butmaybenotthemostdiscriminative

• We’dratherhaveameasurethataskswhetheracontextwordisparticularlyinformativeaboutthetargetword.• PositivePointwiseMutualInformation(PPMI)

21

Page 22: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

PointwiseMutualInformation

PMI(X,Y) = log2P(x,y)P(x)P(y)

Page 23: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

PositivePointwiseMutualInformation•

Page 24: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

ComputingPPMIonaterm-contextmatrix

• MatrixFwithWrows(words)andCcolumns(contexts)• fijis#oftimeswioccursincontextcj

24

pij =fij

fijj=1

C

∑i=1

W

∑pi* =

fijj=1

C

fijj=1

C

∑i=1

W

∑p* j =

fiji=1

W

fijj=1

C

∑i=1

W

pmiij = log2pijpi*p* j

ppmiij =pmiij if pmiij > 0

0 otherwise

⎧⎨⎪

⎩⎪

Page 25: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

p(w=information,c=data)=p(w=information)=p(c=data)=

25

p(w,context) p(w)computer data pinch result sugar

apricot 0.00 0.00 0.05 0.00 0.05 0.11pineapple 0.00 0.00 0.05 0.00 0.05 0.11digital 0.11 0.05 0.00 0.05 0.00 0.21information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

=.326/19

11/19 =.58

7/19 =.37

pij =fij

fijj=1

C

∑i=1

W

p(wi ) =fij

j=1

C

Np(cj ) =

fiji=1

W

N

Page 26: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

26

pmiij = log2pijpi*p* j

• pmi(information,data)=log2(

p(w,context) p(w)computer data pinch result sugar

apricot 0.00 0.00 0.05 0.00 0.05 0.11pineapple 0.00 0.00 0.05 0.00 0.05 0.11digital 0.11 0.05 0.00 0.05 0.00 0.21information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

PPMI(w,context)computer data pinch result sugar

apricot - - 2.25 - 2.25pineapple - - 2.25 - 2.25digital 1.66 0.00 - 0.00 -information 0.00 0.57 - 0.47 -

.32/ (.37*.58)) =.58(.57usingfullprecision)

Page 27: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

WeightingPMI

• PMIisbiasedtowardinfrequentevents• VeryrarewordshaveveryhighPMIvalues

• Twosolutions:• Giverarewordsslightlyhigherprobabilities• Useadd-deltasmoothing(whichhasasimilareffect)

27

Page 28: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

WeightingPMI:Givingrarecontextwordsslightlyhigherprobability

28

Page 29: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

29

Add-2!Smoothed!Count(w,context)computer data pinch result sugar

apricot 2 2 3 2 3pineapple 2 2 3 2 3digital 4 3 2 3 2information 3 8 2 6 2

p(w,context)![add-2] p(w)computer data pinch result sugar

apricot 0.03 0.03 0.05 0.03 0.05 0.20pineapple 0.03 0.03 0.05 0.03 0.05 0.20digital 0.07 0.05 0.03 0.05 0.03 0.24information 0.05 0.14 0.03 0.10 0.03 0.36

p(context) 0.19 0.25 0.17 0.22 0.17

Page 30: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

PPMIversusadd-2smoothedPPMI

30

PPMI(w,context)![add-2]computer data pinch result sugar

apricot 0.00 0.00 0.56 0.00 0.56pineapple 0.00 0.00 0.56 0.00 0.56digital 0.62 0.00 0.00 0.00 0.00information 0.00 0.58 0.00 0.37 0.00

PPMI(w,context)computer data pinch result sugar

apricot - - 2.25 - 2.25pineapple - - 2.25 - 2.25digital 1.66 0.00 - 0.00 -information 0.00 0.57 - 0.47 -

Page 31: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Vector SemanticsMeasuringsimilarity:the

cosine

Page 32: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Measuringsimilarity

• Given2targetwordsvandw• We’llneedawaytomeasuretheirsimilarity.• Mostmeasureofvectorssimilarityarebasedonthe:• Dotproductorinnerproductfromlinearalgebra

• Highwhentwovectorshavelargevaluesinsamedimensions.• Low(infact0)fororthogonalvectorswithzerosincomplementarydistribution32

Page 33: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Problemwithdotproduct

• Dotproductislongerifthevectorislonger.Vectorlength:

• Vectorsarelongeriftheyhavehighervaluesineachdimension• Thatmeansmorefrequentwordswillhavehigherdotproducts• That’sbad:wedon’twantasimilaritymetrictobesensitivetoword

frequency33

Page 34: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Solution:cosine

• Justdividethedotproductbythelengthofthetwovectors!

• Thisturnsouttobethecosineoftheanglebetweenthem!

34

Page 35: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Cosineforcomputingsimilarity

cos(!v, !w) =

!v• !w!v !w

=!v!v•!w!w=

viwii=1

N∑vi2

i=1

N∑ wi

2i=1

N∑

Dot product Unit vectors

vi is the PPMI value for word v in context i wi is the PPMI value for word w in context i.

Cos(v,w) is the cosine similarity of v and w

Sec. 6.3

Page 36: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Cosineasasimilaritymetric

• -1:vectorspointinoppositedirections• +1:vectorspointinsamedirections• 0:vectorsareorthogonal

• RawfrequencyorPPMIarenon-negative,socosinerange0-1

36

Page 37: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

large data computer

apricot 2 0 0

digital 0 1 2

information 1 6 1

37

Whichpairofwordsismoresimilar? cosine(apricot,information)=

cosine(digital,information)=

cosine(apricot,digital)=

cos(!v, !w) =

!v• !w!v !w

=!v!v•!w!w=

viwii=1

N∑vi2

i=1

N∑ wi

2i=1

N∑

1+ 0+ 0

1+36+1

1+36+1

0+1+ 4

0+1+ 4 0+ 6+ 2

0+ 0+ 0

=838 5

= .58

= 0

  

 

Page 38: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Visualizingvectorsandangles

1 2 3 4 5 6 7

1

2

3

digital

apricotinformation

Dim

ensio

n 1:

‘lar

ge’

Dimension 2: ‘data’38

large data

apricot 2 0

digital 0 1

information 1 6

Page 39: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Clusteringvectorstovisualizesimilarityinco-occurrencematrices

Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence

HEAD

HANDFACE

DOG

AMERICA

CAT

EYE

EUROPE

FOOT

CHINAFRANCE

CHICAGO

ARM

FINGER

NOSE

LEG

RUSSIA

MOUSE

AFRICA

ATLANTA

EAR

SHOULDER

ASIA

COW

BULL

PUPPY LION

HAWAII

MONTREAL

TOKYO

TOE

MOSCOW

TOOTH

NASHVILLE

BRAZIL

WRIST

KITTEN

ANKLE

TURTLE

OYSTER

Figure 8: Multidimensional scaling for three noun classes.

WRISTANKLE

SHOULDERARMLEGHAND

FOOTHEADNOSEFINGER

TOEFACEEAREYE

TOOTHDOGCAT

PUPPYKITTEN

COWMOUSE

TURTLEOYSTER

LIONBULLCHICAGOATLANTA

MONTREALNASHVILLE

TOKYOCHINARUSSIAAFRICAASIAEUROPEAMERICA

BRAZILMOSCOW

FRANCEHAWAII

Figure 9: Hierarchical clustering for three noun classes using distances based on vector correlations.

20

39 Rohdeetal.(2006)

Page 40: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Otherpossiblesimilaritymeasures

Page 41: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Evaluatingsimilarity(thesameasforthesaurus-based)

• IntrinsicEvaluation:• Correlationbetweenalgorithmandhumanwordsimilarityratings

• Extrinsic(task-based,end-to-end)Evaluation:• Spellingerrordetection,WSD,essaygrading• TakingTOEFLmultiple-choicevocabularytests

Levied is closest in meaning to which of these: imposed, believed, requested, correlated

Page 42: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

AlternativetoPPMIformeasuringassociation

• tf-idf(that’sahyphennotaminussign)• Thecombinationoftwofactors

• Termfrequency(Luhn1957):frequencyoftheword(canbelogged)• Inversedocumentfrequency(IDF)(SparckJones1972)

• Nisthetotalnumberofdocuments

• dfi=“documentfrequencyofwordi”

• =#ofdocumentswithwordI

• wij=wordiindocumentj

wij=tfij idfi42

idfi = logNdfi

⎜⎜

⎟⎟

Page 43: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

tf-idfnotgenerallyusedforword-wordsimilarity

• Butisbyfarthemostcommonweightingwhenweareconsideringtherelationshipofwordstodocuments

43

Page 44: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Vector SemanticsDenseVectors

Page 45: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Sparseversusdensevectors

• PPMIvectorsare• long(length|V|=20,000to50,000)• sparse(mostelementsarezero)

• Alternative:learnvectorswhichare• short(length200-1000)• dense(mostelementsarenon-zero)

45

Page 46: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Sparseversusdensevectors

• Whydensevectors?• Shortvectorsmaybeeasiertouseasfeaturesinmachinelearning(lessweightstotune)

• Densevectorsmaygeneralizebetterthanstoringexplicitcounts• Theymaydobetteratcapturingsynonymy:

• carandautomobilearesynonyms;butarerepresentedasdistinctdimensions;thisfailstocapturesimilaritybetweenawordwithcarasaneighborandawordwithautomobileasaneighbor

46

Page 47: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Threemethodsforgettingshortdensevectors

• SingularValueDecomposition(SVD)• AspecialcaseofthisiscalledLSA–LatentSemanticAnalysis

• “NeuralLanguageModel”-inspiredpredictivemodels• skip-gramsandCBOW

• Brownclustering

47

Page 48: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Vector SemanticsDenseVectorsviaSVD

Page 49: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Intuition• ApproximateanN-dimensionaldatasetusingfewerdimensions• Byfirstrotatingtheaxesintoanewspace• Inwhichthehighestorderdimensioncapturesthemostvarianceinthe

originaldataset• Andthenextdimensioncapturesthenextmostvariance,etc.• Manysuch(related)methods:

• PCA–principlecomponentsanalysis• FactorAnalysis• SVD

49

Page 50: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

50

Dimensionalityreduction

Page 51: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

SingularValueDecomposition

51

AnyrectangularwxcmatrixXequalstheproductof3matrices:W:rowscorrespondingtooriginalbutmcolumnsrepresentsadimensioninanewlatentspace,suchthat

• Mcolumnvectorsareorthogonaltoeachother• Columnsareorderedbytheamountofvarianceinthedataseteachnewdimensionaccountsfor

S:diagonalmxmmatrixofsingularvaluesexpressingtheimportanceofeachdimension.C:columnscorrespondingtooriginalbutmrowscorrespondingtosingularvalues

Page 52: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

SingularValueDecomposition

52 LanduaerandDumais1997

Page 53: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

SVDappliedtoterm-documentmatrix: LatentSemanticAnalysis

• Ifinsteadofkeepingallmdimensions,wejustkeepthetopksingularvalues.Let’ssay300.• Theresultisaleast-squaresapproximationtotheoriginalX

53 k /

/ k

/ k

/ k

Deerwesteretal(1988)

Page 54: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

LSAmoredetails

• 300dimensionsarecommonlyused• Thecellsarecommonlyweightedbyaproductoftwoweights

• Localweight:Logtermfrequency• Globalweight:eitheridforanentropymeasure

54

Page 55: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Let’sreturntoPPMIword-wordmatrices

• CanweapplytoSVDtothem?

55

Page 56: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

SVDappliedtoterm-termmatrix

56 (I’msimplifyingherebyassumingthematrixhasrank|V|)

Page 57: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

TruncatedSVDonterm-termmatrix

57

Page 58: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

TruncatedSVDproducesembeddings

58

• EachrowofWmatrixisak-dimensionalrepresentationofeachwordw

• Kmightrangefrom50to1000• Generallywekeepthetopkdimensions,

butsomeexperimentssuggestthatgettingridofthetop1dimensionoreventhetop50dimensionsishelpful(LapesaandEvert2014).

Page 59: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Embeddingsversussparsevectors

• DenseSVDembeddingssometimesworkbetterthansparsePPMImatricesattaskslikewordsimilarity• Denoising:low-orderdimensionsmayrepresentunimportantinformation

• Truncationmayhelpthemodelsgeneralizebettertounseendata.• Havingasmallernumberofdimensionsmaymakeiteasierforclassifierstoproperlyweightthedimensionsforthetask.

• Densemodelsmaydobetteratcapturinghigherorderco-occurrence.59

Page 60: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Vector SemanticsEmbeddingsinspiredbyneural

languagemodels:skip-gramsandCBOW

Page 61: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Prediction-basedmodels: Analternativewaytogetdensevectors

• Skip-gram(Mikolovetal.2013a)CBOW(Mikolovetal.2013b)• Learnembeddingsaspartoftheprocessofwordprediction.• Trainaneuralnetworktopredictneighboringwords

• Inspiredbyneuralnetlanguagemodels.• Insodoing,learndenseembeddingsforthewordsinthetrainingcorpus.

• Advantages:• Fast,easytotrain(muchfasterthanSVD)• Availableonlineintheword2vecpackage• Includingsetsofpretrainedembeddings!

61

Page 62: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Skip-grams

• Predicteachneighboringword• inacontextwindowof2Cwords• fromthecurrentword.

• SoforC=2,wearegivenwordwtandpredictingthese4

words:

62

Page 63: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Skip-gramslearn2embeddingsforeachw

inputembeddingv,intheinputmatrixW• ColumnioftheinputmatrixWisthe1×d

embeddingviforwordiinthevocabulary.

outputembeddingvʹ,inoutputmatrixW’

• RowioftheoutputmatrixWʹisad×1vector

embeddingvʹiforwordiinthevocabulary.

63

Page 64: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Setup

• Walkingthroughcorpuspointingatwordw(t),whoseindexin

thevocabularyisj,sowe’llcallitwj(1<j<|V|).

• Let’spredictw(t+1),whoseindexinthevocabularyisk(1<k<|

V|).HenceourtaskistocomputeP(wk|wj).

64

Page 65: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

One-hotvectors

• Avectoroflength|V|• 1forthetargetwordand0forotherwords• Soif“popsicle”isvocabularyword5• Theone-hotvectoris• [0,0,0,0,1,0,0,0,0…….0]

65

Page 66: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

66

Skip-gram

Page 67: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

67

Skip-gramh=vj

o=W’h

o=W’h

Page 68: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

68

Skip-gram

h=vjo=W’h

ok=v’khok=v’k·vj

Page 69: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Turningoutputsintoprobabilities

• ok=v’k·vj• Weusesoftmaxtoturnintoprobabilities

69

Page 70: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

EmbeddingsfromWandW’

• Sincewehavetwoembeddings,vjandv’jforeachwordwj

• Wecaneither:• Justusevj• Sumthem• Concatenatethemtomakeadouble-lengthembedding

70

Page 71: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Butwait;howdowelearntheembeddings?

71

Page 72: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

RelationbetweenskipgramsandPMI!

• IfwemultiplyWW’T

• Wegeta|V|x|V|matrixM,eachentrymijcorrespondingtosome

associationbetweeninputwordiandoutputwordj• LevyandGoldberg(2014b)showthatskip-gramreachesitsoptimumjust

whenthismatrixisashiftedversionofPMI:

WWʹT=MPMI−logk• Soskip-gramisimplicitlyfactoringashiftedversionofthePMImatrixinto

thetwoembeddingmatrices.72

Page 73: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

CBOW(ContinuousBagofWords)

73

Page 74: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Propertiesofembeddings

74

• Nearestwordstosomeembeddings(Mikolovetal.20131)

Page 75: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Embeddingscapturerelationalmeaning!

75

Page 76: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Vector SemanticsEvaluatingsimilarity

Page 77: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Evaluatingsimilarity

• Extrinsic(task-based,end-to-end)Evaluation:• QuestionAnswering• SpellChecking• Essaygrading

• IntrinsicEvaluation:• Correlationbetweenalgorithmandhumanwordsimilarityratings

• Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77• TakingTOEFLmultiple-choicevocabularytests

• Levied is closest in meaning to: imposed, believed, requested, correlated

Page 78: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Summary

• Distributional(vector)modelsofmeaning• Sparse(PPMI-weightedword-wordco-occurrencematrices)• Dense:

• Word-wordSVD50-2000dimensions• Skip-gramsandCBOW(embeddingsavailableinword2vec)

78

Page 79: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:

Agreatsemanticvectorspacefordocuments

• wordshavelow-dimensionalembeddings,usefulformanycomputationallinguisticapplications

• documentsareaweightedcombinationofwords• documentsasavectorinthelow-dimensionalspace• thisallows

• semanticdocumentclustering(k-means,hierarchical,etc.)• searchforsimilardocuments(priorartinpatents,etc.)

79


Recommended