word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol...

transcript

Tengyu Ma

Joint works with Sanjeev Arora, Yuanzhi Li, YingyuLiang, and Andrej Risteski

Princeton University

𝑥 ∈ 𝒳

𝑣& ∈ ℝ(

complicatedspace Euclideanspacewithmeaningful innerproducts

Ø Kernelmethods)*+,-.*/01,23/03+4

Linearlyseparable

Ø Neuralnets 0.*3+1,+15.*2+106 Multi-classlinearclassifier

Vocabulary ={ 60k most frequentwords }

ℝ788

Goal:Embeddingcapturessemanticsinformation

(vialinearalgebraicoperations)

Ø inner products characterize similarityØ similarwordshavelargeinnerproducts

Ø differencescharacterizerelationshipØanalogouspairshavesimilardifferences

Ø more? picture:ChrisOlah’s blog

Meaningofawordisdeterminedbywordsitco-occurswith.

(Distributionalhypothesisofmeaning,[Harris’54],[Firth’57])

⋯⋮⋮ ⋱ ⋮

⋮⋯

word 𝑥 → 𝑣&

word 𝑦↓

Ø Pr 𝑥, 𝑦 ≜ prob.ofco-occurrencesof𝑥, 𝑦 inawindowofsize5

Ø 𝑣&, 𝑣C - agoodmeasureofsimilarityof(𝑥, 𝑦) [Lund-Burgess’96]

Ø 𝑣& =rowofentry-wise square-rootofco-occurrencematrix[Rohdeetal’05]

Ø 𝑣& =rowofPMI 𝑥, 𝑦 = log L.[&,C]L. & L.[C]

matrix[Church-Hanks’90]

Co-occurrencematrixPr ⋅,⋅

Ø “Linearstructure”inthefound𝑣&’s:

𝑣PQRST − 𝑣RST ≈ 𝑣WXYYT − 𝑣Z[T\ ≈ 𝑣XT]^Y − 𝑣SXT_ ≈ ⋯

uncleman

Algorithm [Levy-Goldberg]:(dimension-reductionversionof[Church-Hanks’90])

Ø ComputePMI 𝑥, 𝑦 = log L.[&,C]L. & L.[C]

Ø Takerank-300SVD(bestrank-300approximation)ofPMIØ ⇔ FitPMI 𝑥, 𝑦 ≈ ⟨𝑣&, 𝑣C⟩ (withsquaredloss),where𝑣& ∈ ℝ788

Ø Questions: woman:manqueen:?

,aunt:?

Ø Answers: 𝑘𝑖𝑛𝑔 = argmink|| 𝑣WXYYT − 𝑣P − (𝑣PQRST−𝑣RST)||

𝑎𝑢𝑛𝑡 = argmink|| 𝑣XT]^Y − 𝑣P − (𝑣PQRST−𝑣RST)||

uncleman

Ørecurrentneuralnetworkbasedmodel[Mikolov etal’12]

Øword2vec[Mikolov etal’13]:

Pr 𝑥[pq 𝑥[pr,… , 𝑥[pt ∝ exp⟨𝑣&yz{ ,15 𝑣&yz~ +⋯+ 𝑣&yz� ⟩

ØGloVe [Penningtonetal’14]:

log Pr[𝑥, 𝑦] ≈ 𝑣&, 𝑣C + 𝑠& + 𝑠C + 𝐶

Ø [Levy-Goldberg’14](Previousslide)

PMI 𝑥, 𝑦 = log L.[&,C]L. & L.[C] ≈ 𝑣&, 𝑣C + 𝐶

Logarithm(orexponential)seemstoexcludelinearalgebra!

Whyco-occurrencestatistics+logà linearstructure[Levy-Goldberg’13,Penningtonetal’14,rephrased]

Ø Formostofthewords𝜒:

Pr[𝜒 ∣ 𝑘𝑖𝑛𝑔]Pr[𝜒 ∣ 𝑞𝑢𝑒𝑒𝑛] ≈

Pr[𝜒 ∣ 𝑚𝑎𝑛]Pr 𝜒 𝑤𝑜𝑚𝑎𝑛]

§ For𝜒 unrelatedtogender:LHS,RHS≈ 1

§ for𝜒=dress,LHS,RHS≪ 1 ;for𝜒 =John,LHS,RHS≫ 1

ØItsuggests

=� PMI 𝜒, 𝑘𝑖𝑛𝑔 − PMI 𝜒, 𝑞𝑢𝑒𝑒𝑛 − PMI 𝜒, 𝑚𝑎𝑛 − PMI 𝜒, 𝑤𝑜𝑚𝑎𝑛�

Ø RowsofPMImatrixhas“linearstructure”

Ø Empiricallyonecanfind𝑣P’ss.t. PMI 𝜒, 𝑤 ≈ ⟨𝑣�, 𝑣P⟩

Ø Suggestion:𝑣P’salsohavelinearstructure

� logPr 𝜒 𝑘𝑖𝑛𝑔Pr 𝜒 𝑞𝑢𝑒𝑒𝑛 − log

Pr 𝜒 𝑚𝑎𝑛Pr 𝜒 𝑤𝑜𝑚𝑎𝑛]

M1:Whydolow-dimvectorscaptureessenceofhugeco-occurrencestatistics?Thatis,whyisalow-dimfitofPMImatrixevenpossible?

PMI 𝑥, 𝑦 ≈ 𝑣&, 𝑣C (∗)

M2:Whylow-dimvectorssolvesanalogywhen(∗) isonlyroughlytrue?

Ø NB:solvinganalogytaskrequiresinnerproductsof6pairsofwordvectors,andthat“king”survivesagainstallotherwords– noiseispotentiallyanissue!

𝑘𝑖𝑛𝑔 = argmaxk|| 𝑣WXYYT − 𝑣P − (𝑣PQRST−𝑣RST)||�

Ø Fact:low-dimwordvectorshavemoreaccurate linearstructurethantherowsofPMI(thereforebetteranalogytaskperformance).

↑empiricalfithas17%error

Ø NB:PMImatrixisnotnecessarilyPSD.

M1:Whydolow-dimvectorscaptureessenceofhugeco-occurrencestatistics?Thatis,whyisalow-dimfitofPMImatrixevenpossible?

PMI 𝑥, 𝑦 ≈ 𝑣&, 𝑣C (∗)

A1:Underagenerativemodel(namedRAND-WALK),(*)provably holds

M2:Whylow-dimvectorssolvesanalogywhen(∗) isonlyroughlytrue?

A2:(*)+isotropyofwordvectors⇒ low-dimfittingreducesnoise

(Quiteintuitive, though doesn’t followOccam’sbound forPAC-learning)

Ø HiddenMarkovModel:§ discoursevector𝑐_ ∈ ℝ( governsthediscourse/theme/contextoftime𝑡§ words𝑤_ (observable);embedding𝑣P� ∈ ℝ

( (parameterstolearn)§ log-linearobservationmodel

Pr[𝑤_ ∣ 𝑐_] ∝ exp⟨𝑣P�,𝑐_⟩

Ø Closelyrelatedto[Mnih-Hinton’07]

𝑐_ 𝑐_pr 𝑐_p� 𝑐_p7

𝑤_ 𝑤_pr 𝑤_p� 𝑤_p7 𝑤_p�

𝑐_p�

Ø Ideally,𝑐_, 𝑣P ∈ ℝ( shouldcontainsemanticinformationinitscoordinates§ E.g.(0.5,-0.3,…)couldmean“0.5gender,-0.3age,..”

Ø But,thewholesystemisrotationalinvariant: 𝑐_, 𝑣P = ⟨𝑅𝑐_,𝑅𝑣P⟩

Ø Thereshouldexistarotationsothatthecoordinatesaremeaningful(backtothislater)

𝑐_p�

Ø Assumptions:§ {𝑣P}consistsofvectorsdrawnfrom𝑠 ⋅ 𝒩(0, Id);𝑠 isboundedscalarr.v.§ 𝑐_ doesaslowrandomwalk(doesn’tchangemuchinawindowof5)§ log-linearobservationmodel:Pr[𝑤_ ∣ 𝑐_] ∝ exp⟨𝑣P�,𝑐_⟩

Ø MainTheorem:

(1) logPr 𝑤,𝑤′ = 𝑣P + 𝑣P� �/𝑑 − 2 log𝑍 ± 𝜖

(2)logPr 𝑤 = 𝑣P �/𝑑 − log𝑍 ± 𝜖

(3) PMI 𝑤,𝑤� = 𝑣P, 𝑣P¢ /𝑑 ± 𝜖

Ø Normdeterminesfrequency;spatialorientationdetermines“meaning”

𝑐_p�

Fact:(2)implies thatthewordshavepowerlawdist.

Pr 𝑤[pq 𝑤[pr,… ,𝑤[pt ∝ exp⟨𝑣Pyz{ ,15 𝑣Pyz~ +⋯+ 𝑣Pyz� ⟩

ØGloVe [Penningtonetal’14]:

log Pr[𝑤,𝑤′] ≈ 𝑣P, 𝑣P¢ + 𝑠P + 𝑠P� + 𝐶

Eq.(1) logPr 𝑤,𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖

Ø [Levy-Goldberg’14]

PMI 𝑤,𝑤� ≈ 𝑣P, 𝑣P¢ + 𝐶

Eq.(3)PMI 𝑤, 𝑤� = 𝑣P, 𝑣P¢ /𝑑 ± 𝜖

Pr 𝑤[pq 𝑤[pr,… , 𝑤[pt ∝ exp⟨𝑣Pyz{ ,15 𝑣Pyz~ + ⋯+ 𝑣Pyz� ⟩

Ø Underourmodel,

§ Randomwalkisslow:𝑐[pr ≈ 𝑐[p� ≈ ⋯ ≈ 𝑐[pq ≈ 𝑐

§ Bestestimateforcurrentdiscourse𝑐[pq:

argmax],||]||£r

Pr 𝑐 𝑤[pr,… ,𝑤t] = 𝛼 𝑣Pyz~ + ⋯+ 𝑣Pyz�

§ Prob.distributionofnextwordgiventhebestguess𝑐:

Pr[𝑤[pq ∣ 𝑐[pq = 𝛼 𝑣Pyz~ + ⋯+ 𝑣Pyz� ] ∝ exp⟨𝑣Pyz{ ,𝛼 𝑣Pyz~ +⋯+ 𝑣Pyz� ⟩

↑max-likelihoodestimateof𝑐[pq

𝑐[p� 𝑐[pt

𝑤[p� 𝑤[pt 𝑤[pq

𝑐[pq

Pr[𝑤,𝑤�] = ¥ Pr 𝑤 𝑐] Pr 𝑤� 𝑐′] 𝑝 𝑐, 𝑐� 𝑑𝑐𝑑𝑐′

𝑍]𝑍]�⋅ exp 𝑣P, 𝑐 exp⟨𝑣P¢ , 𝑐�⟩ 𝑝 𝑐, 𝑐� 𝑑𝑐𝑑𝑐′

Ø Assume𝑐 = 𝑐′ withprobability1,

= ¥exp⟨𝑣P + 𝑣P¢, 𝑐⟩ 𝑝 𝑐 𝑑𝑐 = exp 𝑣P + 𝑣P¢ � /𝑑

Thistalk:windowofsize2

Pr[𝑤 ∣ 𝑐] ∝ exp⟨𝑣P, 𝑐⟩

𝑐 𝑐′

𝑤 𝑤′

Pr[𝑤′ ∣ 𝑐′] ∝ exp⟨𝑣P¢ , 𝑐′⟩Ø Pr[𝑤 ∣ 𝑐] = r§¨⋅ exp⟨𝑣P, 𝑐⟩

Ø 𝑍] = ∑ exp⟨𝑣P, 𝑐⟩P partitionfunction

Eq. (1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖

sphericalGaussianvector𝑐Ø 𝔼 exp 𝑣, 𝑐 = exp 𝑣 �/𝑑

𝑐 𝑐′

𝑤 𝑤′

Pr[𝑤′ ∣ 𝑐′] ∝ exp⟨𝑣P¢ , 𝑐′⟩Ø Pr[𝑤 ∣ 𝑐] = r§¨⋅ exp⟨𝑣P, 𝑐⟩

Lemma1:foralmostallc,almostall 𝑣P ,𝑍] = 1 + 𝑜 1 𝑍

Ø Proof(sketch):§ formost𝑐,𝑍] concentratesarounditsmean§ meanof𝑍] isdeterminedby||𝑐||,whichinturnconcentrates§ caveat:exp⟨𝑣, 𝑐⟩ for𝑣 ∼𝒩(0, Id) isnotsubgaussian,norsub-exponential.(𝛼-Orlicz normisnotboundedforany𝛼 > 0)

Ø ProofSketch:

Ø Fixing𝑐,toshowhighprobabilityoverchoicesof𝑣P’s

𝑍] =�exp⟨𝑣P, 𝑐⟩P

= 1 + 𝑜 1 𝔼[𝑍]]

Ø 𝑧P = ⟨𝑣P, 𝑐⟩ scalarGaussianrandomvariable

Ø ||𝑐|| governsthemeanandvarianceof𝑧P.Ø ||𝑐|| inturnsisconcentrated

Ø Question:𝑧r,… , 𝑧T ∼ 𝒩(0,1)

𝑍 = �exp(𝑧[)T

[£rØ Howis𝑍 concentrated?

Ø 𝔼 𝑍] = Θ(𝑛),and𝕍𝑎𝑟 𝑍] = O 𝑛Ø Thetailof𝑒𝑥𝑝(𝑧[) isbad!

Ø Pr exp𝑧[ > 𝑡 ≈ 𝑡² 2³4 _

Ø Claim:Pr[𝑍 > 𝔼𝑍 + 𝐶 𝑛 ⋅ log 𝑛] ≤ exp(− log� 𝑛)

Ø Trick:truncate𝑧[ atlog𝑛 anddealwiththetailbyunionbound

Ø (sub)-Gaussian tailPr 𝑋 > 𝑡 ≤ exp(−𝑡�/2)

Ø (sub)-exponential tailPr 𝑋 > 𝑡 ≤ exp(−𝑡/2)

Ø ProofSketch:

Ø Fixing𝑐,wehavewithhighprobabilityoverchoicesof𝑣P’s

𝑍] =�exp⟨𝑣P, 𝑐⟩P

= 1 + 𝑜 1 𝔼[𝑍]]

Ø 𝑧P = ⟨𝑣P, 𝑐⟩ scalarGaussianrandomvariable

Ø ||𝑐|| governsthemeanandvarianceof𝑧P.Ø ||𝑐|| inturnsisconcentrated

Pr[𝑤,𝑤�] = ¥1

𝑍]𝑍]�⋅ exp 𝑣P + 𝑣P¢, 𝑐 𝑝 𝑐 𝑑𝑐

= 1 ± 𝑜 11𝑍� ¥ exp 𝑣P + 𝑣P¢, 𝑐 𝑝 𝑐 𝑑𝑐

= 1 ± 𝑜 11𝑍� exp(||𝑣P + 𝑣P¢ ||

�/𝑑)

𝑐 𝑐′

𝑤 𝑤′

Pr[𝑤′ ∣ 𝑐′] ∝ exp⟨𝑣P¢ , 𝑐′⟩

Ø Pr[𝑤 ∣ 𝑐] = r§¨⋅ exp⟨𝑣P, 𝑐⟩

Ø Ourtheorypredicts

Eq.(1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖

Ø (Approximate)maximumlikelihoodobjective(SN)

min{·¸},º

� Pr»[𝑤,𝑤�](logPr»[𝑤, 𝑤�] − 𝑣P + 𝑣P¢ �

P,P�

− 𝑌)�

Simplestwordembeddingmethodyet(fewest“knobs”toturn)Comparableperformanceonanalogytest

Eq.(2) logPr 𝑤 = 𝑣P �/𝑑 − log𝑍 ± 𝜖

𝑍] = 1± 𝑜 1 𝑍

ØUndergenerativemodelRANK-WALK

Formostofthewords𝜒:

Pr[𝜒 ∣ 𝑎]Pr[𝜒 ∣ 𝑏] ≈

Pr[𝜒 ∣ 𝑐]Pr 𝜒 𝑑]

⟺ 𝑣S − 𝑣¿ ≈ 𝑣] − 𝑣(

↑semanticdef.ofanalogy

↑algebraicdef.ofanalogy

Ø Beyondonlysolvinganalogytask?

Ø Extractingmoreinformationfromanalogy/embeddings?

Extractingdifferentmeaningsfromwordembeddings(sameteam:Arora,Li,Liang,M.,Risteski)

Somerecentwork:

Ø “Tie”canmeanarticleofclothing,orphysicalact

ØTie representsunrelatedwordstie1,tie2,etc.

Quickexperiment:Taketworandom/unrelatedwordsw1,w2 wherew1 is~100timesmorefrequentthanw2 .Declarethesetobeasinglewordandcomputeitsembeddinginourmodel.

Result:closetosomethinglike0.8𝑣P~ + 0.2𝑣PÂ

Ø Mathematicalexplanation

Ø Merge𝑤r,𝑤� as𝑤.Let𝑟 =L.[P~]L.[PÂ]

Ø Then𝑣P ≈ 𝛼𝑣P~ + 𝛽𝑣PÂ,where§ 𝛼 = 1 − 𝑐r log 1 + r

Ä ≈ 1§ 𝛽 = 1 − 𝑐� log 𝑟

Ø 𝛽 > .1 evenif𝑟 = 100 !

Ø Raremeaningisnotswamped,thankstothelog !

whichcorrespondtodifferentrepresentative“discourses”

𝑣_[Y = 0.8𝑎r + 0.2𝑎� +noise↑

discoursefor𝑡𝑖𝑒r

↑discoursefor𝑡𝑖𝑒�

Ø “Tie”canmeanarticleofclothing,orphysicalact

ØTie representsunrelatedwordstie1,tie2,etc.

Ø Sparsecodingforextractingdifferentmeanings:§ Find𝑚 = 2000“discourses”𝑎r,𝑎�,… ∈ ℝ( suchthateachwordvectorexpressedasweightedsumofatmost5ofthem,plus“noisevector.”

𝑣P = 𝑥P,r𝑎r+ 𝑥P,�𝑎� +…+ 𝑛𝑜𝑖𝑠𝑒

𝑥P hasonly5non-zeros

Ø Trainingobjective:

minÆ£[S~,…,SÇ]ÈÉSÄÈY&¸¢ È

� 𝑣P − 𝐴𝑥P �

Ø localsearchalgo.[EAB’05],provablealgo.[SWW’12,AGM’14,AGMM’15..]

other. Thus combining the bases while merging duplicates yielded a basis of about the same

size. Some atoms are were found to be semantically meaningless but are easily identified and

filtered out by checking if their closest words tend to have low pairwise inner products amongst

themselves (16).

The significant discourses represented by the basis vectors will henceforth be refered to as

atoms of discourse. The “meaning” of an atom can be discerned by looking at the set of words

whose embeddings are close to it. Table 1 contains some examples of the discourse atoms.

Atoms of discourse may be reminiscent of the results of other automated methods for ob-

taining a thematic understanding of text, such as topic modeling, described in the survey (17).

Indeed the model (1) used to compute the word embeddings is related to a log-linear topic model

from (18). However, the discourses here are computed via sparse coding on word embeddings,

which has no analog in topic modeling. There is also a long tradition of detecting coherent

clusters of word vectors using Brown clustering, or even sparse coding (19). The novelty in

the current paper is a clear interpretation of the basis –in terms of discourses— yielded by the

sparse coding, as well as its use to capture different senses of words.

Atom 1978 825 231 616 1638 149 330drowning instagram stakes membrane slapping orchestra conferencessuicides twitter thoroughbred mitochondria pulling philharmonic meetingsoverdose facebook guineas cytosol plucking philharmonia seminarsmurder tumblr preakness cytoplasm squeezing conductor workshopspoisoning vimeo filly membranes twisting symphony exhibitionscommits linkedin fillies organelles bowing orchestras organizesstabbing reddit epsom endoplasmic slamming toscanini concertsstrangulation myspace racecourse proteins tossing concertgebouw lecturesgunshot tweets sired vesicles grabbing solti presentations

Table 1: Some discourse atoms and their nearest 9 words. By (1) words most likely to appearin a discouse are those nearest to it.

Representativesubsetof2000discourses(representedusingtheirnearestwords)

↑closestwordsto𝑎�7r

5atomsthatexpress𝑣_[Y

Ø Atomsofdiscoursefoundarefairlyfine-grained

Ø Maybe𝑎¿[Q]ËYR[È_ÄC = 𝛼 ⋅ 𝑏¿[Q^Q\C + 𝛽 ⋅ 𝑏]ËYR[È_ÄC?

Ø Anotherlayer:

minÌ,ºÈÉSÄÈY

||𝐴 − 𝐵𝑌||�

Ø PartI:newgenerativemodelthatcapturessemantics.

Ø Provableguarantee:§ logofco-occurrencematrixhaslowrankstructure§ semanticanalogy⇔ linearalgebraicstructureforwordvectors

Ø Simplisticassumptions,butgoodfittoreality

Ø PartII:automaticwayofdetectwordmeanings§ Hierarchicalbasisintheembeddingspace

Ø Otherapplicationsofourmodel/method?

Ø Eachordinateof𝑣P meanssomething:

𝑣ÎÏÆ = […… ,0, ……… ,1, ……… ,1, ……… ,0, …… ]

𝑣(Q^^SÄ = […… ,1, ……… ,0, ……… ,1, ……… ,0, …… ]

𝑣ÐË[TS = […… ,0, ……… ,1, ……… ,0, ……… ,1, …… ]

𝑣ÑÒÌ = […… ,1, ……… ,0, ……… ,0, ……… ,1, …… ]

currency↓

country↓

American↓

Chinese↓

𝑣ÎÏÆ − 𝑣(Q^^SÄ = […… ,−1, ……… ,1, ……… ,0 ……… ,0,…… ]

𝑣ÐË[TS − 𝑣ÑÒÌ = […… ,−1,……… , 1,……… , 0,……… , 0,… …]

Ø Onothercoordinates,thevaluesareeitherverysmallorthesupportsarenon-overlapping

Ø Problem:rotationalinvariance– rotationofwordvectorsdoesn’tchangethemodel.

𝑣ÎÏÆ = […… ,0, ……… ,1, ……… ,1, ……… ,0, …… ]

𝑣(Q^^SÄ = […… ,1, ……… ,0, ……… ,1, ……… ,0, …… ]

𝑣ÐË[TS = […… ,0, ……… ,1, ……… ,0, ……… ,1, …… ]

𝑣ÑÒÌ = […… ,1, ……… ,0, ……… ,0, ……… ,1, …… ]

currency↓

country↓

American↓

Chinese↓

⋅ 𝑅

↑sparsecoefficients

↑basisvectors

ØWithsparsity,themodelisidentifiable;allowsovercomplete basis;istractableundermildassumptions.[SWW’12][AGM’13][AAJNT’13][AGMM’14]

minÓ6Ô*.61,Ñ

||𝑉 − 𝑋 ⋅ 𝑅||Ö�

Ø 𝑉 containswordvectorsasrows(obtainedfromanyembeddingmethod)

Ø SparsityofrowsofXischosentobe5

Ø 𝑅contains2000basisvectors(asrows),eachofwhichis300-dim

AssumingM1wasanswered,

PMI 𝑤,𝑤� = 𝑣P, 𝑣P¢ + 𝜉 (*)withlarge𝜉

M2:Whylow-dimvectorssolvesanalogywhen(*)isonlyroughlytrue?

Ø Ourtheoryassumesthat𝑐_ doesaslowrandomwalk

Ø reddot:theestimatehiddenvariable𝑐_ attime𝑡

Ø sentenceattop:thewindowofsize10attime𝑡

AssumingM1wasanswered,

PMI 𝑤,𝑤� = 𝑣P, 𝑣P¢ + 𝜉 (*)withlarge𝜉

M2:Whylow-dimvectorssolvesanalogywhen(*)isonlyroughlytrue?

word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol...

Documents