Post on 16-Oct-2020
transcript
Tengyu Ma
Joint works with Sanjeev Arora, Yuanzhi Li, YingyuLiang, and Andrej Risteski
Princeton University
𝑥 ∈ 𝒳
𝑣& ∈ ℝ(
complicatedspace Euclideanspacewithmeaningful innerproducts
Ø Kernelmethods)*+,-.*/01,23/03+4
Linearlyseparable
Ø Neuralnets 0.*3+1,+15.*2+106 Multi-classlinearclassifier
Vocabulary ={ 60k most frequentwords }
ℝ788
Goal:Embeddingcapturessemanticsinformation
(vialinearalgebraicoperations)
Ø inner products characterize similarityØ similarwordshavelargeinnerproducts
Ø differencescharacterizerelationshipØanalogouspairshavesimilardifferences
Ø more? picture:ChrisOlah’s blog
Meaningofawordisdeterminedbywordsitco-occurswith.
(Distributionalhypothesisofmeaning,[Harris’54],[Firth’57])
⋯⋮⋮ ⋱ ⋮
⋮⋯
word 𝑥 → 𝑣&
word 𝑦↓
Ø Pr 𝑥, 𝑦 ≜ prob.ofco-occurrencesof𝑥, 𝑦 inawindowofsize5
Ø 𝑣&, 𝑣C - agoodmeasureofsimilarityof(𝑥, 𝑦) [Lund-Burgess’96]
Ø 𝑣& =rowofentry-wise square-rootofco-occurrencematrix[Rohdeetal’05]
Ø 𝑣& =rowofPMI 𝑥, 𝑦 = log L.[&,C]L. & L.[C]
matrix[Church-Hanks’90]
Co-occurrencematrixPr ⋅,⋅
Ø “Linearstructure”inthefound𝑣&’s:
𝑣PQRST − 𝑣RST ≈ 𝑣WXYYT − 𝑣Z[T\ ≈ 𝑣XT]^Y − 𝑣SXT_ ≈ ⋯
aunt
king
uncleman
woman
queen
Algorithm [Levy-Goldberg]:(dimension-reductionversionof[Church-Hanks’90])
Ø ComputePMI 𝑥, 𝑦 = log L.[&,C]L. & L.[C]
Ø Takerank-300SVD(bestrank-300approximation)ofPMIØ ⇔ FitPMI 𝑥, 𝑦 ≈ ⟨𝑣&, 𝑣C⟩ (withsquaredloss),where𝑣& ∈ ℝ788
Ø Questions: woman:manqueen:?
,aunt:?
Ø Answers: 𝑘𝑖𝑛𝑔 = argmink|| 𝑣WXYYT − 𝑣P − (𝑣PQRST−𝑣RST)||
𝑎𝑢𝑛𝑡 = argmink|| 𝑣XT]^Y − 𝑣P − (𝑣PQRST−𝑣RST)||
aunt
king
uncleman
woman
queen
Ørecurrentneuralnetworkbasedmodel[Mikolov etal’12]
Øword2vec[Mikolov etal’13]:
Pr 𝑥[pq 𝑥[pr,… , 𝑥[pt ∝ exp⟨𝑣&yz{ ,15 𝑣&yz~ +⋯+ 𝑣&yz� ⟩
ØGloVe [Penningtonetal’14]:
log Pr[𝑥, 𝑦] ≈ 𝑣&, 𝑣C + 𝑠& + 𝑠C + 𝐶
Ø [Levy-Goldberg’14](Previousslide)
PMI 𝑥, 𝑦 = log L.[&,C]L. & L.[C] ≈ 𝑣&, 𝑣C + 𝐶
Logarithm(orexponential)seemstoexcludelinearalgebra!
Whyco-occurrencestatistics+logà linearstructure[Levy-Goldberg’13,Penningtonetal’14,rephrased]
Ø Formostofthewords𝜒:
Pr[𝜒 ∣ 𝑘𝑖𝑛𝑔]Pr[𝜒 ∣ 𝑞𝑢𝑒𝑒𝑛] ≈
Pr[𝜒 ∣ 𝑚𝑎𝑛]Pr 𝜒 𝑤𝑜𝑚𝑎𝑛]
§ For𝜒 unrelatedtogender:LHS,RHS≈ 1
§ for𝜒=dress,LHS,RHS≪ 1 ;for𝜒 =John,LHS,RHS≫ 1
ØItsuggests
=� PMI 𝜒, 𝑘𝑖𝑛𝑔 − PMI 𝜒, 𝑞𝑢𝑒𝑒𝑛 − PMI 𝜒, 𝑚𝑎𝑛 − PMI 𝜒, 𝑤𝑜𝑚𝑎𝑛�
�
≈ 0
Ø RowsofPMImatrixhas“linearstructure”
Ø Empiricallyonecanfind𝑣P’ss.t. PMI 𝜒, 𝑤 ≈ ⟨𝑣�, 𝑣P⟩
Ø Suggestion:𝑣P’salsohavelinearstructure
� logPr 𝜒 𝑘𝑖𝑛𝑔Pr 𝜒 𝑞𝑢𝑒𝑒𝑛 − log
Pr 𝜒 𝑚𝑎𝑛Pr 𝜒 𝑤𝑜𝑚𝑎𝑛]
�
�
≈ 0
M1:Whydolow-dimvectorscaptureessenceofhugeco-occurrencestatistics?Thatis,whyisalow-dimfitofPMImatrixevenpossible?
PMI 𝑥, 𝑦 ≈ 𝑣&, 𝑣C (∗)
M2:Whylow-dimvectorssolvesanalogywhen(∗) isonlyroughlytrue?
Ø NB:solvinganalogytaskrequiresinnerproductsof6pairsofwordvectors,andthat“king”survivesagainstallotherwords– noiseispotentiallyanissue!
𝑘𝑖𝑛𝑔 = argmaxk|| 𝑣WXYYT − 𝑣P − (𝑣PQRST−𝑣RST)||�
Ø Fact:low-dimwordvectorshavemoreaccurate linearstructurethantherowsofPMI(thereforebetteranalogytaskperformance).
↑empiricalfithas17%error
Ø NB:PMImatrixisnotnecessarilyPSD.
M1:Whydolow-dimvectorscaptureessenceofhugeco-occurrencestatistics?Thatis,whyisalow-dimfitofPMImatrixevenpossible?
PMI 𝑥, 𝑦 ≈ 𝑣&, 𝑣C (∗)
A1:Underagenerativemodel(namedRAND-WALK),(*)provably holds
M2:Whylow-dimvectorssolvesanalogywhen(∗) isonlyroughlytrue?
A2:(*)+isotropyofwordvectors⇒ low-dimfittingreducesnoise
(Quiteintuitive, though doesn’t followOccam’sbound forPAC-learning)
Ø HiddenMarkovModel:§ discoursevector𝑐_ ∈ ℝ( governsthediscourse/theme/contextoftime𝑡§ words𝑤_ (observable);embedding𝑣P� ∈ ℝ
( (parameterstolearn)§ log-linearobservationmodel
Pr[𝑤_ ∣ 𝑐_] ∝ exp⟨𝑣P�,𝑐_⟩
Ø Closelyrelatedto[Mnih-Hinton’07]
𝑐_ 𝑐_pr 𝑐_p� 𝑐_p7
𝑤_ 𝑤_pr 𝑤_p� 𝑤_p7 𝑤_p�
𝑐_p�
Ø Ideally,𝑐_, 𝑣P ∈ ℝ( shouldcontainsemanticinformationinitscoordinates§ E.g.(0.5,-0.3,…)couldmean“0.5gender,-0.3age,..”
Ø But,thewholesystemisrotationalinvariant: 𝑐_, 𝑣P = ⟨𝑅𝑐_,𝑅𝑣P⟩
Ø Thereshouldexistarotationsothatthecoordinatesaremeaningful(backtothislater)
𝑐_ 𝑐_pr 𝑐_p� 𝑐_p7
𝑤_ 𝑤_pr 𝑤_p� 𝑤_p7 𝑤_p�
𝑐_p�
Ø Assumptions:§ {𝑣P}consistsofvectorsdrawnfrom𝑠 ⋅ 𝒩(0, Id);𝑠 isboundedscalarr.v.§ 𝑐_ doesaslowrandomwalk(doesn’tchangemuchinawindowof5)§ log-linearobservationmodel:Pr[𝑤_ ∣ 𝑐_] ∝ exp⟨𝑣P�,𝑐_⟩
Ø MainTheorem:
(1) logPr 𝑤,𝑤′ = 𝑣P + 𝑣P� �/𝑑 − 2 log𝑍 ± 𝜖
(2)logPr 𝑤 = 𝑣P �/𝑑 − log𝑍 ± 𝜖
(3) PMI 𝑤,𝑤� = 𝑣P, 𝑣P¢ /𝑑 ± 𝜖
Ø Normdeterminesfrequency;spatialorientationdetermines“meaning”
𝑐_ 𝑐_pr 𝑐_p� 𝑐_p7
𝑤_ 𝑤_pr 𝑤_p� 𝑤_p7 𝑤_p�
𝑐_p�
Fact:(2)implies thatthewordshavepowerlawdist.
Øword2vec[Mikolov etal’13]:
Pr 𝑤[pq 𝑤[pr,… ,𝑤[pt ∝ exp⟨𝑣Pyz{ ,15 𝑣Pyz~ +⋯+ 𝑣Pyz� ⟩
ØGloVe [Penningtonetal’14]:
log Pr[𝑤,𝑤′] ≈ 𝑣P, 𝑣P¢ + 𝑠P + 𝑠P� + 𝐶
Eq.(1) logPr 𝑤,𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖
Ø [Levy-Goldberg’14]
PMI 𝑤,𝑤� ≈ 𝑣P, 𝑣P¢ + 𝐶
Eq.(3)PMI 𝑤, 𝑤� = 𝑣P, 𝑣P¢ /𝑑 ± 𝜖
Øword2vec[Mikolov etal’13]:
Pr 𝑤[pq 𝑤[pr,… , 𝑤[pt ∝ exp⟨𝑣Pyz{ ,15 𝑣Pyz~ + ⋯+ 𝑣Pyz� ⟩
Ø Underourmodel,
§ Randomwalkisslow:𝑐[pr ≈ 𝑐[p� ≈ ⋯ ≈ 𝑐[pq ≈ 𝑐
§ Bestestimateforcurrentdiscourse𝑐[pq:
argmax],||]||£r
Pr 𝑐 𝑤[pr,… ,𝑤t] = 𝛼 𝑣Pyz~ + ⋯+ 𝑣Pyz�
§ Prob.distributionofnextwordgiventhebestguess𝑐:
Pr[𝑤[pq ∣ 𝑐[pq = 𝛼 𝑣Pyz~ + ⋯+ 𝑣Pyz� ] ∝ exp⟨𝑣Pyz{ ,𝛼 𝑣Pyz~ +⋯+ 𝑣Pyz� ⟩
↑max-likelihoodestimateof𝑐[pq
𝑐[p� 𝑐[pt
𝑤[p� 𝑤[pt 𝑤[pq
𝑐[pq
Pr[𝑤,𝑤�] = ¥ Pr 𝑤 𝑐] Pr 𝑤� 𝑐′] 𝑝 𝑐, 𝑐� 𝑑𝑐𝑑𝑐′
= ¥1
𝑍]𝑍]�⋅ exp 𝑣P, 𝑐 exp⟨𝑣P¢ , 𝑐�⟩ 𝑝 𝑐, 𝑐� 𝑑𝑐𝑑𝑐′
Ø Assume𝑐 = 𝑐′ withprobability1,
= ¥exp⟨𝑣P + 𝑣P¢, 𝑐⟩ 𝑝 𝑐 𝑑𝑐 = exp 𝑣P + 𝑣P¢ � /𝑑
??
Thistalk:windowofsize2
Pr[𝑤 ∣ 𝑐] ∝ exp⟨𝑣P, 𝑐⟩
𝑐 𝑐′
𝑤 𝑤′
Pr[𝑤′ ∣ 𝑐′] ∝ exp⟨𝑣P¢ , 𝑐′⟩Ø Pr[𝑤 ∣ 𝑐] = r§¨⋅ exp⟨𝑣P, 𝑐⟩
Ø 𝑍] = ∑ exp⟨𝑣P, 𝑐⟩P partitionfunction
Eq. (1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖
sphericalGaussianvector𝑐Ø 𝔼 exp 𝑣, 𝑐 = exp 𝑣 �/𝑑
Thistalk:windowofsize2
Pr[𝑤 ∣ 𝑐] ∝ exp⟨𝑣P, 𝑐⟩
𝑐 𝑐′
𝑤 𝑤′
Pr[𝑤′ ∣ 𝑐′] ∝ exp⟨𝑣P¢ , 𝑐′⟩Ø Pr[𝑤 ∣ 𝑐] = r§¨⋅ exp⟨𝑣P, 𝑐⟩
Ø 𝑍] = ∑ exp⟨𝑣P, 𝑐⟩P partitionfunction
Lemma1:foralmostallc,almostall 𝑣P ,𝑍] = 1 + 𝑜 1 𝑍
Ø Proof(sketch):§ formost𝑐,𝑍] concentratesarounditsmean§ meanof𝑍] isdeterminedby||𝑐||,whichinturnconcentrates§ caveat:exp⟨𝑣, 𝑐⟩ for𝑣 ∼𝒩(0, Id) isnotsubgaussian,norsub-exponential.(𝛼-Orlicz normisnotboundedforany𝛼 > 0)
Eq. (1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖
Ø ProofSketch:
Ø Fixing𝑐,toshowhighprobabilityoverchoicesof𝑣P’s
𝑍] =�exp⟨𝑣P, 𝑐⟩P
= 1 + 𝑜 1 𝔼[𝑍]]
Ø 𝑧P = ⟨𝑣P, 𝑐⟩ scalarGaussianrandomvariable
Ø ||𝑐|| governsthemeanandvarianceof𝑧P.Ø ||𝑐|| inturnsisconcentrated
Lemma1:foralmostallc,almostall 𝑣P ,𝑍] = 1 + 𝑜 1 𝑍
Ø Question:𝑧r,… , 𝑧T ∼ 𝒩(0,1)
𝑍 = �exp(𝑧[)T
[£rØ Howis𝑍 concentrated?
Ø 𝔼 𝑍] = Θ(𝑛),and𝕍𝑎𝑟 𝑍] = O 𝑛Ø Thetailof𝑒𝑥𝑝(𝑧[) isbad!
Ø Pr exp𝑧[ > 𝑡 ≈ 𝑡² 2³4 _
Ø Claim:Pr[𝑍 > 𝔼𝑍 + 𝐶 𝑛 ⋅ log 𝑛] ≤ exp(− log� 𝑛)
Ø Trick:truncate𝑧[ atlog𝑛 anddealwiththetailbyunionbound
Ø (sub)-Gaussian tailPr 𝑋 > 𝑡 ≤ exp(−𝑡�/2)
Ø (sub)-exponential tailPr 𝑋 > 𝑡 ≤ exp(−𝑡/2)
Ø ProofSketch:
Ø Fixing𝑐,wehavewithhighprobabilityoverchoicesof𝑣P’s
𝑍] =�exp⟨𝑣P, 𝑐⟩P
= 1 + 𝑜 1 𝔼[𝑍]]
Ø 𝑧P = ⟨𝑣P, 𝑐⟩ scalarGaussianrandomvariable
Ø ||𝑐|| governsthemeanandvarianceof𝑧P.Ø ||𝑐|| inturnsisconcentrated
Lemma1:foralmostallc,almostall 𝑣P ,𝑍] = 1 + 𝑜 1 𝑍
Pr[𝑤,𝑤�] = ¥1
𝑍]𝑍]�⋅ exp 𝑣P + 𝑣P¢, 𝑐 𝑝 𝑐 𝑑𝑐
= 1 ± 𝑜 11𝑍� ¥ exp 𝑣P + 𝑣P¢, 𝑐 𝑝 𝑐 𝑑𝑐
= 1 ± 𝑜 11𝑍� exp(||𝑣P + 𝑣P¢ ||
�/𝑑)
Thistalk:windowofsize2
Pr[𝑤 ∣ 𝑐] ∝ exp⟨𝑣P, 𝑐⟩
𝑐 𝑐′
𝑤 𝑤′
Pr[𝑤′ ∣ 𝑐′] ∝ exp⟨𝑣P¢ , 𝑐′⟩
Eq. (1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖
Ø Pr[𝑤 ∣ 𝑐] = r§¨⋅ exp⟨𝑣P, 𝑐⟩
Ø 𝑍] = ∑ exp⟨𝑣P, 𝑐⟩P partitionfunction
Lemma1:foralmostallc,almostall 𝑣P ,𝑍] = 1 + 𝑜 1 𝑍
Ø Ourtheorypredicts
Eq.(1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖
Ø (Approximate)maximumlikelihoodobjective(SN)
min{·¸},º
� Pr»[𝑤,𝑤�](logPr»[𝑤, 𝑤�] − 𝑣P + 𝑣P¢ �
P,P�
− 𝑌)�
Simplestwordembeddingmethodyet(fewest“knobs”toturn)Comparableperformanceonanalogytest
Ø Ourtheorypredicts
Eq.(2) logPr 𝑤 = 𝑣P �/𝑑 − log𝑍 ± 𝜖
Ø Ourtheorypredicts
𝑍] = 1± 𝑜 1 𝑍
ØUndergenerativemodelRANK-WALK
Formostofthewords𝜒:
Pr[𝜒 ∣ 𝑎]Pr[𝜒 ∣ 𝑏] ≈
Pr[𝜒 ∣ 𝑐]Pr 𝜒 𝑑]
⟺ 𝑣S − 𝑣¿ ≈ 𝑣] − 𝑣(
↑semanticdef.ofanalogy
↑algebraicdef.ofanalogy
Ø Beyondonlysolvinganalogytask?
Ø Extractingmoreinformationfromanalogy/embeddings?
Extractingdifferentmeaningsfromwordembeddings(sameteam:Arora,Li,Liang,M.,Risteski)
Somerecentwork:
Ø “Tie”canmeanarticleofclothing,orphysicalact
ØTie representsunrelatedwordstie1,tie2,etc.
Quickexperiment:Taketworandom/unrelatedwordsw1,w2 wherew1 is~100timesmorefrequentthanw2 .Declarethesetobeasinglewordandcomputeitsembeddinginourmodel.
Result:closetosomethinglike0.8𝑣P~ + 0.2𝑣PÂ
Ø Mathematicalexplanation
Ø Merge𝑤r,𝑤� as𝑤.Let𝑟 =L.[P~]L.[PÂ]
> 1
Ø Then𝑣P ≈ 𝛼𝑣P~ + 𝛽𝑣PÂ,where§ 𝛼 = 1 − 𝑐r log 1 + r
Ä ≈ 1§ 𝛽 = 1 − 𝑐� log 𝑟
Ø 𝛽 > .1 evenif𝑟 = 100 !
Ø Raremeaningisnotswamped,thankstothelog !
whichcorrespondtodifferentrepresentative“discourses”
𝑣_[Y = 0.8𝑎r + 0.2𝑎� +noise↑
discoursefor𝑡𝑖𝑒r
↑discoursefor𝑡𝑖𝑒�
Ø “Tie”canmeanarticleofclothing,orphysicalact
ØTie representsunrelatedwordstie1,tie2,etc.
Ø Sparsecodingforextractingdifferentmeanings:§ Find𝑚 = 2000“discourses”𝑎r,𝑎�,… ∈ ℝ( suchthateachwordvectorexpressedasweightedsumofatmost5ofthem,plus“noisevector.”
𝑣P = 𝑥P,r𝑎r+ 𝑥P,�𝑎� +…+ 𝑛𝑜𝑖𝑠𝑒
𝑥P hasonly5non-zeros
Ø Trainingobjective:
minÆ£[S~,…,SÇ]ÈÉSÄÈY&¸¢ È
� 𝑣P − 𝐴𝑥P �
P
Ø localsearchalgo.[EAB’05],provablealgo.[SWW’12,AGM’14,AGMM’15..]
other. Thus combining the bases while merging duplicates yielded a basis of about the same
size. Some atoms are were found to be semantically meaningless but are easily identified and
filtered out by checking if their closest words tend to have low pairwise inner products amongst
themselves (16).
The significant discourses represented by the basis vectors will henceforth be refered to as
atoms of discourse. The “meaning” of an atom can be discerned by looking at the set of words
whose embeddings are close to it. Table 1 contains some examples of the discourse atoms.
Atoms of discourse may be reminiscent of the results of other automated methods for ob-
taining a thematic understanding of text, such as topic modeling, described in the survey (17).
Indeed the model (1) used to compute the word embeddings is related to a log-linear topic model
from (18). However, the discourses here are computed via sparse coding on word embeddings,
which has no analog in topic modeling. There is also a long tradition of detecting coherent
clusters of word vectors using Brown clustering, or even sparse coding (19). The novelty in
the current paper is a clear interpretation of the basis –in terms of discourses— yielded by the
sparse coding, as well as its use to capture different senses of words.
Atom 1978 825 231 616 1638 149 330drowning instagram stakes membrane slapping orchestra conferencessuicides twitter thoroughbred mitochondria pulling philharmonic meetingsoverdose facebook guineas cytosol plucking philharmonia seminarsmurder tumblr preakness cytoplasm squeezing conductor workshopspoisoning vimeo filly membranes twisting symphony exhibitionscommits linkedin fillies organelles bowing orchestras organizesstabbing reddit epsom endoplasmic slamming toscanini concertsstrangulation myspace racecourse proteins tossing concertgebouw lecturesgunshot tweets sired vesicles grabbing solti presentations
Table 1: Some discourse atoms and their nearest 9 words. By (1) words most likely to appearin a discouse are those nearest to it.
6
Representativesubsetof2000discourses(representedusingtheirnearestwords)
↑closestwordsto𝑎�7r
5atomsthatexpress𝑣_[Y
Ø Atomsofdiscoursefoundarefairlyfine-grained
Ø Maybe𝑎¿[Q]ËYR[È_ÄC = 𝛼 ⋅ 𝑏¿[Q^Q\C + 𝛽 ⋅ 𝑏]ËYR[È_ÄC?
Ø Anotherlayer:
minÌ,ºÈÉSÄÈY
||𝐴 − 𝐵𝑌||�
Ø PartI:newgenerativemodelthatcapturessemantics.
Ø Provableguarantee:§ logofco-occurrencematrixhaslowrankstructure§ semanticanalogy⇔ linearalgebraicstructureforwordvectors
Ø Simplisticassumptions,butgoodfittoreality
Ø PartII:automaticwayofdetectwordmeanings§ Hierarchicalbasisintheembeddingspace
Ø Otherapplicationsofourmodel/method?
Ø Eachordinateof𝑣P meanssomething:
𝑣ÎÏÆ = […… ,0, ……… ,1, ……… ,1, ……… ,0, …… ]
𝑣(Q^^SÄ = […… ,1, ……… ,0, ……… ,1, ……… ,0, …… ]
𝑣ÐË[TS = […… ,0, ……… ,1, ……… ,0, ……… ,1, …… ]
𝑣ÑÒÌ = […… ,1, ……… ,0, ……… ,0, ……… ,1, …… ]
currency↓
country↓
American↓
Chinese↓
𝑣ÎÏÆ − 𝑣(Q^^SÄ = […… ,−1, ……… ,1, ……… ,0 ……… ,0,…… ]
𝑣ÐË[TS − 𝑣ÑÒÌ = […… ,−1,……… , 1,……… , 0,……… , 0,… …]
Ø Onothercoordinates,thevaluesareeitherverysmallorthesupportsarenon-overlapping
Ø Problem:rotationalinvariance– rotationofwordvectorsdoesn’tchangethemodel.
𝑣ÎÏÆ = […… ,0, ……… ,1, ……… ,1, ……… ,0, …… ]
𝑣(Q^^SÄ = […… ,1, ……… ,0, ……… ,1, ……… ,0, …… ]
𝑣ÐË[TS = […… ,0, ……… ,1, ……… ,0, ……… ,1, …… ]
𝑣ÑÒÌ = […… ,1, ……… ,0, ……… ,0, ……… ,1, …… ]
currency↓
country↓
American↓
Chinese↓
⋅ 𝑅
↑sparsecoefficients
↑basisvectors
ØWithsparsity,themodelisidentifiable;allowsovercomplete basis;istractableundermildassumptions.[SWW’12][AGM’13][AAJNT’13][AGMM’14]
minÓ6Ô*.61,Ñ
||𝑉 − 𝑋 ⋅ 𝑅||Ö�
Ø 𝑉 containswordvectorsasrows(obtainedfromanyembeddingmethod)
Ø SparsityofrowsofXischosentobe5
Ø 𝑅contains2000basisvectors(asrows),eachofwhichis300-dim
AssumingM1wasanswered,
PMI 𝑤,𝑤� = 𝑣P, 𝑣P¢ + 𝜉 (*)withlarge𝜉
M2:Whylow-dimvectorssolvesanalogywhen(*)isonlyroughlytrue?
A2:(*)+isotropyofwordvectors⇒ low-dimfittingreducesnoise
(Quiteintuitive, though doesn’t followOccam’sbound forPAC-learning)
Ø Ourtheoryassumesthat𝑐_ doesaslowrandomwalk
Ø reddot:theestimatehiddenvariable𝑐_ attime𝑡
Ø sentenceattop:thewindowofsize10attime𝑡
AssumingM1wasanswered,
PMI 𝑤,𝑤� = 𝑣P, 𝑣P¢ + 𝜉 (*)withlarge𝜉
M2:Whylow-dimvectorssolvesanalogywhen(*)isonlyroughlytrue?
A2:(*)+isotropyofwordvectors⇒ low-dimfittingreducesnoise
(Quiteintuitive, though doesn’t followOccam’sbound forPAC-learning)