Discovering Topics from UnstructuredText
Deep Tech Summit, NPCC. Bhattacharyya
Machine Learning LabDepartment of CSA, IISc
26th Oct, 2016
Information retrieval from Unstructured text
What is IR?Manning, Raghavan, Stutze08Finding material (usually documents) of an unstructured nature(usually text) that satisfies an information need from withinlarge collections
(usually stored on computers).
1 Image from Internet
Information retrieval from Unstructured text
What is IR?Manning, Raghavan, Stutze08Finding material (usually documents) of an unstructured nature(usually text) that satisfies an information need from withinlarge collections (usually stored on computers).
1 Image from Internet
Challenges in handling Unstructured TextCorpora
How do we build automatic indexing Large Corpora NLP based methodologies will not scale
What are Topics
runinninghitseasongame
What are Topics
runinninghitseasongame
What are Topics
run cup patient computerinning minutes drug softwarehit add doctor system
season tablespoon cancer microsoftgame oil medical company
What are Topics
run cup patient computerinning minutes drug softwarehit add doctor system
season tablespoon cancer microsoftgame oil medical companySport Cooking Healthcare Computers
Models for discovering themes
Topic models attempt to discover themes indocument collections
Themes can be used for anotating documents. Can be useful for organizing, and searching large
document corpora Do not require supervision
Visualizing Topics: Browsing Wikipedia
Wikipedia topicsAllison Chaney. TMVE
https://github.com/ajbc/tmve-original
https://github.com/ajbc/tmve-original
Outline
What are topics
Latent Semnatic Indexing
Probabilistic Topic ModelsLDA
Learning Topics from finite number of Samples
Outline
What are topics
Latent Semnatic Indexing
Probabilistic Topic ModelsLDA
Learning Topics from finite number of Samples
Information Retrieval
Corpus- Collection of Documents Document Collection of Words
IR revisitedGiven a document find similar documents in a corpus
Corpus is a matrix
Corpus is a matrix
Corpus is a matrix
Corpus is a matrix
SMART Information retrieval system
Pioneered by G. Salton1 in 1975 Given a query q find closest documents
score(q, d) =Aq Ad
AqAd
Representations and scoring system were developed1G. Salton, A. Wong, and C. S. Yang (1975), A Vector Space Model for Automatic Indexing, Communications of theACM
Keyword search
QueryWho won the Turing award in 2015
Retrieval given query q
Aij ={
1 i th word is present in j th document0 otherwise
return i if Aj q is high
Polysemy, Synonmy, Term Dependence
Polysemy Words which have more than one meaningCricket
Polysemous words in queries can reduce precision
Synonmy: Different words have the same meaning.Automobile and Carqueries with synonmyous words can be a problem
Term Dependence: Terms are not orthogonalMisses Themesoften certain group of words occur together
Polysemy, Synonmy, Term Dependence
Polysemy Words which have more than one meaningCricketPolysemous words in queries can reduce precision
Synonmy: Different words have the same meaning.Automobile and Car
queries with synonmyous words can be a problem
Term Dependence: Terms are not orthogonalMisses Themesoften certain group of words occur together
Polysemy, Synonmy, Term Dependence
Polysemy Words which have more than one meaningCricketPolysemous words in queries can reduce precision
Synonmy: Different words have the same meaning.Automobile and Carqueries with synonmyous words can be a problem
Term Dependence: Terms are not orthogonalMisses Themesoften certain group of words occur together
Latent Semantic Indexing1
d : number of words n: number of documents
SVD-Singular Value Decompostion
A = [A1, . . . ,An]
Adn
= Mdr
Drr
Srn
1Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51stAnnual Meeting of the American Society for Information Science 25, 1988, pp. 3640.
Retrieving documents with LSI
Ai = MDSi
D1Mq = q
simLSI(q,Ai) = qSi
LSI outperformed keyword searchsimLSI(q,Ai) outperformed qAi
1Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51stAnnual Meeting of the American Society for Information Science 25, 1988, pp. 3640.
Retrieving documents
Project query on columns of M and finddocuments closest to the projected query.
Works well but why? Maybe M encodes semantics
LSI: A probabilistic analysisPapadimitrou et al. 2000
Each document has only one Topic Each topic, column of M, has some primary words Probability mass on the primary words are very high. Could mathematically explain superior performance
of LSI
What are Topics
Topic is a probability distribution over words
Each document mi.i.d draws fromTopic
LSI and Information retrieval
When does LSI workIf corpus is pure and each topic has some primary words then
Si Sj c whenever Si and Sj share the same topic
Primary words of a topic Group of words with significant fraction of probability
mass within a topic Should be disjoint run,inning,hit, season, game
LSI is not the answer
Topics: Computer Science, Arts
Outline
What are topics
Latent Semnatic Indexing
Probabilistic Topic ModelsLDA
Learning Topics from finite number of Samples
Probabilistic Topic Models: Latent DirichletAllocation(LDA)
Unsupervised corpus analysis Generative model for
documents Topic defined by a pmf over
words Learn topics inherent in
corpus
LDA: Generative model
Document: p.m.f over topics,
Topic: p.m.f. over words, z
Process: Pick , z and thenchoose word from z
Generative Model: sports, cooking, moviesz: sportw : runs
Example: Dynamic Topic Model of Science
75-topic dynamic topic model of the Journal Science(1880-2002)
Words in topics evolve over time Source:http://topics.cs.princeton.edu/Science/
http://topics.cs.princeton.edu/Science/
Topic Model of Science: Example Topics
Dynamic Topic Model of Science: Example I
Dynamic Topic Model of Science: Example II
Resource scarce languages: Multilingualtopics
Training: English-Hindi-Bengali Wikipedia 3.3K doc triplets Test: EN-HI-BN news from FIRE EN 14K, HI 15K, BN 12K articles
English Hindi Bengalifilm films awarddisney awardshitchcock simp-sons chaplinmovie academy
(chaplin film the jerrytom film pitt best actorand)
(film prize do film onethe r him cyrus film)
Insights into LDA
LDA works well for large documents Big Corpus Applies to Dyadic data
Videos, Software Codes, Cross-Lingual retrieval,
When does LDA work?
Theorem((Teng et al. 2014))W.h.p if logn m then d(G,G) C
=
lognn
+logmm
+logmn
n = number of documents m = number of words in a document
n is very large = logmmNot good for short messages
m is very large = lognnNot good for small corpus
1Jian Tang et al. (2014) Understanding the Limiting Factors of Topic Modeling via Posterior ContractionAnalysis ICML 2014
LDA: Observations
Inference is NP hard Learning parameters from a corpus is also NP hard Requires MCMC techniques or variational techniques
Topic SimplexThree Topics: Can be viewed as a triangle in 2-D
Topic SimplexDocuments put weights on the vertices of the triangle
Topic SimplexDocuments are points inside the triangle !
Outline
What are topics
Latent Semnatic Indexing
Probabilistic Topic ModelsLDA
Learning Topics from finite number of Samples
General model for probabilistic topic models
Each column of M be a topicProbability distribution over words
Randomly choose l , weight over topic l ,they should sum to 1
Sample m words fromk
i=1Mll to create adocument
Fit multiple topics to a single documentprovably
QuestionHow many documents do i need to recover M from A
Recent breakthrough (Arora et al. 2012)2gave guarantees Polynomial time algorithms
Separability
Mwt = p0, Mwt = 0
1Learning Topic ModelsGoing beyond SVD, STOC 2012
Fitting topics using separability
TheoremIf all topics have anchor words, there is a polynomial timealgorithm that returns an M such that with high probability,
kl=1
di=1
|Mil Mil | provided
s Max{O(d2k6 logda42p602m
),O
(k4
2a2
)},
where, is the condition number of E(WW T ), a is theminimum expected weight of a topic and m is the numberof words in each document.
Fitting topics using separability
s Max{O(d2k6 logda42p602m
),O
(k4
2a2
)} The dependence of s on parameters p0 is 1/p60 For the topic baseball the word run maybe an
anchor word with p0 = 0.1 Then the requirement is that every 10-th word in a
document on this topic is run (too strong) More realistic to ask that a set of words like - run, hit,
score, together have frequency 0.1
Our Assumptions: Dominant Topics
Dominant Admixture assumption every document has a dominant topic: one topic has
weight significantly higher than others for every topic, there is a small fraction of documents
which are nearly purely on that topic
Formally, let , , , , 0 be non-negative reals satisfying:
+ (1 ), + 2 0.5, 0.08
For j [s], document j has a dominant topic l(j), s.t.Wl(j),j and Wl j , l = l(j)
For each topic l , there are at least 0w0s documentsfor which topic l has weight at least 1 .
Our Assumptions: Catchwords
CatchwordsCatchwords of a topic: a group of words
each word occurs strictly more frequently in the topicthan other topics
together they have high frequency
Formally: Sl
Sl l , l {1, . . . , k} such thati Sl , l = l ,
Mil Mil ,iSl
Mil p0, and m2Mil 8 ln(
20w0
)
Our Results (Bansal et al., NIPS 2014)
Under the assumptions, the TSVD algorithm succeedswith high probability in finding an M so that
i,l
|Mil Mil | O(k), provided
s (
1w0
(k6m2
2p20+
m2k20
2p0+
d02
)).
Dependence of s on w0, that is (1/w0), is optimal Dependence of s on d , (d/0w02), is optimal For Arora, to get comparable error we need a
quadratic dependence on d
Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)
Thresholding
Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >
m}|
w02 s; |{j : A
(1)ij =
m}| 3w0s.
Do the thresholding on A(2):
Bij =
{i if A
(2)ij > i/m and i 8 ln(20/w0)
0 otherwise.
SVD: Find the best rank k approximation B(k) to B.
Identify Dominant Topics
Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).
Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).
Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]
Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)
Thresholding
Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >
m}|
w02 s; |{j : A
(1)ij =
m}| 3w0s.
Do the thresholding on A(2):
Bij =
{i if A
(2)ij > i/m and i 8 ln(20/w0)
0 otherwise.
SVD: Find the best rank k approximation B(k) to B.
Identify Dominant Topics
Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).
Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).
Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]
Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)
Thresholding
Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >
m}|
w02 s; |{j : A
(1)ij =
m}| 3w0s.
Do the thresholding on A(2):
Bij =
{i if A
(2)ij > i/m and i 8 ln(20/w0)
0 otherwise.
SVD: Find the best rank k approximation B(k) to B.
Identify Dominant Topics
Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).
Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).
Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]
Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)
Thresholding
Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >
m}|
w02 s; |{j : A
(1)ij =
m}| 3w0s.
Do the thresholding on A(2):
Bij =
{i if A
(2)ij > i/m and i 8 ln(20/w0)
0 otherwise.
SVD: Find the best rank k approximation B(k) to B.
Identify Dominant Topics
Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).
Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).
Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]
Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)
Thresholding
Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >
m}|
w02 s; |{j : A
(1)ij =
m}| 3w0s.
Do the thresholding on A(2):
Bij =
{i if A
(2)ij > i/m and i 8 ln(20/w0)
0 otherwise.
SVD: Find the best rank k approximation B(k) to B.
Identify Dominant Topics
Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).
Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).
Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]
Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)
Thresholding
Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >
m}|
w02 s; |{j : A
(1)ij =
m}| 3w0s.
Do the thresholding on A(2):
Bij =
{i if A
(2)ij > i/m and i 8 ln(20/w0)
0 otherwise.
SVD: Find the best rank k approximation B(k) to B.
Identify Dominant Topics
Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).
Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).
Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]
Thresholded SVD-based k -means (TSVD)
Identify Catchwords
For each i , l , compute g(i , l) = the (0w0s/2)th highestelement of {A(2)ij : j Rl}.
Let Jl ={i : g(i , l) > max
( 4m2 ln(20/w0),maxl =l g(i , l
))}
,
where, = 12(1+)(+) .
Find Topic Vectors
Find the 0w0s/2 highest
iJl A(2)ij among all j [s]
Return the average of these A,j as our approximation M,l to M,l
Thresholded SVD-based k -means (TSVD)
Identify Catchwords
For each i , l , compute g(i , l) = the (0w0s/2)th highestelement of {A(2)ij : j Rl}.
Let Jl ={i : g(i , l) > max
( 4m2 ln(20/w0),maxl =l g(i , l
))}
,
where, = 12(1+)(+) .
Find Topic Vectors
Find the 0w0s/2 highest
iJl A(2)ij among all j [s]
Return the average of these A,j as our approximation M,l to M,l
Why TSVD Works
Data matrix A (left) and Thresholded matrix B (right)Black: non-catchwords, Blue: catchwords
Empirical ResultsDatasets
NIPS: 1,500 NIPS full papers NYT: Random subset of 30,000 documents from the
New York Times dataset Pubmed: Random subset of 30,000 documents from
the Pubmed abstracts dataset 20NG: 13,389 documents from 20NewsGroup
Baselines Recover (Arora et. al., 2013): state-of-art provable
algorithm based on separability assumption Tensor (Anandkumar et. al., 2012): state-of-art
provable algorithm using tensor decomposition
Empirical Results: AssumptionsCorpus Documents K Fraction of Documents
= 0.4 = 0.8 = 0.9NIPS 1,500 50 56.6% 10.7% 4.8%NYT 30,000 50 63.7% 20.9% 12.7%
Pubmed 30,000 50 62.2% 20.3% 10.7%20NG 13,389 20 74.1% 54.4% 44.3%
Table: Fraction of documents satisfying dominant topic assumption.
Corpus K Mean per topic % Topicsfrequency of CW with CWNIPS 50 0.05 95%NYT 50 0.11 100%
Pubmed 50 0.05 90%20NG 20 0.06 100%
Table: CatchWords (CW) assumption with = 1.1, = 0.25
Empirical Results: L1 Reconstruction ErrorAverage improvement over best of R-KL & Tensor: 30.7%
Corpus Documents Tensor R-L2 R-KL TSVD % Improvement
NIPS
40,000 0.298 0.342 0.308 0.094 68.5%60,000 0.296 0.346 0.311 0.089 69.9%80,000 0.285 0.335 0.303 0.087 69.4%100,000 0.280 0.344 0.306 0.086 69.3%150,000 0.320 0.336 0.302 0.084 72.2%200,000 0.322 0.335 0.301 0.113 62.5%
Pubmed
40,000 0.379 0.388 0.332 0.326 1.8%60,000 0.317 0.372 0.328 0.287 9.5%80,000 0.321 0.358 0.320 0.276 13.8%100,000 0.304 0.350 0.315 0.276 9.2%150,000 0.355 0.344 0.313 0.239 23.6%200,000 0.322 0.334 0.309 0.225 27.3%
20NG
40,000 0.174 0.126 0.120 0.124 -3.3%60,000 0.207 0.114 0.110 0.106 3.6%80,000 0.203 0.110 0.108 0.095 12.0%100,000 0.151 0.103 0.102 0.087 14.7%200,000 0.162 0.096 0.097 0.072 25.8%
NYT
40,000 0.316 0.214 0.208 0.174 16.3%60,000 0.330 0.205 0.200 0.156 22.0%80,000 0.330 0.198 0.196 0.168 14.3%100,000 0.353 0.198 0.196 0.163 16.8%150,000 0.310 0.192 0.192 0.156 18.8%200,000 0.292 0.189 0.189 0.173 8.5%
Empirical Results: L1 Reconstruction Error
Histogram of L1 error across topics for 40k syntheticdocuments. On majority of the topic (> 90%) the recoveryerror for TSVD is significantly smaller.
0
10
20
30
40
50
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
NIPS
0
10
20
30
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3
NYT
0
10
20
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6
Pubmed
0
5
10
15
20
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
20NG
L1 Reconstruction Error
AlgorithmRKLTensorTSVD
Num
ber
of T
opic
s
Empirical Results on Real Data:Perplexity & Topic Coherence
0
1000
2000
20NG NIPS NYT Pubmed
Perplexity
100
50
0
20NG NIPS NYT Pubmed
Topic Coherence
AlgorithmTSVDTensorRL2RKL
Check out!
PaperA provable SVD-based algorithm for learning topics indominant admixture corpus (NIPS 2014)
Codehttp://mllab.csa.iisc.ernet.in/tsvd/
http://mllab.csa.iisc.ernet.in/tsvd/
Thank you
What are topicsLatent Semnatic IndexingProbabilistic Topic ModelsLDALearning Topics from finite number of Samples