+ All Categories
Home > Documents > K S 導入 ALatent!Concept!Topic!Models!for!! SD K S D O SD X...

K S 導入 ALatent!Concept!Topic!Models!for!! SD K S D O SD X...

Date post: 25-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
1
Use Neural word embedding (e.g., word2vec, Glove) to capture conceptual similarity of words. Each cluster corresponds to one latent concept. A Latent Concept Topic Models for Robust Topic Inference Using Word Embeddings MoAvaAon Documentlevel word cooccurrence is scarce when texts are short and vocabulary is diverse (e.g. blog, SNS, newsgroup). ProbabilisAc topic models (e.g., LDA, pLSI) infers topics based on documentlevel word cooccurrence. ConvenFonal topic models are not effecFve. Propose a novel topic model based on cooccurrence staAsAcs of latent concepts to resolve the data sparsity. GloVe vectors (Pennington+ 2014) Proposal Gaussian variance parameter controls the range of the emission. Graphical Models Latent Concept Topic Model (LCTM) Weihua Hu (Utokyo) and Jun’ichi Tsujii (AIST) (a) LDA. (b) LCTM. NotaFons : Dirichlet prior parameters : Gaussian prior parameters : documenttopic distribuFon : topicconcept (word) distribuFon : word type : word vector : latent topic : latent concept : concept vector v d,i z d,i c d,i Draw Add another layer of latent variables (latent concepts) to mediate data sparsity. σ 2 Define topics as distribuAons over latent concepts. Resolve data sparsity in short texts. Model the generaAve process of word embeddings. LCTM can naturally handle Out of Vocabulary (OOV) words. Overview of topic inference Collapsed Gibbs sampler for the approximate inference. Sample latent concepts in addiFon to topics. Prop of topic k in the same doc p(z d,i = k | c d,i = c, z d,i , c d,i , v ) n d,i d,k + α k · n d,i k,c + β c n d,i k,· + c β c p(c d,i = c | z d,i = k, v d,i , z d,i , c d,i , v d,i ) n d,i k,c + β c ·N (v d,i | μ c , σ 2 c I ) (1.4 Sampling of a topic assignment Sampling of a concept assignment Prob of topic k generaFng concept c Prob of topic k generaFng concept c Prob of concept c generaFng word vec v : Gaussian distribuFon corresponding to latent concept c N (·| μ c , σ 2 c I ) Experimental Results 1. Performance on document clustering 2. Performance on handling a high degree of OOV words Dataset: Short posts (less than 50 words) of 20Newsgroup. Gaussian variance with consistently performs well. LCTM outperforms TM w/o word embeddings. LCTM performs comparable to TM w/ word embeddings. σ 2 =0.5 performanc Training Set 400short 800short 1561short OOV prop 0.348 0.253 0.181 Method Classification Accuracy LCTM 0.302 0.367 0.416 LCTM-UNK 0.262 0.340 0.406 LFLDA 0.253 0.333 0.410 nI-cLDA 0.261 0.333 0.412 LDA 0.215 0.293 0.382 GLDA 0.0527 0.0529 0.0529 Chance Rate 0.0539 0.0539 0.0539 ClassificaFon accuracy on heldout documents Clustering performance measured by Adjusted Mutual InformaFon (AMI) LCTMUNK (LCTM that ignores OOV) outperforms other TMs. LCTM further improves performance of LCTMUNK. LCTM effecAvely incorporates OOV words in heldout documents. Infer topics on training dataset Predict topic prop of heldout documents Classify by topic proporAon Experimental se\ng Conclusion Introduced LCTM that infers topics based on document level cooccurrence of latent concepts. Showed that LCTM can effecFvely handle OOV words in heldout documents. The same method can be readily applied to topic models that extend LDA. w d,i . ows.
Transcript
Page 1: K S 導入 ALatent!Concept!Topic!Models!for!! SD K S D O SD X ...web.stanford.edu/~weihuahu/presentation/acl_poster_final.pdf2010) exploits external word features to improve the Dirichlet

・Use  Neural  word  embedding  (e.g.,  word2vec,  Glove)  to  capture  conceptual  similarity  of  words.    à Each  cluster  corresponds  to  one  latent  concept.  

A  Latent  Concept  Topic  Models  for    Robust  Topic  Inference  Using  Word  Embeddings  

MoAvaAon ・Document-­‐level  word  co-­‐occurrence  is  scarce  when  texts  are  short  and  vocabulary  is  diverse  (e.g.  blog,  SNS,  newsgroup).  ・ProbabilisAc  topic  models  (e.g.,  LDA,  pLSI)  infers  topics  based  on  document-­‐level  word  co-­‐occurrence.  à  ConvenFonal  topic  models  are  not  effecFve.  à  Propose  a  novel  topic  model  based  on  co-­‐occurrence  staAsAcs  of  latent  concepts  to  resolve  the  data  sparsity.  

GloVe  vectors  (Pennington+  2014)

Proposal

Gaussian  variance  parameter              controls  the  range  of  the  emission.

Graphical  Models

Latent  Concept  Topic  Model  (LCTM)

Weihua  Hu  (Utokyo)  and  Jun’ichi  Tsujii  (AIST)

Figure 1: Projected latent concepts on the wordembedding space. Concept vectors are annotatedwith their representative concepts in parentheses.

words, we expect topically-related latent conceptsto co-occur many times, even in short texts withdiverse usage of words. This in turn promotestopic inference in LCTM.

LCTM further has the advantage of using con-tinuous word embedding. Traditional LDA as-sumes a fixed vocabulary of word types. Thismodeling assumption prevents LDA from han-dling out of vocabulary (OOV) words in held-outdocuments. On the other hands, since our topicmodel operates on the continuous vector space, itcan naturally handle OOV words once their vectorrepresentation is provided.

The main contributions of our paper are as fol-lows: We propose LCTM that infers topics viadocument-level co-occurrence patterns of latentconcepts, and derive a collapsed Gibbs samplerfor approximate inference. We show that LCTMcan accurately represent short texts by outperform-ing conventional topic models in a clustering task.By means of a classification task, we furthermoredemonstrate that LCTM achieves superior perfor-mance to other state-of-the-art topic models inhandling documents with a high degree of OOVwords.

The remainder of the paper is organized as fol-lows: related work is summarized in Section 2,while LCTM and its inference algorithm are pre-sented in Section 3. Experiments on the 20News-groups are presented in Section 4, and a conclu-sion is presented in Section 5.

2 Related Work

There have been a number of previous studies ontopic models that incorporate word embeddings.The closest model to LCTM is Gaussian LDA

(Das et al., 2015), which models each topic asa Gaussian distribution over the word embeddingspace. However, the assumption that topics areunimodal in the embedding space is not appropri-ate, since topically related words such as ‘neural’and ‘networks’ can occur distantly from each otherin the embedding space. Nguyen et al. (2015) pro-posed topic models that incorporate informationof word vectors in modeling topic-word distribu-tions. Similarly, Petterson et al. (Petterson et al.,2010) exploits external word features to improvethe Dirichlet prior of the topic-word distributions.However, both of the models cannot handle OOVwords, because they assume fixed word types.

Latent concepts in LCTM are closely relatedto ‘constraints’ in interactive topic models (ITM)(Hu et al., 2014). Both latent concepts and con-straints are designed to group conceptually simi-lar words using external knowledge in an attemptto aid topic inference. The difference lies in theirmodeling assumptions: latent concepts in LCTMare modeled as Gaussian distributions over theembedding space, while constraints in ITM aresets of conceptually similar words that are interac-tively identified by humans for each topic. Eachconstraint for each topic is then modeled as amultinomial distribution over the constrained setof words that were identified as mutually relatedby humans. In Section 4, we consider a variant ofITM, whose constraints are instead inferred usingexternal word embeddings.

As regards short texts, a well-known topicmodel is Biterm Topic Model (BTM) (Yan etal., 2013). BTM directly models the genera-tion of biterms (pairs of words) in the whole cor-pus. However, the assumption that pairs of co-occurring words should be assigned to the sametopic might be too strong (Chen et al., 2015).

3 Latent Concept Topic Model

3.1 Generative ModelThe primary difference between LCTM and theconventional topic models is that LCTM describesthe generative process of word vectors in docu-ments, rather than words themselves.

Suppose α and β are parameters for the Dirich-let priors and let vd,i denote the word embeddingfor a word type wd,i. The generative model forLCTM is as follows.

1. For each topic k

(a) Draw a topic concept distribution φk ∼Dirichlet(β).

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

NotaFons  :  Dirichlet  prior  parameters  :  Gaussian  prior  parameters  :  document-­‐topic        distribuFon  :  topic-­‐concept  (word)      distribuFon  :  word  type  :  word  vector  :  latent  topic  :  latent  concept    :  concept  vector  (a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

Add  another  layer  of  latent  variables  (latent  concepts)  to  mediate  data  sparsity.

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

p(cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎝σ2µ+ σ20 ·

(d′,i′)∈A−d,ic

vd′,i′

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

・Define  topics  as  distribuAons  over  latent  concepts.  à Resolve  data  sparsity  in  short  texts.  ・Model  the  generaAve  process  of  word  embeddings.  à  LCTM  can  naturally  handle  Out  of  Vocabulary  (OOV)  words.    

Overview  of  topic  inference ・Collapsed  Gibbs  sampler  for  the  approximate  inference.    ・Sample  latent  concepts  in  addiFon  to  topics.  

Prop  of  topic  k  in  the  same  doc  

Chapter 1

導入

1.1 概要x ∼ p(x|y = 1)

x ∼ p(x) = p(y = 1)p(x|y = 1) + p(y = −1)p(x|y = −1) (1.1)

= πp(x|y = 1) + (1− π)p(x|y = −1) (1.2)

x ∈ Rd,y ∈ {0, 1}m, s ∈ {0, 1}m: X → Rm

Y = {0, 1}準瞬時 FV符号は,復号遅れを定数シンボルだけ許すことにより,特定の情

報源に対してはハフマン符号より良い圧縮性能を示すことが知られている FV符号である.しかし,一般の情報源に対して,準瞬時 FV符号の圧縮限界がハフマン符号よりも良くなるのかどうかについては知られていない.本論文では,情報源の最頻出シンボルの生起確率が与えられたときの,Binary準瞬時 FV符号のtightな圧縮限界を示すとともに,一般の情報源に対する Binary準瞬時 FV符号の tightな圧縮限界を示す.

1.2 符号

p(zd,i = k | cd,i = c, z−d,i, c−d,i,v) ∝!n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

#c′ βc′

, (1.3)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i) ∝

!n−d,ik,c + βc

"· N (vd,i|µc,σ

2cI),

(1.4)

D-ary符号とは,情報源シンボル集合 X から有限長の符号語 D∗ への写像として定義される.符号語シンボル集合Dは,{0, 1, . . . , D− 1}として一般性を失

3

Chapter 1

導入

1.1 概要x ∼ p(x|y = 1)

x ∼ p(x) = p(y = 1)p(x|y = 1) + p(y = −1)p(x|y = −1) (1.1)

= πp(x|y = 1) + (1− π)p(x|y = −1) (1.2)

x ∈ Rd,y ∈ {0, 1}m, s ∈ {0, 1}m: X → Rm

Y = {0, 1}準瞬時 FV符号は,復号遅れを定数シンボルだけ許すことにより,特定の情

報源に対してはハフマン符号より良い圧縮性能を示すことが知られている FV符号である.しかし,一般の情報源に対して,準瞬時 FV符号の圧縮限界がハフマン符号よりも良くなるのかどうかについては知られていない.本論文では,情報源の最頻出シンボルの生起確率が与えられたときの,Binary準瞬時 FV符号のtightな圧縮限界を示すとともに,一般の情報源に対する Binary準瞬時 FV符号の tightな圧縮限界を示す.

1.2 符号

p(zd,i = k | cd,i = c, z−d,i, c−d,i,v) ∝!n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

#c′ βc′

, (1.3)

p(cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i) ∝

!n−d,ik,c + βc

"· N (vd,i|µc,σ

2cI),

(1.4)

D-ary符号とは,情報源シンボル集合 X から有限長の符号語 D∗ への写像として定義される.符号語シンボル集合Dは,{0, 1, . . . , D− 1}として一般性を失

3

・Sampling  of  a  topic  assignment

・Sampling  of  a  concept  assignment

Prob  of  topic  k  generaFng  concept  c

Prob  of  topic  k  generaFng  concept  c

Prob  of  concept  c  generaFng  word  vec  v  

:  Gaussian  distribuFon  corresponding  to  latent  concept  c

Chapter 1

導入

1.1 概要x ∼ p(x|y = 1)

x ∼ p(x) = p(y = 1)p(x|y = 1) + p(y = −1)p(x|y = −1) (1.1)

= πp(x|y = 1) + (1− π)p(x|y = −1) (1.2)

x ∈ Rd,y ∈ {0, 1}m, s ∈ {0, 1}m: X → Rm

Y = {0, 1}準瞬時 FV符号は,復号遅れを定数シンボルだけ許すことにより,特定の情

報源に対してはハフマン符号より良い圧縮性能を示すことが知られている FV符号である.しかし,一般の情報源に対して,準瞬時 FV符号の圧縮限界がハフマン符号よりも良くなるのかどうかについては知られていない.本論文では,情報源の最頻出シンボルの生起確率が与えられたときの,Binary準瞬時 FV符号のtightな圧縮限界を示すとともに,一般の情報源に対する Binary準瞬時 FV符号の tightな圧縮限界を示す.

1.2 符号

p(zd,i = k | cd,i = c, z−d,i, c−d,i,v) ∝!n−d,id,k + αk

n−d,ik,c + βc

n−d,ik,· +

#c′ βc′

,

(1.3)

p(cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i) ∝

!n−d,ik,c + βc

"· N (vd,i|µc,σ

2cI),

(1.4)

N (·|µc,σ2cI) (1.5)

3

Experimental  Results

1.  Performance  on  document  clustering  

2.  Performance  on  handling  a  high  degree  of  OOV  words  

Dataset:  Short  posts  (less  than  50  words)  of  20Newsgroup.  

Figure 3: Relationship between σ2 and AMI.

Figure 4: Comparisons on clustering performanceof the topic models.

better handle OOV words in held-out documentsthan LFLDA and nl-cLDA do.

4.3 Representation of Held-out Documentswith OOV words

To show that our model can better predict topicproportions of documents containing OOV wordsthan other topic models, we conducted an exper-iment on a classification task. In particular, weinfer topics from the training dataset and predictedtopic proportions of held-out documents using col-lapsed Gibbs sampler. With the inferred topicproportions on both training dataset and held-outdocuments, we then trained a multi-class classi-fier (multi-class logistic regression implementedin sklearn2 python module) on the training datasetand predicted newsgroup labels of the held-outdocuments.

We compared classification accuracy usingLFLDA, nI-cLDA, LDA, GLDA, LCTM and avariant of LCTM (LCTM-UNK) that ignores OOVin the held-out documents. A higher classifica-tion accuracy indicates a better representation ofunseen documents. Table 2 shows the propor-tion of OOV words and classification accuracyof the held-out documents. We see that LCTM-UNK outperforms other topic models in almost

2See http://scikit-learn.org/stable/.

Training Set 400short 800short 1561shortOOV prop 0.348 0.253 0.181Method Classification AccuracyLCTM 0.302 0.367 0.416LCTM-UNK 0.262 0.340 0.406LFLDA 0.253 0.333 0.410nI-cLDA 0.261 0.333 0.412LDA 0.215 0.293 0.382GLDA 0.0527 0.0529 0.0529Chance Rate 0.0539 0.0539 0.0539

Table 2: Proportions of OOV words and classifi-cation accuracy in the held-out documents.

every setting, demonstrating the superiority ofour method, even when OOV words are ignored.However, the fact that LCTM outperforms LCTM-UNK in all cases clearly illustrates that LCTM caneffectively make use of information about OOV tofurther improve the representation of unseen docu-ments. The results show that the level of improve-ment of LCTM over LCTM-UNK increases as theproportion of OOV becomes greater.

5 Conclusion

In this paper, we have proposed LCTM that iswell suited for application to short texts with di-verse vocabulary. LCTM infers topics accordingto document-level co-occurrence patterns of la-tent concepts, and thus is robust to diverse vocab-ulary usage and data sparsity in short texts. Weshowed experimentally that LCTM can produce asuperior representation of short documents, com-pared to conventional topic models. We addition-ally demonstrated that LCTM can exploit OOV toimprove the representation of unseen documents.Although our paper has focused on improving per-formance of LDA by introducing the latent con-cept for each word, the same idea can be readilyapplied to other topic models that extend LDA.

Acknowledgments

We thank anonymous reviewers for their construc-tive feedback. We also thank Hideki Mima forhelpful discussions and Paul Thompson for in-sightful reviews on the paper. This paper is basedon results obtained from a project commissionedby the New Energy and Industrial Technology De-velopment Organization (NEDO).

Figure 3: Relationship between σ2 and AMI.

Figure 4: Comparisons on clustering performanceof the topic models.

better handle OOV words in held-out documentsthan LFLDA and nl-cLDA do.

4.3 Representation of Held-out Documentswith OOV words

To show that our model can better predict topicproportions of documents containing OOV wordsthan other topic models, we conducted an exper-iment on a classification task. In particular, weinfer topics from the training dataset and predictedtopic proportions of held-out documents using col-lapsed Gibbs sampler. With the inferred topicproportions on both training dataset and held-outdocuments, we then trained a multi-class classi-fier (multi-class logistic regression implementedin sklearn2 python module) on the training datasetand predicted newsgroup labels of the held-outdocuments.

We compared classification accuracy usingLFLDA, nI-cLDA, LDA, GLDA, LCTM and avariant of LCTM (LCTM-UNK) that ignores OOVin the held-out documents. A higher classifica-tion accuracy indicates a better representation ofunseen documents. Table 2 shows the propor-tion of OOV words and classification accuracyof the held-out documents. We see that LCTM-UNK outperforms other topic models in almost

2See http://scikit-learn.org/stable/.

Training Set 400short 800short 1561shortOOV prop 0.348 0.253 0.181Method Classification AccuracyLCTM 0.302 0.367 0.416LCTM-UNK 0.262 0.340 0.406LFLDA 0.253 0.333 0.410nI-cLDA 0.261 0.333 0.412LDA 0.215 0.293 0.382GLDA 0.0527 0.0529 0.0529Chance Rate 0.0539 0.0539 0.0539

Table 2: Proportions of OOV words and classifi-cation accuracy in the held-out documents.

every setting, demonstrating the superiority ofour method, even when OOV words are ignored.However, the fact that LCTM outperforms LCTM-UNK in all cases clearly illustrates that LCTM caneffectively make use of information about OOV tofurther improve the representation of unseen docu-ments. The results show that the level of improve-ment of LCTM over LCTM-UNK increases as theproportion of OOV becomes greater.

5 Conclusion

In this paper, we have proposed LCTM that iswell suited for application to short texts with di-verse vocabulary. LCTM infers topics accordingto document-level co-occurrence patterns of la-tent concepts, and thus is robust to diverse vocab-ulary usage and data sparsity in short texts. Weshowed experimentally that LCTM can produce asuperior representation of short documents, com-pared to conventional topic models. We addition-ally demonstrated that LCTM can exploit OOV toimprove the representation of unseen documents.Although our paper has focused on improving per-formance of LDA by introducing the latent con-cept for each word, the same idea can be readilyapplied to other topic models that extend LDA.

Acknowledgments

We thank anonymous reviewers for their construc-tive feedback. We also thank Hideki Mima forhelpful discussions and Paul Thompson for in-sightful reviews on the paper. This paper is basedon results obtained from a project commissionedby the New Energy and Industrial Technology De-velopment Organization (NEDO).

・Gaussian  variance  with                    consistently  performs  well.  ・LCTM  outperforms  TM  w/o  word  embeddings.  ・LCTM  performs  comparable  to  TM  w/  word  embeddings.  

topic assignment of a word is determined by itsembedding, instead of its word type. Therefore,LCTM can naturally handle OOV words once theirembeddings are provided.

3.4 Reducing the Computational ComplexityFrom Eqs. (1) and (2), we see that the computa-tional complexity of sampling per word is O(K +SD), where K, S and D are numbers of topics, la-tent concepts and embedding dimensions, respec-tively. Since K ≪ S holds in usual settings, thedominant computation involves the sampling oflatent concept, which costs O(SD) computationper word.

However, since LCTM assumes that Gaussianvariance σ2 is relatively small, the chance of aword being assigned to distant concepts is negli-gible. Thus, we can reasonably assume that eachword is assigned to one of M ≪ S nearest con-cepts. Hence, the computational complexity isreduced to O(MD). Since concept vectors canmove slightly in the embedding space during theinference, we periodically update the nearest con-cepts for each word type.

To further reduce the computational complexity,we can apply dimensional reduction algorithmssuch as PCA and t-SNE (Van der Maaten and Hin-ton, 2008) to word embeddings to make D smaller.We leave this to future work.

4 Experiments

4.1 Datasets and Models DescriptionIn this section, we study the empirical perfor-mance of LCTM on short texts. We used the20Newsgroups corpus, which consists of discus-sion posts about various news subjects authoredby diverse readers. Each document in the corpus istagged with one of twenty newsgroups. Only postswith less than 50 words are extracted for trainingdatasets. For external word embeddings, we used50-dimensional GloVe1 that were pre-trained onWikipedia. The datasets are summarized in Ta-ble 1. See appendix A for the detail of the datasetpreprocessing.

We compare the performance of the LCTM tothe following six baselines:

• LFLDA (Nguyen et al., 2015), an extensionof Latent Dirichlet Allocation that incorpo-rates word embeddings information.

1Downloaded athttp://nlp.stanford.edu/projects/glove/

Dataset Doc size Vocab size Avg len400short 400 4729 31.87800short 800 7329 31.781561short 1561 10644 31.83held-out 7235 37944 140.15

Table 1: Summary of datasets.

• LFDMM (Nguyen et al., 2015), an extensionof Dirichlet Multinomial Mixtures that incor-porates word embeddings information.

• nI-cLDA, non-interactive constrained LatentDirichlet Allocatoin, a variant of ITM (Hu etal., 2014), where constraints are inferred byapplying k-means to external word embed-dings. Each resulting word cluster is then re-garded as a constraint. See appendix B forthe detail of the model.

• GLDA (Das et al., 2015), Gaussian LDA.

• BTM (Yan et al., 2013), Biterm Topic Model.

• LDA (Blei et al., 2003).

In all the models, we set the number of topicsto be 20. For LCTM (resp. nI-ITM), we set thenumber of latent concepts (resp. constraints) tobe 1000. See appendix C for the detail of hyper-parameter settings.

4.2 Document ClusteringTo demonstrate that LCTM results in a superiorrepresentation of short documents compared to thebaselines, we evaluated the performance of eachmodel on a document clustering task. We useda learned topic proportion as a feature for eachdocument and applied k-means to cluster the doc-uments. We then compared the resulting clus-ters to the actual newsgroup labels. Clusteringperformance is measured by Adjusted Mutual In-formation (AMI) (Manning et al., 2008). HigherAMI indicates better clustering performance. Fig-ure 3 illustrates the quality of clustering in termsof Gaussian variance parameter σ2. We see thatsetting σ2 = 0.5 consistently obtains good clus-tering performance for all the datasets with vary-ing sizes. We therefore set σ2 = 0.5 in the laterevaluation. Figure 4 compares AMI on four topicmodels. We see that LCTM outperforms the topicmodels without word embeddings. Also, we seethat LCTM performs comparable to LFLDA andnl-cLDA, both of which incorporate informationof word embeddings to aid topic inference. How-ever, as we will see in the next section, LCTM can

Figure 3: Relationship between σ2 and AMI.

Figure 4: Comparisons on clustering performanceof the topic models.

better handle OOV words in held-out documentsthan LFLDA and nl-cLDA do.

4.3 Representation of Held-out Documentswith OOV words

To show that our model can better predict topicproportions of documents containing OOV wordsthan other topic models, we conducted an exper-iment on a classification task. In particular, weinfer topics from the training dataset and predictedtopic proportions of held-out documents using col-lapsed Gibbs sampler. With the inferred topicproportions on both training dataset and held-outdocuments, we then trained a multi-class classi-fier (multi-class logistic regression implementedin sklearn2 python module) on the training datasetand predicted newsgroup labels of the held-outdocuments.

We compared classification accuracy usingLFLDA, nI-cLDA, LDA, GLDA, LCTM and avariant of LCTM (LCTM-UNK) that ignores OOVin the held-out documents. A higher classifica-tion accuracy indicates a better representation ofunseen documents. Table 2 shows the propor-tion of OOV words and classification accuracyof the held-out documents. We see that LCTM-UNK outperforms other topic models in almost

2See http://scikit-learn.org/stable/.

Training Set 400short 800short 1561shortOOV prop 0.348 0.253 0.181Method Classification AccuracyLCTM 0.302 0.367 0.416LCTM-UNK 0.262 0.340 0.406LFLDA 0.253 0.333 0.410nI-cLDA 0.261 0.333 0.412LDA 0.215 0.293 0.382GLDA 0.0527 0.0529 0.0529Chance Rate 0.0539 0.0539 0.0539

Table 2: Proportions of OOV words and classifi-cation accuracy in the held-out documents.

every setting, demonstrating the superiority ofour method, even when OOV words are ignored.However, the fact that LCTM outperforms LCTM-UNK in all cases clearly illustrates that LCTM caneffectively make use of information about OOV tofurther improve the representation of unseen docu-ments. The results show that the level of improve-ment of LCTM over LCTM-UNK increases as theproportion of OOV becomes greater.

5 Conclusion

In this paper, we have proposed LCTM that iswell suited for application to short texts with di-verse vocabulary. LCTM infers topics accordingto document-level co-occurrence patterns of la-tent concepts, and thus is robust to diverse vocab-ulary usage and data sparsity in short texts. Weshowed experimentally that LCTM can produce asuperior representation of short documents, com-pared to conventional topic models. We addition-ally demonstrated that LCTM can exploit OOV toimprove the representation of unseen documents.Although our paper has focused on improving per-formance of LDA by introducing the latent con-cept for each word, the same idea can be readilyapplied to other topic models that extend LDA.

Acknowledgments

We thank anonymous reviewers for their construc-tive feedback. We also thank Hideki Mima forhelpful discussions and Paul Thompson for in-sightful reviews on the paper. This paper is basedon results obtained from a project commissionedby the New Energy and Industrial Technology De-velopment Organization (NEDO).

ClassificaFon  accuracy  on  held-­‐out  documents

Clustering  performance  measured  by  Adjusted  Mutual  InformaFon  (AMI)

・LCTM-­‐UNK  (LCTM  that  ignores  OOV)  outperforms  other  TMs.  ・LCTM  further  improves  performance  of  LCTM-­‐UNK.  à  LCTM  effecAvely  incorporates  OOV  words  in  held-­‐out  documents.    

Infer  topics  on  training  dataset

Predict  topic-­‐prop  of  held-­‐out  documents

Classify  by  topic  proporAon

Experimental  se\ng

Conclusion Introduced  LCTM  that  infers  topics  based  on  document-­‐level  co-­‐occurrence  of  latent  concepts.    Showed  that  LCTM  can  effecFvely  handle  OOV  words  in  held-­‐out  documents.    The  same  method  can  be  readily  applied  to  topic  models  that  extend  LDA.  

Figure 1: Projected latent concepts on the wordembedding space. Concept vectors are annotatedwith their representative concepts in parentheses.

words, we expect topically-related latent conceptsto co-occur many times, even in short texts withdiverse usage of words. This in turn promotestopic inference in LCTM.

LCTM further has the advantage of using con-tinuous word embedding. Traditional LDA as-sumes a fixed vocabulary of word types. Thismodeling assumption prevents LDA from han-dling out of vocabulary (OOV) words in held-outdocuments. On the other hands, since our topicmodel operates on the continuous vector space, itcan naturally handle OOV words once their vectorrepresentation is provided.

The main contributions of our paper are as fol-lows: We propose LCTM that infers topics viadocument-level co-occurrence patterns of latentconcepts, and derive a collapsed Gibbs samplerfor approximate inference. We show that LCTMcan accurately represent short texts by outperform-ing conventional topic models in a clustering task.By means of a classification task, we furthermoredemonstrate that LCTM achieves superior perfor-mance to other state-of-the-art topic models inhandling documents with a high degree of OOVwords.

The remainder of the paper is organized as fol-lows: related work is summarized in Section 2,while LCTM and its inference algorithm are pre-sented in Section 3. Experiments on the 20News-groups are presented in Section 4, and a conclu-sion is presented in Section 5.

2 Related Work

There have been a number of previous studies ontopic models that incorporate word embeddings.The closest model to LCTM is Gaussian LDA

(Das et al., 2015), which models each topic asa Gaussian distribution over the word embeddingspace. However, the assumption that topics areunimodal in the embedding space is not appropri-ate, since topically related words such as ‘neural’and ‘networks’ can occur distantly from each otherin the embedding space. Nguyen et al. (2015) pro-posed topic models that incorporate informationof word vectors in modeling topic-word distribu-tions. Similarly, Petterson et al. (Petterson et al.,2010) exploits external word features to improvethe Dirichlet prior of the topic-word distributions.However, both of the models cannot handle OOVwords, because they assume fixed word types.

Latent concepts in LCTM are closely relatedto ‘constraints’ in interactive topic models (ITM)(Hu et al., 2014). Both latent concepts and con-straints are designed to group conceptually simi-lar words using external knowledge in an attemptto aid topic inference. The difference lies in theirmodeling assumptions: latent concepts in LCTMare modeled as Gaussian distributions over theembedding space, while constraints in ITM aresets of conceptually similar words that are interac-tively identified by humans for each topic. Eachconstraint for each topic is then modeled as amultinomial distribution over the constrained setof words that were identified as mutually relatedby humans. In Section 4, we consider a variant ofITM, whose constraints are instead inferred usingexternal word embeddings.

As regards short texts, a well-known topicmodel is Biterm Topic Model (BTM) (Yan etal., 2013). BTM directly models the genera-tion of biterms (pairs of words) in the whole cor-pus. However, the assumption that pairs of co-occurring words should be assigned to the sametopic might be too strong (Chen et al., 2015).

3 Latent Concept Topic Model

3.1 Generative ModelThe primary difference between LCTM and theconventional topic models is that LCTM describesthe generative process of word vectors in docu-ments, rather than words themselves.

Suppose α and β are parameters for the Dirich-let priors and let vd,i denote the word embeddingfor a word type wd,i. The generative model forLCTM is as follows.

1. For each topic k

(a) Draw a topic concept distribution φk ∼Dirichlet(β).

Recommended