Download - K S 導入 ALatent!Concept!Topic!Models!for!! SD K S D O SD X ...web.stanford.edu/~weihuahu/presentation/acl_poster_final.pdf2010) exploits external word features to improve the Dirichlet

・Use Neural word embedding (e.g., word2vec, Glove) to capture conceptual similarity of words. à Each cluster corresponds to one latent concept.

A Latent Concept Topic Models for Robust Topic Inference Using Word Embeddings

MoAvaAon ・Document-‐level word co-‐occurrence is scarce when texts are short and vocabulary is diverse (e.g. blog, SNS, newsgroup). ・ProbabilisAc topic models (e.g., LDA, pLSI) infers topics based on document-‐level word co-‐occurrence. à ConvenFonal topic models are not effecFve. à Propose a novel topic model based on co-‐occurrence staAsAcs of latent concepts to resolve the data sparsity.

GloVe vectors (Pennington+ 2014)

Proposal

Gaussian variance parameter controls the range of the emission.

Graphical Models

Latent Concept Topic Model (LCTM)

Weihua Hu (Utokyo) and Jun’ichi Tsujii (AIST)

Figure 1: Projected latent concepts on the wordembedding space. Concept vectors are annotatedwith their representative concepts in parentheses.

words, we expect topically-related latent conceptsto co-occur many times, even in short texts withdiverse usage of words. This in turn promotestopic inference in LCTM.

LCTM further has the advantage of using con-tinuous word embedding. Traditional LDA as-sumes a fixed vocabulary of word types. Thismodeling assumption prevents LDA from han-dling out of vocabulary (OOV) words in held-outdocuments. On the other hands, since our topicmodel operates on the continuous vector space, itcan naturally handle OOV words once their vectorrepresentation is provided.

The main contributions of our paper are as fol-lows: We propose LCTM that infers topics viadocument-level co-occurrence patterns of latentconcepts, and derive a collapsed Gibbs samplerfor approximate inference. We show that LCTMcan accurately represent short texts by outperform-ing conventional topic models in a clustering task.By means of a classification task, we furthermoredemonstrate that LCTM achieves superior perfor-mance to other state-of-the-art topic models inhandling documents with a high degree of OOVwords.

The remainder of the paper is organized as fol-lows: related work is summarized in Section 2,while LCTM and its inference algorithm are pre-sented in Section 3. Experiments on the 20News-groups are presented in Section 4, and a conclu-sion is presented in Section 5.

2 Related Work

There have been a number of previous studies ontopic models that incorporate word embeddings.The closest model to LCTM is Gaussian LDA

(Das et al., 2015), which models each topic asa Gaussian distribution over the word embeddingspace. However, the assumption that topics areunimodal in the embedding space is not appropri-ate, since topically related words such as ‘neural’and ‘networks’ can occur distantly from each otherin the embedding space. Nguyen et al. (2015) pro-posed topic models that incorporate informationof word vectors in modeling topic-word distribu-tions. Similarly, Petterson et al. (Petterson et al.,2010) exploits external word features to improvethe Dirichlet prior of the topic-word distributions.However, both of the models cannot handle OOVwords, because they assume fixed word types.

Latent concepts in LCTM are closely relatedto ‘constraints’ in interactive topic models (ITM)(Hu et al., 2014). Both latent concepts and con-straints are designed to group conceptually simi-lar words using external knowledge in an attemptto aid topic inference. The difference lies in theirmodeling assumptions: latent concepts in LCTMare modeled as Gaussian distributions over theembedding space, while constraints in ITM aresets of conceptually similar words that are interac-tively identified by humans for each topic. Eachconstraint for each topic is then modeled as amultinomial distribution over the constrained setof words that were identified as mutually relatedby humans. In Section 4, we consider a variant ofITM, whose constraints are instead inferred usingexternal word embeddings.

As regards short texts, a well-known topicmodel is Biterm Topic Model (BTM) (Yan etal., 2013). BTM directly models the genera-tion of biterms (pairs of words) in the whole cor-pus. However, the assumption that pairs of co-occurring words should be assigned to the sametopic might be too strong (Chen et al., 2015).

3 Latent Concept Topic Model

3.1 Generative ModelThe primary difference between LCTM and theconventional topic models is that LCTM describesthe generative process of word vectors in docu-ments, rather than words themselves.

Suppose α and β are parameters for the Dirich-let priors and let vd,i denote the word embeddingfor a word type wd,i. The generative model forLCTM is as follows.

1. For each topic k

(a) Draw a topic concept distribution φk ∼Dirichlet(β).

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ,σ2

0I).3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document di. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(φzd,i).

iii. Draw a word vector vd,i ∼N (µcd,i ,σ

2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,id,k + αk

)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,ik,c + βc

)· N (vd,i|µc,σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µand covariance matrix Σ. µc and σ2

c in Eq. (2)are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑

(d′,i′)∈A−d,ic

vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) ̸=

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {φk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, φk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,φ,µ)

∝(n−dnew,idnew,k + αk

)·∑

c

φk,c

N (vdnew,i|µc,σ2c )∑

c′ N (vdnew,i|µc′ ,σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof φk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

NotaFons : Dirichlet prior parameters : Gaussian prior parameters : document-‐topic distribuFon : topic-‐concept (word) distribuFon : word type : word vector : latent topic : latent concept : concept vector (a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)



)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


Add another layer of latent variables (latent concepts) to mediate data sparsity.

(a) LDA. (b) LCTM.









2I).





)·

n−d,ik,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

p(cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)


)· N (vd,i|µc,σ

2cI), (2)




µc =1

σ2 + n−d,i·,c σ2

0

⎛

⎝σ2µ+ σ20 ·

∑


vd′,i′

⎞

⎠ ,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)






θd,k =nd,k + αk

nd,· +∑


nk,c + βc

nk,· +∑

c′ βc′. (5)




)·∑

c

φk,c



.

(6)


・Define topics as distribuAons over latent concepts. à Resolve data sparsity in short texts. ・Model the generaAve process of word embeddings. à LCTM can naturally handle Out of Vocabulary (OOV) words.

Overview of topic inference ・Collapsed Gibbs sampler for the approximate inference. ・Sample latent concepts in addiFon to topics.

Prop of topic k in the same doc

Chapter 1

導入

1.1 概要x ∼ p(x|y = 1)

x ∼ p(x) = p(y = 1)p(x|y = 1) + p(y = −1)p(x|y = −1) (1.1)

= πp(x|y = 1) + (1− π)p(x|y = −1) (1.2)

x ∈ Rd,y ∈ {0, 1}m, s ∈ {0, 1}m: X → Rm

Y = {0, 1}準瞬時 FV符号は，復号遅れを定数シンボルだけ許すことにより，特定の情

報源に対してはハフマン符号より良い圧縮性能を示すことが知られている FV符号である．しかし，一般の情報源に対して，準瞬時 FV符号の圧縮限界がハフマン符号よりも良くなるのかどうかについては知られていない．本論文では，情報源の最頻出シンボルの生起確率が与えられたときの，Binary準瞬時 FV符号のtightな圧縮限界を示すとともに，一般の情報源に対する Binary準瞬時 FV符号の tightな圧縮限界を示す．

1.2 符号

p(zd,i = k | cd,i = c, z−d,i, c−d,i,v) ∝!n−d,id,k + αk

"·

n−d,ik,c + βc

n−d,ik,· +

#c′ βc′

, (1.3)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i) ∝

!n−d,ik,c + βc

"· N (vd,i|µc,σ

2cI),

(1.4)

D-ary符号とは，情報源シンボル集合 X から有限長の符号語 D∗ への写像として定義される．符号語シンボル集合Dは，{0, 1, . . . , D− 1}として一般性を失

3

Chapter 1

導入

1.1 概要x ∼ p(x|y = 1)

x ∼ p(x) = p(y = 1)p(x|y = 1) + p(y = −1)p(x|y = −1) (1.1)

= πp(x|y = 1) + (1− π)p(x|y = −1) (1.2)

x ∈ Rd,y ∈ {0, 1}m, s ∈ {0, 1}m: X → Rm



1.2 符号


"·

n−d,ik,c + βc

n−d,ik,· +

#c′ βc′

, (1.3)

p(cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i) ∝

!n−d,ik,c + βc

"· N (vd,i|µc,σ

2cI),

(1.4)

D-ary符号とは，情報源シンボル集合 X から有限長の符号語 D∗ への写像として定義される．符号語シンボル集合Dは，{0, 1, . . . , D− 1}として一般性を失

3

・Sampling of a topic assignment

・Sampling of a concept assignment

Prob of topic k generaFng concept c

Prob of topic k generaFng concept c

Prob of concept c generaFng word vec v

: Gaussian distribuFon corresponding to latent concept c

Chapter 1

導入

1.1 概要x ∼ p(x|y = 1)

x ∼ p(x) = p(y = 1)p(x|y = 1) + p(y = −1)p(x|y = −1) (1.1)

= πp(x|y = 1) + (1− π)p(x|y = −1) (1.2)

x ∈ Rd,y ∈ {0, 1}m, s ∈ {0, 1}m: X → Rm



1.2 符号


"·

n−d,ik,c + βc

n−d,ik,· +

#c′ βc′

,

(1.3)

p(cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i) ∝

!n−d,ik,c + βc

"· N (vd,i|µc,σ

2cI),

(1.4)

N (·|µc,σ2cI) (1.5)

3

Experimental Results

1. Performance on document clustering

2. Performance on handling a high degree of OOV words

Dataset: Short posts (less than 50 words) of 20Newsgroup.

Figure 3: Relationship between σ2 and AMI.

Figure 4: Comparisons on clustering performanceof the topic models.

better handle OOV words in held-out documentsthan LFLDA and nl-cLDA do.

4.3 Representation of Held-out Documentswith OOV words

To show that our model can better predict topicproportions of documents containing OOV wordsthan other topic models, we conducted an exper-iment on a classification task. In particular, weinfer topics from the training dataset and predictedtopic proportions of held-out documents using col-lapsed Gibbs sampler. With the inferred topicproportions on both training dataset and held-outdocuments, we then trained a multi-class classi-fier (multi-class logistic regression implementedin sklearn2 python module) on the training datasetand predicted newsgroup labels of the held-outdocuments.

We compared classification accuracy usingLFLDA, nI-cLDA, LDA, GLDA, LCTM and avariant of LCTM (LCTM-UNK) that ignores OOVin the held-out documents. A higher classifica-tion accuracy indicates a better representation ofunseen documents. Table 2 shows the propor-tion of OOV words and classification accuracyof the held-out documents. We see that LCTM-UNK outperforms other topic models in almost

2See http://scikit-learn.org/stable/.

Training Set 400short 800short 1561shortOOV prop 0.348 0.253 0.181Method Classification AccuracyLCTM 0.302 0.367 0.416LCTM-UNK 0.262 0.340 0.406LFLDA 0.253 0.333 0.410nI-cLDA 0.261 0.333 0.412LDA 0.215 0.293 0.382GLDA 0.0527 0.0529 0.0529Chance Rate 0.0539 0.0539 0.0539

Table 2: Proportions of OOV words and classifi-cation accuracy in the held-out documents.

every setting, demonstrating the superiority ofour method, even when OOV words are ignored.However, the fact that LCTM outperforms LCTM-UNK in all cases clearly illustrates that LCTM caneffectively make use of information about OOV tofurther improve the representation of unseen docu-ments. The results show that the level of improve-ment of LCTM over LCTM-UNK increases as theproportion of OOV becomes greater.

5 Conclusion

In this paper, we have proposed LCTM that iswell suited for application to short texts with di-verse vocabulary. LCTM infers topics accordingto document-level co-occurrence patterns of la-tent concepts, and thus is robust to diverse vocab-ulary usage and data sparsity in short texts. Weshowed experimentally that LCTM can produce asuperior representation of short documents, com-pared to conventional topic models. We addition-ally demonstrated that LCTM can exploit OOV toimprove the representation of unseen documents.Although our paper has focused on improving per-formance of LDA by introducing the latent con-cept for each word, the same idea can be readilyapplied to other topic models that extend LDA.

Acknowledgments

We thank anonymous reviewers for their construc-tive feedback. We also thank Hideki Mima forhelpful discussions and Paul Thompson for in-sightful reviews on the paper. This paper is basedon results obtained from a project commissionedby the New Energy and Industrial Technology De-velopment Organization (NEDO).











5 Conclusion


Acknowledgments


・Gaussian variance with consistently performs well. ・LCTM outperforms TM w/o word embeddings. ・LCTM performs comparable to TM w/ word embeddings.

topic assignment of a word is determined by itsembedding, instead of its word type. Therefore,LCTM can naturally handle OOV words once theirembeddings are provided.

3.4 Reducing the Computational ComplexityFrom Eqs. (1) and (2), we see that the computa-tional complexity of sampling per word is O(K +SD), where K, S and D are numbers of topics, la-tent concepts and embedding dimensions, respec-tively. Since K ≪ S holds in usual settings, thedominant computation involves the sampling oflatent concept, which costs O(SD) computationper word.

However, since LCTM assumes that Gaussianvariance σ2 is relatively small, the chance of aword being assigned to distant concepts is negli-gible. Thus, we can reasonably assume that eachword is assigned to one of M ≪ S nearest con-cepts. Hence, the computational complexity isreduced to O(MD). Since concept vectors canmove slightly in the embedding space during theinference, we periodically update the nearest con-cepts for each word type.

To further reduce the computational complexity,we can apply dimensional reduction algorithmssuch as PCA and t-SNE (Van der Maaten and Hin-ton, 2008) to word embeddings to make D smaller.We leave this to future work.

4 Experiments

4.1 Datasets and Models DescriptionIn this section, we study the empirical perfor-mance of LCTM on short texts. We used the20Newsgroups corpus, which consists of discus-sion posts about various news subjects authoredby diverse readers. Each document in the corpus istagged with one of twenty newsgroups. Only postswith less than 50 words are extracted for trainingdatasets. For external word embeddings, we used50-dimensional GloVe1 that were pre-trained onWikipedia. The datasets are summarized in Ta-ble 1. See appendix A for the detail of the datasetpreprocessing.

We compare the performance of the LCTM tothe following six baselines:

• LFLDA (Nguyen et al., 2015), an extensionof Latent Dirichlet Allocation that incorpo-rates word embeddings information.

1Downloaded athttp://nlp.stanford.edu/projects/glove/

Dataset Doc size Vocab size Avg len400short 400 4729 31.87800short 800 7329 31.781561short 1561 10644 31.83held-out 7235 37944 140.15

Table 1: Summary of datasets.

• LFDMM (Nguyen et al., 2015), an extensionof Dirichlet Multinomial Mixtures that incor-porates word embeddings information.

• nI-cLDA, non-interactive constrained LatentDirichlet Allocatoin, a variant of ITM (Hu etal., 2014), where constraints are inferred byapplying k-means to external word embed-dings. Each resulting word cluster is then re-garded as a constraint. See appendix B forthe detail of the model.

• GLDA (Das et al., 2015), Gaussian LDA.

• BTM (Yan et al., 2013), Biterm Topic Model.

• LDA (Blei et al., 2003).

In all the models, we set the number of topicsto be 20. For LCTM (resp. nI-ITM), we set thenumber of latent concepts (resp. constraints) tobe 1000. See appendix C for the detail of hyper-parameter settings.

4.2 Document ClusteringTo demonstrate that LCTM results in a superiorrepresentation of short documents compared to thebaselines, we evaluated the performance of eachmodel on a document clustering task. We useda learned topic proportion as a feature for eachdocument and applied k-means to cluster the doc-uments. We then compared the resulting clus-ters to the actual newsgroup labels. Clusteringperformance is measured by Adjusted Mutual In-formation (AMI) (Manning et al., 2008). HigherAMI indicates better clustering performance. Fig-ure 3 illustrates the quality of clustering in termsof Gaussian variance parameter σ2. We see thatsetting σ2 = 0.5 consistently obtains good clus-tering performance for all the datasets with vary-ing sizes. We therefore set σ2 = 0.5 in the laterevaluation. Figure 4 compares AMI on four topicmodels. We see that LCTM outperforms the topicmodels without word embeddings. Also, we seethat LCTM performs comparable to LFLDA andnl-cLDA, both of which incorporate informationof word embeddings to aid topic inference. How-ever, as we will see in the next section, LCTM can











5 Conclusion


Acknowledgments


ClassificaFon accuracy on held-‐out documents

Clustering performance measured by Adjusted Mutual InformaFon (AMI)

・LCTM-‐UNK (LCTM that ignores OOV) outperforms other TMs. ・LCTM further improves performance of LCTM-‐UNK. à LCTM effecAvely incorporates OOV words in held-‐out documents.

Infer topics on training dataset

Predict topic-‐prop of held-‐out documents

Classify by topic proporAon

Experimental se\ng

Conclusion Introduced LCTM that infers topics based on document-‐level co-‐occurrence of latent concepts. Showed that LCTM can effecFvely handle OOV words in held-‐out documents. The same method can be readily applied to topic models that extend LDA.

・

・

・

Figure 1: Projected latent concepts on the wordembedding space. Concept vectors are annotatedwith their representative concepts in parentheses.

words, we expect topically-related latent conceptsto co-occur many times, even in short texts withdiverse usage of words. This in turn promotestopic inference in LCTM.

LCTM further has the advantage of using con-tinuous word embedding. Traditional LDA as-sumes a fixed vocabulary of word types. Thismodeling assumption prevents LDA from han-dling out of vocabulary (OOV) words in held-outdocuments. On the other hands, since our topicmodel operates on the continuous vector space, itcan naturally handle OOV words once their vectorrepresentation is provided.

The main contributions of our paper are as fol-lows: We propose LCTM that infers topics viadocument-level co-occurrence patterns of latentconcepts, and derive a collapsed Gibbs samplerfor approximate inference. We show that LCTMcan accurately represent short texts by outperform-ing conventional topic models in a clustering task.By means of a classification task, we furthermoredemonstrate that LCTM achieves superior perfor-mance to other state-of-the-art topic models inhandling documents with a high degree of OOVwords.

The remainder of the paper is organized as fol-lows: related work is summarized in Section 2,while LCTM and its inference algorithm are pre-sented in Section 3. Experiments on the 20News-groups are presented in Section 4, and a conclu-sion is presented in Section 5.

2 Related Work

There have been a number of previous studies ontopic models that incorporate word embeddings.The closest model to LCTM is Gaussian LDA

(Das et al., 2015), which models each topic asa Gaussian distribution over the word embeddingspace. However, the assumption that topics areunimodal in the embedding space is not appropri-ate, since topically related words such as ‘neural’and ‘networks’ can occur distantly from each otherin the embedding space. Nguyen et al. (2015) pro-posed topic models that incorporate informationof word vectors in modeling topic-word distribu-tions. Similarly, Petterson et al. (Petterson et al.,2010) exploits external word features to improvethe Dirichlet prior of the topic-word distributions.However, both of the models cannot handle OOVwords, because they assume fixed word types.

Latent concepts in LCTM are closely relatedto ‘constraints’ in interactive topic models (ITM)(Hu et al., 2014). Both latent concepts and con-straints are designed to group conceptually simi-lar words using external knowledge in an attemptto aid topic inference. The difference lies in theirmodeling assumptions: latent concepts in LCTMare modeled as Gaussian distributions over theembedding space, while constraints in ITM aresets of conceptually similar words that are interac-tively identified by humans for each topic. Eachconstraint for each topic is then modeled as amultinomial distribution over the constrained setof words that were identified as mutually relatedby humans. In Section 4, we consider a variant ofITM, whose constraints are instead inferred usingexternal word embeddings.

As regards short texts, a well-known topicmodel is Biterm Topic Model (BTM) (Yan etal., 2013). BTM directly models the genera-tion of biterms (pairs of words) in the whole cor-pus. However, the assumption that pairs of co-occurring words should be assigned to the sametopic might be too strong (Chen et al., 2015).

3 Latent Concept Topic Model

3.1 Generative ModelThe primary difference between LCTM and theconventional topic models is that LCTM describesthe generative process of word vectors in docu-ments, rather than words themselves.

Suppose α and β are parameters for the Dirich-let priors and let vd,i denote the word embeddingfor a word type wd,i. The generative model forLCTM is as follows.

1. For each topic k

(a) Draw a topic concept distribution φk ∼Dirichlet(β).