A Multilingual Topic Model for Learning Weighted Topic ... · Proceedings of the 2019 Conference on...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 1243–1248,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

1243

A Multilingual Topic Model forLearning Weighted Topic Links Across Corpora with Low Comparability

Weiwei Yang∗Computer Science

University of [email protected]

Jordan Boyd-Graber†Computer Science, iSchool,Language Science, UMIACS


Philip ResnikLinguistics and UMIACS


Abstract

Multilingual topic models (MTMs) learn top-ics on documents in multiple languages. Pastmodels align topics across languages by im-plicitly assuming the documents in differentlanguages are highly comparable, often a falseassumption. We introduce a new model thatdoes not rely on this assumption, particularlyuseful in important low-resource language sce-narios. Our MTM learns weighted topic linksand connects cross-lingual topics only whenthe dominant words defining them are sim-ilar, outperforming LDA and previous MTMsin classification tasks using documents’ topicposteriors as features. It also learns coherenttopics on documents with low comparability.

1 Introduction

Topic models explain document collections at ahigh level (Boyd-Graber et al., 2017). Multi-lingual topic models (MTMs) uncover latent top-ics across languages and reveal commonalitiesand differences across languages and cultures (Niet al., 2009; Shi et al., 2016; Gutierrez et al.,2016). Existing models extend latent Dirichlet al-location (Blei et al., 2003, LDA) and learn alignedtopics across languages (Mimno et al., 2009).

Prior models work well because they implicitlyassume—even if not part of the model—parallel orhighly comparable data with well-aligned topics.However, this assumption does not always com-port with reality. Even documents from the sameplace and time can discuss very different thingsacross languages: in multicultural London, Hinditweets focus on a Bollywood actor’s BBC appear-ance, French blogs fret about Brexit, and Englisharticles focus on Tottenham’s lineup. Generally,corpora have a range of “nonparallelness” (Fung,2000). In less comparable settings, while some

∗Now at Facebook† Now at Google AI Zurich

EN-1 EN-2 EN-3 EN-4

ZH-1 ZH-2 ZH-3 ZH-4

universities

schools

students

research

science

economics

dollars

million

invest

income

technology

information

computers

smart

system

sports

match

referee

tournament

champion

°_(technology)

(science)

ÑÇj(computer)

ª-(smart)

+(system)

#(music)

CÁ(album)

|(singer)

ñ(works)

DaJ(concert)

Ø(sports)

(match)

ñT(referee)

V7(tournament)

ÐË(champion)

ÿ~(economics)

¾s(dollars)

®7(million)

Åt(invest)

f(income)

Figure 1: Topic pairs with many word translation pairshave high link weights, e.g., (EN-1, ZH-3) and (EN-2,ZH-4); topic pairs with partial overlap receive lowerweights, e.g., (EN-4, ZH-1); a topic is unlinked if thereis no corresponding topic in the other language (ZH-2).

topics are shared, languages’ emphasis may di-verge and some topics may lack analogs.

We therefore introduce a new multilingual topicmodel that assumes each language has its owntopic sets and jointly learns all topics, but does notforce one-to-one alignment across languages. In-stead, our MTM learns weighted topic links acrosslanguages and only assigns a high link weight toa topic pair whose top words have many directtranslation pairs (Figure 1). Moreover, it allowsunlinked topics if there is no matching topic inthe other language. This makes the model robustfor (more common) less-comparable data withtopic misalignment. Joint inference also allowsinsights from high-resource languages to uncoverlow-resource language patterns. It is particularlyuseful in scenarios that involve modeling topicson low-resource languages in humanitarian assis-tance, peacekeeping, and/or infectious disease re-sponse, while limiting the additional cost to othersteps that will also need to be taken, such as find-ing or creating a word translation dictionary.

We validate the MTM in two classification tasksusing inferred topic posteriors as features. Our

1244

MTM has higher F1 than other models in bothintra- and cross-lingual evaluations, while discov-ering coherent topics and meaningful topic links.

2 Multilingual Topic Model forConnecting Cross-Lingual Topics

Yang et al. (2015) present a flexible framework foradding regularization to topic models. We extendthis model to the multilingual setting by addinga potential function that links topics across lan-guages. For simplicity of exposition, we focus onthe bilingual case with languages S and T .

Unlike Yang et al. (2015) that encode monolin-gual information only, our potential function en-codes multilingual knowledge parameterized bytwo matrices, ρS→T and ρT→S , that transformtopics between the two languages. Cells’ valuesare between 0 and 1 and a cell ρS→T,kT ,kS close toone is a strong connection of topics kT and kS inlanguage T and S. Transformations ρ are learnedfrom translation pairs’ topic distributions.

These topic distributions come from the assign-ments of Gibbs sampling (Griffiths and Steyvers,2004). Fortunately adding the potential function isequivalent to adding an additional term to Gibbssampling for topic models (Yang et al., 2015).During sampling, each token is assigned to a topic,so we can compute a post hoc word distributionover topics. The probability of a topic k givena word w is Pr (k |w) ≡ Ωw,k ≡ Nk,w/Nw,where Nk,w is the number of times that word w isassigned to topic k and Nw is w’s term frequency.

To find good topic links ρS→T , we use adictionary. For instance, given the translationpair of “sports” and “运动 (yun dong)”, theyshould have similar topic distributions, so wewant ρEN→ZHΩsports to be close to Ω运动 and viceversa. Moreover, the transformations should besymmetric: ρS→TΩwS close to ΩwT , and viceversa. We encode this cross-lingual knowledge oftopic transformations into the potential function Ψwhich measures the difference of translation pairs’topic distributions after transformation:(

C∏c=1

‖ΩS,c − ρT→SΩT,c‖ηc2 ‖ρS→TΩS,c −ΩT,c‖ηc2

)−1

,

(1)

where ηc is the statistical importance of the c-thtranslation pair to the corpus (Figure 2, full detailsin the Supplement).

While Yang et al. (2015) provide a blueprint forGibbs sampling with potential functions without

TDS

D

TM

SM

SE

dTN

,

TD

TKS

KdS

N,

SD

dT ,

T

ndTz

,,

ndTw

,,

dS ,

T

ndSz

,,

ndSw

,,

TE<

SToU

TSoU

Figure 2: The graphical model of our multilingual topicmodel. The topic links ρ, as instantiated by the func-tion Ψ, encourage topics to encourage word transla-tions to have consistent topics.

additional parameters, our model has additionalparameters of ρS→T and ρT→S so we need to op-timize them. Thus, we use stochastic EM (Celeux,1985). The E-step updates tokens’ topic assign-ments using Gibbs sampling, while holding the pa-rameters of the topic link weight matrices ρ fixed.The M-step optimizes ρ while holding the topicassignments fixed. We optimize Ψ in log spaceusing the objective function J(ρS→T ) as

C∑c=1

ηc log ‖ΩT,c − ρS→T,iT ΩS,c‖2 , (2)

which is minimized by using L-BFGS (Liu and No-cedal, 1989), with the partial derivatives with re-spect to ρS→T,kT ,kS

−C∑c=1

ηcΩS,c,kS (ΩT,c,kT − ρS→T,kT ΩS,c)

‖ΩT,c − ρS→T,iT ΩS,c‖22. (3)

3 Experiments

We evaluate our model extrinsically on classifica-tion tasks, followed by intrinsic topic coherence.

3.1 Classification with Topic PosteriorsWe use two datasets for classification: Wikipediadocuments in English (EN) and Chinese(ZH) (Yuan et al., 2018) and an English-Sinhalese(SI) disaster response dataset (Strassel and Tracey,2016).1 Each dataset provides labeled documentsand a dictionary. Yuan et al. (2018) extract theEN-ZH dictionary from MDBG, while Strassel andTracey (2016) construct the EN-SI dictionary fromonline resources and manual annotation.2 Each

1More dataset details in the Supplement.2MDBG: https://www.mdbg.net/chinese/

dictionary?page=cc-cedict

https://www.mdbg.net/chinese/dictionary?page=cc-cedict

https://www.mdbg.net/chinese/dictionary?page=cc-cedict

1245

MTM+TF-IDF+TOPMTM+TF-IDF

MTM+TOPMTM

ptLDALDA

MTAnchorMCTA

26.726.7

42.942.9

12.827.8

20.813.0Intra-Lingual

14.514.5

35.322.2

16.022.924.5

4.1Cross-Lingual

English

0 20 40 60


MTM+TOPMTM

ptLDALDA

MTAnchorMCTA

38.138.1

23.123.1

18.224.0

32.626.5

0 20 40 6011.415.1

33.326.7

15.121.124.7

15.6

Sinhalese


MTM+TOPMTM

ptLDALDA

MTAnchorMCTA

94.194.193.093.091.692.1

80.751.6

Intra-Lingual

63.257.3

78.174.7

2.916.5

57.623.2Cross-Lingual

English

0 25 50 75 100


MTM+TOPMTM

ptLDALDA

MTAnchorMCTA

85.685.686.586.583.383.4

75.333.4

0 25 50 75 10059.6

55.183.1

64.521.0

10.554.5

39.8

Chinese

Figure 3: The F1 scores on disaster response (upper)and Wikipedia (lower) datasets. Our MTM outperformsall the baselines in intra- and cross-lingual evaluations.

Wikipedia document is labeled with one of thetopics of film, music, animals, politics, religion,and food. A portion of the disaster responsedocuments are labeled with one of eight types ofneeded rescue resources: evacuation, food supply,search/rescue, utilities, infrastructure, medicalassistance, shelter, and water supply.

We follow Yuan et al. (2018) for preprocess-ing (such as lemmatization for English and seg-mentation for Chinese) and use a linear SVM forclassification. For the Wikipedia dataset, we re-port micro-F1 scores on a six-way classification.For the disaster response dataset, our goal is bi-nary classification of the need for evacuation ver-sus other assistance. The classification uses fea-tures of topic posteriors: Pr (k | d) ≡ Nd,k/Nd

which is the proportion of the tokens assigned totopic k in document d.

The baselines include polylingual tree LDA (Huet al., 2014, ptLDA) which encodes the dictio-nary as a tree prior (Andrzejewski et al., 2009),Multilingual Topic Anchoring (Yuan et al., 2018,MTAnchor), and Multilingual Cultural-commonTopic Analysis (Shi et al., 2016, MCTA). Wealso include LDA, which runs monolingually ineach language. We use 20 topics and set hyper-parameters α = 0.1 and β = 0.01 (if applicable).

Our evaluations are both intra- and cross-lingual. The intra-lingual evaluation trains andtests classifiers on the same language, while thecross-lingual evaluation trains classifiers on onelanguage and tests on another. In cross-lingualevaluations, MTAnchor, MCTA, and ptLDA aligntopic spaces, so topic posterior transformation isnot necessary. LDA cannot transform topic spaces,so we do not apply any transformation. For ourMTM, we explore two transformation methodswith ρ. The first multiplies ρ with a language’sdocument topic distributions, i.e., ρZH→ENθZH andvice versa. The second (TOP), transfers each doc-ument topic’s probability mass to the topic in theother language with the highest link weight.3

Our MTM has higher F1 both intra- and cross-lingually (Figure 3). TF-IDF weighting on trans-lation pairs sometimes improves the intra-lingualF1, although it hurts the cross-lingual F1. Con-necting the top linked topics (TOP) is better thandirectly using the topic link weight matrices. Thisindicates that ρ’s values have some noise.

3.2 Looking at Learned TopicsPast MTMs align topics across languages but ourMTM does not, so we compare the topics acrossmodels to see how they differ. We look at theMovies topics from the Wikipedia dataset (Ta-ble 1). For the Chinese MTM topics, we show thethree English topics with the highest link weights.

The topics are about Movies, but the MCTA andMTAnchor topics do not rank “movie” or “电影(dian yıng)” at the top. The ptLDA topics, al-though aligned well, incorrectly align some Chi-nese words. “胶片 (jiao pian)” means “photo-graphic film”, while “释放 (shı fang)” means re-lease as in “let something go”, not movie distri-bution. ptLDA links words based on translations

3An example of TOP is available in the Supplement.4In Tables 1 and 2, “[Q]” denotes the Chinese word is a

counter for the following English word.

1246

Lang. WordsMCTA

ZH主演 (starring),改编 (adapt),本 (this),小说 (novel),拍摄 (shoot),角色 (role)

EN dog, san, movie, mexican, fighter, novelMTAnchor

ZH主演 (starring),改编 (adapt),饰演 (act),本片 (the movie),演员 (actor),编剧 (playwright)

EN kong, hong, movie, official, martial, boxLDA

ZH电影 (movie),部 ([Q] movie),4 美国 (USA),上映 (release),英语 (English),剧情 (plot)

EN film, star, direct, release, action, plotptLDA

ZH电影 (movie),胶片 (film),星 (star),动作 (action),释放 (release),影片 (movie)

EN film, star, direct, action, release, plotMTM

ZH电影 (movie),部 ([Q] movie),上映 (release),动画 (animation),故事 (story),作品 (works),

EN (.20) film, direct, star, release, action, plotEN (.12) kill, find, death, attack, escape, returnEN (.11) shrine, japanese, temple, japan, shinto, kami

MTM + TF-IDF

ZH电影 (movie),部 ([Q] movie),上映 (release),美国 (USA),英语 (English),导演 (director)

EN (.32) film, direct, star, action, release, plotEN (.24) film, kill, find, escape, attack, returnEN (.09) character, series, star, game, trek, create

Table 1: The Movies topics given by models. For theChinese (ZH) topics given by MTM, the top three En-glish (EN) topics and their link weights are also given.

without looking at the context, which causes prob-lems with multiple-sense words. The LDA andMTM topics are generally coherent.

The MTM’s unique joint modeling of weightedtopic links also recovers additional topical struc-ture: after linking respective EN-ZH Movies top-ics, the next linked topics are Action Movies(“kill”, “death”, “attack”, and “escape”). Fur-ther, the models capture a degree of connection be-tween Movies and Computer Games (MTM + TF-IDF) and Japanese Animations (MTM).

3.3 Looking at Learned Topic Links

We give more examples of weighted MTM topiclinks in Table 2. High-weighted Biology (ZH-0,EN-12, and EN-19) and Music topics (EN-10, ZH-9, and ZH-17) are characterized by cross-lingualwords in common. The model can also infer topiclinks beyond words, linking topics when the top-ical words have few direct translations but are re-lated in senses. For instance, ZH-14 is about the“campaigns” against “government”. Only “gov-ernment” overlaps with EN-16 and EN-11, but

Lang. Words

ZH-0 学名 (scientific name),它们 (they),呈 (show),白色 (white),长 (long),黑色 (black)

EN-12 (.57) specie, bird, eagle, genus, white, owlEN-19 (.13) breed, chicken, white, goose, bird, black

ZH-14主义 (-ism),组织 (organization),美国 (USA),革命 (revolution),运动 (campaign),政府 (government)

EN-16 (.32) sex, law, act, sexual, marriage, courtEN-11 (.17) traffic, victim, government, trafficking, child, forceEN-10 album, release, record, music, song, single

ZH-9 (.30) 专辑 (album),张 ([Q] album),发行 (release),音乐 (music),首 ([Q] song),唱片 (record)

ZH-17 (.20) 音乐 (music),乐团 (musical group),艺术 (art),创作 (create),奖 (award),演出 (perform)

Table 2: Topics are linked because they have overlap intopical words. Our MTM can also infer the topic rela-tions beyond words, e.g., ZH-14 and EN-16.

1.31.41.5

Topi

c Co

here

nce

INCOLDA ptLDA MTM MTM-TFIDF

PACO

Arabic

1.21.31.41.5 Chinese

1.31.41.51.61.7

Farsi

0.81

1.2 Russian

25 50 75 1001.11.21.31.4

25 50 75 100Top Words

Spanish

Figure 4: Topic coherence on INCO and PACO datasetswith the number of top words in each topic.

MTM identifies the two English topics as the toplinked topics for ZH-14: EN-16 is about the “cam-paign” in Sexual Rights, while EN-11 is about theCrime of human trafficking. This shows that ourMTM can incorporate word translations and infermore cross-lingual word and topic relationships.

3.4 Evaluating Topic Coherence

We intrinsically evaluate models’ topic coherenceon two sets of preprocessed bilingual Wikipediacorpora (Hao and Paul, 2018) that vary in “non-

1247

parallelness”. Both pair English with Arabic,Chinese, Spanish, Farsi, and Russian. In PACO,30% of documents have direct translations acrosslanguages, and in INCO none has direct trans-lations. Dictionaries are extracted fromn Wik-tionary.5 Standard preprocessing has already beenapplied to the datasets, including stemming, stopword removal, and high-frequency word removal.

We use an intra-lingual metric to evaluate topiccoherence (Lau et al., 2014): for every topic,we compute its top N words’ average pairwisePMI score on a disjoint subset of Wikipedia doc-uments (Hao and Paul, 2018). We report averagecoherence with N from 10 to 100 with a step sizeof 10 (five-fold cross-validation). We use the sametranslation pair weighting options as in our classi-fication tasks and also compare against monolin-gual LDA and ptLDA (Hu et al., 2014).

MTM is no worse than LDA and sometimesslightly better (Figure 4). TF-IDF weighting ontranslation pairs sometimes further improves co-herence (e.g., Arabic, Farsi, Russian, and Spanishon INCO) but occasionally hurts (e.g., Chinese).ptLDA mostly works poorly, except on Farsi witha high number of top words. ptLDA aligns topicspaces, which is hard for low-comparability data,thus sacrificing coherence for alignment; in con-trast MTM only connects topics when they alignwell in senses.

3.5 Topic Coherence vs. Target LanguageCorpora Sizes

We next vary the size of target language (non-English languages in PACO and INCO) corpora:how much can MTM help topic coherence for low-resource languages? We start from 10% of therandomly-selected documents in target languagesand incrementally add more target language docu-ments at a step size of 10% until it reaches 100%.Meanwhile, we always use 100% of the Englishdocuments. We train monolingual LDA, ptLDA,and MTMs with and without TF-IDF weighting ontranslation pairs on each setting and evaluate thetopic coherence on the same reference corpora us-ing the top thirty words of each topic (Figure 5).

In most cases, the topic coherence improveswith larger target corpora, except Arabic and Rus-sian on PACO. This confirms our intuition thatmore data yield a better topic model. MTM is help-

5https://dumps.wikimedia.org/enwiktionary/

1.301.351.401.451.50

Topi

c Co

here

nce

INCOLDA tLDA MTM MTM-TFIDF

PACOArabic

1.21.31.4

Chinese

1.31.41.51.6

Farsi

0.80.9

11.11.2 Russian

25 50 75 1001.161.201.241.281.32

25 50 75 100Target Language Corpora Size Ratios

Spanish

Figure 5: The models’ topic coherence on INCO andPACO datasets when the sizes of target language cor-pora grow from 10% to 100%, with a step size of 10%.

ful in cases when the target language corpora sizesare small, e.g., Chinese and Russian with 10% or20% of the corpora. TF-IDF weighting is not con-sistently better or worse than equal weights.

The ptLDA with tree priors based on dictionariesperforms poorly in topic coherence, except Farsiin INCO. In most cases, its topic coherence is sub-stantially below others’ and improves little whenthe target corpora grow.

4 Conclusions and Future Work

We introduce a novel multilingual topic model(MTM) that learns weighted topic links across lan-guages by minimizing the Euclidean distances oftranslation pairs’ (transformed) topic distributions,where translation pairs can be weighted, e.g., byTF-IDF. This connects topics in different lan-guages only when necessary and is more robuston low-comparability corpora. The MTM outper-forms baselines substantially in both intra- andcross-lingual classification tasks, while achievingno worse or slightly better topic coherence thanmonolingual LDA on low-comparability data.

We plan to explore weighting methods to betterevaluate the importance of translation pairs. Wewill also study how to improve topic transforma-tion with the topic link weight matrices.

https://dumps.wikimedia.org/enwiktionary/

https://dumps.wikimedia.org/enwiktionary/

1248

Acknowledgements

We thank Shudong Hao and Michelle Yuan forproviding their datasets. We thank the anony-mous reviewers for their insightful and construc-tive comments. This research has been sup-ported under subcontract to Raytheon BBN Tech-nologies, by DARPA award HR0011-15-C-0113.Boyd-Graber is also supported by NSF grant IIS-1409287. Any opinions, findings, conclusions, orrecommendations expressed here are those of theauthors and do not necessarily reflect the view ofthe sponsors.

ReferencesDavid Andrzejewski, Xiaojin Zhu, and Mark Craven.

2009. Incorporating domain knowledge into topicmodeling via Dirichlet forest priors. In Proceedingsof the International Conference of Machine Learn-ing.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent Dirichlet allocation. Journal of Ma-chine Learning Research, pages 993–1022.

Jordan Boyd-Graber, Yuening Hu, and David Mimno.2017. Applications of Topic Models, volume 11 ofFoundations and Trends in Information Retrieval.NOW Publishers.

Gilles Celeux. 1985. The SEM algorithm: A proba-bilistic teacher algorithm derived from the EM al-gorithm for the mixture problem. ComputationalStatistics Quarterly, pages 73–82.

Pascale Fung. 2000. A statistical view on bilingual lex-icon extraction. In Parallel Text Processing, pages219–236.

Thomas L. Griffiths and Mark Steyvers. 2004. Find-ing scientific topics. Proceedings of the NationalAcademy of Sciences, pages 5228–5235.

E. Dario Gutierrez, Ekaterina Shutova, Patricia Licht-enstein, Gerard de Melo, and Luca Gilardi. 2016.Detecting cross-cultural differences using a multi-lingual topic model. Transactions of the Associationfor Computational Linguistics, pages 47–60.

Shudong Hao and Michael J. Paul. 2018. Learningmultilingual topics from incomparable corpora. InProceedings of International Conference on Compu-tational Linguistics.

Yuening Hu, Ke Zhai, Vlad Eidelman, and JordanBoyd-Graber. 2014. Polylingual tree-based topicmodels for translation domain adaptation. In Pro-ceedings of the Association for Computational Lin-guistics.

Jey Han Lau, David Newman, and Timothy Baldwin.2014. Machine reading tea leaves: Automaticallyevaluating topic coherence and topic model quality.In Proceedings of the Association for ComputationalLinguistics.

Dong C. Liu and Jorge Nocedal. 1989. On the limitedmemory BFGS method for large scale optimization.Mathematical Programming, pages 503–528.

David Mimno, Hanna M. Wallach, Jason Naradowsky,David A. Smith, and Andrew McCallum. 2009.Polylingual topic models. In Proceedings of Empir-ical Methods in Natural Language Processing.

Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen.2009. Mining multilingual topics from Wikipedia.In Proceedings of the World Wide Web Conference.

Bei Shi, Wai Lam, Lidong Bing, and Yinqing Xu. 2016.Detecting common discussion topics across culturefrom news reader comments. In Proceedings of theAssociation for Computational Linguistics.

Stephanie M. Strassel and Jennifer Tracey. 2016.LORELEI language packs: Data, tools, and re-sources for technology development in low resourcelanguages. In Proceedings of the Language Re-sources and Evaluation Conference.

Yi Yang, Doug Downey, and Jordan Boyd-Graber.2015. Efficient methods for incorporating knowl-edge into topic models. In Proceedings of EmpiricalMethods in Natural Language Processing.

Michelle Yuan, Benjamin Van Durme, and JordanBoyd-Graber. 2018. Multilingual anchoring: In-teractive topic modeling and alignment across lan-guages. In Proceedings of Advances in Neural In-formation Processing Systems.

http://www.nowpublishers.com/article/Details/INR-030

Date post:	01-Jan-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

A Multilingual Topic Model for Learning Weighted Topic ... · Proceedings of the 2019 Conference on...

Documents