+ All Categories
Home > Documents > © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous...

© 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous...

Date post: 26-Dec-2015
Category:
Upload: rosa-hawkins
View: 214 times
Download: 1 times
Share this document with a friend
Popular Tags:
24
© 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1 , Youngja Park 2 , Suresh Chari 2 1. IT Convergence Laboratory, KAIST Institute,Korea 2. IBM T.J. Watson Research Center, NY, USA
Transcript
Page 1: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Hetero-Labeled LDA: A partially supervised topic model with heterogeneous

label information

Dongyeop Kang1, Youngja Park2, Suresh Chari2 1. IT Convergence Laboratory, KAIST Institute,Korea 2. IBM T.J. Watson Research Center, NY, USA

Page 2: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Topic Discovery - Supervised

Topic classification–Learn decision boundaries of classes by learning from data with labels– Accurate topic classification for general domains

Very hard to build a model for business applications due to data bottleneck

Page 3: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Topic Discovery – Unsupervised

Probabilistic topic modeling–Learn topic distribution for each class by learning from data

without label information, and choose topic of new data from most similar topic distribution

–e.g., Latent Dirichlet Allocation (LDA)

Not sufficiently accurate or interpretable

Page 4: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Topic Discovery – Semi-supervised

Supervised topic modeling methods–Supervised LDA [Blei&McAuliffe,2007], Labeled LDA [Ramage,2009]:

document labels provided

Semi-supervised topic modeling methods–Seeded LDA [Jagarlamudi,2012], zLDA [Andrzejewski,2009]: word

labels/constraints provided

Limitations1. Only one kind of domain knowledge is supported2. The labels should cover the entire topic space, |L| = |T|3. All documents should be labeled in training data, |Dunlabeled| = Ф

Page 5: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Partially Semi-supervised Topic Modeling with Heterogeneous Labels

Generation of labeled training samples is much more challenging for real-world applications

In most large companies, data are generated and managed independently by many different divisions

Different types of domain knowledge are available in different divisions

Can we discover accurate and meaningful topics with small amount of various types of domain knowledge?

Page 6: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Hetero-Lableled LDA: Main Contributions

Heterogeneity–Domain knowledge (labels) come in different forms – e.g., document labels, topic-indicative features, a partial taxonomy

Partialness–Small amount of labels are given –We address two kinds of partialness

• Partially labeled documents: |L| << |T| • Partially labeled corpus: |Dlabeled| << |Dunlabeled|

Three levels of domain information–Group Information: –Label Information: –Topic Distribution:

Page 7: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

ChallengesDocument labels (Ld)

Feature labels (Lw) {trade, billion, dollar, export, bank, finance}

{grain, wheat, corn, oil, oildseed, sugar, tonn}{game, team, player, hit, dont, run, pitch}

{god, christian, jesus, bible, church, christ}?????

?????

Page 8: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Hetero-Labeled LDA: Heterogeneity

wzθα φ

β

KK

DΛd

γ

Document Labels

Λw

K

δ

Word Labels

Wd

Page 9: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Hetero-Labeled LDA: Partialness

wzθα φ

β

Wd

Λwδ

Λdγ

Kd << K

Kw << K

Kd ∩ Kw Ф

KK

Kw

D

Page 10: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Hetero-Labeled LDA: Heterogeneity+Partialness

ΛdΨ

Kd

wzθα φ

β

KK

Λw

Kw

δ

Hybrid Constraint Wd

D

γ

Page 11: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Hetero-Labeled LDA: Generative Process

Page 12: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Hetero-Labeled LDA: Generative Process

Page 13: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Experiments

Datasets:

Algorithms:–Baseline: LDA, LLDA, zLDA–Proposed: HLLDA (L=T), HLLDA (L<T)

Evaluation metric:– Prediction Accuracy: the higher the better– Clustering F-measure: the higher the better– Variational Information: the lower the better

Data set N V T

Reuters 21,073 32,848 20

News20 19,997 82,780 20

Delicious.5K 5,000 890,429 20

Page 14: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Experiment: Questions

Q1. How does mixture of heterogeneous label information improve performance of classification and clustering?

Page 15: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Multi-class Prediction Accuracy

Clustering F-Measure

Page 16: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Experiment: Questions

Q2. How does HLLDA improve performance of partially labeled documents?

– Partially labeled corpus: |Dlabeled| << |Dunlabeled|– Partially labeled document: |L| << |T|

For a document, the provided label set covers a subset of all the topics the document belongs to. Our goal is to predict the full set of topics for each document.

Page 17: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Partially Labeled Documents: |L| << |T|

Page 18: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Partially Labeled Corpus: |Dlabeled| << |Dunlabeled|

Page 19: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Experiment: Questions

Q3. How good are the generated topics interpretable?– Comparison between LLDA and HLLDA– User study for topic quality

Page 20: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

News-20: LLDA (10) vs HLLDA (10)

<LLDA(10) with 10 Document labels>

<LLDA(10) with another 10 Document labels>

<HLLDA(10) with 10 Document labels >

Page 21: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Delicious.5k: LLDA (10) vs HLLDA (10)

<LLDA(10) with another 10 Document labels>

<LLDA(10) with 10 Document labels>

<HLLDA(10) with 10 Document labels >

Page 22: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

User Study for Topic Quality

Number of topically irrelevant (Red) and relevant (Blue) words. The more blue (red) words are, the higher (lower) the topic quality is

Page 23: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation

Conclusions

Proposed a novel algorithm for partially semi-supervised topic modeling

–Incorporates multiple heterogeneous domain knowledge which can be easily obtained in real life

–Supports two types of partialness : |L| << |T| and |Dlabeled| << |Dunlabeled|–A unified graphical model

Experimental results confirm that learning from multiple domain information is beneficial (mutually reinforcing)

HLLDA outperforms existing semi-supervised methods in terms of classification and clustering task

Page 24: © 2009 IBM Corporation Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information Dongyeop Kang 1, Youngja Park 2, Suresh.

© 2009 IBM Corporation25

THANK YOU

contact: [email protected]


Recommended