Generative Topic Models for Community Analysis Ramesh Nallapati.

transcript

Generative Topic Models for Community Analysis

Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 2 / 57

Objectives

• Provide an overview of topic models and their learning techniques– Mixture models, PLSA, LDA– EM, variational EM, Gibbs sampling

• Convince you that topic models are an attractive framework for community analysis– 5 definitive papers

9/18/2007 10-802: Guest Lecture 3 / 57

Outline• Part I: Introduction to Topic Models

– Naive Bayes model– Mixture Models

• Expectation Maximization

– PLSA– LDA

• Variational EM• Gibbs Sampling

• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model

9/18/2007 10-802: Guest Lecture 4 / 57

Introduction to Topic Models

• Multinomial Naïve Bayes

W1 W2 W3 ….. WN

• For each document d = 1,, M

• Generate Cd ~ Mult( ¢ | )

• For each position n = 1,, Nd

• Generate wn ~ Mult(¢|,Cd)

9/18/2007 10-802: Guest Lecture 5 / 57

Introduction to Topic Models• Naïve Bayes Model: Compact representation

W1 W2 W3 ….. WN

9/18/2007 10-802: Guest Lecture 6 / 57

• Multinomial naïve Bayes: Learning– Maximize the log-likelihood of observed

variables w.r.t. the parameters:

• Convex function: global optimum

• Solution:

9/18/2007 10-802: Guest Lecture 7 / 57

• Mixture model: unsupervised naïve Bayes model

• Joint probability of words and classes:

• But classes are not visible:Z

9/18/2007 10-802: Guest Lecture 8 / 57

• Mixture model: learning

– Not a convex function• No global optimum solution

– Solution: Expectation Maximization• Iterative algorithm• Finds local optimum• Guaranteed to maximize a lower-bound on the log-likelihood

of the observed data

9/18/2007 10-802: Guest Lecture 9 / 57

• Quick summary of EM:– Log is a concave function

– Lower-bound is convex!– Optimize this lower-bound w.r.t. each variable instead

log(0.5x1+0.5x2)

0.5log(x1)+0.5log(x2)

0.5x1+0.5x2

9/18/2007 10-802: Guest Lecture 10 / 57

• Mixture model: EM solution

E-step:

M-step:

9/18/2007 10-802: Guest Lecture 11 / 57

9/18/2007 10-802: Guest Lecture 12 / 57

• Probabilistic Latent Semantic Analysis Model

• Select document d ~ Mult()

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

Topic distribution

9/18/2007 10-802: Guest Lecture 13 / 57

• Probabilistic Latent Semantic Analysis Model– Learning using EM– Not a complete generative model

• Has a distribution over the training set of documents: no new document can be generated!

– Nevertheless, more realistic than mixture model

• Documents can discuss multiple topics!

9/18/2007 10-802: Guest Lecture 14 / 57

• PLSA topics (TDT-1 corpus)

9/18/2007 10-802: Guest Lecture 15 / 57

9/18/2007 10-802: Guest Lecture 16 / 57

• Latent Dirichlet Allocation

• For each document d = 1,,M

• Generate d ~ Dir(¢ | )

9/18/2007 10-802: Guest Lecture 17 / 57

• Latent Dirichlet Allocation– Overcomes the issues with PLSA

• Can generate any random document

– Parameter learning:• Variational EM

– Numerical approximation using lower-bounds

– Results in biased solutions

– Convergence has numerical guarantees

• Gibbs Sampling – Stochastic simulation

– unbiased solutions

– Stochastic convergence

9/18/2007 10-802: Guest Lecture 18 / 57

• Variational EM for LDA– Approximate the posterior by a simpler

distribution

• A convex function in each parameter!

9/18/2007 10-802: Guest Lecture 19 / 57

• Gibbs sampling– Applicable when joint distribution is hard to evaluate but

conditional distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution

9/18/2007 10-802: Guest Lecture 20 / 57

• LDA topics

9/18/2007 10-802: Guest Lecture 21 / 57

• LDA’s view of a document

9/18/2007 10-802: Guest Lecture 22 / 57

• Perplexity comparison of various models

Unigram

Mixture model

LDALower is better

9/18/2007 10-802: Guest Lecture 23 / 57

• Summary– Generative models for exchangeable data– Unsupervised models– Automatically discover topics– Well developed approximate techniques

available for inference and learning

9/18/2007 10-802: Guest Lecture 24 / 57

Outline• Part I: Introduction to Topic Models

– Naive Bayes model– Mixture Models

• Expectation Maximization

– PLSA– LDA

• Variational EM• Gibbs Sampling

• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model

9/18/2007 10-802: Guest Lecture 25 / 57

Hyperlink modeling using PLSA

9/18/2007 10-802: Guest Lecture 26 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

• Select document d ~ Mult()

• For each citation j = 1,, Ld

• generate zj ~ Mult( ¢ | d)

• generate cj ~ Mult( ¢ | zj)L

9/18/2007 10-802: Guest Lecture 27 / 57

PLSA likelihood:

New likelihood:

Learning using EM

9/18/2007 10-802: Guest Lecture 28 / 57

Heuristic:

0 · · 1 determines the relative importance of content and hyperlinks

9/18/2007 10-802: Guest Lecture 29 / 57

• Experiments: Text Classification• Datasets:

– Web KB• 6000 CS dept web pages with hyperlinks• 6 Classes: faculty, course, student, staff, etc.

– Cora• 2000 Machine learning abstracts with citations• 7 classes: sub-areas of machine learning

• Methodology:– Learn the model on complete data and obtain d for each

document– Test documents classified into the label of the nearest neighbor

in training set– Distance measured as cosine similarity in the space– Measure the performance as a function of

9/18/2007 10-802: Guest Lecture 30 / 57

• Classification performance

Hyperlink content Hyperlink content

9/18/2007 10-802: Guest Lecture 31 / 57

Hyperlink modeling using LDA

9/18/2007 10-802: Guest Lecture 32 / 57

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]

• For each document d = 1,,M

• Generate d ~ Dir(¢ | )

•For each citation j = 1,, Ld

• generate zj ~ Mult( . | d)

• generate cj ~ Mult( . | zj)

Learning using variational EM

9/18/2007 10-802: Guest Lecture 33 / 57

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]

9/18/2007 10-802: Guest Lecture 34 / 57

Author-Topic Model for Scientific Literature

9/18/2007 10-802: Guest Lecture 35 / 57

Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

• For each author a = 1,,A

• Generate a ~ Dir(¢ | )

• For each topic k = 1,,K

• Generate k ~ Dir( ¢ | )

•For each document d = 1,,M

•Generate author x ~ Unif(¢ | ad)

• generate zn ~ Mult( ¢ | a)

9/18/2007 10-802: Guest Lecture 36 / 57

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

Learning: Gibbs sampling

9/18/2007 10-802: Guest Lecture 37 / 57

• Perplexity results

9/18/2007 10-802: Guest Lecture 38 / 57

• Topic-Author visualization

9/18/2007 10-802: Guest Lecture 39 / 57

Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

• Application 1: Author similarity

9/18/2007 10-802: Guest Lecture 40 / 57

• Application 2: Author entropy

9/18/2007 10-802: Guest Lecture 41 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

9/18/2007 10-802: Guest Lecture 42 / 57

Gibbs sampling

9/18/2007 10-802: Guest Lecture 43 / 57

• Datasets– Enron email data

• 23,488 messages between 147 users

– McCallum’s personal email• 23,488(?) messages with 128 authors

9/18/2007 10-802: Guest Lecture 44 / 57

• Topic Visualization: Enron set

9/18/2007 10-802: Guest Lecture 45 / 57

• Topic Visualization: McCallum’s data

9/18/2007 10-802: Guest Lecture 46 / 57

9/18/2007 10-802: Guest Lecture 47 / 57

Modeling Citation Influences

9/18/2007 10-802: Guest Lecture 48 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Copycat model

9/18/2007 10-802: Guest Lecture 49 / 57

• Citation influence model

9/18/2007 10-802: Guest Lecture 50 / 57

• Citation influence graph for LDA paper

9/18/2007 10-802: Guest Lecture 51 / 57

• Words in LDA paper assigned to citations

9/18/2007 10-802: Guest Lecture 52 / 57

• Performance evaluation– Data:

• 22 seed papers and 132 cited papers• Users labeled citations on a scale of 1-4

– Models considered:• Citation influence model• Copy cat model• LDA-JS-divergence

– Symmetric Divergence in topic space • LDA-post

• Page Rank• TF-IDF

– Evaulation measure:• Area under the ROC curve

9/18/2007 10-802: Guest Lecture 53 / 57

• Results

9/18/2007 10-802: Guest Lecture 54 / 57

Mixed membership Stochastic Block models[Work In Progress]

• A complete generative model for text and citations

• Can model the topicality of citations– Topic Specific PageRank

• Can also predict citations between unseen documents

9/18/2007 10-802: Guest Lecture 55 / 57

Summary

• Topic Modeling is an interesting, new framework for community analysis – Sound theoretical basis– Completely unsupervised– Simultaneous modeling of multiple fields– Discovers “soft”-communities and clusters in

terms of “topic” membership

– Can also be used for predictive purposes