Download - An Introduction to Topic Modeling - Verbs Indexverbs.colorado.edu/~mpalmer/Ling7800/topic_modeling_slides.pdf · 4/24/2013 · Topic Modeling Sample and ˚from a Dirichlet distribution

An Introduction to Topic Modeling

Daniel W. Peterson

Department of Computer ScienceUniversity of Colorado at Boulder

[email protected]

April 24, 2013

Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 1 / 20

Latent Semantic Analysis

Documents x Terms matrix: large and sparse

Use SVD to decompose it into three matrices

Keep only the “important” dimensions

Assumptions:

Word order doesn’t matterWords are orthogonal dimensions in a high-dimensional space


Probabilistic Latent Semantic Analysis

Documents are generated by a probabilistic process

Structure based on topicsDifferent topics make different words more likely

Assumptions:

Word order doesn’t matterEach word is chosen as the result of exactly one topic



N documents

A document is L words long

Each entry has an assignment toone of K topics



How do we choose a topic?

We sample from a distributionover topics.

How do we choose a word?We sample from a distributionover words.



How do we choose a topic?We sample from a distributionover topics.

How do we choose a word?

We sample from a distributionover words.



How do we choose a topic?We sample from a distributionover topics.

How do we choose a word?We sample from a distributionover words.


Multinomial Distribution

Select one of several possible outcomes

Outcomes may be equally likely (like dice)

OR: some outcomes may be more likely thanothers (load the dice)

Looks like: a 1× n vector of probabilities

[x1, x2, . . . , xn]x1 + x2 + . . .+ xn = 1every xi > 0

A sample looks like: a number

The outcome of rolling the diceProbability we get i is given by xi







































θ is a distribution over topicsin a document

One θ for each document

θ is a 1× K vector

Sum of θ is 1

φ is a distribution over wordsin a topic

One φ for each topic

φ is a 1×W vector

Sum of φ is 1



θ is a distribution over topicsin a document

One θ for each document

θ is a 1× K vector

Sum of θ is 1

φ is a distribution over wordsin a topic

One φ for each topic

φ is a 1×W vector

Sum of φ is 1



Fold θ into graphicalmodel

Where do θ and φ comefrom?



Fold θ into graphicalmodel

Where do θ and φ comefrom?


Topic Modeling

Sample θ and φ from anappropriate distribution

Dirchlet: a distributionover distributions

Incorporating Dirichletprior provides smoothing


Topic Modeling





Topic Modeling





Dirichlet Distribution

Takes n parameters α1, α2, . . . , αn

Distribution over 1× n vectors with sum of 1

αi are called concentration parameters


Dirichlet Distribution with 2 Parameters

Figure: Image source: Wikipedia


Dirichlet Distribution with 3 Parameters

Figure: Image source: Yee Whye Teh


A Sample from a Dirichlet

A particular 1× n vector with sum of 1

[x1, x2, . . . , xn] such that x1 + x2 + . . .+ xn = 1

every xi > 0

A multinomial distribution





every xi > 0






every xi > 0



Topic Modeling

Sample θ and φ from aDirichlet distribution

This is important forwhen we turn the modelaround:

Dirichlet distribution isconjugate prior ofmultinomial:

Given a Dirichlet prior,and counts of topicassignments, theposterior is also Dirichlet

β and γ are smoothingparameters


Topic Modeling







Topic Modeling







Inference

Generative model explains how the data was created

Inference: trying to guess model parameters


Inference

Generative model explains how the data was created

Inference: trying to guess model parameters


Gibbs Sampling

Hard to determine most likely model parameters

Hard for even relatively likely parameters

Can’t sample from overall distribution: sample instead a singlevariable

Take a walk through distribution

One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!


Gibbs Sampling







Gibbs Sampling







Gibbs Sampling







Gibbs Sampling





One step (parameter) at a time

Spend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!


Gibbs Sampling





One step (parameter) at a timeSpend more time walking around more likely areas

We can get to likely areas from anywhereIt doesn’t matter where we start!


Gibbs Sampling





One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhere

It doesn’t matter where we start!


Gibbs Sampling







Gibbs Sampling in a Topic Model

Start with randomassignment of topics

For each< word , document >pair:

Sample θ based oncounts and priorSample φ based oncounts and priorChoose k based on θ,φ, and w

Repeat the above manytimes

Smoothing (β and γ)very important












Sample θ based oncounts and prior

Sample φ based oncounts and priorChoose k based on θ,φ, and w







Sample θ based oncounts and priorSample φ based oncounts and prior

Choose k based on θ,φ, and w

























Bayes Rule

P(k|β,X) ∝ P(k|β)P(X|k)

Sampling from a conditional distribution can bebroken down into sampling based on the parentnodes (prior, β) and the children (likelihood, X)


Blocked Gibbs Sampling in a Topic Model


Repeat many times:

Sample all θ and φfrom counts and priorChoose k for anumber of< word , document >pairs

More sampling, lesscounting




Repeat many times:






Repeat many times:

Sample all θ and φfrom counts and prior

Choose k for anumber of< word , document >pairs





Repeat many times:






Repeat many times:




Collapsed Gibbs Sampling in a Topic Model

Integrate out θ and φ

Start with random assignment of topics

For each < word , document > pair:

Sample k directly from counts

Repeat many times

P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ

n(·)−i ,k + W γ

n(di )−i ,k + β

n(di )−i ,· + Kβ







Repeat many times

P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ

n(·)−i ,k + W γ

n(di )−i ,k + β

n(di )−i ,· + Kβ







Repeat many times

P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ

n(·)−i ,k + W γ

n(di )−i ,k + β

n(di )−i ,· + Kβ







Repeat many times

P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ

n(·)−i ,k + W γ

n(di )−i ,k + β

n(di )−i ,· + Kβ







Repeat many times

P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ

n(·)−i ,k + W γ

n(di )−i ,k + β

n(di )−i ,· + Kβ