An Introduction to Topic Modeling
Daniel W. Peterson
Department of Computer ScienceUniversity of Colorado at Boulder
April 24, 2013
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 1 / 20
Latent Semantic Analysis
Documents x Terms matrix: large and sparse
Use SVD to decompose it into three matrices
Keep only the “important” dimensions
Assumptions:
Word order doesn’t matterWords are orthogonal dimensions in a high-dimensional space
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 2 / 20
Probabilistic Latent Semantic Analysis
Documents are generated by a probabilistic process
Structure based on topicsDifferent topics make different words more likely
Assumptions:
Word order doesn’t matterEach word is chosen as the result of exactly one topic
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 3 / 20
Probabilistic Latent Semantic Analysis
N documents
A document is L words long
Each entry has an assignment toone of K topics
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 4 / 20
Probabilistic Latent Semantic Analysis
How do we choose a topic?
We sample from a distributionover topics.
How do we choose a word?We sample from a distributionover words.
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 5 / 20
Probabilistic Latent Semantic Analysis
How do we choose a topic?We sample from a distributionover topics.
How do we choose a word?
We sample from a distributionover words.
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 5 / 20
Probabilistic Latent Semantic Analysis
How do we choose a topic?We sample from a distributionover topics.
How do we choose a word?We sample from a distributionover words.
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 5 / 20
Multinomial Distribution
Select one of several possible outcomes
Outcomes may be equally likely (like dice)
OR: some outcomes may be more likely thanothers (load the dice)
Looks like: a 1× n vector of probabilities
[x1, x2, . . . , xn]x1 + x2 + . . .+ xn = 1every xi > 0
A sample looks like: a number
The outcome of rolling the diceProbability we get i is given by xi
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 6 / 20
Multinomial Distribution
Select one of several possible outcomes
Outcomes may be equally likely (like dice)
OR: some outcomes may be more likely thanothers (load the dice)
Looks like: a 1× n vector of probabilities
[x1, x2, . . . , xn]x1 + x2 + . . .+ xn = 1every xi > 0
A sample looks like: a number
The outcome of rolling the diceProbability we get i is given by xi
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 6 / 20
Multinomial Distribution
Select one of several possible outcomes
Outcomes may be equally likely (like dice)
OR: some outcomes may be more likely thanothers (load the dice)
Looks like: a 1× n vector of probabilities
[x1, x2, . . . , xn]x1 + x2 + . . .+ xn = 1every xi > 0
A sample looks like: a number
The outcome of rolling the diceProbability we get i is given by xi
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 6 / 20
Multinomial Distribution
Select one of several possible outcomes
Outcomes may be equally likely (like dice)
OR: some outcomes may be more likely thanothers (load the dice)
Looks like: a 1× n vector of probabilities
[x1, x2, . . . , xn]x1 + x2 + . . .+ xn = 1every xi > 0
A sample looks like: a number
The outcome of rolling the diceProbability we get i is given by xi
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 6 / 20
Multinomial Distribution
Select one of several possible outcomes
Outcomes may be equally likely (like dice)
OR: some outcomes may be more likely thanothers (load the dice)
Looks like: a 1× n vector of probabilities
[x1, x2, . . . , xn]x1 + x2 + . . .+ xn = 1every xi > 0
A sample looks like: a number
The outcome of rolling the diceProbability we get i is given by xi
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 6 / 20
Probabilistic Latent Semantic Analysis
θ is a distribution over topicsin a document
One θ for each document
θ is a 1× K vector
Sum of θ is 1
φ is a distribution over wordsin a topic
One φ for each topic
φ is a 1×W vector
Sum of φ is 1
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 7 / 20
Probabilistic Latent Semantic Analysis
θ is a distribution over topicsin a document
One θ for each document
θ is a 1× K vector
Sum of θ is 1
φ is a distribution over wordsin a topic
One φ for each topic
φ is a 1×W vector
Sum of φ is 1
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 7 / 20
Probabilistic Latent Semantic Analysis
Fold θ into graphicalmodel
Where do θ and φ comefrom?
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 8 / 20
Probabilistic Latent Semantic Analysis
Fold θ into graphicalmodel
Where do θ and φ comefrom?
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 8 / 20
Topic Modeling
Sample θ and φ from anappropriate distribution
Dirchlet: a distributionover distributions
Incorporating Dirichletprior provides smoothing
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 9 / 20
Topic Modeling
Sample θ and φ from anappropriate distribution
Dirchlet: a distributionover distributions
Incorporating Dirichletprior provides smoothing
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 9 / 20
Topic Modeling
Sample θ and φ from anappropriate distribution
Dirchlet: a distributionover distributions
Incorporating Dirichletprior provides smoothing
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 9 / 20
Dirichlet Distribution
Takes n parameters α1, α2, . . . , αn
Distribution over 1× n vectors with sum of 1
αi are called concentration parameters
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 10 / 20
Dirichlet Distribution with 2 Parameters
Figure: Image source: Wikipedia
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 11 / 20
Dirichlet Distribution with 3 Parameters
Figure: Image source: Yee Whye Teh
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 12 / 20
A Sample from a Dirichlet
A particular 1× n vector with sum of 1
[x1, x2, . . . , xn] such that x1 + x2 + . . .+ xn = 1
every xi > 0
A multinomial distribution
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 13 / 20
A Sample from a Dirichlet
A particular 1× n vector with sum of 1
[x1, x2, . . . , xn] such that x1 + x2 + . . .+ xn = 1
every xi > 0
A multinomial distribution
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 13 / 20
A Sample from a Dirichlet
A particular 1× n vector with sum of 1
[x1, x2, . . . , xn] such that x1 + x2 + . . .+ xn = 1
every xi > 0
A multinomial distribution
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 13 / 20
Topic Modeling
Sample θ and φ from aDirichlet distribution
This is important forwhen we turn the modelaround:
Dirichlet distribution isconjugate prior ofmultinomial:
Given a Dirichlet prior,and counts of topicassignments, theposterior is also Dirichlet
β and γ are smoothingparameters
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 14 / 20
Topic Modeling
Sample θ and φ from aDirichlet distribution
This is important forwhen we turn the modelaround:
Dirichlet distribution isconjugate prior ofmultinomial:
Given a Dirichlet prior,and counts of topicassignments, theposterior is also Dirichlet
β and γ are smoothingparameters
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 14 / 20
Topic Modeling
Sample θ and φ from aDirichlet distribution
This is important forwhen we turn the modelaround:
Dirichlet distribution isconjugate prior ofmultinomial:
Given a Dirichlet prior,and counts of topicassignments, theposterior is also Dirichlet
β and γ are smoothingparameters
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 14 / 20
Inference
Generative model explains how the data was created
Inference: trying to guess model parameters
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 15 / 20
Inference
Generative model explains how the data was created
Inference: trying to guess model parameters
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 15 / 20
Gibbs Sampling
Hard to determine most likely model parameters
Hard for even relatively likely parameters
Can’t sample from overall distribution: sample instead a singlevariable
Take a walk through distribution
One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 16 / 20
Gibbs Sampling
Hard to determine most likely model parameters
Hard for even relatively likely parameters
Can’t sample from overall distribution: sample instead a singlevariable
Take a walk through distribution
One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 16 / 20
Gibbs Sampling
Hard to determine most likely model parameters
Hard for even relatively likely parameters
Can’t sample from overall distribution: sample instead a singlevariable
Take a walk through distribution
One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 16 / 20
Gibbs Sampling
Hard to determine most likely model parameters
Hard for even relatively likely parameters
Can’t sample from overall distribution: sample instead a singlevariable
Take a walk through distribution
One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 16 / 20
Gibbs Sampling
Hard to determine most likely model parameters
Hard for even relatively likely parameters
Can’t sample from overall distribution: sample instead a singlevariable
Take a walk through distribution
One step (parameter) at a time
Spend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 16 / 20
Gibbs Sampling
Hard to determine most likely model parameters
Hard for even relatively likely parameters
Can’t sample from overall distribution: sample instead a singlevariable
Take a walk through distribution
One step (parameter) at a timeSpend more time walking around more likely areas
We can get to likely areas from anywhereIt doesn’t matter where we start!
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 16 / 20
Gibbs Sampling
Hard to determine most likely model parameters
Hard for even relatively likely parameters
Can’t sample from overall distribution: sample instead a singlevariable
Take a walk through distribution
One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhere
It doesn’t matter where we start!
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 16 / 20
Gibbs Sampling
Hard to determine most likely model parameters
Hard for even relatively likely parameters
Can’t sample from overall distribution: sample instead a singlevariable
Take a walk through distribution
One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 16 / 20
Gibbs Sampling in a Topic Model
Start with randomassignment of topics
For each< word , document >pair:
Sample θ based oncounts and priorSample φ based oncounts and priorChoose k based on θ,φ, and w
Repeat the above manytimes
Smoothing (β and γ)very important
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 17 / 20
Gibbs Sampling in a Topic Model
Start with randomassignment of topics
For each< word , document >pair:
Sample θ based oncounts and priorSample φ based oncounts and priorChoose k based on θ,φ, and w
Repeat the above manytimes
Smoothing (β and γ)very important
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 17 / 20
Gibbs Sampling in a Topic Model
Start with randomassignment of topics
For each< word , document >pair:
Sample θ based oncounts and prior
Sample φ based oncounts and priorChoose k based on θ,φ, and w
Repeat the above manytimes
Smoothing (β and γ)very important
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 17 / 20
Gibbs Sampling in a Topic Model
Start with randomassignment of topics
For each< word , document >pair:
Sample θ based oncounts and priorSample φ based oncounts and prior
Choose k based on θ,φ, and w
Repeat the above manytimes
Smoothing (β and γ)very important
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 17 / 20
Gibbs Sampling in a Topic Model
Start with randomassignment of topics
For each< word , document >pair:
Sample θ based oncounts and priorSample φ based oncounts and priorChoose k based on θ,φ, and w
Repeat the above manytimes
Smoothing (β and γ)very important
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 17 / 20
Gibbs Sampling in a Topic Model
Start with randomassignment of topics
For each< word , document >pair:
Sample θ based oncounts and priorSample φ based oncounts and priorChoose k based on θ,φ, and w
Repeat the above manytimes
Smoothing (β and γ)very important
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 17 / 20
Gibbs Sampling in a Topic Model
Start with randomassignment of topics
For each< word , document >pair:
Sample θ based oncounts and priorSample φ based oncounts and priorChoose k based on θ,φ, and w
Repeat the above manytimes
Smoothing (β and γ)very important
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 17 / 20
Bayes Rule
P(k|β,X) ∝ P(k|β)P(X|k)
Sampling from a conditional distribution can bebroken down into sampling based on the parentnodes (prior, β) and the children (likelihood, X)
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 18 / 20
Blocked Gibbs Sampling in a Topic Model
Start with randomassignment of topics
Repeat many times:
Sample all θ and φfrom counts and priorChoose k for anumber of< word , document >pairs
More sampling, lesscounting
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 19 / 20
Blocked Gibbs Sampling in a Topic Model
Start with randomassignment of topics
Repeat many times:
Sample all θ and φfrom counts and priorChoose k for anumber of< word , document >pairs
More sampling, lesscounting
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 19 / 20
Blocked Gibbs Sampling in a Topic Model
Start with randomassignment of topics
Repeat many times:
Sample all θ and φfrom counts and prior
Choose k for anumber of< word , document >pairs
More sampling, lesscounting
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 19 / 20
Blocked Gibbs Sampling in a Topic Model
Start with randomassignment of topics
Repeat many times:
Sample all θ and φfrom counts and priorChoose k for anumber of< word , document >pairs
More sampling, lesscounting
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 19 / 20
Blocked Gibbs Sampling in a Topic Model
Start with randomassignment of topics
Repeat many times:
Sample all θ and φfrom counts and priorChoose k for anumber of< word , document >pairs
More sampling, lesscounting
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 19 / 20
Collapsed Gibbs Sampling in a Topic Model
Integrate out θ and φ
Start with random assignment of topics
For each < word , document > pair:
Sample k directly from counts
Repeat many times
P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ
n(·)−i ,k + W γ
n(di )−i ,k + β
n(di )−i ,· + Kβ
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 20 / 20
Collapsed Gibbs Sampling in a Topic Model
Integrate out θ and φ
Start with random assignment of topics
For each < word , document > pair:
Sample k directly from counts
Repeat many times
P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ
n(·)−i ,k + W γ
n(di )−i ,k + β
n(di )−i ,· + Kβ
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 20 / 20
Collapsed Gibbs Sampling in a Topic Model
Integrate out θ and φ
Start with random assignment of topics
For each < word , document > pair:
Sample k directly from counts
Repeat many times
P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ
n(·)−i ,k + W γ
n(di )−i ,k + β
n(di )−i ,· + Kβ
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 20 / 20
Collapsed Gibbs Sampling in a Topic Model
Integrate out θ and φ
Start with random assignment of topics
For each < word , document > pair:
Sample k directly from counts
Repeat many times
P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ
n(·)−i ,k + W γ
n(di )−i ,k + β
n(di )−i ,· + Kβ
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 20 / 20
Collapsed Gibbs Sampling in a Topic Model
Integrate out θ and φ
Start with random assignment of topics
For each < word , document > pair:
Sample k directly from counts
Repeat many times
P(zi = k |z−i ,w) ∝n(wi )−i ,k + γ
n(·)−i ,k + W γ
n(di )−i ,k + β
n(di )−i ,· + Kβ
Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 20 / 20