An Introduction to Topic Modeling - Verbs...

transcript

An Introduction to Topic Modeling

Daniel W. Peterson

Department of Computer ScienceUniversity of Colorado at Boulder

daniel.w.peterson@colorado.edu

April 24, 2013

Daniel Peterson (University of Colorado) Introduction to the HDP April 24, 2013 1 / 20

Latent Semantic Analysis

Documents x Terms matrix: large and sparse

Use SVD to decompose it into three matrices

Keep only the “important” dimensions

Assumptions:

Word order doesn’t matterWords are orthogonal dimensions in a high-dimensional space

Probabilistic Latent Semantic Analysis

Documents are generated by a probabilistic process

Structure based on topicsDifferent topics make different words more likely

Assumptions:

Word order doesn’t matterEach word is chosen as the result of exactly one topic

N documents

A document is L words long

Each entry has an assignment toone of K topics

How do we choose a topic?

We sample from a distributionover topics.

How do we choose a word?We sample from a distributionover words.

How do we choose a topic?We sample from a distributionover topics.

How do we choose a word?

We sample from a distributionover words.

How do we choose a topic?We sample from a distributionover topics.

How do we choose a word?We sample from a distributionover words.

Multinomial Distribution

Select one of several possible outcomes

Outcomes may be equally likely (like dice)

OR: some outcomes may be more likely thanothers (load the dice)

Looks like: a 1× n vector of probabilities

[x1, x2, . . . , xn]x1 + x2 + . . .+ xn = 1every xi > 0

A sample looks like: a number

The outcome of rolling the diceProbability we get i is given by xi

θ is a distribution over topicsin a document

One θ for each document

θ is a 1× K vector

Sum of θ is 1

φ is a distribution over wordsin a topic

One φ for each topic

φ is a 1×W vector

Sum of φ is 1

θ is a distribution over topicsin a document

One θ for each document

θ is a 1× K vector

Sum of θ is 1

φ is a distribution over wordsin a topic

One φ for each topic

φ is a 1×W vector

Sum of φ is 1

Fold θ into graphicalmodel

Where do θ and φ comefrom?

Fold θ into graphicalmodel

Where do θ and φ comefrom?

Topic Modeling

Sample θ and φ from anappropriate distribution

Dirchlet: a distributionover distributions

Incorporating Dirichletprior provides smoothing

Topic Modeling

Dirichlet Distribution

Takes n parameters α1, α2, . . . , αn

Distribution over 1× n vectors with sum of 1

αi are called concentration parameters

Dirichlet Distribution with 2 Parameters

Figure: Image source: Wikipedia

Dirichlet Distribution with 3 Parameters

Figure: Image source: Yee Whye Teh

A Sample from a Dirichlet

A particular 1× n vector with sum of 1

[x1, x2, . . . , xn] such that x1 + x2 + . . .+ xn = 1

every xi > 0

A multinomial distribution

every xi > 0

Topic Modeling

Sample θ and φ from aDirichlet distribution

This is important forwhen we turn the modelaround:

Dirichlet distribution isconjugate prior ofmultinomial:

Given a Dirichlet prior,and counts of topicassignments, theposterior is also Dirichlet

β and γ are smoothingparameters

Topic Modeling

Inference

Generative model explains how the data was created

Inference: trying to guess model parameters

Inference

Generative model explains how the data was created

Inference: trying to guess model parameters

Gibbs Sampling

Hard to determine most likely model parameters

Hard for even relatively likely parameters

Can’t sample from overall distribution: sample instead a singlevariable

Take a walk through distribution

One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!

Gibbs Sampling

One step (parameter) at a time

Spend more time walking around more likely areasWe can get to likely areas from anywhereIt doesn’t matter where we start!

Gibbs Sampling

One step (parameter) at a timeSpend more time walking around more likely areas

We can get to likely areas from anywhereIt doesn’t matter where we start!

Gibbs Sampling

One step (parameter) at a timeSpend more time walking around more likely areasWe can get to likely areas from anywhere

It doesn’t matter where we start!

Gibbs Sampling

Gibbs Sampling in a Topic Model

Start with randomassignment of topics

For each< word , document >pair:

Sample θ based oncounts and priorSample φ based oncounts and priorChoose k based on θ,φ, and w

Repeat the above manytimes

Smoothing (β and γ)very important

Sample θ based oncounts and prior

Sample φ based oncounts and priorChoose k based on θ,φ, and w

Sample θ based oncounts and priorSample φ based oncounts and prior

Choose k based on θ,φ, and w

Bayes Rule

P(k|β,X) ∝ P(k|β)P(X|k)

Sampling from a conditional distribution can bebroken down into sampling based on the parentnodes (prior, β) and the children (likelihood, X)

Blocked Gibbs Sampling in a Topic Model

Repeat many times:

Sample all θ and φfrom counts and priorChoose k for anumber of< word , document >pairs

More sampling, lesscounting

Repeat many times:

Sample all θ and φfrom counts and prior

Choose k for anumber of< word , document >pairs

Repeat many times:

Collapsed Gibbs Sampling in a Topic Model

Integrate out θ and φ

Start with random assignment of topics

For each < word , document > pair:

Sample k directly from counts

Repeat many times