Latent Dirichlet Allocation

transcript

Presenter: Hsuan-Sheng Chiu

Reference

• D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.

Outline

• Introduction

• Notation and terminology

• Latent Dirichlet allocation

• Relationship with other latent variable models

• Inference and parameter estimation

• Discussion

Introduction

• We consider with the problem of modeling text corpora and other collections of discrete data– To find short description of the members a collection

• Significant process in IR– tf-idf scheme (Salton and McGill, 1983)– Latent Semantic Indexing (LSI, Deerwester et al., 1990) – Probabilistic LSI (pLSI, aspect model, Hofmann, 1999)

Introduction (cont.)

• Problem of pLSI: – Incomplete: Provide no probabilistic model at the level of docum

ents– The number of parameters in the model grows linear with the siz

e of the corpus– It is not clear how to assign probability to a document outside of t

he training data

• Exchangeability: bag of words

Notation and terminology

• A word is the basic unit of discrete data ,from vocabulary indexed by {1,…,V}. The vth word is represented by a V-vector w such that wv = 1 and wu = 0 for u≠v

• A document is a sequence of N words denote by w = (w1,w2,…,wN)

• A corpus is a collection of M documents denoted by D = {w1,w2,…,wM}

Latent Dirichlet allocation

• Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus.

• Generative process for each document w in a corpus D:– 1. Choose N ~ Poisson(ξ)– 2. Choose θ ~ Dir(α)– 3. For each of the N words wn

(a) Choose a topic zn ~ Multinomial(θ)

(b) Choose a word wn from p(wn|zn, β), a multinomial probability conditioned on the topic zn

βij is a a element of k×V matrix = p(wj = 1| zi = 1)

Latent Dirichlet allocation (cont.)

• Representation of a document generation:

z1 z2 … … zN

w1 w2 … … wNw

N ~ Poisson

θ~ Dir(α) → {z1,z2,…,zk}

β(z) →{w1,w2,…,wn}

• Several simplifying assumptions:– 1. The dimensionality k of Dirichlet distribution is known and fixe

d– 2. The word probabilities β is fixed quantity that is to be estimate

d– 3. Document length N is independent of all the other data genera

ting variable θ and z

• A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex

1 ...| 1

http://www.answers.com/topic/dirichlet-distribution

• Simplex:

The above figures show the graphs for the n-simplexes with n =2 to 7.(from mathworld, http://mathworld.wolfram.com/Simplex.html)

• The joint distribution of a topic θ, and a set of N topic z, and a set of N words w:

• Marginal distribution of a document:

• Probability of a corpus:

dzwpzpppN

n znnn

,|||,|

nnnn zwpzppp

,|||,| wz,,

n zdndndnd dzwpzppDp

,|||,|

• There are three levels to LDA representation– αβ are corpus-level parameters– θd are document-level variables

– zdn, wdn are word-level variables

corpus document

Refer to as hierarchical models, conditionally independent hierarchical models and parametric empirical Bayes models

• LDA and exchangeability– A finite set of random variables {z1,…,zN} is said exchangeable if the joint

distribution is invariant to permutation (πis a permutation)

– A infinite sequence of random variables is infinitely exchangeable if every finite subsequence is exchangeable

– De Finetti’s representation theorem states that the joint distribution of an infinitely exchangeable sequence of random variables is as if a random parameter were drawn from some distribution and then the random variables in question were independent and identically distributed, conditioned on that parameter

– http://en.wikipedia.org/wiki/De_Finetti's_theorem

NN zzpzzp ,...,,..., 11

• In LDA, we assume that words are generated by topics (by fixed conditional distributions) and that those topics are infinitely exchangeable within a document

dzwpzpppN

nnnn zw,

• A continuous mixture of unigrams– By marginalizing over the hidden topic variable z, we can under

stand LDA as a two-level model

• Generative process for a document w– 1. choose θ~ Dir(α)– 2. For each of the N word wn

(a) Choose a word wn from p(wn|θ, β)– Marginal distribution od a document

zpzwpwp |,|,|

dwppwpN

• The distribution on the (V-1)-simplex is attained with only k+kV parameters.

Relationship with other latent variable models

• Unigram model

• Mixture of unigrams– Each document is generated by first choosing a topic z and then

generating N words independently form conditional multinomial– k-1 parameters

nnwpwp

nn zwpzpwp

Relationship with other latent variable models (cont.)

• Probabilistic latent semantic indexing– Attempt to relax the simplifying assumption made in the mixture

of unigrams models– In a sense, it does capture the possibility that a document may c

ontain multiple topics– kv+kM parameters and linear growth in M

nn dzpzwpdpwdp ||,

• Problem of PLSI– There is no natural way to use it to assign probability to a previou

sly unseen document– The linear growth in parameters suggests that the model is pron

e to overfitting and empirically , overfitting is indeed a serious problem

• LDA overcomes both of there problems by treating the topic mixture weights as a k-parameter hidden random variable

• The k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus.

• A geometric interpretation: three topics and three words

• The unigram model find a single point on the word simplex and posits that all word in the corpus come from the corresponding distribution.

• The mixture of unigram models posits that for each documents, one of the k points on the word simplex is chosen randomly and all the words of the document are drawn from the distribution

• The pLSI model posits that each word of a training documents comes from a randomly chosen topic. The topics are themselves drawn from a document-specific distribution over topics.

• LDA posits that each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter

Inference and parameter estimation

• The key inferential problem is that of computing the posteriori distribution of the hidden variable given a document

,|,,,,|,w

i i jni w

1 1 11

Unfortunately, this distribution is intractable to compute in general.A function which is intractable due to the coupling between θ and β in the summation over latent topics

Inference and parameter estimation (cont.)

• The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood.

• Essentially, one considers a family of lower bounds, indexed by a set of variational parameters.

• A simple way to obtain a tractable family of lower bound is to consider simple modifications of the original graph model in which some of the edges and nodes are removed.

• Drop some edges and the w nodes

nnnzqqq

||,|, z

,|,,,,|,w

• Variational distribution:– Lower bound on Log-likelihood

– KL between variational posteriori and true posteriori

,|,,|,,log,|,

,|,,log,|,

,|,,|,,,|,log,|,,log,|log

zwzzwzz

zwzz wzw

qEpEdqpq

dqpqdpp

,log,,,,|,,

,,,log,|,,|,log,|,

,,log,|,,|,log,|,,,||,|,

,pE,pEqE

d,p,pqdqq

d,|pqdqq,|pqD

qqq wwzzw

wzzzzwzz

• Finding a tight lower bound on the log likelihood

• Maximizing the lower bound with respect to γand φ is equivalent to minimizing the KL divergence between the variational posterior probability and the true posterior probability

,,||,|,

,|,log,|,,log,|log

qEpEp qq

,,||,|,minarg,,

** ,|pqD wzz

• Expand the lower bound:

,|,log,|,,log,;,

• Then

j jiii

j jini

j jiii

1loglog

• We can get variational parameters by adding Lagrange multipliers and setting this derivative to zero:

n niii

j jiivni

• Parameter estimation– Maximize log likelihood of the data:

– Variational inference provide us with a tractable lower bound on the log likelihood, a bound which we can maximize with respect α and β

• Variational EM procedure– 1. (E-step) For each document, find the optimizing values of the

variational parameters {γ, φ}– 2. (M-step) Maximize the result lower bound on the log likelihood

with respect to the model parameters α and β

,|log, w

• Smoothed LDA model:

Discussion

• LDA is a flexible generative probabilistic model for collection of discrete data.

• Exact inference is intractable for LDA, but any or a large suite of approximate inference algorithms for inference and parameter estimation can be used with the LDA framework.

• LDA is a simple model and is readily extended to continuous data or other non-multinomial data.

Latent Dirichlet Allocation

Documents