Latent Dirichlet Allocation

1

Latent Dirichlet Allocation

Presenter: Hsuan-Sheng Chiu

2

Reference

• D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.

3

Outline

• Introduction

• Notation and terminology

• Latent Dirichlet allocation

• Relationship with other latent variable models

• Inference and parameter estimation

• Discussion

4

Introduction

• We consider with the problem of modeling text corpora and other collections of discrete data– To find short description of the members a collection

• Significant process in IR– tf-idf scheme (Salton and McGill, 1983)– Latent Semantic Indexing (LSI, Deerwester et al., 1990) – Probabilistic LSI (pLSI, aspect model, Hofmann, 1999)

5

Introduction (cont.)

• Problem of pLSI: – Incomplete: Provide no probabilistic model at the level of docum

ents– The number of parameters in the model grows linear with the siz

e of the corpus– It is not clear how to assign probability to a document outside of t

he training data

• Exchangeability: bag of words

6

Notation and terminology

• A word is the basic unit of discrete data ,from vocabulary indexed by {1,…,V}. The vth word is represented by a V-vector w such that wv = 1 and wu = 0 for u≠v

• A document is a sequence of N words denote by w = (w1,w2,…,wN)

• A corpus is a collection of M documents denoted by D = {w1,w2,…,wM}

7

Latent Dirichlet allocation

• Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus.

• Generative process for each document w in a corpus D:– 1. Choose N ~ Poisson(ξ)– 2. Choose θ ~ Dir(α)– 3. For each of the N words wn

(a) Choose a topic zn ~ Multinomial(θ)

(b) Choose a word wn from p(wn|zn, β), a multinomial probability conditioned on the topic zn

βij is a a element of k×V matrix = p(wj = 1| zi = 1)

8

Latent Dirichlet allocation (cont.)

• Representation of a document generation:

z1 z2 … … zN

w1 w2 … … wNw

N ~ Poisson

θ~ Dir(α) → {z1,z2,…,zk}

β(z) →{w1,w2,…,wn}

9


• Several simplifying assumptions:– 1. The dimensionality k of Dirichlet distribution is known and fixe

d– 2. The word probabilities β is fixed quantity that is to be estimate

d– 3. Document length N is independent of all the other data genera

ting variable θ and z

• A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex

111

1

1 ...| 1

k

kk

i i

k

i ip

http://www.answers.com/topic/dirichlet-distribution




10


• Simplex:

The above figures show the graphs for the n-simplexes with n =2 to 7.(from mathworld, http://mathworld.wolfram.com/Simplex.html)

http://mathworld.wolfram.com/Simplex.html

http://mathworld.wolfram.com/Simplex.html

11


• The joint distribution of a topic θ, and a set of N topic z, and a set of N words w:

• Marginal distribution of a document:

• Probability of a corpus:

dzwpzpppN

n znnn

n

w

1

,|||,|

N

nnnn zwpzppp

1

,|||,| wz,,

M

dd

N

n zdndndnd dzwpzppDp

d

dn1 1

,|||,|

12


• There are three levels to LDA representation– αβ are corpus-level parameters– θd are document-level variables

– zdn, wdn are word-level variables

corpus document

Refer to as hierarchical models, conditionally independent hierarchical models and parametric empirical Bayes models

13


• LDA and exchangeability– A finite set of random variables {z1,…,zN} is said exchangeable if the joint

distribution is invariant to permutation (πis a permutation)

– A infinite sequence of random variables is infinitely exchangeable if every finite subsequence is exchangeable

– De Finetti’s representation theorem states that the joint distribution of an infinitely exchangeable sequence of random variables is as if a random parameter were drawn from some distribution and then the random variables in question were independent and identically distributed, conditioned on that parameter

– http://en.wikipedia.org/wiki/De_Finetti's_theorem

NN zzpzzp ,...,,..., 11

http://en.wikipedia.org/wiki/De_Finetti's_theorem

14


• In LDA, we assume that words are generated by topics (by fixed conditional distributions) and that those topics are infinitely exchangeable within a document

dzwpzpppN

nnnn zw,

1

||

15


• A continuous mixture of unigrams– By marginalizing over the hidden topic variable z, we can under

stand LDA as a two-level model

• Generative process for a document w– 1. choose θ~ Dir(α)– 2. For each of the N word wn

(a) Choose a word wn from p(wn|θ, β)– Marginal distribution od a document

z

zpzwpwp |,|,|

dwppwpN

nn

1

,||,|

16


• The distribution on the (V-1)-simplex is attained with only k+kV parameters.

17

Relationship with other latent variable models

• Unigram model

• Mixture of unigrams– Each document is generated by first choosing a topic z and then

generating N words independently form conditional multinomial– k-1 parameters

N

nnwpwp

1

z

N

nn zwpzpwp

1

|

18

Relationship with other latent variable models (cont.)

• Probabilistic latent semantic indexing– Attempt to relax the simplifying assumption made in the mixture

of unigrams models– In a sense, it does capture the possibility that a document may c

ontain multiple topics– kv+kM parameters and linear growth in M

z

nn dzpzwpdpwdp ||,

19


• Problem of PLSI– There is no natural way to use it to assign probability to a previou

sly unseen document– The linear growth in parameters suggests that the model is pron

e to overfitting and empirically , overfitting is indeed a serious problem

• LDA overcomes both of there problems by treating the topic mixture weights as a k-parameter hidden random variable

• The k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus.

20


• A geometric interpretation: three topics and three words

21


• The unigram model find a single point on the word simplex and posits that all word in the corpus come from the corresponding distribution.

• The mixture of unigram models posits that for each documents, one of the k points on the word simplex is chosen randomly and all the words of the document are drawn from the distribution

• The pLSI model posits that each word of a training documents comes from a randomly chosen topic. The topics are themselves drawn from a document-specific distribution over topics.

• LDA posits that each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter

22

Inference and parameter estimation

• The key inferential problem is that of computing the posteriori distribution of the hidden variable given a document

,|

,|,,,,|,w

wzwzp

pp

dp

N

n

k

i

V

j

wiji

k

iik

i i

k

i i jni w

1 1 11

1

1

1,|

Unfortunately, this distribution is intractable to compute in general.A function which is intractable due to the coupling between θ and β in the summation over latent topics

23

Inference and parameter estimation (cont.)

• The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood.

• Essentially, one considers a family of lower bounds, indexed by a set of variational parameters.

• A simple way to obtain a tractable family of lower bound is to consider simple modifications of the original graph model in which some of the edges and nodes are removed.

24


• Drop some edges and the w nodes

N

nnnzqqq

1

||,|, z

,|

,|,,,,|,w

wzwzp

pp

25


• Variational distribution:– Lower bound on Log-likelihood

– KL between variational posteriori and true posteriori

,|,,|,,log,|,

,|,,log,|,

,|,,|,,,|,log,|,,log,|log

zwzzwzz

zwzz wzw

z

z

qEpEdqpq

dqpqdpp

qq

z

,log,,,,|,,

,,,log,|,,|,log,|,

,,log,|,,|,log,|,,,||,|,

,pE,pEqE

d,p,pqdqq

d,|pqdqq,|pqD

qqq wwzzw

wzzzz

wzzzzwzz

zz

zz

26


• Finding a tight lower bound on the log likelihood

• Maximizing the lower bound with respect to γand φ is equivalent to minimizing the KL divergence between the variational posterior probability and the true posterior probability

,,||,|,

,|,log,|,,log,|log

,|pqD

qEpEp qq

wzz

zwzw

,,||,|,minarg,,

** ,|pqD wzz

27


• Expand the lower bound:

|log

|log

,|log

|log

|log

,|,log,|,,log,;,

z

zw

z

zwz

pE

pE

pE

pE

pE

qEpEL

q

q

q

q

q

qq

28


• Then

N

n

k

inini

k

i

k

j jiii

k

i

k

j j

N

n

k

iij

jnni

N

n

k

i

k

j jini

k

i

k

j jiii

k

i

k

j j

w

L

1 1

11

11

1 1

1 11

11

11

log

1loglog

log

1loglog

,;,

29


• We can get variational parameters by adding Lagrange multipliers and setting this derivative to zero:

N

n niii

k

j jiivni

1

1exp

30


• Parameter estimation– Maximize log likelihood of the data:

– Variational inference provide us with a tractable lower bound on the log likelihood, a bound which we can maximize with respect α and β

• Variational EM procedure– 1. (E-step) For each document, find the optimizing values of the

variational parameters {γ, φ}– 2. (M-step) Maximize the result lower bound on the log likelihood

with respect to the model parameters α and β

M

ddp

1

,|log, w

31


• Smoothed LDA model:

32

Discussion

• LDA is a flexible generative probabilistic model for collection of discrete data.

• Exact inference is intractable for LDA, but any or a large suite of approximate inference algorithms for inference and parameter estimation can be used with the LDA framework.

• LDA is a simple model and is readily extended to continuous data or other non-multinomial data.

Date post:	18-Mar-2016
Category:	Documents
Upload:	kale
View:	77 times
Download:	0 times

Latent Dirichlet Allocation

Documents