Latent Dirichlet Allocation
Kuan-Yu Menphis Chen
SLP Laboratory, [email protected]
Main Reference:1. D. Blei, A. Ng and M. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, 2003.2. D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet allocation,” in Proc. NIPS, 2002.3. Steyvers, M. & Griffiths, T.. Probabilistic topic models. In T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch
(Eds.), Handbook of Latent Semantic Analysis. Hillsdale, NJ: Erlbaum. 2007.4. Huand, J., “Maximum likelihood estimation of Dirichlet distribution parameters”, Manuscript, 2006.
Outline
• Introduction
• Latent Dirichlet Allocation
• Relationship with Other Latent Variable Models
• Inference and Parameter Estimation
• Discussion
2
Introduction
• The tf-idf reduction has some appealing features– Notably in its basic identification of sets of words that are
discriminative for documents in the collection– The approach also provides a relatively small amount of reduction in
description length and reveals little in the way of inter- or intra-document statistical structure
• LSI uses a singular value decomposition of the word-document matrix to identify a linear subspace in the space of tf-idffeatures – LSI captures most of the variance in the collection– It can achieve significant compression in large collections
3
Introduction
• In PLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers
• This leads to several problems: – The number of parameters in the model grows linearly with the size of
the corpus, which leads to serious problems with overfitting– It is not clear how to assign probability to a document outside of the
training set
4
Introduction
• All of these methods (tf-idf, LSI, PLSI) are based on the “bag-of-words” assumption– the order of words in a document can be neglected
• In the language of probability theory, this is an assumption of exchangeability for the words in a document
• Moreover, although less often stated formally, these methods also assume that documents are exchangeable– the specific ordering of the documents in a corpus can also be neglected.
5
Latent Dirichlet Allocation - Notation
• A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by – Using superscripts to denote components, the v-th word in the
vocabulary is represented by a V-vector w such that and for
• A document is a sequence of N words denoted by , where is the n-th word in the sequence
• A corpus is a collection of M documents denoted by
6
{ }V,,K1
1=vw 0=uwvu ≠
( )Nw,,w,w K21=w nw
( )M,,,D www K21=
Latent Dirichlet Allocation
• The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words
• LDA assumes the following generative process for each document w in a corpus D :1. Choose 2. Choose3. For each of the N words :
a) Choose a topicb) Choose a word from , a multinomial probability
conditioned on the topic
7
( )ξPoisson~N( )αDir~θ
nw( )θlMultinomia~zn
nw ( )β,z|wp nnnz
Latent Dirichlet Allocation
• Several simplifying assumptions are made:– The dimensionality k of Dirichlet distribution is assumed known and
fixed– The word probabilities are parameterized by a matrix β, which
we treat as a fixed quantity that is to be estimated– The Poisson assumption is not critical to anything. Note that document
length N is independent of all the other data generating variables (θ and z)
• A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex, and has the following probability density:
8
( ) ( )( )
1α1α1
1
1 θθαααθ 1 −−
=
=
∏
∑Γ
Γ= k
kki i
ki i ...|p
Vk ×
Latent Dirichlet Allocation
• By placing a Dirichlet prior on the topic distribution, the result is a smoothed topic distribution, with the amount of smoothing determined by the parameter.
• The Dirichlet prior can be interpreted as forces on the topic combinations with higher moving the topics away from the corners of the simplex, leading to more smoothing.
9
α
jα
Latent Dirichlet Allocation
• Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by:
• Integrating over θ and summing over z, we obtain the marginal distribution of a document:
• Obtain the probability of a corpus:
10
( ) ( ) ( ) ( ) θ βθθβα,1
d,z|wp|zpα|p|pN
n znnn
n∫ ∏∑ ⎟
⎠
⎞⎜⎝
⎛==
w
( ) ( ) ( ) ( )∏=
=N
nnnn ,z|wp|zp|p|p
1βθαθβα,,θ, wz
( ) ( ) ( ) ( )∏ ∫ ∏ ∑= =
⎟⎠
⎞⎜⎝
⎛=M
dd
N
n zdndnddnd d,z|wp|zp|p|Dp
d
dn1 1θ βθαθβα,
Latent Dirichlet Allocation
• The parameters α and β are corpus level parameters
• The variables are document-level variables
• The variables and are word-level variables
11
dθ
dnz dnw
Latent Dirichlet Allocation - Exangeability
• A finite set of random variables is said to be exchangeable if the joint distribution is invariant to permutation. If π is a permutation of the integers from 1 to N:
• It is important to emphasize that an assumption of exchangeability is not equivalent to an assumption that the random variables are independent and identically distributed
• Rather, exchangeability essentially can be interpreted as meaning “conditionally independent and identically distributed,”where the conditioning is with respect to an underlying latent parameter of a probability distribution
12(c.f: De Finetti’s representation theorem: http://en.wikipedia.org/wiki/De_Finetti‘s_theorem)
{ }Nz,,z K1
( ) ( ) ( )( )NN z,,zpz,,zp π1π1 KK =
Latent Dirichlet Allocation - Exangeability
• In LDA, we assume that words are generated by topics (by fixed conditional distributions) and that those topics are infinitely exchangeable within a document
• By de Finetti’s theorem, the probability of a sequence of words and topics must therefore have the form:
13(c.f: De Finetti’s representation theorem: http://en.wikipedia.org/wiki/De_Finetti‘s_theorem)
( ) ( ) ( ) ( )∫ ∏ ⎟⎠⎞⎜
⎝⎛=
=θθ|θ,
1dz|wpzppp
N
nnnnzw
Latent Dirichlet Allocation
• An example density on unigram distributions under LDA for three words and four topics. – The triangle embedded in the x-y plane is the 2-D simplex representing
all possible multinomial distributions over three words– The four points marked with an x are the locations of the multinomial
distributions for each of the four topics– The surface shown on top of the simplex is an example of a density
over the (V −1)-simplex (multinomial distributions of words) given by LDA
14
( )βθ,|wp
( )zwp |
Relationship with Other Latent Variable Models
• Unigram model– Under the unigram model, the words of every document are drawn
independently from a single multinomial distribution:
• Mixture of unigrams– Under this mixture model, each document is generated by first choosing
a topic z and then generating N words independently from the conditional multinomial:
– The LDA model allows documents to exhibit multiple topics
– There are k−1 parameters associated with in the mixture of unigrams, versus the k parameters associated with in LDA
15
( ) ( )∏=
=N
nnwpp
1w
( ) ( ) ( )∑ ∏=
=z
N
nn z|wpzpp
1w
( )zp( )α|θp
Relationship with Other Latent Variable Models
• Probabilistic latent semantic indexing (PLSI/PLSA)– The PLSI model posits that a document label d and a word are
conditionally independent given an unobserved topic z:
– It is important to note that d is a dummy index into the list of documents in the training set
• The model learns the topic mixtures only for those documents on which it is trained, so PLSI is not a well-defined generative model
– A further difficulty is that the number of parameters which must be estimated grows linearly with the number of training documents
• It gives kV +kM parameters and therefore linear growth in M
16
nw
( ) ( ) ( ) ( )∑=z
nn d|zpz|wpdpw,dp
( )d|zp
Relationship with Other Latent Variable Models
• LDA overcomes both of PLSI’s problems by:– Treating the topic mixture weights as a k-parameter hidden random
variable • Rather than a large set of individual parameters which are explicitly linked
to the training set
– The k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus
17
Relationship with Other Latent Variable Models
• A geometric interpretation:– The mixture of unigrams places each document at one of the corners of
the topic simplex– The PLSI model induces an empirical distribution on the topic simplex
denoted by x– LDA places a smooth distribution on the topic simplex denoted by the
contour lines
18
Inference and Parameter Estimation - Inference
• The key inferential problem is that of computing the posteriori distribution of the hidden variable given a document :
– Unfortunately, this distribution is intractable to compute in general
• A function which is intractable due to the coupling between θ and β in the summation over latent topics
• Although the posterior distribution is intractable for exact inference, a wide variety of approximate inference algorithms can be considered for LDA
19
( ) ( )( )βα,|
βα,|,θ,βα,θw
wzwzp
p,|,p =
( ) ( )( )
( ) θ βθθααβα,
1 1 11
1α
1
1 d|pN
n
k
i
V
j
wiji
k
iik
i i
ki i j
ni ⎟⎠⎞
⎜⎝⎛⎟⎠⎞⎜
⎝⎛
ΓΓ
= ∏ ∑ ∏∫ ∏∏
∑
= = ==
−
=
=w
Inference and Parameter Estimation - Variational Inference
• The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood
• The variational parameters are chosen by an optimization procedure that attempts to find the tightest possible lower bound
• A simple way to obtain a tractable family of lower bound is to consider simple modifications of the original graph model in which some of the edges and nodes are removed.
20
Inference and Parameter Estimation - Variational Inference
• This family is characterized by the following variationaldistribution:
• The desideratum of finding a tight lower bound on the log likelihood translates directly into the following optimization problem:
– by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior
21
( ) ( ) ( )∏=
=N
nnn |zq|q,|,q
1φγθφγθ z
( )( )
( ) ( )( )βαθφγθD min argφγφγ
,,|,p||,|,q,,
wzz=∗∗
Inference and Parameter Estimation - Variational Inference (1/3)
22
( ) ( ) ( )[ ] ( )[ ]
( ) ( ) ( ) ( )( )
( )( ) ( )
( )( ) ( )
( )
( )
( )( ) ( )
( ) ( )( ) ( ) ( )
( )
( ) ( )( )
( ) ( ) ( ) ( )
( )[ ] ( )[ ] ( )βα,φ;γ,φγ,|θ,logEβα,|,θ,logE
θφγ,|θ,logφγ,|θ,θβα,|θ,logφγ,|θ,
θφγ,|θ,βα,|θ,logφγ,|θ,
θφγ,|θ,βα,|θ,φγ,|θ,logθ
φγ,|θ,φγ,|θ,βα,|θ,log
φγ,|θ, and φγ,|θ,βα,|θ,let
function concave real a is if,
θφγ,|θ,
φγ,|θ,βα,|θ,logθφγ,|θ,
φγ,|θ,βα,|θ,log
function concave real a is if,
θφγ,|θ,
φγ,|θ,βα,|θ,logθβα,|θ,logβα,|log:prove
φγ,|θ,logEβα,|θ,logEβα,φ;γ,βα,|log: that show
Lqp
dqqdpq
dq
pq
dq
pqdq
qp
aaqx
qp
axa
axa
dq
qpdq
qp
gdugdu
dq
qpdpp
qpLp
zz
zz
i
ii
i
ii
=−=
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛≥
=⇒
==∴
≥⎟⎟⎠
⎞⎜⎜⎝
⎛
≥∴
≥
==
−=≥
∫∑∫∑
∫∑
∫ ∑∫ ∑
∑
∑∑
∑∑
∫ ∑∫∑∫∫
∫∑∫∑
′′
zwz
zzwz,z
zwz,z
zwz,z
zzwz,
zzwz,
zzwz,
zzwz,
zzwz,wz,w
zwz,w
zz
z
zz
zz
zz
ϑϑ
ϑ
ϑϑϑ
Q
oQ
(lower bound)
Inference and Parameter Estimation - Variational Inference (2/3)
23
( ) ( ) ( ) ( )( )
( ) ( ) ( )( )
( ) ( )( ) ( ) ( )( )
( ) ( ) ( ) ( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )
( ) ( )( )
( )( ) ( )βα,|log
βα,,,θ,βα,,
βα,βα,,,θ,logφγ,|θ,
βα,,|θ,logφγ,|θ,βα,|,θ,logφγ,|θ,
φγ,|θ,βα,,|θ,logφγ,|θ,φγ,|θ,logφγ,|θ,βα,|,θ,logφγ,|θ,βα,|log
φγ,|θ,βα,,|θ,logφγ,|θ,βα,,|θ,||φγ,|θ,D
|,|log|||D :prove
βα,,|θ,||φγ,|θ,Dβα,φ;γ,βα,|log: that show
wwz
wwzz
wzzwzz
zwzzzzwzzw
zwzzwzz
wzzw
pdp
pp
pq
dpqdpq
dq
pqdqqdpqp
dq
pqpq
VHQwVHPVHQPQ
pqLp
z
zz
zzz
z
H
=⎟⎟⎠
⎞⎜⎜⎝
⎛⋅=
−=
−−=
−=∴
−=
+=
∫∑
∫∑∫∑
∫∑∫∑∫∑
∫∑
∑
θ
θθ
θθθ
θ
Q
( )βα,φ;γ,L( )βα,|log wp
( )βα,φγ, ||D pq
Inference and Parameter Estimation - Variational Inference (3/3)
• We can expand the lower bound by using the factorizations of p and q:
24
( ) ( )[ ] ( )[ ]( )[ ] ( )[ ] ( )[ ] ( )[ ] ( )[ ]
( ) ( ) ( ) ( ) ( )( )
( ) ( )( )
( ) ( ) ( ) ( ) ( )( )
∑ ∑
∑ ∑∑∑
∑ ∑
∑ ∑ ∑
∑ ∑∑∑
= =
==
==
= =
= ==
==
==
−
Ψ−Ψ−−Γ+Γ−
+
Ψ−Ψ+
Ψ−Ψ−+Γ−Γ=
−−++=
−=
N
n
k
inini
k
i
kj jiii
k
i
kj j
N
n
k
iij
jnni
N
n
k
i
kj jini
k
i
kj jiii
k
i
kj j
qqqqq
w
qq|p|p|p
q|pL
1 1
11
11
1 1
1 11
11
11
logφφ
γγ1γγlogγlog
logβφ
γγφ
γγ1ααlogαlog
logEθlogEβ,logEθlogEαθlogE
θ,logEβα,,θ,logEβα,φ;γ,
zzwz
zwz
Inference and Parameter Estimation - Variational Inference
• By computing the derivatives of the KL divergence and setting them equal to zero, we obtain the following pair of update equations:
• It is important to note that the variational distribution is actually a conditional distribution, varying as a function of w because of
• We can write the resulting variational distribution as , so the variational distribution can be viewed as an approximation to the posterior distribution
25
( )[ ]( )γθlogEexpβφ |iqiwni n∝
∑ =+= Nn niii 1φαγ
( ) ( )( )wwz ∗∗ φγθ ,|,q
( )βα,,θ wz |,p
( )( )
( ) ( )( )βαθφγθD min argφγφγ
,,|,p||,|,q,,
wzz=∗∗
Inference and Parameter Estimation - Variational Inference (1/3)
• Computing– A distribution is in the exponential family if it can be written in the
form:
where is the natural or canonical parameter, is the sufficient statistic, and is the log normalizer factor
– The Dirichlet can be written in the form:
– The noteworthy point is that is the cumulant generating function for the sufficient statistic, so in particular, is the expectation, and
is the variance
26
[ ]α|logθE i
( ) ( ) ( ) ( ){ }ηηη AxTxh|xp T −= exp( )xT
( )ηAη
( ) ( )( ) ( ) ( ){ }∑∑∑ === Γ−Γ+−= ki i
ki i
ki iip 111 αlogαloglogθ1αexpα|θ
( )ηA( )ηA′
( )ηA ′′( )[ ] ( )
[ ] ( ) ( )∑ =Ψ−Ψ=
′=kj jii
AxT
1ααα|logθEE η
Inference and Parameter Estimation - Variational Inference (2/3)
• We form the Lagrangian by isolating the terms which contain and adding the appropriate Lagrange multipliers
• Setting this derivative to zero yields the maximizing value of the variational parameter
27
[ ] ( ) ( )( ) ( )[ ] ( ) ( ) ( ) nniiv
kj ji
ni
kj ninniniivni
kj jini
kL
L
ni
ni
λ1logφlogβγγφ
1φλlogφφlogβφγγφ
1φ
11φ
++−+Ψ−Ψ=∂
∂
−+−+Ψ−Ψ=
∑
∑∑
=
==
( ) ( ) ( )( ) ( )( ) ( )( )( ) ( )( ) ( )
( ) ( )( ) [ ]( )γ|logθEexpβγγexpβφ
φλ1expβγγexp
φλ1logβγγexp
logφλ1logβγγ
0λ1logφlogβγγ
1
1
1
1
1
iivkj jiivni
ninivkj ji
ninivkj ji
ninivkj ji
nniivkj ji
k
k
k
k
⋅=Ψ−Ψ⋅∝∴
=+−⋅⋅Ψ−Ψ
=+−+Ψ−Ψ
=+−+Ψ−Ψ⇒
=++−+Ψ−Ψ
∑
∑
∑
∑
∑
=
=
=
=
=
niφ
Inference and Parameter Estimation - Variational Inference (3/3)
• Maximize with respect to , the i-th component of the posterior Dirichlet parameter. The terms containing are:
• Taking the derivative with respect to and setting to zero yields the maximizing value of the variational parameter
28
[ ] ( ) ( ) ( )( ) ( ) ( )( )
( ) ( ) ( ) ( ) ( )( )
( ) ( )( ) ( ) ( )ik
i
kj j
k
ii
N
nnii
kj ji
k
i
kj jiii
k
i
kj j
N
n
k
i
kj jini
k
i
kj jiiL
γlogγlogγφαγγ
γγ1γγlogγlog
γγφγγ1α
11
1 11
11
11
1 11
11γ
Γ+Γ−⎟⎠⎞⎜
⎝⎛ −+Ψ−Ψ=
Ψ−Ψ−−Γ+Γ−
Ψ−Ψ+Ψ−Ψ−=
∑∑∑ ∑∑
∑ ∑∑∑
∑ ∑ ∑∑ ∑
==
= ==
==
==
= ==
==
iγ
iγ
[ ] ( ) ( )
( ) ( )
∑
∑ ∑∑∑
∑ ∑∑∑
=
= ==
=
= ==
=
+=
=⎟⎠⎞⎜
⎝⎛ −+Ψ′−⎟
⎠⎞⎜
⎝⎛ −+Ψ′
⎟⎠⎞⎜
⎝⎛ −+Ψ′−⎟
⎠⎞⎜
⎝⎛ −+Ψ′=
∂∂
N
nniii
k
jj
N
nnjj
kj ji
N
nniii
k
jj
N
nnjj
kj ji
N
nniii
i
L
1
1 11
1
1 11
1
γ
φαγ
0γφαγγφαγ
γφαγγφαγγ
iγ
Inference and Parameter Estimation - parameter estimation
• In particular, given a corpus of documents we wish to find parameters α and β that maximize the (marginal) log likelihood of the data:
• We can thus find approximate empirical Bayes estimates for the LDA model via an alternating variational EM procedure
• The derivation yields the following iterative algorithm:– (E-step) For each document, find the optimizing values of the
variational parameters– (M-step) Maximize the resulting lower bound on the log likelihood with
respect to the model parameters α and β
29
{ }M,,,D www K21=
( ) ( )∑=
=M
dd ,|p,
1βαlogβα wl
{ }Dd, dd ∈∗∗ :φγ
Inference and Parameter Estimation - parameter estimation
• The M-step update for the conditional multinomial parameter βcan be written out analytically:
• Dirichlet parameter a can be implemented using an efficient Newton-Raphson method. – The general update rule can be written as:
where and are the Hessian matrix and gradient respectively at the point α.
30
∑ ∑= =
∗∝M
d
N
n
jdndniij
dw
1 1φβ
( ) ( )oldoldoldnew H αααα 1∇−= −
( )αH ( )α∇
Inference and Parameter Estimation - Smoothing
• We treat β as a random matrix (one row for each mixture component), where we assume that each row is independently drawn from an exchangeable Dirichlet distribution
• We consider a variational approach to Bayesian inference that places a separable distribution on the random variables β, θ, and z
• An additional update for the new variational parameter λ:
31
Vk ×
( ) ( ) ( )∏ ∏= =
=k
i
M
ddddddiiMMk ,|,q|,,|,,q
1 1:1:1:1 γφθλβDirγφλθβ zz
∑ ∑= =
∗+=M
d
N
n
jdndniij
dw
1 1φλ η
Discussion
• We can view LDA as a dimensionality reduction technique, in the spirit of LSI– But LDA with proper underlying generative probabilistic semantics that
make sense for the type of data that it models
• Exact inference is intractable for LDA, but any of a large suite of approximate inference algorithms can be used– Laplace approximation, higher-order variational techniques, and Monte
Carlo methods
• A variety of extensions of LDA can be considered in which the distributions on the topic variables are elaborated– We could arrange the topics in a time series, essentially relaxing the full
exchangeability assumption to one of partial exchangeability32