Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Variational Inference for LDA
Zhao Zhou
The Hong Kong University of Science and Technology
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Outline
1 Generative Process of LDA
2 Exponential Family
3 Newton Method
4 Variational InferenceE-StepM-Step
5 Conclusion
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Notations and terminology
A word is the basic unit of discrete data, defined to be anitem from a vocabulary indexed by {1, . . . ,V }.A document is a sequence of N words denoted byw = (w1,w2, . . . ,wN), where wn is the nth word in thesequence.
A corpus is a collection of M documents denoted byD = {w1,w2, . . . ,wM}.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Latent Dirichlet allocation
LDA assumes the following generative process for each documentw in a corpus D
Choose N ∼ Poisson(ξ).
Choose θ ∼ Dir(α).
For each of the N words wn:
Choose a topic zn ∼ Mult(θ).Choose a word wn from p(wn|zn, β).
Note that
The dimension of the Dirichlet distribution (topic variable) isknown and fixed.
The word probabilities are parameterized by a k × V matrix βwhere βij = p(w j = 1|z i = 1).
The randomness of N is ignored in subsequent slides.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Latent Dirichlet allocation
Given the parameters α and β, the joint distribution of a topicmixture θ, a set of topics z , and a set of N words w is:
p(θ, z ,w |α, β) = p(θ|α)N∏
n=1
p(zn|θ)p(wn|zn, β)
The marginal distribution of a document is:
p(w |α, β) =
∫p(θ|α)(
N∏n=1
∑zn
p(zn|θ)p(wn|zn, β))dθ.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Exponential family
An exponential family distribution has the form
p(x |η) = h(x) exp{ηT t(x)− a(η)}
The different parts of this equation are
The natural parameter ηThe sufficient statistic t(x)The underlying measure h(x)The log normalizer a(η)
a(η) = log
∫h(x) exp{ηT t(x)}
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
First Moment
The derivatives of the log normalizer gives the moments ofthe sufficient statistics
d
dηa(η) =
d
dη(log
∫exp{ηT t(x)}h(x)dx)
=
∫t(x) exp{ηT t(x)}h(x)dx∫
exp{ηT t(x)}h(x)dx
=
∫t(x) exp{ηT t(x)− a(η)}h(x)dx
= E [t(X )]
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Computing E [log(θ|α)]
The Dirichlet distribution p(θ|α):
p(θ|α) =Γ(∑K
i=1 αi )∑Ki=1 Γ(αi )
K∏i=1
θαi−1i
= exp{(K∑i=1
(αi − 1) log θi ) + log Γ(K∑i=1
αi )−K∑i=1
log Γ(αi )}
Sufficient statistics: log θi .
Log normalizer:∑K
i=1 log Γ(αi )− log Γ(∑K
i=1 αi )
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Computing E [log(θ|α)]
The expectation E [log(θ|α)] is:
E [log θi |α] = a(α)′ = (K∑i=1
log Γ(αi )− log Γ(K∑i=1
αi ))′
= ψ(αi )− ψ(K∑j=1
αj).
where ψ is the digamma function, the first derivative of thelog Gamma function.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Unconstrained minimization
Suppose f convex, twice continuously differentiable.
Assume optimal value p∗ = infx f (x) is attained.
Interpreted as iterative methods for solving optimalitycondition
∇f (x∗) = 0.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Newton Step
Newton step:
∆xnt = −∇2f (x)−1∇f (x).
Interpretations:
x + ∆xnt minimizes second order approximation
f̂ (x + v) = f (x) +∇f (x)T v +1
2vT∇2f (x)v
x + ∆xnt solves linearized optimality condition.
∇f (x + v) ≈ ∇f̂ (x + v) = ∇f (x) +∇2f (x)v = 0.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Newton decrement
A measure of the proximity of x to x∗
λ(x) = (∇f (x)T∇2f (x)−1∇f (x))1/2
equal to the norm of the Newton step in the quadraticHessian norm
λ(x) = (∆xTnt∇2f (x)∆xnt)1/2
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Backtracking Line Search
Exact Line search:
t = arg mint>0
f (x + t∆x)
Backtracking line search (with parametersα ∈ (0, 1/2), β ∈ (0, 1)):
Starting at t = 1, repeat t = βt until
f (x + t∆x) < f (x) + αt∇f (x)T∆x .
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Newton Method
Repeat
Compute the Newton step and decrement.
∆xnt = −∇2f (x)−1∇f (x)
λ2 = ∇f (x)T∇2f (x)−1∇f (x)
Stopping criterion. quit if λ2
2 ≤ ε.Line search. Choose step size t by backtracking line search.
Update x = x + t∇xnt .
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Inference
The posterior distribution of hidden variable:
p(θ, z |w , α, β) =p(θ, z ,w |α, β)
p(w |α, β)
This distribution is intractable to compute since
p(w |α, β) =Γ(∑
j αj)∏j Γ(αj)
∫(
k∏i=1
θαi−1i )(
N∏n=1
k∑i=1
V∏j=1
(θiβij)w jn)dθ
due to the coupling between θ and β.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Variational distribution
The variational distribution on latent variables:
q(θ, z |γ, φ) = q(θ|γ)N∏
n=1
q(zn|φn).
An optimization problem that determines the values of γ andφ with respect to KL-Divergence D:
(γ∗, φ∗) = arg minγ,φ
D(q(θ, z |γ, φ)||p(θ, z |w , α, β))
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
KL-Divergence
Now, we denote q(θ, z |γ, φ) by q.
The KL-Divergence between q and p(θ, z |w , α, β) is
D(q||p) = Eq[log q]− Eq[log p(θ, z |w , α, β)]
= Eq[log q]− Eq[log p(θ, z ,w |α, β)] + log p(w |α, β)
Using Jensen’s inequality, we bound p(w |α, β) by
log p(w |α, β) = log
∫ ∑z
p(θ, z ,w |α, β)dθ
= log
∫ ∑z
p(θ, z ,w |α, β)q(θ, z)
q(θ, z)dθ
≥∫ ∑
z
q(θ, z) logp(θ, z ,w |α, β)
q(θ, z)dθ
= Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)].
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
KL-Divergence
We denote Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)] byL(γ, φ;α, β).
Then we have
log p(w |α, β) = L(γ, φ;α, β) + D(q(θ, z |γ, φ)||p(θ, z |w , α, β)).
Maximizing the lower bound L(γ, φ;α, β) with respect to γand φ is equivalent to minimizing the KL-Divergence betweenthe variational posterior probability and the true posteriorprobability.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Variational Inference
Expand L(γ, φ;α, β) using the factorizations of p and q:
L(γ, φ;α, β) = Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)]
= Eq[log p(θ|α)] + Eq[log p(z |θ)] + Eq[log p(w |z , β)]
− Eq[log q(θ)]− Eq[log q(z)]
Compute the five terms, respectively.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Computing Eq[log p(θ|α)]
Eq[log p(θ|α)] is given by
Eq[log p(θ|α)] =K∑i=1
(αi − 1)Eq[log θi ]
+ log Γ(K∑i=1
αi )−K∑i=1
log Γ(αi ).
θ is generated by Dir(θ|γ): Eq[log θi ] = ψ(γi )− ψ(∑K
j=1 γj).
Then we have:
Eq[log p(θ|α)] =K∑i=1
(αi − 1)ψ(γi )− ψ(K∑j=1
γj)
+ log Γ(K∑i=1
αi )−K∑i=1
log Γ(αi ).
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Computing Eq[log p(z |θ)]
Eq[log p(z |θ)] is given by
Eq[log p(z |θ)] = Eq[N∑
n=1
K∑i=1
zni log θi ]
=N∑
n=1
K∑i=1
Eq[zni ]Eq[log θi ]
=N∑
n=1
k∑i=1
φni (ψ(γi )− ψ(K∑j=1
γj))
where z is generated from Mult(z |φ) and θ is generated fromDir(θ|γ).
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Computing Eq[log p(w |z , β)]
Eq[log p(w |z , β)] is given by
Eq[log p(w |z , β)] = Eq[N∑
n=1
k∑i=1
V∑j=1
zniwjn log βij ]
=N∑
n=1
k∑i=1
V∑j=1
Eq[zni ]wjn log βij
=N∑
n=1
k∑i=1
V∑j=1
φniwjn log βij
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Computing Eq[log q(θ|γ)]
Eq[log q(θ|γ)] is given by
Eq[log p(θ|γ)] =k∑
i=1
(γi − 1)Eq[log θi ] + log Γ(k∑
i=1
γi )−k∑
i=1
log Γ(γi )
Then, we have
Eq[log p(θ|γ)] = log Γ(k∑
i=1
γi )−k∑
i=1
log Γ(γi )
+k∑
i=1
(γi − 1)(ψ(γi )− ψ(k∑
j=1
γj))
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Computing Eq[log q(z |φ)]
Eq[log q(z |φ)] is given by
Eq[log q(z |φ)] = Eq[N∑
n=1
k∑i=1
zni log φni ]
=N∑
n=1
k∑i=1
Eq[zni ] log φni
=N∑
n=1
k∑i=1
φni log φni
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Variational Inference
Finally, L(γ, φ;α, β) is
(γ, φ;α, β) = log Γ(K∑i=1
αi )−K∑i=1
log Γ(αi )
+K∑i=1
(αi − 1)(ψ(γi )− ψ(K∑j=1
γj))
+N∑
n=1
K∑i=1
φni (ψ(γi )− ψ(K∑j=1
γj))
+N∑
n=1
K∑i=1
V∑j=1
φniwjn log βij
− (log Γ(K∑i=1
γi )−K∑i=1
log Γ(γi ) +K∑i=1
(γi − 1)(ψ(γi )− ψ(K∑j=1
γj)))
−N∑
n=1
K∑i=1
φni log φni .
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Variational Multinomial
Maximize L(γ, φ;α, β) with respect to φni :
Lφni = φni (ψ(γi )− ψ(K∑j=1
γj)) + φni log βiv
− φni log φni + λ(K∑j=1
φni − 1).
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Variational Multinomial
Taking derivatives with respect to φni :
∂L
∂φni= (ψ(γi )− ψ(
K∑j=1
γj)) + log βiv − log φni − 1 + λ.
Setting this derivative to zero yields
φni ∝ βiv exp(ψ(γi )− ψ(K∑j=1
γj)).
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Variational Dirichlet
Maximize L(γ, φ;α, β) with respect to γi :
Lγ =K∑i=1
(ψ(γi )− ψ(K∑j=1
γj))(αi +N∑
n=1
φni − γi )
− log Γ(K∑j=1
γj) +K∑i=1
log Γ(γi )
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Variational Dirichlet
Taking the derivative with respect to γi
∂L
∂γi= ψ′(γi )(αi +
N∑n=1
φni − γi )− ψ′′(K∑j=1
γj)K∑j=1
(αj +N∑
n=1
φnj − γj)
Setting this equation to zero yields:
γi = αi +N∑
n=1
φni .
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Variational Inference Algorithm
1 initialize φ0ni = 1K for all i and n.
2 initialize γi = αi + NK for all i
3 repeat
4 for n = 1 to N5 for i = 1 to K
1 φt+1ni = βiwn exp(ψ(γti )).
2 normalize φt+1n to sum 1.
6 γt+1 = α +∑N
n=1 φt+1n
7 until convergence
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Parameter Estimation
In the variational E-step, maximize the lower boundL(γ, φ;α, β) with respect to the variational parameters γ andφ.
In the M-step, maximize the bound with respect to the modelparameters α and β.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Conditional Multinomials
Maximize L(γ, φ;α, β) with respect to β:
Lβ =M∑d=1
Nd∑n=1
K∑i=1
V∑j=1
φdniwjdn log βij +
K∑i=1
λi (V∑j=1
βij − 1).
Taking the derivative with respect to βij and setting it to zero:
βij ∝M∑d=1
Nd∑n=1
φdniwjdn.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Dirichlet
Maximize L(γ, φ;α, β) with respect to α:
Lα =M∑d=1
(log Γ(K∑j=1
αj)−K∑i=1
log Γ(αi ))
+K∑i=1
((αi − 1)(ψ(γdi )− ψ(K∑j=1
γdj)))
Taking the derivative with respect to αi
∂L
∂αi= M(ψ(
K∑j=1
αj)− ψ(αi )) +M∑d=1
(ψ(γdi )− ψ(K∑j=1
γdj)).
It is difficult to compute αi by setting the derivative to zero.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Newton Method
Compute the Hessian Matrix by
∂2L
∂αi∂αj= M(ψ′(
K∑j=1
αj)− δ(i , j)ψ′(αi )).
Input this Hessian Matrix and the derivative to NewtonMethod.
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion
Conclusion
Variational Inference is used for approximating intractableintegrals arising in Bayesian network.
Variational Inference can be seen as an extension of the EMalgorithm which computes the entire posterior distribution oflatent variables.
Usually, the derived ”best” variational distribution is the samefamily as the corresponding prior distribution over the variable.
A good template proof for variational inference on other topicmodels.