Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step...

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference for LDA

Zhao Zhou

The Hong Kong University of Science and Technology


Outline

1 Generative Process of LDA

2 Exponential Family

3 Newton Method

4 Variational InferenceE-StepM-Step

5 Conclusion


Notations and terminology

A word is the basic unit of discrete data, defined to be anitem from a vocabulary indexed by {1, . . . ,V }.A document is a sequence of N words denoted byw = (w1,w2, . . . ,wN), where wn is the nth word in thesequence.

A corpus is a collection of M documents denoted byD = {w1,w2, . . . ,wM}.


Latent Dirichlet allocation

LDA assumes the following generative process for each documentw in a corpus D

Choose N ∼ Poisson(ξ).

Choose θ ∼ Dir(α).

For each of the N words wn:

Choose a topic zn ∼ Mult(θ).Choose a word wn from p(wn|zn, β).

Note that

The dimension of the Dirichlet distribution (topic variable) isknown and fixed.

The word probabilities are parameterized by a k × V matrix βwhere βij = p(w j = 1|z i = 1).

The randomness of N is ignored in subsequent slides.


Latent Dirichlet allocation

Given the parameters α and β, the joint distribution of a topicmixture θ, a set of topics z , and a set of N words w is:

p(θ, z ,w |α, β) = p(θ|α)N∏

n=1

p(zn|θ)p(wn|zn, β)

The marginal distribution of a document is:

p(w |α, β) =

∫p(θ|α)(

N∏n=1

∑zn

p(zn|θ)p(wn|zn, β))dθ.


Exponential family

An exponential family distribution has the form

p(x |η) = h(x) exp{ηT t(x)− a(η)}

The different parts of this equation are

The natural parameter ηThe sufficient statistic t(x)The underlying measure h(x)The log normalizer a(η)

a(η) = log

∫h(x) exp{ηT t(x)}


First Moment

The derivatives of the log normalizer gives the moments ofthe sufficient statistics

d

dηa(η) =

d

dη(log

∫exp{ηT t(x)}h(x)dx)

=

∫t(x) exp{ηT t(x)}h(x)dx∫

exp{ηT t(x)}h(x)dx

=

∫t(x) exp{ηT t(x)− a(η)}h(x)dx

= E [t(X )]


Computing E [log(θ|α)]

The Dirichlet distribution p(θ|α):

p(θ|α) =Γ(∑K

i=1 αi )∑Ki=1 Γ(αi )

K∏i=1

θαi−1i

= exp{(K∑i=1

(αi − 1) log θi ) + log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi )}

Sufficient statistics: log θi .

Log normalizer:∑K

i=1 log Γ(αi )− log Γ(∑K

i=1 αi )


Computing E [log(θ|α)]

The expectation E [log(θ|α)] is:

E [log θi |α] = a(α)′ = (K∑i=1

log Γ(αi )− log Γ(K∑i=1

αi ))′

= ψ(αi )− ψ(K∑j=1

αj).

where ψ is the digamma function, the first derivative of thelog Gamma function.


Unconstrained minimization

Suppose f convex, twice continuously differentiable.

Assume optimal value p∗ = infx f (x) is attained.

Interpreted as iterative methods for solving optimalitycondition

∇f (x∗) = 0.


Newton Step

Newton step:

∆xnt = −∇2f (x)−1∇f (x).

Interpretations:

x + ∆xnt minimizes second order approximation

f̂ (x + v) = f (x) +∇f (x)T v +1

2vT∇2f (x)v

x + ∆xnt solves linearized optimality condition.

∇f (x + v) ≈ ∇f̂ (x + v) = ∇f (x) +∇2f (x)v = 0.


Newton decrement

A measure of the proximity of x to x∗

λ(x) = (∇f (x)T∇2f (x)−1∇f (x))1/2

equal to the norm of the Newton step in the quadraticHessian norm

λ(x) = (∆xTnt∇2f (x)∆xnt)1/2


Backtracking Line Search

Exact Line search:

t = arg mint>0

f (x + t∆x)

Backtracking line search (with parametersα ∈ (0, 1/2), β ∈ (0, 1)):

Starting at t = 1, repeat t = βt until

f (x + t∆x) < f (x) + αt∇f (x)T∆x .


Newton Method

Repeat

Compute the Newton step and decrement.

∆xnt = −∇2f (x)−1∇f (x)

λ2 = ∇f (x)T∇2f (x)−1∇f (x)

Stopping criterion. quit if λ2

2 ≤ ε.Line search. Choose step size t by backtracking line search.

Update x = x + t∇xnt .


Inference

The posterior distribution of hidden variable:

p(θ, z |w , α, β) =p(θ, z ,w |α, β)

p(w |α, β)

This distribution is intractable to compute since

p(w |α, β) =Γ(∑

j αj)∏j Γ(αj)

∫(

k∏i=1

θαi−1i )(

N∏n=1

k∑i=1

V∏j=1

(θiβij)w jn)dθ

due to the coupling between θ and β.


Variational distribution

The variational distribution on latent variables:

q(θ, z |γ, φ) = q(θ|γ)N∏

n=1

q(zn|φn).

An optimization problem that determines the values of γ andφ with respect to KL-Divergence D:

(γ∗, φ∗) = arg minγ,φ

D(q(θ, z |γ, φ)||p(θ, z |w , α, β))


KL-Divergence

Now, we denote q(θ, z |γ, φ) by q.

The KL-Divergence between q and p(θ, z |w , α, β) is

D(q||p) = Eq[log q]− Eq[log p(θ, z |w , α, β)]

= Eq[log q]− Eq[log p(θ, z ,w |α, β)] + log p(w |α, β)

Using Jensen’s inequality, we bound p(w |α, β) by

log p(w |α, β) = log

∫ ∑z

p(θ, z ,w |α, β)dθ

= log

∫ ∑z

p(θ, z ,w |α, β)q(θ, z)

q(θ, z)dθ

≥∫ ∑

z

q(θ, z) logp(θ, z ,w |α, β)

q(θ, z)dθ

= Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)].


KL-Divergence

We denote Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)] byL(γ, φ;α, β).

Then we have

log p(w |α, β) = L(γ, φ;α, β) + D(q(θ, z |γ, φ)||p(θ, z |w , α, β)).

Maximizing the lower bound L(γ, φ;α, β) with respect to γand φ is equivalent to minimizing the KL-Divergence betweenthe variational posterior probability and the true posteriorprobability.


Variational Inference

Expand L(γ, φ;α, β) using the factorizations of p and q:

L(γ, φ;α, β) = Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)]

= Eq[log p(θ|α)] + Eq[log p(z |θ)] + Eq[log p(w |z , β)]

− Eq[log q(θ)]− Eq[log q(z)]

Compute the five terms, respectively.


Computing Eq[log p(θ|α)]

Eq[log p(θ|α)] is given by

Eq[log p(θ|α)] =K∑i=1

(αi − 1)Eq[log θi ]

+ log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi ).

θ is generated by Dir(θ|γ): Eq[log θi ] = ψ(γi )− ψ(∑K

j=1 γj).

Then we have:

Eq[log p(θ|α)] =K∑i=1

(αi − 1)ψ(γi )− ψ(K∑j=1

γj)

+ log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi ).


Computing Eq[log p(z |θ)]

Eq[log p(z |θ)] is given by

Eq[log p(z |θ)] = Eq[N∑

n=1

K∑i=1

zni log θi ]

=N∑

n=1

K∑i=1

Eq[zni ]Eq[log θi ]

=N∑

n=1

k∑i=1

φni (ψ(γi )− ψ(K∑j=1

γj))

where z is generated from Mult(z |φ) and θ is generated fromDir(θ|γ).


Computing Eq[log p(w |z , β)]

Eq[log p(w |z , β)] is given by

Eq[log p(w |z , β)] = Eq[N∑

n=1

k∑i=1

V∑j=1

zniwjn log βij ]

=N∑

n=1

k∑i=1

V∑j=1

Eq[zni ]wjn log βij

=N∑

n=1

k∑i=1

V∑j=1

φniwjn log βij


Computing Eq[log q(θ|γ)]

Eq[log q(θ|γ)] is given by

Eq[log p(θ|γ)] =k∑

i=1

(γi − 1)Eq[log θi ] + log Γ(k∑

i=1

γi )−k∑

i=1

log Γ(γi )

Then, we have

Eq[log p(θ|γ)] = log Γ(k∑

i=1

γi )−k∑

i=1

log Γ(γi )

+k∑

i=1

(γi − 1)(ψ(γi )− ψ(k∑

j=1

γj))


Computing Eq[log q(z |φ)]

Eq[log q(z |φ)] is given by

Eq[log q(z |φ)] = Eq[N∑

n=1

k∑i=1

zni log φni ]

=N∑

n=1

k∑i=1

Eq[zni ] log φni

=N∑

n=1

k∑i=1

φni log φni


Variational Inference

Finally, L(γ, φ;α, β) is

(γ, φ;α, β) = log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi )

+K∑i=1

(αi − 1)(ψ(γi )− ψ(K∑j=1

γj))

+N∑

n=1

K∑i=1

φni (ψ(γi )− ψ(K∑j=1

γj))

+N∑

n=1

K∑i=1

V∑j=1

φniwjn log βij

− (log Γ(K∑i=1

γi )−K∑i=1

log Γ(γi ) +K∑i=1

(γi − 1)(ψ(γi )− ψ(K∑j=1

γj)))

−N∑

n=1

K∑i=1

φni log φni .


Variational Multinomial

Maximize L(γ, φ;α, β) with respect to φni :

Lφni = φni (ψ(γi )− ψ(K∑j=1

γj)) + φni log βiv

− φni log φni + λ(K∑j=1

φni − 1).


Variational Multinomial

Taking derivatives with respect to φni :

∂L

∂φni= (ψ(γi )− ψ(

K∑j=1

γj)) + log βiv − log φni − 1 + λ.

Setting this derivative to zero yields

φni ∝ βiv exp(ψ(γi )− ψ(K∑j=1

γj)).


Variational Dirichlet

Maximize L(γ, φ;α, β) with respect to γi :

Lγ =K∑i=1

(ψ(γi )− ψ(K∑j=1

γj))(αi +N∑

n=1

φni − γi )

− log Γ(K∑j=1

γj) +K∑i=1

log Γ(γi )


Variational Dirichlet

Taking the derivative with respect to γi

∂L

∂γi= ψ′(γi )(αi +

N∑n=1

φni − γi )− ψ′′(K∑j=1

γj)K∑j=1

(αj +N∑

n=1

φnj − γj)

Setting this equation to zero yields:

γi = αi +N∑

n=1

φni .


Variational Inference Algorithm

1 initialize φ0ni = 1K for all i and n.

2 initialize γi = αi + NK for all i

3 repeat

4 for n = 1 to N5 for i = 1 to K

1 φt+1ni = βiwn exp(ψ(γti )).

2 normalize φt+1n to sum 1.

6 γt+1 = α +∑N

n=1 φt+1n

7 until convergence


Parameter Estimation

In the variational E-step, maximize the lower boundL(γ, φ;α, β) with respect to the variational parameters γ andφ.

In the M-step, maximize the bound with respect to the modelparameters α and β.


Conditional Multinomials

Maximize L(γ, φ;α, β) with respect to β:

Lβ =M∑d=1

Nd∑n=1

K∑i=1

V∑j=1

φdniwjdn log βij +

K∑i=1

λi (V∑j=1

βij − 1).

Taking the derivative with respect to βij and setting it to zero:

βij ∝M∑d=1

Nd∑n=1

φdniwjdn.


Dirichlet

Maximize L(γ, φ;α, β) with respect to α:

Lα =M∑d=1

(log Γ(K∑j=1

αj)−K∑i=1

log Γ(αi ))

+K∑i=1

((αi − 1)(ψ(γdi )− ψ(K∑j=1

γdj)))

Taking the derivative with respect to αi

∂L

∂αi= M(ψ(

K∑j=1

αj)− ψ(αi )) +M∑d=1

(ψ(γdi )− ψ(K∑j=1

γdj)).

It is difficult to compute αi by setting the derivative to zero.


Newton Method

Compute the Hessian Matrix by

∂2L

∂αi∂αj= M(ψ′(

K∑j=1

αj)− δ(i , j)ψ′(αi )).

Input this Hessian Matrix and the derivative to NewtonMethod.


Conclusion

Variational Inference is used for approximating intractableintegrals arising in Bayesian network.

Variational Inference can be seen as an extension of the EMalgorithm which computes the entire posterior distribution oflatent variables.

Usually, the derived ”best” variational distribution is the samefamily as the corresponding prior distribution over the variable.

A good template proof for variational inference on other topicmodels.

Date post:	23-May-2020
Category:	Documents
Upload:	others
View:	18 times
Download:	1 times

Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step...

Documents