+ All Categories
Home > Documents > Optimization perspective in approximate posterior...

Optimization perspective in approximate posterior...

Date post: 13-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
28
Optimization perspective on approximate Bayesian inference Juho Kim December 6, 2016
Transcript
Page 1: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Optimization perspective on approximate Bayesian inference

Juho Kim

December 6, 2016

Page 2: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Project goals

• Solve an approximate Bayesian inference problem in the perspective of optimization.

• Consider variational Bayesian inference based on various divergence measures.

• Analyze convergence of each optimization empirically and theoretically (if possible).

Page 3: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Inference problem

Given a dataset y = {𝑦1, … , 𝑦𝑛}:

Bayes rule:

Computing posterior distribution is known as the inference problem.

But:

This integral can be very high-dimensional and difficult to compute.

𝑝 𝑦 = න𝑝 𝑦, 𝜃 𝑑𝜃

Page 4: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Approximate Bayesian inference

There are two approaches to approximate inference. They have complementary strengths and weaknesses.

Page 5: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Approximate Bayesian inference

There are two approaches to approximate inference. They have complementary strengths and weaknesses.

Page 6: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Approximate Bayesian inference

There are two approaches to approximate inference. They have complementary strengths and weaknesses.

Page 7: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Variational Bayesian inference

In variational Bayesian inference,

• Find an approximate density that is maximally similar to the true posterior distribution.

• Formulate a density estimation problem as an optimization problem.

Page 8: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Variational Bayesian inference

In variational Bayesian inference,

• Find an approximate and tractable density that is maximally similar to the true posterior distribution.

• Formulate a density estimation problem as an optimization problem.

We can use the Kullback-Leibler (KL) divergence as the measure:

Then we minimize KL-divergence.

But we still cannot compute .

Page 9: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Variational lower-bound

We can solve the equivalent optimization problem:

We now remove the intractable terms:

Variational lower-bound / evidence lower bound (ELBO)

Page 10: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Stochastic variational inference

Suppose the joint distribution is represented as the product of each data point.

We can run stochastic (natural) gradient descent on this optimization problem. (i.e. stochastic variational inference)

𝑝 𝜃, 𝐷 = 𝑝0(𝜃)ෑ

𝑛

𝑝(𝑦𝑛|𝜃)

Page 11: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

When KL divergence does not work well

• Variational inference does not work for non-smooth potentials well.

• KL divergence tends to underestimate the support due to its zero-forcing behavior.

→ The optimal variational distribution q is defined as zero when 𝑝 𝜃 𝑦 = 0

to avoid that it has an infinite value when 𝑝 𝜃 𝑦 = 0 and 𝑞 ∙ > 0.

Page 12: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

When KL divergence does not work well

• Variational inference does not work for non-smooth potentials well.

• KL divergence tends to underestimate the support due to its zero-forcing behavior.

→ The optimal variational distribution q is defined as zero when 𝑝 𝜃 𝑦 = 0

to avoid that it has an infinite value when 𝑝 𝜃 𝑦 = 0 and 𝑞 ∙ > 0.

• In this example, the result of variational inference will fit a delta function.

Page 13: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

When KL divergence does not work well

One possible solution of this issue.

→ Use a different optimization formulation based on another divergence measure: Expectation Propagation (EP).

Minimizes KL(𝑝| 𝑞 instead of KL(𝑞| 𝑝

Page 14: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

EP also has issues

EP tends to overestimate the support of the original distribution.

→ Try to use other divergence measure such as Renyi’s alpha divergence,

f-divergence, other operator based divergence, etc.

Page 15: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Alternative 1 – alpha divergence

• The two forms of KL divergence are members of the alpha divergence:

Page 16: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Inference based on alpha divergence

We can solve the equivalent optimization problem following the idea of variational inference:

We can derive the lower bound like ELBO for 𝛼 ≠ 1:

Page 17: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Inference based on alpha divergence

• Unfortunately, the lower bound is less tractable than ELBO.

• Apply Monte Carlo methods to estimate the lower bound:

Draw 𝜃𝑘~𝑞(𝜃), 𝑘 = 1,… , 𝐾:

• Future work: Find a stable gradient-based optimization method.

Page 18: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Simple experiment

1. Estimate a polynomial function below.

2. Estimate a 2D Gaussian distribution.

Page 19: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Simple experiment

Alpha = -1

Page 20: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Simple experiment

Alpha = -0.5

Page 21: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Simple experiment

Alpha = 0 (the same as KL divergence minimization)

Page 22: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Simple experiment

Alpha = 0.5

Page 23: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Simple experiment

Alpha = 1 (the same as expectation propagation)

Page 24: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Alternative 2 – chi-square divergence

Minimizing the chi-square divergence is equivalent to minimizing

This quantity is an upper bound to the model evidence:

By maximizing ELBO and minimizing chi-square bound together, we might estimate the distribution more accurately.

Page 25: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Alternative 3 – f-divergence

where 𝑓:ℝ+ → ℝ is a convex, lower-semicontinuous function with 𝑓 1 = 0.

Page 26: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Conclusion

• Consider optimization-based variational Bayesian inference methods based on statistical divergences different from KL divergence.

• Observe the behavior of inference methods based on alpha divergence and chi-square divergence.

Page 27: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Future work

• Suggest more stable gradient-based optimization methods by reducing the variance of gradients.

• Consider more general form of divergences.

• Analyze convergence of each optimization theoretically (if possible).

Page 28: Optimization perspective in approximate posterior inferenceniaohe.ise.illinois.edu/...Presentation10_Juho_Kim.pdf · Inference problem Given a dataset y ={𝑦1,…,𝑦𝑛}: Bayes

Questions?


Recommended