Optimization perspective on approximate Bayesian inference
Juho Kim
December 6, 2016
Project goals
• Solve an approximate Bayesian inference problem in the perspective of optimization.
• Consider variational Bayesian inference based on various divergence measures.
• Analyze convergence of each optimization empirically and theoretically (if possible).
Inference problem
Given a dataset y = {𝑦1, … , 𝑦𝑛}:
Bayes rule:
Computing posterior distribution is known as the inference problem.
But:
This integral can be very high-dimensional and difficult to compute.
𝑝 𝑦 = න𝑝 𝑦, 𝜃 𝑑𝜃
Approximate Bayesian inference
There are two approaches to approximate inference. They have complementary strengths and weaknesses.
Approximate Bayesian inference
There are two approaches to approximate inference. They have complementary strengths and weaknesses.
Approximate Bayesian inference
There are two approaches to approximate inference. They have complementary strengths and weaknesses.
Variational Bayesian inference
In variational Bayesian inference,
• Find an approximate density that is maximally similar to the true posterior distribution.
• Formulate a density estimation problem as an optimization problem.
Variational Bayesian inference
In variational Bayesian inference,
• Find an approximate and tractable density that is maximally similar to the true posterior distribution.
• Formulate a density estimation problem as an optimization problem.
We can use the Kullback-Leibler (KL) divergence as the measure:
Then we minimize KL-divergence.
But we still cannot compute .
Variational lower-bound
We can solve the equivalent optimization problem:
We now remove the intractable terms:
Variational lower-bound / evidence lower bound (ELBO)
Stochastic variational inference
Suppose the joint distribution is represented as the product of each data point.
We can run stochastic (natural) gradient descent on this optimization problem. (i.e. stochastic variational inference)
𝑝 𝜃, 𝐷 = 𝑝0(𝜃)ෑ
𝑛
𝑝(𝑦𝑛|𝜃)
When KL divergence does not work well
• Variational inference does not work for non-smooth potentials well.
• KL divergence tends to underestimate the support due to its zero-forcing behavior.
→ The optimal variational distribution q is defined as zero when 𝑝 𝜃 𝑦 = 0
to avoid that it has an infinite value when 𝑝 𝜃 𝑦 = 0 and 𝑞 ∙ > 0.
When KL divergence does not work well
• Variational inference does not work for non-smooth potentials well.
• KL divergence tends to underestimate the support due to its zero-forcing behavior.
→ The optimal variational distribution q is defined as zero when 𝑝 𝜃 𝑦 = 0
to avoid that it has an infinite value when 𝑝 𝜃 𝑦 = 0 and 𝑞 ∙ > 0.
• In this example, the result of variational inference will fit a delta function.
When KL divergence does not work well
One possible solution of this issue.
→ Use a different optimization formulation based on another divergence measure: Expectation Propagation (EP).
Minimizes KL(𝑝| 𝑞 instead of KL(𝑞| 𝑝
EP also has issues
EP tends to overestimate the support of the original distribution.
→ Try to use other divergence measure such as Renyi’s alpha divergence,
f-divergence, other operator based divergence, etc.
Alternative 1 – alpha divergence
• The two forms of KL divergence are members of the alpha divergence:
Inference based on alpha divergence
We can solve the equivalent optimization problem following the idea of variational inference:
We can derive the lower bound like ELBO for 𝛼 ≠ 1:
Inference based on alpha divergence
• Unfortunately, the lower bound is less tractable than ELBO.
• Apply Monte Carlo methods to estimate the lower bound:
Draw 𝜃𝑘~𝑞(𝜃), 𝑘 = 1,… , 𝐾:
• Future work: Find a stable gradient-based optimization method.
Simple experiment
1. Estimate a polynomial function below.
2. Estimate a 2D Gaussian distribution.
Simple experiment
Alpha = -1
Simple experiment
Alpha = -0.5
Simple experiment
Alpha = 0 (the same as KL divergence minimization)
Simple experiment
Alpha = 0.5
Simple experiment
Alpha = 1 (the same as expectation propagation)
Alternative 2 – chi-square divergence
Minimizing the chi-square divergence is equivalent to minimizing
This quantity is an upper bound to the model evidence:
By maximizing ELBO and minimizing chi-square bound together, we might estimate the distribution more accurately.
Alternative 3 – f-divergence
where 𝑓:ℝ+ → ℝ is a convex, lower-semicontinuous function with 𝑓 1 = 0.
Conclusion
• Consider optimization-based variational Bayesian inference methods based on statistical divergences different from KL divergence.
• Observe the behavior of inference methods based on alpha divergence and chi-square divergence.
Future work
• Suggest more stable gradient-based optimization methods by reducing the variance of gradients.
• Consider more general form of divergences.
• Analyze convergence of each optimization theoretically (if possible).
Questions?