Download - Approximate Bayesian Inference I:

Approximate Bayesian Inference I:

PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 10

FALK LIEDER DECEMBER 2 2010

≈

Structural Approximations

Statistical Inference

Introduction Variational Inference Variational Bayes Applications

P(Z|X)Z X

Hidden States Observations Posterior Belief↝

𝔼 [ 𝑓 (𝑍 )∨𝑋 ]

When Do You Need Approximations?The problem with Bayes theorem is that it often leads to integrals that you don’t know how to solve.1. No analytic solution for 2. No analytic solution for 3. In the discrete case computing has complexity 4. Sequential Learning For Non-Conjugate Priors


How to Approximate?

SamplesApproximate Density by HistogramApprox. Expectations by Averages

Structual Approximation

Approximation by a Density of a given Form

Evidence /Expectations of Approximate Density are easy to compute

Numerical integration

Approximate Integrals Numerically:a) Evidence p(x)b) Expectations

Infeasible if Z is high-dimensional


How to Approximate?

Structural Approximations (Variational Inference)

+ Fast to Compute- Systematic Error+ Efficient Representation- Application often requires mathematical derivations+ Learning Rules give Insight

Stochastic Approximations(Monte-Carlo-Methods, Sampling)

- Time-Intensive+ Asymptotically Exact- Storage Intensive + Easily Applicable General Purpose Algorithms


Variational Inference—An Intuition

Probability Distributions

Target Family

True PosteriorVB Approximation

KL-Divergence


What Does Closest Mean?

Intuition: Closest means minimal additional surprise on average.

Kullback-Leibler (KL) divergence measures average additional surprise.

KL[p||q] measures how much less accurate the belief q is than p, if p is the true belief.

KL[p||q] is largest reduction in average surprise that you can achieve, if p is the true belief.


KL-Divergence Illustration

KL [𝑝 (⋅∨𝑋 )∨¿𝑞]≔∫𝒑 (𝒁∨𝑿 ) ⋅𝐥𝐧( 𝒑 (𝒁∨𝑿 )𝒒 (𝒁) )𝑑𝑍


Properties of the KL-Divergence

1. Zero iff both arguments are identical: 2. Greater than zero, if they are different:

DisadvantageThe KL-divergence is not a metric (distance function), becausea) It is not symmetric .b) It does not satisfy the triangle inequality.

KL ¿


How to Find the Closest Target Density?

• Intuition: Minimize Distance• Implementations:– Variational Bayes: Minimize – Expectation Propagation:

• Arbitrariness– Different Measures Different Algorithms &

Different Results– Alternative Schemes are being developed,

e.g. Jaakola-Jordan variational method, Kikuchi-Approximations


Minimizing Functionals• KL-divergence is a functional

Minimizing Functions Minimizing Functionals

Find the root of the derivative Find the root of the functional derivative

Calculus Variational CalculusFunctions that map vectors to real numbers:

Functionals map functions to real numbers

Derivative: Change of for infinitesimal changes in

Functional Derivative: Change of for infinitesimal changes in


VB and the Free-Energy

Variational Bayes: ior

Problem: You can’t evaluate the KL-divergence, because you can’t evaluate the posterior.Solution:

Conclusion:• You can maximize the free-energy instead.

const−ℱ (𝑞 )=−ℒ (𝑞)


VB: Minimizing KL-Divergence is equivalent to Maximizing Free-Energy

ln𝑝 (𝑋 )=ℱ (𝑞)+KL [𝑞∨¿𝑝 ]

(q)

Introduction Variational Inference Variational Bayes Examples

ln𝑝 (𝑋 )

Constrained Free-Energy Maximization

Intuition: • Maximize a Lower Bound on the Log Model Evidence• Maximization is restricted to tractable target densitiesDefinition:

Properties

• The free-energy is maximal for the true posterior.

ℱ (𝑞 )≔∫𝑞 (𝑧 )⋅ ln(𝑝 (𝑋 ,𝑍)𝑞 (𝑍 ) )dz


(q)

Variational Approximations

1. Factorial Approximations (Meanfield)– Independence Assumption – Optimization with respect to factor densities – No Restriction on Functional Form of the factors

2. Approximation by Parametric Distributions– Optimization w.r.t. Parameters

3. Variational Approximations for Model Comparison– Variational Approximation of the Log Model Evidence


Meanfield Approximation

∫𝑞 𝑗 (𝑧 𝑗 ) (∫∏𝑖≠ 𝑗

𝑞𝑖 (𝑧 𝑖 ) ln𝑝 (𝑋 ,𝑍 ) 𝑑𝑧1⋯𝑑𝑧 𝑗−1𝑑 𝑧 𝑗+1⋯𝑑𝑧𝐾 )𝑑 𝑧 𝑗

ln~𝑝 (𝑋 ,𝑍 𝑗 )+const≔𝔼𝑖≠ 𝑗 [ ln𝑝 (𝑋 ,𝑍 )]

Goal: 1. Rewrite as a function of and optimize.2. Optimize separately for each factor

Step 1:

Introduction Variational Inference Variational Bayes Examples

Meanfield Approximation, Step 1

∫𝑞 (𝑧 ) ln𝑞 𝑗 ( 𝑧 𝑗 )𝑑𝑧 1⋯ 𝑑𝑧𝐾+¿∫𝑞 (𝑧 )⋅∑𝑖 ≠ 𝑗ln𝑞𝑖 ( 𝑧𝑖 )𝑑 𝑧1⋯𝑑 𝑧𝐾 ¿

c onst∫𝑞 𝑗 (𝑧 𝑗 ) (∫ ln𝑞 𝑗 (𝑧 𝑗 )𝑑𝑧 1⋯𝑑 𝑧 𝑗 −1𝑑 𝑧 𝑗+1⋯ 𝑑𝑧𝐾 )𝑑𝑧 𝑗

ln𝑞 𝑗(𝑧 𝑗)

𝐹 (𝑞 𝑗 )=∫ q j (𝑧 𝑗 ) ⋅ ln~𝑝 (𝑋 ,𝑍 𝑗)𝑑 𝑧 𝑗−∫𝑞 𝑗 ( 𝑧 𝑗 ) ⋅ ln𝑞 𝑗 (𝑧 𝑗 )𝑑 𝑧 𝑗+const


Meanfield Approximation, Step 2

Notice that .

q̂ j=argmax𝑞 𝑗

ℱ (𝑞 𝑗 )=¿~𝑝 (𝑋 ,𝑍 𝑗 )=exp (𝔼𝑖≠ 𝑗 [ ln𝑝 ( 𝑋 ,𝑍 ) ]+const )¿The constant must be the evidence, because has to integrate to one. Hence,


Meanfield ExampleTrue Distribution: with Target Family: VB meanfield solution:

1. +const2. Hence, and 3. By symmetry


Meanfield ExampleObservation:VB-Approximation is more compact than true density.

Reason:KL[q||p] does not penalize deviations where q is close to 0.

True Density

Approximation

KL ¿

Unreasonable Assumptions Poor Approximation


KL[q||p] vs. KL[p||q]

Variational Bayes• Analytically Easier• Approx. is more compact

Expectation Propagation• More Involved• Approx. is wider


2. Parametric Approximations

• Problem: – You don’t know how to integrate prior times likelihood.

• Solution: – Approximate by .– KL-divergence and free-energy become functions of the

parameters– Apply standard optimization techniques.– Setting derivatives to zero One equation per

parameter.– Solve System of Equations by iterative Updating.


Parametric Approximation ExampleGoal: Learn the Reward Probability p• Likelihood: , • Prior: • Posterior:

ProblemYou cannot derive a learning rule for the expected reward and ist variance, because…a) No Analytic Formula for Expected Reward Probabilityb) Form of Prior Changes with Every Observation

Solution: Approximate the Posterior by a Gaussian.

Z

X{0,1 }

ℝ


Solution

Solve


Result: A Global Approximation

True PosteriorLaplaceVariational Bayes

Learning Rules for expected reward probability and the uncertainty about it

Sequential Learning Algorithm


VB for Bayesian Model Selection

• Hence, if is uniform .• Problem:

– is “intractable”• Solution:

• Justification:– If, then


SummaryApproximate Bayesian Inference

Structural Approximations

Variational Bayes (Ensemble Learning)

Meanfield Parametric Approx.

Learning Rules, Model Selection

Motivation & Overview VI Intuition VB Maths Applications