Inference for misspecified models
Richard Wilkinson
University of Sheffield
Mechanistic modelsModels describe hypothesised relationships between variables.
Mechanistic model
e.g. ODE/PDE models
explains how/why the variables interact the way they do.
parameters may have a physical meaning
often imperfect representations of reality, but may be the only linkbetween the quantity of interest and the data
e.g. Atrial fibrillation
http://staffwww.dcs.shef.ac.uk/people/R.Clayton/MoviesAndPictures/fk3d-3panel.mpg
Mechanistic modelsModels describe hypothesised relationships between variables.
Mechanistic model
e.g. ODE/PDE models
explains how/why the variables interact the way they do.
parameters may have a physical meaning
often imperfect representations of reality, but may be the only linkbetween the quantity of interest and the data
e.g. Atrial fibrillation
http://staffwww.dcs.shef.ac.uk/people/R.Clayton/MoviesAndPictures/fk3d-3panel.mpg
UQ in Patient Specific Cardiac ModelsWith Sam Coveney, Richard Clayton, Steve Neiderer, Jeremy Oakley, . . .
Atrial fibrillation (AF) - rapid and uncoordinated electrical activation(arrhythmia) leading to poor mechanical function.
Affects around 610,000 people in UK.
Catheter ablation removes/isolates pathological tissue thatsustain/initiate AF.
40% of patients subsequently experience atrial tachycardia (AT).
Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to
Infer tissues properties, including regions of fibrotic material
Predict AT pathways
Aid clinical decision making (accounting for uncertainty)
However, our simulator is imperfect. How should we proceed?
UQ in Patient Specific Cardiac ModelsWith Sam Coveney, Richard Clayton, Steve Neiderer, Jeremy Oakley, . . .
Atrial fibrillation (AF) - rapid and uncoordinated electrical activation(arrhythmia) leading to poor mechanical function.
Affects around 610,000 people in UK.
Catheter ablation removes/isolates pathological tissue thatsustain/initiate AF.
40% of patients subsequently experience atrial tachycardia (AT).
Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.
We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to
Infer tissues properties, including regions of fibrotic material
Predict AT pathways
Aid clinical decision making (accounting for uncertainty)
However, our simulator is imperfect. How should we proceed?
UQ in Patient Specific Cardiac ModelsWith Sam Coveney, Richard Clayton, Steve Neiderer, Jeremy Oakley, . . .
Atrial fibrillation (AF) - rapid and uncoordinated electrical activation(arrhythmia) leading to poor mechanical function.
Affects around 610,000 people in UK.
Catheter ablation removes/isolates pathological tissue thatsustain/initiate AF.
40% of patients subsequently experience atrial tachycardia (AT).
Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to
Infer tissues properties, including regions of fibrotic material
Predict AT pathways
Aid clinical decision making (accounting for uncertainty)
However, our simulator is imperfect. How should we proceed?
UQ in Patient Specific Cardiac ModelsWith Sam Coveney, Richard Clayton, Steve Neiderer, Jeremy Oakley, . . .
Atrial fibrillation (AF) - rapid and uncoordinated electrical activation(arrhythmia) leading to poor mechanical function.
Affects around 610,000 people in UK.
Catheter ablation removes/isolates pathological tissue thatsustain/initiate AF.
40% of patients subsequently experience atrial tachycardia (AT).
Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to
Infer tissues properties, including regions of fibrotic material
Predict AT pathways
Aid clinical decision making (accounting for uncertainty)
However, our simulator is imperfect. How should we proceed?
Inference under discrepancyHow should we do inference if the model is imperfect?
Data generating processy ∼ G
Model (complex simulator, finite dimensional parameter)
F = {Fθ : θ ∈ Θ}
If G = Fθ0 ∈ F then we know what to do1.How should we proceed if
G 6∈ F
Note: Interest lies in inference of θ
θ̂ ± σ or π(θ | y)
not calibrated prediction:
π(y ′ | y) =∫
Fθ(y′)π(θ | y)dθ
1
Even if we can’t agree about it!
Inference under discrepancyHow should we do inference if the model is imperfect?Data generating process
y ∼ G
Model (complex simulator, finite dimensional parameter)
F = {Fθ : θ ∈ Θ}
If G = Fθ0 ∈ F then we know what to do1.How should we proceed if
G 6∈ F
Note: Interest lies in inference of θ
θ̂ ± σ or π(θ | y)
not calibrated prediction:
π(y ′ | y) =∫
Fθ(y′)π(θ | y)dθ
1
Even if we can’t agree about it!
Inference under discrepancyHow should we do inference if the model is imperfect?Data generating process
y ∼ GModel (complex simulator, finite dimensional parameter)
F = {Fθ : θ ∈ Θ}
If G = Fθ0 ∈ F then we know what to do1.
How should we proceed ifG 6∈ F
Note: Interest lies in inference of θ
θ̂ ± σ or π(θ | y)
not calibrated prediction:
π(y ′ | y) =∫
Fθ(y′)π(θ | y)dθ
1Even if we can’t agree about it!
Inference under discrepancyHow should we do inference if the model is imperfect?Data generating process
y ∼ GModel (complex simulator, finite dimensional parameter)
F = {Fθ : θ ∈ Θ}
If G = Fθ0 ∈ F then we know what to do1.How should we proceed if
G 6∈ F
Note: Interest lies in inference of θ
θ̂ ± σ or π(θ | y)
not calibrated prediction:
π(y ′ | y) =∫
Fθ(y′)π(θ | y)dθ
1Even if we can’t agree about it!
Inference under discrepancyHow should we do inference if the model is imperfect?Data generating process
y ∼ GModel (complex simulator, finite dimensional parameter)
F = {Fθ : θ ∈ Θ}
If G = Fθ0 ∈ F then we know what to do1.How should we proceed if
G 6∈ F
Note: Interest lies in inference of θ
θ̂ ± σ or π(θ | y)
not calibrated prediction:
π(y ′ | y) =∫
Fθ(y′)π(θ | y)dθ
1Even if we can’t agree about it!
Maximum likelihoodMaximum likelihood estimator
θ̂n = arg maxθ
log π(y |θ)
If G = Fθ0 ∈ F , then (under some conditions)
θ̂n → θ0 almost surely as n→∞√n(θ̂n − θ0)
d=⇒ N(0, I−1(θ0))
Asymptotic consistency, efficiency, normality.If G 6∈ F
θ̂n → θ∗ = arg minθ
DKL(G ,Fθ) a.s.
= arg minθ
∫log
dG
dFθdG
√n(θ̂n − θ∗)
d=⇒ N(0,V−1)
Maximum likelihoodMaximum likelihood estimator
θ̂n = arg maxθ
log π(y |θ)
If G = Fθ0 ∈ F , then (under some conditions)
θ̂n → θ0 almost surely as n→∞√n(θ̂n − θ0)
d=⇒ N(0, I−1(θ0))
Asymptotic consistency, efficiency, normality.
If G 6∈ F
θ̂n → θ∗ = arg minθ
DKL(G ,Fθ) a.s.
= arg minθ
∫log
dG
dFθdG
√n(θ̂n − θ∗)
d=⇒ N(0,V−1)
Maximum likelihoodMaximum likelihood estimator
θ̂n = arg maxθ
log π(y |θ)
If G = Fθ0 ∈ F , then (under some conditions)
θ̂n → θ0 almost surely as n→∞√n(θ̂n − θ0)
d=⇒ N(0, I−1(θ0))
Asymptotic consistency, efficiency, normality.If G 6∈ F
θ̂n → θ∗ = arg minθ
DKL(G ,Fθ) a.s.
= arg minθ
∫log
dG
dFθdG
√n(θ̂n − θ∗)
d=⇒ N(0,V−1)
Maximum likelihoodMaximum likelihood estimator
θ̂n = arg maxθ
log π(y |θ)
If G = Fθ0 ∈ F , then (under some conditions)
θ̂n → θ0 almost surely as n→∞√n(θ̂n − θ0)
d=⇒ N(0, I−1(θ0))
Asymptotic consistency, efficiency, normality.If G 6∈ F
θ̂n → θ∗ = arg minθ
DKL(G ,Fθ) a.s.
= arg minθ
∫log
dG
dFθdG
√n(θ̂n − θ∗)
d=⇒ N(0,V−1)
Bayes
Bayesian posteriorπ(θ|y) ∝ π(y |θ)π(θ)
If G = Fθ0 ∈ F
π(θ|y) d=⇒ N(θ0, I−1(θ0)) as n→∞
Bernstein-von Mises theorem: we forget the prior, and get asymptoticconcentration and normality.This also requires (a long list of) identifiability conditions to hold.
If G 6∈ F , we still get asymptotic concentration (and possibly normality)but to θ∗ (the pseudo-true value).
there is no obvious meaning for Bayesian analysis in this case
Often with non-parametric models (eg GPs), we don’t even get thisconvergence to the pseudo-true value due to lack of identifiability.
Bayes
Bayesian posteriorπ(θ|y) ∝ π(y |θ)π(θ)
If G = Fθ0 ∈ F
π(θ|y) d=⇒ N(θ0, I−1(θ0)) as n→∞
Bernstein-von Mises theorem: we forget the prior, and get asymptoticconcentration and normality.This also requires (a long list of) identifiability conditions to hold.
If G 6∈ F , we still get asymptotic concentration (and possibly normality)but to θ∗ (the pseudo-true value).
there is no obvious meaning for Bayesian analysis in this case
Often with non-parametric models (eg GPs), we don’t even get thisconvergence to the pseudo-true value due to lack of identifiability.
Bayes
Bayesian posteriorπ(θ|y) ∝ π(y |θ)π(θ)
If G = Fθ0 ∈ F
π(θ|y) d=⇒ N(θ0, I−1(θ0)) as n→∞
Bernstein-von Mises theorem: we forget the prior, and get asymptoticconcentration and normality.This also requires (a long list of) identifiability conditions to hold.
If G 6∈ F , we still get asymptotic concentration (and possibly normality)but to θ∗ (the pseudo-true value).
there is no obvious meaning for Bayesian analysis in this case
Often with non-parametric models (eg GPs), we don’t even get thisconvergence to the pseudo-true value due to lack of identifiability.
Bayes
Bayesian posteriorπ(θ|y) ∝ π(y |θ)π(θ)
If G = Fθ0 ∈ F
π(θ|y) d=⇒ N(θ0, I−1(θ0)) as n→∞
Bernstein-von Mises theorem: we forget the prior, and get asymptoticconcentration and normality.This also requires (a long list of) identifiability conditions to hold.
If G 6∈ F , we still get asymptotic concentration (and possibly normality)but to θ∗ (the pseudo-true value).
there is no obvious meaning for Bayesian analysis in this case
Often with non-parametric models (eg GPs), we don’t even get thisconvergence to the pseudo-true value due to lack of identifiability.
An appealing idea: model the discrepancyKennedy an O’Hagan 2001
Can we model our way out of trouble by expanding F into anon-parametric world?
Grey-box models
One way to expand the class of models isby adding a Gaussian process (GP) to thesimulator.
If fθ(x) is our simulator, y the observation,then perhaps we can correct f using themodel
y = fθ∗(x) + δ(x) where δ(·) ∼ GP
and jointly infer θ∗ and δ(·)
An appealing idea: model the discrepancyKennedy an O’Hagan 2001
Can we model our way out of trouble by expanding F into anon-parametric world?
Grey-box models
One way to expand the class of models isby adding a Gaussian process (GP) to thesimulator.
If fθ(x) is our simulator, y the observation,then perhaps we can correct f using themodel
y = fθ∗(x) + δ(x) where δ(·) ∼ GP
and jointly infer θ∗ and δ(·)
An appealing, but flawed, ideaKennedy and O’Hagan 2001, Brynjarsdottir and O’Hagan 2014
Simulator Reality
fθ(x) = θx g(x) =θx
1 + xaθ = 0.65, a = 20
1 2 3 4
0.5
1.0
1.5
2.0
2.5
Solid=model with true theta, dashed=truth
x
y
●
●
●
●
●
●
●
●
●
●
●
An appealing, but flawed, ideaBolting on a GP can correct your predictions2, but won’t necessarily fixyour inference:
No discrepancy:
y = fθ(x) + N(0, σ2),
θ ∼ N(0,100), σ2 ∼ Γ−1(0.001, 0.001)
GP discrepancy:
y = fθ(x) + δ(x) + N(0, σ2),
δ(·) ∼ GP(·, ·) with objective priorsNo MD
chains$beta
Frequency
0.2 0.4 0.6 0.8 1.0
0200
400
600
GP prior on MD
chains3$beta
Frequency
0.2 0.4 0.6 0.8 1.0
0100
200
300
400
Uniform MD on [−1,1]
chains2b$beta
Frequency
0.2 0.4 0.6 0.8 1.0
01000
2000
Uniform MD on [−0.5,0.5]
chains2$beta
Frequency
0.2 0.4 0.6 0.8 1.0
0500
1500
2500
2as long as you are not extrapolating
Dynamic discrepancyTime structured problems give us many more opportunities to learn themodel discrepancy.
Consider the state space model:
xt+1 = fθ(xt) + et , yt = g(xt) + �t
Can we correct errors in f or g? eg, xt+1 = fθ(xt) + δ(xt) + et
Chapter 6: Gaussian Process Models of Simulator Discrepancy
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)=0.5xt+8ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 25xt/(1+xt^2)+8ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 0.5+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)=8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 0.5+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
Figure 6.6: The learnt discrepancy (solid black line) and the true discrepancy function
(red line) from using different incorrect simulators with Gaussian process discrepancy.
Note that data are generated from Equations 6.4.1 and 6.4.2 with true parameters
(q2, r2) as (0.1, 1) (3 plots in the top row), (1, 0.1) (3 plots in the middle row), and
(1, 100) (3 plots in the bottom row), respectively.
ancy from using different incorrect simulators with Gaussian process discrepancy to
model different set of data. We have found that when data come from the system with
193
Fitting a GP is challenging: PGAS works but is expensive, reduced rankmethods better. Variational approaches (for parametric models) lookpromising...
Dynamic discrepancyTime structured problems give us many more opportunities to learn themodel discrepancy.Consider the state space model:
xt+1 = fθ(xt) + et , yt = g(xt) + �t
Can we correct errors in f or g?
eg, xt+1 = fθ(xt) + δ(xt) + et
Chapter 6: Gaussian Process Models of Simulator Discrepancy
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)=0.5xt+8ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 25xt/(1+xt^2)+8ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 0.5+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)=8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 0.5+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
Figure 6.6: The learnt discrepancy (solid black line) and the true discrepancy function
(red line) from using different incorrect simulators with Gaussian process discrepancy.
Note that data are generated from Equations 6.4.1 and 6.4.2 with true parameters
(q2, r2) as (0.1, 1) (3 plots in the top row), (1, 0.1) (3 plots in the middle row), and
(1, 100) (3 plots in the bottom row), respectively.
ancy from using different incorrect simulators with Gaussian process discrepancy to
model different set of data. We have found that when data come from the system with
193
Fitting a GP is challenging: PGAS works but is expensive, reduced rankmethods better. Variational approaches (for parametric models) lookpromising...
Dynamic discrepancyTime structured problems give us many more opportunities to learn themodel discrepancy.Consider the state space model:
xt+1 = fθ(xt) + et , yt = g(xt) + �t
Can we correct errors in f or g? eg, xt+1 = fθ(xt) + δ(xt) + et
Chapter 6: Gaussian Process Models of Simulator Discrepancy
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)=0.5xt+8ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 25xt/(1+xt^2)+8ut+GP(0,K)
x(t)x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 0.5+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)=8 ut+GP(0,K)
x(t)x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 0.5+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
−30 −20 −10 0 10 20 30
−15
−10
−5
05
10
15
f(xt,ut)= 8 ut+GP(0,K)
x(t)
x(t+1)−f(xt,ut)
Figure 6.6: The learnt discrepancy (solid black line) and the true discrepancy function
(red line) from using different incorrect simulators with Gaussian process discrepancy.
Note that data are generated from Equations 6.4.1 and 6.4.2 with true parameters
(q2, r2) as (0.1, 1) (3 plots in the top row), (1, 0.1) (3 plots in the middle row), and
(1, 100) (3 plots in the bottom row), respectively.
ancy from using different incorrect simulators with Gaussian process discrepancy to
model different set of data. We have found that when data come from the system with
193
Fitting a GP is challenging: PGAS works but is expensive, reduced rankmethods better. Variational approaches (for parametric models) lookpromising...
Dangers of non-parametric model extensions
There are (at least) two problems with this approach:
We may still find G 6∈ FIdentifiability
I A GP is an incredibly complex infinite dimensional model, which is notnecessarily identified even asymptotically. The posterior canconcentrate not on a point, but on some sub manifold of parameterspace, and the projection of the prior on this space continues toimpact the posterior even as more and more data are collected.
ie We never forget the prior, but the prior is too complex to understand
I Brynjarsdottir and O’Hagan 2014 try to model their way out oftrouble with prior information:
δ(0) = 0 δ′(x) ≥ 0
Great if you have this information.
Dangers of non-parametric model extensions
There are (at least) two problems with this approach:
We may still find G 6∈ FIdentifiability
I A GP is an incredibly complex infinite dimensional model, which is notnecessarily identified even asymptotically. The posterior canconcentrate not on a point, but on some sub manifold of parameterspace, and the projection of the prior on this space continues toimpact the posterior even as more and more data are collected.
ie We never forget the prior, but the prior is too complex to understand
I Brynjarsdottir and O’Hagan 2014 try to model their way out oftrouble with prior information:
δ(0) = 0 δ′(x) ≥ 0
Great if you have this information.
Dangers of non-parametric model extensions
There are (at least) two problems with this approach:
We may still find G 6∈ FIdentifiability
I A GP is an incredibly complex infinite dimensional model, which is notnecessarily identified even asymptotically. The posterior canconcentrate not on a point, but on some sub manifold of parameterspace, and the projection of the prior on this space continues toimpact the posterior even as more and more data are collected.
ie We never forget the prior, but the prior is too complex to understand
I Brynjarsdottir and O’Hagan 2014 try to model their way out oftrouble with prior information:
δ(0) = 0 δ′(x) ≥ 0
Great if you have this information.
Inferential approaches
Instead of trying to model our way out of trouble, can we modify theinferential approach instead?
Common approaches to inference:
Maximum likelihood/minimum-distance
Bayes(ish)
History matching (HM)/ABC type methods (thresholding)
How do these approaches behave for well-specified and mis-specifiedmodels?
Try to understand why (at least anecdotally) HM and ABC seem to workwell in mis-specified cases.
Big question3 is what properties would we like our inferential approach topossess?
3
To which I have no answer
Inferential approaches
Instead of trying to model our way out of trouble, can we modify theinferential approach instead?Common approaches to inference:
Maximum likelihood/minimum-distance
Bayes(ish)
History matching (HM)/ABC type methods (thresholding)
How do these approaches behave for well-specified and mis-specifiedmodels?
Try to understand why (at least anecdotally) HM and ABC seem to workwell in mis-specified cases.
Big question3 is what properties would we like our inferential approach topossess?
3
To which I have no answer
Inferential approaches
Instead of trying to model our way out of trouble, can we modify theinferential approach instead?Common approaches to inference:
Maximum likelihood/minimum-distance
Bayes(ish)
History matching (HM)/ABC type methods (thresholding)
How do these approaches behave for well-specified and mis-specifiedmodels?
Try to understand why (at least anecdotally) HM and ABC seem to workwell in mis-specified cases.
Big question3 is what properties would we like our inferential approach topossess?
3To which I have no answer
ABC: approximate Bayesian computation
Rejection Algorithm
Draw θ from prior π(·)Accept θ with probability π(D | θ)
Accepted θ are independent draws from the posterior distribution,π(θ | D).
If the likelihood, π(D|θ), is unknown:
‘Mechanical’ Rejection Algorithm
Draw θ from π(·)Simulate X ∼ f (θ) from the computer modelAccept θ if D = X , i.e., if computer output equals observation
The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).
ABC: approximate Bayesian computation
Rejection Algorithm
Draw θ from prior π(·)Accept θ with probability π(D | θ)
Accepted θ are independent draws from the posterior distribution,π(θ | D).If the likelihood, π(D|θ), is unknown:
‘Mechanical’ Rejection Algorithm
Draw θ from π(·)Simulate X ∼ f (θ) from the computer modelAccept θ if D = X , i.e., if computer output equals observation
The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).
Rejection ABC
If P(D) is small (or D continuous), we will rarely accept any θ. Instead,there is an approximate version:
Uniform Rejection Algorithm
Draw θ from π(θ)
Simulate X ∼ f (θ)Accept θ if ρ(D,X ) ≤ �
� reflects the tension between computability and accuracy.
As �→∞, we get observations from the prior, π(θ).If � = 0, we generate observations from π(θ | D).
Rejection ABC
If P(D) is small (or D continuous), we will rarely accept any θ. Instead,there is an approximate version:
Uniform Rejection Algorithm
Draw θ from π(θ)
Simulate X ∼ f (θ)Accept θ if ρ(D,X ) ≤ �
� reflects the tension between computability and accuracy.
As �→∞, we get observations from the prior, π(θ).If � = 0, we generate observations from π(θ | D).
� = 10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●● ●
●●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●● ●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●● ●
●
● ●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
− ε
+ ε
D
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)
ρ(D,X ) = |D − X |, D = 2
� = 7.5
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●● ●
●
●
● ●●
●
●
●
●
●
●
● ●
●
● ●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●●
●
●
●●
●
●
●●●
●
●●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●● ●
●●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●● ●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
● ●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●
●●
●●
●●
●● ●
● ●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
− ε
+ ε
D
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)
ρ(D,X ) = |D − X |, D = 2
� = 5
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
● ●●
●
●
●
●
●● ●
● ●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ●●●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
● ●
● ●
●
●
●●
●
●●● ●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
● ●●
●
●
●
●
●●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●●● ●
●●
●
● ●
●
● ● ●●
●
●●
●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●●●
●●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●● ●
●●
●
●
●
●●
●● ●
●
●●
●●●●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
● ●● ●
●
●●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●●●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●● ●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
− ε
+ ε
D
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)
ρ(D,X ) = |D − X |, D = 2
� = 2.5
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
● ● ●●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
● ●●
●
●
●●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●
● ●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●●
●
●
●●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D
●
●●
●
●●
●
●
●
●
●●
●●●●
●
● ●●
●
●●
●●
●●
●●●●
●
●●●
● ●●
●●
●●
●
● ●
●
●
●●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●●●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●● ●
●
●
●
●
●
●●●
●
●●
●
●
●●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●●●
●●
●●
●
●
●●
●
●
●
●●●
●
●●●
●
●
●●●
● ● ●
●
●●
●●●
●●
●●
●●
●●
●
●●● ●●
●
●
●
●●
●●
●
●
●
●
●●●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●●
●
●●
●● ●
●●
●●
●
●●●
●
●●
●●
●
●
●
●●
● ●
●
●
●
●
●
●
●●
●●
●
●● ●
●
●●
●
●
− ε
+ ε
D
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)
ρ(D,X ) = |D − X |, D = 2
� = 1
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
● ●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D ●●● ●●●● ●●●●●●● ●
●●
●●● ●●
● ●
●●
●●
●●
●●
●●●●●
●
●●●●●
●
●●●
●●●
● ●●●●
● ● ●●●●
●
●●
●●
●●●●●
●●
●●
●●
●● ●● ●
● ●● ●●●
●
● ●● ● ●
●●● ●●
− ε
+ ε
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)
ρ(D,X ) = |D − X |, D = 2
Rejection ABC
If the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionality
Reduce the dimension using summary statistics, S(D).
Approximate Rejection Algorithm With Summaries
Draw θ from π(θ)
Simulate X ∼ f (θ)Accept θ if ρ(S(D), S(X )) < �
If S is sufficient this is equivalent to the previous algorithm.
Simple → Popular with non-statisticians
Rejection ABC
If the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionality
Reduce the dimension using summary statistics, S(D).
Approximate Rejection Algorithm With Summaries
Draw θ from π(θ)
Simulate X ∼ f (θ)Accept θ if ρ(S(D), S(X )) < �
If S is sufficient this is equivalent to the previous algorithm.
Simple → Popular with non-statisticians
History matching and ABC
History matching seeks to find a NROY set
Pθ = {θ : SHM(F̂θ,y ) ≤ 3}
where
SHM(Fθ) =|EFθ(Y )− y |√
VarFθ(Y )
ABC approximates the posterior as
π�(θ) ∝ π(θ)E(IS(F̂θ,y)≤�)
for some choice of S (typically S(F̂θ, y) = ρ(η(y), η(y′)) where y ′ ∼ Fθ)
and �.
They have thresholding of a score in common and are algorithmicallycomparable.
History matching and ABC
History matching seeks to find a NROY set
Pθ = {θ : SHM(F̂θ,y ) ≤ 3}
where
SHM(Fθ) =|EFθ(Y )− y |√
VarFθ(Y )
ABC approximates the posterior as
π�(θ) ∝ π(θ)E(IS(F̂θ,y)≤�)
for some choice of S (typically S(F̂θ, y) = ρ(η(y), η(y′)) where y ′ ∼ Fθ)
and �.
They have thresholding of a score in common and are algorithmicallycomparable.
History matching and ABC
History matching seeks to find a NROY set
Pθ = {θ : SHM(F̂θ,y ) ≤ 3}
where
SHM(Fθ) =|EFθ(Y )− y |√
VarFθ(Y )
ABC approximates the posterior as
π�(θ) ∝ π(θ)E(IS(F̂θ,y)≤�)
for some choice of S (typically S(F̂θ, y) = ρ(η(y), η(y′)) where y ′ ∼ Fθ)
and �.
They have thresholding of a score in common and are algorithmicallycomparable.
History matching and ABC
These methods (anecdotally) seem to work better in mis-specifiedsituations.
Why?
They differ from likelihood based approaches in that
They only use some aspect of the simulator output
I Typically we hand pick which simulator outputs to compare, andweight them on a case by case basis.
Potentially use generalised scores/loss-functions
The thresholding type nature potentially makes them somewhatconservative
History matching and ABC
These methods (anecdotally) seem to work better in mis-specifiedsituations.
Why?
They differ from likelihood based approaches in that
They only use some aspect of the simulator outputI Typically we hand pick which simulator outputs to compare, and
weight them on a case by case basis.
Potentially use generalised scores/loss-functions
The thresholding type nature potentially makes them somewhatconservative
What makes a good inferential approach?
Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?
Consistency?
I I don’t want inconsistency.
Frequency properties?
I I wouldn’t object but seems impossible for subjective priors.
Coherence?
Robustness to small mis-specifications?
What makes a good inferential approach?
Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?
Consistency?
I I don’t want inconsistency.
Frequency properties?
I I wouldn’t object but seems impossible for subjective priors.
Coherence?
Robustness to small mis-specifications?
What makes a good inferential approach?
Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?
Consistency?I I don’t want inconsistency.
Frequency properties?
I I wouldn’t object but seems impossible for subjective priors.
Coherence?
Robustness to small mis-specifications?
What makes a good inferential approach?
Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?
Consistency?I I don’t want inconsistency.
Asymptotic concentration or normality?
Frequency properties?
I I wouldn’t object but seems impossible for subjective priors.
Coherence?
Robustness to small mis-specifications?
What makes a good inferential approach?
Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?
Consistency?I I don’t want inconsistency.
Asymptotic concentration or normality?
Frequency properties?
I I wouldn’t object but seems impossible for subjective priors.
Coherence?
Robustness to small mis-specifications?
What makes a good inferential approach?
Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?
Consistency?I I don’t want inconsistency.
Asymptotic concentration or normality?
Frequency properties?
I I wouldn’t object but seems impossible for subjective priors.
Coherence?
Robustness to small mis-specifications?
What makes a good inferential approach?
Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?
Consistency?I I don’t want inconsistency.
Asymptotic concentration or normality?
Frequency properties?I I wouldn’t object but seems impossible for subjective priors.
Coherence?
Robustness to small mis-specifications?
What makes a good inferential approach?
Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?
Consistency?I I don’t want inconsistency.
Asymptotic concentration or normality?
Frequency properties?I I wouldn’t object but seems impossible for subjective priors.
Coherence?
Robustness to small mis-specifications?
What makes a good inferential approach?
Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?
Consistency?I I don’t want inconsistency.
Asymptotic concentration or normality?
Frequency properties?I I wouldn’t object but seems impossible for subjective priors.
Coherence?
Robustness to small mis-specifications?
Generalized scoresLikelihood based methods are notoriously sensitive to mis-specification.
A single outlier can make our inference arbitrarily bad
The likelihood can pick up on unintended aspects of the data (eg tailbehaviour).
Consider scoring rules instead. If we forecast F , observe y , then wereceive score
S(F , y)
S is a proper score if
G = arg minF
EY∼GS(F ,Y )
i.e. predicting G gives the best possibly score.
Encourages honest reporting
Examples:
Log-likelihood S(F , y) = − log f (y)Tsallis-score (γ − 1)
∫f (x)αdx − γf (y)α−1
Generalized scoresLikelihood based methods are notoriously sensitive to mis-specification.
A single outlier can make our inference arbitrarily bad
The likelihood can pick up on unintended aspects of the data (eg tailbehaviour).
Consider scoring rules instead. If we forecast F , observe y , then wereceive score
S(F , y)
S is a proper score if
G = arg minF
EY∼GS(F ,Y )
i.e. predicting G gives the best possibly score.
Encourages honest reporting
Examples:
Log-likelihood S(F , y) = − log f (y)Tsallis-score (γ − 1)
∫f (x)αdx − γf (y)α−1
Generalized scoresLikelihood based methods are notoriously sensitive to mis-specification.
A single outlier can make our inference arbitrarily bad
The likelihood can pick up on unintended aspects of the data (eg tailbehaviour).
Consider scoring rules instead. If we forecast F , observe y , then wereceive score
S(F , y)
S is a proper score if
G = arg minF
EY∼GS(F ,Y )
i.e. predicting G gives the best possibly score.
Encourages honest reporting
Examples:
Log-likelihood S(F , y) = − log f (y)Tsallis-score (γ − 1)
∫f (x)αdx − γf (y)α−1
Generalized scoresLikelihood based methods are notoriously sensitive to mis-specification.
A single outlier can make our inference arbitrarily bad
The likelihood can pick up on unintended aspects of the data (eg tailbehaviour).
Consider scoring rules instead. If we forecast F , observe y , then wereceive score
S(F , y)
S is a proper score if
G = arg minF
EY∼GS(F ,Y )
i.e. predicting G gives the best possibly score.
Encourages honest reporting
Examples:
Log-likelihood S(F , y) = − log f (y)Tsallis-score (γ − 1)
∫f (x)αdx − γf (y)α−1
Minimum scoring rule estimation (Dawid et al. 2014 etc) uses
θ̂ = arg minθ
S(Fθ, y)
For proper scores
Eθ0
(∂
∂θS(Fθ, y)
∣∣∣∣θ=θ0
)=
∂
∂θEθ0S(Fθ, y)
∣∣∣∣θ=θ0
= 0
so we have an unbiased estimating equation, and hence get asymptoticconsistency for well-specified models. We also get asymptotic normality.
Minimum scoring rule estimation (Dawid et al. 2014 etc) uses
θ̂ = arg minθ
S(Fθ, y)
For proper scores
Eθ0
(∂
∂θS(Fθ, y)
∣∣∣∣θ=θ0
)=
∂
∂θEθ0S(Fθ, y)
∣∣∣∣θ=θ0
= 0
so we have an unbiased estimating equation, and hence get asymptoticconsistency for well-specified models. We also get asymptotic normality.
Dawid et al. 2014 show that if
∇θfθ(x) is bounded in x for all θBregman gauge of scoring rule is locally bounded
then the minimum scoring rule estimator θ̂ is B-robust
i.e. it has bounded influence function
IF (x ; θ̂,Fθ) = lim�→0
θ̂(�δx + (1− �)Fθ)− θ̂(Fθ)�
i.e. if Fθ is infected by outlier at x , this doesn’t unduly affect theinference.
Note both ABC and HM are B-robust in this sense, but using thelog-likelihood is not.
What type of robustness do we want here?
Dawid et al. 2014 show that if
∇θfθ(x) is bounded in x for all θBregman gauge of scoring rule is locally bounded
then the minimum scoring rule estimator θ̂ is B-robust
i.e. it has bounded influence function
IF (x ; θ̂,Fθ) = lim�→0
θ̂(�δx + (1− �)Fθ)− θ̂(Fθ)�
i.e. if Fθ is infected by outlier at x , this doesn’t unduly affect theinference.
Note both ABC and HM are B-robust in this sense, but using thelog-likelihood is not.
What type of robustness do we want here?
Bayes like approachesWhat about Bayesian like approaches with generalized scores?
© 2016 The Authors Journal of the Royal Statistical Society: Series B StatisticalMethodology published by John Wiley & Sons Ltd on behalf of the Royal Statistical Society.This is an open access article under the terms of the Creative Commons Attribution License, which permits use, dis-tribution and reproduction in any medium, provided the original work is properly cited.
1369–7412/16/781103
J. R. Statist. Soc. B (2016)78, Part 5, pp. 1103–1130
A general framework for updating belief distributions
P. G. Bissiri,
University of Milano-Bicocca, Italy
C. C. Holmes
University of Oxford, UK
and S. G. Walker
University of Texas at Austin, USA
[Received December 2013. Final revision November 2015]
Summary. We propose a framework for general Bayesian inference. We argue that a valid up-date of a prior belief distribution to a posterior can be made for parameters which are connectedto observations through a loss function rather than the traditional likelihood function, which isrecovered as a special case. Modern application areas make it increasingly challenging forBayesians to attempt to model the true data-generating mechanism. For instance, when theobject of interest is low dimensional, such as a mean or median, it is cumbersome to have toachieve this via a complete model for the whole data distribution. More importantly, there aresettings where the parameter of interest does not directly index a family of density functionsand thus the Bayesian approach to learning about such parameters is currently regarded asproblematic. Our framework uses loss functions to connect information in the data to function-als of interest. The updating of beliefs then follows from a decision theoretic approach involvingcumulative loss functions. Importantly, the procedure coincides with Bayesian updating whena true likelihood is known yet provides coherent subjective inference in much more generalsettings. Connections to other inference frameworks are highlighted.
Keywords: Decision theory; General Bayesian updating; Generalized estimating equations;Gibbs posteriors; Information; Loss function; Maximum entropy; Provably approximatelycorrect Bayes methods; Self-information loss function
1. Introduction
Data sets are increasing in size and modelling environments are becoming more complex. Thispresents opportunities for Bayesian statistics but also major challenges, perhaps the greatestof which is the requirement to define the true sampling distribution, or likelihood, for thedata generator f0.x/, regardless of the study objective. Even if the task is inference for a lowdimensional parameter, Bayesian analysis is required to model the complete data distributionand, moreover, to assume that the model is ‘true’.
In this paper we present a coherent procedure for general Bayesian inference which is basedon the updating of a prior belief distribution to a posterior when the parameter of interest isconnected to observations via a loss function. Briefly here, and in the simplest scenario, supposethat interest is in the θ minimizing the expected loss
Address for correspondence: C. C. Holmes, Department of Statistics, University of Oxford, 24–29 St Giles,Oxford, OX1 3LB, UK.E-mail: [email protected]
Bissiri et al. 2016 consider updating prior beliefs when parameter θ isconnected to observations via a loss function L(θ, y).They argue the update must be of the form
π(θ|x) ∝ exp(−L(θ, x))π(θ)
via coherency arguments.Note using log-likelihood as the loss funct