from model uncertainty to ABC

From model uncertainty to ABC

Christian P. Robert

Universite Paris-Dauphine, University of Warwick, & [email protected]

BIPM Workshop on Measurement Uncertainty, ParisJune 12, 2015

[email protected]

Outline

1 Introduction

2 Approximate Bayesian computation

3 ABC model choice

Introductory notions

1 Introduction

2 Approximate Bayesian computation

3 ABC model choice

The ABC of [Bayesian] statistics

In a classical (Fisher, 1921) perspective, a statistical model isdefined by the law of the observations, also called likelihood

L(θ|y1, . . . , yn) = L(θ|y)e.g.=

n∏i=1

f(yi|θ)

Parameters θ are estimated based on this function L(θ|y) and onthe probabilistic properties of the distribution of the data.

Comparison of models via likelihoods requires penalization andasymptotics

The ABC of [Bayesian] statistics

In a classical (Fisher, 1921) perspective, a statistical model isdefined by the law of the observations, also called likelihood

L(θ|y1, . . . , yn) = L(θ|y)e.g.=

n∏i=1

f(yi|θ)

Parameters θ are estimated based on this function L(θ|y) and onthe probabilistic properties of the distribution of the data.Comparison of models via likelihoods requires penalization andasymptotics

The ABC of Bayesian statistics

In the Bayesian approach (Bayes, 1763; Laplace, 1773), theparameter is endowed with a probability distribution as well, calledthe prior distribution and the likelihood becomes a conditionaldistribution of the data given the parameter, understood as arandom variable.

Inference based on the posterior distribution, with density

π(θ|y) ∝ π(θ)L(θ|y) Bayes’ Theorem

(also called the posterior) and model comparison on marginallikelihood

m(y) =

∫π(θ)L(θ|y) dθ


In the Bayesian approach (Bayes, 1763; Laplace, 1773), theparameter is endowed with a probability distribution as well, calledthe prior distribution and the likelihood becomes a conditionaldistribution of the data given the parameter, understood as arandom variable.Inference based on the posterior distribution, with density


(also called the posterior)

and model comparison on marginallikelihood

m(y) =



In the Bayesian approach (Bayes, 1763; Laplace, 1773), theparameter is endowed with a probability distribution as well, calledthe prior distribution and the likelihood becomes a conditionaldistribution of the data given the parameter, understood as arandom variable.Inference based on the posterior distribution, with density


(also called the posterior) and model comparison on marginallikelihood

m(y) =


A few more details

• The parameter θ does not become a random variable (insteadof an unknown constant) in the Bayesian paradygm.Probability calculus is used to quantify the uncertainty aboutθ as a calibrated quantity.

• The prior density π(·) is to be understood as a referencemeasure which, in informative situations, may summarise theavailable prior information.

• The impact of the prior density π(·) on the resulting inferenceis real but (mostly) vanishes when the number of observationsgrows. The only exception is the area of hypothesis testingwhere both approaches remain unreconcilable.

A few more details




A few more details




Getting approximative

Case of a well-defined statistical model where the likelihoodfunction

L(θ|y) = f(y1, . . . , yn|θ)

is out of reach

Empirical Approximation to the original Bayesian problem

• Degrading the data precision down to tolerance level ε

• Replacing the likelihood with a non-parametric approximation

• Summarising/replacing the data with insufficient statistics

[Marin & al., 2011]



L(θ|y) = f(y1, . . . , yn|θ)

is out of reach





[Marin & al., 2011]



L(θ|y) = f(y1, . . . , yn|θ)

is out of reach





[Marin & al., 2011]

Approximate Bayesian computation

1 Introduction

2 Approximate Bayesian computationABC basicsGenesis of ABCThe ABC methodAlphabet soup

3 ABC model choice

Regular Bayesian computation issues

When faced with a non-standard posterior distribution

π(θ|y) ∝ π(θ)L(θ|y)

the standard solution is to use simulation (Monte Carlo) toproduce a sample

θ1, . . . , θT

from π(θ|y) (or approximately by Markov chain Monte Carlomethods)

[Robert & Casella, 2004]

Untractable likelihoods

Cases when the likelihood function f(y|θ) is unavailable and whenthe completion step

f(y|θ) =

∫Zf(y, z|θ) dz

is impossible or too costly because of the dimension of zc© MCMC cannot be implemented!

Illustrations

Example

Stochastic volatility model: fort = 1, . . . , T,

yt = exp(zt)εt , zt = a+bzt−1+σηt ,

T very large makes it difficult toinclude z within the simulatedparameters

0 200 400 600 800 1000

−0.

4−

0.2

0.0

0.2

0.4

t

Highest weight trajectories

Illustrations

Example

Potts model: if y takes values on a grid Y of size kn and

f(y|θ) ∝ exp

{θ∑l∼i

Iyl=yi

}

where l∼i denotes a neighbourhood relation, even moderatelylarge n prohibit the computation of the normalising constant

Zθ =∑y∈X

exp{θS(y)}

with too many terms and poor numerical approximations[Cucala& al., 2009]

Illustrations

Example

Dynamic mixture model

(1− wµ,τ (x))fβ,λ(x) + wµ,τ (x)gε,σ(x) x > 0

where fβ,λ is a Weibull density, gε,σ a generalised Pareto density,and wµ,τ is the cdf of a Cauchy distributionCrucially missing the normalising constant∫ ∞

0{(1− wµ,τ (x))fβ,λ(x) + wµ,τ (x)gε,σ(x)} dx

[Frigessi, Haug & Rue, 2002]

Illustrations

Example

Coalescence tree: in populationgenetics, reconstitution of a commonancestor from a sample of genes viaa phylogenetic tree that is close toimpossible to integrate out[100 processor days with 4parameters]

[Cornuet et al., 2009, Bioinformatics]

Genetic background of ABC

ABC is a recent computational technique that only requires beingable to sample from the likelihood f(·|θ)

This technique stemmed from population genetics models, about15 years ago, and population geneticists still significantlycontribute to methodological developments of ABC.

[Griffith & al., 1997; Tavare & al., 1999]

Kingman’s coalescent

Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k− 1)/2)

Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2



Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼Poisson process withintensity θ/2 over thebranches

• MRCA = 100• independent mutations:±1 with pr. 1/2


Observations: leafs of the treeθ =?


Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2

Instance of ecological question

• How did the Asian Ladybirdbeetle arrive in Europe?

• Why do they swarm rightnow?

• What are the routes ofinvasion?

• How to get rid of them?

[Lombaert & al., 2010, PLoS ONE]

Worldwide invasion routes of HarmoniaAxyridis

[Estoup et al., 2012, Molecular Ecology Res.]

c© Intractable likelihood

Missing (too much missing!) data structure:

f(y|θ) =

∫Gf(y|G,θ)f(G|θ)dG

cannot be computed in a manageable way...[Stephens & Donnelly, 2000]

The genealogies are considered as nuisance parameters

Econom’ections

Similar exploration of simulation-based and approximationtechniques in Econometrics

• Simulated method of moments

• Method of simulated moments

• Simulated pseudo-maximum-likelihood

• Indirect inference

[Gourieroux & Monfort, 1996]

even though motivation is partly-defined models rather thancomplex likelihoods

Econom’ections

Similar exploration of simulation-based and approximationtechniques in Econometrics

• Simulated method of moments

• Method of simulated moments

• Simulated pseudo-maximum-likelihood

• Indirect inference

[Gourieroux & Monfort, 1996]

even though motivation is partly-defined models rather thancomplex likelihoods

A?B?C?

• A stands for approximate[wrong likelihood /picture]

• B stands for Bayesian

• C stands for computation[producing a parametersample]

The ABC method

Bayesian setting: target is π(θ)f(x|θ)

When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , z ∼ f(z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y.

[Tavare et al., 1997]

The ABC method

Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm


θ′ ∼ π(θ) , z ∼ f(z|θ′) ,



The ABC method

Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm


θ′ ∼ π(θ) , z ∼ f(z|θ′) ,



Why does it work?!

The proof is trivial:

f(θi) ∝∑z∈D

π(θi)f(z|θi)Iy(z)

∝ π(θi)f(y|θi)= π(θi|y) .

[Accept–Reject 101]

Earlier occurrence

‘Bayesian statistics and Monte Carlo methods are ideallysuited to the task of passing many models over onedataset’

[Don Rubin, Annals of Statistics, 1984]

Note Rubin (1984) does not promote this algorithm forlikelihood-free simulation but frequentist intuition on posteriordistributions: parameters from posteriors are more likely to bethose that could have generated the data.

A as A...pproximative

When y is a continuous random variable, strict equality z = y isreplaced with a tolerance zone

%(y, z) ≤ ε

where % is a distance

Output distributed from

π(θ)Pθ{%(y, z) < ε} def∝ π(θ|%(y, z) < ε)

[Pritchard et al., 1999]

A as A...pproximative

When y is a continuous random variable, strict equality z = y isreplaced with a tolerance zone

%(y, z) ≤ ε

where % is a distanceOutput distributed from

π(θ)Pθ{%(y, z) < ε} def∝ π(θ|%(y, z) < ε)

[Pritchard et al., 1999]

ABC algorithm

Algorithm 1 Likelihood-free rejection sampler

for i = 1 to N dorepeat

generate θ′ from the prior distribution π(·)generate z from the likelihood f(·|θ′)

until ρ{η(z), η(y)} ≤ εset θi = θ′

end for

where η(y) defines a (maybe in-sufficient) statistic

Output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f(z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z), η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|y) .

Output





The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|y) .

Output





The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of therestricted posterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|η(y)) .

Not so good..!

Pima Indian benchmark

−0.005 0.010 0.020 0.030

020

4060

8010

0

Dens

ity

−0.05 −0.03 −0.01

020

4060

80

Dens

ity

−1.0 0.0 1.0 2.0

0.00.2

0.40.6

0.81.0

Dens

ity

Figure : Comparison between density estimates of the marginals on β1(left), β2 (center) and β3 (right) from ABC rejection samples (red) andMCMC samples (black)

.

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]

• Loss of statistical information balanced against gain in dataroughening

• Approximation error and information loss remain unknown

• Choice of statistics induces choice of distance functiontowards standardisation

• may be imposed for external/practical reasons (e.g., DIYABC)

• may gather several non-B point estimates [the more themerrier]

• can [machine-]learn about efficient combination

Which summary?








Which summary?








MA example

Consider the MA(q) model

xt = εt +

q∑i=1

ϑiεt−i

Simple prior: uniform prior over the identifiability zone, e.g.triangle for MA(2)

MA example (2)

ABC algorithm thus made of

1 picking a new value (ϑ1, ϑ2) in the triangle

2 generating an iid sequence (εt)−q<t≤T

3 producing a simulated series (x′t)1≤t≤T

Distance: basic distance between the series

ρ((x′t)1≤t≤T , (xt)1≤t≤T ) =T∑t=1

(xt − x′t)2

or between summary statistics like the first q autocorrelations

τj =T∑

t=j+1

xtxt−j

MA example (2)

ABC algorithm thus made of

1 picking a new value (ϑ1, ϑ2) in the triangle

2 generating an iid sequence (εt)−q<t≤T

3 producing a simulated series (x′t)1≤t≤T

Distance: basic distance between the series

ρ((x′t)1≤t≤T , (xt)1≤t≤T ) =

T∑t=1

(xt − x′t)2

or between summary statistics like the first q autocorrelations

τj =

T∑t=j+1

xtxt−j

Comparison of distance impact

Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model


0.0 0.2 0.4 0.6 0.8

01

23

4

θ1

−2.0 −1.0 0.0 0.5 1.0 1.50.0

0.51.0

1.5

θ2



0.0 0.2 0.4 0.6 0.8

01

23

4

θ1

−2.0 −1.0 0.0 0.5 1.0 1.50.0

0.51.0

1.5

θ2


ABC advances

Simulating from the prior is often poor in efficiency

Either modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Beaumont et al., 2009]

[Toni & al., 2009; Fernhead and Prangle, 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002; Blum & Francois, 2010]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

ABC advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...






ABC advances







ABC advances







ABC-NP

Better usage of [prior] simulations byadjustement: instead of throwing awayθ′ such that ρ(η(z), η(y)) > ε, replaceθs with locally regressed

θ∗ = θ − {η(z)− η(y)}Tβ[Csillery et al., TEE, 2010]

where β is obtained by [NP] weighted least square regression on(η(z)− η(y)) with weights

Kδ {ρ(η(z), η(y))}

[Beaumont et al., 2002, Genetics]

attempts at summaries

How to choose the set of summary statistics?

• Joyce and Marjoram (2008, SAGMB)

• Nunes and Balding (2010, SAGMB)

• Fearnhead and Prangle (2012, JRSS B)

• Ratmann et al. (2012, PLOS Comput. Biol)

• Blum et al. (2013, Statistical science)

• EP-ABC of Barthelme & Chopin (2013, JASA)

• LDA selection of Estoup & al. (2012, Mol. Ecol. Res.)

Semi-automatic ABC

Fearnhead and Prangle (2012) study ABC and selection ofsummary statistics for parameter estimation

• ABC considered as inferential method and calibrated as such

• randomised (or ‘noisy’) version of the summary statistics

η(y) = η(y) + τε

• optimality of the posterior expectation

E[θ|y]

of the parameter of interest as summary statistics η(y)!

LDA summaries for model choice

In parallel to F& P semi-automatic ABC, selection of mostdiscriminant subvector out of a collection of summary statistics,can be based on Linear Discriminant Analysis (LDA)

[Estoup & al., 2012, Mol. Ecol. Res.]

Solution now implemented in DIYABC.2[Cornuet & al., 2008, Bioinf.; Estoup & al., 2013]

LDA advantages

• much faster computation of scenario probabilities viapolychotomous regression

• a (much) lower number of explanatory variables improves theaccuracy of the ABC approximation, reduces the tolerance εand avoids extra costs in constructing the reference table

• allows for a large collection of initial summaries

• ability to evaluate Type I and Type II errors on more complexmodels

• LDA reduces correlation among explanatory variables

When available, using both simulated and real data sets, posteriorprobabilities of scenarios computed from LDA-transformed and rawsummaries are strongly correlated

LDA advantages

• much faster computation of scenario probabilities viapolychotomous regression

• a (much) lower number of explanatory variables improves theaccuracy of the ABC approximation, reduces the tolerance εand avoids extra costs in constructing the reference table

• allows for a large collection of initial summaries

• ability to evaluate Type I and Type II errors on more complexmodels

• LDA reduces correlation among explanatory variables

When available, using both simulated and real data sets, posteriorprobabilities of scenarios computed from LDA-transformed and rawsummaries are strongly correlated

Bayesian model choice

BMC Principle

Several modelsM1,M2, . . .

are considered simultaneously for dataset y and model index Mcentral to inference.Use of

• prior π(M = m), plus

• prior distribution on the parameter conditional on the value mof the model index, πm(θm)

Bayesian model choice

BMC Principle

Several modelsM1,M2, . . .

are considered simultaneously for dataset y and model index Mcentral to inference.Goal is to derive the posterior distribution of M,

π(M = m|data)

a challenging computational target when models are complex.

Generic ABC for model choice

Algorithm 2 Likelihood-free model choice sampler (ABC-MC)

for t = 1 to T dorepeat

Generate m from the prior π(M = m)Generate θm from the prior πm(θm)Generate z from the model fm(z|θm)

until ρ{η(z), η(y)} < εSet m(t) = m and θ(t) = θm

end for

[Grelaud & al., 2009; Toni & al., 2009]

About sufficiency

‘Sufficient statistics for individual models are unlikely tobe very informative for the model probability.’

[Scott Sisson, Jan. 31, 2011, X.’Og]

If η1(x) sufficient statistic for model m = 1 and parameter θ1 andη2(x) sufficient statistic for model m = 2 and parameter θ2,(η1(x), η2(x)) is not always sufficient for (m, θm)

c© Potential loss of information at the testing level

About sufficiency





About sufficiency





Limiting behaviour of BABC12

When ε goes to zero,

Bη12(y) =

∫π1(θ1)fη1 (η(y)|θ1) dθ1∫π2(θ2)fη2 (η(y)|θ2) dθ2

,

c© Bayes factor based on the sole observation of η(y)

Limiting behaviour of BABC12

When ε goes to zero,

Bη12(y) =

∫π1(θ1)fη1 (η(y)|θ1) dθ1∫π2(θ2)fη2 (η(y)|θ2) dθ2

,

c© Bayes factor based on the sole observation of η(y)

Meaning of the ABC-Bayes factor

‘This is also why focus on model discrimination typically(...) proceeds by (...) accepting that the Bayes Factorthat one obtains is only derived from the summarystatistics and may in no way correspond to that of thefull model.’


In the Poisson/geometric case, if E[yi] = θ0 > 0 and η(y) = y,

limn→∞

Bη12(y) =

(θ0 + 1)2

θ0e−θ0

Meaning of the ABC-Bayes factor

‘This is also why focus on model discrimination typically(...) proceeds by (...) accepting that the Bayes Factorthat one obtains is only derived from the summarystatistics and may in no way correspond to that of thefull model.’


In the Poisson/geometric case, if E[yi] = θ0 > 0 and η(y) = y,

limn→∞

Bη12(y) =

(θ0 + 1)2

θ0e−θ0

The only safe cases

Besides specific models like Gibbs random fields,

using distances over the data itself escapes the discrepancy...[Toni & Stumpf, 2010;Sousa et al., 2009]

...but asymptotic consistency of Bayes factors for some summarystatistics ensures convergent model choice

[Marin & al., 2014]

The only safe cases

Besides specific models like Gibbs random fields,

using distances over the data itself escapes the discrepancy...[Toni & Stumpf, 2010;Sousa et al., 2009]

...but asymptotic consistency of Bayes factors for some summarystatistics ensures convergent model choice

[Marin & al., 2014]

Leaning towards machine learning

Main notions:

• ABC-MC seen as learning about which model is mostappropriate from a huge (reference) table

• exploiting a large number of summary statistics not an issuefor machine learning methods intended to estimate efficientcombinations

• abandoning (temporarily?) the idea of estimating posteriorprobabilities of the models, poorly approximated by machinelearning methods, and replacing those by posterior predictiveexpected loss

Machine learning shift

ABC model choice

A) Generate a large set

of (m,θ, z)’s from

Bayesian predictive,

π(m)πm(θ)fm(z|θ)

B) Use machine learning

tech. to infer on

π(m∣∣η(y))

In this perspective:

• (iid) “data set” referencetable simulated duringstage A)

• observed y becomes anew data point

Note that:

• predicting m is aclassification problem⇐⇒ select the best

model based on amaximal a posteriori rule,e.g., through randomforests

• computing π(m|η(y)) isa regression problem⇐⇒ confidence in each

model

classification is much simplerthan regression (e.g., dim. ofobjects we try to learn)

Conclusion

• abc part of a wider pictureto handle complex/Big Datamodels, able to start fromrudimentary machinelearning summaries

• many formats of empirical[likelihood] Bayes methodsavailable

• lack of comparative toolsand of an assessment forinformation loss

• full Bayesian pictureuntrustworthy [yet]

Conclusion

Key ideas (for model choice)

• π(m∣∣η(y)

) 6= π(m∣∣y)

• Rather than approximatingπ(m∣∣η(y)

), focus on

selecting the best model(classif. vs regression)

• Assess confidence in theselection with posteriorpredictive expected losses

Conclusion

Key ideas (for model choice)

• π(m∣∣η(y)

) 6= π(m∣∣y)

• Use a seasoned machinelearning techniqueselecting from ABCsimulations: minimise 0-1loss mimics MAP

• Assess confidence in theselection with posteriorpredictive expected losses

Consequences onABC-PopGen

• Often, RF � k-NN (lesssensible to high correlationin summaries)

• RF requires many less priorsimulations

• RF selects automaticallyrelevent summaries

• Hence can handle muchmore complex models

Date post:	29-Jul-2015
Category:	Science
Upload:	christian-robert
View:	2,066 times
Download:	1 times

from model uncertainty to ABC

Science