Date post: | 29-Jul-2015 |
Category: |
Science |
Upload: | christian-robert |
View: | 2,066 times |
Download: | 1 times |
From model uncertainty to ABC
Christian P. Robert
Universite Paris-Dauphine, University of Warwick, & [email protected]
BIPM Workshop on Measurement Uncertainty, ParisJune 12, 2015
The ABC of [Bayesian] statistics
In a classical (Fisher, 1921) perspective, a statistical model isdefined by the law of the observations, also called likelihood
L(θ|y1, . . . , yn) = L(θ|y)e.g.=
n∏i=1
f(yi|θ)
Parameters θ are estimated based on this function L(θ|y) and onthe probabilistic properties of the distribution of the data.
Comparison of models via likelihoods requires penalization andasymptotics
The ABC of [Bayesian] statistics
In a classical (Fisher, 1921) perspective, a statistical model isdefined by the law of the observations, also called likelihood
L(θ|y1, . . . , yn) = L(θ|y)e.g.=
n∏i=1
f(yi|θ)
Parameters θ are estimated based on this function L(θ|y) and onthe probabilistic properties of the distribution of the data.Comparison of models via likelihoods requires penalization andasymptotics
The ABC of Bayesian statistics
In the Bayesian approach (Bayes, 1763; Laplace, 1773), theparameter is endowed with a probability distribution as well, calledthe prior distribution and the likelihood becomes a conditionaldistribution of the data given the parameter, understood as arandom variable.
Inference based on the posterior distribution, with density
π(θ|y) ∝ π(θ)L(θ|y) Bayes’ Theorem
(also called the posterior) and model comparison on marginallikelihood
m(y) =
∫π(θ)L(θ|y) dθ
The ABC of Bayesian statistics
In the Bayesian approach (Bayes, 1763; Laplace, 1773), theparameter is endowed with a probability distribution as well, calledthe prior distribution and the likelihood becomes a conditionaldistribution of the data given the parameter, understood as arandom variable.Inference based on the posterior distribution, with density
π(θ|y) ∝ π(θ)L(θ|y) Bayes’ Theorem
(also called the posterior)
and model comparison on marginallikelihood
m(y) =
∫π(θ)L(θ|y) dθ
The ABC of Bayesian statistics
In the Bayesian approach (Bayes, 1763; Laplace, 1773), theparameter is endowed with a probability distribution as well, calledthe prior distribution and the likelihood becomes a conditionaldistribution of the data given the parameter, understood as arandom variable.Inference based on the posterior distribution, with density
π(θ|y) ∝ π(θ)L(θ|y) Bayes’ Theorem
(also called the posterior) and model comparison on marginallikelihood
m(y) =
∫π(θ)L(θ|y) dθ
A few more details
• The parameter θ does not become a random variable (insteadof an unknown constant) in the Bayesian paradygm.Probability calculus is used to quantify the uncertainty aboutθ as a calibrated quantity.
• The prior density π(·) is to be understood as a referencemeasure which, in informative situations, may summarise theavailable prior information.
• The impact of the prior density π(·) on the resulting inferenceis real but (mostly) vanishes when the number of observationsgrows. The only exception is the area of hypothesis testingwhere both approaches remain unreconcilable.
A few more details
• The parameter θ does not become a random variable (insteadof an unknown constant) in the Bayesian paradygm.Probability calculus is used to quantify the uncertainty aboutθ as a calibrated quantity.
• The prior density π(·) is to be understood as a referencemeasure which, in informative situations, may summarise theavailable prior information.
• The impact of the prior density π(·) on the resulting inferenceis real but (mostly) vanishes when the number of observationsgrows. The only exception is the area of hypothesis testingwhere both approaches remain unreconcilable.
A few more details
• The parameter θ does not become a random variable (insteadof an unknown constant) in the Bayesian paradygm.Probability calculus is used to quantify the uncertainty aboutθ as a calibrated quantity.
• The prior density π(·) is to be understood as a referencemeasure which, in informative situations, may summarise theavailable prior information.
• The impact of the prior density π(·) on the resulting inferenceis real but (mostly) vanishes when the number of observationsgrows. The only exception is the area of hypothesis testingwhere both approaches remain unreconcilable.
Getting approximative
Case of a well-defined statistical model where the likelihoodfunction
L(θ|y) = f(y1, . . . , yn|θ)
is out of reach
Empirical Approximation to the original Bayesian problem
• Degrading the data precision down to tolerance level ε
• Replacing the likelihood with a non-parametric approximation
• Summarising/replacing the data with insufficient statistics
[Marin & al., 2011]
Getting approximative
Case of a well-defined statistical model where the likelihoodfunction
L(θ|y) = f(y1, . . . , yn|θ)
is out of reach
Empirical Approximation to the original Bayesian problem
• Degrading the data precision down to tolerance level ε
• Replacing the likelihood with a non-parametric approximation
• Summarising/replacing the data with insufficient statistics
[Marin & al., 2011]
Getting approximative
Case of a well-defined statistical model where the likelihoodfunction
L(θ|y) = f(y1, . . . , yn|θ)
is out of reach
Empirical Approximation to the original Bayesian problem
• Degrading the data precision down to tolerance level ε
• Replacing the likelihood with a non-parametric approximation
• Summarising/replacing the data with insufficient statistics
[Marin & al., 2011]
Approximate Bayesian computation
1 Introduction
2 Approximate Bayesian computationABC basicsGenesis of ABCThe ABC methodAlphabet soup
3 ABC model choice
Regular Bayesian computation issues
When faced with a non-standard posterior distribution
π(θ|y) ∝ π(θ)L(θ|y)
the standard solution is to use simulation (Monte Carlo) toproduce a sample
θ1, . . . , θT
from π(θ|y) (or approximately by Markov chain Monte Carlomethods)
[Robert & Casella, 2004]
Untractable likelihoods
Cases when the likelihood function f(y|θ) is unavailable and whenthe completion step
f(y|θ) =
∫Zf(y, z|θ) dz
is impossible or too costly because of the dimension of zc© MCMC cannot be implemented!
Illustrations
Example
Stochastic volatility model: fort = 1, . . . , T,
yt = exp(zt)εt , zt = a+bzt−1+σηt ,
T very large makes it difficult toinclude z within the simulatedparameters
0 200 400 600 800 1000
−0.
4−
0.2
0.0
0.2
0.4
t
Highest weight trajectories
Illustrations
Example
Potts model: if y takes values on a grid Y of size kn and
f(y|θ) ∝ exp
{θ∑l∼i
Iyl=yi
}
where l∼i denotes a neighbourhood relation, even moderatelylarge n prohibit the computation of the normalising constant
Zθ =∑y∈X
exp{θS(y)}
with too many terms and poor numerical approximations[Cucala& al., 2009]
Illustrations
Example
Dynamic mixture model
(1− wµ,τ (x))fβ,λ(x) + wµ,τ (x)gε,σ(x) x > 0
where fβ,λ is a Weibull density, gε,σ a generalised Pareto density,and wµ,τ is the cdf of a Cauchy distributionCrucially missing the normalising constant∫ ∞
0{(1− wµ,τ (x))fβ,λ(x) + wµ,τ (x)gε,σ(x)} dx
[Frigessi, Haug & Rue, 2002]
Illustrations
Example
Coalescence tree: in populationgenetics, reconstitution of a commonancestor from a sample of genes viaa phylogenetic tree that is close toimpossible to integrate out[100 processor days with 4parameters]
[Cornuet et al., 2009, Bioinformatics]
Genetic background of ABC
ABC is a recent computational technique that only requires beingable to sample from the likelihood f(·|θ)
This technique stemmed from population genetics models, about15 years ago, and population geneticists still significantlycontribute to methodological developments of ABC.
[Griffith & al., 1997; Tavare & al., 1999]
Kingman’s coalescent
Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k− 1)/2)
Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2
Kingman’s coalescent
Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k− 1)/2)
Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼Poisson process withintensity θ/2 over thebranches
• MRCA = 100• independent mutations:±1 with pr. 1/2
Kingman’s coalescent
Observations: leafs of the treeθ =?
Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k− 1)/2)
Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2
Instance of ecological question
• How did the Asian Ladybirdbeetle arrive in Europe?
• Why do they swarm rightnow?
• What are the routes ofinvasion?
• How to get rid of them?
[Lombaert & al., 2010, PLoS ONE]
c© Intractable likelihood
Missing (too much missing!) data structure:
f(y|θ) =
∫Gf(y|G,θ)f(G|θ)dG
cannot be computed in a manageable way...[Stephens & Donnelly, 2000]
The genealogies are considered as nuisance parameters
Econom’ections
Similar exploration of simulation-based and approximationtechniques in Econometrics
• Simulated method of moments
• Method of simulated moments
• Simulated pseudo-maximum-likelihood
• Indirect inference
[Gourieroux & Monfort, 1996]
even though motivation is partly-defined models rather thancomplex likelihoods
Econom’ections
Similar exploration of simulation-based and approximationtechniques in Econometrics
• Simulated method of moments
• Method of simulated moments
• Simulated pseudo-maximum-likelihood
• Indirect inference
[Gourieroux & Monfort, 1996]
even though motivation is partly-defined models rather thancomplex likelihoods
A?B?C?
• A stands for approximate[wrong likelihood /picture]
• B stands for Bayesian
• C stands for computation[producing a parametersample]
The ABC method
Bayesian setting: target is π(θ)f(x|θ)
When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:
ABC algorithm
For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating
θ′ ∼ π(θ) , z ∼ f(z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavare et al., 1997]
The ABC method
Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:
ABC algorithm
For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating
θ′ ∼ π(θ) , z ∼ f(z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavare et al., 1997]
The ABC method
Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:
ABC algorithm
For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating
θ′ ∼ π(θ) , z ∼ f(z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavare et al., 1997]
Why does it work?!
The proof is trivial:
f(θi) ∝∑z∈D
π(θi)f(z|θi)Iy(z)
∝ π(θi)f(y|θi)= π(θi|y) .
[Accept–Reject 101]
Earlier occurrence
‘Bayesian statistics and Monte Carlo methods are ideallysuited to the task of passing many models over onedataset’
[Don Rubin, Annals of Statistics, 1984]
Note Rubin (1984) does not promote this algorithm forlikelihood-free simulation but frequentist intuition on posteriordistributions: parameters from posteriors are more likely to bethose that could have generated the data.
A as A...pproximative
When y is a continuous random variable, strict equality z = y isreplaced with a tolerance zone
%(y, z) ≤ ε
where % is a distance
Output distributed from
π(θ)Pθ{%(y, z) < ε} def∝ π(θ|%(y, z) < ε)
[Pritchard et al., 1999]
A as A...pproximative
When y is a continuous random variable, strict equality z = y isreplaced with a tolerance zone
%(y, z) ≤ ε
where % is a distanceOutput distributed from
π(θ)Pθ{%(y, z) < ε} def∝ π(θ|%(y, z) < ε)
[Pritchard et al., 1999]
ABC algorithm
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N dorepeat
generate θ′ from the prior distribution π(·)generate z from the likelihood f(·|θ′)
until ρ{η(z), η(y)} ≤ εset θi = θ′
end for
where η(y) defines a (maybe in-sufficient) statistic
Output
The likelihood-free algorithm samples from the marginal in z of:
πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫
Aε,y×Θ π(θ)f(z|θ)dzdθ,
where Aε,y = {z ∈ D|ρ(η(z), η(y)) < ε}.
The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:
πε(θ|y) =
∫πε(θ, z|y)dz ≈ π(θ|y) .
Output
The likelihood-free algorithm samples from the marginal in z of:
πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫
Aε,y×Θ π(θ)f(z|θ)dzdθ,
where Aε,y = {z ∈ D|ρ(η(z), η(y)) < ε}.
The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:
πε(θ|y) =
∫πε(θ, z|y)dz ≈ π(θ|y) .
Output
The likelihood-free algorithm samples from the marginal in z of:
πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫
Aε,y×Θ π(θ)f(z|θ)dzdθ,
where Aε,y = {z ∈ D|ρ(η(z), η(y)) < ε}.
The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of therestricted posterior distribution:
πε(θ|y) =
∫πε(θ, z|y)dz ≈ π(θ|η(y)) .
Not so good..!
Pima Indian benchmark
−0.005 0.010 0.020 0.030
020
4060
8010
0
Dens
ity
−0.05 −0.03 −0.01
020
4060
80
Dens
ity
−1.0 0.0 1.0 2.0
0.00.2
0.40.6
0.81.0
Dens
ity
Figure : Comparison between density estimates of the marginals on β1(left), β2 (center) and β3 (right) from ABC rejection samples (red) andMCMC samples (black)
.
Which summary?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]
• Loss of statistical information balanced against gain in dataroughening
• Approximation error and information loss remain unknown
• Choice of statistics induces choice of distance functiontowards standardisation
• may be imposed for external/practical reasons (e.g., DIYABC)
• may gather several non-B point estimates [the more themerrier]
• can [machine-]learn about efficient combination
Which summary?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]
• Loss of statistical information balanced against gain in dataroughening
• Approximation error and information loss remain unknown
• Choice of statistics induces choice of distance functiontowards standardisation
• may be imposed for external/practical reasons (e.g., DIYABC)
• may gather several non-B point estimates [the more themerrier]
• can [machine-]learn about efficient combination
Which summary?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]
• Loss of statistical information balanced against gain in dataroughening
• Approximation error and information loss remain unknown
• Choice of statistics induces choice of distance functiontowards standardisation
• may be imposed for external/practical reasons (e.g., DIYABC)
• may gather several non-B point estimates [the more themerrier]
• can [machine-]learn about efficient combination
MA example
Consider the MA(q) model
xt = εt +
q∑i=1
ϑiεt−i
Simple prior: uniform prior over the identifiability zone, e.g.triangle for MA(2)
MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1, ϑ2) in the triangle
2 generating an iid sequence (εt)−q<t≤T
3 producing a simulated series (x′t)1≤t≤T
Distance: basic distance between the series
ρ((x′t)1≤t≤T , (xt)1≤t≤T ) =T∑t=1
(xt − x′t)2
or between summary statistics like the first q autocorrelations
τj =T∑
t=j+1
xtxt−j
MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1, ϑ2) in the triangle
2 generating an iid sequence (εt)−q<t≤T
3 producing a simulated series (x′t)1≤t≤T
Distance: basic distance between the series
ρ((x′t)1≤t≤T , (xt)1≤t≤T ) =
T∑t=1
(xt − x′t)2
or between summary statistics like the first q autocorrelations
τj =
T∑t=j+1
xtxt−j
Comparison of distance impact
Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
01
23
4
θ1
−2.0 −1.0 0.0 0.5 1.0 1.50.0
0.51.0
1.5
θ2
Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
01
23
4
θ1
−2.0 −1.0 0.0 0.5 1.0 1.50.0
0.51.0
1.5
θ2
Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
ABC advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009]
[Toni & al., 2009; Fernhead and Prangle, 2012]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002; Blum & Francois, 2010]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
ABC advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009]
[Toni & al., 2009; Fernhead and Prangle, 2012]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002; Blum & Francois, 2010]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
ABC advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009]
[Toni & al., 2009; Fernhead and Prangle, 2012]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002; Blum & Francois, 2010]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
ABC advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009]
[Toni & al., 2009; Fernhead and Prangle, 2012]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002; Blum & Francois, 2010]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
ABC-NP
Better usage of [prior] simulations byadjustement: instead of throwing awayθ′ such that ρ(η(z), η(y)) > ε, replaceθs with locally regressed
θ∗ = θ − {η(z)− η(y)}Tβ[Csillery et al., TEE, 2010]
where β is obtained by [NP] weighted least square regression on(η(z)− η(y)) with weights
Kδ {ρ(η(z), η(y))}
[Beaumont et al., 2002, Genetics]
attempts at summaries
How to choose the set of summary statistics?
• Joyce and Marjoram (2008, SAGMB)
• Nunes and Balding (2010, SAGMB)
• Fearnhead and Prangle (2012, JRSS B)
• Ratmann et al. (2012, PLOS Comput. Biol)
• Blum et al. (2013, Statistical science)
• EP-ABC of Barthelme & Chopin (2013, JASA)
• LDA selection of Estoup & al. (2012, Mol. Ecol. Res.)
Semi-automatic ABC
Fearnhead and Prangle (2012) study ABC and selection ofsummary statistics for parameter estimation
• ABC considered as inferential method and calibrated as such
• randomised (or ‘noisy’) version of the summary statistics
η(y) = η(y) + τε
• optimality of the posterior expectation
E[θ|y]
of the parameter of interest as summary statistics η(y)!
LDA summaries for model choice
In parallel to F& P semi-automatic ABC, selection of mostdiscriminant subvector out of a collection of summary statistics,can be based on Linear Discriminant Analysis (LDA)
[Estoup & al., 2012, Mol. Ecol. Res.]
Solution now implemented in DIYABC.2[Cornuet & al., 2008, Bioinf.; Estoup & al., 2013]
LDA advantages
• much faster computation of scenario probabilities viapolychotomous regression
• a (much) lower number of explanatory variables improves theaccuracy of the ABC approximation, reduces the tolerance εand avoids extra costs in constructing the reference table
• allows for a large collection of initial summaries
• ability to evaluate Type I and Type II errors on more complexmodels
• LDA reduces correlation among explanatory variables
When available, using both simulated and real data sets, posteriorprobabilities of scenarios computed from LDA-transformed and rawsummaries are strongly correlated
LDA advantages
• much faster computation of scenario probabilities viapolychotomous regression
• a (much) lower number of explanatory variables improves theaccuracy of the ABC approximation, reduces the tolerance εand avoids extra costs in constructing the reference table
• allows for a large collection of initial summaries
• ability to evaluate Type I and Type II errors on more complexmodels
• LDA reduces correlation among explanatory variables
When available, using both simulated and real data sets, posteriorprobabilities of scenarios computed from LDA-transformed and rawsummaries are strongly correlated
Bayesian model choice
BMC Principle
Several modelsM1,M2, . . .
are considered simultaneously for dataset y and model index Mcentral to inference.Use of
• prior π(M = m), plus
• prior distribution on the parameter conditional on the value mof the model index, πm(θm)
Bayesian model choice
BMC Principle
Several modelsM1,M2, . . .
are considered simultaneously for dataset y and model index Mcentral to inference.Goal is to derive the posterior distribution of M,
π(M = m|data)
a challenging computational target when models are complex.
Generic ABC for model choice
Algorithm 2 Likelihood-free model choice sampler (ABC-MC)
for t = 1 to T dorepeat
Generate m from the prior π(M = m)Generate θm from the prior πm(θm)Generate z from the model fm(z|θm)
until ρ{η(z), η(y)} < εSet m(t) = m and θ(t) = θm
end for
[Grelaud & al., 2009; Toni & al., 2009]
About sufficiency
‘Sufficient statistics for individual models are unlikely tobe very informative for the model probability.’
[Scott Sisson, Jan. 31, 2011, X.’Og]
If η1(x) sufficient statistic for model m = 1 and parameter θ1 andη2(x) sufficient statistic for model m = 2 and parameter θ2,(η1(x), η2(x)) is not always sufficient for (m, θm)
c© Potential loss of information at the testing level
About sufficiency
‘Sufficient statistics for individual models are unlikely tobe very informative for the model probability.’
[Scott Sisson, Jan. 31, 2011, X.’Og]
If η1(x) sufficient statistic for model m = 1 and parameter θ1 andη2(x) sufficient statistic for model m = 2 and parameter θ2,(η1(x), η2(x)) is not always sufficient for (m, θm)
c© Potential loss of information at the testing level
About sufficiency
‘Sufficient statistics for individual models are unlikely tobe very informative for the model probability.’
[Scott Sisson, Jan. 31, 2011, X.’Og]
If η1(x) sufficient statistic for model m = 1 and parameter θ1 andη2(x) sufficient statistic for model m = 2 and parameter θ2,(η1(x), η2(x)) is not always sufficient for (m, θm)
c© Potential loss of information at the testing level
Limiting behaviour of BABC12
When ε goes to zero,
Bη12(y) =
∫π1(θ1)fη1 (η(y)|θ1) dθ1∫π2(θ2)fη2 (η(y)|θ2) dθ2
,
c© Bayes factor based on the sole observation of η(y)
Limiting behaviour of BABC12
When ε goes to zero,
Bη12(y) =
∫π1(θ1)fη1 (η(y)|θ1) dθ1∫π2(θ2)fη2 (η(y)|θ2) dθ2
,
c© Bayes factor based on the sole observation of η(y)
Meaning of the ABC-Bayes factor
‘This is also why focus on model discrimination typically(...) proceeds by (...) accepting that the Bayes Factorthat one obtains is only derived from the summarystatistics and may in no way correspond to that of thefull model.’
[Scott Sisson, Jan. 31, 2011, X.’Og]
In the Poisson/geometric case, if E[yi] = θ0 > 0 and η(y) = y,
limn→∞
Bη12(y) =
(θ0 + 1)2
θ0e−θ0
Meaning of the ABC-Bayes factor
‘This is also why focus on model discrimination typically(...) proceeds by (...) accepting that the Bayes Factorthat one obtains is only derived from the summarystatistics and may in no way correspond to that of thefull model.’
[Scott Sisson, Jan. 31, 2011, X.’Og]
In the Poisson/geometric case, if E[yi] = θ0 > 0 and η(y) = y,
limn→∞
Bη12(y) =
(θ0 + 1)2
θ0e−θ0
The only safe cases
Besides specific models like Gibbs random fields,
using distances over the data itself escapes the discrepancy...[Toni & Stumpf, 2010;Sousa et al., 2009]
...but asymptotic consistency of Bayes factors for some summarystatistics ensures convergent model choice
[Marin & al., 2014]
The only safe cases
Besides specific models like Gibbs random fields,
using distances over the data itself escapes the discrepancy...[Toni & Stumpf, 2010;Sousa et al., 2009]
...but asymptotic consistency of Bayes factors for some summarystatistics ensures convergent model choice
[Marin & al., 2014]
Leaning towards machine learning
Main notions:
• ABC-MC seen as learning about which model is mostappropriate from a huge (reference) table
• exploiting a large number of summary statistics not an issuefor machine learning methods intended to estimate efficientcombinations
• abandoning (temporarily?) the idea of estimating posteriorprobabilities of the models, poorly approximated by machinelearning methods, and replacing those by posterior predictiveexpected loss
Machine learning shift
ABC model choice
A) Generate a large set
of (m,θ, z)’s from
Bayesian predictive,
π(m)πm(θ)fm(z|θ)
B) Use machine learning
tech. to infer on
π(m∣∣η(y))
In this perspective:
• (iid) “data set” referencetable simulated duringstage A)
• observed y becomes anew data point
Note that:
• predicting m is aclassification problem⇐⇒ select the best
model based on amaximal a posteriori rule,e.g., through randomforests
• computing π(m|η(y)) isa regression problem⇐⇒ confidence in each
model
classification is much simplerthan regression (e.g., dim. ofobjects we try to learn)
Conclusion
• abc part of a wider pictureto handle complex/Big Datamodels, able to start fromrudimentary machinelearning summaries
• many formats of empirical[likelihood] Bayes methodsavailable
• lack of comparative toolsand of an assessment forinformation loss
• full Bayesian pictureuntrustworthy [yet]
Conclusion
Key ideas (for model choice)
• π(m∣∣η(y)
) 6= π(m∣∣y)
• Rather than approximatingπ(m∣∣η(y)
), focus on
selecting the best model(classif. vs regression)
• Assess confidence in theselection with posteriorpredictive expected losses
Conclusion
Key ideas (for model choice)
• π(m∣∣η(y)
) 6= π(m∣∣y)
• Use a seasoned machinelearning techniqueselecting from ABCsimulations: minimise 0-1loss mimics MAP
• Assess confidence in theselection with posteriorpredictive expected losses
Consequences onABC-PopGen
• Often, RF � k-NN (lesssensible to high correlationin summaries)
• RF requires many less priorsimulations
• RF selects automaticallyrelevent summaries
• Hence can handle muchmore complex models