+ All Categories
Home > Science > ABC short course: final chapters

ABC short course: final chapters

Date post: 13-Apr-2017
Category:
Upload: christian-robert
View: 2,363 times
Download: 0 times
Share this document with a friend
100
ABC model choice via random forests 1 simulation-based methods in Econometrics 2 Genetics of ABC 3 Approximate Bayesian computation 4 ABC for model choice 5 ABC model choice via random forests Random forests ABC with random forests Illustrations 6 ABC estimation via random forests 7 [some] asymptotics of ABC
Transcript
Page 1: ABC short course: final chapters

ABC model choice via random forests

1 simulation-based methods inEconometrics

2 Genetics of ABC

3 Approximate Bayesian computation

4 ABC for model choice

5 ABC model choice via random forestsRandom forestsABC with random forestsIllustrations

6 ABC estimation via random forests

7 [some] asymptotics of ABC

Page 2: ABC short course: final chapters

Leaning towards machine learning

Main notions:

• ABC-MC seen as learning about which model is mostappropriate from a huge (reference) table

• exploiting a large number of summary statistics not an issuefor machine learning methods intended to estimate efficientcombinations

• abandoning (temporarily?) the idea of estimating posteriorprobabilities of the models, poorly approximated by machinelearning methods, and replacing those by posterior predictiveexpected loss

[Cornuet et al., 2014, in progress]

Page 3: ABC short course: final chapters

Random forests

Technique that stemmed from Leo Breiman’s bagging (orbootstrap aggregating) machine learning algorithm for bothclassification and regression

[Breiman, 1996]

Improved classification performances by averaging overclassification schemes of randomly generated training sets, creatinga “forest” of (CART) decision trees, inspired by Amit and Geman(1997) ensemble learning

[Breiman, 2001]

Page 4: ABC short course: final chapters

Growing the forest

Breiman’s solution for inducing random features in the trees of theforest:

• boostrap resampling of the dataset and

• random subset-ing [of size√

t] of the covariates driving theclassification at every node of each tree

Covariate xτ that drives the node separation

xτ ≷ cτ

and the separation bound cτ chosen by minimising entropy or Giniindex

Page 5: ABC short course: final chapters

Breiman and Cutler’s algorithm

Algorithm 5 Random forests

for t = 1 to T do//*T is the number of trees*//Draw a bootstrap sample of size nboot

Grow an unpruned decision treefor b = 1 to B do

//*B is the number of nodes*//Select ntry of the predictors at randomDetermine the best split from among those predictors

end forend forPredict new data by aggregating the predictions of the T trees

Page 6: ABC short course: final chapters

Subsampling

Due to both large datasets [practical] and theoreticalrecommendation from Gerard Biau [private communication], fromindependence between trees to convergence issues, boostrapsample of much smaller size than original data size

N = o(n)

Each CART tree stops when number of observations per node is 1:no culling of the branches

Page 7: ABC short course: final chapters

Subsampling

Due to both large datasets [practical] and theoreticalrecommendation from Gerard Biau [private communication], fromindependence between trees to convergence issues, boostrapsample of much smaller size than original data size

N = o(n)

Each CART tree stops when number of observations per node is 1:no culling of the branches

Page 8: ABC short course: final chapters

ABC with random forests

Idea: Starting with

• possibly large collection of summary statistics (s1i , . . . , spi )(from scientific theory input to available statistical softwares,to machine-learning alternatives)

• ABC reference table involving model index, parameter valuesand summary statistics for the associated simulatedpseudo-data

run R randomforest to infer M from (s1i , . . . , spi )

Page 9: ABC short course: final chapters

ABC with random forests

Idea: Starting with

• possibly large collection of summary statistics (s1i , . . . , spi )(from scientific theory input to available statistical softwares,to machine-learning alternatives)

• ABC reference table involving model index, parameter valuesand summary statistics for the associated simulatedpseudo-data

run R randomforest to infer M from (s1i , . . . , spi )

at each step O(√

p) indices sampled at random and mostdiscriminating statistic selected, by minimising entropy Gini loss

Page 10: ABC short course: final chapters

ABC with random forests

Idea: Starting with

• possibly large collection of summary statistics (s1i , . . . , spi )(from scientific theory input to available statistical softwares,to machine-learning alternatives)

• ABC reference table involving model index, parameter valuesand summary statistics for the associated simulatedpseudo-data

run R randomforest to infer M from (s1i , . . . , spi )

Average of the trees is resulting summary statistics, highlynon-linear predictor of the model index

Page 11: ABC short course: final chapters

Outcome of ABC-RF

Random forest predicts a (MAP) model index, from the observeddataset: The predictor provided by the forest is “sufficient” toselect the most likely model but not to derive associated posteriorprobability

• exploit entire forest by computing how many trees lead topicking each of the models under comparison but variabilitytoo high to be trusted

• frequency of trees associated with majority model is no propersubstitute to the true posterior probability

• And usual ABC-MC approximation equally highly variable andhard to assess

Page 12: ABC short course: final chapters

Outcome of ABC-RF

Random forest predicts a (MAP) model index, from the observeddataset: The predictor provided by the forest is “sufficient” toselect the most likely model but not to derive associated posteriorprobability

• exploit entire forest by computing how many trees lead topicking each of the models under comparison but variabilitytoo high to be trusted

• frequency of trees associated with majority model is no propersubstitute to the true posterior probability

• And usual ABC-MC approximation equally highly variable andhard to assess

Page 13: ABC short course: final chapters

Posterior predictive expected losses

We suggest replacing unstable approximation of

P(M = m|xo)

with xo observed sample and m model index, by average of theselection errors across all models given the data xo ,

P(M(X ) 6= M|xo)

where pair (M,X ) generated from the predictive

∫f (x |θ)π(θ,M|xo)dθ

and M(x) denotes the random forest model (MAP) predictor

Page 14: ABC short course: final chapters

Posterior predictive expected losses

Arguments:

• Bayesian estimate of the posterior error

• integrates error over most likely part of the parameter space

• gives an averaged error rather than the posterior probability ofthe null hypothesis

• easily computed: Given ABC subsample of parameters fromreference table, simulate pseudo-samples associated withthose and derive error frequency

Page 15: ABC short course: final chapters

toy: MA(1) vs. MA(2)

Comparing an MA(1) and an MA(2) models:

xt = εt − ϑ1εt−1[−ϑ2εt−2]

Earlier illustration using first two autocorrelations as S(x)[Marin et al., Stat. & Comp., 2011]

Result #1: values of p(m|x) [obtained by numerical integration]and p(m|S(x)) [obtained by mixing ABC outcome and densityestimation] highly differ!

Page 16: ABC short course: final chapters

toy: MA(1) vs. MA(2)

Difference between the posterior probability of MA(2) given eitherx or S(x). Blue stands for data from MA(1), orange for data fromMA(2)

Page 17: ABC short course: final chapters

toy: MA(1) vs. MA(2)

Comparing an MA(1) and an MA(2) models:

xt = εt − ϑ1εt−1[−ϑ2εt−2]

Earlier illustration using two autocorrelations as S(x)[Marin et al., Stat. & Comp., 2011]

Result #2: Embedded models, with simulations from MA(1)within those from MA(2), hence linear classification poor

Page 18: ABC short course: final chapters

toy: MA(1) vs. MA(2)

Simulations of S(x) under MA(1) (blue) and MA(2) (orange)

Page 19: ABC short course: final chapters

toy: MA(1) vs. MA(2)

Comparing an MA(1) and an MA(2) models:

xt = εt − ϑ1εt−1[−ϑ2εt−2]

Earlier illustration using two autocorrelations as S(x)[Marin et al., Stat. & Comp., 2011]

Result #3: On such a small dimension problem, random forestsshould come second to k-nn ou kernel discriminant analyses

Page 20: ABC short course: final chapters

toy: MA(1) vs. MA(2)

classification priormethod error rate (in %)

LDA 27.43Logist. reg. 28.34SVM (library e1071) 17.17“naıve” Bayes (with G marg.) 19.52“naıve” Bayes (with NP marg.) 18.25ABC k-nn (k = 100) 17.23ABC k-nn (k = 50) 16.97Local log. reg. (k = 1000) 16.82Random Forest 17.04Kernel disc. ana. (KDA) 16.95True MAP 12.36

Page 21: ABC short course: final chapters

Evolution scenarios based on SNPs

Three scenarios for the evolution of three populations from theirmost common ancestor

Page 22: ABC short course: final chapters

Evolution scenarios based on SNPs

DIYBAC header (!)

7 parameters and 48 summary statistics

3 scenarios: 7 7 7

scenario 1 [0.33333] (6)

N1 N2 N3

0 sample 1

0 sample 2

0 sample 3

ta merge 1 3

ts merge 1 2

ts varne 1 N4

scenario 2 [0.33333] (6)

N1 N2 N3

..........

ts varne 1 N4

scenario 3 [0.33333] (7)

N1 N2 N3

........

historical parameters priors (7,1)

N1 N UN[100.0,30000.0,0.0,0.0]

N2 N UN[100.0,30000.0,0.0,0.0]

N3 N UN[100.0,30000.0,0.0,0.0]

ta T UN[10.0,30000.0,0.0,0.0]

ts T UN[10.0,30000.0,0.0,0.0]

N4 N UN[100.0,30000.0,0.0,0.0]

r A UN[0.05,0.95,0.0,0.0]

ts>ta

DRAW UNTIL

Page 23: ABC short course: final chapters

Evolution scenarios based on SNPs

Model 1 with 6 parameters:

• four effective sample sizes: N1 for population 1, N2 forpopulation 2, N3 for population 3 and, finally, N4 for thenative population;

• the time of divergence ta between populations 1 and 3;

• the time of divergence ts between populations 1 and 2.

• effective sample sizes with independent uniform priors on[100, 30000]

• vector of divergence times (ta, ts) with uniform prior on{(a, s) ∈ [10, 30000]⊗ [10, 30000]|a < s}

Page 24: ABC short course: final chapters

Evolution scenarios based on SNPs

Model 2 with same parameters as model 1 but the divergence timeta corresponds to a divergence between populations 2 and 3; priordistributions identical to those of model 1Model 3 with extra seventh parameter, admixture rate r . For thatscenario, at time ta admixture between populations 1 and 2 fromwhich population 3 emerges. Prior distribution on r uniform on[0.05, 0.95]. In that case models 1 and 2 are not embeddeded inmodel 3. Prior distributions for other parameters the same as inmodel 1

Page 25: ABC short course: final chapters

Evolution scenarios based on SNPs

Set of 48 summary statistics:Single sample statistics

• proportion of loci with null gene diversity (= proportion of monomorphicloci)

• mean gene diversity across polymorphic loci[Nei, 1987]

• variance of gene diversity across polymorphic loci

• mean gene diversity across all loci

Page 26: ABC short course: final chapters

Evolution scenarios based on SNPs

Set of 48 summary statistics:Two sample statistics

• proportion of loci with null FST distance between both samples[Weir and Cockerham, 1984]

• mean across loci of non null FST distances between both samples

• variance across loci of non null FST distances between both samples

• mean across loci of FST distances between both samples

• proportion of 1 loci with null Nei’s distance between both samples[Nei, 1972]

• mean across loci of non null Nei’s distances between both samples

• variance across loci of non null Nei’s distances between both samples

• mean across loci of Nei’s distances between the two samples

Page 27: ABC short course: final chapters

Evolution scenarios based on SNPs

Set of 48 summary statistics:Three sample statistics

• proportion of loci with null admixture estimate

• mean across loci of non null admixture estimate

• variance across loci of non null admixture estimated

• mean across all locus admixture estimates

Page 28: ABC short course: final chapters

Evolution scenarios based on SNPs

For a sample of 1000 SNIPs measured on 25 biallelic individualsper population, learning ABC reference table with 20, 000simulations, prior predictive error rates:

• “naıve Bayes” classifier 33.3%

• raw LDA classifier 23.27%

• ABC k-nn [Euclidean dist. on summaries normalised by MAD]25.93%

• ABC k-nn [unnormalised Euclidean dist. on LDA components]22.12%

• local logistic classifier based on LDA components with• k = 500 neighbours 22.61%

• random forest on summaries 21.03%

(Error rates computed on a prior sample of size 104)

Page 29: ABC short course: final chapters

Evolution scenarios based on SNPs

For a sample of 1000 SNIPs measured on 25 biallelic individualsper population, learning ABC reference table with 20, 000simulations, prior predictive error rates:

• “naıve Bayes” classifier 33.3%

• raw LDA classifier 23.27%

• ABC k-nn [Euclidean dist. on summaries normalised by MAD]25.93%

• ABC k-nn [unnormalised Euclidean dist. on LDA components]22.12%

• local logistic classifier based on LDA components with• k = 1000 neighbours 22.46%

• random forest on summaries 21.03%

(Error rates computed on a prior sample of size 104)

Page 30: ABC short course: final chapters

Evolution scenarios based on SNPs

For a sample of 1000 SNIPs measured on 25 biallelic individualsper population, learning ABC reference table with 20, 000simulations, prior predictive error rates:

• “naıve Bayes” classifier 33.3%

• raw LDA classifier 23.27%

• ABC k-nn [Euclidean dist. on summaries normalised by MAD]25.93%

• ABC k-nn [unnormalised Euclidean dist. on LDA components]22.12%

• local logistic classifier based on LDA components with• k = 5000 neighbours 22.43%

• random forest on summaries 21.03%

(Error rates computed on a prior sample of size 104)

Page 31: ABC short course: final chapters

Evolution scenarios based on SNPs

For a sample of 1000 SNIPs measured on 25 biallelic individualsper population, learning ABC reference table with 20, 000simulations, prior predictive error rates:

• “naıve Bayes” classifier 33.3%

• raw LDA classifier 23.27%

• ABC k-nn [Euclidean dist. on summaries normalised by MAD]25.93%

• ABC k-nn [unnormalised Euclidean dist. on LDA components]22.12%

• local logistic classifier based on LDA components with• k = 5000 neighbours 22.43%

• random forest on LDA components only 23.1%

(Error rates computed on a prior sample of size 104)

Page 32: ABC short course: final chapters

Evolution scenarios based on SNPs

For a sample of 1000 SNIPs measured on 25 biallelic individualsper population, learning ABC reference table with 20, 000simulations, prior predictive error rates:

• “naıve Bayes” classifier 33.3%

• raw LDA classifier 23.27%

• ABC k-nn [Euclidean dist. on summaries normalised by MAD]25.93%

• ABC k-nn [unnormalised Euclidean dist. on LDA components]22.12%

• local logistic classifier based on LDA components with• k = 5000 neighbours 22.43%

• random forest on summaries and LDA components 19.03%

(Error rates computed on a prior sample of size 104)

Page 33: ABC short course: final chapters

Evolution scenarios based on SNPs

Posterior predictive error rates

Page 34: ABC short course: final chapters

Evolution scenarios based on SNPs

Posterior predictive error rates

favourable: 0.010 error – unfavourable: 0.104 error

Page 35: ABC short course: final chapters

Evolution scenarios based on microsatellites

Same setting as previously

Sample of 25 diploid individuals per population, on 20 locus(roughly corresponds to 1/5th of previous information)

Page 36: ABC short course: final chapters

Evolution scenarios based on microsatellites

One sample statistics

• mean number of alleles across loci

• mean gene diversity across loci (Nei, 1987)

• mean allele size variance across loci

• mean M index across loci (Garza and Williamson, 2001;Excoffier et al., 2005)

Page 37: ABC short course: final chapters

Evolution scenarios based on microsatellites

Two sample statistics

• mean number of alleles across loci (two samples)

• mean gene diversity across loci (two samples)

• mean allele size variance across loci (two samples)

• FST between two samples (Weir and Cockerham, 1984)

• mean index of classification (two samples) (Rannala andMoutain, 1997; Pascual et al., 2007)

• shared allele distance between two samples (Chakraborty andJin, 1993)

• (δµ)2 distance between two samples (Golstein et al., 1995)

Three sample statistics

• Maximum likelihood coefficient of admixture (Choisy et al.,2004)

Page 38: ABC short course: final chapters

Evolution scenarios based on microsatellites

classification prior error∗

method rate (in %)

raw LDA 35.64“naıve” Bayes (with G marginals) 40.02k-nn (MAD normalised sum stat) 37.47k-nn (unormalised LDA) 35.14RF without LDA components 35.14RF with LDA components 33.62RF with only LDA components 37.25

∗estimated on pseudo-samples of 104 items drawn from the prior

Page 39: ABC short course: final chapters

Evolution scenarios based on microsatellites

Posterior predictive error rates

Page 40: ABC short course: final chapters

Evolution scenarios based on microsatellites

Posterior predictive error rates

favourable: 0.183 error – unfavourable: 0.435 error

Page 41: ABC short course: final chapters

Back to Asian Ladybirds [message in a beetle]

Comparing 10 scenarios of Asian beetle invasion beetle moves

Page 42: ABC short course: final chapters

Back to Asian Ladybirds [message in a beetle]

Comparing 10 scenarios of Asian beetle invasion beetle moves

classification prior error†

method rate (in %)

raw LDA 38.94“naıve” Bayes (with G margins) 54.02

k-nn (MAD normalised sum stat) 58.47RF without LDA components 38.84

RF with LDA components 35.32

†estimated on pseudo-samples of 104 items drawn from the prior

Page 43: ABC short course: final chapters

Back to Asian Ladybirds [message in a beetle]

Comparing 10 scenarios of Asian beetle invasion beetle moves

Random forest allocation frequencies

1 2 3 4 5 6 7 8 9 10

0.168 0.1 0.008 0.066 0.296 0.016 0.092 0.04 0.014 0.2

Posterior predictive error based on 20,000 prior simulations andkeeping 500 neighbours (or 100 neighbours and 10 pseudo-datasetsper parameter)

0.3682

Page 44: ABC short course: final chapters

Back to Asian Ladybirds [message in a beetle]

Comparing 10 scenarios of Asian beetle invasion

Page 45: ABC short course: final chapters

Back to Asian Ladybirds [message in a beetle]

Comparing 10 scenarios of Asian beetle invasion

Page 46: ABC short course: final chapters

Back to Asian Ladybirds [message in a beetle]

Comparing 10 scenarios of Asian beetle invasion

0 500 1000 1500 2000

0.40

0.45

0.50

0.55

0.60

0.65

k

erro

r

posterior predictive error 0.368

Page 47: ABC short course: final chapters

conclusion on random forests

• unlimited aggregation of arbitrary summary statistics

• recovery of discriminant statistics when available

• automated implementation with reduced calibration

• self-evaluation by posterior predictive error

• soon to appear in DIYABC

Page 48: ABC short course: final chapters

ABC estimation via random forests

1 simulation-based methods inEconometrics

2 Genetics of ABC

3 Approximate Bayesian computation

4 ABC for model choice

5 ABC model choice via random forests

6 ABC estimation via random forests

7 [some] asymptotics of ABC

Page 49: ABC short course: final chapters

Two basic issues with ABC

ABC compares numerous simulated dataset to the observed one

Two major difficulties:

• to decrease approximation error (or tolerance ε) and henceensure reliability of ABC, total number of simulations verylarge;

• calibration of ABC (tolerance, distance, summary statistics,post-processing, &tc) critical and hard to automatise

Page 50: ABC short course: final chapters

classification of summaries by random forests

Given a large collection of summary statistics, rather than selectinga subset and excluding the others, estimate each parameter ofinterest by a machine learning tool like random forests

• RF can handle thousands of predictors

• ignore useless components

• fast estimation method with good local properties

• automatised method with few calibration steps

• substitute to Fearnhead and Prangle (2012) preliminaryestimation of θ(y obs)

• includes a natural (classification) distance measure that avoidschoice of both distance and tolerance

[Marin et al., 2016]

Page 51: ABC short course: final chapters

random forests as non-parametric regression

CART means Classification and Regression Trees

For regression purposes, i.e., to predict y as f (x), similar binarytrees in random forests

1 at each tree node, split data into two daughter nodes

2 split variable and bound chosen to minimise heterogeneitycriterion

3 stop splitting when enough homogeneity in current branch

4 predicted values at terminal nodes (or leaves) are averageresponse variable y for all observations in final leaf

Page 52: ABC short course: final chapters

Illustration

conditional expectation f (x) and well-specified dataset

Page 53: ABC short course: final chapters

Illustration

single regression tree

Page 54: ABC short course: final chapters

Illustration

ten regression trees obtained by bagging (Bootstrap AGGregatING)

Page 55: ABC short course: final chapters

Illustration

average of 100 regression trees

Page 56: ABC short course: final chapters

bagging reduces learning variance

When growing forest with many trees,

• grow each tree on an independent bootstrap sample

• at each node, select m variables at random out of all Mpossible variables

• Find the best dichotomous split on the selected m variables

• predictor function estimated by averaging trees

Improve on CART with respect to accuracy and stability

Page 57: ABC short course: final chapters

bagging reduces learning variance

When growing forest with many trees,

• grow each tree on an independent bootstrap sample

• at each node, select m variables at random out of all Mpossible variables

• Find the best dichotomous split on the selected m variables

• predictor function estimated by averaging trees

Improve on CART with respect to accuracy and stability

Page 58: ABC short course: final chapters

prediction error

A given simulation (y sim, x sim) in the training table is not used inabout 1/3 of the trees (“out-of-bag” case)

Average predictions F oob(x sim) of these trees to give out-of-bagpredictor of y sim

Page 59: ABC short course: final chapters

Related methods

• adjusted local linear: Beaumont et al. (2002) Approximate Bayesian

computation in population genetics, Genetics

• ridge regression: Blum et al. (2013) A Comparative Review of

Dimension Reduction Methods in Approximate Bayesian Computation,

Statistical Science

• linear discriminant analysis: Estoup et al. (2012) Estimation of

demo-genetic model probabilities with Approximate Bayesian

Computation using linear discriminant analysis on summary statistics,

Molecular Ecology Resources

• adjusted neural networks: Blum and Francois (2010) Non-linear

regression models for Approximate Bayesian Computation, Statistics and

Computing

Page 60: ABC short course: final chapters

ABC parameter estimation (ODOF)

One dimension = one forest (ODOF) methodology

parametric statistical model:

{f (y ; θ) : y ∈ Y, θ ∈ Θ}, Y ⊆ Rn, Θ ⊆ Rp

with intractable density f (·; θ)

plus prior distribution π(θ)

Inference on quantity of interest

ψ(θ) ∈ R

(posterior means, variances, quantiles or covariances)

Page 61: ABC short course: final chapters

ABC parameter estimation (ODOF)

One dimension = one forest (ODOF) methodology

parametric statistical model:

{f (y ; θ) : y ∈ Y, θ ∈ Θ}, Y ⊆ Rn, Θ ⊆ Rp

with intractable density f (·; θ)

plus prior distribution π(θ)

Inference on quantity of interest

ψ(θ) ∈ R

(posterior means, variances, quantiles or covariances)

Page 62: ABC short course: final chapters

common reference table

Given η : Y → Rk a collection of summary statistics

• produce reference table (RT) used as learning dataset formultiple random forests

• meaning, for 1 ≤ t ≤ N

1 simulate θ(t) ∼ π(θ)2 simulate yt = (y1,t , . . . , yn,t) ∼ f (y ; θ(t))3 compute η(yt) = {η1(yt), . . . , ηk(yt)}

Page 63: ABC short course: final chapters

ABC posterior expectations

Recall that θ = (θ1, . . . , θd) ∈ Rd

For each θj , construct a separate RF regression with predictorsvariables equal to summary statistics η(y) = {η1(y), . . . , ηk(y)}

If Lb(η(y∗)) denotes leaf index of b-th tree associated with η(y∗)—leaf reached through path of binary choices in tree—, with |Lb|response variables

E(θj | η(y∗)) =1

B

B∑

b=1

1

|Lb(η(y∗))|∑

t:η(yt)∈Lb(η(y∗))

θ(t)j

is our ABC estimate

Page 64: ABC short course: final chapters

ABC posterior expectations

For each θj , construct a separate RF regression with predictorsvariables equal to summary statistics η(y) = {η1(y), . . . , ηk(y)}

If Lb(η(y∗)) denotes leaf index of b-th tree associated with η(y∗)—leaf reached through path of binary choices in tree—, with |Lb|response variables

E(θj | η(y∗)) =1

B

B∑

b=1

1

|Lb(η(y∗))|∑

t:η(yt)∈Lb(η(y∗))

θ(t)j

is our ABC estimate

Page 65: ABC short course: final chapters

ABC posterior quantile estimate

Random forests also available for quantile regression[Meinshausen, 2006, JMLR]

Since

E(θj | η(y∗)) =N∑

t=1

wt(η(y∗))θ(t)j

with

wt(η(y∗)) =1

B

B∑

b=1

ILb(η(y∗))(η(yt))

|Lb(η(y∗))|

natural estimate of the cdf of θj is

F (u | η(y∗)) =N∑

t=1

wt(η(y∗))I{θ(t)j ≤u}

.

Page 66: ABC short course: final chapters

ABC posterior quantile estimate

Since

E(θj | η(y∗)) =N∑

t=1

wt(η(y∗))θ(t)j

with

wt(η(y∗)) =1

B

B∑

b=1

ILb(η(y∗))(η(yt))

|Lb(η(y∗))|

natural estimate of the cdf of θj is

F (u | η(y∗)) =N∑

t=1

wt(η(y∗))I{θ(t)j ≤u}

.

ABC posterior quantiles + credible intervals given by F−1

Page 67: ABC short course: final chapters

ABC variances

Even though approximation of Var(θj | η(y∗)) available based on

F , choice of alternative and slightly more involved version

In a given tree b in a random forest, existence of out-of-baf entries,i.e., not sampled in associated bootstrap subsample

Use of out-of-bag simulations to produce estimate of E{θj | η(yt)},θj

(t),

Apply weights ωt(η(y∗)) to out-of-bag residuals:

Var(θj | η(y∗)) =N∑

t=1

ωt(η(y∗)){

(θ(t)j − θj

(t)}2

Page 68: ABC short course: final chapters

ABC variances

Even though approximation of Var(θj | η(y∗)) available based on

F , choice of alternative and slightly more involved version

In a given tree b in a random forest, existence of out-of-baf entries,i.e., not sampled in associated bootstrap subsample

Use of out-of-bag simulations to produce estimate of E{θj | η(yt)},θj

(t),

Apply weights ωt(η(y∗)) to out-of-bag residuals:

Var(θj | η(y∗)) =N∑

t=1

ωt(η(y∗)){

(θ(t)j − θj

(t)}2

Page 69: ABC short course: final chapters

ABC covariances

For estimating Cov(θj , θ` | η(y∗)), construction of a specificrandom forest

product of out-of-bag errors for θj and θ`

(t)j − θj

(t)}{

θ(t)` − θ

(t)`

}

with again predictors variables the summary statisticsη(y) = {η1(y), . . . , ηk(y)}

Page 70: ABC short course: final chapters

Gaussian toy example

Take

(y1, . . . , yn) | θ1, θ2 ∼iid N (θ1, θ2), n = 10

θ1 | θ2 ∼ N (0, θ2)

θ2 ∼ IG(4, 3)

θ1 | y ∼ T(n + 8, (ny)/(n + 1), (s2 + 6)/((n + 1)(n + 8))

)

θ2 | y ∼ IG{

n/2 + 4, s2/2 + 3}

Closed-form theoretical values likeψ1(y) = E(θ1 | y), ψ2(y) = E(θ2 | y), ψ3(y) = Var(θ1 | y) andψ4(y) = Var(θ2 | y)

Page 71: ABC short course: final chapters

Gaussian toy example

Reference table of N = 10, 000 Gaussian replicatesIndependent Gaussian test set of size Npred = 100

k = 53 summary statistics: the sample mean, the samplevariance and the sample median absolute deviation, and 50independent pure-noise variables (uniform [0,1])

Page 72: ABC short course: final chapters

Gaussian toy example

−2 −1 0 1 2

−2

−1

01

2

ψ1

ψ~1

0.5 1.0 1.5 2.0 2.5

0.5

1.0

1.5

2.0

2.5

ψ2

ψ~2

0.05 0.10 0.15 0.20 0.25 0.30 0.35

0.05

0.15

0.25

0.35

ψ3

ψ~3

0.0 0.2 0.4 0.6 0.80.

00.

20.

40.

60.

8

ψ4

ψ~4

Scatterplot of the theoretical values with their correspondingestimates

Page 73: ABC short course: final chapters

Gaussian toy example

−4 −3 −2 −1 0 1 2

−4

−3

−2

−1

01

2

Q0.025(θ1 | y)

Q~0.

025(θ

1 | y

)

−1 0 1 2 3 4

−1

01

23

4

Q0.975(θ1 | y)

Q~0.

975(θ

1 | y

)

0.2 0.4 0.6 0.8 1.0 1.2

0.2

0.4

0.6

0.8

1.0

1.2

Q0.025(θ2 | y)

Q~0.

025(θ

2 | y

)

1 2 3 4 51

23

45

Q0.975(θ2 | y)

Q~0.

975(θ

2 | y

)

Scatterplot of the theoretical values of 2.5% and 97.5% posteriorquantiles for θ1 and θ2 with their corresponding estimates

Page 74: ABC short course: final chapters

Gaussian toy example

ODOF adj local linear adj ridge adj neural net

ψ1(y) = E(θ1 | y) 0.21 0.42 0.38 0.42ψ2(y) = E(θ2 | y) 0.11 0.20 0.26 0.22ψ3(y) = Var(θ1 | y) 0.47 0.66 0.75 0.48ψ4(y) = Var(θ2 | y) 0.46 0.85 0.73 0.98

Q0.025(θ1|y) 0.69 0.55 0.78 0.53Q0.025(θ2|y) 0.06 0.45 0.68 1.02Q0.975(θ1|y) 0.48 0.55 0.79 0.50Q0.975(θ2|y) 0.18 0.23 0.23 0.38

Comparison of normalized mean absolute errors

Page 75: ABC short course: final chapters

Gaussian toy example

True ODOF loc linear ridge Neural net

0.0

0.1

0.2

0.3

0.4

0.5

Var~

(θ1 |

y)

True ODOF loc linear ridge Neural net

0.0

0.2

0.4

0.6

0.8

1.0

Var~

(θ2 |

y)

Boxplot comparison of Var(θ1 | y), Var(θ2 | y) with the truevalues, ODOF and usual ABC methods

Page 76: ABC short course: final chapters

Comments

ABC RF methods mostly insensitive both to strong correlationsbetween the summary statistics and to the presence of noisyvariables.

implies less number of simulations and no calibration

Next steps: adaptive schemes, deep learning, inclusion in DIYABC

Page 77: ABC short course: final chapters

[some] asymptotics of ABC

1 simulation-based methods inEconometrics

2 Genetics of ABC

3 Approximate Bayesian computation

4 ABC for model choice

5 ABC model choice via random forests

6 ABC estimation via random forests

7 [some] asymptotics of ABC

Page 78: ABC short course: final chapters

consistency of ABC posteriors

Asymptotic study of the ABC-posterior z = z(n)

• ABC posterior consistency and convergence rate (in n)

• Asymptotic shape of πε(·|y(n))

• Asymptotic behaviour of θε = EABC[θ|y(n)]

[Frazier et al., 2016]

Page 79: ABC short course: final chapters

consistency of ABC posteriors

• Concentration around true value and Bayesian consistency lessstringent conditions on the convergence speed of tolerance εnto zero, when compared with asymptotic normality of ABCposterior

• asymptotic normality of ABC posterior mean does not requireasymptotic normality of ABC posterior

Page 80: ABC short course: final chapters

ABC posterior consistency

For a sample y = y(n) and a tolerance ε = εn, when n→ +∞,assuming a parametric model θ ∈ Rk , k fixed

• Concentration of summary η(z): there exists b(θ) such that

η(z)− b(θ) = oPθ(1)

• Consistency:

Πεn (‖θ − θ0‖ ≤ δ|y) = 1 + op(1)

• Convergence rate: there exists δn = o(1) such that

Πεn (‖θ − θ0‖ ≤ δn|y) = 1 + op(1)

Page 81: ABC short course: final chapters

Related results

existing studies on the large sample properties of ABC, in whichthe asymptotic properties of point estimators derived from ABChave been the primary focus

[Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2015]

Page 82: ABC short course: final chapters

Convergence when εn & σn

Under assumptions

(A1) ∃σn → +∞

Pθ(σ−1n ‖η(z)− b(θ)‖ > u

)≤ c(θ)h(u), lim

u→+∞h(u) = 0

(A2)Π(‖b(θ)− b(θ0)‖ ≤ u) � uD , u ≈ 0

posterior consistency and posterior concentration rate λT thatdepends on the deviation control of d2{η(z), b(θ)}posterior concentration rate for b(θ) bounded from below by O(εT )

Page 83: ABC short course: final chapters

Convergence when εn & σn

Under assumptions

(A1) ∃σn → +∞

Pθ(σ−1n ‖η(z)− b(θ)‖ > u

)≤ c(θ)h(u), lim

u→+∞h(u) = 0

(A2)Π(‖b(θ)− b(θ0)‖ ≤ u) � uD , u ≈ 0

then

Πεn

(‖b(θ)− b(θ0)‖ . εn + σnh−1(εDn )|y

)= 1 + op0(1)

If also ‖θ − θ0‖ ≤ L‖b(θ)− c(θ0)‖α, locally and θ → b(θ) 1-1

Πεn(‖θ − θ0‖ . εαn + σαn (h−1(εDn ))α︸ ︷︷ ︸δn

|y) = 1 + op0(1)

Page 84: ABC short course: final chapters

Comments

• if Pθ (σn‖η(z)− b(θ)‖ > u) ≤ c(θ)h(u), two cases

1 Polynomial tail: h(u) . u−κ, then δn = εn + σnε−D/κn

2 Exponential tail: h(u) . e−cu, then δn = εn + σn log(1/εn)

• E.g., η(y) = n−1∑

i g(yi ) with moments on g (case 1) orLaplace transform (case 2)

Page 85: ABC short course: final chapters

Comments

• if Pθ (σn‖η(z)− b(θ)‖ > u) ≤ c(θ)h(u), two cases

1 Polynomial tail: h(u) . u−κ, then δn = εn + σnε−D/κn

2 Exponential tail: h(u) . e−cu, then δn = εn + σn log(1/εn)

• E.g., η(y) = n−1∑

i g(yi ) with moments on g (case 1) orLaplace transform (case 2)

Page 86: ABC short course: final chapters

Comments

• Π(‖b(θ)− b(θ0)‖ ≤ u) � uD : If Π regular enough thenD = dim(θ)

• no need to approximate the density f (η(y)|θ).

• Same results holds when εn = o(σn) if (A2) replaced with

inf|x |≤M

Pθ(|σ−1

n (η(z)− b(θ))− x | ≤ u)& uD , u ≈ 0

Page 87: ABC short course: final chapters

proof

Simple enough proof: assume σn ≤ δεn and

|η(y)− b(θ0)| . σn, ‖η(y)− η(z)‖ ≤ εn

Hence

‖b(θ)− b(θ0)‖ > δn ⇒ |η(z)− b(θ)| > δn − εn − σn := tn

Also, if ‖b(θ)− b(θ0)‖ ≤ εn/3

‖η(y)− η(z)‖ ≤ |η(z)− b(θ)|+ σn︸︷︷︸≤εn/3

+εn/3

and

Πεn (‖b(θ)− b(θ0)‖ > δn|y) ≤

∫‖b(θ)−b(θ0)‖>δn

Pθ (|η(z)− b(θ)| > tn) dΠ(θ)∫|b(θ)−b(θ0)|≤εn/3

Pθ (|η(z)− b(θ)| ≤ εn/3) dΠ(θ)

. ε−Dn h(tnσ

−1n )

∫Θ

c(θ)dΠ(θ)

Page 88: ABC short course: final chapters

proof

Simple enough proof: assume σn ≤ δεn and

|η(y)− b(θ0)| . σn, ‖η(y)− η(z)‖ ≤ εn

Hence

‖b(θ)− b(θ0)‖ > δn ⇒ |η(z)− b(θ)| > δn − εn − σn := tn

Also, if ‖b(θ)− b(θ0)‖ ≤ εn/3

‖η(y)− η(z)‖ ≤ |η(z)− b(θ)|+ σn︸︷︷︸≤εn/3

+εn/3

and

Πεn (‖b(θ)− b(θ0)‖ > δn|y) ≤

∫‖b(θ)−b(θ0)‖>δn

Pθ (|η(z)− b(θ)| > tn) dΠ(θ)∫|b(θ)−b(θ0)|≤εn/3

Pθ (|η(z)− b(θ)| ≤ εn/3) dΠ(θ)

. ε−Dn h(tnσ

−1n )

∫Θ

c(θ)dΠ(θ)

Page 89: ABC short course: final chapters

Summary statistic and (in)consistency

Consider the moving average MA(2) model

yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N (0, 1)

and−2 ≤ θ1 ≤ 2, θ1 + θ2 ≥ −1, θ1 − θ2 ≤ 1.

summary statistics equal to sample autocovariances

ηj(y) = T−1T∑

t=1+j

ytyt−j j = 0, 1

with

η0(y)P→ E[y 2

t ] = 1 + (θ01)2 + (θ02)2 and η1(y)P→ E[ytyt−1] = θ01(1 + θ02)

For ABC target pε (θ|η(y)) to be degenerate at θ0

0 = b(θ0)− b (θ) =

(1 + (θ01)2 + (θ02)2

θ01(1 + θ02)

)−(

1 + (θ1)2 + (θ2)2

θ1(1 + θ2)

)must have unique solution θ = θ0

Take θ01 = .6, θ02 = .2: equation has 2 solutions

θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204

Page 90: ABC short course: final chapters

Summary statistic and (in)consistency

Consider the moving average MA(2) model

yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N (0, 1)

and−2 ≤ θ1 ≤ 2, θ1 + θ2 ≥ −1, θ1 − θ2 ≤ 1.

summary statistics equal to sample autocovariances

ηj(y) = T−1T∑

t=1+j

ytyt−j j = 0, 1

with

η0(y)P→ E[y 2

t ] = 1 + (θ01)2 + (θ02)2 and η1(y)P→ E[ytyt−1] = θ01(1 + θ02)

For ABC target pε (θ|η(y)) to be degenerate at θ0

0 = b(θ0)− b (θ) =

(1 + (θ01)2 + (θ02)2

θ01(1 + θ02)

)−(

1 + (θ1)2 + (θ2)2

θ1(1 + θ2)

)must have unique solution θ = θ0

Take θ01 = .6, θ02 = .2: equation has 2 solutions

θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204

Page 91: ABC short course: final chapters

Asymptotic shape of posterior distribution

Three different regimes:

1 σn = o(εn) −→ Uniform limit

2 σn � εn −→ perturbated Gaussian limit

3 σn � εn −→ Gaussian limit

Page 92: ABC short course: final chapters

Assumptions

• (B1) Concentration of summary η: Σn(θ) ∈ Rk1×k1 is o(1)

Σn(θ)−1{η(z)−b(θ)} ⇒ Nk1(0, Id), (Σn(θ)Σn(θ0)−1)n = Co

• (B2) b(θ) is C1 and

‖θ − θ0‖ . ‖b(θ)− b(θ0)‖

• (B3) Dominated convergence and

limn

Pθ(Σn(θ)−1{η(z)− b(θ)} ∈ u + B(0,un))∏j un(j)

→ ϕ(u)

Page 93: ABC short course: final chapters

main result

Set Σn(θ) = σnD(θ) for θ ≈ θ0 and Z o = Σn(θ0)−1(η(y)− θ0),then under (B1) and (B2)

• when εnσ−1n → +∞

Πεn [ε−1n (θ − θ0) ∈ A|y]⇒ UB0 (A), B0 = {x ∈ Rk ; ‖b′(θ0)T x‖ ≤ 1}

• when εnσ−1n → c

Πεn [Σn(θ0)−1(θ − θ0)− Z o ∈ A|y]⇒ Qc(A), Qc 6= N

• when εnσ−1n → 0 and (B3) holds, set

Vn = [b′(θ0)]TΣn(θ0)b′(θ0)

thenΠεn [V−1

n (θ − θ0)− Z o ∈ A|y]⇒ Φ(A),

Page 94: ABC short course: final chapters

intuition

Set x(θ) = σ−1n (θ − θ0)− Z o (k = 1)

πn := Πεn [ε−1n (θ − θ0) ∈ A|y]

=

∫|θ−θ0|≤un

1lx(θ)∈APθ

(‖σ−1

n (η(z)− b(θ)) + x(θ)‖ ≤ σ−1n εn

)p(θ)dθ∫

|θ−θ0|≤unPθ(‖σ−1

n (η(z)− b(θ)) + x(θ)‖ ≤ σ−1n εn

)p(θ)dθ

+ op(1)

• If εn/σn � 1 :

Pθ(|σ−1

n (η(z)− b(θ)) + x(θ)| ≤ σ−1n εn

)= 1+o(1), iff |x | ≤ σ−1

n εn+o(1)

• If εn/σn = o(1)

Pθ(|σ−1

n (η(z)− b(θ)) + x | ≤ σ−1n εn

)= φ(x)σn(1 + o(1))

Page 95: ABC short course: final chapters

more comments

• Surprising : U(−εn, εn) limit when εn � σn

but not so muchsince εn = o(1) means concentration around θ0 andσn = o(εn) implies that b(θ)− b(θ0) ≈ η(z)− η(y)

• again, there is no true need to control approximation off (η(y)|θ) by a Gaussian density: merely a control of thedistribution

• we have

Z o = Z o/

b′(θ0) like the asym score

• generalisation to the case where eigenvalues of Σn aredn,1 6= · · · 6= dn,k

• behaviour of EABC (θ|y) as in Li & Fearnhead (2016)

Page 96: ABC short course: final chapters

more comments

• Surprising : U(−εn, εn) limit when εn � σn but not so muchsince εn = o(1) means concentration around θ0 andσn = o(εn) implies that b(θ)− b(θ0) ≈ η(z)− η(y)

• again, there is no true need to control approximation off (η(y)|θ) by a Gaussian density: merely a control of thedistribution

• we have

Z o = Z o/

b′(θ0) like the asym score

• generalisation to the case where eigenvalues of Σn aredn,1 6= · · · 6= dn,k

• behaviour of EABC (θ|y) as in Li & Fearnhead (2016)

Page 97: ABC short course: final chapters

even more comments

If (also) p(θ) is Holder β

EABC (θ|y)− θ0 = σnZ o

b(θ0)′︸ ︷︷ ︸score for f (η(y)|θ)

+

bβ/2c∑j=1

ε2jn Hj(θ0, p, b)

︸ ︷︷ ︸bias from threshold approx

+o(σn) + O(εβ+1n )

• if ε2n = o(σn) : Efficiency

EABC (θ|y)− θ0 = σnZ o

b(θ0)′+ o(σn)

• the Hj(θ0, p, b)’s are deterministic

we gain nothing by getting a first crude θ(y) = EABC (θ|y)for some η(y) and then rerun ABC with θ(y)

Page 98: ABC short course: final chapters

impact of the dimension of η

dimension of η(.) does not impact above result, but impactsacceptance probability

• if εn = o(σn), k1 = dim(η(y)), k = dim(θ) & k1 ≥ k

αn := Pr (‖y − z‖ ≤ εn) � εk1n σ−k1+kn

• if εn & σnαn := Pr (‖y − z‖ ≤ εn) � εkn

• If we choose αn

• αn = o(σkn ) leads to εn = σn(αnσ

−kn )1/k1 = o(σn)

• αn & σn leads to εn � α1/kn .

Page 99: ABC short course: final chapters

conclusion on ABC consistency

• asymptotic description of ABC: different regimes dependingon εn σn

• no point in choosing εn arbitrarily small: just εn = o(σn)

• no gain in iterative ABC

• results under weak conditions by not studying g(η(z)|θ)

the end

Page 100: ABC short course: final chapters

conclusion on ABC consistency

• asymptotic description of ABC: different regimes dependingon εn σn

• no point in choosing εn arbitrarily small: just εn = o(σn)

• no gain in iterative ABC

• results under weak conditions by not studying g(η(z)|θ)

the end


Recommended