Adaptive ABC model choice and geometric summary statistics ...

HAL Id: hal-00942797https://hal.archives-ouvertes.fr/hal-00942797v2

Submitted on 16 Jul 2014

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Adaptive ABC model choice and geometric summarystatistics for hidden Gibbs random fields

Julien Stoehr, Pierre Pudlo, Lionel Cucala

To cite this version:Julien Stoehr, Pierre Pudlo, Lionel Cucala. Adaptive ABC model choice and geometric summarystatistics for hidden Gibbs random fields. Statistics and Computing, Springer Verlag (Germany),2014, 25 (1), pp.129 - 141. �10.1007/s11222-014-9514-9�. �hal-00942797v2�

https://hal.archives-ouvertes.fr/hal-00942797v2

https://hal.archives-ouvertes.fr

Adaptive ABC model choice and geometric summary

statistics for hidden Gibbs random fields

Julien Stoehr∗, Pierre Pudlo∗ Lionel Cucala∗

July 16, 2014

Abstract

Selecting between different dependency structures ofhidden Markov random field can be very challeng-ing, due to the intractable normalizing constant inthe likelihood. We answer this question with approx-imate Bayesian computation (ABC) which provides amodel choice method in the Bayesian paradigm. Thiscomes after the work of Grelaud et al. (2009) who ex-hibited sufficient statistics on directly observed Gibbsrandom fields. But when the random field is latent,the sufficiency falls and we complement the set withgeometric summary statistics. The general approachto construct these intuitive statistics relies on a clus-tering analysis of the sites based on the observedcolors and plausible latent graphs. The efficiency ofABC model choice based on these statistics is evalu-ated via a local error rate which may be of indepen-dent interest. As a byproduct we derived an ABCalgorithm that adapts the dimension of the summarystatistics to the dataset without distorting the modelselection.

Keywords: Approximate Bayesian Computation,model choice, hidden Gibbs random fields, summarystatistics, misclassification rate, k-nearest neighbors

1 Introduction

Gibbs random fields are polymorphous statisticalmodels, that are useful to analyse different types ofspatially correlated data, with a wide range of appli-cations, including image analysis (Hurn et al., 2003),disease mapping (Green and Richardson, 2002), ge-netic analysis (Francois et al., 2006) among others.The autobinomial model (Besag, 1974) which encom-passes the Potts model, is used to describe the spatialdependency of discrete random variables (e.g., shades

∗I3M – UMR CNRS 5149, Universite Montpellier II, France

of grey or colors) on the vertices of an undirectedgraph (e.g., a regular grid of pixels). See for exam-ple Alfo et al. (2008) and Moores, Hargrave, Harden,and Mengersen (2014) who performed image segmen-tation with the help of the above modeling. Despitetheir popularity, these models present major difficul-ties from the point of view of either parameter esti-mation (Friel et al., 2009, Friel, 2012, Everitt, 2012)or model choice (Grelaud et al., 2009, Friel, 2013, Cu-cala and Marin, 2013), due to an intractable normal-izing constant. Remark the exception of small lat-ices on which we can apply the recursive algorithmof Reeves and Pettitt (2004), Friel and Rue (2007)and obtain a reliable approximation of the normal-izing constant. However, the complexity in time ofthe above algorithm grows exponentially and is thushelpless on large lattices.

The present paper deals with the challenging prob-lem of selecting a dependency structure of an hid-den Potts model in the Bayesian paradigm and ex-plores the opportunity of approximate Bayesian com-putation (ABC, Tavare et al., 1997, Pritchard et al.,1999, Marin et al., 2012, Baragatti and Pudlo, 2014)to answer the question. Up to our knowledge, thisimportant question has not yet been addressed inthe Bayesian literature. Alternatively we could havetried to set up a reversible jump Markov chain MonteCarlo, but follows an important work for the statisti-cian to adapt the general scheme, as shown by Caimoand Friel (2011, 2013) in the context of exponentialrandom graph models where the observed data is agraph. Cucala and Marin (2013) addressed the ques-tion of inferring the number of latent colors with anICL criterion but their complex algorithm cannot beextended easily to choose the dependency structure.Other approximate methods have also been tack-led in the literature such as pseudo-likelihoods (Be-sag, 1975), mean field approximations (Forbes andPeyrard, 2003) but lacks theoretical support.

1

Approximate Bayesian computation (ABC) is asimulation based approach that can addresses themodel choice issue in the Bayesian paradigm. The al-gorithm compares the observed data yobs with numer-ous simulations y through summary statistics S(y)in order to supply a Monte Carlo approximation ofthe posterior probabilities of each model. The choiceof such summary statistics presents major difficul-ties that have been especially highlighted for modelchoice (Robert et al., 2011, Didelot et al., 2011). Be-yond the seldom situations where sufficient statisticsexist and are explicitly known (Gibbs random fieldsare surprising examples, see Grelaud et al., 2009),Marin et al. (2013) provide conditions which ensurethe consistency of ABC model choice. The presentwork has thus to answer the absence of available suf-ficient statistics for hidden Potts fields as well as thedifficulty (if not the impossibility) to check the abovetheoretical conditions in practice.

Recent articles have proposed automatic schemesto construct theses statistics (rarely from scratch butbased on a large set of candidates) for Bayesian pa-rameter inference and are meticulously reviewed byBlum et al. (2013) who compare their performancesin concrete examples. But very few has been accom-plished in the context of ABC model choice apartfrom the work of Prangle et al. (2013). The statisticsS(y) reconstructed by Prangle et al. (2013) have goodtheoritical properties (those are the posterior proba-bilities of the models in competition) but are poorlyapproximated with a pilot ABC run (Robert et al.,2011), which is also time consuming.

The paper is organized as follows: Section 2presents ABC model choice as a k-nearest neighborclassifier, and defines a local error rate which is thefirst contribution of the paper. We also provide anadaptive ABC algorithm based on the local error toselect automatically the dimension of the summarystatistics. The second contribution is the introduc-tion of a general and intuitive approach to producegeometric summary statistics for hidden Potts modelin Section 3. We end the paper with numerical resultsin that framework.

2 Local error rates and adpa-

tive ABC model choice

When dealing with models whose likelihood cannotbe computed analytically, Bayesian model choice be-comes challenging since the evidence of each model

writes as the integral of the likelihood over the priordistribution of the model parameter. ABC provides amethod to escape from the intractability problem andrelies on many simulated datasets from each model ei-ther to learn the model that fits the observed datayobs or to approximate the posterior probabilities.We refer the reader to reviews on ABC (Marin et al.,2012, Baragatti and Pudlo, 2014) to get a wider pre-sentation and will focus here on the model choice pro-cedure.

2.1 Background on Approximate

Bayesian computation for model

choice

Assume we are given M stochastic models with re-spective paramater spaces embedded into Euclideanspaces of various dimensions. The joint Bayesian dis-tribution sets

(i) a prior on the model space, π(1), . . . , π(M),

(ii) for each model, a prior on its parameter space,whose density with respect to a reference mea-sure (often the Lebesgue measure of the Eu-clidean space) is πm(θm) and

(iii) the likelihood of the data y within each model,namely fm(y|θm).

The evidence of model m is then defined as

e(m, y) :=

∫fm(y|θm)πm(θm)dθm

and the posterior probability of model m as

π(m|y) =π(m)e(m, y)∑

m′ π(m′)e(m′, y). (1)

When the goal of the Bayesian analysis is the selec-tion of the model that best fits the observed datayobs, it is performed through the maximum a poste-riori (MAP) defined by

mMAP(yobs) = argmaxm π(m|yobs). (2)

The latter can be seen as a classification problem pre-dicting the model number given the observation of y.From this standpoint, mMAP is the Bayes classifier,well known to minimize the 0-1 loss (Devroye et al.,1996). One might argue that mMAP is an estima-tor defined as the mode of the posterior probabilitieswhich form the density of the posterior with respectto the counting measure. But the counting measure,

2

namely δ1+· · ·+δM , is a canonical reference measure,since it is invariant to any permutation of {1, . . . ,M}whereas no such canonical reference measure (invari-ant to one-to-one transformation) exists on compactsubset of the real line. Thus (2) does not suffer fromthe drawbacks of posterior mode estimators (Druilhetand Marin, 2007).

To approximate mMAP, ABC starts by simulatingnumerous triplets (m, θm, y) from the joint Bayesianmodel. Afterwards, the algorithm mimics the Bayesclassifier (2): it approximates the posterior probabil-ities by the frequency of each model number associ-ated with simulated y’s in a neighborhood of yobs. Ifrequired, we can eventually predict the best modelwith the most frequent model in the neighborhood,or, in other words, take the final decision by pluggingin (2) the approximations of the posterior probabili-ties.

If directly applied, this first, naive algorithm facesthe curse of dimensionality, as simulated datasets ycan be complex objects and lie in a space of highdimension (e.g., numerical images). Indeed, find-ing a simulated dataset in the vicinity of yobs is al-most impossible when the ambient dimension is high.The ABC algorithm performs therefore a (non linear)projection of the observed and simulated datasetsonto some Euclidean space of reasonable dimensionvia a function S, composed of summary statistics.Moreover, due to obvious reasons regarding computermemory, instead of keeping track of the whole sim-ulated datasets, one commonly saves only the simu-lated vectors of summary statistics, which leads to atable composed of iid replicates (m, θm, S(y)), oftencalled the reference table in the ABC literature, seeAlgorithm 1.

Algorithm 1: Simulation of the ABC referencetableOutput: A reference table of size nREF

for j ← 1 to nREF do

draw m from the prior π;draw θ from the prior πm;draw y from the likelihood fm(·|θ);compute S(y);save (mj , θj , S(yj))← (m, θ, S(y));

end

return the table of (mj , θj , S(yj)),j = 1, . . . , nREF

From the standpoint of machine learning, the ref-erence table serves as a training database composedof iid replicates drawn from the distribution of in-terest, namely the joint Bayesian model. The re-gression problem of estimating the posterior prob-abilities or the classification problem of predictinga model number are both solved with nonparamet-ric methods. The neighborhood of yobs is thus de-fined as simulations whose distances to the observa-tion measured in terms of summary statistics, i.e.,ρ(S(y), S(yobs)), fall below a threshold ε commonlynamed the tolerance level. The calibration of ε isdelicate, but had been partly neglected in the papersdealing with ABC that first focused on decreasingthe total number of simulations via the recourse toMarkov chain Monte Carlo (Marjoram et al., 2003)or sequential Monte Carlo methods (Beaumont et al.,2009, Del Moral et al., 2012) whose common tar-get is the joint Bayesian distribution conditioned byρ(S(y), S(yobs)) ≤ ε for a given ε. By contrast, thesimple setting we adopt here reveals the calibrationquestion. In accepting the machine learning view-point, we can consider the ABC algorithm as a k-nearest neighbor (knn) method, see Biau et al. (2013);the calibration of ε is thus transformed into the cal-ibration of k. The Algorithm we have to calibrate isgiven in Algorithm 2.

Algorithm 2: Uncalibrated ABC model choice

Output: A sample of size k distributedaccording to the ABC approximationof the posterior

simulate the reference table T according toAlgorithm 1;sort the replicates of T according toρ(S(yj), S(y

obs));keep the k first replicates;return the relative frequencies of each modelamong the k first replicates and the mostfrequent model;

Before entering into the tuning of k, we highlightthat the projection via the summary statistics gen-erates a difference with the standard knn methods.Under mild conditions, knn are consistent nonpara-metric methods. Consequently, as the size of the ref-erence table tends to infinity, the relative frequencyof model m returned by Algorithm 2 converges to

π(m|S(yobs)).

3

Unfortunately, when the summary statistics are notsufficient for the model choice problem, Didelot et al.(2011) and Robert et al. (2011) found that theabove probability can greatly differ from the genuineπ(m|yobs). Afterwards Marin et al. (2013) providenecessary and sufficient conditions on S(·) for theconsistency of the MAP based on π(m|S(yobs)) whenthe information included in the dataset yobs increases,i.e. when the dimension of yobs tends to infinity.Consequently, the problem that ABC addresses withreliability is classification, and the mentioned theo-retical results requires a shift from the approximationof posterior probabilities. Practically the frequenciesreturned by Algorithm 2 should solely be used to or-der the models with respect to their fits to yobs andconstruct a knn classifier m that predicts the modelnumber.It becomes therefore obvious that the calibration of

k should be done by minimizing the misclassificationerror rate of the resulting classifier m. This indicatoris the expected value of the 0-1 loss function, namely1{m(y) 6= m}, over a random (m, y) distributed ac-cording to the marginal (integrated in θm) of the jointBayesian distribution, whose density in (m, y) writes

π(m)

∫fm(y|θm)πm(θm)dθm. (3)

Ingenious solutions have been already proposed andare now well established to fullfil this minimizationgoal and bypass the overfitting problem, based oncross-validation on the learning database. But, forthe sake of clarity, particularly in the following sec-tions, we decided to take advantage of the fact thatABC aims at learning on simulated databases, and wewill use a validation reference table, simulated alsowith Algorithm 1, but independently of the train-ing reference table, to evaluate the misclassificationrate with the averaged number of differences betweenthe true model numbers mj and the predicted val-ues m(yj) by knn (i.e. by ABC) on the validationreference table.

2.2 Local error rates

The misclassification rate τ of the knn classifier m atthe core of Algorithm 2 provides consistant evidenceof its global accuracy. It supplies indeed a well-knownsupport to calibrate k in Algorithm 2. The purpose ofABC model choice methods though is the analyse ofan observed dataset yobs and this first indicator is ir-relevant to assess the accuracy of the classifier at thisprecise point of the data space, since it is by nature

a prior gauge. We propose here to disintegrate thisindicator, and to rely on conditional expected valueof the misclassification loss 1{m(y) 6= m} knowing yas an evaluation of the efficiency of the classifier aty. We recall the following proposition whose proof iseasy, but might help clarifying matters when appliedto the joint distribution (3).

Proposition 1. Consider a classifier m that aimsat predicting m given y on data drawn from the jointdistribution f(m, y). Let τ be the misclassificationrate of m, defined by P(m(Y ) 6= M ), where (M , Y )is a random pair with distribution f under the proba-bility measure P. Then, (i) the expectation of the lossfunction is

τ =∑

m

∫

y

1{m(y) 6= m}f(m, y) dy.

Additionally, (ii), the conditional expectation knowingy, namely τ(y) = P

(m(Y ) 6= M

∣∣Y = y), is

τ(y) =∑

m

1{m(y) 6= m}f(m|y) (4)

and τ =∫yf(y)τ(y) dy, where f(y) denotes the

marginal distribution of f (integrated over m) andf(m|y) = f(m, y)/f(y) the conditional probability ofm given y. Furthermore, we have

τ(y) = 1− f(m(y) | y). (5)

The last result (5) suggests that a conditional ex-pected value of the misclassification loss is a valuableindicator of the error at y since it is admitted thatthe posterior probability of the predicted model re-veals the accuracy of the decision at y. Neverthe-less, the whole simulated datasets are not saved intothe ABC reference table but solely some numericalsummaries S(y) per simulated dataset y, as explainedabove. Thus the disintegration process of τ is practi-cally limited to the conditional expectation of the lossknowing some non one-to-one function of y. Its def-inition becomes thereforemuch more subtle than thebasic (4). Actually, the ABC classifier can be trainedon a subset S1(y) of the summaries S(y) saved inthe training reference table, or on some deterministicfunction (we still write S1(y)) of S(y) that reducesthe dimension, such as the projection on the LDAaxes proposed by Estoup et al. (2012). To highlightthis fact, the ABC classifier is denoted by m(S1(y)) inwhat follows. It is worth noting here that the abovesetting encompasses any dimension reduction tech-nique presented in the review of Blum et al. (2013),

4

though the review in oriented on parameter inference.Furthermore we might want to disintegrate the mis-classification rate with respect to another projectionS2(y) of the simulated data that can or cannot be re-lated to the summaries S1(y) used to train the ABCclassifier, albeit S2(y) is also limited to be a deter-ministic function of S(y). This yields the following.

Definition 2. The local error rate of the m(S1(y))classifier with respect to S2(y) is

τS1(S2(y)) := P

(m(S1(Y )) 6= M

∣∣S2(Y ) = S2(y)),

where (M , Y ) is a random variable with distributiongiven in (3).

The purpose of the local misclassification rate inthe present paper is twofold and requires to play withthe distinction between S1 and S2, as the last partwill show on numerical examples. The first goal isthe construction of a prospective tool that aims atchecking whether a new statistic S′(y) carries addi-tional information regarding the model choice, be-yond a first set of statistics S1(y). In the latter case,it can be useful to localize the misclassification errorof m(S1(y)) with respect to the concatenated vec-tor S2(y) = (S1(y), S

′(y)). Indeed, this local errorrate can reveal concentrated areas of the data space,characterized in terms of S2(y), in which the local er-ror rate rises above (M − 1)/M , the averaged (local)amount of errors of the random classifier among Mmodels, so as to approach 1. The interpretation ofthe phenomenon is as follows: errors committed bym(S1(y)), that are mostly spread on the S1(y)-space,might gather in particular areas of subspaces of thesupport of S2(y) = (S1(y), S

′(y)). This peculiarityis due to the dimension reduction of the summarystatistics in ABC before the training of the classifierand represents a concrete proof of the difficulty ofABC model choice already raised by Didelot et al.(2011) and Robert et al. (2011).The second goal of the local error rate given in Def-

inition 2 is the evaluation of the confidence we mayconcede in the model predicted at yobs by m(S1(y)),in which case we set S2(y) = S1(y). And, when bothsets of summaries agree, the results of Proposition 1extend to

τS1(S1(y)) =

∑

m

π(m|S1(y))1{m(S1(y)) = m}

= 1− π(m(S1(y)) |S1(y)). (6)

Besides the local error rate we propose in Defini-tion 2 is an upper bound of the Bayes classifier if we

admit the loss of information committed by replacingy with the summaries.

Proposition 3. Consider any classifier m(S1(y)).The local error rate of this classifier satisfies

τS1(s2) = P (m(S1(Y )) 6= M |S2(Y ) = s2)

≥ P (mMAP(Y ) 6= M |S2(Y ) = s2) , (7)

where mMAP is the Bayes classifier defined in (2) ands2 any value in the support of S2(Y ). Consequently,

P (m(S1(Y )) 6= M ) ≥ P (mMAP(Y ) 6= M ) . (8)

Proof. Proposition 1, in particular (5), implies thatmMAP(y) is the ideal classifier that minimizes theconditional 0-1 loss knowing y. Hence, we have

P (m(S1(Y )) 6= M |Y = y)

≥ P (mMAP(Y ) 6= M |Y = y) .

Integrating the above with respect to the distributionof Y knowing S2(Y ) leads to (7), and a last integralto (8).

Proposition 3 shows that the introduction of newsummary statistics cannot distort the model selectioninsofar as the risk of the resulting classifier cannotdecrease below the risk of the Bayes classifier mMAP.We give here a last flavor of the results of Marin et al.(2013) and mention that, if S1(y) = S2(y) = S(y) andif the classifiers are perfect (i.e., trained on infinitereference tables), we can rephrase part of their resultsas providing mild conditions on S under which thelocal error τS(S(y)) tends to 0 when the size of thedataset y tends to infinity.

2.3 Estimation algorithm of the local

error rates

The numerical estimation of the local error rateτS1

(S2(y)), as a surface depending on S2(y), is there-fore paramount to assess the difficulty of the classifi-cation problem at any s2 = SS(y), and the local accu-racy of the classifier. Naturally, when S1(y) = S2(y)for all y, the local error can be evaluated at S2(y

obs)by plugging in (6) the ABC estimates of the posteriorprobabilities (the relative frequencies of each modelamong the particles returned by Algorithm 2) as sub-stitute for π(m|S(yobs)). This estimation procedureis restricted to the above mentioned case where theset of statistics used to localize the error rate agreeswith the set of statistics used to train the classifier.

5

Moreover, the approximation of the posterior proba-bilities returned by Algorithm 2, i.e., a knn method,might not be trustworthy: the calibration of k per-formed by minimizing the prior error rate τ doesnot provide any certainty on the estimated posteriorprobabilities beyond a ranking of these probabilitiesthat yields the best classifier in terms of misclassi-fication. In other words, the knn method calibratedto answer the classification problem of discriminatingamong models does not produce a reliable answer tothe regression problem of estimating posterior prob-abilities. Certainly, the value of k must be increasedto face this second kind of issue, at the price of alarger bias that might even swap the model ranking(otherwise, the empirical prior error rate would notdepend on k, see the numerical result section).For all these reasons, we propose here an alterna-

tive estimate of the local error. The core idea of ourproposal is the recourse to a nonparametric methodto estimate conditional expected values based on thecalls to the classifier m on a validation reference ta-ble, already simulated to estimate the global errorrate τ . Nadaraya-Watson kernel estimators of theconditional expected values

τS1(S2(y)) =

E (1{m(S1(Y )) 6= M } |S2(Y ) = S2(y)) (9)

rely explicitly on the regularity of this indicator, asa function of s2 = S2(y), which contrasts with theABC plug-in estimate described above. We thus hopeimprovements in the accuracy of error estimate anda more reliable approximation of the whole functionτS1

(S2(y)). Additionally, we are not limited anymoreto the special case where S1(y) = S2(y) for all y.It is worth stressing here that the bandwidth of thekernels must be calibrated by minimizing the L2-loss,since the target is a conditional expected value.Practically, this leads to Algorithm 3 which re-

quires a validation or test reference table independentof the training database that constitutes the ABC ref-erence table. We can bypass the requirement by re-sorting to cross validation methods, as for the compu-tation of the global prior misclassification rate τ . Butthe ensued algorithm is complex and it induces morecalls to the classifier (consider, e.g,. a ten-fold crossvalidation algorithm computed on more than one ran-dom grouping of the reference table) than the basicAlgorithm 3, whereas the training database can al-ways be supplemented by a validation database sinceABC, by its very nature, is a learning problem onsimulated databases. Moreover, to display the whole

Algorithm 3: Estimation of τS1(S2(y)) given an

classifier m(S1(y)) on a validation or test refer-ence tableInput: A validation or test reference table and a

classifier m(S1(y)) fitted with a firstreference table

Output: Estimations of (9) at each point of thesecond reference table

for each (mj , yj) in the test table do

compute δj = {m(S1(yj)) 6= mj};end

calibrate the bandwidth h of theNadaraya-Watson estimator predicting δjknowing S2(yj) via cross-validation on the testtable;for each (mj , yj) in the test table do

evaluate the Nadaraya-Watson estimatorwith bandwidth h at S2(yj);

end

surface τS1(S2(y)), we can interpolate values of the

local error between points S2(y) of the second ref-erence table with the help of a Kriging algorithm.We performed numerical experiments (not detailedhere) concluding that the resort to a Kriging algo-rithm provides results comparable to the evaluationof Nadaraya-Watson estimator at any point of thesupport of S2(y), and can reduce computation times.

2.4 Adaptive ABC

The local error rate can also represent a valuable wayto adjust the summary statistics to the data pointy and to build an adaptive ABC algorithm achiev-ing a local trade off that increases the dimension ofthe summary statistics at y only when the additionalcoordinates add information regarding the classifica-tion problem. Assume that we have at our disposal acollection of ABC classifiers, mλ(y) := mλ(Sλ(y)),λ = 1, . . . ,Λ, trained on various projections of y,namely the Sλ(y)’s, and that all these vectors, sortedwith respect to their dimension, depend only on thesummary statistics registered in the reference tables.Sometimes low dimensional statistics may suffice forthe classification (of models) at y, whereas othertimes we may need to examine statistics of larger di-mension. The local adaptation of the classifier is ac-complished through the disintegration of the misclas-sification rates of the initial classifiers with respect toa common statistic S0(y). Denoting τλ(S0(y)) the lo-

6

cal error rate of mλ(y) knowing S0(y), this reasoningyields the adaptive classifier defined by

m(S(y)) := mλ(y)(y),

where λ(y) := argminλ=1,...,Λ τλ(S0(y)). (10)

This last classifier attempts to avoid bearing the costof the potential curse of dimensionality from which allknn classifiers suffer and can help reduce the error ofthe initial classifiers, although the error of the idealclassifier (2) remains an absolute lower bound, seeProposition 3. From a different perspective, (10) rep-resents a way to tune the similarity ρ(S(y), S(yobs))of Algorithm 2 that locally includes or excludes com-ponents of S(y) to assess the proximity between S(y)and S(yobs). Practically, we rely on the followingalgorithm to produce the adaptive classifier, that re-quires a validation reference table independent of thereference table used to fit the initial classifiers.

Algorithm 4: Adaptive ABC model choice

Input: A collection of classifiers mλ(y),λ = 1, . . . ,Λ and a validation referencetable

Output: An adaptive classifier m(y)

for each λ ∈ {1, . . . ,Λ} doestimate the local error of mλ(y) knowingS0(y) with the help of Algorithm 3;

end

return the adaptive classifier m as a functioncomputing (10);

The local error surface estimated within the loopof Algorithm 4 must contrast the errors of the col-lection of classifiers. Our advice is thus to build aprojection S0(y) of the summaries S(y) registered inthe reference tables as follow. Add to the validationreference table a qualitative trait which groups thereplicates of the table according to their differencesbetween the predicted numbers by the initial clas-sifiers and the model numbers mj registered in thedatabase. For instance, when the collection is com-posed of Λ = 2 classifiers, the qualitative trait takesthree values: value 0 when both classifiers mλ(yj)agree (whatever the value of mj), value 1 when thefirst classifier only returns the correct number, i.e.,m1(yj) = mj 6= m2(yj), and value 2 when the sec-ond classifier only returns the correct number, i.e.,m1(yj) 6= mj = m2(yj). The axes of the linear dis-criminant analysis (LDA) predicting the qualitative

trait knowing S(y) provide a projection S0(y) whichcontrasts the errors of the initial collection of classi-fiers.

Finally it is important to note that the local errorrates are evaluated in Algorithm 4 with the help of avalidation reference table. Therefore, a reliable esti-mation of the accuracy of the adaptive classifier can-not be based on the same validation database becauseof the optimism bias of the training error. Evaluatingthe accuracy requires the simulation of a test refer-ence table independently of the two first databasesused to train and adapt the predictor, as is usuallyperformed in the machine learning community.

3 Hidden random fields

Our primary intent with the ABC methodology ex-posed in Section 2 was the study of new summarystatistics to discriminate between hidden randomfields models. The following materials numericallyillustrate how ABC can choose the dependency struc-ture of latent Potts models among two possible neigh-borhood systems, both described with undirectedgraphs, whilst highlighting the generality of the ap-proach.

3.1 Hidden Potts model

This numerical part of the paper focuses on hiddenPotts models, that are representative of the generallevel of difficulty while at the same time being widelyused in practice (see for example Hurn et al., 2003,Alfo et al., 2008, Francois et al., 2006, Moores et al.,2014). we recall that the latent random field x is afamily of random variables xi indexed by a finite setS , whose elements are called sites, and taking valuesin a finite state space X := {0, . . . ,K − 1}, inter-preted as colors. When modeling a digital image, thesites are lying on a regular 2D-grid of pixels, and theirdependency is given by an undirected graph G whichdefines an adjacency relationship on the set of sitesS : by definition, both sites i and j are adjacent ifand only if the graph G includes an edge that links di-rectly i and j. A Potts model sets a probability distri-bution on x, parametrized by a scalar β that adjuststhe level of dependency between adjacent sites. Thelatter class of models differs from the auto-modelsof Besag (1974), that allow variations on the levelof dependencies between edges and introduce poten-tial anisotropy on the graph. But the difficulty of allthese models arises from the intractable normalizing

7

(a)

(b)

Figure 1: Neighborhood graphs G of hidden Pottsmodel. (a) The four closest neighbour graph G4

defining model HPM(G4, α, β). (b) The eight closestneighbour graph G8 defining model HPM(G8, α, β)

constant, called the partition function, as illustratedin the distribution of Potts models defined by

π(x|G , β) =1

Z(G , β)exp

β

∑

iG∼j

1{xi = xj}

.

The above sum iG∼ j ranges the set of edges of the

graph G and the normalizing constant Z(G , β) writesas

Z(G , β) =∑

x∈X

exp

β

∑

iG∼j

1{xi = xj}

, (11)

namely a summation over the numerous possible re-alizations of the random field x, that cannot be com-puted directly (except for small grids and small num-ber of colors K). In the statistical physic literature,β is interpreted as the inverse of a temperature, andwhen the temperature drops below a fixed threshold,values xi of a typical realization of the field are almostall equal (due to important dependency between allsites). These peculiarities of Potts models are calledphase transitions.In hidden Markov random fields, the latent process

is observed indirectly through another field; this per-mits the modeling of a noise that may be encountered

in many concrete situations. Precisely, given the real-ization x of the latent field, the observation y is a fam-ily of random variables indexed by the set of sites, andtaking values in a set Y , i.e., y = (yi; i ∈ S ), and arecommonly assumed as independent draws that forma noisy version of the hidden fields. Consequently,we set the conditional distribution of y knowing xas the product π(y|x, α) =

∏i∈S

P (yi|xi, α), whereP is the marginal noise distribution parametrized bysome scalar α. Hence the likelihood of the hiddenPotts model with parameter β on the graph G andnoise distribution Pα, denoted HPM(G , α, β), is givenby

f(y|α, β,G ) =∑

x∈X

π (x|G , β)πα(y|x)

and faces a double intractable issue as neither thelikelihood of the latent field, nor the above sum canbe computed directly: the cardinality of the range ofthe sum is of combinatorial complexity. The followingnumerical experiments are based on two classes ofnoises, producing either observations in {0, 1, . . . ,K−1}, the set of latent colors, or continuous observationsthat take values in R.

The common point of our examples is to select thehidden Gibbs model that better fits a given yobs com-posed of N = 100× 100 pixels within different neigh-borhood systems represented as undirected graphs G .We considered the two widely used adjacency struc-tures in our simulations, namely the graph G4 (re-spectively G8) in which the neighborhood of a siteis composed of the four (respectively eight) closestsites on the two-dimensional lattice, except on theboundaries of the lattice, see Fig. 1. The prior prob-abilities of both models were set to 1/2 in all exper-iments. The Bayesian analysis of the model choicequestion adds another integral beyond the two abovementioned sums that cannot be calculated explic-itly or numerically either and the problem we illus-trate are said triple intractable. Up to our knowl-edge the choice of the latent neighborhood structurehas never been seriously tackled in the Bayesian lit-erature. We mentioned here the mean field approxi-mation of Forbes and Peyrard (2003) whose softwarecan estimate paramaters of such models, and com-pare models fitness via a BIC criterion. But the firstresults we try to obtain with this tool were worsethan with ABC, except for very low values of β. Thismeans that either we did not manage to run the soft-ware properly, or that the mean field approximationis not appropriate to discriminate between neighbor-hood structures. The detailed settings of our three

8

experiments are as follows.

First experiment. We considered Potts modelswith K = 2 colors and a noise process that switcheseach pixel independently with probability

exp(−α)/(exp(α) + exp(−α)),

following the proposal of Everitt (2012). The prior onα was uniform over (0.42; 2.3), where the bounds ofthe interval were determined to switch a pixel witha probability less than 30%. Regarding the depen-dency parameter β, we set prior distributions belowthe phase transition which occurs at different levelsdepending on the neighborhood structure. Preciselywe used a uniform distribution over (0; 1) when theadjacency is given by G4 and a uniform distributionover (0; 0.35) with G8.

Second experiment. We increased the number ofcolors in the Potts models and set K = 16. Like-wise, we set a noise that changes the color of eachpixel with a given probability parametrized by α, andconditionally on a change at site i, we rely on theleast favorable distribution, which is a uniform drawwithin all colors except the latent one. To extendthe parametrization of Everitt (2012), the marginaldistribution of the noise is defined by

Pα(yi|xi) =exp

{α(21{xi = yi} − 1

)}

exp(α) + (K − 1) exp(−α)

and a uniform prior on α over the interval (1.78; 4.8)ensures that the probability of changing a pixel withthe noise process is at most 30%. The uniform prioron the Potts parameter β was also tuned to stay be-low the phase transition. Hence β ranges the interval(0; 2.4) with a G4 structure and the interval (0; 1) witha G8 structure.

Third experiment. We introduced a homoscedas-tic Gaussian noise whose marginal distribution ischaracterized by

yi | xi = c ∼ N (c, σ2) c ∈ {0; 1}

over bicolor Potts models. And both prior distribu-tions on parameter β are similar to the ones on thelatent fields of the first experiment. The standard de-viation σ = 0.39 was set so that the probability of awrong prediction of the latent color with a marginalMAP rule on the Gaussian model is about 15%.

3.2 Geometric summary statistics

Performing a Bayesian model choice via ABC algo-rithms requires summary statistics that capture therelevant information from the observation yobs todiscriminate among the competing models. Whenthe observation is noise-free, Grelaud et al. (2009)noted that the joint distribution resulting from theBayesian modeling falls into the exponential family,and they obtained consecutively a small set of sum-mary statistics, depending on the collection of consid-ered models, that were sufficient. In front of noise,the situation differs substantially as the joint distri-bution lies now outside the exponential family, andthe above mentioned statistics are not sufficient any-more, whence the urge to bring forward other con-crete and workable statistics. The general approachwe developed reveals geometric features of a discretefield y via the recourse to colored graphs attachedto y and their connected components. Consider anundirected graph G whose set of vertices coincideswith S , the set of sites of y.

Definition 4. The graph induced by G on the fieldy, denoted Γ(G , y), is the undirected graph whose setof edges gathers the edges of G between sites of y thatshare the same color, i.e.,

iΓ(G ,y)∼ j ⇐⇒ i

G∼ j and yi = yj .

We believe that the connected components of suchinduced graphs capture major parts of the geometryof y. Recall that a connected component of an undi-rected graph Γ is a subgraph of Γ in which any twovertices are connected to each other by a path, andwhich is connected to no other vertices of Γ. And theconnected components form a partition of the ver-tices. Since ABC relies on the computation of thesummary statistics on many simulated datasets, itis also worth noting that the connected componentscan be computed efficiently with the help of famousgraph algorithms in linear time based on a breadth-first search or depth-first search over the graph. Theempirical distribution of the sizes of the connectedcomponents represents an important source of geo-metric informations, but cannot be used as a statis-tic in ABC because of the curse of dimensionality.The definition of a low dimensional summary statis-tic derived from these connect components should beguided by the intuition on the model choice we face.Our numerical experiments discriminate between

a G4- and a G8-neighborhood structure and we con-sidered two induced graphs on each simulated y,

9

Γ(G4, y) Γ(G8, y)

Figure 2: The induced graph Γ(G4, y) and Γ(G8, y) on a given bicolor image y of size 5×5. The six summarystatistics on y are thus R(G4, y) = 22, T (G4, y) = 7, U(G4, y) = 12, R(G8, y) = 39, T (G8, y) = 4 andU(G8, y) = 16

namely Γ(G4, y) and Γ(G8, y). Remark that the two-dimensional statistics proposed by Grelaud et al.(2009), which are sufficient in the noise-free con-text, are the total numbers of edges in both inducedgraphs. After very few trials without success, wefixed ourselves on four additional summary statistics,namely the size of the largest component of each in-duced graph, as well as the total number of connectcomponents in each graph. See Fig. 2 for an exampleon a bicolor picture y. To fix the notations, for anyinduced graph Γ(G , y), we define

• R(G , y) as the total number of edges in Γ(G , y),

• T (G , y) as the number of connected componentsin Γ(G , y) and

• U(G , y) as the size of the largest connected com-ponent of Γ(G , y).

And to sum up the above, the set of summary statis-tics that where registered in the reference tables foreach simulated field y is

S(y) =(R(G4, y);R(G8, y);T (G4, y);

T (G8, y);U(G4, y);U(G8, y))

in the first and second experiments.In the third experiment, the observed field y takes

values in R and we cannot apply directly the approachbased on induced graphes because no two pixels sharethe same color. All of the above statistics are mean-ingless, including the statistics R(G , y) used by Gre-laud et al. (2009) in the noise-free case. We rely ona quantization preprocessing performed via a kmeansalgorithm on the observed colors that forgets the spa-tial structure of the field. The algorithm was tuned

to uncover the same number of groups of colors as thenumber of latent colors, namely K = 2. If q2(y) de-notes the resulting field, the set of summary statisticsbecomes

S(y) =(R(G4, q2(y)

);R

(G8, q2(y)

);T

(G4, q2(y)

);

T(G8, q2(y)

);U

(G4, q2(y)

);U

(G8, q2(y)

)).

We have assumed here that the number of latent col-ors is known to keep the same purpose of selectingthe correct neighborhood structure. Indeed Cucalaand Marin (2013) have already proposed a (complex)Bayesian method to infer the appropriate number ofhidden colors. But more generally, we can add statis-tics based on various quantizations qk(y) of y with kgroups.

3.3 Numerical results

In all three experiments, we compare three nested setsof summary statistics S2D(y), S4D(y) and S6D(y) ofdimension 2, 4 and 6 respectively. They are definedas the projection onto the first two (respectively four

Table 1: Evaluation of the prior error rate on a testreference table of size 30, 000 in the first experiment.

Prior error ratesTrain size 5,000 100,0002D statistics 8.8% 7.9%4D statistics 6.5% 6.1%6D statistics 7.1% 7.1%

Adaptive ABC 6.2% 5.5%

10

0 100 200 300 400 500

0.00

0.02

0.04

0.06

0.08

0.10

dim 2

dim 4

dim 6

0 100 200 300 400 500

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

dim 2

dim 4

dim 6

(a) (b)

0.0

0.2

0.4

0.6

0.8

1.0

(c)

Figure 3: First experiment results. (a) Prior error rates (vertical axis) of ABC with respect to thenumber of nearest neighbors (horizontal axis) trained on a reference table of size 100, 000 (solid lines) or50, 000 (dashed lines), based on the 2D, 4D and 6D summary statistics. (b) Prior error rates of ABC basedon the 2D summary statistic compared with 4D and 6D summary statistics including additional ancillarystatistics. (c) Evaluation of the local error on a 2D surface

11

and six) axes of S(y) described in the previous sec-tion. We stress here that S2D(y), which is composedof the summaries given by Grelaud et al. (2009), areused beyond the noise-free setting where they are suf-ficient for model choice. In order to study the infor-mation carried by the connected components, we addprogressively our geometric summary statistics to thefirst set, beginning by the T (G , y)-type of statistics inS4D(y). Finally, remark that, before evaluating theEuclidean distance in ABC algorithms, we normal-ize the statistics in each reference tables with respectto an estimation of their standard deviation sinceall these summaries take values on axis of differentscales. Simulated images have been drawn thanks tothe Swendsen and Wang (1987) algorithm. In theleast favorable experiment, simulations of one hun-dred pictures (on pixel grid of size 100 × 100) via20, 000 iterations of this Markovian algorithm whenparameters drawn from our prior requires about onehour of computation on a single CPU with our opti-mized C++ code. Hence the amount of time requiredby ABC is dominated by the simulations of y via theSwedsen-Wang algorithm. This motivated Moores,Mengersen, and Robert (2014) to propose a cut downon the cost of running an ABC experiment by remov-ing the simulation of an image from hidden Pottsmodel, and replacing it by an approximate simula-tion of the summary statistics. Another alternativeis the clever sampler of Mira et al. (2001) that pro-vides exact simulations of Ising models and can beextended to Potts models.

First experiment. Fig. 3(a) illustrates the cali-bration of the number of nearest neighbors (parame-ter k of Algorithm 2) by showing the evolution of theprior error rates (evaluated on a validation referencetable including 20, 000 simulations) when k increases.We compared the errors of six classifiers to inspect thedifferences between the three sets of summary statis-tics (in yellow, green and magenta) and the impact ofthe size of the training reference table (100, 000 sim-ulations in solid lines; 50, 000 simulations in dashedlines). The numerical results exhibit that a good cal-ibration of k can reduce the prior misclassification er-ror. Thus, without really degrading the performanceof the classifiers, we can reduce the amount of simu-lations required in the training reference table, whosecomputation cost (in time) represents the main ob-stacle of ABC methods, see also Table 1. Moreover,as can be guessed from Fig. 3(a), the sizes of thelargest connected components of induced graphs (in-

Table 2: Evaluation of the prior error rate on a testreference table of size 20, 000 in the second experi-ment.


cluded only in S6D(y)) do not carry additional infor-mation regarding the model choice and Table 1 con-firms this results through evaluations of the errors ona test reference table of 30, 000 simulations drawn in-dependently of both training and validation referencetables.

One can argue that the curse of dimensionality doesnot occur with such low dimensional statistics andsizes of the training set, but this intuition is wrong, asshown in Fig. 3(b). The latter plot shows indeed theprior misclassification rate as a function of k when wereplace the last four summaries by ancillary statisticsdrawn independently of m and y. We can concludethat, although the three sets of summary statisticscarry then the same information in this artificial set-ting, the prior error rates increase substantially withthe dimension (classifiers are not trained on infinitereference tables!). This conclusion shed new light onthe results of Fig. 3(a): the U(G , y)-type summaries,based on the size of the largest component, are notconcretely able to help discriminate among models,but are either highly correlated with the first fourstatistics; or the resolution (in terms of size of thetraining reference table) does not permit the exploita-tion of the possible information they add.

Fig. 3(c) displays the local error rate with respectto a projection of the image space on a plan. We havetaken here S1(y) = S2D(y) in Definition 2. And S2(y)ranges a plan given by a projection of the full set ofsummaries that has been tuned empirically in orderto gather the errors committed by calls of m(S2D(y))on the validation reference table. The most strik-ing fact is that the local error rises above 0.9 in theoval, reddish area of Fig. 3(c). Other reddish areasof Fig. 3(c) in the bottom of the plot correspond toparts of the space with very low probability, and maybe a dubious extrapolation of the Kriging algorithm.We can thus conclude that the information of the newgeometric summaries depends highly on the positionof y in the image space and have confidence in the in-terest of Algorithm 4 (adaptive ABC) in this frame-

12

work. As exhibited in Table 1(d), this last classifierdoes not decrease dramatically the prior misclassi-fication rates. But the errors of the non-adaptiveclassifiers are already low and the error of any clas-sifier is bounded from below, as explained in Propo-sition 3. Interestingly though, the adaptive classifierrelies on m(S2D(y)) (instead of the most informativem(S6D(y))) to take the final decision at about 60% ofthe images of our test reference table of size 30, 000.

Second experiment. The framework was de-signed here to study the limitations of our ap-proach based on the connected components of in-duced graphs. The number of latent colors is indeedrelatively high and the noise process do not rely onany ordering of the colors to perturbate the pixels.Table 2 indicates the difficulty of capturing relevantinformation with the geometric summaries we pro-pose. Only the sharpness introduced by a trainingreference table composed of 100, 000 simulations dis-tinguishes m(S4D(y)) and m(S6D(y)) from the basicclassifier m(S2D(y)). This conclusion is reinforced bythe low value of number of neighbors after the calibra-tion process, namely k = 16, 5 and 5 for m(S2D(y)),m(S4D(y)) and m(S6D(y)) respectively. Hence we donot display in the paper other diagnosis plots basedon the prior error rates or the conditional error rates,which led us to the same conclusion. The adaptiveABC algorithm did not improve any of these results.

Third experiment. The framework here includesa continuous noise process as described at the endof Section 3.1. We reproduced the entire diagnosisprocess performed in the first experiment and we ob-tained the results given in Fig. 4 and Table 3. Themost noticeable difference is the extra informationcarried by the U(G , y)-statistics, representing the sizeof the largest connected component, and the adaptiveABC relie on the simplest m(S2D(y)) in about 30%of the data space (measured with the prior marginal

Table 3: Evaluation of the prior error rate on a testreference table of size 30, 000 in the third experiment.


Adaptive ABC 8.2% 6.7%

distribution in y). Likewise, the gain in misclassifica-tion errors is not spectacular, albeit positive.

4 Conclusion and perspective

In the present article, we considered ABC modelchoice as a classification problem in the frameworkof the Bayesian paradigm (Section 2.1) and provideda local error in order to assess the accuracy of theclassifier at yobs (Sections 2.2 and 2.3). We derivedthen an adaptive classifier (Section 2.4) which is anattempt to fight against the curse of dimensionalitylocally around yobs. This method contrasts with mostprojection methods which are focused on parameterestimation (Blum et al., 2013). Additionally, mostof them perform a global trade off between the di-mension and the information of the summary statis-tics over the whole prior domain, while our proposaladapts the dimension with respect to yobs (see alsothe discussion about the posterior loss approach inBlum et al., 2013). Besides the inequalities of Propo-sition 3 complement modestly the analysis of Marinet al. (2013) on ABC model choice. Principles of ourproposal are well founded by avoiding the well-knownoptimism of the training error rates and by resortingto validation and test reference tables in order to eval-uate the error practically. And, finally, the machinelearning viewpoint gives an efficient way to calibratethe threshold of ABC (Section 2.1).Regarding latent Markov random fields, the pro-

posed method of constructing summary statisticsbased on the induced graphs (Section 3.2) yieldsa promising route to construct relevant summarystatistics in this framework. This approach is veryintuitive and can be reproduced in other settings.For instance, if the goal of the Bayesian analysisis to select between isotropic latent Gibbs modelsand anisotropic models, the averaged ratio betweenthe width and the length of the connect compo-nents or the ratio of the width and the length of thelargest connected components can be relevant numer-ical summaries. We have also explained how to adaptthe method to a continuous noise by performing aquantization of the observed values at each site ofthe fields (Section 3.2). And the detailed analysis ofthe numerical results demonstrates that the approachis promising. However the results on the 16 color ex-ample with a completely disordered noise indicate thelimitation of the induced graph approach. We believethat there exists a road we did not explore above withan induced graph that add weights on the edges of the

13

0 100 200 300 400 500

0.00

0.05

0.10

0.15

dim 2

dim 4

dim 6

0 100 200 300 400 500

0.0

0.1

0.2

0.3

0.4

dim 2

dim 4

dim 6

(a) (b)

−0.2

0.0

0.2

0.4

0.6

0.8

(c)

Figure 4: Third experiment results. (a) Prior error rates (vertical axis) of ABC with respect to thenumber of nearest neighbors (horizontal axis) trained on a reference table of size 100, 000 (solid lines) or50, 000 (dashed lines), based on the 2D, 4D and 6D summary statistics. (b) Prior error rates of ABC basedon the 2D summary statistics compared with 4D and 6D summary statistics including additional ancillarystatistics. (c) Evaluation of the local error on a 2D surface

14

graph according to the proximity of the colors, butthe grouping of sites on such weighted graph is nottrivial.The numerical results (Section 3.3) highlighted

that the calibration of the number of neighbors inABC provides better results (in terms of misclassi-fication) than a threshold set as a fixed quantile ofthe distances between the simulated and the observeddatasets (as proposed in Marin et al., 2012). Conse-quently, we can reduce significantly the number ofsimulations in the reference table without increas-ing the misclassification error rates. This representsan important conclusion since the simulation of a la-tent Markov random field requires a non-negligibleamount of time. The gain in misclassification ratesof the new summaries is real but not spectacular andthe adaptive ABC algorithm was able to select themost performant classifier.

Acknowledgments

The three author were financially supported by theLabex NUMEV. We are grateful to Jean-MichelMarin for his constant feedback and support. Somepart of the present work was presented at MCMSki4 in January 2014 and benefited greatly from discus-sions with the participants during the poster session.We would like to thank the anonymous referrees andthe Editors whose valuable comments and insightfulsuggestions led to an improved version of the paper.

References

M. Alfo, L. Nieddu, and D. Vicari. A finite mix-ture model for image segmentation. Statistics andComputing, 18(2):137–150, 2008.

M. Baragatti and P. Pudlo. An overview on Approx-imate Bayesian Computation. ESAIM: Proc., 44:291–299, 2014.

M. A. Beaumont, J.-M. Cornuet, J.-M. Marin, andC. P. Robert. Adaptive approximate Bayesian com-putation. Biometrika, page asp052, 2009.

J. Besag. Spatial interaction and the statistical anal-ysis of lattice systems (with Discussion). Journalof the Royal Statistical Society. Series B (Method-ological), 36(2):192–236, 1974.

J. Besag. Statistical Analysis of Non-Lattice Data.The Statistician, 24:179–195, 1975.

G. Biau, F. Cerou, and A. Guyader. New insights intoApproximate Bayesian Computation. Annales del’Institut Henri Poincare (B) Probabilits et Statis-tiques, in press, 2013.

M. G. B. Blum, M. A. Nunes, D. Prangle, and S. A.Sisson. A Comparative Review of Dimension Re-duction Methods in Approximate Bayesian Com-putation. Statistical Science, 28(2):189–208, 2013.

A. Caimo and N. Friel. Bayesian inference for expo-nential random graph models. Social Networks, 33(1):41–55, 2011.

A. Caimo and N. Friel. Bayesian model selectionfor exponential random graph models. Social Net-works, 35(1):11 – 24, 2013.

L. Cucala and J.-M. Marin. Bayesian Inference on aMixture Model With Spatial Dependence. Journalof Computational and Graphical Statistics, 22(3):584–597, 2013.

P. Del Moral, A. Doucet, and A. Jasra. An adap-tive sequential monte carlo method for approxi-mate bayesian computation. Statistics and Com-puting, 22(5):1009–1020, 2012.

L. Devroye, L. Gyorfi, and G. Lugosi. A probabilistictheory of pattern recognition, volume 31 of Applica-tions of Mathematics (New York). Springer-Verlag,New York, 1996.

X. Didelot, R. G. Everitt, A. M. Johansen, and D. J.Lawson. Likelihood-free estimation of model evi-dence. Bayesian Analysis, 6(1):49–76, 2011.

P. Druilhet and J.-M. Marin. Invariant HPD crediblesets and MAP estimators. Bayesian Analysis, 2(4):681–691, 2007.

A. Estoup, E. Lombaert, J.-M. Marin, C. Robert,T. Guillemaud, P. Pudlo, and J.-M. Cornuet.Estimation of demo-genetic model probabilitieswith Approximate Bayesian Computation usinglinear discriminant analysis on summary statis-tics. Molecular Ecology Ressources, 12(5):846–855,2012.

R. G. Everitt. Bayesian Parameter Estimation for La-tent Markov Random Fields and Social Networks.Journal of Computational and Graphical Statistics,21(4):940–960, 2012.

15

F. Forbes and N. Peyrard. Hidden Markov randomfield model selection criteria based on mean field-like approximations. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, 25(9):1089–1101, 2003.

O. Francois, S. Ancelet, and G. Guillot. BayesianClustering Using Hidden Markov Random Fieldsin Spatial Population Genetics. Genetics, 174(2):805–816, 2006.

N. Friel. Bayesian inference for Gibbs random fieldsusing composite likelihoods. In Simulation Confer-ence (WSC), Proceedings of the 2012 Winter, pages1–8, 2012.

N. Friel. Evidence and Bayes Factor Estimation forGibbs Random Fields. Journal of Computationaland Graphical Statistics, 22(3):518–532, 2013.

N. Friel and H. Rue. Recursive computing andsimulation-free inference for general factorizablemodels. Biometrika, 94(3):661–672, 2007.

N. Friel, A. N. Pettitt, R. Reeves, and E. Wit.Bayesian Inference in Hidden Markov RandomFields for Binary Data Defined on Large Lattices.Journal of Computational and Graphical Statistics,18(2):243–261, 2009.

P. J. Green and S. Richardson. Hidden Markov Mod-els and Disease Mapping. Journal of the AmericanStatistical Association, 97(460):1055–1070, 2002.

A. Grelaud, C. P. Robert, J.-M. Marin, F. Rodolphe,and J.-F. Taly. ABC likelihood-free methods formodel choice in Gibbs random fields. BayesianAnalysis, 4(2):317–336, 2009.

M. A. Hurn, O. K. Husby, and H. Rue. A Tutorial onImage Analysis. In Spatial Statistics and Compu-tational Methods, volume 173 of Lecture Notes inStatistics, pages 87–141. Springer New York, 2003.ISBN 978-0-387-00136-4.

J.-M. Marin, P. Pudlo, C. P. Robert, and R. J. Ry-der. Approximate Bayesian Computational meth-ods. Statistics and Computing, 22(6):1167–1180,2012.

J.-M. Marin, N. S. Pillai, C. P. Robert, andJ. Rousseau. Relevant statistics for Bayesian modelchoice. Journal of the Royal Statistical Society: Se-ries B (Statistical Methodology), 2013.

P. Marjoram, J. Molitor, V. Plagnol, and S. Tavare.Markov chain Monte Carlo without likelihoods.Proceedings of the National Academy of Sciences,100(26):15324–15328, 2003.

A. Mira, J. Møller, and G. O. Roberts. Perfect slicesamplers. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 63(3):593–606,2001.

M. T. Moores, C. E. Hargrave, F. Harden, andK. Mengersen. Segmentation of cone-beam CT us-ing a hidden Markov random field with informativepriors. Journal of Physics : Conference Series, 489,2014.

M. T. Moores, K. Mengersen, and C. P. Robert. Pre-processing for approximate Bayesian computationin image analysis. ArXiv e-prints, March 2014.

D. Prangle, P. Fearnhead, M. P. Cox, P. J. Biggs, andN. P. French. Semi-automatic selection of summarystatistics for ABC model choice. Statistical Appli-cations in Genetics and Molecular Biology, pages1–16, 2013.

J. K. Pritchard, M. T. Seielstad, A. Perez-Lezaun,and M. W. Feldman. Population growth of hu-man Y chromosomes: a study of Y chromosomemicrosatellites. Molecular Biology and Evolution,16(12):1791–1798, 1999.

R. Reeves and A. N. Pettitt. Efficient recursionsfor general factorisable models. Biometrika, 91(3):751–757, 2004.

C. P. Robert, J.-M. Cornuet, J.-M. Marin, and N. S.Pillai. Lack of confidence in approximate Bayesiancomputation model choice. Proceedings of the Na-tional Academy of Sciences, 108(37):15112–15117,2011.

R. H. Swendsen and J.-S. Wang. Nonuniversal criti-cal dynamics in Monte Carlo simulations. PhysicalReview Letters, 58(2):86–88, 1987.

S. Tavare, D. J. Balding, R. C. Griffiths, and P. Don-nelly. Inferring Coalescence Times From DNA Se-quence Data. Genetics, 145(2):505–518, 1997.

16

Date post:	24-Oct-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Adaptive ABC model choice and geometric summary statistics ...

Documents