+ All Categories
Home > Documents > FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ......

FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ......

Date post: 14-Jul-2018
Category:
Upload: vocong
View: 222 times
Download: 0 times
Share this document with a friend
29
arXiv:1509.04069v1 [stat.AP] 14 Sep 2015 The Annals of Applied Statistics 2015, Vol. 9, No. 2, 687–713 DOI: 10.1214/15-AOAS818 c Institute of Mathematical Statistics, 2015 SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION By Fan Li ,1,4 , Tingting Zhang ,2,4 , Quanli Wang , Marlen Z. Gonzalez , Erin L. Maresh and James A. Coan 3,Duke University and University of Virginia Multi-subject functional magnetic resonance imaging (fMRI) data has been increasingly used to study the population-wide relationship between human brain activity and individual biological or behav- ioral traits. A common method is to regress the scalar individual response on imaging predictors, known as a scalar-on-image (SI) re- gression. Analysis and computation of such massive and noisy data with complex spatio-temporal correlation structure is challenging. In this article, motivated by a psychological study on human affective feelings using fMRI, we propose a joint Ising and Dirichlet Process (Ising-DP) prior within the framework of Bayesian stochastic search variable selection for selecting brain voxels in high-dimensional SI re- gressions. The Ising component of the prior makes use of the spatial information between voxels, and the DP component groups the coef- ficients of the large number of voxels to a small set of values and thus greatly reduces the posterior computational burden. To address the phase transition phenomenon of the Ising prior, we propose a new an- alytic approach to derive bounds for the hyperparameters, illustrated on 2- and 3-dimensional lattices. The proposed method is compared with several alternative methods via simulations, and is applied to the fMRI data collected from the KLIFF hand-holding experiment. 1. Introduction. Positive social contact is known to enhance human health and well-being, possibly because it helps to regulate humans’ emotional reactivity when facing negative stressors in daily life [Coan, Schaefer and Received October 2014; revised February 2015. 1 Supported in part by the U.S. NSF-DMS Grant 1208983. 2 Supported in part by the U.S. NSF-DMS Grants 1209118 and 1120756. 3 Supported in part by the National Institute of Mental Health (NIMH) Grant R01MH080725. 4 Equally contributing authors. Key words and phrases. Bayesian, Dirichlet Process, fMRI, Ising model, phase transi- tion, scalar-on-image regression, stochastic search, variable selection. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2015, Vol. 9, No. 2, 687–713. This reprint differs from the original in pagination and typographic detail. 1
Transcript
Page 1: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

arX

iv:1

509.

0406

9v1

[st

at.A

P] 1

4 Se

p 20

15

The Annals of Applied Statistics

2015, Vol. 9, No. 2, 687–713DOI: 10.1214/15-AOAS818c© Institute of Mathematical Statistics, 2015

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPINGFOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION

By Fan Li∗,1,4, Tingting Zhang†,2,4, Quanli Wang∗,

Marlen Z. Gonzalez†, Erin L. Maresh† and James A. Coan3,†

Duke University∗ and University of Virginia†

Multi-subject functional magnetic resonance imaging (fMRI) datahas been increasingly used to study the population-wide relationshipbetween human brain activity and individual biological or behav-ioral traits. A common method is to regress the scalar individualresponse on imaging predictors, known as a scalar-on-image (SI) re-gression. Analysis and computation of such massive and noisy datawith complex spatio-temporal correlation structure is challenging. Inthis article, motivated by a psychological study on human affectivefeelings using fMRI, we propose a joint Ising and Dirichlet Process(Ising-DP) prior within the framework of Bayesian stochastic searchvariable selection for selecting brain voxels in high-dimensional SI re-gressions. The Ising component of the prior makes use of the spatialinformation between voxels, and the DP component groups the coef-ficients of the large number of voxels to a small set of values and thusgreatly reduces the posterior computational burden. To address thephase transition phenomenon of the Ising prior, we propose a new an-alytic approach to derive bounds for the hyperparameters, illustratedon 2- and 3-dimensional lattices. The proposed method is comparedwith several alternative methods via simulations, and is applied tothe fMRI data collected from the KLIFF hand-holding experiment.

1. Introduction. Positive social contact is known to enhance human healthand well-being, possibly because it helps to regulate humans’ emotionalreactivity when facing negative stressors in daily life [Coan, Schaefer and

Received October 2014; revised February 2015.1Supported in part by the U.S. NSF-DMS Grant 1208983.2Supported in part by the U.S. NSF-DMS Grants 1209118 and 1120756.3Supported in part by the National Institute of Mental Health (NIMH) Grant

R01MH080725.4Equally contributing authors.Key words and phrases. Bayesian, Dirichlet Process, fMRI, Ising model, phase transi-

tion, scalar-on-image regression, stochastic search, variable selection.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in The Annals of Applied Statistics,2015, Vol. 9, No. 2, 687–713. This reprint differs from the original in paginationand typographic detail.

1

Page 2: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

2 F. LI ET AL.

Davidson (2006); Coan, Beckes and Allen (2013); Coan (2010, 2011)]. Con-ventional studies on social contact primarily focus on its aggregated effecton an entire population. With the common belief that human behavior iscontrolled by individual mental decisions, which is affected by the imme-diate environment, it is desirable to investigate emotion regulation activityof the individual brain under different social interaction conditions. Towardthis aim, the KLIFF hand-holding psychological experiment [Coan, Schaeferand Davidson (2006)] was conducted. In this experiment, 104 pairs—eachpair consisting of a male and a female—of mentally and physically healthyyoung adults in various close relationships including friends and marriedcouples were recruited from a larger representative longitudinal communitysample [Allen et al. (2007)]. One participant of each pair was threatenedwith mild electric shock during a functional magnetic resonance imaging(fMRI) session while either holding a hand of a friend, holding a hand ofa stranger or holding no hand at all, in three separate sessions, which rep-resent three different types of social interactions—positive and supportivesocial interaction with friends, general social interaction with strangers andno social interaction, respectively. At the end of each session, the subjectswere asked to rate their feelings of arousal and valence [Russell (1980); Langet al. (1993)] experienced during the experiment. Arousal and valence arethe two dimensions in the framework of emotion fields, representing the ex-tent of excitement and pleasure experienced, respectively [see Bradley andLang (1994) for more detailed explanation].

To investigate which areas in the brain are predictive of individual’s affec-tive feelings in the KLIFF study, we can construct a regression model usingsubjects’ emotion (arousal and valence) measurements as the response, andsummaries of the fMRI images in the regions of interests (ROIs) as pre-dictors. This type of regression is often referred to as scalar-on-image (SI)regressions in the literature [Reiss et al. (2011); Huang et al. (2013); Gold-smith, Huang and Crainiceanu (2014)]. SI regressions with predictors fromother imaging modalities, such as diffusion tensor imaging (DTI), have alsobeen used in medical and scientific studies [e.g., Reiss et al. (2015)].

The SI regression model in the KLIFF study has several unique character-istics due to the features of fMRI data. First, the sample size is much smallerthan the number of predictors, that is, the number of brain voxels (3D cubicvolumes in the brain) in the ROIs, which is over 6000 in the KLIFF study.This is known as the “large p, small n” paradigm [West (2003)]. Second,there is rich spatial information between the predictors. Third, neighboringpredictors are highly correlated and often have similar but weak effects onthe response. Finally, as each voxel accounts for only a tiny area in the brain,it is very likely that the number of significant voxels is much larger than thesample size. The last two characteristics imply that even with all the truevoxels being correctly selected, standard regression methods may still not

Page 3: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 3

be applicable due to multicolinearity. It is therefore desirable to impose acertain degree of shrinkage or grouping of the regression coefficients so thatpredictors with similar values can be grouped together, and thus the effec-tive number of selected predictors is smaller than the sample size. Motivatedby these considerations, in this article, we propose a Bayesian SI regressionmodel that achieves simultaneous grouping and spatial selection of voxelsthat are predictive of individual responses. The key to our proposal is todefine a joint Ising and Dirichlet Process (Ising-DP) prior for the regressionparameters, within the framework of Bayesian stochastic search variable se-lection [SSVS; George and McCulloch (1993, 1997)]. The Ising componentof the prior utilizes the spatial information between voxels to smooth the se-lection indicators of neighboring voxels, and the DP component groups thecoefficients of voxels with similar effects to improve prediction power andalso reduce the posterior computational burden. This method has scientific,statistical and computational advantages over several existing alternativepriors.

Bayesian inference has become increasingly popular in fMRI data analysisdue to several attractive properties: first, the posterior inference offers directprobabilistic interpretation of the estimates; second, it eschews the multiple-comparison problem faced by classical inference; third, incorporating priorinformation is straightforward within the Bayesian framework. In particu-lar, Markov Random Fields priors, such as the Ising prior and the Pottsprior, have been widely used to account for the spatial information betweenvoxels [e.g., Gossl, Auer and Fahrmeir (2001); Woolrich et al. (2004); Penny,Trujillo-Barreto and Friston (2005); Bowman (2007); Bowman et al. (2008);Derado, Bowman and Kilts (2010); Ge et al. (2014)] and for meta-analysis[e.g., Kang et al. (2011); Yue, Lindquist and Loh (2012)]. Johnson et al.(2013) used a joint Dirichlet Process mixture and Potts prior to achievesimultaneous clustering and selection. Within the SSVS framework, Smithet al. (2003) and Smith and Fahrmeir (2007) used the Ising prior in the con-text of massive univariate general linear models [GLM, Friston et al. (1995)]for identifying brain regions activated by a stimulus. It is important to stressthat the setting in Smith and colleagues is fundamentally different from theSI regression in this paper: the former only involves fMRI time series, with-out individual scalar outcome, and it deals with selecting and smoothing thecoefficients from p one-dimensional regressions (one for each voxel), a settingbroadly belonging to multiple testing; whereas our paper deals with variableselection from one p-dimensional regression, a much more challenging task.

Within the SSVS but outside the fMRI literature, there is a stream ofrecent work on using the Ising prior to incorporate existing structure in-formation between variables under the “large p, small n” paradigm [e.g.,Li and Zhang (2010); Stingo et al. (2011); Vannucci and Stingo (2011)].Moreover, simultaneous selection and clustering in multiple regression was

Page 4: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

4 F. LI ET AL.

discussed in Tadesse, Sha and Vannucci (2005), Kim, Tadesse and Vannucci(2006) and Dunson, Herring and Engel (2008), but none of those incorpo-rated existing structure between covariates. Another important but under-investigated issue is phase transition in the Ising model [for a review, seeStanley (1987)], which, in the context of variable selection, leads to a drasticchange (from nearly none to nearly all) in the number of variables selectedgiven an infinitesimal change in the hyperparameters. And the difficulty andsensitivity in hyperparameter selection increases substantially as the degreeof the underlying graph increases. Since the fMRI voxels naturally overlaya 3-dimensional lattice, it is crucial to select hyperparameters that avoidphase transition for valid inference and feasible computation. However, de-spite being intensively explored in statistical physics, phase transition andthe consequent issue of hyperparameter selection has received relatively littleattention in the literature of variable selection. Li and Zhang (2010) deriveda ballpark estimate of the phase transition boundary for the Ising prior usingmean field theory. But their derivation is solely based on the prior distri-bution and does not take into account the data or any prior knowledge ofthe predictors, and thus the resulting range of possible hyperparameters isoften very wide. In this article we develop a new analytic approach to de-rive a tighter boundary of the hyperparameters based on the data and theposterior distribution, and illustrate it on 2- and 3-dimensional lattices.

The rest of the article is organized as follows. Section 2 introduces thenew Bayesian model and Section 3 develops an analytic approach to hy-perparameter selection. Posterior computation of the model is discussed inSection 4. Section 5 compares the proposed methods with several existingmethods through simulations. In Section 6 we apply the proposed methodto the KLIFF study to investigate the social regulation of human emotion.Section 7 concludes.

2. The model. We formulate the problem via a standard multiple re-gression

Y =Xη + ε,(1)

where Y is the n× 1 variable response, for example, the scalar arousal orvalence measurement in the KLIFF study; X = (X1, . . . ,Xp) is the n × p(p≫ n) matrix of spatially correlated neuroimaging covariates, for example,the magnitudes of the estimated hemodynamic response function (HRF)of the voxels in the two ROIs in the study; and ε is the error term withε ∼ N(0, σ2In). To focus on the main message, we do not consider designvariables, such as age and sex, which can be easily added to the regression.

To select the voxels that are predictive of the response, we adopt theBayesian SSVS approach that assumes the “spike-and-slab” type of mix-ture prior for the regression coefficients [Mitchell and Beauchamp (1988);

Page 5: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 5

George and McCulloch (1993, 1997); Smith and Kohn (1996)]. Specifically,we define a latent indicator γj ∈ 0,1 for each covariate that indicateswhether this covariate is included in the model (i.e., whether a voxel issignificantly predictive of the response). We let

ηj = γj · βj and βj ∼G,

where βj represents the regression coefficient of predictor j once it is selected,and G is a prespecified probability distribution. Given γj and G, ηj areindependent following a spike-and-slab prior

ηj|(γj ,G)∼ (1− γj)δ0 + γjG,(2)

where δ0 is a point mass at 0. Our goal is to propose a new joint Ising andDP (Ising-DP) prior, where an Ising prior is imposed on γ = (γ1, . . . , γp)

′ toincorporate spatial information between voxels, and, in parallel, a Bayesiannonparametric DP prior is imposed on G to achieve grouping of the regres-sion coefficients, as elaborated below.

We represent the spatial structure among the fMRI voxels via a graph.Let i∼ j denote that i and j are neighboring voxels. Let E = (j1, j2) : 1≤j1 ∼ j2 ≤ p be the set of all the neighboring pairs of voxels—the edge setof the underlying graph. Given E , let a= (a1, . . . , ap)

′ be a vector and B=

(bj1,j2)p×p be a symmetric matrix of real numbers where bj1,j2 = 0 for all(j1, j2) /∈ E . To incorporate the prior structural information into the modelbuilding process, we assume an Ising prior distribution for γ [Li and Zhang(2010)] as the first component of the proposed prior:

Pr(γ) = expa′γ + γ′Bγ − ψ(a,B),(3)

where ψ(a,B) is the normalizing constant: ψ(a,B) = log∑γ∈0,1p exp(a

′γ+

γ ′Bγ). IfB= 0, then ψ(a,B) =

∑pj=1 log(1+e

aj ), but in general there is noclosed form for ψ. The Ising model is a binary Markov Random Fields modeland encourages the formation of clusters of like-valued binary variables.

The hyperparameters a control the sparsity of γ. Since we are focused on2D and 3D lattices, which are regular graphs (i.e., each vertex has the samedegree), we do not want to favor a priori the inclusion of any voxel. This isachieved by letting a= a1p, where 1p = (1,1, . . . ,1)′ ∈ ℜp. The hyperparam-eters bj1,j2 represent the prior belief on the strength of coupling betweenthe pairs of neighbors (j1, j2), and thus control the smoothness of γ over Egiven a, with larger bj1,j2 leading to tighter coupling. When B= 0, the prioris the standard i.i.d. Bernoulli for each predictor [George and McCulloch(1993)]. Without specific prior information of the strength of connection be-tween each pair of neighbors, it is natural to assume bj1,j2 ’s to be a constantb. Then (a,B) reduce to two hyperparameters (a, b), which can be eitherpre-fixed or assumed to follow some hyperprior distributions.

Page 6: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

6 F. LI ET AL.

The Ising prior smoothes the binary selection indicators, but not theregression coefficients. In structured high-dimensional settings like fMRI,neighboring covariates, often highly correlated, tend to have similar effectson the outcome. Intuitively, a certain degree of smoothing or grouping of thecoefficients would improve the model fitting, especially when the effects ofindividual predictors are very weak. We achieve this by imposing a DP prioron G, G ∼ DP(α,G0), with a precision parameter α and base measure G0

[Ferguson (1973, 1974); Antoniak (1974)]. Following the sticking-breaking(SB) presentation [Sethuraman (1994)], G can be written as a weighted sumof an infinite number of point masses (atoms):

G(·) =∞∑

h=1

whδθh(·), θhi.i.d.∼ G0,

(4)

wh =w′h

k<h

(1−w′k), w′

hi.i.d.∼ Beta(1, α),

where δθ is a point mass at θ. It is clear from (4) that samples from aDP are discrete and the component weights wh decrease exponentially inexpectation. The spike-and-slab prior (2) for each η can then be written asa mixture of an infinite number of point masses (at 0 and atoms randomlydrawn from the base measure G0):

ηj|(γj ,w,θ)∼ (1− γj)δ0 + γj

∞∑

h=1

whδθh(·),(5)

where θ = (θ1, . . . , θh, . . .) and w= (w1, . . . ,wh, . . .). The clustering nature ofthe DP prior can be immediately seen from (5): it classifies the voxels intoone cluster of voxels that have no effect on response, and several clusters ofthe remaining voxels, where the regression coefficients within each clusterare shrunk to be identical. The number of clusters increases automaticallyas the number of voxels under consideration, p, increases. The precisionparameter α governs the number of active components and is assumed tofollow a flexible hyper Gamma(1,1) prior. And we assume the base measureG0 = N(0, v2) with hyperparameter v. In this article, clustering per se isnot the primary interest, rather clustering is a means of grouping similarcoefficients. There is a clear scientific justification for grouping regressioncoefficients in this manner, as each predictive brain region usually containsa number of voxels that are of similar (and usually weak) effects on theoutcome. Clustering also introduces substantial improvement in posteriorcomputation because instead of sampling the coefficient for each voxel, oneonly need to sample the common coefficient for each cluster.

Jointly, equations (3), (4) and (5) define the new Ising-DP spike-and-slab

prior.

Page 7: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 7

3. Selection of hyperparameters. Selection of the hyperparameters a, bin the Ising prior is crucial for both inference and computational feasibilityfor high-dimensional data. A challenging feature of the Ising prior in the“large p” paradigm is the phase transition behavior in a graph with dimen-sion higher than 1: certain combinations of the hyperparameters a, b leadto the selection of almost all variables and thus induce critical slowdownof the MCMC for posterior computation. This issue cannot be mitigatedby simply replacing a and b by a hyperprior, because for a regular graphwith even modest degree (say, 3), the range of hyperparameters that do notincur phase transition is narrow. If the domain of the prior is not carefullychosen, it is very likely that little weight is assigned to appropriate hyper-parameters, leading to poor posterior results, especially for data with lowsignal-to-noise ratio (SNR), such as fMRI data. Smith and Fahrmeir (2007)suggested to co-estimate the hyperparameters and the binary indicators inposterior computation. Their method relies on specifying a uniform priorbetween zero and a prespecified maximum for the smoothing parameter b.However, if the maximum is specified outside the phase transition bounds,the resulting MCMC will still suffer from the critical slowdown. Therefore,finding these phase transition bounds is central to correct specification ofhyperparameters for the Ising prior.

Solely based on the prior distribution, Li and Zhang (2010), page 1205,used mean field approximations to derive a ballpark estimate of the phasetransition boundary for the Ising prior defined on regular graphs, and illus-trated it on a hypertube with degree of 6. However, because this approachdoes not take into account the data or any prior knowledge of selection rate,it often results in a very wide range of hyperparameters. The problem be-comes even more pronounced when the degree of the graph increases. Belowwe develop a new method to tighten the bounds on a and b based on theposterior distribution.

The posterior conditional density of γ given the rest of parameters isproportional to

C(γ) = exp

(

a′γ + γ ′

Bγ −n∑

i=1

(Yi −Xi(β · γ))2/2σ2)

.

In high-dimensional settings, usually it is reasonable to a priori assume spar-sity, that is, the proportion of true predictors among the p candidates, π, ismuch smaller than 1. Intuitively, in order to have only a small proportion ofpredictors being selected, the mode of C(γ) should be larger than C(0p) andattained at a γ such that the number of nonzero γ’s is around π · p, beyondwhich C(γ) should decrease fast as the number of nonzero γ’s increases.Below we form inequalities for a and b based on this intuition.

When all the candidate voxels locate on a lattice, selected voxels give riseto the largest number of neighboring pairs when they form a square in two

Page 8: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

8 F. LI ET AL.

dimensions or a cube in three dimensions. Therefore, we use squares (2D)or cubes (3D) to approximate the location of the selected π · p voxels on thelattice. Let V = [(π · p)1/d], where [c] denotes the largest integer no largerthan c, and d is the dimension of the lattice, which equals either 2 or 3. Fora square containing V 2 voxels, there are 4V 2− 6V +2 neighboring pairs; fora cubic containing V 3 voxels, there are 13V 3 +28+ 66(V − 2) + 51(V − 2)2

neighboring pairs (derivations are given in Appendix A).

3.1. Selection on two-dimensional lattice. We first discuss the two-di-mensional lattice. For V 2 selected voxels on a square,

a′γ + γ ′

Bγ = (a+8b)V 2 − 12bV + 4b.

To achieve sparsity, this value needs to decrease fast as V increases, thus wemust have a+ 8b < 0. We also need the conditional density of selecting V 2

voxels to be larger than the null model with zero voxel, that is,

−n∑

i=1

(Yi − Y )2/2σ2

(6)

≤ (a+ 8b)V 2 − 12bV + 4b−n∑

i=1

(Yi −Xi(β · γ))2/2σ2.

Since∑n

i=1(Yi − Y )2 is the total variation of the observed Y ,∑n

i=1(Yi −Xi(β ·γ))2 is the sum of squared errors, and E

∑ni=1(Yi−Xi(β ·γ))2 ≈ nσ2,

then∑n

i=1(Yi − Y )2/2σ2 −∑ni=1(Yi −Xi(β · γ))2/2σ2 ≈ n · R2

2(1−R2), where

R2 is the determinant of coefficient in the linear regression of Y versus X.Then inequality (6) is reduced to

(a+8b)V 2 − 12bV + 4b >−n ·R2

2(1−R2).

We now propose two ways to determine R2 to further tighten the inequality.In the first method, we prespecify the R2 value that we expect to achieve.Then given V from prior knowledge, obtain bounds on the parameters a andb. For example, if we want at least 50% of variation of Y to be explainedby the regression, and at most 5% of 1000 voxels to be selected, we may letR2 = 50% and V = [

√50] = 7, then the inequality becomes 49(a+8b)−84b+

4b >−n/2, that is, 312b+49a >−n/2. Consequently, the range of a and b isdetermined by two inequalities: −8b > a > (−n/2− 312b)/49 and b < n/160.The second method is to approximate R2 by a lower bound obtained basedon the data: the maximum R2 among all simple linear regressions of Yversus each single predictor X . We believe such a lower bound is an effectiveapproximation for the problem under study for two reasons. First, by using

Page 9: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 9

the DP prior, usually most of the selected voxels should have identical β,effectively converting the multiple regression to a simple linear regression.Second, for fMRI data, spatially close voxels typically have very similar Xvalues, and thus the R2 value from regressing Y on multiple spatially closepredictors is expected to be very similar to that from regressing Y versus asingle predictor.

3.2. Selection on three-dimensional lattice. Analogously, we can derivethe range of a and b for a three-dimensional lattice. For V 3 voxels forminga cubic and V > 1,

a′γ + γ′

Bγ = (a+26b)(V − 2)3 +6(a+17b)(V − 2)2

(7)+ 12(a+11b)(V − 2) + 8a+56b.

In order to avoid all predictors being selected, we need C(γ)< 0 to decreasefast as V increases after certain threshold. For simplicity, we only requireC(γ) to be negative for the maximum possible V , that is, V = [p1/3]. Forexample, in the KLIFF data, p is around 6600 in both ROIs, then V = 18and, consequently, a <−23b. In addition, in order to avoid the null model,that is, no voxel being selected, we have

a′γ + γ′

Bγ ≥ −n ·R2

2(1−R2).(8)

Given the prespecified R2 and V , we can obtain the range of a and b sat-isfying this inequality. Again taking the KLIFF data, for example, n= 104,we want at most 1% voxels selected, and the expected R2 is 0.5. ThenV = [66.71/3] = 4, plug this value and R2 = 0.5 into the inequality (8), andwe have a > −14.6b − 0.81. Combining the previously obtained inequalitya <−23b, it must be the case that −23b > −14.6b− 0.81, so that we haveb < 0.1. Therefore, for the KLIFF data analysis, we will choose a and b suchthat b≤ 0.1 and −23b > a >−14.6b− 0.81.

One potential problem of using (7) to evaluate a′γ + γ′

Bγ in (8) is theoverestimation of the number of neighboring pairs of selected voxels, espe-cially when the selected V is larger than 3, which can lead to a very tightrange of b and a. We instead propose that as long as there is one pre-dictor whose posterior probability of being selected is larger than that ofnot selected, the posterior simulation will not be stuck at the null model.

Therefore, we can just let a ≥ −n·R2

2(1−R2), implying b < n·R2

2·23(1−R2)such that

−23b > −n·R2

2(1−R2) . For one of the real data sets under study, the maximum R2

across all simple linear regressions is 0.10, then we have −23b > a > −5.8and b < 0.25. Given the derived range of hyperparameters, and with the be-lief that all the true predictors are tightly clustered together, we first choose

Page 10: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

10 F. LI ET AL.

the largest possible b to induce the most spatial clustering effect; then giventhe value b, we choose the smallest a within the phase transition boundaryto induce sparsity. Such a choice of a also brings computational advantage,because the computational cost of obtaining the regression coefficients de-creases with the number of selected predictors in each MCMC iteration.Here, we choose b= 0.2 and a=−4.5 as the hyperparameters for the Isingprior.

3.3. Remarks. The above derivation suggests the following: first, thelarger R2 and the sample size n, and the smaller the degree of the underlyinggraph (i.e., the average number of neighbors of each candidate predictor),the wider the range of b; and second, the range of a depends on both b andthe degree of the graph. Generally, for an Ising model built on a regulargraph, given b, a larger degree of the graph leads to smaller a. These areconsistent with a general understanding of the effect of prior distributionsin Bayesian inference: when R2 and n are large, indicating a strong SNRand abundant data information, choice of prior is less crucial. On the otherhand, if each predictor has many neighbors, then the positive part γ ′

Bγ inthe prior will give a strong preference to models with many spatially closepredictors. Therefore, we need to use a smaller b in order not to imposea strong prior. This also explains, for fixed b, the larger the degree of thegraph, the smaller a is required to induce a small prior odds of selecting alarge number of predictors.

The degrees of a 2D and 3D lattice are 8 and 26, respectively. Conse-quently, the range of hyperparameters a and b that avoids phase transitionis much tighter in the latter than the former case. Indeed, in the real appli-cation, when we assume the Ising prior on a 3D lattice, the results are muchmore sensitive to the choice of a and b. In general, we find a larger degreeof the underlying graph corresponds to substantially more difficult hyper-parameter selection and inference, consistent with the observation made inLi and Zhang (2010). Also, it is crucial to examine γ′

Bγ. Nevertheless,when choosing the underlying graph, the concern of the degree of the graphshould not outweigh the true physical structure. For example, in fMRI data,we prefer an Ising prior on a 3D lattice than on a 2D lattice, as the latteronly accounts for the structure in one slice and ignores the true 3D structurebetween voxels.

In Bayesian variable selection problems, choice of hyperparameters af-fects not only posterior selection probabilities, but also computational time,convergence rate and required iteration of MCMC simulations. We foundthat if very few predictors are selected in each iteration, the DP prior tendsto shrink the β’s of all predictors into one identical value, leading to verysticky MCMC, which offsets the computational advantage per iteration of-fered by the shrinkage effect of the DP prior. Therefore, besides avoiding

Page 11: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 11

the two extreme ends of full selection and zero selection, the trade-off be-tween computation per iteration and convergence rate should be taken intoconsideration when choosing the hyperparameters.

4. Posterior computation. We use a Gibbs sampler with data augmen-tation to carry out the posterior inference of the proposed model: γ|−, β|−,σ|−, where “−” denotes all the rest of the parameters. Below we describethe outline of the Gibbs sampler but relegate the computational details toAppendix B.

The procedure to update the variance σ, and the indicators γ, whichwe update one at a time in a random order in each sweep, is standard.To draw posterior samples of β, we use an approximate blocked Gibbssampler based on the truncated stick-breaking process [Ishwaran and Zare-pour (2000); Ishwaran and James (2001)]. First choose a conservative upperbound, H <∞ on the number of mixture components potentially occupiedby βj ’s in the sample. Then introduce latent class indicators for each predic-tor, Zj(∈ 1, . . . ,H) with a multinomial distribution, Zj ∼MN(w) wherew = w1, . . . ,wH. This associates each predictor in the current iterationwith a cluster h in the DP. In the Gibbs sampler, we first augment thecluster membership Zj and then sample βj conditional on Zj .

The main computational gain, especially when p is large, is due to theclustering nature of DP: because all the predictors in one cluster share thesame coefficient, we only need to update one β for each cluster within eachiteration. It is easy to show the computational order of the posterior com-putation of one MCMC iteration under the DP prior for β is O(n×p×psel),where psel is the number of selected predictors (model size) in that iteration.For comparison, we present the corresponding computational order underthe standard spike-and-slab prior with Gaussian prior for β, for which thereare two general schemes for posterior computation: (i) sample all parame-ters, β, σ and γ; and (ii) integrate out β and σ under the conjugate setupand only sample γ. In both schemes, the main computational burden is dueto the inversion of the covariance matrix, which, even using fast low-rankupdate algorithms, is of the order O(n×p2) and O(n×p×p2sel), respectively.When p is very large as in this application, the computational order of thefirst scheme is prohibitive, and this is the reason that the vast majority ofthe SSVS literature in high-dimensional settings adopts the second scheme,which, however, does not provide posterior samples of the coefficients β orthe variance σ. Moreover, because of the squared term of psel, even whenthe average model size is modest (e.g., between 50–100), the second schemecan still incur overwhelming computational cost. In contrast, as shown inthe details of the Gibbs sampler in Appendices A and B, the DP prior doesnot require matrix inversion, yet still provides posterior samples of β’s withmuch lower computational cost.

Page 12: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

12 F. LI ET AL.

5. Simulations.

5.1. Simulation design. We conduct simulations to examine the perfor-mance of the Ising-DP prior and compare with several alternative meth-ods. We simulate data of n = 104 subjects (the number of subjects in thereal application), each having p = 1000 candidate predictors overlaying a10× 10× 10 3D grid. Each predictor j (1≤ j ≤ 1000) is spatially indexed by

dj = (d1j , d2j , d

3j ) for 1≤ d1j , d

2j , d

3j ≤ 10. To mimic the real data, we let predic-

tors be strongly correlated, and the design matrices of the ith subject Xi =(Xi1, . . . ,Xip) in all the following simulations follow a multivariate normal

MVNp(µ,Σ), where µ= (µ1, . . . , µp)i.i.d.∼ Unif(3,6) and Σj1j2 = 0.8|dj1

−dj2|,

where |dj1 − dj2 |=∑3

i=1 |dij1 − dij2 |. We consider the following four simula-tion scenarios.

Scenario 1: One cluster of true predictors, with identical β’s. There isa cluster of 5 × 5 × 5 (125) true predictors (γj = 1) with spatial indices4 ≤ d1j , d

2j , d

3j ≤ 8 located in the center of the 3D cube. The coefficients β

of the true predictors are set to 0.6. The response is generated from Yi =∑

jXi,jβjγj + εi with εi ∼ N(0,2002) for i = 1, . . . , n, creating a data set

with a low SNR 5%—defined as V(Xβ)/V(ε). The following scenarios alsoall have such a low SNR, which is the norm in real fMRI data.

Scenario 2: One cluster of true predictors, with varying but strongly cor-

related β’s. We let the coefficients of the true predictors, locating on thesame grid as those in scenario 1, vary and follow MVNp(0.6× 1p,Ω), where

Ωj1j2 = 0.1× 0.95|dj1−dj2

|. Therefore, both the observed values and the un-derlying coefficients of neighboring predictors are strongly correlated.

Scenario 3: Two clusters of true predictors, with identical β’s within each

cluster. A more challenging scenario is when there are multiple spatiallyseparated clusters of true predictors. Specifically, we let the true predictorsform two clusters: one overlays the grid of 3≤ d1j ≤ 4,3≤ d2j ≤ 4,3≤ d3j ≤ 4,

and another overlays the grid of 6≤ d1j ≤ 9,6≤ d2j ≤ 9,6≤ d3j ≤ 9. We set thecoefficients β of the predictors in the two clusters to 0.4 and 1, respectively.

Scenario 4: Two clusters of true predictors, with varying β’s within each

cluster. The true predictors locate on the same grid as those in scenario 3,and one cluster of β were generated from MVNp(0.4×1p,Ω1) with Ω1,j1j2 =

0.1×0.95|dj1−dj2

|, and those in the second cluster are from MVNp(1×1p,Ω2)

with Ω2,j1j2 = 0.1×0.95|dj1−dj2

|. Variable selection under two-cluster scenar-ios is challenging: the strong correlation between the predictors outside andinside the clusters renders differentiating nonsignificant predictors, especiallythose located between the two clusters, from the true ones difficult.

For each of the simulated data set, we fit the regression model (1) withfour different priors: (i) i.i.d. Bernoulli prior for γj , with a Gaussian prior

Page 13: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 13

for the βj ’s (this is the standard spike-and-slab prior), referred to as thei.i.d.-Gaussian prior; (ii) Ising prior for γj , with a Gaussian prior for theβj ’s, referred to as the Ising-Gaussian prior; (iii) i.i.d. Bernoulli prior forγj , with a DP prior for βj ’s, referred to as the i.i.d.-DP prior; (iv) theIsing-DP prior. The hyperparameters (a, b) for the Ising priors are chosenby the proposed approach in Section 3, with a= −5 and b= 0.25. For theDP priors, we set H = 20, α = 1 and v = 10 such that G0 is very flat in awide domain. For each simulated data, we run 10 parallel Gibbs samplerswith random start in γ, each having 20,000 iterations with the first 10,000ones as burn-in. Posterior computation with the i.i.d.-Gaussian and Ising-Gaussian priors are carried out using the software by Li and Zhang (2010).The main summary statistic, the posterior inclusion probability, is deemedconvergent upon inspecting the Gelman–Rubin statistic [Gelman and Rubin(1992)]. In all of our experiments, the 10 simulations lead to highly similarposterior summary statistics.

5.2. Simulation results. We calculate the posterior inclusion probabili-ties Pr(γj = 1|Y) as the posterior summary statistics, obtained by dividingthe number of iterations where γj = 1 over the total number of iterationsexcluding the burn-in period. To summarize these marginal probabilities, wecompute the ROC curve as follows: only those covariates j with Pr(γj = 1|Y)greater than a threshold are deemed positives, and those below the thresholdare deemed negatives; the ROC curve reflects the pair of true positive rateand false positive rate achieved by varying the calling threshold. The big-ger area under the ROC curve (maximum 1), the better the discriminatingpower of the model.

The ROC curves resulting from the simulations under scenarios 1–2 (onecluster) and 3–4 (two clusters) are presented in the top and bottom panelof Figure 1, respectively. We also calculated the root mean squared error(RMSE) per variable, (

j(βj −βj)2/p)1/2, of each prior, summarized in Ta-ble 1. In all four simulations, the Ising-DP prior resulted in the best ROC,closely followed by the i.i.d.-DP prior, beating both the i.i.d.-Gaussian andthe Ising-Gaussian priors. This pattern is consistent with the RMSEs. Over-all, the ROC curves suggest relatively low discriminating power in thesesimulations, even for the best-performing Ising-DP prior. This is not sur-prising because variable selection under all four scenarios is very challengingdue to the low SNR, strong correlation between variables and the small-nlarge-p nature. Indeed, our experience based on more simulations suggeststhat as the SNR and/or the sample size decreases, performance of all thepriors drops, but the Ising-DP prior is the least affected, demonstrating thebenefit of introducing additional shrinkage to the coefficients when the signalis weak. In summary, it is evident from these simulations that the Ising-DP

Page 14: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

14 F. LI ET AL.

One cluster

(a) Identical β of true predictors (b) Varying β of true predictors

Two clusters

(c) Identical β of true predictors within a cluster (d) Varying β of true predictors

Fig. 1. ROC curves based on the posterior selection probability Pr(γj = 1|Y) obtainedfrom i.i.d.-Gaussian, Ising-Gaussian, i.i.d.-DP and Ising-DP prior, respectively, underfour simulation scenarios.

prior outperforms the existing alternatives in data with characteristics sim-ilar to those of the fMRI data under study.

It is worth noting that in these simulations the DP component appearsto impose a stronger clustering effect on performance than the Ising com-ponent. One reason is that, as shown in Section 3, when the degree of thegraph is large as in the 3D fMRI analysis, the hyperparameter b in the Ising

Page 15: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 15

Table 1

Root mean squared error (RMSE) per variable, (∑

i(βi − βi)2/p)1/2, by different priors

Scenario I.i.d.-Gaussian Ising-Gaussian I.i.d.-DP Ising-DP

1. One-cluster identical β 0.623 0.599 0.190 0.1902. One-cluster varying β 0.284 0.283 0.181 0.1793. Two-cluster identical β 0.311 0.315 0.256 0.2504. Two-cluster varying β 0.368 0.251 0.235 0.233

prior used to control the clustering effect has to be set small to avoid phasetransition, which consequently limits its clustering effect. Nevertheless, thesimulation results suggest that incorporating the spatial information intoBayesian variable selection via the Ising prior still leads to improved selec-tion accuracy than otherwise.

6. Application to the KLIFF study.

6.1. The data. We now provide more information on the design of theKLIFF study and the preprocessing procedure. For each of the 104 pairs ofparticipants in a close relationship (referred to as partners hereafter), oneof them was randomly selected to be threatened by electric shocks whiletheir brain activities were measured by fMRI in three separate sessions: inone session he/she is holding hands with his/her partner; in the second ses-sion, he/she is holding hands with a stranger; in the third session, he/sheis alone, holding hands with nobody at all. The three hand-holding condi-tions mimic three types of social interactions. Each of the three sessions,randomized within each pair of partners, contains 24 trials in random order,half of which are threat cues (a red “X” on a black background) indicatinga 20% likelihood of receiving an electric shock to the ankle, and the otherhalf are safety cues (a blue “O” against a black background) indicating nochance of shock. A 3D fMRI scan of the subject’s brain was acquired forevery 2 seconds in the experiment lasting for 400 s. Overall, fMRI data col-lected from the KLIFF experiment consist of 104 subjects in 3 sessions at200 time points for over 100,000 spatially distributed voxels. At the end ofeach session, the subjects facing the threat were asked to score their arousaland valence feelings experienced during the experiment. Both the arousaland valence measurements range from 1 to 9, encoding feelings from calm-ing/soothing to alert/agitated, and feelings from highly negative/miserableto highly positive/pleased, respectively.

Preprocessing of the fMRI data was carried out via FMRIB’s SoftwareLibrary (FSL) software [Version 5.98; Smith et al. (2004)]. Registration ofthe images in FLIRT [Jenkinson et al. (2002)] was based on Montreal Neu-rological Institute (MNI) space. More details of preprocessing can be found

Page 16: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

16 F. LI ET AL.

in Zhang et al. (2013). ROIs were determined structurally using the Har-vard subcortical brain atlas, and were chosen for their likely involvementin affective processing based on previous studies [Maresh, Beckes and Coan(2013)]. In particular, our analysis focuses on two emotion related regions:dorsal anterior cingulate cortex (dACC) and insula, which were commonlyimplicated in negative affect and threat responding, and whose numbers ofvoxels are similar, 6666 and 6591, respectively. To obtain the predictors, weconducted massive univariate analysis using the GLM to get scalar sum-maries of the fMRI time series. Specifically, for every voxel in each ROI, weused the semi-parametric GLM approach in Zhang et al. (2013) to estimatethe hemodynamic response functions (HRF) corresponding to the threatand safety cues (stimuli), and extracted the height of the HRF estimates,interpreted as the magnitude of brain response to the stimuli of that voxel.We then computed the difference between the estimated magnitudes underthe threat cue and the safety cue (baseline) for each voxel as the predictors.In total, for each ROI, we obtained six sets of regression data: two differentresponse variables—valence and arousal scores of the subjects, under eachof the three hand-holding conditions, and associated magnitude estimatesof each voxel in the ROI collected in the same session as the predictors.

6.2. Results. We applied the proposed Bayesian model to the 12 setsof data (6 for each ROI) using the Ising-DP prior on a 3D lattice withhyperparameter a = −4.5 and b = 0.2 obtained from the method in Sec-tion 3. For comparison, we also fit the model with the i.i.d.-Gaussian andthe Ising-Gaussian priors. For each regression, 25,000 iterations of MCMCwere performed with the first 5000 discarded as burn-in. Convergence of themarginal inclusion probabilities is deemed via the Gelman–Rubin statistics.

Though the number of selected predictors is larger than the sample size ineach MCMC iteration, the clustering effect of the DP prior leads to a smallnumber of different β values (less than 10) in most iterations. Among the 12sets of regressions, we focused on those with (i) reasonably high R-squaredvalues and (ii) top 10% selected voxels having a high proportion of nonzerocoefficients with the same sign. The R-squared value for each iteration t isgiven by

R2t = 1−Var(Y−Xγt ·βt)/Var(Y),

where γt and βt are the posterior draws of γ and β, respectively, at the tthiteration. The first criterion requires that a significant proportion of variationof subjects’ emotion measurements can be explained by their brain responsemagnitudes, and the second requires that the majority of the top selectedpredictors have similar and significant effects on the response, matching thesubstantive knowledge from the existing psychology literature. We foundthree sets of regressions fit these two criteria: the regression with the arousal

Page 17: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 17

(a) dACC alone arousal (b) Insula alone arousal (c) Insula partner valence

Fig. 2. R-squared values of the regressions.

measurement under alone condition as the response in dACC and insula,respectively, and the regression with the valence measurement under hand-holding-with-partner condition as the response in insula.

Histograms of the R-squared values and the coefficients of the top 10%selected voxels in these three regressions are displayed in Figures 2 and 3,respectively. We can see that in the regression with arousal under the alonecondition as the response in dACC, the R-squared value is larger than 20% inmore than 20% of the MCMC draws [Figure 2(a)], and almost all (>99.5%)of the top 10% selected voxels’ coefficients are positive in more than 90%of the posterior draws [Figure 3(a)]. The same regression in insula led tosimilar results [R-squared in Figure 2(b) and coefficients in Figure 3(b)].The significant positive association between the arousal measurement andbrain response magnitudes under the alone condition is consistent with re-lated findings in the literature. First, in a previous study of the KLIFF data[Zhang et al. (2013)], we found that the brain response to threat stimulusis most active when subjects are alone. This phenomenon can be explained

(a) 10% percentile (b) 10% percentile (c) 90% percentiledACC arousal insula arousal insula valence

Fig. 3. Histograms of 10% or 90% percentile of the coefficients (in scale 10−4) of thetop 10% selected voxels in dACC and insula when regressing subjects’ arousal (the firsttwo figures) or valence (the third figure) scores versus the magnitude of brain response tothreat under the alone or hand-holding-with-partner condition.

Page 18: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

18 F. LI ET AL.

through the social baseline theory [Beckes and Coan (2011); Coan, Beckesand Allen (2013); Coan and Maresh (2014)], which suggests that the humanbrain assumes proximity to other human beings, and perceives the envi-ronment as less threatening during the presence of other people in a closerelationship, and thus serving as a default, or baseline, strategy of emotionregulation. This reduces the need to rely on effortful self-regulation in re-sponse to threat. On the other hand, when the subjects are alone withoutany social support, their brains have to use their own energy for emotionregulation, and, consequently, their emotional response is strong, and itsassociation with subjects’ emotion measurements is easier to detect in thetwo emotion-related ROIs. Second, the positive association between brainresponse and excitement level corresponds with literature showing a rolefor dACC and insula in both cognitively- and physically-induced arousal[Critchley et al. (2000); Lewis et al. (2007)]. Since the use of electric shockas a threat stimulus causes physical pain and induce subjects’ internal aware-ness of upcoming pain during anticipation of a shock particularly, it is nat-ural that the more active emotion-related ROIs process the stimulus, themore intense and agitated feeling the subjects experience.

We also found significant association between valence and brain responsemagnitude in insula under hand-holding-with-partner condition [R-squaredvalues shown in Figure 2(c) and coefficients shown in Figure 3(c)]. The neg-ative association has two possible explanations. First, the threat stimulusinduces subjects’ negative feelings, and the valence and arousal measuresare negatively correlated, therefore, the more active the brain responds tothe stimulus, the less pleased the subjects’ feelings. Second, according to thesocial baseline theory, humans feel less threatened under the hand-holding-with-partner condition. Thus, subjects’ emotion variation is more likely tooccur in the valence dimension. We indeed found that the variance of sub-jects’ valence is larger than that of arousal. Moreover, insula is thought tomediate the awareness of internal bodily and emotional states [Craig (2009)]and is related to pain anticipation and intensity [Wiech, Ploner and Tracey(2008)]. Results of the regression under the hand-holding-with-stranger con-dition are not as stable as the other two regressions, possibly due to theindividual differences in cognitive and affective perception of strangers.

In all three regressions, the largest posterior selection probabilities of vox-els are around 0.1, and the majority of the probabilities are below 0.05. Thisis as expected given the very low SNR common in fMRI data. In these sit-uations, arguably, the ranks rather than absolute value of the probabilitiesare more informative about the selection results. Figures 4, 5 and 6 show theheatmaps of the posterior selection probabilities of the voxels in three slicesbased on their rank under the Ising-DP (top panel) in these regressions,respectively, in comparison to the corresponding heatmaps under the i.i.d.-Gaussian (middle panel) and the Ising-Gaussian prior (bottom panel). The

Page 19: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 19

Fig. 4. Heatmaps of voxels according to the ranks of their posterior inclusion probabilitiesobtained from Ising-DP, Ising-Gaussian and i.i.d.-Gaussian priors, respectively, in theBayesian regression of subjects’ arousal scores versus the magnitude of brain response tothreat of voxels in dACC and insula when subjects are alone.

color scale is arbitrary, with dark red representing the selection probabilityin the lowest rank and light yellow representing the highest rank. The moststriking pattern from these graphs is that the areas with the highest selection

Page 20: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

20 F. LI ET AL.

Fig. 5. Heatmaps of voxels according to the ranks of their posterior inclusion probabilitiesobtained from Ising-DP, Ising-Gaussian and i.i.d.-Gaussian priors, respectively, in theBayesian regression of subjects’ arousal scores versus the magnitude of brain response tothreat of voxels in insula when subjects are alone.

probabilities identified by the Ising-DP prior were smoothly located acrossthe ROIs, matching the scientific understanding of human brain functions,in contrast to those by the i.i.d.-Gaussian or the Ising-Gaussian prior, whichare very diffused and scattered across the entire region.

Page 21: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 21

Fig. 6. Heatmaps of voxels according to the ranks of their posterior inclusion probabilitiesobtained from Ising-DP, Ising-Gaussian and i.i.d.-Gaussian priors, respectively, in theBayesian regression of subjects’ valence scores versus the magnitude of brain response tothreat of voxels in insula when subjects are hand holding with their partners.

Since the underlying truth is unknown, we use a simulation-based pro-cedure to obtain the sampling distribution of the R-squared values of anull model. Specifically, we simulated, independently of the covariates, anormally distributed response variable with similar variance and range as

Page 22: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

22 F. LI ET AL.

(a) Histogram of R-squared (b) Heatmap of selection (c) Voxels with highest top 10%probabilities of voxels selection probabilities

Fig. 7. Regression of simulated response versus brain activity measurements in dACCunder alone condition.

the observed emotion measurements, and applied the Bayesian model toregress the simulated outcome on the observed covariates in dACC underthe alone condition. The histogram of positive R-squared values in the pos-terior draws of this null model, shown in Figure 7(a), centers around zero,and is distinct from the histograms from the aforementioned three regres-sions, each of which has a much higher proportion of large R-squared values.In contrast, the histogram of the null model is very similar to those from theremaining nine regressions. As such, we deem there is no statistically sig-nificant association between the covariates and the responses in these nineregressions.

7. Discussion. Motivated by the KLIFF hand-holding experiment, inthis article we propose a joint Ising-DP prior within the Bayesian SSVSframework to achieve selection and grouping of spatially correlated vari-ables in high-dimensional SI regression models. We developed an analyticapproach for deriving the bounds of the hyperparameters to avoid phasetransition, a main challenge in methods involving the Ising prior. Thoughthe bounds provided by our method are tighter than the previous mean fieldbounds, they are still only ballpark estimates and may be wide in graphswith high degrees. A focus of our future research is therefore to improve themethod of hyperparameter selection for a more complex graphical structure.

A major challenge to MCMC-based Bayesian methods in high-dimensionalsettings is computation. Though the DP prior in our model partially reducesthe computational load by clustering the coefficients, computational scalabil-ity remains a challenge given the large p. Indeed, currently we are not able toperform a whole brain analysis with p≈ 100,000. Moreover, the mixing rateof the MCMC of the standard strategy in SSVS of updating one variable ata time may be slow, especially when the DP prior is involved. An attractivedirection is to design a block update Gibbs sampling scheme that updatesmultiple variables at a time, and to parallelize the computation within a

Page 23: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 23

block using graphics processing unit (GPU)-based programming [Suchardet al. (2010); Ge et al. (2014)]. The procedure can be further speeded up bycarefully selecting the block so that it matches the underlying block struc-ture.

The Ising prior is a special case of Markov random fields. Kalus, Samannand Fahrmeir (2014) proposed latent GMRFs via a probit model. The probit-GMRF prior simplifies the calculation of the hyperparameters and does notsuffer from the phase transition behavior. However, the main computationalhurdle of inversion of a matrix of the size of selected variables remains.Nevertheless, it is possible to combine the DP prior with the probit-GMRFprior to reduce the computation.

Extension to binary and categorical responses is, in principle, straightfor-ward using generalized linear models. Computation is an increased focus, asclosed-form posterior conditional distributions are no longer available. Thesame problem applies with censored survival models. Laplace approxima-tions [Raftery (1996)] are useful, but they usually require gradient methodsfor iterative computation of posterior modes for each sweep of covariates.A possible improvement can be obtained by exploiting the majorization–minimization/maximization (MM) algorithm [Lange (2008)]—a generalizedversion of the EM algorithm—for within-model mode computations.

The proposed Ising-DP prior inherently assumes sparsity, that is, onlya small portion of the voxels in the ROIs are associated with the indi-vidual scalar outcome. This is achieved via a point mass (spike-and-slab)prior for the regression coefficients, resulting in a “hard-thresholding” ofthe β’s. However, in our real application, posterior probabilities of inclu-sion of nearly all voxels are relatively small, which suggests that an alter-native “soft-thresholding” without sparsity—achieved by (spatial adaptionof) LASSO-type priors [Park and Casella (2008)]—may be desirable and aworthwhile direction for future investigation.

Though we have focused on fMRI, the proposed model is applicable toother imaging modalities where detailed spatial information between covari-ates is available, such as DTI or MRI.

Matlab code that implements the method is available at http://faculty.virginia.edu/tingtingzhang/Software.html.

APPENDIX A: CALCULATION OF a′γ + γ′

1. Two-dimensional square. For V 2(V > 1) voxels on a square, the (V −2)2 voxels in the center all have 8 neighbors, the 4 vertex voxels have 3neighbors, and the 4 · (V − 2) voxels on the edge but not vertexes have 5neighbors. Then, given a and B as defined in Section 2, we have

a′γ + γ′

Bγ = a · V 2 + b · (8 · (V − 2)2 + 4 · 3 + 5 · 4 · (V − 2))

= (a+ 8b)V 2 − 12bV +4b.

Page 24: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

24 F. LI ET AL.

2. Three-dimensional cube. For V 3 (V > 1) voxels in a cube, the (V − 2)3

voxels in the center all have 26 neighbors, the 8 voxels on the vertex have7 neighbors, the 12(V − 2) voxels on the edge but not vertexes have 11neighbors, and the 6(V − 2)2 voxels on the 6 outside faces of the cube butnot on the edges have 17 voxels. Then, given a and B as defined in Section 2,we have

a′γ + γ ′

= a · V 3 + b · (26(V − 2)3 +8 · 7 + 12(V − 2) · 11 + 6(V − 2)2 · 17)= (a+26b)(V − 2)3 + 6(a+ 17b)(V − 2)2 + 12(a+11b)(V − 2)

+ 8a+ 56b.

APPENDIX B: POSTERIOR DISTRIBUTIONS IN THE GIBBSSAMPLER

1. Update γ. We update the indicator for one voxel γj at a time. Let

γ(−j) = γl : l 6= j, I(−j) be the set of indices γl = 1 : l 6= j, β(−j) = βl : l 6=j, and X(−j) be the design matrix corresponding to β(−j). The prior prob-ability of γj = 1, Pr(γj = 1|γ(−j)) is exp(a + b

l∈I(−j)γl)/(1 + exp(a +

b∑

l∈I(−j)γl)). By the Bayes rule, the posterior probability of γj = 1 given

the data and other parameters is

Pr(γj = 1|γ(−j),β, σ,Y)

=Pr(γj = 1|γ(−j))

Pr(γj = 1|γ(−j)) + F (j|γ(−j))−1 ·Pr(γj = 0|γ(−j))

,

where β ·γ denotes the dot product between β and γ, and F (j|γ(−j)) is theBayes factor,

F (j|γ(−j)) =Pr(Y|γj = 1,γ(−j),β, σ)

Pr(Y|γj = 0,γ(−j),β, σ)

=exp−∑n

i=1(Yi −Xiβ · γ)2/2σ2exp−

∑ni=1(Yi −Xi,(−j)β(−j) · γ(−j))

2/2σ2 ,

where Xi,(−j) is the ith row of matrix X(−j).

2. Update σ2. σ2|− ∼ Inv-Gamma(n/2, µσ), where µσ =∑

i(Yi − Xiβ ·γ)2/2.

3. Update β. Denote the βj ’s in Zj = h by βh, and letXhi =

j : γj=1,Zj=hXij .

Note that Xhi = 0 if j :γj = 1,Zj = h=∅. Also, let β(−h) = βj :Zj 6= h,

Page 25: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 25

γ(−h) = γj :Zj 6= h and X(−h) = Xj :Zj 6= h, respectively, denote the col-

lection of all the β’s and the design matrix of the covariates not in clusterh. Then for h= 1, . . . ,H ,

βh|− ∼N(µh,1/Sh),

with Sh =∑n

i=1(Xhi )

2/σ2+1/v2 and µh = ∑ni=1(Yi−X

(−h)i β(−h) ·γ(−h))Xh

i /Sh. This part can be parallelized (across h).

The posterior cluster membership Z is drawn from a multinomial distri-bution with

Pr(Zj = h|γj = 1,−) =wh exp−

∑ni=1(Yi −Xiβ(jh) · γ(jh))

2/2σ2∑H

k=1wk exp−∑n

i=1(Yi −Xiβ(jk) · γ(jk))2/2σ2

,

Pr(Zj = h|γj = 0,−) =wh,

where β(jh) = (β1, . . . , βj−1, βh, βj+1, . . . , βp) and γ(jh) = (γ1, . . . , γj−1,1, γj+1,

. . . , γp) for h = 1, . . . ,H and j = 1, . . . , p. To update the associated weightsw, first set w′

H = 1 and draw w′h from Beta(1 +

j : Zj=h 1, α+∑

j : Zj>h 1)

for each h ∈ 1, . . . ,H − 1, then update wh =w′h

k<h(1−w′k).

Acknowledgments. The authors are grateful to the Associate Editor andfour reviewers for constructive comments that helped improve the expositionand clarity of the paper, and to Nancy Zhang for insightful discussions. Thecontent is solely the responsibility of the authors and does not necessarilyrepresent the official views of NIMH, the National Institutes of Health orSAMSI.

Part of the project was conducted when Fan Li and Tingting Zhang wereresearch fellows of the Object Data Analysis program of the U.S. Statisticaland Applied Mathematical Sciences Institute (SAMSI).

SUPPLEMENTARY MATERIAL

Heatmaps (DOI: 10.1214/15-AOAS818SUPP; .pdf). We provide the heat-maps of the voxels with top 10% highest posterior selection probabilitiesobtained, resulting from Ising-DP, Ising-Gaussian and i.i.d.-Gaussian priors,respectively, in three regressions [Li et al. (2015)].

REFERENCES

Allen, J. P., Porter, M., McFarland, F. C., McElhaney, K. B. and Marsh, P.

(2007). The relation of attachment security to adolescents’ paternal and peer relation-ships, depression, and externalizing behavior. Child Development 78 1222–1239.

Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesiannonparametric problems. Ann. Statist. 2 1152–1174. MR0365969

Beckes, L. and Coan, J. A. (2011). Social baseline theory: The role of social proximity inemotion and economy of action. Social and Personality Psychology Compass 5 976–988.

Page 26: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

26 F. LI ET AL.

Bowman, F. D. (2007). Spatiotemporal models for region of interest analyses of functional

neuroimaging data. J. Amer. Statist. Assoc. 102 442–453. MR2370845

Bowman, F. D., Caffo, B., Bassett, S. S. and Kilts, C. (2008). A Bayesian hierar-

chical framework for spatial modeling of fMRI data. NeuroImage 39 146–156.

Bradley, M. M. and Lang, P. J. (1994). Measuring emotion: The self-assessment

mankin and the semantic differential. J. Behav. Ther. Exp. Psychiatry 25 49–59.

Coan, J. A. (2010). Adult attachment and the brain. J. Soc. Pers. Relatsh. 27 210–217.

Coan, J. A. (2011). The social regulation of emotion. In Oxford Handbook of Social

Neuroscience 614–623. Oxford Univ. Press, New York.

Coan, J. A., Beckes, L. and Allen, J. P. (2013). Childhood maternal support and

social capital moderate the regulatory impact of social relationships in adulthood. Int.

J. Psychophysiol. 88 224–231.

Coan, J. A. and Maresh, E. L. (2014). Social baseline theory and the social regulation

of emotion. In The Handbook of Emotion Regulation, 2nd ed. (J. Gross, ed.) 221–236.

The Guilford Press, New York.

Coan, J. A., Schaefer, H. S. and Davidson, R. J. (2006). Lending a hand: Social

regulation of the neural response to threat. Psychol. Sci. 17 1032–1039.

Craig, A. D. (2009). How do you fell now? The anterior insula and human awareness.

Nat. Rev. Neurosci. 10 59–70.

Critchley, H. D., Corfield, D. R., Chandler, M. P., Mathias, C. J. and

Dolan, R. J. (2000). Cerebral correlates of autonomic cardiovascular arousal: A func-

tional neuroimaging investigation in humans. J. Physiol. (Lond.) 523 259–270.

Derado, G.,Bowman, F. D. andKilts, C. D. (2010). Modeling the spatial and temporal

dependence in fMRI data. Biometrics 66 949–957. MR2758231

Dunson, D. B., Herring, A. H. and Engel, S. M. (2008). Bayesian selection and

clustering of polymorphisms in functionally related genes. J. Amer. Statist. Assoc. 103

534–546. MR2523991

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann.

Statist. 1 209–230. MR0350949

Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. Ann.

Statist. 2 615–629. MR0438568

Friston, K. J., Holmes, A. P., Worsley, K., Poline, P. J., Frith, C. and Frack-

owiak, R. (1995). Statistical parametric maps in functional imaging: A general linear

approach. Hum. Brain Mapp. 2 189–210.

Ge, T., Muller-Lenke, N., Bendfeldt, K., Nichols, T. E. and Johnson, T. D.

(2014). Analysis of multiple sclerosis lesions via spatially varying coefficients. Ann.

Appl. Stat. 8 1095–1118. MR3262547

Gelman, A. E. and Rubin, D. B. (1992). Inference from iterative simulation using mul-

tiple sequences. Statist. Sci. 7 457–472.

George, E. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J.

Amer. Statist. Assoc. 88 881–889.

George, E. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection.

Statist. Sinica 7 339–373.

Goldsmith, J., Huang, L. and Crainiceanu, C. M. (2014). Smooth scalar-on-image

regression via spatial Bayesian variable selection. J. Comput. Graph. Statist. 23 46–64.

MR3173760

Gossl, C., Auer, D. P. and Fahrmeir, L. (2001). Bayesian spatiotemporal inference in

functional magnetic resonance imaging. Biometrics 57 554–562. MR1855691

Page 27: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 27

Huang, L., Goldsmith, J., Reiss, P. T., Reich, D. S. and Crainiceanu, C. M. (2013).Bayesian scalar-on-image regression with application to association between intracranialDTI and cognitive outcomes. NeuroImage 83 210–223.

Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors.J. Amer. Statist. Assoc. 96 161–173. MR1952729

Ishwaran, H. and Zarepour, M. (2000). Markov chain Monte Carlo in approximateDirichlet and beta two-parameter process hierarchical models. Biometrika 87 371–390.MR1782485

Jenkinson, M., Bannister, P., Brady, M. and Smith, S. (2002). Improved optimizationfor the robust and accurate linear registration and motion correction of brain images.NeuroImage 17 825–841.

Johnson, T. D., Liu, Z., Bartsch, A. J. and Nichols, T. E. (2013). A Bayesian non-parametric Potts model with application to pre-surgical FMRI data. Stat. MethodsMed. Res. 22 364–381. MR3190664

Kalus, S., Samann, P. G. and Fahrmeir, L. (2014). Classification of brain activationvia spatial Bayesian variable selection in fMRI regression. Adv. Data Anal. Classif. 863–83. MR3168680

Kang, J., Johnson, T. D., Nichols, T. E. and Wager, T. D. (2011). Meta analysisof functional neuroimaging data via Bayesian spatial point processes. J. Amer. Statist.Assoc. 106 124–134. MR2816707

Kim, S., Tadesse, M. G. and Vannucci, M. (2006). Variable selection in clustering viaDirichlet process mixture models. Biometrika 93 877–893. MR2285077

Lang, P. J., Greenwald, M. K., Bradley, M. M. and Hamm, A. O. (1993). Lookingat pictures: Affective, facial, visceral, and behavioral reactions. Psychophysiology 30261–273.

Lange, K. (2008). Optimization. Springer Texts in Statistics 95. Springer, New York.Lewis, P. A., Critchley, H. D., Rotshtein, P. and Dolan, R. J. (2007). Neural

correlates of processing valence and arousal in affective words. Cereb. Cortex 17 742–748.

Li, F. and Zhang, N. R. (2010). Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J. Amer. Statist. Assoc.105 1202–1214. MR2752615

Li, F., Zhang, T., Wang, Q., Gonzalez, M., Maresh, E. L. and Coan, J. A. (2015).Supplement to “Spatial Bayesian variable selection and grouping for high-dimensionalscalar-on-image regression.” DOI:10.1214/15-AOAS818SUPP.

Maresh, E. L., Beckes, L. and Coan, J. A. (2013). The social regulation of threat-related attentional disengagement in highly anxious individuals. Front. Human Neu-rosci. 7 515.

Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linearregression. J. Amer. Statist. Assoc. 83 1023–1036. MR0997578

Park, T. and Casella, G. (2008). The Bayesian lasso. J. Amer. Statist. Assoc. 103681–686. MR2524001

Penny, W. D., Trujillo-Barreto, N. J. and Friston, K. J. (2005). Bayesian fMRItime series analysis with spatial priors. NeuroImage 24 350–362.

Raftery, A. E. (1996). Approximate Bayes factors and accounting for model uncertaintyin generalised linear models. Biometrika 83 251–266. MR1439782

Reiss, P. T., Mennes, M., Petkova, E., Huang, L., Hoptman, M. J., Biswal, B. B.,Colcombe, S. J., Zuo, X.-N. and Milham, M. P. (2011). Extracting informationfrom functional connectivity maps via function-on-scalar regression. NeuroImage 56140–148.

Page 28: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

28 F. LI ET AL.

Reiss, P. T., Huo, L., Zhao, Y., Kelly, C. and Ogden, R. T. (2015). Wavelet-domain

regression and predictive inference in psychiatric neuroimaging. Ann. Appl. Stat. 9

1076–1101.

Russell, J. (1980). A circumplex model of affect. J. Pers. Soc. Psychol. 39 1161–1178.

Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4

639–650. MR1309433

Smith, M. and Fahrmeir, L. (2007). Spatial Bayesian variable selection with applica-

tion to functional magnetic resonance imaging. J. Amer. Statist. Assoc. 102 417–431.

MR2370843

Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable se-

lection. J. Econometrics 75 317–343.

Smith, M., Putz, B., Auer, D. and Fahrmeir, L. (2003). Assessing brain activity

through spatial Bayesian variable selection. NeuroImage 20 802–815.

Smith, S. M., Jenkinson, M., Woolrich, M. W., Beckmann, C. F.,

Behrens, T. E. J., Johansen-Berg, H., Bannister, P. R., De Luca, M., Drob-

njak, I., Flitney, D. E., Niazy, R., Saunders, J., Vickers, J., Zhang, Y., De

Stefano, N., Brady, J. M. and Matthews, P. M. (2004). In advances in functional

and structural MR image analysis and implementation as FSL. NeuroImage 23(S1)

208–219.

Stanley, H. E. (1987). Introduction to Phase Transitions and Critical Phenomena. Ox-

ford Univ. Press, New York.

Stingo, F. C., Chen, Y. A., Tadesse, M. G. and Vannucci, M. (2011). Incorporat-

ing biological information into linear models: A Bayesian approach to the selection of

pathways and genes. Ann. Appl. Stat. 5 1978–2002. MR2884929

Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., Cron, A. and West, M. (2010).

Understanding GPU programming for statistical computation: Studies in massively

parallel massive mixtures. J. Comput. Graph. Statist. 19 419–438. MR2758309

Tadesse, M. G., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clus-

tering high-dimensional data. J. Amer. Statist. Assoc. 100 602–617. MR2160563

Vannucci, M. and Stingo, F. C. (2011). Bayesian models for variable selection that in-

corporate biological information. In Bayesian Statistics 9 (J. Bernardo, M. Bayarri,

J. Berger, A. Dawid, D. Heckerman, A. Smith and M. West, eds.) 659–678. Ox-

ford Univ. Press, Oxford. MR3204022

West, M. (2003). Bayesian factor regression models in the “large p, small n” paradigm. In

Bayesian Statistics 7 (Tenerife, 2002) (J. M. Bernardo, J. O. Berger,A. P. Dawid,

and A. F. M. Smith, eds.) 733–742. Oxford Univ. Press, New York. MR2003537

Wiech, K., Ploner, M. and Tracey, I. (2008). Neurocognitive aspects of pain percep-

tion. Trends Cogn. Sci. 12 306–313.

Woolrich, M. W., Jenkinson, M., Brady, J. M. and Smith, S. M. (2004). Fully

Bayesian spatio-temporal modeling of fMRI data. IEEE Trans. Med. Imag. 23 213–

231.

Yue, Y. R., Lindquist, M. A. and Loh, J. M. (2012). Meta-analysis of functional

neuroimaging data using Bayesian nonparametric binary regression. Ann. Appl. Stat. 6

697–718. MR2976488

Zhang, T., Li, F., Beckes, L. and Coan, J. A. (2013). A semi-parametric model of the

hemodynamic response for multi-subject fMRI data. NeuroImage 75 136–145.

Page 29: FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE … · FOR HIGH-DIMENSIONAL SCALAR-ON-IMAGE REGRESSION ... mean field theory. ... we define a latent indicator ...

SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 29

F. Li

Q. Wang

Department of Statistical Science

Duke University

Durham, North Carolina 27708-0251

USA

E-mail: [email protected]@stat.duke.edu

T. Zhang

Department of Statistics

University of Virginia

Charlottesville, Virginia 22904

USA

E-mail: [email protected]

M. Z. Gonzalez

E. L. Maresh

J. A. Coan

Department of Psychology

University of Virginia

Charlottesville, Virginia 22904

USA

E-mail: [email protected]@[email protected]


Recommended