Bayesian Structure Learning for Functional Neuroimagingproceedings.mlr.press/v31/park13a.pdf ·...

489

Bayesian Structure Learning for Functional Neuroimaging

Mijung Park∗,1, Oluwasanmi Koyejo∗,1, Joydeep Ghosh1,Russell A. Poldrack2, Jonathan W. Pillow2

1Electrical and Computer Engineering, 2Psychology and NeurobiologyThe University of Texas at Austin

Abstract

Predictive modeling of functional neuroimag-ing data has become an important tool foranalyzing cognitive structures in the brain.Brain images are high-dimensional and ex-hibit large correlations, and imaging experi-ments provide a limited number of samples.Therefore, capturing the inherent statisticalproperties of the imaging data is critical forrobust inference. Previous methods tacklethis problem by exploiting either spatial spar-sity or smoothness, which does not fully ex-ploit the structure in the data. Here we de-velop a flexible, hierarchical model designedto simultaneously capture spatial block spar-sity and smoothness in neuroimaging data.We exploit a function domain representationfor the high-dimensional small-sample dataand develop efficient inference, parameter es-timation, and prediction procedures. Em-pirical results with simulated and real neu-roimaging data suggest that simultaneouslycapturing the block sparsity and smoothnessproperties can significantly improve struc-ture recovery and predictive modeling per-formance.

1 Introduction

Functional magnetic resonance imaging (fMRI) is animportant tool for non-invasive study of brain activity.Most fMRI studies involve measurements of blood oxy-genation (which is sensitive to the amount of local neu-ronal activity) while the participant is presented with

∗M Park and O Koyejo contributed equally to this work.

Appearing in Proceedings of the 16th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP31. Copyright 2013 by the authors.

a stimulus or cognitive task. Neuroimaging signals arethen analyzed to identify the brain regions that exhibita systematic response to the stimulation. This can beused to infer the functional properties of those brainregions. Estimating statistically consistent models forfMRI data is a challenging task. Typical experimen-tal data consist of brain volumes represented by tensof thousands of noisy and highly correlated voxels, yetpractical constraints generally limit the number of par-ticipants to fewer than 100 per experiment.

Predictive modeling (also known as “brain reading” or“reverse inference”) has become an increasingly popu-lar approach for studying fMRI data (Norman et al.,2006; Pereira et al., 2009; Poldrack, 2011). This ap-proach involves decoding of the stimulus or task us-ing features extracted from the neuroimaging data.Many different machine learning techniques have beenapplied to predictive modeling of fMRI data, includ-ing support vector machines (Cox, 2003), Gaussiannaive Bayes (Mitchell et al., 2004) and neural net-works (Hanson et al., 2004; Poldrack et al., 2009). Thelearned model parameters can also be used to infer as-sociations between groups of voxels conditioned on thestimulus (Poldrack et al., 2009). Linear models are thepreferred approach in this case, as the model weightsare directly related to the image features (voxels). In-terpretability and structure estimation are further sim-plified when the linear model returns sparse weights.

Various sparse regularizers have been applied to func-tional neuroimaging data to improve structure recov-ery (Carroll et al., 2009; Varoquaux et al., 2012).These models have had limited success due to thesmall number of samples and the high dimensions ofthe data. In particular, L1 regularized models typi-cally select only a few features (voxels), and the se-lected subset of voxels can vary widely based on smallchanges in the hyperparameters or the data (Carrollet al., 2009). The high degree of correlation leads tofurther degeneration of the structure recovery and pre-dictive performance. Similar empirical properties havebeen observed with other sparse modeling techniques

490


(Varoquaux et al., 2012). This observed behavior isconsistent with the theoretical conditions for L1 regu-larized structure recovery (Zhao and Yu, 2006; Wain-wright, 2009).

Here we show that statistical regularities in brain im-ages can be exploited to improve estimation perfor-mance. Two properties are of particular interest: spa-tial block sparsity and spatial smoothness. Spatialsparsity results from the fact that the brain respondsselectively, so that only small regions are activatedduring a particular task. Spatial smoothness, on theother hand, results from the fact that the brain re-gions activated extend across many (usually tens tohundreds of) voxels. Sparse blocks may not be locatedin close spatial proximity as different tasks or stimulimay be processed in very different brain regions (Pol-drack, 2011). Sparse blocks may also be separated dueto bilateral activation patterns for certain tasks. Muchof the prior work in the domain of predictive modelinghas focused on the sparse structure, whereas the spa-tial smoothness properties have mostly been ignored.

This paper introduces a novel prior distribution to si-multaneously capture the spatial block sparsity andspatial smoothness structure of fMRI data. Our ap-proach follows methods for structured predictive mod-eling using empirical Bayes (or maximum marginallikelihood) inference (e.g., Wipf and Nagarajan (2008);Sahani and Linden (2002); Park and Pillow (2011)).Our method builds directly on Automatic LocalityDetermination (ALD), which has a prior distributionthat simultaneously captures sparsity and smoothness(Park and Pillow, 2011).

Our work differs from ALD in several respects: (i) wemodel several spatial clusters instead of a single spatialcluster; (ii) we apply the proposed prior model to bothregression and classification problems; (iii) we proposean efficient representation to scale the model to highdimensional functional neuroimaging data.

The contributions of this paper are as follows:

• We propose a novel prior that simultaneously cap-tures spatial block sparsity and smoothness.

• We develop efficient inference, parameter es-timation, and prediction procedures for high-dimensional small-sample data.

• We present empirical results on simulated and realfunctional neuroimaging data. Our experimentsshow the effectiveness of our approach for predic-tive modeling and structure estimation.

We begin the discussion with an overview of the gener-ative modeling approach in Section 2 and introduce the

novel prior in Section 3. We discuss inference and pa-rameter estimation applied to regression in Section 4and classification in Section 5. Experimental resultson synthetic and real brain data are presented in Sec-tion 6.

Notation : N(µ, σ2

)represents a Gaussian distribu-

tion with mean µ and variance σ2. We represent matri-ces by boldface capital letters and vectors by boldfacesmall letters e.g. M ,m respectively. M = diag(m)returns a diagonal M matrix with diagonal elementsgiven by M i,i = mi. The determinant of a matrix Mis given by |M |, and tr(M) represents the trace of thematrix M .

2 Generative model

We study whole brain images collected from subjectsengaged in a controlled experiment. Let x ∈ RD be afeature vector representing the whole brain voxel ac-tivation levels collected into a D dimensional vector.The stimulus is represented by a variable y. This pa-per will focus on cases where y is real valued (regres-sion), or y is discrete (classification). With N trainingexamples, let X = [x>1 |x>2 | . . . |x>N ]> ∈ RN×D repre-sent the concatenated feature matrix, and let Θ rep-resent the model hyperparameters. Predictive mod-eling involves estimating the conditional distributionp(y|x,D,Θ) where data is denoted by D.

We assume that the stimuli are generated from a hi-erarchical Bayesian model. Let the distribution of thestimuli be given by p(y|w,x, ξ) where ξ are the likeli-hood model hyperparameters and w ∈ RD is a weightvector. The functional relationship between the voxelactivations and the stimuli is assumed to be linear.The weights of this linear function are generated froma zero mean multivariate Gaussian distribution withcovariance matrix C ∈ RD×D. The linear model andprior are given by:

f(x) = w>x, p(w|θ) = N (0,C). (1)

where θ represent hyperparameters that determine thecovariance structure. We have suppressed the depen-dence of C on θ to simplify the notation. Our objec-tive is to parametrize this covariance matrix to captureprior smoothness and sparsity assumptions. We referto this method as Bayesian structure learning (BSL).Our approach consists of three main tasks:

1. Hyperparameter estimation using the parametricempirical Bayes approach.

2. Stimulus prediction for held-out images.

3. Structure estimation using a point estimate ofweight vector.

491

Park, Koyejo, Ghosh, Poldrack, Pillow

Hyperparameter estimation: The set of model hy-perparameters given by Θ = ξ,θ are learned us-ing the parametric evidence optimization approach(Casella, 1985; Morris, 1983; Bishop, 2006). Evidenceoptimization (also known as type-II maximum likeli-hood), is a general procedure for estimating the pa-rameters of the prior distribution in a hierarchicalBayesian model by maximizing the marginal likelihoodof the observed data. We can compute the evidence byintegrating out the model parameters w as:

p(y|X,Θ) =

∫p(y|w,X, ξ)p(w|θ)dw.

The resulting maximizer is the maximum likelihoodestimate Θml.

Stimulus prediction: The accuracy of the predictivemodel is estimated by computing predictions of held-out brain images. We estimate the predictive distri-bution of the target stimuli given by:

p(y∗|x∗,D,Θ) =

∫p(y∗|x∗,w, ξ)p(w|θ,D)dw (2)

where p(w|θ,D) is the posterior distribution of theparameters given the training data D = y,X. Thepredictive distribution is applied to held-out brain im-ages. Prediction performance provides evidence of ac-curate modeling and is generally useful for model val-idation.

Structure estimation: In addition to an accurateprediction of the stimuli, the weights of the linearmapping may be analyzed to infer stimulus dependentfunctional associations. This requires an appropriatepoint estimate. We compute the maximum a posteri-ori (MAP) estimate of the weight vector w by max-imizing its (unnormalized) log posterior distribution

conditioned on the estimated hyperparameters Θml.The estimated model parameter also specifies the re-covered support. Ignoring constants independent ofw, the optimal parameter wmap is computed as thesolution of:

arg minw

[− log p(y|w,X, ξ) + 1

2w>C−1w

]. (3)

3 Prior covariance design

A smooth signal is characterized by its frequency con-tent. In particular, the power of a smooth signal is con-centrated near the zero frequency. We apply this intu-ition by designing a prior distribution that encourageslow frequency weight vectors. Let x ∈ RD be the threedimensional tensor containing the brain volume whereD = Dx×Dy×Dz. Each voxel is sampled on a regularthree dimensional grid. Hence, we can measure the fre-quency content of w ∈ RD using the discrete Fourier

transform (DFT) (Oppenheim and Schafer, 1989). Letw = DFT(w) represent the three dimensional discreteFourier transform of w with the resulting discrete fre-quency spectrum w ∈ RD. The weight vector w isconsidered smooth if the signal power of w = DFT(w)is concentrated near zero.

Let el ∈ R3 represent the index locations in thefrequency domain corresponding to the DFT of a threedimensional spatial signal i.e. el = 0 corresponds tothe zero frequency. As the signal is regularly sampled,el are on regular three dimensional grid. We encouragesmooth weights with the use of a prior distributionw ∼ N (0,G). The prior covariance matrixG ∈ RD×D

is diagonal with entries:

Gl,l = exp(− 1

2e>l Ψ−1el − ρ

),

where Ψ ∈ R3×3 is a diagonal scaling matrix and ρ ∈ Ris a scaling parameter. The discrete Fourier transformof a real signal is symmetric around the origin (Oppen-heim and Schafer, 1989). We use a diagonal scaling toensure that this condition is satisfied. The result isan dimension-wise independent, symmetric prior dis-tribution for w where the prior variance decreases ex-ponentially in proportion to the Mahalanobis distanceof the frequency index from the zero frequency.

The prior assumptions on the frequency domain sig-nal w correspond to prior assumptions on the spatialweight vectorw which can be recovered in closed form.Recall that the DFT is a linear operator (Oppenheimand Schafer, 1989). Let B ∈ RD×D be the matrix rep-resentation of the the 3-dimensional discrete Fouriertransform so w = Bw = DFT(w). Similarly, the in-verse 3-dimensional discrete Fourier transform (IDFT)operator is given by the Hermitian transpose of B somay compute w = B>w = IDFT(w). We can com-pute the marginal distribution of w by integrating outthe prior w ∼ N (0,G). The resulting prior distribu-tion on the spatial weight vector is given by:

w ∼ N (0,B>GB).

Next, we augment the prior covariance matrix to cap-ture the block spatial sparsity properties of the signal.Spatial blocks are modeled using a sum of C spatialclusters, where each cluster measures spatial locality.Let zd represent the three dimensional sampling gridso each location d is associated with the correspondingvoxel. Each cluster is defined by proximity to a centralvector κc ∈ R3. The intuition is that voxels near κc

are considered active in the cluster c, while voxels faraway are considered inactive. The sparsity promotingfunction for each cluster c at location d is given by:

sc(d) = γc exp(− 1

2 (zd − κc)>Ω−1c (zd − κc)

),

492


sc1 sc2

sc4sc3

Figure 1: Visualization of spatial block sparsity.An example of 1-dimensional weight vector w withestimated spatial prior clusters sc. We used fourclusters in the prior. Two of them sc1, sc2 specifythe support of w, and the rest sc3, sc4 were prunedout.

where Ωc ∈ R3×3 is a symmetric positive definite ma-trix and γc ∈ R+ is a positive weight. The sparsity pro-moting functions are collected into a vector sc ∈ RD.

The clusters are accumulated into a single spatial spar-sity promoting function:

s(d) =

C∑c=1

sc(d)

=

C∑c=1

γc exp(− 1

2 (zd − κc)>Ω−1c (zd − κc)

),

and collected into a vector s ∈ RD. This modeling ap-proach allows us to capture arbitrarily shaped blocksas a weighted sum of the elliptical clusters. Blocksthat are not utilized can be identified as blocks withγc = 0 and pruned. Hence C is an upper bound onthe number of spatial blocks explicitly captured by theprior. Collectively, sc select the support of w (seeFig. 1). The spatial cluster centers κc are constrainedby the boundaries of the cuboid. We also set s(d) = 0for all voxels outside the brain volume. This will en-sure that the estimated weight vector correspondingto these voxels remains zero.

We now combine the spatial sparsity promoting func-tions with the prior covariance matrix for spatialsmoothness. We define a diagonal matrix S =diag(s

12 ) ∈ RD×D that imposes locality in space on

the prior covariance. The modified prior covariancematrix C ∈ RD×D is now given by:

C = SB>GBS. (4)

Our proposed design combines the notions of spatialblock sparsity and spatial smoothness into a singleprior covariance matrix. With a fixed number of clus-

estimated

estimated in Fourier space

estimated

Figure 2: Combining spatial block sparsity andspatial smoothness. Top: The true 2-dimensionalweight vector w (left) and the estimated spatial blocksparsity matrix S (right). Bottom: w = DFT(w) inFourier space (left) and the estimated frequency sparseprior variance G (right). Right: The estimated weightvector wmap using the proposed prior covariance.

ters C, the covariance matrix is defined by the hyper-parameters θ = Ψ, ρ, γc,κc,ΩcCc=1.

The support of the weight vector w is determined bythe structure of the covariance matrix C through thesparsity of the matrix S. Elements of the weight vec-torw with zero prior covariance will remain sparse. Toillustrate this effect, suppose the diagonals of S con-tain t non-zero elements, then the rows and columnsof C corresponding to the u = D − t sparse indexesare zero. Without loss of generality, there exists a per-mutation matrix P ∈ RD×D such that the covariancematrix can be partitioned as:

P>CP =

(C 0t×u

0u×t 0u×u

),

where C ∈ Rt×t is the non-zero sub-matrix of C, and0 are all zero matrices of the appropriate size. Hence,the Gaussian prior on w = P>w used to compute theMAP estimate Eq. 3 is given as:

w>(P>CP )−1w = w>

(C−1

0t×u0u×t 0−1u×u

)w.

The prior evaluates to infinity unless w = 0 for all in-dexes corresponding to the zero entries in the diagonalof S. In practice, this can be implemented by pruningall the dimensions of w corresponding to zero spatialweights before the MAP estimation procedure.

Efficient implementation of B operator: Al-though the discrete Fourier transpose matrix B is astructured matrix, its storage storage costs are of or-der O(D2), and transformation to the frequency do-main using a matrix vector product has a computa-

493


tional cost of costs O(D2). This cost may be pro-hibitive as the DFT transformation is utilized in theinner loop of the evidence optimization and point es-timation of the weight vector. These costs can besignificantly reduced by exploiting the equivalence be-tween theB and the three dimensional discrete Fouriertransform DFT(·). Using this approach, B incurs nostorage costs, and transformation to the frequency do-main requires computation costs of O(D logD). Forinstance, the covariance matrix is involved in hyper-parameter estimation via quadratic terms of the formU>CV = U>SB>GBSV . This can implementedby (i) spatial scaling: spatial scaling: u = diag(S)Uand v = diag(S) V , (ii) discrete Fourier transform:u = DFT(u) and v = DFT(v), and (iii) weighted in-ner product: u>Gv, where diag(S) U = SU corre-sponds to the product of each element of diag(S) withthe corresponding row of U . The result can be com-puted even more efficiently by using recent algorithmsfor sparse fast Fourier transforms (Hassanieh et al.,2012), exploiting the frequency sparsity recovered bythe covariance matrix. This further extension is leftfor future work.

3.1 Efficient representation with highdimensions

In this high dimensional scenario, the size of w andC render naıve implementation computationally in-feasible. On the other hand, typical neuroimagingdatasets contain a relatively small number of sam-ples. We exploit this small sample property to improvethe computational efficiency of representation and ev-idence optimization. Recall that that the relationshipbetween the voxel response and the stimulus is givenby the linear function f(x) = w>x and the weightvector is drawn from a Gaussian distribution Eq. 1.Let f = [f(x1), f(x2), . . . , f(xN )]> ∈ RN . The priordistribution of f can be recovered in closed form byintegrating out the weight vector. This results in anequivalent representation of the generative model inthe function space:

f ∼ N (0,K), K = XCX>. (5)

The reader may notice the similarity to the Gaus-sian process prior (c.f. chapter 2.1 of Rasmussen andWilliams (2005)). In fact, Eq. 5 is equivalent to aGaussian process prior over linear functions with mean0 and covariance function K(xi,xj) = x>i Cxj . Thisfunction space representation significantly reduces thecomplexity of inference when N D. For instance,the computational complexity of inference in regres-sion is reduced from O(D3) to O(N3), and storagerequirements for the covariance can be reduced fromO(D2) to O(N2).

w

θθ

yn

ξξ

N

xn

f

θθ

yn

ξξ

N

X

Figure 3: Equivalent representations of generativemodel in the weight space (left) and the dual functionspace (right). θ are parameters of the prior distribu-tion, and ξ are likelihood model parameters.

4 BSL for regression

The continuous valued stimuli are modeled as inde-pendent Gaussian distributed variables p(y|x, ξ) =N (f(x), σ2). Without loss of generality, we will as-sume that the data is normalized so the stimuli arezero mean. Hence, the likelihood hyperparameters ξrepresent the observed noise variance σ2. Let y =[y1, y2, . . . , yN ]> ∈ RN represent the N training stim-uli collected into a vector. In this section, we summa-rize the procedures for evidence optimization, stimuliprediction and weight estimation.

4.1 Hyperparameter estimation

As the prior distribution and the likelihood are bothGaussian, the prior Eq. 5 can be integrated out inclosed form. The result is the evidence:

p(y|X,Θ) = N (0,Ky).

where Ky = K + σ2I. We estimate the model hyper-parameters by maximizing the corresponding marginallog likelihood which is given by:

log p(y|X,Θ) = −1

2y>K−1y y− 1

2log |Ky|−

N

2log 2π.

The log evidence can be optimized efficiently usinggradient based direct optimization techniques (Ras-mussen and Williams, 2005).

4.2 Predictive distribution

The predictive distribution is computed by marginal-izing out the model parameters with respect to theirposterior distribution. The posterior distribution ofthe noise free response f∗ = f(x∗) can be computed

494


in closed form (Rasmussen and Williams, 2005) as:

p(f∗|x∗,D) = N (µ,Σ) (6)

µ = x>∗ CX>(K + σ2I)−1y

Σ = x>∗ Cx∗ + x>∗ CX>(K + σ2I)−1XCx∗,

where D = y,X represents the training data. Weuse the mean of the posterior distribution a point es-timate of the model prediction.

4.3 Point estimate of weight vector

Given the trained hyperparameters, the point estimatethat maximizes the posterior distribution is equal tothe posterior mean of the weight vector. The posteriordistribution of the weight vector can be computed inclosed form as:

p(w|D,Θ) = N (σ2ΣX>y,Σ) (7)

where Σ = (C−1 + σ2X>X)−1. Note that only themean of the posterior distribution is required for thepoint estimate. Yet this closed form may be compu-tationally infeasible with high dimensional data. Ascalable alternative approach is direct maximization ofthe (unnormalized) posterior distribution as describedin Eq. 3. Ignoring constant terms, the resulting opti-mization is given by:

wmap = arg minw

[‖y −Xw‖22 + σ2w>C−1w

]. (8)

This is a regularized least squares problem and canbe solved efficiently using standard optimization tech-niques.

5 BSL for classification

We employ a classification approach when the targetstimuli consists of a set of discrete items. Let J bethe total number of stimuli classes and let yjn be anindicator variable with yjn = 1 if the nth image is fromclass j, and yjn = 0 otherwise. These are collectedinto the vector yj = [yj1, . . . , y

jN ]>, and the combined

stimuli is given by y = [(y1)>, . . . , (yJ)>]>. We use aseparate linear function for each class so the resultingweights can be interpreted directly as a discriminativestimulus signature.

The linear function response for each class is com-puted as f j = [f j1 , . . . , f

jN ]> ∈ RN where f jn = f j(xn)

and the combined function vector is given by f =[(f1)>, . . . , (fJ)>]>. Each class function is drawnfrom a multivariate Gaussian prior f j ∼ N (0,Kj)a with a class specific covariance Kj = XCjX> asdescribed in Eq. 5. We assume that the prior distri-butions of the class functions are uncorrelated, Hence,

we can define the prior distribution of the combinedvector as f ∼ N (0,K), where K is a block diagonalmatrix with blocks Kj .

The probability of nth stimulus belonging to jth classclass is defined by the softmax:

p(yjn|f jnJj=1) = πjn =

exp(f jn)∑Jl=1 exp(f ln)

. (9)

Assuming that each of the N targets yjnJj=1 are con-ditionally independent, the log-likelihood of the datais given by:

L(f) = log p(y|f) = yTf −N∑

n=1

log

(J∑

l=1

exp(f ln)

).

(10)

5.1 Hyperparameter estimation

The evidence function is not available in closed form.We employ an approximate evidence approach basedon an approximate posterior. The posterior distri-bution of the latent functions do not have a closedform expression. We estimate an approximate pos-terior distribution using the Laplace approximation(Rasmussen and Williams, 2005; Park et al., 2011)based on a Gaussian approximation to the posteriordistribution at the mode. The approximate posteriortakes the form:

p(f |D) ≈ N (fmap,Λ−1) (11)

where fmap is the MAP parameter estimate. Thefmap is computed by maximizing the unnormalizedlog-posterior:

Φ(f) = L(f)− 1

2fTK−1f − 1

2log |2πK|. (12)

The posterior covariance is given by the Hessian atfmap computed as:

Λ−1 = ∇∇Φ(f) = −K−1 −H.

where H = − ∂2

∂f2L(f) = diag(π)−ΠΠT , and Π is the

matrix of size Jn×n formed by vertically stacking thematrices diag(πj).

Finally, we optimize the evidence at f = fmap. Thisis given by:

p(y|θ) =p(y|f)p(f |θ)

p(f |y,θ)≈

exp(L(f)

)N (f |0,K)

cN(f |fmap,Λ−1)

.

The resulting evidence optimization follows the ap-proach outlined in Rasmussen and Williams (2005).

495


TrueBSLL1 L2 Elastic netBSLL1 L2 Elastic net

50 100 200 400

0.4

0.6

0.8

1

# datapoints50 100 200 400

100 datapoints 400 datapointsPrediction accuracy CorrelationA B

BSL

L1

L2

Elastic net

L1

L2

Elastic net

BSL

Figure 4: 2D simulated example for 3-class classification A: Prediction accuracy, and correlation betweentrue weight vectors and estimates obtained by L1, L2, elastic net regularization methods, and BSL. B: Trueweight vectors for each class, and the estimates obtained by each method using 100 and 400 training data points.BSL outperforms L1, L2 and elastic net regularized models both in terms of classification accuracy and supportrecovery.

5.2 Predictive distribution

The posterior predictive distribution is analytically in-tractable, so employ the approximate Gaussian poste-rior Eq. 11. We approximate the posterior predictivedistribution for class j as:

p(f j∗ |x∗,D) = N (µ,Σ) (13)

µ = x>∗ CjX>(Kj)−1f j

map

Σ = diag(k(x∗,x∗))−Q>∗ (K +H−1)−1Q∗,

where k(x∗,x∗) is the vector of covariances with thejth element given by x>∗ Cx∗ and Q∗ is the (JN × J)matrix:

Q∗ =

X>C1x>∗ 0 · · · 0

0 X>C2x>∗ · · · 0...

.... . .

...

0 0 · · · X>CJx>∗

.

Given the predictive distribution Eq. 13, the predictiveclass probabilities can be computed using the MonteCarlo sampling approach as shown in (Rasmussen andWilliams, 2005).

5.3 Point estimate of weight vector

The point estimate of the weight vector is computed asthe vector that maximizes the unnormalized log pos-terior distribution log p(w|D,Θ). Ignoring constantterms, the resulting wmap ∈ RDJ is given by:

arg minw

[yTf −

N∑n=1

log

(J∑

l=1

exp(f ln)

)+w>C−1w

].

(14)We have retained the linear function representationof the likelihood for compactness, however, note that

the optimization problem is posed in the weight space.This optimization corresponds to a regularized gener-alized linear model and can be solved efficiently usingstandard optimization techniques.

6 Experimental results

We present experimental results comparing the pro-posed Bayesian structured learning (BSL) model toregularized generalized linear models for predictivemodeling.

Simulated data

We first tested our method on simulated data in a 3-class classification setting. We generated N random2-dimensional images where each pixel was generatedindependently from the standard normal distributionN (0, 1). We also generated a set of weight vectors asshown in Fig. 4B (first column). The stimuli responseswere generated using a multinomial distribution Eq. 9and hard thresholded into one of three classes. Thedimensionality of each weight vector was 20 by 20, re-sulting in a D = 1200 parameter space. We first ex-amined the prediction accuracy of estimates obtainedby L1, L2, elastic net regularization (Zou and Hastie,2005), and our method (BSL). The average predictionaccuracy (from 10 independent repetitions) is shownin Fig. 4A (left) as a function of the number of train-ing samples. The estimated weight vectors from eachmethod are shown in Fig. 4B, (right) using 100 and400 data points, respectively. We computed the corre-lation coefficients between the true weight vector andestimates obtained by each method to test the supportrecovery performance. These are shown in Fig. 4A,(right). As shown in the presented results, our methodoutperforms other methods in terms of prediction ac-

496


BSL(mse: 0.90)

Elastic net(0.96)

L1(0.98)

Randomizedward lasso(1.19)

Figure 5: Support (in red) of the estimated weights from each method using real fMRI data. Eachrow shows slices of the brain from the top of the skull. The magnitude of the weight vector is not shown.First row: Estimate obtained by BSL. Second row: Estimate obtained by elastic net regularization. Third row:Estimate obtained by L1 regularization. Fourth row: Estimate obtained by randomized ward lasso. Numbersin parenthesis are the 10-fold cross validation average mean squared error from each method. BSL outperformsother methods in terms of mean squared error and recovers an interpretable support.

curacy as well as support recovery.

Functional neuroimaging data

fMRI data were collected from 126 participants whilethe subjects performed a stop-signal task (Aron andPoldrack, 2006). For each subject, contrast imageswere computed for “go” trials and successful “stop”trials using a general linear model with FMRIB Soft-ware Library (FSL), and these contrast images wereused for regression against estimated stop-signal reac-tion times. The fMRI data is was down-sampled to22 × 27 × 22 voxels using the flirt applyXfm tool(Alpert et al., 1996).

Fig. 5 shows the recovered support from the proposedBSL, L1 regularized regression, elastic net regularizedregression, and randomized ward lasso using hierar-chical spatial clustering (Varoquaux et al., 2012). Wetested each method using 10-fold cross-validation andcomputed the mean square error (MSE) performanceaveraged over the 10 folds. The hyperparameters forL1, L2, elastic net, and randomized ward lasso werecomputed using an inner cross-validation loop. InBSL, we initialized the hyperparameters from the L2estimate to avoid some of the issues with local minima.For spatial sparsity, 20 clusters were assumed based ondomain expertise, and unused blocks were pruned outautomatically during the hyperparameter estimation.

The results from L2 regularized regression are notshown as the returned weights had full support, hencedirect interpretation of the weight vector was infeasi-ble. In addition to the presented results, we tested the

relevance vector machine (Tipping, 2001) (6.6%) andstability selection lasso (Meinshausen and Bhlmann,2010) (6.7%), relative increase in MSE compared toBSL are given in parenthesis. The corresponding im-ages are not shown due to space constraints. We alsotested the special cases of BSL with block sparsityalone (2.6%) and spatial correlation alone (3.2%).

The regions identified by BSL encompass a set of re-gions (including right prefrontal cortex, anterior in-sula, basal ganglia, and lateral temporal cortex) thathave been commonly identified as being involved inthe stop signal task using univariate analyses. In par-ticular, the right prefrontal region that is detected byBSL but missed by the other methods has been widelynoted to be involved in this task (Aron et al., 2004).

7 Conclusion

We develop a novel Bayesian model for structured pre-dictive modeling of functional neuroimaging data, de-signed to jointly capture the block spatial sparsity andspatial smoothness properties of the neural signal. Wealso propose an efficient model representation for thesmall sample high dimensional domain and develop ef-ficient inference, parameter estimation and predictionprocedures. BSL is applied to simulated data and realfMRI data, and it is shown to outperform alternativemodels that focus on spatial sparsity alone.

Acknowledgments

We thank Gael Varoquaux for helpful discussions andPython code for randomized ward lasso.

497


References

N.M. Alpert, D. Berdichevsky, Z. Levin, E.D. Morris, andA.J. Fischman. Improved methods for image registra-tion. NeuroImage, 3(1):10 – 18, 1996. ISSN 1053-8119.

A. R. Aron and R. A. Poldrack. Cortical and subcorticalcontributions to Stop signal response inhibition: role ofthe subthalamic nucleus. J. Neurosci., 26(9):2424–2433,Mar 2006.

A. R. Aron, T. W. Robbins, and R. A. Poldrack. Inhibitionand the right inferior frontal cortex. Trends in cognitivesciences, 8(4):170–177, 2004.

Christopher M. Bishop. Pattern Recognition and MachineLearning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN0387310738.

Melissa K Carroll, Guillermo A Cecchi, Irina Rish, RahulGarg, and A Ravishankar Rao. Prediction and interpre-tation of distributed neural activity with sparse models.NeuroImage, 44(1):112–122, 2009.

G. Casella. An introduction to empirical bayes data anal-ysis. American Statistician, pages 83–87, 1985.

Savoy RL. Cox, D. Functional magnetic resonance imag-ing (fMRI) brain reading: detecting and classifying dis-tributed patterns of fMRI activity in human visual cor-tex. NeuroImage, 19(2):261–270, June 2003.

S.J. Hanson, T. Matsuka, and J.V. Haxby. Combinatorialcodes in ventral temporal lobe for object recognition:Haxby (2001) revisited: is there a. Neuroimage, 23(1):156–166, 2004.

Haitham Hassanieh, Piotr Indyk, Dina Katabi, and EricPrice. Simple and practical algorithm for sparse fouriertransform. In SODA, pages 1183–1194, 2012.

Nicolai Meinshausen and Peter Bhlmann. Stability selec-tion. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology), 72(4):417–473, 2010. ISSN1467-9868.

Tom M. Mitchell, Rebecca Hutchinson, Radu S. Niculescu,Francisco Pereira, Xuerui Wang, Marcel Just, and Shar-lene Newman. Learning to decode cognitive states frombrain images. Mach. Learn., 57(1-2):145–175, October2004.

C.N. Morris. Parametric empirical bayes inference: theoryand applications. Journal of the American StatisticalAssociation, pages 47–55, 1983.

Kenneth A. Norman, Sean M. Polyn, Greg J. Detre, andJames V. Haxby. Beyond mind-reading: multi-voxel pat-tern analysis of fMRI data. Trends in Cognitive Sciences,10(9):424–430, September 2006.

A.V. Oppenheim and R.W. Schafer. Discrete-time signalprocessing. Prentice-Hall signal processing series. Pren-tice Hall, 1989. ISBN 9780132162920.

Mijung Park and Jonathan W. Pillow. Receptive field in-ference with localized priors. PLoS Comput Biol, 7(10):e1002219, 10 2011.

Mijung Park, Greg Horwitz, and Jonathan W. Pillow. Ac-tive learning of neural response functions with gaussianprocesses. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett,F.C.N. Pereira, and K.Q. Weinberger, editors, Advancesin Neural Information Processing Systems 24, pages2043–2051. NIPS, 2011.

Francisco Pereira, Tom Mitchell, and Matthew Botvinick.Machine learning classifiers and fMRI: A tutorialoverview. NeuroImage, 45(1, Supplement 1):S199–S209,2009. Mathematics in Brain Imaging.

Russell A Poldrack, Yaroslav O Halchenko, andStephen Jose Hanson. Decoding the large-scale structureof brain function by classifying mental states across indi-viduals. Psychological Science, 20(11):1364–1372, 2009.

Russell J.A. Poldrack. Inferring mental states from neu-roimaging data: From reverse inference to large-scaledecoding. Neuron, 72(5):692–697, 2011.

Carl E. Rasmussen and Christopher K. I. Williams. Gaus-sian Processes for Machine Learning (Adaptive Compu-tation and Machine Learning series). The MIT Press,November 2005. ISBN 026218253X.

Maneesh Sahani and Jennifer F. Linden. Evidence op-timization techniques for estimating stimulus-responsefunctions. In NIPS, pages 301–308, 2002.

Michael E. Tipping. Sparse bayesian learning and the rele-vance vector machine. J. Mach. Learn. Res., 1:211–244,September 2001.

Gael Varoquaux, Alexandre Gramfort, and BertrandThirion. Small-sample brain mapping: sparse recoveryon spatially correlated designs with randomization andclustering. In Langford John and Pineau Joelle, editors,International Conference on Machine Learning. AndrewMcCallum, June 2012.

Martin J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEETrans. Inf. Theor., 55(5):2183–2202, May 2009. ISSN0018-9448. doi: 10.1109/TIT.2009.2016018.

David Wipf and Srikantan Nagarajan. A new view of auto-matic relevance determination. In J.C. Platt, D. Koller,Y. Singer, and S. Roweis, editors, Advances in Neural In-formation Processing Systems 20, pages 1625–1632. MITPress, Cambridge, MA, 2008.

Peng Zhao and Bin Yu. On model selection consistencyof lasso. J. Mach. Learn. Res., 7:2541–2563, December2006. ISSN 1532-4435.

Hui Zou and Trevor Hastie. Regularization and variableselection via the elastic net. Journal of the Royal Sta-tistical Society, Series B, 67:301–320, 2005.

Date post:	19-Jun-2020
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Bayesian Structure Learning for Functional Neuroimagingproceedings.mlr.press/v31/park13a.pdf ·...

Documents