+ All Categories
Home > Documents > Nonparametric Permutation Tests For Functional Neuroimaging: A ...

Nonparametric Permutation Tests For Functional Neuroimaging: A ...

Date post: 01-Jan-2017
Category:
Upload: haduong
View: 222 times
Download: 2 times
Share this document with a friend
25
Nonparametric Permutation Tests For Functional Neuroimaging: A Primer with Examples Thomas E. Nichols 1 and Andrew P. Holmes 2,3 * 1 Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 2 Robertson Centre for Biostatistics, Department of Statistics, University of Glasgow, Scotland, United Kingdom 3 Wellcome Department of Cognitive Neurology, Institute of Neurology, London, United Kingdom r r Abstract: Requiring only minimal assumptions for validity, nonparametric permutation testing provides a flexible and intuitive methodology for the statistical analysis of data from functional neuroimaging experi- ments, at some computational expense. Introduced into the functional neuroimaging literature by Holmes et al. ([1996]: J Cereb Blood Flow Metab 16:7–22), the permutation approach readily accounts for the multiple comparisons problem implicit in the standard voxel-by-voxel hypothesis testing framework. When the appropriate assumptions hold, the nonparametric permutation approach gives results similar to those ob- tained from a comparable Statistical Parametric Mapping approach using a general linear model with multiple comparisons corrections derived from random field theory. For analyses with low degrees of freedom, such as single subject PET/SPECT experiments or multi-subject PET/SPECT or fMRI designs assessed for popu- lation effects, the nonparametric approach employing a locally pooled (smoothed) variance estimate can outperform the comparable Statistical Parametric Mapping approach. Thus, these nonparametric techniques can be used to verify the validity of less computationally expensive parametric approaches. Although the theory and relative advantages of permutation approaches have been discussed by various authors, there has been no accessible explication of the method, and no freely distributed software implementing it. Conse- quently, there have been few practical applications of the technique. This article, and the accompanying MATLAB software, attempts to address these issues. The standard nonparametric randomization and per- mutation testing ideas are developed at an accessible level, using practical examples from functional neuro- imaging, and the extensions for multiple comparisons described. Three worked examples from PET and fMRI are presented, with discussion, and comparisons with standard parametric approaches made where appro- priate. Practical considerations are given throughout, and relevant statistical concepts are expounded in appendices. Hum. Brain Mapping 15:1–25, 2001. © 2001 Wiley-Liss, Inc. Key words: hypothesis test; multiple comparisons; statistic image; nonparametric; permutation test; randomization test; SPM; general linear model r r INTRODUCTION The statistical analyses of functional mapping ex- periments usually proceeds at the voxel level, involv- ing the formation and assessment of a statistic image: at each voxel a statistic indicating evidence of the exper- imental effect of interest, at that voxel, is computed, giving an image of statistics, a statistic image or Statis- tical Parametric Map (SPM). In the absence of a priori Contract grant sponsor: Wellcome Trust; Contract grant sponsor: Center for the Neural Basis of Cognition. *Correspondence to: Dr. A.P. Holmes, Robertson Centre for Biosta- tistics, Department of Statistics, University of Glasgow, Glasgow, UK G12 8QQ. E-mail: [email protected] Received for publication 20 August 1999; accepted 10 July 2001 r Human Brain Mapping 15:1–25(2001) r © 2001 Wiley-Liss, Inc.
Transcript
Page 1: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

Nonparametric Permutation Tests ForFunctional Neuroimaging:A Primer with Examples

Thomas E. Nichols1 and Andrew P. Holmes2,3*

1Department of Biostatistics, University of Michigan, Ann Arbor, Michigan2Robertson Centre for Biostatistics, Department of Statistics, University of Glasgow,

Scotland, United Kingdom3Wellcome Department of Cognitive Neurology, Institute of Neurology,

London, United Kingdom

r r

Abstract: Requiring only minimal assumptions for validity, nonparametric permutation testing provides aflexible and intuitive methodology for the statistical analysis of data from functional neuroimaging experi-ments, at some computational expense. Introduced into the functional neuroimaging literature by Holmes etal. ([1996]: J Cereb Blood Flow Metab 16:7–22), the permutation approach readily accounts for the multiplecomparisons problem implicit in the standard voxel-by-voxel hypothesis testing framework. When theappropriate assumptions hold, the nonparametric permutation approach gives results similar to those ob-tained from a comparable Statistical Parametric Mapping approach using a general linear model with multiplecomparisons corrections derived from random field theory. For analyses with low degrees of freedom, suchas single subject PET/SPECT experiments or multi-subject PET/SPECT or fMRI designs assessed for popu-lation effects, the nonparametric approach employing a locally pooled (smoothed) variance estimate canoutperform the comparable Statistical Parametric Mapping approach. Thus, these nonparametric techniquescan be used to verify the validity of less computationally expensive parametric approaches. Although thetheory and relative advantages of permutation approaches have been discussed by various authors, there hasbeen no accessible explication of the method, and no freely distributed software implementing it. Conse-quently, there have been few practical applications of the technique. This article, and the accompanyingMATLAB software, attempts to address these issues. The standard nonparametric randomization and per-mutation testing ideas are developed at an accessible level, using practical examples from functional neuro-imaging, and the extensions for multiple comparisons described. Three worked examples from PET and fMRIare presented, with discussion, and comparisons with standard parametric approaches made where appro-priate. Practical considerations are given throughout, and relevant statistical concepts are expounded inappendices. Hum. Brain Mapping 15:1–25, 2001. © 2001 Wiley-Liss, Inc.

Key words: hypothesis test; multiple comparisons; statistic image; nonparametric; permutation test;randomization test; SPM; general linear model

r r

INTRODUCTION

The statistical analyses of functional mapping ex-periments usually proceeds at the voxel level, involv-ing the formation and assessment of a statistic image: ateach voxel a statistic indicating evidence of the exper-imental effect of interest, at that voxel, is computed,giving an image of statistics, a statistic image or Statis-tical Parametric Map (SPM). In the absence of a priori

Contract grant sponsor: Wellcome Trust; Contract grant sponsor:Center for the Neural Basis of Cognition.*Correspondence to: Dr. A.P. Holmes, Robertson Centre for Biosta-tistics, Department of Statistics, University of Glasgow, Glasgow,UK G12 8QQ. E-mail: [email protected] for publication 20 August 1999; accepted 10 July 2001

r Human Brain Mapping 15:1–25(2001) r

© 2001 Wiley-Liss, Inc.

Page 2: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

anatomical hypotheses, the entire statistic image mustbe assessed for significant experimental effects, usinga method that accounts for the inherent multiplicityinvolved in testing at all voxels simultaneously.

Traditionally, this has been accomplished in a clas-sical parametric statistical framework. The most com-monly used methods are, or are similar to, those orig-inally expounded by Friston et al. (1995b) and Worsleyet al. (1992). In this framework, the data are assumedto be normally distributed, with mean parameterizedby a general linear model (this flexible frameworkencompasses t-tests, F-tests, paired t-tests, ANOVA,correlation, linear regression, multiple regression, andANCOVA, among others). The estimated parametersof this model are contrasted to produce a test statisticat each voxel, which have a Student’s t-distributionunder the null hypothesis. The resulting t-statistic im-age is then assessed for statistical significance, usingdistributional results for continuous random fields toidentify voxels or regions where there is significantevidence against the null hypothesis (Friston et al.,1994, 1996; Worsley et al., 1995; Worsley, 1996; Polineet al., 1997) [see Appendix B for a glossary of statisticalterms].

Holmes et al. (1996) introduced a nonparametricalternative based on permutation test theory. Thismethod is conceptually simple, relies only on minimalassumptions, deals with the multiple comparisons is-sue, and can be applied when the assumptions of aparametric approach are untenable. Further, in somecircumstances, the permutation method outperformsparametric approaches. Arndt (1996), working inde-pendently, also discussed the advantages of similarapproaches. Subsequently, Grabrowski et al. (1996)demonstrated empirically the potential power of theapproach in comparison with other methods. Halberet al. (1997) discussed further by Holmes et al. (1998)also favour the permutation approach. Applications ofpermutation testing methods to single subject fMRIrequire modelling the temporal auto-correlation in thetime series. Bullmore et al. (1996) develop permutationbased procedures for periodic fMRI activation designsusing a simple ARMA model for temporal autocorre-lations, though they eschew the problem of multiplecomparisons. Locascio et al. (1997) describe an appli-cation to fMRI combining the general linear model(Friston et al., 1995b), ARMA modelling (Bullmore etal., 1996), and a multiple comparisons permutationprocedure (Holmes et al., 1996). Liu et al. (1998) con-sider an alternative approach, permuting labels. Bull-more et al. (1999) apply nonparametric methods tocompare groups of structural MR images. Applica-tions of these techniques, however, have been rela-

tively scarce (Andreasen et al., 1996; Noll et al., 1996;Locascio et al., 1997).

The aim of this study is to make the multiple com-parisons nonparametric permutation approach ofHolmes et al. (1996) more accessible, complement theearlier formal exposition with more practical consid-erations, and illustrate the potential power and flexi-bility of the approach through worked examples.

We begin with an introduction to nonparametricpermutation testing, reviewing experimental designand hypothesis testing issues, and illustrating the the-ory by considering testing a functional neuroimagingdataset at a single voxel. The problem of searching thebrain volume for significant activations is then consid-ered, and the extension of the permutation method tothe multiple comparisons problem of simultaneously test-ing at all voxels is described. With appropriate meth-odology in place, we conclude with three annotatedexamples illustrating the approach. Software imple-menting the approach is available as an extension ofthe MATLAB based SPM package (see Appendix A fordetails).

PERMUTATION TESTS

Permutation tests are one type of nonparametrictest. They were proposed in the early twentieth cen-tury, but have only recently become popular with theavailability of inexpensive, powerful computers toperform the computations involved.

The essential concept of a permutation test is rela-tively intuitive. For example, consider a simple singlesubject PET activation experiment, where a single sub-ject is scanned repeatedly under “rest” and “activa-tion” conditions. Considering the data at a particularvoxel, if there is really no difference between the twoconditions, then we would be fairly surprised if mostof the “activation” observations were larger than the“rest” observations, and would be inclined to con-clude that there was evidence of some activation atthat voxel. Permutation tests simply provide a formalmechanism for quantifying this “surprise” in terms ofprobability, thereby leading to significance tests andp-values.

If there is no experimental effect, then the labellingof observations by the corresponding experimentalcondition is arbitrary, because the same data wouldhave arisen whatever the condition. These labels can beany relevant attribute: condition “tags,” such as “rest”or “active”; a covariate, such as task difficulty or re-sponse time; or a label, indicating group membership.Given the null hypothesis that the labellings are arbi-trary, the significance of a statistic expressing the ex-

r Nichols and Holmes r

r 2 r

Page 3: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

perimental effect can then be assessed by comparisonwith the distribution of values obtained when thelabels are permuted.

The justification for exchanging the labels comesfrom either weak distributional assumptions, or byappeal to the randomization scheme used in designingthe experiment. Tests justified by the initial random-ization of conditions to experimental units (e.g., sub-jects or scans), are sometimes referred to as randomiza-tion tests, or re-randomization tests. Whatever thetheoretical justification, the mechanics of the tests arethe same. Many authors refer to both generically aspermutation tests, a policy we shall adopt unless adistinction is necessary.

In this section, we describe the theoretical underpin-ning for randomization and permutation tests. Begin-ning with simple univariate tests at a single voxel, wefirst present randomization tests, describing the keyconcepts at length, before turning to permutationtests. These two approaches lead to exactly the sametest, which we illustrate with a simple worked exam-ple, before describing how the theory can be appliedto assess an entire statistic image. For simplicity ofexposition, the methodology is developed using theexample of a simple single subject PET activation ex-periment. The approach, however, is not limited toactivation experiments, nor to PET.

Randomization Test

First, we consider randomization tests, using a sin-gle subject activation experiment to illustrate thethinking: Suppose we are to conduct a simple singlesubject PET activation experiment, with the regionalcerebral blood flow (rCBF) in “active” (A) conditionscans to be compared to that in scans acquired underan appropriate “baseline” (B) condition. The funda-mental concepts are of experimental randomization, thenull hypothesis, exchangeability, and the randomizationdistribution.

Randomization

To avoid unexpected confounding effects, supposewe randomize the allocation of conditions to scansbefore conducting the experiment. Using an appropri-ate scheme, we label the scans as A or B according tothe conditions under which they will be acquired, andhence specify the condition presentation order. This al-location of condition labels to scans is randomly cho-sen according to the randomization scheme, and anyother possible labeling of this scheme was equally

likely to have been chosen (see Appendix C for adiscussion of the fundamentals of randomization).

Null hypothesis

In the randomization test, the null hypothesis isexplicitly about the acquired data. For example, *0:“Each scan would have been the same whatever thecondition, A or B.” The hypothesis is that the experi-mental conditions did not affect the data differentially,such that had we run the experiment with a differentcondition presentation order, we would have ob-served exactly the same data. In this sense we regardthe data as fixed, and the experimental design asrandom (in contrast to regarding the design as fixed,and the data as a realization of a random process).Under this null hypothesis, the labeling of the scans asA or B is arbitrary; because this labeling arose from theinitial random allocation of conditions to scans, andany initial allocation would have given the same data.Thus, we may re-randomize the labels on the data,effectively permuting the labels, subject to the restric-tion that each permutation could have arisen from theinitial randomization scheme. The observed data isequally likely to have arisen from any of these per-muted labelings.

Exchangeability

This leads to the notion of exchangeability. Considerthe situation before the data is collected, but after thecondition labels have been assigned to scans. For-mally, a set of labels on the data (still to be collected)are exchangeable if the distribution of the statistic (stillto be evaluated) is the same whatever the labeling(Good, 1994). For our activation example, we woulduse a statistic expressing the difference between the“active” and “baseline” scans. Thus under the nullhypothesis of no difference between the A and B con-ditions, the labels are exchangeable, provided the per-muted labeling could have arisen from the initial ran-domization scheme. The initial randomization schemegives us the probabilistic justification for permutingthe labels, the null hypothesis asserts that the datawould have been the same.

With a randomization test, the randomizationscheme prescribes the possible labeling, and the nullhypothesis asserts that the labels are exchangeablewithin the constraints of this scheme. Thus we de-fine an exchangeability block (EB) as a block of scanswithin which the labels are exchangeable, a defini-tion that mirrors that of randomization blocks (seeAppendix C).

r Permutation Tests for Functional Neuroimaging r

r 3 r

Page 4: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

Randomization distribution

Consider some statistic expressing the experimentaleffect of interest at a particular voxel. For the currentexample of a PET single subject activation, this couldbe the mean difference between the A and the B con-dition scans, a two-sample t-statistic, a t-statistic froman ANCOVA, or any appropriate statistic. We are notrestricted to the common statistics of classical para-metric hypothesis whose null distributions are knownunder specific assumptions, because the appropriatedistribution will be derived from the data.

The computation of the statistic depends on thelabeling of the data. For example, with a two-samplet-statistic, the labels A and B specify the groupings.Thus, permuting the labels leads to an alternativevalue of the statistic.

Given exchangeability under the null hypothesis,the observed data is equally likely to have arisen fromany possible labeling. Hence, the statistics associatedwith each of the possible labeling are also equallylikely. Thus, we have the permutation (or randomiza-tion) distribution of our statistic: the permutation dis-tribution is the sampling distribution of the statistic un-der the null hypothesis, given the data observed.Under the null hypothesis, the observed statistic israndomly chosen from the set of statistics correspond-ing to all possible relabelings. This gives us a way toformalize our “surprise” at an outcome: The probabil-ity of an outcome as or more extreme than the oneobserved, the P-value, is the proportion of statisticvalues in the permutation distribution greater or equalto that observed. The actual labeling used in the ex-periment is one of the possible labelings, so if theobserved statistic is the largest of the permutationdistribution, the P-value is 1/N, where N is the num-ber of possible labelings of the initial randomizationscheme. Because we are considering a test at a singlevoxel, these would be uncorrected P-values in the lan-guage of multiple comparisons (Appendix E).

Randomization test summary

To summarize, the null hypothesis asserts that thescans would have been the same whatever the exper-imental condition, A or B. Under this null hypothesisthe initial randomization scheme can be regarded asarbitrarily labeling scans as A or B, under which theexperiment would have given the same data, and thelabels are exchangeable. The statistic corresponding toany labeling from the initial randomization scheme isas likely as any other, because the permuted labelingcould equally well have arisen in the initial random-

ization. The sampling distribution of the statistic (giv-en the data) is the set of statistic values correspondingto all the possible relabeling of the initial randomiza-tion scheme, each value being equally likely.

Randomization test mechanics

Let N denote the number of relabel, and let, ti thestatistic corresponding to labeling i. The set of ti for allpossible relabeling constitutes the permutation distribu-tion. Let T denote the value of the statistic for theactual labeling of the experiment. As usual in statis-tics, we use a capital letter for a random variable. T israndom, because under *0 it is chosen from the per-mutation distribution according to the initial random-ization.

Under *0, all of the ti are equally likely, so wedetermine the significance of our observed statistic Tby counting the proportion of the permutation distri-bution as or more extreme than T, giving us our P-value. We reject the null hypothesis at significancelevel a if the P-value is less than a. Equivalently, Tmust be greater or equal to the 100(1 2 a)th percentileof the permutation distribution. Thus, the critical valueis the c 1 1 largest member of the permutation distri-bution, where c 5 aN, aN rounded down. If T ex-ceeds this critical value then the test is significant atlevel a.

Permutation Test

In many situations it is impractical to randomlyallocate experimental conditions, or perhaps we arepresented with data from an experiment that was notrandomized. For instance, we can not randomly assignsubjects to be patients or normal controls. Or, forexample, consider a single subject PET design where acovariate is measured for each scan, and we seek brainregions whose activity appears to be related to thecovariate value.

In the absence of an explicit randomization of con-ditions to scans, we must make weak distributionalassumptions to justify permuting the labels on thedata. Typically, all that is required is that distributionshave the same shape, or are symmetric. The actualpermutations that are performed depend on the de-gree of exchangeability, which in turn depend on theactual assumptions made. With the randomizationtest, the experimenter designs the initial randomiza-tion scheme carefully to avoid confounds. The ran-domization scheme reflects an implicitly assumeddegree of exchangeability (see Appendix C for ran-domization considerations). With the permutation

r Nichols and Holmes r

r 4 r

Page 5: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

test, the degree of exchangeability must be assumedpost hoc. The reasoning that would have led to aparticular randomization scheme can be usually beapplied post-hoc to an experiment, leading to a per-mutation test with the same degree of exchangeability.Given exchangeability, computation proceeds as forthe randomization test.

Permutation test summary

Weak distributional assumptions are made, whichembody the degree of exchangeability. The exact formof these assumptions depends on the experiment athand, as illustrated in the following section and in theexamples section.

For a simple single subject activation experiment,we might typically assume the following. For a par-ticular voxel, “active” and “baseline” scans within agiven block have a distribution with the same shape,though possibly different means. The null hypothesisasserts that the distributions for the “baseline” and“active” scans have the same mean, and hence are thesame. Then the labeling of scans is arbitrary within thechosen blocks, which are thus the exchangeabilityblocks. Any permutation of the labels within the ex-changeability blocks leads to an equally likely statistic.

The mechanics are then the same as with the ran-domization test. For each of the possible relabeling,compute the statistic of interest; for relabeling i, callthis statistic ti. Under the null hypothesis each of the ti

are equally likely, so the P-value is the proportion ofthe ti greater than or equal to the statistic T corre-sponding to the correctly labeled data.

Single Voxel Example

To make these concepts concrete, consider assessingthe evidence of an activation effect at a single voxel ofa single subject PET activation experiment consistingof six scans, three in each of the “active” (A) and“baseline” (B) conditions. Suppose that the conditionswere presented alternately, starting with rest, and thatthe observed data at this voxel are {90.48, 103.00, 87.83,99.93, 96.06, 99.76} to two decimal places (these dataare from a voxel in the primary visual cortex of thesecond subject in the visual activation experiment pre-sented in the examples section).

As mentioned before, any statistic can be used, sofor simplicity of illustration we use the “mean differ-ence,” i.e., T 5 1

3 (j513 (Aj 2 Bj) where Bj and Aj indicate

the value of the jth scan at the particular voxel ofinterest, under the baseline and active conditions re-spectively. Thus, we observe statistic T 5 9.45.

Randomization test

Suppose that the condition presentation order wasrandomized, the actual ordering of BABABA havingbeing randomly selected from all allocations of threeA’s and three B’s to the six available scans, a simplebalanced randomization within a single randomiza-tion block of size six. Combinatorial theory, or somecounting, tells us that this randomization scheme hastwenty (6C3 5 20) possible outcomes (see Appendix Dfor an introduction to combinatorics).

Then we can justify permuting the labels on thebasis of this initial randomization. Under the nullhypothesis *0: “The scans would have been the samewhatever the experimental condition, A or B”, thelabels are exchangeable, and the statistics correspond-ing to the 20 possible labeling are equally likely. The20 possible labeling are:

1. AAABBB2. AABABB3. AABBAB4. AABBBA5. ABAABB6. ABABAB7. ABABBA

8. ABBAAB9. ABBABA

10. ABBBAA11. BAAABB12. BAABAB13. BAABBA14. BABAAB

15. BABABA16. BABBAA17. BBAAAB18. BBAABA19. BBABAA20. BBBAAA

Permutation test

Suppose there was no initial randomization of con-ditions to scans, and that the condition presentationorder ABABAB was simply chosen. With no random-ization, we must make weak distributional assump-tions to justify permuting the labels, effectively pre-scribing the degree of exchangeability.

For this example, consider permuting the labelsfreely amongst the six scans. This corresponds to fullexchangeability, a single exchangeability block of sizesix. For this to be tenable, we must either assume theabsence of any temporal or similar confounds, ormodel their effect such that they do not affect thestatistic under permutations of the labels. Consider theformer. This gives 20 possible permutations of thelabels, precisely those enumerated for the randomiza-tion justification above. Formally, we’re assuming thatthe voxel values for the “baseline” and “active” scanscome from distributions that are the same except for apossible difference in location, or mean. Our null hy-pothesis is that these distributions have the samemean, and therefore are the same.

Clearly the mean difference statistic under consid-eration in the current example is confounded withtime for labeling such as AAABBB (no. 1) and BB-

r Permutation Tests for Functional Neuroimaging r

r 5 r

Page 6: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

BAAA (no. 20), where a time effect will result in alarge mean difference between the and the labeledscans. The test remains valid, but possibly conserva-tive. The actual condition presentation order ofBABABA is relatively unconfounded with time, butthe contribution of confounds to the statistics for al-ternative labeling such as no. 1 and no. 20 will poten-tially increase the number of statistics greater than theobserved statistic.

Computation

Let ti be the mean difference for labeling i, as enu-merated above. Computing for each of the 20 relabel-ing:

t1 5 14.82t2 5 23.25t3 5 20.67t4 5 23.15t5 5 16.86t6 5 19.45t7 5 16.97

t8 5 11.38t9 5 21.10

t10 5 11.48t11 5 21.48t12 5 11.10t13 5 21.38t14 5 26.97

t15 5 29.45t16 5 26.86t17 5 13.15t18 5 10.67t19 5 13.25t20 5 24.82.

This is our permutation distribution for this analysis,summarized as a histogram in Figure 1. Each possiblelabeling was equally likely. Under the null hypothesisthe statistics corresponding to these labeling areequally likely. The P-value is the proportion of thepermutation distribution greater than or equal to T.Here the actual labeling (no. 6 with t6 5 19.45) gives

the largest mean difference of all the possible labeling,so the P-value is 1/20 5 0.05. For a test at given alevel, we reject the null hypothesis if the P-value is lessthan a, so we conclude that there is significant evi-dence against the null hypothesis of no activation atthis voxel at level a 5 0.05.

Multiple Comparisons Permutation Tests

Thus far we have considered using a permutationtest at a single voxel. For each voxel we can produce aP-value, pk, for the null hypothesis *0

k, where thesuperscript k indexes the voxels. If we have an a priorianatomical hypothesis concerning the experimentallyinduced effect at a single voxel, then we can simplytest at that voxel using an appropriate a level test. Ifwe don’t have such precise anatomical hypotheses,evidence for an experimental effect must be assessedat each and every voxel. We must take account of themultiplicity of testing. Clearly 5% of voxels are ex-pected to have P-values less than a 5 0.05. This is theessence of the multiple comparisons problem. In the lan-guage of multiple comparisons (Appendix E), theseP-values are uncorrected P-values. Type I errors mustbe controlled overall, such that the probability offalsely declaring any region as significant is less thanthe nominal test level a. Formally, we require a testprocedure maintaining strong control over image-wiseType I error, giving adjusted P-values, P-values cor-rected for multiple comparisons.

The construction of suitable multiple comparisonsprocedures for the problem of assessing statistic im-ages from functional mapping experiments withinparametric frameworks has occupied many authors(Friston et al., 1991; Worsley et al., 1992, 1995; Polineand Mazoyer, 1993; Roland et al., 1993; Forman et al.,1995; Friston et al., 1994, 1996; Worsley, 1994; Poline etal., 1997; Cao, 1999). In contrast to these parametricand simulation based methods, a nonparametric resa-mpling based approach provides an intuitive and eas-ily implemented solution (Westfall and Young, 1993).The key realization is that the reasoning presentedabove for permutation tests at a single voxel rely onrelabeling entire images, so the arguments can be ex-tended to image level inference by considering anappropriate maximal statistic. If, under the omnibusnull hypothesis, the labels are exchangeable with re-spect to the voxel statistic under consideration, thenthe labels are exchangeable with respect to any statis-tic summarizing the voxel statistics, such as their max-ima.

We consider two popular types of test, single thresh-old and suprathreshold cluster size tests, but note again

Figure 1.Histogram of permutation distribution for single voxel using amean difference statistic. Note the symmetry of the histogramabout the y-axis. This occurs because for each possible labeling,the opposite labeling is also possible, and yields the same meandifference but in the opposite direction. This trick can be used inmany cases to halve the computational burden.

r Nichols and Holmes r

r 6 r

Page 7: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

the flexibility of these methods to consider any statis-tic.

Single threshold test

With a single threshold test, the statistic image isthresholded at a given critical threshold, and voxelswith statistic values exceeding this threshold havetheir null hypotheses rejected. Rejection of the omnibushypothesis (that all the voxel hypotheses are true) oc-curs if any voxel value exceeds the threshold, a situa-tion clearly determined by the value of the maximumvalue of the statistic image over the volume of interest.Thus, consideration of the maximum voxel statisticdeals with the multiple comparisons problem. For avalid omnibus test, the critical threshold is such thatthe probability that it is exceeded by the maximalstatistic is less than a. Thus, we require the distribu-tion of the maxima of the null statistic image. Approx-imate parametric derivations based on the theory ofstrictly stationary continuous random fields are givenby Friston et al. (1991), Worsley (1994), and Worsley etal. (1992,1995).

The permutation approach can yield the distribu-tion of the maximal statistic in a straightforward man-ner: Rather than compute the permutation distributionof the statistic at a particular voxel, we compute thepermutation distribution of the maximal voxel statisticover the volume of interest. We reject the omnibushypothesis at level a if the maximal statistic for theactual labeling of the experiment is in the top 100a% ofthe permutation distribution for the maximal statistic.The critical value is c 1 1 largest member of thepermutation distribution, where c 5 aN, aNrounded down. Furthermore, we can reject the nullhypothesis at any voxel with a statistic value exceed-ing this threshold. The critical value for the maximalstatistic is the critical threshold for a single thresholdtest over the same volume of interest. This test can beshown to have strong control over experiment-wiseType I error. A formal proof is given by Holmes et al.(1996).

The mechanics of the test are as follows. For eachpossible relabeling i 5 1,…,N, note the maximal sta-tistic ti

max, the maximum of the voxel statistics forlabeling i. This gives the permutation distribution forTmax, the maximal statistic. The critical threshold is thec 1 1 largest member of the permutation distributionfor Tmax, where c 5 aN, aN rounded down. Voxelswith statistics exceeding this threshold exhibit evi-dence against the corresponding voxel hypotheses atlevel a. The corresponding corrected P-value for eachvoxel is the proportion of the permutation distribution

for the maximal statistic that is greater than or equal tovoxel statistic.

Suprathreshold cluster tests

Suprathreshold cluster tests threshold the statisticimage at a predetermined primary threshold, and as-sess the resulting pattern of suprathreshold activity.Suprathreshold cluster size tests assess the size ofconnected suprathreshold regions for significance, de-claring regions greater than a critical size as activated.Thus, the distribution of the maximal suprathresholdcluster size (for the given primary threshold) is re-quired. Simulation approaches have been presentedby Poline and Mazoyer (1993) and Roland et al. (1993)for PET, and Forman et al. (1995) for fMRI. Friston etal. (1994) give a theoretical parametric derivation forGaussian statistic images based on the theory of con-tinuous Gaussian random fields, Cao (1999) gives re-sults for x2, t, and F fields.

Again, as noted by Holmes et al. (1996), a nonpara-metric permutation approach is simple to derive. Sim-ply construct the permutation distribution of the max-imal suprathreshold cluster size. For the statisticimage corresponding to each possible relabeling, notethe size of the largest suprathreshold cluster above theprimary threshold. The critical suprathreshold clustersize for this primary threshold is the aN 1 1 largestmember of this permutation distribution. CorrectedP-values for each suprathreshold cluster in the ob-served statistic image are obtained by comparing theirsize to the permutation distribution.

In general, such suprathreshold cluster tests aremore powerful for functional neuroimaging data thenthe single threshold approach (see Friston et al., 1995bfor a fuller discussion). It must be remembered, how-ever, that this additional power comes at the price ofreduced localizing power. The null hypotheses forvoxels within a significant cluster are not tested, soindividual voxels cannot be declared significant. Onlythe omnibus null hypothesis for the cluster can berejected. Further, the choice of primary threshold dic-tates the power of the test in detecting different typesof deviation from the omnibus null hypothesis. With alow threshold, large suprathreshold clusters are to beexpected, so intense focal “signals” will be missed. Athigher thresholds these focal activations will be de-tected, but lower intensity diffuse “signals” may goundetected below the primary threshold.

Poline et al. (1997) addressed these issues within aparametric framework by considering the suprath-reshold cluster size and height jointly. A nonparamet-ric variation could be to consider the exceedance mass,

r Permutation Tests for Functional Neuroimaging r

r 7 r

Page 8: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

the excess mass of the suprathreshold cluster, definedas the integral of the statistic image above the primarythreshold within the suprathreshold cluster (Holmes,1994; Bullmore et al., 1999). Calculation of the permu-tation distribution and P-values proceeds exactly asbefore.

Considerations

Before turning to example applications of the non-parametric permutation tests described above, wenote some relevant theoretical issues. The statisticalliterature (referenced below) should be consulted foradditional theoretical discussion. For issues related tothe current application to functional neuroimaging,see also Holmes (1994), Holmes et al. (1996), andArndt et al. (1996).

Nonparametric statistics

First, it should be noted that these methods areneither new nor contentious. Originally expounded byFisher (1935), Pitman (1937a–c), and later Edgington(1964, 1969a,b), these approaches are enjoying a re-naissance as computing technology makes the requi-site computations feasible for practical applications.Had R.A. Fisher and his peers had access to similarresources, it is possible that large areas of parametricstatistics would have gone undeveloped! Moderntexts on the subject include Good’s Permutation Tests(Good, 1994), Edgington’s Randomization Tests (Edg-ington, 1995), and Manly’s Randomization, Bootstrapand Monte-Carlo Methods in Biology (Manly, 1997). Re-cent interest in more general resampling methods,such as the bootstrap, has further contributed to thefield. For a treatise on resampling based multiple com-parisons procedures, see Westfall and Young (1993).

Many standard statistical tests are essentially per-mutation tests. The “classic” nonparametric tests, suchas the Wilcoxon and Mann-Whitney tests, are permu-tation tests with the data replaced by appropriateranks, such that the critical values are only a functionof sample size and can therefore be tabulated. Fisher’sexact test (Fisher and Bennett, 1990), and tests ofSpearman and Kendall correlations (Kendall and Gib-bons, 1990), are all permutation/randomization based.

Assumptions

For a valid permutation test the only assumptionsrequired are those to justify permuting the labels.Clearly the experimental design, model, statistic andpermutations must also be appropriate for the ques-

tion of interest. For a randomization test the probabi-listic justification follows directly from the initial ran-domization of condition labels to scans. In the absenceof an initial randomization, permutation of the labelscan be justified via weak distributional assumptions.Thus, only minimal assumptions are required for avalid test.

In contrast to parametric approaches where the sta-tistic must have a known null distributional form, thepermutation approach is free to consider any statisticsummarizing evidence for the effect of interest at eachvoxel. The consideration of the maximal statistic overthe volume of interest then deals with the multiplecomparisons problem.

There are, however, additional considerations whenusing the non-parametric approach with a maximalstatistic to account for multiple comparisons. For thesingle threshold test to be equally sensitive at all vox-els, the (null) sampling distribution of the chosen sta-tistic should be similar across voxels. For instance, thesimple mean difference statistic used in the singlevoxel example could be considered as a voxel statistic,but areas where the mean difference is highly variablewill dominate the permutation distribution for themaximal statistic. The test will still be valid, but will beless sensitive at those voxels with lower variability. So,although for an individual voxel a permutation test ongroup mean differences is equivalent to one using atwo-sample t-statistic (Edgington, 1995), this not truein the multiple comparisons setting using a maximalstatistic.

One approach to this problem is to consider multi-step tests, which iteratively identify activated areas,cut them out, and continue assessing the remainingvolume. These are described below, but are addition-ally computationally intensive. Preferable is to use avoxel statistic with approximately homogeneous nullpermutation distribution across the volume of inter-est, such as an appropriate t-statistic. A t-statistic isessentially a mean difference normalized by a varianceestimate, effectively measuring the reliability of aneffect. Thus, we consider the same voxel statistics for anon-parametric approach as we would for a compa-rable parametric approach.

Pseudo t-statistics

Nonetheless, we can still do a little better than astraight t-statistic, particularly at low degrees of free-dom. In essence, a t-statistic is a change divided by thesquare root of the estimated variance of that change.When there are few degrees of freedom available forvariance estimation, this variance is estimated poorly.

r Nichols and Holmes r

r 8 r

Page 9: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

Errors in estimation of the variance from voxel tovoxel appear as high (spatial) frequency noise in im-ages of the estimated variance or near-zero varianceestimates, which in either case cause noisy t-statisticimages. Given that PET and fMRI measure (indicatorsof) blood flow, physiological considerations wouldsuggest that the variance be roughly constant oversmall localities. This suggests pooling the varianceestimate at a voxel with those of its neighbors to givea locally pooled variance estimate as a better estimateof the actual variance. Because the model is of thesame form at all voxels, the voxel variance estimateshave the same degrees of freedom, and the locallypooled variance estimate is simply the average of thevariance estimates in the neighborhood of the voxel inquestion. More generally, weighted locally pooledvoxel variance estimates can be obtained by smooth-ing the raw variance image. The filter kernel thenspecifies the weights and neighborhood for the localpooling. The Pseudo t-statistic images formed withsmoothed variance estimators are smooth. In essencethe noise (from the variance image) has beensmoothed, but not the signal. A derivation of theparametric distribution of the pseudo t requiresknowledge of the variance-covariance of the voxel-level variances, and has so far proved elusive. Thisprecludes parametric analyses using a pseudo t-statis-tic, but poses no problems for a nonparametric ap-proach.

Number of relabelings and test size

A constraint on the permutation test is the numberof possible relabelings. Because the observed labelingis always one of the N possible relabelings, the small-est P-value attainable is 1/N. Thus, for a level a 5 0.05test to potentially reject the null hypothesis, theremust be at least 20 possible labeling.

More generally, the permutation distribution is dis-crete, consisting of a finite set of possibilities corre-sponding to the N possible relabelings. Hence, anyP-values produced will be multiples of 1/N. Further,the 100(1 2 a)th percentile of the permutation distri-bution, the critical threshold for a level a test, may liebetween two values. Equivalently, a may not be amultiple of 1/N, such that a P-value of exactly acannot be attained. In these cases, an exact test withsize exactly a is not possible. It is for this reason thatthe critical threshold is computed as the c 1 1 largestmember of the permutation distribution, where c 5aN, aN rounded down. The test can be described asalmost exact, because the size is at most 1/N less than a.

Approximate tests

A large number of possible labelings is also prob-lematic, due to the computations involved. In situa-tions where it is not feasible to compute the statisticimages for all the labelings, a subsample of labelingscan be used (Dwass, 1957; Edgington, 1969a). The setof N possible relabelings is reduced to a more man-ageable N9 consisting of the true labeling and N9 2 1randomly chosen from the set of N 2 1 possible rela-belings. The test then proceeds as before.

Such a test is sometimes known as an approximatepermutation test, because the permutation distribu-tion is approximated by a subsample, leading to ap-proximate P-values and critical thresholds (these testsare also known as Monte-Carlo permutation tests orrandom permutation tests, reflecting the random selec-tion of permutations to consider).

Despite the name, the resulting test remains exact.As might be expected from the previous section, how-ever, using an approximate permutation distributionresults in a test that is more conservative and lesspowerful than one using the full permutation distri-bution.

Fortunately, as few as 1,000 permutations can yieldan effective approximate permutation test (Edgington,1969a). For an approximate test with minimal loss ofpower in comparison to the full test (i.e., with highefficiency), however, one should consider rather morepermutations (Joel, 1986).

Power

Frequently, nonparametric approaches are lesspowerful than equivalent parametric approacheswhen the assumptions of the latter are true. The as-sumptions provide the parametric approach with ad-ditional information that the nonparametric approachmust “discover.” The more labelings, the better thepower of the nonparametric approach relative to theparametric approach. In a sense the method has moreinformation from more labelings, and “discovers” thenull distribution assumed in the parametric approach.If the assumptions required for a parametric analysisare not credible, however, a nonparametric approachprovides the only valid method of analysis.

In the current context of assessing statistic imagesfrom functional neuroimaging experiments, the prev-alent Statistical Parametric Mapping techniques re-quire a number of assumptions and involve someapproximations. Experience suggests that the permu-tation methods described here do at least as well as theparametric methods on real (PET) data (Arndt et al.,

r Permutation Tests for Functional Neuroimaging r

r 9 r

Page 10: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

1996). For noisy statistic images, such as t-statisticimages with low degrees of freedom, the ability toconsider pseudo t-statistics constructed with locallypooled (smoothed) variance estimates affords the per-mutation approach additional power (Holmes, 1994;Holmes et al., 1996).

Multi-step tests

The potential for confounds to affect the permuta-tion distribution via the consideration of unsuitablerelabelings has already been considered. Recall theabove comments regarding the potential for the mul-tiple comparison permutation tests to be differentiallysensitive across the volume of interest if the null per-mutation distribution varies dramatically from voxelto voxel. In addition, there is also the prospect thatdepartures from the null hypothesis influence the per-mutation distribution. Thus far, our nonparametricmultiple comparisons permutation testing techniquehas consisted of a single-step. The null sampling distri-bution (given the data), is the permutation distribu-tion of the maximal statistic computed over all voxelsin the volume of interest, potentially including voxelswhere the null hypothesis is not true. A large depar-ture from the null hypothesis will give a large statistic,not only in the actual labeling of the experiment, butalso in other labelings, particularly those close to thetrue labeling. This does not affect the overall validityof the test, but may make it more conservative forvoxels other than that with the maximum observedstatistic.

One possibility is to consider step-down tests, wheresignificant regions are iteratively identified, cut out,and the remaining volume reassessed. The resultingprocedure still maintains strong control over family-wise Type I error, our criteria for a test with localizingpower, but will be more powerful (at voxels other thatwith maximal statistic). The iterative nature of theprocedure, however, multiplies the computationalburden of an already intensive procedure. Holmes etal. (1996) give a discussion and efficient algorithms,developed further in Holmes (1994), but find that theadditional power gained was negligible for the casesstudied.

Recall also the motivations for using a normalizedvoxel statistic, such as the t-statistic. An inappropri-ately normalized voxel statistic will yield a test differ-entially sensitive across the image. In these situationsthe step-down procedures may be more beneficial.

Further investigation of step-down methods andsequential tests more generally are certainly war-

ranted, but are unfortunately beyond the scope of thisprimer.

WORKED EXAMPLES

The following sections illustrate the application ofthe techniques described above to three common ex-perimental designs: single subject PET “parametric,”multi-subject PET activation, and multi-subject fMRIactivation. In each example we will illustrate the keysteps in performing a permutation analysis:

1. Null HypothesisSpecify the null hypothesis.

2. ExchangeabilitySpecify exchangeability of observations underthe null hypothesis.

3. StatisticSpecify the statistic of interest, usually brokendown into specifying a voxel-level statistic and asummary statistic.

4. RelabelingDetermine all possible relabeling given the ex-changeability scheme under the null hypothesis.

5. Permutation DistributionCalculate the value of the statistic for each rela-beling, building the permutation distribution.

6. SignificanceUse the permutation distribution to determinesignificance of correct labeling and threshold forstatistic image.

The first three items follow from the experimentaldesign and must be specified by the user; the last threeare computed by the software, though we will stilladdress them here. When comparable parametricanalyses are available (within SPM) we will comparethe permutation results to the parametric results.

Single Subject PET: Parametric Design

The first study will illustrate how covariate analysesare implemented and how the suprathreshold clustersize statistic is used. This example also shows howrandomization in the experimental design dictates theexchangeability of the observations.

Study description

The data come from a study of Silbersweig et al.(1994). The aim of the study was to validate a novel PETmethodology for imaging transient, randomly occurringevents, specifically events that were shorter than the

r Nichols and Holmes r

r 10 r

Page 11: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

duration of a scan. This work was the foundation forlater work imaging hallucinations in schizophrenics (Sil-bersweig et al., 1995). We consider one subject from thestudy, who was scanned 12 times. During each scan thesubject was presented with brief auditory stimuli. Theproportion of each scan over which stimuli were deliv-ered was chosen randomly, within three randomizationblocks of size four. A score was computed for each scan,indicating the proportion of activity infused into thebrain during stimulation. This scan activity score is ourcovariate of interest, which we shall refer to as DURA-TION. This is a type of parametric design, though in thiscontext parametric refers not to a set of distributionalassumptions, but rather an experimental design wherean experimental parameter is varied continuously. Thisis in contradistinction to a factorial design where theexperimental probe is varied over a small number ofdiscrete levels.

We also have to consider the global cerebral bloodflow (gCBF), which we account for here by includingit as a nuisance covariate in our model. This gives amultiple regression, with the slope of the DURATIONeffect being of interest. Note that regressing out gCBFlike this requires an assumption that there is no inter-action between the score and global activity; examina-tion of a scatter plot and a correlation coefficient of0.09 confirmed this as a tenable assumption.

Null hypothesis

Because this is a randomized experiment, the testwill be a randomization test, and the null hypothesispertains directly to the data, and no weak distribu-tional assumptions are required:

*0: “The data would be the same whatever theDURATION.”

Exchangeability

Because this experiment was randomized, ourchoice of EB matches the randomization blocks of theexperimental design, which was chosen with temporaleffects in mind. The values of DURATION weregrouped into 3 blocks of four, such that each block hadthe same mean and similar variability, and then ran-domized within block. Thus we have three EBs of sizefour.

Statistic

We decompose our statistic of interest into two sta-tistics: one voxel-level statistic that generates a statistic

image, and a maximal statistic that summarizes thatstatistic image in a single number. An important con-sideration will be the degrees of freedom. The degreesof freedom is the number of observations minus thenumber of parameters estimated. We have one param-eter for the grand mean, one parameter for the slopewith DURATION, and one parameter for confoundingcovariate gCBF. Hence 12 observations less three pa-rameters leaves just 9 degrees of freedom to estimatethe error variance at each voxel.

Voxel-level statistic

For a voxel-level statistic we always use some typeof t-statistic. Although the nonparametric nature ofthe permutation tests allows the use of any statistic ata single voxel (e.g., the slope of rCBF with DURA-TION) we use the t because it is a standardized mea-sure. It reflects the reliability of a change.

Analyses with fewer than about 20 degrees of free-dom tend to have poor variance estimates, varianceestimates that are themselves highly variable. In im-ages of variances estimates this variability shows upas “sharpness,” or high frequency noise. This studyhas just 9 degrees of freedom and shows has thecharacteristic noisy variance image (Fig. 2). The prob-lem is that this high frequency noise propagates intothe t-statistic image, when one would expect an imageof evidence against *0 to be smooth (as is the case forstudies with greater degrees of freedom) because theraw images are smooth.

We can address this situation by smoothing thevariance images (see section on Pseudo t-statistics,above), replacing the variance estimate at each voxelwith a weighted average of its neighbors. Here we useweights from an 8 mm FWHM spherical Gaussiansmoothing kernel. The statistic image consisting of theratio of the slope and the square root of the smoothedvariance estimate is smoother than that computed withthe raw variance. At the voxel level the resulting statisticdoes not have a Student’s t-distribution under the nullhypothesis, so we refer to it as a pseudo t-statistic.

Figure 3 shows the effect of variance smoothing.The smoothed variance image creates a smoother sta-tistic image, the pseudo t-statistic image. The key hereis that the parametric t-statistic introduces high spatialfrequency noise via the poorly estimated standarddeviation. By smoothing the variance image we aremaking the statistic image more like the “signal.”

Summary statistic

We have a statistic image, but we need a singlevalue that can summarize evidence against *0 for

r Permutation Tests for Functional Neuroimaging r

r 11 r

Page 12: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

each labeling. For the reasons given in the methodssection, we use a maximum statistic, and in this ex-ample consider the maximum suprathreshold clustersize (max STCS).

Clusters are defined by connected suprathresholdvoxels. Under the *0, the statistic image should berandom with no features or structure, hence largeclusters are unusual and indicate the presence of anactivation. A primary threshold is used to define theclusters. The selection of the primary threshold is cru-cial. If set too high there will be no clusters of any size;if set to low the clusters will be too large to be useful.

Relabeling enumeration

Each of the three previous sections correspond to achoice that a user of the permutation test has to make.Those choices and the data are sufficient for an algo-rithm to complete the permutation test. This and the

next two sections describe the ensuing computationalsteps.

To create the labeling used in the experiment, thelabels were divided into three blocks of four, andrandomly ordered within blocks. Taking the divisionof the labels into the three blocks as given (it is notrandom), then we need to count how many ways thelabels can be randomly permuted within blocks. Thereare 4! 5 4 3 3 3 2 3 1 5 24 ways to permute fourlabels, and because each block is independently ran-domized, there are a total of 4!3 5 13,824 permutationsof the labels (see Appendix D formulae).

Computations for 13,824 permutations would take along time, so we consider an approximate test. Thesignificance is calculated by comparing our observedstatistic to the permutation distribution. With enoughrelabeling, a good approximation to the permutationdistribution can be made; Here we use 1,000 relabel-ings. So, instead of 13,824 relabeling, we randomly

Figure 2.Mesh plots of parametric analysis, z 5 0 mm. Upper left: slope estimate. Lower left: standarddeviation of slope estimate. Right: t image for DURATION. Note how the standard deviation imageis much less smooth than slope image, and how t image is correspondingly less smooth than slopeimage.

r Nichols and Holmes r

r 12 r

Page 13: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

select 999 relabeling to compute the statistic, giving1,000 labeling including the actual labeling used in theexperiment. The P-values will be approximate, but thetest remains exact.

Permutation distribution

For each of the 1,000 relabeling, the statistic image iscomputed and thresholded, and the maximal suprath-reshold cluster size is recorded. For each relabelingthis involves fitting the model at each voxel, smooth-ing the variance image, and creating the pseudo t-statistic image. This is the most computationally in-tensive part of the analysis, but is not onerous onmodern computing hardware. See discussion of exam-ples for run times.

Selection of the primary threshold is not easy. Forthe results to be valid we need to pick the thresholdbefore the analysis is performed. With a parametricvoxel-level statistic we could use its null distribution

to specify a threshold by uncorrected P-value (e.g., byusing t table). Here we cannot take this approachbecause we are using a nonparametric voxel-level sta-tistic whose null distribution is not known a priori.Picking several thresholds is not valid, as this intro-duces a new multiple comparisons problem. We sug-gest gaining experience with similar datasets frompost hoc analyses: apply different thresholds to get afeel for an appropriate range and then apply such athreshold to the data on hand. Using data from othersubjects in this study we found 3.0 to be a reasonableprimary threshold.

Significance threshold

We use the distribution of max STCS to assess theoverall significance of the experiment and the signifi-cance of individual clusters: The significance is theproportion of labelings that had max STCS greaterthan or equal to maximum of the correct labeling. Put

Figure 3.Mesh plots of permutation analysis, z5 0 mm. Upper left: Slope estimate. Lower left: square rootof smoothed variance of slope estimate. Right: pseudo t image fot5r DURATION. Note thatsmoothness of pseudo t image is similar to that of the slope image (c.f. figure 2).

r Permutation Tests for Functional Neuroimaging r

r 13 r

Page 14: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

another way, if max STCS of the correct labeling is ator above the 95th percentile of the max STCS permu-tation distribution, the experiment is significant at a 50.05. Also, any cluster in the observed image with sizegreater than the 95th percentile is the significant at a 50.05. Because we have 1,000 labeling, 1,000 3 0.95 5950, so the 950th largest max STCS will be our signif-icance threshold.

Results

The permutation distribution of max STCS under*0 is shown in Figure 4a. Most labelings have maxSTCS less than 250 voxels. The vertical dotted lineindicates the 95th percentile. The top 5% are spreadfrom about 500 to 3,000 voxels.

For the correctly labeled data the max STCS is 3,101voxels. This is unusually large in comparison to thepermutation distribution. Only five labelings yieldmax equal to or larger than 3,101, so the P-value forthe experiment is 5/1,000 5 0.005. The 95th percentileis 462, so any suprathreshold clusters with size greaterthan 462 voxels can be declared significant at level0.05, accounting for the multiple comparisons implicitin searching over the brain.

Figure 4b, is a maximum intensity projection (MIP) ofthe significant suprathreshold clusters. Only these twoclusters are significant, that is, there are no other su-prathreshold clusters larger than 462 voxels. These

two clusters cover the bilateral auditory (primary andassociative) and language cortices. They are 3,101 and1,716 voxels in size, with P-values of 0.005 and 0.015,respectively. Because the test concerns suprathresholdclusters it has no localizing power: Significantly largesuprathreshold clusters contain voxels with a signifi-cant experimental effect, but the test does not identifythem.

Discussion

The nonparametric analysis presented here usesmaximum STCS for a pseudo t-statistic image. Be-cause the distribution of the pseudo t-statistic is notknown, the corresponding primary threshold for aparametric analysis using a standard t-statistic cannotbe computed. This precludes a straightforward com-parison of this nonparametric analysis with a corre-sponding parametric analysis such as that of Friston etal. (1994).

Although the necessity to choose the primarythreshold for suprathreshold cluster identification is aproblem, the same is true for parametric approaches.The only additional difficulty occurs with pseudo t-statistic images, when specification of primary thresh-olds in terms of upper tail probabilities from a Stu-dents’ t-distribution is impossible. Further, parametricsuprathreshold cluster size methods (Friston et al.,1994; Poline et al., 1997) utilize asymptotic distribu-

Figure 4.A: Distribution of maximum suprathreshold cluster size, threshold of 3. Dotted line shows 95th

percentile. The count axis is truncated at 100 to show low-count tail; first two bars have counts 579and 221. B: Maximum intensity projection image of significantly large clusters.

r Nichols and Holmes r

r 14 r

Page 15: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

tional results, and therefore require high primarythresholds. The nonparametric technique is free of thisconstraint, giving exact P-values for any primarythreshold (although very low thresholds are undesir-able due to the large suprathreshold clusters expectedand consequent poor localization of an effect).

Although only suprathreshold cluster size has beenconsidered, any statistic summarizing a suprathresh-old cluster could be considered. In particular an ex-ceedance mass statistic could be employed.

Multi-Subject PET: Activation

For the second example consider a multi-subject,two condition activation experiment. We will use astandard t-statistic with a single threshold test, en-abling a direct comparison with the standard paramet-ric random field approach.

Study description

Watson et al. (1993) localized the region of visualcortex sensitive to motion, area MT/V5, using highresolution 3D PET imaging of 12 subjects. These thedata were analyzed by Holmes et al. (1996), usingproportional scaling global flow normalization and arepeated measures pseudo t-statistic. We consider thesame data here, but use a standard repeated measurest-statistic, allowing direct comparison of parametricand nonparametric approaches.

The visual stimulus consisted of randomly placedrectangles. During the baseline condition the patternwas stationary, whereas during the active conditionthe rectangles smoothly moved in independent direc-tions. Before the experiment, the 12 subjects were ran-domly allocated to one of two scan condition presen-tation orders in a balanced randomization. Thus sixsubjects had scan conditions ABABABABABAB, theremaining six having ABABABABABAB, which we’llrefer to as AB and BA orders, respectively.

Null hypothesis

In this example the labels of the scans as A and B areallocated by the initial randomization, so we have arandomization test, and the null hypothesis concernsthe data directly:

*0: For each subject, the experiment wouldhave yielded the same data were the conditionsreversed.

Note that it is not that the data itself is exchangeable,as the data is fixed. Rather, the labels are the observedrandom process and, under the null hypothesis, thedistribution of any statistic is unaltered by permuta-tions of the labels.

Exchangeability

Given the null hypothesis, exchangeability followsdirectly from the initial randomization scheme. Theexperiment was randomized at the subject level, withsix AB and six BA labels randomly assigned to the 12subjects. Correspondingly, the labels are exchangeablesubject to the constraint that they could have arisenfrom the initial randomization scheme. Thus we con-sider all permutations of the labels that result in sixsubjects having scans labeled AB, and the remainingsix AB. The initial randomization could have resultedin any six subjects having the AB condition presenta-tion order (the remainder being BA), and under thenull hypothesis the data would have been the same,hence exchangeability.

Statistic

Note that the permutations arrived at above per-mute across subjects, such that subject-to-subject dif-ferences in activation (expressed through the as yetunspecified statistic) will be represented in the permu-tation distribution. Because subject-to-subject differ-ences in activation will be present in the permutationdistribution, we must consider a voxel statistic thataccounts for such inter-subject variability, as well asthe usual intra-subject (residual) error variance. Thuswe must use a random effects model incorporating arandom subject by condition interaction term (manypublished analyses of multi-subject and group com-parison experiments have not accounted for variabil-ity in activation from subject-to-subject, and usedfixed effects analyses).

Voxel-level statistic

Fortunately, a random effects analysis can be easilyeffected by collapsing the data within subject and com-puting the statistic across subjects (Worsley et al., 1991;Holmes and Friston, 1999). In this case the result is arepeated measures t-statistic after proportional scalingglobal flow normalization: Each scan is proportionallyscaled to a common global mean of 50; each subjects datais collapsed into two average images, one for each con-dition; a paired t-statistic is computed across the sub-jects’ “rest”–“active” pairs of average images. By com-

r Permutation Tests for Functional Neuroimaging r

r 15 r

Page 16: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

puting this paired t-statistic on the collapsed data, boththe inter-subject and intra-subject (error) components ofvariance are accounted for appropriately. Because thereare 12 subjects there are 12 pairs of average conditionimages, and the t-statistic has 11 degrees of freedom.With just 11 degrees of freedom we anticipate the sameproblems with noisy variance images as in the previousexamples, but to make direct comparisons with a para-metric approach, we will not consider variance smooth-ing and pseudo t-statistics for this example.

Summary statistic

To consider a single threshold test over the entirebrain, the appropriate summary statistic is the maxi-mum t-statistic.

Relabeling enumeration

This example is different from the previous one inthat we permute across subjects instead of across rep-lications of conditions. Here our EB is not in units ofscans, but subjects. The EB size here is 12 subjects,because the six AB and six BA labels can be permuted

freely amongst the 12 subjects. There are S 12

6 D5

12!

6!~12 2 6!!5 924 ways of choosing six of the 12

subjects to have the AB labeling. This is a sufficiently

small number of permutations to consider a completeenumeration.

Note that although it might be tempting to considerpermuting labels within subjects, particularly in the permu-tation setting when there is no initial randomization dictat-ing the exchangeability, the bulk of the permutation distri-bution is specified by these between-subject permutations.Any within-subject permutations just flesh out this frame-work, yielding little practical improvement in the test atconsiderable computational cost.

Permutation distribution

For each of the 924 labelings we calculate the maxi-mum repeated measures t-statistic, resulting in the per-mutation distribution shown in Figure 5a. Note that foreach possible labeling and t-statistic image, the oppositelabeling is also possible, and gives the negative of thet-statistic image. Thus, it is only necessary to computet-statistic images for half of the labelings, and retain theirmaxima and minima. The permutation distribution isthen that of the maxima for half the relabeling concate-nated with the negative of the corresponding minima.

Significance threshold

As before, the 95th percentile of the maximum tdistribution provides both a threshold for omnibus

Figure 5.A: Permutation distribution of maximum repeated measures t-statistic. Dotted line indicates the 5%level corrected threshold. B: Maximum intensity projection of t-statistic image, thresholded atcritical threshold for 5% level permutation test analysis of 8.401.

r Nichols and Holmes r

r 16 r

Page 17: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

experimental significance and a voxel-level signifi-cance threshold appropriate for the multiple compar-isons problem. With 924 permutations, the 95th per-centile is at 924 3 0.05 5 46.2, so the critical thresholdis the 47th largest member of the permutation distri-bution. Any voxel with intensity greater than thisthreshold can be declared significant at the 0.05 level.

Results

Figure 5a shows the permutation distribution of themaximum repeated measures t-statistic. Most maximalie between about 4 and 9, though the distribution isskewed in the positive direction.

The outlier at 29.30 corresponds to the observed t-statistic, computed with correctly labeled data. Becauseno other labelings are higher, the P-value is 1/924 50.0011. The 47th largest member of the permutation dis-tribution is 8.40, the critical threshold (marked with adotted vertical line on the permutation distribution). Thet-statistic image thresholded at this critical value isshown in Figure 5b. There is a primary region of 1,424significant voxels covering the V1/V2 region, flanked bytwo secondary regions of 23 and 25 voxels correspond-ing to area V5, plus six other regions of 1 or 2 voxels.

For a t-statistic image of 43,724 voxels of size 2 32 3 4 mm, with an estimated smoothness of 7.8 38.7 3 8.7 mm , the parametric theory gives a 5% levelcritical threshold of 11.07, substantially higher than

the corresponding 4.61 of the nonparametric result.The thresholded image is shown in Figure 6b. Theimage is very similar to the nonparametric image (Fig.5b), with the primary region having 617 voxels, withtwo secondary regions of 7 and 2 voxels. Anotherparametric result is the well-known, but conservativeBonferroni correction; here it specifies a a-0.05 thresh-old of 8.92 that yields a primary region of 1,212 voxelsand 5 secondary regions with a total of 48 voxels. InFigure 6a we compare these three approaches by plot-ting the significance level vs. the threshold. The criticalthreshold based on the expected Euler characteristic(Worsley et al., 1995) for a t-statistic image is shown as adot-dash line and the critical values for the permuta-tion test is shown as a solid line. For a given test level(a horizontal line), the test with the smaller thresholdhas the greater power. At all thresholds in this plot thenonparametric threshold is below the random fieldthreshold, though it closely tracks the Bonferronithreshold below the 0.05 level. Thus the random fieldtheory appears to be quite conservative here.

Discussion

This example again demonstrates the role of thepermutation test as a reference for evaluating otherprocedures, here the parametric analysis of Friston etal. (1995b). The t field results are conservative for lowdegrees of freedom and low smoothness (Holmes,

Figure 6.A: Test significance (a) levels plotted against critical thresholds, for nonparametric and parametricanalyses. B: Maximum intensity projection of t image, thresholded at parametric 5% level criticalthreshold of 11.07.

r Permutation Tests for Functional Neuroimaging r

r 17 r

Page 18: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

1994; Stoeckl et al., 2001); the striking difference be-tween the nonparametric and random field thresholdsmakes this clear.

Figure 6a provides a very informative comparisonbetween the two methods. For all typical test sizes(a # 0.05), the nonparametric method specifies a lowerthreshold than the parametric method. For these data,this is exposing the conservativeness of the t fieldresults. For lower thresholds the difference betweenthe methods is even greater, though this is anticipatedbecause the parametric results are based high thresh-old approximations.

Multi-Subject fMRI: Activation

For this third and final example, consider a multi-subjectfMRI activation experiment. We will perform a permuta-tion test so that we can make inference on a population, incontrast to a randomisation test. We will use a smoothedvariance t-statistic with a single threshold test and willmake qualitative and quantitative comparisons with thecorresponding parametric results.

Before discussing the details of this example, wenote that fMRI data presents a special challenge fornonparametric methods. Because fMRI data exhibitstemporal autocorrelation (Smith et al., 1999), an as-sumption of exchangeability of scans within subject isnot tenable. To analyze a group of subjects for popu-lation inference, however, we need only assume ex-changeability of subjects. Therefore, although intrasu-bject fMRI analyses are not straightforward with thepermutation test, multisubject analyses are.

Study description

Marshuetz et al. (2000) studied order effects inworking memory using fMRI. The data were analyzedusing a random effects procedure (Holmes and Fris-ton, 1999), as in the last example. For fMRI, this pro-cedure amounts to a generalization of the repeatedmeasures t-statistic.

There were 12 subjects, each participating in eightfMRI acquisitions. There were two possible presenta-tion orders for each block, and there was randomiza-tion across blocks and subjects. The TR was two sec-onds, with a total of 528 scans collected per condition.Of the study’s three conditions we only consider two,item recognition and control. For item recognition, thesubject was presented with five letters and, after a twosecond interval, presented with a probe letter. Theywere to respond “yes” if the probe letter was amongthe five letters and “no” if it was not. In the controlcondition they were presented with five X’s and, two

seconds later, presented with either a “y” or a “n”;they were to press “yes” for y and “no” for n.

Each subject’s data was analyzed, creating a differ-ence image between the item recognition and controleffects. These images were analyzed with a one-sam-ple t-test, yielding a random effects analysis that ac-counts for intersubject differences.

Null hypothesis

This study used randomization within and acrosssubject and hence permits the use of a randomizationtest. Although randomization tests require no distri-butional assumptions, they only make a statementabout the data at hand. To generalize to a populationwe need to use a permutation test.

The permutation test considers the data to be a ran-dom realization from some distribution, which is thesame approach used in a parametric test (except that aparticular parametric distribution, usually a normal, isspecified). This is in distinction to the randomization testused in the last two examples, where the data is fixedand we use the randomness of the experimental designto perform the test. Although the machinery of the per-mutation and randomization tests are the same, the as-sumptions and scope of inference differ.

Each subject has an image expressing the item rec-ognition effect, the difference of the item and controlcondition estimates. We make the weak distributionalassumption that the values of the subject differenceimages at any given voxel (across subjects) are drawnfrom a symmetric distribution (the distribution maybe different at different voxels, provided it is symmet-ric). The null hypothesis is that these distributions arecentered on zero:

*0: The symmetric distributions of the (voxelvalues of the) subjects’ difference images havezero mean.

Exchangeability

The conventional assumption of independent sub-jects implies exchangeability, and hence a single EBconsisting of all subjects.

We consider subject labels of “11” and “21,” indi-cating an unflipped or flipped sign of the data. Underthe null hypothesis, we have data symmetric aboutzero, and hence for a particular subject the sign of theobserved data can be flipped without altering its dis-tribution. With exchangeable subjects, we can flip thesigns of any or all subjects’ data and the joint distri-bution of all of the data will remain unchanged.

r Nichols and Holmes r

r 18 r

Page 19: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

Statistic

In this example we use a single threshold test.

Voxel-level statistic

As noted above, this analysis amounts to a one-samplet-test on the first level images, testing for a zero-meaneffect across subjects. Because we will have only 11 de-grees of freedom we will use a pseudo t-test. We used avariance smoothing of 4 mm FWHM, comparable to theoriginal within subject smoothing. In our experience, theuse of any variance smoothing is more important thanthe particular magnitude (FWHM) of the smoothing.

Summary statistic

Again we are interested in searching over the wholebrain for significant changes, hence we use the maxi-mum pseudo t.

Relabeling enumeration

Based on our exchangeability under the null hy-pothesis, we can flip the sign on some or all of oursubjects’ data. There are 212 5 4,096 possible ways ofassigning either “11” or “21” to each subject.

Permutation distribution

For each of the 4,096 relabelings, we computed apseudo t-statistic image and noted the maximum overthe image, yielding the distribution in Figure 7a. As inthe last example, we have a symmetry in these labels;we need only compute 2,048 statistic images and saveboth the maxima and minima.

Significance threshold

With 4,096 permutations the 95th percentile is4,096 3 0.05 5 452.3, and hence the 453rd largest

Figure 7.A: Permutation distribution ofmaximum repeated measures tstatistic. Dotted line indicatesthe 5% level corrected threshold.B: Maximum intensity projectionof pseudo t statistic imagethreshold at 5% level, as deter-mined by permutation distribu-tion. C: Maximum intensity pro-jection of t statistic imagethreshold at 5% level as deter-mined by permutation distribu-tion. D: Maximum intensity pro-jection of t statistic imagethreshold at 5% level as deter-mined by random field theory.

r Permutation Tests for Functional Neuroimaging r

r 19 r

Page 20: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

maxima defines the 0.05 level corrected significancethreshold.

Results

The permutation distribution of the maxim pseudo-t-statistics under *0 is shown in Figure 7a. It is cen-tered around 4.5 and is slightly skewed positive; allmaxima are found between about 3 and 8.

The correctly labeled data yielded the largest max-imum, 8.471. Hence the overall significance of theexperiment is 1/4,096 5 0.0002. The dotted line indi-cates the 0.05 corrected threshold, 5.763. Figure 7bshows the thresholded MIP of significant voxels.There are 312 voxels in 8 distinct regions; in particularthere is a pair of bilateral posterior parietal regions, aleft thalamus region and an anterior cingulate region;these are typical of working memory studies(Marshuetz et al., 2000).

It is informative to compare this result to the tradi-tional t-statistic, using both a nonparametric and para-metric approach to obtain corrected thresholds. Wereran this nonparametric analysis using no variancesmoothing. The resulting thresholded data is shown inFigure 7c; there are only 58 voxels in 3 regions thatexceeded the corrected threshold of 7.667. Using stan-dard parametric random field methods produced theresult in Figure 7d. For 110,776 voxels of size 2 3 2 32 mm, with an estimated smoothness of 5.1 3 5.8 3 6.9mm, the parametric theory finds a threshold of 9.870;there are only 5 voxels in 3 regions above this thresh-old. Note that only the pseudo-t-statistic detects thebilateral parietal regions. Table I summaries the threeanalyses along with the Bonferroni result.

Discussion

In this example we have demonstrated the utility ofthe nonparametric method for intersubject fMRI anal-

yses. Based solely on independence of the subjects andsymmetric distribution of difference images under thenull hypothesis, we can create a permutation test thatyields inferences on a population.

Intersubject fMRI studies typically have few sub-jects, many fewer than 20 subjects. By using thesmoothed variance t-statistic we have gained sensitiv-ity relative to the standard t-statistic. Even with thestandard t-statistic, the nonparametric test provedmore powerful, detecting 5 times as many voxels asactive. Although the smoothed variance t can increasesensitivity, it does not overcome any limitations of theface validity of an analysis based on only 12 subjects.

We note that this relative ranking of sensitivity(nonparametric pseudo-t, nonparametric t, parametrict) is consistent with the other second level datasets wehave analyzed. We believe this is due to a conserva-tiveness of the random field method under low de-grees of freedom, not just to low smoothness.

Discussion of Examples

These examples have demonstrated the nonpara-metric permutation test for PET and fMRI with a va-riety of experimental designs and analyses. We haveaddressed each of the steps in sufficient detail to fol-low the algorithmic steps that the software performs.We have shown how that the ability to utilizesmoothed variances via a pseudo t-statistic can offeran approach with increased power over a correspond-ing standard t-statistic image. Using standard t-statis-tics, we have seen how the permutation test can beused as a reference against which parametric randomfield results can be validated.

Note, however, that the comparison between para-metric and nonparametric results must be made verycarefully. Comparable models and statistics must beused, and multiple comparisons procedures with thesame degree of control over image-wise Type I error

TABLE I. Comparison of four inference methods for the item recognition fMRI data*

StatisticInferencemethod

MinimumcorrectedP value

Number ofsignificant

voxels

Corrected threshold

t Pseudo-t

t Random field 0.0062646 5 9.870t Bonferroni 0.0025082 5 9.802t Permutation 0.0002441 58 7.667Pseudo-t Permutation 0.0002441 312 5.763

* The minimum corrected P-value and number of significant voxels give an overall measure of sensitivity; corrected thresholds can only becompared within statistic type. For this data, the Bonferroni and random field results are very similar, and the nonparametric methods aremore powerful. The nonparametric t method detects 10 times as many voxels as the parametric method, and the nonparametric pseudo-tdetects 60 times as many.

r Nichols and Holmes r

r 20 r

Page 21: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

used. Further, because the permutation distributionsare derived from the data, critical thresholds are spe-cific to the data set under consideration. Although theexamples presented above are compelling, it shouldbe remembered that these are only a few specific ex-amples and further experience with many data sets isrequired before generalizations can be made. Thepoints noted for these specific examples, however, areindicative of our experience with these methods thusfar.

Finally, although we have noted that the nonpara-metric method has greater computational demandsthan parametric methods, they are reasonable onmodern hardware. The PET examples took 35 min and20 min, respectively, on a 176 MHz Sparc Ultra 1. ThefMRI example took 2 hr on a 440 MHz Sparc Ultra 10.The fMRI data took longer due to more permutations(2,048 vs. 500) and larger images.

CONCLUSIONS

In this paper, the theory and practicalities of multi-ple comparisons non-parametric randomization andpermutation tests for functional neuroimaging exper-iments have been presented, and illustrated withworked examples.

As has been demonstrated, the permutation ap-proach offers various advantages. The methodology isintuitive and accessible. By consideration of suitablemaximal summary statistics, the multiple compari-sons problem can easily be accounted for; only mini-mal assumptions are required for valid inference, andthe resulting tests are almost exact, with size at most1/N less than the nominal test level a, where N is thenumber of relabelings.

The nonparametric permutation approaches de-scribed give results similar to those obtained from acomparable Statistical Parametric Mapping approachusing a general linear model with multiple compari-sons corrections derived from random field theory. Inthis respect these nonparametric techniques can beused to verify the validity of less computationallyexpensive parametric approaches (but not prove theminvalid). When the assumptions required for a para-metric approach are not met, the non-parametric ap-proach described provides a viable alternative analy-sis method.

In addition, the approach is flexible. Choice of voxeland summary statistic are not limited to those whosenull distributions can be derived from parametric as-sumptions. This is particularly advantageous at lowdegrees of freedom, when noisy variance images leadto noisy statistic images and multiple comparisons

procedures based on the theory of continuous randomfields are conservative. By assuming a smooth vari-ance structure, and using a pseudo t-statistic com-puted with smoothed variance image as voxel statistic,the permutation approach gains considerable power.

Therefore we propose that the nonparametric per-mutation approach is preferable for experimental de-signs implying low degrees of freedom, includingsmall sample size problems, such as single subjectPET/SPECT, but also PET/SPECT and fMRI multi-subject and between group analyses involving smallnumbers of subjects, where analysis must be con-ducted at the subject level to account for inter-subjectvariability. It is our hope that this paper, and theaccompanying software, will encourage appropriateapplication of these non-parametric techniques.

ACKNOWLEDGMENTS

We thank the authors of the three example data setsanalyzed for permission to use their data; to KarlFriston and Ian Ford for statistical advice; to the twoanonymous reviewers for their insightful comments,and to Mark Mintun, David Townsend, and TerryMorris for practical support and encouragement inthis collaborative work. Andrew Holmes was fundedby the Wellcome Trust for part of this work. ThomasNichols was supported by the Center for the NeuralBasis of Cognition.

REFERENCES

Andreasen NC, O’Leary DS, Cizadlo T, Arndt S, Rezai K, Ponto LL,Watkins GL, Hichwa RD (1996): Schizophrenia and cognitivedysmetria: a positron-emission tomography study of dysfunc-tional prefrontal-thalamic-cerebellar circuitry. Proc Natl AcadSci USA 93:9985–9990.

Arndt S, Cizadlo T, Andreasen NC, Heckel D, Gold S, O’Leary DS(1996): Tests for comparing images based on randomization andpermutation methods. J Cereb Blood Flow Metab 16:1271–1279.

Bullmore E, Brammer M, Williams SCR, Rabe-Hesketh S, Janot N,David A, Mellers J, Howard R, Sham P (1996): Statistical meth-ods of estimation and inference for functional MR image analy-sis. Magn Reson Med 35:261–277.

Bullmore E, Suckling J, Overmeyer S, Rabe-Hesketh S, Taylor E,Brammer MJ (1999): Global, voxel, and cluster tests, by theoryand permutation, for difference between two groups of struc-tural MR images of the brain. IEEE Trans Med Imaging 18:32–42.

Cao J (1999): The size of the connected components of the excursionsets of x2, t, and F fields. Adv Appl Probability 51:579–595.

Dwass M (1957): Modified randomization tests for nonparametrichypotheses. Ann Math Stat 28:181–187.

Edgington ES (1964): Randomization tests. J Psychol 57:445–449.Edgington ES (1969a): Approximate randomization tests. J Psychol

72:143–149.Edgington ES (1969b): Statistical inference: the distribution free

approach. New York: McGraw-Hill.

r Permutation Tests for Functional Neuroimaging r

r 21 r

Page 22: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

Edgington ES (1995): Randomization tests, 3rd ed. New York: Mar-cel Dekker.

Fisher RA (1990): Statistical methods, experimental design, andscientific inference. In: Bennett JH. ??: Oxford University Press.

Fisher RA (1935): The design of experiments. Edinburgh: OliverBoyd.

Forman SD, Cohen JD, Fitzgerald M, Eddy WF, Mintun MA, NollDC (1995): Improved assessment of significant activation infunctional magnetic resonance imaging (fMRI): use of a cluster-size threshold. Magn Reson Med 33:636–647.

Frackowiak RSJ, Friston KJ, Frith CD, Dolan RJ, Mazziotta JC (1997):Human brain function. San Diego: Academic Press.

Friston KJ, Frith CD, Liddle PF, Frackowiak RSJ (1991): Comparingfunctional (PET) images: the assessment of significant change.J Cereb Blood Flow Metab 11:690–699.

Friston KJ, Worsley KJ, Frackowiak RSJ, Mazziotta JC, Evans AC(1994): Assessing the significance of focal activations using theirspatial extent. Hum Brain Mapp 1:214–220.

Friston KJ, Holmes AP, Poline JB, Grasby PJ, Williams SCR, Frack-owiak RSJ, Turner R (1995a): Analysis of fMRI time series revis-ited. Neuroimage 2:45–53.

Friston KJ, Holmes AP, Worsley KJ, Poline J-B, Frackowiak RSJ(1995b): Statistical parametric maps in functional imaging: ageneral linear approach. Hum Brain Mapp 2:189–210.

Friston KJ, Holmes AP, Poline J-B, Price CJ, Frith CD (1996): Detect-ing activations in PET and fMRI: levels of inference and power.Neuroimage 4:223-235.

Good P (1994): Permutation tests. A practical guide to resamplingmethods for testing hypotheses. ??: Springer-Verlag.

Grabowski TJ, Frank RJ, Brown CK, Damasio H, Boles Ponto LL,Watkins GL, Hichwa RD (1996): Reliability of PET activationacross statistical methods, subject groups, and sample sizes.Hum Brain Mapp 4:23–46.

Halber M, Herholz K, Wienhard K, Pawlik G, Heiss W-D (1997):Performance of randomization test for single-subject 15-O-waterPET activation studies. J Cereb Blood Flow Metab 17:1033–1039.

Hochberg Y, Tamhane AC (1987): Multiple comparison procedures.New York: Wiley.

Holmes AP, Watson JDG, Nichols TE (1998): Holmes and Watson on‘Sherlock’. J Cereb Blood Flow Metab 18:S697.

Holmes AP (1994): Statistical issues in functional brain mapping,PhD thesis. University of Glasgow. http://www.fil.ion.ucl.ac.uk/spm/papers/APH_thesis

Holmes AP, Friston KJ (1999): Generalizability, random effects, andpopulation inference. Proceedings of the Fourth InternationalConference on Functional Mapping of the Human Brain, June7–12, 1998, Montreal, Canada. Neuroimage 7:S754.

Holmes AP, Blair RC, Watson JDG, Ford I (1996): Nonparametricanalysis of statistic images from functional mapping experi-ments. J Cereb Blood Flow Metab 16:7–22.

Jockel K-H (1986): Finite sample properties and asymptotic effi-ciency of Monte-Carlo tests. Ann Stat 14:336–347.

Kendal M, Gibbons JD (1990): Rank correlation methods, 5th ed. ??:Edward Arnold.

Liu C, Raz J, Turetsky B (1998): An estimator and permutation testfro single-trial fMRI data. In: Abstracts of ENAR meeting of theInternational Biometric Society. International Biometric Society.

Locascio JJ, Jennings PJ, Moore CI, Corkin S (1997): Time seriesanalysis in the time domain and resampling methods for studiesof functional magnetic resonance brain imaging. Hum BrainMapp 5:168–193.

Manly BFJ (1997): Randomization, bootstrap, and Monte-Carlomethods in biology. London: Chapman and Hall.

Marshuetz C, Smith EE, Jonides J, DeGutis J, Chenevert TL (2000):Order information in working memory: fMRI evidence for pa-rietal and prefrontal mechanisms. J Cogn Neurosci 12:130–144.

Noll DC, Kinahan PE, Mintun MA, Thulborn KR, Townsend DW(1996): Comparison of activation response using functional PETand MRI. Proceedings of the Second International Conference onFunctional Mapping of the Human Brain, June 17–21, 1996,Boston, MA. Neuroimage 3:S34.

Pitman EJG (1937a): Significance tests which may be applied tosamples from any population. J R Stat Soc 4(Suppl):119–130.

Pitman EJG (1937b): Significance tests which may be applied tosamples from any population. II. The correlation coefficient test.J R Stat Soc 4(Suppl):224–232.

Pitman EJG (1937a): Significance tests which may be applied tosamples from any population. III. The analysis of variance test.Biometrika 29:322–335.

Poline JB, Mazoyer BM (1993): Analysis of individual positron emis-sion tomography activation maps by detection of high signal-to-noise-ratio pixel clusters. J Cereb Blood Flow Metab 13:425–437.

Poline JB, Worsley KJ, Evans AC, Friston KJ (1997): Combiningspatial extent and peak intensity to test for activations in func-tional imaging. Neuroimage 5:83–96.

Roland PE, Levin B, Kawashima R, Akerman S (1993): Three-dimen-sional analysis of clustered voxels in 15-O-butanol brain activa-tion images. Hum Brain Mapp 1:3–19.

Silbersweig DA, Stern E, Schnorr L, Frith CD, Ashburner J, Cahill C,Frackowiak RSJ, Jones T (1994): Imaging transient, randomlyoccurring neuropsychological events in single subjects withpositron emission tomography: an event-related count rate cor-relational analysis. J Cereb Blood Flow Metab 14:771–782.

Silbersweig DA, Stern E, Frith C, Cahill C, Holmes A, Grootoonk S,Seaward J, McKenna P, Chua SE, Schnorr L, Jones T, FrackowiakRSJ (1995): A functional neuroanatomy of hallucinations inschizophrenia. Nature 378:169–176.

Smith AM, Lewis BK, Ruttimann UE, Ye FQ, Sinnwell TM, Yang Y,Duyn JH, Frank JA (1999): Investigation of low frequency drift infMRI signal. Neuroimage 9:526–533.

Stoeckl J, Poline J-B, Malandain G, Ayache N, Darcourt J (2001):Smoothness and degrees of freedom restrictions when usingSPM99. NeuroImage 13:S259.

Watson JDG, Myers R, Frackowiak RSJ, Hajnal JV, Woods RP,Mazziotta JC, Shipp S, Zeki S (1993): Area V5 of the humanbrain: evidence from a combined study using positron emissiontomography and magnetic resonance imaging. Cereb Cortex3:79–94.

Westfall PH, Young SS (1993): Resampling-based multiple testing:examples and methods for P-value adjustment. New York:Wiley.

Worsley KJ (1994): Local maxima and the expected Euler character-istic of excursion sets of x2, F, and t fields. Adv Appl Prob26:13–42.

Worsley KJ, Evans AC, Strother SC, Tyler JL (1991): A linear spatialcorrelation model, with applications to positron emission to-mography. J Am Stat Assoc 86:55–67.

Worsley KJ (1996): The geometry of random images. Chance 9:27–40.Worsley KJ, Friston KJ (1995): Analysis of fMRI time-series revised–

again. Neuroimage 2:173–181.Worsley KJ, Evans AC, Marrett S, Neelin P (1992): A three-dimen-

sional statistical analysis for CBF activation studies in humanbrain. J Cereb Blood Flow Metab 12:1040–1042.

Worsley KJ, Marrett S, Neelin P, Vandal AC, Friston KJ, Evans AC(1995): A unified statistical approach for determining significantsignals in images of cerebral activation. Hum Brain Mapp 4:58–73.

r Nichols and Holmes r

r 22 r

Page 23: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

APPENDIX A: STATISTICALNONPARAMETRIC MAPPING

Statistical Parametric Mapping refers to the conceptualand theoretical framework that combines the generallinear model and Gaussian random field (GRF) theory toconstruct, and make inferences about statistic maps re-spectively. The approach depends on various image pro-cessing techniques for coregistering, smoothing and spa-tially normalizing neuroimages. As the name suggests,the statistical technology employed is parametric. Thedata are modeled using the general linear model, and theresulting statistic images are assessed using approximateresults from the theory of continuous random fields. Themethodologies are presented in the peer reviewed liter-ature (Friston et al., 1995a,b; Worsley and Friston, 1995).A complete and more accessible exposition, togetherwith examples, is presented in Human Brain Function(Frackowiak et al., 1997).

The Statistical Parametric Mapping approach is im-plemented in a software package known as SPM,which runs within the commercial MATLAB (http://www.mathworks.com/) numerical computing envi-ronment. The SPM software is freely available to thefunctional neuroimaging community, and may be re-trieved from the SPM web site at http://www.fil.ion.ucl.ac.uk/spm.

The nonparametric permutation methods describedin this study build on the Statistical Parametric Map-ping framework and may be referred to as StatisticalnonParametric Mapping. The computer programs usedin this article are available as a “toolbox” built on topof SPM, hence SnPM. SnPM is available from the SPMwebsite, at http://www.fil.ion.ucl.ac.uk/spm/snpm,where additional resources complementing this articlecan be found. These include an example data set anda step-by-step description of the analysis of these datausing the SnPM toolbox.

APPENDIX B: STATISTICALHYPOTHESIS TESTING

Statistical hypothesis testing formulates the experimen-tal question in terms of a null hypothesis, written *0,hypothesizing that there is no experimental effect. Thetest rejects this null hypothesis in favor of an alterna-tive hypothesis, written *A, if it is unlikely that theobserved data could have arisen under the null hy-pothesis, where the data are summarized by a statistic,and appropriate assumptions are made. Thus, theprobability of falsely rejecting a true null hypothesis, aType I error, is controlled. The test level, usually de-noted by a, is the accepted “risk” of the test, the

probability of committing a Type I error. Formally, wecompute the P-value as the probability that the statis-tic would exceed (or equal) that observed under thenull hypothesis, given the assumptions. If the P-valueis smaller than the chosen test level a, then the nullhypothesis is rejected. Rigorously we say “there isevidence against the null hypothesis at level a,” or “atthe 100a% level.” Hence, the P-value is the smallesttest level a at which the null hypothesis would berejected. The value of the statistic with P-value equalto a is the critical value because more extreme valueslead to rejection of the null hypothesis. Commonly a ischosen to be 0.05, corresponding to an expected falsepositive rate of one in every 20 applications of the test(5%).

Frequently the computation of the P-value involvesapproximations, either direct mathematical approxi-mations, or indirectly via reliance on results or as-sumptions that are only approximately true. The sizeof a test is the actual probability of a Type I error. Atest is valid if the size is at most the specified test levela, that is the true probability of a Type I error is lessthan a. If approximate P-values are under-estimated(overestimating the significance), the size exceeds a,and the test is invalid. If the approximate P-values areover-estimated (underestimating the significance),then the test is said to be conservative, because the sizeof the test is less than the allowed a. A test with sizeequal to the specified level of a is said to be exact.

A Type II error is a false-negative, the error of notrejecting a false null-hypothesis. The probability of aType II error, obviously dependent on the degree ofdeparture from the null hypothesis, is often denotedby b. The power of a test, for a given departure from*0, is given by (1 2 b). Frequently power is discussedgenerically. A conservative test is usually, but notnecessarily, a less powerful test than an exact test. Infunctional neuroimaging the emphasis has been onavoiding false positives at all costs, concentrating onType I errors, frequently at the expense of power. Thishas led to testing procedures with a high probabilityof Type II error, for even a fairly robust departurefrom the null hypothesis. In this study, we shall con-sider the traditional testing framework, focusing onType I error.

Lastly, hypothesis tests may be two-sided, in whichthe alternative hypothesis *A specifies any departurefrom the null; or one-sided, in which the alternativehypothesis is directional. For instance a two-sided twosample t-test would assume normality and equal vari-ance of the two groups, and assess the null hypothesis*0: “equal group means” against the alternative *A:“group means differ.” A one-sided test would have

r Permutation Tests for Functional Neuroimaging r

r 23 r

Page 24: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

alternative *A: “Group 1 mean is greater than Group2 mean,” or vice-versa.

APPENDIX C: EXPERIMENTAL DESIGNAND RANDOMIZATION

Randomization is a crucial aspect of experimental de-sign. The basic idea is to randomly allocate subjects totreatments, or in our case conditions to scans, so thatany unforeseen confounding factors are randomly dis-tributed across all treatments/conditions, and arethereby accounted for as error. In the absence of ran-dom allocation, unforeseen factors may bias the re-sults.

For instance, consider the example of a simple PETactivation experiment, where a single subject is to bescanned under two conditions, A and B, with sixreplications of each condition. We must choose a con-dition presentation order for the 12 scans. ClearlyBBBBBBAAAAAA is unsatisfactory, because compar-ing the A’s with the B’s will reveal changes over timeas well as those due to condition. The condition effectis confounded with time. Even the relatively innocuousand widely employed ABABABABABAB paradigm,however, may be confounded with time. Indeed, prin-cipal component analysis of datasets often indicatesthat time is a serious confound, whose effect may notbe easy to model, and temporal effects are only oneexample of possible confounds. Thus, some form ofrandomization is almost always required.

The simplest scheme would be to decide the condi-tion for each scan on the toss of a fair coin. Thisunrestricted randomization, however, may not result insix scans for each condition, and is therefore unsatis-factory. We need a restricted randomization schemethat allocates equal A’s and B’s across the 12 scans. Asimple balanced randomization would allocate the sixA’s and six B’s freely amongst the 12 scans. This isobviously unsatisfactory, because BBBBBBAAAAAA& ABABABABABAB are possible outcomes, unac-ceptable due to temporal confounding. A block ran-domization is required.

In a block randomization scheme, the scans are splitup into blocks, usually contiguous in time, and usu-ally of the same size. Conditions are then randomlyallocated to scans within these randomization blocks,using a simple restricted randomization scheme. Forinstance, consider our simple PET activation experi-ment example. The 12 scans can be split up intoequally sized randomization blocks in various ways:two blocks of six scans; three blocks of four scans; orsix blocks of two scans. The size of the randomizationblocks in each case is a multiple of the number of

conditions (two), and a divisor of the number of scans(12). Within randomization blocks, we assign equalnumbers of A’s and B’s at random. So, a randomiza-tion block of size 2 could be allocated in two ways asAB or BA; blocks of size four in six ways as AABB,ABAB, ABBA, BAAB, BABA, or BBAA; and for ran-domization blocks of size six there are 20 possibleallocations. The implicit assumption is that the ran-domization blocks are sufficiently short that con-founding effects within blocks can be ignored. That is,the different allocations within each block are all as-sumed to be free from confound biases, such that thedistribution of a statistic comparing the A’s and B’swill be unaffected by the within-block allocation. Thisparallels the properties of the exchangeability blocks.

APPENDIX D: COMBINATORICS

Combinatorics is the study of permutations and com-binations, usually expressed generically in terms of“drawing colored balls from urns.” Fortunately weonly need a few results:

• There are n! ways of ordering n distinguishableobjects. Read “n-factorial,” n! is the product of thefirst n natural numbers: n! 5 1 3 2 3 … 3 (n 21) 3 n Example: In the current context of func-tional neuroimaging, a parametric design pro-vides an example. Suppose we have 12 scans on asingle individual, each with a unique covariate.There are 12! ways of permuting the 12 covariatevalues amongst the 12 scans.

• There are nCr ways of drawing r objects (withoutreplacement) from a pool of n distinguishable ob-jects, where the order of selection is unimportant.Read “n-choose-r,” these are the Binomial coeffi-

cients. Also written (nr), nCr is a fraction of facto-

rials: nCr 5n!

r!(n2t)!Example: Consider a balanced

randomization of conditions A and B to scanswithin a randomization block of size four. Oncewe choose two of the four scans to be condition A,the remainder must be B, so there are

4C2 5 6

ways of ordering two A’s and two B’s.• There are nr ways of drawing r objects from a pool

of n distinguishable objects, when the order isimportant and each drawn object is replaced be-fore the next selection. Example: Suppose we havea simple single subject activation experiment withtwo conditions, A and B, to be randomly allocatedto 12 scans using a balanced randomizationwithin blocks of size four. From above, we havethat there are 4C2 5 6 possibilities within eachrandomization block. Because there are three such

r Nichols and Holmes r

r 24 r

Page 25: Nonparametric Permutation Tests For Functional Neuroimaging: A ...

blocks, the total number of possible labeling forthis randomization scheme is 63 5 216.

APPENDIX E: MULTIPLE COMPARISONS

For each voxel k in the volume of interest W, k [ W, wehave a voxel level null hypothesis *0

k, and a test ateach voxel. In the language of multiple comparisons(Hochberg and Tamhane, 1987), we have a family oftests, one for each voxel, a “collection of tests forwhich it is meaningful to take into account some com-bined measure of errors.” The probability of falselyrejecting any voxel hypothesis is formally known asthe family-wise or experiment-wise Type I error rate. Forthe current simultaneous testing problem of assessingstatistic images, experiment-wise error is better de-scribed as image-wise error.

If the voxel hypotheses are true for all voxels in thevolume of interest W, then we say the omnibus hypoth-esis *0

W is true. The omnibus hypothesis is the inter-section of the voxel hypotheses, a hypothesis of “noexperimental effect anywhere” within the volume ofinterest. Rejecting any voxel hypothesis implies reject-ing the omnibus hypothesis. Rejecting the omnibushypothesis implies rejecting some (possibly unspeci-fied) voxel hypotheses. Image-wise error is then theerror of falsely rejecting the omnibus hypothesis.

Clearly a valid test must control the probability ofimage-wise error. Formally, a test procedure has weakcontrol over experiment-wise Type I error if the prob-ability of falsely rejecting the omnibus hypothesis isless than the nominal level a:

Pr~“reject”*Wu*W! # a

Such a test is known as an omnibus test. A significanttest result indicates evidence against the omnibus nullhypothesis, but because the Type I error for individualvoxels is not controlled the test has no localizingpower to identify specific voxels. We can only declare“some experimental effect, somewhere.”

For a test with localizing power we must consider afurther possibility for Type I error, namely that ofattributing a real departure from the omnibus nullhypothesis to the wrong voxels. If we are to rejectindividual voxel hypotheses, then in addition to con-trolling for image-wise Type I error, we must alsocontrol the probability of Type I error at the voxellevel. This control must be maintained for any givenvoxel even if the null hypothesis is not true for voxelselsewhere. A test procedure has strong control overexperiment-wise Type I error if the tests are valid forany set of voxels where the null hypothesis is true,regardless of the veracity of the null hypothesis else-where. Formally, for any subset U of voxels in thevolume of interest, U # W, where the correspondingomnibus hypothesis *0

U is true, strong control overexperiment-wise Type I error is maintained if and onlyif

Pr~“reject”*Uu*U! # a

In other words, the validity of a test in one region isunaffected by the veracity of the null hypothesis else-where. Such a test has localizing power: A departurefrom the null hypothesis in one region will not causethe test to pick out voxels in another region where thenull hypothesis is true. Clearly strong control impliesweak control.

A multiple comparisons procedure with strong con-trol over experiment-wise Type I error can yield cor-rected or adjusted P-values. Considering a test at asingle voxel, the P-value is the smallest test level a atwhich the null hypothesis is rejected. In the context ofthe multiple comparisons problem of assessing thestatistic image, these are uncorrected P-values, becausethey do not take into account the multiplicity of test-ing. By analogy, a corrected P-value for the null hy-pothesis at a voxel is the smallest test level a at whichan appropriate multiple comparisons procedure withstrong control over experiment-wise Type I error re-jects the null hypothesis at that voxel. Thus, correctedP-values, denoted ˜, account for the multiplicity oftesting.

r Permutation Tests for Functional Neuroimaging r

r 25 r


Recommended