+ All Categories
Home > Documents > fMRI Analysis with emphasis on the General Linear Model Last Update: January 18, 2012 Last Course:...

fMRI Analysis with emphasis on the General Linear Model Last Update: January 18, 2012 Last Course:...

Date post: 30-Dec-2015
Category:
Upload: cornelius-parks
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
105
fMRI Analysis with emphasis on the General Linear Model http://www.fmri4newbies.com/ Last Update: January 18, 2012 Last Course: Psychology 9223, W2010, University of Western Ontario Jody Culham Brain and Mind Institute Department of Psychology University of Western Ontario
Transcript

fMRI Analysiswith emphasis on the General Linear Model

http://www.fmri4newbies.com/

Last Update: January 18, 2012Last Course: Psychology 9223, W2010, University of Western Ontario

Jody CulhamBrain and Mind Institute

Department of PsychologyUniversity of Western Ontario

Part 1Statistical Intuitions

What data do we start with

• 12 slices * 64 voxels x 64 voxels = 49,152 voxels• Each voxel has 136 time points (volumes)• Therefore, for each run, we have 6.7 million data points• We often have several runs for each experiment

These #s are from an obsolete scanner. With a modern 3T, we can get 3X the slices

Why do we need stats?• We could, in principle, analyze data by voxel surfing: move the cursor

over different areas and see if any of the time courses look interesting

Slice 9, Voxel 1, 0Slice 9, Voxel 0, 0

Even where there’s no brain, there’s noise

Slice 9, Voxel 9, 27Here’s a voxel that responds well whenever there’s visual stimulation

Slice 9, Voxel 13, 41Here’s one that responds well whenever there’s intact objects

Slice 9, Voxel 14, 42

Here’s a couple that sort of show the right pattern but is it “real”?

Slice 9, Voxel 18, 36

Slice 9, Voxel 22, 7

The signal is much higher where there is brain, but there’s still noise

Why do we need stats?

• Clearly voxel surfing isn’t a viable option. We’d have to do it 49,152 times in this data set and it would require a lot of subjective decisions about whether activation was real

• This is why we need statistics

• Statistics:– tell us where to look for activation that is related to our

paradigm– help us decide how likely it is that activation is “real”

The lies and damned lies come in when you write the manuscript

Predicted Responses• fMRI is based on the Blood Oxygenation Level Dependent (BOLD) response• It takes about 5 sec for the blood to catch up with the brain• We can model the predicted activation in one of two ways:

1. shift the boxcar by approximately 5 seconds (2 images x 2 seconds/image = 4 sec, close enough)

2. convolve the boxcar with the hemodynamic response to model the shape of the true function as well as the delay

PREDICTED ACTIVATION IN VISUAL AREA

BOXCAR

SHIFTED

CONVOLVEDWITH HRF

PREDICTED ACTIVATION IN OBJECT AREA

Types of Errors

Slide modified from Duke course

Is the region truly active?

Doe

s ou

r st

at t

est

indi

cate

th

at t

he r

egio

n is

act

ive?

Yes

No

Yes No

HIT Type I Error

Type II Error

Correct Rejection

p value:probability of a Type I error

e.g., p <.05

“There is less than a 5% probability that a voxel our stats have declared as “active” is in reality NOT active

Statistical Approaches in a Nutshellt-tests

• compare activation levels between two conditions

• use a time-shift to account for hemodynamic lag

correlations

• model activation and see whether any areas show a similar pattern

Fourier analysis

• Do a Fourier analysis to see if there is energy at your paradigm frequency

Fourier analysis images from Huettel, Song & McCarthy, 2004, Functional Magnetic Resonance Imaging

Effect of Thresholds

r = 00% of variance

p < 1

r = .246% of variance

p < .05

r = .5025% of variance

p < .000001

r = .4016% of variance

p < .000001

r = .8064% of variance

p < 10-33

Complications

• Not only is it hard to determine what’s real, but there are all sorts of statistical problems

r = .246% of variance

p < .05

What’s wrong with these data?

Potential problems

1. data may be contaminated by artifacts (e.g., head motion, breathing artifacts)

2. .05 * 49,152 = 2457 “significant” voxels by chance alone

3. many assumptions of statistics (adjacent voxels uncorrelated with each other; adjacent time points uncorrelated with one another) are false

The General Linear Model (GLM)GLM definition from Huettel et al.:• a class of statistical tests that assume that the

experimental data are composed of the linear combination of different model factors, along with uncorrelated noise

• Model– statistical model

• Linear– things add up sensibly (1+1 = 2)

• note that linearity refers to the predictors in the model and not necessarily the BOLD signal

• General– many simpler statistical procedures such as

correlations, t-tests and ANOVAs are subsumed by the GLM

Benefits of the GLM• GLM is an overarching tool that can do anything that the

simpler tests do• allows any combination of contrasts (e.g., intact -

scrambled, scrambled - baseline), unlike simpler methods (correlations, t-tests, Fourier analyses)

• allows more complex designs (e.g., factorial designs)• allows much greater flexibility for combining data within

subjects and between subjects• allows comparisons between groups• allows counterbalancing orders within and between

subjects• allows modelling of known sources of noise in the data

(e.g., error trials, head motion)

Part 2Composition of a Voxel Time Course

A Simple Experiment

IntactObjects

ScrambledObjects

BlankScreen

TIME

One volume (12 slices) every 2 seconds for 272 seconds (4 minutes, 32 seconds)

Condition changes every 16 seconds (8 volumes)

Lateral Occipital Complex• responds when subject views objects

What’s real?

A. C.

B. D.

What’s real?• I created each of those time courses based by taking

the predictor function and adding a variable amount of random noise

= +

signal

noise

What’s real?

Which of the data sets below is more convincing?

Formal Statistics

• Formal statistics are just doing what your eyeball test of significance did– Estimate how likely it is that the signal is real given how noisy the data is

• confidence: how likely is it that the results could occur purely due to chance?

• “p value” = probability value

– If “p = .03”, that means there is a .03/1 or 3% chance that the results are bogus

• By convention, if the probability that a result could be due to chance is less than 5% (p < .05), we say that result is statistically significant

• Significance depends on– signal (differences between conditions)– noise (other variability)– sample size (more time points are more convincing)

Let’s create a time course for one LO voxel

We’ll begin with activation

Response to Intact Objects is 4X greater than Scrambled Objects

Then we’ll assume that our modelled activation is off because a transient component

Our modelled activation could be off for other reasons

All of the following could lead to inaccurate models• different shape of function• different width of function• different latency of function

Reminder: Variability of HRF

Intersubject variability of HRF in M1Handwerker et al., 2004, NeuroImage

Now let’s add some variability due to head motion

…though really motion is more complex

• Head motion can be quantified with 6 parameters given in any motion correction algorithm– x translation– y translation– z translation– xy rotation– xz rotation– yz rotation

• For simplicity, I’ve only included parameter one in our model

• Head motion can lead to other problems not predictable by these parameters

Now let’s throw in a pinch of linear drift

• linear drift could arise from magnet noise (e.g., parts warm up) or physiological noise (e.g., subject’s head sinks)

and then we’ll add a dash of low frequency noise

• low frequency noise can arise from magnet noise or physiological noise (e.g., subject’s cycles of alertness/drowsiness)• low frequency noise would occur over a range of frequencies but for simplicity, I’ve only included one frequency (1 cycle per

run) here– Linear drift is really just very low frequency noise

and our last ingredient… some high frequency noise

• high frequency noise can arise from magnet noise or physiological noise (e.g., subject’s breathing rate and heartrate)

When we add these all together, we get a realistic time course

Part 3General Linear Model

Now let’s be the experimenter• First, we take our time course and normalize it using z scores

• z = (x - mean)/SD

• normalization leads to data where– mean = zero– SD = 1

Alternative: You can transform the data into % BOLD signal change.This is usually a better approach because it’s not dependent on variance

If you only pay attention to one slide in this lecture, it should be the next one!!!

We create a GLM with 2 predictors

fMRI Signal

× 1

× 2

=

ResidualsDesign Matrix

++

“what we CAN

explain”

“what we CANNOT explain”

= +Betasx

“how much of it we CAN explain”

“our data” = +x

Statistical significance is basically a ratio of explained to unexplained variance

Implementation of GLM in SPM

• SPM represents time as going down• SPM represents predictors within the design matrix as grayscale plots (where black = low, white = high)

over time• GLM includes a constant to take care of the average activation level throughout each run

– SPM shows this explicity (BV may not)

T

ime

Many thanks to Øystein Bech Gadmar for creating this figure in SPM

IntactPredictor

ScrambledPredictor

Effect of Beta Weights

• Adjustments to the beta weights have the effect of raising or lowering the height of the predictor while keeping the shape constant

Dynamic Example

The beta weight is NOT a correlation

• correlations measure goodness of fit regardless of scale• beta weights are a measure of scale

small ßlarge r

large ßlarge r

small ßsmall r

large ßsmall r

We create a GLM with 2 predictors

fMRI Signal

“what we CAN

explain”

“what we CANNOT explain”

“how much of it we CAN explain”

“our data”

when 1=2

when 2=0.5

=

= BetasxDesign Matrix

+

+ Residuals

+

= +x

Statistical significance is basically a ratio of explained to unexplained variance

Correlated Predictors

• Where possible, avoid predictors that are highly correlated with one another

• This is why we NEVER include a baseline predictor– baseline predictor is almost completely correlated with the sum of

existing predictors

Two stimulus predictors Baseline predictor

+

=

r = -.95

r = -.53

r = -.53

Which model accounts for this data?

x β = 1

• Because the predictors are highly correlated, the model is overdetermined and you can’t tell which beta combo is best

OR

x β = 1

x β = 0

+

+

x β = 0

x β = 0

x β = -1

+

+

Orthogonalizing Regressors

Contrasts in the GLM• We can examine whether a single predictor is significant

(compared to the baseline)

• We can also examine whether a single predictor is significantly greater than another predictor

Contrasts

“balanced”

Conjunction of contrasts• e.g., (+1 -1 0) AND (+1 0 -1)• (Bio motion - Nonbio motion) AND (Bio motion > control)• more rigorous than balanced contrast• hypothetical (but not actual) conjunction p value = multiple of independent p values

• e.g., .01 x .01 = .001

A Real Voxel• Here’s the time course from a voxel that was significant in the +Intact -

Scrambled comparison

Maximizing Your Power

As we saw earlier, the GLM is basically comparing the amount of signal to the amount of noise

How can we improve our stats?• increase signal• decrease noise• increase sample size (keep subject in longer)

= +

signal

noise

How to Reduce Noise

• If you can’t get rid of an artifact, you can include it as a “predictor of no interest” to soak up variance

Example: Some people include predictors from the outcome of motion correction algorithms

Corollary: Never leave out predictors for conditions that will affect your data (e.g., error trials)

This works best when the motion is uncorrelated with your paradigm (predictors of interest)

Reducing Residuals

Part 3Deconvolution of Event-Related Designs

Using the GLM

Convolution of Single Trials

Neuronal Activity

Haemodynamic Function

BOLD Signal

Time

Time

Slide from Matt Brown

Fast fMRI DetectionA) BOLD Signal

B) Individual Haemodynamic Components

C) 2 Predictor Curves for use with GLM (summation of B)

Slide from Matt Brown

DEconvolution of Single Trials

Neuronal Activity

Haemodynamic Function

BOLD Signal

Time

Time

Slide from Matt Brown

Deconvolution Example• time course from 4 trials of two types (pink, blue) in a “jittered” design

Summed Activation

Single Stick Predictor

• single predictor for first volume of pink trial type

Predictors for Pink Trial Type

• set of 12 predictors for subsequent volumes of pink trial type

• need enough predictors to cover unfolding of HRF (depends on TR)

Predictor Matrix

• Diagonal filled with 1’s

.

.

.

Predictors for the Blue Trial Type

• set of 12 predictors for subsequent volumes of blue trial type

Predictor x Beta Weights for Pink Trial Type• sequence of beta weights for one trial type yields an estimate of

the average activation (including HRF)

Predictor x Beta Weights for Blue Trial Type• height of beta weights indicates amplitude of response (higher

betas = larger response)

Linear Deconvolution

• Jittering ITI also preserves linear independence among the hemodynamic components comprising the BOLD signal.

Miezen et al.2000

Fast fMRI: Estimation

Pros:• Produces time course• Does not assume specific shape for hemodynamic function• Robust against trial history biases (though not immune to it)• Compound trial types possible

Cons:• Complicated• Unrealistic assumptions about linearity if trials are too close in time

– BOLD is non-linear with inter-event intervals < 6 sec.– Nonlinearity becomes severe under 2 sec.

• Sensitive to noise

Part 4Dealing with Faulty Assumptions

What’s this #*%&ing reviewer complaining about?!

• Particularly if you do voxelwise stats, you have to be careful to follow the accepted standards of the field. In the past few years the following approaches have been recommended by the stats mavens:

1. Correction for multiple comparisons

2. Random effects analyses

3. Correction for serial correlations

Dead Salmon

• 130,000 voxels• no correction for

multiple comparisons

poster at Human Brain Mapping conference, 2009

Fishy Headlines

Correction for Multiple Comparisons

1) Bonferroni correction• divide desired p value by number of comparisons

Example:desired p value: p < .05number of voxels: 50,000required p value: p < .05 / 50,000 p < .000001

• quite conservative• can use less stringent values

• e.g., Brain Voyager can use the number of voxels in the cortical surface

• small volume correction: use more liberal thresholds in areas of the brain which you expected to be active

With conventional probability levels (e.g., p < .05) and a huge number of comparisons (e.g., 64 x 64 x 12 = 49,152), a lot of voxels will be significant purely by chance

e.g., .05 * 49,152 = 2458 voxels significant due to chance

How can we avoid this?

Correction for Multiple Comparisons

2) Gaussian random field theory• Fundamental to SPM• If data are very smooth, then the chance of noise points passing

threshold is reduced• Can correct for the number of “resolvable elements” (“resels”)

rather than number of voxels

Slide modified from Duke course

3) Cluster correction

• falsely activated voxels should be randomly dispersed • set minimum cluster size to be large enough to make it unlikely that a

cluster of that size would occur by chance• some algorithms assume that data from adjacent voxels are

uncorrelated (not true)• some algorithms (e.g., Brain Voyager) estimate and factor in spatial

smoothness of maps• cluster threshold may differ for different contrasts

4) Test-retest reliability

• Perform statistical tests on each half of the data• The probability of a given voxel appearing in both purely by chance is

the square of the p value used in each halfe.g., .001 x .001 = .000001

• Alternatively, use the first half to select an ROI and evaluate your hypothesis in the second half.

5) False discovery rate (Genovese et al, 2002, NeuroImage)

• “controls the proportion of rejected hypotheses that are falsely rejected”• standard p value (e.g., p < .01) means that a certain proportion of all

voxels will be significant by chance (1%)• FDR uses q value (e.g., q < .01), meaning that a certain proportion of the

“activated” (colored) voxels will be significant by chance (1%)• works in theory, though in practice, my lab hasn’t been that satisfied

Is the region truly active?

Doe

s ou

r st

at t

est

indi

cate

th

at t

he r

egio

n is

act

ive?

Yes

No

Yes No

HIT Type I Error

Type II Error

Correct Rejection

6) Poor man’s Bonferroni

• Jack up the threshold till you get rid of the schmutz (especially in air, ventricles, white matter)

• If you have a comparison where one condition is expected to produce much more activity than the other, turn on both tails of the comparison

• Jody’s rule of thumb: “If ya can’t trust the negatives, can ya trust the positives?”

Example: MT localizer data

Moving rings > stationary rings (orange)Stationary rings > moving rings (blue)

Correction for Temporal Correlations

Statistical methods assume that each of our time points is independent.

In the case of fMRI, this assumption is false.

Even in a “screen saver scan”, activation in a voxel at one time is correlated with it’s activation within ~6 sec

This fact can artificially inflate your statistical significance.

Autocorrelation function

To calculate the magnitude of the problem, we can compute the autocorrelation function

For a voxel or ROI, correlate its time course with itself shifted in time

Plot these correlations by the degree of shift

original

shift by 1 volume

shift by 2 volumes

If there’s no autocorrelation, function should drop from 1 to 0 abruptly – pink line

The points circled in yellow suggest there is some autocorrelation, especially at a shift of 1, called AR(1)

time

BV can correct for the autocorrelation to yield revised (usually lower) p values

BEFORE

AFTER

BV Preprocessing Options

Temporal Smoothing of Data

• We have the option in our software to temporally smooth our data (i.e., remove high temporal frequencies)

• However, I recommended that you not use this option

• Now do you understand why?

Clarification

• correction for temporal correlations is NOT necessary with random effects analyses, only for fixed effects and individual subjects analysis

Collapsed Fixed Effects Models• assume that the experimental manipulation has same effect in each

subject• treats all data as one concatenated set with one beta per predictor

(collapsed across all subjects)• e.g., Intact = 2

Scrambled = .5

• strong effect in one subject can lead to significance even when others show weak or no effects

• you can say that effect was significant in your group of subjects but cannot generalize to other subjects that you didn’t test

Separate Subjects Models• one beta per predictor per subject

• e.g., JC: Intact = 2.1JC: Scrambled = 0.2DQ: Intact = 1.5DQ: Scrambled = 1.0KV: Intact = 1.2KV: Scrambled = 1.3

• weights each subject equally• makes data less susceptible to effects of one rogue subject

Random Effects Analysis• Typical fMRI stats test whether the differences between conditions are significant

in the sample of subjects we have tested

• Often, we want to be able to generalize to the population as a whole including all potential subjects, not just the ones we tested

• Random effects analyses allow you to generalize to the population you tested

• Brain Voyager recommends you don’t even toy with random effects unless you’ve got 10 or more subjects (and 50+ is best)

• Random effects analyses can really squash your data, especially if you don’t have many subjects. Sometimes we refer to the random effects button as the “make my activation go away” button.

• Though standards were lower in the early days of fMRI, today it’s virtually impossible to publish any group voxelwise data without random effects analysis

• You don’t have to worry about it if you’re using the ROI approach because (1) presumably the ROI has already been well-established across multiple labs; and (2) posthoc analyses of results in an ROI approach allow you to generalize to the population (assuming you include individual variance)

underpaid graduate students in need of a few bucks!

Fixed vs. Random Effects GLM

• Fixed Effects GLM cannot tell the difference between these data sets because (Intact sum - Scram sum) is the same in both cases

• In Random Effects GLM, Data set #1 would be more likely to be significant because all 3 subjects show a trend in the same direction (intact > scrambled), whereas in data set #2, only 2 of 3 subjects show a difference in that direction

Subject Intact beta

Scram beta

Diff

1 4 3 1

2 2 3 -1

3 4 1 3

SUM 10 7 3

Subject Intact beta

Scram beta

Diff

1 4 3 1

2 2 1 1

3 4 3 1

SUM 10 7 3

Sample Data #1 Sample Data #2

Strategies for Exploration vs. Publication

• Deductive approach– Have a specific hypothesis/contrast planned– Run all your subjects– Run the stats as planned– Publish

• Inductive approach– Run a few subjects to see if you’re on the right track– Spend a lot of time exploring the pilot data for

interesting patterns– “Find the story” in the data– You may even change the experiment, run additional

subjects, or run a follow-up experiment to chase the story

• While you need to use rigorous corrections for publication, do not be overly conservative when exploring pilot data or you might miss interesting trends• Random effects analyses can be quite conservative so you may want to do exploratory analyses with fixed effects (and then run more subjects if needed so you can publish random effects)

Part 4To Localize or Not to Localise?

To Localize or Not to Localise?

Neuroimagers can’t even agree how to

SPELL localiser/localizer!

Methodological Fundamentalism

The latest review I received…

Approach #1: Voxelwise Statistics

1. You don’t necessarily need a priori hypotheses (though sometimes you can use less conservative stats if you have them)

2. Average all of your data together in Talairach space3. Compare two (or more) conditions using precise statistical

procedures within every voxel of the brain. Any area that passes a carefully determined threshold is considered real.

4. Make a list of these areas and publish it.This is the tricky part!

Voxelwise Approach: Example• Malach et al., 1995, PNAS

• Question: Are there areas of the human brain that are more responsive to objects than scrambled objects

• You will recognize this as what we now call an LO localizer, but Malach was the first to identify LO

LO activation is shown in red, behind MT+ activation in green

LO (red) responds more to objects, abstract sculptures and faces than to textures, unlike visual cortex (blue) which responds well to all stimuli

The Danger of Voxelwise Approaches

Source: Decety et al., 1994, Nature

• This is one of two tables from a paper• Some papers publish tables of activation two pages long• How can anyone make sense of so many areas?

Approach #2: Region of interest (ROI) analysis

• If you are looking at a well-established area (such as visual cortex, motor cortex, or the lateral occipital complex), it’s fairly easy to activate and identify the area

1. Do the stats and play with the threshold till you get something believable in the right vicinity based on anatomical location (e.g., sulcal landmarks) or functional location (e.g., Talairach coordinates from prior studies)

2. Once you have found the ROI, do independent experiments, extract the time course information and determine whether activation differences between conditions are significant– Because the runs that are used to generate the area are

independent from those used to test the hypothesis, liberal statistics (p < .05) can be used

Example of ROI ApproachCulham et al., 2003, Experimental Brain Research

Does the Lateral Occipital Complex compute object shape for grasping?

Step 1: Localize LOC

IntactObjects

Scrambled Objects

Example of ROI ApproachCulham et al., 2003, Experimental Brain Research

Does the Lateral Occipital Complex compute object shape for grasping?

Step 2: Extract LOC data from experimental runs

Grasping

Reaching

NSp = .35

NSp = .31

Example of ROI ApproachVery Simple Stats

NSp = .35

NSp = .31

% BOLD Signal Change

Left Hem. LOC

Subject Grasping Reaching

1 0.02 0.03

2 0.19 0.08

3 0.04 0.01

4 0.10 0.32

5 1.01 -0.27

6 0.16 0.09

7 0.19 0.12

Extract average peak from each subject for

each condition

Then simply do a paired t-test to see

whether the peaks are significantly different between conditions

• Instead of using % BOLD Signal Change, you can use beta weights• You can also do a planned contrast in Brain Voyager using a module

called the ROI GLM

Utility of Doing Both Approaches

• We also verified the result with a voxelwise approach

Verification of no LOC activation for grasping > reaching even at moderate threshold (p < .001, uncorrected)

Example: The Danger of ROI Approaches

• Example 1: LOC may be a heterogeneous area with subdivisions; ROI analyses gloss over this

• Example 2: Some experiments miss important areas (e.g., Kanwisher et al., 1997 identified one important face processing area -- the fusiform face area, FFA -- but did not report a second area that is a very important part of the face processing network -- the occipital face area, OFA -- because it was less robust and consistent than the FFA.

Comparing the two approaches

Voxelwise Analyses• Require no prior hypotheses about areas involved • Include entire brain • Often neglect individual differences• Can lose spatial resolution with intersubject averaging• Can produce meaningless “laundry lists of areas” that are difficult

to interpret • You have to be fairly stats-savvy and include all the appropriate

statistical corrections to be certain your activation is really significant

• Popular in Europe

Comparing the two approaches

Region of Interest (ROI) Analyses• Extraction of ROI data can be subjected to simple stats (no need

for multiple comparisons, autocorrelation or random effects corrections)

• Gives you more statistical power (e.g., p < .05)• Hypothesis-driven

• Useful when hypotheses are motivated by other techniques (e.g., electrophysiology) in specific brain regions

• ROI is not smeared due to intersubject averaging• Important for discriminating abutting areas (e.g., V1/V2)

• Easy to analyze and interpret • Neglects other areas which may play a fundamental role• If multiple ROIs need to be considered, you can spend a lot of

scan time collecting localizer data (thus limiting the time available for experimental runs)

• Works best for reliable and robust areas with unambiguous definitions

• Popular in North America

A Proposed Resolution

• There is no reason not to do BOTH ROI analyses and voxelwise analyses– ROI analyses for well-defined key regions– Voxelwise analyses to see if other regions are also involved

• Ideally, the conclusions will not differ• If the conclusions do differ, there may be sensible reasons

– Effect in ROI but not voxelwise• perhaps region is highly variable in stereotaxic location between subjects

• perhaps voxelwise approach is not powerful enough

– Effect in voxelwise but not ROI• perhaps ROI is not homogenous or is context-specific

Part 5The War of Non-Independence

Finding the Obvious

Non-independence error• occurs when statistical tests performed are not independent

from the means used to select the brain region

Arguments from Vul & Kanwisher, book chapter in press

A priori probability of getting JQKA sequence = (1/13)4 = 1/28,561

A posteriori probability of getting JQKA sequence = 1/1 = 100%

Non-independence ErrorEgregious example• Identify Area X with contrast of A > B• Do post hoc stats showing that A is statistically higher than B• Act surprised

More subtle example of selection bias• Identify Area X with contrast of A > B• Do post hoc stats showing that A is statistically higher than C and C is

statistically greater than B

Arguments from Vul & Kanwisher, book chapter in press

Figure from Kriegeskorte et al., 2009, Nature Neuroscience

Double Dipping & How to Avoid It• Kriegeskorte et al.,

2009, Nature Neuroscience

• surveyed 134 papers in prestiguous journals

• 42% showed at least one example of non-independence error

Correlations Between Individual Subjects’ Brain Activity and Behavioral Measures

Sample of Critiqued Papers:Eisenberg, Lieberman & Williams, 2003, Science• measured fMRI activity during social rejection• correlated self-reported distress with brain activity• found r = .88 in anterior cingulate cortex, an area implicated in physical pain perception• concluded “rejection hurts”

social exclusion > inclusion

“Voodoo Correlations”

• reliability of personality and emotion measures: r ~ .7• reliability of activation in a given voxel: r ~ .7• highest expected behavior: fMRI correlation is ~.74• so how can we have behavior: fMRI correlations of r ~.9?!

2009

Voodoo

The original title of the paper was not well-received by reviewers so it was changed even though some people still use the term

“Voodoo Correlations”

"Notably, 53% of the surveyed studies selected voxels based on a correlation with the behavioral individual-differences measure and then used those same data to compute a correlation within that subset of voxels."

Vul et al., 2009, Perspectives on Psychological Science

Avoiding “Voodoo”

• Use independent means to select region and then evaluate correlation

• Do split-half reliability test– WARNING: This is reassuring that the

result can be replicated in your sample but does not demonstrate that result generalizes to the population

Is the “voodoo” problem all that bad?

• High correlations can occur in legitimately analyzed data• Did voxelwise analyses use appropriate correction for multiple comparisons?

– then result is statistically significant regardless of specific correlation

• Is additional data being used for1. inference purposes?– if they pretend to provide independent support, that’s bad2. presentation purposes?– alternative formats can be useful in demonstrating that data is clean (e.g., time

courses look sensible; correlations are not driven by outliers)


Recommended