Date post: | 21-Jan-2016 |
Category: |
Documents |
Upload: | margaret-ferguson |
View: | 215 times |
Download: | 0 times |
Learning to distinguish cognitive subprocesses based on fMRI
Tom M. MitchellCenter for Automated Learning and Discovery
Carnegie Mellon University
Collaborators: Luis Barrios, Rebecca Hutchinson, Marcel Just, Francisco Pereira, Jay Pujara, John
Ramish, Indra Rustandi
Can we distinguish brief cognitive processes using fMRI?
Finds sentence ambiguous or not?
Decide whether consistent
Can we classify/track multiple overlapping processes?
Observed fMRI:
Observed button press:
Read sentence
View picture
Mental Algebra Task
[Anderson, Qin, & Sohn, 2002]
24 3 c
[Anderson, Qin, & Sohn, 2002]
Activity Predicted by ACT-R Model
Typical ACT-R rule:
IF “_ op a = b”
THEN “ _ = <b <inv op> a>”
[Anderson, Qin,& Sohn, 2002]
Outline
• Training classifiers for short cognitive processes– Examples– Classifier learning algorithms– Feature selection– Training across multiple subjects
• Simultaneously classifying multiple overlapping processes– Linear Model and classification– Hidden Processes and EM
Training “Virtual Sensors” of Cognitive Processes
Train classifiers of form: fMRI(t, t+) CognitiveProcess
e.g., fMRI(t, t+) = {ReadSentence, ViewPicture}
• Fixed set of cognitive processes
• Fixed time interval [t, t+]
Study 1: Pictures and Sentences
• Subject answers whether sentence describes picture by pressing button.
• 13 subjects, TR=500msec
View PictureOr
Read Sentence
Read SentenceOr
View PictureFixation
Press Button
4 sec. 8 sec.t=0
Rest
Data from [Keller et al., 2001]
It is not true that the star is above the plus.
+
---
*
.
• Learn fMRI(t,t+8) {Picture,Sentence}, for t=0,8
View PictureOr
Read Sentence
Read SentenceOr
View PictureFixation
Press Button
4 sec. 8 sec.t=0
Rest
picture or sentence? picture or sentence?
Difficulties:
only 8 seconds of very noisy data
overlapping hemodynamic responses
additional cognitive processes occuring simultaneously
Learning task formulation:
• Learn fMRI(t, …, t+8) {Picture, Sentence}– 40 trials (40 pictures and 40 sentences)– fMRI(t,…t+8) = voxels x time (~ 32,000 features)– Train separate classifier for each of 13 subjects– Evaluate cross-validated prediction accuracy
• Learning algorithms:– Gaussian Naïve Bayes– Linear Support Vector Machine (SVM)– k-Nearest Neighbor – Artificial Neural Networks
• Feature selection/abstraction– Select subset of voxels (by signal, by anatomy)– Select subinterval of time– Summarize by averaging voxel activities over space, time– …
Learning a Gaussian Naïve Bayes (GNB) classifier for <f1, … fn> C
For each class value, ci,
1. Estimate
2. For each feature fj estimate
modeling distribution for each ci , fj, as Gaussian,
Applying GNB classifier to new instance
f2f1
C
fn…
Support Vector Machines [Vapnik et al. 1992]
• Method for learning classifiers corresponding to linear decision surface in high dimensional spaces
• Chooses maximum margin decision surface
• Useful in many high-dimensional domains– Text classification– Character recognition– Microarray analysis
Support Vector Machines (SVM)
Linear SVM
Non-linear Support Vector Machines
• Based on applying kernel functions to data points
– Equivalent to projecting data into higher dimensional space, then finding linear decision surface
– Select kernel complexity (H) to minimize ‘structural risk’
Error on training data
Variance term related to kernel H complexity and number of training
examples m
True error rate
Generative vs. Discriminative Classifiers
Goal: learn , equivalently
Discriminative classifier:
• Learn directly
Generative classifier:
• Learn
• Classify using
Generative vs. Discriminative Classifiers
Discriminative Generative
What they estimate:
P(C|data) P(data|C)
Examples: SVM’s,
Artificial Neural Nets
Naïve Bayes, Bayesian networks
Robustness to modeling errors
Typically more robust
Less robust
Criterion for estimating parameters
Minimize classification
error
Maximize data likelihood
GNB vs. Logistic regression [Ng, Jordan NIPS03]
Gaussian naïve Bayes
• Model P(X|C) as a class-conditional Gaussian
• Decision surface: hyperplane
• Learning converges in O(log(n)) examples, where n is number of data attributes
Logistic regression
• Model P(C|X) as a logistic function
• Decision surface: hyperplane
• Learning converges in O(n) examples
• Asymptotic error less or same as GNB
Accuracy of Trained Pict/Sent Classifier
• Results (leave one out cross validation)– Guessing 50% accuracy
– SVM: 91% mean accuracy• Single subject accuracies ranged from 75% to 98%
– GNB: 84% mean accuracy
– Feature selection step important for both• ~10,000 voxels x 16 time samples = 160,000 features• Selected only 240 voxels x 16 time samples
Can We Train Subject-Indep Classifiers?
Training Cross-Subject Classifiers for Picture/Sentence
• Approach1: define “supervoxels” based on anatomically defined brain regions– Abstract to seven brain region supervoxels– Each supervoxel 100’s to 1000’s of voxels
• Train on n-1 subjects, test on nth subject
• Result: 75% prediction accuracy over subjects outside training set– Compared to 91% avg. single-subject accuracies– Significantly better than 50% guessing accuracy
[Wang, Hutchinson, Mitchell. NIPS03]
Study 2: Semantic Word Categories
Word categories:• Fish• Trees• Vegetables
• Tools• Dwellings• Building parts
[Francisco Pereira]
Experimental setup:• Block design• Two blocks per
category• Each block begins by
presenting category name, then 20 words
• Subject indicates whether word fits category
Learning task formulation• Learn fMRI(t, …, t+32) WordCategory
– fMRI(t,…t+32) represented by mean fMRI image– Train on presentation 1, test on presentation 2 (and vice versa)
• Learning algorithm:– 1-Nearest Neighbor, based on spatial correlation [after Haxby]
• Feature selection/abstraction– Select most ‘object selective’ voxels, based on multiple regression
on boxcars convolved with gamma function– 300 voxels in ventral temporal cortex produced greatest accuracy
Results predicting word semantic category
Mean pairwise prediction accuracy averaged over 8 subjects:
• Ventral temporal: 77% (low: 57%, high 88%)• Parietal: 70%• Frontal: 67%
Random guess: 50%
Mean Activation per Voxel for Word Categories
Tools
Dwellings
Vegetables
one horizontal slice, ventral temporal cortex
[Pereira, et al 2004]
P(fMRI | WordCategory)
Plot of single-voxel classification accuracies.
Gaussian naïve Bayes classifier (yellow and red are most predictive). Images from three different subjects show similar regions with highly informative voxels.
Subject 1 Subject 2 Subject 3
Single-voxel GNB classification error vs. p value from T-statistic
N=10^6, P < 0.0001, Error = 0.51 N=10^3, P < 0.0001, Error = 0.01
Cross validated prediction error is unbiased estimate of the Bayes optimal error – the area under the intersection
Question:
Do different people’s brains ‘encode’ semantic categories
using the same spatial patterns?
No.
But, there are cross-subject regularities in “distances” between categories, as measured by classifier error rates.
Six-Category Study: Pairwise Classification Errors (ventral temporal cortex)
Fish Vegetables Tools Dwellings Trees Bldg Parts
Subj1 .20 .55 * .20 .15 .15 .05 *Sub2 .10 * .55 * .35 .20 .10 * .30Sub3 .20 .35 * .15 * .20 .20 .20Sub4 .15 .45 * .15 .15 .25 .05 *Sub5 .60 * .55 .25 .20 .15 * .15 *Sub6 .20 .25 .00 * .30 * .30 * .05Sub7 .15 .55 * .15 .25 .15 .05 *Mean .23 .46 .18 .21 .19 .12
* Worst * Best
LDA classification of semantic categories of photographs.
[Carlson, et al., J. Cog. Neurosci, 2003]
Cox & Savoy, Neuroimage 2003
Trained SVM and LDA classifiers for semantic photo categories.
Classifiers applied to same subject a week later were equally accurate
Lessons Learned
Yes, one can train machine learning classifiers to distinguish a variety of cognitive processes– Comprehend Picture vs. Sentence– Read ambiguous sentence vs. unambiguous– Read Noun vs. Verb– Read Nouns about “tools” vs. “building parts”
Failures too:– True vs. false sentences– Negative vs. affirmative sentences
Which Machine Learning Method Works Best?
• GNB and SVM tend to outperform KNN• Feature selection important
NoYes
NoYes
NoYes
NoYes
Average per-subject classification error
Which Feature Selection Works Best?
• Conventional wisdom: pick features xi that best distinguish between classes A and B– E.g., sort xi by mutual information, choose the top n
• Surprise:
Alternative strategy worked much better
Wish to learn F: <x1,x2,…xn> {A,B}
The learning setting
Class A Class B
Rest / Fixation
Voxel discriminability
Voxel activity Voxel activity
GNB Classifier Errors: Feature Selection
NA.23.27.21ROI Active Average
.09.31.27.18ROI Active
.08.34.25.16Active
.10.36.34.26Discriminate target classes
.10.36.43.29All features
Word Categories
Nouns vs. Verbs
Syntactic Ambiguity
Picture Sentence
fMRI study
feat
ure
sel
ecti
on m
eth
od
X1=S1+N1 X2=S2+N2
Z = N0
Goal: learn f: XY or P(Y|X)
Given:
1. Training examples <Xi, Yi> where Xi = Si + Ni , signal Si ~ P(S|Y= Yi), noise Ni ~ Pnoise
2. Observed noise with zero signal N0 ~ Pnoise
“Zero Signal” learning setting.
Zero signal (fixation)
Class 1 observations
Class 2 observations
Select features based on discrim(X1,X2) or discrim(Z,Xi)?
“Zero Signal” learning setting
Conjecture: feature selection using discrim(Z,Xi) will improve relative to discrim(X1,X2) as:
• # of features increases
• # of training examples decreases
• signal/noise ratio decreases
• fraction of relevant features decreases
Decide whether consistent
2. Can we classify/track multiple overlapping processes?
Observed fMRI:
Observed button press:
Read sentence
View picture
Input stimuli:
?
Bayes Net related State-Space ModelsHMM’s, DBNs, etc. e.g., [Ghahramani, 2001]
Cognitive subprocesses / state variables:
fMRI:
see [Hojen-Sorensen et al, NIPS99]
Hidden Process Model Each process defined by:
– ProcessID: <comprehend sentence>– Maximum HDR duration: R– EmissionDistribution: [ W(v,t) ]
Interpretation Z of data: set of process instances– Desire max likelihood { <ProcessIDi, StartTimei>}
– Where data likelihood is
Generative model for classifying overlapping hidden processes
[with Rebecca Hutchinson]
Classifying Processes with HPMs
Start time known:
Start time unknown: consider candidate times S
GNB classifier is a special case of HPM classifier
View PictureOr
Read Sentence
Read SentenceOr
View PictureFixation
Press Button
4 sec. 8 sec.t=0
Rest
picture or sentence? picture or sentence?
16 sec.
GNB:
picture or sentence?
picture or sentence?
HPM:
Learning HPMs
• Known start times:Least squares regression, eg. see Dale[HMB,
1999]
• Unknown start times:EM algorithm– Repeat:
• Estimate P(S|Y,W)• W’ arg max
Currently implement M step with gradient ascent
OLS learns 2 processes, overlapping in time, 1 voxel, zero noise, start times known, 10 trials
Estimates:
-00.250.50.7510.750.50.253.5108e-17
-4.7535e-170.50.50.50.50.50.50.50.5
[Indra Rustandi]
Observed data
Reconstructed data
Learned process 1
Learned process 2
OLS learns 2 processes, overlapping in time, 1 voxel, noise 0.2, start times known, 10 trials
Estimates:
0.00549560.324460.488470.833170.998720.865550.556240.23633-0.050592
-0.0173760.364350.361340.48560.601430.461680.541370.474660.52419
[Indra Rustandi]
Observed data
Reconstructed data
Learned process 1
Learned process 2
Phase II, Words every 3 seconds. Mean LFEF, subj 08179
Estimate Noun and Verb impulse responses
Verb impulse response estimated from above
Verb impulse response “ground truth” from non-
overlapping stimuli
[Indra Rustandi]
Decide whether consistent
Can we classify/track multiple overlapping processes?
Observed fMRI:
Observed button press:
Read sentence
View picture
Learned HPM with 3 processes (S,P,D), and R=13sec (TR=500msec).
P PS S
D?
Learned models
S
P
D
observed
reconstructed
D start time picked to be trailStart+18
P PS S
D D
D?
Initial results: HPM’s on PictSent
• EM chooses start time = 18 for hidden D process
• Classification accuracy for heldout PS/SP trials = 15/20 = 0.75
• Heldout classification accuracy same for 2 process (P,S) and 3 process (P,S,D) models
• Data likelihood over heldout data slightly better for 3 process (P,S,D)
Further reading• Carlson, et al., J. Cog. Neurosci, 2003
• Cox, D.D. and R.L. Savoy, Functional magnetic resonance imaging (fMRI) ``brain reading'': detecting and classifying distributed patterns of fMRI activity in human visual cortex. NeuroImage, Volume 19, Pages 261--270, 2003.
• Kjems, U., L. Hansen, J. Anderson, S. Frutiger, S. Muley, J. Sidtis, D. Rottenberg, and S. C. Strother. The quantitative evalutation of functional neuroimaging experiments: mutual information learning curves, NeuroImage 15, pp. 772--786, 2002.
• Mitchell, T.M., R. Hutchinson, M. Just, S. R. Niculescu, F. Pereira, X. Wang, Classifying Instantaneous Cognitive States from fMRI Data. Proceedings of the 2003 Americal Medical Informatics Association Annual Symposium, Washington D.C., November 2003.
• Mitchell, T.M., R. Hutchinson, S. R. Niculescu, F. Pereira, X. Wang, , M. Just, S. Newman. Learning to Decode Cognitive States from Brain Images, Machine Learning, 2004.
• Strother S.C., J. Anderson, L.Hansen, U.Kjems, R.Kustra, J. Siditis, S. Frutiger, S. Muley, S. LaConte, and D. Rottenberg. The quantitative evaluation of functional neuroimaging experiments: The NPAIRS data analysis framework. NeuroImage 15:747-771, 2002.
• Wang, X., R. Hutchinson, and T.~M. Mitchell. Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects. Proceedings of the 2003 Conference on Neural Information Processing Systems, Vancouver, December 2003.