November 18, 2002 Stanford Statistics 1✬
✫
✩
✪
Supervised Learning from
Micro-Array Data:
Datamining with Care
Trevor HastieStanford University
November 18, 2002
joint work with Robert Tibshirani, Balasubramanian
Narasimhan, Gil Chu, Pat Brown and David Botstein
November 18, 2002 Stanford Statistics 2✬
✫
✩
✪
DNA microarrays
• Exciting new technology for measuring geneexpression of tens of thousands of genesSIMULTANEOUSLY in a single sample ofcells
• first multivariate, quantitative way ofmeasuring gene expression
• a key idea: to find genes, follow aroundmessenger RNA
• also known as “gene chips”— there are anumber of different technologies: Affymetrix,Incyte, Brown Lab,...
• techniques for analysis of microarray data arealso applicable to SNP data, protein arrays,etc.
November 18, 2002 Stanford Statistics 3✬
✫
✩
✪
DNA microarray process
November 18, 2002 Stanford Statistics 4✬
✫
✩
✪
The entire Yeast genome on a chip
November 18, 2002 Stanford Statistics 5✬
✫
✩
✪
Statistical challenges
• Typically have ∼ 5,000–40,000 genesmeasured over ∼ 50–100 samples.
• Goal: understand patterns in data, and lookfor genes that explain known features in thesamples.
• Biologists don’t want to miss anything (lowtype II error). Statisticians have to help themto appreciate Type I error, and find ways toget a handle on it.
November 18, 2002 Stanford Statistics 6✬
✫
✩
✪
Types of problems
• Preprocessing: Li, Wong (Dchip), Speed
• The analysis of expression arrays can beseparated into:unsupervised — “class discovery” andsupervised — “class prediction”
• In unsupervised problems, only the expressiondata is available. Clustering techniques arepopular. Hierarchical (Eisen’s TreeView -next slide), K-means, SOMs, block-clustering,gene-shaving (H&T), plaid models (Owen&Lazzeroni). SVD also useful.
• In supervised problems, a responsemeasurement is available for each sample. Forexample, a survival time or cancer class.
November 18, 2002 Stanford Statistics 7✬
✫
✩
✪
Two-way hierarchical clustering
Molecular portraits of Breast Cancer — Perou et al,
Nature 2000
November 18, 2002 Stanford Statistics 8✬
✫
✩
✪
Some editorial comments
• Most statistical work in this area is beingdone by non-statisticians.
• Journals are filled with papers of the form“Application of <machine-learning method>to Microarray Data”
• Many are a waste of time. P � N i.e. manymore variables (genes) than samples.Data-mining research has produced exoticenhancements of standard statistical modelsfor the N � P situation (neural networks,boosting,...). Here we need to restrict thestandard models; cannot even do linearregression.
• Simple is better: a complicated method isonly worthwhile if it works significantlybetter than the simple one.
November 18, 2002 Stanford Statistics 9✬
✫
✩
✪
• Give scientists good statistical software, withmethods they can understand. They knowtheir science better than you. With yourhelp, they can do a better job analyzing theirdata than you can do alone.
• Software should be easy to install (e.g. R)and easy to use (e.g. SAM is an excel add-in)
November 18, 2002 Stanford Statistics 10✬
✫
✩
✪
SAM
How to do 7000 t-tests all at once!
Significance Analysis of Microarrays (Tusher,Tibshirani and Chu, 2001).
-10 -5 0 5 10
010
020
030
040
050
0
t statistic
• At what threshold should we call a genesignificant?
• how many false positives can we expect?
November 18, 2002 Stanford Statistics 11✬
✫
✩
✪
SAM plot
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
expected d(i)
obse
rved
d(i)
-5 0 5
-50
5
Ave # falsely # Called False
∆ significant significant discovery rate
0.3 75.1 294 .255
0.4 33.6 196 .171
0.5 19.8 160 .123
0.7 10.1 94 .107
1.0 4.0 46 .086
∆ is the half-width of the band around the 45o line.
November 18, 2002 Stanford Statistics 12✬
✫
✩
✪
• SAM is popular in genetics and biochemistrylabs at Stanford and worldwide already
• SAM is freely available for academic andnon-profit use. The SAM site is:www-stat.stanford.edu/∼tibs/SAM
• For commercial use, software is available forlicensing from Stanford Office of TechnologyLicensing: [email protected]
November 18, 2002 Stanford Statistics 13✬
✫
✩
✪
Nearest Prototype Classification
An extremely simple classifier that
• performs well on test data, and
• produces subsets of informative genes.
Tibshirani, Hastie, Narasimhan and Chu (2002)
“Diagnosis of multiple cancer types by shrunken
centroids of gene expression” PNAS 2002
99:6567-6572 (May 14).
November 18, 2002 Stanford Statistics 14✬
✫
✩
✪
Classification of Samples
Example: small round blue cell tumors; Khan etal, Nature Medicine, 2001
• Tumors classified as BL (Burkitt lymphoma),EWS (Ewing), NB (neuroblastima) and RMS(rhabdomyosarcoma).
• There are 63 training samples and 25 testsamples, although five of the latter were notSRBCTs. 2308 genes
• Khan et al report zero training and testerrors, using a complex neural network model.Decided that 96 genes were “important”.
• Upon close examination, network is linear.It’s essentially extracting linear principalcomponents, and classifying in their subspace.
• But even principal components isunnecessarily complicated for this problem!
November 18, 2002 Stanford Statistics 15✬
✫
✩
✪
Khan data
BL EWS NB RMS
November 18, 2002 Stanford Statistics 16✬
✫
✩
✪
Khan’s neural network
November 18, 2002 Stanford Statistics 17✬
✫
✩
✪
Nearest Shrunken Centroids
BL
Average Expression
Gen
e
-1.0 0.0 0.5 1.0
050
010
0015
0020
00
EWS
Average Expression
Gen
e
-1.0 0.0 0.5 1.0
050
010
0015
0020
00
NB
Average Expression
Gen
e
-1.0 0.0 0.5 1.0
050
010
0015
0020
00
RMS
Average Expression
Gen
e
-1.0 0.0 0.5 1.0
050
010
0015
0020
00
Centroids are shrunk towards the overall centroidusing soft-thresholding. Classification is to thenearest shrunken centroid.
November 18, 2002 Stanford Statistics 18✬
✫
✩
✪
Nearest Shrunken Centroids
• Simple, includes nearest centroid classifier asa special case.
• Thresholding denoises large effects, and setssmall ones to zero, thereby selecting genes.
• with more than two classes, method canselect different genes, and different numbersof genes for each class.
• Still very simple. In statistical parlance, thisis a restricted version of a naive Bayesclassifier (also called idiot’s Bayes!)
November 18, 2002 Stanford Statistics 19✬
✫
✩
✪
Results on Khan data
At optimal point chosen by 10-fold CV, there are27 active genes, 0 training and 0 test errors.
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
2308
2250
1933
1468
987
657
436
282
183
138
98 61 41 27 18 15 7 1 0
0.0
0.2
0.4
0.6
0.8
cv cv cv cv cv cv cv cv cv cv cv cv cv cv cv
cv
cv
cvcv
tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr
tr
tr
tr tr
te te
te te
te
te te te te te
te te te te te
te
te
te te
Amount of Shrinkage ∆
Error
Size
November 18, 2002 Stanford Statistics 20✬
✫
✩
✪
Predictions and Probabilities
Sample
Pro
babi
lity
0 10 20 30 40 50 60
0.0
0.2
0.4
0.6
0.8
1.0
Training Data
•• • • •
•• •
•• • • •
•• •
• •• • • • • • • • • • • • • • • • • • • • • •
• • • •• • • • • • • • • • • • • • •
•• • ••
• • • • •• •
•
• •• • • • •
• •
••• • •
••••
•
• ••
• • ••• •
••••• • • •
• • • • • • • ••• • • • •
•
• • •• • • • •
• • ••• •
• • • • • • •• • • • • • • • • • • • •
•• •
•
••
•• •
• ••
•• • • • • • • • •
• • • • • • • • • •• • • • • • • • •• •
•• • • • • • • •
•• • •
•• •
•
• • • • • ••• •
• •••• •
•
•• • • • •
• • •
•• •
• • •
•
•• •
BL EWS NB RMS
Sample
Pro
babi
lity
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
Test Data
•
•• •
••
•
•
• • • • • • • • • • •
•• •
• • •
•
• • •
••
•
•
•
•
••
• • • ••
••
•
•
•
• • ••
• • •
• ••
•
••
••
• • ••
•
•
• • • •• • •
•
•• • •
•
••
••
• •
• • • ••
•
•
•
•
•
• • •
BL EWS NB RMS
OO O
O O
November 18, 2002 Stanford Statistics 21✬
✫
✩
✪
The genes that matter
BL EWS NB RMS
295985
866702
814260
770394
377461
810057
365826
41591
629896
308231
325182
812105
44563
244618
796258
298062
784224
296448
207274
563673
504791
204545
21652
308163
212542
183337
241412
November 18, 2002 Stanford Statistics 22✬
✫
✩
✪
Heatmap of selected genes
BL EWS NB RMS
295985866702814260770394377461810057365826415916298963082313251828121054456324461879625829806278422429644820727456367350479120454521652308163212542183337241412
PAM — “Prediction Analysis for Microarrays”
R package, available at
http://www-stat.stanford.edu/∼tibs/PAM
November 18, 2002 Stanford Statistics 23✬
✫
✩
✪
Leukemia classification
Golub et al 1999, Science. They use a “voting”procedure for each gene, where votes are based ona t-like statistic
Method CV err Test err
Golub 3/38 4/34
Nearest Prototype 1/38 2/34
Breast Cancer classification
Hedenfalk et al 2001, NEJM. They use a“compound predictor”
∑j wjxj , where the
weights are t-statistics.
Method BRCA1+ BRCA1- BRCA2+ BRCA2-
Heden et. al. 3/7 2/15 3/8 1/14
Nearest Prototype 2/7 1/15 2/8 0/14
November 18, 2002 Stanford Statistics 24✬
✫
✩
✪
Summary
• With large numbers of genes as variables(P � N), we have to learn how to tame evenour simplest supervised learning techniques.Even linear models are way too aggressive.
• We need to talk with the biologists to learntheir priors; i.e. genes work in groups.
• We need to provide them with easy-to-usesoftware to try out our ideas, and involvethem in the design. They are much better atunderstanding their data than we are.