Download - Supervised Learning from Micro-Array Data: Datamining …hastie/TALKS/pam_short.pdf · Supervised Learning from Micro-Array Data: Datamining with Care TrevorHastie StanfordUniversity

November 18, 2002 Stanford Statistics 1✬

✫

✩

✪

Supervised Learning from

Micro-Array Data:

Datamining with Care

Trevor HastieStanford University

November 18, 2002

joint work with Robert Tibshirani, Balasubramanian

Narasimhan, Gil Chu, Pat Brown and David Botstein


✫

✩

✪

DNA microarrays

• Exciting new technology for measuring geneexpression of tens of thousands of genesSIMULTANEOUSLY in a single sample ofcells

• first multivariate, quantitative way ofmeasuring gene expression

• a key idea: to find genes, follow aroundmessenger RNA

• also known as “gene chips”— there are anumber of different technologies: Affymetrix,Incyte, Brown Lab,...

• techniques for analysis of microarray data arealso applicable to SNP data, protein arrays,etc.


✫

✩

✪

DNA microarray process


✫

✩

✪

The entire Yeast genome on a chip


✫

✩

✪

Statistical challenges

• Typically have ∼ 5,000–40,000 genesmeasured over ∼ 50–100 samples.

• Goal: understand patterns in data, and lookfor genes that explain known features in thesamples.

• Biologists don’t want to miss anything (lowtype II error). Statisticians have to help themto appreciate Type I error, and find ways toget a handle on it.


✫

✩

✪

Types of problems

• Preprocessing: Li, Wong (Dchip), Speed

• The analysis of expression arrays can beseparated into:unsupervised — “class discovery” andsupervised — “class prediction”

• In unsupervised problems, only the expressiondata is available. Clustering techniques arepopular. Hierarchical (Eisen’s TreeView -next slide), K-means, SOMs, block-clustering,gene-shaving (H&T), plaid models (Owen&Lazzeroni). SVD also useful.

• In supervised problems, a responsemeasurement is available for each sample. Forexample, a survival time or cancer class.


✫

✩

✪

Two-way hierarchical clustering

Molecular portraits of Breast Cancer — Perou et al,

Nature 2000


✫

✩

✪

Some editorial comments

• Most statistical work in this area is beingdone by non-statisticians.

• Journals are filled with papers of the form“Application of <machine-learning method>to Microarray Data”

• Many are a waste of time. P � N i.e. manymore variables (genes) than samples.Data-mining research has produced exoticenhancements of standard statistical modelsfor the N � P situation (neural networks,boosting,...). Here we need to restrict thestandard models; cannot even do linearregression.

• Simple is better: a complicated method isonly worthwhile if it works significantlybetter than the simple one.


✫

✩

✪

• Give scientists good statistical software, withmethods they can understand. They knowtheir science better than you. With yourhelp, they can do a better job analyzing theirdata than you can do alone.

• Software should be easy to install (e.g. R)and easy to use (e.g. SAM is an excel add-in)


✫

✩

✪

SAM

How to do 7000 t-tests all at once!

Significance Analysis of Microarrays (Tusher,Tibshirani and Chu, 2001).

-10 -5 0 5 10

010

020

030

040

050

0

t statistic

• At what threshold should we call a genesignificant?

• how many false positives can we expect?


✫

✩

✪

SAM plot

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

expected d(i)

obse

rved

d(i)

-5 0 5

-50

5

Ave # falsely # Called False

∆ significant significant discovery rate

0.3 75.1 294 .255

0.4 33.6 196 .171

0.5 19.8 160 .123

0.7 10.1 94 .107

1.0 4.0 46 .086

∆ is the half-width of the band around the 45o line.


✫

✩

✪

• SAM is popular in genetics and biochemistrylabs at Stanford and worldwide already

• SAM is freely available for academic andnon-profit use. The SAM site is:www-stat.stanford.edu/∼tibs/SAM

• For commercial use, software is available forlicensing from Stanford Office of TechnologyLicensing: [email protected]


✫

✩

✪

Nearest Prototype Classification

An extremely simple classifier that

• performs well on test data, and

• produces subsets of informative genes.

Tibshirani, Hastie, Narasimhan and Chu (2002)

“Diagnosis of multiple cancer types by shrunken

centroids of gene expression” PNAS 2002

99:6567-6572 (May 14).


✫

✩

✪

Classification of Samples

Example: small round blue cell tumors; Khan etal, Nature Medicine, 2001

• Tumors classified as BL (Burkitt lymphoma),EWS (Ewing), NB (neuroblastima) and RMS(rhabdomyosarcoma).

• There are 63 training samples and 25 testsamples, although five of the latter were notSRBCTs. 2308 genes

• Khan et al report zero training and testerrors, using a complex neural network model.Decided that 96 genes were “important”.

• Upon close examination, network is linear.It’s essentially extracting linear principalcomponents, and classifying in their subspace.

• But even principal components isunnecessarily complicated for this problem!


✫

✩

✪

Khan data

BL EWS NB RMS


✫

✩

✪

Khan’s neural network


✫

✩

✪

Nearest Shrunken Centroids

BL

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

EWS

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

NB

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

RMS

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

Centroids are shrunk towards the overall centroidusing soft-thresholding. Classification is to thenearest shrunken centroid.


✫

✩

✪

Nearest Shrunken Centroids

• Simple, includes nearest centroid classifier asa special case.

• Thresholding denoises large effects, and setssmall ones to zero, thereby selecting genes.

• with more than two classes, method canselect different genes, and different numbersof genes for each class.

• Still very simple. In statistical parlance, thisis a restricted version of a naive Bayesclassifier (also called idiot’s Bayes!)


✫

✩

✪

Results on Khan data

At optimal point chosen by 10-fold CV, there are27 active genes, 0 training and 0 test errors.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

2308

2250

1933

1468

987

657

436

282

183

138

98 61 41 27 18 15 7 1 0

0.0

0.2

0.4

0.6

0.8

cv cv cv cv cv cv cv cv cv cv cv cv cv cv cv

cv

cv

cvcv

tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr

tr

tr

tr tr

te te

te te

te

te te te te te

te te te te te

te

te

te te

Amount of Shrinkage ∆

Error

Size


✫

✩

✪

Predictions and Probabilities

Sample

Pro

babi

lity

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Training Data

•• • • •

•• •

•• • • •

•• •

• •• • • • • • • • • • • • • • • • • • • • • •

• • • •• • • • • • • • • • • • • • •

•• • ••

• • • • •• •

•

• •• • • • •

• •

••• • •

••••

•

• ••

• • ••• •

••••• • • •

• • • • • • • ••• • • • •

•

• • •• • • • •

• • ••• •

• • • • • • •• • • • • • • • • • • • •

•• •

•

••

•• •

• ••

•• • • • • • • • •

• • • • • • • • • •• • • • • • • • •• •

•• • • • • • • •

•• • •

•• •

•

• • • • • ••• •

• •••• •

•

•• • • • •

• • •

•• •

• • •

•

•• •

BL EWS NB RMS

Sample

Pro

babi

lity

5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Test Data

•

•• •

••

•

•

• • • • • • • • • • •

•• •

• • •

•

• • •

••

•

•

•

•

••

• • • ••

••

•

•

•

• • ••

• • •

• ••

•

••

••

• • ••

•

•

• • • •• • •

•

•• • •

•

••

••

• •

• • • ••

•

•

•

•

•

• • •

BL EWS NB RMS

OO O

O O


✫

✩

✪

The genes that matter

BL EWS NB RMS

295985

866702

814260

770394

377461

810057

365826

41591

629896

308231

325182

812105

44563

244618

796258

298062

784224

296448

207274

563673

504791

204545

21652

308163

212542

183337

241412


✫

✩

✪

Heatmap of selected genes

BL EWS NB RMS

295985866702814260770394377461810057365826415916298963082313251828121054456324461879625829806278422429644820727456367350479120454521652308163212542183337241412

PAM — “Prediction Analysis for Microarrays”

R package, available at

http://www-stat.stanford.edu/∼tibs/PAM


✫

✩

✪

Leukemia classification

Golub et al 1999, Science. They use a “voting”procedure for each gene, where votes are based ona t-like statistic

Method CV err Test err

Golub 3/38 4/34

Nearest Prototype 1/38 2/34

Breast Cancer classification

Hedenfalk et al 2001, NEJM. They use a“compound predictor”

∑j wjxj , where the

weights are t-statistics.

Method BRCA1+ BRCA1- BRCA2+ BRCA2-

Heden et. al. 3/7 2/15 3/8 1/14

Nearest Prototype 2/7 1/15 2/8 0/14


✫

✩

✪

Summary

• With large numbers of genes as variables(P � N), we have to learn how to tame evenour simplest supervised learning techniques.Even linear models are way too aggressive.

• We need to talk with the biologists to learntheir priors; i.e. genes work in groups.

• We need to provide them with easy-to-usesoftware to try out our ideas, and involvethem in the design. They are much better atunderstanding their data than we are.