SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 1
A case study in genomics:
Biomarker discovery: fact or artefact?
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 2
Outline
• a recent controversy in Cancer genomics
• some general suggestions for dealing with this problem
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 3
A recent experience
• Dave et al published a high-profile study in NEJM, reporting
that they had found two sets of genes whose expression were
highly predictive of survival in patients with Follicular
Lyphoma.
• the paper got a lot of attention at the recent ASH meeting,
because the genes in the clusters were largely expressed in
non-tumor cells, suggesting that the host-response was the
important factor
• One of my medical collaborators- Ron Levy, asked me to look
over their paper- he wanted to apply their model to the
Stanford FL patient population.
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 4
n e n g l j m e d
3 5 1 ;2 1
w w w .n e j m .o r g n o v e m b e r
1 8 , 2 0 0 4
2159
The
n e w en g lan djo u r n al
of
m ed icin e
e sta b lish e d in 1812
n o ve m b e r
18
,
2004
vo l.351 n o .21
Prediction of Survival in Follicular Lym phom a Based on M olecular Features of Tum or-Infiltrating Im m une C ells
Sandeep S. Dave, M.D., George Wright, Ph.D., Bruce Tan, M.D., Andreas Rosenwald, M.D., Randy D. Gascoyne, M.D., Wing C. Chan, M.D., Richard I. Fisher, M.D., Rita M. Braziel, M.D.,
Lisa M. Rimsza, M.D., Thomas M. Grogan, M.D., Thomas P. Miller, M.D., Michael LeBlanc, Ph.D., Timothy C. Greiner, M.D., Dennis D. Weisenburger, M.D., James C. Lynch, Ph.D., Julie V ose, M.D.,
James O. Armitage, M.D., Erlend B. Smeland, M.D., Ph.D., Stein Kvaloy, M.D., Ph.D., Harald Holte, M.D., Ph.D., Jan Delabie, M.D., Ph.D., Joseph M. Connors, M.D., Peter M. Lansdorp, M.D., Ph.D., Qin Ouyang, Ph.D.,
T. Andrew Lister, M.D., Andrew J. Davies, M.D., Andrew J. Norton, M.D., H. Konrad Muller-Hermelink, M.D., German Ott, M.D., Elias Campo, M.D., Emilio Montserrat, M.D., Wyndham H. Wilson, M.D., Ph.D., Elaine S. Jaffe, M.D., Richard Simon, Ph.D., Liming Y ang, Ph.D., John Powell, M.S., Hong Zhao, M.S.,
Neta Goldschmidt, M.D., Michael Chiorazzi, B.A., and Louis M. Staudt, M.D., Ph.D.
a b st r a c t
From National Cancer Institute (S.S.D.,G.W., B.T., A.R., W.H.W., E.S.J., R.S., H.Z.,N.G., M.C., L.M.S.); Center for InformationTechnology (L.Y ., J.P.); and National Heart,Lung, and Blood Institute (S.S.D.) — all inBethesda, Md.; British Columbia CancerCenter, V ancouver, Canada (R.D.G., J.M.C.,P.M.L., Q.O.); University of NebraskaMedical Center, Omaha (W.C.C., T.C.G.,D.D.W., J.C.L., J.V ., J.O.A.); Southwest On-cology Group, San Antonio, Tex . (R.I.F.,T.M.G., T.P.M., M.L.); University of Roch-ester School of Medicine, Rochester, N.Y .(R.I.F.); Oregon Health and Science Univer-sity, Portland (R.M.B.); University of ArizonaCancer Center, Tucson (L.M.R., T.M.G.,T.P.M.); Fred Hutchinson Cancer ResearchCenter, Seattle (M.L.); Norwegian RadiumHospital, Oslo (E.B.S., S.K., H.H., J.D.);Cancer Research UK, St. Bartholomew’sHospital, London (T.A.L., A.J.D., A.J.N.);University of Würzburg, Würzburg, Ger-many (A.R., H.K.M.-H., G.O.); and Univer-sity of Barcelona, Barcelona, Spain (E.C.,E.M.). Address reprint requests to Dr.Staudt at the National Cancer Institute,Bldg. 10, Rm. 4N114, NIH, Bethesda, MD208 92, or at lstaudt@ mail.nih.gov.
N Engl J Med 2004;351:2159-69.
C opyright © 2004 Massachusetts Medical Society.
backg ro u n d
Patients w ith follicular lym phom a m ay survive for periods of less than 1 year to m ore
than 20 years after diagnosis. W e used gene-expression profiles of tum or-biopsy spec-
im ens obtained at diagnosis to develop a m olecular predictor of the length of survival.
m eth o d s
G ene-expression profiling w as perform ed on 191 biopsy specim ens obtained from pa-
tients w ith untreated follicular lym phom a. Supervised m ethods w ere used to discover
expression patterns associated w ith the length of survival in a training set of 95 speci-
m ens. A m olecular predictor of survival w as constructed from these genes and validat-
ed in an independent test set of 96 specim ens.
resu lts
Individual genes that predicted the length of survival w ere grouped into gene-expres-
sion signatures on the basis of their expression in the training set, and tw o such signa-
tures w ere used to construct a survival predictor. The tw o signatures allow ed patients
w ith specim ens in the test set to be divided into four quartiles w ith w idely disparate m e-
dian lengths of survival (13.6, 11.1, 10.8, and 3.9 years), independently of clinical
prognostic variables. Flow cytom etry show ed that these signatures reflected gene ex-
pression by nonm alignant tum or-infiltrating im m une cells.
co n clu sio n s
The length of survival am ong patients w ith follicular lym phom a correlates w ith the
m olecular features of nonm alignant im m une cells present in the tum or at diagnosis.
Copyright © 2004 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 5
Summary of their findings
• They started with the expression of approximately 49,000
genes measured on 189 patient samples, derived from DNA
microarrays. A survival time (possibly censored) was available
for each patient
• they randomly split the data into a training set of 89 patients
and a test set of 90 patients
• using a multi-step procedure (described below), they extracted
two clusters of genes, called IR1 (immune response 1) and IR2
(immune response 2).
• They averaged the gene expression of the genes in each cluster,
to create two “super-genes”.
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 6
... continued
• They then fit these super-genes together in a Cox model for
survival, and applied it to the training and test sets. The
p-value in the training set was < 10e − 7 and 0.003 in the test
set. IR1 correlates with good prognosis; IR2 with poor
prognosis
• In the remainder of the paper they interpret the genes in their
model
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 7
What happened next...
• I downloaded the data
• Applied some familiar statistical tools- eg SAM (Significance
analysis of microarrays), less familiar ones- supervised principal
components, and also gave the data to Brad Efron. Our initial
finding- no significant correlation between gene expression and
survival.
• I spent 2-3 weeks emailing back and forth with their
statistician (George Wright) and programming in R, to recreate
their analysis
• Confession- it was fun being a “forensic statistician”
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 8
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 9
−4 −2 0 2 4
−4−2
02
4
expected score
obse
rved
sco
re
SAM plot
1 10 100 1000 10000
1e−0
41e
−03
1e−0
21e
−01
1e+0
0
gene
p−va
lue
Ordered p−values
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 10
Details of their analysis
1) Divide the data randomly into training and test sets of
approximately equal numbers of patients. Apply the following
recipe [steps 2–6] to the training set.
2) Choose all genes with univariate Cox score > 1.5 in absolute
value. This reduced the number of genes from roughly 49,000
to roughly 3,000, with about a 50-50 split between good
prognosis genes (negative scores) and poor prognosis genes
(positive scores).
3) Do separate hierarchical clusterings (correlation metric, average
linkage) of the good and poor prognosis genes.
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 11
Details continued...
4) Find all clusters in the dendrograms (clustering trees)
containing between 25 and 50 genes, with internal correlation
at least 0.5. Represent each cluster by the average expression
of all genes in the cluster– a “supergene” Try every pair of
supergenes as predictors in Cox models for predicting survival.
5) Choose the most significant pair from this process. The authors
call the resulting pair of clusters IR1 (good prognosis) and IR2
(poor prognosis).
6) Finally use the model (IR1, IR2) in a Cox model to predict
survival in the test set.
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 12
genes
patients
survival times
Training set
Expression data
Cox scores
>1.5
< −1.5
49,000
Extract clusters withbetween 25 and 50 genes,and corr > 0.5
Hierarchicalclustering
89
Test set90 patients
Apply best model to test set
in a Cox survival modeland try all possible pairs of supergenes
Average each cluster into a supergene,~1500 Positive genes
~1500 Negative genes
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 13
n e n g l j m e d
3 5 1 ;2 1
w w w .n e j m .o r g n o v e m b e r
1 8 , 2 0 0 4
pr e d ic tio n o f su r viva l in fo llic u la r lym ph o m a
2163
Training Set of TumorSpecimens (N= 95)
Genes Associated
with Poor
Prognosis
0.33 31
Relative level of expression(¬ median value)
Immune-Response 1 Signature
Immune-Response 2 Signature
ACTN1ATP8B2BIN2C1RLC6orf37C9orf52CD7CD8B1DDEF2DKFZP566G1424DKFZP761D1624FLJ32274FLNAFLT3LGGALNT12GNAQHCSTHOXB2IL7R
TNFRSF1BTNFRSF25TNFSF12TNFSF13B
BLVRAC17orf31C1QAC1QBC3AR1C4AC6orf145CEB1DHRS3DUSP3F8FCGR1AGPRC5BHOXD8LGMNME1
Genes Associated
with F avorable Prognosis
IMAGE:5289004INPP1ITKJAM KIAA1128KIAA1450LEF1LGALS2LOC340061NFICPTRFRAB27ARALGDSSEMA4CSEPW1STAT4TBC1D4TEAD1TMEPAI
MITFMRVI1NDNOASLPELOSCARB2SEPT10TLR5
Pro
bab
ility
of
Su
rviv
al 0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
Y ears
No. at Risk 12296699147187
A B
C D
F ollicular lymphoma biopsy specimens(N = 191)
T raining set (N = 95) T est set (N = 96)
Identify genes with expression patterns
associated with favorable prognosis
C reate optimal multi-variate model withsurvival signature
averages
V alidate survival predictor in test set
Identify genes with expression patterns
associated with poor prognosis
A verage gene- expression levels for genes in each survival signature
Use hierarchicalclustering to
identify survivalsignatures
Use hierarchicalclustering to
identify survivalsignatures
A verage gene- expression levels for genes in each survival signature
Copyright © 2004 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 14
n e
ng
l j m
ed
351;2
1
ww
w.n
ejm
.or
gn
ov
em
be
r
18, 2
00
4
pr
ed
ic
ti
on
of
su
rv
iv
al
in
fo
ll
ic
ul
ar
ly
mp
ho
ma
21
65
Figure 2. Development of a Molecular Predictor of Survival in Follicular L ymphoma.
Panel A shows overall survival among the patients with biopsy specimens in the test set, according to the quartile of the survival-predictor score (SPS). Panel B shows overall survival
according to the International Prognostic Index (IPI) risk group for all the patients for whom these data were available. Panel C shows overall survival among the patients with specimens
in the test set for the indicated IPI risk group, stratified according to the quartile of the SPS.
Pro
bab
ility
of
Su
rviv
al
0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
P< 0.001
Y ears
No. at RiskIPI, 0 or 1IPI, 2 or 3IPI, 4 or 5
8 1 0
19 5 0
38 18 0
55 31 1
75 48
6
91 64
9
Pro
bab
ility
of
Su
rviv
al
0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
P< 0.001
Y ears
No. at RiskSPS quartile 1SPS quartile 2SPS quartile 3SPS quartile 4
1 2 2 1
4 3 6 1
9 10 10 4
14 16 13 7
21 23 18 12
23 242423
1234
51 58 54 29
86 86 69 38
13.6 11.1 10.8 3.9
Pro
bab
ility
of
Su
rviv
al
0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
P< 0.001
Y ears
No. at RiskSPS quartile 1SPS quartile 2SPS quartile 3SPS quartile 4
1 1 2 0
3 2 4 0
5 3 6 1
13 7 8 2
9 5 8 2
15 8 9 6
0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
P< 0.001
Y ears
0 0 0 0
0 0 2 0
2 4 4 2
3 7 5 3
3 10 9 6
3 10 11 12
Quartile of SPS SurvivalMedian (yr) 5 yr (%) 10 yr (%)
A
B CIPI, 0 or 1 IPI, 2 or 3
Copyright ©
2004 Massachusetts M
edical Society. A
ll rights reserved. D
ownloaded from
ww
w.nejm
.org at Stanford U
niversity on May 10, 2005 .
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 15
p-values of all cluster pairs
1e−08 1e−06 1e−04 1e−02
5e−0
45e
−03
5e−0
25e
−01
train set p−value
test
set
p−v
alue
The total number of points (cluster pairs) with test set p-values less
than 0.05 (239) is far fewer than we’d expect to see by chance (735)
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 16
Univariate p-values
1e−04 5e−04 2e−03 1e−02
0.05
0.20
0.50
training pval
test
pva
l
poor prognosis genes
5e−04 2e−03 1e−020.
10.
20.
51.
0
training pval
test
pva
l
good prognosis genes
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 17
Swapping train and test sets
1e−11 1e−08 1e−05 1e−02
0.00
10.
005
0.05
00.
500
train set p−value
test
set
p−v
alue
There are only 85 pairs out of 11628 that are significant in the test
set at the 0.05 level, while we would expect 11628*.05=581 pairs
just by chance.
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 18
Swapping train and test sets, ctd...
2e−04 1e−03 5e−03 2e−02
0.1
0.2
0.5
1.0
training pval
test
pva
l
poor prognosis genes
5e−05 5e−04 5e−030.
10.
20.
51.
0
training pval
test
pva
l
good prognosis genes
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 19
Cluster size ranges (30,60) rather than (25,50)
1e−07 1e−05 1e−03 1e−01
5e−0
45e
−03
5e−0
25e
−01
train set p−value
test
set
p−v
alue
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 20
The aftermath
• I published a short letter to NEJM in March 2005; full details
of my re-analysis appear on my website
• The authors published a rebuttal in the same issue.
“Nothing in Tibshirani’s analysis calls into dispute the fact that
we discovered and validated a strong association between gene
expression in follicular lymphoma and overall survival. ”
Their arguments:
– (1) we followed standard statistical procedures, found a
small p-value on the test set, therefore our finding is correct;
– (2) our method found an interaction, which SAM can’t find
– (3) we get small p-values if we apply our model to random
halves of the data (????!!!!!!)
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 21
n engl j med 352;14 w w w .nejm.o r g a p r il 7 , 20 0 5
correspondence
1497
Our predictor could not have been discovered
w ith the SAM m ethod, w hich relies solely on uni-
variate associations w ith survival. R ather, our pre-
dictor derives its strength from the synergistic
com bination of tw o gene-expression signatures in
a m ultivariate m odel. Tibshirani confuses the abil-
ity of our m ethod to discover a survival association
w ith the fact that w e actually found one that validat-
ed the association. W hen he exchanged the training
set for the test set, he w as unable to rediscover our
gene-expression predictor because som e genes in
our predictor fell below the P value threshold for
association w ith survival in the test set. This does
not negate the fact that our m odel is highly associ-
ated w ith survival in the test set.
H ong et al. have m ade three errors. First, in our
sorted subpopulations, the C D 19¡ fraction con-
tained, on average, 12.6 percent contam ination
w ith follicular-lym phom a cells, not 25 percent, as
they claim . To believe that the higher expression of
the im m une-response signatures in the C D 19¡
fraction is due to this 12.6 percent contam ination
requires that the lym phom a cells in the C D 19¡
fraction have expression of the im m une-response
signatures that w as m ore than eight tim es as high
as that in their counterparts in the C D 19+ fraction.
Second, H ong et al. incorrectly discount the im -
m une-response 1 signature, w hich contributes sig-
nificantly to the survival m odel in the test set
(P<0.001). Third, m any of the im m une-response
signature genes are selectively expressed in T cells,
m onocytes, or dendritic cells, or in m ore than one
of these, but not in B cells, m aking the contention
of H ong et al. even m ore im plausible.
Louis M. Staudt, M.D., Ph.D.George W right, Ph.D.Sandeep Dave, M.D.
National C ancer InstituteBethesda, MD [email protected]
1. R ansohoff D F. R ules of evidence for cancer m olecular-m arker
discovery and validation. N at R ev C ancer 2004;4:309-14.
When Doctors Go to War
to the editor: Like Bloche and M arks in their Per-
spective article on doctors in com bat (Jan. 6 is-
sue),1 the Am erican M edical Association (AM A)
applauds the outstanding w ork of m ilitary physi-
cians in treating w ounded soldiers under extrem ely
challenging circum stances.2 Physicians are de-
fined by their com m on calling to prevent harm and
treat people w ho are ill or injured and by their uni-
versal com m itm ent to uphold recognized principles
of m edical ethics w henever patients rely on their
Figure 1. Results of Samplings of Half-Sets and Association between Gene Expression and Survival.
No
. o
f H
alf-
Set
s O
ut
of
20
00
Sam
plin
gs
200
250
150
100
50
010¡11 10¡10 10¡9 10¡8 10¡7 10¡6 10¡5 10¡4 10¡3 10¡2 10¡1 100
P V alue for Association of Gene-Expression– Based Model with Survival
300
Copyright © 2005 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 22
General comments
• Their finding is fragile. I don’t believe that it is real or
reproducible
• This experience uncovers a problem that is of general
importance to our field:
– with many predictors, it is too easy to overfit the data and
find spurious results
– we can inadvertently mislead the reader, and mislead
ourselves. I have been guilty of this too
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 23
Some recommendations
• encourage authors to publish not only the raw data, but a
script of their analysis
• encourage authors to use “canned” methods/packages, with
built-in cross-validation to validate the model search process -
see “supervised principal components” later today
• develop crude, global measures of the “information content” in
a dataset: see “tail strength” later today
• develop measures of the fragility of an analysis
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 24
A sobering point of view
SLD
MII
Tre
vor
Hast
ieand
Rob
Tib
shir
ani,
Sta
nfo
rdU
niv
ers
ity
25
PLoS Medicine | www.plosmedicine.org 0696
Essay
Open access, freely available online
August 2005 | V olume 2 | Issue 8 | e124
Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion
and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key
factors that infl uence this problem and some corollaries thereof.
Modeling the F ramew ork for F alse
P ositiv e F indings
Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confi rmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research fi ndings solely on the basis of a single study assessed by formal statistical signifi cance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles
should be interpreted based only on p-values. Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings.
As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R
is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus
T he Essay section contains opinion pieces on topics
of broad interest to a general medical audience.
W hy Most P ublished Research F indings
Are F alse J ohn P . A. Ioannidis
C itation: Ioannidis JPA (2005) Why most published research fi ndings are false. PLoS Med 2(8): e124.
C opyright: © 2005 John P. A. Ioannidis. T his is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abbrev iation: PPV , positive predictive value
John P. A. Ioannidis is in the D epartment of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, D epartment of Medicine, T ufts-New England Medical Center, T ufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]
C ompeting Interests: T he author has declared that no competing interests exist.
DOI: 10.1371/journal.pmed.0020124
Summary
T here is increasing concern that most current published research fi ndings are false. T he probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientifi c fi eld. In this framework, a research fi nding is less likely to be true when the studies conducted in a fi eld are smaller; when effect siz es are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater fl exibility in designs, defi nitions, outcomes, and analytical modes; when there is greater fi nancial and other interest and prejudice; and when more teams are involved in a scientifi c fi eld in chase of statistical signifi cance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientifi c fi elds, claimed research fi ndings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
It can be prov en that most claimed research
fi ndings are false.
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 26
• Let R= the ratio of “true relationships” to “no relationships”
among those tested in a field.
• We carry out a study with type I error α and power 1 − β,
probing c relationships. Then given a positive finding from a
study, the author derives its PPV (positive predictive value).
• He models the effect of bias µ= proportion of analyses that
would not have been “research findings” if the study had been
carried out in an unbiased way.
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 27
Main results
Research findings are less likely to be true:
• the smaller the studies conducted in a scientific field,
• the smaller the effect sizes in a scientific field,
• the greater the number and the lesser the selection of tested
relationships in a scientific field,
• the greater the flexibility in designs, definitions, outcomes, and
analytical modes in a scientific field,
• the greater the financial and other interests and prejudices in a
scientific field,
• the hotter a scientific field (with more scientific teams involved)
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 28
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 29
An example
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 30
PL oS M edicine | www.plosmedicine.org 0699
alternating extreme research claims and extremely opposite refutations [29]. Empirical evidence suggests that this sequence of extreme opposites is very common in molecular genetics [29].
These corollaries consider each factor separately, but these factors often infl uence each other. For example, investigators working in fi elds where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fi elds where true effect sizes are perceived to be large. Or prejudice may prevail in a hot scientifi c fi eld, further undermining the predictive value of its research fi ndings. Highly prejudiced stakeholders may even create a barrier that aborts efforts at obtaining and disseminating opposing results. Conversely, the fact that a fi eld
is hot or has strong invested interests may sometimes promote larger studies and improved standards of research, enhancing the predictive value of its research fi ndings. Or massive discovery-oriented testing may result in such a large yield of signifi cant relationships that investigators have enough to report and search further and thus refrain from data dredging and manipulation.
Most Resear ch F indings A r e F alse
f or Most Resear ch Designs and f or
Most F ields
In the described framework, a PPV exceeding 50% is quite diffi cult to get. Table 4 provides the results of simulations using the formulas developed for the infl uence of power, ratio of true to non-true relationships, and bias, for various types of situations that may be characteristic of specifi c study designs and settings. A fi nding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is
eventually true about 85% of the time. A fairly similar performance is expected of a confi rmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely, a meta-analytic fi nding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research fi ndings from underpowered, early-phase clinical trials would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in fi ve chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [30,31], PPV for each claimed relationship is extremely low, even with considerable
B ox 1. A n Example: Science at L ow P r e-Study Odds
L et us assume that a team of inv estigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schiz ophrenia. B ased on what we k now about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schiz ophrenia, with relativ ely similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. T hen R = 10/100,000 = 10−4, and the pre-study probability for any polymorphism to be associated with schiz ophrenia is also R/(R + 1) = 10−4. L et us also suppose that the study has 60% power to fi nd an association with an odds ratio of 1.3 at α = 0.05. T hen it can be estimated that if a statistically signifi cant association is found with the p-v alue barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.
Now let us suppose that the inv estigators manipulate their design,
analyses, and reporting so as to mak e more relationships cross the p = 0.05 threshold ev en though this would not hav e been crossed with a perfectly adhered to design and analysis and with perfect comprehensiv e reporting of the results, strictly according to the original study plan. Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, inv estigation of genetic contrasts that were not originally specifi ed, changes in the disease or control defi nitions, and v arious combinations of selectiv e or distorted reporting of the results. C ommercially av ailable “data mining” pack ages actually are proud of their ability to yield statistically signifi cant results through data dredging. In the presence of bias with u = 0.10, the post-study probability that a research fi nding is true is only 4.4 × 10−4. F urthermore, ev en in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them fi nds a formally statistically signifi cant association, the probability that the research fi nding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensiv e research was undertak en!
DOI: 10.1371/journal.pmed.0020124.g002
F igur e 2. PPV (Probability T hat a Research F inding Is T rue) as a F unction of the Pre-Study Odds for Various Numbers of C onducted Studies, n
Panels correspond to power of 0.20, 0.50, and 0.80.
A ugust 2005 | Volume 2 | Issue 8 | e124