EPI293Design and analysis of gene association studies
Winter Term 2008
Lecture 7: Genome-wide association scans
Peter Kraft
[email protected] 2 Rm 206
2-4271
1900
1920
1940
1960
1980
2005
Rediscovery of Mendel’s laws
Association between Blood Groups and malignant disease fails to replicate
Microsattelite maps for genome-wide linkage analysis developedHuman Genome Project launched
Human Genome Project working draft completed; beginnings of SNP map
HapMap launched
Risch and Merikangas paper
Principles of Linkage Analysis discovered
Association between Blood Groups and malignant disease published
1990
2000
First Genome-Wide Association Study
HapMap Phase I completed (draft Phase II available)Genome-wide SNP panels developed
RFLPs available for linkage analysis developed
2006
2007
5 December 2007 [email protected]
3
3
Gg
14Control
41Case
GGgg
GGGG
GG
GG
GGGg
Gg
Gg
gg
Gg
Gg
Gggg
gg
gg
gg
5 December 2007 [email protected]
Linkage vs. Association
• Linkage studies– Pro: can scan genome with fewer markers
– Cons: Can only detect alleles with large effect; limited resolution (identify broad region, not individual genes); requires data on multiple family members
• Association studies– Pros: can detect subtle effects; very fine resolution
– Cons: requires 0.5 to 1 million markers to cover whole genome; requires large sample size
Risch and Merikangas (1996) Science 273:1516-7
Schloterer C. Nat Rev Genet. 2004;5:63-9.
• Ozaki K. Myocardial Infarction. Nat Genet 2002;32:650–4.• Klein RJ. Age-related macular degeneration. Science 2005;308:385–9.• Maraganore DM. Parkinson disease. Am J Hum Genet 2005;77:685–93.• Shiffman D. Myocardial Infarction. Am J Hum Genet 2005;77:596–605.• Cheung VG. Gene expression. Nature 2005;437:1365-9.• Stranger BE. Gene expression. PLOS Genet 2005;1:695-704.• Mah S. Schizophrenia. Mol Psychiatry 2006;11:471-8.• Herbert A. Obesity. Science 2006; 312:279-83.
Published Genome-Wide Association Scans
Reviews• Hirschorn J. Nat Reviews Genet 2005;6: 95-108.• Wang WY. Nat Reviews Genet 2005;6: 109-18. • Thomas DC. Am J Hum Genet 2005 77: 337-45.• Thomas DC. Cancer Epidemiol Biomarkers Prev 2006 15: 595-8.• Evans DM. Trends in Genetics 2006 (epub)
OLD SLIDE!!!!
96 cases, 50 controls
103,611 SNPs
rs380390Recessive OR
7.4 (2.9-19)
PAR (70%)
Genotyping errors
Functionality
ReplicationScience 2005;308:421–4
Science 2005;308:419–21
Klein RJ Science 2005;308:385–9
Tier 1 Tier 2
443 sib pairs 332 matched unrelated case-control pairs198,000 SNPs 3,148 SNPs
No SNPs pass Bonferroni-corrected significance threshold (2.510-7).
Maraganore Am J Hum Genet 2005 77:685-93
Known Breast Cancer Genes, November 2006
Known Prostate Cancer Genes, November 2006
Known Breast Cancer Genes, Fall 2007
Known Prostate Cancer Genes, Fall 2007
Kraft and Cox 2008 in: Rao and Gu, eds.
• Power issues– Tagging efficiency of genome-wide panels
– Multi-stage design and analysis
• Design issues
• Analytic issues– Imputation
• CGEMS examples
Outline
• Power issues– Tagging efficiency of genome-wide panels
– Multi-stage design and analysis
• Design issues
• Analytic issues– Imputation
• CGEMS examples
Outline
Known
Unknown
r2
r2
Barrett JC. Nat Genet 200638:659-62 Pe’er I. Nat Genet 2006;38:663-7.
International HapMap Consortium. Nature. 2007 Oct 18;449(7164):851-61
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
MAF < 5% MAF 5-12.5% MAF 12.5-25% MAF 25-37.5% MAF 37.5-50%
.90-1.00
.81-.90
.61-.80
.32-.60
.01-.30
0
Distribution of max r2 with tag panel as a function of MAF
Tags chosen from a “pseudo Phase II HapMap” and evaluated against ENCODE SNPs
The fundamental theorem of the HapMap
The power of a study that genotypes N cases and N controls at a marker that has a correlation of r2 with a disease susceptibility locus has the same power as a study that genotypes N = r2 N cases
and N controls at the disease susceptibility locus.
Power adjusting for tagging efficiency
)()( fNPow
Pritchard JK. Am J Hum Genet 2001;69:1-14.Jorgenson Am J Hum Genet 2006;78:884-8.
Terwilliger JD Eur J Hum Genet 2006;14:426-37.
0 5000 10000 15000 20000
0.0
0.2
0.4
0.6
0.8
1.0
0 2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
0 2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
0 2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
0 2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
0 2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
0 2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
0 2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
0 2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
sample size (cases)
OR=1.3 OR=1.5 OR=1.8M
AF
=.0
1M
AF
=.0
5M
AF
=.1
0po
wer
direct
indirect(averaged over r2)
indirect(r2 fixed at 80%)
• Power issues– Tagging efficiency of genome-wide panels
– Multi-stage design and analysis
• Design issues
• Analytic issues– Imputation
• CGEMS examples
Outline
SN
Ps
subjects
TT11 TT22 TT33
Replication analysis Joint analysis
Power = Pr(T1>k1,…,TS>kS)=Pr(T1>k1)…Pr(TS>kS)
Power = Pr(T1*>k1
*,…,TS*>kS
*)
ks = Quantile(1-ms+1/ms)ks
* chosen s.t. expected number of markers (under null) taken to
s+1st stage is ms+1
Ts* = 1..s Ts
mS+1 is number of expected false leads (under the null) at the end of Sth stage
(e.g. mS+1 = .05 is strong control of FWER at α=.05)
Power of multi-stage designs
Skol. Nat Genet 2006;38:209-13; Wang Genet Epidemiol 2006;30:356-68; Kraft (in prep)
Multistage Design and Analysis
• It is (or should be) well known that “replication analysis ” is statistically inefficient [cf Thomas DC et al (1985) AJE, Skol (2006) Nat Genet]
• Usually you can find a multistage design that has almost the same power as a single-stage design but is much cheaper
• Multi-stage design is NOT a way of finessing the multiple testing issue. If genotypes were free, you would genotype everybody for every SNP and test all SNPs at very very small alpha level.
• Multi-stage design IS a way of saving big $s, ₤s, €s, etc.
Amount of savings and cheapest design depend on prices—which are very fluid!
Calculating power for “replication analysis”
P2=1-q,,r,N2,22=M3/M2N2M2
k=Mk+1/Mk
1=M2/M1
Effective level
Πi=1..k PiOverall
Pk=1-q,,r,Nk,kNkMk
…
P1=1-q,,r,N1,1N1M1
PowerNumber of subjects
Number of Markers
Mk+1 is “number of significant tests expected under the null”
E.g. Mk+1=.05 is Bonferroni-corrected threshold for M1 tests
Calculating power for “replication analysis”
2=.0036,0001,500
1=.003
Effective level
Overall
2,400 (1:1 case:control)
500,000
PowerNumber of subjects
Number of Markers
q=10%; dominant OR=1.4; M4=5
.883
.999
.882
Cost: ca. USD7002,400+USD606,000=USD
2.04 million
Calculating power for “replication analysis”
2=.0753,00020,000
=.003
1=.04
Effective level
Overall
3,0001,500
2,400 (1:1 case:control)
500,000
PowerNumber of subjects
Number of Markers
q=10%; dominant OR=1.4; M4=5
.999
.998
.950
.946
Cost: ca. USD7002,400+USD2003,000+
USD603,000=USD 2.46 million
Two-stage study with equivalent power costs > 2.8
million
A B C20000 5.9 17.7 1
1500 35.8 107.4 120 91.7 275.1 1
Nsnp
Three different per-SNP pricing scenarios considered
Prices relative to per-SNP costs for whole-genome platform
Pricing scheme A; cost relative to single stage study using 7,000 subjects
Relative costs for studies with 65% power
Power for single stage studies, accounting for tagging efficiency
Pow
er
relative cost relative cost relative cost
Illumina 550 Affy 500 Affy 1,000
Power for three stage studies, accounting for tagging efficiency
Illumina 550 Affy 500 Affy 1,000
(Simulated) tagging properties of three panels
How to select SNPs for 2nd Stage?
• Rank by increasing p-value– But recall, prob. of being false positive depends not only on p-value,
but also on power and prior
• Hence Bayesian alternatives [WTCCC, Wakefield 2007 Am J Hum Genet]
• Quasi-Bayesian FPRP [Wacholder et al 2004; Samani 2007 NEJM]
• Prior-weighted analyses [Roeder 2007 Genet Epidemiol, Lewinger 2007 Genet Epidemiol]
• Pragmatist: meh, no big difference in practice
• What about multiple SNPs in high LD?– Cull so as to interrogate as many regions as possible (“broad” follow
up), or retain to try and distinguish causal variant (“deep” follow up)?
• Can I improve coverage by genotyping more SNPs around “hits”?– Again: “deep” coverage
“broad” follow-up
“deep” follow-up
“broad” / “deep” defined
Thought Experiment
• Two kinds of GWAS products– Tagging—captures HapMap II at r2>80%
– Random—has density of Affy 500k
• Choose additional SNPs in 2nd stage so that you tag region spanning “hit” in HapMap II at >95%
• Does this increase your power over simply genotyping the top hit?
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
R2 initial map
R2
de
nse
r m
ap
MAF < 5%MAF 5-12.5%MAF 12.5-25%MAF < 25-37.5%MAF > 37.5%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MAF <5% MAF 5-12.5% MAF 12.5-25% MAF 25-37.5% MAF 37.5-50%
.90-1.0
.81-.90
.61-.80
.31-.60
.01-.30
0
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
MAF < 5% MAF 5-12.5% MAF 12.5-25% MAF 25-37.5% MAF 37.5-50%
.90-1.00
.81-.90
.61-.80
.32-.60
.01-.30
0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
R2 initial map
R2
de
nse
r m
ap
Tag
ging
Pan
elR
ando
m P
anel
3.22 X markers
1.46 X markers
# markers per region
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
cst
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
max
imum
pow
er f
or
bud
get
co
st
Tag
ging
Pan
elR
ando
m P
anel
Broad
Deep
Power of one-stage design
OR=1.3, MAF=.10Two-stage designs
7,000 cases/controls
“deep” follow-up “broad” follow-up
Am J Hum Genet 2007
Very small gain in power from fine mapping=deep follow up. Is it worth the opportunity cost? Genotyping a lot of extra markers “fine mapping” null loci means you will miss the chance to replicate the true signals that happened to be lower on your list.
Power calculations
http://www.sph.umich.edu/csg/abecasis/CaTS/
http://www.hsph.harvard.edu/faculty/kraft/soft.htm
• Power issues– Tagging efficiency of genome-wide panels
– Multi-stage design and analysis
• Design issues
• Analytic issues– Imputation
• CGEMS examples
Outline
• Subject selection• Flexible but simple analysis
– (Multistage design may limit analysis options)
• Sample heterogeneity across stages• Data QC• Population stratification• Bioinformatics• Data sharing, scientific replication, and validation
The design of genome-wide association studies is an art of the possible.
• Power issues– Tagging efficiency of genome-wide panels
– Multi-stage design and analysis
• Design issues
• Analytic issues– Imputation
• CGEMS examples
Outline
Analytic issues
• Multiple comparisons• Phenotypic / Genetic heterogeneity• Epistasis• Incorporating external information• Imputation
BPC3-1 A T A A CBPC3-2 A T G A TBPC3-3 C T A TBPC3-4 C C CBPC3-5 A T G A CBPC3-6 C G A TBPC3-7 A T A TBPC3-8 C C A C BPC3-9 A T A A TBPC3-10 A T G CBPC3-11 T A A BPC3-12 A C G C TBPC3-13 A T A A C
HapM-1 AACGTTTGAACT CCATTGCACHapM-2 AAGGTTTGAACT CTATTGCATHapM-3 CAGGTTTGAACT CTATTGCATBPC3-1 AACGTTTGAACTACTATTGCACBPC3-2 AACGTTTGAACTGCTATTGCATBPC3-3 CAGGTTTGAACT CTATTGCATBPC3-4 CAGGTTCGAACT CTCTTGCACBPC3-5 AATGTTTGAACTGCTATTGCACBPC3-6 CATGTTCGAACTGCTATTGCATBPC3-7 AATGTTTGAACT CTATTGCATBPC3-8 CAGGTTCGAACTACTCTTGCACBPC3-9 AATGTTTGAACTACTATTGCATBPC3-10 AATGTTTGAACTGCT TTGCACBPC3-11 AATGTTTGAACTACTATTGCAC BPC3-12 AATGTTCGAACTGCTCTTGCATBPC3-13 AATGTTTGAACTACTATTGCAC
BPC3-1 AACGTTTGAACTACTATTGCACBPC3-2 AACGTTTGAACTGCTATTGCATBPC3-3 CAGGTTTGAACTACTATTGCATBPC3-4 CAGGTTCGAACTACTCTTGCACBPC3-5 AATGTTTGAACTGCTATTGCACBPC3-6 CATGTTCGAACTGCTATTGCATBPC3-7 AATGTTTGAACTGCTATTGCATBPC3-8 CAGGTTCGAACTACTCTTGCACBPC3-9 AATGTTTGAACTACTATTGCATBPC3-10 AATGTTTGAACTGCT TTGCACBPC3-11 AATGTTTGAACTACTATTGCAC BPC3-12 AATGTTCGAACTGCTCTTGCATBPC3-13 AATGTTTGAACTACTATTGCAC
Accuracy?
Marchini et al. (2007)
Accuracy?
Li et al.
Power Gains?
Marchini et al. (2007)
Implementation
• MACH 1.0 (Li Y et al. submitted)
• IMPUTE (Marchini et al. Nat Genet 2007)
• Bim-Bam (Servin and Stephens, PLoS Genet 2007)
• MACH 1.0 (Li Y et al. submitted)
• IMPUTE (Marchini et al. Nat Genet 2007)
• Bim-Bam (Servin and Stephens, PLoS Genet 2007)
MEC-BMEC-HMEC-LPLCO-B
MEC-JACSATBCEPICHPFS
MEC-WPHS
PLCO-W
(Sub) cohorts
CosmopolitanCHB+JPTCEUReference panel
de Bakker et al. Nat Genet 2007
• Power issues– Tagging efficiency of genome-wide panels
– Multi-stage design and analysis
• Design issues
• Analytic issues– Imputation
• CGEMS examples
Outline
The design of genome-wide association studies is an art of the possible.
Replication Study #1
3000 cases / 3000 controls
Replication Study #2
3000 cases / 3000 controls
Replication Study #3
1200 cases / 1200 controls
Initial Study1200cases / 1200controls
~15,000 SNPs
~1,500 SNPs
Ca. 200 + New ht-SNPs
~500,000 Tag SNPs
Ca. 15-20 Loci
Control Type I error at 510-5
For prostate:PLCONCI’s CGEMS project
Parallel GWA scans for breast and prostate cancer
susceptibility loci
Yeager et al. 2007 Nat Genet
“Fast Track” Partial Replication
Not shown: ca. 100 other “top SNPs” that did not replicate convincingly.
Multi-locus modeling provides evidence for independent effects!
Characterization
Model name Nparms -2 log L p-value AIC BIC BIC Weight0 NULL: Intercept only 1 11691.71 ref 11693.71 11700.75 0.0001 SNP1 - Dominant Model 2 11636.95 1.36E-13 11640.95 11655.03 0.0002 SNP1 - Recessive Model 2 11653.34 5.86E-10 11657.34 11671.42 0.0003 SNP1 - Additive (log odds) Model 2 11622.22 7.68E-17 11626.22 11640.30 0.0004 SNP1 - Codominant Model 3 11621.62 6.02E-16 11627.62 11648.74 0.0005 SNP2 - Dominant Model 2 11614.80 1.79E-18 11618.80 11632.88 0.0006 SNP2 - Recessive Model 2 11674.28 2.98E-05 11678.28 11692.36 0.0007 SNP2 - Additive (log odds) Model 2 11610.08 1.64E-19 11614.08 11628.16 0.0008 Two additive (log odds) SNPs, additive (log odds) interation 3 11548.83 9.43E-32 11554.83 11575.95 0.7479 Two additive (log odds) SNPs, additive (risk scale) interaction 3 11551.00 2.79E-31 11557.00 11578.12 0.253
10 Two codominant SNPs, general interaction 9 11541.41 1.70E-28 11559.41 11622.77 0.000
Say we know two SNPs are associated with risk. Next step is to ask: How? Do they each contribute to disease risk (i.e. conditional on the other SNP, does
adding a SNP improve model fit)? How do they “interact”?
aaAa
AA
bb
Ba
BB
0.00
0.50
1.00
1.50
2.00
2.50
Odds Ratio (relative to '00')
Additive (log odds) SNPs, additive (log odds) interaction
aaAa
AA
bb
Ba
BB
0.00
0.50
1.00
1.50
2.00
2.50
Unrestricted model
a.k.a. “Main effects only”
Although the saturated model (with 8 unrestricted log odds
ratio parameters) is “closest to the data,” the BIC suggests it is “too close.” The exceptional pattern for odds across the A
locus in the BB stratum is probably just noise (small
cells), not “gene-gene interaction”
Pooled Phase I and II Results
Initial Scan Region p-value Rank p-value
8q24 3.07E-19 116 1.12E-04 8q24 6.58E-12 300 3.92E-04
HNF1B 9.58E-10 384 5.21E-04
MSMB 7.31E-13 24,223 0.042 11q13 1.76E-09 2,439 0.004 CTBP2 1.70E-07 319 4.09E-04 JAZF1 2.14E-06 24,407 0.042
Pooled Phase I+II
Thomas et al, in press
Population Attributable Risk (PAR)
0.23
0.27
0.48
0.40
0.49
0.50
0.10
Freq.
1.10
1.17
1.23
1.22
1.22
1.26
1.43
ORmul
14%JAZF1
9%CTB2
20%11q13
16%MSMB
19%HNF1B
22%8q24-c
8%8q24-a
PARLocus
Joint PAR ~ 60%
Thomas et al, in press
PARs do not add!
E
G1
G2
All Cases
Marginal PAR for exposure E is 100%Marginal PAR for gene G1 is 100%
Marginal PAR for gene G2 is 20%
A joint PAR of 60% for top seven loci does not mean there are no other risk loci
nor does it mean modifiable environmental factors do not influence prostate cancer risk
Individual Risk PredictionOdds ratio comparing 90th percentile to 10th
percentile ~ 2.5
Thomas et al, submitted
Based on allele frequencies in
controls and multi-locus model assuming
codominant effects at each locus and
multiplicative effects across loci
Probability that a man in the top 10th percentile of risk according to seven-SNP model develops prostate cancer: 45%
Positive predictive value for screening test that predicts prostate cancer for men above a genetic risk profile above a given threshold; recall PPV involves test sensitivity and specificity AS WELL AS incidence rates (here: age specific rates from ACS website)
Novel Risk Loci
• 8q24– Three independent loci with no known function, associated with risks of
prostate and colorectal cancer
• HNF1B (TCF2)– Prostate cancer risk alleles associated with decreased risk of T2D
• MSMB– Encodes beta-microseminoprotein, a proposed prostate-cancer
biomarker
• CTB2– Has anti-apoptotic activity
• JAZF1– Fused by translocation with SUZ12 in endometrial cancer
Where to from Here?
These results open up new and often unexpected avenues for research (c.f. 8q24 region). They may also point to etiologic pathways as targets for treatment.
Despite large PARs, individually these variants are not good predictors on individual's risk. But taken together they may—MAY—be useful for prediction: either for screening or prognosis. The performance of any screening panel will
need to be evaluated in independent studies, and its ultimate efficacy will depend on its discriminative power, and the availability of an intervention proven
to reduce risk.
In the next 3-5 years we'll see many more discoveries using the simple, brute force approach illustrated here. The new challenge will be making sense out of it all: characterizing effects in different populations, looking for gene-environment
interactions, developing new treatments and sound & ethical prevention strategies to reduce cancer morbidity and mortality.
Acknowledgements
NCI Core Genotyping Facility
NCI Division of Cancer Epidemiology and
Genetics
Harvard School of Public Health
Stephen ChanockGilles Thomas
Meredith YeagerKevin Jacobs
Bob HooverRichard Hayes
Sholom WacholderNilanjan Chatterjee
Kai Yu
David HunterJiali Han
Connie Chen
And all the subjects and support staff from the participating studies!
Further ReadingNew England Journal of Medicine, 2 August 2007