Download - EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 7: Genome-wide association scans Peter Kraft [email protected] Bldg.

EPI293Design and analysis of gene association studies

Winter Term 2008

Lecture 7: Genome-wide association scans

Peter Kraft

[email protected] 2 Rm 206

2-4271

1900

1920

1940

1960

1980

2005

Rediscovery of Mendel’s laws

Association between Blood Groups and malignant disease fails to replicate

Microsattelite maps for genome-wide linkage analysis developedHuman Genome Project launched

Human Genome Project working draft completed; beginnings of SNP map

HapMap launched

Risch and Merikangas paper

Principles of Linkage Analysis discovered

Association between Blood Groups and malignant disease published

1990

2000

First Genome-Wide Association Study

HapMap Phase I completed (draft Phase II available)Genome-wide SNP panels developed

RFLPs available for linkage analysis developed

2006

2007

5 December 2007 [email protected]

A A B B C C C C

A B C C

A C A C B C B C A C

Linkage Analysis


3

3

Gg

14Control

41Case

GGgg

GGGG

GG

GG

GGGg

Gg

Gg

gg

Gg

Gg

Gggg

gg

gg

gg


Linkage vs. Association

• Linkage studies– Pro: can scan genome with fewer markers

– Cons: Can only detect alleles with large effect; limited resolution (identify broad region, not individual genes); requires data on multiple family members

• Association studies– Pros: can detect subtle effects; very fine resolution

– Cons: requires 0.5 to 1 million markers to cover whole genome; requires large sample size

Risch and Merikangas (1996) Science 273:1516-7

Schloterer C. Nat Rev Genet. 2004;5:63-9.

• Ozaki K. Myocardial Infarction. Nat Genet 2002;32:650–4.• Klein RJ. Age-related macular degeneration. Science 2005;308:385–9.• Maraganore DM. Parkinson disease. Am J Hum Genet 2005;77:685–93.• Shiffman D. Myocardial Infarction. Am J Hum Genet 2005;77:596–605.• Cheung VG. Gene expression. Nature 2005;437:1365-9.• Stranger BE. Gene expression. PLOS Genet 2005;1:695-704.• Mah S. Schizophrenia. Mol Psychiatry 2006;11:471-8.• Herbert A. Obesity. Science 2006; 312:279-83.

Published Genome-Wide Association Scans

Reviews• Hirschorn J. Nat Reviews Genet 2005;6: 95-108.• Wang WY. Nat Reviews Genet 2005;6: 109-18. • Thomas DC. Am J Hum Genet 2005 77: 337-45.• Thomas DC. Cancer Epidemiol Biomarkers Prev 2006 15: 595-8.• Evans DM. Trends in Genetics 2006 (epub)

OLD SLIDE!!!!

96 cases, 50 controls

103,611 SNPs

rs380390Recessive OR

7.4 (2.9-19)

PAR (70%)

Genotyping errors

Functionality

ReplicationScience 2005;308:421–4

Science 2005;308:419–21

Klein RJ Science 2005;308:385–9

Tier 1 Tier 2

443 sib pairs 332 matched unrelated case-control pairs198,000 SNPs 3,148 SNPs

No SNPs pass Bonferroni-corrected significance threshold (2.510-7).

Maraganore Am J Hum Genet 2005 77:685-93

Known Breast Cancer Genes, November 2006

Known Prostate Cancer Genes, November 2006

Known Breast Cancer Genes, Fall 2007

Known Prostate Cancer Genes, Fall 2007

Kraft and Cox 2008 in: Rao and Gu, eds.

• Power issues– Tagging efficiency of genome-wide panels

– Multi-stage design and analysis

• Design issues

• Analytic issues– Imputation

• CGEMS examples

Outline



• Design issues


• CGEMS examples

Outline

Known

Unknown

r2

r2

Barrett JC. Nat Genet 200638:659-62 Pe’er I. Nat Genet 2006;38:663-7.

International HapMap Consortium. Nature. 2007 Oct 18;449(7164):851-61

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

MAF < 5% MAF 5-12.5% MAF 12.5-25% MAF 25-37.5% MAF 37.5-50%

.90-1.00

.81-.90

.61-.80

.32-.60

.01-.30

0

Distribution of max r2 with tag panel as a function of MAF

Tags chosen from a “pseudo Phase II HapMap” and evaluated against ENCODE SNPs

The fundamental theorem of the HapMap

The power of a study that genotypes N cases and N controls at a marker that has a correlation of r2 with a disease susceptibility locus has the same power as a study that genotypes N = r2 N cases

and N controls at the disease susceptibility locus.

Power adjusting for tagging efficiency

)()( fNPow

Pritchard JK. Am J Hum Genet 2001;69:1-14.Jorgenson Am J Hum Genet 2006;78:884-8.

Terwilliger JD Eur J Hum Genet 2006;14:426-37.

0 5000 10000 15000 20000

0.0

0.2

0.4

0.6

0.8

1.0

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

sample size (cases)

OR=1.3 OR=1.5 OR=1.8M

AF

=.0

1M

AF

=.0

5M

AF

=.1

0po

wer

direct

indirect(averaged over r2)

indirect(r2 fixed at 80%)



• Design issues


• CGEMS examples

Outline

SN

Ps

subjects

TT11 TT22 TT33

Replication analysis Joint analysis

Power = Pr(T1>k1,…,TS>kS)=Pr(T1>k1)…Pr(TS>kS)

Power = Pr(T1*>k1

*,…,TS*>kS

*)

ks = Quantile(1-ms+1/ms)ks

* chosen s.t. expected number of markers (under null) taken to

s+1st stage is ms+1

Ts* = 1..s Ts

mS+1 is number of expected false leads (under the null) at the end of Sth stage

(e.g. mS+1 = .05 is strong control of FWER at α=.05)

Power of multi-stage designs

Skol. Nat Genet 2006;38:209-13; Wang Genet Epidemiol 2006;30:356-68; Kraft (in prep)

Multistage Design and Analysis

• It is (or should be) well known that “replication analysis ” is statistically inefficient [cf Thomas DC et al (1985) AJE, Skol (2006) Nat Genet]

• Usually you can find a multistage design that has almost the same power as a single-stage design but is much cheaper

• Multi-stage design is NOT a way of finessing the multiple testing issue. If genotypes were free, you would genotype everybody for every SNP and test all SNPs at very very small alpha level.

• Multi-stage design IS a way of saving big $s, ₤s, €s, etc.

Amount of savings and cheapest design depend on prices—which are very fluid!

Calculating power for “replication analysis”

P2=1-q,,r,N2,22=M3/M2N2M2

k=Mk+1/Mk

1=M2/M1

Effective level

Πi=1..k PiOverall

Pk=1-q,,r,Nk,kNkMk

…

P1=1-q,,r,N1,1N1M1

PowerNumber of subjects

Number of Markers

Mk+1 is “number of significant tests expected under the null”

E.g. Mk+1=.05 is Bonferroni-corrected threshold for M1 tests


2=.0036,0001,500

1=.003

Effective level

Overall

2,400 (1:1 case:control)

500,000


Number of Markers

q=10%; dominant OR=1.4; M4=5

.883

.999

.882

Cost: ca. USD7002,400+USD606,000=USD

2.04 million


2=.0753,00020,000

=.003

1=.04

Effective level

Overall

3,0001,500

2,400 (1:1 case:control)

500,000


Number of Markers

q=10%; dominant OR=1.4; M4=5

.999

.998

.950

.946

Cost: ca. USD7002,400+USD2003,000+

USD603,000=USD 2.46 million

Two-stage study with equivalent power costs > 2.8

million

A B C20000 5.9 17.7 1

1500 35.8 107.4 120 91.7 275.1 1

Nsnp

Three different per-SNP pricing scenarios considered

Prices relative to per-SNP costs for whole-genome platform

Pricing scheme A; cost relative to single stage study using 7,000 subjects

Relative costs for studies with 65% power

Power for single stage studies, accounting for tagging efficiency

Pow

er

relative cost relative cost relative cost

Illumina 550 Affy 500 Affy 1,000

Power for three stage studies, accounting for tagging efficiency

Illumina 550 Affy 500 Affy 1,000

(Simulated) tagging properties of three panels

How to select SNPs for 2nd Stage?

• Rank by increasing p-value– But recall, prob. of being false positive depends not only on p-value,

but also on power and prior

• Hence Bayesian alternatives [WTCCC, Wakefield 2007 Am J Hum Genet]

• Quasi-Bayesian FPRP [Wacholder et al 2004; Samani 2007 NEJM]

• Prior-weighted analyses [Roeder 2007 Genet Epidemiol, Lewinger 2007 Genet Epidemiol]

• Pragmatist: meh, no big difference in practice

• What about multiple SNPs in high LD?– Cull so as to interrogate as many regions as possible (“broad” follow

up), or retain to try and distinguish causal variant (“deep” follow up)?

• Can I improve coverage by genotyping more SNPs around “hits”?– Again: “deep” coverage

“broad” follow-up

“deep” follow-up

“broad” / “deep” defined

Thought Experiment

• Two kinds of GWAS products– Tagging—captures HapMap II at r2>80%

– Random—has density of Affy 500k

• Choose additional SNPs in 2nd stage so that you tag region spanning “hit” in HapMap II at >95%

• Does this increase your power over simply genotyping the top hit?

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

R2 initial map

R2

de

nse

r m

ap

MAF < 5%MAF 5-12.5%MAF 12.5-25%MAF < 25-37.5%MAF > 37.5%

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MAF <5% MAF 5-12.5% MAF 12.5-25% MAF 25-37.5% MAF 37.5-50%

.90-1.0

.81-.90

.61-.80

.31-.60

.01-.30

0

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

MAF < 5% MAF 5-12.5% MAF 12.5-25% MAF 25-37.5% MAF 37.5-50%

.90-1.00

.81-.90

.61-.80

.32-.60

.01-.30

0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

R2 initial map

R2

de

nse

r m

ap

Tag

ging

Pan

elR

ando

m P

anel

3.22 X markers

1.46 X markers

# markers per region

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

cst

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

max

imum

pow

er f

or

bud

get

co

st

Tag

ging

Pan

elR

ando

m P

anel

Broad

Deep

Power of one-stage design

OR=1.3, MAF=.10Two-stage designs

7,000 cases/controls

“deep” follow-up “broad” follow-up

Am J Hum Genet 2007

Very small gain in power from fine mapping=deep follow up. Is it worth the opportunity cost? Genotyping a lot of extra markers “fine mapping” null loci means you will miss the chance to replicate the true signals that happened to be lower on your list.

Power calculations

http://www.sph.umich.edu/csg/abecasis/CaTS/

http://www.hsph.harvard.edu/faculty/kraft/soft.htm



• Design issues


• CGEMS examples

Outline

• Subject selection• Flexible but simple analysis

– (Multistage design may limit analysis options)

• Sample heterogeneity across stages• Data QC• Population stratification• Bioinformatics• Data sharing, scientific replication, and validation

The design of genome-wide association studies is an art of the possible.



• Design issues


• CGEMS examples

Outline

Analytic issues

• Multiple comparisons• Phenotypic / Genetic heterogeneity• Epistasis• Incorporating external information• Imputation

BPC3-1 A T A A CBPC3-2 A T G A TBPC3-3 C T A TBPC3-4 C C CBPC3-5 A T G A CBPC3-6 C G A TBPC3-7 A T A TBPC3-8 C C A C BPC3-9 A T A A TBPC3-10 A T G CBPC3-11 T A A BPC3-12 A C G C TBPC3-13 A T A A C

HapM-1 AACGTTTGAACT CCATTGCACHapM-2 AAGGTTTGAACT CTATTGCATHapM-3 CAGGTTTGAACT CTATTGCATBPC3-1 AACGTTTGAACTACTATTGCACBPC3-2 AACGTTTGAACTGCTATTGCATBPC3-3 CAGGTTTGAACT CTATTGCATBPC3-4 CAGGTTCGAACT CTCTTGCACBPC3-5 AATGTTTGAACTGCTATTGCACBPC3-6 CATGTTCGAACTGCTATTGCATBPC3-7 AATGTTTGAACT CTATTGCATBPC3-8 CAGGTTCGAACTACTCTTGCACBPC3-9 AATGTTTGAACTACTATTGCATBPC3-10 AATGTTTGAACTGCT TTGCACBPC3-11 AATGTTTGAACTACTATTGCAC BPC3-12 AATGTTCGAACTGCTCTTGCATBPC3-13 AATGTTTGAACTACTATTGCAC

BPC3-1 AACGTTTGAACTACTATTGCACBPC3-2 AACGTTTGAACTGCTATTGCATBPC3-3 CAGGTTTGAACTACTATTGCATBPC3-4 CAGGTTCGAACTACTCTTGCACBPC3-5 AATGTTTGAACTGCTATTGCACBPC3-6 CATGTTCGAACTGCTATTGCATBPC3-7 AATGTTTGAACTGCTATTGCATBPC3-8 CAGGTTCGAACTACTCTTGCACBPC3-9 AATGTTTGAACTACTATTGCATBPC3-10 AATGTTTGAACTGCT TTGCACBPC3-11 AATGTTTGAACTACTATTGCAC BPC3-12 AATGTTCGAACTGCTCTTGCATBPC3-13 AATGTTTGAACTACTATTGCAC

Accuracy?

Marchini et al. (2007)

Accuracy?

Li et al.

Power Gains?

Marchini et al. (2007)

Implementation

• MACH 1.0 (Li Y et al. submitted)

• IMPUTE (Marchini et al. Nat Genet 2007)

• Bim-Bam (Servin and Stephens, PLoS Genet 2007)

• MACH 1.0 (Li Y et al. submitted)

• IMPUTE (Marchini et al. Nat Genet 2007)

• Bim-Bam (Servin and Stephens, PLoS Genet 2007)

MEC-BMEC-HMEC-LPLCO-B

MEC-JACSATBCEPICHPFS

MEC-WPHS

PLCO-W

(Sub) cohorts

CosmopolitanCHB+JPTCEUReference panel

de Bakker et al. Nat Genet 2007



• Design issues


• CGEMS examples

Outline

The design of genome-wide association studies is an art of the possible.

Replication Study #1

3000 cases / 3000 controls





Initial Study1200cases / 1200controls

~15,000 SNPs

~1,500 SNPs

Ca. 200 + New ht-SNPs

~500,000 Tag SNPs

Ca. 15-20 Loci

Control Type I error at 510-5

For prostate:PLCONCI’s CGEMS project

Parallel GWA scans for breast and prostate cancer

susceptibility loci

Yeager et al. 2007 Nat Genet

“Fast Track” Partial Replication

Not shown: ca. 100 other “top SNPs” that did not replicate convincingly.

Multi-locus modeling provides evidence for independent effects!

Characterization

Model name Nparms -2 log L p-value AIC BIC BIC Weight0 NULL: Intercept only 1 11691.71 ref 11693.71 11700.75 0.0001 SNP1 - Dominant Model 2 11636.95 1.36E-13 11640.95 11655.03 0.0002 SNP1 - Recessive Model 2 11653.34 5.86E-10 11657.34 11671.42 0.0003 SNP1 - Additive (log odds) Model 2 11622.22 7.68E-17 11626.22 11640.30 0.0004 SNP1 - Codominant Model 3 11621.62 6.02E-16 11627.62 11648.74 0.0005 SNP2 - Dominant Model 2 11614.80 1.79E-18 11618.80 11632.88 0.0006 SNP2 - Recessive Model 2 11674.28 2.98E-05 11678.28 11692.36 0.0007 SNP2 - Additive (log odds) Model 2 11610.08 1.64E-19 11614.08 11628.16 0.0008 Two additive (log odds) SNPs, additive (log odds) interation 3 11548.83 9.43E-32 11554.83 11575.95 0.7479 Two additive (log odds) SNPs, additive (risk scale) interaction 3 11551.00 2.79E-31 11557.00 11578.12 0.253

10 Two codominant SNPs, general interaction 9 11541.41 1.70E-28 11559.41 11622.77 0.000

Say we know two SNPs are associated with risk. Next step is to ask: How? Do they each contribute to disease risk (i.e. conditional on the other SNP, does

adding a SNP improve model fit)? How do they “interact”?

aaAa

AA

bb

Ba

BB

0.00

0.50

1.00

1.50

2.00

2.50

Odds Ratio (relative to '00')

Additive (log odds) SNPs, additive (log odds) interaction

aaAa

AA

bb

Ba

BB

0.00

0.50

1.00

1.50

2.00

2.50

Unrestricted model

a.k.a. “Main effects only”

Although the saturated model (with 8 unrestricted log odds

ratio parameters) is “closest to the data,” the BIC suggests it is “too close.” The exceptional pattern for odds across the A

locus in the BB stratum is probably just noise (small

cells), not “gene-gene interaction”

Pooled Phase I and II Results

Initial Scan Region p-value Rank p-value

8q24 3.07E-19 116 1.12E-04 8q24 6.58E-12 300 3.92E-04

HNF1B 9.58E-10 384 5.21E-04

MSMB 7.31E-13 24,223 0.042 11q13 1.76E-09 2,439 0.004 CTBP2 1.70E-07 319 4.09E-04 JAZF1 2.14E-06 24,407 0.042

Pooled Phase I+II

Thomas et al, in press

Population Attributable Risk (PAR)

0.23

0.27

0.48

0.40

0.49

0.50

0.10

Freq.

1.10

1.17

1.23

1.22

1.22

1.26

1.43

ORmul

14%JAZF1

9%CTB2

20%11q13

16%MSMB

19%HNF1B

22%8q24-c

8%8q24-a

PARLocus

Joint PAR ~ 60%

Thomas et al, in press

PARs do not add!

E

G1

G2

All Cases

Marginal PAR for exposure E is 100%Marginal PAR for gene G1 is 100%

Marginal PAR for gene G2 is 20%

A joint PAR of 60% for top seven loci does not mean there are no other risk loci

nor does it mean modifiable environmental factors do not influence prostate cancer risk

Individual Risk PredictionOdds ratio comparing 90th percentile to 10th

percentile ~ 2.5

Thomas et al, submitted

Based on allele frequencies in

controls and multi-locus model assuming

codominant effects at each locus and

multiplicative effects across loci

Probability that a man in the top 10th percentile of risk according to seven-SNP model develops prostate cancer: 45%

Positive predictive value for screening test that predicts prostate cancer for men above a genetic risk profile above a given threshold; recall PPV involves test sensitivity and specificity AS WELL AS incidence rates (here: age specific rates from ACS website)

Novel Risk Loci

• 8q24– Three independent loci with no known function, associated with risks of

prostate and colorectal cancer

• HNF1B (TCF2)– Prostate cancer risk alleles associated with decreased risk of T2D

• MSMB– Encodes beta-microseminoprotein, a proposed prostate-cancer

biomarker

• CTB2– Has anti-apoptotic activity

• JAZF1– Fused by translocation with SUZ12 in endometrial cancer

Where to from Here?

These results open up new and often unexpected avenues for research (c.f. 8q24 region). They may also point to etiologic pathways as targets for treatment.

Despite large PARs, individually these variants are not good predictors on individual's risk. But taken together they may—MAY—be useful for prediction: either for screening or prognosis. The performance of any screening panel will

need to be evaluated in independent studies, and its ultimate efficacy will depend on its discriminative power, and the availability of an intervention proven

to reduce risk.

In the next 3-5 years we'll see many more discoveries using the simple, brute force approach illustrated here. The new challenge will be making sense out of it all: characterizing effects in different populations, looking for gene-environment

interactions, developing new treatments and sound & ethical prevention strategies to reduce cancer morbidity and mortality.

Acknowledgements

NCI Core Genotyping Facility

NCI Division of Cancer Epidemiology and

Genetics

Harvard School of Public Health

Stephen ChanockGilles Thomas

Meredith YeagerKevin Jacobs

Bob HooverRichard Hayes

Sholom WacholderNilanjan Chatterjee

Kai Yu

David HunterJiali Han

Connie Chen

And all the subjects and support staff from the participating studies!

Further ReadingNew England Journal of Medicine, 2 August 2007