(Genome-wide) association analysis
Peter M. VisscherQueensland Institute of Medical
ResearchBrisbane, Australia
1
Outline
• Association vs linkage• Linkage disequilibrium• Analysis: single SNP
• GWAS: design, power• GWAS: analysis
2
Linkage Association
Families Populations3
Linkage disequilibrium around an ancestral mutation
[Ardlie et al. 2002]4
m = functional mutation
[Cardon & Bell 2001]
Linkage and association
5
Why mapping by LD?
• Use association across families– no pedigree information needed
• Higher resolution of mapping• Dense SNP maps: GWAS
6
Linkage vs. AssociationLinkage Association/LD
genealogy known unknownmarker sharing by descent by state# meioses small largeshared DNA segments large smallmarkers microsatellites SNPmarker density low high
7
LD
• Non-random association between alleles at different loci
• Many possible causes– mutation– drift / inbreeding / founder effects– population stratification– selection
• Broken down by recombination
8
Measures of pair-wise LD
Definition SymbolCovariance DScaled covariance D’Association ρCorrelation rFrequency difference fDelta δYule y
[Morton et al. 2001, PNAS]9
Measures
• All estimates of LD are functions of pairwise haplotype frequencies
• ‘Best’ measure depends on purpose of LD estimation
10
Properties of ‘good’ measures
• Simple biological interpretation• Allow statistical tests• Directly related to evolutionary forces
(recombination, selection, drift, etc.)• Standardised to allow comparisons across
loci & populations
[Hedrick 1987]11
Utility of LD measures
• Population dynamics• Estimating population size• Gene/QTL mapping
12
Definition of D
• 2 bi-allelic loci– Locus 1, alleles A & a, with freq. p and (1-p)– Locus 2, alleles B & b with freq. q and (1-q)– Haplotype frequencies pAB, pAb, paB, pBB
D = pAB - pq
13
Alternative expressions for D
D = pAB – pq = DAB
= pab – (1-p)(1-q) = Dab
= -(pAb – p(1-q)) = -DAb
= -(paB – (1-p)q) = -DaB
= pABpab - pAbpaB
14
Related measures
Dmax = smaller of pq and (1-p)(1-q) [D<0]= smaller of p(1-q) and (1-p)q) [D>0]
• D’ = D / Dmax
-1 ≤ D’ ≤ 1Can compare pairs of loci across populations
15
|D’|
• |D’| = |D| / Dmax
0 ≤ D’ ≤ 1– Can compare different pairs of loci in genome|D’| = 1 if one of the 4 haplotype frequencies = 0– E(|D’|) and var(|D’|) not known
16
r2
r2 = D2 / [pq(1-p)(1-q)]
• Squared correlation between presence and absence of the alleles in the population
• ‘Nice’ statistical properties
[Hill and Robertson 1968]17
Properties of r and r2
• Population in ‘equilibrium’E(r) = 0E(r2) = var(r) ≈ 1/[1 + 4Nc] + 1/n
N = effective population sizen = sample size (haplotypes)c = recombination rate
• nr2 ~ χ(1)2
[Sved 1971; Weir and Hill 1980]18
Measures of LD are very variable!
• If nr2 ~ χ2(1 df), then
[CV(r2)]2 ≈ 2
CV = σ(r2) / E(r2) = (2/n2)0.5 / (1/n) = (2)0.5
19
r2 decay under different scenarios.(Pritchard & Przeworski,2001)
y-axis: (r2)0.5
20
r2 decay in real data.(Pritchard & Przeworski,2001)
21
Population stratification
Allele frequency Haplotype frequency pA1 pB1 pA1B1 pA1B2 pA2B1 pA2B2 Pop. 1 0.9 0.9 0.81 0.09 0.09 0.01 Pop. 2 0.1 0.1 0.01 0.09 0.09 0.81 Average 0.5 0.5 0.41 0.09 0.09 0.41
Both populations are in linkage equilibrium
Combined population: D = 0.16 and D’ = 0.64r2 = 0.4096
22
23
Decay of LD (bi-allelic loci)
D(0) = pAB - pqpAB(1) = (1-c) pAB(0) + cpqD(1) = (1-c) pAB(0) + cpq – pq
= (1-c)D(0)
D(t) = (1-c)t D(0)≈ e-ct D(0)
24
Decay of LD
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0 10 20 30 40 50 60 70 80 90 100Generation
LD
c = 0.10c = 0.01c = 0.001
25
Approaches for quantitative traits
• Association (single locus)• TDT• GWAS
26
Association
• Random sample from the population• Associate allelic or haplotype variant(s)
with trait values• H0: no association• Analysis
– Linear (mixed) model– Allele (haplotype) as fixed effect
27
Falconer model for single biallelic QTL
Var (X) = Regression Variance + Residual Variance= Additive Variance + Dominance Variance
bb Bb BB
m
-a
ad
28
29
TDT= transmission disequilibrium test
• Association with family-based controls• Original TDT (disease mapping)
– trios, two parents one affected progeny– test for transmission of allele to affected progeny from
heterozygous parents• non-transmitted allele is the control
• Quantitative traits– Test for association between trait value and allele
within parental mating type
30
Classical TDTTable 4.1: Transmission data for a bi-allelic marker.
Not transmitted
Transmitted Allele 1 Allele 2 Total
Allele 1 n11 n12 n11 + n12
Allele 2 n21 n22 n21 + n22
Total n11 + n21 n12 + n22 n
The TDT statistic is,
TDT = (n21 – n12)2 / (n21 + n12)
~ χ12 Test for both linkage and association
31
Quantitative traits
• With family data we can separate association into between and within family components
• Advantage– Within component is robust to stratification
• Disadvantage– Unrelated design is more powerful for same
sample size32
TDT for quantitative traits(regression model)
33
Information for Within Test
• Families are only informative for the within family component when the offspring can have different genotypes....– AA x AA ?– AA x aa ?– AA x Aa ?– Aa x Aa ?
• At least one parent must be heterozygous34
Population Stratification
• When there is no population stratification the slopes for the between and within test should be equal
35
Power (bi-allelic locus)
q2 = {2p(1-p)[a + d(1-2p)]2 + [2p(1-p)d]2 } / σp2
ANOVA Regression
Fit 3 genotypes y = µ + βx + e (x = 0, 1, 2)2 df 1 df
λANOVA = nq2/(1-q2) n= [(1-q2)/(q2)](z(1-α/2) + z(1-β))2
36
Genetic Power Calculator (PGC)http://pngu.mgh.harvard.edu/~purcell/gpc/
37
Power (n=1000)
38
GWAS
• Same principle as single locus association, but additional information– QC
• Duplications, sample swaps, contamination– Power of multi-locus data
• Unbiased genome-wide association• Relatedness• Population structure• Ancestry
39
Detection of susceptibility variants for common diseases
CNV studies
Sequencing
Not yet detectable
Atypical of common diseases
Association
High
Intermediate
Modest
Low
Very rare Rare Uncommon Common
Allele frequency
Penetrance
0.001 0.01 0.1
Linkage
Requires same locus in many families orlarge pedigrees.
Modified from McCarthy et al.
40
Detection of susceptibility variants for common diseases: the new era
CNV studies
Sequencing
Not yet detectable
Atypical of common diseases
Variants typically identified by GWAS
High
Intermediate
Modest
Low
Very rare Rare Uncommon Common
Allele frequency
Penetrance
0.001 0.01 0.1
CNV/sequencing
41
Sequence - SNPs
LD
Technology
42
GWAS in humans
Advances ingenotypingtechnology
Sample collections ofadequate size
Better understandingof patterns of human
sequence variation
Genome-wideassociation
scans
3,000,000,000 basesin human genome
~10,000,000 positionscommonly variant
in Europeans
80% of these capturedby typing ~500k
Samplesof
interest
test for evidence ofassociation
43
• Categorical traits– disease susceptibility genes
• Continuous traits– quantitative trait loci, QTL
44
Age-related macular degeneration
45
GWAS analysis
Challenges most obviously, multiple testing burden computation
Opportunities simple methods can work well with ↑ data novel analyses permitted
46
The multiple testing burden
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40 45 50
Number of independent tests performed
P(at
leas
t 1 fa
lse
posi
tive) per test false positive
rate 0.05
per test false positive rate0.001 = 0.05/50
47
Genomic control
Test locus Unlinked ‘null’ markers
( )2χE
χ2 No stratification
( )2χE
χ2
Stratification → adjust test statistic48
Genomic control
Simple estimate of inflation factor
median protects from outliers• i.e. true effects
bounded at minimum of 1• i.e. should never increase test statistic
extends to multiple alleles, haplotpes, quantitative traits, different tests, etc
456.0/},,,{ˆ 222
21 Nmedian χχχλ =
49
Empirical assessment of ancestry
~2K SNPs
CEPH/EuropeanYorubaHan ChineseJapanese
51
Entire Phase I HapMap
Empirical assessment of ancestry
52
Han ChineseJapanese
~10K SNPs
Empirical assessment of ancestry
53