1
Association MappingLD
Methods Germplasm
DefinitionCauses
Haplotype Blocks
Marker DensityRecombinationHotspots
Model-basedor PCA?
Candidate locior whole genome?
Sub-populationstructure
Extent of LD
BreedingSystem
Gene identification orMarker-assisted
selection?
Regression
Genomic selection
Multiple testingvs. Shrinkage
Signatures ofselection
Species
Panel diversity
Confounded structure andpolymorphism
2
Outline
• Association mapping is regression• Accounting for structure– Estimating structure using markers– Truly multi-factorial models
• Miscelaneous topics:– Genomic control; TDT; Confounding with
structure; Haplotype predictors; Genetic heterogeneity; Missing heritability; NAM; Validation
3
Association Mapping
• It’s the same thing as linkage mapping in a bi-parental population but in a population that has not been carefully designed and generated experimentally
• Because the experiment has not been designed, it is messy. Statistical methods are needed to deal with the mess
4
Regression
• xi is the allelic state at a marker• Consider the total genotypic effect of I
• qi is the allelic state at a QTL with which the marker is (hopefully) in LD
• Now estimate β
5
Estimate of Beta
Part having to
do with LD Multi-factorial
trait / structure
6
When is cov(x, g) non-zero?
• Differences in allele frequencies at the marker between subpopulations AND difference in phenotypic mean between subpopulations– The difference in mean can be due to a single or
many loci• Difference in the frequency of alleles between
families AND difference in family phenotypic means within a (sub)population
7
Structure possibilities
Popu
latio
n st
ruct
ure
Familial relatedness
Yu, J., Pressoir, G., et al. 2006. Nat Genet 38:203-208
8
Controlling for structure
• Basic quantitative genetics:– Two individuals who share many alleles should
resemble each other phenotypically– Use markers to figure out how many alleles
individuals share and then use that to adjust statistically for their phenotypic resemblance
9
Controlling for structure
• The “mixed model”
Yu, J., Pressoir, G., et al. 2006. Nat Genet 38:203-208
10
Controlling for structure
• Structure => large differences in allele frequencies across many markers
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
Average marker Set1 score
Aver
age
mar
ker S
et2
scor
e
First PCA axisPotential Phenotypic Gradient
Regression coefficients of the phenotype on the PCA values
11
Use of PCA• Results are not sensitive to the
number of PCA, provided you have enough– Price, A.L. et al. 2006. Principal
components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904-909
• The number of significant PC can be determined– Patterson, N. et al. 2006. Population
Structure and Eigenanalysis. PLoS Genetics 2:e190
• Use a “Screeplot”
12
Historical footnote
• PCA achieves what the Pritchard program Structure does
• PCA is faster and more robust
Pritchard, J.K. et al. 2000. Genetics 155:945-959Price A.L. et al. 2006. Nat Genet 38:904-909.Patterson N. et al. 2006. PLoS Genetics 2:e190
13
Kinship
• We are all a little bit related:– Two unrelated people: go back 1 generation, all
four parents must be different people.– Go back 2 generations, all eight grand-parents
must be different people.– Go back 30 generations, all 2.1 billion ancestors
would need to be different people: Impossible!
14
Identity by Descent
• Two alleles that are copies (through reproduction) of the same ancestral allele
Coefficient of Coancestry• Choose a locus• Pick an allele from Ed and one from Peter• Probability that the alleles are IBD = Ed and
Peter’s Coefficient of Coancestry, θEP
15
Coef. of Coancestry –> A matrix
• A is the additive relationship or kinship matrix
Winter
Two-Row
Six-Row
“Bison”
16
A constrains u
• Two individuals who share many alleles should resemble each other phenotypically
• u is the polygenic effect• Its covariance matrix is Var(u) = Aσ2
u
• If aij has a high value, the ui and uj should have similar values (they have high covariance)
• A constrains the values that are possible for u
17
Single locus, additive model: cov(ui, uj)
18
A matrix from the pedigree
• The cells in the A matrix are aij = 2θij, the additive relationship coefficients between i in the row and j in the column
• Coefficient of coancestry θij: the prob that a random alleles from i and j are IBD
• Calculate from the pedigree by recursion:
19
A matrix from marker data
, the homozygosities over all markers and alleles
20
With inbreeding, parental contributions NOT 50:50
• Maize intermated population
• Drift during intermating and inbreeding
• Markers can give more accurate θ than pedigree
00.05 0.1
0.15 0.20.25 0.3
0.35 0.40.45 0.5
0.55 0.60.65 0.7
0.75 0.80.85 0.9
0.95 10
10
20
30
40
50
60
70
80
90
/
21
Mixed Model Example
• Five individuals, a, b, c, d, and e.• a and b in subpop 1; c, d, and e in subpop2.• a, b, c, and d unrelated; e is offspring of c and d.• a and d carry the 0; b, c, and e carry the 1 allele
y = + Qvμ + Xβ
22
Mixed Model Example
y = + Qvμ + Xβ
• Five individuals, a, b, c, d, and e.• a and b in subpop 1; c, d, and e in subpop2.• a, b, c, and d unrelated; e is offspring of c and d.• a and d carry the 0; b, c, and e carry the 1 allele
23
Mixed Model Example
• Five individuals, a, b, c, d, and e.• a and b in subpop 1; c, d, and e in subpop2.• a, b, c, and d unrelated; e is offspring of c and d.• a and d carry the 0; b, c, and e carry the 1 allele
+ Zu + ey = + Qvμ + Xβ
24
Mixed Model Example
• Five individuals, a, b, c, d, and e.• a and b in subpop 1; c, d, and e in subpop2.• a, b, c, and d unrelated; e is offspring of c and d.• a and d carry the 0; b, c, and e carry the 1 allele
A =
var(u) = σ2u
Zu
25
Mixed Model Example
• There is a polygenic effect u for each individual => overdetermined model?
• NO: u is a random effect, constrained by Aσ2u
26
Mixed Model Example+ Zu + ey = + Qvμ + Xβ
–1
✕=
27
Control false positives from structure
Observed P
Flowering time(High population structure)
Ear height(Moderate population structure)
Ear diameter(Low population structure)
0.5 0.50.5
0
0.1
0.2
0.3
0.4
0 0.1 0.2 0.3 0.4 0.5Observed P
SimpleQ
KQ + K
GC
0
0.1
0.2
0.3
0.4
0 0.1 0.2 0.3 0.4 0.5
Simple
Q
K
Q + K
GC
0
0.1
0.2
0.3
0.4
0 0.1 0.2 0.3 0.4 0.5Observed P
Cum
ulati
ve P
Simple
QK
Q + K
GC
SimpleQKQ + KGC
a. b. c.
A straight diagonal line indicates an appropriate control of false positives.
Q + K model has best Type I error control, most important when trait is related to population structure (e.g., flowering time).
28
Statistical power
Flowering time(High population structure)
Ear height(Moderate population structure)
Ear diameter(Low population structure)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Genetic effect(Phenotypic variation explained in %)
Simple
Q
K
Q + K
GC
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Genetic effect(Phenotypic variation explained in %)
SimpleQ
K
Q + K
GC
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Genetic effect(Phenotypic variation explained in %)
Adju
sted
ave
rage
pow
er
Simple
Q
KQ + K
GC
(0) (0.8) (3.3) (7.1) (11.9) (17.4) (0) (0.8) (3.3) (7.1) (11.9) (17.4) (0) (0.8) (3.3) (7.1) (11.9) (17.4)
d. e. f.
SimpleQKQ + KGC
Q + K model had highest power to detect SNPs with true effects.
29
Controlling for Structure
Original P Matrix K Matrix
30
FDR vs. Power for 300 lines, 10 QTL
31
Effect of line number, P-only10 QTL, 0.75 heritability
32
Effect of Reduced Population Diversity
33
Take homes on diversity
• At equal population size– A less diverse population can increase power
because relative to the extent of LD, the average marker distance is lower
– Given that you are testing fewer markers, the multiple testing problem is reduced
• Avoid as much as possible reducing population size for the sake of obtaining a more homogeneous population
34
Guidelines
• More lines and more markers are better• For a diverse population, 800+ lines• For a narrower population, 300+ (?)• FDR is a reasonable method of determining
significance, but probably conservative
35
0.00
0.20
0.40
0.60
0.80
0% 25% 50% 75% 100%0.00
0.20
0.40
0.60
0.80
0% 25% 50% 75% 100%0.00
0.20
0.40
0.60
0.80
0% 25% 50% 75% 100%
Varia
nce
ratio
d e f
Marker number Marker number Marker number
SSRSNP
Q constant K estimated
• Q estimated with all markers, K estimated with varying fraction of markers available
Flowering time Ear height Ear diameter
36
0.00
0.20
0.40
0.60
0.80
0% 25% 50% 75% 100%0.00
0.20
0.40
0.60
0.80
0% 25% 50% 75% 100%0.00
0.20
0.40
0.60
0.80
0% 25% 50% 75% 100%
Varia
nce
ratio
d e f
Marker number Marker number Marker number
SSRSNP
Q estimated K constant
• Q estimated with varying fraction of markers available, K estimated with all markers
Flowering time Ear height Ear diameter
37
History / future of controlling for structure
Part having to
do with LD Multi-factorial
trait / structure
38
Single locus: model mis-specification
• “the problem is better thought of as model mis-specification: when we carry out GWA analysis using a single SNP at a time, we are in effect modeling a multifactorial trait as if it were due to a single locus”– Atwell S. et al. 2010. Nature 465:627-631
39
History: Candidate locus studies
• AM started out with candidate locus studies where the effects of few loci could be fitted
• The biotechnology was not there to type more than a few loci
• The genetic background needed to be accounted for somehow (see above)
• In any event, the computational power was not there to fit all 106 loci simultaneously
40
Future: GWAS fitting all loci
• These methods could displace mixed models accounting for structure
Logs
don
B. e
t al.
2010
.BM
C Bi
oinf
orm
atics
11:
58.
41
Sundry topics
• Other methods to control structure• QTL confounded with structure• Single markers or haplotypes?• Genetic heterogeneity• Missing heritability• Linkage disequilibrium / Linkage analysis• Validation
42
Genomic Control
• Calculate bias in distribution of test statistic using “neutral” loci, then account for bias
• Devlin, B. and Roeder, K. 1999. Genomic Control for Association Studies. Biometrics 55:997-1004.
• Works best for candidate genes: test loci can be distinguished from neutral control loci. Works less well for whole genome scans• Marchini, J. et al. 2004. Nat. Genet. 36:512-517• Devlin, B. et al. 2004. Nat. Genet. 36:1129-1131.• Marchini, J. et al. 2004. Nat. Genet. 36:1131-1131
43
Transmission Disequilibrium Test
• Experimental rather than statistical control of effects of structure
• Originally conceived for dichotomous (e.g., disease / no disease) traits
• Affected offspring and both parents, of which one must be heterozygous
• Test whether the a putative causal allele is transmitted more often that 50% of the time
• Spielman, R.S. et al. 1993. Am. J. Hum. Genet. 52:506-516
44
TDT
• Extensions for quantitative traits• Allison, D.B. 1997. Am. J. Hum. Genet. 60:676-690
• Extensions for larger-than-trio pedigrees• Monks, S.A., and N.L. Kaplan. 2000. Am J Hum Genet
66:576-92
• Using for populations under artificial selection• Bink, M.C.A.M. et al. 2000. Genetical Res. 75:115-121
45
QTL confounded with structure• Particularly important for QTL affecting
adaptation, e.g., flowering time
Camus-Kulandaivelu, L. et al. 2006. Genetics 172:2449–2463
46
Also in rice…
Ghd7-0aNon-functional
Ghd7-2Weak allele
Ghd7-0Deleted
Ghd7-1, Ghd7-3Functional
Given geographic distribution and role in adaptation, selection using this locus will have marginal utility
Xue, W. et al. 2008. Nat Genet 40:761-767
47
Confounded QTL with structure• Association analysis will have difficulty
identifying such QTL: the QTL needs to be polymorphic within subpopulations
• Traditional linkage studies of crosses between members of different subpopulations should be very effective in this case
• e.g., Xue, W. et al. 2008. Nat Genet 40:761-767
• Multi-factorial methods will have difficulty identifying loci under strong structure
48
Dwarf8: Confounded with structure
• Thornsberry, J.M. et al. 2001. Nat. Genet. 28:286-289– First structured association test applied to plants
Camus-Kulandaivelu, L. et al. 2006. Genetics 172:2449–2463
49
Single markers or haplotypes?
• The jury is still out• Infinite ways to simulate and
analyze– Ne, QTL MAF, QTL effect,
quantitative vs. binary, age of mutation
• Ex. 1: Dramatically more power for haplotypes vs single markers
• Durrant, C. et al. 2004. Am J Hum Genet 75:35-43
50
Single markers or haplotypes?
• Ex. 2: Similar or lower power for haplotype method relative to single marker method
• Zhao, H.H. et al. 2007. Genetics 175:1975-1986
• Process to sort out what method most appropriate for when still has to happen
51
Exploiting Haplotype Blocks
• Objective: reduce the genotyping cost while capturing polymorphism at all (most) loci
• Haplotype: series of alleles at adjacent polymorphic loci
• Blocks: majority of diversity in few haplotypes• => Strong LD between loci within blocks; weak LD
between loci across blocks• Knowledge of the allele at one locus provides
much information on the alleles at other loci
52
What causes blocks?
• Recombination heterogeneity: coldspots within blocks, hotspots between blocks
• Random sampling of alleles and timing of mutation relative to recombination events
53
Evidence for mechanisms
• Humans: High marker density resources– Observed LD structure reproduced best if recombination
hotspots every ~ 100 kbp• Reich, D.E. et al. 2002. Nat Genet 32:135-142.
Wall, J.D., and J.K. Pritchard. 2003. Am J Hum Genet 73:502-15.– Block boundaries correspond with positions of high
current recombination• Jeffreys, A.J. et al. 2005. Nat Genet 37:601-606
– Block boundaries consistent across different human populations• De La Vega, F.M. et al. 2005. Genome Res. 15:454-462
54
Hotspots exist in plants (Arabidopsis) too
Kim, S. et al. 2007. Nat Genet 39:1151-1155
70 kbp
55
Blocks also arise randomly
•No relation between historic recombination (histogram) and block boundaries (dark bars)• Verhoeven, K.J.F. and K.L. Simonsen. 2005. Mol. Biol. Evol. 22:735-740
56
Blocks in barley
40 ?? Mbp
57
Block cause matters
• If blocks arise from a recombination process, they will be consistent across populations
• Markers that tag blocks identified in one population will therefore be useful in others
• If blocks arise from random processes, tags useful in one population will not be so in another
58
Haplotypes for discovery in barley
• 2198 mapped SNP in 1807 lines across barley• Five methods– Traditional single SNP– Four gamete: use D’ to determine boundaries– Tree scan: single df contrasts based on parsimony– HapBlock: group to capture diversity– Sliding window of 3 SNP
• Simulation: mask a SNP and pretend it’s a QTL• Real data: heading date on 1040 lines
59
Results
• Simulation: single SNP best in 5 / 8 cases and never worse than the best haplotype method
• CAUTION: the QTL had the same properties as the SNP: ideal for single SNP discovery
• IF QTL simulated as recent mutations on blocks with haplotype properties THEN haplotype methods had higher power– Even then, single SNP did pretty well
60
Chromosome
1H 2H 4H3H 6H5H 7H
Real data (heading date)
All methodsOnlyTreeScan
Only 4gamete
61
000-0.53
64
0110.16489
110-0.48386
010-0.44
82
111-0.69
12
001-0.99
3
*
4gamete success• Rare recombinants split off early-heading lines
62
Take-homes
• Simulations don’t support use of haplotype blocks– But we don’t know how to simulate the true
nature of QTL• With real data, a diversity of approaches
might produce the most useful candidate list
63
Block vs tag identification
• Blocks require position, tags do not• General tag marker approach:– Identify markers in high LD with each other– Retain only one– Aggressive: among tags, see if combinations can
be used instead of single tagsde Bakker, P.I.W. et al. 2005. Nat Genet 37:1217-
1223
64
Reducing marker numbers
• Tag SNP and Imputation:• Tag marker approach– Identify markers in high LD with each other– Retain only one– Aggressive: among tags, see if combinations can
be used instead of single tags
Tagging works
• Power is maintained; genotyping is reduced
65Average marker spacing (kbp)
Pow
er
•Greedy•Best N•Random Tags•No LD
de Bakker, P.I.W. et al. 2005
66
Tags serve as a base for imputation
• Model-based imputation using fastPHASE
Scheet and Stephens. 2006
67
Imputation on tag markers works
Jannink et al. 2007
68
Imputation can increase power
Marchini et al. 2007Chromosomal Position (Mb)
69
Imputation can increase power
Guan and Stephens 2008
70
Genome scans with low LD
• Numerous species have too low LD to perform (as of yet) whole genome scans– You would need too many SNP on too many genos
• “Nested Association Mapping”• Known as “Linkage disequilibrium linkage
analysis” (LDLA) in animal genetics
Yu, J. et al. 2008. Genetics 178:539-551
Meuwissen, T.H.E. et al. 2002. Genetics 161:373-379.
71
NAM Design
• B73 is the reference parent, crossed to 26 other inbred lines, representing a large part of maize diversity
CML52B73
F1
RIL2 RIL199 RIL200RIL1 …
B73
F1
RIL2 RIL199 RIL200RIL1 …
P3926 Times
…
72
RIL
1
2
200
B73×
F1s
SSD
25 DL
B97
CML1
03
CML2
28
CML2
47
CML2
77
CML3
22
CML3
33
CML5
2
CML6
9
Hp30
1
Il14H
Ki11
Ki3
Ky21
M16
2W
M37
W
Mo1
8W
MS7
1
NC3
50
NC3
58
Oh4
3
Oh7
B
P39
Tx30
3
Tzi8
73
NAM Genotyping
• Type parents at high density (2.5 M SNP…)• Type RIL at low density (10 k SNP): know, on a
sub-cM scale, which parental allele inherited
74
NAM linear models
• P: matrix indicating which parent contributed the allele to each offspring. α: vector of effects of parental alleles.
• Eq. 1• A linear model for the parental allele effects:• Eq. 2• This latter model is what Yu et al. call
“Projecting parental SNP on to the progeny.”
75
NAM / LDLA on Maize
• Consider: – 2.5 Gbp genome with LD extending to 1000 bp– Requires 2.5 M SNP…
• Apply Eq. 1 to identifiy QTL with, say, 3 cM C.I.– Within interval there are ~ 3000 parental SNP
• Apply Eq. 2 to dissect the QTL to its causal SNP– Feasible with 25 genotypes (apparently)– Note that α will be accurately estimated
76
Advantages of NAM / LDLA
• Adds power without adding huge genotyping burden
• Reduces / eliminates problems related to structure: the linkage part of the analysis removes long-distance LD
77
Sugary1: Genetic heterogeneity
Tracy, W.F. et al. 2006. Crop Sci 46:S-49-54
78
Genetic heterogeneity hinders AM
• Distinct mutations at the B locus
• If B2 and B3 cause a phenotype (e.g. loss of function at isoamylase), it will be associated with A1 in one case and A2 in the other case.
• B2 and B3 can be identified by linkage mapping
B2 associated with A1
A1B1 Exists A2B1 Exists
A1B2 New A2B2 --
B3 associated with A2
A1B1 Exists A2B1 Exists
A1B3 -- A2B3 New
79
How prevalent is heterogeneity?
Buckler et al. 2009. Science 325:714-718
80
Multiple hits => Allelic series
Buckler et al. 2009. Science 325:714-718
81
Heterogeneity and Pop. History
• If a population has gone through a severe bottleneck, polymorphic loci are unlikely to have > 2 alleles…
• Heterogeneity is less likely in domesticated populations with low Ne
82
83
Missing Heritability
• Heritabtility for height in humans is ~ 0.80.• Very large GWAS studies find ~ 50 SNP together
accounting for 5% of that heritability• Where’s the rest?– Infinitesimal effects– Low frequency SNP in same causal genes– Epigenetics– Genotype x environment interaction– Epistasis Maher, B. 2008. Nature 456:18-21
Manolio T.A. et al. 2009.Nature 461:747-753.
84
Plants are not like humans
• Atwell et al. 2010. Nature 465:627-631– Just 192 lines!– Some large effect variants (intermediate
frequency and explain 20% of variation…)– Inbred lines enable noise reduction– Extended association peaks because of low Ne
• Less evidence of missing heritability
85
Mouse composite not like humans
• Valdar et al. 2006. Nature Genetics 38:879-887
• QTL account for 73% of observed heritability
86
Humans are not like humans
• Yang et al. 2010. Nat. Genet. 10.1038/ng.608– Common SNP accounted for 45% of variation if all
SNP included in the model• i. Many very small QTL effects• ii. QTL generally have lower MAF than arrayed SNP
• Dickson S.P. et al. 2010. PLoS Biol 8:e1000294– Several rare variants can combine to produce an
association with a common SNP
87
Validation
• All genome-wide studies raise the question of validation
• In candidate studies, independent evidence from biological reasoning for candidate choice
• In Zhao et al. 2007, used previous linkage analyses of parents in the association panel
88
Chromosome
1H 2H 4H3H 6H5H 7H
Real data (heading date)Linkage Studies
VRN3
89
Arabidopsis: Residual structure
Residual Confounding: no bi-parental QTL found despite it segregating in the cross
Low Power: No association found despite large effect in the cross
90
Recap• Model has focused on one locus at a time• The locus has been treated as a fixed effect
– Makes sense in the candidate locus context• We have dealt with residual “polygenic” effects that, through
structure, wreak havoc• Going forward, statistical models will be multi-factorial• Linkage mapping needed to find loci associated with structure• LD exhibits block-like structure: what to do with that?• Potential for genetic heterogeneity depends on population
history• GWAS can miss substantial heritability• If you have very low LD, nested association, or LDLA, is a good
idea