Association Mapping

1

Association MappingLD

Methods Germplasm

DefinitionCauses

Haplotype Blocks

Marker DensityRecombinationHotspots

Model-basedor PCA?

Candidate locior whole genome?

Sub-populationstructure

Extent of LD

BreedingSystem

Gene identification orMarker-assisted

selection?

Regression

Genomic selection

Multiple testingvs. Shrinkage

Signatures ofselection

Species

Panel diversity

Confounded structure andpolymorphism

2

Outline

• Association mapping is regression• Accounting for structure– Estimating structure using markers– Truly multi-factorial models

• Miscelaneous topics:– Genomic control; TDT; Confounding with

structure; Haplotype predictors; Genetic heterogeneity; Missing heritability; NAM; Validation

3

Association Mapping

• It’s the same thing as linkage mapping in a bi-parental population but in a population that has not been carefully designed and generated experimentally

• Because the experiment has not been designed, it is messy. Statistical methods are needed to deal with the mess

4

Regression

• xi is the allelic state at a marker• Consider the total genotypic effect of I

• qi is the allelic state at a QTL with which the marker is (hopefully) in LD

• Now estimate β

5

Estimate of Beta

Part having to

do with LD Multi-factorial

trait / structure

6

When is cov(x, g) non-zero?

• Differences in allele frequencies at the marker between subpopulations AND difference in phenotypic mean between subpopulations– The difference in mean can be due to a single or

many loci• Difference in the frequency of alleles between

families AND difference in family phenotypic means within a (sub)population

7

Structure possibilities

Popu

latio

n st

ruct

ure

Familial relatedness

Yu, J., Pressoir, G., et al. 2006. Nat Genet 38:203-208

8

Controlling for structure

• Basic quantitative genetics:– Two individuals who share many alleles should

resemble each other phenotypically– Use markers to figure out how many alleles

individuals share and then use that to adjust statistically for their phenotypic resemblance

9


• The “mixed model”

Yu, J., Pressoir, G., et al. 2006. Nat Genet 38:203-208

10


• Structure => large differences in allele frequencies across many markers

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Average marker Set1 score

Aver

age

mar

ker S

et2

scor

e

First PCA axisPotential Phenotypic Gradient

Regression coefficients of the phenotype on the PCA values

11

Use of PCA• Results are not sensitive to the

number of PCA, provided you have enough– Price, A.L. et al. 2006. Principal

components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904-909

• The number of significant PC can be determined– Patterson, N. et al. 2006. Population

Structure and Eigenanalysis. PLoS Genetics 2:e190

• Use a “Screeplot”

12

Historical footnote

• PCA achieves what the Pritchard program Structure does

• PCA is faster and more robust

Pritchard, J.K. et al. 2000. Genetics 155:945-959Price A.L. et al. 2006. Nat Genet 38:904-909.Patterson N. et al. 2006. PLoS Genetics 2:e190

13

Kinship

• We are all a little bit related:– Two unrelated people: go back 1 generation, all

four parents must be different people.– Go back 2 generations, all eight grand-parents

must be different people.– Go back 30 generations, all 2.1 billion ancestors

would need to be different people: Impossible!

14

Identity by Descent

• Two alleles that are copies (through reproduction) of the same ancestral allele

Coefficient of Coancestry• Choose a locus• Pick an allele from Ed and one from Peter• Probability that the alleles are IBD = Ed and

Peter’s Coefficient of Coancestry, θEP

15

Coef. of Coancestry –> A matrix

• A is the additive relationship or kinship matrix

Winter

Two-Row

Six-Row

“Bison”

16

A constrains u

• Two individuals who share many alleles should resemble each other phenotypically

• u is the polygenic effect• Its covariance matrix is Var(u) = Aσ2

u

• If aij has a high value, the ui and uj should have similar values (they have high covariance)

• A constrains the values that are possible for u

17

Single locus, additive model: cov(ui, uj)

18

A matrix from the pedigree

• The cells in the A matrix are aij = 2θij, the additive relationship coefficients between i in the row and j in the column

• Coefficient of coancestry θij: the prob that a random alleles from i and j are IBD

• Calculate from the pedigree by recursion:

19

A matrix from marker data

, the homozygosities over all markers and alleles

20

With inbreeding, parental contributions NOT 50:50

• Maize intermated population

• Drift during intermating and inbreeding

• Markers can give more accurate θ than pedigree

00.05 0.1

0.15 0.20.25 0.3

0.35 0.40.45 0.5

0.55 0.60.65 0.7

0.75 0.80.85 0.9

0.95 10

10

20

30

40

50

60

70

80

90

/

21

Mixed Model Example

• Five individuals, a, b, c, d, and e.• a and b in subpop 1; c, d, and e in subpop2.• a, b, c, and d unrelated; e is offspring of c and d.• a and d carry the 0; b, c, and e carry the 1 allele

y = + Qvμ + Xβ

22

Mixed Model Example

y = + Qvμ + Xβ


23

Mixed Model Example


+ Zu + ey = + Qvμ + Xβ

24

Mixed Model Example


A =

var(u) = σ2u

Zu

25

Mixed Model Example

• There is a polygenic effect u for each individual => overdetermined model?

• NO: u is a random effect, constrained by Aσ2u

26

Mixed Model Example+ Zu + ey = + Qvμ + Xβ

–1

✕=

27

Control false positives from structure

Observed P

Flowering time(High population structure)

Ear height(Moderate population structure)

Ear diameter(Low population structure)

0.5 0.50.5

0

0.1

0.2

0.3

0.4

0 0.1 0.2 0.3 0.4 0.5Observed P

SimpleQ

KQ + K

GC

0

0.1

0.2

0.3

0.4

0 0.1 0.2 0.3 0.4 0.5

Simple

Q

K

Q + K

GC

0

0.1

0.2

0.3

0.4

0 0.1 0.2 0.3 0.4 0.5Observed P

Cum

ulati

ve P

Simple

QK

Q + K

GC

SimpleQKQ + KGC

a. b. c.

A straight diagonal line indicates an appropriate control of false positives.

Q + K model has best Type I error control, most important when trait is related to population structure (e.g., flowering time).

28

Statistical power

Flowering time(High population structure)

Ear height(Moderate population structure)

Ear diameter(Low population structure)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Genetic effect(Phenotypic variation explained in %)

Simple

Q

K

Q + K

GC

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1


SimpleQ

K

Q + K

GC

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1


Adju

sted

ave

rage

pow

er

Simple

Q

KQ + K

GC

(0) (0.8) (3.3) (7.1) (11.9) (17.4) (0) (0.8) (3.3) (7.1) (11.9) (17.4) (0) (0.8) (3.3) (7.1) (11.9) (17.4)

d. e. f.

SimpleQKQ + KGC

Q + K model had highest power to detect SNPs with true effects.

29

Controlling for Structure

Original P Matrix K Matrix

30

FDR vs. Power for 300 lines, 10 QTL

31

Effect of line number, P-only10 QTL, 0.75 heritability

32

Effect of Reduced Population Diversity

33

Take homes on diversity

• At equal population size– A less diverse population can increase power

because relative to the extent of LD, the average marker distance is lower

– Given that you are testing fewer markers, the multiple testing problem is reduced

• Avoid as much as possible reducing population size for the sake of obtaining a more homogeneous population

34

Guidelines

• More lines and more markers are better• For a diverse population, 800+ lines• For a narrower population, 300+ (?)• FDR is a reasonable method of determining

significance, but probably conservative

35

0.00

0.20

0.40

0.60

0.80

0% 25% 50% 75% 100%0.00

0.20

0.40

0.60

0.80

0% 25% 50% 75% 100%0.00

0.20

0.40

0.60

0.80

0% 25% 50% 75% 100%

Varia

nce

ratio

d e f

Marker number Marker number Marker number

SSRSNP

Q constant K estimated

• Q estimated with all markers, K estimated with varying fraction of markers available

Flowering time Ear height Ear diameter

36

0.00

0.20

0.40

0.60

0.80

0% 25% 50% 75% 100%0.00

0.20

0.40

0.60

0.80

0% 25% 50% 75% 100%0.00

0.20

0.40

0.60

0.80

0% 25% 50% 75% 100%

Varia

nce

ratio

d e f

Marker number Marker number Marker number

SSRSNP

Q estimated K constant

• Q estimated with varying fraction of markers available, K estimated with all markers

Flowering time Ear height Ear diameter

37

History / future of controlling for structure

Part having to

do with LD Multi-factorial

trait / structure

38

Single locus: model mis-specification

• “the problem is better thought of as model mis-specification: when we carry out GWA analysis using a single SNP at a time, we are in effect modeling a multifactorial trait as if it were due to a single locus”– Atwell S. et al. 2010. Nature 465:627-631

39

History: Candidate locus studies

• AM started out with candidate locus studies where the effects of few loci could be fitted

• The biotechnology was not there to type more than a few loci

• The genetic background needed to be accounted for somehow (see above)

• In any event, the computational power was not there to fit all 106 loci simultaneously

40

Future: GWAS fitting all loci

• These methods could displace mixed models accounting for structure

Logs

don

B. e

t al.

2010

.BM

C Bi

oinf

orm

atics

11:

58.

41

Sundry topics

• Other methods to control structure• QTL confounded with structure• Single markers or haplotypes?• Genetic heterogeneity• Missing heritability• Linkage disequilibrium / Linkage analysis• Validation

42

Genomic Control

• Calculate bias in distribution of test statistic using “neutral” loci, then account for bias

• Devlin, B. and Roeder, K. 1999. Genomic Control for Association Studies. Biometrics 55:997-1004.

• Works best for candidate genes: test loci can be distinguished from neutral control loci. Works less well for whole genome scans• Marchini, J. et al. 2004. Nat. Genet. 36:512-517• Devlin, B. et al. 2004. Nat. Genet. 36:1129-1131.• Marchini, J. et al. 2004. Nat. Genet. 36:1131-1131

43

Transmission Disequilibrium Test

• Experimental rather than statistical control of effects of structure

• Originally conceived for dichotomous (e.g., disease / no disease) traits

• Affected offspring and both parents, of which one must be heterozygous

• Test whether the a putative causal allele is transmitted more often that 50% of the time

• Spielman, R.S. et al. 1993. Am. J. Hum. Genet. 52:506-516

44

TDT

• Extensions for quantitative traits• Allison, D.B. 1997. Am. J. Hum. Genet. 60:676-690

• Extensions for larger-than-trio pedigrees• Monks, S.A., and N.L. Kaplan. 2000. Am J Hum Genet

66:576-92

• Using for populations under artificial selection• Bink, M.C.A.M. et al. 2000. Genetical Res. 75:115-121

45

QTL confounded with structure• Particularly important for QTL affecting

adaptation, e.g., flowering time

Camus-Kulandaivelu, L. et al. 2006. Genetics 172:2449–2463

46

Also in rice…

Ghd7-0aNon-functional

Ghd7-2Weak allele

Ghd7-0Deleted

Ghd7-1, Ghd7-3Functional

Given geographic distribution and role in adaptation, selection using this locus will have marginal utility

Xue, W. et al. 2008. Nat Genet 40:761-767

47

Confounded QTL with structure• Association analysis will have difficulty

identifying such QTL: the QTL needs to be polymorphic within subpopulations

• Traditional linkage studies of crosses between members of different subpopulations should be very effective in this case

• e.g., Xue, W. et al. 2008. Nat Genet 40:761-767

• Multi-factorial methods will have difficulty identifying loci under strong structure

48

Dwarf8: Confounded with structure

• Thornsberry, J.M. et al. 2001. Nat. Genet. 28:286-289– First structured association test applied to plants

Camus-Kulandaivelu, L. et al. 2006. Genetics 172:2449–2463

49

Single markers or haplotypes?

• The jury is still out• Infinite ways to simulate and

analyze– Ne, QTL MAF, QTL effect,

quantitative vs. binary, age of mutation

• Ex. 1: Dramatically more power for haplotypes vs single markers

• Durrant, C. et al. 2004. Am J Hum Genet 75:35-43

50

Single markers or haplotypes?

• Ex. 2: Similar or lower power for haplotype method relative to single marker method

• Zhao, H.H. et al. 2007. Genetics 175:1975-1986

• Process to sort out what method most appropriate for when still has to happen

51

Exploiting Haplotype Blocks

• Objective: reduce the genotyping cost while capturing polymorphism at all (most) loci

• Haplotype: series of alleles at adjacent polymorphic loci

• Blocks: majority of diversity in few haplotypes• => Strong LD between loci within blocks; weak LD

between loci across blocks• Knowledge of the allele at one locus provides

much information on the alleles at other loci

52

What causes blocks?

• Recombination heterogeneity: coldspots within blocks, hotspots between blocks

• Random sampling of alleles and timing of mutation relative to recombination events

53

Evidence for mechanisms

• Humans: High marker density resources– Observed LD structure reproduced best if recombination

hotspots every ~ 100 kbp• Reich, D.E. et al. 2002. Nat Genet 32:135-142.

Wall, J.D., and J.K. Pritchard. 2003. Am J Hum Genet 73:502-15.– Block boundaries correspond with positions of high

current recombination• Jeffreys, A.J. et al. 2005. Nat Genet 37:601-606

– Block boundaries consistent across different human populations• De La Vega, F.M. et al. 2005. Genome Res. 15:454-462

54

Hotspots exist in plants (Arabidopsis) too

Kim, S. et al. 2007. Nat Genet 39:1151-1155

70 kbp

55

Blocks also arise randomly

•No relation between historic recombination (histogram) and block boundaries (dark bars)• Verhoeven, K.J.F. and K.L. Simonsen. 2005. Mol. Biol. Evol. 22:735-740

56

Blocks in barley

40 ?? Mbp

57

Block cause matters

• If blocks arise from a recombination process, they will be consistent across populations

• Markers that tag blocks identified in one population will therefore be useful in others

• If blocks arise from random processes, tags useful in one population will not be so in another

58

Haplotypes for discovery in barley

• 2198 mapped SNP in 1807 lines across barley• Five methods– Traditional single SNP– Four gamete: use D’ to determine boundaries– Tree scan: single df contrasts based on parsimony– HapBlock: group to capture diversity– Sliding window of 3 SNP

• Simulation: mask a SNP and pretend it’s a QTL• Real data: heading date on 1040 lines

59

Results

• Simulation: single SNP best in 5 / 8 cases and never worse than the best haplotype method

• CAUTION: the QTL had the same properties as the SNP: ideal for single SNP discovery

• IF QTL simulated as recent mutations on blocks with haplotype properties THEN haplotype methods had higher power– Even then, single SNP did pretty well

60

Chromosome

1H 2H 4H3H 6H5H 7H

Real data (heading date)

All methodsOnlyTreeScan

Only 4gamete

61

000-0.53

64

0110.16489

110-0.48386

010-0.44

82

111-0.69

12

001-0.99

3

*

4gamete success• Rare recombinants split off early-heading lines

62

Take-homes

• Simulations don’t support use of haplotype blocks– But we don’t know how to simulate the true

nature of QTL• With real data, a diversity of approaches

might produce the most useful candidate list

63

Block vs tag identification

• Blocks require position, tags do not• General tag marker approach:– Identify markers in high LD with each other– Retain only one– Aggressive: among tags, see if combinations can

be used instead of single tagsde Bakker, P.I.W. et al. 2005. Nat Genet 37:1217-

1223

64

Reducing marker numbers

• Tag SNP and Imputation:• Tag marker approach– Identify markers in high LD with each other– Retain only one– Aggressive: among tags, see if combinations can

be used instead of single tags

Tagging works

• Power is maintained; genotyping is reduced

65Average marker spacing (kbp)

Pow

er

•Greedy•Best N•Random Tags•No LD

de Bakker, P.I.W. et al. 2005

66

Tags serve as a base for imputation

• Model-based imputation using fastPHASE

Scheet and Stephens. 2006

67

Imputation on tag markers works

Jannink et al. 2007

68

Imputation can increase power

Marchini et al. 2007Chromosomal Position (Mb)

69

Imputation can increase power

Guan and Stephens 2008

70

Genome scans with low LD

• Numerous species have too low LD to perform (as of yet) whole genome scans– You would need too many SNP on too many genos

• “Nested Association Mapping”• Known as “Linkage disequilibrium linkage

analysis” (LDLA) in animal genetics

Yu, J. et al. 2008. Genetics 178:539-551

Meuwissen, T.H.E. et al. 2002. Genetics 161:373-379.

71

NAM Design

• B73 is the reference parent, crossed to 26 other inbred lines, representing a large part of maize diversity

CML52B73

F1

RIL2 RIL199 RIL200RIL1 …

B73

F1

RIL2 RIL199 RIL200RIL1 …

P3926 Times

…

72

RIL

1

2

200

B73×

F1s

SSD

25 DL

B97

CML1

03

CML2

28

CML2

47

CML2

77

CML3

22

CML3

33

CML5

2

CML6

9

Hp30

1

Il14H

Ki11

Ki3

Ky21

M16

2W

M37

W

Mo1

8W

MS7

1

NC3

50

NC3

58

Oh4

3

Oh7

B

P39

Tx30

3

Tzi8

73

NAM Genotyping

• Type parents at high density (2.5 M SNP…)• Type RIL at low density (10 k SNP): know, on a

sub-cM scale, which parental allele inherited

74

NAM linear models

• P: matrix indicating which parent contributed the allele to each offspring. α: vector of effects of parental alleles.

• Eq. 1• A linear model for the parental allele effects:• Eq. 2• This latter model is what Yu et al. call

“Projecting parental SNP on to the progeny.”

75

NAM / LDLA on Maize

• Consider: – 2.5 Gbp genome with LD extending to 1000 bp– Requires 2.5 M SNP…

• Apply Eq. 1 to identifiy QTL with, say, 3 cM C.I.– Within interval there are ~ 3000 parental SNP

• Apply Eq. 2 to dissect the QTL to its causal SNP– Feasible with 25 genotypes (apparently)– Note that α will be accurately estimated

76

Advantages of NAM / LDLA

• Adds power without adding huge genotyping burden

• Reduces / eliminates problems related to structure: the linkage part of the analysis removes long-distance LD

77

Sugary1: Genetic heterogeneity

Tracy, W.F. et al. 2006. Crop Sci 46:S-49-54

78

Genetic heterogeneity hinders AM

• Distinct mutations at the B locus

• If B2 and B3 cause a phenotype (e.g. loss of function at isoamylase), it will be associated with A1 in one case and A2 in the other case.

• B2 and B3 can be identified by linkage mapping

B2 associated with A1

A1B1 Exists A2B1 Exists

A1B2 New A2B2 --

B3 associated with A2

A1B1 Exists A2B1 Exists

A1B3 -- A2B3 New

79

How prevalent is heterogeneity?

Buckler et al. 2009. Science 325:714-718

80

Multiple hits => Allelic series

Buckler et al. 2009. Science 325:714-718

81

Heterogeneity and Pop. History

• If a population has gone through a severe bottleneck, polymorphic loci are unlikely to have > 2 alleles…

• Heterogeneity is less likely in domesticated populations with low Ne

82

83

Missing Heritability

• Heritabtility for height in humans is ~ 0.80.• Very large GWAS studies find ~ 50 SNP together

accounting for 5% of that heritability• Where’s the rest?– Infinitesimal effects– Low frequency SNP in same causal genes– Epigenetics– Genotype x environment interaction– Epistasis Maher, B. 2008. Nature 456:18-21

Manolio T.A. et al. 2009.Nature 461:747-753.

84

Plants are not like humans

• Atwell et al. 2010. Nature 465:627-631– Just 192 lines!– Some large effect variants (intermediate

frequency and explain 20% of variation…)– Inbred lines enable noise reduction– Extended association peaks because of low Ne

• Less evidence of missing heritability

85

Mouse composite not like humans

• Valdar et al. 2006. Nature Genetics 38:879-887

• QTL account for 73% of observed heritability

86

Humans are not like humans

• Yang et al. 2010. Nat. Genet. 10.1038/ng.608– Common SNP accounted for 45% of variation if all

SNP included in the model• i. Many very small QTL effects• ii. QTL generally have lower MAF than arrayed SNP

• Dickson S.P. et al. 2010. PLoS Biol 8:e1000294– Several rare variants can combine to produce an

association with a common SNP

87

Validation

• All genome-wide studies raise the question of validation

• In candidate studies, independent evidence from biological reasoning for candidate choice

• In Zhao et al. 2007, used previous linkage analyses of parents in the association panel

88

Chromosome

1H 2H 4H3H 6H5H 7H

Real data (heading date)Linkage Studies

VRN3

89

Arabidopsis: Residual structure

Residual Confounding: no bi-parental QTL found despite it segregating in the cross

Low Power: No association found despite large effect in the cross

90

Recap• Model has focused on one locus at a time• The locus has been treated as a fixed effect

– Makes sense in the candidate locus context• We have dealt with residual “polygenic” effects that, through

structure, wreak havoc• Going forward, statistical models will be multi-factorial• Linkage mapping needed to find loci associated with structure• LD exhibits block-like structure: what to do with that?• Potential for genetic heterogeneity depends on population

history• GWAS can miss substantial heritability• If you have very low LD, nested association, or LDLA, is a good

idea

Date post:	24-Feb-2016
Category:	Documents
Upload:	shauna
View:	43 times
Download:	0 times

Association Mapping

Documents