(Genome-wide) association analysisnitro.biosci.arizona.edu/workshops/GIGA/pdfs/L3-GWAS.pdf ·...

(Genome-wide) association analysis

Peter M. VisscherQueensland Institute of Medical

ResearchBrisbane, Australia

[email protected]

1

Outline

• Association vs linkage• Linkage disequilibrium• Analysis: single SNP

• GWAS: design, power• GWAS: analysis

2

Linkage Association

Families Populations3

Linkage disequilibrium around an ancestral mutation

[Ardlie et al. 2002]4

m = functional mutation

[Cardon & Bell 2001]

Linkage and association

5

Why mapping by LD?

• Use association across families– no pedigree information needed

• Higher resolution of mapping• Dense SNP maps: GWAS

6

Linkage vs. AssociationLinkage Association/LD

genealogy known unknownmarker sharing by descent by state# meioses small largeshared DNA segments large smallmarkers microsatellites SNPmarker density low high

7

LD

• Non-random association between alleles at different loci

• Many possible causes– mutation– drift / inbreeding / founder effects– population stratification– selection

• Broken down by recombination

8

Measures of pair-wise LD

Definition SymbolCovariance DScaled covariance D’Association ρCorrelation rFrequency difference fDelta δYule y

[Morton et al. 2001, PNAS]9

Measures

• All estimates of LD are functions of pairwise haplotype frequencies

• ‘Best’ measure depends on purpose of LD estimation

10

Properties of ‘good’ measures

• Simple biological interpretation• Allow statistical tests• Directly related to evolutionary forces

(recombination, selection, drift, etc.)• Standardised to allow comparisons across

loci & populations

[Hedrick 1987]11

Utility of LD measures

• Population dynamics• Estimating population size• Gene/QTL mapping

12

Definition of D

• 2 bi-allelic loci– Locus 1, alleles A & a, with freq. p and (1-p)– Locus 2, alleles B & b with freq. q and (1-q)– Haplotype frequencies pAB, pAb, paB, pBB

D = pAB - pq

13

Alternative expressions for D

D = pAB – pq = DAB

= pab – (1-p)(1-q) = Dab

= -(pAb – p(1-q)) = -DAb

= -(paB – (1-p)q) = -DaB

= pABpab - pAbpaB

14

Related measures

Dmax = smaller of pq and (1-p)(1-q) [D<0]= smaller of p(1-q) and (1-p)q) [D>0]

• D’ = D / Dmax

-1 ≤ D’ ≤ 1Can compare pairs of loci across populations

15

|D’|

• |D’| = |D| / Dmax

0 ≤ D’ ≤ 1– Can compare different pairs of loci in genome|D’| = 1 if one of the 4 haplotype frequencies = 0– E(|D’|) and var(|D’|) not known

16

r2

r2 = D2 / [pq(1-p)(1-q)]

• Squared correlation between presence and absence of the alleles in the population

• ‘Nice’ statistical properties

[Hill and Robertson 1968]17

Properties of r and r2

• Population in ‘equilibrium’E(r) = 0E(r2) = var(r) ≈ 1/[1 + 4Nc] + 1/n

N = effective population sizen = sample size (haplotypes)c = recombination rate

• nr2 ~ χ(1)2

[Sved 1971; Weir and Hill 1980]18

Measures of LD are very variable!

• If nr2 ~ χ2(1 df), then

[CV(r2)]2 ≈ 2

CV = σ(r2) / E(r2) = (2/n2)0.5 / (1/n) = (2)0.5

19

r2 decay under different scenarios.(Pritchard & Przeworski,2001)

y-axis: (r2)0.5

20

r2 decay in real data.(Pritchard & Przeworski,2001)

21

Population stratification

Allele frequency Haplotype frequency pA1 pB1 pA1B1 pA1B2 pA2B1 pA2B2 Pop. 1 0.9 0.9 0.81 0.09 0.09 0.01 Pop. 2 0.1 0.1 0.01 0.09 0.09 0.81 Average 0.5 0.5 0.41 0.09 0.09 0.41

Both populations are in linkage equilibrium

Combined population: D = 0.16 and D’ = 0.64r2 = 0.4096

22

23

Decay of LD (bi-allelic loci)

D(0) = pAB - pqpAB(1) = (1-c) pAB(0) + cpqD(1) = (1-c) pAB(0) + cpq – pq

= (1-c)D(0)

D(t) = (1-c)t D(0)≈ e-ct D(0)

24

Decay of LD

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0 10 20 30 40 50 60 70 80 90 100Generation

LD

c = 0.10c = 0.01c = 0.001

25

Approaches for quantitative traits

• Association (single locus)• TDT• GWAS

26

Association

• Random sample from the population• Associate allelic or haplotype variant(s)

with trait values• H0: no association• Analysis

– Linear (mixed) model– Allele (haplotype) as fixed effect

27

Falconer model for single biallelic QTL

Var (X) = Regression Variance + Residual Variance= Additive Variance + Dominance Variance

bb Bb BB

m

-a

ad

28

29

TDT= transmission disequilibrium test

• Association with family-based controls• Original TDT (disease mapping)

– trios, two parents one affected progeny– test for transmission of allele to affected progeny from

heterozygous parents• non-transmitted allele is the control

• Quantitative traits– Test for association between trait value and allele

within parental mating type

30

Classical TDTTable 4.1: Transmission data for a bi-allelic marker.

Not transmitted

Transmitted Allele 1 Allele 2 Total

Allele 1 n11 n12 n11 + n12

Allele 2 n21 n22 n21 + n22

Total n11 + n21 n12 + n22 n

The TDT statistic is,

TDT = (n21 – n12)2 / (n21 + n12)

~ χ12 Test for both linkage and association

31

Quantitative traits

• With family data we can separate association into between and within family components

• Advantage– Within component is robust to stratification

• Disadvantage– Unrelated design is more powerful for same

sample size32

TDT for quantitative traits(regression model)

33

Information for Within Test

• Families are only informative for the within family component when the offspring can have different genotypes....– AA x AA ?– AA x aa ?– AA x Aa ?– Aa x Aa ?

• At least one parent must be heterozygous34

Population Stratification

• When there is no population stratification the slopes for the between and within test should be equal

35

Power (bi-allelic locus)

q2 = {2p(1-p)[a + d(1-2p)]2 + [2p(1-p)d]2 } / σp2

ANOVA Regression

Fit 3 genotypes y = µ + βx + e (x = 0, 1, 2)2 df 1 df

λANOVA = nq2/(1-q2) n= [(1-q2)/(q2)](z(1-α/2) + z(1-β))2

36

Genetic Power Calculator (PGC)http://pngu.mgh.harvard.edu/~purcell/gpc/

37

Power (n=1000)

38

GWAS

• Same principle as single locus association, but additional information– QC

• Duplications, sample swaps, contamination– Power of multi-locus data

• Unbiased genome-wide association• Relatedness• Population structure• Ancestry

39

Detection of susceptibility variants for common diseases

CNV studies

Sequencing

Not yet detectable

Atypical of common diseases

Association

High

Intermediate

Modest

Low

Very rare Rare Uncommon Common

Allele frequency

Penetrance

0.001 0.01 0.1

Linkage

Requires same locus in many families orlarge pedigrees.

Modified from McCarthy et al.

40

Detection of susceptibility variants for common diseases: the new era

CNV studies

Sequencing

Not yet detectable

Atypical of common diseases

Variants typically identified by GWAS

High

Intermediate

Modest

Low

Very rare Rare Uncommon Common

Allele frequency

Penetrance

0.001 0.01 0.1

CNV/sequencing

41

Sequence - SNPs

LD

Technology

42

GWAS in humans

Advances ingenotypingtechnology

Sample collections ofadequate size

Better understandingof patterns of human

sequence variation

Genome-wideassociation

scans

3,000,000,000 basesin human genome

~10,000,000 positionscommonly variant

in Europeans

80% of these capturedby typing ~500k

Samplesof

interest

test for evidence ofassociation

43

• Categorical traits– disease susceptibility genes

• Continuous traits– quantitative trait loci, QTL

44

Age-related macular degeneration

45

GWAS analysis

Challenges most obviously, multiple testing burden computation

Opportunities simple methods can work well with ↑ data novel analyses permitted

46

The multiple testing burden

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50

Number of independent tests performed

P(at

leas

t 1 fa

lse

posi

tive) per test false positive

rate 0.05

per test false positive rate0.001 = 0.05/50

47

Genomic control

Test locus Unlinked ‘null’ markers

( )2χE

χ2 No stratification

( )2χE

χ2

Stratification → adjust test statistic48

Genomic control

Simple estimate of inflation factor

median protects from outliers• i.e. true effects

bounded at minimum of 1• i.e. should never increase test statistic

extends to multiple alleles, haplotpes, quantitative traits, different tests, etc

456.0/},,,{ˆ 222

21 Nmedian χχχλ =

49

Empirical assessment of ancestry

~2K SNPs

CEPH/EuropeanYorubaHan ChineseJapanese

51

Entire Phase I HapMap


52

Han ChineseJapanese

~10K SNPs


53

Date post:	21-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

(Genome-wide) association analysisnitro.biosci.arizona.edu/workshops/GIGA/pdfs/L3-GWAS.pdf ·...

Documents