Genome-wide association mapping Introduction to theory and methodology Aaron Lorenz Department of...

transcript

Genome-wide association mapping

Introduction to theory and methodology

Aaron Lorenz

Department of Agronomy and Horticulture

GWAS – Genome-wide Association Study• Big subject• Lots of methods and software packages• Lots of considerations for handling data• We have some data to analyze

• 75 minutes

Slide credit: Mike Gore

Find genes contributing to variation in phenotypes of interest

Approaches to mapping genes

Yu and Buckler, 2006

Germplasm

Biometris

Germplasm• Any genetically diverse natural or artificial population can

be used– Examples

• 71 elite European maize inbred lines (Andersen et al., 2005)

• Diverse panel of 288 maize lines (Harjes et al., 2008)

• Diverse panel of 191 Arabidopsis lines (Stock center accessions and individuals sampled from the wild; Atwell et al. 2010)

• 915 dogs from 80 domestic breeds, 83 wild canids, 10 outbred African shelter dogs.

Linkage disequilibrium (LD)

AB A BD p p p 2

A a B b

p p p p

Common statistic to quantify LD. Normalized value of D.

• The non-random association of alleles between loci.

• Extent of LD over physical distance determines marker density needed.

LD decay in bi-parental linkage mapping populations

Slide credit: Peter Bradbury

Plots of LD across the Maize d3 Gene (Remington et al., 2001).

Gaut B. S., Long A. D. Plant Cell 2010:15:1502-1506

r2 above diagonal, D’ below diagonal

Note that LD drops to nearly 0 within 500 base pairs

Extensive LD in barley of the Upper Midwest

• 500 random individuals from a population phenotyped and genotyped– Genotypes were scored for one marker linked to a

candidate gene– Individuals scored as A1A1 = 0, A1A2 = 1, A2A2 = 2.

Toy example

R: lm function• Fits a linear model with normal errors and constant

variance; generally this is used for regression analysis using continuous explanatory variables.

• Simple linear regression– lm(y ~ x)

• See riceGwasEmma.r

Population structure• Nearly always present in association mapping panels• Causes spurious associations if not accounted for.

Extreme example

Within each of these populations, the Ab or bA gametes never occur, soD = freq(AB) – freq(A)*freq(B) = 0.25.When the subpops are combined into population and LD is calculated, the two loci are in complete LD regardless of their physical linkage.

Model population structure

y vq bw e

Subpop membership and effect

Marker allele dosage and effect

y 1 Qv Wb eMatrix notation

Illustration3 subpopulations, 2 markers, 10 individuals

4.4 1 0.75 0.25 0.00

4.6 1 0.65 0.30 0.05

5.3 1 0.50 0.40 0.10

5.0 1 0.75 0.05 0.20

5.8 1 0.80 0.00 0.20

5.7 1 0.20 0.60 0.20

4.3 1 0.20 0.80 0.00

4.6 1 0.30 0.70 0

.00 0 1

0.10 0.00 0.90 0 0

0.10 0.00 0.90 1 1

1y Qv Wb e

Population structure and differential relatedness (or family structure)

Yu and Buckler, 2006

Mixed-linear model to account for family structure

y 1 Qv Wb Zu e

2~ (0, )uMVN u K

Polygenic effect(random)

K = kinship matrix. Normally calculated with genome-wide markers

Efficient Mixed-Model Association (EMMA)

• Uses eigenvalue decomposition to more efficiently solve mixed-model equation

• (Taking direct inverse of covariance matrix is computationally intensive. Want to avoid in GWAS.)

Options for modeling structure and kinship [see Price et al. (2010)]Inferring and modeling structure• Use knowledge on subpop membership directly• Subpopulation clustering (explicitly infer ancestry)

– STRUCTURE– ADMIXTURE

• Principal component analysis– Use top PCs as covariates to correct for pop structure– Related approach is multi-dimensional scaling (MDS)

Inferring kinship• Marker similarity matrix• Realized genomic additive relationship matrix• Pedigree additive relationship matrix

Efficient Mixed-Model Association (EMMA)

See riceGwasEmma.r

Manhattan plot

See riceGwasEmma.r

Statistical threshold: Correcting for multiple testing

Statistical threshold: Correcting for multiple testing• Bonferroni correction

– alphaC ≈ alphaE / test#

– Assumes independent tests– Too conservative

• Permutation testing– Good for linkage mapping– Generally, not valid for GWAS because family structure not

preserved

• False-discovery rate (Benjamini and Hochberg, 1995)– Calculate expected proportion of declared QTL that are false

positives.

Calculate effective number of tests

Other software packages to implement linear models for GWAS• TASSEL: www.maizegenetics.net• PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/• EIGENSTRAT:

http://www.hsph.harvard.edu/alkes-price/software/• EMMAMAX: http://genetics.cs.ucla.edu/emmax/• GAPIT: http://www.maizegenetics.net/gapit• GenABEL: http://www.genabel.org/packages/GenABEL• GWASTools: http://

www.bioconductor.org/packages/2.11/bioc/html/GWASTools.html

• FaST-LMM: http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/

Genome-wide association mapping Introduction to theory and methodology Aaron Lorenz Department of...

Documents