1
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Genomic Approaches to the Study of
Complex Genetic Diseases
Karen Mohlke, PhD Department of Genetics
University of North Carolina April 20, 2016
Current Topics in Genome Analysis 2016
Karen Mohlke
No Relevant Financial Relationships with ���Commercial Interests
2
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
123freevectors.com
Complex diseases & traits
Gene mapping in populations
Altshuler and Clark (2005) Science 307:1052
3
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Genome-wide association studies
• Test a large portion of the common single nucleotide genetic variation in the genome for association with a disease or variation in a quantitative trait
• Find disease/quantitative trait-related
variants without a prior hypothesis of gene function
Genetic architecture
Manolio (2009) Nature 46: 747
4
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Welter (2014) Nuc Acids Res 42: D1001
Genome-wide association studies identify loci
https://www.ebi.ac.uk/gwas/diagram
Outline • Genome-wide association study design
– Samples/study participants – Genotyping – Tests of association – Imputation and meta-analysis
• Interpretation of results – Effect size and significance – Example locus characteristics
• Sequencing/rare variant studies
5
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Study designs
time Identify/enroll cases and controls
What happened prior to disease onset?
time Enroll subjects; measure X,Y,Z over time, wait for disease onset
Prospective cohort
Case-control
time
Population-based cohort
Enroll subjects regardless of health or disease
Matching of cases and controls Cases Controls
Cases and controls should be comparable in all respects except disease status (e.g. age, sex, demographics)
6
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Selection of cases
• Potential criteria to enrich genetic effect size – More severely affected
individuals – Require other family
member to have disease – Younger age-of-disease
onset
Cases
• Potential criterion to enrich genetic effect size – Low risk of disease
rather than population-based samples
Controls
Selection of controls
7
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Cases Controls
Comparable ancestry
Cases Controls
Ancestry differences
May have inadequate ancestry information prior to genotyping
8
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Confounding and population stratification
Cancer Epidemiol Biomarkers Prev 11: 513
Confounding
Disease
Exposure of interest
True risk factor
True risk factor Disease
Confounded association
Causal
Causal
Correlation, not causal
Genotype of interest Ethnicity
Population stratification
Confounded association
Correlation, not causal
Correlation, not causal
Correlation, not causal
Population stratification
• Systematic differences in allele frequencies between subpopulations that may be due to different ancestry
• Oversampled individuals from one sub-population for cases in a case-control genetic association study can produce spurious associations
9
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
• Match cases with controls • Restrict to one subgroup • Adjust for genetic background
E.g. Use principle components (PCs) to infer ancestry from genotype data and adjust for PCs in association analysis
• Family-based study design – genotype relatives and analyze transmission of alleles from heterozygous parents to offspring
Transmission disequilibrium test (TDT), family-based association test (FBAT)
Account for or avoid population stratification
Genome-wide genotyping panels
• 10,000 - 5 million variants • Affymetrix, Illumina
• Random SNPs
• Selected haplotype tag variants
• Copy number probes • More lower frequency variants • Exome variants • Some arrays allow variants to be added
10
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Selecting ‘haplotype tag’ SNPs
International HapMap Consortium (2003) Nature 426:789
Illumina Infinium Assay
Illumina.com and adapted from Gunderson (2005) NatGen 37:549
Whole genome amplification
Hybridize unlabeled DNA to specific arrays of 50-mers
Fragment DNA
Allele-specific primer extension with labeled nucleotides
Dual-color florescent staining Detect fluorescent color and intensity
11
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Affymetrix Axiom Array
Affymetrix.com
25-125 bp
30-mers
Cocktail of labeled 9-mer oligos
Ligase closes the gap between capture and label probe if complete complementarity; wash off others
Stain to detect label GeneTitan platform
Global genomic coverage
Li (2008) EJHG 16:625
Percent of SNPs present on the chip or tagged at r2>0.8 by at least one SNP in the chip within 250 kb
12
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Quality control: Identify and remove bad samples
• Poor quality samples – Sample success rate < 95 % – Excess heterozygous genotypes
• Sample switches – Wrong sex
• Unexpected related individuals – Pair-wise comparisons of genotype similarity – Duplicates
• Ancestry different from the rest of sample
• Genotyping success rate < 95%
• Different genotypes in duplicate samples
• Expected proportions of genotypes are not consistent with observed allele frequencies
• Non-Mendelian inheritance in trios
• Differential missingness in cases and controls
Quality control: Identify and remove bad SNPs
13
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Quality control: Identify and remove bad SNPs
Ideal genotyping plot
McCarthy (2008) Nat Rev Gen 9:356
Clusters mis-called Clusters overlap
GG AG AA
Toe
size
Statistical analysis: linear regression
y = β0 + β1x
Trait = β0 + β1SNP1
Toe size = β0 + β1rs123456
Two main parameters: p-value and effect size
14
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
• Assumptions – Trait is normally distributed for each genotype,
with a common variance – Subjects independent (e.g. unrelated)
Toe size = β0 + β1rs123456 + β2sex + β3age + β4age2 + β5BMI
covariates
y = β0 + β1x
Trait = β0 + β1SNP1
Toe size = β0 + β1rs123456
Statistical analysis: linear regression Two main parameters: p-value and effect size
Odds ratio • Surrogate measure of effect of allele on risk
of developing disease
Allele A C Total Case 860 1140 2000 Control 1000 1000 2000 Total 1860 2140 4000
1140 / 860 1000 / 1000 = 1.33 Odds Ratio Case C / Case A
Control C / Control A =
=
Odds of C allele given case status = Case C / Case A Odds of C allele given control status = Control C / Control A
15
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Association study odds ratio plot
Relationship between GWAS sample size and power
Paria (2014) J Bone Joint Surg Am 96:e38
Odds ratio
Stat
istic
al p
ower
16
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Adjust for population structure: genomic control
Quantile-quantile (Q-Q)
plot
True associations?
Population outliers and/or structure?
Devlin & Roeder (1999) Biometrics 55:997; Pearson (2008) JAMA 299:1335
• With popula,on structure, the distribu,on of Cochran-‐Armitage trend tests, genome-‐wide, is inflated by a constant mul,plica,ve factor λ.
• That factor can be es,mated from the associa,on results λ = median(Xi2)/0.456.
• Infla,on factor λ > 1 indicates popula,on structure, unknown rela,ves or other errors.
• The tests of associa,on can be adjusted by this factor. Xi2adjusted=Xi2/λ
Novel HDL loci GWS HDL loci
GWAS, New for HDL
‘Manhattan plot’ for HDL-cholesterol
GLGC (2013) Nat Gen 45:1274
-log 1
0 p-v
alue
Chromosome Global Lipids Genetics Consortium
188,577 individuals from 60 studies, GWAS + metabochip variants
17
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Multiple testing
• Genotype and test > 300K – 5M SNPs
• Correct for the multiple tests .05 P-value = 5 x 10-8
~1 million common SNPs
• Need large effect or large sample size
Li (2009) Ann Rev Genomics Hum Genet 10:387
Imputation of ungenotyped variants
18
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Imputation: Observed genotypes
Observed Genotypes
. . . . A . . . . . . . A . . . . A . . .
. . . . G . . . . . . . C . . . . A . . .
Reference Haplotypes
C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C
Study Sample
HapMap or 1000 Genomes or …
Gonçalo Abecasis Li (2009) Ann Rev Gen Hum Genet 10:387
Identify match among reference
Observed Genotypes
. . . . A . . . . . . . A . . . . A . . .
. . . . G . . . . . . . C . . . . A . . .
Reference Haplotypes
C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C
Gonçalo Abecasis Li (2009) Ann Rev Gen Hum Genet 10:387
19
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Phase chromosomes, impute missing genotypes
Observed Genotypes
c g a g A t c t c c c g A c c t c A t g gc g a a G c t c t t t t C t t t c A t g g
Reference Haplotypes
C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C
Gonçalo Abecasis Li (2009) Ann Rev Gen Hum Genet 10:387
Li (2009) Ann Rev Genomics Hum Genet 10:387
20
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Combining GWAS by meta-analysis
• Combine studies giving more weight to studies with greater precision
• Increase power vs individual studies
• Can investigate consistency of effects across studies
• Potential sources of heterogeneity: – Phenotype definitions are different – Different genotyping and analysis strategies – Environmental effects may differ
Combining GWAS by meta-analysis
Zeggini (2009) Pharmacogenomics 10:191
21
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Outline • Genome-wide association study design
– Samples/study participants – Genotyping – Tests of association – Imputation and meta-analysis
• Interpretation of results – Effect size and significance – Example locus characteristics
• Sequencing/rare variant studies
Novel HDL loci GWS HDL loci
GWAS, New for HDL
‘Manhattan plot’ for HDL-cholesterol
GLGC (2013) Nat Gen 45:1274
-log 1
0 p-v
alue
Chromosome
Global Lipids Genetics Consortium 188,577 individuals from 60 studies, GWAS + metabochip variants
22
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
GLGC (2013) Nat Gen 45:1274
Single good candidate gene
Teslovich (2010) Nature 466:707
23
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Signal outside of genes
Voight (2010) Nat Gen 42:579
Many candidate genes
Interpret GWA locus names with caution; many are merely the nearest gene to the signal Teslovich (2010) Nature 466:707
24
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Interpret plausible candidate genes
GLGC (2013) Nat Gen 45:1274
Nearby independent signals
CEU: D’=.07, r2 < .01, p-values remain unchanged with other SNP as covariate
p1+2 = 2e-15
p1+2 = 3e-20
Cristen Willer
25
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Conditional analysis
Tests independence of SNP effects If β1 changes when β2 is included in the model,
then SNP1 is sometimes inherited with SNP2
If neither β changes in reciprocal tests, then the two SNPs independently affect the trait
y = β0 + β1x
Trait = β0 + β1SNP1 + β2SNP2
[HDL] = β0 + β1rs261332 + β2rs4775041
[HDL] = β0 + β1rs261332 + β2rs4775041 + β3sex + β4age + β5age2
Fine-mapping across populations Europeans
African Americans
HDL-C locus near PPP1R3B Wu (2013) PLoS Gen 9:e1003379
26
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Outline • Genome-wide association study design
– Samples/study participants – Genotyping – Tests of association – Imputation and meta-analysis
• Interpretation of results – Effect size and significance – Example locus characteristics
• Sequencing/rare variant studies
Panoutsopoulou (2013) Hum Mol Gen 22:R16
27
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Some sequencing study designs for complex traits
• Sequence selected individuals - extreme trait values (>95% vs <5% level) - cases and controls
• Increase the number of individuals - by decreasing sequencing coverage ($) - by collecting rare variants onto a
less expensive genotyping array • Sequence population isolates, where rare
variants may have drifted to higher frequencies and LD may be longer
Sequenced coding regions and splice junctions of 58 genes in 379 obese individuals with mean BMI 49 and 378 lean individuals with mean BMI 19
Found >1000 variants, including 8 in MC4R that were subsequently tested for function
28
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Variant discovery at GWAS locus • Sequence ‘positional candidate’ genes in cases &
controls or individuals with extreme trait values
• Identify variants in cases (one extreme) that are absent from controls (other extreme)
• Hypothesize that occasional‘smoking gun’ variants with strong effect will be identified
• Use evidence that variants affect gene function and lead to the same disease/trait to implicate that gene at the association signal
• Does not require finding the variant(s) responsible for association signal that may have a weaker effect
Resequenced exons and splice sites of 10 candidate genes in pools of DNA from 480 pts & 480 controls Tested variants for association in >30,000 subjects
common
29
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Rare variants confirmed to be associated with T1D in more samples
Establishes the role of IFIH1 in T1D and demonstrates that resequencing studies can pinpoint disease-causing genes in regions initially identified by GWASs.
Identify an increased ‘burden’ of variants in a single gene or locus
• Many individually important variants will be too rare to detect the association with the trait; however, there will often be more than one important variant in a gene
• Gene-based tests combine information from multiple variants into a single test statistic to be used as predictor in genetic association tests
Raychaudhuri (2011) Cell 147:57
30
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Rare variant burden (gene-based) tests
• Collapse information from multiple variants into single test (e.g. count risk alleles across a set of variants)
• Some tests allow the direction of effect of each variant to be different (gain of function versus lost of function)
• Choice of variants to include in tests has a large impact on the test. Including too many neutral variants reduces statistical power, but so can not including the right ones
• Filter missense variants on minor allele frequency and predictive function
• Restrict tests to obvious functional variants (nonsense, frameshift indels, splice errors)
Gene-based rare variant association methods
Moutsianas (2015) PLoS Genet 11: e1005165
31
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
An example of a gene-based test
Flannick (2014) NatGen 46:357
• Initially sequenced 352 young lean T2D cases, 406 elderly obese euglycemic controls
• Then tested variants in 6,388 cases and 7,496 controls • Found a nonsense variant in 7 cases and 21 controls,
odds ratio (OR) = 0.38, P = 0.05 • Added this variant to the exome array and tested more
individuals (N= 48,115, P = 0.0067). • Difficult to increase sample size because variant mostly
restricted to western Finland • Expanded to look at more variants in the gene in other
populations…
SLC30A8 variants in ~150,000 individuals
32
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
UK10K sequencing study
UK10K (2015) Nature 526:82
UK10K association results
UK10K (2015) Nature 526:82
33
NHGRI Current Topics in Genome Analysis 2016 Week 9: Genomic Approaches to the Study of Complex Genetic Diseases
April 20, 2016 Karen Mohlke, Ph.D.
Clinical translation
McCarthy (2008) Nat Rev Gen 9:356
Future of complex trait analyses
• More and more loci identified • Larger meta-analyses • Deeper follow-up of signals • More diverse populations • Gene-based results from rare variants • Gene-gene and -environment interactions • Molecular and biological mechanisms