Genotype Phasing and Imputation in 1x Sequencing Data

transcript

Warren W. Kretzschmar

DPhil Genomic Medicine and StatisticsWellcome Trust Centre for Human Genetics, Oxford, UK

Supervisor: Jonathan Marchini

• Commonest psychiatric disorder and the second ranking cause of morbidity world-wide.

• Affects 1 in 10 people in their lifetime.

• Estimates of heritability range between 30-40%.

Major Depression

Major de-pressive dis-

orders

Violence

Ischaemic heart disease

Alcohol use disordersRoad traffic ac-

cidents

Diabetes mellitus

Cerebrovascular disease

Other unin-tentional in-

juries

Lower respiratory infections

Chronic obstructive pulmonary disease

DALY : Disability adjusted life year : number of years lost due to ill-health, disability or early death

Top Ten causes of DALYs

Genetics of Major DepressionMajor Depressive Disorder Working Group of the Psychiatric GWAS Consortium (2012). A mega-analysis of genome-wide association studies for major depressive disorder. Molecular Psychiatry 18.4:497-511.

Study Design• Unrelated Europeans• 9240 cases• 9519 controls• 1.2 million SNPs

Hypotheses• Depression has

heterogeneous environmental and genetic causes

• Depression is a complex trait with genetic components of small effect size

CONVERGE (China, Oxford and VCU Experimental Research on Genetic Epidemiology)

Genetically Homogeneous : All subjects are female and their grandparents are Han Chinese

6,000 cases : typically severe affected: 85% qualify for a diagnosis of melancholia by DSM-IV. >25% reported a family history of MD in one or more first-degree relatives

6,000 controls : patients undergoing minor surgical procedures.

Extensive Phenotyping : primary disorder of major depression, common comorbid disorders (e.g. generalized anxiety disorder, panic disorder), within disorder symptoms (e.g. suicidal ideation), disorder subtypes (e.g. melancholia, dysthymia), possible endophenotypes (e.g. neuroticism) and a range of risk factors (e.g. child abuse, stressful life events, social and marital relationships, parenting, post-natal depression, demographics).

Sequencing : mean depth 1.7X using lllumina HiSeq at Beijing Genomics Institute

Current status Sequencing finished. We have data on 12,000 samples. For now we have only considered ~13M sites polymorphic 1000 Genomes Asian samples. Analysis ongoing…

59 hospitals, 45 cities, 21 provinces.

Phase 1: genotype likelihood estimationOne sample at a time

Phase 2: phasing and imputationAll samples together

Raw reads

Genotype likelihoods

Mapping Stampy

Duplicate Picard

marking

Base quality GATK recalibration Genotype

probabilitiesGenotype

likelihoodSNPToolsestimation

Phasing and imputation

Genotype likelihoods My focus!

Sequence analysis pipeline

650 GB4.6 CPU

350 GB2.7 CPU

5 CPU years

GENOTYPE PHASING AND IMPUTATION

Genotype Phasing

Unphased: G/G A/T A/A T/T G/T A/T T/T A/A G/G G/C

Example SNP chip data

Hap 1: G A A T T T T A G C

Hap 2: G T A T G A T A G G

After Phasing

Phase-informative Sites

Genotype Imputation from Haplotypes

J Marchini and B Howie. Nature Rev. Genet. 2010

GENOTYPE LIKELIHOODS

What is a Genotype Likelihood?

Genotype Likelihood = Pr( R | G )

R = Reads; also known as the “observed data”G = Genotype; usually one of ref/ref, ref/alt, alt/alt

Genotype likelihoods (aka GL) are defined on a site by site basis.

GLs are conditional probabilities.

How are Genotype Likelihoods Useful?

Genotype Probability = Pr ( G | R ) proportional to Pr( R | G ) * Pr( G )

Genotype likelihoods allow us to quantify how much the reads support each possible genotype independent of other information.

To determine the most likely genotype call, we need a genotype probability.

Pr( G ) = prior probability of G.May be determined through haplotype phasing and imputation approaches.

Genotype Likelihood Creation with SNPTools

Y Wang, J Lu, J Yu, RA Gibbs, FL Yu. Genome Research. 2013

observed reads

Three distributions

Pr(R|G = alt/alt) = 10e-6

Pr(R|G = ref/alt) = 10e-3

Pr(R|G = ref/ref) = 0.06

Genotype Phasing using Genotype Likelihoods

Example GL dataPr(ref/ref): G/G A/A A/A T/T G/G A/A T/T A/A G/G G/G Pr(ref/alt): G/A A/T A/G T/A G/T A/T T/C A/G G/C G/C

Pr(alt/alt): A/A T/T G/G A/A T/T T/T C/C G/G C/C C/C

Hap 5: G A A T T A T A G C

Hap 6: G T A T T A T A G G

Plausible Haplotypes after Phasing

Hap 1: G A A T T A C A G G

Hap 2: G T A T T A T A G G

Hap 3: G T A T G A C A G G

Hap 4: G T A T G A T A G C

Reference Haplotypes

General MCMC Scheme for Phasing from GLs

When using GLs, haplotype estimation is currently done in an iterative Markov Chain Monte Carlo (MCMC) scheme

1. Initalize haplotypes for each sample randomly2. for a predetermined number of iterations

1. for each sample1. Find a plausible haplotype pair using its GLs and all

other haplotypes as a reference panel2. Update that sample’s haplotypes with the plausible

haplotype pair3. Return each sample’s current pair of haplotypes

The Tools/Languages I use

Coding Emacs

Scripting Perl with DistributedMake for pipelines

Statistical Methods C++

Figure Generation R

Statistical Analysis & Report Writing

LaTeX with SWeave

Presentations PowerPoint or LaTeX

A Bioinformatician’s Best Practices

- Understand your goals and choose appropriate methods- Be suspicious and trust nobody

- Set traps for your own scripts and other people’s- Be a detective- You're a scientist, not a programmer- Use version control software- Pipelineitis is a nasty disease- An Obama frame of mind- Someone has already done this. Find them!

according to Nick Loman & Mick Watson. Nature Biotechnology. 2013see also: W. S. Noble. PLoS Computational Biology. 2009

Good Directory Structureaccording to W. S. Noble. PLoS Computational Biology. 2009

Thank you. Questions?

Genotype Phasing and Imputation in 1x Sequencing Data

Documents