+ All Categories
Home > Documents > Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology

Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology

Date post: 11-Feb-2016
Category:
Upload: gypsy
View: 24 times
Download: 1 times
Share this document with a friend
Description:
Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology. Ion Mandoiu University of Connecticut. Outline. HMM model of haplotype diversity Applications Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions. - PowerPoint PPT Presentation
Popular Tags:
35
Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology Ion Mandoiu University of Connecticut
Transcript
Page 1: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Hidden Markov Models of Haplotype

Diversity and Applications in

Genetic Epidemiology

Ion MandoiuUniversity of Connecticut

Page 2: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

HMM model of haplotype diversityApplications

- Phasing- Error detection- Imputation- Genotype calling from low-coverage

sequencing dataConclusions

Outline

Page 3: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs)

High density in the human genome: 1 107 SNPs out of total 3 109 base pairs

Single Nucleotide Polymorphisms

… ataggtccCtatttcgcgcCgtatacacgggActata …… ataggtccGtatttcgcgcCgtatacacgggTctata …… ataggtccCtatttcgcgcCgtatacacgggTctata …

Page 4: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Haplotypes and Genotypes

Diploids: two homologous copies of each autosomal chromosome One inherited from mother and one from father

Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor

Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor)

allele; 2 - the chromosomes contain different alleles

011100110001000010021200210

+two haplotypes per individual

genotype

Page 5: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Sources of Haplotype Diversity: Mutation

The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005.

Page 6: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Sources of Haplotype Diversity: Recombination

Page 7: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Haplotype Structure in Human Populations

Page 8: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Fi = founder haplotype at locus i, Hi = observed allele at locus i

P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference genotype or haplotype data

For given haplotype h, P(H=h|M) can be computed in O(nK2) using forward algorithm

Similar models proposed in [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]

HMM Model of Haplotype Frequencies

F1 F2 Fn…

H1 H2 Hn

Page 9: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

HMM model of haplotype diversityApplications

- Phasing- Error detection- Imputation- Genotype calling from low-coverage

sequencing dataConclusions

Outline

Page 10: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Genotype Phasing

g: 0010212 ?

h1:0010111

h2:0010010

h3:0010011

h4:0010110

Page 11: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Maximum Likelihood Genotype Phasing

Maximum likelihood genotype phasing: given g, find (h1,h2) = argmaxh1+h2=g P(h1|M)P(h2|M)

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

F'1 F'2 F'n…

H'1 H'2 H'n

Page 12: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Computational Complexity• [KMP08] Cannot approximate maxh1+h2=g P(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP• [Rastas et al.] give Viterbi and randam sampling based heuristics that yield phasing accuracy comparable to best existing methods (PHASE)

Page 13: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

HMM model of haplotype diversityApplications

- Phasing- Error detection- Imputation- Genotype calling from low-coverage

sequencing dataConclusions

Outline

Page 14: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Genotyping Errors A real problem despite advances in technology &

typing algorithms 1.1% of 20 million dbSNP genotypes typed multiple times are

inconsistent [Zaitlen et al. 2005]

Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004]

In pedigrees, some errors detected as Mendelian Inconsistencies (MIs)

Many errors remain undetected As much as 70% of errors are Mendelian consistent for

mother/father/child trios [Gordon et al. 1999]

Page 15: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Likelihood of best phasing for original trio T

0 1 1 1 0 0 h1

0 0 0 1 0 1 h3

0 1 1 1 0 0 h1

0 1 0 1 0 1 h2

0 0 0 1 0 1 h3

0 1 1 1 0 0 h4

)()()()( MAX)( 4321 hphphphpTL

Likelihood Sensitivity Approach to Error Detection in Trios

Page 16: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Likelihood of best phasing for original trio T

)()()()( MAX)( 4321 hphphphpTL

? 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3

0 1 0 1 0 1 h’1

0 1 1 1 0 0 h’2

0 0 0 1 0 0 h’ 3

0 1 1 1 0 1 h’ 4

Likelihood of best phasing for modified trio T’

)'()'()'()'( MAX)'( 4321 hphphphpTL

Likelihood Sensitivity Approach to Error Detection in Trios

Page 17: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child?

Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)

Likelihood Sensitivity Approach to Error Detection in Trios

Page 18: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Alternate Likelihood Functions

• Efficiently Computable Likelihood Functions- Viterbi probability - Probability of Viterbi Haplotypes - Total Trio Probability

• [KMP08] Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP

Page 19: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Comparison with FAMHAP (Children)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

TotalProb-UNO

TotalProb-DUO

TotalProb-TRIO

TotalProb-COMBINED

FAMHAP-1

FAMHAP-3

Page 20: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Comparison with FAMHAP (Parents)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015

FP rate

Sens

itivi

ty

TotalProb-UNO

TotalProb-DUO

TotalProb-TRIO

TotalProb-COMBINED

FAMHAP-1

FAMHAP-3

Page 21: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

HMM model of haplotype diversityApplications

- Phasing- Error detection- Imputation- Genotype calling from low-coverage

sequencing dataConclusions

Outline

Page 22: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Genome-Wide Association Studies Powerful method for finding genes associated with

complex human diseases Large number of markers (SNPs) typed in cases and

controls Disease causal SNPs unlikely to be typed directly Significant statistical power gained by performing

imputation of untyped Hapmap genotypes [WTCCC’07]

Page 23: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

HMM Based Genotype Imputation Train HMM using the haplotypes from related

Hapmap or small cohor typed at high density

Probability of missing genotypes given the typed genotype data

gi is imputed as )|,(argmax }2,1,0{ MxggPx iix

)|()|,(),|(

MgPMxggPMgxgP

i

iiii

Page 24: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Experimental Results Estimates of the allele 0 frequency based on

Imputation vs. Illumina 15k

Page 25: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Experimental Results Accuracy and missing data rate for imputed

genotypes at different thresholds

Page 26: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

HMM model of haplotype diversityApplications

- Phasing- Error detection- Imputation- Genotype calling from low-coverage

sequencing dataConclusions

Outline

Page 27: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Illumina / Solexa Genetic Analyzer 1G1000 Mb/run, 35bp reads

Roche / 454 Genome Sequencer FLX100 Mb/run, 400bp reads

Applied BiosystemsSOLiD3000 Mb/run, 25-35bp reads

New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing

Ultra-High Throughput Sequencing

Page 28: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

Probabilistic Model

Page 29: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

where is the probability that read r has an error at locus I

Conditional probabilities for sets of reads are given by:

Model Training

)(1)(

)()(

)()(

)(1)(, 1

221

2)|( ir

irir

iriir

irir

iri

iijigggGrRP

1)(r

)(

0)(r

)( )1()0|r(ir

rir

irr

iriiii

GP

0)(r

)(

1)(r

)( )1()2|r(ir

rir

irr

iriiii

GP ic

ii GP

21)1|r(

)(ir

Page 30: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Multilocus Genotyping ProblemGIVEN:

• Shotgun read sets r=(r1, r2, … , rn)• Base quality scores• HMMs for populations of origin for mother/father

FIND:• Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum

posterior probability, i.e., g*=argmaxg P(g | r)

Page 31: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Joint probabilities can be computed using a forward-backward algorithm:

Direct implementation gives O(m+nK4) time, where m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs

Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]

)()|r()r,( '' ''1 ,1 ,, i

iff

K

fi

ffi

ff

K

fiii ggPgPiii iiiii

Posterior Decoding Algorithm1. For each i = 1..n, compute2. Return *)*,...,(* 1 nggg

)r,(maxarg)r|(maxarg* igigi gPgPgii

Page 32: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Homozygous Watson SNPs (Affy 500k)

30.1

93.5

50.5

95.6

73.7

97.490.5

98.5 96.3 99.1

0

20

40

60

80

100

120

0.35

x Bi

nom

ial

0.35

x Po

ster

ior

0.70

x Bi

nom

ial

0.70

x Po

ster

ior

1.41

x Bi

nom

ial

1.41

x Po

ster

ior

2.82

x Bi

nom

ial

2.82

x Po

ster

ior

5.64

x Bi

nom

ial

5.64

x Po

ster

ior

Heterozygous Watson SNPs (Affy 500k)

2.8

67.8

9.2

80.0

25.7

88.5

54.6

93.782.5

96.8

0

20

40

60

80

100

120

0.35

x Bi

nom

ial

0.35

x Po

ster

ior

0.70

x B

inom

ial

0.70

x Po

ster

ior

1.41

x B

inom

ial

1.41

x Po

ster

ior

2.82

x Bi

nom

ial

2.82

x Po

ster

ior

5.64

x Bi

nom

ial

5.64

x P

oste

rior

Genotyping Accuracy on Watson Reads

Page 33: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

HMM model of haplotype diversityApplications

- Phasing- Error detection- Imputation- Genotype calling from low-coverage

sequencing dataConclusions

Outline

Page 34: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Conclusions HMM model of haplotype diversity provides a powerful

framework for addressing central problems in population genetics & genetic epidemiology

Enables significant improvements in accuracy by exploiting the high amount of linkage disequilibrium in human populations

Despite hardness results, heuristics such as posterior or Viterbi decoding perform well in practice

Highly scalable runtime (linear in #SNPs and #individuals/reads)

Software available at http://www.engr.uconn.edu/~ion/SOFT/

Page 35: Hidden Markov Models of Haplotype Diversity and Applications in  Genetic Epidemiology

Acknowledgements

Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin Kennedy, Bogdan Pasaniuc

NSF funding (awards IIS-0546457 and DBI-0543365)


Recommended