Post on 05-Jan-2016
transcript
1
Haplotyping AlgorithmsHaplotyping Algorithms
Qunyuan Zhang
Division of Statistical Genomics
GEMS Course M21-621
Computational Statistical Genetics
Mar. 29, 2012
https://dsgweb.wustl.edu/qunyuan/presentations/Haplotyping_GEMS_2012.ppt
2
Questions
WHAT is haplotype?
WHY study haplotype?
WHY use algorithms for haplotyping?
HOW ? (Data, Hypotheses, Algorithms)
3
WHAT is Haplotype?
A haplotype (Greek haploos = simple) is a combination of alleles at multiple linked loci that are transmitted together. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci. The term haplotype is a portmanteau of "haploid genotype.“
In a second meaning, haplotype is a set of single nucleotide polymorphisms (SNPs) on a single chromatid that are statistically associated. It is thought that these associations, and the identification of a few alleles of a haplotype block, can unambiguously identify all other polymorphic sites in its region. Such information is very valuable for investigating the genetics behind common diseases, and is collected by the International HapMap Project.
From http://en.wikipedia.org/wiki/Haplotype
4
Haplotype = Genotype of Haploid
Haplotypes: Ab//aBGenotype: Aa Bb
Haplotype
C G
Haplotype
T A
GenotypeCT GA
Haplotypes: AB//abGenotype: Aa Bb
5
WHY Study Haplotype?
An efficient way of presentation of genetic variation/polymorphism, useful in genomics, population genetics, and genetic epidemiology
Population evolution
LD analysis
Missing genotype imputation
IBD estimation
Tag marker (SNP) selection
Multi-locus linkage & association
…
6
WHY use algorithm in haplotyping?
Most of current molecular genotyping techniques mix DNA pieces from two complementary chromosomes and only provide genotypes of diploid (mixture of haplotypes)
genotype(AaBb) haplotype (Ab//aB or AB//ab)
Some molecular techniques can directly measure haplotypes, but expensive (money, labor, time ….), especially for genome-wide study.
So, at least now, we need algorithms …
?
7
Ambiguity of Haplotype
Haplotypic ambiguity/uncertainty happens while ≥2 makers/loci are heterozygous and their genetic phase is unknown
Genotype Haplotypes
AA BB AB//AB
Aa bb Ab//ab
Aa Bb Ab//aB or AB//ab
Aa Bb Cc ABC//abc, ABc//abC, Abc//aBC or aBC//Abc
8
Rule-based Approaches(Parsimony & Phylogeny)
Search an optimal set of haplotypes that satisfies some specific rules
9
Parsimony Approaches
1.List all unambiguous haplotypes
2.Resolve ambiguous individuals one by one using listed haplotypes
3. If only half-resolved, add new haplotype to the list
4. Continue 2 & 3
5. Until on one can be solved
ABC, abc, abC Abc
AaBbCC => ABC//abC
AABbCc => ABC//Abc
Continue …
Until on one can be resolved
Clark, 1990, Mol. Biol. Evol., 7(2): 111-122
Parsimony rules: Maximum-resolution of genotypes
and/or Minimum set of haplotypes
Clark’s Algorithm
10
Phylogeny Approaches
D. Gusfield. 2002. Proc. of the 6th Annual Inter. Conf. on Res. In Comput. Mol. Biology, p166–175.
Given a set of genotypes, find a set of explaining haplotypes, which defines a perfect phylogeny. Perfect Phylogeny Haplotype (PPH) rule: coalescent rule (no recombination, infinite-site mutation, but only once for one site)
11
Probability-based Approaches(EM & MCMC)
Calculate probability of haplotype, conditional on genotypes. Pr(H|G)=?
12
Gene/haplotype frequencies HWE, LD
Data Structure for Haplotyping
Haplotypes
LinkageS
ubje
cts(
1,2,
3…)
Loci (A,B,C…)
G1,A G1,B G1,C …
G2,A G2,B G2,C …
G3,A G3,B G4,C …
… … … …
A CB
Genetic RelationshipGenoty
pes
13
HWE & LD
Hardy-Weinberg Equilibrium (HWE)Hardy-Weinberg Disequilibrium (HWD)
HWE: random combination of alleles from the same locus Under HWE, allele freq. determines genotype freq. HWE => Pr(AA)=Pr(A)*Pr(A), Pr(aa)=Pr(a)*Pr(a), Pr(Aa)=2*Pr(A)*Pr(a)
Linkage Equilibrium (LE)Linkage Disequilibrium (LD) LE: random combination of alleles from different loci LD: association between alleles from different loci Under LE, allele freq. determines haplotype freq. LE => Pr(ABC)=Pr(A)*Pr(B)*Pr(C)
14
Genetic Relationship (R) & Linkage (r)
AaBb
AABB
AaBb
AB//ab or aB//Ab
AB//ab
(if r=0) AB//ab
(if r>0) AB//ab, Ab//aB
Recombination rate (r)
r =0, complete Linkage
0< r <0.5, incomplete Linkage
r =0.5, no Linkage
AaBb
AaBb
AABB aabb
15
Haplotyping & Conditional Probability
AaBB: Pr(AB//aB)=1
AAbB: Pr(AB//Ab)=1
AaBb: Pr(AB//ab)=0.5, Pr(Ab//aB)=0.5
AABB, aabb, AABB, aabb, AABB, AABb, aabb
AaBB, aabb, AABB, AABB, AABB, AABB, aabb
aabb, AABB, AABB, AABB, AaBb, AABB,aabb
aabb, AABB, AABB, aabb, AABB, aabb, AABB …
Pr(AB//ab)=Pr(Ab//aB)=0.5 ?HWE or HWD?
LD or LE?
P(H|G, R, r)=?
P(H|G)=?
16
EM Algorithm
for unrelated individuals
Pr(H|G,F)=?
Excoffier et al., 1995, Mol. Biol. Evol., 12(5): 921-927
Hawley et al., 1995, J Hered., 86:409-411 (software: HAPLO)
Pr(AB)=0.25, Pr(Ab)=0.25
Pr(aB)=0.25, Pr(ab)=0.25
ORPr(AB)=0.01, Pr(Ab)=0.49
Pr(aB)=0.49, Pr(ab)=0.01
AaBbPr(AB//ab)=?
Pr(Ab//aB)=?
17
Likelihood: L(G|F)
)()|(
constraint1
)//(0
)//(1
)|Pr(
)|Pr()|(
),,,,,(
),,,,,(
),,,,,(
1 1 1
1
1 1
1
21
21
21
g
kba
h
a
h
b
kab
h
ii
kba
kbakab
ba
h
a
h
b
kabk
g
kk
gk
hi
hi
ffcFGL
f
GHH
GHHc
ffcFG
FGFGL
GGGGG
ffffF
HHHHH
Haplotypes
Joint Likelihood of G given F
Genotypes
Haplotype Frequencies
Prbability of the k-th individual’s G given F & HWE
Haplotype-Genotype compatibility index of the k-th individual
F=? => Max. L(G|F)
18
EM AlgorithmMaximum Likelihood
Estimation of Haplotype Freq.
Lagrange multiplier
0
0
))(()(),(
?)},(max{
)(
Qx
Q
cxgxqxQ
xxq
cxg
g
k tb
ta
h
a
h
b
kab
tb
ta
h
a
h
b
kab
iab
ti
g
kba
h
a
h
b
kab
ba
h
a
h
b
kab
iab
i
i
h
ii
g
kba
h
a
h
b
kab
g
kba
h
a
h
b
kab
h
ii
ffc
ffcz
gf
ffc
ffcz
gf
Q
fQ
fffcFQ
ffcFGLFq
fFg
1 )()(
1 1
)()(
1 1)1(
1
1 1
1 1
11 1 1
1 1 1
1
2
1
2
1
0
0
)1()log(),(
)log())|(log()(
01)(
...),|Pr(...),|Pr(),|Pr( )1()(,
)()1(,
)1()0(,
)0( ttba
tbaba FFGHFFGHFFGHF
Prior Expectation Maximization E … M E M …
EM Recursion
Partial
Derivative
Equations
z=1 if i in (a,b), or z=0 c=1 if (a,b)=>G, or c=0
19
Posterior Probability of Haplotype
0588.0),|Pr(
9412.0005.008.0
08.0
1.0*1.0*5.04.0*4.0*5.0
4.0*4.0*5.0
**)|Pr(**)|Pr(
**)|Pr(),|Pr(
4.0,1.0,1.0,4.0:
5.0)|//Pr()|Pr(
5.0)|//Pr()|Pr(
,,,:
:
*)Pr(*)Pr()Pr(
)Pr(*)|Pr(
)Pr(*)|Pr(),|Pr(
3,2
323,2414,1
414,14,1
4321
3,2
4,1
4321
),(,
,,
FGH
ffGHffGH
ffGHFGH
ffffF
DdEedEDeGH
DdEedeDEGH
deHdEHDeHDEHH
DdEeG
Example
ffHHF
FGH
FGHFGH
k
kk
kk
k
k
k
baba
bakba
kbakba
Prior Prob.
Posterior Prob.
20
Limitation of EM Algorithm
For diploid(2n) organism, a genotype of L heterozygous markers may have 2L possible haplotypes, EM is unpractical for large L
Only suitable for small number of loci, 2~12
While L=20, 2L=1,048,576 …Large space of F
Subseting approaches (partition-ligation & block partitioning etc.) have been used to reduce computational burden …
21
MCMC
Markov Chain Monte Carlo Algorithmfor unrelated individuals
by sampling from Pr(H|G,F)
Stephens et al., 2001, Am. J. Hum. Genet., 68:978-989 (software: PHASE)
22
Markov Chain
)()()()(
)()()()()(
)1()1()1()1()2(
)1()1()1()1()1(
)0()0()0()1()1(
)0()0()0()0()1(
)0()0()0()0()0(
......
....
......
......
......
),|Pr(
......
),|Pr(
......
......
),|Pr(
......
),|Pr(
......
21
121
121
11
121
121
22
121
11
121
NtG
NtG
NtG
NtG
tG
tG
tG
tG
tG
GGGGG
GG
GGGGG
GG
GGGGG
GG
GGGGG
GG
GGGGG
gk
gkk
gkk
gkk
gg
gkk
gkk
gkk
HHHH
HHHHH
HHHHH
HGH
HHHHH
HGH
HHHHH
HGH
HHHHH
HGH
HHHHH
MCMC Estimation
Random sampling based on Pr(H|G,H_)
Repeat many times
After getting close to stationary distribution of P(H|G)
Collect samples
Average over samples
23
Transition Probability ),|Pr(kk GG HGH
))/(2()/(2.
.
),...,,(
)/(
)/()/()/(
:),(
0
),...,(
),...,(
22
''
21
2
)(,
2,1
2,1
)(,
MpMprobwithphasechoserandomlyHHFor
ppprobwithhaplotypeconstructHHFor
ppppgetFinally
MnpthenHHif
MMnMnpthenHHif
checkandHHGthenHGif
pthenHGif
HfromHremove
Gpick
nnnncount
HHHHlist
GallforlociLofHgiven
L
ii
Li
iiii
m
iij
jiij
jikik
iik
Gtba
k
m
m
ktba
k
Add the newly constructed haplotype to list H, pick Gk+1 …kGt
baH )1(,
Coalescent hypothesis, Mutation rate, M haplotypes
subseting loci, reducing time
24
EM vs. MCMC
EM MCMCSearch F, Max. L(G|F)
Haplo. freq. => Haplo. construction
Maximum likelihood approach
“Analytical” posterior distribution
Less loci
Convergence: Local Maximum
Sample from Pr(H|G,F)
Haplo. construction => Haplo. freq.
Sampling approach
“Empirical” posterior distribution
More loci
Better convergence: whole parameter space (more computer time)
25
EM Algorithmfor family data
(no recombination, r=0)
Pr(H{fam.}|G,R,F)=?
Rohde et al., 2001, Human Mutation, 17: 289-295 (software: HAPLO)Becher et al., 2004, Genetic Epidemiology, 27:21-32 (software: FAMHAP)O’Connell, 2000, Genetic Epidemiology, 19(Suppl 1):S64-S70 (software: ZAPLO)
26
Haplotype Configuration of Family
AaBb AaBb
AaBb
AB//ab AB//ab
AB//ab
Ab//aB Ab//aB
Ab//aB
AB//ab AB//ab
Ab//aB
Genotypes
Possible Haplotype Configurations
recombinant, as r=0 or nearly =0, impossible or very low prob. , ignored
27
EM AlgorithmHaplotype Freq. Estimation using Nuclear Families
.
1 1
2211
2 2
2211
1 1
2211
2 2
22112211
1.
1 1
)()()()(
1 1
.
1 1
)()()()(
1 1
.
.
)1(
1 )()(
1 1
)()(
1 1)1(
4
1
.
2
1
..
famN
famh
a
h
b
t
b
t
a
t
b
t
a
h
a
h
b
fam
baba
h
a
h
b
t
b
t
a
t
b
t
a
h
a
h
b
fam
baba
i
baba
fam
ti
g
k tb
ta
h
a
h
b
kab
tb
ta
h
a
h
b
kab
iab
ti
ffffc
ffffcz
Nf
FamiliesNuclear
ffc
ffcz
gf
IndvUnrelatedTips:
Only use parents to calculate haplotype freq. (f)
Use parents+children ’s info to determine compatibility (c)
28
EM AlgorithmHaplotype Freq. Estimation for General Pedigrees
.
2211
22112211
2211
221122112211
.1.
,,...,,,
,,...,,,,
)()()()()()(.
...
,,...,,,
,,...,,,,
)()()()()()(.
......
1.
'.
)1(
...
...1 fam
nn
nnnn
nn
nnnnnn
fam
N
famhhhhhh
bababa
t
b
t
a
t
b
t
a
t
b
t
a
fam
bababa
hhhhhh
bababa
t
b
t
a
t
b
t
a
t
b
t
a
fam
bababa
i
bababa
N
famfam
ti
ffffffc
ffffffcz
n
f
Tips:
Only use founders to calculate haplotype freq. (f)
Use all members (founders & non- founders) to determine compatibility (c)
Discard the cases with too small probabilities to save time
29
Posterior Probability of Haplotype Configuration
22112211 ***)Pr(*)Pr(*)Pr(*)Pr()Pr(
)Pr(*)|Pr(
)Pr(*)|Pr(),|Pr(
*)Pr(*)Pr()Pr(
)Pr(*)|Pr(
)Pr(*)|Pr(),|Pr(
.).(,
,,
11
.).(,
,,
babababaparents
configsallparents
famk
famba
parentsfam
kfam
baparents
famk
famba
N
jba
N
jbafounders
configsallfounders
famk
famba
foundersfam
kfam
bafounders
famk
famba
ffffHHHHF
FGH
FGHFGH
FamilyNuclear
ffHHF
FGH
FGHFGH
FamilyGeneral
founders
jj
founders
jj
Dad Mom
30
A Middle Summary …Subject-oriented Algorithms
Large/General Pedigree & Allowing Recombination (r>0) ?
A CB
X
X
X
Joint Prob. / Likelihood
indiv. by indiv.unrelated
family by familyr=0
31
Next … Locus-oriented Algorithm (Lander-Green)
A CB
X X X Joint Prob./
Likelihood
…
Locus by Locus
A Pedigree
For Large/General Pedigree Data & Allowing Recombination (r>0)
A CB
32
Inheritance Vector (V) of a pedigree
Lander & Green, 1987, Proc. Natl. Acad. Sci., 84: 2363-2367Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER)Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN)
Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2)
Prob.
A
33
Inheritance Vector & Haplotype
5: AaBb
1101 AB//ab 1101
1101 Ab//aB 1111
34
Lander-Green Algorithm
A CB
…
VA VB VC
Pr(VB|VA) Pr(VC|VB)
…Pr(Vt+1|Vt)
GA
Pr(GA |VA)
GB
Pr(GB |VB)
GC
Pr(GC |VC)
Loci A,B,C,…
One pedigree
Hidden status (inheritance vectors)
Transition Prob.=f(r)
Emission Prob.
Observations (genotypes)
35
Lander-Green Algorithm Based (or Similar) Approaches
Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER)Viterbi algorithm, the best haplotype configuration
Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2)MCMC: Annealing & Metropolis Process
Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN)Allowing LD & Marker Cluster/Block
Haplotyping
based on sequencing data
(can be done for individual subject with no population data)
36
Rationale
37Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
Data Structure
38Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
Algorithms
39Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
ML
Or MCMC when H space is huge
Prob(sequence/haplotype)
40Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
haplotype
=1 if observed sequence X matches assumed haplotype=0 otherwise(for the j-th variant site of i-th fragment )
Sequencing/mapping error
observed sequence
Markov Chain
41Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
Sampling H from .
42
Practices(1) If a child’s genotype of 4 loci is AaBbCcDD, list all possible haplotype pairs of the child, calculate the probability of each pair, given no any extra information.
(2) If you know his/her father’s genotype is also AaBbCcDD and mother is AaBbCCDD, list all possible haplotype configurations of his/her family, calculate the probability of each configuration. (Assume recombination rate r=0)
(3) If you know the haplotype frequencies below in population: ABCD(0.2),ABcD(0.1),AbcD(0.1)aBCD(0.1),aBcD(0.2),abcD(0.3)calculate the posterior probabilities in (1) .
Within a week, send your answers to (E-mail: qunyuan@wustl.edu)