Lecture 10: Linkage Analysis III
Date: 9/26/02 Revisit segregation ratio distortion. Haplotype coding Three point analysis Multipoint analysis
Additive Segregation Ratio Distortion
Systematic genotype classification error occurs.
Power and estimates of recombination fraction are unaffected by additive distortion in the backcross configuration.
Estimates of recombination fraction are not affected for F2, but the false positive rate increases.
Additive Segregation - Backcross
Suppose the frequency of genotype Aa is increased because a fraction u of aa genotypes are misclassified.
Similarly, assume the frequency of genotype Bb is independently increased by fraction v.
We need to recalculate the expected frequencies under the new model with additional parameters u and v.
Additive Segregation – Backcross (contd)
Genotype Expected Frequency
Expected Frequency with Distortion
AaBb 0.5(1-) 0.5(1-) + u/2 + v/2
Aabb 0.5 0.5u/2 – v/2
aaBb 0.5 0.5 - u/2 + v/2
aabb 0.5(1-) 0.5(1-) – u/2 – v/2
Total: Aa 0.5 0.5 + u
Total: aa 0.5 0.5 – u
Total: Bb 0.5 0.5 + v
Total: bb 0.5 0.5 – v
Additive Segregation – Backcross (contd)
The number of unknown parameters equals the number of degrees of freedom.
Use Bailey’s method to find the MLEs of the parameters (, u, v).
Bailey’s Method
Set the expected frequencies equal to the observed proportions and solve the system of equations for the unknown parameters. These are the MLEs.
Example: Suppose you observe 5 successes from a Binomial(10, p) distribution. Then
pmle = 5/10
Additive Segregation – Backcross (contd)
What do you notice about the MLE for recombinant fraction?
Is the MLE for recombinant fraction biased?
N
ffffv
N
ffffu
N
ff
4ˆ
4ˆ
ˆ
22122111
22211211
2112
Additive Segregation – F2-CC
Genotype Expected Frequency
Additive Distortion
AABB 0.25(1-)2 u/3 + v/3
AABb 0.5 u/3 – v/3
AAbb 0.25 u/3
AaBB 0.5(1-) - u/3 + v/3
AaBb 0.5(1-2+22) -u/3 – v/3
Aabb 0.5 (1-) -u/3
aaBB 0.252 v/3
aaBb 0.5 (1-) -v/3
aabb 0.25(1-)2 0
Penetrance Distortion - Backcross
Selection, penetrance, linkage to selected markers all can result in penetrance distortion, thus it is quite common.
Suppose (100xu)% of the genotype aa is misclassified as Aa. Similarly, assume that bb has (100xv)% misclassified as Bb independently.
Penetrance Distortion - Backcross
Gen. Expected Frequency
AaBb P(AaBb)+P(scored as Aa|aaBb)P(aaBb)+P(scored as Bb|Aabb)P(Aabb)+P(scored as AaBb|aabb)P(aabb)
=0.5(1-)+0.5u+0.5v+0.5(a+b)(1-)
=0.5[(u+v)+(1-)(1+uv)]
Aabb
aaBb
aabb
Penetrance Distortion - Backcross
Is the estimate for recombination fraction biased?
The power to detect linkage is decreased.
N
ffffv
N
ffffu
vuN
f
22122111
22211211
22
ˆ
ˆ
11
21ˆ
Cost of Assuming Non-Distortion Model
The estimate for recombination fraction is biased. By how much?
ˆEBias
Overall Impact of Segregation Distortion
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4
Distortion (u=v)
Biasrecomb. fraction 0.3
recomb. fraction 0.2
recomb. fraction 0.1
First Project
This slide marks the end of the material that will be needed to complete the first project.
Linkage Analysis for Multiple Loci
The haplotype is the sequence of alleles along one of the chromosomes in an individual.
In multipoint linkage analysis we are not concerned with the alleles at each locus, rather its parental origin.
Recoding Haplotypes
Suppose there are k loci. Recode each haplotype as a string of k-1 of 0’s and 1’s If the ith position is 0, it indicates the (i+1)th
locus is noit recombinant with respect to the ith locus.
If the ith position is 1, it indicates the (i+1)th locus is recombinant with respect o the ith locus.
Recoding Haplotypes (contd)
Haplotype ABC
Recombinant on interval:
Picture
AB AC BC
00 no no no A—B—C
01 no yes yes A—BC
10 yes no yes ABC
11 yes yes no ABC
Recoding Haplotypes (contd)
Haplotype Code
ABxCxD
101
000110
Recoded Haplotypes and Recombination Fractions
1101
1001
1011
AC
BC
AB
111011000
Calculate the probabilities of the four haplotype classes (i.e. 00, 10, 01, 11) when AB = 0.1 and BC = 0.2 and AC is unknown. Assume the Sturt map function with L = 1.
Sample Problem
Plan of Attack
1. Transform recombination fractions to genetic map units using the inverse map function.
2. Sum the genetic map units to obtain length of AC interval.
3. Calculate the recombination fraction between AC using the map function.
4. Solve the set of simultaneous equations for the haplotype frequencies.
Step 1
238.0
108.0
BC
AB
m
m
LLme
L
m /12112
1
Step 2
346.0238.0108.0 BCABAC mmm
Step 3
269.0
346.0112
1
112
1
346.0
/12
e
eL
m
AC
LLm
Step 4
1
269.0
2.0
1.0
11100100
1101
1001
1011
0845.0
1845.0
0155.0
7155.0
11
01
10
00
Phase Known Three Point Analysis
When all gametes in sample are fully informative, then the likelihood is simple.
4
1
logi
iifl
BCAB
ACBCAB
BCAB
ACBCAB
c
cl
l
2
,,
,,
How would youtest for interference?
Multipoint Analysis – A Difficulty
Suppose there are k loci. How many haplotypes are possible? How many recombination fractions are
there?
Recombination Value
Definition: The recombination value of a set of intervals is the probability of an odd number of crossovers occurring in the intervals.
How many sets of intervals are there?
Sample Problem – Four Point Analysis
Suppose loci A, B, C, and D are in syntenic order and AB = 0.1, BC = 0.2, and CD = 0.3.
What are the probabilities of the haplotype classes given the Kosambi map function.
12
14
4
m
m
e
e
The Linear Equations
1111101011110100010001000
111111101011001
110111110011010
,101111100010001
100111110101100
011110100011001
010101100011010
001110101010001
AD
AC
CDAB
AB
BD
BC
CD
Multipoint Likelihood
Can be written in terms of the 2k-1-1 recombination values or haplotype frequencies.
Can be reparameterized as k-1 recombination fractions and 2k-1-k interference parameters.
Then tests for interference are possible. An alternative is to assume a map function with
possibly unknown parameters which constrains the gamete probabilities as functions of the k-1 recombination fractions.
Multilocus-Infeasible Map Functions
Kosambi, Carter-Falconer, and Felsenstein map functions are multilocus-infeasible because they can produce negative gametic frequencies.
The Morgan, Haldane, Sturt and generalized map functions are multilocus-feasible.
Haldane is most often used for its simplicity except when linkage is tight, e.g. m << 0.5.
Map Building
How many possible orders are there for k loci?
10 loci can be ordered in over 1 million ways.
The solution is to generate a small number of probably orders and then analyze these few in depth.
Stepwise Approximate Ordering
Use likelihood analysis to order a few markers, say l.
Add each additional marker one at a time by considering all l-1 positions for it. Choose the location that results in the highest likelihood.
Number of likelihood evaluations: 3+4+5...+k = (k-2)(k+3)/2.
Pairwise Approximate Ordering
Two point linkage analysis on all pairs of loci to obtain a recombination fraction estimate.
Multidimensional scaling analyses (multivariate exploratory analysis) to find approximate orders.
Final Step – Perfecting Order
Test the likelihood of various reorderings of neigboring groups of loci.
If an tested order has higher likelihood, keep it.
etc...
Disease Mapping
Condition on an ordering of all markers except disease locus.
Calculate a multilocus likelihood for each possible position of the disease locus, call this lx.
Calculate the location score 2(lx - l) at point x, where l is the log-likelihood with disease locus unlinked to other markers.
Disease Mapping
Can also calculate multipoint LOD scores by dividing locations scores by 2ln(10).
Plot location score or multipoint LOD score by position x. The peak is the likely position of the disease locus and if the peak exceeds some cut-off criteria linkage to that region is significant.
Multipoint vs. Single Point Disease Mapping
Information from every sampled individual, even those who may be homozygous at the single marker.
Single marker can only provide information about crossovers on one side of the disease gene.
The more markers, the sharper the peak. The disease gene is ultimately mapped to the smallest
interval where there is no observed crossover between marker and disease gene in entire sample.
Sample Size
Assuming no interference, crossovers are distributed exponentially with mean 1 per Morgan.
Sample n individuals and the mean rate is n. Therefore, the expected distance to the nearest
crossover on either side of the disease locus is 1/n. The interval containing disease gene has length
distributed as gamma distribution with mean 2/n. Example: You want to localize disease gene to 1
cM = 1/100 M. Therefore, you need n>200.
Summary
Modeling of segregation distortion and the impact on linkage analysis.
Haplotying coding. The use of map functions. Overview of likelihood formulation for
multipoint analysis.