.
Computational Problems in Genetic Linkage Analysis
Dan GeigerCS, Technion
This talk is based mainly on work with Ma'ayan Fishelson. See bioinfo.cs.technion.ac.il/superlink/ for more details. Some slides are due to Gideon Greenspan.
Course homepage: http://www.cs.technion.ac.il/~fmaayan/cs236524/
2
Requirements
•One homework assignment (10%).
•Mid term progress report.
•Submission in pairs.
•Well documented, tested, useful program.
•Self initiative (by reading and thinking) will be
rewarded.
•One or two excelling projects maybe selected for
continuation next semester under special projects
course, if desired.
4
Part I: Quick look on relevant geneticsPart II: Case study: Werner’s syndrome Part III: Relevant Mathematics Part IV: Software description / algorithms
Gene Hunting: find genes responsible for a given diseaseMain idea: If a disease is statistically linked with a marker on a chromosome, then tentatively infer that a gene causing the disease is located near that marker.
Outline
5
Human Genome
Most human cells contain
46 chromosomes:
2 sex chromosomes (X,Y):XY – in males.XX – in females.
22 pairs of chromosomes named autosomes.
6
Chromosome Logical StructureMarker – Genes, Single Nucleotide
Polymorphisms, Tandem repeats, etc.
Locus – the location of markers on the chromosome.
Allele – one variant form (or state) of a gene/marker at a particular locus.
Locus1Possible Alleles: A1,A2
Locus2Possible Alleles: B1,B2,B3
7
Alleles
b - dominant allele. Namely, (b,b), (b,w) is Black. w - recessive allele. Namely, only (w,w) is White.This is an example of an X-linked trait.For males b alone is Black and w alone is white.
genotype
phenotype
8
Genotypes versus Phenotypes
At each locus (except for sex chromosomes) there are 2 genes. These constitute the individual’s genotype at the locus.
The expression of a genotype is termed a phenotype. For example, hair color, weight, or the presence or absence of a disease.
10
Recombination Phenomenon
A recombination between2 genes occurred if thehaplotype of the individualcontains 2 alleles that resided in different haplotypes in theindividual's parent.
(Haplotype – the alleles at different loci that are received by an individual from one parent).
Male or female
:תאי מיןביצית, או זרע
11
An example - the ABO locus. The ABO locus
determines detectable antigens on the surface of red blood cells.
The 3 major alleles (A,B,O) interact to determine the various ABO blood types.
O is recessive to A and B. Alleles A and B are codominant.
Phenotype Genotype
A A/A, A/O
B B/B, B/O
AB A/B
O O/O
Note: genotypes are unordered.
12
Example: ABO, AK1 on Chromosome 9
Male recombination fraction 12/100 and female 20/100. These numbers are measured in centi-morgans. One centi-morgan means one recombination every 100 meiosis.
Ten centi-morgan corresponds to approx 1M nucleotides (with large variance) depending on the location and sex.
2
4
5
1
3
A
A1/A1
O
A2/A2
A
A1/A2
O
A1/A2
A
A2/A2
O OA1 A2
A OA1 A2
A | OA2 | A2
O OA2 A2
Recombinant
Phase inferred
13
Example for Finding Disease Genes
We use a marker with codominant alleles A1/A2.
We speculate a locus with alleles H (Healthy) / D (affected)
If the expected number of recombinants is low (close to zero), then the speculated locus and the marker are tentatively physically closed.
2
4
5
1
3
H
A1/A1
D
A2/A2
H
A1/A2
D
A1/A2
H
A2/A2
D DA1 A2
H DA1 A2
H | DA2 | A2
D DA2 A2
Recombinant
Phase inferred
14
Recombination cannot be simply counted
2
4
5
1
3
H
A1/A1
H
A2/A2
H
A1/A2
D
A1/A2
H
A2/A2
D DA1 A2
H DA1 A2
H | DA2 | A2
Possible Recombinant
Phase ???
One can compute the probability that a recombination occurred and use this number as if this is the real count.
15
Comments about the example
Often:
Pedigrees are larger and more complex. Not every individual is typed. Recombinants cannot always be
determined. There are more markers and they are polymorphic (have more than two
alleles).
16
Genetic Linkage Analysis
The method just described is called genetic linkage analysis. It uses the phenomena of recombination in families of affected individuals to locate the vicinity of a disease gene.
Recombination fraction is measured in centi morgans and can change between males and females.
Next step: Once a suspected area is found, further studies check the 20-50 candidate genes in that area.
Linkage) No(5.0)ionRecombinat(0)Linkage( P
17
Part II: Case study
Werner’s Syndrome
A successful application of genetic linkage analysis
18
The Disease
First references in 1960s Causes premature ageing Autosomal recessive Linkage studies from 1992 WRN gene cloned in 1996 Subsequent discovery of mechanisms involved in
wild-type and mutant proteins
19
One Pedigree’s Data (out of 14)
1 115 0 0 2 1 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 126 0 0 1 1 0 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 1 0 1 111 0 0 1 1 0 1 0 1 2 0 2 0 3 1 2 1 1 1 3 1 2 1 0 0 1 0 1 0 0 0 1 122 111 115 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 125 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 121 111 115 1 1 2 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 1 135 126 122 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 131 121 125 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 141 131 135 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Pedigree number
Individual ID
Mother’s ID
Father’s ID
Sex: 1=male 2=female
Status: 1=healthy 2=diseased
Unknown marker alleles
Known marker alleles
20
Marker File Input
14 0 0 50 0.0 0.0 01 2 3 4 5 6 7 8 9 10 11 12 13 141 20.995 0.00510 0 13 6 # D8S133 0.0200 0.3700 0.4050 0.0050 0.0500 0.0750
...[other 12 markers skipped]...
0 010 7.6 7.4 0.9 6.7 1.6 2.5 2.8 2.1 2.8 11.4 1 43.8 1 0.1 0.45
1 disease locus + 13 markers
Recessive disease
requires 2 mutant genes
First marker has 6 alleles
First markerfounder allele
frequencies
First marker’s
name
Recombination distances between
markers
21
Genehunter Output
position LOD_score information 0.00 -1.254417 0.224384 1.52 2.836135 0.226379
...[other data skipped]...
18.58 13.688599 0.384088 19.92 14.238474 0.401992 21.26 14.718037 0.426818 22.60 15.159389 0.462284 22.92 15.056713 0.462510 23.24 14.928614 0.463208 23.56 14.754848 0.464387
...[other data skipped]...
81.84 1.939215 0.059748 90.60 -11.930449 0.087869
Putative distance of disease gene
from first marker in recombination
units
Log likelihood of placing disease
gene at distance, relative to it being
unlinked.
Maximum log likelihood score
Most ‘likely’ position
22
Final Location
Marker D8S131
Marker D8S259
location of marker D8S339
WRN Gene final location
Error in location by genetic linkage of about 1.25M base pairs.
23
Part III: Relevant Mathematics
24
The Maximum Likelihood Approach
The probability of pedigree data Pr(data | ) is a function of the known and unknown recombination fractions denoted collectively by .
How can we construct this likelihood function ?
The maximum likelihood approach is to seek the value of which maximizes the likelihood function Pr(data | ) . This value is called the ML estimate.
The main computational difficulty is to compute Pr(data|) for a specific value of .
25
Constructing the Likelihood function
Lijm (Lijf) = Maternal (paternal) allele at locus i of person j. The values of this variables are the possible alleles li at locus i.
First, we need to determine the variables that describe the problem. There are several possible choices. Some variables we can observe and some we cannot.
Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i).
26
Constructing the Likelihood functionL11fL11m
L13m
X11
S13m
Selector of maternal allele at locus 1 of person 3
Maternal allele at locus 1 of person 3 (offspring)
Selector variables Sijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j.
P(s13m) = ½
P(l13m | l11m, l11f,,S13m=0) = 1 if l13m = l11m
P(l13m | l11m, l11f,,S13m=1) = 1 if l13m = l11f
P(l13m | l11m, l11f,,s13m) = 0 otherwise
27
Probabilistic model for two lociS13m
L11fL11m
L13m
X11 S13f
L12fL12m
L13f
X12
X13Model for locus 1
S23m
L21fL21m
L23m
X21 S23f
L22fL22m
L23f
X22
X23
Model for locus 2
28
Probabilistic model for Recombination
S23m
L21fL21m
L23m
X21 S23f
L22fL22m
L23f
X22
X23
S13m
L11fL11m
L13m
X11 S13f
L12fL12m
L13f
X12
X13
{m,f}tssP tt
where
1
1),|(
22
2221323
2 is called the recombination fraction between loci 2 & 1.
females males
29
The Data
The data consists of an assignment to a subset of the variables {Xij}. In other words some (or all) persons are genotyped at some (or all) loci.
31
Constructing the likelihood function
= P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, 2) P(s23m | s13m, 2)
P(l11m,l11f,x11,l12m,l12f,x12,l13m,l13f,x13, l21m,l21f,x21,l22m,l22f,x22,l23m,l23f,x23,
s13m,s13f,s23m,s23f | 2) = Product over all local probability tables
Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) =
Probability of data (sum over all states of all hidden variables)
Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) = l11m, l11f … s23f [P(l11m)
P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, 2) P(s23m | s13m, 2) ]
The result is a function of the recombination fraction. The ML estimate is the 2 value that maximizes this function.
32
Modeling PhenotypesL11fL11m
L13m
X11
S13m
Phenotype variables Yij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities:P(y11 = sick | X11= (a,a)) = 1
P(y11 = sick | X11= (A,a)) = 0P(y11 = sick | X11= (A,A)) = 0
Y11
33
Standard usage of linkageThere are usually 5-15 markers. 20-30% of the persons in large pedigrees are genotyped (namely, their xij is measured). For each genotyped person about 90% of the loci are measured correctly. Recombination fraction between every two loci is known from previous studies (available genetic maps).
The user adds a locus called the “disease locus” and places it between two markers i and i+1. The recombination fraction ’ between the disease locus and marker i and ” between the disease locus and marker i+1 are the unknown parameters being estimated using the likelihood function.
This computation is done for every gap between the given markers on the map. The MLE hints on the whereabouts of a single gene causing the disease (if a single one exists).
34
Part IV: Software and Algorithms
• Fastlink v4.1 (Each person’s genotype is one variable) • Vitesse v1,v2 (Only loopless Bayesian networks allowed)• GeneHunter/Alegro (exponential in number of persons)• Many more specific packages (e.g., affected siblings)• Superlink: Why is it better ?
For a list, See http://linkage.rockefeller.edu/soft/list.html
35
SUPERLINK
Stage 1: each pedigree is translated into a Bayesian network.
Stage 2: value elimination is performed on each
pedigree (i.e., some of the impossible values of the variables of the network are eliminated).
Stage 3: an elimination order for the variables is determined, according to some heuristic.
Stage 4: the likelihood of the pedigrees given the values is calculated using variable elimination according to the elimination order determined in stage 3.
36
Experiment A• Same topology (57 people, no loops)• Increasing number of loci (each one with 4-5 alleles)• Run time is in seconds.
Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter
A0 2 0.03 0.12 0.27A1 5 0.1 3.77 0.31A2 6 0.14 79.32 0.39A3 7 0.42 0.69A4 8 0.36 2.81A5 10 1.19 84.66A6 12 4.65A7 14 3.01A8 18 20.98A9 37 8510.15
A10 38 10446.27A11 40
over 100 hours
Out-of-memory
Pedigree sizeToo big forGenehunter.
37
Experiment C
Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter
D0 100 0.16 (2 l.e.) 0.41 (99 l.e.)D1 110 0.2 (2 l.e.) 0.45 (109 l.e.)D2 120 0.21 (2 l.e.) 0.48 (119 l.e.)D3 130 0.22 (2 l.e.) 0.49 (129 l.e.)D4 140 0.24 (2 l.e.) 0.51 (139 l.e.)D5 150 0.25 (2 l.e.) 0.53 (149 l.e.)D6 160 0.27 (2 l.e.) 0.54 (159 l.e.)D7 170 0.3 (2 l.e.) 0.6 (169 l.e.)D8 180 0.3 (2 l.e.) 0.59 (179 l.e.)D9 190 0.32 (2 l.e.) 0.61 (189 l.e.)D10 200 0.34 (2 l.e.) 0.66 (199 l.e)D11 210 0.37 (2 l.e.) 0.67 (209 l.e)
• Same topology (5 people, no loops)• Increasing number of loci (each one with 3-6 alleles)• Run time is in seconds.
Out-of-memory
Bus error
38
The computational task at hand
kx x x
n
iii paxPP
3 1 1
)|()|( data
lmnkjmk
ikllmn
ij CBAY
Multidimensional multiplication/summation:
kjk
ikij BAC Example: Matrix multiplication:
5011505050 xxx CBA versus 5011505050 xxx CBA
39
Some options for improving efficiency
1. Performing approximate calculations of the likelihood.
2. Multiplying special probability matrices efficiently.
3. Grouping alleles together and removing inconsistent alleles.
4. Optimizing the elimination order of variables in a Bayesian network.
kx x x
n
iii paxPP
3 1 1
)|()|( data
40
Projects
Project No.
Project Subject
1 Performing approximate likelihood computations by using the method Iterative Join-Graph Propagation. (ps)
2Performing haplotyping on the input data, i.e., inferring the most likely haplotypes for the individuals in the input pedigrees. (ps)
3Performing approximate likelihood computations by using a heuristic which ignores extreme markers in the likelihood computation. (ps)