7/31/2019 tics Algorithms
1/47
BIOINFORMATICSALGORITHMS
BASED ON THE 2009TEACHING OF THE CAMBRIDGE COMPUTERSCIENCEPARTIIBIOINFORMATICSCOURSE BYPIETRO LI
Vaughan EveleighJesus College, Cambridge University
7/31/2019 tics Algorithms
2/47
2
7/31/2019 tics Algorithms
3/47
3
CONTENTS
1 DNA and Protein Sequences ......................................................................................... 51.1 Preparation ................................................................................................................ 5
1.1.1 Manhattan Tourist ............................................................................................ 51.1.1.1 Naive Algorithm....................................................................................... 51.1.1.2 Dynamic Algorithm ................................................................................. 5
1.2 Strings ......................................................................................................................... 61.2.1 Longest Common Subsequence ..................................................................... 71.2.2 Neeleman-Wunsch (Global Alignment) ....................................................... 81.2.3
Smith Waterman (Local Alignment) .............................................................. 9
1.2.4 Affine Gaps ..................................................................................................... 101.2.5 Banded Dynamic Programming ................................................................... 111.2.6 Computing Path with Linear Space ............................................................. 111.2.7 Block Alignment ............................................................................................. 121.2.8 Four Russians Block Alignment Speedup .................................................. 141.2.9 Four Russians Technique - Longest Common Sub-Expression ............. 141.2.10 Nussinov Algorithm ...................................................................................... 151.2.11 BLAST (Multiple Alignment) ....................................................................... 171.2.12 Pattern Hunter (Multiple Alignment) .......................................................... 181.2.13 BLAT (Multiple Alignment) ......................................................................... 19
1.3 Trees ......................................................................................................................... 191.3.1 Parsimony ........................................................................................................ 19
1.3.1.1 Sankoff Algorithm ................................................................................. 201.3.1.2 Fitchs Algorithm ................................................................................... 21
1.3.2 Large Parsimony Problem ............................................................................. 221.3.3 Distance ........................................................................................................... 23
1.3.3.1 UPGMA .................................................................................................. 241.3.3.2 Neighbour Joining ................................................................................. 25
1.3.4 Likelihood ........................................................................................................ 271.3.5 Bootstrapping Algorithm .............................................................................. 291.3.6 Prims Algorithm ............................................................................................ 29
1.4 Information Theory and DNA ............................................................................. 29
7/31/2019 tics Algorithms
4/47
4
1.4.1 Information Content of a DNA Motif ....................................................... 301.4.2 Entropy of Multiple Alignment.................................................................... 311.4.3 Information Content of a String .................................................................. 311.4.4 Motifs ............................................................................................................... 311.4.5 Exhaustive Search .......................................................................................... 331.4.6 Gibbs Sampling .............................................................................................. 33
1.5 Hidden Markov Models ......................................................................................... 351.5.1 Forward Algorithm ........................................................................................ 361.5.2 Viterbi Algorithm ........................................................................................... 371.5.3 Backward Algorithm ...................................................................................... 38
2 Working with Microarray .............................................................................................. 392.1 Clustering ................................................................................................................. 39
2.1.1 Lloyd Algorithm (k-means) ........................................................................... 402.1.2 Greedy Algorithm (k-means) ........................................................................ 412.1.3 CAST (Cluster Affinity Search Technique) ................................................ 412.1.4 QT clustering .................................................................................................. 422.1.5 Markov Clustering Algorithm ...................................................................... 43
2.2 Genetic Networks Analysis ................................................................................... 432.3 Systems Biology ...................................................................................................... 45
2.3.1 Gillespie Algorithm ........................................................................................ 46Complexity Summary ............................................................................................................. 47
7/31/2019 tics Algorithms
5/47
5
1 DNAAND PROTEIN SEQUENCES
1.1 PREPARATION
DNA (Deoxyribonucleic acid) uses a 4-letter alphabet (A,T,C,G)
RNA (Ribonucleic acid) also uses a 4-letter alphabet (A,U,C,G)Proteins use 20 amino acids (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)
1.1.1 MANHATTANTOURIST
The problemGiven a weighted grid G, travel from source (top left) to thesink (bottom right) along the highest scoring path onlytravelling south and east.
The solutionThe problem can be generalised finding the longest path from
the source to an arbitrary destination .1.1.1.1 Naive Algorithm
Start at the destination node and calculate which of the immediately adjacent nodes
has the highest path score from the source. For each of these edges, recurse.
path(i,j)
if (i = 0 or j = 0)
return 0
else
X = path(i-1, j) + edge (i-1,j) to (i,j)
Y = path(i, j-1) + edge (i,j-1) to (i,j)return max(X,Y)= (!!)= (1)
Although this exhaustive algorithm produces accurate results it is not efficient. Many
path values are repeatedly computed.
1.1.1.2Dynamic Algorithm
Dynamic programming improves the naive algorithm by storing the results of previous
computations and reusing them when required at a later stage. The idea behind a
dynamic algorithm is that unnecessary calculations are not re-computed. Although this
significantly improves time complexity, in many cases the space complexity can be
quite demanding.
In the case of the Manhattan tourist problem we only need to store the values of 1 row
and 1 column at any time.
7/31/2019 tics Algorithms
6/47
6
DynamicPath(i,j)
S0,0=0
for x=1 to i
Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)
for y=1 to j
S0,y = S0,y-1+edge (0,y-1) to (0,y)for x=1 to i
for y=1 to j
A = Sx,y-1+edge (x,y-1) to (x,y)
B = Sx-1,y+edge (x-1,y) to (x,y)
Sx,y = max (A,B)
Return Si,j
Where Sx,yare stored values = () = ( +)If our DAG representing the city were to also contain diagonal paths we would require
a 3rd condition in the final for loop.
DynamicDiagonalPath(i,j)
S0,0=0
for x=1 to i
Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)
for y=1 to j
S0,y = S0,y-1+edge (0,y-1) to (0,y)
for x=1 to i
for y=1 to j
A = Sx,y-1+edge (x,y-1) to (x,y)
B = Sx-1,y+edge (x-1,y) to (x,y)C = Sx-1,y-1+edge (x-1,y-1) to (x,y)
Sx,y = max (A,B,C)
Return Si,j = () = ( + + 1) = ( +)Many of the future algorithms will resemble the Manhattan tourist problem
1.2 STRINGS
There are several ways by which we can compare the similarity of strings.
Edit Distance (non trivial) the minimum number of operations (insertions,deletions and substitutions) required to transform 1 string into another
Hamming Distance (trivial) the number of differences when comparing the value of a string against the value of another
Consider the two strings of length 7 and of length 6 : ATCTGAT
: TGCATA
7/31/2019 tics Algorithms
7/47
7
After comparison of string against we can count the number of matches,insertions and deletions.
1.2.1 LONGESTCOMMONSUBSEQUENCEAlthough the hamming distance is commonly used in computer science, the edit
distance is of greater use in biology. By aligning two strings by their longest common
sub-sequences, the minimal distance can be found.
The longest common subsequence is similar to edit
distance but only uses insertions and deletions, not
substitutions.
This problem can be represented as a hybrid of the
Manhattan tourist problem (right) where diagonal
paths represent matched strings and horizontal or
vertical lines represent edits. Each of the edges are
assigned a weighting.
weighting 0
0
1 where = otherwise LongestCommonSubsequence(i,j)
S0,0=0
for x=1 to i
Sx,0 = 0
for y=1 to j
S0,y = 0
for x=1 to i
for y=1 to j
A = Sx,y-1+0
B = Sx-1,y+0C = Sx-1,y-1+edge (x-1,y-1) to (x,y)
Sx,y = max (A,B,C)
Return Si,j = ()If we are only concerned with returning the optimal score (and
not the path) the LCS can be calculated with linear space. When
computing the score of any cell in our adjacency matrix only the
scores of cells immediately above, immediately left and
immediately diagonal are required. As a result, historic values
7/31/2019 tics Algorithms
8/47
8
from other computations are not required.
() = ( +)The algorithm can also be modified to remember the route when deciding which of
the paths (A, B or C) to take. This requires allocation of an adjacency matrix of size populated with the score and direction of each cell. This information allows usto backtrack to find which sequence of insertions and deletions generated the score.
() = () () = ( +)This is the simplest form of alignment as only insertions and deletions are allowed.
This algorithm is rather restrictive, awarding 1 for matches and not penalising indels
(Abbreviation for insertions and deletions). We will now consider ways in whichmismatches can be penalised.
1.2.2 NEELEMAN-WUNSCH(GLOBALALIGNMENT)Global alignment assumes that the two proteins are basically similar over the entirelength of one another. The alignment attempts to match them to each other from end
to end, even though parts of the alignment may not be very convincing. E.g.
Global alignment penalises insertions or deletions by decreasing the overall alignment
score by the value .We first need to initialise our scoring matrix as we did for the longest common
subsequence algorithm.
weightingVertical -Horizontal -Diagonali,j
1 where
=
j
otherwise
aaagcggaagtcacag
||.||.||||| |.||
aaggctgaagt-atag
7/31/2019 tics Algorithms
9/47
9
Needleman-Wunsch (i,j)
S0,0=0
for x=1 to i
Sx,0 = -x*d
for y=1 to jS0,y = -y*d
for x=1 to i
for y=1 to j
A = Sx,y-1 - d
B = Sx-1,y - d
C = Sx-1,y-1+edge (x-1,y-1) to (x,y)
Sx,y = max (A,B,C)
Return Si,j = ()
(
) =
(
+
)
() = () () = ( +)Again this algorithm could be modified to store the path taken through the matrix.
1.2.3 SMITHWATERMAN(LOCALALIGNMENT)Local alignmentsearches for segments of the two sequences that match well. There is noattempt to force entire sequences into an alignment, just those parts that appear to
have good similarity, according to some criterion. E.g.
The Smith Waterman algorithm is based on the Needleman-Wunsch algorithm but
ignores badly aligning regions. It does this by assigning 0 to any cells that would have
been allocated a negative value using Needleman-Wunsch.
Smith-Waterman (i,j)
S0,0=0
for x=1 to i
Sx,0 = 0
for y=1 to jS0,y = 0
for x=1 to i
for y=1 to j
A = Sx,y-1 - d
B = Sx-1,y - d
C = Sx-1,y-1+edge (x-1,y-1) to (x,y)
D = 0
Sx,y = max (A,B,C,D)
Return Si,j
=
(
)
(
) =
(
)
On termination we can use our alignment matrix to find string alignments.
aaagcggaagtcacag
......||||| ....aaggctgaagt-atag
7/31/2019 tics Algorithms
10/47
10
To find the longest match we are just required to find the with the highestvalue
To find alignments greater than length & > Valid scores can be found in () time complexity
The alignment can be found from the scores in ( +) time complexity1.2.4 AFFINEGAPSNeedleman Wunsch can be optimised further to score gaps more accurately. Gaps are
currently being scored uniformly, however they usually occur in clusters. The uniform
gap penalty is changed to a function that considers the length of a gap.We implement this usingaffine gaps.
= gap length = initial penalty = successive penaltyHere the first gap incurs a penalty of and subsequent gaps incur a penalty of.
Two alignment matrices are used to record scores.
contains scores assuming aligns to contains scores assuming or align to a gapThe example below also includes the function (,) that returns a constant if the value of string aligns with the value of string. This is just a shorthand notationfor the edge function explained earlier.
, = 0
= + 1
7/31/2019 tics Algorithms
11/47
11
Needleman-Wunsch-Affine (i,j)
F0,0=0
for x=1 to i
Fx,0 = d+(x-1)*e
for y=1 to j
F0,y = d+(y-1)*efor x=1 to i
for y=1 to j
A = Fx-1,y-1 + s(x,y)
B = Gx-1,y-1 + s(x,y)
Fx,y = max (A,B)
L = Fx-1,y - d
M = Fx,y-1 - d
N = Gx-1,y - e
O = Gx,y-1 - e
Gx,y = max(L,M,N,O)
Return max(Gi,j, Fi,j)
= () () = () () = ( +)1.2.5 BANDEDDYNAMICPROGRAMMINGProvided we know that strings are similar the majority of
computations can be ignored. For example, if we were
comparing two DNA sequences from the same species, the
optimal alignment will not deviate far from the perfect
diagonal line. As a result we can just exclude computations
outside of a set boundary.
Although this reduces the real run time speed, it does not
greatly affect the asymptotic complexity.
= ( ()) ()1.2.6 COMPUTINGPATH WITHLINEARSPACE
When running dynamic algorithms with quadratic space complexity and quadratic time
complexity, memory resources usually limit computation before processor cycles.
As explained in 1.2.1 it is possible to calculate the optimal score using dynamicprogramming in linear space. It is possible to modify the algorithm to return the path
in linear space but at the expense of doubling the required computations by a factor of
2.
Score Only Path without optimisation Path with optimisation = () = ( +) = () = () = (2) () = ( +)This desired space complexity is achieved by finding where the longest path crosses
the middle line before recursively subdividing the problem
7/31/2019 tics Algorithms
12/47
12
Method1. Split the matrix into 2
2. Run the algorithm on the first half of thematrix remembering the values of the final
column (prefix values)
3. Run the algorithm in reverse on thesecond half of the matrix rememberingthe values of the final column (suffix
values)
4. Find the greatest length where the pathcrosses the middle line. This is the middle
vertex of the optimal path.
Length(i) = Prefix(i) + Suffix(i)
5. Now that we have our mid-point we willrecurse on 2 sub sections of the matrix.
The upper left and lower right regions.
1.2.7 BLOCKALIGNMENTSo far all of the algorithms have required (2) time to align two sequences of length. The idealistic algorithm for aligning two sequences would be () in time butthis has yet to be achieved and the lower bounds of the Global Alignment Problem
remain unknown.
To reduce the required computation time blockscan be compared instead of individualletters. This will only provide an approximation to the longest common substring
algorithm as only the corners of blocks are considered. The block alignment algorithm
only allows the longest sub expression path to enter a block through its corners.
Accuracy is lost in favour of speed.
This is achieved by splitting two DNA sequences, and say, into blocks of length so
= 1 . +1 .2 +1 . and,
= |
1. . . .
||
+1. . . .
2
| . . . |
+1. . . .
|
7/31/2019 tics Algorithms
13/47
13
We create our alignment matrix of size , and populate the edge values. Thehorizontal and vertical edges will represent insertions or deletions of whole blocks and
will have the usual penalty constant, . The value of the diagonal edges will be equal tothe longest common path alignment score of the two sub blocks.
weightingVertical Horizontal Diagonali,j Needleman-Wunsch ( , )
Block-Alignment (u,v,n)
for x=1 to n/t
for y=1 to n/t
edge(x-1,y-1) to (x,y) = Needleman-Wunsch (ux, vy)
S0,0=0
for x=1 to n/tSx,0 = -x*d
for y=1 to n/t
S0,y = -y*d
for x=1 to n/t
for y=1 to n/t
A = Sx,y-1 - d
B = Sx-1,y - d
C = Sx-1,y-1+edge (x-1,y-1) to (x,y)
Sx,y = max (A,B,C)
Return Si,j
Where
and
are DNA sequences,
is the length of the DNA
sequences, is the length of a block, is the block of string and isour scoring matrix.If computations resulting from the first nested for-loop (which calculates the
alignment of blocks) are ignored then time complexity is significantly improved. This
is only possible when sub-block alignment values have already been computed. This
may well be the case if we are examining the same DNA strings but with different
penalties for insertions and deletions.
= = 2
2In the case where penalties are not being adjusted, the alignment score may already be
stored in a pre-computed lookup table. Access to such a table will take (). = On2
t2logn
It may not always be the case that we have pre-computed block alignment values. In
this case, the cost of running the initialisation step that calculates the diagonal edges
score makes no improvement on our initial algorithm.
7/31/2019 tics Algorithms
14/47
14
= = 2This can be improved using the Four Russians technique.
1.2.8 FOURRUSSIANSBLOCKALIGNMENTSPEEDUPThe Four Russians Technique is very similar to the block alignment algorithmachieving a significant reduction in the time complexity.
To achieve this goal, the block length should be approximatelylog 4
where is thesequence length and 4 is the number of letters in the alphabet of DNA.
log4
Also, instead of calculating a lookup table of alignment values of size , a lookuptable of size 4 4 . log
4
4 4 = Computing the initial values of the lookup table now only takes () when is boundto log ().
= + 2()2 = O 21.2.9 FOUR RUSSIANS TECHNIQUE - LONGEST COMMON SUB-EXPRESSION
Recall that the block alignment algorithm only allows the longest subexpression path to enter blocks through their corners. When alignmentscored for sub-blocks are calculated, only the corner values are stored aspoints of interest.
The Longest common sub expression algorithm can take anypath through the matrix. By extending the Four Russian Block
Alignment speedup making every point along the side of a block a
point of interest unrestricted entry and exit between blocks is possible.
Instead of performing dynamic programming on the corner vertices of
blocks, dynamic programming is used on all edge vertices, ignoring internal vertices.
This totals 2 vertices.Again, the four Russians technique is used to create a lookup table that stores all of
these values. In essence we are interested in the following problem :
7/31/2019 tics Algorithms
15/47
15
given the alignment scores in the first column and first row of a block and the two strings, compute thealignment scores in the last row and last column.
This poses a problem. What are the scores of the first row and column? This clearly
varies depending on the path taken through the matrix. As a result the values of all
possible combinations of first row and column values are calculated for all
combinations of strings. This could clearly be an enormous lookup table if there were
a large number of possible first row and column initial value combinations.
By careful observation of the LCS problem, we can see that the initial values
of the first row or column of any block is not entirely arbitrary. Recall that
a match scores 1 and an insertion or deletion scores 0.The alignmentscores in LCS are monotonically increasing and adjacent elements cannot
differ by more than 1. Therefore there are 2 possible scores for each initialrow and column. There are also only4 possible strings (due to the DNAalphabet size of 4). Therefore we can very efficiently compute the lookup values.
2 2 4 4 = 26 Given that =
4due to the four Russians :
= 264 = 1.5Our initialisation step is now sub quadratic. As a result the overall time complexity is
dominated by the dynamic programming algorithm.
() = O 21.2.10NUSSINOVALGORITHM
The Nussinov algorithm finds the optimal secondary
structure of RNA (right). This is essentially its 3D
representation.1
The Nussinov algorithm has two stages. The first fillsthe dynamic array with
scores. The second uses
these scores to trace the
secondary structure of the RNA.
The first fill stage is based on the LCS dynamic
algorithm. One noticeable difference is that it is notnecessary to fill the entire table (see left).
The biological side of the algorithm specifies rules for
1 http://bioinf.kvl.dk/~gorodkin/teach/bioinf2004/talk4_nov5.pdf
Monotonic function
7/31/2019 tics Algorithms
16/47
16
what is a valid paring of letters. This is based on the Watson-Crick basepairs.
, = 1 if the and letters of the string are a valid pairing0 otherwise
Nussinov Fill
Given subsequence with letters (1 , , )for i=2 to
i,i-1 = 0
for i = 1 to i,i = 0
for all sub sequences of length 2 to length A = i+1,j
B = i,j-1
C = i+1,j-1+(i,j)
D =
7/31/2019 tics Algorithms
17/47
17
1.2.11 BLAST(MULTIPLEALIGNMENT)It is often the case that we wish to align far more than just 2 strings. If we were to use
the previous dynamic algorithms for the alignment of strings, we would just requirethe allocation and analysis of a
-dimensional matrix.
For sequences of length , there are a possible 2 1 paths through the matrix. Ifwe recall the time complexity of the LCS algorithm but extended it into dimensionswe get :
Time Complexity (k-dimension LCS)= (2 1) = 2Unfortunately, due to the exponential running time this becomes unpractical and
unusable very quickly.
This problem has led to the development of algorithms such as BLAST (basic local
alignment search tool). BLAST greatly improves the speed at which strings can be
compared but at the expense of approximated results.
A BLAST search enables a researcher to compare a query sequence against a library or
database of sequences, and identify library sequences that resemble the query sequence
above a certain threshold. For example, following the discovery of a previously
unknown mouse gene, a scientist will typically perform a BLAST search of the human
genome to see if humans carry a similar gene; BLAST will identify sequences in the
human genome that resemble the mouse gene based on similarity of sequence. BLAST
has become one of the most widely used bioinformatics algorithms due to its emphasis
on speed over sensitivity.
where is the search string, is the length of sub strings, isthe database of known sub strings and is the threshold value = ()
There are many variations of the basic BLAST algorithm. The original algorithm
found the substring
of length
in the dictionary
by performing basic local
alignment around the string until the threshold was exceeded. In the basic case, thelocal alignment only acknowledged matches and mismatches. The algorithm
BLAST (n,m,D,k){
For all words, w, of length m from the query
string n
{
Match win database D
IF match was found
{
Perform local alignment around w until our
score
falls below threshold k}
}}
7/31/2019 tics Algorithms
18/47
18
terminated when either the alignment score fell below tolerance , or the ratio ofmatches to mismatches fell below tolerance .
An alternative and newer approach isgapped BLAST.This has the same basic structurebut the local alignment stage also allows for some insertions and deletions. We score
the local alignment in the usual way and terminate when the score becomes less thanthe threshold value. This results in some deviation from the perfect diagonal line in
our matrix.
There are many more variations on this algorithm tailored to different kinds of pattern
matching and different biological applications. The effectiveness of the algorithm
varies drastically with the choice of input variables. Some prior knowledge about the
strings being comparing can greatly improve BLASTs results.
1.2.12PATTERNHUNTER(MULTIPLEALIGNMENT)Pattern Hunter is a variation on BLAST which provides increased sensitivity andincreased speed. BLAST only matches consecutive sequences of length during thedictionary lookup. Pattern Hunter introduces the concept of a spaced seed providinggreater flexibility.
Consider the BLAST seed mask of length 11
11111111111
Here the seed represents the fact that we want to match all 11 consecutive characters
in the search string with 11 consecutive characters in the dictionary. If there had been
a single mutation of any character in the search string BLAST would not pick up anymatch in the dictionary.
7/31/2019 tics Algorithms
19/47
19
Now consider the Pattern Hunter spaced seed of length 11
110111001110111
Here a 0 represents dont care. This is still matching 11 characters in the search string
with 11 characters in the dictionary, but the letters are not necessarily consecutive. Thespaced seed models are defined before the algorithm is run.
This algorithm provides a higher hit probability and a lower expected number of
random hits.
1.2.13 BLAT(MULTIPLEALIGNMENT)BLAT (BLAST like alignment tool) is just an inversion of the BLAST algorithm.
Instead of building an index from the query string and scanning through the database,
we build an index from the database and scan linearly through the query sequence.
This results in a significant runtime performance increase due to the fact that the index
fits in RAM memory which has much faster access.
1.3 TREES
Phylogenetic trees are often used in bioinformatics to represent how species or
sequences are related. They can represent how many organisms have evolved or how
strings have mutated.
There are two main types of tree. In rooted trees the root position is the common
ancestor of all sequences (A - E in the figure). Branch lengths can be used to indicate
the amount of divergence/change. Un-rooted trees contain no information about a
hypothetical common ancestor. Branch lengths still reflect degree of divergence.
1.3.1 PARSIMONYTrees can be constructed in many ways. One of the most common ways to build a tree
is using minimum parsimony. Parsimony is a measure of how complex the tree is. In
our case trees with fewer mutations between parent and child have a lower parsimony
score. Minimum parsimony is the simplest set of assumptions or mutations that can
explain an observation.
7/31/2019 tics Algorithms
20/47
20
The simplest way of generating a parsimony score is to use the Hamming distance
(explained earlier). This is known as small parsimony. In many cases it may be
desirable to create a lookup matrix which assigns different scores to differentkinds of mutations, known as weighted parsimony. This way, common mutations can
have lower penalties than unusual mutations. Small parsimony is just a special case of
the weighted parsimony lookup table where the diagonal values are 0 and all others are
1.
1.3.1.1Sankoff Algorithm
The Sankoff Algorithm is a way of evaluating the weighted small parsimony problem.
Given a tree T with each leaf labelled with a letter from kletter alphabet and a scoring matrix (
), output a tree T with the internal vertices of the tree T minimizing
the weighted parsimony score
7/31/2019 tics Algorithms
21/47
21
Sankoff Algorithm
Given a tree = (,)and string , with leaf nodes labelledwith letters in order from string
, assign to each node
integer costs cx() for each letter at position inrecursively, starting with the leaf nodes, as follows: in at all leaf nodes , let = 0 +
in at all internal nodes , let = min
yin
+ ,
Where , is the cost of mutating from the letter atposition to the letter at position based on the parsimonyscoring martix.
Once all costs have been assigned, we then label each
internal node with the letter as follows (backtrackingstage):
For the root node , let = such that miny AThen, for every already labelled node , label each child of with
= if miny( ) + 1 > c () miny( ) otherwise = (2)If we were to use a small parsimony scoring matrix then miny in , where is the root node, would be the minimum number of mutations that would explain the
tree. If we use weighted parsimony then miny in , where is the root node,would be the minimum parsimony score possible explaining the tree.The assigned labels are one possible assignment that exhibits this minimum parsimony
score. There may be many label variations that produce the same parsimony score.
1.3.1.2Fitchs Algorithm
The Fitch algorithm is very similar to Sankoff. It also finds the minimum parsimony of
a tree but only for small parsimony. Fitch produces an identical node labelling toSankoff given that a non-weighted small parsimony scoring matrix is used.
7/31/2019 tics Algorithms
22/47
22
Given two strings and , we can find the most likely way that mutated into using small parsimony. We begin by creating a binary tree where only the leaf nodes arelabelled with the letters in string.Given a binary tree T = (V,E), with leaf nodes labelled withletters from an alphabet A, assign to each node Va set ofletters A recursively, starting with the leaf nodes, asfollows:
For a leaf node with label , let =
For an internal node with children and , let = 0
Now label each internal node with a single letter.
(backtracking stage)
Label the root node with any .Then, for every already labelled internal node , label eachchild of with
=
= ()The labelling produced exhibits the minimum number of mutations that can explain
the original tree. There may be label combinations that produce the same score.
1.3.2 LARGEPARSIMONYPROBLEMThis is very similar to the previous small parsimony problem, but instead of calculating
the parsimony score of a single string we want to calculate the parsimony score of
multiple string alignments.
Given a multiple alignment = {1 , . . . ,}, its parsimony score is defined as() = {(,) | }
The Large Parsimony Problem is to compute ().Potentially, we need to consider all (2 5)!! possible un-rooted trees or (2 3)!!possible rooted trees. Unfortunately, in general this cant be avoided and the
maximum parsimony problem is known to be NP-hard.
7/31/2019 tics Algorithms
23/47
23
Exhaustive enumeration of all possible tree topologies will only work for 10,say.
Thus, we need more efficient strategies that either solve the problem exactly or return
good approximations with heuristic searches. These algorithms are not covered here.
1.3.3 DISTANCEThe distance between two nodes in a tree (tree distance) is always going to be greater
or equal to the edit distance between two nodes. The tree distance between two nodes
is the sum of all the edges along the shortest path between the nodes.
We use the notation:
to represent the edit distance between nodes and, and () to represent the tree distance between nodes and.
Given strings we can easily create a matrix, , representing the edit distancebetween strings and. Our distance based algorithms will produce distance trees thatbestfitthe distance matrix.
In an ideal (and optimal) case:
= ()Such optimality is always possible when we have trees with no more than 3 leaves, but
this is rarely the case when the number of leaves is > 3. Such trees are said to be
additiveand should yield a simple solution.
= + 2
A special case of tree distances is a degenerate triple. A degenerate triple is a set of
three distinct elements 1
7/31/2019 tics Algorithms
24/47
24
has a degenerate triple ,, thencan be removed from thus reducing the sizeof the problem. If distance matrix does not have a degenerate triple ,, one cancreate a degenerative triple in by shortening all hanging edges (in the tree).1.3.3.1
UPGMA
UPGMA (Un-weighted Pair Group Method usingArithmetic averages) is the first of
our best-fit-tree distance algorithms that uses iterative clustering to create a
hierarchical phylogenetic tree.
It uses a pair-wise distance matrix where is the number of sequences and is the distance between sequences and. As explained earlier, there are manyways of calculating the distance between strings but UPGMA usually uses the edit
distance.
On each itteration the algorithm combines seqences to create clusters. Whencalculating the distance to or from a cluster we average (mean) the distance values of
all possible combinations of sequence tuples, (,) say, where is a sequence from thefirst cluster and is a tuple from the other.
= { , }
where is the number of elements in cluster and is a sequence in cluster.UPGMA Algorithm
1. Begin with sequences and populate an distancematrix where : =
2. Find the smallest value of 3. Combine sequences and creating a cluster 4. Create a new node in our tree with child nodes
and
at
height 2 5. Repeat steps 2 to 4 with 1 sequences as input until
only 2 sequences remain. This new input will be the same
as our original with sequences and removed andcluster added (The distances to a cluster is the averagedistance to sequences in the cluster)
6. Place root midway between the two remaining clusters= (2)
7/31/2019 tics Algorithms
25/47
25
UPGMA assumes that:1. all leaves are placed at the same level2. when two sequences are joined in a cluster, the common ancestor is
equidistant from each sequence;3. the "molecular clock" rate of evolution is the same on each branch of the tree.
Unfortunately these assumptions can cause many inaccuracies. Trees that should look
like (c) might end up like (d). It is the final assumption (3) that causes the erroneous
results. This anomaly can be solved using the neighbour joining algorithm.
1.3.3.2Neighbour JoiningThe Neighbour Joining algorithm gets around the pitfalls of UPGMA as no
assumption is made about the mutation rate. Like UPGMA it is a bottom up clustering
algorithm that produces phylogenetic trees and calculates the length of branches. Its
results differ by the fact that the trees are non hierarchical.
The algorithm starts with a star tree, where each node represents a sequence. One
chooses the two nodes with shortest distance and connects them with a new internal
node. The distance could be the difference in percent between the two sequences.
When this is done two new nodes with the smallest distance are picked out and
connected with another new node. This will continue until the whole star is resolved.
Neighbour Joining populates a second matrix, based on the distance matrix eachiteration. The new distance values are calculated such that:
= + 2( 2) + 2 2 where represents all nodes that remain in the star.
7/31/2019 tics Algorithms
26/47
26
Neighbour Joining
1. Begin with sequences and populate our distancematrix where :
=
2. Populate a such that,
= + 2( 2) + 2 2 3. Find the smallest value of4. Combine sequences and creating a cluster 5. Create a new node in our tree with child nodes and.6. Repeat steps 2 to 5 with
1 sequences as input until
only 3 sequences or clusters remain. This new input willbe the same as our previous with nodes and removedand node added.= (5)
Every time we create a cluster we are required to recomputed the whole of our to account for fast evolving edges. Calculating the position of the clusteris very costly yielding(5) complexity. For each stage we are required to calculate times, there are stages and for each we are required to sum over allthe elements of the matrix, . Therefore, =
5
.The complexity can be reduced by introducing a new parameter: = =1
We are only required to calculate and once each round achieving (1) .Therefore by using dynamic programming it is not necesary to sum over all the
elements of the matrix for each
. This reduces the complexity to
3
.
7/31/2019 tics Algorithms
27/47
27
= (3)1.3.4 LIKELIHOODMaximum Likelihood evaluates a phylogenetic tree in terms of the probability that the
proposed model of the evolutionary process would give rise to the observed data.
Often it is the case that certain mutations are more common than others. Once a local
phylogeny is constructed, it can be scored according to how well the tree helps explain
the evolutionary path. A minimum distance based tree is only guaranteed to have
maximal likelihood when all mutations have equal probability.
Given a sequence, , of length , I will use to define an individual nucleotideposition.position 1
Sequence 1 = ACTGTCGATCGCGCGCGCGATCG2 = ACTCGATTZCGCAATCGCGATCG3 = ACTGTCACTCCAGATCGCGCGCG
=
= 2 + 2( 2) =
= + 2
Neighbour Joining (Optimised)
1.Begin with sequences and populate our distancematrix where ,
2. Find the smallest value of3. Combine sequences and creating a cluster 4. Create a new node in our tree with child nodes and.5. Calculate the branch lengths of and to such that:
Where
=
=1
6.Calculate the distance between the new internal node and each node in the remaining star such that:
remembering the values.
7. Repeat steps 2 to 5 with 1 sequences as inputuntil only 3 sequences or clusters remain. This new
7/31/2019 tics Algorithms
28/47
28
e.g. Given a tree:
We root at an arbitrary internal node:
For each in the sequence we calculate the likelihood of the tree by summing theprobabilities of all the possible ancestral-state combinations:
= (),
The overall likelihood of the tree representing the full sequence can be calculated by
taking the product of all the individual cognate-set likelihoods.
= 1 2 = =1 As this is costly to compute and the probability of any individual observation is small
we take the natural log of the likelihood, . = ln1 + ln2 ++ ln = =1
If a tree has a relatively high likelihood score this means that, given the tree and themodel of evolution, the data is a relatively likely outcome. The maximum likelihood
() tree is that tree or trees making the data most likely. There may be many treeswith same likelihood score.
When scoring a tree we assume that all mutations are independent allowing us to
calculate the likelihood, , for each position individually.This can be summarised with the following algorithm:
7/31/2019 tics Algorithms
29/47
29
Calculate Likelihood
1.Root the tree at any internal node (models are time
reversible)
2.Calculate
for each
= () , 3.Combine the values to calculate for the whole tree.
= 1 2 = =1 = (3)1.3.5 BOOTSTRAPPINGALGORITHMDue to the quadratic nature of the tree comparison algorithms it is often the case that
they are very expensive and time consuming to run. The bootstrapping algorithmincreases the speed of best fit multiple alignment algorithms by repeating many times
on small trees and output the most frequent result.
Bootstrapping Algorithm
1.Select random columns from a multiple alignment (one
column may appear several times)
2.Build a phylogenetic tree based on the random sample
3.Repeat stages 1 and 2 many times
4.Output the tree that is constructed most frequently
1.3.6 PRIMSALGORITHMPrim's algorithm finds the minimum spanning tree for a connected weighted graph.This means it finds a subset of the edges that forms a tree that includes every vertex,
where the total weight of all the edges in the tree is minimized.
Prims Algorithm
1. Start from a random vertex and make this the root of our
tree, 2.Add the shortest edge connecting a vertex in to a
vertex not in 3.Repeat step 2 until all nodes are in our set
= ( log)(Where is the number of edges)Whilst the time complexity of this algorithm is highly desirable we cannot construct
meaningful phylogenetic trees. When constructing typical phylogenetic trees the input
is made up of leaf nodes. The internal nodes are unknown and have to be generated
during construction of the tree.
1.4 INFORMATIONTHEORY AND DNA
Information theory in the context of DNA expresses the amount of information
encoded in a string or observation. The measure of information has unit bits. We usethree types of comparisons:
7/31/2019 tics Algorithms
30/47
30
Entropy: The entropy of a discrete random variable is a measure of theuncertainty associated with the value of. For example, if someone
were to select a random letter from a given alphabet, how many
binary questions would it take to identify the letter. The entropy of a
random letter from a 32 letter alphabet can be expressed as = 5 .Conditional Entropy: The conditional entropy of a variable is the uncertainty
associated with the value of given a random variable . If thevariable is an independent random variable then the conditionalentropy of given , , will be the same as the entropy of,(). The conditional entropy of a variable will always be less thanor equal to the entropy of that variable.
Mutual Information: Mutual information measures the amount of information that
can be obtained about one random variable by observing another.
This is essentially the difference between learning the value ofgiven and learning the value of without .
; = (|)
1.4.1 INFORMATIONCONTENT OF ADNAMOTIFThe information encoded at position in the string can be expressed as
=
= log2 =1 + log2 =1
Where given an alphabet of possible characters of length , is the backgroundprobability and is the motif probability. If all characters are equiprobable, thebackground probability for any letter will be 1. In the case of DNA with a 4 letteralphabet = 14 giving a background probability of 2.
7/31/2019 tics Algorithms
31/47
31
The information content encoded by a DNA motif of length is the sum of all theinformation encoded by its individual characters.
=
=1
1.4.2 ENTROPY OFMULTIPLEALIGNMENTThe entropyof a multiple alignment is a measure of the uncertainty of a single column.
Entropy of Multiple Alignment
1.Given an alignment column, we calculate the frequency of
occurrence of every possible letter
2.The entropy of the column is the sum of these
probabilities
= log2 =
1.4.3 INFORMATIONCONTENT OF ASTRINGThe information content of a string is the sum of the information content of everyposition, , in our string.
=
log
=
Where is the probability of letter at position in our positional weight matrix and is the frequency of by chance.1.4.4 MOTIFS
A sequence motif is a nucleotide or amino-acid sequence pattern. This pattern may be
present in many different positions in many different strings. Finding the same motif
in multiple strings often suggests a regulatory relationship between those genes.
However, all motif occurrences may not always be exactly the same as genes may havebeen turned on and off by regulatory proteins or mutate at non important bases. A
motif logo graphically illustrates the importance of letters within a motif at each
position. Larger letters are more important than smaller letters. This represents
conserved and variable regions of the motif.
7/31/2019 tics Algorithms
32/47
32
The cumulative size of the letters at a given position is calculated by calculating the
information content of the alignment and subtracting it from the background
probability.
=
The individual letter size is calculated by the information content at its positionmultiplied by the fraction of occurrence.
= . occurrence . It is intuitive that larger letters have less variation than smaller letters.
Given a set of motifs it is possible to find the
motif consensus. The consensus can be thoughtof as the ancestor from which all mutated motifs
emerged.
First, align all patterns by their start index and
construct a matrix containing the frequency of
each nucleotide in each column, known as the
profile.
The consensus nucleotide in each position is the
nucleotide with the highest score in each column.
The distance between a real motif and the consensus sequence is generally less than
that for two real motifs.
The number of motif occurrences (sites) and how relations extend can be estimated
using Markov Chain theory.
e.g.
1 = 2 = ()() 3 =
All Markov chains estimate the frequency of a word from base composition alone with
increasing orders producing more accurate results. A Markov chain of order supposes that the base present at a certain position in a sequence depends only on the
bases present at the previous positions.
7/31/2019 tics Algorithms
33/47
33
1.4.5 EXHAUSTIVESEARCHAn exhaustive search generates a motif,, that best matches a set of sequences, .
Given set of sequences
=
1
and motif defined
=
1
find
such that
the match with 1 is optimal.Whilst there are several ways to define an optimalmatch hamming distance is usuallyused.
, = and
, = (, ) In the case of DNA and RNA, which use 4-letter alphabets, the number of possible
motifs with length is 4 .Whilst this always finds the best motif it is very costly with running time:
= (4)where = .It is possible to speed up the basic exhaustive search algorithm at the expense of
accuracy. Instead of searching through all alphabet permutations of length , onlywords of length that occur in some are considered. This only requires (2)but does not always yield an accurate answer. If is weak and doesnt occur in thena random motif may have a higher score.
1.4.6 GIBBSSAMPLINGGibbs Motif Sampling identifies motifs, conserved regions, in DNA or protein
sequences solving the Motif Finding Problem.
Motif Finding Problem: given a set of
DNA sequences each of length
, find the motifwith optimal match.
The algorithm uses an iterative random sampling method increasing the odds that it
will converge to the correct solution.
7/31/2019 tics Algorithms
34/47
34
Gibbs Sampling Algorithm
Method Example
Given length of motif and a set ofsequences
:
1.Randomly choose starting positions = (1, . . . , ) and form the set of mers associated with thesestarting positions.
2.Randomly choose one of the t
sequences.Sequence 2
3.Create a profile p from the othert sequences.
4.For each position in the removed
sequence, calculate the
probability that the mer startingat that position was generated by
using likelihood.
= =1
AAAATTTACCTTAGAAGG 0.000732
AAAATTTACCTTAGAAGG 0.000122
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0.000183
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
5.Create a distribution of
probabilities of mers (|) ,and randomly select a new starting
position based on this
distribution.
a.To create this distribution,
divide each probability
(|) by the lowestprobability:Position 1: prob(AAAATTTA | P ) = .000732/.000122 = 6Position 2: prob(AAATTTAC | P ) = .000122/.000122 = 1Position 8: prob(ACCTTAGA | P ) = .000183/.000122 = 1.5Ratio = 6 : 1 : 1.5
b.Define probabilities of
starting positions according
to computed ratios
Probability (Position 1): 6/(6+1+1.5)= 0.706Probability (Position 2): 1/(6+1+1.5)= 0.118Probability (Position 8): 1.5/(6+1+1.5)=0.176
c.Select the start position
according to computed ratios:
P(Selecting Starting Position 1): .706P(Selecting Starting Position 2): .118P(Selecting Starting Position 8): .176
6.Repeat steps 25 until there isno improvement
7/31/2019 tics Algorithms
35/47
35
1.5 HIDDEN MARKOVMODELS
A Hidden Markov Model (HMM) is a statistical model in which the system being
modelled is assumed to be a Markov process with unobserved state. A HMM is
memory-less with the only thing effecting the next step being the current state.
Definition
Given an alphabet (1, 2, , ) and set of states ( 1 , . . . ,) , the transitionprobability from state to state is written .1. The sum of all probability transitions will equal 1.
1 ++ = 1, = 12. The sum of all starting state probabilities will equal 1
01 ++ 0 = 13. Emission probability within each state (probability that the
hidden state emits the observable symbol) = ( = | = )1++ = 1
Given a sequence
=
1
and a parse
=
1
(a sequence of states), we
can calculate the parse likelihood,
, = ,11,122,211,1= 12221111= 0112 11 (1) ()
Hidden Markov models are used to solve three main questions
1. EvaluationGiven a HMM
and a sequence
Find [|], the probability that sequence was generated by the model. = (,) = () This can be calculated using the Forward dynamic algorithm
2. DecodingGiven a HMM and a sequence Find the sequence
of states that maximises
,
Described algebraically as
7/31/2019 tics Algorithms
36/47
36
= max11 1 1,1 1, , = = =
This is also known as the Viterbi path and solved using the Viterbi
dynamic algorithm.
3. LearningGiven a HMM with unspecified transmission/emission probabilitiesand a sequence ,Find the HMM parameters = ( , ()) that maximise [|]Solutions for this problem will be omitted.
1.5.1 FORWARDALGORITHMThe Forward algorithm solves the evaluation problem, the probability of given theHMM. This is calculated by summing over all the possible ways of calculating:
= (,) = () To avoid summing over an exponential number of paths , the forward probability isdefined as
forward probability: = 1 , = = (1 11 1,1 1, = )
= (1 12 1,1 2 , = ) () = 1
7/31/2019 tics Algorithms
37/47
37
= k2N = ()The main issue with this algorithm is that of underflow. As the numbers become
increasingly small, they often become inaccurate. This can be solved by rescaling at
each position by multiplying by a constant.
1.5.2 VITERBIALGORITHMThe Viterbi algorithm is a dynamic programming algorithm that solves the decoding
problem for finding the most likely sequence of hidden states (Viterbi path) that result
in a sequence of observed events.
= k2N
00 = 1
0
= 0,
> 0
= 1 () = ( 1), = max
=
1=
(
)
Viterbi Algorithm
Given a sequence = 1 Initialise
Iterate
Termination
Trace back
00 = 10 = 0, > 0 = ( 1)
= ()0
Forward Algorithm
Given a sequence = 1 and aMarkov model ,Initialise
Iterate
Termination
Where
0is the probability that
the terminating state is (usually = 0)
7/31/2019 tics Algorithms
38/47
38
= ()Again, this algorithm is subject to underflow. As the numbers become increasingly
small, they often become inaccurate. This is solved by taking the log of all values
during the iteration stage. = log + max [ 1+ log ]1.5.3 BACKWARDALGORITHM
This final algorithm calculates ( = |) , the probability distribution of the position given .Like the Forward algorithm, the backward probability can be derived:
backward probability:
=
+1
,
=
= (+1+1 ,+1| = )= (+1+1 ,+1 = ,+2| = )
= (+1) (+2+1 ,+2|+1 = ) =
(
+1)
+ 1
= k2N
=
(
)
= 0 ,
=
+1
(
+ 1)
= 01(1)
Backward Algorithm
Given a sequence = 1 and aMarkov model ,Initialise
Iterate
Termination
Where 0 is the probability thatthe terminating state is (usually = 0)
7/31/2019 tics Algorithms
39/47
39
The underflow problem associated with the backwards algorithm is solved in the same
way as the forwards algorithm. At each position we rescale by multiplying by a
constant.
The equation to calculate the probability distribution can be derived to using forwards
and backwards algorithms,
= = 1 , = , +1 =1 , = )(+1|1 , =
= , , = 1 , = (+1| = )
An implementation of this equation combining both the forwards and backwards
algorithm is often referred to as the forwards-backwards algorithm.
2 WORKING WITH MICROARRAYDNA microarray is a multiplex technology consisting of
an array containing thousands of microscopic spots of
DNA. The microarrays measure changes to the genes
under varying conditions, such as time, heat or pressure.
An expression level is estimated by measuring the
amount of mRNA (Messenger RNA) for a particulargene. mRNA is a molecule of RNA which carries
messages to the sites of protein synthesis. More mRNA
usually indicates more gene activity.
Microarray analysis can produce an activity diagram (right).
2.1 CLUSTERING
Clustering is a method of grouping functionally related genes. The genes will be related
based on some distance metric which may combine several factors. Genes are
clustered by plotting points in n-dimensional space, comparing all gene pairs using thedistance metric and grouping genes with small distances.
Given a set consisting of points and a parameter , the -means clusteringproblem finds a set consisting of points (cluster centres) that minimises thesquared error distortion (,) over all possible choices of.
, = ( ,)2 1
Samples
Genes
7/31/2019 tics Algorithms
40/47
40
2.1.1 LLOYDALGORITHM(K-MEANS)The most common algorithm for approximating the -means problem is the Lloydalgorithm which uses an iterative refinement heuristic. The algorithm begins by
partitioning the input points into
initial sets, either at random or using some
heuristic data. It then calculates the mean point of each set. It constructs a newpartition by associating each point with the closest point before recalculating the
points for the new clusters. The algorithm then repeats by alternate application of
these two steps until the results converge, which is obtained when the points no longer
switch clusters (or alternatively points no longer change).
Lloyd's algorithm uses a heuristic for solving the k-means problem which, when used
with certain combinations of starting points, could converge to the wrong answer. The
Lloyd algorithm has remained popular because it has a very quick running time with
the number of iterations often far less than the number of points.
Lloyd Algorithm
Randomly assign the cluster centres While the cluster centres keep changing
o Assign each data point to a cluster corresponding tothe closest cluster, where 1
o After the assignment of all data points, compute new
cluster representatives according to the centre of
gravity of each cluster, that is, the new cluster
representative is
7/31/2019 tics Algorithms
41/47
41
2.1.2 GREEDYALGORITHM(K-MEANS)The Lloyd algorithm is fast but moves many data points each iteration possiblyresulting in sub-optimal convergence. A more conservative method would be to moveone point at a time only if it improves the overall clustering cost. The smaller theclustering cost of a partition of data points the better the clustering is. Whilst this mayproduce better clustering, it is usually much more costly.
Progressive Greedy K-Means(k)
Select an arbitrary partition P into clusterswhile forever
bestChange 0for every cluster
for every element not in if moving to cluster reduces its clustering cost
if (cost(
) cost(
) > bestChange
bestChange cost(P) cost(Pi C) if bestChange > 0
Change partition by moving to else
return P
2.1.3 CAST(CLUSTERAFFINITYSEARCHTECHNIQUE)CAST is a clustering algorithm that groups genes based on affinity.
Affinity: A measure of similarity between a gene and all other genes ina cluster.
Threshold Affinity: A user specified criterion for retaining a gene in a
cluster defined as the percentage of the maximum affinity at
that point.
It requires more computing power than -means, but does not require the number ofclusters to be specified beforehand. The algorithm is also consistent, returning the
same result when run several times. The algorithm returns an optimal set of clusters
with diameters near the threshold affinity.
7/31/2019 tics Algorithms
42/47
42
CAST ALGORITHM
1.Create an empty cluster
2.Set initial affinity of all genes to 0
3.Move the two most similar genes into the new cluster
4.Update the affinities of all genes, both clustered and un-
clustered = + 5.Whilst there remains a gene whose affinity value is
greater than the threshold
Add the gene with the highest affinity to thecluster
Update the affinities of all genes6.Whilst there remains a clustered gene whose affinity is
lower than the current threshold
Remove the gene with the lowest affinity from thecluster
Update the affinities of all genes
7.Repeat steps 5 and 6 until there is no further change
8.Save the cluster removing points in the cluster from
further consideration.
9.Repeat steps 1 to 8 with the reduced set of points until
all genes have been assigned to a cluster= (2)2.1.4 QTCLUSTERINGQT (quality threshold) clustering is an alternative method of partitioning data. Like
CAST, it requires more computing power than
-means and does not require the
number of clusters to be specified beforehand. The algorithm is also consistent,returning the same result when run several times.
QT Clustering ALGORITHM
1.Choose a maximum cluster diameter.
2.Choose a gene as the seed for a new cluster.
3.Add the closest point, the next closest, and so on, to
the cluster until the diameter surpasses the threshold.
4.Repeat steps 2 and 3 for every gene.
5.Save the cluster with the most points as the first true
cluster, and remove all points in the cluster from
further consideration.6.Repeat steps 2 to 5 with the reduced set of points until
the last cluster formed has fewer genes than the user-
specified number. All genes that are not part of a
cluster are unassigned = (3)where is the number of genes
The distance between a point and a group of points is computed using complete
linkage (Jack-knifed distance). This is the maximum distance between the seed and all
other genes in the cluster.
7/31/2019 tics Algorithms
43/47
43
2.1.5 MARKOVCLUSTERINGALGORITHMFinally, the Markov clustering algorithm randomly walks through the probability graph
described by the similarity matrix to identify clusters of related genes. The basic idea
underlying the algorithm is the algorithm will walk more probable routes more often.
Dense clusters correspond to regions with a large number of paths.
The algorithm uses three steps:
Markov Clustering Algorithm
1.Given a network with vertexes, take the corresponding adjacency matrix and normalise each column toobtain a stochastic matrix.
2.Takes the power of this matrix (expansion)3.Then take the power () of every element (inflation).The expansion parameter is often taken equal to 2, whilethe granularity of the clustering is controlled by tuning
the inflation parameter.2.2
G
ENETICN
ETWORKSA
NALYSISA genetic network is a collection of DNA segments in a cell which interact with each
other (indirectly through their RNA and protein expression products (mRNA)) and
with other substances in the cell. This controls the rate at which genes in the network
are transcribed into mRNA. This kind of network can often cause chain reactions.
Assume there are two related genes, A and B. Neither is expressed initially but a third
gene, X, causes A to be expressed which in turn causes B to be expressed. This kind of
reaction can often be thought of as a circuit consisting of logic gates.
Gene activity can affect some genes directly and other genes indirectly, known as the
primary and secondary targets respectively. Our aim is to represent a large genetic
network with gene prertubations in fewer than 2 steps. A perturbation static graphmodel is used. This essentially perturbs a gene network one gene at a time monitoring
the behaviour of other genes. This identifies direct and indirect gene-gene
relationships. If this were a black boxed circuit made up of logical elements it wouldbe called bit twiddling!
7/31/2019 tics Algorithms
44/47
44
Perturbation static graph model
Given a gene network, 1 STEP 1. For each gene , compare the control
experiment to the perturbed experiment where isperturbed to identify differently expressed genes.
STEP 2. Use the most parsimonious graph that representsthe behaviour observed in 1.
In more detail:STEP 1
(1) Given a gene network(2) Find the adjacency
list
(3) Use the adjacency listto find the accessibility
list
STEP 2
(4) Find the most parsimonious graph representing the accessibility list (3)Assuming the accessibility list is non-cyclic, producing the most parsimonious
graph is relatively straight forward. Let() be the accessibility list and() be the adjacency list at an acyclic directed graph, its mostparsimonious graph, and the set of all nodes of . Then thefollowing identity holds:
=\ () ()
7/31/2019 tics Algorithms
45/47
45
Non-Cyclic most parsimonious graph
for all nodes of () =()for all nodes
of
if node has not been visitedcall PRUNE()PRUNE()
for all nodes ()if() =
declare as visitedelse
call PRUNE()for all nodes ()
for all nodes
(
)
if ()delete fromdeclare as visitedIf the accessibility list contains cycles it is not algorithmically possible to
produce a unique graph for the accessibility list. All genes within a cycle effect
all other genes. This is an experimental limitation. Where cycles do occur we
shrink each cycle into a single node and apply the same algorithm as for the
non-cyclic case. When and (where ) are two nodes in a directed graph,iff
(
) and
(
) then they belong to the same component.
This algorithm is limited. It is unable to resolve cyclic graphs and requires far more
data than conventional methods which use gene expression correlations. Also, for an
accessibility list there may be many consistent networks. This algorithm only
constructs the most parsimonious tree which is not necessarily an accurate model.
2.3 SYSTEMS BIOLOGY
Systems biology has become a particular field of interest in recent years (2000
onwards). It is the study of how different biological systems interact with each other
with respect to time. Knowledge about molecules, genes, cells, tissue, organs,
chemicals and much more can be combined to study how they interact.
A simple example of a biological system is our digestive system. We know that when
we eat the energy is not released into our blood stream immediately. The time taken
and the rate at which energy passes through the system has many variables (i.e. sugar
content of the food, metabolic rate, blood sugar levels etc...). Given all of the system
dependencies, we want to be able to describe the likelihood that a given event will
occur.
Ideally, we want to produce a model that represents the flow of events throughout a
system. If our system has a definite series of events we can accurately use differential
7/31/2019 tics Algorithms
46/47
46
equations as our model, but often such systems have a random element making
stochastic algorithms more suitable.
2.3.1 GILLESPIEALGORITHMThe Gillespie algorithm is a stochastic algorithm for the simulation of geneticnetworks. It uses random numbers in order to generate sequences of events and inter-
event times in a biological system.
It is assumed that in a genetic network, different proteins/enzymes are
related to each other: they can trigger or inhibit the production of others or
they can react with another substance to form a new substrate. Any possible
reaction will have a reaction rate, , associated with it (which is equal tothe probability of this reaction happening at any step). Additionally, the
number of molecules of substances referred to by the reactions in the environment is
known.
First a random number, say , is generated in order to decide which reaction happens.Given that the probability of a reaction occurring is given by
= =1 where is the reaction rate for and the denominator normalises by the cumulativereaction rate of all known reactions in the system. The reaction will occur if
1