Download - tics Algorithms

7/31/2019 tics Algorithms

1/47

BIOINFORMATICSALGORITHMS

BASED ON THE 2009TEACHING OF THE CAMBRIDGE COMPUTERSCIENCEPARTIIBIOINFORMATICSCOURSE BYPIETRO LI

Vaughan EveleighJesus College, Cambridge University


2/47

2


3/47

3

CONTENTS

1 DNA and Protein Sequences ......................................................................................... 51.1 Preparation ................................................................................................................ 5

1.1.1 Manhattan Tourist ............................................................................................ 51.1.1.1 Naive Algorithm....................................................................................... 51.1.1.2 Dynamic Algorithm ................................................................................. 5

1.2 Strings ......................................................................................................................... 61.2.1 Longest Common Subsequence ..................................................................... 71.2.2 Neeleman-Wunsch (Global Alignment) ....................................................... 81.2.3

Smith Waterman (Local Alignment) .............................................................. 9

1.2.4 Affine Gaps ..................................................................................................... 101.2.5 Banded Dynamic Programming ................................................................... 111.2.6 Computing Path with Linear Space ............................................................. 111.2.7 Block Alignment ............................................................................................. 121.2.8 Four Russians Block Alignment Speedup .................................................. 141.2.9 Four Russians Technique - Longest Common Sub-Expression ............. 141.2.10 Nussinov Algorithm ...................................................................................... 151.2.11 BLAST (Multiple Alignment) ....................................................................... 171.2.12 Pattern Hunter (Multiple Alignment) .......................................................... 181.2.13 BLAT (Multiple Alignment) ......................................................................... 19

1.3 Trees ......................................................................................................................... 191.3.1 Parsimony ........................................................................................................ 19

1.3.1.1 Sankoff Algorithm ................................................................................. 201.3.1.2 Fitchs Algorithm ................................................................................... 21

1.3.2 Large Parsimony Problem ............................................................................. 221.3.3 Distance ........................................................................................................... 23

1.3.3.1 UPGMA .................................................................................................. 241.3.3.2 Neighbour Joining ................................................................................. 25

1.3.4 Likelihood ........................................................................................................ 271.3.5 Bootstrapping Algorithm .............................................................................. 291.3.6 Prims Algorithm ............................................................................................ 29

1.4 Information Theory and DNA ............................................................................. 29


4/47

4

1.4.1 Information Content of a DNA Motif ....................................................... 301.4.2 Entropy of Multiple Alignment.................................................................... 311.4.3 Information Content of a String .................................................................. 311.4.4 Motifs ............................................................................................................... 311.4.5 Exhaustive Search .......................................................................................... 331.4.6 Gibbs Sampling .............................................................................................. 33

1.5 Hidden Markov Models ......................................................................................... 351.5.1 Forward Algorithm ........................................................................................ 361.5.2 Viterbi Algorithm ........................................................................................... 371.5.3 Backward Algorithm ...................................................................................... 38

2 Working with Microarray .............................................................................................. 392.1 Clustering ................................................................................................................. 39

2.1.1 Lloyd Algorithm (k-means) ........................................................................... 402.1.2 Greedy Algorithm (k-means) ........................................................................ 412.1.3 CAST (Cluster Affinity Search Technique) ................................................ 412.1.4 QT clustering .................................................................................................. 422.1.5 Markov Clustering Algorithm ...................................................................... 43

2.2 Genetic Networks Analysis ................................................................................... 432.3 Systems Biology ...................................................................................................... 45

2.3.1 Gillespie Algorithm ........................................................................................ 46Complexity Summary ............................................................................................................. 47


5/47

5

1 DNAAND PROTEIN SEQUENCES

1.1 PREPARATION

DNA (Deoxyribonucleic acid) uses a 4-letter alphabet (A,T,C,G)

RNA (Ribonucleic acid) also uses a 4-letter alphabet (A,U,C,G)Proteins use 20 amino acids (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)

1.1.1 MANHATTANTOURIST

The problemGiven a weighted grid G, travel from source (top left) to thesink (bottom right) along the highest scoring path onlytravelling south and east.

The solutionThe problem can be generalised finding the longest path from

the source to an arbitrary destination .1.1.1.1 Naive Algorithm

Start at the destination node and calculate which of the immediately adjacent nodes

has the highest path score from the source. For each of these edges, recurse.

path(i,j)

if (i = 0 or j = 0)

return 0

else

X = path(i-1, j) + edge (i-1,j) to (i,j)

Y = path(i, j-1) + edge (i,j-1) to (i,j)return max(X,Y)= (!!)= (1)

Although this exhaustive algorithm produces accurate results it is not efficient. Many

path values are repeatedly computed.

1.1.1.2Dynamic Algorithm

Dynamic programming improves the naive algorithm by storing the results of previous

computations and reusing them when required at a later stage. The idea behind a

dynamic algorithm is that unnecessary calculations are not re-computed. Although this

significantly improves time complexity, in many cases the space complexity can be

quite demanding.

In the case of the Manhattan tourist problem we only need to store the values of 1 row

and 1 column at any time.


6/47

6

DynamicPath(i,j)

S0,0=0

for x=1 to i

Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)

for y=1 to j

S0,y = S0,y-1+edge (0,y-1) to (0,y)for x=1 to i

for y=1 to j

A = Sx,y-1+edge (x,y-1) to (x,y)

B = Sx-1,y+edge (x-1,y) to (x,y)

Sx,y = max (A,B)

Return Si,j

Where Sx,yare stored values = () = ( +)If our DAG representing the city were to also contain diagonal paths we would require

a 3rd condition in the final for loop.

DynamicDiagonalPath(i,j)

S0,0=0

for x=1 to i

Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)

for y=1 to j

S0,y = S0,y-1+edge (0,y-1) to (0,y)

for x=1 to i

for y=1 to j

A = Sx,y-1+edge (x,y-1) to (x,y)

B = Sx-1,y+edge (x-1,y) to (x,y)C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

Sx,y = max (A,B,C)

Return Si,j = () = ( + + 1) = ( +)Many of the future algorithms will resemble the Manhattan tourist problem

1.2 STRINGS

There are several ways by which we can compare the similarity of strings.

Edit Distance (non trivial) the minimum number of operations (insertions,deletions and substitutions) required to transform 1 string into another

Hamming Distance (trivial) the number of differences when comparing the value of a string against the value of another

Consider the two strings of length 7 and of length 6 : ATCTGAT

: TGCATA


7/47

7

After comparison of string against we can count the number of matches,insertions and deletions.

1.2.1 LONGESTCOMMONSUBSEQUENCEAlthough the hamming distance is commonly used in computer science, the edit

distance is of greater use in biology. By aligning two strings by their longest common

sub-sequences, the minimal distance can be found.

The longest common subsequence is similar to edit

distance but only uses insertions and deletions, not

substitutions.

This problem can be represented as a hybrid of the

Manhattan tourist problem (right) where diagonal

paths represent matched strings and horizontal or

vertical lines represent edits. Each of the edges are

assigned a weighting.

weighting 0

0

1 where = otherwise LongestCommonSubsequence(i,j)

S0,0=0

for x=1 to i

Sx,0 = 0

for y=1 to j

S0,y = 0

for x=1 to i

for y=1 to j

A = Sx,y-1+0

B = Sx-1,y+0C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

Sx,y = max (A,B,C)

Return Si,j = ()If we are only concerned with returning the optimal score (and

not the path) the LCS can be calculated with linear space. When

computing the score of any cell in our adjacency matrix only the

scores of cells immediately above, immediately left and

immediately diagonal are required. As a result, historic values


8/47

8

from other computations are not required.

() = ( +)The algorithm can also be modified to remember the route when deciding which of

the paths (A, B or C) to take. This requires allocation of an adjacency matrix of size populated with the score and direction of each cell. This information allows usto backtrack to find which sequence of insertions and deletions generated the score.

() = () () = ( +)This is the simplest form of alignment as only insertions and deletions are allowed.

This algorithm is rather restrictive, awarding 1 for matches and not penalising indels

(Abbreviation for insertions and deletions). We will now consider ways in whichmismatches can be penalised.

1.2.2 NEELEMAN-WUNSCH(GLOBALALIGNMENT)Global alignment assumes that the two proteins are basically similar over the entirelength of one another. The alignment attempts to match them to each other from end

to end, even though parts of the alignment may not be very convincing. E.g.

Global alignment penalises insertions or deletions by decreasing the overall alignment

score by the value .We first need to initialise our scoring matrix as we did for the longest common

subsequence algorithm.

weightingVertical -Horizontal -Diagonali,j

1 where

=

j

otherwise

aaagcggaagtcacag

||.||.||||| |.||

aaggctgaagt-atag


9/47

9

Needleman-Wunsch (i,j)

S0,0=0

for x=1 to i

Sx,0 = -x*d

for y=1 to jS0,y = -y*d

for x=1 to i

for y=1 to j

A = Sx,y-1 - d

B = Sx-1,y - d

C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

Sx,y = max (A,B,C)

Return Si,j = ()

(

) =

(

+

)

() = () () = ( +)Again this algorithm could be modified to store the path taken through the matrix.

1.2.3 SMITHWATERMAN(LOCALALIGNMENT)Local alignmentsearches for segments of the two sequences that match well. There is noattempt to force entire sequences into an alignment, just those parts that appear to

have good similarity, according to some criterion. E.g.

The Smith Waterman algorithm is based on the Needleman-Wunsch algorithm but

ignores badly aligning regions. It does this by assigning 0 to any cells that would have

been allocated a negative value using Needleman-Wunsch.

Smith-Waterman (i,j)

S0,0=0

for x=1 to i

Sx,0 = 0

for y=1 to jS0,y = 0

for x=1 to i

for y=1 to j

A = Sx,y-1 - d

B = Sx-1,y - d

C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

D = 0

Sx,y = max (A,B,C,D)

Return Si,j

=

(

)

(

) =

(

)

On termination we can use our alignment matrix to find string alignments.

aaagcggaagtcacag

......||||| ....aaggctgaagt-atag


10/47

10

To find the longest match we are just required to find the with the highestvalue

To find alignments greater than length & > Valid scores can be found in () time complexity

The alignment can be found from the scores in ( +) time complexity1.2.4 AFFINEGAPSNeedleman Wunsch can be optimised further to score gaps more accurately. Gaps are

currently being scored uniformly, however they usually occur in clusters. The uniform

gap penalty is changed to a function that considers the length of a gap.We implement this usingaffine gaps.

= gap length = initial penalty = successive penaltyHere the first gap incurs a penalty of and subsequent gaps incur a penalty of.

Two alignment matrices are used to record scores.

contains scores assuming aligns to contains scores assuming or align to a gapThe example below also includes the function (,) that returns a constant if the value of string aligns with the value of string. This is just a shorthand notationfor the edge function explained earlier.

, = 0

= + 1


11/47

11

Needleman-Wunsch-Affine (i,j)

F0,0=0

for x=1 to i

Fx,0 = d+(x-1)*e

for y=1 to j

F0,y = d+(y-1)*efor x=1 to i

for y=1 to j

A = Fx-1,y-1 + s(x,y)

B = Gx-1,y-1 + s(x,y)

Fx,y = max (A,B)

L = Fx-1,y - d

M = Fx,y-1 - d

N = Gx-1,y - e

O = Gx,y-1 - e

Gx,y = max(L,M,N,O)

Return max(Gi,j, Fi,j)

= () () = () () = ( +)1.2.5 BANDEDDYNAMICPROGRAMMINGProvided we know that strings are similar the majority of

computations can be ignored. For example, if we were

comparing two DNA sequences from the same species, the

optimal alignment will not deviate far from the perfect

diagonal line. As a result we can just exclude computations

outside of a set boundary.

Although this reduces the real run time speed, it does not

greatly affect the asymptotic complexity.

= ( ()) ()1.2.6 COMPUTINGPATH WITHLINEARSPACE

When running dynamic algorithms with quadratic space complexity and quadratic time

complexity, memory resources usually limit computation before processor cycles.

As explained in 1.2.1 it is possible to calculate the optimal score using dynamicprogramming in linear space. It is possible to modify the algorithm to return the path

in linear space but at the expense of doubling the required computations by a factor of

2.

Score Only Path without optimisation Path with optimisation = () = ( +) = () = () = (2) () = ( +)This desired space complexity is achieved by finding where the longest path crosses

the middle line before recursively subdividing the problem


12/47

12

Method1. Split the matrix into 2

2. Run the algorithm on the first half of thematrix remembering the values of the final

column (prefix values)

3. Run the algorithm in reverse on thesecond half of the matrix rememberingthe values of the final column (suffix

values)

4. Find the greatest length where the pathcrosses the middle line. This is the middle

vertex of the optimal path.

Length(i) = Prefix(i) + Suffix(i)

5. Now that we have our mid-point we willrecurse on 2 sub sections of the matrix.

The upper left and lower right regions.

1.2.7 BLOCKALIGNMENTSo far all of the algorithms have required (2) time to align two sequences of length. The idealistic algorithm for aligning two sequences would be () in time butthis has yet to be achieved and the lower bounds of the Global Alignment Problem

remain unknown.

To reduce the required computation time blockscan be compared instead of individualletters. This will only provide an approximation to the longest common substring

algorithm as only the corners of blocks are considered. The block alignment algorithm

only allows the longest sub expression path to enter a block through its corners.

Accuracy is lost in favour of speed.

This is achieved by splitting two DNA sequences, and say, into blocks of length so

= 1 . +1 .2 +1 . and,

= |

1. . . .

||

+1. . . .

2

| . . . |

+1. . . .

|


13/47

13

We create our alignment matrix of size , and populate the edge values. Thehorizontal and vertical edges will represent insertions or deletions of whole blocks and

will have the usual penalty constant, . The value of the diagonal edges will be equal tothe longest common path alignment score of the two sub blocks.

weightingVertical Horizontal Diagonali,j Needleman-Wunsch ( , )

Block-Alignment (u,v,n)

for x=1 to n/t

for y=1 to n/t

edge(x-1,y-1) to (x,y) = Needleman-Wunsch (ux, vy)

S0,0=0

for x=1 to n/tSx,0 = -x*d

for y=1 to n/t

S0,y = -y*d

for x=1 to n/t

for y=1 to n/t

A = Sx,y-1 - d

B = Sx-1,y - d

C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

Sx,y = max (A,B,C)

Return Si,j

Where

and

are DNA sequences,

is the length of the DNA

sequences, is the length of a block, is the block of string and isour scoring matrix.If computations resulting from the first nested for-loop (which calculates the

alignment of blocks) are ignored then time complexity is significantly improved. This

is only possible when sub-block alignment values have already been computed. This

may well be the case if we are examining the same DNA strings but with different

penalties for insertions and deletions.

= = 2

2In the case where penalties are not being adjusted, the alignment score may already be

stored in a pre-computed lookup table. Access to such a table will take (). = On2

t2logn

It may not always be the case that we have pre-computed block alignment values. In

this case, the cost of running the initialisation step that calculates the diagonal edges

score makes no improvement on our initial algorithm.


14/47

14

= = 2This can be improved using the Four Russians technique.

1.2.8 FOURRUSSIANSBLOCKALIGNMENTSPEEDUPThe Four Russians Technique is very similar to the block alignment algorithmachieving a significant reduction in the time complexity.

To achieve this goal, the block length should be approximatelylog 4

where is thesequence length and 4 is the number of letters in the alphabet of DNA.

log4

Also, instead of calculating a lookup table of alignment values of size , a lookuptable of size 4 4 . log

4

4 4 = Computing the initial values of the lookup table now only takes () when is boundto log ().

= + 2()2 = O 21.2.9 FOUR RUSSIANS TECHNIQUE - LONGEST COMMON SUB-EXPRESSION

Recall that the block alignment algorithm only allows the longest subexpression path to enter blocks through their corners. When alignmentscored for sub-blocks are calculated, only the corner values are stored aspoints of interest.

The Longest common sub expression algorithm can take anypath through the matrix. By extending the Four Russian Block

Alignment speedup making every point along the side of a block a

point of interest unrestricted entry and exit between blocks is possible.

Instead of performing dynamic programming on the corner vertices of

blocks, dynamic programming is used on all edge vertices, ignoring internal vertices.

This totals 2 vertices.Again, the four Russians technique is used to create a lookup table that stores all of

these values. In essence we are interested in the following problem :


15/47

15

given the alignment scores in the first column and first row of a block and the two strings, compute thealignment scores in the last row and last column.

This poses a problem. What are the scores of the first row and column? This clearly

varies depending on the path taken through the matrix. As a result the values of all

possible combinations of first row and column values are calculated for all

combinations of strings. This could clearly be an enormous lookup table if there were

a large number of possible first row and column initial value combinations.

By careful observation of the LCS problem, we can see that the initial values

of the first row or column of any block is not entirely arbitrary. Recall that

a match scores 1 and an insertion or deletion scores 0.The alignmentscores in LCS are monotonically increasing and adjacent elements cannot

differ by more than 1. Therefore there are 2 possible scores for each initialrow and column. There are also only4 possible strings (due to the DNAalphabet size of 4). Therefore we can very efficiently compute the lookup values.

2 2 4 4 = 26 Given that =

4due to the four Russians :

= 264 = 1.5Our initialisation step is now sub quadratic. As a result the overall time complexity is

dominated by the dynamic programming algorithm.

() = O 21.2.10NUSSINOVALGORITHM

The Nussinov algorithm finds the optimal secondary

structure of RNA (right). This is essentially its 3D

representation.1

The Nussinov algorithm has two stages. The first fillsthe dynamic array with

scores. The second uses

these scores to trace the

secondary structure of the RNA.

The first fill stage is based on the LCS dynamic

algorithm. One noticeable difference is that it is notnecessary to fill the entire table (see left).

The biological side of the algorithm specifies rules for

1 http://bioinf.kvl.dk/~gorodkin/teach/bioinf2004/talk4_nov5.pdf

Monotonic function


16/47

16

what is a valid paring of letters. This is based on the Watson-Crick basepairs.

, = 1 if the and letters of the string are a valid pairing0 otherwise

Nussinov Fill

Given subsequence with letters (1 , , )for i=2 to

i,i-1 = 0

for i = 1 to i,i = 0

for all sub sequences of length 2 to length A = i+1,j

B = i,j-1

C = i+1,j-1+(i,j)

D =


17/47

17

1.2.11 BLAST(MULTIPLEALIGNMENT)It is often the case that we wish to align far more than just 2 strings. If we were to use

the previous dynamic algorithms for the alignment of strings, we would just requirethe allocation and analysis of a

-dimensional matrix.

For sequences of length , there are a possible 2 1 paths through the matrix. Ifwe recall the time complexity of the LCS algorithm but extended it into dimensionswe get :

Time Complexity (k-dimension LCS)= (2 1) = 2Unfortunately, due to the exponential running time this becomes unpractical and

unusable very quickly.

This problem has led to the development of algorithms such as BLAST (basic local

alignment search tool). BLAST greatly improves the speed at which strings can be

compared but at the expense of approximated results.

A BLAST search enables a researcher to compare a query sequence against a library or

database of sequences, and identify library sequences that resemble the query sequence

above a certain threshold. For example, following the discovery of a previously

unknown mouse gene, a scientist will typically perform a BLAST search of the human

genome to see if humans carry a similar gene; BLAST will identify sequences in the

human genome that resemble the mouse gene based on similarity of sequence. BLAST

has become one of the most widely used bioinformatics algorithms due to its emphasis

on speed over sensitivity.

where is the search string, is the length of sub strings, isthe database of known sub strings and is the threshold value = ()

There are many variations of the basic BLAST algorithm. The original algorithm

found the substring

of length

in the dictionary

by performing basic local

alignment around the string until the threshold was exceeded. In the basic case, thelocal alignment only acknowledged matches and mismatches. The algorithm

BLAST (n,m,D,k){

For all words, w, of length m from the query

string n

{

Match win database D

IF match was found

{

Perform local alignment around w until our

score

falls below threshold k}

}}


18/47

18

terminated when either the alignment score fell below tolerance , or the ratio ofmatches to mismatches fell below tolerance .

An alternative and newer approach isgapped BLAST.This has the same basic structurebut the local alignment stage also allows for some insertions and deletions. We score

the local alignment in the usual way and terminate when the score becomes less thanthe threshold value. This results in some deviation from the perfect diagonal line in

our matrix.

There are many more variations on this algorithm tailored to different kinds of pattern

matching and different biological applications. The effectiveness of the algorithm

varies drastically with the choice of input variables. Some prior knowledge about the

strings being comparing can greatly improve BLASTs results.

1.2.12PATTERNHUNTER(MULTIPLEALIGNMENT)Pattern Hunter is a variation on BLAST which provides increased sensitivity andincreased speed. BLAST only matches consecutive sequences of length during thedictionary lookup. Pattern Hunter introduces the concept of a spaced seed providinggreater flexibility.

Consider the BLAST seed mask of length 11

11111111111

Here the seed represents the fact that we want to match all 11 consecutive characters

in the search string with 11 consecutive characters in the dictionary. If there had been

a single mutation of any character in the search string BLAST would not pick up anymatch in the dictionary.


19/47

19

Now consider the Pattern Hunter spaced seed of length 11

110111001110111

Here a 0 represents dont care. This is still matching 11 characters in the search string

with 11 characters in the dictionary, but the letters are not necessarily consecutive. Thespaced seed models are defined before the algorithm is run.

This algorithm provides a higher hit probability and a lower expected number of

random hits.

1.2.13 BLAT(MULTIPLEALIGNMENT)BLAT (BLAST like alignment tool) is just an inversion of the BLAST algorithm.

Instead of building an index from the query string and scanning through the database,

we build an index from the database and scan linearly through the query sequence.

This results in a significant runtime performance increase due to the fact that the index

fits in RAM memory which has much faster access.

1.3 TREES

Phylogenetic trees are often used in bioinformatics to represent how species or

sequences are related. They can represent how many organisms have evolved or how

strings have mutated.

There are two main types of tree. In rooted trees the root position is the common

ancestor of all sequences (A - E in the figure). Branch lengths can be used to indicate

the amount of divergence/change. Un-rooted trees contain no information about a

hypothetical common ancestor. Branch lengths still reflect degree of divergence.

1.3.1 PARSIMONYTrees can be constructed in many ways. One of the most common ways to build a tree

is using minimum parsimony. Parsimony is a measure of how complex the tree is. In

our case trees with fewer mutations between parent and child have a lower parsimony

score. Minimum parsimony is the simplest set of assumptions or mutations that can

explain an observation.


20/47

20

The simplest way of generating a parsimony score is to use the Hamming distance

(explained earlier). This is known as small parsimony. In many cases it may be

desirable to create a lookup matrix which assigns different scores to differentkinds of mutations, known as weighted parsimony. This way, common mutations can

have lower penalties than unusual mutations. Small parsimony is just a special case of

the weighted parsimony lookup table where the diagonal values are 0 and all others are

1.

1.3.1.1Sankoff Algorithm

The Sankoff Algorithm is a way of evaluating the weighted small parsimony problem.

Given a tree T with each leaf labelled with a letter from kletter alphabet and a scoring matrix (

), output a tree T with the internal vertices of the tree T minimizing

the weighted parsimony score


21/47

21

Sankoff Algorithm

Given a tree = (,)and string , with leaf nodes labelledwith letters in order from string

, assign to each node

integer costs cx() for each letter at position inrecursively, starting with the leaf nodes, as follows: in at all leaf nodes , let = 0 +

in at all internal nodes , let = min

yin

+ ,

Where , is the cost of mutating from the letter atposition to the letter at position based on the parsimonyscoring martix.

Once all costs have been assigned, we then label each

internal node with the letter as follows (backtrackingstage):

For the root node , let = such that miny AThen, for every already labelled node , label each child of with

= if miny( ) + 1 > c () miny( ) otherwise = (2)If we were to use a small parsimony scoring matrix then miny in , where is the root node, would be the minimum number of mutations that would explain the

tree. If we use weighted parsimony then miny in , where is the root node,would be the minimum parsimony score possible explaining the tree.The assigned labels are one possible assignment that exhibits this minimum parsimony

score. There may be many label variations that produce the same parsimony score.

1.3.1.2Fitchs Algorithm

The Fitch algorithm is very similar to Sankoff. It also finds the minimum parsimony of

a tree but only for small parsimony. Fitch produces an identical node labelling toSankoff given that a non-weighted small parsimony scoring matrix is used.


22/47

22

Given two strings and , we can find the most likely way that mutated into using small parsimony. We begin by creating a binary tree where only the leaf nodes arelabelled with the letters in string.Given a binary tree T = (V,E), with leaf nodes labelled withletters from an alphabet A, assign to each node Va set ofletters A recursively, starting with the leaf nodes, asfollows:

For a leaf node with label , let =

For an internal node with children and , let = 0

Now label each internal node with a single letter.

(backtracking stage)

Label the root node with any .Then, for every already labelled internal node , label eachchild of with

=

= ()The labelling produced exhibits the minimum number of mutations that can explain

the original tree. There may be label combinations that produce the same score.

1.3.2 LARGEPARSIMONYPROBLEMThis is very similar to the previous small parsimony problem, but instead of calculating

the parsimony score of a single string we want to calculate the parsimony score of

multiple string alignments.

Given a multiple alignment = {1 , . . . ,}, its parsimony score is defined as() = {(,) | }

The Large Parsimony Problem is to compute ().Potentially, we need to consider all (2 5)!! possible un-rooted trees or (2 3)!!possible rooted trees. Unfortunately, in general this cant be avoided and the

maximum parsimony problem is known to be NP-hard.


23/47

23

Exhaustive enumeration of all possible tree topologies will only work for 10,say.

Thus, we need more efficient strategies that either solve the problem exactly or return

good approximations with heuristic searches. These algorithms are not covered here.

1.3.3 DISTANCEThe distance between two nodes in a tree (tree distance) is always going to be greater

or equal to the edit distance between two nodes. The tree distance between two nodes

is the sum of all the edges along the shortest path between the nodes.

We use the notation:

to represent the edit distance between nodes and, and () to represent the tree distance between nodes and.

Given strings we can easily create a matrix, , representing the edit distancebetween strings and. Our distance based algorithms will produce distance trees thatbestfitthe distance matrix.

In an ideal (and optimal) case:

= ()Such optimality is always possible when we have trees with no more than 3 leaves, but

this is rarely the case when the number of leaves is > 3. Such trees are said to be

additiveand should yield a simple solution.

= + 2

A special case of tree distances is a degenerate triple. A degenerate triple is a set of

three distinct elements 1


24/47

24

has a degenerate triple ,, thencan be removed from thus reducing the sizeof the problem. If distance matrix does not have a degenerate triple ,, one cancreate a degenerative triple in by shortening all hanging edges (in the tree).1.3.3.1

UPGMA

UPGMA (Un-weighted Pair Group Method usingArithmetic averages) is the first of

our best-fit-tree distance algorithms that uses iterative clustering to create a

hierarchical phylogenetic tree.

It uses a pair-wise distance matrix where is the number of sequences and is the distance between sequences and. As explained earlier, there are manyways of calculating the distance between strings but UPGMA usually uses the edit

distance.

On each itteration the algorithm combines seqences to create clusters. Whencalculating the distance to or from a cluster we average (mean) the distance values of

all possible combinations of sequence tuples, (,) say, where is a sequence from thefirst cluster and is a tuple from the other.

= { , }

where is the number of elements in cluster and is a sequence in cluster.UPGMA Algorithm

1. Begin with sequences and populate an distancematrix where : =

2. Find the smallest value of 3. Combine sequences and creating a cluster 4. Create a new node in our tree with child nodes

and

at

height 2 5. Repeat steps 2 to 4 with 1 sequences as input until

only 2 sequences remain. This new input will be the same

as our original with sequences and removed andcluster added (The distances to a cluster is the averagedistance to sequences in the cluster)

6. Place root midway between the two remaining clusters= (2)


25/47

25

UPGMA assumes that:1. all leaves are placed at the same level2. when two sequences are joined in a cluster, the common ancestor is

equidistant from each sequence;3. the "molecular clock" rate of evolution is the same on each branch of the tree.

Unfortunately these assumptions can cause many inaccuracies. Trees that should look

like (c) might end up like (d). It is the final assumption (3) that causes the erroneous

results. This anomaly can be solved using the neighbour joining algorithm.

1.3.3.2Neighbour JoiningThe Neighbour Joining algorithm gets around the pitfalls of UPGMA as no

assumption is made about the mutation rate. Like UPGMA it is a bottom up clustering

algorithm that produces phylogenetic trees and calculates the length of branches. Its

results differ by the fact that the trees are non hierarchical.

The algorithm starts with a star tree, where each node represents a sequence. One

chooses the two nodes with shortest distance and connects them with a new internal

node. The distance could be the difference in percent between the two sequences.

When this is done two new nodes with the smallest distance are picked out and

connected with another new node. This will continue until the whole star is resolved.

Neighbour Joining populates a second matrix, based on the distance matrix eachiteration. The new distance values are calculated such that:

= + 2( 2) + 2 2 where represents all nodes that remain in the star.


26/47

26

Neighbour Joining

1. Begin with sequences and populate our distancematrix where :

=

2. Populate a such that,

= + 2( 2) + 2 2 3. Find the smallest value of4. Combine sequences and creating a cluster 5. Create a new node in our tree with child nodes and.6. Repeat steps 2 to 5 with

1 sequences as input until

only 3 sequences or clusters remain. This new input willbe the same as our previous with nodes and removedand node added.= (5)

Every time we create a cluster we are required to recomputed the whole of our to account for fast evolving edges. Calculating the position of the clusteris very costly yielding(5) complexity. For each stage we are required to calculate times, there are stages and for each we are required to sum over allthe elements of the matrix, . Therefore, =

5

.The complexity can be reduced by introducing a new parameter: = =1

We are only required to calculate and once each round achieving (1) .Therefore by using dynamic programming it is not necesary to sum over all the

elements of the matrix for each

. This reduces the complexity to

3

.


27/47

27

= (3)1.3.4 LIKELIHOODMaximum Likelihood evaluates a phylogenetic tree in terms of the probability that the

proposed model of the evolutionary process would give rise to the observed data.

Often it is the case that certain mutations are more common than others. Once a local

phylogeny is constructed, it can be scored according to how well the tree helps explain

the evolutionary path. A minimum distance based tree is only guaranteed to have

maximal likelihood when all mutations have equal probability.

Given a sequence, , of length , I will use to define an individual nucleotideposition.position 1

Sequence 1 = ACTGTCGATCGCGCGCGCGATCG2 = ACTCGATTZCGCAATCGCGATCG3 = ACTGTCACTCCAGATCGCGCGCG

=

= 2 + 2( 2) =

= + 2

Neighbour Joining (Optimised)

1.Begin with sequences and populate our distancematrix where ,

2. Find the smallest value of3. Combine sequences and creating a cluster 4. Create a new node in our tree with child nodes and.5. Calculate the branch lengths of and to such that:

Where

=

=1

6.Calculate the distance between the new internal node and each node in the remaining star such that:

remembering the values.

7. Repeat steps 2 to 5 with 1 sequences as inputuntil only 3 sequences or clusters remain. This new


28/47

28

e.g. Given a tree:

We root at an arbitrary internal node:

For each in the sequence we calculate the likelihood of the tree by summing theprobabilities of all the possible ancestral-state combinations:

= (),

The overall likelihood of the tree representing the full sequence can be calculated by

taking the product of all the individual cognate-set likelihoods.

= 1 2 = =1 As this is costly to compute and the probability of any individual observation is small

we take the natural log of the likelihood, . = ln1 + ln2 ++ ln = =1

If a tree has a relatively high likelihood score this means that, given the tree and themodel of evolution, the data is a relatively likely outcome. The maximum likelihood

() tree is that tree or trees making the data most likely. There may be many treeswith same likelihood score.

When scoring a tree we assume that all mutations are independent allowing us to

calculate the likelihood, , for each position individually.This can be summarised with the following algorithm:


29/47

29

Calculate Likelihood

1.Root the tree at any internal node (models are time

reversible)

2.Calculate

for each

= () , 3.Combine the values to calculate for the whole tree.

= 1 2 = =1 = (3)1.3.5 BOOTSTRAPPINGALGORITHMDue to the quadratic nature of the tree comparison algorithms it is often the case that

they are very expensive and time consuming to run. The bootstrapping algorithmincreases the speed of best fit multiple alignment algorithms by repeating many times

on small trees and output the most frequent result.

Bootstrapping Algorithm

1.Select random columns from a multiple alignment (one

column may appear several times)

2.Build a phylogenetic tree based on the random sample

3.Repeat stages 1 and 2 many times

4.Output the tree that is constructed most frequently

1.3.6 PRIMSALGORITHMPrim's algorithm finds the minimum spanning tree for a connected weighted graph.This means it finds a subset of the edges that forms a tree that includes every vertex,

where the total weight of all the edges in the tree is minimized.

Prims Algorithm

1. Start from a random vertex and make this the root of our

tree, 2.Add the shortest edge connecting a vertex in to a

vertex not in 3.Repeat step 2 until all nodes are in our set

= ( log)(Where is the number of edges)Whilst the time complexity of this algorithm is highly desirable we cannot construct

meaningful phylogenetic trees. When constructing typical phylogenetic trees the input

is made up of leaf nodes. The internal nodes are unknown and have to be generated

during construction of the tree.

1.4 INFORMATIONTHEORY AND DNA

Information theory in the context of DNA expresses the amount of information

encoded in a string or observation. The measure of information has unit bits. We usethree types of comparisons:


30/47

30

Entropy: The entropy of a discrete random variable is a measure of theuncertainty associated with the value of. For example, if someone

were to select a random letter from a given alphabet, how many

binary questions would it take to identify the letter. The entropy of a

random letter from a 32 letter alphabet can be expressed as = 5 .Conditional Entropy: The conditional entropy of a variable is the uncertainty

associated with the value of given a random variable . If thevariable is an independent random variable then the conditionalentropy of given , , will be the same as the entropy of,(). The conditional entropy of a variable will always be less thanor equal to the entropy of that variable.

Mutual Information: Mutual information measures the amount of information that

can be obtained about one random variable by observing another.

This is essentially the difference between learning the value ofgiven and learning the value of without .

; = (|)

1.4.1 INFORMATIONCONTENT OF ADNAMOTIFThe information encoded at position in the string can be expressed as

=

= log2 =1 + log2 =1

Where given an alphabet of possible characters of length , is the backgroundprobability and is the motif probability. If all characters are equiprobable, thebackground probability for any letter will be 1. In the case of DNA with a 4 letteralphabet = 14 giving a background probability of 2.


31/47

31

The information content encoded by a DNA motif of length is the sum of all theinformation encoded by its individual characters.

=

=1

1.4.2 ENTROPY OFMULTIPLEALIGNMENTThe entropyof a multiple alignment is a measure of the uncertainty of a single column.

Entropy of Multiple Alignment

1.Given an alignment column, we calculate the frequency of

occurrence of every possible letter

2.The entropy of the column is the sum of these

probabilities

= log2 =

1.4.3 INFORMATIONCONTENT OF ASTRINGThe information content of a string is the sum of the information content of everyposition, , in our string.

=

log

=

Where is the probability of letter at position in our positional weight matrix and is the frequency of by chance.1.4.4 MOTIFS

A sequence motif is a nucleotide or amino-acid sequence pattern. This pattern may be

present in many different positions in many different strings. Finding the same motif

in multiple strings often suggests a regulatory relationship between those genes.

However, all motif occurrences may not always be exactly the same as genes may havebeen turned on and off by regulatory proteins or mutate at non important bases. A

motif logo graphically illustrates the importance of letters within a motif at each

position. Larger letters are more important than smaller letters. This represents

conserved and variable regions of the motif.


32/47

32

The cumulative size of the letters at a given position is calculated by calculating the

information content of the alignment and subtracting it from the background

probability.

=

The individual letter size is calculated by the information content at its positionmultiplied by the fraction of occurrence.

= . occurrence . It is intuitive that larger letters have less variation than smaller letters.

Given a set of motifs it is possible to find the

motif consensus. The consensus can be thoughtof as the ancestor from which all mutated motifs

emerged.

First, align all patterns by their start index and

construct a matrix containing the frequency of

each nucleotide in each column, known as the

profile.

The consensus nucleotide in each position is the

nucleotide with the highest score in each column.

The distance between a real motif and the consensus sequence is generally less than

that for two real motifs.

The number of motif occurrences (sites) and how relations extend can be estimated

using Markov Chain theory.

e.g.

1 = 2 = ()() 3 =

All Markov chains estimate the frequency of a word from base composition alone with

increasing orders producing more accurate results. A Markov chain of order supposes that the base present at a certain position in a sequence depends only on the

bases present at the previous positions.


33/47

33

1.4.5 EXHAUSTIVESEARCHAn exhaustive search generates a motif,, that best matches a set of sequences, .

Given set of sequences

=

1

and motif defined

=

1

find

such that

the match with 1 is optimal.Whilst there are several ways to define an optimalmatch hamming distance is usuallyused.

, = and

, = (, ) In the case of DNA and RNA, which use 4-letter alphabets, the number of possible

motifs with length is 4 .Whilst this always finds the best motif it is very costly with running time:

= (4)where = .It is possible to speed up the basic exhaustive search algorithm at the expense of

accuracy. Instead of searching through all alphabet permutations of length , onlywords of length that occur in some are considered. This only requires (2)but does not always yield an accurate answer. If is weak and doesnt occur in thena random motif may have a higher score.

1.4.6 GIBBSSAMPLINGGibbs Motif Sampling identifies motifs, conserved regions, in DNA or protein

sequences solving the Motif Finding Problem.

Motif Finding Problem: given a set of

DNA sequences each of length

, find the motifwith optimal match.

The algorithm uses an iterative random sampling method increasing the odds that it

will converge to the correct solution.


34/47

34

Gibbs Sampling Algorithm

Method Example

Given length of motif and a set ofsequences

:

1.Randomly choose starting positions = (1, . . . , ) and form the set of mers associated with thesestarting positions.

2.Randomly choose one of the t

sequences.Sequence 2

3.Create a profile p from the othert sequences.

4.For each position in the removed

sequence, calculate the

probability that the mer startingat that position was generated by

using likelihood.

= =1

AAAATTTACCTTAGAAGG 0.000732


AAAATTTACCTTAGAAGG 0









5.Create a distribution of

probabilities of mers (|) ,and randomly select a new starting

position based on this

distribution.

a.To create this distribution,

divide each probability

(|) by the lowestprobability:Position 1: prob(AAAATTTA | P ) = .000732/.000122 = 6Position 2: prob(AAATTTAC | P ) = .000122/.000122 = 1Position 8: prob(ACCTTAGA | P ) = .000183/.000122 = 1.5Ratio = 6 : 1 : 1.5

b.Define probabilities of

starting positions according

to computed ratios

Probability (Position 1): 6/(6+1+1.5)= 0.706Probability (Position 2): 1/(6+1+1.5)= 0.118Probability (Position 8): 1.5/(6+1+1.5)=0.176

c.Select the start position

according to computed ratios:

P(Selecting Starting Position 1): .706P(Selecting Starting Position 2): .118P(Selecting Starting Position 8): .176

6.Repeat steps 25 until there isno improvement


35/47

35

1.5 HIDDEN MARKOVMODELS

A Hidden Markov Model (HMM) is a statistical model in which the system being

modelled is assumed to be a Markov process with unobserved state. A HMM is

memory-less with the only thing effecting the next step being the current state.

Definition

Given an alphabet (1, 2, , ) and set of states ( 1 , . . . ,) , the transitionprobability from state to state is written .1. The sum of all probability transitions will equal 1.

1 ++ = 1, = 12. The sum of all starting state probabilities will equal 1

01 ++ 0 = 13. Emission probability within each state (probability that the

hidden state emits the observable symbol) = ( = | = )1++ = 1

Given a sequence

=

1

and a parse

=

1

(a sequence of states), we

can calculate the parse likelihood,

, = ,11,122,211,1= 12221111= 0112 11 (1) ()

Hidden Markov models are used to solve three main questions

1. EvaluationGiven a HMM

and a sequence

Find [|], the probability that sequence was generated by the model. = (,) = () This can be calculated using the Forward dynamic algorithm

2. DecodingGiven a HMM and a sequence Find the sequence

of states that maximises

,

Described algebraically as


36/47

36

= max11 1 1,1 1, , = = =

This is also known as the Viterbi path and solved using the Viterbi

dynamic algorithm.

3. LearningGiven a HMM with unspecified transmission/emission probabilitiesand a sequence ,Find the HMM parameters = ( , ()) that maximise [|]Solutions for this problem will be omitted.

1.5.1 FORWARDALGORITHMThe Forward algorithm solves the evaluation problem, the probability of given theHMM. This is calculated by summing over all the possible ways of calculating:

= (,) = () To avoid summing over an exponential number of paths , the forward probability isdefined as

forward probability: = 1 , = = (1 11 1,1 1, = )

= (1 12 1,1 2 , = ) () = 1


37/47

37

= k2N = ()The main issue with this algorithm is that of underflow. As the numbers become

increasingly small, they often become inaccurate. This can be solved by rescaling at

each position by multiplying by a constant.

1.5.2 VITERBIALGORITHMThe Viterbi algorithm is a dynamic programming algorithm that solves the decoding

problem for finding the most likely sequence of hidden states (Viterbi path) that result

in a sequence of observed events.

= k2N

00 = 1

0

= 0,

> 0

= 1 () = ( 1), = max

=

1=

(

)

Viterbi Algorithm

Given a sequence = 1 Initialise

Iterate

Termination

Trace back

00 = 10 = 0, > 0 = ( 1)

= ()0

Forward Algorithm

Given a sequence = 1 and aMarkov model ,Initialise

Iterate

Termination

Where

0is the probability that

the terminating state is (usually = 0)


38/47

38

= ()Again, this algorithm is subject to underflow. As the numbers become increasingly

small, they often become inaccurate. This is solved by taking the log of all values

during the iteration stage. = log + max [ 1+ log ]1.5.3 BACKWARDALGORITHM

This final algorithm calculates ( = |) , the probability distribution of the position given .Like the Forward algorithm, the backward probability can be derived:

backward probability:

=

+1

,

=

= (+1+1 ,+1| = )= (+1+1 ,+1 = ,+2| = )

= (+1) (+2+1 ,+2|+1 = ) =

(

+1)

+ 1

= k2N

=

(

)

= 0 ,

=

+1

(

+ 1)

= 01(1)

Backward Algorithm

Given a sequence = 1 and aMarkov model ,Initialise

Iterate

Termination

Where 0 is the probability thatthe terminating state is (usually = 0)


39/47

39

The underflow problem associated with the backwards algorithm is solved in the same

way as the forwards algorithm. At each position we rescale by multiplying by a

constant.

The equation to calculate the probability distribution can be derived to using forwards

and backwards algorithms,

= = 1 , = , +1 =1 , = )(+1|1 , =

= , , = 1 , = (+1| = )

An implementation of this equation combining both the forwards and backwards

algorithm is often referred to as the forwards-backwards algorithm.

2 WORKING WITH MICROARRAYDNA microarray is a multiplex technology consisting of

an array containing thousands of microscopic spots of

DNA. The microarrays measure changes to the genes

under varying conditions, such as time, heat or pressure.

An expression level is estimated by measuring the

amount of mRNA (Messenger RNA) for a particulargene. mRNA is a molecule of RNA which carries

messages to the sites of protein synthesis. More mRNA

usually indicates more gene activity.

Microarray analysis can produce an activity diagram (right).

2.1 CLUSTERING

Clustering is a method of grouping functionally related genes. The genes will be related

based on some distance metric which may combine several factors. Genes are

clustered by plotting points in n-dimensional space, comparing all gene pairs using thedistance metric and grouping genes with small distances.

Given a set consisting of points and a parameter , the -means clusteringproblem finds a set consisting of points (cluster centres) that minimises thesquared error distortion (,) over all possible choices of.

, = ( ,)2 1

Samples

Genes


40/47

40

2.1.1 LLOYDALGORITHM(K-MEANS)The most common algorithm for approximating the -means problem is the Lloydalgorithm which uses an iterative refinement heuristic. The algorithm begins by

partitioning the input points into

initial sets, either at random or using some

heuristic data. It then calculates the mean point of each set. It constructs a newpartition by associating each point with the closest point before recalculating the

points for the new clusters. The algorithm then repeats by alternate application of

these two steps until the results converge, which is obtained when the points no longer

switch clusters (or alternatively points no longer change).

Lloyd's algorithm uses a heuristic for solving the k-means problem which, when used

with certain combinations of starting points, could converge to the wrong answer. The

Lloyd algorithm has remained popular because it has a very quick running time with

the number of iterations often far less than the number of points.

Lloyd Algorithm

Randomly assign the cluster centres While the cluster centres keep changing

o Assign each data point to a cluster corresponding tothe closest cluster, where 1

o After the assignment of all data points, compute new

cluster representatives according to the centre of

gravity of each cluster, that is, the new cluster

representative is


41/47

41

2.1.2 GREEDYALGORITHM(K-MEANS)The Lloyd algorithm is fast but moves many data points each iteration possiblyresulting in sub-optimal convergence. A more conservative method would be to moveone point at a time only if it improves the overall clustering cost. The smaller theclustering cost of a partition of data points the better the clustering is. Whilst this mayproduce better clustering, it is usually much more costly.

Progressive Greedy K-Means(k)

Select an arbitrary partition P into clusterswhile forever

bestChange 0for every cluster

for every element not in if moving to cluster reduces its clustering cost

if (cost(

) cost(

) > bestChange

bestChange cost(P) cost(Pi C) if bestChange > 0

Change partition by moving to else

return P

2.1.3 CAST(CLUSTERAFFINITYSEARCHTECHNIQUE)CAST is a clustering algorithm that groups genes based on affinity.

Affinity: A measure of similarity between a gene and all other genes ina cluster.

Threshold Affinity: A user specified criterion for retaining a gene in a

cluster defined as the percentage of the maximum affinity at

that point.

It requires more computing power than -means, but does not require the number ofclusters to be specified beforehand. The algorithm is also consistent, returning the

same result when run several times. The algorithm returns an optimal set of clusters

with diameters near the threshold affinity.


42/47

42

CAST ALGORITHM

1.Create an empty cluster

2.Set initial affinity of all genes to 0

3.Move the two most similar genes into the new cluster

4.Update the affinities of all genes, both clustered and un-

clustered = + 5.Whilst there remains a gene whose affinity value is

greater than the threshold

Add the gene with the highest affinity to thecluster

Update the affinities of all genes6.Whilst there remains a clustered gene whose affinity is

lower than the current threshold

Remove the gene with the lowest affinity from thecluster

Update the affinities of all genes

7.Repeat steps 5 and 6 until there is no further change

8.Save the cluster removing points in the cluster from

further consideration.

9.Repeat steps 1 to 8 with the reduced set of points until

all genes have been assigned to a cluster= (2)2.1.4 QTCLUSTERINGQT (quality threshold) clustering is an alternative method of partitioning data. Like

CAST, it requires more computing power than

-means and does not require the

number of clusters to be specified beforehand. The algorithm is also consistent,returning the same result when run several times.

QT Clustering ALGORITHM

1.Choose a maximum cluster diameter.

2.Choose a gene as the seed for a new cluster.

3.Add the closest point, the next closest, and so on, to

the cluster until the diameter surpasses the threshold.

4.Repeat steps 2 and 3 for every gene.

5.Save the cluster with the most points as the first true

cluster, and remove all points in the cluster from

further consideration.6.Repeat steps 2 to 5 with the reduced set of points until

the last cluster formed has fewer genes than the user-

specified number. All genes that are not part of a

cluster are unassigned = (3)where is the number of genes

The distance between a point and a group of points is computed using complete

linkage (Jack-knifed distance). This is the maximum distance between the seed and all

other genes in the cluster.


43/47

43

2.1.5 MARKOVCLUSTERINGALGORITHMFinally, the Markov clustering algorithm randomly walks through the probability graph

described by the similarity matrix to identify clusters of related genes. The basic idea

underlying the algorithm is the algorithm will walk more probable routes more often.

Dense clusters correspond to regions with a large number of paths.

The algorithm uses three steps:

Markov Clustering Algorithm

1.Given a network with vertexes, take the corresponding adjacency matrix and normalise each column toobtain a stochastic matrix.

2.Takes the power of this matrix (expansion)3.Then take the power () of every element (inflation).The expansion parameter is often taken equal to 2, whilethe granularity of the clustering is controlled by tuning

the inflation parameter.2.2

G

ENETICN

ETWORKSA

NALYSISA genetic network is a collection of DNA segments in a cell which interact with each

other (indirectly through their RNA and protein expression products (mRNA)) and

with other substances in the cell. This controls the rate at which genes in the network

are transcribed into mRNA. This kind of network can often cause chain reactions.

Assume there are two related genes, A and B. Neither is expressed initially but a third

gene, X, causes A to be expressed which in turn causes B to be expressed. This kind of

reaction can often be thought of as a circuit consisting of logic gates.

Gene activity can affect some genes directly and other genes indirectly, known as the

primary and secondary targets respectively. Our aim is to represent a large genetic

network with gene prertubations in fewer than 2 steps. A perturbation static graphmodel is used. This essentially perturbs a gene network one gene at a time monitoring

the behaviour of other genes. This identifies direct and indirect gene-gene

relationships. If this were a black boxed circuit made up of logical elements it wouldbe called bit twiddling!


44/47

44

Perturbation static graph model

Given a gene network, 1 STEP 1. For each gene , compare the control

experiment to the perturbed experiment where isperturbed to identify differently expressed genes.

STEP 2. Use the most parsimonious graph that representsthe behaviour observed in 1.

In more detail:STEP 1

(1) Given a gene network(2) Find the adjacency

list

(3) Use the adjacency listto find the accessibility

list

STEP 2

(4) Find the most parsimonious graph representing the accessibility list (3)Assuming the accessibility list is non-cyclic, producing the most parsimonious

graph is relatively straight forward. Let() be the accessibility list and() be the adjacency list at an acyclic directed graph, its mostparsimonious graph, and the set of all nodes of . Then thefollowing identity holds:

=\ () ()


45/47

45

Non-Cyclic most parsimonious graph

for all nodes of () =()for all nodes

of

if node has not been visitedcall PRUNE()PRUNE()

for all nodes ()if() =

declare as visitedelse

call PRUNE()for all nodes ()

for all nodes

(

)

if ()delete fromdeclare as visitedIf the accessibility list contains cycles it is not algorithmically possible to

produce a unique graph for the accessibility list. All genes within a cycle effect

all other genes. This is an experimental limitation. Where cycles do occur we

shrink each cycle into a single node and apply the same algorithm as for the

non-cyclic case. When and (where ) are two nodes in a directed graph,iff

(

) and

(

) then they belong to the same component.

This algorithm is limited. It is unable to resolve cyclic graphs and requires far more

data than conventional methods which use gene expression correlations. Also, for an

accessibility list there may be many consistent networks. This algorithm only

constructs the most parsimonious tree which is not necessarily an accurate model.

2.3 SYSTEMS BIOLOGY

Systems biology has become a particular field of interest in recent years (2000

onwards). It is the study of how different biological systems interact with each other

with respect to time. Knowledge about molecules, genes, cells, tissue, organs,

chemicals and much more can be combined to study how they interact.

A simple example of a biological system is our digestive system. We know that when

we eat the energy is not released into our blood stream immediately. The time taken

and the rate at which energy passes through the system has many variables (i.e. sugar

content of the food, metabolic rate, blood sugar levels etc...). Given all of the system

dependencies, we want to be able to describe the likelihood that a given event will

occur.

Ideally, we want to produce a model that represents the flow of events throughout a

system. If our system has a definite series of events we can accurately use differential


46/47

46

equations as our model, but often such systems have a random element making

stochastic algorithms more suitable.

2.3.1 GILLESPIEALGORITHMThe Gillespie algorithm is a stochastic algorithm for the simulation of geneticnetworks. It uses random numbers in order to generate sequences of events and inter-

event times in a biological system.

It is assumed that in a genetic network, different proteins/enzymes are

related to each other: they can trigger or inhibit the production of others or

they can react with another substance to form a new substrate. Any possible

reaction will have a reaction rate, , associated with it (which is equal tothe probability of this reaction happening at any step). Additionally, the

number of molecules of substances referred to by the reactions in the environment is

known.

First a random number, say , is generated in order to decide which reaction happens.Given that the probability of a reaction occurring is given by

= =1 where is the reaction rate for and the denominator normalises by the cumulativereaction rate of all known reactions in the system. The reaction will occur if

1