Post on 30-Jan-2016
description
transcript
Michael Schroeder BioTechnological CenterTU Dresdenms@biotec.tu-dresden.dehttp://biotec.tu-dresden.de Biotec
Multiple Global Alignment and Phylogenetic tree
By Michael Schroeder, Biotec 2
Outline
Multiple sequence alignment—MSA Motivation The sum of pairs method (SP)
Phylogenetic tree Clustering Neighbour joining
Clustalw
By Michael Schroeder, Biotec 3
What is a Multiple Sequence Alignment
MSA is the alignment of more than two sequences
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG— * *
An example of MSA alignment
By Michael Schroeder, Biotec 4
Dynamic Programming in 3D
QUESTION:Which alignmentwould be generatedFor DQLF, DNVQ, QGL?
By Michael Schroeder, Biotec 5
Dynamic Programming in 3D
D--Q-LF
DNVQ---
---QGL-
By Michael Schroeder, Biotec 6
How many cases do we need to consider?
In standard dynamic programming we considered 3 cases, namely match/mismatch, insert, and delete
For three sequences s1, s2, s3 there are 7 possibilities:
For m sequences there are 2m -1 possibilities
si1 - si
1 si1 - - si
1
sj2 sj
2 - sj2 - sj
2 -
sk3 sk
3 sk3 - sk
3 - -
QUESTION:Why is it “2”?
By Michael Schroeder, Biotec 7
Complexity
For m sequences each of length n the matrix has nm cells and for each we must check 2m -1 possibilities: That’s prohibitive!
Solution: Use pruning techniques (cut-offs) and heuristics to guide the search for the best solution
By Michael Schroeder, Biotec 8
A little excursion to Romania:
A* Search
Further reading Russel/Norvig, Artificial Intelligence, Chapter 4. Prentice-Hall
By Michael Schroeder, Biotec 9
Problem: Find the shortest path from Arad to Bucharest
Arad
Bucharest
OradeaZerind
Faragas
Neamt
Iasi
Vaslui
Hirsova
Eforie
Urziceni
Giurgui
Pitesti
Sibiu
Dobreta
Craiova
Rimnicu
Mehadia
Timisoara
Lugoj
87
92
142
86
98
86
211
101
90
99
151
71
75
140118
111
70
75
120
138
146
97
80
140
80
97
101
Sibiu
Rimnicu
Pitesti
Optimal route is (140+80+97+101) = 418 miles
By Michael Schroeder, Biotec 10
Straight Line Distances to Bucharest
Town SLD
Arad 366
Bucharest 0
Craiova 160
Dobreta 242
Eforie 161
Fagaras 178
Giurgiu 77
Hirsova 151
Iasi 226
Lugoj 244
Town SLD
Mehadai 241
Neamt 234
Oradea 380
Pitesti 98
Rimnicu 193
Sibiu 253
Timisoara 329
Urziceni 80
Vaslui 199
Zerind 374
By Michael Schroeder, Biotec 11
Greedy search
Arad
Bucharest
OradeaZerind
Faragas
Hirsova
Eforie
Urziceni
Giurgui
Pitesti
Sibiu
Dobreta
Craiova
Rimnicu
Mehadia
Timisoara
Lugoj
Town SLD
Arad 366
Bucharest 0
Craiova 160
Dobreta 242
Eforie 161
Fagaras 178
Giurgiu 77
Hirsova 151
Iasi 226
Lugoj 244
Town SLD
Mehadai 241
Neamt 234
Oradea 380
Pitesti 98
Rimnicu 193
Sibiu 253
Timisoara 329
Urziceni 80
Vaslui 199
Zerind 374
Go to neighboring city v, which minimizesdistance Fv to goal
By Michael Schroeder, Biotec 12
Greedy search
Arad
Bucharest
OradeaZerind
Faragas
Hirsova
Eforie
Urziceni
Giurgui
Pitesti
Sibiu
Dobreta
Craiova
Rimnicu
Mehadia
Timisoara
Lugoj
Town SLD
Arad 366
Bucharest 0
Craiova 160
Dobreta 242
Eforie 161
Fagaras 178
Giurgiu 77
Hirsova 151
Iasi 226
Lugoj 244
Town SLD
Mehadai 241
Neamt 234
Oradea 380
Pitesti 98
Rimnicu 193
Sibiu 253
Timisoara 329
Urziceni 80
Vaslui 199
Zerind 374
Go to neighboring city v, which minimizesdistance Fv to goal
QUESTION:Any problems?Why is it called“greedy” search?
By Michael Schroeder, Biotec 13
Problems of greedy search Not optimal
Greedy search from Arad to Bucharestvia Fagaras, optimum via Rimnicu
Problem: Greedy algorithm does not include distance already covered
A*: Pursue best node first with scoring function of distance so far plus under estimate to goal (e.g.
shortest line distance) v is a node Sv Best score to go from start to node v Fv Estimate for going from v to goal Tv = Sv + Fv Total score
Organize nodes to be visited sorted by total score(TODO list in next slides)
By Michael Schroeder, Biotec 14
A* search of the Romanian map featured in the previous slide. Note: Nodes are labelled with Tv = Sv + Fv. However,we will be using the abbreviations T, S and F to make the notation simpler
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
Bucharest(2)
BucharestBucharest
By Michael Schroeder, Biotec 15
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
Arad
We begin with the initial state of Arad. The cost of reaching Arad from Arad (or S value) is 0 miles. The straight line distance from Arad to Bucharest (or F value) is 366 miles. This gives us a total value of ( T = S + F ) 366 miles. Expand the initial state of Arad.
DONE = []
TODO = [Arad/366]
T= 0 + 366
T= 366
Bucharest(2)
BucharestBucharest
By Michael Schroeder, Biotec 16
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
Once Arad is expanded we look for the node with the lowest cost. Sibiu has the lowest value for T. (The cost to reach Sibiu from Arad is 140 miles, and the straight line distance from Sibiu to the goal state is 253 miles. This gives a total of 393 miles).
DONE = [Arad]
TODO = [Sibiu/393, Timisoara/447, Zerind/449]
Bucharest(2)
By Michael Schroeder, Biotec 17
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
We now expand Sibiu (that is, we expand the node with the lowest value of T).
DONE = [Arad, Sibiu]
TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671]
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
Bucharest(2)
By Michael Schroeder, Biotec 18
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
We now expand Rimnicu (that is, we expand the node with the lowest value of T ).
DONE = [Arad, Sibiu]
TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671]
Bucharest(2)
By Michael Schroeder, Biotec 19
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
Once Rimnicu is expanded we look for the node with the lowest cost. As you can see, Pitesti has the lowest value for T. (The cost to reach Pitesti from Arad is 317 miles, and the straight line distance from Pitesti to the goal state is 98 miles. This gives a total of 415 miles
DONE = [Arad, Sibiu, Rimnicu]
TODO = [Pitesti/415, Fagaras/417, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
Bucharest(2)
By Michael Schroeder, Biotec 20
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
We now expand Pitesti (that is, we expand the node with the lowest value of T).
DONE = [Arad, Sibiu, Rimnicu, Pitesti]
TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
T= 418 + 0
T= 418
Bucharest(2)
By Michael Schroeder, Biotec 21
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
In actual fact, the algorithm will not really recognise that we have found Bucharest. It just keeps expanding the lowest cost nodes (based on T ) until it finds a goal state AND it has the lowest value of T. So, we must now move to Fagaras and expand it.
DONE = [Arad, Sibiu, Rimnicu, Pitesti]
TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
Bucharest(2)
By Michael Schroeder, Biotec 22
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
We have just expanded a node (Pitesti) that revealed Bucharest, but it has a cost of 418. If there is any other lower cost node (and in this case there is one cheaper node, Fagaras, with a cost of 417) then we need to expand it in case it leads to a better solution to Bucharest than the 418 solution we have already found.
DONE = [Arad, Sibiu, Rimnicu, Pitesti]
TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
T= 418 + 0
T= 418
Bucharest(2)
By Michael Schroeder, Biotec 23
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
We now expand Fagaras (that is, we expand the node with the lowest value of T ).
DONE = [Arad, Sibiu, Rimnicu, Pitesti]
TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
Bucharest(2)T= 450 + 0
T= 450
By Michael Schroeder, Biotec 24
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
Bucharest(2)T= 450 + 0
T= 450
Once Fagaras is expanded we look for the lowest cost node. As you can see, we now have two Bucharest nodes. One of these nodes ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ) has an T value of 418. The other node (Arad – Sibiu – Fagaras – Bucharest(2) ) has an T value of 450. We therefore move to the first Bucharest node and expand it.
DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras]
TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671]
By Michael Schroeder, Biotec 25
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
Bucharest(2)T= 450 + 0
T= 450
BucharestBucharestBucharest
We have now arrived at Bucharest. As this is the lowest cost node AND the goal state we can terminate the search. If you look back over the slides you will see that the solution returned by the A* search pattern ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ), is in fact the optimal solution.
DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras]
TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671]
By Michael Schroeder, Biotec 26
Additional optimization
Let‘s assume we have an (over)-estimate K for the best solution, i.e. the optimal solution will be better than K
Do not consider any node with total score Tv worse than K
If Tv > K then remove v from TODO list
By Michael Schroeder, Biotec 27
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
Bucharest(2)T= 450 + 0
T= 450
BucharestBucharestBucharest
Additional optimization Assume K = 430, then we can
remove nodes Zerind, Oradea, Timisoara, Craiova
QUESTION:What if K is equal to optimum?What if K is poorely chosen?What if rule is “If Tv >= K then remove v“? Problem?
By Michael Schroeder, Biotec 28
F must be under-estimate
For algorithm to work F must be an under-estimate
Example: Direct distance is always shorter than road
QUESTION:What happens if F is not under-estimate?
By Michael Schroeder, Biotec 29
F must be under-estimate
For algorithm to work F must be an under-estimate
Example: Direct distance is always shorter than road
Then it cannot be guaranteed that optimal solution is found E.g. FRiminicu = 10.000 in example for Riminicu?
Then TRiminicu = 10.220 > K = 450, so Riminicu would be removed, and optimal solution would not be found
QUESTION:What happens if F is not under-estimate?
By Michael Schroeder, Biotec 30
From Romania to Dresden
So, what does that mean for multiple sequence alignment?
QUESTIONS:What does a node (city) correspond to?What does an edge between nodes correspond to?What does the cost between two nodes correspond to?How could we define S?How could we define F?How could we define K?
By Michael Schroeder, Biotec 31
The Sum of Pairs Method
As in the pairwise case, not all MSA’s are equally good. We need a scoring method to determine when one MSA is better than another one
The Sum of Pairs (SP) method: For each column in the alignment, sum up the
score of each pair of residues. M: a MSA of the sequences of (s1, s2, ...sm) s’i is the projection of si , i.e. the sequence si with gaps S(s’i,s’j): the score of the projections The final score is
∑∑+=
−
=
=m
ij
jim
i
ssSMSP1
1
1
)','()(
By Michael Schroeder, Biotec 32
QUESTION:What is the score of the alignment?
An Example of Using the SP Method
Example
s1 = AVP s’1: A-VP-
s2 = AVT s’2: A-V-T
s3 = PSVPT s’3: PSVPT Scores:
Match = 1 Mismatch, insertion, deletion = -1 S(-, -) = 0 to prevent the double counting of gaps.
By Michael Schroeder, Biotec 33
An Example of Using the SP Method
Example
s1 = AVP s’1: A-VP-
s2 = AVT s’2: A-V-T
s3 = PSVPT s’3: PSVPT Scores:
Match = 1 Mismatch, insertion, deletion = -1 S(-, -) = 0 to prevent the double counting of gaps.
Then the SP score is
S(s’1,s’2) + S(s’1,s’3) + S( s’2, s’3)
= 0 + (-1) + (-1)
= -2
By Michael Schroeder, Biotec 34
1 MSA vs. n SA
What is the difference between making one multiple sequence alignment to making many pairwise sequence comparisons?
The score S(s’i,s’j) for the alignment s’i,s’j in a multiple sequence alignment is less than score S(si,sj) for aligning si,sj directly
S(s’i,s’j) <= S(si,sj)
By Michael Schroeder, Biotec 35
Pruning the search space Computing all cells in the dynamic
programming solution is expensive, therefore we want to avoid computing as many cells as possible
Can we rule out any cells? Let us assume that we know already that
there is a known alignment of score K Let v = (i1,i2,….im) be a cell of the DP
matrix for which want to determine whether we need to consider it (and its neighbours) or not
Let Sv be the score of the best path from the start cell to cell v
Let FV be an upper bound for the highest-scoring alignment from v to the end of DP matrix, i.e. we can only find a path from v to the end which is less than FV
Then we know the following: If Sv+ Fv < K, then v cannot lie on
the path of the best alignment
SV
v<=Fv
By Michael Schroeder, Biotec 36
Dynamic Pruning with Forward Recursion
D(v,w) is the score to be added when moving from v to its forward (east, southeast, south) neighbor w.
I.e. the overall score Sv+D(v,w) is sent to w.
The value of Sw is the maximum of all values sent to w from its backward (west, north, northwest) neighbor cells.
SV - gv
si1
From cell v values are sent to all its neighbor cells
SV
- g
SV + R(s
i 1,sj 2)
sj2
By Michael Schroeder, Biotec 37
One more thing: A queue
We need a data structure before we list the algorithm
A queue is a list of elements with two special operators Push: to add an element at the end of the queue Pop: to remove an element from the top of a queue
By Michael Schroeder, Biotec 38
Algorithm: Forward-recursion with pruningproc F(v, hN) a procedure which finds an upper bound
of the score of the alignment from a cell v to the end-cell hN
begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from
its neighbours If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end for end whileend
consth0 the start cell of the DP matrix (H0,0…0)
hN the end cell of the DP matrix (Hn1,n2…
nm)
K a lower bound for the score of the whole alignment
var u, v, w denote cells
S(u) the best score of an alignment from h0 to u
P(u) the score of the best alignment from h0 to u found so far
D(u, v) the score for extending the alignment from cell u to cell v
Q a queue of the cells u for which a value for P(u) is found but u is not visited yet
By Michael Schroeder, Biotec 39
Finding upper limits for scores
For any multiple sequence alignment M of sequences {s1,s2,….sm} we know that the score for the multiple sequence alignment S(M) is less then the
sum of pairwise comparisons of the sequences {s1,s2,….sm}
∑∑+=
−
=
≤m
kl
lkm
k
ssSMS1
1
1
),()(
∑∑+=
++
−
=
=m
kl
lni
kni
m
kllkk
ssSF1
....1...1
1
1
),( (4.6)
The procedure F should find an upper bound for the alignment of the subsequences s1
i1+1…n1 , s2
i2+1…n2 , ….. sm
im+1…nm This can be done as follows:
By Michael Schroeder, Biotec 40
Questions
QUESTION:What is the score of the multiple sequence alignmentwhen the algorithm is done?
QUESTION:How can we get alignment from algorithm?
By Michael Schroeder, Biotec 41
Answers The score for the multiple sequence alignment is S(hN)
How can we get an alignment from the algorithm? We need another variable Dir to store the direction from which we
were coming
Let‘s assume we are at node v and its neighbour w is not pruned If w is new in queue then Dir(w)={v} If w is already in queue and S(v)+D(v,w)>P(w) then
P(w) = S(v)+D(v,w) and Dir(w) = {v} If w is already in queue and S(v)+D(v,w)=P(w) then
Add v to Dir(w)
By Michael Schroeder, Biotec 42
Algorithm: Forward-recursion with pruningproc F(v, hN) a procedure which finds an upper bound of
the score of the alignment from a cell v to the end-cell hN
begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from
its neighbours If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) Dir(w) = {v}
else if S(v)+D(v,w) > P(w) then P(w) = S(v)+D(v,w) Dir(w) = {v} else if S(v)+D(v,w) = P(w) then Add v to Dir(w)
end for end whileend
consth0 the start cell of the DP matrix (H0,0…0)
hN the end cell of the DP matrix (Hn1,n2…
nm)
K a lower bound for the score of the whole alignment
var u, v, w denote cells
S(u) the best score of an alignment from h0 to u
P(u) the score of the best alignment from h0 to u found so far
D(u, v) the score for extending the alignment from cell u to cell v
Q a queue of the cells u for which a value for P(u) is found but u is not visited yet
Dir(w) stores nodes v from which best scores were obtained
By Michael Schroeder, Biotec 43
Printing the alignment: printMSA(hN,0)
printMSA is recursive function, which takes a node v and a position k in the alignment to be generated as input
B is a matrix, which contains the aligment
printMSA(v,k): If v = h0 then print B Else
Let i1,…,im be the indices of v For all u in Dir(v) do
Let i‘1,…,i‘m be the indices of w For j from 0 to m-1 do
If ij = i‘j then Bk,j = „-“ Else Bk,j = sequence j at position ij
printMSA(u,k+1)
By Michael Schroeder, Biotec 44
Questions
QUESTION:Why is Dir a set and not a single node?
QUESTION:Does printMSA print one multiple sequence alignmentor all possible ones?
By Michael Schroeder, Biotec 45
ExampleLet’s align DQLF, DNVQ, QGL
with match = 3 and insertion, deletion, mismatch = -1
<0,0,0>
<3,3,2>
By Michael Schroeder, Biotec 46
Example
We need a lower bound for the overall result.Let’s assume we have got already the following alignment
What is K, the sum of pairs for this alignment?
DQ-LF
DNVQ-
-QGL-
By Michael Schroeder, Biotec 47
Example
We need a lower bound for the overall result.Let’s assume we have got already the following alignment
K = -1 -4 + 3 = -2
DQ-LF
DNVQ-
-QGL-
By Michael Schroeder, Biotec 48
Example
Upper bound for the score from <0,0,0> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1)
F( <0,0,0>, <3,3,2> ) = +2 +3 -2 = +3
D--QLF DQ-LF DNVQ--
DNVQ-- -QGL- ---QGL
+2 +3 -2
By Michael Schroeder, Biotec 49
Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend
Q: <0,0,0>P( <0,0,0> ) = 0S( <0,0,0> ) = 0
By Michael Schroeder, Biotec 50
Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend
S( <0,0,0> ) + F( <0,0,0>, <3,3,2>) = 0+3 >= -2Q: <0,0,1>, <0,1,0>, <0,1,1>, … , <1,1,1>
P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ
By Michael Schroeder, Biotec 51
Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend
v = <0,0,1>, Q: <0,1,0>, <0,1,1>, … , <1,1,1>S( <0,0,1> ) = P( <0,0,1> = -2
P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ
By Michael Schroeder, Biotec 52
Example
Upper bound for the score from <0,0,1> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1)
F( <0,0,1>, <3,3,2> ) = +2 +0 -4 = -2
D--QLF DQLF DNVQ
DNVQ-- -GL- GL--
+2 +0 -4
By Michael Schroeder, Biotec 53
Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend
v = <0,0,1>S( <0,0,1> ) = -2S( <0,0,1> ) + F( <0,0,1>, <3,3,2>) = -2-2=-4 >= -2
Q: <0,1,0>, <0,1,1>, … , <1,1,1>
P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ
v = <0,0,1> is not further pursued as the pruning rule determines that it cannot be part of the best alignment
By Michael Schroeder, Biotec 54
From MSA to phylogenetic treesAR-LARTLARSIARSLAWTLAWT-
AR-LARTLARSIARSL
AWTLAWT-
AWTLAWT-ARSI
ARSLAR-LARTL AWT- AWTL
ARSI ARSLARTLAR-L
1
23
By Michael Schroeder, Biotec 55
Phylogenetic tree
Introduction Definition Tree construction method
– Clustering (UPGMA)
– Neighbour Joining
By Michael Schroeder, Biotec 56
Darwin: “Origin of the species”
Find the evolutionary history of species existing today and how they are related.
By Michael Schroeder, Biotec 57
Unrooted and Rooted Trees
A B C
A C B
B C A
B
C
A
root
By Michael Schroeder, Biotec 58
Unrooted and Rooted Trees
A
B
C
D
A B
C D
A B
CD
A B C D
A C B D B C A D C A B D D A B c
A D B C A D B C B D A C C B A D D B A C
(a) (b)
All the topologies for four original sequences: (a) unrooted and (b) rooted
A B C D B A C D C D A B D C A B
A C B D
By Michael Schroeder, Biotec 59
How many different trees are there?
)!2(2
)!32()(
2 −−
= − mm
mT mroot
)!3(2
)!52()(
3 −−
= − mm
mT munroot
The number of unrooted topologies for m≥3 original sequences is
The number of rooted topologies for m≥2 original sequences is
(4.7)
(4.8)
Example: For m=10 there are 2.027.025 unrooted trees and 34.459.425 rooted trees
By Michael Schroeder, Biotec 60
Distances between Nodes
Degree of sequence similarity should be reflected in the distances between nodes
Additive tree: The distances between any two nodes is the sum of the distances over the edges connecting the nodes
By Michael Schroeder, Biotec 61
Additive Trees A tree is additive if and only if
the distance between any two nodes is the sum of the distances over the edges connecting the nodes
(a) An additive tree constructed from the sequences with the distances in (b). r shows where a root is placed.
D
A
E
F
BC
8
14
3
2
4
5
34
6
1.5 4.5
r
B C D E F
A 27 24 22 31 30
B 11 21 12 11
C 18 15 14
D 25 24
E 5 (a)
(b)
By Michael Schroeder, Biotec 62
Additive Trees
If the distances between nodes satisfy the equation below, then an additive tree can be constructed
Di,j + Dk,l = Di,k + Dj,l ≥ Di,l + Dj,k
This means that there are often distance matrices for which we cannot compute an additive tree
i
l
j
k
By Michael Schroeder, Biotec 63
Distance-based Approach Single Alignment
Score: 46 matches, 3 mismatches, 1 gap, 3 gap extensions, z.B. Score = 46x1 - 3x1 - 1x2 - 3x1 = 38
Approach: Define distance between two sequences, e.g. percentage of
mismatches in their alignment Construct tree, which groups sequences with minimal
distances iteratively together
atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcgata||||||||||||||| |||| |||||||| |||| |||||||||||||||atgctctggccacggatcttgtggatccca---tgatatgtgcacctgcgata
By Michael Schroeder, Biotec 64
Distance basedAlignment
4
2
3
5
6
7
1
Tree
Distance Matrix
By Michael Schroeder, Biotec 65
Hierarchical Clustering (Single linkage)
(1,2) 3 (4,5)
(1,2) 0 5 8
3 0 4
(4,5) 0
1 2 3 4 5
1 0 2 6 10 9
2 0 5 9 8
3 0 4 5
4 0 3
5 0
(1,2) 3 4 5
(1,2) 0 5 9 8
3 0 4 5
4 0 3
5 0
(1,2) (3,(4,5))
(1,2) 0 5
(3,(4,5)) 0
5
4
3
2
1
0
1 2 3 4 5
By Michael Schroeder, Biotec 66
Hierarchical clusteringconst m number of original sequencesvar U a set of current trees, initially, one tree for each original sequence.D The distance between the trees in Ubegin U = the set of one tree (each of one node) for each original sequence. while |U| >1 do (u,v) = the roots of two trees in U with the least distance in D Make a new tree with root w and with u and v as children Calculate the length of the edges (v, w) and (u, w) for each root x of the trees in U-{u, v} do D(x, w) = calculate the distance between x and the new node (w) end U = (U - {u,v} ) {w} update U endend
By Michael Schroeder, Biotec 67
Hierarchical Clustering
How to define distance between clusters?Distance to the new cluster w = (u,v)
Single linkage: D(x,w) = min { D(x,u), D(x,v) } Example: Distance (A,B) to C is 1
Complete linkage: D(x,w) = max { D(x,u), D(x,v) } Example: Distance (A,B) is C is 2
Average linkage (also called WPGMA (weighted pair group method with arithmetic mean)):
D(x,w) = ( D(x,u) + D(x,v) ) / 2 Example: Distance (A,B) to C is 1.5
More general (also called UPGMA(unweighted pair group method using arithmetic mean):
D(x,w) = ( mu D(x,u) + mv D(x,v) ) / (mu + mv ) mu is the number of nodes in the subtreee u
Question: Are dendrograms always the same independent
of the method?
Question: What’s the difference between
UPGMA and WPGMA?
Note: “weighted” because u and v may have different number of nodes, hences
they are weighted.
By Michael Schroeder, Biotec 68
Hierarchical Clustering
0C
10B
210A
CBA A B C A B CQuestion: Are
dendrograms always the same independent
of the method?
Question: What’s the difference between
WPGMA and UPGMA?
Average linkage: D(x,w) = ( D(x,u) + D(x,v) ) / 2 Example: Distance (A,B) to C is 1.5
More general:D(x,w) = ( mu D(x,u) + mv D(x,v) ) / (mu + mv )mu is the number of nodes in the subtreee u
Consider that subtree D contains 100 nodes (mD =100) and E only 1 (mE =1)
Average linkage D( (D,E), F ) = (2+10)/2 = 6Weighted average D (D,E), F ) = (100*2 + 1*10)/(100+1) = 2.08
0F
100E
210D
FED
Single linkage Complete l.
By Michael Schroeder, Biotec 69
UPGMA-example
B C D E
A 3 7 8 10
B 6 8 7
C 4 5
D 6
C D E
(A,B) 6.5 8 8.5
C 4 5
D 6 ( C,D) E
(A,B) 7.25 8.5
( C,D) 5.5
(( C,D), E)
(A,B) 7.67
(a)
(b)
(c)
(d)
By Michael Schroeder, Biotec 70
Constructing the Edges of the Tree
Let’s assume we want to join the subtrees u and v under the new root w
Then the edge from v to w has to have the following length
Lv,w = 0.5 Du,v – Lv,yv
Example: Joining C and D:
LC, (C,D) = 0.5x4 – 0=2
Joining (C,D) and E: L(C,D),((C,D),E)= 0.5x5.5-2=0.75
Lv,yv
v
yv
w
u
Lv,w
By Michael Schroeder, Biotec 71
UPGMA-Tree
A B C DE
(A,B) (C,D)
((C,D),E)
1.5 1.5
2.33
1.08
2 2
2.75
0.75
B C D E
A 3 7.66 7.66 7.66
B 7.66 7.66 7.66
C 4 5.5
D 5.5 Distances in tree
B C D E
A 3 7 8 10
B 6 8 7
C 4 5
D 6
Original distances
By Michael Schroeder, Biotec 72
Neighbour Joining (NJ)
Does not assume a constant molecular clock Starts with a star tree where all nodes are linked to a central
node:
x
F
A
B
C
D
E
By Michael Schroeder, Biotec 73
Neighbour Joining (NJ)
Each pair of nodes are evaluated for being clustered together
For each pair the sum of all lengths in the resulting tree is calculated
The pair giving the lowest sum is chosen - in the continuation the pair is considered as one node
This is repeated
By Michael Schroeder, Biotec 74
x Y
F
A
B
C
D
E
x
F
A
B
C
D
E
Y
x
F
A
B
C
D
E
Yx
F
A
B
C
D
E
(a) (b)
(c) (d)
Neighbour Joining (NJ)
By Michael Schroeder, Biotec 75
A B
C
FE
D
Rooting an Unrooted Tree
Choose mid-point between all nodes and introduce new root node there
Yx
F
A
B
C
D
E
Mid-point = root
By Michael Schroeder, Biotec 76
Rooting an Unrooted Tree
Alternative: Use an outgroup, which has large distance to all nodes
Example: Let’s assume D is outgroup, then the root is added to the edge from D
A
B
C
D
D = outgroup, so root goes here
BA
C
D
By Michael Schroeder, Biotec 77
NJ vs Hierarchical clustering
In Neighbour Joining the pair of nodes is chosen that gives the lowest sum of branch lengths in the resulting tree.
In Hierarchical clustering the pair of closest nodes are chosen not taking into account the rest of the tree.
Hierarchical clustering does not allow for rate variation among branches.
By Michael Schroeder, Biotec 78
Assessing Quality: Bootstrapping Given a tree obtained from one of the methods above Generate Multiple Alignment For a number of iterations
Generate new sequences by selecting columns (possibly the same column more than once) form the multiple alignment
Generate tree for the new sequences Compare this new tree with the given tree For each cluster in the given tree, which also approach
in the new tree, the bootstrap value is increased Bootstrap-Value = Percentage of trees containing the
same cluster
By Michael Schroeder, Biotec 79
From Phylogenetic Trees to MSA
Use a phylogenetic tree to guide the construction of the multiple sequence alignment
By Michael Schroeder, Biotec 80
5
4
3
2
1
0
1 2 3 4 5
From Phylogenetic Trees to MSA
MSA
12
45
3
By Michael Schroeder, Biotec 81
Progressive AlignmentAlgorithm: Progressive alignment of the sequences {s1, s2, ……sm}var
C current set of alignments.begin C = { };
for i=0 to m do C = C {{ si }} end one alignment of each sequence for i =0 to m-1 do choose two alignments Ap, Aq from C; C = C - { Ap, Aq };
Ar = align ( Ap,Aq ); C = C { Ar } end C now contains the (single) final alignmentend
By Michael Schroeder, Biotec 82
Aligning two subset alignments
Two subset alignments Ap, Aq with the sequences {sp1 ….spm } and {sq1 ….sqm }
Complete alignment method for aligning pairs of subset alignments
The SP score will be
kj
qqkZss
ppj
wwRnm
trSm
kt
jr
n
∑∑∈∈
=}...{
''}...{ 11
1),(
By Michael Schroeder, Biotec 83
Clustering The progressive alignment should be guided by a true
phylogenetic tree Methods
Average linkage Maximum (single) linkage Minimum (complete) linkage
By Michael Schroeder, Biotec 84
Clustering--example Three alignments: A1 ={s1, s2}, A2 ={s3, s4} and A3 ={s5}, with pairwise scores: s2 s3 s4 s5
s1 - 7 5 3 s2 6 4 8 s3 - 7
s4 6
Average linkage S(A1,A2) = (7+5+6+4)/4 = 5.5
S(A1,A3) = 5.5
S(A2,A3) = 6.5 best
Maximum linkage S(A1,A2) = max (7,5,6,4) = 7
S(A1,A3) = 8 best
S(A2,A3) = 7 Minimum linkage S(A1,A2) = min (7,5,6,4) = 4
S(A1,A3) = 3
S(A2,A3) = 6 best
By Michael Schroeder, Biotec 85
Linear clusteringAlgorithm : Basic linear clustering for aligning the sequences {s1, s2, ……sn}var U the set of sequences not alignedA the current alignmentbegin U = {s1, s2, ……sn }; choose two sequences (the most similar) (s, t) from U; A = Align(s, t); U = U – {s, t}; for i=0 to n-2 do choose a sequence s from U; U = U –{s}; A = Align (A, s) endend
By Michael Schroeder, Biotec 86
The CLUSTALW Algorithm
CLUSTALW: one of the most popular MSA global alignment programs1. Calculate the (static) pairwise similarity scores for the
sequences 2. Construct a guide tree by use of the pairwise scores
(NJ method) 3. Calculate sequence weights, using the guide tree4. Perform a progressive alignment, guided by the tree