Multiple Global Alignment and Phylogenetic tree

transcript

Michael Schroeder BioTechnological CenterTU Dresdenms@biotec.tu-dresden.dehttp://biotec.tu-dresden.de Biotec

By Michael Schroeder, Biotec 2

Outline

Multiple sequence alignment—MSA Motivation The sum of pairs method (SP)

Phylogenetic tree Clustering Neighbour joining

Clustalw

What is a Multiple Sequence Alignment

MSA is the alignment of more than two sequences

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG— * *

An example of MSA alignment

Dynamic Programming in 3D

QUESTION:Which alignmentwould be generatedFor DQLF, DNVQ, QGL?

Dynamic Programming in 3D

D--Q-LF

DNVQ---

---QGL-

How many cases do we need to consider?

In standard dynamic programming we considered 3 cases, namely match/mismatch, insert, and delete

For three sequences s1, s2, s3 there are 7 possibilities:

For m sequences there are 2m -1 possibilities

si1 - si

1 si1 - - si

sj2 sj

2 - sj2 - sj

sk3 sk

3 sk3 - sk

QUESTION:Why is it “2”?

Complexity

For m sequences each of length n the matrix has nm cells and for each we must check 2m -1 possibilities: That’s prohibitive!

Solution: Use pruning techniques (cut-offs) and heuristics to guide the search for the best solution

A little excursion to Romania:

A* Search

Further reading Russel/Norvig, Artificial Intelligence, Chapter 4. Prentice-Hall

Problem: Find the shortest path from Arad to Bucharest

Bucharest

OradeaZerind

Faragas

Vaslui

Hirsova

Eforie

Urziceni

Giurgui

Pitesti

Dobreta

Craiova

Rimnicu

Mehadia

Timisoara

140118

Rimnicu

Pitesti

Optimal route is (140+80+97+101) = 418 miles

Straight Line Distances to Bucharest

Town SLD

Arad 366

Bucharest 0

Craiova 160

Dobreta 242

Eforie 161

Fagaras 178

Giurgiu 77

Hirsova 151

Iasi 226

Lugoj 244

Town SLD

Mehadai 241

Neamt 234

Oradea 380

Pitesti 98

Rimnicu 193

Sibiu 253

Timisoara 329

Urziceni 80

Vaslui 199

Zerind 374

Greedy search

Bucharest

OradeaZerind

Faragas

Hirsova

Eforie

Urziceni

Giurgui

Pitesti

Dobreta

Craiova

Rimnicu

Mehadia

Timisoara

Town SLD

Arad 366

Bucharest 0

Craiova 160

Dobreta 242

Eforie 161

Fagaras 178

Giurgiu 77

Hirsova 151

Iasi 226

Lugoj 244

Town SLD

Mehadai 241

Neamt 234

Oradea 380

Pitesti 98

Rimnicu 193

Sibiu 253

Timisoara 329

Urziceni 80

Vaslui 199

Zerind 374

Go to neighboring city v, which minimizesdistance Fv to goal

Greedy search

Bucharest

OradeaZerind

Faragas

Hirsova

Eforie

Urziceni

Giurgui

Pitesti

Dobreta

Craiova

Rimnicu

Mehadia

Timisoara

Town SLD

Arad 366

Bucharest 0

Craiova 160

Dobreta 242

Eforie 161

Fagaras 178

Giurgiu 77

Hirsova 151

Iasi 226

Lugoj 244

Town SLD

Mehadai 241

Neamt 234

Oradea 380

Pitesti 98

Rimnicu 193

Sibiu 253

Timisoara 329

Urziceni 80

Vaslui 199

Zerind 374

Go to neighboring city v, which minimizesdistance Fv to goal

QUESTION:Any problems?Why is it called“greedy” search?

Problems of greedy search Not optimal

Greedy search from Arad to Bucharestvia Fagaras, optimum via Rimnicu

Problem: Greedy algorithm does not include distance already covered

A*: Pursue best node first with scoring function of distance so far plus under estimate to goal (e.g.

shortest line distance) v is a node Sv Best score to go from start to node v Fv Estimate for going from v to goal Tv = Sv + Fv Total score

Organize nodes to be visited sorted by total score(TODO list in next slides)

A* search of the Romanian map featured in the previous slide. Note: Nodes are labelled with Tv = Sv + Fv. However,we will be using the abbreviations T, S and F to make the notation simpler

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

Bucharest(2)

BucharestBucharest

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

We begin with the initial state of Arad. The cost of reaching Arad from Arad (or S value) is 0 miles. The straight line distance from Arad to Bucharest (or F value) is 366 miles. This gives us a total value of ( T = S + F ) 366 miles. Expand the initial state of Arad.

DONE = []

TODO = [Arad/366]

T= 0 + 366

T= 366

Bucharest(2)

BucharestBucharest

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

Once Arad is expanded we look for the node with the lowest cost. Sibiu has the lowest value for T. (The cost to reach Sibiu from Arad is 140 miles, and the straight line distance from Sibiu to the goal state is 253 miles. This gives a total of 393 miles).

DONE = [Arad]

TODO = [Sibiu/393, Timisoara/447, Zerind/449]

Bucharest(2)

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

We now expand Sibiu (that is, we expand the node with the lowest value of T).

DONE = [Arad, Sibiu]

TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671]

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

Bucharest(2)

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

We now expand Rimnicu (that is, we expand the node with the lowest value of T ).

DONE = [Arad, Sibiu]

TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671]

Bucharest(2)

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

Once Rimnicu is expanded we look for the node with the lowest cost. As you can see, Pitesti has the lowest value for T. (The cost to reach Pitesti from Arad is 317 miles, and the straight line distance from Pitesti to the goal state is 98 miles. This gives a total of 415 miles

DONE = [Arad, Sibiu, Rimnicu]

TODO = [Pitesti/415, Fagaras/417, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]

Bucharest(2)

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

We now expand Pitesti (that is, we expand the node with the lowest value of T).

DONE = [Arad, Sibiu, Rimnicu, Pitesti]

TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]

T= 418 + 0

T= 418

Bucharest(2)

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

In actual fact, the algorithm will not really recognise that we have found Bucharest. It just keeps expanding the lowest cost nodes (based on T ) until it finds a goal state AND it has the lowest value of T. So, we must now move to Fagaras and expand it.

Bucharest(2)

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

We have just expanded a node (Pitesti) that revealed Bucharest, but it has a cost of 418. If there is any other lower cost node (and in this case there is one cheaper node, Fagaras, with a cost of 417) then we need to expand it in case it leads to a better solution to Bucharest than the 418 solution we have already found.

T= 418 + 0

T= 418

Bucharest(2)

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

We now expand Fagaras (that is, we expand the node with the lowest value of T ).

Bucharest(2)T= 450 + 0

T= 450

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

T= 450

Once Fagaras is expanded we look for the lowest cost node. As you can see, we now have two Bucharest nodes. One of these nodes ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ) has an T value of 418. The other node (Arad – Sibiu – Fagaras – Bucharest(2) ) has an T value of 450. We therefore move to the first Bucharest node and expand it.

DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras]

TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671]

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

T= 450

BucharestBucharestBucharest

We have now arrived at Bucharest. As this is the lowest cost node AND the goal state we can terminate the search. If you look back over the slides you will see that the solution returned by the A* search pattern ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ), is in fact the optimal solution.

DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras]

TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671]

Additional optimization

Let‘s assume we have an (over)-estimate K for the best solution, i.e. the optimal solution will be better than K

Do not consider any node with total score Tv worse than K

If Tv > K then remove v from TODO list

OradeaZerind

Fagaras

Pitesti

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

T= 450

BucharestBucharestBucharest

Additional optimization Assume K = 430, then we can

remove nodes Zerind, Oradea, Timisoara, Craiova

QUESTION:What if K is equal to optimum?What if K is poorely chosen?What if rule is “If Tv >= K then remove v“? Problem?

F must be under-estimate

For algorithm to work F must be an under-estimate

Example: Direct distance is always shorter than road

QUESTION:What happens if F is not under-estimate?

F must be under-estimate

For algorithm to work F must be an under-estimate

Example: Direct distance is always shorter than road

Then it cannot be guaranteed that optimal solution is found E.g. FRiminicu = 10.000 in example for Riminicu?

Then TRiminicu = 10.220 > K = 450, so Riminicu would be removed, and optimal solution would not be found

QUESTION:What happens if F is not under-estimate?

From Romania to Dresden

So, what does that mean for multiple sequence alignment?

QUESTIONS:What does a node (city) correspond to?What does an edge between nodes correspond to?What does the cost between two nodes correspond to?How could we define S?How could we define F?How could we define K?

The Sum of Pairs Method

As in the pairwise case, not all MSA’s are equally good. We need a scoring method to determine when one MSA is better than another one

The Sum of Pairs (SP) method: For each column in the alignment, sum up the

score of each pair of residues. M: a MSA of the sequences of (s1, s2, ...sm) s’i is the projection of si , i.e. the sequence si with gaps S(s’i,s’j): the score of the projections The final score is

∑∑+=

ssSMSP1

)','()(

QUESTION:What is the score of the alignment?

An Example of Using the SP Method

Example

s1 = AVP s’1: A-VP-

s2 = AVT s’2: A-V-T

s3 = PSVPT s’3: PSVPT Scores:

Match = 1 Mismatch, insertion, deletion = -1 S(-, -) = 0 to prevent the double counting of gaps.

An Example of Using the SP Method

Example

s1 = AVP s’1: A-VP-

s2 = AVT s’2: A-V-T

s3 = PSVPT s’3: PSVPT Scores:

Match = 1 Mismatch, insertion, deletion = -1 S(-, -) = 0 to prevent the double counting of gaps.

Then the SP score is

S(s’1,s’2) + S(s’1,s’3) + S( s’2, s’3)

= 0 + (-1) + (-1)

1 MSA vs. n SA

What is the difference between making one multiple sequence alignment to making many pairwise sequence comparisons?

The score S(s’i,s’j) for the alignment s’i,s’j in a multiple sequence alignment is less than score S(si,sj) for aligning si,sj directly

S(s’i,s’j) <= S(si,sj)

Pruning the search space Computing all cells in the dynamic

programming solution is expensive, therefore we want to avoid computing as many cells as possible

Can we rule out any cells? Let us assume that we know already that

there is a known alignment of score K Let v = (i1,i2,….im) be a cell of the DP

matrix for which want to determine whether we need to consider it (and its neighbours) or not

Let Sv be the score of the best path from the start cell to cell v

Let FV be an upper bound for the highest-scoring alignment from v to the end of DP matrix, i.e. we can only find a path from v to the end which is less than FV

Then we know the following: If Sv+ Fv < K, then v cannot lie on

the path of the best alignment

Dynamic Pruning with Forward Recursion

D(v,w) is the score to be added when moving from v to its forward (east, southeast, south) neighbor w.

I.e. the overall score Sv+D(v,w) is sent to w.

The value of Sw is the maximum of all values sent to w from its backward (west, north, northwest) neighbor cells.

SV - gv

From cell v values are sent to all its neighbor cells

SV + R(s

i 1,sj 2)

One more thing: A queue

We need a data structure before we list the algorithm

A queue is a list of elements with two special operators Push: to add an element at the end of the queue Pop: to remove an element from the top of a queue

Algorithm: Forward-recursion with pruningproc F(v, hN) a procedure which finds an upper bound

of the score of the alignment from a cell v to the end-cell hN

begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from

its neighbours If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);

P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end for end whileend

consth0 the start cell of the DP matrix (H0,0…0)

hN the end cell of the DP matrix (Hn1,n2…

K a lower bound for the score of the whole alignment

var u, v, w denote cells

S(u) the best score of an alignment from h0 to u

P(u) the score of the best alignment from h0 to u found so far

D(u, v) the score for extending the alignment from cell u to cell v

Q a queue of the cells u for which a value for P(u) is found but u is not visited yet

Finding upper limits for scores

For any multiple sequence alignment M of sequences {s1,s2,….sm} we know that the score for the multiple sequence alignment S(M) is less then the

sum of pairwise comparisons of the sequences {s1,s2,….sm}

∑∑+=

ssSMS1

∑∑+=

....1...1

),( (4.6)

The procedure F should find an upper bound for the alignment of the subsequences s1

i1+1…n1 , s2

i2+1…n2 , ….. sm

im+1…nm This can be done as follows:

Questions

QUESTION:What is the score of the multiple sequence alignmentwhen the algorithm is done?

QUESTION:How can we get alignment from algorithm?

Answers The score for the multiple sequence alignment is S(hN)

How can we get an alignment from the algorithm? We need another variable Dir to store the direction from which we

were coming

Let‘s assume we are at node v and its neighbour w is not pruned If w is new in queue then Dir(w)={v} If w is already in queue and S(v)+D(v,w)>P(w) then

P(w) = S(v)+D(v,w) and Dir(w) = {v} If w is already in queue and S(v)+D(v,w)=P(w) then

Add v to Dir(w)

Algorithm: Forward-recursion with pruningproc F(v, hN) a procedure which finds an upper bound of

the score of the alignment from a cell v to the end-cell hN

begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from

its neighbours If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);

P(w) = S(v) + D(v,w) Dir(w) = {v}

else if S(v)+D(v,w) > P(w) then P(w) = S(v)+D(v,w) Dir(w) = {v} else if S(v)+D(v,w) = P(w) then Add v to Dir(w)

end for end whileend

consth0 the start cell of the DP matrix (H0,0…0)

hN the end cell of the DP matrix (Hn1,n2…

K a lower bound for the score of the whole alignment

var u, v, w denote cells

S(u) the best score of an alignment from h0 to u

P(u) the score of the best alignment from h0 to u found so far

D(u, v) the score for extending the alignment from cell u to cell v

Q a queue of the cells u for which a value for P(u) is found but u is not visited yet

Dir(w) stores nodes v from which best scores were obtained

Printing the alignment: printMSA(hN,0)

printMSA is recursive function, which takes a node v and a position k in the alignment to be generated as input

B is a matrix, which contains the aligment

printMSA(v,k): If v = h0 then print B Else

Let i1,…,im be the indices of v For all u in Dir(v) do

Let i‘1,…,i‘m be the indices of w For j from 0 to m-1 do

If ij = i‘j then Bk,j = „-“ Else Bk,j = sequence j at position ij

printMSA(u,k+1)

Questions

QUESTION:Why is Dir a set and not a single node?

QUESTION:Does printMSA print one multiple sequence alignmentor all possible ones?

ExampleLet’s align DQLF, DNVQ, QGL

with match = 3 and insertion, deletion, mismatch = -1

<0,0,0>

<3,3,2>

Example

We need a lower bound for the overall result.Let’s assume we have got already the following alignment

What is K, the sum of pairs for this alignment?

Example

We need a lower bound for the overall result.Let’s assume we have got already the following alignment

K = -1 -4 + 3 = -2

Example

Upper bound for the score from <0,0,0> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1)

F( <0,0,0>, <3,3,2> ) = +2 +3 -2 = +3

D--QLF DQ-LF DNVQ--

DNVQ-- -QGL- ---QGL

+2 +3 -2

Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);

P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend

Q: <0,0,0>P( <0,0,0> ) = 0S( <0,0,0> ) = 0

S( <0,0,0> ) + F( <0,0,0>, <3,3,2>) = 0+3 >= -2Q: <0,0,1>, <0,1,0>, <0,1,1>, … , <1,1,1>

P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ

v = <0,0,1>, Q: <0,1,0>, <0,1,1>, … , <1,1,1>S( <0,0,1> ) = P( <0,0,1> = -2

Example

Upper bound for the score from <0,0,1> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1)

F( <0,0,1>, <3,3,2> ) = +2 +0 -4 = -2

D--QLF DQLF DNVQ

DNVQ-- -GL- GL--

+2 +0 -4

v = <0,0,1>S( <0,0,1> ) = -2S( <0,0,1> ) + F( <0,0,1>, <3,3,2>) = -2-2=-4 >= -2

Q: <0,1,0>, <0,1,1>, … , <1,1,1>

v = <0,0,1> is not further pursued as the pruning rule determines that it cannot be part of the best alignment

From MSA to phylogenetic treesAR-LARTLARSIARSLAWTLAWT-

AR-LARTLARSIARSL

AWTLAWT-

AWTLAWT-ARSI

ARSLAR-LARTL AWT- AWTL

ARSI ARSLARTLAR-L

Phylogenetic tree

Introduction Definition Tree construction method

– Clustering (UPGMA)

– Neighbour Joining

Darwin: “Origin of the species”

Find the evolutionary history of species existing today and how they are related.

Unrooted and Rooted Trees

A B C D

A C B D B C A D C A B D D A B c

A D B C A D B C B D A C C B A D D B A C

(a) (b)

All the topologies for four original sequences: (a) unrooted and (b) rooted

A B C D B A C D C D A B D C A B

A C B D

How many different trees are there?

)!32()(

2 −−

= − mm

mT mroot

)!52()(

3 −−

= − mm

mT munroot

The number of unrooted topologies for m≥3 original sequences is

The number of rooted topologies for m≥2 original sequences is

Example: For m=10 there are 2.027.025 unrooted trees and 34.459.425 rooted trees

Distances between Nodes

Degree of sequence similarity should be reflected in the distances between nodes

Additive tree: The distances between any two nodes is the sum of the distances over the edges connecting the nodes

Additive Trees A tree is additive if and only if

the distance between any two nodes is the sum of the distances over the edges connecting the nodes

(a) An additive tree constructed from the sequences with the distances in (b). r shows where a root is placed.

1.5 4.5

B C D E F

A 27 24 22 31 30

B 11 21 12 11

C 18 15 14

D 25 24

E 5 (a)

Additive Trees

If the distances between nodes satisfy the equation below, then an additive tree can be constructed

Di,j + Dk,l = Di,k + Dj,l ≥ Di,l + Dj,k

This means that there are often distance matrices for which we cannot compute an additive tree

Distance-based Approach Single Alignment

Score: 46 matches, 3 mismatches, 1 gap, 3 gap extensions, z.B. Score = 46x1 - 3x1 - 1x2 - 3x1 = 38

Approach: Define distance between two sequences, e.g. percentage of

mismatches in their alignment Construct tree, which groups sequences with minimal

distances iteratively together

atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcgata||||||||||||||| |||| |||||||| |||| |||||||||||||||atgctctggccacggatcttgtggatccca---tgatatgtgcacctgcgata

Distance basedAlignment

Distance Matrix

Hierarchical Clustering (Single linkage)

(1,2) 3 (4,5)

(1,2) 0 5 8

(4,5) 0

1 2 3 4 5

1 0 2 6 10 9

2 0 5 9 8

3 0 4 5

(1,2) 3 4 5

(1,2) 0 5 9 8

3 0 4 5

(1,2) (3,(4,5))

(1,2) 0 5

(3,(4,5)) 0

1 2 3 4 5

Hierarchical clusteringconst m number of original sequencesvar U a set of current trees, initially, one tree for each original sequence.D The distance between the trees in Ubegin U = the set of one tree (each of one node) for each original sequence. while |U| >1 do (u,v) = the roots of two trees in U with the least distance in D Make a new tree with root w and with u and v as children Calculate the length of the edges (v, w) and (u, w) for each root x of the trees in U-{u, v} do D(x, w) = calculate the distance between x and the new node (w) end U = (U - {u,v} ) {w} update U endend

Hierarchical Clustering

How to define distance between clusters?Distance to the new cluster w = (u,v)

Single linkage: D(x,w) = min { D(x,u), D(x,v) } Example: Distance (A,B) to C is 1

Complete linkage: D(x,w) = max { D(x,u), D(x,v) } Example: Distance (A,B) is C is 2

Average linkage (also called WPGMA (weighted pair group method with arithmetic mean)):

D(x,w) = ( D(x,u) + D(x,v) ) / 2 Example: Distance (A,B) to C is 1.5

More general (also called UPGMA(unweighted pair group method using arithmetic mean):

D(x,w) = ( mu D(x,u) + mv D(x,v) ) / (mu + mv ) mu is the number of nodes in the subtreee u

Question: Are dendrograms always the same independent

of the method?

Question: What’s the difference between

UPGMA and WPGMA?

Note: “weighted” because u and v may have different number of nodes, hences

they are weighted.

Hierarchical Clustering

CBA A B C A B CQuestion: Are

dendrograms always the same independent

of the method?

Question: What’s the difference between

WPGMA and UPGMA?

Average linkage: D(x,w) = ( D(x,u) + D(x,v) ) / 2 Example: Distance (A,B) to C is 1.5

More general:D(x,w) = ( mu D(x,u) + mv D(x,v) ) / (mu + mv )mu is the number of nodes in the subtreee u

Consider that subtree D contains 100 nodes (mD =100) and E only 1 (mE =1)

Average linkage D( (D,E), F ) = (2+10)/2 = 6Weighted average D (D,E), F ) = (100*2 + 1*10)/(100+1) = 2.08

Single linkage Complete l.

UPGMA-example

B C D E

A 3 7 8 10

B 6 8 7

(A,B) 6.5 8 8.5

D 6 ( C,D) E

(A,B) 7.25 8.5

( C,D) 5.5

(( C,D), E)

(A,B) 7.67

Constructing the Edges of the Tree

Let’s assume we want to join the subtrees u and v under the new root w

Then the edge from v to w has to have the following length

Lv,w = 0.5 Du,v – Lv,yv

Example: Joining C and D:

LC, (C,D) = 0.5x4 – 0=2

Joining (C,D) and E: L(C,D),((C,D),E)= 0.5x5.5-2=0.75

UPGMA-Tree

A B C DE

(A,B) (C,D)

((C,D),E)

1.5 1.5

B C D E

A 3 7.66 7.66 7.66

B 7.66 7.66 7.66

C 4 5.5

D 5.5 Distances in tree

B C D E

A 3 7 8 10

B 6 8 7

Original distances

Neighbour Joining (NJ)

Does not assume a constant molecular clock Starts with a star tree where all nodes are linked to a central

Each pair of nodes are evaluated for being clustered together

For each pair the sum of all lengths in the resulting tree is calculated

The pair giving the lowest sum is chosen - in the continuation the pair is considered as one node

This is repeated

(a) (b)

(c) (d)

Rooting an Unrooted Tree

Choose mid-point between all nodes and introduce new root node there

Mid-point = root

Rooting an Unrooted Tree

Alternative: Use an outgroup, which has large distance to all nodes

Example: Let’s assume D is outgroup, then the root is added to the edge from D

D = outgroup, so root goes here

NJ vs Hierarchical clustering

In Neighbour Joining the pair of nodes is chosen that gives the lowest sum of branch lengths in the resulting tree.

In Hierarchical clustering the pair of closest nodes are chosen not taking into account the rest of the tree.

Hierarchical clustering does not allow for rate variation among branches.

Assessing Quality: Bootstrapping Given a tree obtained from one of the methods above Generate Multiple Alignment For a number of iterations

Generate new sequences by selecting columns (possibly the same column more than once) form the multiple alignment

Generate tree for the new sequences Compare this new tree with the given tree For each cluster in the given tree, which also approach

in the new tree, the bootstrap value is increased Bootstrap-Value = Percentage of trees containing the

same cluster

From Phylogenetic Trees to MSA

Use a phylogenetic tree to guide the construction of the multiple sequence alignment

1 2 3 4 5

From Phylogenetic Trees to MSA

Progressive AlignmentAlgorithm: Progressive alignment of the sequences {s1, s2, ……sm}var

C current set of alignments.begin C = { };

for i=0 to m do C = C {{ si }} end one alignment of each sequence for i =0 to m-1 do choose two alignments Ap, Aq from C; C = C - { Ap, Aq };

Ar = align ( Ap,Aq ); C = C { Ar } end C now contains the (single) final alignmentend

Aligning two subset alignments

Two subset alignments Ap, Aq with the sequences {sp1 ….spm } and {sq1 ….sqm }

Complete alignment method for aligning pairs of subset alignments

The SP score will be

qqkZss

∑∑∈∈

=}...{

''}...{ 11

Clustering The progressive alignment should be guided by a true

phylogenetic tree Methods

Average linkage Maximum (single) linkage Minimum (complete) linkage

Clustering--example Three alignments: A1 ={s1, s2}, A2 ={s3, s4} and A3 ={s5}, with pairwise scores: s2 s3 s4 s5

s1 - 7 5 3 s2 6 4 8 s3 - 7

Average linkage S(A1,A2) = (7+5+6+4)/4 = 5.5

S(A1,A3) = 5.5

S(A2,A3) = 6.5 best

Maximum linkage S(A1,A2) = max (7,5,6,4) = 7

S(A1,A3) = 8 best

S(A2,A3) = 7 Minimum linkage S(A1,A2) = min (7,5,6,4) = 4

S(A1,A3) = 3

S(A2,A3) = 6 best

Linear clusteringAlgorithm : Basic linear clustering for aligning the sequences {s1, s2, ……sn}var U the set of sequences not alignedA the current alignmentbegin U = {s1, s2, ……sn }; choose two sequences (the most similar) (s, t) from U; A = Align(s, t); U = U – {s, t}; for i=0 to n-2 do choose a sequence s from U; U = U –{s}; A = Align (A, s) endend

The CLUSTALW Algorithm

CLUSTALW: one of the most popular MSA global alignment programs1. Calculate the (static) pairwise similarity scores for the

sequences 2. Construct a guide tree by use of the pairwise scores

(NJ method) 3. Calculate sequence weights, using the guide tree4. Perform a progressive alignment, guided by the tree

Multiple Global Alignment and Phylogenetic tree

Documents