Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.

Post on 18-Jan-2018

215 views 0 download

description

a)Models of sequence evolution b)Sequence similarity c)Estimating the number of substitutions between two sequences d)Phylogenetic reconstruction

transcript

Molecular EvolutionDistance Methods

Biol. Luis Delaye

Facultad de Ciencias, UNAM

ab

Mainly a STATISTICAL problem!

a) Models of sequence evolution

b) Sequence similarity

c) Estimating the number of substitutions between two sequences

d) Phylogenetic reconstruction

Evolution at the molecular level is the substitution of one allele by another

0

1

frequency

time

1/

The basic forces are: mutation, genetic drift and natural selection

Allele A Allele B Allele C

By this process, a DNA sequence accumulates substitutions through time

ATCGCATCC

ATTGCGTAC

TAGCGTAGG

TAACCCATG

t

In the study of molecular evolution, this changes in a DNA sequence are used for both:

Estimating the rate of molecular evolution

Reconstructing the evolutionary history

Models of sequence evolution

Models of DNA evolution

A C

To study the dynamics of nucleotide substitution we must made assumptions regarding the probability (p) of substitution of one nucleotide by another at the end of time interval t

pt

pAC

For instance, PAC represents the probability that a site that has started with nucleotide i (A in this case) change to nucleotide j (C in this case) at the end of interval t

Models of DNA evolution using matrix theory

PAA PAC PAG PAT

PCA PCC PCG PCT

PGA PGC PGG PGT

PTA PTC PTG PTT

Pt =

Substitution probability matrix

f = [fA fC fG fT]

Base composition of sequences

The Jukes and Cantor’s One-Parameter Model

A G

C T

*

*

*

*

Pt =

Substitution probability matrix

f = [ ¼ ¼ ¼ ¼ ]

Base composition of sequences

The Jukes and Cantor’s One-Parameter Model

* pii = 1 - ji pij

A

The Jukes and Cantor’s One-Parameter Model

t = 0 t = 1A

pA(0) = 1 pA(1) = 1 - 3

Since we started whit A

The probability that the nucleotide has

remained unchanged

What is the probability of having an A in a site in a DNA sequence at time t =1, in a site that started

whit an A at time t = 0 ?

The Jukes and Cantor’s One-Parameter Model

What is the probability of having an A in a site in a DNA sequence at time t = 2?

A

A

A

A

Not A

A

t = 0

t = 1

t = 2

Scenario 1 Scenario 2

No substitution Substitution

No substitution Substitution

(After Li, 1997)

The Jukes and Cantor’s One-Parameter Model

What is the probability of having an A in a site in a DNA sequence at time t = 2?

A

A

A

A

Not A

A

t = 0

t = 1

t = 2

Scenario 1 Scenario 2

pA(1) = (1 - 3) [1 - pA(1)]

(1 - 3)

(After Li, 1997)

The Jukes and Cantor’s One-Parameter Model

What is the probability of having an A in a site in a DNA sequence at time t = 2?

A

A

A

A

Not A

A

t = 0

t = 1

t = 2

Scenario 1 Scenario 2

pA(1) [1 - pA(1)]

(1 - 3)

(After Li, 1997)

+

The Jukes and Cantor’s One-Parameter Model

What is the probability of having an A in a site in a DNA sequence at time t = 2?

pA(2) = (1 - 3) pA(1) + [1 - pA(1)]

The probability of not having a

substitution from t = 1 to t = 2

The probability of not having a

substitution from t = 0 to t = 1

The probability of having a

substitution from not A to A, from

t = 1 to t = 2

The probability of having a

substitution from A to not A, in

t = 0 to t = 1

The probability of no change The probability of reversible change

The Jukes and Cantor’s One-Parameter Model

The following recurrence equation holds for any t:

pA(t + 1) = (1 - 3) pA(t) + [1 - pA(t)]

The Jukes and Cantor’s One-Parameter Model

Rewriting this equation in terms of the amount of change:

pA(t + 1) - pA(t) = (1 - 3) pA(t) + [1 - pA(t)] - pA(t)

The Jukes and Cantor’s One-Parameter Model

Doing some algebra:

pA(t + 1) - pA(t) = (1 - 3) pA(t) + [1 - pA(t)] - pA(t)

The Jukes and Cantor’s One-Parameter Model

Doing some algebra:

pA(t + 1) - pA(t) = (1 - 3) pA(t) + [1 - pA(t)] - pA(t)

pA(t + 1) - pA(t) = pA(t) - 3pA(t) + [1 - pA(t)] - pA(t)

The Jukes and Cantor’s One-Parameter Model

Doing some algebra:

pA(t + 1) - pA(t) = (1 - 3) pA(t) + [1 - pA(t)] - pA(t)

pA(t + 1) - pA(t) = pA(t) - 3pA(t) + [1 - pA(t)] - pA(t)

The Jukes and Cantor’s One-Parameter Model

Doing some algebra:

pA(t + 1) - pA(t) = (1 - 3) pA(t) + [1 - pA(t)] - pA(t)

pA(t) = - 3pA(t) + [1 - pA(t)]

pA(t + 1) - pA(t) = pA(t) - 3pA(t) + [1 - pA(t)] - pA(t)

The Jukes and Cantor’s One-Parameter Model

Doing some algebra:

pA(t + 1) - pA(t) = (1 - 3) pA(t) + [1 - pA(t)] - pA(t)

pA(t) = - 3pA(t) + [1 - pA(t)]

pA(t + 1) - pA(t) = pA(t) - 3pA(t) + [1 - pA(t)] - pA(t)

The Jukes and Cantor’s One-Parameter Model

Doing some algebra:

pA(t + 1) - pA(t) = (1 - 3) pA(t) + [1 - pA(t)] - pA(t)

pA(t) = - 4pA(t) +

pA(t + 1) - pA(t) = pA(t) - 3pA(t) + [1 - pA(t)] - pA(t)

pA(t) = - 3pA(t) + [1 - pA(t)]

Rewriting this equation for a continuous time model:

= - 4pA(t) + d pA(t)d t

The Jukes and Cantor’s One-Parameter Model

Rewriting this equation for a continuous time model:

= - 4pA(t) + d pA(t)

d t

The Jukes and Cantor’s One-Parameter Model

pA(t) = ¼ + pA(0) - ¼ e -4t

The solution is given by:

Since we started with A, pA(0) = 1

The Jukes and Cantor’s One-Parameter Model

An if we start with non A, pA(0) = 0

pA(t) = ¼ + 1 - ¼ e -4t = ¼ + ¾ e -4t

pA(t) = ¼ + 0 - ¼ e -4t = ¼ - ¼ e -4t

The probability of initially having A, and still having A at time t is:

The Jukes and Cantor’s One-Parameter Model

The probability of initially having G, and then having A at time t is:

pAA(t) = ¼ + ¾ e -4t

pGA(t) = ¼ - ¼ e -4t

We can write the equations in a more explicit form:

And since all nucleotides are equivalent under the JC model, pGA(t) = pCA(t) = pTA(t).

The Jukes and Cantor’s One-Parameter Model

pii(t) = ¼ + ¾ e -4t

pij(t) = ¼ - ¼ e -4t

where i j

pA(t)

For instance, pA(t) can also be interpreted as the frequency of A in a DNA sequence. For example, if we start with a sequence made of A‘s only, then pA(0) = 1, and pA(t) is the expected frequency of A in the sequence at time t.

Probability

Time (million years)

pii

pij

¼

The Jukes and Cantor’s One-Parameter Model

Temporal changes in the probability of having a certain nucleotide at a given nucleotide site ( = 5x10-9 substitutions/site/year).

0

1

20 40 60 80 100 120 140 160 180 200

Other models of sequence evolution

The Kimura two-Parameter Model

A G

C T

Transitions

Transitions

Transversions

Base pair differences

Time since divergence (Myr)

Transitions

Transversions

The Kimura two-Parameter Model

Number of transition and transversions between pairs of bovid mammal mitochondrial sequences (684 base pairs from the COII gene) against the estimated time of divergence.

0 5 10 15 20 25

20

40

60

80

100

*

*

*

*

Pt =

Substitution probability matrix

f = [ ¼ ¼ ¼ ¼ ]

Base composition of sequences

The Kimura two-Parameter Model

* pii = 1 - ji pij

* C G T

A * G T

A C * T

A C G *

Pt =

Substitution probability matrix

f = [A C G T ]

Base composition of sequences

The Felsenstein (1981) Model

* pii = 1 - ji pij

This model assumes that there is variation in base composition

* C G T

A * G T

A C * T

A C G *

Pt =

Substitution probability matrix

f = [A C G T ]

Base composition of sequences

The Hasegawa, Kishino and Yano (1985) Model

* pii = 1 - ji pij

This model assumes that there is variation in base composition and that transition and transversions occur at different rates.

* C a G b T c

A a * G d T e

A b C d * T f

A c C e G f *

Pt =

Substitution probability matrix

f = [A C G T ]

Base composition of sequences

The General Reversible (REV) Model

* pii = 1 - ji pij

This model assumes that there is variation in base composition and that each substitution has its own probability.

Comparing the Models

Jukes-Cantor

Allow for / bias Allow for base frequency to vary

Kimura 2 parameter Felsenstein (1981)

Allow for / biasAllow for base frequency to vary

Felsenstein (1981)

Allow all six pairs of substitutions to have different rates

General Reversible (REV)From Page and Holms (1998)

Among site rate variation

Among site rate variation

For protein coding sequences not all sites have the same probability of change (there is among site rate variation). If this effect is not taken into account, the number of substitutions per site between two sequences can be underestimated (Li and Graur, 1991).

Effect of among site rate variation in sequence divergence

(A) Substitution rate of 0.5 % / M.a. and 80 % of the sites free to vary

(B) Substitution rate of 2 % / M.a. and 50 % of the sites free to vary

(Page and Holms, 1998)

Gamma distribution

f(r) = [ba / (a)] e –br r a-1

where:

(a) = ∫0 e –t t a-1 dt

The a shape parameter

Time reversibility

Time reversibility in the Jukes and Cantor’s One-Parameter Model

A

A A

t tpAA(t)pAA(t)

pAA(t)2

AA At = 0 t = 1 t = 2

pAA(t) pAA(t)

pAA(t)2

Time reversibility in the Jukes and Cantor’s One-Parameter Model

A

A A

t tpAA(t)

Time reversibility in the Jukes and Cantor’s One-Parameter Model

A

A A

t tpAA(t)pAA(t)

Time reversibility in the Jukes and Cantor’s One-Parameter Model

A

A A

t tpAA(t)pAA(t)

pAA(t)2

Time reversibility in the Jukes and Cantor’s One-Parameter Model

A substitution process is said to be time reversible if the probability of starting from nucleotide i and changing to nucleotide j in a time interval t is the same as the probability of starting from j and going backward to i in the same time duration.

pij(t) p = pji(t) p

Sequence similarity between two sequences

Divergence Between DNA sequences

Ancestral sequence

Sequence 1 Sequence 2

t t

I(t)

The expected value of the proportion of identical nucleotides between the two sequences under study is equal to the probability, I(t), that the nucleotide at a given site at time t is the same in both sequences.

Sequence Similarity

A

t t

Sequence Similarity

A

A

t tpAA(t)

Sequence Similarity

A

A A

t tpAA(t)pAA(t)

Sequence Similarity

A

A A

t tpAA(t)pAA(t)

pAA(t)2

Sequence Similarity

A

C C

t tpAC(t)pAC(t)

pAC(t)2

But for parallel substitutions.

Sequence Similarity

A

G G

t tpAG(t)pAG(t)

pAG(t)2

But for parallel substitutions.

Sequence Similarity

A

T T

t tpAT(t)pAT(t)

pAT(t)2

But for parallel substitutions.

Sequence Similarity in the JC Model

Therefore,

I(t) = pAA(t)2

+ pAT(t) 2

+ pAC(t) 2

+ pAG(t) 2

And from the JC model,

I(t) = ¼ + ¾ e -8t

This equation also holds if the initial nucleotide was different from A, and represents the expected proportion of identical nucleotides between two sequences that diverged t time units ago

Proportion of identical nucleotides

Time (million years)

¼

Sequence similarity in the Jukes and Cantor’s One-Parameter Model

Temporal changes in the expected proportion of identical nucleotides between two sequences that diverged t years ago ( = 5x10-9 substitutions/site/year).

0

1

20 40 60 80 100 120 140 160 180 200

Estimating the number of nucleotide substitutions between two sequences

Number of nucleotide substitutions between two sequences

K= N/LSubstitutions per nucleotide site.

Total number of substitutions.

Number of sites compared between two sequences.

A simple measure of genetic distance between two sequences is p

p= nd / nProportion of different sites.

Total number of differences.

Number of sites compared between two sequences.

Divergence Between DNA sequences

Ancestral sequence

Sequence 1 Sequence 2

ACTGAACGTAACGC

ACTGAACGTAACGC

t t Single substitution

Multiple substitutions

T C

Coincidental substitutions

Parallel substitutions

Convergent substitutions

Back substitutions T C

A

G G

A A

T C T

Divergence Between DNA sequences

Ancestral sequence

Sequence 1 Sequence 2

ACTGAACGAATCGC

ACTGAACGAATCGC

t t Single substitution

Multiple substitutions

T C

Coincidental substitutions

Parallel substitutions

Convergent substitutions

Back substitutions T C

A

A G

A A

T C TAlthough there has been 12 mutations, only 3 can be detected

Sequence dissimilarity

D = (1 – I(t))

Time

Due to multiple substitutions, the observed number of differences between two sequence is less than the

true number of substitutions

0

1

Proportion of observed differences

Proportion of actual differences

Sequence dissimilarity

D = (1 – I(t))

Time

Models of sequence evolution can be used to “correct” for multiple hits

0

1 Distance correction

Estimating the number of nucleotide substitutions under the Jukes and Cantor’s One-Parameter Model

As we have seen, the expected proportion of identical nucleotides between two sequences that diverged t time units ago is given by:

I(t) = ¼ + ¾ e -8t

Estimating the number of nucleotide substitutions under the Jukes and Cantor’s One-Parameter Model

And the probability that the two sequences are different at a site at time t is:

I(t) = ¼ + ¾ e -8t

p = 1 - I(t)

Estimating the number of nucleotide substitutions under the Jukes and Cantor’s One-Parameter Model

Doing some algebra:

p = 1 - (¼ + ¾ e -8t)

p = ¾ (1 - e -8t)

8t = - ln (1 - 4p/3)

p = 1 - I(t)

And since in the JC model K = 2(3t) between two sequences:

K = - (¾) ln (1 - (4/3)p)

Estimating the number of nucleotide substitutions under the Kimura two-Parameter Model

where:

And P and Q are the proportions of transitional and transversional differences between the two sequences

K = (½) ln(a) + (¼)ln(b)

a = 1/ (1 - 2P - Q)

b = 1/ (1 - 2Q)

Estimating the number of nucleotide substitutions using the Poisson Correction for protein sequences

Estimating the number of nucleotide substitutions using the Poisson Correction for protein sequences

M C A N T P L …P (k) = e -rt (rt)k / k!

P (0) = e -rt

P (1) = e -rt

P (2) = e -rt (rt)2 / 2!P (n) = e -rt (rt)n / n!

P (substitutions)

Estimating the number of nucleotide substitutions using the Poisson Correction for protein sequences

SecA

Sec1 Sec2

e–rt e–rt q = (e–rt)2 e–2rt = 1 - p

The probability that none of the sequences has suffered a substitution is:

K = 2rt

Doing a little algebra:

K = - ln (1 - p)e–K = 1 - p

Genetic distance using Poisson Correction

Trees

A phylogeny and the three basic kinds of tree used to depict that phylogeny

After Page and Holmes (1998)

A B C

time

Character change

PhylogenyA B CCladogram

A B C

Additive tree

A B C

5

0

Ultrametric tree

Distance Methods for Phylogenetic Inference

[ 1 2 3 4 5 6 7 8 9 10]

[ 1]

[ 2] 0.009

[ 3] 0.000 0.009

[ 4] 0.000 0.009 0.000

[ 5] 0.000 0.009 0.000 0.000

[ 6] 0.009 0.019 0.009 0.009 0.009

[ 7] 0.009 0.019 0.009 0.009 0.009 0.000

[ 8] 0.098 0.108 0.098 0.098 0.098 0.108 0.108

[ 9] 0.098 0.108 0.098 0.098 0.098 0.108 0.108 0.000

[ 10] 0.088 0.098 0.088 0.088 0.088 0.098 0.098 0.009 0.009

Distance Matrix

In order for a distance measure to be used to build phylogenies it must satisfy some basic requeriments

It must be metric

It must be additive

Metric distances

A distance is metric if:

1 d (a,b) 0 (non-negativity)

a sequence

b sequence

d (a,b)

2 d (a,b) = d (b,a) (symetry)

3 d (a,c) d (a,b) + d (b,c) (triangle inequality)4 d (a,b) = 0 if and only if a = b (distinctiness)

Ultrametric distances

5 d (a,b) maximum [d (a,c), d (b,c)]

A distance is ultrametric if:

a b

c

4

6 6

An ultrametric distance have the property of implying a constant evolutionary rate

Additive distances

Four point condition:

d (a,b) + d (c,d) maximum [d (a,c) + d (b,d), d (a,d) + d (b,c)]

a

b

c

d

a b c d

a b c d

10 10 10 6 6 2

a

b

c

d

2

6

6

10

10

10

1

1

2

2

3

5

An ultrametric distance matrix between four sequences and the corresponding ultrametric tree

a b c d

a b c d

14 10 9 7 3 6

6

3

7

9

10

14

a

b

c

d

5

1

1

2

1

6

An aditive distance matrix between four sequences and the corresponding additive tree

Unweighted Pair-group Method using Arithmetic averages (UPGMA)

OTU A B C

B dAB    

C dAC dBC  

D dAD dBD dCD

OTU

Unweighted Pair-group Method using Arithmetic averages (UPGMA)

OTU A B C

B dAB    

C dAC dBC  

D dAD dBD dCD

OTU

Unweighted Pair-group Method using Arithmetic averages (UPGMA)

A

B

dAB /2

OTU (AB) C

C d(AB)C  

D d(AB)D dCD

OTU

Unweighted Pair-group Method using Arithmetic averages (UPGMA)

d(AB)C = ( dAC + dBC )/2d(AB)D = ( dAD + dBD )/2

OTU (AB) C

C d(AB)C  

D d(AB)D dCD

OTU

Unweighted Pair-group Method using Arithmetic averages (UPGMA)

Unweighted Pair-group Method using Arithmetic averages (UPGMA)

A

B

C

d(AB)C /2

Unweighted Pair-group Method using Arithmetic averages (UPGMA)

d(ABC)D /2 = [(dAD + dBD + dCD )/ 3]/ 2

A

B

C

D

Unweighted Pair-group Method using Arithmetic averages (UPGMA)

dXY = dij / (nX nY)

Assumes a constant molecular clock

Estimates tree topology and branch length

Minimum Evolution Method

In this method, the sum (S) of all branch length estimates is computed for all or all plausible topologies and the topology that has the smallest S value is chosen as the best tree.

S = bii

T

Neighbor-Joining Method

The principle of N-J method is to find neighbors sequentially that may minimize the total lenght of the tree

X

1

2

3

4

5

6

7

8

        

 This method strarts with a starlike tree:

Y

1

2 3

4

5

6

7

8

X

        

 

The first step is to separate a pair of OTUs from all others:

And among all the posible pair of OTUs the one with the smallest sum of branch lenghts is chosen.This procedure is repeated until all interior branches are found.

1

23

4

5

6

7

8