Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 221 times |
Download: | 2 times |
Phylogenetics
“Inferring Phylogenies”
Joseph Felsenstein
Excellent reference
What is a phylogeny?
Different Representations Cladogram - branching pattern only Phylogram - branch lengths are
estimated and drawn proportional to the amount of change along the branch
Rooted - implies directionality of change Unrooted - does not How do you root a tree?
What is a phylogeny used for?
€
π =2 π ij
j=i+1
n
∑i=1
n−1
∑
n n−1( )
€
θ =4N eiμ
Estimate a Phylogeny
Sp1 ACCGTCTTGTTASp2 AGCGTCATCAAASp3 AGCGTCATCAAASp4 ACCGTCTTGATASp5 AGCCTCTTCATA
Estimate a Phylogeny
Sp1 ACCGTCTTGTTASp2 AGCGTCATCAAASp3 AGCGTCATCAAASp4 ACCGTCTTGATASp5 AGCCTCTTCATA
Working Tree
sp1
sp4
sp2
sp3
sp5
c2
Estimate a Phylogeny
Sp1 ACCGTCTTGTTASp2 AGCGTCATCAAASp3 AGCGTCATCAAASp4 ACCGTCTTGATASp5 AGCCTCTTCATA
Working Tree
sp1
sp4
sp2
sp3
sp5
c2
c4
Estimate a Phylogeny
Sp1 ACCGTCTTGTTASp2 AGCGTCATCAAASp3 AGCGTCATCAAASp4 ACCGTCTTGATASp5 AGCCTCTTCATA
Working Tree
sp1
sp4
sp2
sp3
sp5
c2
c4
c7
Estimate a Phylogeny
Sp1 ACCGTCTTGTTASp2 AGCGTCATCAAASp3 AGCGTCATCAAASp4 ACCGTCTTGATASp5 AGCCTCTTCATA
Working Tree
sp1
sp4
sp2
sp3
sp5
c2
c4
c7
c9
Estimate a Phylogeny
Sp1 ACCGTCTTGTTASp2 AGCGTCATCAAASp3 AGCGTCATCAAASp4 ACCGTCTTGATASp5 AGCCTCTTCATA
Working Tree
sp1
sp4
sp2
sp3
sp5
c2
c4
c7
c9
c10
Estimate a Phylogeny
Sp1 ACCGTCTTGTTASp2 AGCGTCATCAAASp3 AGCGTCATCAAASp4 ACCGTCTTGATASp5 AGCCTCTTCATA
Final Tree
sp1
sp4
sp2
sp3
sp5
c2
c4
c7
c9
c10 c11
What optimality criteria do we use then? Parsimony Likelihood Bayesian
Distance methods?
Parsimony Why should we choose a specific grouping? Maximum parsimony: we should accept the
hypothesis that explain the data most simply and efficiently
“Parsimony is simply the most robust criterion for choosing between competing scientific hypotheses. It is not a statement about how evolution may or may not have taken place”1
1 Kitching, I. J.; Forey, P. L.; Humphries, J. & Williams, D. M. 1998. Cladistics: the theory and practice of parsimony analysis. The systematics Association Publication. No. 11.
Parsimony Optimality criteria that chooses the
topology with the less number of transformations of character states
Optimizing one component: tree topology (pattern based)
Most parsimonious tree: the one (or multiple) with the minimum number of evolutionary changes (smaller size/tree length)
Reconstructing trees via sequence data1 2 3 4 5 6
O T G T A A T
A A A T G A G
B A G C C - G
C A A T G A T
D A G C C - T
AO DC B
1. T=>A
3. T=>C
2. G=>A
4. A=>G
4. A=>C5. A=> GAP
6. T=>G6. T=>G
Tree length = 8
Neighbor-joining Method
NJ distance matrices
NJ distance matrices
NJ distance matrices
NJ distance matrices
Finished NJ tree
Pyrimidines
Purines
T C
A G
Models of Evolution
Transversions Transitions
Maximum Likelihood
Base frequencies: fA + fG + fC + fT = 1 Base exchange: fs + fv = 1 R-matrix: + + + + + = 1 Gamma shape parameter Number of discrete gamma-distribution categories Pinvar: fvar + finv = 1
Likelihood: L = li where i is each character state
Maximum Likelihood
L=Pr(D|H)
L
( i )
= Pr AGGCG via x , y , z , w( )
all x , y , z , w
∑
= (Pr w ) (Pr z ; w , t8
) (Pr x ; w , t7
) (Pr y ; z , t6
) (Pr A ; x , t1
) (Pr G ; z , t2
) (Pr G ; z , t3
) (Pr C ; y , t4
) (Pr G ; y , t5
)
w
∑
z
∑
y
∑
x
∑
w
zx
y
GC
GGA
t1 t2 t3
t4 t5
t6
t7 t8
ML cont.
L = L
( i )
i = 1
n
∏
Pii
( t ) =
1
4
+
3
4
e
− 4 λ t / 3
the probability that the nucleotide at time t is i is given by
the probability that the nucleotide at time t is j, j i, is given by
Pij
( t ) =
1
4
−
1
4
e
− 4 λ t / 3
Bayes Theorem
Prob (H │D) = Prob (H) Prob (D│H)
Prob (D)H=Hypothesis D=Data
Prior probability orMarginal probability of HThe conditional
probability of H given D: posterior probability
Likelihood function
Prior probability orMarginal probability of D∑HP(H) P(D|H)
Normalizing Constant: ensures ∑ P (H │D) = 1
Take Home Message Likelihood: represents the P of the data
given the hypothesis => difficult to interpret
Bayes approach: estimates the P of the hypothesis given the data => estimates P for the hypothesis of interest
Bayesian Inference of Phylogeny
Calculating pP of a tree involves a summation over all possible trees and, for each tree, integration over all combinations of bl and substitution-model parameter values
f(i |X) = f(i) f(X|i)∑j=1 f(i) f(X|i)
B(s)
f(i,i,|X) = f(i,i,) f(X|i,i,)∑j=1 ∫ , f(i,i,) f(X| i,i,)dd
B(s)
f(i|X) = ∫ , f(i,i,) f(X|i,i,) dd∑j=1 ∫ , f(i,i,) f(X| i,i,)ddB(s)
Inferences of any single parameter are based on the marginal distribution of the parameter
This marginal P distribution of the topology, for example, integrates out all the other parameters
Advantage: the power of the analysis is focused on the parameter of interest (i.e., the topology of the tree)
Estimating phylogenies Exhaustive Searches Branch and bound methods Rise in computational time versus rise
in solution space
How many topologies are there?
€
T =2n − 3( )!
2n−1 n −1( )!
The Phylogenetic Problem
Number of Seqs Number of Trees10 2x106
100 2x10182
1,000 2x102,860
10,000 8x1038,658
100,000 1x10486,663
1,000,000 1x105,866,723
B(T)= 2i−5( )i=3
T∏
HIV-1 Whole Genomes1993 - 15
HIV-1 Whole Genomes2003 (JAN) - 397
Tree Space - the final frontier
Heuristic Searches Nearest-neighbor interchanges (NNI) - swap two adjacent
branches on the tree Subtree pruning and regrafting (SPR) - removing a branch
from the tree (either an interior or an exterior branch) with a subtree attached to it. The subtree is then reinserted into the remaining tree in all possible places
Tree bisection and reconnection (TBR) - An interior branch is broken, and the two resulting fragments o the tree ar considered as separate trees. All possible connections are made between a branch of one and a branch of the other.
Other approaches Tree-fusing - find two near optimal trees
and exchange subgroups between the two trees
Genetic Algorithms - a simulation of evolution with a genotype that describes the tree and a fitness function that reflects the optimality of the tree
Disc Covering - upcoming paper
Phylogenetic Accuracy? Consistency - A phylogenetic method is consistent for a given evolutionary model if the method converges on the correct tree as the data available to the method become infinite.
Efficiency - Statistical efficiency is a measure of how quickly a method converges on the correct solution as more data are applied to the problem.
Robustness - Robustness refers to the degree to which violations of assumptions will affect performance of phylogenetic methods
How reliable is MY phylogeny? Bootstrap Analysis Jackknife Analysis Posterior Probabilities (Bayesian
Approaches) Decay Indices
Bootstrap
Pseudoreplicates