BINF6201/8201
Molecular phylogenetic methods 4
11-10-2011
Maximum likelihood methodsSo far we have only considered a single site (configuration). The
likelihood for all sites is the product of the likelihoods for each site if all the sites evolve independently.
Suppose there are s homologous sequences each with N nucleotides. Let Dn be the n-th column of the multiple alignment.
),...,,|( 21 TDf mn For a tree T, let be the likelihood of tree T for the n-th site, where 1, 2,…, m are the unknown parameters such as the branch length. Using the previous case as an example, we have,
sn
n
n
n
d
dd
D...
2
1
,
lkji
Dn )}()()()()({
),,,,|,,,(),...,,|(
21534
432121
vPvPvPvPvPg
TvvvvlkjihTDf
yjyiy
xyxkx
xlx
mn
,ii v
Maximum likelihood methods For simplicity, let’s assume the sequences are homogenous, i.e., all
sites evolve at the same rate, then the likelihood function for the entire sequence for the tree T is,
Here, we treat L as a function of the parameters. We then search for the values of 1, 2,…, m that maximize L given the topology of the tree T, this value of L is called a ML value of the tree T.
Finding the ML value can be a slow process.We do this for all possible tree topologies, and identify the one that has
the largest ML value as the inferred phylogenetic tree of the s sequences.
Clearly, different substitution models may result in different trees.When the number of OTUs is larger, a heuristic trees search algorithm
should be used for evaluating the alternative trees.
N
nmnm TDfTDL
12121 ),...,,|(),|...,,(
Heuristic tree search using predefined clustersAlthough the tree space could be very large, majority of them have
extremely low likelihood values for a certain OTUs. So we can safely ignore these unpromising trees, and focus on the
promising ones.To reduce the searching
space, we can predefine clusters if their relationships are known as the input.
Then the problem becomes to examine the (105) possible trees generated by connecting these predefined groups, instead of an astronomically large number of unrooted trees:
!!41!)!5232(!)!52( NNU
Heuristic tree search using predefined clustersThe ML value is computed for
each tree, the one with the largest ML value is returned as the inferred tree.
As this algorithm examines all possible trees, so the global optimum is guaranteed if the predefined groups are correct.
When the simple J-C model was used, and a homogenous substitution rate is assumed, the resulting ML tree is similar to the NJ and parsimony tree with the problem of misplacing tree shrews inside the primate group.
Maximum likelihood trees for primates However, when the more
sophisticated HKY substitution model, plus six g-distribution rate categories and invariant sites were used, the tree constructed by the ML method places the tree shrews outside of the primate group.
Nevertheless, there are three trifurcations on this tree, indicating that at a trifurcation point, any of the three clusters can be an outgroup of the other two, and the three trees have the same ML value.
Comparison of parsimony and maximum likelihood methods
Parsimony methods have only one assumption that the changes on the branches are equally possible, however, this assumption may not hold.
Because of the few assumptions are used in parsimony methods, their proponents believe that these methods can be applied to any sequence data.
Parsimony method is also relatively fast, so can be applied to larger data sets.
ML methods make assumptions about the evolutionary models. ML methods need to optimize all these parameters to find the ML
value, therefore they are computationally intensive, and are very slow. When evolutionary models are properly selected, ML methods tend to
achieve better results than parsimony methods.
Heuristic tree search using quartet puzzlingThe quartet puzzling algorithm is very fast heuristic algorithm for
exploring the promising trees.
Step 1: Computer ML values of the three trees for all possible four sequences
For each possible 4 sequences
4
3n
1
2
3
4
1
3
2
4
1
4
2
3
trees
The best ML tree1
2
3
4
5
6
Heuristic tree search using quartet puzzlingStep 2: Randomly pick up four sequences, place them in the tree according to their best ML tree.
1
4
2
3
Step 3: Randomly pick up a remaining sequence, and add it to the tree, such that growing tree has a maximum number of best ML quartet trees. Repeat this process until all sequences are added to the tree.
5
1
3
2
5
then, the resulting tree will be,
4
3
2
5
1
4
2
3
For example, if sequence 5 is randomly picked, and if one or both of the following trees are the best ML quartet trees involving 1, 2, 3, 4, and 5:
Heuristic tree search using quartet puzzling
1
4
2
3
5 Add sequence 61
4
2
3
5
6
6
1
3
4Then the resulting tree will be
Then last sequence 6 is added to the tree. If the following has the best ML among all quartet trees containing sequence 6,
The whole process is repeated many times with the sequences being selected in different orders. The resulting tree will depend on the order of sequence selections.
The tree that happens most frequently will be chosen as the inferred tree.
Bayesian phylogenetic methods Bayesian theorem: if A and B are two events, then
)()()/()/(
),()/()()/()(
APBPBAPABP
BPBAPAPABPABP
T1 T2 T3 T4 T5 T6
T7 T8 T9 T10 T11 T12
D
If T1, T2, …, and Tn, are events that partitions the sample space, and D is an event from the sample space, then,
.)()/(
)()/(...)()/()()/()(
1
2211
n
iii
nn
TPTDP
TPTDPTPTDPTPTDPDP
n
iii
jj
jjj
TPTDP
TPTDPDP
TPTDPDTP
1
)()/(
)()/()(
)()/()/(
Bayesian phylogenetic methodsFor N OTUs, we can have n=(2N-5)!! possible unrooted trees, which is
a partition of the tree space. Let D be the alignment of the N OUTs, but we do not know which tree is most likely to account for D.
tree1 tree2 tree3 tree4 tree5 tree6
tree7 tree8 tree9 tree10 ……. treen
In the ML method, we compute the probability (likelihood) that D can be generated by each tree:
L(treei)=P(D/treei).
We find the maximum likelihood ML=max [P(D/treei)] by changing the parameters (branch length or substitution rates) on each tree i, and return the tree that has largest ML.
In Bayesian methods, we compute the probability that a tree can be generated by the observed alignment of the N OTUs, which is called the posterior probability, )./( DtreeP j
Bayesian phylogenetic methodsUsing Bayesian theorem, we have,
Calculation of the denominator of the posterior probability can difficulty, because we have to numerate all possible trees, and their branch length or substitution rate.
However, the value of the denominator is a constant for all possible trees, thus the posterior probability of each tree is only proportional to the likelihood of the tree multiplied by the prior probability.
If we can generate a large number of trees, such that the frequency of a tree is proportional to its likelihood of the tree multiplied by the prior probability, then the posterior probability can be easily computed by,
.sample thein treeofnumber total
as topology same the with treesofnumber
)()/()/(
j
jjj
tree
treePtreeDPDtreeP
,)()/(
)()/()/(
1
n
iii
jjj
treePtreeDP
treePtreeDPDtreeP where, P(treei) is called the prior
probability.
The Markov chain Monte Carlo method for samplingMarkov chain Monte Carlo (MCMC) is a method for generating a
sample from the entire sample space, such that the frequency of each individual in the sample is propotional to the likelihood to generate the observed data.
If we have no preference for choosing a tree before seeing the data, we can use a non-informative uniform prior probability, therefore,
)/()/(
)/(
)()/(
)()/()/(
11
jn
ii
jn
iii
jjj treeDP
treeDP
treeDP
treePtreeDP
treePtreeDPDtreeP
The MCMC method begins with a trial tree T1 and compute its likelihood, L1, a move is then made on this tree that changes it by a small amount on any of the following parameters,
1. Branch length;2. Rate of substitution;3. Topology by a nearest neighbor interchange tree move.
The Markov chain Monte Carlo method for samplingThe likelihood of the new tree T2, L2 is computed, which is usually
slightly different from L1.
If L2 > L1, then T2 is accepted, and it becomes an element in the sample If L2 < L1, then T2 is accepted with probability L2 / L1.
This rule of selection is call the Metropolis algorithm.Therefore the MCMC method favors hill-climbing moves, but also
allows downhill moves with the a certain probability.The result will be that the equilibrium probabilities of observing the
different trees in the sample are given by the likelihoods of the trees. To see this, suppose that we have only two trees, so MCMC moves
back and forward between them with transition probabilities r12 and r21.
T1 T2
r12
r21
The Markov chain Monte Carlo method for samplingLet p1 and p2 be the equilibrium probabilities of these trees in the
sample. Then at equilibrium, the probabilities of observing these trees during the sampling process should be constant,
.or ,2
1
12
21212121 p
prrrprp
This property is called detailed balance. To have trees in the sample to be proportional to their likelihoods, we need to set
.2
1
2
1
LL
pp
Therefore, we have, . 2
1
12
21
LL
rr
This means that to generate the desired sample, we should set the ratio of transitional probability to be equal to the ratio of likelihoods.
The MCMC algorithm just does this, because, if L2 > L1, we set r12=1, r21= L1 /L2; therefore, r21/r12= L1 /L2.
if L2 < L1, we set r12= L2 /L1, and r21=1; therefore, r21/r12= L1 /L2.
The top four trees for the Platyrrhini group by MCMC
The same as in the tree constructed by NJ and
parsimony methods
To compute likelihoods, HKY substitution model, plus six g-distribution rate categories and invariant sites are used.
The most parts o the tree are well defined, except the following groups. The positions of Capuchin is varying
The top seven trees for principle groups by MCMC
The same as by NJ and
parsimony
The uncertainty of these trees indicate that more sequences are needed to solve the problem.
The positions of Capuchin is varying
Popular phylogenetic tree construction programs PHYLIP
PAUP (Phylogenetic Analysis Using Parsimony)
• Developed by Joseph Felsenstein; • Implements most known distance methods such as UPGAM and
NJ, maximum parsimony and ML methods;• The most recent release is version 3.69, which contains more than
50 programs; • Command line interface;• The package can be freely downloaded at
http://evolution.genetics.washington.edu/phylip.html
• Written by David Swofford;• Includes parsimony, distance matrix, invariants, and maximum
likelihood methods and many indices and statistical tests; • Described at http://paup.csit.fsu.edu/ • Unfortunately, it is now commercialized by Sinauer Associates,
selling for $85-150/package.
Popular phylogenetic tree construction programs MEGA (Molecular Evolutionary Genetic Analysis)
TREE-PUZZLE
• Developed by Sudhir Kumar and colleagues; • Contains parsimony, distance and likelihood methods for molecular
data (nucleic acid sequences and protein sequences); • Can do bootstrapping, consensus trees, and a variety of data editing
tasks;• Has sequence alignment function using an implementation of
ClustalW;• A GUI based program;• Contain tree display functions.
• Written by Korbinian Strimmer;• A program for maximum likelihood analysis for nucleotide and
amino acid alignments; • Infers phylogenies by quartet puzzling;
Popular phylogenetic tree construction programs TREE-PUZZLE
MrBayes
Tree View
• Supports all popular models of sequence evolution of nucleotides and proteins, and can take rate heterogeneity among sites into account;
• Compatible with PHYLIP files; • The current version also has features for parallel computation
using the MPI message-passing interface if this is available;• Freely available at http://www.tree-puzzle.de/.
• A program for the Bayesian estimation of phylogenetic trees.• Ability to analyze nucleotide, amino acid, restriction site, and
morphological data• Freely available at http://mrbayes.csit.fsu.edu/
• A program for visualization and printing trees; • Free at http://taxonomy.zoology.gla.ac.uk/rod/treeview.html