Date post: | 25-Mar-2018 |
Category: |
Documents |
Upload: | phungkhanh |
View: | 230 times |
Download: | 1 times |
Bayesian Phylogenetic Analysis Fredrick Nindo
Computational Biology Group University of Cape Town
Outline � Introduction to Bayesian phylogenetic analysis
� Markov chain Monte Carlo sampling
� Comparison of MrBayes and BEAST
� Strict molecular clocks and relaxed molecular clocks
� Calibrating estimates of rates and divergence times
History of Bayesian theory
� The Reverend Thomas Bayes was born in London in 1702
� He was the son of one of the first Noncomformist ministers to be ordained in England
� He became a Presbyterian minister in the late 1720s, but was well known for his studies of mathematics
� He was elected a Fellow of the Royal Society of London in 1742
� He died in 1761 before his works were published
Bayes Theorem � Bayes’ Theorem explains how to calculate inverse probabilities
� For example, suppose that boxes contains colored balls as shown below
� Given a box, a ball is chosen uniformly at random
� For example, if a ball is chosen from Box B1, there is a 3/4 chance that it is red
� The inverse problem states if a red ball is drawn, how likely is it that it came from Box B1?
B1: OOOO B2: OOOO B3: OOOO
Bayesian Priors
� If a red ball is drawn, how likely is it that it came from Box B1?
� To answer this question, we need a prior distribution for the selection of the box
� The answer will be different if we believe a priori that Box B1 is 10% likely to be the chosen box than if we believe that all three boxes are equally likely
B1:OOOO B2:OOOO B3:OOOO
Bayes formulation � Bayes’ Theorem states that if a complete list of mutually exclusive
events B1,B2,... have prior probabilities Pr(B1),Pr(B2),..., and if the likelihood of the event A given event Bi is Pr(A|Bi) for each i,
then
� The posterior probability of Bi given A, written Pr(Bi |A), is proportional to the product of the likelihood Pr(A|Bi) and the prior probability Pr(Bi) where the normalizing constant
Pr(Bi |A) = Pr(A|Bi)Pr(Bi) j Pr(A|Bj)Pr(Bj)
Pr(A) = j Pr(A|Bj)Pr(Bj) is the prior probability of A
Bayesian theory in Phylogenetics � In a Bayesian approach to phylogenetics, the boxes are like different
tree topologies.
� The colored balls are like site patterns, except: � there are many more than two colors; and � we observe multiple draws from each box
� Additional parameters such as branch lengths and substitution model parameters affect the likelihood, are unknown, and add to the complexity
� A prior distribution is a probability distribution on parameters before any data is observed.
� A posterior distribution is a probability distribution on parameters after data is observed
Bayes Rule
� H represents a specific hypothesis
� Pr(H) is called the prior probability of H that was assumed before new data, D, became available.
� Pr(D|H) is called the conditional probability of seeing D if H is true. It is also called a likelihood function when it is considered as a function of H for fixed D
� Pr(D) is called the marginal probability of D: the a priori probability of witnessing the D under all possible hypotheses. It can be calculated as the sum of the product of all probabilities of any complete set of mutually exclusive hypotheses and corresponding conditional probabilities: Pr(D) = ΣPr(D|Hi)P(Hi)
� Pr(H|D) is called the posterior probability of H given D
Markov Chain Monte Carlo(MCMC) � Markov chain Monte Carlo (MCMC) is a method to take
(dependent) samples from a distribution.
� The distribution need only be known up to a constant of proportionality.
� MCMC is especially useful for computation of Bayesian posterior probabilities.
� Simple summary statistics from the sample converge to posterior probabilities.
� Metropolis-Hastings is a form of MCMC that works using any Markov chain to propose the next item to sample, but rejecting proposals with specified probability.
MCMC ALGORITHM � We want to make inferences on the basis of a posterior distribution
p(θ|x)
� We cannot calculate desired quantities analytically, so instead we wish to sample from p(θ|x) and use sample statistics as estimates for the true posterior values— for example, a sample mean is an estimate of an expected value
� But, we also may not be able to take a simple random sample of values from the posterior distribution
� A computational method called Markov chain Monte Carlo has proven to be remarkably successful for obtaining dependent samples from probability distributions
� If this is done carefully, sample statistics will converge to the desired posterior values
MCMC Tree search strategy
� Markov chain Monte Carlo (MCMC) takes (dependent) samples from a distribution
� The distribution need only be known up to a constant of proportionality as the algorithm depends only on ratios
� A proposal method is needed that describes a probability distribution for proposing new parameter values given current ones
� In theory, just about any proposal distribution is correct (given an infinite sample size)— the art is in designing (and correctly implementing) a method so that feasible sample sizes are adequate
� If a proposal is not accepted, the current value is sampled again
MCMC proposal steps � Start at θ0; Set i =0
� Propose θ from the current θi
� Calculate the acceptance probability
� Generate a random number. � If accepted, set i+1 = � If rejected, set i+1 = i .
� Increment i to i + 1
� Repeat steps 2 through 6 many times
MCMC sampling � The parameter space includes:
� The tree topology
� The branch lengths
� Substitution model parameters
� In practice, we use several MCMC proposals that leave some parameters fixed while changing others
� The result of an MCMC analysis is a sample from the posterior distribution
� Sample statistics are estimates of corresponding posterior estimates � The sample proportion of a given tree topology converges to the posterior probability of that
tree topology
� The proportion of trees with a given clade converge to the posterior probability of that clade
� The ends of the middle 95% of the sample for the transition/transversion biasκis an interval estimate forκ
summarizing MCMC results � Logging trees(.trees) and continuous parameters(.log) for each
sample
� The proportion of a given tree topology (after burn-‐in) in these logs is an approximation of the posterior probability of that tree topology
� Trees in the file can be analyzed for tree partitions, from which a consensus tree can be made. The proportion of a given tree partition in the trees is the posterior probability of that partition
MCMC output � A consensus tree from an MCMC sample is simply
a summary of the posterior distribution of the topology.
� Other summaries are possible.
� This consensus tree is not an optimal tree according to some criterionsuch as maximum likelihood or parsimony
MCMC Diagnostics � measure the behavior of a single chain
� visually inspect output traces in Tracer
� Measure autocorrelation within a chain: the effective sampling size (ESS)
ESS=1689
Limitations of MCMC sampling � MCMC sometimes fails to converge
� Should always run several chains with different random numbers and compare answers
� If the true tree has some very short internal edges Bayesian inference can mislead
� Different likelihood models can lead to different results
Bayesian Phylogenetic Tree Reconstruction
� A search for a set of plausible trees (weighed by their
probability) instead of a single best tree
� the “space” that you search in is limited by prior information and the data
� the posterior distribution of trees can be translated to a probability of any branching event � allows estimate of uncertainty!
� BUT incorporates prior beliefs
Bayesian phylogenetics the one ‘true’ tree? • Maximum likelihood, Maximum parsimony, Distance methods try to get a single tree that best describes the data • however, they admit that they don’t search everywhere (heuristic), and that it is difficult to find the “best” tree • are we doing a good job reporting a single tree ??
Maximum likelihood vs Bayesian approaches
Maximum Likelihood
� Probability: Only defined in the context of long-run relative frequencies
� Parameters: Fixed and Unknown
� Nuisance Parameters: Optimize them
� Testing: p-values
� Nature of the method: Objective
Bayesian
� Probability: Describes everything that is uncertain
� Parameters: Random
� Nuisance Parameters: Average over them
� Testing: Bayes’ factors
� Nature of the method: subjective
Advantages of Bayesian Inference of Phylogeny
� We have almost no prior knowledge for the parameters of interest.... So why bother doing Bayesian inference? � A Bayesian analysis expresses its results as the probability of the
hypothesis given the data � MCMC is a stochastic algorithm and thus is able to avoid getting
stuck in a local suboptimal solution. � By sampling a set of plausible trees, MCMC allows estimating of
the uncertainty of any branching event
Bayesian Inference of Phylogeny � Development of Bayesian methods has led to continual improvement
in our ability to model and learn about molecular evolution.
� Bayesian Inference uses likelihood, but requires a prior distribution.
� Bayesian inference is computationally intensive, but can be less so than ML plus bootstrapping.
� Bayesian inference directly measures items of interest on an easily interpretable probability scale.
� Some folks dislike the requirement of specifying a prior distribution
Bayesian analysis software � A number of computer programs have been developed that
implement Bayesian Phylogenetic analysis: � MrBayes (http://mrbayes.sourceforge.net/) includes BEST (http://
www.stat.osu.edu/~dkp/BEST/help/manual_BEST2.3.pdf) � BEAST (Bayesian Evolutionary Analysis Sampling Trees)
( http://beast.bio.ed.ac.uk/) � Coevolv (Lartillot and Poujol, 2011) � PhyloBayes (Lartillot et al, 2009) � RevBayes (http://revbayes.com)
MrBayes � A Bayesian inference program for phylogenetic
inference and model selection
� Developed by
� Assisted by Maxim Teslenko, Paul van der Mark, Daniel Ayres, Aaron Darling, Sebastian Höhna, Bret Larget, Liang Liu, Marc Suchard
John Huelsenbeck Fredrik Ronquist
MRBAYES � Phylogenetic inference under a wide range of models two output files
� Parameter (.p) files � Tree (.t files)
� Unrooted trees � joint estimation of topology, branch length, and model parameters
� Rooted – time-‐calibrated trees � joint estimation of topology, branch rates, branch times, and model
parameters, and gene-‐tree/ species-‐tree inference in BEST
� Data types � discrete characters – binary (0,1) or multi-‐state (0,1, . . . ,9) � DNA – 4-‐state nucleotide, doublet, or codons � amino acid
Comparison of MrBayes and BEAST
Method/Model/Feature MrBayes BEAST
Unrooted trees ✓ ✗
Joint est. topology & times ✓ ✓
Gene-tree/species-tree ✓ ✓
Dataset partioning ✓ ✓
Bayes factors ✓ ✓
Morphological data ✓ ✓
Demography/phylogeography ✗ ✓
Continuous traits ✗ ✓
Graphical-user-interface (GUI) ✗ ✓
Final Tree summarisation .con .mcc
Acknowledgements
Assoc Prof Darren Martin
Dr. Aderito Louis Monjane
Brejnev Muhire
Rebone Meraba
UCT CBIO Crew