Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... ·...

Bayesian Phylogenetic Analysis Fredrick Nindo

Computational Biology Group University of Cape Town

Outline �  Introduction to Bayesian phylogenetic analysis

�  Markov chain Monte Carlo sampling

�  Comparison of MrBayes and BEAST

�  Strict molecular clocks and relaxed molecular clocks

�  Calibrating estimates of rates and divergence times

History of Bayesian theory

�  The Reverend Thomas Bayes was born in London in 1702

�  He was the son of one of the first Noncomformist ministers to be ordained in England

�  He became a Presbyterian minister in the late 1720s, but was well known for his studies of mathematics

�  He was elected a Fellow of the Royal Society of London in 1742

�  He died in 1761 before his works were published

Bayes Theorem �  Bayes’ Theorem explains how to calculate inverse probabilities

�  For example, suppose that boxes contains colored balls as shown below

�  Given a box, a ball is chosen uniformly at random

�  For example, if a ball is chosen from Box B1, there is a 3/4 chance that it is red

�  The inverse problem states if a red ball is drawn, how likely is it that it came from Box B1?

B1: OOOO B2: OOOO B3: OOOO

Bayesian Priors

�  If a red ball is drawn, how likely is it that it came from Box B1?

�  To answer this question, we need a prior distribution for the selection of the box

�  The answer will be different if we believe a priori that Box B1 is 10% likely to be the chosen box than if we believe that all three boxes are equally likely

B1:OOOO B2:OOOO B3:OOOO

Bayes formulation �  Bayes’ Theorem states that if a complete list of mutually exclusive

events B1,B2,... have prior probabilities Pr(B1),Pr(B2),..., and if the likelihood of the event A given event Bi is Pr(A|Bi) for each i,

then

�  The posterior probability of Bi given A, written Pr(Bi |A), is proportional to the product of the likelihood Pr(A|Bi) and the prior probability Pr(Bi) where the normalizing constant

Pr(Bi |A) = Pr(A|Bi)Pr(Bi) j Pr(A|Bj)Pr(Bj)

Pr(A) = j Pr(A|Bj)Pr(Bj) is the prior probability of A

Bayesian theory in Phylogenetics �  In a Bayesian approach to phylogenetics, the boxes are like different

tree topologies.

�  The colored balls are like site patterns, except: �  there are many more than two colors; and �  we observe multiple draws from each box

�  Additional parameters such as branch lengths and substitution model parameters affect the likelihood, are unknown, and add to the complexity

�  A prior distribution is a probability distribution on parameters before any data is observed.

�  A posterior distribution is a probability distribution on parameters after data is observed

Bayes Rule

�  H represents a specific hypothesis

�  Pr(H) is called the prior probability of H that was assumed before new data, D, became available.

�  Pr(D|H) is called the conditional probability of seeing D if H is true. It is also called a likelihood function when it is considered as a function of H for fixed D

�  Pr(D) is called the marginal probability of D: the a priori probability of witnessing the D under all possible hypotheses. It can be calculated as the sum of the product of all probabilities of any complete set of mutually exclusive hypotheses and corresponding conditional probabilities: Pr(D) = ΣPr(D|Hi)P(Hi)

�  Pr(H|D) is called the posterior probability of H given D

Markov Chain Monte Carlo(MCMC) �  Markov chain Monte Carlo (MCMC) is a method to take

(dependent) samples from a distribution.

�  The distribution need only be known up to a constant of proportionality.

�  MCMC is especially useful for computation of Bayesian posterior probabilities.

�  Simple summary statistics from the sample converge to posterior probabilities.

�  Metropolis-Hastings is a form of MCMC that works using any Markov chain to propose the next item to sample, but rejecting proposals with specified probability.

MCMC ALGORITHM �  We want to make inferences on the basis of a posterior distribution

p(θ|x)

�  We cannot calculate desired quantities analytically, so instead we wish to sample from p(θ|x) and use sample statistics as estimates for the true posterior values— for example, a sample mean is an estimate of an expected value

�  But, we also may not be able to take a simple random sample of values from the posterior distribution

�  A computational method called Markov chain Monte Carlo has proven to be remarkably successful for obtaining dependent samples from probability distributions

�  If this is done carefully, sample statistics will converge to the desired posterior values

MCMC Tree search strategy

�  Markov chain Monte Carlo (MCMC) takes (dependent) samples from a distribution

�  The distribution need only be known up to a constant of proportionality as the algorithm depends only on ratios

�  A proposal method is needed that describes a probability distribution for proposing new parameter values given current ones

�  In theory, just about any proposal distribution is correct (given an infinite sample size)— the art is in designing (and correctly implementing) a method so that feasible sample sizes are adequate

�  If a proposal is not accepted, the current value is sampled again

MCMC proposal steps �  Start at θ0; Set i =0

�  Propose θ from the current θi

�  Calculate the acceptance probability

�  Generate a random number. �  If accepted, set i+1 = �  If rejected, set i+1 = i .

�  Increment i to i + 1

�  Repeat steps 2 through 6 many times

MCMC sampling �  The parameter space includes:

�  The tree topology

�  The branch lengths

�  Substitution model parameters

�  In practice, we use several MCMC proposals that leave some parameters fixed while changing others

�  The result of an MCMC analysis is a sample from the posterior distribution

�  Sample statistics are estimates of corresponding posterior estimates �  The sample proportion of a given tree topology converges to the posterior probability of that

tree topology

�  The proportion of trees with a given clade converge to the posterior probability of that clade

�  The ends of the middle 95% of the sample for the transition/transversion biasκis an interval estimate forκ

summarizing MCMC results �  Logging trees(.trees) and continuous parameters(.log) for each

sample

�  The proportion of a given tree topology (after burn-‐in) in these logs is an approximation of the posterior probability of that tree topology

�  Trees in the file can be analyzed for tree partitions, from which a consensus tree can be made. The proportion of a given tree partition in the trees is the posterior probability of that partition

MCMC output �  A consensus tree from an MCMC sample is simply

a summary of the posterior distribution of the topology.

�  Other summaries are possible.

�  This consensus tree is not an optimal tree according to some criterionsuch as maximum likelihood or parsimony

MCMC Diagnostics �  measure the behavior of a single chain

�  visually inspect output traces in Tracer

�  Measure autocorrelation within a chain: the effective sampling size (ESS)

ESS=1689

Limitations of MCMC sampling �  MCMC sometimes fails to converge

�  Should always run several chains with different random numbers and compare answers

�  If the true tree has some very short internal edges Bayesian inference can mislead

�  Different likelihood models can lead to different results

Bayesian Phylogenetic Tree Reconstruction

�  A search for a set of plausible trees (weighed by their

probability) instead of a single best tree

�  the “space” that you search in is limited by prior information and the data

�  the posterior distribution of trees can be translated to a probability of any branching event �  allows estimate of uncertainty!

�  BUT incorporates prior beliefs

Bayesian phylogenetics the one ‘true’ tree? • Maximum likelihood, Maximum parsimony, Distance methods try to get a single tree that best describes the data • however, they admit that they don’t search everywhere (heuristic), and that it is difficult to find the “best” tree • are we doing a good job reporting a single tree ??

Maximum likelihood vs Bayesian approaches

Maximum Likelihood

�  Probability: Only defined in the context of long-run relative frequencies

�  Parameters: Fixed and Unknown

�  Nuisance Parameters: Optimize them

�  Testing: p-values

�  Nature of the method: Objective

Bayesian

�  Probability: Describes everything that is uncertain

�  Parameters: Random

�  Nuisance Parameters: Average over them

�  Testing: Bayes’ factors

�  Nature of the method: subjective

Advantages of Bayesian Inference of Phylogeny

�  We have almost no prior knowledge for the parameters of interest.... So why bother doing Bayesian inference? �  A Bayesian analysis expresses its results as the probability of the

hypothesis given the data �  MCMC is a stochastic algorithm and thus is able to avoid getting

stuck in a local suboptimal solution. �  By sampling a set of plausible trees, MCMC allows estimating of

the uncertainty of any branching event

Bayesian Inference of Phylogeny �  Development of Bayesian methods has led to continual improvement

in our ability to model and learn about molecular evolution.

�  Bayesian Inference uses likelihood, but requires a prior distribution.

�  Bayesian inference is computationally intensive, but can be less so than ML plus bootstrapping.

�  Bayesian inference directly measures items of interest on an easily interpretable probability scale.

�  Some folks dislike the requirement of specifying a prior distribution

Bayesian analysis software �  A number of computer programs have been developed that

implement Bayesian Phylogenetic analysis: �  MrBayes (http://mrbayes.sourceforge.net/) includes BEST (http://

www.stat.osu.edu/~dkp/BEST/help/manual_BEST2.3.pdf) �  BEAST (Bayesian Evolutionary Analysis Sampling Trees)

( http://beast.bio.ed.ac.uk/) �  Coevolv (Lartillot and Poujol, 2011) �  PhyloBayes (Lartillot et al, 2009) �  RevBayes (http://revbayes.com)

MrBayes �  A Bayesian inference program for phylogenetic

inference and model selection

�  Developed by

�  Assisted by Maxim Teslenko, Paul van der Mark, Daniel Ayres, Aaron Darling, Sebastian Höhna, Bret Larget, Liang Liu, Marc Suchard

John Huelsenbeck Fredrik Ronquist

MRBAYES �  Phylogenetic inference under a wide range of models two output files

�  Parameter (.p) files �  Tree (.t files)

�  Unrooted trees �  joint estimation of topology, branch length, and model parameters

�  Rooted – time-‐calibrated trees �  joint estimation of topology, branch rates, branch times, and model

parameters, and gene-‐tree/ species-‐tree inference in BEST

�  Data types �  discrete characters – binary (0,1) or multi-‐state (0,1, . . . ,9) �  DNA – 4-‐state nucleotide, doublet, or codons �  amino acid

Comparison of MrBayes and BEAST

Method/Model/Feature MrBayes BEAST

Unrooted trees ✓ ✗

Joint est. topology & times ✓ ✓

Gene-tree/species-tree ✓ ✓

Dataset partioning ✓ ✓

Bayes factors ✓ ✓

Morphological data ✓ ✓

Demography/phylogeography ✗ ✓

Continuous traits ✗ ✓

Graphical-user-interface (GUI) ✗ ✓

Final Tree summarisation .con .mcc

Acknowledgements

Assoc Prof Darren Martin

Dr. Aderito Louis Monjane

Brejnev Muhire

Rebone Meraba

UCT CBIO Crew

Date post:	25-Mar-2018
Category:	Documents
Upload:	phungkhanh
View:	230 times
Download:	1 times

Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... ·...

Documents