Post on 20-Dec-2015
transcript
Phylogeny Phylogeny
Reconstructing a phylogenyReconstructing a phylogeny The phylogenetic tree (phylogeny) describes The phylogenetic tree (phylogeny) describes
the evolutionary relationships between the the evolutionary relationships between the studied datastudied data
The data must be comprised of homologous The data must be comprised of homologous typestypes
In molecular evolution, the studied data are In molecular evolution, the studied data are homologous DNA/AA sequenceshomologous DNA/AA sequences
Phylogeny reconstruction explicitly assumes Phylogeny reconstruction explicitly assumes that the sequences are alignedthat the sequences are aligned
INPUT = MSA
Reminder: MSA and phylogeny Reminder: MSA and phylogeny are dependentare dependent
Inaccurate guide tree
MSA
Sequence alignment
0.4
Phylogeny reconstruction
Unaligned sequences
Phylogeny representation
CA D
Textual representation (Newick format)
B
• Each pair of parenthesis () encloses a clade in the tree • A comma “,” separates the members of the corresponding clade• A semicolon “;” is always the last character
Visual representation
((A,C),(B,D));
Some terminology
root
internal branches
(splits)
internal nodes External nodes (leaves)
monophyletic group (clade)
External branches
Neighbors
Neighbors
Chimp HumanGorillaHuman ChimpGorilla
=
Chimp GorillaHuman
= =
Human GorillaChimp
(Gorilla,(Human,Chimp)) = (Gorilla,(Chimp,Human))
= ((Human,Chimp),Gorilla) = ((Chimp,Human),Gorilla)
Swapping neighbors is meaningless
1
2
3A
B
C
1
CBA
2
BCA
3
ABC
≠
≠
Rooted vs. unrooted
1
2
3A
B
C
1
CBA
2
BCA
3
ABC
≠
≠
((C,B),A) ((A,B),C)
((A,C),B)(A,B,C)
In newick format
How can we root a tree?
Rooting the tree based on a priori knowledge: using an outgroup
Human ChimpChicken Gorilla
INGROUPOUTGROUP
HumanChimpGorilla
Chicken
Human
Chimp
Chicken
Gorilla
The outgroup should be close enough for detecting sequence homology, but far enough to be a clear outgroup
The gene tree is not always identical to the species tree
Gorilla
Chimp
Chicken
Human
Gorilla ChimpChicken Human Human ChimpChicken Gorilla
≠
Gene tree
Species tree
Phylogeny reconstruction approaches
Distance based methods: Neighbor Joining
B
D
AC
E
AD
C
EB
A,B
B
D
AC
E
ABCDEA02344B0345C034D05E0
A,BCDEA,B02.54.53.5C034D05E0
The Minimum Evolution (ME) criterion: in each iteration we separate the two sequences which result with the minimal sum of branch lengths
Maximum Parsimony: finds the most parsimonious topology
Seq 1:
Seq 2:
Seq 3:
Seq 4:
1 3 2 4 1 4 2 3 1 2 3 4
Phylogeny reconstruction approaches
1 3 2 4 1 4 2 3 1 2 3 4
P(Data|T)
Maximum Likelihood: finds the most likely topology
Topology search methods: MP, ML
Distance based methods Distance based methods Neighbor Joining (e.g., using ClustalX)Neighbor Joining (e.g., using ClustalX)
FastFast InaccurateInaccurate
Topology search methods Topology search methods Maximum parsimony (e.g., using Maximum parsimony (e.g., using MEGAMEGA))
× CrudeCrude× Questionable statistical basisQuestionable statistical basis
Maximum likelihood (e.g., using Maximum likelihood (e.g., using RAxMLRAxML, , phyMLphyML))AccurateAccurate SlowSlow
Bayesian methods Bayesian methods Monte Carlo Markov Chains (MCMC) (e.g., using Monte Carlo Markov Chains (MCMC) (e.g., using MrBayesMrBayes))
Most accurateMost accurate Very slowVery slow
Phylogeny reconstruction approaches: summary
How robust is our treeHow robust is our tree??
Human GorillaChimp
We need some statistical way to We need some statistical way to estimate the confidence in the estimate the confidence in the tree topologytree topology
But we don’t know anything But we don’t know anything about the distribution of tree about the distribution of tree topologiestopologies
The only data source we have is The only data source we have is our data (MSA)our data (MSA)
So, we must rely on our own So, we must rely on our own resources: resources: “pull up by your “pull up by your own bootstraps”own bootstraps”
Bootstrap for estimating robustness
Bootstrap1. Create n (100-1000) new MSAs (pseudo-MSAs) by randomly sampling K positions from our original MSA with replacement
12345 K1 : ATCTG…A 2 : ATCTG…C3 : ACTTA…C 4 : ACCTA…T
11244…31 : AATTT…C2 : AATTT…C3 : AACTT…T4 : AACTT…C
97478…101 : TTTTA…T2 : CATAC…A3 : CATAC…T4 : AGTGG…A
51578… 121 : GAGTA…T2 : GAGAC…G3 : AAAAC…A4 : AAAGG…C
Sp1Sp2
Sp3
Sp4
Bootstrap2. Reconstruct a pseudo-tree from each pseudo-MSA with the same method used for reconstructing the original tree
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
11244…31 : AATTT…C2 : AATTT…C3 : AACTT…T4 : AACTT…C
97478…101 : TTTTA…T2 : CATAC…A3 : CATAC…T4 : AGTGG…A
51578… 121 : GAGTA…T2 : GAGAC…G3 : AAAAC…A4 : AAAGG…C
Bootstrap3. For each split in our original tree, we count the number of times it appeared in the pseudo-trees Sp1
Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3
Sp4
67%100%
In 67% of the pseudo-trees, the split between SP1+SP2 and the rest of the tree was found
In general bp support < 80% is considered low
ClustalX: NJ phylogeny reconstruction
ClustalX: NJ phylogeny reconstruction
http://phylobench.vital-it.ch/raxml-bb//
Viewing the tree with njPlot
Note :unrooted
tree
Defining an outgroup
Swapping nodes
Bootstrap support
FigTree: tree visualization and figure creationhttp://tree.bio.ed.ac.uk/software/figtree/
Reconstructing the tree of lifeReconstructing the tree of life
Darwin’s vision of the tree of life Darwin’s vision of the tree of life from the from the Origin of SpeciesOrigin of Species
The three-domain tree of life based The three-domain tree of life based on SSU rRNA MSAon SSU rRNA MSA
But branching of several But branching of several kingdoms remain in disputekingdoms remain in dispute
Lateral Gene Transfer (LGT) Lateral Gene Transfer (LGT) challenges the conceptual basis of challenges the conceptual basis of
phylogenetic classificationphylogenetic classification
MethodologyMethodology Started with 36 genes universally present in 191 Started with 36 genes universally present in 191
species (spanning all 3 domains of life), for species (spanning all 3 domains of life), for which orthologs could be unambiguously which orthologs could be unambiguously identifiedidentified
Eliminated 5 genes that are LGT suspects Eliminated 5 genes that are LGT suspects (mostly tRNA synthetases)(mostly tRNA synthetases)
Constructed an MSA for each of the 31 Constructed an MSA for each of the 31 orthogroupsorthogroups
Concatenated all 31 MSAs to a super-MSA of Concatenated all 31 MSAs to a super-MSA of 8090 columns8090 columns
The phylogeny was reconstructed based on the The phylogeny was reconstructed based on the super-MSA using the maximum likelihood super-MSA using the maximum likelihood approachapproach
Archaea
Eukaryota
Bacteria
http://itol.embl.de
Tree supportTree support
81.7% of the splits show bootstrap support 81.7% of the splits show bootstrap support of over 80%of over 80%
65% of the split show bootstrap support of 65% of the split show bootstrap support of 100%100%
However, several deep splits show low However, several deep splits show low supportssupports
Still, the debate goes onStill, the debate goes on
““Tree of one percent of lifeTree of one percent of life”” Ciccarelli et al. on the one hand favor the claim
that bacteria adhere to a bifurcating tree of life, given that the small amount of LGT genes are filtered
On the other hand, their filtering process left only 31 proteins, which represent ~1% of an average prokaryotic proteome and ~0.1% of a large eukaryotic proteome
““If throwing out all non-universally distributed If throwing out all non-universally distributed genes and all LGT suspects leaves a 1% tree, then genes and all LGT suspects leaves a 1% tree, then we should probably abandon the tree as a working we should probably abandon the tree as a working hypothesis” hypothesis”