Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | michael-mclaughlin |
View: | 219 times |
Download: | 0 times |
.
Phylogenetic TreesLecture 11
Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al.
© Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan Gronau
2
Evolution
Evolution of new organisms is driven by
Diversity Different individuals
carry different variants of the same basic blue print
Mutations The DNA sequence
can be changed due to single base changes, deletion/insertion of DNA segments, etc.
Selection bias
3
Theory of Evolution
Basic idea speciation events lead to creation of different
species (speciation: physical separation into groups where different genetic variants become dominant)
Any two species share a (possibly distant) common ancestor
This is described by a rooted tree – the tree of life.
4
Any two species share a (possibly distant) common ancestor
The process of evolution consists of: speciation events. mutations along
evolutionary branches.
Tree of Life
Source: Alberts et al
5
Often only a subtree is studied
Definition: A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.
6
Components of Phylogenenetic Trees
Leaves - current day species (or taxa – plural of taxon) Internal vertices - hypothetical common ancestors Edges length - “time” from one speciation to the next The Tree Topology – the tree structure, ignoring edge
lengths. Usually the goal is to find the topolgy.
Aardvark Bison Chimp Dog Elephant
7
Historical Note Until mid 1950’s phylogenies were constructed by
experts based on their opinion (subjective criteria)
Since then, focus on objective criteria for constructing phylogenetic trees
Thousands of articles in the last decades
Important for many aspects of biology Classification Understanding biological mechanisms
.
A. Introduction (this lecture)
1. The phylogenetic Reconstruction Problem: from
sequences to trees
2.Morphological vs. molecular sequences3. Possible pitfalls
4. Directed and undirected trees
5. The “big” problem, the “small” problem.
Outline
.
B. Character based methods (this + next lectures)
1. Perfect Phylogeny
2. Maximum Parsimony
3. Maximum Likelihood (not studied in this course)
• These methods consider the evolution of each character
separately.
• Try to find the tree which gives the “best” evolutionary
explanation:
- least number of observed mutations (1&2), or most probable
tree (3).
• These optimization problems are typically NP-hard.
• We’ll discuss ways for solving simplified versions of the problems.
Outline (cont)
C. Distance based methods (last 1-2 lectures)
- Run in polynomial time
- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the distances
Outline (cont)
0
30
980
1514180
171620220
1615192190
D
4 5
7 21
210 61
.
Distance Methods (cont.)
1. Efficient reconstruction (O(n2) time) from accurate
distances
2. Reconstruction from noisy distances: Can we
reconstruct accurate trees from approximate
distances?• Worst-case noise model
• More realistic noise models: inter-species distances derived
from probabilistic models of mutations.
Outline (end)
12
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGA
ACGGTCA
ACGGATA
ACGGGTA
ACCCGTG
ACCGTTG
TCTGGTA
TCTGGGA
TCCGGAA AGCCGTG
GGGGATT
AAAGTCA
AAAGGCG AAACACAAAAGCTG
Evolution as a Tree
13
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGAACCGTTGTCTGGGA
TCCGGAA AGCCGTG
GGGGATT
Phylogenetic Reconstruction
14
B : AATCCTG
C : ATAGCTG
A : AATGGGC
D : GAACGTA
E : AAACCGA
J : ACCGTTG
G : TCTGGGAH : TCCGGAA
I : AGCCGTG
F : GGGGATT
Goal: reconstruct the ‘true’ tree as accurately as possible
reconstruct
AB
C
FG
IH J
D
E
Phylogenetic Reconstruction
16
What are the sequences?Morphological vs. Molecular
Classical methods. morphological features: number of legs, lengths of legs, etc.
Modern methods. molecular features: Gene (DNA) sequences Protein sequences
Analysis based on homologous sequences (e.g., globins) in different species
17
Possible pitfall in reconstruction: Misleading selection of sequences
Gene/protein sequences can be homologous for several different reasons:
Orthologs -- sequences diverged after a speciation event
Paralogs -- sequences diverged after a duplication event (next slides)
Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)
18
Misleading selection of sequences:Using paralogs instead of orthologs
1 2 3
Consider evolutionary tree of three taxa:
…and assume that at some point in the past a gene duplication event occurred.
Gene Duplication
19
Paralogs instead of Orthologs
Speciation events
Gene Duplication
1A 2A 3A 3B 2B 1B
The gene evolution is described by this tree (1,2,3 are species; A, B are the copies of the same gene).
Copy BCopy A
20
Speciation events
Gene Duplication
1A 2A 3A 3B 2B 1B
If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree.
In the sequel we assume all given sequences are orthologs – created from a common ancestor by specification events.
S
SS
Paralogs instead of Orthologs
21
Rooted vs. Undirected Trees
A natural representation of phylogeny is rooted trees
CommonAncestor
22
Types of treesUnrooted tree represents the same phylogeny without
the root node
Most known tree-reconstruction techniques do not distinguish between different placements of the root.
23
Rooted versus unrooted treesTree a
ab
Tree b
c
Tree c
Represents the three rooted trees
24
Positioning Roots in Unrooted Trees
We can estimate the position of the root by introducing an outgroup:
a set of species that are definitely distant from all the species of interest
Aardvark Bison Chimp Dog Elephant
Falcon
Proposed root
25
Two phylogenenetic trees of the same species:Do these trees represent the same evolutionary history?
Aardvark Bison Chimp Dog Elephant
AardvarkBison
ChimpDog
Elephant
26
When two unrooted phylogenetic trees are considered different?
Trees T1 and T2 on the same set of species are considered identical if they represent the same evolutionary history, i.e.: they have the same topology.
Formally, this is equivalent to:
There is a tree isomorphism h: T1 T2 s.t: For each species x, h(x)=x.
27
The two trees represent the same evolution
Aardvark Bison Chimp Dog Elephant
AardvarkBison
ChimpDog
Elephant
w
v
h(u)
u
h(w)
h(v)
28
The “Big” reconstruction problem, the “Small” problem
The “big” problem: compute the whole phylogenetic tree from the n input sequences.
The “small” problem: Assume the tree topology and the identities of the leaf-species are known. Reconstruct the sequences at the internal vertices, and give a score to the resulted phylogeny.
Connection between the problems: In order to solve the big problem, solve the small problem on all possible trees with n leaves, and output the tree(s) with the highest “score”.
This is impossible in practice for more than few taxa.
29
Input for the “big” problem
A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA
Our task: Find evolutionary tree with leafs corresponding to the 5 sequences, which best explains the evolution of the strings.
30
Input for the “small” problem
Aardvark Bison Chimp Dog Elephant
A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA
The tree and assignments of strings to the leaves is given, and we need only to assign strings to internal vertices.
31
Character-based methodsfor constructing phylogenies
In this approach, trees are constructed by comparing the characters of the corresponding sequences. Characters may be morphological (teeth structures) or molecular (nucleotides in homologous DNA sequences). We will present two methods: “Perfect Phylogeny” and “Maximum Parsimony”
Basic Assumption in these methods:
Best tree is one with minimal number of observed mutations (character changes along the edges, aka substitutions).
32
Character based methods: Input data
species C1 C2 C3 C4 … Cm
dog A A C A G G T C T T C G A G G C C C
horse A A C A G G C C T A T G A G A C C C
frog A A C A G G T C T T T G A G T C C C
human A A C A G G T C T T T G A T G A C C
pig A A C A G T T C T T C G A T G G C C
* * * * * * * * * * *
• Each character (column) is processed independently.
• The green character will separate the human and pig from frog, horse and dog.
• The red character will separate the dog and pig from frog, horse and human.
33
The perfect phylogeny problem
A character is assumed to be a significant property, which distinguishes between species (e.g. dental structure, number of legs/limbs).
A characters state is a value of the character (eg: human dental structure).
Assumption: It is unlikely that a given state will be created twice in the evolution tree. Such characters are called “Homoplasy free”, and are detailed next.
34
Homoplasy-free characters 1
Homoplasy free characters should avoid:
reversal transitions
A species regains a state it’s direct ancestor has lost.
Famous known exceptions: Teeth in birds. Legs in snakes.
35
Homoplasy-free characters 2
…and also avoid convergence transitions
Two species possess the same state while their least common ancestor possesses a different state.
Famous known exceptions: The marsupials.
36
Input: 1. A set of species2. A set of characters3. For each character, assignment of states to the species
Problem: Is there a phylogenetic tree T=(V,E), s.t. the evolution of all characters is “homoplasy free” (no reversal, no convergence)
The Perfect Phylogeny Problem
First, we define the problem using graph-theoretic terms.
37
Characters = Colorings
A coloring of a tree T=(V,E) is a mapping C:V [set of colors]
A partial coloring of T is a coloring of a subset of the vertices U V:
C:U [set of colors]
U=
38
Each character defines a (partial) coloring of the corresponding phylogenetic tree:
Characters as Colorings
Species ≡ VerticesStates ≡ Colors
39
Convex Colorings (and Characters)
Definition: A (partial/total) coloring of a tree is convex iff all d-carriers are disjoint
Let T=(V,E) be a partially colored tree, and d be a color. The d-carrier is the minimal subtree of T containing all vertices colored d
40
A character is Homoplasy free (avoids reversal and convergence transitions)
↕
The corresponding (partial) coloring is convex
Convexity Homoplasy Freedom
41
Input: Partial colorings (C1,…,Ck) of a set of vertices U (in the example: 3 total colorings: left, center, right, each by two colors).
Problem: Is there a tree T=(V,E), s.t. UV and for i=1,…,k,, Ci is a convex (partial) coloring of T?
R B PR G PB B PR G A
The Perfect Phylogeny Problem(pure graph theoretic setting)
PP is NP-Hard In general In the tutorial you will see a special case
solvable in p-time .
42
Maximum Parsimony
Perfect Phylogeny is not only hard to compute, but in
many cases it doesn’t exist.
Next we discuss a more common approach, called
“Maximum Parsimony”, which looks for a tree which
minimizes the number of mutations.
43
Maximum Parsimony
A Character-based method
Input:
h sequences (one per species), all of length k.
Goal:
Find a tree whose leaves are labeled by the input
sequences, and an assignment of sequences to internal
nodes, such that the total number of substitutions is
minimized.
44
ExampleInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.
AGAAAA
GGAAAG
AAA AAA
AAA
21 1
Total #substitutions = 4
By the parsimony principle, we seek a tree whose leaves are labeled by the input sequences, and assignment of sequences to internal vertices, with minimum total number of mutations (ie, letter changes) along the tree edges. Here is one possible tree + sequences assignment.
45
Example ContinuedHere are two other trees+ sequence assignments:
AGAGGA
AAAAAG
AAA AGA
AAA
11
1
Total #substitutions = 3
GGAAAA
AGAAAG
AAA AAA
AAA
11 2
Total #substitutions = 4
The left solution is preferred over the right one.
A solution has two parts: First, select a tree and label its leaves by the input sequences; then, assign sequences to the internal vertices.
47
Parsimony score
AGAGGA
AAAAAG
AAA AGA
AAA
11
1
Parsimony score = 3
GGAAAA
AGAAAG
AAA AAA
AAA
11 2
Parsimony score = 4
The parsimony score of a leaf-labeled tree T is the minimum possible number of mutations over all assignments of sequences to internal vertices of T.
48
Parsimony Based Reconstruction
We have here both the small and big problems:
1. The small problem: find the parsimony score for a given leaf labeled tree.
2. The big problem: Find a tree whose leaves are labeled by the input sequences, with the minimum possible parsimony score.
3. We will see efficient algorithms for (1). (2) is hard.