Phylogeny – data mining by biologists
• Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences
Understanding our relationships
Trees are like mobiles
The language of trees
Changes can occur
The why and what of natural selection
• Variation exists at the DNA level: alleles• This variation is inexhaustible (something
important to remember when looking at new genome sequences)
• These differences are subjected to selection:– Changes in protein structure are typically unfavorable
and as a result, selected against
– However, some changes in structure/function are selected for: sickle cell anemia/malaria
Neutral Theory of Evolution - Kimura
• Third position of a codon or a nucleotide in a non-coding, non-regulatory region are expected to be invisible to natural selection
• Compare Fugu with humans..most conserved sequences are the genes– http://www.sciencemag.org/cgi/content/full/297/5585/1301
• Synonymous substitutions and substitutions in pseudogenes (define) are thought to be reflective of actual mutation rate operating with a genome (no selection)
• Is this accurate?
Genetic drift
• Random genetic drift is a stochastic process (by definition).
• One aspect of genetic drift is the random nature of transmitting alleles from one generation to the next given that only a fraction of all possible zygotes become mature adults.
• Begin with equal frequency of C or T at given position, next generation observe 60/40 in favor of C…greater chance of C making it into the next generation
Neutralist vs. Selectionist
Where do substitutions occur?
• Non-coding regions exhibit a substitution rate 2X greater than coding regions
• Coding regions are more “functionally constrained”
• Higher degeneracy of codon, higher substitution rate observed
• A thought: Coding sequences – sequence constraint; Non-coding sequence – structure constraint???
Natural variants
• Site-directed mutagenesis studies of a single gene will give way to comparative genomic studies derived from the abundance of sequence data
• As a result, it is important to understand molecular evolution and models describing this process
The relationship between time and substitutions is non-linear
Observing differences in nucleotides
• The simplest measure of distance between two sequences is to count the # of sites where the two sequences differ – called p-distance
• If all sites are not equally likely to change, the same site may undergo repeated substitutions
• As time goes by, the number of differences between two sequences becomes less and less an accurate estimator of the actual number of substitutions that have occurred
So what is phylogeneticsgood for?
Phylogenetics has direct applications to:
• Conservation: test wood, ivory, meat products for poaching
• Agriculture: analyze specific differences between cultivars
• Forensics: DNA fingerprinting
• Medicine: determine specific biochemical function of cancer-causing genes
Phylogenetic concepts:Interpreting a Phylogeny
Sequence A
Sequence B
Sequence C
Sequence D
Sequence E
Time
Which sequence is most closely related to B?
A, because B diverged from A more recently than from any other sequence.
Physical position in tree is not meaningful! Only tree structure matters.
Rooted vs. unrooted
• Root – ancestor of all taxa considered
• Unrooted – relationship without consideration of ancestry
• Often specify root with outgroup– Outgroup – distantly related species (ie.
mammals and an archaeal species)
Phylogenetic concepts:Rooted and Unrooted Trees
Time
A
B
C
D
Root =
A B
C D
Root
X
=?
A B
C D
?
? ?
? ?
X
How Many Trees?
Unrooted trees Rooted trees
# sequences
# pairwise distances # trees
# branches /
tree # trees
# branches
/tree
3 3 1 3 3 4
4 6 3 5 15 6
5 10 15 7 105 8
6 15 105 9 945 10
10 45 2,027,025 17 34,459,425 18
30 435 8.69 1036 57 4.95 1038 58
N N (N - 1)
2
(2N - 5)!
2N - 3 (N - 3)!
2N - 3 (2N - 3)!
2N - 2 (N - 2)!
2N - 2
Tree Types
Root
50 million years
sharks
seahorses
frogs
owls
crocodiles
armadillosbats
Evolutionary trees measure time.
Root
sharksseahorses
frogsowls
crocodilesarmadillos
bats5% change
Phylograms measure change.
Tree Properties
Root
UltrametricityAll tips are an equal
distance from the root.X
Y
a
b
c de
a = b + c + d + e
Root
AdditivityDistance between any two tips equals the total branch
length between them.
X
Y
ab
c d
e
XY = a + b + c + d + e
In simple scenarios, evolutionary trees are ultrametric and phylograms are additive.
Tree building
• Get protein/RNA/DNA sequences
• Construct multiple sequence alignment
• Compute pairwise distances (if necessary)
• Build tree – topology and distances
• Estimate reliability
• Visualize
Tree summary
Various models have been generated to more accurately estimate distance and evolution
• All use the following framework:
Probability matrix
pAC is the probability of a site starting with an A had a C at the end of time interval t, etc.
Base composition of sequence; fa = frequency of A
Phylogenetic Methods
Neighbor-joining• Minimizes distance between nearest neighbors
Maximum parsimony• Minimizes total evolutionary change
Maximum likelihood• Maximizes likelihood of observed data
Many different procedures exist. Three of the most popular:
Comparison of Methods
Neighbor-joining Maximum parsimony Maximum likelihood
Uses only pairwise distances
Uses only shared derived characters
Uses all data
Minimizes distance between nearest neighbors
Minimizes total distance
Maximizes tree likelihood given specific parameter values
Very fast Slow Very slow
Easily trapped in local optima
Assumptions fail when evolution is rapid
Highly dependent on assumed evolution model
Good for generating tentative tree, or choosing among multiple trees
Best option when tractable (<30 taxa, homoplasy rare)
Good for very small data sets and for testing trees built using other methods
Which procedure should we use?Neighbor-
joining
Maximumparsimony
Maximumlikelihood
All that we can!
?
• Each method has its own strengths
• Use multiple methods for cross-validation
• In some cases, none of the three gives the correct phylogeny!
Jukes-Cantor Model
• Distance between any two sequences is given by: d = -3/4 ln(1-4/3p)
• p is the proportion of nucleotides that are different in the two sequences
• All substitutions are equally probable– Each position in matrix = ; except diagonal =
1-
Kimura’s two parameter model
• d = ½ ln[1/(1-2P-Q)] + ¼ ln[1/1-2Q)]
• P and Q are proportional differences between the two sequences due to transitions and transversions, respectively.
• Accounts for transition bias in sequences (transversions more rare)
Distances in Amino acid sequences
• Account for synonymous and non-synonymous changes in respective codons
• Pathways to double mutations
Dealing with multiple substitutions
• Unweighted method – pathways are equally likely • Weighted – favor synonymous changes • Degeneracy classifications
– Nondegenerate (0) – First two positions of TTT (Phe)
– Two-fold degenerate (2) – Third position of TTT (Phe)
– Four fold degenerate (4) – Third position of GTT (Val)
Evolutionary models
Implementing models and building trees
Comparing models
Trees are hypotheses about evolutionary history
So far, we’ve looked at understanding and formulating these hypotheses. Now, let’s turn our attention to testing them.
Testing the reliability of trees
• Interior branch test or Bootstrap analysis
• Bootstrap analysis – subsequences or sequence deletion or replacement; re-draw trees; how many times do you get some branching? Bootstrap values of 70 (95) or greater are normally considered reliable
Tree Testing:Split Decomposition
Split decomposition is one method for testing a tree.
A
B
C
D
A
D
B
C
A
C
B
D
Under this procedure, we choose exactly four taxa (A, B, C, D) and examine the topologies of all possible unrooted trees. How many such trees are there?
Only one of these topologies is right. How can we quantitatively assess the support for each tree?
Tree Testing:Split Decomposition
The correct tree should be approximately additive; the others usually will not. For each tree, we calculate split indices that estimate the length of the internal branch:
+A
D
B
C+
A
C
B
D
–
2Large split indices Long internal branch Topology strongly supported
Small split indices Short internal branch Topology weakly supported
Negative split indices Biologically impossible Topology probably wrong
=
if A
C
B
Dis the right phylogeny!
Tree Testing:Bootstrapping
Used to assess the support for individual branches
Randomly resample characters, with replacement
How often does a specific branch appear?
Repeat many times (1000 or more)
rathumanturtlefruit flyoakduckweed
100
98
73
Rates of nucleotide substitutions between human and mouse or rat
• Synonymous rate = 2-10 substitutions per site per 109 years in coding regions
• Nonsynonymous rate = 0-3 substitutions per site per 109 years in coding regions (more variable among genes)
• Synonymous rate exceeds nonsynonymous rate
Molecular Clocks
• Do homologous proteins evolve at the same substitution rate?
• Estimate relative rates using an outgroup
• But, what about effects of generation time, metabolic specialization, etc?
Darwin’s theory reinterpreted homology as common ancestry.
ATCGGCCACTTTCGCGATCA
ATAGGCCACTTTCGCGATCA
ATAGGCCACTTTCGCGATTA
ATAGGGCAGTTTCGCGATTA
ATAGGGCAGTTTTGCGATTA
ATAGGGCAGTTTCGCGATTA
ATAGGGCAGTCTCGCGATTA
ATCGGCCACTTTCGCGATCG
ATCGGCCACTTTCGTGATCG
ATCGGCCACGTTCGTGATCG
ATCGGCCACGTTCGCGATCG
ATCGGCCACCTTCGCGATCG
ACCGGCCACCTTCGCGATCG
ACCGGCCACCTTCGCGATCGATAGGGCAGTCTCGCGATTA
Ancestral sequence
Homologous sequences
Orthologs arise by speciation
ATCGGCCACTTTCGCGATCA
ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG
Sequence in ancestralOrganism
Orthologous sequences
Speciation event
Modern species A Modern species B
Orthologs are “evolutionary counterparts” – Koonin (2001)
Paralogs arise by duplications
ATCGGCCACTTTCGCGATCA
ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG
Sequence in ancestralOrganism
Paralogous sequences
Duplication event
Modern duplicate A Modern duplicate B
Hardison PNAS 2001 98 :1327-1329
We have different types of hemoglobins
The major adult hemoglobin is composed of 2 chains and 2 chains. The major fetal hemoglobin is composed of 2 chains and 2 chains.
“There may thus exist a Molecular Evolutionary Clock”Zuckerkandl & Pauling (1965)
A model of sequence divergence can be used to extract the duplication dates of the difference hemoglobin chains
Duplication event
Primordial hemoglobin
Human Human Cow Cow
Speciation event
Note: This model explains why the distance betweem Human and Cow is shorter than Human – Human proximity.
PBS Evolution Library (http://www.pbs.org/wgbh/evolution/library/)
Different clocks keep different times
Between horse and man
The clock varies for different regions of the protein
For example, locations on the exterior of the protein may change at a different rate than those on the interior.
Ayala, F. Bioessays 1999 Jan;21(1):71-5
No universal clocks found!
Two terrible clocks
Ayala, F. Bioessays 1999 Jan;21(1):71-5
The common estimate is 1,100 My
What causes deviations from the clock?
1. Generation time: Shorter generation time will accelerate the clock because it shortens the time to fix new mutations.
2. Mutation rate: Species-characteristic differences in polymerases or other biological properties that affect the fidelity of DNA replication, and hence the incidence of mutations.
3. Gene function: Changes in the function of a protein as evolutionary time proceeds. This might particularly be expected in the case of gene duplication.
4. Natural selection: Organisms are continually adapting to the physical and biotic environments, which change endlessly in patterns that are unpredictable and differently significant to different species.
Ayala, F. Bioessays 1999 Jan;21(1):71-5
HIV Example 1:Florida dentist case
• 1990 case: Did a patient’s HIV infection result from an invasive dental procedure performed by an HIV+ dentist?
• HIV evolves so fast that transmission patterns can be reconstructed from viral sequence (molecular forensics).
• Compared viral sequence from the dentist, three of his HIV+ patients, and two HIV+ local controls.
Florida dentist case
So what do the results mean?
• 2 of 3 patients closer to dentist than to local controls. Statistical significance? More powerful analyses?
• Do we have enough data to be confident in our conclusions? What additional data would help?
• If we determine that the dentist’s virus is linked to those of patients E and G, what are possible interpretations of this pattern? How could we test between them?