Molecular Evolution
Justin FayCenter for Genome Sciences
Department of Genetics4515 McKinley Ave. Rm 4305
Molecular evolution is the study of the cause and effects of
evolutionary changes in molecules
PhylogeneticsDivergence timesComparative Genomics(mutation and selection)
Species 1 GGCAGTGACATTTTCTAACGCGAAGGTACTTSpecies 2 GGCAGCGCCATTTTCTAATGCGAGGGTACTTSpecies 3 GGCAGCGCCATTGTCTAATGCGAGGGTACTT
***** * **** ***** **** *******
ArcheaHuman-chimp-neanderthalUltraconserved sequencesENCODEFox2p
Phylogenetics Methods
2 1 13 3 14 15 35 954 105
10 34,459,425 2,027,025
Table 1. Number of possible rooted and unrooted trees.
Number of sequences
Number of rooted trees
Number of unrooted trees
Taxonomists have long debated phylogenetic methods.
There are many types of methods:
Character state methods (also called cladistic methods), like parsimony.
Distance or similarity based methods (also called phenetic methods), like UPGMA.
Maximum likelihood and Bayesian Methods.
Parsimony (non-parametric) and Maximum likelihood (parametric) are both used when phylogeny is critical.
Software:
PAUPPHYLIPMEGAMrBayes
D
A
C
B
Table 2. Distance matrix.Sequence A B CAB d(AB)C d(AC) d(BC)D d(AD) d(BD) d(CD)Each d is the distance (substitution rate) between pairs of sequences
Gene trees vs Species trees
1. Orthology2. Independence (no concerted evolution or horizontal transfer)
Orthologs are genes created by speciation events. Paralogs are genes created by duplication events.Homologs are genes that are similar because of shared ancestry.
Speciation
Duplication
Species 1 Species 2
Orthologues and paralogues can be distinguished by i) synteny or ii) phylogeny.
Gene Conversion and Horizontal Gene Transfer
Locus 1 Chr02
Locus 2 Chr14
HHT2 HHF2
HHF1 HHT1
No conversion(true phylogeny)
Gene Conversion
Species tree
Vertebrate toBacteria
Bacteria toVertebrate
Molecular Evolution(Comparative Genomics)
1. Conservation
Annotation of genes, regulatory sequences and other functional elements
Functional sequences will remain conserved across distantly related species whereas non-functional sequences will accumulate changes
2. Divergence
Evolution of genes, regulatory sequences and other functional elements
Species-specific functional sequences
Functional sequences with new or modified functions
Origins of Molecular EvolutionInsulin was the first protein sequenced in 1955 for which Fred Sanger received the Nobel prize. Cytochrome C protein sequence (Margoliash et al. 1961).
The sequencing of the same proteins from different species established a number of key principles of molecular evolution:
1. Most proteins are highly conserved and changes that do occur are not found within functionally important sites. For example human diabetics were treated with insulin purified from pigs and cows.
2. The rate of amino acid substitution is constant across phylogenetic lineages.
Molecular clock - the rate of amino acid or nucleotide substitution is constant per year across phylogenetic lineages (Zuckerkandl and Pauling 1962). Controversial but revolutionized phylogenetics and set the stage for the neutral theory.
Neutral theory or neutral mutation random drift hypothesis - the vast majority of mutations that become polymorphic in a population and fixed between species are not driven by Darwinian selection but are neutral or nearly neutral with respect to fitness (Kimura 1968; King and Jukes 1969). The neutral theory is dead; long live the neutral theory.
Difference between mutation rate and substitution rate.
Time
Popu
latio
n fr
eque
ncy
Mutation rate the chance of a mutation occurring in each generation or cell division (does NOT depend on selection)
Substitution rate the frequency at which mutations become fixed within a population (depends on selection)
Substitution rate = mutation rate * fixation probability * timeFixation probability depends on selection
Nucleotide Substitution Models
Nucleotide substitution models correct for multiple hits
A G
C T
Purines
Pyrimidines
Jukes and Cantor (JC69) Model (1969)
Assumptions of JC model. 1) Equal base frequencies2) Equal mutation rates between the bases3) Constant mutation rate4) No selection
Jukes Cantor Model
p = 3/31 = 0.097K = 0.104 substitutions per site
Other nucleotide substitution models
Model Assumption Free Parameters
Reference
JC69 A=G=C=Tts=tv
1 Jukes & Cantor 1969
K80 A=G=C=T 2 Kimura 1980
F81 ts=tv 4 Felsenstein 1980
HKY85 5 Hasegawa, Kishino & Yano
GTR unequal rates 9 Tavare 1986
Substitution Rates with Selection
No selection: The substitution rate between two species is K = 2t.
Selection:
S.cerevisiae S.paradoxus
t
P=1−e
−4Ne sq
1−e−4Ne s
Substitution rate = mutation rate * fixation probability * time
The substitution rate for neutral mutations = 2Nµ * 1/2N * t = µtThe substitution rate for adaptive mutations = 2Nµ * 2s * t = 4Nsµt for 4Ns > 1
Conserved sequences
Human-Mouse conservation
Species Conserved* Conserved Noncoding(non-repetitive aligned)
Reference
Humans 3-8% 21% Waterston et al. (2002)
Worms 18-37% 18% Shabalina & Kondrashov (1999)
Flies 37-53% 40-70% Andolfatto (2005)
Yeast 47-68% 30-40% Chin et al. (2005), Doniger et al. (2005)
*Siepel et al. (2005)
Deletion and expression assays of conserved noncoding sequences
Pennacchio et al. 2006 Yun et al. 2012
Rapidly Evolving Genes (dN/dS)
Detecting selection using the nucleotide substitution rateSynonymous change - mutation that does not change the amino
acid sequence of a protein. Nonsynonymous change - mutation that changes the amino acid
sequence of a protein.
Table 1. The genetic code.Codon AA Codon AA Codon AA Codon AATTT Phe TCT Ser TAT Tyr TGT CysTTC Phe TCC Ser TAC Tyr TGC CysTTA Leu TCA Ser TAA Stop TGA StopTTG Leu TCG Ser TAG Stop TGG Trp
CTT Leu CCT Pro CAT His CGT ArgCTC Leu CCC Pro CAC His CGC ArgCTA Leu CCA Pro CAA Gln CGA ArgCTG Leu CCG Pro CAG Gln CGG Arg
ATT Ile ACT Thr AAT Asn AGT SerATC Ile ACC Thr AAC Asn AGC SerATA Ile ACA Thr AAA Lys AGA ArgATG Met ACG Thr AAG Lys AGG Arg
GTT Val GCT Ala GAT Asp GGT GlyGTC Val GCC Ala GAC Asp GGC GlyGTA Val GCA Ala GAA Glu GGA GlyGTG Val GCG Ala GAG Glu GGG Gly
dN or Ka = the nonsynonymous substitution rate = # nonsynonymous changes / # nonsynonymous sites.dS or Ks = the synonymous substitution rate = # synonymous changes / # synonymous sites.
Interpretation of dN/dS ratios (assuming synonymous sites areneutral):
dN/dS = 1No constraint on protein sequence, i.e. nonsynonymouschanges are neutral.
dN/dS < 1Functional constraint on the protein sequence, i.e.nonsynonymous mutations are deleterious.
dN/dS > 1Change in the function of the protein sequence, i.e.nonsynonymous mutations are adaptive.
Rapidly Evolving Genes
Nayak et al. 2005
dN increased by positive selectiondN decreased by negative selectionProblem: dN may be influenced by both and still be less than dS
Branch Model (dN/dS)(rate heterogeneity)
15 copies in humanVary in copy in other primates
Johnson et al. 2001
Site Model (dN/dS)
● Positive selection on the egg receptor (VERL) for abalone sperm lysin.
● VERL – lysin are a lock and key for fertilization.
● Co-evolution by sexual selection, conflict or microbial attack.
Gilando et al. 2003
Sites – methodsMaximum Parsimony (Suzuki)Maximum Likelihood (PAML, HyPhy)
Models of molecular evolution
Key Assumptions:
➔Alignments are correct➔Sites are independent➔Mutational & selection parameters
Alignment Accuracy & Coverage
Pollard et al. 2004
No indels No indelsIndels Indels
No constraint
Constraint
Alignment differences gp120 HIV/SIV
ClustalW alignment PRANK alignment(phylogeny aware)
Detection of positive selection depends on the alignment
Markova-Raina and Petrov (2011)
Mutation rate variation
● Transitions vs. Transversions – transitions occur twice as often as transversions
● CpG - Spontaneous deamination of 5-methylcytosine results in thymine and ammonia, 20x higher rate of transition
● 28% of mutations are transitions at CpG sites but only 3.5% of sites are CpG
● Genomic position (5-10%)● Age, sex (2 – 10 fold)● Repeats (polynucleotides, microsatellites)
Types of Mutations - WGS
Single nucleotideTranspositionsDuplicationsInsertion/DeletionRearrangement
G/C to A/T 2.9-fold higherthan reverse! Predicts 74% AT content
Substitution rate as a function ofGC content
BRCA1 sliding window Ka/Ks analysis
Codon Bias
Measures of Codon BiasCAI – codon adaptive index based on relative usage of the codon to the most abundant codon for an amino acid
Fop – frequency of the optimal codon
ENC – effective number of codons based on the deviation from equal usage
Explanation of Codon BiasBias towards GC ending codons that is not found in adjacent noncoding regions
Correlates with highly expressed genes
Correlates with tRNA abundance
Explanations: translational accuracy/speed, protein misfolding
Codon Bias is correlated with Synonymous Substitution Rate
Codon Bias correlation depends on distance
Codon models
αs = synonymous rate
βs = nonsynonymous rate
R = tv/ts
πny
= frequency of target nucleotiden in codon y
Binding site models● Sequence ~ binding affinity (Schneider et al. 1986, Berg and von Hippel 1987)
● Binding affinity ~ fitness (Gerland and Hwa 2002, Sengupta et al. 2002)
● Fitness ~ substitution rate (Moses et al. 2004)
Kimura 1962
Bulmer 1991
Moses et al. 2004
Biased Gene ConversionAT to GC bias
Recombination occurs in hotspotsRecombination hotspots evolve rapidlyBiased gene conversion occurs in bursts (non-equilibrium)
Recombination and predicted equilibrium GC frequency
Correlomics
r (Interaction ~ Fitness) = 0.15, P = 3.4x10e-13
r (Fitness ~ Evolutionary rate) = -0.13, P = 4.3x10e-7
r (Interactions ~ Evolutionary rate) = -0.24, P = 0.002
Spurious (strong) correlations
Significance and effect sizeStatistical significance (a low P value) measures how certain we are that a given effect exists.Effect size measures the magnitude of an effect.
r = 0.10, P < 1e-16 A squared correlation coefficient below 0.1 (r < 0.3) means the effect is pretty much non-existent, regardless of how low the P value is.
Claus Wilke, UT-Austin (Blog 2013)
Gene expression predictsthe rate of evolution
Polymorphisms vs Divergence
P ( SNP | conserved amino acid )
P ( SNP | conserved transcription factor binding site )
Methods for Predicting Human Disease Mutations
SIFT: Ng P, Henikoff S (2001) Predicting deleterious amino acid substitutions. Genome Res 11: 863-874.
PolyPhen: Sunyaev S, Ramensky V, Koch I, Lathe W, Kondrashov A et al. (2001) Prediction of deleterious human alleles. Hum Mol Genet 10: 591-597.
Method True Positive False Positives
SIFT 69% 20%
PolyPhen 69% 9%
Disease mutations
Conservedsites
2.2% of human diseasealleles are WT in mouse
Likelihood Ratio Test
human GYCF G AQEQ chimp GYCF G AQEQ orangutan GYCF G AQEQ rhesus GYCF G AQEQ bushbaby GYCF G VQEQ treeshrew GYCF G VQEQ rat GYCF G VQEQ mouse GYCF G VQEQ squirrel GYCF G VQEQ guineapig GYCF G VQEQ dog GYCF G IQEQ cat GYCF G VQEQ horse GYCF G VQEQ cow GYCF G VQEQ microbat GYCF G VQEQ armadillo GYCF G VQEQ opossum GYCF G VAEQ platypus GYGF G EQEQ frog GFCF G ETKQ tetraodon GCCF G NLEE stickleback GYCF G DGEE medaka GYCF G DLEE zebrafish GYCF G DLEE
Pla
cent
als
Fis
h
Non
-pla
cent
alm
amm
als
Chi
cken
Fro
g
32 vertebrate species18,993 alignmentsdS = 12.2 subs/site
Tons of Deleterious Mutations
Chun and Fay (2009)
Most Deleterious SNPs are Rare
Three Methods Applied to Venter
Method Tested (%) Deleterious (%)
SIFT 5,401 (72%) 890 (16%)
PolyPhen 6,746 (90%) 555 (8.2%) probably768 (11%) possibly
LRT 5,645 (75%) 796 (14%)
7,534 High Quality NSN SNPs in Venter Genome
Disturbing Overlap Among Three Methods
LRT
PolyPhen SIFT
28%
5%
6%
10%
30%
3%
18%
7,534 NSN SNPs in Venter Genome1,735 SNPs predicted deleterious by any one of the three methods
Human disease associated SNPs
Chen et al. 2010
21,429 disease-associated SNPs (2,113 publications)5,270 in HapMap3
Conservation of GWAS SNPs
Dudley et al. (2012)
High-confidence
GWAS SNPsOR vs. Conservation
Dudley et al. (2012)
Phylomedicine
Kumar et al. (2011) Phylomedicine: an evolutionary telescope to explore and diagnose the universe of disease mutations. Trends in Genetics 27:377-386