Date post: | 15-Jan-2016 |
Category: |
Documents |
Upload: | corey-lewis |
View: | 224 times |
Download: | 0 times |
EVOLUTIONARY CHANGE IN DNA SEQUENCES
- usually too slow to monitor directly…
… so use comparative analysis of 2 sequences whichshare a common ancestor
- determine number and nature of nt substitutions that haveoccurred (ie measure degree of divergence)
spontaneous mutation rates? p. 35-37
for mammalian nuclear DNA (regions not under functional constraint)
... much higher for viruses
~ 4 x 10 -9 nt sub per site per year
eg. 10 -6 to 10 -3 nt sub per site per generation
Potential pitfalls
2. If indels between two sequences, can they be aligned with confidence?
- algorithms with gap penalties
1. Are all evolutionary changes being monitored?
- if closely-related, high probability only one change at anygiven site…
but if distant, may have been multiple substitutions (“hits”)at a site
- can use algorithms to correct for this
Ancestral sequence
Present day sequences
Fig. 3.6
Page & Holmes Fig. 5.9
(If comparing long stretches, highly unlikely they would have converged to the same sequence)
Homoplasy: same nt, but not directly inherited from ancestral sequence
Nucleotide substitutions within protein-coding sequences
1. Synonymous vs. non-synonymous
Single step:
Multiple steps:
AAT ACT
Is one pathway more likely than another?
p.82
2. Nomenclature related to “degeneracy”:
Non-degenerate
- all possible changes at site are non-synonymous
2-fold degenerate
- one of the 3 possible changes is synonymous
4-fold degenerate
- all possible changes at site are synonymous
ALIGNMENT OF SEQUENCES FOR COMPARATIVE ANALYSIS
1. By manual inspection- if sequences very similar and no (or few) gaps
2. By sequence distance methods
(often followed by “correction by visual inspection”)
- use algorithms which minimize mismatches and gaps
- gap penalty > mismatch penalty
Fig. 3.12
Alignment of human and chicken pancreatic hormone proteins
no gap penality
with gap penalty
alignment as in (b), with biochemically similar aa
ArabAAG52143 FIVDEADLLLDLGFRRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 539ArabAAC26676 FIVDEADLLLDLGFKRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 586yeast -VLDEADRLLEIGFRDDLETISGILNEKNSKSADNIKTLLFSATLDDKVQKLANNIMNKK 323 ::**** **::**: *:*.* . * .:. ::******: .:*:::: ::: *:
CLUSTAL W (1.81) Multiple Sequence Alignments
Sequence 1: ArabidopsisAAG52143 798 aaSequence 2: ArabidopsisAAC26676 845 aaSequence 3: yeast 664 aa
Sequences (2:3) Aligned. Score: 23Sequences (1:2) Aligned. Score: 93Sequences (1:3) Aligned. Score: 22
Multiple sequence alignments - CLUSTALW
ww.ebi.ac.uk/clustalw (European Bioinformatics Institute)
Symbols used? * : .
Avers Fig. 3.23
globin globin
Human globin = 141 aa
Human globin = 146 aa
Was D-helix loss neutral or adaptive mutation? (Nature 352:349-51, 1991)
Alignment of human -globin and -globin proteins
In sequence comparisons, refer to nt (or aa) sequencerelatedness as
“… % identity” or “...% similarity”
BUT NOT “ … % homology”
because “homology” means “shares a common ancestor”
“Non-evolutionary biologists”
Petsko Genome Biol. 2:1002,2001
Reminder about definition of the word “homology”
“Normalized alignment score”NAS = (# identities x 10) + (# Cys identities x 20) – (# gaps x 25)
Doolittle, R. “URFs & ORFs” p.14
Query = yeast mt ribosomal protein L8 gene (1275 nt)
BLAST searches www.ncbi.nlm.nih.gov/BLAST/
- to detect similarity between “sequence of interest” & databank entries
Score = 383 bits (193), Expect = 1e-102 Identities = 196/197 (99%), Gaps = 0/197 (0%)
Query AGCGTCAGGATAGCTCGCTCGATGTGGTCAGGCTAACACAATGAACAACGAGACTAGTG |||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||Sbjct AGCGTCAGGATAGCTCGCTCGATGTGATCAGGCTAACACAATGAACAACGAGACTAGTG
E-values: statistical measure of likelihood that sequences with this degree of similarity occur randomly ie. reflects number of hits expected by chance
Example of high score “hit” (red)
Query GTTTTCTTAATATTTATTTAAAAA |||||||||||||||| |||||||Sbjct GTTTTCTTAATATTTAATTAAAAA
Example of low score “hit” (blue or black)
Score = 40.1 bits (20), Expect = 3.6 Identities = 23/24 (95%), Gaps = 0/24 (0%)
“low complexity sequence”
Why is “sequence complexity” important when judging whether two sequences are homologous?
Human DNA
Chimp DNA
Pu-rich region #1
Pu-rich region #2 (not homologous to #1)
Region of unbiasedbase composition
G=C=A=T
AAGAGGAG
How frequently is AAGAGGAG (8-nt sequence) expected to occur by chance in a DNA sequence?
If sequence A is of low complexity (or short length), high % identity with sequence B may not reflect shared evolutionary origin
AAGAGGAG
Advantages of using aa (rather than nt) sequences for identifying homologous genes among organisms?
-20 amino acids vs. 4 nucleotides
- for distantly related sequences – “saturation” of synonymous sites within codons (multiple hits)
- degeneracy of genetic code & different codon usage patterns (and G+C% of genomes) among organisms
But… for certain phylogenetic analyses, number of informative characters may be higher at DNA than protein level
- lower chance of “spurious” matches
- unrelated nt sequences (non-homologous) expected to show 25% identity by random chance (if unbiased base composition)
What if BLAST search were done at protein (instead of nt) level?
Query = yeast mitochondrial ribosomal protein L8 (238 aa)
Fungal
Bacterial
Dot matrix method for aligning sequences
- 2 sequences to be compared along X and Y axis of matrix
- dots put in matrix when nts in the 2 sequences are identical
mismatch = “gap” (or break) in line
Fig. 3.7
indel = shift in diagonal
Fig. 3.7
Dot matrix method
- normally compare blocks rather than individual nts
- spurious matches (background noise) influenced by
1. window size – overlapping fixed-length windowswhereby sequence 1 compared with seq 2
2. stringency – minimum threshold value (% identity)at each step to score as hit
- for coding regions, could use aa instead of nt sequencesto reduce “noise”
Comparison of human chromosome 7 “draft” sequence (2001) with“near-complete” sequence (2004)
Nature 431:935, 2004How do you interpret the data in this figure?
2004
seq
uen
ce (
few
er e
rro
rs)
2001 sequence
Blowup of 500 kb region