EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use...

EVOLUTIONARY CHANGE IN DNA SEQUENCES

- usually too slow to monitor directly…

… so use comparative analysis of 2 sequences whichshare a common ancestor

- determine number and nature of nt substitutions that haveoccurred (ie measure degree of divergence)

spontaneous mutation rates? p. 35-37

for mammalian nuclear DNA (regions not under functional constraint)

... much higher for viruses

~ 4 x 10 -9 nt sub per site per year

eg. 10 -6 to 10 -3 nt sub per site per generation

Potential pitfalls

2. If indels between two sequences, can they be aligned with confidence?

- algorithms with gap penalties

1. Are all evolutionary changes being monitored?

- if closely-related, high probability only one change at anygiven site…

but if distant, may have been multiple substitutions (“hits”)at a site

- can use algorithms to correct for this

Ancestral sequence

Present day sequences

Fig. 3.6

Page & Holmes Fig. 5.9

(If comparing long stretches, highly unlikely they would have converged to the same sequence)

Homoplasy: same nt, but not directly inherited from ancestral sequence

Nucleotide substitutions within protein-coding sequences

1. Synonymous vs. non-synonymous

Single step:

Multiple steps:

AAT ACT

Is one pathway more likely than another?

p.82

2. Nomenclature related to “degeneracy”:

Non-degenerate

- all possible changes at site are non-synonymous

2-fold degenerate

- one of the 3 possible changes is synonymous

4-fold degenerate

- all possible changes at site are synonymous

ALIGNMENT OF SEQUENCES FOR COMPARATIVE ANALYSIS

1. By manual inspection- if sequences very similar and no (or few) gaps

2. By sequence distance methods

(often followed by “correction by visual inspection”)

- use algorithms which minimize mismatches and gaps

- gap penalty > mismatch penalty

Fig. 3.12

Alignment of human and chicken pancreatic hormone proteins

no gap penality

with gap penalty

alignment as in (b), with biochemically similar aa

ArabAAG52143 FIVDEADLLLDLGFRRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 539ArabAAC26676 FIVDEADLLLDLGFKRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 586yeast -VLDEADRLLEIGFRDDLETISGILNEKNSKSADNIKTLLFSATLDDKVQKLANNIMNKK 323 ::**** **::**: *:*.* . * .:. ::******: .:*:::: ::: *:

CLUSTAL W (1.81) Multiple Sequence Alignments

Sequence 1: ArabidopsisAAG52143 798 aaSequence 2: ArabidopsisAAC26676 845 aaSequence 3: yeast 664 aa

Sequences (2:3) Aligned. Score: 23Sequences (1:2) Aligned. Score: 93Sequences (1:3) Aligned. Score: 22

Multiple sequence alignments - CLUSTALW

ww.ebi.ac.uk/clustalw (European Bioinformatics Institute)

Symbols used? * : .

Avers Fig. 3.23

globin globin

Human globin = 141 aa

Human globin = 146 aa

Was D-helix loss neutral or adaptive mutation? (Nature 352:349-51, 1991)

Alignment of human -globin and -globin proteins

In sequence comparisons, refer to nt (or aa) sequencerelatedness as

“… % identity” or “...% similarity”

BUT NOT “ … % homology”

because “homology” means “shares a common ancestor”

“Non-evolutionary biologists”

Petsko Genome Biol. 2:1002,2001

Reminder about definition of the word “homology”

“Normalized alignment score”NAS = (# identities x 10) + (# Cys identities x 20) – (# gaps x 25)

Doolittle, R. “URFs & ORFs” p.14

Query = yeast mt ribosomal protein L8 gene (1275 nt)

BLAST searches www.ncbi.nlm.nih.gov/BLAST/

- to detect similarity between “sequence of interest” & databank entries

Score = 383 bits (193), Expect = 1e-102 Identities = 196/197 (99%), Gaps = 0/197 (0%)

Query AGCGTCAGGATAGCTCGCTCGATGTGGTCAGGCTAACACAATGAACAACGAGACTAGTG |||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||Sbjct AGCGTCAGGATAGCTCGCTCGATGTGATCAGGCTAACACAATGAACAACGAGACTAGTG

E-values: statistical measure of likelihood that sequences with this degree of similarity occur randomly ie. reflects number of hits expected by chance

Example of high score “hit” (red)

Query GTTTTCTTAATATTTATTTAAAAA |||||||||||||||| |||||||Sbjct GTTTTCTTAATATTTAATTAAAAA

Example of low score “hit” (blue or black)

Score = 40.1 bits (20), Expect = 3.6 Identities = 23/24 (95%), Gaps = 0/24 (0%)

“low complexity sequence”

Why is “sequence complexity” important when judging whether two sequences are homologous?

Human DNA

Chimp DNA

Pu-rich region #1

Pu-rich region #2 (not homologous to #1)

Region of unbiasedbase composition

G=C=A=T

AAGAGGAG

How frequently is AAGAGGAG (8-nt sequence) expected to occur by chance in a DNA sequence?

If sequence A is of low complexity (or short length), high % identity with sequence B may not reflect shared evolutionary origin

AAGAGGAG

Advantages of using aa (rather than nt) sequences for identifying homologous genes among organisms?

-20 amino acids vs. 4 nucleotides

- for distantly related sequences – “saturation” of synonymous sites within codons (multiple hits)

- degeneracy of genetic code & different codon usage patterns (and G+C% of genomes) among organisms

But… for certain phylogenetic analyses, number of informative characters may be higher at DNA than protein level

- lower chance of “spurious” matches

- unrelated nt sequences (non-homologous) expected to show 25% identity by random chance (if unbiased base composition)

What if BLAST search were done at protein (instead of nt) level?

Query = yeast mitochondrial ribosomal protein L8 (238 aa)

Fungal

Bacterial

Dot matrix method for aligning sequences

- 2 sequences to be compared along X and Y axis of matrix

- dots put in matrix when nts in the 2 sequences are identical

mismatch = “gap” (or break) in line

Fig. 3.7

indel = shift in diagonal

Fig. 3.7

Dot matrix method

- normally compare blocks rather than individual nts

- spurious matches (background noise) influenced by

1. window size – overlapping fixed-length windowswhereby sequence 1 compared with seq 2

2. stringency – minimum threshold value (% identity)at each step to score as hit

- for coding regions, could use aa instead of nt sequencesto reduce “noise”

Comparison of human chromosome 7 “draft” sequence (2001) with“near-complete” sequence (2004)

Nature 431:935, 2004How do you interpret the data in this figure?

2004

seq

uen

ce (

few

er e

rro

rs)

2001 sequence

Blowup of 500 kb region

Date post:	15-Jan-2016
Category:	Documents
Upload:	corey-lewis
View:	224 times
Download:	0 times

EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use...

Documents