Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | aleesha-norton |
View: | 214 times |
Download: | 1 times |
Genomes are large systems with small-system statistics:
Genome Growth by Duplication
National Tsinghua UniversityFebruary 19, 2003
Institute of Physics, Academia SinicaMarch 20, 2003
HC LeeDept Physics & Dept Life Sciences
National Central University
Plan of Presentation
• Introduction• Frequency of words in genomes• Large system & small-system statistics• Model for genome growth & evolution• Some results• Discussion - The RNA world, spandrels,
codons, punctuated equilibrium, the Universal Ancestor, etc.
• Outlook
The Book of Life
Many completed genomes1995-2002 – Bacteria 細菌 (about 80 organisms); 0.5-5 Mb; hundreds to 2000 genes1996 April – Yeast 酵母 (Saccharomyces cerevisiae) 12 Mb, 5,500 genes1998 Dec. - Worm 線蟲 (Caenorhabditis elegans) 97 Mb, 19,000 genes2000 March – Fly 果蠅 (Drosophila melanogaster) 137 Mb, 13,500 genes2000 Dec. - Mustard 芥末子 (Arabidopsis thaliana) 125 Mb, 25,498 genes2001 Feb. – Human 人類 (Homo sapiens) 3000 Mb, 35,000~40,000 genes
Many completed genomes
CBL@NCU
New way to do Life Science Research
•in vivo 在活體裡•in vitro 在試管中•in silico 在電腦中
CBL@NCU
“It is much easier to teach biology to people from a math, physics or computer-science background than to teach a biologist how to code well.” - Nature, February 15, 2001, p963
CBL@NCU
= [biology] + [computer-science] + [math & physics] + [sequence data]
Life Science in silico
• Local - “Biology”– Individual, specificity, uniqueness
• Global - “Physics”– Class, generality, universality
Today’s talk: Global treatment of microbial genome Identify universality Hypothesis for early growth of universal ancestral genome
Two approaches to Life Science
Structure of genome is complex
• Many levels – genes, intergenic region, regulatory sections
• Gene – network of introns and exons• Genome – network of genes• Random mutation• Genes are products of “blind watchmaker
”• Once made, gene is repeatedly copied
– paralogues, orthologues and pseudogenes• Genes are protected against rapid mutation
Genome as text
• Genome is a text of four letters – A,C,G,T
• Frequencies of k-mers characterize the whole genome– E.g. counting frequen-
cies of 7-mers with a
“sliding window”N(GTTACCC) = N(GTTACCC) +1
Textual statistics of genome almost random but NOT TRIVIALLY
so• Looks like a random text to casual observer• We know parts of it are coded
– Coded text also appears random but occupies almost no volume in space of all tests
• Very hard to construct dictionary• Distribution of frequencies of k-mers
– Characterizes whole genomes– Similar in coding and non-coding regions– For short oligos width of width of distribution many
times (up to 80) wider than normal– Disparity greater for smaller k
• Similar for other kinds of distributions
21 century random text generator - Courtesy PY Lai
Genomes violently disobey rule of large systems
• Large systems have sharply defined averages• Genomes are large texts with very fuzzy averag
es– There are 64 3-letter words (3-mers), each should ap
pear 15,625 +/- 125 times in a 1 Mb long genome– In random sequence, chances one 3-mer would appear more (less) than 24,000 (8,000) times is 10-830 (10-980)- In Treponema pallidum (syphilis;1 Mb long), 6 3-mer
s (CGC, GCG, AAA, TTT, GCA, TGC) occur more than 24,000 times and 2 (CTA, TAG) appear less than 8,000 times
Bacterial genomes are UNLIKErandom sequences
E. coli, 50% A+T
B. subtilis, 57% A+T
M. jannaschii, 70% A+T
If genome grows randomly by single nucleotide then distribution is Poisson
Poisson P(f=k) = k e- /k!<f> = , D (stand. dev.) = <f>1/2
Gamma G(f) = fef<f> = D = 1/2
Random single nucleotide; D = 15.5
E. coli, , = 80.0; D = 140
________ 0 11.4 1 26.6 2 62.0 3 144 4 337 5 787 6 1837
k fk
Frequency of 6-mers
Non-uniform nucleotide composition breaks the n-mer Poisson distribution into n+1 peaks
Num
ber
of 6
-mer
s
Given [at]/[cg]=70/30. If mean frequencyis 244, then mean frequency of 6-mers with K a or t’s and 6-k c or g’s is fk = 244 (0.7)k (0.3)6-k/(.5)6
787
337
144
62.0
183711.4
26.6Random single nucleotide
M. janaschii
Similar discrepancy in other genomes and for other word lengths
Word length (k)
Average word count/1Mb
Genomic deviation (error)
Poisson deviation
2
3
4
5
6
7
8
9
10
62,500
15,625
3,906
977
244
61
15.3
3.81
0.95
10,580 (2,040)
4,080 (630)
1,490 (210)
469 (66)
141 (21)
41.9 (6.7)
12.4 (2.3)
3.84 (0.84)
1.33 (0.34)
250
125
62.5
31.2
15.6
7.8
3.9
1.9
0.98
rms deviation of word count in genomes
Effective length: Length of sequence with Poisson distribution having same mean to s.d. ratio as genome sequence.
Recall for Poisson, s.d.= sqrt(mean)
Leff = ((mean/s.d.)gen)2 4k
Statistically genomes resemble random sequences of much short
lengths
k Mean s.d. Effective genome
length (kb)
2
3
4
5
6
7
8
9
10
62,500
15,625
3,906
977
244
61
15.3
3.81
0.95
10,580
4,080
1,490
469
141
41.9
12.4
3.84
1.33
0.56
0.94
1.8
4.4
12
35
100
260
540
How does a genome evolve and grow?
• Evolve by random mutation – replacement, insertion, deletion
• Plus natural selection – Fitness acts only on phenotype, not directly on
genome– Selection is made on genome generated
randomly
• Genome cannot grow through random mutation alone
– Otherwise Poisson distribution
• Must grow to long length while retaining statistical characteristics of SHORT genome
The genome is a self plagiarizer
• Genomes have many homologous genes• 50%, probably much more, of human geno
me composed of recent repeats – Many traces of repeats obliterated by mutation– Lower organisms may have longer genomes
• Five types of repeats– transposable elements; processed pseudogenes; simple
k-mer repeats; segmental duplications (10-300 kb); (large) blocks of tandemly repeated sequences
A Hypothesis for Genome Growth
• Random early growth
• Followed by
1. random duplication and 2. random mutation
Self copying – strategy for retaining and multiple usage of hard-to-come-by
coded sequences (i.e. genes)
The Model
• The genome grows by random single base addition from nothing to an initial length much shorter than final length
• Thereafter the genome evolves by random mutation and random duplication, with a fixed frequency ratio
The Model (continued)
• Mutation is standard single-point replacement (no insertion and deletion)
• Segmental duplication involves three stochastic steps– random selection of site of copied
segment– weighed random selection of length of
copied segment– random selection of insertion site of
copied segment
Stochastic selection of the length of self-copied segments
• Use Erlang density distribution function for segment length l
f(l) = 1/( m!) (l/)m exp(-l/)
(gamma function when m is real)
• Mean < l > = (m+1) standard deviation = (m+1)½
Nothing special about this particular function, but mean and s.d. important
First generation result
• Distribution of 6-mer frequency• Starting genome length 1000• Final genome length 1 million• Mutation to duplication event ratio
100 < h < 4000• Length scale for copied segments
2500 < s < 100 K • Compared with E. coli (4.5 Mbp), B. sub
tilis (4.2 Mbp), M. jannaschii (1.7 Mbp) (all normalized to 1 Mbp)
LS Hsieh, LF Luo, FM Ji and HCL, PRL 90 (2003) 18101
E. coli vs mutation + repeatRatio 500:1Sigma = 15kD= 140, 144
E. coli vs randomD= 140, 15.5
Frequency of 6-mers
Num
ber
of 6
-mer
s
E. coli [at]/[cg]=50/50
B. subtilis [at]/[cg]=60/40
B. subtilis vs randomD= 167, 79
B. subtilis vs mutation + repeatRatio 600:1Sigma = 15kD= 167, 169
Frequency of 6-mers
Num
ber
of 6
-mer
s
M. jannaschii vs mutation + repeatRatio 600:1Sigma = 15kD= 320, 321
M. jannaschii vs randomD= 320, 265
M. jannaschii [at]/[cg]=70/30
Frequency of 6-mers
Num
ber
of 6
-mer
s
Organism [at]/[gc] (2) (3) (4) (5)
E. coli 50/50 140 147 213 252
gamma distribution 3.05 80.0 140 146 208 243
radom w/o self-copy (Poisson) 15.6 3.6 20.7 10
w/ self-copy (= 500 144 148 212 247
B. subtilis 60/40 168 223 316 400
gamma distribution 2.12 115 168 186 261 310
radom w/o self-copy (Poisson/7) 79 68 109 117
w/ self-copy (= 600169 194 266 311
M. jannaschii 70/30 320 465 650 810
gamma distribution 0.58 418 320 439 609 767
radom w/o self-copy (Poisson/7) 264 369 500 603
w/ self-copy (= 600321 462 635 783
Gamma function reproduce higher moments
Gamma distribution: D(x) = x-1 exp(-x/)/()
(n) = (<(x - <x>)n>)1/n; <x> = 244 = (2) =
What about other k’s?
• Initial model good for k=6 but for other k’s not so good. Over-compensation (too broad) when k>6 and under-compensation (too narrow) when k<6.
• Good result for k=6 (length = 1 Mb) requires . In the limit of very small mutation to duplication event ratio, or ~1, ~25 b.
• New model with short duplication length, ~ 25 b, and without mutation.
Density function for duplication segment length
• Recall Erlang density distribution function has mean and rms deviation
< l > = (m+1) lsqrt(m+1)
• For < l > = 25, have:
m l
0 25 25
2 8 14
4 5 11 Good!
Comparison of k-mer distributions, k=5-9, for model sequence D and genome Treponema
Length of duplicated segements:25 +/- 12 bp
Word length
T. pallidum
Genomic average (error)
Poisson
2
3
4
5
6
7
8
9
10
8260
3870
1380
432
129
37.5
11.0
3.4
1.3
10,580 (2,040)
4,080 (630)
1,490 (210)
469 (66)
141 (21)
41.9 (6.7)
12.4 (2.3)
3.84 (0.84)
1.33 (0.34)
250
125
62.5
31.2
15.6
7.8
3.9
1.9
0.98
rms deviation of word count in genomes
Model sequence almost reproduces shape of genomic distributions
Presentmodel
82073415120240213445.315.95.92.3
Random sequence at 62500+/-250
Counts of dinucleotdies (k=2)
Random sequence at 15625+/-125
Counts of trinucleotdies (k=3)
Random sequence at 3906+/-63
Counts of tetranucleotdies (k=4)
Random sequence
Methanoccocus jannaschii70% A+T, 30% C+GModel sequence generated Exactly as before, except 70% A+T in initial random seq
Result sensitive to parameters
• Paremeter values for “good” model sequence:- Initial random sequence length L0 ~1 kb; - Mean copied segment length <l> ~ 25 b- rms l b
If L0 > 10 kb, no good results
If <l> = 15 b, sequence too random for k<5 If <l> = 40 b, sequence too choppy for k>6
If <l> = 25 b, l b; agreement worsens
Discussion: The RNA World
• RNA was discovered in early 80’s to have enzymatic activity – ribozymes can splice and replicate DNA sequences (Cech et al. (1981), Guerrier-Takada et al. 1983)
• The RNA world conjecture – early had no proteins, only RNAs, which played the dual roles of genotype and phenotype
• Some present-day ribozymes are very small; smallest hammerhead ribozyme only 31 nucleotides; ribozymes in early life need not be much larger
• In our model the small initial size of the genome necessarily implies an early RNA world
• A genome ~ 1K nt long is long enough to code the many small ribozymes (but not proteins) needed to propagate life
• Origin of this initial genome not addressed in the model. It (or its presursor) could have arisen spontaneously - artificial ribozymes have been succcessfully isolated from pools of random RNA sequences (Ekland et al. 1995)
RNA World & size of early genome
• Recall that present-day ribozyme can be as small as 31 nt
• The average duplicated segment length of 25 nt in the model is very short compared to present-day genes that code for proteins, but likely represents a good portion of the length of a typical ribozyme encoded in the early universal genome of the RNA world
RNA World & length of duplicated segments
• Spandrels– In architecture - the roughly triangular space between an arch, a wall and the ceiling– In evolution – major category of important ev
olutionary features that were originally side effects and did not arise as adaptations (Gould and Lewontin 1979)
• Wide 3-mer/codon distribution or natural selection, which came first?
Are codons “spandrels”?
Are codons “spandrels”? (cont’d)
• Frequency of 3-mer distribution in genomes is about 40 x wider than Poisson. Was the widening caused by– Uneven codon usage + natural selection? Or,– Genome growth by segmental duplication?
• In RNA world, codons came after RNA and existence of replication machinery. Hence the following scenario: RNA + recombination > genome growth by stochastic
dupliction > extreme bias in 3-mer population > rise of codon
• In our model, codons are most likely spandrels
More spandrels
•Same goes with other oligonucleotides
Many oligonucleotides that are grossly over- or under-represented have biological functions. Evolution being an opportunistic process, these oligonucleotides could have been drafted to serve special biological purposes because they had already been made very copious or very rare by stochastic genome growth
• In bacterial genomes typically about 12% of genes represent recent duplication events– Average gene is about 1000 bases long. Suggest ab
out 12% of genome generated by duplications of ~ 1000 b segments. Not yet incorporated into the model.
• In higher organisms a large number of repeat sequences with lengths ranging from 1 base to many kilobases are believed to have resulted from at least five modes of duplication
Duplication continued and expanded after the rise of proteins
• How have genes been duplicated at the high rate of about 1% per gene per million years? (Lynch 2000)
• Why are there so many duplicate genes in all life forms? (Maynard 1998, Otto & Yong 2001)
• Was duplicate genes selected because they contribute to genetic robustness (by protecting the genome against harmful mutations)? (Gu et al. 2003)– Likely not; Most likely high frequency of occurrence d
uplicate genes is a spandrel
Grow by duplication (of gene-size segments) may explain:
• Great debated in palaeontology and evolution - Dawkins & others vs. (the late) Gould & Eldridge: evolution went gradually and evenly vs. by stochastic bursts with intervals of stasis
Our model provides genetic basis for both. Mutation and small duplication induce gradual change; occasional large duplication can induce abrupt and seemingly discontinuous change
Classical Darwinian Gradualism or Punctuated equilibrium?
• Phylogeny and the Universal Ancestor– If extremely frequent and extremely rare oligos (EFER
O) are the remnants of much shorter early sequence, then there should exist such a short sequence during some stage of the genome growth.
– Then we may be able to use the set of EFEROs in whole genomes to construct phylogenetic trees of whole genomes.
– At each node of the tree would be an ancestor sequence characterized by a set of EFEROs.
– The ancestor of Life would be characterized by the minimum set of EFEROs.
Discussion (cont’d)
• Distribution of frequency of k-mers in bacterial genomes hugely wider than Poisson – larger for smaller k
• Can be explained by simple two-phase genome growth model: – first grow to short (~1 kb) random sequence– then grow by random duplications of segments of length
25 +- 12 b long
• Reproduces genomic statistics for k=2-8• Universal ancestral genome lived in an RNA world
– Replication carried by ribozymes ~ 30 nt;– Codons and many signal sequences are spandreals
Summary
Outlook
• Need to understand distribution for ALL k’s– There are repeated k-mers of k up to ~1000
• Other oddities– E.g. Distribution of entropies of k-mers
• Empirical verification– Can duplication growth be independently verified?
• Time scale– When did growth happen? At what rate? How did gro
wth stabilize? Has it stabilized? • Phylogeny
– Can we build a good tree based on model? Can we learn anything about the Universal Ancestor ? Is there a Universal Ancestor ?
CBL Lab @NCUPhys. & Life Sci.
* # L.C. Hsieh# J.L. Lo# T.Y. Chen# J.P. Yiu# Z.Y. Guo# Z.R. Chiu# H.Y.Bai# C.H. Chang# H.D. Chen# W.L. Fan
Collaborators
J.Z. Horng# F.M. LinHorng Lab, NCU, Comp. Sci.
*L.F. LuoUniv. Outer Mongolia
*F.M. JiBeijing Jiaotong Univ.
Rosie RedfiedZoology, UBC
*This work; # students * # All simulation in this work done by L.C. Hsieh
Thank you for your attention
Genomes are large systems with small-system statistics:
Genome Growth by Duplication
Lecture II
Winter School on Modern BiophysicsNational Taiwan University
December 16-18, 2002
HC LeeDept Physics & Dept Life Science
National Central University
Result sensitive to values of two parameters
• Mutation to duplication event ratio
– bacterial genomes,
– If @• too many mutations• gets long genome with Poisson distribution
– If @• too much duplication• too few mutations• gets multiple copies of random short (initial)
genome (distribution too wide)
= 100 = 250 = 500 = 2000 = 4000
Mutation/self-copy = Scale of repeat length = P(l)/P(l’) = exp{-(l-l’)/}[at]:[cg] = 70:30
Mutation to self-copy ratio is 500 +/- 100
(genome-like)
• Length scale for copied segments – ~ 10 K to 25 K for bacterial genomes– If @
• genome grows too slowly• too many mutations• gets long genome with Poisson distribution
– If @ • genome grows too quickly• too few mutations• gets multiple copies of random short (initial) ge
nome (distribution too wide)
Result sensitive to values of two parameters (cont’d)
= 0.5K =2.5K =15K =50K =1000K
Scale of repeat length = P(l)/P(l’) = exp{-(l-l’)/} Mutation/self-copy = [at]:[cg] = 70:30
Scale of repeat length cannot be too short
(genome-like)
Frequency of oligo
Num
ber
of o
ligo
sFrequency distribution of 6-mers