Genomes are large systems with small-system statistics: Genome Growth by Duplication National...

Genomes are large systems with small-system statistics:

Genome Growth by Duplication

National Tsinghua UniversityFebruary 19, 2003

Institute of Physics, Academia SinicaMarch 20, 2003

HC LeeDept Physics & Dept Life Sciences

National Central University

Plan of Presentation

• Introduction• Frequency of words in genomes• Large system & small-system statistics• Model for genome growth & evolution• Some results• Discussion - The RNA world, spandrels,

codons, punctuated equilibrium, the Universal Ancestor, etc.

• Outlook

The Book of Life

Many completed genomes1995-2002 – Bacteria 細菌 (about 80 organisms); 0.5-5 Mb; hundreds to 2000 genes1996 April – Yeast 酵母 (Saccharomyces cerevisiae) 12 Mb, 5,500 genes1998 Dec. - Worm 線蟲 (Caenorhabditis elegans) 97 Mb, 19,000 genes2000 March – Fly 果蠅 (Drosophila melanogaster) 137 Mb, 13,500 genes2000 Dec. - Mustard 芥末子 (Arabidopsis thaliana) 125 Mb, 25,498 genes2001 Feb. – Human 人類 (Homo sapiens) 3000 Mb, 35,000~40,000 genes

Many completed genomes

CBL@NCU

New way to do Life Science Research

•in vivo 在活體裡•in vitro 在試管中•in silico 在電腦中

CBL@NCU

“It is much easier to teach biology to people from a math, physics or computer-science background than to teach a biologist how to code well.” - Nature, February 15, 2001, p963

CBL@NCU

= [biology] + [computer-science] + [math & physics] + [sequence data]

Life Science in silico

• Local - “Biology”– Individual, specificity, uniqueness

• Global - “Physics”– Class, generality, universality

Today’s talk: Global treatment of microbial genome Identify universality Hypothesis for early growth of universal ancestral genome

Two approaches to Life Science

Structure of genome is complex

• Many levels – genes, intergenic region, regulatory sections

• Gene – network of introns and exons• Genome – network of genes• Random mutation• Genes are products of “blind watchmaker

”• Once made, gene is repeatedly copied

– paralogues, orthologues and pseudogenes• Genes are protected against rapid mutation

Genome as text

• Genome is a text of four letters – A,C,G,T

• Frequencies of k-mers characterize the whole genome– E.g. counting frequen-

cies of 7-mers with a

“sliding window”N(GTTACCC) = N(GTTACCC) +1

Textual statistics of genome almost random but NOT TRIVIALLY

so• Looks like a random text to casual observer• We know parts of it are coded

– Coded text also appears random but occupies almost no volume in space of all tests

• Very hard to construct dictionary• Distribution of frequencies of k-mers

– Characterizes whole genomes– Similar in coding and non-coding regions– For short oligos width of width of distribution many

times (up to 80) wider than normal– Disparity greater for smaller k

• Similar for other kinds of distributions

21 century random text generator - Courtesy PY Lai

Genomes violently disobey rule of large systems

• Large systems have sharply defined averages• Genomes are large texts with very fuzzy averag

es– There are 64 3-letter words (3-mers), each should ap

pear 15,625 +/- 125 times in a 1 Mb long genome– In random sequence, chances one 3-mer would appear more (less) than 24,000 (8,000) times is 10-830 (10-980)- In Treponema pallidum (syphilis;1 Mb long), 6 3-mer

s (CGC, GCG, AAA, TTT, GCA, TGC) occur more than 24,000 times and 2 (CTA, TAG) appear less than 8,000 times

Bacterial genomes are UNLIKErandom sequences

E. coli, 50% A+T

B. subtilis, 57% A+T

M. jannaschii, 70% A+T

If genome grows randomly by single nucleotide then distribution is Poisson

Poisson P(f=k) = k e- /k!<f> = , D (stand. dev.) = <f>1/2

Gamma G(f) = fef<f> = D = 1/2

Random single nucleotide; D = 15.5

E. coli, , = 80.0; D = 140

________ 0 11.4 1 26.6 2 62.0 3 144 4 337 5 787 6 1837

k fk

Frequency of 6-mers

Non-uniform nucleotide composition breaks the n-mer Poisson distribution into n+1 peaks

Num

ber

of 6

-mer

s

Given [at]/[cg]=70/30. If mean frequencyis 244, then mean frequency of 6-mers with K a or t’s and 6-k c or g’s is fk = 244 (0.7)k (0.3)6-k/(.5)6

787

337

144

62.0

183711.4

26.6Random single nucleotide

M. janaschii

Similar discrepancy in other genomes and for other word lengths

Word length (k)

Average word count/1Mb

Genomic deviation (error)

Poisson deviation

2

3

4

5

6

7

8

9

10

62,500

15,625

3,906

977

244

61

15.3

3.81

0.95

10,580 (2,040)

4,080 (630)

1,490 (210)

469 (66)

141 (21)

41.9 (6.7)

12.4 (2.3)

3.84 (0.84)

1.33 (0.34)

250

125

62.5

31.2

15.6

7.8

3.9

1.9

0.98

rms deviation of word count in genomes

Effective length: Length of sequence with Poisson distribution having same mean to s.d. ratio as genome sequence.

Recall for Poisson, s.d.= sqrt(mean)

Leff = ((mean/s.d.)gen)2 4k

Statistically genomes resemble random sequences of much short

lengths

k Mean s.d. Effective genome

length (kb)

2

3

4

5

6

7

8

9

10

62,500

15,625

3,906

977

244

61

15.3

3.81

0.95

10,580

4,080

1,490

469

141

41.9

12.4

3.84

1.33

0.56

0.94

1.8

4.4

12

35

100

260

540

How does a genome evolve and grow?

• Evolve by random mutation – replacement, insertion, deletion

• Plus natural selection – Fitness acts only on phenotype, not directly on

genome– Selection is made on genome generated

randomly

• Genome cannot grow through random mutation alone

– Otherwise Poisson distribution

• Must grow to long length while retaining statistical characteristics of SHORT genome

The genome is a self plagiarizer

• Genomes have many homologous genes• 50%, probably much more, of human geno

me composed of recent repeats – Many traces of repeats obliterated by mutation– Lower organisms may have longer genomes

• Five types of repeats– transposable elements; processed pseudogenes; simple

k-mer repeats; segmental duplications (10-300 kb); (large) blocks of tandemly repeated sequences

A Hypothesis for Genome Growth

• Random early growth

• Followed by

1. random duplication and 2. random mutation

Self copying – strategy for retaining and multiple usage of hard-to-come-by

coded sequences (i.e. genes)

The Model

• The genome grows by random single base addition from nothing to an initial length much shorter than final length

• Thereafter the genome evolves by random mutation and random duplication, with a fixed frequency ratio

The Model (continued)

• Mutation is standard single-point replacement (no insertion and deletion)

• Segmental duplication involves three stochastic steps– random selection of site of copied

segment– weighed random selection of length of

copied segment– random selection of insertion site of

copied segment

Stochastic selection of the length of self-copied segments

• Use Erlang density distribution function for segment length l

f(l) = 1/( m!) (l/)m exp(-l/)

(gamma function when m is real)

• Mean < l > = (m+1) standard deviation = (m+1)½

Nothing special about this particular function, but mean and s.d. important

First generation result

• Distribution of 6-mer frequency• Starting genome length 1000• Final genome length 1 million• Mutation to duplication event ratio

100 < h < 4000• Length scale for copied segments

2500 < s < 100 K • Compared with E. coli (4.5 Mbp), B. sub

tilis (4.2 Mbp), M. jannaschii (1.7 Mbp) (all normalized to 1 Mbp)

LS Hsieh, LF Luo, FM Ji and HCL, PRL 90 (2003) 18101

E. coli vs mutation + repeatRatio 500:1Sigma = 15kD= 140, 144

E. coli vs randomD= 140, 15.5

Frequency of 6-mers

Num

ber

of 6

-mer

s

E. coli [at]/[cg]=50/50

B. subtilis [at]/[cg]=60/40

B. subtilis vs randomD= 167, 79

B. subtilis vs mutation + repeatRatio 600:1Sigma = 15kD= 167, 169

Frequency of 6-mers

Num

ber

of 6

-mer

s

M. jannaschii vs mutation + repeatRatio 600:1Sigma = 15kD= 320, 321

M. jannaschii vs randomD= 320, 265

M. jannaschii [at]/[cg]=70/30

Frequency of 6-mers

Num

ber

of 6

-mer

s

Organism [at]/[gc] (2) (3) (4) (5)

E. coli 50/50 140 147 213 252

gamma distribution 3.05 80.0 140 146 208 243

radom w/o self-copy (Poisson) 15.6 3.6 20.7 10

w/ self-copy (= 500 144 148 212 247

B. subtilis 60/40 168 223 316 400

gamma distribution 2.12 115 168 186 261 310

radom w/o self-copy (Poisson/7) 79 68 109 117

w/ self-copy (= 600169 194 266 311

M. jannaschii 70/30 320 465 650 810

gamma distribution 0.58 418 320 439 609 767

radom w/o self-copy (Poisson/7) 264 369 500 603

w/ self-copy (= 600321 462 635 783

Gamma function reproduce higher moments

Gamma distribution: D(x) = x-1 exp(-x/)/()

(n) = (<(x - <x>)n>)1/n; <x> = 244 = (2) =

What about other k’s?

• Initial model good for k=6 but for other k’s not so good. Over-compensation (too broad) when k>6 and under-compensation (too narrow) when k<6.

• Good result for k=6 (length = 1 Mb) requires . In the limit of very small mutation to duplication event ratio, or ~1, ~25 b.

• New model with short duplication length, ~ 25 b, and without mutation.

Density function for duplication segment length

• Recall Erlang density distribution function has mean and rms deviation

< l > = (m+1) lsqrt(m+1)

• For < l > = 25, have:

m l

0 25 25

2 8 14

4 5 11 Good!

Comparison of k-mer distributions, k=5-9, for model sequence D and genome Treponema

Length of duplicated segements:25 +/- 12 bp

Word length

T. pallidum

Genomic average (error)

Poisson

2

3

4

5

6

7

8

9

10

8260

3870

1380

432

129

37.5

11.0

3.4

1.3

10,580 (2,040)

4,080 (630)

1,490 (210)

469 (66)

141 (21)

41.9 (6.7)

12.4 (2.3)

3.84 (0.84)

1.33 (0.34)

250

125

62.5

31.2

15.6

7.8

3.9

1.9

0.98

rms deviation of word count in genomes

Model sequence almost reproduces shape of genomic distributions

Presentmodel

82073415120240213445.315.95.92.3

Random sequence at 62500+/-250

Counts of dinucleotdies (k=2)


Counts of trinucleotdies (k=3)


Counts of tetranucleotdies (k=4)

Random sequence

Methanoccocus jannaschii70% A+T, 30% C+GModel sequence generated Exactly as before, except 70% A+T in initial random seq

Result sensitive to parameters

• Paremeter values for “good” model sequence:- Initial random sequence length L0 ~1 kb; - Mean copied segment length <l> ~ 25 b- rms l b

If L0 > 10 kb, no good results

If <l> = 15 b, sequence too random for k<5 If <l> = 40 b, sequence too choppy for k>6

If <l> = 25 b, l b; agreement worsens

Discussion: The RNA World

• RNA was discovered in early 80’s to have enzymatic activity – ribozymes can splice and replicate DNA sequences (Cech et al. (1981), Guerrier-Takada et al. 1983)

• The RNA world conjecture – early had no proteins, only RNAs, which played the dual roles of genotype and phenotype

• Some present-day ribozymes are very small; smallest hammerhead ribozyme only 31 nucleotides; ribozymes in early life need not be much larger

• In our model the small initial size of the genome necessarily implies an early RNA world

• A genome ~ 1K nt long is long enough to code the many small ribozymes (but not proteins) needed to propagate life

• Origin of this initial genome not addressed in the model. It (or its presursor) could have arisen spontaneously - artificial ribozymes have been succcessfully isolated from pools of random RNA sequences (Ekland et al. 1995)

RNA World & size of early genome

• Recall that present-day ribozyme can be as small as 31 nt

• The average duplicated segment length of 25 nt in the model is very short compared to present-day genes that code for proteins, but likely represents a good portion of the length of a typical ribozyme encoded in the early universal genome of the RNA world

RNA World & length of duplicated segments

• Spandrels– In architecture - the roughly triangular space between an arch, a wall and the ceiling– In evolution – major category of important ev

olutionary features that were originally side effects and did not arise as adaptations (Gould and Lewontin 1979)

• Wide 3-mer/codon distribution or natural selection, which came first?

Are codons “spandrels”?

Are codons “spandrels”? (cont’d)

• Frequency of 3-mer distribution in genomes is about 40 x wider than Poisson. Was the widening caused by– Uneven codon usage + natural selection? Or,– Genome growth by segmental duplication?

• In RNA world, codons came after RNA and existence of replication machinery. Hence the following scenario: RNA + recombination > genome growth by stochastic

dupliction > extreme bias in 3-mer population > rise of codon

• In our model, codons are most likely spandrels

More spandrels

•Same goes with other oligonucleotides

Many oligonucleotides that are grossly over- or under-represented have biological functions. Evolution being an opportunistic process, these oligonucleotides could have been drafted to serve special biological purposes because they had already been made very copious or very rare by stochastic genome growth

• In bacterial genomes typically about 12% of genes represent recent duplication events– Average gene is about 1000 bases long. Suggest ab

out 12% of genome generated by duplications of ~ 1000 b segments. Not yet incorporated into the model.

• In higher organisms a large number of repeat sequences with lengths ranging from 1 base to many kilobases are believed to have resulted from at least five modes of duplication

Duplication continued and expanded after the rise of proteins

• How have genes been duplicated at the high rate of about 1% per gene per million years? (Lynch 2000)

• Why are there so many duplicate genes in all life forms? (Maynard 1998, Otto & Yong 2001)

• Was duplicate genes selected because they contribute to genetic robustness (by protecting the genome against harmful mutations)? (Gu et al. 2003)– Likely not; Most likely high frequency of occurrence d

uplicate genes is a spandrel

Grow by duplication (of gene-size segments) may explain:

• Great debated in palaeontology and evolution - Dawkins & others vs. (the late) Gould & Eldridge: evolution went gradually and evenly vs. by stochastic bursts with intervals of stasis

Our model provides genetic basis for both. Mutation and small duplication induce gradual change; occasional large duplication can induce abrupt and seemingly discontinuous change

Classical Darwinian Gradualism or Punctuated equilibrium?

• Phylogeny and the Universal Ancestor– If extremely frequent and extremely rare oligos (EFER

O) are the remnants of much shorter early sequence, then there should exist such a short sequence during some stage of the genome growth.

– Then we may be able to use the set of EFEROs in whole genomes to construct phylogenetic trees of whole genomes.

– At each node of the tree would be an ancestor sequence characterized by a set of EFEROs.

– The ancestor of Life would be characterized by the minimum set of EFEROs.

Discussion (cont’d)

• Distribution of frequency of k-mers in bacterial genomes hugely wider than Poisson – larger for smaller k

• Can be explained by simple two-phase genome growth model: – first grow to short (~1 kb) random sequence– then grow by random duplications of segments of length

25 +- 12 b long

• Reproduces genomic statistics for k=2-8• Universal ancestral genome lived in an RNA world

– Replication carried by ribozymes ~ 30 nt;– Codons and many signal sequences are spandreals

Summary

Outlook

• Need to understand distribution for ALL k’s– There are repeated k-mers of k up to ~1000

• Other oddities– E.g. Distribution of entropies of k-mers

• Empirical verification– Can duplication growth be independently verified?

• Time scale– When did growth happen? At what rate? How did gro

wth stabilize? Has it stabilized? • Phylogeny

– Can we build a good tree based on model? Can we learn anything about the Universal Ancestor ? Is there a Universal Ancestor ?

CBL Lab @NCUPhys. & Life Sci.

* # L.C. Hsieh# J.L. Lo# T.Y. Chen# J.P. Yiu# Z.Y. Guo# Z.R. Chiu# H.Y.Bai# C.H. Chang# H.D. Chen# W.L. Fan

Collaborators

J.Z. Horng# F.M. LinHorng Lab, NCU, Comp. Sci.

*L.F. LuoUniv. Outer Mongolia

*F.M. JiBeijing Jiaotong Univ.

Rosie RedfiedZoology, UBC

*This work; # students * # All simulation in this work done by L.C. Hsieh

Thank you for your attention

Genomes are large systems with small-system statistics:

Genome Growth by Duplication

Lecture II

Winter School on Modern BiophysicsNational Taiwan University

December 16-18, 2002

HC LeeDept Physics & Dept Life Science

National Central University

Result sensitive to values of two parameters

• Mutation to duplication event ratio

– bacterial genomes,

– If @• too many mutations• gets long genome with Poisson distribution

– If @• too much duplication• too few mutations• gets multiple copies of random short (initial)

genome (distribution too wide)

= 100 = 250 = 500 = 2000 = 4000

Mutation/self-copy = Scale of repeat length = P(l)/P(l’) = exp{-(l-l’)/}[at]:[cg] = 70:30

Mutation to self-copy ratio is 500 +/- 100

(genome-like)

• Length scale for copied segments – ~ 10 K to 25 K for bacterial genomes– If @

• genome grows too slowly• too many mutations• gets long genome with Poisson distribution

– If @ • genome grows too quickly• too few mutations• gets multiple copies of random short (initial) ge

nome (distribution too wide)

Result sensitive to values of two parameters (cont’d)

= 0.5K =2.5K =15K =50K =1000K

Scale of repeat length = P(l)/P(l’) = exp{-(l-l’)/} Mutation/self-copy = [at]:[cg] = 70:30

Scale of repeat length cannot be too short

(genome-like)

Frequency of oligo

Num

ber

of o

ligo

sFrequency distribution of 6-mers

Date post:	13-Jan-2016
Category:	Documents
Upload:	aleesha-norton
View:	214 times
Download:	1 times

Genomes are large systems with small-system statistics: Genome Growth by Duplication National...

Documents