Global Similarities in NucleotideBase Composition Among Disparate Functional Classes ofSingle-Stranded RNAErik Schultes Peter T. HraberThomas H. LaBean
SFI WORKING PAPER: 1996-12-090
SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent theviews of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our externalfaculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, orfunded by an SFI grant.©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensuretimely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rightstherein are maintained by the author(s). It is understood that all persons copying this information willadhere to the terms and constraints invoked by each author's copyright. These works may be reposted onlywith the explicit permission of the copyright holder.www.santafe.edu
SANTA FE INSTITUTE
Global similarities in nucleotide base composItion amongdisparate functional classes of single-stranded RNA
Erik Schultes1,2, Peter T. Hraber3, Thomas H. LaBean2,4
IDepartment of Earth and Space Sciences, University of California at Los Angeles, Los Angeles, California, 90024, USA
2Molecular Diversity Sciences Center, Duke University Medical Center, Durham, North Carolina, 27710, USA
3Department of Biology, University of New Mexico, Albuquerque, New Mexico, 87131, USA
4Department of Biochemistry, Duke University Medical Center, Durham, North Carolina, 27710, USA
The general research program of the Santa Fe Institute is supported by core funding i
from the John D. and Catherine T. MacArthur Foundation, the National Science
Foundation (PHY 9021437), and the U.S. Department of Energy (DE-FG03
94ER61951), and by gifts and grants from individuals and members of the Institute's
Business Network for Complex Systems research. Additional support, if any, for
specific research projects is listed in the acknowledgments section of the paper.
Correspondence:
Erik Schultes
Center for the Study of the Evolution and Origin of Life
University of California, Los Angeles
Los Angeles, CA 90024
phone:
fax:
e-mail:
310 825-1769
310 825-0097
1 Schultes, Hraber, LaBean
ABSTRACT
The number of distinct functional classes of single-stranded RNAs (ssRNAs) and the number of
sequences representing them are substantial and continue to increase. Organizing this data in an
evolutionary context is essential, yet traditional comparative sequence analyses require that
homologous sites can be identified. This prevents comparative analysis between sequences of
different functional classes that share no site-to-site sequence similarity. Analysis within a single
evolutionary lineage also limits evolutionary inference because shared ancestry confounds
properties of molecular structure and function that are historically contingent with those that are
imposed for biophysical reasons. Here, we apply a method of comparative analysis to ssRNAs that
is not restricted to homologous sequences, and therefore enables comparison between unrelated or
distantly related sequences, minimizing the effects of shared ancestry. This method is based on
statistical similarities in nucleotide base composition among different functional classes of
ssRNAs. In order to denote base composition unambiguously, we have calculated the fraction
G+A and G+U content in addition to the more commonly used fraction G+C content. These three
parameters define RNA composition space which we have visualized using interactive graphics
software. We have examined the distribution of nucleotide composition from 15 distinct functional
classes of ssRNAs from organisms spanning the universal phylogenetic tree. Surprisingly, these
distributions are highly constrained and consistently biased in G+A and G+U content both within
and between functional classes regardless of the more variable G+C content. That RNA sequences
sharing little or no sequence similarity should nonetheless exhibit similar patterns of base
composition indicates that severe constraints, adaptive or otherwise, act to localize the evolutionary
divergence of ssRNAs in ways that have not been obvious at the sequence level.
Keywords: composition space; RNA evolution; RNA simplex; evolutionary trajectory;
evolutionary attractor
2 Schultes, Hraber, LaBean
INTRODUCTION
The remarkably successful comparative sequence analyses of ribosomal RNA have had as their
main goals the elucidation of rRNA structure and the construction of phylogenetic relationships
between divergent sequences (Woes et aI., 1990; Woes & Pace, 1993). However, this requires the
existence of well conserved, homologous sites amiable to comparison (Pace et ai., 1989) and
therefore imposes two important limitations on evolutionary inference. First, it is impossible to
define the relationships between sequences having no sequence similarity (e.g., 23S rRNA and P
RNA) (Gould, 1991). Second, restricting sequence analyses to single evolutionary lineages (Le.,
homologous sequences), makes impossible the characterization of molecular properties that are
universal to functional ssRNAs. This is because shared genealogies confound similarities due to
biophysical constraints with similarities that are the result of historically contingent factors
(Kauffman, 1993). Though comparative sequence analyses can effectively resolve secondary
structure within a functional class, they severely limit generalized inferences about structural
properties and biophysical mechanisms (e.g., the folding problem) that may be common to RNA
polymers in general. Hence, research in molecular evolution has been largely directed toward
historical reconstructions of the branching order of sequence diversification and the analysis of
structural adaptations specific to individual lineages. Studies addressing the generic properties of
evolved RNA polymers and the origin of new function are relatively rare (Bloch et ai., 1983; Bloch
et al., 1985; Tomizawa, 1993; Fontana et ai., 1993; Huynen & Hogeweg, 1994). In an attempt to
avoid these problems, we have developed a method of comparative analysis of ssRNA utilizing
statistical distributions of attributes that are common to any RNA sequence, even those sharing no
evolutionary history (i.e., having no common ancestor). The main goal ofthis statistical approach
is not the construction of phylogenetic relationships but establishing and explaining global
similarities and differences between sequences that are known to be unrelated or distantly related
(Schultes et ai., submitted).
3 Schultes, Hraber, LaBean
In this contribution, we compare the statistical distributions of nucleotide base composition of
different functional classes of ssRNAs. Analyses of nucleic acid base composition have typically
focused on G+C content data derived from the thermal denaturation/hybridization or density
gradient sedimentation experiments using large genomic DNA samples (Chargaff et at. 1949;
Chargaff, 1951; Sabeur et al., 1993). This study differs in two respects. First, we examine the full
range of nucleotide composition calculated as the fraction of A, C, G, and U residues in a sequence
of N residues; (AIN, CIN, GIN, UIN). This quantity is referred to as a composition vector, and
can be calculated for any RNA sequence. Though commonly used by itself, the percent G+C
content is a one-dimensional projection of (4-l)-dimensional data (i.e., knowing AIN, CIN and
GIN implies the value of UIN because (AIN + CIN + GIN + UIN) = 1.0). The compression of
complex composition information onto the single dimension of G+C content results in the loss of
information. Because these four fractions must sum to one, all possible composition vectors can be
visualized as points within the volume of a tetrahedron. This geometric representation is called a
unit simplex and constitutes the entire composition space of all possible RNA sequences. We have
visualized this space (Fig. 1) using interactive graphics software developed by Richardson and
Richardson (1992). The composition vectors describing the four homopolymers (poly-A, poly-C,
etc.) are represented by the vertices while the point located at the center, equidistant from the four
homopolymers, represents all those sequences having a uniform distribution of nucleotides, i.e.,
(0.25, 0.25, 0.25, 0.25). We refer to this special class of heteropolymers as the
isoheteropolymers. In general, an arbitrary composition class "contains" a large number of
possible sequences, many sharing no sequence similarity. Though composition vectors specify
base composition exactly, they can be cumbersome to work with analytically. Taking the fraction
G+C content as a convention, we invoke a more convenient notation using two additional
measures of base composition; the fraction G+A and fraction G+U content. Together, G+C, G+A,
and G+U measures act as coordinate system, uniquely locating position in RNA composition
space.
4 Schultes, Hraber, LaBean
This study also differs from previous base composition analyses in that we narrow our focus from
bulk genomic samples (containing a wide variety of regulatory, coding, and non-coding DNA
sequences) to distinct ssRNA sequences that are known to play specific metabolic roles in the cell.
Similar to proteins, ssRNA molecules fold in a sequence specific manner into complex
conformations that determine their metabolic function (Draper, 1992; Draper, 1996; Price & Nagai,
1996; Zarrinkar & Williamson, 1996). By focusing our analysis on disparate classes of ssRNA
sequences, it may be possible to correlate properties in base composition to biochemical properties
that are characteristic of functional ssRNAs, such as intramolecular folding and structure. Also, the
lack of sequence similarly between functional classes implies either independent originations
(Joyce, 1989; Ekland et aI., 1995) or radical sequence divergence early in evolutionary history,
before the latest common ancestor of contemporary life (Lewin, 1985; Woes & Pace, 1993).
Comparing disparate sequences minimizes the confounding influences of genealogy on the
interpretation of molecular attributes as either historical accident or as generic properties of RNA
polymers. We have compiled and plotted within the RNA simplex, composition vectors of 2800
complete sequences from 15 distinct functional classes of ssRNA molecules. Within each
functional class, except tRNA, each genus is represented by a single randomly chosen species
making it representative of the universal phylogenetic tree. This database represents a diverse cross
section of RNA composition (G+C content ranging from 0.088, mitochondria GTC-tRNA
Drosophila yakuba, to 0.748, 5S rRNA, Thermomicrobium roseum); structure (sequence length
varies from 13 nt, Hepatitis Delta virus ribozyme a, to 5182 nt, 23s rRNA Homo sapiens);
function (including various ribozymes); organisms and ecological settings (including
extremeophiles such as Halobacterium, Thermus, and Pyrodictium) (Table 1). Because the
complete sequences of full length messenger RNAs are rarely obtained, they have been excluded
from this analysis.
Relative to the well known variability in G+C content, we have discovered among the empirical
distributions, an unexpected, universal localization of base composition in G+A and G+U content.
5 Schultes, Hraber, LaBean
We present three general observations: (i) the empirical distributions form a set of parallel axes in
composition space that are themselves parallel to the gradient in G+C content, (ii) these axes are
displaced in a magnitude and direction indicating persistent and relatively constrained G+A and
G+U biases and (iii) the magnitude of the G+A bias and the variance of the G+A and G+U biases
are dependent on sequence length. The universal nature of these base composition biases between
sequences unrelated by structure or function is evidence for the existence of universal evolutionary
constraints acting to confine the base composition of ssRNA despite enormous diversification at
the sequence level. Whether or not these constraints are adaptive remains unknown.
RESULTS
Comprehensive Overview
Taken as a whole, the mean of the G+C, the G+A, and the G+U content of the 2800 ssRNA
sequences investigated here are each slightly greater than 0.5 (Table 1). The mean composition
vector of these data is (0.242, 0.235, 0.274, 0.249); guanosine residues predominating. G+A is
the least variable of the composition measures, followed by G+U. The variability of the base
composition is greatest in G+C, being 6.5 times more variable than G+A and 4.3 times more
variable than G+D. Below we partition these data by functional class and phylogenetic Domain.
Empirical distributions parallel Chargaff's Axis
Chargaff (1949, 1951) established for genomic DNA, that the mole-fraction of A is equal to T and
C is equal to G. In the RNA simplex, those composition vectors fulfilling the equivalent of
6 Schultes, Hraber, LaBean
Chargaff's Rule (A =U and simultaneously C =G) form a line joining the midpoints of the AU
and CG edges and contain the isoheteropolymer composition vector. We refer to this locus of
composition vectors as Chargaff's Axis (Fig. I). In almost every case studied here (exceptions
being 4 classes of small nuclear RNA), the distributions of ssRNA composition vectors are
noticeably protracted into axes that lie parallel to Chargaff's Axis (Fig. 2, panels labeled i). These
empirical axes, extending from G+C-rich to G+C-poor compositions are another way of
visualizing the well known variability in G+C content between species. For the majority of the
distributions, the mean G+C content is greater than 0.5, and is exceptionally high among the
Archaea. Particularly striking is the distinct G+C biases between mitochondrial and chloroplast
tRNA sequences (Fig. 21). Chloroplast sequences are slightly G+C-rich while mitochondrial
sequences show remarkably low G+C content with 91 % of the 742 mitochondrial tRNAs having
G+C values less than 0.5.
A standard model for the diversification of RNA sequences is the compensatory change among
paired nucleotides. This is the basis of structural inference from sequence comparison studies.
Thus, AU pairs can often replace CG pairs and vice versa with only neutral effects (Pace et aI.,
1989). Compensatory mutations allow ssRNAs to drift neutrally and therefore relatively rapidly
within functional extremes of low to high G+C content (Wada, 1992), without affecting overall
G+A and G+U content (Fig. 3). Hence, evolutionary diversification driven by compensatory
mutations will result in axis-like distributions lying parallel to but displaced from Chargaff's Axis
as seen in panels labeled ii of Fig. 2. G+C biases can be driven by mechanisms that are external to
the organism (e.g., extreme environmental parameters such as temperature or salinity tending to
increase G+C content (Brown et aI., 1993)) or internal to the organism (e.g., cytosine dearnination
mutation pressures or polymerase error biases tending to decrease G+C content (Pearl & Savva,
1996)). These axes of neutral diversification, or "neutral ridges" (Schuster et aI., 1994), spanning
composition space from G+C-rich to G+C-poor composition classes account well for the axis-like
nature of the empirical distributions.
7 Schultes, Hraber, LaBean
Universal G+A and G+U Biases
If one views the RNA simplex from the C = G = 0.5 endpoint of Chargaff's Axis, one sees
Chargaff s Axis and the empirical axes mentioned above, end-on (Fig. 2, panels labeled ii). In this
perspective it is clear that the variability of the distributions in directions perpendicular to the G+C
gradient are relatively constrained and with the exception of 5 classes of snRNA, consistently
G+A-rich. Unlike G+C content which is variable, the G+A and G+U content is relatively constant
within and between functional classes, particularly for the longer sequences. The location of these
axes with respect to Chargaff s Axis can be quantified by calculating the means and standard
deviations in G+A and G+U (Table 1).
For functional classes containing longer sequences (23S, 18S, 16S, P RNA) common to each
phylogenetic Domain (Archaea, Bacteria, and Eucarya), we observe that composition vectors from
each Domain cluster into individual axis, each being distinct from Chargaff s Axis. Intriguingly,
the phylogenetic arrangement of the Archaea, Bacteria, and Eucarya distributions is remarkably
similar between functional classes, having nearly identical mean G+A and G+U values (compare
for example mean G+A values for Archaea 23S rRNA (0.567), 16S rRNA (0.562), and P RNA
(0.564) in Table 1). These Domain-specific G+A and G+U biases between different functional
classes of ssRNAs suggest that characteristic constraints act on the evolution of base composition
within Domains. The G+A and G+U content of shorter sequences, such has 5S rRNA and tRNA,
are less biased than longer sequences and are more variable. We describe this length dependence in
detail in the next section.
8 Schultes, Hraber, LaBean
Dependence of G+A and G+U biases on sequence length
The magnitude and variance of G+A and G+U biases in the empirical distributions are dependent
on the length of the RNA sequences. As populations of RNA sequences diversify through
mutation and selection, they map a system of branching trajectories through sequence space.
Changes in sequence composition along these trajectories can be observed in the RNA simplex.
The simplest trajectories are those of totally neutral, random walks, where sequences accept single
point mutations at some constant rate, uniformly over the four nucleotides. This (ergodic) process
tends to drive trajectories toward the isoheteropolymer composition. However, for sequences of
biologically relevant length, random trajectories can be quite variable. This is because single base
substitutions in short sequences have a larger relative effect on nucleotide frequency than in longer
sequences. Hence, the distributions of shorter RNA sequences (tRNA, snRNA, 5S rRNA) are
more easily "buffeted" by point mutations than longer sequences (e.g., 23S rRNA, P RNA). This
length effect can be seen in distributions of randomly generated sequences which form spherical
clouds centered on the isoheterpolymers (mean (G+C) = (G+A) = (G+U) = 0.5) (Fig. 2J). As
sequence length increases, the variance about the mean values of G+A, G+C, and G+U content
decrease isotropically. The variability in the composition of these random-sequence distributions
accounts well for the observed changes in variability in G+A and G+U with mean sequence length
for ssRNA (Fig. 4). G+C content however, remains variable and independent of length. Hence,
the ranges and variability of the distributions decrease with increasing sequence length, tending to
sharpen inherent phylogenetic distinctions between the Archaea, Bacteria, and Eucarya axes.
However, unlike the random distributions, the mean G+A and G+U biases do not converge to 0.5
with increasing sequence length (Fig. 5). Indeed, the overall G+A bias appears to increase as
sequence length increases. This gives the impression that longer sequences better reflect an
intrinsic G+A bias than shorter sequences.
9 Schultes, Hraber, LaBean
DISCUSSION
Factors effecting the evolution of base composition
The observed localization of disparate ssRNAs in G+A and G+U content suggests (i) G+A and
G+U content is subject to severe evolutionary constraints and (ii) that these constraints are not
functionally or phylogenteically specific. We refer to this confined volume of composition space
occupied by ssRNA as an evolutionary attractor (Holm & Sander, 1996) in the sense that
evolutionary trajectories of unrelated sequences have remained invariant in base composition
despite drastic evolutionary change at the sequence level. Three factors effect the evolution of
molecular sequences and therefore base composition: mutation, selection, and historical constraints
imposed by ancestral sequences (Sueoka, 1992). These factors are outlined below.
First, the ultimate impetus behind evolutionary change is the creation of new variation via
mutation. Mutational mechanisms in DNA replication and repair that are not compositionally
uniform might be expected to impose biases on the coding regions of ssRNA (Pearl & Savva,
1996). The effects of compensatory mutation described above account well for the high variability
in G+C content between species. However, without mechanisms analogous to neutral,
compensatory mutations acting in directions of G+A and G+U content, diversification in these
directions proceeds at relatively slower rates.
Noted previously, the sensitivity of base composition to change arising from point mutations is
dependent on sequence length. Gene duplication events, though significantly increasing the length
of ssRNA genes, will have little or no effect on composition. Bloch et ai., (1983; 1985) have
described possible duplication events relating tRNA and rRNA-subsequences, suggesting that
these two classes may have similar sequence (and hence base composition) characteristics due to
10 Schultes, Hraber, LaBean
common origins, despite their radical divergence in structure and function. In contrast, insertion,
deletion and recombination of sizable elements, can potentially alter the base composition of
ssRNA coding regions drastically. These events may account for the disparity in composition
between prokaryotic and eukaryotic sequences.
Second, the current location of a ssRNA sequence is determined in part by the location of its
ancestors. Since G+C content evolves relatively rapidly, this measure is probably least sensitive to
historical constraints. G+A and G+U content however are more restricted, and presumably carry
more historical information. For example, the G+A and G+U content of Eucarya 16S rRNA are
well constrained after billions of years for evolutionary history. The latest common ancestor of the
Eucarya 16S rRNA probably had G+A and G+U values similar to the calculated mean values of
contemporary sequences (0.524 and 0.527 respectively). Its not clear however, where the latest
common ancestor of extant life would have been located in the RNA simplex. Did other lineages
from this time, which have since become extinct, also have G+A-rich compositions? Going back
still further in evolutionary history to the time ofthe RNA world (Joyce, 1989), did nascent RNA
polymers have consistently biased compositions or were they more randomly distributed? Are the
observed G+A and G+U biases a result of adaptive convergence or an historical accident that has
become fixed among existing, descendent sequences?
Finally, it may be that nucleotide base composition probabilistically influences the folding
properties of ssRNA polymers in ways that have heretofore gone unappreciated. Noting the
variability in G+C content, Cantor and Schimmel (1980) state, "There is no evidence that the
overall base composition of RNA or DNA correlates in any significant way with biological
function". However, it is reasonable to assume that the physical and chemical differences among
the four nucleotide bases impose differences on the biophysical properties of polymers having
different base compositions. In other words, the average biophysical properties of arbitrary RNA
polymers will differ from place to place within the RNA simplex. Hence, specific provinces of
11 Schultes, Hraber, LaBean
composition space may contain sequences that are more likely to support intramolecular folding to
unique and stable structures. These provinces would have a high "density" of sequences with
properties that are general to and necessary for biological function. Selection for functional
sequences would therefore have the effect of driving sequence composition toward provinces
where these properties are more commonly found. Though adaptive evolutionary convergence
(Doolittle, 1994) of G+A and G+U content is currently speculative, it is the accepted explanation
for an increase in G+C content in thermophilic organisms (Brown et aI., 1993). It may be the case
that the observed universality in base composition among ssRNA reflects an adaptive,
evolutionary, convergence.
The UI - U5 snRNA, named for their U-rich base composition, are the only sequences that have
mean G+A values less than 0.5. Apparently, these sequences have specialized composition
requirements within the spliceosome. However, the U6 snRNA displays mean G+A and G+U
values that are typical of other ssRNAs including group II ribozymes. Intriguingly, U6 appears to
be the catalytic core of the spliceosome, carrying out a two-step splicing reaction analogous to self
splicing group II introns (Wise, 1993). It may be the case that this biased composition is
advantageous for maintenance of the catalytic activity mediated by the U6 moiety.
Other comparative statistical analyses
Similarities in base composition that arise from to common ancestry can be controlled for by
comparing independent lineages of molecular sequences. Comparative statistical analyses of this
sort, will begin to define characteristics of ssRNAs that are common to any well evolved,
functional RNA sequence in contrast to conventional comparative sequence analyses that define
characteristics unique to individual lineages. Comparative statistical analyses, however, need not
be restricted to base composition. The primary, secondary (Fontana et aI., 1993), and tertiary
12 Schultes, Hraber, LaBean
structures as well as experimentally obtained biophysical and functional properties (Kuo & Cech,
1996) are all amiable to comparison. Though relatively new to ssRNA, there is in fact precedence
for comparative statistical analysis in the protein literature. Most relevant to our study, Chou
(1995), using a representative compilation of distantly related sequences, has shown that the amino
acid composition of proteins is an excellent predictor (95.3% accuracy) of protein structural class.
Intriguingly, the interpretation is that the different structural classes of proteins are localized in
distinct provinces of protein composition space (i.e., the (20-1)-dimensional protein simplex).
Eisenhaber et at. (1996a; 1996b) have critically reviewed Chou's work, but nonetheless conclude
that "secondary structural content of a protein is determined mainly by the amino acid
composition."
As more sequence data becomes available, comparative statistical analyses between sequences
sharing little if any evolutionary history will become increasingly germane. Currently, large
sequence data sets are being generated via in vitro selection and evolution from diverse, random or
partially-random synthetic RNA libraries. It will be informative to compare statistical distributions
of natural RNA sequences to that of artificial ssRNAs evolved under various conditions (e.g.,
Ekland et at. 1995; Schultes et at. submitted). Only in this case can the confounding influence of
genealogy be completely eliminated from evolutionary inference.
13 Schultes, Hraber, LaBean
MATERIALS AND METHODS
The RNA simplex
The number of possible RNA sequences of length N is given by 4N (this is the size of the so called
sequence space; Smith, 1970; Hamming, 1980; Eigen, 1992). Each of these sequences can be
classified into a compositional class denoted by its composition vector. The space of all possible
RNA composition vectors is constrained to the volume of tetrahedron. This is a 3-dimensional
projection of high-dimensional sequence space in that all 4N possible sequences are projected onto,
(N+3)C= N
compositional classes. For example, for N = 20, all 1.1 X 1012 sequences are partitioned among
C = 1771 compositional classes. The "density" of a composition class c (the number of sequences
belonging to the class c) is given by the multinominal distribution,
Pc = N! / (A! x C! x G! xU!),
where A, C, G, and U specify the number of each residue within a sequence of N total residues.
Pc can be summarized by calculating the Shannon entropy (Shannon, 1948) of a composition
vector cas
Hc = -[(AIN log2 AIN)+(CIN log2 CIN)+(GIN log2 GIN)+(UIN log2 UIN)].
14 Schultes, Hraber, LaBean
Note that the density of sequences increases enormously toward the center of the simplex. It is for
this reason that a random walk in sequence space, where each sequence is equally likely to be
visited, will be confined near the high entropy, isoheteropolymers.
Visualization software
The unit simplex was visualized with Mage 4.4, an interactive molecular graphics software
package. Composition data are visualized by specifying the calculated composition vectors as the
fraction of A, C, and G (the fraction U being implicit) in the Kinemage data file. The latest version
of Mage and the Kinemage data files used in this work can be obtained via anonymous ftp, at:
ftp://santafe.edu/pub/pth/rna.
Data base construction
Composition vectors and base composition statistics of ssRNA sequences (Table 1) were
calculated using a combination of standard Unix based utilities and Microsoft Excel 4.0.
Composition vectors were calculated from fulliength sequence data compiled from various Internet
sources: 23S and 16S rRNA sequences were obtained from http://rrna.uia.ac.be; 5S rRNA
sequences were obtained from http://cammsg3.caos.kun.nl; snRNA were obtained from
http://pegasus.uthcLedu/uRNADB/uRNADB.html; tRNA sequences were obtained from
ftp://ftp.ebi.ac.uklpub/databases/trna; P RNA sequences obtained courtesy of Jim Brown; 18S
rRNA sequences obtained courtesy of Kevin Peterson; Group I, Group II, and Hammerhead
ribozymes obtained through GenBank searches; Additional Group I and Group II sequences
obtained from Green and Szostak, 1992; Schmelzer & Schweyen, 1986; and Schmidt et al., 1990.
Only sequences greater than 90% complete are included. In an attempt to produce phylogenetically
15 Schultes, Hraber, LaBean
representative samples of molecular diversity, each genus is represented by a single, randomly
chosen species (except in the tRNA data set where all available sequences are plotted). Random
"RNA" sequences were computationally generated by choosing "bases" from a uniform
distribution (each base occurs with a frequency of 0.25). These random sequence data files were
then treated like biological sequences in order to calculate their base composition statistics.
Complete data sets with accession numbers are available via anonymous ftp, at:
ftp://santafe.edu/pub/pthlma.
16 Schultes, Hraber, LaBean
ACKNOWLEDGMENTS
We thank Francois Michel for directing us to group II data, Kevin 1. Peterson for ISS rRNA data
and Jim Brown for P RNA data sets. We thank D. Richardson, and J. Richardson for computing
resources, D. Kenan, J. Keene, M. Geysen, and J. W. Schopf for helpful discussions and advice.
This work was supported by the Center for the Study of the Evolution and Origin of Life,
Diversity Biotechnology Consortium, and the Molecular Diversity Sciences Center at Duke
University. ES was also supported through core funding while in residence at the Santa Fe
Institute (1995). PTH was supported by a Santa Fe Institute graduate fellowship with funds from
Grant No. NOOOI4-95-1-IOOO from the Office of Naval Research, acting in cooperation with the
Defense Advanced Research Projects Agency.
17 Schultes, Hraber, LaBean
REFERENCES
Bloch DP, McArthur B, Widdowson R, Spector D, Guimaraes RC, Smith J. 1983. tRNA-rRNA
sequence homologies: evidence for a common evolutionary origin.! Mol Evo119: 420-428.
Bloch DP, McArthur B, Mirrop S. 1985. tRNA-rRNA sequence homologies: evidence for an
ancient modular format shared by tRNAs and rRNAs.Biosystems 17: 209-225.
Brown JW, Hass ES, Pace NR. 1993. Characterization of ribonuclease P RNAs from
thermophilic bacteria. Nucleic Acids Res 21: 671-679.
Cantor C, Schimmel P. 1980. Biophysical Chemistry, Part Ill. New York. W. H. Freeman and
Company. p 162.
Chargaff E, Vischer E, Doniger R, Green C, Misani F. 1949. The composition of the
desoxypentose nucleic acids of the thymus and spleen. J Bioi Chem 177: 405-416.
ChargaffE. 1951. Structure and function of nucleic acids as cell constituents. Fed Proc 10: 654
659.
Chou K. 1995. A novel approach to predicting protein structural classes in a (20-1)-D amino acid
composition space. Proteins 21: 319-344.
Doolittle, R. F. Convergent evolution: the need to be explicit. Trends Biochem Sci 19, 15-18
(1994).
Draper DE. 1992. The RNA-folding problem. Acc Chem Res 25: 201-207.
18 Schultes, Hraber, LaBean
Draper DE. 1996. Strategies for RNA folding. Trends Biochem Sci 21: 145-149.
Eigen M. 1992. Steps Towards Life. Oxford. Oxford University Press. pp 92 - 100.
Eisenhaber F, Imperiale F, Argos P, Frommel C. 1996a. Prediction of secondary structural
content of proteins from their amino acid composition alone. 1. new analytical vector
decomposition mehods. Proteins 25: 157-168.
Eisenhaber F, Frommel C, Argos P. 1996b. Prediction of secondary structural content of proteins
from their amino acid composition alone. II. the paradox with secondary structral class.
Proteins 25: 169-179.
Ekland EH, Szostak JW, Bartel DP. 1995. Structurally complex and highly active RNA ligases
derived from random RNA sequences. Science. 269: 364-370.
Fontana W, Konings DAM, Stadler PF, Schuster P. 1993. Statistics of RNA secondary
structures. Biopolymers 33: 1389-1404.
Gould SJ. 1991. The disparity of the Burgess Shale arthropod fauna and the limits of cladistic
analysis: why we must strive to quantify morphospace. Paleobiology 17: 441-423.
Green R, Szostak JW. 1992. Selection of a ribozyme that functions as a superior template in a self
copying reaction. Science 258: 1910-1915.
Hamming RW. 1980. Coding and information theory. New Jersey. Prentice-Hall. pp 44-47, 176
190.
19 Schultes, Hraber, LaBean
Holm L, Sander C. 1996. Mapping the protein universe. Science 273: 595-602.
Huynen M, Hogeweg P. 1994. Pattern generation in molecular evolution: exploitation of the
variation in RNA landscapes. J Mol Evol39: 71-79.
Joyce OF. RNA evolution and the origins of life. Nature 338: 217-224.
Kauffman SA. 1993. The Origins of Order: Self-Organization and Selection in Evolution. New
York. Oxford University Press. p. 22-25.
Kuo L, Cech TR. 1996. Conserved thermochemistry of guanosine nucleophile binding for
structrually distinct group I ribozymes. Nucleic Acids Res 24: 3722-3727.
Lewin R. 1985. Basic modular format in tRNAs and rRNAs. Science 229: 1254.
Neefs JM, Van de Peer Y, De Rijk P, Chapelle S, De Wacter R. 1993. Compilation of small
ribosomal subunit RNA structures. Nucleic Acids Res 21: 3025-3049.
Pace NR, Smith K, Olsen OJ, James BD. 1989. Phylogenetic comparative analysis and the
secondary strucutre of ribonuclase P RNA - a review. Gene 82: 65-75.
Pearl LH, Savva R. 1996. The problem with pyrimidines.Nat Struct Bio 3: 485-487.
Price S, Nagai K. 1996. Secrets of RNA folding revealed. Structure 4: 1129-1132.
20 Schultes, Hraber, LaBean
Richardson DC, Richardson JS. The kinemage: A tool for scientific communication. Protein
Science 1: 3-9.
Sabeur G, Macaya G, Kadi F, Bernardi G. 1993. The isochore patterns of mammalian genomes
and their phylogenetic implications. J Mol Evol37: 93-108.
Schmelzer C, Schweyen. 1986. Self-splicing of group II Introns in vitro: mapping of the branch
point and mutational inhibition oflariat formation. Cell 46: 557-565.
Schmidt D, Riederer B, Morl M. 1990. Self-splicing of the mobile group II intron of the
filamentous fungus Podospra anserina (COl II) in vitro. EMBO J 9: 2289-2298.
Schultes E, Hraber PT, LaBean TH. Evidence for adaptive evolutionary convergence in the base
composition of single-stranded RNA. Submitted.
Schuster P, Fontana W, Stadler PF, Hofacker IL. 1994. From sequences to shapes and back: a
case study in RNA secondary structures. Proc R Soc Lond B 255: 279-284.
Shannon, CEo 1948. A mathematical theory of communication. The Bell System Technical
Journal, VXXVIl, No.3. pp 379-423.
Smith 1M. 1970. Natural selection and the concept of a protein space. Nature 225: 563-564.
Sueoka N. 1992. Directional mutation pressure, selective constraints, and genetic equilibria. J Mol
Evol34: 95-114.
21 Schultes, Hraber, LaBean
Tomizawa J. 1993. Evolution of functional structures of RNA. In: Geste1and RF, Atkins JK, eds.
The RNA World. Plainview, New York. Cold Spring Harbor Press. pp 419-445.
Wada A. 1992. Compliance of genetic code with base-composition deflecting pressure. Adv
Biophys 28: 135-158.
Wise JA. 1993. Guides to the heart of the sp1iceosome. Science 262: 1978-1979.
Woese CR, Kandler 0, Wheelis ML. 1990. Towards a natural system of organisms: proposal for
the domains archaea, bacteria, and eukarya. Proc Natl Acad Sci USA 87: 4576-4579.
Woese CR, Pace NR. 1993. Probing RNA structure, function, and history by comparative
analysis. In: Gesteland RF, Atkins JK, eds. The RNA World. Plainview, New York. Cold
Spring Harbor Press. pp 91-117
Zarrinkar PP, Williamson JR. 1996. The kinetic folding pathway of the tetrahymena ribozyme
reveal possible similarities between RNA and protein folding. Nat Struct Bio 3: 432-438.
22 Schultes, Hraber, LaBean
FIGURE LEGENDS
Note: Figures 1,2, and 3B were submitted for publication as a color images. To view these figures
in color, see the original data files at http://www.santafe.edu/-pth/simplex.html.
FIGURE 1. The RNA simplex represents the space of all possible composition vectors and has
been visualized using molecular graphics software. Three composition vectors indicating the
midpoints of the GA, GC, and GU edges are depicted. The green line represents Chargaffs Axis,
indicating the direction of the gradient in G+C content. The two red lines represent gradients in
G+A and G+U content. These lines are mutually perpendicular, intersecting at the
isoheteropolymers. Position within the simplex can be unambiguously located by specifying the
G+C, G+A, and G+U content of an RNA sequence. These values can be easily calculated from
molecular sequence data and plotted within the RNA simplex, resolving patterns in base
composition that might otherwise be lost in simple G+C projections.
FIGURE 2. The empirical distributions of nucleotide base composition of functionally distinct
ssRNA sequences in the RNA simplex. These data are summarized in Table I. Each distribution is
shown in the simplex in the same oblique perspective as in Fig. 1 (panels labeled i) and also from
the vantage point looking along Chargaff's Axis with the CG edge toward the observer (panels
labeled ii). In the panels labeled ii, the red lines indicate the directions of compositional gradients in
G+A and G+U content. These gradients are increasing in the direction of the arrows shown. Also,
for panels labeled ii, the G-homopolymer is to the upper-left and the C-homopolymer is to the
lower-right. The AU edge is behind the empirical distributions with the A-homopolymer at the
lower-left and the U-homopolymer at the upper-right. Sequences belonging to different Domains
are indicated by different colors: Archaea (red), Bacteria (blue), Eucarya (yellow). A: 5S rRNA.
B: 165 rRNA. C: 18S rRNA, metazoan phyla only. Taxonomic groups of rank lower than
Domain, such as metazoa or vascular plants, tend also to cluster into axis-like distributions. D: 235
23 Schultes, Hraber, LaBean
rRNA. E: RNase P RNA ribozymes. F: Group I self-splicing introns (green), group II self
splicing introns (orange), and hammerhead ribozymes (white). G: 8mall nuclear RNA (snRNA);
Ul (magenta), U2 (white), U3 (red), U4 (blue), U5 (green), U6 (yellow). In general, the snRNA
distribution, and UI, U2, U3 and U4 in particular, show comparatively little organization in
composition space and are the most G+A-poor sequences examined. H: Cytoplasmic tRNA
(colored by Domain, viral sequences depicted in orange). The tRNA are G+U biased and like the
58 rRNA, are more variable. I: Chloroplast tRNA sequences (green) are slightly G+C biased
while mitochondria tRNA sequences (orange) show a remarkable AU biases. J: Plotted are the
calculated base compositions of 500 computer generated, random sequences of A, C, G, and U.
The length of the sequences vary over biologically relevant lengths from 74 positions (red), 120
positions (magenta), 400 positions (yellow), 1500 positions (green), 3000 positions (blue). The
shorter sequences are more variable.
Figure 3. Evolutionary diversification of ssRNA via compensatory mutations results in the
observed axes lying parallel to but displaced from Chargaff's Axis. A, The rate of nucleotide
substitution in stems and loops of E. coli 168 rRNA. From phylogenetic comparisons, Neefs et
al., (1993) calculated the rate of base substitution at each site in the 168 rRNA sequence and
grouped them into 6 categories from low to high variability. Using the inferred secondary
structure for this molecule, we calculated the number of sites in both paired and unpaired regions
for each category of substitution rate. Loop regions are dominated by sites having low to moderate
substitution rates, while the stem regions are dominated by sites having high rates of substitution.
These data indicate that stems have nearly four times higher rates of evolution than loops. This
high rate of mutation in stem regions is the result of compensatory changes in Watson-Crick
partners, and is thought to have relatively small effects on structure and function, and are therefore
relatively neutral. B, Hypothetical "neutral ridges" in RNA composition space. The ridge (blue) is
derived by calculating the base composition of E. coli RNase P RNA as if the stems had sustained
compensatory mutations of increasing or decreasing G+C contents. The wild type P RNA is
24 Schultes, Hraber, LaBean
depicted in yellow. The ridge is an axis, parallel to Chargaff s Axis, and displaced by a magnitude
and direction stipulated primarily by the composition of the loop regions.
FIGURE 4. The variability in G+A and G+U among ssRNA is dependent on the length of the
sequence. Plotted are the standard deviations in G+C, G+A, and G+U content for the 15
functional classes and phylogenetic Domains listed in Table, as a function of their mean length (in
nucleotides). Plotted as a control, is the standard deviation of simulated, random sequences of
various lengths (denoted G+X since the four simulated "monomers" are equivalent). Construction
of these random sequences are described in the Materials and Methods section. With the exception
of G+C content which remains variable regardless of sequence length, these distributions are fitted
with power functions (r = 0.658 for G+A; r = 0.697 G+U; r = 0.999 for G+X). For the longer
sequences, the variability of both G+A and G+U content decrease with increasing sequence length
similar to the random sequences. For sequences shorter than 200 nt however, the variability of the
ssRNA is smaller than that of the random sequences.
FIGURE 5. The compositional bias of ssRNA is partially dependent on sequence length. The
scatter plots are fitted with linear regressions (solid lines) with correlation coefficients indicated in
the upper right hand comer. The broken line marks the composition value 0.5. A: G+C content
remains variable, though there may be a slight tendency to increase G+C content with sequence
length. B: G+A content is constrained and of the three measures shows the strongest tendency to
increase with sequence length. C: G+U content remains roughly constant with sequence length. D:
G+X content of randomly generated sequences of various lengths are plotted as a control. Note the
regression line has zero slope and the variability decreases with sequence length. This describes the
distribution of random sequences as concentric spheres centered on the isoheteropolymers and
having radii that decrease as sequence length increases.
25 Schultes, Hraber, LaBean
Table 1 Phylogenticlly and functionally representative compilation of ssRNA nucleotide composition.
RNA Taxon na <N>b <G+C>c <G+A>c <G+U>c
Comprehensive 2800 287.0 ±662.0 0.509 ±O.200 0.516 ±O.031 0 0.523 ±O.0462
238 rRNA Archaea 15 2968.9 ±65.8 0.588 ±0.0525 0.567 ±O.00597 0.509 ±O.0212Bacteria 39 2915.6 ±86.3 0.526 ±0.0383 0.570 ±O.00720 0.517 ±O.01 02Eucarya 33 3615.0 ±470.3 0.530 ±0.0832 0.540 ±0.0139 0.520 ±O.0143
188 rRNA Metazoa 20 1821.0 ±43.2 0.494 ±0.0297 0.517 ±0.0123 0.535 ±O.0113
168rRNA Archaea 15 1530.7 ±184.9 0.611 ±O.0427 0.562 ±0.00500 0.507 ±O.00681Bacteria 85 1511.8 ±30.9 0.550 ±O.0387 0.568 ±0.00770 0.520 ±0.0118Eucarya 47 1823.6 ±57.8 0.486 ±O.0255 0.524 ±0.00671 0.527 ±0.00955
58rRNA Archaea 26 124.5±3.9 0.598 ±O.0726 0.508 ±0.0272 0.498 ±0.0409Bacteria 123 117.6 ±3.9 0.575 ±0.0574 0.520 ±0.0308 0.495 ±0.0344Eucarya 234 119.3 ±1.4 0.557 ±0.0369 0.517 ±0.0119 0.505 ±0.0176
PRNA Archaea 7 400.9±62.0 0.644 ±0.0966 0.564 ±O.0140 0.469 ±O.0296Bacteria 37 389.0±39.7 0.569 ±O.114 0.570 ±O.0168 0.496 ±O.0147
Group I Ribozymes 13 757.15±517.5 0.434±O.105 0.561 ±O.0302 0.492 ±O.0268
Group II Ribozymes 6 734.8 ±853.8 0.355 ±O.0933 0.566 ±0.0259 0.495 ±O.0276
Hammerhead Ribozymes 2 43.5 ±43.1 0.619 ±O.00212 0.512 ±0.0368 0.519 ±0.0269
snRNA, U1 Eucarya 24 162.4±4.2 0.557 ±O.0228 0.487 ±0.00957 0.548 ±0.0152snRNA, U2 Eucarya 16 187.8±11.6 0.455 ±0.0357 0.456 ±0.031 0 0.543 ±0.0257snRNA, U3 Eucarya 8 220.6±14.6 0.468 ±0.0684 0.477 ±0.0417 0.561 ±0.0207snRNA, U4 Eucarya 11 143;6 ±12.3 0.484 ±0.0384 0.493 ±0.0192 0.536 ±0.0303snRNA, U5 Eucarya 11 125.6 ±29.7 0.411 ±0.0345 0.448 ±O.0279 0.531 ±0.0220snRNA, U6 Eucarya 14 101.5±7.7 0.451 ±0.0233 0.544 ±O.0304 0.469 ±0.0352
tRNA Comprehensive 2011 74.3±6.3 0.496 ±O.120 0.510 ±0.0280 0.528 ±0.0509Archaea 121 77.0±4.2 0.633 ±O.0457 0.503 ±0.0215 0.516 ±O.0351Bacteria 371 78.3±5.2 0.580 ±O.0522 0.506 ±0.0222 0.522 ±O.0317Eucarya 436 75.8±4.2 0.572 ±O.0438 0.510 ±0.0282 0.544 ±O.0365Chloroplast 291 75.6±5.0 0.526 ±O.0531 0.510 ±0.0212 0.540 ±O.0341Mitochondria 742 70.3±6.4 0.372 ±0.0939 0.514 ±0.0328 0.519 ±0.0685
Random 500 25 0.498 ±O.1 04 0.495 ±0.0979 0.494 ±O.130500 74 0.500 ±0.0560 0.502 ±0.0587 0.498 ±0.0548500 120 0.500 ±0.0471 0.501 ±O.0457 0.500 ±0.0446500 400 0.501 ±0.0260 0.501 ±O.0247 0.500 ±0.0254500 1500 0.500 ±0.0134 0.500 ±O.0132 0.500 ±0.0129500 3000 0.500 ±O.00924 0.500 ±0.00954 0.500 ±0.00888500 5000 0.500 ±O.00692 0.500 ±0.00758 0.500 ±O.00703
a n is the number of individual sequences in the data set.
b N is the length of the sequences in nucleotide residues. < > notation indicates mean value.
c G+C, G+A, and G+U were calculated as the fraction of these nucleotides in individual sequences. < > notation indicates
mean value. Variability is specified as ± one standard deviation.
Schultes, Hraber. LaBean
FIGURE 1
Schultes, Hraber, LaBean
FIGURE 2
Schultes, Hraber, LaBean
FIGURE 3
A250
----.- Stems
- .... - Loops
1Il 200<I).....
?\1Il
4- .// \
0 /... 150 / \<I)
/ \.0E / \:::l
/ \Z
/ \100
/ •• \\
\ --+-.-50
Low HighNucleotide substitution rate
B
Schultes, Hraber, LaBean
FIGURE 3
Schultes, Hraber, LaBean
FIGURE 4
o Gte+-- G+A
-x- GtU+-- G+X (random sequences)
4000
x
o
o
30002000
N
88
1000
x *~~-~-~-~--~~ -~-----~~-~-*
o
0.14
o
A><+ovcco+-'ell
~ 0.06"0~
-@ 0.04cell+-'
(J) 0.02
Schultes, Hraber, LaBean
A B1 1
r = 0.0697 r = 0.367
0.8 0.8
+ ++0.6 0.6
~ "'r+:j;- - - - '%(9
0.4 + + 0.4
} ++ +
0.2 + + 0.2
0 00 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
N N
C D1 1 , ,
r = 0.0311 r = 0.00440
0.8 (j) 0.8OJ
"cOJ:::J
0.6 0' 0.6OJ
~<Ji • .. '"E II! • •0 ...'0 J0.4 c 0.4~ ix+
0.2 (9 0.2
0 0,
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
N N
FIGURE 5