+ All Categories
Home > Documents > Anatomy of Escherichia coli ribosome binding sites.pdf

Anatomy of Escherichia coli ribosome binding sites.pdf

Date post: 05-Jan-2017
Category:
Upload: tranthuan
View: 223 times
Download: 1 times
Share this document with a friend
14
Anatomy of Escherichia coli Ribosome Binding Sites Ryan K. Shultzaberger 1,2 , R. Elaine Bucheimer 3 , Kenneth E. Rudd 4 and Thomas D. Schneider 2 * 1 University of Maryland, College Park, MD 20742, USA 2 National Cancer Institute at Frederick, Laboratory of Experimental and Computational Biology, P.O. Box B, Frederick MD 21702-1201, USA 3 University of Virginia School of Medicine, Charlottesville VA 22908, USA 4 Department of Biochemistry and Molecular Biology (R-629) University of Miami School of Medicine, P. O. Box 016129 Miami, FL 33101-6129, USA During translational initiation in prokaryotes, the 3 0 end of the 16S rRNA binds to a region just upstream of the initiation codon. The relationship between this Shine-Dalgarno (SD) region and the binding of ribosomes to translation start-points has been well studied, but a unified mathematical connection between the SD, the initiation codon and the spacing between them has been lacking. Using information theory, we constructed a model that treats these three components uniformly by assigning to the SD and the initiation region (IR) conservations in bits of information, and by assigning to the spacing an uncertainty, also in bits. To build the model, we first aligned the SD region by maximizing the information content there. The ease of this process confirmed the existence of the SD pattern within a set of 4122 reviewed and revised Escherichia coli gene starts. This large data set allowed us to show graphically, by sequence logos, that the spacing between the SD and the initiation region affects both the SD site conservation and its pattern. We used the aligned SD, the spacing, and the initiation region to model ribosome binding and to identify gene starts that do not conform to the ribosome binding site model. A total of 569 experimentally proven starts are more conserved (have higher information content) than the full set of revised starts, which probably reflects an experimental bias against the detection of gene products that have inefficient ribosome binding sites. Models were refined cyclically by removing non-conforming weak sites. After this pro- cedure, models derived from either the original or the revised gene start annotation were similar. Therefore, this information theory-based tech- nique provides a method for easily constructing biologically sensible ribo- some binding site models. Such models should be useful for refining gene-start predictions of any sequenced bacterial genome. # 2001 Academic Press Keywords: ribosome; Shine-Dalgarno; information theory; sequence logo; sequence walker *Corresponding author Introduction Ribosomes play a central role in cells by reading mRNA and synthesizing proteins. 1 The entire high-resolution atomic structure of the 50S 2 and 30S 3 ribosomal subunits have been determined recently, but a full understanding of translation will also require quantitative mathematical descrip- tions. Because codons are three bases long, trans- lational initiation must be directed to within one base on the mRNA. This requires a pattern in the mRNA known as a ribosome binding site, which includes the initiation codon. The completion of entire genome sequences, and the identification of likely genes within them, now allows for the inspection of most ribosome binding sites and allows for the statistics of the patterns to be deter- mined in greater detail than was possible pre- viously. 4–7 In eukaryotes, ribosomes recognize the 7-methyl guanine cap to help identify the translation initiation codon. 8 Prokaryotes, however, lack this marker and instead have a contact between the 3 0 end of the 16S rRNA in the 30S ribosomal subunit and a region upstream of the initiation codon, referred to as the Shine-Dalgarno region (SD). 9,10 The ribosome protects RNA further downstream than just the initiation codon; 11 therefore, the E-mail address of the corresponding author: [email protected] and see http://www.lecb.ncif.gov/ ~toms/ Abbreviations used: SD, Shine-Dalgarno; IR, initiation region; GS, gap surprisal. doi:10.1006/jmbi.2001.5040 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 313, 215–228 0022-2836/01/010215–14 $35.00/0 # 2001 Academic Press
Transcript
Page 1: Anatomy of Escherichia coli ribosome binding sites.pdf

doi:10.1006/jmbi.2001.5040 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 313, 215±228

Anatomy of Escherichia coli Ribosome Binding Sites

Ryan K. Shultzaberger1,2, R. Elaine Bucheimer3, Kenneth E. Rudd4

and Thomas D. Schneider2*

1University of Maryland,College Park, MD 20742, USA2National Cancer Institute atFrederick, Laboratory ofExperimental andComputational Biology,P.O. Box B, FrederickMD 21702-1201, USA3University of Virginia Schoolof Medicine, CharlottesvilleVA 22908, USA4Department of Biochemistryand Molecular Biology (R-629)University of Miami School ofMedicine, P. O. Box 016129Miami, FL 33101-6129, USA

E-mail address of the [email protected] and see http://w~toms/

Abbreviations used: SD, Shine-Daregion; GS, gap surprisal.

0022-2836/01/010215±14 $35.00/0

During translational initiation in prokaryotes, the 30 end of the 16S rRNAbinds to a region just upstream of the initiation codon. The relationshipbetween this Shine-Dalgarno (SD) region and the binding of ribosomes totranslation start-points has been well studied, but a uni®ed mathematicalconnection between the SD, the initiation codon and the spacing betweenthem has been lacking. Using information theory, we constructed amodel that treats these three components uniformly by assigning to theSD and the initiation region (IR) conservations in bits of information, andby assigning to the spacing an uncertainty, also in bits. To build themodel, we ®rst aligned the SD region by maximizing the informationcontent there. The ease of this process con®rmed the existence of the SDpattern within a set of 4122 reviewed and revised Escherichia coli genestarts. This large data set allowed us to show graphically, by sequencelogos, that the spacing between the SD and the initiation region affectsboth the SD site conservation and its pattern. We used the aligned SD,the spacing, and the initiation region to model ribosome binding and toidentify gene starts that do not conform to the ribosome binding sitemodel. A total of 569 experimentally proven starts are more conserved(have higher information content) than the full set of revised starts,which probably re¯ects an experimental bias against the detection ofgene products that have inef®cient ribosome binding sites. Models werere®ned cyclically by removing non-conforming weak sites. After this pro-cedure, models derived from either the original or the revised gene startannotation were similar. Therefore, this information theory-based tech-nique provides a method for easily constructing biologically sensible ribo-some binding site models. Such models should be useful for re®ninggene-start predictions of any sequenced bacterial genome.

# 2001 Academic Press

Keywords: ribosome; Shine-Dalgarno; information theory; sequence logo;sequence walker

*Corresponding author

Introduction

Ribosomes play a central role in cells by readingmRNA and synthesizing proteins.1 The entirehigh-resolution atomic structure of the 50S2 and30S3 ribosomal subunits have been determinedrecently, but a full understanding of translationwill also require quantitative mathematical descrip-tions. Because codons are three bases long, trans-lational initiation must be directed to within onebase on the mRNA. This requires a pattern in the

ing author:ww.lecb.ncif.gov/

lgarno; IR, initiation

mRNA known as a ribosome binding site, whichincludes the initiation codon. The completion ofentire genome sequences, and the identi®cation oflikely genes within them, now allows for theinspection of most ribosome binding sites andallows for the statistics of the patterns to be deter-mined in greater detail than was possible pre-viously.4 ± 7

In eukaryotes, ribosomes recognize the 7-methylguanine cap to help identify the translationinitiation codon.8 Prokaryotes, however, lack thismarker and instead have a contact between the 30end of the 16S rRNA in the 30S ribosomal subunitand a region upstream of the initiation codon,referred to as the Shine-Dalgarno region (SD).9,10

The ribosome protects RNA further downstreamthan just the initiation codon;11 therefore, the

# 2001 Academic Press

Page 2: Anatomy of Escherichia coli ribosome binding sites.pdf

216 Anatomy of Escherichia coli Ribosome Binding Sites

downstream region should be accounted for whenmodeling ribosome binding. The region around theinitiation codon is referred to as the initiationregion (IR).

The SD has strong effects on translation,10,12,13

and one of its most intriguing features is the vari-able spacing between it and the initiation region.Preferential binding of the 16S rRNA at certainspacings has been shown.11,14 ± 17 We investigatedhow this spacing affects the sequence conservationof the SD and IR, and the patterns being bound forthe majority of ribosome binding sites in Escherichiacoli.

Nucleic acid and protein sequences can beanalyzed by information theory, an approach thatwas originally applied to quantify the movementof data in communication systems.18,19 Unlike stat-istical measures of signi®cance, informationmeasured in bits de®nes the minimum number ofbinary choices needed to represent some data. Theadvantage of this measure, over all other measures,is that information from independent sources canbe added together, and bits provide a universalscale. In molecular biology, the amount of infor-mation indicates the degree of sequence conserva-tion among a set of aligned sequences. It is aquantitative measure that has proven to be moreuseful than consensus sequences for understandinga variety of genetic systems.6,20 ± 23 The averageinformation computed from a set of relatedsequences6 describes the overall conservation ateach position in the alignment, and this can beshown with a sequence logo graphic (e.g. seeFigure 1(a)).24 The individual information presentin a single sequence25 measures how that sequencecontributes to the average sequence conservationof the sequence family, and this can be shown withsequence walker graphics (e.g. see Figure 4).26

Individual information is calculated as the sumof the conservation at each base position. Somebases are not favored and can have a negativevalue.25 A site with overall negative informationcontent should, according to the second law ofthermodynamics, have a positive �G, and there-fore should not be bound.25 The theory, therefore,naturally provides a way to detect anomaloussites. Such sites can be removed to re®ne the data-set and thereby produce a more consistent model.Anomalous sites can be investigated to determinewhether they represent sequencing errors, databaseerrors, or novel biological phenomenon.

Although the spacing between the SD contactand the IR is variable, a ``rigid'' ribosome modelfunctioned reasonably well.6,7,27 However, as moresites were added to the model, the informationcontent of the SD region dropped, suggesting thatthe model was not suf®cient to explain the vari-ation among sites. We therefore investigated aribosome binding model where the spacingbetween the SD and the IR was allowed to vary.This ¯exible model provides a better representationthan a rigid model does.

We describe four main results. First, multiplealignment of the regions upstream of E. coli genesby maximizing the information content identi®edthe SD pattern without reference to the 16S rRNAsequence. Secondly, ribosome binding could bemodelled using a uni®ed mathematical represen-tation for the aligned SD, the IR, and the distri-bution of spacings. Thirdly, the second law ofthermodynamics sets zero as the theoretical lowerbound for the information of binding sites,25,28 sowe could iteratively remove sites with negativeinformation to heighten the model's predictivecapabilities. Finally, further characterization of theShine-Dalgarno model allowed us to observe howthe SD pattern varies with distance from theinitiation region.

Theory

Since early ribosome binding site models did notaccount for variable spacing between bindingcomponents,7,27 a new method for analyzing ¯ex-ible sites was developed. First, the individual infor-mation of a binding site is computed from a rigidweight matrix de®ned as:25

Riw�b; l� � 2ÿ �ÿ log2 f �b; l� � e�n�l��� �bits per base��1�

where f(b,l) is the frequency of each base (b) at pos-ition (l) in the aligned binding site sequences ande(n(l)) is a sample size correction factor for the (n)sequences at position (l) used to create f(b,l).6

Then, to evaluate the individual information of aribosome binding site using a ¯exible model, wecalculated three values:

Flexible Site Information

� Ri�SD� � Ri�IR� ÿ GS�d� �bits=site�: �2�

Ri(SD) is the individual information of the alignedSD, Ri(IR) is the individual information of thealigned IR, and GS(d) is the gap surprisal, whichaccounts for the variable spacing d.

The SD was aligned in two steps. First, thesequences upstream of the initiation codon wereembedded into random sequence so as not to trig-ger alignment by the well-conserved initiationcodon. Second, the sequences were shuf¯ed tomaximize the information content.29 By aligningthe SD, we obtained the distribution of distancesfrom the IR. Any probability distribution has anuncertainty measured in bits:

H � ÿX

d

pd log2 pd

�X

d

pd�ÿ log2 pd� �bits��3�

where pd is the probability of the distance d.18,19

Rewriting the uncertainty as shown on the right-hand side shows that it can be expressed as an

Page 3: Anatomy of Escherichia coli ribosome binding sites.pdf

Figure 1. Rigid and ¯exible ribosome binding site sequence logos. (a) The rigid model of the entire EcoGene12set;34 (b) the EcoGene12 unre®ned ¯exible model; (c) the EcoGene12 re®ned ¯exible model; (d) the Veri®ed re®ned¯exible model. For all logos, the height of each stack of letters corresponds to the total sequence conservation at thatposition, measured in bits.6 The height of each letter corresponds to the relative frequency of that base at thatposition.24 The sine wave represents the 11 base twist of A-form RNA.53 The histogram between each pair of ¯exiblelogos represents the distribution of distances between the Shine-Dalgarno and the initiation region zero coordinates.A Gaussian distribution with the same mean and standard deviation is shown for comparison. All logos on the leftrepresent the Shine-Dalgarno alignment and all logos on the right represent the initiation region alignment. Thesequence shown under each SD logo is the anti-Shine-Dalgarno sequence found on the 30 end of the 16S rRNA.

Anatomy of Escherichia coli Ribosome Binding Sites 217

average of the surprisal function:30

ud � ÿ log2 pd �bits=spacing� �4�n(d), the number of sites with a binding distance d,is divided by the total number of sites, n, to obtainthe frequency of binding at each distance. The GSequation is therefore:

GS�d� � ÿ log2

n�d�n� e�n� �bits=spacing� �5�

where e(n) is a small-sample correction for GS,required because we have substituted a frequencyfor the probability pd.

6,25

GS(d) is positive when there is more than onespacing possibility and it has the same units (bits)as Ri(SD) and Ri(IR). We assume that the spacing isindependent of the SD and IR,16 so we subtractedGS(d) from the SD and IR individual informationto obtain equation (2). Other similar methods31,32

cannot be used to compare models from differentdatasets, because they use consensus sequences,which are sensitive to small changes in thesequences. In contrast, the individual informationmethod allows comparison between matricies fromdifferent recognizers (SD and IR in this case) andevaluations converge to a single value as the dataset size increases.25

Page 4: Anatomy of Escherichia coli ribosome binding sites.pdf

218 Anatomy of Escherichia coli Ribosome Binding Sites

Results

Characteristics of flexible ribosome bindingsite models

Several different E. coli ribosome binding sitemodels were used for various purposes. Modelsmay be rigid, in which case all parts are ®xed rela-tive to a zero coordinate, or ¯exible, in which casethe model contains two rigid parts (SD and IR)separated by a variable distance. These models arefurther characterized as being either unre®ned orre®ned. Re®nement refers to a cyclic process inwhich an individual information model is madefrom the current set of binding sites and then sitesthat have negative information content areremoved from the set. This process is repeateduntil only positive sites remain (see Materials andMethods). In order for the information to be calcu-lated for a ¯exible model, one must take intoaccount the statistics of the spacing between bind-ing components. The effect of the spacing, calledthe gap surprisal (GS) (Theory, equation (5)), isgiven in bits and is subtracted from the sum of theinformation present in the SD and IR binding com-ponents, also measured in bits (Theory, equation(2)).

We used three databases in this work. Theprotein-coding feature locations in the complete-genome GenBank entry U00096 have not beenupdated since the original publication,33 so our®rst database was the alternative set of gene inter-vals present in EcoGene12.34 This revised database,which contains 4122 known and putative trans-lation start sites in E. coli, is the result of an intenseand continuous effort to improve the annotationand prediction of E. coli genes. Second, we usedthe Veri®ed subset of this database, which is com-posed of protein start sites con®rmed by N-term-inal protein sequencing. The third database is theoriginal E. coli annotation from Blattner (Blattneret al.,33 GenBank U00096). To create a reliable base-line model, we re®ned the Veri®ed set. In contrast,a ribosome model built from the re®ned Eco-Gene12 database is probably the most representa-tive of all genes. We also re®ned the Blattnerdatabase to determine if we could automaticallyderive a model comparable to EcoGene12.

The Veri®ed model is derived from ribosomebinding sites for proteins that have been well stu-died and/or detected as spots on 2D gels and itprobably lacks many sites that show lower bindingaf®nity. Despite this bias, the Veri®ed model isuseful, since it is composed only of sites proven tobe actual ribosome binding sites. For example therange of SD to IR spacing from ÿ18 to ÿ4 wasestablished by observing spacings utilized withinthe Veri®ed set. The EcoGene12 model is based onthe full set of proven and predicted gene starts,and thus is representative of all ribosome bindingsites, including weak sites responsible for low-levelprotein expression. Although EcoGene12 may con-tain a few predicted sites that turn out to be incor-

rect, we consider it to be the most accurate model,and therefore we used it as our benchmark model.

The rigid model sequence logo made from allEcoGene12 translation start sites (Figure 1(a))shows the expected strong conservation for theinitiation region at bases 0, �1 and �2 and a lowregion of conservation from bases ÿ12 to ÿ6 forthe SD. When the SD was re-aligned to maximizethe information,29 its information present rose from1.53(�0.03) to 4.96(�0.04) bits. (We report here themean Rsequence and standard error of this meanfrom the individual information distribution.25)The range of re-alignment (ÿ18 to ÿ4) was selectedto allow for all spacings observed in the Veri®edmodel. The SD was realigned using only sequencesfrom translation start sites, and this was done inde-pendently of the 16S rRNA sequence, yet thesequence logo closely complements the 16S rRNA30 end. This ¯exible model has an SD-IR spacing ofÿ18 to ÿ4 bases, with a peak of occurrence at ÿ9(Figure 1(b)). When the model was tightened byusing the exclusionary re®nement process(Figure 1(c)), there was again an increase in theinformation present in the SD logo to 5.23(�0.04)bits. In contrast, the re®ned Veri®ed SD has5.77(�0.10) bits with an SD-IR spacing of ÿ18 toÿ4 bases, with a peak of occurrence at ÿ10(Figure 1(d)).

Logos were made in the same fashion asFigure 1(a)-(c) for the Blattner sites. Since the logoslooked similar to the EcoGene12 logos, they arenot shown. When the SD region was re-aligned tomaximize information in the Blattner model, theSD information rose from 0.91(�0.03) to3.87(�0.05) bits. This model has an SD-IR spacingof ÿ16 to ÿ2 bases, with a peak of occurrence atÿ8. Re®nement of the Blattner model also showeda further increase in the SD information to5.01(�0.04) bits. The most noticeable differencebetween the unre®ned and the re®ned Blattnermodel is at the zero position of the aligned SD.This position went from a partially conserved G at�1.5 bits to a fully conserved G at 2 bits, while therest of the positions increased proportionally. Inter-estingly, upon re®nement the SD-IR spacingshifted to ÿ18 to ÿ4 and the peak shifted to ÿ9bases. This is the same range as that seen in thewell-characterized Veri®ed model (Figure 1(d)).

For all models, a Gaussian distribution with thesame mean and standard deviation as the respect-ive SD-IR spacing distributions was plotted alongwith the spacing histogram. In all cases, the histo-gram did not match the Gaussian plot.

Using the individual information method,25 allsites in the Veri®ed set were evaluated by the rigidand ¯exible EcoGene12 re®ned models over therange of 30 bases upstream to 14 bases down-stream of the ®rst base of the initiation codon. Thisis the range required to identify a site with themaximum SD-IR spacing of 18 bp. Previous infor-mation theory-based ribosome evaluations with arigid model have been reasonably accurate,7 butsince that model does not take into account vari-

Page 5: Anatomy of Escherichia coli ribosome binding sites.pdf

Anatomy of Escherichia coli Ribosome Binding Sites 219

able spacing, it is limited in its analysis of ribosomebinding. The rigid EcoGene12 model (range ÿ21 to�14) picked up about the same number ofupstream non-sites (sites with more than zero bitsof information other than those annotated in thedata set) as the ¯exible model (92 versus 89,respectively). The two models identi®ed nearly thesame number of Veri®ed start points (565 versus567). The average site strength assessed by therigid model was 9.50(�0.13) bits, and with the ¯ex-ible model it was 10.17(�0.14) bits, indicating thatthe ¯exible model generally assessed the Veri®edsites more strongly.

Shine-Dalgarno as a function of spacing

To understand better the function of the Shine-Dalgarno, we examined SD sequence logos atevery SD-IR spacing in the EcoGene12 set(Figure 2). The shape and pattern of the SDremained fairly constant, but the information pre-sent ¯uctuated. There was a constant increase inthe information as the spacing was increased fromÿ4 to ÿ9, and a decrease in information for ÿ9 toÿ18. This is re¯ected in the change of the size ofthe bases surrounding the central G. The infor-mation present in the SD at each alignment relativeto the IR is only weakly related to the conservation

of information in the IR (r � ÿ 0.17) (Figure 3).When the total ¯exible site information, as calcu-lated from equation (2) (see Theory), was examinedfor all positions, an increase and decrease in infor-mation was observed similar to that observed withthe SD region alone. When the re®ned Blattnersites were split into spacing classes, similar resultswere obtained (data not shown).

For each spacing class, the program diana35 wasused to determine if there was any correlationbetween bases in the SD and IR. None wasobserved at any spacing (data not shown),suggesting independence between the SD and theIR. In addition, no correlation is observed betweenparts of the re®ned EcoGene12 SD (Figure 1(c))when all classes are combined.

For all spacings of ÿ4 to ÿ11 there is an A withlow conservation at position ÿ3,5,6 and it is presentalso from ÿ16 to ÿ18, indicating that conservationat this position is an effect of the initiation contactand not the SD (Figure 2).

The minimum SD-IR spacing of ÿ4 has beenobserved in nadB36 but appeared infrequently inEcoGene12. Binding of regions with spacings morethan 18 bases is known, but these are rare and dueto RNA structural effects such as hairpins thatbring the SD closer to the IR.12

Figure 2. The Shine-Dalgarno asa function of spacing. Sequencelogos were constructed for all dis-tances between the SD and IR zerocoordinates observed in the Eco-Gene12 re®ned set. The black circlefalls under the central G of theShine-Dalgarno, which is the zerocoordinate of the SD in the variablemodel.

Page 6: Anatomy of Escherichia coli ribosome binding sites.pdf

Figure 3. Quanti®cation of ribosome binding site com-ponents as a function of spacing. The information pre-sent in the Shine-Dalgarno regions of Figure 2 (shownin green boxes) were plotted at their respective dis-tances. The information content was measured over theregion 12 bases prior to and 4 bases after the central Gof the Shine-Dalgarno, except for the spacing of ÿ4,whose information is measured over the range of ÿ12to �3, because of interference with the IR at position 0.The information present in the IR for the range of ÿ3 to�14 at each distance is shown in black (with small ®lledcircles). The gap surprisal GS computed by equation (5)from the distance distribution in Figure 1(c), is plottedas open blue circles. The red curve with no symbolsshows the total ¯exible information at each spacing, ascalculated by equation (2). For all cases, error bars areplotted with black I symbols (the error for GS is smallerthan the circle).

220 Anatomy of Escherichia coli Ribosome Binding Sites

Correlation between the refined Blattner andEcoGene12 models

To test whether the re®ned EcoGene12 model isaccurate and can be used to correct sequence anno-tations, we scanned the model across several pro-ven ribosome binding sites and across several sitespredicted by Blattner that have been corrected inEcoGene12 (Figure 4). When we applied our modelto the well-studied lacZ and lacI initiation regions,it concurred with Blattner's locations (Figure 4(a)and (b)). In the case of the 8.1 bit lacI ribosomebinding site, which starts at a GTG in the contextatGTGa, a second weaker 5.5 bit site is seen at theout-of-frame ATG just upstream. Interestingly,ribosomes binding to this site should terminateimmediately at the TGA. Using two of Blattner'ssites that have been corrected based on N-terminalprotein sequencing, we tested whether our modellocates the correct binding sites. In mhpD

(Figure 4(c)), we saw a 12.8 bit site at the correctlocation six bases downstream from Blattner's pre-diction. Our model did not predict any site (Ri > 0)at the Blattner location. In yhbL (Figure 4(d)) thereis a predicted weak 4.5 bit site at the incorrect pos-ition, but experimentally the start site was provento be nine bases downstream and our modelfavored this location (13.7 bits). As expected, inboth cases the correct site was found in the samereading frame as the predicted site. When there®ned Blattner model was scanned over thesesame sites, the same predictions were made, indi-cating that the re®ned Blattner model is compar-able to the re®ned EcoGene12 model.

To investigate further the effect of re®nement,we scanned both the Blattner unre®ned andre®ned models over all of the EcoGene12 sites forregions 100 bases upstream and 100 bases down-stream of each of the 3900 re®ned EcoGene12 startpoints. The unre®ned Blattner model found 21,464non-sites and the re®ned model found consider-ably fewer non-sites (12,018). This large number ofsites detected may represent weak ribosome bind-ing sites, untranscribed regions or may be falsepositive artifacts of this model. Alternatively, someof these sites may be occluded by RNA secondarystructure. Since the unre®ned Blattner model con-tains many non-sites, it has a lower informationcontent and therefore picks up more non-sites thanthe re®ned Blattner model. Both the unre®ned andthe re®ned Blattner models identi®ed all of theEcoGene12 sites.

To generalize Figure 4(c) and (d), we scannedboth the re®ned Blattner model and the re®nedEcoGene12 model over the 26 sites in the Blattnerannotation that have been corrected in the Eco-Gene12 dataset based on experimental veri®cation.The Blattner model identi®ed the experimentallyreported start site as the strongest site in 21 of the26 cases. In four of the ®ve other cases, the modelassessed the Blattner annotation more strongly, butalso found a site at the con®rmed start point. Forone gene (gntK), the model did not match eitherBlattner's annotation or the experimentally provenTTG start. When this same analysis was doneusing the re®ned EcoGene12 model, the correct sitewas predicted in 22 of the 26 sites and three ofBlattner's annotations were favored. As with there®ned Blattner model, no site was predicted forthe gntK gene at either the Blattner or EcoGene12locations. As exempli®ed by Figure 4(d), inapproximately half of the 26 corrected sites bothmodels predicted strong sites at the veri®edlocations and these were accompanied by weakersites at Blattner's locations. In the three (Eco-Gene12) or four (Blattner) cases where the veri®edstart was weaker than the Blattner site for eithermodel, the difference in site strength between thesite at the Blattner location and the site at the Veri-®ed location was generally only 1 to 2 bits (exceptfor one case, where the difference was around 5bits). These results show that re®ned ¯exible

Page 7: Anatomy of Escherichia coli ribosome binding sites.pdf

Figure 4. Lister maps withsequence walkers for four ribosomebinding sites. Blattner's sequence,GenBank accession numberU00096, is annotated with 4290gene starts,33 four of which areillustrated; (a) lacZ, (b) lacI, (c)mhpD and (d) yhbL. The EcoGene12¯exible model (Figure 1(c)) wasscanned across each sequence. Thesites that are found (Ri > 0) areshown by two part sequence walk-ers. A walker is a graphic consist-ing of several adjacent letters withvarying heights.26 Vertical greenrectangles indicate the zero coordi-nate of each sequence walker andprovide a scale from ÿ3 to �2 bits.Braces ({ and }) connected by a bro-ken line are used to link SD and IRwalkers. This feature, created bythe program biscan, also reportsthe distance of separation, the coor-dinate of the IR and the ¯exiblesite information value according toequation (2). All correct translationstart points, based on experimentaldata, are identi®ed by a ®lled blackarrow starting at the initiation startpoint. The broken-line arrow showsBlattner's predicted gene start. Thecolor bar above the sequence cyclesthrough three colors to illustratethe reading frames. In cases (c) and(d), it is obvious that the predicted(boxed and broken-line arrow) andcorrected sites (boxed and fullarrow) fall in the same readingframe because the adenine bases lieunder the same color. The sinewaves represent the 11 base twistof A-form RNA.53 The asterisksand numbers above the sequenceindicate positions on the Escherichiacoli genome.33

Anatomy of Escherichia coli Ribosome Binding Sites 221

information models can be used to improve ribo-some binding site predictions.

Can we create a valid ribosome model from thelarge lists of gene start-points determined fromopen reading frames that are presented as annota-tions for complete genome sequences? To test forrelatedness, we compared various models usingthe Euclidean distance between Riw(b,l) matrices(Materials and Methods, equation (6)). The distancebetween the unre®ned Blattner SD matrix and there®ned EcoGene12 SD matrix was 13.0 bits and thedistance between the corresponding IR matriceswas 2.2 bits. In contrast, when the re®ned Blattnerwas compared to the re®ned EcoGene12 model,there was a much smaller difference: for the SDmatrix there was a distance of 1.1 bits and for theIR matrix there was a distance of 0.9 bit. Re®ning

the Blattner model brought it closer to the re®nedEcoGene12 model, which is representative of thebulk of E. coli ribosome binding sites.

When the individual information distributionsfor all models were compared, there was a generalincrease in the strength of sites from the unre®nedto the re®ned to the Veri®ed model (Figure 5 andTable 1). This effect may occur not only becausethe re®nement process removes negative sites, butalso because the well-characterized sites in theVeri®ed model may tend to neglect weaker sites,as these may often be harder to characterize bio-chemically. The sets overlap reasonably well, since507 of the 569 Veri®ed sites are found in there®ned Blattner set, and all but six of the Veri®edsites are found in the re®ned EcoGene12 model.

Page 8: Anatomy of Escherichia coli ribosome binding sites.pdf

Figure 5. Individual information distributions for ®veribosome binding site models. The ordinate is the indi-vidual information and the abscissa is the frequency ofoccurrence. (a) The information distributions for the Eco-Gene12 unre®ned (red circles), the EcoGene12 re®ned(green boxes) and the Veri®ed model (black, with nosymbols). (b) The information distributions for the Blatt-ner unre®ned (red circles), the Blattner re®ned (greenboxes) and the Veri®ed model (black, with no symbols).

222 Anatomy of Escherichia coli Ribosome Binding Sites

Discussion

Unlike gene-®nding programs,37,38 ribosomes donot use open reading frames or other global factorsto recognize translational start points, so our phil-osophy is to model the ribosome explicitly. A puremodel has the advantage that it can identify ribo-some binding sites in the center of genes, such asthe out-of-frame one at E. coli coordinate 3919396in atpB (formerly uncB)39 and those of short poly-peptides as in transcription attenuation.40

To create a ¯exible ribosome model, we ®rstremoved the initiation codon and downstream

Table 1. Comparing individual information distribution valu

Model Mean (bits) Std Dev.

Blattner unrefined 6.83 4.84Blattner refined 8.82 3.63EcoGene12 unrefined 8.81 3.99EcoGene12 refined 9.28 3.58Verified 10.35 3.73

We report the mean, standard deviation (Std Dev.), standard errocorrespond to the distributions in Figure 5.

open reading frame by embedding the SD regioninto random sequence. This allowed us to usemultiple alignment to focus the SD region by maxi-mizing its information content.29 The SD emergedeasily (Figure 1(b)), demonstrating mathematicallythe existence of this feature in the majority ofE. coli ribosome binding sites. Furthermore, thegeneral pattern matches the 30 end of the 16SrRNA well, con®rming independently that theseare complementary to each other.37

In contrast with the notion of an SD consensussequence, sequence logos show that the SD is vari-able and its pattern depends on how far thesequence is from the IR (Figure 2). Despite thisvariability, the information content of the SD isrelatively constant at various spacings, smoothlyincreasing and decreasing in a range of only 2 bitsfrom ÿ18 to ÿ5 (Figure 3). Surprisingly, the SD-IRspacing contributes more variation to the totalinformation than either the SD or the IR.Furthermore, the variation of the SD informationworks in the same direction as the gap surprisal;they do not compensate for each other but insteadwork together. This sets up a maximal range ofvariation for ef®ciency of translational initiation.These observations are consistent with the SD-anti-SD helix formed between the rRNA and themRNA as being a reasonably consistent ``object'',whose placement relative to the IR is important.

In all cases (Figure 1) the spacing distributionbetween the SD and the initiation region was simi-lar to but differed from a Gaussian distribution.There is a predominance of ÿ9 and ÿ8 spacings(and ÿ10 for the re®ned Veri®ed set). A spring(simple harmonic oscillator41) moving under thein¯uence of random thermal noise should producea Gaussian spacing distribution.42 Since there is anon-Gaussian distribution, the SD-mRNA helixappears to have more constraints than a freelyoscillating spring. What these bounds are maybecome apparent only when crystal structures ofinitiating ribosomes have been determined, but aclue that the meaning is related to the placement ofthe SD-anti-SD helix comes from the shape of theSD sequence logos.

Unlike the rectangular block that a consensussequence would make on an information graph,the SD sequence logo rises smoothly and declineswith position (Figure 1(c)). This is consistent withthe idea that mismatches at the center of an RNA-

es

(bits) SEM (bits) n

0.07 42900.06 35090.06 41220.06 39000.16 569

r of the mean and number of sites for each model. These values

Page 9: Anatomy of Escherichia coli ribosome binding sites.pdf

Anatomy of Escherichia coli Ribosome Binding Sites 223

RNA hybrid should be more disruptive than mis-matches towards the ends. However, the situationmay be more complicated. Sequence logos forduplex DNA binding proteins also rise and declinewith position.21,43 ± 45 One intriguing explanation isthat the formation of the mRNA-rRNA hybrid isfollowed by binding of a ribosomal protein orRNA3,46 into the resulting major or minor grooveas a step during translational initiation. Such amodel accounts for the shape of the SD sequencelogo because proteins tend to evolve contacts onone face of a helix, and such contacts become pro-gressively more dif®cult to form when they areclose to the back face.44 The proximity of proteinS1 to the SD11,47,48 suggests it as a candidate forthis process, but other proteins such as S7, S18 andS21,49 and various 16S rRNA positions50,51 thatcrosslink to the mRNA52 could be involved. Toallow us to judge the validity of this model, weadded a dashed sine wave to the sequence logos.The peaks of this wave are separated by 11 bp,which is the distance between two major groovesof A-form RNA.53 Preferred spacings of the SD(Figure 2) are consistent with this model, but thereis clearly a greater degree of ¯exibility than inDNA-protein interactions. However, tight packingis observed throughout the 70S subunit2,54 andthere is close packing in the 30S,3 so it is likely thatthe fully assembled initiation region is also tightlypacked. This suggests a mechanism for initiation inwhich the binding of the 16S 30 end to the mRNASD allows the resulting helix to be smaller thanunpaired single strands would be. The smallerhelix could pack against other components of theribosome, reducing the volume further and com-pleting initiation, perhaps by creating suf®cientspace in the A site for the next tRNA. Even a non-speci®c RNA phosphate backbone binding into theminor groove between the SD and the mRNA3,46

could account for the sinusoidal shape of thealigned SD sequence logo. IF3 appears to recognizecodon-anticodon complementarity at the initiationcodon rather than direct recognition of the codonitself.55 Because complementarity usually creates amore compact structure than a mismatch, thiseffect is also consistent with a packing model forinitiation. Finally, this tight-packing model mayaccount for why the SD-IR spacing is narrowerthan a Gaussian distribution: the simple harmonicoscillator is con®ned in a box.

The concept of individual information25 allowsus to consistently apply an information measurenot only to the SD and IR but also to the gapdistance between individual sequences, therebycreating a ¯exible search tool. The problem of howto compute the information content of ¯exiblebinding sites was recognized previously.6 If twosequence elements have a variable distancebetween them, then the uncertainty in positiondecreases the overall information content. Forexample, GC, with an information content of 2bases � 2 bits per base � 4 bits, is found every 16bases in equiprobable DNA, while GNC is found

at the same frequency. A shorthand notation forthe set containing both of these is G1EC, in which1E means to search for G followed by C with anextendable spacing of 1 or 0 bases.5 Because it con-tains the search for both GC and GNC, G1ECoccurs approximately every 8 bases in equiprob-able random DNA. So, although the G and C con-tribute 4 bits of information, the variable spacingremoves 1 bit and the site is therefore effectivelyonly 3 bits. With G3EC there are four possiblesearch patterns, GC, GNC, GNNC and GNNNC;this removes log2 4 � 2 bits. Interestingly, G15EChas 16 search patterns and this removes 4 bits giv-ing the, at ®rst sight, odd result that the infor-mation content is 0 bits. However, in a sequence Mbases long, G15EC will occur roughly M timesbecause of overlapping cases, so the result is con-sistent. It is interesting to note that there can besites with negative information by this method: ina sequence of length M, G31EC will occur roughly2M times, giving an apparent information ofÿ1 bit. The reason for this odd effect is that thereare many overlapping sites. We interpret zeroor negative information to mean that the twocomponents are independent.

We have extended these computations by usingShannon's uncertainty measure to consistentlyassess the contribution when different spacingsoccur with different frequencies. Because frequen-cies are not probabilities, a small sample correctionwas applied.6

Fortunately, the negative information effect doesnot occur for ribosome binding sites. In the re®nedEcoGene12 model (Figure 1(c)), the SD contains5.80(�0.04) bits, the IR contains 6.72(�0.04) bits,and the uncertainty of the distance between them(gap uncertainty, Hgap) is 3.25(�0.02) bits, givinga total information content6 of Rs(SD) �Rs(IR) ÿ Hgap � 9.28(�0.06) bits. This is similar tothe re®ned rigid EcoGene12 model, which has8.92(�0.05) bits, but quite different from the ¯ex-ible Veri®ed model with 10.35(�0.16) bits. Wesuggest that the difference occurs because strongsites tend to be experimentally identi®ed ®rst andsome non-functional sites may still contaminate there®ned EcoGene12 model. The latter effect can beobserved in Figure 5, where the unre®ned Eco-Gene12 has examples of sites below zero, while there®ned EcoGene12 set does not have any sitesbelow zero bits, by de®nition. While there are nosites below zero in the re®ned Veri®ed set (becausewe removed the 13 that we found), the lower endof the distribution curve is smaller than that for there®ned EcoGene12. Further, the shape of the Veri-®ed distribution is a more Gaussian-like curve,trailing smoothly down to nearly zero at zerobits,25 while the re®ned EcoGene12 distributionstill has members near zero bits and is thereforediscontinuous. It is not known if these very weaksites are functional.

The Veri®ed sites that we removed duringre®nement (gene at U00096 coordinates and orien-tations: uppS 194903�, gsk 499349�, fes 612038�,

Page 10: Anatomy of Escherichia coli ribosome binding sites.pdf

224 Anatomy of Escherichia coli Ribosome Binding Sites

dbpA 1407535�, topB 1844984ÿ, guaB 2632090ÿ,xseA 2632252�, trmD 2743359ÿ, pcm 2867542ÿ,cysI 2888122ÿ, dnaN 3879949ÿ, aceK 4216175� andarcA 4637875ÿ) presumably initiate differentlyfrom the majority of sites. Surprisingly, this setdoes not contain infC, which codes for IF3. In theabsence of this initiation factor the ribosome canuse the AUU start of infC (1798662ÿ) for initiation,forming a regulatory feedback loop.55 By the Veri-®ed model, the AUU-containing IR is ÿ4.8 bits butthis is compensated by a GS of 2.3 bits at the opti-mal spacing of ÿ9 bases and an SD of 9.6 bits togive a total of 2.5 bits. This anomalous site wasremoved automatically during re®nement of Eco-Gene12, because the G at the third base of startcodons is otherwise invariant and rare bases areweighted against more heavily in larger datasets.25

By the EcoGene12 model, the infC IR is ÿ8.4 bitswith an SD at a ÿ9 spacing of 8.7 bits for a total ofÿ2.0 bits. This model predicts that mutating thestart codon from AUU to AUG should bring the IRup to 5.5 bits to give a strong 11.9 bit site.

Other mechanisms may be needed to explain theanomalous Veri®ed sites. Only two excluded casesin the Veri®ed set have GTG starts, which areknown to be weaker than ATG starts.13 With fewerexamples in the data set, marginal GTG startscould have been removed because of statisticalnoise.

Another way to explain the Veri®ed siteanomalies is that RNA secondary structures mightbring an SD closer to the IR12 and so in¯uencetranslation.11,56 As shown in Figure 6, this mechan-ism might be involved in fes (612038) in which afour base helix (ÿ4.5 kcal/mol)57 may bring a 3.4bit SD to position ÿ11 with respect to the IR andpcm (2867542) in which a ®ve base helix(ÿ7.8 kcal/mol) brings a 3.5 bit SD to position ÿ9with respect to the IR. The other sites do notappear to use this mechanism.

The relatively large number of initiation regionsthat do not conform to the majority model (i.e. therejected Veri®ed sites) suggests the possibility thatthere are even more alternative mode(s) of trans-lation initiation. We are left with a number oflikely and proven genes that fail to have ribosomebinding sites that conform to our model. A com-bination of computational and experimentalapproaches will be needed to identify alternativemodels among the rejected sites. Of course, one

simple possibility is that apparent anomalies canbe caused by sequencing errors.26

An empirical observation for human splice junc-tions is that, in addition to the thermodynamicbound at zero bits,25 sites with less than 2.4 bitsare non-functional.58 We suspect that such a fuzzynon-zero bound may also apply to the majority ofribosome binding sites. However, unlike the casewith splice junctions, experimental data are notavailable to suggest what a natural bound may bethat delineates a functional from a non-functionalribosome site. For this reason we used the zerobound, which is based on the second law of ther-modynamics, for cyclic re®ning.

The process of subtracting the gap uncertainty issimilar to the accounting of gaps provided by hid-den Markov models (HMM),38,59 except that thegap size we use is variable and the frequencies ofdifferent gap distances are accounted for.37 It maybe possible to extend the model given here to a fullinformation-theory based HMM, but this was notattempted.

The information theory approach allowed us tobuild models that represent the vast majority ofribosome sites without having to assume that somesequences were not sites. In contrast, training witha neural net27 requires this assumption, becausedata on where ribosomes do not bind are sparse.Because the data set is so large, it could be splitinto spacing classes (Figure 2), effectively dissect-ing the ribosome binding sites. The resultingmodels revealed that the weakly conserved ``A atÿ3``5,6 correlates with the IR and not with the SD.This is consistent with the presence of an A at ÿ3relative to the translational start in eukaryoticmRNAs, which do not have an SD.8 The functionof this conservation is unknown but crosslinkexperiments place it close to U1381 on the 16SrRNA51 and to the S7 protein.3,49

The effort required to generate a data set asexempli®ed by EcoGene12 is enormous. The re®n-ing process described here gives results compar-able to EcoGene12, so we believe it will be usefulfor gene analysis in other species. Ideally, allorganisms would have models consisting of bio-chemically supported sites, rather than sites thatwere chosen by computer algorithms that do notmodel the SD. However, as shown here, it is poss-ible to use information theory methods to helpproduce a reasonably clean identi®cation ofgene starts. This technique will be useful to

Figure 6. mRNA folding mayrescue fes and pcm translation.Structure base-pairings are indi-cated by parentheses. Start siteswere predicted by sequence walk-ers as in Figure 4.

Page 11: Anatomy of Escherichia coli ribosome binding sites.pdf

Anatomy of Escherichia coli Ribosome Binding Sites 225

better characterize medically important diseaseorganisms.

Materials and Methods

Databases

To create our models, we drew from three databases.One database that we used was the 4122 sites in the Eco-Gene12 database, which represent the majority of E. coligenes.34

The second database was a carefully compiled list of569 experimentally supported sites, referred to as theVeri®ed database.34 This database is a subset of the Eco-Gene12 database and provided us with an initial com-parison model that was used to determine the allowedSD-IR spacings and the general pattern of the Shine-Dal-garno. Rudd34 has catalogued from the biomedical litera-ture 717 E. coli proteins whose N termini have beendetermined directly by protein sequencing. The Veri®edproteins that have cleaved signal peptides were omitted,since these N-terminal protein sequences do not verifythe translation start codons as de®nitively as the 569Veri®ed proteins that are uncleaved or have only theinitiator methionine residue cleaved. This dataset can beobtained from the internet at: http://www.lecb.ncifcrf.-gov/~toms/papers/¯exrbs/

The third database was the 4290 gene starts presentedby Blattner et al37 and extracted from their completeE. coli GenBank entry, U00096 (version M52, September02, 1997); this is referred to as the Blattner database. There®ning method described below was performed onthese databases.

Creating a ribosome model

Our ribosome models have two rigid bindingelements connected by a ¯exible bond that allows thespacing between the elements to vary. If both elementsdo not ®nd a suitable contact at an appropriate spacing,then the model will not bind. The ®rst binding elementis represented by a sequence logo made from translationstart codons, which we refer to as the initiation region(IR) (Figure 1(a)). The range of this model is from ÿ3 to�14, representing the predominantly A conservation atÿ3 through the downstream mRNA protected by theribosome.6,11,17 To create this logo, standard Delila toolswere used.24,44

For the SD there is only a low sequence conservationover the range of ÿ11 to ÿ7 in the rigid logo(Figure 1(a)), so we used multiple alignment to build onthe rigid model to create a ¯exible ribosome model. Ran-dom sequences were generated by the markov programand the ÿ20 to ÿ4 range of the ribosome binding siteswas embedded into the random sequence using theembed program. By replacing the IR with randomsequence we avoided alignment by the IR in the nextstep. This step was to use the malign program to realignthe SD region to maximize the information present.29

The resulting alignment was then represented by a logousing the previously described method. This realignedset displays the complement of the anti-Shine-Dalgarnofound on the 16S rRNA (Figure 1(b)) and was used asour Shine-Dalgarno model. The zero coordinate of thismodel was shifted to the coordinate of the large centralG using instshift. Since this base position contains themaximum information, presumably it is the most stableposition to use and it will appear as the most signi®cant

base in a sequence walker (Figure 4). By this de®nition,our �4 base corresponds to the SDref reference pointde®ned by Chen et al.,14 and our spacing measures are``aligned spacing'' according to these authors. Thismeasures the distance from a ®xed point on the 16SrRNA to the initiation codon, as they advocate.

Once we had both an SD and an IR model, we made ahistogram of the distances between their zero coordi-nates for all sites in the database. This was done usingthe dif®nst program, which calculates the distancebetween coordinates in a pair of Delila instruction sets.These distances were then presented in a histogramusing the genhis program and graphed in postscriptwith genpic.

The program used to compute equation (1) is ri,25

which generated the Riw(b,l) weight matrix for both theSD and IR for further analysis of the individual infor-mation conserved in ribosome binding sites (see Theory).The SD sites were assessed for the region ÿ12 to �4 andthe IR sites were assessed for the region ÿ3 to �14because this is the region covered by footprints.7 Theprogram used to compute equation (2) is biscan (seeTheory). Biscan ®nds pairs of SD and IR that fall withinthe range of the spacing histogram, and then the ¯exiblesite information is calculated for each coupling using thedistribution of distances from the genhis histogram.

Further information about the programs is available athttp://www.lecb.ncifcrf.gov/~toms/and a web-basedserver with guest-access is available at http://www.lecb.ncifcrf.gov/~toms/delilaserver.html

Cyclic refinement

The 50 ends of genes are often placed incorrectly insequence database feature tables. To obtain a reliableribosome model containing a minimum number of mis-placed sites, a cyclic re®nement method was used. To dothis, we computed the ¯exible site information for allsites in the set. We removed all sites whose informationcontent was less than zero (this is the theoretical bound-ary for binding because of the second law of thermo-dynamics25) and we rebuilt the model with the correctedset. Following every round of re®ning, the SD was rea-ligned by malign as previously described.29 Re®nementused 1000 realignments, maximized the information in awindow from ÿ20 to ÿ4 and allowed the sequences toshift from ÿ8 to �6. This range was chosen to match theknown binding range of the Veri®ed model. If morethan one Shine-Dalgarno site was found upstream of aninitiation start site, then the SD that gave the strongest¯exible site information was used in the model. Thealignment with the highest information content was cho-sen from the 1000 realignments. This process wasrepeated until no site remained in the data set with anegative ¯exible site information. The EcoGene12 setrequired ten rounds of re®ning and lost 222 sites; theVeri®ed set required two rounds and we dropped 13sites, and Blattner's set required 20 rounds of re®ningand we dropped 781 sites. Each round took approxi-mately two hours on a 450 MHz Sun Ultra60 Sparcworkstation.

Dissecting the SD

To generate the SD as a function of spacing(Figure 2), a logo was made for each of the 15observed spacing groups. For example, six sites wereobserved to have an SD-IR spacing of ÿ4 bases, so a

Page 12: Anatomy of Escherichia coli ribosome binding sites.pdf

226 Anatomy of Escherichia coli Ribosome Binding Sites

logo was made for those sites (upper left corner ofthe Figure). This was repeated for the range of ÿ18 toÿ4 for the re®ned EcoGene12 model. The graph ofinformation present in the SD versus spacing (Figure 3),is the Rsequence for the range of ÿ12 to �4 around thecentral G in the SD portion of the logo. The rangeÿ12 to �3 was used for the spacing of ÿ4, becauseof interference with the zero position of the initiationregion. The range for the IR information curve wasÿ3 to �14. To examine relatedness between nucleo-tides for an SD-IR spacing, correlations betweennucleotides were computed using the program diana.35

Sequence walkers

To make ¯exible sequence walkers26 (Figure 4), biscangenerated features that were then mapped by the pro-grams live, mergemarks and lister. Live created a colorbar that changed hue every three bases to mark readingframes.22 Mergemarks combined marks from varioussources and lister generated the sequence walkergraphics.26 The re®ned EcoGene12 model was used forthis analysis. For Figure 6, mfold 3.157 was used to foldRNA sequences and the structures were displayed alongwith the walkers using mfoldseq, which generatessequence ®les for mfold, and mfoldfea, which uses theoutput from mfold to create features for lister.

Comparing matrices

To compare two weight matrices, we calculated theEuclidean distance between them using the followingequation:

Distance ����������������������������������������������������������X

l

Xb

�Ri1�b; l� ÿ Ri2�b; l��2r

�bits� �6�

Here, the difference is taken between the individualinformation (Ri) of each base (b) at each position (l)between matrices 1 and 2. The difference is then squaredand summed for all positions and the square-root of thisvalue is taken, giving the distance in bits. The programused to do this was diffribl.

Acknowledgments

We thank Peter Rogan for reporting to us the resultsof cyclic rigid information-theory based re®ning ofhuman splice junctions, and Karen Lewis, Xiao Ma, ShuOuyang, Denise Rubens, Brent Jewett, Ilya Lyakhov,Peter Rogan, Frank Boellmann, Eric Miller, and Jim Ellisfor their comments. K.E.R. was supported by NIH grantGM58560.

References

1. Green, R. & Noller, H. F. (1997). Ribosomes andtranslation. Annu. Rev. Biochem. 66, 679-716.

2. Ban, N., Nissen, P., Hansen, J., Moore, P. B. &Steitz, T. A. (2000). The complete atomic structure ofthe large ribosomal subunit at 2.4 AÊ resolution.Science, 289, 905-920.

3. Wimberly, B. T., Brondersen, D. E., Clemons, W. M.,Jr, Morgan-Warren, R. J., Carter, A. P., Vonrheln, C.et al. (2000). Structure of the 30S ribosomal subunit.Nature, 407, 327-339.

4. Gold, L., Pribnow, D., Schneider, T., Shinedling, S.,Singer, B. S. & Stormo, G. (1981). Translationalinitiation in prokaryotes. Annu. Rev. Microbiol. 35,365-403.

5. Stormo, G. D., Schneider, T. D. & Gold, L. M.(1982). Characterization of translational initiationsites in E. coli. Nucl. Acids Res. 10, 2971-2996.

6. Schneider, T. D., Stormo, G. D., Gold, L. &Ehrenfeucht, A. (1986). Information content of bind-ing sites on nucleotide sequences. J. Mol. Biol. 188,415-431.

7. Rudd, K. E. & Schneider, T. D. (1992). Compilationof E. coli ribosome binding sites. In A Short Course inBacterial Genetics: A Laboratory Manual and Handbookfor Escherichia coli and Related Bacteria (Miller, J. H.,ed.), pp. 17.19-17.45, Cold Spring Harbor LaboratoryPress, Cold Spring Harbor, NY.

8. Kozak, M. (1999). Initiation of translation in prokar-yotes and eukaryotes. Gene, 234, 187-208.

9. Shine, J. & Dalgarno, L. (1974). The 30-terminalsequence of Escherichia coli 16S ribosomal RNA:complementarity to nonsense triplets and ribosomebinding sites. Proc. Natl Acad. Sci. USA, 71, 1342-1346.

10. Calogero, R. A., Pon, C. L., Canonaco, M. A. &Gualerzi, C. O. (1988). Selection of the mRNA trans-lation initiation region by Escherichia coli ribosomes.Proc. Natl Acad. Sci. USA, 85, 6427-6431.

11. Hartz, D., McPheeters, D. S. & Gold, L. (1991). In¯u-ence of mRNA on determinants on translationalinitiation in Escherichia coli. J. Mol. Biol. 218, 83-97.

12. Gold, L. (1988). Posttranscriptional regulatory mech-anisms in Escherichia coli. Annu. Rev. Biochem. 57,199-233.

13. Barrick, D., Villanueba, K., Childs, J., Kalil, R.,Schneider, T. D., Lawrence, C. E. et al. (1994). Quan-titative analysis of ribosome binding sites in E. coli.Nucl. Acids Res. 22, 1287-1295.

14. Chen, H., Bjerknes, M., Kumar, R. & Jay, E. (1994).Determination of the optimal aligned spacingbetween the Shine-Dalgarno sequence and the trans-lation initiation codon of Escherichia coli mRNAs.Nucl. Acids Res. 22, 4953-4957.

15. Rinke-Appel, J., Junke, N., Brimacombe, R., Lavrik,I., Dokudovskaya, S., Dontsova, O. & Bogdanov, A.(1994). Contacts between 16S ribosomal RNA andmRNA, within the spacer region separating theAUG initiator codon and the Shine-Dalgarnosequence; a site-directed cross-linking study. Nucl.Acids Res. 22, 3018-3025.

16. Ringquist, S., Shinedling, S., Barrick, D., Green, L.,Binkley, J., Stormo, G. D. & Gold, L. (1992). Trans-lation initiation in Escherichia coli: sequences withinthe ribosome binding site. Mol. Microbiol. 6, 1219-1229.

17. Hartz, D., McPheeters, D. S., Green, L. & Gold, L.(1991). Detection of Escherichia coli ribosome bindingat translational initiation sites in the absence oftRNA. J. Mol. Biol. 218, 99-105.

18. Shannon, C. E. (1948). A mathematical theory ofcommunication. Bell System Tech. J. 27, 379-423, 623-656.

19. Pierce, J. R. (1980). An Introduction to InformationTheory: Symbols, Signals and Noise, 2nd edit., DoverPublications, Inc., New York.

20. Schneider, T. D. (1994). Sequence logos, machine/channel capacity, Maxwell's demon, and molecularcomputers: a review of the theory of molecularmachines. Nanotechnology, 5, 1-18.

Page 13: Anatomy of Escherichia coli ribosome binding sites.pdf

Anatomy of Escherichia coli Ribosome Binding Sites 227

21. Hengen, P. N., Bartram, S. L., Stewart, L. E. &Schneider, T. D. (1997). Information analysis of Fisbinding sites. Nucl. Acids Res. 25, 4994-5002.

22. Shultzaberger, R. K. & Schneider, T. D. (1999).Using sequence logos and information analysis ofLrp DNA binding sites to investigate discrepanciesbetween natural selection and SELEX. Nucl. AcidsRes. 27, 882-887.

23. Zheng, M., Doan, B., Schneider, T. D. & Storz, G.(1999). OxyR and SoxRS regulation of fur. J. Bacteriol.181, 4639-4643.

24. Schneider, T. D. & Stephens, R. M. (1990). Sequencelogos: a new way to display consensus sequences.Nucl. Acids Res. 18, 6097-6100.

25. Schneider, T. D. (1997). Information content of indi-vidual genetic sequences. J. Theor. Biol. 189, 427-441.

26. Schneider, T. D. (1997). Sequence walkers: a graphi-cal method to display how binding proteins interactwith DNA or RNA sequences. Nucl. Acids Res. 25,4408-4415.

27. Stormo, G. D., Schneider, T. D., Gold, L. &Ehrenfeucht, A. (1982). Use of the `Perceptron' algor-ithm to distinguish translational initiation sites inE. coli. Nucl. Acids Res. 10, 2997-3011.

28. Schneider, T. D. (1991). Theory of molecularmachines. II. Energy dissipation from molecularmachines. J. Theor. Biol. 148, 125-137.

29. Schneider, T. D. & Mastronarde, D. (1996). Fast mul-tiple alignment of ungapped DNA sequences usinginformation theory and a relaxation method. Discr.Appl. Math. 71, 259-268.

30. Tribus, M. (1961). Thermostatics and Thermodynamics,D. van Nostrand Company, Inc., Princeton, NJ.

31. Frishman, D., Mironov, A., Mewes, H. W. &Gelfand, M. (1998). Combining diverse evidence forgene recognition in completely sequenced bacterialgenomes. Nucl. Acids Res. 26, 2941-2947.

32. Frishman, D., Mironov, A. & Gelfand, M. (1999).Starts of bacterial genes: estimating the reliability ofcomputer predictions. Gene, 234, 257-265.

33. Blattner, F. R., Plunkett, G., III, Bloch, C. A., Perna,N. T., Burland, V.,Riley, M. et al. (1997). The com-plete genome sequence of Escherichia coli K-12.Science, 277, 1453-1474.

34. Rudd, K. E. (2000). EcoGene: a genome sequencedatabase for Escherichia coli K-12. Nucl. Acids Res. 28,60-64.

35. Stephens, R. M. & Schneider, T. D. (1992). Featuresof spliceosome evolution and function inferred froman analysis of the information at human splice sites.J. Mol. Biol. 228, 1124-1136.

36. Flachmann, R., Kunz, N., Seifert, J., Gutlich, M.,Wientjes, F. J., Laufer, A. & Gassen, H. G. (1988).Molecular biology of pyridine nucleotide biosyn-thesis in Escherichia coli. Cloning and characteriz-ation of quinolinate synthesis genes nadA and nadB.Eur. J. Biochem. 175, 221-228.

37. Besemer, J., Lomsadze, A. & Borodovsky, M. (2001).GeneMarkS: a self-training method for prediction ofgene starts in microbial genomes. Implications for®nding sequence motifs in regulatory regions. Nucl.Acids Res. 29, 2607-2618.

38. Krogh, A., Brown, M., Mian, I. S., SjoÈ lander, K. &Haussler, D. (1994). Hidden Markov models in com-putational biology, applications to protein modeling.J. Mol. Biol. 235, 1501-1531.

39. Matten, S. R., Schneider, T. D., Ringquist, S. &Brusilow, W. S. A. (1998). Identi®cation of an intra-genic ribosome binding site that affects expression

of the B gene of the Escherichia coli proton-translocat-ing ATPase (unc) operon. J. Bacteriol. 180, 3940-3945.

40. Landick, R., Turnbough, C. L., Jr & Yanofsky, C.(1996). Transcription attenuation. In Escherichia coliand Salmonella: Cellular and Molecular Biology(Neidhardt, F. C., Curtiss, R., III, Ingraham, J. L.,Lin, E. C. C., Low, K. B.,Magasanik, B. et al., eds),vol. 1, pp. 1263-1286, American Society forMicrobiology, Washington, DC.

41. Keller, J. M. (1983). Harmonic motion; harmonicoscillator. In McGraw-Hill Encyclopedia of Physics(Parker, S. P., ed.), pp. 419-422, McGraw-Hill BookCompany, Inc., New York.

42. Schneider, T. D. (1991). Theory of molecularmachines. I. Channel capacity of molecularmachines. J. Theor. Biol. 148, 83-123.

43. Papp, P. P., Chattoraj, D. K. & Schneider, T. D.(1993). Information analysis of sequences that bindthe replication initiator RepA. J. Mol. Biol. 233, 219-230.

44. Schneider, T. D. (1996). Reading of DNA sequencelogos: prediction of major groove binding by infor-mation theory. Methods Enzymol. 274, 445-455.

45. Wood, T. I., Grif®th, K. L., Fawcett, W. P., Jair, K.-W., Schneider, T. D. & Wolf, R. E. (1999). Interde-pendence of the position and orientation of SoxSbinding sites in the transcriptional activation of theclass I subset of Escherichia coli superoxide-induciblepromoters. Mol. Microbiol. 34, 414-430.

46. Clemons, W. M., Jr, May, J. L., Wimberly, B. T.,McCutcheon, J. P., Capel, M. S. & Ramakrishnan, V.(1999). Structure of a bacterial 30S ribosomal subunitat 5.5 AÊ resolution. Nature, 400, 833-840.

47. Boni, I. V., Isaeva, D. M., Musychenko, M. L. &Tzareva, N. V. (1991). Ribosome-messenger recog-nition: mRNA target sites for ribosomal protein S1.Nucl. Acids Res. 19, 155-162.

48. Sorensen, M. A., Fricke, J. & Pedersen, S. (1998).Ribosomal protein S1 is required for translation ofmost, if not all, natural mRNAs in Escherichia coliin vivo. J. Mol. Biol. 280, 561-569.

49. Dontsova, O., Kopylov, A. & Brimacombe, R. (1991).The location of mRNA in the ribosomal 30Sinitiation complex; site-directed cross-linking ofmRNA analogues carrying several photo- reactivelabels simultaneously on either side of the AUGstart codon. EMBO J. 10, 2613-2620.

50. Greuer, B., Thiede, B. & Brimacombe, R. (1999). Thecross-link from the upstream region of mRNA toribosomal protein S7 is located in the C-terminalpeptide: experimental veri®cation of a predictionfrom modeling studies. RNA, 5, 1521-1525.

51. Bhangu, R., Juzumiene, D. & Wollenzien, P. (1994).Arrangement of messenger RNA on Escherichia coliribosomes with respect to 10 16S rRNA cross-linkingsites. Biochemistry, 33, 3063-3070.

52. Baranov, P. V., Kubarenko, A. V., Gurvich, O. L.,Shamolina, T. A. & Brimacombe, R. (1999). Thedatabase of ribosomal cross-links: an update. Nucl.Acids Res. 27, 184-185.

53. Arnott, S., Hukins, D. W. & Dover, S. D. (1972).Optimised parameters for RNA double-helices.Biochem. Biophys. Res. Commun. 48, 1392-1399.

54. Nissen, P., Hansen, J., Ban, N., Moore, P. B. &Steitz, T. A. (2000). The structural basis of ribosomeactivity in peptide bond synthesis. Science, 289, 920-930.

55. Meinnel, T., Sacerdot, C., Graffe, M., Blanquet, S.& Springer, M. (1999). Discrimination by Escheri-

Page 14: Anatomy of Escherichia coli ribosome binding sites.pdf

228 Anatomy of Escherichia coli Ribosome Binding Sites

chia coli initiation factor IF3 against initiation onnon-canonical codons relies on complementarityrules. J. Mol. Biol. 290, 825-837.

56. de Smit, M. H. & van Duin, J. (1990). Secondarystructure of the ribosome binding site determinestranslational ef®ciency: a quantitative analysis. Proc.Natl Acad. Sci. USA, 87, 7668-7672.

57. Mathews, D. H., Sabina, J., Zuker, M. & Turner,D. H. (1999). Expanded sequence dependence ofthermodynamic parameters improves prediction of

RNA secondary structure. J. Mol. Biol. 288, 911-940.

58. Rogan, P. K., Faux, B. M. & Schneider, T. D. (1998).Information analysis of human splice site mutations.Hum. Mutat. 12, 153-171.

59. Baldi, P., Brunak, S., Chauvin, Y. & Krogh, A.(1996). Naturally occurring nucleosome positioningsignals in human exons and introns. J. Mol. Biol. 263,503-510.

Edited by D. Draper

(Received 30 January 2001; received in revised form 24 August 2001; accepted 27 August 2001)


Recommended