+ All Categories
Home > Documents > Fold Assembly of Small Proteins Using Monte Carlo ...

Fold Assembly of Small Proteins Using Monte Carlo ...

Date post: 02-Apr-2022
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
30
Fold Assembly of Small Proteins Using Monte Carlo Simulations Driven by Restraints Derived from Multiple Sequence Alignments Angel R. Ortiz 1 , Andrzej Kolinski 1,2 and Jeffrey Skolnick 1 * 1 Department of Molecular Biology, TPC-5, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla CA 92037, USA 2 Department of Chemistry University of Warsaw ul. Pasteura 1, 02-093 Warsaw Poland The feasibility of predicting the global fold of small proteins by incorpor- ating predicted secondary and tertiary restraints into ab initio folding simulations has been demonstrated on a test set comprised of 20 non- homologous proteins, of which one was a blind prediction of target 42 in the recent CASP2 contest. These proteins contain from 37 to 100 residues and represent all secondary structural classes and a representative variety of global topologies. Secondary structure restraints are provided by the PHD secondary structure prediction algorithm that incorporates multiple sequence information. Predicted tertiary restraints are derived from mul- tiple sequence alignments via a two-step process. First, seed side-chain contacts are identified from correlated mutation analysis, and then a threading-based algorithm is used to expand the number of these seed contacts. A lattice-based reduced protein model and a folding algorithm designed to incorporate these predicted restraints is described. Depend- ing upon fold complexity, it is possible to assemble native-like topologies whose coordinate root-mean-square deviation from native is between 3.0 A ˚ and 6.5 A ˚ . The requisite level of accuracy in side-chain contact map prediction can be roughly 25% on average, provided that about 60% of the contact predictions are correct within 1 residue and 95% of the predictions are correct within 4 residues. Precision in tertiary contact prediction is more critical than absolute accuracy. Furthermore, only a subset of the tertiary contacts, on the order of 25% of the total, is sufficient for successful topology assembly. Overall, this study suggests that the use of restraints derived from multiple sequence alignments combined with a fold assembly algorithm holds considerable promise for the prediction of the global topology of small proteins. # 1998 Academic Press Limited Keywords: protein structure prediction; correlated mutations; side-chain contact prediction; lattice protein models; secondary structure prediction; threading *Corresponding author Introduction At present, the prediction of protein structure from amino acid sequence remains one of the major unsolved problems in molecular biology. The solution to this problem demands the develop- ment of effective conformational search algorithms and the formulation of potentials capable of recog- nizing the native state from the manifold of mis- folded structures. Reduced protein models having one or a few interaction centers per residue and statistical potentials extracted from protein struc- ture databases offer a reasonable way to address both issues (Godzik et al., 1994; Kolinski et al., 1995a,b; Kolinski & Skolnick, 1994a; Park & Levitt, 1995; Skolnick et al., 1993). Using such approaches, several successful ab initio predictions of simple topologies have been reported. For example, the conformation of short peptides such as melittin (Ripoll & Scheraga, 1990), pancreatic polypeptide inhibitor (Sun, 1993; Wallqvist & Ullner, 1994), apamin (Sun, 1993), and PthrP (Wallqvist & Ullner, 1994) have been predicted with a backbone RMSD ranging from 1.7 A ˚ to 4.5 A ˚ . Lattice folding simu- lations (Kolinski & Skolnick, 1994b) of the B domain of Protein A (Gouda et al., 1992) have Abbreviations used: RMSD, root-mean-square deviation; cRMSD, coordinate RMSD; PDB, Protein Data Bank. J. Mol. Biol. (1998) 277, 419–448 0022–2836/98/120419–30 $25.00/0/mb971595 # 1998 Academic Press Limited
Transcript

J. Mol. Biol. (1998) 277, 419±448

Fold Assembly of Small Proteins Using Monte CarloSimulations Driven by Restraints Derived fromMultiple Sequence Alignments

Angel R. Ortiz1, Andrzej Kolinski1,2 and Jeffrey Skolnick1*

1Department of MolecularBiology, TPC-5, The ScrippsResearch Institute, 10550 NorthTorrey Pines Road, La JollaCA 92037, USA2Department of ChemistryUniversity of Warsawul. Pasteura 1, 02-093 WarsawPoland

Abbreviations used: RMSD, root-deviation; cRMSD, coordinate RMSBank.

0022±2836/98/120419±30 $25.00/0/mb

The feasibility of predicting the global fold of small proteins by incorpor-ating predicted secondary and tertiary restraints into ab initio foldingsimulations has been demonstrated on a test set comprised of 20 non-homologous proteins, of which one was a blind prediction of target 42 inthe recent CASP2 contest. These proteins contain from 37 to 100 residuesand represent all secondary structural classes and a representative varietyof global topologies. Secondary structure restraints are provided by thePHD secondary structure prediction algorithm that incorporates multiplesequence information. Predicted tertiary restraints are derived from mul-tiple sequence alignments via a two-step process. First, seed side-chaincontacts are identi®ed from correlated mutation analysis, and then athreading-based algorithm is used to expand the number of these seedcontacts. A lattice-based reduced protein model and a folding algorithmdesigned to incorporate these predicted restraints is described. Depend-ing upon fold complexity, it is possible to assemble native-like topologieswhose coordinate root-mean-square deviation from native is between3.0 AÊ and 6.5 AÊ . The requisite level of accuracy in side-chain contact mapprediction can be roughly 25% on average, provided that about 60% ofthe contact predictions are correct within �1 residue and 95% of thepredictions are correct within �4 residues. Precision in tertiary contactprediction is more critical than absolute accuracy. Furthermore, only asubset of the tertiary contacts, on the order of 25% of the total, issuf®cient for successful topology assembly. Overall, this study suggeststhat the use of restraints derived from multiple sequence alignmentscombined with a fold assembly algorithm holds considerable promise forthe prediction of the global topology of small proteins.

# 1998 Academic Press Limited

Keywords: protein structure prediction; correlated mutations; side-chaincontact prediction; lattice protein models; secondary structure prediction;threading

*Corresponding author

Introduction

At present, the prediction of protein structurefrom amino acid sequence remains one of themajor unsolved problems in molecular biology.The solution to this problem demands the develop-ment of effective conformational search algorithmsand the formulation of potentials capable of recog-nizing the native state from the manifold of mis-folded structures. Reduced protein models havingone or a few interaction centers per residue and

mean-squareD; PDB, Protein Data

971595

statistical potentials extracted from protein struc-ture databases offer a reasonable way to addressboth issues (Godzik et al., 1994; Kolinski et al.,1995a,b; Kolinski & Skolnick, 1994a; Park & Levitt,1995; Skolnick et al., 1993). Using such approaches,several successful ab initio predictions of simpletopologies have been reported. For example, theconformation of short peptides such as melittin(Ripoll & Scheraga, 1990), pancreatic polypeptideinhibitor (Sun, 1993; Wallqvist & Ullner, 1994),apamin (Sun, 1993), and PthrP (Wallqvist & Ullner,1994) have been predicted with a backbone RMSDranging from 1.7 AÊ to 4.5 AÊ . Lattice folding simu-lations (Kolinski & Skolnick, 1994b) of the Bdomain of Protein A (Gouda et al., 1992) have

# 1998 Academic Press Limited

420 Fold Prediction of Small Proteins

yielded structures with a Ca RMSD from native onthe order of 3.0 AÊ . The most accurate predictionsto date are those of the GCN4 leucine zipper,whose ®nal predicted Ca RMSD is 0.8 AÊ from thenative crystal structure (O'Shea et al., 1991; Viethet al., 1994). In general, these methods have beenmainly successful on helical proteins having simpleglobal folds. For more complex structures of natu-ral proteins, ab initio folding has failed thus far.Simpli®ed protein representations and inaccuraciesin the attendant potentials can conspire to yield aninability to identify the native topology fromalternative non-native structures, especially thosewith a substantial similarity to the native fold.Furthermore, as protein size and topological com-plexity increases, conformational samplingbecomes exponentially more problematic. Thus,alternative approaches are required that can sur-mount these dif®culties.

Introduction of secondary structure restraintsobtained from secondary structure predictionalgorithms is a natural extension of pure, restraintfree ab initio folding. This is a particularly appeal-ing idea since knowledge of the native secondarystructure elements enormously reduces the confor-mational space that must be searched. Recently,the accuracy of secondary structure prediction hasimproved from about 65% to 72%, on average(Rost & Sander, 1996b). This development lendscredence to the idea that secondary structuralelements can be identi®ed with reasonable accu-racy. Some progress is also being made in algor-ithms that can predict regions where the chainreverses global direction, viz., U-turns (Kolinskiet al., 1997). The feasibility of biasing search algor-ithms with secondary structure knowledge was®rst explored using ``exact'' secondary structure, asobserved in the experimental native conformation.In this regard, using an off-lattice model and exactknowledge of the native secondary structure,Friesner et al. (1996) have obtained very encoura-ging results, successfully folding two four-helixbundles (cytochrome b562 (B256) and myohemery-thrin (2MHR)), a large a-helical protein (myoglobin(1MBO)), and a relatively complicated a/b fold,the C-terminal domain of the L7/L12 50 S riboso-mal protein (1CTF) (Friesner & Gunn, 1996; Gunnet al., 1994). Similarly, Dandekar & Argos (1994,1996), using a genetic algorithm to search confor-mational space, obtained encouraging results for atest set of 19 small proteins, including all a andsome a/b proteins. In their studies, they have suc-ceeded in predicting a signi®cant proportion ofthese small proteins at �5 AÊ resolution. However,Dandekar & Argos have also observed that use ofpredicted secondary structure information pro-duces a substantial deterioration in the perform-ance of their prediction algorithm. As a practicalmatter, however, any successful tertiary structureassembly algorithm must be able to successfullyhandle predicted secondary structural informationwhose accuracy is at the current state-of-the-art,and tests using predicted secondary structure

should be made. Along these lines, an earlyexample of where predicted secondary structurewas incorporated as restraint information in a sub-sequent topology assembly algorithm is due toKolinski & Skolnick (1994b). The resulting pre-dicted structure of crambin had a backbone Ca

RMSD of 3.2 AÊ . A more recent example is due toSimons et al. (1997), who instead of secondarystructure predictions used an interesting techniqueto derive short-range conformational preferencesfrom multiple sequence alignments. Some a-helicalproteins could be assembled using this approach,although the potential function used by the authorsdid not allow the native-like topology to be discri-minated from alternative answers. However, it wasapparent from these and other studies that knowl-edge of secondary structure preferences alone doesnot entirely eliminate competing misfolded states.Furthermore, secondary structure bias is local innature and, therefore, does not provide a gradientin the conformational energy landscape that canfunnel the conformation towards the native state.Thus, problems with both potentials and confor-mational search protocols still remain.

Funneling can be ef®ciently obtained throughthe use of long-range (in sequence) distancerestraints. A number of workers have begun toexamine the feasibility of such an approach. Forexample, Smith-Brown et al. (1993) have attemptedto predict several protein folds by assuming exactknowledge of the secondary structure and a subsetof interresidue distance restraints encoded as abiharmonic potential. They ®nd that a considerablenumber of restraints per residue is required toassemble the fold, making the approach impracti-cal for most prediction purposes. Another interest-ing study is due to Aszodi & Taylor (1996), whoassumed correct native secondary structure and aset of simulated tertiary restraints (Aszodi et al.,1995). Here in an attempt to build the protein core,restraints were supplemented by a set of interresi-due distances based on patterns of conservedhydrophobic residues obtained from a multiplesequence alignment. Folds were then assembledusing distance geometry with a simpli®ed proteinchain model. Aszodi & Taylor (1996) were able toassemble structures below 5 AÊ RMSD when atleast N/4 restraints are used, where N is the num-ber of protein residues. However, with their force®eld, they have excessive dif®culties selecting thecorrect fold from competing alternatives. Alongsimilar lines, Mumenthaler & Braun (1995) devel-oped an interesting self-correcting distance geome-try method that tries to automatically eliminatewrongly predicted contacts derived from multiplesequence alignments. Again, the correct secondarystructure is assumed, but now totally predicted ter-tiary restraints, based on the conservation patternsof hydrophobic residues in multiple sequencealignments, are used to assemble the fold. Withthis method, encouraging results have beenobtained with the successful folding to the nativetopology in six out of eight helical proteins stu-

Fold Prediction of Small Proteins 421

died. Still, a signi®cant number of restraints arerequired in all these approaches. This poses a pro-blem because no prediction technique is availablethat can provide both the requisite number andaccuracy of secondary and tertiary restraints thatthese approaches demand for more complex folds.

One step towards addressing these problemswas made recently by Skolnick et al. (1997b). Theydeveloped a new program, called MONSSTER(MOdeling of New Structures from Secondary andTErtiary Restraints), that is able to successfully foldsmall proteins using a considerably smaller num-ber of distance restraints than previous approachesdemanded. It was found that when ``exact''restraints are available, helical proteins can befolded with roughly N/7 restraints, while all b anda/b proteins require about N/4 restraints, where Nis the number of residues in the protein chain. Ofcourse, for any particular case, the accuracydepends on restraint distribution and fold com-plexity. MONSSTER employs a lattice-basedreduced representation of the protein chain. Inaddition to secondary and tertiary restraints, thereis a potential that incorporates statistical prefer-ences for secondary structure, side-chain burialand pair interactions, together with a hydrogenbond potential. We term these non-restraint contri-butions inherent interactions. The resulting foldaccuracy is substantially degraded when theseinherent contributions to the potential are elimi-nated. Thus, these simulations indicated that thereis a complementarity between the inherent contri-butions to the potential and the supplementary butcrucial information provided by secondary and ter-tiary restraints.

The encouraging results obtained with thismodel prompted us to attempt the next logicalstep. Namely, we explored the possibility of assem-bling global protein topologies using entirely pre-dicted secondary and tertiary restraints. Predictedrestraints are noisy in nature and it is unclearwhether an algorithm that works within the limitof a very small number of correct restraints isrobust enough to successfully handle the unavoid-able presence of incorrect predictions. Extant sec-ondary structure prediction schemes provide alogical jumping off point for the incorporation ofpredicted secondary structure information. Theprotocol for predicting tertiary restraints is lessobvious. Following on the ideas of GoÈebel et al.(1994), predicted tertiary contacts are extracted onthe basis of evolutionary information contained inmultiple sequence alignments, complemented withthreading calculations (our unpublished results). Insequence alignments, since some pairs of positionsappear to exhibit a covariation in their mutationalbehavior consistent with their physical and chemi-cal properties, it has been suggested that spatiallyclose neighbors might be more likely to exhibitsuch behavior. Using statistical techniques, thiseffect has been quanti®ed by a method known ascorrelated mutation analysis (GoÈebel et al., 1994). Ithas been shown that, by applying a stringent sig-

ni®cance cut-off in the prediction of contacts bycorrelated mutations, a small number of contactscan be predicted that are a factor of 1.4 to 5.1 timesbetter than random. Previously, the number of cor-rect contacts obtained this way has been either toosmall to permit successful tertiary structure assem-bly if a high signi®cance cut-off was used to avoidfalse positives, or too noisy if the number of con-tacts selected was that demanded by existingassembly algorithms (Rost & Sander, 1996a). Here,it is shown that, for a representative set of proteins,a modi®cation of the correlated mutation analysisapproach (our unpublished results), when coupledto secondary structure prediction and followed bystructure assembly using a version of MONSSTERupdated to handle incorrect predictions, is able tobridge the gap between sequence analysis andfolding simulations. This permits the ab initio fold-ing of some complex topologies.

The outline of the remainder of the paper is asfollows. In Methods, we describe the approach fol-lowed in this work, which can be logically dividedinto two parts: secondary and tertiary restraintderivation, and fold assembly/re®nement usingMONSSTER. In the Results, we describe its appli-cation to a set of 20 representative single domainproteins. This is followed by the Discussion, whichexamines possible reasons why the currentapproach can be successful and delineates theimprovements required, in both the restraint deri-vation procedure and in the fold assembly/re®ne-ment protocol, to make this approach generallyapplicable. Finally, in the Conclusions, we sum-marize the current state-of-the-art of protein struc-ture prediction as provided by MONSSTER.

Methods

Overview

A ¯ow chart of the tertiary structure predictionprotocol is depicted in Figure 1. The procedurepresented in this work can be logically dividedinto two parts: restraint derivation and structureassembly/re®nement using the MONSSTER algor-ithm. With respect to restraint derivation, the ®rstobjective is to predict the number, location, andidentity of the dominant secondary structuralelements that will comprise the protein. These con-sist of helices and b-strands, termed here the coretopological elements of the molecule. In addition,U-turns between these secondary structureelements are predicted (Kolinski et al., 1997). Next,we try to predict the secondary structure elementsin contact. This is attempted by obtaining the mostreliable set of predicted contacts between coreelements using correlated mutation analysis. Wedenote the contacts obtained in this way as``seeds''. This protocol allows us to maximize thesignal-to-noise ratio of predicted contacts. How-ever, the number of seeds is still insuf®cient toallow successful fold assembly to occur. Hence, weexploit the fact that packing patterns between sec-

Figure 1. Flow chart of the protein fold predictionmethod.

422 Fold Prediction of Small Proteins

ondary structure elements are degenerate. Seedcontacts are ``enriched'' by ®nding the most com-patible contact map in the structural databasegiven the predicted seed and the secondary struc-ture elements involved by using a fragment clus-tering/inverse folding protocol (Godzik et al.,1992). Full details of this procedure will be givenin a forthcoming publication. The key point is thatthe resultant restraints are not particularly accurateand that approaches incorporating such restraintsmust be adjusted to account for their ambiguity.This is accomplished in an updated version ofMONSSTER designed to accommodate theinherent inaccuracies of such restraints, as isdescribed below. The general features of the force®eld and representation are the same as thosedescribed earlier (Skolnick et al., 1997b). Thus, wefocus here only on those aspects that differ fromthe previous implementation in order to adapt theprocedure to incorrect and/or ambiguousrestraints.

Restraint derivation method

Secondary structure prediction

Multiple sequence alignments for each of theproteins studied were obtained from the HSSPdatabase (Sander & Schneider, 1991). This align-ment is used as input for the PHD (Rost &Sander, 1993) secondary structure predictionmethod. For the purposes of hydrogen bondassignment, all predicted strand elements areassumed to correspond to a strand in the realsecondary structure. For helices, only thoseelements with a reliability index higher thanthree are used. Chain reversals are predicted bythe U-turn prediction algorithm LINKER devel-

oped by Kolinski et al. (1977). Because of theirreliability, elements predicted as U-turns overridePHD predictions (Kolinski et al., 1997). In prac-tice, each residue can be assigned to one of ®veconformational states: a predicted extended state,a predicted helix, a predicted U-turn, a b-(strand)state or a non-predicted state. The set of pre-dicted helices and strands comprise the putativecore elements of the protein.

Side-chain contact prediction

The prediction of residue contacts is performedin two stages. First, a correlated mutation anal-ysis (GoÈebel et al., 1994) of the multiple sequencealignment is done to identify the seed contacts.The procedure is based on de®ning an exchangematrix or other similarity measure at eachsequence position in a multiple sequence align-ment. One then calculates the correlation coef®-cient between exchange matrices at any twopositions. In the calculation of the covariancematrix, regions containing deletions and inser-tions are not considered. Here, residue compari-son is carried out using the McLachlan (1971)matrix. In this work, the same multiple sequencealignment is used for secondary structure andcorrelated mutation analysis. Only correlationsbetween elements predicted to be in core regions(and not U-turns) are considered. The rationale isthat by restricting the predictions to rigidelements of the putative core, the assumption ofcloseness in space for positions showing covari-ance in their mutational behavior might be morevalid. Correlation is measured by a Pearson-typecorrelation coef®cient:

rij � 1

N2

Xkl

�sikl ÿ hsii��sjkl ÿ hsji�sisj

�1�

Here, i and j are two different positions in a mul-tiple sequence alignment, and the indices k and lrun from 1 to the number N of sequences in thefamily. The parameter sikl (sjkl) is the comparisonscore (according to the McLachlan (1971)mutation matrix) of the amino acids of sequencesk and l at position i(j) of the alignment. Averagevalues over all aligned sequences at positions iand j are given by hsii and hsji. The parameterssi and sj correspond to the standard deviation ofthe scores in the positions i and j, respectively.A correlation coef®cient cut-off threshold of 0.5 isused for contact prediction. At most, only onecontact per each pair of core secondary structureblocks is used. Thus, this analysis delineates pre-dicted secondary structure elements in contact. Inaddition, the maximum number of seeds allowedis equal to the maximum number of expectedcontacts (ns) between the L of predicted second-ary structure elements. This value is obtainedfrom a representative database of small proteinsand is roughly given by ns � L (L ÿ 1)/4 (ourunpublished results). However, correlated

Fold Prediction of Small Proteins 423

mutation analysis only provides a few seed side-chain contacts; their number is generally insuf®-cient to assemble a protein from the unfoldedstate using MONSSTER. Thus, the number ofside-chain contacts needs to be increased.

The set of seed restraints is enriched by a com-bined structural fragment search and threadingfolding procedure (Godzik et al., 1992). All pairs ofsecondary structure elements compatible with thepredicted secondary structure types and predictedcontacts are extracted from a structural database.This structural database is the same used by Hu etal. (1997). For each tested sequence, homologues tothe full target sequence found in the database wereremoved prior to the application of the growingprocedure. To account for the inaccuracies in thecorrelated mutation analysis, a tolerance of a one-residue shift in each member of the contacting resi-due pair is allowed. Fragments are then scored bya statistical potential that considers local confor-mational propensities and the burial energy withinthe pair of fragments (Godzik et al., 1992). Pair andhigher order interactions are ignored (Godzik et al.,1992; Hu et al., 1997) to avoid the imbalancebetween intra and extra fragment interactions,which would result if such contributions wereincluded. The top ten scoring fragments are super-imposed in space by minimizing their coordinateRMSD, and then clustered on the basis of theirpair-wise RMSD. If they do not show a clear clus-tering (with an upper limit of 5.5 AÊ for the mostdivergent fragment pair), then additional side-chain contact restraints are not derived. Conver-sely, if the fragments spatially cluster, then thefragment within this cluster whose RMSD is smal-lest, with respect to all other members, is selectedand its side-chain contact map is projected onto thequery sequence. Following this procedure, thenumber of predicted contacts usually increases byabout a factor of ®ve with respect to that predictedfrom correlated mutation analysis alone. Fulldetails of this procedure will be given in a forth-coming publication (A.R.O. & J.S., unpublishedresults).

Assembly and refinement protocol

Protein model

Geometric properties. The Ca coordinates of theprotein backbone are con®ned to a set of latticepoints located on an underlying cubic lattice whoselattice spacing, a � 1.22 AÊ (Kolinski & Skolnick,1994a). Successive Ca atoms are connected by a setof 90 virtual bond vectors a.v, with {v} � {(�3, �1,�1),... (�3, �1, 0),... (�3, 0, 0),... (�2, �2, �1),...(�2, �2, 0),...}. The distance a is chosen so that themean Ca virtual bond length is 3.8 AÊ . Side-chainsare represented by a set of rotamers, each locatedat the side-chain center of mass. They are notrestricted to lattice points. With the exception ofGly, Pro and Ala, there are multiple rotamers foreach amino acid, chosen so that the center of mass

of a side-chain in real proteins will be no fartherthan 1 AÊ from some member of the rotamerlibrary. For more details, see Kolinski & Skolnick(1994a, 1997b).

Interaction scheme

Inherent contributions. This class of terms is inde-pendent of the restraint predictions and isdesigned to capture both generic (sequence inde-pendent) and sequence-speci®c protein-like fea-tures. Many of these contributions are identical tothose described in MONSSTER (Skolnick et al.,1997b). Such terms include an amino acid pairspeci®c potential that describes the intrinsic sec-ondary structural preferences, E14, and a one-bodycentrosymmetric burial potential, E1. Here, toavoid non-physical segregation of the subunits, wehave added a packing density regularizer, Edensity

(Kolinski & Skolnick, 1997). This term is designedto ensure that the average overall density distri-bution of residues in native proteins is reproduced.This term provides a strong compressive force inthe unfolded state, but contributes negligibly incompact states. A more sensitive side-chain paircontact potential, Epair, that has been derived by amore careful analysis of the appropriate referencestate is now used (Skolnick et al., 1997a). Hydrogenbonds are Ca-based and very much in the spirit ofLevitt & Greer (1977).

Restraint contributions. Predicted secondary struc-tures and tertiary contacts are implemented intothe model in the form of restraint contributions tothe conformational energy. Furthermore, a set ofsomewhat re®ned knowledge-based restraintsdesigned to reproduce the packing of supersecond-ary structural elements is used. The implemen-tation of each type of restraint is discussed in turn.

Secondary structure dependentrestraint contributions

Local secondary structure bias. Secondary struc-ture bias is incorporated into the local secondarystructure dependent terms with magnitudeEtarget,sec. As indicated above, a given residue canbe in one of ®ve conformational states assigned onthe basis of the local chain geometry. For thoseresidues having a predicted secondary structuraltype, energetic biases for the various allowed con-formational states are assigned. Turns are encodedon a generic basis, i.e. their chirality is not speci-®ed. Rather, they behave as ¯exible joints betweenregular secondary structural elements (see Skolnicket al. (1997b) for additional details).

U-turn surface bias. Regions predicted as U-turnsare assumed to lie at the protein surface. Thus, forthese residues, a penalty of 0.5 kT (with k Boltz-mann's constant and T the absolute temperature)per residue is added when they lie at or below theradius of gyration. This term of total magnitude,

424 Fold Prediction of Small Proteins

EU-turn, acts to reduce kinetic traps by segregatingthe different parts of the protein into its corre-sponding layers. Similarly, N and C-terminal resi-dues are penalized by 4 kT if they are buried (i.e.at or below the radius of gyration) to account fortheir charged ends.

Hydrogen bond mixing rules. The hydrogen bondpotential is modi®ed for those residues assigned toa predicted type of secondary structure so that theresulting hydrogen bond pattern is compatiblewith the secondary structural prediction. The mag-nitude of this term is EH-bond. More speci®cally: (1)continuous stretches of strands and extended statesor their combinations cannot form intra-elementhydrogen bonds. Strands can form hydrogenbonds only with other strands, extended states ornon-assigned states. On the other hand, extendedstates can form hydrogen bonds with all statesexcept helices. (2) For those residues assigned to behelical, hydrogen bonds beyond the ®fth neighboralong the chain are not allowed.

b-Strand cooperativity term. In trial calculations, itwas observed that predicted b-strands had con-siderable dif®culty forming b-sheets. The sameobservation has been made by other authors(Dandekar & Argos, 1996; Friesner & Gunn, 1996;Simons et al., 1997). In our case, this behaviorappears to result from a combination of the exces-sive conformational entropy of the backbone andthe highly permissive hydrogen bond scheme. Tocorrect for these effects, a cooperativity term thatstabilizes and propagates the formation of b-sheets(Eb-prop) has been included. For each predictedstrand, the hydrogen bond state of each residue inthe putative strand is scanned. If the residue ofinterest participates in two hydrogen bondsbelonging to two different b-strands, then a stabil-ization energy equal to that of the hydrogen bondcooperativity term is added. Strand residues canboth nucleate and participate in the cooperativity.In other words, blocks of secondary structure pre-dicted to be strands can be located either in thecore or at the edges of the b-sheet. Extended stateresidues can serve as cooperative hydrogen bondpartners, but cannot nucleate cooperativity; there-fore, their location in the b-sheet core is energeti-cally penalized, but not forbidden. There is nodirectionality in this cooperativity term. Thus, itcannot distinguish parallel from antiparallelarrangements of the strands, rather the ®nalarrangement is dictated by the connectivity of thechain and the predicted restraints. This hydrogenbond cooperativity term has the effect of propagat-ing the b-sheet. It also helps to bury strands pre-dicted by PHD into the core, and to locateextended states predicted by LINKER at the sur-face and at the edges of a b-sheet. Trial calculationsindicated that a small bias is adequate to success-fully build b-sheets. In the all -b and a/b proteinsstudied here, the total magnitude of this term has a

value of roughly ÿ5 kT in successfully folded struc-tures.

Tertiary restraints

Restraint function

The restraint function used in this work consistsof a simple ¯at-bottom harmonic potential. Let rij

be the actual distance between two restrained resi-dues. In practice, the restraint could operatebetween side-chain centers of mass or between theprojection of the residue pair onto the principalaxes of their respective secondary structuralelements. This situation is discussed in greaterdetail in the next section, which describes restraintsplinning. Thus, the restraint function is as follows:

Eres

� 800

� krep�rij ÿ r0ij�2

� 0

if Eres > 800

if rij > r0ij and Eres < 800

if rij < r0ij

�2�

krep � 4.0 kT/(lu)2 with lu equal to one lattice unit(1.22 AÊ ). In the case of side-chain centers ofmass, r\rm 0

ij � (hrABi � sAB) (1 � o). For a pairof residues, A and B, hrABi and sAB are the aver-age separation distance and standard deviation ofthis distance observed in a structural database.The value of o � 0.5 is used by default. In thecase of restraint splinning (see below), the value(hrABi � sAB) is substituted by the average separ-ation distance observed in a structural data basefor the packing of secondary structure elements.The values used are: 10 AÊ for a-a, 8 AÊ for a-band 6 AÊ for b-b super-secondary elements,respectively.

Restraint splinning. Most predicted seeds areshifted by at least one residue with respect to theexperimentally observed contact. Moreover, aftergrowth, the different patches of contacts can havedifferent phases. For example, suppose that onehelix is predicted to contact two other secondarystructure elements, i.e. it has two seed contacts.Because each seed is obtained and grown indepen-dently, the overall predicted contact pattern of thehelix with the two other elements could be imposs-ible. These seed contacts can lie on opposite sidesof the helix, but in fact in the folded structure theirsecondary structure partners can be on the sameside of the helix face. This effect could either pre-clude successful assembly or distort the folded con-formation such that distinction, using energycriteria from misfolded alternatives, is not possible.One way to eliminate these artifacts is to apply therestraints between the axes of the secondary struc-ture elements. This is done by smoothing the localCa chain using the method described in the Appen-dix of Skolnick et al. (1997b). This list of smoothedcoordinates is continuously updated during thesimulation.

Figure 2. Scheme of the geometric de®nitions used tode®ne the knowledge-based rules in bab supersecond-ary structures. The secondary structure elements arerepresented as blue cylinders; the strands are rep-resented as two thin cylinders and the helix as a thickcylinder. The axes of the secondary structure elementsare represented by the blue cylinders and are rep-resented by the vectors k ÿ 2, k ÿ 1 and k, respectively.The b-sheet to which the two strands belong is rep-resented as a red membrane. The vector a connects thebeginning of the last b-strand with the end of the ®rstb-strand. The vector b connects the middle of the vectora with the middle of the k ÿ 1 element; it is shown ingreen in the Figure. Note that it is shifted from its orig-inal position for display purposes. Vectors d and e,shown in magenta, are perpendicular to the planede®ned by vectors k ÿ 2 and a, and k and e, respect-ively. Figure generated with MOLMOL (Koradi et al.,1996).

Fold Prediction of Small Proteins 425

Knowledge-based restraints. Knowledge-basedinformation about the general features of proteintopology is also used (Skolnick et al., 1997b). Thisknowledge-based information acts to reduce thenumber of misfolded structures. Two types ofknowledge-based rules are considered, namely thechirality of bab units and the angle formed in bbasupersecondary structure units (Chothia &Finkelstein, 1990). The implementation used herediffers in some important aspects from thatdescribed by Skolnick et al. (1997b). First of all,because the secondary structure prediction schemecan miss an intervening element, the number ofsuccessive residues between secondary structureregions is counted. If the number of loop residuesis greater than 15 residues, it is assumed that anintervening secondary structure element has beenmissed by the secondary structure prediction algor-ithm, and the knowledge-based rules are notapplied at all. The knowledge-based rules them-selves are also implemented in a different waythan that described in our previous work. First ofall, in the bba rule, the chirality requirement iseliminated and only the angle between theelements is restricted. When predicted secondarystructure is used, this rule is not suf®ciently robustbecause strands, particularly at the edges of thefold in a/b proteins, can be missed. This results inthe inappropriate application of the chiralityrequirement of the rule. In the case of the bab rule,the vector de®nitions and restraint potential arethe same as in our previous work (Skolnick et al.,1997b). However, the de®nition of the anglesbetween elements is different from that previouslyemployed because the previous implementationmade implicit assumptions about the chain geome-try that can be violated. In particular, in the meth-od described in our previous work, the ®rst strandof the bab element was not demanded to be in theplane formed by the vector describing the orien-tation of the second strand, and the vector connect-ing the beginning of the second strand with theend of the ®rst strand. Therefore, unusual geome-tries were not penalized. These geometries did notappear in our previous work because the sparseset of restraints used in the folding simulationswas exact and always involved some contactsbetween b-sheet forming strands. However, theuse of incorrect, clustered restraints in the presentwork permits the appearance of such confor-mations. The new vector de®nitions of the babelement are shown in Figure 2. The energy penaltyfor the bab rule is then given by:

Ebab �X�Vpÿ1�k� � Vpÿ2�k��Z�k� �3�

with Z(k) � 1 if element k ÿ 2 is predicted to be b,k ÿ 1 is a and k is b, and Z(k) � 0 otherwise. Thesum is taken over the number of secondary struc-ture elements Nsec. The potentials Vp ÿ 1 and Vp ÿ 2

are given by:

Vpÿ1�k� � Z�k�Kbab�b � e�2 �4�

Vpÿ2�k� � l�k�Kbab�0:7ÿ �e � d��2 �5�where the vectors b, d and e are given in Figure 2.These vectors are obtained as follows: Let usdenote xa

b as the splinned coordinates (Skolnicket al., 1997b) of the starting (a � s) and ending(a � e) points of the secondary structure elementsb � k ÿ 2, k ÿ 1 or k, repectively (see Figure 2). Theunit vectors describing the direction of the second-ary structures of the b elements can be found as:vb � ||xe

b ÿ xsb||. We can also de®ne the vector

connecting the two strands as a � (xek ÿ 2 ÿ xs

k). Thevectors d and e can then be obtained as the follow-ing cross-products: d � (vk ÿ 2� a) and e � (vk � a).For the derivation of b, we refer to the derivationof equation (A8a) in our previous work (Skolnicket al., 1997b). The value of Z(k) � 1 if (b �e) <0, asthe connection is then left-handed, and it isZ(k) � 0 otherwise. Also, if ||e �d|| >0.7, thenl(k) � 0, otherwise l(k) � 1. This allows an angle ofup to 45� between the strands, therefore takinginto account the possible b-sheet twist (seeFigure 2). The typical value of Kbab � 20 kT.

Relative weighting of the various contributions.The total energy of a given conformation is given

426 Fold Prediction of Small Proteins

by:

E � 0:5E14 � 1:5E1 � Edensity � 2:75Epair

� Etarget;sec � EUÿturn � EHÿbond

� Ebÿprop � Eres � Eknow �6�

Conformational sampling

Sampling of conformational space occurs via astandard asymmetric Monte Carlo Metropolisscheme (Metropolis et al., 1953). Several types oflocal conformational micromodi®cations of thechain backbone and rare, small distance motions oflarger chain fragments, together with side groupequilibration cycles, are used. From 10 to 40 inde-pendent assembly simulations for each protein arecarried out, each from a fully extended initial con-formation. Each simulation starts at a reduced tem-perature of 5.0, and then the temperature is slowlylowered to 1.0. Low energy structures are thensubject to isothermal re®nement. The predictedfold is the one exhibiting the lowest averageenergy during the isothermal calculation. (In cor-rectly folded structures, this energy is roughly 5kTper residue.)

Computational details

The typical computational time of assembling a100-residue protein with MONSSTER, consisting ofa run of about 5 � 106 Monte Carlo steps, is 5.0hours on a single SGI MIPS R10000 processor run-ning at 180 MHz clock speed and using a cachesize of 32 kb. Each isothermal calculation needs anadditional 5.0 hour run.

Structural analysis

The folded structures were compared with theexperimental conformations using two sets ofmeasurements, describing the global similarity ofthe structures and the local matching of the sec-ondary structure elements. For the global simi-larity, the total coordinate root-mean-squaredeviation (cRMSD) of the two structures was calcu-lated after computing the best superposition of thepredicted structure with the experimental struc-ture, using the McLachlan algorithm. In all thecRMSD calculations, all residues of both structureswere included.

The comparison of the secondary structure pre-diction of the predicted models with that of theexperimental structure and that coming from sec-ondary structure prediction algorithms is compli-cated, and we feel that no satisfactory method canbe used at the moment. The standard method forassigning secondary structure is due to Kabsch &Sander (1983) and is implemented in the DSSP pro-gram. The method relies heavily on the hydrogenbond pattern of the structure as computed fromthe peptide plates. The Ca models of the predictedstructures lack these peptide plates. An all-atom

reconstruction of these models is possible, and hasactually been carried out (see below), but the DSSPmethod is very sensitive to small shifts in the coor-dinates in the secondary structure assignment;thus, the direct use of the DSSP assignments in thepredicted models is prone to introduce consider-able artifacts. For this reason, we adopted theRichards & Kundrot (1988) de®nition of secondarystructure, based on the pseudodihedral angles ofconsecutive Ca atoms. Still, this approach is notentirely satisfactory, as the PHD method has beendeveloped using the DSSP de®nition of secondarystructure. However, trial calculations indicated thatartifacts introduced by this approach are smallerthan when the DSSP assignment is used. Compu-tation of the Richards-Kundrot secondary structurewas carried out on the basis of reconstructed all-atom models. All-atom reconstruction of each ofthe predicted structures was carried out using theMODELLER program (Sali & Blundell, 1993) usingdefault parameter settings.

Proteins tested

The above protocol has been applied to the setof 20 small proteins listed in Table 1 sorted by sizeand named according to the entry in the Brookha-ven database (Bernstein et al., 1977). The size of theproteins ranges from the 29 amino acids of 3cti tothe 100 amino acids of 1ife. They were chosen toexamine how well the methodology performs onthe following different structural motifs: small dis-ul®de-rich proteins; all-a proteins; all-b proteinsand a/b proteins. The fold description of each ofthe proteins is presented in Table 1 according tothe SCOP database (Murzin et al., 1995). In allcases, the structures were chosen at random, withthe only constraints being that no global misassign-ment of the secondary structure prediction tookplace, i.e. no long helical segment was assigned asextended or vice versa, and that a suf®cient numberof homologous sequences, at least ten, was avail-able in the HSSP database (Sander & Schneider,1991). For the two smallest disul®de-rich proteins,3cti and 1ixa, which are substantially devoid ofsecondary structure, the same protocol wasapplied. However, here we assumed knowledge ofthe identity of the disul®de bridges, as the generalprotocol would presumably fail due to the smallpredicted content of secondary structure. Such dis-ul®de bridges are used as seeds in the tertiaryrestraint derivation protocol for contact prediction.Here, the objective is to study whether the knowl-edge of disul®de bridges, together with the proto-col we now present, can produce low resolutionmodels of small disul®de-rich proteins, which, ingeneral, are considerably less regular than largerproteins. For T0042, we also used the known disul-®de pattern, as it was available to the predictionteams in the CASP2 contest. In the case of theother 17 proteins, the native disul®de bridges werenot used as seeds in those proteins that had disul-®de bridges in their native structure. Rather, cross-

Fold Prediction of Small Proteins 427

links were chosen so as to be compatible with thepredicted contact map. To this effect, the followingalgorithm was used: ®rst, the set of all possiblepairs of disul®de bridges is computed. From thisset, the ®rst disul®de bridge is assigned as thatwith the closest contact map distance to any of thepredicted contacts. The selected pair of cysteineresidues is removed from the list of free cysteineresidues, and the list of possible disul®de bridgesis recomputed. The algorithm continues until allpairs are assigned.

All additional information in this study, such asthe derived set of predicted contacts for all pro-teins, and structural predictions, is available via theWWW on the URL http://www.scripps.edu/skolnick/ORTIZ/ortiz.html

Results

Derivation of restraints

Secondary structure restraints

The secondary structure bias for each residueused in these series of calculations has beenobtained by mixing the PHD and LINKER predic-tions, as described in Methods. The accuracy of thePHD predictions is shown in Table 2, and the ®nalstates assigned to each residue in a representativesubset of the proteins studied are shown inFigure 3. The average secondary structure predic-tion accuracy for the set of proteins used here ishigher than the expected accuracy of the method,but it is still within one standard deviation of thecurrent accuracy of PHD. Combination of the LIN-KER algorithm with the PHD method seems toimprove, on average, the quality of the predictions.This improvement comes from two sources: (1) theability of LINKER to break long PHD-predictedhelices by inserting loops in unphysically longhelices assessed on the basis of the expected pro-tein size, and (2) the ability of LINKER to predictextended states of local hydrophilic stretches,usually missed by PHD as a result of the ®tting ofPHD to the DSSP assignment of secondary struc-ture, which demands a hydrogen-bond network inthe assignment (Kabsch & Sander, 1983). Thus,PHD sometimes misses stretches of sequence withpoor b-sheet propensities, but which populateextended states. For example, the second strand in1gpt was missed by PHD, but localized as anextended state by LINKER (observe in Figure 3 thedifference between 1 and 4 states). A similar situ-ation is observed in the ®rst strand of 1t®. For1hmd, helices two and three are merged, but LIN-KER successfully corrects this overprediction(Figure 3). A similar situation was observed in 3icb(not shown). However, because of the limited accu-racy in loop positioning of the algorithm at theresidue level, the ®nal outcome of the secondarystructure assignment is usually more shifted withrespect to the experimental secondary structurethan in the original PHD prediction. Although the

overall secondary structure assignment of the pro-teins tested here can be considered to be quitegood by state-of-the-art standards, it is worthpointing out that there are still some proteins forwhich entire elements of secondary structure aremissed. Thus, in the case of 1ftz, the third helix ismissed, and an additional helix is predicted in theC terminus. Similarly, the third strand of 1shg ismissed, as is the second and third helix and thefourth strand of 1ego (Figure 3), as well as thethird strand of 1poh at the edge of the fold (notshown). There are also considerable discrepanciesbetween the predicted and observed lengths andlocations of the secondary structure elements(Figure 3).

Contact prediction

The results of side-chain contact prediction arecompiled in Tables 2 and 3. Not consideringcysteine-rich proteins where the disul®de contactpattern has been assumed to be known, the overallaccuracy of the procedure of contact prediction issimilar to that reported by other authors (GoÈebelet al., 1994; Olmea & Valencia, 1997), on the orderof 25%. Turning to the issue of precision, allowingan error of one residue in the assignment of part-ners yields an accuracy on the order of 60%; withintwo residues is 77%; within three residues is 85%;and within four residues is 95%. The average pre-diction of the contact map coverage is on the orderof 25%, which corresponds to about N/4 restraintsper N protein residues and is consistent with ourprevious ®ndings (Skolnick et al., 1997b). Thus, ingeneral, the number of contacts predicted is asmall portion of the whole protein contact mapand usually contains a signi®cant amount of noise.In some cases, wrong pairings of secondary struc-ture elements are obtained, as is the case with1poh, 1ife and 1hmd. In general, as expected, thebigger the protein, the higher the chance to assigna wrong pairing of secondary structure elements.The structures used for ``growing'' the contactspredicted by correlated mutations for each of theproteins studied are compiled in Table 3. Thestructures selected are entirely unrelated to the tar-get sequence, i.e. no remote sequence homologueswere used. With the contact derivation procedureused in this work, the ®nal predicted contacts arehighly clustered in structure; therefore, the effec-tive number of contacts that can be considered tobe independent is lower than the number of pre-dicted contacts. Some other effects are also worthnoting. Comparison of the predicted and observedcontact maps (see Figure 4) shows that, as a resultof the independent growth of each one of therestraints, frequent phase shifts for the differentrestraint subsets are observed. This problem is par-ticularly important for restraints involving a-helices. This can produce helix unwrapping andother structural distortions. As explained inMethods, we have tried to avoid this problem byintroducing the ``splinning'' procedure.

Ta

ble

1.

Cla

ssi®

cati

on

of

the

pro

tein

sfo

lded

inth

isst

ud

yac

cord

ing

toth

eS

CO

P(U

RL

htt

p:/

/sc

op

.mrc

-lm

b.c

am.a

c.u

k/

sco

p/

)d

atab

ase

Pro

tein

Nre

sC

lass

Fo

ldd

escr

ipti

on

Nam

e

3cti

29S

mal

lD

isu

lfid

e-b

ou

nd

fold

,b

eta

hai

rpin

wit

had

jace

nt

dis

ulf

ide

Try

psi

nin

hib

ito

rfr

om

squ

ash

(Cu

curb

ita

max

ima)

1ix

a39

Sm

all

EG

F-l

ike

(dis

ulf

ide-

rich

fold

;n

earl

yal

l-b

eta)

Fac

tor

IXfr

om

hu

man

(Hom

osa

pien

s)p

rotA

47a

Th

ree-

hel

ixb

un

dle

Pro

tein

A1g

pt

47sm

all

Dis

ulf

ide-

bo

un

dfo

ldb

eta

hai

rpin

wit

had

jace

nt

dis

ulf

ide

Gam

ma-

thio

nin

fro

mb

arle

y(H

orde

um

vulg

are)

1tfi

50S

mal

lR

ub

red

oxi

n-l

ike

(met

alb

ou

nd

fold

,w

ith

2C

XX

Cm

oti

fs)

Tra

nsc

rip

tio

nal

fact

or

SII

fro

mh

um

an(H

omo

sapi

ens)

6pti

58S

mal

lB

PT

I-li

ke

(dis

ulf

ide

richa�b

fold

)P

ancr

eati

ctr

yp

sin

inh

ibit

or

fro

mb

ov

ine

(Bos

tau

rus)

1fas

61S

mal

lS

nak

eto

xin

lik

e(d

isu

lfid

eri

ch;

nea

rly

all

bet

a)F

asci

culi

nfr

om

gre

enm

amb

a(D

endr

oasp

isan

gust

icep

s)1s

hg

62b

SH

3-li

ke

bar

rel

(par

tly

op

ened

;n

*�

4,S

*�

8;m

ean

der

)al

ph

a-S

pec

trin

,S

H3

do

mai

nfr

om

chic

ken

(Gal

lus

gall

us)

1cis

66a�b

CI-

2fa

mil

y(a�b

san

dw

ich

;lo

op

acro

ssfr

eesi

de

ofb)

Hy

bri

dp

rote

infr

om

bar

ley

(Hor

deu

mvu

lgar

e)h

ipro

lyst

rain

1ftz

70a

DN

A-b

ind

ing

3-h

elix

bu

nd

le(r

igh

t-h

and

edtw

ist;

up

-do

wn

)F

ush

iT

araz

up

rote

infr

om

fru

itfl

y(D

roso

phil

am

elan

ogas

ter)

1po

u71

aD

NA

-bin

din

gd

om

ain

(4h

elic

es,

fold

edle

af,

clo

sed

)O

ct-1

PO

U-s

pec

ific

do

mai

nfr

om

hu

man

(Hom

osa

pien

s)1c

5a73

aA

nap

hy

loto

xin

s(4

hel

ices

;ir

reg

.ar

ray

,d

isu

lfid

eli

nk

ed)

C5a

anap

hy

loto

xin

fro

mp

ig(S

us

scro

fado

mes

tica

)3i

cb75

aE

F-h

and

(2E

F-h

and

con

nec

ted

wit

hC

ab

ind

loo

p)

Cal

bin

din

D9K

fro

mb

ov

ine

(Bos

tau

rus)

1ub

i76

a�b

b-G

rasp

(sin

gle

hel

ixp

ack

sag

ain

stb-

shee

t)U

biq

uit

infr

om

hu

man

(Hom

osa

pien

s)T

0042

78a

Fiv

e-h

elix

bu

nd

leN

K-l

ysi

nfr

om

pig

(Su

ssc

rofa

)1l

ea84

aD

NA

-bin

din

g3-

hel

ixb

un

dle

(rig

ht-

han

ded

twis

t;u

p-d

ow

n)

Lex

Are

pre

sso

r,D

NA

-bin

din

gd

om

ain

(Esc

heri

chia

coli

)1e

go

85a/b

Th

iore

do

xin

e-li

ke

(3a/b/a

lay

ers;b-

shee

to

rder

4312

)G

luta

red

oxi

nfr

om

bac

teri

op

hag

et4

1hm

d85

aF

ou

rh

elic

alu

p-a

nd

-do

wn

bu

nd

le(l

eft-

han

ded

twis

t)H

emer

yth

rin

fro

msi

pu

ncu

lid

wo

rm(T

hem

iste

dysc

rita

)1p

oh

85a�b

a�b

san

dw

ich

His

tid

ine-

con

tain

ing

ph

osp

ho

carr

ier

pro

tein

s(E

.co

li)

1ife

100

a�b

IF3-

lik

e(b

-a-b

-a-b

(2);

2la

yer

s;m

ixed

shee

t12

43)

Tra

nsl

atio

nin

itia

tio

nfa

cto

rIF

3fr

om

Esc

heri

chia

coli

Th

ep

rote

inn

ame

isas

sig

ned

acco

rdin

gto

the

Bro

ok

hav

enP

rote

inD

ata

Bas

een

try

,ex

cep

tfo

rP

rote

inA

(no

tin

the

PD

Bd

abab

ase)

and

T00

42,

corr

esp

on

din

gto

the

seq

uen

ceta

rget

42o

fth

ere

cen

tC

AS

P2

mee

tin

g(U

RL

htt

p:/

/P

red

icti

on

Cen

ter.

lln

l.g

ov

/);

see

the

tex

tfo

rd

etai

ls.

Ta

ble

2.

Sta

tist

ics

of

the

rest

rain

td

eriv

atio

np

roce

du

re

Pro

tein

Nse

qa

Nse

edb

Np

ccd�

0dd�

1dd�

2dd�

3dd�

4dd�

5dN

we

Nco

nf

%P

Cg

Q3h

3cti

i19

36

83.3

100.

010

0.0

100.

010

0.0

100.

00

3915

.382

.41i

xaj

702

510

0.0

100.

010

0.0

100.

010

0.0

100.

00

4810

.497

.4p

rota

253

170.

035

.270

.582

.394

.110

0.0

091

18.6

83.0

1gp

t8

313

46.1

76.9

100.

010

0.0

100.

010

0.0

070

18.5

72.3

1tfi

108

3721

.654

.088

.893

.397

.710

0.0

084

44.0

78.0

6pti

456

1968

.494

.710

0.0

100.

010

0.0

100.

00

9220

.680

.41f

as44

119

26.3

57.8

78.9

89.4

100.

010

0.0

098

19.3

90.2

1sh

g20

639

28.2

89.7

100.

010

0.0

100.

010

0.0

010

935

.764

.91c

is17

523

8.6

65.2

78.2

95.6

100.

010

0.0

014

415

.986

.41f

tz31

22

1225

.033

.358

.358

.375

.091

.61

149

8.0

71.4

1po

u47

548

28.6

77.5

89.8

95.9

100.

010

0.0

012

239

.384

.51c

5a20

945

24.4

62.2

73.3

82.2

95.5

95.5

210

542

.893

.83i

cb67

325

28.0

68.0

68.0

76.0

100.

010

0.0

015

416

.289

.31u

bi

337

1723

.588

.294

.194

.110

0.0

100.

00

153

11.1

77.6

T00

4218

324

29.1

45.8

58.3

70.8

91.6

95.8

115

016

.080

.81l

ea16

1041

9.7

34.1

75.6

90.2

95.1

95.1

213

131

.287

.51e

go

107

3315

.184

.893

.996

.910

0.0

100.

00

223

14.7

71.8

1hm

d11

520

10.0

45.0

65.0

70.0

90.0

90.0

215

712

.785

.01p

oh

1912

368.

333

.355

.580

.591

.691

.63

162

29.6

74.1

1ife

136

2114

.223

.838

.061

.980

.985

.73

148

14.1

70.0

aN

um

ber

of

seq

uen

ces

con

tain

edin

the

mu

ltip

lese

qu

ence

alig

nm

ent.

bN

um

ber

of

pre

dic

ted

con

tact

so

bta

ined

fro

mth

em

ult

iple

seq

uen

ceal

ign

men

t(s

eeth

ete

xtfo

rd

etai

ls).

cN

um

ber

of

pre

dic

ted

con

tact

su

sed

asre

stra

ints

inth

esi

mu

lati

on

s.d

Per

cen

tag

eo

fp

red

icte

dco

nta

cts

wit

hind

resi

du

eso

fa

nat

ive

con

tact

.e

Nu

mb

ero

fco

nta

cts

that

are

inco

rrec

tw

hend�

5.f

Nu

mb

ero

fco

nta

cts

fou

nd

inth

eex

per

imen

tal

stru

ctu

re.

Aco

nta

ctb

etw

een

two

amin

oac

ids

occ

urs

wh

enan

yo

fth

eir

sid

e-ch

ain

hea

vy

ato

ms

are

wit

hin

5.0

AÊfr

om

each

oth

er.

gP

erce

nta

ge

of

the

con

tact

map

of

the

exp

erim

enta

lst

ruct

ure

pre

dic

ted

by

the

con

tact

pre

dic

tio

nm

eth

od

.h

Per

resi

du

ep

erce

nta

ge

accu

racy

of

seco

nd

ary

stru

ctu

rep

red

icti

on

,b

ased

on

ath

ree-

stat

em

od

el,

ob

tain

edfr

om

the

PH

Dm

eth

od

.i

Inth

eca

seo

f1i

xa,

the

rest

rain

tsco

me

fro

mth

ek

no

wn

dis

ul®

de

bri

dg

es(c

on

tact

s:3-

20;

10-2

2;16

-28)

,p

lus

thre

ead

dit

ion

alco

nta

cts

pre

dic

ted

by

the

corr

elat

edm

uta

tio

ns

met

ho

d(c

on

tact

s:8-

17;

7-27

;14

-21)

.j

Res

trai

nts

com

efr

om

the

kn

ow

nd

isu

l®d

ep

atte

rn(c

on

tact

s:6-

17;

11-2

6;28

-37)

and

pre

dic

ted

con

tact

sfr

om

the

corr

elat

edm

uta

tio

nan

aly

sis

(co

nta

cts:

10-2

4;13

-32)

.

Figure 3. Secondary structure assignment for a representative subset of the proteins used in this study. The aminoacid sequence is given for each protein. The observed secondary structure in the experimental conformation accordingto the DSSP assignment (Kabsch & Sander, 1983) of three states is also shown, as is the assigned secondary structurestate in the folding simulations, according to the prediction results. 1 stands for coil assignment; 2 for helix assign-ment; 3 for a U-turn assignment; 4 for strand assignment, and 5 corresponds to no assignment of secondary structure.

430 Fold Prediction of Small Proteins

Table 3. For each of the predicted proteins, the sourcestructures used for contact map growth are shown

Protein Source structures used for contact growtha

3cti Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð1ixa Ð Ð Ð Ð Ð Ð Ð Ð Ð ÐprotA 1bbhA 4ts1A 256bA1gpt 1lhm 2gbp 1ald1tfi 1pk4 1fdlH 1hoe 1vaaB 1sarA 1pk46pti 2fb4 1er8 2gb11fas 1fxa 1dtx 1atx1shg 4tms 1atx 1gp1 6taa 3ebx1cis 1pcy 2hlaA 1ppd1ftz 1s01 1c5a1pou 1prc 1cdp1c5a 1pbxA 3adk 1avr 8catA 3wrp 1col A1ubi 1ovaA 3enl 1pazT0042 1mba 2utgA1lea 2liv 1akeA 2lhb 2timA 2liv1ego 2azaA 2fx2 5rubA 1rnh1hmd 6taa 1lig1poh 5rubA 3blm 1abp 1rbp1ife 2fcr 1lh1 1ald

a In the case of 3cti and 1ixa, the contact map growth step wasnot used, as a result of the insuf®cient secondary structure con-tent in these structures. Here, restraints are given by the pre-dicted contacts plus the known disul®de bridges. See alsoTable 2.

Fold Prediction of Small Proteins 431

Fold assembly and discrimination

Fold assembly is carried out starting from anextended chain; therefore, the initial restraintenergy is very large in the ®rst cycles of the algor-ithm where the ®rst motions of the chain mainlydecrease the restraint energy. Thus, compact statesare generated very quickly. Finally, the secondarystructure forms, and the adjustments of secondarystructure elements take place. In some cases,during the ®rst annealing run, the structures aretrapped in misfolded states. Then, during sub-sequent annealing runs, the correct registration ofelements takes place. This effect is particularlyobserved in a/b proteins. In the ®nal folds, typi-cally the restraint energy is close to zero as a resultof the soft implementation of restraints (Table 4),although on average the number of satis®ed pre-dicted contacts is about half the number of pre-dicted contacts. Based on restraint satisfaction, it isnot possible to discriminate among alternativeanswers (Table 4). Because the energy landscape isrugged, individual structures obtained from theassembly runs are not able to provide a reliableenergy for the particular fold they represent. As aresult, to rank order the folds, it is necessary tocarry out isothermal calculations for at least thelower part of the energy spectrum of the createdfolds (Table 4).

The superimposed predicted and experimentalconformations of a representative subset of theproteins tested in this work can be seen inFigure 5. Details of the resolution achieved foreach particular protein can be found in Table 4.The average cRMSD is about 5 AÊ . Thus, in com-parison with the use of ``exact'' restraints, a pricein resolution between 1.0 to 2.0 AÊ has to be

paid. When the different protein classes are con-sidered, the average cRMSD of the lowest energyset of structures ranges from about 4 AÊ for heli-cal proteins to roughly 6 AÊ for b and mixedmotif proteins. In all cases, the global topology isrecovered either as the best energy (in 17 out of20 cases) or as the next best energy alternativefold. Of the three that failed (1ixa, 1hmd and1ife), the misfolded state of 1ixa results from themisplacement of a few residues in the C-terminalregion; in the case of 1hmd, it is not possible todistinguish between the two topological mirrorimages, which are essentially isoenergetic. In thelast case, 1ife, the selected fold is actually correctin spite of the unacceptably high cRMSD. Here,a coil region shifts from the edge of the fold tothe back of the protein. These numbers could becompared with the expected cRMSD obtained byrandom for protein chains of the length of thesequences used in these folding studies (Table 4),using the expression given by Cohen &Sternberg (1980). The average value that couldhave been obtained by random is around 12 AÊ .

Turning again to the assembly process, it is inter-esting to note that the different secondary structureelements do not simply pack as rigid bodies; thatis, as shown in Figure 6, changes in secondarystructure status produced by long-range inter-actions are common in order to assemble the fold.This can be quanti®ed by the comparison of thesecondary structure predictions and the secondarystructure assignment of the predicted models(Table 5). In some cases, the secondary structureelements extend in length from the original predic-tions, as in the case of the a-helices of 6pti or the b-strands in 1gpt. In some other cases, they need toshorten to accommodate themselves into the pro-tein fold, as in the case of 1lea. And in a number ofcases, additional secondary structure elementsform. The most striking case is that of 1ubi, as anexample of a-helix formation, and 1shg forb-strands. Overall, our results suggest that ®xingthe length of secondary structure elements andtreating them as rigid bodies can have a deleter-ious effect in fold assembly. For example, as aresult of ``fraying ends'', some hydrophobic resi-dues can be exposed and some wrong contactsbetween elements can be formed, reducing or eveneliminating the energy gap with alternative folds.Some ¯exibility is required in order to correctlypack the secondary structure elements. Usuallytheir predicted length is incorrect, and there areshifts in registration with respect to the tertiaryrestraints. Therefore, if a rigid model is used tode®ne the secondary structure, the correct fold canbe missed. However, on average, the correctness ofthe secondary structure in the predicted modelsdoes not improve when compared with the orig-inal secondary structure predictions used asrestraints during the simulations (see Table 5 andFigure 6).

432 Fold Prediction of Small Proteins

The case of T0042 as a blind prediction

The second meeting on the Critical Assessmentof Techniques for Protein Structure Prediction(CASP2) was held in Asilomar, California,recently. Several protein targets were available toresearchers as blind predictions covering differentaspects of protein structure prediction: docking,homology modeling, threading and ab initio fold-ing (URL http://iris4.carb.nist.gov/casp2/). Atthe time of the meeting, the present work wasbeing carried out, and we felt that the methodwas too immature for us to participate. However,once we obtained enough experience with the

Figure 4(a) (legen

approach presented here, and in order to com-pare our results with those of other groups usingdifferent methods under similar conditions, wepursued the blind prediction of target 42 (T0042)of the meeting. It must be stressed that the pre-diction was made without knowledge of the tar-get conformation, as this structure has beenreleased only recently. (All of the numericalassessments for this and the rest of the targets, aswell as the experimental structure of some of thetargets, are available through the World WideWeb URL http://PredictionCenter.llnl.gov/.)T0042 was chosen because it was the most popu-lar target sequence for most groups doing ab

d on page 434)

Fold Prediction of Small Proteins 433

initio folding. Here, we will describe in detail theresults of the prediction of this protein. It will beused as an example to illustrate the predictionprocess following the present approach.

T0042 is a protein of 78 amino acids. The correctpairing of the three disul®de bridges of the proteinwas made available to the prediction teams, and itwas used by us as well. A multiple sequence align-ment was obtained for this sequence scanning theEMBL/SWISSPROT database with FASTA(Pearson & Lipman, 1988) and ®ltering thesequences found using MAXHOM (Sander &Schneider, 1991). However, it was necessary to

Figure 4(b) (legen

manually edit this alignment in order to removeshort sequences because the initial alignment didnot provide any predicted contacts using our meth-od. After ®ltering the alignment by hand, the ®nalmultiple sequence alignment contained 15 homolo-gous sequences plus the target sequence (Table 6).Secondary structure predictions were carried outcombining PHD and LINKER, as described inMethods (Table 7). The experimental structure con-tains ®ve a-helices, but the PHD prediction mergeshelices III and IV and partially misses helix V.When the PHD predictions are combined with theLINKER predictions, the resulting secondary struc-

d on page 434)

Figure 4. Contact maps of a representative set of proteins used in this work. The experimental contact map is shownin blue, the predicted seeds are shown in red, and the expanded contacts obtained by inverse folding are shown ingreen (see the text for details). A contact distance cut-off of 4.5 AÊ between side-chain heavy atoms is used. Contactmaps for the following proteins are shown: (a) 1ego; (b) 1t®; (c) 1pou.

434 Fold Prediction of Small Proteins

ture assignment is actually worse than the PHDprediction alone: helices III and IV remain merged,helix V is totally missed and helix II is considerablyshortened (Table 7). Using the correlated mutationanalysis, three contact seeds could be predictedfrom the multiple sequence alignment (see Table 8).One of them involves a disul®de bridge observedin this protein, which made us feel more con®dentabout the quality of the predicted seeds. Theseseeds, together with the known disul®de bridges,were used in the inverse folding calculation withthe objective of ``expanding'' the predicted con-tacts. Only two of the three seeds could ``grow''.

Thus, the ®nal number of restraints was 23(Table 9). Interestingly, one of the fragmentsselected for the enrichment process involved 2utg,which has been found by other researchers duringthe CASP2 contest to be a popular template whenglobal threading of the sequence is done. Theobtained secondary and tertiary restraints wereused as input for MONSSTER. Ten simulationswere performed. After the isothermal calculations,the lowest average energy fold was selected as thepredicted structure.

After these calculations were completed, theexperimental structure of T0042 was then available

Table 4. Results of the folding simulations

Protein RMSDa hEib sc rsd Epene RMSDf hEig sh rsi Epenj RMSD(r)k

3cti 3.8 ÿ106.9 7.4 6 0.0 6.7 ÿ103.1 7.8 6 0.0 10.601ixa 7.7 ÿ131.2 7.0 2 13.2 5.6 ÿ130.2 8.0 5 10.0 11.07prota 3.1 ÿ246.2 6.6 2 0.2 9.4 ÿ240.0 5.5 1 0.3 11.451gpt 5.9 ÿ276.1 12.4 9 4.3 6.6 ÿ142.3 7.0 10 2.8 11.451tfi 5.9 ÿ201.6 7.3 28 8.2 7.0 ÿ191.2 9.4 31 2.2 11.596pti 4.7 ÿ410.0 10.0 19 0.0 9.7 ÿ397.0 10.0 18 0.00 11.961fas 6.2 ÿ330.0 6.3 19 1.5 9.3 ÿ284.0 7.6 20 10.7 12.101shg 4.5 ÿ420.0 4.5 11 14.2 6.7 ÿ397.0 5.5 17 21.1 12.151cis 6.4 ÿ240.0 8.2 7 2.7 7.6 ÿ232.0 6.6 7 0.1 12.341ftz 5.1 ÿ276.9 8.0 11 0.7 10.13 ÿ270.0 7.9 15 0.5 12.531pou 3.5 ÿ418.0 3.4 18 31.8 11.9 ÿ364.0 4.0 22 23.5 12.571c5a 4.2 ÿ194.0 4.0 20 9.4 9.8 ÿ182.0 5.2 26 5.3 12.663icb 4.5 ÿ406.0 7.0 21 17.6 12.6 ÿ342.0 3.9 11 15.0 12.761ubi 6.1 ÿ238.0 6.3 9 0.0 11.5 ÿ203.0 5.7 8 2.8 12.80T0042 5.6 ÿ362.2 8.2 15 10.4 11.7 ÿ359.8 9.6 8 11.9 12.901lea 6.1 ÿ136.0 7.7 26 8.8 9.4 ÿ115.0 7.5 27 7.3 13.181ego 5.7 417.2 8.9 20 1.3 9.0 ÿ396.4 15.0 14 1.19 13.221hmd 9.3 ÿ459.7 5.4 13 0.3 4.6 ÿ458.0 7.2 3 0.15 13.221poh 6.5 ÿ336.0 9.5 42 24.6 11.7 ÿ299.0 8.6 23 16.1 13.221ife 8.2 ÿ481.8 7.3 16 5.8 6.7 ÿ419.0 10.8 15 11.8 13.93

After the column corresponding to the protein name, the next ®ve columns correspond to parameters describing the lowest energyfold obtained during the simulations. The following ®ve columns correspond to these same parameters describing the alternativefold of lowest energy found during the simulations. The numbers in bold correspond to the lowest cRMSD among the competingfolds. Note: Both folds of 1ife correspond to the same topology; however, the selected conformation has a strongly distorted strandat the edge of the fold. The ®nal column describes the expected cRMSD obtained by random for a protein chain of the length of thecorresponding sequence, according to the Cohen & Sternberg (1980) model.

a Average coordinate RMSD of the lowest energy fold found in the folding simulations.b Average energy (in kT units) of the fold obtained from the isothermal calculation (T � 1.0).c Standard deviation of the energy during the isothermal calculation (T � 1.0) for the lowest energy fold.d Number of predicted contacts satis®ed in the ®nal predicted fold.e Residual restraint energy (in kT units) in the predicted fold.f Average coordinate RMSD of the lowest energy alternative fold found in the folding simulations.g Average energy (in kT units) of the lowest energy alternative fold obtained from the isothermal calculation (T � 1.0).h Standard deviation of the energy during the isothermal calculation (T � 1.0) for the lowest energy alternative fold.i Number of predicted contacts satis®ed in the alternative fold.j Residual restraint energy (in kT units) in the alternative fold.k Expected coordinate RMSD for a random chain of the length of the sequence under consideration according to the Cohen &

Sternberg (1980) model.

Fold Prediction of Small Proteins 435

to us. The RMSD between all Ca atoms of theexperimental and computed structures is 5.6 AÊ

(Table 4). A superimposition of the predicted andthe experimental structure can be seen in Figure 5.Two striking features of the predicted fold areworth noting. First, helix III of the predicted sec-ondary structure needs to break around residues55 and 56 to assemble the fold, forming two inde-pendent helices in the predicted fold, as observedin the experimental structure. Thus, helix IV in thepredicted structure extends from residues 59 to 62,as compared to residues 57 to 61 observed in theexperimental conformation. The second point tonote is the partial formation of the last C-terminalhelix, helix V in the experimental structure, missedby the secondary structure predictions, eventhough a soft bias was used towards extendedstates. Both observations highlight the fact thatproteins are frustrated systems from the energeticpoint of view, and that any prediction schememust consider this frustration. The local secondarystructure biases provided by the secondaryrestraints were in both cases overridden by tertiaryinteractions. It must be emphasized that many pre-dictions submitted to CASP2 failed to provide the

correct answer because the secondary structurewas assumed to be completely correct. In our case,helix IV of the real structure is shorter and slightlyshifted when compared to the experimental struc-ture.

On the other hand, the prediction of T0042 alsoillustrates some of the shortcomings of the method.The last helix observed in the experimental struc-ture was predicted as an extended state, and anextended state partially persists in the C-term-inal region of the ®nal structure, being one of themain errors in the predicted conformation. Anotherproblem is related to the energy discrimination ofthe fold. As seen in Figure 7, the energy differencebetween the lowest average energy structure andthat of an alternative fold is about 4 kT. Figure 7also demonstrates that most of the noise in theenergy evaluation comes from the sequence-inde-pendent terms. Thus, when only the pair potentialenergy is considered, the energy differencesincrease to about 10 to 20 kT. It is of interest tonote that the restraint energy, by itself, favors analternative topology by 10 kT. This again illustratesthe need to incorporate the restraint function as asoft bias to provide a manifold of topologies,

Figure 5. Predicted and experimental structures of: (a) 1ubi; (b) 1ego; (c) 1shg; (d) 1t®; (e) 1pou; (f) T0042. The exper-imental structures are shown in blue, while the predicted structures are shown in cyan. Figure generated with MOL-MOL (Koradi et al., 1996).

436 Fold Prediction of Small Proteins

among which the energy function must select thecorrect one. Thus, the restraint energy bias cannotbe too large.

Our results compare favorably with thoseobtained by other groups in the CASP2 contest. Inab initio folding, the best result was obtained byJones's group, who were able to obtain predictionsof 6.2 AÊ RMSD with respect to the target protein,although their predicted four-helix bundle top-ology was incorrect. It is interesting that Jones'srelatively good results (see results of ab initio fold-ing at URL: http://PredictionCenter.llnl.gov/)were due to the possibility, in his simulation algor-ithm, of introducing kinks in the secondary struc-ture elements; that is, the secondary structureelements were not considered ®xed during thesimulations, as was assumed by the other research-ers (Dunbrack et al., 1997).

Discussion

Factors affecting the performance ofthe approach

Given the myriad of problems that any ab initiofolding algorithm must face, and the dif®cultiesencountered so far, it is important to ascertain whythe approach described here is reasonably success-ful. In our view, the ability to assemble low tomoderate resolution structures of small proteins isrelated to the following features. First and fore-most, MONSSTER does not require a precisedescription of secondary structure nor a large num-ber of tertiary restraints to assemble the global top-ology. This is made possible by including genericprotein-like features into the model, usingsequence-speci®c terms and adjusting the restraint

Fold Prediction of Small Proteins 437

implementation to the expected accuracy and pre-cision of the predicted restraints. As a result, a sub-stantial number of prediction errors are tolerated.Thus, one can employ existing secondary structureprediction algorithms and can focus the tertiaryrestraint derivation process on the generation of arelatively small number of tertiary contacts ofenhanced reliability.

Turning explicitly to the latter, another import-ant aspect of this approach is the increased sig-

Figure 6 (continued on

nal-to-noise ratio in the predicted contactsobtained by restricting the analysis to residuecovariation in the predicted core secondary struc-tural elements. Such contacts are extractedthrough the use of a two-step procedure thatminimizes the appearance of wrong pairings ofsecondary structure elements and generates alocally self-consistent set of restraints. We stressthat prediction of contacts based only on thelocal threading of all possible pairs of secondary

page 438 with legend)

Figure 6. Observed and predicted secondary structures for all proteins studied in this work. Calculation of the sec-ondary structure using the three-dimensional coordinates of the protein models was done using the Richards &Kundrot (1988) method. RK refers to the secondary structure assignment of the experimental structure; PHD to thepredictions of secondary structure using the PHD method; and PRD is the secondary structure assignment based onthe coordinates of the predicted structure.

438 Fold Prediction of Small Proteins

structure elements is unreliable (Hu et al., 1997;A.R.O. & J.S., unpublished results). For example,by using only local threading, almost allb-strands would like each other and it would be,in general, impossible to discriminate the b-sheetpattern in a/b and b proteins. There is not enoughspeci®city in the potential (local secondary struc-ture plus burial) to discriminate reliably among thepreferred pairings of b-strands. These limitations oflocal threading are well known and have beenreported by us (Hu et al., 1997) and others(Hubbard & Park, 1995). It is the use of a few con-tacts derived from correlated mutations obtainedwith high reliability together with the fragmentthreading/clustering protocol of restraint growingthat yields restraints of suf®cient quality for suc-cessful structure assembly.

An interesting observation made here is that it isbetter not to implement a restraint than to includegrossly wrong information. The appearance of

false positives is one of the main factors affectingthe performance of the approach. Another interest-ing outcome is that the average overall precision ofthe predicted restraints is more important than theaverage overall accuracy in low resolution foldingsimulations. This effect can be appreciated inFigure 8(A), where a correlation can be observedbetween precision at d � 2 and cRMSD for a-helicalproteins on the one hand and b-containing proteinson the other. For two of the proteins, 1ixa and 6pti,that do not lie in any of the correlation lines, thediscrepancy can be explained by the good accuracyof the contact predictions in both cases(Figure 8(B)). However, no clear correlation withthe cRMSD was found in the case of the accuracy(Figure 8(B)). Moreover, it is interesting to notethat the dependency of the quality of the predictedfold on the precision of the restraints is higher fora-helices than for b-containing proteins, something

Table 5. Accuracy of secondary structure prediction

Protein Q3 PRD Q3 PHD

3cti 61.176 50.5881ixa 57.647 70.588protA 69.412 77.6471gpt 80.000 70.5881tfi 77.647 60.0006pti 69.412 58.8241fas 77.647 67.0591shg 70.588 67.0591cis 64.706 64.7061ftz 54.118 63.5291pou 71.765 78.8241c5a 62.353 85.8823icb 68.235 82.3531ubi 75.294 62.353T0042 42.353 61.1761lea 44.706 63.5291ego 57.647 60.0001hmd 77.647 90.5881poh 58.824 65.8821ife 67.059 75.294

AVER 65.412 68.824SDEV 10.366 9.875

Q3 PRD refers to the Q3 value obtained from the predicted ter-tiary structure model, while Q3 PHD refers to the Q3 valueobtained using the PHD method. Secondary structure assign-ment of the three-dimensional structures is made according tothe Richards & Kundrot (1988) de®nition.

Fold Prediction of Small Proteins 439

probably related to the dif®culties of modelingb-strands.

Comparison with previous studies

Different authors have now started to investigatethe use of multiple sequence information and/orpredicted secondary structure using different foldprediction techniques. Perhaps the ®eld that hasmost strongly pursued by this approach is thread-ing. There, as recently reported by differentauthors, the use of variability patterns in multiplesequence alignments (Defay & Cohen, 1996;Taylor, 1997) or predicted secondary structure(Rice & Eisenberg, 1997; Rost et al., 1997; Russellet al., 1996), provide considerable improvementover the single sequence approach. But, and atleast up to the best of our knowledge, the onlyreported application of predicted restraints inthreading is due to Russell et al. (1996). In theirstudy, putative contacts derived from biochemicalarguments were used as ®lters in a fold recognitiontechnique that makes use of predicted secondarystructure. However, no distance restraint infor-mation coming from multiple sequence alignmentshas been reported in threading. The reasons forthis might be that the predicted contact infor-mation has been regarded as not being reliable,and certainly it is not without post-processing, andthat introduction of distance restraints requiresincorporation of double dynamic programmingtechniques (Taylor, 1997) or Monte Carloapproaches, making the threading procedurerather computationally expensive.

On the other hand, the combined effort of theCohen and Benner groups has producedapproaches recently to fold prediction based onhierarchical building procedures that are similar inspirit to the one presented here (Gerloff et al.,1997a,b). In their case, multiple sequence align-ments are built and secondary structure is pre-dicted. Later, an analysis of compensatoryvariations found in the multiple sequence align-ment between pairs of positions is carried out. Thisallows the authors to derive relationships of close-ness in space between the secondary structureelements. In the ®nal stage of their predictionmethod, folds are searched in the database thatmeet at least a fraction of the predicted structuralfeatures. Bona ®de predictions of the C-terminaldomain of the b and g chains of ®brinogen havebeen published making use of this approach(Gerloff et al., 1997a).

Perhaps the most similar approach to fold pre-diction to that presented here has been suggestedby Aszodi & Taylor (1995). Their model studiesusing simulated restraints were very similar to ourprevious studies on the determination of the requi-site number of exact restraints necessary for suc-cessful fold assembly (Skolnick et al., 1997b).However, later studies by Aszodi & Taylor (1996)have addressed the problem of remote homologymodeling rather than fold prediction (Aszodi &Taylor, 1996), and also follow an approach similarin spirit to the one presented here. In their case,multiple sequence alignments are used to de®neconserved regions that are assumed to form part ofthe protein core. A ®tting function derived fromthe protein database is used to map restraint dis-tances from these conservation patterns. Aszodi &Taylor (1996) have shown convincingly that this isa promising technique for remote homology mod-eling.

Limitations of the current methodology

While the results described above are encoura-ging, there are problems with this approach thatmust be addressed. First and foremost, the yield ofnative topologies is only about 10 to 20%, andextraction of correct from incorrect topologiesrequires a long series of isothermal simulations.The fact that different assembly runs produce awide dispersion in the ®nal energies of the proteinmodel (even for the same overall global fold) is asignature of sampling problems. While sampling isprobably adequate for 50 to 60-residue proteins,sampling problems become acute as the size of theprotein (or more precisely, the number of topologi-cal elements) increases. Thus, the development ofbetter sampling approaches should permit us toextend the treatment to larger proteins havingmore complicated topologies. Another dif®culty isrelated to fold selection. Typically, one of two situ-ations arises. Either one must differentiate thenative topology from its topological mirror image(a fold where the chirality of the secondary struc-

Ta

ble

6.

Mu

ltip

lese

qu

ence

alig

nm

ent

use

din

the

con

tact

pre

dic

tio

no

fta

rget

T00

42

NR

ID%

IDE

AC

CN

UM

PR

OT

EIN

SE

QU

EN

CE

0:N

K-l

ys

1.00

ÐÐ

GYFCESCRKIIQKLEDMVGPQPNEDTVTQAASQVCDKLKILRGLCKKIMRSFLRRISWDILTGKKPQAICVDIKICKE

1:n

kg

5h

um

an0.

33P

2274

9GRDYRTCLTIVQKLKKMVD.KPTQRSVSNAATRVCRTGRswRDVCRNFMRRYQSRVIQGLVAGETAQQICEDLRLC

2:y

og

2ca

eel

0.33

P34

611

GQFTEPSGVAVNGQGDIVVADTNNHRI.....QVFDkfKFQFGECGKRDgqFLRKFGANILQ..HPRGVCVDSK

3:p

p1b

dro

me

0.33

P48

462

GDFDLNVDSLIQRLLEMRSCRTGK........QVQMTEAEVRGLCLKSREIFLQQPI..LLELEAPLIICGDIH

4:p

spb

rat

0.28

P22

355

LCQECEDIVHLLTKMTKEDAFQDTIRKFLEQECdpLKLLVPRCRQVLDVYLPLVIDYFQGQIKPKAICSHVGLC

5:x

yla

then

e0.

26P

4568

7FDAKVRRASYKVEDLFIGHIAGMDTFALGFKVAYKlgVLDKFIEEKYRSFREGIGRDIVEGKVDFEKLEEYIIDKE

6:f

bn

1b

ov

in0.

26P

9813

3GTPCELCPPVNTSEYKILCprPNPITVILEDIDECQELPGlgGKCINTFGSFQCRCPTGYYL.NEDTRVCDDVNECET

7:f

bn

1h

um

an0.

26P

3555

5GTPCEMCPAVNTSEYKILCprPNPITVILEDIDECQELPGlgGKCINTFGSFQCRCPTGYYL.NEDTRVCDDVNECET

8:v

nsm

insv

0.26

Q01

268

FCDSPRADLDKSCMIIPINRAIRAKSQAFIEAC.KLIIPKGNSEKQIRRQLAELSANLEKSVEEEENVTDNKI

9:r

pb

1eu

po

c0.

25P

2836

4CSTCQGDSKECPGHFGHIELAQPVFHIgdLVKKILKCVCFNCNKLlySALKRVKDPKLKLNKVYKVCKDIKVCGK

10:p

u92

scic

o0.

26P

2231

2KECQKNTENLKETIEQLKKELAEAQKALEKCKKEL...ADCKKENAKLLNKIecQLDECKKKLNICNNELI

11:g

lne

eco

li0.

24P

3087

0GYFEEDDRKQVLTLIADFRKELDKRTIGPRGRQVLDHlhLLSDVCAREDAAVLSRItyLELLSEFPAALKHLISLCAA

12:y

nh

4ca

eel

0.24

P32

742

RYVCSSHDVTIHGLAAMLRDRYPEYDVPQRFPGIQDDLQPVRFSSKK.....LQDLGFTFRYKTLEDMFDAAIRTCQE

13:i

md

hb

acsu

0.24

P21

879

RYFQEENKKFVP..EGIEGRTPYKGPVEETVYQLVGGLRSGMGYCGSKDLRALrrMTGAGLRESHPHDVQITVHRN

14:p

spb

pig

0.24

P15

782

FCWLCRTLIKRIQAVVP....KGVLLKAVAQVCHVVPlvGGICQCLAERYIVICLNMLLDRTLPQLVCGLVLR

15:d

fra

ho

rvu

0.23

P51

106

RYICSSHDATIHGLARMLQDRFPEYDIPQKFAGVDDNLQPIHFSSKKLlhGFSFRYTTEDMFD.AAIHTCRDKGL

NR

isth

ese

qu

ence

nu

mb

erin

the

alig

nm

ent.

Th

en

um

ber

0st

and

sfo

rth

eta

rget

seq

uen

ce.

IDis

the

iden

ti®

cati

on

nu

mb

erac

cord

ing

toth

eE

MB

L/

SW

ISS

PR

OT

dat

ab

ase

(exc

ept

the

targ

etse

qu

ence

).%

IDE

isth

ep

erce

nta

ge

of

iden

tity

bet

wee

nth

eco

rres

po

nd

ing

seq

uen

cean

dth

eta

rget

seq

uen

ce.

AC

CN

UM

corr

esp

on

ds

toth

een

try

nu

mb

erin

the

EM

BL

/S

WIS

SP

RO

Td

ata

bas

e(e

xcep

tth

eta

rget

seq

uen

ce).

PR

OT

EIN

SE

QU

EN

CE

isth

ese

qu

ence

alig

nm

ent

use

dfo

rco

nta

ctp

red

icti

on

.L

ow

erca

sele

tter

sin

dic

ate

that

inth

atp

osi

tio

nan

inse

rtio

n(n

ot

sho

wn

)is

fou

nd

inth

eco

rres

po

nd

ing

seq

uen

ce.

Th

eal

ign

men

tw

ascr

eate

db

ysc

ann

ing

the

SW

ISS

PR

OT

dat

ab

ase

wit

hF

AS

TA

lter

ing

the

seq

uen

ces

wit

hM

AX

HO

M,

and

then

®n

ally

sele

ctin

gth

ese

qu

ence

sb

yh

and

.

Ta

ble

7.

Ass

ign

men

to

fse

con

dar

yst

ruct

ure

for

T00

42fr

om

the

CA

SP

2m

eeti

ng

1020

3040

5060

70|

||

||

||

|GYFCESCRKIIQKLEDMVGPQPNEDTVTQAASQVCDKLKILRGLCKKIMRSFLRRISWDILTGKKPQAICVDIKICKE|

SE

QU

EN

CE

|HHHHHHHHHHHHHHH

HHHHHHHHHHHHHH

HHHHHHHHHH

HHHHH

HHHHHHH

|O

BS

EC

|HHHHHHHHHHHHHH

HHHHHHHHHHH

HHHHHHHHHHHHHHHHHHH

HHEHE

|P

DS

EC

(PH

D)

|HHHHHHHHHHHHH

HHHHHHHH

HHHHHHHHHHHHHHHHHH

|P

DS

EC

(PH

D�

LIN

KE

R)

|555552222222222222111333332222222255555333322222222222222222255533355111111555|

Ass

ign

edlo

cal

stru

ctu

re

Th

ese

qu

ence

of

the

pre

dic

ted

pro

tein

issh

ow

n(S

EQ

UE

NC

E),

tog

eth

erw

ith

the

ob

serv

edse

con

dar

yst

ruct

ure

inth

eex

per

imen

tal

stru

ctu

re(O

BS

EC

);th

ep

red

icte

dse

c-o

nd

ary

stru

ctu

reu

sin

gth

eP

HD

met

ho

d(P

DS

EC

(PH

D))

;th

ep

red

icte

dse

con

dar

yst

ruct

ure

wh

enth

eP

HD

and

LIN

KE

Rp

red

icti

on

sar

eco

mb

ined

(PD

SE

C(P

HD�

LIN

KE

R))

;an

dth

elo

cal

seco

nd

ary

stru

ctu

reas

sig

ned

inth

esi

mu

lati

on

s(s

eeM

eth

od

sfo

rb

ind

escr

ipti

on

).

Table 8. Predicted seeds for target T0042 using corre-lated mutation analysis

Residuenumber A

Residuenumber B

Residuename A

Residuename B

Correlationcoefficient

34 61 V L 0.6626 56 S I 0.5887 70 C C 0.535

Three contacts were predicted, each one of them being an entryin the Table. Residues A and B refer to the ®rst and secondpartner of the predicted contact, respectively. The correlationcoef®cient of the mutational behavior of the correspondingpositions in the multiple sequence alignment is also shown (seeMethods for details).

442 Fold Prediction of Small Proteins

tural elements is the same, but the chirality of theturns is reversed; Pastore et al., 1991), or there are ahandful of distinct folds, some having a subset oftheir structures in common. Thus, the resultingenergy differences between the different lowenergy topologies are small and on the order of thestandard deviation of the mean energy per foldobtained from independent runs. We hope that thespeci®city for the native topology could be accen-tuated by the development of better energy func-tions.

A major effort is required to devise bettermethods of tertiary restraint derivation so thatfalse positives in contact map prediction are mini-mized. If restraints could be predicted with higherreliability, then a tighter restraint function could beused that would eliminate some of the misfoldedstates currently encountered. For example, to

Table 9. List of contacts used in the structure predic

Res. A (T0042)a Res. B (T0042)b Template stru

7 55 1mba7 56 1mba7 59 1mba

10 56 1mba10 60 1mba11 59 1mba11 60 1mba14 60 1mba15 60 1mba30 34 2utg30 43 2utg30 47 2utg33 38 2utg33 43 2utg33 47 2utg34 30 2utg34 43 2utg35 31 2utg38 43 2utg34 61 SEED6 56 SEED4 76 Disulfide brid

35 45 Disulfide brid

a First residue in the predicted contact in the target sequb Second residue in the predicted contact in the target sec Template structure from which the predicted contact i

entry. The last four contacts are either from the correlatedpartner of the structure.

d Residue number in the template structure of the ®rst pe Residue number in the template structure of the second

account for the possibility of wrong restraintsbetween non-interacting b-strands in b and a/bproteins, the restraint potential has to be ratherpermissive, resulting in a higher population ofcompeting misfolded states. This work suggeststhat the accuracy of the current methods of contactprediction is already more or less adequate, butimprovement in precision is still required.

Finally, secondary restraint derivation also needsimprovement. While the PHD and LINKER algor-ithms have been combined in an ad-hoc manner, amore consistent protocol would be desirable. Over-all, this study indicates that to predict low to mod-erate resolution structures, the prediction of theexact secondary structure element boundaries isnot an important factor. However, missing a sec-ondary structure element can have a strong nega-tive impact on the results, particularly if theelement belongs to the protein core. In this regard,our studies are in agreement with the recent ®nd-ings of Dandekar & Argos (1994, 1996) and Simonset al. (1997).

Conclusions

This paper has addressed the feasibility of deriv-ing restraints from multiple sequence alignmentsfor use in restraint assisted simulations designed topredict the native conformation of small proteins.Initial application has been made to a set of 20different proteins, representing all secondary struc-tural classes and having a wide variety of topolo-

tion of T0042

cturec Res. A (template)d Res. B (template)e

89 13789 13889 14192 13892 14293 14193 14296 14297 14222 2622 3522 3925 3025 3525 3926 2226 3527 2330 35± ±± ±

ge ± ±ge ± ±

ence of unknown structure.quence of unknown structure.

s extracted. The protein name corresponds to the PDBmutation analysis or from the known disul®de bridge

artner of the predicted contact.partner of the predicted contact.

Figure 7(A±B) (legend on page 444)

Fold Prediction of Small Proteins 443

Figure 7. Plots of energy versus coordinate RMSD for the different folds obtained for T0042. (A) Total average energyof the isothermal run versus ®nal coordinate RMSD of the fold. (B) Pairwise energy (Elr) versus cRMSD. (C) Restraintenergy (Eres) run versus ®nal coordinate RMSD of the fold. (D) Burial energy (Ebur) versus cRMSD.

444 Fold Prediction of Small Proteins

Figure 8. (A) Precision versuscRMSD for the set of proteins usedin this work. The cRMSD valuesshown in bold in Table 3 are usedin this plot. The d � 2 level wasused (i.e. percentage of correctlypredicted contacts allowing for a�2 residue error in prediction withrespect to a correct contact),although similar results areobtained for d � 1 and d � 3.(B) The same plot, but usingaccuracy criteria (i.e. percentage ofcorrectly predicted contacts withoutany residue error with respect tothe correct contact, so that d � 0).

Fold Prediction of Small Proteins 445

gies. From the experience gained so far, these con-clusions can be drawn.

(1) At the level of secondary structure elements,the accuracy of existing secondary structure predic-tion algorithms appears to be acceptable for suc-cessful fold assembly, provided that some tertiaryrestraints are included in the assembly algorithm.Problems encountered with current secondarystructure prediction algorithms arise either fromentirely missing a secondary structure element ormistaking the secondary structural class (helix orb) of some elements. Depending on the position ofthe missed element in the global fold, this might ormight not prohibit fold assembly. In some cases,when the missed elements lie at the edge of thefold, the ability to assemble the native topology isnot affected. In fact, sometimes such missingelements are induced by tertiary interactions. If,

however, the element missed corresponds to a cru-cial core element (e.g. a helix between twob-strands), it can exert a strong in¯uence on thequality of the ®nal predicted structure and in theworst case, could result in a grossly incorrect pre-dicted structure.

(2) Low resolution models of small proteins canbe assembled from rather inaccurate predictions ofa subset of the total number of tertiary side-chaincontacts. These predicted side-chain contacts neednot span the entire structure, but can be stronglyclustered. On average, the required level of accu-racy in contact map prediction is on the order of25%, provided that the predictions are reasonablyprecise, i.e. 95% lie within �4 residues of correctcontacts. About 25% of the total number of side-chain contacts need to be identi®ed. This studystrongly suggests that precision is a more import-ant factor than accuracy in determining the likeli-

446 Fold Prediction of Small Proteins

hood of successful fold assembly. Althoughadditional investigation is necessary to assess itsgenerality, our results suggest that the requiredrestraints can be reliably derived from multiplesequence alignments using a combination of corre-lated mutations followed by structural ®ltration/inverse folding (additional details will be given ina forthcoming publication).

(3) The strategy presented here yields predic-tions for all fold types whose RMSD from nativeranges from 3.0 to 6.5 AÊ , depending upon foldcomplexity. Such structures are at the level of accu-racy that can be obtained when threading tech-niques are applied to match a sequence to astructure whose sequence homology lies in orbelow the twilight zone of sequence identity. In thefolding simulations, the accuracy and yield of cor-rectly assembled structures is different for thedifferent protein classes. In general, all-a proteinsare predicted with higher accuracy than a/b pro-teins, and these, in turn, have better accuracy thanall-b proteins.

(4) Because of errors in the tertiary restraints, theresulting structures exhibit shifts in registrationand distorted mutual orientations of pairs of sec-ondary structural elements. Such structural distor-tions also reduce the energy gap between theputative native conformation and alternative folds,as compared to our previous studies using ``exact''restraints. Thus, the present protocol allows for theprediction of a small number of possible nativeconformations. However, selection of the speci®cfold based on the force ®eld energy is more uncer-tain as a result of problems with the current energyfunction.

This study successfully demonstrates that, for aset of small proteins, the use of restraints derivedfrom multiple sequence alignments incorporatedinto a tertiary structure prediction algorithmallows for assembly of native-like structures. Theapproach has been shown to be capable of assem-bling low resolution tertiary structures of greatercomplexity than was previously possible. How-ever, the dif®culties encountered in the assemblyof these topologies, even when low resolutionrestraints are employed, paints a cautionary pic-ture for the likelihood of ab initio assembly of com-plex folds without restraints. Given contemporarycomputer resources, sampling algorithms andexisting force ®elds, the folding of more complextopologies is likely to be problematic. Probably,over the short term, the only way to progress is tocombine insights gained from restraint-free foldingstudies on simple folds with strategies that reducethe conformational search space.

Acknowledgments

This work is supported by grant GM-37408 from theNational Institutes of Health. A.K. also acknowledgessupport from the University of Warsaw (grant BST-34/

97) and is an International Scholar of the HowardHughes Medical Institute. A.R.O. also acknowledgessupport from the Spanish Ministry of Education, as wellas access to the computational facilities of EMBL duringthis work. A.R.O. also thanks Dr Wei-Ping Hu for shar-ing some of his source codes at the beginning of this pro-ject, as well as Drs Li Zhang, Leszek Rychlewski andAdam Godzik for useful discussions. Finally, we thankboth reviewers for their careful reading of the manu-script and useful suggestions that have signi®cantlyimproved the quality of the text.

References

Aszodi, A. & Taylor, W. R. (1996). Homology modellingby distance geometry. Folding Design, 1, 325±334.

Aszodi, A., Gradwell, M. J. & Taylor, W. R. (1995).Global fold determination from a small number ofdistance restraints. J. Mol. Biol. 251, 308±326.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer,E. F., Jr, Brice, M. D., Rodgers, J. R., Kennard, O.,Simanouchi, T. & Tasumi, M. (1977). The ProteinData Bank: a computer-based archival ®le formacromolecular structures. J. Mol. Biol. 112, 535±542.

Chothia, C. & Finkelstein, A. (1990). The classi®cationand origins of protein folding patterns. Annu. Rev.Biochem. 59, 1007±1039.

Cohen, F. E. & Sternberg, M. J. E. (1980). On the predic-tion of protein structure: the signi®cance of root-mean-square deviation. J. Mol. Biol. 138, 321±333.

Dandekar, T. & Argos, P. (1994). Folding the main-chainof small proteins with the genetic algorithm. J. Mol.Biol. 236, 844±861.

Dandekar, T. & Argos, P. (1996). Identifying the tertiaryfold of small proteins with different topologies fromsequence and secondary structure using the geneticalgorithm and extended criteria speci®c for strandregions. J. Mol. Biol. 256, 645±660.

Defay, T. R. & Cohen, F. E. (1996). Multiple sequenceinformation for threading algorithms. J. Mol. Biol.262, 314±323.

Dunbrack, R. L., Gerloff, D. L., Bower, M., Chen, X.,Lichtarge, O. & Cohen, F. E. (1997). Meeting review:the second meeting on the critical assessment oftechniques for protein structure prediction (CASP2),Asilomar, California, December 13±16, 1996. FoldingDesign, 2, R27±R42.

Friesner, R. A. & Gunn, J. R. (1996). Computational stu-dies of protein folding. Annu. Rev. Biophys. Biomol.Struct. 25, 315±342.

Gerloff, D. L., Cohen, F. E. & Benner, S. A. (1997a).A predicted consensus structure for the C-terminusof the beta and gamma chains of ®brinogen.Proteins: Struct. Funct. Genet. 27, 279±289.

Gerloff, D. L., Cohen, F. E., Korostensky, C., Turcotte,M., Gonnet, G. H. & Benner, S. A. (1997b). A pre-dicted consensus structure for the N-terminal frag-ment of the Heat Shock Protein HSP90 family.Proteins: Struct. Funct. Genet. 27, 450±458.

Godzik, A., Skolnick, J. & Kolinski, A. (1992). A top-ology ®ngerprint approach to the inverse proteinfolding problem. J. Mol. Biol. 227, 227±238.

Godzik, A., Kolinski, A. & Skolnick, J. (1994). Latticerepresentation of globular proteins: how good arethey?. J. Comput. Chem. 14, 1194±1202.

Fold Prediction of Small Proteins 447

GoÈebel, U., Sander, C., Schneider, R. & Valencia, A.(1994). Correlated mutations and residue contacts inproteins. Proteins: Struct. Funct. Genet. 18, 309±317.

Gouda, H., Torigoe, H., Saito, A., Sato, M., Arata, Y. &Schimada, I. (1992). Three-dimensional solutionstructure of the B-domain of staphylococcal ProteinA: comparisons of the solution and crystalstructures. Biochemistry, 40, 9665±9672.

Gunn, G. J. R., Monge, A. & Friesner, R. A. (1994).Hierarchical algorithm for computer modeling ofprotein tertiary structure: folding of myoglobin to 6.2 AÊ resolution. J. Phys. Chem. 98, 702±711.

Hu, W.-P., Godzik, A. & Skolnick, J. (1997). Sequence-structure speci®city: how does an inverse foldingapproach work?. Protein Eng. 10, 317±331.

Hubbard, T. J. & Park, J. (1995). Fold recognition and abinitio structure predictions using hidden Markovmodels and beta-strand pair potentials. Proteins:Struct. Funct. Genet. 23, 398±402.

Kabsch, W. & Sander, C. (1983). Dictionary of proteinsecondary structure: pattern recognition of hydro-gen-bonded and geometrical features. Biopolymers,22, 2577±2637.

Kolinski, A. & Skolnick, J. (1994a). Monte Carlo simu-lations of protein folding: I. Lattice model and inter-action scheme. Proteins: Struct. Funct. Genet. 18,338±352.

Kolinski, A. & Skolnick, J. (1994b). Monte Carlo simu-lations of protein folding: II. Application to proteinA, ROP, and crambin. Proteins: Struct. Funct. Genet.18, 353±366.

Kolinski, A. & Skolnick, J. (1997). Determinants of sec-ondary structure of polypeptide chains: interplaybetween short range and burial interactions. J. Chem.Phys. 107, 953±964.

Kolinski, A., Galazka, W. & Skolnick, J. (1995a).Computer design of idealized b-motifs. J. Chem.Phys. 103, 10286±10297.

Kolinski, A., Milik, M., Rycombel, J. & Skolnick, J.(1995b). A reduced model of short range inter-actions in polypeptide chains. J. Chem. Phys. 103,4312±4323.

Kolinski, A., Skolnick, J., Godzik, A. & Hu, W.-P. (1997).A method for the prediction of surface ``U''-turnsand transglobular connections in small proteins.Proteins: Struct. Funct. Genet. 27, 290±308.

Koradi, R., Billeter, M. & Wuethrich, K. (1996).MOLMOL: a program for display and analysis ofmacromolecular structures. J. Mol. Graph. 14, 51±55.

Levitt, M. & Greer, J. (1977). Automatic identi®cation ofsecondary structure in globular proteins. J. Mol.Biol. 114, 181±293.

McLachlan, A. D. (1971). Test for comparing relatedamino acid sequences. J. Mol. Biol. 61, 409±424.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N.,Teller, A. H. & Teller, E. (1953). Equation of statecalculations by fast computing machines. J. Chem.Phys. 51, 1087±1092.

Mumenthaler, C. & Braun, W. (1995). Predicting thehelix packing of globular proteins by self-correctingdistance geometry. Protein Sci. 4, 863±871.

Murzin, A., Brenner, S. E., Hubbard, T. & Chothia, C.(1995). SCOP: a structural classi®cation of proteindatabase for the investigation of sequences andstructures. J. Mol. Biol. 247, 536±540.

Olmea, O. & Valencia, A. (1997). Improving contact pre-dictions by the combination of correlated mutationsand other sources of sequence information. FoldingDesign, 2, S25±S32.

O'Shea, E. K., Klemm, J. D., Kim, P. S. & Alber, T.(1991). X-ray structure of the GCN4 leucine zipper,a two-stranded, parallel coiled coil. Science, 254,539±544.

Park, B. H. & Levitt, M. (1995). The complexity andaccuracy of discrete state models of proteinstructure. J. Mol. Biol. 249, 493±507.

Pastore, A., Atkinson, R. A., Saudek, V. & Williams,R. J. P. (1991). Topological mirror images in proteinstructure computation: an underestimated problem.Proteins: Struct. Funct. Genet. 10, 22±32.

Pearson, W. R. & Lipman, D. J. (1988). Improved toolsfor biological sequence comparison. Proc. Natl Acad.Sci. USA, 85, 2444±2448.

Rice, D. W. & Eisenberg, D. (1997). A 3D-1D substi-tution matrix for protein fold recognition thatincludes predicted secondary structure of thesequence. J. Mol. Biol. 267, 1026±1038.

Richards, F. M. & Kundrot, C. E. (1988). Identi®cation ofstructural motifs from protein coordinate data: sec-ondary structure and ®rst level super-secondarystructure. Proteins: Struct. Funct. Genet. 3, 71±84.

Ripoll, D. R. & Scheraga, H. A. (1990). On the multipleminima problem in the conformational analysis ofpolypeptides. IV. Application of electrostatically dri-ven Monte Carlo method to the 20±residue mem-brane bond portion of melittin. Biopolymers, 30,165±176.

Rost, B. & Sander, C. (1993). Prediction of secondarystructure at better than 70% accuracy. J. Mol. Biol.232, 584±599.

Rost, B. & Sander, C. (1996a). Bridging the proteinsequence-structure gap by structure predictions.Annu. Rev. Biophys. Biomol. Struct. 25, 113±136.

Rost, B. & Sander, C. (1996b). Progress of 1D proteinstructure prediction at last. Proteins: Struct. Funct.Genet. 23, 295±300.

Rost, B., Schneider, R. & Sander, C. (1997). Protein foldrecognition by prediction-based threading. J. Mol.Biol. 270, 471±480.

Russell, R. B., Copley, R. C. & Barton, G. J. (1996).Protein fold recognition by mapping predicted sec-ondary structures. J. Mol. Biol. 259, 349±365.

Sali, A. & Blundell, T. L. (1993). Comparative proteinmodelling by satisfaction of spatial restraints. J. Mol.Biol. 234, 779±815.

Sander, C. & Schneider, R. (1991). Database of hom-ology derived protein structures and the structuralmeaning of sequence alignment. Proteins: Struct.Funct. Genet. 9, 56±68.

Simons, K. T., Klooperberg, C., Huang, E. & Baker, D.(1997). Assembly of protein tertiary structures fromfragments with similar local sequences using simu-lated annealing and Bayesian scoring functions.J. Mol. Biol. 268, 209±225.

Skolnick, J., Kolinski, A., Brooks, C., III, Godzik, A. &Rey, A. (1993). A method for prediction of proteinstructure from sequence. Curr. Biol. 3, 414±423.

Skolnick, J., Jaroszewski, L., Kolinski, A. & Godzik, A.(1997a). Derivation and testing of pair potentials forprotein folding. When is the quasichemical approxi-mation correct?. Protein Sci. 6, 676±688.

Skolnick, J., Kolinski, A. & Ortiz, A. R. (1997b).MONSSTER: A method for folding globular pro-teins with a small number of distance restraints.J. Mol. Biol. 265, 217±241.

Smith-Brown, M. J., Kominos, D. & Levy, R. M. (1993).Global folding of proteins using a limited numberof distance restraints. Protein Eng. 6, 605±614.

448 Fold Prediction of Small Proteins

Sun, S. (1993). Reduced representation model of protein

structure prediction: statistical potential and genetic

algorithms. Protein Sci. 2, 762±785.

Taylor, W. R. (1997). Multiple sequence threading: an

analysis of alignment quality and stability. J. Mol.

Biol. 269, 902±943.

Vieth, M., Kolinski, A., Brooks, C. L., III & Skolnick, J.(1994). Prediction of the folding pathways andstructure of the GCN4 ``leucine zipper''. J. Mol. Biol.237, 361±367.

Wallqvist, A. & Ullner, M. (1994). A simpli®ed aminoacid potential for use in structure prediction ofproteins. Proteins: Struct. Funct. Genet. 18, 267±289.

Edited by F. Cohen

(Received 19 August 1997; received in revised form 5 December 1997; accepted 5 December 1997)


Recommended