Folding of small helical proteins assisted by small-angle X-ray scattering profiles

Structure, Vol. 13, 1587–1597, November, 2005, ª2005 Elsevier Ltd All rights reserved. DOI 10.1016/j.str.2005.07.023

Folding of Small Helical Proteins Assistedby Small-Angle X-Ray Scattering Profiles

Yinghao Wu,1 Xia Tian,2 Mingyang Lu,2

Mingzhi Chen,3 Qinghua Wang,2

and Jianpeng Ma1,2,3,*1Department of BioengineeringRice UniversityHouston, Texas 770052Verna and Marrs McLean Department of Biochemistry

and Molecular Biology3Graduate Program of Structural and Computational

Biology and Molecular BiophysicsBaylor College of MedicineOne Baylor PlazaHouston, Texas 77030

Summary

This paper reports a computational method for folding

small helical proteins. The goal was to determine theoverall topology of proteins given secondary struc-

ture assignment on sequence. In doing so, a MonteCarlo protocol, which combines coarse-grained nor-

mal modes and a Hamiltonian at a different scale, wasdeveloped to enhance sampling. In addition to the

knowledge-based potential functions, a small-angleX-ray scattering (SAXS) profile was also used as a

weak constraint for guiding the folding. The algorithmcan deliver structural models with overall correct to-

pology, which makes them similar to those of 5w6 Acryo-EM density maps. The success could contribute

to make the SAXS technique a fast and inexpensivesolution-phase experimental method for determining

the overall topology of small, soluble, but noncrystal-

lizable, helical proteins.

Introduction

One of the major goals of protein folding is to obtain theoverall three-dimensional structure from a one-dimen-sional primary sequence of amino acids. Through de-cades of extensive studies, substantial progress hasbeen made toward understanding the folding mecha-nisms and the prediction of three-dimensional structures(Blundell et al., 1987; Boczko and Brooks, 1995; Bowieet al., 1991; Cardenas and Elber, 2003; Dill and Chan,1997; Dinner et al., 2000; Duan and Kollman, 1998; Elofs-son et al., 1996; Guo and Thirumalai, 1995; Jones et al.,1992; Liwo et al., 2005; Onuchic et al., 1997; Pandeet al., 2003; Pedersen and Moult, 1997; Sali et al., 1994;Simons et al., 1997; Srinivasan et al., 2004; Wedemeyeret al., 2000; Wolynes et al., 1995; Zhang and Skolnick,2004; Zhou and Karplus, 1999). At the current stage,however, reliably predicting the overall topology is stilla challenging problem for most proteins.

In this paper, we report on the effort of determining thetertiary topology of small helical proteins or domains byusing several constraints. The proteins were modeled

*Correspondence: [email protected]

based on Ca traces and the secondary structures asstandard helices. The starting conformations were ran-domly chosen and had extended tertiary structures.Knowledge-based potential functions and a small-angleX-ray scattering (SAXS) profile were used to guidethe simulations. To efficiently sample conformationalspace, we also developed a novel, to our knowledge,Monte Carlo (MC) protocol, which was based on obser-vations made from previous studies that found that thelarge-scale conformation changes are dominated bya small number of low-frequency normal modes (Ma,2004, 2005). A distinct feature in this protocol is the sep-aration of scale between Hamiltonian for calculating nor-mal modes and for updating Metropolis criterion in MCsimulation. A substantially improved sampling efficiencywas observed in our study.

Multiple folding trajectories that usually resulted in dif-ferent final structures with fairly diverse topologies weresimulated. A major issue, therefore, was to distinguishthe most native-like topology among them. To do so,we adopted an approach developed in our recent study(Wu et al., 2005), which was originally designed to deter-mine protein topology from the skeletons of secondaryelements derived from low- to intermediate-resolutiondensity maps. The approach was based on an essentialobservation that the average energy of an ensemble ofstructures with slightly different configurations aroundthe original structure was a much more robust parame-ter for marking native topology than the energy of any ofthe individual structure in the ensemble. The underlyinghypothesis is that native topology was chosen by evolu-tion as the one that can accommodate the largest struc-tural variations, not the one rigidly trapped in a deep, butnarrow, conformational energy well. In this study, for eachfinal folded structure, we separately generated an ensem-ble of structures around it and evaluated the average en-ergy as the criterion for identifying the native topology.Our results showed that, in two-thirds of the testing caseswith simulated scattering profiles, the native topologywas successfully identified. Tests on real experimentalscattering data also yielded a satisfactory outcome.

Since we used Ca-based models, an approach thatcould not deliver atomic resolution structure, our focuswas on getting the overall topology correct. It wasshown that the approach was particularly effective forsmall, globular helical proteins or protein domains.Such a success could help increase the effective resolu-tion of the SAXS method. This is particularly importantsince SAXS is a relatively low-cost and fast method com-pared with other solution-phase methods such as NMR.

Results

Test of 12 All-Helical Globular Proteins

The simulation protocol was first tested on a total of 12all-helical globular proteins that can be roughly split intofour groups, which contain three, four, five, and six heli-ces, respectively. They have an architecture of an or-thogonal bundle or an up-down bundle (Orengo et al.,1997). The SAXS profiles for all of the proteins were

mailto:[email protected]

Structure1588

Table 1. Results for 12 Testing Proteins

Helix Number PDB Codes

Protein Size

(Number of Residues)

Protein

Architecture

Lowest Rmsd

Topology

Rmsd with

Loop (A)

Rmsd without

Loop (A)

Three helices 1mbg 40 orthogonal 1st 3.98 4.10

1prb 45 up-down 1st 3.76 3.39

1erc 40 up-down 5th 3.78 3.07

Four helices 1eo0 76 up-down 1st 5.77 5.57

1i2t 61 orthogonal 1st 5.52 5.24

1eoq 64 orthogonal 3rd 5.53 5.12

Five helices 1nkl 78 orthogonal 1st 5.47 5.37

2cro 65 orthogonal 1st 6.85 6.02

1ow5 60 orthogonal 7th 5.10 4.45

Six helices 1a1w 91 orthogonal 1st 7.02 6.19

1ngr 74 orthogonal 1st 5.35 5.38

1ich 87 orthogonal 4th 6.09 5.82

The number in the column ‘‘Lowest Rmsd Topology’’ describes the energetics ranking of the topology of the lowest rmsd in relation to the native

topology. Therefore, ‘‘1st’’ means the energetics-based screening successfully identified the most native-like topology as the lowest one in en-

ergy. The values of rmsd calculated with and without loops are both listed.

computationally calculated. The native topologies of 8out of 12 proteins were successfully identified with rea-sonable root mean square deviations (rmsd) from knowncrystal structures (Table 1). The errors in the final foldedstructures inevitably increased with the size of the pro-teins, but such magnitudes of rmsd were within the over-all range of precision of the current structural predictionby other methods (Moult et al., 2003).

In Figure 1, one successful case from each group, to-gether with the corresponding crystal structure, wasdisplayed. Clearly, for all groups, the overall topology

of the proteins was correctly reached. Note that for clar-ity the loops were not shown. All major helices were inthe vicinity of their correct positions. Shown in Figure 2is one of the folding trajectories of 1nkl from a conforma-tion near the beginning of simulation (Figure 2A) to theend of simulation (Figure 2F). It is apparent that the over-all sampling of the conformational space by the elasticnormal mode-based MC protocol is efficient. Each tra-jectory for the largest protein we tested took about48 hr on a single processor PC machine. Moreover, thespeed of computing also depends on the total number

Figure 1. Comparison of the Most Native-

like Topology with the Corresponding Native

Topology

(A–D) For clarity, only the helices are drawn

as cylinders. The dark color indicates native

structures, and the light color indicates sim-

ulated structures. The values of rmsd with-

out loops are 4.1 A for 1mbg, 5.1 A for

1eo0, 5.4 A for 1nkl, and 6.2 A for 1a1w.

www.ncbi.nlm.nih.gov

















SAXS-Assisted Folding of Small Helical Proteins1589

Figure 2. The Snapshots Taken from One of the Folding Trajectories of Protein 1nkl

(A–F) (A) shows a conformation near the beginning of simulation, and (F) shows the final folded conformation that has an rmsd of 5.5 A to the

native structure.

of blocks in BNM analysis that was related to the lengthof loops. However, our protocol was very fast for foldingthe small- to medium-sized helical proteins.

Four typical final folded structures were shown for thethree-helical protein 1mbg (Figure 3A) and for the five-helical protein 1nkl (Figure 3B). For each protein, the firstfolded structure had the most native-like topology,which was successfully identified in our final screening.The results clearly demonstrated that the final foldedstructures from different folding trajectories had quitedifferent topologies, and that some were very far awayfrom the native topology. This fact necessitated the ap-plication of an additional energetics-based screeningmethod in our protocol (see Experimental Procedures).The convergence of energy values during MC simula-tions was quick for all trajectories (Figure 4A), eventhough the rmsds of the final folded structures werefairly different (Figure 4B). Finally, as we had noted pre-viously (Wu et al., 2005), the most native-like topology of1nkl had the lowest average total energy in the per-turbed ensemble of structures, the smallest standarddeviation (SD) for the energy distribution, and the largestnumber of perturbed structures (Table 2). These resultsfurther support the finding that the native topology waschosen by evolution as the one that can accommodatethe largest structural variations, not the one rigidly trap-ped in a deep, but narrow, conformational energy well.

Analysis of Some Failed Cases

To better understand its overall performance, here wepresent two typical cases in which our simulation proto-col failed: myoglobin (PDB code 1bvc) and cytochromec (PDB code 1akk).

The structure of myoglobin has 8 helices (6 majorones) and 153 residues. Out of the ten trajectories,

none of them had a native-like topology. The closesthad an rmsd of 10.52 A. If only the six major heliceswere compared, the rmsd was 8.0 A. A detailed compar-ison of the six major helices in the lowest-rmsd modelwith those of the native structure is shown in Figure5A. The dark color indicates the native helices, and thelight color indicates our modeled helices. Obviously,the lowest-rmsd model has a very similar shape com-pared with the native structure. The largest discrepancyfrom the native structure was in the last two major heli-ces, which were swapped in the modeled structure.Note that since the native topology was not in the finalten folded structures by visual inspection, we didn’t ap-ply energetics-based screening.

Cytochrome c is a smaller four-helical protein (Figure5B). The folding simulation also failed to produce thenative-like topology out of the multiple trajectories.One possible reason for this failure was that the foldingof the native structure of cytochrome c heavily involvesits cofactor heme. Since we didn’t include heme in oursimulation, this could contribute to the substantial errorin the final folded structures. Also, there is a very longloop from residues 17 to 48 that extensively interactswith the heme group. This loop also contributed to thefrustration of the folding simulation.

Cases with Mismatch between Predicted

and Native Secondary StructuresIn all applications discussed so far, we utilized the nativesecondary structure assignments taken from the crystalstructures. For the purpose of practical applications, wealso tested cases with mismatches between the pre-dicted and native assignments. By using the consensussecondary structure prediction server (http://ibivu.cs.vu.nl/programs/sympredwww/), 3 out of 12 proteins






http://ibivu.cs.vu.nl/programs/sympredwww/

http://ibivu.cs.vu.nl/programs/sympredwww/


Structure1590

had secondary structure mismatches. They were 1erc,1prb, and 1ow5. For 1erc and 1prb, although they areall-helical proteins, the secondary structure predictionindicated that they had strands. In these cases, our fold-ing simulations totally failed. In the case of 1ow5, whichis a five-helical bundle, secondary structure predictionindicated four helical regions (two small helical regionswere predicted as a single long helix). If we constructedthe initial protein model based on the predicted second-ary structures with the scattering profile still computedfrom the native crystal structures, the folding simulationsuccessfully identified the most native-like topologywith the lowest rmsd of 6.86 A from the native structure

Figure 3. Four Final Folded Structures from Different Trajectories

for 1mbg and 1nkl

(A and B) For each protein ([A] 1mbg; [B] 1nkl), the first structure

has the lowest rmsd value (most native-like). All folded structures

have compact conformations and similar radii of gyration, but

with quite different topologies.

with loop regions, and 6.78 A for the structure withoutloop regions (Figure 6A). (Note that in Table 1, whenthere was no mismatch, this protein has an rmsd 5.10 Awith loop regions and an rmsd of 4.45 A without loopregions.) The global arrangement of all three helicesis similar to the native structure, while the single longhelix, predicted over a sequence region that shouldhave two smaller helices in the native structure, lies atthe average position between the two native small heli-ces (Figure 6A, (3)).

Another example that can illustrate the effects oferrors in secondary structure assignment is a protein ofPDB code 3icb (a five-helical orthogonal bundle protein).As in Figure 6B, the consensus prediction for the sec-ondary structure gave a correct assignment, and, conse-quently, the folded topology was also correct (lowest-rmsd structure without loop regions was 4.93 A out of theten folding trajectories [Figure 6B, (1)]). If we used theassignment by the YASPIN method (Lin et al., 2005), inwhich a small helix (residues 37–41) was merged into apreceding longer helix, the topology was more or lesscorrect (lowest rmsd was 5.85 A [Figure 6B, (2)]). If,

Figure 4. Convergence of MC Simulation for Protein 1nkl, for En-

ergy and for Rmsd

(A and B) (A) indicates convergence of energy, and (B) indicates

convergence of rmsd. Although the overall convergences for differ-

ent trajectories were similar, the final structures differ substantially

with regard to rmsd.














Table 2. Detailed Results for the Final Energetics-Based Ranking of Ten Folding Trajectories for 1nkl

Trajectory

Index

Average

Rmsd (A)

Average Total

Energy (RT)

SD of Total

Energy (RT)

Average Long-Range

Energy (RT)

SD of Long-Range

Energy (RT)

Number of Randomized

Structures in the Ensemble

1 10.7 2378.21 50.74 254.42 26.93 253

2 6.6 2399.08 45.99 254.91 22.08 353

3 10.8 2378.59 61.71 242.19 33.91 322

4 8.9 2396.36 44.87 249.77 33.35 373

5 10.5 2378.70 50.91 231.08 40.08 472

6 7.2 2399.85 42.71 253.26 26.14 445

7 5.4 2418.47 37.75 264.92 25.59 563

8 10.5 2403.07 38.95 252.19 24.25 510

9 11.7 2371.95 45.91 234.79 35.62 416

10 10.6 2388.36 48.5 240.16 36.13 412

however, we used the assignment by the PSIRed method(Jones, 1999), which mistakenly predicted the samesmall helix as a loop, the folded topology became muchworse (lowest rmsd became 11.36 A, [Figure 6B, (3)]).

These results indicate that, in general, if the mis-matches between the predicted and native secondarystructures are not too severe, our method is still capableof delivering reasonable overall topology. But failures

did occur when the mismatches were too large for ourmethod to handle.

Test of System with Experimental Scattering

DataIn order to test our simulation protocol by using real ex-perimentally measured SAXS data, we performed thetest on a system that had published scattering data

Figure 5. Examination of Failed Cases

(A) Comparison of the lowest-rmsd structure with the native structure for myoglobin 1bvc. Six major long helices are illustrated one-by-one.

Their amino acid sequence ranges are 3–18, 20–36, 59–78, 85–96, 100–119, and 124–149. The last two were swapped in the folded structure,

but the overall architecture of the folded structure is not far away from that of the native structure.

(B) Ribbon diagram for cytochrome c. The long loop that extensively interacts with the heme group is highlighted in a darker color.



Structure1592

Figure 6. Comparison of the Lowest-Rmsd

Structures with the Native Structure for

1ow5 and 3icb with Mismatches between

the Predicted and Native Secondary Struc-

tures

The darker color is for the native structure,

and the lighter color is for the folded struc-

ture.

(A) For 1ow5, a particular helix is highlighted

by ribbon representation in each panel (1–4).

Note the mismatch that occurs over the se-

quence region of two small, dark helices

highlighted in (3).

(B) For 3icb, results for different secondary

structure assignments are shown.

available (Mehboob et al., 2003) (the maximal scatteringvector s was set to 0.05 A21). The N-terminal region ofspectrin is a relatively large three-helical bundle protein(149 residues), and the secondary structure predictiongave the same assignment as the native structure. Al-though the absolute size of this three-helical bundlewas even larger than the largest six-helical bundle pro-tein we tested, our method successfully found the mostnative-like topology (Figure 7) with an rmsd of 6.5 A.

Convergence of Elastic Normal Mode-BasedMC Simulation

To demonstrate the convergence of the MC protocol wedeveloped, we compared this protocol with a more tra-ditional rigid-body MC simulation, in which each blockwas chosen randomly at each time step. After eachblock was chosen, the chosen block took a randommove. If the chosen block contained only one Ca

atom, as in the loops, the random move was chosenon its three translational degrees of freedom. If the cho-sen block contained more than one Ca atom, as in heli-ces, three additional rotational degrees of freedom werechanged randomly. For comparison purposes, we usedthe same initial condition, the same energy potential,and the same weight value; the moving amplitude wasalso adjusted close to the normal mode-based MC sim-ulation. The convergences of two simulation methodsare shown in Figure 8. The normal mode-based MC sim-ulation converged much faster and to a lower energy, ora more compact conformation, than the traditional rigid-body MC simulation. Although the calculation of elasticnormal modes in each step took some additional time,the overall performance of normal mode-based MC issignificantly better, especially in terms of the energy ofthe final folded structure. As a direct comparison, ten in-dependent trajectories were run by the rigid-body MC






simulation for 1nkl. The final average rmsd to the nativestructure was more than 13 A, and the lowest rmsd was11.2 A, compared with the average rmsd of 9.3 A and thelowest rmsd of 5.4 A obtained with the elastic normalmode-based MC simulation. More importantly, withthe traditional rigid-body MC, the lowest-rmsd finalstructure out of the ten folding trajectories had a non-compact structure with a nonnative topology. These re-sults suggest the better performance of the normalmode-based MC protocol (detailed analysis of perfor-mance of the normal mode-based MC will be given ina separate paper).

Discussion

In this paper, we reported a new, to our knowledge,computational protocol for determining tertiary to-pology of small, globular helical proteins, or proteindomains, from sequence. Knowledge-based potentialfunctions and coarse-grained protein models (Ca

traces) were used in the folding simulations. A novel,to our knowledge, protocol of MC simulation was devel-

Figure 7. Comparison of the Most Native-like Topology with Native

Structure for Spectrin

The rmsd is 6.5 A. This protein was folded by using published ex-

perimental SAXS data (Mehboob et al., 2003) as constraints.

oped in which the scale in Hamiltonian was used for con-formational random walk and for which the Metropoliscriterion was separated. The random walk was basedon elastic normal modes, and the Metropolis criterionwas based on more accurate potential functions. Sucha separation of Hamiltonian allows an effective samplingof the conformational space. It is important to point outthat the potential functions used for Metropolis criterionhad no restriction and that they can be any kind. There-fore, the protocol can be generalized to any case. An-other distinct feature in this study was the utilization ofSAXS data as soft constraints to assist folding. Fora set of testing proteins we studied with both simulatedand experimentally measured SAXS profiles, a satisfac-tory successful rate was achieved.

We wish to emphasize that the purpose of making ran-dom configurational moves during MC simulation alongthe eigenvectors of elastic normal modes is to achievethe maximal sampling efficiency. Each normal mode isa single, but collective, degree of freedom. Moving alongthe low-frequency mode has the effect of achieving thelargest structural change collectively with the smallestenergetic cost. Our fundamental assumption is that, atany given instant, the direction of the conformationalmovement in the immediate future can be approximatedby the normal modes calculated at that instant. Althoughthe conventional concept of normal mode, as a harmonicapproximation, does not seem to allow the system tocross the energy barrier along the eigenvector, in thiscase, the separation of scale of Hamiltonian overcomesthis problem. As shown in our study, the structuralmovement along the coarse-grained elastic normalmodes can efficiently cross energy barriers in the land-scape defined by more sophisticated Hamiltonian laterused for Metropolis criterion. It was shown in this studythat the method was very effective for folding simulationbecause the folding process involves large-scale con-formational rearrangement that drastically changes theshape of the molecule. Normal modes are good param-eters to describe the conformational changes that

Figure 8. Comparison of the Convergence of Two MC Methods on

Protein 1nkl

The dotted line is for traditional rigid-body MC, and the solid line is

for elastic normal mode-based MC. It is clear that the normal

mode-based MC converges much faster and to a lower energy,

i.e., a more compact conformation.



Structure1594

substantially alter the overall shape of the molecule (Luand Ma, 2005; Ma, 2004, 2005).

From our study, it seems that the success of foldingdepends more on the complexity of the structure andless on the absolute size of the structure. The three-helical protein that we successfully tested with experi-mental scattering data was relatively simple in its topol-ogy, but it was large in size compared with other sys-tems we tested. Myoglobin was a good example inwhich the complexity led to the failure of simulation.However, we wish to point out that, even for myoglobin,in which the native topology was not found, the finalfolded structure had four major helices correctly posi-tioned. The remaining two helices had their positionsswapped; the overall angles of them were still closeenough to the native structures. In other words, if onlythe skeletons of secondary structure elements wereconcerned, all of the rough positions of helices wouldhave been considered as correctly predicted. The re-sults would be of a similar level of information as towhat can be obtained from intermediate-resolutioncryo-EM density maps (Jiang et al., 2001; Kong andMa, 2003; Kong et al., 2004).

In its current form, the protocol we used relies on theknowledge of secondary structure assignment on se-quence. In a real application, the secondary structureassignment would have to be taken from computationalprediction, which has about 80% accuracy in determin-ing major secondary structures. Our method shouldeven be able to handle cases in which small helicesare missing in the sequence-based prediction. Ofcourse, there are other factors that could affect the re-sults of folding. For example, proteins containing largecofactors would probably be harder to fold by the cur-rent method than the ones without cofactors.

In conclusion, we believe that, for small, globular heli-cal proteins or protein domains, it is feasible to utilizeSAXS data to assist structure prediction to a level of ac-curacy that is roughly equivalent to, or better than, inter-mediate-resolution density maps from other structuralbiology experiments. However, SAXS has the unique ad-vantage of being able to quickly collect data on small,soluble proteins. It also provides a useful alternativetechnique for fields such as structural genomics, sincea SAXS experiment would be much easier to be auto-mated than X-ray crystallography. Together with the re-cent developments of other related computationalmethods (Costenaro et al., 2005; Davies et al., 2005; Ko-jima et al., 2004; Petoukhov et al., 2002; Svergun andKoch, 2002; Svergun et al., 2001; Walther et al., 2000;Zheng and Doniach, 2002, 2005) that also utilize SAXSdata to model protein structures, we believe that thesemethods will eventually enable SAXS to become a main-stream experimental technique in the field of structuralbiology.

Experimental Procedures

Overall Procedure of the Simulation

To fold the proteins, we assumed that the positions of secondary

structures (helices in this case) in the sequence are known. The sim-

ulations were based on the Ca traces of proteins. Each run started

from a random conformation in which the loops were modeled as ex-

tended and helices were kept in a standard form—a model similar to

the diffusion-collision model (Karplus and Weaver, 1979, 1994). The

conformational space was then sampled by a novel, to our knowl-

edge, MC protocol, which, in essence, separates the Hamiltonians

between the random walk and Metropolis criterion. In each step of

the simulation, a small set of low-frequency normal modes was first

computed based on the Ca trace by elastic normal mode analysis

(eNMA) (Atilgan et al., 2001) with the block normal mode (BNM)

(Durand et al., 1994; Li and Cui, 2002; Tama et al., 2000) scheme

for constructing the Hessian matrix (the helical regions were kept

as rigid bodies during MC simulation). Then, the structure was

changed along a randomly chosen mode by a random displacement.

Different from what was used to compute normal modes, the energy

function used to evaluate the acceptance included a short-range

bonded term, a long-range nonbonded term, a hydrophobicity

term, and a constraint derived from the SAXS profile (detailed

below). Multiple trajectories were used to generate a set of folded

structures, usually with quite different topologies. These structures

were then used as candidates from which the topology that is the

most similar to native topology (most native-like) was identified by

using an effective procedure recently developed by us (Wu et al.,

2005).

Protein Model and Initial Condition

Each amino acid was represented by a single Ca atom. Two adjacent

Ca atoms were assumed to be connected by a pseudo bond with an

equilibrium length of 3.8 A. The conformation of the Ca trace for

a protein of N residues was thus defined by 3N 2 6 parameters:

N 2 1 virtual bonds {li}, N 2 2 virtual bond angles {qi}, N 2 3 dihedral

angles {fi}. The Cartesian coordinates of the Ca trace of a protein

were constructed by a method developed by Levitt and colleagues

(Park and Levitt, 1995). Given the positions of the first three Ca

atoms by:

x1 = 0; y1 = 0; z1 = 0

x2 = 3:8; y2 = 0; z2 = 0

x3 = x2 + 3:8 cosðp 2 q2Þ

y3 = y2 + 3:8 sinðp 2 q2Þ

z3 = 0: (1)

The coordinates of the ith Ca atom ri (i R 4) were calculated by:

ri = ri 2 1 + 3:8 cosðp 2 qi 2 1Þu + 3:8 sinðp 2 qi 2 1Þ cosfi 2 2v

+ 3:8 sinðp 2 qi 2 1Þ sinfi 2 2w; (2)

where three orthogonal unit vectors, u, v, and w, were defined as:

u =ri 2 1 2 ri 2 2

jri 2 1 2 ri 2 2j

v =ðri 2 3 2 ri 2 2Þ2 ½ðri 2 3 2 ri 2 2Þ,u�ujðri 2 3 2 ri 2 2Þ2 ½ðri 2 3 2 ri 2 2Þ,u�uj

w = u3v: (3)

Since we assumed that the secondary structures were known, the

helical regions were fixed at the ideal angles (q = 90º, f = 50º) (Bahar

et al., 1997). For residues in the loop regions, the bond angles and

dihedrals were set as completely extended (q = 120º, f = 180º). By

starting our simulations from a maximally extended conformation,

we expected to avoid bias.

Monte Carlo Simulation Protocol Based on Elastic

Normal Modes

The sampling of conformational space during folding simulations

was carried out by an MC protocol (Allen and Tildesley, 1980), which

usually involves two major steps: (1) a random walk of the structure

in configurational space, and (2) a Metropolis criterion for accep-

tance of the walk based on the Boltzmann distribution. In order to in-

crease the sampling efficiency for conformational collapse during

folding, we developed a novel, to our knowledge, protocol for imple-

menting the MC simulation. The essence of this protocol is the sep-

aration of Hamiltonian in scale for the two steps. In the first step,


each random walk of the structure was along a low-frequency eigen-

vector of elastic normal mode calculated from a coarse-grained

Hamiltonian for the elastic network. Then, in the second Metropolis

step, the Hamiltonian used to evaluate the energy change was

based on more accurate potential functions.

Random Walk Based on Elastic Normal Modes

In the first random walk step, the low-frequency normal modes of the

structure were calculated by using a simplified potential function,

V =g

2

Xij

sijðjrijj2 jr0ij jÞ

2; sij =

1jr0ij j%rc

0jr0ij j>rc

�: (4)

Here,��rij

�� and jr0ij j were the instantaneous and current values of dis-

tance, respectively, between Ca atom pair i and j. The force constant

g was uniformly set to one, and the cutoff distance rc was set to 13 A.

A distinct feature of elastic normal modes is that they are calculated

from a potential function that regards the current coordinates as the

equilibrium coordinates. Consequently, initial energy minimization

was not required; thus, no additional structural distortion resulting

from energy minimization would occur, which made it possible to

carry out this normal mode-based Monte Carlo protocol step by

step to propagate the simulation.

Furthermore, since the helices were treated as rigid bodies, we

used the well-established BNM approach (Durand et al., 1994; Li

and Cui, 2002; Tama et al., 2000) to change the conformation of

the protein. The BNM method first partitions a structure into n

blocks. In our model, one block was chosen for each helix or each

residue in the loop regions.

For a particular conformation of a protein, a subset of low-fre-

quency modes was calculated based on the current Ca network

by using BNM and the elastic harmonic potential (Equation 4). The

structure was then moved along a randomly chosen mode with a ran-

dom displacement. The SHAKE algorithm (Ryckaert et al., 1977) was

applied to keep the pseudo bond length unchanged. The newly up-

dated structure was either accepted or rejected based on the Me-

tropolis criterion according to a different Hamiltonian described in

the next section. If it was accepted, the newly updated structure

was set as the current structure of the protein for the next cycle.

Hamiltonian Used in Metropolis Criterion

The energy function we used in Metropolis update can be expressed

as:

UTotal = UAngle + UvdW + UNonbonded + UHydrophobic + UScattering (5)

The first term on the right is for the pseudo bond angle constraint,

which can be maintained by a harmonic potential,

UAngle =X

i

1

2kq

�q 2 q0

qv

�2

: (6)

The equilibrium bond angle q0 was set to 105º, the constant qv was

set to 15º, and the force constant kq was 2.

In order to prevent the protein model from interresidue collapse,

a pseudo van der Waals potential, UvdW, was added with the form

of (Berriz and Shakhnovich, 2001):

UvdW = u0

��d0

dij

�12

2 2

�d0

dij

�6�; (7)

where dij was the instantaneous distance between residue i and j, d0

was a constant distance set to 5 A, which was the most favorable

distance between two adjacent Ca atoms, and u0 was set to 1.

The long-range nonbonded interactions were modeled by a

residue-specific, distance-dependent potential extracted from the

structural database by Bahar and Jernigan (Bahar and Jernigan,

1997). The knowledge-based potential can be written as:

Unon 2 bonded =Xi< j

�uði; j; rÞ; (8)

in which i and j were residue indices, r was the distance between the

two residues, and �uði; j; rÞ was the energy parameter.

Since hydrophobic residues tend to be buried in the proteins, we

constructed a crude penalty term to mimic this hydrophobic effect.

A group of hydrophobic residues on the helices with the highest hy-

drophobicity indices (Kyte and Doolittle, 1982), such as Ile, Leu, Val,

or Phe, were chosen. Then, the summation of distances between all

pairs of the chosen hydrophobic residues was calculated as:

UHydrophobic =1

2

Xi; j˛HC

��r.i 2 r.j

��2: (9)

Here, HC stands for the chosen hydrophobic core. This term has

an effect of ‘‘pulling’’ the chosen set of the hydrophobic core toward

a more globular packing geometry. Our experience indicates that it

was better to choose a set of hydrophobic residues with very high

hydrophobicity indices than to include all of the hydrophobic resi-

dues in the sequence. This is perhaps because those residues

tend to be deeply buried in tertiary structures.

Finally, the SAXS profile of proteins that gives information on their

size and shape was incorporated as a constraint for the folding pro-

cess by adopting a term in the potential function, UScattering, con-

cerning the scattering profile as:

UScattering = wXn

i = 1

jIT ðsiÞ2 IMðsiÞj2; (10)

where Si is the scattering vector, and IT(si) and IM(si) are the scatter-

ing intensity of target structure and modeled structure, respectively.

The constant w is a weight for balancing the contribution with re-

spect to other energy terms. Similar to what has been reported in

the literature (Walther et al., 2000; Zheng and Doniach, 2002), the

scattering intensity I(s) of a model formed by N beads can be calcu-

lated by the Debye equation:

IðsÞ= N + 2XN

i; j

sin�2psrij

�2psrij

; (11)

where rij is the distance between a pair of beads. For larger systems,

the above N 3 N calculation can be replaced by an alternative Debye

equation in its pair-distance histogram form:

IðsÞ= N + 2Xnbins

i = 1

gðriÞsinð2pjri jsÞ

2pjrijs; (12)

where g(ri) is the pair-distance histogram of all singly counted pair-

wise distances, and the number of bins is nbins. To numerically rep-

resent the scattering profile I(s), the scattering vector s was discre-

tized with the interval ds = 0.001 A21. So, if the maximal s was set to

0.1 A21, then the value of n in Equation 10 would be 100. The weight

factor, w, was adjusted to a value at which the scattering term

roughly made up one-fifth of the contribution to the total energy

function in Equation 5.

After the new structure was generated by a random walk along

normal mode, the acceptance of this move was based on the Me-

tropolis criterion: the new conformation was accepted based on

probability p of:

pwexp

�2�UTotal 2 U0

Total

�C

�; (13)

where UTotal was the energy of the new conformation, U0Total was the

energy of the old conformation, and C was a constant related to

Monte Carlo temperature that was empirically chosen. The initial

value of constant C was adjusted from 1.5 to 3 depending on the

size of the proteins. Generally, proteins with larger sizes got a larger

value of C. Additionally, the parameter C was decreased in the final

steps of the simulation so that the structure could converge to the

lower-energy state. Correspondingly, the maximal displacement

along the normal mode was also decreased in the final steps so

that the conformation space can be finely sampled before the simu-

lation is converged.

Final Identification of Native Topology

Using the coarse-grained potential for residual interactions and

other simple constraints, one naturally cannot expect every simula-

tion trajectory to converge to exactly the same native topology. The

problem of deriving a three-dimensional model from a one-dimen-

sional scattering profile was fundamentally underdetermined, espe-

cially when using solution scattering information. Therefore, for each

Structure1596

case, multiple trajectories were generated independently, which led

to a number of compact structures that were similar in shape and ra-

dius of gyration, but different in tertiary packing of helices. A major

task then was to identify the most native-like topology among all

of the folded structures. To do so, we adopted an energy-based

screening procedure developed in our previous study (Wu et al.,

2005) that is briefly outlined below.

For each final folded structure, only the axes and directions of he-

lices (as vectors) and their corresponding sequence identities were

extracted from Ca coordinates. These vectors represented the

packing geometry of a particular structure. Then, an ensemble of

slightly perturbed structures was constructed to represent the con-

formational variations around the given topology. All of the struc-

tures in the ensemble were generated as a Ca-based trace represen-

tation. Each structure in the ensemble was constructed by two

steps. The first step was helix placement: the standard helical seg-

ments were placed based on their extracted helical axes. To allow

for enough structural deviations, the x, y, and z coordinates of the

centers of mass of the standard helical segments were perturbed

in a sphere of radius of 1.0 A. The three angles of rotation around

the centers of mass were also randomized within a range of 5º.

The second step was loop construction once the helices were all

in place, which was done by an off-lattice MC procedure. First, the

loops were built up to their full length in an extended conformation.

Then, they were allowed to relax based on MC movement to connect

two neighboring helices.

After the construction of all perturbed structures in the ensemble,

all members of the ensemble were separately energy refined by

global optimization, which included the genetic algorithm (GA) opti-

mization for rotations of the helices and MC relaxation for loop re-

gions. By iteratively applying the procedure, all structures were

guided globally to energetically more favorable states. After global

optimization of all members of the structural ensemble, the average

energy for the whole ensemble was evaluated. The average energies

of all folded structures are then compared to locate the one with the

lowest energy. Our previous study indicated that (Wu et al., 2005),

even with severe errors inherent in the constructed structures and

limited accuracy in the coarse-grained potential functions, the aver-

age energy of a slightly perturbed ensemble of structures around

a given topology was a much more robust evaluator for the topology

than any individual member in the ensemble, no matter how exten-

sively the structures in the ensemble were optimized.

Acknowledgments

We wish to thank Tom Irving for helpful discussion in the early stage

of the project. The authors gratefully acknowledge the support from

the National Institutes of Health (R01-GM067801). M.C. and M.L. are

partially supported by predoctoral fellowships from the W.M. Keck

Foundation of the Gulf Coast Consortia through the Keck Center

for Computational and Structural Biology.

Received: May 25, 2005

Revised: July 21, 2005

Accepted: July 22, 2005

Published: November 8, 2005

References

Allen, M.P., and Tildesley, D.J. (1980). Computer Simulation of

Liquids (Oxford, UK: Clarendon Press).

Atilgan, A.R., Durell, S.R., Jernigan, R.L., Demirel, M.C., Keskin, O.,

and Bahar, I. (2001). Anisotropy of fluctuation dynamics of proteins

with an elastic network model. Biophys. J. 80, 505–515.

Bahar, I., and Jernigan, R.L. (1997). Inter-residue potentials in glob-

ular proteins and the dominance of highly specific hydrophilic inter-

actions at close separation. J. Mol. Biol. 266, 195–214.

Bahar, I., Kaplan, M., and Jernigan, R.L. (1997). Short-range confor-

mational energies, secondary structure propensities, and recogni-

tion of correct sequence-structure matches. Proteins 29, 292–308.

Berriz, G.F., and Shakhnovich, E.I. (2001). Characterization of the

folding kinetics of a three-helix bundle protein via a minimalist Lan-

gevin model. J. Mol. Biol. 310, 673–685.

Blundell, T.L., Sibanda, B.L., Sternberg, M.J., and Thornton, J.M.

(1987). Knowledge-based prediction of protein structures and the

design of novel molecules. Nature 326, 347–352.

Boczko, E.M., and Brooks, C.L., 3rd. (1995). First-principles calcula-

tion of the folding free energy of a three-helix bundle protein. Sci-

ence 269, 393–396.

Bowie, J.U., Luthy, R., and Eisenberg, D. (1991). A method to identify

protein sequences that fold into a known three-dimensional struc-

ture. Science 253, 164–170.

Cardenas, A.E., and Elber, R. (2003). Kinetics of cytochrome C fold-

ing: atomically detailed simulations. Proteins 51, 245–257.

Costenaro, L., Grossmann, J.G., Ebel, C., and Maxwell, A. (2005).

Small-angle X-ray scattering reveals the solution structure of the

full-length DNA gyrase a subunit. Structure 13, 287–296.

Davies, J.M., Tsuruta, H., May, A.P., and Weis, W.I. (2005). Confor-

mational changes of p97 during nucleotide hydrolysis determined

by small-angle X-ray scattering. Structure 13, 183–195.

Dill, K.A., and Chan, H.S. (1997). From Levinthal to pathways to fun-

nels. Nat. Struct. Biol. 4, 10–19.

Dinner, A.R., Sali, A., Smith, L.J., Dobson, C.M., and Karplus, M.

(2000). Understanding protein folding via free-energy surfaces

from theory and experiment. Trends Biochem. Sci. 25, 331–339.

Duan, Y., and Kollman, P.A. (1998). Pathways to a protein folding in-

termediate observed in a 1-microsecond simulation in aqueous so-

lution. Science 282, 740–744.

Durand, P., Trinquier, G., and Sanejouand, Y.H. (1994). New ap-

proach for determining low-frequency normal-modes in macromole-

cules. Biopolymers 34, 759–771.

Elofsson, A., Fischer, D., Rice, D.W., Le Grand, S.M., and Eisenberg,

D. (1996). A study of combined structure/sequence profiles. Fold

Des. 1, 451–461.

Guo, Z., and Thirumalai, D. (1995). Kinetics of protein folding: nucle-

ation mechanism, time scales, and pathways. Biopolymers 36,

83–102.

Jiang, W., Baker, M.L., Ludtke, S.J., and Chiu, W. (2001). Bridging

the information gap: computational tools for intermediate resolution

structure interpretation. J. Mol. Biol. 308, 1033–1044.

Jones, D.T. (1999). Protein secondary structure prediction based on

position-specific scoring matrices. J. Mol. Biol. 292, 195–202.

Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). A new ap-

proach to protein fold recognition. Nature 358, 86–89.

Karplus, M., and Weaver, D.L. (1979). Diffusion-collision model for

protein folding. Biopolymers 18, 1421–1437.

Karplus, M., and Weaver, D.L. (1994). Protein folding dynamics: the

diffusion-collision model and experimental data. Protein Sci. 3, 650–

668.

Kojima, K., Timchenko, A.A., Higo, J., Ito, K., Kihara, H., and Takaha-

shi, K. (2004). Structural refinement by restrained molecular-dynam-

ics algorithm with small-angle X-ray scattering constraints for a bio-

molecule. J. Appl. Crystallogr. 37, 103–109.

Kong, Y., and Ma, J. (2003). A structural-informatics approach for

mining b-sheets: locating sheets in intermediate-resolution density

maps. J. Mol. Biol. 332, 399–413.

Kong, Y., Zhang, X., Baker, T.S., and Ma, J. (2004). A Structural-infor-

matics approach for tracing b-sheets: building pseudo-Ca traces for

b-strands in intermediate-resolution density maps. J. Mol. Biol. 339,

117–130.

Kyte, J., and Doolittle, R.F. (1982). A simple method for displaying

the hydropathic character of a protein. J. Mol. Biol. 157, 105–132.

Li, G., and Cui, Q. (2002). A coarse-grained normal mode approach

for macromolecules: an efficient implementation and application

to Ca(2+)-ATPase. Biophys. J. 83, 2457–2474.

Lin, K., Simossis, V.A., Taylor, W.R., and Heringa, J. (2005). A simple

and fast secondary structure prediction method using hidden neural

networks. Bioinformatics 21, 152–159.

Liwo, A., Khalili, M., and Scheraga, H.A. (2005). Ab initio simulations

of protein-folding pathways by molecular dynamics with the united-

residue model of polypeptide chains. Proc. Natl. Acad. Sci. USA 102,

2362–2367.


Lu, M., and Ma, J. (2005). The role of shape in determining molecular

motion. Biophys. J. 4, 2395–2401.

Ma, J. (2004). New advances in normal mode analysis of supermo-

lecular complexes and applications to structural refinement. Curr.

Protein Pept. Sci. 5, 119–123.

Ma, J. (2005). Usefulness and limitations of normal mode analysis in

modeling dynamics of biomolecular complexes. Structure 13, 373–

380.

Mehboob, S., Jacob, J., May, M., Kotula, L., Thiyagarajan, P., John-

son, M.E., and Fung, L.W. (2003). Structural analysis of the a N-

terminal region of erythroid and nonerythroid spectrins by small-

angle X-ray scattering. Biochemistry 42, 14702–14710.

Moult, J., Fidelis, K., Zemla, A., and Hubbard, T. (2003). Critical as-

sessment of methods of protein structure prediction (CASP)-round

V. Proteins 53 (Suppl 6), 334–339.

Onuchic, J.N., Luthey-Schulten, Z., and Wolynes, P.G. (1997). The-

ory of protein folding: the energy landscape perspective. Annu.

Rev. Phys. Chem. 48, 545–600.

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B.,

and Thornton, J.M. (1997). CATH–a hierarchic classification of pro-

tein domain structures. Structure 5, 1093–1109.

Pande, V.S., Baker, I., Chapman, J., Elmer, S.P., Khaliq, S., Larson,

S.M., Rhee, Y.M., Shirts, M.R., Snow, C.D., Sorin, E.J., and Zagrovic,

B. (2003). Atomistic protein folding simulations on the submillisec-

ond time scale using worldwide distributed computing. Biopolymers

68, 91–109.

Park, B.H., and Levitt, M. (1995). The complexity and accuracy of

discrete state models of protein structure. J. Mol. Biol. 249, 493–507.

Pedersen, J.T., and Moult, J. (1997). Protein folding simulations with

genetic algorithms and a detailed molecular description. J. Mol. Biol.

269, 240–259.

Petoukhov, M.V., Eady, N.A., Brown, K.A., and Svergun, D.I. (2002).

Addition of missing loops and domains to protein models by x-ray

solution scattering. Biophys. J. 83, 3113–3125.

Ryckaert, J.P., Ciccotti, G., and Berendsen, H.J.C. (1977). Numerical

integration of the cartesian equations of motion of a system with

constraints: molecular dynamics of n-alkanes. J. Comput. Phys.

23, 327–341.

Sali, A., Shakhnovich, E., and Karplus, M. (1994). How does a protein

fold? Nature 369, 248–251.

Simons, K.T., Kooperberg, C., Huang, E., and Baker, D. (1997). As-

sembly of protein tertiary structures from fragments with similar lo-

cal sequences using simulated annealing and Bayesian scoring

functions. J. Mol. Biol. 268, 209–225.

Srinivasan, R., Fleming, P.J., and Rose, G.D. (2004). Ab initio protein

folding using LINUS. Methods Enzymol. 383, 48–66.

Svergun, D.I., and Koch, M.H. (2002). Advances in structure analysis

using small-angle scattering in solution. Curr. Opin. Struct. Biol. 12,

654–660.

Svergun, D.I., Petoukhov, M.V., and Koch, M.H. (2001). Determina-

tion of domain structure of proteins from X-ray solution scattering.

Biophys. J. 80, 2946–2953.

Tama, F., Gadea, F.X., Marques, O., and Sanejouand, Y.H. (2000).

Building-block approach for determining low-frequency normal

modes of macromolecules. Proteins 41, 1–7.

Walther, D., Cohen, F.E., and Doniach, S. (2000). Reconstruction

of low-resolution three-dimensional density maps from one-

dimensional small-angle x-ray solution scattering data for biomole-

cules. J. Appl. Crystallogr. 33, 350–363.

Wedemeyer, W.J., Welker, E., Narayan, M., and Scheraga, H.A.

(2000). Disulfide bonds and protein folding. Biochemistry 39,

4207–4216.

Wolynes, P.G., Onuchic, J.N., and Thirumalai, D. (1995). Navigating

the folding routes. Science 267, 1619–1620.

Wu, Y., Chen, M., Lu, M., Wang, Q., and Ma, J. (2005). Determining

protein topology from skeletons of secondary structures. J. Mol.

Biol. 350, 571–586.

Zhang, Y., and Skolnick, J. (2004). Tertiary structure predictions on

a comprehensive benchmark of medium to large size proteins. Bio-

phys. J. 87, 2647–2655.

Zheng, W., and Doniach, S. (2002). Protein structure prediction con-

strained by solution X-ray scattering data and structural homology

identification. J. Mol. Biol. 316, 173–187.

Zheng, W., and Doniach, S. (2005). Fold recognition aided by con-

straints from small angle X-ray scattering data. Protein Eng. Des.

Sel. 18, 209–219.

Zhou, Y., and Karplus, M. (1999). Interpreting the folding kinetics of

helical proteins. Nature 401, 400–403.

Date post:	14-May-2023
Category:	Documents
Upload:	rice
View:	0 times
Download:	0 times

Folding of small helical proteins assisted by small-angle X-ray scattering profiles

Documents