Structure, Vol. 13, 1587–1597, November, 2005, ª2005 Elsevier Ltd All rights reserved. DOI 10.1016/j.str.2005.07.023
Folding of Small Helical Proteins Assistedby Small-Angle X-Ray Scattering Profiles
Yinghao Wu,1 Xia Tian,2 Mingyang Lu,2
Mingzhi Chen,3 Qinghua Wang,2
and Jianpeng Ma1,2,3,*1Department of BioengineeringRice UniversityHouston, Texas 770052Verna and Marrs McLean Department of Biochemistry
and Molecular Biology3Graduate Program of Structural and Computational
Biology and Molecular BiophysicsBaylor College of MedicineOne Baylor PlazaHouston, Texas 77030
Summary
This paper reports a computational method for folding
small helical proteins. The goal was to determine theoverall topology of proteins given secondary struc-
ture assignment on sequence. In doing so, a MonteCarlo protocol, which combines coarse-grained nor-
mal modes and a Hamiltonian at a different scale, wasdeveloped to enhance sampling. In addition to the
knowledge-based potential functions, a small-angleX-ray scattering (SAXS) profile was also used as a
weak constraint for guiding the folding. The algorithmcan deliver structural models with overall correct to-
pology, which makes them similar to those of 5w6 Acryo-EM density maps. The success could contribute
to make the SAXS technique a fast and inexpensivesolution-phase experimental method for determining
the overall topology of small, soluble, but noncrystal-
lizable, helical proteins.
Introduction
One of the major goals of protein folding is to obtain theoverall three-dimensional structure from a one-dimen-sional primary sequence of amino acids. Through de-cades of extensive studies, substantial progress hasbeen made toward understanding the folding mecha-nisms and the prediction of three-dimensional structures(Blundell et al., 1987; Boczko and Brooks, 1995; Bowieet al., 1991; Cardenas and Elber, 2003; Dill and Chan,1997; Dinner et al., 2000; Duan and Kollman, 1998; Elofs-son et al., 1996; Guo and Thirumalai, 1995; Jones et al.,1992; Liwo et al., 2005; Onuchic et al., 1997; Pandeet al., 2003; Pedersen and Moult, 1997; Sali et al., 1994;Simons et al., 1997; Srinivasan et al., 2004; Wedemeyeret al., 2000; Wolynes et al., 1995; Zhang and Skolnick,2004; Zhou and Karplus, 1999). At the current stage,however, reliably predicting the overall topology is stilla challenging problem for most proteins.
In this paper, we report on the effort of determining thetertiary topology of small helical proteins or domains byusing several constraints. The proteins were modeled
*Correspondence: [email protected]
based on Ca traces and the secondary structures asstandard helices. The starting conformations were ran-domly chosen and had extended tertiary structures.Knowledge-based potential functions and a small-angleX-ray scattering (SAXS) profile were used to guidethe simulations. To efficiently sample conformationalspace, we also developed a novel, to our knowledge,Monte Carlo (MC) protocol, which was based on obser-vations made from previous studies that found that thelarge-scale conformation changes are dominated bya small number of low-frequency normal modes (Ma,2004, 2005). A distinct feature in this protocol is the sep-aration of scale between Hamiltonian for calculating nor-mal modes and for updating Metropolis criterion in MCsimulation. A substantially improved sampling efficiencywas observed in our study.
Multiple folding trajectories that usually resulted in dif-ferent final structures with fairly diverse topologies weresimulated. A major issue, therefore, was to distinguishthe most native-like topology among them. To do so,we adopted an approach developed in our recent study(Wu et al., 2005), which was originally designed to deter-mine protein topology from the skeletons of secondaryelements derived from low- to intermediate-resolutiondensity maps. The approach was based on an essentialobservation that the average energy of an ensemble ofstructures with slightly different configurations aroundthe original structure was a much more robust parame-ter for marking native topology than the energy of any ofthe individual structure in the ensemble. The underlyinghypothesis is that native topology was chosen by evolu-tion as the one that can accommodate the largest struc-tural variations, not the one rigidly trapped in a deep, butnarrow, conformational energy well. In this study, for eachfinal folded structure, we separately generated an ensem-ble of structures around it and evaluated the average en-ergy as the criterion for identifying the native topology.Our results showed that, in two-thirds of the testing caseswith simulated scattering profiles, the native topologywas successfully identified. Tests on real experimentalscattering data also yielded a satisfactory outcome.
Since we used Ca-based models, an approach thatcould not deliver atomic resolution structure, our focuswas on getting the overall topology correct. It wasshown that the approach was particularly effective forsmall, globular helical proteins or protein domains.Such a success could help increase the effective resolu-tion of the SAXS method. This is particularly importantsince SAXS is a relatively low-cost and fast method com-pared with other solution-phase methods such as NMR.
Results
Test of 12 All-Helical Globular Proteins
The simulation protocol was first tested on a total of 12all-helical globular proteins that can be roughly split intofour groups, which contain three, four, five, and six heli-ces, respectively. They have an architecture of an or-thogonal bundle or an up-down bundle (Orengo et al.,1997). The SAXS profiles for all of the proteins were
Structure1588
Table 1. Results for 12 Testing Proteins
Helix Number PDB Codes
Protein Size
(Number of Residues)
Protein
Architecture
Lowest Rmsd
Topology
Rmsd with
Loop (A)
Rmsd without
Loop (A)
Three helices 1mbg 40 orthogonal 1st 3.98 4.10
1prb 45 up-down 1st 3.76 3.39
1erc 40 up-down 5th 3.78 3.07
Four helices 1eo0 76 up-down 1st 5.77 5.57
1i2t 61 orthogonal 1st 5.52 5.24
1eoq 64 orthogonal 3rd 5.53 5.12
Five helices 1nkl 78 orthogonal 1st 5.47 5.37
2cro 65 orthogonal 1st 6.85 6.02
1ow5 60 orthogonal 7th 5.10 4.45
Six helices 1a1w 91 orthogonal 1st 7.02 6.19
1ngr 74 orthogonal 1st 5.35 5.38
1ich 87 orthogonal 4th 6.09 5.82
The number in the column ‘‘Lowest Rmsd Topology’’ describes the energetics ranking of the topology of the lowest rmsd in relation to the native
topology. Therefore, ‘‘1st’’ means the energetics-based screening successfully identified the most native-like topology as the lowest one in en-
ergy. The values of rmsd calculated with and without loops are both listed.
computationally calculated. The native topologies of 8out of 12 proteins were successfully identified with rea-sonable root mean square deviations (rmsd) from knowncrystal structures (Table 1). The errors in the final foldedstructures inevitably increased with the size of the pro-teins, but such magnitudes of rmsd were within the over-all range of precision of the current structural predictionby other methods (Moult et al., 2003).
In Figure 1, one successful case from each group, to-gether with the corresponding crystal structure, wasdisplayed. Clearly, for all groups, the overall topology
of the proteins was correctly reached. Note that for clar-ity the loops were not shown. All major helices were inthe vicinity of their correct positions. Shown in Figure 2is one of the folding trajectories of 1nkl from a conforma-tion near the beginning of simulation (Figure 2A) to theend of simulation (Figure 2F). It is apparent that the over-all sampling of the conformational space by the elasticnormal mode-based MC protocol is efficient. Each tra-jectory for the largest protein we tested took about48 hr on a single processor PC machine. Moreover, thespeed of computing also depends on the total number
Figure 1. Comparison of the Most Native-
like Topology with the Corresponding Native
Topology
(A–D) For clarity, only the helices are drawn
as cylinders. The dark color indicates native
structures, and the light color indicates sim-
ulated structures. The values of rmsd with-
out loops are 4.1 A for 1mbg, 5.1 A for
1eo0, 5.4 A for 1nkl, and 6.2 A for 1a1w.
SAXS-Assisted Folding of Small Helical Proteins1589
Figure 2. The Snapshots Taken from One of the Folding Trajectories of Protein 1nkl
(A–F) (A) shows a conformation near the beginning of simulation, and (F) shows the final folded conformation that has an rmsd of 5.5 A to the
native structure.
of blocks in BNM analysis that was related to the lengthof loops. However, our protocol was very fast for foldingthe small- to medium-sized helical proteins.
Four typical final folded structures were shown for thethree-helical protein 1mbg (Figure 3A) and for the five-helical protein 1nkl (Figure 3B). For each protein, the firstfolded structure had the most native-like topology,which was successfully identified in our final screening.The results clearly demonstrated that the final foldedstructures from different folding trajectories had quitedifferent topologies, and that some were very far awayfrom the native topology. This fact necessitated the ap-plication of an additional energetics-based screeningmethod in our protocol (see Experimental Procedures).The convergence of energy values during MC simula-tions was quick for all trajectories (Figure 4A), eventhough the rmsds of the final folded structures werefairly different (Figure 4B). Finally, as we had noted pre-viously (Wu et al., 2005), the most native-like topology of1nkl had the lowest average total energy in the per-turbed ensemble of structures, the smallest standarddeviation (SD) for the energy distribution, and the largestnumber of perturbed structures (Table 2). These resultsfurther support the finding that the native topology waschosen by evolution as the one that can accommodatethe largest structural variations, not the one rigidly trap-ped in a deep, but narrow, conformational energy well.
Analysis of Some Failed Cases
To better understand its overall performance, here wepresent two typical cases in which our simulation proto-col failed: myoglobin (PDB code 1bvc) and cytochromec (PDB code 1akk).
The structure of myoglobin has 8 helices (6 majorones) and 153 residues. Out of the ten trajectories,
none of them had a native-like topology. The closesthad an rmsd of 10.52 A. If only the six major heliceswere compared, the rmsd was 8.0 A. A detailed compar-ison of the six major helices in the lowest-rmsd modelwith those of the native structure is shown in Figure5A. The dark color indicates the native helices, and thelight color indicates our modeled helices. Obviously,the lowest-rmsd model has a very similar shape com-pared with the native structure. The largest discrepancyfrom the native structure was in the last two major heli-ces, which were swapped in the modeled structure.Note that since the native topology was not in the finalten folded structures by visual inspection, we didn’t ap-ply energetics-based screening.
Cytochrome c is a smaller four-helical protein (Figure5B). The folding simulation also failed to produce thenative-like topology out of the multiple trajectories.One possible reason for this failure was that the foldingof the native structure of cytochrome c heavily involvesits cofactor heme. Since we didn’t include heme in oursimulation, this could contribute to the substantial errorin the final folded structures. Also, there is a very longloop from residues 17 to 48 that extensively interactswith the heme group. This loop also contributed to thefrustration of the folding simulation.
Cases with Mismatch between Predicted
and Native Secondary StructuresIn all applications discussed so far, we utilized the nativesecondary structure assignments taken from the crystalstructures. For the purpose of practical applications, wealso tested cases with mismatches between the pre-dicted and native assignments. By using the consensussecondary structure prediction server (http://ibivu.cs.vu.nl/programs/sympredwww/), 3 out of 12 proteins
Structure1590
had secondary structure mismatches. They were 1erc,1prb, and 1ow5. For 1erc and 1prb, although they areall-helical proteins, the secondary structure predictionindicated that they had strands. In these cases, our fold-ing simulations totally failed. In the case of 1ow5, whichis a five-helical bundle, secondary structure predictionindicated four helical regions (two small helical regionswere predicted as a single long helix). If we constructedthe initial protein model based on the predicted second-ary structures with the scattering profile still computedfrom the native crystal structures, the folding simulationsuccessfully identified the most native-like topologywith the lowest rmsd of 6.86 A from the native structure
Figure 3. Four Final Folded Structures from Different Trajectories
for 1mbg and 1nkl
(A and B) For each protein ([A] 1mbg; [B] 1nkl), the first structure
has the lowest rmsd value (most native-like). All folded structures
have compact conformations and similar radii of gyration, but
with quite different topologies.
with loop regions, and 6.78 A for the structure withoutloop regions (Figure 6A). (Note that in Table 1, whenthere was no mismatch, this protein has an rmsd 5.10 Awith loop regions and an rmsd of 4.45 A without loopregions.) The global arrangement of all three helicesis similar to the native structure, while the single longhelix, predicted over a sequence region that shouldhave two smaller helices in the native structure, lies atthe average position between the two native small heli-ces (Figure 6A, (3)).
Another example that can illustrate the effects oferrors in secondary structure assignment is a protein ofPDB code 3icb (a five-helical orthogonal bundle protein).As in Figure 6B, the consensus prediction for the sec-ondary structure gave a correct assignment, and, conse-quently, the folded topology was also correct (lowest-rmsd structure without loop regions was 4.93 A out of theten folding trajectories [Figure 6B, (1)]). If we used theassignment by the YASPIN method (Lin et al., 2005), inwhich a small helix (residues 37–41) was merged into apreceding longer helix, the topology was more or lesscorrect (lowest rmsd was 5.85 A [Figure 6B, (2)]). If,
Figure 4. Convergence of MC Simulation for Protein 1nkl, for En-
ergy and for Rmsd
(A and B) (A) indicates convergence of energy, and (B) indicates
convergence of rmsd. Although the overall convergences for differ-
ent trajectories were similar, the final structures differ substantially
with regard to rmsd.
SAXS-Assisted Folding of Small Helical Proteins1591
Table 2. Detailed Results for the Final Energetics-Based Ranking of Ten Folding Trajectories for 1nkl
Trajectory
Index
Average
Rmsd (A)
Average Total
Energy (RT)
SD of Total
Energy (RT)
Average Long-Range
Energy (RT)
SD of Long-Range
Energy (RT)
Number of Randomized
Structures in the Ensemble
1 10.7 2378.21 50.74 254.42 26.93 253
2 6.6 2399.08 45.99 254.91 22.08 353
3 10.8 2378.59 61.71 242.19 33.91 322
4 8.9 2396.36 44.87 249.77 33.35 373
5 10.5 2378.70 50.91 231.08 40.08 472
6 7.2 2399.85 42.71 253.26 26.14 445
7 5.4 2418.47 37.75 264.92 25.59 563
8 10.5 2403.07 38.95 252.19 24.25 510
9 11.7 2371.95 45.91 234.79 35.62 416
10 10.6 2388.36 48.5 240.16 36.13 412
however, we used the assignment by the PSIRed method(Jones, 1999), which mistakenly predicted the samesmall helix as a loop, the folded topology became muchworse (lowest rmsd became 11.36 A, [Figure 6B, (3)]).
These results indicate that, in general, if the mis-matches between the predicted and native secondarystructures are not too severe, our method is still capableof delivering reasonable overall topology. But failures
did occur when the mismatches were too large for ourmethod to handle.
Test of System with Experimental Scattering
DataIn order to test our simulation protocol by using real ex-perimentally measured SAXS data, we performed thetest on a system that had published scattering data
Figure 5. Examination of Failed Cases
(A) Comparison of the lowest-rmsd structure with the native structure for myoglobin 1bvc. Six major long helices are illustrated one-by-one.
Their amino acid sequence ranges are 3–18, 20–36, 59–78, 85–96, 100–119, and 124–149. The last two were swapped in the folded structure,
but the overall architecture of the folded structure is not far away from that of the native structure.
(B) Ribbon diagram for cytochrome c. The long loop that extensively interacts with the heme group is highlighted in a darker color.
Structure1592
Figure 6. Comparison of the Lowest-Rmsd
Structures with the Native Structure for
1ow5 and 3icb with Mismatches between
the Predicted and Native Secondary Struc-
tures
The darker color is for the native structure,
and the lighter color is for the folded struc-
ture.
(A) For 1ow5, a particular helix is highlighted
by ribbon representation in each panel (1–4).
Note the mismatch that occurs over the se-
quence region of two small, dark helices
highlighted in (3).
(B) For 3icb, results for different secondary
structure assignments are shown.
available (Mehboob et al., 2003) (the maximal scatteringvector s was set to 0.05 A21). The N-terminal region ofspectrin is a relatively large three-helical bundle protein(149 residues), and the secondary structure predictiongave the same assignment as the native structure. Al-though the absolute size of this three-helical bundlewas even larger than the largest six-helical bundle pro-tein we tested, our method successfully found the mostnative-like topology (Figure 7) with an rmsd of 6.5 A.
Convergence of Elastic Normal Mode-BasedMC Simulation
To demonstrate the convergence of the MC protocol wedeveloped, we compared this protocol with a more tra-ditional rigid-body MC simulation, in which each blockwas chosen randomly at each time step. After eachblock was chosen, the chosen block took a randommove. If the chosen block contained only one Ca
atom, as in the loops, the random move was chosenon its three translational degrees of freedom. If the cho-sen block contained more than one Ca atom, as in heli-ces, three additional rotational degrees of freedom werechanged randomly. For comparison purposes, we usedthe same initial condition, the same energy potential,and the same weight value; the moving amplitude wasalso adjusted close to the normal mode-based MC sim-ulation. The convergences of two simulation methodsare shown in Figure 8. The normal mode-based MC sim-ulation converged much faster and to a lower energy, ora more compact conformation, than the traditional rigid-body MC simulation. Although the calculation of elasticnormal modes in each step took some additional time,the overall performance of normal mode-based MC issignificantly better, especially in terms of the energy ofthe final folded structure. As a direct comparison, ten in-dependent trajectories were run by the rigid-body MC
SAXS-Assisted Folding of Small Helical Proteins1593
simulation for 1nkl. The final average rmsd to the nativestructure was more than 13 A, and the lowest rmsd was11.2 A, compared with the average rmsd of 9.3 A and thelowest rmsd of 5.4 A obtained with the elastic normalmode-based MC simulation. More importantly, withthe traditional rigid-body MC, the lowest-rmsd finalstructure out of the ten folding trajectories had a non-compact structure with a nonnative topology. These re-sults suggest the better performance of the normalmode-based MC protocol (detailed analysis of perfor-mance of the normal mode-based MC will be given ina separate paper).
Discussion
In this paper, we reported a new, to our knowledge,computational protocol for determining tertiary to-pology of small, globular helical proteins, or proteindomains, from sequence. Knowledge-based potentialfunctions and coarse-grained protein models (Ca
traces) were used in the folding simulations. A novel,to our knowledge, protocol of MC simulation was devel-
Figure 7. Comparison of the Most Native-like Topology with Native
Structure for Spectrin
The rmsd is 6.5 A. This protein was folded by using published ex-
perimental SAXS data (Mehboob et al., 2003) as constraints.
oped in which the scale in Hamiltonian was used for con-formational random walk and for which the Metropoliscriterion was separated. The random walk was basedon elastic normal modes, and the Metropolis criterionwas based on more accurate potential functions. Sucha separation of Hamiltonian allows an effective samplingof the conformational space. It is important to point outthat the potential functions used for Metropolis criterionhad no restriction and that they can be any kind. There-fore, the protocol can be generalized to any case. An-other distinct feature in this study was the utilization ofSAXS data as soft constraints to assist folding. Fora set of testing proteins we studied with both simulatedand experimentally measured SAXS profiles, a satisfac-tory successful rate was achieved.
We wish to emphasize that the purpose of making ran-dom configurational moves during MC simulation alongthe eigenvectors of elastic normal modes is to achievethe maximal sampling efficiency. Each normal mode isa single, but collective, degree of freedom. Moving alongthe low-frequency mode has the effect of achieving thelargest structural change collectively with the smallestenergetic cost. Our fundamental assumption is that, atany given instant, the direction of the conformationalmovement in the immediate future can be approximatedby the normal modes calculated at that instant. Althoughthe conventional concept of normal mode, as a harmonicapproximation, does not seem to allow the system tocross the energy barrier along the eigenvector, in thiscase, the separation of scale of Hamiltonian overcomesthis problem. As shown in our study, the structuralmovement along the coarse-grained elastic normalmodes can efficiently cross energy barriers in the land-scape defined by more sophisticated Hamiltonian laterused for Metropolis criterion. It was shown in this studythat the method was very effective for folding simulationbecause the folding process involves large-scale con-formational rearrangement that drastically changes theshape of the molecule. Normal modes are good param-eters to describe the conformational changes that
Figure 8. Comparison of the Convergence of Two MC Methods on
Protein 1nkl
The dotted line is for traditional rigid-body MC, and the solid line is
for elastic normal mode-based MC. It is clear that the normal
mode-based MC converges much faster and to a lower energy,
i.e., a more compact conformation.
Structure1594
substantially alter the overall shape of the molecule (Luand Ma, 2005; Ma, 2004, 2005).
From our study, it seems that the success of foldingdepends more on the complexity of the structure andless on the absolute size of the structure. The three-helical protein that we successfully tested with experi-mental scattering data was relatively simple in its topol-ogy, but it was large in size compared with other sys-tems we tested. Myoglobin was a good example inwhich the complexity led to the failure of simulation.However, we wish to point out that, even for myoglobin,in which the native topology was not found, the finalfolded structure had four major helices correctly posi-tioned. The remaining two helices had their positionsswapped; the overall angles of them were still closeenough to the native structures. In other words, if onlythe skeletons of secondary structure elements wereconcerned, all of the rough positions of helices wouldhave been considered as correctly predicted. The re-sults would be of a similar level of information as towhat can be obtained from intermediate-resolutioncryo-EM density maps (Jiang et al., 2001; Kong andMa, 2003; Kong et al., 2004).
In its current form, the protocol we used relies on theknowledge of secondary structure assignment on se-quence. In a real application, the secondary structureassignment would have to be taken from computationalprediction, which has about 80% accuracy in determin-ing major secondary structures. Our method shouldeven be able to handle cases in which small helicesare missing in the sequence-based prediction. Ofcourse, there are other factors that could affect the re-sults of folding. For example, proteins containing largecofactors would probably be harder to fold by the cur-rent method than the ones without cofactors.
In conclusion, we believe that, for small, globular heli-cal proteins or protein domains, it is feasible to utilizeSAXS data to assist structure prediction to a level of ac-curacy that is roughly equivalent to, or better than, inter-mediate-resolution density maps from other structuralbiology experiments. However, SAXS has the unique ad-vantage of being able to quickly collect data on small,soluble proteins. It also provides a useful alternativetechnique for fields such as structural genomics, sincea SAXS experiment would be much easier to be auto-mated than X-ray crystallography. Together with the re-cent developments of other related computationalmethods (Costenaro et al., 2005; Davies et al., 2005; Ko-jima et al., 2004; Petoukhov et al., 2002; Svergun andKoch, 2002; Svergun et al., 2001; Walther et al., 2000;Zheng and Doniach, 2002, 2005) that also utilize SAXSdata to model protein structures, we believe that thesemethods will eventually enable SAXS to become a main-stream experimental technique in the field of structuralbiology.
Experimental Procedures
Overall Procedure of the Simulation
To fold the proteins, we assumed that the positions of secondary
structures (helices in this case) in the sequence are known. The sim-
ulations were based on the Ca traces of proteins. Each run started
from a random conformation in which the loops were modeled as ex-
tended and helices were kept in a standard form—a model similar to
the diffusion-collision model (Karplus and Weaver, 1979, 1994). The
conformational space was then sampled by a novel, to our knowl-
edge, MC protocol, which, in essence, separates the Hamiltonians
between the random walk and Metropolis criterion. In each step of
the simulation, a small set of low-frequency normal modes was first
computed based on the Ca trace by elastic normal mode analysis
(eNMA) (Atilgan et al., 2001) with the block normal mode (BNM)
(Durand et al., 1994; Li and Cui, 2002; Tama et al., 2000) scheme
for constructing the Hessian matrix (the helical regions were kept
as rigid bodies during MC simulation). Then, the structure was
changed along a randomly chosen mode by a random displacement.
Different from what was used to compute normal modes, the energy
function used to evaluate the acceptance included a short-range
bonded term, a long-range nonbonded term, a hydrophobicity
term, and a constraint derived from the SAXS profile (detailed
below). Multiple trajectories were used to generate a set of folded
structures, usually with quite different topologies. These structures
were then used as candidates from which the topology that is the
most similar to native topology (most native-like) was identified by
using an effective procedure recently developed by us (Wu et al.,
2005).
Protein Model and Initial Condition
Each amino acid was represented by a single Ca atom. Two adjacent
Ca atoms were assumed to be connected by a pseudo bond with an
equilibrium length of 3.8 A. The conformation of the Ca trace for
a protein of N residues was thus defined by 3N 2 6 parameters:
N 2 1 virtual bonds {li}, N 2 2 virtual bond angles {qi}, N 2 3 dihedral
angles {fi}. The Cartesian coordinates of the Ca trace of a protein
were constructed by a method developed by Levitt and colleagues
(Park and Levitt, 1995). Given the positions of the first three Ca
atoms by:
x1 = 0; y1 = 0; z1 = 0
x2 = 3:8; y2 = 0; z2 = 0
x3 = x2 + 3:8 cosðp 2 q2Þ
y3 = y2 + 3:8 sinðp 2 q2Þ
z3 = 0: (1)
The coordinates of the ith Ca atom ri (i R 4) were calculated by:
ri = ri 2 1 + 3:8 cosðp 2 qi 2 1Þu + 3:8 sinðp 2 qi 2 1Þ cosfi 2 2v
+ 3:8 sinðp 2 qi 2 1Þ sinfi 2 2w; (2)
where three orthogonal unit vectors, u, v, and w, were defined as:
u =ri 2 1 2 ri 2 2
jri 2 1 2 ri 2 2j
v =ðri 2 3 2 ri 2 2Þ2 ½ðri 2 3 2 ri 2 2Þ,u�ujðri 2 3 2 ri 2 2Þ2 ½ðri 2 3 2 ri 2 2Þ,u�uj
w = u3v: (3)
Since we assumed that the secondary structures were known, the
helical regions were fixed at the ideal angles (q = 90º, f = 50º) (Bahar
et al., 1997). For residues in the loop regions, the bond angles and
dihedrals were set as completely extended (q = 120º, f = 180º). By
starting our simulations from a maximally extended conformation,
we expected to avoid bias.
Monte Carlo Simulation Protocol Based on Elastic
Normal Modes
The sampling of conformational space during folding simulations
was carried out by an MC protocol (Allen and Tildesley, 1980), which
usually involves two major steps: (1) a random walk of the structure
in configurational space, and (2) a Metropolis criterion for accep-
tance of the walk based on the Boltzmann distribution. In order to in-
crease the sampling efficiency for conformational collapse during
folding, we developed a novel, to our knowledge, protocol for imple-
menting the MC simulation. The essence of this protocol is the sep-
aration of Hamiltonian in scale for the two steps. In the first step,
SAXS-Assisted Folding of Small Helical Proteins1595
each random walk of the structure was along a low-frequency eigen-
vector of elastic normal mode calculated from a coarse-grained
Hamiltonian for the elastic network. Then, in the second Metropolis
step, the Hamiltonian used to evaluate the energy change was
based on more accurate potential functions.
Random Walk Based on Elastic Normal Modes
In the first random walk step, the low-frequency normal modes of the
structure were calculated by using a simplified potential function,
V =g
2
Xij
sijðjrijj2 jr0ij jÞ
2; sij =
1jr0ij j%rc
0jr0ij j>rc
�: (4)
Here,��rij
�� and jr0ij j were the instantaneous and current values of dis-
tance, respectively, between Ca atom pair i and j. The force constant
g was uniformly set to one, and the cutoff distance rc was set to 13 A.
A distinct feature of elastic normal modes is that they are calculated
from a potential function that regards the current coordinates as the
equilibrium coordinates. Consequently, initial energy minimization
was not required; thus, no additional structural distortion resulting
from energy minimization would occur, which made it possible to
carry out this normal mode-based Monte Carlo protocol step by
step to propagate the simulation.
Furthermore, since the helices were treated as rigid bodies, we
used the well-established BNM approach (Durand et al., 1994; Li
and Cui, 2002; Tama et al., 2000) to change the conformation of
the protein. The BNM method first partitions a structure into n
blocks. In our model, one block was chosen for each helix or each
residue in the loop regions.
For a particular conformation of a protein, a subset of low-fre-
quency modes was calculated based on the current Ca network
by using BNM and the elastic harmonic potential (Equation 4). The
structure was then moved along a randomly chosen mode with a ran-
dom displacement. The SHAKE algorithm (Ryckaert et al., 1977) was
applied to keep the pseudo bond length unchanged. The newly up-
dated structure was either accepted or rejected based on the Me-
tropolis criterion according to a different Hamiltonian described in
the next section. If it was accepted, the newly updated structure
was set as the current structure of the protein for the next cycle.
Hamiltonian Used in Metropolis Criterion
The energy function we used in Metropolis update can be expressed
as:
UTotal = UAngle + UvdW + UNonbonded + UHydrophobic + UScattering (5)
The first term on the right is for the pseudo bond angle constraint,
which can be maintained by a harmonic potential,
UAngle =X
i
1
2kq
�q 2 q0
qv
�2
: (6)
The equilibrium bond angle q0 was set to 105º, the constant qv was
set to 15º, and the force constant kq was 2.
In order to prevent the protein model from interresidue collapse,
a pseudo van der Waals potential, UvdW, was added with the form
of (Berriz and Shakhnovich, 2001):
UvdW = u0
��d0
dij
�12
2 2
�d0
dij
�6�; (7)
where dij was the instantaneous distance between residue i and j, d0
was a constant distance set to 5 A, which was the most favorable
distance between two adjacent Ca atoms, and u0 was set to 1.
The long-range nonbonded interactions were modeled by a
residue-specific, distance-dependent potential extracted from the
structural database by Bahar and Jernigan (Bahar and Jernigan,
1997). The knowledge-based potential can be written as:
Unon 2 bonded =Xi< j
�uði; j; rÞ; (8)
in which i and j were residue indices, r was the distance between the
two residues, and �uði; j; rÞ was the energy parameter.
Since hydrophobic residues tend to be buried in the proteins, we
constructed a crude penalty term to mimic this hydrophobic effect.
A group of hydrophobic residues on the helices with the highest hy-
drophobicity indices (Kyte and Doolittle, 1982), such as Ile, Leu, Val,
or Phe, were chosen. Then, the summation of distances between all
pairs of the chosen hydrophobic residues was calculated as:
UHydrophobic =1
2
Xi; j˛HC
��r.i 2 r.j
��2: (9)
Here, HC stands for the chosen hydrophobic core. This term has
an effect of ‘‘pulling’’ the chosen set of the hydrophobic core toward
a more globular packing geometry. Our experience indicates that it
was better to choose a set of hydrophobic residues with very high
hydrophobicity indices than to include all of the hydrophobic resi-
dues in the sequence. This is perhaps because those residues
tend to be deeply buried in tertiary structures.
Finally, the SAXS profile of proteins that gives information on their
size and shape was incorporated as a constraint for the folding pro-
cess by adopting a term in the potential function, UScattering, con-
cerning the scattering profile as:
UScattering = wXn
i = 1
jIT ðsiÞ2 IMðsiÞj2; (10)
where Si is the scattering vector, and IT(si) and IM(si) are the scatter-
ing intensity of target structure and modeled structure, respectively.
The constant w is a weight for balancing the contribution with re-
spect to other energy terms. Similar to what has been reported in
the literature (Walther et al., 2000; Zheng and Doniach, 2002), the
scattering intensity I(s) of a model formed by N beads can be calcu-
lated by the Debye equation:
IðsÞ= N + 2XN
i; j
sin�2psrij
�2psrij
; (11)
where rij is the distance between a pair of beads. For larger systems,
the above N 3 N calculation can be replaced by an alternative Debye
equation in its pair-distance histogram form:
IðsÞ= N + 2Xnbins
i = 1
gðriÞsinð2pjri jsÞ
2pjrijs; (12)
where g(ri) is the pair-distance histogram of all singly counted pair-
wise distances, and the number of bins is nbins. To numerically rep-
resent the scattering profile I(s), the scattering vector s was discre-
tized with the interval ds = 0.001 A21. So, if the maximal s was set to
0.1 A21, then the value of n in Equation 10 would be 100. The weight
factor, w, was adjusted to a value at which the scattering term
roughly made up one-fifth of the contribution to the total energy
function in Equation 5.
After the new structure was generated by a random walk along
normal mode, the acceptance of this move was based on the Me-
tropolis criterion: the new conformation was accepted based on
probability p of:
pwexp
�2�UTotal 2 U0
Total
�C
�; (13)
where UTotal was the energy of the new conformation, U0Total was the
energy of the old conformation, and C was a constant related to
Monte Carlo temperature that was empirically chosen. The initial
value of constant C was adjusted from 1.5 to 3 depending on the
size of the proteins. Generally, proteins with larger sizes got a larger
value of C. Additionally, the parameter C was decreased in the final
steps of the simulation so that the structure could converge to the
lower-energy state. Correspondingly, the maximal displacement
along the normal mode was also decreased in the final steps so
that the conformation space can be finely sampled before the simu-
lation is converged.
Final Identification of Native Topology
Using the coarse-grained potential for residual interactions and
other simple constraints, one naturally cannot expect every simula-
tion trajectory to converge to exactly the same native topology. The
problem of deriving a three-dimensional model from a one-dimen-
sional scattering profile was fundamentally underdetermined, espe-
cially when using solution scattering information. Therefore, for each
Structure1596
case, multiple trajectories were generated independently, which led
to a number of compact structures that were similar in shape and ra-
dius of gyration, but different in tertiary packing of helices. A major
task then was to identify the most native-like topology among all
of the folded structures. To do so, we adopted an energy-based
screening procedure developed in our previous study (Wu et al.,
2005) that is briefly outlined below.
For each final folded structure, only the axes and directions of he-
lices (as vectors) and their corresponding sequence identities were
extracted from Ca coordinates. These vectors represented the
packing geometry of a particular structure. Then, an ensemble of
slightly perturbed structures was constructed to represent the con-
formational variations around the given topology. All of the struc-
tures in the ensemble were generated as a Ca-based trace represen-
tation. Each structure in the ensemble was constructed by two
steps. The first step was helix placement: the standard helical seg-
ments were placed based on their extracted helical axes. To allow
for enough structural deviations, the x, y, and z coordinates of the
centers of mass of the standard helical segments were perturbed
in a sphere of radius of 1.0 A. The three angles of rotation around
the centers of mass were also randomized within a range of 5º.
The second step was loop construction once the helices were all
in place, which was done by an off-lattice MC procedure. First, the
loops were built up to their full length in an extended conformation.
Then, they were allowed to relax based on MC movement to connect
two neighboring helices.
After the construction of all perturbed structures in the ensemble,
all members of the ensemble were separately energy refined by
global optimization, which included the genetic algorithm (GA) opti-
mization for rotations of the helices and MC relaxation for loop re-
gions. By iteratively applying the procedure, all structures were
guided globally to energetically more favorable states. After global
optimization of all members of the structural ensemble, the average
energy for the whole ensemble was evaluated. The average energies
of all folded structures are then compared to locate the one with the
lowest energy. Our previous study indicated that (Wu et al., 2005),
even with severe errors inherent in the constructed structures and
limited accuracy in the coarse-grained potential functions, the aver-
age energy of a slightly perturbed ensemble of structures around
a given topology was a much more robust evaluator for the topology
than any individual member in the ensemble, no matter how exten-
sively the structures in the ensemble were optimized.
Acknowledgments
We wish to thank Tom Irving for helpful discussion in the early stage
of the project. The authors gratefully acknowledge the support from
the National Institutes of Health (R01-GM067801). M.C. and M.L. are
partially supported by predoctoral fellowships from the W.M. Keck
Foundation of the Gulf Coast Consortia through the Keck Center
for Computational and Structural Biology.
Received: May 25, 2005
Revised: July 21, 2005
Accepted: July 22, 2005
Published: November 8, 2005
References
Allen, M.P., and Tildesley, D.J. (1980). Computer Simulation of
Liquids (Oxford, UK: Clarendon Press).
Atilgan, A.R., Durell, S.R., Jernigan, R.L., Demirel, M.C., Keskin, O.,
and Bahar, I. (2001). Anisotropy of fluctuation dynamics of proteins
with an elastic network model. Biophys. J. 80, 505–515.
Bahar, I., and Jernigan, R.L. (1997). Inter-residue potentials in glob-
ular proteins and the dominance of highly specific hydrophilic inter-
actions at close separation. J. Mol. Biol. 266, 195–214.
Bahar, I., Kaplan, M., and Jernigan, R.L. (1997). Short-range confor-
mational energies, secondary structure propensities, and recogni-
tion of correct sequence-structure matches. Proteins 29, 292–308.
Berriz, G.F., and Shakhnovich, E.I. (2001). Characterization of the
folding kinetics of a three-helix bundle protein via a minimalist Lan-
gevin model. J. Mol. Biol. 310, 673–685.
Blundell, T.L., Sibanda, B.L., Sternberg, M.J., and Thornton, J.M.
(1987). Knowledge-based prediction of protein structures and the
design of novel molecules. Nature 326, 347–352.
Boczko, E.M., and Brooks, C.L., 3rd. (1995). First-principles calcula-
tion of the folding free energy of a three-helix bundle protein. Sci-
ence 269, 393–396.
Bowie, J.U., Luthy, R., and Eisenberg, D. (1991). A method to identify
protein sequences that fold into a known three-dimensional struc-
ture. Science 253, 164–170.
Cardenas, A.E., and Elber, R. (2003). Kinetics of cytochrome C fold-
ing: atomically detailed simulations. Proteins 51, 245–257.
Costenaro, L., Grossmann, J.G., Ebel, C., and Maxwell, A. (2005).
Small-angle X-ray scattering reveals the solution structure of the
full-length DNA gyrase a subunit. Structure 13, 287–296.
Davies, J.M., Tsuruta, H., May, A.P., and Weis, W.I. (2005). Confor-
mational changes of p97 during nucleotide hydrolysis determined
by small-angle X-ray scattering. Structure 13, 183–195.
Dill, K.A., and Chan, H.S. (1997). From Levinthal to pathways to fun-
nels. Nat. Struct. Biol. 4, 10–19.
Dinner, A.R., Sali, A., Smith, L.J., Dobson, C.M., and Karplus, M.
(2000). Understanding protein folding via free-energy surfaces
from theory and experiment. Trends Biochem. Sci. 25, 331–339.
Duan, Y., and Kollman, P.A. (1998). Pathways to a protein folding in-
termediate observed in a 1-microsecond simulation in aqueous so-
lution. Science 282, 740–744.
Durand, P., Trinquier, G., and Sanejouand, Y.H. (1994). New ap-
proach for determining low-frequency normal-modes in macromole-
cules. Biopolymers 34, 759–771.
Elofsson, A., Fischer, D., Rice, D.W., Le Grand, S.M., and Eisenberg,
D. (1996). A study of combined structure/sequence profiles. Fold
Des. 1, 451–461.
Guo, Z., and Thirumalai, D. (1995). Kinetics of protein folding: nucle-
ation mechanism, time scales, and pathways. Biopolymers 36,
83–102.
Jiang, W., Baker, M.L., Ludtke, S.J., and Chiu, W. (2001). Bridging
the information gap: computational tools for intermediate resolution
structure interpretation. J. Mol. Biol. 308, 1033–1044.
Jones, D.T. (1999). Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292, 195–202.
Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). A new ap-
proach to protein fold recognition. Nature 358, 86–89.
Karplus, M., and Weaver, D.L. (1979). Diffusion-collision model for
protein folding. Biopolymers 18, 1421–1437.
Karplus, M., and Weaver, D.L. (1994). Protein folding dynamics: the
diffusion-collision model and experimental data. Protein Sci. 3, 650–
668.
Kojima, K., Timchenko, A.A., Higo, J., Ito, K., Kihara, H., and Takaha-
shi, K. (2004). Structural refinement by restrained molecular-dynam-
ics algorithm with small-angle X-ray scattering constraints for a bio-
molecule. J. Appl. Crystallogr. 37, 103–109.
Kong, Y., and Ma, J. (2003). A structural-informatics approach for
mining b-sheets: locating sheets in intermediate-resolution density
maps. J. Mol. Biol. 332, 399–413.
Kong, Y., Zhang, X., Baker, T.S., and Ma, J. (2004). A Structural-infor-
matics approach for tracing b-sheets: building pseudo-Ca traces for
b-strands in intermediate-resolution density maps. J. Mol. Biol. 339,
117–130.
Kyte, J., and Doolittle, R.F. (1982). A simple method for displaying
the hydropathic character of a protein. J. Mol. Biol. 157, 105–132.
Li, G., and Cui, Q. (2002). A coarse-grained normal mode approach
for macromolecules: an efficient implementation and application
to Ca(2+)-ATPase. Biophys. J. 83, 2457–2474.
Lin, K., Simossis, V.A., Taylor, W.R., and Heringa, J. (2005). A simple
and fast secondary structure prediction method using hidden neural
networks. Bioinformatics 21, 152–159.
Liwo, A., Khalili, M., and Scheraga, H.A. (2005). Ab initio simulations
of protein-folding pathways by molecular dynamics with the united-
residue model of polypeptide chains. Proc. Natl. Acad. Sci. USA 102,
2362–2367.
SAXS-Assisted Folding of Small Helical Proteins1597
Lu, M., and Ma, J. (2005). The role of shape in determining molecular
motion. Biophys. J. 4, 2395–2401.
Ma, J. (2004). New advances in normal mode analysis of supermo-
lecular complexes and applications to structural refinement. Curr.
Protein Pept. Sci. 5, 119–123.
Ma, J. (2005). Usefulness and limitations of normal mode analysis in
modeling dynamics of biomolecular complexes. Structure 13, 373–
380.
Mehboob, S., Jacob, J., May, M., Kotula, L., Thiyagarajan, P., John-
son, M.E., and Fung, L.W. (2003). Structural analysis of the a N-
terminal region of erythroid and nonerythroid spectrins by small-
angle X-ray scattering. Biochemistry 42, 14702–14710.
Moult, J., Fidelis, K., Zemla, A., and Hubbard, T. (2003). Critical as-
sessment of methods of protein structure prediction (CASP)-round
V. Proteins 53 (Suppl 6), 334–339.
Onuchic, J.N., Luthey-Schulten, Z., and Wolynes, P.G. (1997). The-
ory of protein folding: the energy landscape perspective. Annu.
Rev. Phys. Chem. 48, 545–600.
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B.,
and Thornton, J.M. (1997). CATH–a hierarchic classification of pro-
tein domain structures. Structure 5, 1093–1109.
Pande, V.S., Baker, I., Chapman, J., Elmer, S.P., Khaliq, S., Larson,
S.M., Rhee, Y.M., Shirts, M.R., Snow, C.D., Sorin, E.J., and Zagrovic,
B. (2003). Atomistic protein folding simulations on the submillisec-
ond time scale using worldwide distributed computing. Biopolymers
68, 91–109.
Park, B.H., and Levitt, M. (1995). The complexity and accuracy of
discrete state models of protein structure. J. Mol. Biol. 249, 493–507.
Pedersen, J.T., and Moult, J. (1997). Protein folding simulations with
genetic algorithms and a detailed molecular description. J. Mol. Biol.
269, 240–259.
Petoukhov, M.V., Eady, N.A., Brown, K.A., and Svergun, D.I. (2002).
Addition of missing loops and domains to protein models by x-ray
solution scattering. Biophys. J. 83, 3113–3125.
Ryckaert, J.P., Ciccotti, G., and Berendsen, H.J.C. (1977). Numerical
integration of the cartesian equations of motion of a system with
constraints: molecular dynamics of n-alkanes. J. Comput. Phys.
23, 327–341.
Sali, A., Shakhnovich, E., and Karplus, M. (1994). How does a protein
fold? Nature 369, 248–251.
Simons, K.T., Kooperberg, C., Huang, E., and Baker, D. (1997). As-
sembly of protein tertiary structures from fragments with similar lo-
cal sequences using simulated annealing and Bayesian scoring
functions. J. Mol. Biol. 268, 209–225.
Srinivasan, R., Fleming, P.J., and Rose, G.D. (2004). Ab initio protein
folding using LINUS. Methods Enzymol. 383, 48–66.
Svergun, D.I., and Koch, M.H. (2002). Advances in structure analysis
using small-angle scattering in solution. Curr. Opin. Struct. Biol. 12,
654–660.
Svergun, D.I., Petoukhov, M.V., and Koch, M.H. (2001). Determina-
tion of domain structure of proteins from X-ray solution scattering.
Biophys. J. 80, 2946–2953.
Tama, F., Gadea, F.X., Marques, O., and Sanejouand, Y.H. (2000).
Building-block approach for determining low-frequency normal
modes of macromolecules. Proteins 41, 1–7.
Walther, D., Cohen, F.E., and Doniach, S. (2000). Reconstruction
of low-resolution three-dimensional density maps from one-
dimensional small-angle x-ray solution scattering data for biomole-
cules. J. Appl. Crystallogr. 33, 350–363.
Wedemeyer, W.J., Welker, E., Narayan, M., and Scheraga, H.A.
(2000). Disulfide bonds and protein folding. Biochemistry 39,
4207–4216.
Wolynes, P.G., Onuchic, J.N., and Thirumalai, D. (1995). Navigating
the folding routes. Science 267, 1619–1620.
Wu, Y., Chen, M., Lu, M., Wang, Q., and Ma, J. (2005). Determining
protein topology from skeletons of secondary structures. J. Mol.
Biol. 350, 571–586.
Zhang, Y., and Skolnick, J. (2004). Tertiary structure predictions on
a comprehensive benchmark of medium to large size proteins. Bio-
phys. J. 87, 2647–2655.
Zheng, W., and Doniach, S. (2002). Protein structure prediction con-
strained by solution X-ray scattering data and structural homology
identification. J. Mol. Biol. 316, 173–187.
Zheng, W., and Doniach, S. (2005). Fold recognition aided by con-
straints from small angle X-ray scattering data. Protein Eng. Des.
Sel. 18, 209–219.
Zhou, Y., and Karplus, M. (1999). Interpreting the folding kinetics of
helical proteins. Nature 401, 400–403.