When is a potential accurate enough for structure prediction? Theory and application to a random...

When is a potential accurate enough for structure prediction? Theory andapplication to a random heteropolymer model of protein foldingJoseph D. Bryngelson Citation: J. Chem. Phys. 100, 6038 (1994); doi: 10.1063/1.467114 View online: http://dx.doi.org/10.1063/1.467114 View Table of Contents: http://jcp.aip.org/resource/1/JCPSA6/v100/i8 Published by the American Institute of Physics. Additional information on J. Chem. Phys.Journal Homepage: http://jcp.aip.org/ Journal Information: http://jcp.aip.org/about/about_the_journal Top downloads: http://jcp.aip.org/features/most_downloaded Information for Authors: http://jcp.aip.org/authors

Downloaded 20 Oct 2012 to 136.159.235.223. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/about/rights_and_permissions

http://jcp.aip.org/?ver=pdfcov

http://aipadvances.aip.org/resource/1/aaidbi/v2/i1?&section=special-topic-physics-of-cancer&page=1

http://jcp.aip.org/search?sortby=newestdate&q=&searchzone=2&searchtype=searchin&faceted=faceted&key=AIP_ALL&possible1=Joseph D. Bryngelson&possible1zone=author&alias=&displayid=AIP&ver=pdfcov


http://link.aip.org/link/doi/10.1063/1.467114?ver=pdfcov

http://jcp.aip.org/resource/1/JCPSA6/v100/i8?ver=pdfcov

http://www.aip.org/?ver=pdfcov


http://jcp.aip.org/about/about_the_journal?ver=pdfcov

http://jcp.aip.org/features/most_downloaded?ver=pdfcov

http://jcp.aip.org/authors?ver=pdfcov

When is a potential accurate enough for structure prediction? Theory and application to a random heteropolymer model of protein folding

Joseph D. Bryngelson Physical Sciences Laboratory, Division of Computer Research and Technology, National Institutes of Health, Bethesda, Maryland 20892

(Received 31 December 1992; accepted 7 January 1994)

Attempts to predict molecular structure often try to minimize some potential function over a set of structures. Much effort has gone into creating potential functions and into creating algorithms for minimizing these potential functions. This paper develops a formalism that addresses a complementary question: What are the accuracy requirements for a potential function that predicts molecular structure? The formalism is applied to a simple model of a protein structure potential. The results of this calculation suggest that high accuracy predictions (~1 A rms deviation in a-carbon positions) of protein structures require monomer-monomer interaction energies accurate to within 5% to 15%. The paper closes with a discussion of the implications of these results for practical structure prediction.

I. INTRODUCTION

The theoretical prediction of the structure of a molecule or an assembly of molecules, such as a cluster, frequently involves the minimization of a potential function. Examples of this activity range from using sophisticated techniques of modern quantum chemistry to obtain high accuracy predictions of the structures of small molecules in the gas phase, to using semiempirical potentials of mean force to predict the structures of macromolecules in solution. Particularly for large molecules, much effort has gone into developing accurate potentials that require tractable amounts of computer time for their evaluation, and into developing efficient algorithms for finding the deepest minimum of these potentials. This papers addresses another aspect of structure prediction: the accuracy required of a potential that predicts molecular structure. If the potential function is not sufficiently accurate, then the best minimization algorithm is useless for predicting structure. However, if the accuracy requirements are known, then definite goals for potential construction exist, and researchers can concentrate on problems that are solvable with the present potentials. The formalism derived here is general, and applicable to any structure prediction problem. An important unsolved problem in molecular structure prediction is the prediction of protein structure from amino acid sequence. This paper applies the formalism to a simple protein model to estimate the accuracy needed for a potential that predicts protein structure.

This paper discusses the problem of predicting the full three-dimensional or tertimy structure of a protein. Most attempts to predict protein tertiary structure are, at least implicitly, based on the thermodynamic hypothesis, which states that a protein in solution folds to the configuration that minimizes the free energy of the protein-solvent system.1

Typically the solvent is water. This hypothesis suggests a general strategy for predicting protein tertiary structure from sequence. First, one develops a semiempirical potential function which approximates the free energy of the proteinsolvent system as a function of the three-dimensional configuration of the protein. Next, one attempts to solve the

problem by finding the configuration that minimizes the approximate potential function. This configuration is the predicted protein structure. Although the above general strategy has successfully predicted the structure of small polypeptides? it has failed to predict the structure of globular proteins. This failure has typically been attributed to the difficulty of finding the the minimum of the potential functions; much effort has gone into algorithms for optimizing these potential functions. The analysis presented here is a first step toward understanding the possible role of inaccuracies in potential functions in the failure to predict protein structure.

The protein calculation presented here has a forerunner, a paper by Shaknovitch and Gutin on the probability of a neutral mutation in a protein.3 Their calculation relates to a special case of the calculation presented here. The present paper also makes explicit the mathematical approximations and the notion of structure implicit in the mutation paper. I use some of the same notation as the mutation paper so the reader can easily compare the two papers afterwards.

In the following section, I define how the term "structure" is used in this paper, and pose the potential accuracy question in a mathematically precise manner. In Secs. III and IV the potential accuracy problem is formally solved for some simple cases. A simple model of a protein potential function is described in Sec. V and the results of Secs. III and IV are applied to this model in Sec. VI. The paper concludes with a critical discussion of these results, their implications for real protein structure prediction, and a look at some future directions. Readers not interested in technical details should skip Sec. VI.

II. POSING THE PROBLEM

Consider a representation of the arrangement of atoms in three-dimensional space. Coarse grain this configuration space so there are a countable number of discrete states. If the representation of the arrangement of atoms was discrete to begin with, then there is no need to coarse-grain. When I refer to a "structure" or (for emphasis) "discrete structure" in this paper, I mean one of these discrete states. Label each

6038 J. Chern. Phys. 100 (8), 15 April 1994 0021-9606/94/100(8)/6038/8/$6.00 © 1994 American Institute of Physics


Joseph D. Bryngelson: Accuracy of potentials and prediction 6039

structure with an integer. I denote the value of the potential function for structure i by E i' and for the sake of brevity I refer to Ei as the "energy" of structure i, even though it is really a free energy or the value of a potential of mean force.

The notion of coarse graining and discrete structure may become more clear with a simple example. Consider a small molecule with only one flexible bond, so the spatial arrangement of atoms can be specified by denoting a bond angle, e. To coarse grain the bond angle, I could consider the molecule to be in structure one, if Oo~e<10°, structure two, if lOo~e<20°, and so on. The value of a potential function for each of these discrete structures must be determined by some fixed algorithm. To continue this example, if I am given a continuous potential V;~~~~x.(e), then I could define the value of this potential for structure one to be E 1 = V;~~~~x. (5°), E2 = V;~~~~x.C15°), and so on. I can specify the spatial arrangement of atoms more or less accurately by making the discretization finer or coarser. The formalism discussed in this paper is independent of the manner of representing the arrangement of atoms and the details of the coarse-graining procedure, which may be chosen to suit the application at hand.

The approximate potential Vapprox.c q ) assigns a real number to each structure q. Denote by q:J:fnrox the discrete structure with the lowest Vapprox.(q). This is the predicted structure. There is also some real, exact energy of the molecule or the molecule-solvent system, given by the potential V rea1( q). The potential V real( q) also assigns a real number to each of the structures. Denote by q~i~ the discrete structure with the lowest V rea1( q). This is the structure found in nature. Therefore, Vapprox.( q) predicts the correct structure when qre~1 =qapprox. The approximate potential function has inac-mm mm . curacies due to inaccuracies in parameters, neglect of physi-cal effects and so on. Denote the errors in Vapprox.c q) by

8E(q) = Vapprox.(q) - Vrea1(q) (1)

which may be thought of as noise added to the true potential. Notice that the value of 8E(q) depends on the discrete state q. Since the real potential V rea1( q) is unknown, 8E (q) is also unknown. However, the knowledge used to construct the approximate potential can also be used to find a probability distribution for 8E. The expected size of the inaccuracies in the approximate potential can be estimated. For example, one can estimate the size of the ignored physical effects, the inaccuracies in parameters, and so on. These estimates of inaccuracy can be interpreted to mean that the best one can do is to give a probability distribution for the energy, parameter, etc., with a width given by the estimate of inaccuracy. The sum of all of these inaccuracies is the value of 8E. If the sum has a large number of terms, then central limit theorem states that p(8E), the probability distribution of 8E, is a Gaussian with a mean and a standard deviation which is calculable from the inaccuracy estimates. Now the potential accuracy question can be put into a precise, mathematical form: Given p( 8E), what is the probability that q~i~ and qi:r;ox, are the same structure?

III. DETERMINISTIC CASE

To illustrate the formal solutions to the potential accuracy question, I will start with the simplest examples and proceed to examples of greater complexity. After developing the formalism sufficiently, I will apply it to a simple model of a protein potential function.

The simplest problem is that of two structures. Suppose an approximate potential Vapprox. ( q) is used to calculate the energies, E~ and E~, for structures 0 and 1, respectively. Without loss of generality, I assumeEh < E~, so q:J:fnrox·=O. For concreteness, one could think of a spin in a solid that could point in one of only two directions. These structures also have real energies Eo and E 1 related to the approximate energies by

Eb=Eo+ 8Eo;

E~=E1 + 8E 1· (2)

The 8E i represent errors in Vapprox.cq). I denote the distribution of the 8E i by p ( 8E i)' For this simple case the potential accuracy question becomes: What is the probability, R, that Eo <E I? Note that Eo <E 1 implies

E~ -Eh~b.E, (3)

where I have used the natural variable

b.E =. 8E 1 - 8E 0 • (4)

If I denote the distribution of b.E by P(b.E), then R, the probability of predicting the right structure, is given by

R(Eh ,E~) =J~: -Eb P(b.E)d(b.E). (5)

To complete the formal solution to the two structure problem, I must express P( b.E) in terms of p( 8E). No general connection exists between these two distributions. For example, the errors in the potential function may be so closely correlated that 8E 0= 8E 1, in which case P( b.E) becomes, to a good approximation, a Dirac delta function at zero! Here I will only consider the simplest case, which must be solved before others. The simplest assumption is that the error in the energy of each structure is independent of the errors in the energies of aIr other structures. With this assumption, the distribution of b.E becomes

J+co

P(b.E) = -co p( 8Eo)p(b.E + 8Eo)d( 8Eo)· (6)

Equations (5) for R(Eh ,E~) and Eq. (6) for P(M) are the desired solution of the two structure problem with the independent error assumption.

In practice, the independent error assumption is probably a worst case, because correlations in the inaccuracies of potential functions tend to narrow the distribution of P(b.E). For example, the most important correlation in errors results from correlation in structures. Similar structures have many interactions in common, . e.g., the same hydrogen bonds. These common interactions are calculated with the same (inaccurate) terms in the potential function. The inaccuracies in these common interactions are the same, so they do not con-

J. Chern. Phys., Vol. 100, No.8, 15 April 1994


6040 Joseph D. Bryngelson: Accuracy of potentials and prediction

tribute to the difference between the errors in the energies of these two structures, hence they do not contribute to AE. Only interactions that the structures do not have in common contribute to this difference. Therefore, these structurerelated correlations tend to narrow P( AE).

The solution of the many structure problem is a straightforward generalization of the solution of the two structure problem presented above. Consider 0 different structures, labeled 0,1,2, ... ,0-1. As before, the energies calculated from the approximate potential, Eh ,E~ ,E~ , ... ,EO- 1 are related to the real energies Eo ,E 1 ,E 2, ... ,E 0 -1 by

Eh=Eo+ 8Eo,

E~ =E1 + 8E 1 ,

E~=E2+8E2'

EO-1 =EO-1 + 8Eo-1,

(7)

where the 8E i are errors in the potential Vapprox.( q) and the distribution of the 8Ei is p( 8Ei). Once again, I may assume, without loss of generality, that E h < E ~, E h < E ~ , ... ,E h < EO- 1 • What is the probability, R, that Eo<E1,Eo<E2, ... ,Eo<Eo-1' i.e., that the 0 structure has the lowest energy for the real potential? Since Eo<Ei implies

(8)

where in analogy with the two structure problem I have defined

AEi==8Ei- 8Eo, (9)

the probability of Ei>Eo is geE; - Eh) where

g(x)==f~('f(AE)d(AE) (10)

and P( AE) is the distribution of the AE i and is given by Eq. (6). With the independent error assumption, the probability of simultaneously satisfying all of the inequalities in Eq. (8) is the product of satisfying each of them separately, so the probability of predicting the right structure is

0-1

R(Eh ,E~ ,E~, ... ,EO- 1) = II gee; - Eh). (11) i=1

The desired solution of the many structure problem with the independent error assumption follows from combining Eqs. (6) for P(AE), Eqs. (10) for g(x), and Eq. (11) for R(Eh ,E~ ,E~ , ... ,EO- 1).

IV. STOCHASTIC CASE

The work in the previous section solves the potential accuracy problem for the case when the calculated energies of all of the structures are known. This is an unrealistic assumption for problems where one wishes to predict the structure of a large molecule. Similarly, one may wish to the investigate accuracy requirements for predicting the structure of a large class of molecules, for example, globular proteins.

For these two cases, a reasonable alternative to the previous formulation treats the energies of the structures as stochastic variables. In the stochastic formulation one defines an ensemble of molecules and studies the properties of this ensemble. Consider an ensemble of molecules, each with 0 discrete structures labeled 0,1,2, ... ,0-1. The most useful stochastic property for calculating accuracy requirements is the probability density po(Eh ,E~ ,E~ , ... ,EO- 1) that a molecule, which has been randomly selected from the ensemble, has structures with calculated energies Eh ,E~ ,E~ , ... ,EO- 1 • For example, in the next section Po(Eh,E~ ,E~, ... ,Eo-1) will represent the probability that a sequence, drawn at random from the ensemble of all possible sequences of N amino acids, has structures with calcu-lated energies Eh ,E~ ,E~ , ... ,EO- 1. The function. Po(Eh ,E~ ,E~ , ... ,EO- 1) can be calculated from an approximate stochastic model of the potential, and may also include information drawn from simulation or experiment. In applications Po(Eh ,E~ ,E~ , ... ,EO- 1) is typically well approximated by a special form. In this paper I will consider the simplest such special form,

0-1

Po(Eh,E~,E~, ... ,EO-1)= II peE!}. (12) i=O

Equation (12) holds when the calculated energies of the structure are random, independent variables distributed with probability density pee'). Henceforth, I will refer to Eq. (12) as the independent energy assumption. When only a probability density is known, one can calculate R, the average of R. For a probability density of the form (12), this averaging yields

R=O _ t ... f- II peE!} II geE; -Eh) f +'" +'" +",0--,1 [0-1

co 00 00 i=O i=l

where

e(x)={~ if x;;':O if x<O

(13)

(14)

and g(x) is the integral over P(AE) defined in Eq. (10). Notice that the energies in the argument of p are not arranged in any special order, and in particular, there is no requirement that Eh < E; for i>O. Therefore, in Eq. (13) the product of e functions ensures that the structure labeled 0 has the lowest energy. The factor of 0 in front of the integral occurs because the selection of the O-labeled structure is arbitrary, since any of the 0 structures could have the lowest energy. Equation (13) for R simplifies to

R=of:","'P(Eh)[t:oo

p(E')g(E' -Eh)dE' r- 1

dEh.

(15)

Equation (15) above can be approximated with a technique related to steepest descent. Define




and

f+OO g(E~)== p(E')g(E'-E~)dE'

E' o

9(E~)==(n/R)p(E~)g(E~)n-1.

Note that I have normalized 9(E~) so

f+OO

_00 g;(E~)dE~= 1.

The identity

can be rewritten as

-f+OO g'(E') R _00 P(E~O) Jl{Eb)dEb=-1.

(16)

(17)

(18)

(19)

(20)

In most relevant cases 9(Eb) has a maximum which grows sharper as n becomes larger. Thus, for large n, go(E~) is well approximated by a Dirac delta function at the value of E~ that maximizes go(E~). I denote this value of E~ by Eo. In this approximation, Eq. (20) for R becomes

- peED) R=- e(E;) . (21)

A useful expression for g' (E~) follows from differentiating equations (16) and (10) for g and g, respectively, and changing the variable of integration in Eq. (16) to A == E' - Eb to find

g' (E~) = -'p(Eb)f~ooP(A)dA - fo+oo

p(E~ + A)P(A)dA.

(22)

After substituting this expression for g', expression (21) for R can be put in the suggestive form

1 l+€(Eo), (23)

where

r+OO[p(E~+A) ] €(Eb)==Jo p(E~) -1 P(A)dA. (24)

There are four requirements for using Eqs. (23) and (24) to calculate the average probability of predicting the right structure. First, one must define an appropriate ensemble of molecules. Second, one must provide the function pee'), the density of structure energies for molecules in the ensemble. Third, one must provide the function peA), the distribution of differences in energetic errors. Fourth, one must calculate Eo, the energy that maximizes go(E~). I have made three assumptions in deriving Eqs. (23) and (24), an assumption that the total number of possible discrete structures, n, is large, an independent error assumption and an independent energy assumption. Finally, the reader should remember that R in Eq. (23) represents the average probability of predicting the right structure, which should not be confused with prob-

ability of predicting the right structure for a specific member of the ensemble of molecules. In the following section I will discuss a simple model of protein folding where the large n assumption is obviously valid and the other assumptions are valid in a mean field approximation. I will also present expressions for peE') and peA) which are valid in the same approximation. In Sec. VI I will calculate E; for this model, and I will use Eqs. (23) and (24) to calculate the average probability of predicting the right structure.

V. A RANDOM HETEROPOLYMER MODEL OF A PROTEIN POTENTIAL FUNCTION

The random heteropolymer model is the simplest model of a protein potential function. This model, with some significant extensions, was first proposed as a model of protein folding by Bryngelson and Wolynes,4,5 who solved it within a random energy approximation. Later, and independently, Garel and Orland, and Shaknovitch and Gutin, also proposed this model and solved it within a mean field approximation.6- 8 They also demonstrated the equivalence of their mean field approximation to the random energy approximation of Bryngelson and Wolynes, and were able to obtain further information. In the random heteropolymer model the energy, E, of a configuration is determined by the contacts between the amino acids. If I define

Ll(i,j)={~ if amino acids i and j are in contact otherwise,

(25)

and define N to be the number of amino acid residues in the protein, then the energy of state q with the set of contacts {Ll(i,j)} is given by

N

E=.Jl6'({Ll(i,j)}) = 2: Bi,jLl(i,j). (26) i<j

In Eq. (26) for the energy, B i,j is the energy of contact between amino acids i and j. In the random heteropolymer model, the B i,j are random, with probability distribution

1 (B?) PcontactCBi) = rz:;;:ii2' exp - 2~~ . (27)

The quantity B in Eq. (27) sets the energy scale of the amino acid contact energies.9 Notice that energies of structures in the random heteropolymer model are stochastic variables. Equations (26) and (27) define the ensemble of molecules generated by this model.

Model (26) distinguishes between configurations based on their contacts. Therefore, this model represents structures by their amino acid contacts. This representation is often referred to as the contact or distance .map representationlO of protein structure, and is useful in many applications. Notice that the contact map representation automatically produces a countable number of discrete structures, as required by the formalism developed in this paper.

Havel, Crippen, and Kunz studied the precision of this representation.ll Their investigation found that contact maps, when combined with excluded volume, constrain the a-carbon positions to about 1 A or less. To establish these




results, Havel et al. defined a contact to occur when the positions of the a-carbons of two amino acids were less than 10 A apart, and they used this definition to produce contact maps for BPTI and for Carp Myogen. For each of the two resulting contact maps, they generated ten structures that met the constraints imposed by the maps and by excluded volume, but were otherwise random. They compared these constrained random structures to each other and to the original protein structures. For Carp Myogen, they found a 0.4 A root mean square (rms) distance deviation between the positions of equivalent a-carbons in the constrained random structures; between the original protein structure and the random structures the corresponding rms distance deviation was 0.6 A. For BPTI both of these rms distance deviations were 1.0 A. Havel, Crippen and Kuntz also argue that contact maps constrain the structure of large proteins more effectively than they constrain the structures of small proteins. Since BPTI is perhaps the smallest soluble po~ypeptide designated a protein, a conservative estimate places the rms deviation of the a-carbon positions of structures specified by a complete contact map at 1 A or less.

For compact structures, the solution of the mean field theory for random heteropolymers has four properties important for the present calculation. These properties are a product of both the model and the mean field approximation, and therefore may not be entirely physical. In the conclusion I will discuss possible ways to check and improve the model and the approximations. First, the total number of compact structures is

0,= 0, (28)

where v is the average number of conformations per amino acid residue in the compact phase. (Excluded volume effects are included in this counting.) Second, the energies are random, independent variables, so Eq. (12) for p(Eo,El,E2, ... ,Eu-l) holds. Third, the probability density that a structure has energy E is

1 (E2 ) pee) = exp -N B2 , .J7TNzB2 Z

(29)

where Z is the average number of contacts each amino acid residue has with other amino acid residues. Fourth, the low energy structures have few contllcts, hence few interactions, in common. Only these 10'Y energy structures have a nonnegligible probability of having the lowest energy after J;loise has been added to the original potential function, so their properties determine the probability of a correct prediction with an inaccurate potential. Hence, as noted in Sec. III, the independent error approximation is valid for these structures and therefore also for the purposes of this paper.

Inaccuracies in the pair interactions between amino acids are modeled by adding random noise to the contact energies, so the approximate potential is

N

E' =Y3' ({Ll(i,j)}) = 2: B ;,jLl(i,j), (30) i<j

where

B~ .=B· .+ 1') •• l,j l,] -Il,} (31)

and the 7]i,j are random variables distributed with mean zero and standard deviation 7]. Since Y3 and Y3' have the same form, they have the same properties. An approximate potential function :Jff' used to calculate the energy of the state {Ll(i,j)} errs by an energy

N

8E({Ll(i,D}) = L 7]i,jLl(i,j). (32) i<j

The quantity 8E is a sum of~N independent random variables, so by the central limit theorem, 8E is a random variable with probability density

1 (8E2

) p( 8E) = ~ exp -N...-r::-:::z . 7TNz7]2 Z7]

(33)

Substituting this Eq. (33), above, into Eq. (6) for the probability density of LlE, the error in energy with respect to the predicted lowest energy structure within the independent error approximation, yields

1 (LlE2 ) P(LlE)=~ 2 exp - 2N 2'

27TNZ7] z7] (34)

VI. ACCURACY REQUIREMENTS FOR THE RANDOM HETEROPOLYMER MODEL

In the previous section, I noted that structure energies in the random heteropolymer model are stochastic variables generated from a well-defined ensemble of molecules. The independent error assumption and the random energy assumption are both valid in a mean field theory of the random heteropolymer model. The large 0, assumption is obviously valid for realistically large values of N, i.e., N>50. I also gave expressions for pee'), the density of structure energies, and for P(,,-), the distribution of differences in energetic errors. Therefore, once I have calculated E; , I can use Eqs. (23) and (24) to find the average probability of predicting the right structure in the random heteropolymer model. For Eb=E; ,

dgo dEb =0 (35)

so substituting Eq. (29) for pee') in Eq. (17) for go and differentiating yields

(36)

For all values of Eb, g' (Eb) < 0 and g(E~) > 0, hence E; < O. Substituting Eqs. (29) and (34) for pee') andP(LlE) into Eqs. (16) and (22) for g(E~) and g' (E~) yields




1/Z (I E~ v'2 1JXI) + (2Nz) 1JxJerfc (NZ)l12B +~ dx

1 (IE~I) -Z erfc (2Nz)1/21J (37)

and

1 ([ () Z] -1/Z g'(E~)=-Z p(E~) 1+ 1+2 i}

for E ~ < O. Henceforth, I will assume that N'P 1, which is true for proteins. Define a by

E* o a=-Nzl!ZB' (39)

I have shown that a>O. I assume that a is of order one and will show that this assumption is self-consistent. I will also assume that 1J is at most the same order of magnitude as B, and quite possibly much smaller. This assumption covers all interesting cases, because if 1J is much larger than B, then the "signal" (the B i,j) is swamped by the "noise" (the 1Ji,j) so the probability of predicting the correct structure is essentially zero. With these assumptions, the asymptotic expansion for the error function complement for large argument gives

(40)

where l' is a constant of order one. Therefore, the equation for a becomes

4 '1TzN za = zlI exp( -NaZ){ 1 +[ 1 +2(i}rr1

/

Z

xexp(~~:~::)[ l+erf(~~::z)]}. (41)

Equation (41) for a simplifies in two limits, -IN( 1J/B) small, and -INc 1J/B) large. The expansion of Eq. (41) for small -IN( 1J/B) to first order in -IN(1J/B) is

(42)

and similarly for large .IN( 1J/B) the leading term in the asymptotic expansion yields

2'1Tl/ZNl/Za[ 1 +2(i}) rz =exp[N( log v B~~~Z1JZ)]' (43)

Approximate solutions of the above equations can be found by writing a as

(log N) ( 1) a= ao+ ---y:;- a1 + N az+'" (44)

and substituting this expression into the above equations for a to yield

(log N)

=0 N 3/Z (45)

- (~){ (BZ~:1JZ) aoaz+~ 10g[ 4'1Tao( 1 + ~~Z) ]}

=Oe~7~) (46)

for large -IN( 1J/B). Solving the above equations, order by order in N, by first requiring the order one term to vanish, then the order log N / N term to vanish, and so on up to and including the order liN term, yields expressions for a for leading orders in N,

_ 1/2 log N a-(log v) 4N(log V)l12

1 1 + 4N(log V)1/2- log(4'1T log v)- 2N(log V)1/2

x 10g[ 1 + eo~ Vf/Z N~Z1J'] (47)

for small .IN( 1J/B) and

_(B2+21J2

')112 __ 1_ (4'1TNB210g v) a- B2 log v 4N log 'B2+21J2

(B2+21JZ ,)'-112

X B2 log v - (48)

for large $( 1J/B). The leading order terms in both Egs. (47) and (48) for a are of order ~log v. A typical estimate gives v=1.4,12 with makes a of order one, as promised. Notice that this concl~sion still holds if the value of for v-I changes by a factor of 10. -

For peE') and P(AE) given by Eqs. (29) and (34), respectively, and EO' = N a-{zB, Eq. (24) for t(E(j) becomes13




x [1 + erf( fiN a1]-y]_~ .JB2+Z7]2) 2·

(49)

By inspection of Eq. (49), E is small, and hence the average probability of predicting the correct structure, R =1/(1 + E), close to one, only if ../N( 7]/B) is small. Substituting Eq. (47), which gives the value of a when ../N( 7]/B) is small, into the above expression for E(a) and expanding to first order in .IN( 7]/B) gives

( 2 ) 112{

E= -; log v 1 10g[(4'7T log v)/N] } (Nl/27J)

4N log v B

(N7J

2)

+0 B2 . (50)

Therefore, if 17 is small compared to B /../N, then, to first order in ../N( 7J/B) , the aver~ge probability of predicting the right structure is

_ (2 ) 112{ 10g[(4'7T log v)/N] } (Nl/2 7J) R=1- - log v 1- --'7T 4N log v B

(51)

which, for large N, is well approximated by

_ ( 2 ) 1/2(Nl/27J) R=1- '7T log v ~ . (52)

For large .IN 17/ B, €i> 1 so R= 1/ E, therefore, t~ the leading term in the asymptotic expansion,

R= 1 +2 - v- 2N1/1B2 _ [ (7J)2]1/2(4'7TNB2 log v) 1/1(B2+21/,)

B B2+27J2 . (53)

VII. CONCLUSIONS

This paper has two principle purposes. First, the general formalism developed in Secs. II, III, and IV is applicable to a wide variety of problems in chemical physics. The prerequisites for applying the formalism are fourfold. First, a suitable coarse-graining procedure for obtaining a finite number of discrete structures must exist. As I previously mentioned, this coarse-graining procedure can be tailored to the problem at hand, and the structure can specified as accurately as desired by making the discretization finer. Second, the solution of the structure prediction problem must be put into the form of a solution to a minimization problem, typically not a stringent requirement. Third, the size of the inaccuracies must be estimated. Then the central limit theorem can be used to obtain the probability density for the total inaccuracy in energy. Fourth, there must be some information about the distribution of the calculated energies of the structures. In practice, only the calculated energies of the few structures with lowest calculated energies are necessary because all of the other structures have a negligible probability of being the real lowest energy structure. For large molecules, where even these calculations are unfeasible, one can estimate the distri-

bution of energies of structures and use the methods of Sec. IV. Techniques for estimating this distribution include using a simple model, for example the model described in Sec. V, ana using aata obtained from simulations, for example, the spectrum of local energy minima introduced by Honeycutt and Anderson14 in a study of Lennard-Jones clusters and used by Honeycutt and Thirmalai15,16 in studies of protein folding. The main assumption in deriving the equations for the probability of predicting the correct structure, Eqs. (5), (11), (13), and (23), is the independent error assumption, which is a worst case for most problems. The derivation of Eq. (23) also used the independent energy assumption. Extensions of this formalism that lessen these assumptions are under way.

The second purpose of this paper is to report a result for the required accuracy for a potential that predicts protein structure. The major result of this investigation is Eq. (52), which states that the average probability of predicting the correet -structure is given by

(Nl/27J)

probability= 1-k ~ , (54)

where B is the scale of the monomer-monomer interaction energies, 17 is the scale of the inaccuracy of the these interaction energies, N is the number of monomers, and k is a constant of order one. Equation (54) was derived from Eq. (23) for R and therefore is based on the independent error and independent energy assumptions. In Sec. V, I described a simple random heteropolymer model of protein folding, and I noted that these assumptions are correct for a mean field theory of this model. Equation (54) implies that, if a potential· function -islo predict the correct structure, then the monomer-monomer interactions energies must have a proportional error which is small compared with 1/../N. For a globular protein N will typically be between 50 and 400, so the required accuracy in monomer-monomer interactions is about 5% to 15% or less. It is important to note that this result is for the accuracy required for getting all of the monomer-monomer contacts right, that is, predicting the entire contact map with perfect accuracy. Perfect prediction of the contact map is a stringent requirement for a potential function. Proteins with 60% or more of correct contacts are usually considered structurally homologous. Therefore, the protein calculation should be extended to calculate the probability of predicting a structure with a specified fraction of correct contacts. This extension will require improvements in the formalism and in the model. The formalism must be extended to calculate the probability of predicting one of many, rather than just one, low energy state. The model must be extended so the low energy states of the model have contacts in common. The model can be extended in two ways. First, the statistical mechanics of the model potential function could be solved in an approximation that is more accurate than the mean field theory used here. Some progress has already been made in this direction.17,18 Second, the model potential function could be extended by incorporating new effects that are alleged to be important in protein folding, such as the ~rinciple of minimal frustration.4,5,19,20




ACKNOWLEDGMENTS

Most of the work described in this paper was done under the auspices of the U.S. Department of Energy while I was a member of the Complex Systems Group (T-13) in the Theoretical Division of the Los Alamos National Laboratory. I wish to thank the Los Alamos National Laboratory and the Department of Energy for generous support of this work. I also wish to thank the Center for Non-Linear Studies and the Santa Fe Institute for their generous hospitality. The work described in this paper owes much to the help and encouragement of many people. In particular I would like to thank Dr. Henrik Bohr, Dr. Ken Dill, Dr. Walter Fontana, Dr. Josh Deutsch, Dr. Silvio Franz, Dr. John Hopfield, Dr. Giulia lori, Dr. Alan Lapedes, Dr. Jiri Novotny, Dr. Jose Nelson Onuchic, Dr. Peter Leopold, Dr. Lawrence Pratt, Dr. Sri Sastry, Dr. Jeff Skolnick, Dr. Paul Stolorz, Dr. James Theiler, Dr. Miguel Virasoro, Dr. David Wolpert, and Dr. Peter Wolynes, and Miss Anne Keegan for listening to my ideas and for good advice concerning this work.

1 C. B. Anfinsen, Science 181, 223 (1973). 2Z. Li and H. A. Scheraga, Proc. Natl. Acad. Sci. USA 84, 6611 (1987). 3E. I. Shaknovitch and A. M. Gutin, J. Theor. BioI. 149, 537 (1991). 4 J. D. Bryngelson and P. G. Wolynes, Proc. Natl. Acad. Sci. USA 84,7524

(1987).

5 J. D. Bryngelson and P. G. Wolynes, Biopolymers 30, 177 (1990). 6T. Garel and H. Orland, Europhys. Lett. 6, 307 (1988). 7E. I. Shaknovitch and A. M. Gutin, Biophys. Chern. 34, 187 (1989). BE. I. Shaknovitch and A. M. Gutin, J. Phys. A 22, 1647 (1989). 9The erudite reader may recall that in the Shaknovitch-Gutin paper the Gaussian for the B i •j was centered about a mean energy Bo, and the potential function included a repulsive three-body interaction term C2:A(ri ,rj)A(rk ,r) which prevented the model protein from collapsing to a single point. The values of B 0 and C determine whether or not the protein molecule is collapsed. In this paper I am only interested in the relative energies of the different collapsed conformations. Changes in B 0

and C would only add a constant energy to each collapsed conformation, and therefore can be ignored.

lOT. E. Creighton, Proteins (Freeman, New York, 1984), p. 231. liT. F. Havel, G. M. Crippen, and 1. D. Kuntz, Biopolymers 18, 73 (1979). 12K. A. Dill, Biochemistry 24, 1501 (1984). 13 At this juncture, the interested reader may verify that setting 1J=2B/ ffi iII

Eq. (49) for E, substituting the result into Eq. (23) for the average probability of a right prediction, and taking the large N limit, reproduces Eq. (12) of Ref. 3 for the probability of a neutral mutation in the random heteropolymer model of protein folding.

14J. D. Honeycutt and H. C. Andersen, J. Phys. Chern. 91, 4950 (1987). 15 J. D. Honeycutt and D. Thirmalai, Proc. Natl. Acad. Sci. USA 87, 3526

(1990). 16J. D. Honeycutt and D. Thirmalai, Biopolymers 32, 695 (1992). 17M. Mezard and G. Parisi, J. Phys. (Paris) I 1, 809 (1992). 18Silvio Franz (personal communication). . 19J. D. Bryngelson and P. G. Wolynes, J. Phys. Chern. 93, 6902 (1989). 20p. E. Leopold, M. Montal, and J. N. Onuchic, Proc. Natl. Acad. Sci. USA

89, 8721 (1992).



Date post:	15-Oct-2016
Category:	Documents
Upload:	joseph-d
View:	214 times
Download:	0 times

When is a potential accurate enough for structure prediction? Theory and application to a random...

Documents