+ All Categories
Home > Documents > Locating landmarks on high-dimensional free energy surfaces · Locating landmarks on...

Locating landmarks on high-dimensional free energy surfaces · Locating landmarks on...

Date post: 26-Dec-2019
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
6
Locating landmarks on high-dimensional free energy surfaces Ming Chen a , Tang-Qing Yu b , and Mark E. Tuckerman a,b,c,1 a Department of Chemistry and b Courant Institute of Mathematical Sciences, New York University (NYU), New York, NY 10003; and c New York UniversityEast China Normal University Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China Edited by Michael L. Klein, Temple University, Philadelphia, PA, and approved January 28, 2015 (received for review September 25, 2014) Coarse graining of complex systems possessing many degrees of freedom can often be a useful approach for analyzing and un- derstanding key features of these systems in terms of just a few variables. The relevant energy landscape in a coarse-grained de- scription is the free energy surface as a function of the coarse- grained variables, which, despite the dimensional reduction, can still be an object of high dimension. Consequently, navigating and exploring this high-dimensional free energy surface is a nontrivial task. In this paper, we use techniques from multiscale modeling, stochastic optimization, and machine learning to devise a strategy for locating minima and saddle points (termed landmarks) on a high-dimensional free energy surface on the flyand without re- quiring prior knowledge of or an explicit form for the surface. In addition, we propose a compact graph representation of the land- marks and connections between them, and we show that the graph nodes can be subsequently analyzed and clustered based on key attributes that elucidate important properties of the system. Finally, we show that knowledge of landmark locations allows for the ef- ficient determination of their relative free energies via enhanced sampling techniques. free energy surface | stochastic optimization | activationrelaxation | machine learning | network representation U nderstanding the conformational equilibria of complex sys- tems remains a significant challenge in the computational molecular sciences. Whether one is interested in predicting bio- molecular structure, generating and thermodynamically ranking the polymorphs of molecular crystals, or studying the phase be- havior of complex materials, the very large number of degrees of freedom renders such problems highly nontrivial. Often the most important conformational states in a system can be characterized in terms of a subset of collective degrees of freedom or collective variables(CVs), and the problem of mapping out the confor- mational equilibria amounts to generating the marginal probability distribution in these CVs, from which the associated free energy surface (FES) can be generated. Unfortunately, due to the exis- tence of many minima on the potential energy surface separated by high barriers, transitions from basin to basin on this surface are rare, so that the FES cannot be generated on any reasonable timescale using standard molecular dynamics (MD) or Monte Carlo (MC) methods. Various enhanced sampling approaches have been devised to accelerate the exploration of such roughor frustratedenergy landscapes to generate such FESs, either by elevating the temperature in the subspace of the CVs (13) or by applying a bias potential on the FES (47) as in the popular metadynamics approach (4, 5). Recently, we have shown that these two classes of methods can be effectively combined (8), and others have shown that various types of free energy dynamics are possible (9). Although it is often claimed that a low-dimensional manifold involving the selected CVs and embedded in the full phase space could capture the most important processes in complex systems (10, 11), the problem of identifying the relevant CVs that char- acterize this low-dimensional surface remains unsolved in gen- eral. In particular, the chosen CVs must describe a manifold of slow motions in the system. Although considerable effort has been devoted to this subject (1113), most of the methods pro- posed require prior knowledge of the system, which necessitates sufficient sampling of the global configurational distribution. Moreover, quantifying different conformational equilibria might require different numbers of CVs, indicating that assigning a constant dimensionality to the low-dimensional manifold is in- adequate (14). One simple solution is to use a large number of CVs for the system, which may contain some redundancy yet are sufficient to incorporate the important slow modes. When this is done, one is necessarily faced with the problem of describing a high-dimensional free energy surface (HDFES). Despite the advent of robust enhanced sampling techniques capable of ex- ploring HDFESs (8, 15, 16), significant difficulties remain in the study of such surfaces, including the characterization and rep- resentation of a function of many variables. In fact, the low free energy regions on an HDFES possess a spiders webstructure that occupies a relatively small fraction of the surface (16). Thus, enhanced sampling methods designed to sample the surface uniformly could spend too much time in uninteresting, relatively high free energy regions. Even if a well-sampled HDFES is available, fitting it to some model functional form (8) and/or visualizing the samples in a low-dimensional space (16) are nontrivial problems. Interestingly, the spiders web structure indicates that the most interesting parts of the HDFES are the minima and the transition paths connecting them. As long as minima and saddles are located on an HDFES, relative free energy differences can be calculated and transition paths can be constructed. Therefore, the minima and saddle points are im- portant landmarkson an HDFES, and locating these land- marks becomes a threshold for further analysis of the surface. An obvious way to locate landmarks is to apply a searching algorithm on an analytically represented HDFES. However, difficulties in complete sampling and high-dimensional fitting Significance The problem of generating and navigating high-dimensional free energy surfaces is a significant challenge in the study of complex systems. The approach introduced represents an ad- vance in this area, and its ability to generate and organize the key features of a high-dimensional free energy surface, i.e., its landmarks, with high efficiency impacts numerous problems in the materials and biomolecular sciences for which prediction of optimal structures is key. These include polypeptide and nucleic acid structure and crystal design and structure pre- diction. Moreover, as the algorithm targets the free energy surface, candidate structures can be ranked based on their relative free energies, which is not possible with algorithms that target only the bare potential energy surface. Author contributions: M.C., T.-Q.Y., and M.E.T. designed research; M.C. and T.-Q.Y. per- formed research; M.C. and T.-Q.Y. analyzed data; and M.C. and M.E.T. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. 1 To whom correspondence should be addressed. Email: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1418241112/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1418241112 PNAS | March 17, 2015 | vol. 112 | no. 11 | 32353240 CHEMISTRY Downloaded by guest on January 19, 2020
Transcript
Page 1: Locating landmarks on high-dimensional free energy surfaces · Locating landmarks on high-dimensional free energy surfaces Ming Chena, Tang-Qing Yub, and Mark E. Tuckermana,b,c,1

Locating landmarks on high-dimensional freeenergy surfacesMing Chena, Tang-Qing Yub, and Mark E. Tuckermana,b,c,1

aDepartment of Chemistry and bCourant Institute of Mathematical Sciences, New York University (NYU), New York, NY 10003; and cNew YorkUniversity–East China Normal University Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China

Edited by Michael L. Klein, Temple University, Philadelphia, PA, and approved January 28, 2015 (received for review September 25, 2014)

Coarse graining of complex systems possessing many degrees offreedom can often be a useful approach for analyzing and un-derstanding key features of these systems in terms of just a fewvariables. The relevant energy landscape in a coarse-grained de-scription is the free energy surface as a function of the coarse-grained variables, which, despite the dimensional reduction, can stillbe an object of high dimension. Consequently, navigating andexploring this high-dimensional free energy surface is a nontrivialtask. In this paper, we use techniques from multiscale modeling,stochastic optimization, and machine learning to devise a strategyfor locating minima and saddle points (termed “landmarks”) on ahigh-dimensional free energy surface “on the fly” and without re-quiring prior knowledge of or an explicit form for the surface. Inaddition, we propose a compact graph representation of the land-marks and connections between them, and we show that the graphnodes can be subsequently analyzed and clustered based on keyattributes that elucidate important properties of the system. Finally,we show that knowledge of landmark locations allows for the ef-ficient determination of their relative free energies via enhancedsampling techniques.

free energy surface | stochastic optimization | activation–relaxation |machine learning | network representation

Understanding the conformational equilibria of complex sys-tems remains a significant challenge in the computational

molecular sciences. Whether one is interested in predicting bio-molecular structure, generating and thermodynamically rankingthe polymorphs of molecular crystals, or studying the phase be-havior of complex materials, the very large number of degrees offreedom renders such problems highly nontrivial. Often the mostimportant conformational states in a system can be characterizedin terms of a subset of collective degrees of freedom or “collectivevariables” (CVs), and the problem of mapping out the confor-mational equilibria amounts to generating the marginal probabilitydistribution in these CVs, from which the associated free energysurface (FES) can be generated. Unfortunately, due to the exis-tence of many minima on the potential energy surface separatedby high barriers, transitions from basin to basin on this surface arerare, so that the FES cannot be generated on any reasonabletimescale using standard molecular dynamics (MD) or MonteCarlo (MC) methods. Various enhanced sampling approacheshave been devised to accelerate the exploration of such “rough”or “frustrated” energy landscapes to generate such FESs, eitherby elevating the temperature in the subspace of the CVs (1–3) orby applying a bias potential on the FES (4–7) as in the popularmetadynamics approach (4, 5). Recently, we have shown thatthese two classes of methods can be effectively combined (8),and others have shown that various types of free energy dynamicsare possible (9).Although it is often claimed that a low-dimensional manifold

involving the selected CVs and embedded in the full phase spacecould capture the most important processes in complex systems(10, 11), the problem of identifying the relevant CVs that char-acterize this low-dimensional surface remains unsolved in gen-eral. In particular, the chosen CVs must describe a manifold of

slow motions in the system. Although considerable effort hasbeen devoted to this subject (11–13), most of the methods pro-posed require prior knowledge of the system, which necessitatessufficient sampling of the global configurational distribution.Moreover, quantifying different conformational equilibria mightrequire different numbers of CVs, indicating that assigning aconstant dimensionality to the low-dimensional manifold is in-adequate (14). One simple solution is to use a large number ofCVs for the system, which may contain some redundancy yet aresufficient to incorporate the important slow modes. When this isdone, one is necessarily faced with the problem of describinga high-dimensional free energy surface (HDFES). Despite theadvent of robust enhanced sampling techniques capable of ex-ploring HDFESs (8, 15, 16), significant difficulties remain in thestudy of such surfaces, including the characterization and rep-resentation of a function of many variables. In fact, the low freeenergy regions on an HDFES possess a “spider’s web” structurethat occupies a relatively small fraction of the surface (16). Thus,enhanced sampling methods designed to sample the surfaceuniformly could spend too much time in uninteresting, relativelyhigh free energy regions. Even if a well-sampled HDFES isavailable, fitting it to some model functional form (8) and/orvisualizing the samples in a low-dimensional space (16) arenontrivial problems. Interestingly, the spider’s web structureindicates that the most interesting parts of the HDFES are theminima and the transition paths connecting them. As long asminima and saddles are located on an HDFES, relative freeenergy differences can be calculated and transition paths can beconstructed. Therefore, the minima and saddle points are im-portant “landmarks” on an HDFES, and locating these land-marks becomes a threshold for further analysis of the surface.An obvious way to locate landmarks is to apply a searching

algorithm on an analytically represented HDFES. However,difficulties in complete sampling and high-dimensional fitting

Significance

The problem of generating and navigating high-dimensionalfree energy surfaces is a significant challenge in the study ofcomplex systems. The approach introduced represents an ad-vance in this area, and its ability to generate and organize thekey features of a high-dimensional free energy surface, i.e., itslandmarks, with high efficiency impacts numerous problemsin the materials and biomolecular sciences for which predictionof optimal structures is key. These include polypeptide andnucleic acid structure and crystal design and structure pre-diction. Moreover, as the algorithm targets the free energysurface, candidate structures can be ranked based on theirrelative free energies, which is not possible with algorithmsthat target only the bare potential energy surface.

Author contributions: M.C., T.-Q.Y., and M.E.T. designed research; M.C. and T.-Q.Y. per-formed research; M.C. and T.-Q.Y. analyzed data; and M.C. and M.E.T. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.1To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1418241112/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1418241112 PNAS | March 17, 2015 | vol. 112 | no. 11 | 3235–3240

CHEM

ISTR

Y

Dow

nloa

ded

by g

uest

on

Janu

ary

19, 2

020

Page 2: Locating landmarks on high-dimensional free energy surfaces · Locating landmarks on high-dimensional free energy surfaces Ming Chena, Tang-Qing Yub, and Mark E. Tuckermana,b,c,1

become obstacles in such an approach. In this paper, we developa strategy for locating landmarks on an HDFES “on the fly” bydirect optimization. Various optimization approaches, such asthe dimer method (17–19), the activation–relaxation technology(ART) (20, 21), and discrete path sampling (22), have achievedconsiderable success in locating minima and saddles on potentialenergy surfaces. However, the problem of locating such land-marks on an HDFES has received significantly less attention. Itis worth noting that, in general, the complexity of a frustratedsystem described by a particular potential energy function can bereduced by integrating out uninteresting degrees of freedom andobtaining the FES, a surface that encodes information aboutimportant conformational states. Because the FES is determinedby the expectation of an indicator function, the optimizationalgorithm proposed here belongs to the general class of sto-chastic approximations, which is a family of optimization algo-rithms for objective functions that can be estimated only vianoisy observations. Such algorithms have been successful in thefield of electrical engineering, specifically in adaptive signalprocessing (23), are the basis for stochastic gradient descentmethods that are widely used in machine learning (24), and haverecently been used to optimize a biasing potential (25) in meta-dynamics. As our algorithm is designed to navigate HDFESsthat are multiscale in nature, the algorithm falls within thegeneral framework of multiscale modeling, specifically heterog-enous multiscale modeling (HMM) (26) or the “equation-free”approach (27). We show that the optimization algorithm can betailored to suit the features of the FES, which are generatedon the fly using machine-learning methods; this constitutesa significant difference from algorithms that operate on potentialenergy surfaces. The use of ART leads to a global search algo-rithm on the HDFES. We term the resulting approach the sto-chastic activation–relaxation technique (START). Once thelandmarks are located, we show that they may be subsequentlyused as inputs to an enhanced sampling calculation to obtain,quantitatively, their associated free energies, a procedure that isof considerably greater efficiency than attempting to convergethe full HDFES from scratch via enhanced sampling (with noa priori knowledge of the location of the landmarks). Finally, weaddress the problem of representing the HDFES explored bySTART by introducing a graph-based approach in which all ofthe landmarks obtained are colored vertices and connectionsbetween them are the edges. We then show that the graph nodescan be analyzed and clustered based on suitably chosen attri-butes to reveal archetypical configurations of the system.

Stochastic Approximation on an HDFESConsider a system of N particles with coordinates r1; . . . ; rN ≡r∈R3N interacting through a potential UðrÞ. We introduce aset of n CVs given by functions qðrÞ= fq1ðrÞ; . . . ; qnðrÞg, whichwe assume capture the slow, collective motions of the system.We take the CVs as inputs and do not discuss here how theyare generally chosen. The marginal probability distributionPðs1; . . . :; snÞ≡ PðsÞ, where s∈Rn, gives the probability thatq1ðrÞ= s1, q2ðrÞ= s2,. . ., and qnðrÞ= sn and is defined to bePðsÞ=C

Rdr  e−βUðrÞQn

α=1δðqαðrÞ− sαÞ, where β= 1=kBT, kB isBoltzmann’s constant, and T is the temperature of the system.The variables s1; . . . :; sn are referred to as “coarse-grained”variables. Representing the product of δ-functions as the limit ofa product of Gaussians (2, 3), we obtain

PðsÞ= limκ→∞

CðκÞZ

dr  e−β�UðrÞ+

Pn

α=1καðqαðrÞ− sαÞ2=2

�: [1]

C and CðfκgÞ are overall normalization constants. The parame-ters κ1; . . . ; κn ≡ κ determine the width and height of the Gaussians,and for finite κ, the approximate marginal distribution convergesto the true distribution weakly as Oð1=jκjÞ (28). The FES asso-ciated with this marginal distribution is AðsÞ=−β−1 logPðsÞ,whereas the gradient and Hessian are expressed, respectively,

in terms of the expectation or sample mean and sample covari-ance matrix of the quantity Finst = diagfκgðqðrÞ− sÞ, i.e., −∇A=E½Finst� (8), and∇∇A= diagfκg− βcovðFinstÞ (29), where covðFinstÞ=E½ðFinst −E½Finst�ÞðFinst −E½Finst�ÞT� is the covariance matrix of Finst.The probability distribution for calculating these expectations is thecanonical distribution in Eq. 1 with fixed s. The samples come frommolecular dynamics or Monte Carlo simulations with a restrainton s. These estimators, denoted by FðsÞ and HðsÞ, exhibit low-amplitude noise whenever finite-time averaging is used. This is themain difference between the optimization on a potential energysurface and a FES. However, the noise is a necessary componentin the START algorithm.Starting from one point on an HDFES, neither minimum nor

saddle optimization algorithms can guarantee a complete searchof the minima/saddles on the surface. Here, we obtain a globalsearch strategy by combining two protocols iteratively: A saddleoptimization approach allows the system to escape a local min-imum, and minimum optimization allows the system to relaxfrom a saddle. In the latter, s is updated according to

sk+1 = sk + δsFðskÞkFðskÞk: [2]

There are various ways to locate an index-1 saddle point on apotential energy surface, including ART, the dimer method,descrete path sampling, and gentlest ascent dynamics (GAD) (30,31). As we have recently extended GAD to index-1 saddle search-ing on FESs (29), we now choose this scheme as the saddleoptimization method in the START protocol. The optimizationequations in this scheme are

sk+1 = sk + δsF k

kF kk [3a]

γnk+1 = γnk − δs½HðskÞnk − ðnk ·HðskÞnkÞnk�: [3b]

Here, F k =FðskÞ− 2ðFðskÞ ·nkÞnk and δs is the same step size asin Eq. 2. The vector F k is a modification of the force FðskÞ suchthat an index-1 saddle becomes an attractor of s if n is close tothe eigenvector of H with the minimum eigenvalue. The secondequation evolves n to the required eigenvector. The parameter γcontrols the “sensitivity” of n to a change in H, thus providinga means of damping the noise in H. This type of optimizationalgorithm most closely fits the HMM framework; i.e., Eqs. 2, 3a,and 3b can be viewed as macroscopic solvers for the coarse-grained variables. Information needed for these solvers can comefrom any microscopic approach capable of delivering the re-quired constrained averages that produce FðsÞ and HðsÞ. Al-though the present study employs molecular dynamics for thistask, stochastic dynamics or Monte Carlo could work equally well.Note that in Eqs. 2 and 3a, the step size δs is constant, and the

force (F or F ) is normalized. This contrasts with traditionalstochastic approximations in which the gradient of an objectivefunction is typically unnormalized, and a step size δsk in the kthstep of the optimization is chosen to decay as 1=k. In the presentscheme, we choose δs to be sufficiently small so that whenmultiplied by the normalized force (FðskÞ=kFðskÞk or F k=kF kk),s will not evolve too rapidly when FðskÞ or F k is large. In thisway, we avoid large jumps in s and thus avoid a long equilibrationphase in the restrained simulations. Note also that in traditionalstochastic approximations, when k is large, δsk will be small, ands will advance more slowly than desired. By contrast, START witha constant step size and normalized force guarantees the efficiencyof the optimization. When the step size is constant, however, sdoes not converge to a point; rather, the trajectory of s generates acluster around a minimum/saddle (32) (see Fig. 2). Such a clusterformed during a minimum/saddle optimization can be extractedby a clustering algorithm (Materials and Methods), and landmarks

3236 | www.pnas.org/cgi/doi/10.1073/pnas.1418241112 Chen et al.

Dow

nloa

ded

by g

uest

on

Janu

ary

19, 2

020

Page 3: Locating landmarks on high-dimensional free energy surfaces · Locating landmarks on high-dimensional free energy surfaces Ming Chena, Tang-Qing Yub, and Mark E. Tuckermana,b,c,1

will be covered by clusters. The exact location of each landmark isthen determined by investigating the FES at the cluster. The localFES can be constructed from FðsÞ sampled within the cluster. InSTART, we approximate this local FES by a quadratic function,Af ðsÞ=A · s+ ð1=2Þs⊤Hf s, where Hf is a symmetric matrix. Sup-pose there are M points fsmg ð1≤m≤MÞ with fFðsmÞg ina cluster. The parameters in A and Hf are estimated by min-imizing the objective function

PMm=1

��∇Af ðsmÞ+FðsmÞ��2 (8).

This is a quadratic programming problem with a unique solution.The landmark point, either a minimum or a saddle, is a criticalpoint with zero gradient on Af ðsÞ that can be resolved by solvinga set of linear equations ∇Af ðsÞ= 0. The matrix Hf is actually theHessian at the landmark, estimated by this local fitting, fromwhich the properties of the landmark, e.g., the eigenvalues andeigenvectors of the Hessian, can be determined.A cluster can also form in a flat region (not a minimum or

saddle) on the FES during an optimization. The reason for this isthat in such a region, the mean force is small along severaldirections, and the noisy nature of F and H causes a minimum/saddle optimization trajectory to remain in the region for aconsiderable period, thereby generating a cluster, before it isable to diffuse out. We illustrate this phenomenon using thealanine dipeptide in vacuum. The Ramachandran map, i.e., theFES as a function of the dihedral angles Φ and Ψ, exhibits a veryflat region (proved by studying the converged mean force in Fig.1A), corresponding to M4 in Fig. 2, that does not contain anylandmarks. Fig. 1B shows one minimum optimization trajectorypassing through this region: It wanders in this region (left box)for a relatively long period before finally diffusing out to a min-imum (right box), thus forming a cluster (our criterion for clusterformation is described in Materials and Methods). When thishappens, the optimization halts prematurely, and this cluster isthen fed into the minimum/saddle analysis described above. Fig.1C shows the resulting cluster around M4 together with the fittedcritical point (landmark). We thus see that the fitted criticalpoint is separated from the cluster centroid; i.e., it can be locatedat the edge or even outside the cluster. Inspired by the idea ofsupport vector machines (SVMs), an algorithm that we term the“maximum separation test” is designed to screen out these falselandmarks by quantifying the degree to which a point is “atthe edge,” “near the center,” or “outside” of a cluster. Let w be thenormal vector of a hyperplane through a fitted critical point. Thenumber of samples on one side of the hyperplane is a functionLðwÞ=PM

m=1Hðw · ðsm − scÞÞ, where HðxÞ is the Heaviside stepfunction (0–1 loss function), sm is the coordinate of the mthpoint in the cluster, and sc is the coordinate of the fitted criticalpoint. If we define wmin = argminkwk=1:0LðwÞ, the hyperplanewith wmin as its normal vector, called the “plane of maximumseparation,” should have a minimum number of points ðLminÞ onone side and a maximum ðLmaxÞ on the other. Numerically, HðxÞis replaced by the cumulative distribution Fgð · Þ of a Gaussianwith SD σ, and we obtain the plane of maximum separation byminimizing LðwÞ=PM

m=1Fgðw · ðsm − scÞÞ. We then use the ratiol=Lmin=Lmax as an indicator of the location of a critical pointrelative to the cluster: The ratio l will be approximately unity if thecritical point lies close to the center of the cluster, will decrease ifthe critical point shifts to the edge of the cluster, and will be ap-proximately zero if the critical point lies outside the cluster. Fig.1D shows the plane of maximum separation with all points locatedon one side. The l score in this case is very small (Fig. 2D), and thefitted critical point should be excluded as a false landmark. Theconsistency of this result with our observations in Fig. 1 A and Dhighlights the utility of the maximum separation test.A flowchart of the START procedure is shown in Fig. 2A. The

initial shooting move before each optimization, like the initialshifting of one atom in the activation–relaxation technique(20, 21), forces s to leave a landmark more quickly, thus increasingthe overall efficiency. Instead of shooting along a random di-rection, other strategies, such as shooting along the direction ofthe Hessian eigenvector with the smallest eigenvalue, can beused (29). The clusters from the clustering algorithm are shown

in Fig. 2B for an alanine dipeptide example. Fig. 2D plots the lscores of all of the clusters. Most of these clusters have large scoresexcept for M4 and S7, indicating that these two clusters do not passthe maximum separation test and are, therefore, flat regions on theFES. After local mean force fitting and maximum separation tests,the locations of minima/saddles and the eigenvectors of the Hes-sians at these points are determined and are plotted in Fig. 2C.The locations of the landmarks match the shape of the FES froma 10-ns benchmark simulation using the unified free energy dy-namics (UFED) method (8) (background FES of Fig. 2 B and C)and are also consistent with those in previous studies (33).

ResultsAlanine Tripeptide. The alanine dipeptide illustrated in Fig. 2serves as a simple, illustrative example for the START algorithm.However, because the FES can also be generated via numerousenhanced sampling methods, and the locations of minima andsaddles can be read off directly from the FES (8, 34), we considernext the alanine tripeptide in vacuum to provide an exampleof a nontrivial four-dimensional FES. It is instructive to testSTART on such a system. The two pairs of Ramanchandralangles are the most flexible CVs for this problem. To search asmany minima and saddles as possible, 600 START iterations areused, and the total simulation time is nearly 150 ns. Local meanforce fitting, followed by maximum separation testing, was ap-plied to generate the critical points, eigenvalues, eigenvectors,and l scores. Eighteen minima (M1–M18) and 45 saddles(S1–S45) were selected (SI Appendix). Seventeen minima foundin this study also show up in the previously reported UFED study(8), and one extra minimum, M14, is found. However, this

-30

-25

-20

-10

-84 -80 -76 -72

-15

-120

-80

-40

0

40

80

0 20 40 60

Dih

edra

l Ang

le

Step

-30

-25

-20

-15

-10

-5

-90 -85 -80 -75 -70

-30

-25

-20

-15

-10

-84 -80 -76 -72

A B

C D

Fig. 1. A flat region (M4, Fig. 2) on the Ramachandran plot of the alaninedipeptide in vacuum is studied in detail. (A) A benchmark study provesthat this region is not a minimum. The red arrows are the mean force atthe green lattice points. (B) One minimum optimization trajectory passesthrough this range and ultimately forms a cluster at another minimum.(C) Samples in this flat region are the green points, and the background FESis from a 10-ns UFED (8) simulation. The red cross is the fitted critical point.(D) The starting point of the green arrow is the fitted critical point. Thegreen line is the plane of maximum separation and the green arrow is itsnormal vector. The convolved samples are plotted as the colored densitydistribution in the background. The light yellow color indicates the largestdensity and the black color shows the low-density regions.

Chen et al. PNAS | March 17, 2015 | vol. 112 | no. 11 | 3237

CHEM

ISTR

Y

Dow

nloa

ded

by g

uest

on

Janu

ary

19, 2

020

Page 4: Locating landmarks on high-dimensional free energy surfaces · Locating landmarks on high-dimensional free energy surfaces Ming Chena, Tang-Qing Yub, and Mark E. Tuckermana,b,c,1

minimum lies close to S40 and possesses a correspondingly smallHessian eigenvalue (SI Appendix), indicating that this minimumis shallow. With the information of all minima and saddles, it ispossible to set up a network or graph connecting all metastableconfigurations. The network consisting of all possible confor-mational changes through various saddle points is exhibited inFig. 3A. Blue and yellow vertices correspond to minima andsaddles, respectively.Fig. 3B shows the number of clusters obtained as a function

of the number of iterations. The number of clusters visitedincreases rapidly at the beginning and more slowly thereafterbecause the probability of finding a previously visited clusterincreases during the START search procedure. After roughly300 iterations, START has identified nearly all of the minimaand most of the saddles. The subsequent 300 iterations simplyprovide additional mean force samples, which improves the ac-curacy of the local mean force fitting.

Met-Enkephalin. Met-enkephalin (Tyr-Gly-Gly-Phe-Met) is a pen-tapeptide well known as an endogenous ligand of the opioid

receptors and distributed throughout the central nervous system.For this small peptide, we used the full set of 10 Ramachandranangles as CVs. In terms of these CVs, the FES is nontrivial(15, 16). Thus, this system provides a very challenging 10-dimensionalFES to test the ability of START. Our simulations used 6,905START iterations (3.5 μs). After locally fitting the FES arounda landmark to a quadratic form and testing all clusters by themaximum separation test, 1,081 minima and 1,431 saddles arelocated (SI Appendix). Saddles connecting to one minimum duethe periodicity of the CVs are excluded here because they pro-vide no information on conformational changes.Similiar to Fig. 3A, a network representation with all minima

and saddles as vertices can be constructed for this example.Viewing this network is clearly nontrivial. However, groups oflandmarks may share structural similarities on a more coarse-grained level, and the 10 chosen CVs could exhibit structural re-dundancy. We therefore use the five α-carbon atoms to distinguishstructural archetypes. The similarity between two different land-marks is based on the root-mean-square deviation (rmsd) of theseα-carbon atoms. The network can be organized by groupinglandmarks based on structural similarity, which here is achieved-150

-50

50

150

Ψ

-150 -50 50 150Φ

-150

-50

50

150

Ψ

-150 -50 50 150Φ

0

5

10

15

20

25

0

0.4

0.8

Sep

arat

ion

Sco

re

M1 M2 M3 M4 S1 S2 S3 S4 S5 S6 S7Cluster Index

M1

M2

M3M4

S1

S2 S3S4

S5

S6

S7

Minimum Cluster

Shoot

SaddleCluster

ShootOptimization

Saddle

MinimumOptimization

Maximum Separation Test

Local Mean Force Fitting

A

B C

D

Fig. 2. (A) Flowchart of the START procedure. First, an optimization isperformed to drive the system to the nearest minimum from an initialconfiguration. Following this, a trajectory is shot along a random directionfor several steps. A saddle optimization is then started and terminated if itis converged. Finally, a trajectory is shot again along another random di-rection, and a new minimum optimization is initiated. One such loop, in-cluding one minimum and saddle point optimization, is considered aniteration. All clusters are then sent to a local mean force fitting and maxi-mum separation testing for identifying and locating landmarks. (B) Samplesfrom the START simulation are aggregated as clusters. Red samples areminima candidates and green samples are saddle candidates. (C) Theeigenvectors of each minimum/saddle are two perpendicular arrows crossingat the exact location of the minimum/saddle. The length of an arrow isproportional to the magnitude of the corresponding eigenvalue, and thecolor indicates the sign of the eigenvalue: Red is positive, and white isnegative. (D) The scores (l) of the maximum separation test clearly prove thatthe M4 and S7 clusters are flat regions rather than true minima/saddles.

M1

S1

M11

M3S2

M2

M5

S3

M6

S4

S5

M12

S6

M13

S7

M18

S8

M9

M15S9

M14S10

S11

M4S12

M10

S13

S14

M17

S15

S16

M16S17S18

S19

S20

M7

S21

S22

S24

M8

S25

S26

S27 S28

S29

S30

S31

S32

S33

S34S35

S36

S37

S38

S39

S40

S41S42 S43

S44

S45

B

A

20

40

60

0 100 200 300 400 500 600Num

ber o

f Vis

ited

Clu

ster

s

Number of Iterations

Minimum Saddle

S23

Fig. 3. (A) A network representing the four-dimensional FES of an alaninetripeptide in vacuum. The CVs are the four Ramachandran dihedral angles oftwo residues. Each blue box represents a minimum and each yellow boxrepresents a saddle. Generally, each saddle relaxes into two different min-ima, and two edges are drawn to connect the saddle with these minima, asshown. However, due to the periodicity of the CVs, some saddles, such as S5,S7, S19, S29, S37, and S44, can relax to one minimum only via two differentpathways. This situation arises when dihedral angles are used as CVs and isnot expected to occur for nonperiodic CVs. (B) The number of clusters afterclustering in the START protocol increases with the number of iterations.Most of the minima/saddles are explored within the first 300 iterations.

3238 | www.pnas.org/cgi/doi/10.1073/pnas.1418241112 Chen et al.

Dow

nloa

ded

by g

uest

on

Janu

ary

19, 2

020

Page 5: Locating landmarks on high-dimensional free energy surfaces · Locating landmarks on high-dimensional free energy surfaces Ming Chena, Tang-Qing Yub, and Mark E. Tuckermana,b,c,1

using the sketch-map method (16). The sketch map is a variationof multidimensional scaling that seeks a meaningful dimensionalreduction of a high-dimensional configuration space while re-taining the cluster structure in the low-dimensional representa-tion (details in Materials and Methods). As shown in Fig. 4A,similar landmarks are clustered, and three regions indicate threemajor classes of structures. The ability to explore various foldedand unfolded structures clearly proves that START is powerfulas a global search tool on an HDFES. The locations of thelandmarks generated by START are now used as inputs into anenhanced sampling calculation to obtain the free energies ofthese landmarks. Several enhanced sampling algorithms can beused to estimate the free energies of the landmarks. Here, we usedriven adiabatic free energy dynamics (d-AFED)/temperature-accelerated MD (TAMD) (2, 3) to evaluate free energies of thelandmarks because of d-AFED’s ability to sweep rapidly over thefree energy landscape and quickly generate free energies atthe START landmark locations. The same CVs as in START areused in the d-AFED run (details in Materials and Methods). Thefree energies of landmarks from a 500-ns d-AFED simulation areshown in Fig. 4B. In fact, the first 200 ns of the d-AFED simu-lation are already sufficient to select quantitatively the low freeenergy landmarks (<8 kcal/mol) to within 1 kcal/mol (Fig. 4C).Thus, it should be clear that having this information availablea priori is considerably more efficient than searching for thelandmarks from scratch via the enhanced sampling procedure.

Conclusion and PerspectiveA robust and efficient approach for searching minima and sad-dles (landmarks) on an HDFES, START, has been introduced.START provides a strategy for analyzing HDFESs and circum-venting the difficulties of uniform sampling and global con-struction of the HDFES. We used the alanine tripeptide and

met-enkaphlin examples to show that START is a powerful toolfor the global exploration of an HDFES and for the fast andaccurate pinning down of exact locations of landmarks. Althoughour examples were performed in the gas phase, it is important tonote that, since the START algorithm targets the FES in se-lected CVs, these systems could just as well have been performedin explicit solvent, a feature that distinguishes START frommethods that directly target the potential energy surface. If themain goal of a study is to explore the FES, including identifyingimportant configurations such as folded structures in proteinsand polymorphs of molecular crystals and computing free energydifferences between them, the landmarks located by START areimportant inputs. The saddles located by START can also beused to construct a good initial guess for subsequent searching oftransition paths connecting the two minima on the FES withmethods such as the nudged-elastic band approach (17) or thestring method (35). The relative free energies of all (or some)landmarks could be easily evaluated by other enhanced samplingmethods once the landmarks are located. A network or graphcontaining all of the landmarks can be constructed as a repre-sentation of the HDFES. This network, which also specifiesconnections between saddles and minima as the edges in thegraph, provides important information for studying mechanismsof conformational changes in the system. A related representa-tion was proposed to set up a network for organic reactions (36).In the present scheme, our graph representation allows us toapply graph analysis methods on the HDFES in which the ver-tices are characterized by a set of key features or molecularstructural similarity, so that important properties of the systemcan be subsequently revealed and elucidated. In a manner that issimilar to building a stationary point database in discrete pathsampling (22), landmarks located by START, together with theirfree energies and the graph representation, can also be used forfurther kinetic studies.

1

2

3

4

5

67

8

A

B

C

Sec

ond

500

ns F

ree

Ene

rgy

(kca

l/mol

)

First 500 ns Free Energy (kcal/mol)14121086420

14

12

10

8

6

4

2

0

0

5

10

15

20kcal/mol

0 120

12

4 8

48

AB

C

Fig. 4. Met-enkephalin in vacuum has been studied by START with the 10 Ramachandran dihedral angles as CVs. (A) The network of all landmarks. Each bluedot denotes a minimum, and each red dot corresponds to a saddle. The coordinates of the vertices are generated by the sketch-map algorithm (16). Everysaddle is connected with two minima via solid gray lines. Three major classes of configurations are labeled “A,” “B,” and “C” and divided by black dashedlines. Region A includes all landmarks associated with “unfolded” or “extended” structures. Examples are structures 6, 7, and 8, as shown. Region B cor-responds to helical structures, e.g., structures 1 (310 helix) and 2. Region C mainly contains hairpin-like conformations. There are β-hairpin structures withvarious turns: a γ-turn for structure 3 and a β-turn for structure 4; and there are other “U”-shaped structures, e.g., structure 5. Various stable structures (foldedand unfolded) indicate the diversity of metastable configurations that were sampled by START. (B) Landmark free energies are estimated from a 500-ns d-AFED/TAMD simulation. A circle denotes a minimum, and a cross corresponds to a saddle. All landmarks with free energies higher than 20 kcal/mol are shownin red. (C) The correlation between landmark free energies from two independent 500-ns d-AFED simulations. Blue dots represent minima, and red dotsrepresent saddles. A point close to the black diagonal indicates that the two free energy values are close. Two green lines represent a ±1 kcal/mol error. TheInset shows the same correlation from two independent 250-ns d-AFED simulations.

Chen et al. PNAS | March 17, 2015 | vol. 112 | no. 11 | 3239

CHEM

ISTR

Y

Dow

nloa

ded

by g

uest

on

Janu

ary

19, 2

020

Page 6: Locating landmarks on high-dimensional free energy surfaces · Locating landmarks on high-dimensional free energy surfaces Ming Chena, Tang-Qing Yub, and Mark E. Tuckermana,b,c,1

Materials and MethodsTwo test systems, here the alanine di- and tripeptides and met-enkaphalin invacuum, are studied by the START algorithm. Samples for FðskÞ and HðskÞrequired by START are generated from restrained molecular dynamics simu-lations with fixed s= sk . All simulations were performed using the PINY_MD(37) package with the CHARMM22 (38) force field. The reversible multipletime-step algorithm r-RESPA (39) was used to integrate the equations ofmotion, with a 1.0-fs time step for nonbonded interactions and a 0.5-fs timestep for intramolecular interactions. The Nosé–Hoover chain algorithm wasused with chain length 2 to maintain an average temperature of 300 K. TheCVs in all three examples were taken to be the Ramachandran dihedralangles, and the coupling constant κ for the restrained simulations was278.2 kcal·mol−1·rad−2. The simulation time for each restrained simulation was1 ps for the alanine di- and tripeptides. In the met-enkaphalin example, thesimulation time for each restrained simulation was 1 ps for optimizing tominima, whereas the simulation time was 10 ps for saddle optimizations. In allthree cases, the step size δs is chosen as 6.0° (0.105 rad). A minimum/saddleoptimization was regarded as converged when the trajectory remained ina box with side length 23° (0.4 rad) for 20 steps for the alanine dipeptide,30 steps for the alanine tripeptide, and 40 steps for met-enkaphalin. Beforeminimum/saddle optimizations, the system was shot out along a random di-rection for 5 steps. The initial direction of n in GAD was chosen to be the sameas this random shooting direction. In the saddle optimization, the parameterγ = 10:0 for the alanine dipeptide and the alanine tripeptide, and γ = 0:05 forthe met-enkaphalin due to the very high dimensionality of the FES.

The density-based spatial clustering for applications with noise (DBSCAN)(40) is suitable for analyzing and classifying the clusters generated in the twooptimization steps of the START protocol. The DBSCAN algorithm beginswith an arbitrary unvisited sample point on the FES and a count of thenumber of neighboring points in the sample whose Euclidean distance from

the sample point is not larger than some specified value r. If the number ofneighboring sample points is larger than some lower bound p, a new clusteris assigned. Otherwise, the point is considered “noise.” When a new clusteris assigned, DBSCAN combines all of the neighboring points (neighborhood)within the distance r, together with their neighborhood, provided the pointsare not noisy points. The cluster growth continues until no points remain(that are not designated as noise) that can be assigned to the cluster. Unlikeclustering methods such as k-means and Gaussian mixture models, DBSCANrequires no initial guess of the number or shape of the clusters. In this study,the radius of the neighborhood r is 0.122 rad or 7.0° and the lower bound isp= 4. In the following maximum separation test, the l score, above whicha cluster passes the test, is 0.2 in the alanine tripeptide example and 0.25 (forminima) and 0.11 (for saddles) in the met-enkaphalin example.

In the met-enkaphalin example, the objective function of the sketch map isχ2 =

Pi≠j ½FðRijÞ− fðrijÞ�2 (16), where rij is the Euclidian distance of two different

landmarks in the projected 2D space. Rij is the rmsd between two averagedstructures in terms of α-carbon atoms. Switching functions FðrÞ and fðrÞ followthe general formula 1− ð1+ ð2a=b − 1Þðr=σÞaÞ−b=a (16). The values of a andb are 3.0 and 9.0 for FðrÞ and 2.0 and 2.0 for fðrÞ. The parameter σ is 0.5 Å inboth cases. The optimization strategy follows that in ref. 16. In the d-AFEDsimulation, the temperature of s is 800 K, and the mass is 168.0 Å2·rad−2 timesthe mass of the proton. The other simulation parameters are the same as thoseprescribed before. Independent samples are taken from the trajectories every1 ps. These samples are convoluted with a homogeneous Gaussian kernel witha variance of 20° to generate a probability distribution of s, and landmark freeenergies can be evaluated following the method in ref. 3.

ACKNOWLEDGMENTS. M.E.T. and M.C. acknowledge support from the Na-tional Science Foundation Grant CHE-1301314. M.C. also acknowledges theMargaret and Herman Sokol Doctoral Fellowship in the Sciences.

1. Rosso L, et al. (2002) On the use of the adiabatic molecular dynamics technique in thecalculation of free energy profiles. J Chem Phys 116:4389–4402.

2. Maragliano L, Vanden-Eijnden E (2006) A temperature accelerated method for sam-pling free energy and determining reaction pathways in rare events simulations.Chem Phys Lett 426(1–3):168–175.

3. Abrams JB, Tuckerman ME (2008) Efficient and direct generation of multidimensionalfree energy surfaces via adiabatic dynamics without coordinate transformations.J Phys Chem B 112(49):15742–15757.

4. Laio A, Parrinello M (2002) Escaping free-energy minima. Proc Natl Acad Sci USA99(20):12562–12566.

5. Bonomi M, Parrinello M (2010) Enhanced sampling in the well-tempered ensemble.Phys Rev Lett 104(19):190601.

6. Darve E, Rodríguez-Gómez D, Pohorille A (2008) Adaptive biasing force method forscalar and vector free energy calculations. J Chem Phys 128(14):144120.

7. Hénin J, Fiorin G, Chipot C, Klein ML (2010) Exploring multidimensional free energylandscapes using time-dependent biases on collective variables. J Chem TheoryComput 6(1):35–47.

8. Chen M, Cuendet MA, Tuckerman ME (2012) Heating and flooding: A unified ap-proach for rapid generation of free energy surfaces. J Chem Phys 137(2):024102.

9. Morishita T, Itoh SG, Okumura H, Mikami M (2012) Free-energy calculation via mean-force dynamics using a logarithmic energy landscape. Phys Rev E Stat Nonlin SoftMatter Phys 85(6 Pt 2):066702.

10. Piana S, Laio A (2008) Advillin folding takes place on a hypersurface of small di-mensionality. Phys Rev Lett 101(20):208101.

11. Das P, Moll M, Stamati H, Kavraki LE, Clementi C (2006) Low-dimensional, free-energylandscapes of protein-folding reactions by nonlinear dimensionality reduction. ProcNatl Acad Sci USA 103(26):9885–9890.

12. Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework fornonlinear dimensionality reduction. Science 290(5500):2319–2323.

13. Coifman RR, et al. (2005) Geometric diffusions as a tool for harmonic analysis andstructure definition of data: Diffusion maps. Proc Natl Acad Sci USA 102(21):7426–7431.

14. Rohrdanz MA, Zheng W, Maggioni M, Clementi C (2011) Determination of reactioncoordinates via locally scaled diffusion map. J Chem Phys 134(12):124116.

15. Tribello GA, Ceriotti M, Parrinello M (2010) A self-learning algorithm for biasedmolecular dynamics. Proc Natl Acad Sci USA 107(41):17509–17514.

16. Ceriotti M, Tribello GA, Parrinello M (2011) From the cover: Simplifying the repre-sentation of complex free-energy landscapes using sketch-map. Proc Natl Acad SciUSA 108(32):13023–13028.

17. Henkelman G, Uberuaga BP, Jónsson H (2000) A climbing image nudged elastic bandmethod for finding saddle points and minimum energy paths. J Chem Phys 113(22):9901–9904.

18. Henkelman G, Jónsson H (1999) A dimer method for finding saddle points on highdimensional potential surfaces using only first derivatives. J Chem Phys 111(15):7010–7022.

19. Munro LJ, Wales DJ (1999) Defect migration in crystalline silicon. Phys Rev B 59:3969–3980.

20. Barkema GT, Mousseau N (1996) Event-based relaxation of continuous disorderedsystems. Phys Rev Lett 77(21):4358–4361.

21. Malek R, Mousseau N (2000) Dynamics of Lennard-Jones clusters: A characterizationof the activation-relaxation technique. Phys Rev E Stat Phys Plasmas Fluids Relat In-terdiscip Topics 62(6 Pt A):7723–7728.

22. Wales DJ (2002) Discrete path sampling. Mol Phys 100(20):3285–3305.23. Widrow B, Stearns SD (1985) Adaptive Signal Processing (Prentice-Hall, Upper Saddle

River, NJ).24. Bottou L (2004) Stochastic learning. Advanced Lectures on Machine Learning, Lecture

Notes in Artificial Intelligence, LNAI 3176, eds Bousquet O, von Luxburg U (Springer,Berlin), pp 146–168.

25. Valsson O, Parrinello M (2014) Variational approach to enhanced sampling and freeenergy calculations. Phys Rev Lett 113(9):090601.

26. E W, Engquist B (2002) The heterogeneous multiscale methods. Commun Math Sci1(1):87–132.

27. Gear WC, et al. (2003) Equation-free, coarse-grained multiscale computation: En-abling macroscopic simulators to perform system-level analysis. Commun Math Sci1(4):715–762.

28. Maragliano L, Fischer A, Vanden-Eijnden E, Ciccotti G (2006) String method in col-lective variables: Minimum free energy paths and isocommittor surfaces. J Chem Phys125(2):024106.

29. Samanta A, Chen M, Yu T-Q, Tuckerman M, E W (2014) Sampling saddle points ona free energy surface. J Chem Phys 140(16):164109.

30. E W, Zhou X (2011) The gentlest ascent dynamics. Nonlinearity 24(6):1831–1842.31. Samanta A, E W (2012) Atomistic simulations of rare events using gentlest ascent

dynamics. J Chem Phys 136(12):124104.32. Borkar VS (2008) Stochastic Approximation: A Dynamical Systems Viewpoint (Cam-

bridge Univ Press, New York).33. Strodel B, Wales DJ (2008) Free energy surfaces from an extended harmonic superpo-

sition approach and kinetics for alanine dipeptide. Chem Phys Lett 466(4–6):105–115.34. Rosso L, Abrams JB, Tuckerman ME (2005) Mapping the backbone dihedral free-

energy surfaces in small peptides in solution using adiabatic free-energy dynamics.J Phys Chem B 109(9):4162–4167.

35. E W, Ren W, Vanden-Eijnden E (2002) String method for the study of rare events. PhysRev B 66:052301.

36. Rappoport D, Galvin CJ, Zubarev DYu, Aspuru-Guzik A (2014) Complex chemical reactionnetworks from heuristics-aided quantum chemistry. J Chem Theory Comput 10(3):897–907.

37. Tuckerman ME, et al. (2000) Exploiting multiple levels of parallelism in moleculardynamics based calculations via modern techniques and software paradigms on dis-tributed memory computers. Comput Phys Commun 128:333–376.

38. MacKerell AD, et al. (1998) All-atom empirical potential for molecular modeling anddynamics studies of proteins. J Phys Chem B 102(18):3586–3616.

39. Tuckerman ME, Martyna GJ, Berne BJ (1992) Reversible multiple time scale moleculardynamics. J Chem Phys 97:1990–2001.

40. Ester M, Kriegel H-p, Jörg S, Xu X (1996) A density-based algorithm for discoveringclusters in large spatial databases with noise. Proceedings of 2nd International Con-ference on Knowledge Discovery and Data Mining (KDD-96), eds Simoudis E, Han J,Fayyad U (AAAI Press, Menlo Park, CA), pp 226–231.

3240 | www.pnas.org/cgi/doi/10.1073/pnas.1418241112 Chen et al.

Dow

nloa

ded

by g

uest

on

Janu

ary

19, 2

020


Recommended