+ All Categories
Home > Documents > Mj Article

Mj Article

Date post: 17-Feb-2016
Category:
Upload: cptudor
View: 246 times
Download: 0 times
Share this document with a friend
Description:
article with stuff
Popular Tags:
22
J. Mol. Biol. (1996) 256, 623–644 Residue–Residue Potentials with a Favorable Contact Pair Term and an Unfavorable High Packing Density Term, for Simulation and Threading Sanzo Miyazawa 1 and Robert L. Jernigan 2 * Attractive inter-residue contact energies for proteins have been re-evaluated 1 Faculty of Technology Gunma University, Kiryu with the same assumptions and approximations used originally by us in 1985, but with a significantly larger set of protein crystal structures. An Gunma 376, Japan additional repulsive packing energy term, operative at higher densities to 2 Room B-116, Building 12B prevent overpacking, has also been estimated for all 20 amino acids as a Laboratory of Mathematical function of the number of contacting residues, based on their observed Biology, DBS, National distributions. The two terms of opposite sign are intended to be used Cancer Institute, National together to provide an estimate of the overall energies of inter-residue Institutes of Health, Bethesda interactions in simplified proteins without atomic details. To overcome the MD 20892-5677, USA problem of how to utilize the many homologous proteins in the Protein Data Bank, a new scheme has been devised to assign different weights to each protein, based on similarities among amino acid sequences. A total of 1168 protein structures containing 1661 subunit sequences are actually used here. After the sequence weights have been applied, these correspond to an effective number of residue–residue contacts of 113,914, or about six times more than were used in the old analysis. Remarkably, the new attractive contact energies are nearly identical to the old ones, except for those with Leu and the rarer amino acids Trp and Met. The largest change found for Leu is surprising. The estimates of hydrophobicity from the contact energies for non-polar side-chains agree well with the experimental values. In an application of these contact energies, the sequences of 88 structurally distinct proteins in the Protein Data Bank are threaded at all possible positions without gaps into 189 different folds of proteins whose sequences differ from each other by at least 35% sequence identity. The native structures for 73 of 88 proteins, excluding 15 exceptional proteins such as membrane proteins, are all demonstrated to have the lowest alignment energies. 7 1996 Academic Press Limited Keywords: hydrophobicity; contact energy; residue packing energy; sequence threading; sequence sampling weights *Corresponding author Introduction The understanding of protein folding is a long-standing goal in structural biology. Although a large number of native structures of proteins are already known and more are being elucidated rapidly, usually only relatively small fluctuations near native structures have been examined in detail with molecular dynamics simulations that employ potentials with full atomic representations of proteins. Much less is known about the denatured state and the full breadth of the protein folding process. Simulating the entire protein folding process, which occurs on a time scale ranging from milliseconds to seconds, would require enormous computational power because of the high dimen- sional space of protein conformations and the complexity of the energy surface. The present-day capabilities of computers usually limit the time scale of molecular dynamics simulations to nanoseconds. Consequently, simplified models are required for the study of the protein folding process. Simplifica- tions can be made to both geometry and potential functions. Here, we address the design of simplified but realistic free energy potentials that include solvent effects. Previously we evaluated empirically a set of such Abbreviations used: PDB, Protein Data Bank; s.d., standard deviation. 0022–2836/96/080623–22 $12.00/0 7 1996 Academic Press Limited
Transcript
Page 1: Mj Article

J. Mol. Biol. (1996) 256, 623–644

Residue–Residue Potentials with a Favorable ContactPair Term and an Unfavorable High Packing DensityTerm, for Simulation and Threading

Sanzo Miyazawa 1 and Robert L. Jernigan 2*

Attractive inter-residue contact energies for proteins have been re-evaluated1Faculty of TechnologyGunma University, Kiryu with the same assumptions and approximations used originally by us in

1985, but with a significantly larger set of protein crystal structures. AnGunma 376, Japanadditional repulsive packing energy term, operative at higher densities to2Room B-116, Building 12B prevent overpacking, has also been estimated for all 20 amino acids as a

Laboratory of Mathematical function of the number of contacting residues, based on their observedBiology, DBS, National distributions. The two terms of opposite sign are intended to be usedCancer Institute, National together to provide an estimate of the overall energies of inter-residueInstitutes of Health, Bethesda interactions in simplified proteins without atomic details. To overcome theMD 20892-5677, USA problem of how to utilize the many homologous proteins in the Protein

Data Bank, a new scheme has been devised to assign different weights toeach protein, based on similarities among amino acid sequences. A total of1168 protein structures containing 1661 subunit sequences are actually usedhere. After the sequence weights have been applied, these correspond toan effective number of residue–residue contacts of 113,914, or about sixtimes more than were used in the old analysis. Remarkably, the newattractive contact energies are nearly identical to the old ones, except forthose with Leu and the rarer amino acids Trp and Met. The largest changefound for Leu is surprising. The estimates of hydrophobicity from thecontact energies for non-polar side-chains agree well with the experimentalvalues. In an application of these contact energies, the sequences of 88structurally distinct proteins in the Protein Data Bank are threaded at allpossible positions without gaps into 189 different folds of proteins whosesequences differ from each other by at least 35% sequence identity. Thenative structures for 73 of 88 proteins, excluding 15 exceptional proteinssuch as membrane proteins, are all demonstrated to have the lowestalignment energies.

7 1996 Academic Press Limited

Keywords: hydrophobicity; contact energy; residue packing energy;sequence threading; sequence sampling weights*Corresponding author

Introduction

The understanding of protein folding is along-standing goal in structural biology. Although alarge number of native structures of proteins arealready known and more are being elucidatedrapidly, usually only relatively small fluctuationsnear native structures have been examined in detailwith molecular dynamics simulations that employpotentials with full atomic representations ofproteins. Much less is known about the denaturedstate and the full breadth of the protein folding

process. Simulating the entire protein foldingprocess, which occurs on a time scale ranging frommilliseconds to seconds, would require enormouscomputational power because of the high dimen-sional space of protein conformations and thecomplexity of the energy surface. The present-daycapabilities of computers usually limit the time scaleof molecular dynamics simulations to nanoseconds.Consequently, simplified models are required forthe study of the protein folding process. Simplifica-tions can be made to both geometry and potentialfunctions. Here, we address the design of simplifiedbut realistic free energy potentials that includesolvent effects.

Previously we evaluated empirically a set of suchAbbreviations used: PDB, Protein Data Bank; s.d.,

standard deviation.

0022–2836/96/080623–22 $12.00/0 7 1996 Academic Press Limited

Page 2: Mj Article

Residue–Residue Potentials624

effective inter-residue contact energies for all pairsof the 20 amino acids, under the basic assumptionthat the average characteristics of residue–residuecontacts observed in a large number of crystalstructures of globular proteins represent the actualintrinsic inter-residue interactions. For this purpose,the Bethe approximation (quasi-chemical approxi-mation) with an approximate treatment of theeffects of chain connectivity was employed with thenumber of residue–residue close contacts observedin protein crystal structures (Miyazawa & Jernigan,1985). This empirical energy function includedsolvent effects, and provided an estimate of thelong-range component of conformational energieswithout atomic details. A comparison of thesecontact energies with the Nozaki–Tanford transferenergies (Nozaki & Tanford, 1971) showed a highcorrelation, although on average the contact energiesyielded about twice the energy gain indicated bythe Nozaki–Tanford transfer energies (Miyazawa& Jernigan, 1985). However, when these contactenergies are applied together with a simpleassumption about the compactness of the denaturedstate to estimate the unfolding Gibbs free energychanges for single amino acid mutants of thetryptophan synthase a subunit (Yutani et al., 1987)and bacteriophage T4 lysozyme (Matsumura et al.,1988), they yield estimates that exhibit a strongcorrelation with the observed values, especially forhydrophobic amino acids. Also, the calculatedenergy values had the same magnitudes as theobserved values for both proteins. This method canalso explain the wide range of unfolding Gibbs freeenergy changes for single amino acid replacementsat various residue positions of staphylococcalnuclease (Miyazawa & Jernigan, 1994). These factsall indicate that the inter-residue contact energiesproperly reflect actual inter-residue interactions,including hydrophobic effects originating in solventeffects.

It is generally thought that the native confor-mations of proteins correspond to the structures oflowest free energy. Thus, successful potentialfunctions, such as those based on native structures,ought to yield the lowest free energy for the nativeconformations. It was demonstrated that classicalsemi-empirical potentials such as CHARMM(Brooks et al., 1983) cannot always identifynon-native folds of proteins (Novotny et al., 1984).Our contact energies were demonstrated to dis-criminate successfully between native-like andincorrectly folded conformations in a lattice study offive small proteins (Covell & Jernigan, 1990). Sippl(1990) evaluated the potentials of mean force as afunction of distance for two-body interactionsbetween amino acids in protein structures from theradial distribution of amino acids in known proteinnative structures. The potentials of mean force forthe interactions between Cb atoms of all amino acidpairs were used to calculate the conformationalenergies of amino acid sequences in a number ofdifferent folds, and it was found that theconformational energy of the native state is the

Table 1. Summary of protein structures used in thepresent analysisNumber of protein structuresa 1168Number of protein subunit structures 1661Number of protein familiesb 424Effective number of proteins (Siwi) 251

a Structures whose resolutions were higher than 2.5 A andwhich were determined by X-ray analyses and are larger than 50residues. See text for details.

b A set of proteins with less than 95% sequence identitybetween any pair.

lowest among the alternatives (Hendlich et al., 1990;Sippl & Weitckus, 1992; Jones et al., 1992). Pairwisecontact potentials depending on inter-residuedistance were also estimated by Bryant & Lawrence(1993).

It should be noted here that a two-bodyresidue–residue potential of mean force based onthe radial distribution of residues will manifestpeaks and valleys as a function of distance, even forhard spheres, which are effects of close residuepacking. However, these would not be present inactual interaction potentials. That is, such apotential of mean force reflects not only the actualinter-residue interactions, but also includes theaverage effects of other residues upon the targetresidue pair, including those interposed betweenthe target pair and especially the significant effectsof residue packing in protein structures. There willbe an over-counting if the sum of the potential istaken over all residue pairs. Thus, if the residue–residue potential in a protein is approximated bysuch a potential of mean force, the sum of thepotential over all residue pairs is unlikely to yieldthe correct value for the total residue–residueinteraction energy. In addition, even though theseeffective potentials have the important character-istics of low values for the native folds of proteins,they are unlikely to succeed in representing theactual potential surface far from the nativeconformation. Therefore, such potentials of meanforce may not be appropriate for application in astudy of a wide range of conformations, from thedenatured state to the native conformation.

On the other hand, a direct account of therequirement that the native state be lowest in energywas taken by Crippen (1991), and Maiorov &Crippen (1992), who tried to fit empirically afeasible set of parameters, which corresponded tocontact energies between amino acid groupsseparated at certain ranges of distance, in such away that the total contact energies of nativeconformations were lower than other alternatives.As with all of these potentials, it is unknown howwell this potential represents the actual potentialsurface far from the native conformation.

Luthy et al. (1992) developed an empirical methodto evaluate the correctness of protein models.Pseudo-potentials have been devised to find outwhich amino acid sequences fold appropriately intoa known three-dimensional structure (Bowie et al.,

Page 3: Mj Article

Tabl

e2.

Num

ber

ofco

ntac

ts:

upp

ertr

iang

lea

for

rand

omm

ixin

gan

dlo

wer

tria

ngle

bfo

rac

tual

coun

tsin

the

prot

ein

sam

ple

Ni

SLV

cC

ysM

etPh

eIl

eL

euVa

lTr

pTy

rA

laG

lyT

hrSe

rA

snG

lnA

spG

luH

isA

rgL

ysPr

o

314,

460

1089

528

1723

3429

7813

5970

299

1780

10,2

1991

6048

9467

9828

8817

6048

5441

0073

024

9251

2322

73C

' ii24

,518

21,1

2441

,637

57,5

3689

,945

76,8

8015

,672

40,7

7299

,089

95,3

0570

,556

80,4

5952

,629

41,6

5270

,989

64,3

2424

,806

47,8

8872

,052

47,8

60SL

Vc

SLV

c12

,785

,07

4,01

0,16

921

2811

7824

9636

2453

8845

2110

3326

7044

1043

9530

3234

6921

8416

3024

7620

6011

9119

6521

6020

65C

ysC

ys10

,318

12,3

7350

0394

629

6141

6165

8150

4196

424

1650

2340

1530

2030

4619

7416

0026

3824

1012

2120

8921

9519

40M

etM

et10

,354

7172

1247

1125

3566

8302

13,3

3510

,531

2170

5294

10,1

7885

1665

9066

2542

7734

0556

5348

4425

9341

4045

0841

09Ph

ePh

e21

,163

6577

2573

4546

4778

6620

18,6

2214

,980

2820

7004

14,5

0812

,158

9153

9075

5808

4644

7478

6746

3424

5795

6186

5678

Ile

Ile

29,0

6912

,219

3830

5188

10,1

8384

3216

,070

23,5

0245

4010

,854

23,1

4618

,339

14,1

0214

,095

8989

7655

11,5

8410

,788

5514

9456

9895

8882

Leu

Leu

44,4

8313

,346

5395

8218

17,2

2826

,396

21,4

3210

,497

3612

8844

19,2

2915

,857

11,7

8611

,964

7356

5923

9312

8345

4422

7147

7867

7256

Val

Val

38,2

0726

,678

4768

5343

12,7

4619

,830

32,6

2314

,091

525

1998

3503

3064

2393

2452

1541

1236

1880

1640

940

1524

1499

1545

Trp

Trp

8160

5866

1168

1393

2498

3101

4810

3723

519

2752

8598

7554

5875

5917

3874

3001

4811

4030

2221

3662

3759

3654

Tyr

Tyr

20,3

6521

,398

2045

2765

5549

6770

10,8

0677

7719

7524

9110

,471

15,6

4411

,298

11,4

4672

1257

9492

2281

0342

8970

7776

1269

23A

laA

la47

,643

97,6

1343

0650

1410

,516

16,4

6524

,837

21,7

3334

5379

4211

,789

7237

9785

9962

6133

4860

7736

6570

3640

5864

6138

5981

Gly

Gly

46,3

8112

0,77

341

6833

7665

2390

7113

,533

13,5

8429

4672

3517

,079

10,8

8140

4175

5146

7937

5758

9950

7827

3445

0947

2445

11T

hrT

hr32

,587

82,7

9524

7525

9552

8179

5811

,387

9906

1708

5102

11,3

2311

,312

4387

4266

4743

3747

5854

4957

2767

4479

4692

4639

Ser

Ser

36,7

1011

1,60

930

1222

0555

6764

4310

,084

9059

1889

5537

11,4

4712

,019

9465

5458

1766

2396

3791

3232

1784

2853

3126

2873

Asn

Asn

24,2

3676

,966

1549

1410

2940

3331

5751

4792

1381

3792

6465

7319

5790

5920

2605

1148

2989

2619

1375

2404

2420

2306

Gln

Gln

19,1

9958

,151

1375

1490

3056

3772

6091

4541

1057

3350

5037

4918

4412

3957

3368

1253

2721

4301

2255

3683

3992

3648

Asp

Asp

32,7

4910

7,79

015

8413

8932

1743

1857

2946

0614

4451

0578

3686

3273

6283

2760

7535

5029

5121

5420

0434

5936

7832

29G

luG

lu30

,640

98,6

4611

8918

2731

1444

9867

0652

4115

2146

0159

3352

4861

7263

4345

5030

9640

0118

8476

218

2018

6818

17H

isH

is11

,907

24,9

0812

0912

7325

8125

2541

3332

1510

9124

4136

0034

2530

7130

7320

5313

3236

0328

0012

2417

6629

4628

89A

rgA

rg22

,741

62,3

9411

2014

6329

5640

6566

5348

6715

8844

3352

0556

5647

2047

9033

2530

6289

8185

7018

8315

5819

7229

13L

ysL

ys32

,058

124,

460

1200

1456

3098

4199

6531

5437

1367

4678

6019

6054

4940

5117

4260

3304

9501

10,2

3415

5418

0313

5416

24Pr

oPr

o24

,594

63,9

5819

8319

1639

3046

0076

3265

1822

5346

5363

6868

2748

0648

4330

7727

7934

8329

8920

9730

3626

6518

24

Ni

SLV

cC

ysM

etPh

eIl

eL

euVa

lTr

pTy

rA

laG

lyT

hrSe

rA

snG

lnA

spG

luH

isA

rgL

ysPr

o

Tota

lsar

e:N

r=

54,3

56.2

,N

rr=

113,

913.

5,2N

r0=

113,

569.

1.a

Scal

ing

fact

ors

are

C' ii

×10

,C

' i0×

20,

Cii

×10

,an

dC

ij×

20.

bSc

alin

gfa

ctor

sar

eN

10,

Nii

×10

,N

ij×

20.

cSL

V,ef

fect

ive

solv

ent

mol

ecul

es.

Page 4: Mj Article

Tabl

e3.

Con

tact

ener

gies

inR

Tun

its;

e ijfo

rup

per

half

and

dia

gona

lan

de' ij

for

low

erha

lfC

ysM

etPh

eIl

eL

euVa

lTr

pTy

rA

laG

lyT

hrSe

rA

snG

lnA

spG

luH

isA

rgL

ysPr

o

Cys

−5.4

4−4

.99

−5.8

0−5

.50

−5.8

3−4

.96

−4.9

5−4

.16

−3.5

7−3

.16

−3.1

1−2

.86

−2.5

9−2

.85

−2.4

1−2

.27

−3.6

0−2

.57

−1.9

5−3

.07

Cys

Met

0.46

−5.4

6−6

.56

−6.0

2−6

.41

−5.3

2−5

.55

−4.9

1−3

.94

−3.3

9−3

.51

−3.0

3−2

.95

−3.3

0−2

.57

−2.8

9−3

.98

−3.1

2−2

.48

−3.4

5M

etPh

e0.

54−0

.20

−7.2

6−6

.84

−7.2

8−6

.29

−6.1

6−5

.66

−4.8

1−4

.13

−4.2

8−4

.02

−3.7

5−4

.10

−3.4

8−3

.56

−4.7

7−3

.98

−3.3

6−4

.25

Phe

Ile

0.49

−0.0

10.

06−6

.54

−7.0

4−6

.05

−5.7

8−5

.25

−4.5

8−3

.78

−4.0

3−3

.52

−3.2

4−3

.67

−3.1

7−3

.27

−4.1

4−3

.63

−3.0

1−3

.76

Ile

Leu

0.57

0.01

0.03

−0.0

8−7

.37

−6.4

8−6

.14

−5.6

7−4

.91

−4.1

6−4

.34

−3.9

2−3

.74

−4.0

4−3

.40

−3.5

9−4

.54

−4.0

3−3

.37

−4.2

0L

euVa

l0.

520.

180.

10−0

.01

−0.0

4−5

.52

−5.1

8−4

.62

−4.0

4−3

.38

−3.4

6−3

.05

−2.8

3−3

.07

−2.4

8−2

.67

−3.5

8−3

.07

−2.4

9−3

.32

Val

Trp

0.30

−0.2

90.

000.

020.

080.

11−5

.06

−4.6

6−3

.82

−3.4

2−3

.22

−2.9

9−3

.07

−3.1

1−2

.84

−2.9

9−3

.98

−3.4

1−2

.69

−3.7

3Tr

pTy

r0.

64−0

.10

0.05

0.11

0.10

0.23

−0.0

4−4

.17

−3.3

6−3

.01

−3.0

1−2

.78

−2.7

6−2

.97

−2.7

6−2

.79

−3.5

2−3

.16

−2.6

0−3

.19

Tyr

Ala

0.51

0.15

0.17

0.05

0.13

0.08

0.07

0.09

−2.7

2−2

.31

−2.3

2−2

.01

−1.8

4−1

.89

−1.7

0−1

.51

−2.4

1−1

.83

−1.3

1−2

.03

Ala

Gly

0.68

0.46

0.62

0.62

0.65

0.51

0.24

0.20

0.18

−2.2

4−2

.08

−1.8

2−1

.74

−1.6

6−1

.59

−1.2

2−2

.15

−1.7

2−1

.15

−1.8

7G

lyT

hr0.

670.

280.

410.

300.

400.

360.

370.

130.

100.

10−2

.12

−1.9

6−1

.88

−1.9

0−1

.80

−1.7

4−2

.42

−1.9

0−1

.31

−1.9

0T

hrSe

r0.

690.

530.

440.

590.

600.

550.

380.

140.

180.

14−0

.06

−1.6

7−1

.58

−1.4

9−1

.63

−1.4

8−2

.11

−1.6

2−1

.05

−1.5

7Se

rA

sn0.

970.

620.

720.

870.

790.

770.

300.

170.

360.

220.

020.

10−1

.68

−1.7

1−1

.68

−1.5

1−2

.08

−1.6

4−1

.21

−1.5

3A

snG

ln0.

640.

200.

300.

370.

420.

460.

19−0

.12

0.24

0.24

−0.0

80.

11−0

.10

−1.5

4−1

.46

−1.4

2−1

.98

−1.8

0−1

.29

−1.7

3G

lnA

sp0.

910.

770.

750.

710.

890.

890.

30−0

.07

0.26

0.13

−0.1

4−0

.19

−0.2

4−0

.09

−1.2

1−1

.02

−2.3

2−2

.29

−1.6

8−1

.33

Asp

Glu

0.91

0.30

0.52

0.46

0.55

0.55

0.00

−0.2

50.

300.

36−0

.22

−0.1

9−0

.21

−0.1

90.

05−0

.91

−2.1

5−2

.27

−1.8

0−1

.26

Glu

His

0.65

0.28

0.39

0.66

0.67

0.70

0.08

0.09

0.47

0.50

0.16

0.26

0.29

0.31

−0.1

9−0

.16

−3.0

5−2

.16

−1.3

5−2

.25

His

Arg

0.93

0.38

0.42

0.41

0.43

0.47

−0.1

1−0

.30

0.30

0.18

−0.0

7−0

.01

−0.0

2−0

.26

−0.9

1−1

.04

0.14

−1.5

5−0

.59

−1.7

0A

rgL

ys0.

830.

310.

330.

320.

370.

33−0

.10

−0.4

60.

110.

03−0

.19

−0.1

5−0

.30

−0.4

6−1

.01

−1.2

80.

230.

24−0

.12

−0.9

7L

ysPr

o0.

530.

160.

250.

390.

350.

31−0

.33

−0.2

30.

200.

130.

040.

140.

18−0

.08

0.14

0.07

0.15

−0.0

5−0

.04

−1.7

5Pr

o

e rr−

2.55

e ir−3

.57

−3.9

2−4

.76

−4.4

2−4

.81

−3.8

9−3

.81

−3.4

1−2

.57

−2.1

9−2

.29

−1.9

8−1

.92

−2.0

0−1

.84

−1.7

9−2

.56

−2.1

1−1

.52

−2.0

9e r

−3.

60e i

−4.2

9−4

.73

−5.5

7−5

.29

−5.7

1−4

.72

−4.4

1−3

.87

−3.1

7−2

.53

−2.6

3−2

.27

−2.1

4−2

.35

−2.0

2−2

.07

−2.9

4−2

.43

−1.8

2−2

.53

f r−

3.60

f i−5

.58

−6.1

4−7

.39

−7.0

9−7

.88

−6.1

5−5

.34

−4.6

0−3

.24

−2.2

2−2

.48

−1.9

2−1

.74

−1.9

3−1

.54

−1.4

9−2

.91

−2.0

7−1

.17

−1.9

7N

ir/

Ni

2.09

62.

723

2.72

22.

780

2.81

12.

893

2.72

82.

537

2.49

32.

143

1.84

01.

973

1.77

11.

699

1.72

01.

598

1.50

82.

075

1.78

71.

343

1.62

9q i

7.16

26.

281

6.64

66.

137

5.87

06.

042

6.08

76.

155

5.79

36.

037

6.33

46.

284

6.48

66.

582

6.57

46.

469

6.48

76.

235

6.24

16.

318

6.56

95.

858

Page 5: Mj Article

Residue–Residue Potentials 627

Figure 1. A comparison of the coordination number, qi ,between the previous analysis (1985) and this work foreach type of amino acid. Slv denotes solvent. The ordinateindicates the values previously reported by Miyazawa &Jernigan (1985) and the abscissa the values obtained inthis work. A continuous line shows the regression linethat is y = −0.19 + 1.03x; the correlation coefficient is 0.98.

would be developed. In the Bethe approximation,the effects of interactions are taken into account toestimate the average numbers of contacts. Thus, theBethe approximation is the lowest order approxi-mation to be able to provide an estimate of a set ofcontact energies between species from a given set ofthe average numbers of contacts between them.

Therefore, if residue–residue contacts in proteinstructures can be reliably represented to be thesame as those in mixtures of unconnected aminoacids and solvent molecules, the Bethe approxi-mation will give us a reasonable estimate of actualinteraction energies between amino acids. Ofcourse, it must be examined whether or not thisbasic approximation is appropriate to describe thecontacts in real protein structures. Also, a limitation,both of this method and of methods using apotential of mean force to evaluate inter-residueinteractions, lies in the fact that the effects of specificamino acid sequences on the statistical distributionof inter-residue distances are completely neglected,although the characteristics of the protein being achain are included as a mean field.

Here, the effective inter-residue contact energiesfor all amino acid pairs are re-evaluated using thesame method as before, but with significantly moreprotein structures than before. In the original work,only 42 globular proteins were used to calculatecontact frequencies between amino acids. Since1985, many additional protein structures have beenreported, and now more than 1000 proteinstructures are available. However, one complicationarises from the fact that there are many homologousproteins in the Protein Data Bank (PDB; Bernsteinet al., 1977). For example, the structures of manysingle amino acid mutants of T4 lysozyme areincluded in the PDB. Therefore, to use all of thisstructural data, an unbiased sampling of proteinstructures from the PDB is required in thecalculation of the contact frequencies. Here, asampling weight for each protein has been devisedbased on a sequence homology matrix giving theextent of sequence identity of all pairs of sequences.

In the estimation of the attractive contact energies,the Bethe lattice is used for conformational space.However, an additional repulsive potential based onmany-body residue packing is needed to properlyestimate the long range energies of proteinconformations. Here, the repulsive energies thatresult from tight packing of residues are evaluatedas a function of the numbers of contacting residuesbased on their distributions in known proteinstructures.

Results

Sample weighting

According to the procedure described inMethods, 1168 protein structures, whose structureshave been analyzed by X-ray and whose resolutionhas been reported to be better than 2.5 A, arechosen, and then each of the 1661 sequences in

1991; Nishikawa & Matsuo, 1993). The pseudo-potential devised by Nishikawa & Matsuo (1993)was composed of four terms, side-chain packing,hydration, hydrogen bonding and local confor-mational potentials, and was empirically derivedfrom the statistical features observed in 101 knownprotein structures. A slightly modified form of theSippl potential was used to take account of the effectof side-chain packing in proteins. All other termswere also evaluated as potentials of mean force.This function was also demonstrated to be anappropriate measure of the compatibility betweensequences and structures of proteins. They share anumber of common characteristics with empiricalenergy potentials, but they are not designed to beused explicitly as an empirical energy function. Inthe case of Nishikawa & Matsuo (1993), each of thefour terms are summed with weights in the totalenergy.

On the other hand, the present two-body contactenergies are estimated with the Bethe approxi-mation with the basic assumption of regardingresidue–residue contacts in protein structures to bethe same as those in mixtures of unconnected aminoacids and solvent molecules. The Bethe approxi-mation is a well-known second-order approximationto the mean field approximation used to describea system consisting of a mixture of multiplemolecular species interacting with each other (Hill,1960). Both approximations are usually used tocalculate a partition function for such a systemfrom a given set of interaction energies betweenmolecules. In the mean field approximation, con-tacts between species would be approximated to berandom, and a partition function of the system

Page 6: Mj Article

Residue–Residue Potentials628

A

B

C

Figure 2(A–C)

those structures is sampled with a weight deter-mined as described below on the basis of thesequence identity matrix. Sequences are aligned ina conventional pairwise manner with a globalalignment method (Needleman & Wunsch, 1970),using the log odds matrix for 250 PAM (Dayhoffet al., 1978) as a scoring matrix for amino acidsimilarity. Penalty for a gap (of k residues) is takento be 12 + 4(k − 1) with the cut-off value of 48, butno penalty is used for terminal gaps. Sequenceidentities are then calculated for the aligned pairs ofsequences. In the original work, the numbers ofcontacts in protein structures were counted incomplete assembly. In the present calculation, onlythe coordinates of subunits explicitly given in thePDB files are used. As listed in Table 1, the effectivenumber of proteins is 251. The effective number ofresidues is 54,356, and the total effective number ofcontacts is 113,914 in the present analysis; bycontrast, these values were 9040 and 18,192 in theoriginal analysis, respectively. The total effectivenumber of contacts is 6.3 times the previous data.Thus, on average, the standard deviations in thenumbers of contacts could be expected to bereduced by a factor of 1/2.5.

Estimates of contact energies

The effective numbers of contacts observed in theprotein structures are listed in the lower triangle ofTable 2. The entries in Table 2 have been multipliedby ten for diagonal elements and by 20 foroff-diagonal elements. The numbers in the uppertriangle of Table 2 correspond to the correctionfactors, C and C ', for the estimation of contactenergies; see equations (29), (30), (36) and (37) inthis paper, and also equations (10) to (15) ofMiyazawa & Jernigan (1985). The coordinationnumber for each residue type has been estimated inthe same way with equation (33) of Miyazawa &Jernigan (1985) from the volume of each type ofresidue at the center and the average volume of itssurrounding residues. These newly estimatedcoordination numbers are listed in the last row ofTable 3 and are very similar to the previous

Figure 2. A, A comparison of the average contactenergy, ei , in RT units for each type of amino acidbetween the previous analysis (1985) and this work. Theordinate values are those reported by Miyazawa &Jernigan (1985) and the abscissa the values obtained in thepresent analysis. The continuous line is the regressionline, y = 0.06 + 0.92x; the correlation coefficient is 0.98.B, A comparison of the contact energy, eij , in RT units ofeach type of contact between the previous analysis (1985)and this work. A continuous line shows the regressionline that is y = 0.03 + 0.93x; the correlation coefficient is0.97. C, A comparison of the energy, e'ij , in RT units, whichis the energy difference accompanying the formation of acontact pair i–j from contact pairs i–i and j–j, between theprevious analysis (1985) and this work. The regressionline is y = −0.02 + 0.96x, and the correlation coefficient is0.88.

Page 7: Mj Article

Residue–Residue Potentials 629

estimates (1985) as shown in Figure 1; the regressionline is y = −0.19 + 1.03x and the correlation coeffi-cient is 0.98.

Effective contact energies are estimated fromthese numbers and are listed in dimensionlessunits, units of RT, in Table 3. Figure 2 shows acomparison between the new contact energies andthose from the previous analysis. The averagecontact energies, ei , for each type of residue arecompared in Figure 2A, and each of the contactenergies, eij , is compared in Figure 2B. The ordinatevalues are those reported by Miyazawa & Jernigan(1985) and the abscissa values those obtained in thepresent work. Correlations between both estimatesare high; the correlation coefficient is 0.98 for ei

and 0.97 for eij . The difference between these twoestimates tends to be larger for hydrophobic aminoacids than for hydrophilic ones. On average, thepresent estimates of contact energies are slightlymore negative than the previous estimates, perhapsreflecting large structures present in the new data;the regression line is y = 0.06 + 0.92x for ei , andy = 0.03 + 0.93x for eij . The present estimate of themean contact energy, er , is equal to −3.6, which ismore negative than the old estimate, −3.2.

Also, the differences between individual newand old contact energies tend to be large forinfrequent amino acids such as Trp and Met. Thismight be expected. However, there is one largedifference for a common amino acid, Leu. Thecontact energy of any pair with Leu is significantlymore negative in the present analysis. All of the topten contact pairs with large differences involve pairswith Leu. The difference in the contact energy forthe Leu–Leu pair is the largest among them, and thepresent estimate, −7.37, is more negative than theprevious one, −5.79. If the coordination number forLeu were estimated to be smaller in the presentwork than previously, then the contact energy forany pair with Leu would be estimated to be morenegative. However, the two estimates of thecoordination number of Leu are almost identical(see Table 2 and Figure 1). The comparison of thedistribution of the number of contacts for Leuclearly shows that the present distribution is shiftedtoward more contacts than the previous one.Therefore, there are actually more contacts with Leuin the present data than before, but the reason isunclear.

As shown in equation (27), the estimation ofcontact energies, eij , requires the estimation of n00,which is not so accurate. On the other hand, relativedifferences between contact energies do not dependon such a parameter; see equation (28). As alreadystated, the absolute values of the contact energiesmight be expected to be less reliable than theirrelative differences. The required scaling factor forabsolute energy specification could be adjusted bymaking err consistent with experimental estimates ofthe average attractive energy between residues.

Figure 2C shows the comparison of the values ofe'ij between the two estimates. The estimation of e'ij ,which is the energy difference accompanying the

Figure 3. Transfer free energies of amino acids relativeto glycine between octanol and water corrected by Sharpet al. (1991) and the corresponding values of −0.6qiei /2 inthis work. −RTqiei /2 corresponds to the average contactenergy gain of an i type residue completely surroundedby other residues in the protein crystal structures.RT = 0.6 kcal/mol has been employed to transform thedimensionless contact energies into kcal/mol units. Themodified transfer free energies of 20 N-acetyl amino acidamides between octanol and water are taken fromTable III of Sharp et al. (1991); these values include volumecorrections for solute–solvent size differences. The line ofunit slope is shown by a dotted line. Filled circles showthe values for non-polar side-chains (Ala, Val, Ile, Leu,Phe) and open circles for other amino acids; Gly is locatedat the origin. The regression line, not shown here, isy = −0.14 + 0.96x, and the correlation coefficient is 0.98 fornon-polar side-chains. Although polar amino acids aregiven in this Figure, the comparison is strictly meaningfulonly for non-polar residues.

formation of the two contact pairs i–j from thecontact pairs i–i and j–j, does not require estimationof n00. Therefore, their estimates might be expectedto be more accurate than the absolute values of thecontact energies, eij . The regression line in Figure 2Calmost passes through the origin with unit slope; itis y = −0.02 + 0.96x. Even though the correlationbetween the two estimates is clearly not as good asfor the contact energies eij , its value is still quitegood at 0.88.

The important qualitative characteristics of thecontact energies observed in the previous analysishold in the present results; the values of e'ij clearlyshow: (1) a relatively large energy loss accompany-ing the formation of Cys–X contacts from Cys–Cysand X–X contacts, probably reflecting the loss ofdisulfide bonds; (2) the large favorable electrostaticinteractions between positively charged (Lys, Arg)and negatively charged (Glu, Asp) amino acids; and(3) the segregation of hydrophobic and hydrophilicresidues.

Page 8: Mj Article

Residue–Residue Potentials630

A C

B D

Figure 4. The frequency distribution of the number of contacts for each type of amino acid. The distribution for eachtype of amino acid is shown by order of increasing values of eir , which measures the hydrophobicity of the amino acid,in A to D. Total in C shows the frequency distribution of the number of contacts, irrespective of amino acid type.

Comparison with experimentaltransfer energies

Figure 3 shows the comparison of the contactenergies with the transfer free energies, taken fromTable III of Sharp et al. (1991), for amino acidsrelative to glycine between octanol and water. Theabscissae show values of −0.6qiei /2 that correspondto the average contact energy gain of an i type ofresidue completely surrounded by other residues inprotein crystal structures. For comparison, a valueof RT = 0.6 kcal/mol has been employed in thisFigure to express the dimensionless contact energiesin kcal/mol units. Sharp et al. (1991) re-evaluatedthe transfer free energies of 20 amino acids betweenoctanol and water taken from Fauchere & Pliska(1983) by including volume corrections for solute–solvent size differences. The filled circles show thevalues for non-polar side-chains, Ala, Val, Ile, Leuand Phe, and the open circles for other amino acids;Gly is located at the origin. The plots for the mosthydrophobic side-chains are located nearly directlyon the dotted line of the unit slope, i.e. the twovalues are nearly identical. Their coincidencestrongly supports the present estimates of long-range interactions between side-chains. Although

polar amino acids are also plotted in this Figure, thecomparison is really meaningful only for non-polarresidues, because organic solvents cannot representcircumstances surrounding polar residues in nativeprotein structures; ei for polar residues includesnot only hydrophobic energies but also the averageof other interaction energies with surroundingresidues and water in proteins, such as hydrogenbonds and electrostatic energies. Thus, the extent ofagreement is, overall, better than might be expected.

Distributions of the number ofresidue contacts

The distribution of the number of contacts foreach of the 20 types of amino acids in proteincrystals is shown in order of increasing values of eir ,which measure the hydrophobicity of amino acids,in Figure 4A to 4D. Amino acids whose values ofeir are similar will show similar frequency distri-butions, although the coordination numbers mustbe taken into account. For example, the distributionfor Cys is shifted somewhat toward more contactsthan that for Trp, even though ecys,r for Cys is morepositive than that for Trp, because the coordinationnumber for Cys is significantly larger than that for

Page 9: Mj Article

Residue–Residue Potentials 631

Trp. The distributions for non-polar amino acidshave single peaks around nc = 6 and those for themost polar amino acids near nc = 2. These numbersof contacts are typical for residues buried com-pletely inside of proteins or for residues exposed onthe surfaces of proteins. This indicates that thesystem consists nearly of two phases, buriedresidues and surface residues. The distribution forSer clearly shows the presence of a shoulder nearnc = 6, as well as a single peak at nc = 2. For suchresidue types that are ambivalent in character, thedistributions can readily be decomposed into twopeaks with maxima near 2 and 6.

Table 4 shows only the high density portion of thedistribution of the number of contacts for each ofthe 20 types of amino acid in protein crystals. Eachnumber in the table is an effective number becauseit is the sum of the numbers of contacts weightedby the sampling weight of equation (50) to removeany sampling bias from homologous proteinsequences included in the protein sample. Repul-sive packing energies have been estimated directlyfrom the values in this table by using equation (43).

Average residue–residue energies

Replacing eipj in equation (19) by the averagecontact energy eip of the ip type of amino acid, theaverage contact energy for the pth residue isrepresented by:

�Ecp�01

2 eipncp (1)

and depends linearly on the number of contactswith the pth residue, nc

p. Figure 5 shows thedependence of the long range interaction energy onthe number of contacts for each type of amino acid.The ordinate corresponds to the sum of the averageattractive contact energy (equation (1)) and therepulsive packing energy (equation (43)). In therepulsive region shown in this Figure, the repulsivepacking energy, the second term in equation (43),does not depend so much on the type of aminoacid. This is expected, because repulsive packingenergies should reflect only packing density and notdepend strongly on the type of amino acid at thecenter; likewise, the coordination number of theamino acid does not depend much on the type ofamino acid.

Total energies of individual proteins

As described in Methods, the long-range energydefined by equation (13) has been calculated for aset of 189 representatives of protein structures in thePDB, which differ from each other by at least 35%sequence identity, but do not include too manyunknown atomic coordinates, which were selectedby Orengo et al. (1993).

The numbers of residues in contact with eachresidue along a chain are calculated by countingresidues within 6.5 A of each residue according toequation (14): the contact energies between these

residue pairs are then evaluated from Table 3 andsummed up according to equation (19). Alsorepulsive energies are estimated for the residueaccording to equation (40). The hard core repulsiondefined by equation (41) is not included here withehc = 0, because known protein structures wereusually refined to remove such close contacts, andsuch close contacts are easily removed by structurerefinement. The repulsive packing energy is esti-mated according to equation (43), if the number ofcontacts at the pth residue is above the thresholdvalue qip . Then, those contact energies and repulsiveenergies are summed over all residues in a proteinto estimate the total long-range energy.

As discussed in the previous work, the totalcontact energies are expected to consist of twoterms, a term that is proportional to the chain lengthand another term that is proportional to the surfacearea of the proteins, that is, the two-thirds power ofthe chain length for proteins whose shapes aresimilar to each other:

sp

Ecp = s

i($0)

sj($0)

eijnij (2)

= ernrr (3)

2ev si($0)

12 qini − esnr0 (4)

where

ev = 0 si($0)

eiqiNi1>0 si($0)

qiNi1 (5)

= −3.26

es = 0 si($0)

eiNi01/Nr0 (6)

= −2.57

In Figure 6A, the long-range energies per residuecalculated with equation (13) are plotted indimensionless units against the inverse of theone-third power of their chain lengths. Monomericproteins, whose shapes could be similar to eachother, are shown by filled circles, and the regressionline for them is shown by a continuous line that isgiven by Elong/nr = −8.5 + 10.1n−1/3

r . The correlationcoefficient is 0.67. The value of the intersect of theregression line is more positive than expected fromequation (4), probably because of repulsive packingenergies included in the total long-range energies ofthe proteins.

Alignment energy for residues within aprotein structure

In the present contact energies, any contact has afavorable contribution to protein stability, even if itis between polar residues, because contact energiesbetween residues are all negative. Therefore,

Page 10: Mj Article

Tabl

e4.

The

high

dens

ity

por

tion

inth

ed

istr

ibut

ion

ofth

enu

mbe

rof

cont

acts

for

each

amin

oac

idnc

Tota

lC

ysM

etPh

eIl

eL

euVa

lTr

pTy

rA

laG

lyT

hrSe

rA

snG

lnA

spG

luH

isA

rgL

ysPr

o

580

57.8

212.

116

7.2

415.

452

7.5

720.

465

4.2

185.

642

5.0

754.

267

7.2

485.

147

0.1

333.

524

3.9

364.

230

6.2

204.

633

8.7

251.

832

0.8

673

72.4

242.

120

9.7

461.

461

9.4

947.

380

4.8

172.

740

2.6

780.

649

9.2

366.

837

8.5

201.

117

2.1

204.

520

4.6

152.

021

9.3

125.

420

8.4

753

34.9

178.

719

4.4

368.

854

9.7

921.

268

6.4

95.5

258.

049

4.4

288.

023

5.1

237.

911

9.9

90.8

143.

910

8.4

95.4

98.5

53.1

116.

88

2747

.786

.410

7.9

213.

031

8.6

553.

639

2.8

53.3

121.

221

3.0

115.

811

6.1

88.5

47.8

52.0

70.0

40.7

55.4

33.8

17.3

50.4

988

3.3

20.3

42.7

69.2

110.

020

5.8

124.

714

.936

.457

.835

.945

.227

.36.

513

.716

.212

.414

.211

.44.

913

.810

226.

94.

56.

123

.728

.859

.930

.02.

911

.811

.616

.09.

57.

11.

91.

14.

90.

72.

90.

61.

71.

111

28.4

0.5

0.0

2.5

4.9

5.6

4.7

0.0

0.8

2.8

1.9

1.8

0.1

0.0

0.0

0.1

0.0

2.0

0.0

0.0

0.8

120.

80.

00.

00.

20.

00.

00.

20.

00.

00.

20.

20.

00.

00.

00.

00.

00.

00.

00.

00.

00.

013

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

q i6.

286.

656.

145.

876.

046.

096.

165.

796.

046.

336.

286.

496.

586.

576.

476.

496.

246.

246.

326.

575.

86

Page 11: Mj Article

Residue–Residue Potentials 633

A

B

C

Figure 5. The dependence of the average long-rangeenergy on the number of contacts for each residue type.A, Ranges of packing density as reflected in nc

p values.B, Values for the more polar residue types. C, The lesspolar residue types in order of decreasing values of eir ,which measures the hydrophobicity of the amino acid.The ordinates show the sum of the average contact energyand the repulsive packing energy, in dimensionless RTunits; see equation (1) for the definition of the averagecontact energy and equation (40) for the repulsive packingenergy. The hard core repulsion energy is not includedhere; ehc = 0 in equation (41).

non-representative proteins whose shapes differsignificantly from a globular form, or smallproteins like inhibitors, are often stabilized bydisulfide bonds whose effects on protein stabilityare not counted here. Other effects not accountedfor here are effects of the backbone and details ofdenatured state entropies. Also, as already noted,the estimate of err , which is defined by equation(34) and reflects the overall compactness ofproteins, is less reliable. Consequently, an assess-ment of protein stability based on the totalnumber of contacts in a protein could lead to anincorrect result. Thus, in order to measure thestability of any protein structure for a givensequence, it might be better to use the energythat does not include the homogeneous energyfor protein collapse but consists only of theremaining energy for aligning residues with thecontacts assumed to be present within the proteinstructure. The alignment energy of residueswithin a protein structure is therefore definedhere to be the total contact energy minus theaverage collapse energy:

Ecp (eij − err ) = Ec

p (eij ) − Ecp (err ) (7)

where Ecp (eij − err ) is a function of (eij − err ) that is

calculated by replacing eij with (eij − err ) in equation(18). eij − err here is named the alignment energy forcontacts between residues of type i and j, andrepresents the relative preference of the i–j contactscompared to the average attraction.

It should also be noted that Ec(eij − err ) does notdepend on nr0 but has only a linear dependence onthe chain length; because es is almost equal to err , thesecond term in equation (4) is negligible. The valueof err is estimated to be −2.55 in RT units as givenin Table 3.

It is clear from equation (7) that Ecp (eij − err ) will

be positive unless hydrophobic residues are buriedinside proteins and hydrophilic ones are exposedto solvent on the surfaces of proteins. In otherwords, the alignment energy, Ec

p (eij − err ), is consist-ent with the polar-out and non-polar-in ruleobserved in protein structures. Here it should benoted that equation (7) can be applied to proteinsin water, but should not be used in a hydrophobicenvironment.

The total alignment energy in the form ofequation (7) is appropriate for the considerationof the stabilities of protein conformations fora given sequence, but it is still inappropriatefor the stabilities of a given fold for differentprotein sequences. In the latter case, a simplecomparison of the energies of a given foldamong protein sequences would be meaningless,because the ensemble of protein conformationscould depend on the protein sequence. Thestability of a specific conformation for a proteinis determined in relation to the whole ensembleof protein conformations. Therefore, unless thewhole ensemble of protein conformations isknown, reference energies, each of which reflectsto some extent the partition function for a

conformations with the most contacts tend to bethe most stable in the assessment of the totalcontact energy. However, a rigorous treatment ofbinding requires an estimate of the entropy lossas well as the binding energy. On the other hand,

Page 12: Mj Article

Residue–Residue Potentials634

A B

Figure 6. A, The dependence of the long-range energy per residue on the inverse of the one-third power of chainlength. B, The long-range energy per residue with a collapse energy subtracted to remove the protein size dependence.See equation (13) for the definition of the long-range energy, and equation (12) for DElong(eij − err ). The long-range energiesinclude the contact energies and the repulsive packing energies, but not the hard core repulsion energies, i.e. ehc = 0in equation (41). All energies here are given in RT units. The representative protein structures used here are 189 proteinstructures that differ from each other by at least 35% sequence identity and are those selected by Orengo et al. (1993)(see their Table 1). Proteins with many unknown atomic coordinates are not included. The filled circles show the valuesfor monomeric proteins determined by X-ray, not including membrane proteins, metal binding proteins, DNA bindingproteins and inhibitors, and multimeric proteins not given in their complete assembly in the coordinate files. The opencircles are proteins whose structures are given in an at least partial, if not full, assembly of subunits. A continuous lineshows the regression line for the monomeric proteins. In A, the regression line is Elong(eij )/nr = −8.5 + 10.1n − 1/3

r , and thecorrelation coefficient is 0.67. In B, a collapse energy has been removed, so that the regression line is almost flat,DElong(eij − err )/nr = 0.30 − 0.01n − 1/3

r , and the correlation coefficient is 0.001. The entry names and sequence identifiers ofthe PDB files used in preparing this Figure are:

Membrane proteins:1PRC-L 1PRC-M 1PRC-C 2P0R 1SN3 1VSG-A 1HGE-A 1HGE-B 1PRC-HMetal binding proteins:1CY3 1PRC-C 5RXN 2HIP-A 2CDVDNA binding proteins:1HDD-CInhibitors:1H0E 1PI2 3EBX 20V0 5PTIMultimeric proteins without subunit interactions:2WRP-R 1UTG 1R0P-A 2TMV-P 2RHE 2STV 3PGM 61DH 1PYPStructures determined by NMR:1C5A 1HCC 1ATX 1SH1 2SH1 1EPG 4TGF 3TRX 1EG0 1APS 1IL8-A 2GB1Other monomeric proteins:1MBC 1MBA 1ECD 2LH3 2LHB 1R69 4ICB 4CPV 1LE2 1YCC 1CC5 451C 1IFC 1RBP 1SGT 4PTP 2SGA 2ALP 2SNV1CD8 1CD4 1ACX 1PAZ 1PCY 1GCR 2CNA 3PSG 1F3G 8I1B 1ALD 1PII 6XIA 2TAA-A 4ENL 5P21 4FXN 2FCR 2FX23CHY 5CPA 8DFR 3DFR 3ADK 1GKY 1RHD 4PFK 3PGK 2GBP 8ABP 2LIV 1TRB 1IPD 4ICD 1PGD 8ADH 2TS1 1PHH31ZM 1LZ1 1RNH 7RSA 1CRN 1CTF 1FXD 2FXB 4FD1 1FDX 4CLA 9RNT 1RNB-A 1FKF 1SNC 1UBQ 3B5C 9PAP3BLM 2CPP 1CSC 1ACE 1C0X 1GLY 1LAP 1LFI 2CYP 8ACN 2CA2Other multimeric proteins:1HBB-A 2SDH-A 1ITH-A 1C0L-A 1LMB-A 3SDP-A 2SCP-A 2HMZ-A 256B-A 2CCY-A 1GMF-A 1BBP-A 2FB4-H3HLA-B 1C0B-A 2AZA-A 2PAB-A 1BMV-1 1BMV-2 2PLV-1 1TNF-A 2MEV-1 2MEV-2 2MEV-3 2PLV-2 2PLV-3 2LTN-A2RSP-A 2ER7-E 5HVP-A 1NSB-A 5TIM-A 2TRX-A 1CSE-E 1GP1-A 4DFR-A 8CAT-A 4MDH-A 1GD1-0 7AAT-A1HRH-A 1RVE-A 2SIC-I 8ATC-B 2TSC-A 2SAR-A 1MSB-A 1B0V-A 1FXI-A 1TGS-I 1TPK-A 9WGA-A 3HLA-A 8ATC-A2CPK-E IGST-A IOVA-A 7API-A IWSY-B 2GLS-A 2PMG-A 6TMN-E 3GAP-A

sequence, are needed in order to measure thestabilities of a given fold for protein sequences.Here, we choose the total energy expected fora typical protein with the given amino acid

composition and chain length to be the referenceenergy of the native structure for the given proteinsequence, in order to compare the alignmentenergies of the native folds among a wide range of

Page 13: Mj Article

Residue–Residue Potentials 635

protein sequences; see Miyazawa & Jernigan (1994)for a reference state that is appropriate to measurestability changes due to limited amino acid changesin a protein. That is, the following difference inenergy is considered:

DElong 0 Elong

− (Elong of a typical native structurefor a given sequence) (8)

0Elong − sp

fip

· (the average number of contacts perresidue in a typical native structure) (9)

where the latter sum is the sum of the averagecontact energies per residue of residue type ip overall positions in the protein. fi as a function of eij isestimated by averaging over all proteins; that is:

fi (eij )0eiNir

Nrr

Nr

Ni(10)

The values of fi (eij ) are given in Table 3. The averagenumber of contacts per residue in a typical nativestructure for a given sequence is estimated asfollows, ignoring the chain length dependence.

(the average number of contacts per residue in atypical native structure):

0Nrr /Nr = 2.096 (11)

Because the second term of equation (9) does notdepend on protein conformation but only on theamino acid composition and the length of theprotein, it is a scaling factor and does not have anyeffect in a comparison of different conformations forthe same protein. However, to a certain extent, itsuse will allow us to discuss how compatible a givenprotein sequence is with a certain structure, as wellas how stable a particular conformation is for acertain sequence.

Lastly, a linear dependence on chain length isremoved by comparison on a ‘‘per residue’’ basis,and the following quantity, which is appropriate forassessing the stability of one protein structureamong other folds, is obtained:

DElong(eij − err )/nr

0[Elong(eij − err ) − sp

fip (eij − err )

· (Nrr /Nr )]/nr (12)

where DElong(eij − err ) and fip (eij − err ) are calculatedby substituting eij with (eij − err ) in equations (9) and(10). DElong includes both attractive and repulsiveterms; the latter will be important principally forpoor quality structures.

In Figure 6B, the estimates of DElong(eij − err )/nr forthe protein representatives are plotted against n−1/3

r .As expected, overall there is no correlation betweenthe two quantities for monomeric proteins; themean for monomeric proteins is slightly morepositive than zero, because the repulsive packing

energy is included in DElong(eij − err )/nr . Therefore,this alignment energy has removed the dependenceon the size of the protein. Membrane proteins,which are shown as crosses, tend to have muchhigher values of DElong(eij − err )/nr than the mean formonomeric proteins. This is to be expected, becausemembrane proteins are not stable in water; inmembrane proteins, portions exposed and embed-ded in the membrane are highly hydrophobic, andthe surface is hydrophobic rather than hydrophilic,resulting in relatively high values of DElong(eij − err ).The same type of exception can be found for themetal binding proteins and the DNA bindingproteins, in which metals and DNA have beentreated here only as holes filled with water in thecalculation of the total contact energies. Also, themultimeric cases, shown as open circles, tend to belocated above the continuous line, probably becausesome coordinates of the inter-molecular neighborsare incompletely given in the partially assembledstructures. On the other hand, the high values ofenergies for proteins whose structures weredetermined by NMR may indicate the relativelypoorer resolution of these structures.

Discrimination for the native structures amongother folds

The threading of sequences into other folds(Hendlich et al., 1990; Jones et al., 1992) or moregenerally the inverse folding (Bowie et al., 1991) isa good method to evaluate how well a given energyscale can discriminate the native structures as thelowest energy conformations among other folds. Amore extensive study of inverse folding, in whichdeletions and additions are included, will bereported in a subsequent paper. Here, a result forsimple threading of protein sequences without gapsinto other folds is reported as a demonstration of thediscrimination power of our alignment energy.

A total of 88 proteins determined to a resolutionbetter than 2.5 A by X-ray analyses, which arestructurally dissimilar to each other with valuessmaller than 80 on the scale of Orengo et al. (1993)for structure similarity, were threaded into each ofthe 189 representatives of protein structures, whichdiffer from each other by at least 35% sequenceidentity and were selected by Orengo et al. (1993)(see their Table 1). The 88 proteins are a subset ofthe 189 protein representatives whose entry namesare listed in the caption of Figure 6. Coordinate fileswith too many unknown atomic coordinates areexcluded from these data sets. Proteins classifiedwithin the multidomain group by Orengo et al.(1993) are also excluded from the set of sequencesto be threaded.

DElong(eij − err ) is calculated for protein sequencesthreaded at all possible positions in all other proteinstructures, and their means and standard deviationsare also calculated; no gaps in either the sequencesor the structures are allowed. The positions of thenative energies in the distributions of all threadingsare then measured in units of standard deviation

Page 14: Mj Article

Residue–Residue Potentials636

(s.d.) where negative values indicate the nativeenergies are below the mean. Table 5 lists values perresidue, DElong(eij − err )/nr , for the protein sequencesthreaded in their own native structures, as wellas the ranks and the positions of the nativeenergies in units of s.d. in the distributions of allthreadings; proteins are sorted by the increasingorder of the values, in units of standard deviation,of DElong(eij − err )/nr . Favorable cases for proteinswith more negative values than −5 s.d. are listed inTable 5A and the unfavorable higher ones inTable 5B.

For most proteins, the native structures have s.d.values of DElong(eij − err )/nr significantly large inmagnitude and are ranked at the lowest energy.However, there are some proteins for which thenative structures are not best or significantly betterthan all others. Proteins with values worse (higher)than −5 s.d., listed in Table 5B, are always membraneproteins or proteins, whose coordinates are given inisolated forms without their counterparts, such assmall inhibitors, multimeric proteins and proteinsbinding metallic ions or other molecules. Therelative proportions of binding regions on theirsurfaces may be significantly large for these smallproteins.

Figure 7 shows the correlation between thevalues of DElong(eij − err )/nr in RT units and in s.d.units. It should be noted that their values in s.d.units do not depend on the second term of equation(12), but their absolute values in RT units do. Sincethe native energies in s.d. units depend not only onthe native energies but also on the means andstandard deviations of the energy distributions ofthreadings, a good correlation is not expected,especially in the low energy region. However, a factthat insignificant native folds have relatively highenergies indicates that the energy function,DElong(eij − err )/nr , may not only be used as anenergy function to evaluate the stabilities of proteinstructures for a given sequence, but also as a scoringfunction to assess the compatibilities of proteinsequences to a given structure. A more detaileddiscussion is planned for a subsequent paper.

Discussion

A basic assumption underlying the presentestimation of contact energies is that, for a largeenough sample, the effects of specific amino acidsequences will average out, and then the numbersof residue–residue contacts observed in a largenumber of protein crystals will represent the actualintrinsic interresidue interactions. As already notedin the original paper, this assumption is compatiblewith the ‘‘principle of structural consistency’’originally proposed by Go (1983) and also called the‘‘principle of minimal frustration’’ in the energylandscape view of proteins by Bryngelson &Wolynes (1987), because the present assumption isequivalent to the assumption that on average theintrinsic contact interactions are those consistentwith the high stability of native structures. This

Table 5. Positions of native folds in the energydistributions of threadingsA. Proteins with favorable native threadings

DElong (eij − err ) In unitsPDB name Length Rank Threadings /nr

a of s.db

7AAT-A 401 1 1681 0.17 −11.11PGD 469 1 765 0.27 −10.91PII 452 1 953 0.18 −10.12LIV 344 1 3084 0.35 −9.92GBP 309 1 4402 0.32 −9.98ADH 374 1 2254 0.29 −9.91PHH 394 1 1808 0.31 −9.92ER7-E 330 1 3548 0.20 −9.84ENL 436 1 1146 0.40 −9.72TS1 317 1 3972 0.21 −9.61ALD 363 1 2538 0.38 −9.51GD1-O 334 1 3406 0.40 −9.54FXN 138 1 16,873 −0.10 −9.33ADK 194 1 11,413 0.12 −9.13PGK 415 1 1426 0.45 −9.15TIM-A 249 1 7560 0.18 −8.81GKY 186 1 12,058 0.18 −8.81RVE-A 244 1 7876 0.24 −8.74PFK 319 1 3972 0.41 −8.71RHD 293 1 5173 0.31 −8.68CAT-A 498 1 548 0.62 −8.61IPD 345 1 3052 0.42 −8.66XIA 387 1 1953 0.46 −8.51MBC 153 1 15,196 −0.01 −8.53LZM 164 1 14,113 0.08 −8.42FCR 173 1 13,264 0.16 −8.31NSB-A 390 1 1889 0.55 −8.32TSC-A 264 1 6713 0.16 −8.31COL-A 197 1 10,951 0.12 −8.28ATC-B 146 1 15,196 0.15 −8.23CHY 128 1 18,067 −0.07 −8.24PTP 223 1 9338 0.38 −8.11PAZ 120 1 19,074 0.11 −7.91GCR 174 1 13,171 0.13 −7.82TRX-A 108 1 20,669 −0.07 −7.84DFR-A 159 1 14,598 0.20 −7.72CNA 237 1 8389 0.33 −7.74CPV 108 1 20,530 0.14 −7.61F3G 151 1 15,407 0.23 −7.61COB-A 151 1 15,407 0.31 −7.41MSB-A 115 1 19,724 0.22 −7.41LZl 130 1 17,821 0.28 −7.44CLA 213 1 10,054 0.12 −7.42LTN-A 181 1 12,563 0.24 −7.45CPA 307 1 4493 0.40 −7.34ICB 76 1 25,579 −0.14 −7.36LDH 329 1 3548 0.50 −7.25P21 166 1 13,922 0.29 −7.21RNH 148 1 15,300 0.34 −7.21CSE-E 274 1 6151 0.57 −7.01BOV-A 69 1 26,735 0.05 −6.91FKF 107 1 20,812 0.25 −6.91UBQ 76 1 25,579 0.14 −6.82AZA-A 129 1 17,943 0.43 −6.81YCC 108 1 20,669 0.31 −6.81RBP 175 1 13,081 0.45 −6.8256B-A 106 1 20,957 0.31 −6.71ACX 108 1 20,669 0.34 −6.71FXD 58 1 28,625 0.31 −6.63B5C 86 1 23,674 0.17 −6.51LMB-A 87 1 23,056 0.19 −6.51GMF-A 119 1 19,203 0.23 −6.49RNT 104 1 21,250 0.37 −6.42RSP-A 115 1 18,565 0.28 −6.49WGA-A 170 1 13,451 0.59 −6.17RSA 124 1 18,565 0.53 −6.12SIC-I 107 1 20,812 0.40 −6.12SAR-A 96 1 22,443 0.40 −5.81PRC-Cc 333 1 3441 0.79 −5.72PAB-A 114 1 18,692 0.45 −5.72HIP-Ad 71 1 26,400 0.37 −5.52RHEe 114 1 19,857 0.52 −5.45RXNf 54 1 29,347 0.44 −5.0

a Alignment energies per residue in RT units; see equation (12) fordefinition.

b The position of the native energy in the distribution of allthreadings in units of s.d., where negative values are for nativeenergies below the mean.

c Photosynthetic reaction center; four protoporphyrin IX are bound.d High potential iron sulfur protein; four Fe and four S are bound.e Bence-Jones protein.f Rubredoxin; small Fe binding protein.

Page 15: Mj Article

Residue–Residue Potentials 637

Table 5—continued

B. Proteins with less favorable native threadingsDElong (eij − err ) In units

PDB name Length Rank Threadings /nra of s.d.b Comment

2OVO 56 1 28,983 0.56 −4.9 Ovomucoid third domain; 3 S–S bonds2STV 184 1 11,413 0.72 −4.5 Coat protein of S.T. virus; multimeric2WRP-R 104 1 20,812 0.60 −4.5 Trp repressor; DNA binding1SN3 65 5 27,410 0.66 −4.1 Scorpion neurotoxin; membrane binding protein1TPK-A 88 11 23,674 0.69 −4.1 Tissue plasminogen activator, Kringle-2 domain3EBX 62 15 27,925 0.78 −3.8 Erabutoxin B; inhibitor to acetylcholine receptor1UTG 70 15 26,567 0.89 −3.7 Uteroglobin; progesterone binding protein1PI2 61 26 27,752 0.78 −3.7 Bowman–Birk proteinase inhibitor; no enzyme5PTI 58 44 28,625 0.75 −3.6 Trypsin inhibitor; no enzyme2POR 301 3 4778 1.05 −3.3 Porin; integral membrane protein2CDV 107 50 20,812 1.01 −2.8 Cytochrome c3; small protein with 4 hemes1CRN 46 >100 30,832 1.01 −2.6 Crambin1HOE 74 >100 25,905 1.06 −2.2 a-Amylase inhibitor without enzyme1PRC-L 273 >100 6205 0.88 −2.0 Photosynthetic reaction center; membrane protein1CY3 118 >100 19,333 1.28 −1.4 Cytochrome c3; small protein with 4 hemes

a Alignment energies per residue in RT units; see equation (12) for definition.b The position of the native energy in the distribution of all threadings in units of s.d., where negative values are for native energies

below the mean.

assumption is also equivalent to the assumptionthat the distribution of the numbers of contacts inprotein structures is a ‘‘self-averaging property’’ inthe terms of Bryngelson et al. (1995), which meansit is, in the present case, the property ofheteropolymers rather than of the detailed aminoacid order of protein sequences. Gutin et al. (1992)indicated that the Boltzmann-like statistics observedin protein structures are a general property ofthe stable structures of heteropolymer chains, andthat the ‘‘temperature’’ in these statistics is not theusual temperature of the medium but a ‘‘selectivetemperature’’, at which the native structure is‘‘frozen out’’ from an exponentially large set ofother structures; see also Sali et al. (1994) for theestimation of the selective temperature or criticaltemperature.

We have not stated explicitly what temperatureshould be taken for the conversion of the contactenergies from RT units to kcal/mol in our originalpaper, but a melting temperature is implied, whichis just low enough for the native structure to bemarginally stable and high enough for the energylandscape of protein to show minimum frustrationand for the ‘‘self-averaging property’’ of contacts tobe satisfied. In the analyses given in our papers(Miyazawa & Jernigan, 1994, 1985), however, roomtemperature was used to translate the RT units intokcal/mol. The reason is that the contact energiesestimated here are free energies which depend ontemperature. In general, the conformational energyof a protein is implicitly not an energy but rather afree energy, because it is usually a potential of meanforce that is obtained by integrating over solventcoordinates and therefore includes energetic andentropic solvent effects. Therefore, as long as themelting temperature is not too far away from roomtemperature, it is preferable to use the experimentalroom temperature for unit conversions. The useof average melting temperature or room tempera-ture for the conversion differs significantly from

the claim of Gutin et al. (1992) that a criticaltemperature estimated to be a factor of 1.5 higherthan room temperature should be used for theconversion, but does not contradict the claim of Saliet al. (1994) that the critical temperature could alsobe less than room temperature, because rapidfolding into the stable native structure occurredslightly above the critical temperature and thefolding temperature could be equated with roomtemperature.

The hydrophobic effect is a dominant force instabilizing the native structure of a globular protein.However, there is a lack of agreement as to itsprecise magnitude. The hydrophobicity of a smallmolecule is usually measured by the transferenergies, for example, of amino acids from anon-polar solvent to water. However, it is notimmediately evident that the free energy changeaccompanying a process in which residues areburied in the folding process of protein is the sameas that accompanying the transfer process of aminoacids from water to a nonaqueous solvent, becauseboth processes are obviously different (Lee, 1993).In our previous work on contact energies, wepointed out that the estimates of the contactenergies of amino acid side-chains were almosttwice as large in magnitude as the usual estimatesof hydrophobicities from the transfer energies ofNozaki & Tanford (1971). In order to measuredirectly the contribution of the hydrophobic effectto the stability of proteins, many experimentsmeasuring the stability change caused by aminoacid replacements have been performed (Yutaniet al., 1984; Matsumura et al., 1988; Shortle et al.,1990). These experiments showed that the unfoldingfree energy changes upon single amino acidreplacements could be significantly larger inmagnitude than those expected from the transferfree energies of amino acids (Yutani et al., 1984;Shortle et al., 1990), but were also highly variable(Shortle et al., 1990).

Page 16: Mj Article

Residue–Residue Potentials638

Figure 7. The correlation between the values ofDElong(eij − err )/nr of the native structures of 88 proteins inRT units and those in standard deviation units from itsdistribution of threaded structures. A total of 88 proteinsdetermined to a resolution better than 2.5 A by X-rayanalyses, which are structurally dissimilar to each otherwith values smaller than 80 on the scale of Orengo et al.(1993) for structure similarity, are threaded into each ofthe 189 representatives of protein structures, which differfrom each other by at least 35% sequence identity andwere those selected by Orengo et al. (1993) (see theirTable 1). DElong(eij − err )/nr is then calculated according toequation (12) for all threadings, with no gaps allowed, andits mean and standard deviation are calculated. The 189protein representatives are the same ones used for Figure6. A total of 88 proteins are a subset of these 189 proteinrepresentatives and are listed in Table 5. Coordinate fileswith too many unknown atomic coordinates are excludedfrom these data sets. The correlation coefficient is 0.75.

not fully account for changes in volume entropy, andthey provided new estimates by including a volumecorrection term. Their overall estimates were almosttwice as large as the previous ones and similar inmagnitude to the estimates in this work of thecontact energies for the transfer energy of side-chains from the outside to the interior of a protein.Pace (1992) also reported that the estimate ofhydrophobicity with the volume correction couldexplain the values of unfolding free energy changesobserved for 72 aliphatic side-chain mutants fromfour proteins in which a larger side-chain wasreplaced by a smaller side-chain.

As shown in Figure 3, the estimates by Sharp et al.(1991) of the transfer free energies between octanoland water for non-polar side-chains are almostidentical with the present estimates of equivalentquantities for hydrophobic side-chains. Here, itshould be noted that the contributions of amino acidside-chains to the contact energy are calculated tobe equal to the difference of the contact energiesbetween Gly and other amino acids and theseestimates are expected to be more reliable than theabsolute values of the contact energies of aminoacids.

Miyazawa & Jernigan (1994) showed, based onthe previous estimates of contact energies, that thecontact energy changes of protein structures due tosingle amino acid replacements were large enoughto account for the observed values of unfoldingfree energy changes. Also, the large variation ofunfolding free energy changes among residuepositions for single amino acid replacements,observed by Shortle et al. (1990), was accounted forif the free energy changes in the denatured statewere taken into account. The free energy change inthe denatured state of a protein due to single aminoacid replacements had not been considered andhad usually been ignored. The large variation ofunfolding free energy changes may be due to thedifferent environments surrounding each residue inboth the native structure and the denatured state.Miyazawa & Jernigan (1994) indicated that staphylo-coccal nuclease, under the experimental conditionsof Shortle et al. (1990), was not fully expanded in thedenatured state and also that the compact de-natured state might have a substantially native-liketopology.

In the original work to evaluate the contactenergies, the number of protein structures usedwas only 42, including 30 monomeric proteins(Miyazawa & Jernigan, 1985). These proteins werechosen with the criteria that their chain lengthswere longer than 100 residues and that few atomicpositions were missing. Also, in cases wherecoordinates are available for several closely homolo-gous proteins, only one representative was used; theminimum difference between amino acid sequencesfor homologous proteins was 50%. The numbers ofcontacts were calculated for the complete assembly.Now, more than 1000 protein structures areavailable. Here, we use only the coordinates ofsubunits as given in the PDB files, without

The effect of a cavity created by such amino acidreplacements was considered to be one of the factorscausing this discrepancy (Shortle et al., 1990;Nicholls, 1991). Eriksson et al. (1992) reported thatthe unfolding free energy changes showed corre-lations with the increases in the sizes of cavitiesobserved in the protein structures of mutant T4lysozymes, and that the value of the unfolding freeenergy change extrapolated to zero cavity sizecoincided with the value expected from the transferfree energy of an amino acid. Lee (1993) estimatedthe free energy changes for cavities formed byreplacing a bulky side-chain with a smallerside-chain, assuming the protein structure toremain completely rigid. He pointed out that mostexperimental values of unfolding free energychanges for such replacements fell into a rangebetween the maximum expected for the full cavitysize and the minimum expected for no cavity,because rearrangement of the protein occurs to fillthe cavity.

On the other hand, Sharp et al. (1991) argued thatthe previous estimates of the hydrophobic effectderived from analyses of solute partition data did

Page 17: Mj Article

Residue–Residue Potentials 639

calculating any additional subunit intersections.That is, if a PDB file is given in a monomericstate, the numbers of contacts have been countedin the monomeric state, even though matrices togenerate multiple symmetric subunits are givenin the PDB file. The ratio of molecular surfacecontacts to the total number of contacts

6 Nro

(Nrr + Nro)is about 0.33.

Generally, subunit–subunit interfaces and en-zyme–substrate interfaces are not the major por-tion of a molecular surface. Therefore, ignoringthe molecular binding in some proteins shouldnot affect the estimated values of contact energiesso much.

There are many homologous proteins in the PDB.Any statistical analysis of sequence–structurerelations of proteins requires an unbiased samplingof proteins. One way is to choose representative setsof proteins that are sufficiently dissimilar to eachother. Usually, selections of protein representativesare based on sequence dissimilarity (Hobohm et al.,1992) or structure dissimilarity (Orengo et al., 1993;Fischer et al., 1996). The alternative method usedhere is to assign a different sampling weight to eachprotein according to the extent of redundancy(sequence identity), to remove sampling bias. Thishas not been done previously, but we show thatsuch a sampling weight for each protein can bebased on a similarity matrix between proteins in thedata set. Here, each element of the similarity matrixhas been evaluated as the ratio of identical residuesbetween aligned protein sequences, and then thesequence identity matrix has been diagonalized.The present data set contains 1661 proteinsequences. To reduce computational time for thecalculation of the 1661 by 1661 sequence identitiesamong sequences and for diagonalization of theresulting sequence identity matrix, the matrix sizewas reduced by regarding highly similar sequencesas identical, i.e. if they are more than 95% identical.If sequence identities lower than a certain valueought to be ignored, then matrix elements ofsequence identity, equation (44), whose values fallbelow a threshold value, could be replaced by zero;30% identity, where sequence pairs may beevolutionally unrelated, may be a good choice forsuch a threshold. For statistical analyses similar tothe present one, this method will certainly be betterif homologous proteins are equally good structures,because all available data are used and then thestatistical error becomes lower for a larger numberof samples.

Remarkably, all characteristics of the contactenergies found in the original analysis hold for thepresent results, for which the total effective numberof contacts is 6.3 times more than in the originaldata. The new values differ from the originalestimates of the contact energies to a significantextent only for the infrequent amino acids Trp andMet and the one other exception, Leu. The reason

why the contacts with Leu are more frequentlyobserved is not clear.

The contact energies are attractive energies only,and volume exclusion between residues needs tobe taken into account, unless a lattice model witha coordination number small enough to preventoverpacking is used. In a lattice model, volumeexclusion between residues can be satisfiedreadily by assuming that a lattice site cannot beoccupied by multiple residues. When a latticemodel is not used, a repulsive potential is neededfor evaluation of the conformational packingenergy of a protein. This has been one motivationfor the present work. The van der Waals’ surfacesof side-chains are highly anisotropic, makingspherically symmetric repulsive potentials inap-propriate to represent effective two-body repulsiveinteractions between two residues. The presentmany-body potential has been formulated todepend instead on packing density to representthe repulsive interactions around a central residuein a protein and to treat packing density effectsmore realistically than would be possible with aspherically symmetric two-body potential. How-ever, such a repulsive potential is not in effect forsmall numbers of neighboring residues. In ad-dition, there is a hard core, which cannot bepenetrated, for any side-chain with any confor-mation. However, a preliminary investigationshowed that there is relatively little residue typedependence for these hard cores, and that theminimum size of such a hard core may correspondclosely to the van der Waals’ size of a methylgroup. Therefore, the repulsive potential betweenresidues here is represented as the sum of twokinds of potentials, a hard core potential that is atwo-body potential and a repulsive packingpotential that is a many-body potential dependingon the number of residues immediately surround-ing a residue. The repulsive packing potentials for20 types of amino acids have been estimated as afunction of the number of contacts from the highdensity portion of the distributions observed inknown protein structures. This tail portion of thedistribution, which is defined as the region of highcoordination numbers of amino acids, shouldreflect the effective repulsive packing energybetween residues.

We have shown how this new potential forinteractions among residues can be used to calculatethe total energies of a set of proteins and how theseenergies behave roughly as expected. Also, calcu-lations of threading sequences into other folds havedemonstrated that the native structures have thelowest alignment energies of residues among allother folds for most of protein representatives inthe PDB, the exceptions being proteins such asmembrane proteins and others having boundligands. The application to threading here may notbe the most demanding test of the potentials,because it only requires discrimination amongrelatively few conformations. However, the resultsare excellent and give, at least, a taste of the level

Page 18: Mj Article

Residue–Residue Potentials640

of success achievable with it. This new potentialis appropriate for application to a broad rangeof protein conformation simulations and inversefolding calculations.

Methods

Long-range inter-residue interaction energy

Long-range interaction energies are simplified forinter-residue interactions at the residue level withoutany atomic details of side-chains. They are approxi-mated to consist of two terms, a short-range attractiveterm that becomes effective only when two residues are inclose proximity, and a repulsive term that results from theoverlap of residues at high packing densities. In the caseof a lattice model, volume exclusion between residues isimplicitly included in the model, because a site cannot beoccupied by more than one residue. However, ifconformational space is defined continuously rather thandiscretely, a repulsive energy potential is essential toaccount for volume exclusion among residues. In thefollowing, the long range inter-residue energy of a proteinis represented as a sum of these two terms over allresidues in a protein:

Elong = sp

(Ecp + Er

p ) (13)

where p is a residue sequence index in a protein.

Contact energy

Residues are represented by single points at thecenters of their side-chain atom positions; the positionsof Ca atoms are used for glycine residues. Residueswhose centers are closer than Rc are defined to bein contact. The limiting value Rc = 6.5 A for contactswas chosen on the basis of the occurrence of the first peakin the radial distribution of residues in the interiorof proteins (Miyazawa & Jernigan, 1985).

Intra-residue interactions and nearest-neighbor inter-actions also lead to contact formation among nearestneighbor residues, and therefore in the present evaluationof the long range interactions, these nearest-neighborcontacts are explicitly excluded.

Thus, a contact between the pth and qth residues isdefined using:

Dcpq060 if =p − q =E1

H(Rc − dpq ) if =p − q = > 1(14)

Rc06.5 A (15)

where H is the Heaviside function defined as:

H(x)061 if xe0

0 if x < 0(16)

and dpq is the distance between the pth and qthresidues. Maiorov & Crippen (1992) pointed out that,in the use of discrete functions for the definitions ofcontacts, slight changes in interatomic distances canproduce significantly different lists of contacts. Sinceresidue–residue contacts are defined here by usingthe distance between their side-chain centers, theyare relatively insensitive to small variations in inter-atomic distances. However, a sigmoidal function with

a transition width of about 1 A might have beenused to advantage, instead of the Heaviside function,in equation (16) to account for uncertainties inthe boundary region. The number of residues of type jin contact with the pth residue whose amino acid is ip

type is:

ncipj = s

q($p)

djqjDcpq (17)

where d is the Kronecker d function.The contact energy for the pth residue, Ec

p , in which theenergy of a conformation with no residue–residuecontacts is defined to be zero, is represented as:

Ecp (eij ) = 1

2 sq($p)

eip jq Dcpq (18)

= 12 s

j($0)

eip jncipj (19)

where eij is the energy difference accompanying theformation of contacts between i and j types of amino acidsfrom those amino acids exposed to solvent, and is definedas follows (Miyazawa & Jernigan, 1985):

eij0Eij + E00 − Ei0 − Ej0 (20)

where 0 means effective solvent molecules, and E00 andEi0 are the absolute interaction energies between a pairof solvents and between an i type of residue andan effective solvent, respectively. Here, it should benoted that a lattice model is used to take account ofinteractions among solvent molecules and amino acidsin a protein. Residues in a protein are assumed tooccupy lattice sites or cells as a linear chain. Eachvacant cell is regarded as being occupied by an effectivesolvent molecule.

ncipj can be summed to produce the following equation:

nij = 12 s

p

sq($p)

DcipiDc

jqj = 12 s

p

ncipjdipi (21)

12 qini = s

j = 0

nij (22)

nij = nji (23)

where nii and 2nij are the total numbers of contactsbetween two residues of the same type, i, and betweeni and j types of amino acids; ni is the total number of itype amino acids in a protein and qi is the coordinationnumber for the i type of amino acid, that is, the averagenumber of residues that completely surround the i typeof amino acid. Let us define here the following quantitiesfor the typical residue r:

nir = nri0 sj($0)

nij (24)

nrr0 si($0)

nir (25)

nr0 si($0)

ni (26)

The summations above are taken over all 20 amino acidtypes.

Page 19: Mj Article

Residue–Residue Potentials 641

How to estimate contact energies

The contact energies have been re-evaluated by usinga newer, much larger protein data set with the sameprocedure described by Miyazawa & Jernigan (1985), inwhich the following assumptions and approximationswere employed; the original notation is used here.

(1) For a large enough sample, the effects of specificamino acid sequences average out and the numbers ofnon-bonded residue–residue contacts observed in a largenumber of protein crystal structures will then representthe actual intrinsic differences of interactions amongresidues in proteins. Of course, nearest-neighbor contactsalong chains are significantly affected by the amino acidsequences of proteins. Therefore, the contacts betweennearest-neighbor residues are explicitly excluded in thecounting of contacts. The effect of a protein being a chainmay remain and might affect somewhat the observednumber of residue–residue contacts. Therefore, thecontact energies estimated here should be regarded aseffective inter-residue energies.

(2) By taking account of the effects of the chainconnectivity as imposing a limit to the size of the system,i.e. the total number of lattice sites, the system is thenregarded as an equilibrium mixture of unconnectedresidues and effective solvent molecules.

(3) The Bethe approximation (quasi-chemical approxi-mation), which gives an exact solution for the Bethelattice, and in which contact pair formation can beregarded as a chemical reaction, is used to estimate thecontact energies from the numbers of contacts observed inknown protein structures. In the Bethe approximation, eij

satisfies the following equilibrium equation:

exp(−eij ) = nijn00

ni0nj0(27)

where nij represents the statistical average of nij . Note thatall energies here are taken to be dimensionless, i.e. inunits of RT. Usually this equation is used to evaluate nij

from known eij , but it is used inversely here to estimatethe contact energies from the numbers of contactsobserved in known protein structures. However, theequation above includes n00, the number of contactsbetween effective solvent molecules, which is ratherdifficult to estimate accurately. However, the differencesbetween contact energies, such as e'ij defined in the nextparagraph, do not depend on n00 and so the estimates ofsuch relative quantities ought to be more reliable than theabsolute values of contact energies; e.g.:

exp(eij − ekl ) = nijnk0nl0

ni0nj0nkl(28)

does not include n00.(4) n00 is evaluated through an estimation of the

number of effective solvent molecules. The number ofeffective solvent molecules for each protein is chosen toyield the total number of residue–residue contacts equalto its expected value for the hypothetical case of hardsphere interactions with eij = 0 among residues andeffective solvent molecules, representing the effects ofchain connectivity. An effective solvent molecule is takento have the same volume (of water molecules) as anaverage residue.

(5) The numbers of contacts, nij , are counted for eachprotein structure and the sums of nij over all proteinsamples, Nij , are calculated. The numbers of contacts witheffective solvent molecules, ni0, are estimated withequation (22); the coordination number for each aminoacid type is estimated from the volume of each type of

amino acid at the center and the average volume of itssurrounding residues. That is, incomplete coordinationspheres are completed with solvent molecules. Then, thecontact energies are estimated from Nij with a compo-sition correction, so that if all residues and effectivesolvent molecules were randomly mixed, the estimatedvalues of the contact energies would be zero:

e'ij = − 12 ln0N2

ijCiiCjj

NjjNjjC2ij1 for i, j$0 (29)

e'i0 = − 12 ln0 N2

i0

NiiN00

C'iiC'00

C'2i0 1 (30)

where:

e'ij0eij − (eii + ejj )/2 (31)

eij = e'ij + e'00 − e'i0 − e'j0 (32)

Cij and C'ii are correction factors and are equal to thenumber of contacts expected between residues of i and jtypes and the number of contacts expected betweenresidues of the same type i, respectively, when residuesand effective solvent molecules are randomly mixed; seeequations (10) to (15) of Miyazawa & Jernigan (1985) fortheir detailed definitions. e'ij is the energy differencedetailing the residue interaction specificity accompanyingthe formation of a contact pair i–j from contact pairs i–iand j–j.

According to the procedure described by Miyazawa &Jernigan (1985) and briefly summarized above, the contactenergies, eij , between i and j types of amino acids arere-evaluated with equations (29) and (30) above, orequations (10) to (15) of Miyazawa & Jernigan (1985).However, the further useful quantities for interactionswith the average residue r are defined as:

exp(−eir )0nirn00

ni0nr0(33)

exp(−err )0nrrn00

nr0nr0(34)

These are not estimated by the procedure described in theearlier paper, but directly from the data by the followingequations:

eir = e'ir − e'i0 − e'r0 (35)

e'ir = − 12 ln(Cii /Nii ) (36)

err = −2e'r0 = ln(C'00/N00) (37)

As stated by Miyazawa & Jernigan (1985), if eir is morenegative than err , then amino acids of the i type will tendto be buried inside a protein; otherwise, they will tend tobe exposed to solvent on the surface of the protein.

Also, the average contact energy of each type of aminoacid is estimated by:

ei = sj($0)

eijNij /Nir (38)

er = si($0)

eiNir /Nrr (39)

ei may be used to compare with other estimates ofhydrophobic energies. However, eir will be moreappropriate as a one-dimensional measure of hydropho-bicity. ei and eir correspond to the mean energy and the

Page 20: Mj Article

Residue–Residue Potentials642

effective mean energy, respectively; refer to equation (35)of Miyazawa & Jernigan (1985).

Repulsive energy

Here, each residue has been represented by the pointat the center of its side-chain heavy atom positions.Generally, the van der Waals’ surface of a side-chain ishighly anisotropic, and its hard core would not bewell represented by a sphere. However, if a repulsivepotential between residues is averaged over allpossible side-chain conformations, then the anisotropiccharacter in the average repulsive potential will beweakened by the flexibility manifested in the side-chainconformations. Therefore, a symmetric repulsive potentialis appropriate at the present level of simplification.However, a repulsive force resulting from van der Waals’overlaps is a very short range force, and therefore evena soft core symmetric repulsive potential may beinappropriate. Here, the repulsive force of van der Waals’overlaps is approximated as the sum of two terms, ahard core repulsion between residues and a repulsivepacking potential, depending on the packing density ofresidues:

Erp = ehc

p + erp (40)

The hard core potential is taken, in turn, as a two-bodyinteraction potential given by:

ehcp 01

2 sq($p)

ehcH(rc − dpq ) (41)

ehc is the parameter for the positive energy of hard corerepulsion and will have a large positive value. Dependingon side-chain conformation, the side-chain centers of tworesidues can sometimes come as close as the van derWaals’ separation between two methyl groups. Therefore,the hard core radius, rc/2, is independent of residue typeand here is simply the van der Waals’ core of a methylgroup, with a value of 1.9 A.

The repulsive packing potential is a many-bodyinteraction potential dependent on nc

p , the number ofresidues in contact with the pth residue:

ncp0 s

j($0)

ncipj (42)

This packing density energy in RT units is estimated asfollows:

erp0H(nc

p − qip ) $0qip

ncp

− 11 Ecp − ln0N(ip , nc

p ) + eN(ip , qip ) + e1% (43)

N(i, nc) is the total number of residues of type i that aresurrounded by nc residues in the set of protein structures.The value of N(i, qi ) is obtained by interpolation. To avoidthe divergence of the logarithm function in equation (43),a small positive number, e, has been added to both thenumerator and the denominator, so that the sum of thecontact energy and repulsive energy takes on a positivevalue even at N(i, nc) = 0 for any amino acid; here, a valueof e = 10−6 is employed.

The first term of the repulsive packing potential inequation (43) represents the effect of exceeding thelimiting number of contacts, which is equal to thecoordination number qip , to offset the extra contactenergy; the actual value of Ec

p from equation (19) isused. The second term is estimated from the ratio of

the observed frequency of packing density to thatfor the limiting value for packing density. The repulsiveenergy is applied through the Heaviside function onlyif the number of surrounding residues exceeds athreshold value, qip . The distribution of the number ofcontacts for each amino acid is assumed in the Betheapproximation to be determined by the contact energies,except for the high density region, which should reflectthe repulsive energies among residues. It can be seenin Table 3 that the estimated coordination numbers forthe 20 types of amino acids here range only from 5.79 forTrp to 6.65 for Cys.

Protein structures used in the presentstatistical analysis

The proteins used here are all those in the Protein DataBank (PDB) (Bernstein et al., 1977) that satisfy thefollowing conditions.

(1) Proteins whose structures are determined by X-rayanalysis and whose resolution is equal to, or better than,2.5 A. All protein structures determined by NMR areexcluded.

(2) Protein subunits composed of 50 or more residues.(3) Membrane proteins are excluded, because, inappro-

priate to them, it is assumed that incomplete coordinationspheres are completed with water molecules. Thus, thenumber of inter-residue contacts compiled here are thoseobserved only in soluble proteins in order to derivecontact energies appropriate for soluble proteins. How-ever, some characteristics of neighbor pairs might bepreserved, except for the overall inversion of structurecompared to soluble proteins.

The number of protein subunit structures satisfying thesecriteria is 1661 (see Table 1).

In this data set, there are many proteins whosesequences are similar to one another. To obtainstatistically unbiased results, unbiased sampling isrequired. There are two ways to do this: either (i) useprotein representatives that are sufficiently dissimilar toeach other in their sequences; or (ii) use a differentstatistical weight for each protein related to its extent ofsimilarity to other sequences. So far, most statisticalanalyses have used a representative set of proteins.Usually, protein representatives are chosen by specifyingan upper limit for sequence identity (Hobohm et al., 1992)or structural similarity (Orengo et al., 1993; Fischer, et al.,1996, among them). However, it is not clear what valueis best as an upper limit of similarity in proteinrepresentatives. Also, in such a method, many goodstructures may be discarded. In this work, the secondapproach has been taken.

Sampling weight

A sequence identity matrix is defined here as:

Ismn0sequence identity between sequence m and n

0

2 × (number of identicalresidues in the alignment)

(length of sequence m)+ (length of sequence n)

(44)

The sequence identity is taken as the fraction of identicalresidues in the alignment of two sequences. The sequenceidentity matrix is a real symmetric, non-negative matrix.

Page 21: Mj Article

Residue–Residue Potentials 643

Each element of Ismn is between 0 and 1, and the diagonal

elements are equal to one:

0EIsmn = Is

nmE1

Ismm = 1 (45)

The sequence identity matrix is similar to a correlationmatrix in the sense that the value of Is

mn represents thecorrelation between the amino acid sequences m and n.Let us define lm as the mth eigenvalue and Vm the mthcolumn eigenvector that is orthogonal to all others andnormalized to be equal to one. That is:

IsVm = lmVm (46)

VTm Vn = dmn (47)

VTm means the transpose of Vm. The total sum of the

eigenvalues is equal to the trace of the sequence identitymatrix, and so it is equal to the number of sequencesused:

sm

lm = Tr Is = sm

1 = Nprot (48)

where Nprot is the number of sequences used. Fromequation (45), it follows that a sequence identity matrixmust be positive semidefinite:

0EliENprot (49)

Let us consider some special cases. If a whole set ofsequences can be divided into groups where individualsequences from different groups are completely dissimi-lar, each group can be handled independently. In casethere is a sequence which is completely dissimilar to anyother sequence, at least one eigenvalue will be equal toone. If all sequences are the same, all other eigenvalues,except one, must be equal to zero; if Is

mn = 1 for any m andn, rank Is = 1, and therefore the number of non-zeroeigenvalues must be equal to one.

On the basis of these characteristics, the followingprocedure may be used to remove redundant informationthat comes from similarities among sequences. Thesampling weight, wn for the nth sequence is taken to be:

wn00sm (lm − (lm − 1) · H(lm − 1)) · VmVTm1nn

(50)

0 < wnE1 (51)

This definition of sampling weight satisfies all require-ments. If, and only if, a sequence has zero sequenceidentity with any other sequence, then the samplingweight for that sequence will be equal to one. If Nprot

sequences in the data set are all identical, then thestatistical weights for those sequences will be equal to1/Nprot. Generally, sampling weights take a value betweenone and 1/Nprot, and are approximately equal to theinverse of the number of similar sequences. The effectivenumber of proteins can be defined as the total sum of thesampling weights:

Neffectiveprot 0s

mwmENprot (52)

In a similar way, the effective number of residues isdefined.

Please note the use of sequence identity here. Althoughthis approach for weighting could be used equally wellwith any sequence similarity measure, the presentproblem dictates that only identities be considered;

otherwise part of the data for deriving contact pairs wouldhave been discarded.

The number of protein subunits chosen according tothe criteria described in the previous section is more than1600. To reduce computational time expediently for thecalculation of the sequence identity matrix and itsdiagonalization, proteins that have more than 0.95sequence identity are regarded as the same sequence andare removed from the sequence identity matrix. Then astatistical weight for each of the individual proteins istaken to be wm/m, where wm is the weight for that proteinfamily m, and m is the number of members in the family.Protein representatives for this purpose are chosen in thefollowing procedures. (1) If a protein is less similar than0.95 sequence identity to any protein representativealready chosen, this protein is regarded as a new protein.(2) The above procedure is iterated until all proteins areexamined. There are 424 protein representatives remain-ing with less than 0.95 sequence identity to any other inthe PDB, and the total effective number of proteins isfound with equation (52) to be 251.

ReferencesBernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer,

E. F., Brice, M. D., Rodgers, J. R., Kennard, O.,Shimanouchi, T. & Tasumi, M. (1977). The ProteinData Bank: a computer-based archival file formacromolecular structures. J. Mol. Biol. 112, 535–542.

Bowie, J. U., Luthy, R. & Eisenberg, D. (1991). A methodto identify protein sequences that fold into a knownthree-dimensional structure. Science, 253, 164–170.

Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J.,Swaminathan, S. & Karplus, M. (1983). CHARMM: aprogram for macromolecular energy, minimization,and dynamics calculations. J. Computer Chem. 4,187–217.

Bryant, S. H. & Lawrence, C. E. (1993). An empiricalenergy function for threading protein sequencethrough the folding motif. Proteins: Struct. Funct.Genet. 16, 92–112.

Bryngelson, J. D. & Wolynes, P. G. (1987). Spin glasses andthe statistical mechanics of protein folding. Proc. NatlAcad. Sci. USA, 84, 7524–7528.

Bryngelson, J. D., Onuchic, J. N., Socci, N. D. & Wolynes,P. G. (1995). Funnels, pathways and the energylandscape of protein folding: a synthesis. Proteins:Struct. Funct. Genet. 21, 167–195.

Covell, D. G. & Jernigan, R. L. (1990). Conformations offolded proteins in restricted spaces. Biochemistry, 29,3287–3294.

Crippen, G. M. (1991). Prediction of protein folding fromamino acid sequence over discrete conformationspaces. Biochemistry, 30, 4232–4237.

Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978). Amodel of evolutionary change in proteins. In Atlas ofProtein Sequence and Structure 1978 (Dayhoff, M. O.,ed.), vol. 3. suppl. 5, pp. 345–352, National Biomedi-cal Research Foundation, Washington, DC.

Eriksson, A. E., Baase, W. A., Zhang, X.-J., Heinz, D. W.,Blaber, M., Baldwin, E. P. & Matthews, B. W. (1992).Response of a protein structure to cavity-creatingmutations and its relation to the hydrophobic effect.Science, 255, 178–183.

Fauchere, V. L. & Pliska, V. (1983). Hydrophobicparameters p of amino acid side-chains from thepartitioning of N-acetyl-amino-acid amides. Eur. J.Med. Chem. 18, 369–375.

Page 22: Mj Article

Residue–Residue Potentials644

Fischer, D., Tsai, C. J., Nussinov, R. & Wolfson, H. J.(1996). A 3-D sequence independent representationof the protein databank. Protein Eng., in the press.

Go, N. (1983). Theoretical studies of protein folding.Annu. Rev. Biophys. Bioeng. 12, 183–210.

Gutin, A. M., Badretdinov, A. Y. & Finkelstein A. V.(1992). Why are the statistics of globular proteinstructures Boltzmann-like? Mol. Biol. (USSR), 26,94–102.

Hendlich, M., Lackner, P., Weitckus, S., Floechner, H.,Froschauer, R., Gottsbachner, K., Casari, G. & Sippl,M. J. (1990). Identification of native protein foldsamongst a large number of incorrect models; thecalculation of low energy conformations frompotentials of mean force. J. Mol. Biol. 216, 167–180.

Hill, T. L. (1960). Statistical Mechanics. Addison-Wesley,Reading, MA.

Hobohm, U., Scharf, M., Schneider, R. & Sander, C.(1992). Selection of representative protein data sets.Protein Sci. 1, 409–417.

Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). A newapproach to protein fold recognition. Nature, 358,86–89.

Lee, B. (1993). Estimation of the maximum change instability of globular proteins upon mutation of ahydrophobic residue to another of smaller size.Protein Sci. 2, 733–738.

Luthy, R., Bowie, J. U. & Eisenberg, D. (1992). Assessmentof protein models with three-dimensional profiles.Nature, 356, 83–85.

Maiorov, V. N. & Crippen, G. M. (1992). Contact potentialthat recognizes the correct folding of globularproteins. J. Mol. Biol. 227, 876–888.

Matsumura, M., Becktel, W. J. & Matthews, B. W. (1988).Hydrophobic stabilization in T4 lysozyme deter-mined directly by multiple substitutions of Ile3.Nature, 334, 406–410.

Miyazawa, S. & Jernigan, R. L. (1985). Estimation ofeffective interresidue contact energies from proteincrystal structures: quasi-chemical approximation.Macromolecules, 18, 534–552.

Miyazawa, S. & Jernigan, R. L. (1994). Protein stability forsingle substitution mutants and the extent of localcompactness in the denatured state. Protein Eng. 7,1209–1220.

Needleman, S. B. & Wunsch, C. B. (1970). A generalmethod applicable to the search for similarities in theamino acid sequences of two proteins. J. Mol. Biol. 48,443–453.

Nicholls, A., Sharp, K. A. & Honig, B. (1991). Proteinfolding and association: insights from the interfacialand thermodynamic properties of hydrocarbons.Proteins: Struct. Funct. Genet. 11, 281–296.

Nishikawa, K. & Matsuo, Y. (1993). Development ofpseudoenergy potentials for assessing protein 3-D–1-D compatibility and detecting weak homologies.Protein Eng. 6, 811–820.

Novotny, J., Bruccoleri, R. E. & Karplus, M. (1984). Ananalysis of incorrectly folded protein models;implications for structure predictions. J. Mol. Biol.177, 787–818.

Nozaki, Y. & Tanford, C. (1971). The solubility of aminoacids and two glycine peptides in aqueous ethanoland dioxine solutions; establishment of a hydropho-bicity scale. J. Biol. Chem. 246, 2211–2217.

Orengo, C. A., Flores, T. P., Taylor, W. R., & Thornton, J. M.(1993). Identification and classification of protein foldfamilies. Protein Eng. 6, 485–500.

Pace, C. N. (1992). Contribution of the hydrophobiceffect to globular protein stability. J. Mol. Biol. 226,29–35.

Sali, A., Shakhnovich, E. & Karplus, M. (1994). Kinetics ofprotein folding: a lattice model study of therequirements for folding to the native state. J. Mol.Biol. 235, 1614–1636.

Sharp, K. A., Nicholls, A., Friedman, R. & Honig, B.(1991). Extracting hydrophobic free energies fromexperimental data: relationship to protein foldingand theoretical models. Biochemistry, 30, 9686–9697.

Shortle, D., Sites, W. E. & Meeker, A. K. (1990).Contributions of the large hydrophobic amino acidsto the stability of staphylococcal nuclease. Biochem-istry, 29, 8033–8041.

Sippl, M. J. (1990). Calculation of conformationalensembles from potentials of mean force. J. Mol. Biol.213, 859–883.

Sippl, M. J. & Weitckus, S. (1992). Detection of native-likemodels for amino acid sequences of unknownthree-dimensional structure in a data base of knownprotein conformations. Proteins: Struct. Funct. Genet.13, 258–271.

Yutani, K., Ogasawara, K., Tsujita, T. & Sugino, Y. (1987).Dependence of conformational stability on hydro-phobicity of the amino acid residue in a series ofvariant proteins substituted at a unique position oftryptophan synthase a subunit. Proc. Natl Acad. Sci.USA, 84, 4441–4444.

Edited by B. Honig

(Received 17 April 1995; accepted in revised form 17 November 1995)


Recommended