14 R E F IN E M E N T A N D A U T O M A T IO N O F TH E M ... · II R E F IN E M E N T A N D A U T...

14 Bulletin of Magnetic Resonance

III

REFINEMENT AND AUTOMATION OF THE MAIN CHAIN DIRECTED ASSIGNMENTPROCEDURE FOR THE ANALYSIS OF 2-D *H SPECTRA OF PROTEINS

Sarah J. Nelson,^ Diane M. Schneider* Deena L. DiStefano*and A. Joshua Wand# Department of NMR & Medical Spectroscopy^,

Institute for Cancer Research*Fox Chase Cancer Center, Philadelphia, PA 19111

INTRODUCTIONThe elucidation of protein structures from

NMR data requires the interpretation of largemulti-dimensional spectra. These data aresubjected to extensive processing and analysisin order to reveal the information of interest J-correlated and distance correlated relationshipsbetween pairs of protons are most often derivedfrom a variety of 2-D (or 3-D) homonuclearspectra (1-4). More recently labelling with ^ Cor *^N in conjunction with heteronuclearspectroscopy has provided an alternative orcomplementary route for distinguishing andidentifying resonances (5,6). All relationshipsbetween pairs of nuclei are represented bycrosspeaks in 2-D or 3-D spectra. The first taskin analyzing such data is therefore theidentification of crosspeak positions inindividual spectra. The cornerstone of anyautomated or computer assisted procedure forstructure determination is to use the computerto produce a list of crosspeak positions for eachspectrum being considered.

The second stage of the analysis is to assignspectral frequencies to particular protons in themolecule. One approach to this problem is theSequential Assignment Procedure (7-9), whichplaces initial emphasis on analyzing the sidechain spin systems of each amino acid. Theamino acid sequence is used to position shortruns of spin systems within the protein.Secondary structure is inferred from thepresence of NOEs indicating short distancesbetween main chain protons from differentresidues in the molecule. A second approach isthe Main Chain Directed assignment procedure(10,11), which places more initial emphasis onthe main chain NH-CQH-CQH subspin systems.Characteristic patterns of short distancesbetween these protons are used to identifydifferent types of secondary structures. Thesecondary structural units are then placed

within the sequence and remaining protonsassigned by a reduced side chain analysis.

Clearly, as the size of proteins studiedincreases, it will become necessary to make useof all available information in makingassignments. It seems likely therefore that, infuture studies, relationships between side chainand main chain protons will need to beconsidered in parallel rather than sequentially.To assist in these analyses, computer aidedpattern searches are already becoming criticalparts of the analysis. Bodenhausen et al (12)have presented attempts to automaticallyidentify side chain spin systems directly from J-correlated spectra. We report here upon recentadvances in the refinement and automation ofthe analysis of main chain spin systems. Theseinclude the use of an automated peak-pickingalgorithm (PIQABLE2) and the derivation ofstrategies for identifying helix, anti-parallel andparallel sheet structures using only relationshipsbetween main chain protons (MCDPAT). Ourstudies have been guided by simulations basedupon crystal structure data and involvedexamination of NMR data derived from severaldifferent proteins. The detailed statistical basisis being reported elsewhere (13,14) and thus wewill concentrate here on the overall procedureand its practical implementation.

PEAK-PICKING AND PIQABLE2The PIQABLE algorithm was originally

developed for analysis of 1-D low signal tonoise spectra. It automatically separates slowlyvarying baseline from statistically significantpeaks and random noise of constant variance(15,16). The basic algorithm has naturalextensions to both 2-D and 3-D data, adiscussion of its underlying assumptions beingpresented elsewhere (19). The 2-D version ofthe algorithm (PIQABLE2) can be used toseparate univariate ridges from bivariate peaks

Vol. 13, No. 1/2 15

. . . , * * *

• ••••••• «; • * • • . •

Figure 1: Comparison of peak picking procedures applied to amide-alpha (a-d) and amide-amide (e-h)region of a 500MHz NOESY spectrum of ubiquitin: a)and e) simulations based on main Chain protondistances from crystal structure, b) and f) local maximum search, c) and g) manual peak detection, d) and e)PIQAIJLE2analysis.

and noise which has slowly changing variancein one or both coordinate directions. As largeregions of 2-D NMR spectra display thesecharacteristics, we have used this algorithm toobtain lists of NOE crosspeak positions. Ofparticular interest is the accuracy of peakposition estimates, whether the algorithm couldreliably identify crosspeaks between main chainamide, alpha and beta protons and how thelimited precision in these estimates impartssubsequent assignment strategies. Humanubiquitin was used as the first test system forthe combination of PIQABLE2 with the MainChain Directed (MCD) assignment procedure.Crosspeak sets were obtained by the followingthree methods:(i) visual peak-picking of a NOESY spectrum

obtained at 500MHz(ii) searching for local maxima in the 500MHzspectrum, followed by eliminating those

maxima which occurred on only one side of thediagonal and(iii) PIQABLE2 applied to the 500MHzNOESY spectrum.A crosspeak set was also generated on the basisof the crystal structure (18). For these data,each main chain proton was assigned its actualfrequency (11,19). A crosspeak was assumed toexist when the crystal structure predicted adistance between two protons of less than 4.2A.The relevant MCD patterns were predicted byapplying the MCDPAT procedure to thesimulated crystal structure data assuming zeroerror in crosspeak position. Schematics oftypical peak distributions obtained in thismanner are shown in Figure 1. In order tovisualize the peaks, the radii of the circlesrepresenting peaks are exaggerated from thepeak picking tolerance (+ 6 datapoints). For themanual peaks, some tolerances were relatively


large and, to give a reasonable representation,the values used were either the same as for theautomated peaks or twice the estimatedtolerance, depending on which was the larger.

From the crystal structure, 116 crosspeaksbetween main chain protons were involved inMCD patterns. Of these, 9 would beoverlapping due to degeneracy. The localmaximum search of the 500 MHz spectrumfound over 30000 peaks with 4056 symmetricmaxima. Only 85 of these corresponded to thecrosspeaks defining MCD patterns. The missingcrosspeaks defined mainly relationshipsbetween protons in the sheet structures,especially those close to the solvent track. Thespatial distribution of local maxima in thespectrum appeared to be stochastic, showing noobvious pattern (see Figure 1). This suggeststhat a large fraction of the peaks were due torandom noise.

Manual peak-picking from the upperdiagonal of an expanded contour plot of thespectrum gave 847 crosspeaks. In this case, thebeta-beta and alpha-beta regions were notanalyzed thoroughly because they were thoughtto have minimal influence on identifying MCDpatterns. To ensure mat overlapping peaks werenot missed, the tolerance on peak position wasset by the size of the lowest contour around theidentified peak and ranged from +1 to + 6 datapoints (2.8Hz/point). It was thought that thesetolerances would be able to be relaxed but, inpractice, this was not possible without losingcritical peaks. When these tolerances were usedto compare peaks found with those predictedfrom the crystal structure, 95 of the 116 MCDpeaks were identified. Of the 21 missing, 8were between alpha-beta protons and would notcause MCD patterns to be missed. Theremaining 13 crosspeaks were either at the endsof sheets or involved crosspeaks close to thesolvent

The PIQABLE2 algorithm requires adefinition of baseline, which define the relativerates of change of these three components inboth dimensions. For this analysis a range ofparameters were used. In each case,approximately 2000 crosspeaks are found, withbetween 90 and 99 of the 116 MCD crosspeaksbeing identified. In general, the same MCD

crosspeaks were found as in the manualanalysis, with the addition of some of themissing 8 alpha-beta crosspeaks. For someparameter values, crosspeaks close to thespectrum diagonal were missed and, as with theother peak-picking procedures, crosspeaks nearthe solvent track were sometimes not identified.Again, there is a clear spatial pattern ofPIQABLE2 crosspeaks (see Figure 1),suggesting that most of them were real. In thiscase the peak position tolerances were found tobe between +1 and +2 data points. This isconsistent with the accuracy of positions foundin the analysis of 1-D spectra using thePIQABLE algorithm.

Overall, both manual and PIQABLE2 peakpicking found the majority of crosspeaksneeded for defining MCD patterns in ubiquitin.Missing crosspeaks were due mainly todifficulties near the solvent track or for protonsin residues near the ends of sheets. The latter iscaused by larger distances between protons inresidues spanning the transitions from classicalsecondary structure. The local maximumsearch found a larger total number ofcrosspeaks and less MCD crosspeaks. This isclearly an inferior procedure and apparentlyfound a large number of spurious peaks. Theimproved accuracy of the PIQABLE2 analysiscan be attributed to (1) its ability to remove theinfluence of ridges, (2) that it made use ofadditional local smoothing to refine peakposition estimates and (3) the difficultiesassociated with visually estimating maximumpositions from a contour plot It is possible thatthe accuracy of manual estimates could beimproved by picking from a grey level image Ofthe spectrum. Even in this case, it seemsunlikely that significant and consistentimprovements over the PIQABLE2 accuracywould be made. Hence, for both speed andreliability, PIQABLE2 would appear to be thebest of the procedures studied.

THE MCDPAT PROCEDUREThe basic information which is used for the

MCDPAT procedure is a list of the frequenciesof main chain amide, alpha and beta protons foreach residue (NAB set) and a list of crosspeaksobtained from the NOESY. The NAB sets are

Vol. 13, No. 1/2 17

identified from J-correlated spectra (COSY,RELAYED COSY and TOCSY) by relativelysimple rules. In our current analyses, this stephas been performed by hand. In future, weexpect to automate it by making use ofPIQABLE2's peak-picking abilities. TheMCDPAT procedure requires a set NOESYcrosspeaks characterized by both a position andestimated position tolerance so that they can bemapped to specific pairs of NAB protons. Thekey to easy automation of the search for MCDpatterns is the way in which these two types ofinformation are organized. Our approach can besummarized as follows. Let the main chainprotons of interest be defined by a set E ofobjects under study. Information aboutindividual protons or relationships betweenprotons can be viewed as mappings on the setE. Joint properties are then defined by simpleset algebra on the inverse images of thesemapping. For example, supposeS j = {protons with frequencies in the range( f j^ ) } andS2 = {alpha protons}. Then theintersection of Sj and S2 describes the alphaprotons with frequencies in the range ( f j ^ ) -Note that the frequencies in the NAB listdepend upon the peak-picking in J-correlatedspectra and it thus is not always be possible toassociate each crosspeak in the NOESY with aunique pair of protons. This is partly due to thetolerances on peak positions and partly to theoccurrence of degeneracy. If there is acrosspeak at ( f j^ ) in a 2-D spectrum then it isassumed that there is a short distancerelationship between all pairs of protons in therange (f-j-ej.,f|.+"ej) and protons withfrequencies in the range (f2-e2,f2+£2)» where£j and e 2 are the peak-picking tolerances.Clearly, the smaller the values of Ej and £2, theless the chance of including false relationships.

During the MCDPAT procedure, thefollowing information for each main chainproton is kept:(1) NAB set membership(2) type (amide, alpha or beta)(3) frequency from the NAB list(4) protons with which there may be an NOE(5) patterns or secondary structural units thatthe proton has currently been found toparticipate in.

With appropriate data structures, the MCDPATprocedure is implemented as a relativelystraightforward set of logical rules. Searchingfor patterns and fitting the patterns togetheruses set algebra operations (e.g. union,intersection, membership, complementation) asbasic tools. This gives a very flexible andpowerful basis for studying complexrelationships between main chain protons.These concepts can readily be extended toinclude other protons (e.g. side chain spinsystems).

The MCDPAT procedure first searches forbasic helix H4 patterns (see Figure 2). Theseare ordered according to the number ofcharacteristic NOEs they exhibit (between 10and IS). Pieces of helical structure aregenerated by fitting the overlapping H4 unitstogether, beginning with the H4's of highestNOE status (seeds). If an inconsistency ordegeneracy is encountered, a new seed isconsidered. When all the H4 units have beenconsidered, the pieces of helix are sorted andthe elements of NAB sets corresponding tounambiguous pieces of helix are eliminatedfrom further consideration. In this way falserelationships, generated during mapping ofNAB frequencies onto the set of NOEcrosspeaks are eliminated.

The next basic MCD patterns to be found arefor anti-parallel sheets (Figure 2); inner loops(I), outer loops (O) and hybrid loops (H).Fitting them together is more complex due tothe multiple patterns and the potentially multi-strand nature of sheets. A second level ofcomposite patterns between I and O subunitsare first identified (10^ IQV, 0 1 0 ^ where the'h' refers to horizontal patterns connecting onlytwo strands and the V to vertical patternsconnecting more than two strands). The OIOj,patterns are the seeds fot building bigger piecesof sheets. They are extended out bothhorizontally and vertically as far as possibleusing overlapping composite patterns until anambiguity is reached. At mis stage any relevanthybrids can be fitted to the sheets. When allpossible combinations have been made, theanti-parallel sheet structures are sorted andunambiguous ones elimated from furthersearch.

18

H4(0)

O H

I

H4(12)

H O H

VrVrO H

oioIOIV

Figure 2: Examples of MCD patterns.

Bulletin of Magnetic Resonance

KEY* AMIDEO ALPHAO BETA

rH

Vol. 13, No. 1/2 19

Third comes the study of single loop (P)parallel sheet patterns (Figure 2) and theircombination with each other. Again, multiplestrand structures are possible. When thesestructures have been identified they are checkedto see if they are joined to pieces of anti-parallelsheets.Finally comes a reconciliation phasewhere ambiguities are addressed. Anyoutstanding basic or composite patterns areexamined to see if their role has been betterdefined. This may lead to an iteration of theprocedure to build new pieces of structure. Ifthere are remaining ambiguities, they must beresolved by using other criteria such as sidechain analysis of the residues involved.

APPLICATION OF MCDPATThe MCDPAT procedure has been used for

both theoretical and experimental studies. Afirst area where it has been useful is in findingwhich particular patterns are valuable for theMCD analysis. The original patterns consideredwhere based upon empirical observations (10).With MCDPAT we were able to perform amore extensive analysis of the frequency andfidelity of different patterns. Inter-protondistances were obtained from high resolutioncrystal stucture data (11) and used to generatedummy "ideal" datasets for MCDPAT. Alibrary of 39 different proteins were consideredwith between 26 and 287 residues. The datawere ideal in the sense that there was nodegeneracy (perfect peak-picking and nooverlapping frequencies) and that all possibleshort distances were included (no missingpeaks).

This analysis showed conclusively that H4patterns were present in almost all regionsdefined as having a helical secondary structure.In addition, their fidelity increased with NOEstatus and, for cut-off detection distances of 4.2A, their NOE status was usually the highestpossible (status 12). This high frequency andfidelity provided the motivation for puttingthem first in the MCDPAT procedure. At lowcut-off distances (less than 3.6A), the hybridsubunits dominated for anti-parallel sheetpatterns. When the cut-off distance wasincreased to 4.2A, the inner and outer loopsincreased in frequency and had higher fidelity.

Nevertheless, the inner loops had maximumfidelity only 80% and the outer loops 90%.There are two ways of inproving the fidelity ofanti-parallel sheet patterns. The first is byeliminating already identified helix patterns andthe second is to form composite patterns. Forexample, OIOn patterns have a fidelity of 98%and OIOV patterns of 100%.

The situation for parallel sheet single loops ismuch worse and the basic fidelity is as low as10%. In this case, however, there is aconsiderable overlap with helix and anti-parallel sheet patterns. EUminating these beforesearching for single loops first does leavebehind a subset of patterns with relativelyhigher fidelity. When the full MCDPATprocedure was applied to data for ubiquitin,ribonuclease A and T4 lysozyme, the secondarystructures identified corresponded Very closelyto the crystal structures. The only deviationswere within the residues at the end of structuralunits or within regions where the crystalstructure information was ambiguous (13).

To investigate the effect of spectraldegeneracy on the frequency of differentpatterns we chose to study ubiquitin in somedetail: from the NMR and crystal structures itcontained the three types of secondarystructures detected by MCDPAT (18,19). Eachmain chain proton was given a frequency bysampl ing at random from empiricaldistributions for the diferent types of protons.Using the crystal structure to define NOErelationships between protons within 4.2A andassuming a range of tolerances + .5 to + 2 datapoint on peak positions, we simulated variouslevels of degeneracy. The extent of degeneracyhad a large effect on the number of NOEswhich were identified. As the toleranceincreases to + 2, approximately 75% of theidentified NOEs are false. There are variationsdepending on the particular types of protonsconcerned, but even in the best case (amide-amide) 50% of NOEs are false at +2 tolerance.Similar large increases are found in the totalnumbers of basic MCD patterns found.However, some patterns are relatively robust todegeneracy. These are in particular the H4patterns with high NOE status (greater than orequal to 9). Composite anti-parallel sheet


patterns are more robust than single patterns butabove +1 tolerance they exhibit a large numberof false patterns. The reason for this is clear ifone examines composite inner loop patterns.Unlike the helix patterns, the inner loopinvolves relationships between single protonsfrom each NAB set Hence, any degeneracy inthe relevant proton will define a false pattern,there are six possible outer loops which overlapwith an inner loop. Each overlapping looprequires an NOE to a second proton of one ofthe NAB sets. Appropriate combinations offour outer loops will therefore place stronglimitations on the effect of degeneracy. Forexample, the existence of an OOIOQn patternwill confer robustness to the OIO^ subpattern.Using this type of argument, OIOn patterns canbe classified according to their participation inhigher level patterns. This order gives arobustness criterion for ordering the seeds insearching for anti-parallel sheet structures.Similar arguments can be applied to single looppatterns which form part of PPPn compositepatterns.

For simulated ubiquitin at zero tolerance twohelices are identified in residues 23-33 and56-59. Up to +.1 tolerance these are stilluniquely defined by MCDPAT. At +1.5 and +2tolerance an additional false H4 is found. Anti-parallel sheets are also identified with a singlefalse inner loop up to +1.5 tolerance. At 2.0tolerance, many false patterns are found. In thecase of parallel sheets, the ambiguity is toosevere to identify pieces of sheet even at +0.5tolerance without using overlaps betweenparallel and anti-parallel sheet patterns. This ispartly because the parallel sheet in ubiquitinconprises a relatively short run of two strandslinking two pieces of anti-parallel sheet

These simulations were very valuable indefining robust patterns and specifying that cut-off tolerances should ideally be +1 to +1.5 datapoints. This is substantially in the rangeachieved by PIQABLE2 but is well below themanual peak-picking tolerances (see above).We therefore went on to investigate the abilityto define structural units in experimental NMRdata for which the NOESY had been peak-picked by PIQABLE2. Our predictionsconcerning the effect of degeneracy were

confirmed. One notable difference was that,particularly for helices, protons within astructure tended to have a higher degeneracythan unrelated protons. This meant mat many ofthe false patterns linked NAB sets within thecorrect structure but with incorrect orientation.Such inconsistencies were easier to resolve thantotally false patterns. Both of the helices fromthe simulated data were found, but the shorter3 JQ helix had a problem distinguishing betweenresidues 57 and 63. When the residuesparticipating in the helices were removed, theanti-parallel sheet patterns reduced to twofamilies of composite patterns. By building outto the maximum extent possible residues 3-7and 13-17 are seen to participate in a sheettogether with 42 -> 45 and 6 8 - * 71. It was notpossible to extend further or find all the parallelsheet structure because of missing crosspeaksclose to the solvent track. If we extend withpatterns complete except for such crosspeaks,all but a few NOEs at the ends of the sheet arepresent. This highlights the practical difficultyof missing information. We are currently tryingtwo approaches to solving this problem. Firstly,we are making a more precise definition ofpartial patterns and secondly, we are studying600MHz data which has higher resolution andis less affected by the residual solvent track. Forlarger proteins, it is likely that both types ofapproaches will be required.

CONCLUSIONS >The combination of PIQABLE2 and

MCDPAT are potentially very powerful forstudying relationships between main chainprotons and identifying secondary structuralunits based upon NMR data. Remainingdifficulties are associated with peak-pickingnear the solvent track which particularly affectsthe definition of sheet structures. This seems tobe a feature of both manual and automaticpeak-picking procedures. Obtaining better peakresolution by using a higher field strength,together with improving solvent supression orcollecting additional spectra at differenttemperatures in order to move the relevantcrosspeaks further away from the solvent willreduce the impact of this technical problem.Anticipated improvements to MCDPAT will

Hi!

Vol. 13, No. 1/2 21

include the use of PIQABLE2 to peak-pick J-correlated spectra and automatic definition ofNAB sets. Further refinements will be basedupon experience with experimental data for aother proteins. The general approach which wehave developed to analysis of multiplerelationships between protons may also be ofvalue in studying side chain spin systems oranalyzing 3-D proton spectra.

ACKNOWLEDGMENTS: This work wassupported by NSF Grant #DIR89-04066(SJN), NIH Research Grant GM35490 (AJW),by an NIH Postdoctoral Fellowship GM12574(DMS), by the Pew Memorial Trust and byNIH Grants GA-06927 and QR-0S139

REFERENCES1. Wuthrich K. NMR of Proteins and NucleicAcids (Wiley, New York 1986).2. Ernst R.R., Bodenhausen G. and A. Wokaun.Principles of Nuclear Magnetic Resonance inOne and Two Dimensions (Oxford UniversityPress, Oxford, 1987).3. Griesinger C., Sorensen W. and R.R. Ernst.J. Magn. Reson. 73,574 (1987).4. Viuster G.W., Boelens R. and R. Kaptein. J.Magn. Reson. 80,176 (1988).5. Bax A. and M.A. Weiss. J. Magn. Reson. 71,571 (1987). 6. Grigley R.H., Redfield A.G.,L o o m i s R .E . and F.W. D a h l q u i s t .Biochemistry 24,817 (1985).7. Wuthrich K., Wides G., Wagner G. and W.Braun. J. MoL BioL 155,311 (1982).8. Billiter M., Braun W. and K. Wuthrich. J.Mol. BioL 155,321 (1982).9. Wagner G. and K. Wuthrich. J. MoL BioL155,347(1982).10. Englander S.W. and A.J. Wand,Biochemistry 26,5953 (1987).11. Wand AJ. and S J. Nelson. Trans.ACA 24,131(1988).12. Pfandler P. and G. Bodenhausen. J. Magn.Reson. 79,99 (1988).13. Wand AJ. and S.J. Nelson, in preparation.14. Nelson S.J. and Schneider D.M. and AJ.Wand, in preparation.15. Nelson SJ . and T.R. Brown, J. Magn.Reson. 75,229 (1987).

16. Nelson S.J. and T.R. Brown, J. Magn.Reson. 84,95 (1989).17. Nelson S J. and T.R. Brown. Bull. Magn.Reson. (1990).18,s Vijay-Kumar S-, Bugg C.E. and WJ. Cook.J. MoL BioL 194,531 (1987).19. DiStefano D.L. and A.J. Wand.Biochemistry 26,7272 (1987). 1987

Date post:	20-Apr-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

14 R E F IN E M E N T A N D A U T O M A T IO N O F TH E M ... · II R E F IN E M E N T A N D A U T...

Documents