+ All Categories
Home > Documents > [IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City,...

[IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City,...

Date post: 27-Jan-2017
Category:
Upload: latha
View: 215 times
Download: 0 times
Share this document with a friend
6
Resonant recognition model for prediction of ‘lead-in regions’ in proteins Mohenish Jaiswal 1 , S R Mahadeva Prasanna 2 and Latha Rangan 1* 1 Department of Biotechnology, IIT Guwahati, Assam 781 039 2 Department of Electronics and Communication, IIT Guwahati, Assam 781 039 Email: {lrangan, prasanna}@iitg.ernet.in Abstract In this study, the resonant recognition model (RRM), was developed to locate the so called ‘lead-in regions’ or residues that contribute to the creation of the environment that enables proteins to perform their complex functions. The RRM is a physical and mathematical model and provides a significant correlation between spectra of numerical presentation of amino acids and their biological activity. We developed a preliminary model for this purpose and applied it on two well characterized classes of enzymes viz., human proteases and kinases. The results obtained (positions of ‘lead-in residues’ in the primary structure of the proteins) were used in association with the 3D-structure data of the proteins for the purpose of predicting the catalytic residue of the enzyme. Key Words: Catalytic site: lead-in-regions; Protein coding; Resonant recognition model (RRM) 1. Introduction Proteins are the most important biological effector molecules in the cell performing diverse functions. Function is encrypted in a protein’s primary structure and it turns out that this can be represented in a discrete state by assigning numerical values to the twenty amino acids that make up protein sequences [1] [2]. By this means, digital signal processing tools (DSP) can be applied to decode the information coded in a protein’s primary structure, which in turn determines a protein’s secondary, tertiary and quaternary structures [3]. The chief characteristic of proteins that enables them to carry out their diverse cellular functions is their ability to bind to other molecules specifically and tightly. The environment prevailing at this active site is crucial for the function of an enzyme. The residues that have direct participation in the catalytic mechanism (catalytic residues, binding residues etc.) are able to act only because of the precise environment that results from the presence of other residues in the protein, which although not participating directly in the activity, nonetheless are crucial for the proper functioning of the enzyme. These residue groups, located at specific positions defined by the tertiary structure or the folding pattern of the protein, tune the environment by means of hydrogen bonds, charge- charge interactions and hydrophobic forces that operate between the residues and their corresponding side-chains. The theory and methods of digital signal processing are becoming increasingly important in molecular biology [4]. Digital filtering techniques, Markov models and transform domain methods have played important roles in gene identification, biological sequence analysis and alignment [5]. The resonant recognition model (RRM) is a physical and mathematical model which interprets protein sequence information using digital signal analysis methods and the model assumes that characteristic frequencies are responsible for resonant recognition between macromolecules at a distance. The RRM characteristic frequencies characterize not only general functions but also provide recognition between particular protein and its target [6] [7]. Majority of the work has been focused on prediction of hot-spots: specific sites in 3D-structure of protein where certain molecules (ligands, substrates etc.) can conveniently bind [8], [9], [10], [11], [12], [13]. Frontiers in the Convergence of Bioscience and Information Technologies 2007 0-7695-2999-2/07 $25.00 © 2007 IEEE DOI 10.1109/FBIT.2007.32 101 Frontiers in the Convergence of Bioscience and Information Technologies 2007 0-7695-2999-2/07 $25.00 © 2007 IEEE DOI 10.1109/FBIT.2007.32 101 Frontiers in the Convergence of Bioscience and Information Technologies 2007 0-7695-2999-2/07 $25.00 © 2007 IEEE DOI 10.1109/FBIT.2007.32 101 Frontiers in the Convergence of Bioscience and Information Technologies 2007 0-7695-2999-2/07 $25.00 © 2007 IEEE DOI 10.1109/FBIT.2007.32 101
Transcript
Page 1: [IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City, South Korea (2007.10.11-2007.10.13)] 2007 Frontiers in the Convergence of Bioscience

Resonant recognition model for prediction of ‘lead-in regions’ in proteins

Mohenish Jaiswal1, S R Mahadeva Prasanna2 and Latha Rangan1*

1Department of Biotechnology, IIT Guwahati, Assam 781 039

2Department of Electronics and Communication, IIT Guwahati, Assam 781 039

Email: {lrangan, prasanna}@iitg.ernet.in

Abstract

In this study, the resonant recognition model (RRM), was developed to locate the so called ‘lead-in regions’ or residues that contribute to the creation of the environment that enables proteins to perform their complex functions. The RRM is a physical and mathematical model and provides a significant correlation between spectra of numerical presentation of amino acids and their biological activity. We developed a preliminary model for this purpose and applied it on two well characterized classes of enzymes viz., human proteases and kinases. The results obtained (positions of ‘lead-in residues’ in the primary structure of the proteins) were used in association with the 3D-structure data of the proteins for the purpose of predicting the catalytic residue of the enzyme.

Key Words: Catalytic site: lead-in-regions; Protein coding; Resonant recognition model (RRM)

1. Introduction

Proteins are the most important biological effector molecules in the cell performing diverse functions. Function is encrypted in a protein’s primary structure and it turns out that this can be represented in a discrete state by assigning numerical values to the twenty amino acids that make up protein

sequences [1] [2]. By this means, digital signal processing tools (DSP) can be applied to decode the information coded in a protein’s primary structure, which in turn determines a protein’s secondary, tertiary and quaternary structures [3]. The chief characteristic of proteins that enables them to carry out their diverse cellular functions is their ability to bind to other molecules specifically and tightly. The environment prevailing at this active site is crucial for the function of an enzyme. The residues that have direct participation in the catalytic mechanism (catalytic residues, binding residues etc.) are able to act only because of the precise environment that results from the presence of other residues in the protein, which although not participating directly in the activity, nonetheless are crucial for the proper functioning of the enzyme. These residue groups, located at specific positions defined by the tertiary structure or the folding pattern of the protein, tune the environment by means of hydrogen bonds, charge-charge interactions and hydrophobic forces that operate between the residues and their corresponding side-chains.

The theory and methods of digital signal processing are becoming increasingly important in molecular biology [4]. Digital filtering techniques, Markov models and transform domain methods have played important roles in gene identification, biological sequence analysis and alignment [5]. The resonant recognition model (RRM) is a physical and mathematical model which interprets protein sequence information using digital signal analysis methods and the model assumes that characteristic frequencies are responsible for resonant recognition between macromolecules at a distance. The RRM characteristic frequencies characterize not only general functions but also provide recognition between particular protein and its target [6] [7]. Majority of the work has been focused on prediction of hot-spots: specific sites in 3D-structure of protein where certain molecules (ligands, substrates etc.) can conveniently bind [8], [9], [10], [11], [12], [13].

Frontiers in the Convergence of Bioscience and Information Technologies 2007

0-7695-2999-2/07 $25.00 © 2007 IEEEDOI 10.1109/FBIT.2007.32

101

Frontiers in the Convergence of Bioscience and Information Technologies 2007

0-7695-2999-2/07 $25.00 © 2007 IEEEDOI 10.1109/FBIT.2007.32

101

Frontiers in the Convergence of Bioscience and Information Technologies 2007

0-7695-2999-2/07 $25.00 © 2007 IEEEDOI 10.1109/FBIT.2007.32

101

Frontiers in the Convergence of Bioscience and Information Technologies 2007

0-7695-2999-2/07 $25.00 © 2007 IEEEDOI 10.1109/FBIT.2007.32

101

Page 2: [IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City, South Korea (2007.10.11-2007.10.13)] 2007 Frontiers in the Convergence of Bioscience

In our study, and by means of the RRM, we have attempted to locate the residues that contribute to the creation of the environment, so called ‘lead-in regions’ that enable the proteins to perform their complex functions on two well characterized classes of enzymes viz., human proteases and kinases. We then go a step further to see if the catalytic residues can be predicted from the information obtained by the application of digital signal processing tools and the 3D-structure data of the protein.

2. Research Questions

Apart from the catalytic site and binding site, the active site proximity and the rest of the protein contains residue groups that are not directly involved in binding or catalysis but play the role of creation of an environment without which binding and catalysis is not possible [4] [14]. These residues are expected to occur in the lead-in regions. Fourier techniques have helped in great detail in analyzing ways to identify commonality of a protein molecule having many functions [5], [12] [15], but failed to provide any information regarding the residue location in primary structure of protein that is responsible for its biological function. The impact of digital signal processing tools here could be significant particularly the resonant recognition model.

Since each frequency f(x) in the RRM characterizes one biological function x, we hypothesize that the portion of the primary structure of a protein where the intensity of characteristic frequency is significantly higher than the rest of the sequence, contribute most to the biological function. Hence the research questions that were addressed in our present work were to find the lead-in regions and analyze its spatial location with respect to the catalytic site and to predict the location of catalytic residues based on the location of lead-in regions.

3. Methods

Amino acid sequences of human protein kinases (11 sequences) and human proteases (12 sequences) were downloaded from Swissprot database (http://expasy.org/sprot/). The database also contains information on functional features of the proteins like catalytic residues, binding residues, etc. The corresponding 3D-structures of these proteins were downloaded from RSCB protein database (www.rcsb.org/pdb/).

4. Numerical Experiments

Calculation of characteristic frequency

The mentioned sequences of amino acids are transformed to numerical sequences x (n)’s such that x (n) is equal to the EIIP value of the nth amino acid in a particular sequence and represents the distribution of the free electron’s energies along the protein. The frequency spectra X (k) is obtained by application of discrete Fourier transform (DFT) on the numerical sequence x (n) [16].

where is kth frequency DFT coefficient of

, e is the base of natural logarithm, i is imaginary unit (i2 = -1), N is the total number of

sequences in and π is Pi. The multiple cross-spectra (consensus-spectra) of a group of proteins having a common function (eg. Human protein kinases) is calculated by point wise multiplication of the frequency spectra.

where M contains the multiple cross spectral coefficients and XM contains DFT coefficients of Mth sequence. Peak frequencies of the consensus spectra with signal to noise ratio SNR > 20 are recorded. These are the characteristic frequencies corresponding to the functions common within the group of proteins.

Correction in characteristic frequency calculation method

The peak f(x) of the consensus spectra correspond to the function x common to the members in the group. But if a member of group does not have the common function f(x) possessed by all other group members, then it will result in decay of the peak f (x). Also the resulting cross spectra would then contain ‘weak’ peaks that do not correspond to the common dominant function. Such anomalous sequences need to be culled out so as to preserve the nature of the multiple cross spectra (author’s personal communication). On analysis it was observed that such anomalous sequences had sizes much higher than the average size of the group. So it was decided that sequences having sizes greater than mean plus two standard deviations of the sizes of the sequences

102102102102

Page 3: [IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City, South Korea (2007.10.11-2007.10.13)] 2007 Frontiers in the Convergence of Bioscience

in the group would not be considered for calculation of multiple cross spectra.

The consensus-spectra of human protein kinases and human proteases are shown in Fig. 1a & b. Out of 11 human kinases and 12 human proteases originally taken, 9 kinases and 9 proteases were able to qualify in the selection process mentioned above. As shown, one characteristic frequency for human kinases and two characteristic frequencies for human proteases having SNR > 20 were found. These frequencies and their corresponding SNR are given in the table 1. Thus, the RRM model predicts that there exist one common function, in the group of kinases taken, and two common functions in the group of proteases taken.

Window size and rate of translation

Having calculated the characteristics frequencies f (x), individual members of the family were analyzed to find the lead-in regions. Windows having the significantly higher value of characteristic frequencies (greater than mean plus two standard deviations) were fixed to enclose the lead-in regions. Window size range of 4, 6, 8 and 10 was used to predict the lead-in regions for every single case and the regions were then mapped in to the 3D-structure of the corresponding protein using UCSF Chimera software package and their location with respect to the catalytic and binding residues was evaluated. Similarly window transition rate was set up to be half the window width to avoid any false positives/ negatives (Fig. 2)

Location of lead-in regions

The lead-in regions for two kinases (Swiss-Prot Accession Nos. P08631 and P12931) and 3 proteases were not calculated as these were rejected from the population due to their abnormal large size. No lead-in regions were predicted for P31751 at window sizes 6, 8 and 10 as the characteristic frequency intensity was below the decided threshold. These predicted lead-in residues were mapped in the 3D-structure of the corresponding proteins (in blue). The catalytic residues (yellow) and the binding residues (red) were also mapped. The lead-in residues were found to be lying around the catalytic residues (Fig. 3). In some cases the catalytic residue was also included in the lead-in region. Thus, the catalytic residue, apart from involving directly in the mechanism of the catalysis, can also engage in creation of the environment which promotes its own catalytic activity. In many cases it was observed that the lead-in residues, apart from

lying in the neighborhood of the catalytic residues, were also scattered in other parts of the proteins (Fig. 3a and 3b). Based on the hypothesis that we had assumed, it can be said that these residues, although far apart, are also crucial to the environment of the active site but a conclusive answer can only be given when the hypothesis is experimentally tested by means of mutagenesis techniques to see how crucial these residues actually are.

Some proteins were found to contain large number (high percentage) of lead-in residues and others had only a few. Based on the assumed hypothesis we can say that the proteins containing a large percentage of lead-in residues have a tightly regulated activity. i.e., most of the residues are dedicated for the cause of creation of optimum environment or there is lesser number of ‘stuffer’ residues. In the group of proteases that we had considered for our study, protease P00746 had a large percentage of its residues falling in the lead-in regions. This protease performs selective cleavage of Arg-|-Lys bond. Conversely it can be said that the proteins, having low percentage of lead-in residues (many residues are redundant with respect to a biological function f(x)) may be contributing to some other biological function f’(x) which is different form f(x). For e.g. Protease P17538 which preferentially cleaves Tyr-|-Xaa, Trp-|-Xaa, Phe-|-Xaa, Leu-|-Xaa, had a low percentage of its total number of residues as lead-in residues.

Prediction of catalytic residues

Due to the enveloping nature of the lead-in region to the catalytic residues, a question rises: Can the lead-in residues be employed in the prediction of the catalytic residue(s)? To answer this question, the centroid of the residues was calculated and distances AG and AB defined to answer the query (A = catalytic residue; G = centroid of the lead-in residues; B = amino acid nearest to G). Small value of AB and AG will answer the above question in yes whereas large values will reply in negative.

Distances AG and AB were calculated for human kinases and human proteases respectively. Out of 7 human kinases and 9 human proteases that were found fit for calculation of AG and AB (unfit members were those for which sufficient information for calculation of AB and AG was not available) correct prediction (AB = 0) was done for only one protein form each group (kinase O14965 and protease Q92876). For other proteins the values were lying in the range 10 – 40 Å. So although the lead-in residues

103103103103

Page 4: [IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City, South Korea (2007.10.11-2007.10.13)] 2007 Frontiers in the Convergence of Bioscience

lie clustered around the catalytic residues, the cluster is far from being symmetric about the catalytic site. This result suggests that environment at the location of active site as well as the substrate/ ligand for which it is so made, are asymmetric in nature.

Double frequencies in proteases

The presence of two characteristic frequencies in human proteases indicated presence of two common biological functions in the considered set of proteases. On observing the AB and AG values for proteases we find that they are correlated. This indicates that the two characteristic frequencies might be corresponding for the same function. Since the proteolysis is itself of several types, these two frequencies may correspond to two of these cleavages. On observation these two cleavages were found to be Arg-|- Xaa and Lys-|-Xaa.

5. Outcome and Conclusions

In the current study we tried to employ the RRM model to locate lead-in regions in the primary structure of a protein. We developed a preliminary model for this purpose and applied it on two well characterized classes of enzymes. The results obtained (positions of lead-in residues in the primary structure of the proteins) were used in association with the 3D-structure data of the proteins for the purpose of predicting the catalytic residue of the enzyme. The results of the work suggested that the hypothesis which we had assumed as the basis of this research was effective but there is a need of more sophisticated tool for proper utilization of the hypothesis in revealing the information stored in the protein primary structure.

The first part to the work, which was to predict the lead-in regions, produced some questionable results. Firstly, in some proteins the residues in the terminals of the proteins were predicted to be lead-in which can be argued upon. The terminal residues of proteins usually affect the folding pattern of the protein only to limited extent and therefore its role in the creation of the environment at the active site should be marginal. Secondly, residues located far from the active site were also predicted to be lead-in ones. To see how capable are these far lying residues of affecting the environment prevailing at the active site, we need experimental results of site-directed mutagenesis.

The second part of the work, prediction of the catalytic residues using the knowledge of the lead-in regions and the 3D-structure, was unsuccessful but provided an insight into the environment of the protein. Firstly, its failure showed that the assumption that the environment of the active site is equally affected by all the lead-in residues is flawed. Choosing molecular masses as weights for calculation of the catalytic residue was improper. It also suggests that the environment at the active site location is asymmetric and the plausible argument is that the enzyme active site as well as the molecules which it catalyzes is mostly asymmetric structure. Further work is required to refine the model and test its efficacy by its application on other class of proteins apart from enzymes.

Acknowledgements

Thanks to Department of Biotechnology for providing computational laboratory facilities. Thanks to ECE Department, IITG for providing access to MATLAB programme.

References

1. D. Anastassiou, “Genomic signal processing,” IEEE signal Processing Magazine, pp, 8-20, July 2001.

2. S. Tiwari, S. Ramachandran, A. Bhattacharya, S.

Bhattacharya, and R. Ramaswamy, “Prediction of probable genes by Fourier analysis of genomic sequence,” CABIOS.

3. E. Kordi, M. Billeter, and K. Wthrich, “MOLMOL: A

program for display and analysis of macromolecular structures,” J. of Mol. Graphics, vol. 14, pp. 51-55, 1996.

4. P.P. Vaidyanathan, “Genomics and Proteomics: A

Signal Processor’s Tour,” IEEE circuits and systems Magazine.2004.

5. I. Cosic, “Macromolecular Bioactivity: Is It Resonant

Interaction Between Macromolecules?-Theory and Applications,” IEEE Transactions on Biomedical Engineering, vol. 41, pp. 1101-1114, Dec. 1994.

6. I. Cosic, “Protein design by methods of digital signal

analysis,” Proc. Annu. Conf. IEEE Biomed. Eng.. vol. 2. New York: IEEE Publ. Service, 901-905, 1988.

7. I. Cosic, “Resonant recognition model of protein-

protein and protein-DNA interaction,”

104104104104

Page 5: [IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City, South Korea (2007.10.11-2007.10.13)] 2007 Frontiers in the Convergence of Bioscience

Bioinstrumentation and Biosensors, D. Wise, Ed. New York: Marcel Dekker, Inc., 475-510, 1990.

8. I. Cosic, and D. Nesic, “Prediction of ‘hot spots’ in

SV40 enhancer and relation with experimental data,” Eur. J. Biochem., vol. 170, 247-252, 1988.

9. I. Cosic, M. Pavlovic, and V. Vojisavljevic,

“Prediction of ‘hot spots’ in interleukin -2 based on informational spectrum characteristics of growth regulating factors, ” Biochimie, vol. 71, 333-342, 1989.

10. I. Cosic, and M.T. Hearn, “Hot spot’ amino acid

distribution in Ha-ras oncogene product p 21: Relationship to guanine binding site, I. Mol. Recognition, vol. 4, 5742, 1991.

11. I. Cosic, and M.T. Hearn, “Studies on protein-DNA

interactions using the resonant recognition model. Application to repressors and transforming proteins.” European Journal of Biochem., vol 205, 613-619, 1992a.

12. I. Cosic, and M. T. Hearn, "Studies on protein-DNA

interactions using the resonant recognition model. Application to repressors and transforming proteins," European Journal of Biochem., vol. 205 (2), 613–619, 1992b.

13. P. Ramachandran, A. Antoniou, and P. Vaidyanathan.

“Identification and Location of Hot Spots in Proteins Using the Short-Time Discrete Fourier Transform.” Asimolar Conference on Signal Systems and Computers, 38th Conf, Vol. 2, 1656-1662, 2004.

14. M. Chalfie, “Green Flurescent Protein,”

Photochemistry and Photobiology, vol. 62, 651-656, 1995.

15. M.E. Pirogova, Q. Fang, M. Akay, and I. Cosic,

"Investigation of the structural and functional relationships of oncogene proteins." Proceedings of the IEEE, Vol. 90-92, 1859-1867, 2002.

16. A.V. Oppenheim, and R.W. Schafer, “Discrete Time

Signal Processing”, Prentice Hall, Inc., NJ, 1999.

Table 1. Characteristic frequency(s) and corresponding SNR of Human Protein Kinases

and Human Proteases.

Human Protein Kinases Human Proteases

Characteristic frequency SNR

Characteristic frequency SNR

0.098881 ± 0.0018

33.859 0.046053 ± 0.001613

61.728

0.14868 ± 0.001613 27.721

Figure Legends

Figure 1. Consensus spectrum of (A) Human protein kinases; and (B) Human proteases. The characteristic frequencies (SNR > 20) have been labeled.

Figure 2. Effect of window transition rate on the prediction of lead-in regions. Best and worst case performances are recorded when transition rate is half the window size. In both the cases it is assumed that the detection thresholds are set at 6 lead-in residues.

Figure 3. Predicted lead-in-residues mapped in the 3D-structure; proteins (in blue), catalytic residue (yellow), binding residues (red) (A). Kinase Q000534. (Cell division protein kinase 6). (B). Protease O43464. (Serine protease HTRA2, mitochondrial), corresponding to freq = 0.14868 ± 0.001613. The lead-in regions were so densely clustered around the catalytic residue (yellow) that ball-stick model was used to represent them instead of regular VDW surface so as to ease the view of catalytic residue.

105105105105

Page 6: [IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City, South Korea (2007.10.11-2007.10.13)] 2007 Frontiers in the Convergence of Bioscience

Figure 1.

Figure 2.

Figure 3.

A

B

A

B

106106106106


Recommended