Structure-based barcoding of proteins

Structure Based Barcoding of Proteins

Rahul Metri2, Gaurav Jerath

1, Govind Kailas

2, Nitin Gacche

1, Adityabarna Pal

1& Vibin

Ramakrishnan1, 2

1Department of Biotechnology, Indian Institute of Technology, Guwahati – 781039. India.

2Institute of Bioinformatics & Applied Biotechnology, Bangalore – 560100, India

ABSTRACT

A reduced representation in the format of a barcode has been developed to provide an overview

of the topological nature of a given protein structure from 3D coordinate file. The molecular

structure of a protein coordinate file from Protein Data Bank (PDB) is first expressed in terms of

an alpha-numero code and further converted to a barcode image. The barcode representation can

be used to compare and contrast different proteins based on their structure. The utility of this

method has been exemplified by comparing structural barcodes of proteins that belong to same

fold family, and across different folds. In addition to this we have attempted to provide an

illustration to (i) the structural changes often seen in a given protein molecule upon interaction

with ligands and (ii) Modifications in overall topology of a given protein during evolution. The

program is fully downloadable from the website http://www.iitg.ac.in/probar/.

KEYWORDS

Barcode, protein structure comparison, fold classification

Article Protein ScienceDoi: 10.1002/pro.2392

This article has been accepted for publication and undergone full peer review but has not beenthrough the copyediting, typesetting, pagination and proofreading process which may lead todifferences between this version and the Version of Record. Please cite this article asdoi: 10.1002/pro.2392© 2013 The Protein SocietyReceived: Aug 06, 2013; Revised: Oct 15, 2013; Accepted: Oct 21, 2013

2

ABBREVIATIONS

SSE- Secondary Structure Elements

PDB- Protein Data Bank

DSSP- Dictionary of Protein Secondary Structure

TOPS- Topology of Protein Structure

CATH- Class Architecture Topology Homology

DHFR- Di Hydro Folate Reductase

CBIR- Content-Based Image Retrieval

INTRODUCTION

The strength of protein data bank (PDB) has been growing exponentially over last three

decades.1 As structural genomics initiatives gain momentum, this trend is expected to continue in

the following years as well, principally because of the rapid advancement in high throughput

structure determination techniques.2,3 Total number of structures reported in PDB is inching

closer to the milestone of one lakh structures. Total number of folds identified so far is 1392 and

1282 as per SCOP4,5 and CATH

6 classification respectively, and no additions to this number

have been reported since 2009. Nevertheless proteins belong to the same fold family do exhibit

variations at sequential, structural (to some extent) as well as functional levels.7,8 Numerous tools

Page 2 of 26

John Wiley & Sons

Protein Science

3

are available as open source programs for protein visualization9 and structure prediction.

10,11

There have also been attempts to present reduced representations to three-dimensional6 protein

structures in 2D and 1D. TOPS diagrams12 and contact maps

13 show protein secondary structure

and topology in two dimensions, while DSSP presents secondary structure information of a

protein molecule sequentially from N terminus to C terminus as a 1D string.14 We present here a

new representation of protein structure in the form of a ‘barcode’. The advantage of this type of

representation is that, it can encode secondary structure as well as their relative orientation in

space. We can align different ‘barcodes’ to compare and contrast structural and topological

information of a given structure. Inspiration to this type of a representation was drawn from the

pioneering contribution in encoding information as ‘barcodes’ by Bernard Silver and Norman

Woodland in 1949.15 It took 3-4 decades to completely operationalize the technology by using

barcodes for cataloguing articles across a wide variety of applications. We present in this paper,

the design and utility of this computational tool in cataloguing proteins according to their

structure. The program is fully downloadable from the website http://www.iitg.ac.in/probar/; we

also provide a webserver that can display barcode images of close to about 70,000 protein

molecules in PDB.

VALIDATION OF COMPUTATIONAL METHODS

Crystal structure of B1 immunoglobulin-binding domain of streptococcal protein G1 (1PGB.pdb)

is used as a model structure to illustrate the design of protein barcode representation. The 56

residue protein molecule with one alpha helix and one beta sheet consisting of four beta strand

has a well-defined hydrophobic core. Total number of secondary structure elements is five, with

first and second strands form an antiparallel beta sheet followed by a helix. Another antiparallel

Page 3 of 26

John Wiley & Sons

Protein Science

4

beta sheet follows the helix, coplanar with the first sheet with final beta strand being parallel to

the first strand. Since all four strands form one continuous sheet, all four strands are colored

same (blue in this case). SSE’s not part of the same sheet is colored differently as illustrated in

figures 3 and 4. All successive secondary structures in protein G are antiparallel in their relative

orientation and hence having an identical space width of three units. Space width is customizable

by appropriately modifying the code. Space width may change according to the relative

topology of successive SSE’s. Therefore protein barcode provides information about SSE’s and

their relative topology with necessary clarity. Furthermore, it is possible to derive TOPS

representation from barcode with reasonable accuracy and vice versa (figures 1 and 2).

Structure Comparison using Barcode Identity Index (BII): Analyzing the spatial orientations of

proteins is significant for their functional and evolutionary studies16 and such an objective may

be achieved by comparison of barcodes. To indicate the utility of protein barcode, we further

examined the barcode images generated from structure files of all PDB structures of DHFR

(dihydrofolatereductase) across different species.1 Though the barcode images look more or less

identical, subtle differences can be observed in structures adapted during evolution from left to

right (figure 3). A barcode identity index (BII) has also been formulated to compare structures

(BII) quantitatively (figure 4) and structural adaptations at specific loci can be identified by

carefully comparing two barcode images.Barcode Identity Index (BII) is calculated from a

metadata of barcode image, consisting of numbers that correspond to the ‘barcode’ and aligning

them. In a typical case, Helix is represented as 0, Strand as 1 and the orientation between

secondary structures as 3, 4, 5 and 6 based on space width between 2 bars in the barcode

representation. For e.g. 1A41.pdb may be represented as 03030413140304030303030. The

number that represents a barcode (query) is aligned with another number (subject) using

Page 4 of 26

John Wiley & Sons

Protein Science

5

Needleman Wunsch algorithm.17 Further details may be found in supplementary material and BII

code may be downloaded from Barcode webpage.

Protein barcode is presented as a TIFF image. If this representation is widely accepted by the

scientific community, then it will help in locating proteins in a ‘protein-barcode’ database by

making use of Content-based image retrieval (CBIR) tools.18,19

This method is basically meant

for addressing the problem of searching digital images in large databases. It analyzes the content

of the image rather than the meta-data or descriptions or tags associated with the image. Barcode

representation foresees this opportunity in subsequent phases of its development, though it is

beyond the scope of this manuscript. Furthermore, we tested barcode image comparison to study

the possible structural alterations during ligand binding on the same DHFR structure. The

number and type of ligands bound to DHFR receptor were given in Table 1(Supplementary

Material). The disparities in structures are pictorially represented as barcodes and their relative

similarities in overall topology may be quantified from calculating BII. For illustrative purpose,

topologically similar structures are clubbed together and structurally dissimilar molecules are

separated in a VIBGYOR color scheme.

COMPUTATIONAL METHODS

Protein Barcode is the representation of secondary structures, and their orientations as barcode

images. The colored bars in the barcode image correspond to the secondary structure elements

(SSE’s) and white spaces between the secondary structures represent the orientation between the

two SSE’s. Three dimensional co-ordinate file from PDB is used to generate these barcodes.

Page 5 of 26

John Wiley & Sons

Protein Science

6

DSSP program is used to obtain secondary structure information. The information about strands

and the sheet they belong to, is also obtained from DSSP file.14 The orientation between

secondary structures is the angle in radians calculated by atan2 method. The first step in

generating a ‘barcode’ is the generation of an alpha-numero code (ANCODE). ANCODE is a

combination of alphabets, H (for helix), and S (for strand/sheet) followed by a 4 digit number

divided into two pairs. First pair represents overall SSE count and second pair represents the

count of secondary structure each SSE belongs to. For e.g. S0401in figure 1D signifies that the

given strand is the fourth SSE in the overall structure, but belong to the first sheet. Similarly,

H0301 in Figure 1D signifies that Helix (H) is the third SSE5 but is first (01) helix in the overall

structure.

The orientation of each SSE’s with the previous and successive ones are assigned based on a

tableaux representation (Figure 1C). If both Secondary structures are pointing within 90o against

each other, they are considered parallel (P) and if they are between -135o to +135

o, antiparallel.

The relative orientations in between are designated as L and R in either directions as shown in

figure 1C.

BARCODE is derived from ANCODE generated using pdb file. H is always colored black, S is

colored based on the corresponding sheet id. Each sheet id is colored unique. For e.g. figure 2A

has seven strands with four strands forming one sheet (green) and the remaining three forms

second sheet (blue). Orientations of successive SSE’s are represented by the ‘width’ of white

space between the bars in barcode image. Orientation and pixel width is as follows, P = 6 units,

A = 3 units, R = 4 units and L = 5 units. Representations of successive SSE’s are denoted in

ANCODE in the sixth and seventh spaces after a colon. The first letter shows orientation

between previous SSE and second letter shows the succeeding one. If the previous SSE and

Page 6 of 26

John Wiley & Sons

Protein Science

7

succeeding SSE is missing (as in the case of N terminus and C terminus) it is denoted as ‘O’

(figures 1C and 1D.). Thus, secondary structures and topology are encoded in the ANCODE

string and further translated to barcode image in TIFF format in MATLAB.

ELECTRONIC SUPPLEMENTARY MATERIAL

Supplementary information contains details of structure comparison and scoring between protein

structures, and details of 18 different sets of ligands bound to Dihydrofolatereductase molecule.

All structures were obtained from PDB.

CONCLUSION

In this methodology article, we attempted to present a new reduced representation of protein

structures so as to compare and contrast two structures based on their secondary structure and

topology. Apart from the structural and toplogical information conveyed, we can also quantify

the overall comparison by way of a Barcode Identity Index (BII). The two experiments described

above, are indicative of the utility of the tool. Addressing a scientific problem and comparison

with other tools are not within the scope of this paper, yet the value of the method for qualitative

and quantitative comparison of protein structures may not be discounted. The program is fully

downloadable from the webpage http://www.iitg.ac.in/probar/.

ACKNOWLEDGEMENTS

This work is supported by Department of Biotechnology, Govt. of India under Innovative Young

Biotechnologist Award (IYBA) Scheme. Authors acknowledge the contributions of Prof. P K

Page 7 of 26

John Wiley & Sons

Protein Science

8

Bora of Electrical Engineering at IIT Guwahati for useful suggestions and Rakesh Kumar of

Biotechnology, IIT Guwahati in the final formulation of this manuscript and creation of

webpage. Govind Kailas was supported by DIT-CoE scheme under Department of Information

Technology, Government of India.

Page 8 of 26

John Wiley & Sons

Protein Science

9

REFERENCES

1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN,

Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235-242.

2. Pieper U, Schlessinger A, Kloppmann E, Chang GA, Chou JJ, Dumont ME, Fox BG,

Fromme P, Hendrickson WA, Malkowski MG, Rees DC, Stokes DL, Stowell MHB,

Wiener MC, Rost B, Stroud RM, Stevens RC, Sali A (2013) Coordinating the impact of

structural genomics on the human [alpha]-helical transmembrane proteome. Nat Struct

Mol Biol 20:135-138.

3. Berman HM, Bhat TN, Bourne PE, Feng Z, Gilliland G, Weissig H, WestbrookJ (2000)

The Protein Data Bank and the challenge of structural genomics. Nat Struct Mol Biol

7:957-959.

4. Day R, Beck DAC, Armen RS, Daggett V (2003) A consensus view of fold space:

Combining SCOP, CATH, and the Dali Domain Dictionary. Protein Sci 12:2150-2160.

5. Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJP, Chothia C, Murzin

AG (2008) Data growth and its impact on the SCOP database: new developments.

Nucleic Acids Res 36:D419-D425.

6. Sillitoe I, Cuff AL, Dessailly BH, Dawson NL, Furnham N, Lee D, Lees JG, Lewis TE,

Studer RA, Rentzsch R, Yeats C, Thornton JM, Orengo CA (2013) New functional

families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D

structures. Nucleic Acids Res 41:D490-D498.

7. Krissinel E (2007) On the relationship between sequence and structure similarities in

proteomics. Bioinformatics 23:717-723.

Page 9 of 26

John Wiley & Sons

Protein Science

10

8. Eidhammer I, Jonassen I, Taylor WR (2000) Structure comparison and structure patterns.

J Comp Biol 7:685-716.

9. Humphrey W, Dalke A, Schulten K (1996) VMD: Visual molecular dynamics. J Mol

Graph 14:33-38.

10. Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science

294:93-96.

11. Zhang Y (2009) Protein structure prediction: when is it useful? Curr Opin Struct Biol

19:145-155.

12. Michalopoulos I, Torrance GM, Gilbert DR, Westhead DR (2004) TOPS: an enhanced

database of protein structural topology. Nucleic Acids Res 32:D251-D254.

13. Yuan X, Bystroff C (2007) Protein contact map prediction. In:. Xu Y, Xu D, Liang J, Ed.

Computational Methods for Protein Structure Prediction and Modeling. Springer, New

York, pp 255-277.

14. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern

recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577-2637.

15. Woodland NJ, Silver B (1952) Classifying apparatus and method. US Patent no.

2612994.

16. Shi S, Chitturi B, Grishin NV (2009) ProSMoS server: a pattern-based search using

interaction matrix representation of protein structures. Nucleic Acids Res 37:W526-

W531.

17. Needleman SB, Wunsch CD (1970) A general method applicable to the search for

similarities in the amino acid sequence of two proteins. J Mol Biol 48:443-453.

Page 10 of 26

John Wiley & Sons

Protein Science

11

18. Lew MS, Nicu S, Chabane D, Ramesh J (2006) Content-based multimedia information

retrieval: State of the art and challenges. ACM Trans Multimedia Comp Commun Appl

2:1-19.

19. Ritendra D, Dhiraj J, Jia L, James ZW (2008) Image retrieval: Ideas, influences, and

trends of the new age. ACM Comput Surv 40:1-60.

Page 11 of 26

John Wiley & Sons

Protein Science

12

FIGURE LEGENDS

Figure 1.

Generation of ‘protein barcode from 3D representation of protein G (1PGB.pdb) (A). TOPS

diagram of protein G showing secondary structure and their relative orientation (B). SSE’s with

the previous and successive ones are assigned based on a tableaux representation with space

width assigned in parenthesis (C). ANCODE generated for protein G as explained in validation

Section (D) and its corresponding barcode format (E).

Figure 2.

Barcode images of representative protein structures corresponding to all beta, all alpha and

alpha/beta folds in the SCOP database. The respective TOPS diagram and ‘Barcodes’ present the

utility of ‘barcode’ representation in encoding the structure and topology of any given protein

structure.

Figure 3.

Barcodes corresponding to dihydrofolatereductase enzyme in different species. Only those

species with structures available in PDB were shown in this figure. The differences in barcode

can be attributed to the differences in the secondary structures that are altered during the course

of evolution. However, there is a common string of bars in the barcode depicting the structural

conservation for DHFR in the bacterial species. Similarly, the barcodes for the vertebrates and

fungi are somewhat identical within their respective sets.

Page 12 of 26

John Wiley & Sons

Protein Science

13

Figure 4.

Differences in protein structures illustrated using ‘barcode’ representation when the same DHFR

molecule is bound with different ligands. All structures are obtained from PDB.3

Page 13 of 26

John Wiley & Sons

Protein Science

Generation of ‘protein barcode from 3D representation of protein G (1PGB.pdb) (A). TOPS diagram of protein G showing secondary structure and their relative orientation (B). SSE’s with the previous and

successive ones are assigned based on a tableaux representation with space width assigned in parenthesis (C). ANCODE generated for protein G as explained in validation Selection (D) and its corresponding barcode

format (E).

Page 14 of 26

John Wiley & Sons

Protein Science

Barcode images of representative protein structures corresponding to all beta, all alpha and alpha/beta folds in the SCOP database. The respective TOPS diagram and Barcodes present the utility of ‘barcode’

representation in encoding the structure and topology of any given protein structure.

Page 15 of 26

John Wiley & Sons

Protein Science

Barcodes corresponding to dihydrofolatereductase enzyme in different species. Only those species with structures available in PDB were shown in this figure. The differences in barcode can be attributed to the differences in the secondary structures that altered during the course of evolution. However, there is a common string of bars in the barcode depicting the structural conservation for DHFR in the bacterial

species. Similarly, the barcodes for the vertebrates and fungi are somewhat identical within their respective sets.

Page 16 of 26

John Wiley & Sons

Protein Science

Differences in protein structures illustrated using ‘barcode’ representation when the same DHFR molecule is bound with different ligands. All structures are obtained from PDB (Berman et al. 2000b).

Page 17 of 26

John Wiley & Sons

Protein Science

Supplementary Information

Structure Based Barcoding of Proteins

Rahul Metri2, Gaurav Jerath

1, Govind Kailas

2, Nitin Gacche

1, Adityabarna Pal

1 &Vibin

Ramakrishnan1, 2

1Department of Biotechnology, Indian Institute of Technology, Guwahati – 781039. India.

2Institute of Bioinformatics & Applied Biotechnology, Bangalore – 560100, India

Barcode Identity Index:

Barcode Identity Index (BII): BII is calculated from a metadata of image consisting of numbers

that correspond to the ‘barcode’ and aligning them. In a typical case, Helix is represented as 0,

strand as 1 and the orientation between secondary structures as 3, 4, 5 and 6 based on space

width between 2 bars in the barcode representation. For e.g. 1A41.pdb (Fig. S1) may be

represented as 03030413140304030303030. The number that represents a barcode (query) is

aligned with another number (subject) using Needleman Wunsch algorithm.

eg:

1A41.pdb - 1A4O.pdb

This alignment is scored as follows:

Secondary structure element and its orientation aligned: 2

Page 18 of 26

John Wiley & Sons

Protein Science

Secondary structure element aligned its orientation not aligned: 1

Only orientation aligned: 0

All this is added to get align_score

Alignment score and coverage is calculated as:

Q - Query

S - Subject

% Identity Score: Score = (align_score / ((lenQ+(lenQ-1)+lenS+(lenS-1))/2))*100;

% Coverage: Cov=(min(length(S),length(Q))/max(length(S),length(Q)))*100;

1A41 1A40

Fig S1: 3D images of 1A41.pdb and 1A40.pdb.

Table 1: The following table contains the information listing different ligands bound to the

Dihydrofolate reductase molecule (data obtained from PDB).

PDB

IDs

Ligand 1 Ligand 2 Ligand 3 Ligand 4

1DHF FOLIC ACID

Page 19 of 26

John Wiley & Sons

Protein Science

2DHF 5-DEAZAFOLIC

ACID

1OHJ N-(4-CARBOXY-4-

{4-[(2,4-DIAMINO-

PTERIDIN- 6-

YLMETHYL)-

AMINO]-

BENZOYLAMINO}-

BUTYL)-

PHTHALAMIC ACID

NADPH DIHYDRO-

NICOTINAMIDE-

ADENINE-

DINUCLEOTIDE

PHOSPHATE

1HFP N-[4-[(2,4-

DIAMINOFURO[2,3D

]PYRIMIDIN-5-

YL)METHYL]METH

YLAMINO]-

BENZOYL]-L-

GLUTAMATE

NADP

NICOTINAMIDE-

ADENINE-

DINUCLEOTIDE

PHOSPHATE

1HFQ N-[4-[(2,4-

DIAMINOFURO[2,3D

]PYRIMIDIN-5-

YL)METHYL]METH

YLAMINO]-

BENZOYL]-L-

NADP

NICOTINAMIDE-

ADENINE-

DINUCLEOTIDE

PHOSPHATE

Page 20 of 26

John Wiley & Sons

Protein Science

GLUTAMATE

1DLS METHOTREXATE NADPH DIHYDRO-

NICOTINAMIDE-

ADENINE-

DINUCLEOTIDE

PHOSPHATE

1U72 METHOTREXATE NADPH DIHYDRO-

NICOTINAMIDE-

ADENINE-

DINUCLEOTIDE

PHOSPHATE

1PD9 2,4-DIAMINO-5-

METHYL-6-[(3,4,5-

TRIMETHOXY- N-

METHYLANILINO)

METHYL]PYRIDO[2,

3-D]PYRIMIDINE

SULFATE ION

1PDB

Page 21 of 26

John Wiley & Sons

Protein Science

1KMV DIMETHYL

SULFOXIDE

(Z)-6-(2-[2,5-

DIMETHOXYPHENY

L]ETHEN-1-YL)- 2,4-

DIAMINO-5-

METHYLPYRIDO[2,3-

D]PYRIMIDINE

NADPH

DIHYDRO-

NICOTINA

MIDE-

ADENINE-

DINUCLEO

TIDE

PHOSPHAT

E

SULFATE

ION

1S3V SULFATE ION (2R,6S)-6-

{[methyl(3,4,5-

trimethoxyphenyl)amin

o]methyl}- 1,2,5,6,7,8-

hexahydroquinazoline-

2,4-diamine

1S3W NADP

NICOTINAMIDE-

ADENINE-

DINUCLEOTIDE

PHOSPHATE

6-(OCTAHYDRO-1H-

INDOL-1-

YLMETHYL)DECAH

YDROQUINAZOLINE

- 2,4-DIAMINE

1S3U SULFATE ION (2R,6S)-6-

{[methyl(3,4,5-

trimethoxyphenyl)amin

Page 22 of 26

John Wiley & Sons

Protein Science

o]methyl}- 1,2,5,6,7,8-

hexahydroquinazoline-

2,4-diamine

1PD8 2,4-DIAMINO-5-

METHYL-6-[(3,4,5-

TRIMETHOXY- N-

METHYLANILINO)

METHYL]PYRIDO[2,

3-D]PYRIMIDINE

NADPH DIHYDRO-

NICOTINAMIDE-

ADENINE-

DINUCLEOTIDE

PHOSPHATE

2C2T (S)-2,4-DIAMINO-5-

((7,8-

DICARBAUNDECAB

ORAN- 7-

YL)METHYL)-6-

METHYLPYRIMIDIN

E

(R)-2,4-DIAMINO-5-

((7,8-

DICARBAUNDECAB

ORAN- 7-

YL)METHYL)-6-

METHYLPYRIMIDIN

E

GLYCEROL NADPH

DIHYDRO-

NICOTINAMI

DE-

ADENINE-

DINUCLEOTI

DE

PHOSPHATE

2C2S 2,4-DIAMINO-5-(1-O-

CARBORANYLMET

HYL)-6-

METHYLPYRIMIDIN

E

GLYCEROL NADPH

DIHYDRO-

NICOTINA

MIDE-

ADENINE-

DINUCLEO

Page 23 of 26

John Wiley & Sons

Protein Science

TIDE

PHOSPHAT

E

1MVS 2,4-DIAMINO-6-[N-

(3',4',5'-

TRIMETHOXYBENZ

YL)- N-

METHYLAMINO]PY

RIDO[2,3-

D]PYRIMIDINE

SULFATE ION

1BOZ NADPH DIHYDRO-

NICOTINAMIDE-

ADENINE-

DINUCLEOTIDE

PHOSPHATE

N6-(2,5-

DIMETHOXY-

BENZYL)-N6-

METHYL-

PYRIDO[2,3-

D]PYRIMIDINE-2,4,6-

TRIAMINE

Page 24 of 26

John Wiley & Sons

Protein Science

Barcode image of 1M65_A

Figure S2: Modified image to exemplify the Possibility of incorporating barcode as an additional

structure representation in huge databases and molecular repositories.

Page 25 of 26

John Wiley & Sons

Protein Science

Page 26 of 26

John Wiley & Sons

Protein Science

Date post:	23-Apr-2023
Category:	Documents
Upload:	amrita
View:	0 times
Download:	0 times

Structure-based barcoding of proteins

Documents