+ All Categories
Home > Documents > TMBHMM: A frequency profile based HMM for predicting the topology of transmembrane beta barrel...

TMBHMM: A frequency profile based HMM for predicting the topology of transmembrane beta barrel...

Date post: 19-Jan-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
7
TMBHMM: A frequency prole based HMM for predicting the topology of transmembrane beta barrel proteins and the exposure status of transmembrane residues Nitesh Kumar Singh a , Aaron Goodman b , Peter Walter a , Volkhard Helms a , Sikander Hayat a, a Center for Bioinformatics, Saarland University, Saarbrücken, Germany b Penn Genome Frontiers Institute, University of Pennsylvania, Philadelphia, PA, USA abstract article info Article history: Received 19 November 2010 Received in revised form 16 February 2011 Accepted 7 March 2011 Available online 21 March 2011 Keywords: Transmembrane beta barrel proteins Hidden Markov model Membrane protein topology Structure prediction web server Lipid exposure Transmembrane beta barrel (TMB) proteins are found in the outer membranes of bacteria, mitochondria and chloroplasts. TMBs are involved in a variety of functions such as mediating ux of metabolites and active transport of siderophores, enzymes and structural proteins, and in the translocation across or insertion into membranes. We present here TMBHMM, a computational method based on a hidden Markov model for predicting the structural topology of putative TMBs from sequence. In addition to predicting transmembrane strands, TMBHMM also predicts the exposure status (i.e., exposed to the membrane or hidden in the protein structure) of the residues in the transmembrane region, which is a novel feature of the TMBHMM method. Furthermore, TMBHMM can also predict the membrane residues that are not part of beta barrel forming strands. The training of the TMBHMM was performed on a non-redundant data set of 19 TMBs. The self- consistency test yielded Q 2 accuracy of 0.87, Q 3 accuracy of 0.83, Matthews correlation coefcient of 0.74 and SOV for beta strand of 0.95. In this self-consistency test the method predicted 83% of transmembrane residues with correct exposure status. On an unseen, non-redundant test data set of 10 proteins, the 2-state and 3-state TMBHMM prediction accuracies are around 73% and 72%, respectively, and are comparable to other methods from the literature. The TMBHMM web server takes an amino acid sequence or a multiple sequence alignment as an input and predicts the exposure status and the structural topology as output. The TMBHMM web server is available under the tmbhmm tab at: http://service.bioinformatik.uni-saarland.de/tmx-site/. © 2011 Elsevier B.V. All rights reserved. 1. Introduction TMBs are located in the outer membranes of gram-negative bacteria, chloroplast and mitochondria [1]. Their membrane spanning regions are formed by anti-parallel beta strands, creating a channel in a form of barrel that spans the outer membrane [2]. The rst known members of this class of which atomic resolution 3D structures could be determined were bacterial trimeric porins that form water-lled channels mediating passive transport of ions and small molecules through outer membrane [2]. TMBs perform a wide variety of functions such as active ion transport, passive nutrient uptake, membrane anchoring, adhesion and catalytic activity [15]. They also act as structural proteins, and as mediators in the protein translocation across or insertion into bacterial [6] and mitochondrial outer membranes [710]. TMBs also play a role in bacterial virulence and are potential targets for the development of antimicrobial drugs and vaccines [4,1115]. In the recent years, the number of 3D structures of membrane proteins at atomic resolution has increased rapidly due to the improvement in the cloning and crystallization techniques [16]. Although this has mostly boosted the development of computational methods for helical membrane proteins (HMP), this has led to an increase in the number of prediction methods in the realm of TMBs as well. All known single chain TMBs can be described by a simple grammar [1,17] consisting of N-terminal signal sequence, M repeats of (upward strand, extra-cellular loop, downward strand, periplasmic hairpin), and possibly a C-terminal region [18]. Based on the experimentally known TMB 3D structures, the number of beta strands ranges from 8 to 24. Since the TMBs are known to be slightly less hydrophobic than the HMPs, the task of identifying the membrane spanning regions is more difcult in TMBs than in HMPs [19]. This difculty is further complicated due to the lack of a clear pattern in their membrane spanning strands, such as the stretch of 1530 consecutive hydrophobic residues or the positive-inside rule [20] Biochimica et Biophysica Acta 1814 (2011) 664670 Corresponding author at: Lehrstuhl für Computational Biology, Universität des Saarlandes, Gebäude E2.1, Postfach 15 11 50, 66041 Saarbrücken, Germany. Tel.: +49 68130270700; fax: +49 68130270702. E-mail addresses: [email protected] (N.K. Singh), [email protected] (A. Goodman), [email protected] (P. Walter), [email protected] (V. Helms), [email protected] (S. Hayat). 1570-9639/$ see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.bbapap.2011.03.004 Contents lists available at ScienceDirect Biochimica et Biophysica Acta journal homepage: www.elsevier.com/locate/bbapap
Transcript

Biochimica et Biophysica Acta 1814 (2011) 664–670

Contents lists available at ScienceDirect

Biochimica et Biophysica Acta

j ourna l homepage: www.e lsev ie r.com/ locate /bbapap

TMBHMM: A frequency profile based HMM for predicting the topology oftransmembrane beta barrel proteins and the exposure status oftransmembrane residues

Nitesh Kumar Singh a, Aaron Goodman b, Peter Walter a, Volkhard Helms a, Sikander Hayat a,⁎a Center for Bioinformatics, Saarland University, Saarbrücken, Germanyb Penn Genome Frontiers Institute, University of Pennsylvania, Philadelphia, PA, USA

⁎ Corresponding author at: Lehrstuhl für ComputatSaarlandes, Gebäude E2.1, Postfach 15 11 50, 66041 Saa68130270700; fax: +49 68130270702.

E-mail addresses: [email protected] (N.K. [email protected] (A. Goodman), peter@bio(P. Walter), [email protected]@bioinformatik.uni-saarland.de (S. Hayat).

1570-9639/$ – see front matter © 2011 Elsevier B.V. Adoi:10.1016/j.bbapap.2011.03.004

a b s t r a c t

a r t i c l e i n f o

Article history:Received 19 November 2010Received in revised form 16 February 2011Accepted 7 March 2011Available online 21 March 2011

Keywords:Transmembrane beta barrel proteinsHidden Markov modelMembrane protein topologyStructure prediction web serverLipid exposure

Transmembrane beta barrel (TMB) proteins are found in the outer membranes of bacteria, mitochondria andchloroplasts. TMBs are involved in a variety of functions such as mediating flux of metabolites and activetransport of siderophores, enzymes and structural proteins, and in the translocation across or insertion intomembranes. We present here TMBHMM, a computational method based on a hidden Markov model forpredicting the structural topology of putative TMBs from sequence. In addition to predicting transmembranestrands, TMBHMM also predicts the exposure status (i.e., exposed to the membrane or hidden in the proteinstructure) of the residues in the transmembrane region, which is a novel feature of the TMBHMM method.Furthermore, TMBHMM can also predict the membrane residues that are not part of beta barrel formingstrands. The training of the TMBHMM was performed on a non-redundant data set of 19 TMBs. The self-consistency test yielded Q2 accuracy of 0.87, Q3 accuracy of 0.83, Matthews correlation coefficient of 0.74 andSOV for beta strand of 0.95. In this self-consistency test the method predicted 83% of transmembrane residueswith correct exposure status. On an unseen, non-redundant test data set of 10 proteins, the 2-state and 3-stateTMBHMM prediction accuracies are around 73% and 72%, respectively, and are comparable to other methodsfrom the literature. The TMBHMMweb server takes an amino acid sequence or amultiple sequence alignmentas an input and predicts the exposure status and the structural topology as output. The TMBHMMweb serveris available under the tmbhmm tab at: http://service.bioinformatik.uni-saarland.de/tmx-site/.

ional Biology, Universität desrbrücken, Germany. Tel.: +49

ingh),informatik.uni-saarland.ded.de (V. Helms),

ll rights reserved.

© 2011 Elsevier B.V. All rights reserved.

1. Introduction

TMBs are located in the outer membranes of gram-negativebacteria, chloroplast and mitochondria [1]. Their membrane spanningregions are formed by anti-parallel beta strands, creating a channel ina form of barrel that spans the outer membrane [2]. The first knownmembers of this class of which atomic resolution 3D structures couldbe determined were bacterial trimeric porins that form water-filledchannels mediating passive transport of ions and small moleculesthrough outer membrane [2]. TMBs perform a wide variety offunctions such as active ion transport, passive nutrient uptake,membrane anchoring, adhesion and catalytic activity [1–5]. Theyalso act as structural proteins, and as mediators in the protein

translocation across or insertion into bacterial [6] and mitochondrialouter membranes [7–10]. TMBs also play a role in bacterial virulenceand are potential targets for the development of antimicrobial drugsand vaccines [4,11–15].

In the recent years, the number of 3D structures of membraneproteins at atomic resolution has increased rapidly due to theimprovement in the cloning and crystallization techniques [16].Although this has mostly boosted the development of computationalmethods for helical membrane proteins (HMP), this has led to anincrease in the number of prediction methods in the realm of TMBs aswell. All known single chain TMBs can be described by a simplegrammar [1,17] consisting of N-terminal signal sequence, M repeats of(upward strand, extra-cellular loop, downward strand, periplasmichairpin), and possibly a C-terminal region [18]. Based on theexperimentally known TMB 3D structures, the number of beta strandsranges from 8 to 24. Since the TMBs are known to be slightly lesshydrophobic than the HMPs, the task of identifying the membranespanning regions is more difficult in TMBs than in HMPs [19]. Thisdifficulty is further complicated due to the lack of a clear pattern intheir membrane spanning strands, such as the stretch of 15–30consecutive hydrophobic residues or the positive-inside rule [20]

665N.K. Singh et al. / Biochimica et Biophysica Acta 1814 (2011) 664–670

which is followed by the HMPs. Furthermore, discriminating betweentransmembrane strands and beta barrel structures of water solubleproteins is hard as well because both of them share some commonfeatures, such as amphipathicity [21].

The existing sequence-based computational methods in the realmof TMBs can be classified into two main categories. Computationalmethods of the first sort aim at identifying the TMBs in a givengenome [21–24]. The computational methods in the other categoryfocus on determining the structural topology of the identified putativeTMB sequence [25]. In contrast to the extensively studied class ofglobular proteins following the pioneering work of Rost and Sander[26], the problem of predicting exposed/buried residues in TMBproteins was only recently addressed when Yuan et al. presented thefirst computational method for predicting the solvent accessibility ofTMBs [45]. We believe that, in addition to predicting membranespanning regions and structural topology, prediction of the exposurestatus of transmembrane residues, i.e., whether they are exposed tothe lipid bilayer or buried in the protein structure, is of immediateinterest due to its implied applications in channel engineering andsite-specific mutational studies [27]. In a recent study, Hayat et al.thus presented the BTMX method to predict the burial status of TMBresidues in the transmembrane region [28]. The BTMX method relieson the PROFtmbmethod to obtain the putative TM regions for a givenamino acid sequence and makes a significant step further by couplingtopology prediction and exposure status prediction.

In this work, we have developed a comprehensive computationalmethod (TMBHMM) based on a hidden Markov model (HMM) topredict the structural topology of TMBs by employing only thefrequency profile of the amino acids in a given sequence. The methodalso predicts the exposure status (i.e., whether a residue is exposed tolipid bilayer or buried inside the protein structure) of the transmem-brane residues. The prediction accuracy of TMBHMM has beencompared with the PRED-TMBB method and it is shown thatTMBHMM is comparable to the PRED-TMBB method in terms ofstrand prediction accuracy. We have also established the TMBHMMweb server that accepts amino acid sequence or multiple sequencealignment as input and predicts the structural topology of the givenamino acid sequence annotated with the exposure status. To the bestof our knowledge, TMBHMM is the first HMM based topologyprediction method for TMBs that takes TM-Other residues intoaccount and also provides the exposure status classification forresidues in the TM region. Given that several computational methodsexist for the genome wide identification of TMBs, TMBHMM can beused as a complementary tool to annotate those putative TMBs.

2. Materials and methods

2.1. The hidden Markov model

AMarkovmodel is a probabilistic process over a finite set, S1,…, Sk,usually called its states. At each time step the model visits one statefrom this set of states. There is another set of observation symbols andat each state transition the model emits a character from this time-independent alphabet of distinct observation symbols. A standardHMMwas employed in this study. The initial parameters for the HMMwere generated from the training data set using the Baum–Welchalgorithm [29,30]. For decoding, the Viterbi algorithm [31] wasemployed. The “jhmm” java module, which is freely available athttp://code.google.com/p/jhmm/, was used to develop the hiddenMarkov model for the TMBHMM method.

2.2. Training and test data sets

Initially a non-redundant set of TMB 3D structures was obtainedfrom the OPM database such that the sequence identity was b30%.From this set, some sequences were further removed based on the

criteria previously described by Park et al. [32]. Briefly, for a givenquery sequence, a maximum of 1000 homologous sequences wereretrieved from the non-redundant database using BLAST. Initial MSAswere built using ClustalW. Sequences that aremore than 80% identicalto the query sequence were removed. The remaining sequences wererealigned using ClustalW to yield a final MSA, which was used toobtain profiles. The training data set thus obtained consisted of 19non-redundant TMBs for which we retrieved the structures from theOPM database. Unlike the PDB database, the OPM database providesmembrane boundaries which were useful in classifying the aminoacid residues [33]. The residues within these boundaries wereclassified as membrane residues and residues outside were classifiedas non-membrane residues. The test data set for comparing ourmethod (TMBHMM) with the PRED-TMBB methods was taken fromBagos et al. [19]. It includes a set of 20 non-redundant proteins.Another completely unseen, non-redundant data set of 10 proteinstructures obtained from the OPM database was used to compare theprediction accuracy of the final TMBHMMmodel and the PRED-TMBBand PROFtmb methods.

2.3. Computation of rSASA

Residues in the membrane spanning region can be classified aseither buried in the barrel interior or as exposed to the lipidmembrane. The residues were classified as buried or exposed basedon their relative solvent-accessible surface area (rSASA) value asdescribed by Park and Helms [34]. The probe radius for calculating therSASA value was 2.2 Å [32]. The two faces of the transmembraneregion were capped to prevent the probe from entering the core and,hence, misclassifying buried residue as exposed. The VOLBL programsuite [35,36] was employed for actual computation of SASA values.

2.4. Computation of frequency profile

The frequency profile is a vector of size 20 which holds thefrequency of 20 amino acids in the MSA at the position of a particularresidue. As described [34], the frequency profile was calculated byperforming a blast search using the blastp program available from theNCBI website. The blast search was performed against a non-redundant protein sequence database with default parameters. Atotal of up to 1000 homologous sequences obtained after blast weretaken into considerationwhile performing theMSA. The lower limit ofhomologous sequences for generating MSA was set at 20. Then wegenerated a multiple sequence alignment (MSA) using ClustalW foreach training sequence and their homologues as described in [34]. Inthe final step, the program AL2CO [37] was used to generate afrequency profile from the MSA using a modified method of Henikoffand Henikoff [38].

3. Results and discussion

3.1. The HMM architecture and training stage

The HMM architecture employed in TMBHMM is shown atdifferent levels of detail in Fig. 1. Fig. 1a shows the general overviewof the HMM. Each box in Fig. 1a corresponds either to the membrane,the periplasmic or the cytoplasmic regions of a TMB, respectively. The“TM-Other” region consists of residues that are in the transmembraneregion but do not belong to any beta strand. Each region representedby a box in Fig. 1a corresponds to a submodel depicting an array ofstates which share similar transition probabilities. The structure of thesubmodel is shown in Fig. 1b. Each region in Fig. 1a can be expandedinto its constituent array of states as shown in Fig. 1b except themembrane region which lacks the self-loop state. The overallarchitecture of the HMM model is shown in Fig. 1c.

Fig. 1. HMM architecture. a. Simple architecture of HMM. Each box corresponds to a sub-model depicting an array of states. b. Structure of arrays of state. c. Complete architecture ofHMM after expanding the sub-models.

666 N.K. Singh et al. / Biochimica et Biophysica Acta 1814 (2011) 664–670

The HMM architecture is similar to the previously establishedmethods [18,21,23] except for the membrane region. Existing HMMbased methods typically have states for the extracellular and theperiplasmic loop and the membrane regions. In the case of TMBHMM,the membrane region has been further divided into two states tocapture the topological signal for exposed and buried residues. Thus,unlike upward and downward strands in the membrane region [18],TMBHMM has five different arrays of states. These five arrays of statesinclude exposed-membrane to extracellular, buried-membrane toextracellular, exposed-membrane to periplasm, buried-membrane toperiplasm and TM-Other. As an example, the ‘exposed-membrane toextracellular’ state consists of residues in the membrane region thatare exposed to the bilayer and the strand on which they are locatedtraverses towards the extracellular region. Corresponding definitionsapply to the other four arrays of states.

In preparation of the training set, each amino acid in a giventraining sequence was labelled with one of the states of the HMM byusing information from the respective PDB file as deposited in theOPM database. The HMM states were defined based on the z-coordinate, calculated rSASA value and membrane boundariesobtained from the OPM database. Each amino acid position in thegiven training sequence was first labelled as Extracellular, Periplasmicand Membrane region based on the z-coordinate of its C-alpha atom.Then the residues in the membrane region were labelled astransmembrane based on the TM segment information obtained

from the OPM database. The remaining residues in the membraneregion were labelled as TM-Other. It should be noted that each state/submodel is modelled as an array. Thus each labelled residue isassigned an array index in its designated state. This is done bysequentially labelling the amino acid chain from the first amino acidposition such that entry into a state array was from index 0 and exitfrom the state array was through the last index. Further, residuesbelonging to TM segments were labelled as “exposed” or “buried”based on the calculated rSASA values. The transmembrane residueswere also classified as “toExtracellular” or “toPeriplasmic” to reflectwhether the chain was coming from the periplasmic loop or theextracellular loop region. Here, the minimum size of the TM array wasset to 3 residues to prevent the HMM from jumping across themembrane with a single amino acid residue. This restriction is validconsidering that (a) phospholipid bilayers have a thickness of 3.5–4nm, and (b) the shortest strand in all TMB barrel structures obtainedfrom the OPM database (2omf, strand number 12 and 13) has a lengthof 4 residues. The directional arrows in Fig. 1 show the possibletransitions from each state. Thus, the minimum and the maximumallowed lengths for the transmembrane state arrays are 3 and 15,respectively. The minimum length of the loop and the TM-Other statearrays is 1 and there is no maximum due to the presence of the self-loop state.

For each state, we sequentially added frequency profiles indicatingthe relative occurrence of the 20 natural amino acids for all sequence

Table 1Accuracy at different rSASA thresholds.

Q2 Q3 MCC Core Exposure Exposed/Buried

0.01 0.87 0.83 0.74 0.91 0.80 1.120.02 0.86 0.83 0.73 0.90 0.81 1.100.03 0.87 0.83 0.74 0.91 0.83 1.050.04 0.86 0.82 0.73 0.90 0.83 1.010.05 0.86 0.83 0.73 0.90 0.82 0.95

Accuracy measures at different rSASA thresholds. Exposed/Buried is the ratio ofobserved exposed and buried residues at each rSASA threshold.

Table 3TMBHMM prediction accuracy on the training data set.

Accuracy of training data set

Q2 Q3 MCC Core SOV Exposure

Self-consistency 0.87 0.83 0.74 0.91 0.95 0.83Jack-knife 0.86 0.83 0.72 0.88 0.92 0.83

Different accuracy measures of training data set for self-consistency test and jack-knife test.

667N.K. Singh et al. / Biochimica et Biophysica Acta 1814 (2011) 664–670

positions that were attributed to that state. This gives a twodimensional array. For example, if 10 sequence positions belong tostate X, the matrix has a size of 10*20 (20 is the size of the frequencyprofile). The mean value for each sequence position in the matrix wascalculated and thesemean values for the 20 amino acidswere taken asthe representative profile for that state. Thus the observation symbolfor a state is an array of 20 floating point values. As the original jhmmmodule can only handle a single value for each amino acid in a givenstate, the source code was modified to handle frequency profilesconsisting of 20 values. Each state can be considered as a submodel inour HMM as shown in the HMM architecture. Thus we have 7submodels/states, each being an array. Training was done only oncefor the entire HMM model and the individual states were not trainedseparately. A restriction on the transition from one state to anotherstate was made such that the entry to a state is always made fromindex 0 and the transition from a state must take place from the lastindex.

3.2. Prediction of the structural topology of TMBs

Several accuracy measures have been employed for evaluating theperformance of TMBHMM. Q2 is defined as the 2-state accuracymeasure where the amino acid residues are classified as eitherbelonging to a strand or a non-strand region. Similarly, Q3 is theprediction accuracy when the amino acid residues are classified intothree regions, namely, periplasmic, extracellular or the membraneregion. We have also used Matthews correlation coefficient (MCC)[39] and Segment overlap measure (SOV) for beta strands as definedby Zemla et al. [40] as accuracy measures. Core accuracy is defined asthe percentage of residues that are correctly predicted to be in themembrane region by TMBHMM. The exposure accuracy gives thepercentage of correctly predicted membrane residues with correctexposure status.

In the training data set, the relative solvent-accessible surface area(rSASA) value was used to label the transmembrane residues asexposed or buried. Since there is no consensus on which rSASA valueis to be used as threshold [32], we explored all reasonable rSASAvalues from 0.01 to 0.05 in steps of 0.01 as threshold.While selecting arSASA threshold we also have to consider the random probability ofan amino acid being exposed or buried. Hence, a rSASA thresholdgiving equi-partition of the data set as buried or exposed is preferred.The results with different rSASA threshold are shown in Table 1. Asshown (Table 1), an rSASA cutoff value of 0.03 gave better accuraciesfor most of the measures. Hence, this value was chosen as thethreshold for labelling the residues in the training data set as buried or

Table 2Results of alternate labelling method for transmembrane residues.

Q2 Q3 MCC Core Exposure

Exposed/Buried 0.87 0.83 0.74 0.91 0.83Cin/Cout 0.85 0.79 0.69 0.82 0.75

Comparison of accuracy when an alternate labelling method Cin/Cout was used fortransmembrane residues.

exposed. Transmembrane residues with rSASA greater than 0.03 werelabelled as exposed and the remaining ones were labelled as buried.

Wimley showed that the in/out dyad repeat pattern is strictlyfollowed by TMBs [41]. However, in a recent study, Hayat et al. [28]showed that roughly 10% of the out-pointing residues were found tobe hidden in the protein structure as a result of shielding by the sidechains of the neighbouring residues. This means that although thestrict in/out dyad repeat pattern is maintained, the out-pointingresidues are not necessarily exposed to the lipid bilayer and some in-pointing residues are also slightly exposed to the membrane. We notethat most state of the art computational methods in the realm of TMBsemploy the regular in/out dyad repeat pattern in determining TMBstrands [24] and identifying TMBs from genomic data [41]. Here, wetried to account for such shielding effects of neighbouring residueswhen predicting the exposure status of TM residues. Thus, TMBHMMemploys the exposure status of membrane residues instead of thefixed in/out dyad repeat pattern. Table 2 shows the TMBHMMprediction accuracy for the case when the membrane residues werelabelled based on the in/out instead of the exposure status. As shown,TMBHMM prediction accuracy is higher when the exposure statuslabelling instead of in/out labelling is applied to the residues in themembrane region. However, a combination of exposed/buried and in/out labelling can be used to extract more information about themembrane residues in future.

The prediction accuracy measures for both the self-consistencyand the jack-knife test for TMBHMM are given in Table 3. As shown,the leave-one-out Q2 accuracy of TMBHMM is 0.86, while the Q3

accuracy of determining the periplasmic, cytosolic and the membraneregion is 0.83. The core accuracy of accurately predicting the residuesin the membrane region is 0.88. The exposure status predictionaccuracy for the residues correctly predicted to be in the membraneregion is 0.83. The Q2 accuracy for strand/non-strand predictions wasfurther analyzed as a function of the z-position relative to themembrane center. As shown in Fig. 2, there is a drop in Q2 accuracy atz-positions between−10 Å and−15 Å and between 10 Å and 15 Å. Itshould be noted that the average half length of the bilayer thicknessfor our training data set, determined based on the PDB structuresobtained from the OPM database [33], is 11.78 Å . For the training ofTMBHMM, this value was used as the boundary for classifying aresidue as a membrane or a non-membrane residue. Hence, we canconclude that most misclassifications were observed at the bound-aries, which is a common problem with the TMB prediction methods.Thesemisclassifications can be attributed to the difference in physico-chemical properties of the membrane core and the membrane/waterinterface regions [42–44]. Moreover, inherent errors in the theoreticaland experimental methods in determining the membrane boundarycould also be responsible for the higher rate of misclassification at themembrane/water interface region.

Further analysis was done to check for the tendency for correct orwrong predictions to occur together. The probability of true/falseprediction was calculated given the case when one or both theneighbours were predicted correctly/wrongly. Table 4 shows that theprobability of a residue to have a correct strand/non-strandassignment (Q2) is 0.97, when the regions of both the neighbouringresidues were also correctly predicted. More interesting, when boththe neighbouring residues are assigned to wrongly predicted regions,the target residues are also misclassified with a probability of 0.95.

Fig. 2. Q2 accuracy as a function of z-position that was aligned to the membrane normal.

668 N.K. Singh et al. / Biochimica et Biophysica Acta 1814 (2011) 664–670

Thus, as shown in Table 4, there is a tendency of true/false predictionsto occur in clusters. We also calculated the TMBHMM predictionaccuracy when the training was performed without the exposurestatus information about the residues in the membrane region(Table 5). As shown in Table 5, all the accuracy measures are slightlyhigher when the exposure status of the membrane residues wasincorporated in the TMBHMM architecture. Thus, the inclusion of theexposure status of the residues in the membrane region resulting inseparate states for membrane residues enables TMBHMM to predictthe exposure status of query sequences and this additional number ofstates in the TMBHMM architecture does not adversely affect theprediction accuracy.

3.3. Analysis of state-wise prediction accuracy of TMBHMM

The data set consists of 7470 residues. These residues werelabelled as belonging to one of the five states, namely, non-beta strandtransmembrane region (TM-Other), periplasmic side (periSide), betastrand residues buried and exposed in the membrane region(buriedCore and exposedCore) and the extracellular side (extraSide).Based on the OPM assignment of the membrane boundaries, the totalnumber of residues in each state was 899, 723, 1262, 1321 and 3265,respectively. As shown in Table 6, the overall accuracy for the 5-stateTMBHMM predictions is 76%. The prediction accuracy for 1thq and1e54 at 42% and 41%, respectively, is much lower than the overallaverage prediction accuracy. Table 7 shows the misclassification ofresidues into a wrong state. The diagonal represents the correctpredictions. As shown, the lowest prediction accuracy is obtained forthe residues labelled as TM-Other. It should be noted that all non-beta

Table 4Tendency for correct or wrong predictions to occur together.

One neighbourcorrect

Bothneighbourscorrect

Oneneighbourwrong

Bothneighbourswrong

Q2 predictionCorrect prediction 0.86 0.97 NA NAWrong prediction NA NA 0.66 0.95

For exposure predictionCorrect prediction 0.85 0.93 NA NAWrong prediction NA NA 0.53 0.90

The probability of a residue being predicted correctly or wrongly if one or bothneighbours are predicted correctly or wrongly was calculated. This was done to see ifthere is tendency for correct or wrong predictions to occur in clusters.

strand residues in the TM region were labelled as “TM-Other” and therelative exposure of these residues to the space inside the beta barrelis not taken into account, which could lead to a higher misclassifica-tion rate. However, as shown in Table 7, 22% of the TM-Other residueswere misclassified as belonging to the beta strands in the membraneregion (12% as exposed and 10% as buried). When the buriedCore andexposedCore states were merged with the TM-Other state, a total of3062 out of 3482 (88%) residues belonging to the TM region werecorrectly classified. The overall average 3-state accuracy of classifica-tion when TM-other state was merged with buriedCore andexposedCore regions is 84%. The per amino acid prediction accuracyis shown in Table 8. At 65% the prediction accuracy for PHE is thelowest.

To compare the TMBHMMmethod to other methods, we followedBagos et al. [19] who assessed the performance of different methodsavailable for topology predictions of TMBs. Their assessment showedthat HMM based methods perform better than neural network orsupport vector machine based methods. Among HMM based methodsthey concluded that PRED-TMBB [25] has the best performance. Itshould be noted that since none of these methods predicts theexposure status of residues in the membrane region, we can onlycompare the methods in terms of 2-state, 3-state and strandprediction accuracies. The non-redundant data set used by Bagos etal. [19] was employed to compare the prediction accuracies of thedifferent methods. Table 9 lists the Q2, Matthews correlationcoefficient (MCC), Core and SOV prediction accuracies for TMBHMM(0.84, 0.67, 0.84 and 0.91, respectively). The corresponding accuraciesfor the PRED-TMBBmethod are 0.85, 0.69, 0.82 and 0.90, respectively.Lower prediction accuracies were obtained when raw PDB structureswere employed instead of the membrane oriented protein structuresobtained from the OPM database [33].

The prediction accuracy of the final TMBHMM model was alsocompared with the methods PRED-TMBB [25] and PROFtmb [22]based on an unseen non-redundant data set of 10 protein chains withX-ray 3D structures obtained from the OPM database [33]. As shown

Table 5Accuracy measures without exposure information.

Q2 Q3 MCC Core

With exposure information 0.87 0.83 0.74 0.91Without exposure information 0.85 0.81 0.70 0.86

For the comparison of accuracy measures when exposure information was notprovided, all the strand residues were classified in the same state.

Table 6Protein-wise prediction accuracy.

Protein Number of strands Correct Wrong ACC [%]

1thq 8 62 85 421qfg 22 560 147 791kmo 22 542 119 821a0s 18 282 131 681xkw 22 547 108 842gsk 22 465 125 791qjp 8 111 26 812f1v 8 157 25 861xkh 22 518 169 752erv 8 116 34 771qd6 12 187 53 782mpr 18 315 106 751tly 12 185 66 741i78 10 234 63 791qj8 8 129 19 871fep 22 554 126 811e54 16 135 196 411t16 14 331 96 782j1n 16 261 85 75Total 5691 1179 76

The prediction accuracy per protein. The table shows the 5-state prediction accuracy.The prediction accuracy for the case when all TM states are merged into one (3-stateprediction accuracy) is discussed in the main text. “Correct” means residues that arepredicted to be in the right state.

Table 8Amino acid residue-wise prediction accuracy.

Amino acid Correct Wrong ACC [%]

CYS 6 1 86GLN 230 91 72ILE 219 69 76SER 407 129 76VAL 316 95 77GLY 574 237 71PRO 207 64 76LYS 253 45 85THR 425 119 78PHE 208 114 65ALA 421 134 76HIS 82 40 67MET 95 23 81ASP 426 94 82GLU 253 57 82LEU 393 137 74ARG 281 81 78TRP 145 49 75ASN 369 112 77TYR 381 88 81

Table 9Comparison with the PRED-TMBB method.

Q2 MCC Core SOV

TMBHMM(OPM) 0.84 0.67 0.84 0.91PRED-TMBB (OPM) 0.85 0.69 0.82 0.90TMBHMM (PDB) 0.76 0.53 0.69 0.86PRED-TMBB (PDB) 0.76 0.54 0.67 0.84

The performance of the TMBHMMmethodwas comparedwith the PRED-TMBBmethodon a test data-set of 20 non-redundant proteins. Test data set was used for thecomparison. The first two rows are the accuracy when strand regions were consideredfrom OPM database, while the last two rows are the accuracy when strand regions fromPDB database were considered.

Table 10Prediction accuracy comparison on an unseen non-redundant test data set.

Protein Number ofstrands

PRED-TMBB(3-state)

PROFtmb(2-state)

TMBHMM(2-state)

TMBHMM(3-state)

TMBHMM(exposurestatus)

3a2sG 16 75 75 82 82 892k0lA 8 86 85 85 82 942qdzAa 16 57 77 36 36 872qomA 12 88 84 83 83 912vqiA 24 66 62 72 69 97

669N.K. Singh et al. / Biochimica et Biophysica Acta 1814 (2011) 664–670

in Table 10, the overall prediction accuracy for the 2-state accuracy forPROFtmb and TMBHMM methods is 74% and 73%, respectively. Thelow prediction accuracy (36%) of the TMBHMM method for 2qdzAcould be due to the huge structural gaps in the PDB file obtained fromthe OPM database. When protein 2qdzA is excluded, the average 2-state prediction accuracy of TMBHMM is 77%. The 3-state predictionaccuracy of the PRED-TMBB method on this test data set is 71%. Thecorresponding 3-state accuracy of the TMBHMM method is 72% and76% for the case including and excluding 2qdzA, respectively. Theexposure status prediction accuracy of the TMBHMM method on theunseen test data set comprising of 10 protein chains is around 92%.Thus, on all data sets tested, the prediction accuracy of the TMBHMMmethod is comparable to the other prediction methods tested fromthe literature and it also predicts the exposure status of the residuespredicted to be in the transmembrane region.

3.4. TMBHMM web server for TMB structural topology prediction

The TMBHMM web server can take input in the form of anuploaded fasta file, a pasted fasta sequence or as aMSA file. The resultsare sent back to the user via email and include the generatedMSA andthe predicted topology of the putative TMB. The TMBHMMweb servercan be found under the tmbhmm tab at http://service.bioinformatik.uni-saarland.de/tmx-site/.

4. Conclusions

We have developed TMBHMM, a computational method based ona hidden Markov model, for the prediction of the structural topologyof TMBs. In contrast to the existing methods, TMBHMM also predicts

Table 7State-wise prediction accuracy per residue in %.

TM-Other periSide buriedCore exposedCore extraSide

TM-Other 52 7 10 12 19periSide 5 63 11 20 2buriedCore 0 2 78 15 5exposedCore 0 1 14 78 7extraSide 8 0 4 3 84

Values on the diagonal are accurate predictions. Each row lists the predictions for onetrue state.

the exposure status of the residues in the membrane region.TMBHMMemploys evolutionary information in the form of frequencyprofiles. Based on a jack-knife test, the overall Q2 and Q3 accuracies ofTMBHMM were 0.86 and 0.83, respectively, and the SOV for theidentified beta strands was 0.92. We showed that the accuracy ofTMBHMM is comparable to two of the currently best availablemethods. Namely, the 2- and 3-state TMBHMM prediction accuracieson an unseen non-redundant data set were found to be 73% and 72%,

3bs0A 14 63 78 71 70 933cslA 22 71 75 75 74 933emnX 19 57 62 66 57 903fhhA 22 71 71 84 83 933jtyA 18 74 75 80 80 95Average 71 74 73 72 92Averageb 72 74 77 76 93

Shown is the comparison of per residue accuracies. Only the residues correctlypredicted to be in the membrane region were considered for determining the exposurestatus prediction accuracy. In the case of PROFtmb and 2-state TMBHMM, the two statesconsidered are membrane and non-membrane.

a Protein 2qdzA is discussed in the text.b Average is calculated after excluding protein 2qdzA.

670 N.K. Singh et al. / Biochimica et Biophysica Acta 1814 (2011) 664–670

respectively, which is comparable to the PRED-TMBB and PROFtmbmethods. The TMBHMM method has been implemented as a webservice. The TMBHMMweb server employs an amino acid sequence ora multiple sequence alignment as an input and predicts the exposurestatus of the transmembrane residues and the structural topology asoutput. The TMBHMMmethod can be used as a computational tool toidentify beta strands and annotate the putative TMBs from proteomicdata.

Author contributions

S.H. and V.H. designed the study; N.K.S., S.H. and A.G. performedthe study; P.W. and S.H. implemented the web server; S.H., N.K.S. andV.H. wrote the manuscript.

Acknowledgements

S.H. in the Graduiertenkolleg 1276/1 was supported by a PhDfellowship funded by DFG. A.G. thanks the DAAD RISE program forfunding his research at Saarland University, Germany.

References

[1] G.E. Schulz, beta-Barrel membrane proteins, Curr. Opin. Struct. Biol. 10 (2000)443–447.

[2] G.E. Schulz, The structure of bacterial outer membrane proteins, Biochim. Biophys.Acta 1565 (2002) 308–317.

[3] R. Koebnik, K.P. Locher, P. Van Gelder, Structure and function of bacterial outermembrane proteins: barrels in a nutshell, Mol. Microbiol. 37 (2000) 239–253.

[4] G.E. Schulz, Porins: general to specific, native to engineered passive pores, Curr.Opin. Struct. Biol. 6 (1996) 485–490.

[5] A. Pautsch, G.E. Schulz, Structure of the outer membrane protein A transmem-brane domain, Nat. Struct. Biol. 5 (1998) 1013–1017.

[6] T.J. Knowles, A. Scott-Tucker, M. Overduin, I.R. Henderson, Membrane proteinarchitects: the role of the BAM complex in outer membrane protein assembly, Nat.Rev. Microbiol. 7 (2009) 206–214.

[7] N. Pfanner, N. Wiedemann, C. Meisinger, T. Lithgow, Assembling the mitochon-drial outer membrane, Nat. Struct. Mol. Biol. 11 (2004) 1044–1048.

[8] N. Bolender, A. Sickmann, R. Wagner, C. Meisinger, N. Pfanner, Multiple pathwaysfor sorting mitochondrial precursor proteins, EMBO Rep. 9 (2008) 42–49.

[9] T. Endo, K. Yamano, Multiple pathways for mitochondrial protein traffic, Biol.Chem. 390 (2009) 723–730.

[10] J. Tommassen, Assembly of outer-membrane proteins in bacteria and mitochon-dria, Microbiology 156 (2010) 2587–2596.

[11] R. Jackups Jr., J. Liang, Interstrand pairing patterns in beta-barrel membraneproteins: the positive-outside rule, aromatic rescue, and strand registrationprediction, J. Mol. Biol. 354 (2005) 979–993.

[12] S. Galdiero, M. Galdiero, C. Pedone, beta-Barrel membrane bacterial proteins:structure, function, assembly and interaction with lipids, Curr. Protein Pept. Sci.8 (2007) 63–82.

[13] R. Pajon, D. Yero, A. Lage, A. Llanes, C.J. Borroto, Computational identification of beta-barrel outer-membrane proteins in Mycobacterium tuberculosis predicted pro-teomes as putative vaccine candidates, Tuberculosis (Edinb) 86 (2006) 290–302.

[14] M.H. Saier Jr., Families of proteins forming transmembrane channels, J. Membr.Biol. 175 (2000) 165–180.

[15] R.J. Gilbert, Pore-forming toxins, Cell. Mol. Life Sci. 59 (2002) 832–844.[16] M. Bannwarth, G.E. Schulz, The expression of outer membrane proteins for

crystallization, Biochim. Biophys. Acta 1610 (2003) 37–45.[17] J.E. Meyer, M. Hofnung, G.E. Schulz, Structure of maltoporin from Salmonella

typhimurium ligated with a nitrophenyl-maltotrioside, J. Mol. Biol. 266 (1997)761–775.

[18] H.R. Bigelow, D.S. Petrey, J. Liu, D. Przybylski, B. Rost, Predicting transmembranebeta-barrels in proteomes, Nucleic Acids Res. 32 (2004) 2566–2577.

[19] P.G. Bagos, T.D. Liakopoulos, S.J. Hamodrakas, Evaluation of methods forpredicting the topology of beta-barrel outer membrane proteins and a consensusprediction method, BMC Bioinform. 6 (2005) 7.

[20] G. von Heijne, Membrane protein structure prediction Hydrophobicity analysisand the positive-inside rule, J. Mol. Biol. 225 (1992) 487–494.

[21] P.G. Bagos, T.D. Liakopoulos, I.C. Spyropoulos, S.J. Hamodrakas, A hidden Markovmodel method, capable of predicting and discriminating beta-barrel outermembrane proteins, BMC Bioinform. 5 (2004) 29.

[22] H. Bigelow, B. Rost, PROFtmb: a web server for predicting bacterial transmem-brane beta barrel proteins, Nucleic Acids Res. 34 (2006) W186–W188.

[23] P.L. Martelli, P. Fariselli, A. Krogh, R. Casadio, A sequence-profile-based HMM forpredicting and discriminating beta barrel membrane proteins, Bioinformatics 18(Suppl 1) (2002) S46–S53.

[24] T.C. Freeman Jr., W.C. Wimley, A highly accurate statistical approach for theprediction of transmembrane beta-barrels, Bioinformatics 26 (2010) 1965–1974.

[25] P.G. Bagos, T.D. Liakopoulos, I.C. Spyropoulos, S.J. Hamodrakas, PRED-TMBB: a webserver for predicting the topology of beta-barrel outer membrane proteins,Nucleic Acids Res. 32 (2004) W400–W404.

[26] B. Rost, C. Sander, Conservation and prediction of solvent accessibility in proteinfamilies, Proteins 20 (1994) 216–226.

[27] J. Abramson, I. Smirnova, V. Kasho, G. Verner, H.R. Kaback, S. Iwata, Structure andmechanism of the lactose permease of Escherichia coli, Science 301 (2003)610–615.

[28] S. Hayat, P. Walter, P. Yungki, V. Helms, Prediction of the exposure status oftransmembrane beta barrel residues from protein sequence, J. Bioinform. Comput.Biol. 9 (2011) 43–65.

[29] L. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occurring in thestatistical analysis of probabilistic functions of Markov chains, Ann. Math. Stat. 41(1970) 164–171.

[30] L. Welch, Hidden Markov models and the Baum–Welch algorithm, IEEE Inf.Theory Soc. Newsl. 53 (2003) 1–10.

[31] G. Forney Jr., The Viterbi algorithm, Proc. IEEE 61 (1973) 268–278.[32] Y. Park, S. Hayat, V. Helms, Prediction of the burial status of transmembrane

residues of helical membrane proteins, BMC Bioinform. 8 (2007) 302.[33] M.A. Lomize, A.L. Lomize, I.D. Pogozheva, H.I. Mosberg, OPM: orientations of

proteins in membranes database, Bioinformatics 22 (2006) 623–625.[34] Y. Park, V. Helms, On the derivation of propensity scales for predicting exposed

transmembrane residues of helical membrane proteins, Bioinformatics 23 (2007)701–708.

[35] H. Edelsbrunner, M. Liang, Measuring proteins and voids in proteins, SystemSciences, 1995. Vol. V. Proceedings of the Twenty-Eighth Hawaii InternationalConference on 5 (1995).

[36] H. Edelsbrunner, The union of balls and its dual shape, ACM (1993) 218–231.[37] J. Pei, N.V. Grishin, AL2CO: calculation of positional conservation in a protein

sequence alignment, Bioinformatics 17 (2001) 700–712.[38] S. Henikoff, J. Henikoff, Position-based sequence weights* 1, J. Mol. Biol. 243

(1994) 574–578.[39] P. Baldi, S. Brunak, Y. Chauvin, C.A. Andersen, H. Nielsen, Assessing the accuracy of

prediction algorithms for classification: an overview, Bioinformatics 16 (2000)412–424.

[40] A. Zemla, C. Venclovas, K. Fidelis, B. Rost, A modified definition of Sov, a segment-basedmeasure for protein secondary structure prediction assessment, Proteins 34(1999) 220–223.

[41] W.C. Wimley, Toward genomic identification of beta-barrel membrane proteins:composition and architecture of known structures, Protein Sci. 11 (2002)301–312.

[42] W.C. Wimley, S.H. White, Experimentally determined hydrophobicity scale forproteins at membrane interfaces, Nat. Struct. Biol. 3 (1996) 842–848.

[43] E. Granseth, G. von Heijne, A. Elofsson, A study of the membrane–water interfaceregion of membrane proteins, J. Mol. Biol. 346 (2005) 377–385.

[44] J. Liang, L. Adamian, R. Jackups Jr., The membrane–water interface region ofmembrane proteins: structural bias and the anti-snorkeling effect, TrendsBiochem. Sci. 30 (2005) 355–357.

[45] Z. Yuan, F. Zhang, M.J. Davis, M. Boden, R.D. Teasdale, Predicting the solventaccessibility of transmembrane residues from protein sequence, J. Proteome Res. 5(2006) 1063–1070.


Recommended