+ All Categories
Home > Documents > Coding of amino acids by texture descriptors

Coding of amino acids by texture descriptors

Date post: 22-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
8
Coding of amino acids by texture descriptors Loris Nanni *, Alessandra Lumini Department of Electronic, Informatics and Systems (DEIS), Universita ` di Bologna, Via Venezia 52, 47023 Cesena, Italy 1. Introduction Several applications need to extract features from peptides/ proteins for solving a given classification problem [1], some examples are: sub-cellular localization [2], protein–protein inter- actions [3], HIV-1 protease cleavage site prediction [4,5]. Probably, the most used feature extractor for peptides and proteins is the Chou’s pseudo amino acid (PseAA) composition [6]. In the literature, several variants of these descriptors have been proposed: hydropathy scales [7,8], physicochemical distance [9], digital code [10], complexity factor [11–12], digital signal [13], Fourier low-frequency spectrum [14], cellular automata [15], ‘‘artificial’’ features created by genetic programming combining one or more ‘‘original’’ Chou’s pseudo amino acid features [16]. The interested reader can refer to [17] and [18] for a survey of the different methods for extracting features from peptides and proteins. Most of the feature extractors proposed in the literature are based on a vectorial representation of the peptide/protein. For example, in [19] a physicochemical encoding is proposed: each amino acid is represented by a 20-dimesional vector with all values set to zero except for the one corresponding to the considered amino acid, which takes the value of the measured physicochem- ical property. The descriptor associated to a peptide/protein is obtained by concatenating all the 20-dimesional vectors corre- sponding to its amino acid sequence. Other interesting encoding methods, here reported for com- pleteness, are based on kernels. One of the first approaches is the Fisher kernel [20] proposed for remote homology detection. A different kernel, the mismatch string kernel, is proposed in [21], which measures similarity among two sequences of amino acids Artificial Intelligence in Medicine 48 (2010) 43–50 ARTICLE INFO Article history: Received 12 December 2008 Received in revised form 24 September 2009 Accepted 3 October 2009 Keywords: Protein classification Peptide classification Vaccine development Locally binary patterns Discrete cosine transform Support vector machine ABSTRACT Objective: In this paper we propose a new feature extractor for peptide/protein classification based on the calculation of texture descriptors. Representing a peptide/protein using a matrix descriptor, instead of a vector, allows to deal with the peptide/protein as an image and to use texture descriptors for representation purposes. Methods and materials: A matrix descriptor, which is a squared matrix of the dimension of the peptide/ protein, is obtained considering a partial ordering of the amino acids of the peptide/protein according to their value of a given physicochemical property. Each matrix descriptor is considered as a texture image and several texture descriptors are considered to obtain a compact representation which is scale invariant (i.e. independent on the length of the peptide\protein). The texture descriptors tested in this work are: local binary patterns (LBP), discrete cosine transform (DCT) and Daubechies wavelets. Results and conclusion: The experimental section reports several tests, aimed at supporting our ideas, performed on the following datasets: vaccine dataset for the predictions of peptides that bind human leukocyte antigens; human immunodeficiency virus (HIV-1) protease cleavage site prediction dataset and membrane proteins type dataset. The experimental results confirm the usefulness of the novel descriptors: the performance obtained by our system on the three difficult datasets is quite high, indicating that the proposed method is a feasible system for extracting information from peptides and proteins. The performance obtained by each of the three texture descriptors calculated from the matrix-based representation, and coupled to a support vector machine classifier, is lower than the performance obtained by other vector-based descriptors based on physicochemical properties proposed in the literature. Anyway the new descriptors bring different information and our tests show that the texture descriptors and the vector-based descriptors can be combined to improve the overall performance of the system. In particular the proposed approach improves the state-of-the-art results in two out of three tested problems (HIV-1 protease cleavage site prediction dataset and membrane proteins type dataset). ß 2009 Elsevier B.V. All rights reserved. * Corresponding author. Tel.: +39 0547 339121; fax: +39 0547 338890. E-mail addresses: [email protected] (L. Nanni), [email protected] (A. Lumini). Contents lists available at ScienceDirect Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim 0933-3657/$ – see front matter ß 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.artmed.2009.10.001
Transcript

Coding of amino acids by texture descriptors

Loris Nanni *, Alessandra Lumini

Department of Electronic, Informatics and Systems (DEIS), Universita di Bologna, Via Venezia 52, 47023 Cesena, Italy

Artificial Intelligence in Medicine 48 (2010) 43–50

A R T I C L E I N F O

Article history:

Received 12 December 2008

Received in revised form 24 September 2009

Accepted 3 October 2009

Keywords:

Protein classification

Peptide classification

Vaccine development

Locally binary patterns

Discrete cosine transform

Support vector machine

A B S T R A C T

Objective: In this paper we propose a new feature extractor for peptide/protein classification based on

the calculation of texture descriptors. Representing a peptide/protein using a matrix descriptor, instead

of a vector, allows to deal with the peptide/protein as an image and to use texture descriptors for

representation purposes.

Methods and materials: A matrix descriptor, which is a squared matrix of the dimension of the peptide/

protein, is obtained considering a partial ordering of the amino acids of the peptide/protein according to

their value of a given physicochemical property. Each matrix descriptor is considered as a texture image

and several texture descriptors are considered to obtain a compact representation which is scale

invariant (i.e. independent on the length of the peptide\protein). The texture descriptors tested in this

work are: local binary patterns (LBP), discrete cosine transform (DCT) and Daubechies wavelets.

Results and conclusion: The experimental section reports several tests, aimed at supporting our ideas,

performed on the following datasets: vaccine dataset for the predictions of peptides that bind human

leukocyte antigens; human immunodeficiency virus (HIV-1) protease cleavage site prediction dataset

and membrane proteins type dataset.

The experimental results confirm the usefulness of the novel descriptors: the performance obtained

by our system on the three difficult datasets is quite high, indicating that the proposed method is a

feasible system for extracting information from peptides and proteins. The performance obtained by

each of the three texture descriptors calculated from the matrix-based representation, and coupled to a

support vector machine classifier, is lower than the performance obtained by other vector-based

descriptors based on physicochemical properties proposed in the literature. Anyway the new descriptors

bring different information and our tests show that the texture descriptors and the vector-based

descriptors can be combined to improve the overall performance of the system.

In particular the proposed approach improves the state-of-the-art results in two out of three tested

problems (HIV-1 protease cleavage site prediction dataset and membrane proteins type dataset).

� 2009 Elsevier B.V. All rights reserved.

Contents lists available at ScienceDirect

Artificial Intelligence in Medicine

journa l homepage: www.e lsev ier .com/ locate /a i im

1. Introduction

Several applications need to extract features from peptides/proteins for solving a given classification problem [1], someexamples are: sub-cellular localization [2], protein–protein inter-actions [3], HIV-1 protease cleavage site prediction [4,5].

Probably, the most used feature extractor for peptides andproteins is the Chou’s pseudo amino acid (PseAA) composition [6].In the literature, several variants of these descriptors have beenproposed: hydropathy scales [7,8], physicochemical distance [9],digital code [10], complexity factor [11–12], digital signal [13],Fourier low-frequency spectrum [14], cellular automata [15],‘‘artificial’’ features created by genetic programming combining

* Corresponding author. Tel.: +39 0547 339121; fax: +39 0547 338890.

E-mail addresses: [email protected] (L. Nanni), [email protected]

(A. Lumini).

0933-3657/$ – see front matter � 2009 Elsevier B.V. All rights reserved.

doi:10.1016/j.artmed.2009.10.001

one or more ‘‘original’’ Chou’s pseudo amino acid features [16]. Theinterested reader can refer to [17] and [18] for a survey of thedifferent methods for extracting features from peptides andproteins.

Most of the feature extractors proposed in the literature arebased on a vectorial representation of the peptide/protein. Forexample, in [19] a physicochemical encoding is proposed: eachamino acid is represented by a 20-dimesional vector with all valuesset to zero except for the one corresponding to the consideredamino acid, which takes the value of the measured physicochem-ical property. The descriptor associated to a peptide/protein isobtained by concatenating all the 20-dimesional vectors corre-sponding to its amino acid sequence.

Other interesting encoding methods, here reported for com-pleteness, are based on kernels. One of the first approaches is theFisher kernel [20] proposed for remote homology detection. Adifferent kernel, the mismatch string kernel, is proposed in [21],which measures similarity among two sequences of amino acids

1 Available at www.genome.jp/dbget/aaindex.html (accessed 15 July 2009).

L. Nanni, A. Lumini / Artificial Intelligence in Medicine 48 (2010) 43–5044

based on shared occurrences of subsequences. In [21] it is shownthat string kernels have performance similar to Fisher kernel with alower computational cost. A class of new kernels is developed in[22] which obtain good performance for predicting protein sub-cellular localization: a set of kernel functions derived from k-peptide vectors mapped by a matrix of high-scored pairs,measured by BLOSUM62 scores, of k-peptides, are used for traininga support vector machine. Another interesting approach is the bio-basis function neural network [23], in this method the sequencesare not encoded in a feature space; instead, the distances obtainedby sequence alignment are used to train the neural network.

The aim of this paper is to propose a novel descriptor obtainedfrom a matrix representation of the peptides/proteins. Analogouslyto many of the above cited methods the physicochemicalproperties are considered to discriminate among the amino acids:each descriptor, which is a squared matrix of the dimension of thepeptide/protein, is obtained considering a partial ordering of theamino acids of the peptide/protein according to a givenphysicochemical property. A more compact representation of thismatrix descriptor is obtained by considering such matrix as animage and using a texture descriptor to obtain a scale-invariantrepresentation, independent on the length of the peptide\protein.Several well-known texture descriptors are tested in this paper:local binary pattern (LBP), which extracts a histogram thatdescribes the difference between each matrix point and itsneighborhood; discrete cosine transform (DCT); Daubechieswavelet, which performs a multi-resolution analysis of the image.

The experimental section reports several tests on the followingdatasets: vaccine dataset for the predictions of peptides that bindhuman leukocyte antigens; HIV-1 protease cleavage site predictiondataset and membrane proteins type dataset. Our results showthat the proposed descriptors obtain valuable classificationaccuracy and can be considered for a fusion with other standarddescriptors to further improve the classification performance.

The remaining of the paper is organized as follows. Section 2briefly reviews the related works on the three applications testedin this paper; Section 3 introduces the feature extraction methodproposed in this work; Section 4 reports experimental resultsobtained on three different classification problems; finally, Section5 draws some conclusions.

2. Related works

2.1. HIV-1 protease cleavage site prediction

For the replication of the AIDS virus, the HIV-1 protease [24–26]is essential. The inhibitors of the protease bind the active site inHIV-1 protease and do not permit the normal functioning of theprotease. In the literature, several methods for HIV-1 proteasecleavage sites in proteins prediction are published, most based onmachine learning systems: in [27–29] a standard feed-forwardmultilayer perceptron (MLP) is proposed to outperform thedecision tree classifier; in [24,30] a support vector machine(SVM) classifier is tested, in particular, in [24] it is shown that HIV-1 protease cleavage is a linear problem and that the best results areobtained by linear SVM. Recently, a web-server was established forpredicting HIV-1 protease cleavage sites in proteins [31].

2.2. Vaccine (predictions of peptides that bind human leukocyte

antigens)

In order to design useful vaccines for a large population, it isvery important to predict the peptides that bind multiple humanleukocyte antigen (HLA) molecules [32]. The developing ofautomatic systems for predicting if a peptide binds multipleHLA molecules is very useful for making the design of vaccines

more time-effective. Examples of automatic systems yet proposedin the literature (see [33] for a survey) are: a system based on SVMproposed in [34]; systems based on artificial neural networks andhidden Markov model proposed in [33] where and ensembles ofclassifiers tested in [18,35].

2.3. Membrane proteins type

The membrane proteins type determines the function of thatprotein [36,37], for this reason several methods for automatedclassification of membrane protein types have been proposed inthe literature [7,14,38,39]. Until 2007, only small datasets withoutbeing rigorously screened by a data-culling operation to avoidredundancy were available in the literature, a bigger and morereliable dataset was collected in [2].

In [2] a system based on an ensemble of optimized evidence-theoretic k-nearest neighbor classifiers is proposed, where thefeatures (the pseudo amino acid composition) are extractedconsidering the position-specific scoring matrix.

In [16] an ensemble of SVMs, where each classifier is trainedconsidering a different physicochemical property and a featureextraction method based on the residue couple model, is proposed.This method partially fills the performance gap between thefeature extraction methods based on the amino acid sequence andthe feature extraction methods based on the position-specificscoring matrix.

3. System description

The system proposed in this work is based on a matrixrepresentation of the peptide/protein, which is treated as an imageand characterized using a texture descriptor. A SVM classifier istrained using the extracted texture features to perform theclassification task. A graphical schema of the proposed system isreported in Fig. 1. The following subsections describe the mainsteps of the approach.

3.1. Matrix representation

The aim of this step is the characterization of a peptide/proteinsequence by means of a matrix able of representing theinformation related to both the positions of the amino acids inthe sequence and also their physicochemical properties. Then, atexture descriptor is extracted from the matrix, as described inSection 3.2, in order to obtain a descriptor that can be considered asinvariant for the protein sequences.

The matrix representation of the peptide/protein is constructedby considering a selected physicochemical property of amino acids,which can be obtained by the amino acid index database1 [40]. First,the 20 amino acids are sorted according to the value of the selectedproperty and then a ‘‘ranking value’’ is assigned to each of them,which weighs the position of the amino acid in the sequence [41].The ranking rule is the following: the first amino acid has value 1, thelast has value 1/20, if there are not two amino acids with the samevalues, otherwise the sequence is mediate on the number ofdifferent values present in the sequence. For example, if the 20 basesare sorted in the following way, according to a given physicochem-ical property P: N < K < R < Y < F = Q < S < H < M < W<G = L < V< E < I < A < D < T < P < C, the corresponding weights are:rankP(N) = 1/18, rankP (K) = 2/18, rankP (R) = 3/18, . . ., andrankP(C) = 18/18 = 1.

The ranking relationships between all the pairs of amino acidsthat compose the sequence of the peptide/protein are collectedinto a square matrix, named OM(P) having dimensions l � l, where

Fig. 1. A complete schema of the proposed approach.

Fig. 2. LBP neighbor sets for different (P, R).

L. Nanni, A. Lumini / Artificial Intelligence in Medicine 48 (2010) 43–50 45

l is the length of the sequence. For each pair of elements s and t ofthe sequence, the corresponding entry OM(P)s,t of this matrix isgiven by [rankP(s) + rankP(t)]/2. Therefore the diagonal values ofthe matrix are OM(P)s,s = rankP(s).

From a practical point of view the matrix OM(P) represents theordering relationship determined by the physicochemical propertyP among the amino acids contained in a peptide/protein. In Fig. 1an example of matrix creation is reported: the resulting matrix isrescaled in grey-levels and visualized as an image.

When the features are extracted from proteins to reduce thecomputational issue each protein is divided in 10 parts, and thenfrom each part a different descriptor is extracted. The descriptor ofthe whole protein is given by the concatenation of these 10descriptors.

3.2. Extraction of texture features

A large matrix representation of the protein/peptide can containnoise and correlated features, which may exacerbate data over-fitting and negatively affect the classification performance. There-fore a powerful texture descriptor can be adopted to reduce thedimensionality of the matrix representation and maintain only themost discriminant features. In this work we propose three well-known texture descriptors: LBP, DCT and Daubechies wavelets.

A local binary pattern2 operator is defined considering thebinary difference between the value of a cell x of a matrix and

2 The Matlab code is available at http://www.ee.oulu.fi/mvg/page/lbp_matlab

(accessed 15 July 2009).

the values of its P neighborhood placed on a circle of radius R

(Fig. 2) [42]. The LBP is a texture operator which is maderotation invariant by performing P � 1 bitwise shift operationsand selecting the smallest value. Then a histogram is con-structed by counting the number of ‘‘uniform’’ patternscontained in an image. A ‘‘uniform’’ pattern is a pattern wherethe number of transactions between ‘‘0’’ and ‘‘1’’ of the sequenceis less or equal to two: the LBP histogram measures theoccurrence of each type of uniform patterns and the number ofnon-uniform patterns. In this work we have extracted andconcatenated the following histograms (suggested in [42]):(P = 8; R = 1), (P = 16; R = 2).

The discrete cosine transform [43] is a texture descriptor whichexpresses a sequence of points in terms of a sum of cosinefunctions oscillating at different frequencies. In particular, DCT issimilar to the discrete Fourier transform, but using only realnumbers. Moreover, it has a good information packing ability, sincethe most DCT components are typically very small in magnitudebecause most of the salient information exists in the coefficientswith low frequencies. Therefore, compared to other inputindependent transforms it has the advantage of packing the mostuseful information into the fewest coefficients.

In this paper, we apply the DCT transform directly to the matrixrepresentation and retain only nC coefficients. The optimization ofthe nC parameter has been studied in the experimental section. Theselection of the best coefficients can be performed, as usually,according to their frequencies, the coefficients with the lowerfrequencies (we named this approach ORD) or according to asupervised feature transformation. In this work we have tested the

Fig. 3. A graphical representation of the SVM hyperplane.

Table 2Number of membrane proteins in each of the eight types.

Type Training set Testing set

Single-pass type I 610 444

Single-pass type II 312 78

Single-pass type III 24 6

Single-pass type IV 44 12

Multipass 1316 3265

Lipid-chain-anchor 151 38

GPI-anchor 182 46

Peripheral 610 444

Overall 3249 4333

Table 1Number of binders (B) and non-binders (NB) in training and testing sets for HLA-A2.

HLA-A2 Training set Testing set

B NB B NB

0201 224 378 440 1999

0202 619 2361 45 25

0204 641 2162 23 224

0205 648 2346 16 40

0206 621 2349 43 37

L. Nanni, A. Lumini / Artificial Intelligence in Medicine 48 (2010) 43–5046

neighborhood preserving embedding method (NPE)3 [44], which isa subspace learning algorithm aimed at preserving the globalEuclidean structure of the space. With respect to the most knownprincipal component analysis, NPE is less sensitive to outliers.

Daubechies wavelet features (DAU) are widely used texturefeatures [45] based on a set of descriptors extracted from a multi-resolution wavelet transformed image. In this work 30 features areextracted from the Daubechies wavelet [46] transformed image asthe average energy of the three high-frequency componentscalculated up to the 10th level decomposition using both thescaling and wavelet functions of the Daubechies 4 wavelet.

3.3. Classification

The classifier used in this work is a stand-alone radial basisfunction SVM, which has been already proven to be a goodclassifier for this problem [19]. SVM is a bi-class classifier [47]trained to find the equation of a hyperplane that divides thetraining set leaving all the points of the same class on the same sidewhile maximizing the distance between the two classes and thehyperplane. If the training set is not linearly separable, a differentkernel can be used to map the input vectors into a high-dimensional feature space, and construct an optimal hyperplane,which maximize the margin. Typical kernels are polynomialkernels and radial basis function kernels; in this paper, we use theradial basis function kernel. The basic 2-class SVM formulation getsan implicit embedding F and a labeled training set {xi} and returnsthe hyperplane wTF(x) + b = 0 that best separates the trainingsamples of the two classes (see Fig. 3).

4. Experimental results

4.1. Datasets and protocols

The proposed system has been tested using the followingdatasets:

- H

d

IV: This dataset [48] is the biggest dataset ever tested for the HIV-1 protease problem. It contains 1625 octamer protein sequences:374 HIV-1 protease cleavable sites; 1251 uncleavable sites. In thisdataset the ten-fold cross-validation testing protocol is used.

- V

4 Available at http://www.expasy.ch/sprot/ (accessed 15 July 2009).

accine (VAC): This dataset [34] contains peptides from five HLA-A2 molecules that bind/non-bind multiple HLA. The testingprotocol suggested in [34] has been adopted, which is a ‘‘five-molecule’’ cross-validation, where all the peptides related to agiven molecule are used as the testing set and all the peptides

3 The Matlab code is available at http://www.cs.uiuc.edu/homes/dengcai2/Data/

ata.html (accessed 15 July 2009).

is

fe

related to the other four molecules as training set (as detailed inTable 1).

- M

embrane (MEM): It is the same dataset used in [2] to classify aprotein sequence according to its membrane proteins type. Theprotein sequences were collected from the Swiss-Prot database4

and to, avoid redundancy among the data, the proteins werescreened strictly by a cut-off procedure. The protocol used is thesame suggested in [2] where the samples have been divided intotwo separate sets for training and testing: the number of trainingand testing samples for each of the eight types is reported inTable 2.

4.2. Performance indicators

In the first two datasets (HIV and VAC), which are related to atwo-class classification problem, the performance is evaluatedusing the area under the receiver operating characteristic (ROC)curve indicator. The area under the ROC curve (AUC) is a scalarmeasure to evaluate performance, which can be interpreted as theprobability that the classifier will assign a higher score to arandomly picked positive sample than to a randomly pickednegative sample. It has been shown [50] that AUC is empiricallyand theoretically better than accuracy, due to the fact that accuracydoes not consider the scores of the classifiers.

In the last dataset (MEM), which is related to an eight-classclassification problem, the simple accuracy is used, whichmeasures the ratio of correctly classified samples.

4.3. DCT parameter optimization

This experiment is aimed at selecting the best number of DCTcoefficients nC for improving the classification performance, alsovarying the selection technique. In Table 3 the performanceobtained retaining only the first 10 or 25 DCT coefficients (nC = 10,nC = 25) are reported for both the selection techniques described inSection 3.2 (named ORD and NPE); moreover, some tests with thewell know sequential forward floating selection (SFFS)5 [49] arereported. The results show that the performance of NPE is slightly

5 It is implemented as in PRTools 3.1.7 Matlab Toolbox, due to computational

sues the selection by SFFS is not performed in the MEM dataset. For selecting the

atures we have performed a ten-fold cross validation using the training data.

Table 3Average (avg) and maximum (max) performance obtained on the pool of 494

physicochemical properties (AUC for HIV and VAC, accuracy for MEM) for the DCT

descriptor.

Dataset ORD NPE SFFS

nC = 10 nC = 25 nC = 10 nC = 25 nC = 25

HIV avg 0.872 0.890 0.880 0.895 0.899

max 0.923 0.943 0.929 0.950 0.955

VAC avg 0.775 0.800 0.785 0.810 0.820

max 0.855 0.875 0.865 0.875 0.885

MEM avg 0.739 0.779 0.740 0.785 –

max 0.792 0.802 0.800 0.810 –

Fig. 6. Accuracy obtained by the three descriptors (LBP, DCT, DAU) varying the

considered physicochemical properties in the MEM dataset.

L. Nanni, A. Lumini / Artificial Intelligence in Medicine 48 (2010) 43–50 47

better that that obtained by ORD, but with increased computa-tional load, and that SFFS outperforms NPE, but with a furtherincreased computational load. Therefore in the remaining experi-ments we use the ORD selection with nC = 25.

4.4. Comparisons among the three novel texture descriptors

The first experiment is aimed at evaluating the three noveltexture descriptors. The graphs in Figs. 4–6 show the performanceof the three texture descriptors varying the physicochemicalproperties considered, for the three datasets. In the three graphs,the physicochemical properties are ordered according to theperformance of the DCT descriptor, which is the best one. In Table 4the average and maximum results obtained on the pool of 494physicochemical properties are reported, which demonstrate thesuperiority of the DCT descriptor in all three the datasets.

Fig. 4. AUC obtained by the three descriptors (LBP, DCT, DAU) varying the considered

physicochemical properties in the HIV dataset.

Fig. 5. AUC obtained by the three descriptors (LBP, DCT, DAU) varying the considered

physicochemical properties in the VAC dataset.

Even if the performance of the new descriptors is definitelylower than that obtained by other approaches [51,2] based on thestandard encodings for proteins and peptides, it is well know in theliterature [52] that the performance of a classification system canbe improved by combining the predictions of multiple classifiers toproduce a single classification result. Noticeable performanceimprovement can be obtained, in particular, if the individualclassifiers in the ensemble are both accurate and independent (i.e.they make errors on different regions of the feature space) and arerelated to different descriptors (i.e. carry out different informa-tion). In the following section we compare the performance of LBP,DCT and DAU with other state-of-the-art encodings and propose afusion between them.

4.5. Fusion with other state-of-the-art encodings

Two well-known and very effective descriptors based on thephysicochemical properties of the peptides/proteins are thephysicochemical encoding representation for peptides (PE) [19]and the short quasi residue couple for proteins (QR) [53]. Thisexperiment is aimed at evaluating the classification performanceof an ‘‘augmented descriptor’’ obtained by the combination of thetexture descriptors here proposed with (i) the physicochemicalencoding for peptide representation or (ii) the short quasi residuecouple descriptor for protein representation. In both the cases thesame physicochemical property is considered, and then twodifferent SVMs are trained for each descriptor; the SVMs are finallyfused by weighted sum rule.

Notice that the novel descriptors proposed in this work are wellsuited for both peptides and proteins while the encodingsconsidered for comparison are specifically developed for peptidesand proteins, respectively. Therefore we report the experimentalresults obtained by PE on the first two datasets (HIV, VAC) and byQR on the last one (MEM).

Table 4Average (avg) and maximum (max) performance obtained on the pool of 494

physicochemical properties (AUC for HIV and VAC, accuracy for MEM).

Dataset Descriptor

LBP DCT DAU

HIV avg 0.803 0.890 0.723

max 0.892 0.943 0.835

VAC avg 0.788 0.800 0.667

max 0.843 0.875 0.752

MEM avg 0.753 0.779 0.650

max 0.775 0.802 0.672

The bold values represent the highest values in each row.

L. Nanni, A. Lumini / Artificial Intelligence in Medicine 48 (2010) 43–5048

The physicochemical encoding representation (PE) was proposedin [19,53]; in this method each amino acid of the sequence isdescribed by a vector x 2R

20, with 19 values set to zero and thevalue related to the position of the amino acid set to the value ofthe considered physicochemical property.

The short quasi residue couple encoding (QR) was proposed in[53]; this method is a variant of the 2-gram encoding to take intoaccount the value of a considered physicochemical property p. The2-grams are histograms which count the occurrence of a couple ofamino acids at a fixed distance m in a sequence. The QR encodingfor a protein sequence is described by a vector QR p

i; j;m 2R1200

obtained by Eq. (1):

QR pi; j;m ¼

1

L�m�X

k¼1...L�m

H pi; jðk; kþmÞ (1)

where the values of i,j 2 [1. . .20] represent the 20 different aminoacids; m 2 [1. . .3] measures the distance among couples of amino

Fig. 7. AUC obtained in the HIV dataset by six ensembles an

Fig. 9. Accuracy obtained in the MEM dataset by six ensembles

Fig. 8. AUC obtained in the VAC dataset by six ensembles an

acids, it is called the ‘‘rank of the residue couple model’’; H pi; j is a

function that returns the sum of the values of the fixedphysicochemical property p for the amino acids in positions k

and k + m if they are equal to i and j, respectively, 0 otherwise.The graphs in Figs. 7–9 and Table 5 show a comparison among

the performance obtained by the simple PE and QR encodings andthe fusion at score level with the NEW = {LBP, DCT or DAU}encodings. The ensembles (named PE + NEW, QR + NEW,3 � PE + NEW or 3 � QR + NEW) are obtained by simply fusing bya weighted sum rule the scores from two SVM classifiers trained bythe given descriptors. As suggested by their names, the scores of PE

or QR are multiplied by 3 before the fusion with NEW in the last twoensembles. In these nine graphs, the physicochemical propertiesare ordered according to the performance of the PE descriptor,which is our baseline.

The results in the above graphs show that, even if LBP gives alower performance than DCT if considered as stand-alonedescriptor, it gives better performance if considered for the fusion

d PE varying the considered physicochemical property.

and PE varying the considered physicochemical property.

d PE varying the considered physicochemical property.

Table 5Average (avg) and maximum (max) performance obtained on the pool of 494 physicochemical properties (AUC for HIV and VAC, accuracy for MEM).

Dataset Method

PE PE + LBP 3�PE + LBP PE + DCT 3�PE + DCT PE + DAU 3�PE + DAU

HIV avg 0.985 0.983 0.986 0.982 0.986 0.984 0.985

max 0.992 0.991 0.993 0.991 0.992 0.991 0.992

VAC avg 0.865 0.881 0.874 0.872 0.872 0.862 0.865

max 0.884 0.905 0.893 0.894 0.887 0.874 0.878

Dataset Method

PE PE + LBP 3�PE + LBP PE + DCT 3�PE + DCT PE + DAU 3�PE + DAU

MEM avg 0.901 0.890 0.903 0.910 0.907 0.892 0.904

max 0.909 0.904 0.914 0.920 0.919 0.904 0.914

The bold values represent the highest values in each row.

L. Nanni, A. Lumini / Artificial Intelligence in Medicine 48 (2010) 43–50 49

with PE or QR. This result is probably due to the low correlationamong the 2 descriptors.

Since the state-of-the-art descriptors PE and QR obtain betterclassification results than NEW it is reasonable to give them ahigher weight in the fusion, as proved by the better performance of3 � PE + DCT with respect to PE + DCT.

4.6. Selection of the best physicochemical properties

All the results reported in the above section have been obtainedconsidering only one physicochemical property each time. Thisexperiment is aimed at selecting the best 10 physicochemicalproperties and designing a multi-classifier which combines thedescriptors obtained from these 10 properties. The selection isperformed according to the method proposed in [51] which uses thePudil’s selection procedure. The following results have been obtainedonthe HIV dataset (which is the mostused in the literatureamong thetested dataset), using the following double cross-validation testingprotocol.First, thedatasethas beenrandomlydivided intotenequallysized subsets Di, then, we generated ten new datasets (Ni) removingeach time one of the Di subsets from the original set. In each of the Ni

datasets the ten-fold cross-validation is used for finding theparameters of our method, the subset Di is classified using Ni.

In the following we denote as PE(10) the system obtainedselecting the best 10 physicochemical properties for a physico-chemical encoding representation, which is the system describedin [51], and as 3 � PE + DCT(10) the system obtained selecting thebest 10 physicochemical properties for the novel ensemble named3 � PE + DCT. In both cases the 10 selected classifiers are combinedby sum rule.

In Table 6 the two ensembles PE(10) and 3 � PE + DCT(10) arecompared with the standard orthogonal descriptor (OR) [24] inwhich the amino acids of the sequence are described by a vectorx 2R

20, with 19 values set to zero and the value related to theposition of the amino acid set to 1. The results show a performanceimprovement with respect to the state-of-the-art approaches inthe HIV protease problem.

It is interesting to note that the performance of the 10 bestphysicochemical properties selected on a training set are lowerthan that of the best single physicochemical property (chosen onthe whole dataset, see Table 5—max). This behavior is not due to

Table 6AUC and its standard deviation (std) obtained by three methods in the HIV dataset.

Dataset Method

OR PE (10) 3�PE + DCT(10)

HIV AUC 0.986 0.988 0.990std 0.0081 0.0075 0.0042

The bold values represent the highest values in each row.

the number of selected properties (10 instead of 1), since in ourexperiment the best physicochemical property selected on atraining set gives a lower performance of 0.986 in both cases (PE(1)and 3 � PE + DCT(1)). Therefore a possible consideration is that thetesting protocol used for the selection of the best physicochemicalproperties (ten-fold cross-validation) probably in not enoughreliable to discover the hidden characteristic of the training set.Unfortunately the leave-one-out testing protocol, which couldsolve this problem, is unfeasible from a computational point ofview.

5. Conclusion

In this paper, we have presented a novel method for describingthe peptides/proteins, based on the calculation of texturedescriptors from a matrix representation of the peptides/proteins.The novel method is based on the selection of a physicochemicalproperty which is used to construct a representation of thepeptide/protein as a matrix (using a method based on the Hassematrix); this matrix representation is considered as an image andseveral texture descriptors are extracted and used for representingthe peptide/protein.

These novel features are tested as stand-alone descriptors,showing that the DCT is the best descriptor for our matrix-basedrepresentation; moreover, they have been combined with otherstate-of-the-art vector-based features in an ‘‘augmented featurevector’’ to improve their performance.

Our tests on three different datasets (HIV-1 protease cleavagesite prediction; predictions of peptides that bind HLA; membraneproteins type) show that our augmented feature vector outper-forms the considered vector-based feature extractor (physico-chemical encoding representation when the patterns are peptidesand short quasi residue couple method when the patterns areproteins).

As a future research direction we plan to more deeply study theutility of a feature selection/transformation step in the proposedapproach, since our preliminary results reported in Section 4.3show that the performance obtained using a feature transforma-tion technique (NPE) or a feature selection technique (SFFS) isslightly better that that obtained by a simple feature selection(ORD), even if with increased computational load.

References

[1] Chou KC, Zhang CT. Review: prediction of protein structural classes. Crit RevBiochem Mol Biol 1995;30:275–349.

[2] Chou KC, Shen HB. MemType-2L: a web server for predicting membraneproteins and their types by incorporating evolution information throughPse-PSSM. Biochem Biophys Res Commun 2007;360:339–45.

[3] Nanni L, Lumini A. An ensemble of k-local hyperplane for predicting protein–protein interactions. BioInformatics 2006;(22):1207–10.

L. Nanni, A. Lumini / Artificial Intelligence in Medicine 48 (2010) 43–5050

[4] Nanni L. Comparison among feature extraction methods for HIV-1 proteasecleavage site prediction. Pattern Recognit 2006;39(April (4)):711–3.

[5] Nanni L, Lumini A. A genetic approach for building different alphabets forpeptide and protein classification. BMC Bioinform 2008;9:45.

[6] Chou KC, Cai YD. Predicting protein–protein interactions from sequences in ahybridization space. J Proteome Res 2006;5:316–22.

[7] Wang M, Yang J, Liu GP, Xu ZJ, Chou KC. Weighted-support vector machines forpredicting membrane protein types based on pseudo amino acid composition.Protein Eng Design Select 2004;17:509–16.

[8] Chou KC. Using amphiphilic pseudo amino acid composition to predict enzymesubfamily classes. Bioinformatics 2005;21:10–9.

[9] Chou KC. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun 2000;278:477–83.

[10] Gao Y, Shao SH, Xiao X, Ding YS, Huang YS, Huang ZD, et al. Using pseudo aminoacid composition to predict protein subcellular location: approached withLyapunov index, Bessel function, and Chebyshev filter. Amino Acids 2005;28:373–6.

[11] Xiao X, Shao S, Ding Y, Huang Z, Huang Y, Chou KC. Using complexity measurefactor to predict protein subcellular location. Amino Acids 2005;28:57–61.

[12] Xiao X, Shao SH, Huang ZD, Chou KC. Using pseudo amino acid composition topredict protein structural classes: approached with complexity measurefactor. J Comput Chem 2006;27:478–82.

[13] Xiao X, Chou KC. Digital coding of amino acids based on hydrophobic index.Protein Peptide Lett 2007;14:871–5.

[14] Liu H, Wang M, Chou KC. Low-frequency Fourier spectrum for predictingmembrane protein types. Biochem Biophys Res Commun 2005;336:737–9.

[15] Xiao X, Shao SH, Ding YS, Huang ZD, Chou KC. Using cellular automata imagesand pseudo amino acid composition to predict protein subcellular location.Amino Acids 2006;30:49–54.

[16] Nanni L, Lumini A. Genetic programming for creating Chou’s pseudoaminoacid based features for submitochondria localization. Amino Acids2008;34(May (4)):653–60.

[17] Chou KC, Shen HB. Review: recent progresses in protein subcellular locationprediction. Anal Biochem 2007;370:1–16.

[18] Nanni L, Lumini A., Using ensemble of classifiers in Bioinformatics. In Machinelearning research progress. Nova publishers 2009; ISBN: 978-1-60456-646-8.

[19] Nanni L, Lumini A. MppS: an ensemble of support vector machine based onmultiple physicochemical properties of amino-acids. NeuroComputing2006;69(August (13)):1688–90.

[20] Jaakkola T, Diekhans M, Haussler D. Using the Fisher kernel method to detectremote protein homologies. In: Proceedings of the seventh international con-ference on intelligent systems for molecular biology. AAAI Press; 1999. p. 149–58.

[21] Leslie CS, Eskin E, Cohen A, Weston J, Noble WS. Mismatch string kernels fordiscriminative protein classification. Bioinformatics 2004;20:467–76.

[22] Lei Z, Dai Y. An SVM-based system for predicting protein subnuclear localiza-tions BMC. Bioinformatics 2005;6:291.

[23] Yang ZR, Thomson R. Bio-basis function neural network for prediction ofprotease cleavage sites in proteins. IEEE Trans Neural Netw 2005;16:263–74.

[24] Rognvaldsson T, You L. Why neural networks should not be used for HIV-1protease cleavage site prediction. Bioinformatics 2003;1702–9.

[25] Chou KC. A vectorized sequence-coupling model for predicting HIV proteasecleavage sites in proteins. J Biol Chem 1993;268:16938–4.

[26] Chou KC. Review: prediction of HIV protease cleavage sites in proteins. AnalBiochem 1996;233:1–14.

[27] Thompson TB, Chou KC, Zheng C. Neural network prediction of the HIV-1protease cleavage sites. J Theor Biol 1995;177:369–79.

[28] Cai YD, Chou KC. Artificial neural network model for predicting HIV proteasecleavage sites in protein. Adv Eng Soft 1998;29:119–28.

[29] Narayanan A, Wu X, Yang Z. Mining viral protease data to extract cleavageknowledge. Bioinformatics 2002;18:S5–13.

[30] Cai YD, Liu XJ, Xu XB, Chou KC. Support vector machines for predicting HIVprotease cleavage sites in protein. J Comput Chem 2002;23:267–74.

[31] Shen HB, Chou KC. HIVcleave: a web-server for predicting HIV proteasecleavage sites in proteins. Anal Biochem 2008;375:388–90.

[32] Brusic V. Prediction of promiscuous peptides that bind HLA class I molecules.Immunol Cell Biol 2002;80:280–5.

[33] Brusic V, Bajic VB, Petrovsky N. Computational methods for prediction of T-cellepitopes a framework for modelling, testing, and applications. Methods2004;(34):436–43.

[34] Bozic I, Zhang GL, Brusic V. Predictive vaccinology: optimization of predictionsusing support vector machine classifiers. Intell Data Eng Autom Learn2005;2005:375–81.

[35] Lumini A, Nanni L. Machine learning multi-classifiers for peptide classification.Neural Comput Appl 2008;18(2):185–92.

[36] Chou KC, Elrod DW. Protein subcellular location prediction. Protein Eng1999;12:107–18.

[37] Lodish H, Baltimore D, Berk A, Zipursky SL, Matsudaira P, Darnell J. Molecularcell biology, chapter 3, 3rd ed., New York: Scientific American Books; 1995.

[38] Pu X, Guo J, Leung H, Lin Y. Prediction of membrane protein types fromsequences and position-specific scoring matrices. J Theor Biol 2007;247:259–65.

[39] Shen HB, Chou KC. Using ensemble classifier to identify membrane proteintypes. Amino Acids 2007;32:483–8.

[40] Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic AcidsRes 2000;20(1):374.

[41] Feng J, Wang T-M. Characterization of protein primary sequences based onpartial ordering. J Theor Biol 2008. doi: 10.1016/j.jtbi.2008.07.007.

[42] Ojala T, Pietikainen M, Maeenpaa T. Multiresolution gray-scale and rotationinvariant texture classification with local binary patterns. IEEE Trans PatternAnal Mach Intell 2002;24(7):971–87.

[43] Pan Z, Rust A, Bolouri H. Image redundancy reduction for neural networkclassification using discrete cosine transforms. In: Proceedings of the inter-national joint conference on neural networks, vol. 3; 2000. p. 149–54.

[44] Xiaofei He, Deng Cai, Shuicheng Yan, Hong-Jiang Zhang. Neighborhood pre-serving embedding. In: Tenth IEEE international conference on computervision (ICCV’2005), vol. 2. 2005. p. 1208–13.

[45] Manjunath BS, Ma WY. Texture features for browsing and retrieval of imagedata. IEEE Trans Pattern Anal Mach Intell 1996;8:837–42.

[46] Daubechies I. Orthonormal bases of compactly supported wavelets. CommPure Appl Math 1988;41:909–96.

[47] Duda RO, Hart PE, Stork D. Pattern classification, 2nd ed., Wiley; 2000.[48] Kontijevskis A, Wikberg JES, Komorowski J. Computational proteomics ana-

lysis of HIV-1 protease interactome. Proteins Struct Funct Bioinform2007;(1):305–12.

[49] Pudil P, Novovicova J, Kittler J. Floating search methods in feature selection.Pattern Recognit Lett 1994;15(November (11)):1119–25.

[50] Qin ZC. ROC analysis for predictions made by probabilistic classifiers. In:Proceedings of the fourth international conference on machine learningand cybernetics, vol. 5; 2006. p. 3119–312.

[51] Nanni L, Lumini A. Using ensemble of classifiers for predicting HIV proteasecleavage sites in proteins. Amino Acids 2009;36(3):409–16.

[52] Kuncheva LI. Diversity in multiple classifier systems. Inform Fusion2005;6(1):3–4.

[53] L. Nanni and A. Lumini, An ensemble of Support Vector Machines for predictingthe membrane proteins type directly from the amino acid sequence. AminoAcids, vol.35, no.3, pp.573-580, October 2008.


Recommended