Fast overlapping of protein contact maps by alignment of eigenvectors

Fast overlapping of protein contact maps byalignment of eigenvectors

Pietro Di Lena Piero Fariselli Luciano MargaraMarco Vassura Rita Casadio

Technical Report UBLCS-2010-01

January 2010

Department of Computer ScienceUniversity of Bologna

Mura Anteo Zamboni 740127 Bologna (Italy)

The University of Bologna Department of Computer Science Research Technical Reports are available inPDF and gzipped PostScript formats via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCSor via WWW at URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available inthe directory ABSTRACTS.

Recent Titles from the UBLCS Technical Report Series

2008-13 A Theory of Contracts for Strong Service Compliance, Bravetti, M., Zavattaro, G., June 2008.

2008-14 A Uniform Approach for Expressing and Axiomatizing Maximal Progress and Different Kinds of Time inProcess Algebra, Bravetti, M., Gorrieri, R., June 2008.

2008-15 On the Expressive Power of Process Interruption and Compensation, Bravetti, M., Zavattaro, G., June2008.

2008-16 Stochastic Semantics in the Presence of Structural Congruence: Reduction Semantics for Stochastic Pi-Calculus, Bravetti, M., July 2008.

2008-17 Measures of conflict and power in strategic settings, Rossi, G., October 2008.

2008-18 Lebesgue’s Dominated Convergence Theorem in Bishop’s Style, Sacerdoti Coen, C., Zoli, E., November2008.

2009-01 A Note on Basic Implication, Guidi, F., January 2009.

2009-02 Algorithms for network design and routing problems (Ph.D. Thesis), Bartolini, E., February 2009.

2009-03 Design and Performance Evaluation of Network on-Chip Communication Protocols and Architectures(Ph.D. Thesis), Concer, N., February 2009.

2009-04 Kernel Methods for Tree Structured Data (Ph.D. Thesis), Da San Martino, G., February 2009.

2009-05 Expressiveness of Concurrent Languages (Ph.D. Thesis), di Giusto, C., February 2009.

2009-06 EXAM-S: an Analysis tool for Multi-Domain Policy Sets (Ph.D. Thesis), Ferrini, R., February 2009.

2009-07 Self-Organizing Mechanisms for Task Allocation in a Knowledge-Based Economy (Ph.D. Thesis), Mar-cozzi, A., February 2009.

2009-08 3-Dimensional Protein Reconstruction from Contact Maps: Complexity and Experimental Results (Ph.D.Thesis), Medri, F., February 2009.

2009-09 A core calculus for the analysis and implementation of biologically inspired languages (Ph.D. Thesis), Ver-sari, C., February 2009.

2009-10 Probabilistic Data Integration, Magnani, M., Montesi, D., March 2009.

2009-11 Equilibrium Selection via Strategy Restriction in Multi-Stage Congestion Games for Real-time Streaming,Rossi, G., Ferretti, S., D’Angelo, G., April 2009.

2009-12 Natural deduction environment for Matita, C. Sacerdoti Coen, E. Tassi, June 2009.

2009-13 Hints in Unification, Asperti, A., Ricciotti, W., Sacerdoti Coen, C., Tassi, E., June 2009.

2009-14 A New Type for Tactics, Asperti, A., Ricciotti, W., Sacerdoti Coen, C., Tassi, E., June 2009.

2009-15 The k-Lattice: Decidability Boundaries for Qualitative Analysis in Biological Languages, Delzanno, G.,Di Giusto, C., Gabbrielli, M., Laneve, C., Zavattaro, G., June 2009.

Fast overlapping of protein contact maps by alignmentof eigenvectors

Pietro Di Lena1 Piero Fariselli2 Luciano Margara1 Marco Vassura1

Rita Casadio2

Technical Report UBLCS-2010-01

January 2010

Abstract

In the genomic era fast and reliable methods for protein structure comparison are needed. The maximumcontact map overlap (CMO) is a measure of protein structure similarity. Exact methods are known forthe maximum CMO problem but they are exponential in the worst case and not applicable for large-scalecomparison of protein structures. In this paper we present a heuristic algorithm for the CMO problem. Ourapproach relies on the property that a contact map can be approximated by a fraction of its eigenvectors.We can heuristically obtain good overlaps of two contact maps by computing the optimal global alignmentof just few of their principal eigenvectors. Our algorithm is simple, fast and its computing time doesnot depend on the threshold adopted to represent the contact maps. Experimental results show that itis comparable to exact CMO methods in terms of the quality of the overlap and to structural alignmentmethods in terms of protein structure comparison. Moreover, our algorithm is fast enough to be used forlarge-scale comparison of protein structures.

1. Department of Computer Science University of Bologna, Mura Anteo Zamboni, 7 40127 Bologna, Italy.2. Biocomputing Group, Department of Biology, University of Bologna, Via S.Giacomo 9/2, 40127, Bologna, Italy

1

1 Introduction

1 IntroductionProteins are the most abundant macromolecules in living systems and serve important functionsin essentially all biological processes. Proteins spontaneously fold into a unique characteristicthree-dimensional (3D) conformation, which determines their biological role and which is crucialfor their correct functioning. It is generally assumed that protein structures are more conservedthan sequences through evolution and that proteins sharing similar folds are likely to have simi-lar biological functions and common ancestors. This is at the bases of several heuristic methodstrying to predict the protein structure from its sequence [14].

Protein structural comparison attempts to establish equivalence between protein moleculesby measuring the similarity of their 3D structures. Structural comparison is a useful tool for theevaluation of protein equivalence in presence of low sequence similarity, when the amount ofstructural similarity cannot be directly inferred by sequence alignment. There are several simi-larity measures for protein structures but no general agreement on the best one has been achieved[7]. Every similarity measure relies on the choice of a scoring function and on the assumption thatits optimum corresponds to the best possible match between two protein structures. The mostwidely adopted scoring measures are based on the root mean square deviation (RMSD) [12], distancemap similarity [9] and contact map overlap (CMO) [6].

A protein contact map is a binary symmetric matrix whose entries i, j are 1 if and only ifthe Euclidean distance between the i-th and the j-th residues of the protein is below some giventhreshold. The CMO measure quantifies the level of similarity between two protein structuresby measuring the maximum overlap of their contact maps. The maximum overlap is obtainedby computing the sequence alignment that maximizes the number of corresponding contactsbetween pairs of aligned residues. The maximum CMO is one of the few measures for whichexact algorithms are known. On the other end, the maximum CMO problem is known to beNP-hard [8].

The first exact algorithm for the maximum CMO problem, based on integer programming(IP), was developed in [13] and improved in [4]. Later, several other methods based on the sameapproach were proposed [18], [20], [1]. The IP approach consists in formulating the CMO asthe maximization of some integer linear function and solving it with Lagrangian Relaxation (LR)and/or Branch & Bound reduction techniques. The disadvantage of IP-based methods is that,due to the intractability of the problem, they are exponential in the worst case. For a practicaluse, the running time of these algorithms is bounded and the best solution within the time-limitis returned. The counter part is that these methods provide upper and lower bounds to the op-timal solution, which makes it possible to evaluate the quality of the partial solution computed(i.e. the distance between the upper and the lower bounds) and to detect if the best possibleoverlap has been found (i.e. if the upper and lower bounds coincide). Recently, a polynomial-time approximation scheme for the protein structure alignment problem (in particular, contactmap alignment) has been developed [21]. A polynomial-time approximation scheme is an ap-proximation algorithm that, for every ε > 0 produces a solution that is within an ε factor of theoptimal. The approximation algorithm described in [21] is polynomial in the protein size but itis exponential with respect to some constant parameters and its running time increases with thedecreasing of the ε factor.

Despite the strength of the underlying formalization, the CMO-based algorithms are scarcelyused to compare protein structures and structural alignment methods are generally preferred (i.e.methods based on RMSD and distance maps similarity measures). Indeed, the algorithmic im-plementations of the exact and approximation methods are on the average extremely slow tobe used for wide-scale comparison of protein structures. Furthermore and most importantly,there is no agreement on the most suitable contact threshold to represent a protein structure [5].Higher-threshold contact maps are more informative, but they impose a much higher numberof constrains and, again, this makes the adoption of exact/approximation methods not feasible.To our knowledge, up to now, only two heuristic methods, MSVNS [16] and SADP [11], havebeen proposed for the CMO problem. Even if not optimal, MSVNS and SADP can produce in areasonable time acceptable solutions for the CMO problem compared to exact methods.

UBLCS-2010-01 2

2 Background

In this paper we describe a new heuristic algorithm for the CMO problem. Our approachis based on the property that a contact map can be well-approximated by a fraction of its eigen-vectors. Thus, an acceptable overlapping of two contact maps can be heuristically obtained byperforming a global alignment of just few of their eigenvectors. Our algorithm is easily im-plementable, fast and, more importantly, by construction, its running time does not dependon the contact threshold. Experimental results show that it can compute good overlaps com-pared to exact CMO methods and that its performances in terms of protein structure recog-nition/classification are comparable with those of structural alignment methods. In therms ofcomputational time, our implementation is comparable to the fastest structural alignment algo-rithms and it is much more faster than the heuristic CMO method MSVNS (SADP is not publiclyavailable, thus it was not possible to compare also its performances).

The paper is organized as follows. In Section 2 we formally define the CMO problem. InSection 3 we describe in detail our approach. Sections 4 and 5 are devoted to the experimentalresults and the conclusions, respectively.

2 Background2.1 Contact mapsA protein contact map is a two-dimensional approximation of the protein three-dimensional struc-ture. For a given protein P , its contact map of threshold τ is a square binary symmetric matrixdefined by

MPij =

{1 if the distance between residues i, j is ≤ τ A0 otherwise

There are several definitions of distance between residues in literature. The particular choice of adistance is not critical since the CMO problem is independent of the distance used to representcontacts between residues.

Following the CMO literature, we consider here the Cα distance, which defines the dis-tance between residues i, j as the Euclidean distance between the coordinates of their respectiveCα atoms. Typical threshold values for Cα contact maps vary from 6A to 16A. For this range ofthresholds, consecutive residues are always in contact (consecutive residues share a peptide bondand the distance between their respective Cα atoms is about 3.7A). For low threshold values, typ-ically 6-9A, the number of contacts (i.e. 1s) observed in the map is sparse compared to the numberof non-contacts (i.e. 0s). Moreover, these threshold values are the ones which minimize the dis-tance between Cα contact maps and physical contact maps [3]. On the contrary, higher-thresholdcontact maps, 10-16A, have a higher number of contacts and are more informative about theprotein structure: for lower threshold values there can be several different three-dimensionalstructures consistent with the same contact map; this ambiguity is minimized by increasing thethreshold of the map [19]. The threshold problem was also noticed in [5], where the authors re-port that a threshold value smaller than 7A was not suitable to represent the protein structuresin their benchmark set.

2.2 The maximum CMO problemGiven two proteins P1, P2, whose (ordered) sets of residues are denoted respectively by R1 ={1, ..., n} and R2 = {1, ...,m}, an alignment between P1 and P2 is a mapping f : R1 → R2 thatrespects the following two conditions:

1. f is an injective partial function,

2. for each pair of residues i, j ∈ R1 in the domain of f (i.e. f(i) 6= ∅ 6= f(j)) we have that

i < j if and only if f(i) < f(j).

UBLCS-2010-01 3

3 Materials and methods

The condition 1 imposes that a residue in the first/second protein can be aligned at mostwith one (possibly none) residue in the second/first protein. The non-aligned residues are as-sumed to be matched with gaps. Biologically, the introduction of a gap reflects an insertion/deletionevent during the evolution of protein sequences. The number of gaps in an alignment f is definedby

gapf = |{i | i ∈ R1, f(i) = ∅}|+ |{i | i ∈ R2, f−1(i) = ∅}|.

The condition 2 imposes the ordering of the residues to be preserved in the alignment.The maximum CMO for proteins P1, P2, is an alignment that maximizes the overlap be-

tween their respective contact maps MP1 ,MP2 . More formally, the maximum CMO problem isdefined as the problem of computing the alignment f that maximizes the quantity

O(MP1 ,MP2) =∑

f(i) 6=∅6=f(j)j>i+1,f(j)>f(i)+1

MP1ij ·M

P2f(i)f(j) (1)

Note that, since contacts between consecutive amino acids are always present, they are notcounted in (1). Moreover, note that a match between a contact and a non-contact is not penalizedin (1).

The CMO can be used as a measure of the similarity between two proteins structures:higher is the overlap between two contact maps higher is the probability that the two relatedprotein structures are similar. The CMO measure is quite robust to perturbations and does notpenalize too much the insertion of gaps and deletions. The CMO as a measure of similarity wasintroduced in [6]. The problem of computing the maximum CMO was proven to be NP-hard in[8]. To quantify the level of similarity of two overlapped contact maps we use the most widelyadopted scoring function, originally proposed in [20]:

2 ·O(MP1 ,MP2)C(MP1) + C(MP2)

(2)

where C(M) =∑j>i+1Mij denotes the number of contacts in the contact map M .

3 Materials and methods3.1 Data setsWe considered two sets of protein domains, one for each of the two tests performed.

The first set, here referred to as Skolnick dataset (see Table 6), was originally suggestedby J. Skolnick and it is used as a standard benchmark to evaluate the overlap quality of CMOalgorithms. It contains 40 small size domains (between 97 and 256 residues with mean 160±43)from 33 proteins, distributed in five SCOP families.

The second set, referred to as Proteus300 dataset (see Table 7), was originally used in [1]to evaluate the recognition/classification performances of their CMO algorithm. It contains 300protein domains (between 64 and 465 residues with mean 193±69), distributed in 30 distinctSCOP families (10 domains per family), 27 distinct super families and 24 distinct folds.

In order to compare the overlap quality with the results described in [1], we used a thresh-old of 7.5A. For the recognition/classification experiments we adopted several different thresh-olds.

3.2 CMO by alignment of eigenvectorsThe maximum CMO problem involves the alignment of two dimensional objects. While thereare optimal polynomial-time algorithms for the alignment in one-dimensional space, in twodimensions the problem is not tractable. Here we show an heuristic method to obtain a two-dimensional alignment of contact maps through the one-dimensional alignment of their eigen-vectors. Our approach uses standard techniques such as the canonical eigendecomposition of

UBLCS-2010-01 4

3 Materials and methods

symmetric matrices and the Needleman-Wunsch alignment algorithm.

The spectral theory provides conditions under which a matrix can be decomposed into a canoni-cal form in terms of eigenvalues and eigenvectors. This canonical decomposition is usually calledeigendecomposition or spectral decomposition. By the spectral theorem, every real n × n symmetricmatrix M can be eigendecomposed as

M =n∑i=1

λi(vi ⊗ vi) (3)

where λi represents the i-th eigenvalue, vi the corresponding eigenvector and ⊗ denotes theouter product between vectors. The ordering of the eigenvalues is not important provided thatthe eigenvectors are permuted accordingly, so we can always assume that the eigenvalues aresorted in decreasing order, i.e. λ1 ≥ λ2 ≥ ... ≥ λn. Note that equation (3) defines matrix Mas the sum of n × n matrices vi ⊗ vi, 1 ≤ i ≤ n, weighted by the corresponding eigenvaluesλi. In practice, a contact map can be approximated by considering only a fraction of its eigenvec-tors/eigenvalues. For instance, for 1 ≤ t ≤ n, the approximation of order t of M can definedas

M =t∑i=1

λi(vi ⊗ vi) (4)

This way of approximating a contact map is effective because the smaller is eigenvalue λi thesmaller is the contribute of matrix vi ⊗ vi in equation (3). This is actually one of the approachesused for image data compression [2].

Consider now two proteins P1, P2 with contact maps MP1 ∈ {0, 1}n×n,MP2 ∈ {0, 1}m×m,respectively. For some given 1 ≤ t ≤ min{n,m}, we can heuristically compute an overlap be-tween MP1 and MP2 by computing an alignment that maximizes an opportune scoring functiondefined on their respective t eigenvectors u1, ...,ut and v1, ...,vt. Since the scoring function (1)does not penalize the eventual match of a contact with a non-contact, a global alignment of eigen-vectors is preferred to a local alignment.

The Needleman-Wunsh algorithm (NW) [15] computes in polynomial-time the optimalglobal alignment of two sequences with respect to some scoring matrix S ∈ Rn×m and constant gappenalty G ∈ R. The entry Sij of the scoring matrix denotes the level of similarity between the i-thresidue of P1 and the j-th residue of P2. The constant valueG defines the cost for the introductionof a gap in the alignment. The NW algorithm can be easily modified in order to encode non-constant gap penalties. In this work we consider constant gap penalties only. Formally, the NWalgorithm computes the alignment f : R1 → R2 that maximizes the objective function

n∑i=1

f(i)6=∅

Sif(i) +G · gapf

To describe our implementation of the Needleman-Wunsh we just need to describe the scoringfunction and the constant gap penalty used.

By equation (3), for residues i, j of protein P1 the quantity

λ1(u1)i(u1)j + ...+ λt(ut)i(ut)j (5)

will tend to 1 at the increasing of t if i, j are in contact and to 0 otherwise. Note that, in thequantity (5), the contribute of each product (uk)i(uk)j is weighted by the corresponding eigen-value λk. Moreover, when λk is positive, such a contribute is positive if and only if (uk)i and(uk)j agree in sign. We describe the i-th residue of P1 by the i-th entries of eigenvectors u1, ...,utweighted by the square root of the corresponding eigenvalue λk:

[(u′1)i, ..., (u′t)i] = [√|λ1|(u1)i, ...,

√|λt|(ut)i]

UBLCS-2010-01 5

4 Experimental results

According to this representation, the i-th residue of P1 should be matched with the j-th residueof P2 when the pairwise entries of the two vectors highly agree both in sign and relative magni-tude. The vectors don’t need to be equal to obtain a high score; for this reason, a scoring schemebased on the Euclidean distance is not very appropriate. Experimentally we found that a scoringfunction that provides better performances is simply

Sij =t∑

k=1

(u′k)i(v′k)j (6)

The scoring function (6) assigns higher scores when the corresponding entries of the vectors de-scribing residues i and j agree in sign. Moreover, the contribute in Sij of each product (u′k)i(v′k)jis weighted by the square root of the corresponding eigenvalues. We found experimentally thata penalty equal to

G = min{0,min{Sij | 1 ≤ i ≤ n, 1 ≤ j ≤ m}}

is in almost all cases a good choice.

Note that the sign of the eigenvectors has no influence in equation (3), i.e. vi⊗vi = −vi⊗−vi. Infact, there’s no way to standardize the sign of the eigenvectors in the eigendecomposition. Thisimplies that, when aligning two sets of t eigenvectors, we are forced to try all possible combina-tions of their signs. By definition of (6), to consider all possible combinations, it is sufficient totry all possible sign combinations for just one set of t eigenvectors, thus producing 2t differentalignments. Moreover, experimentally we found that, in some cases, increasing the number ofeigenvectors decreases slightly the quality of the alignment (in terms of the number of overlap-ping contacts recovered). Thus, when aligning t eigenvectors, our algorithm proceeds as follows.First it computes all alignments with one eigenvector for pair, then with two and so on up to t.This procedure evaluates a total of

∑tk=1 2k = 2t+1 − 2 alignments. The best alignment in terms

of overlapping contacts is chosen. One execution of the NW algorithm costs O(nm) so our algo-rithm costsO(2t+1 ·nm), i.e. it is exponential in the order of approximation t. Anyway, for valuesof t up to 7 the running time is small enough to assure a fast computation.

There are two main differences between our approach and the exact approaches developed sofar for the CMO problem.

1. The IP-based methods are exact while our algorithm is completely heuristic. With ourmethod we have no way to detect if the overlap found in some point of the computation isthe best possible. In contrast, IP-based methods can stop the computation when the lowerand upper bounds of the optimal solution meet.

2. The computing time of our algorithm depends uniquely on the protein lengths and on thenumber of eigenvectors considered and it is not affected by the threshold of the contactmap. On the contrary, for all the other methods developed so far (exact, approximate andheuristic) the computing time is influenced by the number of contacts in the maps: morecontacts mean more constraints to be taken into account and then more computing time.

4 Experimental resultsWe compare the performances of our algorithm with the available methods by performing twodifferent tests. There are only two publicly available methods: the heuristic algorithm MSVNS[16] and the exact algorithm CMOS [20]. Unfortunately, CMOS is only available online as a webserver3 and it has limitations on the size of the submitted problems. Thus it was not possible toinclude its performances in our tests.

3. http://eudoxus.cheme.cmu.edu/cmos/cmos.html

UBLCS-2010-01 6


In the first test (Section 4.1) we compared the performances in terms of the total overlapquality on the Skolnick dataset with two exact IP-based algorithms, LAGR [4] and A Purva [1],and with the heuristic algorithm MSVNS4. The implementations of LAGR and A purva are notpublicly available, so we will refer to the results published as supplementary material5 of [1] forthe comparison.

In the second test (Section 4.2) we evaluate the accuracy of our algorithm as a classifieron the Proteus300 dataset, i.e. its ability to recognize protein structural similarities. The clas-sification results are validated on the SCOP hierarchical classification of protein structures. Wecompare the effectiveness of contact map alignment methods with respect to the performancesof MSVNS, three widely used structural alignment algorithms, CE6 [17], TM-align7 [22] andDaliLite8 [10] and A purva+sse, the variant of A purva that encodes secondary structure con-straints (two residues can be aligned only if they belong to the same secondary structure class).

The experiments have been run on an Intel Pentium machine with a 2.80 GHz CPU andwith 1Gb Ram. Since the performances in terms of computational time are also relevant to eval-uate the quality of the proposed method, we compare the various algorithms also in terms oftime-efficiency. The computational time of our algorithm does not include the time necessary toperform the eigendecomposition since the contact maps can be pre-processed and we don’t needto recompute every time their eigenvectors.

4.1 Experiments on the Skolnick datasetWe compared the performances on the Skolnick set in terms of total number of overlapping con-tacts recovered and in terms of required computational time. The results obtained are summa-rized in Table 1. Here Eig k identifies our algorithm, where k is the number of eigenvectors usedfor the computation. The computational time of the exact algorithms LAGR and A purva hasbeen limited to a maximum of 30 minutes per contact map pair [1]. We used MSVNS version3 with number of restarts equal to 5,10,30,50,70. All algorithms have been run on the same Cαcontact maps9 of threshold 7.5A.

As shown in Table 1, with the limitation of a maximum time of 30m for pair, exact methodsneed large computational time (from 7 to 13 days) to compute the 780 alignments in the Skolnickset. Anyway, the quality of their overlaps is very good since it is quite close to the optimum: thelowest upped bound on the total overlap has been computed by A purva and it is equal to 218316,then the total overlaps are distant less than 1% and 4% from the optimal solution, respectively forA purva and LAGR. On the contrary, heuristic algorithms can provide a good approximationsof the best overlaps in lower computational time. Note that the running time of our algorithmincreases exponentially with the increase of the number of eigenvectors used to approximate thecontact maps while the running time of MSVNS increases linearly with respect to the number ofrestarts. In fact, our algorithm can compute better solutions than MSVNS if the computing timeis limited to 1 or 2 hours, while it has worst performances than MSVNS for higher computationaltimes. We remark that, a method that needs few hours to solve the 780 alignments in the Skolinckset is not practical for large-scale comparisons. For this reason a good compromise in terms ofcomputing time and total number of overlapping contacts recovered is obtained by Eig 7.

4.2 Experiments on the Proteus300 datasetThe main motivation of these tests is to analyze the capabilities of our method in detecting proteinstructure similarities on the Proteus300 dataset. We compare the performances of our algorithmwith MSVNS, A purva+sse and with three structural alignment methods, CE, DaliLite and TM-align. The scoring function for CMO methods is provided by equation (2). On the contrary,CE, DaliLite and TM-align have their own scoring function: Z-score (for CE10 and DaliLite) and

4. http://modo.ugr.es/jrgonzalez/msvns4maxcmo5. http://www.irisa.fr/symbiose/old/softwares/resources/proteus3006. http://cl.sdsc.edu/7. http://zhang.bioinformatics.ku.edu/TM-align/8. http://www.ebi.ac.uk/DaliLite/9. http://www.irisa.fr/symbiose/old/softwares/resources/proteus30010. When CE returned more than one alignment for a pair, the best Z-score was chosen.

UBLCS-2010-01 7


Method Total overlap % wrt A purva TimeA purva 216372 100% ∼7dLAGR 210395 97.2% ∼13dMSVNS.v3 r70 199270 92.1% ∼15hMSVNS.v3 r50 197777 91.4% ∼10hMSVNS.v3 r30 195007 90.1% ∼6hMSVNS.v3 r10 186776 86.3% ∼2hMSVNS.v3 r5 178757 82.6% ∼1hEig 14 198124 91.6% ∼ 18hEig 13 197386 91.2% ∼ 9hEig 12 196512 90.8% ∼ 4h 30mEig 11 195640 90.4% ∼ 2hEig 10 194654 90.0% ∼ 1hEig 9 193571 89.5% ∼ 25mEig 8 192177 88.8% ∼12mEig 7 190923 88.2% ∼6mEig 6 189446 87.6% ∼3m

Table 1. Overlap quality on the Skolnick dataset. A purva and LAGR are exact methods (upper box)while MSVNS and Eig k are heuristic methods (lower box). MSVN has been run with 5,10,30,50 and70 restarts. Eig k has been run by taking k=6,..,14 principal eigenvectors. In the table are shown thetotal number of recovered overlapping contacts, the percentage of contact recovered with respect thebest performing method (A purva) and the computational times (d=days,h=hours, m=minutes).

TM-score (for TM-align). Recall that A purva+sse is an exact CMO alignment method that usessecondary structure constraints; the only pure CMO alignment methods considered for these testsare MSVNS and our algorithm. Since A purva+sse is not publicly available, we refer to the resultspublished in [1], released as supplementary material. We performed the following two tests.

1. Family recognition: we tested the ability to detect the correct protein family. For everyquery-protein we selected the model-protein in the set that obtained the best similarityscore and measured the fraction of query-proteins for which the chosen model-proteinbelongs to the same family.

2. Classification: we measured the robustness of the scoring scheme by computing a hier-archical classification of our benchmark set. To obtain a hierarchical classification weused the UPGMA (Unweighted Pair Group Method with Arithmetic mean) and WPGMA(Weighted Pair Group Method with Arithmetic mean) algorithms, which compute a binarycluster-tree given a matrix of (dis)similarity scores. Every subtree of the cluster-tree iden-tifies uniquely a set of proteins. We validated the tree against the SCOP hierarchy at thefamily/super family/fold level. From our point of view, a protein family/s.family/fold iscorrectly classified if the tree returned by UPGMA/WPGMA contains a subtree whose ele-ments are exactly all the members of that family/s.family/fold. Recall that the Proteus300dataset contains 300 protein domains (see Table 7), distributed in 30 SCOP families. Someof these families belong to the same SCOP super family: b.1.1 (composed by families b.1.1.2and b.1.1.4), c.2.1 (c.2.1.2, c.2.1.5) and c.37.1 (c.37.1.8, c.37.1.20). Some families belong to thesame SCOP fold: b.1 (b.1.1.2, b.1.1.4, b.1.2.1), c.1 (c.1.8.3, c.1.10.1), c.2 (c.2.1.2, c.2.1.5), c.37(c.37.1.20, c.37.1.8) and d.58 (d.58.7.1, d.58.17.1). Thus, in total we have 30 distinct families,27 distinct super families and 24 distinct folds (note that the domains belonging to fold c.2and c.37 actually belong to the super families c.2.1 and c.37.1, respectively). When testingthe classification at the family/s.family/fold level we consider also the number of non cor-rectly classified domains: if a family/s.family/fold is not correctly classified, the error iscounted as the number of missing proteins in the largest subtree containing only elements

UBLCS-2010-01 8


Method Contact th. Family recognition TimeCE N/A 297/300 ∼40hDaliLite N/A 299/300 ∼9h 30mTM-align N/A 300/300 ∼4hA purva+sse 7.5A 300/300 ∼23hMSVNS.v3 7.5A 297/300 ∼94hEig 7 13A 300/300 ∼6hEig 7 12A 300/300 ∼6hEig 7 11A 300/300 ∼6hEig 7 10A 300/300 ∼6hEig 7 9A 299/300 ∼6hEig 7 8A 297/300 ∼6hEig 7 7.5A 294/300 ∼6h

Table 2. Family recognition results on the Proteus300 dataset. The only pure CMO methods are MSVNSand Eig 7 (lower box). When applicable the threshold of the contact map is reported in column 2. Thebest results in terms of family recognition (column 3) have been highlighted with bold fonts. Column 4reports for each method the computational time needed to solve the entire set of 44850 alignments.

of the respective family/s.family/fold.

The Skolnick set is an easy benchmark with respect to family recognition and classificationtests: it contains just five distinct families, which share poor structural similarities. In fact, allmethods considered have full accuracy on this set (data not shown). The Proteus300 benchmarkis much more interesting since it contains also non-trivial super families and folds. The resultsobtained for the family recognition test plus the time needed to solve the entire set of 44850 =300 ∗ (300− 1)/2 alignments are summarized in Table 2. The results of the classification tests aresummarized in Tables 3 and 4 .

As shown in Table 2, all methods have more or less the same performances in terms offamily recognition. Notably, most of the errors are related to query-proteins that belong to non-trivial folds or non-trivial super families in the Proteus300 dataset. In particular, CE fails tocorrectly recognize two queries in the b.1.1.4 family (in both cases it assigns the highest score totwo models in the b.1.1.2 family) and one query in the c.1.10.1 family (it assigns the highest scoreto a model in the c.1.8.3 family). DaliLite fails to correctly assign the highest score to one query inthe b.1.1.4 family (also in this case the highest score is assigned to a model in the b.1.1.2 family).MSVNS has the same problem for two queries in the b.1.1.4 family and for one query in thec.37.1.20 family (in this case the highest score is assigned to a model in the a.123.1.1 family). It isworth noticing how the family recognition performances of our method vary with the increase ofthe contact map threshold: all queries are recognized correctly for thresholds above 10 A, but theperformances are not optimal for lower thresholds. In terms of computational time, our method isthe fastest together with TM-align. On the contrary, despite the fact that MSVNS is heuristic andconsiderably more faster than exact CMO methods, it is still much more slower than structuralalignments algorithms and A purva+sse, whose computational times are reduced thanks to theconstraints introduced on the secondary structure elements.

As shown in Tables 3 and 4, the best performing method in terms of family/s.family/fold clas-sification is DaliLite, which is also the most stable method since its classification performancesdo not depend on the hierarchical clustering algorithm used, UPGMA or WPGMA. The secondbest performing methods are TM-align, A purva+see and our method (on 11 A contact maps),which have exactly the same performances. Note that, the only difference between the resultsobtained with DaliLite and these three last methods is just the incorrect classification of one su-

UBLCS-2010-01 9


Method Cont. th. Fam. clu. (#err) Fam. clu. (#err)UPGMA WPGMA

CE N/A 28/30 (5/300) 29/30 (4/300)DaliLite N/A 30/30 (0/300) 30/30 (0/300)TM-align N/A 28/30 (4/300) 30/30 (0/300)A purva+sse 7.5A 30/30 (0/300) 29/30 (1/300)MSVNS 7.5A 26/30 (11/300) 23/30 (18/300)Eig 7 13A 28/30 (3/300) 28/30 (4/300)Eig 7 12A 28/30 (4/300) 29/30 (2/300)Eig 7 11A 29/30 (2/300) 30/30 (0/300)Eig 7 10A 29/30 (3/300) 28/30 (6/300)Eig 7 9A 25/30 (10/300) 26/30 (9/300)Eig 7 8A 27/30 (10/300) 25/30 (13/300)Eig 7 7.5A 27/30 (11/300) 26/30 (11/300)

Table 3. Classification results at the family (columns 3-4) level on the Proteus300 dataset. The hierarchi-cal classifications have been computed both with UPGMA and WPGMA algorithms. For families notcorrectly classified, the error (#err) is counted as the number of missing proteins in the largest subtreecontaining only elements of the respective family. The only pure CMO methods are MSVNS and Eig 7(lower box). When applicable the threshold of the contact map is reported in column 2. The best resultshave been highlighted with bold fonts.

Method S.F. clu. (#err) S.F. clu. (#err) Fld clu. (#err) Fld clu. (#err)UPGMA WPGMA UPGMA WPGMA

CE 24/27 (13/300) 26/27 (10/300) 23/24 (10/300) 23/24 (10/300)DaliLite 26/27 (10/300) 26/27 (10/300) 23/24 (10/300) 23/24 (10/300)TM-align 23/27 (24/300) 25/27 (20/300) 22/24 (20/300) 22/24 (20/300)A purva+sse 25/27 (20/300) 25/27 (20/300) 22/24 (20/300) 22/24 (20/300)MSVNS 24/27 (22/300) 21/27 (29/300) 21/24 (30/300) 19/24 (27/300)Eig 7 24/27 (22/300) 23/27 (23/300) 22/24 (20/300) 22/24 (20/300)Eig 7 24/27 (22/300) 25/27 (20/300) 22/24 (20/300) 22/24 (20/300)Eig 7 24/27 (22/300) 25/27 (20/300) 22/24 (20/300) 22/24 (20/300)Eig 7 25/27 (20/300) 25/27 (20/300) 22/24 (20/300) 22/24 (20/300)Eig 7 22/27 (26/300) 23/27 (25/300) 21/24 (30/300) 21/24 (30/300)Eig 7 24/27 (22/300) 24/27 (21/300) 21/24 (30/300) 20/24 (32/300)Eig 7 24/27 (22/300) 23/27 (22/300) 21/24 (30/300) 20/24 (31/300)

Table 4. Classification results at the super family (columns 2-3) and fold (columns 4-5) level on the Pro-teus300 dataset. The hierarchical classifications have been computed both with UPGMA and WPGMAalgorithms. For s.families/folds not correctly classified, the error (#err) is counted as the number of miss-ing proteins in the largest subtree containing only elements of the respective s.family/fold. The onlypure CMO methods are MSVNS and Eig 7 (lower box). The best results have been highlighted with boldfonts.

per family. In detail, only DaliLite and CE can recognize enough similarity between familiesc.2.1.2 and c.2.1.5, which belong to the common super family c.2.1 and common fold c.2. Noneof the algorithms is able to correctly classify the two families c.37.1.8, c.37.1.20 into a commonsubtree corresponding to super family c.37.1 and fold c.37. The bad performances of MSVNS areprobably a consequence of the low threshold used. As we can notice in Table 2, the minimumacceptable threshold to obtain good results with our method is 10A. Since the computing timeof MSVSN depends on the contact threshold, it is infeasible running it with thresholds largerthan 8 A. Nonetheless, we can notice that our algorithm has slightly better performances than

UBLCS-2010-01 10


Method Contact th. AUC Max accuracyCE N/A 0.9944422 0.99148272DaliLite N/A 0.9990242 0.99440357TM-align N/A 0.9987262 0.99299889A purva+sse 7.5A 0.9973591 0.99286511MSVNS.v3 7.5A 0.9675197 0.98301003Eig 7 13A 0.9922069 0.98773690Eig 7 12A 0.9916820 0.98793757Eig 7 11A 0.9902748 0.98807135Eig 7 10A 0.9866454 0.98802676Eig 7 9A 0.9832919 0.98704571Eig 7 8A 0.9801223 0.98584169Eig 7 7.5A 0.9793783 0.98468227

Table 5. AUC (see Figure 1) and maximum accuracy (see Figure 2) at the family level of the Proteus300dataset.

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

MSVNS.v3 (7.5)Eig_7 (7.5)Eig_7 (11)CEA_purva+sse (7.5)TM−alignDaliLite

Figure 1. ROC curves at the family level of the Proteus300 dataset. When applicable, the contact mapthreshold is reported in the legend.

MSVNS also on 7.5A contact maps. It seems that the secondary structure information encoded inA purva+sse can overcome the threshold problem, despite the fact that the total overlap foundby A purva+sse is lower than the one found by MSVSN and by our algorithm on the same maps(data not shown).

UBLCS-2010-01 11


Cutoff

Acc

urac

y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

MSVNS.v3 (7.5)Eig_7 (7.5)Eig_7 (11)CEA_purva+sse (7.5)TM−alignDaliLite

Figure 2. Accuracy curves at the family level of the Proteus300 dataset. When applicable, the contact mapthreshold is reported in the legend.

UBLCS-2010-01 12

5 Conclusions

In Figure 1 we compare the ROC (Receiver Operating Characteristic) curves11 at the familylevel for the different methods considered (the Z-scores of DaliLite and CE have been previouslynormalized between 0 and 1). The corresponding AUC (Area Under Curve) values are reportedin Table 5. In addition, in Figure 2, we show the accuracy of the different methods at the familylevel with respect to the score cutoff. The maximum accuracy per method is shown in Table 5 (thecorresponding cutoffs are not shown). The ROC curves with the relative AUC values confirm thatthe performances of DaliLite, TM-align, A purva+see, CE and Eig 7 (on 11 A contact maps) arecomparable, with DaliLite performing slightly better than the others. On lower threshold contactmaps, even if not comparable with structural alignment algorithms, our algorithm outperformsMSVNS both in terms of computational time and quality of the results.

5 ConclusionsIn this paper we described a heuristic algorithm for the contact map overlap problem. Our al-gorithm computes an overlap of two contact maps by performing a global alignment of few oftheir eigenvectors. This approach is effective since contact maps can be well-approximated byjust a fraction of their eigenvectors. Our algorithm is reasonably simple and it can be easilyimplemented.

We validated experimentally the performances of our method by comparison with exactCMO methods (LAGR and A purva), with an heuristic CMO method (MSVNS) and with threestructural alignment methods (CE, TM-align and DaliLite). Our algorithm is fast (independentlyof the contact map threshold), it has good performances in terms of quality of the overlap whencompared with exact CMO methods and it is competitive with widely used structural alignmentmethods for the task of protein structure comparison. In all tests we performed, our methodshowed better performances than the heuristic CMO algorithm MSVNS in terms of both compu-tational time and quality of the results.

References[1] Andonov,R., Yanev,N., Malod-Dognin,N. (2008) An Efficient Lagrangian Relaxation for the

Contact Map Overlap Problem. Lecture Notes in Bioinformatics, 5251, 162–173.

[2] Andrews,H.C., Patterson,C.L. (1976) Singular Value Decomposition (SVD) Image Coding.IEEE Transactions on Communications, 24, 425–432.

[3] Bartoli,L., Fariselli,P., Casadio,R. (2007) The effect of backbone on the small-world propertiesof protein contact maps. Phys Biol., 4, 1–5.

[4] Caprara,A., Lancia,G. (2002) Structural alignment of large-size proteins via Lagrangian re-laxation. In Proceedings of the Annual International Conference on Computational Molecular Biology(RECOMB 2002), 100-108.

[5] Caprara,A., Carr,R., Istrail,S., Lancia,G., Walenz,B. (2004) 1001 optimal PDB structure align-ments: integer programming methods for finding the maximum contact map overlap. J ComputBiol., 11, 27–52.

[6] Godzik,A., Kolinski,A., Skolnick,J. (1992) Topology fingerprint approach to the inverse pro-tein folding problem. J Mol Biol., 227, 227–238.

[7] Godzik,A. (1996) The structural alignment between two proteins: is there a unique answer?Protein Sci., 5, 1325–1338.

[8] Goldman,D., Istrail,S., Papadimitriou,C. (1999) Algorithmic aspects of protein structure sim-ilarity. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, 512–521.

11. Computed with the ROCR package in the R environment.

UBLCS-2010-01 13

REFERENCES

[9] Holm,L., Sander,C. (1993) Protein structure comparison by alignment of distance matrices. JMol Biol., 233, 123–138.

[10] Holm, L., Park, J. (2000) DaliLite workbench for protein structure comparison. Bioinformatics,16, 566–567.

[11] Jain,B.J., Lappe,M. (2007) Joining Softassign and Dynamic Programming for the ContactMap Overlap Problem. Lecture Notes in Bioinformatics, 4414, 410–423.

[12] Kabsch,W. (1976) A solution for the best rotation to relate two sets of vectors. Acta Cryst., 32,922–923.

[13] Lancia,G., Carr,R., Walenz,B., Istrail,S. (2001) 101 Optimal PDB structure alignments: Abranch-and-cut algorithm for the maximum contact map overlap problem. In Proceedings ofthe Annual International Conference on Computational Molecular Biology, RECOMB 2001, 193–202.

[14] Lesk,A. (2006) Introduction to Bioinformatics. Oxford University Press.

[15] Needleman,S.B., Wunsch,C.D. (1970) A general method applicable to the search for similar-ities in the amino acid sequence of two proteins. J Mol Biol. 48, 443–453.

[16] Pelta,D.A., Gonzalez,J.R., Moreno Vega,M. (2008) A simple and fast heuristic for proteinstructure comparison. BMC Bioinformatics, 9: 161.

[17] Shindyalov,I.N., Bourne,P.E. (1998) Protein structure alignment by incremental combinato-rial extension (CE) of the optimal path. Protein Eng., 11, 739–747.

[18] Strickland,D.M., Barnes,E., Sokol,J.S. (2005) Optimal protein structure alignment using max-imum cliques. Oper. Res. 53, 389–402.

[19] Vassura,M., Margara,L., Di Lena,P., Medri,F., Fariselli,P., Casadio,R. (2008) Reconstruction of3D structures from protein contact maps. IEEE/ACM Trans Comput Biol Bioinform., 5, 357–366

[20] Xie,W., Sahinidis,N.V. (2007) A reduction-based exact algorithm for the contact map overlapproblem. J Comput Biol., 14, 637–654.

[21] Xu,J., Jiao,F., Berger,B. (2007) A parameterized algorithm for protein structure alignment. JComput Biol., 14, 564–577.

[22] Zhang,Y., Skolnick,J. (2005) TM-align: a protein structure alignment algorithm based on theTM-score. Nucleic Acids Res., 22, 2302–2309.

UBLCS-2010-01 14

REFERENCES

SCOP Domains SCOP Family SCOP IDd1b71a1, d1bcfa , d1dpsa , d1fhaa , Ferritin a.25.1.1d1iera , d1rcdad1bawa , d1byoa , d1byob ,d1kdia , Plastocyanin/azurin-like b.6.1.1d1nina ,d1plaa ,d2b3ia , d2pcya , d2pltad1amka , d1aw2a , d1b9ba , d1btma , Triosephosphate isomerase (TIM) c.1.1.1d1htia , d1tmha , d1trea , d1tria ,d1ydva , d3ypia , d8timad1b00a , d1dbwa , d1nata , d1ntra , CheY-related c.23.1.1d1qmpa , d1qmpb , d1qmpc , d1qmpd ,d3chya , 4tmya , 4tmybd1rn1a , d1rn1b , d1rn1c Fungal ribonucleases d.1.1.4

Table 6. The Skolnick data set.

# SCOP domains SCOP ID1 d1b0ba d1cqxa1 d1gcva d1h97a d1irda d1it2a d1q1fa d1wmub d1x9fc d3sdha a.1.1.22 d1jgca d1ji4a d1jiga d1lb3a d1nf4a d1o9ra d1tjoa d1umna d1vela d1vlga a.25.1.13 d1eema1 d1f2ea1 d1k3ya1 d1m0ua1 d1n2aa1 d1nhya1 d1oe8a1 d1oyja1 d1r5aa1 d2gsqa1 a.45.1.14 d1cpta d1io7a d1izoa d1jipa d1jpza d1lfka d1n40a d1n97a d1po5a d1x8va a.104.1.15 d1n46a d1nq7a d1pdua d1pk5a d1pq9a d1pzla d1r1kd d1t7ra d1xpca d1xvpb a.123.1.16 d1fp5a1 d1k5na1 d1k5nb d1l6xa1 d1mjuh2 d1mjul2 d1rzfl2 d1uvqa1 d2fbjh2 d3frua1 b.1.1.27 d1biha3 d1ev2e2 d1gl4b d1gsma1 d1iray3 d1p53a2 d1p53a3 d1rhfa2 d1ucta1 d1zxqa2 b.1.1.48 d1axib2 d1bqua1 d1cd9b1 d1f6fb2 d1fyhb2 d1lqsr2 d1lwra d1n26a2 d1uc6a d2hfta2 b.1.2.19 d1g9oa d1gm1a d1ihja d1iu2a d1l6oa d1m5za d1n7ea d1qava d1r6ja d1ujva b.36.1.110 d1erja d1gxra d1k8kc d1nexb2 d1nr0a1 d1nr0a2 d1p22a2 d1pgua1 d1pgua2 d1tbga b.69.4.111 d1bhga3 d1bqca d1ecea d1foba d1h1na d1nofa2 d1qnra d1uhva2 d1xyza d7a3ha c.1.8.312 d1gqna d1l6wa d1n7ka d1o5ka d1ojxa d1p1xa d1sfla d1ub3a d1vlwa d1w3ia c.1.10.113 d1db3a d1ek6a d1gy8a d1i24a d1iy8a d1ja9a d1sb8a d1vl0a d1w4za d1xgka c.2.1.214 d1b8pa1 d1hyea1 d1hyha1 d1ldna1 d1o6za1 d1obba1 d1s6ya1 d1t2da1 d1uxja1 d2cmda1 c.2.1.515 d1a04a2 d1b00a d1krwa d1mb3a d1oxkb d1p6qa d1qkka d1u0sy d1w25a1 d1w25a2 c.23.1.116 d1ctqa d1i2ma d1kk1a3 d1mkya2 d1r2qa d1r8sa d1svia d1wb1a4 d1wf3a1 d3raba c.37.1.817 d1d2na d1fnna2 d1l8qa2 d1lv7a d1njfa d1ny5a2 d1r7ra3 d1sxja2 d1sxje2 d1w5sa2 c.37.1.2018 d1bw0a d1gdea d1lc5a d1m6sa d1o4sa d1toia d1u08a d1uu1a d1v2da d1w7la c.67.1.119 d1byka d1guda d1jdpa d1jx6a d1jyea d1qo0a d1sxga d1tjya d2dria d8abpa c.93.1.120 d1amfa d1atga d1i6aa d1ii5a d1lsta d1pb7a d1sbpa d1ursa d1xvxa d1y4ta c.94.1.121 d1mg8a d1v5oa d1v6ea d1v86a d1wh3a d1wiaa d1wjna d1wjua d1wm3a d1xd3b d.15.1.122 d1ec7a2 d1jpdx2 d1jpma2 d1muca2 d1r0ma2 d1rvka2 d1sjda2 d1wuea2 d1yeya2 d2mnra2 d.54.1.123 d1fxla1 d1h6kx d1l3ka2 d1no8a d1oo0b d1sjqa d1wf0a d1wg1a d1wg4a d1whya d.58.7.124 d1aw0a d1cc8a d1cpza d1fe0a d1fvqa d1kqka d1mwya d1osda d1qupa2 d1sb6a d.58.17.125 d1ghea d1n71a d1nsla d1q2ya d1qsta d1s3za d1tiqa d1ufha d1vhsa d1vkca d.108.1.126 d1b77a1 d1dmla1 d1dmla2 d1iz5a1 d1iz5a2 d1plqa1 d1plqa2 d1t6la1 d1u7ba1 d1ud9a1 d.131.1.227 d1fvra d1k2pa d1phka d1rdqe d1s9ja d1tkia d1u46a d1uu3a d1vjya d1xkka d.144.1.728 d1q5qa d1ryp1 d1rypa d1rypb d1rypd d1rypg d1ryph d1rypi d1rypk d1rypl d.153.1.429 d1b8pa2 d1ez4a2 d1gv1a2 d1hyea2 d1hyha2 d1llda2 d1ojua2 d1t2da2 d2cmda2 d7mdha2 d.162.1.130 d1byfa d1e87a d1h8ua d1jzna d1kg0c d1qo3c d1sl4a d1tdqb d1tn3a d2afpa d.169.1.1

Table 7. The Proteus 300 data set.

UBLCS-2010-01 15

Date post:	30-Apr-2023
Category:	Documents
Upload:	unibo
View:	0 times
Download:	0 times

Fast overlapping of protein contact maps by alignment of eigenvectors

Documents