IMPROVED EFFICIENCY IN CRYO-EM SECONDARY...

IMPROVED EFFICIENCY IN CRYO-EM SECONDARY

STRUCTURE TOPOLOGY DETERMINATION

FROM INACCURATE DATA

ABHISHEK BISWAS, DONG SI, KAMAL AL NASR, DESH RANJAN,

MOHAMMAD ZUBAIR and JING HE*

Department of Computer Science

Old Dominion University, Norfolk, VA 23529, USA*[email protected]

Received 17 February 2012

Revised 7 April 2012Accepted 7 April 2012

Published 7 June 2012

The determination of the secondary structure topology is a critical step in deriving the atomicstructure from the protein density map obtained from electron cryo-microscopy technique. This

step often relies on the matching of two sources of information. One source comes from the

secondary structures detected from the protein density map at the medium resolution, such as

5�10Å. The other source comes from the predicted secondary structures from the amino acidsequence. Due to the inaccuracy in either source of information, a pool of possible secondary

structure positions needs to be sampled. This paper studies the question, that is, how to reduce

the computation of the mapping when the inaccuracy of the secondary structure predictions is

considered. We present a method that combines the concept of dynamic graph with our previouswork of using constrained shortest path to identify the topology of the secondary structures. We

show a reduction of 34.55% of run-time as comparison to the naïve way of handling the inac-

curacies. We also show an improved accuracy when the potential secondary structure errors areexplicitly sampled verses the use of one consensus prediction. Our framework demonstrated the

potential of developing computationally e®ective exact algorithms to identify the optimal

topology of the secondary structures when the inaccuracy of the predicted data is considered.

Keywords: Constraint dynamic graph; topology; secondary structure; helix prediction error;

electron cryo-microscopy.

1. Introduction

Electron cryo-microscopy (cryoEM) is a promising technique to study the three-

dimensional structure of macromolecular complexes1�3 and it is complementary to

X-ray crystallography and nuclear magnetic resonance techniques. Although the

backbone is not resolved in the density map at the medium resolution such as

5�10Å, secondary structure elements (SSEs) such as helices and �-sheets can be

*Corresponding author.

Journal of Bioinformatics and Computational BiologyVol. 10, No. 3 (2012) 1242006 (16 pages)

#.c Imperial College Press

DOI: 10.1142/S0219720012420061

1242006-1

http://dx.doi.org/10.1142/S0219720012420061

detected.4�8 A helix detected from the density map is represented as a stick [red

cylinder in Fig. 1(a)] and a �-sheet appears as a thin sheet [blue, Fig. 1(a)]. Due to

the medium resolution, the strands of the �-sheet are often not distinguishable. The

connection between two SSEs is often ambiguous. The major challenge to derive the

protein structure from such cryoEMmaps is that it is not known which segment of the

protein sequence corresponds to which of the SSEs detected from the density map. A

topology of the SSEs refers to the order of the SSEs with respect to the protein

sequence and the direction of each SSE. For example, the true topology of the protein

in Fig. 1 presents the true order of the SSEs as ðS2;S7;E1;S9;S10;S1;E2;E3;E4;S3;

S6;S4;S8;S5Þ [Fig. 1(b)]. In principle, each stick Sj; j ¼ 1; . . . ; 10 of the protein cor-

responds to a sequence segment Hi; i ¼ 1; . . . ; 10 that forms a helix in the structure.

The four sequence segments E1;E2;E3, and E4 correspond to a sheet that can be

detected in the density map. Note that there are two directions to correspond a

sequence segmentHi toSj [arrows of Fig. 1(a) and dot and cross in Fig. 1(b)]. Since the

topology problem involving �-sheets is still challenging, our work in this paper focuses

on the topology problem for�-proteins inwhich no�-sheets are involved. The�-helices

are generally detectedmore reliably than �-sheets from the densitymap at themedium

resolution and often play important roles in deriving the topology.

The problem of determining the topology of the helix secondary structures, in the

simplest form, can be formulated as the following. Let H ¼ ðH1;H2; . . . ;HNÞ be a

tuple of sequence segments that form helices [Fig. 1(c)]. Let S ¼ fS1;S2; . . . ;SNg be

a set of the accurately detected helix sticks from the density map [Fig. 1(a)]. The

topology determination problem can be described as a problem to ¯nd a permutation

� of f1; 2; . . . ;Ng such that assigning Hi to S�ðiÞ; i ¼ 1; . . . ;N minimize the assign-

ment score. In the assignment, each Hi is assigned to S�ðiÞ in one of the two opposite

(a) (b)

(c)

Fig. 1. SSEs and the topology. (a) The density map (grey) was simulated to 10Å resolution using protein

3PBA from the Protein Data Bank (PDB) and EMAN software24. The SSEs (red: helix sticks, blue: sheet)were detected using SSE Tracer, an extended version of Helix Tracer4, and viewed by Chimera. For clear

viewing, only SSEs at the front of the structure are labeled. Arrows: the direction of the protein sequence;

(b) The true topology of the sticks (arrow, cross and dot for directions); (c) H1 to H10: helix segments; E1 toE4: �-strands; \. . .": loops longer than two amino acids.

A. Biswas et al.

1242006-2

directions [Fig. 1(b), cross or dot]. Note that there are N ! permutations and two

possible directions to assign each segment Hi to Sj, 1 � i; j � N . A naïve way to

derive the true topology is to go over all the N ! permutation of S to ¯nd the best

permutation. Therefore, the naïve method will take OðN !2NÞ time.9�11

The topology problem is complicated by the inaccuracy inH and S, since either of

them can have errors. H is generally obtained from the secondary structure pre-

diction tools using the amino acid (AA) sequence of the protein. Although many

methods are available for protein secondary structure prediction, such as

PSIPRED,12 SSPRO,13 and Porter,14 most of them have a prediction accuracy of

70%�80%.14�16 Although the rough positioning of the helices can be predicted for

most of the helices, the length of the helix may not be exactly correct and the position

can be shifted as well (Fig. 2). A \shift-error" happens when the beginning or the

ending position of a helix is shifted slightly from its true position (Helix 2 of Fig. 2). A

\split-error" is when an observed helix is predicted as two shorter helices (Helix 3 of

Fig. 2). Similarly, a \merge-error" corresponds to the situation in which two shorter

helices are predicted as a long helix. Short helices may be missed and a wrong helix

may exist in the prediction. The types of error in S share the similar nature as those

in H due to the inexact detection of the helices from the volumetric density map.

Ideally, it is necessary to sample all the possible positions near the predicted positions

of each helix to ¯nd the best tuple of the sequence segments that matches the

constraints in the volumetric data. However, this is a computationally expensive task

due to the number of possible tuples to sample. In this paper, we present an algor-

ithm to search through all the possible tuples of the helices from the inexact helix

positions given by the secondary structure prediction methods. Our dynamic graph

algorithm shows an improvement of 34.55% run-time over the naïve way to evaluate

all the possible tuples when dealing with the \shift-error" and \split-error" of the

secondary structure prediction.

2. Related Work

Suppose that each predicted helix has a maximum error of t amino-acid shifts to the

left and t shifts to the right of the true position. The total number of the possible

topologies is N!2Nð2tþ 1ÞN . Two topologies in this solution space may di®er from

each other by the order of the sticks, the direction of a stick, or the beginning/ending

position of a helix. A direct method to work around the inaccuracy is to generate a

Fig. 2. Prediction errors for helices. The observed position of the helices (Helix 1, . . . , Helix 4), the

predicted helices (Helix 2 0, Helix 3 0, Helix 3 00, Helix 4 0, Helix 2 00) from multiple prediction methods, and theconsensus prediction of the helices (shaded) are illustrated.

Improved E±ciency in Cryo-Em Secondary Structure Topology Determination from Inaccurate Data

1242006-3

consensus secondary structure prediction from multiple prediction methods.

The consensus prediction was then used for H in topology determination.8 Although

the consensus is the overall best guess, some of the helices in the consensus may be

less accurate than the predicted helix of a particular method. Alternatively, EM-fold

creates a pool of the potential helix positions for each helix.17 It then uses Monte

Carlo method to sample the solution space guided by a scoring function. Although

this method can be used to sample a large solution space, the inherent random nature

in the Monte Carlo method determines that not all of the potential positions

are sampled.

Although the entire solution space is huge, in practice, two factors may limit the

solution space to a computationally manageable size. Firstly, the number of the

helices in a protein is bounded. Most medium-size proteins have less than 10 helices

and most large proteins (300�500AAs) have less than 20 helices with longer than

one turn. Secondly, some helices are predicted fairly accurately and consistently

among di®erent servers. This means the number of alternative positions of a helix

might be much smaller than estimated. It is possible to generate a pool for each helix

to include the representative positions.

We previously developed a dynamic programming method for the topology

determination problem in an \error-free" situation. It reduced the computation from

OðN !2NÞ as needed in a naïve method to OðN 22NÞ.18 In this paper we further

explore the used of dynamic graph concepts in dealing with the possible errors. A

naïve way of ¯nding the tuple of helices that best matches the constraints of the

density map is to construct the topology graph from scratch for each of the possible

tuples. We will show in this paper that it is not necessary to construct the graph from

scratch. The dynamic graph method developed in this paper shows that it reduces

the computational time to about 65.55% compared to the naïve way to evaluate the

inaccurate input. The method developed in this paper is an exact algorithm instead

of a stochastic method.

3. Method

3.1. Evaluation of the errors in predicted helices

In order to investigate the nature of the errors in the predicted helices, we collected

the secondary structure prediction results using PREDATOR online server.19 We

randomly selected 30 proteins from the CASP 9 dataset that contains 246 helices.

Di®erent types of errors are quanti¯ed.

3.2. Pre-processing of the predicted helix positions

We used two consensus prediction servers SYMPRED20 and JPRED21 to produce a

consensus prediction. The consensus prediction for each helix simply contains the

positions shared by the two predictions. We used three other servers PSIPRED,12

PREDATOR,19 and SABLE22 to extend from the consensus prediction. Since each

A. Biswas et al.

1242006-4

prediction has slightly di®erent results, the pre-processing step aims to generate a list

of the alternative positions for each helix. Ideally, the alternative helix positions

should represent the predicted population and have the least number of alternatives

for the sake of computation. For each end of a helix, if the predicted position is more

than x AAs di®erent from an existing alternative position, it is included as a new

alternative end of the helix. We used x ¼ 3 for the work of Tables 2 and 3 and x ¼ 2

for Table 4. In other words, we only create an alternative position for the helix if it is

quite di®erent from an existing one. For example, suppose the shaded segments rep-

resent the consensus predictions, then Helix 2 0 is an alternative prediction [Fig. 2(a)].

The right end of Helix 2 0 will be an alternative right end of Helix 2. In this case, tuples of

the alternative helix positions are generated. An example of the tuple is (Helix 1,

Helix 2 0, Helix 3 0, Helix 3 00, Helix 4 0).

3.3. The topology graph

We developed a constraint graph previously to represent the secondary structure

topology problem.18 The graph [Fig. 3(a)] has a two-dimensional layout. The rows

represent the list of sequence segments of helices H ¼ ðH1;H2; . . . ;HMÞ. The col-

umns represent the set of helix sticks S ¼ fS1;S2; . . . ;SNg detected from the protein

density map. We use a weighted directed graph GTop ¼ ðV ;E;wÞ to represent the

SSE topology problem. V has M �N � 2 \regular" nodes and two special nodes

START and END. The index for the row and column of the nodes is i and j,

respectively. The two ends of a stick are marked by t ¼ 0, or t ¼ 1, to distinguish the

two directions of each assignment. A node ði; j; tÞ represents an assignment of Hi to

Sj in t direction. An edge from node ði; j; tÞ to ði 0; j 0; t 0Þ represents the assignment of

Hi 0 to Sj 0 in direction t 0 right after the assignment of Hi to Sj in direction t. When

M ¼ N, each edge connects nodes in two adjacent rows. When M � N, an edge can

skip a maximum of M �N rows. There is an edge connecting hSTART i to each of

the nodes in the ¯rst M �N þ 1 rows. The weight for such special edges is zero.

Similarly, there is an edge connecting each node of the last M �N þ 1 rows to

hENDi. The weight for these special edges was assigned to be zero. The weighted

graph GTop ¼ ðV ;E;wÞ with M � N is de¯ned as the following.

V ¼ fði; j; tÞj1 � i � M ; 1 � j � N; t 2 f0; 1gg [ fSTART ;ENDg

E ¼ ðði; j; tÞ; ði 0; j 0; t 0ÞÞ1 � i � M � 1; 1 � j 6¼ j 0 � N ;

i < i 0 � minððiþM �N þ 1Þ;MÞ; t; t 0 2 f0; 1g

��( )

[fðSTART ; ði; j; tÞÞj1 � i � M �N þ 1; 1 � j � N ; t 2 f0; 1gg[ fðði; j; tÞ;ENDÞ; jN � i � M ; 1 � j � N ; t 2 f0; 1gg:

We assigned 1 as the edge weight to the two consecutive assignments that are

impossible. Some of the impossible situations arise when the length of the sequence

fragment is di®erent from the length of the stick by 60%. Another impossible


1242006-5

situation happens when the length of the loop is too short to make the connection

of the two sticks. For example, it is impossible for a loop of length 3 to

connect two ends of the helices that are 15Å apart in the density map if we

assume the distance between two AAs is about 3.8Å. For any possible edges, the

weight wðði; j; tÞði 0; j 0; t 0ÞÞ ¼ jlði; i 0Þ � dðj; t; j 0; t 0Þj þ b, in which lði; i 0Þ is the

number of AAs between Hi and Hiþ1 measured on the protein sequence; dðj; t; j 0; t 0Þis the distance estimated from the density map between Hi and Hiþ1 when they

are assigned to Sj at end t and Sj 0 at end t 0, respectively; b is a penalty parameter.

The edge weight measures the cost of assigning two consecutive helices in

the sequence to two sticks.18 Given the graph, the topology determination

problem becomes the problem of ¯nding the shortest path from START to END that

satis¯es certain constraints. The most important constraint is that no columns

are visited more than once in a valid path and similarly, no rows are visited more

than once.18

Fig. 3. Dynamic graph update. (a) A topology graph with the edge weight labeled by the original

weight and the new weight if there is a change to the weight. The shortest path before (bold green)and after (dashed) update are shown. (b) Increment updates (upper box) to the table at node (3,2,0).

Decrement update (lower box) to the table at node (3,1,1). For clear viewing, the edge from hSTART i to(2,2,0) is not drawn.

A. Biswas et al.

1242006-6

3.4. Graph update algorithm

To ¯nd the shortest valid path, our dynamic programming method stores a table at

each node.18 For each node v ¼ ði; j; tÞ, let U be a subset of columns

U � f1; 2; . . . ;Ng, i� ðM �NÞ � jU j � i, j 2 U . A table containing ðU ; fðv;UÞÞwas stored at each node v, where fv;U is the best score to reach v using all the

columns of U . For example, node (3, 2, 0) [Fig. 3(b)] can be reached by di®erent sets

of columns f1,2g, f3,2g, or f1,2,3g. Note that the order to visit the columns can be

di®erent even by using the same set of columns. The best score to reach node (3, 2, 0)

using column 1 and 2 is 2 [Figs. 3(a) and 3(b)].

The method developed in Ref. 18 was able to use the topology graph to ¯nd the

best assignment. However, if the secondary structure prediction changes, the

shortest path might be di®erent. Ideally we do not have to re-compute the entire

graph to ¯nd the new shortest path. The idea of our graph update algorithm is to

update only those nodes that need to be updated due to the change to certain edges.

We adapted the general idea from Ref. 23 to update the graph accordingly. It is

handled slightly di®erently if the edge weight is incremented verses decremented

situation.

The pseudo-code for increment update is shown in Fig. 4. Suppose that an

incoming edge (i.e. from (1,1,0) to (3,2,0)) to a node v is incremented, v is marked as

a node for update [node (3,2,0) in Fig. 3(a)]. Each node v in the graph has di®erent

sets of columns fU1;U2; . . . ;Ukg that can be visited to reach v. The lowest cost of

reaching v using each column combination is stored along with the immediate pre-

decessor node. For example, there are three sets of columns that can be used to reach

node (3,2,0) with cost fðð3; 2; 0Þ;UiÞ. Here, the weight of edge ((1,1,0)(3,2,0)) is

increased from 2 to 4. The algorithm in Fig. 4 checks if any subset in (3,2,0) has

(1,1,0) as the immediate predecessor and ¯nds the new minimum path for that

subset. For example, fðð3; 2; 0Þ; f1; 2gÞ has (1,1,0) as the immediate predecessor and

fðð3; 2; 0Þ;UÞ ¼ 4 is changed as the edge weight wðð1; 1; 0Þ; ð3; 2; 0ÞÞ was increased

to 4 and there are no shorter path using the columns U ¼ f1; 2g. Change in any one

of the paths to reach v may a®ect nodes that have v as ancestor. So, all updated

nodes are recorded in set Rinc, which is used by other nodes to check for updates. The

update is performed in one pass of the graph row by row.

When an incoming edge to node v is decremented, we ¯rst check if the decre-

mented edge can change the lowest cost of reaching v [node (3,1,1) in Fig. 3(a)]. For

example fðð3; 1; 1Þ; f1; 2gÞ has to be updated as the weight of the edge from the

predecessor node (2, 2, 0) has decreased from 8 to 6. The algorithm searches for the

new shortest path to reach (3,1,1) using U ¼ f1; 2g and ¯nds a path through (1,1,1)

with cost 5 and updates fðð3; 1; 1Þ; f1; 2gÞ. In case of the decrement update, the

search algorithm only search paths from the changed parent nodes and compare it to

the existing shortest path. If the decrement does not cause a change to the lowest cost

of reaching v, there is no need to update v and hence no need to update the nodes

below the current row. Otherwise, we mark node v and add it to Rdec. The program


1242006-7

was written in Java. The performance tests were run on a generic desktop Dell

Optiplex 980 machine with 0.8GHz CPU and 8GB of memory.

4. Results and Discussions

4.1. Errors in the predicted helices

Protein secondary structure prediction methods generally have the Q3 accuracy

between 70%�80%.19,22 The overall accuracy does not provide explicit evaluation in

terms of the length and the position of the predicted helices. Since our topology

mapping is based on the geometrical matching between the estimated loop length in

the volumetric data and the loop length in the sequence, the error in the predicted

helices may a®ect the mapping result. We investigated di®erent kinds of errors in the

Fig. 4. The pseudo-code for updating the graph with tables ðU ; fðv;UÞÞ after an edge increment.

A. Biswas et al.

1242006-8

predicted helices using a dataset of 30 proteins. Instead of the bench mark dataset,

we randomly selected the proteins from the CASP 9 dataset that represents the

newly deposited protein structures in the PDB. Out of the 246 helices, 39.4% of the

helices were well predicted and we refer them as the corresponded helices [column 2

of Table 1(A)]. A corresponded helix is a predicted helix that satis¯es two criteria:

(1) its center position is within ¯ve AA di®erence from the center of the corre-

sponding observed helix in the PDB; (2) its length is within ¯ve AA di®erence from

the length of the observed helix. Our test shows that only 39.4% of the helices were

well predicted. Some of the non-corresponded helices are short. However, about

27.6% of the total non-corresponded helices are longer (> 8AA) helices [column 4 of

Table 1(A)]. This re°ects the challenges in the reality of the secondary structure

prediction. We calculated the center-to-center shift distribution between the pre-

dicted helices and the observed helices for the corresponded helices that are well

predicted. The 1-position shift is the most popular, covering 35.05% of the corre-

sponded helices. However, shift of 3, 4, and 5AAs accounts for a total of � 26%.

For 30.08% of the helices, one of the two ends of the helix was predicted fairly

accurately (within 3AA from the true end), but the other end was predicted to be at

least 3AA away from its true position. This suggests that one of the two ends can be

predicted fairly accurately for these helices. In fact, we sampled the ends of the

helices in the pre-processing of the predicted helices. Our goal was to capture the

ends that are actually predicted accurately. The split-error refers to the case when an

observed helix was broken into two smaller helices with at most three AAs in

between. As expected, the split-error and merge error are not popular, with 3.25%

and 6.5%, respectively. However, such error might a®ect the topology mapping

signi¯cantly since the length and the position of the predicted helix can be very

di®erent from the observed ones.

Table 1. The errors in the predicted helices from the protein sequence.

Total no. ofhelicesa

Correspondedb Non-corresponded(�8AAs)c

Non-corresponded

(>8AAs)d1-ende Split f Mergeg

A. The percentage of di®erent types of errors

246 39.43% 32.93% 27.64% 30.08% 3.25% 6.50%

B. Shift and length error in the corresponded helices

# AAsh 0 1 2 3 4 5

Shift i 21.65% 35.05% 17.53% 12.37% 7.22% 6.19%

Length j 13.40% 18.56% 21.65% 13.40% 20.62% 12.37%

aThe total number of helices. bThe percentage of the corresponded helices. cThe percentage of the non-

corresponded helices with length no more than 8AA. dThe percentage of the non-corresponded helices

with length more than 8AA. eThe percentage of the helices with only one-end match. fSplit: One

observed helix is split into two predicted helices with a loop (<3AAs) in between. gMerge error: twoobserved helices with a loop (<3AAs) are merged into one predicted helix. hThe number of AAs that are

involved in the center-to-center shift in (i) or the length di®erence in (j). iThe percentage of the corre-

sponded helices with the center-to-center shift by #AAs. jThe percentage of the corresponded helices with

length di®erence of #AAs.


1242006-9

In summary, our test of the 246 helices suggests that various errors exist in the

secondary structure prediction. Although it is expected that the consensus prediction

can have improved accuracy compared to the individual predictor, it is very likely

that the same type of errors are in the consensus prediction as well. It is important to

sample the alternative positions and lengths for the predicted helices to ¯nd the best

match between the volumetric data and the sequence data. We provide an exact

algorithm in this paper to improve the e±ciency over a naïve way of screening all the

alternative positions of the helices.

4.2. E±ciency in the topology searching with the presence of errors

For each tuple picked from the alternative helix positions, a topology graph can be

built and the shortest path can be determined using our previously developed

dynamic programming method.18 A naïve way of ¯nding the tuple of helices that best

matches the constraints of the density map is to construct the topology graph from

scratch for each of the possible tuples. We compared our dynamic graph method with

the naïve method in terms of the run-time to evaluate all the tuples. Seven proteins

were randomly selected from the PDB with the requirement (1) containing only

helices and (2) having at least ¯ve helices. The smallest protein (1NG6) has 148AAs,

and the largest (2X79) has 501AAs. For each protein sequence, we collected the

secondary structure prediction results from ¯ve servers. The prediction results were

preprocessed to generate the alternative helix positions for each predicted helix. The

preprocessing step restricted the total number of tuples signi¯cantly. For example,

the largest protein 2X79 has only 32 helix tuples generated (row 5 column 7 of

Table 2). The shortest valid path was searched for each H and the shortest among

the 32 paths was selected for the topology. We simulated the protein density map to

10Å resolution using EMAN.24 SSETracer4 was used to detect the helices from the

Table 2. Improved run-time using dynamic graph update in SSE topology determination.

Protein #Amino #True #Pred. #Sticksc Helix Base 1st All Naive % i

No. ID acids helicesa helicesb tuplesd timee helix f comb.g timeh

1 3FYQ 199 5 6 5 12 0.07 0.05 0.51 0.84 60.7

2 2WVI 164 9 8 7 2 0.46 0.30 0.76 0.92 82.63 1NG6 148 9 7 8 8 0.60 0.41 2.3 4.8 47.9

4 3HFW 357 23 15 16 2 200.0 145.5 350.84 400.0 87.5

5 2X79 501 26 14 21 32 879.7 720.5 20,479.2 28,150.4 72.86 1TBF 347 22 16 16 24 623.1 579.2 6820.7 14,954.4 45.6

7 3L6A 364 25 18 19 16 1362.5 1124.6 13,253.6 21,800.0 60.8

Average 65.45

aThe number of the observed helices in the PDB ¯le. bThe number of helix regions in the secondary

structure predictions. cThe number of sticks detected from the density map. dThe number of helix tuples

generated after preprocessing. eThe time (in seconds) to build the ¯rst graph and to ¯nd the shortestpath. fThe time (in seconds) to update the graph for the ¯rst alternative helix and to ¯nd the shortest

path. gThe time (in seconds) to update the graph for all helix tuples and to ¯nd the shortest path.hBrute force time to re-compute the entire graph for all the tuples and to search for the shortest path,h ¼ ðd� eÞ. iPercentage of the total time for dynamic update i ¼ g=h.

A. Biswas et al.

1242006-10

density map. In the case of protein 2X79, the secondary structure prediction tools

were able to predict 14 helix regions out of 26 actual helices in the PDB structure,

and SSETracer detected 21 helix sticks from the volumetric map. For each protein,

the number of rows and columns in the topology graph are the number of helix

regions (column 5, Table 2) obtained after preprocessing and the number of sticks

(column 6, Table 2), respectively.

Although this paper does not address the issue how the errors in the SSE pre-

diction a®ect accuracy of topology determination, we would like to use an example

(2X79) to illustrate the challenge proposed by the errors. We found that ¯ve helices

were completely missed by the consensus servers, among which four have a length

shorter than ¯ve AAs and one has seven AAs. In general, the shorter helices are more

likely to be missed in the secondary structure predictions. We also noticed that seven

helices in the PDB ¯le (2X79) were predicted as part of longer helices. In other words,

two shorter helices that are closely located on the sequence were predicted as a

merged helix with longer length. In this case, the 14 predicted helix regions cover 21

helices of the PDB structure.

We calculated \base time" which is the time to create a graph and to search for

the shortest path. In the work of Table 2, we used the Euclidean distance between

two ends of the two helices for the estimation of dðj; t; j 0; t 0Þ. In this paper, the

constraint shortest path concept was implemented in Java rather than Cþþ that

was used in our earlier work.18 We noticed the slowdown in our current implemen-

tation using Java. A naïve way to ¯nd the topology with the lowest cost among

the 32 tuples for protein 2X79 will involve 32 times of the base time. Using the

dynamic graph concept, our method found the topology with the lowest cost using

72.8% of the time that would have been spent in the naïve way. The average

reduction time is 34.55% for the seven proteins tested (Table 2). If an alternative

helix happens to be near the top row of the graph, it generally takes longer time to

update the graph, since the nodes below it are likely to be updated. Column 9 shows

the worst helix update time found for the proteins.

Although \shift-errors" aremost popular errors in the SSE prediction, \split-error"

is another type of error that needs to be considered. A \split-error" corresponds to the

situation inwhich a long helix is predicted as two shorter helices.We collected adataset

containing four proteins, each of which has one split error. Table 3 shows the per-

formance of our program to evaluate all tuples of the alternative helix positions with

both \shift-error" and \split-error". The tuples were generated in which two split

helices were merged into an alternative helix. To handle the split error, the graph

update involves merging two rows into one and updating the ðU ; fðv;UÞÞ table at theappropriate nodes. Table 3 shows that the average reduction time is 40.50%.

4.3. Improved topology accuracy through the sampling of the errors

The secondary structure topology determination for the cryoEM density maps relies

on the accuracy of the predicted secondary structures from the protein sequence and


1242006-11

those detected from the density maps. How to ¯nd the topology from the inexact

prediction data is still a challenging problem. We propose, in this paper, an approach

to carefully generate the tuples of alternative helix positions using the predictions

from multiple servers. Once the tuples are created, we apply our dynamic graph

method to ¯nd the optimal solution among all the tuples. We compared the accuracy

of this approach verses the use of one consensus tuple created using two consensus

prediction servers. We performed this comparison because it is not clear if the con-

sensus tuple is accurate enough to ¯nd the true topology. We present this prelimi-

nary comparison using six proteins. These proteins were chosen because most of the

helices in these proteins are predicted fairly accurately by the multiple prediction

servers, although some helices shorter than two turns are missed. As an example, 13

out of the 16 helices in the PDB structure are predicted approximately correctly in

3LTJ (row 4, Table 4). We used those 13 helices in our test. Both the consensus tuple

and the tuples of the alternative helix positions were created. In this case, there are

only 6 tuples (column 6, row 4, Table 4) generated since there is good agreement

among the multiple servers. Yet the number of correctly assigned helices increased

from 4 to 8 when the consensus tuple is replaced with the 6 tuples of alternative

helices. The dynamic graph ¯nds the optimal shortest path in the graphs of the 6

tuples, and in this case, one of them is a much better choice than the consensus. In

the work of Table 4, we estimated dðj; t; j 0; t 0Þ through identifying the possible traces

in the density map between end t of Sj and end t 0 of S 0j. This process involves the

clustering of the voxels in the density map into critical groups and selecting the

possible paths between the two end points. dðj; t; j 0; t 0Þ was estimated as the total

distance along a selected path that passes a set of cluster centers.

We compared the number of correctly assigned helices in the two experiments. A

correctly assigned helix refers to the helix in the tuple that was mapped to the correct

stick in the density map by comparing to its PDB structure. The only di®erence in

the two experiments is the input tuple. A consensus tuple was used in the ¯rst

Table 3. E±ciency of the graph update involving \split-errors."

No. Protein #Amino #True #Pred. #Sticksc Helix Base 1st All Naive % i

ID acids helicesa helicesb tuplesd timee helix f comb.g timeh

1 3FYQ 199 5 6 5 12 0.06 0.05 0.55 0.72 60.7

2 1NG6 148 9 8 8 8 0.62 0.43 3.12 4.96 47.9

3 1Z1L 345 24 14 14 32 170.3 124.6 3736.5 5449.6 56.44 2XSI 585 33 27 19 16 1928.4 1789.3 22,548.3 30,854.4 73.0

Average 59.5

aThe number of the observed helices in the PDB ¯le. bThe number of helix regions in the secondary

structure predictions. cThe number of helix sticks detected from the density map. dThe number of helix

tuples generated after preprocessing. eThe time (in seconds) to build the ¯rst constraint graph and to¯nd the shortest path. fThe time (in seconds) to update the graph for the ¯rst alternative helix. gThe

time (in seconds) to update the graph for all the tuples and to ¯nd the shortest path. hBrute force time to

re-compute the entire graph for all the tuples and to ¯nd the shortest path, h ¼ ðd� eÞ. iPercentage of

the total time for dynamic update, i ¼ g=ðd� eÞ.

A. Biswas et al.

1242006-12

experiment and multiple tuples derived from the preprocessing step were used in the

second experiment. For all the six proteins tested, the number of correctly assigned

helices is more in the second experiment. Although the time to ¯nd the best topology

in the second experiment is much longer, the run-time is still about 50% of the time

compared to the naïve way to ¯nd the best topology.

Although the accuracy was improved through the sampling of the multiple sec-

ondary structure prediction results, the accuracy is still far from adequate. For

example, only 6 out of 13 helices were correctly assigned in protein 3LTJ. There

might be a number of factors related to the inaccuracy. These include the accuracy in

the edge weight estimation, the accuracy of the detected sticks, and the accuracy of

the sampling for the alternative helices. The dynamic graph method developed in this

paper will provide an optimal solution for any edge weights selected and any tuples of

the alternative helices selected. We have noticed that the true topology is often in the

top portion of the ranked list of the topologies (data not shown). It is possible that

the work in this paper needs to be extended to ¯nd the top-k ranked solutions instead

of the top-1 solution.

5. Conclusions

Due to the inaccuracy in both the secondary structure prediction from the sequence

and the detection from density map, the true topology needs to be searched with the

consideration of such errors. We presented the framework to reduce the computation

in topology search using a combination of preprocessing and dynamic graph update.

Our method provides the optimal solution for any among the alternative helix

positions that were carefully chosen in the preprocessing step. Using a small test

dataset, the reduction in time is about 34.55% to sample the \shift-error" compared

Table 4. Improved topology accuracy over the consensus helix prediction.

Protein #AAa #PDB #True #Sticksd Helix Base Scratch Actual % i Cons. Best

ID helicesb helicesc tuplese time f timeg timeh assign j assignk

1FLP 142 7 7 7 4 0.40 1.6 0.71 44.37 4 5

1NG6 148 9 8 7 8 .62 4.96 2.42 48.79 5 5

2XB5 207 13 10 9 48 .79 37.92 17.8 46.94 3 63LTJ 210 16 13 12 6 13.87 83.22 45.7 54.91 4 8

3ACW 293 17 13 14 8 29.45 235.6 139.7 59.29 4 6

1Z1L 345 23 17 14 24 773.5 18564 8765.5 47.21 4 9

Average 50.25

Note: aThe number of AAs in the protein. bThe number of the observed helices in the PDB ¯le. cThenumber of helix regions in the secondary structure predictions. dThe number of sticks detected from the

density map. eThe number of helix tuples generated after preprocessing. fThe time (in seconds) to build

the ¯rst graph and to ¯nd the shortest path. gBrute force time to re-compute the entire graph for all the

tuples and to search for the shortest path, g ¼ ðe� fÞ. hThe time (in seconds) to update the graph forall helix tuples and to ¯nd the shortest path. iPercentage of the total time for dynamic update

i ¼ h=g� 100. jThe number of sticks correctly assigned using the consensus sequence prediction only.kThe number of sticks correctly assigned using helix tuples (e) considering shift and split errors.


1242006-13

to the naïve way of searching. To our knowledge, this is the ¯rst computationally

e®ective exact algorithm to identify the optimal secondary structure topology when

the inaccuracy of the predicted data is considered.

Acknowledgments

The authors acknowledge ¯nancial support from Old Dominion University MDS

research fund.

References

1. Chiu W, Electron microscopy of frozen, hydrated biological specimens, Annu Rev BiophysBiophys Chem 15:237�257, 1986.

2. Zhou ZH, Atomic resolution cryo electron microscopy of macromolecular complexes, AdvProtein Chem Struct Biol 82:1�35, 2011.

3. Zhou ZH, Towards atomic resolution structural determination by single-particle cryo-electron microscopy, Current Opin Struct Biol 18:218�228, 2008.

4. Del Palu A, He J, Pontelli E, Lu Y, Identi¯cation of alpha-helices from low resolutionprotein density maps, Proc Computational Systems Bioinformatics Conference (CSB)89�98, 2006.

5. Jiang W, Baker ML, Ludtke SJ, Chiu W, Bridging the information gap: Computationaltools for intermediate resolution structure interpretation, J Mol Biol 308:1033�1044,2001.

6. Kong Y, Ma J, A structural-informatics approach for mining beta-sheets: Locating sheetsin intermediate-resolution density maps, J Mol Biol 332:399�413, 2003.

7. Baker ML, Ju T, Chiu W, Identi¯cation of secondary structure elements in intermediate-resolution density maps, Structure 15:7�19, 2007.

8. Baker ML, Abeysinghe SS, Schuh S, Coleman RA, Abrams A, Marsh MP, Hryc CF,Ruths T, Chiu W, Ju T, Modeling protein structure at near atomic resolutions withGorgon, J Struct Biol 174:360�373, 2011.

9. Lu Y, He J, Strauss CE, Deriving topology and sequence alignment for the helixskeleton in low-resolution protein density maps, J Bioinform Comput Biol 6:183�201,2008.

10. Sun W, He J, Native secondary structure topology has near minimum contact energyamong all possible geometrically constrained topologies, Proteins 77:159�173, 2009.

11. Al Nasr K, Sun W, He J, Structure prediction for the helical skeletons detected from thelow resolution protein density map, BMC Bioinformatics 11(Suppl 1):S44, 2010.

12. Bryson K, McGu±n LJ, Marsden RL, Ward JJ, Sodhi JS, Jones DT, Protein structureprediction servers at University College London, Nucleic Acids Res 33:W36�W38, 2005.

13. Pollastri G, Przybylski D, Rost B, Baldi P, Improving the prediction of protein secondarystructure in three and eight classes using recurrent neural networks and pro¯les, Proteins47(2):228�235, 2002.

14. Pollastri G, McLysaght A, Porter: A new, accurate server for protein secondary structureprediction, Bioinformatics 21:1719�1720, 2005.

15. McGu±n LJ, Bryson K, Jones DT, The PSIPRED protein structure prediction server,Bioinformatics 16:404�405, 2000.

16. Przybylski D, Rost B, Alignments grow, secondary structure prediction improves, Pro-teins 46:197�205, 2002.

A. Biswas et al.

1242006-14

17. Lindert S, Staritzbichler R, Wotzel N, Karakas M, Stewart PL, Meiler J, EM-fold: Denovo folding of alpha-helical proteins guided by intermediate-resolution electronmicroscopy density maps, Structure 17:990�1003, 2009.

18. Al-Nasr K, Ranjan D, Zubair M, He J, Ranking valid topologies of the secondarystructure elements using a constraint graph, J Bioinformatics Comput Biol 9:415�430,2011.

19. Frishman D, Argos P, Seventy-¯ve percent accuracy in protein secondary structureprediction, Proteins: Struct Funct Bioinform 27:329�335, 1997.

20. Simossis VA, Heringa J, The in°uence of gapped positions in multiple sequencealignments on secondary structure prediction methods, Comput Biol Chem 28:351�366,2004.

21. Cu® JA, Clamp ME, Siddiqui AS, Finlay M, Barton GJ, JPred: A consensus secondarystructure prediction server, Bioinformatics 14:892�893, 1998.

22. Adamczak R, Porollo A, Meller J, Combining prediction of secondary structureand solvent accessibility in proteins, Proteins: Struct Funct Bioinform 59:467�475,2005.

23. Ramalingam G, Reps T, An incremental algorithm for a generalization of the shortest-path problem, J Algorithms 21:267�305, 1996.

24. Ludtke SJ, Baldwin PR, Chiu W, EMAN: Semi-automated software for high resolutionsingle particle reconstructions, J Struct Biol 128:82�97, 1999.

Abhishek Biswas is a Ph.D. candidate at the Department of Computer Science at

Old Dominion University. He received his BE degree from Amrita School of

Engineering, Bangalore, India in 2007. He worked as developer in Tata Consultancy

Services from September 2007 to December 2008 and as a research associate in PES

School of Engineering, Bangalore, India till August 2009, before joining Old

Dominion University.

Dong Si is a Ph.D. candidate at the Department of Computer Science at Old

Dominion University. He received his undergraduate degree from Nanjing University

before joining Old Dominion University.

Kamal Al Nasr is a Ph.D. candidate at the Department of Computer Science at

Old Dominion University. He received his BS and MS degrees from Yarmouk Uni-

versity, Jordan, in 2003 and 2005, respectively. He worked as a part time instructor

in Yarmouk University, Jordan. Between August 2007 and August 2009, he was a

graduate student conducting research at the CREST Center (Center for Research

Excellence in Bioinformatics and Computational Biology) at the Department of

Computer Science at New Mexico State University before joining to Old Dominion

University.

Desh Ranjan received his undergraduate degree in Computer Science from Indian

Institute of Technology, Kanpur in 1987 and a Masters and Ph.D. degrees in


1242006-15

Computer Science, with a minor in Mathematics, from Cornell University in Ithaca,

New York, in 1990 and 1992, respectively. After ¯nishing his Ph.D., he spent one

year (1992�1993) as a post-doctoral fellow at the Max Planck Institute for Com-

puter Science in Saarbrucken, Germany. His research interests include Algorithms,

Bioinformatics and Computational Complexity.

Mohammad Zubair received his B.Sc. from the Electrical & Electronics Engin-

eering at Delhi University, India, in 1981. He received his Ph.D. degree from Indian

Institute of Technology, India, in 1987. He worked in IBM T.J. Watson Research

Center for three years. Currently, he is a Professor at the Department of Computer

Science at Old Dominion University. He has conducted research in three areas:

Digital Library, Advanced Web Applications including E-Commerce, High Per-

formance Computing and Bioinformatics.

Jing He obtained her B.S. degree in Mathematics from Jilin University and M.S.

degree in Mathematics from New Mexico State University. She worked in the area of

three-dimensional reconstruction and analysis of virus structures at Baylor College of

Medicine from which she obtained her Ph.D. in Structural and Computational

Biology and Molecular Biophysics. Currently she is an Associate Professor at the

Department of Computer Science at Old Dominion University.

A. Biswas et al.

1242006-16

Date post:	22-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

IMPROVED EFFICIENCY IN CRYO-EM SECONDARY...

Documents