IMPROVED EFFICIENCY IN CRYO-EM SECONDARY
STRUCTURE TOPOLOGY DETERMINATION
FROM INACCURATE DATA
ABHISHEK BISWAS, DONG SI, KAMAL AL NASR, DESH RANJAN,
MOHAMMAD ZUBAIR and JING HE*
Department of Computer Science
Old Dominion University, Norfolk, VA 23529, USA*[email protected]
Received 17 February 2012
Revised 7 April 2012Accepted 7 April 2012
Published 7 June 2012
The determination of the secondary structure topology is a critical step in deriving the atomicstructure from the protein density map obtained from electron cryo-microscopy technique. This
step often relies on the matching of two sources of information. One source comes from the
secondary structures detected from the protein density map at the medium resolution, such as
5�10Å. The other source comes from the predicted secondary structures from the amino acidsequence. Due to the inaccuracy in either source of information, a pool of possible secondary
structure positions needs to be sampled. This paper studies the question, that is, how to reduce
the computation of the mapping when the inaccuracy of the secondary structure predictions is
considered. We present a method that combines the concept of dynamic graph with our previouswork of using constrained shortest path to identify the topology of the secondary structures. We
show a reduction of 34.55% of run-time as comparison to the naïve way of handling the inac-
curacies. We also show an improved accuracy when the potential secondary structure errors areexplicitly sampled verses the use of one consensus prediction. Our framework demonstrated the
potential of developing computationally e®ective exact algorithms to identify the optimal
topology of the secondary structures when the inaccuracy of the predicted data is considered.
Keywords: Constraint dynamic graph; topology; secondary structure; helix prediction error;
electron cryo-microscopy.
1. Introduction
Electron cryo-microscopy (cryoEM) is a promising technique to study the three-
dimensional structure of macromolecular complexes1�3 and it is complementary to
X-ray crystallography and nuclear magnetic resonance techniques. Although the
backbone is not resolved in the density map at the medium resolution such as
5�10Å, secondary structure elements (SSEs) such as helices and �-sheets can be
*Corresponding author.
Journal of Bioinformatics and Computational BiologyVol. 10, No. 3 (2012) 1242006 (16 pages)
#.c Imperial College Press
DOI: 10.1142/S0219720012420061
1242006-1
detected.4�8 A helix detected from the density map is represented as a stick [red
cylinder in Fig. 1(a)] and a �-sheet appears as a thin sheet [blue, Fig. 1(a)]. Due to
the medium resolution, the strands of the �-sheet are often not distinguishable. The
connection between two SSEs is often ambiguous. The major challenge to derive the
protein structure from such cryoEMmaps is that it is not known which segment of the
protein sequence corresponds to which of the SSEs detected from the density map. A
topology of the SSEs refers to the order of the SSEs with respect to the protein
sequence and the direction of each SSE. For example, the true topology of the protein
in Fig. 1 presents the true order of the SSEs as ðS2;S7;E1;S9;S10;S1;E2;E3;E4;S3;
S6;S4;S8;S5Þ [Fig. 1(b)]. In principle, each stick Sj; j ¼ 1; . . . ; 10 of the protein cor-
responds to a sequence segment Hi; i ¼ 1; . . . ; 10 that forms a helix in the structure.
The four sequence segments E1;E2;E3, and E4 correspond to a sheet that can be
detected in the density map. Note that there are two directions to correspond a
sequence segmentHi toSj [arrows of Fig. 1(a) and dot and cross in Fig. 1(b)]. Since the
topology problem involving �-sheets is still challenging, our work in this paper focuses
on the topology problem for�-proteins inwhich no�-sheets are involved. The�-helices
are generally detectedmore reliably than �-sheets from the densitymap at themedium
resolution and often play important roles in deriving the topology.
The problem of determining the topology of the helix secondary structures, in the
simplest form, can be formulated as the following. Let H ¼ ðH1;H2; . . . ;HNÞ be a
tuple of sequence segments that form helices [Fig. 1(c)]. Let S ¼ fS1;S2; . . . ;SNg be
a set of the accurately detected helix sticks from the density map [Fig. 1(a)]. The
topology determination problem can be described as a problem to ¯nd a permutation
� of f1; 2; . . . ;Ng such that assigning Hi to S�ðiÞ; i ¼ 1; . . . ;N minimize the assign-
ment score. In the assignment, each Hi is assigned to S�ðiÞ in one of the two opposite
(a) (b)
(c)
Fig. 1. SSEs and the topology. (a) The density map (grey) was simulated to 10Å resolution using protein
3PBA from the Protein Data Bank (PDB) and EMAN software24. The SSEs (red: helix sticks, blue: sheet)were detected using SSE Tracer, an extended version of Helix Tracer4, and viewed by Chimera. For clear
viewing, only SSEs at the front of the structure are labeled. Arrows: the direction of the protein sequence;
(b) The true topology of the sticks (arrow, cross and dot for directions); (c) H1 to H10: helix segments; E1 toE4: �-strands; \. . .": loops longer than two amino acids.
A. Biswas et al.
1242006-2
directions [Fig. 1(b), cross or dot]. Note that there are N ! permutations and two
possible directions to assign each segment Hi to Sj, 1 � i; j � N . A naïve way to
derive the true topology is to go over all the N ! permutation of S to ¯nd the best
permutation. Therefore, the naïve method will take OðN !2NÞ time.9�11
The topology problem is complicated by the inaccuracy inH and S, since either of
them can have errors. H is generally obtained from the secondary structure pre-
diction tools using the amino acid (AA) sequence of the protein. Although many
methods are available for protein secondary structure prediction, such as
PSIPRED,12 SSPRO,13 and Porter,14 most of them have a prediction accuracy of
70%�80%.14�16 Although the rough positioning of the helices can be predicted for
most of the helices, the length of the helix may not be exactly correct and the position
can be shifted as well (Fig. 2). A \shift-error" happens when the beginning or the
ending position of a helix is shifted slightly from its true position (Helix 2 of Fig. 2). A
\split-error" is when an observed helix is predicted as two shorter helices (Helix 3 of
Fig. 2). Similarly, a \merge-error" corresponds to the situation in which two shorter
helices are predicted as a long helix. Short helices may be missed and a wrong helix
may exist in the prediction. The types of error in S share the similar nature as those
in H due to the inexact detection of the helices from the volumetric density map.
Ideally, it is necessary to sample all the possible positions near the predicted positions
of each helix to ¯nd the best tuple of the sequence segments that matches the
constraints in the volumetric data. However, this is a computationally expensive task
due to the number of possible tuples to sample. In this paper, we present an algor-
ithm to search through all the possible tuples of the helices from the inexact helix
positions given by the secondary structure prediction methods. Our dynamic graph
algorithm shows an improvement of 34.55% run-time over the naïve way to evaluate
all the possible tuples when dealing with the \shift-error" and \split-error" of the
secondary structure prediction.
2. Related Work
Suppose that each predicted helix has a maximum error of t amino-acid shifts to the
left and t shifts to the right of the true position. The total number of the possible
topologies is N!2Nð2tþ 1ÞN . Two topologies in this solution space may di®er from
each other by the order of the sticks, the direction of a stick, or the beginning/ending
position of a helix. A direct method to work around the inaccuracy is to generate a
Fig. 2. Prediction errors for helices. The observed position of the helices (Helix 1, . . . , Helix 4), the
predicted helices (Helix 2 0, Helix 3 0, Helix 3 00, Helix 4 0, Helix 2 00) from multiple prediction methods, and theconsensus prediction of the helices (shaded) are illustrated.
Improved E±ciency in Cryo-Em Secondary Structure Topology Determination from Inaccurate Data
1242006-3
consensus secondary structure prediction from multiple prediction methods.
The consensus prediction was then used for H in topology determination.8 Although
the consensus is the overall best guess, some of the helices in the consensus may be
less accurate than the predicted helix of a particular method. Alternatively, EM-fold
creates a pool of the potential helix positions for each helix.17 It then uses Monte
Carlo method to sample the solution space guided by a scoring function. Although
this method can be used to sample a large solution space, the inherent random nature
in the Monte Carlo method determines that not all of the potential positions
are sampled.
Although the entire solution space is huge, in practice, two factors may limit the
solution space to a computationally manageable size. Firstly, the number of the
helices in a protein is bounded. Most medium-size proteins have less than 10 helices
and most large proteins (300�500AAs) have less than 20 helices with longer than
one turn. Secondly, some helices are predicted fairly accurately and consistently
among di®erent servers. This means the number of alternative positions of a helix
might be much smaller than estimated. It is possible to generate a pool for each helix
to include the representative positions.
We previously developed a dynamic programming method for the topology
determination problem in an \error-free" situation. It reduced the computation from
OðN !2NÞ as needed in a naïve method to OðN 22NÞ.18 In this paper we further
explore the used of dynamic graph concepts in dealing with the possible errors. A
naïve way of ¯nding the tuple of helices that best matches the constraints of the
density map is to construct the topology graph from scratch for each of the possible
tuples. We will show in this paper that it is not necessary to construct the graph from
scratch. The dynamic graph method developed in this paper shows that it reduces
the computational time to about 65.55% compared to the naïve way to evaluate the
inaccurate input. The method developed in this paper is an exact algorithm instead
of a stochastic method.
3. Method
3.1. Evaluation of the errors in predicted helices
In order to investigate the nature of the errors in the predicted helices, we collected
the secondary structure prediction results using PREDATOR online server.19 We
randomly selected 30 proteins from the CASP 9 dataset that contains 246 helices.
Di®erent types of errors are quanti¯ed.
3.2. Pre-processing of the predicted helix positions
We used two consensus prediction servers SYMPRED20 and JPRED21 to produce a
consensus prediction. The consensus prediction for each helix simply contains the
positions shared by the two predictions. We used three other servers PSIPRED,12
PREDATOR,19 and SABLE22 to extend from the consensus prediction. Since each
A. Biswas et al.
1242006-4
prediction has slightly di®erent results, the pre-processing step aims to generate a list
of the alternative positions for each helix. Ideally, the alternative helix positions
should represent the predicted population and have the least number of alternatives
for the sake of computation. For each end of a helix, if the predicted position is more
than x AAs di®erent from an existing alternative position, it is included as a new
alternative end of the helix. We used x ¼ 3 for the work of Tables 2 and 3 and x ¼ 2
for Table 4. In other words, we only create an alternative position for the helix if it is
quite di®erent from an existing one. For example, suppose the shaded segments rep-
resent the consensus predictions, then Helix 2 0 is an alternative prediction [Fig. 2(a)].
The right end of Helix 2 0 will be an alternative right end of Helix 2. In this case, tuples of
the alternative helix positions are generated. An example of the tuple is (Helix 1,
Helix 2 0, Helix 3 0, Helix 3 00, Helix 4 0).
3.3. The topology graph
We developed a constraint graph previously to represent the secondary structure
topology problem.18 The graph [Fig. 3(a)] has a two-dimensional layout. The rows
represent the list of sequence segments of helices H ¼ ðH1;H2; . . . ;HMÞ. The col-
umns represent the set of helix sticks S ¼ fS1;S2; . . . ;SNg detected from the protein
density map. We use a weighted directed graph GTop ¼ ðV ;E;wÞ to represent the
SSE topology problem. V has M �N � 2 \regular" nodes and two special nodes
START and END. The index for the row and column of the nodes is i and j,
respectively. The two ends of a stick are marked by t ¼ 0, or t ¼ 1, to distinguish the
two directions of each assignment. A node ði; j; tÞ represents an assignment of Hi to
Sj in t direction. An edge from node ði; j; tÞ to ði 0; j 0; t 0Þ represents the assignment of
Hi 0 to Sj 0 in direction t 0 right after the assignment of Hi to Sj in direction t. When
M ¼ N, each edge connects nodes in two adjacent rows. When M � N, an edge can
skip a maximum of M �N rows. There is an edge connecting hSTART i to each of
the nodes in the ¯rst M �N þ 1 rows. The weight for such special edges is zero.
Similarly, there is an edge connecting each node of the last M �N þ 1 rows to
hENDi. The weight for these special edges was assigned to be zero. The weighted
graph GTop ¼ ðV ;E;wÞ with M � N is de¯ned as the following.
V ¼ fði; j; tÞj1 � i � M ; 1 � j � N; t 2 f0; 1gg [ fSTART ;ENDg
E ¼ ðði; j; tÞ; ði 0; j 0; t 0ÞÞ1 � i � M � 1; 1 � j 6¼ j 0 � N ;
i < i 0 � minððiþM �N þ 1Þ;MÞ; t; t 0 2 f0; 1g
�����( )
[fðSTART ; ði; j; tÞÞj1 � i � M �N þ 1; 1 � j � N ; t 2 f0; 1gg[ fðði; j; tÞ;ENDÞ; jN � i � M ; 1 � j � N ; t 2 f0; 1gg:
We assigned 1 as the edge weight to the two consecutive assignments that are
impossible. Some of the impossible situations arise when the length of the sequence
fragment is di®erent from the length of the stick by 60%. Another impossible
Improved E±ciency in Cryo-Em Secondary Structure Topology Determination from Inaccurate Data
1242006-5
situation happens when the length of the loop is too short to make the connection
of the two sticks. For example, it is impossible for a loop of length 3 to
connect two ends of the helices that are 15Å apart in the density map if we
assume the distance between two AAs is about 3.8Å. For any possible edges, the
weight wðði; j; tÞði 0; j 0; t 0ÞÞ ¼ jlði; i 0Þ � dðj; t; j 0; t 0Þj þ b, in which lði; i 0Þ is the
number of AAs between Hi and Hiþ1 measured on the protein sequence; dðj; t; j 0; t 0Þis the distance estimated from the density map between Hi and Hiþ1 when they
are assigned to Sj at end t and Sj 0 at end t 0, respectively; b is a penalty parameter.
The edge weight measures the cost of assigning two consecutive helices in
the sequence to two sticks.18 Given the graph, the topology determination
problem becomes the problem of ¯nding the shortest path from START to END that
satis¯es certain constraints. The most important constraint is that no columns
are visited more than once in a valid path and similarly, no rows are visited more
than once.18
Fig. 3. Dynamic graph update. (a) A topology graph with the edge weight labeled by the original
weight and the new weight if there is a change to the weight. The shortest path before (bold green)and after (dashed) update are shown. (b) Increment updates (upper box) to the table at node (3,2,0).
Decrement update (lower box) to the table at node (3,1,1). For clear viewing, the edge from hSTART i to(2,2,0) is not drawn.
A. Biswas et al.
1242006-6
3.4. Graph update algorithm
To ¯nd the shortest valid path, our dynamic programming method stores a table at
each node.18 For each node v ¼ ði; j; tÞ, let U be a subset of columns
U � f1; 2; . . . ;Ng, i� ðM �NÞ � jU j � i, j 2 U . A table containing ðU ; fðv;UÞÞwas stored at each node v, where fv;U is the best score to reach v using all the
columns of U . For example, node (3, 2, 0) [Fig. 3(b)] can be reached by di®erent sets
of columns f1,2g, f3,2g, or f1,2,3g. Note that the order to visit the columns can be
di®erent even by using the same set of columns. The best score to reach node (3, 2, 0)
using column 1 and 2 is 2 [Figs. 3(a) and 3(b)].
The method developed in Ref. 18 was able to use the topology graph to ¯nd the
best assignment. However, if the secondary structure prediction changes, the
shortest path might be di®erent. Ideally we do not have to re-compute the entire
graph to ¯nd the new shortest path. The idea of our graph update algorithm is to
update only those nodes that need to be updated due to the change to certain edges.
We adapted the general idea from Ref. 23 to update the graph accordingly. It is
handled slightly di®erently if the edge weight is incremented verses decremented
situation.
The pseudo-code for increment update is shown in Fig. 4. Suppose that an
incoming edge (i.e. from (1,1,0) to (3,2,0)) to a node v is incremented, v is marked as
a node for update [node (3,2,0) in Fig. 3(a)]. Each node v in the graph has di®erent
sets of columns fU1;U2; . . . ;Ukg that can be visited to reach v. The lowest cost of
reaching v using each column combination is stored along with the immediate pre-
decessor node. For example, there are three sets of columns that can be used to reach
node (3,2,0) with cost fðð3; 2; 0Þ;UiÞ. Here, the weight of edge ((1,1,0)(3,2,0)) is
increased from 2 to 4. The algorithm in Fig. 4 checks if any subset in (3,2,0) has
(1,1,0) as the immediate predecessor and ¯nds the new minimum path for that
subset. For example, fðð3; 2; 0Þ; f1; 2gÞ has (1,1,0) as the immediate predecessor and
fðð3; 2; 0Þ;UÞ ¼ 4 is changed as the edge weight wðð1; 1; 0Þ; ð3; 2; 0ÞÞ was increased
to 4 and there are no shorter path using the columns U ¼ f1; 2g. Change in any one
of the paths to reach v may a®ect nodes that have v as ancestor. So, all updated
nodes are recorded in set Rinc, which is used by other nodes to check for updates. The
update is performed in one pass of the graph row by row.
When an incoming edge to node v is decremented, we ¯rst check if the decre-
mented edge can change the lowest cost of reaching v [node (3,1,1) in Fig. 3(a)]. For
example fðð3; 1; 1Þ; f1; 2gÞ has to be updated as the weight of the edge from the
predecessor node (2, 2, 0) has decreased from 8 to 6. The algorithm searches for the
new shortest path to reach (3,1,1) using U ¼ f1; 2g and ¯nds a path through (1,1,1)
with cost 5 and updates fðð3; 1; 1Þ; f1; 2gÞ. In case of the decrement update, the
search algorithm only search paths from the changed parent nodes and compare it to
the existing shortest path. If the decrement does not cause a change to the lowest cost
of reaching v, there is no need to update v and hence no need to update the nodes
below the current row. Otherwise, we mark node v and add it to Rdec. The program
Improved E±ciency in Cryo-Em Secondary Structure Topology Determination from Inaccurate Data
1242006-7
was written in Java. The performance tests were run on a generic desktop Dell
Optiplex 980 machine with 0.8GHz CPU and 8GB of memory.
4. Results and Discussions
4.1. Errors in the predicted helices
Protein secondary structure prediction methods generally have the Q3 accuracy
between 70%�80%.19,22 The overall accuracy does not provide explicit evaluation in
terms of the length and the position of the predicted helices. Since our topology
mapping is based on the geometrical matching between the estimated loop length in
the volumetric data and the loop length in the sequence, the error in the predicted
helices may a®ect the mapping result. We investigated di®erent kinds of errors in the
Fig. 4. The pseudo-code for updating the graph with tables ðU ; fðv;UÞÞ after an edge increment.
A. Biswas et al.
1242006-8
predicted helices using a dataset of 30 proteins. Instead of the bench mark dataset,
we randomly selected the proteins from the CASP 9 dataset that represents the
newly deposited protein structures in the PDB. Out of the 246 helices, 39.4% of the
helices were well predicted and we refer them as the corresponded helices [column 2
of Table 1(A)]. A corresponded helix is a predicted helix that satis¯es two criteria:
(1) its center position is within ¯ve AA di®erence from the center of the corre-
sponding observed helix in the PDB; (2) its length is within ¯ve AA di®erence from
the length of the observed helix. Our test shows that only 39.4% of the helices were
well predicted. Some of the non-corresponded helices are short. However, about
27.6% of the total non-corresponded helices are longer (> 8AA) helices [column 4 of
Table 1(A)]. This re°ects the challenges in the reality of the secondary structure
prediction. We calculated the center-to-center shift distribution between the pre-
dicted helices and the observed helices for the corresponded helices that are well
predicted. The 1-position shift is the most popular, covering 35.05% of the corre-
sponded helices. However, shift of 3, 4, and 5AAs accounts for a total of � 26%.
For 30.08% of the helices, one of the two ends of the helix was predicted fairly
accurately (within 3AA from the true end), but the other end was predicted to be at
least 3AA away from its true position. This suggests that one of the two ends can be
predicted fairly accurately for these helices. In fact, we sampled the ends of the
helices in the pre-processing of the predicted helices. Our goal was to capture the
ends that are actually predicted accurately. The split-error refers to the case when an
observed helix was broken into two smaller helices with at most three AAs in
between. As expected, the split-error and merge error are not popular, with 3.25%
and 6.5%, respectively. However, such error might a®ect the topology mapping
signi¯cantly since the length and the position of the predicted helix can be very
di®erent from the observed ones.
Table 1. The errors in the predicted helices from the protein sequence.
Total no. ofhelicesa
Correspondedb Non-corresponded(�8AAs)c
Non-corresponded
(>8AAs)d1-ende Split f Mergeg
A. The percentage of di®erent types of errors
246 39.43% 32.93% 27.64% 30.08% 3.25% 6.50%
B. Shift and length error in the corresponded helices
# AAsh 0 1 2 3 4 5
Shift i 21.65% 35.05% 17.53% 12.37% 7.22% 6.19%
Length j 13.40% 18.56% 21.65% 13.40% 20.62% 12.37%
aThe total number of helices. bThe percentage of the corresponded helices. cThe percentage of the non-
corresponded helices with length no more than 8AA. dThe percentage of the non-corresponded helices
with length more than 8AA. eThe percentage of the helices with only one-end match. fSplit: One
observed helix is split into two predicted helices with a loop (<3AAs) in between. gMerge error: twoobserved helices with a loop (<3AAs) are merged into one predicted helix. hThe number of AAs that are
involved in the center-to-center shift in (i) or the length di®erence in (j). iThe percentage of the corre-
sponded helices with the center-to-center shift by #AAs. jThe percentage of the corresponded helices with
length di®erence of #AAs.
Improved E±ciency in Cryo-Em Secondary Structure Topology Determination from Inaccurate Data
1242006-9
In summary, our test of the 246 helices suggests that various errors exist in the
secondary structure prediction. Although it is expected that the consensus prediction
can have improved accuracy compared to the individual predictor, it is very likely
that the same type of errors are in the consensus prediction as well. It is important to
sample the alternative positions and lengths for the predicted helices to ¯nd the best
match between the volumetric data and the sequence data. We provide an exact
algorithm in this paper to improve the e±ciency over a naïve way of screening all the
alternative positions of the helices.
4.2. E±ciency in the topology searching with the presence of errors
For each tuple picked from the alternative helix positions, a topology graph can be
built and the shortest path can be determined using our previously developed
dynamic programming method.18 A naïve way of ¯nding the tuple of helices that best
matches the constraints of the density map is to construct the topology graph from
scratch for each of the possible tuples. We compared our dynamic graph method with
the naïve method in terms of the run-time to evaluate all the tuples. Seven proteins
were randomly selected from the PDB with the requirement (1) containing only
helices and (2) having at least ¯ve helices. The smallest protein (1NG6) has 148AAs,
and the largest (2X79) has 501AAs. For each protein sequence, we collected the
secondary structure prediction results from ¯ve servers. The prediction results were
preprocessed to generate the alternative helix positions for each predicted helix. The
preprocessing step restricted the total number of tuples signi¯cantly. For example,
the largest protein 2X79 has only 32 helix tuples generated (row 5 column 7 of
Table 2). The shortest valid path was searched for each H and the shortest among
the 32 paths was selected for the topology. We simulated the protein density map to
10Å resolution using EMAN.24 SSETracer4 was used to detect the helices from the
Table 2. Improved run-time using dynamic graph update in SSE topology determination.
Protein #Amino #True #Pred. #Sticksc Helix Base 1st All Naive % i
No. ID acids helicesa helicesb tuplesd timee helix f comb.g timeh
1 3FYQ 199 5 6 5 12 0.07 0.05 0.51 0.84 60.7
2 2WVI 164 9 8 7 2 0.46 0.30 0.76 0.92 82.63 1NG6 148 9 7 8 8 0.60 0.41 2.3 4.8 47.9
4 3HFW 357 23 15 16 2 200.0 145.5 350.84 400.0 87.5
5 2X79 501 26 14 21 32 879.7 720.5 20,479.2 28,150.4 72.86 1TBF 347 22 16 16 24 623.1 579.2 6820.7 14,954.4 45.6
7 3L6A 364 25 18 19 16 1362.5 1124.6 13,253.6 21,800.0 60.8
Average 65.45
aThe number of the observed helices in the PDB ¯le. bThe number of helix regions in the secondary
structure predictions. cThe number of sticks detected from the density map. dThe number of helix tuples
generated after preprocessing. eThe time (in seconds) to build the ¯rst graph and to ¯nd the shortestpath. fThe time (in seconds) to update the graph for the ¯rst alternative helix and to ¯nd the shortest
path. gThe time (in seconds) to update the graph for all helix tuples and to ¯nd the shortest path.hBrute force time to re-compute the entire graph for all the tuples and to search for the shortest path,h ¼ ðd� eÞ. iPercentage of the total time for dynamic update i ¼ g=h.
A. Biswas et al.
1242006-10
density map. In the case of protein 2X79, the secondary structure prediction tools
were able to predict 14 helix regions out of 26 actual helices in the PDB structure,
and SSETracer detected 21 helix sticks from the volumetric map. For each protein,
the number of rows and columns in the topology graph are the number of helix
regions (column 5, Table 2) obtained after preprocessing and the number of sticks
(column 6, Table 2), respectively.
Although this paper does not address the issue how the errors in the SSE pre-
diction a®ect accuracy of topology determination, we would like to use an example
(2X79) to illustrate the challenge proposed by the errors. We found that ¯ve helices
were completely missed by the consensus servers, among which four have a length
shorter than ¯ve AAs and one has seven AAs. In general, the shorter helices are more
likely to be missed in the secondary structure predictions. We also noticed that seven
helices in the PDB ¯le (2X79) were predicted as part of longer helices. In other words,
two shorter helices that are closely located on the sequence were predicted as a
merged helix with longer length. In this case, the 14 predicted helix regions cover 21
helices of the PDB structure.
We calculated \base time" which is the time to create a graph and to search for
the shortest path. In the work of Table 2, we used the Euclidean distance between
two ends of the two helices for the estimation of dðj; t; j 0; t 0Þ. In this paper, the
constraint shortest path concept was implemented in Java rather than Cþþ that
was used in our earlier work.18 We noticed the slowdown in our current implemen-
tation using Java. A naïve way to ¯nd the topology with the lowest cost among
the 32 tuples for protein 2X79 will involve 32 times of the base time. Using the
dynamic graph concept, our method found the topology with the lowest cost using
72.8% of the time that would have been spent in the naïve way. The average
reduction time is 34.55% for the seven proteins tested (Table 2). If an alternative
helix happens to be near the top row of the graph, it generally takes longer time to
update the graph, since the nodes below it are likely to be updated. Column 9 shows
the worst helix update time found for the proteins.
Although \shift-errors" aremost popular errors in the SSE prediction, \split-error"
is another type of error that needs to be considered. A \split-error" corresponds to the
situation inwhich a long helix is predicted as two shorter helices.We collected adataset
containing four proteins, each of which has one split error. Table 3 shows the per-
formance of our program to evaluate all tuples of the alternative helix positions with
both \shift-error" and \split-error". The tuples were generated in which two split
helices were merged into an alternative helix. To handle the split error, the graph
update involves merging two rows into one and updating the ðU ; fðv;UÞÞ table at theappropriate nodes. Table 3 shows that the average reduction time is 40.50%.
4.3. Improved topology accuracy through the sampling of the errors
The secondary structure topology determination for the cryoEM density maps relies
on the accuracy of the predicted secondary structures from the protein sequence and
Improved E±ciency in Cryo-Em Secondary Structure Topology Determination from Inaccurate Data
1242006-11
those detected from the density maps. How to ¯nd the topology from the inexact
prediction data is still a challenging problem. We propose, in this paper, an approach
to carefully generate the tuples of alternative helix positions using the predictions
from multiple servers. Once the tuples are created, we apply our dynamic graph
method to ¯nd the optimal solution among all the tuples. We compared the accuracy
of this approach verses the use of one consensus tuple created using two consensus
prediction servers. We performed this comparison because it is not clear if the con-
sensus tuple is accurate enough to ¯nd the true topology. We present this prelimi-
nary comparison using six proteins. These proteins were chosen because most of the
helices in these proteins are predicted fairly accurately by the multiple prediction
servers, although some helices shorter than two turns are missed. As an example, 13
out of the 16 helices in the PDB structure are predicted approximately correctly in
3LTJ (row 4, Table 4). We used those 13 helices in our test. Both the consensus tuple
and the tuples of the alternative helix positions were created. In this case, there are
only 6 tuples (column 6, row 4, Table 4) generated since there is good agreement
among the multiple servers. Yet the number of correctly assigned helices increased
from 4 to 8 when the consensus tuple is replaced with the 6 tuples of alternative
helices. The dynamic graph ¯nds the optimal shortest path in the graphs of the 6
tuples, and in this case, one of them is a much better choice than the consensus. In
the work of Table 4, we estimated dðj; t; j 0; t 0Þ through identifying the possible traces
in the density map between end t of Sj and end t 0 of S 0j. This process involves the
clustering of the voxels in the density map into critical groups and selecting the
possible paths between the two end points. dðj; t; j 0; t 0Þ was estimated as the total
distance along a selected path that passes a set of cluster centers.
We compared the number of correctly assigned helices in the two experiments. A
correctly assigned helix refers to the helix in the tuple that was mapped to the correct
stick in the density map by comparing to its PDB structure. The only di®erence in
the two experiments is the input tuple. A consensus tuple was used in the ¯rst
Table 3. E±ciency of the graph update involving \split-errors."
No. Protein #Amino #True #Pred. #Sticksc Helix Base 1st All Naive % i
ID acids helicesa helicesb tuplesd timee helix f comb.g timeh
1 3FYQ 199 5 6 5 12 0.06 0.05 0.55 0.72 60.7
2 1NG6 148 9 8 8 8 0.62 0.43 3.12 4.96 47.9
3 1Z1L 345 24 14 14 32 170.3 124.6 3736.5 5449.6 56.44 2XSI 585 33 27 19 16 1928.4 1789.3 22,548.3 30,854.4 73.0
Average 59.5
aThe number of the observed helices in the PDB ¯le. bThe number of helix regions in the secondary
structure predictions. cThe number of helix sticks detected from the density map. dThe number of helix
tuples generated after preprocessing. eThe time (in seconds) to build the ¯rst constraint graph and to¯nd the shortest path. fThe time (in seconds) to update the graph for the ¯rst alternative helix. gThe
time (in seconds) to update the graph for all the tuples and to ¯nd the shortest path. hBrute force time to
re-compute the entire graph for all the tuples and to ¯nd the shortest path, h ¼ ðd� eÞ. iPercentage of
the total time for dynamic update, i ¼ g=ðd� eÞ.
A. Biswas et al.
1242006-12
experiment and multiple tuples derived from the preprocessing step were used in the
second experiment. For all the six proteins tested, the number of correctly assigned
helices is more in the second experiment. Although the time to ¯nd the best topology
in the second experiment is much longer, the run-time is still about 50% of the time
compared to the naïve way to ¯nd the best topology.
Although the accuracy was improved through the sampling of the multiple sec-
ondary structure prediction results, the accuracy is still far from adequate. For
example, only 6 out of 13 helices were correctly assigned in protein 3LTJ. There
might be a number of factors related to the inaccuracy. These include the accuracy in
the edge weight estimation, the accuracy of the detected sticks, and the accuracy of
the sampling for the alternative helices. The dynamic graph method developed in this
paper will provide an optimal solution for any edge weights selected and any tuples of
the alternative helices selected. We have noticed that the true topology is often in the
top portion of the ranked list of the topologies (data not shown). It is possible that
the work in this paper needs to be extended to ¯nd the top-k ranked solutions instead
of the top-1 solution.
5. Conclusions
Due to the inaccuracy in both the secondary structure prediction from the sequence
and the detection from density map, the true topology needs to be searched with the
consideration of such errors. We presented the framework to reduce the computation
in topology search using a combination of preprocessing and dynamic graph update.
Our method provides the optimal solution for any among the alternative helix
positions that were carefully chosen in the preprocessing step. Using a small test
dataset, the reduction in time is about 34.55% to sample the \shift-error" compared
Table 4. Improved topology accuracy over the consensus helix prediction.
Protein #AAa #PDB #True #Sticksd Helix Base Scratch Actual % i Cons. Best
ID helicesb helicesc tuplese time f timeg timeh assign j assignk
1FLP 142 7 7 7 4 0.40 1.6 0.71 44.37 4 5
1NG6 148 9 8 7 8 .62 4.96 2.42 48.79 5 5
2XB5 207 13 10 9 48 .79 37.92 17.8 46.94 3 63LTJ 210 16 13 12 6 13.87 83.22 45.7 54.91 4 8
3ACW 293 17 13 14 8 29.45 235.6 139.7 59.29 4 6
1Z1L 345 23 17 14 24 773.5 18564 8765.5 47.21 4 9
Average 50.25
Note: aThe number of AAs in the protein. bThe number of the observed helices in the PDB ¯le. cThenumber of helix regions in the secondary structure predictions. dThe number of sticks detected from the
density map. eThe number of helix tuples generated after preprocessing. fThe time (in seconds) to build
the ¯rst graph and to ¯nd the shortest path. gBrute force time to re-compute the entire graph for all the
tuples and to search for the shortest path, g ¼ ðe� fÞ. hThe time (in seconds) to update the graph forall helix tuples and to ¯nd the shortest path. iPercentage of the total time for dynamic update
i ¼ h=g� 100. jThe number of sticks correctly assigned using the consensus sequence prediction only.kThe number of sticks correctly assigned using helix tuples (e) considering shift and split errors.
Improved E±ciency in Cryo-Em Secondary Structure Topology Determination from Inaccurate Data
1242006-13
to the naïve way of searching. To our knowledge, this is the ¯rst computationally
e®ective exact algorithm to identify the optimal secondary structure topology when
the inaccuracy of the predicted data is considered.
Acknowledgments
The authors acknowledge ¯nancial support from Old Dominion University MDS
research fund.
References
1. Chiu W, Electron microscopy of frozen, hydrated biological specimens, Annu Rev BiophysBiophys Chem 15:237�257, 1986.
2. Zhou ZH, Atomic resolution cryo electron microscopy of macromolecular complexes, AdvProtein Chem Struct Biol 82:1�35, 2011.
3. Zhou ZH, Towards atomic resolution structural determination by single-particle cryo-electron microscopy, Current Opin Struct Biol 18:218�228, 2008.
4. Del Palu A, He J, Pontelli E, Lu Y, Identi¯cation of alpha-helices from low resolutionprotein density maps, Proc Computational Systems Bioinformatics Conference (CSB)89�98, 2006.
5. Jiang W, Baker ML, Ludtke SJ, Chiu W, Bridging the information gap: Computationaltools for intermediate resolution structure interpretation, J Mol Biol 308:1033�1044,2001.
6. Kong Y, Ma J, A structural-informatics approach for mining beta-sheets: Locating sheetsin intermediate-resolution density maps, J Mol Biol 332:399�413, 2003.
7. Baker ML, Ju T, Chiu W, Identi¯cation of secondary structure elements in intermediate-resolution density maps, Structure 15:7�19, 2007.
8. Baker ML, Abeysinghe SS, Schuh S, Coleman RA, Abrams A, Marsh MP, Hryc CF,Ruths T, Chiu W, Ju T, Modeling protein structure at near atomic resolutions withGorgon, J Struct Biol 174:360�373, 2011.
9. Lu Y, He J, Strauss CE, Deriving topology and sequence alignment for the helixskeleton in low-resolution protein density maps, J Bioinform Comput Biol 6:183�201,2008.
10. Sun W, He J, Native secondary structure topology has near minimum contact energyamong all possible geometrically constrained topologies, Proteins 77:159�173, 2009.
11. Al Nasr K, Sun W, He J, Structure prediction for the helical skeletons detected from thelow resolution protein density map, BMC Bioinformatics 11(Suppl 1):S44, 2010.
12. Bryson K, McGu±n LJ, Marsden RL, Ward JJ, Sodhi JS, Jones DT, Protein structureprediction servers at University College London, Nucleic Acids Res 33:W36�W38, 2005.
13. Pollastri G, Przybylski D, Rost B, Baldi P, Improving the prediction of protein secondarystructure in three and eight classes using recurrent neural networks and pro¯les, Proteins47(2):228�235, 2002.
14. Pollastri G, McLysaght A, Porter: A new, accurate server for protein secondary structureprediction, Bioinformatics 21:1719�1720, 2005.
15. McGu±n LJ, Bryson K, Jones DT, The PSIPRED protein structure prediction server,Bioinformatics 16:404�405, 2000.
16. Przybylski D, Rost B, Alignments grow, secondary structure prediction improves, Pro-teins 46:197�205, 2002.
A. Biswas et al.
1242006-14
17. Lindert S, Staritzbichler R, Wotzel N, Karakas M, Stewart PL, Meiler J, EM-fold: Denovo folding of alpha-helical proteins guided by intermediate-resolution electronmicroscopy density maps, Structure 17:990�1003, 2009.
18. Al-Nasr K, Ranjan D, Zubair M, He J, Ranking valid topologies of the secondarystructure elements using a constraint graph, J Bioinformatics Comput Biol 9:415�430,2011.
19. Frishman D, Argos P, Seventy-¯ve percent accuracy in protein secondary structureprediction, Proteins: Struct Funct Bioinform 27:329�335, 1997.
20. Simossis VA, Heringa J, The in°uence of gapped positions in multiple sequencealignments on secondary structure prediction methods, Comput Biol Chem 28:351�366,2004.
21. Cu® JA, Clamp ME, Siddiqui AS, Finlay M, Barton GJ, JPred: A consensus secondarystructure prediction server, Bioinformatics 14:892�893, 1998.
22. Adamczak R, Porollo A, Meller J, Combining prediction of secondary structureand solvent accessibility in proteins, Proteins: Struct Funct Bioinform 59:467�475,2005.
23. Ramalingam G, Reps T, An incremental algorithm for a generalization of the shortest-path problem, J Algorithms 21:267�305, 1996.
24. Ludtke SJ, Baldwin PR, Chiu W, EMAN: Semi-automated software for high resolutionsingle particle reconstructions, J Struct Biol 128:82�97, 1999.
Abhishek Biswas is a Ph.D. candidate at the Department of Computer Science at
Old Dominion University. He received his BE degree from Amrita School of
Engineering, Bangalore, India in 2007. He worked as developer in Tata Consultancy
Services from September 2007 to December 2008 and as a research associate in PES
School of Engineering, Bangalore, India till August 2009, before joining Old
Dominion University.
Dong Si is a Ph.D. candidate at the Department of Computer Science at Old
Dominion University. He received his undergraduate degree from Nanjing University
before joining Old Dominion University.
Kamal Al Nasr is a Ph.D. candidate at the Department of Computer Science at
Old Dominion University. He received his BS and MS degrees from Yarmouk Uni-
versity, Jordan, in 2003 and 2005, respectively. He worked as a part time instructor
in Yarmouk University, Jordan. Between August 2007 and August 2009, he was a
graduate student conducting research at the CREST Center (Center for Research
Excellence in Bioinformatics and Computational Biology) at the Department of
Computer Science at New Mexico State University before joining to Old Dominion
University.
Desh Ranjan received his undergraduate degree in Computer Science from Indian
Institute of Technology, Kanpur in 1987 and a Masters and Ph.D. degrees in
Improved E±ciency in Cryo-Em Secondary Structure Topology Determination from Inaccurate Data
1242006-15
Computer Science, with a minor in Mathematics, from Cornell University in Ithaca,
New York, in 1990 and 1992, respectively. After ¯nishing his Ph.D., he spent one
year (1992�1993) as a post-doctoral fellow at the Max Planck Institute for Com-
puter Science in Saarbrucken, Germany. His research interests include Algorithms,
Bioinformatics and Computational Complexity.
Mohammad Zubair received his B.Sc. from the Electrical & Electronics Engin-
eering at Delhi University, India, in 1981. He received his Ph.D. degree from Indian
Institute of Technology, India, in 1987. He worked in IBM T.J. Watson Research
Center for three years. Currently, he is a Professor at the Department of Computer
Science at Old Dominion University. He has conducted research in three areas:
Digital Library, Advanced Web Applications including E-Commerce, High Per-
formance Computing and Bioinformatics.
Jing He obtained her B.S. degree in Mathematics from Jilin University and M.S.
degree in Mathematics from New Mexico State University. She worked in the area of
three-dimensional reconstruction and analysis of virus structures at Baylor College of
Medicine from which she obtained her Ph.D. in Structural and Computational
Biology and Molecular Biophysics. Currently she is an Associate Professor at the
Department of Computer Science at Old Dominion University.
A. Biswas et al.
1242006-16