Sparse RNA folding: Time and space efficient algorithms

Sparse RNA Folding: Time and Space EfficientAlgorithms

Rolf Backofen1, Dekel Tsur2, Shay Zakov2, and Michal Ziv-Ukelson2

1 Albert Ludwigs University, Freiburg, [email protected]

2 Department of Computer Science, Ben-Gurion University of the Negev, Israel{dekelts}, {zakovs}, {michaluz}@cs.bgu.ac.il

Abstract. The classical algorithm for RNA single strand folding re-quires O(nZ) time and O(n2) space, where n denotes the length of theinput sequence and Z is a sparsity parameter which satisfies n ≤ Z ≤ n2.We show how to reduce the space complexity of this algorithm. Thespace reduction is based on the observation that some solutions for sub-problems are not examined after a certain stage of the algorithm, andmay be discarded from memory. This yields an O(nZ) time and O(Z)space algorithm, which outputs both the cardinality of the optimal fold-ing as well as a corresponding secondary structure. The space-efficientapproach also extends to the related RNA simultaneous alignment withfolding problem, and can be applied to reduce the space complexity of thefastest algorithm for this problem from O(n2m2) down to O(nm2 + Z),where where n and m denote the lengths of the input sequences to bealigned (w.l.o.g. m ≤ n), and Z is a sparsity parameter which satisfiesnm ≤ Z ≤ n2m2.

In addition, we also show how to speed up the max base-pair variant ofRNA single strand folding. The speed up is achieved by combining twoindependent existing techniques, which restrict the number of expressionsthat need to be examined in bottleneck computations of these algorithms.This yields an O(LZ) time and O(Z) space algorithm, where L denotesthe maximum cardinality of a folding of the input sequence.

1 Introduction

The structure of RNA is evolutionarily more conserved than its sequence and isthus key to its functional analysis [3]. Unfortunately, although massive amountsof sequence data are continuously generated, the number of known RNA struc-tures is still very limited since experimental methods, such as NMR and Crys-tallography, require expertise and long experimental time. Therefore, computa-tional methods for predicting RNA structures are of great value [20, 8, 19].

RNA is typically produced as a single stranded molecule, which then foldsupon itself to form a number of short base-paired stems. This base-paired struc-ture is called the secondary structure of the RNA. Paired bases almost alwaysoccur in a nested fashion in RNA secondary structure. Under the assumption

that the structure does not contain pseudoknots (i.e. crossing base pairs), amodel was proposed by Tinoco et al. [9] to calculate the stability (in terms offree energy) of a folded RNA molecule by summing all contributions from thestabilizing, consecutive base pairs and from the loop-destabilizing terms in thesecondary structure. Based on this model, dynamic programming algorithmswere suggested for computing the most stable structures [16, 14, 21, 1, 17], ap-plying various scoring criteria such as the maximal number of base pairs [14]or the minimal free energy [21]. This optimization problem is formally denotedSingle RNA strand folding, and the time and space complexity of the classicalalgorithms for solving it is O(n3) and O(n2), respectively, where n denotes thelength of the input RNA sequence. Recently, these were sped up to yield O(nZ)time and O(n2) space [17] algorithms, where Z is a sparsity parameter whichsatisfies n ≤ Z ≤ n2. We note that these algorithms are practical in the sensethat the hidden constants are small. On the more theoretical front, Akutsu sug-gested an O(D(n)) algorithm for this problem [1], where D(n) is the time forcomputing the distance product of two n× n matrices. The best current boundon D(n) is O(n3 log3 log n/ log2 n) [2].

Another approach to RNA folding is the Sankoff simultaneous alignment withfolding (SAF for short) [15, 13, 11, 12, 18]. This approach consists of finding anoptimal alignment between a set of RNA sequences, where an alignment scoreis evaluated with respect to some common folding of the input sequences. How-ever, as stated in [5], even for the simple case where the input consists of onlytwo sequences, this approach requires “extreme amounts of memory and space”with complexity of O(n2m2) space and O(n3m3) time, where n and m are thelengths of the input RNA sequences to be aligned. Thus, most existing practicalimplementations of this algorithm [13, 11, 18] use restricted versions of the orig-inal problem. Since these restrictions introduce another source of error, it is ofutmost practical importance to the research on RNA to improve both the spaceand time complexities of the full versions of RNA folding. A first non-heuristicspeedup, which does not sacrifice the optimality of results, was recently describedin [12]. This work extends the approach of [17] and yields an O(nmZ) time andO(n2m2) space algorithm for RNA simultaneous alignment with folding, whereZ is a sparsity parameter which satisfies nm ≤ Z ≤ n2m2. However, experi-mental analysis of this algorithm indicates that the high memory requirementspose a major bottleneck in practice, both in constraining the lengths of the in-put sequences to be aligned, as well as in exhausting the benchmark machine’smemory, which in turn results in a page-fault slowdown.

Our contribution

(1.) Reducing the space requirements of both RNA folding problems. In this workwe focus on improving the space complexity of the base-pairing maximizationvariant of the RNA single strand folding problem [16, 14, 1]. The space require-ment reduction is based on the observation that some solutions for subproblemsare not examined after a certain stage of the algorithms, and may be discardedfrom memory. This yields an O(nZ) time and O(Z) space algorithm for this

Previous results New resultsTime Space Time Space

Single strandbase-paring maximization

O(n3)[14]O(nZ)[17]O(D(n))[1]

O(n2) O(LZ) O(Z)

Single strandenergy minimization

O(n3)[21]O(nZ)[17]

O(n2) O(nZ) O(Z)

Simultaneous alignmentwith folding

O(n3m3)[15]

O(nmZ) [12]O(n2m2) O(nmZ) O(nm2 + Z)

Table 1. Time and space complexities of RNA folding algorithms.

problem. In addition to the optimal folding cardinality computation, we show atrace-back procedure which outputs a corresponding secondary structure. Notethat it is an interesting challenge on its own to recover an optimal folding withinthe time and space complexity bounds of the space-reduced algorithm, sincedue to the sparse representation only partial information is kept. The presentedstrategy may also be extended to the score computation of a family of RNA fold-ing algorithms, which includes algorithms for the energy minimization variant ofthe single strand folding problem [21, 17] (improving the space complexity fromO(n2) to O(Z)), as well as algorithms for SAF [15, 12] (improving the spacecomplexity from O(n2m2) to O(nm2 + Z)).

(2.) A sparse RNA single strand folding algorithm. We also describe a fastalgorithm for the max base-pair variant of RNA single-strand folding that ex-ploits an additional sparsity parameter, based on the cardinality of the optimalfolding. This is achieved by combining two independent techniques, which werepreviously used to reduce the number of sub-instance pairs that need to beconsidered by the algorithm. This combination yields the simultaneous exploita-tion of two key properties emerging from the formal definitions of these foldingproblems: the triangle inequality property, previously exploited in [17] and [12],as well as the monotonicity and unit-step properties, previously utilized in [10]for a related problem. The result is an O(LZ) time and O(Z) space algorithm,where L denotes the maximum cardinality of a folding of the input sequence andn ≤ Z ≤ nL.

We note that the algorithms described here are practical in the sense that thehidden constants are small. In the context of practical contribution, we also pointout that our space complexity improvements are more significant than the timecomplexity improvements since while the expected value of L is Θ(n) (assuminguniform character distribution), both Z and Z were experimentally shown to besignificantly less than n2 and n2m2, respectively [17, 12]. Furthermore, reducingthe space complexity of the simultaneous alignment with folding problem is a keyresult in practice, as in the previous results the space complexity was typicallythe computational bottleneck [5, 12].

RoadMap: The rest of the paper proceeds as follows. Preliminary notationand definitions are given in Section 2. The Nussinov-Jacobson [14] recurrencefor computing maximum cardinality folding of an RNA sequence, and the ap-plication of the elimination technique of Wexler et al. [17] to this recurrence aredescribed in Section 3. The latter technique is used in Section 3.2 for design-ing a space efficient algorithm for the single strand RNA folding problem. Inthe following Section 3.3, we show how to implement a space efficient tracebackprocedure in order to reconstruct an optimal folding from this sparse, space effi-cient representation. Then, in Section 3.4, the space reduction approach is shownto scale up to the RNA simultaneous alignment with folding problem. In Sec-tions 4.1 we apply another elimination technique, which is similar to a techniqueof Jansson et al. [10], to further reduce the algorithm’s time complexity, with-out increasing its space complexity. We conclude with Section 4.2, where thistechnique is further improved by step encoding. Due to space constraints, somefigures, pseudocode and proofs are omitted, and may be found in the supportingon-line material document at (TODO:link).

2 Preliminaries

An RNA sequence is a sequence over the alphabet {A,C,G,U}. Each letterin an RNA sequence is also called a base or a nucleotide. The bases A and Uare called complementary bases, and so are the bases C and G 3. For a baseσ ∈ {A,C,G,U}, denote by σ the complementary base of σ.

Fix henceforth an RNA sequence S = s1s2 · · · sn. Denote by Si,j the subse-quence si · · · sj of S, where Si,i−1 is defined to be the empty sequence.

Definition 1 (RNA folding). A folding F of a subsequence Si,j is a set ofindex pairs that satisfies the following:1. For every (k, l) ∈ F , i ≤ k < l ≤ j, and sl = sk.2. There are no (k, l), (k′, l′) ∈ F , such that k ≤ k′ ≤ l ≤ l′.

A pair (k, l) ∈ F is called a base-pair. Denote by |F | the size of a folding F ,i.e. the number of base-pairs in F . Say that index k is paired in a folding F if kappears in a base-pair in F (either as the first or the second base in the pair),otherwise k is unpaired in F .

The single strand base-pairing maximization problem was first addressedin [14]. The formal problem definition is given below.

Problem 1 (Single Strand Base-Paring Maximization). Compute the maximumsize of a folding of the instance sequence S.

Definition 2 (branch point). Let F be a folding of some subsequence Si,j. Anindex q is a branch point with respect to F if for all (k, l) ∈ F , either l < q ork ≥ q.3 For the sake of clarity, we disregard in this work the possible ”wobble” pairing

between G and U . All presented results may be easily extended to include G − Upairing as well.

We distinguish between two kinds of foldings. For a folding F of Si,j , saythat1. F is co-terminus if F includes the base-pair (i, j).2. F is partitionable if F does not include the base-pair (i, j).

Note that for j > i, F is partitionable if and only if F has a branch pointi < q ≤ j.

Definition 3. For a subsequence Si,j, denote:1. L (i, j) is the maximum size of a folding of Si,j.2. Lp (i, j) is the maximum size of a partitionable folding of Si,j.3. Lc (i, j) is the maximum size of a co-terminus folding of Si,j, or −∞ if there

is no such folding (if j ≤ i or sj 6= si).

Call a folding F of Si,j for which |F | = L (i, j), an optimal folding of Si,j . Inthe rest of this paper, we will use L instead of L (1, n) whenever the context isclear.

3 RNA Folding via base-pairing maximization

In this section we describe the recursive solution for the single RNA strandfolding base-pair maximization problem, and some of its characteristics. We thenpresent a technique for reducing its space complexity. Note that this techniqueeasily extends to the single-strand RNA folding algorithms that are based ona thermodynamic model [20, 8, 19]. Finally, we show how to extend the space-reduction technique and apply it to the simultaneous alignment with foldingalgorithms [15, 12].

3.1 A recursive solution

For a subsequence Si,j such that j ≤ i, the only possible folding is the emptyfolding, and therefore L (i, j) = 0. The following equations show how to recur-sively compute L (i, j) when j > i:

L (i, j) = max {Lp (i, j) , Lc (i, j)} . (3.1)

Lc (i, j) ={L (Si+1,j−1) + 1, sj = si,−∞, sj 6= si.

(3.2)

Lp (i, j) = maxi<q≤j

{L (Si,q−1) + L (Sq,j)} . (3.3)

Note that the time complexity bottleneck in algorithms which implement therecursive computation of Equations 3.1 to 3.3 is due to the consideration of O(n)branch points q in the computation of Lp (i, j), according to Equation 3.3. In therest of this section, as well as in Section 4, we describe techniques that reducethe number of branch points that need to be examined in this computation,and thus improve the time complexity of such algorithms. Due to Equations 3.1and 3.3, the following (inverse) triangle inequality is sustained in the base-paringmaximization problem:

Observation 1 (triangle inequality) For every subsequence Si,j and for ev-ery i < q ≤ j, L (i, j) ≥ L (i, q − 1) + L (q, j) .

Based on the triangle inequality, Wexler et al. [17] observed that it is sufficientto examine only a subset of the branch points in order to compute Lp (i, j). Wepresent here a slightly different notation for the same concept.

Definition 4 (OCT). A subsequence Si,j is optimally co-terminus (OCT) ifi = j, or if every optimal folding of Si,j is co-terminus (that is, if L (i, j) =Lc (i, j) > Lp (i, j)).

Call an index q for which Lp (i, j) = L (i, q − 1) + L (q, j) an optimal branchpoint with respect to Si,j .

Lemma 1 (Wexler et al. [17]). For every subsequence Si,j, there is an optimalbranch point q with respect to Si,j such that Sq,j is an OCT.

For completeness, we give a proof for Lemma 1 with the notation of thecurrent work.

Proof. Let q be the maximum index such that Lp (i, j) = L (i, q − 1) +L (q, j).If q = j, then Sq,j = Sj,j is an OCT by definition. Otherwise, let q′ be anyindex in the range q < q′ ≤ j. From the selection of q, L (i, q − 1) + L (q, j) =

Lp (i, j) > L (i, q′ − 1)+L (q′, j)Obs. 1≥ L (i, q − 1)+L (q, q′ − 1)+L (q′, j). There-

fore, L (q, j) > L (q, q′ − 1) + L (q′, n) for every q < q′ ≤ j, and Lp (q, j)Eqn. 3.3

=maxq<q′≤j

L (q, q′ − 1) + L (q′, j) < L (q, j). Thus, Sq,j is an OCT. utDefine the following subset of branch points with respect to Si,j :

Qi,j = {i < q ≤ j : Sq,j is an OCT} .

The following equation restates Equation 3.3, based on Lemma 1, by restrict-ing the branch points considered by the maximization term to those in Qi,j .

Lp (i, j) = maxq∈Qi,j

{L (i, q − 1) + L (q, j)} . (3.4)

We define the following sparsity measure of RNA sequences.

Definition 5. For a subsequence Si,j, Z(i, j) is the number of subsequences ofSi,j which are OCTs.

In the rest of this paper, we use Z instead of Z(1, n) whenever the context isclear. In the sparse case, only a small portion of the O(n2) subsequences of S areOCTs. In Section 4.1 we show that, in the maximum base-pair variant of RNAsingle-strand folding, Z is bounded by n(L + 1). For the minimum free energyproblem variant, an estimation of the expected value of a parameter related to Z,based on a probabilistic model for polymer folding and measured by simulations,which shows that that Z is significantly smaller than O(n2), can be found in [17].

Previous algorithms for the base-pairing maximization problem were pre-sented by Nussinov and Jacobson [14] and by Wexler et al. [17]4. Both algorithmsare dynamic programming algorithms that perform a bottom-up computation ofthe recurrence described in this section, where the Nussinov-Jacobson algorithmuses Equation 3.3 for the computation of Lp (i, j), and the Wexler et al. algo-rithm improves it by using Equation 3.4. These algorithms compute the uppertriangle of a table Mn×n, where each cell M [i, j] stores the value L (i, j). Theentries of M are traversed in an order which guarantees that all values that areneeded for the computation of M [i, j] = L (i, j), according to the recurrence for-mula, are computed and stored in M prior to the computation of M [i, j]. Upontermination, M [1, n] holds the value L. The time complexity of the algorithmby Nussinov and Jacobson is O(n3), whereas that of the algorithm by Wexler etal. is O(nZ). Both algorithms use O(n2) space.

3.2 A space efficient algorithm

Our space reduction strategy is based on the observation that some of the valuesstored by the algorithm of Wexler et al. [17] are not necessary throughout thecomplete run of the algorithm. In the following lemma we characterize the valuesthat need to be maintained in memory for the computation of L (i, j).

Lemma 2. For a subsequence Si,j, it is possible to compute L (i, j) by examiningonly those values L (a, b), where i ≤ a < b ≤ j and b− a < j − i, which sustainthat either a = i, a = i+ 1, or Sa,b is an OCT.

Proof. Immediate from Equations 3.1, 3.2 and 3.4. ut

Consider a dynamic programming algorithm which fills the table M bytraversing its entries row by row from bottom to top, and each row from leftto right. Lemma 2 implies that at the stage where M [i, j] is computed, it issufficient to keep only the values in the currently computed i-th row, the valuesin the recently computed (i+ 1)-th row, and values in entries which correspondto OCT subsequences of S. Thus, there is no need to maintain the completetable M in memory, rather, at each stage, entries which are guarantied not to befurther examined by the algorithm may be discarded. This yields a total spacecomplexity of O(n+Z) = O(Z). Note that the computation of each entry M [i, j]requires O(|Qi,j |) operations, due to the consideration of the branch point setQi,j (these sets are maintained as lists in order to allow an efficient traversal,

as explained in [17]). Sincen−1∑i=1

n∑j=i+1

|Qi,j | ≤n−1∑i=1

n∑j=i+1

|Q1,j | ≤n−1∑i=1

Z < nZ, the

running time of the algorithm is O(nZ). Fig. 1 illustrates a run of the above de-scribed algorithm, and Alg. 1 in the supporting on-line material gives the pseudocode. Its time and space complexities are summarized in the following lemma.4 [17] deals with the more realistic energy minimization variant of the problem. For

clarity, we project their notions on the simpler base-paring maximization variantdiscussed here.

Lemma 3. Given an RNA sequence S of length n, there is an algorithm whichcomputes L (1, n) in O(nZ) time and O(Z) space.

Fig. 1. An exemplification of the space reduced single strand folding algorithm.The left figure shows the complete table M with respect to the RNA sequenceS = ACAGUUGCA. Grayed entries correspond to OCT subsequences. The figure inthe middle presents a snapshot of the entries maintained by the algorithm at the stagewhere M [2, 7] is computed. At this stage, the algorithm examines the sum of M [2, 6]and M [7, 7] to compute Lp (2, 7) (note that Q2,7 = {7}), and M [3, 6] + 1 to computeLc (2, 7) (since s7 = s2). Since Lc (2, 7) = 2 > 1 = Lp (2, 7), the subsequence S2,7 ismarked as an OCT. The right figure shows the computation of the next entry, M [2, 8].Here, the evaluation of Lp (2, 8) examines the sum of M [2, 6] and M [7, 8], and the sumof M [2, 7] and M [8, 8] (since Q2,8 = {7, 8}). Lc (2, 8) is not evaluated since s8 6= s2.

3.3 Folding reconstruction

In addition to computing the optimal folding score of a given RNA sequence, itis often of interest to report at least one optimal folding, or a set of optimal orsuboptimal foldings of the input sequence. Some well known standard techniquesfor reporting an optimal folding apply trace-back procedures over the foldingscore matrix M , in O(n2) time [4]. In this section we show how to reconstructone optimal folding, without exceeding the time and space complexities of ourfolding algorithm. Note that this is a challenging task, as the classical trace-backalgorithm requires the availability of the full table M , while our algorithm storesonly those entries of M which correspond to OCT subsequences.

Assume that the full table M is given, with annotated OCT subsequences.The basic recursive folding trace-back reconstruction algorithm [4] could be mod-ified as follows to utilize the OCT subsequences in the computation of the trace-back of Si,j .1. For j ≤ i, the only (optimal) folding of Si,j is the empty folding, and the

algorithm halts without reporting any base-pair.2. For j > i, if Si,j is an OCT, then every optimal folding of Si,j includes the

pair (i, j), and therefore the algorithm reports the pair (i, j) and is calledrecursively on the subsequence Si+1,j−1.

3. Otherwise, Si,j is partitionable, and therefore the algorithm finds an indexq ∈ Qi,j for which M [i, j] = M [i, q − 1] + M [q, j] and then continues byrecursively computing an optimal folding of Si,q−1 and of Sq,j . An optimalfolding of Si,q−1 is obtained by calling the trace-back algorithm recursivelywith the sub-instance Si,q−1. As for computing an optimal folding of Sq,j ,note that Sq,j is an OCT, and consider the two cases, where either q = j orq < j. If q = j, then there is no need for another recursive call. Otherwiseq < j, and an optimal folding of Sq,j is obtained by first reporting the base-pair (q, j) and then calling the trace-back algorithm recursively with thesub-instance Sq+1,j−1.

Time complexity analysis of the trace-back algorithm on the full table M . Ob-serve that, when calling the above algorithm to compute the folding tracebackof Si,j , recursive calls with three different subsequences could be initiated at thetop level: Si−1,j−1, Sq+1,j−1 and Si,q−1, and thus index j is eliminated from fur-ther consideration as an end index. Therefore, each recursive call is performedwith a different end index j, and altogether there are at most n recursive callsin the whole computation. For a recursive call in which the end index is j,at most O (|Q1,j |) operations are preformed in finding a q ∈ Qi,j for whichM [i, j] = M [i, q− 1] +M [q, j]. Since

∑1≤j≤n

|Q1,j | ≤ Z, the total running time is

O(Z).We next turn to address the challenge of reconstructing an optimal folding

from the sparse table M computed in Section 3.2. The above described algorithmcannot be applied directly in this case, due to the fact that when the algorithmneeds to find q ∈ Qi,j for which M [i, j] = M [i, q−1] +M [q, j], the values M [i, j]and M [i, q − 1] may have been discarded from memory (while M [q, j] is main-tained in memory since Sq,j is an OCT). In order to overcome this difficulty weadopt a similar approach as of the algorithm of Hirschberg [6], namely perform-ing on-demand value re-computations of discarded entries. Thus, it remains toshow how to recover such deleted entries.

Lemma 4. Given the sparse table M that contains optimal folding cardinalitiesfor OCT subsequences, there is an algorithm which recovers the set of entriesM [i, i + 1],M [i, i + 2], . . .M [i, j], for a pair of given indices i and j, in O(Z)time.

Proof. The entries of the form M [i, j′] which have been discarded from mem-ory correspond to partitionable subsequences, where L (i, j′) = Lp (i, j′), andthus may be recomputed based solely on Equation 3.4. Observe that this com-putation examines only entries of the form M [i, q] for q < j′, and M [q, j′] forOCT subsequences Sq,j′ . Re-computing the entries of the ith row from left toright guaranties that upon computing M [i, j′], all necessary values for the com-putation of Lp (i, j′) are already stored in M . For each i < j′ ≤ j, there areO(|Q1,j′ |) operations performed along this computation, due to the considera-tion of branch points in the set Qi,j′ . As before, summing this expression overall i < j′ ≤ j accumulates to O(Z). ut

We next show that, throughout the full run of the algorithm, the process ofrestoring row entries is applied to O(L) distinct start indices. Consider the casewhere the trace-back algorithm is applied on Si,j and assume that the set ofentries M [i, i + 1],M [i, i + 2], . . .M [i, j] was already previously restored. Notethat a recursive call with a sub-instance of the form Si,q−1 does not requirethe restoration of the entries M [i, i + 1],M [i, i + 2], . . .M [i, q − 1], as (by theassumption) they have already been restored and are maintained inM . The othertwo possible recursive calls with sub-instances of the form Si+1,j−1 or Sq+1,j−1,do require re-computation of entries in M (in rows i+1 or q+1, correspondingly).However, observe that each call of the latter kind is preceded by a detection ofa base-pair. Since throughout the full run of the algorithm only L base pairs aredetected, we get that the row entry recovery only needs to be executed L times(in addition to the recovery of M [1, 1],M [1, 3], . . .M [1, n] during initialization).Thus, according to Lemma 4, the entry value recovery contributes an additionalO(LZ) factor to the total time complexity of the trace-back algorithm.

Furthermore, note that upon performing such a re-computation of an en-try set, there is no need to further maintain the values in M [i, i + 1],M [i, i +2], . . .M [i, j] in the case where Si,j is co-terminus, nor to keep the values inM [i, q],M [i, q+ 1], . . .M [i, j] in the case where Si,j is partitionable. This allowsto discard these values from memory before the re-computation of the entry setfor the corresponding sub-instance, guaranteeing that at each stage, at most nrecovered entries are maintained in the sparse table M , in addition to the al-ready existing OCT corresponding entries. Therefore, the space complexity ofthe trace-back algorithm remains O(Z + n) = O(Z).

Alg. 2 in the supporting on-line material implements the efficient trace-backscheme.

Lemma 5. There is an algorithm which, given the sparse table M that containsfolding scores for all OCT subsequence of S, computes an optimal folding of Sin O(LZ) time and O(Z) space.

3.4 Extending the space reduction to Simultaneous Alignment withFolding

The goal of the SAF problem is to find a multiple sequence alignment and acommon folding of the aligned sequences, which optimizes some score function.For simplicity, we assume the problem instance consists of two sequences. Simi-larly to single RNA strand folding algorithms, the basic dynamic programmingalgorithm for the SAF problem [15] computes the scores for all sub-instances ofits input instance, and then combines these values to resolve the score of the fullinput instance. Given an instance of the problem - a pair of RNA sequences Sand T , the algorithm maintains the scores of sub-instances (Si,j , Ti′,j′) in a four-dimensional table N (see Fig. 2). For |S| = n and |T | = m, we depict N as ann×n ”super table”, in which each entry Ni,j corresponds to an internal table ofsize m×m, where the combined alignment-with-folding score of the sub-instance(Si,j , Ti′,j′) is stored in the entry Ni,j [i′, j′]. The time-complexity of the basic

dynamic programming algorithm for the SAF problem is dictated by the needto compute all O(n2m2) sub-instances, where each such computation involvesthe consideration of a set of O(nm) competing branch point index pairs (i.e. all(q, q′) such that i < q ≤ j and i′ < q′ ≤ j′). This yields a total time complexityof O(n3m3).

Recently, [12] extended the approach of [17] and applied it to speed upSAF by reducing the number of branch points that need to be considered inthe main recursion for the SAF score computation. Similarly to the concept ofOCT sequences, it is possible to define OCT-aligned sequence pairs, where thepair (Si,j , Ti′,j′) is OCT-aligned if, in every optimal alignment-with-folding of(Si,j , Ti′,j′) the bases si and ti′ are aligned to each other, the bases sj and tj′

are aligned to each other, and the common folding is co-terminus. Using thisformulation to describe the results of [12], it was shown that it is sufficient toexamine branch point pairs (q, q′) such that the sequences Sq,j and Tq′,j′ areOCT-aligned, thus reducing the number of examined branch points and improv-ing the running time of the algorithm. This extension yields an O(nmZ) timeand O(n2m2) space algorithm for the SAF problem, where Z is the number ofOCT-aligned sub-instances, and nm ≤ Z ≤ n2m2 (in practice, Z is expected tobe significantly smaller than O(n2m2) [12]).

Applying an observation similar to Observation 2, an algorithm is suggestedhere which, upon the computation of entry Ni,j [i′, j′], queries only those entrieswhich correspond to OCT-aligned sub-instances, in addition to entries in rows iand i+1 of the ”super table” N . The space complexity of SAF is thus reduced toO(nm2 + Z) (w.l.o.g. m ≤ n). In the extended version of this paper we describein detail how to extend the space-reduction technique described in Section 3.2 tothe four-dimensional matrix computed by the SAF algorithm [12]. An intuitiveexplanation can be found in Fig. 2.

Lemma 6. There is an algorithm which computes the simultaneous alignmentwith folding of two RNA sequences S and T in O(nmZ) time and O(nm2 + Z)space, where n = |S|, m = |T |, and w.l.o.g. m ≤ n.

4 Utilizing step characterization

In this section we show how to take advantage of a step characterization ofthe single strand base-pairing maximization problem in order to improve therunning time of the algorithms which compute it. Based on this approach, inSection 4.1 we describe an improvement to Alg. 1 which reduces its runningtime from O(nZ) to O

(n2 + LZ

), and then in Section 4.2 we further reduce it

to O(LZ). Both algorithms have the same space complexity as Alg. 1, which isO(Z).

Let Si,j be a subsequence of S, and Si′,j′ another subsequence such thateither i′ = i + 1 and j′ = j, or i′ = i and j′ = j − 1. It is straightforwardto show that L (i′, j′) ≤ L (i, j) ≤ L (i′, j′) + 1. Therefore, we get the followingobservation:

Fig. 2. A schematic illustration of the data structure maintained by the dynamicprogramming algorithm for the SAF problem. In order to compute the entry Ni,j [i′, j′],the algorithm needs to examine values in entries marked with ”club” symbols. Theseentries include entries in the ith and (i + 1)th rows of the super table (the entriesNi,j [i′, j′ − 1], Ni,j−1[i′, j′], Ni,j [i′ + 1, j′], Ni+1,j [i′, j′], Ni+1,j−1[i′ + 1, j′ − 1], andentries of the form Ni,q−1[i′, q′ − 1]), as well as entries of the from Nq,j [q′, j′] whichcorrespond to OCT-aligned sub-instances (Sq,j , Tq′,j′) (illustrated by grayed entries).The algorithm maintains only entries in rows i and (i+ 1) in the super table (markedwith ”dot” symbols) and entries which correspond to OCT-aligned sub-instances. Allentries as from the (i+ 2)th row in the super table which do not correspond to OCT-aligned sub-instances are discarded from memory.

Observation 2 For every 1 ≤ k ≤ n, the sequence L (Sk,k) , L (Sk,k+1) , . . . , L (Sk,n),as well as the sequence L (Sk,k) , L (Sk−1,k) , . . . , L (S1,k) are monotonically non-decreasing with unit steps in the range 0− L.

The above observation implies a bound on Z, as follows.

Lemma 7. The value of Z satisfies n ≤ Z ≤ n(L+ 1).

Proof. Every OCT subsequence Si,j satisfies that either i = j, or L (i, j) >L (i+ 1, j). Hence, according to Observation 2, there are at most L+1 OCT sub-sequences that end with a given index j, and at therefore there are most n(L+1)OCT subsequences of S. ut

4.1 An O(n2 + LZ) Algorithm

Similarly to the previously presented technique for restricting the set of ex-amined branch points in the computation of Lp (i, j), we next show anotherdominance relation which can be utilized to further constrain the set of branchpoints examined in Equation 3.3.

Definition 6 (step sequence). Call a subsequence Si,j a step sequence if inevery optimal folding of Si,j the base i is paired.

Observe that Si,j is a step sequence if and only if q = i+1 is not a branch pointin any of the optimal foldings of Si,j , i.e. L(i, j) > L(i, i)+L(i+1, j) = L(i+1, j).Also note that any OCT subsequence of length greater than 1 is a step sequence,though the opposite is not necessarily true. In the following Lemma we furtherrestrict the branch points which need to be examined in a recursive computationof Lp (i, j).

Lemma 8. For any subsequence Si,j such that j > i, there is an optimal branchpoint q with respect to Si,j such that either q = i + 1, or Si,q−1 is a step se-quence and Sq,j is an OCT.

Proof. If q = i+1 is an optimal branch point with respect to S, the lemma holds.Otherwise, Lp (i, j) > L (i, i)+L (i+ 1, j) = L (i+ 1, j). According to Lemma 1,there is an optimal branch point i+ 1 < q ≤ j such that Sq,j is an OCT. There-

fore, L (i, q − 1)+L (q, j) = Lp (i, j) > L (i+ 1, j)Obs. 1≥ L (i+ 1, q − 1)+L (q, j).

It follows that L (i, q − 1) > L (i+ 1, q − 1), hence Si,q−1 is a step sequence. ut

Define the following subset of branch points with respect to Si,j :

Pi,j = {i+ 1} ∪ {i+ 1 < q ≤ j : Si,q−1 is a step sequence and Sq,j is an OCT} .

The following equation restates Equation 3.4, based on Lemma 8.

Lp (i, j) = maxq∈Pi,j

{L (i, q − 1) + L (q, j)}. (4.1)

We next show a bottom-up algorithm that computes L according to Equa-tions 3.1, 3.2, and 4.1. The presented algorithm is similar to Alg. 1, where aforward dynamic programming technique is applied in order to efficiently com-pute Lp (i, j) (Forward dynamic programming was also applied by Jansson etal. [10] to a related problem).

The new algorithm also scans and computes the entries of M in decreasingrow index and increasing column index. It maintains the following invariant:upon reaching entry M [i, j], the entry contains the value Lp (i, j). Before com-puting row i in M , the entries M [i, i − 1] and M [i, i] are initialized with zeros,and all entries M [i, j] for i < j ≤ n are initialized with the corresponding val-ues M [i + 1, j]. This initialization is equivalent to examining the branch pointq = i + 1 in the computation of Lp (i, j) according to equation 4.1 for all j > i(the branching at q = i+1 is handled separately from other branch points in Pi,jsince it does not follow the step sequence-prefix-OCT-suffix rule as the rest ofthe group). Note that in this stage the invariant is sustained for the first entry inthe row which is traversed by the algorithm - M [i, i+ 1], since Pi,i+1 = {i+ 1}.

Based on the invariant, upon reaching M [i, j], the entry contains the valueLp (i, j), and the value L (i, j) can be computed by resolving the maximum

between the current entry value and the value of Lc (i, j), which is obtainedfrom Equation 3.2. If Lc (i, j) > Lp (i, j), Si,j is classified as an OCT. Then, ifM [i, j] > M [i + 1, j], Si,j is classified as a step sequence, and the branch pointq = j + 1 is considered and forward-reflected to the computation of Lp (i, j′),for all j′ > j such that Sj+1,j′ is an OCT, by updating the value of M [i, j′]to be the maximum among its current value and that of M [i, j] + M [j + 1, j′],thus accumulating the maximum according to Equation 4.1, and guaranteeingthe maintenance of the invariant.

Alg. 5 in the supporting on-line material implements the forward dynamicprogramming approach described above, combined with the space-efficient ap-proach described in Section 3.2. An illustration of its run is given in Fig. 3. Thespeedup obtained by this algorithm is due to the fact that branch points areexamined by Equation 4.1 only if both the sequence prefix before the branchpoint is a step-sequence and its suffix, as from the branch point on, is an OCT.Note that, for each one of the Z OCT subsequences Sq,j which are examinedas suffices by Equation 4.1, Observation 2 shows that there are at most L se-quences Si,q−1 which may be corresponding step-sequence prefixes, and thus thetotal run-time contribution due to computation of values of the form Lp (i, j) isO(LZ). Since the table M has O(n2) entries, where for each entry O(1) opera-tions are performed in addition to the operations involved in the computationsof Lp (i, j), the total running time is O(n2 +LZ). The space complexity remainsO(Z), as the space complexity of Alg. 1.

Lemma 9. Given an RNA sequence S of length n, there is an algorithm whichcomputes L (1, n) in O(n2 + LZ) time and O(Z) space.

4.2 An O(LZ) Algorithm

In this section we show how to further reduce the running time of the foldingalgorithm from O(n2 + LZ) to O(LZ). We do so by applying a step encoding[7] to M , representing each of its rows by its O(L) steps (see Fig. 4). Hence, inwhat follows we give corresponding step-encoding formulations, where a typicalinstance is composed of a suffix Si,n of S and a folding cardinality x, which willbe denoted by the pair (Si,n, x). The goal is to compute the minimum indexi−1 ≤ j ≤ n such that there is a folding of Si,j whose cardinality is x. The nextdefinition gives a step-encoding equivalents of the entities L (i, j) , Lp (i, j), andLc (i, j).

Definition 7. For 1 ≤ i ≤ n, 1 ≤ x, and α ∈ {ε, p, c} (where ε denotes theempty word), define βα (i, x) to be the minimum index j such that Lα (i, j) ≥ x,or ∞ if there is no such j.

Note the relation between the step-encoding formulation and the standardformulation, where L (i, j) is the maximum x such that β (i, x) ≤ j. Say thata sub-instance (Si,n, x) is a β-OCT if β (i, x) = βc (i, x) < βp (i, x). In thefollowing definition we give the step-encoding equivalent of Pi,j .

Fig. 3. An exemplification of Alg. 5. The left figure shows the initialization of row 3in the table M , copying the values from row 4. The figure in the middle presents thecomputation of the entry M [3, 5]. Upon reaching this entry, the entry value is 0, whichcorresponds to Lp (3, 5). Since s5 = s3, Lc (3, 5) = M [4, 4]+1 = 1 is evaluated, the entryvalue is updated to be 1, and S3,5 is marked as an OCT. Since M [3, 5] > M [4, 5], S3,5 isclassified as a step sequence, and the branching at q = 6 is considered for all sequencesS3,j such that S6,j is an OCT. This forward computation updates the values in M [3, 6]and M [3, 9]. The right figure shows the computation of the next entry, M [3, 6]. Herethere is no need to evaluate Lc (3, 6) (since s6 6= s3), and S3,6 is also classified as a stepsequence (since M [3, 6] > M [4, 6]). Therefore, the branching at q = 7 considered for allsequences S3,j such that S7,j is an OCT, updating the values in M [3, 7] and M [3, 8].

Definition 8.

Yi,x = {i+ 1}∪{i+ 1 < q ≤ β (i+ 1, x− 1) :

Si,q−1 is a step sequence, and(Sq,n, x− L(i, q − 1)) is a β-OCT

}.

The auxiliary function defined below will be used in the computation ofβc (i, x).

Definition 9. For σ ∈ {A,C,G,U} and 1 ≤ r ≤ n, define NextMatch(r, σ) tobe the minimum index r′ > r such that sr′ = σ, or ∞ if there is no such indexr′.

We now convert Equations 3.1, 3.2 and 4.1 to their equivalent forms in thestep encoding. For all 1 ≤ i ≤ n and 1 ≤ x:

β (i, x) = min {βc (i, x) , βp (i, x)} . (4.2)βc (i, x) = NextMatch (β (i+ 1, x− 1) , si) . (4.3)

βp (i, x) = min{

minq∈Yi,x

{β (q, x− L(i, q − 1))} , βc (i, x) + 1}. (4.4)

Intuitively, the computation of L (i, j) can be viewed as a competition be-tween different foldings of the fixed sequence Si,j , in order to resolve the maximalfolding cardinality, while the computation of β (i, x) can be viewed as a compe-tition between different foldings of a fixed cardinality x and which correspond

to subsequences that start at index i, in order to resolve the minimal subse-quence end index. Formal proofs of the correctness of Equations 4.2 to 4.4, aswell as the pseudocode of an algorithm that implements them, are included inthe supporting on-line material. This algorithm, denoted Alg. 6, adopts a for-ward dynamic programming approach, similarly to that of Alg. 5. This allowsfor efficient computation of Equation 4.4, where the number of sub-instances,as well as the dimensions of the data structure which stores solutions for thesesub-instances, is O(Ln) (instead of O(n2)).

Lemma 10. Given an RNA sequence S of length n, there is an algorithm whichcomputes L (1, n) in O(LZ) time and O(Z) space.

Fig. 4. An exemplification of Alg. 6. This figure shows the table B with respect to theRNA sequence S = ACAGUUGCA, which is the step encoding of the table M in Fig. 1.The left plot shows the complete table, were grayed entries correspond to β-OCT sub-instances. The plot at the middle presents a snapshot of the entries maintained by thealgorithm at the stage where row 3 is about to be computed, after the initialization.The right plot present the computation of the entry B[3, 1]. The initial entry value8, which corresponds to βp (3, 1), is replaced with the value NextMatch(B[4, 0], s3) =NextMatch(3, U) = 5, and the sub-instance (S3,n, 1) is marked as a β-OCT. Then,all branch points q such the B[3, 1] = 5 ≤ q − 1 < 8 = B[4, 1] are examined forthe computation of βp (3, x) for all x such that the sub-instances (Sq,n, x− L(3, q)) =(Sq,n, x− 1) are β-OCTs. This computation updates the values in the entries B[3, 2]and B[3, 3].

References

1. Tatsuya Akutsu. Approximation and exact algorithms for RNA secondary struc-ture prediction and recognition of stochastic context-free languages. Journal ofCombinatorial Optimization, 3:321–336, 1999.

2. T. M. Chan. More algorithms for all-pairs shortest paths in weighted graphs. InProc. 39th Symposium on the Theory of Computing (STOC), pages 590–598, 2007.

3. Athanasius F. Bompfunewerer Consortium, Rolf Backofen, Stephan H. Bernhart,Christoph Flamm, Claudia Fried, Guido Fritzsch, Jorg Hackermuller, Jana Hertel,Ivo L. Hofacker, Kristin Missal, Axel Mosig, Sonja J. Prohaska, Dominic Rose,Peter F. Stadler, Andrea Tanzer, Stefan Washietl, and Sebastian Will. RNAseverywhere: genome-wide annotation of structured RNAs. J Exp Zoolog B MolDev Evol, 308(1):1–25, 2007.

4. Eddy S. Krogh-A. Durbin, R. and G. Mitchison. Biological sequence analysis:Probabilistic models of proteins and nucleic acids. Cambridge University Press,1998.

5. Paul P. Gardner and Robert Giegerich. A comprehensive comparison of compara-tive RNA structure prediction approaches. BMC Bioinformatics, 5:140, 2004.

6. D. S. Hirschberg. A linear space algorithm for computing maximal common sub-sequences. Commun. ACM, 18(6):341–343, 1975.

7. D.S. Hirschberg. Algorithms for the longest common subsequence problem. JACM,24:664–675, 1977.

8. Ivo L. Hofacker. Vienna RNA secondary structure server. Nucleic Acids Research,(13):3429–3431, 2003.

9. Tinoco I., Borer PN., Dengler B., Levine MD., Uhlenbeck OC., Crothers DM., andGralla J. Improved estimation of secondary structure in ribonucleic acids. NatureNew Biology, 246:40–41, 1973.

10. Jesper Jansson, See-Kiong Ng, Wing-Kin Sung, and Hugo Willy. A faster and morespace-efficient algorithm for inferring arc-annotations of RNA sequences throughalignment. Algorithmica, 46(2):223–245, 2006.

11. Havgaard JH., Lyngso RB., Stormo GD., and Gorodkin J. Pairwise local structuralalignment of RNA sequences with sequence similarity less than 40%. Bioinformat-ics, 21(9):1815–1824, 2005.

12. Y. Wexler M. Ziv-Ukelson, I. Gat-Viks and R. Shamir. A faster algorithm forRNA co-folding. Proc. Algorithms in Bioinformatics, 8th International Workshop,WABI, LNBI 5251, pages 174–185, September 2008.

13. David H. Mathews and Douglas H. Turner. Dynalign: an algorithm for finding thesecondary structure common to two RNA sequences. Journal of Molecular Biology,317(2):191–203, 2002.

14. Ruth Nussinov and Ann B. Jacobson. Fast algorithm for predicting the secondarystructure of single-stranded RNA. PNAS, 77(11):6309–6313, 1980.

15. David Sankoff. Simultaneous solution of the RNA folding, alignment and protose-quence problems. SIAM Journal on Applied Mathematics, 45(5):810–825, 1985.

16. M.S. Waterman and T.F. Smith. RNA secondary structure: a complete mathe-matical analysis. Math. Biosci., 42:257–266, 1978.

17. Ydo Wexler, Chaya Zilberstein, and Michal Ziv-Ukelson. A study of accessiblemotifs and RNA folding complexity. Journal of Computational Biology, 14(6):856–872, 2007.

18. Sebastian Will, Kristin Reiche, Ivo L. Hofacker, Peter F. Stadler, and Rolf Back-ofen. Inferring non-coding RNA families and classes by means of genome-scalestructure-based clustering. PLOS Computational Biology, 3(4):e65, 2007.

19. M. Zuker. Computer prediction of RNA structure. Methods Enzymol., 180:262–288,1989.

20. M. Zuker. Mfold web server for nucleic acid folding and hybridization prediction.Nucleic Acids Res., (13):3406–15, 2003.

21. M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences usingthermodynamics and auxiliary information. Nucleic Acids Research, 9(1):133–148,1981.

Supporting on-line material: omitted figures, pseudo-codeand proofs

=

=

=

max

L (i, j)

Lp(i, j)

Lc(i, j)

Lp(i, j) Lc(i, j)

max

ji+1i

qi q jq+1

OCTmaxQi, j i+1

0

i i+1 jj−1

i j

i j

i j

i j=

=

=

L (i, j) Lp(i, j) Lc(i, j)

Lc(i, j)

Lp(i, j)max

ji+1i

q jq+1

OCT

i q

STEP

max

0

i i+1 jj−1i j

i j

i j i j

maxPi, j i+1

x

i

i+1

x−1

kβ c(i,x)

q i,Y x β (q ’)− xx,

NextMatch( j,si)

L(i )−1q, = x’=

x

i β p(i,x)

min

β(i + 1,x− )1

β(i + 1,x− )1

=

i

x x

iβ(i,x) β p(i,x)

x

i β c(i,x)

β c(i,x)i

where=

i i+1 j k

STEPOCT

i

minqq−1

=

x−x’

j =

x

+1

Fig. 5. A schematic illustration of the different recursion formulas for the single strandbase-pairing maximization problem. The left plot illustrates the recursion of Equa-tions 3.1, 3.2, and 3.4, the middle plot illustrates the recursion of Equations 3.1, 3.2,and 4.1, and the right plot illustrates the recursion of Equations 4.2, 4.3, and 4.4.

Algorithm 1 space efficient RNA folding

Input: An RNA sequence S = s1s2 · · · sn.Output: L (1, n).1: for i← n down to 1 do2: set M [i, i− 1]← 0, M [i, i]← 0.3: mark Si,i as an OCT.4: for j ← i+ 1 to n do5: M [i, j]← max

q∈Qi,j

{M [i, q − 1] +M [q, j]}. // assigning Lp (i, j) to M [i, j]

6: if sj = si and M [i, j] < M [i+ 1, j − 1] + 1 then7: M [i, j]←M [i+ 1, j − 1] + 1. // assigning Lc (i, j) to M [i, j]8: mark Si,j as an OCT.9: end if

10: end for11: discard from memory the values in all entries in row i + 1 of M which do not

correspond to OCT subsequences.12: end for13: return M [1, n].

Algorithm 2 Folding-Traceback (M)

Input: A sparse table M which contains folding scores of all OCT subsequences of anRNA sequence S.

Output: An optimal folding of S.1: call Restore-entries (M, 1, n).2: call Rec-Folding-Traceback (M, 1, n).

Algorithm 3 Rec-Folding-Traceback (M, i, j)

Input: A sparse table M which contains folding scores of all OCT subsequences of anRNA sequence S, as well as folding scores for the subsequences Si,i+1, Si,i+2, . . . Si,j

for the two given internal indices i and j.Output: An optimal folding of Si,j .1: if j ≤ i then2: halt.3: else if Si,j is an OCT then4: output the pair (i, j).5: discard from memory the values in all non-OCT entries M [i, i + 1],M [i, i +

2], . . . ,M [i, j].6: call Restore-entries (M, i+ 1, j − 1).7: call Rec-Folding-Traceback (M, i+ 1, j − 1).8: else9: find q ∈ Qi,j s.t. M [i, j] = M [i, q − 1] +M [q, j].

10: discard from memory the values in all non-OCT entries M [i, q],M [i, q +1], . . . ,M [i, j].

11: if q < j then12: output the pair (q, j).13: call Restore-entries (M, q + 1, j − 1).14: call Rec-Folding-Traceback (M, q + 1, j).15: end if16: call Rec-Folding-Traceback (M, i, q − 1).17: end if

Algorithm 4 Restore-entries (M, i, j)

Input: A sparse table M which contains folding scores of all OCT subsequences of anRNA sequence S, and two internal indices i and j.

Output: The table M after restoring values in the entries M [i, i + 1],M [i, i +2], . . .M [i, j].

1: for j′ = i+ 1 to j do2: if M [i, j′] is discarded then3: set M [i, j′]← max

q∈Qi,j′

˘M [i, q − 1] +M [q, j′]

¯.

4: end if5: end for

Algorithm 5 forward RNA folding

Input: An RNA sequence S = s1s2 · · · sn.Output: L (1, n).1: for i← n down to 1 do2: set M [i, i− 1]← 0, M [i, i]← 0, and M [i, j]←M [i+ 1, j] for all i < j ≤ n.3: for j ← i+ 1 to n do4: if sj = si and M [i, j] < M [i+ 1, j − 1] + 1 then5: M [i, j]←M [i+ 1, j − 1] + 1.6: mark Si,j as an OCT.7: end if8: if Si,j is a step sequence then9: for all j′ s.t. Sj+1,j′ is an OCT do

10: M [i, j′]← max {M [i, j′],M [i, j] +M [j + 1, j′]}.11: end for12: end if13: end for14: discard from memory the values in all entries in row i + 1 of M which do not

correspond to OCT subsequences.15: end for16: return M [1, n].

Algorithm 6 step encoded RNA folding

Input: An RNA sequence S = s1s2 · · · sn.Output: L (1, n).1: set B[n, 0]← n− 1.2: for i← n− 1 down to 1 do3: set B[i, 0]← i− 1, B[i, L (i+ 1, n) + 1]←∞.4: set B[i, x]← B[i+ 1, x] for all 1 ≤ x ≤ L (i+ 1, n).5: for x← 1 to L (i+ 1, n) + 1 do6: if B[i, x] > NextMatch(B[i+ 1, x− 1], si) then7: B[i, x]← NextMatch(B[i+ 1, x− 1], si)8: mark (Si,n, x) as a β-OCT.9: end if

10: for all index q s.t. B[i, x] ≤ q − 1 < B[i+ 1, x] do11: for all x′ s.t. (Sq,n, x

′) is a β-OCT do12: B[i, x+ x′]← min {B[i, x+ x′], B[q, x′]}13: end for14: end for15: end for16: discard from memory the values in all entries in row i + 1 of B which do not

correspond to β-OCTsub-instances.17: end for18: return L (1, n) - the maximum x such that B[1, x] <∞.

Proof (Proof of correctness for Equation 4.2.). The correctness of Equa-tion 4.2 is immediate from Definition 7.

Proof (Proof of correctness for Equation 4.3.). First, we show that βc (i, x) ≥NextMatch(β (i+ 1, x− 1) , si). Let F be a co-terminus folding of Si,βc(i,x) ofsize x, and consider the folding F ′ which is obtained by removing the base-pair(i, βc (i, x)) from F . Then, F ′ is a folding of Si+1,βc(i,x)−1 of size x − 1, andin particular β (i+ 1, x− 1) ≤ βc (i, x) − 1. Since sβc(i,x) = si, it follows thatβc (i, x) = NextMatch(βc (i, x)− 1, si) ≥ NextMatch(β (i+ 1, x− 1) , si).

Next, we show that βc (i, x) ≤ NextMatch(β (i+ 1, x− 1) , si). Any foldingwhich is obtained by adding the base-pair (i,NextMatch(β (i+ 1, x− 1) , si))to an optimal folding of Si+1,β(i+1,x−1) is a co-terminus folding of size x for thesequence Si,NextMatch(β(i+1,x−1),si), and from the definition of βc (i, x) we havethat βc (i, x) ≤ NextMatch(β (i+ 1, x− 1) , si).

Thus, βc (i, x) = NextMatch(β (i+ 1, x− 1) , si). ut

Before proving the correctness of Equation 4.4, we first show a property ofβp (i, x).

Claim 1 Let 1 ≤ i ≤ n− 1 and let 1 ≤ x such that βp (i, x) ≤ n.1. For every i < q ≤ β (i+ 1, x− 1), βp (i, x) ≤ β (q, x− L (i, q − 1)).2. For an optimal branch point q′ with respect to Si,βp(i,x), βp (i, x) ≥ β (q′, x− L (i, q′ − 1)).

Proof. We prove separately the two parts of the claim.1. Let i < q ≤ β (i+ 1, x− 1). Note that L (i, q − 1) ≤ x−1 (since L (i, q − 1) ≤L (i+ 1, q − 1)+1 ≤ L (i+ 1, β (i+ 1, x− 1)− 1)+1 = x−1), and thereforex− L (i, q − 1) ≥ 1. If β (q, x− L (i, q − 1)) = n+ 1, then clearly βp (i, x) ≤β (q, x− L (i, q − 1)). Otherwise, consider a folding F of Si,β(q,x−L(i,q−1))

which is obtained from the concatenation of optimal foldings for Si,q−1 andSq,β(q,x−L(i,q−1)). Then, |F | = L (i, q − 1) + L (q, β (q, x− L (i, q − 1))) =L (i, q − 1)+x−L (i, q − 1) = x. Thus, F is a partitionable folding of size x forSi,β(q,x−L(i,q−1)), and from the definition of βp (i, x) we have that βp (i, x) ≤β (q, x− L (i, q − 1)).

2. Let F ′ be a partitionable folding of size x for Si,βp(i,x), and q′ be a branchpoint in F ′. Then, x = |F ′| ≤ L (i, q′ − 1) + L (q′, βp (i, x)), and therefore,L (q′, βp (i, x)) ≥ x − L (i, q′ − 1). It follows that there is a folding of sizex− L (i, q′ − 1) to Sq′,βp(i,x), therefore βp (i, x) ≥ β (q′, x− L (i, q′ − 1)).

ut

Proof (Proof of correctness for Equation 4.4). First, it is straightforwardsto observe that that if (Si,n, x) is a β-OCT, then βp (i, x) = βc (i, x) + 1. Inaddition, from Claim 1, for every q ∈ Yi,x, βp (i, x) ≤ β (q, x− L (i, q − 1)) and

therefore βp (i, x) = min{

minq∈Yi,x

{β (q, x− L (i, q − 1))} , βc (i, x) + 1}

.

If (Si,n, x) is not a β-OCT, then βp (i, x) ≤ βc (i, x) < βc (i, x) + 1. Due to

Claim 1, in order to prove that βp (i, x) = min{

minq∈Yi,x

{β (q, x− L (i, q − 1))} , βc (i, x) + 1}

,

it is sufficient to show that there is an optimal branch point q′ ∈ Yi,x with respectto Si,βp(i,x). If i+ 1 ∈ Yi,x is an optimal branch point with respect to Si,βp(i,x),the correctness of the equation follows. Otherwise, from Lemma 8 there is anoptimal branch point i + 1 < q′ ≤ βp (i, x) with respect to Si,βp(i,x), such thatSi,q′−1 is a step sequence and Sq′,βp(i,x) is an OCT. From Claim 1, βp (i, x) ≥β (q′, x− L (i, q′ − 1)). Since Si,q′−1 is a step sequence, L (i+ 1, q′ − 1) < L (i, q′ − 1) <x, therefore L (i+ 1, q′ − 1) < x − 1, and q′ − 1 < β (i+ 1, x− 1). We then getfrom Claim 1 that βp (i, x) = β (q′, x− L (i, q′ − 1)), and therefore Sq′,β(q′,x−L(i,q′−1))

is an OCT, i.e. (Sq′,n, x−L (i, q′ − 1)) is a β-OCT. We have shown that q′ satisfiesi+1 < q ≤ β (i+ 1, x− 1), Si,q′−1 is a step sequence and (Sq′,n, x−L (i, q′ − 1))is a β-OCT, thus q′ ∈ Yi,x. ut

Proof (Proof of Lemma 9). We prove the lemma by showing that Alg. 6implements the recursion in Equations 4.2 to 4.4, and analyze its time and spacecomplexities. We first note the following observation:

Observation 3 Every step sequence Si,j satisfies that β (i, x) ≤ j < β (i+ 1, x)for x = L (i, j).

The algorithm fills a table B of size O(nL), whose entries B[i, x] store solu-tions β (i, x) for sub-instances (Si,n, x) (see Fig. 4). Denote by β′p (i, x) the valueminq∈Yi,x

{β (q, x− L (i, q − 1))}, thus from Equation 4.4, βp (i, x) = min {β′p (i, x) , βc (i, x) + 1}.

The algorithm maintains the invariant that upon reaching the entry B[i, x], theentry contains the value β′p (i, x). Before computing row i in B, the entry B[i, 0]is initialized with the value i−1, and all entries B[i, x] for 0 < x ≤ L (i+ 1, n) areinitialized with the corresponding values B[i+1, x]. This initialization is equiva-lent to examining the branch point q = i+1 ∈ Yi,x in the computation of β′p (i, x)for all 0 < x ≤ L (i+ 1, n) (note that for q = i + 1, β (q, x− L (i, q − 1)) =β (i+ 1, x− L (i, i)) = β (i+ 1, x)). The entry B[i, L (i+ 1, n) + 1] is set to ∞.

Based on the invariant, upon reaching B[i, x], the entry contains the valueβ′p (i, x), and the value β (i, x) can be computed by resolving the minimum be-tween the current entry value, and the value of βc (i, x), which is obtained fromEquation 4.3 (note that the expression βc (i, x) + 1, which appears in Equa-tion 4.4, is only needed in order to cope with the formal definition of βp (i, x),yet it is redundant in the context of the recursive computation of β (i, x)). Ifβc (i, x) < β′p (i, x), the sub-instance (Si,n, x) is classified as a β-OCT.

Then, the algorithm examines all branch points q for which Si,q−1 is a stepsequence with an optimal folding cardinality x, i.e. all q such that β (i, x) ≤q − 1 < β (i+ 1, x) (from Observation 3). For each such branch point q, allβ-OCT sub-instances (Sq,n, x′) are considered and forward-reflected to the com-putation of β′p (i, x+ x′), by updating the value of B[i, x+x′] to be the minimumamong its current value and the value β (q, x′), thus accumulating the minimumaccording to the definition of β′p (i, x), and guaranteeing the maintenance of theinvariant. Upon termination, L (1, n) is obtained by finding the maximum x suchthat B[1, x] ≤ n.

As was shown for the non-step-encoding algorithm, the number of operationsdue to the examinations of branch points in the computation of Equation 4.4

throwout the whole run of the algorithm is O(LZ). Other than that, there areO(1) operation for each sub-instance (Si,n, x), the total running time is O(LZ+Ln) = O(LZ). The space complexity remains O(Z), as of Alg. 1. ut

Date post:	11-Nov-2023
Category:	Documents
Upload:	independent
View:	0 times
Download:	0 times

Sparse RNA folding: Time and space efficient algorithms

Documents