New tabulation and dynamic programming based techniques for sequence similarity problems

1

New tabulation and dynamic programming

based techniques for sequence similarity problems

Szymon Grabowski

Sept. 2014

Lodz University of Technology, Institute of Applied Computer Science, Łódź, [email protected]

2

Agenda

1. (Naïve) dynamic programming.

2. Four Russians.

3. Main LCS results.

4. Bille & Farach-Coltontechnique.

5. Our improvement of the BFC alg.

6. Our LCS result with sparse DP.

7. Algorithmic apps (Lev distance, LCTS, MerLCS).

8. Concl & open problems.

33

Dynamic Programming (DP)

• Everybody knows…

• Quadratic cost for 2 sequences (can’t compute a cell "in a middle" before knowing the previous rows/cols),

• Speedup ideas: tabulation (aka Four Russians),bit-parallelism, sparse dynamic programming,compressing the input sequences.

4

DP made (slightly) fasterIf we can process blocks of b b symbols in O(1) time, we immediately obtain O(mn / b2) time. We can do it (Masek & Paterson, 1980) e.g. for binary alphabet and b = log n / 4 O(mn / log2 n) time.

The idea is to precompute all possible inputs (short enough strings are guaranteed to repeat and represent the DP values in differential manner).

55

LCS, selected results (time compl.)

Standard DP: O(mn).

Tabulation (Masek & Paterson, 1980): O(mn / log2 n) for a constant alphabet.

Tabulation (Bille & Farach-Colton, 2008): O(mn (log log n)2 / log2 n) for an integer alphabet.

Bit-parallelism (Allison & Dix, 1986, …): O(mn / w), w log n is machine word size (in bits).

Sparse DP: Hunt & Szymanski, 1977: O(r log log n), r is the # of matches,Eppstein, Galil, Giancarlo & Italiano, 1992: O(D log log(min{D, mn / D})), D r is the # of dominant matches.

6

LCS, selected results, cont’d

Sparse DP: Sakai, 2012: O(m + min{D, p(m-q)} + n),where p = LCS(A, B), q = LCS(A[1…m], B).

LZ78-compressed input:Crochemore, Landau & Ziv-Ukelson, 2003:O(hmn / log n), for a constant alphabet,where h 1 is the entropy of the inputs (for a binary alph.).

RLE-compressed input:several results, incl. Liu, Wang & Lee, 2008:O(min{nl, km}), where l, m are RLE-compressed seq lengths.

SLP-compressed input:Gawrychowski, 2012: O(kn sqrt(log(n / k)), where k is total length of SLP-compressed sequences.

7

The technique of Bille & Farach-Colton

For an integer alphabet of size , the Masek & Patersonresult can easily be modified to have O(mn log2 / log2 n)

time. This is fine for small , but not if = nc, c > 0.

Bille & Farach-Colton use alphabet mapping in superblocks.Use superblocks of size e.g. log3 n log3 n

and divide each superblock into blocks of size (log n / log log n) (log n / log log n).

8

BFC, cont’d

That is, for current text snippet from A of length log3 nextract up to log3 n distinct symbols and encode the current

snippet of A and current snippet of B accordingly (one extra symbol for "smth else" in snippet B needed).

Easily, O(log log n) bits per encoded symbol are enough, mapping times overall negligible (a BST can be used with

log(superblock)-factor per symbol) and O(mn (log log n)2 / log2 n) total time.

9

BFC, alphabet mapping example

Blocks of size 3 3, superblocks of size 9 9.

10

Our technique (Alg 1)Use the BFC alphabet mapping in superblocks.

But use many LUTs (instead of 1), yet with modified input.One LUT per horizontal stripe (of length n).

The LUT input:• snippet of A,

• left block border (1 bit per cell),• upper block border (1 bit per cell).

No snippet of B as part of the input, as it is fixed for a given LUT!

(Re-use LUTs for repeating snippets of B.)Thanks to it, we work on rectangular (not square)

"portrait"-oriented blocksof size (log n / log log n) (log n).

1111

One horizontal stripe (4 blocks of 5 5)

Red arrows: explicitly stored LCS values; black arrows: diff-encoded LCS values.

05550 and 34023: text snippets encoded with ref to a superblock (not shown).

The diagonally shaded cells are the block output cells.

seq A

seqB

12

LCS, first result (Alg 1)

12

13

Output-dependent algorithm

We work in blocks of (b+1) (b+1), but divide theminto sparse ones, which have K matches,

and dense ones with > K matches.

Key observation:knowing the top row and leftmost column for the block

plus the location of all matches in itis enough to compute this block.

That is, the text snippets are not needed!

14

Where sparse DP meets tabulation

A sparse block input:

• top row: b bits (diff encoding),

• leftmost column: b bits (diff encoding),

• match locations: each in log(b2) bits,totalling O(K log b) bits.

(Output: even less.)

Hence, if K log b + b = O(log n) (with a small enough constant), we can use a LUT for all sparse blocks and

compute each of them in constant time.

15

Dense blocks

Dense blocks are partitioned into smaller blockswhich then will be processed by our technique from Alg 1.

The smaller block sizes are:(log n / log log n) (b).

16

Choosing the parametersb = O(log n)

(otherwise the LUT build costs will be dominating), but also b = (log n / sqrt(log log n))

(otherwise this alg will never beat Alg 1).

This implies K = (log n / log log n), with an appropriate constant.

If the fraction of dense blocks in the matrix is 0 < fd 1,then the total time complexity (w/o preprocessing!) is:

For a small enough r (= total # of matches in the matrix) we may have O(mn / log2 n) from the above formula,

alas in the pp we have to find and encode all matches in all sparse blocks, in O(n + r) time.

17

LCS, second result (Alg 2)

18

Alg 2 nicheConsidering the results of:

• Eppstein et al., 1992,

• Sakai, 2012,

• Alg 1,

we obtain the following niche in which Alg 2 is the winner:

and

and

19

Simple generalization of Th. 1 and 2

20

Longest common transposition-invariant subsequence (LCTS)

LCTS = LCS in the best key transposition (in music, transposition is shifting a sequence of notes

(pitches) up or down by a constant interval).

21

LCTS, known results and a new one

Navarro, Grabowski, Mäkinen, Deorowicz, 2005; Deorowicz, 2006

apply BFC technique for each transposition

New algorithm: let us call the transpositions withat least mn log log n / matches as dense,

the others as sparse.

Apply Alg 1 to the dense transpositions and Alg 2 to the sparse ones.

Overall time: for

22

Merged LCS (MerLCS)

A bioinformatics problem on 3 sequences:given sequences A, B and P,

return a longest seq. T that is a subsequence of Pand can be split into two subsequences T’ and T’’

such that T’ is a subsequence of Aand T’’ is a subsequence of B.

|A| = n, |B| = m, |P| = u.

Known results:Peng, Yang, Huang, Tseng & Hor, 2010: O(lmn) time,

where l n is the result length.

Deorowicz & Danek, 2013: O(u / w mn log w) time.

23

Our result for MerLCS

DP matrix property:Deorowicz and Danek noticed that

M(i, j, k) is equal to or larger by 1 thanany of the three neighhbors:

M(i – 1, j, k), M(i, j – 1, k), M(i, j, k – 1).

We generalize our result on 2 sequences to 3 sequences (input: 3 text snippets

plus 3 2-dim walls instead of 1-dim borders!)to obtain O(mnu / log3/2 n) for MerLCS,

if u = (nc) for some c > 0.

24

Conclusions

24

• Tabulation (= Four Russians) is a classic DP-boosting technique. Interestingly, we managed to (slightly) improve its application to the LCS / edit distance problem.

• Applying tabulation may be even better for a sparse matrix.

• These techniques work also for a few other problems than LCS and edit distance.

2525

Open problems

• Can we improve the tabulation based result on compressible sequences?

• Can we adopt our technique(s) to problemsin which the conditions from Lemma 3 (or Lemma 7, involving 3 sequences) are relaxed, that is, consecutive DP cells may (sometimes) differ more than by a constant?

Exemplary problem: SEQ-EC-LCS (Chen & Chao, 2011; Deorowicz & Grabowski, 2014).

Date post:	06-Jan-2016
Category:	Documents
Upload:	roddy
View:	36 times
Download:	0 times

New tabulation and dynamic programming based techniques for sequence similarity problems

Documents