+ All Categories
Home > Documents > New tabulation and dynamic programming based techniques for sequence similarity problems

New tabulation and dynamic programming based techniques for sequence similarity problems

Date post: 06-Jan-2016
Category:
Upload: roddy
View: 36 times
Download: 0 times
Share this document with a friend
Description:
New tabulation and dynamic programming based techniques for sequence similarity problems. Szymon Grabowski. Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland [email protected]. Sept. 2014. Agenda. (Na ï ve) dynamic programming. Four Russians. - PowerPoint PPT Presentation
25
1 New tabulation and dynamic programming based techniques for sequence similarity problems Szymon Grabowski Sept. 2014 Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland [email protected]
Transcript
Page 1: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

1

New tabulation and dynamic programming

based techniques for sequence similarity problems

Szymon Grabowski

Sept. 2014

Lodz University of Technology, Institute of Applied Computer Science, Łódź, [email protected]

Page 2: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

2

Agenda

1. (Naïve) dynamic programming.

2. Four Russians.

3. Main LCS results.

4. Bille & Farach-Coltontechnique.

5. Our improvement of the BFC alg.

6. Our LCS result with sparse DP.

7. Algorithmic apps (Lev distance, LCTS, MerLCS).

8. Concl & open problems.

Page 3: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

33

Dynamic Programming (DP)

• Everybody knows…

• Quadratic cost for 2 sequences (can’t compute a cell "in a middle" before knowing the previous rows/cols),

• Speedup ideas: tabulation (aka Four Russians),bit-parallelism, sparse dynamic programming,compressing the input sequences.

Page 4: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

4

DP made (slightly) fasterIf we can process blocks of b b symbols in O(1) time, we immediately obtain O(mn / b2) time. We can do it (Masek & Paterson, 1980) e.g. for binary alphabet and b = log n / 4 O(mn / log2 n) time.

The idea is to precompute all possible inputs (short enough strings are guaranteed to repeat and represent the DP values in differential manner).

Page 5: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

55

LCS, selected results (time compl.)

Standard DP: O(mn).

Tabulation (Masek & Paterson, 1980): O(mn / log2 n) for a constant alphabet.

Tabulation (Bille & Farach-Colton, 2008): O(mn (log log n)2 / log2 n) for an integer alphabet.

Bit-parallelism (Allison & Dix, 1986, …): O(mn / w), w log n is machine word size (in bits).

Sparse DP: Hunt & Szymanski, 1977: O(r log log n), r is the # of matches,Eppstein, Galil, Giancarlo & Italiano, 1992: O(D log log(min{D, mn / D})), D r is the # of dominant matches.

Page 6: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

6

LCS, selected results, cont’d

Sparse DP: Sakai, 2012: O(m + min{D, p(m-q)} + n),where p = LCS(A, B), q = LCS(A[1…m], B).

LZ78-compressed input:Crochemore, Landau & Ziv-Ukelson, 2003:O(hmn / log n), for a constant alphabet,where h 1 is the entropy of the inputs (for a binary alph.).

RLE-compressed input:several results, incl. Liu, Wang & Lee, 2008:O(min{nl, km}), where l, m are RLE-compressed seq lengths.

SLP-compressed input:Gawrychowski, 2012: O(kn sqrt(log(n / k)), where k is total length of SLP-compressed sequences.

Page 7: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

7

The technique of Bille & Farach-Colton

For an integer alphabet of size , the Masek & Patersonresult can easily be modified to have O(mn log2 / log2 n)

time. This is fine for small , but not if = nc, c > 0.

Bille & Farach-Colton use alphabet mapping in superblocks.Use superblocks of size e.g. log3 n log3 n

and divide each superblock into blocks of size (log n / log log n) (log n / log log n).

Page 8: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

8

BFC, cont’d

That is, for current text snippet from A of length log3 nextract up to log3 n distinct symbols and encode the current

snippet of A and current snippet of B accordingly (one extra symbol for "smth else" in snippet B needed).

Easily, O(log log n) bits per encoded symbol are enough, mapping times overall negligible (a BST can be used with

log(superblock)-factor per symbol) and O(mn (log log n)2 / log2 n) total time.

Page 9: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

9

BFC, alphabet mapping example

Blocks of size 3 3, superblocks of size 9 9.

Page 10: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

10

Our technique (Alg 1)Use the BFC alphabet mapping in superblocks.

But use many LUTs (instead of 1), yet with modified input.One LUT per horizontal stripe (of length n).

The LUT input:• snippet of A,

• left block border (1 bit per cell),• upper block border (1 bit per cell).

No snippet of B as part of the input, as it is fixed for a given LUT!

(Re-use LUTs for repeating snippets of B.)Thanks to it, we work on rectangular (not square)

"portrait"-oriented blocksof size (log n / log log n) (log n).

Page 11: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

1111

One horizontal stripe (4 blocks of 5 5)

Red arrows: explicitly stored LCS values; black arrows: diff-encoded LCS values.

05550 and 34023: text snippets encoded with ref to a superblock (not shown).

The diagonally shaded cells are the block output cells.

seq A

seqB

Page 12: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

12

LCS, first result (Alg 1)

12

Page 13: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

13

Output-dependent algorithm

We work in blocks of (b+1) (b+1), but divide theminto sparse ones, which have K matches,

and dense ones with > K matches.

Key observation:knowing the top row and leftmost column for the block

plus the location of all matches in itis enough to compute this block.

That is, the text snippets are not needed!

Page 14: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

14

Where sparse DP meets tabulation

A sparse block input:

• top row: b bits (diff encoding),

• leftmost column: b bits (diff encoding),

• match locations: each in log(b2) bits,totalling O(K log b) bits.

(Output: even less.)

Hence, if K log b + b = O(log n) (with a small enough constant), we can use a LUT for all sparse blocks and

compute each of them in constant time.

Page 15: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

15

Dense blocks

Dense blocks are partitioned into smaller blockswhich then will be processed by our technique from Alg 1.

The smaller block sizes are:(log n / log log n) (b).

Page 16: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

16

Choosing the parametersb = O(log n)

(otherwise the LUT build costs will be dominating), but also b = (log n / sqrt(log log n))

(otherwise this alg will never beat Alg 1).

This implies K = (log n / log log n), with an appropriate constant.

If the fraction of dense blocks in the matrix is 0 < fd 1,then the total time complexity (w/o preprocessing!) is:

For a small enough r (= total # of matches in the matrix) we may have O(mn / log2 n) from the above formula,

alas in the pp we have to find and encode all matches in all sparse blocks, in O(n + r) time.

Page 17: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

17

LCS, second result (Alg 2)

Page 18: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

18

Alg 2 nicheConsidering the results of:

• Eppstein et al., 1992,

• Sakai, 2012,

• Alg 1,

we obtain the following niche in which Alg 2 is the winner:

and

and

Page 19: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

19

Simple generalization of Th. 1 and 2

Page 20: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

20

Longest common transposition-invariant subsequence (LCTS)

LCTS = LCS in the best key transposition (in music, transposition is shifting a sequence of notes

(pitches) up or down by a constant interval).

Page 21: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

21

LCTS, known results and a new one

Navarro, Grabowski, Mäkinen, Deorowicz, 2005; Deorowicz, 2006

apply BFC technique for each transposition

New algorithm: let us call the transpositions withat least mn log log n / matches as dense,

the others as sparse.

Apply Alg 1 to the dense transpositions and Alg 2 to the sparse ones.

Overall time: for

Page 22: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

22

Merged LCS (MerLCS)

A bioinformatics problem on 3 sequences:given sequences A, B and P,

return a longest seq. T that is a subsequence of Pand can be split into two subsequences T’ and T’’

such that T’ is a subsequence of Aand T’’ is a subsequence of B.

|A| = n, |B| = m, |P| = u.

Known results:Peng, Yang, Huang, Tseng & Hor, 2010: O(lmn) time,

where l n is the result length.

Deorowicz & Danek, 2013: O(u / w mn log w) time.

Page 23: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

23

Our result for MerLCS

DP matrix property:Deorowicz and Danek noticed that

M(i, j, k) is equal to or larger by 1 thanany of the three neighhbors:

M(i – 1, j, k), M(i, j – 1, k), M(i, j, k – 1).

We generalize our result on 2 sequences to 3 sequences (input: 3 text snippets

plus 3 2-dim walls instead of 1-dim borders!)to obtain O(mnu / log3/2 n) for MerLCS,

if u = (nc) for some c > 0.

Page 24: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

24

Conclusions

24

• Tabulation (= Four Russians) is a classic DP-boosting technique. Interestingly, we managed to (slightly) improve its application to the LCS / edit distance problem.

• Applying tabulation may be even better for a sparse matrix.

• These techniques work also for a few other problems than LCS and edit distance.

Page 25: New tabulation and  dynamic programming  based techniques for  sequence similarity problems

2525

Open problems

• Can we improve the tabulation based result on compressible sequences?

• Can we adopt our technique(s) to problemsin which the conditions from Lemma 3 (or Lemma 7, involving 3 sequences) are relaxed, that is, consecutive DP cells may (sometimes) differ more than by a constant?

Exemplary problem: SEQ-EC-LCS (Chen & Chao, 2011; Deorowicz & Grabowski, 2014).


Recommended