+ All Categories
Home > Documents > Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes...

Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes...

Date post: 20-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
15
Lecture 8: Backwards Search and FM Indices Johannes Fischer 1
Transcript
Page 1: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Lecture 8:Backwards Search and

FM IndicesJohannes Fischer

1

Page 2: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

_

t

t

e

r

n

_

mat

Topicss

u

f

f

i

x

f

_

t

r

a

y

e

e

a

r

r

a

y

e

x

t_

t

compression

d

e

r

i

o

d

i

c

i

t

i

e

s

a

cing

o

c

u

m

e

n

t

retrieval

counting

compressed

s

c

rl

m

q

_

a t

a p

p

2

Page 3: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Burrows Wheeler Transform

3

T = CACAACCAC$

1 2 3 4 5 6 7 8 9 10

C A C A A C C A C $

A C A A C C A C $ C

C A A C C A C $ C A

A A C C A C $ C A C

A C C A C $ C A C A

C C A C $ C A C A A

C A C $ C A C A A C

A C $ C A C A A C C

C $ C A C A A C C A

$ C A C A A C C A C

T (1) T (6)

$ A A A A C C C C C

C A C C C $ A A A C

A C $ A C C A C C A

C C C A A A C $ A C

A A A C C C C C A $

A C C C $ A A A C C

C $ A C C A C C C A

C C A A A C $ A A C

A A C $ C C C A C A

C C C C A A A C $ A

1 2 3 4 5 6 7 8 9 10

F (first)

TBWT=

L (last)

T (1)

)sort

columns

lexicogr.

T=CACAACCAC$

buildcyclic

rotations

Page 4: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Burrows Wheeler Transform

4

T = CACAACCAC$

1 2 3 4 5 6 7 8 9 10

C A C A A C C A C $

A C A A C C A C $ C

C A A C C A C $ C A

A A C C A C $ C A C

A C C A C $ C A C A

C C A C $ C A C A A

C A C $ C A C A A C

A C $ C A C A A C C

C $ C A C A A C C A

$ C A C A A C C A C

T (1) T (6)

$ A A A A C C C C C

C A C C C $ A A A C

A C $ A C C A C C A

C C C A A A C $ A C

A A A C C C C C A $

A C C C $ A A A C C

C $ A C C A C C C A

C C A A A C $ A A C

A A C $ C C C A C A

C C C C A A A C $ A

1 2 3 4 5 6 7 8 9 10

F (first)

TBWT=

L (last)

T (1)

)sort

columns

lexicogr.

sortcolumns(=strings)

lexicographically

Page 5: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Burrows Wheeler Transform

5

T = CACAACCAC$

$ A A A A C C C C CC A C C C $ A A A CA C $ A C C A C C AC C C A A A C $ A CA A A C C C C C A $A C C C $ A A A C CC $ A C C A C C C AC C A A A C $ A A CA A C $ C C C A C AC C C C A A A C $ A

1 2 3 4 5 6 7 8 9 10

F (first)

TBWT =L (last)

A=10 4 8 2 5 9 3 7 1 6

L[i] = T[A[i]-1]

Page 6: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Last to Front Mapping

6

LF[i]=j ⇔ A[j]=A[i]-1

T = CACAACCAC$

$ A A A A C C C C CC A C C C $ A A A CA C $ A C C A C C AC C C A A A C $ A CA A A C C C C C A $A C C C $ A A A C CC $ A C C A C C C AC C A A A C $ A A CA A C $ C C C A C AC C C C A A A C $ A

1 2 3 4 5 6 7 8 9 10

F (first)

TBWT =L (last)

LF = 6 7 8 9 2 3 4 10 1 5

Page 7: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Last to Front Mapping• equal chars preserve order in F and L

7

F

L

aa

a a

↵↵

��

i j

lf(i) lf(j)

Page 8: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Backwards Search• C[a] := # chars smaller than a in T f. a ∈ ∑

• OCC[a,i] := # a's in L[1,i] for a ∈ ∑

• search for interval of P[1,m] in A

8

A =

C(Pi) + 1 C(Pi + 1)si ei ei+1si+1

Pi...mPi+1...m

PiPi F

backwards search step

Page 9: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Backwards Search

9

A =

ei+1si+1

Pi+1...m

F

L

= Pi = Pi

6= Pi

Page 10: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Backwards Search

10

A =

C(Pi) + 1 C(Pi + 1)si ei ei+1si+1

Pi...mPi+1...m

PiPi F

L

= Pi = Pi

= occ(Pi, si+1 � 1)

Page 11: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Backwards Search

11

Algorithm 2: function backwards-search(P1...m)

s 1; e n;for i = m . . . 1 do

s C(Pi) + occ(Pi, s� 1) + 1;e C(Pi) + occ(Pi, e);if s > e then

return “no match”;end

endreturn [s, e];

This gives rise to the following, elegant algorithm for backwards search:The reader should compare this to the “normal” binary search algorithm in su�x arrays. Apart

from matching backwards, there are two other notable deviations:

1. The su�x array A is not accessed during the search.

2. There is no need to access the input text T .

Hence, T and A can be deleted once T bwt has been computed. It remains to show how array Cand occ are implemented. Array C is actually very small and can be stored plainly using � log nbits.1 Because � = o(n/ log n), |C| = o(n) bits. For occ, we have several options that are exploredin the rest of this chapter. This is where the di↵erent FM-Indices deviate from each other. In fact,we will see that there is a natural trade-o↵ between time and space: using more space leads to afaster computation of the occ-values, while using less space implies a higher query time.

Theorem 14. With backwards search, we can solve the counting problem in O(m·tocc

) time, wheretocc

denotes the time to answer an occ(·)-query.

3.4 First Ideas for Implementing Occ

For answering occ(c, i), there are two simple possibilities:

1. Scan L every time an occ(·)-query has to be answered. This occupies no space, but needsO(n) time for answering a single occ(·)-query, leading to a total query time of O(mn) forbackwards search.

2. Store all answers to occ(c, i) in a two-dimensional table. This table occupies O(n� log n) bitsof space, but allows constant-time occ(·)-queries. Total time for backwards search is optimalO(m).

For more more practical implementation between these two extremes, let us define the following:

Definition 14. Given a bit-vector B[1, n], rank

1

(B, i) counts the number of 1’s in B’s prefixB[1, i]. Operation rank

0

(B, i) is defined similarly for 0-bits.1

More precisely, we should say �dlog ne bits, but we will usually omit floors and ceilings from now on.

21

Page 12: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Backwards Search

• Note: no use of suffix array A

• Space:

‣ C : |∑| lg n bits (small)

‣ BWT L: n lg |∑| (same as text)

‣ OCC: ??? (Q1)

• How to output text positions? (Q2)

12

Page 13: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Implementing OCC (Q1)

• rank1(B, i) = #1's in B[1,i]

• in ADS: n+o(n) bits for rank in O(1) time

• idea 1: bitmap Ba for every character a ∈ ∑

⇒n |∑| bits, O(m) search time!

• idea 2: wavelet trees:

⇒n lg |∑| bits, O(m lg |∑|) search time

13

Page 14: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Finding Positions (Q2)• A[i] = j ⇔ A[LF(i)] = j-1

• sample every s'th position in T

⇒ O(n/s lg n) extra bits, O(s⋅tOCC) time

• say s = lg|∑| n ⇒ n lg |∑| bits, O(lg n) time

‣ tOCC: time for evaluating OCC (e.g. O(lg |∑|))

• marking sampled positions: n+o(n) bits

14

Page 15: Lecture 8: Backwards Search and FM Indices · Lecture 8: Backwards Search and FM Indices Johannes Fischer 1 _ t t e r n _ m a t Topics s u f f i x f _ t r a y e e a r r a y e x t

Summary• FM index with

‣ O(n lg |∑|) space (using wavelet trees)

‣ O(m lg |∑|) search for # occurrences

‣ O(k lg n) for outputting k occurrences (using sampled suffix array)

• can be improved, see e.g.

‣ G. Navarro, V. Mäkinen: Compressed Text Indices. ACM Comput. Surv. 39(1), 2007.

15


Recommended