Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo...

Dynamic Rank-Select Structures with Applications to Run-Length

Encoded Texts

Sunho Lee and Kunsoo ParkSeoul National Univ.

Contents

Introduction– Rank/select problem– Relations to compressed full-text indices

Dynamic rank-select structure Extensions of the structure

– For a large alphabet text– For a run-length encoded text

Rank-select problem

For a given text T over σ-size alphabet, our structures support:– rankT(c, i): gives the number of character c’s up to

position i in T– selectT(c, k): gives the position of the k-th c

E.g. T=acabbc– rankT(‘a’, 5) = 2

– selectT(‘a’, 2) = 3

Rank-select problem

Our structures support additional update operations– insertT(c, i): inserts character c between T[i] and T

[i+1]– deleteT(i): deletes T[i] from T

E.g. T=acabbc aababc– rankT(‘a’, 5) = 2 rankT(‘a’, 5) = 3– selectT(‘a’, 2) = 3 selectT(‘a’, 2) = 2

Why rank-select problem?

In compressed full-text index– Rank-select structures are built on Burrows-Whee

ler Transform (BWT)– Rank: backward search (Ferragina & Manzini)– Select: Psi-function in CSA (Grossi & Vitter)

Dynamic BWT– Index for a collection of texts (Chan, Hon & Lam)– Add or remove a text from the collection

Example of select on BWT

T=mississippi$i Psi SA Suffix

1 6 12 $

2 1 11 i$

3 8 8 ippi$

4 11 5 issippi$

5 12 2 ississippi$

6 5 1 mississippi$

7 2 10 pi$

8 7 9 ppi$

9 3 7 sippi$

10 4 4 sissippi$

11 9 6 ssippi$

12 10 3 ssissippi$

Psi function– Order of the suffix at next position– E.g.. Psi[4] = 11, the order of ‘ssippi

$’

Example of select on BWT

T=mississippi$i BWT Psi SA Suffix

1 i 6 12 $

2 p 1 11 i$

3 s 8 8 ippi$

4 s 11 5 issippi$

5 m 12 2 ississippi$

6 $ 5 1 mississippi$

7 p 2 10 pi$

8 i 7 9 ppi$

9 s 3 7 sippi$

10 s 4 4 sissippi$

11 i 9 6 ssippi$

12 i 10 3 ssissippi$

Psi function– Order of the suffix at next position– E.g. Psi[4] = 11, the order of ‘ssippi$’

Duality between Psi-function and BWT

(Hon, Sadakane & Sung)– BWT[i] = T[SA[i] – 1]– Psi[i] = selectBWT(C[i], i – F[C[i]])

C[i]: T[SA[i]] F[c]: The number of x < c

Our results

Dynamic rank-select on texts over a small alphabet (σ < log n)

– Improve the binary-alphabet version by Makinen & Navarro– O(log n) time and nlogσ + o(nlogσ) bits

Dynamic rank-select for a large alphabet (σ < n)– Use wavelet trees to extend our small-alphabet structure– O(log n logσ / loglog n) time and nlogσ + o(nlogσ) bits

Application to RLE texts

Static rank-select

Dynamic rank-select

Dynamic rank-select preliminary

We assume RAM model with:– Word size w = θ(log n) bits– +, -, *, / and bitwise operations in O(1) time

We process a word-size text of θ(log n/log ) characters in O(1) time


Partition of text– Blocks of sizes from ½ log n words to 2log n words– Bit vector representation, I

Give block number b and offset r for position i Employ binary rank-select by Makinen & Navarro:

O(log n) time & O(n) bits

E.g. – T = babc abab abca b = rankI(‘1’, 10) = 3

– I = 1000 1000 1000 r = 10 - selectI(‘1’, 3) + 1 = 2


Over-block/in-block operation– rankT(c, i):

rank-overT(c, b): The number of c’s before the b-th block

rankTb(c, r): The number of c’s up to position r in Tb

– E.g. T = babc abab abca : rankT(‘a’,10) = rank-overT(‘a’, 3)

I = 1000 1000 1000 + rankT3(‘a’, 2)


Over-block/in-block operation– selectT(c, k):

select-overT(c,k): The block number containing the k-th c

selectTb(c,k’): The offset of the k’-th c in Tb

– Update operation In-block update: change the text itself Over-block update: change the statistics of the text

Over-block structures

Sorted character-block pair– Character-block pair (T[i], b): T[i] in the b-th block

E.g. T = babc abab abca(b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)


Sorted character-block pair– Character-block pair (T[i], b): T[i] in the b-th block– Sorted pairs: partially non-decreasing

(Hon, Sadakane & Sung)

E.g. T = babc abab abca(b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)

(a,1)(a,2)(a,2)(a,3)(a,3) (b,1)(b,1)(b,2)(b,2)(b,3) (c,1)(c,3)


Differential encoding of sorted pairs– A bit vector B of O(n) bits– For each distinct pair:

1: the difference of block number 0: the number of the same pairs

E.g. – T = ... babc abab bbbb abcc …– … (c,5)(c,8)(c,8) … … 11111011100 …


Differential encoding of sorted pairs– A bit vector B of O(n) bits– For each distinct pair:

1: the difference of block number 0: the number of the same pairs

E.g. – T = babc abab abca

– B = 10100100 10010010 10110‘b’ group

Over-block rank-select

rank-overT(c, b):– Find the position of the b-th ‘1’ in the group of c– Count ‘0’s representing c up to the position

E.g. – T = babc abab abca

– B = 10100100 10010010 10110

rank-overT(‘b’, 3): count ‘0’s up to 3rd ‘1’ in ‘b’ group

Over-block updates

If the number of blocks is fixed– Insert or delete 0s at the b-th block in I and B– Rank-select remains correct

E.g.– T = babc abab abca babc aabaaabb abca– I = 1000 1000 1000 1000 100000000 1000– B = 10100100 10010010 10110 10100000100 100100010 10110

Over-block updates

If the number of blocks is changing– Split or merge the b-th block in I and B– Call O() queries on B amortized ( < log n)

E.g.– T = babc aabaaabb abca babc aaba aabb abca– I = 1000 10000000 1000 1000 1000 1000 1000– B =10100000100 1001000010 10110 101000100100 10010100010 10110

In-block structures

We use the hierarchy as Makinen & Navarro’s: word, sub-block and block

Rank/select on word-size texts w– Convert w to a bit vector representing occurrences of c– E.g. w = abaacbab, mask = bbbbbbbb (log)

w XOR mask = x0xxx0x0 (log) 01000101(2)

– O(1) time rank-select by tables of o(n) bits size

In-block structures

Linked list over sub-blocks– A block contains ½log n to 2log n words– A sub-block contains √log n words – One extra sub-block is a buffer for updates

Red-black tree over blocks– Leaf node: pointer to block, list of sub-blocks– Internal node: the number of blocks in its subtree

In-block rank-select

RankTb(c, r) in O(log n) time– Traverse the tree to find the b-th block– Scan the b-th block of θ(log n) words

ab ba bc

2

2

3

5

In-block updates

Update words in the list in O(log n) time Process carry characters using the extra spa

ce in a block

ab bc ab c

2

2

3

5

In-block updates

Split or merge the block of out of the range Update tree nodes from leaf to root

ab bc ac ba

2

2

3

5

bc

In-block updates

Split or merge the block of out of the range Update tree nodes from leaf to root

ab bc acba

2

2

2

4

6

bc

Extension of our structure

Dynamic rank-select on plain texts over a large alphabet, σ < n– Use k-ary wavelet trees– O(log n logσ /loglog n) time & nlogσ + O(nlogσ /lo

glog n) bits

Application to run-length encoded texts– Start from RLFM (Makinen & Navarro)– Support dynamic BWT

Application to RLE

Run-Length Encoding (RLE) of T– Character of runs: text T’– Length of runs: bit vector L– E.g. T = aaabbaacccc T’=abac, L=10010101000

RLE of BWT (Makinen & Navarro)– Run-Length based FM-index – The number of runs in BWT(T) ≤ min(n, nHk) + σk

Application to RLE

Assume rank/select on L and T’– Total size of structure: O(n + n’logσ)– Operation time: O(log n + log n logσ/loglog n)

Some additional vectors– Sorted length vector: L’– Frequency table F’: count characters in T’– E.g.

T = bb aa bbbb cc aaa aa aaa bb bbbb ccL = 10 10 1000 10 100 L’ = 10 100 10 1000 10T’ = babca F’ = 001 001 01

Conclusion

Rank-select structure is an essential ingredient of compressed full-text indices

We propose dynamic rank-select for a small alphabet and its large-alphabet version

We can apply our structures to indices that uses BWT, such as RLFM and index for texts collection

Date post:	31-Mar-2015
Category:	Documents
Upload:	michael-parvin
View:	213 times
Download:	0 times

Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo...

Documents