Date post: | 31-Mar-2015 |
Category: |
Documents |
Upload: | michael-parvin |
View: | 213 times |
Download: | 0 times |
Dynamic Rank-Select Structures with Applications to Run-Length
Encoded Texts
Sunho Lee and Kunsoo ParkSeoul National Univ.
Contents
Introduction– Rank/select problem– Relations to compressed full-text indices
Dynamic rank-select structure Extensions of the structure
– For a large alphabet text– For a run-length encoded text
Rank-select problem
For a given text T over σ-size alphabet, our structures support:– rankT(c, i): gives the number of character c’s up to
position i in T– selectT(c, k): gives the position of the k-th c
E.g. T=acabbc– rankT(‘a’, 5) = 2
– selectT(‘a’, 2) = 3
Rank-select problem
Our structures support additional update operations– insertT(c, i): inserts character c between T[i] and T
[i+1]– deleteT(i): deletes T[i] from T
E.g. T=acabbc aababc– rankT(‘a’, 5) = 2 rankT(‘a’, 5) = 3– selectT(‘a’, 2) = 3 selectT(‘a’, 2) = 2
Why rank-select problem?
In compressed full-text index– Rank-select structures are built on Burrows-Whee
ler Transform (BWT)– Rank: backward search (Ferragina & Manzini)– Select: Psi-function in CSA (Grossi & Vitter)
Dynamic BWT– Index for a collection of texts (Chan, Hon & Lam)– Add or remove a text from the collection
Example of select on BWT
T=mississippi$i Psi SA Suffix
1 6 12 $
2 1 11 i$
3 8 8 ippi$
4 11 5 issippi$
5 12 2 ississippi$
6 5 1 mississippi$
7 2 10 pi$
8 7 9 ppi$
9 3 7 sippi$
10 4 4 sissippi$
11 9 6 ssippi$
12 10 3 ssissippi$
Psi function– Order of the suffix at next position– E.g.. Psi[4] = 11, the order of ‘ssippi
$’
Example of select on BWT
T=mississippi$i BWT Psi SA Suffix
1 i 6 12 $
2 p 1 11 i$
3 s 8 8 ippi$
4 s 11 5 issippi$
5 m 12 2 ississippi$
6 $ 5 1 mississippi$
7 p 2 10 pi$
8 i 7 9 ppi$
9 s 3 7 sippi$
10 s 4 4 sissippi$
11 i 9 6 ssippi$
12 i 10 3 ssissippi$
Psi function– Order of the suffix at next position– E.g. Psi[4] = 11, the order of ‘ssippi$’
Duality between Psi-function and BWT
(Hon, Sadakane & Sung)– BWT[i] = T[SA[i] – 1]– Psi[i] = selectBWT(C[i], i – F[C[i]])
C[i]: T[SA[i]] F[c]: The number of x < c
Our results
Dynamic rank-select on texts over a small alphabet (σ < log n)
– Improve the binary-alphabet version by Makinen & Navarro– O(log n) time and nlogσ + o(nlogσ) bits
Dynamic rank-select for a large alphabet (σ < n)– Use wavelet trees to extend our small-alphabet structure– O(log n logσ / loglog n) time and nlogσ + o(nlogσ) bits
Application to RLE texts
Static rank-select
Dynamic rank-select
Dynamic rank-select preliminary
We assume RAM model with:– Word size w = θ(log n) bits– +, -, *, / and bitwise operations in O(1) time
We process a word-size text of θ(log n/log ) characters in O(1) time
Dynamic rank-select preliminary
Partition of text– Blocks of sizes from ½ log n words to 2log n words– Bit vector representation, I
Give block number b and offset r for position i Employ binary rank-select by Makinen & Navarro:
O(log n) time & O(n) bits
E.g. – T = babc abab abca b = rankI(‘1’, 10) = 3
– I = 1000 1000 1000 r = 10 - selectI(‘1’, 3) + 1 = 2
Dynamic rank-select preliminary
Over-block/in-block operation– rankT(c, i):
rank-overT(c, b): The number of c’s before the b-th block
rankTb(c, r): The number of c’s up to position r in Tb
– E.g. T = babc abab abca : rankT(‘a’,10) = rank-overT(‘a’, 3)
I = 1000 1000 1000 + rankT3(‘a’, 2)
Dynamic rank-select preliminary
Over-block/in-block operation– selectT(c, k):
select-overT(c,k): The block number containing the k-th c
selectTb(c,k’): The offset of the k’-th c in Tb
– Update operation In-block update: change the text itself Over-block update: change the statistics of the text
Over-block structures
Sorted character-block pair– Character-block pair (T[i], b): T[i] in the b-th block
E.g. T = babc abab abca(b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)
Over-block structures
Sorted character-block pair– Character-block pair (T[i], b): T[i] in the b-th block– Sorted pairs: partially non-decreasing
(Hon, Sadakane & Sung)
E.g. T = babc abab abca(b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)
(a,1)(a,2)(a,2)(a,3)(a,3) (b,1)(b,1)(b,2)(b,2)(b,3) (c,1)(c,3)
Over-block structures
Differential encoding of sorted pairs– A bit vector B of O(n) bits– For each distinct pair:
1: the difference of block number 0: the number of the same pairs
E.g. – T = ... babc abab bbbb abcc …– … (c,5)(c,8)(c,8) … … 11111011100 …
Over-block structures
Differential encoding of sorted pairs– A bit vector B of O(n) bits– For each distinct pair:
1: the difference of block number 0: the number of the same pairs
E.g. – T = babc abab abca
– B = 10100100 10010010 10110‘b’ group
Over-block rank-select
rank-overT(c, b):– Find the position of the b-th ‘1’ in the group of c– Count ‘0’s representing c up to the position
E.g. – T = babc abab abca
– B = 10100100 10010010 10110
rank-overT(‘b’, 3): count ‘0’s up to 3rd ‘1’ in ‘b’ group
Over-block updates
If the number of blocks is fixed– Insert or delete 0s at the b-th block in I and B– Rank-select remains correct
E.g.– T = babc abab abca babc aabaaabb abca– I = 1000 1000 1000 1000 100000000 1000– B = 10100100 10010010 10110 10100000100 100100010 10110
Over-block updates
If the number of blocks is changing– Split or merge the b-th block in I and B– Call O() queries on B amortized ( < log n)
E.g.– T = babc aabaaabb abca babc aaba aabb abca– I = 1000 10000000 1000 1000 1000 1000 1000– B =10100000100 1001000010 10110 101000100100 10010100010 10110
In-block structures
We use the hierarchy as Makinen & Navarro’s: word, sub-block and block
Rank/select on word-size texts w– Convert w to a bit vector representing occurrences of c– E.g. w = abaacbab, mask = bbbbbbbb (log)
w XOR mask = x0xxx0x0 (log) 01000101(2)
– O(1) time rank-select by tables of o(n) bits size
In-block structures
Linked list over sub-blocks– A block contains ½log n to 2log n words– A sub-block contains √log n words – One extra sub-block is a buffer for updates
Red-black tree over blocks– Leaf node: pointer to block, list of sub-blocks– Internal node: the number of blocks in its subtree
In-block rank-select
RankTb(c, r) in O(log n) time– Traverse the tree to find the b-th block– Scan the b-th block of θ(log n) words
ab ba bc
2
2
3
5
In-block updates
Update words in the list in O(log n) time Process carry characters using the extra spa
ce in a block
ab bc ab c
2
2
3
5
In-block updates
Split or merge the block of out of the range Update tree nodes from leaf to root
ab bc ac ba
2
2
3
5
bc
In-block updates
Split or merge the block of out of the range Update tree nodes from leaf to root
ab bc acba
2
2
2
4
6
bc
Extension of our structure
Dynamic rank-select on plain texts over a large alphabet, σ < n– Use k-ary wavelet trees– O(log n logσ /loglog n) time & nlogσ + O(nlogσ /lo
glog n) bits
Application to run-length encoded texts– Start from RLFM (Makinen & Navarro)– Support dynamic BWT
Application to RLE
Run-Length Encoding (RLE) of T– Character of runs: text T’– Length of runs: bit vector L– E.g. T = aaabbaacccc T’=abac, L=10010101000
RLE of BWT (Makinen & Navarro)– Run-Length based FM-index – The number of runs in BWT(T) ≤ min(n, nHk) + σk
Application to RLE
Assume rank/select on L and T’– Total size of structure: O(n + n’logσ)– Operation time: O(log n + log n logσ/loglog n)
Some additional vectors– Sorted length vector: L’– Frequency table F’: count characters in T’– E.g.
T = bb aa bbbb cc aaa aa aaa bb bbbb ccL = 10 10 1000 10 100 L’ = 10 100 10 1000 10T’ = babca F’ = 001 001 01
Conclusion
Rank-select structure is an essential ingredient of compressed full-text indices
We propose dynamic rank-select for a small alphabet and its large-alphabet version
We can apply our structures to indices that uses BWT, such as RLFM and index for texts collection