Date post: | 15-Feb-2017 |
Category: |
Science |
Upload: | nidhal-el-abbadi |
View: | 339 times |
Download: | 0 times |
January 5, 2016 1 [email protected]
January 5, 2016 2 [email protected]
Introduction
Example
Example
Reverse Transform
Encoding Example
Decoding Example
Move to Front
Compression L
Contents
January 5, 2016 [email protected] 3
• Burrows-Wheeler, 1994
• BW Transform creates a representation of the data which has a small
working set.
• The transformed data is compressed with move to front compression.
• The decoder is quite different from the encoder.
• The algorithm requires processing the entire string at once (it is not on-
line).
• It is a remarkably good compression method.
Introduction
January 5, 2016 [email protected] 4
The Burrows-Wheeler Transform (BWT) is a way of permuting the
characters of a string T into another string BWT(T).
This permutation is reversible; one procedure exists for turning T into
BWT(T) and another exists for turning BWT(T) back into T.
The BWT has two main applications: compression and indexing.
T denotes the string we would like to transform
m = |T| (the length of T)
January 5, 2016 [email protected] 5
prepares a string of data for later compression. The compression itself is
done with the move-to-front method, perhaps in combination with RLE.
Burrows and Wheeler works in a block mode, where the input stream is
read block by block and each block is encoded separately as one string.
The BW method is general purpose, it works well on images, sound,
and text, and can achieve very high compression ratios
January 5, 2016 [email protected] 6
Take T = abaaba$
First, we write down the rotations of T:
The distinct strings we can make from T by repeatedly taking a character
from one end and sticking it on the other:
Read the string abaaba
Example
January 5, 2016 [email protected] 7
By writing them stacked vertically, we've created an m x m matrix. Now we
sort the rows of the matrix lexicographically (i.e. alphabetically):
This is the Burrows-Wheeler Matrix (BWM(T)). The final column of BWM(T),
read from top to bottom, is BWT(T). So for T = abaaba$, BWT(T) = abba$aa.
January 5, 2016 [email protected] 8
Read in the following block: this is a test.
N = 15
C0 = 't'
C1 = 'h'
…
C13 = 't'
C14 = '.‘
The next step is to think of the block as a cyclic buffer. N strings
(rotations) S0 … SN-1 may be constructed such that:
S0 = C0, …, CN-1
S1 = C1, …, CN-1, C0
S2 = C2, …, CN-1, C0, C1
…
SN-1 = CN-1, C0, …, CN-2
Example
January 5, 2016 [email protected] 9
"this is a test." yields the following rotations:
S0 = "this is a test."
S1 = "his is a test.t"
S2 = "is is a test.th"
S3 = "s is a test.thi"
S4 = " is a test.this"
S5 = "is a test.this "
S6 = "s a test.this i"
S7 = " a test.this is"
S8 = "a test.this is "
S9 = " test.this is a"
S10 = "test.this is a "
S11 = "est.this is a t"
S12 = "st.this is a te"
S13 = "t.this is a tes"
S14 = ".this is a test"
January 5, 2016 [email protected] 10
The third step of BWT is to lexicographically sort S0 … SN-1.
"this is a test." yields the following sorted rotations:
S7 = " a test.this is"
S4 = " is a test.this"
S9 = " test.this is a"
S14 = ".this is a test"
S8 = "a test.this is "
S11 = "est.this is a t"
S1 = "his is a test.t"
S5 = "is a test.this "
S2 = "is is a test.th"
S6 = "s a test.this i"
S3 = "s is a test.thi"
S12 = "st.this is a te"
S13 = "t.this is a tes"
S10 = "test.this is a "
S0 = "this is a test."
January 5, 2016 [email protected] 11
The final step in the transform is to output a string L, consisting of the
last character in each of the rotations in their sorted order along with
I, the sorted row containing S0.
"this is a test." yields the following output:
L = "ssat tt hiies .", I = 14
January 5, 2016 [email protected] 12
Reversing BWT is a little more complicated than the initial transform.
The reversal process starts with a string L composed of last characters of
sorted rotations (S0 … SN-1) and I, the position of the contribution S0 made
to L.
The reversal process must yield S0, the original block.
It turns out there are a few ways to reverse the transform. The method
discussed here is the one that I ended up implementing.
If L is composed of the symbols V0 … VN-1, the transformed string may
be parsed to determine the following pieces of additional information:
1.The number of symbols in the substring V0 … Vi-1 that are identical to Vi.
2.For each unique symbol, Vi, in L, the number of symbols that are
lexicographically less than that symbol.
Reverse Transform
January 5, 2016 [email protected] 14
Using tables 1 and 2 reverse BWT where L = "ssat tt hiies ." and I = 14.
We start with:
S0 = ???????????????
We're given that C14 is V14 = '.'.
S0 = ??????????????.
Table 1 tells us that there are 0 other '.' before V14 and Table 2 tells us that there are 3 characters < '.',
so C14 must be V0 + 3 = V3 = 't'.
S0 = ?????????????t.
Table 1 tells us that there are 0 other 't' before V3 and Table 2 tells us that there are 12 characters < 't',
so C13 must be V0 + 12 = V12 = 's'.
S0 = ????????????st.
Table 1 tells us that there are 2 other 's' before V12 and Table 2 tells us that there are 9 characters < 's',
so C12 must be V2 + 9 = V11 = 'e'.
S0 = ???????????est.
January 5, 2016 [email protected] 15
Table 1 tells us that there are 0 other 'e' before V11 and Table 2 tells us that there are 5 characters < 'e',
so C11 must be V0 + 5 = V5 = 't'.
S0 = ??????????test.
Table 1 tells us that there is 1 other 't' before V5 and Table 2 tells us that there are 12 characters < 't', so
C10 must be V1 + 12 = V13 = ' '.
S0 = ????????? test.
Table 1 tells us that there is 2 other ' ' before V13 and Table 2 tells us that there are 0 characters < ' ', so
C9 must be V2 + 0 = V2 = 'a'.
S0 = ????????a test.
January 5, 2016 [email protected] 16
• abracadabra
1. Create all cyclic shifts of the string.
0 abracadabra
1 bracadabraa
2 racadabraab
3 acadabraabr
4 cadabraabra
5 adabraabrac
6 dabraabraca
7 abraabracad
8 braabracada
9 raabracadab
10 aabracadabr
Encoding Example
January 5, 2016 [email protected] 19
4. Transmit X the index of the input in A and L (using move to front coding).
January 5, 2016 [email protected] 20
• At first of decode we assuming some information. We then show how
to compute the information.
• Let As be A shifted by 1
Decoding Example
January 5, 2016 [email protected] 21
• Assume we know the mapping T[i] is the index in As of the string i in A.
• T = [2 5 6 7 8 9 10 4 1 0 3]
January 5, 2016 [email protected] 22
• Let F be the first column of A, it is just L sorted.
• Follow the pointers in T in F to recover the input starting with X.
Decoding Example
January 5, 2016 [email protected] 23
January 5, 2016 [email protected] 24
January 5, 2016 [email protected] 25
January 5, 2016 [email protected] 26
• Why does this work?
• The first symbol of A[T[i]] is the second symbol of A[i]
because As[T[i]] = A[i].
Decoding Example
January 5, 2016 [email protected] 27
• How do we compute F and T from L and X?
F is just L sorted
Note that L is the first column of As, and As is in the same order as A.
If i is the k-th x in F then T[i] is the k-th x in L.
January 5, 2016 [email protected] 28
January 5, 2016 [email protected] 29
January 5, 2016 [email protected] 30
January 5, 2016 [email protected] 31
January 5, 2016 [email protected] 32
January 5, 2016 [email protected] 33
1. Initialize A to a list containing our alphabet A.
2. For i : 0, . . . , n − 1, encode symbol Li as the number of symbols
preceding it in A, and then move symbol Li to the beginning of A.
3. Combine the codes of step 2 in a list C, which will be further
compressed using Huffman or arithmetic coding.
Compression L
January 5, 2016 [email protected] 34
Move to Front
The basic idea of this method [Bentley 86] is to maintain the alphabet A of
symbols as a list where frequently occurring symbols are located near the
front.
January 5, 2016 [email protected] 35
NOTE.
The last column, L, of the sorted matrix contains concentrations of identical
characters, which is why L is easy to compress. However, the first column,
F, of the same matrix is even easier to compress, since it contains runs, not
just concentrations, of identical characters. Why select column L and not
column F? Answer. Because the original string S can be reconstructed from
L but not from F.
January 5, 2016 36 [email protected]