+ All Categories
Home > Documents > A general compression algorithm that supports fast searching

A general compression algorithm that supports fast searching

Date post: 31-Dec-2015
Category:
Upload: chloe-levine
View: 19 times
Download: 4 times
Share this document with a friend
Description:
A general compression algorithm that supports fast searching. Kimmo Fredriksson Dept. of Computer Science Univ. of Joensuu, Finland [email protected]. Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected]. - PowerPoint PPT Presentation
21
A general compression algorithm that supports fast searching Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected]. pl Appeared in Information Processing Letters (IPL), 100(6):226– 232, 2006. Kimmo Fredriksson Dept. of Computer Science Univ. of Joensuu, Finland [email protected] .fi
Transcript
Page 1: A general compression algorithm that supports fast searching

A general compression algorithm that supports fast searching

Szymon GrabowskiComputer Engineering Dept., Tech. Univ. of Łódź, Poland

[email protected]

Appeared in Information Processing Letters (IPL), 100(6):226–232, 2006.

Kimmo FredrikssonDept. of Computer Science Univ. of Joensuu, Finland

[email protected]

Page 2: A general compression algorithm that supports fast searching

2

Compressed pattern searching problem (Amir & Benson, 1992):

Input: text T’ available in a compressed form, pattern P.

Output: report all occurences of P in T (i.e. decompressed T’)without decompressing the whole T’.

Of course, a compressed search algorithm can be called practical if the search time is less than with the naïve

“first decompress, then search” approach.

Basic notation: |T| = n, |T’| = n’, |P| = m, || = .

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 3: A general compression algorithm that supports fast searching

3K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Pros and cons of on-lineand off-line searching

On-line algorithms: immediate to use (raw text), simple, flexible – but slow.

Off-line algorithms (indexes): much faster but the simple and fastest solutions (suffix tree, suffix array) need much space (at least 5n incl. the text), while the more succinct (FM-index, CSA, many variants of...) are quite complicated.Indexed searching much less flexible than on-line searching (hard / impossible to adapt various approximate matching models, hard to handle a dynamic scenario).

Page 4: A general compression algorithm that supports fast searching

4K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Compressed pattern searching – something in between

May be faster (but not dramatically) than on-line searching in uncompressed text.

Space: typically 0.5n or less.

Relatively simple.

Easier to implement approximate matching, handle dynamic text etc.

So here was our motivation...

Page 5: A general compression algorithm that supports fast searching

5

State-of-the-art in compressed pattern searching

Word based vs. full-text schemes.

Word based algorithms are better (faster, better compression, more flexible

for advanced queries, easier...) as long as can be applied: text naturally segmented into words.

Works like a charm with English. Slightly worse with agglutinative languages (German, Finnish...).

Even worse with Polish, Russian...

Doesn’t work at all with oriental languages (Chinese, Korean, Japanese).

Doesn’t work with DNA, proteins, MIDI...

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 6: A general compression algorithm that supports fast searching

6

State-of-the-art in compressed pattern searching, cont’d

Full-text algorithms

(Approximate) searching in RLE-compressed data(Apostolico et al., 1999; Mäkinen et al., 2001, 2003) – nice

theory but limited applications (fax images?).

Direct search in binary Huffman stream(Klein & Shapira, 2001; Takeda et al., 2001, 2002;

Fredriksson & Tarhio, 2003) – mediocre compression ratio, but relatively simple.

Ziv-Lempel based schemes (Kida et al., 1999; Navarro & Tarhio, 2000) – quite good compression but

complicated and not very fast.

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 7: A general compression algorithm that supports fast searching

7

Our proposal, main traits

Full-text compression.

Based on q-grams.

Actually two search algorithms: very fast for “long” patterns (m 2q–1),

somewhat slower and more complicated for short patterns (m < 2q–1).

Compresses plain NL text to 45–50% orig. size(worse than Ziv-Lempel but better than character

based Huffman).

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 8: A general compression algorithm that supports fast searching

8

Our proposal, compression scheme

Choose q (larger q better asymptotic compression but larger dictionary, the slower “short pattern” search variant

triggered more often).Practical trade-off for human text: q = 4.

Split text T into non-overlapping q-grams, build a dictionary over those units,

dump the dictionary to the output file,encode the q-grams according to the built dictionary,

using some byte-oriented codeenabling pattern searching with skips

(could be tagged Huffman (Moura et al., 2000) but (s,c)-DC (Brisaboa et al., 2003b) and ETDC

(Brisaboa et al., 2003b) are more efficient).K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 9: A general compression algorithm that supports fast searching

9

Searching for long patterns

Generate q possible alignments of pattern P[0..m–1].That is, the last char of P may be either the 1st symbol,

or the 2nd, etc., or the qth symbol of some q-gram.

We cannot ignore any alignment as this could result in missed matches.

Now, truncate at most q–1 characters at each pattern alignment boundary, those that belong to “broken” q-grams.

Encode each alignment according to the text dictionary.

Use any multiple string searching algorithms (we use BNDM adapted for multiple matching) for searching for the q alignm.

in parallel; verify matches with the truncated prefix/suffix.

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 10: A general compression algorithm that supports fast searching

10

Searching for long patterns, pattern preprocessing, pseudo code

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 11: A general compression algorithm that supports fast searching

11

Searching for long patterns, example

Let P = nasty_bananasLet q = 3.

ETDC code.

Three alignments generated:

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 12: A general compression algorithm that supports fast searching

12

Searching for long patterns, example, cont’d

We encode the 3-grams. The pattern alignments may turn into smth like:

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

nas ty_ ban ana

nasanaast y_b

sty _ba nan

Page 13: A general compression algorithm that supports fast searching

13

The shortest of those encodings (prev. slide) has 7 bytes (the 3rd one), therefore we truncate

the other two sequences to 7 bytes.Those three sequences are input for BNDM alg,

potential matches must be verified.

Searching for long patterns, example, cont’d

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 14: A general compression algorithm that supports fast searching

14

Searching for short patterns

If m < 2q–1, at least one alignment will not contain even one “full” q-gram. In result, the presented

algorithm won’t work.

We solve it by adapting the method from (Fredriksson, 2003). The idea is to have an implicit decoding of the text, encoded to a Shift-Or (Baeza-

Yates & Gonnet, 1992; Wu & Manber, 1992) automaton, i.e. the automaton makes implicit

transitions using the original text symbols, while the input is the q-gram symbols of the compressed text.

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 15: A general compression algorithm that supports fast searching

15

Test methodology

All algorithms implemented in C, compiled with gcc 3.4.1.

Test machine: P4 2 GHz, 512 MB, running GNU/Linux 2.4.20.

Text files:Dickens (10.2 MB), English, plain text;

Bible (~4 MB), in English, Spannish, Finnish, plain text;XML collection (5.3 MB);

DNA (e.coli) (4.6 MB), = 4.proteins (5 MB), = 23.

( All test files available at szgrabowski.kis.p.lodz.pl/research/data.zip )

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 16: A general compression algorithm that supports fast searching

16

Experimental results.Compression ratio

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

our

alg

ori

thm

s

Page 17: A general compression algorithm that supports fast searching

17

The effect of varying qon the dictionary size

and the overall compression.Dickens / ETDC coding

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

q = 4 somewhat worse compression here than for q = 5 but much smaller dictionary, so may be preferred

Page 18: A general compression algorithm that supports fast searching

18

Decompression times (excl. I/O times) [s]

On the XML file, where the word based methods can be used, the q-gram based algs almost twice faster, partly

because of the better compression they provide for this case.

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 19: A general compression algorithm that supports fast searching

19

Search times [s]

Short patterns used for the test: random excerpts from text of length 2q–2 (i.e. longest “short” patterns).

Long patterns in the test: minimum pat. lengths that produced compressed patterns of length at least 2.

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 20: A general compression algorithm that supports fast searching

20

Conclusions

We have presented a compression algorithm for arbitrary data which enables pattern search with Boyer-Moore skips

directly in the compressed representation.

The algorithm is simple and the conducted experiments validate the claim for its practicality.

For natural texts this scheme, however, cannot match, e.g., the original (s,c)-dense code in compression ratio,

but this is the price we pay for removing the limitation to word based textual data.

Searching speed for long enough patterns can be higher than in uncompressed text.

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

Page 21: A general compression algorithm that supports fast searching

21

Future plans

Flexible text partitioning: apart from q-grams allowing for shorter tokens (should give a significant compression boost

on NL texts).

Succinct dictionary representation (currently a naïve approach used).

Handling updates to T.

Adapting the scheme for approximate searching (very promising!).

Finding (quickly) appropriate q for a given text.

K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching


Recommended