+ All Categories
Home > Documents > Improving a lightweight LZ77 computation algorithm for...

Improving a lightweight LZ77 computation algorithm for...

Date post: 04-Feb-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
17
SOFTWARE: PRACTICE AND EXPERIENCE Softw. Pract. Exper. 2016; 46:1201–1217 Published online 11 November 2015 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/spe.2377 Improving a lightweight LZ77 computation algorithm for running faster Wei Jun Liu 1,2 , Ge Nong 1,3, * ,† , Wai hong Chan 4, * ,† and Yi Wu 1 1 Computer Science Department, Sun Yat-sen University, Guangzhou, China 2 School of Physics and Electronic Information, Gannan Normal University, Ganzhou, China 3 SYSU-CMU Shunde International Joint Research Institute, Shunde, China 4 Department of Mathematics and Information Technology, The Hong Kong Institute of Education, Hong Kong SUMMARY Computing the Lempel–Ziv factorization (LZ77) of a string is a key step in many applications. However, at the same time, it constitutes a bottleneck of the entire computation. The investigation of time and space efficient computation of the LZ77 has become an important topic. In this paper, we present a lightweight linear-time algorithm called LZone for computing the LZ77, which is designed by improvements on the existing linear-time space efficient LZ77 algorithm BGone for speed acceleration. For an input string T Œ1::nŁ over a constant alphabet size of O.1/, LZone requires only n words of workspace in addition to the input string and the output factorization, dlog ne bits per word. This is the same space requirement for the algorithm BGone. LZone has two versions, LZoneT and LZoneSA, corresponding to BGoneT and BGoneSA, respectively. Our experimental results show that for computing the LZ77 from an input string T , LZoneT and LZoneSA run at around 26% and 57%, respectively, faster than their counterparts in BGone. Moreover, for computing the LZ77 from the suffix array of T , the speed of LZoneSA is on average twice that of BGoneSA. Copyright © 2015 John Wiley & Sons, Ltd. Received 23 December 2014; Revised 4 August 2015; Accepted 19 October 2015 KEY WORDS: Lempel–Ziv factorization; algorithm; linear time; lightweight; data compression; suffix array 1. INTRODUCTION The Lempel–Ziv factorization (LZ77) [1], named after its authors Abraham Lempel and Jacob Ziv, is an important concept in computer science. Because of its high efficiency in string processing, it has been widely used in many applications such as file compression [2–6], pattern discovery, sequence alignments, and full-text indexes [7, 8]. Very recently, several powerful instances have testified that compression schemes based on LZ77 are effective in modern datasets, especially in the collections of highly repetitive characteristics such as Genome sequences [9, 10]. However, in all those applications, computing LZ77 has been shown to be a time and space bottleneck in practice [11]. Unless otherwise specified, in this paper, the input string assumes a constant alphabet size of O.1/, this is commonly satisfied by realistic data such as a text or bioinformatics database. Although a variety of worst-case linear time algorithms appeared over the years [12, 13], there has been still much research work carried out on making the computation of LZ77 to be more time and space efficient. Until now, the fastest linear time algorithms, KKP1, KKP2, and KKP3 [11], for LZ77 were proposed by Kärkkäinen et al. in 2013. Among the three algorithms, KKP3 is the fastest, which *Correspondence to: Ge Nong, Computer Science Department, Sun Yat-sen University, Guangzhou, China; Wai Hong Chan, Department of Mathematics and Information Technology, The Hong Kong Institute of Education, Hong Kong. E-mail: [email protected]; [email protected] Copyright © 2015 John Wiley & Sons, Ltd.
Transcript
  • SOFTWARE: PRACTICE AND EXPERIENCESoftw. Pract. Exper. 2016; 46:1201–1217Published online 11 November 2015 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/spe.2377

    Improving a lightweight LZ77 computation algorithm forrunning faster

    Wei Jun Liu1,2, Ge Nong1,3,*,† , Wai hong Chan4,*,† and Yi Wu1

    1Computer Science Department, Sun Yat-sen University, Guangzhou, China2School of Physics and Electronic Information, Gannan Normal University, Ganzhou, China

    3SYSU-CMU Shunde International Joint Research Institute, Shunde, China4Department of Mathematics and Information Technology, The Hong Kong Institute of Education, Hong Kong

    SUMMARY

    Computing the Lempel–Ziv factorization (LZ77) of a string is a key step in many applications. However,at the same time, it constitutes a bottleneck of the entire computation. The investigation of time and spaceefficient computation of the LZ77 has become an important topic. In this paper, we present a lightweightlinear-time algorithm called LZone for computing the LZ77, which is designed by improvements on theexisting linear-time space efficient LZ77 algorithm BGone for speed acceleration. For an input string T Œ1::n�over a constant alphabet size of O.1/, LZone requires only n words of workspace in addition to theinput string and the output factorization, dlogne bits per word. This is the same space requirement for thealgorithm BGone. LZone has two versions, LZoneT and LZoneSA, corresponding to BGoneT andBGoneSA, respectively. Our experimental results show that for computing the LZ77 from an input string T ,LZoneT and LZoneSA run at around 26% and 57%, respectively, faster than their counterparts in BGone.Moreover, for computing the LZ77 from the suffix array of T , the speed of LZoneSA is on average twicethat of BGoneSA. Copyright © 2015 John Wiley & Sons, Ltd.

    Received 23 December 2014; Revised 4 August 2015; Accepted 19 October 2015

    KEY WORDS: Lempel–Ziv factorization; algorithm; linear time; lightweight; data compression; suffixarray

    1. INTRODUCTION

    The Lempel–Ziv factorization (LZ77) [1], named after its authors Abraham Lempel and Jacob Ziv,is an important concept in computer science. Because of its high efficiency in string processing,it has been widely used in many applications such as file compression [2–6], pattern discovery,sequence alignments, and full-text indexes [7, 8]. Very recently, several powerful instances havetestified that compression schemes based on LZ77 are effective in modern datasets, especially inthe collections of highly repetitive characteristics such as Genome sequences [9, 10]. However,in all those applications, computing LZ77 has been shown to be a time and space bottleneck inpractice [11].

    Unless otherwise specified, in this paper, the input string assumes a constant alphabet size ofO.1/, this is commonly satisfied by realistic data such as a text or bioinformatics database. Althougha variety of worst-case linear time algorithms appeared over the years [12, 13], there has been stillmuch research work carried out on making the computation of LZ77 to be more time and spaceefficient. Until now, the fastest linear time algorithms, KKP1, KKP2, and KKP3 [11], for LZ77 wereproposed by Kärkkäinen et al. in 2013. Among the three algorithms, KKP3 is the fastest, which

    *Correspondence to: Ge Nong, Computer Science Department, Sun Yat-sen University, Guangzhou, China; Wai HongChan, Department of Mathematics and Information Technology, The Hong Kong Institute of Education, Hong Kong.

    †E-mail: [email protected]; [email protected]

    Copyright © 2015 John Wiley & Sons, Ltd.

  • 1202 W. J. LIU ET AL.

    utilizes three size-n integer arrays including the suffix array SA and the other two arrays PSVand NSV to store the previous smaller values (PSVs) and next smaller values (NSVs) of the inputstring T , respectively. Given the SA of an input string T , KKP3 first computes the PSV and NSVsimultaneously and then the LZ77. KKP2 uses one less integer array by computing the NSV onlyin the preliminary step and then computes the PSV on-the-fly in the parsing step by making useof the relationships between PSV , NSV , and the array ˆ (first defined in [14]). In KKP1, SA isstored on the disk and streamed from the disk when computing the NSV . Apart from that, there isno other differences between KKP1 and KKP2. Although KKP1 holds only one single integer arrayin the memory, the total space requirement of KKP1 is still two integer arrays.

    Currently, many algorithms for computing LZ77 (including KKP1, KKP2, and KKP3) are basedon suffix array with the workspace (excluding the space for storing the input and the output) notless than two integer arrays. That is, the total workspace is at least 2n words with dlogne bits perword. Recently, a space economical and linear time LZ77 factorization algorithm called BGone[15, 16], which was proposed by Goto and Bannai, uses only an integer array as the workspace,that is, the workspace is n words. This is performed by constructing the array ˆ by simulating thesorting process of SACA-K [17], and the in-place computation of NSV from ˆ. BGone has twoversions called BGoneT and BGoneSA, the former computes the LZ77 directly from T , while thelatter computes the LZ77 from the suffix array of T . Provided that the suffix array has been obtainedin some applications, BGoneSA can be employed to save the runtime.

    While BGone brings a lightweight solution for the LZ77 computation, the experiments in[15, 16] show that its speed is quite slow, and techniques for acceleration are desired. Motivatedby this, we present here an algorithm called LZone to compute LZ77 in O.n/ time and n-wordworkspace. LZone is designed by improvements on BGone for speed acceleration, it has the sametheoretical time and space complexities as BGone, but it can run much faster in practice. We achievethis by using much nature of suffix array and a series of techniques for rewriting the differentauxiliary integer arrays from one to another. LZone also has two versions, LZoneT and LZoneSA,which correspond to BGoneT and BGoneSA, respectively. Our experimental results in Section 5show that, with the same space requirements, LZoneT and LZoneSA are around 26% and 57% fasterthan BGoneT and BGoneSA, respectively. We also compared the speed of LZone with that of KKP,LZone is slower than KKP. But on the other hand, LZone uses only half the total space. Thus, LZonemay be a better choice when the total space (including both RAM and disk) is an important factorto consider. For example, WRT-LZ77 [18] and XML-WRT [19] are two text preprocessors that canboost the speed and compression performance of gzip, each uses a space-consuming dictionary tokeep a large collection of words. In this case, LZone can be applied for LZ77 computation to savespace for the dictionary.

    As a summary, the contribution of our work consists of two components:

    � We design an algorithm LZone for computing the LZ77 of a size-n input string of a constantalphabet. Being linear-time and using only one size-n integer array, LZone provides a newchoice for space succinct applications. LZone uses a new method to compute ˆ` or ‰s , thismakes it run much faster than BGone. This method is explained in details to help readers seethe differences between LZone and BGone.� An experimental study is conducted for performance evaluation of our algorithms against

    the others, including both linear-time and non-linear-time algorithms, such as KKP,BGone/BGtwo, LZscan, and ISA6s. Moreover, we also include the time results of gzip forcomparison to give some rough ideas how far our algorithm being away from the LZ77 compu-tation engine in the popular compression software gzip that uses both LZ77 and Huffman code.It is believed that gzip has been fine-tuned for speed, although its space consumption may notbe optimal.

    Because LZoneSA is only a variant of LZoneT under the assumption that the suffix array isalready given, in the rest of this paper, we shall present the details of LZoneT, which uses only onesingle integer array workspace for computing the LZ77 directly from the input string.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING FASTER 1203

    2. PRELIMINARIES

    Let † be a constant alphabet of size O.1/, and T D T Œ1::n� D T Œ1�T Œ2�: : : T Œn� be a string of ncharacters from †, where T Œi � is the i-th character of T . The length of a string X is denoted byjX j, for example, jT j D n. T Œi::j � with 1 6 i 6 j 6 n is a substring of T consisting of the i-th tothe j -th characters, that is, T Œi::j � D T Œi �T Œi C 1�::T Œj �. For convenience, we assume that the lastcharacter of T is the sentinel $, that is, T Œn� D $, which is the unique lexicographically smallestcharacter in T . The size of the alphabet of an input string T is denote by � or j†j.

    2.1. Suffix and suffix array

    For i D 1; : : : ; n, the suffix of T starting with the i-th character of T is denoted by T Œi::n� orsuf.T; i/, and the prefix of T ending at the i-th character of T is denoted by T Œ1::i � or pre.T; i/.The length of the longest common prefix (LCP) of suf.T; i/ and suf.T; j / is denoted by lcp.i; j /,e.g., for T D aabbaabb, lcp.1; 2/ D 1 D jaj, and lcp.1; 5/ D 4 D jaabbj.

    A suffix suf.T; i/ is S-type if T Œi � < T Œi C 1�, or T Œi � D T Œi C 1� and suf.T; i C 1/ is S-type. Asuffix suf.T; i/ is L-type if T Œi � > T ŒiC1�, or T Œi � D T ŒiC1� and suf.T; iC1/ is L-type. The lastsuffix suf.T; n/ is always S-type. A suffix suf.T; i/ is leftmost S-type (LMS) if suf.T; i/ is S-typeand suf.T; i � 1/ is L-type. Symmetrically, a suffix suf.T; i/ is leftmost L-type (LML) if suf.T; i/is L-type and suf.T; i � 1/ is S-type. The last suffix suf.T; n/ is always LMS, while the first suffixsuf.T; 1/ is neither LMS nor LML. An L-type suffix is also called L-suffix for short, so are S-type,LMS or LML suffixes. The type of a character T Œi � is defined as the type of suf.T; i/.

    The suffix array of an input string T is an array of length n, denoted by SA, indicating thelexicographical order of all the suffixes of T . That is, for any 1 6 i < j 6 n, suf.T; SAŒi �/ <suf.T; SAŒj �/, or T ŒSAŒi �::n� < T ŒSAŒj �::n�. The inverse array of SA, denoted by ISA, is also anarray of length n, which is the inverse permutation of SA such that ISAŒSAŒi �� D i . ISAŒi� D jmeans that suf.T; i/ is at the j -th position in SA. The arrays storing all the sorted L-type, S-type,LMS, and LML suffixes are, respectively, denoted as SA`, SAs , SAlms and SAlml .

    For example, given T D mmiissiiss$, Table I shows that the type of each character in string T(marked by ‘L’, ‘S’, and ‘�’), the arrays SA, ISA and so on, where the arrays SA`, SAs , ˆ, and ‰will be introduced in Section 2.2.

    2.2. ˆ and ‰

    We define ˆŒSAŒ1�� = 0, ˆ[0] = SAŒn� and ˆŒi� = SAŒISAŒi �� 1� for i 2 Œ1; n� n SAŒ1� to make ˆa cycle of length jT jC 1 to indicate the predecessor of each suffix of T in the lexicographical order.Similarly, we define‰ŒSAŒn�� = 0,‰Œ0� = SAŒ1�, and‰Œi� = SAŒISAŒi �C1� for i 2 Œ1; n�nSAŒn� tomake ‰ a cycle of length jT jC 1 to indicate the successor of each suffix of T in the lexicographicalorder. Clearly, ‰ is the inverse of ˆ. Given the array ˆ, starting from ˆŒ0�, we can visit all of the

    Table I. An example for LMS, LML, SA, ISA, SA`, SAs , ˆ and ‰ so on.

    Index 0 1 2 3 4 5 6 7 8 9 10 11

    T m m i i s s i i s s $L/S L L S S L L S S L L SLMS � � �LML � �SA 11 7 3 8 4 2 1 10 6 9 5ISA 7 6 3 5 11 9 2 4 10 8 1SAs 11 7 3 8 4SA` 2 1 10 6 9 5SAlms 11 7 3SAlml 9 5ˆ 5 2 4 7 8 9 10 11 3 6 1 0‰ 11 10 1 8 2 0 9 3 4 5 6 7

    LMS, leftmost S-type; LML, leftmost L-type.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • 1204 W. J. LIU ET AL.

    suffixes of T from the largest to the smallest. Analogously, given the array ‰, starting from ‰Œ0�,we can visit all of the suffixes from the smallest to the largest. So,ˆ array can be viewed as an arraybased linked list which is another way to store SA, so does ‰ array.

    Note that ˆ and ‰ are defined using SA. Similarly, we define ˆ`, ‰s and ‰lms using SA`, SAsand SAlms , respectively, as follows. (i) ˆ`ŒSA`Œ1�� = 0, ˆ`[0] = SA`ŒjSA`j� and ˆ`Œi � is the imme-diate lexicographical predecessor of suf.T; i/ for i 2 Œ1; jSA`j� n SA`Œ1� to make ˆ` a cycle oflength jSA`jC1. (ii)‰sŒSAsŒjSAsj�� = 0,‰s[0] = SAsŒj1j� and‰sŒi � is the immediate lexicograph-ical successor of suf.T; i/ for i 2 Œ1; jSAsj� n SAsŒjSAsj� to make ‰s a cycle of length jSAsj C 1.(iii) ‰lmsŒSAlmsŒjSAlmsj�� = 0, ‰lms[0] = SAlmsŒ1� and ‰lmsŒi � is the immediate lexicographi-cal successor of suf.T; i/ for i 2 Œ1; jSAlmsj� n SAlmsŒjSAlmsj� to make ‰lms a cycle of lengthjSAlmsj C 1.

    Givenˆ`, all of the L-suffixes of T can be visited from the largest to the smallest. However, given‰lms or ‰s , all of the LMS-suffixes or S-suffixes, respectively, can be visited from the smallest tothe largest.

    2.3. Lempel–Ziv factorization

    The LZ77 introduces the concept of longest previous factor. The pair (pi , li ) is the longest previousfactor of position i in T , such that, for any 1 6 i 6 n, T Œpi ::pi C li � 1� D T Œi::i C li � 1�, wherepi < i and li > 0 is maximized. That is, T Œi::iC li � 1� is the longest prefix of suf.T; i/ that occursat least once before i . If T Œi � does not occur before i , then pi D T Œi � and li D 0.

    The formation of the LZ77 of a string T is a left-to-right, greedy process that parses the string Tinto the longest previous factors. In each parsing phrase i (corresponding to T Œj �), an ordered pairwill be acquired. Then, the next phrase starts at position j C li if li > 0 or starts at position j C 1if otherwise.

    For example, given T D mmiissiiss, the LZ77 of T is as follows:

    .m; 0/; .1; 1/; .i; 0/; .3; 1/; .s; 0/; .5; 1/; and.3; 4/:

    2.4. Next and previous smaller values

    Crochemore and Ilie [20] showed that pi can be computed by the NSVs/PSVs, which are defined as

    PSV ŒSAŒi �� D SAŒj1� and NSV ŒSAŒi �� D SAŒj2�;

    where j1 D max¹j 2 Œ1; i/jSAŒj � < SAŒi�º and j2 D min¹j 2 .i; n�jSAŒj � < SAŒi�º. If j1(or j2) does not exist, we set j1 (or j2) equal to 0.

    2.5. Lazy LZ factorization

    The lazy LZ factorization is actually lazy evaluation of LCP values. For example, lcp.i; NSV Œi �/and lcp.i; PSV Œi �/ will not be computed until i is a starting position of a phrase. This trick is usedin the recent fast LZ factorization algorithms [16, 21, 22] and the currently fastest LZ factorizationalgorithm KKP3 [11]. During the process of computing the LCP values, the characters of two suf-fixes are compared one-by-one, and the total time complexity isO.n/. Our new algorithm presentedin this paper also utilizes the lazy LZ factorization.

    In the rest of the paper, we assume that all LZ factors are sequentially acquired from left to right,and the space for storing the LZ factors is excluded from the workspace.

    3. PRIOR ARTS

    3.1. KKP3, KKP2 and KKP1

    The process of LZ factorization in each of KKP3, KKP2, and KKP1 is composed of two commonsteps, the preliminary step and the parsing step. For example, in KKP3, in the preliminary step, thePSVs and NSVs for all the positions are computed by sequentially scanning the SA of T in O.n/

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING FASTER 1205

    time, which makes use of the technique of peak elimination (originated from [20]). Then, in theparsing step, the LZ factors can be acquired by repeatedly calling the LZ-factor function in O.n/time. It is clear that KKP3 runs in O.n/ time and requires three auxiliary integer arrays (SA, PSV ,and NSV ) of length jT j.

    In KKP2, Xt D ¹T Œi::n�ji 6 tº for t 2 Œ1::n�. This means that Xt contains all suffixes of T start-ing at or before the t -th character. Let ˆt be ˆ restricted to Xt , that is, for t 2 Œ1::n�, ˆt Œi � is theimmediate lexicographical predecessor of suf.T; i/ among the suffixes in Xt . In order to make ˆt acomplete unicyclic permutation like ˆ, ˆt Œimin� is set to 0, where suf.T; imin/ is the lexicograph-ically smallest suffix in Xt . Also, ˆt Œ0� is set to imax , where suf.T; imax/ is the lexicographicallylargest suffix in Xt . When t D n, ˆt = ˆ. Thus, in the preliminary step, only the NSVs need to becomputed. Because the PSVs can be computed in-place on-the-fly in the parsing step by scanningand rewriting the NSV sequentially. That is, in the parsing step, T is parsed from left to right. Inthe t -th round, ˆt�1 has already been obtained and stored in NSV Œ1::t � 1�, and so PSV Œt � can beacquired by ˆt�1ŒNSV Œt ��. Because NSV Œt � is not needed to be kept after it has been processed,the space for keeping NSV Œt � can then be used for storing ˆŒt� D PSV Œt �. After the t -th round inthe parsing step, NSV Œ1::t � D ˆt and NSV Œt C 1::n� remain unchanged.

    As described previously, KKP2 requires two integer arrays (SA and NSV ) in the first step andone integer array (NSV ) in the second step. Thus, KKP2 runs in O.n/ time using two auxiliarysize-n integer arrays. In KKP1, SA is stored in the disk. When computing the NSVs, KKP1 streamsthe suffix array from the disk. Thus, there is only one integer array (i.e., NSV ) that is kept in themain memory, but the total space requirement is still two integer arrays (i.e., SA and NSV ).

    3.2. BGone

    BGone computes the LZ77 using one single integer array. We first describe BGoneT, which com-putes the LZ77 directly from T . Let AŒ1::n� be a size-n integer array, A is called the working arrayfor BGone, which is reused to store SAlms , ˆ or NSV in different steps of BGone. In order toobtain the LZ factors, BGoneT conducts the following steps.

    1. Compute SAlms .2. Compute ‰lms from SAlms .3. Compute ˆ` from ‰lms .4. Compute ˆ from the result of ˆ`.5. Compute NSV from ˆ.6. Compute LZ77 from NSV .

    BGoneSA, which is another version of BGone, computes the LZ77 from the SA of T . The differ-ence between BGoneT and BGoneSA is how to obtain SAlms . BGoneT calls SACA-K to computeSAlms directly from the input string T , while BGoneSA obtains SAlms by scanning the SA fromleft to right to find all of the LMS-suffixes.

    4. OUR WORK

    4.1. LZoneT

    Our algorithm LZone, which has two variations of LZoneT and LZoneSA, is described in thissection. The algorithmic framework of LZoneT is shown in Algorithm 1. The workspace of LZoneremains identical as that of BGone, that is, the work array A only (Section 3.2). The crucial differ-ence between BGone and LZone resides in their methods for computingˆ, that is, BGone computesˆ from SAlms , but LZone computes ˆ from SA` or ‰ from SAs . LZoneT first counts the num-ber of the L-suffixes and S-suffixes. If the S-suffixes outnumber the L-suffixes, LZoneT computesSA` from T ; otherwise computes SAs . Computing SAlms in LZoneT is not required because of thefollowing reasons.

    1. If the number of L-suffixes is less than that of S-suffixes, all of the L-suffixes can be linkedfrom the lexicographically largest one to the smallest by putting the L-suffixes in the S-type

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • 1206 W. J. LIU ET AL.

    positions of the working array A (the i-th position in A is S-type or L-type depending onwhether T Œi � is an S-suffix or L-suffix respectively). Thus, we can obtainˆ`. When L-suffixesare more than S-suffixes, we obtain ‰s by putting the S-suffixes in the L-type positions of A.

    2. Given ‰s , PSV can be computed in linear time and use only one integer array plus O.1/extra workspace by rewriting ‰ in-place. Once PSV is obtained, LZoneT can compute NSVfrom PSV on-the-fly in the parsing step, which is a symmetric process to that of KKP2 incomputing PSV fromNSV [11]. In a similar way, givenˆ`,NSV can be computed in lineartime using O.1/ workspace by rewriting ˆ in-place. Once NSV is obtained, PSV can besequentially acquired in the parsing step (see KKP2n in [11]).

    3. By directly calling SACA-K, we can easily obtain SA` or SAs and then compute ˆ` or ‰s ,respectively, by step (1). Once either ˆ` or ‰s is obtained, we need only one scan to computeˆ or ‰, respectively. In this way, we do not need to sort the L-suffixes the same way asBGoneT does, in which two scans are performed to obtain ˆ. One step is for sorting theL-suffixes, and the other is for sorting the S-suffixes.

    Figure 1 shows the difference between BGone and LZone. In LZone, with respect to thenumbers of L- and S-suffixes, the LZ77 factorization is computed either from ‰ or ˆ. Neitherexecution path of LZone need compute ‰lms required by BGone, resulting in a faster speed.

    Figure 1. The Lempel–Ziv factorization computation processes of BGone and LZone.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING FASTER 1207

    4.2. Computing SAs and SA` from T by induced sorting

    The principle behind induced sorting is to induce the lexicographic order of the unsorted suffixesfrom the sorted suffixes, as observed in [23] as follows.

    Lemma 1Given that all of the L- or S-suffixes of T are sorted, all of the suffixes of T can be sorted inO.n/ time.

    However, as further observed in [24], not every S-suffix is useful for inducing the order of all theL-suffixes, knowing the order of all the LMS-suffixes is already enough for inducing the order of allof the suffixes.

    Lemma 2Given that all of the LMS-suffixes of T are sorted, all of the suffixes of T can be sorted inO.n/ time.

    Based on the aforementioned Lemmas, given that all of the LMS-suffixes of T has been sortedand stored in AŒ1::k�, SACA-K can obtain the SA of the string T by performing the followingfour-step procedure in O.n/ time and O.1/ workspace. In these steps, the range of suffixes of anidentical heading character c is called a bucket inA, denoted as bucket.c/. Each bucket is composedof at most a sequence of L-suffixes followed by at most a sequence of S-suffixes. Hence, bucket.c/can be divided into at most two sub-buckets bucket`.c/ and buckets.c/ for the L- and S-suffixes,respectively.

    1. Initialize each item of AŒk C 1::n� as EMPTY.2. Scan AŒ1::k� from right to left, and put all of the LMS-suffixes into their buckets in A, from

    the end to the head of each bucket.3. Scan A from left to right. For each non-empty position i , if suf.T; AŒi ��1/ is an L-suffix, then

    put suf.T; AŒi � � 1/ in the left-most empty position of bucket.T ŒAŒi � � 1�/.4. Scan A from right to left. For each non-empty position i , if suf.T; AŒi �� 1/ is an S-suffix, put

    suf.T; AŒi � � 1/ in the right-most empty position of bucket.T ŒAŒi � � 1�/.After step (3), all of the L-suffixes have been sorted, we need only to perform steps (1) through

    (3) to compute SA`.Analogously, we can obtain the following result.

    Lemma 3Given that all of the LML-suffixes of T are sorted, all of the suffixes of T can be sorted in O.n/time.

    Thus, if SAlml has been obtained and stored in AŒ1::k�, we can also obtain the SA of T byperforming the following four-step procedure in O.n/ time and using O.1/ workspace.

    1. Initialize each item of AŒk C 1::n� as EMPTY.2. Scan AŒ1::k� from left to right, and put all of the LML-suffixes into their buckets in A, from

    the head to the end in each bucket.3. Scan A from right to left. For each non-empty position i , if suf.T; AŒi �� 1/ is an S-suffix, put

    suf.T; AŒi � � 1/ into the right-most empty position of bucket.T ŒAŒi � � 1�/.4. Scan A from left to right. For each non-empty position i , if suf.T; AŒi ��1/ is an L-suffix, then

    put suf.T ŒAŒi � � 1�/ into the left-most empty position of bucket.T ŒAŒi � � 1�/.Obviously, we need only to perform steps (1) through (3) to compute SAs .As described previously, both SAs and SA` can be correctly computed in O.n/ time and O.1/

    workspace.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • 1208 W. J. LIU ET AL.

    4.3. Computing ˆ`=‰s from SA`=SAs

    In this section, we transform SA`=SAs into the array based linked list representation. Suppose thatSA`=SAs is stored in AŒ1::k�. Let snum and lnum be the number of S-suffixes and L-suffixes,respectively.

    4.3.1. If lnum > snum, compute ‰s . In this case, we compute SAs first, for there are enoughL-type positions to store all of the S-suffixes. All of the suffixes in SAs must be linked from thelexicographically smallest suffix to the largest. Thus, we can simulate the process in SACA-K forinduced sorting the L-suffixes. For each suf.T; SAsŒi �/, its lexicographically succeeding S-type suf-fix suf.T; SAsŒi C 1�/ will be put in A[SAs[i]], that is, A[SAs[i]] = SAs[i+1], A[0] = A[SAs[1]]and A[SAs[k]] = 0. ‰s is then obtained in which A[0] is the lexicographically smallest suffix inSAs . This can be performed by the following steps.

    1. Reverse the order of the sorted S-suffixes in SAs so that all of the S-suffixes are stored indescending order in AŒ1::k�.

    2. Put all of the S-suffixes in L-type positions in A. Let p and q point to A[k] and A[n], respec-tively. That is, p D k and q D n. Then, scan T from right to left. If suf.T; q/ is L-type (i.e., theq-th position is L-type), we put suf.T; AŒp�/ in A[q], i.e., A[q] = A[p] and A[p] = EMPTY.After this, both p and q decreased by one. In this way, all of the S-suffixes can be put in L-typepositions. To determine the type of suf.T; q/, we use an integer variable typepre to record thetype of the previously scanned position: 1 for S-type and 0 for L-type. Because suf.T; n/ isalways an S-suffix, the n-th position of A is always S-type. So, the variable typepre is ini-tialized as 1, and we scan T from T Œn � 1�. Given the value of typepre , when processingsuf.T; q/, the type of the q-th position can be immediately obtained by comparing T Œq� andT Œq C 1�. Therefore, this step runs in O.n/ time using O.1/ workspace.

    3. Link up all of the S-suffixes to obtain ‰s . Because all of the S-suffixes are stored in the L-typepositions of A, all of the S-type positions are empty. We scan T from right to left again to findall non-empty L-type positions in A. Suppose that the j1-th and j2-th are such positions, andposition j2 is the one next to j1. Put p = A[j1], q= A[j2], A[p] = q and then A[j1] = EMPTY.Because position p of A must be empty, the previous operations can be performed withoutlosing any values in SAs . In this way, for each suf.T; i/ in SAs , we can put the succeedingsuffix in A[i], and this process needs scan A only once. Therefore, this step runs in O.n/ timeusing O.1/ workspace.

    For example, given T D mmiissiiss$, Table II shows the result of each step for computing ‰sfrom SAs . In this table, row ‘L/S’ gives the type of each character. Because the number of S-typecharacters is less, SAs is computed, and the result of each step for computing‰s from SAs is shownin rows ‘(1)’, ‘(2)’, and ‘(3)’.

    4.3.2. If lnum 6 snum, compute ˆ`. We compute SA` first. All of the suffixes in SA`should be linked from the lexicographically largest suffix to the smallest for induced sorting theS-suffixes. That is, we make all of the L-suffixes an array ˆ` so that A[0] stores the lexicograph-ically largest suffix in SA`. For each suf.T; SA`Œi �/, its lexicographically preceding L-type suffix

    Table II. An example for computing ‰s from SAs .

    Index 0 1 2 3 4 5 6 7 8 9 10 11

    T m m i i s s i i s s $L/S L L S S L L S S L L SSAs 11 7 3 8 4.1/ 4 8 3 7 11.2/ 4 8 3 7 11.3/ 11 8 0 3 4 7‰s 11 8 0 3 4 7

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING FASTER 1209

    Table III. An example for computing ˆ` from SA`.

    Index 0 1 2 3 4 5 6 7 8 9 10 11

    T e e i i s s i i s s $L/S S S S S L L S S L L SSA` 10 6 9 5.1/ 10 6 9 5.2/ 5 9 10 6 0ˆ` 5 9 10 6 0

    suf.T; SA`Œi � 1�/ will be put in A[SA`[i]], i.e., A[SA`[i]] = SA`Œi � 1�. Thus,ˆ` can be obtained.This can be carried out by the following steps.

    1. Put all of the L-suffixes in the S-type positions of A. This step is similar to putting all of the L-suffixes in the S-type positions of A. Note that it does not need to reverse the sorted L-suffixesin SA`.

    2. Link up all of the L-suffixes to obtainˆ`. Because all of the L-suffixes are stored in the S-typepositions of A, all L-type positions are empty. We scan T from right to left to find all of thenon-empty S-type positions of A. Suppose position j1 and j2 are such positions, and positionj2 is the next one to j1. Put p = AŒj1�, q = AŒj2�, AŒp� = q and then AŒj1� = EMPTY. In thisway, for each suf.T; i/ in SA`, we can put the preceding suffix in AŒi�. The process scans Tonce only. Therefore, this step runs also in O.n/ time and O.1/ workspace.

    For T D eeiissiiss$ as an example, Table III shows the result of each step for computing ˆ` fromSA` . In this table, because the number of L-type characters is less, SA` is computed, and the resultof each step for computing ˆ` from SA` is shown in rows ‘(1)’ and ‘(2)’.

    4.4. Computing ‰=ˆ from ‰s=ˆ` using O.1/ workspace

    Firstly, we consider the situation that the L-suffixes outnumber the S-suffixes, that is, lnum >snum. In this situation, we should compute ‰ from ‰s . Note that all of the S-suffixes are stored inA, ‰s is an array-based singly linked list, and ‰s[0] is the lexicographically smallest suffix in thelist. Based on Lemma 1 in Section 4.2, we can simulate the process for sorting the L-suffixes byaccessing to A. Thus, ‰ will be obtained finally.

    To accomplish this task, we view this process as scanning all of the buckets in A in ascendingorder and simulate the method used by BGoneT to sort all of the S-suffixes from ˆ`. For a char-acter c and its bucket bucket.c/, we scan bucket`.c/ first, then buckets.c/. Four integer arrays,LbktsŒc�, LbkteŒc�, SbktsŒc�, and SbkteŒc�, each of size j†j are required. We use LbktsŒc� andLbkteŒc� to store the lexicographically smallest and largest suffixes, respectively, in bucket`.c/, anduse SbktsŒc� and SbkteŒc� to store the lexicographically smallest and largest suffixes, respectively,in buckets.c/.

    The followings are the concrete steps for sorting the L-suffixes in LZoneT with the given ‰s .

    1. Initialize each item of LbktsŒc�, LbkteŒc�, SbktsŒc�, and SbkteŒc� as EMPTY.2. Scan ‰s once to compute SbktsŒc� and SbkteŒc� for each bucket.c/.3. Scan all of the buckets in lexicographically ascending order to sort and store all of the L-

    suffixes in their buckets in A. For each bucket.c/, we scan from LbktsŒc� to LbkteŒc�, thenSbktsŒc� to SbkteŒc�.

    After step (2), we can obtain the start and end suffixes for each buckets.c/. Note that all of thesuffixes in buckets.c/ have already been linked together from the smallest to the largest beforestep (1).

    In step (3), the values of LbktsŒc� and LbkteŒc� are updated dynamically. Suppose that we areprocessing suf.T; i/. Let j = i�1. If suf.T; j / is an L-suffix, we put it in bucket.T Œj �/ and do noth-ing if otherwise. Because there are bucket`.c/ and buckets.c/, when we are scanning bucket`.c/,the type of suf.T; i � 1/ can be determined in a constant time by comparing T Œi � 1� and T Œi �.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • 1210 W. J. LIU ET AL.

    For examples, suf.T; i �1/ is L-type if T Œi �1� > T Œi � when scanning bucket`.c/, and suf.T; i �1/is L-type if T Œi � 1� > T Œi � when scanning buckets.c/. Then, we can put the L-type suf.T; j / inbucket.T Œj �/ according to the following steps.

    (1) Check whether LbktsŒT Œj �� is empty or not.(2) If LbktsŒT Œj �� is empty, suf.T; j / is the lexicographically smallest suffix in bucket`.c/ of c

    or the starting suffix of bucket.c/. We set LbktsŒT Œj �� = LbkteŒT Œj �� = j .(3) If LbktsŒT Œj �� is non-empty, the end suffix of bucket`.T Œj �/ is smaller than suf.T; j /. We

    can put suf.T; j / in A by setting A[LbkteŒT Œj ��] = j and update the end of bucket`.T Œj �/by setting LbkteŒT Œj �� = j .

    In this way, each of the L-suffixes can be put in its corresponding position in bucket`.c/ of A. Allof the suffixes in the same bucket`.c/ can be linked together from the lexicographically smallest tothe largest. Algorithm 2 shows the pseudo code for computing ‰ from ‰s , and an example is givenin Table IV for illustrating this algorithm on T D mmiissiiss$.

    In this example, the work array A is reused to store the input ‰s , the output ‰ and all the tem-porary data generated during the computation process. Initially, ‰s is sparsely stored in the S-typepositions of A. The rows between‰s and‰ give the results of scanning all the suffixes in increasing

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING FASTER 1211

    Table IV. An example for computing ‰ from ‰s .

    Index 0 1 2 3 4 5 6 7 8 9 10 11T m m i i s s i i s s $L/S L L S S L L S S L L S

    ‰s 11 8 0 3 4 711 11 8 0 3 4 77 11 8 0 3 4 6 73 11 8 0 3 4 6 78 11 8 0 3 4 6 74 11 8 0 3 4 6 7

    11 8 2 3 4 6 72 11 1 8 2 3 4 6 71 11 1 8 2 3 4 6 7

    11 10 1 8 2 3 4 6 710 11 10 1 8 2 9 3 4 6 76 11 10 1 8 2 9 3 4 5 6 79 11 10 1 8 2 9 3 4 5 6 75 11 10 1 8 2 0 9 3 4 5 6 7‰ 11 10 1 8 2 0 9 3 4 5 6 7

    order, where each row gives the status of A after scanning the suffix indexed by the first column ofthis row, for example, the row ‘11’ is the status of A after scanning suf.T; 11/. Notice that the suf-fixes in the first column between‰s and‰ are lexicographically increasing, which is the same orderas that in the SA of T . There are totally four buckets to be scanned, that is bucket.$/, bucket.i/,bucket.m/ and bucket.s/. From ‰s , we obtain that SbktsŒ$� D SbkteŒ$�=11, SbktsŒi �=7 andSbkteŒi �=4. The arrays Lbkts and Lbkte are initialized as empty.

    First, the smallest bucket bucket.$/ is scanned. There is only one suffix belonging to this bucket,that is suf.T; 11/, which is stored in AŒ0�. The preceding suffix of suf.T; 11/ is suf.T; 10/. GivenT Œ10� D s being L-type, suf.T; 10/ is put in bucket.s/. Because both LbktsŒs� and LbkteŒs� arecurrently empty, and the previous suffix of suf.T; 10/ in SA is currently unknown, we will not savesuf.T; 10/ to ‰ but set LbktsŒs� D LbkteŒs� D 10 to record suf.T; 10/ instead. Later on, beforescanning bucket.s/, suf.T; 10/ is saved to ‰. That is, at the row between ‘1’ and ‘10’, when thepreceding suffix of suf.T; 10/ is scanned and known as suf.T; 1/, suf.T; 10/ is saved to ‰.

    Next, the bucket bucket.i/ is scanned, by scanning bucket`.i/ first and then buckets.i/. In thiscase, given the empty bucket`.i/, only buckets.i/ needs to be scanned. The smallest suffix inbuckets.i/ is given by SbktsŒi � D 7 as suf.T; 7/, its preceding suffix is suf.T; 6/. Because T Œ6� D sis L-type, suf.T; 6/ is put in bucket.s/. Given LbkteŒs� D 10, the previous suffix of suf.T; 6/ in SAis determined as suf.T; 10/, so we put suf.T; 6/ in AŒ10� and underline it to indicate that it is theL-suffix newly inserted into ‰ in this step. Meanwhile, LbkteŒs� is updated as 6 to record that thecurrent end suffix of bucket`.s/ is suf.T; 6/.

    Similarly, the same method is applied to sort the other L-suffixes at rows ‘3’ to ‘5’. Finally, weobtain the result ‰ in the last row.

    Because the sequential access to SA and putting each L-type suffix in its correct position can becarried out in O.1/ time, the whole process runs in O.n/ time using only one single integer array Aand O.1/ workspace in total.

    Now, we consider the situation that the S-suffixes outnumber the L-suffixes, that is, lnum 6snum. In this case, we need to compute ˆ from ˆ`. In Section 4.3, we presented the proceduresto compute ˆ` in which all of the L-suffixes stored in A, ˆ` is an array-based singly linked list,and ˆ`[0] is the lexicographically largest suffix in the list. We can scan all of the L-suffixes andS-suffixes in A in lexicographically descending order starting from ˆ`[0]. Based on Lemma 1, wecan sort the S-suffixes and store them in A. Thus, ˆ can be obtained. This problem has been welladdressed in [15, 16], so we will not discuss here. Note that in [15, 16], BGoneT computes the ˆ`based on the LMS and all of the L-suffixes will be linked in increasing order. So, BGoneT needs torewrite the work array A to reverse the direction of the links before sorting S-suffixes.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • 1212 W. J. LIU ET AL.

    4.5. In-place computing PSV=NSV from ‰=ˆ

    Goto and Bannai [15, 16] showed that NSV can be computed in-place from ˆ. Similarly, given‰, the PSV can also be computed in-place. In this section, we will show how to compute PSVin-place.

    Because ‰Œi� is the immediate lexicographical successor of suf.T; i/, we can think of ‰ as anarray-based singly linked list, which links all elements of SA from left to right. Then, we can sim-ulate the process for sorting the L-suffixes in SACA-K by accessing the ‰ array starting from thelexicographically smallest suffix, ‰Œ0�. Because the suffix stored in ‰Œi� is no longer required to bekept after it has been processed, ‰Œi� can be rewritten to PSV Œi �. Algorithm 3 shows the pseudocode for computing PSV from ‰.

    Following corollary can be obtained from Lemma 4.1 in [21].

    Corollary 1Given ‰ of an input string T , the PSV of T can be computed from ‰ in O.n/ time and in-placeusing O.1/ workspace.

    With the given PSV , we can compute NSV and the LZ77 based on the following lemma.

    Lemma 4 ([11])Given the PSV of a string T of length n, NSV Œi� of T can be sequentially obtained for all i D1; : : : ; n in O.n/ time using O.1/ workspace excluding the PSV and T .

    Thus, if the L-suffixes outnumber the S-suffixes, we can obtain the LZ77 of a given string Twith a linear in-place algorithm which computes ‰s first, then PSV from ‰s , and finally rewritePSV to ‰ using only one single integer array plus O.1/ extra workspace; otherwise, the algorithmcomputes ˆ` first, then ˆ` to NSV , and rewrite NSV to ˆ, using only one single integer arrayplus O.1/ extra workspace.

    4.6. LZoneSA

    In view that the SA of T is already available in some applications, we provide another variant ofthe LZoneT called LZoneSA to compute the LZ77 from the SA efficiently. LZoneSA is differ-ent from LZoneT only in their ways for computing SA` or SAs . In Algorithm 1, that is LZoneT,SA`, or SAs are computed from T . However, given that the SA is already known in the case ofLZoneSA, SA`; or SAs can be computed efficiently as follows. For each character c, we countthe numbers of the L-suffixes and the S-suffixes by scanning T from right to left, and then theranGe of buckets.c/ or bucket`.c/ in SA. In this way, we can quickly obtain SA` or SAs from thegiven SA.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING FASTER 1213

    5. PERFORMANCE EVALUATION

    We compare the performance of nine programs in this evaluation experiment: KKP1, KKP2, KKP3,BGtwo, BGoneT, BGoneSA, LZoneT, LZoneSA and gzip (version 1.3.12), where gzip is commonlyshipped with Linux as a file compression utility software. Both BGoneT and LZoneT compute theLZ77 directly from T in O.n/ time using only one single integer array. BGtwo uses two integerarrays. The SA needed in other programs is computed by divsufsort, which was downloaded fromhttps://code.google.com/p/libdivsufsort/.

    The experiment was performed on a computer with a 2.20 GHz Intel(R) Xeon(R) CPU E5-2407processor, 16 GiB RAM and Linux (CentOS 6.4 Final 64-bit). All programs were compiled by g++with options ‘-fomit-frame-pointer -W -Wall -Winline -DNDBUG -O3’. The running times weremeasured in seconds, starting from after reading input string T into the memory, and the average offive runs was reported.

    The datasets used in this experiment are available at http://pizzachili.dcc.uchile.cl/texts.html.Table V shows the details of the datasets. Table VI shows the runtime of the algorithms. The run-time of KKP1 includes the writing and reading time of the SA to and from the disk. The total time(in seconds) for each algorithm is the sum of all the times for running the algorithm on all the inputdata. The mean time (in seconds per MiB) for each algorithm is the total time divided by the totalnumber of characters of all the input data.

    The results show that for computing the LZ77 from an input string, KKP3 and gzip are on average0:39=0:17 � 1 D 1:29 and 0:39=0:15 � 1 D 1:60 faster than LZoneT, respectively. LZoneT andLZoneSA are on average 0:49=0:39 � 1 D 0:26 and 0:47=0:3 � 1 D 0:57 faster than BGoneT andBGoneSA, respectively.

    To further compare the performance of BGoneT and LZoneT in different phases, we divide thewhole process into three phases as follows. For BGoneT: (i) Compute SAlms from T ; (ii) Computeˆ` from SAlms; (iii) Compute LZ77 from ˆ`. For LZoneT: (i) Compute SA`=SAs from T ; (ii)Compute ˆ`=‰s from SA`=SAs; (iii) Compute LZ77 from ˆ`=‰s .

    Table V. Datasets used in the experiments.

    Name Size (MiB) j†j Description

    Proteins.200 200.00 25 Swissprot databaseEnglish.200 200.00 225 Gutenberg projectDna.200 200.00 16 Human GenomeSources.200 200.00 230 Linux and GCC codeCoreutils 195.77 236 GNU coreutils sourceCere 439.92 5 Baking yeast GenomeKernel 246.01 160 Linux kernel sourceEinstein.en.txt 445.96 139 Wikipedia articles

    Total 2127.66 - Total number of characters

    Table VI. Time for computing Lempel–Ziv factorization from T for linear-time algorithms.

    Algorithm KKP1 KKP2 KKP3 BGtwo BGoneT BGoneSA LZoneT LZoneSA gzip

    Proteins.200 45.41 45.25 42.75 64.65 114.72 108.10 96.14 74.97 13.44English.200 39.98 39.89 37.39 60.23 112.18 104.36 91.30 67.69 22.11Dna.200 43.06 42.91 39.51 63.55 112.16 109.82 91.91 74.09 62.52Sources.200 29.68 29.48 27.65 47.70 84.78 78.09 67.99 52.09 11.52Coreutils 30.32 30.22 28.85 48.53 86.10 81.48 61.80 52.01 11.46Cere 82.72 82.52 80.26 128.58 232.00 224.17 185.19 136.20 135.57Kernel 38.59 38.46 36.93 64.20 123.02 115.37 92.97 68.24 15.80Einstein.en.txt 67.72 67.52 66.02 108.50 177.55 169.54 143.12 117.88 37.21

    Total (s) 377.48 376.25 359.36 585.94 1042.51 990.93 830.42 643.14 309.55Mean (s/MiB) 0.18 0.18 0.17 0.28 0.49 0.47 0.39 0.30 0.15

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

    https://code.google.com/p/libdivsufsort/http://pizzachili.dcc.uchile.cl/texts.html

  • 1214 W. J. LIU ET AL.

    Table VII shows the time consumption in each phase for computing the LZ77 from T . BGoneTis faster for the first phase, but it took about seven times longer than LZoneT for the second phase.In the third phase, the time gaps between BGoneT and LZoneT are small. In Tables VII and VIII,the speedup for each set of input data is the ratio of the speed of LZone to that of BGone, that is,the ratio of the time consumed by BGone to that by LZone. Table VIII gives the time consumptionfor computing LZ77 from SA. The result shows that once the SA has been obtained, the speed ofLZoneSA is on average twice that of BGoneSA.

    The key reason for LZone to run faster than BGone is its new way to computeˆ` or‰s from SA`or SAs in the second phase, respectively. For example, having obtained SAlms , BGone performsinduced sorting in a non-sequential way to compute ˆ`, which need access LbktsŒc� and LbkteŒc�frequently to dynamically mark the head or end of bucket`.c/ for putting all the L-suffixes intotheir proper positions in A given by LbktsŒc� or LbkteŒc�. This process requires frequent randomaccesses to A. Different from BGone, LZone computes SA` or SAs and then sequentially scans Atwice for computing ˆ` or ‰s , respectively: once to put all the L-sufixes or S-suffixes into the S- orL-type positions in A, respectively, and another to sequentially link all these L-suffixes or S-suffixesto produce ˆ` or ‰s , respectively. By doing so, LZone avoids random accesses to A and henceruns faster.

    In addition to these linear time LZ77 factorization algorithms described previously, there arealso some other non-linear-time LZ77 factorization algorithms that are practically fast and spaceeconomical. For example, the most space economical LZ77 factorization called LZscan [25], whichis a non-linear-time LZ77 factorization algorithm, runs in O.dn/ time and O..n logn/=d/ bits ofworking space, where the parameter d is used for the space–time trade-off. LZscan divides T intod = dn=be fixed size blocks of length b, then parses the blocks one by one. LZscan runs slower

    Table VII. Time in each step for computing Lempel–Ziv factorization from T by BGoneT and LZoneT.

    Algorithm BGoneT LZoneT

    T ! SAlms ˆ` T ! SA`=SAs ˆ`=‰sStep SAlms ! ˆ` !LZ77 Total SA`=SAs ! ˆ`=‰s !LZ77 Total Speedup

    Proteins.200 45.93 32.68 36.09 114.70 59.88 4.91 31.81 96.60 1.19English.200 41.99 34.13 36.17 112.29 56.87 4.83 29.65 91.35 1.23Dna.200 39.14 34.89 38.02 112.05 53.59 4.87 33.48 91.94 1.22Sources.200 31.52 25.35 28.09 84.96 40.13 4.54 23.48 68.15 1.25Coreutils 28.81 26.38 30.74 85.93 36.41 4.35 20.97 61.73 1.39Cere 84.82 74.26 73.08 232.16 124.91 10.76 50.25 185.92 1.25Kernel 42.90 39.44 40.73 123.07 59.14 5.80 27.97 92.91 1.32Einstein.en.txt 70.06 54.98 52.42 177.46 85.96 10.92 45.97 142.85 1.24

    Total (s) 385.17 322.11 335.34 1042.62 516.89 50.98 263.58 831.45 1.25Mean (s/MiB) 0.18 0.15 0.16 0.49 0.24 0.02 0.12 0.39 1.26

    Table VIII. Time for computing Lempel–Ziv factor-ization from SA by BGoneSA and LZoneSA.

    Algorithm BGoneSA LZoneSA Speedup

    Proteins.200 71.99 37.26 1.93English.200 73.09 35.60 2.05Dna.200 76.49 39.59 1.93Sources.200 55.90 28.86 1.94Coreutils 56.80 26.16 2.17Cere 154.39 63.69 2.42Kernel 83.38 34.86 2.39Einstein.en.txt 113.78 59.19 1.92

    Total (s) 685.83 325.21 2.11Mean (s/MiB) 0.32 0.15 2.11

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

  • IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING FASTER 1215

    Table IX. Time for computing Lempel–Ziv factorization fromT for both linear-time and non-linear-time algorithms.

    Algorithm LZscan ISA6s LZoneT LZoneSA

    Proteins.200 229.41 62.04 96.14 74.97English.200 188.30 52.34 91.30 67.69Dna.200 206.00 54.40 91.91 74.09Sources.200 191.08 38.03 67.99 52.09Coreutils 71.53 31.24 61.80 52.01Cere 206.79 85.49 185.19 136.20Kernel 73.86 40.04 92.97 68.24Einstein.en.txt 93.65 68.75 143.12 117.88

    Total (s) 1261.52 423.33 830.42 643.14Mean (s/MiB) 0.59 0.20 0.39 0.30

    when d is increased. ISA6s [22], which is another non-linear-time LZ77 factorization algorithm andruns in O.n log �/ time and .1 C �/n logn + n + O.� logn/ bits of space, seem to be faster thanLZone, but it need much more space than LZone. More details for these two algorithms are availableat https://www.cs.helsinki.fi/group/pads/lz77.html. Table IX shows the running times of these twonon-linear-time algorithms compared with that of LZone. In this experiment, the parameter of b forLZscan is chosen such that LZscan uses approximately the same space as that of LZone. The valueof b for LZscan on each dataset in this table, arranging from top to bottom, are {29, 29, 29, 29, 29,65, 36, 66} in unit of MiB.

    For the space succinct algorithms LZscan, LZoneT, and LZoneSA, we see that both LZoneT andLZoneSA run faster than LZscan except for the last two datasets. In order to have LZscan run ina speed almost equal to that of LZoneT, the values of b are chanGed to {101, 95, 100, 100, 64,95, 22, 5}. That is, b increases for most datasets except the last two. The reason is due to that,for LZscan on a given n, the time is O.dn/ D O.n2=b/ and the working space is O..b logn/=n/bits, which means the bigger b the faster speed and the more space. In Table IX, LZscan runsfaster than LZoneT only for the last two datasets. As a result, b is reduced only for the last twodatasets from {36, 66} to {22, 5} to slower LZscan and increased for all the other datasets toaccelerate LZscan.

    6. CONCLUSION

    We presented a linear-time algorithm called LZone to compute the LZ77 for an input string of aconstant alphabet, which requires only n words of workspace. LZone has two variations, LZoneTand LZoneSA. The former directly computes the LZ77 from the input string T , while the lattercomputes the LZ77 from the suffix array of T . Given T and/or its suffix array, we can obtainthe integer arrays ˆ` or ‰s . While ˆ` or ‰s is available, ˆ or ‰ can be computed by simu-lating the sorting process of SACA-K. Then, NSV or PSV can also be further computed fromˆ or ‰. Finally, the LZ77 can be obtained by either NSV or PSV . Note that BGoneT andLZoneT compute LZ77 directly from string T , which do not need to compute the whole SA ofstring T in advance. Thus, more space can be saved when one needs to compute the LZ77 of alarge-scale dataset.

    The difference between BGone and LZone is that BGone first computes SAlms , while LZone firstcomputes SA`=SAs . BGone uses the fact that at most one of two neighboring suffixes is LMS, whileLZone considers the number of the L-suffixes and S-suffixes may not both exceed bn=2c. In thewhole computation, a number of techniques are employed to rewrite the various auxiliary integerarrays from one to another in-place and in linear-time. Our experimental study shows that in practiceLZone runs much faster than BGone with the same space complexity. While LZone gives a fasterway for computing the LZ77 in a small space, we expect that our program used in the experimentalstudy can be better engineered by experts for further acceleration.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

    https://www.cs.helsinki.fi/group/pads/lz77.html

  • 1216 W. J. LIU ET AL.

    ACKNOWLEDGEMENTS

    DEGP of China, The Research Grant Council of Hong Kong SAR; contract/grant number DEGP2014KTSCX007, GRF 810012.

    REFERENCES

    1. Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory1977; 23(3):337–343. DOI: 10.1109/TIT.1977.1055714.

    2. Bell T, Kulp D. Longest-match string searching for Ziv–Lempel compression. Software: Practice and Experience1993; 23(7):757–771. DOI: 10.1002/spe.4380230705.

    3. Thies W, Hall S. Manipulating lossless video in the compressed domain. Proceedings of the 17th ACM InternationalConference on Multimedia, MM ’09, ACM, New York, NY, USA, 2009; 331–340.

    4. Fraser CW. An instruction for direct interpretation of LZ77-compressed Programs. Software: Practice and Experi-ence 2006; 36(4):397–411. DOI: 10.1002/spe.702.

    5. Abel J. Incremental frequency count—a post bwt-stage for the burrows-wheeler compression algorithm. Software:Practice and Experience 2007; 37(3):247–265. DOI: 10.1002/spe.763.

    6. Langiu A. On parsing optimality for dictionary-based text compression—the zip case. Journal of Discrete Algorithms2013; 20(0):65–70. DOI: 10.1016/j.jda.2013.04.001.

    7. Ferragina P, Manzini G. Indexing compressed text. Journal of the ACM 2005; 52(4):552–581. DOI: 10.1145/1082036.1082039.

    8. Claude Francisco, Fariña Antonio, Martínez-Prieto Miguel A., Navarro Gonzalo. Indexes for highly repetitivedocument collections. Proceedings of the 20th ACM International Conference on Information and KnowledgeManagement, CIKM ’11, ACM, New York, NY, USA, 2011; 463–468.

    9. Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ. A faster grammar-based self-index. In Language andautomata theory and applications, vol. 7183, Dediu AH, Martn-Vide C (eds)., Lecture Notes in Computer Science.Springer-Verlag: Berlin, Germany, 2012; 240–251.

    10. Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ. Lz77-based self-indexing with faster pattern match-ing. In Combinatorial Pattern Matching, vol. 8392, Pardo A, Viola A (eds)., Lecture Notes in Computer Science.Springer-Verlag: Berlin, Germany, 2014; 731–742.

    11. Kärkkäinen J, Kempa D, Puglisi SJ. Linear time Lempel–Ziv factorization: Simple, fast, small. In CombinatorialPattern Matching, vol. 7922, Fischer J, Sanders P (eds)., Lecture Notes in Computer Science. Springer-Verlag: Berlin,Germany, 2013; 189–200.

    12. Chen G, Puglisi SJ, Smyth WF. Fast and practical algorithms for computing all the runs in a string. In CombinatorialPattern Matching, vol. 4580, Ma B, Zhang K (eds)., Lecture Notes in Computer Science. Springer-Verlag: Berlin,Germany, 2007; 307–315.

    13. Al-Hafeedh A, Crochemore M, Ilie L, Kopylova E, Smyth WF, Tischler G, Yusufu M. A comparison ofindex-based Lempel–Ziv LZ77 factorization algorithms. ACM Computing Surveys 2012; 45(1):5:1–5. DOI:10.1145/2379776.2379781.

    14. Kärkkäinen J, Manzini G, Puglisi SJ. Permuted longest-common-prefix array. In Combinatorial Pattern Matching,vol. 5577, Kucherov G, Ukkonen E (eds)., Lecture Notes in Computer Science. Springer-Verlag: Berlin, Germany,2009; 181–192.

    15. Goto K, Bannai H. Space Efficient Linear Time Lempel–Ziv Factorization on Constant Size Alphabets, 2013.Available from: http://arxiv.org/abs/1310.1448 [last accessed 31 March 2014].

    16. Goto K, Bannai H. Space efficient linear time Lempel–Ziv factorization for small alphabets. Proceedings of the 2014Data Compression Conference, DCC ’14, IEEE Computer Society, Washington, USA, 2014; 163–172.

    17. Nong G. Practical linear-time o(1)-workspace suffix sorting for constant alphabets. ACM Transactions on InformationSystems (TOIS) 2013; 31(3):15:1–15. DOI: 10.1145/2493175.2493180.

    18. Skibiński P, Grabowski S, Deorowicz S. Revisiting dictionary-based compression. Software: Practice and Experi-ence 2005; 35(15):1455–1476. DOI: 10.1002/spe.678.

    19. Skibiński P, Grabowski S, Swacha J. Effective asymmetric XML compression. Software: Practice and Experience2008; 38(10):1027–1047. DOI: 10.1002/spe.859.

    20. Crochemore M, Ilie L. Computing longest previous factor in linear time and applications. Information ProcessingLetters 2008; 106(2):75–80. DOI: 10.1016/j.ipl.2007.10.006.

    21. Goto K, Bannai H. Simpler and faster Lempel Ziv factorization. Proceedings of the 2013 Data CompressionConference, DCC ’13, IEEE Computer Society, Washington, USA, 2013; 133–142.

    22. Kempa D, Puglisi SJ. Lempel–Ziv factorization: Simple, fast, practical. In 2013 Proceedings of the Fifteenth Work-shop on Algorithm Engineering and Experiments (ALENEX), Sanders P, Zeh N (eds)., ALENEX ’13. SIAMPub:Philadelphia, USA, 2013; 103–112.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

    http://arxiv.org/abs/1310.1448

  • IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING FASTER 1217

    23. Ko P, Aluru S. Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching, vol. 2676,Baeza-Yates R, Chávez E, Crochemore M (eds)., Lecture Notes in Computer Science. Springer-Verlag: Berlin,Germany, 2003; 200–210.

    24. Nong G, Zhang S, Chan WH. Two efficient algorithms for linear time suffix array construction. IEEE Transactionson Computers 2011; 60(10):1471–1484. DOI: 10.1109/TC.2010.188.

    25. Kärkkäinen J, Kempa D, Puglisi SJ. Lightweight Lempel–Ziv parsing. In Proceedings of 12th Symposium on Exper-imental Algorithms, Bonifaci V, Demetrescu C, Marchetti-Spaccamela A (eds)., SEA ’13. Springer-Verlag: Berlin,Germany, 2013; 139–150.

    Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2016; 46:1201–1217DOI: 10.1002/spe

    Improving a lightweight LZ77 computation algorithm for running fasterSummaryIntroductionPreliminariesSuffix and suffix array and Lempel–Ziv factorizationNext and previous smaller valuesLazy LZ factorization

    Prior ArtsKKP3, KKP2 and KKP1BGone

    Our WorkLZoneTComputing SAs and SA from T by induced sortingComputing /s from SA/SAsIf lnum > snum, compute sIf lnum snum, compute

    Computing / from s/ using O(1) workspaceIn-place computing PSV/NSV from / LZoneSA

    Performance EvaluationConclusionREFERENCES


Recommended