+ All Categories
Home > Documents > Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm....

Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm....

Date post: 23-Dec-2015
Category:
Upload: bertha-richardson
View: 252 times
Download: 0 times
Share this document with a friend
22
Gzip Compression and Decom Gzip Compression and Decom pression pression 1. Gzip file format 1. Gzip file format 2. Gzip Compress Algorithm 2. Gzip Compress Algorithm . . LZ77 algorithm LZ77 algorithm .Dynamic Huffman coding algorithm .Dynamic Huffman coding algorithm 3. Gzip Decompression Algorithm 3. Gzip Decompression Algorithm 4. Other Method of data compressi 4. Other Method of data compressi on on and open questions and open questions
Transcript
Page 1: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

Gzip Compression and DecompresGzip Compression and Decompressionsion

1. Gzip file format1. Gzip file format2. Gzip Compress Algorithm2. Gzip Compress Algorithm

..LZ77 algorithmLZ77 algorithm .Dynamic Huffman coding algorithm.Dynamic Huffman coding algorithm

3. Gzip Decompression Algorithm3. Gzip Decompression Algorithm4. Other Method of data compression4. Other Method of data compression

and open questionsand open questions

Page 2: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

Gzip file formatGzip file format1.1. A gzip file consists of a series of “member”. The A gzip file consists of a series of “member”. The

members simply appear one after another in the fimembers simply appear one after another in the file, with no additional information before ,between le, with no additional information before ,between or after them.or after them.

2.2. Member formatMember format Each member has the following format:Each member has the following format: +---+---+---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---+---+---+ |ID1|ID2|CM|FLG| MTIME |XFL|OS| (more->)|ID1|ID2|CM|FLG| MTIME |XFL|OS| (more->) +---+---+---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---+---+---+ if FLG.FEXTRA setif FLG.FEXTRA set +---+---+---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---+---+---+ | XLEN | …XLEN bytes of “extra field” |(more->)| XLEN | …XLEN bytes of “extra field” |(more->) +---+---+---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---+---+---+

Page 3: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

if FLG.FNAME setif FLG.FNAME set +---+---+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+---+---+ | … original file name, zero-terminated …| (more->)| … original file name, zero-terminated …| (more->) +---+---+---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---+---+---+ if FLG.COMMENT setif FLG.COMMENT set +---+---+---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---+---+---+ | … file comment, zero-terminated … |(more->)| … file comment, zero-terminated … |(more->) +---+---+---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---+---+---+ if FLG.FHCRC setif FLG.FHCRC set +---+---++---+---+ | CRC16|| CRC16| +---+---++---+---+ +====================++====================+ | … compressed blocks | (more->)| … compressed blocks | (more->) +====================++====================+

Page 4: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

+---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---+ | CRC32 | INSIZE | | CRC32 | INSIZE | +---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---+ ID1=31,ID2=139, they are used to identify the file as being in ID1=31,ID2=139, they are used to identify the file as being in

gzip format.gzip format. CM (compression method)CM (compression method) This identifies the compression method in the file. This identifies the compression method in the file. CM = 0-7 are reserved. CM = 8 denotes the “deflate”CM = 0-7 are reserved. CM = 8 denotes the “deflate” compression method, which is the one customarilycompression method, which is the one customarily used by gzip and which is documented elsewhere.used by gzip and which is documented elsewhere. bit 0 FTEXT bit 1 FHCRCbit 0 FTEXT bit 1 FHCRC bit 2 FEXTRA bit 3 FNAME bit 2 FEXTRA bit 3 FNAME bit 4 FNAME others reserved.bit 4 FNAME others reserved. CRC32CRC32 INSIZE original size of uncompressed data mod 2^32INSIZE original size of uncompressed data mod 2^32

Page 5: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

2.Gzip compression algorithm2.Gzip compression algorithmIntroductionIntroduction Gzip combine the LZ77 algorithm and dynamic HuffmanGzip combine the LZ77 algorithm and dynamic Huffmanalgorithm to compress data. Gzip use LZ77 algorithm to algorithm to compress data. Gzip use LZ77 algorithm to compress data first, then use dynamic Huffman algorithm compress data first, then use dynamic Huffman algorithm to compress the result.to compress the result.

2.1 LZ77 compression algorithm2.1 LZ77 compression algorithmTerms used in the algorithm:Terms used in the algorithm: ..input stream input stream :the sequence of characters to be compressed.:the sequence of characters to be compressed. ..charactercharacter :the basic element in the input stream.:the basic element in the input stream. ..coding positioncoding position: : the position of input stream being coded.the position of input stream being coded.(the beginning of (the beginning of lookahead bufferlookahead buffer)) ..lookahead bufferlookahead buffer: the character sequence from the coding : the character sequence from the coding position to the end of input stream.position to the end of input stream.

Page 6: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

..windowwindow: : size of w, contains w characters from coding size of w, contains w characters from coding position backwards. i.e. the last w characters processed.position backwards. i.e. the last w characters processed. . A . A pointer pointer points the match in the window and also points the match in the window and also specifies its length.specifies its length.

The principle of encodingThe principle of encoding The algorithm searches the window for The algorithm searches the window for longest matchlongest match with with the the lookahead bufferlookahead buffer and output a and output a pointer pointer for that match. Whefor that match. Whe

n n we find the match, we use data pair <we find the match, we use data pair <offset, lengthoffset, length> to take > to take place of the match.place of the match. Offset: Offset: the offset from the beginning of match to window’s the offset from the beginning of match to window’s left bound. (length from coding position to the beginning left bound. (length from coding position to the beginning of match)of match) Length: Length: length of match.length of match.The encoding algorithmThe encoding algorithm

Page 7: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

step1: set the coding position to the beginning of input step1: set the coding position to the beginning of input streamstream step2: if coding position is not at the end of input step2: if coding position is not at the end of input stream, search the window for the longest match with the stream, search the window for the longest match with the lookahead bufferlookahead buffer; else algorithm terminates.; else algorithm terminates. step3: if find match, output (step3: if find match, output (off, length,coff, length,c), c is the character ), c is the character following the match, coding position and window move following the match, coding position and window move length+1 bytes forward; else goto step4.length+1 bytes forward; else goto step4. step4: output current character at coding position, step4: output current character at coding position, coding position and windows move 1 byte forward; goto coding position and windows move 1 byte forward; goto step2.step2.Following is an example to explain the algorithm. Assume Following is an example to explain the algorithm. Assume the size of window is 10, the content is “abcdbbccaa”, the the size of window is 10, the content is “abcdbbccaa”, the string to be coded is “abaeaaabaee”. The steps ofstring to be coded is “abaeaaabaee”. The steps ofencoding is following:encoding is following:

Page 8: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

step1: the longest match between string and window is step1: the longest match between string and window is ““ab”, output (0,2,a), then window and coding position ab”, output (0,2,a), then window and coding position move forward 3 bytes.move forward 3 bytes. step2: the character at the current coding position is ‘e’.step2: the character at the current coding position is ‘e’.content of window is “dbbccaaaba”, there is no matchcontent of window is “dbbccaaaba”, there is no matchwith ‘e’, then output ‘e’. Window and coding position with ‘e’, then output ‘e’. Window and coding position move 1 byte forward.move 1 byte forward. step3: Content of window is “bbccaaabae”.Lookahead step3: Content of window is “bbccaaabae”.Lookahead buffer is “aaabae”, the longest match is itself. Then output buffer is “aaabae”, the longest match is itself. Then output (4,6,e).(4,6,e).There are many other problems needed to be considered.There are many other problems needed to be considered.You can refer the gzip source code and document.You can refer the gzip source code and document.

Page 9: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

Dynamic Huffman CodingDynamic Huffman CodingStatic Huffman coding algorithm:Static Huffman coding algorithm:Assume that we give a set of characters, and frequencies Assume that we give a set of characters, and frequencies of them. Then we can use the Huffman algorithm to of them. Then we can use the Huffman algorithm to encode for these characters.encode for these characters.Dynamic Huffman coding process is a dynamic process to Dynamic Huffman coding process is a dynamic process to build a Huffman tree. We don’t know the characters and build a Huffman tree. We don’t know the characters and there frequency at first. Following is an example to there frequency at first. Following is an example to introduce the process of dynamic huffman algorithm:introduce the process of dynamic huffman algorithm:String: String: TENNESSEETENNESSEEDuring the dynamic process of building Huffman During the dynamic process of building Huffman tree, we must obey one rule: maintain the sibling tree, we must obey one rule: maintain the sibling property if each node (except the root) property if each node (except the root) has a has a sibling and if the nodes can be numbered in order sibling and if the nodes can be numbered in order of nondecreasing weight with each node adjacent of nondecreasing weight with each node adjacent to its sibling. Moreover the parent of a node is to its sibling. Moreover the parent of a node is higher in the numberinghigher in the numbering

Page 10: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

TTStage 1 (First occurrence of Stage 1 (First occurrence of t t ) )

r r 99 / \/ \ 77 0 0 tt(1) (1) 88

Order: 0,Order: 0,tt(1)(1) ** r represents the root r represents the root** 0 represents the null node 0 represents the null node** t(1) denotes the occurrence of t(1) denotes the occurrence of TT with a frequency with a frequency

of 1 of 1

Page 11: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

TETEStage 2 (First occurrence of Stage 2 (First occurrence of ee))

r r 99 / \/ \ 77 1 1 tt(1) (1) 88 / \/ \ 55 0 0 ee(1) (1) 66

Order: 0,Order: 0,ee(1),1,(1),1,tt(1)(1)

Page 12: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

TENTENStage 3 (First occurrence of Stage 3 (First occurrence of nn ) )

r r 99 / \/ \ 77 2 2 tt(1) (1) 88 / \/ \ 55 1 1 ee(1) (1) 66 / \ / \ 33 0 0 nn(1) (1) 44

Order: 0,Order: 0,nn(1),1,(1),1,ee(1),2,(1),2,tt(1)(1)It is not a Huffman tree, we need to adjust it to It is not a Huffman tree, we need to adjust it to Huffman treeHuffman tree

Page 13: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

Reorder: TENReorder: TEN r r 99 / \/ \ 77 tt(1) 2 (1) 2 88 / \/ \ 55 1 1 ee(1) (1) 66 / \/ \ 33 0 0 nn(1) (1) 44

Order: 0,Order: 0,nn(1),1,(1),1,ee(1),(1),tt(1),2(1),2

Page 14: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

TENNTENNStage 4 ( Repetition of Stage 4 ( Repetition of nn ) )

r r 99 / \/ \ 77 tt(1) 3 (1) 3 88 / \/ \ 55 2 2 ee(1) (1) 66 / \/ \ 33 0 0 nn(2) (2) 44

Order: 0,Order: 0,nn(2),2,(2),2,ee(1),(1),tt(1),3(1),3Sibling property is no more valid, rebuild the tree.Sibling property is no more valid, rebuild the tree.Swap this node with the node whose number is the biggest in the bSwap this node with the node whose number is the biggest in the block.lock.Block: a set of nodes whose weights are the same.Block: a set of nodes whose weights are the same.

In order to maintain the sibling property, we should swap node (n) In order to maintain the sibling property, we should swap node (n) with node (t), if the node has subtree, the subtree should be swappwith node (t), if the node has subtree, the subtree should be swapped together.ed together.

Page 15: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

Reorder: TENNReorder: TENN r r 99 / \/ \ 77 nn(2) 2 (2) 2 88 / \/ \ 55 1 1 ee(1) (1) 66 / \/ \ 33 0 0 tt(1) (1) 44

Order: 0,Order: 0,tt(1),1,(1),1,ee(1),(1),nn(2),2 (2),2 tt(1),(1),nn(2) are swapped(2) are swapped

Page 16: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

TENNETENNEStage 5 (Repetition of Stage 5 (Repetition of ee ) )

r r 99 / \/ \ 77 nn(2) 3 (2) 3 88 / \/ \ 55 1 1 ee(2) (2) 66 / \/ \ 33 0 0 tt(1) (1) 44

Order: 0,Order: 0,tt(1),1,(1),1,ee(2),(2),nn(2),3 (2),3

Page 17: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

TENNESTENNESStage 6 (First occurrence of s)Stage 6 (First occurrence of s)

r r 99 / \/ \ 77 nn(2) 4 (2) 4 88 / \/ \ 55 2 2 ee(2) (2) 66 / \/ \ 33 1 1 tt(1) (1) 44 / \/ \ 11 0 0 ss(1) (1) 22

Order: 0,Order: 0,ss(1),1,(1),1,tt(1),2,(1),2,ee(2),(2),nn(2),4(2),4

Page 18: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

TENNESSTENNESSStage 7 (Repetition of Stage 7 (Repetition of ss))

r r 99 / \/ \ 77 nn(2) 5 (2) 5 88 / \/ \ 55 3 3 ee(2) (2) 66 / \/ \ 33 2 2 tt(1) (1) 44 / \/ \ 11 0 0 ss(2) (2) 22

Order: 0,Order: 0,ss(2),2,(2),2,tt(1),3,(1),3,ee(2),(2),nn(2),5 (2),5 Sibling property is not valid. Adjust the tree to maintain Sibling property is not valid. Adjust the tree to maintain sibling property.sibling property.

Page 19: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

Reorder: TENNESSReorder: TENNESS r r 99 / \/ \ 77 3 4 3 4 88 / \ / \/ \ / \ 33 1 s (2) 1 s (2) 44 5 5 n(2) e(2) n(2) e(2) 66 / \/ \ 11 0 t(1) 0 t(1) 22

s(2) and t(1) are swapped s(2) and t(1) are swapped e and 3 are also need to be swapped e and 3 are also need to be swapped

Page 20: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

TENNESSETENNESSEStage 8 (Second repetition of Stage 8 (Second repetition of ee ) )

r r 99 / \/ \ 77 3 5 3 5 88 / \ / \/ \ / \ 33 1 s (2) 1 s (2) 44 5 5 n(2) e(3) n(2) e(3) 66 / \/ \ 11 0 t(1) 0 t(1) 22

Order : 0,Order : 0,tt(1),1,(1),1,ss(2),(2),ee(3),3,(3),3,nn(2),6(2),6

Page 21: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

Reorder: TENNESSEEReorder: TENNESSEE r r 99 / \/ \ 77 3 6 3 6 88 / \ / \/ \ / \ 33 1 s (2) 1 s (2) 44 5 5 n(2) e(4) n(2) e(4) 66 / \/ \ 11 0 t(1) 0 t(1) 22 sibling property is valid, need to rebuild the sibling property is valid, need to rebuild the Huffman tree.Huffman tree.

Page 22: Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

TENNESSEETENNESSEEStage 9 (Second repetition of Stage 9 (Second repetition of ee ) )

r r 99 / \/ \ 77 e(4) 5 e(4) 5 88 / \/ \ 5 5 n(2) 3 n(2) 3 66 / \/ \ 33 1 s(2) 1 s(2) 44 / \/ \ 11 0 t(1) 0 t(1) 22 Adaptive Huffman decoding is the inverse Adaptive Huffman decoding is the inverse procedure of encoding.procedure of encoding.


Recommended