How GZIP compression works - JS Conf EU 2014

Post on 14-Dec-2014

1,366 views 12 download

Tags:

description

Data compression is an amazing topic. Even in today’s world, with fast networks and almost unlimited storage, data compression is still relevant, especially for mobile devices and countries with poor Internet connections. For better or worse, GZIP compression is the de-facto lossless compression method for compressing text data in websites. It is not the fastest nor the better, but provides an excellent tradeoff between speed and compression ratio. The way Internet works makes it also difficult to use newer compression methods. This talk examines how GZIP works internally, explaining the internals of the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. Different implementations will be compared, such as GNU GZIP, 7-ZIP and zopfli, focusing on why and how some of these implementations perform better than others. Finally, we will try to go beyond GZIP, preprocessing our data to achieve better results. For example, transposing JSON.

transcript

H O W G Z I P C O M P R E S S I O N W O R K SR A U L F R A I L E

J S C O N F E U B E R L I N

• P H P / J S S O F T W A R E D E V E L O P E R

!

• M S ( R E S ) S T U D E N T I N

C O M P U T I N G T E C H N O L O G I E S .

!

• M A D E I N S PA I N .

A B O U T M E

D ATA C O M P R E S S I O N

N O T A N E X P E R T *

D ATA C O M P R E S S I O N I S A N AMAZ ING T O P I C

REALLY !

M A G I CI T C A N B E S E E N L I K E …

flickr.com/photos/jeffkrause/6799254170

flickr.com/photos/t_e_brown/8677750589

… I T ’ S N O T

I N F O R M AT I O N T H E O R YC L A U D E S H A N N O N

E N T R O P Yflickr.com/photos/95303997@N07/10074330416

H = - p ( x ) l o g 2 p ( x )⎲⎳

AV E R A G E A M O U N T O F I N F O R M AT I O N C O N TA I N E D I N E A C H M E S S A G E

≈N U M B E R O F B I T S T O R E P R E S E N T T H E M E S S A G E

225 days/year 62 %

17 days/year 6 %

flickr.com/photos/aigle_dore/5952296478flickr.com/photos/mariano-mantel/13955110319

H U M A N B R A I NI S D E S I G N E D T O C O M P R E S S D A TA

flickr.com/photos/birthintobeing/11841180046

flickr.com/photos/neolao/3105372669flickr.com/photos/tommiephotography/6840025942

flickr.com/photos/earlysound/2186172726

M O R S E C O D E S H O R T E R S E Q U E N C E S F O R C O M M O N C H A R A C T E R S

flickr.com/photos/amboo213/9044879245

D ATA C O M P R E S S I O N I N H T T P

GET index.html Accept-Encoding: gzip, deflate

G Z I P + H T T P

G Z I P C O M P R E S S I O N

• D E F L A T E A L G O R I T H M

!

• D E S I G N E D B Y P H I L K A T Z

!

• U S E D I N H T T P, P N G A N D P D F

G Z I P

D E F L AT E

L Z 7 7

H U F F M A N C O D I N G+

L Z 7 7 ( VA R I AT I O N )

T H I S F I L E I S H U G E ! T H AT ' S B E C A U S E T H E F I L E I S N O T C O M P R E S S E D

< 3 3 , 9 >

S E A R C H B U F F E R ( U P T O 3 2 K B ) L O O K - A H E A D

T H I S F I L E I S H U G E ! T H AT ' S B E C A U S E T H E F I L E I S N O T C O M P R E S S E D

L Z 7 7 ( VA R I AT I O N )

< 3 3 , 9 >

L I T E R A L S · L E N G T H S · D I S TA N C E S

H U F F M A N C O D I N G

0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0

H 0 0 0E 0 0 1L 0 1 0O 0 1 1W 1 0 0R 1 0 1D 1 1 0_ 1 1 1

H E L L O W O R L D

8 8 B I T S

F I X E D - L E N G T H C O D E S

0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0

3 3 B I T S

H U F F M A N C O D I N G

C H A R A C T E R F R E Q U E N C Y:

0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0

L 3 0O 2 1H 1 0 0E 1 0 1W 1 1 0R 1 1 1D 1 0 0 0_ 1 0 0 1

H E L L O W O R L D

1 9 B I T S

I T ’ S A M B I G U O U S

H EL H OD O…

VA R I A B L E - L E N G T H C O D E S

H U F F M A N C O D I N G

L 3 1 0O 2 1 1 1H 1 0 0 1E 1 1 1 0 0W 1 0 0 1R 1 0 0 0D 1 1 1 0 1_ 1 0 1 0

H U F F M A N C O D I N G

L 3 1 0O 2 1 1 1H 1 0 0 1E 1 1 1 0 0W 1 0 0 1R 1 0 0 0D 1 1 1 0 1_ 1 0 1 0

0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 1

H E L L O W O R L D

3 2 B I T S

H U F F M A N C O D I N G

TA B L E 1 : L I T E R A L S + L E N G T H S

TA B L E 2 : D I S TA N C E S

B L O C K S

B L O C K 1 B L O C K 2 … B L O C K NM M M M

M O D E 1 : N O C O M P R E S S I O N

M O D E 2 : F I X E D C O D E TA B L E S

M O D E 3 : G E N E R AT E D C O D E TA B L E S

flickr.com/photos/functoruser/2436979033

G Z I P C O M P R E S S I O NI M P L E M E N TAT I O N S

G N U G Z I P Z O P F L I7 - Z I P

M O D E FA S T

M O D E H I G H

C O M P R E S S I O N

M O D E N O R M A L

G E N E R A L R U L E : M O R E T I M E , B E T T E R C O M P R E S S I O N R AT I O

I M P L E M E N TAT I O N S

G Z I P C O M P R E S S I O NW H Y G Z I P ?

• G O O D C O M P R E S S I O N R A T I O .

• FA S T T O ( U N ) C O M P R E S S .

• I N T H E W O R S T C A S E , E X PA N D S

T H E D A TA S L I G H T LY.

• M E M O R Y I N D E P E N D E N T.

• F R E E I M P L E M E N TA T I O N S T H A T

A V O I D PA T E N T S .

T R A D E O F F

N E W E R A L G O R I T H M SI S S U E S T R Y I N G T O A D D B Z I P 2 S U P P O R T T O C H R O M E

G Z I P C O M P R E S S I O NB E Y O N D G Z I P

P R E P R O C E S S D ATA T O O P T I M I Z E MATCHES

G Z I P ( T ( D ATA ) ) < G Z I P ( D ATA )

T R A N S P O S I N G J S O N

{ "name": "John", "country": "USA" }, { "name": "Stephan", "country": "Germany" }, { "name": "Rob", "country": "USA" }

{ "name": [ "John", "Stephan", "Rob" ], "country": [ "USA", "Germany", "USA" ] }

X M L / H T M L AT T R I B U T E S O R D E R

<input id='f1' class='field' name="f1" type="text" /> <input class="field" id="f2" type="text" name="f2" />

<input id="f1" class="field" name="f1" type="text" /> <input class="field" id="f2" type="text" name="f2" />

<input id="f1" class="field" name="f1" type="text" /> <input id="f2" class="field" name="f2" type="text" />

<input type="text" class="field" id="f1" name="f1" /> <input type="text" class="field" id="f2" name="f2" />

1 7 , 7 6 %

2 7 , 1 0 %

3 8 , 3 2 %

3 8 , 3 2 %

h t t p : / / g o o . g l / G g M w 2 6

R E F E R E N C E S

“ C o m p r e s s o r H e a d ” C o l t M c A n l i s

“ D a t a C o m p r e s s i o n : T h e C o m p l e t e R e f e r e n c e ” D a v i d S a l o m o n

“ A U n i v e r s a l A l g o r i t h m f o r S e q u e n t i a l D a t a C o m p r e s s i o n ” J a c o b Z i v & A b r a h a m L e m p e l

“ A m e t h o d f o r t h e c o n s t r u c t i o n o f m i n i m u m r e d u n d a n c y c o d e s ” D a v i d A . H u f f m a n

T H A N K Y O U

R a ú l F r a i l e @ r a u l f r a i l e