+ All Categories
Home > Documents > Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON...

Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON...

Date post: 03-Jun-2020
Category:
Upload: others
View: 37 times
Download: 0 times
Share this document with a friend
44
Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc
Transcript
Page 1: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Optimizing Zlib on Arm:The power of NEONAdenilson CavalcantiARM - San Jose (California)

@adenilsonc

Page 2: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Why zlib?

Zlib

Used everywhere (libpng, Skia, freetype, cronet, Firefox, Chrome, linux kernel, android, iOS, JDK, git, etc).

Old code base released in 1995.

Written in K&R C style.

Context

Lacks any optimizations for ARM CPUs.

Problem statement

Identify potential optimization candidates and verify positive effects in Chromium.

Page 3: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

● Cloudflare● Intel● Zlib-ng

Previous art

Page 4: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

● Performed some benchmarking.● Contacted each project.● Mixed results (1 project never replied back).

Before deepening the fork...

Page 5: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

● Performed some benchmarking.● Contacted each project.● Mixed results (1 project never replied back).

None focused on decompression* or had ARM specific optimizations.

Before forking...

*Important for a Web Browser.

Page 6: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

PNGs rely on zlib● Transparent.● Pre-filters.● High-res.

Meet Mr. Parrot

Source: https://upload.wikimedia.org/wikipedia/commons/3/3f/ZebraHighRes.png

Page 7: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Parrots are not created equal

Original: 2.7MB

Palette: 0.8MB

Zopfli: 2.6MB

Page 8: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Perf to the rescue

Page 9: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

NEON: Advanced SIMD(Single Instruction Multiple Data)

Page 10: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

● Optional on ARMv7.● Mandatory on

ARMv8.

NEON

Page 11: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

RegistersARMv7● 16 registers@128 bits: Q0

- Q15.● 32 registers@64bits: D0 -

D31.● Varied set of instructions:

load, store, add, mul, etc.

ARMv8● 32 registers@128 bits: Q0 - Q31.● 32 registers@64bits: D0 - D31.● 32 registers@32bits: S0 - S31.● 32 registers@8bits: H0 - H31.● Varied set of instructions: load,

store, add, mul, etc.

Page 12: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

An example: VADD.I16 Q0, Q1, Q2

Page 13: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Entropy & Compression

Page 14: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Entertaining definition

https://www.youtube.com/watch?v=l49MHwooaVQ

Page 15: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Formal definition

Shannon Entropy

Where:p_i: probability of character i appearing in the stream of characters.

https://en.wiktionary.org/wiki/Shannon_entropy

Page 16: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Practical explanationa) HTML b) JPEG

Page 17: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Practical visualization./binwalk -E filea) HTML: 0.68 b) JPEG: 0.95

Page 18: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Decompression optimizations

Page 19: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Adler-32 checksum

https://en.wikipedia.org/wiki/Adler-32

Page 20: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Adler-32 simplistic implementation

https://en.wikipedia.org/wiki/Adler-32

Page 21: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Adler-32: problems

● Zlib’s Adler-32 was more than 7x faster than naive implementation.

● It is hard to vectorize the following computation:

Page 22: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Adler-32: technical drawing (Jan 2017)

Page 23: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Adler-32‘Taps’ to the rescue

Assembly:https://godbolt.org/g/KMeBAJ

Page 24: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Adler-32: Intel got some love too!

https://bugs.chromium.org/p/chromium/issues/detail?id=688601

Page 25: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

fast_chunk● Second candidate in the perf

profiling was inflate_fast.● Very high level idea: perform

long loads/stores in the byte array.

● Average 20% faster!● Shipping on M62.● Original patch by Simon

Hosie.

https://bugs.chromium.org/p/chromium/issues/detail?id=697280

Page 26: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

CRC-32

https://bugs.chromium.org/p/chromium/issues/detail?id=709716

● YMMV on PNGs (from 1 to 5%).● Remember it is used while decompressing web

content (29% boost for gzipped content).● ARMv8-a has a crc32 instruction (from 3 to 10x faster

than zlib’s crc32 C code).● Shipping on M66.

Page 27: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Results: Chromium’s zlib*

* c-zlib

Page 28: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Arm: zlib format 1.4x

Page 29: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Arm: gzip format 1.5x

Page 30: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Arm: c-zlib X Vanilla

Page 31: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

x86: c-zlib X Vanilla

Page 32: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

We were missing compression...

Page 33: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Bonus: Compression on Arm

Page 34: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Slide-hash: NEON

https://chromium-review.googlesource.com/1136940

● Using NEON instruction vqsubq.

● Works on 8x 16bits chunks.

● Perf gain of 5%.

Page 35: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

insert-string: crypto CRC-32

https://chromium-review.googlesource.com/c/chromium/src/+/1173262

● Using ARMv8-ainstruction crc32.

● Works on 1x 32bits chunks.

● Perf gain of 24%.

Page 36: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Arm: current state● Compression: average 1.36x faster, but 1.4x faster for HTML.● Decompression: average 1.6x faster (gzip), but 1.8x faster for HTML.

Page 37: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Conclusions

Page 38: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Conclusions● There is plenty of life left even in an old code base.● NEON optimizations can yield a *huge* impact.● It pays up to work in a lower layer.● OSS love: Intel got it too.

Page 39: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Chromium’s zlib: c-zlib● Decompression: 1.7x to 2x faster.● Compression: 1.3x to 1.4x faster.● Both ARM & x86 are supported.● Highly tested (i.e. cronet, fuzzers).● Widely deployed (over 1 billion users).● Open to performance & security patches.

Page 40: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Chromium’s zlib: c-zlib● Decompression: 1.7x to 2x faster.● Compression: 1.3x to 1.4x faster.● Both ARM & x86 are supported.● Highly tested (i.e. cronet, fuzzers).● Widely deployed (over 1 billion users).● Open to performance & security patches.

Zlib users should consider moving to Chromium’s zlib.

Page 41: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Resources

a) Slides: https://goo.gl/vaZA9ob) Performance benchmarks: https://goo.gl/qLVdvh c) Code:

https://cs.chromium.org/chromium/src/third_party/zlib/

Page 42: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Final words

“This is how the open-source model works: building upon the work of others is far more efficient than rewriting everything.”

Jean-loup Gailly (zlib author)

https://slashdot.org/story/00/03/10/1043247/jean-loup-gailly-on-gzip-go-and-mandrake

Page 43: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Questions

Page 44: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc

Recommended