Granular computing for low-latency compression and … Talks/retreat-2017... · System 1: Lepton...

Granular computing for low-latency compression and compilation

Sadjad Fouladi , Riad S. Wahby , Emre Orbay ,

John Emmons , Brennan Shacklett , William Zeng ,

Karthikeyan Vasuki Balasubramaniam , Rahul Bhalerao ,

Daniel Reiter Horn , Ken Elkabany , Chris Lesniewski-Lass ,

Anirudh Sivaraman , George Porter , Geet Sethi ,

Eyal Cidon , Dan Iter , Shuvo Chaterjee,

Christos Kozyrakis , Matei Zaharia , Keith Winstein

Stanford

UC San Diego

Dropbox

MIT

Message of this talk

◮ Granular interfaces to computing resources will enable newapplications.

◮ It’s worth refactoring existing applications using ideas fromfunctional programming.

◮ Saving and restoring program states is a powerful tool.

Four interactive systems built with functional abstractions

Lepton: JPEG recompression with VP8 and “functional” codec

ExCamera: Fast VP8 encoding using “functional” VP8 codec

Salsify: Videoconferencing using “functional” VP8 codec

gg: “make -j5000” functional compilation for pennies

System 1: Lepton (distributed JPEG recompression)

Daniel Reiter Horn, Ken Elkabany, Chris Lesniewski-Lass, and Keith Winstein, The Design,

Implementation, and Deployment of a System to Transparently Compress Hundreds of

Petabytes of Image Files for a File-Storage Service, in NSDI 2017.

Storage Overview at Dropbox

• ¾ Media

• Roughly an Exabyte in storage

• Can we save backend space?

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Other

Videos

JPEGs

Idea: save storage with transparent recompression

◮ Requirement: byte-for-byte reconstruction of original file

◮ Approach: improve “lossless” steps only

◮ Replace DC-predicted Huffman code with an arithmetic code

◮ Use probability model to predict “1” vs. “0” with high accuracy

Prior work

15

20

30

40

50

100

150

200

6 7 8 9 10 15 20 25

De

co

mp

ressio

n s

pe

ed

(M

bits/s

)

Compression savings (percent)

Better

MozJPEG(arithmetic)

JPEGrescan(progressive)

packjpg(global sort

+ big model + arithmetic)

Requirements for distributed compression

◮ Independently store and decode small chunks of each file◮ Chunk boundaries can be at any byte offset◮ Boundary locations not under our control

◮ Achieve > 100 Mbps decoding speed per chunk

◮ Never ever lose any data◮ Encoder and decoder both deterministic◮ Immune to adversarial/pathological input files

Our solution

◮ Lepton converts Huffman codewords to VP8 arithmetic code

◮ Uses a very large probability model to avoid “global” operations◮ Developed against real images

◮ But: how to achieve fast decode of independent chunks?

Making the state of the JPEG encoder explicit

◮ Formulate JPEG encoder in explicit state-passing style

◮ Implement DC-predicted Huffman encoder as a pure function◮ Takes “state” as a formal parameter◮ Can resume anywhere, even in the middle of a Huffman codeword

◮ State contains “everything required to resume from midstream”◮ 16 bytes: partial Huffman codeword, prior DC value, etc.

◮ Save the state in Lepton-encoded chunks to aid decoder

Two benefits to functional JPEG encoder

◮ Distribute decoding across independent 4 MiB chunks◮ Save state at the beginning of each chunk

◮ Parallelize decoding within a chunk◮ Split chunk into pieces and save a state for each◮ Each piece can run in parallel, then concatenate outputs and send

Results (single-chunk decoding)

15

20

30

40

50

100

150

200

6 7 8 9 10 15 20 25

De

co

mp

ressio

n s

pe

ed

(M

bits/s

)


Better

MozJPEG(arithmetic)


packjpg(global sort


Results (single-chunk decoding)

15

20

30

40

50

100

150

200

6 7 8 9 10 15 20 25

De

co

mp

ressio

n s

pe

ed

(M

bits/s

)


Better

Lepton

MozJPEG(arithmetic)


packjpg(global sort


Deployment

• Lepton has encoded 150 billion files– 203 PiB of JPEG files

– Saving 46 PiB

– So far…o Backfilling at > 6000 images per second

Power Usage at 6,000 Encodes

21:0000:00

03:0006:00

09:0012:00

15:0018:00

21:0000:00

03:000

50

100

150

200

250

300

Ch

ass

is 3

ow

er

(k:

)

Lepton concluding thoughts

◮ A little bit of functional programming can go a long way

◮ Functional codec was required to be able to deploy backendJPEG recompression on a distributed filesystem

Overview





System 2: ExCamera (fine-grained parallel video processing)

Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Vasuki Balasubramaniam, William Zeng,

Rahul Bhalerao, Anirudh Sivaraman, George Porter, Keith Winstein, Encoding, Fast and Slow:

Low-Latency Video Processing Using Thousands of Tiny Threads, in NSDI 2017.

https://ex.camera

https://ex.camera

Outline

• Vision & Goals

• mu: Supercomputing as a Service

• Fine-grained Parallel Video Encoding

• Evaluation

• Conclusion & Future Work

2

What we currently have

• People can make changes to a word-processing document

• The changes are instantly visible for the others

3

What we would like to have

• People can interactively edit and transform a video

• The changes are instantly visible for the others

for Video?

"Apply this awesome filter to my video."

"Look everywhere for this face in this movie."

"Remake Star Wars Episode I without Jar Jar."

Can we achieve interactive collaborative video editingby using massive parallelism?

Currently, running such pipelines on videos takes hours and hours, even for a short video.

The Problem

The Question

The challenges

• Low-latency video processing would need thousands of threads, running in

parallel, with instant startup.

• However, the finer-grained the parallelism, the worse the compression

efficiency.

9

Enter ExCamera

• We made two contributions:

• Framework to run 5,000-way parallel jobs with IPC on a commercial “cloud function” service.

• Purely functional video codec for massive fine-grained parallelism.

• We call the whole system ExCamera.

10

Outline

• Vision & Goals



• Evaluation


11

Where to find thousands of threads?

• IaaS services provide virtual machines (e.g. EC2, Azure, GCE):

• Thousands of threads

• Arbitrary Linux executables

! Minute-scale startup time (OS has to boot up, ...)

! High minimum cost(60 mins EC2, 10 mins GCE)

12

3,600 threads on EC2 for one second → >$20

Cloud function services have (as yet) unrealized power

• AWS Lambda, Google Cloud Functions

• Intended for event handlers and Web microservices, but...

• Features:

✔ Thousands of threads

✔ Arbitrary Linux executables

✔ Sub-second startup

✔ Sub-second billing

13

3,600 threads for one second → 10¢

mu, supercomputing as a service

• We built mu, a library for designing and deploying general-purpose parallel computations on a commercial “cloud function” service.

• The system starts up thousands of threads in seconds and manages inter-thread communication.

• mu is open-source software: https://github.com/excamera/mu

14

Outline

• Vision & Goals



• Evaluation


17

Now we have the threads, but...

• With the existing encoders, the finer-grained the parallelism, the worse the compression efficiency.

18

Video Codec

• A piece of software or hardware that compresses and decompresses digital video.

19

1011000101101010001000111111101100111001100111011100110010010000...001001101001001101101101101011111010011001010000010011011011011010

Encoder Decoder

How video compression works

• Exploit the temporal redundancy in adjacent images.

• Store the first image on its entirety: a key frame.

• For other images, only store a "diff" with the previous images: an interframe.

20

In a 4K video @15Mbps, a key frame is ~1 MB, but an interframe is ~25 KB.

Existing video codecs only expose a simple interface

encode([!,!,...,!]) → keyframe + interframe[2:n]

decode(keyframe + interframe[2:n]) → [!,!,...,!]

21

compressed video

encode(i[1:200]) → keyframe1 + interframe[2:200]

[thread 01] encode(i[1:10]) → kf1 + if[2:10]



⠇


Traditional parallel video encoding is limited

22finer-grained parallelism ⇒ more key frames ⇒ worse compression efficiency

parallel ↓

serial ↓

+1 MB

+1 MB

+1 MB

We need a way to start encoding mid-stream

• Start encoding mid-stream needs access to intermediate computations.

• Traditional video codecs do not expose this information.

• We formulated this internal information and we made it explicit: the “state”.

23

The decoder is an automaton

24

state

interframe

state statestate

key frame interframe interframe

What we built: a video codec in explicit state-passing style

• VP8 decoder with no inner state:

decode(state, frame) → (state′, image)

• VP8 encoder: resume from specified state

encode(state, image) → interframe

• Adapt a frame to a different source state

rebase(state, image, interframe) → interframe′

25

Putting it all together: ExCamera

• Divide the video into tiny chunks:

• [Parallel] encode tiny independent chunks.

• [Serial] rebase the chunks together and remove extra keyframes.

26

1. [Parallel] Download a tiny chunk of raw video

27

1 61115

thread 1

7 1211111

thread 2

13 1811117

thread 3

19 2411123

thread 4

2. [Parallel] vpxenc → keyframe, interframe[2:n]

28

1 61115

thread 1

7 1211111

thread 2

13 1811117

thread 3

19 2411123

thread 4

Google's VP8 encoder

encode(img[1:n]) → keyframe + interframe[2:n]

3. [Parallel] decode → state ↝ next thread

29

1 61115

thread 1

7 1211111

thread 2

13 1811117

thread 3

19 2411123

thread 4

Our explicit-state style decoder

decode(state, frame) → (state′, image)

4. [Parallel] last thread’s state ↝ encode

30

1 61115

thread 1

7 1211111

thread 2

13 1811117

thread 3

19 2411123

thread 4

Our explicit-state style encoder

encode(state, image) → interframe

5. [Serial] last thread’s state ↝ rebase → state ↝ next thread

31

1 61115

thread 1

7 1211111

thread 2

13 1811117

thread 3

19 2411123

thread 4

Adapt a frame to a different source state


5. [Serial] last thread’s state ↝ rebase → state ↝ next thread

32

1 61115

thread 1

7 1211111

thread 2

13 1811117

thread 3

19 2411123

thread 4

Adapt a frame to a different source state


6. [Parallel] Upload finished video

33

1 61115

thread 1

7 1211111

thread 2

13 1811117

thread 3

19 2411123

thread 4

Wide range of different configurations

34

ExCamera[n, x]number of frames in each chunk

Wide range of different configurations

35

ExCamera[n, x]number of chunks "rebased" together

Outline

• Vision & Goals



• Evaluation


36

How well does it compress?

37

16

17

18

19

20

21

22

5 10 20 30 40 50 70

qu

alit

y (

SS

IM d

B)

average bitrate (Mbit/s)

vpx (1 th

read)

vpx (multit

hreaded)


38

16

17

18

19

20

21

22

5 10 20 30 40 50 70

qu

alit

y (

SS

IM d

B)


ExCamera[6, 1]

vpx (1 th

read)

vpx (multit

hreaded)


39

16

17

18

19

20

21

22

5 10 20 30 40 50 70

qu

alit

y (

SS

IM d

B)


ExCamera[6, 1]

ExCamera

[6, 1

6]vpx (1

thread)

±3%

ExCamera[6, 16] 2.6 mins

14.8-minute 4K Video @20dB

vpxenc Single-Threaded 453 mins

vpxenc Multi-Threaded 149 mins

YouTube (H.264) 37 mins

ExCamera contributions

◮ Framework to run 5,000-way parallel jobs with IPC on acommercial “cloud function” service.

◮ Purely functional video codec for massive fine-grained

parallelism.

ExCamera concluding thoughts

◮ Operators of Lambda/Cloud Functions/OpenWhisk will find that theircustomers want as many parallel tiny tasks as they can get.

◮ “Lots of data” and “interactive” → massively parallel → functional

◮ Many interactive data-processing jobs aren’t embarrasingly parallel:◮ Image and video filters◮ 3D artists◮ Compilation and software testing◮ Interactive machine learning◮ Data visualization◮ Searching large datasets

◮ To achieve low end-to-end latency, distributed systems will need to treatapplication state as a first-class object.

Overview





System 3: Salsify (videoconferencing)

Sadjad Fouladi, Emre Orbay, John Emmons, Riad S. Wahby, Keith Winstein, Salsify: for real-time

videoconferencing, a richer interface between the video codec and transport protocol is helpful,

in preparation.

https://github.com/excamera

https://github.com/excamera

Interfaces exposed by the transport layer

bandwidth = X

Sender Receiver

◮ “Can I send one byte without blocking?”

◮ “Send as many of these bytes as possible.”

◮ “What is X (1-sec average)?”

◮ “How much can I send now without blocking?”


bandwidth = X

Sender Receiver

◮ “Can I send one byte without blocking?” (select, poll)

◮ “Send as many of these bytes as possible.”




bandwidth = X

Sender Receiver


◮ “Send as many of these bytes as possible.” (write)




bandwidth = X

Sender Receiver



◮ “What is X (1-sec average)?” (RTCOutboundRTPStreamStats)



bandwidth = X

Sender Receiver



◮ “What is X (1-sec average)?” (RTCOutboundRTPStreamStats)


The problem: codecs drive videoconferencing

1. Transport tells encoder what “bitrate” to target

2. Tens of milliseconds pass

3. Encoder delivers next frame

4. Transport stuck sending what encoder created

◮ Even if overshot/undershot capacity estimate◮ Even if network has changed in the meantime

When interfaces ossify. . .

Q: “Can you make the encoder obey the congestion control?”

◮ A: “Our software codec was licensed from On2/Google and doesn’t supportthat interface.”

◮ A: “Our hardware codecs don’t support that interface (and we have tosupport 50 different chips).”

◮ A: “Our congestion control doesn’t support that interface.”

◮ A: “It doesn’t matter; Skype/WebRTC/Facetime works the same way.”

Hypothesis: functional interface could yield measurable benefits

1. Transport to Encoder: “Aim for x bytes.”

2. Encoder to Transport: Here’s a menu of N different encodings run inparallel.

3. Transport late-binds which option to send, based on current capacityestimate.

4. Transport to Encoder: “I sent option #3. Save that state, restore it toall N encoders, then go back to step 1 and encode a new frame.”

Video-aware congestion control

Receiver

t₁ t₂ t₃ t₄ t₅graceperiod

frame i frame i+1

Sender

◮ A video encoder is not a full-throttle source.

◮ Salsify’s sender includes time between outgoing packets as a “grace period.”

◮ Receiver maintains moving average of packet inter-arrival times, pausinginference during grace period.

Network throughput (synthetic network outage)

0

1

2

3

4

5

6

7

8

0 5 10 15 20

Salsify

Skype

WebRTC

thro

ug

hp

ut (M

bp

s)

time (s)

Measuring videoconferencing end-to-end

Measuring videoconferencing end-to-end

Video frame delay (synthetic network outage)

0 5 10 15 200

0.5

1

1.5

2

2.5

3

Salsify

Skype

WebRTC fra

me

de

lay (

s)

time (s)

Network trace (Verizon LTE)

10

11

12

13

14

15

16

5007001000200050007000

vid

eo q

ualit

y (

SS

IM d

B)

video delay (95th percentile ms)

WebRTC

FaceTime

Hangouts

Salsify

Skype

Better

Network trace (Verizon LTE)

10

11

12

13

14

15

16

5007001000200050007000

vid

eo q

ualit

y (

SS

IM d

B)


WebRTC

FaceTime

Hangouts

Salsify

Skype

Salsify (no late binding)

Better

Network trace (AT&T LTE)

8

9

10

11

12

13

14

15

16

3005001000200030005000

vid

eo q

ua

lity (

SS

IM d

B)


WebRTC

FaceTime

Hangouts

Salsify

Skype

Better

Network trace (T-Mobile UMTS)

9

10

11

12

13

14

35004000500070009500

vid

eo q

ua

lity (

SS

IM d

B)


WebRTC

FaceTime

Hangouts

Salsify

Skype

Better

Takeaways (videoconferencing)

◮ Need functional interface to the codec to get good matchto late capacity estimate

◮ Need richer interface to the transport to expose “how manybytes can I send now?”

◮ No time like the present to start refactoring these interfaces

Overview





Problem: software development OODA loop is increasing

◮ Compiling large programs can take hours

◮ Trend against dynamic linking means “vendoring” all dependencies

◮ Slow feedback → waste of human resources

gg: the Stanford builder

◮ Goal: distribute builds across 5,000 cores in the cloud

◮ (without modifying build system)

◮ Use granular abstractions to computing resources

◮ AWS Lambda, IBM OpenWhisk, Google Cloud Functions. . .

◮ Requirements:

◮ Thousands of cores◮ Arbitrary binaries◮ Subsecond startup◮ Subsecond billing

Approach: model and thunk everything

◮ Locally, run build system with delayed execution of major pipeline stages:

◮ preprocess◮ compile◮ assemble◮ link◮ ar, ranlib, strip

◮ Each model produces a thunk.

◮ closure that allows deterministic delayed execution of any pipeline stage

Example thunk

{

"function": {

"exe": "gcc",

"args": [

"gcc", "-g", "-O2", "-c", "-o", "TEST_remake.o", "remake.i"

],

"hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"

},

"infiles": [

{

"filename": "remake.i",

"hash": "9f1d127592e2bee6e702b66c9114d813d059f65e8cdb79db2127e7d6d1b3384b",

"order": 0

}

],

"outfile": "TEST_remake.o"

}

To compile, lazily force the thunk

◮ Thunks are self-contained and can be forced locally or in the cloud.

◮ Run 5,000+ thunks in parallel on Lambda/OpenWhisk

◮ Can trust others’ assertions

◮ “File with this hash → contents” (easy to detect invalid claims)

◮ “Thunk with this hash → result” (can prove a claim is invalid)

Granular computing for interactive applications

◮ Granular interfaces to computing resources will enable new applications.

◮ It’s worth refactoring existing applications using ideas from functional programming.

◮ Saving and restoring program states is a powerful tool.

◮ Lepton: JPEG recompression

◮ ExCamera: video encoding with thousands of tiny tasks

◮ Salsify: videoconferencing with a functional interface to the codec

◮ gg: “make -j5000” functional compilation for pennies

Thank you: Platform Lab, NSF, DARPA, Google, Dropbox, VMware, Facebook, Huawei, SITP.

ExCamera timings

1. [Parallel] Download 6-frame chunk of raw or mezzanine video

ExCamera timings (download)

1. [Parallel] Download 6-frame chunk of raw or mezzanine video

ExCamera timings (vpxenc)

2. [Parallel] vpxenc → keyframe, interframe[5]

ExCamera timings (decode, send state)

3. [Parallel] decode → state next thread

ExCamera timings (encode-given-state)

4. [Parallel] last thread’s state encode-given-state

ExCamera timings (rebase)

5. [Serial] last thread’s state rebase → state next thread





























ExCamera timings (upload)

6. [Parallel] Upload finished video

Date post:	10-Jul-2018
Category:	Documents
Upload:	hoangkhue
View:	215 times
Download:	0 times

Granular computing for low-latency compression and … Talks/retreat-2017... · System 1: Lepton...

Documents