Granular computing for low-latency compression and compilation
Sadjad Fouladi , Riad S. Wahby , Emre Orbay ,
John Emmons , Brennan Shacklett , William Zeng ,
Karthikeyan Vasuki Balasubramaniam , Rahul Bhalerao ,
Daniel Reiter Horn , Ken Elkabany , Chris Lesniewski-Lass ,
Anirudh Sivaraman , George Porter , Geet Sethi ,
Eyal Cidon , Dan Iter , Shuvo Chaterjee,
Christos Kozyrakis , Matei Zaharia , Keith Winstein
Stanford
UC San Diego
Dropbox
MIT
Message of this talk
◮ Granular interfaces to computing resources will enable newapplications.
◮ It’s worth refactoring existing applications using ideas fromfunctional programming.
◮ Saving and restoring program states is a powerful tool.
Four interactive systems built with functional abstractions
Lepton: JPEG recompression with VP8 and “functional” codec
ExCamera: Fast VP8 encoding using “functional” VP8 codec
Salsify: Videoconferencing using “functional” VP8 codec
gg: “make -j5000” functional compilation for pennies
System 1: Lepton (distributed JPEG recompression)
Daniel Reiter Horn, Ken Elkabany, Chris Lesniewski-Lass, and Keith Winstein, The Design,
Implementation, and Deployment of a System to Transparently Compress Hundreds of
Petabytes of Image Files for a File-Storage Service, in NSDI 2017.
Storage Overview at Dropbox
• ¾ Media
• Roughly an Exabyte in storage
• Can we save backend space?
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Other
Videos
JPEGs
Idea: save storage with transparent recompression
◮ Requirement: byte-for-byte reconstruction of original file
◮ Approach: improve “lossless” steps only
◮ Replace DC-predicted Huffman code with an arithmetic code
◮ Use probability model to predict “1” vs. “0” with high accuracy
Prior work
15
20
30
40
50
100
150
200
6 7 8 9 10 15 20 25
De
co
mp
ressio
n s
pe
ed
(M
bits/s
)
Compression savings (percent)
Better
MozJPEG(arithmetic)
JPEGrescan(progressive)
packjpg(global sort
+ big model + arithmetic)
Requirements for distributed compression
◮ Independently store and decode small chunks of each file◮ Chunk boundaries can be at any byte offset◮ Boundary locations not under our control
◮ Achieve > 100 Mbps decoding speed per chunk
◮ Never ever lose any data◮ Encoder and decoder both deterministic◮ Immune to adversarial/pathological input files
Our solution
◮ Lepton converts Huffman codewords to VP8 arithmetic code
◮ Uses a very large probability model to avoid “global” operations◮ Developed against real images
◮ But: how to achieve fast decode of independent chunks?
Making the state of the JPEG encoder explicit
◮ Formulate JPEG encoder in explicit state-passing style
◮ Implement DC-predicted Huffman encoder as a pure function◮ Takes “state” as a formal parameter◮ Can resume anywhere, even in the middle of a Huffman codeword
◮ State contains “everything required to resume from midstream”◮ 16 bytes: partial Huffman codeword, prior DC value, etc.
◮ Save the state in Lepton-encoded chunks to aid decoder
Two benefits to functional JPEG encoder
◮ Distribute decoding across independent 4 MiB chunks◮ Save state at the beginning of each chunk
◮ Parallelize decoding within a chunk◮ Split chunk into pieces and save a state for each◮ Each piece can run in parallel, then concatenate outputs and send
Results (single-chunk decoding)
15
20
30
40
50
100
150
200
6 7 8 9 10 15 20 25
De
co
mp
ressio
n s
pe
ed
(M
bits/s
)
Compression savings (percent)
Better
MozJPEG(arithmetic)
JPEGrescan(progressive)
packjpg(global sort
+ big model + arithmetic)
Results (single-chunk decoding)
15
20
30
40
50
100
150
200
6 7 8 9 10 15 20 25
De
co
mp
ressio
n s
pe
ed
(M
bits/s
)
Compression savings (percent)
Better
Lepton
MozJPEG(arithmetic)
JPEGrescan(progressive)
packjpg(global sort
+ big model + arithmetic)
Deployment
• Lepton has encoded 150 billion files– 203 PiB of JPEG files
– Saving 46 PiB
– So far…o Backfilling at > 6000 images per second
Power Usage at 6,000 Encodes
21:0000:00
03:0006:00
09:0012:00
15:0018:00
21:0000:00
03:000
50
100
150
200
250
300
Ch
ass
is 3
ow
er
(k:
)
Lepton concluding thoughts
◮ A little bit of functional programming can go a long way
◮ Functional codec was required to be able to deploy backendJPEG recompression on a distributed filesystem
Overview
Lepton: JPEG recompression with VP8 and “functional” codec
ExCamera: Fast VP8 encoding using “functional” VP8 codec
Salsify: Videoconferencing using “functional” VP8 codec
gg: “make -j5000” functional compilation for pennies
System 2: ExCamera (fine-grained parallel video processing)
Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Vasuki Balasubramaniam, William Zeng,
Rahul Bhalerao, Anirudh Sivaraman, George Porter, Keith Winstein, Encoding, Fast and Slow:
Low-Latency Video Processing Using Thousands of Tiny Threads, in NSDI 2017.
https://ex.camera
Outline
• Vision & Goals
• mu: Supercomputing as a Service
• Fine-grained Parallel Video Encoding
• Evaluation
• Conclusion & Future Work
2
What we currently have
• People can make changes to a word-processing document
• The changes are instantly visible for the others
3
What we would like to have
• People can interactively edit and transform a video
• The changes are instantly visible for the others
for Video?
Can we achieve interactive collaborative video editingby using massive parallelism?
Currently, running such pipelines on videos takes hours and hours, even for a short video.
The Problem
The Question
The challenges
• Low-latency video processing would need thousands of threads, running in
parallel, with instant startup.
• However, the finer-grained the parallelism, the worse the compression
efficiency.
9
Enter ExCamera
• We made two contributions:
• Framework to run 5,000-way parallel jobs with IPC on a commercial “cloud function” service.
• Purely functional video codec for massive fine-grained parallelism.
• We call the whole system ExCamera.
10
Outline
• Vision & Goals
• mu: Supercomputing as a Service
• Fine-grained Parallel Video Encoding
• Evaluation
• Conclusion & Future Work
11
Where to find thousands of threads?
• IaaS services provide virtual machines (e.g. EC2, Azure, GCE):
• Thousands of threads
• Arbitrary Linux executables
! Minute-scale startup time (OS has to boot up, ...)
! High minimum cost(60 mins EC2, 10 mins GCE)
12
3,600 threads on EC2 for one second → >$20
Cloud function services have (as yet) unrealized power
• AWS Lambda, Google Cloud Functions
• Intended for event handlers and Web microservices, but...
• Features:
✔ Thousands of threads
✔ Arbitrary Linux executables
✔ Sub-second startup
✔ Sub-second billing
13
3,600 threads for one second → 10¢
mu, supercomputing as a service
• We built mu, a library for designing and deploying general-purpose parallel computations on a commercial “cloud function” service.
• The system starts up thousands of threads in seconds and manages inter-thread communication.
• mu is open-source software: https://github.com/excamera/mu
14
Outline
• Vision & Goals
• mu: Supercomputing as a Service
• Fine-grained Parallel Video Encoding
• Evaluation
• Conclusion & Future Work
17
Now we have the threads, but...
• With the existing encoders, the finer-grained the parallelism, the worse the compression efficiency.
18
Video Codec
• A piece of software or hardware that compresses and decompresses digital video.
19
1011000101101010001000111111101100111001100111011100110010010000...001001101001001101101101101011111010011001010000010011011011011010
Encoder Decoder
How video compression works
• Exploit the temporal redundancy in adjacent images.
• Store the first image on its entirety: a key frame.
• For other images, only store a "diff" with the previous images: an interframe.
20
In a 4K video @15Mbps, a key frame is ~1 MB, but an interframe is ~25 KB.
Existing video codecs only expose a simple interface
encode([!,!,...,!]) → keyframe + interframe[2:n]
decode(keyframe + interframe[2:n]) → [!,!,...,!]
21
compressed video
encode(i[1:200]) → keyframe1 + interframe[2:200]
[thread 01] encode(i[1:10]) → kf1 + if[2:10]
[thread 02] encode(i[11:20]) → kf11 + if[12:20]
[thread 03] encode(i[21:30]) → kf21 + if[22:30]
⠇
[thread 20] encode(i[191:200]) → kf191 + if[192:200]
Traditional parallel video encoding is limited
22finer-grained parallelism ⇒ more key frames ⇒ worse compression efficiency
parallel ↓
serial ↓
+1 MB
+1 MB
+1 MB
We need a way to start encoding mid-stream
• Start encoding mid-stream needs access to intermediate computations.
• Traditional video codecs do not expose this information.
• We formulated this internal information and we made it explicit: the “state”.
23
What we built: a video codec in explicit state-passing style
• VP8 decoder with no inner state:
decode(state, frame) → (state′, image)
• VP8 encoder: resume from specified state
encode(state, image) → interframe
• Adapt a frame to a different source state
rebase(state, image, interframe) → interframe′
25
Putting it all together: ExCamera
• Divide the video into tiny chunks:
• [Parallel] encode tiny independent chunks.
• [Serial] rebase the chunks together and remove extra keyframes.
26
1. [Parallel] Download a tiny chunk of raw video
27
1 61115
thread 1
7 1211111
thread 2
13 1811117
thread 3
19 2411123
thread 4
2. [Parallel] vpxenc → keyframe, interframe[2:n]
28
1 61115
thread 1
7 1211111
thread 2
13 1811117
thread 3
19 2411123
thread 4
Google's VP8 encoder
encode(img[1:n]) → keyframe + interframe[2:n]
3. [Parallel] decode → state ↝ next thread
29
1 61115
thread 1
7 1211111
thread 2
13 1811117
thread 3
19 2411123
thread 4
Our explicit-state style decoder
decode(state, frame) → (state′, image)
4. [Parallel] last thread’s state ↝ encode
30
1 61115
thread 1
7 1211111
thread 2
13 1811117
thread 3
19 2411123
thread 4
Our explicit-state style encoder
encode(state, image) → interframe
5. [Serial] last thread’s state ↝ rebase → state ↝ next thread
31
1 61115
thread 1
7 1211111
thread 2
13 1811117
thread 3
19 2411123
thread 4
Adapt a frame to a different source state
rebase(state, image, interframe) → interframe′
5. [Serial] last thread’s state ↝ rebase → state ↝ next thread
32
1 61115
thread 1
7 1211111
thread 2
13 1811117
thread 3
19 2411123
thread 4
Adapt a frame to a different source state
rebase(state, image, interframe) → interframe′
6. [Parallel] Upload finished video
33
1 61115
thread 1
7 1211111
thread 2
13 1811117
thread 3
19 2411123
thread 4
Outline
• Vision & Goals
• mu: Supercomputing as a Service
• Fine-grained Parallel Video Encoding
• Evaluation
• Conclusion & Future Work
36
How well does it compress?
37
16
17
18
19
20
21
22
5 10 20 30 40 50 70
qu
alit
y (
SS
IM d
B)
average bitrate (Mbit/s)
vpx (1 th
read)
vpx (multit
hreaded)
How well does it compress?
38
16
17
18
19
20
21
22
5 10 20 30 40 50 70
qu
alit
y (
SS
IM d
B)
average bitrate (Mbit/s)
ExCamera[6, 1]
vpx (1 th
read)
vpx (multit
hreaded)
How well does it compress?
39
16
17
18
19
20
21
22
5 10 20 30 40 50 70
qu
alit
y (
SS
IM d
B)
average bitrate (Mbit/s)
ExCamera[6, 1]
ExCamera
[6, 1
6]vpx (1
thread)
±3%
ExCamera[6, 16] 2.6 mins
14.8-minute 4K Video @20dB
vpxenc Single-Threaded 453 mins
vpxenc Multi-Threaded 149 mins
YouTube (H.264) 37 mins
ExCamera contributions
◮ Framework to run 5,000-way parallel jobs with IPC on acommercial “cloud function” service.
◮ Purely functional video codec for massive fine-grained
parallelism.
ExCamera concluding thoughts
◮ Operators of Lambda/Cloud Functions/OpenWhisk will find that theircustomers want as many parallel tiny tasks as they can get.
◮ “Lots of data” and “interactive” → massively parallel → functional
◮ Many interactive data-processing jobs aren’t embarrasingly parallel:◮ Image and video filters◮ 3D artists◮ Compilation and software testing◮ Interactive machine learning◮ Data visualization◮ Searching large datasets
◮ To achieve low end-to-end latency, distributed systems will need to treatapplication state as a first-class object.
Overview
Lepton: JPEG recompression with VP8 and “functional” codec
ExCamera: Fast VP8 encoding using “functional” VP8 codec
Salsify: Videoconferencing using “functional” VP8 codec
gg: “make -j5000” functional compilation for pennies
System 3: Salsify (videoconferencing)
Sadjad Fouladi, Emre Orbay, John Emmons, Riad S. Wahby, Keith Winstein, Salsify: for real-time
videoconferencing, a richer interface between the video codec and transport protocol is helpful,
in preparation.
https://github.com/excamera
Interfaces exposed by the transport layer
bandwidth = X
Sender Receiver
◮ “Can I send one byte without blocking?”
◮ “Send as many of these bytes as possible.”
◮ “What is X (1-sec average)?”
◮ “How much can I send now without blocking?”
Interfaces exposed by the transport layer
bandwidth = X
Sender Receiver
◮ “Can I send one byte without blocking?” (select, poll)
◮ “Send as many of these bytes as possible.”
◮ “What is X (1-sec average)?”
◮ “How much can I send now without blocking?”
Interfaces exposed by the transport layer
bandwidth = X
Sender Receiver
◮ “Can I send one byte without blocking?” (select, poll)
◮ “Send as many of these bytes as possible.” (write)
◮ “What is X (1-sec average)?”
◮ “How much can I send now without blocking?”
Interfaces exposed by the transport layer
bandwidth = X
Sender Receiver
◮ “Can I send one byte without blocking?” (select, poll)
◮ “Send as many of these bytes as possible.” (write)
◮ “What is X (1-sec average)?” (RTCOutboundRTPStreamStats)
◮ “How much can I send now without blocking?”
Interfaces exposed by the transport layer
bandwidth = X
Sender Receiver
◮ “Can I send one byte without blocking?” (select, poll)
◮ “Send as many of these bytes as possible.” (write)
◮ “What is X (1-sec average)?” (RTCOutboundRTPStreamStats)
◮ “How much can I send now without blocking?”
The problem: codecs drive videoconferencing
1. Transport tells encoder what “bitrate” to target
2. Tens of milliseconds pass
3. Encoder delivers next frame
4. Transport stuck sending what encoder created
◮ Even if overshot/undershot capacity estimate◮ Even if network has changed in the meantime
When interfaces ossify. . .
Q: “Can you make the encoder obey the congestion control?”
◮ A: “Our software codec was licensed from On2/Google and doesn’t supportthat interface.”
◮ A: “Our hardware codecs don’t support that interface (and we have tosupport 50 different chips).”
◮ A: “Our congestion control doesn’t support that interface.”
◮ A: “It doesn’t matter; Skype/WebRTC/Facetime works the same way.”
Hypothesis: functional interface could yield measurable benefits
1. Transport to Encoder: “Aim for x bytes.”
2. Encoder to Transport: Here’s a menu of N different encodings run inparallel.
3. Transport late-binds which option to send, based on current capacityestimate.
4. Transport to Encoder: “I sent option #3. Save that state, restore it toall N encoders, then go back to step 1 and encode a new frame.”
Video-aware congestion control
Receiver
t₁ t₂ t₃ t₄ t₅graceperiod
frame i frame i+1
Sender
◮ A video encoder is not a full-throttle source.
◮ Salsify’s sender includes time between outgoing packets as a “grace period.”
◮ Receiver maintains moving average of packet inter-arrival times, pausinginference during grace period.
Network throughput (synthetic network outage)
0
1
2
3
4
5
6
7
8
0 5 10 15 20
Salsify
Skype
WebRTC
thro
ug
hp
ut (M
bp
s)
time (s)
Video frame delay (synthetic network outage)
0 5 10 15 200
0.5
1
1.5
2
2.5
3
Salsify
Skype
WebRTC fra
me
de
lay (
s)
time (s)
Network trace (Verizon LTE)
10
11
12
13
14
15
16
5007001000200050007000
vid
eo q
ualit
y (
SS
IM d
B)
video delay (95th percentile ms)
WebRTC
FaceTime
Hangouts
Salsify
Skype
Better
Network trace (Verizon LTE)
10
11
12
13
14
15
16
5007001000200050007000
vid
eo q
ualit
y (
SS
IM d
B)
video delay (95th percentile ms)
WebRTC
FaceTime
Hangouts
Salsify
Skype
Salsify (no late binding)
Better
Network trace (AT&T LTE)
8
9
10
11
12
13
14
15
16
3005001000200030005000
vid
eo q
ua
lity (
SS
IM d
B)
video delay (95th percentile ms)
WebRTC
FaceTime
Hangouts
Salsify
Skype
Better
Network trace (T-Mobile UMTS)
9
10
11
12
13
14
35004000500070009500
vid
eo q
ua
lity (
SS
IM d
B)
video delay (95th percentile ms)
WebRTC
FaceTime
Hangouts
Salsify
Skype
Better
Takeaways (videoconferencing)
◮ Need functional interface to the codec to get good matchto late capacity estimate
◮ Need richer interface to the transport to expose “how manybytes can I send now?”
◮ No time like the present to start refactoring these interfaces
Overview
Lepton: JPEG recompression with VP8 and “functional” codec
ExCamera: Fast VP8 encoding using “functional” VP8 codec
Salsify: Videoconferencing using “functional” VP8 codec
gg: “make -j5000” functional compilation for pennies
Problem: software development OODA loop is increasing
◮ Compiling large programs can take hours
◮ Trend against dynamic linking means “vendoring” all dependencies
◮ Slow feedback → waste of human resources
gg: the Stanford builder
◮ Goal: distribute builds across 5,000 cores in the cloud
◮ (without modifying build system)
◮ Use granular abstractions to computing resources
◮ AWS Lambda, IBM OpenWhisk, Google Cloud Functions. . .
◮ Requirements:
◮ Thousands of cores◮ Arbitrary binaries◮ Subsecond startup◮ Subsecond billing
Approach: model and thunk everything
◮ Locally, run build system with delayed execution of major pipeline stages:
◮ preprocess◮ compile◮ assemble◮ link◮ ar, ranlib, strip
◮ Each model produces a thunk.
◮ closure that allows deterministic delayed execution of any pipeline stage
Example thunk
{
"function": {
"exe": "gcc",
"args": [
"gcc", "-g", "-O2", "-c", "-o", "TEST_remake.o", "remake.i"
],
"hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
},
"infiles": [
{
"filename": "remake.i",
"hash": "9f1d127592e2bee6e702b66c9114d813d059f65e8cdb79db2127e7d6d1b3384b",
"order": 0
}
],
"outfile": "TEST_remake.o"
}
To compile, lazily force the thunk
◮ Thunks are self-contained and can be forced locally or in the cloud.
◮ Run 5,000+ thunks in parallel on Lambda/OpenWhisk
◮ Can trust others’ assertions
◮ “File with this hash → contents” (easy to detect invalid claims)
◮ “Thunk with this hash → result” (can prove a claim is invalid)
Granular computing for interactive applications
◮ Granular interfaces to computing resources will enable new applications.
◮ It’s worth refactoring existing applications using ideas from functional programming.
◮ Saving and restoring program states is a powerful tool.
◮ Lepton: JPEG recompression
◮ ExCamera: video encoding with thousands of tiny tasks
◮ Salsify: videoconferencing with a functional interface to the codec
◮ gg: “make -j5000” functional compilation for pennies
Thank you: Platform Lab, NSF, DARPA, Google, Dropbox, VMware, Facebook, Huawei, SITP.