1
HUAWEI | EUROPEAN RESEARCH CENTER
John Doe
— Huawei Confidential —
DeepSpark
Contributors: Natan Peterfreund, Roman Talyansky, Uri Verner,
Zach Melamed, Youliang Yan, Rongfu Zheng
Presenter: Uri Verner
Asynchronous deep learning over Spark
2
DeepSpark is a scalable
deep learning framework
for -based
distributed environments
3
Outline
Background
DeepSpark architecture
Data locality optimizations
Initial results
Useful tools
4
What is Apache Spark?
Spark is an advanced framework for distributed computation
Very fast at iterative algorithms
In-memory data caching between iterations
Provides fault-tolerance and recovery
Efficient at data transportation between nodes (“shuffle”)
Easy and expressive APIs
5
Synchronous vs. Asynchronous Training
parameter server
w o r k e r s
< repeatedly >
Workers can get out of sync
• network delays
• waiting for data
• machine crashed
• etc.
update parameter server
input data
6
System Architecture
Spark executor
Training manager
Caffe
GPU
GPU
GPU
GPU
Spark executor
Training manager
Caffe
GPU
GPU
GPU
GPU
Spark Driver
GPUGPUGPU GPUGPU GPU
M O D E L
data
Training worker machines
Distributed parameter server
in Spark RDD
HDFS HDFS
data
7
Data Parallelismwith Asynchronous Distributed Stochastic Descent
worker
Each worker operates
asynchronously with other workers.
M O D E L
Parameter server
1
2
3
4
Download
model 𝑀
Compute
update ∆𝑀
Upload
∆𝑀 to PS
Update model:
𝑀:= 𝑀 + ∆𝑀
data
8
Distributed Parameter Server
M O D E LSpark Resilient Distributed Dataset (RDD)
- Cached in memory
- API for distributed processing
Model update procedure:
- Training workers send local updates to PS
machines in split form
- Compute a new global model
- Update training workers with new model L O C A L
ready model
updates
G L O B L
merged
model
Workers
9
Workers Don’t Wait For Model Update
global
model
local
model
model
update
accumulated
updates
Forward/Backwardload if new add
update
get model
from PS
send updates
to PS
training loop: 1 2 3
4
1 2update loop:
gpugpugpu
Caffe
10
Preserve Local Updates
global
model
local
model
model
update
accumulated
updates
Forward/Backwardload if new add
update
get model
from PS
send updates
to PS
training loop: 1 2 3
4
1 2update loop:
gpugpugpu
Caffe
“read-my-writes” [1]
[1] ”More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server”, Ho et al., NIPS 2013
11
Limited Staleness
slowest worker
global
model
local
model
gpu
global
model
local
model
gpu
another worker
version: 2 version: 10
load if new load if new
Use a configurable staleness threshold
12
Work Assignment with HDFS
HDFS HDFS HDFS HDFS HDFS
The input data is distributed,
and replicated.
stored in blocks of 128MB (by default),
worker worker worker worker
Worker machines may also be HDFS machines.
Problem: assign each (unique) data block to a worker
Requirements (in order of priority):
- Equal work distribution
- Minimize data transfer over the network
13
The Data Block Assignment Problem
data blocks (N) replicas (R) workers (W)
for each
data block
choose one
replica
such that each
worker gets𝑁
𝑊blocks (±1)
and non-local
assignments are
minimized
and assign it to a
worker
locality
14
Solving HDFS Locality OptimizationRepresent as a minimum-cost flow optimization problem
A classical problem with an efficient solution[2]
[2] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications.
data blocks (N) replicas (R) workers (W)
𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 1𝑐𝑜𝑠𝑡 = 0
𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 1𝑐𝑜𝑠𝑡 = 0
𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 1
cost = 1 𝑟𝑒𝑚𝑜𝑡𝑒 𝑟𝑒𝑝𝑙𝑖𝑐𝑎0 𝑙𝑜𝑐𝑎𝑙 𝑟𝑒𝑝𝑙𝑖𝑐𝑎
flow N out
𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 𝑁𝑊
𝑐𝑜𝑠𝑡 = 0
15
Assigning the HDFS data blocks
Assignment
data blocks (N) replicas (R) workers (W)
flow N out
16
Initial Results
Setup: 4 machines with one Titan X per machine, TCP/IP over Connect-X 3 Infiniband,
GoogleNet model (from Caffe), each machine is used as both worker and PS.
0
2
4
6
8
10
12
0K 10K 20K 30K 40K
los
s
iterations
Single worker
DeepSpark
BSP (the ideal)
0
100
200
300
400
500
600
700
ite
rati
on
tim
e [
ms
]
Single worker
DeepSpark
BSP
17
Useful Optimization & Debugging Tools
Visualize the program’s execution using NVIDIA Tools Extension (NVTX)
Mark the beginnings and endings of all your important operations
Caffe
Spark
18
Useful Optimization & Debugging Tools
See CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX
Time ranges are marked using push-pop semantics.
C++ trick: define a special class with “push” in constructor & “pop” in destructor
Define a macro that creates a “profiling” object with info about function; to describe
function use macros __PRETTY_FUNCTION__, __FILE__, and __LINE__
Example:
int func() {
PROFILER_FUNCTION_SCOPE();
... body of function ...
}
19
Copyright © 2016 Huawei Technologies. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new
technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.
EUROPEAN RESEARCH CENTEREUROPEAN RESEARCH CENTER
Copyright © 2016 Huawei Technologies. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new
technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.
DeepSpark
Contact emails: