Date post: | 22-Jan-2018 |
Category: |
Technology |
Upload: | jim-dowling |
View: | 255 times |
Download: | 0 times |
@ODSC
Distributed DeepLearning on Hops Robin Andersson
Fabio BusoRISE SICS AB
Logical Clocks AB
London | October 12th-14th 2017
Please register on odsc.hops.site
Big Data and AI
3
Why you are here
4From: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
Deep Learning with GPUs (on Hops)
5
Separate Clusters for Big Data and ML
6
*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!
7
I need estimates for the ROI on these candidate features in our product
We are on it. Need to first sync up with IT and engineering
Data Science in Enterprises Today
7
Data Science Team
CTO
88
IT
Collaboration Overhead is HighPrepare Dataset samples for Data Science
Data Science Team Data Engineering
We need access to these Datasets
DataLake
Ok
1. Update Access Rights
GPU Cluster2. Copy Dataset Samples (some time later)
3. Run experiments
99
How it should be
IT
Data Science Data Engineering
Here’s someone who can help you out
I need help to work on a project for the CTO
Project
Conda Env, CPU/Storage Quotas, Self-Service, GDPR
Kafka Topics
DataLake
GPU Cluster
Elasticsearch
HopsWorks Data Platform
10
HopsWorks
11
Kafka Topic
Project X Project Y
Project Data
HopsFS
12
Open Source fork of Apache HDFS
16x faster than HDFS
37x more capacity than HDFS
SSL/TLS instead of Kerberos
Scale Challenge Winner (2017)
https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
HopsYARN GPUs
13
Native GPU support in YARN - world first
Implications
- Schedule GPUs just like memory or CPU- Exclusive allocation (no GPU-sharing)- Distributed, scale-out Machine Learning
TensorFlow first-class support in Hops
14
Run in
Spark ExecutorTensorFlow code
0.003 learning rate, 0.3 dropout0.001 learning rate, 0.5 dropout
0.002 learning rate, 0.7 dropout
Spark ExecutorTensorFlow code
Spark ExecutorTensorFlow code
HopsUtil
Library for launching TensorFlow jobs
Manages the TensorBoard lifecycle
Helper Functions for Spark/Kafka/HDFS/etc
15
HopsUtil - Read data
from hopsutil import hdfs
dataset=path.join(hdfs.project_path(),‘Resources/mnist/tfr/train’)
files=tf.gfile.Glob(path.join(dataset,‘part-*’))
file_queue=tf.train.string_input_producer(files, … )
16
17
HopsUtil - initialize Pydoop HDFS API
Pydoop HDFS API is a rich api that provides operations such as
- Connecting to an HDFS instance- General file operations (create, read, write)- Get information on files, directories, fs
Connect to HopsFS using HopsUtil:
from hopsutil import hdfs
pydoop_handle = hdfs.get()17
HopsUtil - TensorBoard
from hopsutil import tensorboard
[...]
logdir = tensorboard.logdir()
sv = tf.train.Supervisor(is_chief=True, logdir=logdir, [...], save_model_secs=60)
18
HopsUtil - Hyperparameter searching
from hopsutil import tflauncher
def training(learning_rate, dropout):[....]
params = {‘learning_rate': [0.001, 0.002, 0.003], 'dropout': [0.3, 0.5, 0.7]}tflauncher.launch(spark, training, params)
19
HopsUtil - Logging
from hopsutil import hdfs
[...]
while not sv.should_stop() and step < steps:
hdfs.log(sess.run(accuracy))
[...]
20
DEMO TIME!TensorFlow tour on HopsWorks
21
22
How to get started
23
How to get started (2)
24
How to get started (3)
25
TensorBoard
26
Dela - Search for interesting datasets
27
Dela - Import a Dataset
Dela
28
p2p network of Hops clusters
Find and share interesting datasets
Exploits unused bandwidth and backs off in case of network traffic
The Challenge
29
http://timdettmers.com/2017/08/31/deep-learning-research-directions
Experiment Time and Research Productivity
● Minutes, Hours:○ Interactive analysis!
● 1-4 days○ Interactivity replaced by
many parallel experiments● 1-4 weeks
○ High value experiments only● >1 month
○ Don’t even try!
30
Solution: Go distributed
31
State-of-the-Art in GPU Hardware
32
Nvidia DGX-1
33
SingleRoot Commodity GPU Cluster Computing
34
The budget side
35
Commodity Server*
➔ 10 Nvidia GTX 1080Ti◆ 11 GB Memory
➔ 256 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband➔ SingleRoot PCI Complex
10 x Commodity Server = 150K Euro
Nvidia DGX-1
➔ 8 Nvidia Tesla V100◆ 16 GB Memory
➔ 512 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband
➔ NVLink
Price per DGX-1 = 150K Euro
*www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems/
36
Distributed TensorFlow
Distribute TensorFlow graph
Workers / Parameter server
Synchronous / Asynchronous
Model / Data parallelism
Problems:- Clusterspec- Manually starting process
37
Introducing TensorFlowOnSpark by YAHOO!
Wrapper for Distributed TensorFlow
- Creates clusterspec automatically!- Runs on a Hadoop/Spark cluster- Starts the workers/parameter servers automatically- First attempt at “scheduling” GPUs- Simplifies the programming model- Manages TensorBoard- “Migrate all existing TF programs with < 10 lines of code”
37
TensorFlowOnSpark architecture
38 HopsFs
Spark Driver
Spark ExecutorParameter
Server
Spark Executor
Worker
Spark Executor
Worker
Scaling TensorFlowOnSpark
39
Near linear scaling up to 8 workers
*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!
TensorFlowOnSpark on Hops
40
41
Our improved TensorFlowOnSpark - 1
Problem:Use RAM (1GPU = 27GB RAM) as a proxy to ‘schedule’
GPUs.Solution:
Hops provides GPU scheduling!
41
42
Our improved TensorFlowOnSpark - 2
Problem:A worker will wait until GPUs become available,
potentially forever!Solution:
GPU scheduling ensures that the GPU is only allocated for that particular worker.
42
43
Our improved TensorFlowOnSpark - 3
Problem:Each parameter server allocates 1 GPU, this is a waste!
Solution:Only workers may use GPUs
43
44
Conversion guide: TensorFlowOnSpark
TFCluster.run(spark, training_fun, num_executors, num_ps…)
Add PySpark and TensorFlowOnSpark imports
Create your own FileWriter
Replace tf.train.Server() with TFNode.start_cluster_server()
Full conversion guide for Distributed TensorFlow to TensorFlowOnSparkhttps://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide
44
DEMO TIME!Distributed TF on Spark
45
Distributed Stochastic Gradient Descent
46
SDG with Data Parallelism (Single Host)
47
Facebook: Scaling Synchronous SDGJune 2017: training time on ImageNet from 2 weeks to 1 hour
➔ ~90% scaling efficiency going from 8 to 256 GPUs
Learning rate heuristic/ Warm up phase/ Large batches
48Paper: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
All-Reduce
49
N GPUs, K parametersComm. cost: 2(N-1) * K/N
Independent from # GPUs
overlap communication and computation
Drawback: Synchronous communication
From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/
Baidu All-Reduce - Performance scaling
50From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/
Horovod - Better than Baidu All-Reduce?
51
Fork of Baidu All-Reduce
Improvements
1. Replaced Baidu ring-allreduce with NVIDIA NCCL2. Tensor Fusion3. Support for larger models4. Pip package5. Horovod Timeline
5252
Migrating existing code to run on Horovod
1. Run hvd.init()
2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list. Local rank maps to unique GPU for the process.
3. Wrap optimizer in hvd.DistributedOptimizer. 4. Add hvd.BroadcastGlobalVariablesHook(0) to
broadcast initial variable states from rank 0 to all other processes.
Horovod/Baidu AllReduce
53
Provide as a service on HopsWorks
Integration of All-Reduce with a Hadoop cluster- Use YARN to schedule GPUs
Scheduling of homogeneous GPUs and network- YARN supports node labels
HopsFS authentication/authorization
TensorBoard lifecycle management as in HopsUtil
The teamActive contributors:Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.
Past contributors:Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Aruna Kumari Yedurupaka, Tobias Johansson, Roberto Bampi, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid.
54
www.hops.iogithub.com/hopshadoop
@hopshadoop
55