Apache Submarine
Presenter: Muhammad Afzal
Cisco Data Intelligence Platform
© 2019 Cisco and/or its affiliates. All rights reserved.
Data Intelligence Trends
2Presentation ID
Data Lake
Containers / AI
Object (S3) / File Storage
• Massive Ingest rates• Near Real-time analytics• Data Tiering
• AI everywhere• Containers as de facto• Kubernetes dominated
• Object (S3) - new standard• Exabyte scale• Warm / Archival tier
Broad Applicability:Autonomous vehicles, Utilities, Federal, Financials, Healthcare
Disaggregated Architecture
© 2019 Cisco and/or its affiliates. All rights reserved.
Big Data Meets AI
• Hadoop 3.0 enables AI workloads to run natively with GPU and container resources
Eliminate Architectural Silos
• Data Intensive workloads, Compute Intensive workloads and storage systems work closely
Cloud Scale Architecture
• Brining Big Data, AI, and Object Storage together to scale to thousands of nodes and hundreds of petabytes
Cisco Data Intelligence PlatformUnlock intelligence, performance, simplicity at scale for large data
sets
Automation
Solution Management and Deployment Automation with Intersight
Frameworks Compute Apps
AI/ Compute Farm
Data Lake - HadoopObjectHDFS
Data Anywhere
Hadoop community initiated the Submarine project to make distributed deep learning/machine learning applications easily launched, managed and monitored
Submarine Components• Submarine Computation Engine – Submit
customized deep learning to YARN
• Submarine echo-system integration
• Submarine-Zeppelin Integration – Allow data
scientists to code inside Zeppelin notebook and submit/manage
training jobs from Zeppelin.
• Submarine Azkaban Integration - Allow data scientist
to submit a set of workflow tasks with dependencies directly to
Azkaban from Zeppelin notebooks.
• Submarine Installer – Install submarine and its dependent
components.
Distributed Deep Learning with Apache Submarine
Worker
Worker
Worker
PSPS
Worker
Worker
Worker
PS
Tensorboard
YARN
Tensorboard
Tensorboard
Launch Submarine jobs
End UserSubmarine
CLI/REST
Launch and Monitor Tensorboard
Zeppelin
Edit Job
s
Schedule JobsYARN
Algorithm development and job scheduling
Interpreter
Submarine Architecture
Submarine Architecture – Contd.
Compute Farm
➢ Easy access data/models in HDFS and other storages.
➢ Utilize Docker containers
➢ Between-host container networking
➢ Specify compute resources
➢ Optionally launch Tensorboard
Data LakeObject Store
Cisco Data Intelligence Platform
Launch Distributed TensorFlow job via YARN CLI
yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \--name dtf-job-01 \
--verbose \
--docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:1.0 \
--input_path hdfs://CiscoHDP/tmp/cifar-10-data \
--checkpoint_path hdfs://CiscoHDP/tmp/cifar-10-jobdir \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--num_workers 2 \
--worker_resources memory=8G,vcores=2,gpu=1 \
--worker_launch_cmd "/test/models/tutorials/image/cifar10_estimator/python cifar10_main.py --data-
dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16
--num-gpus=1 --sync" \
--ps_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu:1.0 \
--num_ps 1 \
--ps_resources memory=2G,vcores=2 \
--ps_launch_cmd "/test/models/tutorials/image/cifar10_estimator/python cifar10_main.py --data-
dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \
--tensorboard \
--tensorboard_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu:1.0
Monitor and Manage Submarine Jobs in YARN UI
➢ Registry DNS server creates DNS records for end-points of all submarine components➢ Submarine components communicates each other via DNS records for end-points
Submarine Components
Submarine uses YARN to schedule and
run distributed DL in Docker containers.
Compute Resources allocated to Submarine jobs
➢ YARN allocate compute resources (CPU, RAM, and GPU) and manages the life-cycle of submarine jobs
➢ After completion of the job, resources are returned to YARN cluster
Submarine – Zeppelin IntegrationDirectly develop, debug and visualize deep learning algorithms in zeppelin
Zeppelin submarine interpreter automatically merges the algorithm files into sections and submits them to the submarine computation engine for execution
Visualize Deep Learning Job in Tensorboard
# yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
--name tensorboard-service-001 --docker_image
linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu:1.0 \
--tensorboard
Execute a YARN service to monitor all TF job’s training
progress in one tensorboard dashboard
Launch Tensorboard from Zeppelin
Thank You