Cisco Data Intelligence Platform...Big Data Meets AI •Hadoop 3.0 enables AI workloads to run...

Apache Submarine

Presenter: Muhammad Afzal

Cisco Data Intelligence Platform

© 2019 Cisco and/or its affiliates. All rights reserved.

Data Intelligence Trends

2Presentation ID

Data Lake

Containers / AI

Object (S3) / File Storage

• Massive Ingest rates• Near Real-time analytics• Data Tiering

• AI everywhere• Containers as de facto• Kubernetes dominated

• Object (S3) - new standard• Exabyte scale• Warm / Archival tier

Broad Applicability:Autonomous vehicles, Utilities, Federal, Financials, Healthcare

Disaggregated Architecture

© 2019 Cisco and/or its affiliates. All rights reserved.

Big Data Meets AI

• Hadoop 3.0 enables AI workloads to run natively with GPU and container resources

Eliminate Architectural Silos

• Data Intensive workloads, Compute Intensive workloads and storage systems work closely

Cloud Scale Architecture

• Brining Big Data, AI, and Object Storage together to scale to thousands of nodes and hundreds of petabytes

Cisco Data Intelligence PlatformUnlock intelligence, performance, simplicity at scale for large data

sets

Automation

Solution Management and Deployment Automation with Intersight

Frameworks Compute Apps

AI/ Compute Farm

Data Lake - HadoopObjectHDFS

Data Anywhere

Hadoop community initiated the Submarine project to make distributed deep learning/machine learning applications easily launched, managed and monitored

Submarine Components• Submarine Computation Engine – Submit

customized deep learning to YARN

• Submarine echo-system integration

• Submarine-Zeppelin Integration – Allow data

scientists to code inside Zeppelin notebook and submit/manage

training jobs from Zeppelin.

• Submarine Azkaban Integration - Allow data scientist

to submit a set of workflow tasks with dependencies directly to

Azkaban from Zeppelin notebooks.

• Submarine Installer – Install submarine and its dependent

components.

Distributed Deep Learning with Apache Submarine

Worker

Worker

Worker

PSPS

Worker

Worker

Worker

PS

Tensorboard

YARN

Tensorboard

Tensorboard

Launch Submarine jobs

End UserSubmarine

CLI/REST

Launch and Monitor Tensorboard

Zeppelin

Edit Job

s

Schedule JobsYARN

Algorithm development and job scheduling

Interpreter

Submarine Architecture

Submarine Architecture – Contd.

Compute Farm

➢ Easy access data/models in HDFS and other storages.

➢ Utilize Docker containers

➢ Between-host container networking

➢ Specify compute resources

➢ Optionally launch Tensorboard

Data LakeObject Store

Cisco Data Intelligence Platform

Launch Distributed TensorFlow job via YARN CLI

yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \--name dtf-job-01 \

--verbose \

--docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:1.0 \

--input_path hdfs://CiscoHDP/tmp/cifar-10-data \

--checkpoint_path hdfs://CiscoHDP/tmp/cifar-10-jobdir \

--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \

--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \

--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \

--num_workers 2 \

--worker_resources memory=8G,vcores=2,gpu=1 \

--worker_launch_cmd "/test/models/tutorials/image/cifar10_estimator/python cifar10_main.py --data-

dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16

--num-gpus=1 --sync" \

--ps_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu:1.0 \

--num_ps 1 \

--ps_resources memory=2G,vcores=2 \

--ps_launch_cmd "/test/models/tutorials/image/cifar10_estimator/python cifar10_main.py --data-

dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \

--tensorboard \

--tensorboard_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu:1.0

Monitor and Manage Submarine Jobs in YARN UI

➢ Registry DNS server creates DNS records for end-points of all submarine components➢ Submarine components communicates each other via DNS records for end-points

Submarine Components

Submarine uses YARN to schedule and

run distributed DL in Docker containers.

Compute Resources allocated to Submarine jobs

➢ YARN allocate compute resources (CPU, RAM, and GPU) and manages the life-cycle of submarine jobs

➢ After completion of the job, resources are returned to YARN cluster

Submarine – Zeppelin IntegrationDirectly develop, debug and visualize deep learning algorithms in zeppelin

Zeppelin submarine interpreter automatically merges the algorithm files into sections and submits them to the submarine computation engine for execution

Visualize Deep Learning Job in Tensorboard

# yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \

--name tensorboard-service-001 --docker_image

linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu:1.0 \

--tensorboard

Execute a YARN service to monitor all TF job’s training

progress in one tensorboard dashboard

Launch Tensorboard from Zeppelin

Thank You

Date post:	22-May-2020
Category:	Documents
Upload:	others
View:	14 times
Download:	0 times

Cisco Data Intelligence Platform...Big Data Meets AI •Hadoop 3.0 enables AI workloads to run...

Documents