Multi-tenant Deep Learning and Streaming as-a-Service with ... · Lifecycle of a Streaming Job...

Post on 20-May-2020

1 views 0 download

transcript

Multi-tenant Deep Learning and Streaming as-a-Service with HopsworksTheoflos Kakantousis (@theofloskak)COO – Logical Clocks AB

Big Data Moscow 2018

©2018 Logical Clocks AB. All Rights Reserved2

Deep Learning & Streaming-as-a-Service in Sweden

● Hopsworks

– Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service

– Built on Hops Hadoop (www.hops.io)

– hops.site, 600+ users as of September 2018 ● RISE SICS ICE

– 250 kW Datacenter, ~1000 servers

https://www.sics.se/projects/sics-ice-data-center-in-lulea

©2018 Logical Clocks AB. All Rights Reserved3

[…] the general consensus seems to be that everyoneexpects some gain in performance numbers if the dataset size is increased dramatically [...]

Deep Learning needs Big data

Sun et Al. - Revisiting Unreasonable Efectiveness of Data in Deep Learning Era - 2017

Joel et Al. - Deep Learning Scaling is Predictable, Empirically - 2017

©2018 Logical Clocks AB. All Rights Reserved4

AI Hierarchy of Needs

DataEngineers

DataScientists

DataScientists?

DDL(Distributed

Deep Learning)

Deep Learning, RL

Machine Learning (ML)

Data Analytics

Data Pipelines

Big Data

Lots of GPUs

GPUs

Full-stack Data Science

©2018 Logical Clocks AB. All Rights Reserved6

Hopsworks

Hopsworks

Rest API

©2018 Logical Clocks AB. All Rights Reserved7

Hopsworks

Develop Train Test Deploy

MySQL Cluster

Hive

InfuxDB

ElasticSearch

KafkaProjects,Datasets,Users

HopsFS / YARN

Spark, Flink, Tensorfow

Jupyter

Jobs, Kibana, Grafana

RESTAPI

Big data needs scalable storage

©2018 Logical Clocks AB. All Rights Reserved9

HopsFS*

Metadata

Datanode

Namenode

● HDFS derivative with distributed metadata

– 37x increased capacity– 16x increased

throughput

HDFS Client

HDFS Client

Scale-out all layers

* HopsFS - https://goo.gl/yFCsGc

Scale Challenge Winner (2017)

Hops

©2018 Logical Clocks AB. All Rights Reserved10

HopsFS support for Small Files *

RAMNVMe Disk

Datanode

Namenode

> 64KB (Configurable)

< 1KB 1KB < > 64KB

● Integrates NVMe ● Open Images Dataset:

– 9m images– ~80% small fles (<64 KB)

NVMe Disk

Metadata layer - NDB

*Size Matters: Improving the Performance of Small Files in Hadoop, Middleware 2018. Niazi et al

Multi-tenancy

©2018 Logical Clocks AB. All Rights Reserved12

Projects for Software-as-a-Service

Proj-42 Proj-X

Shared TopicTopic /Projs/My/Data

Proj-AllCompanyDB

©2018 Logical Clocks AB. All Rights Reserved13

Manage Projects like Github

©2018 Logical Clocks AB. All Rights Reserved14

Share like in Dropbox

Share any Data Source/Sink: HDFS Datasets, Kafka Topics, etc

©2018 Logical Clocks AB. All Rights Reserved15

Project Authorization

● Data Owner Privileges– Import/Export data– Manage Membership– Share DataSets, Topics

● Data Scientist Privileges

– Write and Run code● Delegate Administration of

privileges to users

©2018 Logical Clocks AB. All Rights Reserved16

Custom Python environments with Conda

Python libraries are usable by Spark/Tensorfow

©2018 Logical Clocks AB. All Rights Reserved17

TLS (not Kerberos) for security in Hops

● X.509 Certifcates for authentication● 1 Certifcate for each project

user● New App certifcate

generated for each job

● Store an audit trail of the operations (read/write/create/etc) users and apps perform on HopsFs

Resource Manager

Node Manager

HopsFs

Generate App Cert

Auth w/ App Cert

Project_user cert

©2018 Logical Clocks AB. All Rights Reserved18

TLS certifcate generation

alice@gmail.com

Users don’t see the certifcates,authenticate using:• LDAP• password • 2-Factor Authentication

Add/DelUsers

Distributed Database

Insert/Remove CertsProject Mgr

RootCA

HDFSSparkKafkaYARN

Cert Signing Requests

IntermediateCertifcate Authority

Hopsworks

Streaming-as-a-Service

©2018 Logical Clocks AB. All Rights Reserved20

ETL Workloads

ParquetHive

Hopsworks Jobs

trigger

Elastic

pipelines transform raw datato structured data

HopsFS

©2018 Logical Clocks AB. All Rights Reserved21

Streaming Analytics in Hopsworks

HopsFS YARN

HopsFS YARN

Grafana/InfluxDBGrafana/InfluxDB

Elastic/KibanaElastic/Kibana

Public Cloud or On-PremisePublic Cloud or On-Premise

Parquet / ORC

Data Src

Batch Analytics

Kafka

…...MySQLMySQL

©2018 Logical Clocks AB. All Rights Reserved22

Lifecycle of a Streaming Job

Developer

1.Discover: Schema Registry and Kafka Broker Endpoints2.Create: Kafka Properties file with certs and broker

details3.Create: Producer/Consumer using Kafka Properties

4.Download: the Schema for the Topic from the Schema Registry

5.Distribute: X.509 certs to all hosts on the cluster6.Cleanup securely

Operations

Facilitate dev+ops with hops-util https://github.com/logicalclocks/hops-util

©2018 Logical Clocks AB. All Rights Reserved23

Kafka Self-Service UI

Manage & Share• Topics• ACLs• Avro Schemas

Manage & Share• Topics• ACLs• Avro Schemas

©2018 Logical Clocks AB. All Rights Reserved24

Realtime Logs

● YARN aggregates logs on job completion– No good to us for Streaming

● Collect logs and make them searchable in real-time using Filebeat, Logstash, Elasticsearch, and Kibana

©2018 Logical Clocks AB. All Rights Reserved25

Realtime Logs

©2018 Logical Clocks AB. All Rights Reserved26

Resource Monitoring/Alerting

©2018 Logical Clocks AB. All Rights Reserved27

Jupyter Notebooks

©2018 Logical Clocks AB. All Rights Reserved28

Dela* – A Global Ecosystem for Datasets

Peer-to-Peer Search and Download for Huge DataSets(ImageNet, YouTube8M, MsCoCo, Reddit, etc)

*http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)

ML & Deep Learning-as-a-Service

©2018 Logical Clocks AB. All Rights Reserved30

HopsML Pipeline

©2018 Logical Clocks AB. All Rights Reserved31

HopsML Spark/TensorFlow Arch

Executor/Tf Executor/Tf

Driver

HopsFSTensorBoard Model Serving

Conda Envs

Conda Envs

Distributed Training

©2018 Logical Clocks AB. All Rights Reserved33

Deep Learning Hierarchy of Scale

DDLAllReduce

on GPU Servers

DDL with GPU Serversand Parameter Servers

Parallel Experiments on GPU Servers

Single GPU

Many GPUs on a Single GPU Server

Days/Hours

Days

Weeks

Minutes

Training Time for ImageNet

Hours

“My Model’s Training.”

Training

©2018 Logical Clocks AB. All Rights Reserved34

GPU Resource Requests in HopsYARN

HopsYARN HopsYARN

10 GPUs on 1 host

100 GPUs on 10 hosts with ‘Infiniband’

Hops supports a Hetrogenous Mix of GPUs

4 GPUs on any host

Experiments in Hopsworks

©2018 Logical Clocks AB. All Rights Reserved36

The boring part of the job

● Find good Hyperparameters for your model

● Test diferent confgurations● Automate this!

“I have to run a hundred experiments to fnd the best

model,” he complained, as he showed me his Jupyter notebooks.

“That takes time. Every experiment takes a lot of

programming, because there are so many diferent parameters.

[https://thomaswdinsmore.com/2018/01/30/predictions-for-2018/ ]

©2018 Logical Clocks AB. All Rights Reserved37

Experiments in TensorFlow/Hopsworks

● Run and evaluate multiple models in parallel on a subset of the dataset

Experiment 1 Experiment 2

Experiment 4Experiment 3

©2018 Logical Clocks AB. All Rights Reserved38

Reproducible Experiments

● Results tracking● Hyperparameter tracking● Jupyter notebook versioning● Conda Env versioning● WIP: Dataset versioning

©2018 Logical Clocks AB. All Rights Reserved39

Experiments Dashboard

©2018 Logical Clocks AB. All Rights Reserved40

TensorBoard (1)

©2018 Logical Clocks AB. All Rights Reserved41

TensorBoard (2)

©2018 Logical Clocks AB. All Rights Reserved42

HopsAPI*

● Python (also Java/Scala)– Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod, TensorFlowOnSpark– Parallel experiments

● Gridsearch● Model Architecture Search with Genetic Algorithms

– Secure Streaming Analytics with Kafka/Spark/Flink– SSL/TLS certs, Avro Schema, Endpoints for

Kafka/Zookeeper/etc.

* https://github.com/logicalclocks/hops-util-py

Model Serving

©2018 Logical Clocks AB. All Rights Reserved44

Standard serving infrastructure

Scale model serving with Kubernetes

Considered best practice by the community

Provide tools to easily:● Fault tolerance● Rolling release new models● Autoscaling

©2018 Logical Clocks AB. All Rights Reserved45

Model Monitoring

HopsFS

Serving infrastructure

Re-train and deploy new model

Model monitoring infrastructure

● Log model inference requests/results to Kafka● Spark monitors model performance and input data● When to retrain?

©2018 Logical Clocks AB. All Rights Reserved46

Model Serving on Kubernetes

©2018 Logical Clocks AB. All Rights Reserved47

Orchestrating Hops workfows

Data Collection

Experimentation Training ServingFeature

Extraction

Data Transformation & Verifcation

Test

Airflow (Hopsworks Operator)

Demo

©2018 Logical Clocks AB. All Rights Reserved49

Summary

● Build a single platform to cover the entire AI hierarchy of needs.

● Increase productivity of Data Scientists – Manage all your data pipelines and workflows

under a single roof– Have first-class support for Python / Streaming/

Deep Learning / ML / Data Governance / GPUs

Hopsworks → logicalclocks.comGitHub → github.com/logicalclocksTwitter → @logicalclocks