Machine Learning on OpenStack
How can Scientific OpenStack help?
Feb 2021John Garbutt, StackHPC
StackHPC Company Overview
● Formed 2016, based in Bristol, UK○ Based in Bristol with presence in Cambridge, France and Poland○ Currently 16 people
● Founded on HPC expertise ○ Software Defined Networking○ Systems Integration○ OpenStack Development and Operations
● Motivation to transfer this expertise into Cloud to address HPC & HPDA● “Open” Modus Operandi
○ Upstream development of OpenStack capabilities○ Consultancy/Support to end-user organizations in managing HPC service transition○ Scientific-SIG engagement for the Open Infrastructure Foundation○ Open Source, Design, Development and Community
What is needed forMachine Learning?
Training, Inference, Data and more
https://developers.google.com/machine-learning/crash-course/production-ml-systems
TensorFlow Extended (TFX)
https://github.com/GoogleCloudPlatform/tf-estimator-tutorials/blob/54c3099d3a687052bd463e1344a8836913ac2d26/00_Miscellaneous/tfx/02_tfx_end_to_end.ipynb
TensorFlow Extended (TFX)
https://github.com/GoogleCloudPlatform/tf-estimator-tutorials/blob/54c3099d3a687052bd463e1344a8836913ac2d26/00_Miscellaneous/tfx/02_tfx_end_to_end.ipynb
Transform Data Model Training
Inference
Machine Learning Breakdown● Data Processing and Pipelines
○ Transform to extract Features and Labels○ Data Visualization
● Training: Static vs Dynamic Model Training○ Does input data change over time○ Pipeline reproducibility
● Inference: Offline vs Online predictions○ Regression or Classification have similar questions○ Decision latency can be critical, with the need to use more resources to get it faster
● Model complexity○ Linear, Non-linear, how deep, how wide, ...
● Flow: Dev -> Stage -> Prod● MLOps: Configuration Management, Deployment tooling, Monitoring...
https://www.kubeflow.org/docs/started/kubeflow-overview/#introducing-the-ml-workflow
Infrastructure Requests● Offline can fit batch (e.g. Slurm), but not online
○ Offline Training and Online Inference: you may want a mix?
● Scale up○ CPUs not always the best price per performance○ GPUs often better. Also IPUs, Tensor cores, new CPU instructions○ Connect to disaggregated accelerator and/or storage
● Scale out○ Distribute work, via RDMA low latency interconnect
● High Performance Storage○ Keep the model processing fed, share transformed data○ RDMA enabled low latency access to data sets
● Monitoring to understand how your chosen model performs
Scientific OpenStack
HPC Stack 1.0
Motivations Driving a Change
● Manage the increasing complexity
● Better knowledge sharing
● Move away from Resource Silos
HPC Stack 1.0
HPC Stack 2.0
HPC Stack 2.0
HPC Stack 2.0
OpenStack Magnum● Kubernetes clusters on demand
○ … working to add support for K8s v1.20○ Terraform and Ansible can be used to manage clusters
● Magnum Cluster Autoscaler○ Automatically expands and shrinks K8s cluster, within defined limits○ Based on when pods can / can’t be scheduled
● Storage Integration○ Cinder CSI for Volumes (ReadWriteOnce PVC)○ Manila CSI for CephFS shares (ReadWriteMany PVC)
● Network Integration○ Octavia load balancer as a service
https://github.com/RSE-Cambridge/iris-magnum
OpenStack GPUs
GPUs in OpenStack● Ironic
○ Full access to hardware, including all GPUs and Networking
● Virtual Machine with PCI Passthrough○ Share out single physical machine○ Flavors with one or multiple GPUs○ Some protection of GPU firmware○ … restrictions around using data centre GPUs
● Virtual Machine with vGPU○ Typical vGPU requires expensive licences, created○ Time Slicing GPU created for VDI○ Depends if your workloads can saturate a full GPU○ … but A100 includes MIG (multi-instance GPU)
● Some GPU features need RDMA networking
GPU Resource Management● GPUs are expensive, need careful management to get a good ROI● Batch Queue System e.g. Slurm
○ Sharing via batch jobs can be very efficient○ … but not great for dynamic training and online inference
● OpenStack Quotas○ Today no real support for GPU quota○ … but flavors that request GPUs can be limited to projects○ and projects can be limited to specific subsets of hardware
● Reservations and Pre-emptables○ Scientific OpenStack is looking at OpenStack Blazar○ Projects can reserve resources ahead of time○ Option to use pre-emptables to scale out when resources are free○ … to stop people hanging on to GPUs and not using them
● Get in touch if you are interested in shaping this work
https://unsplash.com/photos/oBbTc1VoT-0
OpenStack and RDMA(Remote Direct Memory Access)
RDMA with OpenStack● Ethernet with RoCEv2, also Infiniband
○ Ethernet switches can use ECN and PFC to reduce packet loss, larger MTU
● SR-IOV○ Hardware physical function (PF) mapped to multiple virtual functions (VF)○ Hardware configured to limit traffic to VLANs (and sometimes overlays)○ … typically no security groups, bonding possible with some NICs, QoS possible○ https://www.stackhpc.com/vxlan-ovs-bandwidth.html and https://www.stackhpc.com/sriov-kayobe.html
● Virtual Machine runs drivers for the specific NIC○ … ignoring mdev for now
● Live-migration with SR-IOV○ Makes use of hot unplug then hot plug○ Possible bond with a virtual NIC, breaks RDMA○ … in future mdev may help, but not today
Kubernetes RDMA with OpenStack● Some Pods need RDMA enabled networking● Kubernetes in VM with VF passthrough
○ OpenStack controls network isolation○ Pods forced to use host networking to get RDMA○ … not so bad if one Pod per VM, with Magnum cluster autoscaler
● Kubernetes in VM with PF passthrough○ Kubelet manages virtual function passthrough to pods○ OpenStack maps devices to physical networks○ Switch port could be out of band configured to restrict allowed VLANs○ NIC that is passed through, typically can’t be used by host or any other VM
● Kubernetes deployed on Ironic server○ Similar to PF passthrough○ … but Neutron could orchestrate the switch port configuration
RDMA Remote Storage● OpenStack supports fast local storage
○ Ratio typically fixed with CPU and RAM, but workload needs vary○ Remote storage, like Ceph RBD, can have a very high latency
● OpenStack supports provider VLANs○ Can be shared with a select group of projects○ You can have a neutron router onto the network, if required
● Shared Storage can be a workload in OpenStack○ Examples: Lustre, BeeGFS○ Run on baremetal or VMs with RDMA enabled○ Provide a shared file system to stage data into
● External appliances can be accessed via provider VLANs○ Some storage can integrate with OpenStack Manila for Filesystem as a Service
Example workload:Monitoring Slurm
Example workload:Horovod Benchmarks
● Distributed deep learning framework● Supported by LF AI & Data Foundation● https://github.com/horovod/horovod
● P3 AlaSKA○ TCP 10GbE○ RoCE 25GbE○ Infiniband 100Gb○ 2 GPU nodes, each with 4 x P100 GPUs
● ResNET 50 Benchmark○ All tests use 8 P100 GPUs○ On baremetal, using OpenStack Ironic○ Horovod on k8s with tensorflow, openmpi○ Note: higher is better
Horovod on P3 AlaSKA
Horovod
https://github.com/horovod/horovod/blob/master/docs/benchmarks.rst
Example hardware:NVIDIA DGX A100
NVIDIA DGX A100● 200Gb/s ConnectX-6 for each GPU● Local NVMe to cache training data● NVIDIA NVLink
○ DGX A100 has 6 NVIDIA NVSwitch fabrics○ Each A100 GPU uses twelve NVLink
interconnects, two to each NVSwitch○ GPU-to-GPU communication 600 GB/s○ All to all peak of 4.8 TB/s in both directions
● A100 has Multi-Instance GPU (MIG)○ Up to 7 MIGs per A100○ MIG GPU instance has its own memory,
cache, and streaming multiprocessor○ Multiple users can share the same GPU
and run all instances simultaneously, maximizing GPU efficiency
https://developer.nvidia.com/blog/defining-ai-innovation-with-dgx-a100/
Next Stepswith Scientific OpenStack
Scientific OpenStack Digital Assets● Existing assets
○ Reference OpenStack architecture, configuration and operational tooling○ Reference platforms and workloads, such as:
■ https://github.com/RSE-Cambridge/iris-magnum
● Edinburgh Institute of Astronomy○ 2nd IRIS site starting to adopt Scientific OpenStack Digital Assets
● SKA are funding moving P3 AlaSKA into IRIS● Updates in March 2021 due to include:
○ GPU and SR-IOV best practice guides, using P3 AlaSKA hardware○ Magnum updated to support Kubernetes v1.20○ Improved Resource Management via Blazar○ Prometheus based Utilization Monitoring○ Assessment of porting JASMIN Cluster as a Service to IRIS
Summary of ML on OpenStack● ML represents a diverse set of workloads
○ … with a corresponding diverse set of Infrastructure needs○ Latency sensitive Online inference vs Offline training, Regression vs Classification, etc○ Wide variety of model complexity and size of data inputs
● Broad ecosystem of tools and platforms○ Many assume Kubernetes is available for deployment○ OpenStack can provide the resources needed, Kubernetes or not
● Challenging to use resources efficiently○ No generic “best fit” mix of Compute, Networking and Storage○ GPUs, Tensor cores, IPUs, FPGAs can be more efficient than CPUs○ Storage and Networking need to keep processing fed○ Demand for Online Inference (and training) can be hard to predict
Questions?