Aug 2019 - Shankar Chandrasekaran
MANAGING GPU ACCELERATED COMPUTING
2
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
3
END-TO-END
SOFTWARE STACK
RECORD-SETTINGPERFORAMNCE
AVAILABLE EVERYWHERE
NVIDIA GPU PLATFORM FOR ACCELERATING AI
Cloud Services
Systems
6 ML Perf Training Records
AWS SageMaker
GCP ML Engine
AzureML
Time Machine for AI (Training RN-50)
2015TESLA K80 | CUDA
2017DGX-1 | VOLTA | TENSOR CORES
2018DGX-2 | VOLTA | NVSWITCH
2019DGX SUPERPOD | VOLTA | MELLANOX IB
36,000 Mins (25 Days)
480 Mins (8 Hrs)
63 Mins
<2 Mins
4
Server (& rest of infra) Bring-up &
Provisioning
Virtualization Provisioning
Container Orchestration
Application Deployment & Management
App/Infra Monitoring
Error handling & remediation
INFRASTRUCTURE WORKFLOW FOR AI
5
RUNNING AI & DATA SCIENCE JOBS WITH CONFIDENCE
NVIDIA DATA
CENTER GPU
MANAGER
Parallel monitoring & management ecosystem for GPU, memory, NVSWITCH,
baseboard components
Active health monitoringDiagnostics
Power and clock management
OUT OF BAND
MANAGEMENT
Error handling & remediation
Server (& rest of infra) Bring-up &
Provisioning
Performance validated NVIDIA GPU servers for
faster rollout in production
NGC READY
SERVERS
Hardware Reliability & Management
6
NVIDIA VIRTUAL COMPUTE SERVERGPU Acceleration Features for Server Virtualization
Multi-VMs per GPU (Sharing)
NVIDIA NGC(Containers)
ECC & Page Retirement
Peer-to-Peer over NVLink
Multi-vGPU per VM(Aggregate)
New Features for vComputeServer
Vsphere For Management, Monitoring & Migration
Enhanced, Flexible Scheduling
Virtualization Provisioning
7
NVIDIA GPUs in Kubernetes
• Simplify large scale deployments of GPU-
accelerated applications - GPU support in
Kubernetes using the NVIDIA device plugin
• Specify GPU attributes such as GPU type and
memory requirements for deployment in
heterogeneous GPU clusters
• Visualize and monitor GPU metrics and health
with an integrated GPU monitoring stack
of NVIDIA DCGM , Prometheus and Grafana
7
Container Orchestration
App/Infra Monitoring
Application Deployment & Management
NVIDIA GPUs
NVIDIA Container Runtime
KUBERNETES GPU plugin
NGC Containers
Docker
8
AI DEVELOPMENT AND DEPLOYMENT
Data scientists Developers IT/DevOps
Trained ModelsApps with
trained Models
New data to update models
Data Preprocessing
LabelingModel
Development & Evaluation
Train @scale OptimizationDeployment &
Monitoring
9
PLATFORM BUILT FOR TRAININGAccelerating Every Framework And Fueling Innovation
All Major FrameworksAll Use-cases
Speech Video
Translation Personalization
Volta Tensor Core, NVSwitch, NVLink
Tensor Cores
NVLink NVSwitch
10
App 1
App 2
AI Model
Repository
AI Inference Cluster
CPU | GPUFront End Client
ApplicationsTensorRT Inference
Server App
TensorRT Inference Server App
TensorRT Inference Server App
TensorRT Inference Server App
INFERENCE WITH TENSORRT INFERENCE SERVER
Cloud| Data centerGPU | CPU
TensorFlow | TensorRT Plan | PyTorch | Caffe | Custom
Any framework
Any platform
11©2018 VMware, Inc.
NVIDIA NGC – 150+ CONTAINERS, PRE-TRAINED MODELS, TRAINING SCRIPTS AND WORKFLOWS
12
KEY TAKEAWAYS
• NGC Ready servers, DCGM
• vComputeServer for vSphere environments
• Kubernetes for container orchestration on GPUs
• AI Training: GPU optimized software for model development and training
• AI Inference: TensorRT Inference Server or vComputeServer fractional GPUs
GPU Platform Check list
www.nvidia.comngc.nvidia.com