NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Deep Learning at NVIDIA Peter Messmer
4/29/2016 [email protected]
2
The Big bang in machine learning
“ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.”
DNN GPU BIG DATA
3
DEEP LEARNING EVERYWHERE
INTERNET & CLOUD
Image Classification Speech Recognition
Language Translation Language Processing Sentiment Analysis Recommendation
MEDIA & ENTERTAINMENT
Video Captioning
Video Search Real Time Translation
AUTONOMOUS MACHINES
Pedestrian Detection Lane Tracking
Recognize Traffic Sign
SECURITY & DEFENSE
Face Detection Video Surveillance Satellite Imagery
MEDICINE & BIOLOGY
Cancer Cell Detection Diabetic Grading Drug Discovery
4
Tesla Accelerated computing platform Focused on Co-Design from Top to Bottom
Productive Programming Model & Tools
Expert Co-Design
Accessibility
APPLICATION
MIDDLEWARE
SYS SW
LARGE SYSTEMS
PROCESSOR
Fast GPU Engineered for High Throughput
0.0
0.5
1.0
1.5
2.0
2.5
3.0
2008 2009 2010 2011 2012 2013 2014
NVIDIA GPU x86 CPUTFLOPS
M2090
M1060
K20
K80
K40
Fast GPU +
Strong CPU
5
INTRODUCING TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node
Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine
PCIe
Switch
PCIe
Switch
CPU CPU
Highest Compute Performance GPU Interconnect for Maximum Scalability
Unifying Compute & Memory in Single Package
Simple Parallel Programming with Virtually Unlimited Memory
Unified Memory
CPU
Tesla P100
6 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Giant leaps
in everything
NVLINK
PAGE MIGRATION ENGINE
PASCAL ARCHITECTURE
CoWoS HBM2 Stacked Mem
K40 Tera
flops
(FP32/FP16)
5
10
15
20
P100
(FP32)
P100
(FP16)
M40
K40
Bi-
dir
ecti
onal BW
(G
B/Sec)
40
80
120
160 P100
M40
K40 Bandw
idth
(G
B/s)
200
400
600
P100
M40 K40
Addre
ssable
Mem
ory
(G
B)
10
100
1000
P100
M40
21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth
3x Higher for Massive Data Workloads Virtually Unlimited Memory Space
10000 800
7 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
nvidia DGX-1 world’s first deep learning supercomputer
170 TFLOPS FP16
8x Tesla P100 16GB
NVLink Hybrid Cube Mesh
Accelerates Major AI Frameworks
Dual Xeon
7 TB SSD Deep Learning Cache
Dual 10GbE, Quad IB 100Gb
3RU – 3200W
8
9
NVIDIA Deep Learning SDK High Performance GPU-Acceleration for Deep Learning
COMPUTER VISION SPEECH AND AUDIO BEHAVIOR
Object Detection Voice Recognition Translation Recommendation
Engines Sentiment Analysis
DEEP LEARNING
cuDNN
MATH LIBRARIES
cuBLAS cuSPARSE
MULTI-GPU
NCCL
cuFFT
Mocha.jl
Image Classification
DEEP LEARNING
SDK
FRAMEWORKS
APPLICATIONS
10
NVIDIA DIGITS Interactive Deep Learning GPU Training System
Test Image
Monitor Progress Configure DNN Process Data Visualize Layers
developer.nvidia.com/digits
11
Good time for confluence of HPC and DL
CSCS working on containers for WLCG data
CSCS to upgrade Piz Daint to Pascal GPUs
Simple process for basic appliations
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
13
nvGRAPH Accelerated Graph Analytics
nvGRAPH for high performance graph analytics
Deliver results up to 3x faster than CPU-only
Solve graphs with up to 2.5 Billion edges on 1x M40
Accelerates a wide range of graph analytics apps:
developer.nvidia.com/nvgraph
PageRank Single Source Shortest
Path
Single Source Widest
Path
Search Robotic Path Planning IP Routing
Recommendation Engines Power Network Planning Chip Design / EDA
Social Ad Placement Logistics & Supply Chain
Planning
Traffic sensitive routing 0
1
2
3
Itera
tions/
s
nvGRAPH: 3x Speedup
48 Core Xeon E5
nvGRAPH on M40
PageRank on Twitter 1.5B edge dataset
CPU System: 4U server w/ 4x12-core Xeon E5-2697 CPU,
30M Cache, 2.70 GHz, 512 GB RAM
14
OPENACC More Science, Less Programming
SIMPLE
POWERFUL
PORTABLE
Minimum efforts Small code modifications
Up to 10x faster application performance
Optimize once,
run on GPUs and CPUs
main() { <serial code> #pragma acc kernels //automatically runs on GPU
{ <parallel code> } }
FREE FOR ACADEMIA
SIMPLE