2
内容
GPU计算近况
NVIDIA 深度学习训练平台
NVIDIA 在线服务平台
Tesla GPU 产品及路线图
3
CPU优化串行任务
GPU 加速器优化并行任务
加速计算10x 性能 & 5x 能源效率
4
10x 加速计算的增长20152008
3 MillionCUDA Downloads
150,000CUDA Downloads
60,000 Academic Papers
4,000Academic
Papers
800Universities Teaching
60Universities
Teaching
54,000Supercomputing
Teraflops
77Supercomputing
Teraflops
450,000Tesla GPUs
6,000Tesla GPUs
334CUDA Apps
27CUDA Apps
5
超算中心
高教
政府
能源
金融
制造
Tesla 加速政府和企业的HPC数据中心
Tokyo Institute of
Technology
Air Force
Research
Laboratory
Naval Research
Laboratory
6
360+ GPU 加速的应用软件www.nvidia.com/appscatalog
7
加速计算被快速地采用
NVIDIA GPU 加速器的首选
NVIDIA GPU
85%OTHERS
15%
113
206
242
367
0
50
100
150
200
250
300
350
2011 2012 2013 2014 2015
GPU 加速的应用
287
“超过一半的新HPC系统将安装加速器””-Intersect360 Research, Feb 2015
Intersect360 Research. Top 6 Prediction in HPC, Feb 2015
8
高密度GPU服务器已成为主流
Cray CS-Storm8 K80s per Node
Dell C41304 K80s per Node
HP SL2708 K40s per Node
Sugon4 K80s per Node
9
NVIDIA 深度学习训练平台
10
深度学习的实例图像分类、目标检测、定位、行为识别
语音识别、语音翻译,自然语言处理
检测乳腺癌细胞有丝分裂,体积大脑图像分割
行人检测、车道检测,交通标志识别
11
什么是深度学习?
Image “Volvo XC90”
Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011.Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng.
12
为什么深度学习现在这么热?
大数据 GPU 加速新的机器学习技术
350 millions images uploaded per day
2.5 Petabytes of customer data hourly
300 hours of video uploaded every minute
13
GPU和深度学习
GPU实现 --- 相同或更好的预测精度- 更快的结果- 更小的占地面积- 低功率
72%
74%
84%
88%
93%
2010 2011 2012 2013 2014
ImageNet ChallengeAccuracy
NVIDIA CUDA GPU
NEURALNETWORKS
GPUS
固有的并行
矩阵运算
浮点运算
带宽
14
NVIDIA 完整的深度学习平台
应用
DIGITS 工具
深度学习框架(caffe, Torch 等)
函数库
cuDNN, cuBlas …
GPU
Tesla
软件
系统管理
服务器
15
NVIDIA cuDNN
高性能神经网络训练
GPU 加速 Caffe, Theano, Torch 和其他深度学习框架
支持使用广泛的层类型,包括pooling, ReLU, sigmoid, softmax, TANH
对最新的NVIDIA GPU架构进行了优化
支持 Linux, Windows, OSX 和 Linux for Tegra(ARM)
GPU 加速深度学习框架
http://developer.nvidia.com/cuDNN
0
20
40
60
80
cuDNN 1 cuDNN 2 cuDNN 3
性能持续提高
Millions of images trained per day
16
NVIDIA DIGITS交互式的GPU深度学习训练系统
Test Image
Monitor ProgressConfigure DNNProcess Data Visualize Layers
http://developer.nvidia.com/digits
17
GPU 已加速的深度学习框架
CAFFE TORCH THEANO MINERVA KALDI
Deep Learning
Framework
Scientific Computing
Framework
Math Expression
Compiler
Deep Learning
Framework
Speech
RecognitionToolkit
cuDNN 3 3 3 3 --
Multi-GPU In Progress (nnet2)
Multi-Node (nnet2)
License BSD-2 BSD BSD Apache 2.0 Apache 2.0
Interface(s)Text-based
definition files,
Python, MATLAB
Python, Lua,
MATLABPython C++ C++, Shell scripts
Embedded
18
NVIDIA 在线服务平台
19NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
不断变化的工作负载适合GPU
视频转码 2X
Real time Super Resolution, Stabilization, Enhancements
Resize, Filter, Search, Auto-Enhance
H.264 & H.265, SD & HD
机器学习在线服务 2X
图像处理 5X视频处理 4X
20NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
GPU提升数据中心的处理能力
Traditional
NewWorkload
NewWorkload Traditional
+Add GPUs to
boost data center
Available capacity GPU capacity for growth Reclaimed CPU capacity
21NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Tesla 平台加速大规模应用
Low Power, Small Form Factor GPU Acceleration in scale-out infrastructure
GPU REST EngineHigh throughput low latency accelerated services
Monitoring and Management Deploy fault-tolerant and elastic GPU systems
Media FrameworkPainless out-of-the box support of GPUs in FFMPEG and OBS
YARN
OBS FFMPEG
22NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
为大规模应用而设计的低功耗GPU
Maxwell Architecture135% performance per core, 2x performance/Watt
of Kepler Architecture
Low Power, Small Form Factor PCIe Low Profile, 50W to 75W
Easy upgrade/retrofit
Versatile Compute PlatformCUDA, Video Enhancements, Analytics, General Acceleration
Independent Video EnginesIndependent on-chip video encode and decode engines accelerate H.264 and H.265
23NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
质量 = 保证GPU处理使实时视频增强成为可能
视频稳定
图像对比度/锐度的提高
先进去噪
解封
缩小规模(Lanczos & poly-phase multi-tap filters)
超分辨率
平滑的帧速率上转换
24NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
GPU
10 Streams
75 Watt
Decode Enhance Infer Encode
NVDEC NVENC NVDEC NVENC NVDEC NVENC NVDEC NVENC1080p30 10 Mbps1080p30 5 Mbps720p30 3 Mbps720p 1.5 Mbps480p 1.2 Mbps360p 1 Mbps
360p 0.5 Mbps240p 0.3Mbps240p 0.15Mbps
1080p30 5 Mbps
GPU 转码: 10x 的吞吐量, ½ 的功耗Video processing at scale
1080p30 h.264 source, enhancement includes deblocking, motion stabilization and scaling
CPU
1 Stream
150 Watt
25
Tesla GPU 产品和路线图
26
1H 2015 2H 2015 (新品)
K40
12GB
235W
PCIe Passive
K80
2xGPU, 2x12GB
300W
PCIe Passive
TESLA M60
2xGPU, 2x8GB
300W / 225W
PCIe Passive
2015年Tesla GPU 加速器产品
TESLA M40
12GB
250W
PCIe Passive
TESLA M6
8GB
75W / 100W
MXM PCIe in definition
Fastest DL
Solution
VDI
Solution
VDI Blade
Solution
NEW!
27
为数据中心设计世界上最快的深度学习训练
MAXWELL 架构
• 24/7 Reliability
• Scalable Perf. w/ RDMA
• Datacenter mgmt. tools
up to 2.4x K40
up to 1.7x K80
Save days on each training
Tesla M40专门为深度学习而设计建造
3072 Core
~7 TFLOPS
12GB
28NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
最快的深度学习训练
Caffe uses Alexnet, batch size = 128, 280M images training set
Torch uses OverFeat, batch size = 128, 140M images training set
Save days on each training iteration
Enable users to iterate to final solution much faster
0
2
4
6
8
10
Caffe Torch
# of days to train
K40
K80
M40
M40
K80
K40Save
3 daysSave
2 days
29NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
GPUDirect RDMA
GPU之间直接传输数据
67% GPU到GPU的延迟降低
5x 高的GPU到GPU MPI 带宽
RDMA 加速扩展深度学习
Yahoo, Baidu use RDMA to speedup Deep Learning Training
“We have enhanced Caffe to use multiple GPUs on a server and benefit from RDMA tosynchronize Deep Learning models”http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop
“Given the properties of DL’s SGD algorithms, it is desired to have very high bandwidthand ultra low latency interconnects to minimize inter-node communication costs”http://arxiv.org/vc/arxiv/papers/1501/1501.02876v1.pdf
30
2012 20142008 2010 2016 2018
48
36
12
0
24
60
72
TeslaFermi
Kepler
Maxwell
PascalMixed PrecisionDouble Precision3D MemoryNVLink
Volta
GPU 路线图SG
EM
M /
W
31
PASCAL GPU 新特性
NVLINKGPU high speed interconnect
Connect CPU to GPU or GPU to GPU
NVLINK 1.0, 80 GB/s, 4 Link Pairs
3D Stacked Memory4x Higher Bandwidth (~1 TB/s)
3x Larger Capacity
4x More Energy Efficient per bit
32
NVLink : 高速GPU互连
Whitepaper: http://www.nvidia.com/object/nvlink.html
PascalCPU
(NVLINK
Enabled)
GPU to CPU via NVLink GPU to GPU via NVLink
4 NVLink
20GB/s each
PCIe
Control
HBM
16-32GB
DDR Memory
10s-100s GB
1Tbyte/s
DDR4
50-75 GB/s
CPU
(x86)
Pascal Pascal
PCIe Switch
4 NVLink
20GB/s each
33
NVLinkHigh-Speed GPU Interconnect
NVLink
NVLink
POWER CPU
X86, ARM64, POWER CPU
X86, ARM64, POWER CPU
PASCAL GPUKEPLER GPU
20162014
PCIe PCIe
34
NVLink释放了Multi-GPU性能
343D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)
TESLA
GPU
TESLA
GPU
CPU
5x Faster than
PCIe Gen3 x16
PCIe Switch
GPUs Interconnected with NVLink
1.00x
1.25x
1.50x
1.75x
2.00x
2.25x
ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT
Over 2x Application Performance SpeedupWhen Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup vs PCIe based Server
35NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
加速数据中心的一种灵活架构
8 GPU Cube Mesh
PCIe
Switch
CPU
PCIe
Switch
CPU
x
x
NVLINK + UVM
Efficient 4-GPU and 8-GPU scaling
Pascal
Best-in-class single GPU performance
vGPU
Graphics virtualization
谢谢 !