Mellanox – The Artificial Intelligence Interconnect Company
June 2017
Building the Most Efficient Machine Learning System
2- Mellanox Confidential -2
© 2017 Mellanox Technologies 2
Mellanox Overview
Company Headquarters
Yokneam, IsraelSunnyvale, CaliforniaWorldwide Offices
~2,900 Employees worldwide
Ticker: MLNX
3- Mellanox Confidential -3
© 2017 Mellanox Technologies 3
Adapters SwitchesCables &
Transceivers
System on a Chip
Higher
Faster
Better
Data Speeds
Data Processing
Data SecuritySmartNIC
Exponential Data Growth Everywhere
4- Mellanox Confidential -4
© 2017 Mellanox Technologies 4
Enabling the Future of Machine Learning Applications
HPC and Machine Learning Share Same Interconnect Needs
Storage
High PerformanceComputing
Financial
EmbeddedAppliances
Database
Hyperscale
Machine Learning
IoT
Healthcare
Manufacturing
Retail
Self-Driving Vehicles
5- Mellanox Confidential -5
© 2017 Mellanox Technologies 5
Highest Performance 100 and 200Gb/s Interconnect Solutions
TransceiversActive Optical and Copper Cables (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
40 HDR (200Gb/s) Ports 80 HDR100 (100Gb/s) Ports 16Tb/s Throughput, 15.6 Billion msg/sec
Interconnect
Switch
Adapters200Gb/s, 0.6us Latency 200 Million Messages per Second(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
Switch32 100GbE Ports, 64 25/50GbE Ports(10 / 25 / 40 / 50 / 100GbE)Throughput of 6.4Tb/s
Today’s Datacenters Need the Most Intelligent Interconnect
6- Mellanox Confidential -6
© 2017 Mellanox Technologies 6
Mellanox Delivers Best Return on Investment
60% Higher Return on Investment
Up to 50% Savings on Capital and Operation Expenses
World’s Highest Performance, Scalability and Productivity for Deep Learning
Mellanox Unlocks the Power of AI
Chainer
Cognitive Toolkit
7- Mellanox Confidential -7
© 2017 Mellanox Technologies 7
Mellanox is Leading Artificial Intelligence (AI)
Health Care, Business Integrity, Business Intelligence
Knowledge Discovery, Security, Customer Support and more
Advancing Technology to Affect Science, Business, and Society
By Enabling Critical and Timely Decision Making
More Data Better Models Faster Interconnect
GPUs
CPUs
FPGAs
Storage
More Data → Faster Interconnect → Better Insight → Competitive Advantage
8- Mellanox Confidential -8
© 2017 Mellanox Technologies 8
Enabling Most Efficient Machine Learning Platforms (Examples)
Highest Performance, Scalability and Productivity for Deep Learning
9- Mellanox Confidential -9
© 2017 Mellanox Technologies 9
World’s First PCIe Gen 4
Public Cloud Server for
Cognitive Computing
Enabling Analytics
in Cloud
Sets TeraSort 2016
Benchmark Record5x Faster , 3x Energy Efficient than 2015 Record
http://sortbenchmark.org/TencentSort2016.pdf
Smart Network for Azure
Cloud ServerDesigned for Big Data Analytics & AI
Mellanox Accelerates Machine Learning and Big Data
10- Mellanox Confidential -10
© 2017 Mellanox Technologies 10
Mellanox Accelerates Machine Learning and Big Data
Big Sur & Big Basin Facebook Open Source AI Hardware Platform
Only ONE Network of Choice - Mellanox
Powering Self Driving Car2X Faster Training with Paddle Paddle
“…We rely on fast interconnect
technologies and RDMA.”
Andrew Ng, Chief Scientist, Baiduhttps://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design
https://github.com/caffe2/caffe2/tree/master/caffe2/contrib/fbcollective/vendor/fbcollective
Real Time Fraud Detection14 Million Transactions per Day
4 Billion Database Inserts
Image Recognition~90% Prediction Accuracy
RDMA in Tensorflow and Caffe
Caffe
Caffe2
11- Mellanox Confidential -11
© 2017 Mellanox Technologies 11
AI is Changing the Way We Interact with Computers
Automotive and Transportation
Security and Public Safety
Consumer Web, Mobile,
Retail
Medicine and Biology
Broadcast, Media and
Entertainment
Finance, Fraudand Insurance
• Autonomous driving
• Pedestrian detection
• Accident avoidance
• Surveillance
• Image analysis
• Facial recognition
and detection
• Image tagging
• Speech
recognition
• Natural language
processing
• Recommendation
and sentiment
analysis
• Drug discovery
• Diagnostic
assistance
• Cancer cell
detection
• Captioning
• Search
• Recommendations
• Real time
translation
• Real Time Trade
• Credit / Risk
Analysis
• Fraud Detection
and Prevention
Efficient Deep Learning Depends on Mellanox
12- Mellanox Confidential -12
© 2017 Mellanox Technologies 12
Deep Learning Demands Highest Performance
TRAINING DATASET
NEW DATA
TRAINING
• Scalability requires ultra-fast networking
• Same hardware needs as HPC
• Faster access to storage
• RDMA
• SHARP
• PeerDirect™, GPUDirect™, ROCm, others
INFERENCING
• Highly transactional / supports many users
• Mellanox ultra-low latency
• Instant network response
• RDMA
• PeerDirect™, GPUDirect™, ROCm, others
Billions of TFLOPS
Billions of FLOPS
Images
Video
Text
Speech
Tabular
Time Series
13- Mellanox Confidential -13
© 2017 Mellanox Technologies 13
Exponential Data Growth – The Need for Intelligent and Faster Interconnect
CPU-Centric (Onload) Data-Centric (Offload)
Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale
Must Wait for the Data
Creates Performance BottlenecksAnalyze Data as it Moves!
14- Mellanox Confidential -14
© 2017 Mellanox Technologies 14
Data Centric Architecture to Overcome Latency Bottlenecks
HPC / Machine Learning
Communications Latencies of 30-40us
HPC / Machine Learning
Communications Latencies of 3-4us
CPU-Centric (Onload) Data-Centric (Offload)
Network In-Network Computing
Intelligent Interconnect Paves the Road to Exascale Performance
15- Mellanox Confidential -15
© 2017 Mellanox Technologies 15
Mellanox Technology Accelerations for Machine Learning
GPU
GPU
CPUCPU
CPU
CPU
CPU
GPU
GPU
In-Network Computing Key for Highest Return on Investment
RDMA
GPUDirect
NVMe over
Fabrics
SHARP
Security
16- Mellanox Confidential -16
© 2017 Mellanox Technologies 16
In-Network Computing Enables Deep Learning Frameworks
CUDA
Mellanox Interconnect SolutionsMellanox Accelerations for Machine Learning and Big Data
SHARP
Middleware (MPI, gRPC) - Optional
GPUDirect RDMA NVMe over FabricsrCUDA
17- Mellanox Confidential -17
© 2017 Mellanox Technologies 17
Mellanox SHARP for Gradient Computation
CPU in a parameter server becomes the
bottleneck quickly (roughly 4 nodes)
TCP adds a lot of overhead and the
traffic pattern is bursty
• SHARP performs the gradient averaging
• Removes the need for physical parameter server
• Removes all parameter server overhead
SHARP Provides Better Scalability and Reduced Network Traffic
18- Mellanox Confidential -18
© 2017 Mellanox Technologies 18
Purpose-built for Acceleration of Deep Learning
PeerDirect™, GPUDirect® RDMA and ASYNC
19- Mellanox Confidential -19
© 2017 Mellanox Technologies 19
What is GPUDirect™
Provides significant decrease in communication latency for acceleration devices
Natively supported by Mellanox OFED
Supports peer-to-peer communications between Mellanox adapters and third-party devices
No unnecessary system memory copies & CPU overhead
Enables GPUDirect™ RDMA, GPUDirect™ ASYNC, ROCm and others
InfiniBand and Ethernet
CPU
Chip
set
ChipsetVendor
Device
CPU
Chip
set
ChipsetVendor
Device0101001011
Designed for Deep Learning Acceleration
20- Mellanox Confidential -20
© 2017 Mellanox Technologies 20
GPUDirect™ RDMA and GPUDirect ASYNC™
Direct Connectivity GPU - Interconnect
21- Mellanox Confidential -21
© 2017 Mellanox Technologies 21
GPU-GPU Internode Latency
Low
er is
Bette
r
GPUDirect™ RDMA Performance
9.3X Better Latency
GPU-GPU Internode Bandwidth
Hig
her
is B
ett
er
10X Better Throughput
Source: Prof. DK Panda
9.3X
2.18 usec
10x
22- Mellanox Confidential -22
© 2017 Mellanox Technologies 22
NVIDIA® NCCL 2.0 Near-Linear Scalability
Optimized collective communication library
• Allreduce, Reduce, Broadcast, Reduce-scatter, Allgather
Inter-node communication using InfiniBand verbs and GPUDirect™ RDMA
Multi-rail support, Topology detection
50% performance improvement with NVIDIA® DGX-1™ across 32 NVIDIA Tesla® V100 GPUs
NVIDIA Accelerates Scalable Deep Learning with Mellanox
23- Mellanox Confidential -23
© 2017 Mellanox Technologies 23
Performance and Scalability Examples
24- Mellanox Confidential -24
© 2017 Mellanox Technologies 24
TensorFlow with Mellanox RDMA
Unmatched Linear Scalability, No Additional Cost
Up to 76% Efficiency and 50% Better Performance versus TCP
Reference Deployment Guide
25- Mellanox Confidential -25
© 2017 Mellanox Technologies 25
Accelerating TensorFlow™ with gRPC over RDMA
Open source Machine Learning from Google
Distributed training with gRPC framework• Google’s Optimized RPC for distributed network
RDMA Acceleration over UCX• Unified Communication X (UCX)
• Integration with upstream TensorFlow
2X higher Performance with RDMA
>2x FasterLower is better
~2X Acceleration for TensorFlow with RDMA
26- Mellanox Confidential -26
© 2017 Mellanox Technologies 26
TensorFlow™ over RDMA in Apache® Spark™ Environment
Yahoo enhanced the TensorFlow C++ layer
to enable RDMA over InfiniBand
InfiniBand provides faster connectivity and
supports accelerated offload capability
Source: http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep
InfiniBand Provides Near Linear Scalability for Inception Model Training
27- Mellanox Confidential -27
© 2017 Mellanox Technologies 27
2X Acceleration for Baidu
Machine Learning Software from Baidu
• Usage: word prediction, translation, image processing
RDMA (GPUDirect) speeds training
• Lowers latency, increases throughput
• More cores for training
• Even better results with optimized RDMA
~2X Acceleration for Paddle Training with RDMA
28- Mellanox Confidential -28
© 2017 Mellanox Technologies 28
ChainerMN Depends on InfiniBand
ChainerMN depends on MPI for inter-node communication
NVIDIA® NCCL library is then used for intra-node communication between GPUs
Leveraging InfiniBand results in near linear performance
Mellanox InfiniBand allows ChainerMN to achieve ~72% accuracy.
Source: http://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html
29- Mellanox Confidential -29
© 2017 Mellanox Technologies 29
Machine Learning Performance Comparison
InfiniBand Delivers 60% Better Performance with 2X Less Infrastructure
DeepBench measures the performance of
basic operations involved in training deep
neural networks.
60.3%
8 Accelerators
16 Accelerators
32 Accelerators
Lower is Better
30- Mellanox Confidential -30
© 2017 Mellanox Technologies 30
Scalable Deep Learning Depends on Mellanox
A Few Solution Examples
31- Mellanox Confidential -31
© 2017 Mellanox Technologies 31
NVIDIA® DGX-1™
World’s first purpose-built system for deep
learning
• SaturnV is #28 on the Top500, 3.3Pf with 124 nodes
• SaturnV is also #1 on the Green500
Fully integrated hardware
• 8x Tesla™ P100 (Pascal) w/16GB per GPU
• 28672 CUDA® Cores
• 4x ConnectX-4 EDR 100Gb/s HCAs
Fully integrated software stack
• Major deep learning frameworks
• Drivers, NVIDIA CUDA, NVIDIA Deep Learning SDK
• GPUDirect™ RDMA
32- Mellanox Confidential -32
© 2017 Mellanox Technologies 32
NVIDIA® DGX-1™ Deep Learning Server
Deep Learning Supercomputer in a Box
8 x NVIDIA® Tesla® P100 GPUs
5.3TFlops
16nm FinFET
NVLINK
4 x ConnectX®-4 EDR 100G InfiniBand
Adapters
NVIDIA® “SaturnV”NVIDIA® Machine Learning Supercomputer
#28 on the Top500
3.3Pf with 124 DGX-1 nodes
#1 on the Green500
33- Mellanox Confidential -33
© 2017 Mellanox Technologies 33
End-to-End Interconnect Solutions for All Platforms
Highest Performance and Scalability for
X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms
X86Open
POWERGPU ARM FPGA
Smart Interconnect to Unleash The Power of All Compute Architectures
34- Mellanox Confidential -34
© 2017 Mellanox Technologies 34
Proven Advantages
RDMA delivers 2X performance advantage over traditional TCP
Machine Learning and HPC platforms share the same interconnect needs
Scalable, flexible, high performance, high bandwidth, end-to-end connectivity
Standards-based and supported by the largest eco-system
Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc.
Native Offloading architecture
RDMA, GPUDirect, SHARP and other core accelerations
Backward and future compatible
Scalable Machine Learning Depends on Mellanox
Thank You