© 2019 Mellanox Technologies 1
Paving the Road to ExascaleSeptember 2019
Interconnect Your Future
© 2019 Mellanox Technologies 2
InfiniBand Accelerates 6 of Top 10 Supercomputers
1 2 3 5 8 10
© 2019 Mellanox Technologies 3
5 62 166 264
India’s National Supercomputing
Program
World’s FirstHDR InfiniBandSupercomputer
HDR 200G InfiniBand Accelerated Supercomputers
© 2019 Mellanox Technologies 4
Higher Data SpeedsFaster Data Processing
Better Data Security
Adapters SwitchesCables &
Transceivers
SmartNIC System on a Chip
HPC and AI Needs the Most Intelligent Interconnect
© 2019 Mellanox Technologies 5
The Need for Intelligent and Faster Interconnect
CPU-Centric (Onload) Data-Centric (Offload)
Must Wait for the DataCreates Performance Bottlenecks
Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale
GPU
CPU
GPU
CPU
Onload Network In-Network Computing
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Analyze Data as it Moves!Higher Performance and Scale
© 2019 Mellanox Technologies 6
Data Centric Architecture to Overcome Latency Bottlenecks
CPU-Centric (Onload) Data-Centric (Offload)
Communications Latencies of 30-40us
Intelligent Interconnect Paves the Road to Exascale Performance
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Communications Latenciesof 3-4us
© 2019 Mellanox Technologies 7
GPUDirect
RDMA
Network
Communication
Application Data Analysis
Real Time
Deep Learning
Mellanox SHARP In-Network Computing
MPI Tag Matching
MPI Rendezvous
Software Defined Virtual Devices
Network Transport Offload
RDMA and GPU-Direct RDMA
SHIELD (Self-Healing Network)
Enhanced Adaptive Routing and Congestion Control
Connectivity Multi-Host Technology
Socket-Direct Technology
Enhanced Topologies
Accelerating All Levels of HPC / AI Frameworks
© 2019 Mellanox Technologies 8
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
© 2019 Mellanox Technologies 9
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
Reliable Scalable General Purpose Primitive In-network Tree based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations
Applicable to Multiple Use-cases HPC Applications using MPI / SHMEM Distributed Machine Learning applications
Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64 bits
DataAggregated
AggregatedResult
Aggregated Result
Data
Host Host Host Host Host
SwitchSwitch
Switch
© 2019 Mellanox Technologies 10
SHARP AllReduce Performance Advantages (128 Nodes)
SHARP enables 75% Reduction in LatencyProviding Scalable Flat LatencyScalable Hierarchical
Aggregation and Reduction Protocol
© 2019 Mellanox Technologies 11
SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology
SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and
Reduction Protocol
© 2019 Mellanox Technologies 12
Oak Ridge National Laboratory – Coral Summit SupercomputerSHARP AllReduce Performance Advantages
SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and
Reduction Protocol
© 2019 Mellanox Technologies 13
Performs the Gradient AveragingReplaces all physical parameter serversAccelerate AI Performance
SHARP Accelerates AI Performance
The CPU in a parameter server becomes the bottleneck
© 2019 Mellanox Technologies 14
NCCL-SHARP Delivers Highest Performance
© 2019 Mellanox Technologies 15
Adaptive Routing
© 2019 Mellanox Technologies 16
InfiniBand Proven Adaptive Routing Performance
Oak Ridge National Laboratory – Coral Summit supercomputer Bisection bandwidth benchmark, based on mpiGraph
Explores the bandwidth between possible MPI process pairs
AR results demonstrate an average performance of 96% of the maximum bandwidth measured
mpiGraph explores the bandwidth
between possible MPI process pairs. In
the histograms, the single cluster with AR
indicates that all pairs achieve nearly
maximum bandwidth while single-path
static routing has nine clusters as
congestion limits bandwidth, negatively impacting overall application performance.
“The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems”, Sudharshan S. Vazhkudai, Arthur S. Bland, Al Geist, Christopher J. Zimmer, Scott Atchley, Sarp Oral, Don E. Maxwell, Veronica G. Vergara Larrea, Wayne Joubert, Matthew A. Ezell, Dustin Leverman, James H. Rogers, Drew Schmidt, Mallikarjun Shankar, Feiyi Wang, Junqi Yin (Oak Ridge National Laboratory) and Bronis R. de Supinski, Adam Bertsch, Robin Goldstone, Chris Chambreau, Ben Casses, Elsa Gonsiorowski, Ian Karlin, Matthew L. Leininger, Adam Moody, Martin Ohmacht, Ramesh Pankajakshan, Fernando Pizzano, Py Watson, Lance D. Weems (Lawrence Livermore National Laboratory) and James Sexton, Jim Kahle, David Appelhans, Robert Blackmore, George Chochia, Gene Davison, Tom Gooding, Leopold Grinberg, Bill Hanson, Bill Hartner, Chris Marroquin, Bryan Rosenburg, Bob Walkup (IBM)
InfiniBand High Network Efficiency - mpiGraph
Oak Ridge National Lab Summit Supercomputer
Static Routing Adaptive Routing
© 2019 Mellanox Technologies 17
Network Topologies
© 2019 Mellanox Technologies 18
Supporting Variety of Topologies
Torus DragonflyFat Tree
Hypercube HyperX
© 2019 Mellanox Technologies 19
HDR InfiniBand
© 2019 Mellanox Technologies 20
Highest-Performance 200Gb/s InfiniBand Solutions
TransceiversActive Optical and Copper Cables(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
40 HDR (200Gb/s) InfiniBand Ports80 HDR100 InfiniBand PortsThroughput of 16Tb/s, <90ns Latency
200Gb/s Adapter, 0.6us latency215 million messages per second(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
MPI, SHMEM/PGAS, UPCFor Commercial and Open Source ApplicationsLeverages Hardware Accelerations
System on Chip and SmartNICProgrammable adapterSmart Offloads
© 2019 Mellanox Technologies 21
HDR InfiniBand Switches
40 ports of HDR, 200G 80 ports of HDR100, 100G
40 QSFP56 ports
800 ports of HDR, 200G 1600 ports of HDR100, 100G
800 QSFP56 ports
© 2019 Mellanox Technologies 22
Highest Performance and Scalability for Exascale Platforms
7X Higher Performance
96%Network Utilization
FlatLatency
5000X HigherResiliency
2X Higher Performance
DeepLearning
HDR 200G
NDR 400G
XDR 1000G
© 2019 Mellanox Technologies 23
Thank You