1© 2018 Mellanox Technologies | Confidential
Interconnect Topology Considerations Applied To Differing Applications And ClustersDecember 2018
Interconnect Your Future
2© 2018 Mellanox Technologies | Confidential
Mellanox Accelerates Leading HPC and AI SystemsWorld’s Top 3 Supercomputers
Summit CORAL SystemWorld’s Fastest HPC / AI System9.2K InfiniBand Nodes
Sierra CORAL System#2 USA Supercomputer 8.6K InfiniBand Nodes
1 2Wuxi Supercomputing CenterFastest Supercomputer in China41K InfiniBand Nodes
3
3© 2018 Mellanox Technologies | Confidential
Mellanox connects 53% of overall TOP500 platforms or 265 systems (InfiniBand and Ethernet), Demonstrating 38% Growth in 12 months (Nov’17-Nov’18)
Mellanox InfiniBand and Ethernet Accelerate World-Leading Supercomputers on the Nov’18 TOP500 List
InfiniBand is the Interconnect of Choice for HPC and AI Infrastructures
Mellanox Ethernet is the Interconnect of Choice for Cloud and Hyperscale Platforms
InfiniBand accelerates the fastest HPC and AI supercomputer in the world – Oak Ridge National Laboratory ‘Summit’ system
InfiniBand accelerates the top 3 supercomputers in the world - #1 (USA), #2 (USA), #3 (China)
InfiniBand connects 135 supercomputers, or nearly 55% of overall HPC systems on the TOP500 list
InfiniBand is the most used high-speed interconnect for the TOP500 systems
Mellanox connects 130 Ethernet systems (25 Gigabit and faster), or 51% of total Ethernet systems
The TOP500 list has evolved to include both HPC and cloud / hyperscale (non-HPC) platforms
Nearly half of the platforms on the TOP500 list can be categorized as non-HPC application platforms (mostly Ethernet-based)
4© 2018 Mellanox Technologies | Confidential
Higher Data SpeedsFaster Data Processing
Better Data Security
Adapters SwitchesCables &
Transceivers
SmartNIC System on a Chip
HPC and AI Needs the Most Intelligent Interconnect
5© 2018 Mellanox Technologies | Confidential
Adapters
Switch
Interconnect
Transceivers
Active Optical and Copper Cables
(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
40 HDR (200Gb/s) Ports
80 HDR100 (100Gb/s) Ports
16Tb/s Throughput, 15.6 Billion msg/sec
200Gb/s, 0.6us Latency
215 Million Messages per Second
(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
HDR 200G InfiniBand Accelerates Next Generation HPC/AI Systems
Highest Performance HDR 200G InfiniBand
6© 2018 Mellanox Technologies | Confidential
The Need for Intelligent and Faster Interconnect
CPU-Centric (Onload) Data-Centric (Offload)
Must Wait for the DataCreates Performance Bottlenecks
Faster Data Speeds and In-Network Computing
Enable Higher Performance and Scale
GPU
CPU
GPU
CPU
Onload Network In-Network Computing
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Analyze Data as it Moves!Higher Performance and Scale
7© 2018 Mellanox Technologies | Confidential
In-Network Computing to Enable Data-Centric Data Centers
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
GPUDirect
RDMA
Scalable Hierarchical
Aggregation and
Reduction Protocol
NVMeOver
Fabrics
8© 2018 Mellanox Technologies | Confidential
In-Network Computing
Self-Healing Technology
In-Network Computing
Unbreakable Data Centers
Delivers Highest Application Performance
GPUDirect™ RDMA
Critical for HPC and Machine Learning ApplicationsGPU Acceleration Technology
10X Performance AccelerationCritical for HPC and Machine Learning Applications
35XFaster Network Recovery5000X
10X Performance Acceleration
Performance Acceleration
In-Network Computing Delivers Highest Performance
Scalable Hierarchical
Aggregation and
Reduction Protocol
9© 2018 Mellanox Technologies | Confidential
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
Reliable Scalable General Purpose Primitive In-network Tree based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations
Applicable to Multiple Use-cases HPC Applications using MPI / SHMEM Distributed Machine Learning applications
Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64 bits
SHArP Tree
SHARP Tree Aggregation Node
(Process running on HCA)
SHARP Tree Endnode
(Process running on HCA)
SHARP Tree Root
10© 2018 Mellanox Technologies | Confidential
10X Higher Performance with GPUDirect™ RDMA
Accelerates HPC and Deep Learning performance
Lowest communication latency for GPUs
GPUDirect™ RDMA
11© 2018 Mellanox Technologies | Confidential
Network Topologies
12© 2018 Mellanox Technologies | Confidential
Supporting Variety of Topologies
Torus DragonflyFat Tree Hypercube
13© 2018 Mellanox Technologies | Confidential
Traditional Dragonfly vs Dragonfly+
Dragonfly+s
3
1
2 l1
s
3
1
2 l1
s
3
1
2 l1
s
3
1
2 l1
14© 2018 Mellanox Technologies | Confidential
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
3.1 3.2 3.20
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
2.1 2.2 2.20
Dragonfly+ Topology
Several “groups”, connected using all to all links
The topology inside each group can be any topology
Reduce total cost of network (fewer long cables)
Utilizes Adaptive Routing to for efficient operations
Simplifies future system expansion
Full-Graph connecting
every group to all
other groups
Group 1
1 2 H
Group 2
H+1 H+2 2H
Group G
GH
BB
B
B
L
1200-Nodes Dragonfly+ Systems Example
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
1.1 1.2 1.20
G1 G2 G3
15© 2018 Mellanox Technologies | Confidential
Dragonfly+ Topology
Several “groups”, connected using all to all links
The topology inside each group can be any topology
Reduce total cost of network (fewer long cables)
Utilizes Adaptive Routing to for efficient operations
Simplifies future system expansion
Full-Graph connecting
every group to all
other groups
Group 1
1 2 H
Group 2
H+1 H+2 2H
Group G
GH
BB
B
B
L
1.1
2.1
3.1
1.2
2.23
.2
1.2
0
2.20
3.2
0
1200-Nodes Dragonfly+ Systems Example
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
10
16© 2018 Mellanox Technologies | Confidential
1 112
1
20HCA
x 20
2 20
20
1
20HCA
x 20
2 20
20
1
20HCA
x 20
2 20
20
Future Expansion of Dragonfly+ Based System
Topology expansion of a Fat Tree, or a regular/Aries like Dragonfly requires one of the following Reduction of early phase bisection bandwidth due to reservation of ports on the network switches Re-cabling the long cables
Dragonfly+ is the only topology that allows system expansion at zero cost While maintaining bisection bandwidth No port reservation No re-cabling
1.2
0
2.20
21.2
01.2
2.2
21
.2
1.1
2.1
21
.1
Phase 1:
11x400 =
4400 hosts
17© 2018 Mellanox Technologies | Confidential
1 112
1
20HCA
x 20
2 20
20
1
20HCA
x 20
2 20
20
1
20HCA
x 20
2 20
20
Future Expansion of Dragonfly+ Based System
1.1
1.2
0
1.2
2.12.202.2
21
.1
21.2
0
21
.2
1221
1
20HCA
x 20
220
20
1
20HCA
x 20
220
20
21.1 12.121.20 12.2021.2 12.2
Re-cable the central racks,
a change local to the RACK
Phase 1:
11x400 =
4400 hosts
Phase 2:
+10x400 =
8400 hosts
18© 2018 Mellanox Technologies | Confidential
Questions?
19© 2018 Mellanox Technologies | Confidential
Thank You