1© 2018 Mellanox Technologies | Confidential
The Critical Path for HPC, Cloud and Machine LearningLugano April 9 2018
Advanced Networking
2© 2018 Mellanox Technologies | Confidential
Modern Data Centers
Orchestration SDSSDN
Multi-
TenancyOn Demand
3© 2018 Mellanox Technologies | Confidential
The Challenge – Software Implementation
Performance
Programmability
4© 2018 Mellanox Technologies | Confidential
The Solution - Hardware Acceleration
Software defined everything is a key for modern data centers With today’s software based solutions, functionality is (almost) there To gain flexibility, performance and cost efficiency, Hardware Acceleration is needed
High Performance Workloads Can’t Deliver Without HW Acceleration
Software Software + Hardware Acceleration
5© 2018 Mellanox Technologies | Confidential
Advance Network Technologies For OpenStack
6© 2018 Mellanox Technologies | Confidential
Server
VM1 VM2 VM3 VM4
Overlay Networks
Overlay Network Advantages: Isolation, Simplicity, Scalability
Virtual Domain 3
Virtual Domain 2
Virtual Domain 1
Physical
View
Server
VM5 VM6 VM7 VM8
Virtual
View
NVGRE/VXLAN/Geneve Overlay Networks
7© 2018 Mellanox Technologies | Confidential
Turbocharge Overlay Networks
Overlay tunnels add network processing Limits bandwidth Consumes CPU
System efficiency drops 10s of percents
For penalty free overlays, at bare-metal performance use NIC with overlay Network HW offloads ConnectX-4 and ConnectX-5 family
Mellanox adapters also supports VxLAN VTEP (encap/decap)
37.5
17.62
36.21
0.7
3.5
0.7
0
0.5
1
1.5
2
2.5
3
3.5
4
Physical VXLAN No Offloads VXLAN HW Offloads
0
5
10
15
20
25
30
35
40
CP
U %
Pe
r 1
Gb
/s
Be
nd
with
(G
b/s
)
40G/s ConnectX-3 Pro
8 VM Pairs BW 8 VM Pairs CPU
8© 2018 Mellanox Technologies | Confidential
Para-Virtualized SR-IOV
Single Root I/O Virtualization (SR-IOV)
PCIe device presents multiple
instances to the OS/Hypervisor
Enables Application Direct
Access
Bare metal performance for VM
Reduces CPU overhead
Enables many advanced NIC
features (e.g. DPDK, RDMA, ASAP2)
NIC
Hypervisor
vSwitch
VM VM
SR-IOV NIC
Hypervisor VM VM
eSwitch
Physical Function
(PF)
Virtual Function
(VF)
Fully Integrated And Upstream With OpenStack
9© 2018 Mellanox Technologies | Confidential
Per VF (SR-IOV) Quality of Service (QoS)
New Neutron API for Per VF Rate Limiting Per VF BW Guarantee Packet Pacing
Same model for ParaVirt and SR-IOV
In SR-IOV mode, QoS is enforced by HW Finer grain More predictable Less Jitter Less CPU utilization
Rate ShaperRate Shaper
QoS Queue
Work Queue
Work Queue
Work Queue
Priority 0 Arbiter
QoS Queue
Work Queue
Work Queue
Work Queue
QoS Queue
Work Queue
Work Queue
Work Queue
QoS Queue
Work Queue
Work Queue
Work Queue
Priority 1 Arbiter
RR arbiter
RR arbiter
RR arbiter
RR arbiter
Strict Priority
TC Group 0DWRR
TC Group 1DWRR
TC Group 7DWRR
TC
0
TC
1
Flow Ctrl
Flow Ctrl
TC
2
TC
3
Flow Ctrl
Flow Ctrl
TC
7
Flow Ctrl...
HL
...
...Priority 0
Priority 1
Priority 2
Priority 3
Priority 7
Rate Limiter
Rate Limiter
Enhanced ETSPer VF Rate Limiter
Mellanox Advance HW QoS Implementation
10© 2018 Mellanox Technologies | Confidential
SR-IOV High Availability / VF LAG
SR-IOV VMs don’t support bonding/HA Mellanox enable transparent SR-IOV HA on a single NIC
LAG will be implemented on Mellanox NIC so VM will only see a single Virtual Function (VF)
Mode supported Active Passive (Single port BW) Active Active (Double port BW) LACP
NIC
Host
Virtual Function
VM
VF driver
User
Kernel
Virtual Function
Port 1 Port 2
LAG
11© 2018 Mellanox Technologies | Confidential
Tradeoffs Between Virtual Switch and SR-IOV
Virtual Switch
SR-IOV
12© 2018 Mellanox Technologies | Confidential
Open Virtual Switch (OVS) Challenges
Virtual switches such as Open vSwitch (OVS) are used as the forwarding plane in the hypervisor
Virtual switches implement extensive support for SDN (e.g. enforce policies) and are widely used by the industry
Supports L2-L3 networking features: L2 & L3 Forwarding, NAT, ACL, Connection Tracking etc. Flow based
OVS Challenges: Awful Packet Performance: <1M w/ 2-4 cores, Burns CPU like Hell : Even w/ 12 cores, can’t get 1/3rd 100G NIC Speed Bad User Experience: High and unpredictable latency, packet drops
Solution Offload OVS data plane into Mellanox NIC using ASAP2 technology
13© 2018 Mellanox Technologies | Confidential
ConnectX-5 Packet Processing Offload Capabilities
Flow Tables Multiple, Programmable tables Dedicate, isolated tables for hypervisor and/or VMs Practically unlimited table size
Can support million of rules/flows
Classification Match on all header fields including encapsulated packets Flexible fields extraction by “Flexparse”
Actions Steering Encap/Decap
VXLAN, NVGRE, Geneve, MPLSoGRE/UDP, NSH Flex encap/decap
Drop / Allow Mirror Flow ID Header rewrite Hairpin mode
14© 2018 Mellanox Technologies | Confidential
Accelerated Switching And Packet Processing (ASAP2)
ASAP2 take advantage of ConnectX-5 capability to accelerate or offload “in host” network stack Family of solutions
ASAP2 Direct
Full vSwitch offload
ASAP2 Flex
vSwitch acceleration
ASAP2 Flex
VNF/VM acceleration
15© 2018 Mellanox Technologies | Confidential
ASAP2 Direct: Full OVS Offload Enable SR-IOV data path with OVS control plane
In other words, enable support for most SDN controllers with SR-IOV data plane
Use Open vSwitch to be the management interface and offload OVS data-plane to Mellanox embedded Switch (eSwitch) using ASAP2 Direct
OVS-eSwitch
Netdev
Representor
Netdev
Representor
Netdev
Representor
Netdev
Representor
eSwitch
PF (wire)
Host IP interface Host exception path (user-space)
VF VF VF
netdev netdev
Para-virt Para-virt
Hypervisor
Representor Ports
VM
ConnectX-5 eSwitch
VM
Hypervisor
OVS
SR-IOV
VFSR-IOV
VF
Data
Path
PF
16© 2018 Mellanox Technologies | Confidential
OVS over DPDK VS. OVS Offload
ConnectX-5 provide significant performance boost Without adding CPU resources
7.6MPPS
66MPPS
4 Cores
0 Cores0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0
10
20
30
40
50
60
70
OVS over DPDK OVS Offload
Nu
mb
er
of
De
dic
ate
d C
ore
s
Mil
lio
n P
ac
ke
t P
er
Se
co
nd
Message Rate Dedicated Hypervisor Cores
Test ASAP2
Direct
OVS
DPDK
Benefit
1 Flow
VXLAN
66M PPS 7.6M PPS
(VLAN)
8.6X
60K flows
VXLAN
19.8M PPS 1.9M PPS 10.4X
17© 2018 Mellanox Technologies | Confidential
Remote Direct Memory Access (RDMA)
ZERO Copy Remote Data Transfer
Low Latency, High Performance Data Transfers
InfiniBand - 100Gb/s RoCE* – 100Gb/s
Kernel Bypass Protocol Offload
Application ApplicationUSER
KERNEL
HARDWARE
Buffer Buffer
18© 2018 Mellanox Technologies | Confidential
RDMA In Cloud
Enable RDMA applications to run on cloud Scientific HPC Machine Learning and AI Data bases
Accelerate cloud infrastructure VM migration over RDMA Message queue over RDMA (e.g. gRPC)
Accelerate cloud storage iSER NVMf
Cognitive Toolkit
19© 2018 Mellanox Technologies | Confidential
RDMA Provide Fastest OpenStack Block Storage Access
Using OpenStack Built-in components and management (Open-iSCSI, tgt target, Cinder), no additional software is required, RDMA is already inbox and used by our OpenStack customers !
Hypervisor (KVM)
OS
VM
OS
VM
OS
VM
Adapter
Open-iSCSI w iSER
Compute Servers
Switching Fabric
iSCSI/iSER Target (tgt)
Adapter Local Disks
RDMA Cache
Storage Servers
OpenStack (Cinder)
Using RDMA
to accelerate
iSCSI storage
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16 32 64 128 256
Ban
dw
idth
[M
B/s
]
I/O Size [KB]
iSER 4 VMs Write
iSER 8 VMs Write
iSER 16 VMs Write
iSCSI Write 8 vms
iSCSI Write 16 VMs
PCIe Limit
6X
RDMA enables 6x More Bandwidth, 5x lower I/O latency, and lower CPU%
20© 2018 Mellanox Technologies | Confidential
NVMe Over Fabrics
Sharing NVMe based storage across multiple servers Better utilization: capacity, rack space, power Scalability, management, fault isolation
RDMA protocol is part of the standard InfiniBand or Ethernet (RoCE)
OpenStack Integration Cinder driver
* Roadmap
21© 2018 Mellanox Technologies | Confidential
Data Plane Development Kit (DPDK)
What is DPDK? Set of open source libraries and drivers for fast packet processing
What is the main usage and benefits of DPDK? Receive and send packets within the minimum number of CPU
cycles (usually less than 80 cycles) Develop fast packet capture algorithms Run third-party fast path stacks Can be used as an abstraction layer that will enable application
porting between CPU architectures
DPDK in the cloud Accelerate virtual switches (i.e., OVS over DPDK) Enable Virtual Network Functions (VNFs)
22© 2018 Mellanox Technologies | Confidential
DPDK with Mellanox - Industry Leading PerformanceM
ellanoxw
ith
66% lower latency compared to competition
Highest Performance and Message Rate in the Market!!!
139.22
84.46
45.29
23.50
11.97 9.62 8.13
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
64 128 256 512 1024 1280 1518
Fra
me
ra
te [
mp
ps
]
Frame Size [B]
Lossless ConnectX-5 Ex 100GbE Frame Rate [Mpps]
16 cores
100GbE line rate
23© 2018 Mellanox Technologies | Confidential
DPDK with Mellanox – Secure & Cost Effective
S E C U R E
NIC based hardware memory protection and translation by memory registration and isolation per application
Benefits: Better Secured Supports Containerized DPDK
applications without SR-IOV
THROUGHMEMORY
PROTECTIONIn hardware
Allows concurrent use of DPDK and NON-DPDK applications on the same NIC unlike competition
Benefits: Save CapEx of dedicated DPDK NIC
C O S T E F F E C T I V E
Supporting multiple architectures
Benefits:
Tightly integrated with processor specific accelerators (Neon, AVX, etc)
M U LT I A R C H
24© 2018 Mellanox Technologies | Confidential
OpenStack Over InfiniBand – The Route To Extreme Performance Transparent InfiniBand integration into OpenStack
Since Havana OpenStack release
RDMA directly from VM Requires SR-IOV
MAC to GUID mapping VLAN to pkey mapping InfiniBand SDN network
Ideal fit for High Performance Computing Clouds
InfiniBand Enables The Highest Performance and Efficiency
25© 2018 Mellanox Technologies | Confidential
Ironic Ethernet and InfiniBand Support
Ironic is OpenStack bare metal provisioning Useful for High Performance Compute (HPC) and Big Data
Mellanox enabled Ironic support for IB and Eth Bare metal, multi-tenancy IB and Eth Provide zero touch VLAN switch provision Enable InfiniBand support for Ironic with Neutron using pkey segmentation and
OpenSM integration (via Neo/UFM)
I’m “Pixie Boots”the mascot of the "Bear Metal"
Provisioning, a.k.a Ironic
26© 2018 Mellanox Technologies | Confidential
Comprehensive OpenStack Integration
Integrated with Major
OpenStack
Distributions
In-Box
Neturon-ML2
support for
mixed
environment
(VXLAN, PV,
SRIOV)
Ethernet
Neutron :
Hardware
support for
security and
isolation
Accelerating
storage
access by up
to 5X
OpenStack Plugins Create Seamless Integration , Control, & Management
27© 2018 Mellanox Technologies | Confidential
Machine LearningNetwork Needs
28© 2018 Mellanox Technologies | Confidential
Neural Networks Complexity Growth
2014 2015 2016 2017
DeepSpeech DeepSpeech-2
DeepSpeech-3
30X
2013 2014 2015 2016
AlexNet GoogleNetResNet
Inception-V2
350X
Inception-V4
Image
Recognition
Speech
Recognition
PolyNet
29© 2018 Mellanox Technologies | Confidential
Training Challenges
Training with large data sets and increasing networks can take long time In some cases even weeks
In many cases training need to happen frequently Model development and tuning Real life use cases may require retraining regularly
Accelerate training time by scale out architecture Add workers (nodes) to reduce training time
Types of parallelism that are now popularData parallelismModel parallelism
Network is critical element to accelerate Distributed Training!
30© 2018 Mellanox Technologies | Confidential
Model and Data Parallelism
Main Model/Parameter Server/Allreaduce
Local
Model
Mini
Batch
Mini
Batch
Mini
Batch
Mini
Batch
Mini
Batch
Local
Model
Local
Model
Local
Model
Local
Model
Local
Model
Mini
BatchData Data
Model Parallelism Data Parallelism
31© 2018 Mellanox Technologies | Confidential
Accelerating Data Parallelism
Data Parallelism communication pattern
Gradient updates to parameter servers or among workers.
Model parameters distribution among workers.
Frequent – each training step due to the sequential nature of SGD
High bandwidth is needed, as models become larger and larger Number of parameters is increasing
Usually characterized with Bursts on the network - workers are synchronized
RDMA and GPU Direct Accelerates Model Parallelism
32© 2018 Mellanox Technologies | Confidential
What Is RDMA?
Remote Direct Memory Access (RDMA) Advance transport protocol (same layer as TCP and UDP) Main features
Remote memory read/write semantics in addition to send/receive Kernel bypass / direct user space access Full hardware offload Secure, channel based IO
Application advantage Low latency High bandwidth Low CPU consumption
RoCE: RDMA over Converged Ethernet Available for all Ethernet speeds 10 – 100G
Verbs: RDMA SW interface (equivalent to sockets)
33© 2018 Mellanox Technologies | Confidential
GPUDirect™ RDMA Technology
34© 2018 Mellanox Technologies | Confidential
All Major Machine Learning Frameworks Support RDMATensorFlow: Several implementations upstream
Native (verbs) -https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/verbs
MPI, Horavod – Donated by Uber among others
Caffe2: Over MPI or Gloo library
Microsoft Cognitive Toolkit: Native support
NVIDIA NCCL2: Native support in NCCL
Cognitive Toolkit
35© 2018 Mellanox Technologies | Confidential
TensorFlow with Mellanox RDMA Test Report
System Configuration 8 x x86 servers 4 x NVIDIA P100 per server Mellanox 100G RDMA network NVMe driver per server
RDMA vs. TCP: Up to 50% Better Performance
Advanced RDMA vs. TCP: Up to 173% Better Performance
Reference Deployment Guide
36© 2018 Mellanox Technologies | Confidential
SHARP To Accelerate Parameter Server
Bottleneck created by Parameter Server SHARP Reduces Network Load and Latency
SCALE
∆𝑾𝟏 ∆𝑾𝟐 ∆𝑾𝟑 ∆𝑾𝑵
∑∆𝑾𝒊∑∆𝑾𝒊
∑∆𝑾𝒊∑∆𝑾𝒊∑∆𝑾𝒊
𝒊=𝟏
𝒊=𝟑
∆𝑾𝒊
𝒊=𝑵−𝟑
𝒊=𝑵
∆𝑾𝒊
37© 2018 Mellanox Technologies | Confidential
Data Ingestion
Data ingestion is the process of acquiring and preparing the input Preprocessing stage before accessing machine learning frameworks Examples
Convert file/image formats Combine multiple data sources Clean noise / enhance input
Relevant for training and inference Data Ingestion typically includes
Access to storage (local, distributed, network storage) Pre-processing in a big data framework
such as Hadoop or Spark
Accelerate Data Ingest is critical for machine learning performance
38© 2018 Mellanox Technologies | Confidential
Data Pipeline
39© 2018 Mellanox Technologies | Confidential
Mellanox Is Driving High Performance Storage
A majority of these customer are doing NVMe-oF POCs or early development with us Today
40© 2018 Mellanox Technologies | Confidential
Accelerate Big Data - Enabling Real-time Decisions
Benchmark: TeraSort Benchmark: Cassandra Stress
0
200
400
600
800
1000
1200
Intel 10Gb/s Mellanox 10Gb/s Mellanox 40Gb/s
Execu
tio
n T
ime (
in s
eco
nd
s)
Ethernet Network
3X Faster
Benchmark: Fraud Detection
0
100
200
300
400
500
600
700
Existing Solution Aerospike withMellanox + Samsung
NVMe
Tota
l T
ransaction T
ime (
in m
s)
CPU + Storage + Network Fraud Detection Algorithm
~2x more time for running
fraud detection algorithm
3X Faster Runtime! ~2X Faster Runtime! 25G BW for Database!
Mellanox is Certified by Leading Big Data Partners
41© 2018 Mellanox Technologies | Confidential
Spark Over RDMA – Accelerate Map/Reduce
Map
Reduce task
Map
Re
duce
Map
Map
Map
Map
Input Map output
File
File
File
File
File
Driver
Reduce task
Reduce task
Reduce task
Reduce task
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
Shuffling is very expensive in terms of CPU, RAM, disk and network IOs Spark Over RDMA allow to speed up shuffle operations RDMA is provided by “DiSNI” (Open-source Java interface to RDMA user libraries)
https://github.com/zrlio/disni
42© 2018 Mellanox Technologies | Confidential
Spark over RDMA Performance Results: TeraSort
Testbed: HiBench TeraSort
Workload: 175GB
HDFS on Hadoop 2.6.0 No replication
Spark 2.2.0 1 Master 16 Workers 28 active Spark cores on each node,
420 total
Node info: Intel Xeon E5-2697 v3 @ 2.60GHz RoCE 100GbE 256GB RAM HDD is used for Spark local directories
and HDFS
RDMA
Standard
0 10 20 30 40 50 60 70 80seconds
43© 2018 Mellanox Technologies | Confidential
17%
23%
19%
0
20
40
60
80
100
120
140
160
Customer App #1 Customer App #2 HiBench TeraSort
Ru
nti
me
(in
se
co
nd
s)
TCP
RDMA
Spark over RDMA: Real Applications Results
Runtime samples Input Size Nodes Cores per node RAM per node Improvement
Customer App #1 5GB 14 24 85GB 17%
Customer App #2 540GB 14 24 85GB 23%
HiBench TeraSort 300GB 15 28 256GB 19%
Lower is better
44© 2018 Mellanox Technologies | Confidential
Containers
45© 2018 Mellanox Technologies | Confidential
Containers Vs. Virtual Machines
Infrastructure Infrastructure
Operating System
Hypervisor
Libs
App
Libs
App
Libs
App
Operating System
Container Engine
Libs
App
Libs
App
Libs
App
OS OS OS
VM VM VM
Container Container Container
Packaging technology and light weight virtualization
46© 2018 Mellanox Technologies | Confidential
Containers Networking – Many Options
Host
Container
CContainer D Container E Container FContainer A Container B
Direct
Host
network
Unix-domain
sockets and
other IPC
Linux bridge
(Docker0)
iptables
(Docker proxy)
Open vSwitch
Port
mapping
NICNIC NIC
47© 2018 Mellanox Technologies | Confidential
Containers Networking With Mellanox
10-100Gb/s Ethernet Stateless offloadsScalable and secure DPDK usage from container (*)RDMA for InfiniBand and RoCE from within the container (*)SR-IOV (*)
Roadmap Container Direct networking vSwitch offload with ASAP2
(*) Initial support available; Integration with Kubernetes later this year
48© 2018 Mellanox Technologies | Confidential
Thank You