Advanced Networking - HPC-AI Advisory Council...© 2018 Mellanox Technologies | Confidential 6...

1© 2018 Mellanox Technologies | Confidential

The Critical Path for HPC, Cloud and Machine LearningLugano April 9 2018

Advanced Networking


Modern Data Centers

Orchestration SDSSDN

Multi-

TenancyOn Demand


The Challenge – Software Implementation

Performance

Programmability


The Solution - Hardware Acceleration

Software defined everything is a key for modern data centers With today’s software based solutions, functionality is (almost) there To gain flexibility, performance and cost efficiency, Hardware Acceleration is needed

High Performance Workloads Can’t Deliver Without HW Acceleration

Software Software + Hardware Acceleration


Advance Network Technologies For OpenStack


Server

VM1 VM2 VM3 VM4

Overlay Networks

Overlay Network Advantages: Isolation, Simplicity, Scalability

Virtual Domain 3

Virtual Domain 2

Virtual Domain 1

Physical

View

Server

VM5 VM6 VM7 VM8

Virtual

View

NVGRE/VXLAN/Geneve Overlay Networks


Turbocharge Overlay Networks

Overlay tunnels add network processing Limits bandwidth Consumes CPU

System efficiency drops 10s of percents

For penalty free overlays, at bare-metal performance use NIC with overlay Network HW offloads ConnectX-4 and ConnectX-5 family

Mellanox adapters also supports VxLAN VTEP (encap/decap)

37.5

17.62

36.21

0.7

3.5

0.7

0

0.5

1

1.5

2

2.5

3

3.5

4

Physical VXLAN No Offloads VXLAN HW Offloads

0

5

10

15

20

25

30

35

40

CP

U %

Pe

r 1

Gb

/s

Be

nd

with

(G

b/s

)

40G/s ConnectX-3 Pro

8 VM Pairs BW 8 VM Pairs CPU


Para-Virtualized SR-IOV

Single Root I/O Virtualization (SR-IOV)

PCIe device presents multiple

instances to the OS/Hypervisor

Enables Application Direct

Access

Bare metal performance for VM

Reduces CPU overhead

Enables many advanced NIC

features (e.g. DPDK, RDMA, ASAP2)

NIC

Hypervisor

vSwitch

VM VM

SR-IOV NIC

Hypervisor VM VM

eSwitch

Physical Function

(PF)

Virtual Function

(VF)

Fully Integrated And Upstream With OpenStack


Per VF (SR-IOV) Quality of Service (QoS)

New Neutron API for Per VF Rate Limiting Per VF BW Guarantee Packet Pacing

Same model for ParaVirt and SR-IOV

In SR-IOV mode, QoS is enforced by HW Finer grain More predictable Less Jitter Less CPU utilization

Rate ShaperRate Shaper

QoS Queue

Work Queue

Work Queue

Work Queue

Priority 0 Arbiter

QoS Queue

Work Queue

Work Queue

Work Queue

QoS Queue

Work Queue

Work Queue

Work Queue

QoS Queue

Work Queue

Work Queue

Work Queue

Priority 1 Arbiter

RR arbiter

RR arbiter

RR arbiter

RR arbiter

Strict Priority

TC Group 0DWRR

TC Group 1DWRR

TC Group 7DWRR

TC

0

TC

1

Flow Ctrl

Flow Ctrl

TC

2

TC

3

Flow Ctrl

Flow Ctrl

TC

7

Flow Ctrl...

HL

...

...Priority 0

Priority 1

Priority 2

Priority 3

Priority 7

Rate Limiter

Rate Limiter

Enhanced ETSPer VF Rate Limiter

Mellanox Advance HW QoS Implementation


SR-IOV High Availability / VF LAG

SR-IOV VMs don’t support bonding/HA Mellanox enable transparent SR-IOV HA on a single NIC

LAG will be implemented on Mellanox NIC so VM will only see a single Virtual Function (VF)

Mode supported Active Passive (Single port BW) Active Active (Double port BW) LACP

NIC

Host

Virtual Function

VM

VF driver

User

Kernel

Virtual Function

Port 1 Port 2

LAG


Tradeoffs Between Virtual Switch and SR-IOV

Virtual Switch

SR-IOV


Open Virtual Switch (OVS) Challenges

Virtual switches such as Open vSwitch (OVS) are used as the forwarding plane in the hypervisor

Virtual switches implement extensive support for SDN (e.g. enforce policies) and are widely used by the industry

Supports L2-L3 networking features: L2 & L3 Forwarding, NAT, ACL, Connection Tracking etc. Flow based

OVS Challenges: Awful Packet Performance: <1M w/ 2-4 cores, Burns CPU like Hell : Even w/ 12 cores, can’t get 1/3rd 100G NIC Speed Bad User Experience: High and unpredictable latency, packet drops

Solution Offload OVS data plane into Mellanox NIC using ASAP2 technology


ConnectX-5 Packet Processing Offload Capabilities

Flow Tables Multiple, Programmable tables Dedicate, isolated tables for hypervisor and/or VMs Practically unlimited table size

Can support million of rules/flows

Classification Match on all header fields including encapsulated packets Flexible fields extraction by “Flexparse”

Actions Steering Encap/Decap

VXLAN, NVGRE, Geneve, MPLSoGRE/UDP, NSH Flex encap/decap

Drop / Allow Mirror Flow ID Header rewrite Hairpin mode


Accelerated Switching And Packet Processing (ASAP2)

ASAP2 take advantage of ConnectX-5 capability to accelerate or offload “in host” network stack Family of solutions

ASAP2 Direct

Full vSwitch offload

ASAP2 Flex

vSwitch acceleration

ASAP2 Flex

VNF/VM acceleration


ASAP2 Direct: Full OVS Offload Enable SR-IOV data path with OVS control plane

In other words, enable support for most SDN controllers with SR-IOV data plane

Use Open vSwitch to be the management interface and offload OVS data-plane to Mellanox embedded Switch (eSwitch) using ASAP2 Direct

OVS-eSwitch

Netdev

Representor

Netdev

Representor

Netdev

Representor

Netdev

Representor

eSwitch

PF (wire)

Host IP interface Host exception path (user-space)

VF VF VF

netdev netdev

Para-virt Para-virt

Hypervisor

Representor Ports

VM

ConnectX-5 eSwitch

VM

Hypervisor

OVS

SR-IOV

VFSR-IOV

VF

Data

Path

PF


OVS over DPDK VS. OVS Offload

ConnectX-5 provide significant performance boost Without adding CPU resources

7.6MPPS

66MPPS

4 Cores

0 Cores0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0

10

20

30

40

50

60

70

OVS over DPDK OVS Offload

Nu

mb

er

of

De

dic

ate

d C

ore

s

Mil

lio

n P

ac

ke

t P

er

Se

co

nd

Message Rate Dedicated Hypervisor Cores

Test ASAP2

Direct

OVS

DPDK

Benefit

1 Flow

VXLAN

66M PPS 7.6M PPS

(VLAN)

8.6X

60K flows

VXLAN

19.8M PPS 1.9M PPS 10.4X


Remote Direct Memory Access (RDMA)

ZERO Copy Remote Data Transfer

Low Latency, High Performance Data Transfers

InfiniBand - 100Gb/s RoCE* – 100Gb/s

Kernel Bypass Protocol Offload

Application ApplicationUSER

KERNEL

HARDWARE

Buffer Buffer


RDMA In Cloud

Enable RDMA applications to run on cloud Scientific HPC Machine Learning and AI Data bases

Accelerate cloud infrastructure VM migration over RDMA Message queue over RDMA (e.g. gRPC)

Accelerate cloud storage iSER NVMf

Cognitive Toolkit


RDMA Provide Fastest OpenStack Block Storage Access

Using OpenStack Built-in components and management (Open-iSCSI, tgt target, Cinder), no additional software is required, RDMA is already inbox and used by our OpenStack customers !

Hypervisor (KVM)

OS

VM

OS

VM

OS

VM

Adapter

Open-iSCSI w iSER

Compute Servers

Switching Fabric

iSCSI/iSER Target (tgt)

Adapter Local Disks

RDMA Cache

Storage Servers

OpenStack (Cinder)

Using RDMA

to accelerate

iSCSI storage

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 16 32 64 128 256

Ban

dw

idth

[M

B/s

]

I/O Size [KB]

iSER 4 VMs Write

iSER 8 VMs Write

iSER 16 VMs Write

iSCSI Write 8 vms

iSCSI Write 16 VMs

PCIe Limit

6X

RDMA enables 6x More Bandwidth, 5x lower I/O latency, and lower CPU%


NVMe Over Fabrics

Sharing NVMe based storage across multiple servers Better utilization: capacity, rack space, power Scalability, management, fault isolation

RDMA protocol is part of the standard InfiniBand or Ethernet (RoCE)

OpenStack Integration Cinder driver

* Roadmap


Data Plane Development Kit (DPDK)

What is DPDK? Set of open source libraries and drivers for fast packet processing

What is the main usage and benefits of DPDK? Receive and send packets within the minimum number of CPU

cycles (usually less than 80 cycles) Develop fast packet capture algorithms Run third-party fast path stacks Can be used as an abstraction layer that will enable application

porting between CPU architectures

DPDK in the cloud Accelerate virtual switches (i.e., OVS over DPDK) Enable Virtual Network Functions (VNFs)


DPDK with Mellanox - Industry Leading PerformanceM

ellanoxw

ith

66% lower latency compared to competition

Highest Performance and Message Rate in the Market!!!

139.22

84.46

45.29

23.50

11.97 9.62 8.13

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

64 128 256 512 1024 1280 1518

Fra

me

ra

te [

mp

ps

]

Frame Size [B]

Lossless ConnectX-5 Ex 100GbE Frame Rate [Mpps]

16 cores

100GbE line rate


DPDK with Mellanox – Secure & Cost Effective

S E C U R E

NIC based hardware memory protection and translation by memory registration and isolation per application

Benefits: Better Secured Supports Containerized DPDK

applications without SR-IOV

THROUGHMEMORY

PROTECTIONIn hardware

Allows concurrent use of DPDK and NON-DPDK applications on the same NIC unlike competition

Benefits: Save CapEx of dedicated DPDK NIC

C O S T E F F E C T I V E

Supporting multiple architectures

Benefits:

Tightly integrated with processor specific accelerators (Neon, AVX, etc)

M U LT I A R C H


OpenStack Over InfiniBand – The Route To Extreme Performance Transparent InfiniBand integration into OpenStack

Since Havana OpenStack release

RDMA directly from VM Requires SR-IOV

MAC to GUID mapping VLAN to pkey mapping InfiniBand SDN network

Ideal fit for High Performance Computing Clouds

InfiniBand Enables The Highest Performance and Efficiency


Ironic Ethernet and InfiniBand Support

Ironic is OpenStack bare metal provisioning Useful for High Performance Compute (HPC) and Big Data

Mellanox enabled Ironic support for IB and Eth Bare metal, multi-tenancy IB and Eth Provide zero touch VLAN switch provision Enable InfiniBand support for Ironic with Neutron using pkey segmentation and

OpenSM integration (via Neo/UFM)

I’m “Pixie Boots”the mascot of the "Bear Metal"

Provisioning, a.k.a Ironic


Comprehensive OpenStack Integration

Integrated with Major

OpenStack

Distributions

In-Box

Neturon-ML2

support for

mixed

environment

(VXLAN, PV,

SRIOV)

Ethernet

Neutron :

Hardware

support for

security and

isolation

Accelerating

storage

access by up

to 5X

OpenStack Plugins Create Seamless Integration , Control, & Management


Machine LearningNetwork Needs


Neural Networks Complexity Growth

2014 2015 2016 2017

DeepSpeech DeepSpeech-2

DeepSpeech-3

30X

2013 2014 2015 2016

AlexNet GoogleNetResNet

Inception-V2

350X

Inception-V4

Image

Recognition

Speech

Recognition

PolyNet


Training Challenges

Training with large data sets and increasing networks can take long time In some cases even weeks

In many cases training need to happen frequently Model development and tuning Real life use cases may require retraining regularly

Accelerate training time by scale out architecture Add workers (nodes) to reduce training time

Types of parallelism that are now popularData parallelismModel parallelism

Network is critical element to accelerate Distributed Training!


Model and Data Parallelism

Main Model/Parameter Server/Allreaduce

Local

Model

Mini

Batch

Mini

Batch

Mini

Batch

Mini

Batch

Mini

Batch

Local

Model

Local

Model

Local

Model

Local

Model

Local

Model

Mini

BatchData Data

Model Parallelism Data Parallelism


Accelerating Data Parallelism

Data Parallelism communication pattern

Gradient updates to parameter servers or among workers.

Model parameters distribution among workers.

Frequent – each training step due to the sequential nature of SGD

High bandwidth is needed, as models become larger and larger Number of parameters is increasing

Usually characterized with Bursts on the network - workers are synchronized

RDMA and GPU Direct Accelerates Model Parallelism


What Is RDMA?

Remote Direct Memory Access (RDMA) Advance transport protocol (same layer as TCP and UDP) Main features

Remote memory read/write semantics in addition to send/receive Kernel bypass / direct user space access Full hardware offload Secure, channel based IO

Application advantage Low latency High bandwidth Low CPU consumption

RoCE: RDMA over Converged Ethernet Available for all Ethernet speeds 10 – 100G

Verbs: RDMA SW interface (equivalent to sockets)


GPUDirect™ RDMA Technology


All Major Machine Learning Frameworks Support RDMATensorFlow: Several implementations upstream

Native (verbs) -https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/verbs

MPI, Horavod – Donated by Uber among others

Caffe2: Over MPI or Gloo library

Microsoft Cognitive Toolkit: Native support

NVIDIA NCCL2: Native support in NCCL

Cognitive Toolkit

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/verbs


TensorFlow with Mellanox RDMA Test Report

System Configuration 8 x x86 servers 4 x NVIDIA P100 per server Mellanox 100G RDMA network NVMe driver per server

RDMA vs. TCP: Up to 50% Better Performance

Advanced RDMA vs. TCP: Up to 173% Better Performance

Reference Deployment Guide

https://community.mellanox.com/docs/DOC-2852

https://community.mellanox.com/servlet/JiveServlet/showImage/102-2852-11-141138/pastedImage_3.png

https://community.mellanox.com/servlet/JiveServlet/showImage/102-2852-11-141138/pastedImage_3.png


SHARP To Accelerate Parameter Server

Bottleneck created by Parameter Server SHARP Reduces Network Load and Latency

SCALE

∆𝑾𝟏 ∆𝑾𝟐 ∆𝑾𝟑 ∆𝑾𝑵

∑∆𝑾𝒊∑∆𝑾𝒊

∑∆𝑾𝒊∑∆𝑾𝒊∑∆𝑾𝒊

𝒊=𝟏

𝒊=𝟑

∆𝑾𝒊

𝒊=𝑵−𝟑

𝒊=𝑵

∆𝑾𝒊


Data Ingestion

Data ingestion is the process of acquiring and preparing the input Preprocessing stage before accessing machine learning frameworks Examples

Convert file/image formats Combine multiple data sources Clean noise / enhance input

Relevant for training and inference Data Ingestion typically includes

Access to storage (local, distributed, network storage) Pre-processing in a big data framework

such as Hadoop or Spark

Accelerate Data Ingest is critical for machine learning performance


Data Pipeline


Mellanox Is Driving High Performance Storage

A majority of these customer are doing NVMe-oF POCs or early development with us Today


Accelerate Big Data - Enabling Real-time Decisions

Benchmark: TeraSort Benchmark: Cassandra Stress

0

200

400

600

800

1000

1200

Intel 10Gb/s Mellanox 10Gb/s Mellanox 40Gb/s

Execu

tio

n T

ime (

in s

eco

nd

s)

Ethernet Network

3X Faster

Benchmark: Fraud Detection

0

100

200

300

400

500

600

700

Existing Solution Aerospike withMellanox + Samsung

NVMe

Tota

l T

ransaction T

ime (

in m

s)

CPU + Storage + Network Fraud Detection Algorithm

~2x more time for running

fraud detection algorithm

3X Faster Runtime! ~2X Faster Runtime! 25G BW for Database!

Mellanox is Certified by Leading Big Data Partners


Spark Over RDMA – Accelerate Map/Reduce

Map

Reduce task

Map

Re

duce

Map

Map

Map

Map

Input Map output

File

File

File

File

File

Driver

Reduce task

Reduce task

Reduce task

Reduce task

Fetch blocks

Fetch blocks

Fetch blocks

Fetch blocks

Fetch blocks

Shuffling is very expensive in terms of CPU, RAM, disk and network IOs Spark Over RDMA allow to speed up shuffle operations RDMA is provided by “DiSNI” (Open-source Java interface to RDMA user libraries)

https://github.com/zrlio/disni

https://github.com/zrlio/disni


Spark over RDMA Performance Results: TeraSort

Testbed: HiBench TeraSort

Workload: 175GB

HDFS on Hadoop 2.6.0 No replication

Spark 2.2.0 1 Master 16 Workers 28 active Spark cores on each node,

420 total

Node info: Intel Xeon E5-2697 v3 @ 2.60GHz RoCE 100GbE 256GB RAM HDD is used for Spark local directories

and HDFS

RDMA

Standard

0 10 20 30 40 50 60 70 80seconds


17%

23%

19%

0

20

40

60

80

100

120

140

160

Customer App #1 Customer App #2 HiBench TeraSort

Ru

nti

me

(in

se

co

nd

s)

TCP

RDMA

Spark over RDMA: Real Applications Results

Runtime samples Input Size Nodes Cores per node RAM per node Improvement

Customer App #1 5GB 14 24 85GB 17%

Customer App #2 540GB 14 24 85GB 23%

HiBench TeraSort 300GB 15 28 256GB 19%

Lower is better


Containers


Containers Vs. Virtual Machines

Infrastructure Infrastructure

Operating System

Hypervisor

Libs

App

Libs

App

Libs

App

Operating System

Container Engine

Libs

App

Libs

App

Libs

App

OS OS OS

VM VM VM

Container Container Container

Packaging technology and light weight virtualization


Containers Networking – Many Options

Host

Container

CContainer D Container E Container FContainer A Container B

Direct

Host

network

Unix-domain

sockets and

other IPC

Linux bridge

(Docker0)

iptables

(Docker proxy)

Open vSwitch

Port

mapping

NICNIC NIC


Containers Networking With Mellanox

10-100Gb/s Ethernet Stateless offloadsScalable and secure DPDK usage from container (*)RDMA for InfiniBand and RoCE from within the container (*)SR-IOV (*)

Roadmap Container Direct networking vSwitch offload with ASAP2

(*) Initial support available; Integration with Kubernetes later this year


Thank You

Date post:	16-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Advanced Networking - HPC-AI Advisory Council...© 2018 Mellanox Technologies | Confidential 6...

Documents