+ All Categories
Home > Documents > Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more...

Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more...

Date post: 05-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
25
1 © 2018 Mellanox Technologies Scot Schultz, Sr. Director, HPC / AI and Technical Computing Mellanox Technologies Join the Conversation #OpenPOWERSummit Revolutionary Performance and Scalability for HPC and AI
Transcript
Page 1: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

1© 2018 Mellanox Technologies

Scot Schultz, Sr. Director, HPC / AI and Technical Computing

Mellanox Technologies

Join the Conversation #OpenPOWERSummit

Revolutionary Performance and Scalability for HPC and AI

Page 2: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

2© 2018 Mellanox Technologies

NVMe, NVMe Over Fabrics

Big Data Storage

Hadoop / Spark

Software Defined Storage

SQL & NoSQL Database

Deep Learning

HPC

GPUDirect

RDMA

MPI

NCCLSHARP

Image , Video Recognition

Sentiment Analysis

Fraud & Flaw

Detection

Voice Recognition & Search

Same Interconnect Technology Enables a Variety of Applications

Page 3: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

3© 2018 Mellanox Technologies

Accelerating the Next Generation of HPC and AI

Summit CORAL System

Fastest Supercomputer in Japan

Fastest Supercomputer in CanadaInfiniBand Dragonfly+ Topology

Sierra CORAL System

Highest Performance with InfiniBand

Page 4: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

4© 2018 Mellanox Technologies

Higher Data SpeedsFaster Data Processing

Better Data Security

Adapters SwitchesCables &

Transceivers

SmartNIC System on a Chip

HPC and AI Needs the Most Intelligent Interconnect

Page 5: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

5© 2018 Mellanox Technologies

Cloud Big Data

Enterprise

Business Intelligence

HPC

Storage

Security

Machine Learning

Internet of Things

Exponential Data Growth Everywhere

Page 6: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

6© 2018 Mellanox Technologies

Big Data? No… REALLY BIG DATA

Average data generated in a self-driving vehicle is expected to reach 40TB for every eight hours of driving (this mostly applies to full service fleet vehicles)

The Pratt & Whitney PW1000G engine has 5,000 sensors installed, generating about 10 GB of data per second. With an average 12-hr. flight-time can produce up to 844 TB of data

Mellanox is the de-facto interconnect for deep learning deployments

Page 7: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

7© 2018 Mellanox Technologies

What’s Important For ML & Big Data Networking?

High Bandwidth and Low Latency Massive amounts of data requires thick data-pipes NVIDIA DGX-1 use 4 x 100G InfiniBand Port latency < 90 nsec with Mellanox EDR InfiniBand Spectrum is the world’s lowest latency Ethernet switch

No Packet Loss! Mellanox Spectrum switches have ZERO packet loss

Offloads – Free up the CPU! Stateless Offloads RDMA GPUDirect RDMA, GPUDirect ASYNC Virtualization and Containers Networking

GPUDirect™ RDMA

In-Network Computing SHARP

Storage Acceleration RDMA NVMe over Fabrics Acceleration Erasure Coding Acceleration

Security IPsec, TLS Offloads

Resiliency Robust End-to-End Network Solutions

Page 8: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

8© 2018 Mellanox Technologies

Data Centric Architecture to Overcome Latency Bottlenecks

Communications Latencies of 30-40us Communications Latencies of 3-4us

CPU-Centric (Onload) Data-Centric (Offload)

Network In-Network Computing

Intelligent Interconnect Paves the Road to Exascale Performance

Page 9: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

9© 2018 Mellanox Technologies

In-Network Computing

Self-Healing Technology

In-Network Computing

Unbreakable Data Centers

Delivers Highest Application Performance

GPUDirect™ RDMA

Critical for HPC and Machine Learning ApplicationsGPU Acceleration Technology

10X Performance AccelerationCritical for HPC and Machine Learning Applications

35XFaster Network Recovery5000X

10X Performance Acceleration

Performance Acceleration

In-Network Computing Delivers Highest Performance

Page 10: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

10© 2018 Mellanox Technologies

30%-250% Higher Return on Investment

Up to 50% Saving on Capital and Operation Expenses

Highest Applications Performance, Scalability and Productivity

InfiniBand Delivers Best Return on Investment

1.9X Better 2X Better 1.4X Better 2.5X Better 1.3X Better

Molecular DynamicsChemistry Automotive GenomicsWeather

Page 11: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

11© 2018 Mellanox Technologies

Cognitive Toolkit

RDMA Supercharges Leading AI Frameworks

Mellanox Supercharges Leading AI Companies

Higher ROI

Lower CapEx& OpEx

60%

50%

An Intelligent Network Unlocks the Power of AI

Page 12: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

12© 2018 Mellanox Technologies

10X Higher Performance with GPUDirect™ RDMA Technology

GPUDirect™ RDMA

Purpose-built for Acceleration of Deep Learning

Lowest communication latency for acceleration devices

No unnecessary system memory copies and CPU overhead

Page 13: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

13© 2018 Mellanox Technologies

GPUDirect™ ASYNC

GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect Control path still uses the CPU

CPU prepares and queues communication tasks on GPU GPU triggers communication on HCA Mellanox HCA directly accesses GPU memory

GPUDirect ASYNC (GPUDirect 4.0) Both data path and control path go directly

between the GPU and the Mellanox interconnect

Maximum PerformanceFor GPU Clusters

Page 14: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

14© 2018 Mellanox Technologies

Distributed Training

Training on large data sets can take a long time In some cases weeks

In many cases training need to happen frequently Model development and tuning Real life use cases may require retraining regularly

Accelerate training time by scale out architecture Add workers (nodes) to reduce training time

Data Parallelism is the solution

Network is critical element to accelerate Distributed Training!

Page 15: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

15© 2018 Mellanox Technologies

RDMA Accelerates Distributed Training

Data Parallelism Communication Workers to parameter servers and parameter servers to workers Frequent Maybe high bandwidth Bursty Point to point and collectives

RDMA Improves Point to Point and Collective operations

Page 16: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

16© 2018 Mellanox Technologies

Model Parallelism

Models size is limited by the compute engine (GPU for example)

In some cases the model doesn’t fit in compute Large models Small compute such as FPGAs

Model parallelism slices the model and run each part on different compute engine

Networking becomes a critical element

High bandwidth, low latency and RDMA are mandatory

Data Data

Page 17: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

17© 2018 Mellanox Technologies

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Reliable Scalable General Purpose Primitive

In-network tree-based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations

Applicable to Multiple Use-cases HPC applications using MPI / SHMEM Distributed Machine Learning applications

Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64/128 bits

SHARP Tree

SHARP Tree Aggregation Node

(Process running on HCA)

SHARP Tree Endnode

(Process running on HCA)

SHARP Tree Root

Page 18: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

18© 2018 Mellanox Technologies

SHARP Allreduce Performance Advantages

SHARP Enables 75% Reduction in Latency

Providing Scalable Flat Latency

Page 19: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

19© 2018 Mellanox Technologies

Performs the Gradient AveragingRemoves the need for physical parameter serverRemoves all parameter server overhead

Mellanox SHARP Technology Accelerates AI

The CPU in a parameter serverquickly becomes the bottleneck(roughly 4 nodes)

Page 20: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

20© 2018 Mellanox Technologies

Mellanox Accelerates TensorFlow with RDMA

Unmatched Linear Scalability at No Additional Cost

50% Better

Performance

Page 21: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

21© 2018 Mellanox Technologies

Mellanox Accelerates TensorFlow (Advanced Verbs)

10GbE is not enough for large scale models 6.5X Faster Training

with 100GbE

2.5X

6.5X

Page 22: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

22© 2018 Mellanox Technologies

Mellanox Accelerates IBM Deep Learning

MellanoxInfiniBandProvides

95%LinearScalability

Page 23: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

23© 2018 Mellanox Technologies

Mellanox Accelerates NVIDIA NCCL 2.0

50% PerformanceImprovement

with NVIDIA® DGX-1 across32 NVIDIA Tesla V100 GPUsUsing InfiniBand RDMAand GPUDirect™ RDMA

Page 24: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

24© 2018 Mellanox Technologies

Proven Advantages

RDMA delivers 2X or more performance advantage over traditional TCP

Machine Learning and HPC platforms share the same interconnect needs

Scalable, flexible, high performance, high bandwidth, end-to-end connectivity

Standards-based and supported by the largest eco-system

Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc.

Native Offloading architecture

RDMA, GPUDirect, SHARP and other core accelerations

Backward and future compatible

Scalable Machine Learning Depends on Mellanox

Page 25: Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

25© 2018 Mellanox Technologies

Questions?


Recommended