High Performance Computing with SLES for HPC …...High Performance Computing with SLES for HPC on...

High Performance Computing with SLES for HPC on AWS[tut1393]

David Duncan, Partner Solutions Architect, AWSKevin Ayres, Solutions Architect, SUSE

HPC Customer Pain PointsComplexity Maintenance Time to Solution

“My IT staff doesn’t have time to update and test all the different software components.”

• Better management software is needed, and deployment approach needs to be updated to leverage HPC and cloud infrastructure

• Stack components provided by multiple vendors, making it more challenging to maintain

“I need to maximize application performance, scale workloads, and minimize overhead.”

• Parallel software is lacking with many applications needing a major re-design

• Stack components provided by multiple vendors, making managing more challenging

• Segmented into commercial and scientific, and there is not enough collaboration

“Composing a working HPC environment is difficult, time-consuming, requiring experts.”

• Clusters are hard to use and manage as they become more complex in heterogeneous environments

• Storage access time and data management are becoming new bottlenecks

The HPC universe is expanding in new ways

3

CAGR 2016-2021:

• 5.6% Supercomputer (>$500K)• 5.0% Divisional ($250K-$500K)• 6.3% Departmental ($100K-$250K)• 6.3% Workgroup (<$100K)

• HPC is a growth market, with a growing recognition of strategic value

• HPC ROI is very high• $551 on average revenue per dollar

invested in HPC• $52 on average profit (or cost savings) per

dollar invested in HPC

• Key use cases:• HPC in the cloud (incl. HPCaaS)• Cognitive computing (incl. AI/ML/DL)• HPDA (High Performance Data Analysis) • IoT

• Key applications:• Modeling and simulation• Data analytics

Source: Hyperion Research, June 2017

•

4/15/2019

Goal: Propel the Arm HPC ecosystem and exascale computing in the UK

• More than 12,000 Arm-based cores running across three universities• 64 Apollo 70 systems per site• Two 32 core Cavium ThunderX2 processors per system• Running SUSE Linux Enterprise for High Performance Computing

Catalyst UK project:HPE, Arm, SUSE, and three leading UK universities establish one of the largest Arm-based supercomputer deployments in the world

SuperMUC Petascale system runs SUSE on Lenovo ThinkSystem

Geophysicists use earthquake simulation software to investigate seismic waves beneath Earth’s surface

Calculations involved in this kind of simulation are so complex that they push even supercomputers to their limits

SUSE continues to work with NVIDIA to enable support for the latest NVIDIA GPU cards – important in HPC modeling and simulation

NVIDIA’s expertise in programmable GPUs has led to breakthroughs in parallel processing which make supercomputing inexpensive and widely accessible

Univa and SUSE together manage containerized HPC and AI workloads on TSUBAME 3.0

Scaling machine learning for SUSE Linux containers, servers, clusters and clouds with Apache Spark and Univa

TSUBAME touted as the “supercomputer for everyone”

Predicting traffic congestion or share prices, simulating human organs, or forecasting the weather

“The excellent management that SUSE Linux Enterprise Server provides is one of the key factors behind Tsubame 2.5’s success.” – Professor Satoshi Matsuoka, Global Scientific Information and Computing Center, Tokyo Institute of Technology

Bright Cluster Manager supports SUSE, enabling customers to deploy, manage and monitor SLES clusters using the familiar Bright interface

Bright Cluster Manager lets users monitor and build clusters of any size that are easy to provision, operate, monitor, manage and scale

Motivations – why HPC in the Cloud? (and what is HPC anyway?)

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.10

Why AWS for HPC?

Low cost with flexible pricing Efficient clusters

Unlimited infrastructure

Faster time to results

Concurrent clusters on-demand

Increased collaboration


Cost advantages

On Premises Capital Expense Model

Amazon Web ServicesPay As You Go Model

Use only what you need

Multiple pricing models

High upfront capital cost

High cost of ongoing support

12

One-way ortwo-way doors…Traditional Infrastructure barriers present significant hurdles in accelerating the speed in which HPC customers innovate.HPC on AWS changes the equation by allowing users to experiment and operate without compromises…

Genomics Processing

Modeling and Simulation

Government and Educational Research

Monte Carlo Simulations

Transcoding and Encoding

Computational Chemistry

Popular HPC workloads on AWS

… and many more

14

Region & Number of Availability ZonesAWS GovCloud EU

AWS Gov Cloud West Ireland

AWS Gov Cloud East Frankfurt

US West London

Oregon Paris

Northern California

US East Asia Pacific

N. Virginia, Ohio Singapore

Canada Sydney, Tokyo,

Central Seoul, Mumbai

South America China

São Paulo Beijing, Ningxia

AWS global infrastructure20 global regions, 1 local region, 61 availability zones


Announced RegionsStockholm, Hong Kong, Bahrain, Italy

Region

Ensuring High Availability

The AWS Cloud infrastructure:• A Region is a physical location in the world where we have

multiple Availability Zones• Availability Zones consist of one or more discrete data centers,

each with redundant power, networking, and connectivity, housed in separate facilities. Applications and Data are replicated in real time and consistent in the different AZs

Low-latency ensures real data

replication

Distance ensures high availability

AWS Regions (20)AZs (61)

N. VirginiaOhio

N. California

OregonMumbai

Seoul

Singapore

Sydney

Canada

BeijingFrankfurt

IrelandLondonSão Paulo

GovCloud (US-West)

Ningxia

Paris

Osaka (Local)

Tokyo

GovCloud (US-East)

Stockholm

Important enablers for HPC on the cloud

Compute performance – CPUs, GPUs, FPGAs

Memory performance – high RAM requirements in many applications

Network performance – throughput, latency, and consistency

Storage performance – including shared filesystems

Automation and cluster/job management

Remote graphics for interactive applications

ISV support – including license management

…and


SCALE

Defining HPC – example use cases

Data LightMinimal requirements for high performance storage

Data Heavy

Benefits from

access to high

performance

storage

Clustered (Tightly coupled)

Distributed / Grid (Loosely coupled)

Fluid dynamics Weather forecasting Materials simulations Crash simulations

Risk simulations Molecular modeling Contextual search Logistics simulations

Animation and VFX Semiconductor verification Image processing/GIS Genomics

Seismic processing Metagenomics Astrophysics Deep learning



Cluster HPC

Tightly coupled, latency-sensitive applications

Use larger EC2 compute instances, placement groups, enhanced networking, HPC job schedulers

Grid HPC

Loosely coupled, pleasingly parallel

Use a variety of EC2 instances, multiple AZs, Spot, Auto Scaling, Amazon SQS, AWS Batch

Grids of Clusters

Running parallel cluster jobs, parameter studies

Use a grid strategy on the cloud to run a group of parallel, individually-clustered HPC jobs

Grid Computing Examples (Scale Out)


Global-scale grids for research


Large Hadron Collider (LHC)

Global-scale grids for research


Best-practices using Spot: diversify computing with many instance types, multiple AZs, multiple regions, and with stateless architectures

1.1M vCPUs for machine learning

A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster in the cloudby using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS region.


The graph highlights the

elastic, automatic expansion of

resources. Clemson took

advantage of the new per-second

billing for EC2 instances.

The vCPU count usage is

comparable to the core count on the

largest supercomputers in

the world.

HPC in design and manufacturing


Applications for engineering: Molecular dynamics, CAD, CAE,

EDA Collaboration tools for engineering Big data for manufacturing yield

analysis

Running drive-head simulations at scale:Millions of parallel parameter sweeps, running months of simulations in just hours

Over 85,000 Intel cores running at peak, using Spot Instances

Cluster Computing Examples (Scale Up)

Tightly-coupled HPC – weather

Structural simulation

Fluid dynamics – Ansys Fluent

C4.8xlarge instance type 140M cell model F1 car CFD benchmark

HPC in aerospace“Rescale’s ScaleX cloud platform is a game-changer for engineering. It gives Boom computing resources comparable to building a large on-premise HPC center. Rescale lets us move fast with minimal capital spending and resources overhead.”

Josh KrallCTO & Co-Founder

Boom leverages Rescale and AWS to enable supersonic travel

Simulated vortex lift with 200M cell models on 512+ cores

Increased simulation throughput: 100 jobs in parallel with 6x speedup per job → 600x speedup

Eliminated IT overhead, including server capital costs & in-house IT and software teams

Elastic HPC capacity and pay-as-you-go AWS clusters allow business agility & ability to scale

Accelerated Computing


Accelerated computing on AWS


Parallelism increases throughput

CPU: High speed, highly flexible GPU/FPGA: High throughput, high efficiency

GPUs and FPGAs can provide massive parallelism and higher efficiency than CPUs for many categories of applications

GPU-accelerated computing on AWS


NVIDIA GPU

P2/P3: GPU-accelerated computing

Enabling a high degree of parallelism –each GPU has thousands of cores

Consistent, well documented set of APIs (CUDA, OpenACC, OpenCL)

Supported by a wide variety of ISVs and open source frameworks

XILINX ULTRASCALE+

FPGAF1: FPGA-accelerated computing

Massively parallel – each FPGA includes millions of parallel system logic cells

Flexible – no fixed instruction set, can implement wide or narrow datapaths

Programmable using available, cloud-based FPGA development tools

FPGA-accelerated computing on AWS

3333

A Glance at Storage

HPC storage dataflow


CORPORATE DATACENTER

Data Transfer Object, Block, File Storage

Ingress

Egress

INTERNET / VPN

ISV CONNECTORS

AWS DIRECT CONNECT

AWS SNOWBALL

AMAZON CLOUDFRONT

STORAGE GATEWAY

S3 TRANSFER ACCELERATI

ONS

AMAZON KINESIS

FIREHOSE

EFS EBS + EC2 INSTANCESTORE

OR OTHER MOUNTED

SHARED FILE SYSTEM

EC2 Instance

AMAZON S3 / S3-IA

AMAZON GLACIER

LIFECYCLE POLICIES

FSx for LUSTRE

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon FSx for LustreFully managed Lustre file system for compute-intensive workloads

Native file system interface

Simple and fully managed

Seamless access to your data repositories

Massively scalable performance

Cost-optimized for compute-intensive workloads

Secure and compliant

3636

Instance Survey

HPC on AWS solution componentsAutomation

and orchestration

AWS Batch

NICE EnginFrame

AWS ParallelCluster

Visualization

NICE DCV

Amazon AppStream 2.0

Compute

Amazon EC2 Spot

AWS Auto Scaling

Amazon EC2 instances

(Compute and accelerated)

Networking

Enhanced networkingPlacement

groups

Elastic Fabric Adapter

Storage

Amazon S3

Amazon EBS

Amazon EFS

Amazon FSx for Lustre

Broadest choice of processors and architectures

Intel® Xeon® Scalable (Skylake) processor

AMD EPYC processor

2

Right compute for the right application and workload

AWS Graviton Processor


AWS compute instance types

M5T2

C5

C4

I3

D2

X1

R4

G3

EG

General Purpose

Compute Optimized

Storage and I/O

Optimized

Memory Optimized

GPU Graphics

GPU and FPGA

Compute

G2

F1

P3

P2

M4

H1

A1


Elastic Network Adaptor (ENA)

Latest generation of Enhanced Networking

Hardware Checksums

Multi-Queue Support

Receive Side Steering

Up to 100Gbps in a Placement Group

Open Source Amazon Network Driver (Upstream first)

Compatible with MPI libraries including OpenMPI 3.0

Accelerating time to results with HPC on AWSIntel® Xeon® Scalable processor-based Amazon EC2 instances provide unique advantages to accelerate HPC workloads

• Intel® Xeon® Scalable processors:o Up to 28 cores delivering enhanced per core performance, and significant increases in memory bandwidth (6

memory channels) and I/O bandwidth and throughput (48 PCIe lanes)o Intel® Advanced Vector Instructions (Intel® AVX-512): Intel® AVX-512 can handle your most

demanding computational tasks and accelerate performance for workloads o Intel® AVX 512 delivers up to 2X more FLOPs/clock-cycle for HPC, analytics, cryptography and data

compression workloads5.• Intel Software development Tools

o Allow you to take advantage of the Intel Xeon Scalable processor and quickly deploy a fully elastic HPC cluster on AWS Cloud. Once created, the cluster provisions standard HPC tools such as schedulers, Message Passing Interface (MPI) environment, and shared storage.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Custom AWS silicon with 64-bit Arm Neoverse cores

AWS Graviton Processor

Rapidly innovate, build, and iterate on behalf of our customers

Targeted workloads optimizations


S U M M I T

Innovations in HPC infrastructure High clock speed compute instances: Z1dUp to 4 GHz sustained, all-turbo performance• Z1d instances are optimized for memory-intensive,

compute-intensive applications• Custom Intel Xeon Scalable processor

• Up to 4 GHz sustained, all-turbo performance

• Up to 385GiB DDR4 memory

• Enhanced networking, up to 25 GB

throughput

HPC stack on AWS3D graphics virtual workstation

License managers and cluster head nodes with job schedulersCloud-based, auto-scaling HPC clusters

Shared file storage

Storage cacheIntel Xeon

Scalable (Skylake) processor

Featuring


S U M M I T

Intel Xeon Scalable (Skylake) processor

Featuring

A m a z o n E C 2 P 3 I n s t a n c e s ( O c t o b e r 2 0 1 7 )

• Up to eight NVIDIA Tesla V100 GPUs

• 1 PetaFLOPs of computational performance

– Up to 14x better than P2

• 300 GB/s GPU-to-GPU communication

(NVLink) – 9X better than P2

• 16GB GPU memory with 900 GB/sec peak

GPU memory bandwidth

O n e o f t h e f a s t e s t , m o s t p o w e r f u l G P U i n s t a n c e s i n t h e c l o u d

F1 FPGA instance types on AWS

Up to 8 Xilinx UltraScale+ 16nm VU9P FPGA devices in a single instance

The f1.16xlarge size provides:

8 FPGAs, each with over 2 million customer-accessible FPGA programmable logic cells and over 5000 programmable DSP blocks

Each of the 8 FPGAs has 4 DDR-4 interfaces, with each interface accessing a 16GiB, 72-bit wide, ECC-protected memory

A1 Software ecosystemTools

Coming Soon: Amazon Corretto

OSVs and ISVs

SUSE Enterprise Linux 15

Ubuntu 16.04 and newer

Red Hat Enterprise Linux 7.6 and newer

Amazon Linux 2

More coming soon

Containers

Most Docker official images support arm64

ECS

Available today

EKS

Comingsoon

4949

Better together

50

SUSE is a supported enterprise Linux HPC distribution on Amazon EC2

Collaboration to accelerate HPC usage and cloud services across Amazon EC2

SUSE High Performance Computing

Amazon maintained

Broad set of Linux and Windows images

Kept up-to-date by Amazon in each region

Marketplace maintainedManaged and

maintained by AWS Marketplace partners

Your machine images

AMIs you have created from Amazon EC2

instances

Can keep private, share with other accounts, or

publish to the community

Amazon Machine Images (AMIs)

SUSEMaintained

Managed and Maintained by the

SUSE Public Cloud Team

The structure of instance pricing

* Reservation Discount (Reserved Instance) * Instance

* On-demand subscription * Infrastructure Cost

© 2018, Amazon Web Services, Inc. or its Affilates. All rights reserved.

AWS PrivateLinkShare services privately between VPCs and on-premises networks

Secure. Scalable. Reliable.

AmazonVirtual Private Cloud (VPC)

Customer InterfaceVPC Endpoint

SalesforceData Centers

AWS DirectConnect Salesforce VPC

Endpoint Service

ISV Managed Supported Self-Managed HybridSoftware vendor offers customers a managed service; can be offered via AWS Marketplace.

Billing, security, performance is fully managed by software vendor (SaaS).

Software vendor enables customers to run software in customer’s Virtual Private Cloud.

Customer manages their own AWS deployments. Software vendor provides support.

Customer deploys into their own Virtual Private Cloud, using their own IT policies.

Customer designs, monitors, and manages all needed cloud infrastructure. Software vendor may or may not be aware of customer’s use of AWS.

Customer deploys a portion of their total software workloads on AWS, while maintaining significant on-premises, legacy IT infrastructure.

Good options for product evaluations and training, or for production use

Great option to balance customer need for ISV support with IT controls

This is a common method of deployment today

This may be desired for data compliance reasons, software license policies, for IT refresh cycles, or to accommodate custom hardware

Optimizing HPC software on AWS


5555

Building your Spot Fleet

Automation and Control

Automatic re-sizing of compute clusters based upon demand and policies

Auto Scaling

Consistent experience

Simplified permissions

Governance & best practices

Launch templates are now available in AWS CloudFormation with Auto Scaling and EC2 Fleet

Use launch templates to achieve …

Increased productivity

Cost optimization

RESERVED INSTANCES

ON-DEMAND INSTANCES

Conservative:

SPOT

RESERVED INSTANCES

Optimized:

SPOT

SPOT

RESERVED INSTANCES

ON-DEMAND INSTANCES

Optimized with scale-out (magnify the peak):

Reserved

Make a low, one-time payment and receive a significant discount on the hourly charge

For committed utilization

On-Demand

Pay for compute capacity by the hour, with per-second billing and no long-term commitments

For spiky workloads, or to define needs

Spot

Bid for unused capacity, charged at a Spot price that fluctuates based on supply and demand

For high-scale, time-flexible workloads

59

60

Coming Soon On-Demand support

•SUSE Linux Enterprise HPC

• Pricing through the AWS Marketplace

• A single bill for Infrastructure and Subscriptions

• Availability for X86 and Arm HPC clusters

• Extended Service Pack Overlap Support (ESPOS)

• Long Term Service Pack Support (LTSS)

• Support for burstable cluster sizes

• Support for SLE HPC 15 – separate from SLES 15

- Required for access to HPC Module packages on SLE 15

SUSE High Performance Computing

62

Coming Soon parallelcluster support


Configuration in Minutes

There’s not a great deal involved getting a cluster up and running.This config file will do it.

* update the config and update the cluster.• Queue size options: 0 instances is really ok. The cluster will grow on demand when you put jobs in the queue.• Scheduler: choose from “sge”, “torque”, “openlava” or SLURM.There are also options for leveraging the spot market.

Feeding workloadsUsing highly available Simple Queue Service to feed EC2 nodes

Processing task/processing trigger

Amazon SQS

Processing results

Supporting Services

65

Coming Soon EFA Adapter support


Elastic Fabric Adaptor (EFA)

For Amazon EC2 instances

Supports high levels of inter-instance communications Enabling: computational fluid dynamics, weather modeling * etc.

EFA supports industry-standard libfabric APIs, EFA will be available as an optional EC2 networking feature that you can enable on C5n.9xl, C5n.18xl, and P3dn.24xl instances.

67

Q and A?

6969

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.

Date post:	22-May-2020
Category:	Documents
Upload:	others
View:	13 times
Download:	0 times

High Performance Computing with SLES for HPC …...High Performance Computing with SLES for HPC on...

Documents