High Performance Computing with SLES for HPC on AWS[tut1393]
David Duncan, Partner Solutions Architect, AWSKevin Ayres, Solutions Architect, SUSE
HPC Customer Pain PointsComplexity Maintenance Time to Solution
“My IT staff doesn’t have time to update and test all the different software components.”
• Better management software is needed, and deployment approach needs to be updated to leverage HPC and cloud infrastructure
• Stack components provided by multiple vendors, making it more challenging to maintain
“I need to maximize application performance, scale workloads, and minimize overhead.”
• Parallel software is lacking with many applications needing a major re-design
• Stack components provided by multiple vendors, making managing more challenging
• Segmented into commercial and scientific, and there is not enough collaboration
“Composing a working HPC environment is difficult, time-consuming, requiring experts.”
• Clusters are hard to use and manage as they become more complex in heterogeneous environments
• Storage access time and data management are becoming new bottlenecks
The HPC universe is expanding in new ways
3
CAGR 2016-2021:
• 5.6% Supercomputer (>$500K)• 5.0% Divisional ($250K-$500K)• 6.3% Departmental ($100K-$250K)• 6.3% Workgroup (<$100K)
• HPC is a growth market, with a growing recognition of strategic value
• HPC ROI is very high• $551 on average revenue per dollar
invested in HPC• $52 on average profit (or cost savings) per
dollar invested in HPC
• Key use cases:• HPC in the cloud (incl. HPCaaS)• Cognitive computing (incl. AI/ML/DL)• HPDA (High Performance Data Analysis) • IoT
• Key applications:• Modeling and simulation• Data analytics
Source: Hyperion Research, June 2017
•
4/15/2019
Goal: Propel the Arm HPC ecosystem and exascale computing in the UK
• More than 12,000 Arm-based cores running across three universities• 64 Apollo 70 systems per site• Two 32 core Cavium ThunderX2 processors per system• Running SUSE Linux Enterprise for High Performance Computing
Catalyst UK project:HPE, Arm, SUSE, and three leading UK universities establish one of the largest Arm-based supercomputer deployments in the world
SuperMUC Petascale system runs SUSE on Lenovo ThinkSystem
Geophysicists use earthquake simulation software to investigate seismic waves beneath Earth’s surface
Calculations involved in this kind of simulation are so complex that they push even supercomputers to their limits
SUSE continues to work with NVIDIA to enable support for the latest NVIDIA GPU cards – important in HPC modeling and simulation
NVIDIA’s expertise in programmable GPUs has led to breakthroughs in parallel processing which make supercomputing inexpensive and widely accessible
Univa and SUSE together manage containerized HPC and AI workloads on TSUBAME 3.0
Scaling machine learning for SUSE Linux containers, servers, clusters and clouds with Apache Spark and Univa
TSUBAME touted as the “supercomputer for everyone”
Predicting traffic congestion or share prices, simulating human organs, or forecasting the weather
“The excellent management that SUSE Linux Enterprise Server provides is one of the key factors behind Tsubame 2.5’s success.” – Professor Satoshi Matsuoka, Global Scientific Information and Computing Center, Tokyo Institute of Technology
Bright Cluster Manager supports SUSE, enabling customers to deploy, manage and monitor SLES clusters using the familiar Bright interface
Bright Cluster Manager lets users monitor and build clusters of any size that are easy to provision, operate, monitor, manage and scale
Motivations – why HPC in the Cloud? (and what is HPC anyway?)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.10
Why AWS for HPC?
Low cost with flexible pricing Efficient clusters
Unlimited infrastructure
Faster time to results
Concurrent clusters on-demand
Increased collaboration
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.11
Cost advantages
On Premises Capital Expense Model
Amazon Web ServicesPay As You Go Model
Use only what you need
Multiple pricing models
High upfront capital cost
High cost of ongoing support
12
One-way ortwo-way doors…Traditional Infrastructure barriers present significant hurdles in accelerating the speed in which HPC customers innovate.HPC on AWS changes the equation by allowing users to experiment and operate without compromises…
Genomics Processing
Modeling and Simulation
Government and Educational Research
Monte Carlo Simulations
Transcoding and Encoding
Computational Chemistry
Popular HPC workloads on AWS
… and many more
14
Region & Number of Availability ZonesAWS GovCloud EU
AWS Gov Cloud West Ireland
AWS Gov Cloud East Frankfurt
US West London
Oregon Paris
Northern California
US East Asia Pacific
N. Virginia, Ohio Singapore
Canada Sydney, Tokyo,
Central Seoul, Mumbai
South America China
São Paulo Beijing, Ningxia
AWS global infrastructure20 global regions, 1 local region, 61 availability zones
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.15
Announced RegionsStockholm, Hong Kong, Bahrain, Italy
Region
Ensuring High Availability
The AWS Cloud infrastructure:• A Region is a physical location in the world where we have
multiple Availability Zones• Availability Zones consist of one or more discrete data centers,
each with redundant power, networking, and connectivity, housed in separate facilities. Applications and Data are replicated in real time and consistent in the different AZs
Low-latency ensures real data
replication
Distance ensures high availability
AWS Regions (20)AZs (61)
N. VirginiaOhio
N. California
OregonMumbai
Seoul
Singapore
Sydney
Canada
BeijingFrankfurt
IrelandLondonSão Paulo
GovCloud (US-West)
Ningxia
Paris
Osaka (Local)
Tokyo
GovCloud (US-East)
Stockholm
Important enablers for HPC on the cloud
Compute performance – CPUs, GPUs, FPGAs
Memory performance – high RAM requirements in many applications
Network performance – throughput, latency, and consistency
Storage performance – including shared filesystems
Automation and cluster/job management
Remote graphics for interactive applications
ISV support – including license management
…and
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.17
SCALE
Defining HPC – example use cases
Data LightMinimal requirements for high performance storage
Data Heavy
Benefits from
access to high
performance
storage
Clustered (Tightly coupled)
Distributed / Grid (Loosely coupled)
Fluid dynamics Weather forecasting Materials simulations Crash simulations
Risk simulations Molecular modeling Contextual search Logistics simulations
Animation and VFX Semiconductor verification Image processing/GIS Genomics
Seismic processing Metagenomics Astrophysics Deep learning
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.18
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.19
Cluster HPC
Tightly coupled, latency-sensitive applications
Use larger EC2 compute instances, placement groups, enhanced networking, HPC job schedulers
Grid HPC
Loosely coupled, pleasingly parallel
Use a variety of EC2 instances, multiple AZs, Spot, Auto Scaling, Amazon SQS, AWS Batch
Grids of Clusters
Running parallel cluster jobs, parameter studies
Use a grid strategy on the cloud to run a group of parallel, individually-clustered HPC jobs
Grid Computing Examples (Scale Out)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.20
Global-scale grids for research
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.21
Large Hadron Collider (LHC)
Global-scale grids for research
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.22
Best-practices using Spot: diversify computing with many instance types, multiple AZs, multiple regions, and with stateless architectures
1.1M vCPUs for machine learning
A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster in the cloudby using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS region.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.23
The graph highlights the
elastic, automatic expansion of
resources. Clemson took
advantage of the new per-second
billing for EC2 instances.
The vCPU count usage is
comparable to the core count on the
largest supercomputers in
the world.
HPC in design and manufacturing
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.24
Applications for engineering: Molecular dynamics, CAD, CAE,
EDA Collaboration tools for engineering Big data for manufacturing yield
analysis
Running drive-head simulations at scale:Millions of parallel parameter sweeps, running months of simulations in just hours
Over 85,000 Intel cores running at peak, using Spot Instances
Cluster Computing Examples (Scale Up)
Tightly-coupled HPC – weather
Structural simulation
Fluid dynamics – Ansys Fluent
C4.8xlarge instance type 140M cell model F1 car CFD benchmark
HPC in aerospace“Rescale’s ScaleX cloud platform is a game-changer for engineering. It gives Boom computing resources comparable to building a large on-premise HPC center. Rescale lets us move fast with minimal capital spending and resources overhead.”
Josh KrallCTO & Co-Founder
Boom leverages Rescale and AWS to enable supersonic travel
Simulated vortex lift with 200M cell models on 512+ cores
Increased simulation throughput: 100 jobs in parallel with 6x speedup per job → 600x speedup
Eliminated IT overhead, including server capital costs & in-house IT and software teams
Elastic HPC capacity and pay-as-you-go AWS clusters allow business agility & ability to scale
Accelerated Computing
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.30
Accelerated computing on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.31
Parallelism increases throughput
CPU: High speed, highly flexible GPU/FPGA: High throughput, high efficiency
GPUs and FPGAs can provide massive parallelism and higher efficiency than CPUs for many categories of applications
GPU-accelerated computing on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.32
NVIDIA GPU
P2/P3: GPU-accelerated computing
Enabling a high degree of parallelism –each GPU has thousands of cores
Consistent, well documented set of APIs (CUDA, OpenACC, OpenCL)
Supported by a wide variety of ISVs and open source frameworks
XILINX ULTRASCALE+
FPGAF1: FPGA-accelerated computing
Massively parallel – each FPGA includes millions of parallel system logic cells
Flexible – no fixed instruction set, can implement wide or narrow datapaths
Programmable using available, cloud-based FPGA development tools
FPGA-accelerated computing on AWS
3333
A Glance at Storage
HPC storage dataflow
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.34
CORPORATE DATACENTER
Data Transfer Object, Block, File Storage
Ingress
Egress
INTERNET / VPN
ISV CONNECTORS
AWS DIRECT CONNECT
AWS SNOWBALL
AMAZON CLOUDFRONT
STORAGE GATEWAY
S3 TRANSFER ACCELERATI
ONS
AMAZON KINESIS
FIREHOSE
EFS EBS + EC2 INSTANCESTORE
OR OTHER MOUNTED
SHARED FILE SYSTEM
EC2 Instance
AMAZON S3 / S3-IA
AMAZON GLACIER
LIFECYCLE POLICIES
FSx for LUSTRE
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon FSx for LustreFully managed Lustre file system for compute-intensive workloads
Native file system interface
Simple and fully managed
Seamless access to your data repositories
Massively scalable performance
Cost-optimized for compute-intensive workloads
Secure and compliant
3636
Instance Survey
HPC on AWS solution componentsAutomation
and orchestration
AWS Batch
NICE EnginFrame
AWS ParallelCluster
Visualization
NICE DCV
Amazon AppStream 2.0
Compute
Amazon EC2 Spot
AWS Auto Scaling
Amazon EC2 instances
(Compute and accelerated)
Networking
Enhanced networkingPlacement
groups
Elastic Fabric Adapter
Storage
Amazon S3
Amazon EBS
Amazon EFS
Amazon FSx for Lustre
Broadest choice of processors and architectures
Intel® Xeon® Scalable (Skylake) processor
AMD EPYC processor
2
Right compute for the right application and workload
AWS Graviton Processor
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.39
AWS compute instance types
M5T2
C5
C4
I3
D2
X1
R4
G3
EG
General Purpose
Compute Optimized
Storage and I/O
Optimized
Memory Optimized
GPU Graphics
GPU and FPGA
Compute
G2
F1
P3
P2
M4
H1
A1
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.40
Elastic Network Adaptor (ENA)
Latest generation of Enhanced Networking
Hardware Checksums
Multi-Queue Support
Receive Side Steering
Up to 100Gbps in a Placement Group
Open Source Amazon Network Driver (Upstream first)
Compatible with MPI libraries including OpenMPI 3.0
Accelerating time to results with HPC on AWSIntel® Xeon® Scalable processor-based Amazon EC2 instances provide unique advantages to accelerate HPC workloads
• Intel® Xeon® Scalable processors:o Up to 28 cores delivering enhanced per core performance, and significant increases in memory bandwidth (6
memory channels) and I/O bandwidth and throughput (48 PCIe lanes)o Intel® Advanced Vector Instructions (Intel® AVX-512): Intel® AVX-512 can handle your most
demanding computational tasks and accelerate performance for workloads o Intel® AVX 512 delivers up to 2X more FLOPs/clock-cycle for HPC, analytics, cryptography and data
compression workloads5.• Intel Software development Tools
o Allow you to take advantage of the Intel Xeon Scalable processor and quickly deploy a fully elastic HPC cluster on AWS Cloud. Once created, the cluster provisions standard HPC tools such as schedulers, Message Passing Interface (MPI) environment, and shared storage.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Custom AWS silicon with 64-bit Arm Neoverse cores
AWS Graviton Processor
Rapidly innovate, build, and iterate on behalf of our customers
Targeted workloads optimizations
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S U M M I T
Innovations in HPC infrastructure High clock speed compute instances: Z1dUp to 4 GHz sustained, all-turbo performance• Z1d instances are optimized for memory-intensive,
compute-intensive applications• Custom Intel Xeon Scalable processor
• Up to 4 GHz sustained, all-turbo performance
• Up to 385GiB DDR4 memory
• Enhanced networking, up to 25 GB
throughput
HPC stack on AWS3D graphics virtual workstation
License managers and cluster head nodes with job schedulersCloud-based, auto-scaling HPC clusters
Shared file storage
Storage cacheIntel Xeon
Scalable (Skylake) processor
Featuring
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S U M M I T
Intel Xeon Scalable (Skylake) processor
Featuring
A m a z o n E C 2 P 3 I n s t a n c e s ( O c t o b e r 2 0 1 7 )
• Up to eight NVIDIA Tesla V100 GPUs
• 1 PetaFLOPs of computational performance
– Up to 14x better than P2
• 300 GB/s GPU-to-GPU communication
(NVLink) – 9X better than P2
• 16GB GPU memory with 900 GB/sec peak
GPU memory bandwidth
O n e o f t h e f a s t e s t , m o s t p o w e r f u l G P U i n s t a n c e s i n t h e c l o u d
F1 FPGA instance types on AWS
Up to 8 Xilinx UltraScale+ 16nm VU9P FPGA devices in a single instance
The f1.16xlarge size provides:
8 FPGAs, each with over 2 million customer-accessible FPGA programmable logic cells and over 5000 programmable DSP blocks
Each of the 8 FPGAs has 4 DDR-4 interfaces, with each interface accessing a 16GiB, 72-bit wide, ECC-protected memory
A1 Software ecosystemTools
Coming Soon: Amazon Corretto
OSVs and ISVs
SUSE Enterprise Linux 15
Ubuntu 16.04 and newer
Red Hat Enterprise Linux 7.6 and newer
Amazon Linux 2
More coming soon
Containers
Most Docker official images support arm64
ECS
Available today
EKS
Comingsoon
4949
Better together
50
SUSE is a supported enterprise Linux HPC distribution on Amazon EC2
Collaboration to accelerate HPC usage and cloud services across Amazon EC2
SUSE High Performance Computing
Amazon maintained
Broad set of Linux and Windows images
Kept up-to-date by Amazon in each region
Marketplace maintainedManaged and
maintained by AWS Marketplace partners
Your machine images
AMIs you have created from Amazon EC2
instances
Can keep private, share with other accounts, or
publish to the community
Amazon Machine Images (AMIs)
SUSEMaintained
Managed and Maintained by the
SUSE Public Cloud Team
The structure of instance pricing
* Reservation Discount (Reserved Instance) * Instance
* On-demand subscription * Infrastructure Cost
© 2018, Amazon Web Services, Inc. or its Affilates. All rights reserved.
AWS PrivateLinkShare services privately between VPCs and on-premises networks
Secure. Scalable. Reliable.
AmazonVirtual Private Cloud (VPC)
Customer InterfaceVPC Endpoint
SalesforceData Centers
AWS DirectConnect Salesforce VPC
Endpoint Service
ISV Managed Supported Self-Managed HybridSoftware vendor offers customers a managed service; can be offered via AWS Marketplace.
Billing, security, performance is fully managed by software vendor (SaaS).
Software vendor enables customers to run software in customer’s Virtual Private Cloud.
Customer manages their own AWS deployments. Software vendor provides support.
Customer deploys into their own Virtual Private Cloud, using their own IT policies.
Customer designs, monitors, and manages all needed cloud infrastructure. Software vendor may or may not be aware of customer’s use of AWS.
Customer deploys a portion of their total software workloads on AWS, while maintaining significant on-premises, legacy IT infrastructure.
Good options for product evaluations and training, or for production use
Great option to balance customer need for ISV support with IT controls
This is a common method of deployment today
This may be desired for data compliance reasons, software license policies, for IT refresh cycles, or to accommodate custom hardware
Optimizing HPC software on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.54
5555
Building your Spot Fleet
Automation and Control
Automatic re-sizing of compute clusters based upon demand and policies
Auto Scaling
Consistent experience
Simplified permissions
Governance & best practices
Launch templates are now available in AWS CloudFormation with Auto Scaling and EC2 Fleet
Use launch templates to achieve …
Increased productivity
Cost optimization
RESERVED INSTANCES
ON-DEMAND INSTANCES
Conservative:
SPOT
RESERVED INSTANCES
Optimized:
SPOT
SPOT
RESERVED INSTANCES
ON-DEMAND INSTANCES
Optimized with scale-out (magnify the peak):
Reserved
Make a low, one-time payment and receive a significant discount on the hourly charge
For committed utilization
On-Demand
Pay for compute capacity by the hour, with per-second billing and no long-term commitments
For spiky workloads, or to define needs
Spot
Bid for unused capacity, charged at a Spot price that fluctuates based on supply and demand
For high-scale, time-flexible workloads
59
60
Coming Soon On-Demand support
•SUSE Linux Enterprise HPC
• Pricing through the AWS Marketplace
• A single bill for Infrastructure and Subscriptions
• Availability for X86 and Arm HPC clusters
• Extended Service Pack Overlap Support (ESPOS)
• Long Term Service Pack Support (LTSS)
• Support for burstable cluster sizes
• Support for SLE HPC 15 – separate from SLES 15
- Required for access to HPC Module packages on SLE 15
SUSE High Performance Computing
62
Coming Soon parallelcluster support
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Configuration in Minutes
There’s not a great deal involved getting a cluster up and running.This config file will do it.
* update the config and update the cluster.• Queue size options: 0 instances is really ok. The cluster will grow on demand when you put jobs in the queue.• Scheduler: choose from “sge”, “torque”, “openlava” or SLURM.There are also options for leveraging the spot market.
Feeding workloadsUsing highly available Simple Queue Service to feed EC2 nodes
Processing task/processing trigger
Amazon SQS
Processing results
Supporting Services
65
Coming Soon EFA Adapter support
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.66
Elastic Fabric Adaptor (EFA)
For Amazon EC2 instances
Supports high levels of inter-instance communications Enabling: computational fluid dynamics, weather modeling * etc.
EFA supports industry-standard libfabric APIs, EFA will be available as an optional EC2 networking feature that you can enable on C5n.9xl, C5n.18xl, and P3dn.24xl instances.
67
Q and A?
6969
Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.
General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.