High Performance Computing (HPC)Computing (HPC)CAEA eLearning Series
Jonathan G. Dudley, Ph.D.06/09/2015
© 2015 CAE Associates
Agenda
Introduction
HPC Background— Why HPC
— SMP vs. DMP
— Licensing
HPC Terminology— Types of HPC: HPC Cluster & Workstation HPC— Hardware Components: CPU vs. Cores, GPU vs. Phi, HDD vs. SSD— Interconnects— GPU Acceleration
2
CAE Associates Inc.
Engineering Consulting Firm in Middlebury, CT specializing in FEA and CFD analysis.
ANSYS® Channel Partner since 1985 providing sales of the ANSYS®
products, training and technical support.
3
e-Learning Webinar Series
This presentation is part of a series of e-Learning webinars offered by This presentation is part of a series of e-Learning webinars offered by CAE Associates.
You can view many of our previous e-Learning session either on our website or on the CAE Associates YouTube channel:website or on the CAE Associates YouTube channel:
If you are a New Jersey or New York resident you can earn continuing education credit for attending the full webinar and completing a survey
4
education credit for attending the full webinar and completing a survey which will be emailed to you after the presentation.
CAEA Resource Library
Our Resource Library contains over 250 items including: Our Resource Library contains over 250 items including: — Consulting Case Studies— Conference and Seminar Presentations
Software demonstrations— Software demonstrations— Useful macros and scripts
The content is searchable and you can download copies of the material to review at your conveniencereview at your convenience.
5
CAEA Engineering Advantage Blog
Our Engineering Advantage Blog offers weekly insights from our Our Engineering Advantage Blog offers weekly insights from our experienced technical staff.
6
CAEA ANSYS® Training
Classes can be held at our Training Center at CAE Associates or on-site Classes can be held at our Training Center at CAE Associates or on-site at your location.
CAE Associates is offering on-line training classes in 2015! Registration is a ailable on o r ebsite Registration is available on our website.
7
Agenda
Introduction
HPC Background— Why HPC
— Licensing
— SMP vs. DMP
HPC Terminology— Types of HPC: HPC Cluster & Workstation HPC— Hardware Components: CPU vs. Cores, GPU vs. Phi, HDD vs. SSD— Interconnects— GPU Acceleration
8
Why High Performance Computing (HPC)?
Remove computing limitations from Remove computing limitations from engineers in all phases of design, analysis, and testing
Impact product design Impact product design— Faster simulation— More efficient parametric studies
Larger Models— More accuracy — Turbulence modeling, particle
tracking. More refined models Design Optimization
— More runs for a fixed hardware configuration
9
Why HPC?
Using today’s multicore Using today s multicore computers are key for companies to remain competitive.
ANSYS HPC product suite allows scalability to whatever computational level required from single-user orlevel required, from single-user or small user group options at entry-level up to virtually unlimited parallel capacity or large user group optionscapacity or large user group options at enterprise level.
— Reduce turnaround time— Examine more design variants fasterg— Simulate larger or more complex
models
10
4 Main Product Licenses
HPC (per-process) Parallel HPC (per-process) HPC Pack
— HPC product rewarding volume parallel processing for high fidelity simulations
2048
Enabled(Cores) 32768
8192
processing for high-fidelity simulations.— Each simulation consumes one or more Packs.— Parallel enabled increases quickly with added
Packs32
8
128
512
Packs. HPC Workgroup
— HPC product rewards volume parallel processing for increased simulation throughput
8
HPC Packs per Simulation
1 2 3 4 5 6 7processing for increased simulation throughput shared among engineers throughout a single location or the world.
— 16 to 32768 parallel shared across any number
HPC Packs per Simulation
of simulations on a single server. HPC Parametric Pack
— Enables simultaneous execution of multiple
11
design points while consuming just one set of licenses.
Shared and Distributed Memory
Shared Memory: Single Machine Shared Memory Shared Memory: Single Machine Parallel (SMP) systems share a single global memory image that may be distributed physically across
Shared Memory
multiple cores, but is globally addressable.
— OpenMP is the industry standard.
Distributed Memory: Distributed memory parallel processing (DMP) assumes that physical memory for Distributed Memoryassumes that physical memory for each process is separate from all other processes
— Requires message passaging
Distributed Memory
software to communicate between cores
— MPI is the industry standard
13
Distributed ANSYS Architecture
Domain decomposition approach Sparse PCG & LANPCG all Domain decomposition approach Break problem into “N” pieces “Solve” the global problem
independently within each domain
Sparse, PCG & LANPCG all support distributed
BenefitsTh ti SOLVE h i ll lindependently within each domain
Communicate information across the boundaries as necessary
DMP on single node or cluster! SMP
— The entire SOLVE phase is parallel — More computations performed in parallel
with faster solution time. Better speed-ups than SMP Can achieve > 4x speed up on 8 cores (TryDMP on single node or cluster! SMP
for single node only — Can achieve > 4x speed-up on 8 cores (Try
getting that with SMP!!!!) — Can be used for jobs running on hundreds
of cores. Can take advantage of resources on multiple machines p
— Memory usage and bandwidth scales — Disk (I/O) usage scales (i.e. parallel I/O)
14
ANSYS Mechanical Scaling
6M Degrees of Freedom Plasticity, Contact Bolt pretension 4 load steps 4 load steps
15
v15
Parallel Settings ANSYS APDL
SMP With GPU Acceleration Settings
DMP: For Multiple Core or Node Processing
For GPU Acceleration using DMP:Customization Preferences Tab Additional Parameters add command line argument: ‐acc nvidia
16
Parallel Settings ANSYS CFX/Fluent
FluentMultiple Core Processing and
FluentParallel Settings OptionsCFX
P ll l S tti O ti
17
GPU Acceleration Optionsg p
Parallel Settings Options
2 Common Types of HPC
HPC Cluster HPC Cluster HPC Cluster— Communication via series of
switches and interconnects • Infiniband,
HPC Cluster
Infiniband, • Gigabit (1GB/s,10GB/s)• Fiber
— Scalable • DOE Supercomputer: 1.6M cores
Workstation HPC Workstation HPC— Single desktop communication— More than 2 cores, commonly 8
or more
Workstation HPC
— Quad Socket Current Builds • Xeon E5-4600 up to 48 cores• Up to 1TB of 1866 DDR3 1866
MH RAM
18
MHz RAM
Central Processing Unit and Cores Intel Xeon E5 Processor Series
Quad-Socket MOBO— E5: 4-18 Cores per CPU— Frequency: 1.8-3.5 GHz— L3 Cache up to 2.5MB/Core
Quad-Socket MOBO
— Bus: 6.4-9 GT/s QPI
Intel Xeon E7 Processor Series:— E7: 4-18 Cores per CPU— Frequency: 1.9-3.2 GHz— L3 Cache up to 2.5MB/Core CPU— Bus: 6.4-9 GT/s QPI
RAM
CPU
— DDR4: Supports 2-4k MT/s (106 transfers/s)
— DDR3: Supports 0.8-2k MT/s
21
DDR3: Supports 0.8 2k MT/s
Graphical and Co-Processing Units
GPU l t d ti i th C P i i t GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, analytics, engineering,
Co-Processing is a computer processor (PCI-Card) used to supplement the functions of the primary processorscientific, analytics, engineering,
consumer, and enterprise applications.
Supported Cards
p y p— Floating-point arithmetic— Signal processing
Supported Cards Supported Cards — Mechanical and Fluent Only— 64-bit Windows or Linux x64
• Tesla K10 and K20 series
— Xeon Phi 3000, 5000, 7000 series(ANSYS Mechanical only)
• Quadro 6000• Quadro K5000 and K6000
22
GPU Acceleration ANSYS Mechanical ANSYS Fluent Only
— For models with solid elements > 500k DOF— DMP is preferred— DOF>5M add another card or a single card
with 12GB (k40 k6000)
— Higher AMG are ideal for GPU acceleration. Coupled problems benefit from GPUs
— Whole problem must fit on GPUwith 12GB (k40, k6000)— PCG/JCG solver: MSAVE off— Models with lower Lev_Diff better suited
p• 1e06 cells require ~4 GB GPU RAM
— Better performance with lower CPU core counts
3 t 4 CPU C 1 GPU• 3 to 4 CPU Cores per 1 GPU
ANSYS Fluent
24
GPU/CoProcessing Licensing
Licensing Optionsg p— HPC Packs for quick scale-up— HPC Workgroup for flexibility
GPUs treated same as CPU cores in the licensing modelGPUs treated same as CPU cores in the licensing model As you scale-up, license cost decreases per core
25
Hard Disk
Conventional SATA SAS & SATA Conventional SATA— 7200 RPM and 10k RPM— Ideal for volume storage
Cheapest
SAS & SATA
— Cheapest
Serial Attached SCSI (SAS)15k RPM d i (RAID 0)— 15k RPM drives (RAID 0)
— Ideal “scratch space” drives
2 5” SSD Solid State Drives (SSD)
— Fastest read/write operations— Lower power, cooler, quieter
2.5 SSD
— No mechanical parts— Ideal for OS drive— Cost per GB is highest
27
Interconnects
Internal CAT5 I fi ib d Internal— Controlled by motherboard — Intel QuickPath Interconnect (QPI)
• PCIe 3 0 x8 = 63 Gb/s
CAT5e Infiniband
• PCIe 3.0 x8 = 63 Gb/s• PCIe 2.0 x1 = 32 Gb/s• PCIe 4.0 x8 = 125 Gb/s
External— Gigabit (1 Gb/s) — Infiniband (56 Gb/s) Mechanical/APDL requires at least 10
(1 GB/s) (40+ GB/s)
Infiniband (56 Gb/s) — Fibre Channel’s (16 Gb/s)— Ethernet RDMA (40 Gb/s)
Mechanical/APDL requires at least 10 GB/s interconnect for scaling past 1-node.
• Prefer Infiniband FRD/QDR FDR for large clusters
28
Basic Guidelines
Faster cores = faster solution4 GB RAM/Core ANSYS CFD
Faster RAM = faster solution— Most be aware of memory
bandwidth
4 GB RAM/Core ANSYS CFD Hyper-threading: Off Turbo-Boost: Only for low core counts
Faster HD = faster solution — Especially for intensive I/O— RAID 0 for multiple disks
Faster is better! More is better.— Must balance
budget/performance
— SSD or SAS 15k drives— Parallel file systems
29
HPC Revolution
Every computer today is a parallel computer.
Every simulation in ANSYS can benefit from parallel processing.
31