CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
High Performance Computing in
CST STUDIO SUITE
Felix Wolfheimer
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
GPU Computing Performance
0
2
4
6
8
10
12
14
16
18
0 1 2 3 4
Spee
dup
Number of GPUs (Tesla K40)
Speedup of Solver Loop
CST STUDIO SUITE 2013
CST STUDIO SUITE 2014
Benchmark performed on system equipped with dual Xeon E5-2630 v2 (Ivy Bridge EP) processors, and four Tesla K40 cards. Model has 80 million mesh cells.
GPU computing performance has been improved for CST STUDIO SUITE 2014 as CPU and GPU resources are used in parallel.
GPU
CPU
Promo offer for EUC participants: 25% discount for K40 cards
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Typical GPU System Configurations Entry level
Workstation with 1 GPU card
Available "off the shelf“ Good acceleration for
smaller models Limited model size
(depends on available GPU memory and features used)
CST engineers are available to discuss with you which configuration makes sense for your applications and usage scenario.
Professional level
Workstation/server with multiple internal or external GPU cards
Many configurations available Good acceleration for medium
size and large models Limited model size
(depends on available GPU memory and features used)
Enterprise level
Cluster system with high-speed interconnect.
High flexibility: Can handle extremely large models using MPI Computing and also a lot of parallel simulation tasks using Distributed Computing (DC) Administrative overhead Higher price
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
MPI Computing — Area of Application MPI Computing is a way to handle very large models efficiently
Some application examples for MPI Computing:
Electrically very large structures (e.g. RCS calculation, lightning strike)
Extremely complex structures (e.g.SI simulation for a full package)
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
MPI Computing — Working Principle
Based on a domain decomposition of the simulation domain. Each cluster computer works on its part of the domain. Automatic load balancing ensures an equal distribution of the workload. It works cross-platform on Windows and Linux systems.
connects to
MPI Client Nodes
CST STUDIO SUITE® Frontend
High speed/low latency interconnection network (optional)
Subdomain boundary
Domain decomposition is shown in mesh view.
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Model Matrix Comp. Time/s (2013)
Matrix Comp. Time/s (2014)
Speedup (Matrix Comp.)**
Speedup (Total Sim.)**
10,301 1,217 8.46 2.63
12,921 4,018 3.22 1.85
MPI Matrix Computation
Performance Results (for two cluster nodes):*
340M cells
47M cells
* =System configuration: Compute nodes are equipped with dual eight core Xeon E5-2650 processors, 4xK20 GPUs, and Infiniband FDR interconnect. **=Speedup between version 2013 and 2014 of CST STUDIO SUITE.
The performance of the matrix computation step has been improved significantly for the new version of CST STUDIO SUITE.
CPU Core
CPU Core
CPU Core
CPU Core
Matrix computation is single-threaded in case of MPI up to version 2013.
Version 2014 uses all available cores on all cluster nodes.
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
MPI Calculation Example
2 GHz 17.4 x 4.5 x 16.2 m 116 x 30 x 108 λ 375,840 λ3
660 million cells 4 node MPI cluster 4 Tesla K20 GPU on each node Total of 16 GPUs with 6GB RAM at 60% Memory Total memory: < 100 GB
2 GHz blade antenna positioned on aircraft
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
MPI Calculation Example
2 GHz 17.4 x 4.5 x 16.2 m 116 x 30 x 108 λ 375,840 λ3
660 million cells 4 node MPI cluster 4 Tesla K20 GPU on each node Total of 16 GPUs with 6GB RAM at 60% Memory Total memory: < 100 GB
2 GHz blade antenna positioned on aircraft
Broadband calculation time ~ 4h
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Sub-Volume Monitors Sub-volume monitors allow to record field data only in a region of interest allowing for a reduction of data. This is especially important for large models which have hundreds of millions mesh cells.
Field data is only stored in the sub-volume defined by the box
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
“Jobs” could be: port excitations frequency points parameter variations optimization iterations
Distributed Computing
CST STUDIO SUITE® Frontend
connects to
DC Main Controller
DC Solver Servers
“Jobs” could be: port excitations* frequency points* parameter variations optimization iterations
*2 in parallel included with standard license
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Model has 16 ports Only 8 ports need to be computed if defining symmetry conditions Distribute the 8 simulation runs to different solver servers with
GPU acceleration
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC Simulation Time Improvement
0
5
10
15
20
25
30
1 2 4 8
Spee
dup
Number of DC Solver Servers
Speedup (total time)
CPU
1 GPU (Tesla 20)
Dual Intel Xeon X5675 CPUs (3.06 GHz), fastest memory configuration, 1 Tesla 20 GPU per node, 1 Gb Ethernet interconnect, 40 million mesh cells
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
DC Main Controller The DC Main Controller gives you a complete overview about what is happening on your cluster.
Job Status
Machine Status Essential resources (RAM usage and disk space) are monitored as well in the 2014 version.
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
GPU Assignment
Users who have smaller jobs can start multiple solver servers and assign each GPU to a separate server. This allows for a more efficient use of multi-GPU hardware
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Supported Acceleration Methods
Solver Multithreading GPU Computing Distributed Computing MPI Computing
Acceleration methods supported by the solvers of CST STUDIO SUITE.
Most other solvers support Multithreading and Distributed Computing for parameter sweeps and optimization.
on one GPU card
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
Choose the Right Acceleration Method Solver Model Size Number of
Simulations Acceleration Technique
Transient below memory limit of GPU
hardware low GPU Computing
Transient below memory limit of GPU hardware medium/high GPU Computing on a DC Cluster (Distributed Excitations)
Transient above memory limit of GPU
hardware - MPI or combined MPI+GPU Computing
Frequency Domain
can be handled by a single machine medium/high Distributed Computing (Distributed Frequency Points)
Integral Equation
can't be handled by a single machine - MPI Computing
Integral Equation can be handled by a single machine medium/high Distributed Computing (Distributed Frequency Points)
Parameter Sweep/Optimization n/a medium/high Distributed Computing
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
CST is working together with HPC hardware and service providers to enable easy access to large computing power for challenging simulations which can't be run on in-house hardware. Users rent a CST license for the resources they need and pay the HPC provider for the required hardware.
HPC in the Cloud
+
HPC system provider
Currently supported providers hosting CST STUDIO SUITE:
More information can be found in the HPC section of our website: https://www.cst.com/Products/HPC/Cloud-Computing
CST – COMPUTER SIMULATION TECHNOLOGY | www.cst.com
HPC Hardware Design Process
Personal contact with CST engineers to design solution.
Benchmarking of designed computing solution in the hardware test center of the preferred vendor.
Buy the machine if it fulfills your expectations.
A general hardware recommendation is available on our website which helps you to configure standard systems (e.g. workstations) for CST STUDIO SUITE. For HPC systems (multi-GPU systems, clusters) our hardware experts are available to guide you through the whole process of system design and benchmarking to ensure that your new system is compatible with CST STUDIO SUITE and delivers the expected performance.
HPC System Design Process