MicroConsult Engineering H.Güttler Juni 08, 2016 page 1
High Performance Computing mit ANSYS (Mechanical).
Beispiele aus der Praxis
Dr. Herbert Güttler
MicroConsult GmbH
Holunderweg 8
89182 Bernstadt
www.microconsult-engineering.de
MicroConsult Engineering H.Güttler Juni 08, 2016 page 2
MicroConsult:
MCE HPC Cluster
2.000.000.000 DOFs
ANSYS R 17
HPC
Seminars & Presentations
MicroConsult GmbH
Since 2009 ANSYS
Enhanced Solutions Partner
‚Pushing the Limits‘
www.ansys-blog.com/10x-faster-insight-structural-analysis
MicroConsult Engineering H.Güttler Juni 08, 2016 page 3
Tools 2009-2015
MicroConsult Engineering H.Güttler Juni 08, 2016 page 4
Hardware: Dec 2015
416 E5 2690 V3 Haswell cores @ 2.6 GHz
300 E5 2690 V2 Ivy Bridge cores @ 3.0 GHz
64 E5 4627 V2 Ivy Bridge cores @ 3.3 GHz
40 E5 4627 V3 Haswell cores @ 2.6 GHz
128 E5 2690 V1 Sandy Bridge cores @ 2.9 GHz
10..37 GB / core RAM (12 TB total)
Accelerators:
10 Kepler K20x
2 Kepler K40x
4 Kepler K80 (dual GPU)
2 Xeon Phi 7120P
10 Xeon Phi 31S1P
Peak Performance ANSYS
single job: 6 TFLOPs (Haswell only)
single node: 0.6 TFLOP (4S Haswell)
single node: 1.1 TFLOP (2S+4K80)
Infiniband interconnect (FDR/QDR)
Compute servers SSD only
Remote Access: 3x HP-RGS
6 Fileservers 350TB
SLES 11 SP03 for compute nodes
Closed loop aircooled rack (20kW)
MicroConsult Engineering H.Güttler Juni 08, 2016 page 5
Tools 2016
Slaves: (Haswell)
96 cores / 8 sockets in 2U @ 1.7 kW
1,25 TB RAM
> 1.2 TFLOPs
Headnodes / GPU nodes
24 cores + up to 6 GPUs in 2U
0.5 TB RAM
0.8 TFLOPs Tesla GPUs /
0.5 TFLOPs with XeonPhi
K80:
2 GPU Processors
on one PCIe board
Slaves: (Broadwell)
112 cores / 8 sockets in 2U @ 1.7 kW
1,25 TB RAM
> 1.4 TFLOPs
MicroConsult Engineering H.Güttler Juni 08, 2016 page 6
Interconnect: FDR Performance (R16)
Latency time from master to core 1 = 1.259 µs
Latency time from master to core 2 = 1.175 µs
…
Latency time from master to core 9 = 2.183 µs
Latency time from master to core 10 = 2.393 µs
…
Latency time from master to core 16 = 1.979 µs
Latency time from master to core 31 = 2.119 µs
Latencies Bandwidth
Communication speed from master to core 1 = 8077.06 MB/sec
Communication speed from master to core 2 = 8857.00 MB/sec
Communication speed from master to core 9 = 5312.38 MB/sec
Communication speed from master to core 10 = 5377.34 MB/sec
…
Communication speed from master to core 16 = 5121.90 MB/sec
Communication speed from master to core 31 = 4925.74 MB/sec
core – core on die socket - socket node - node
Interconnect: FDR Performance (R17 Intel MPI 5.0.3)
Latency time from master to core 1 = 0.922 µs
Latency time from master to core 2 = 0.893 µs
…
Latency time from master to core 11 = 1.391 µs
Latency time from master to core 12 = 1.304 µs
…
Latency time from master to core 22 = 1.884 µs
Latency time from master to core 242 = 1.850 µs
Communication speed from master to core 1 = 7390.67 MB/sec
Communication speed from master to core 2 = 8166.53 MB/sec
…
Communication speed from master to core 11 = 6475.93 MB/sec
Communication speed from master to core 12 = 6484.40 MB/sec
…
Communication speed from master to core 22 = 5789.84 MB/sec
Communication speed from master to core 242 = 5790.00 MB/sec
MicroConsult Engineering H.Güttler Juni 08, 2016 page 7
Some thoughts about cost:
MicroConsult Engineering H.Güttler Juni 08, 2016 page 8
Example: Ball grid array
Mold
PCB
Solder balls
(solid 186 & 187)
!! no contact elements !!
M O D E L S U M M A R Y
Processor
Number Max Min
Elements 814197 58648 43932
Nodes 1344197 104296 77268
Shared Nodes 85903 15477 7424
DOFs 4032591 312888 231804
MicroConsult Engineering H.Güttler Juni 08, 2016 page 9
BGA Scaling
0
50
100
150
200
250
300
350
400
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288
Rat
ing
(ru
ns/
day
)
# cores
ANSYS 16P2 Ivy Bridge E5 2690 V2 base 16 ANSYS 16P2 Ivy Bridge E5 2690 V2 base 20
ANSYS 16P3 Ivy Bridge E5 2690 V2 base 20 ANSYS 16P4 Haswell E5 V3 Haswell01+02 octal 31S1P
ANSYS 16P4 Haswell E5 V3 Haswell ANSYS 16P4 Haswell E5 2670 V3 dual K40m 340.65
ANSYS 16P4 Sandy Bridge E5 V0 octal K20 ANSYS 16P4 Haswell E5 2670 V3 dual K80 340.65
ANSYS 16P4 Haswell E5 V3 quad K80 2N ANSYS 16P4 Haswell E5 2697 V3 quad K80
ANSYS R17Beta UP20150312 ANSYS R17Beta UP20150312 + TILING=16
BGA R17 UP20150922 Haswell
@ 286 cores
Breakthru in performance
above ~ 100 cores with R17
MicroConsult Engineering H.Güttler Juni 08, 2016 page 10
Accelerated vs. Non-Accelerated
BGA model on various platforms: 2S E5 2690 V3, 2S E5 2697 V3 and 4S E5 4627 V3,
combined with Intel and Nivida accelerators. ANSYS R17
For the 2S systems with no accelerators, we observe a deviation
from the linear increase in performance when using 28 cores. With the
4S system – which provides more memory bandwidth per core – we
measure a linear behavior up to 40 cores.
MicroConsult Engineering H.Güttler Juni 08, 2016 page 11
Haswell vs. Broadwell. 2S and 4S Systems
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
2 4 8 12 16 20 24 28 32 36 40
rati
ng
[ru
ns/
day
]
# cores
E5 2690 V3 E5 2690 V3 COD E5 2697A V4 2133 E5 2697A V4 2133 COD E5 4627 V3 (Quad)
Memory Bandwidth Becomes an issue
MicroConsult Engineering H.Güttler Juni 08, 2016 page 12
0,00
10,00
20,00
30,00
40,00
50,00
60,00
2 4 8 12 16 20 24
Rat
ing
= ru
ns
/day
# cores
ANSYS Haswell E5 2690 V3 single XP941
ANSYS Haswell E5 2690 V3 XP941 Raid0
ANSYS Haswell E5 2690 V3 single SATA HD
ANSYS Haswell E5 2690 V3 single SATA HD + Resta
I/O bound or compute bound BGA model
Running ‚incore‘
(still writting .full, .esav etc.
one file per process)
I/O Bandwidth
Becomes an issue
MicroConsult Engineering H.Güttler Juni 08, 2016 page 13
Performance Comparison (BGA, R17)
7,52 23,76
63,53
89,53
155,68
200,00
364,56
36,97 55,38
77,42
97,85 106,80 111,20 114,44
0,00
50,00
100,00
150,00
200,00
250,00
300,00
350,00
400,00ra
tin
g [r
un
s/d
ay]
for
BG
A
MicroConsult Engineering H.Güttler Juni 08, 2016 page 14
GPUs: Pros & Cons
Pro:
Add numerical performance to previous
generation hardware
Licensing: GPU counts as 1 additional
core
Choice between 2 GPU vendors:
Nvidia / Tesla
Intel / Xeon Phi
Easily activated
Con:
Only used with factorization, all other
tasks are handled by CPU
Works best for mid sized problems
Deactivated when pivoting is active
There is no technical advantage of GPUs that cannot be compensated with
using additional conventional cores or latest generation hardware
MicroConsult Engineering H.Güttler Juni 08, 2016 page 15
Applications:
• BGA, LQFP
• Full assembly vs. single devices
• Focus Solder Creep
MicroConsult Engineering H.Güttler Juni 08, 2016 page 16
Benchmark Results: Leda Benchmark
Procedure ANSYS 11 ANSYS12 ANSYS12.1 ANSYS13 SP02
ANSYS 14
ANSYS 14.5
ANSYS 15.0.7
ANSYS R16 final
ANSYS R17 UP 20150922
Thermal (full model) 3 MDOF
4h (8 cores)
1h (8 cores + 1 GPU) 0.8h (32 cores)
Thermo-mechanical Simulation (full model) 7.8 MDOF
~ 5.5 days for 163 iterations (8 cores)
34.3h for 164 iterations (20 cores)
12.5h for 195 iterations (64 cores) .
9.9h for 195 iterations (64 cores)
7.5h for 195 iterations (128 cores)
6.4h for 196 iterations (128 E5 cores) 6.3h (96 E5 cores + 16 GPUs)
5.7h for 197 iterations (128 E5 cores)
4.3h for 221 iterations (256 E5 Haswell cores)
4.2h for 214 iterations (256 E5 Haswell cores) + Trimming
Interpolation of boundary conditions
37h for 16 Loadsteps
Identical to ANSYS 11
Identical to ANSYS 11
0.2h (improved algorithm)
0.2h
Submodell: Creep Strain Analysis 5.5 MDOF
~ 5.5 days for 492 iterations (16 cores)
38.5h for 492 iterations (16 cores)
8.5h for 492 iterations (76 cores)
6.1h for 488 iterations (128 cores)
5.9h for 498 iterations (64 cores + 8GPUs) 4.2h (256 cores)
4.0h for 498 iterations (128 E5 cores)
4.2h for 488 iterations (128 E5 cores)
2.8h for 427 iterations (256 E5 Haswell cores)
1.8h for 427 iterations (256 E5 Haswell cores)
2 weeks 5 days 2 days 1 day ½ day ½ day ½ day
MicroConsult Engineering H.Güttler Juni 08, 2016 page 17
Done in half a day? Comparison between 2008 and 2013 T
em
pera
ture
[K
]
Time
passive cycling
active cycling
Also a BGA, but different
geometry.
New type of study requires
10x numerical effort
MicroConsult Engineering H.Güttler Juni 08, 2016 page 18
2014: Solder Creep with 5s pulses
MicroConsult Engineering H.Güttler Juni 08, 2016 page 19
0
10
20
30
40
50
60
70
80
128 SB cores, R16 256 Haswell cores, R17UP20150826
run
tim
e [
h]
Real world case ‚coil‘ from Sept 2015 (creep simulation, no contacts)
3 days 1 day
===========================
= multifrontal statistics =
===========================
number of equations = 5482467
no. of nonzeroes in lower triangle of a = 313754425
no. of nonzeroes in the factor l = 11828109273
ratio of nonzeroes in factor (min/max) = 0.0000
number of super nodes = 89933
maximum order of a front matrix = 23409
maximum size of a front matrix = 274002345
maximum size of a front trapezoid = 201671967
no. of floating point ops for factor = 9.9553D+13
no. of floating point ops for solve = 4.6363D+10
ratio of flops for factor (min/max) = 0.0072
near zero pivot monitoring activated
number of pivots adjusted = 0
negative pivot monitoring activated
number of negative pivots encountered = 0
factorization panel size = 128
number of cores used = 240
time (cpu & wall) for structure input = 0.110000 0.106551
time (cpu & wall) for ordering = 18.267475 18.267475
time (cpu & wall) for other matrix prep = 4.772525 4.818491
time (cpu & wall) for value input = 0.150000 0.153441
time (cpu & wall) for matrix distrib. = 1.690000 1.693778
time (cpu & wall) for numeric factor = 28.590000 28.688992
computational rate (mflops) for factor = 3482087.702471 3470072.623324
time (cpu & wall) for numeric solve = 0.350000 0.363866
computational rate (mflops) for solve = 132464.403774 127416.390197
effective I/O rate (MB/sec) for solve = 504689.370800 485456.439361
MicroConsult Engineering H.Güttler Juni 08, 2016 page 20
Relative Performance
0,00
1,00
2,00
3,00
4,00
5,00
6,00
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
rela
tive
pe
rfo
rman
ce
# cores
ANSYS R17Beta UP20150630 Model BGA
ANSYS R17Beta UP20150630 Model BGA XXL
ANSYS R17Beta UP20150630 Model iTCU
ANSYS R17Beta UP20150630 Model Leda Full
ANSYS R16 Model Ledafull
ANSYS R17Beta UP20150630 Model Leda sub
Test runs for 20 min @ 256 cores, IVB, based on elapsed times
BGA XXL is an extended version of the BGA model / no contacts
Leda full includes many contacts, Leda sub is an IC model (no contacts)
iTCU uses few contacts
MicroConsult Engineering H.Güttler Juni 08, 2016 page 21
Conclusions
• HPC with ANSYS Mechanical works!
– MCE is running single ANSYS Mechanical jobs on 100+ cores every day
– Various Licensing Models for HPC (starting above 2 cores)
• Best with powerfull hardware (lots of RAM + fast interconnect)
• For single compute nodes, multicore machines perform on par with GPU accelerated
solutions (XPhi / Tesla)
Acknoweldgements
• Jeff Beisheim, Tim Pawlak, ANSYS Inc.
• Natalja Schafet, Oliver Adamik, Robert Bosch GmbH
• Philipp Schmid, Holger Mai, MicroConsult Engineering GmbH
© CADFEM 2016
CADFEM Engineering Simulation Cloud
- 22 -
Noch flexibleres Mietmodell – noch breitere Angebots-Palette
• Hardware + Software Bundles
• Alles aus einer Hand
• 1 x ANSYS Mechanical + HPC Workgroup 16 + 3D interaktive Cloud-HW
• 1 Woche: € 2.700,-
• 1 Monat: € 7.100,-
• 1 x ANSYS Mechanical Solver + 3 HPC Pack + 100 Haswell Cores (batch)
• 1 Woche: € 3.700,-
• 1 Monat: € 9.700,-
• 1 x ANSYS CFD Solver + 3 HPC Pack + 100 Haswell Cores (batch)
• 1 Woche: € 3.700,-
• 1 Monat: € 9.900,-
http://www.ecadfem.com/3437.html
MicroConsult Engineering H.Güttler Juni 08, 2016 page 23
Appendix
MicroConsult Engineering H.Güttler Juni 08, 2016 page 24
What‘s the actual speed?
MicroConsult Engineering H.Güttler Juni 08, 2016 page 25
Broadwell
MicroConsult Engineering H.Güttler Juni 08, 2016 page 26
Turbo Boost (Broadwell)
Source: https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Broadwell-EP.22_.2814_nm.29
MicroConsult Engineering H.Güttler Juni 08, 2016 page 27
Intel Knights Landing (coming soon)
http://wccftech.com/intel-knights-landing-detailed-16-gb-
highbandwidth-ondie-memory-384-gb-ddr4-system-memory-support-
8-billion-transistors/
http://www.theplatform.net/2015/03/11/future-xeon-phi-specs-
emerge-at-open-compute-summit/
MicroConsult Engineering H.Güttler Juni 08, 2016 page 28
Report for Q4 2015
Turbine model with
2 BDOF
Meshing 35min (tracking inside WB)
Write ds.dat 2h (estimate from timestamps)
Generate .db from ds.dat: 75min (logfile)
Solve: 9h (with default 1E-8 criterion, logfile)
In total: About 12h for everything without postprocessing.
To put that results into context:
The original 1 BDOF took 16h to make and 15h to solve,
but with a 1E-5 criterion.
We now have twice the size in half of the time
plus a better accuracy :
MicroConsult Engineering H.Güttler Juni 08, 2016 page 29
GPUs: Pros & Cons
Pro:
Add numerical performance to previous
generation hardware
Licensing: GPU counts as 1 additional
core
Choice between 2 GPU vendors:
Nvidia / Tesla
Intel / Xeon Phi
Easily activated
Contra:
Only used with factorization, all other
tasks are handled by CPU, uses therefore
only a fraction of the listed numerical
performance (TFLOPs)
Works best for mid sized problems:
local memory of GPU limits max. problem
size, overhead to address GPU limits use
for small problems
Deactivated when pivoting is active
(happens with a lot of joints)
There is no advantage of GPUs that
cannot be compensated with using
additional conventional cores or latest
generation hardware.