High Performance Computing mit ANSYS (Mechanical ......Latency time from master to core 16 = 1.979...

MicroConsult Engineering H.Güttler Juni 08, 2016 page 1

High Performance Computing mit ANSYS (Mechanical).

Beispiele aus der Praxis

Dr. Herbert Güttler

MicroConsult GmbH

Holunderweg 8

89182 Bernstadt

www.microconsult-engineering.de


MicroConsult:

MCE HPC Cluster

2.000.000.000 DOFs

ANSYS R 17

HPC

Seminars & Presentations

MicroConsult GmbH

Since 2009 ANSYS

Enhanced Solutions Partner

‚Pushing the Limits‘

www.ansys-blog.com/10x-faster-insight-structural-analysis


Tools 2009-2015


Hardware: Dec 2015

416 E5 2690 V3 Haswell cores @ 2.6 GHz

300 E5 2690 V2 Ivy Bridge cores @ 3.0 GHz

64 E5 4627 V2 Ivy Bridge cores @ 3.3 GHz

40 E5 4627 V3 Haswell cores @ 2.6 GHz

128 E5 2690 V1 Sandy Bridge cores @ 2.9 GHz

10..37 GB / core RAM (12 TB total)

Accelerators:

10 Kepler K20x

2 Kepler K40x

4 Kepler K80 (dual GPU)

2 Xeon Phi 7120P

10 Xeon Phi 31S1P

Peak Performance ANSYS

single job: 6 TFLOPs (Haswell only)

single node: 0.6 TFLOP (4S Haswell)

single node: 1.1 TFLOP (2S+4K80)

Infiniband interconnect (FDR/QDR)

Compute servers SSD only

Remote Access: 3x HP-RGS

6 Fileservers 350TB

SLES 11 SP03 for compute nodes

Closed loop aircooled rack (20kW)


Tools 2016

Slaves: (Haswell)

96 cores / 8 sockets in 2U @ 1.7 kW

1,25 TB RAM

> 1.2 TFLOPs

Headnodes / GPU nodes

24 cores + up to 6 GPUs in 2U

0.5 TB RAM

0.8 TFLOPs Tesla GPUs /

0.5 TFLOPs with XeonPhi

K80:

2 GPU Processors

on one PCIe board

Slaves: (Broadwell)

112 cores / 8 sockets in 2U @ 1.7 kW

1,25 TB RAM

> 1.4 TFLOPs


Interconnect: FDR Performance (R16)

Latency time from master to core 1 = 1.259 µs


…



…



Latencies Bandwidth

Communication speed from master to core 1 = 8077.06 MB/sec




…



core – core on die socket - socket node - node

Interconnect: FDR Performance (R17 Intel MPI 5.0.3)



…



…





…



…




Some thoughts about cost:


Example: Ball grid array

Mold

PCB

Solder balls

(solid 186 & 187)

!! no contact elements !!

M O D E L S U M M A R Y

Processor

Number Max Min

Elements 814197 58648 43932

Nodes 1344197 104296 77268

Shared Nodes 85903 15477 7424

DOFs 4032591 312888 231804


BGA Scaling

0

50

100

150

200

250

300

350

400

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288

Rat

ing

(ru

ns/

day

)

# cores

ANSYS 16P2 Ivy Bridge E5 2690 V2 base 16 ANSYS 16P2 Ivy Bridge E5 2690 V2 base 20

ANSYS 16P3 Ivy Bridge E5 2690 V2 base 20 ANSYS 16P4 Haswell E5 V3 Haswell01+02 octal 31S1P

ANSYS 16P4 Haswell E5 V3 Haswell ANSYS 16P4 Haswell E5 2670 V3 dual K40m 340.65

ANSYS 16P4 Sandy Bridge E5 V0 octal K20 ANSYS 16P4 Haswell E5 2670 V3 dual K80 340.65

ANSYS 16P4 Haswell E5 V3 quad K80 2N ANSYS 16P4 Haswell E5 2697 V3 quad K80

ANSYS R17Beta UP20150312 ANSYS R17Beta UP20150312 + TILING=16

BGA R17 UP20150922 Haswell

@ 286 cores

Breakthru in performance

above ~ 100 cores with R17


Accelerated vs. Non-Accelerated

BGA model on various platforms: 2S E5 2690 V3, 2S E5 2697 V3 and 4S E5 4627 V3,

combined with Intel and Nivida accelerators. ANSYS R17

For the 2S systems with no accelerators, we observe a deviation

from the linear increase in performance when using 28 cores. With the

4S system – which provides more memory bandwidth per core – we

measure a linear behavior up to 40 cores.


Haswell vs. Broadwell. 2S and 4S Systems

0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00

2 4 8 12 16 20 24 28 32 36 40

rati

ng

[ru

ns/

day

]

# cores

E5 2690 V3 E5 2690 V3 COD E5 2697A V4 2133 E5 2697A V4 2133 COD E5 4627 V3 (Quad)

Memory Bandwidth Becomes an issue


0,00

10,00

20,00

30,00

40,00

50,00

60,00

2 4 8 12 16 20 24

Rat

ing

= ru

ns

/day

# cores

ANSYS Haswell E5 2690 V3 single XP941

ANSYS Haswell E5 2690 V3 XP941 Raid0

ANSYS Haswell E5 2690 V3 single SATA HD

ANSYS Haswell E5 2690 V3 single SATA HD + Resta

I/O bound or compute bound BGA model

Running ‚incore‘

(still writting .full, .esav etc.

one file per process)

I/O Bandwidth

Becomes an issue


Performance Comparison (BGA, R17)

7,52 23,76

63,53

89,53

155,68

200,00

364,56

36,97 55,38

77,42

97,85 106,80 111,20 114,44

0,00

50,00

100,00

150,00

200,00

250,00

300,00

350,00

400,00ra

tin

g [r

un

s/d

ay]

for

BG

A


GPUs: Pros & Cons

Pro:

Add numerical performance to previous

generation hardware

Licensing: GPU counts as 1 additional

core

Choice between 2 GPU vendors:

Nvidia / Tesla

Intel / Xeon Phi

Easily activated

Con:

Only used with factorization, all other

tasks are handled by CPU

Works best for mid sized problems

Deactivated when pivoting is active

There is no technical advantage of GPUs that cannot be compensated with

using additional conventional cores or latest generation hardware


Applications:

• BGA, LQFP

• Full assembly vs. single devices

• Focus Solder Creep


Benchmark Results: Leda Benchmark

Procedure ANSYS 11 ANSYS12 ANSYS12.1 ANSYS13 SP02

ANSYS 14

ANSYS 14.5

ANSYS 15.0.7

ANSYS R16 final

ANSYS R17 UP 20150922

Thermal (full model) 3 MDOF

4h (8 cores)

1h (8 cores + 1 GPU) 0.8h (32 cores)

Thermo-mechanical Simulation (full model) 7.8 MDOF

~ 5.5 days for 163 iterations (8 cores)

34.3h for 164 iterations (20 cores)

12.5h for 195 iterations (64 cores) .



6.4h for 196 iterations (128 E5 cores) 6.3h (96 E5 cores + 16 GPUs)

5.7h for 197 iterations (128 E5 cores)

4.3h for 221 iterations (256 E5 Haswell cores)

4.2h for 214 iterations (256 E5 Haswell cores) + Trimming

Interpolation of boundary conditions

37h for 16 Loadsteps

Identical to ANSYS 11

Identical to ANSYS 11

0.2h (improved algorithm)

0.2h

Submodell: Creep Strain Analysis 5.5 MDOF

~ 5.5 days for 492 iterations (16 cores)




5.9h for 498 iterations (64 cores + 8GPUs) 4.2h (256 cores)





2 weeks 5 days 2 days 1 day ½ day ½ day ½ day


Done in half a day? Comparison between 2008 and 2013 T

em

pera

ture

[K

]

Time

passive cycling

active cycling

Also a BGA, but different

geometry.

New type of study requires

10x numerical effort


2014: Solder Creep with 5s pulses


0

10

20

30

40

50

60

70

80

128 SB cores, R16 256 Haswell cores, R17UP20150826

run

tim

e [

h]

Real world case ‚coil‘ from Sept 2015 (creep simulation, no contacts)

3 days 1 day

===========================

= multifrontal statistics =

===========================

number of equations = 5482467

no. of nonzeroes in lower triangle of a = 313754425

no. of nonzeroes in the factor l = 11828109273

ratio of nonzeroes in factor (min/max) = 0.0000

number of super nodes = 89933

maximum order of a front matrix = 23409

maximum size of a front matrix = 274002345

maximum size of a front trapezoid = 201671967

no. of floating point ops for factor = 9.9553D+13

no. of floating point ops for solve = 4.6363D+10

ratio of flops for factor (min/max) = 0.0072

near zero pivot monitoring activated

number of pivots adjusted = 0

negative pivot monitoring activated

number of negative pivots encountered = 0

factorization panel size = 128

number of cores used = 240

time (cpu & wall) for structure input = 0.110000 0.106551

time (cpu & wall) for ordering = 18.267475 18.267475

time (cpu & wall) for other matrix prep = 4.772525 4.818491

time (cpu & wall) for value input = 0.150000 0.153441

time (cpu & wall) for matrix distrib. = 1.690000 1.693778

time (cpu & wall) for numeric factor = 28.590000 28.688992

computational rate (mflops) for factor = 3482087.702471 3470072.623324

time (cpu & wall) for numeric solve = 0.350000 0.363866

computational rate (mflops) for solve = 132464.403774 127416.390197

effective I/O rate (MB/sec) for solve = 504689.370800 485456.439361


Relative Performance

0,00

1,00

2,00

3,00

4,00

5,00

6,00

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256

rela

tive

pe

rfo

rman

ce

# cores

ANSYS R17Beta UP20150630 Model BGA

ANSYS R17Beta UP20150630 Model BGA XXL

ANSYS R17Beta UP20150630 Model iTCU

ANSYS R17Beta UP20150630 Model Leda Full

ANSYS R16 Model Ledafull

ANSYS R17Beta UP20150630 Model Leda sub

Test runs for 20 min @ 256 cores, IVB, based on elapsed times

BGA XXL is an extended version of the BGA model / no contacts

Leda full includes many contacts, Leda sub is an IC model (no contacts)

iTCU uses few contacts


Conclusions

• HPC with ANSYS Mechanical works!

– MCE is running single ANSYS Mechanical jobs on 100+ cores every day

– Various Licensing Models for HPC (starting above 2 cores)

• Best with powerfull hardware (lots of RAM + fast interconnect)

• For single compute nodes, multicore machines perform on par with GPU accelerated

solutions (XPhi / Tesla)

Acknoweldgements

• Jeff Beisheim, Tim Pawlak, ANSYS Inc.

• Natalja Schafet, Oliver Adamik, Robert Bosch GmbH

• Philipp Schmid, Holger Mai, MicroConsult Engineering GmbH

© CADFEM 2016

CADFEM Engineering Simulation Cloud

- 22 -

Noch flexibleres Mietmodell – noch breitere Angebots-Palette

• Hardware + Software Bundles

• Alles aus einer Hand

• 1 x ANSYS Mechanical + HPC Workgroup 16 + 3D interaktive Cloud-HW

• 1 Woche: € 2.700,-

• 1 Monat: € 7.100,-

• 1 x ANSYS Mechanical Solver + 3 HPC Pack + 100 Haswell Cores (batch)

• 1 Woche: € 3.700,-

• 1 Monat: € 9.700,-

• 1 x ANSYS CFD Solver + 3 HPC Pack + 100 Haswell Cores (batch)

• 1 Woche: € 3.700,-

• 1 Monat: € 9.900,-

http://www.ecadfem.com/3437.html




Appendix


What‘s the actual speed?


Broadwell


Turbo Boost (Broadwell)

Source: https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Broadwell-EP.22_.2814_nm.29


Intel Knights Landing (coming soon)

http://wccftech.com/intel-knights-landing-detailed-16-gb-

highbandwidth-ondie-memory-384-gb-ddr4-system-memory-support-

8-billion-transistors/

http://www.theplatform.net/2015/03/11/future-xeon-phi-specs-

emerge-at-open-compute-summit/

http://wccftech.com/intel-knights-landing-detailed-16-gb-highbandwidth-ondie-memory-384-gb-ddr4-system-memory-support-8-billion-transistors/





































Report for Q4 2015

Turbine model with

2 BDOF

Meshing 35min (tracking inside WB)

Write ds.dat 2h (estimate from timestamps)

Generate .db from ds.dat: 75min (logfile)

Solve: 9h (with default 1E-8 criterion, logfile)

In total: About 12h for everything without postprocessing.

To put that results into context:

The original 1 BDOF took 16h to make and 15h to solve,

but with a 1E-5 criterion.

We now have twice the size in half of the time

plus a better accuracy :


GPUs: Pros & Cons

Pro:

Add numerical performance to previous

generation hardware

Licensing: GPU counts as 1 additional

core

Choice between 2 GPU vendors:

Nvidia / Tesla

Intel / Xeon Phi

Easily activated

Contra:

Only used with factorization, all other

tasks are handled by CPU, uses therefore

only a fraction of the listed numerical

performance (TFLOPs)

Works best for mid sized problems:

local memory of GPU limits max. problem

size, overhead to address GPU limits use

for small problems

Deactivated when pivoting is active

(happens with a lot of joints)

There is no advantage of GPUs that

cannot be compensated with using

additional conventional cores or latest

generation hardware.

Date post:	11-Mar-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

High Performance Computing mit ANSYS (Mechanical ......Latency time from master to core 16 = 1.979...

Documents