+ All Categories
Home > Documents > ANSYS HPC · © 2011 ANSYS, Inc. September 20, 2011 3 2001 - 2003 Parallel dynamic moving/deforming...

ANSYS HPC · © 2011 ANSYS, Inc. September 20, 2011 3 2001 - 2003 Parallel dynamic moving/deforming...

Date post: 07-Sep-2018
Category:
Upload: doduong
View: 227 times
Download: 0 times
Share this document with a friend
21
© 2011 ANSYS, Inc. September 20, 2011 1 ANSYS HPC Technology Leadership Barbara Hutchings [email protected]
Transcript

© 2011 ANSYS, Inc. September 20, 2011

1

ANSYS HPC Technology Leadership

Barbara [email protected]

© 2011 ANSYS, Inc. September 20, 2011

2

Why ANSYS Users Need HPCInsight you can’t get any other way

HPC enables high-fidelity • Include details - for reliable results

• Be sure your design is “right”

• Innovate with confidence

HPC delivers throughput• Consider multiple design ideas

• Optimize the design

• Ensure performance across range of conditions

© 2011 ANSYS, Inc. September 20, 2011

3

2001 - 2003►Parallel dynamic moving/deforming mesh►Distributed memory particle tracking

ANSYS HPC Leadership

A History of HPC Performance

1980

1990

2010

1990► Shared Memory Multiprocessing for structural simulations

1994►Iterative PCG Solver Introduced for large structural analysis

1999 - 2000►64bit large memory addressing►Shared memory multiprocessing (HFSS 7)

► 15 % spent on R&D

► 570 software developers

► Partner relationships

2004►1st company to solve 100M structural DOF

2007 - 2008► Optimized performance on multicore

processors►1st One Billion cell fluids simulation

2009►Ideal scaling to 2048 cores (fluids)►Teraflop performance at 512 core (structures)►Parallel I/O (fluids)►Domain Decomposition introduced (HFSS 12)

1998-1999►Integration with load management systems►Support for Linux clusters, low latency interconnects►10M cell fluids simulations, 128 processors

2005 - 2006►Parallel meshing (fluids)►Support for clusters using Windows HPC

© 2008 ANSYS, Inc. All rights reserved. ANSYS, Inc. Proprietary

1980’s► Vector Processing on Mainframes

1993►1st general-purpose parallel CFD with

interactive client-server user environment

1994 - 1995►Parallel dynamic mesh refinement and coarsening►Dynamic load balancing

2005 -2007►Distributed sparse solver►Distributed PCG solver►Variational Technology►DANSYS released►Distributed Solve (DSO) HFSS 10

2000

Today’s multi-core / many-core hardware evolution makes HPC a software development imperative. ANSYS is committed to maintaining performance

leadership.

2010► Hybrid parallel for sustained multi-

core performance (fluids)►GPU acceleration (structures)

© 2011 ANSYS, Inc. September 20, 2011

4

HPC – A Software Development Imperative

• Clock Speed – Leveling off

• Core Counts – Growing• Exploding (GPUs)

• Future performance depends on highly scalable parallel software

Source: http://www.lanl.gov/news/index.php/fuseaction/1663.article/d/20085/id/13277

© 2011 ANSYS, Inc. September 20, 2011

5

0

200

400

600

800

1000

1200

1400

0 256 512 768 1024 1280 1536

RA

TIN

GNumber of Cores

2010 Hardware(Intel Westmere, QDR IB)

12.1.0

13.0.0

IDEAL

0

50

100

150

200

250

300

350

400

450

0 200 400 600 800 1000 1200

RA

TIN

G

Number of Cores

2008 Hardware (Intel Harpertown, DDR IB)

6.3.35

12.0.5

IDEAL

Systems keep improving: faster processors, more cores• Ideal rating (speed) doubled in two years!

Memory bandwidth per core and network latency/BW stress scalability• 2008 release (12.0) re-architected MPI – huge scaling improvement, for a while…

• 2010 release (13.0) introduces hybrid parallelism – and scaling continues!

ANSYS FLUENT Scaling Achievement

© 2011 ANSYS, Inc. September 20, 2011

6

Extreme CFD Scaling - 1000’s of cores

Enabled by ongoing software innovation

Hybrid parallel: fast shared memory communication (OpenMP) within a machine to speed up overall solver performance; distributed memory (MPI) between machines

0

500

1000

1500

2000

2500

0 768 1536 2304 3072 3840

Co

re S

olv

er

Rat

ing

Number of Cores

Scaling to Thousands of Cores111M Cell Truck Benchmark

ANSYS Fluent 13.0

ANSYS Fluent 14.0 (Pre-Release)

© 2011 ANSYS, Inc. September 20, 2011

7

Parallel Scaling ANSYS Mechanical

0

50

100

150

200

250

300

0 64 128 192 256

Solu

tio

n R

atin

g

Number of cores

Sparse Solver (Parallel Re-Ordering)

R12.1R13.0

0

500

1000

1500

2000

2500

3000

3500

4000

0 16 32 48 64

Solu

tio

n R

atin

g

Number of cores

PCG Solver (Pre-Conditioner Scaling)

R12.1R13.0

Focus on bottlenecks in

the distributed memory

solvers (DANSYS)

● Sparse Solver

• Parallelized equation ordering

• 40% faster w/ updated Intel MKL

● Preconditioned Conjugate Gradient (PCG) Solver

• Parallelized preconditioning step

© 2011 ANSYS, Inc. September 20, 2011

8

Architecture-Aware Partitioning

Original partitions are remapped to the cluster considering the network topology and latencies

Minimizes inter-machine traffic reducing load on network switches

Improves performance, particularly on slow interconnects and/or large clusters

Partition Graph3 machines, 8 cores eachColors indicate machines

Original mapping New mapping

© 2011 ANSYS, Inc. September 20, 2011

9

File I/O Performance

Case file IO

• Both read and write significantly faster in R13

• A combination of serial-IO optimizations as well as parallel-IO techniques, where available

Parallel-IO (.pdat)

• Significant speedup of parallel IO, particularly for cases with large number of zones

• Support for Lustre, EMC/MPFS, AIX/GPFS file systems added

Data file IO (.dat)

• Performance in R12 was highly optimized. Further incremental improvements done in R13

Parallel Data write

R12 vs. R13

BMW -68%

FL5L2 4M -63%

Circuit -97%

Truck 14M -64%

91.279.5 90.4 94.4

37.3 35.2 39.6 45.3

48 96 192 384

truck_14m, case read

12.1.0 13.0.0

© 2011 ANSYS, Inc. September 20, 2011

10

What about GPU Computing?

CPUs and GPUs work in a collaborative fashion

Multi-core processors

•Typically 4-6 cores

•Powerful, general purpose

Many-core processors

•Typically hundreds of cores

•Great for highly parallel code, within memory constraints

CPU GPU

PCI Express channel

© 2011 ANSYS, Inc. September 20, 2011

11

SolverKernel

Speedups

OverallSpeedups

From NAFEMS World Congress

May 2011 Boston, MA, USA

“Accelerate FEA Simulations with a

GPU”-by Jeff Beisheim,

ANSYS

ANSYS Mechanical SMP – GPU Speedup

Tesla C2050 and Intel Xeon 5560

© 2011 ANSYS, Inc. September 20, 2011

12

•Windows workstation : Two Intel Xeon 5560 processors (2.8 GHz, 8 cores total), 32 GB RAM, NVIDIA Tesla C2070, Windows 7, TCC driver mode

R14: GPU Acceleration for DANSYS

1.52

1.16

1.70

1.20

2.24

1.44

0

1

2

3

V13cg-1 (JCG,

1100k)

V13sp-1 (sparse, 430k)

V13sp-2 (sparse, 500k)

V13sp-3 (sparse, 2400k)

V13sp-4 (sparse, 1000k)

V13sp-5 (sparse, 2100k)

R14 Distributed ANSYS Total Simulation Speedups for R13 Benchmark set

4 CPU cores

4 CPU cores + 1 GPU

© 2011 ANSYS, Inc. September 20, 2011

13

1.9x

3.2x

1.7x

3.4x

4.4x

0.0

1.0

2.0

3.0

4.0

5.0

16 cores 32 cores 64 cores

Tota

l Sp

eed

up

R14 Distributed ANSYS w/wo GPU

Without GPU

With GPU

ANSYS Mechanical – Multi-Node GPU

Mold

PCB

Solder balls

Results Courtesy of MicroConsult Engineering, GmbH

• Solder Joint Benchmark (4 MDOF, Creep Strain Analysis)

Linux cluster : Each node contains 12 Intel Xeon 5600-series cores, 96 GB RAM, NVIDIA Tesla M2070, InfiniBand

© 2011 ANSYS, Inc. September 20, 2011

14

First capability for “specialty physics”

– view factors, ray tracing, reaction rates, etc.

R&D focus on linear solvers, smoothers – but potential limited by Amdahl’s Law

GPU Acceleration for CFD

Radiation viewfactorcalculation (ANSYS FLUENT 14 - beta)

© 2011 ANSYS, Inc. September 20, 2011

15

Case Study

HPC for High Fidelity CFD

http://www.ansys.com/About+ANSYS/ANSYS+Advantage+Magazine/Current+Issue

• 8M to 12M element turbocharger models (ANSYS CFX)

• Previous practice (8 nodes HPC)

● Full stage compressor runs 36-48 hours

● Turbine simulations up to 72 hours

• Current practice (160 nodes)

● 32 nodes per simulation

● Full stage compressor 4 hours

● Turbine simulations 5-6 hours

● Simultaneous consideration of 5 ideas

● Ability to address design uncertainty – clearance tolerance

“ANSYS HPC technology is enabling Cummins to use larger models with greater geometric details and more-realistic treatment of physical phenomena.”

© 2011 ANSYS, Inc. September 20, 2011

16

3 Millions of Cells(6 Days)

25 Millions(4 Days)

10 Millions(5 Days)

50 Millions(2 Days)

Increase of :

Spatial-temporal Accuracy

Complexity of Physical Phenomenon

SupersonicMultiphaseRadiation

CompressibilityConduction/Convection

TransientOptimisation / DOEDynamic Mesh

LES CombustionAeroacousticFluid Structure Interaction

Case Study

HPC for High Fidelity CFD

EURO/CFD• Model sizes up to 200M cells (ANSYS FLUENT)

• 2011 cluster of 700 cores

– 64-256 cores per simulation

© 2011 ANSYS, Inc. September 20, 2011

17

Microconsult GmbH

Solder joint failure analysis

• Thermal stress 7.8 MDOF

• Creep strain 5.5 MDOF

Simulation time reduced from 2 weeks to 1 day

• From 8 – 26 cores (past) to 128 cores (present)

“HPC is an important competitive advantage for companies looking to optimize the performance of their products and reduce time to market.”

Case Study

HPC for High Fidelity Mechanical

© 2011 ANSYS, Inc. September 20, 2011

18

Case Study

HPC for Desktop Productivity

• Cognity Limited – steerable conductors for oil recovery

• ANSYS Mechanical simulations to determine load carrying capacity

• 750K elements, many contacts• 12 core workstations / 24 GB RAM• 6X speedup / results in 1 hour or less• 5-10 design iterations per day

“Parallel processing makes it possible to evaluate five to 10 design iterations per day, enabling Cognity to rapidly improve their design.

http://www.ansys.com/About+ANSYS/ANSYS+Advantage+Magazine/Current+Issue

© 2011 ANSYS, Inc. September 20, 2011

19

Case Study

Skewed Waveguide Array (HFSS)

• 16X16 (256 elements and excitations)

• Skewed Rectangular Waveguide (WR90) Array– 1.3M Matrix Size

• Using 8 cores– 3 hrs. solution time– 0.4GB Memory total

• Using 16 cores– 2 hrs. solution time– 0.8GB Memory total

• Additional Cores– Faster solution time– More memory.

Unit cell shown with wireframe view of virtual array

© 2011 ANSYS, Inc. September 20, 2011

20

NVIDIA - Case study on the value of HW refresh and SW best-practice

Deflection and bending of 3D glasses

• ANSYS Mechanical – 1M DOF models

Optimization of:

• Solver selection (direct vs iterative)

• Machine memory (in core execution)

• Multicore (8-way) parallel with GPU acceleration

Before/After:

77x speedup – from 60 hours per simulation to 47 minutes.

Most importantly: HPC tuning added scope for design exploration and optimization.

Case Study

Desktop Productivity Cautionary Tale

© 2011 ANSYS, Inc. September 20, 2011

21

“Take Home” Points / Discussion

ANSYS HPC performance enables scaling for high-fidelity● What could you learn from a 10M (or 100M) cell / DOF model?

● What could you learn if you had time to consider 10 x more design ideas?

● Scaling applies to “all physics”, “all hardware” (desktop and cluster)

ANSYS continually invests in software development for HPC• Maximized value from your HPC investment

• This creates differentiated competitive advantage for ANSYS users

Comments / Questions / Discussion


Recommended