Stan Posey, CAE Industry Development
NVIDIA, Santa Clara, CA, USA
2 ANSYS 2011 Regional Conferences
NVIDIA and HPC Evolution of GPUs
Public, based in Santa Clara, CA | ~$4B revenue | ~5,500 employees
Founded in 1999 with primary business in semiconductor industry Products for graphics in workstations, notebooks, mobile devices, etc.
Began R&D of GPUs for HPC in 2004, released first Tesla and CUDA in 2007
Development of GPUs as a co-processing accelerator for x86 CPUs
2004: Began strategic investments in GPU as HPC co-processor
2006: G80 first GPU with built-in compute features, 128 cores; CUDA SDK Beta
2007: Tesla 8-series based on G80, 128 cores – CUDA 1.0, 1.1
2008: Tesla 10-series based on GT 200, 240 cores – CUDA 2.0, 2.3
2009: Tesla 20-series, code named ―Fermi‖ up to 512 cores – CUDA SDK 3.0, 3.2
HPC Evolution of GPUs
3 Years With
3 Generations
3 ANSYS 2011 Regional Conferences
NVIDIA and ANSYS Collaboration Focus
NVIDIA Structural Fluid Electro- GPU Status Mechanics Dynamics magnetics
Available Today
Updates for 2011
Product
Evaluation
Research Evaluation
ANSYS Mechanical 13
SMP, Single GPU
ANSYS Mechanical 14
DMP, Improved PCG
ANSYS HFSS
ANSYS Maxwell
ANSYS Nexxim
(Signal Integrity)
ANSYS CFD 15
Solver, other models
NVIDIA Provides Business and Engineering Investments in ANSYS Technology Developments
ANSYS CFD 14
Radiation HT (beta)
ANSYS Mechanical 15
Multi-GPU, Multi-node
4 ANSYS 2011 Regional Conferences
ANSYS computes the heavy workloads of matrix solvers on
the GPU and other routines on the CPU
ANSYS Mechanical GPU acceleration is user-transparent
Jobs launch and complete without additional user steps
1. ANSYS job launched on CPU
2. Solver operations sent to GPU
3. GPU sends results back to CPU
4. ANSYS job completes on CPU
1
2
3 4
How ANSYS Software Works With GPUs
5 ANSYS 2011 Regional Conferences
Important Considerations for ANSYS and GPUs
Core ANSYS focus is on direct and iterative linear solvers
Others (models, mat. assembly) move to GPUs in progressive stages
Most ANSYS software employs a domain parallel method
GPU computing fits this method, preserves DANSYS investments ANSYS 13 focus was SMP solvers; ANSYS 14 focus is DANSYS solvers
ANSYS software is parallel and scales well for multi-core CPUs Direct solvers use a scheme of computations on both GPU and CPU Iterative solvers have computations on GPU, matrix assembly on CPU Investigations include GPU performance against multi-core CPU only
6 ANSYS 2011 Regional Conferences
ANSYS Presentation at NVIDIA GTC 2010 Sep 20 — 23, 2010 San Jose Convention Center, San Jose, California, USA
Accelerating System Level Signal Integrity Simulation with GPU Dr. Ekanathan Palamadai, ANSYS
Lower
is
better
Nexxim 13.0 Convolution Results for Tesla C2050:
Intel Nehalem 8 core CPU, OpenMP: 108 H
NVIDIA Tesla C2050 GPU, OpenMP: 4 H
Single Precision ~27x
Double Precision ~13x
Speedup combines
GPU and other SW
changes
7 ANSYS 2011 Regional Conferences
ANSYS CFD 14.0 to Offer (Beta) GPU Capability
ANSYS CFD preliminary results of radiation heat transfer view-factor computation on GPUs vs. CPUs Radiation HT Applications: - Underhood cooling - Cabin comfort HVAC - Furnace simulations - Solar loads on buildings - Combustor in turbine - Electronics passive cooling
Other ANSYS CFD Evaluations: - Models (e.g. disperse phase) - Implicit equation solvers
NOTE: Growing CPU time of view-factor
computations inhibit proper inclusion of radiation HT effects
NOTE: GPU time remains low even
as view-factor computations
grow very large
8 ANSYS 2011 Regional Conferences
ANSYS Announcement of NVIDIA CUDA Support
"This initial development for GPU computing demonstrates our focus on evolving ANSYS software to take advantage of important technology trends in high-performance computing." said Dipankar Choudhury, vice president of
corporate product strategy and planning at ANSYS. "We work to achieve optimized software performance, across the full spectrum of HPC technologies, so that our customers get maximum value from their investment in HPC. Here, our
technical collaboration with NVIDIA has resulted in a significant benefit for our mutual customers."
9 ANSYS 2011 Regional Conferences
ANSYS Mechanical 13: Collaboration on SMP direct sparse and PCG/JCG iterative solvers – CUDA 3.2 support in 13.0 SP2
Initial release for both Linux and Windows 64-bit, and single GPU per job – multi-GPU under evaluation for future release:
Model limits for direct depend on largest front sizes: GPUs good for ~1M DOF to ~8M DOF for 6GB Tesla C2075 or Quadro 6000 Model limits for iterative depend on GPU memory: GPUs good for ~1M DOF to ~5M DOF for 6GB Tesla C2075 or Quadro 6000
ANSYS Mechanical 14: Collaboration on DMP solvers – Nov 11
Details of ANSYS Mechanical for NVIDIA GPUs
10 ANSYS 2011 Regional Conferences
ANSYS Mechanical Results of Solver Acceleration NOTE: Results of ANSYS Mechanical for Tesla C2050 and Intel Xeon 5560
- Xeon 5560, 2.8 GHz 2 sockets, 8 cores - 32 GB memory - Win XP SP2 64-bit - Tesla C2050 GPU
GPU Solver
Kernel Speedups
GPU Overall
Simulation Speedups
From NAFEMS World Congress May 2011
Boston, MA, USA
“Accelerate FEA Simulations with a GPU”
-by Jeff Beisheim, ANSYS
System Configuration:
11 ANSYS 2011 Regional Conferences
ANSYS 13 and 14P3 Performance Study by NVIDIA HP Z800 Workstation Configuration - $9,403
Windows 7 Professional 64-bit or CentOS 2 x Xeon® X5650 HC 2.66 GHz CPUs 12MB/1333 (12 cores) NVIDIA Quadro 2000 1 GB Graphics HP 24 GB (6x4GB) DDR3-1333 ECC memory HP 500 GB SATA 7200 HDD Add HP 24 GB (6x4GB) for total 48 GB - $1,800 (included) Source: h10010.www1.hp.com/wwpc/pscmisc/vac/us/en/sm/workstations/z800.html
NVIDIA Tesla C2075 GPU – about $2,000 NOTE: Directly configures with HP Z800 as above and with other workstations and servers [vendors appear later in presentation]
ANSYS Mechanical Model – V13sp-5
Turbine geometry of 2,100 K DOF and mostly SOLID187 FE’s
Single load step, static, large deflection nonlinear
Use of ANSYS Mechanical 13 SMP and 14P3 DANSYS direct sparse solver
ANSYS Mechanical Model – V13cg-2 [Study still a work in progress]
Engine block (static, linear) of 6,270 K DOF and mostly SOLID187 FE’s
Use of ANSYS Mechanical 14P3 SMP PCG iterative solver
+ NVIDIA Tesla C2075 GPU
HP Z800 Workstation
12 ANSYS 2011 Regional Conferences
V13sp-5 Model 2449
1484
846
633560 512
395 358 359414450526
0
1000
2000
3000 Xeon 5670 2.93 GHz Westmere (Dual Socket)
Xeon 5670 2.93 GHz Westmere + Tesla C2075
AN
SY
S M
ech
anic
al T
imes
in S
eco
nd
s
- Turbine geometry
- 2,100 K DOF
- SOLID187 FEs
- Static, nonlinear
- One load step
- Direct sparse
4.7x
2.0x
3.3x
1.6x 1.4x 1.6x
NOTE: Results Based on ANSYS Mechanical 13.0 SP2 SMP Solver Jun 2011
NOTE: Add a Tesla C2075 to use with 6 cores: now 30% faster than 12, with 6 available for other tasks
1 Core 2 Core 4 Core 6 Core 12 Core
1 Socket 2 Socket
8 Core
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz 48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17
ANSYS Mechanical 13 on GPU Workstation
AVAILABLE TODAY Lower
is
better
13 ANSYS 2011 Regional Conferences
V13sp-5 Model
1848
1192
846
564 516399
273 270314342444
0
1000
2000
3000 Xeon 5670 2.93 GHz Westmere (Dual Socket)
Xeon 5670 2.93 GHz Westmere + Tesla C2075
AN
SY
S M
ech
anic
al T
imes
in S
eco
nd
s
- Turbine geometry
- 2,100 K DOF
- SOLID187 FEs
- Static, nonlinear
- One load step
- Direct sparse
4.2x
2.7x 3.5x
2.1x 1.9x
NOTE: Add a Tesla C2075 to use with 6 cores: now 46% faster than 12, with 6 available for other tasks
1 Core 2 Core 4 Core 6 Core 12 Core
1 Socket 2 Socket
8 Core
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz 48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17
ANSYS Mechanical 14 Preview on GPU Workstation
NOTE: Results Based on ANSYS Mechanical 14.0 Preview 3 DMP Solver Aug 2011
AVAILABLE NOV 2011 Lower
is
better
14 ANSYS 2011 Regional Conferences
V13sp-5 Model
414395
358
270273314
0
250
500
750 Xeon 5670 + Tesla C2075 for 13.0 SP2 SMP
Xeon 5670 + Tesla C2075 for 14.0 P3 DMP
AN
SY
S M
ech
anic
al T
imes
in S
eco
nd
s
- Turbine geometry
- 2,100 K DOF
- SOLID187 FEs
- Static, nonlinear
- One load step
- Direct sparse 4 Core 6 Core
32%
ANSYS Mechanical for 12-Core GPU Workstation
NOTE: Comparison of ANSYS Mechanical14.0 Preview 3 DMP vs. 13.0 SP2 SMP for Tesla GPU
8 Core
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz 48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17
45% 33%
13SP2 14P3 13SP2 14P3 13SP2 14P3
Lower
is
better
15 ANSYS 2011 Regional Conferences
830
426
1524
11551214
682
0
500
1000
1500
2000Xeon 5560 2.8 GHz Nehalem 4 Cores (Dual Socket)
Xeon 5560 2.8 GHz Nehalem 4 Cores + Tesla C2050
48 GB In-memory
32 GB Out-of-memory
24 GB Out-of-memory
AN
SY
S M
ech
anic
al T
imes
in S
eco
nd
s
2.0x
Study on System Memory Effects at 4 Cores
1.3x
NOTE: Results Based on ANSYS Mechanical 13.0 SMP Direct Solver Sep 2010
34 GB required for in-memory solution
Results from HP Z800 Workstation, 2 x Xeon X5560 2.8GHz CPUs, 48GB memory, MKL 10.25; Tesla C2050, CUDA 3.1
NOTE: Greatest benefit for CPU and CPU+GPU is in-memory solution
V12sp-5 Model
- Turbine geometry
- 2,100 K DOF
- SOLID187 FEs
- Static, nonlinear
- One load step
- Direct sparse
1.7x
Lower
is
better
16 ANSYS 2011 Regional Conferences
V13cg-2 Model 1758
1175
817721 732
828
153 147 146161280
0
500
1000
1500
2000 Xeon 5670 2.93 GHz Westmere (Dual Socket)
Xeon 5670 2.93 GHz Westmere + Tesla C2075
AN
SY
S M
ech
anic
al T
imes
in S
eco
nd
s
- Engine block
- 6,270 K DOF
- SOLID187 FEs
- Static, linear
- PCG iterative
6.3x
5.1x 4.7x 5.0x
1 Core 2 Core 4 Core 6 Core 12 Core
1 Socket 2 Socket
8 Core
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz 48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17
ANSYS Mechanical 14 Preview on GPU Workstation
NOTE: Results Based on ANSYS Mechanical 14.0 Preview 3 SMP Solver Aug 2011
AVAILABLE NOV 2011
5.7x
Lower
is
better
NOTE: Results for SMP only
17 ANSYS 2011 Regional Conferences
V13cg-2 Model
1829
1048
682605
666
465
1758
1175
817721 732
828
153 147 146161280
0
500
1000
1500
2000 Xeon 5670 2.93 GHz Westmere - DANSYS
Xeon 5670 2.93 GHz Westmere - SMP
Xeon 5670 2.93 GHz Westmere - SMP + Tesla C2075
AN
SY
S M
ech
anic
al T
imes
in S
eco
nd
s
- Engine block
- 6,270 K DOF
- SOLID187 FEs
- Static, linear
- PCG iterative
6.3x
5.1x 4.7x 5.0x
1 Core 2 Core 4 Core 6 Core 12 Core
1 Socket 2 Socket
8 Core
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz 48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17
ANSYS Mechanical 14 Preview on GPU Workstation
NOTE: Results Based on ANSYS Mechanical 14.0 Preview 3 Solvers Aug 2011
AVAILABLE NOV 2011
5.7x
Lower
is
better
NOTE: DANSYS outperforms SMP
18 ANSYS 2011 Regional Conferences
ANSYS Base License : Unlocks up to 2 CPU Cores
ANSYS HPC Pack: Unlocks up to 8 CPU Cores
Unlocks 1 computational GPU
ANSYS HPC Core Licensees: Contact ANSYS to enquire
* Academic customers: GPU feature is bundled with ANSYS Base License
*
How ANSYS is Licensed for NVIDIA GPUs
19 ANSYS 2011 Regional Conferences
ANSYS 14 Performance Gain > 4X vs. Base License
4.4
1.35 1.35 1.38
1.0
2.12.3
1.0
0
1
2
3
4
5
CPU Speed-up GPU Speed-up Solution Cost
Base License 2 Core
ANSYS HPC Pack 6 Cores
ANSYS HPC Pack 8 Cores
ANSYS HPC Pack 6 Cores + GPU
Solution Cost Basis
- ANSYS base license
- ANSYS HPC Pack
- $10K for Workstation
- $2K for Tesla C2075
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz 48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17
NOTE: Based on ANSYS Mechanical 14.0 Preview 3 DMP Solver Aug 2011 and Model V13sp-5
Fac
tors
Gai
n O
ver
Bas
e L
icen
se R
esu
lts
Performance Basis
V13sp-5 Model:
- 2,100 K DOF
- SOLID187 FEs
- Static nonlinear
- One load step
- Direct sparse
NOTE: Invest 38% more over
Base License for a gain of over 4x!
20 ANSYS 2011 Regional Conferences
NVIDIA Use of ANSYS Software for Product Design
ANSYS Icepak – active and passive cooling of IC packages
ANSYS Mechanical – large deflection bending of PCBs
ANSYS Mechanical – comfort and fit of 3D emitter glasses
ANSYS Mechanical – shock & vib of solder ball assemblies
21 ANSYS 2011 Regional Conferences
NVIDIA HPC Case Study: Performance Gain of 77x
ANSYS Mechanical Simulations by NVIDIA for Design of 3D Emitter Glasses
Simulation for prediction of comfort, fit, and handling
Study optimized on CPU platform before applying GPU
Once impossible model parameterization now practical
22 ANSYS 2011 Regional Conferences
Servers with Tesla GPUs
Workstations with
Tesla GPUs
Workstations Servers
Existing System • Tesla C2050 (3 GB)
• Tesla C2075 (6 GB)
New System Purchase • Total 6-8 CPU cores
• Total 48 GBs of CPU memory
• Disk with minimum 500 GB
• Tesla C2075
+ Quadro 2000 for pre/post
-- OR --
• Quadro 6000 (6GB)
Existing System • Tesla S2050 (12 GB or 3 GB/GPU)
New System Purchase • Total 4 CPUs, 6-8 CPU cores each
• Total 4 x16 PCIe (one for each GPU)
• Total 96 to128 GBs of CPU memory
• Disk with minimum 2000 GB (scratch)
• Tesla M2070 or Tesla M2090
Recommended System Configurations
23 ANSYS 2011 Regional Conferences
Summary and Next Steps
ANSYS Software supports NVIDIA GPUs for Computation
ANSYS 13.0 since Nov 2010; New features coming in ANSYS 14.0 Joint Collaboration on ANSYS 13.0 is only the beginning
Collaboration ongoing in all disciplines of CSM, CFD and CEM
Learn more about ANSYS and NVIDIA GPU solution More at: www.nvidia.com/object/tesla-ansys-accelerations.html Want to try ANSYS on NVIDIA GPUs? Contact [email protected]
24 ANSYS 2011 Regional Conferences
Thank You, Questions ?
Stan Posey | CAE Industry Development | [email protected]
NVIDIA, Santa Clara, CA, USA