+ All Categories
Home > Documents > Speedup Altair RADIOSS Solvers Using NVIDIA GPU - GTC ......0 250 500 750 1000 Xeon Nehalem CPU...

Speedup Altair RADIOSS Solvers Using NVIDIA GPU - GTC ......0 250 500 750 1000 Xeon Nehalem CPU...

Date post: 04-Feb-2021
Category:
Upload: others
View: 6 times
Download: 1 times
Share this document with a friend
29
Innovation Intelligence ® Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012
Transcript
  • Innovation Intelligence®

    Speedup Altair RADIOSS Solvers

    Using NVIDIA GPU

    Eric LEQUINIOU, HPC Director

    Hongwei Zhou, Senior Software Developer

    May 16, 2012

  • Innovation Intelligence®

    ALTAIR OVERVIEW

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    3

    Altair’s Vision

    Simulation, predictive analytics and optimization

    leveraging high performance computing for engineering

    and business decision making

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    4

    25+Years of Innovation

    40+Offices in 16 Countries

    1500+Employees Worldwide

    Altair Engineering

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    5

    Altair’s Brands and Companies

    Solid State Lighting

    Products

    Engineering

    Simulation Platform

    On-demand Cloud

    Computing Technology

    Product Innovation

    Consulting

    Business Intelligence &

    Data Analytics Solutions

    Industrial Design

    Technology

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    66

    RADIOSS … AcuSolve … MotionSolve

    Statics

    NVH

    Non-Linear (Implicit)

    Non-Linear (Explicit)

    Thermal and CFD

    Optimization – OptiStruct and HyperStudy

    Data and Process Management

    Multi-Body Dynamics

    Partner AllianceSolutions

    1-D Systems, Fatigue, Ergonomics, Industrial

    Design, Injection Molding, Noise and Vibration (NVH),

    Composite Materials, Electromagnetics

    Pre-Post

    HyperWorks for Analysis and Optimization

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    77

    Simulate Real-life Models with HyperWorks Solvers

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    AcuSolve Already Benefits from GPU Accelerator

    • High performance computational fluid dynamics software (CFD)

    • The leading finite element based CFD technology

    • High accuracy and scalability on massively parallel architectures

    549 549

    165279

    0

    250

    500

    750

    1000

    Xeon Nehalem CPU Nehalem CPU + Tesla GPU

    Lower is

    better

    4 coreCPU

    1 core +1 GPU

    4 coreCPU

    2 core +2 GPU

    S-duct

    80K Degrees of Freedom

    Hybrid MPI/OpenMP for Multi-GPU test

    2X* 3.3X*

    *Performance gain versus 4 core CPU

  • Innovation Intelligence®

    RADIOSS PORTING ON GPU

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    1010

    Motivations to use GPU

    140

    280

    1450

    795

    8,5

    2,1

    12

    5050

    0,74 0,741,92

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1 Node Intel

    X5670

    2 Nodes Intel

    X5670

    1 Node+1

    Nvidia M2090

    1 Node+2

    Nvidia M2090G

    flo

    ps

    0

    10

    20

    30

    40

    50

    60 $

    Gflop Peak (DP) Gflop cost in $ Gflops per Watt

    • Cost effective solution

    • Power efficient solution

    How much of the peak can we get?

    � Which part of the code is best suited?

    � Which coding effort is required?

    � What is the speedup for my application?

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    1111

    RADIOSS Porting on Nvidia GPU

    • Assess the potential of GPU for RADIOSS

    • Focus on Implicit

    � Direct Solver

    � Highly optimized compute intensive solver

    � Limited scalability on multicores and cluster

    � Iterative Solver

    � Ability to solve huge problems with low memory requirement

    � Efficient parallelization

    � High cost on CPU due to large number of iterations for convergence

    • Double precision required

    • Integrate GPU parallelization into our Hybrid parallelization technology

  • Innovation Intelligence®

    RADIOSS DIRECT SOLVER PORTING

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    1313

    Multifrontal Sparse Direct Solver

    do j = 1, Nassemble(A(j))factor(j)update(j)

    end

    Non-pivoting: CholeskyPivoting: LDLT

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    1414

    Concerns with GPU Acceleration

    • CUBLAS (DGEMM) – perfect candidate to speed up update module

    � Frontal matrix could be too huge to fit in GPU memory

    � Frontal matrix could be too small and thus inefficient w/ GPU

    � Data transfer is not trivial

    � Only the lower triangular matrix is interesting

    • Pivoting is required in real applications

    � Pivot searching is a sequential operation

    � Factor module has limited parallel potential

    � It could be as expensive as “update” module in extreme case

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    1515

    CUBLAS speedup “Update” module

    • Improve profile of “Update” module

    � Base – BCSLIB4.3

    • Asynchronous computing

    � Overlap the computation

    � Overlap the communicationS1

    S2

    S3

    TLDLU =

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    1616

    Numerical Test : Non-Pivoting Case

    GPU speedup - non-pivoting case

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    4 CPU 4 CPU + 1GPU

    ela

    ps

    ed

    (s

    )

    solver elapsed

    3.9X2.9X

    0

    0,2

    0,4

    0,6

    0,8

    1

    1,2

    update total

    pro

    file

    (%)

    base(BCSLIB-EXT) improved

    Lower is

    better

    Linear static - Non-pivoting case

    Benchmark 2,8 Millions of Degrees of Freedom

    Platform Intel Xeon X5550, 4 Core, 48GB RAMNvidia C2070, CUDA 4.0MKL 10.3, RHEL 5.1

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    1717

    Numerical Test: Pivoting Case

    GPU speedup - pivoting case

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    4 CPU 4 CPU + 1GPU

    ela

    pse

    d (

    s)

    solver elapsed

    2.8X 2.5X

    0

    0,2

    0,4

    0,6

    0,8

    1

    1,2

    update total

    pro

    file

    (%)

    base(BCSLIB-EXT) improved

    Lower is

    better

    Nonlinear static - Pivoting caseCustomer model – Engine Block

    Benchmark 2,5 Millions of Degrees of Freedom

    Platform Intel Xeon X5550, 4 Core, 48GB RAMNvidia C2070, CUDA 4.0MKL 10.3, RHEL 5.1

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    1818

    Challenging Work

    • On models – “update” is not that

    dominated

    � E.g. engine block with 1st order element

    • In-core /“Quasi” in-core is preferred

    � A memory threshold to get reasonable good

    speedup will be provided to the user

    � Make sure the essential computation is in core

    0

    200

    400

    600

    800

    1000

    1200

    4 CPU 4 CPU + 1GPU

    Ela

    pse

    d (

    s)

    solver elapsed

    2.1X 1.8X

    0

    0,2

    0,4

    0,6

    0,8

    1

    1,2

    update total

    pro

    file

    (%)

    base(BCSLIB-EXT) improved

    Lower is

    better

    1.5 Millions of DOFs with Pivoting

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    1919

    Summary & Perspectives

    • GPU and CUDA enhance the performance of direct solver significantly

    � CUBLAS on Fermi card is so fast!

    � Asynchronous computing is the key component

    � Improving the profile is a necessary procedure� Amdahl's law

    • Application Area

    � Nonlinear analysis – robust and accurate solution

    � Optimization – matrix factorization reused in sensitivity analysis

    • Future works

    � Other dynamic solvers� Block Lanczos Eigen value solver

    � Altair AMSES (automated multilevel-substructure Eigen value solver)

  • Innovation Intelligence®

    RADIOSS ITERATIVE SOLVER PORTING

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    2121

    Preconditioned Conjugate Gradient Method

    • Problem� Solve linear equation Ax – b = 0� with A symmetric positive-definite

    • Solution� Preconditioned Conjugate Gradient (PCG) solves iteratively:

    M-1 . (Ax – b) = 0� Method quickly converges� Efficiency depends on M

    • Other advantages� Low memory consumption� Efficient parallelization: SMP, MPI

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    2222

    PCG Porting on GPU using Cuda

    r0 = b – Ax0

    z0 = M-1 r0

    p0 = z0

    k = 0

    DO WHILE NOT DONE

    alphak = rkt . zk / pk

    t . A . pk

    xk+1 = xk + alphak pk

    rk+1 = rk – alphak A . pk

    IF(rk+1 < precision) THEN

    DONE = TRUE

    END IF

    zk+1 = M-1 . rk+1

    betak = zk+1t . rk+1 / zk

    t . rk

    pk+1 = zk+1 + betak pk

    k = k+1

    END DO

    result = xk+1

    • Pretty simple original Fortran code

    • Few kernels to write in Cuda

    � Sparse Matrix Vector

    � Focus optimization effort on this kernel

    • Use of CUBLAS

    � DAXPY

    � DDOT

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    2323

    Hybrid MPP Computing with CUDA and MPI

    RADIOSS 11 Hybrid MPP Speedup

    • Hybrid MPP version of RADIOSS

    � 2 parallelization levels

    � MPI parallelization based on domain decomposition

    � Multi-threaded MPI processes under OpenMP

    � Enhanced performance

    � Higher scalability

    � Better efficiency

    • Extend Hybrid to multi GPUs programming

    � Integrate GPU parallelization into our hybrid programming model

    � MPI to manage communication between GPUs

    � Portions of code not GPU enable benefit from OpenMP parallelization

    � Programming model expendable to multi nodes with multi GPUs

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    2424

    PCG Multi GPUs Parallelization

    • Domain decomposition to split original problem

    � Optimal load balancing

    • MPI Communications between domains

    � Controlled by the CPU

    � One GPU associated to one MPI

    • OpenMP Multithreading

    � Speedup calculations not performed under GPU

    � Leverage the available CPU cores

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    2525252525

    Benchmark #1

    Linear Problem #1

    Hood of a car with pressure loads

    Compute displacements and stresses

    Benchmark 0,9 Millions of Degrees of Freedom

    24 Millions of non zero

    140000 Shells + 13000 Solids + 1100 RBE3

    4300 iterations

    Platform Nvidia PSG Cluster – 2 nodes with:

    Dual Nvidia M2090 GPUs

    Cuda v3.2

    Intel Westmere 2x6 X5670@2,93Ghz

    Linux RHEL 5.4 with Intel MPI 4.0

    Performance Elapsed time decreased by up to 9X

    351

    185

    84

    53

    0

    50

    100

    150

    200

    250

    300

    350

    400

    1 node - 2 Nividia M2090

    Ela

    ps

    ed

    (s)

    SMP 6-core

    Hybrid 2 MPI x 6 SMP

    SMP 6 + 1 GPU

    Hybrid 2 MPI x 6 SMP + 2 GPUs

    104

    38

    0

    20

    40

    60

    80

    100

    120

    2 nodes - 4 Nvidia M2090

    Ela

    ps

    ed

    (s)

    Hybrid 4 MPI x 6 SMP

    Hybrid 4 MPI x 6 SMP + 4 GPUs

    4.2X*

    *Performance gain versus SMP 6-core

    6.5X*

    9.2X*

    Lower

    isbetter

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    26262626

    Benchmark #2

    Linear Problem #2

    Hood of a car with pressure loads

    Refined model

    Compute displacements and stresses

    Benchmark 2,2 Millions of Degrees of Freedom

    62 Millions of non zero

    380000 Shells + 13000 Solids + 1100 RBE3

    5300 iterations

    Platform Nvidia PSG Cluster – 2 nodes with:

    Dual Nvidia M2090 GPUs

    Cuda v3.2

    Intel Westmere 2x6 X5670@2,93Ghz

    Linux RHEL 5.4 with Intel MPI 4.0

    Performance Elapsed time decreased by up to 13X

    1106

    572

    254

    143

    0

    200

    400

    600

    800

    1000

    1200

    1 node - 2 Nividia M2090

    Ela

    ps

    ed

    (s)

    SMP 6-core

    Hybrid 2 MPI x 6 SMP

    SMP 6 + 1 GPU

    Hybrid 2 MPI x 6 SMP + 2 GPUs

    306

    85

    0

    50

    100

    150

    200

    250

    300

    350

    2 nodes - 4 Nvidia M2090

    Ela

    psed

    (s)

    Hybrid 4 MPI x 6 SMP

    Hybrid 4 MPI x 6 SMP + 4 GPUs

    4.3X* 7.5X*

    13X*

    *Performance gain versus SMP 6-core

    Lower is

    better

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    27272727

    Benchmark #3

    Gravity Problem with contacts

    Full car model

    Compute displacements and stresses

    Benchmark 0,85 Millions of Degrees of Freedom

    21 Millions of non zero

    138000 Shells + 11000 Solids + 5700 1-D

    elements + 230 Rigid bodies + 6 Rigid walls

    1 Gravity load and 22 Contact interfaces

    ~ 9000 iterations

    Platform Nvidia PSG Cluster – 2 nodes with:

    Dual Nvidia M2090 GPUs

    Cuda v3.2

    Intel Westmere 2x6 X5670@2,93Ghz

    Linux RHEL 5.4 with Intel MPI 4.0

    Performance Elapsed time decreased by up to 9X

    1430

    443

    225

    1223

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1 node - 2 Nividia M2090

    Ela

    ps

    ed

    (s)

    SMP 6-core

    Hybrid 2 MPI x 6 SMP

    SMP 6 + 1 GPU

    Hybrid 2 MPI x 6 SMP + 2 GPUs

    531

    163

    0

    100

    200

    300

    400

    500

    600

    2 nodes - 4 Nvidia M2090

    Ela

    ps

    ed

    (s)

    Hybrid 4 MPI x 6 SMP

    Hybrid 4 MPI x 6 SMP + 4 GPUs

    3.2X*6.4X*

    8.8X*

    *Performance gain versus SMP 6-core

    Lower is

    better

  • Innovation Intelligence®

    CONCLUSION

  • Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

    29292929

    Conclusion

    • RADIOSS implicit direct & iterative solvers have been successfully ported on

    Nvidia GPU using Cuda

    • Adding GPU improves significantly the performance of these solvers

    • For iterative solver, Hybrid MPP allows to run on multi GPU card workstation and

    GPU cluster with good scalability and enhanced performance

    • GPU support for implicit solvers is planed for HyperWorks 12

    A big thanks to the Nvidia team for their great support:Stan Posey, Steven Rennich, Thomas Reed, Peng Wang!


Recommended