+ All Categories
Home > Documents > CoPA: Co-Design center for Particle Applications

CoPA: Co-Design center for Particle Applications

Date post: 16-Oct-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
1
CoPA: Co-Design center for Particle Applications PI: Susan Mniszewski (Los Alamos National Laboratory) Lab Leads: Jim Belak, Deputy PI (LLNL); Salman Habib (ANL); Stuart Slattery (ORNL); C-S Chang (PPPL); Steve Plimpton (SNL) (2. Construct neighbor lists) 3. Compute forces on particles due to short- range neighbors 4. Integrate equations of motion (particle update) May not be done every timestep, or never if using cell lists for simple nonbonded interactions Molecular Dynamics (MD) Particle-in-Cell (PIC) MD (w/ long-range) 1. Halo exchange of ghost cells Communication Memory/flop kernel 5. Resorting of particles into cell lists CoMD, ExaMiniMD, CabanaMD Proxy Apps 1. Particle deposition (interpolation from particles → mesh) 2. (Poisson) field solve on mesh 3. Gather forces (mesh → particles) 4. Particle push (update) (5. Remapping: generate new set of particles) May be done occasionally/never (2. Construct neighbor lists) 3. Compute forces on particles due to short-range neighbors 5. Integrate equations of motion (particle update) 1. Halo exchange of ghost cells 4. Compute (approximate) long- range forces, e.g. via Ewald, P3M, FMM, or tree methods 6. Resorting of particles into cell lists 6. Resorting of particles into cell lists HACCmk kernel When long-range fields exist, e.g. for plasma PIC O(N) Quantum MD 3. Build Hamiltonian (H) and overlap (S) matrices 4. Invert S and orthogonalize H 5. SCF loop iteration 6. Compute forces on atoms 1. Halo exchange of ghost cells 7. Particle update a) Build density matrix CoSP2 proxy app b) Calculate partial charges via Coulomb sum c) Update H and, if necessary, orthogonalize 2. Construct neighbor lists Four key sub-motifs of particle-based simulation codes Challenges of O(N) Quantum Molecular Dynamics Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electro-negativities, and long-range electrostatic interactions Electronic structure methods typically require manipulations of a Hamiltonian or density matrix For non-metallic systems, these are sparse à can obtain O(N) scaling https://github.com/lanl/qmd-progress https://github.com/lanl/bml Lead: Christian Negre (LANL), Co-Lead: Jean-Luc Fattebert (ORNL) PROGRESS/BML libraries for O(N) Quantum Molecular Dynamics Lead: Stuart Slattery (ORNL) Cabana: A Co-Designed HPC Library for Particle Applications ECP particle application space (CoPA Tier 1 partners) Anatomy of a Timestep Applications using PROGRESS/BML PROGRESS and BML libraries provide performance portability LATTE (Los Alamos Transferable Tight-binding for Energetics) Developed and utilized within the EXAALT AD project – Fast O(N 3 ) and efficient, parallel O(N) density matrix builds Extended Lagrangian Born-Oppenheimer MD gives long-time energy conservation without expensive SCF cycles Graph-based linear scaling electronic structure theory DFTB+ (Density Functional Tight Binding (and more)) – U. Bremen Approximate density functional theory based quantum simulation tool (~1000 citations) Full integration with PROGRESS/BML after LANL visit by lead author (B. Aradi) https://github.com/cnegre/dftbplus ; manuscript in preparation. MGmol – LLNL (+ORNL) Real-space finite difference, pseudopotential DFT code 2016 Gordon Bell finalist O(N) scheme relies on a matrix divide-and- conquer approach to determine orbital occupation/density matrix Cabana Data Structures Particles are tuples of multidimensional data: struct Particle { float pos_x; float pos_y; float pos_z; double vel_x; double vel_y; double vel_z; int matid; }; Let’s rethink this in terms of SIMD units (e.g. a CUDA warp or CPU vector instruction) struct SimdParticle { float pos_x[VECLEN]; float pos_y[VECLEN]; float pos_z[VECLEN]; double vel_x[VECLEN]; double vel_y[VECLEN]; double vel_z[VECLEN]; int matid[VECLEN]; }; An array of SimdParticle objects can be configured to a struct-of-arrays when VECLEN = (# of particles), or to an array-of-structs when VECLEN = 1 VECLEN should be the size of a CUDA warp when using an NVIDIA GPU VECLEN should be a multiple of the vector width on an Intel CPU Allocate a block of these and you have an AoSoA Cabana data structure syntax is similar to Kokkos but instead of a View we build a new container of a specified number of particles in AoSoA format in a specified memory space: Developers: Adedoyin Adetokunbo (LANL) Jean-Luc Fattebert (ORNL) Jamal Mohd-Yusof (LANL) Christian Negre (LANL) Daniel Osei-Kuffuor (LLNL) Developers: Bob Bird (LANL) Guangye Chen (LANL) Damien Lebrun-Grandie (ORNL) Christoph Junghans (LANL) Aaron Scheinberg (PPPL) Stan Moore (SNL) Sam Reeve (LLNL) Shane Fogerty (LANL) using DataTypes = Cabana::MemberTypes<float[3],double[3],int>; enum MyTypeNames { Position = 0, Velocity = 1; Matid = 2 }; Cabana::AoSoA<DataTypes,Kokkos::HostSpace,8> host_aosoa( num_particle ); Cabana::AoSoA<DataTypes,Kokkos::CudaUVMSpace,32> device_aosoa( num_particle ); https://github.com/ECP-copa/Cabana What is Cabana? Cabana is a software library for developing exascale applications that use particle algorithms, including particle-grid (Cajita) The library contains general particle data structures and algorithms implemented with those data structures Cabana is designed for modern DOE HPC architectures and builds directly on Kokkos Cabana open source distributed on GitHub with a BSD 3-clause license How does Cabana impact ECP applications? Serves as a rapid development tool for algorithm research Acts as an intermediary to hardware vendors via isolated algorithm implementations Best practices can be translated to ECP applications Integration with ECP mini-applications Integration with full ECP applications How is Cabana designed and developed? Co-Design motifs distilled into a shared set of concrete algorithms across partner applications A requirements document was produced in FY18 Q1 to build library specifications Implementations are designed in partnership with and reviewed by application stakeholders Performance benchmarks and mini-applications give continuous feedback CoPA Cabana Ecosystem Cabana Performance portable particle motifs Multi-node performance portable particle comm Flexible particle data layout CabanaMD Molecular dynamics proxy app CabanaPIC Particle-in-cell proxy app LJ NNP Kokkos On-node performance portability: execution and memory spaces CUDA Open MP threads HPX HIP LAMMPS, HACC, XGC, Applications Cajita Performance portable particle-grid motifs, multi- node portable grid comm, long-range solvers SYCL MPI LAMMPS/SNAP GPU Performance Over Time Significant increase of roughly 5x in performance over the baseline implementation of the SNAP benchmark running on NVIDIA V100 GPUs (Aidan Thompson (EXAALT), Stan Moore (CoPA) and Rahulkumar Gayatri (NESAP)) The improved Cabana version performs better than the original GPU version XGC to full-size Summit for a production case tokamak plasma and geometry (Aaron Scheinberg (WDMapp/CoPA) and Stuart Slattery (CoPA)) Lower is better 50M electrons/GPU 2.4T ions and electrons on 90% Summit 2.4T Original version Improved Cabana version - PROGRESS: High-level solvers - BML: Low-level matrix formats and APIs are the same for all matrix types (dense, sparse) and architectures, but underlying implementations can be different - Dense matrix routines wrap BLAS/LAPACK calls - Sparse matrix routines are hand- written - CPU only or CPU-GPU - New: CSR, ELLBLOCK formats. Acceleration with MAGMA. - One codebase - Flexibility in configuring and building - Ported and tested for multiple compilers: GNU, IBM, Intel, and more - Leverage existing BLAS linear algebra libraries - GPU acceleration in many forms - MAGMA for dense matrix format. - OpenMP offload for hand-written code - CUDA/CUBLAS/CUSPARSE as needed Library approach: Performance portability: MAIN GOAL: Construct a flexible library ecosystem for Quantum Chemistry Applications adapted to pre-exascale and exascale architectures. A basic matrix library (BML) offers abstractions for matrix operations independent of the underlying data structures (dense or sparse) and algorithms arising in quantum chemistry codes in C and Fortran A high-level PROGRESS solver library using BML provides methods (algorithms) for integration into existing codes (LATTE, DFTB+, etc.) or building new codes Software Architecture PROGRESS/BML enables LATTE Scaling on SUMMIT for metallic systems Performance on CPU+GPU and CPU is similar for small matrices. As the matrix size increases, the CPU+GPU is the clear winner with good speedup! (Christian Negre, LANL) Matrices of small sizes (< 3000) do more compute on the CPU for diagonalization in MAGMA. (J.-L. Fattebert, ORNL) Sizes relevant to EXAALT OLCF GPU Hackathon CabanPIC Proxy App lower is better higher is better
Transcript

CoPA: Co-Design center for Particle ApplicationsPI: Susan Mniszewski (Los Alamos National Laboratory)

Lab Leads: Jim Belak, Deputy PI (LLNL); Salman Habib (ANL); Stuart Slattery (ORNL); C-S Chang (PPPL); Steve Plimpton (SNL)

(2. Construct neighbor lists)

3. Compute forces on particles due to short-

range neighbors

4. Integrate equations of motion (particle update)

May not be done every timestep, or never if using cell lists for simple nonbonded interactions

Molecular Dynamics (MD) Particle-in-Cell (PIC)MD (w/ long-range)

1. Halo exchange of ghost cells

Communication

Memory/flop kernel

5. Resorting of particlesinto cell lists

CoMD, ExaMiniMD, CabanaMD Proxy Apps

1. Particle deposition(interpolation fromparticles → mesh)

2. (Poisson) fieldsolve on mesh

3. Gather forces (mesh → particles)

4. Particle push (update)

(5. Remapping: generatenew set of particles)

May be done occasionally/never

(2. Construct neighbor lists)

3. Compute forces on particlesdue to short-range neighbors

5. Integrate equations of motion (particle update)

1. Halo exchange of ghost cells

4. Compute (approximate) long-range forces, e.g. via Ewald, P3M,

FMM, or tree methods

6. Resorting ofparticles into cell lists

6. Resorting of particlesinto cell lists

HACCmk kernel

When long-range fieldsexist, e.g. for plasma PIC

O(N) Quantum MD

3. Build Hamiltonian (H) and overlap (S) matrices

4. Invert S and orthogonalize H

5. SCF loop iteration

6. Compute forces on atoms

1. Halo exchange of ghost cells

7. Particle update

a) Build density matrix

CoSP2 proxy app

b) Calculate partial charges via Coulomb sum

c) Update H and, if necessary, orthogonalize

2. Construct neighbor lists

Four key sub-motifs of particle-based simulation codes

Challenges of O(N) Quantum Molecular Dynamics• Quantum-based models capture the making and breaking of covalent bonds,

charge transfer between species of differing electro-negativities, and long-range electrostatic interactions

• Electronic structure methods typically require manipulations of a Hamiltonian or density matrix – For non-metallic systems, these are sparse à can obtain O(N) scaling

https://github.com/lanl/qmd-progresshttps://github.com/lanl/bml

Lead: Christian Negre (LANL), Co-Lead: Jean-Luc Fattebert (ORNL)

PROGRESS/BML libraries for O(N) Quantum Molecular DynamicsLead: Stuart Slattery (ORNL)

Cabana: A Co-Designed HPC Library for Particle Applications

ECP particle application space (CoPA Tier 1 partners) Anatomy of a Timestep

Applications using PROGRESS/BML• PROGRESS and BML libraries provide

performance portability• LATTE (Los Alamos Transferable Tight-binding for

Energetics)– Developed and utilized within the EXAALT AD

project– Fast O(N3) and efficient, parallel O(N) density

matrix builds– Extended Lagrangian Born-Oppenheimer MD

gives long-time energy conservation without expensive SCF cycles

– Graph-based linear scaling electronic structure theory

• DFTB+ (Density Functional Tight Binding (and more)) – U. Bremen– Approximate density functional theory based

quantum simulation tool (~1000 citations)– Full integration with PROGRESS/BML after

LANL visit by lead author (B. Aradi)– https://github.com/cnegre/dftbplus; manuscript

in preparation.• MGmol – LLNL (+ORNL)

– Real-space finite difference, pseudopotential DFT code

– 2016 Gordon Bell finalist– O(N) scheme relies on a matrix divide-and-

conquer approach to determine orbital occupation/density matrix

Cabana Data Structures

Particles are tuples of multidimensional data:

struct Particle{

float pos_x;float pos_y;float pos_z;double vel_x;double vel_y;double vel_z;int matid;

};

Let’s rethink this in terms of SIMD units (e.g. a CUDA warp or CPU vector instruction)

struct SimdParticle{

float pos_x[VECLEN];float pos_y[VECLEN];float pos_z[VECLEN];double vel_x[VECLEN];double vel_y[VECLEN];double vel_z[VECLEN];int matid[VECLEN];

};

An array of SimdParticle objects can be configured to a struct-of-arrays when VECLEN = (# of particles), or to an array-of-structs when VECLEN = 1

• VECLEN should be the size of a CUDA warp when using an NVIDIA GPU• VECLEN should be a multiple of the vector width on an Intel CPU• Allocate a block of these and you have an AoSoA• Cabana data structure syntax is similar to Kokkos but instead of a View we build a new container of a specified

number of particles in AoSoA format in a specified memory space:

Developers:Adedoyin Adetokunbo (LANL)Jean-Luc Fattebert (ORNL)Jamal Mohd-Yusof (LANL)Christian Negre (LANL)Daniel Osei-Kuffuor (LLNL)

Developers:Bob Bird (LANL)Guangye Chen (LANL)Damien Lebrun-Grandie (ORNL)Christoph Junghans (LANL)Aaron Scheinberg (PPPL)Stan Moore (SNL)Sam Reeve (LLNL)Shane Fogerty (LANL)

using DataTypes = Cabana::MemberTypes<float[3],double[3],int>;enum MyTypeNames { Position = 0, Velocity = 1; Matid = 2 };Cabana::AoSoA<DataTypes,Kokkos::HostSpace,8> host_aosoa( num_particle );Cabana::AoSoA<DataTypes,Kokkos::CudaUVMSpace,32> device_aosoa( num_particle );

https://github.com/ECP-copa/Cabana

What is Cabana?• Cabana is a software library for developing exascale applications that use

particle algorithms, including particle-grid (Cajita)• The library contains general particle data structures and algorithms

implemented with those data structures• Cabana is designed for modern DOE HPC architectures and builds directly

on Kokkos• Cabana open source distributed on GitHub with a BSD 3-clause license

How does Cabana impact ECP applications?• Serves as a rapid development tool for algorithm

research• Acts as an intermediary to hardware vendors via

isolated algorithm implementations• Best practices can be translated to ECP applications• Integration with ECP mini-applications• Integration with full ECP applications

How is Cabana designed and developed?• Co-Design motifs distilled into a shared set of

concrete algorithms across partner applications• A requirements document was produced in FY18

Q1 to build library specifications• Implementations are designed in partnership with

and reviewed by application stakeholders• Performance benchmarks and mini-applications

give continuous feedback

CoPA Cabana Ecosystem

CabanaPerformance portable particle motifs

Multi-node performance portable particle comm

Flexible particle data layout

CabanaMDMolecular dynamics

proxy app

CabanaPICParticle-in-cell

proxy app

LJ NNP

KokkosOn-node performance portability: execution and memory spaces

CUDA OpenMP threads HPX HIP

LAMMPS, HACC, XGC, … Applications

CajitaPerformance portable

particle-grid motifs, multi-node portable grid comm,

long-range solvers

SYCL

MPI

LAMMPS/SNAP GPU Performance Over Time

Significant increase of roughly 5x in performance over the baseline implementation of the SNAP benchmark running on NVIDIA V100 GPUs (Aidan Thompson (EXAALT), Stan Moore (CoPA) and Rahulkumar Gayatri (NESAP))

The improved Cabana version performs better than the original GPU version XGC to full-size Summit for a production case tokamak plasma and geometry (Aaron Scheinberg (WDMapp/CoPA) and Stuart Slattery (CoPA))

Low

er is

bet

ter

50M electrons/GPU2.4T ions and electrons on 90%

Summit

2.4TOriginal version

Improved Cabana version

- PROGRESS: High-level solvers- BML: Low-level matrix formats and APIs are the same for all matrix types (dense, sparse) and architectures, but underlying implementations can be different- Dense matrix routines wrap BLAS/LAPACK calls- Sparse matrix routines are hand-written- CPU only or CPU-GPU- New: CSR, ELLBLOCK formats. Acceleration with MAGMA.

- One codebase- Flexibility in configuring and building- Ported and tested for multiple compilers: GNU, IBM, Intel, and more- Leverage existing BLAS linear algebra libraries- GPU acceleration in many forms- MAGMA for dense matrix format.- OpenMP offload for hand-written

code- CUDA/CUBLAS/CUSPARSE as

needed

Library approach: Performance portability:

MAIN GOAL: Construct a flexible library ecosystem for Quantum Chemistry Applications adapted to pre-exascale and exascalearchitectures.

• A basic matrix library (BML) offers abstractions for matrix operations independent of the underlying data structures (dense or sparse) and algorithms arising in quantum chemistry codes in C and Fortran

• A high-level PROGRESS solver library using BML provides methods (algorithms) for integration into existing codes (LATTE, DFTB+, etc.) or building new codes

Software Architecture

PROGRESS/BML enables LATTEScaling on SUMMIT for metallic systems

Performance on CPU+GPU and CPU is similar for small matrices. As the matrix size increases, the CPU+GPU is the clear winner with good speedup! (Christian Negre, LANL)

Matrices of small sizes (< 3000) do more compute on the CPU for diagonalization in MAGMA.(J.-L. Fattebert, ORNL)

Sizes relevant to EXAALT

OLCF GPU Hackathon

CabanPIC Proxy App

lower is better

higher is better

Recommended