Propagate-to-R: Hackathon Summary · 2021. 5. 7. · 05/07/21 Martin Kwok | P2R Hackathon Summary...

Team members: Martin Kwok, Matti Kortelainen, Alexei Strelchenko, Giuseppe Cerati (Fermilab)Team mentors: Mathew Colgrove(NVIDIA), James Osborne, Taylor Childers(ANL)HEP-CCE PPS Meeting 7 May, 2021

Propagate-to-R: Hackathon Summary

05/07/21 Martin Kwok | P2R Hackathon Summary

The Event

• 1 (pre-event) + 3 (main event) days of 9am-5pm Zoom calls- Daily 5 min updates from of 13 teams

Other teams: ML / domain specific code(Astro/Chem etc.)Usually bigger code base than P2R

- Most of the time working in a breakout room with our mentor, MattPrototypes ideas to try with openACC => Profile => More ideas + discussion => Repeat => Port to CUDA

- Alexei / Giuseppe work on independently goals

• Help from mentor was very useful- Learnt how to use isight-compute and interpret the output- Exclusive 4 day access to A100 GPUs is a nice bonus

2


List of accomplished items

• Implemented parallelSTL version by Alexei• Bug fix and validation by Giuseppe- Update track uncertainties across layers (was a no-opt)

• Explored a lot of avenues to speed-up the CUDA kernel- Static shared memory- Privatizing part of the costly variables- Dynamic shared memory - Min. usage of thread-local memory

• 1 clear winning CUDA version in terms of kernel time (5x speed up)• Code infrastructure development- Multiple CUDA stream- Added warm-up cycles- Added complier macros to include/exclude H2D/D2H transfer time

3


P2R overview

• P2R is a standalone mini-app. to perform core math of parallelized track reconstruction- Build tracks in radial direction from detector hits- Lightweight kernel extracted from a more realistic application (mkFit)

• Performs hit propagation to r and KalmanUpdate- Per-track matrix multiplication

+ helix parametrization formulas

4


Improvement with CUDA version

• Issue 1: Not launching enough threads- Improves work allocation scheme to make sure ~ 1 thread per track

Mapping 1 MPTRK to 1 CUDA block with bsize of threads

5

……

… …

… … …

evt 0

evt 1

Total work = ntrks x nevts, Tracks are grouped in batches of bsize, called MPTRK

i-th track

bsize

ntrks


…


• Issue 1: Not launching enough threads- Improves work allocation scheme to make sure ~ 1 thread per track

Mapping 1 MPTRK to 1 CUDA block with bsize of threads• Issue 2: large local arrays in subroutine limited GPU resource usage- Used shared memory to hold temporary data across the CUDA block

6

……

… …

… … …

evt 0

evt 1

Total work = ntrks x nevts, Tracks are grouped in batches of bsize, called MPTRK

i-th track

bsize

ntrks

…

Track data Local arrays

Each track initialized data structure of bsize [but consuming only 1]

Before:

……


…


• Issue 2: large local arrays in subroutine limited GPU resource usage- Used shared memory to hold temporary data across the CUDA block- Explored other ways to store the local arrays• Putting some local arrays in global mem.• Min. usage of thread local memory - Not as fast

7

…

Shared memory: Track data

Local arrays

Local arrays are shared across the CUDABlock



• Issue 2: large local arrays in subroutine limited GPU resource usage- Used shared memory to hold temporary data across the CUDA block- Explored other ways to store the local arrays• Putting some local arrays in global mem.• Min. usage of thread local memory - Not as fast • Scan with bsize

8

Gain by coalesced access

Saturate by using up shared memory


Comparing CUDA with CPU result

• Dominated by transfer time (~1/3 of the time is on kernel)• Back of the envelop calculation:- CPU version’s peak throughput ~ 2.5e7 (tracks/s)- CUDA V100’s peak throughput ~ 1e8 (tracks/s) [kernel only]

• Can hide kernel time under transfer time with multiple streams - Max. possible CUDA performance (including transfer time) ~ CPU- Original CUDA version was slower

9

CPU version with TBB

+g]NQYIg�$kjdkj

Trace with CUDA version

HtoD Mem copy (~66%) Kernel time (~33%)


Comparing CUDA with CPU result

• Dominated by transfer time (~1/3 of the time is on kernel)• Back of the envelop calculation:- CPU version’s peak throughput ~ 2.5e7 (tracks/s)- CUDA V100’s peak throughput ~ 1e8 (tracks/s) [kernel only]

• Can hide kernel time under transfer time with multiple streams - Max. possible CUDA performance (including transfer time) ~ CPU- Original CUDA version was slower

10

CPU version with TBB

Hiding Kernel time under transfer time with multiple streams

Mem copy + Kernel (~66%) +g]NQYIg�$kjdkj


ParallelSTL version

• Implemented using C++17 standard- target x86 and Nvidia backend

• Requires significant redesign of code structure

• Used gnu-10.2 (with TBB 2020.3) and nvc++ 2021.3• NV back-end (A100): appr. 4.5x slower than the best CUDA implementation• X86 back-end: appr. 2X slower than NV back-end

11


Summary and moving forward

• CUDA version is well-optimized - and we understand why and limitations

• Propagate the bug fix by Giuseppe • Focus on Kokkos port now

12

Date post:	17-Aug-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Propagate-to-R: Hackathon Summary · 2021. 5. 7. · 05/07/21 Martin Kwok | P2R Hackathon Summary...

Documents