Team members: Martin Kwok, Matti Kortelainen, Alexei Strelchenko, Giuseppe Cerati (Fermilab)Team mentors: Mathew Colgrove(NVIDIA), James Osborne, Taylor Childers(ANL)HEP-CCE PPS Meeting 7 May, 2021
Propagate-to-R: Hackathon Summary
05/07/21 Martin Kwok | P2R Hackathon Summary
The Event
• 1 (pre-event) + 3 (main event) days of 9am-5pm Zoom calls- Daily 5 min updates from of 13 teams
Other teams: ML / domain specific code(Astro/Chem etc.)Usually bigger code base than P2R
- Most of the time working in a breakout room with our mentor, MattPrototypes ideas to try with openACC => Profile => More ideas + discussion => Repeat => Port to CUDA
- Alexei / Giuseppe work on independently goals
• Help from mentor was very useful- Learnt how to use isight-compute and interpret the output- Exclusive 4 day access to A100 GPUs is a nice bonus
2
05/07/21 Martin Kwok | P2R Hackathon Summary
List of accomplished items
• Implemented parallelSTL version by Alexei• Bug fix and validation by Giuseppe- Update track uncertainties across layers (was a no-opt)
• Explored a lot of avenues to speed-up the CUDA kernel- Static shared memory- Privatizing part of the costly variables- Dynamic shared memory - Min. usage of thread-local memory
• 1 clear winning CUDA version in terms of kernel time (5x speed up)• Code infrastructure development- Multiple CUDA stream- Added warm-up cycles- Added complier macros to include/exclude H2D/D2H transfer time
3
05/07/21 Martin Kwok | P2R Hackathon Summary
P2R overview
• P2R is a standalone mini-app. to perform core math of parallelized track reconstruction- Build tracks in radial direction from detector hits- Lightweight kernel extracted from a more realistic application (mkFit)
• Performs hit propagation to r and KalmanUpdate- Per-track matrix multiplication
+ helix parametrization formulas
4
05/07/21 Martin Kwok | P2R Hackathon Summary
Improvement with CUDA version
• Issue 1: Not launching enough threads- Improves work allocation scheme to make sure ~ 1 thread per track
Mapping 1 MPTRK to 1 CUDA block with bsize of threads
5
……
… …
… … …
evt 0
evt 1
Total work = ntrks x nevts, Tracks are grouped in batches of bsize, called MPTRK
i-th track
bsize
ntrks
05/07/21 Martin Kwok | P2R Hackathon Summary
…
Improvement with CUDA version
• Issue 1: Not launching enough threads- Improves work allocation scheme to make sure ~ 1 thread per track
Mapping 1 MPTRK to 1 CUDA block with bsize of threads• Issue 2: large local arrays in subroutine limited GPU resource usage- Used shared memory to hold temporary data across the CUDA block
6
……
… …
… … …
evt 0
evt 1
Total work = ntrks x nevts, Tracks are grouped in batches of bsize, called MPTRK
i-th track
bsize
ntrks
…
Track data Local arrays
Each track initialized data structure of bsize [but consuming only 1]
Before:
……
05/07/21 Martin Kwok | P2R Hackathon Summary
…
Improvement with CUDA version
• Issue 2: large local arrays in subroutine limited GPU resource usage- Used shared memory to hold temporary data across the CUDA block- Explored other ways to store the local arrays• Putting some local arrays in global mem.• Min. usage of thread local memory - Not as fast
7
…
Shared memory: Track data
Local arrays
Local arrays are shared across the CUDABlock
05/07/21 Martin Kwok | P2R Hackathon Summary
Improvement with CUDA version
• Issue 2: large local arrays in subroutine limited GPU resource usage- Used shared memory to hold temporary data across the CUDA block- Explored other ways to store the local arrays• Putting some local arrays in global mem.• Min. usage of thread local memory - Not as fast • Scan with bsize
8
Gain by coalesced access
Saturate by using up shared memory
05/07/21 Martin Kwok | P2R Hackathon Summary
Comparing CUDA with CPU result
• Dominated by transfer time (~1/3 of the time is on kernel)• Back of the envelop calculation:- CPU version’s peak throughput ~ 2.5e7 (tracks/s)- CUDA V100’s peak throughput ~ 1e8 (tracks/s) [kernel only]
• Can hide kernel time under transfer time with multiple streams - Max. possible CUDA performance (including transfer time) ~ CPU- Original CUDA version was slower
9
CPU version with TBB
+g]NQYIg�$kjdkj
Trace with CUDA version
HtoD Mem copy (~66%) Kernel time (~33%)
05/07/21 Martin Kwok | P2R Hackathon Summary
Comparing CUDA with CPU result
• Dominated by transfer time (~1/3 of the time is on kernel)• Back of the envelop calculation:- CPU version’s peak throughput ~ 2.5e7 (tracks/s)- CUDA V100’s peak throughput ~ 1e8 (tracks/s) [kernel only]
• Can hide kernel time under transfer time with multiple streams - Max. possible CUDA performance (including transfer time) ~ CPU- Original CUDA version was slower
10
CPU version with TBB
Hiding Kernel time under transfer time with multiple streams
Mem copy + Kernel (~66%) +g]NQYIg�$kjdkj
05/07/21 Martin Kwok | P2R Hackathon Summary
ParallelSTL version
• Implemented using C++17 standard- target x86 and Nvidia backend
• Requires significant redesign of code structure
• Used gnu-10.2 (with TBB 2020.3) and nvc++ 2021.3• NV back-end (A100): appr. 4.5x slower than the best CUDA implementation• X86 back-end: appr. 2X slower than NV back-end
11
05/07/21 Martin Kwok | P2R Hackathon Summary
Summary and moving forward
• CUDA version is well-optimized - and we understand why and limitations
• Propagate the bug fix by Giuseppe • Focus on Kokkos port now
12