Instead of the
screen while your
Managed by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
in Slide, you
logo/management
use one of the two
Los Alamos National Laboratory
Porting mini-apps to ARM HPC systems
Brian J GravelleDave Nystrom
September 2019ARM Research Summit
(unclassified)
LA-UR-19-29059
10/1/2019 | 3Los Alamos National Laboratory
Introduction
• x86 and Power dominate HPC CPU market
• ARM is new alternative with potential for• Low-power systems• Customization through multiple chip makers• High levels of parallelism
10/1/2019 | 4Los Alamos National Laboratory
Introduction
How will old codes work on new systems?
Are the performance issues significantly different?
10/1/2019 | 5Los Alamos National Laboratory
Systems Used
• Compare Intel Skylake to Marvell Thunder X2Skylake Gold 6152 ThunderX2-B1
Cores 44 56
Threads per core 2 4
Clock 2.1GHz 2.0GHz
L1 data cache 32K 32K
L2 cache 1024K 256K
L3 cache 30976K 32768K
Memory Controllers 6 8
SIMD instructions Up to 512 bit 128 bit NEON
*This work used systems funded by the Computational Systems and Software Environments (CSSE) subprogram of LANL’s ASC program NNSA/DOE
10/1/2019 | 6Los Alamos National Laboratory
Measurement Methodology
• TAU• Performance measurement• Sampling• Profiling
• Caliper• Performance measurement• Instrumentation
• PAPI• Hardware counter interface
10/1/2019 | 7Los Alamos National Laboratory
Measurements of Interest
• Frontend and Backend Stalls• Cache performance• SIMD instruction use• Energy
10/1/2019 | 8Los Alamos National Laboratory
Mini-apps used
• SNAP• Mini app for PARTISN • Computation based on 6D neutral particle transport• 3D structured spatial mesh• 3D velocity space uses 2 angle coordinates and an
energy coordinate• Fortran, MPI, OpenMP
10/1/2019 | 9Los Alamos National Laboratory
SNAP Problem size
• Problem Size• 3D mesh 270x40x64
• Skylake• MPI+openmp threading• 40 ranks with 2 threads each
• ThunderX2• MPI+openmp threading• 40 ranks with 4 threads each
10/1/2019 | 10Los Alamos National Laboratory
SNAP Analysis (Intel 2t ARM 4t)
0E+002E+124E+126E+128E+121E+131E+13
Frontend Stalls Backend Stalls
SNAP Frontend and Backend Stalls
Skylake ThunderX2
10/1/2019 | 11Los Alamos National Laboratory
SNAP Analysis (Intel 2t ARM 4t)
10/1/2019 | 12Los Alamos National Laboratory
SNAP Analysis (Intel 2t)
0.00E+00
2.00E+12
4.00E+12
6.00E+12
8.00E+12
1.00E+13
1.20E+13
1.40E+13
1.60E+13
FLOP count No Vec SSE AVX 2 AVX 512
Cou
nt o
f FP
INS
or C
ycle
s us
ed in
di
m3_
swee
p
Type of Floating Point operations
Time and Instructions vs Vector types for dim3_sweep
FP INS countideal FP insCycle count
10/1/2019 | 13Los Alamos National Laboratory
SNAP Analysis (ARM 4t)
• ARM SIMD comparison – solve time
No SIMD NEON Speedup
84.7s 66.1s 1.28x
10/1/2019 | 14Los Alamos National Laboratory
SNAP Analysis
• Vector instructions provide significant improvement to Skylake performance
• Not so much for ThunderX2
10/1/2019 | 15Los Alamos National Laboratory
SNAP Energy Results
0.00E+00
1.00E+05
2.00E+05
3.00E+05
4.00E+05
Intel 1thread
Intel 2threads
ARM 1thread
ARM 4threads
Package DRAM Total
10/1/2019 | 16Los Alamos National Laboratory
SNAP Energy Results
0.00E+00
1.00E+06
2.00E+06
3.00E+06
4.00E+06
5.00E+06
Intel 1thread
Intel 2threads
ARM 1thread
ARM 4threads
Intel NoSIMD 2threads
Package DRAM Total
10/1/2019 | 17Los Alamos National Laboratory
Mini-apps used
• XSBench• “mini-app representing a key computational
kernel of the Monte Carlo neutronics application OpenMC”
• Mostly tabular lookup based on randomly generated energy values
• C, OpenMP
10/1/2019 | 18Los Alamos National Laboratory
XSBench Problem size
• Problem Size• Standard size reactor with 5E6 particles
• Skylake• OpenMP threading• 88 threads
• ThunderX2• OpenMP threading• Ranging from 56 to 448 threads
10/1/2019 | 19Los Alamos National Laboratory
XSBench Analysis
0.00E+00
5.00E+11
1.00E+12
1.50E+12
2.00E+12
2.50E+12
3.00E+12
Frontend Stalls Backend Stalls
Frontend and Backend Stalls
Skylake (history) Skylake (event)ThunderX2 (history) ThunderX2 (event)
10/1/2019 | 20Los Alamos National Laboratory
XSBench Analysis
10/1/2019 | 21Los Alamos National Laboratory
XSBench Analysis
0%
20%
40%
60%
80%
100%
L1 Miss Rate L2 Miss Rate
Cache Performance
Skylake (history) Skylake (event)ThunderX2 (history) ThunderX2 (event)
10/1/2019 | 22Los Alamos National Laboratory
XSBench Improvements
• Event Based XSBench• There is a nuclide array with each element storing
random energies other values• The main loop iterates over this array• For each nuclide it looks up data based primarily on
the energy value• Lookups are distributed to threads• Randomly ordered energy values prevent locality in
the lookups
10/1/2019 | 23Los Alamos National Laboratory
XSBench Improvements
• To improve cache locality sort the nuclide array before performing lookups
• Optimized kernel in distribution sorts based on the energy and the material
• Our version sorts only based on the energy
10/1/2019 | 24Los Alamos National Laboratory
XSBench Improvements
• “Base” versions are the default (event or history)• “K1” – optimized event-based in distribution• “K2” – event-based version optimized by us
10/1/2019 | 25Los Alamos National Laboratory
KNL Experiment
0.E+005.E+061.E+072.E+072.E+07
22 44 88 132
176
220
264
272
340
408
Look
ups/
s
Thread Counts
KNL (DRAM) Speedup compared to thread counts
Base (History) Base (Event)k1 event k2 event
0.E+00
5.E+06
1.E+07
2.E+07
22 44 88 132
176
220
264
272
340
408
Look
ups/
s
Thread counts
KNL (HBM) Speedup compared to thread counts
Base (History) Base (Event)k1 event k2 event
10/1/2019 | 26Los Alamos National Laboratory
Results
0.E+005.E+061.E+072.E+072.E+073.E+073.E+07
Look
ups/
sthread counts
Skylake Speed compared to thread count
Base (History) Base (Event)k1 event k2 event
0.E+005.E+061.E+072.E+072.E+073.E+073.E+07
Look
ups/
s
thread counts
ARM Speed compared to thread count
Base (History) Base (Event)k1 event k2 event
10/1/2019 | 27Los Alamos National Laboratory
ARM HW Counter comparison
0.00E+002.00E+104.00E+106.00E+108.00E+10
56 112 168 224 280 336 392 448
L1 Data Cache misses
0.00E+002.00E+104.00E+106.00E+10
56 112 168 224 280 336 392 448
L2 Data Cache misses
Event Base Event K1 Event K2
10/1/2019 | 28Los Alamos National Laboratory
XSBench Energy Results
0.00E+00
2.00E+03
4.00E+03
6.00E+03
8.00E+03
Event Base Event K1 Event K2
Lookups per Joule (Higher is better)
ThunderX2 Skylake
10/1/2019 | 29Los Alamos National Laboratory
Conclusion
• Getting mini-apps (and presumably the full applications on which they are based) up and running on ARM is easy.
• Performance bottlenecks are for the most part similar; i.e. memory bound vs compute bound does not change.
• Some specific issues may differ; in these cases frontend stalls were higher than backend on the ThunderX2.
10/1/2019 | 30Los Alamos National Laboratory
Questions?
Over 70 years at the forefront of supercomputing