Intranode Stencil Performance Evaluation Collection
Julian Hammer <[email protected]>, Julian Hornich, Georg Hager, Thomas Gruber, Gerhard Wellein
INSPECTSupported byt the Federal Ministry of Education and Research (METACCA)
The Broadwell CPU has two NUMA domains in Cluster-on-Die mode, each with 9 cores. The code saturates at 5 cores and scales to both NUMA domains.Downclocking from 2.3 to 1.3 GHz, 5/9th of the nominal frequency, would save energy without compromising performance.
Multicore scaling is well predict-ed. Single core is based on ineffi-cient assembly. Transfers can not be measured on AMD Zen, since no suitable counters have been published.
As for Broadwell, IACA pre-dicts TReg-L1 too pessimistic. Considering this, the new cache non-inclusive hierar-chy traffic is well predicted by the cache simulator.
To allow reproduction of INSPECT results, the entier workflow is automated and published open source on github. INSPECT depends on the following tools:
STEMPEL (for stencil code generation) and pycachesim (for cache simulation).Additionally it depends on IACA, for Intel architectures, and requires an optimizing compiler.
LCs are an analytic cache reuse model to estimate inter-cache and memory transfer volumes. The hits for each condition in the radius 3 stencil have been colored coded here:
dimensions radius
2D/r1/box
2D/r1/box
2D/r1/box
2D/r1/box
coefficientweighting
hom
o-ge
neou
she
tero
-ge
neou
spo
int-
sym
met
riciso
tropi
c
coefficienttype
variableconstant
datatype
floatdouble
float _Complexdouble _Complex
2D/r1/star
3D/r1/star
2D3D
stencil type
2D/r1/star
star
2D/r1/box
box
3D/r2/star
r1r2
2D/r3/star
r3
2D/r1/box
Stencil Classification
Compiler Forensics
Counting Complex FLOPs
Reproducibility
Machine Description
To structure the library, stencils are classified along the following scheme:
STEMPEL[1] is used to generate example codes based on this classification.
Feature-Architecture Matrix
Floating point operations per lattice-side update (FLOP per LUP) are counted with regard to the high level code.
Adding up to 60 real FLOP per LUP.
Each micro-architecture requires a model definition, cache hierarchy de-scription and benchmark data. This information is also useful for general reference when interpreting and comparing benchmark results.The Broadwell Execution-Cache-Memory (ECM) model is presented here, all parameters may be seen on the INSPECT website.
Allows for assembly code inspection, details on compiler and options are provided in “System Information” and defined by the machine de-scription. Here icc Version 19.0.2 was used.
Transfer Volumes
Model vs Measurement Visualization
Cycles per cacheline throughputs as used as base unit. Here, a cacheline equals to four iterations, because four complex double elements fit in one cacheline. Lower is faster! For comparison, lattice updates per second have been added.
ECM and Roofline models are build with the layer condition or cache simulation (plots left and center). A phenomenological ECM model is con-structed from measured data transfers and port occupations (right).One may conclude that spatial blocking with 300 elements on the inner two dimensions would be-neificial.
INSPECT is based on STEMPEL and Kerncraft, which in turn is based on IACA, OSACA, py-cachesim and LIKWID. This tables provides an overview of the current architecture support:
OutlookImprove internal workflow, improve on visualization and reporting, include support for ARM, add more architectures, make machine description more versatile, include sup-port penalty cycles on all levels of memory hierarchy.
References[1] Guerrera. 2019. STEMPEL: Stencil TEMPlate Engineering Library. https://github.com/RRZE-HPC/stempel[2] Hammer et al. 2017. Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels. Tools for High Performance Computing 2016, doi:10.1007/978-3-319-56702-0_1[3] Hornich et al. 2019. Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT. arXiv:1906.08138[4] Luszczek et al. 2005. Introduction to the HPC Challenge Benchmark Suite. doi:10.1.1.112.1817[5] Meuer et al. 2014. The TOP500: History, Trends, and Future Directions in High Performance Computing (1st ed.). Chapman & Hall/CRC.[6] SPEC. 2019. Standard Performance Evaluation Corporation. https://www.spec.org[7] Stengel et al. 2015. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache- Memory Model. Proceedings of the 29th ACM on International Conference on Supercomputing - ICS ’15. doi:10.1145/2751205.2751240[8] Williams et al. 2009. Roofline. Commun. ACM 52, 4 (Apr 2009). doi:10.1145/1498765.1498785[9] McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter.[10] Laukemann et al. 2018. Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures. In 2018 IEEE/ACM Perfor- mance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). doi:10.1109/PMBS.2018.8641578[11] Treibig et al. 2010. LIKWID: A Lightweight Performance- Oriented Tool Suite for x86 Multicore Environments. In 2010 39th International Conference on Parallel Processing Workshops. doi:10.1109/ ICPPW.2010.38[12] Franchetti et al. 2018. SPIRAL: Extreme Performance Portability. In Proceedings of the IEEE, Special Issue on From High Level Specifications to High Performance Code, 2018. doi:10.1109/JPROC.2018.2873289
Related WorkWe are unaware of any project that aims to build a performance reference library includ-ing modelling, measurements and performance counters. SPEC[6], HPCC[4], STREAM[9], among others, do not consider modelling nor anything near the detail. While SPEC is very detailed on runtime information, which is often helpful, but vital in-formation such as the executed assembly is missing. None of the named tools are meant to act as a framework for user provided codes. Other frameworks, such as SPI-RAL[12] focus on optimation rather than modelling and explaining behavior.
Layer Conditions
Stacked Single-core Plots
Multicore Scaling
https://git.io/fjMHx
Interactive LC Calc-ulator, prefilled with3D/r3/star stencil:
https://git.io/fjMDb
Broadwell Description
Modelling and presenting performance data–even for simple kernels such as stencils–is not triv-ial. We therefore present an overview on how to interpret and what to learn from an INSPECT report, as well as highlighting best practices for performance data reporting.
To make a performance report comprehensible, the following need to be answered:> What type of code and underlying method is used?> What did the compiler make of the code?> What are the expected memory and cache data volumes?> What is the expected maximum computational performance?> What is the prediction of a performance model?> How fast does it actually run?> How does it scale across cores, NUMA domains and sockets?> How do data volumes compare to predictions?> What are the steps needed to reproduce?
Tool AnalysisIntel pre-Skylake
Intel Skylake
AMD Zen
ARM ThunderX2
OSACA Instruction Throughput
pycachesim Hit/Miss Prediction
Kerncraft
Compute w/ OSACA OSACA
Compute w/ IACA
Data transfers with pycachesim
Data transfers with Layer Conditions
INSPECT
Layer Conditions
Single Core Grid Scaling (Cache Simulator) OSACA
Single Core Grid Scaling (Phenomenological) LIKWID
Data Transfers between Caches LIKWID
Multi Core Thread Scaling OSACA
work-in-progress
no support for non-Intel architec-tures and discontin-ued development
depends on LIKWID or OSACA support
insufficient or partial hardware counter support
Documentation
Microbenchmarks
L1
L2
L3
MEM
TOL
TReg-L1
TL1-L2
TL2-L3
TL3-MEM
Registers
Execution
max
sum
j
k
i
Compulsory misses / full-caching
Counted Underlying operation Real FLOP executed
(2)
(6)
Next neighbor caching
Line caching
2D layer caching
TOL
Roofline prediction
Reci
proc
al T
hrou
ghpu
t [cy
cle
/ wor
k un
it]
MeasurementECM prediction
TL3-MEM
TL2-L3
TL1-L2
TReg-L1
Performance [w
ork / time]
Validation of data volumes is accomplished with hardware performance counters, using LIKWID. The cache simulator captures some artifacts and transition behavior. The layer condition model is very accu-rate in steady states. Kerncraft uses the following machine specific hardware performance counters for Broadwell:
L1
L2
L3
MEM
RegistersLOAD counters STORE counters
Report for Skylake X
https://git.io/fjDxK
Report for AMD Zen
2 4 6 8 10 12 14 16 180
50
100
150
200
250
300
350
400
450
Benchmark ECM LC Prediction Roofline LC Prediction
Number of Threads
Perfo
rman
ce @
1.3
GH
z [M
LUP/
s]
https://git.io/fjy3g
LIKWID
TOOLS[2] (for performance modelling,evaluation and benchmarking)
[11] (for performance measurementsand machine model extraction)[10] (for AMD in-core predictions)
ACAOS
IACA Output
How to test this stencil and replicate the dataLayer Condition Data
Cache Simulator Data
Generate analyzable stenci code: stempel gen -D 3 -r 3 -t "double _Complex" -C constant -k star
-i
--store stencil.c
[ output of the IACA, OSACA or LLVM-MCA tool ]
System Information
[ output of likwid-topology and other utilities to document and record the machine configuration ]
and compilable benchmark code with: stempel bench stencil.c -m BroadwellEP_E5-2697_CoD.yml --store
Multicore scaling:
Complete analysis report available here!
https://git.io/fjMDd
0 200 400 600 800 1000
0
100
20
40
60
80
120
140
Cycl
es /
Cach
elin
e
Grid Size (N=M=P)
Multicore Thread Scaling
2 4 6 8 10 12 14 16 180
50
100
150
200
250
300
350
400
450
Benchmark ECM LC Prediction Roofline LC Prediction
Number of Threads
Perfo
rman
ce [M
LUP/
s]
Single Core Grid Scaling Analytical ECM and Roofline Model Prediction vs. Performance
0 200 400 600 800 1000
0
20
40
60
80
100
120
140
Grid Size (N=M=P)
Cyc
les
/ Cac
helin
e
Modelled with layer conditions:
0 200 400 600 800 1000
0
20
40
60
80
100
120
140
Grid Size (N=M=P)
Cyc
les
/ Cac
helin
e
Phenomenological ECM model:
T L3-MEM T L2-L3 T L1-L2 T Reg-L1T OL BenchmarkRoofline
Data Transfers between Caches
0 200 400 600 800 10000
20
40
60
80
100
120
140
Grid Size (N=M=P)
Data
Tra
nsfe
rs [B
yte/
LUP]
Grid Size (N=M=P)
Data
Tra
nsfe
rs [B
yte/
LUP]
Modelled with cache simulation:Modelled with layer conditions:L1-L2 measured L2-L3 measured L3-MEM measured L1-L2 predicted L2-L3 predicted L3-MEM predicted
MLU
P / second
230
115
76.7
57.5
46.0
38.3
32.8Modelled with cache simulation:
0 200 400 600 800 10000
20
40
60
80
100
120
140
Layer Conditions LC for L1 cache with 32 KB:
condition misses hits
0 20
2 18
6 14
8 12
12 8
else 14 6
LC for L2 cache with 256 KB: condition misses hits
0 20
2 18
6 14
8 12
12 8
else 14 6
LC for L3 cache with 22.5 MB: condition misses hits
0 20
2 18
6 14
8 12
12 8
else 14 6
1 miss = 16 Byte / LUP
576288
Kernel Source Code C Code (generated or hand-written): Assembly (generated by compiler):
Stencil Properties dimension 3D
radius 3
weighting isotropic
kind star
coefficients constant
datatype double _Complexmachine BroadwellEP_E5-2697_CoD
FLOP per LUP 22
Performance Model and Benchmark Report
Same single NUMA and full-socketperformance, at lower clock speed
ECM too pessimistic due to
T Reg-L1 overestimation by IACA
TReg-L1 difference shows over-estimation by IACA for ECM
Main memory transfer streampenalties, shows with lower TReg-L1
Transition to next layer condi-tion (line caching in L2 at ~1170)