+ All Categories
Home > Documents > LIKWID 5 Tools Architecture LIKWID MarkerAPI LIKWID event ... · Lua API. Python API. Marker API....

LIKWID 5 Tools Architecture LIKWID MarkerAPI LIKWID event ... · Lua API. Python API. Marker API....

Date post: 01-Oct-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
1
LIKWID is a collection of command-line tools for performance-aware programmers of multicore and manycore CPUs. It follows the UNIX design philosophy of “one task, one tool”. Among its many capabilities are system topology reporting, enforcement of thread-core affinity for threading, MPI, and hybrid programming models, setting clock speeds, hardware performance event counting, energy measurements, and low-level benchmarking. As of version 5 it supports not only x86 (Intel/AMD), CPUs but also ARM and POWER architectures and Nvidia GPUs. LIKWID 5: Lightweight Performance Tools Thomas Gruber, Jan Eitzinger, Georg Hager, and Gerhard Wellein Erlangen Regional Computing Center (RRZE), 91058 Erlangen, Germany Data repository with code, scripts, plot files and measurement results: References [1] J. Treibig,et al.: "Likwid: A lightweight performance-oriented tool suite for x86 multicore environments." 2010 39th International Conference on Parallel Processing Workshops. IEEE, 2010. [2] D. Poliakoff et al.: "Gotcha: An Function-Wrapping Interface for HPC Tools" 2019 International Workshop on Extreme-Scale Programming Tools [3] F. Jansen et al.: “From bijels to Pickering emulsions: A lattice Boltzmann study.“ Physical Review E 83, 4 (2011), 046707. Grant Nr. 01IH13009 Grant Nr. 01IH16012 Thanks to LIKWID 5 Tools Architecture LIKWID core C API and GPU API* Linux OS Kernel LIKWID suid daemon Lua API Marker API Python API Hwloc LIKWID CLI applications Lua RT Pinning lib User applications Nvidia GPUs CUDA* perf_event LIKWID MarkerAPI Soft matter system simulation with single fluid 16 and 32 nodes/core, cubic domain, Hazel Hen (HLRS) Success story Code using LB3D ([3], Lattice Boltzmann engine) in Fortran08 Institute for Dynamics of Complex Fluids and Interfaces of the Helmholtz Association Tracking down caching problems with main data structure Fixing compiler vectorization due to OOP paradigm (C malloc‘d data structures unknown to be contiguous) more than 3-fold performance increase Documentation Nvidia GPU MarkerAPI LIKWID_NVMARKER_INIT; double *x = malloc(N*sizeof(double)) for(i=0; i<N; i++) { x[i] = 2.0; } LIKWID_NVMARKER_START(“cudafunction”); cudaMalloc(&cu_x, N*sizeof(double)) cudaMemcpy(cu_x, x, N*sizeof(double), …) cufunc<<<(256, 256>>>(N, cu_x); cudaMemcpy(x, cu_x, N*sizeof(double), …); LIKWID_NVMARKER_STOP(“cudafunction”); LIKWID_NVMARKER_CLOSE; Self-monitoring of application with LIKWID’s nvmon C-API nvmon_init(num_gpus, glist); gid = nvmon_addEventSet(“GPUEVENT0:GPU0”); num_events = nvmon_getNumberOfEvents(gid); nvmon_setupCounters(gid); double *x = malloc(N*sizeof(double)) for(i=0; i<N; i++) { x[i] = 2.0; } nvmon_startCounters(); cudaMalloc(&cu_x, N*sizeof(double)) cudaMemcpy(cu_x, x, N*sizeof(double), …) cufunc<<<(256, 256>>>(N, cu_x); cudaMemcpy(x, cu_x, N*sizeof(double), …); nvmon_stopCounters() for (i=0; i<num_gpus; i++) { for (j=0; j<num_events; j++) { double r = nvmon_getResult(gid, i, j); printf(“GPU%d Event %d: %f\n”, glist[i], j, r); } } nvmon_finalize(); Monitor all activities on CPUs: likwid-perfctr –C 0,1 –g GRP ./a.out Measure already running application likwid-perfctr … --perfpid <pid> Count only for wrapped program likwid-perfctr … --execpid ./a.out Use MarkerAPI and count only application likwid-perfctr … --execpid -m ./a.out New CPU backend (perf_event) Support for core-local counters and all uncore units (including energy counts) with all event options NVML CUPTI PerfWorks #include <likwid-marker.h> int main(…) { […] LIKWID_MARKER_INIT; […] #pragma omp parallel { LIKWID_MARKER_REGISTER(“region”); } #pragma omp parallel { for (int j=0; j < iters; j++) { LIKWID_MARKER_START(“region”); #pragma omp for reduction(+:y[0:N_rows]) for (int c=0; c<N_cols; c++) { for (int r=0; r<N_rows; r++) { y[r] = y[r] + a[c*N_rows+r] * x[c]; } } LIKWID_MARKER_STOP(“region”); if (j == iters/2) LIKWID_MARKER_SWITCH; } } […] LIKWID_MARKER_CLOSE; return 0; } Support for most recent architectures: Cascade Lake SP (incl. Intel Optane DC) Support for most recent architecture: Zen2 alias Rome Generic support for ARMv7 and ARMv8 Extended support for Marvell Thunder X2 (incl. Memory controllres, socket interconnect and L3 cache) Core event support for POWER8 and POWER9 Nest event support for POWER9 (incl. Memory controllers) NEW performance montitoring backend for NVIDIA GPUs NEW Topology backend for Nvidia GPUs Providing events from CUPTI, NVML and (soon) PerfWorks Basic set of performance groups (FLOPS_DP, FLOPS_SP, MEM, L2, …) Distinct C/C++ API and GPU MarkerAPI macros for full flexibility CPU MarkerAPI for C/C++, Fortran90 and Lua included Python (pip install pylikwid) Java ( GitHub: http://tiny.cc/p7pdez ) LIKWID‘s performance groups are validated against well-understood kernels: likwid-bench kernels (handcrafted assembly benchmarks) Load only, store only, memory copy, Stream triad, Schoenauer triad, Daxpy Important HPC kernels: DP/SP dense matrix-vector-multiplication Stencils Load data transfer analysis for DP dense quad. matrix-vector-multiplication: (x[] traffic neglatable as only loaded once per row) Only matrix a[] is loaded from lower cache level: 8 Byte/update Matrix a[] and y[] are loaded from lower cache level: 16 Byte/update LIKWID event validation likwid-perfctr support for Nvidia GPU events through GOTCHA [2] in combination with CPU measurements: $ likwid-perfctr –C 0-4 –g CPUEVENT:PMC0 -G 0,1 –W GPUEVENT:GPU0 ./cuda.a.out For GPUMarkerAPI (-m) instrument code once, control measurements from outside Micro-benchmarking Handcrafted assembly streaming benchmarks Kernels for x86_64, ARMv7, ARMv8 and POWER included (NT-stores, FMAs, AVX512, VSX, NEON, …) New: Dynamic loading of benchmarks for rapid prototyping Support for hardware performance measurements (LIKWID MarkerAPI) included Event comparison for DP dense quad. Matrix-vector-multiplication Intel Broadwell E5-2697 v4 @ 2.3 GHz, 4 Threads L2_TRANS.DEMAND_DATA_RD: This event counts Demand Data Read requests that access L2 cache, including rejects. L1D.REPLACEMENT: This event counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace. L2_RQSTS.DEMAND_DATA_RD_MISS: This event counts the number of demand Data Read requests that miss L2 cache. Only not rejected loads are counted. L2_LINES_IN.ALL: This event counts the number of L2 cache lines filling the L2. Counting does not cover rejects. Prefetchers active Prefetchers inactive Used by LIKWID 0 50 100 150 200 x = A[i] (load only) A[i] = c (store only) A[i] = B[i] (copy) A[i] = B[i]+c*C[i] (stream) A[i] = B[i]+C[i]*D[i] (triad) MEMORY BANDWIDTH GBYTE/S Micro - architectural comparison of likwid - bench kernels Full socket (1 thread per core ), Total size 2GB Intel CLX (AVX512) AMD NAPLES (AVX) IBM PWR9 (VSX) Marvell TX2 (NEON)
Transcript
Page 1: LIKWID 5 Tools Architecture LIKWID MarkerAPI LIKWID event ... · Lua API. Python API. Marker API. Hwloc. LIKWID CLI applications. Lua RT. Pinning lib. User applications. Nvidia GPUs.

LIKWID is a collection of command-line tools for performance-aware programmers of multicore and manycore CPUs. It follows the UNIX design philosophy of “one task, one tool”. Among its many capabilities are system topology reporting, enforcement of thread-core affinity for threading, MPI, and hybrid programming models, setting clock speeds, hardware performance event counting, energy measurements, and low-level benchmarking. As of version 5 it supports not only x86 (Intel/AMD), CPUs but also ARM and POWER architectures and Nvidia GPUs.

LIKWID 5: Lightweight Performance ToolsThomas Gruber, Jan Eitzinger, Georg Hager, and Gerhard Wellein

Erlangen Regional Computing Center (RRZE), 91058 Erlangen, Germany

Data repositorywith code, scripts, plot files and measurement results:

References[1] J. Treibig,et al.: "Likwid: A lightweight performance-oriented tool suitefor x86 multicore environments." 2010 39th International Conference on Parallel Processing Workshops. IEEE, 2010.[2] D. Poliakoff et al.: "Gotcha: An Function-Wrapping Interface for HPC Tools" 2019 International Workshop on Extreme-Scale Programming Tools[3] F. Jansen et al.: “From bijels to Pickering emulsions: A lattice Boltzmann study.“ Physical Review E 83, 4 (2011), 046707.

Grant Nr. 01IH13009Grant Nr. 01IH16012

Thanks to

LIKWID 5 Tools Architecture

LIKWID core C API and GPU API*

Linux OS Kernel

LIKWID suid daemon

Lua API Marker APIPython API

Hwloc

LIKWID CLI applications

Lua RT

Pinning lib

User applications

Nvidia GPUs

CUDA*perf_event

LIKWID MarkerAPI

Soft matter system simulation with single fluid16 and 32 nodes/core, cubic domain, Hazel Hen (HLRS)

Success storyCode using LB3D ([3], Lattice Boltzmann engine) in Fortran08Institute for Dynamics of Complex Fluids and Interfaces of theHelmholtz Association

• Tracking down caching problems with main data structure• Fixing compiler vectorization due to OOP paradigm

(C malloc‘d data structures unknown to be contiguous)→ more than 3-fold performance increase

Documentation

Nvidia GPU MarkerAPI

LIKWID_NVMARKER_INIT;

double *x = malloc(N*sizeof(double))for(i=0; i<N; i++) { x[i] = 2.0; }LIKWID_NVMARKER_START(“cudafunction”);cudaMalloc(&cu_x, N*sizeof(double))cudaMemcpy(cu_x, x, N*sizeof(double), …)cufunc<<<(256, 256>>>(N, cu_x);cudaMemcpy(x, cu_x, N*sizeof(double), …);LIKWID_NVMARKER_STOP(“cudafunction”);

LIKWID_NVMARKER_CLOSE;

Self-monitoring of application with LIKWID’s nvmon C-APInvmon_init(num_gpus, glist);gid = nvmon_addEventSet(“GPUEVENT0:GPU0”);num_events = nvmon_getNumberOfEvents(gid);nvmon_setupCounters(gid);double *x = malloc(N*sizeof(double))for(i=0; i<N; i++) { x[i] = 2.0; }nvmon_startCounters();cudaMalloc(&cu_x, N*sizeof(double))cudaMemcpy(cu_x, x, N*sizeof(double), …)cufunc<<<(256, 256>>>(N, cu_x);cudaMemcpy(x, cu_x, N*sizeof(double), …);nvmon_stopCounters()for (i=0; i<num_gpus; i++) {

for (j=0; j<num_events; j++) {double r = nvmon_getResult(gid, i, j);printf(“GPU%d Event %d: %f\n”, glist[i], j, r);

}}nvmon_finalize();

Monitor all activities on CPUs:likwid-perfctr –C 0,1 –g GRP ./a.outMeasure already running applicationlikwid-perfctr … --perfpid <pid>Count only for wrapped programlikwid-perfctr … --execpid ./a.outUse MarkerAPI and count only applicationlikwid-perfctr … --execpid -m ./a.out

New CPU backend (perf_event)

Support for core-local counters and all uncore units(including energy counts) with all event options

NVMLCUPTI

PerfWorks

#include <likwid-marker.h>int main(…){

[…]LIKWID_MARKER_INIT;[…]

#pragma omp parallel{

LIKWID_MARKER_REGISTER(“region”);}#pragma omp parallel{

for (int j=0; j < iters; j++) {LIKWID_MARKER_START(“region”);

#pragma omp for reduction(+:y[0:N_rows])for (int c=0; c<N_cols; c++) {

for (int r=0; r<N_rows; r++) {y[r] = y[r] + a[c*N_rows+r] * x[c];

}}LIKWID_MARKER_STOP(“region”);if (j == iters/2) LIKWID_MARKER_SWITCH;

}}

[…]LIKWID_MARKER_CLOSE;return 0;

}

Support for most recent architectures: Cascade Lake SP (incl. Intel Optane DC)

Support for most recent architecture: Zen2 alias Rome

Generic support for ARMv7 and ARMv8Extended support for Marvell Thunder X2(incl. Memory controllres, socket interconnect and L3 cache)

Core event support for POWER8 and POWER9Nest event support for POWER9 (incl. Memory controllers)

NEW performance montitoring backend for NVIDIA GPUsNEW Topology backend for Nvidia GPUsProviding events from CUPTI, NVML and (soon) PerfWorksBasic set of performance groups (FLOPS_DP, FLOPS_SP, MEM, L2, …)Distinct C/C++ API and GPU MarkerAPI macros for full flexibility

CPU MarkerAPI for C/C++, Fortran90 and Lua includedPython (pip install pylikwid)Java ( GitHub: http://tiny.cc/p7pdez )

LIKWID‘s performance groups are validated againstwell-understood kernels:• likwid-bench kernels (handcrafted assembly benchmarks)

• Load only, store only, memory copy,• Stream triad, Schoenauer triad, Daxpy

• Important HPC kernels:• DP/SP dense matrix-vector-multiplication• Stencils

Load data transfer analysis for DP dense quad. matrix-vector-multiplication:(x[] traffic neglatable as only loaded once per row)• Only matrix a[] is loaded from lower cache level: 8 Byte/update• Matrix a[] and y[] are loaded from lower cache level: 16 Byte/update

LIKWID event validation

likwid-perfctr support for Nvidia GPU events through GOTCHA [2] in combination with CPU measurements:$ likwid-perfctr –C 0-4 –g CPUEVENT:PMC0 -G 0,1 –W GPUEVENT:GPU0 ./cuda.a.outFor GPUMarkerAPI (-m) instrument code once, control measurements from outside

Micro-benchmarkingHandcrafted assembly streaming benchmarksKernels for x86_64, ARMv7, ARMv8 and POWER included (NT-stores, FMAs, AVX512, VSX, NEON, …)New: Dynamic loading of benchmarks for rapid prototypingSupport for hardware performance measurements (LIKWID MarkerAPI) included

Event comparison for DP dense quad. Matrix-vector-multiplicationIntel Broadwell E5-2697 v4 @ 2.3 GHz, 4 Threads

L2_TRANS.DEMAND_DATA_RD: This event counts Demand Data Read requests that access L2 cache, including rejects.L1D.REPLACEMENT: This event counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.L2_RQSTS.DEMAND_DATA_RD_MISS: This event counts the number of demand Data Read requests that miss L2 cache. Only not rejected loads are counted.L2_LINES_IN.ALL: This event counts the number of L2 cache lines filling the L2. Counting does not cover rejects.

Prefetchers active

Prefetchers inactive

Used by LIKWID

0

50

100

150

200

x = A[i](load only)

A[i] = c(store only)

A[i] = B[i](copy)

A[i] = B[i]+c*C[i](stream)

A[i] = B[i]+C[i]*D[i](triad)

MEM

ORY

BAN

DWID

THG

BYTE

/S

Micro-architectural comparison of likwid-bench kernelsFull socket (1 thread per core), Total size 2GB

Intel CLX (AVX512) AMD NAPLES (AVX) IBM PWR9 (VSX) Marvell TX2 (NEON)

Recommended