Instrumentation and Run-Time Measurement
VampirTrace
Overview
• Instrumentation– Automatic, manual and binary instrumentation
• Run-time measurement– Behind the scenes, post-processing
– Trace file format, overhead
• Options, settings, parameters– Environment Variables
– PAPI hardware performance counters
– Memory allocation counters, application I/O calls
– Filtering, grouping
• FAQ and Issues
INSTRUMENTATION
Instrumentation in General
Edit – Compile – Run Cycle
Edit – Compile – Run Cycle with VampirTrace
Source Code Binary ResultsCompiler Run
Source Code Binary ResultsVT Wrapper
Run
Traces
Compiler
Compiler Wrappers
• Easiest way of using VampirTrace
• No source code modifications
• In the build system of your application, substitute calls to the regular compiler with calls to the VampirTrace compiler wrappers
– For compiling and linking
– e.g. in the makefile change icc to vtcc
• Rebuild the application
• Run the application to produce trace data
Instrumentation & Measurement
• What do you need to do for it?– VampirTrace and a supported compiler
• Instrumentation (automatic with compiler wrappers)
• Re-compile & re-link• Trace Run (run with appropriate test data set)
• More details later
CC = vtcc
CXX = vtcxx
F90 = vtf90
MPICC = vtcc -vt:cc mpicc
CC = icc
CXX = icpc
F90 = ifc
MPICC = mpicc
Compiler Wrappers
Captured events:
• All user function entries and exits
– If supported by the compiler (Intel, GNU, PGI, NEC, IBM)
• MPI calls and messages
– If the application is MPI parallel
• OMP regions
– If the application is OpenMP parallel
Compiler Wrappers
Manual Instrumentation
• Allows for detailed source code instrumentation
– e.g. regions of functions such as loops
• Can be combined with automatic instrumentation
• Be sure to instrument all function exits!
– Otherwise post-mortem analysis will fail
• I personally consider this advanced usage of VampirTrace!
Manual Instrumentation
• Add the following into our source code to instrument a region, e.g. C: (available for C++ and FORTRAN as well)
• Compile with “-DVTRACE”– Otherwise, VampirTrace macros will expand to empty
blocks, producing zero overhead
#include "vt_user.h"...VT_USER_START("Region_1");...VT_USER_END("Region_1");...
vtcc -vt:inst manual prog.c -DVTRACE -o prog
Manual Instrumentation
Binary Instrumentation
• Using DYNINST
– http://www.dyninst.org
• Source should be compiled with “-g” switch
• “vtunify” has to be run manually afterwards
vtf90 -vt:inst dyninst prog.c -o prog
RUN-TIME MEASUREMENT
Behind the Scenes
Unifying - Post-Processing
OTF Open Trace Format
Tracing Overhead
Workflow
1) Instrumentation– Hide instrumentation in compiler wrappers
– Use underlying compiler and add appropriate options
2) Test Run– Use representative test input
– Set parameters, environment variables, etc.
– Selective tracing
3) Get Trace
CC=mpicc
CC= vtcc -vt:cc mpicc
Automatic Function Tracing
• Uses compiler support to add tracing calls at every function entry and exit
• Compilers supported:
– GNU, Intel, PGI, PathScale, IBM, Sun Fortran, NEC
• Binary instrumentation via Dyninst
MPI and OpenMP Tracing
• Tracing of MPI-1 and MPI-IO events via PMPI interface
• Tracing of OpenMPdirectives via OPARI source-to-source instrumentation
Hardware Performance Counter• Recording PAPI counter(s) at every function entry /
exit• PAPI allows access to hardware (mostly CPU)
counters, e.g. floating point operations, cache misses, exceptions
• Can derive rates, e.g. GFlop/s of each function
Memory and I/O Tracing
• Tracing of memory allocation calls via libcbuilt-in hooks
• malloc, realloc, free, …
• Tracing of I/O calls, accessed files, transferred data volume via wrappers for I/O calls
• open, read, write, …
Instrumentation & MeasurementWhat does VampirTrace do in the background?
• Trace Run:– Event data collection– Precise time measurement– Parallel timer synchronization– Collecting parallel process/thread traces– Collecting performance counters
• from PAPI, • memory usage,• POSIX I/O calls and • fork/system/exec calls, and more …
– Filtering and grouping of function calls
17
Behind the Scenes
• Trace data is written to a buffer in memory first
• When this buffer is full, data is flushed to storage
• After the application has run to completion, these trace files are unified to produce the final OTF trace
• Most aspects of this behavior can be customized with environment variables
Filebased Workflow
Unifying - Post-Processing
• Normally, trace data is unified automatically after the application has run to completion
• This takes time – depending on the trace-data
• Can be switched off by an environment variable
• vtunify <number-of-trace-files> <trace-file-prefix>
vtunify 16 my_trace
How to Store Trace Data - Trace File
Various trace file formats (for HPC):
– VTF3 (TU Dresden)
– Tau Trace Format (Univ. of Oregon, LANL and JSC/Jülich)
– EPILOG (JSC/Jülich/Germany)
– STF (Pallas GmbH, now Intel)
– OTF (TU Dresden)
• ASCII or binary file formats
• single/multiple file(s) per trace
• merge process traces to single file
• multiple streams for parallel/selective I/O
OTF – Open Trace Format
• Open source trace file format– Available from the homepage of TU Dresden, ZIH
http://www.tu-dresden.de/zih/otf/
• Includes powerful libotf for use in custom applications
• API / Interfaces– High level interface for analysis tools
– Low level interface for trace libraries
• Actively developed – In cooperation with the University of Oregon and
Lawrence Livermore National Laboratory
Tracing Overhead
• Measured on SGI Altix 4700, Itanium 2 1.6 GHz
• Tracing overhead per function call (from test program with one million function calls, multiple repetitions)
• Suppressed inlining: icc -O2 -ip-no-inlining
9.25 µs4.47 µs1 PAPI counter
1.04 µs0.82 µsFiltered function
1.10 µs0.92 µsWithout PAPI
9.64 µs4.61 µs3 PAPI counters
Intel Trace CollectorVampirTrace
OPTIONS, SETTINGS, PARAMETERS
Environment Variables
PAPI hardware performance counters
Memory allocation counters
Application I/O calls
Filtering
Grouping
Environment Variables
• By default, trace data is written to the ‘pwd’
• About everything of this can be customized with environment variables
• Environment variables must be set prior to running the application, not prior to building the application
Environment Variables
VT_PFORM_GDIR Directory where final trace file is storedVT_PFORM_LDIR Directory for intermediate trace filesVT_FILE_PREFIX Trace file nameVT_BUFFER_SIZE Internal trace buffer sizeVT_MAX_FLUSHES Max number of buffer flushesVT_MEMTRACE Enable memory allocation tracingVT_IOTRACE Enable I/O tracingVT_MPITRACE Enable MPI tracingVT_FILTER_SPEC Name of filter fileVT_GROUPS_SPEC Name of function groups fileVT_COMPRESSION Compress trace filesVT_METRICS List of PAPI counters
PAPI Counter
• PAPI counters can be included in traces
– If PAPI is available on the platform
– If VampirTrace was build with PAPI support
• VT_METRICS can be used to specify a colon-separated list of PAPI counters
• VampirTrace >5.8.1 will have a customizable separator as Component-PAPI counters will use colons in the counter-names
export VT_METRICS=PAPI_FP_OPS:PAPI_L2_TCM
Environment Variables
Memory Counter
• Memory allocation counters can be included in traces
– If VampirTrace was build with memory allocations support
– If GNU glibc is used on the platform
• Memory function in glibc like “malloc” and “free” are traced
• Environment variable VT_MEMTRACE
export VT_MEMTRACE=yes
I/O Counter
• I/O counter can be included in traces
– If VampirTrace was build with I/O tracing support
• Standard I/O calls like “open” and “read” are recorded
• Environment variable VT_IOTRACE
export VT_IOTRACE=yes
User defined Counter
• Records program variables or any othernumerical quantity
• Helps finding „that one loop-iteration“ whichcauses trouble
#include "vt_user.h"int main() {
unsigned int i, cid, cgid;
cgid = VT_COUNT_GROUP_DEF(’loopindex’);cid = VT_COUNT_DEF("i", "#", VT_COUNT_TYPE_UNSIGNED, cgid);
for( i = 1; i <= 100; i++ ) {VT_COUNT_UNSIGNED_VAL(cid, i);
}return 0;
}
User defined Counter
Function Filtering• Filtering is one of the ways to reduce trace file size• Environment variable VT_FILTER_SPEC
• Filter definition file contains a list of filters
• Filter rules can be global to all processes or only be assigned to specific ranks (see the manual for more details of rank specific filtering)
• See also the vtfilter tool– Can generate a customized filter file– Can reduce the size of existing trace files
%> export VT_FILTER_SPEC=filter.spec
my_*;test_* -- 1000debug_* -- 0calculate -- -1* -- 1000000
Switch Tracing On/Off
• Starting and stopping of tracing should be performed with care
• Tracing has to be activated on the same level as it was switched off to ensure the consistency of the trace file
• Useful if your program behaves in an iterative manner or if you are only interested in some parts of your application
• Recompile your source code with the user macro“-DVTRACE”
#include “vt_user.h”…VT_OFF();for( i=1; i < 100; i++ ) { do something};VT_ON();…
%> vtcc … -DVTRACE source_code.c …
Selective Instrumentation
• Selective instrumentation can help you to reduce the size of your trace file so that only those parts of interests will be recorded
• One option to use selective instrumentation is to use a manual instrumentation instead of a automatic instrumentation
• Another option is to modify your Makefile in such a way that a automatic instrumentation (default) is only applied to source files of interest (functions of interest)
%> vtcc -vt:inst manual … source_code.c
Function Grouping
• Groups can be defined by the user to group related functions
– Groups can be assigned different colors in Vampir, highlighting application behavior
• Environment variable VT_GROUPS_SPEC
• Group file contains a list of groups with associated functions
export VT_GROUPS_SPEC=/path/to/groups.spec
CALC=calculateMISC=my*;test
UNKNOWN=*
Advanced Performance Monitoring
• CUDA wrapper library
– Based on LD_PRELOAD
– Usable with dynamically linked libraries
– Little overhead (indirection)
– No re-compilation (neither application nor library)
Preload-Library
Application
CUDA
Function
Function
Wrapper-Function
enter
enter
leave
leave
Advanced Performance Monitoring
• vtlibwrapgen
– Abstraction layer for process monitoring
– Dynamic and static libraries
– Requires library’s header file only
– Portable
monitor-gen
foo.h
make
callback.inc.*
libmonitor/src
vt_user.h
libmonitor.so
vtlibwrapgen -g SDL -o SDLwrap.c /usr/include/SDL/*.h
vtlibwrapgen --build --shared -o libSDLwrap SDLwrap.c
export LD_PRELOAD=$PWD/libSDLwrap.so <executable>
QUESTIONS?