Overview: Event Based Program Analysis

Performance Analysis using the Vampir Toolchain

Robert Henschel (HPA-IU)

David Cronk (CS-UTK)

Thomas William (PSW-ZIH)

OverviewMorning Session (Innovation Center, Room 105)• 09:00 – 10:15 Overview: Event Based Program Analysis• 10:15 – 10:45 Break• 10:45 – 11:45 Instrumentation and Runtime Measurement• 11:45 – 13:00 Lunch break

Afternoon Session• 13:00 – 13:45 Using PAPI Performance Counters• 13:45 – 14:00 Break• 14:00 – 15:00 Trace Visualization• 15:00 – 15:30 Break• 15:30 – 18:00 Hands On (Wrubel Computing Center, Building

WCC, Room 107)

TU DRESDEN, ZIH, AND HPCWe do have computers in Germany too (although quiet old ones)

Dresden University of Technology

• Founded in 1828• One of the oldest technical

universities in Germany• 14 faculties and a number of

specialized institutes• More than 35000 Students, about

4000 Employees, 438 professors• International courses of studies,

bachelor, masters• One of the largest faculties for

computer science in Germany• 110 million Euro annual third party

funding• http://tu-dresden.de

http://tu-dresden.de/



Center for Information Services and HPC (ZIH)

• Central Scientific Unit at TU Dresden

• Competence Center for „Parallel Computing and Software Tools“

• Strong commitment to support real users

• Development of algorithms and methods: Cooperation with users from all departments

• Providing infrastructure and qualified service for TU Dresden and Saxony

Structure of ZIH• Management

– Director: Prof. Dr. Wolfgang E. Nagel

– Assistant directors: Dr. Peter Fischer (COO), Dr. Matthias S. Müller (CTO)

• Administration (7 Employees)

• Departments (ca. 100 Employees; incl. Trainees)

– Department of interdisciplinary function support and coordination (IAK)

– Department of networking and communication services (NK)

– Department of central systems and services (ZSD)

– Department of innovative methods of computing (IMC)

– Department of programming and software tool-kits (PSW)

– Department of distributed and data intensive computing (VDR)

HPC-Component

Main Memory 6,5 TBPC-Farm

HPC-SAN

Hard-disk -

capacity :

68 TB

PC-SAN

Hard-disk -

capacity :

68 TB

PetaByte-

Tapestorage

capacity :

1 PB

8 GB/s 4 GB/s4 GB/s

1,8 GB/s

installed in 2006

Today‘s Main HPC Infrastructure

Areas of Expertise• Research topics

– Architecture and performance analysis of High Performance Computers

– Programming methods and techniques for HPC systems

– Grid Computing– Software tools to support programming and

optimization– Modeling algorithms of biological processes– Mathematical models, algorithms, and

efficient implementations

• Role of mediator between vendors, developers, and users

• Pick up and preparation of new concepts, methods, and techniques

• Teaching and Education

Performance Analysis Tools• The Vampir performance analysis toolkit

– Vampir: Scalable event trace visualization– VampirTrace: Instrumentation and run-time data collection– Open Trace Format (OTF): Event trace data format

Vampir-Team

http://www.tu-dresden.de/zih/ptools

http://www.vampir.eu

Matthias Jurenz

Dr. Andreas Knüpfer

Matthias Lieber

Holger Mickler

Dr. Hartmut Mix

Dr. Matthias Müller

Ronny Brendel

Jens Doleschal

Ronald Geisler

Daniel Hackenberg

Robert Henschel

Prof. Wolfgang E. Nagel

Michael Peter

Heide Rohling

Matthias Weber

Thomas William

Performance Analysis Tools

EVENT BASED PROGRAM ANALYSIS

Why performance analysis?

• Moore's Law still in charge, no need to tune performance?

• Increasingly difficult to get close to peak performance– for sequential computation

• memory wall

• optimum pipelining, ...

– for parallel interaction• Amdahl's law

• synchronization with single late-comer, ...

• Efficiency is important because of limited resources

• Scalability is important to cope with next bigger simulation

http://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Transistor_Count_and_Moore's_Law_-_2008.svg/2000px-Transistor_Count_and_Moore's_Law_-_2008.svg.png

OVERVIEW

Basics about Parallelization

Performance Analysis with Profiling

Instrumentation and Tracing

Motivation• Reasons for parallel programming:

– Higher Performance• Solve the same problem in shorter time• Solve larger problems in the same time

– Higher Capability• Solve problems that cannot be solved on a single processor• Larger memory on parallel computers• Time constraints limit the possible problem size

( Weather forecast, turn around within working day)

• In both cases performance is one of the major concerns:– Also consider sequential performance within the parallel

sections

Parallelization Strategies• General strategy for parallelization:

– Distribute the work to many workers

Limitations:– Not all tasks can be split into smaller sub-tasks– Dependencies between sub-tasks– Coordination overhead– (same as for human teams)

Algorithms:– Different algorithms for the same problem differ in terms of

parallelization– Different “best” algorithms for serial vs. parallel execution or

for different parallelization schemes

BASICS ABOUT PARALLELIZATION

Speed-up

• Definition of speed-up S

p

S

T

TS

0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8

# C P U s

Sp

ee

d-u

p

Id e a l S p e e d -u p

R e a l S p e e d -u p

Ts: Serial Execution

Tp: Parallel Execution Time with P CPUs

Speed-up versus number of used processors:

Actual speed-up often lower than optimal one due to aforementioned limitations.

Parallel Efficiency

• Alternative definition: parallel efficiency E

PT

T

P

SE

P

S

0

0 ,2

0 ,4

0 ,6

0 ,8

1

1 ,2

1 2 3 4 5 6 7 8

# C P U s

Pa

ra

lle

l E

ffic

ien

cy

Id e a l Pa r a lle l

Ef f ic ie n c y

R e a l Pa r a lle l

Ef f ic ie n c y

Parallel efficiency versus number of used processors:

TS:Serial Execution Time

TP: Parallel Execution Time with P CPUs

Amdahl’s law

• Fundamental limit of parallelization

)()1(

1

)1(

1P

FS

S

FF

S

P

0

2

4

6

8

1 0

1 2

1 4

1 6

1 8

1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6

# C P Us

Ma

xim

um

Sp

ee

d-u

p

Id e a l

F= 9 9 %

F= 9 5 %

F= 9 0 %

F= 8 0 %

•Only a fraction F of the algorithm is parallel with speed-up Sp

•A fraction (1-F) is serial

Then the maximum resulting speed-up is:

Amdahl’s law• If you know your desired speed up S you can calculate F:

– F gives you the percentage of your program that has to be executed parallel in order to achieve a speed up S (asymptotically).

– In order estimate the resulting effort you need to know in which parts of your program (1-F) of the time is spent.

– This is even before considering the actual parallelization method• Might add new serial sections• Brings coordination overhead• Will not scale arbitrarily high, i.e. the parallel section will stay > 0

SF

11

Amdahl’s law, example

• Example program with some sub-routines calling one another:

– For a maximum speed-up of 2 one needs to parallelize Calc and Multiply.

– For a maximum speed-up of 5 all need to be parallelized!

# c a lls T im e (% ) A c c u m u la te d

T im e (% )

C a ll

1 5 5 6 4 8 3 1 .2 2 3 1 .2 2 C a lc

6 0 3 6 4 8 2 2 .2 4 5 3 .4 6 M u lt ip ly

1 5 5 6 4 8 1 0 .0 5 6 3 .5 1 M a tm u l

2 1 4 5 2 8 9 .3 3 7 2 .8 4 C o p y

6 0 3 6 4 8 7 .8 7 8 0 .7 1 F in d

General Parallelization Strategy

• Therefore, successful parallelization requires:– Finding the actual hot-spots of work – Sufficient potential for parallelization– Parallelization strategy that introduces minimum coordination

overhead

• There are no general rules! Things that help to achieve high performance:– Know your application– Know your compiler– Understand the performance tool– Know the characteristics of the hardware

PERFORMANCE ANALYSIS WITH PROFILING

Profiling• Profiling gives an overview about the distribution of

run time• Usually on the level of subroutines, also at line-by-line

level• Rather low overhead• Usually good enough to find computation hot spots• Little details to detect performance problems and

their causes

• More sophisticated ways of profiling:– Based on hardware performance counters– Phase-based profiles– Call-path profiles

Profiling

• Profile Recording

– Of aggregated information (Time, Counts, …)

– About program and system entities

• Functions, loops, basic blocks

• Application, processes, threads, …

• Methods of Profile Creation

– PC sampling (statistical approach)

– Direct measurement (deterministic approach)

Profiling with gprof

– Compile with profiling support• Using -pg for GNU, -p –g for Intel

• Optimization -O3 might obscure the output somewhat

– Execute normally

• Used to be only for sequential programms

• Parallel only with the GMON_OUT_PREFIX trick

%> mpicc –p -g -O2 heat-mpi-slow-big.c -o heat-mpi-slow-big

%> export GMON_OUT_PREFIX=ggg

%> mpirun -np 4 heat-mpi-slow-big

%> ls

ggg.11762 ggg.11763 ggg.11764 ggg.11765


– Pre-process profiling output with gprof:

• Text output

• There are also GUI front-ends like – pgprof (PGI)

– kprof (KDE)

– For a single rank:

– Combine results for all ranks:

%> gprof [–b] heat-mpi-slow-big ggg.11765 | less

%> gprof -s heat-mpi-slow-big ggg.*

%> gprof [–b] heat-mpi-slow-big gmon.sum | less

Profiling with gprof– Flat profile for one of four ranks:

– Flat profile for all four ranks combined:

Flat profile:

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls s/call s/call name

100.00 2.08 2.08 1 2.08 2.08 Algorithm

0.00 2.08 0.00 1 0.00 0.00 CalcBoundaries

0.00 2.08 0.00 1 0.00 0.00 DistributeNodes

Flat profile:

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls s/call s/call name

100.00 8.59 8.59 4 2.15 2.15 Algorithm

0.00 8.59 0.00 4 0.00 0.00 CalcBoundaries

0.00 8.59 0.00 4 0.00 0.00 DistributeNodes


– Annotated call graph for one of four ranks:Call graph

granularity: each sample hit covers 4 byte(s) for 0.48% of 2.08 seconds

index % time self children called name

2.08 0.00 1/1 main [2]

[1] 100.0 2.08 0.00 1 Algorithm [1]

-----------------------------------------------

<spontaneous>

[2] 100.0 0.00 2.08 main [2]

2.08 0.00 1/1 Algorithm [1]

0.00 0.00 1/1 DistributeNodes [4]

0.00 0.00 1/1 CalcBoundaries [3]

-----------------------------------------------

0.00 0.00 1/1 main [2]

[3] 0.0 0.00 0.00 1 CalcBoundaries [3]

-----------------------------------------------

0.00 0.00 1/1 main [2]

[4] 0.0 0.00 0.00 1 DistributeNodes [4]

-----------------------------------------------

Profiling

• Simple profiling is a good starting point• Reveals computational hot spots

• Hides away outlier values in the average

• More details needed for • Parallel analysis and identification of performance problems

• Finding optimization opportunities

• Advanced profiling tools:• TAU http://www.cs.uoregon.edu/research/tau/

• HPCToolkit http://hpctoolkit.org/

http://www.cs.uoregon.edu/research/tau/

http://hpctoolkit.org/

INSTRUMENTATION AND TRACING

Event Tracing

• Collect more detailed information for more insight

• Do not summarize run-time information

• Collect individual events with properties during run-time

• Event Tracing can be used for:– Visualization (VampirSuite)

– Automatic analysis (Scalasca)

– Debugging or for re-play (VampirSuite + Scalasca)

Tracing

• Recording of run-time events (points of interest)

– During program execution

– Enter leave of functions/subroutines

– Send/receive of messages, synchronization

– More …

– Saved as event records

• Timestamp, process, thread, event type

• Event specific information

• Sorted by time stamp

– Collected via instrumentation & trace library

Profiling vs Tracing

• Tracing Advantages

– Preserve temporal and spatial relationships (context)

– Allow reconstruction of dynamic behavior on any required abstraction level

– Profiles can be calculated from trace

• Tracing Disadvantages

– Traces can become very large

– May cause perturbation

– Instrumentation and tracing is complicated

• Event buffering, clock synchronization, …

Common Event Types

• Enter/leave of function/routine/region

– Time stamp, process/thread, function ID

• Send/receive of P2P message (MPI)

– Time stamp, sender, receiver, length, tag, communicator

• Collective communication (MPI)

– Time stamp, process, root, communicator, # bytes

• Hardware performance counter values

– Time stamp, process, counter ID, value

• Etc.

Parallel Trace

10010 P 1 ENTER 5

10090 P 1 ENTER 6

10110 P 1 ENTER 12

10110 P 1 SEND TO 3 LEN 1024 ...

10330 P 1 LEAVE 12

10400 P 1 LEAVE 6

10520 P 1 ENTER 9

10550 P 1 LEAVE 9

...

10020 P 2 ENTER 5

10095 P 2 ENTER 6

10120 P 2 ENTER 13

10300 P 2 RECV FROM 3 LEN 1024 ...

10350 P 2 LEAVE 13

10450 P 2 LEAVE 6

10620 P 2 ENTER 9

10650 P 2 LEAVE 9

...

DEF TIMERRES 1000000000

DEF PROCESS 1 `Master`

DEF PROCESS 1 `Slave`

DEF FUNCTION 5 `main`

DEF FUNCTION 6 `foo`

DEF FUNCTION 9 `bar`

DEF FUNCTION 12 `MPI_Send`

DEF FUNCTION 13 `MPI_Recv`

Instrumentation

• Instrumentation: Process of modifying programs to detect and report events by calling instrumentation functions.

– Instrumentation functions provided by trace library

– Call == notification about run-time event

– There are various ways of instrumentation

Source Code Instrumentation

int foo(void* arg){

enter(6);

if (cond){

leave(6);

return 1;

}

leave(6);

return 0;

}

Manually or Automatically

int foo(void* arg){

if (cond){

return 1;

}

return 0;

}

Source Code Instrumentation

Manually

– Large effort, error prone

– Difficult to manage

Automatically

– Via source to source translation

– Program Database Toolkit (PDT)

http://www.cs.uoregon.edu/research/pdt/

– OpenMP Pragma And Region Instrumentor (Opari)

http://www.fz-juelich.de/zam/kojak/opari/







Wrapper Function Instrumentation

• Provide wrapper functions• Call instrumentation function for notification

• Call original target for functionality

• Via preprocessor directives:

– Via library preload:

• preload instrumented dynamic library

– Suitable for standard libraries (e.g. MPI, glibc)

#define MPI_Init WRAPPER_MPI_Init

#define MPI_Send WRAPPER_MPI_Send

The MPI Profiling Interface

– Each MPI function has two names:

• MPI_xxx and PMPI_xxx

– Selective replacement of MPI routines at link time

wrapper library

user program

MPI library

MPI_Send

PMPI_SendMPI_Send

MPI_Send

MPI_Send

MPI_SendMPI_Send

Compiler Instrumentation

• gcc -finstrument-functions -c foo.c

• Many compilers support instrumentation:

(GCC, Intel, IBM, PGI, NEC, Hitachi, Sun Fortran, …)

• No source modification

void __cyg_profile_func_enter( <args> );

void __cyg_profile_func_exit( <args> );

Dynamic Instrumentation

• Modify binary executable in memory

• Insert instrumentation calls

• Very platform/machine dependent, expensive

• DynInst project (http://www.dyninst.org)

– Common interface

– Alpha/Tru64, MIPS/IRIX, PowerPC/AIX, Sparc/Solaris, x86/Linux+Windows, ia64/Linux

http://www.dyninst.org/

Instrumentation & Trace Overhead

15 ticksw/o

568526059dummy

937451300299id+timer

633219120119f.id

637278121120f.symbol

638115117117f.addr.

DynInstGCCPDTmanual

overhead for empty function call

Trace Libraries

• Provide instrumentation functions

• Receive events of various types

• Collect event properties

– Time stamp

– Location (thread, process, cluster node, MPI rank)

– Event specific properties

– Perhaps hardware performance counter values

• Record to memory buffer, flush eventually

• Try to be fast, minimize overhead

Trace Files & Formats

• TAU Trace Format (Univ. of Oregon)

• Epilog (ZAM, FZ Jülich)

• STF (Pallas, now Intel)

• Open Trace Format (OTF)

– ZIH, TU Dresden in coop. with Oregon & Jülich

– Single/multiple files per trace with

– Fast sequential and random access

– Including API for writing/reading

– Supports auxiliary information

– See http://www.tu-dresden.de/zih/otf/

http://www.tu-dresden.de/zih/otf/



Interoperability

Other Tools

• TAU profiling (University of Oregon, USA)– Extensive profiling and tracing for parallel applications and

visualization, camparison, etc.


• Paraver (CEPBA, Barcelona, Spain)– Trace based parallel performance analysis and visualization

http://www.cepba.upc.edu/paraver/

• Scalasca (FZ Jülich)– Tracing and automatic detection of performance problems

http://www.scalasca.org

• Intel Trace Collector & Analyzer– Very similar to Vampir



http://www.cepba.upc.edu/paraver/

http://www.scalasca.org/

Date post:	25-Dec-2014
Category:	Documents
Upload:	ptihpa
View:	1,052 times
Download:	0 times

Overview: Event Based Program Analysis

Documents