+ All Categories
Home > Documents > Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques...

Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques...

Date post: 06-Mar-2018
Category:
Upload: vukiet
View: 220 times
Download: 1 times
Share this document with a friend
50
          Performance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science PRACE Petascale Summer School Stockholm, Sweden Aug 26-29 2008
Transcript
Page 1: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Performance Measurement Tools and Techniques

Pekka Manninen

CSC, the Finnish IT center for science

PRACE Petascale Summer School

Stockholm, Sweden Aug 26-29 2008

Page 2: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Part I: Introduction

Motivation Traditional and Petascales optimization Performance data and optimization

Page 3: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Motivation

It is all about software• It is of no use for building petascale machines without suitable

software• Single core efficiency is not the key issue anymore but the

scalability

Large scale parallel application development is still work on the frontier with novel challenges

• How to utilize thousands of cores• How to deal with I/O

Good performance analysis tools are mandatory for parallel program development

Page 4: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Traditional optimization process overview

Parallel optimization stage-Apply parallelization

-Perform parallel optimization

Serial optimization stage-Perform serial optimization

Application development stage-Choose algorithms

-Choose data structures-Develop or port application

Page 5: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Petascales: Optimization flowchart

Unoptimal, correct parallel code

Assess scalability Sufficient?

Measure parallel performance

Converged?Assess scalability

Sufficient?

Yes

Measure single-core performance Yes

No

No

No

Yes

Identify performance bottlenecks

Develop codeApply

parallelization

Choose algorithms, data structures and

parallelization strategy

Reduce overhead from communication

and load inbalance

Apply compiler optimizationTune for the processor

Optimize I /O

Sufficient?

Yes

No

Link optimized libraries

Optimized code

Page 6: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Optimization considerations

1. Load balance

2. Minimal dedicated time for communication Minimize communication Overlap computation and communication

3. CPU utilization Optimal memory access (cache utilization) Pipeline performance (branch prediction, prefetching) SIMD operations

Efficient I/O

Page 7: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Examples of relevant measurements

Execution time across CPUs• In order to an application to scale all tasks should be kept

equally loaded MPI trace

• How and when the communication is carried out, hinting how to optimize communication

• Communication bandwidth Function call-tree and execution time profile

• Pinpoint the execution hotspots, i.e. where to spend the most of the effort in serial optimization

Hardware counters (e.g. Cache utilization ratio, Instruction usage, Computational intensity, Flop rate)

• Provide insight on the potential inefficiencies of a given routine I/O statistics

Page 8: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Performance data collection

Two “dimensions” When collection is triggered

• Asynchronous (sampling) or synchronous (code instrumentation)

How data is recorded• Profile or a trace file

Acquisition Presentation

Sampling

Instrumentation

Profile

Timeline

Page 9: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Things to be kept in mind

The objective of performance analysis is to understand the behavior of the whole system and apply it to improve the performance

Instrumentation causes always overhead• Artifacts in all measurements

A performance analyst should• Have an understanding of the different levels of the system

architecture• Be able to communicate with users as well as developers• Be patient enough to explore a broad range of hypotheses and

double-check them• Be open-minded as to where the performance bottleneck could

be

Page 10: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Part II: Cray performance analysis tools as an example

Overview Usage

Page 11: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Cray performance analysis infrastructure

CrayPat• pat_build - an utility for application instrumentation without a

need for source code modification• Transparent run-time library for measurements• pat_report - performance reports and visualization file• pat_help - interactive help utility

Cray Apprentice2• An advanced graphical performance analysis and visualization

tool

Page 12: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Instrumentation with pat_build

No source or makefile modification needed, link-time instrumentation

• Requires object files• Instruments compiler-optimized code• Generates stand-alone instrumented executable and preserves

the original binary

Automatic instrumentation at group level• Supports both asynchronous and synchronous instrumentation

Basic usage% module load xt-craypat

% make clean; make

% pat_build -g <trace groups> a.out

Page 13: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Instrumentation with pat_build

Trace groupsbiolibs Cray bioinformatics library routinesblas BLAS subroutinesheap Dynamic heapio stdio + sysio tracegroupslapack LAPACK subroutinesmath ANSI mathmpi MPI statisticsomp OpenMP APIomp-rtl OpenMP runtime librarypthreads Posix threadsshmem SHMEM APIstdio All functions that accept or return the

file* constructsysio I/O system callssystem system calls

Page 14: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Instrumentation with pat_build

Automatic profiling analysis (APA)% module load xt-craypat/4.2

% make clean; make

% pat_build -O apa a.out

Execution of the instrumented executable will produce a report for pat_report and an .apa file that allows for fine-tuning the analysis

Page 15: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Fine-grained instrumentation

Fortraninclude “pat_apif.h”

...

call PAT_region_begin(id,”label”,ierr)

<code segment>

call PAT_region_end(id,ierr)

Cinclude <pat_api.h>

...

ierr = PAT_region_begin(id,”label”);

<code segment>

ierr = PAT_region_end(id);

Page 16: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Collecting data

Run the instrumented application (a.out+pat) as usual, then measurement data is created (.xf file)

Must run on Lustre file system Optional runtime variables

• Optional timeline view of the program by setting PAT_RT_SUMMARY=0

• Request hardware performance counter information by setting PAT_RT_HWPC=<HWPC group id>

• Number of files to store raw data can be customized with PAT_RT_EXPFILE_MAX

Page 17: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Collecting data

Hardware performance counter groups1 Summary with translation lookaside buffer

metrics

2 L1 and L2 cache metrics

3 Bandwidth information

4 Hypertransport information

5 Floating point instruction (including SSE) information

6 Cycles stalled and resources empty

7 Cycles stalled and resources full

8 Instructions and branches

9 Instruction cache values

Page 18: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Analysis with pat_report

Combines information from binary with the raw performance measurement data

Generates a text report of performance results and/or formats data to be visualized with Apprentice2

Basic usagepat_report -O <keywords> data_file.xf

Useful keywordsprofile Subroutine level data

callers Function callers

calltree Calltree

heap Heap information, instrument with -g heap

mpi MPI statistics, instrument with -g mpi

load_balance Load balance information

help Available options

Page 19: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Analysis with pat_report ct -O calltree defaults Tables that would appear by default. heap -O heap_program,heap_hiwater,heap_leaks io -O read_stats,write_stats lb -O load_balance load_balance -O lb_program,lb_group,lb_function mpi -O mpi_callers callers Profile by Function and Callers callers+src Profile by Function and Callers, with Line Numbers calltree Function Calltree View calltree+src Calltree View with Callsite Line Numbers heap_hiwater Heap Stats during Main Program heap_leaks Heap Leaks during Main Program heap_program Heap Usage at Start and End of Main Program hwpc HW Performance Counter Data load_balance_function Load Balance across PE's by Function load_balance_group Load Balance across PE's by FunctionGroup load_balance_program Load Balance across PE's

Page 20: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Analysis with pat_report load_balance_sm Load Balance with MPI Sent Message Stats loops Loop Stats from -hprofile_generate mpi_callers MPI Sent Message Stats by Caller mpi_dest_bytes MPI Sent Message Stats by Destination PE mpi_dest_counts MPI Sent Message Stats by Destination PE mpi_rank_order Suggested MPI Rank Order mpi_sm_rank_order Sent Message Stats and Suggested MPI Rank Order pgo_details Loop Stats detail from -hprofile_generate profile Profile by Function Group and Function profile+src Profile by Group, Function, and Line profile_pe.th Profile by Function Group and Function profile_pe_th Profile by Function Group and Function profile_th_pe Profile by Function Group and Function rogram_time Program Wall Clock Time read_stats File Input Stats by Filename samp_profile Profile by Function samp_profile+src Profile by Group, Function, and Line thread_times Program Wall Clock Time write_stats File Output Stats by Filename

Page 21: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Part III: Case study

Workplan Demonstration of tools Analysis

Page 22: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Case study: CP2K performance analysis

A code for ab-initio molecular dynamics simulations written in Fortran 95, uses MPI for parallelization

Test job: A few time-step DFT simulation of 512 water molecules

We wish to know the following• Function profile - where are the hotspots• Single-core efficiency• Load balance and communication analysis - what

are the scalability bottlenecks?

Runtimes (s) on Cray XT4

#CPUs FFTSG FFTW3

64 1550 1592

128 979 1022

256 659 660

512 560 558

1024 416 434

2048 546 542Peak performance around 19 TFlop/s - still quite far away from petascales

Not enough to calculate and so much communication overhead that the execution is slower than with 1024 cores!

Page 23: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Case study: CP2K performance analysis

Workplan• Do this on a longer route to reduce instrumentation overhead• Instrument the code (that is build with Craypat module loaded)

pat_build -O apa cp2k.pat (→ cp2k.pat+pat)• Execute cp2k.pat+pat with 512 cores (→ an .xf file )• Obtain a sampling profile

pat_report -O samp_profile+src xf-file.xf > samp_profile

• This will produce the profile as well as an .apa file• Run a more exhaustive analysis by editing apa file and

rebuilding the executable

pat_build -O apa-file.apa• Executing this will produce another .xf file - visualize the

analysis with Apprentice2

Page 24: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Sampling profile outputTable 1: Profile by Group, Function, and Line

Samp % | Samp | Imb. | Imb. |Group

| | Samp | Samp % | Function

| | | | Source

| | | | Line

| | | | PE='HIDE'

100.0% | 35712 | -- | -- |Total

|------------------------------------------------

| 63.1% | 22537 | -- | -- |MPI

||-----------------------------------------------

|| 17.3% | 6164 | 459.63 | 7.0% |MPI_Bcast

|| 14.8% | 5291 | 1409.44 | 21.1% |MPI_Recv

|| 8.6% | 3088 | 2372.59 | 43.5% |mpi_allreduce_

|| 5.9% | 2123 | 211.08 | 9.1% |mpi_alltoallv_

|| 5.2% | 1844 | 909.33 | 33.1% |MPI_Reduce

|| 4.5% | 1595 | 415.49 | 20.7% |mpi_waitall_

|| 1.4% | 508 | 224.91 | 30.7% |MPI_Send

|| 1.1% | 406 | 84.81 | 17.3% |mpi_alltoall_

||===============================================

| 23.4% | 8355 | -- | -- |ETC

||-----------------------------------------------

|| 14.7% | 5249 | 397.36 | 7.1% |dgemm_kernel

|| 3.5% | 1256 | 149.35 | 10.7% |dgemm_otcopy

|| 1.5% | 531 | 80.12 | 13.1% |dgemm_oncopy

||===============================================

| 13.5% | 4820 | -- | -- |USER

||-----------------------------------------------

|| 3.2% | 1145 | -- | -|UPDATE_COST_CPU_ROW.

in.DISTRIBUTION_OPTIMIZE

||||---------------------------------------------

4||| 2.2% | 781 | 486.43 | 38.5% |line.359

4||| 1.0% | 343 | 216.15 | 38.7% |line.358

|================================================

Page 25: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Sampling profile

The corresponding fractions with other processor counts

#cores MPI ETC USER

128 37.0 47.3 15.7

256 46.4 38.7 14.9

512 63.1 23.4 13.5

Page 26: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

APA file# You can edit this file, if desired, and use it

# to reinstrument the program for tracing like this:

#

# pat_build -O cp2k.pat+pat+4779-896sdt.apa

#

# These suggested trace options are based on data from:

#

# /wrk/pmannin/Prace/CP2K/cp2k/tests/scala/512/sg/cp2k.pat+pat+4779-896sdt.ap2, /wrk/pmannin/Prace/CP2K/cp2k/tests/scala/512/sg/cp

2k.pat+pat+4779-896sdt.xf

# ----------------------------------------------------------------------

# HWPC group to collect by default.

# -Drtenv=PAT_RT_HWPC=0 # Summary with instructions metrics.

-Drtenv=PAT_RT_HWPC=2

# ----------------------------------------------------------------------

# Libraries to trace.

-g mpi

# ----------------------------------------------------------------------

Select the hardware counter group here or give the PAPI calls explicitly, get group 2

Insert the desired tracegroups here

Page 27: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

APA file continued

# User-defined functions to trace, sorted by % of samples.

# Limited to top 200. A function is commented out if it has < 1%

# of samples, or if a cumulative threshold of 90% has been reached,

# or if it has size < 200 bytes.

-w # Enable tracing of user-defined functions.

# Note: -u should NOT be specified as an additional option.

# 3.21% 756 bytes

-T UPDATE_COST_CPU_ROW.in.DISTRIBUTION_OPTIMIZE

# 0.74% 5757 bytes

-T PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS

# 0.66% 72344 bytes

-T RS_PW_TRANSFER_DISTRIBUTED.in.REALSPACE_GRID_TYPES

# 0.64% 9426 bytes

-T RS_PW_TRANSFER_REPLICATED.in.REALSPACE_GRID_TYPES

# 0.56% 2172 bytes

-T PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS

# 0.49% 17890 bytes

# -T CP_SM_FM_MULTIPLY_2D.in.CP_SM_FM_INTERACTIONS

We may now select the functions which we want to trace and reduce the instrumentation overhead by ignoring functions with little significance on the execution time, uncomment a few

Page 28: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Performance analysis

Let us get the most out of the analysis datapat_report -O defaults,mpi,io,lb,heap,mpi_dest_bytes,hwpc,mpi_sm_rank_order xf-

file.xf

Page 29: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 1: Profile by Function Group and Function

Experiment=1 / Group / Function / PE='HIDE'

========================================================================

Totals for program

------------------------------------------------------------------------

Time% 100.0%

Time 1527.856590

Imb.Time --

Imb.Time% --

Calls 167616828

REQUESTS_TO_L2:DATA 12.127M/sec 16047354721 req

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 8.449M/sec 11180256256 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 3.139M/sec 4153438330 fills

PAPI_L1_DCA 1519.962M/sec 2011407347660 refs

User time (approx) 1323.327 secs 3043652030586 cycles

Cycles 1323.327 secs 3043652030586 cycles

User time (approx) 1323.327 secs 3043652030586 cycles

Utilization rate 86.6%

LD & ST per D1 miss 131.18 refs/miss

D1 cache hit ratio 99.2%

LD & ST per D2 miss 484.28 refs/miss

D2 cache hit ratio 72.9%

D1+D2 cache hit ratio 99.8%

Effective D1+D2 Reuse 7.57 refs/byte

System to D1 refill 3.139M/sec 4153438330 lines

System to D1 bandwidth 191.567MB/sec 265820053147 bytes

L2 to Dcache bandwidth 515.661MB/sec 715536400364 bytes

Note how the instrumentation

affects the performance

Page 30: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

MPI_SYNC

------------------------------------------------------------------------

Time% 51.9%

Time 792.264406

Imb.Time --

Imb.Time% --

Calls 347497

REQUESTS_TO_L2:DATA 0.474M/sec 373392744 req

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 0.371M/sec 291730786 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 0.025M/sec 19302776 fills

PAPI_L1_DCA 1444.128M/sec 1136805936028 refs

User time (approx) 787.192 secs 1810541859938 cycles

Cycles 787.192 secs 1810541859938 cycles

User time (approx) 787.192 secs 1810541859938 cycles

Utilization rate 99.4%

LD & ST per D1 miss 3654.93 refs/miss

D1 cache hit ratio 100.0%

LD & ST per D2 miss 58893.39 refs/miss

D2 cache hit ratio 93.8%

D1+D2 cache hit ratio 100.0%

Effective D1+D2 Reuse 920.21 refs/byte

System to D1 refill 0.025M/sec 19302776 lines

System to D1 bandwidth 1.497MB/sec 1235377634 bytes

L2 to Dcache bandwidth 22.619MB/sec 18670770284 bytes

This is all load imbalance!

Page 31: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

USER

------------------------------------------------------------------------

Time% 31.1%

Time 475.803299

Imb.Time --

Imb.Time% --

Calls 141943266

REQUESTS_TO_L2:DATA 43.794M/sec 13550844676 req

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 31.395M/sec 9714170564 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 10.629M/sec 3288776269 fills

PAPI_L1_DCA 1861.004M/sec 575832343661 refs

User time (approx) 309.420 secs 711666603720 cycles

Cycles 309.420 secs 711666603720 cycles

User time (approx) 309.420 secs 711666603720 cycles

Utilization rate 65.0%

LD & ST per D1 miss 44.28 refs/miss

D1 cache hit ratio 97.7%

LD & ST per D2 miss 175.09 refs/miss

D2 cache hit ratio 74.7%

D1+D2 cache hit ratio 99.4%

Effective D1+D2 Reuse 2.74 refs/byte

System to D1 refill 10.629M/sec 3288776269 lines

System to D1 bandwidth 648.732MB/sec 210481681247 bytes

L2 to Dcache bandwidth 1916.183MB/sec 621706916112 bytes

Cache statistics. There’s room to

improve both L1 and L2 utilization.

Page 32: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 2: Load Balance with MPI Sent Message Stats

Time % | Time | Sent | Sent Msg | Avg Sent |Experiment=1

| | Msg | Total Bytes | Msg Size |Group

| | Count | | | PE[mmm]

100.0% | 1527.687783 | 587504 | 27691142616 | 47133.54 |Total

|--------------------------------------------------------------------

| 51.9% | 792.264406 | -- | -- | -- |MPI_SYNC

||-------------------------------------------------------------------

|| 0.2% | 1294.073494 | -- | -- | -- |pe.369

|| 0.0% | 370.692059 | -- | -- | -- |pe.434

|| 0.0% | 253.923512 | -- | -- | -- |pe.14

||===================================================================

| 31.1% | 475.803299 | -- | -- | -- |USER

||-------------------------------------------------------------------

|| 0.1% | 769.359224 | -- | -- | -- |pe.28

|| 0.0% | 247.921795 | -- | -- | -- |pe.5

|| 0.0% | 195.379051 | -- | -- | -- |pe.405

||===================================================================

| 15.3% | 234.166242 | 587504 | 27691142616 | 47133.54 |MPI

||-------------------------------------------------------------------

|| 0.0% | 277.189769 | 668446 | 28291878712 | 42324.85 |pe.14

|| 0.0% | 233.824474 | 577886 | 27245088600 | 47146.13 |pe.393

|| 0.0% | 205.883493 | 581264 | 27281927088 | 46935.52 |pe.305

||===================================================================

MPI data transfer without load imbalance

Page 33: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 3: MPI Sent Message Stats by Caller

Sent Msg | Sent | MsgSz | 16B<= | 256B<= | 4KB<= | 64KB<= |Experiment=1

Total Bytes | Msg | <16B | MsgSz | MsgSz | MsgSz | MsgSz |Function

| Count | Count | <256B | <4KB | <64KB | <1MB | Caller

| | | Count | Count | Count | Count | PE[mmm]

27691142615 | 587504 | 23182 | 21316 | 179467 | 250579 | 112965 |Total

|-----------------------------------------------------------------------------

| 16856792064 | 205380 | -- | -- | -- | 147168 | 58212 |mpi_isend_

||----------------------------------------------------------------------------

|| 10371981312 | 58212 | -- | -- | -- | -- | 58212 |MP_ISENDRECV_RM2.in.MESSAGE_PASSING

|||---------------------------------------------------------------------------

3|| 8620867584 | 48384 | -- | -- | -- | -- | 48384CP_SM_FM_MULTIPLY_2D.in.CP_SM_FM_INTERACTIONS

4|| | | | | | | | CP_SM_FM_MULTIPLY.in.CP_SM_FM_INTERACTIONS

|||||-------------------------------------------------------------------------

5|||| 3232825344 | 18144 | -- | -- | -- | -- | 18144 |QS_KS_BUILD_KOHN_SHAM_MATRIX.in.QS_KS_METHODS

6|||| | | | | | | | QS_KS_UPDATE_QS_ENV.in.QS_KS_METHODS

|||||||-----------------------------------------------------------------------

7|||||| 2649120768 | 14868 | -- | -- | -- | -- | 14868 |SCF_ENV_DO_SCF.in.QS_SCF

8|||||| | | | | | | | SCF.in.QS_SCF

9|||||| | | | | | | | QS_ENERGIES.in.QS_ENERGY

10||||| | | | | | | | QS_FORCES.in.QS_FORCE

11||||| | | | | | | | FORCE_ENV_CALC_ENERGY_FORCE.in.FORCE_ENV_METHODS

||||||||||||------------------------------------------------------------------

...

||||||||||||||||||------------------------------------------------------------

18|||||||||||||||| 1628024832 | 9072 | -- | -- | -- | -- | 9072 |pe.416

18|||||||||||||||| 1617573888 | 9072 | -- | -- | -- | -- | 9072 |pe.164

18|||||||||||||||| 1600155648 | 9072 | -- | -- | -- | -- | 9072 |pe.374

||||||||||||||||||============================================================

Message passing profile

We could go into a routine and see if we could aggregate small

messages

Page 34: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 5: Heap Stats during Main Program

Tracked | MBytes | Total | Allocs | Total | Tracked | Tracked |Experiment=1

Heap | Not | Allocs | Not | Frees | Objects | MBytes |PE[mmm]

HiWater | Tracked | | Tracked | | Not | Not |

MBytes | | | | | Freed | Freed |

54.178 | 665.609 | 9047403 | 2167539 | 9046867 | 465 | 0.705 |Total

|-----------------------------------------------------------------------------------

| 67.376 | 568.817 | 8618099 | 1962537 | 8617537 | 468 | 0.563 |pe.424

| 53.599 | 669.015 | 9026169 | 2020513 | 9025735 | 378 | 0.679 |pe.350

| 45.430 | 646.063 | 8993493 | 2260616 | 8993059 | 381 | 0.401 |pe.379

|===================================================================================

Page 35: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 7: File Input Stats by Filename

Read | Read MB | Read Rate | Reads | Read |Experiment=1

Time | | MB/sec | | B/Call |File Name

| | | | | PE[mmm]

| | | | | File Desc

0.000 | 1.132719 | 3304.704390 | 47 | 25271.11 |Total

|---------------------------------------------------------------

| 0.000 | 0.886948 | 3678.979746 | 34 | 27353.88 |NA

||--------------------------------------------------------------

|| 0.000 | 0.001757 | 5.794658 | 34 | 54.18 |pe.107

3| | | | | | fd.12

|| 0.000 | 0.001757 | 7.319901 | 34 | 54.18 |pe.114

3| | | | | | fd.12

|| 0.000 | 0.000929 | 7.620604 | 19 | 51.26 |pe.0

3| | | | | | fd.49

||==============================================================

| 0.000 | 0.245771 | 2417.241058 | 13 | 19823.85 |<NA>

||--------------------------------------------------------------

|| 0.052 | 0.245771 | 4.721184 | 6899 | 37.35 |pe.0

3| | | | | | fd.49

|| 0.000 | -- | -- | -- | -- |pe.51

|| 0.000 | -- | -- | -- | -- |pe.242

|===============================================================

Page 36: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 8: File Output Stats by Filename

Write | Write MB | Write Rate | Writes |Write B/Call |Experiment=1

Time | | MB/sec | | |File Name

| | | | | PE[mmm]

| | | | | File Desc

0.006 | 1627.057666 | 280217.977719 | 166 | 10277672.40 |Total

|------------------------------------------------------------------------

| 0.005 | 1624.943947 | 301911.123163 | 56 | 30426379.00 |out-w512-RESTART.wfn

||-----------------------------------------------------------------------

|| 2.756 | 1624.943947 | 589.670154 | 28714 | 59339.60 |pe.0

3| | | | | | fd.49

|| 0.000 | -- | -- | -- | -- |pe.51

|| 0.000 | -- | -- | -- | -- |pe.242

||=======================================================================

| 0.000 | 2.032463 | 8471.972651 | 108 | 19733.26 |<NA>

||-----------------------------------------------------------------------

|| 0.123 | 2.032463 | 16.546812 | 55300 | 38.54 |pe.0

3| | | | | | fd.49

|| -- | -- | -- | -- | -- |pe.51

|| -- | -- | -- | -- | -- |pe.242

||=======================================================================

| 0.000 | 0.081256 | 440.883140 | 2 | 42601.50 |stdout

||-----------------------------------------------------------------------

|| 0.094 | 0.081256 | 0.861100 | 1238 | 68.82 |pe.0

3| | | | | | fd.1

|| 0.000 | -- | -- | -- | -- |pe.51

|| 0.000 | -- | -- | -- | -- |pe.242

|========================================================================

Rather modest I/O, check the I/O

bandwidth!

Page 37: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 13: Load Balance across PE's

...

========================================================================

pe.491

------------------------------------------------------------------------

Time% 0.2%

Time 1733.719207

Calls 23369392

REQUESTS_TO_L2:DATA 8.170M/sec 13727855580 req

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 5.761M/sec 9680509601 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 2.079M/sec 3493230434 fills

PAPI_L1_DCA 1461.354M/sec 2455488979184 refs

User time (approx) 1680.284 secs 3864653185625 cycles

Cycles 1680.284 secs 3864653185625 cycles

User time (approx) 1680.284 secs 3864653185625 cycles

Utilization rate 96.9%

LD & ST per D1 miss 186.39 refs/miss

D1 cache hit ratio 99.5%

LD & ST per D2 miss 702.93 refs/miss

D2 cache hit ratio 73.5%

D1+D2 cache hit ratio 99.9%

Effective D1+D2 Reuse 10.98 refs/byte

System to D1 refill 2.079M/sec 3493230434 lines

System to D1 bandwidth 126.889MB/sec 223566747776 bytes

L2 to Dcache bandwidth 351.638MB/sec 619552614464 bytes

========================================================================

Compare the amount of calls and

time spent

Page 38: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 13: Load Balance across PE's

...

========================================================================

pe.2

------------------------------------------------------------------------

Time% 0.2%

Time 1312.932656

Calls 318135867

REQUESTS_TO_L2:DATA 20.391M/sec 19283776533 req

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 14.358M/sec 13578864880 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 5.022M/sec 4749508915 fills

PAPI_L1_DCA 1616.882M/sec 1529092888549 refs

User time (approx) 945.705 secs 2175120420977 cycles

Cycles 945.705 secs 2175120420977 cycles

User time (approx) 945.705 secs 2175120420977 cycles

Utilization rate 72.0%

LD & ST per D1 miss 83.43 refs/miss

D1 cache hit ratio 98.8%

LD & ST per D2 miss 321.95 refs/miss

D2 cache hit ratio 74.1%

D1+D2 cache hit ratio 99.7%

Effective D1+D2 Reuse 5.03 refs/byte

System to D1 refill 5.022M/sec 4749508915 lines

System to D1 bandwidth 306.530MB/sec 303968570560 bytes

L2 to Dcache bandwidth 876.371MB/sec 869047352320 bytes

========================================================================

Compare the amount of calls and

time spent

Page 39: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 15: Load Balance across PE's by Function

...

========================================================================

USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe.216

------------------------------------------------------------------------

Time% 0.0%

Time 342.454473

Calls 576

REQUESTS_TO_L2:DATA 1084579465 req

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 1011292217 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 42841511 fills

PAPI_L1_DCA 177829252850 refs

User time (approx) 311763447500 cycles

User time (approx) 311763447500 cycles

========================================================================

...

========================================================================

USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe.361

------------------------------------------------------------------------

Time% 0.0%

Time 0.001128

Calls 576

REQUESTS_TO_L2:DATA 23918 req

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 21906 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 1474 fills

PAPI_L1_DCA 1334202 refs

User time (approx) 0 cycles

User time (approx) 0 cycles

Why does not this routine employ all the processes?

Page 40: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 15: Load Balance across PE's by Function

...

========================================================================

USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe.408

------------------------------------------------------------------------

Time% 0.0%

Time 181.970770

Calls 283115520

REQUESTS_TO_L2:DATA 1921736331 req

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 1725999972 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 2667162 fills

PAPI_L1_DCA 299580285973 refs

User time (approx) 152982200000 cycles

User time (approx) 152982200000 cycles

========================================================================

USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe.127

------------------------------------------------------------------------

Time 0.000000

========================================================================

USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe.425

------------------------------------------------------------------------

Time 0.000000

========================================================================

Or this?

Page 41: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 20: HW Performance Counter Data

Experiment=1 / PE='HIDE'

========================================================================

Totals for program

------------------------------------------------------------------------

REQUESTS_TO_L2:DATA 12.130M/sec 16047201243 req

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 8.451M/sec 11180135091 fills

DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 3.140M/sec 4153365739 fills

PAPI_L1_DCA 1520.227M/sec 2011113494425 refs

User time (approx) 1322.904 secs 3042678719117 cycles

Cycles 1322.904 secs 3042678719117 cycles

User time (approx) 1322.904 secs 3042678719117 cycles

Utilization rate 86.6%

LD & ST per D1 miss 131.16 refs/miss

D1 cache hit ratio 99.2%

LD & ST per D2 miss 484.21 refs/miss

D2 cache hit ratio 72.9%

D1+D2 cache hit ratio 99.8%

Effective D1+D2 Reuse 7.57 refs/byte

System to D1 refill 3.140M/sec 4153365739 lines

System to D1 bandwidth 191.625MB/sec 265815407319 bytes

L2 to Dcache bandwidth 515.821MB/sec 715528645846 bytes

========================================================================

Page 42: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Table 21: Sent Message Stats and Suggested MPI Rank Order

Sent Msg Total Bytes per MPI rank

Max Avg Min Max Min

Total Bytes Total Bytes Total Bytes Rank Rank

288785211952 55382285231 43823619696 8 511

------------------------------------------------------------

Dual core: Sent Msg Total Bytes per node

Rank Max Avg Min Max Node Min Node

Order Total Bytes Total Bytes Total Bytes Ranks Ranks

d 332608831648 110764570462 91978303568 8,511 178,269

u 332608831648 110764570462 91978303568 8,511 178,269

2 332753774320 110764570462 91702010072 503,8 409,102

0 335218166984 110764570462 89167505112 8,264 255,511

1 573867407720 110764570462 88992394184 8,9 502,503

------------------------------------------------------------

Quad core: Sent Msg Total Bytes per node

Rank Max Avg Min Max Node Min Node

Order Total Bytes Total Bytes Total Bytes Ranks Ranks

d 424587135216 221529140924 184022706744 8,511,178,269 344,209,68,473

u 424587135216 221529140924 184022706744 8,511,178,269 344,209,68,473

2 662859801904 221529140924 183540043808 502,9,503,8 374,137,375,136

0 665630542072 221529140924 181038379488 8,264,9,265 246,502,247,503

1 1145872706976 221529140924 178161802184 8,9,10,11 500,501,502,503

According to CrayPat, the default SMP-like placement of ranks is

the worst choice

However this custom

placement of ranks did not do much in

practice

Page 43: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Visualization with Apprentice2

The previous statistics can be viewed graphically with Apprentice2• pat_report version 4.2 produces the .ap2 file, with older

versions you get it with -f ap2 switch to pat_report• Just launch it by app2 command

Page 44: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

You will get more info by holding the

mouse cursor over a slice. Clicking will

show the load balance of the

function

Page 45: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Smallest, average and largest

individual time

Page 46: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Same profile as a list, a click on a function will provide the

HW counter data of it

Page 47: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

HW counter overview. Would be useful if

cache miss or cycle stall counters have

been recorded.

Page 48: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Routine call flow window (who calls who) and how the execution is divided

Page 49: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

The largest execution time on the left, smallest on the

right

Page 50: Performance Measurement Tools and · PDF filePerformance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science ... Automatic instrumentation at group

          

Final remarks on CP2K analysis

Load balance is a major concern and largest efforts on optimization should be put there

• By performance analysis, we know in which routines the most severe imbalances happen and start to look the issue from there

Messages are usually not too small, so no need for tedious aggregating

I/O seems to be efficiently written Single core efficiency should be investigated in the

most intense routines• L1 and L2 accessing could be improved• We would need also other HW counter data for a true

optimization work, e.g. SSE instruction utilization and memory stalls


Recommended