11/17/02
1
PAPI and Dynaprof
Application Signatures and Performance Analysis of Scientific Applications
Philip J. MucciInnovative Computing Laboratory, UTK
Performance Evaluation Research Center, LBL
[email protected]://icl.cs.utk.edu/~mucci/dynaprof/snapshots/sc2002.ppt
11/17/02
1
Goals
● Understanding the behavior of the application– Identification of bottlenecks.– Usage of the hardware resources.– Effects of that usage on performance.
● Using Dynaprof to achieve that goal– Command line usage– 3 Dynaprof probes
● Wallclock Time● Hardware performance counters● Resource usage traces
11/17/02
1
Motivation
● Optimize the application's performance.● Evaluate the algorithms efficiency.● Generate an application signature.
– A collection of data that represent the major terms in the performance model.
● Develop a performance model.
11/17/02
1
Overview of Hardware Counters
● Data is NOT PORTABLE, but PAPI is...● Small number of registers dedicated for
performance monitoring functions.– AMD Athlon, 4 counters
– Pentium <= III, 2 counters
– Pentium IV, 18 counters
– IA64, 4 counters
– Alpha 21x64, 2 counters
– Power 3, 8 counters
– Power 4, 8 counters to a group
– UltraSparc II, 2 counters
– MIPS R14K, 2 counters
11/17/02
1
Applications used in this Tutorial
● Serial: – FSPX: A binary alloy solidification benchmark.– SWIM: The SPEC shallow water benchmark.
● Parallel (MPI):– Ex19 from PetSC distribution. – Solves nonlinear driven cavity with multigrid. A 2D
driven cavity problem solved in a velocity-vorticity formulation.
11/17/02
1
FPSX Execution Environment
● Intel PIII, 1.2 Ghz– FP Results/Clock: 1 1.2 Gflips
● 4 SP/clk with SSE, 2DP/clk with SSE2
– Caches: 16K/16K, 256K● G77 version 2.96-g -O -malign-double -mpentiumpro -funroll-
loops -fexpensive-optimizations
● Execution time:> /bin/time fspx
115.370u 0.030s 1:58.17 97.6% 0+0k 0+0io 162pf+0w
11/17/02
1
swim Execution Environment
● IBM Nighthawk, 16-way Power 3, 375MHz– FP Results/Clock: 4 (1.5 Gflips)– Caches: 32K/64K, 8MB– MPI over TCP/IP via switch
● Xlc 5.0.2.1 built with -g -O3 -qstrict -qarch=pwr3 -qtune=pwr3
● Execution time:> /bin/time poe swim -procs 2
0.4u 0.0s 0:15 3% 217+3933k 0+0io 1pf+0w
11/17/02
1
ex19 Execution Environment
● IBM Nighthawk, 16-way Power 3, 375MHz– FP Results/Clock: 4 (1.5 Gflips)– Caches: 32K/64K, 8MB
● Xlc 5.0.2.1 built with -g● Execution time:
> /bin/time poe ex19 -procs 2 -da_grid_x 56 -da_grid_y 56
0.520u 0.200s 0:44.18 1.6% 297+3580k 0+0io 0pf+0w
11/17/02
1
Gprof
● Gathers timer interrupts vs. text address.● Recompile with -p option.● Gprof profile is useful for a high level overview● Does it tell us why?
11/17/02
1
Gprof Profile of FSPX
%time cumulative self calls ms/call tot/call name 21.71 18.93 18.93 6080 3.11 3.11 flux_ 19.99 36.36 17.43 9124 1.91 3.91 proflux_ 8.26 43.56 7.20 6080 1.18 1.18 pde_ 8.11 50.63 7.07 6080 1.16 4.17 phase_ 7.96 57.57 6.94 100061386 0.00 0.00 cplintg_ 7.46 64.08 6.51 100061388 0.00 0.00 cpsintg_ 6.05 69.36 5.28 49807360 0.00 0.00 tsofx_ 5.60 74.24 4.88 49807362 0.00 0.00 tlofx_ 4.07 77.79 3.55 62202877 0.00 0.00 cpl_ 2.44 79.92 2.13 37371906 0.00 0.00 cps_ 1.67 81.38 1.46 37371904 0.00 0.00 hl_ 1.43 82.63 1.25 37371904 0.00 0.00 hs_ 1.07 83.56 0.93 24903681 0.00 0.00 elqds_ 0.89 84.34 0.78 37371904 0.00 0.00 aks_
11/17/02
1
FPSX: Top 4 functions
● Top 4 functions make up 50% of execution time● In module update.F
– flux– proflux– pde
● In module phase.F– phase
● Use the list command to explore modules and functions
11/17/02
1
Gprof Profile of SWIM
% cumulative self time seconds seconds name 37.3 3.22 3.22 .calc2 [1] 33.4 6.10 2.88 .calc1 [2] 24.7 8.23 2.13 .calc3 [3] 1.3 8.34 0.11 .kickpipes [4] 1.0 8.43 0.09 .inital [5]
11/17/02
1
Gprof Profile of ex19
% cumulative self name 70.6 22.57 22.57 .MatLUFactorNumeric_SeqAIJ_Inode 6.4 24.61 2.04 .MatFDColoringCreate_MPIAIJ [2] 5.2 26.26 1.65 .MatSetValues_MPIAIJ [3] 3.4 27.35 1.09 .MatLUFactorSymbolic_SeqAIJ [4] 2.3 28.09 0.74 .MatSolve_SeqAIJ_Inode [5] 2.3 28.82 0.73 .FormFunctionLocal [6] 1.7 29.35 0.53 .memset [7] 1.2 29.74 0.39 .MatSetValues [8] 0.9 30.02 0.28 .MatFDColoringApply [9] 0.7 30.24 0.22 .kickpipes [10]
11/17/02
1
Dynaprof Environment Variables
● LD_LIBRARY_PATH: Colon seperated list where to look for shared libraries. We need to find:– DynInst library– PAPI library– Any dependancies on the above. (libperfctr.so,
libcpc.so)● DYNINSTAPI_RT_LIB: Full pathname of
DynInst runtime library.● No settings necessary for AIX/DPCL port
11/17/02
1
Running Dynaprof
● Usage:
dynaprof [-d] [serial_application]● -d enables debugging output● Specifying an application automatically loads it
into the tool immediately after initialization.
11/17/02
1
Command Line Interface
● Uses GNU Readline library for input● Full featured Command Line Editing
– File and command completion: <Tab>– History: <Up>/<Down>
● Settings, macros and aliases in ~/.inputrc● Allows Emacs or VI style bindings
– set editing-mode emacs– set editing-mode vi
● See man page, TexInfo file or home page.
11/17/02
1
Load command
● Starts the application and stops it at the first instruction.
● Usage:
load <application> [args]
> dynaprof
(dynaprof) load tests/fpsx
11/17/02
1
Poeload command
● For use with MPI applications on AIX and DPCL.– DPCL < 3.2.5 requires full path
● Usage:
poeload <application> [args]
(dynaprof) poeload tests/swim -procs 2
11/17/02
1
Mpiload command
● For use with MPI applications.● Stops the application after it calls PMPI_Init().
● Mostly useful for script driven execution of MPI jobs
● Usage:
mpiload <application> [args]
(dynaprof) mpiload tests/mpicount
11/17/02
1
Attach command
● Attaches to a running application (or poe process) and stops it.
● Usage:
attach <application> <pid>(dynaprof) ^Z
> tests/fspx &
[2] 17500
> fg
(dynaprof) attach tests/fspx 17500
11/17/02
1
Poeattach Command
● For use with MPI applications on AIX and DPCL.– DPCL < 3.2.5 requires full path
● Usage:
poeattach <application> <pid_of_poe>
(dynaprof) ^Z
poe ex19 -da_grid_x 56 -da_grid_y 56 -procs 2 &
[2] 17500
> fg
(dynaprof) poeattach ex19 17500
11/17/02
1
List command
● list
– List all modules in process● list <pattern>
– List all matching modules● list <module>
– List all functions in module● list <module> <pattern>
– List all matching functions in module● list <module> <function>
– List instrumentable points in function
11/17/02
1
Exploring FSPX
(dynaprof) listDEFAULT_MODULEeos.Fphase.Fsetup.Fsupmain.Fio.Fproperties.FsolveT.Fupdate.Flibm.so.6libc.so.6
●G77's Fortran Runtime supportCode compiled with g77 without -gends up in the DEFAULT_MODULE
●Application Code
●Shared libraries
11/17/02
1
Exploring FSPX 2(dynaprof) list DEFAULT_MODULEcall_gmon_startfini_dummycopyap_endop_gengt_numf_sne_de_di_temf_listtype_fl_Rrd_count
●G77's Fortran Runtime supportCode compiled with g77 without -gends up in the DEFAULT_MODULE
11/17/02
1
Exploring FSPX 3(dynaprof) list phase.FPhase_(dynaprof) list update.Fproflux_flux_pde_(dynaprof) list phase.F phase_Entry
Call tsofx_Call tlofx_Call eslds_Call elqds_Call tinsol_Call s_wsleCall do_lioCall do_lioCall do_lio
Function Calls
11/17/02
1
Use command
● Loads a probe shared library into address space
(dynaprof) use [probe [args]]● Use by itself displays current probe.● To change options, respecify probe.● 4 probes in this release
– Wallclock: Real time clock– PAPI: Hardware metrics– Perfometer: RT Visi of streaming hardware metrics
11/17/02
1
Instr command
● instr
– list all instrumented functions● instr module <pattern> [arg]
– Instrument all functions in modules matching pattern● instr function <module> <pattern> [arg]
– Instrument all functions matching pattern in module
11/17/02
1
Threads and Dynaprof Probes
● For threaded code, use the same probe!● Dynaprof detects threads and loads a special
version of the probe library.● Each probe specifies what to do when a new
thread is discovered.● Each thread gets the same instrumentation.
11/17/02
1
Probe Warning
● Instrumentation is not free.● Consider granularity of region being measured.● Overhead for PAPI 2.3 is O(100) cycles.
– Between 500 and 2000 cycles for a 2 counter read.● Overhead for Wallclock is O(100) cycles.
11/17/02
1
Wallclock Probe
● High resolution, low latency timer● Usage:
use wallclockprobe● Reports time in microseconds, 1.0x10-6s.
11/17/02
1
PAPI Probe
● Count PAPI Presets or Native Events● Usage:
use papiprobe [event,event,...]● Default argument is either PAPI_FP_INS or PAPI_TOT_INS if the architecture doesn't support it.
● Available events a can be obtained by using:
papi_avail -a
11/17/02
1
PAPI Probe and Multiplexing
● More than physical number of metrics automatically enables multiplexing.
● Minimum runtime of instrumented regions must be observed, such that all virtual counters get a chance to run at least once.
run-timemin
= num_events * .01s
● Automatic warning functionality is being rolled into PAPI.
11/17/02
1
PAPI Native Events
● Look in the PAPI distribution● See the README file for your architecture in the src directory
● See the example program tests/native.c in the src/tests directory
11/17/02
1
Power 3 EventsPAPI_L1_DCM Yes Level 1 data cache misses (PM_LD_MISS_L1,PM_ST_L1MISS)PAPI_L1_ICM No Level 1 instruction cache misses (PM_IC_MISS)PAPI_L1_TCM Yes Level 1 cache misses (PM_IC_MISS,PM_LD_MISS_L1,PM_ST_L1MISS)PAPI_CA_SNP No Requests for a snoop (PM_SNOOP)PAPI_CA_SHR No Requests for exclusive access to shared cache line (PM_SNOOP_E_TO_S)PAPI_CA_ITV No Requests for cache line intervention (PM_SNOOP_PUSH_INT)PAPI_BRU_IDL No Cycles branch units are idle (PM_BRU_IDLE)PAPI_FXU_IDL No Cycles integer units are idle (PM_FXU_IDLE)PAPI_FPU_IDL No Cycles floating point units are idle (PM_FPU_IDLE)PAPI_LSU_IDL No Cycles load/store units are idle (PM_LSU_IDLE)PAPI_TLB_TL No Total translation lookaside buffer misses (PM_TLB_MISS)PAPI_L1_LDM No Level 1 load misses (PM_LD_MISS_L1)PAPI_L1_STM No Level 1 store misses (PM_ST_L1MISS)PAPI_L2_LDM No Level 2 load misses (PM_LD_MISS_EXCEED_L2)PAPI_L2_STM No Level 2 store misses (PM_ST_MISS_EXCEED_L2)PAPI_BTAC_M No Branch target address cache misses (PM_BTAC_MISS)PAPI_PRF_DM No Data prefetch cache misses (PM_PREF_MATCH_DEM_MISS)PAPI_TLB_SD No Translation lookaside buffer shootdowns (PM_TLBSYNC_RERUN)PAPI_CSR_FAL No Failed store conditional instructions (PM_ST_COND_FAIL)PAPI_CSR_SUC No Successful store conditional instructions (PM_RESRV_CMPL)PAPI_CSR_TOT No Total store conditional instructions (PM_RESRV_RQ)PAPI_MEM_SCY Yes Cycles Stalled Waiting for memory accesses (PM_CMPLU_WT_LD,PM_CMPLU_WT_ST)PAPI_MEM_RCY No Cycles Stalled Waiting for memory Reads (PM_CMPLU_WT_LD)PAPI_MEM_WCY No Cycles Stalled Waiting for memory writes (PM_CMPLU_WT_ST)PAPI_STL_ICY No Cycles with no instruction issue (PM_0INST_DISP)PAPI_STL_CCY No Cycles with no instructions completed (PM_0INST_CMPL)PAPI_BR_CN No Conditional branch instructions (PM_CBR_DISP)PAPI_BR_MSP No Conditional branch instructions mispredicted (PM_MPRED_BR_CAUSED_GC)PAPI_BR_PRC No Conditional branch instructions correctly predicted (PM_BR_PRED)
11/17/02
1
Power 3 Events 2
PAPI_FMA_INS No FMA instructions completed (PM_EXEC_FMA)PAPI_TOT_IIS No Instructions issued (PM_INST_DISP)PAPI_TOT_INS No Instructions completed (PM_INST_CMPL)PAPI_INT_INS Yes Integer instructions (PM_FXU0_PROD_RESULT,PM_FXU1_PROD_RESULT,PM_FXU2_PROD_RESULT)PAPI_FP_INS Yes Floating point instructions (PM_FPU0_CMPL,PM_FPU1_CMPL)PAPI_LD_INS No Load instructions (PM_LD_CMPL)PAPI_SR_INS No Store instructions (PM_ST_CMPL)PAPI_BR_INS No Branch instructions (PM_BR_CMPL)PAPI_FLOPS Yes Floating point instructions per second (PM_CYC,PM_FPU0_CMPL,PM_FPU1_CMPL)PAPI_TOT_CYC No Total cycles (PM_CYC)PAPI_IPS Yes Instructions per second (PM_CYC,PM_INST_CMPL)PAPI_LST_INS Yes Load/store instructions completed (PM_LD_CMPL,PM_ST_CMPL)PAPI_SYC_INS No Synchronization instructions completed (PM_SYNC)PAPI_FDV_INS No Floating point divide instructions (PM_FPU_FDIV)PAPI_FSQ_INS No Floating point square root instructions (PM_FPU_FSQRT)
11/17/02
1
Power 4 Events
PAPI_L1_DCM Yes Level 1 data cache misses (PM_LD_MISS_L1,PM_ST_MISS_L1)PAPI_FXU_IDL No Cycles integer units are idle (PM_FXU_IDLE)PAPI_TLB_DM No Data translation lookaside buffer misses (PM_DTLB_MISS)PAPI_TLB_IM No Instruction translation lookaside buffer misses (PM_ITLB_MISS)PAPI_TLB_TL Yes Total translation lookaside buffer misses (PM_DTLB_MISS,PM_ITLB_MISS)PAPI_L1_LDM No Level 1 load misses (PM_LD_MISS_L1)PAPI_L1_STM No Level 1 store misses (PM_ST_MISS_L1)PAPI_STL_ICY No Cycles with no instruction issue (PM_0INST_FETCH)PAPI_HW_INT No Hardware interrupts (PM_EXT_INT)PAPI_FMA_INS No FMA instructions completed (PM_FPU_FMA)PAPI_TOT_IIS No Instructions issued (PM_INST_DISP)PAPI_TOT_INS No Instructions completed (PM_INST_CMPL)PAPI_INT_INS No Integer instructions (PM_FXU_FIN)PAPI_FP_INS No Floating point instructions (PM_FPU_FIN)PAPI_FLOPS Yes Floating point instructions per second (PM_CYC,PM_FPU_FIN)PAPI_TOT_CYC No Total cycles (PM_CYC)PAPI_IPS Yes Instructions per second (PM_CYC,PM_INST_CMPL)PAPI_L1_DCA Yes Level 1 data cache accesses (PM_LD_REF_L1,PM_ST_REF_L1)PAPI_L1_DCR No Level 1 data cache reads (PM_LD_REF_L1)PAPI_L1_DCW No Level 1 data cache writes (PM_ST_REF_L1)PAPI_FDV_INS No Floating point divide instructions (PM_FPU_FDIV)PAPI_FSQ_INS No Floating point square root instructions (PM_FPU_FSQRT)
11/17/02
1
Pentium III EventsPAPI_L1_DCM No Level 1 data cache misses (0x45,0x45)PAPI_L1_ICM No Level 1 instruction cache misses (0xf28,0xf28)PAPI_L2_ICM No Level 2 instruction cache misses (0x68,0x68)PAPI_L1_TCM No Level 1 cache misses (0xf2e,0xf2e)PAPI_L2_TCM No Level 2 cache misses (0x24,0x24)PAPI_CA_SHR No Requests for exclusive access to shared cache line (0x22e,0x22e)PAPI_CA_CLN No Requests for exclusive access to clean cache line (0x66,0x66)PAPI_CA_INV No Requests for cache line invalidation (0x69,0x69)PAPI_CA_ITV No Requests for cache line intervention (0x4007b,0x4007b)PAPI_TLB_IM No Instruction translation lookaside buffer misses (0x85,0x85)PAPI_L1_LDM No Level 1 load misses (0xf29,0xf29)PAPI_L1_STM No Level 1 store misses (0xf2a,0xf2a)PAPI_L2_LDM Yes Level 2 load misses (0x24,0x25)PAPI_L2_STM No Level 2 store misses (0x25,0x25)PAPI_BTAC_M No Branch target address cache misses (0xe2,0xe2)PAPI_HW_INT No Hardware interrupts (0xc8,0xc8)PAPI_BR_CN No Conditional branch instructions (0xc4,0xc4)PAPI_BR_TKN No Conditional branch instructions taken (0xc9,0xc9)PAPI_BR_NTK Yes Conditional branch instructions not taken (0xc4,0xc9)PAPI_BR_MSP No Conditional branch instructions mispredicted (0xc5,0xc5)PAPI_BR_PRC Yes Conditional branch instructions correctly predicted (0xc4,0xc5)PAPI_TOT_IIS No Instructions issued (0xd0,0xd0)PAPI_TOT_INS No Instructions completed (0xc0,0xc0)PAPI_FP_INS No Floating point instructions (0xc1,0x0)PAPI_BR_INS No Branch instructions (0xc4,0xc4)PAPI_VEC_INS No Vector/SIMD instructions (0xb0,0xb0)PAPI_FLOPS Yes Floating point instructions per second (0xc1,0x79)
11/17/02
1
Intel Pentium IV Events
PAPI_L1_DCM No Level 1 data cache misses 0x0003b000/0x12000204@0x8000000c)
PAPI_L2_DCM No Level 2 data cache misses (0x0003b000/0x12000204@0x8000000c)
PAPI_L1_LDM No Level 1 load misses (0x0003b000/0x12000204@0x8000000c)PAPI_L1_STM No Level 1 store misses (0x0003b000/0x12000204@0x8000000c)PAPI_L2_LDM No Level 2 load misses (0x0003b000/0x12000204@0x8000000c)PAPI_L2_STM No Level 2 store misses (0x0003b000/0x12000204@0x8000000c)PAPI_TOT_INS No Instructions completed
(0x00039000/0x04000204@0x8000000c)PAPI_FP_INS No Floating point instructions
(0x0003b000/0x18000204@0x8000000c 0x00033000/0x09000034@0x80000008)
PAPI_TOT_CYC No Total cycles (0x00ff9000/0x7e000004@0x8000000d)
(Arguments to perfex -e from PerfCtr distribution)
11/17/02
1
Sun UltraSparc II Events
PAPI_L1_ICM Yes Level 1 instruction cache misses (0x8,0x8)PAPI_L2_TCM Yes Level 2 cache misses (0xc,0xc)PAPI_CA_SNP No Requests for a snoop (-1,0xe)PAPI_CA_INV No Requests for cache line invalidation (0xe,-1)PAPI_L1_LDM Yes Level 1 load misses (0x9,0x9)PAPI_L1_STM Yes Level 1 store misses (0xa,0xa)PAPI_BR_MSP No Conditional branch instructions mispredicted (-1,0x2)PAPI_TOT_IIS No Instructions issued (-1,0x1)PAPI_TOT_INS No Instructions completed (-1,0x1)PAPI_LD_INS No Load instructions (0x9,-1)PAPI_SR_INS No Store instructions (0xa,-1)PAPI_TOT_CYC No Total cycles (0x0,0x0)PAPI_IPS Yes Instructions per second (0x0,0x1)PAPI_L1_DCR No Level 1 data cache reads (0x9,-1)PAPI_L1_DCW No Level 1 data cache writes (0xa,-1)PAPI_L1_ICH No Level 1 instruction cache hits (-1,0x8)PAPI_L2_ICH No Level 2 instruction cache hits (-1,0xf)PAPI_L1_ICA No Level 1 instruction cache accesses (0x8,-1)PAPI_L2_TCH No Level 2 total cache hits (-1,0xc)PAPI_L2_TCA No Level 2 total cache accesses (0xc,-1)
11/17/02
1
Sun UltraSparc III Events
PAPI_L1_ICM No Level 1 instruction cache misses (-1,0x8)PAPI_L2_ICM No Level 2 instruction cache misses (-1,0xf)PAPI_L2_TCM No Level 2 cache misses (-1,0xc)PAPI_TLB_DM No Data translation lookaside buffer misses (-1,0x12)PAPI_TLB_IM No Instruction translation lookaside buffer misses (-1,0x11)PAPI_L1_LDM No Level 1 load misses (-1,0x9)PAPI_L1_STM No Level 1 store misses (-1,0xa)PAPI_BR_MSP No Conditional branch instructions mispredicted (-1,0x2)PAPI_TOT_IIS No Instructions issued (0x1,0x1)PAPI_TOT_INS No Instructions completed (0x1,0x1)PAPI_FP_INS Yes Floating point instructions (0x18,0x27)PAPI_TOT_CYC No Total cycles (0x0,0x0)PAPI_IPS Yes Instructions per second (0x0,0x1)PAPI_L1_DCR No Level 1 data cache reads (0x9,-1)PAPI_L1_DCW No Level 1 data cache writes (0xa,-1)PAPI_L1_ICH No Level 1 instruction cache hits (0x8,-1)PAPI_L1_ICA Yes Level 1 instruction cache accesses (0x8,0x8)PAPI_L2_TCH Yes Level 2 total cache hits (0xc,0xc)PAPI_L2_TCA No Level 2 total cache accesses (0xc,-1)PAPI_FML_INS No Floating point multiply instructions (-1,0x27)PAPI_FAD_INS No Floating point add instructions (0x18,-1)
11/17/02
1
MIPS R12K EventsPAPI_L1_DCM No Level 1 data cache misses (25)PAPI_L1_ICM No Level 1 instruction cache misses (9)PAPI_L2_DCM No Level 2 data cache misses (26)PAPI_L2_ICM No Level 2 instruction cache misses (10)PAPI_L1_TCM Yes Level 1 cache misses (9,25)PAPI_L2_TCM Yes Level 2 cache misses (10,26)PAPI_CA_SHR No Requests for exclusive access to shared cache line (31)PAPI_CA_INV No Requests for cache line invalidation (13)PAPI_CA_ITV No Requests for cache line intervention (12)PAPI_TLB_TL No Total translation lookaside buffer misses (23)PAPI_PRF_DM No Data prefetch cache misses (17)PAPI_CSR_FAL No Failed store conditional instructions (5)PAPI_CSR_SUC Yes Successful store conditional instructions (20,5)PAPI_CSR_TOT No Total store conditional instructions (20)PAPI_BR_CN No Conditional branch instructions (6)PAPI_BR_MSP No Conditional branch instructions mispredicted (24)PAPI_BR_PRC Yes Conditional branch instructions correctly predicted(6,24)PAPI_TOT_IIS No Instructions issued (1)PAPI_TOT_INS No Instructions completed (15)PAPI_FP_INS No Floating point instructions (21)PAPI_LD_INS No Load instructions (18)PAPI_SR_INS No Store instructions (19)PAPI_FLOPS Yes Floating point instructions per second (0,21)PAPI_TOT_CYC No Total cycles (0)PAPI_IPS Yes Instructions per second (0,15)PAPI_LST_INS Yes Load/store instructions completed (18,19)
11/17/02
1
Alpha/DADD 21264 Events
PAPI_L1_ICM No Level 1 instruction cache misses (0x3)PAPI_L2_TCM No Level 2 cache misses (0x1)PAPI_TLB_DM No Data translation lookaside buffer misses (0x2)PAPI_BR_UCN No Unconditional branch instructions (0x15)PAPI_BR_CN No Conditional branch instructions (0x16)PAPI_BR_NTK No Conditional branch instructions not taken (0x18)PAPI_BR_MSP No Conditional branch instructions mispredicted (0x19)PAPI_BR_PRC No Conditional branch instructions correctly predicted (0x1a)PAPI_TOT_IIS No Instructions issued (0x7)PAPI_TOT_INS No Instructions completed (0x8)PAPI_INT_INS No Integer instructions (0x9)PAPI_FP_INS No Floating point instructions (0x14)PAPI_LD_INS No Load instructions (0xa)PAPI_SR_INS No Store instructions (0xb)PAPI_TOT_CYC No Total cycles (0x0)PAPI_LST_INS No Load/store instructions completed (0xc)PAPI_SYC_INS No Synchronization instructions completed (0xd)PAPI_FML_INS No Floating point multiply instructions (0x11)PAPI_FAD_INS No Floating point add instructions (0x10)PAPI_FDV_INS No Floating point divide instructions (0x12)PAPI_FSQ_INS No Floating point square root instructions (0x13)
11/17/02
1
Perfometer Probe
● Sends a stream of performance data every N seconds to the Perfometer GUI.
● Functions can be colored at instrumentation time.– Default color is white, 0xFFFFFF
● Usage:use perfometerprobe [0xRRGGBB]
instr <args> <0xRRGGBB>
11/17/02
1
Perfometer Probe 2
● Perfometer GUI is NOT launched automatically.● showrgb in X11 lists colors and names.● Run the Java GUI
– Java -jar Perfometer.jar● Connect up to the specified hostname and port.
11/17/02
1
Instrumenting SWIM withperfometerprobe
Module perfometerprobe.so was loaded.Module libperfometer.so was loaded.Module libpapi.so was loaded.(dynaprof) instr function swim.F calc1_ 0xff0000swim.F, inserted 1 instrumentation points(dynaprof) instr function swim.F calc2_ 0x00ff00swim.F, inserted 1 instrumentation points(dynaprof) instr function swim.F calc3_ 0x0000ffswim.F, inserted 1 instrumentation points(dynaprof) runModule libnss_files.so.2 was loaded.Module libnss_nisplus.so.2 was loaded.Module libnsl.so.1 was loaded.Module libnss_dns.so.2 was loaded.Module libresolv.so.2 was loaded.Perfometer client awaiting connection on port #33733
11/17/02
1
Instrumenting FSPX forInstructions Per Cycle
(dynaprof) use probes/papiprobe PAPI_TOT_CYC, PAPI_TOT_INSModule papiprobe.so was loaded.Module libpapi.so was loaded.Module libperfctr.so was loaded.(dynaprof) instr module update.Fupdate.F, inserted 3 instrumentation points(dynaprof) instr module pde.F (dynaprof) instrproflux_flux_pde_(dynaprof) instr module phase.Fphase.F, inserted 1 instrumentation points(dynaprof) instrproflux_flux_pde_phase_
11/17/02
1
Instrumenting SWIM forInstructions Per Cycle
(dynaprof) use probes/papiprobe PAPI_TOT_CYC, PAPI_TOT_INSModule papiprobe.so was loaded.Module libpapi.so was loaded.Module libperfctr.so was loaded.(dynaprof) instr function swim.F calc*Swim.F, inserted 3 instrumentation points(dynaprof) instrcalc1_calc2_calc3_calc3z_
11/17/02
1
Reporting Probe Data
● The wallclock and PAPI probes produce very similar data.
● Both use a parsing script written in Perl.– wallclockrpt <file>– papiproberpt <file>
● Produce 3 profiles– Inclusive: T
function = T
self + T
children
– Exclusive: Tfunction
= Tself
– 1-Level Call Tree: Tchild
= Inclusive Tfunction
11/17/02
1
Fspx Cycles
& Instrs.
Exclusive Profile of Metric PAPI_TOT_INS.
Name Percent Total Calls ------------- ------- ----- --------TOTAL 100 3.031e+11 1 unknown 53.81 1.631e+11 1 proflux_ 27.75 8.411e+10 9124 phase_ 15.44 4.68e+10 6080 flux_ 2.507 7.598e+09 6080 pde_ 0.4884 1.48e+09 6080
Inclusive Profile of Metric PAPI_TOT_INS.
Name Percent Total SubCalls------------- ------- ----- --------TOTAL 100 3.031e+11 0 proflux_ 59.31 1.797e+11 2.242e+08phase_ 37.69 1.142e+11 1.247e+08flux_ 2.507 7.598e+09 0 pde_ 0.4884 1.48e+09 0
1-Level Inclusive Call Tree of Metric PAPI_TOT_INS.
Parent/-Child Percent Total Calls ------------- ------- ----- --------TOTAL 100 3.031e+11 1 proflux_ 100 1.797e+11 9124 - akl_ 8.504 1.529e+10 3.737e+07- aks_ 8.4 1.51e+10 3.737e+07- cpl_ 8.525 1.532e+10 3.737e+07- cps_ 8.525 1.532e+10 3.737e+07- hl_ 9.689 1.742e+10 3.737e+07- hs_ 9.564 1.719e+10 3.737e+07flux_ 100 7.598e+09 6080 pde_ 100 1.48e+09 6080 phase_ 100 1.142e+11 6080 - tsofx_ 11.72 1.339e+10 2.49e+07- tlofx_ 11.49 1.312e+10 2.49e+07- eslds_ 12.88 1.471e+10 2.49e+07- elqds_ 12.69 1.449e+10 2.49e+07- tinsol_ 4.999e-07 571 1 - tinmush_ 1.114 1.273e+09 7.271e+04- xsoft_ 0.121 1.383e+08 7.271e+04- xloft_ 0.1031 1.178e+08 7.271e+04- cpl_ 8.913 1.018e+10 2.483e+07
Exclusive Profile of Metric PAPI_TOT_CYC.
Name Percent Total Calls ------------- ------- ----- --------TOTAL 100 5.017e+11 1 unknown 53.62 2.69e+11 1 proflux_ 27.75 1.393e+11 9124 phase_ 14.9 7.475e+10 6080 flux_ 3.096 1.554e+10 6080 pde_ 0.6356 3.189e+09 6080
Inclusive Profile of Metric PAPI_TOT_CYC.
Name Percent Total SubCalls------------- ------- ----- --------TOTAL 100 5.017e+11 0 proflux_ 57.32 2.876e+11 2.242e+08phase_ 38.92 1.953e+11 1.247e+08flux_ 3.096 1.554e+10 0 pde_ 0.6356 3.189e+09 0
1-Level Inclusive Call Tree of Metric PAPI_TOT_CYC.
Parent/-Child Percent Total Calls ------------- ------- ----- --------TOTAL 100 5.017e+11 1 proflux_ 100 2.876e+11 9124 - akl_ 7.945 2.285e+10 3.737e+07- aks_ 7.871 2.264e+10 3.737e+07- cpl_ 8.84 2.542e+10 3.737e+07- cps_ 8.705 2.503e+10 3.737e+07- hl_ 9.252 2.661e+10 3.737e+07- hs_ 8.967 2.579e+10 3.737e+07flux_ 100 1.554e+10 6080 pde_ 100 3.189e+09 6080 phase_ 100 1.953e+11 6080 - tsofx_ 12.42 2.425e+10 2.49e+07- tlofx_ 12.42 2.425e+10 2.49e+07- eslds_ 13.41 2.618e+10 2.49e+07- elqds_ 13.41 2.62e+10 2.49e+07- tinsol_ 1.013e-06 1978 1 - tinmush_ 1.716 3.351e+09 7.271e+04- xsoft_ 0.1749 3.415e+08 7.271e+04- xloft_ 0.151 2.95e+08 7.271e+04- cpl_ 8.032 1.569e+10 2.483e+07
11/17/02
1
fspx IPC
Exclusive Profile of Metric PAPI_TOT_INS.
Name Percent Total Calls ------------- ------- ----- --------TOTAL 100 3.031e+11 1 unknown 53.81 1.631e+11 1 proflux_ 27.75 8.411e+10 9124 phase_ 15.44 4.68e+10 6080 flux_ 2.507 7.598e+09 6080 pde_ 0.4884 1.48e+09 6080
Inclusive Profile of Metric PAPI_TOT_INS.
Name Percent Total SubCalls------------- ------- ----- --------TOTAL 100 3.031e+11 0 proflux_ 59.31 1.797e+11 2.242e+08phase_ 37.69 1.142e+11 1.247e+08flux_ 2.507 7.598e+09 0 pde_ 0.4884 1.48e+09 0
1-Level Inclusive Call Tree of Metric PAPI_TOT_INS.
Parent/-Child Percent Total Calls ------------- ------- ----- --------TOTAL 100 3.031e+11 1 proflux_ 100 1.797e+11 9124 - akl_ 8.504 1.529e+10 3.737e+07- aks_ 8.4 1.51e+10 3.737e+07- cpl_ 8.525 1.532e+10 3.737e+07- cps_ 8.525 1.532e+10 3.737e+07- hl_ 9.689 1.742e+10 3.737e+07- hs_ 9.564 1.719e+10 3.737e+07flux_ 100 7.598e+09 6080 pde_ 100 1.48e+09 6080 phase_ 100 1.142e+11 6080 - tsofx_ 11.72 1.339e+10 2.49e+07- tlofx_ 11.49 1.312e+10 2.49e+07- eslds_ 12.88 1.471e+10 2.49e+07- elqds_ 12.69 1.449e+10 2.49e+07- tinsol_ 4.999e-07 571 1 - tinmush_ 1.114 1.273e+09 7.271e+04- xsoft_ 0.121 1.383e+08 7.271e+04- xloft_ 0.1031 1.178e+08 7.271e+04- cpl_ 8.913 1.018e+10 2.483e+07
Exclusive Profile of Metric PAPI_TOT_CYC.
Name Percent Total Calls ------------- ------- ----- --------TOTAL 100 5.017e+11 1 unknown 53.62 2.69e+11 1 proflux_ 27.75 1.393e+11 9124 phase_ 14.9 7.475e+10 6080 flux_ 3.096 1.554e+10 6080 pde_ 0.6356 3.189e+09 6080
Inclusive Profile of Metric PAPI_TOT_CYC.
Name Percent Total SubCalls------------- ------- ----- --------TOTAL 100 5.017e+11 0 proflux_ 57.32 2.876e+11 2.242e+08phase_ 38.92 1.953e+11 1.247e+08flux_ 3.096 1.554e+10 0 pde_ 0.6356 3.189e+09 0
1-Level Inclusive Call Tree of Metric PAPI_TOT_CYC.
Parent/-Child Percent Total Calls ------------- ------- ----- --------TOTAL 100 5.017e+11 1 proflux_ 100 2.876e+11 9124 - akl_ 7.945 2.285e+10 3.737e+07- aks_ 7.871 2.264e+10 3.737e+07- cpl_ 8.84 2.542e+10 3.737e+07- cps_ 8.705 2.503e+10 3.737e+07- hl_ 9.252 2.661e+10 3.737e+07- hs_ 8.967 2.579e+10 3.737e+07flux_ 100 1.554e+10 6080 pde_ 100 3.189e+09 6080 phase_ 100 1.953e+11 6080 - tsofx_ 12.42 2.425e+10 2.49e+07- tlofx_ 12.42 2.425e+10 2.49e+07- eslds_ 13.41 2.618e+10 2.49e+07- elqds_ 13.41 2.62e+10 2.49e+07- tinsol_ 1.013e-06 1978 1 - tinmush_ 1.716 3.351e+09 7.271e+04- xsoft_ 0.1749 3.415e+08 7.271e+04- xloft_ 0.151 2.95e+08 7.271e+04- cpl_ 8.032 1.569e+10 2.483e+07
proflux 0.61phase 0.63flux 0.49pde 0.46
11/17/02
1
Swim Cycles
& Instrs.
Exclusive Profile of Metric PAPI_TOT_INS.
Name Percent Total Calls ------------- ------- ----- --------TOTAL 100 1.723e+09 1 calc2 38.28 6.598e+08 120 calc1 32.31 5.567e+08 120 calc3 22.33 3.847e+08 118 unknown 7.084 1.221e+08 1
Inclusive Profile of Metric PAPI_TOT_INS.
Name Percent Total SubCalls------------- ------- ----- --------TOTAL 100 1.723e+09 0 calc2 39.42 6.793e+08 1680 calc1 35.28 6.08e+08 1800 calc3 22.87 3.942e+08 1652
1-Level Inclusive Call Tree of Metric PAPI_TOT_INS.
Parent/-Child Percent Total Calls ------------- ------- ----- --------TOTAL 100 1.723e+09 1 calc1 100 6.08e+08 120 - fsav 0.02065 1.255e+05 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_isend 0.05911 3.593e+05 120 - mpi_isend 0.06434 3.912e+05 120 -mpi_waitall 0.9013 5.479e+06 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_isend 0.05356 3.256e+05 120 - mpi_isend 0.05079 3.088e+05 120 -mpi_waitall 6.813 4.142e+07 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_isend 0.07504 4.562e+05 120 - mpi_isend 0.06757 4.108e+05 120 -mpi_waitall 0.161 9.791e+05 120 calc2 100 6.793e+08 120 - fsav 0.01848 1.255e+05 120 - mpi_irecv 0.02804 1.904e+05 120 - mpi_irecv 0.02804 1.904e+05 120 - mpi_isend 0.07762 5.273e+05 120 - mpi_isend 0.048 3.26e+05 120 -mpi_waitall 0.8084 5.491e+06 120 - mpi_irecv 0.02804 1.904e+05 120 - mpi_isend 0.05213 3.541e+05 120
Exclusive Profile of Metric PAPI_TOT_CYC.
Name Percent Total Calls ------------- ------- ----- --------TOTAL 100 3.181e+09 1 calc2 34.85 1.108e+09 120 calc1 33.48 1.065e+09 120 calc3 26.1 8.301e+08 118 unknown 5.568 1.771e+08 1
Inclusive Profile of Metric PAPI_TOT_CYC.
Name Percent Total SubCalls------------- ------- ----- --------TOTAL 100 3.181e+09 0 calc2 35.98 1.144e+09 1680 calc1 35.61 1.133e+09 1800 calc3 26.88 8.55e+08 1652
1-Level Inclusive Call Tree of Metric PAPI_TOT_CYC.
Parent/-Child Percent Total Calls ------------- ------- ----- --------TOTAL 100 3.181e+09 1 calc1 100 1.133e+09 120 - fsav 0.03432 3.887e+05 120 - mpi_irecv 0.07356 8.332e+05 120 - mpi_isend 0.0663 7.51e+05 120 - mpi_isend 0.0739 8.371e+05 120 -mpi_waitall 0.7189 8.143e+06 120 - mpi_irecv 0.1646 1.864e+06 120 - mpi_irecv 0.03407 3.859e+05 120 - mpi_isend 0.1867 2.115e+06 120 - mpi_isend 0.06067 6.872e+05 120 -mpi_waitall 4.22 4.78e+07 120 - mpi_irecv 0.03979 4.506e+05 120 - mpi_irecv 0.03008 3.407e+05 120 - mpi_isend 0.1014 1.148e+06 120 - mpi_isend 0.07568 8.573e+05 120 -mpi_waitall 0.1076 1.219e+06 120 calc2 100 1.144e+09 120 - fsav 0.03382 3.87e+05 120 - mpi_irecv 0.03222 3.687e+05 120 - mpi_irecv 0.03554 4.067e+05 120 - mpi_isend 0.0959 1.097e+06 120 - mpi_isend 0.05655 6.471e+05 120 -mpi_waitall 0.7268 8.317e+06 120 - mpi_irecv 0.1865 2.134e+06 120 - mpi_isend 0.2616 2.993e+06 120 - mpi_isend 0.06976 7.983e+05 120
11/17/02
1
Swim IPC
Exclusive Profile of Metric PAPI_TOT_INS.
Name Percent Total Calls ------------- ------- ----- --------TOTAL 100 1.723e+09 1 calc2 38.28 6.598e+08 120 calc1 32.31 5.567e+08 120 calc3 22.33 3.847e+08 118 unknown 7.084 1.221e+08 1
Inclusive Profile of Metric PAPI_TOT_INS.
Name Percent Total SubCalls------------- ------- ----- --------TOTAL 100 1.723e+09 0 calc2 39.42 6.793e+08 1680 calc1 35.28 6.08e+08 1800 calc3 22.87 3.942e+08 1652
1-Level Inclusive Call Tree of Metric PAPI_TOT_INS.
Parent/-Child Percent Total Calls ------------- ------- ----- --------TOTAL 100 1.723e+09 1 calc1 100 6.08e+08 120 - fsav 0.02065 1.255e+05 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_isend 0.05911 3.593e+05 120 - mpi_isend 0.06434 3.912e+05 120 -mpi_waitall 0.9013 5.479e+06 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_isend 0.05356 3.256e+05 120 - mpi_isend 0.05079 3.088e+05 120 -mpi_waitall 6.813 4.142e+07 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_irecv 0.03132 1.904e+05 120 - mpi_isend 0.07504 4.562e+05 120 - mpi_isend 0.06757 4.108e+05 120 -mpi_waitall 0.161 9.791e+05 120 calc2 100 6.793e+08 120 - fsav 0.01848 1.255e+05 120 - mpi_irecv 0.02804 1.904e+05 120 - mpi_irecv 0.02804 1.904e+05 120 - mpi_isend 0.07762 5.273e+05 120 - mpi_isend 0.048 3.26e+05 120 -mpi_waitall 0.8084 5.491e+06 120 - mpi_irecv 0.02804 1.904e+05 120 - mpi_isend 0.05213 3.541e+05 120
Exclusive Profile of Metric PAPI_TOT_CYC.
Name Percent Total Calls ------------- ------- ----- --------TOTAL 100 3.181e+09 1 calc2 34.85 1.108e+09 120 calc1 33.48 1.065e+09 120 calc3 26.1 8.301e+08 118 unknown 5.568 1.771e+08 1
Inclusive Profile of Metric PAPI_TOT_CYC.
Name Percent Total SubCalls------------- ------- ----- --------TOTAL 100 3.181e+09 0 calc2 35.98 1.144e+09 1680 calc1 35.61 1.133e+09 1800 calc3 26.88 8.55e+08 1652
1-Level Inclusive Call Tree of Metric PAPI_TOT_CYC.
Parent/-Child Percent Total Calls ------------- ------- ----- --------TOTAL 100 3.181e+09 1 calc1 100 1.133e+09 120 - fsav 0.03432 3.887e+05 120 - mpi_irecv 0.07356 8.332e+05 120 - mpi_isend 0.0663 7.51e+05 120 - mpi_isend 0.0739 8.371e+05 120 -mpi_waitall 0.7189 8.143e+06 120 - mpi_irecv 0.1646 1.864e+06 120 - mpi_irecv 0.03407 3.859e+05 120 - mpi_isend 0.1867 2.115e+06 120 - mpi_isend 0.06067 6.872e+05 120 -mpi_waitall 4.22 4.78e+07 120 - mpi_irecv 0.03979 4.506e+05 120 - mpi_irecv 0.03008 3.407e+05 120 - mpi_isend 0.1014 1.148e+06 120 - mpi_isend 0.07568 8.573e+05 120 -mpi_waitall 0.1076 1.219e+06 120 calc2 100 1.144e+09 120 - fsav 0.03382 3.87e+05 120 - mpi_irecv 0.03222 3.687e+05 120 - mpi_irecv 0.03554 4.067e+05 120 - mpi_isend 0.0959 1.097e+06 120 - mpi_isend 0.05655 6.471e+05 120 -mpi_waitall 0.7268 8.317e+06 120 - mpi_irecv 0.1865 2.134e+06 120 - mpi_isend 0.2616 2.993e+06 120 - mpi_isend 0.06976 7.983e+05 120
calc20.59calc10.53calc30.46
11/17/02
1
Perfometer Screenshot
11/17/02
1
Dynaprof 0.8 SC Release
● Binary distribution for 4 Platforms on the website– AIX 3.x / DPCL 3.2.5 on Power 3– Linux / DynInst 3.0 on Pentium <= III– Solaris 2.8 / DynInst 3.0 on UltraSparc II/III– IRIX / DynInst 3.0 on MIPS R10/12/14k– Power 4 and Pentium 4 are coming...
● Xdynaprof Java/Swing GUI included● perfometerprobe and GUI included● Updated documentation
11/17/02
1
References
● The Dynaprof Homepage
http://www.cs.utk.edu/~mucci/dynaprof
● The PAPI Homepage
http://icl.cs.utk.edu/projects/papi
● The DynInst Homepage
http://www.dyninst.org
● The DPCL Homepage
http://oss.software.ibm.com/developerworks/opensource/dpcl
● The Vprof Homepage
http://aros.ca.sandia.gov/~cljanss/perf/vprof
● The GNU Readline Homepage
http://cnswww.cns.cwru.edu/~chet/readline/rltop.html