Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | brian-wilson |
View: | 217 times |
Download: | 0 times |
IBM Petascale Workshop 2006
PAPI DirectionsPAPI Directions Overview
What’s PAPI?What’s New?
FeaturesPlatforms
What’s Next?Network PAPIThermal PAPI
When?PAPI release roadmap
What’s ICL?(a word from our sponsor)
IBM Petascale Workshop 2006
What’s PAPI?What’s PAPI? A software layer (library) designed to provide
the tool developer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major micro-processors.
Countable events are defined in two ways: platform-neutral Preset Events Platform-dependent Native Events
Preset Events can be derived from multiple Native Events
All events referenced by name and collected in EventSets for sampling
Events can be multiplexed if counters are limited
Statistical sampling is implemented by: Software overflow with timer driven sampling Hardware overflow if supported by the platform
IBM Petascale Workshop 2006
Where’s PAPIWhere’s PAPI PAPI runs on most modern processors and
Operating Systems of interest to HPC: IBM POWER3,4,5 / AIX POWER4,5 / Linux PowerPC-32 and -64 / Linux Blue Gene Intel Pentium II, III, 4, M, EM64T, etc. / Linux Intel Itanium AMD Athlon, Opteron / Linux Cray T3E, X1, XD3, XT3 Catamount Altix, Sparc, …
NOTE: All Linux implementations require the perfctr kernel patch. Except Itanium which uses the built-in perfmon
interface Perfmon2 development is underway to replace perfctr
and be pre-installed in the kernel – NO PATCHES NEEDED!
IBM Petascale Workshop 2006
Extending PAPI beyond the CPUExtending PAPI beyond the CPU
PAPI has historically targeted on on-processor performance counters
Several categories of off-processor counters existnetwork interfaces: Myrinet, Infiniband, GigEmemory interfaces: Cray X1 thermal and power interfaces: ACPI
CHALLENGE: Extend the PAPI interface to address
multiple counter domainsPreserve the PAPI calling semantics, ease of
use, and platform independence for existing applications
IBM Petascale Workshop 2006
Multi-Substrate PAPIMulti-Substrate PAPI
Goals:Isolate hardware dependent code
in a separable ‘substrate’ moduleExtend platform independent code
to support multiple simultaneous substrates
Add or modify API calls to support access to any of several substrates
Modify build environment for easy selection and configuration of multiple available substrates
IBM Petascale Workshop 2006
PAPI 3.0 Design PAPI 3.0 Design
PAPI Low Level
Machine SpecificLayer
PortableLayer
PAPI Machine Dependent Substrate
PAPI High Level
Hardware Performance Counters
Operating System
Kernel Extension
Hardware Independent Layer
IBM Petascale Workshop 2006
PAPI Low Level
Machine SpecificLayer
PortableLayer
PAPI Machine Dependent Substrate
PAPI High Level
Hardware Performance Counters
Operating System
Kernel Extension
Hardware Independent Layer
PAPI Low Level
Machine SpecificLayer
PortableLayer
PAPI High Level
PAPI Machine Dependent Substrate
Hardware Performance Counters
Operating System
Kernel Extension
Hardware Independent Layer
PAPI Machine Dependent Substrate
Off-Processor Hardware Counters
Operating System
Kernel Extension
PAPI 4.0 Multiple Substrate DesignPAPI 4.0 Multiple Substrate Design
IBM Petascale Workshop 2006
API ChangesAPI Changes 3 calls augmented with a substrate index Old syntax preserved in wrapper functions
for backward compatibility Modified entry points:
PAPI_create_eventset PAPI_create_sbstr_eventset
PAPI_get_opt PAPI_get_sbstr_opt PAPI_num_hwctrs PAPI_num_sbstr_hwctrs
New entry points for new functionality: PAPI_num_substrates PAPI_get_sbstr_info
Old code can run with no source modifications
IBM Petascale Workshop 2006
PAPI 4.0 StatusPAPI 4.0 Status
Multi-substrate development complete
Some CPU platforms not yet ported Substrates available for
ACPI (Advanced Configuration and Power Interface )
Myrinet MX Substrates under development for
InfinibandGigE
Friendly User release available now for CVS checkout
IBM Petascale Workshop 2006
Myrinet MX CountersMyrinet MX CountersLANAI_UPTIMECOUNTERS_UPTIMEBAD_CRC8BAD_CRC32UNSTRIPPED_ROUTEPKT_DESC_INVALIDRECV_PKT_ERRORSPKT_MISROUTEDDATA_SRC_UNKNOWNDATA_BAD_ENDPTDATA_ENDPT_CLOSEDDATA_BAD_SESSIONPUSH_BAD_WINDOWPUSH_DUPLICATEPUSH_OBSOLETEPUSH_RACE_DRIVERPUSH_BAD_SEND_HANDLE
_MAGICPUSH_BAD_SRC_MAGICPULL_OBSOLETEPULL_NOTIFY_OBSOLETEPULL_RACE_DRIVERACK_BAD_TYPEACK_BAD_MAGICACK_RESEND_RACELATE_ACK
ACK_NACK_FRAMES_IN_PIPENACK_BAD_ENDPTNACK_ENDPT_CLOSEDNACK_BAD_SESSIONNACK_BAD_RDMAWINNACK_EVENTQ_FULLSEND_BAD_RDMAWINCONNECT_TIMEOUTCONNECT_SRC_UNKNOWNQUERY_BAD_MAGICQUERY_TIMED_OUTQUERY_SRC_UNKNOWNRAW_SENDSRAW_RECEIVESRAW_OVERSIZED_PACKETSRAW_RECV_OVERRUNRAW_DISABLEDCONNECT_SENDCONNECT_RECVACK_SENDACK_RECVPUSH_SENDPUSH_RECVQUERY_SENDQUERY_RECV
REPLY_SENDREPLY_RECVQUERY_UNKNOWNDATA_SEND_NULLDATA_SEND_SMALLDATA_SEND_MEDIUMDATA_SEND_RNDVDATA_SEND_PULLDATA_RECV_NULLDATA_RECV_SMALL_INLINEDATA_RECV_SMALL_COPYDATA_RECV_MEDIUMDATA_RECV_RNDVDATA_RECV_PULLETHER_SEND_UNICAST_CNTETHER_SEND_MULTICAST_C
NTETHER_RECV_SMALL_CNTETHER_RECV_BIG_CNTETHER_OVERRUNETHER_OVERSIZEDDATA_RECV_NO_CREDITSPACKETS_RESENTPACKETS_DROPPEDMAPPER_ROUTES_UPDATE
ROUTE_DISPERSIONOUT_OF_SEND_HANDLESOUT_OF_PULL_HANDLESOUT_OF_PUSH_HANDLESMEDIUM_CONT_RACECMD_TYPE_UNKNOWNUREQ_TYPE_UNKNOWNINTERRUPTS_OVERRUNWAITING_FOR_INTERRUPT_DMAWAITING_FOR_INTERRUPT_ACKWAITING_FOR_INTERRUPT_TIM
ERSLABS_RECYCLINGSLABS_PRESSURESLABS_STARVATIONOUT_OF_RDMA_HANDLESEVENTQ_FULLBUFFER_DROPMEMORY_DROPHARDWARE_FLOW_CONTROLSIMULATED_PACKETS_LOSTLOGGING_FRAMES_DUMPEDWAKE_INTERRUPTSAVERTED_WAKEUP_RACEDMA_METADATA_RACE
IBM Petascale Workshop 2006
Myrinet MX CountersMyrinet MX CountersLANAI_UPTIMECOUNTERS_UPTIMEBAD_CRC8BAD_CRC32UNSTRIPPED_ROUTEPKT_DESC_INVALIDRECV_PKT_ERRORSPKT_MISROUTEDDATA_SRC_UNKNOWNDATA_BAD_ENDPTDATA_ENDPT_CLOSEDDATA_BAD_SESSIONPUSH_BAD_WINDOWPUSH_DUPLICATEPUSH_OBSOLETEPUSH_RACE_DRIVERPUSH_BAD_SEND_HANDLE
_MAGICPUSH_BAD_SRC_MAGICPULL_OBSOLETEPULL_NOTIFY_OBSOLETEPULL_RACE_DRIVERACK_BAD_TYPEACK_BAD_MAGICACK_RESEND_RACELATE_ACK
ACK_NACK_FRAMES_IN_PIPENACK_BAD_ENDPTNACK_ENDPT_CLOSEDNACK_BAD_SESSIONNACK_BAD_RDMAWINNACK_EVENTQ_FULLSEND_BAD_RDMAWINCONNECT_TIMEOUTCONNECT_SRC_UNKNOWNQUERY_BAD_MAGICQUERY_TIMED_OUTQUERY_SRC_UNKNOWNRAW_SENDSRAW_RECEIVESRAW_OVERSIZED_PACKETSRAW_RECV_OVERRUNRAW_DISABLEDCONNECT_SENDCONNECT_RECVACK_SENDACK_RECVPUSH_SENDPUSH_RECVQUERY_SENDQUERY_RECV
REPLY_SENDREPLY_RECVQUERY_UNKNOWNDATA_SEND_NULLDATA_SEND_SMALLDATA_SEND_MEDIUMDATA_SEND_RNDVDATA_SEND_PULLDATA_RECV_NULLDATA_RECV_SMALL_INLINEDATA_RECV_SMALL_COPYDATA_RECV_MEDIUMDATA_RECV_RNDVDATA_RECV_PULLETHER_SEND_UNICAST_CNTETHER_SEND_MULTICAST_C
NTETHER_RECV_SMALL_CNTETHER_RECV_BIG_CNTETHER_OVERRUNETHER_OVERSIZEDDATA_RECV_NO_CREDITSPACKETS_RESENTPACKETS_DROPPEDMAPPER_ROUTES_UPDATE
ROUTE_DISPERSIONOUT_OF_SEND_HANDLESOUT_OF_PULL_HANDLESOUT_OF_PUSH_HANDLESMEDIUM_CONT_RACECMD_TYPE_UNKNOWNUREQ_TYPE_UNKNOWNINTERRUPTS_OVERRUNWAITING_FOR_INTERRUPT_DMAWAITING_FOR_INTERRUPT_ACKWAITING_FOR_INTERRUPT_TIM
ERSLABS_RECYCLINGSLABS_PRESSURESLABS_STARVATIONOUT_OF_RDMA_HANDLESEVENTQ_FULLBUFFER_DROPMEMORY_DROPHARDWARE_FLOW_CONTROLSIMULATED_PACKETS_LOSTLOGGING_FRAMES_DUMPEDWAKE_INTERRUPTSAVERTED_WAKEUP_RACEDMA_METADATA_RACE
IBM Petascale Workshop 2006
Multiple MeasurementsMultiple Measurements The HPCC HPL benchmark with 3 performance metrics:
FLOPS; Temperature; Network Sends/Receives Node 7:
IBM Petascale Workshop 2006
Multiple MeasurementsMultiple Measurements The HPCC HPL benchmark with 3 performance metrics:
FLOPS; Temperature; Network Sends/Receives Node 3:
IBM Petascale Workshop 2006
Data Structure Addressing Data Structure Addressing Goal:
Measure events related to specific data addresses (structures).
Availability: Itanium: 160 / 475 native eventsrumored on POWER4; POWER5?
PAPI example: ...
opt.addr.eventset = EventSet; opt.addr.start = (caddr_t)array; opt.addr.end = (caddr_t)(array + size_array); retval = PAPI_set_opt(PAPI_DATA_ADDRESS, &opt);actual.start = (caddr_t)array - opt.addr.start_off; actual.end = (caddr_t)(array + size_array) + opt.addr.end_off; ...
IBM Petascale Workshop 2006
Rensselaer to Build and HouseRensselaer to Build and House$100 Million Supercomputer$100 Million Supercomputer
NY Times, May 11, 2006
Rensselaer Polytechnic Institute announced yesterday that it was combining forces with New York State and I.B.M. to build a $100 million supercomputer that will be among the 10 most powerful in the world. The computer, a type of I.B.M. system known as Blue Gene, will be on Rensselaer's campus in Troy, N.Y., and will have the power to perform more than 70 trillion calculations per second. It will mainly be used to help researchers make smaller, faster semiconductor devices and for nanotechnology research.
IBM Petascale Workshop 2006
PAPI and BG/LPAPI and BG/L Performance Counters:
48 UPC Countersshared by both CPUsExternal to CPU cores32 bits :(
2 Counters on each FPU1 counts load/stores1 counts arithmetic
operations
Accessed via blg_perfctr
2 FPU PMCs
2 FPU PMCs
UPC Module48 SharedCounters
IBM Petascale Workshop 2006
PAPI and BG/L (2): VersionsPAPI and BG/L (2): Versions PAPI 2.3.4
Original releasePoor native event support
PAPI 3.2.2 betaCurrently being beta testedFull access to native events by name
LimitationsOnly events exposed by bgl_perfctr
No control over native event edgesStill no overflow/profile support
Is there a timer available?No configure script (cross-compilation)No scripted acceptance test suite
(multiple queuing systems)
IBM Petascale Workshop 2006
PAPI and BG/L (3): PresetsPAPI and BG/L (3): PresetsTest case avail.c: Available events and hardware information.-------------------------------------------------------------------------Vendor string and code : (1312)Model string and code : PVR=0x5202:0x1891 Serial=R00-M0-N1-C:J16-U01 (1375869073)CPU Revision : 20994.062500CPU Megahertz : 700.000000CPU's in this Node : 1Nodes in this System : 32Total CPU's : 32Number Hardware Counters : 52Max Multiplex Counters : 32-------------------------------------------------------------------------Name Derived Description (Mgr. Note)PAPI_L3_TCM No Level 3 cache misses ()PAPI_L3_LDM Yes Level 3 load misses ()PAPI_L3_STM No Level 3 store misses ()PAPI_FMA_INS No FMA instructions completed ()PAPI_TOT_CYC No Total cycles ()PAPI_L2_DCH Yes Level 2 data cache hits ()PAPI_L2_DCA Yes Level 2 data cache accesses ()PAPI_L3_TCH No Level 3 total cache hits ()PAPI_FML_INS No Floating point multiply instructions ()PAPI_FAD_INS No Floating point add instructions ()PAPI_BGL_OED No BGL special event: Oedipus operations ()PAPI_BGL_TS_32BYes BGL special event: Torus 32B chunks sent ()PAPI_BGL_TS_FULL Yes BGL special event: Torus no token UPC cycles ()PAPI_BGL_TR_DPKT Yes BGL special event: Tree 256 byte packets ()PAPI_BGL_TR_FULL Yes BGL special event: UPC cycles (CLOCKx2) tree rcv is full ()-------------------------------------------------------------------------avail.c PASSED
IBM Petascale Workshop 2006
PAPI and BG/L (4): Native EventsPAPI and BG/L (4): Native Events 328 native events available
Only events exposed by bgl_perfctr 4 arithmetic events per FPU 4 Load/Store events per FPU 312 UPC events
BGL_FPU_ARITH_ADD_SUBTRACT 0x40000000 |Add and subtract, fadd, fadds, fsub, fsubs (Book E add, substract)|
BGL_FPU_ARITH_MULT_DIV 0x40000001 |Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div)|
BGL_FPU_ARITH_OEDIPUS_OP 0x40000002 |Oedipus operations, All symmetric, asymmetric, and complex Oedipus multiply-add instructions|...
BGL_UPC_TS_ZP_VCD0_CHUNKS 0x40000145 |ZP vcd0 chunks|
BGL_UPC_TS_ZP_VCD1_CHUNKS 0x40000146 |ZP vcd1 chunks|
BGL_PAPI_TIMEBASE 0x40000148 |special event for getting the timebase reg|
-------------------------------------------------------------------------Total events reported: 328native_avail.c PASSED
IBM Petascale Workshop 2006
XT3 and CatamountXT3 and Catamount
The Oak RidgerFebruary 21, 2006
“The Cray XT3 Jaguar, the flagship computing system in ORNL's Leadership Computing Facility, was ranked tenth in the world in a November 2005 survey of supercomputers, delivering 20.5 trillion operations per second (teraflops).”
IBM Petascale Workshop 2006
PAPI and CatamountPAPI and Catamount
Opteron-based Catamount OS similar to CNK Driven by Sandia-Cray version of
perfctrNo overflow / profiling
Configure works because compile node == compute node
Test Suite script works because there’s only one queuing system
IBM Petascale Workshop 2006
……and of course, Celland of course, Cell The PlayStation 3's CPU is based on a chip codenamed "Cell" Each Cell contains 8 APUs.
An APU is a self contained vector processor acting independently from the others.
4 floating point units capable of a total of 32 Gflop/s (8 Gflop/s each) 256 Gflop/s peak! 32 bit floating point; 64 bit floating point at 25
Gflop/s.
But what about the performance counters!
IBM Petascale Workshop 2006
When? PAPI Release ScheduleWhen? PAPI Release Schedule
PAPI 3.3.0: RealSoonNow™BG/L in beta testingMerging and deprecating PAPI 3.0.8.1Regression testing on other platforms
PAPI 4.0: Q2, 2006Porting some substrates to Multi-
substrate modelDeveloping additional non-cpu
substrates Wanna Help? Distributed Testing…
IBM Petascale Workshop 2006
Distributed TestingDistributed Testing
Dart / CTest Mozilla
Tinderbox DejaGnu Homegrown Others?
Problem:How do you develop/test/verify on multiple
systems with multiple OS’s at multiple sites?Automatically; Transparently; Repetitively
IBM Petascale Workshop 2006
A Word from our Sponsor…A Word from our Sponsor…
Innovative Computing LaboratoryInnovative Computing Laboratory Jack’s Research Group in the CS
Department Size- About 45-50 people
16 students; 19 scientific staff; 10 support staff; 1 visitors
Funding NSF
Supercomputer Centers (UCSD & NCSA) Next Generation Software (NGS) Info Tech Res. (ITR) Middleware Init. (NMI)
DOE Scientific Discovery through Advanced
Computing (SciDAC) Math in Comp Sci (MICS)
DARPA High Productivity Computing Systems
DOD Modernization
Work with companies AMD, Cray, Dolphin,
Microsoft, MathWorks, Intel, Sun, Myricom, SGI, HP, IBM, Northrop-Grumman
PhD Dissertation, MS Project
Equipment A number of clusters Desktop machines Office setup
Summer internships Industry, ORNL, …
Travel to meetings Participate in
publications
IBM Petascale Workshop 2006
ICL Class of 2005ICL Class of 2005
IBM Petascale Workshop 2006
Speculative Performance PositionsSpeculative Performance Positions
PostDoc Positions Probably AvailablePAPI
New Platforms (Cell?)New Substrates (Infiniband?)
KOJAKAutomated Performance Analysis
ECLIPSE PTP & TAU Integration See me for brochures or more info