Martin Bakal
ScicomP5/25/2016
TotalView on IBM PowerLE and
CORAL Sierra/Summit
Agenda
• Corporate Overview
• Coral Milestones
• TotalView
• New architecture
• Demo
• Questions
Founded:1989
We are the largest independent provider of cross-platform software development tools and embedded components
Company snapshot
Our capabilities cover different languages, code bases, and platforms. We meet development where – and how – it happens.
Headquarters:Louisville, CO
Employees:350
Offices Worldwide:11
Meeting customer needs with capabilities
Our products and services
Tools Libraries
SourcePro OS, database, network, and analysis abstraction for C++
Visualization Real-time data visualization at scale
IMSL Numerical Libraries Scalable math and statistics algorithms
PV-WAVE Visual data analysis
HydraExpress SOA/C++ modernization framework
HostAccess Terminal emulation for Windows
Stingray MFC GUI components
OpenLogic Audits Detailed open source license and security risk guidance
OpenLogic Support Enterprise-grade SLA support
Klocwork On-the-fly static code analysis for app security
TotalView for HPC Scalable debugging
CodeDynamics Commercial dynamic analysis
Zend Server Enterprise PHP app server
Zend Studio PHP IDE
Zend Guard PHP encoding and obfuscation
TotalView for HPC
• Comprehensive multi-core and multi-threaded analysis and debug environment
– Thread specific breakpoints – Control individual thread execution– View thread specific stack and data – View complex data types easily
• Integrated Reverse debugging• Track memory leaks in running applications• Supports C/C++ on Linux
• Allowing the business to have– Predictable development schedules– Less time spent debugging
– Platform coverage • Linux, BG/Q, CUDA GPUs, Xeon Phi, Linux-PowerLE with GPUs, etc
LLNL/Sierra Focus Areas
7
• Collaborative work
– Rogue Wave, LLNL, IBM, Nvidia, RWTH Aachen
• Focuses on three areas
– OpenMP 4 + GPUs debugging
– MPI+GPU debugger performance and scalability
– EVAL (conditional breakpoints) performance and scalability
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION
OpenMP 4 + GPUs Debugging
• OpenMP 4 debugging support (CPUs and GPUs) for Sierra• Collaborate on OpenMP Debug API (OMPD) design• Three phases
– Phase 1: TotalView/OMPD: OMP3.1/CPU, x86_64– Phase 2: TotalView/OMPD: OMP4/CPU/GPU, x86_64– Phase 3: TotalView/OMPD: OMP4/CPU/GPU, PowerLE
• Phase 1 progress to-date follows– Draft of OMPD for OpenMP 3.1 completed– RWTH Aachen implemented OMPD DLL for Intel OpenMP RTL– TotalView/OMPD feature development progressing
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 8
OMP Control Vars & Meta Info
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 9
Intel OMPD DLL currently returns no control variable information
Meta information shows version #, ID, and DLL path
OMP Parallel Regions
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 10
Parallel region hierarchy at thread, process and group widths
Aggregated, process/thread list:“#p:#t[dpid-range.dtid-range, …]”
OMP Task Regions
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 11
Task region display is similar to parallel region display, but shows the task relationships
OMP Threads
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 12
Thread-centric views of information available from OMPD
OMP Stack Filtering
• “Raw”, unfiltered stack displays the OMP RTL stack frames
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 13
OMP RTL frames typically uninteresting to users
OMP Stack Filtering
• OMPD allows the debugger to portablyfind and filter-out OMP RTL frames
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 14
OMP Master/Slave Stack Linking
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 15
Stack hyperlink “connects” a slave’s thread frame to its master’s thread frame
Selecting the frame jumps to the parent thread and stack frame that invoked the parallel region
OMP Master/Slave Stack Linking
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 16
Clicking again “climbs” the parallel region tree, focusing on its parent
OMP Master/Slave Stack Linking
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 17
Now at the “root” of the OMP parallel region tree
OMP Mangled Outlined-Functions
• OMP outlined-function name mangling is not standard
• DWARF could connect an outlined function to its containing function
– E.g., DW_AT_omp_outlined <containing-die>
• Instead of
“L_func_42__par_region0_1_2”
• Debugger could reliably show something like
“func (parallel region 1 at file.c#42)”
• Needed
– A DWARF OpenMP proposal
– A compiler developer to produce the DWARF
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 18
OMP Variable Information
• Users have asked for “OpenMP variable information”
– E.g., private, shared, firstprivate, copyin, reduction, etc.
– Compile-time attributes of the variable that the compiler knows
– DWARF could represent these attributes
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 19
OMP4 + GPUs
• OMP4 compilers currently produce no DWARF for GPU code
– IBM is working on a solution
• OMPD currently supports only OpenMP 3.1 (no GPUs)
– Specification must be extended for OpenMP 4 + GPU
– Seeking OMP4+GPU OMPD implementation
• TotalView modifications
– DEVICE and TARGET region support
– CUDA/GPU support
– Depends on the OMP RTL execution model
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 20
OpenMP 4 Loose Ends
• Help push toward OMPD standardization
• OMPD for IBM/LOMP
– When IBM implements the DLL
– TotalView should be able to “just” use it
• OMP aggregated logical call tree
– Reassemble the structure of an executing OMP4 program into a logical call tree
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 21
MPI+GPU Debugging at Scale
• Performance, scalability, and functionality on MPI+GPU targets
• Two phases of Application Driven Tuning (ADT) with GPUs– Phase 1: Linux-x86_64 (LULESH/RAJA, HYPRE, LAMMPS)– Phase 2: Linux-PowerLE (other benchmarks)
• NVidia CILP allows MPI processes to share GPUs on a node– CILP (hardware pre-emption)
• CUDA Debug API Limitations– Requires creating a debug agent process per target process– TotalView creates a “bushier” MRNet tree– Future work
• Fix the API to support true multi-process debugging• Add support for MPS debugging
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 22
EVAL Point Performance and Scalability• Support evaluating conditional breakpoints in the debugger servers
• Allows the interpreter to run in parallel in the servers
• Client contains “heavyweight” stuff: symbol data, compilers, IL generator
• Server remains “lightweight” adding a small IL interpreter
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 23
TV Client
Lex, Parse, Compile, Generate
IL
IL Interpreter
TV Server
IL InterpreterBroadcast
IL
TV Server
IL Interpreter
TV Server
IL Interpreter
TV Server
IL Interpreter
TV Server
IL Interpreter
TV Server
IL Interpreter
TV Server
IL Interpreter
TV Server
IL Interpreter
Aggregation
• TotalView has been enhanced to add new types of aggregation
– Aggregated process and thread status
• New root window
• CLI dstatus
– Aggregated stack back trace
• Graphical call tree window
• CLI dwhere
– Aggregated data
• CLI dprint
DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 24
Focus on Data Aggregation
• Allows the ability to get data about data and array from all threads pretty easily
• Added support for aggregated data collection
• On CUDA still prints for each thread
TotalView dprint command
dprint -gagg_1Focus:64:32000[0-63.1-500]0x00000000(0):64:16000[0-63.1,0-63.3,0-63.5,0-63.7,0-63.9,...]0x00000001(1):64:16000[0-63.2,0-63.4,0-63.6,0-63.8,0-63.10,...]
One lineforeach unique value ofthe variable
Portable across platforms and will be supported on Linux-PowerLECORAL Sierra/Summit
Variable
# of MPI rank:# of threads[MPI rank range.threads
Data on specific value
Licensing
• Licensing is an issue
• Flexera doesn’t support Power with their FlexNet Publisher product
• This means we have to time bomb the product after a year
26
New architecture
New Architecture in building UI
Qt 4 Based Front-End
TotalView Debug Interface (TVDI)
Back-End
TotalView DebugWire Protocol
(TVDWP)Tran
spor
t Mec
hani
sm TotalView Debug Engine Interface (TVDEI)
TotalView Debugger Engine(TVDE)
Tran
spor
t Mec
hani
sm
Front-End
CommunicationChannel
TotalViewDebugServer
Process
Process
Process
Used to be one large application
Why does the architecture matter
• This isn’t short term thinking
– No need for XWindows on target platforms
– Performance
– Scalability
– Platforms
• More effective debugger on all platforms
• Easier 3rd party integrations
30
Multi-threading made easier
• How do you debug a problem in a 50 thread application that occurs in 1 thread?
– Without TotalView, • Set a breakpoint in code • Run and hope you hit the right thread
– With TotalView• set thread specific breakpoints
• Better multithreaded debugger– Understand the state of all of your threads– Focus on specific threads
• View stack and data• Built to scale to HPC and leveraging that in mainstream commercial envs
Viewing complex data types
• How do you inspect complex data types for changes?– Without TotalView,
• Look through pointer at memory• Map to the data structures• Recreate the data type by hand
– With TotalView• View the data structure directly• users get to focus on debugging
– Complex data types support includes• STL collections• Large multi-dimensional arrays• Boost collection classes• C++ 11 specific types
Reverse debugging
• How do you isolate an intermittent a failure?– Without TotalView,
• Set a breakpoint in code • Realize you ran past the problem• Re-load• Set breakpoint earlier • Hope it fails• Keep repeating
– With TotalView• Start recording• Set a breakpoint• See failure• Run backwards/forwards in context of failing execution
– Reverse Debugging• Re-creates the context when going backwards• Focus down to a specific problem area easily• Saves days in recreating a failure
How do you identify buffer overflows?
Runtime Memory Analysis : Eliminate Memory Errors– Detects memory leaks before they are a problem– Explore heap memory usage
Features– Detects
• Malloc API misuse• Memory leaks• Buffer overflows
– Low runtime overhead– Easy to use
• Works with vendor libraries• No recompilation• No instrumentation
Memory Analysis
Regression Testing
• How do you make sure a bug you fixed never returns?
– Build a regression test
– Issue is it typically is time consuming
• What is the method to build a regression test?
– Use the tools that helped you find it
• How do you run a regression test?
– Invoke it during your build process
• Enter TotalView scripts
– Command line driven
– Access to application internals
– Same commands as in the debugger
•!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!•! Print•!•! Process:•! ./server (Debugger Process ID: 1, System ID: 12110)•! Thread:•! Debugger ID: 1.1, System ID: 3083946656•! Time Stamp:•! 06-26-2008 14:04:09•! Triggered from event:•! actionpoint•! Results:•! foreign_addr = {•! sin_family = 0x0002 (2)•! sin_port = 0x1fb6 (8118)•! sin_addr = {•! s_addr = 0x6658a8c0 (1717086400)•! }•! sin_zero = ""•! } •!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Scales to meet your need
• Support debugging on thousands of cores
– MRNet is built to multicast
– Aggregates data to/from cores
• Remote Display Client
– Debug on a remote machine
– Easy to configure
– Focus on your debugging