+ All Categories
Home > Documents > TotalView on IBM PowerLE and CORAL...

TotalView on IBM PowerLE and CORAL...

Date post: 28-Sep-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
36
Martin Bakal ScicomP 5/25/2016 TotalView on IBM PowerLE and CORAL Sierra/Summit
Transcript
Page 1: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Martin Bakal

ScicomP5/25/2016

TotalView on IBM PowerLE and

CORAL Sierra/Summit

Page 2: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Agenda

• Corporate Overview

• Coral Milestones

• TotalView

• New architecture

• Demo

• Questions

Page 3: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Founded:1989

We are the largest independent provider of cross-platform software development tools and embedded components

Company snapshot

Our capabilities cover different languages, code bases, and platforms. We meet development where – and how – it happens.

Headquarters:Louisville, CO

Employees:350

Offices Worldwide:11

Page 4: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Meeting customer needs with capabilities

Page 5: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Our products and services

Tools Libraries

SourcePro OS, database, network, and analysis abstraction for C++

Visualization Real-time data visualization at scale

IMSL Numerical Libraries Scalable math and statistics algorithms

PV-WAVE Visual data analysis

HydraExpress SOA/C++ modernization framework

HostAccess Terminal emulation for Windows

Stingray MFC GUI components

OpenLogic Audits Detailed open source license and security risk guidance

OpenLogic Support Enterprise-grade SLA support

Klocwork On-the-fly static code analysis for app security

TotalView for HPC Scalable debugging

CodeDynamics Commercial dynamic analysis

Zend Server Enterprise PHP app server

Zend Studio PHP IDE

Zend Guard PHP encoding and obfuscation

Page 6: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

TotalView for HPC

• Comprehensive multi-core and multi-threaded analysis and debug environment

– Thread specific breakpoints – Control individual thread execution– View thread specific stack and data – View complex data types easily

• Integrated Reverse debugging• Track memory leaks in running applications• Supports C/C++ on Linux

• Allowing the business to have– Predictable development schedules– Less time spent debugging

– Platform coverage • Linux, BG/Q, CUDA GPUs, Xeon Phi, Linux-PowerLE with GPUs, etc

Page 7: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

LLNL/Sierra Focus Areas

7

• Collaborative work

– Rogue Wave, LLNL, IBM, Nvidia, RWTH Aachen

• Focuses on three areas

– OpenMP 4 + GPUs debugging

– MPI+GPU debugger performance and scalability

– EVAL (conditional breakpoints) performance and scalability

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION

Page 8: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OpenMP 4 + GPUs Debugging

• OpenMP 4 debugging support (CPUs and GPUs) for Sierra• Collaborate on OpenMP Debug API (OMPD) design• Three phases

– Phase 1: TotalView/OMPD: OMP3.1/CPU, x86_64– Phase 2: TotalView/OMPD: OMP4/CPU/GPU, x86_64– Phase 3: TotalView/OMPD: OMP4/CPU/GPU, PowerLE

• Phase 1 progress to-date follows– Draft of OMPD for OpenMP 3.1 completed– RWTH Aachen implemented OMPD DLL for Intel OpenMP RTL– TotalView/OMPD feature development progressing

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 8

Page 9: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Control Vars & Meta Info

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 9

Intel OMPD DLL currently returns no control variable information

Meta information shows version #, ID, and DLL path

Page 10: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Parallel Regions

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 10

Parallel region hierarchy at thread, process and group widths

Aggregated, process/thread list:“#p:#t[dpid-range.dtid-range, …]”

Page 11: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Task Regions

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 11

Task region display is similar to parallel region display, but shows the task relationships

Page 12: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Threads

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 12

Thread-centric views of information available from OMPD

Page 13: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Stack Filtering

• “Raw”, unfiltered stack displays the OMP RTL stack frames

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 13

OMP RTL frames typically uninteresting to users

Page 14: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Stack Filtering

• OMPD allows the debugger to portablyfind and filter-out OMP RTL frames

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 14

Page 15: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Master/Slave Stack Linking

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 15

Stack hyperlink “connects” a slave’s thread frame to its master’s thread frame

Selecting the frame jumps to the parent thread and stack frame that invoked the parallel region

Page 16: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Master/Slave Stack Linking

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 16

Clicking again “climbs” the parallel region tree, focusing on its parent

Page 17: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Master/Slave Stack Linking

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 17

Now at the “root” of the OMP parallel region tree

Page 18: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Mangled Outlined-Functions

• OMP outlined-function name mangling is not standard

• DWARF could connect an outlined function to its containing function

– E.g., DW_AT_omp_outlined <containing-die>

• Instead of

“L_func_42__par_region0_1_2”

• Debugger could reliably show something like

“func (parallel region 1 at file.c#42)”

• Needed

– A DWARF OpenMP proposal

– A compiler developer to produce the DWARF

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 18

Page 19: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP Variable Information

• Users have asked for “OpenMP variable information”

– E.g., private, shared, firstprivate, copyin, reduction, etc.

– Compile-time attributes of the variable that the compiler knows

– DWARF could represent these attributes

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 19

Page 20: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OMP4 + GPUs

• OMP4 compilers currently produce no DWARF for GPU code

– IBM is working on a solution

• OMPD currently supports only OpenMP 3.1 (no GPUs)

– Specification must be extended for OpenMP 4 + GPU

– Seeking OMP4+GPU OMPD implementation

• TotalView modifications

– DEVICE and TARGET region support

– CUDA/GPU support

– Depends on the OMP RTL execution model

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 20

Page 21: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

OpenMP 4 Loose Ends

• Help push toward OMPD standardization

• OMPD for IBM/LOMP

– When IBM implements the DLL

– TotalView should be able to “just” use it

• OMP aggregated logical call tree

– Reassemble the structure of an executing OMP4 program into a logical call tree

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 21

Page 22: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

MPI+GPU Debugging at Scale

• Performance, scalability, and functionality on MPI+GPU targets

• Two phases of Application Driven Tuning (ADT) with GPUs– Phase 1: Linux-x86_64 (LULESH/RAJA, HYPRE, LAMMPS)– Phase 2: Linux-PowerLE (other benchmarks)

• NVidia CILP allows MPI processes to share GPUs on a node– CILP (hardware pre-emption)

• CUDA Debug API Limitations– Requires creating a debug agent process per target process– TotalView creates a “bushier” MRNet tree– Future work

• Fix the API to support true multi-process debugging• Add support for MPS debugging

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 22

Page 23: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

EVAL Point Performance and Scalability• Support evaluating conditional breakpoints in the debugger servers

• Allows the interpreter to run in parallel in the servers

• Client contains “heavyweight” stuff: symbol data, compilers, IL generator

• Server remains “lightweight” adding a small IL interpreter

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 23

TV Client

Lex, Parse, Compile, Generate

IL

IL Interpreter

TV Server

IL InterpreterBroadcast

IL

TV Server

IL Interpreter

TV Server

IL Interpreter

TV Server

IL Interpreter

TV Server

IL Interpreter

TV Server

IL Interpreter

TV Server

IL Interpreter

TV Server

IL Interpreter

Page 24: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Aggregation

• TotalView has been enhanced to add new types of aggregation

– Aggregated process and thread status

• New root window

• CLI dstatus

– Aggregated stack back trace

• Graphical call tree window

• CLI dwhere

– Aggregated data

• CLI dprint

DO NOT COPY OR REDISTRIBUTE WITHOUT WRITTEN PERMISSION 24

Page 25: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Focus on Data Aggregation

• Allows the ability to get data about data and array from all threads pretty easily

• Added support for aggregated data collection

• On CUDA still prints for each thread

TotalView dprint command

dprint -gagg_1Focus:64:32000[0-63.1-500]0x00000000(0):64:16000[0-63.1,0-63.3,0-63.5,0-63.7,0-63.9,...]0x00000001(1):64:16000[0-63.2,0-63.4,0-63.6,0-63.8,0-63.10,...]

One lineforeach unique value ofthe variable

Portable across platforms and will be supported on Linux-PowerLECORAL Sierra/Summit

Variable

# of MPI rank:# of threads[MPI rank range.threads

Data on specific value

Page 26: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Licensing

• Licensing is an issue

• Flexera doesn’t support Power with their FlexNet Publisher product

• This means we have to time bomb the product after a year

26

Page 27: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

New architecture

Page 28: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

New Architecture in building UI

Qt 4 Based Front-End

TotalView Debug Interface (TVDI)

Back-End

TotalView DebugWire Protocol

(TVDWP)Tran

spor

t Mec

hani

sm TotalView Debug Engine Interface (TVDEI)

TotalView Debugger Engine(TVDE)

Tran

spor

t Mec

hani

sm

Front-End

CommunicationChannel

TotalViewDebugServer

Process

Process

Process

Used to be one large application

Page 29: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Why does the architecture matter

• This isn’t short term thinking

– No need for XWindows on target platforms

– Performance

– Scalability

– Platforms

• More effective debugger on all platforms

• Easier 3rd party integrations

Page 30: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

30

Page 31: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Multi-threading made easier

• How do you debug a problem in a 50 thread application that occurs in 1 thread?

– Without TotalView, • Set a breakpoint in code • Run and hope you hit the right thread

– With TotalView• set thread specific breakpoints

• Better multithreaded debugger– Understand the state of all of your threads– Focus on specific threads

• View stack and data• Built to scale to HPC and leveraging that in mainstream commercial envs

Page 32: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Viewing complex data types

• How do you inspect complex data types for changes?– Without TotalView,

• Look through pointer at memory• Map to the data structures• Recreate the data type by hand

– With TotalView• View the data structure directly• users get to focus on debugging

– Complex data types support includes• STL collections• Large multi-dimensional arrays• Boost collection classes• C++ 11 specific types

Page 33: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Reverse debugging

• How do you isolate an intermittent a failure?– Without TotalView,

• Set a breakpoint in code • Realize you ran past the problem• Re-load• Set breakpoint earlier • Hope it fails• Keep repeating

– With TotalView• Start recording• Set a breakpoint• See failure• Run backwards/forwards in context of failing execution

– Reverse Debugging• Re-creates the context when going backwards• Focus down to a specific problem area easily• Saves days in recreating a failure

Page 34: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

How do you identify buffer overflows?

Runtime Memory Analysis : Eliminate Memory Errors– Detects memory leaks before they are a problem– Explore heap memory usage

Features– Detects

• Malloc API misuse• Memory leaks• Buffer overflows

– Low runtime overhead– Easy to use

• Works with vendor libraries• No recompilation• No instrumentation

Memory Analysis

Page 35: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Regression Testing

• How do you make sure a bug you fixed never returns?

– Build a regression test

– Issue is it typically is time consuming

• What is the method to build a regression test?

– Use the tools that helped you find it

• How do you run a regression test?

– Invoke it during your build process

• Enter TotalView scripts

– Command line driven

– Access to application internals

– Same commands as in the debugger

•!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!•! Print•!•! Process:•! ./server (Debugger Process ID: 1, System ID: 12110)•! Thread:•! Debugger ID: 1.1, System ID: 3083946656•! Time Stamp:•! 06-26-2008 14:04:09•! Triggered from event:•! actionpoint•! Results:•! foreign_addr = {•! sin_family = 0x0002 (2)•! sin_port = 0x1fb6 (8118)•! sin_addr = {•! s_addr = 0x6658a8c0 (1717086400)•! }•! sin_zero = ""•! } •!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Page 36: TotalView on IBM PowerLE and CORAL Sierra/Summitspscicomp.org/wordpress/wp-content/uploads/2016/05/... · TotalView for HPC • Comprehensive multi -core and multi -threaded analysis

Scales to meet your need

• Support debugging on thousands of cores

– MRNet is built to multicast

– Aggregates data to/from cores

• Remote Display Client

– Debug on a remote machine

– Easy to configure

– Focus on your debugging


Recommended