Scalable Debugging with TotalViewon Blue Gene
John DelSignore, CTOTotalView Technologies
2
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Agenda
• TotalView on Blue Gene– A little history– Current status
• Recent TotalView improvements– ReplayEngine (reverse debugging)– Remote Display– TotalView Script (batch debugging)
• Future work– BG/*– Heterogeneous systems– Many core, transactional memory, speculative execution– Petascale debugging
3
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Supported Blue Gene Architectures and Compilers
• Blue Gene/L and Blue Gene/P• Languages / Compilers
– C/C++, Fortran, Assembly– GNU Compilers– IBM Compilers– IBM OpenMP (on BG/P)
• Parallel Environments– IBM MPI – IBM OpenMP (on BG/P)– Pthreads (BG/P)
• Runtime linking/loading (BG/P)– Shared libraries– Dynamically loaded shared libraries
4
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Blue Gene Architecture
• TotalView client (GUI/CLI) runs on the Front End node
• Client communicates with the TotalView debugger servers running on the I/O nodes via a socket
• The debugger servers communicate with the CIOD to control processes and threads running on the Compute nodes
• Fanout ratios (CNs/server)– BG/L: 3264, 2 cores/CN,
128 threads/server– BG/P:128256, 4 cores/CN,
1024 threads/server– Ratio increasing (8K thr/svr?)– Parallelize server operation
5
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
TotalView Blue Gene/L Support
• TotalView involvement since 2003• Support for Blue Gene/L since 2005• Debugging interfaces developed via close
collaboration with IBM• Used on DOE/NNSA/LLNL's Blue Gene/L system
containing 212 K cores– Heap memory debugging support added– Blue Gene/L scaling and performance tuning project– TotalView has debugged jobs as large as 8,192 processes
(LLNL)• Work on Blue Gene/L facilitated Blue Gene/P
support
6
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
TotalView Blue Gene/P Support
• Blue Gene/P supported since Q4 2007• Continued close collaboration with IBM to
develop multithreaded debugging interfaces• Support for shared libraries and dynamically
loaded libraries• Scalability improvements• TotalView has debugged jobs as large as 32K
(Jülich)
7
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
TotalView Blue Gene/P Sites
• Currently running at over 30 sites in Germany, France, UK, and US, including
– Argonne– Boston University– Daresbury– IDRIS– Jülich– LLNL– Max Planck– ORNL– Princeton University– Rensselaer Polytechnic Institute
• Jülich workshop, March 08• Argonne workshop, May 08
8
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Recent TotalView Improvementson Blue Gene and Linux
• Remote Display– Run a remote version of the TotalView GUI…– …display it locally, with fast, interactive performance– Easy, fast, secure
• tvscript– Simplifies debugging batch jobs– Event/action paradigm– Configurable
• ReplayEngine– Step execution back in time– Uses reverse debugging technology– Linux x86 and x8664 (currently only)
9
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Remote Display
• Presents a window on your machine that will display TotalView executing on a remote system
• Two components: – Client, runs on the local
system, available for Linux x86, x8664 Windows XP, Vista
– Server, which runs on any system supported by TotalView, invisibly managing the connections between the host and client
• The Client also provides for submission of jobs to batch queuing systems PBS Pro and LoadLeveler
10
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Batch Scripting
• Designed for debugging in a batch environment• tvscript lets you define the events to act on, the actions to
take when an event occurs• Typical events
– Action point (e.g., breakpoint)– Memory error (e.g., malloc returns 0, guard block corruption)– Errors (e.g., SEGV, FPE)
• Typical actions– Display a backtrace– List memory leaks– Print variables and arrays
• Configurable– Supports external script files– Allows generation of even more complex actions and events
11
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Replay Engine
Step forward over functions
Step forward into functions
Advance forward out of current Function, after the call
Advance forward to selected line
Step backward over functions
Step backward into functions
Advance backward out of currentFunction, to before the call
Advance backward to selected line
Advance forward to “live” session
• Intuitive user interface, integrated with TotalView
12
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Possible Future Blue Gene Work
• BG/* support– Support future generations of Blue Gene
• Fast conditional breakpoints/watchpoints– Expressions compiled/patched into target, excute in parallel,
about 10usecs/expression• Asynchronous thread control
– Thread barrier breakpoint, thread single stepping• User programmable visual data
– Allows user define complex data access function• Debugging optimized code• Postmortem debugging• Fast DLL debugging interface• LLNL collaboration for scalable subset attach
– Integrates with lightweight tools such as STAT
<number>
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Possible Other Future Work
• Scalability/performance– Continue scalability and performance improvements– Treebased infrastructure for logarithmic scaling– Petascale debugging– Hundreds of thousands of threads
• Heterogeneous systems– IBM Roadrunner (x8664/Cell)– GPUs
• Emerging technologies– Many core– Transactional memory– Speculative execution
<number>
TotalView Technologies – Confidential and Proprietary – Plans Subject to Change without Noticewww.totalviewtech.com
Questions?
More Information
• Blue Gene Technical Development Interest Group– Contact [email protected]
• Technical support – [email protected]
• BG LLNL case study– www.totalviewtech.com/pdf/case_study_scientific_computing.pdf
• Customer training or webinars– [email protected]
• Web site – www.totalviewtech.com