Operated by Los Alamos National Security, LLC for DOE/NNSA
LA-UR 09-02032
Total Work-Flow: Exploiting Hybrid Computing Architectures for
Scientific Computing
Ben Bergen
Computational Physics (CCS-2)
Los Alamos National Laboratory Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1),
William Daughton (X-1)
ScicomP 15
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Overview
Roadrunner System Overview
Basic Considerations and Programming Models
Adapting VPIC Kinetic Plasma Code to Roadrunner
Optimizing Total Workflow
Open Science on Roadrunner
Slide 2
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Roadrunner is a Cluster
Slide 3
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Roadrunner is a Cluster of Clusters
Slide 4
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Roadrunner is a Cluster of Clusters with Accelerators
Slide 5
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Triblade Compute Node
Slide 6
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Original Blade Topology
Slide 7
One-to-one affinity between Opteron core and Cell processor
Newer versions of DaCS support two-to-one affinity
Not sure about four-to-one???
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Roadrunner: Basic Considerations for Adaptation
First hybrid supercomputer of the current generation incorporating x86_64, PowerPC, and SPU ISAs.
Codes require three executables x86_64 executable runs on the Opteron host processor PowerPC executable runs on the Power Processing Element
(PPE) accelerator processor SPU threads runs on the eight Synergistic Processing
Element (SPE) special purpose vector unit processors Three compilers: gcc, ppu-gcc, spu-gcc (also XL C/C++)
Design considerations: Process launch and synchronization
Slide 8
Roadrunner has three different architectures
Opteron
PowerPC
SPE
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Roadrunner: Basic Considerations for Adaptation
Incorporates main memory on the Opteron and Cell eDP blades plus the local store user-controlled SRAM on the SPEs
Codes that run on Roadrunner must handle communication between these memory spaces Distributed memory communication between Opteron hosts Point-to-point communication between Opteron host and Cell accelerator Direct Memory Access (DMA) communication between Cell main memory and
SPE local store memory
Opteron and Cell have different endianness Some byte-swapping is necessary Cell blades are diskless
Design considerations: Communication and I/O
Slide 9
Roadrunner has three different address spaces
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Roadrunner: Basic Considerations for Adaptation
Process launch and synchronization MPI, DaCS/ALF, libSPE2
Communication MPI, DaCS/ALF, libSPE2
Slide 10
Multiple tools and programming models MPI
DaCS
libSPE2
Hierarchical/heterogeneous advantages
Fault tolerance Faults can be caught at multiple levels
Scalability
Strong scalability is possible on SPEs
Weak scalability through distributed memory
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Programming Models: Host-Centric (Function Offload)
Slide 11
Opteron
Cell
Allows staged development Existing MPI codes will run on Opterons
Synchronous or asynchronous function offload to accelerator
Minimizes reliance on PPE (poor performer!)
Pros
Potential data-movement bottleneck
Offload cost must be amortized by work done on accelerator
Cons
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Programming Models: Accelerator-Centric
Slide 12
Opteron
Cell
Also allows staged development Existing MPI codes will run on PowerPC (PPE)
Hides complexity of hybrid architecture
Avoids data-movement bottleneck
Pros
Heavier reliance on PPE
Computationally intensive portions of code must run on SPEs
Requires “relay” to forward message traffic
Cons
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Message Passing Relay
Slide 13
Cell
Opteron
Cell
Opteron
Direct point-to-point communication is not possible between Cells
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Message Passing Relay
Slide 14
Cell
Opteron
Cell
Opteron
Data Data
Relay forwards messages through hosts to peer
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Programming Models: All Roads Lead Everywhere
Slide 15
Opteron
Cell
There is a natural evolution of both of these approaches into a fully hybrid computing model Initial difference is in program Locus or
control-process
On “evolved” model the host process runs a task-queue
Tasks may be offloaded to other host-type cores or to accelerators
Task data may live in worker’s memory to avoid data-movement bottlenecks
More on how we can use this to follow!
Scheduler
Cell/GPU
Opteron Core
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 16
Particle-In-Cell (PIC) Methods Simulate Plasma Physics
One application of VPIC is to simulate Laser Plasma Interactions (LPI) critical to understanding Inertial Confinement Fusion (ICF) at the National Ignition Facility (NIF)
Several difficulties arise during the compression of hohlraum capsules Laser scattering – not enough energy to compress capsule Laser scattering – laser does not target desired areas (unsymmetric compression) Pre-heating – electrons heat plasma making compression more difficult
LLNL pF3D modeling of a laser beam
VPIC modeling of a single laser speckle
Integrated LLNL Hydra modeling of ICF experiment
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 17
Particle-In-Cell Method
Advance Particles
Accumulate Currents
Update Fields
Interpolate Field Effects
Time Iteration grids
particles
Spatial Domain
+ +
+ +
+
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 18
VPIC – Vector Particle-In-Cell
3D, fully relativistic, electromagnetic Particle-In-Cell (PIC) code Self-consistent evolution of a kinetic plasma Charge conserving (no implicit solve)
Optimized for data motion Single precision – half the memory bandwidth/double the theoretical peak Single-pass particle processing Field interpolation coefficients are pre-computed
Optimized for modern architectures Uses short-vector, SIMD intrinsics (SSE, Altivec, SPU)
Assumes that particles do not leave voxel in which they started Exceptions are handled separately
O(N) particle sorting Improves spatial locality of particle data access Improves temporal locality of Field data access
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 19
Porting to Roadrunner (things that we did)
Message Passing Relay (MP Relay) Flattens communication topology Allows logical point-to-point communication between Cell processors Abstracts remote I/O layer for restart and visualization dumps
Pipelined execution Code restructured for data-parallel thread execution Current support for serial, pthreads, and SPE threads Simple, common interface: init(), finalize(), execute(function_t), sync()
Particle data structures Optimized for efficient communication via DMA requests Can be tuned to cache size on traditional cached-memory architectures (padding)
Voxel cache (access to Field data) Fully associative least recently used (LRU) policy Simple interface: voxel_cache_fetch() and voxel_cache_wait()
Text overlay support Allows acceleration of field advance, particle sorting and accumulators
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Pipeline Abstraction
Slide 20
init
Master Thread
execute sync finalize
Worker threads block for execute message to reduce thread creation overhead pthreads implementation uses condition variables SPE implementation uses mailboxes
SPE symbols are exposed to the PPE through _SPUEAR_ linker magic Function call is implemented through mailbox message
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 21
Data are processed in segments of even multiples of 16 particles Segments are accessed in blocks of up to 512 particles (16 KB largest possible single DMA request) Triple-buffered: streaming data paradigm (read, update, write)
Block processing groups particles in sets of 4 Optimal for single-precision SIMD operations Inner loop is 4x hand unrolled
VPIC applies best strategy to particle advance
typedef struct particle { float dx, dy, dz; // position (relative to voxel) int32_t i; // index of voxel containing particle float ux, uy, uz; // particle normalized momentum float q; // particle charge } particle_t;
32 bytes
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 22
Overlays VPIC’s particle advance logic maxes out the Local Store (LS)
Particle advance data uses 206 KB This leaves ~50 KB for text (machine instructions)
Overlays are segments of text that can be loaded/unloaded from LS Expand the effective maximum size of an SPE program Avoid overhead of starting new SPE threads (prohibitive) Limited by management table size
IBM has implemented overlay support as a software cache Overlay manager fetches text that is not in LS (DMA call) No prefetch capability
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 23
Overlay Properties
Root Segment
Region 2
SA
SPE Local Store
Data
SA SB
SD SE
Main Memory
Region 1
SD
SC
SF
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 24
Overlay Properties
Root Segment
Region 2
SA
SPE Local Store
Data
SA SB
SD SE
Main Memory
Region 1
SD
SC
SF
Text is partitioned into regions with a static root segment
Text
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 25
Overlay Properties
Root Segment
Region 2
SA
SPE Local Store
Data
SA SB
SD SE
Main Memory
Region 1
SD
SC
SF
Each region can be filled by specific segments of text
Segments for region 1
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 26
Overlay Properties
Root Segment
Region 2
SA
SPE Local Store
Data
SA SB
SD SE
Main Memory
Region 1
SD
SC
SF
The size of a region is determined by its largest segment
32KB 28KB 32KB 20KB
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 27
Overlay Properties
Root Segment
Region 2
SPE Local Store
Data
SA SB
SD SE
Main Memory
Region 1
SD
SC
SF
Loading a new segment overwrites its respective region
SB
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 28
Overlays VPIC has been extended to support overlays
Particle advance, accumulators and particle sorting have been accelerated Current code decomposition uses one region with two segments
Even this fairly trivial approach has difficulties VPIC overlay strategy uses stack for data buffers Stack placement changes with linkage (even with only trivial code changes) Silent overflows into memory handled by overlay manager Data corruption, hangs, segmentation faults…
May need to implement light-weight heap void * spu_malloc(), spu_free_all() Would reserve 224 KB of LS space for heap Actual implementation will use static byte array in root segment
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 29
VPIC Highlights
1.00 Trillion Particle Run (Poughkeepsie just after stand-up) Aggressive test of full system Achieved sustained performance of >374.25 TF
11% of theoretical max performance (single precision 3.0 PF) Gordon Bell Prize Finalist SC2008
Cell processes used 42.8 TB RAM (93.8% of available Cell memory) Opteron processes used 7.3 TB RAM
Science Runs: Back-scatter in laser plasma interactions Current bread-and-butter runs on 6 CUs (4,096 ranks : 32,768 threads) Next set of runs will be at 16 CUs (11,520 ranks : 92,160 threads) 11x speedup over Opteron-only Excellent machine stability (main difficulty is I/O subsystem)
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Efficiency is a Poor Metric for Performance
Real computational workflows expose many more bottlenecks Data I/O for visualization and restarts Data rendering for visualization Diagnostics and statistical analysis
Many of these steps can be handled concurrently Once the pipeline is full, we can fully subscribe a hybrid compute node Reduces vulnerability to machine instabilities by reducing total time to solution Special purpose accelerators can be targeted to specific tasks Host process only manages tasks
VPIC will be enhanced to address these issues Initial enhancements will use pipeline abstraction OpenCL implementation planned
Slide 30
Hybrid Computing Architectures Can Help Us!
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Exploiting Hybrid Architectures
Slide 31
Cell
Opteron
Scheduler
GPU
Opteron Core
Disk I/O
Rendering
Computation
Host Process/Task Queue
Computation/
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Commodity Nodes Like This Already Exist!
Scalable Informatics – Pegasus GPU+Cell Node 4-16 AMD or Intel cores 8-128 GB RAM One or more Tesla GPU cards One or more GA-180 (PXCAB) Cell cards
Slide 32
How can we develop for a cluster of such nodes?
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
OpenCL – One Possibility
OpenCL is programming framework for accelerated compute nodes Runtime – handles work distribution and JIT compilation
Fully static embedded kernels are supported Topology interrogation
API – process launch, communication and synchronization OpenCL C – kernel programming language
Abstraction layer for SIMD vector types and intrinsics Explicit dependency specification of kernel parameters
Host process controls one or more attached devices
Still missing Heterogeneous device support Support for clusters (host-to-host communication) Build/configuration system
Slide 33
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Hybrid Compute Node
Slide 34
OpenCL
C/C++/OpenCL C Kernel Logic
OpenMPI (h)
C/C++/OpenCL C Kernel Logic
IB Interconnect Hybrid OpenMPI Working version in use on
Roadrunner architecture Extends MPI Interface
Process launch on multiple architectures
Introduces hierarchical communicators
Available in next release
OpenCL No current support for
peer-to-peer communication between hosts
Others CellSs, OpenMP
Node Control Process
OpenCL OpenMPI (h)
OpenMPI
Abstraction Layer
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Open Science on Roadrunner
Internal peer-reviewed process identified 9 projects Bio-Fuels, Astrophysics, Plasma Physics, Phylogenetics, Atmospheric Science,
Molecular Dynamics
Projects have been awarded allocations on full machine Many are currently underway Window of 3-4 months before the machine goes behind-the-fence
Cerrillos 162 TF Roadrunner architecture (2 CUs) Call has been issued and proposals are being evaluated Allocations will begin soon!
Education Hands-On Cell programming class (second offering currently underway) Student program Development allocations for collaboration
Slide 35
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Slide 36
Supernovae
The final event in the evolution of a sufficiently massive star is a supernova
SNSPH, developed at LANL by Chris Fryer and Mike Warren, is a parallel three-dimensional smoothed particle hydrodynamics code
Simulations conducted on Roadrunner will allow comparison with data from actual light curves and spectra from supernova observations
Large Synoptic Survey Telescope (LSST) and the Joint Dark Energy Mission (JDEM)
Work on Roadrunner will extend these simulations by calculating light curves and spectra from full radiation-hydrodynamic models of these explosions.
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Shock Compression of Metals
Dislocation interactions such as line defects determine the strength of metals
Roadrunner will finally allow us to realize the promise of computational science by bridging the gap between simulation and experiment
Slide 37
This animation shows a shock front traveling through polycrystalline Fe causing a phase transformation from the bcc (gray) to hcp (red) and fcc (green) structure
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Turbulent Mixing in Buoyancy Driven Flows
Material mixing to molecular scale in the presence of turbulence induced stirring is an important process in many areas
Most studies to date address the Boussinesq case
Significant and unexpected differences in the mixing process occur as the material density parameters diverge
These animations highlight the complexity of the mixing process, illustrating the new physics associated with mixing at large density differences
Slide 38
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Laser Plasma Interactions (LPI)
Slide 39
Operated by Los Alamos National Security, LLC for NNSA
LA-UR 09-02032
Operated by Los Alamos National Security, LLC for NNSA
Thanks!
Our HPC Division staff is committed to making Roadrunner succeed Meghan Wingate Mark Vernon Phil Church Randall Rheinheimer
Applications’ developers have done amazing work Sriram Swaminarayan Tim Kelley Paul Henning Jamal Mohd-Yusof
Special thanks to Larry Cox for funding Khronos membership!
Slide 40