Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | xyla-thomas |
View: | 26 times |
Download: | 0 times |
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPSBIPS
Leading Computational Methods on the Earth
Simulator
Leonid Oliker
Lawrence Berkeley National Laboratory
Joint work with:
Julian Borrill, Andrew Canning, Stephane Ethier, Art Mirin, David Parks, John Shalf, David Skinner, Michael Wehner, Patrick Worley
Motivation
Stagnating application performance is well-know problem in scientific computing
By end of decade numerous mission critical applications expected to have 100X computational demands of current levels
Many HEC platforms are poorly balanced for demands of leading applications Memory-CPU gap, deep memory hierarchies,
poor network-processor integration, low-degree network topology
Traditional superscalar trends slowing down Mined most benefits of ILP and pipelining,
Clock frequency limited by power concerns Achieving next order of magnitude in computing will require
O(100K) - O(1M) processors: interconnect that scale nonlinearly in cost (crossbar, fattree) become impractical
Application Evaluation
Microbenchmarks, algorithmic kernels, performance modeling and prediction, are important components of understanding and improving architectural efficiency
However full-scale application performance is the final arbiter of system utility and necessary as baseline to support all complementary approaches
Our evaluation work emphasizes full applications, with real input data, at the appropriate scale
Requires coordination of computer scientists and application experts from highly diverse backgrounds Allows advanced optimizations for given architectural platforms
In-depth studies reveal limitation of compilers, operating systems, and hardware, since all of these components must work together at scale
Our recent efforts have focused on the potential of modern parallel vector systems
Effective code vectorization is an integral part of the process First US team to conduct Earth Simulator performance study
Vector Paradigm
High memory bandwidth Allows systems to effectively feed ALUs (high byte to flop ratio)
Flexible memory addressing modes Supports fine grained strided and irregular data access
Vector Registers Hide memory latency via deep pipelining of memory load/stores
Vector ISA Single instruction specifies large number of identical operations
Vector architectures allow for: Reduced control complexity Efficiently utilize large number of computational resources Potential for automatic discovery of parallelism
Can some of these features can be added to modern micros?
However: most effective if sufficient regularity discoverable in program structure• Suffers even if small % of code non-vectorizable (Amdahl’s Law)
ES Processor Overview
Cachless Architecture
8 CPU per SMP
8 way replicatedvector pipes
72 vec registers, 256 64-bit words
Divide Unit
32 GB/s pipe to FPLRAM
4-way superscalar o-o-o @ 1 Gflop
64KB I$ & D$
ES: specially developed FPLRAM (Full Pipelined RAM) SX6: DDR-SDRAM 128/256Mb
ES: uses IN 12.3 GB/s bi-dir btw any 2 nodes, 640 nodes SX6: uses IXS 8GB/s bi-dir btw any 2 nodes, max 128 nodes
In April 2002, the Earth Simulator (ES) became operational: Peak ES performance > all DOE and DOD systems combined
Stayed #1 Top 500 for an unprecedented 2.5 years!
Earth Simulator Overview
Machine Type : 640 nodes, each node is 8-way SMP vector processors (5120 total procs) Machine Peak: 40TF/s (proc peak 8GF/s)OS : Extended version of Super-UX: 64 bit Unix OS based on System V-R3Connection structure : a single stage crossbar network (1400 miles of cable), 83,000 copper cables
7.9 TB/s aggregate switching capacity 12.3 GB/s bi-di between any two nodes
Global Barrier Counter within interconnect allows global barrier synch <3.5usecStorage: 480 TB Disk, 1.5 PB TapeCompilers : Fortran 90, HPF, ANSI C, C++Batch: similar to NQS, PBSParallelization: vectorization processor level OpenMP, Pthreads, MPI, HPFNo Parallel I/O: Each node has separate file systemNo remote access (until recently)
Architectural Comparison
Node Type Where NetworkCPU/Node
ClockMHz
PeakGFlop
Stream BW
GB/s/P
Peak byte/flop
MPIBW
GB/s/P
MPI Latency
sec
NetworkTopology
Power3 NERSC Colony 16 375 1.5 0.4 0.26 0.13 16.3 Fat-tree
Itanium2 LLNL Quadrics 4 1400 5.6 1.1 0.19 0.25 3.0 Fat-tree
Opteron NERSCInfiniBan
d2 2200 4.4 2.3 0.51 0.59 6.0 Fat-tree
X1 ORNL Custom 4 800 12.8 14.9 1.16 6.3 7.1 4D-Hypercube
X1E ORNL Custom 4 1130 18.0 9.7 0.54 2.9 5.0 4D-Hypercube
ES ESC IN 8 1000 8.0 26.3 3.29 1.5 5.6 Crossbar
SX-8 HLRS INX 8 2000 16.0 41.0 2.56 2.0 5.0 Crossbar
Custom vectors : High Bandwidth, Flexible addressing, Many Register, vector ISA Less control complexity, use many ALUs, auto parallelism ES shows best balance between memory and peak performance Data caches of superscalar systems and X1(E) potential reduce mem costs X1E: 2 MSP’s per MCM - increases contention for memory and interconnect
A key ‘balance point’ for vector systems is the scalar:vector ratio Opteron/IB shows best balance for superscalar, Itanium2/Quadrics lowest latency Cost is a critical metric - however we are unable to provide such data
Proprietary, pricing varies based on customer and time frame
Poorly balanced systems cannot solve certain class of problems/resolutions regardless of processor count!
Application Overview
NAME Discipline Problem/Method Structure
Cactus Astrophysics Theory of GRADM-BSSN, Method of Lines
Grid
LBMHD Plasma Physics Magneto-Hydrodyamics, Lattice-Boltzmann
Lattice/Grid
MADCAP Cosmology Cosmic Microwave Background,Dense Linear Algebra, high I/O
Dense Matrix
GTC Magnetic Fusion Particle in Cell,Vlasov-Poisson
Particle/Grid
PARATEC Material Science Density Functional Theory, Kohn Shan, 3DFFT
Fourier/Grid
FVCAM Climate Modeling AGCM,Finite Volume, Navier-Stokes, FFT
Grid
Examining candidate ultra-scale applications with abundant data parallelism Codes designed for superscalar architectures, required vectorization effort
ES use requires minimum vectorization and parallelization hurdles
Astrophysics: CACTUS
Numerical solution of Einstein’s equations from theory of general relativity
Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms
CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes
Evolves PDE’s on regular grid using finite differencesVisualization of grazing collision of two black holes
Developed at Max Planck Institute, vectorized by John Shalf LBNL
CACTUS: Performance
SX8 attains highest per-processor performance ever attained for Cactus ES achieves highest overall performance and efficiency to date: 39X faster than Power3!
Vector performance related to x-dim (vector length) Excellent scaling on ES using fixed data size per proc (weak scaling) Opens possibility of computations at unprecedented scale
X1 surprisingly poor (4X slower than ES) - low ratio scalar:vector Unvectorized boundary, required 15% of runtime on ES and 30+% on X1 < 5% for the scalar version: unvectorized code can quickly dominate cost
Poor superscalar performance despite high computational intensity Register spilling due to large number of loop variables Prefetch engines inhibited due to multi-layer ghost zones calculations
ProblemSize
P
Power 3
SeaborgItanium2Thunder
X1Phoenix
SX6*ES
SX8
GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %peak
250x80x80per
processor
16 0.10 6% 0.58 10% 0.81 6% 2.8 35% 4.3 27%
64 0.08 6% 0.56 10% 0.72 6% 2.7 34%
256 0.07 5% 0.55 10% 0.68 5% 2.7 34%
Plasma Physics: LBMHD
LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD)
Performs 2D/3D simulation of high temperature plasma Evolves from initial conditions and decaying to form
current sheets Spatial grid coupled to octagonal streaming lattice Block distributed over processor grid Main computational components:
Collision, Stream, Interpolation Vectorization: loop interchange, unrolling HPCS Benchmark
Ported by Jonathan Carter, developed by George Vahala’s group College of William & Mary
Evolution of vorticity into turbulent structures
LBMHD-3D: Performance
Grid Size
P
Power3 Seaborg
Itanium2 Thunder
OpteronJacquard
X1Phoenix
X1EPhoenix
SX6 ES
SX8HLRS
GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk
2563 16 0.14 9% 0.26 5% 0.70 16% 5.2 41% 6.6 37% 5.5 69% 7.9 49%
5123 64 0.15 9% 0.35 6% 0.68 15% 5.2 41% 5.8 32% 5.3 66% 8.1 51%
10243 256 0.14 9% 0.32 6% 0.60 14% 5.2 41% 6.0 33% 5.5 68% 9.6 60%
20483 512 0.14 9% 0.35 6% 0.59 13% 5.8 32% 5.2 65%
Not unusual to see vector achieve > 40% peak while superscalar architectures achieve ~ 10% There exists plenty of computation, however large working set causes register spilling scalars Unlike superscalar approach, large vector register sets hide latency Opteron shows impressive superscalar performance, 2X speed vs. Itanium2
Opteron has >2x STREAM BW, and Itanium2 cannot store FP in L1 cache ES sustains 68% of peak up to 4800 processors: 26TFlops -SC2005 GB Finalizst
The highest performance ever attained for this code by far SX8 shows highest raw performance, but lags behind ES in terms of efficiency
SX8: Commodity DDR2-SDRAM vs. ES: high performance custom FPLRAM X1E achieved same performance as X1 using original code version
By turning off caching resulted in about 10% improvement over X1
Magnetic Fusion: GTC
Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)
Goal magnetic fusion is burning plasma power plant producing cleaner energy
GTC solves 3D gyroaveraged kinetic system w/ particle-in-cell approach (PIC)
PIC scales N instead of N2 – particles interact w/ electromagnetic field on grid
Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)
Vectorization inhibited: multiple particles may attempt to concurrently update same point in charge deposition
Whole volume and cross section of electrostatic potential field, showing elongated turbulence eddies
Developed at PPPL, vectorized/optimized by Stephane Ethier and ORNL/Cray/NEC
GTC Particle Decomposition
Vectorization: Particle charge deposited amongst nearest grid points Several particles can contribute to same grid point preventing vectorization Solution: VLEN copies of charge deposition array with reduction after main loop
Greatly increases memory footprint (8x) GTC originally optimized for superscalar SMPs using MPI/OpenMP However: OpenMP severely increase memory for vector implementation
Vectorization and thread-level parallelism compete w/ each other Previous vector experiments limited to only 64-way MPI parallelism
64 is optimal domains for 1D toroidal (independent of # particles) Updated GTC algorithm Introduces a third level of parallelism:
Algorithm splits particles between several processors (within 1D domain) Allows increase concurrency and number of studied particles
Larger particle simulations allow increase resolution studies Particles not subject to Courant condition (same timestep) Allows multiple species calculations
GTC: Performance
New decomposition algorithm efficiently utilizes high P (as opposed to 64 on ES) Breakthrough of Tflop barrier on ES for important SciDAC code
7.2 Tflop/s on 4096 processors SX8 highest raw performance (of any arch) but lower efficiency than ES
Opens possibility of new set of high-phase space-resolution simulations Scalar architectures suffer from low CI, irregular data access, and register spilling Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1
Opteron: on-chip memory controller and caching of FP L1 data X1 suffers from overhead of scalar code portions Original (unmodified) X1 version performed 12% *slower* on X1E
Additional optimizations increased performance by 50%! Recent ORNL work increases performance additional 75%
SciDAC code, HPCS benchmark
PPart/Cell
Power3 Seaborg
Itanium2 Thunder
OpteronJacquard
X1Phoenix
X1EPhoenix
SX6 ES
SX8HLRS
GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk
128 200 0.14 9% 0.39 7% 0.59 13% 1.2 9% 1.7 10% 1.9 23% 2.3 14%
256 400 0.14 9% 0.39 7% 0.57 13% 1.2 9% 1.7 10% 1.8 22% 2.3 15%
512 800 0.14 9% 0.38 7% 0.51 12% 1.7 9% 1.8 22%
1024 1600 0.14 9% 0.37 7% 1.8 22%
Material Science: PARATEC
First-principles quantum mechanical total energy calc using pseudopotentials & plane wave basis set
Density Functional Theory to calc structure & electronic properties of new materials
DFT calc are one of the largest consumers of supercomputer cycles in the world
33% 3D FFT, 33% BLAS3, 33% Hand coded F90 Part of calculation in real space other in Fourier space
Uses specialized 3D FFT to transform wavefunction
Conduction band minimum electron state forCadmium Selenide (CdSe) quantum dot
Developed by Andrew Canning with Louie and Cohen’s groups (LBNL, UCB)
Global transpose Fourier to real space Multiple
FFTs: reduce communication latency
Vectorize across (not within) FFTs
Custom FFT: only nonzeros sent
PARATEC: Performance
All architectures generally perform well due to computational intensity of code (BLAS3, FFT) ES achieves highest overall performance to date: 5.5Tflop/s on 2048 procs
Main ES advantage for this code is fast interconnect Allows never before possible, high resolution simulations CdSE Q-dot: Largest cell-size atomistic experiment ever run using PARATEC
• Important uses: can be attached to organic molecules as electronic dye tags SX8 achieves highest per-processor performance X1/X1E shows lowest % of peak
Non-vectorizable code much more expensive on X1/X1E (32:1) Lower bisection bandwidth to computational ratio (4D-hypercube) X1 Performance is comparable to Itanium2
Itanium2 outperforms Opteron (unlike LBMHD/GTC) because PARATEC less sensitive to memory access issues (high computational intensity) Opteron lacks FMA unit Quadrics shows better scaling of all-to-all at large concurrencies
Problem P
Power3 Seaborg
Itanium2 Thunder
OpteronJacquard
X1Phoenix
X1EPhoenix
SX6 ES
SX8HLRS
GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GF/s/P %pk GFs/P %pk GFs/P %pk
488 AtomCdSe
QuantumDot
128 0.93 62% 2.8 51% 3.2 25% 3.8 21% 5.1 64% 7.5 49%
256 0.85 67% 2.6 47% 2.0 45% 3.0 24% 3.3 18% 5.0 62% 6.8 43%
512 0.73 49% 2.4 44% 1.0 22% 2.2 12% 4.4 55%
1024 0.60 40% 1.8 32% 3.6 46%
Atmospheric Modeling: FVCAM
CAM3.1: Atmospheric component of CCSM3.0 Our focus is the Finite Volume (FV) approach
AGCM: consists of physics (PS) and dynamical core (DC)
DC approximates dynamics of atmosphere
PS: calculates source terms to equations of motion:
Turbulence, radiative transfer, clouds, boundary, etc
DC default: Eulerian spectral transform - maps onto sphere
Allows only 1D decomposition in latitude
FVCAM grid is rectangular (longitude, latitude, level)
Allows 2D decomp (latitude, level) in dynamics phase
Singularity at pole prevents longitude decomposition
Dynamical eqns bounded by Lagrangian surfaces Requires remapping to Eulerian reference frame
Hybrid (MPI/OpenMP) programming MPI tasks limited by # of latitude lines Increase potential parallelism and improves S2V ratio Is not always effective on some platforms
Experiments/vectorization Art Mirin, Dave Parks, Michael Wehner, Pat Worley
Simulated Class IV hurricane at 0.5. This storm was produced solely through the chaos of the
atmospheric model. It is one of the many events produced by FVCAM at resolution of 0.5.
FVCAM Decomposition and Vectorization
Processor communication topology and volume for 1D and 2D FVCAM Generated by IPM profiling tool - used to understand interconnect
requirements 1D approach straightforward nearest neighbor communication 2D communication bulk is nearest neighbor - however:
Complex pattern due to vertical decomp and transposition during remapping Total volume in 2D remap is reduced due to improved surface/volume ratio
Vectorization - 5 routines, about 1000 lines Move latitude calculation to inner loops to maximize parallelism Reduce number of branches, performing logical tests in advance (indirect indexing) Vectorize across (not within) FFT’s for Polar filters However, higher concurrency for fixed size problem limit performance of vectorized FFTs
FVCAM3.1: Performance
FVCAM 2D decomposition allows effective use of >2X as many processors Increasing vertical discretizations (1,4,7) allows higher concurrencies Results showing high resolution vector performance 361x576x26 (0.5 x 0.625)
X1E achieves speedup of over 4500 on P=672 - highest ever achieved Power3 limited to speedup of 600 regardless of concurrency Factor of at least 1000x necessary for simulation to be tractable
Raw speed X1E: 1.14X X1, 1.4X ES, 3.7X Thunder, 13X Seaborg At high concurrencies (P= 672) all platforms achieve low % peak (< 7%)
ES achieves highest sustained performance (over 10% at P=256) Vectors suffer from short vector length of fixed problem size, esp FFTs Superscalars generally achieve lower efficiencies/performance than vectors
Finer resolutions requires increased number of more powerful processors
Simulated Speedup
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 200 400 600 800
Processors
Simulated Years / Wallclock Years
Power3Itanium2ESX1X1E
Percent of Theoretical Peak
3%
5%
7%
9%
11%
13%
15%
17%
P=32 (1D) P=256 (2D:4) P=336 (2D:7) P=672 (2D:7)
Configuration
% of Theoretical Peak
Power3
Itanium2
ES
X1
X1E
Cosmology: MADCAP
Anisotropy Dataset Computational Analysis Package
Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)
Anisotropies in the CMB contains early history of the Universe
Recasts problem in dense linear algebra: ScaLAPACK
Out of core calculation: holds ~3 of the 50 matrices in memory
Temperature anisotropies in CMB (Boomerang)
Developed by Julian Borrill, LBNL
MADCAP: Performance
Overall performance can be surprising low, for dense linear algebra code
I/O takes a heavy toll on Phoenix and Columbia: I/O optimization currently in progress
NERSC Power3 shows best system balance with respect to I/O ES lacks high-performance parallel I/O - code rewritten to utilize
local disks HPCS I/O candidate benchmark
NumberPixels
PNERSC (Power3) Columbia (Itan2) Phoenix (X1) ES (SX6*)
Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
10K 64 0.73 49% 1.2 20% 2.2 17% 2.9 37%
20K 256 0.76 51% 1.1 19% 0.6 5% 4.0 50%
40K 1024 0.75 50% 4.6 58%
0%
20%
40%
60%
80%
100%
P=16 P=16 P=16 P=16 P=64 P=64 P=64 P=64P=256P=256P=256P=256
P=1024P=1024
Sbg ES Phx Cmb Sbg ES Phx Cmb Sbg ES Phx Cmb Sbg ES
LBST
Calc
MPI
I/O
Performance Overview
Tremendous potential of vector systems: ES achieves unprecedented aggregate performance on almost all test applications
LBMHD-3D achieves 26 TF/s using 4800 ES procs (68% of peak) - GB finalist GTC achieves 7.2 TF/s on 4096 ES processors PARATEC achieves 5.5 TF/s on 2048 processors of ES
ES consistently achieves highest efficiency Investigating how to incorporate vector-like facilities into modern micros To date we have mostly looked at mostly regularly structured algorithms Many methods unexamined (more difficult to vectorize): Implicit, Multigrid, AMR,
Unstructured (Adaptive), Iterative and Direct Sparse Solvers, N-body, Fast Multiple One size does not fit all: need a range of architectural designs to best fit
algorithms
0%
10%
20%
30%
40%
50%
60%
70%
CACTUS FVCAM GTC LBMHD3D PARATEC
Application
% of Theoretical Peak
SX-8
ES
X1
X1E
Power3
Itanium2
Opteron
% of Theoretical Peak
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
CACTUS FVCAM GTC LBMHD3D PARATEC
Application
Speedup vs. ES
SX-8
ES
X1
X1E
Power3
Itanium2
Opteron
Speedup vs. ES