Date post: | 17-Jan-2016 |
Category: |
Documents |
Upload: | nathaniel-norris |
View: | 213 times |
Download: | 0 times |
1
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
100 TF Sustained on Cray X Series
SOS 8
April 13, 2004
James B. White III (Trey)
2
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Disclaimer
The opinions expressed here do not necessarily represent those of the CCS, ORNL, DOE, the Executive Branch of the Federal Government of the United States of America, or even UT-Battelle.
3
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Disclaimer (cont.)
Graph-free, chart-free environment For graphs and charts
http://www.csm.ornl.gov/evaluation/PHOENIX/
4
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
100 Real TF on Cray Xn
Who needs capability computing? Application requirements Why Xn? Laundry, Clean and Otherwise Rants
Custom vs. Commodity MPI CAF Cray
5
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Who needs capability computing?
OMB? Politicians? Vendors? Center directors? Computer scientists?
6
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Who needs capability computing?
Application scientists According to scientists themselves
7
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Personal Communications
Fusion General Atomics, Iowa, ORNL, PPPL, Wisconsin
Climate LANL, NCAR, ORNL, PNNL
Materials Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin
Biology NCI, ORNL, PNNL
Chemistry Auburn, LANL, ORNL, PNNL
Astrophysics Arizona, Chicago, NC State, ORNL, Tennessee
8
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Scientists Need Capability
Climate scientists need simulation fidelity to support policy decisions All we can say now is that humans cause warming
Fusion scientists need to simulate fusion devices All we can do now is model decoupled
subprocesses at disparate time scales Materials scientists need to design new
materials Just starting to reproduce known materials
9
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Scientists Need Capability
Biologists need to simulate proteins and protein pathways Baby steps with smaller molecules
Chemists need similar increases in complexity
Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times) Low-res, 3D CFD, approximate 3D neutrinos,
short times
10
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Why Scientists Might Resist
Capacity also needed Software isn’t ready Coerced to run capability-sized jobs on
inappropriate systems
11
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Capability Requirements
Sample DOE SC applications Climate: POP, CAM Fusion: AORSA, Gyro Materials: LSMS, DCA-QMC
12
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Parallel Ocean Program (POP)
Baroclinic 3D, nearest neighbor, scalable Memory-bandwidth limited
Barotropic 2D implicit system, latency bound
Ocean-only simulation Higher resolution Faster time steps
As ocean component for CCSM Atmosphere dominates
13
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Community Atmospheric Model (CAM)
Atmosphere component for CCSM Higher resolution?
Physics changes, parameterization must be retuned, model must be revalidated
Major effort, rare event Spectral transform not dominant
Dramatic increases in computation per grid point Dynamic vegetation, carbon cycle, atmospheric chemistry,
…
Faster time steps
14
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
All-Orders Spectral Algorithm (AORSA)
Radio-frequency fusion-plasma simulation Highly scalable
Dominated by ScaLAPACK Still in weak-scaling regime
But… Expanded physics reducing ScaLAPACK
dominance Developing sparse formulation
15
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Gyro
Continuum gyrokinetic simulation of fusion-plasma microturbulence
1D data decomposition Spectral method - high communication
volume Some need for increased resolution More iterations
16
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Locally Self-Consistent Multiple Scattering (LSMS)
Calculates electronic structure of large systems
One atom per processor Dominated by local DGEMM First real application to sustain a TF But… moving to sparse formulation with
a distributed solve for each atom
17
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Dynamic Cluster Aproximation (DCA-QMC)
Simulates high-temp superconductors Dominated by DGER (BLAS2)
Memory-bandwidth limited Quantum Monte Carlo, but…
Fixed start-up per process Favors fewer, faster processors Needs powerful processors to avoid
parallelizing each Monte-Carlo stream
18
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Few DOE SC Applications
Weak-ish scaling Dense linear algebra But moving to sparse
19
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Many DOE SC Applications
“Strong-ish” scaling Limited increase in gridpoints Major increase in expense per gridpoint Major increase in time steps
Fewer, more-powerful processors High memory bandwidth
High-bandwidth, low-latency communication
20
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Why X1?
“Strong-ish” scaling Limited increase in gridpoints Major increase in expense per gridpoint Major increase in time steps
Fewer, more-powerful processors High memory bandwidth
High-bandwidth, low-latency communication
21
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Tangent: Strongish* Scaling
Firm Semistrong Unweak Strongoidal MSTW (More Strong Than Weak) JTSoS (Just This Side of Strong) WNS (Well-Nigh Strong) Seak, Steak, Streak, Stroak, Stronk Weag, Weng, Wong, Wrong, Twong
* Greg Lindahl, Vendor Scum
22
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
X1 for 100 TF Sustained?
Uh, no OS not scalable, fault-resilient enough
for 104 processors That “price/performance” thing That “power & cooling” thing
23
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Xn for 100 TF Sustained
For DOE SC applications, YES Most-promising candidate
-or-
Least-implausible candidate
24
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Why X, again?
Most-powerful processors Reduce need for scalability Obey Amdahl’s Law
High memory bandwidth See above
Globally addressable memory Lowest, most hide-able latency Scale latency-bound applications
High interconnect bandwidth Scale bandwidth-bound applications
25
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
The Bad News
Scalar performance “Some tuning required” Ho-hum MPI latency
See Rants
26
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Scalar Performance
Compilation is slow Amdahl’s Law for single processes
Parallelization -> Vectorization
Hard to port GNU tools GCC? Are you kidding? GCC compatibility, on the other hand…
Black Widow will be better
27
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
“Some Tuning Required”
Vectorization requires: Independent operations Dependence information Mapping to vector instructions
Applications take a wide spectrum of steps to inhibit this May need a couple of compiler directives May need extensive rewriting
28
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Application Results
Awesome Indifferent Recalcitrant Hopeless
29
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Awesome Results
256-MSP X1 already showing unique capability
Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency
POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber, …
Many examples from DoD
30
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Indifferent Results
Cray X1 is brute-force fast, but not cost effective
Dense linear algebra Linpack, AORSA, LSMS
31
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Recalcitrant Results
Inherent algorithms are fine Source code or ongoing code mods
don’t vectorize Significant code rewriting done,
ongoing, or needed CLM, CAM, Nimrod, M3D
32
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Aside: How to Avoid Vectorization
Use pointers to add false dependencies Put deep call stacks inside loops Put debug I/O operations inside
compute loops Did I mention using pointers?
33
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Aside: Software Design
In general, we don’t know how to systematically design efficient, maintainable HPC software
Vectorization imposes constraints on software design Bad: Existing software must be rewritten Good: Resulting software often faster on modern
superscalar systems “Some tuning required” for X series
Bad: You must tune Good: Tuning is systematic, not a Black Art
Vectorization “constraints” may help us develop effective design patterns for HPC software
34
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Hopeless Results
Dominated by unvectorizable algorithms Some benchmark kernels of
questionable relevance No known DOE SC applications
35
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Summary
DOE SC scientists do need 100 TF and beyond of sustained application performance
Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond
36
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
“Custom” Rant
“Custom vs. Commodity” is Red Herring CMOS is commodity Memory is commodity Wires are commodity Cooling is independent of vector vs. scalar
PNNL liquid-cooling clusters Vector systems may move to air-cooling
All vendors do custom packaging Real issue: Software
37
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
MPI Rant
Latency-bound apps often limited by “MPI_Allreduce(…, MPI_SUM, …)” Not ping pong! An excellent abstraction that is imminently optimizable
Some apps are limited by point-to-point Remote load/store implementations (CAF, UPC) have
performance advantages over MPI But MPI could be implemented using load/store, inlined, and
optimized On the other hand, easier to avoid pack/unpack with
load/store model
38
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Co-Array-Fortran Rant
No such thing as one-sided communication It’s all two sided: send+receive, sync+put+sync,
sync+get+sync Same parallel algorithms
CAF mods can be highly nonlocal Adding CAF in a subroutine can have implications on the
argument types, and thus on the callers, the callers’ callers, etc.
Rarely the case for MPI We use CAF to avoid MPI-implementation
performance inadequacies Avoiding nonlocality by cheating with Cray pointers
39
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Cray Rant
Cray XD1 (OctigaBay) follows in tradition of T3E
40
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Cray Rant
Cray XD1 (OctigaBay) follows in tradition of T3E Very promising architecture Dumb name
Interesting competitor with Red Storm
41
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
The Center forComputational Sciences
Questions?
James B. White III (Trey)
http://www.csm.ornl.gov/evaluation/PHOENIX/