Post on 24-Feb-2016
description
transcript
ET
E.T. International, Inc.
X-Stack: Programming Challenges, Runtime Systems, and Tools
Brandywine TeamMay2013
DynAXInnovations in Programming Models, Compilers and Runtime Systems for
Dynamic Adaptive Event-Driven Execution Models
E.T. International, Inc.
2
ObjectivesScalability Expose, express, and exploit O(1010) concurrencyLocality Locality aware data types, algorithms, and
optimizationsProgrammability
Easy expression of asynchrony, concurrency, locality
Portability Stack portability across heterogeneous architecturesEnergy Efficiency
Maximize static and dynamic energy savings while managing the tradeoff between energy efficiency, resilience, and performance
Resilience Gradual degradation in the face of many faultsInteroperability
Leverage legacy code through a gradual transformation towards exascale performance
Applications Support NWChem
E.T. International, Inc.
3
Brandywine Xstack Software Stack
SWARM(Runtime System)
SCALE(Compiler)
HTA (Library)
R-Stream(Compiler)
NWChem + Co-Design Applications
Rescinded Primitive Data Types .
E.T. International, Inc.
E.T. International, Inc.
E.T. International, Inc.
SWARMMPI, OpenMP, OpenCL SWARM
Asynchronous Event-Driven Tasks Dependencies Resources Active Messages Control Migration
VS.
Communicating Sequential Processes Bulk Synchronous Message Passing
Tim
e Time
Active threads
Waiting
4
E.T. International, Inc.
5
SWARM• Principles of Operation
Codelets* Basic unit of parallelism* Nonblocking tasks* Scheduled upon satisfaction of precedent constraints
Hierarchical Locale Tree: spatial position, data locality Lightweight Synchronization Active Global Address Space (planned)
• Dynamics Asynchronous Split-phase Transactions: latency hiding Message Driven Computation Control-flow and Dataflow Futures Error Handling Fault Tolerance (planned)
E.T. International, Inc.
Cholesky DAG•POTRF → TRSM•TRSM → GEMM, SYRK•SYRK → POTRF
POTRF TRSM SYRK GEMM1:
2:
3:
6
• Implementations:OpenMPSWARM
E.T. International, Inc.
Naïve O
penMP
Tuned OpenM
PSW
AR
M
Cholesky Decomposition: Xeon
1 2 3 4 5 6 7 8 9 101112123456789
101112
OpenMPSWARM
# Threads
Spee
dup
over
Ser
ial
7
E.T. International, Inc.
Cholesky Decomposition: Xeon Phi
8
Ope
nMP
SWA
RM
Xeon Phi: 240 Threads
OpenMP fork-join programming suffers on many-core chips (e.g. Xeon Phi). SWARM removes these synchronizations.
E.T. International, Inc.
Cholesky: SWARM vs ScaLapack/MKLSc
aLap
ack
SWA
RM
16 node cluster: Intel Xeon E5-2670 16-core 2.6GHz
Asynchrony is key in large dense linear algebra
2 4 8 16 32 640
2000
4000
6000
8000
10000
12000
14000
16000
ScaLapack/MKL
SWARM
# Nodes
GFLO
PS9
E.T. International, Inc.
Code Transition to Exascale1. Determine application execution, communication, and
data access patterns2. Find ways to accelerate application execution directly.3. Consider data access pattern to better lay out data
across distributed heterogeneous nodes.4. Convert single-node synchronization to asynchronous
control-flow/data-flow (OpenMP -> asynchronous scheduling)
5. Remove bulk-synchronous communications where possible (MPI -> asynchronous communication)
6. Synergize inter-node and intra-node code7. Determine further optimizations afforded by
asynchronous model.Method successfully deployed for NWChem code transition
10
E.T. International, Inc.
Self Consistent Field Module From NWChem
•NWChem used by 1000’s of researchers•Code is designed to be highly scalable to petaflop scale
•Thousands of man-hours expensed on tuning and performance
•Self Consistent Field (SCF) module is a key component of NWChem
•ETI has worked with PNNL to extract the algorithm from NWChem to study how to improve it.As part of the DOE XStack program
11
E.T. International, Inc.
Serial Optimizations
Origina
l
Symmetr
y of g
()
BLAS/L
APACK
Precom
pute_x
000d
_g val
ues
Fock M
atrix S
ymmetr
y02468
1012141618
Serial OptimizationsSp
eedu
p
12
E.T. International, Inc.
Single Node Parallelization
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
02468
10121416182022
Speedup of OpenMP versions and SWARM
Dynamic,v1Dynamic,v2Dynamic,v3Guided,v1Guided,v2Guided,v3Static,v1Static,v2Static,v3SWARMIdeal
# Threads
Spee
dup
13
E.T. International, Inc.
Multi-Node Parallelization
16 32 64 128 256 512 1024 204810
100
1000
10000
SCF Multi-Node Execution Scaling
SWARMMPI
# Cores
Exec
utio
n Ti
me
(sec
onds
)
16 32 64 128 256 512 1024 2048
0.1
1
10
100
SCF Multinode Speedup
SWARMMPI
# Cores
Spee
dup
over
Sin
gle
Nod
e
14
E.T. International, Inc.
15
Information Repository• All of this information is available in more detail at the
Xstack wiki:http://www.xstackwiki.com
E.T. International, Inc.
16
Questions?
E.T. International, Inc.
17
Acknowledgements• Co-PIs:
Benoit Meister (Reservoir)David Padua (Univ. Illinois) John Feo (PNNL)
• Other team members:ETI: Mark Glines, Kelly Livingston, Adam MarkeyReservoir: Rich LethinUniv. Illinois: Adam SmithPNNL: Andres Marquez
• DOESonia Sachs, Bill Harrod