1
CPSCPSDD
NSF/DARPA OPAAL
Adaptive Parallelization Strategies using Data-driven Objects
NSF/DARPA OPAAL
Adaptive Parallelization Strategies using Data-driven Objects
Laxmikant Kale
First Annual Review
27-28 October 1999, Iowa City
2
CPSCPSDD
OutlineOutline
Quench and solidification codes Coarse grain parallelization of the quench
code Adaptive parallelization techniques
Dynamic variations Adaptive load balancing Finite element framework with adaptivity Preliminary results
3
CPSCPSDD
Coarse grain parallelizationCoarse grain parallelization
Structure of current sequential quench code: 2-D array of elements (each independently refined) Within row dependence Independent rows, but…
—share global variables
Parallelization using Charm++: 3 hours effort (after a false start) about 20 lines of change to F90 code A 100 line Charm++ wrapper Observations:
—Global variables that are defined and used within inner loop iterations are easily dealt with in Charm++ , in contrast to OpenMP
—Dynamic load balancing is possible, but was unnecessary
4
CPSCPSDD
Performance resultsPerformance results
Speedup for Micro1D
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70
Speedup
pro
cess
ors
Contributors:
Engineering: N. Sobh, R. Haber
Computer Science: M. Bhandarkar, R. Liu, L. Kale
5
CPSCPSDD
OpenMP experienceOpenMP experience
Work by: J. Hoeflinger, D. Padua, with N. Sobh, R. Haber, J. Dantzig,
N. Provatas
Solidification code: Parallelized using openMp Relatively straightforward, after a key decision
—Parallelize by rows only
6
CPSCPSDD
OpenMP experienceOpenMP experience Quench code on Origin2000
Privatization of variables is needed
—as outer loop was parallelized Unexpected initial difficulties with OpenMP
—Led initially to large slowdown in parallelized code
—Traced to unnecessary locking in MATMUL intrinsic
0
50
100
150
200
250
300
350
400
0 1 2 3 4 5 6 7 8 9
Processors
Exe
cuti
on
tim
e in
sec
on
ds
7
CPSCPSDD
Adaptive StrategiesAdaptive Strategies
Advanced codes model dynamic and irregular behavior
Solidification: adaptive grid refinement Quench:
—Complex dependencies,
—Parallelization within elements To parallelize these effectively,
—adaptive runtime strategies are necessary
8
CPSCPSDD
Multi-partition decomposition:
Idea: decompose the problem into a number of partitions,
independent of the number of processors # Partitions > # Processors
The system maps partitions to processors The system should be able to map and re-map objects as
needed
9
CPSCPSDD Charm++
A parallel C++ library Supports data driven objects
—singleton objects, object arrays, groups, Many objects per processor, with method execution scheduled
with availability of data System supports automatic instrumentation and object
migration Works with other paradigms: MPI, openMP, ..
10
CPSCPSDD
Data driven executionin Charm++Data driven executionin Charm++
Scheduler Scheduler
Message Q Message Q
11
CPSCPSDD
Load Balancing Framework
Aimed at handling ... Continuous (slow) load variation Abrupt load variation (refinement) Workstation clusters in multi-user mode
Measurement based Exploits temporal persistence of computation and communication
structures Very accurate (compared with estimation) instrumentation possible via Charm++/Converse
12
CPSCPSDD
Object balancing frameworkObject balancing framework
13
CPSCPSDD
Utility of the framework: workstation clustersUtility of the framework: workstation clusters
Cluster of 8 machines, One machine gets another job Parallel job slows down on all machines
Using the framework: Detection mechanism Migrate objects away from overloaded processor Restored almost original throughput!
14
CPSCPSDD
Performance on timeshared clustersPerformance on timeshared clusters
Another user logged on at about 28 seconds into a parallel run on 8 workstations.
Throughput dipped from 10 steps per second to 7. The load balancer intervened
at 35 seconds,and restored throughput to almost its initial value.
15
CPSCPSDD
Utility of the framework: Intrinsic load imbalanceUtility of the framework: Intrinsic load imbalance
To test the abilities of the framework A simple problem: Gauss-Jacobi iterations Refine selected sub-domains
ConSpector: web based tool Submit parallel jobs Monitor performance and application behavior Interact with running jobs via GUI interfaces
16
CPSCPSDD
AppSpector view of Load balanceron the synthetic Jacobi relaxation benchmark.
Imbalance is introduced by interactively refining a subset ofcells around 9 seconds..The resultant load imbalancebrings the utilization down to 80%from the peak of 96%.
The load balancer kicks in aroundt = 16, and restores utilizationto around 94%.
17
CPSCPSDD
Charm++
Converse
Load database + balancer
MPI-on-Charm Irecv+
AutomaticConversion from
MPI
FEM Structured
Cross module interpolation
Migration path
Frameworkpath
Using the Load Balancing Framework
18
CPSCPSDD
Example application:Example application:
Crack propagation (P. Geubelle et al) Similar in structure to Quench components 1900 lines of F90
Rewritten using FEM framework in C++ 1200 lines of C++ code Framework: 500 lines of code,
—reused by all applications Parallelization completely by the framework
19
CPSCPSDD
Crack PropagationCrack Propagation
Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle
20
CPSCPSDD
“Overhead” of multi-partition method“Overhead” of multi-partition method
0
5
10
15
20
25
30
35
40
1 10 100 1000
Number of partitions
21
CPSCPSDD
Overhead study on 8 processorsOverhead study on 8 processors
0
2
4
6
8
10
12
1 10 100
Number of chunks per processor
Exe
cuti
on
tim
e (o
n 8
pro
cess
ors
)
When running on 8 processors, the effect of using multiple partitions per
processor is also beneficial, due to cache behavior.
22
CPSCPSDD
Cross-approach comparisonCross-approach comparison
Performance compariosn across approaches
0
0.5
1
1.5
2
2.5
3
3.5
4
1 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Number of partitions
Exe
cuti
on
tim
e in
sec
on
ds
MPI-F90 original Charm++ framework(all C++) F90 + charm++ library
23
CPSCPSDD
Load balancer in actionLoad balancer in action
24
CPSCPSDD
Summary and Planned ResearchSummary and Planned Research
Use the adaptive FEM framework To parallelize Quench code further Quad tree based solidification code:
—First phase: parallelize each phase separately
—Parallelize across refinement phases
Refine the FEM framework Use feedback from applications Support for implicit solvers and multigrid