STAPLSTAPLThe C++ Standard Template The C++ Standard Template Adaptive Parallel LibraryAdaptive Parallel Library
Alin JulaDepartment of Computer Science, Texas A&M
Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger
http://www.cs.tamu.edu/research/parasol
2 Texas A&M UniversitySTAPL
MotivationMotivation
Building block library– Nested parallelism
Inter-operability with existing code– Superset of STL
Portability and Performance– Layered architecture– Run-time adaptivity
STAPL – C++ Standard Template Adaptive Parallel Library
3 Texas A&M UniversitySTAPL
PhilosophyPhilosophy Interface Layer
– STL compatible Concurrency & Communication Layer
– Generic parallelism, synchronization Software Implementation Layer
– Instantiates concurrency & communication Machine Layer
– Architecture dependent code
4 Texas A&M UniversitySTAPL
Related WorkRelated WorkSTAPL
AVTL CHARM++
CHAOS++ CILK* NESL*
POOMA
PSTL SPLIT C*
Paradigm SPMD/ MIMD
SPMD MIMD SPMD SPMD/ MIMD
SPMD/ MIMD
SPMD SPMD SPMD
Architecture
shared/ dist
dist shared/ dist
dist shared/ dist
shared/ dist
shared/ dist
shared/ dist
shared/ dist
NestedParallelism
yes no no no yes yes no no yes
Adaptive yes no no no no no no no no
Generic yes yes Yes (limits) yes no yes yes yes no
Irregular yes no yes yes yes yes no yes yes
DataDecomp.
auto/ user
auto user auto/ user
user user user auto/ user
user
DataMapping
auto/ user
auto auto auto/ user
auto auto user auto/ user
auto
Scheduling user, static, dyn,block
user, MPI-based
prioritizedexecution
data decom. based
work stealing
work & depth model
pthread scheduling
Tulip RTS
user
Comm/Comp Overlap
yes no yes no no no no no yes
* Parallel programming language* Parallel programming language
5 Texas A&M UniversitySTAPL
STL OverviewSTL Overview Data is stored in ContainersContainers STL provides standardized AlgorithmsAlgorithms IteratorsIterators bind Algorithms to Containers
– are generalized pointers
Example
Container AlgorithmIterator
vector<int> vect;… // initialization of ‘vect’ variablesort(vect.begin(),vect.end());
6 Texas A&M UniversitySTAPL
STAPL OverviewSTAPL Overview
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
Data is stored in pContainerspContainers STAPL provides standardized pAlgorithmspAlgorithms pRanges pRanges bind pAlgorithms to pContainers
Similar to STL Iterators, but must also support parallelism
7 Texas A&M UniversitySTAPL
pRangepRange pRange is the Parallel Counterpart of STL
iterator:– Binds pAlgorithms to pContainers– Provides an abstract view of a scoped data space
– data space is (recursively) partitioned into subranges
More than an iterator since it supports parallelization– Scheduler/distributor decides how computation and
data structures should be mapped to the machine– Data dependences among subranges can be represented
by a data dependence graph (DDG)– Executor launches parallel computation, manages
communication, and enforces dependences
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
8 Texas A&M UniversitySTAPL
pRangepRange
Provides random access to a partition of the data space– View and access provided by a collection of iterators
describing pRange boundary pRanges are partitioned into subranges
– Automatically by STAPL based on machine characteristics, number of processors, partition factors, etc.
– Manually according to user-specified partitions pRange can represent relationships among
subspaces as Data Dependence Graphs (DDG) ( for scheduling )
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
9 Texas A&M UniversitySTAPL
pRangepRange Each subspace is disjoint and could be itself a
pRange– Nested parallelism
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
Data Space
stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);
10 Texas A&M UniversitySTAPL
pRangepRange Each subspace is disjoint and could be itself a
pRange– Nested parallelism
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
Prange
Data Space
stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);
11 Texas A&M UniversitySTAPL
pRangepRange Each subspace is disjoint and could be itself a
pRange– Nested parallelism
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
Prange
Data Space
subspace
subspace subspace
subspace
stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);
12 Texas A&M UniversitySTAPL
pRangepRange Each subspace is disjoint and could be itself a
pRange– Nested parallelism
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
subspace
subspace
subspace
subspace
Prange
Data Space
stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);
13 Texas A&M UniversitySTAPL
pRangepRange Each subspace is disjoint and could be itself a
pRange– Nested parallelism
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
subspace subspace
subspace
Prange
Prange
Data Space
stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);
14 Texas A&M UniversitySTAPL
pRangepRange Each subspace is disjoint and could be itself a
pRange– Nested parallelism
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
subspace subspace
subspace
Prange
Prange
Data Space
stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);
15 Texas A&M UniversitySTAPL
pContainerpContainer pContainer is the parallel counterpart of STL
container– Provides parallel and concurrent methods
Maintains internal pRange– Updated during insert/delete operations– Minimizes redistribution
Completed: pVector, pList, pTree Example:
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
pVector
pRange
STL vector
16 Texas A&M UniversitySTAPL
pAlgorithmpAlgorithm pAlgorithm is the parallel counterpart of STL
algorithm Parallel Algorithms take as input
– pRange – Work functions that operate on subRanges
and apply the work function to all subrangestemplate<class SubRange>class pAddOne : public stapl::pFunction {public: ... void operator()(SubRange& spr) { typename SubRange::iterator i; for (i=spr.begin(); i!=spr.end(); i++) (*i)++ }}...p_transform(pRange, pAddOne);
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
17 Texas A&M UniversitySTAPL
Run-Time SystemRun-Time System Support for different architectures
– HP V2200– SGI Origin 2000, SGI Power Challenge
Support for different paradigms– OpenMP, Pthreads– MPI
Memory allocation– HOARD
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
Run-Time
Cluster 1Proc 12
Proc 14
Proc 13
Proc 15
Cluster 4
pAlgorithm
Cluster 2 Cluster 3
18 Texas A&M UniversitySTAPL
Run-Time SystemRun-Time System Scheduler
– Determine an execution order (DDG)– Policies:
Automatic : Static, Block, Dynamic, Partial Self Scheduling, Complete Self Scheduling
User defined
Distributor– Hierarchical data distribution– Automatic and user defined
Executor– Execute DDG
Processor assignment Synchronization and Communication
pContainer
pRange
pRange
pRange
Scheduler Executor
pAlgorithm
Algorithm
Processor
Processor
Processor
Distributor
pRange Run-time System
19 Texas A&M UniversitySTAPL
STL to STAPL Automatic STL to STAPL Automatic TranslationTranslation C++ preprocessor converts STL code into
STAPL parallel code Iterators used to construct pRanges User is responsible for safe parallelization
#include <start_STAPL> accumulate(x.begin(), x.end(), 0); for_each(x.begin(), x.end(), foo());#include <stop_STAPL>
pi_accumulate(x.begin(), x.end(), 0);pi_for_each(x.begin(), x.end(), foo());
p_accumulate(x_pRange, 0);p_for_each(x_pRange,foo());
Preprocessingphase
pRangeconstruction• In some cases automatic
translation provides similar performance to STAPL written code (5% deterioration)
21 Texas A&M UniversitySTAPL
Performance: p_inner_productPerformance: p_inner_product
Experimental results on HP V2200Experimental results on HP V2200
22 Texas A&M UniversitySTAPL
pTreepTree Parallel Tree supports bulk commutative operations in
parallel Each processor is assigned a set of subtrees to maintain Operations on the base are atomic Operations on subtrees are parallel
Base(atomic)
Subtrees(parallel)
P1 P2 P3
Example : Parallel Insertion AlgorithmEach processor is given a set of elements
1) Each proc creates local buckets corresponding to the subtrees
2) Each processor collects the buckets that correspond to its subtrees
3) Elements in the subtree buckets are inserted into tree in parallel
23 Texas A&M UniversitySTAPL
pTreepTree Basis for STAPL pSet, pMultiSet, pMap, pMultiMap
containers– Covers all remaining STL containers
Results are sequentially consistent although internal structure may vary
Requires negligible additional memory pTrees can be used either sequentially or in parallel in the
same execution – allows switching back and forth between parallel & sequential
24 Texas A&M UniversitySTAPL
Performance: pTreePerformance: pTree
Experimental results on HP V2200Experimental results on HP V2200
25 Texas A&M UniversitySTAPL
Algorithm AdaptivityAlgorithm Adaptivity Problem - Parallel algorithms are highly
sensitive– Architecture – number of processors, memory
interconnection, cache, available resources, etc– Environment – thread management, memory
allocation, operating system policies, etc– Data Characteristics – input type, layout, etc
Solution - implement a number of different algorithms and adaptively choose the best one at run-time
26 Texas A&M UniversitySTAPL
Adaptive FrameworkAdaptive Framework
27 Texas A&M UniversitySTAPL
Case Study - Adaptive Sorting Case Study - Adaptive Sorting
Sort Strength Weakness
Column Theoretically time optimal
Many passes over data
Merge Low memory overhead
Poor scalability
Radix Extremely fast Integers only
Sample Two passes over data
High memory overhead
28 Texas A&M UniversitySTAPL
Performance: Adaptive SortingPerformance: Adaptive SortingV2200
Power Challenge
Origin 2000
Performance on 10 million integers
29 Texas A&M UniversitySTAPL
Performance: Run-Time TestsPerformance: Run-Time Tests
if (data_type = INTEGER) radix_sort();else if (num_procs < 5) merge_sort();else column_sort();
Origin 2000
30 Texas A&M UniversitySTAPL
Performance: Molecular Performance: Molecular Dynamics*Dynamics*
Discrete time particle interaction simulation– Written in STL– Time steps calculate system evolution (dependence)– Parallelized within time step
STAPL utilization:– pAlgorithms: p_for_each, p_transform, p_accumulate– pContainers: pVector (push_back)– Automatic vs. Manual (5% performance
deterioration )
* Code written by Danny Rintoul at Sandia National Labs
31 Texas A&M UniversitySTAPL
Performance: Molecular Performance: Molecular DynamicsDynamics
Number of particles
108K23k
Number of processors1 4 8 12
162815 1102 546 386 309 627 238 132 94 86
Execution Time (sec)
Experimental results on Experimental results on HP V2200HP V2200
• 40%-49% parallelized• Input sensitive• Use pTree on rest
32 Texas A&M UniversitySTAPL
Performance - Particle Performance - Particle Transport*Transport* Generic particle transport solver
– Regular and arbitrary grids– Numerically intensive, 25k line, C++ STAPL
code– Sweep function unaware of parallel issues
STAPL utilization:– pAlgorithms: p_for_each– pContainers: pVector (for data distribution)– Scheduler: determine grid data dependencies– Executor: satisfy data dependencies
* Joint effort between Texas A&M Nuclear Engineering and Computer Science, funded by DOE ASCI
33 Texas A&M UniversitySTAPL
Performance - Particle Performance - Particle TransportTransport
Profile and Speedups on SGI Origin 2000 using 16 processors
Code Region % Seq. Speedup
Create computational grid 10.00 14.5
Scattering across group-sets 0.05 N/A
Scattering within a group-set 0.40 15.94
Sweep 86.86 14.72
Convergence across group sets
0.05 N/A
Convergence within group sets
0.05 N/A
Other 2.59 N/A
Total 100.00 14.70
34 Texas A&M UniversitySTAPL
Performance - Particle Performance - Particle TransportTransport
Experimental results on SGI Origin 2000Experimental results on SGI Origin 2000
35 Texas A&M UniversitySTAPL
SummarySummary Parallel equivalent to STL
– Many codes can immediately utilize STAPL– Automatic translation
Building block library– Portability (layered architecture)– Performance (adaptive)– Automatic recursive parallelism
STAPL performs well in small pAlgorithm test cases and large codes
36 Texas A&M UniversitySTAPL
STAPL Status and Current STAPL Status and Current WorkWork pAlgorithms - fully implemented pContainers - pVector, pList, pTree pRange - mostly implemented Run-Time
– Executor fully implemented– Scheduler fully implemented– Distributor work in progress
Adaptive mechanism (case study – sorting)
OpenMP + MPI (mixed) work in progress – OpenMP version fully implemented– MPI version work in progress
37 Texas A&M UniversitySTAPL
http://www.cs.tamu.edu/research/parasolhttp://www.cs.tamu.edu/research/parasol
Project funded byProject funded by• NSFNSF• DOEDOE