STAPL The C++ Standard Template Adaptive Parallel Library

STAPLSTAPLThe C++ Standard Template The C++ Standard Template Adaptive Parallel LibraryAdaptive Parallel Library

Alin JulaDepartment of Computer Science, Texas A&M

Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger

http://www.cs.tamu.edu/research/parasol

http://www.cs.tamu.edu/research/parasol

2 Texas A&M UniversitySTAPL

MotivationMotivation

Building block library– Nested parallelism

Inter-operability with existing code– Superset of STL

Portability and Performance– Layered architecture– Run-time adaptivity

STAPL – C++ Standard Template Adaptive Parallel Library


PhilosophyPhilosophy Interface Layer

– STL compatible Concurrency & Communication Layer

– Generic parallelism, synchronization Software Implementation Layer

– Instantiates concurrency & communication Machine Layer

– Architecture dependent code


Related WorkRelated WorkSTAPL

AVTL CHARM++

CHAOS++ CILK* NESL*

POOMA

PSTL SPLIT C*

Paradigm SPMD/ MIMD

SPMD MIMD SPMD SPMD/ MIMD

SPMD/ MIMD

SPMD SPMD SPMD

Architecture

shared/ dist

dist shared/ dist

dist shared/ dist

shared/ dist

shared/ dist

shared/ dist

shared/ dist

NestedParallelism

yes no no no yes yes no no yes

Adaptive yes no no no no no no no no

Generic yes yes Yes (limits) yes no yes yes yes no

Irregular yes no yes yes yes yes no yes yes

DataDecomp.

auto/ user

auto user auto/ user

user user user auto/ user

user

DataMapping

auto/ user

auto auto auto/ user

auto auto user auto/ user

auto

Scheduling user, static, dyn,block

user, MPI-based

prioritizedexecution

data decom. based

work stealing

work & depth model

pthread scheduling

Tulip RTS

user

Comm/Comp Overlap

yes no yes no no no no no yes

* Parallel programming language* Parallel programming language


STL OverviewSTL Overview Data is stored in ContainersContainers STL provides standardized AlgorithmsAlgorithms IteratorsIterators bind Algorithms to Containers

– are generalized pointers

Example

Container AlgorithmIterator

vector<int> vect;… // initialization of ‘vect’ variablesort(vect.begin(),vect.end());


STAPL OverviewSTAPL Overview

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Data is stored in pContainerspContainers STAPL provides standardized pAlgorithmspAlgorithms pRanges pRanges bind pAlgorithms to pContainers

Similar to STL Iterators, but must also support parallelism


pRangepRange pRange is the Parallel Counterpart of STL

iterator:– Binds pAlgorithms to pContainers– Provides an abstract view of a scoped data space

– data space is (recursively) partitioned into subranges

More than an iterator since it supports parallelization– Scheduler/distributor decides how computation and

data structures should be mapped to the machine– Data dependences among subranges can be represented

by a data dependence graph (DDG)– Executor launches parallel computation, manages

communication, and enforces dependences

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor



pRangepRange

Provides random access to a partition of the data space– View and access provided by a collection of iterators

describing pRange boundary pRanges are partitioned into subranges

– Automatically by STAPL based on machine characteristics, number of processors, partition factors, etc.

– Manually according to user-specified partitions pRange can represent relationships among

subspaces as Data Dependence Graphs (DDG) ( for scheduling )

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor



pRangepRange Each subspace is disjoint and could be itself a

pRange– Nested parallelism

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor


Data Space

stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);




pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor


Prange

Data Space





pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor


Prange

Data Space

subspace

subspace subspace

subspace





pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor


subspace

subspace

subspace

subspace

Prange

Data Space





pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor


subspace subspace

subspace

Prange

Prange

Data Space





pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor


subspace subspace

subspace

Prange

Prange

Data Space



pContainerpContainer pContainer is the parallel counterpart of STL

container– Provides parallel and concurrent methods

Maintains internal pRange– Updated during insert/delete operations– Minimizes redistribution

Completed: pVector, pList, pTree Example:

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor


pVector

pRange

STL vector


pAlgorithmpAlgorithm pAlgorithm is the parallel counterpart of STL

algorithm Parallel Algorithms take as input

– pRange – Work functions that operate on subRanges

and apply the work function to all subrangestemplate<class SubRange>class pAddOne : public stapl::pFunction {public: ... void operator()(SubRange& spr) { typename SubRange::iterator i; for (i=spr.begin(); i!=spr.end(); i++) (*i)++ }}...p_transform(pRange, pAddOne);

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor



Run-Time SystemRun-Time System Support for different architectures

– HP V2200– SGI Origin 2000, SGI Power Challenge

Support for different paradigms– OpenMP, Pthreads– MPI

Memory allocation– HOARD

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor


Run-Time

Cluster 1Proc 12

Proc 14

Proc 13

Proc 15

Cluster 4

pAlgorithm

Cluster 2 Cluster 3


Run-Time SystemRun-Time System Scheduler

– Determine an execution order (DDG)– Policies:

Automatic : Static, Block, Dynamic, Partial Self Scheduling, Complete Self Scheduling

User defined

Distributor– Hierarchical data distribution– Automatic and user defined

Executor– Execute DDG

Processor assignment Synchronization and Communication

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor



STL to STAPL Automatic STL to STAPL Automatic TranslationTranslation C++ preprocessor converts STL code into

STAPL parallel code Iterators used to construct pRanges User is responsible for safe parallelization

#include <start_STAPL> accumulate(x.begin(), x.end(), 0); for_each(x.begin(), x.end(), foo());#include <stop_STAPL>

pi_accumulate(x.begin(), x.end(), 0);pi_for_each(x.begin(), x.end(), foo());

p_accumulate(x_pRange, 0);p_for_each(x_pRange,foo());

Preprocessingphase

pRangeconstruction• In some cases automatic

translation provides similar performance to STAPL written code (5% deterioration)


Performance: p_inner_productPerformance: p_inner_product

Experimental results on HP V2200Experimental results on HP V2200


pTreepTree Parallel Tree supports bulk commutative operations in

parallel Each processor is assigned a set of subtrees to maintain Operations on the base are atomic Operations on subtrees are parallel

Base(atomic)

Subtrees(parallel)

P1 P2 P3

Example : Parallel Insertion AlgorithmEach processor is given a set of elements

1) Each proc creates local buckets corresponding to the subtrees

2) Each processor collects the buckets that correspond to its subtrees

3) Elements in the subtree buckets are inserted into tree in parallel


pTreepTree Basis for STAPL pSet, pMultiSet, pMap, pMultiMap

containers– Covers all remaining STL containers

Results are sequentially consistent although internal structure may vary

Requires negligible additional memory pTrees can be used either sequentially or in parallel in the

same execution – allows switching back and forth between parallel & sequential


Performance: pTreePerformance: pTree

Experimental results on HP V2200Experimental results on HP V2200


Algorithm AdaptivityAlgorithm Adaptivity Problem - Parallel algorithms are highly

sensitive– Architecture – number of processors, memory

interconnection, cache, available resources, etc– Environment – thread management, memory

allocation, operating system policies, etc– Data Characteristics – input type, layout, etc

Solution - implement a number of different algorithms and adaptively choose the best one at run-time


Adaptive FrameworkAdaptive Framework


Case Study - Adaptive Sorting Case Study - Adaptive Sorting

Sort Strength Weakness

Column Theoretically time optimal

Many passes over data

Merge Low memory overhead

Poor scalability

Radix Extremely fast Integers only

Sample Two passes over data

High memory overhead


Performance: Adaptive SortingPerformance: Adaptive SortingV2200

Power Challenge

Origin 2000

Performance on 10 million integers


Performance: Run-Time TestsPerformance: Run-Time Tests

if (data_type = INTEGER) radix_sort();else if (num_procs < 5) merge_sort();else column_sort();

Origin 2000


Performance: Molecular Performance: Molecular Dynamics*Dynamics*

Discrete time particle interaction simulation– Written in STL– Time steps calculate system evolution (dependence)– Parallelized within time step

STAPL utilization:– pAlgorithms: p_for_each, p_transform, p_accumulate– pContainers: pVector (push_back)– Automatic vs. Manual (5% performance

deterioration )

* Code written by Danny Rintoul at Sandia National Labs


Performance: Molecular Performance: Molecular DynamicsDynamics

Number of particles

108K23k

Number of processors1 4 8 12

162815 1102 546 386 309 627 238 132 94 86

Execution Time (sec)

Experimental results on Experimental results on HP V2200HP V2200

• 40%-49% parallelized• Input sensitive• Use pTree on rest


Performance - Particle Performance - Particle Transport*Transport* Generic particle transport solver

– Regular and arbitrary grids– Numerically intensive, 25k line, C++ STAPL

code– Sweep function unaware of parallel issues

STAPL utilization:– pAlgorithms: p_for_each– pContainers: pVector (for data distribution)– Scheduler: determine grid data dependencies– Executor: satisfy data dependencies

* Joint effort between Texas A&M Nuclear Engineering and Computer Science, funded by DOE ASCI


Performance - Particle Performance - Particle TransportTransport

Profile and Speedups on SGI Origin 2000 using 16 processors

Code Region % Seq. Speedup

Create computational grid 10.00 14.5

Scattering across group-sets 0.05 N/A

Scattering within a group-set 0.40 15.94

Sweep 86.86 14.72

Convergence across group sets

0.05 N/A

Convergence within group sets

0.05 N/A

Other 2.59 N/A

Total 100.00 14.70


Performance - Particle Performance - Particle TransportTransport

Experimental results on SGI Origin 2000Experimental results on SGI Origin 2000


SummarySummary Parallel equivalent to STL

– Many codes can immediately utilize STAPL– Automatic translation

Building block library– Portability (layered architecture)– Performance (adaptive)– Automatic recursive parallelism

STAPL performs well in small pAlgorithm test cases and large codes


STAPL Status and Current STAPL Status and Current WorkWork pAlgorithms - fully implemented pContainers - pVector, pList, pTree pRange - mostly implemented Run-Time

– Executor fully implemented– Scheduler fully implemented– Distributor work in progress

Adaptive mechanism (case study – sorting)

OpenMP + MPI (mixed) work in progress – OpenMP version fully implemented– MPI version work in progress


http://www.cs.tamu.edu/research/parasolhttp://www.cs.tamu.edu/research/parasol

Project funded byProject funded by• NSFNSF• DOEDOE

Date post:	11-Jan-2016
Category:	Documents
Upload:	amelia
View:	48 times
Download:	0 times

STAPL The C++ Standard Template Adaptive Parallel Library

Documents