+ All Categories
Home > Documents > STAPL The C++ Standard Template Adaptive Parallel Library

STAPL The C++ Standard Template Adaptive Parallel Library

Date post: 11-Jan-2016
Category:
Upload: amelia
View: 48 times
Download: 0 times
Share this document with a friend
Description:
STAPL The C++ Standard Template Adaptive Parallel Library. Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger http://www.cs.tamu.edu/research/parasol. Motivation. - PowerPoint PPT Presentation
Popular Tags:
36
STAPL STAPL The C++ Standard The C++ Standard Template Adaptive Template Adaptive Parallel Library Parallel Library Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger http://www.cs.tamu.edu/research/paras
Transcript
Page 1: STAPL The C++ Standard Template Adaptive Parallel Library

STAPLSTAPLThe C++ Standard Template The C++ Standard Template Adaptive Parallel LibraryAdaptive Parallel Library

Alin JulaDepartment of Computer Science, Texas A&M

Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger

http://www.cs.tamu.edu/research/parasol

Page 2: STAPL The C++ Standard Template Adaptive Parallel Library

2 Texas A&M UniversitySTAPL

MotivationMotivation

Building block library– Nested parallelism

Inter-operability with existing code– Superset of STL

Portability and Performance– Layered architecture– Run-time adaptivity

STAPL – C++ Standard Template Adaptive Parallel Library

Page 3: STAPL The C++ Standard Template Adaptive Parallel Library

3 Texas A&M UniversitySTAPL

PhilosophyPhilosophy Interface Layer

– STL compatible Concurrency & Communication Layer

– Generic parallelism, synchronization Software Implementation Layer

– Instantiates concurrency & communication Machine Layer

– Architecture dependent code

Page 4: STAPL The C++ Standard Template Adaptive Parallel Library

4 Texas A&M UniversitySTAPL

Related WorkRelated WorkSTAPL

AVTL CHARM++

CHAOS++ CILK* NESL*

POOMA

PSTL SPLIT C*

Paradigm SPMD/ MIMD

SPMD MIMD SPMD SPMD/ MIMD

SPMD/ MIMD

SPMD SPMD SPMD

Architecture

shared/ dist

dist shared/ dist

dist shared/ dist

shared/ dist

shared/ dist

shared/ dist

shared/ dist

NestedParallelism

yes no no no yes yes no no yes

Adaptive yes no no no no no no no no

Generic yes yes Yes (limits) yes no yes yes yes no

Irregular yes no yes yes yes yes no yes yes

DataDecomp.

auto/ user

auto user auto/ user

user user user auto/ user

user

DataMapping

auto/ user

auto auto auto/ user

auto auto user auto/ user

auto

Scheduling user, static, dyn,block

user, MPI-based

prioritizedexecution

data decom. based

work stealing

work & depth model

pthread scheduling

Tulip RTS

user

Comm/Comp Overlap

yes no yes no no no no no yes

* Parallel programming language* Parallel programming language

Page 5: STAPL The C++ Standard Template Adaptive Parallel Library

5 Texas A&M UniversitySTAPL

STL OverviewSTL Overview Data is stored in ContainersContainers STL provides standardized AlgorithmsAlgorithms IteratorsIterators bind Algorithms to Containers

– are generalized pointers

Example

Container AlgorithmIterator

vector<int> vect;… // initialization of ‘vect’ variablesort(vect.begin(),vect.end());

Page 6: STAPL The C++ Standard Template Adaptive Parallel Library

6 Texas A&M UniversitySTAPL

STAPL OverviewSTAPL Overview

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Data is stored in pContainerspContainers STAPL provides standardized pAlgorithmspAlgorithms pRanges pRanges bind pAlgorithms to pContainers

Similar to STL Iterators, but must also support parallelism

Page 7: STAPL The C++ Standard Template Adaptive Parallel Library

7 Texas A&M UniversitySTAPL

pRangepRange pRange is the Parallel Counterpart of STL

iterator:– Binds pAlgorithms to pContainers– Provides an abstract view of a scoped data space

– data space is (recursively) partitioned into subranges

More than an iterator since it supports parallelization– Scheduler/distributor decides how computation and

data structures should be mapped to the machine– Data dependences among subranges can be represented

by a data dependence graph (DDG)– Executor launches parallel computation, manages

communication, and enforces dependences

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Page 8: STAPL The C++ Standard Template Adaptive Parallel Library

8 Texas A&M UniversitySTAPL

pRangepRange

Provides random access to a partition of the data space– View and access provided by a collection of iterators

describing pRange boundary pRanges are partitioned into subranges

– Automatically by STAPL based on machine characteristics, number of processors, partition factors, etc.

– Manually according to user-specified partitions pRange can represent relationships among

subspaces as Data Dependence Graphs (DDG) ( for scheduling )

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Page 9: STAPL The C++ Standard Template Adaptive Parallel Library

9 Texas A&M UniversitySTAPL

pRangepRange Each subspace is disjoint and could be itself a

pRange– Nested parallelism

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Data Space

stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);

Page 10: STAPL The C++ Standard Template Adaptive Parallel Library

10 Texas A&M UniversitySTAPL

pRangepRange Each subspace is disjoint and could be itself a

pRange– Nested parallelism

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Prange

Data Space

stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);

Page 11: STAPL The C++ Standard Template Adaptive Parallel Library

11 Texas A&M UniversitySTAPL

pRangepRange Each subspace is disjoint and could be itself a

pRange– Nested parallelism

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Prange

Data Space

subspace

subspace subspace

subspace

stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);

Page 12: STAPL The C++ Standard Template Adaptive Parallel Library

12 Texas A&M UniversitySTAPL

pRangepRange Each subspace is disjoint and could be itself a

pRange– Nested parallelism

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

subspace

subspace

subspace

subspace

Prange

Data Space

stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);

Page 13: STAPL The C++ Standard Template Adaptive Parallel Library

13 Texas A&M UniversitySTAPL

pRangepRange Each subspace is disjoint and could be itself a

pRange– Nested parallelism

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

subspace subspace

subspace

Prange

Prange

Data Space

stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);

Page 14: STAPL The C++ Standard Template Adaptive Parallel Library

14 Texas A&M UniversitySTAPL

pRangepRange Each subspace is disjoint and could be itself a

pRange– Nested parallelism

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

subspace subspace

subspace

Prange

Prange

Data Space

stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd);dataRange.partition();stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3);dataSubrange.partition_like(<0.25,0.25,0.25,0.25> * size);

Page 15: STAPL The C++ Standard Template Adaptive Parallel Library

15 Texas A&M UniversitySTAPL

pContainerpContainer pContainer is the parallel counterpart of STL

container– Provides parallel and concurrent methods

Maintains internal pRange– Updated during insert/delete operations– Minimizes redistribution

Completed: pVector, pList, pTree Example:

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

pVector

pRange

STL vector

Page 16: STAPL The C++ Standard Template Adaptive Parallel Library

16 Texas A&M UniversitySTAPL

pAlgorithmpAlgorithm pAlgorithm is the parallel counterpart of STL

algorithm Parallel Algorithms take as input

– pRange – Work functions that operate on subRanges

and apply the work function to all subrangestemplate<class SubRange>class pAddOne : public stapl::pFunction {public: ... void operator()(SubRange& spr) { typename SubRange::iterator i; for (i=spr.begin(); i!=spr.end(); i++) (*i)++ }}...p_transform(pRange, pAddOne);

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Page 17: STAPL The C++ Standard Template Adaptive Parallel Library

17 Texas A&M UniversitySTAPL

Run-Time SystemRun-Time System Support for different architectures

– HP V2200– SGI Origin 2000, SGI Power Challenge

Support for different paradigms– OpenMP, Pthreads– MPI

Memory allocation– HOARD

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Run-Time

Cluster 1Proc 12

Proc 14

Proc 13

Proc 15

Cluster 4

pAlgorithm

Cluster 2 Cluster 3

Page 18: STAPL The C++ Standard Template Adaptive Parallel Library

18 Texas A&M UniversitySTAPL

Run-Time SystemRun-Time System Scheduler

– Determine an execution order (DDG)– Policies:

Automatic : Static, Block, Dynamic, Partial Self Scheduling, Complete Self Scheduling

User defined

Distributor– Hierarchical data distribution– Automatic and user defined

Executor– Execute DDG

Processor assignment Synchronization and Communication

pContainer

pRange

pRange

pRange

Scheduler Executor

pAlgorithm

Algorithm

Processor

Processor

Processor

Distributor

pRange Run-time System

Page 19: STAPL The C++ Standard Template Adaptive Parallel Library

19 Texas A&M UniversitySTAPL

STL to STAPL Automatic STL to STAPL Automatic TranslationTranslation C++ preprocessor converts STL code into

STAPL parallel code Iterators used to construct pRanges User is responsible for safe parallelization

#include <start_STAPL> accumulate(x.begin(), x.end(), 0); for_each(x.begin(), x.end(), foo());#include <stop_STAPL>

pi_accumulate(x.begin(), x.end(), 0);pi_for_each(x.begin(), x.end(), foo());

p_accumulate(x_pRange, 0);p_for_each(x_pRange,foo());

Preprocessingphase

pRangeconstruction• In some cases automatic

translation provides similar performance to STAPL written code (5% deterioration)

Page 20: STAPL The C++ Standard Template Adaptive Parallel Library

21 Texas A&M UniversitySTAPL

Performance: p_inner_productPerformance: p_inner_product

Experimental results on HP V2200Experimental results on HP V2200

Page 21: STAPL The C++ Standard Template Adaptive Parallel Library

22 Texas A&M UniversitySTAPL

pTreepTree Parallel Tree supports bulk commutative operations in

parallel Each processor is assigned a set of subtrees to maintain Operations on the base are atomic Operations on subtrees are parallel

Base(atomic)

Subtrees(parallel)

P1 P2 P3

Example : Parallel Insertion AlgorithmEach processor is given a set of elements

1) Each proc creates local buckets corresponding to the subtrees

2) Each processor collects the buckets that correspond to its subtrees

3) Elements in the subtree buckets are inserted into tree in parallel

Page 22: STAPL The C++ Standard Template Adaptive Parallel Library

23 Texas A&M UniversitySTAPL

pTreepTree Basis for STAPL pSet, pMultiSet, pMap, pMultiMap

containers– Covers all remaining STL containers

Results are sequentially consistent although internal structure may vary

Requires negligible additional memory pTrees can be used either sequentially or in parallel in the

same execution – allows switching back and forth between parallel & sequential

Page 23: STAPL The C++ Standard Template Adaptive Parallel Library

24 Texas A&M UniversitySTAPL

Performance: pTreePerformance: pTree

Experimental results on HP V2200Experimental results on HP V2200

Page 24: STAPL The C++ Standard Template Adaptive Parallel Library

25 Texas A&M UniversitySTAPL

Algorithm AdaptivityAlgorithm Adaptivity Problem - Parallel algorithms are highly

sensitive– Architecture – number of processors, memory

interconnection, cache, available resources, etc– Environment – thread management, memory

allocation, operating system policies, etc– Data Characteristics – input type, layout, etc

Solution - implement a number of different algorithms and adaptively choose the best one at run-time

Page 25: STAPL The C++ Standard Template Adaptive Parallel Library

26 Texas A&M UniversitySTAPL

Adaptive FrameworkAdaptive Framework

Page 26: STAPL The C++ Standard Template Adaptive Parallel Library

27 Texas A&M UniversitySTAPL

Case Study - Adaptive Sorting Case Study - Adaptive Sorting

Sort Strength Weakness

Column Theoretically time optimal

Many passes over data

Merge Low memory overhead

Poor scalability

Radix Extremely fast Integers only

Sample Two passes over data

High memory overhead

Page 27: STAPL The C++ Standard Template Adaptive Parallel Library

28 Texas A&M UniversitySTAPL

Performance: Adaptive SortingPerformance: Adaptive SortingV2200

Power Challenge

Origin 2000

Performance on 10 million integers

Page 28: STAPL The C++ Standard Template Adaptive Parallel Library

29 Texas A&M UniversitySTAPL

Performance: Run-Time TestsPerformance: Run-Time Tests

if (data_type = INTEGER) radix_sort();else if (num_procs < 5) merge_sort();else column_sort();

Origin 2000

Page 29: STAPL The C++ Standard Template Adaptive Parallel Library

30 Texas A&M UniversitySTAPL

Performance: Molecular Performance: Molecular Dynamics*Dynamics*

Discrete time particle interaction simulation– Written in STL– Time steps calculate system evolution (dependence)– Parallelized within time step

STAPL utilization:– pAlgorithms: p_for_each, p_transform, p_accumulate– pContainers: pVector (push_back)– Automatic vs. Manual (5% performance

deterioration )

* Code written by Danny Rintoul at Sandia National Labs

Page 30: STAPL The C++ Standard Template Adaptive Parallel Library

31 Texas A&M UniversitySTAPL

Performance: Molecular Performance: Molecular DynamicsDynamics

Number of particles

108K23k

Number of processors1 4 8 12

162815 1102 546 386 309 627 238 132 94 86

Execution Time (sec)

Experimental results on Experimental results on HP V2200HP V2200

• 40%-49% parallelized• Input sensitive• Use pTree on rest

Page 31: STAPL The C++ Standard Template Adaptive Parallel Library

32 Texas A&M UniversitySTAPL

Performance - Particle Performance - Particle Transport*Transport* Generic particle transport solver

– Regular and arbitrary grids– Numerically intensive, 25k line, C++ STAPL

code– Sweep function unaware of parallel issues

STAPL utilization:– pAlgorithms: p_for_each– pContainers: pVector (for data distribution)– Scheduler: determine grid data dependencies– Executor: satisfy data dependencies

* Joint effort between Texas A&M Nuclear Engineering and Computer Science, funded by DOE ASCI

Page 32: STAPL The C++ Standard Template Adaptive Parallel Library

33 Texas A&M UniversitySTAPL

Performance - Particle Performance - Particle TransportTransport

Profile and Speedups on SGI Origin 2000 using 16 processors

Code Region % Seq. Speedup

Create computational grid 10.00 14.5

Scattering across group-sets 0.05 N/A

Scattering within a group-set 0.40 15.94

Sweep 86.86 14.72

Convergence across group sets

0.05 N/A

Convergence within group sets

0.05 N/A

Other 2.59 N/A

Total 100.00 14.70

Page 33: STAPL The C++ Standard Template Adaptive Parallel Library

34 Texas A&M UniversitySTAPL

Performance - Particle Performance - Particle TransportTransport

Experimental results on SGI Origin 2000Experimental results on SGI Origin 2000

Page 34: STAPL The C++ Standard Template Adaptive Parallel Library

35 Texas A&M UniversitySTAPL

SummarySummary Parallel equivalent to STL

– Many codes can immediately utilize STAPL– Automatic translation

Building block library– Portability (layered architecture)– Performance (adaptive)– Automatic recursive parallelism

STAPL performs well in small pAlgorithm test cases and large codes

Page 35: STAPL The C++ Standard Template Adaptive Parallel Library

36 Texas A&M UniversitySTAPL

STAPL Status and Current STAPL Status and Current WorkWork pAlgorithms - fully implemented pContainers - pVector, pList, pTree pRange - mostly implemented Run-Time

– Executor fully implemented– Scheduler fully implemented– Distributor work in progress

Adaptive mechanism (case study – sorting)

OpenMP + MPI (mixed) work in progress – OpenMP version fully implemented– MPI version work in progress

Page 36: STAPL The C++ Standard Template Adaptive Parallel Library

37 Texas A&M UniversitySTAPL

http://www.cs.tamu.edu/research/parasolhttp://www.cs.tamu.edu/research/parasol

Project funded byProject funded by• NSFNSF• DOEDOE


Recommended