Chapter 3 Parallel Algorithm Design Parallel Programming · MPI and OpenMP programming model MPI:...

transcript

Chapter 3 Parallel Algorithm Design

Parallel Programmingin C with MPI and OpenMP

Michael J. Quinn Wonyong Sung Modification

Parallel Programming

Load balancingBest with data-level parallel processing

Low communication overheadArchitecture dependent

Precedence relation and schedulingMemory/cache/IO consideration

PartitioningDividing both computation and data into piecesDomain (x-axis, data) decomposition

Divide data into pieces – in many case best partitioningSPMD (Single Processor Multiple Data) paradigmGood for massively parallel distributed memory multi-computer systems

Functional decompositionDivide computation into piecesMay good for heterogeneous multiprocessors

Partitioning Checklist

At least 10x more primitive tasks than processors in target computerMinimize redundant computations and redundant data storagePrimitive tasks roughly the same sizeNumber of tasks an increasing function of problem size (scalable partitioning)

CommunicationDetermine values passed among tasksLocal communication

Task needs values from a small number of other tasksCreate channels illustrating data flow

Global communicationSignificant number of tasks contribute data to perform a computationDon’t create channels for them early in design

Communication Checklist

Communication operations balanced among tasksEach task communicates with only small group of neighborsTasks can perform communications concurrentlyTask can perform computations concurrently

AgglomerationGrouping tasks into larger tasksGoals

Eliminate communication between primitive tasks agglomerated into consolidated task Maintain scalability of programCombine groups of sending and receiving tasks

In MPI programming, goal often to create one agglomerated task per processor

Agglomeration ChecklistLocality of parallel algorithm has increasedReplicated computations take less time than communications they replaceData replication doesn’t affect scalabilityAgglomerated tasks have similar computational and communications costsNumber of tasks increases with problem sizeNumber of tasks suitable for likely target systemsTradeoff between agglomeration and code modifications costs is reasonable

MappingProcess of assigning tasks to processorsMPI and OpenMP programming model

MPI: purely parallel, need parallel algorithm from the start-upOpenMP: Fork-join model, starting-from sequential version and incremental parallelization

Conflicting goals of mappingMaximize processor utilizationMinimize interprocessor communication

NP-hard problem

Mapping Decision TreeStatic number of tasks

Structured communicationConstant computation time per task

• Agglomerate tasks to minimize comm• Create one task per processor

Variable computation time per task• Cyclically map tasks to processors

Unstructured communication• Use a static load balancing algorithm

Dynamic number of tasks

Mapping Strategy

Static number of tasksDynamic number of tasks

Frequent communications between tasksUse a dynamic load balancing algorithm –analyzes the current tasks and produces a new mapping of tasks to processors

Many short-lived tasksUse a run-time task-scheduling algorithm

Task SchedulingCentralized method

One manager and many workersWhen a worker processor has nothing to do, it requests a task from the manager.Sometimes, the manager can be a bottleneck

DistributedEach processor maintains its own list of tasksPush (processors with too many send to some others) and pull strategies

Hybrid method

Mapping Checklist

Considered designs based on one task per processor and multiple tasks per processorEvaluated static and dynamic task allocationIf dynamic task allocation chosen, task allocator is not a bottleneck to performanceIf static task allocation chosen, ratio of tasks to processors is at least 10:1

Case Studies

Boundary value problemFinding the maximumThe n-body problemAdding data input

Partitioning

One data item per grid pointAssociate one primitive task with each grid pointTwo-dimensional domain decomposition

Communication

Identify communication pattern between primitive tasksEach interior primitive task has three incoming and three outgoing channels

Sequential execution time

χ – time to update elementn – number of elementsm – number of iterationsSequential execution time: m (n-1) χ

Parallel Execution Time

p – number of processorsλ – message latencyParallel execution time m(χ⎡(n-1)/p⎤+2λ)

Finding the Maximum ErrorComputed 0.15 0.16 0.16 0.19Correct 0.15 0.16 0.17 0.18Error (%) 0.00% 0.00% 6.25% 5.26%

Reduction

Given associative operator ⊕a0 ⊕ a1 ⊕ a2 ⊕ … ⊕ an-1

ExamplesAddMultiplyAnd, OrMaximum, Minimum

Binomial Trees

Subgraph of hypercube

Finding Global Sum

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Finding Global Sum

1 7 -6 4

4 5 8 2

Finding Global Sum

Binomial Tree

Agglomeration

sum sum

Partitioning

Domain partitioningAssume one task per particleTask has particle’s position, velocity vectorIteration

Get positions of all other particlesCompute new position, velocity

Gather

All-gather

Scatter

Scatter in log p Steps

12345678 56781234 56 12

Summary: Task/channel Model

Parallel computationSet of tasksInteractions through channels

Good designsMaximize local computationsMinimize communicationsScale up

Summary: Design Steps

Partition computationAgglomerate tasksMap tasks to processorsGoals

Maximize processor utilizationMinimize inter-processor communication

Summary: Fundamental Algorithms

ReductionGather and scatterAll-gather

Chapter 3 Parallel Algorithm Design Parallel Programming · MPI and OpenMP programming model MPI:...

Documents