Chapter 3 Parallel Algorithm Design Parallel Programming · MPI and OpenMP programming model MPI:...

Post on 13-May-2021

22 views 0 download

transcript

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Chapter 3 Parallel Algorithm Design

Parallel Programmingin C with MPI and OpenMP

Michael J. Quinn Wonyong Sung Modification

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Parallel Programming

Load balancingBest with data-level parallel processing

Low communication overheadArchitecture dependent

Precedence relation and schedulingMemory/cache/IO consideration

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

PartitioningDividing both computation and data into piecesDomain (x-axis, data) decomposition

Divide data into pieces – in many case best partitioningSPMD (Single Processor Multiple Data) paradigmGood for massively parallel distributed memory multi-computer systems

Functional decompositionDivide computation into piecesMay good for heterogeneous multiprocessors

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Partitioning Checklist

At least 10x more primitive tasks than processors in target computerMinimize redundant computations and redundant data storagePrimitive tasks roughly the same sizeNumber of tasks an increasing function of problem size (scalable partitioning)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

CommunicationDetermine values passed among tasksLocal communication

Task needs values from a small number of other tasksCreate channels illustrating data flow

Global communicationSignificant number of tasks contribute data to perform a computationDon’t create channels for them early in design

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Communication Checklist

Communication operations balanced among tasksEach task communicates with only small group of neighborsTasks can perform communications concurrentlyTask can perform computations concurrently

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

AgglomerationGrouping tasks into larger tasksGoals

Eliminate communication between primitive tasks agglomerated into consolidated task Maintain scalability of programCombine groups of sending and receiving tasks

In MPI programming, goal often to create one agglomerated task per processor

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Agglomeration ChecklistLocality of parallel algorithm has increasedReplicated computations take less time than communications they replaceData replication doesn’t affect scalabilityAgglomerated tasks have similar computational and communications costsNumber of tasks increases with problem sizeNumber of tasks suitable for likely target systemsTradeoff between agglomeration and code modifications costs is reasonable

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

MappingProcess of assigning tasks to processorsMPI and OpenMP programming model

MPI: purely parallel, need parallel algorithm from the start-upOpenMP: Fork-join model, starting-from sequential version and incremental parallelization

Conflicting goals of mappingMaximize processor utilizationMinimize interprocessor communication

NP-hard problem

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Mapping Decision TreeStatic number of tasks

Structured communicationConstant computation time per task

• Agglomerate tasks to minimize comm• Create one task per processor

Variable computation time per task• Cyclically map tasks to processors

Unstructured communication• Use a static load balancing algorithm

Dynamic number of tasks

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Mapping Strategy

Static number of tasksDynamic number of tasks

Frequent communications between tasksUse a dynamic load balancing algorithm –analyzes the current tasks and produces a new mapping of tasks to processors

Many short-lived tasksUse a run-time task-scheduling algorithm

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Task SchedulingCentralized method

One manager and many workersWhen a worker processor has nothing to do, it requests a task from the manager.Sometimes, the manager can be a bottleneck

DistributedEach processor maintains its own list of tasksPush (processors with too many send to some others) and pull strategies

Hybrid method

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Mapping Checklist

Considered designs based on one task per processor and multiple tasks per processorEvaluated static and dynamic task allocationIf dynamic task allocation chosen, task allocator is not a bottleneck to performanceIf static task allocation chosen, ratio of tasks to processors is at least 10:1

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Case Studies

Boundary value problemFinding the maximumThe n-body problemAdding data input

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Partitioning

One data item per grid pointAssociate one primitive task with each grid pointTwo-dimensional domain decomposition

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Communication

Identify communication pattern between primitive tasksEach interior primitive task has three incoming and three outgoing channels

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Sequential execution time

χ – time to update elementn – number of elementsm – number of iterationsSequential execution time: m (n-1) χ

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Parallel Execution Time

p – number of processorsλ – message latencyParallel execution time m(χ⎡(n-1)/p⎤+2λ)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Finding the Maximum ErrorComputed 0.15 0.16 0.16 0.19Correct 0.15 0.16 0.17 0.18Error (%) 0.00% 0.00% 6.25% 5.26%

6.25%

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Reduction

Given associative operator ⊕a0 ⊕ a1 ⊕ a2 ⊕ … ⊕ an-1

ExamplesAddMultiplyAnd, OrMaximum, Minimum

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Binomial Trees

Subgraph of hypercube

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Finding Global Sum

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Finding Global Sum

1 7 -6 4

4 5 8 2

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Finding Global Sum

8 -2

9 10

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Finding Global Sum

17 8

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Finding Global Sum

25

Binomial Tree

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Agglomeration

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Agglomeration

sum

sum sum

sum

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Partitioning

Domain partitioningAssume one task per particleTask has particle’s position, velocity vectorIteration

Get positions of all other particlesCompute new position, velocity

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Gather

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

All-gather

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Scatter

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Scatter in log p Steps

12345678 56781234 56 12

7834

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Summary: Task/channel Model

Parallel computationSet of tasksInteractions through channels

Good designsMaximize local computationsMinimize communicationsScale up

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Summary: Design Steps

Partition computationAgglomerate tasksMap tasks to processorsGoals

Maximize processor utilizationMinimize inter-processor communication

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Summary: Fundamental Algorithms

ReductionGather and scatterAll-gather