Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South...

transcript

Lecture 4TTH 03:30AM-04:45PM

Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csc

CSCE569 Parallel Computing

University of South CarolinaDepartment of Computer Science and Engineering

Outline: Parallel Programming Modeland algorithm design

Task/channel modelAlgorithm design methodologyCase studies

Task/Channel ModelParallel computation = set of tasksTask

ProgramLocal memoryCollection of I/O ports

Tasks interact by sending messages through channels

Task/Channel Model

TaskTaskChannelChannel

Foster’s Design MethodologyPartitioningCommunicationAgglomerationMapping

Foster’s Methodology

P ro b lemP artitio ning

C o m m unic atio n

A gglo m eratio nM ap p ing

PartitioningDividing computation and data into

piecesDomain decomposition

Divide data into piecesDetermine how to associate

computations with the dataFunctional decomposition

Divide computation into piecesDetermine how to associate data with

the computations

Example Domain Decompositions

Example Functional Decomposition

Partitioning ChecklistAt least 10x more primitive tasks than

processors in target computerMinimize redundant computations and

redundant data storagePrimitive tasks roughly the same sizeNumber of tasks an increasing function of

problem size

CommunicationDetermine values passed among tasksLocal communication

Task needs values from a small number of other tasks

Create channels illustrating data flowGlobal communication

Significant number of tasks contribute data to perform a computation

Don’t create channels for them early in design

Communication ChecklistCommunication operations balanced

among tasksEach task communicates with only small

group of neighborsTasks can perform communications

concurrentlyTask can perform computations

concurrently

AgglomerationGrouping tasks into larger tasksGoals

Improve performanceMaintain scalability of programSimplify programming

In MPI programming, goal often to create one agglomerated task per processor

Agglomeration Can Improve PerformanceEliminate communication between

primitive tasks agglomerated into consolidated task

Combine groups of sending and receiving tasks

Agglomeration ChecklistLocality of parallel algorithm has increasedReplicated computations take less time

than communications they replaceData replication doesn’t affect scalabilityAgglomerated tasks have similar

computational and communications costsNumber of tasks increases with problem

sizeNumber of tasks suitable for likely target

systemsTradeoff between agglomeration and code

modifications costs is reasonable

MappingProcess of assigning tasks to processorsCentralized multiprocessor: mapping done

by operating systemDistributed memory system: mapping done

by userConflicting goals of mapping

Maximize processor utilizationMinimize interprocessor communication

Mapping Example

Optimal MappingFinding optimal mapping is NP-hardMust rely on heuristics

Mapping Decision TreeStatic number of tasks

Structured communicationConstant computation time per task

Agglomerate tasks to minimize commCreate one task per processor

Variable computation time per taskCyclically map tasks to processors

Unstructured communicationUse a static load balancing algorithm

Dynamic number of tasks

Mapping StrategyStatic number of tasksDynamic number of tasks

Frequent communications between tasksUse a dynamic load balancing algorithm

Many short-lived tasksUse a run-time task-scheduling algorithm

Mapping ChecklistConsidered designs based on one task per

processor and multiple tasks per processorEvaluated static and dynamic task

allocationIf dynamic task allocation chosen, task

allocator is not a bottleneck to performanceIf static task allocation chosen, ratio of

tasks to processors is at least 10:1

Case StudiesBoundary value problemFinding the maximumThe n-body problemAdding data input

Boundary Value Problem

Ice water Rod Insulation

Rod Cools as Time Progresses

Finite Difference Approximation

PartitioningOne data item per grid pointAssociate one primitive task with each grid

pointTwo-dimensional domain decomposition

CommunicationIdentify communication pattern between

primitive tasksEach interior primitive task has three

incoming and three outgoing channels

Agglomeration and Mapping

Agglomeration

Sequential execution time – time to update elementn – number of elementsm – number of iterationsSequential execution time: m (n-1)

Parallel Execution Timep – number of processors – message latencyParallel execution time m((n-1)/p+2)

Finding the Maximum Error

Computed 0.15 0.16 0.16 0.19Correct 0.15 0.16 0.17 0.18Error (%) 0.00% 0.00% 6.25% 5.26%

ReductionGiven associative operator a0 a1 a2 … an-1

ExamplesAddMultiplyAnd, OrMaximum, Minimum

Parallel Reduction Evolution

Binomial Trees

Subgraph of hypercube

Finding Global Sum

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Finding Global Sum

1 7 -6 4

4 5 8 2

Finding Global Sum

Binomial Tree

Agglomeration

sum sum

The n-body Problem

PartitioningDomain partitioningAssume one task per particleTask has particle’s position, velocity vectorIteration

Get positions of all other particlesCompute new position, velocity

Gather

All-gather

Complete Graph for All-gather

Hypercube for All-gather

Communication Time

Hypercube

Complete graph

Adding Data Input

Scatter

Scatter in log p Steps

12345678 56781234 56 12

Summary: Task/channel ModelParallel computation

Set of tasksInteractions through channels

Good designsMaximize local computationsMinimize communicationsScale up

Summary: Design StepsPartition computationAgglomerate tasksMap tasks to processorsGoals

Maximize processor utilizationMinimize inter-processor communication

Summary: Fundamental AlgorithmsReductionGather and scatterAll-gather

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South...

Documents