CSCE569 Parallel Computing

transcript

Lecture 4TTH 03:30AM-04:45PM

Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csc

University of South CarolinaDepartment of Computer Science and Engineering

Chapter ObjectivesCreating 2-D arraysThinking about “grain size”Introducing point-to-point communicationsReading and printing 2-D matricesAnalyzing performance when computations

and communications overlap

OutlineAll-pairs shortest path problemDynamic 2-D arraysParallel algorithm designPoint-to-point communicationBlock row matrix I/OAnalysis and benchmarking

All-pairs Shortest Path Problem

0 6 3 6

4 0 7 10

12 6 0 3

7 3 10 0

9 5 12 2

A B C D

Resulting Adjacency Matrix Containing Distances

Floyd’s Algorithm

for k 0 to n-1for i 0 to n-1

for j 0 to n-1a[i,j] min (a[i,j], a[i,k] + a[k,j])

endforendfor

endfor

Why It Works

Shortest path from i to k through 0, 1, …, k-1

Shortest path from k to j through 0, 1, …, k-1

Shortest path from i to j through 0, 1, …, k-1

Computedin previousiterations

Dynamic 1-D Array Creation

Run-time Stack

Dynamic 2-D Array Creation

Run-time Stack

Bstorage B

Designing Parallel AlgorithmPartitioningCommunicationAgglomeration and Mapping

PartitioningDomain or functional decomposition?Look at pseudocodeSame assignment statement executed n3

timesNo functional parallelismDomain decomposition: divide matrix A

into its n2 elements

Communication

Primitive tasksUpdatinga[3,4] whenk = 1

Iteration k:every taskin row kbroadcastsits value w/intask column

Iteration k:every taskin column kbroadcastsits value w/intask row

Agglomeration and MappingNumber of tasks: staticCommunication among tasks: structuredComputation time per task: constantStrategy:

Agglomerate tasks to minimize communication

Create one task per MPI process

Two Data Decompositions

Rowwise block striped Columnwise block striped

Comparing DecompositionsColumnwise block striped

Broadcast within columns eliminatedRowwise block striped

Broadcast within rows eliminatedReading matrix from file simpler

Choose rowwise block striped decomposition

File Input

Pop Quiz

Why don’t we input the entire file at onceand then scatter its contents among theprocesses, allowing concurrent messagepassing?

Point-to-point CommunicationInvolves a pair of processesOne process sends a messageOther process receives the message

Send/Receive Not Collective

Function MPI_Send

int MPI_Send (

void *message,

int count,

MPI_Datatype datatype,

int dest,

int tag,

MPI_Comm comm

Function MPI_Recv

int MPI_Recv (

void *message,

int count,

MPI_Datatype datatype,

int source,

int tag,

MPI_Comm comm,

MPI_Status *status

Coding Send/Receive

…if (ID == j) { … Receive from I …}…if (ID == i) { … Send to j …}…

Receive is before Send.Why does this work?

Inside MPI_Send and MPI_Recv

Sending Process Receiving Process

ProgramMemory

SystemBuffer

ProgramMemory

MPI_Send MPI_Recv

Return from MPI_SendFunction blocks until message buffer freeMessage buffer is free when

Message copied to system buffer, orMessage transmitted

Typical scenarioMessage copied to system bufferTransmission overlaps computation

Return from MPI_RecvFunction blocks until message in bufferIf message never arrives, function never

returns

DeadlockDeadlock: process waiting for a condition

that will never become trueEasy to write send/receive code that

deadlocksTwo processes: both receive before sendSend tag doesn’t match receive tagProcess sends message to wrong destination

process

Computational ComplexityInnermost loop has complexity (n)Middle loop executed at most n/p timesOuter loop executed n timesOverall complexity (n3/p)

Communication ComplexityNo communication in inner loopNo communication in middle loopBroadcast in outer loop — complexity is (n

log p)Overall complexity (n2 log p)

Execution Time Expression (1)

)/4(log/ npnnpnn

Iterations of outer loop

Iterations of middle loop

Cell update time

Messages per broadcast

Message-passing time

Iterations of inner loop

Computation/communication Overlap

Execution Time Expression (2)

Iterations of middle loop

Cell update time

Messages per broadcast

Message-passing time

Iterations of inner loop

/4loglog/ nppnnpnn Message transmission

Predicted vs. Actual Performance

Execution Time (sec)Execution Time (sec)

ProcessesProcesses PredictedPredicted ActualActual

11 25.5425.54 25.5425.54

22 13.0213.02 13.8913.89

33 9.019.01 9.609.60

44 6.896.89 7.297.29

55 5.865.86 5.995.99

66 5.015.01 5.165.16

77 4.404.40 4.504.50

88 3.943.94 3.983.98

SummaryTwo matrix decompositions

Rowwise block stripedColumnwise block striped

Blocking send/receive functionsMPI_SendMPI_Recv

Overlapping communications with computations

CSCE569 Parallel Computing

Documents