Introduction to the Message Passing Interface (MPI)

Post on 01-Jan-2022

2 views 0 download

transcript

IntroductiontotheMessagePassingInterface(MPI)

JanFostier

May3rd 2017

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Moore’sLaw

“Transistorcountdoubleseverytwoyears”

IllustrationfromWikipedia

Moore’sLaw

2000 2011

IllustrationfromWikipedia

Evolutionoftop500supercomputersovertime

Imagetakenfrom

www.to

p500

.org

=17.6PFlops/s(#1ranked)

=76TFlops/s(#500ranked)

Exponential increaseofsupercomputerpeakperformanceovertime

Applicationarea– performanceshare

Imagetakenfrom

www.to

p500

.org

“research”

“unknown”

Operatingsystemfamily

“Linux”

“Unix”

“Mixed”

Motivationforparallelcomputing

• Wanttorunthesameprogramfaster§ Dependsontheapplicationwhatisconsideredanacceptable

runtimeo SETI@Home,Folding@Home,GIMPS:yearsmaybeacceptableo ForR&D applications:daysorevenweeksareacceptable

– CFD,CEM,Bioinformatics,Cheminformatics,andmanymoreo Predictionoftomorrow’sweathershouldtakelessthanadayof

computationtime.o Someapplicationsrequirereal-timebehavior

– Computergames,algorithmictrading

• Wanttorunbiggerdatasets• Wanttoreducefinancialcost and/orpowerconsumption

Distributed-memoryarchitecture

Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)

Machine

CPU

Memory

Machine

CPU

Memory

Machine

CPU

Memory

Machine

CPU

Memory

NI NI NI NI

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

MessagePassingInterface(MPI)

• MPI=libraryspecification,notanimplementation• Mostimportantimplementations

§ OpenMPI (http://www.open-mpi.org,MPI-3standard)§ IntelMPI(proprietary,MPI-3standard)

• Specifiesroutinesfor(amongothers)§ Point-to-point communication(between2processes)§ Collective communication(>2processes)§ Topology setup§ ParallelI/O

• BindingsforC/C++andFortran

MPIreferenceworks

• MPIstandards:http://www.mpi-forum.org/docs/

• MPI:TheCompleteReference (M.Snir,S.Otto,SHuss-Lederman,D.Walker,J.Dongarra)Availablefromhttp://switzernet.com/people/emin-gabrielyan/060708-thesis-ref/papers/Snir96.pdf

• UsingMPI:PortableParallelProgrammingwiththeMessagePassingInterface,2nd ed.(W.Gropp,E.Lusk,A.Skjellum).

MPIstandard

• Startedin1992(WorkshoponStandardsforMessage-PassinginaDistributedMemoryEnvironment)withsupportfromvendors,librarywritersandacademia.

• MPIversion1.0(May1994)• Finalpre-draftin1993(Supercomputing‘93conference)• FinalversionJune1994

• MPIversion2.0 (July1997)• Supportforone-sidedcommunication• Supportforprocessmanagement• SupportforparallelI/O

• MPIversion3.0(September2012)• Supportfornon-blockingcollectivecommunication• Fortran2008bindings• Newone-sidedcommunicationroutines

HelloworldexampleinMPI#include <mpi.h>#include <iostream>#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {int rank, size;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );MPI_Comm_size( MPI_COMM_WORLD, &size );

cout << “Hello World from process” << rank << “/” << size << endl;

MPI_Finalize();return EXIT_SUCCESS;

}

john@doe ~]$ mpirun –np 4 ./helloWorldHello World from process 2/4Hello World from process 3/4Hello World from process 0/4Hello World from process 1/4

Outputorder israndom

BasicMPIroutines

• int MPI_Init(int *argc, char ***argv)§ Initialization:allprocessesmustcallthispriortoanyotherMPIroutine.§ Stripsof(possible)argumentsprovidedby“mpirun”.

• int MPI_Finalize(void)§ Cleanup:allprocessesmustcallthisroutineattheendoftheprogram.§ Allpendingcommunicationshouldhavefinishedbeforecallingthis.

• int MPI_Comm_size(MPI_Comm comm, int *size);§ Returnsthesizeofthe“Communicator”associatedwith“comm”§ Communicator=userdefinedsubsetofprocesses§ MPI_COMM_WORLD=communicatorthatinvolvesallprocesses

• int MPI_Comm_rank(MPI_Comm comm, int *rank);§ ReturntherankoftheprocessintheCommunicator§ Range:[0...size– 1]

MessagePassingInterfaceMechanisms

Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)

Machine Machine Machine

Machine

HelloWorldprocess

HelloWorldprocess

HelloWorldprocess

HelloWorldprocess

Pprocesses(orinstances)ofthe“HelloWorld”program

mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram

Stackvariables:rank=0size=P

Stackvariables:rank=1size=P

Stackvariables:rank=2size=P

Stackvariables:rank=P-1size=P

“HelloWorld”

“HelloWorld”

“HelloWorld”

“HelloWorld”

Terminology

• Computerprogram =passivecollectionofinstructions.

• Process =instanceofacomputerprogramthatisbeingexecuted.

• Multitasking =runningmultipleprocessesonaCPU.

• Thread =smalleststreamofinstructionsthatcanbemanagedbyanOSscheduler(=light-weightprocess).

• Distributed-memory system=multi-processorsystemswhereeachprocessorhasdirectaccess(fast)toitsownprivatememoryandreliesoninter-processorcommunicationtoaccessanotherprocessor’smemory(typicallyslower).

Multithreadingversusmultiprocessing

Multi-threadingSingle processSharedmemoryaddressspaceProtectdata againstsimultaneouswritingLimitedtoasinglemachineE.g.Pthreads,CILK,OpenMP,etc.

MessagepassingMultiple processesSeparate memoryaddressspacesExplicitlycommunicate everything

Multiple machinespossibleE.g.MPI,UnifiedParallelC,PVM

SeqI SeqI SeqI SeqI

A B C D

SeqII SeqII SeqII SeqII

SeqI

SeqII

A B C D

Threadspawning

Threadjoining

MessagePassingInterface(MPI)• MPImechanisms (dependsonimplementation)

• CompilinganMPIprogramfromsource• mpicc -O3main.cpp-omain

gcc -O3main.cpp-omain-L<IncludeDir>-l<mpiLibs>

• Alsompic++(ormpicxx),mpif77,mpif90,etc.

• RunningMPIapplications(manually)• mpirun -np <numberofprograminstances><yourprogram>• Listofworkernodesspecifiedinsomeconfig file.

• UsingMPIontheUgentHPCcluster• Loadappropriatemodulefirst,e.g.

• moduleloadintel/2017a

• CompilinganMPIprogramfromsource• mpigcc (usestheGNU“gcc”Ccompiler)• mpiicc (usestheIntel“icc”Ccompiler)• mpigxx (usestheGNU“g++”C++compiler)• mpiicpc (usestheIntel“icpc”C++compiler)

• Submitjobusingajobscript (seefurther)

C

C++

mpicc / mpic++defaults to gcc

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

BasicMPIpoint-to-pointcommunication...int rank, size, count;char b[40];MPI_Status status;

... // init MPI and rank and size variables

if (rank != 0) {char * str = “Hello World”;MPI_Send(str, 12, MPI_CHAR, 0, 123, MPI_COMM_WORLD);

} else {for (int i = 1; i < size; i++) {

MPI_Recv(b, 40, MPI_CHAR, i, MPI_ANY_TAG, MPI_COMM_WORLD, &status);MPI_Get_count(&status, MPI_CHAR, &count);printf(“I received %s from process %d with size %d and tag %d\n”,

b, status.MPI_SOURCE, count, status.MPI_TAG);}

}...

john@doe ~]$ mpirun –np 4 ./ptpcommI received Hello World from process 1 with size 12 and tag 123I received Hello World from process 2 with size 12 and tag 123I received Hello World from process 3 with size 12 and tag 123

branchingonrank

MessagePassingInterfaceMechanisms

Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)

Machine Machine Machine

Machine

HelloWorldprocess

HelloWorldprocess

HelloWorldprocess

HelloWorldprocess

Pprocesses(orinstances)ofthe“HelloWorld”program

mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram

rank=0Recv 3x

rank=1Send

rank=2Send

rank=P-1Send

“HelloWorld”“HelloWorld”“HelloWorld”

“HelloWorld”“HelloWorld”“HelloWorld”

Blockingsendandreceive

int MPI_Send(void *buf, int count, MPI_Datatype datatype,int dest, int tag, MPI_Comm comm)

• buf:pointertothemessagetosend• count:numberofitemstosend• datatype:datatypeofeachitem

– numberofbytessent:count*sizeof(datatype)• dest:rankofdestinationprocess• tag:valuetoidentifythemessage[0...atleast(32767)]• comm:communicatorspecification(e.g.MPI_COMM_WORLD)

int MPI_Recv(void *buf, int count, MPI_Datatype datatype,int source, int tag, MPI_Comm comm,MPI_Status *status)

• buf:pointertothebuffertostorereceiveddata• count:upperbound(!)ofthenumberofitemstoreceive• datatype:datatypeofeachitem• source:rankofsourceprocess(orMPI_ANY_SOURCE)• tag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }

SendingandreceivingTwo-sided communication:

• BoththesenderandreceiverareinvolvedindatatransferAsopposedtoone-sidedcommunication

• PostedsendmustmatchreceiveWhendoMPI_SendandMPI_recv match ?

• 1.Rankofreceiver process• 2.Rankofsending process• 3.Tag

- customvaluetodistinguishmessagesfromsamesender• 4.Communicator

RationaleforCommunicators• Usedtocreatesubsetsofprocesses• Transparentuseoftags

- modulescanbewritteninisolation- communicationwithinmodulethroughownCommunicator- communicationbetweenmodulesthroughsharedCommunicator

MPIDatatypes

MPI_Datatype Cdatatype

MPI_CHAR signedchar

MPI_SHORT signedshortint

MPI_INT signedint

MPI_LONG signedlongin

MPI_UNSIGNED_CHAR unsignedchar

MPI_UNSIGNED_SHORT unsignedshortint

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsignedlongint

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE longdouble

MPI_BYTE noconversion,bitpatterntransferredasis

MPI_PACKED groupedmessages

Queryingforinformation

• MPI_Status

• StoresinformationabouttheMPI_Recv operationtypedef struct MPI_Status {

int MPI_SOURCE;

int MPI_TAG;

int MPI_ERROR;

}• Doesnotcontainthesizeofthereceivedmessage

• int MPI_Get_count (MPI_Status *status, MPI_Datatype

datatype, int *count)

• returnsthenumberofdataitemsreceivedinthecountvariable• notdirectlyaccessiblefromstatusvariable

Blockingsendandreceive

timea) sendercomesfirst,

idlingatsender(nobufferingofmessage)

send

data

reqtosend

oktosendrecv

b) sending/receivingataboutthesametime,idlingminimized

data

reqtosendoktosend recvsend

c) receivercomesfirst,idlingatreceiver

data

recv

sendreqtosendoktosend

SlidereproducedfromslidesbyJohnMellor- Crummey

Deadlocks

int a[10], b[10], myRank;MPI_Status s1, s2;MPI_Comm_rank( MPI_COMM_WORLD, &myRank);

if ( myRank == 0 ) {MPI_Send( a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD );MPI_Send( b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD );

}else if ( myRank == 1 ) {

MPI_Recv( b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD, &s1 );MPI_Recv( a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD, &s2 );

}

Slide credit: John Mellor - Crummey

IfMPI_Sendisblocking(handshakeprotocol),thisprogramwilldeadlock• If'eager'protocolisused,itmayrun• Dependsonmessagesize

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Networkcostmodeling

• Variouschoicesofinterconnectionnetwork§ GigabitEthernet:cheap,butfartooslowforHPCapplications§ Infiniband /Myrinet:highspeedinterconnect

• Simpleperformancemodel forpoint-to-pointcommunicationTcomm =a +b*n

§ a =latency§ B=1/b =saturation(asymptotic)bandwidth(bytes/s)§ n =numberofbytestotransmit§ EffectivebandwidthBeff:

Beff =n

a+b∗n = na+nB

ThecaseofGengar (UGentcluster)Bandwidthandlatency

core3&L1cache

6MByte ofL2cache

core4&L1cache

core2&L1cache

core1&L1cache

16GByteofRAM

Non-blockingInfiniband(switchedfabric)

6MbyteofL2cache

core3&L1cache

6MbyteofL2cache

core4&L1cache

core2&L1cache

core1&L1cache

6MbyteofL2cache

Bandwidth Latency

2ns

50ns~100Gbit/s

~20Gbit/s ~1μstothe193othermachines

CPU1 CPU2

Measureeffectivebandwidth:ringtest

Idea:sendasinglemessage of size N inacircle

• IncreasethemessageNsizeexponentially• 1byte,2bytes,4bytes,...1024bytes,2048bytes,4096bytes

• Benchmarktheresults(measurewallclocktimeT),…• Bandwidth=N*P/T

0

1 2

3

P-1 ...

Hands-on:ringtest inMPI

void sendRing( char *buffer, int length ) {/* send message in a ring here */

}

int main( int argc, char * argv[] ){

...char *buffer = (char*) calloc ( 1048576, sizeof(char) );int msgLen = 8;for (int i = 0; i < 18; i++, msgLen *= 2) {

double startTime = MPI_Wtime();sendRing( buffer, msgLen );double stopTime = MPI_Wtime();double elapsedSec = stopTime - startTime;if (rank == 0)

printf( "Bandwidth for size %d is : %f\", ... );}...

}

Jobscript example

#!/bin/sh##PBS -o output.file#PBS -e error.file#PBS -l nodes=2:ppn=all#PBS -l walltime=00:02:00#PBS -m n

cd $VSC_SCRATCH/<yourdirectory>module load intel/2017amodule load scriptsmympirun ./<program name> <program arguments>

qsub mpijob.shqstat /qdel /etc remainsthesame

mpijob.shjobscriptexample:

Hands-on:ringtest inMPI(solution)

void sendRing( char *buffer, int msgLen ){

int myRank, numProc;MPI_Comm_rank( MPI_COMM_WORLD, &myRank );MPI_Comm_size( MPI_COMM_WORLD, &numProc );MPI_Status status;

int prevR = (myRank - 1 + numProc) % numProc;int nextR = (myRank + 1 ) % numProc;

if (myRank == 0) { // send first, then receiveMPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,

&status );} else { // receive first, then send

MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,&status );

MPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);}

}

BasicMPIroutines

TimingroutinesinMPI

• double MPI_Wtime( void )• returnsthetimeinsecondsrelativeto“sometime”inthepast• “sometime”inthepastisfixedduringprocess

• double MPI_Wtick( void )• ReturnstheresolutionofMPI_Wtime()inseconds• e.g.10-3 =millisecondresolution

BandwidthonGengar(Ugentcluster)

EffectiveBWincreases forlarger messages

Messagesize(byte)

1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 0

200

400

600

800

1000

1200

GigabitEthernet

Infiniband

EffectiveBa

ndwidthB

eff(M

byte/s)

BenchmarkresultsComparisonofCPUload

Infiniband

GigabitEthernet

ExchangingmessagesinMPIint MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype

sendtype, int dest, int sendtag, void *recvbuf,int recvcount, MPI_Datatype recvtype, int source,int recvtag, MPI_Comm comm, MPI_Status *status )• sendbuf:pointertothemessagetosend• sendcount:numberofelementstotransmit• sendtype:datatypeoftheitemstosend• dest:rankofdestinationprocess• sendtag:identifierforthemessage• recvbuf:pointertothebuffertostorethemessage(disjoint withsendbuf)• recvcount:upperbound(!)tothenumberofelementstoreceive• recvtype:datatypeoftheitemstoreceive• source:rankofthesourceprocess(orMPI_ANY_SOURCE)• recvtag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }• sendbuf:pointertothebuffertosend

int MPI_Sendrecv_replace( ... )• Bufferisreplacebyreceiveddata

BasicMPIroutinesSendrecv example

const int len = 10000;int a[len], b[len];if ( myRank == 0 ) {

MPI_Send( a, len, MPI_INT, 1, 0, MPI_COMM_WORLD );MPI_Recv( b, len, MPI_INT, 2, 2, MPI_COMM_WORLD, &status);

} else if ( myRank == 1 ) {MPI_Sendrecv( a, len, MPI_INT, 2, 1, b, len, MPI_INT, 0,

0, MPI_COMM_WORLD, &status );} else if ( myRank == 2) {

MPI_Sendrecv( a, len, MPI_INT, 0, 2, b, len, MPI_INT, 1,1, MPI_COMM_WORLD, &status );

}

• CompatibilitybetweenSendrecv and'normal'sendandrecv• Sendrecv canhelptopreventdeadlocks

2

0 1

safe to exchange !

0

12

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Non-blockingcommunication

Idea:• Dosomethingusefulwhilewaitingforcommunicationstofinish• Trytooverlapcommunicationsandcomputations

How?• Replaceblockingcommunicationbynon-blockingvariants

MPI_Send(...) MPI_Isend(..., MPI_Request *request)MPI_Recv(...) MPI_Irecv(..., status, MPI_Request *request)

• I=intermediate functions• MPI_Isend and MPI_Irecv routinesreturnimmediately• Needpolling routinestoverifyprogress

• request handleisusedtoidentifycommunications• status fieldmovedtopollingroutines(seefurther)

Non-blockingcommunications

Asynchronousprogress• =abilitytoprogresscommunicationswhileperformingcalculations• Dependsonhardware

• GigabitEthernet=verylimited• Infiniband=muchmorepossibilities

• DependsonMPIimplementation• MultithreadedimplementationsofMPI(e.g.OpenMPI)• Daemonforasynchronousprogress(e.g.LAMMPI)

• Dependsonprotocol• Eagerprotocol• Handshakeprotocol

• Stillthesubjectofongoingresearch

Non-blockingsendingandreceiving

timea) networkinterfacesupports

overlappingcomputationsandcommunications

b) networkinterfacehasnosuchsupport

Isend

data

reqtosend

oktosendIrecv

SlidereproducedfromslidesbyJohnMellor- Crummey

Isend

data

reqtosend

oktosendIrecv

Non-blockingcommunicationsPolling/waitingroutines

int MPI_Wait( MPI_Request *request, MPI_Status *status )

request: handle to identify communicationstatus: status information (cfr. 'normal' MPI_Recv)

int MPI_Test( MPI_Request *request, int *flag, MPI_Status *status )

Returns immediately. Sets flag = true if communication has completed

int MPI_Waitany( int count, MPI_Request *array_of_requests,int *index, MPI_Status *status )

Waits for exactly one communication to completeIf more than one communication has completed, it picks a random oneindex returns the index of completed communication

int MPI_Testany( int count, MPI_Request *array_of_requests,int *index, int *flag, MPI_Status *status )

Returns immediately. Sets flag = true if at least one communication completedIf more than one communication has completed, it picks a random oneindex returns the index of completed communicationIf flag = false, index returns MPI_UNDEFINED

Example:client-servercodeif ( rank != 0 ) { // client code

while ( true ) { // generate requests and send to the servergenerate_request( data, &size );MPI_Send( data, size, MPI_CHAR, 0, tag, MPI_COMM_WORLD );

}} else { // server code (rank == 0)

MPI_Request *reqList = new MPI_Request[nProc];for ( int i = 0; i < nProc - 1; i++ )

MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,MPI_COMM_WORLD, &reqList[i] );

while ( true ) { // main consumer loopMPI_Status status;int reqIndex, recvSize;

MPI_Waitany( nProc–1, reqList, &reqIndex, &status );MPI_Get_count ( &status, MPI_CHAR, &recvSize );do_service( buffer[reqIndex].data, recvSize );MPI_Irecv( buffer[reqIndex].data, MAX_LEN, MPI_CHAR,

status.MPI_SOURCE, tag, MPI_COMM_WORLD, &reqList[reqIndex] );

}}

Non-blockingcommunicationsPolling / waiting routines (cont'd)

int MPI_Waitall( int count, MPI_Request *array_of_requests,MPI_Status *array_of_statuses )

Waits for all communications to completeint MPI_Testall ( int count, MPI_Request *array_of_requests,

int *flag, MPI_Status *array_of_statuses )Returns immediately. Sets flag = true if all communications have completed

int MPI_Waitsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )

Waits for at least one communications to completeoutcount contains the number of communications that have completedCompleted requests are set to MPI_REQUEST_NULL

int MPI_Testsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )

Same as Waitsome, but returns immediately. flag field no longer needed, returns outcount = 0 if no completed communications

Example:improvedclient-servercodeif ( rank != 0 ) { // same client code

...} else { // server code (rank == 0)

MPI_Request *reqList = new MPI_Request[nProc-1];MPI_Status *status = new MPI_Status[nProc-1];int *reqIndex = new MPI_Request[nProc];

for ( int i = 0; i < nProc - 1; i++ )MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,

MPI_COMM_WORLD, &reqList[i] );while ( true ) { // main consumer loop

int numMsg;MPI_Waitsome( nProc–1, reqList, &numMsg, reqIndex, status );for ( int i = 0; i < numMsg; i++ ) {

MPI_Get_count ( &status[i], MPI_CHAR, &recvSize );do_service( buffer[reqIndex[i]].data, recvSize);MPI_Irecv( buffer[reqIndex[i]].data, MAX_SIZE, MPI_CHAR,

status[i].MPI_SOURCE, tag, MPI_COMM_WORLD,&reqList[reqIndex[i]] );

}}

}

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Barriersynchronization

BarrierSynchronization

time

Waitingtime(processdependent)

CalltoMPI_Barrier

processrank Computations

Idletime

MPI_Barrier(MPI_Comm comm)Thisfunctiondoesnotreturnuntilallprocessesincommhavecalledit.

BroadcastMPI_Bcast(void *buffer, int count, MPI_Datatype datatype,

int root, MPI_Comm comm)MPI_Bcastbroadcastscount elementsoftypedatatype storedinbuffer attheroot processtoallotherprocessesincomm wherethisdatais storedinbuffer.

data

processrank

bufferatnon-root process(contentswillbeoverwritten)

countelements

data

data

data

data

data

bufferatroot process(containsusefuldata)

MPI_Bcast

processrank

bufferatallprocessesnowcontainsamedata

countelements

p1

p0

p2

p4

p3

Broadcastexample...int rank, size;

... // init MPI and rank and size variables

int root = 0;char buffer[12];

if (rank == root) sprintf(buffer, “Hello world”);

MPI_Bcast(buffer, 12, MPI_CHAR, root, MPI_COMM_WORLD);

printf(“Process %d has %s stored in the buffer.\n”, buffer, rank);

...

john@doe ~]$ mpirun –np 4 ./broadcastProcess 1 has Hello World stored in the buffer.Process 0 has Hello World stored in the buffer.Process 3 has Hello World stored in the buffer.Process 2 has Hello World stored in the buffer.

fillthebufferattherootprocessonly

allprocessesmustcallMPI_Bcast

Broadcastalgorithm

data(nbytes)

p1

p0

p2

p4

p3

p5

p6

p7

processrank

• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.

• Binarytreealgorithm takesonly(a +bn)élog2Pù time.

ScatterMPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendType,

void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)

MPI_Scatter partitionsasendbuf attheroot processintoPequalpartsofsizesendcount andsendseachprocessincomm (includingroot)aportioninrankorder.

d0

processrank

sendbuffer(onlymattersatroot process)

sendcountelements

receivebuffer

MPI_Scatter

processrank

recvcountelements

p1

p0

p2

p4

p3

d1 d2 d3 d4

d2

d1 d0

d3

d4

d1 d2 d3 d4

d0

root=p1

Scatterexampleint root = 0;char recvBuf[7];

if (rank == root) { char sendBuf[25];sprintf(sendBuf, “This is the source data.”);

MPI_Scatter(sendBuf, 6, MPI_CHAR, recvBuf, 6, MPI_CHAR,root, MPI_COMM_WORLD);

} else {MPI_Scatter(NULL, 0, MPI_CHAR, recvBuf, 6, MPI_CHAR,

root, MPI_COMM_WORLD);}

recvBuf[6] = ’\0’;printf(“Process %d has %s in receive buffer\n”, rank, recvBuf);

...john@doe ~]$ mpirun –np 4 ./scatterProcess 1 has s the stored in the buffer.Process 0 has This i stored in the buffer.Process 3 has data. stored in the buffer.Process 2 has source stored in the buffer.

fillthesendbufferattherootprocessonly

firstthreeparametersareignoredonnon-rootprocesses

Scatteralgorithm

p1

p0

p2

p4

p3

p5

p6

p7

processrank

• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.

• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)

oneblock=nbytes

Scatter(vectorvariant)MPI_Scatterv(void *sendbuf, int *sendcnts, int *displs,

MPI_Datatype sendType, void *recvbuf, int recvcnt,MPI_Datatype recvType, int root, MPI_Comm comm)

Partitionsofsendbuf don’tneedtobeofequalsizeandarespecifiedperreceivingprocess:thefirstindexbythedispls array,theirsizebythesendcnts array.

processrank

sendbuffer(onlymattersatroot process)

sendcnts[i]elements

receivebuffer

MPI_Scatterv

processrank

recvcountelements

p1

p0

p2

p4

p3

root=p1

displs[i]

GatherMPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendType,

void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)

MPI_Gathergathersequalpartitionsofsizerecvcount fromeachofthePprocessesincomm (includingroot)andstorestheminrecvbuf attheroot processinrankorder.

processrank

sendcountelements

d2

d1

d3

d4

d0

d0

processrank

receivebuffer(onlymattersatroot process)

recvcountelements

sendbuffer

p1

p0

p2

p4

p3

d1 d2 d3 d4

root=p1

d2

d1

d3

d4

d0

MPI_Gather

GatheristheoppositeofScatter

Avectorvariant,MPI_Gatherv,exists,asimilargeneralizationasMPI_Scatterv

Gatherexampleint root = 0;int sendBuf = rank;

if (rank == root) { int *recvBuf = new int[size];

MPI_Gather(&sendBuf, 1, MPI_INT, recvBuf, 1, MPI_INT, root, MPI_COMM_WORLD);

cout << “Receive buffer at root process: ” << endl;for (size_t i = 0; i < size; i++)

cout << recvBuf[i] << “ ”;cout << endl;

delete [] recvBuf;} else {

MPI_Gather(&sendBuf, 1, MPI_INT, NULL, 1, MPI_INT,root, MPI_COMM_WORLD);

}

john@doe ~]$ mpirun –np 4 ./gatherReceive buffer at root process:0 1 2 3

receivebufferexistsattherootprocessonly

receiveparametersareignoredonnon-rootprocesses

Gatheralgorithm

p1

p0

p2

p4

p3

p5

p6

p7

processrank

• Linearalgorithm,subsequentsendingofnbytesfromP-1processestoroottakes(a +bn)(P- 1)time.

• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)

oneblock=nbytes

AllGatherMPI_Allgather(void *sendbuf, int sendcnt, MPI_Datatype sendType,

void *recvbuf, int recvcnt, MPI_Datatype recvType,MPI_Comm comm)

MPI_Allgather isageneralizationofMPI_Gather,inthatsensethatthedataisgatheredbyallprocessed,insteadofjusttherootprocess.

processrank

sendcountelements

processrank recvcountelementssendbuffer

p1

p0

p2

p4

p3

receivebuffer

MPI_Allgather

receivebuffer

Allgather algorithm

p1

p0

p2

p4

p3

p5

p6

p7

processrank

• Pcallstogather takesP[a élog2Pù +bn(P-1)]time(usingthebestgatheralgorithm)• Gatherfollowedbybroadcasttakes2a élog2Pù +bn(Pélog2Pù +P-1) time.• “Butterfly”algorithm takesonlya log2P +bn(P-1)time(incasePisapoweroftwo)

nbytes 2nbytes4nbytes

oneblock=nbytes

AlltoallcommunicationMPI_Alltoall(void *sendbuf, int sendcnt, MPI_Datatype sendType,

void *recvbuf, int recvcnt, MPI_Datatype recvType,int root, MPI_Comm comm)

UsingMPI_Alltoall,everyprocesssendsadistinctmessagetoeveryotherprocess.

processrank sendcountelements

a0

processrank

p1

p0

p2

p4

p3

MPI_Alltoall

a1 a2 a3 a4

b0 b1 b2 b3 b4

e0 e1 e2 e3 e4

d0 d1 d2 d3 d4

c0 c1 c2 c3 c4

sendbuffer

recvcountelements

a0 b0 c0 d0 e0

a1 b1 c1 d1 e1

a4 b4 c4 d4 e4

a3 b3 c3 d3 e3

a2 b2 c2 d2 e2

receivebuffer

Avectorvariant,MPI_Alltoallv,exists,allowingfordifferentsizesforeachprocess

ReduceMPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype

dataType, MPI_Op op, int root, MPI_Comm comm)Thereduceoperationaggregates(“reduces”)scattereddataattherootprocess

processrank

sendbuffer

countelements

MPI_Reduce

processrank

p1

p0

p2

p4

p3root=p0

d2

d1

d3

d4

d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4

receivebuffer

◊ =operation,likesum,product,maximum,etc.

countelements

Reduceoperations

MPI_MAX maximum

MPI_MIN minimum

MPI_SUM sum

MPI_PROD product

MPI_LAND logical AND

MPI_BAND bitwise AND

MPI_LOR logicalOR

MPI_BOR bitwiseOR

MPI_LXOR logicalexclusiveOR

MPI_BXOR bitwise exclusiveOR

MPI_MAXLOC maximumanditslocation

MPI_MINLOC maximumanditslocation

Availablereduceoperations(associativeandcommutative)Userdefinedoperationsarealsopossible

AllreduceoperationMPI_Allreduce(void *sendbuf, void *recvbuf, int count,

MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Similartothereduceoperation,buttheresultisavailableoneveryprocess.

processrank

sendbuffer

countelements

MPI_Allreduce

processrank

p1

p0

p2

p4

p3

d2

d1

d3

d4

d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4

receivebuffer

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

ScanoperationMPI_Scan(void *sendbuf, void *recvbuf, int count,

MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Ascanperformsapartialreductionofdata,everyprocesshasadistinctresult

processrank

sendbuffer

countelements

MPI_Scan

processrank

p1

p0

p2

p4

p3

d2

d1

d3

d4

d0 d0

receivebuffer

d0◊ d1 ◊ d2

d0◊ d1

d0◊ d1 ◊ d2 ◊ d3

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

Hands-onMatrix-vectormultiplication

Master :CoordinatestheworkofothersSlave :doesabitofwork

Task:computeA.bA:doubleprecision(mxn)matrixb:doubleprecision(nx1)columnmatrix

Master algorithm Slave algorithm1. Broadcast b to each slave 1. Broadcast b (in fact receive b)2. Send 1 row of A to each slave 2. do {3. while ( not all m results received ) { Receive message m

Receive result from any slave s if( m != termination )if ( not all m rows sent ) compute result

Send new row to slave s send result to master else } while( m != termination )

Send termination message to s 3. slave terminates}

4. continue

Matrix-vectormultiplicationint main( int argc, char** argv ) {

int rows = 100, cols = 100; // dimensions of adouble **a;double *b, *c;int master = 0; // rank of masterint myid; // rank of this processint numprocs; // number of processes// allocate memory for a, b and ca = (double**)malloc(rows * sizeof(double*));for( int i = 0; i < rows; i++ )

a[i]=(double*)malloc(cols * sizeof(double));b = (double*)malloc(cols * sizeof(double));c = (double*)malloc(rows * sizeof(double));MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &myid );MPI_Comm_size( MPI_COMM_WORLD, &numprocs );if( myid == master )

// execute master codeelse

// execute slave codeMPI_Finalize();

}

Matrixvectormultiplication// initialize a and bfor(int j=0;j<cols;j++) {b[j]=1.0; for(int i=0;i<rows;i++) a[i][j]=i;}// broadcast b to each slaveMPI_Bcast( b, cols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD );// send row of a to each slave, tag = row numberint numsent = 0;for( int i = 0; (i < numprocs-1) && (i < rows); i++ ) {

MPI_Send(a[i], cols, MPI_DOUBLE_PRECISION, i+1,i,MPI_COMM_WORLD);numsent++;

}for( int i = 0; i < rows; i++ ) {

MPI_Status status; double ans; int sender;MPI_Recv( &ans, 1, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE,

MPI_ANY_TAG, MPI_COMM_WORLD, &status );c[status.MPI_TAG] = ans;sender = status.MPI_SOURCE;if ( numsent < rows ) { // send more work if any

MPI_Send( a[numsent], cols, MPI_DOUBLE_PRECISION,sender, numsent, MPI_COMM_WORLD );

numsent++;} else // send termination message

MPI_Send( MPI_BOTTOM, 0, MPI_DOUBLE_PRECISION, sender,rows, MPI_COMM_WORLD );

}

Matrix-vectormultiplication// broadcast b to each slave (receive here)MPI_Bcast( b,cols,MPI_DOUBLE_PRECISION,master,MPI_COMM_WORLD );// send row of a to each slave, tag = row numberif( myid <= rows ) {

double* buffer=(double*)malloc(cols*sizeof(double));while (true) {

MPI_Status status;MPI_Recv( buffer, cols, MPI_DOUBLE_PRECISION, master,

MPI_ANY_TAG, MPI_COMM_WORLD, &status );if( status.MPI_TAG != rows ) { // not a termination message

double ans = 0.0;for(int i=0; i < cols; i++)

ans += buffer[i]*b[i];MPI_Send( &ans, 1, MPI_DOUBLE_PRECISION, master,

status.MPI_TAG, MPI_COMM_WORLD );} else

break;}

} // more processes than rows => no work for some nodes

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Networkcostmodeling

• Simpleperformancemodel formultiple communications– Bisectionbandwidth:sumofthebandwidthsofthe(averagenumber

of)linkstocuttopartitionthenetworkintwohalves.– Diameter:maximumnumberofhopstoconnectanytwodevices

Network(firsthalf)

Network(secondhalf)

bisection

Bustopology

Bus =communicationchannelthatisshared byallconnecteddevices• Nomorethantwodevices cancommunicateatanygiventime• Hardwarecontrolswhichdeviceshaveaccess• Highriskofcontention whenmultipledevicestrytoaccessthebus

simultaneously.Busisa“blocking”interconnect.

bisection

bus

Busnetworkproperties• Bisectionbandwidth=point-to-pointbandwidth(independentof#devices)• Diameter=1(singlehop)

Crossbarswitch

i4i3i2i1

o4

o3

o1

inputdevices

outputdevices

switch

Crossbarswitchcanconnectanycombination ofinput/outputpairsatfullbandwidth=fullynon-blocking

Staticroutingincrossbarswitch

i4i3i2i1

o4

o3

o2

o1

inputdevices

outputdevices stateA

Switchesatcrossingofinput/outputshouldbeinstateB.OtherswitchesshouldbeinstateA.Pathselectionisfixed=staticrouting

Input Output

i1 o2i2 o4i3 o1i4 o3

stateB

Input– outputpairings

Crossbarswitchfordistributedmemorysystems

• CrossbarimplementationofaswitchwithP=4machines.• Fullynon-blocking• Expensive:P2 switchesandO(P2)wires

M4

M3

M1

M4

M3

M1

4portnon-blocking

switch

Usuallyimplementedinaswitch

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

ninputs noutputs

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

ninputs noutputs

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

ninputs noutputs

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

• Adaptiveroutingmaybenecessary:cannotfindaconnectionforE!

AB

AB

CD

DC

E

E

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

• Adaptiveroutingmaybenecessary:rerouteanexistingpath

AB

AB

CD

DC

E

E

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

• Adaptiveroutingmaybenecessary:Ecannowbeconnected

AB

AB

CD

DC

E

E

Switchexamples

non-blockingswitchesInfinibandswitch(24ports) GigabitEthernetswitch(24ports)

Meshnetworks

bisection

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Basicperformanceterminology

• Runtime (“Howlongdoesittaketorunmyprogram”)§ Inpractice,onlywallclocktime matters§ DependsonthenumberofparallelprocessesP§ TP =runtimeusingPprocesses

• Speedup (“Howmuchfasterdoesmyprogramruninparallel”)§ SP =T1 /TP§ Intheidealcase,SP =P§ Superlinearspeedup areusuallyduetocacheeffects

• Parallelefficiency (“Howwellistheparallelinfrastructureused”)§ hP =SP /P (0£ hP £ 1)§ Intheidealcase,hP =1=100%§ Dependsontheapplicationwhatisconsideredacceptableefficiency

Strongscaling:Amdahl’slaw• StrongScaling=increasingthenumberofparallelprocessesfora

fixed-size problem• Simplemodel:partitionsequentialruntimeT1 inaparallelizable

fraction(1-s) andinherentlysequentialfraction(s).§ T1 =sT1 +(1-s)T1 (0£ s £ 1)

§ Therefore, TP =sT1 +𝟏"𝐬 𝐓𝟏𝐏

sequential parallelizableT1=

sequentialT2= parallel

sequentialT3= parallel

sequentialT4= parallel

…sequentialT¥ =

sT1 (1-s)T1

Strongscaling:Amdahl’slaw

• Consequently,SP =&'&(

=&'

)&'*+,- .+/

= ')*+,-/

(Amdahl’slaw)

• Speedupisbounded S¥ =')

§ e.g.s=1%,S¥ = 100§ e.g.s=5%,S¥ = 20

• SourcesofsequentialfractionsT1§ Processstartup overhead§ Inherentlysequential portionsofthecode§ Dependenciesbetweensubtasks§ Communication overhead§ Functioncalling overhead§ Loadmisbalance

Amdahl’slaw

0

5

10

15

20

25

30

0 5 10 15 20 25 30

Parallelspe

edup

NumberofparallelprocessesP

s=5%

s=2%

s=1%

s=0%

ForsmallP,closetolinearspeedup

linearspeedup

Amdahl’slaw

0

20

40

60

80

100

120

140

1 4 16 64 256 1024 4096 16384 65536

Parallelspe

edup

NumberofparallelprocessesP

s=5%

s=2%

s=1%

s=0%

S¥ =')

Samegraphasonpreviousslide,butlogarithmicx-as

Amdahl’slaw

0

0.2

0.4

0.6

0.8

1

1 4 16 64 256 1024 4096 16384 65536

Parallelefficiency

NumberofparallelprocessesP

s=5%

s=2%

s=1%

s=0%

Samegraphasonpreviousslide,butexpressedasparallelefficiency

Weakscaling:Gustafson’slaw

• Weakscaling:increasingboththenumberofparallelprocessesPandtheproblemsizeN.

• Simplemodel:assumethatthesequentialpartisconstant,andthattheparallelizablepartisproportionaltoN.§ T1(N)=T1(1)[s+(1-s)N] (0£ s £ 1)

sequential parallelizableT1(1)=

sT1(1) (1-s)T1(1)

sequential parallelizableT1(2)=

sequential parallelizableT1(3)=

sequential parallelizableT1(4)=

parallelizable

parallelizable

parallelizable

parallelizable

parallelizableparallelizable

Gustafson’slaw

Therefore,TP(N)=T1(1)[s+'") 1(

]andhence TP(P)=T1(1)SpeedupSP(P)=s+(1-s)P=P- s(P-1)(Gustafson’slaw)

• Solveincreasinglylargerproblemsusingaproportionallyhighernumberofprocesses(N=P).

sequential parallelizableT1(1)=

sT1(1) (1-s)T1(1)

sequential parallelizableT2(2)=

sequential parallelizableT3(3)=

sequential parallelizableT4(4)=

Gustafson’slaw

0

50

100

150

200

250

0 50 100 150 200 250

Parallelspe

edup

Number ofparallelprocesses P,problem size N

s=5%

s=2%

s=1%

s=0%

Gustafson’slaw

0.94

0.95

0.96

0.97

0.98

0.99

1

1 4 16 64 256 1024 4096 16384 65536

Parallelefficiency

Number ofparallelprocesses P,problem size N

s=5%

s=2%

s=1%

s=0%

Samegraphasonpreviousslide,butexpressedasparallelefficiencyandlogscale

h¥ =1-s

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

CaseStudy1:Parallelmatrix-matrixproduct

Casestudy:parallelmatrixmultiplication

• Matrix-matrixmultiplication:C=a*A*B+b*C(BLASxgemm)§ AssumeC=mxn;A=mxk;B=kxnmatrix.§ a, b =scalars;assumea = b =1 inwhatfollows,i.e.C=C+A*B

• Initially,matrixelementsaredistributed amongPprocesses§ AssumesameschemeforeachmatrixA,BandC

• EachprocesscomputesvaluesforCthatarelocaltothatprocess§ RequireddatafromAandBthatisnotlocalneedstobe

communicated§ Performancemodelingassumingthea, b, gmodel

o a =latencyo b =perelementtransfertimeo g =timeforsinglefloatingpointcomputation

Casestud

yreprod

uced

from

J.Dem

mel

Casestudy:parallelmatrix-matrixproduct

• Twodimensionalpartitioning§ Partitionmatricesin2Dinanrxcmesh(P=r*c)§ X(I,J)referstoblock(I,J)ofmatrixX(X={A,B,C})§ Processpn isalsodenotedbypi,j (n=i*c+j)andholdsX(I,J)

p0,0 p0,1 p0,2 … p0,c-1

p1,0 …

pi,j

pr-1,0 pr-1,c-1

Casestud

yreprod

uced

from

J.Dem

mel

k

n

B(I,J)e.g.matrixB

rprocesses

cprocesses

Casestudy:parallelmatrix-matrixproduct

• Secondapproach:twodimensionalpartitioning(SUMMA)§ Processpi,j needstocomputeC(I,J):

C I, J = C I, J +8A I, i ∗ B(i, J)?"'

@AB

∀I = 0… r∀J = 0… c

C(I,J) A(I,J)

B(I,J)

+=*

C A

B

datalocalinpi,jdatathatneedstobecommunicatedtopi,j

datanotneededbypi,j

dothisinparallel

Casestud

yreprod

uced

from

J.Dem

mel

kelements

indexi referstoasinglecolumnofAorsinglerowofB

A(I,i)B(i,J)

Casestudy:parallelmatrix-matrixproduct

• Secondapproach:twodimensionalpartitioning§ SUMMAalgorithm(allIandJinparallel)

§ Costforinnerloop(executedktimes):o log2c(a +b(m/r))+log2r(a +b(n/c))+2mng/P

§ TotalTP =2kmng/P+ka(log2c+log2r)+kb((m/r)log2c)+(n/c)log2r)

§ Forn=m=kandr=c=sqrt(P)wefind:ParallelefficiencyhP = 1/(1+(a/g)(Plog2P)/(2n2)+(b/g)√PlogP/n)Isoefficiency whenngrowsas√P(constantmemorypernode!)

Casestud

yreprod

uced

from

J.Dem

mel

for i = 0 to k–1broadcast A(I, i) within process rowbroadcast B(i, J) within process columnC(I,J) += A(I, i) * B(i, J)

endfor

Casestudy:parallelmatrix-matrixproduct

• Evenmoreefficient:use“blocking”algorithm

• SUMMAalgorithmisimplementedinPBLAS=ParallelBLAS• Algorithmcanbeextendedtoblock-cycliclayout(seefurther)

for i = 0 to k–1 step bend = min(i+b-1, k-1)broadcast A(I, i:end) within process rowbroadcast B(i:end, J) within process columnC(I,J) += A(I, i:end) * B(i:end, J)

endfor

PerformthisproductusingLevel-3BLAS

Casestudy:parallelGaussianElimination

Imagetakefrom

J.Dem

mel

CaseStudy2:ParallelSorting

Casestudy:parallelsortingalgorithm

• Sequential sortingofnkeys(1st Bachelor)§ Bubblesort:O(n2)§ Mergesort:O(nlogn),eveninworst-case§ Quicksort:O(nlogn) expected,O(n2)worst-case,fastinpractice

• Parallel sortingofnkeys,usingPprocesses§ Initially,eachprocessholdsn/pkeys(unsorted)§ Eventually,eachprocessholdsn/pkeys(sorted)

o Keysperprocessaresortedo Ifq<r,eachkeyassignedtoprocessqislessthanorequalto

everykeyassignedtoprocessr(sortkeysinrankorder)

Casestudy:parallelsortingalgorithm

void Bubble_sort(int *a, int n) {for (int listLen = n; listLen >= 2; listLen--)

for (int i = 0; i < listLen-1; i++)if (a[i] > a[i+1]) {

temp = a[i];a[i] = a[i+1]a[i+1] = temp;

}

}

Bubblesort algorithm:O(n2)

Example: 52481® 2451 8® 241 58® 21 458® 1 2458

“Compare-swap”operation

AlgorithmreproducedfromP.Pacheco

afteriteration1afteriteration2afteriteration3afteriteration4

Casestudy:parallelsortingalgorithm

• Bubblesort§ Resultofcurrentstep(a[i]>a[i+1])dependsonpreviousstep

o Valueofa[i]isdeterminedbypreviousstepo Algorithmis“inherentlyserial”o Notmuchpointintryingtoparallelizethisalgorithm

• Odd-eventranspositionsort§ Decouplealgorithmintwophases:evenandodd

o Evenphase:compare-swaponfollowingelements:(a[0],a[1]),(a[2],a[3]),(a[4],a[5]),…

o Oddphase:compare-swapoperationsonfollowingelements:(a[1],a[2]),(a[3],a[4]),(a[5],a[6]),…

Casestudy:parallelsortingalgorithm

void Even_odd_sort(int *a, int n) {for (int phase = 0; phase < n; phase++)

if (phase % 2 == 0) { // even phasefor (int i = 0; i < n-1; i += 2)

if (a[i] > a[i+1]) {temp = a[i];a[i] = a[i+1]a[i+1] = temp;

}} else { // odd phase

for (int i = 1; i < n-1; i += 2)if (a[i] > a[i+1]) {

temp = a[i];a[i] = a[i+1]a[i+1] = temp;

}}

}

Even-oddtranspositionsortalgorithm:O(n2)

“Compare-swap”operation

“Compare-swap”operation

AlgorithmreproducedfromP.Pacheco

Casestudy:parallelsortingalgorithm

Example: 52481® 25481® 24518® 24158® 21458® 12458

evenphaseoddphaseevenphaseoddphaseevenphase

Even-oddtranspositionsortalgorithm:O(n2)

Parallelism withineachevenoroddphaseisnowobvious:Compare-swapbetween(a[i],a[i+1])independentfrom(a[i+2],a[i+3])

Casestudy:parallelsortingalgorithm

Parallelalgorithm• First,assumeP==n(oneelementperprocess)

a[i-1] a[i] a[i+1]

ImagereproducedfromP.Pacheco

a[i-1] a[i] a[i+1]

phasej

phasej+1

Communicatevaluewithneighbor• Rightprocess(highestrank)keepslargestvalue• Leftprocess(lowestrank)keepssmallestvalue

a[i-2]

a[i-2]

executeinparallelduringphasej+1

executeinparallelduringphasej

Casestudy:parallelsortingalgorithm

Parallelalgorithm• Now,assumen/P>>1 (asistypicallythecase)• Example:P=4;n=16

AlgorithmreproducedfromP.Pacheco

initialvalues 15,11,9,16 3,14,8,7 4,6,12,10 5,2,13,1

localsorting 9,11,15,16 3,7,8,14 4,6,10,12 1,2,5,13

phase0(even) 3,7,8,9 11,14,15,16 1,2,4,5 6,10,12,13

phase1(odd) 3,7,8,9 1,2,4,5 11,14,15,16 6,10,12,13

phase2(even) 1,2,3,4 5,7,8,9 6,10,11,12 13,14,15,16

phase3(odd) 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16

process0 process1 process2 process3

Casestudy:parallelsortingalgorithm

sort local keysfor (int phase = 0; phase < P; phase++) {

neighbor = computeNeighbor(phase, myRank);if (I’m not idle) { // first and/or last process may be idle

send all my keys to neighborreceive all keys from neighborif (myRank < neighbor)

keep smaller keyselse

keep larger keys}

}

Paralleleven-oddtranspositionsortpseudocode

AlgorithmreproducedfromP.Pacheco

Theorem: Parallelodd-eventranspositionsortalgorithmwillsorttheinputlistafterP(=numberofprocesses)phases.

Casestudy:parallelsortingalgorithm

int computeNeighbor(int phase, int myRank) {int neighbor;if (phase % 2 == 0) {

if (myRank % 2 == 0)neighbor = myRank + 1;

elseneighbor = myRank - 1;

} else {if (myRank % 2 == 0)

neighbor = myRank – 1;else

neighbor = myRank + 1;}if (neighbor == -1 || neighbor == P-1)

neighbor = MPI_PROC_NULL;return neigbor;

}

ImplementationofcomputeNeighbor (MPI)

AlgorithmreproducedfromP.Pacheco

WhenusedasdestinationorsourcerankinMPI_SendorMPI_Recv,nocommunicationtakesplace

Casestudy:parallelsortingalgorithm

if (myRank % 2 == 0) {MPI_Send(...)MPI_Recv(...)

} else {MPI_Recv(...)MPI_Send(...)

}

ImplementationofdataexchangeinMPI• Becarefulofdeadlocks• Inbothevenandoddphases,communicationalwaystakesplace

betweenaprocesswitheven,andaprocesswithoddrank

AlgorithmreproducedfromP.Pacheco

Exchangeordertopreventdeadlocks!

MPI_Sendrecv(...)

OR

Casestudy:parallelsortingalgorithm

Parallelodd-eventranspositionsortalgorithmanalysis• Initialsorting:O(n/Plog(n/P))time

• Useanefficientsequentialsortingalgorithm,e.g.quicksortormergesort

• Perphase:2(a +n/Pb)+gn/P• TotalruntimeTP(n)=O(n/Plog(n/P)+2(aP +nb)+gn

=1/PO(nlogn)+O(n)• LinearspeedupwhenPissmallandnislarge• However,badasymptoticbehaviour

• WhennandPincreaseproportionally,runtimeperprocessisO(n)

• Whatwereallywant:O(logn)• Difficult!(butpossible!)

AlgorithmreproducedfromP.Pacheco

Casestudy:parallelsortingalgorithm

• Sortingnetworks (=graphicaldepictionofsortingalgorithms)§ Numberofhorizontal“wires”(=elementstosort)§ Connectedbyvertical“comparators”(=compareandswap)

elem

entsto

sort(n

)

time

Example(4elementstosort)

x

y

min(x,y)

max(x,y)

Casestudy:parallelsortingalgorithm

Bubblesort algorithm(sequential) Samebubblesortalgorithm(parallel)

Sequentialruntime=numberofcomparators

=“size ofthesortingnetwork”=n*(n– 1)/2

Parallelruntime(assumeP==nprocesses)

=“depth ofsortingnetwork”=2n– 3

Casestudy:parallelsortingalgorithmDefinition:depthofasortingnetwork (=parallelruntime)

• Zeroattheinputsoreachwire• Foracomparatorwithinputswithdepthd1 andd2,thedepthof

itsoutputsis1+max(d1,d2)• Depthofthesortingnetwork=maximumdepthofeachoutput

0

0

0

0

0

0

1

1

2 3 4 5 6 7 8 9

3 5 7 9

2 4 6 8

3 5 7

4 6

5

3 5 7

4 6

5

Casestudy:parallelsortingalgorithm

• Sortingnetworkofodd-eventranspositionsort

• Parallelruntime=“depth ofsortingnetwork”=n• …orP(weassumen==P)

Casestudy:parallelsortingalgorithm

• Canwedobetter?• SequentialsortingalgorithmsareO(nlogn)• Ideally,Pandncanscaleproportionally:P=O(n)• ThatmeansthatwewanttosortnnumbersinO(logn)time

• Thisispossible(!),however,bigconstantpre-factor• WewilldescribeanalgorithmthatcansortnnumbersinO(log2 n)

paralleltime(usingP=O(n)processes)• Thisalgorithmhasbestperformanceinpractice• …unlessnbecomeshuge(n>22000)• Nobodywantstosortthatmanynumbers

Casestudy:parallelsortingalgorithm

• Theorem:Ifasortingnetworkwithninputssortsall2n binarystringsoflengthncorrectly,thenitsortsallsequencescorrectly(proof:seereferences).

1

1

1

0

0

0

0

1

1

1

0

0

WewilldesignanalgorithmthatcansortbinarysequencesinO(log22n) time

Example

Casestudy:parallelsortingalgorithm

• Step1: createasortingnetworkthatsortsbitonic sequences• Definition:Abitonic sequenceisasequencewhichisfirst

increasingandthendecreasing,orcanbecircularlyshiftedtobecomeso.§ (1,2,3,3.14,5,4,3,2,1)isbitonic§ (4,5,4,3,2,1,1,2,3)isbitonic§ (1,2,1,2)isnotbitonic

• Overzerosandones,abitonic sequenceisoftheform§ 0i1j0k or1i0j1k(withe.g.0i=0000…0=i consecutivezeros)§ i,jorkcanbezero

Casestudy:parallelsortingalgorithm

• Now,let’screateasortingnetworkthatsortsabitonic sequence• Ahalf-cleaner networkconnectslinei withlinei +n/2

• Iftheinputisabinarybitonic sequencethenfortheoutput§ Elementsinthetophalfaresmaller thanthecorresponding

elementsinthebottomhalf,i.e.halvesarerelativelysorted.§ Oneofthehalvesoftheoutputconsistsofonlyzerosorones(i.e.

is“clean”),theotherhalfisbitonic.

Half-cleanerexample(n=8)

Casestudy:parallelsortingalgorithm

• Exampleofahalf-cleanernetwork

0

1

0

1

1

0

0

0

0

0

01

0

0

1

1

lowerhalfisbitonic

upperhalfis“clean”

input=bitonicsequence

upperhalfissmallerthanlowerhalf

Casestudy:parallelsortingalgorithm

• Therefore,abitonic sorter[n] (i.e.networkthatsortsabitonicsequenceoflengthn)isobtainedas

Half-cleaner[n]

Bitonic-sorter[n/2]

Bitonic-sorter[n/2]

Abitonic sortersortsabitonic sequenceoflengthn=2k using• size=nk/2=n/2log2ncomparators(=sequentialtime)• depth=k=log2n(=paralleltime)

Casestudy:parallelsortingalgorithm• Step2:Buildanetworkmerger[n] thatmergestwosorted

sequencesoflengthn/2sothattheoutputissorted§ Flipsecondsequenceandconcatenatefirstandflippedsecond§ Concatenatedsequenceisbitonic,sortusingstep1

0

1

1

0

1

0

1

1

0

0

11

1

0

1

1

tobitonic-sorter[n/2]

sortedsequence1

sortedsequence2

tobitonic-sorter[n/2]

Casestudy:parallelsortingalgorithm

• Step3:Buildasorter[n] networkthatsortsarbitrarysequences§ Dothisrecursivelyfrommerger[n]buildingblocks§ Depth:D(1)=0andD(n)=D(n/2)+log2n=O(log22n)

merger[n](fromStep2)

sorter[n/2]

sorter[n/2]

Casestudy:parallelsortingalgorithm

Example:sorter[16]

merger[16]

merger[8]

merger[4]merger[2]

flip-cleaner[16]

half-cleaner[8]

Casestudy:parallelsortingalgorithm

• Furtherreading ofbitonic networks:§ http://valis.cs.uiuc.edu/~sariel/teach/2004/b/

webpage/lec/14_sortnet_notes.pdf• Incasenisnotapoweroftwo:

§ http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/oddn.htm