Introduction to the Message Passing Interface (MPI)

transcript

IntroductiontotheMessagePassingInterface(MPI)

JanFostier

May3rd 2017

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Outline

Moore’sLaw

“Transistorcountdoubleseverytwoyears”

IllustrationfromWikipedia

Moore’sLaw

2000 2011

IllustrationfromWikipedia

Evolutionoftop500supercomputersovertime

Imagetakenfrom

www.to

=17.6PFlops/s(#1ranked)

=76TFlops/s(#500ranked)

Exponential increaseofsupercomputerpeakperformanceovertime

Applicationarea– performanceshare

Imagetakenfrom

www.to

“research”

“unknown”

Operatingsystemfamily

“Linux”

“Unix”

“Mixed”

Motivationforparallelcomputing

• Wanttorunthesameprogramfaster§ Dependsontheapplicationwhatisconsideredanacceptable

runtimeo SETI@Home,Folding@Home,GIMPS:yearsmaybeacceptableo ForR&D applications:daysorevenweeksareacceptable

– CFD,CEM,Bioinformatics,Cheminformatics,andmanymoreo Predictionoftomorrow’sweathershouldtakelessthanadayof

computationtime.o Someapplicationsrequirereal-timebehavior

– Computergames,algorithmictrading

• Wanttorunbiggerdatasets• Wanttoreducefinancialcost and/orpowerconsumption

Distributed-memoryarchitecture

Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)

Machine

Memory

Machine

Memory

Machine

Memory

Machine

Memory

NI NI NI NI

Outline

MessagePassingInterface(MPI)

• MPI=libraryspecification,notanimplementation• Mostimportantimplementations

§ OpenMPI (http://www.open-mpi.org,MPI-3standard)§ IntelMPI(proprietary,MPI-3standard)

• Specifiesroutinesfor(amongothers)§ Point-to-point communication(between2processes)§ Collective communication(>2processes)§ Topology setup§ ParallelI/O

• BindingsforC/C++andFortran

MPIreferenceworks

• MPIstandards:http://www.mpi-forum.org/docs/

• MPI:TheCompleteReference (M.Snir,S.Otto,SHuss-Lederman,D.Walker,J.Dongarra)Availablefromhttp://switzernet.com/people/emin-gabrielyan/060708-thesis-ref/papers/Snir96.pdf

• UsingMPI:PortableParallelProgrammingwiththeMessagePassingInterface,2nd ed.(W.Gropp,E.Lusk,A.Skjellum).

MPIstandard

• Startedin1992(WorkshoponStandardsforMessage-PassinginaDistributedMemoryEnvironment)withsupportfromvendors,librarywritersandacademia.

• MPIversion1.0(May1994)• Finalpre-draftin1993(Supercomputing‘93conference)• FinalversionJune1994

• MPIversion2.0 (July1997)• Supportforone-sidedcommunication• Supportforprocessmanagement• SupportforparallelI/O

• MPIversion3.0(September2012)• Supportfornon-blockingcollectivecommunication• Fortran2008bindings• Newone-sidedcommunicationroutines

HelloworldexampleinMPI#include <mpi.h>#include <iostream>#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {int rank, size;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );MPI_Comm_size( MPI_COMM_WORLD, &size );

cout << “Hello World from process” << rank << “/” << size << endl;

MPI_Finalize();return EXIT_SUCCESS;

john@doe ~]$ mpirun –np 4 ./helloWorldHello World from process 2/4Hello World from process 3/4Hello World from process 0/4Hello World from process 1/4

Outputorder israndom

BasicMPIroutines

• int MPI_Init(int *argc, char ***argv)§ Initialization:allprocessesmustcallthispriortoanyotherMPIroutine.§ Stripsof(possible)argumentsprovidedby“mpirun”.

• int MPI_Finalize(void)§ Cleanup:allprocessesmustcallthisroutineattheendoftheprogram.§ Allpendingcommunicationshouldhavefinishedbeforecallingthis.

• int MPI_Comm_size(MPI_Comm comm, int *size);§ Returnsthesizeofthe“Communicator”associatedwith“comm”§ Communicator=userdefinedsubsetofprocesses§ MPI_COMM_WORLD=communicatorthatinvolvesallprocesses

• int MPI_Comm_rank(MPI_Comm comm, int *rank);§ ReturntherankoftheprocessintheCommunicator§ Range:[0...size– 1]

MessagePassingInterfaceMechanisms

Machine Machine Machine

Machine

HelloWorldprocess

Pprocesses(orinstances)ofthe“HelloWorld”program

mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram

Stackvariables:rank=0size=P

Stackvariables:rank=P-1size=P

“HelloWorld”

Terminology

• Computerprogram =passivecollectionofinstructions.

• Process =instanceofacomputerprogramthatisbeingexecuted.

• Multitasking =runningmultipleprocessesonaCPU.

• Thread =smalleststreamofinstructionsthatcanbemanagedbyanOSscheduler(=light-weightprocess).

• Distributed-memory system=multi-processorsystemswhereeachprocessorhasdirectaccess(fast)toitsownprivatememoryandreliesoninter-processorcommunicationtoaccessanotherprocessor’smemory(typicallyslower).

Multithreadingversusmultiprocessing

Multi-threadingSingle processSharedmemoryaddressspaceProtectdata againstsimultaneouswritingLimitedtoasinglemachineE.g.Pthreads,CILK,OpenMP,etc.

MessagepassingMultiple processesSeparate memoryaddressspacesExplicitlycommunicate everything

Multiple machinespossibleE.g.MPI,UnifiedParallelC,PVM

SeqI SeqI SeqI SeqI

A B C D

SeqII SeqII SeqII SeqII

A B C D

Threadspawning

Threadjoining

MessagePassingInterface(MPI)• MPImechanisms (dependsonimplementation)

• CompilinganMPIprogramfromsource• mpicc -O3main.cpp-omain

gcc -O3main.cpp-omain-L<IncludeDir>-l<mpiLibs>

• Alsompic++(ormpicxx),mpif77,mpif90,etc.

• RunningMPIapplications(manually)• mpirun -np <numberofprograminstances><yourprogram>• Listofworkernodesspecifiedinsomeconfig file.

• UsingMPIontheUgentHPCcluster• Loadappropriatemodulefirst,e.g.

• moduleloadintel/2017a

• CompilinganMPIprogramfromsource• mpigcc (usestheGNU“gcc”Ccompiler)• mpiicc (usestheIntel“icc”Ccompiler)• mpigxx (usestheGNU“g++”C++compiler)• mpiicpc (usestheIntel“icpc”C++compiler)

• Submitjobusingajobscript (seefurther)

mpicc / mpic++defaults to gcc

Outline

BasicMPIpoint-to-pointcommunication...int rank, size, count;char b[40];MPI_Status status;

... // init MPI and rank and size variables

if (rank != 0) {char * str = “Hello World”;MPI_Send(str, 12, MPI_CHAR, 0, 123, MPI_COMM_WORLD);

} else {for (int i = 1; i < size; i++) {

MPI_Recv(b, 40, MPI_CHAR, i, MPI_ANY_TAG, MPI_COMM_WORLD, &status);MPI_Get_count(&status, MPI_CHAR, &count);printf(“I received %s from process %d with size %d and tag %d\n”,

b, status.MPI_SOURCE, count, status.MPI_TAG);}

john@doe ~]$ mpirun –np 4 ./ptpcommI received Hello World from process 1 with size 12 and tag 123I received Hello World from process 2 with size 12 and tag 123I received Hello World from process 3 with size 12 and tag 123

branchingonrank

MessagePassingInterfaceMechanisms

Machine Machine Machine

Machine

HelloWorldprocess

Pprocesses(orinstances)ofthe“HelloWorld”program

mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram

rank=0Recv 3x

rank=1Send

rank=2Send

rank=P-1Send

“HelloWorld”“HelloWorld”“HelloWorld”

Blockingsendandreceive

int MPI_Send(void *buf, int count, MPI_Datatype datatype,int dest, int tag, MPI_Comm comm)

• buf:pointertothemessagetosend• count:numberofitemstosend• datatype:datatypeofeachitem

– numberofbytessent:count*sizeof(datatype)• dest:rankofdestinationprocess• tag:valuetoidentifythemessage[0...atleast(32767)]• comm:communicatorspecification(e.g.MPI_COMM_WORLD)

int MPI_Recv(void *buf, int count, MPI_Datatype datatype,int source, int tag, MPI_Comm comm,MPI_Status *status)

• buf:pointertothebuffertostorereceiveddata• count:upperbound(!)ofthenumberofitemstoreceive• datatype:datatypeofeachitem• source:rankofsourceprocess(orMPI_ANY_SOURCE)• tag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }

SendingandreceivingTwo-sided communication:

• BoththesenderandreceiverareinvolvedindatatransferAsopposedtoone-sidedcommunication

• PostedsendmustmatchreceiveWhendoMPI_SendandMPI_recv match ?

• 1.Rankofreceiver process• 2.Rankofsending process• 3.Tag

- customvaluetodistinguishmessagesfromsamesender• 4.Communicator

RationaleforCommunicators• Usedtocreatesubsetsofprocesses• Transparentuseoftags

- modulescanbewritteninisolation- communicationwithinmodulethroughownCommunicator- communicationbetweenmodulesthroughsharedCommunicator

MPIDatatypes

MPI_Datatype Cdatatype

MPI_CHAR signedchar

MPI_SHORT signedshortint

MPI_INT signedint

MPI_LONG signedlongin

MPI_UNSIGNED_CHAR unsignedchar

MPI_UNSIGNED_SHORT unsignedshortint

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsignedlongint

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE longdouble

MPI_BYTE noconversion,bitpatterntransferredasis

MPI_PACKED groupedmessages

Queryingforinformation

• MPI_Status

• StoresinformationabouttheMPI_Recv operationtypedef struct MPI_Status {

int MPI_SOURCE;

int MPI_TAG;

int MPI_ERROR;

}• Doesnotcontainthesizeofthereceivedmessage

• int MPI_Get_count (MPI_Status *status, MPI_Datatype

datatype, int *count)

• returnsthenumberofdataitemsreceivedinthecountvariable• notdirectlyaccessiblefromstatusvariable

Blockingsendandreceive

timea) sendercomesfirst,

idlingatsender(nobufferingofmessage)

reqtosend

oktosendrecv

b) sending/receivingataboutthesametime,idlingminimized

reqtosendoktosend recvsend

c) receivercomesfirst,idlingatreceiver

sendreqtosendoktosend

SlidereproducedfromslidesbyJohnMellor- Crummey

Deadlocks

int a[10], b[10], myRank;MPI_Status s1, s2;MPI_Comm_rank( MPI_COMM_WORLD, &myRank);

if ( myRank == 0 ) {MPI_Send( a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD );MPI_Send( b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD );

}else if ( myRank == 1 ) {

MPI_Recv( b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD, &s1 );MPI_Recv( a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD, &s2 );

Slide credit: John Mellor - Crummey

IfMPI_Sendisblocking(handshakeprotocol),thisprogramwilldeadlock• If'eager'protocolisused,itmayrun• Dependsonmessagesize

Outline

Networkcostmodeling

• Variouschoicesofinterconnectionnetwork§ GigabitEthernet:cheap,butfartooslowforHPCapplications§ Infiniband /Myrinet:highspeedinterconnect

• Simpleperformancemodel forpoint-to-pointcommunicationTcomm =a +b*n

§ a =latency§ B=1/b =saturation(asymptotic)bandwidth(bytes/s)§ n =numberofbytestotransmit§ EffectivebandwidthBeff:

Beff =n

a+b∗n = na+nB

ThecaseofGengar (UGentcluster)Bandwidthandlatency

core3&L1cache

6MByte ofL2cache

core4&L1cache

core2&L1cache

core1&L1cache

16GByteofRAM

Non-blockingInfiniband(switchedfabric)

6MbyteofL2cache

core3&L1cache

6MbyteofL2cache

core4&L1cache

core2&L1cache

core1&L1cache

6MbyteofL2cache

Bandwidth Latency

50ns~100Gbit/s

~20Gbit/s ~1μstothe193othermachines

CPU1 CPU2

Measureeffectivebandwidth:ringtest

Idea:sendasinglemessage of size N inacircle

• IncreasethemessageNsizeexponentially• 1byte,2bytes,4bytes,...1024bytes,2048bytes,4096bytes

• Benchmarktheresults(measurewallclocktimeT),…• Bandwidth=N*P/T

P-1 ...

Hands-on:ringtest inMPI

void sendRing( char *buffer, int length ) {/* send message in a ring here */

int main( int argc, char * argv[] ){

...char *buffer = (char*) calloc ( 1048576, sizeof(char) );int msgLen = 8;for (int i = 0; i < 18; i++, msgLen *= 2) {

double startTime = MPI_Wtime();sendRing( buffer, msgLen );double stopTime = MPI_Wtime();double elapsedSec = stopTime - startTime;if (rank == 0)

printf( "Bandwidth for size %d is : %f\", ... );}...

Jobscript example

#!/bin/sh##PBS -o output.file#PBS -e error.file#PBS -l nodes=2:ppn=all#PBS -l walltime=00:02:00#PBS -m n

cd $VSC_SCRATCH/<yourdirectory>module load intel/2017amodule load scriptsmympirun ./<program name> <program arguments>

qsub mpijob.shqstat /qdel /etc remainsthesame

mpijob.shjobscriptexample:

Hands-on:ringtest inMPI(solution)

void sendRing( char *buffer, int msgLen ){

int myRank, numProc;MPI_Comm_rank( MPI_COMM_WORLD, &myRank );MPI_Comm_size( MPI_COMM_WORLD, &numProc );MPI_Status status;

int prevR = (myRank - 1 + numProc) % numProc;int nextR = (myRank + 1 ) % numProc;

if (myRank == 0) { // send first, then receiveMPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,

&status );} else { // receive first, then send

MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,&status );

MPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);}

BasicMPIroutines

TimingroutinesinMPI

• double MPI_Wtime( void )• returnsthetimeinsecondsrelativeto“sometime”inthepast• “sometime”inthepastisfixedduringprocess

• double MPI_Wtick( void )• ReturnstheresolutionofMPI_Wtime()inseconds• e.g.10-3 =millisecondresolution

BandwidthonGengar(Ugentcluster)

EffectiveBWincreases forlarger messages

Messagesize(byte)

1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 0

GigabitEthernet

Infiniband

EffectiveBa

ndwidthB

byte/s)

BenchmarkresultsComparisonofCPUload

Infiniband

GigabitEthernet

ExchangingmessagesinMPIint MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype

sendtype, int dest, int sendtag, void *recvbuf,int recvcount, MPI_Datatype recvtype, int source,int recvtag, MPI_Comm comm, MPI_Status *status )• sendbuf:pointertothemessagetosend• sendcount:numberofelementstotransmit• sendtype:datatypeoftheitemstosend• dest:rankofdestinationprocess• sendtag:identifierforthemessage• recvbuf:pointertothebuffertostorethemessage(disjoint withsendbuf)• recvcount:upperbound(!)tothenumberofelementstoreceive• recvtype:datatypeoftheitemstoreceive• source:rankofthesourceprocess(orMPI_ANY_SOURCE)• recvtag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }• sendbuf:pointertothebuffertosend

int MPI_Sendrecv_replace( ... )• Bufferisreplacebyreceiveddata

BasicMPIroutinesSendrecv example

const int len = 10000;int a[len], b[len];if ( myRank == 0 ) {

MPI_Send( a, len, MPI_INT, 1, 0, MPI_COMM_WORLD );MPI_Recv( b, len, MPI_INT, 2, 2, MPI_COMM_WORLD, &status);

} else if ( myRank == 1 ) {MPI_Sendrecv( a, len, MPI_INT, 2, 1, b, len, MPI_INT, 0,

0, MPI_COMM_WORLD, &status );} else if ( myRank == 2) {

MPI_Sendrecv( a, len, MPI_INT, 0, 2, b, len, MPI_INT, 1,1, MPI_COMM_WORLD, &status );

• CompatibilitybetweenSendrecv and'normal'sendandrecv• Sendrecv canhelptopreventdeadlocks

safe to exchange !

Outline

Non-blockingcommunication

Idea:• Dosomethingusefulwhilewaitingforcommunicationstofinish• Trytooverlapcommunicationsandcomputations

How?• Replaceblockingcommunicationbynon-blockingvariants

MPI_Send(...) MPI_Isend(..., MPI_Request *request)MPI_Recv(...) MPI_Irecv(..., status, MPI_Request *request)

• I=intermediate functions• MPI_Isend and MPI_Irecv routinesreturnimmediately• Needpolling routinestoverifyprogress

• request handleisusedtoidentifycommunications• status fieldmovedtopollingroutines(seefurther)

Non-blockingcommunications

Asynchronousprogress• =abilitytoprogresscommunicationswhileperformingcalculations• Dependsonhardware

• GigabitEthernet=verylimited• Infiniband=muchmorepossibilities

• DependsonMPIimplementation• MultithreadedimplementationsofMPI(e.g.OpenMPI)• Daemonforasynchronousprogress(e.g.LAMMPI)

• Dependsonprotocol• Eagerprotocol• Handshakeprotocol

• Stillthesubjectofongoingresearch

Non-blockingsendingandreceiving

timea) networkinterfacesupports

overlappingcomputationsandcommunications

b) networkinterfacehasnosuchsupport

reqtosend

oktosendIrecv

SlidereproducedfromslidesbyJohnMellor- Crummey

reqtosend

oktosendIrecv

Non-blockingcommunicationsPolling/waitingroutines

int MPI_Wait( MPI_Request *request, MPI_Status *status )

request: handle to identify communicationstatus: status information (cfr. 'normal' MPI_Recv)

int MPI_Test( MPI_Request *request, int *flag, MPI_Status *status )

Returns immediately. Sets flag = true if communication has completed

int MPI_Waitany( int count, MPI_Request *array_of_requests,int *index, MPI_Status *status )

Waits for exactly one communication to completeIf more than one communication has completed, it picks a random oneindex returns the index of completed communication

int MPI_Testany( int count, MPI_Request *array_of_requests,int *index, int *flag, MPI_Status *status )

Returns immediately. Sets flag = true if at least one communication completedIf more than one communication has completed, it picks a random oneindex returns the index of completed communicationIf flag = false, index returns MPI_UNDEFINED

Example:client-servercodeif ( rank != 0 ) { // client code

while ( true ) { // generate requests and send to the servergenerate_request( data, &size );MPI_Send( data, size, MPI_CHAR, 0, tag, MPI_COMM_WORLD );

}} else { // server code (rank == 0)

MPI_Request *reqList = new MPI_Request[nProc];for ( int i = 0; i < nProc - 1; i++ )

MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,MPI_COMM_WORLD, &reqList[i] );

while ( true ) { // main consumer loopMPI_Status status;int reqIndex, recvSize;

MPI_Waitany( nProc–1, reqList, &reqIndex, &status );MPI_Get_count ( &status, MPI_CHAR, &recvSize );do_service( buffer[reqIndex].data, recvSize );MPI_Irecv( buffer[reqIndex].data, MAX_LEN, MPI_CHAR,

status.MPI_SOURCE, tag, MPI_COMM_WORLD, &reqList[reqIndex] );

Non-blockingcommunicationsPolling / waiting routines (cont'd)

int MPI_Waitall( int count, MPI_Request *array_of_requests,MPI_Status *array_of_statuses )

Waits for all communications to completeint MPI_Testall ( int count, MPI_Request *array_of_requests,

int *flag, MPI_Status *array_of_statuses )Returns immediately. Sets flag = true if all communications have completed

int MPI_Waitsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )

Waits for at least one communications to completeoutcount contains the number of communications that have completedCompleted requests are set to MPI_REQUEST_NULL

int MPI_Testsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )

Same as Waitsome, but returns immediately. flag field no longer needed, returns outcount = 0 if no completed communications

Example:improvedclient-servercodeif ( rank != 0 ) { // same client code

...} else { // server code (rank == 0)

MPI_Request *reqList = new MPI_Request[nProc-1];MPI_Status *status = new MPI_Status[nProc-1];int *reqIndex = new MPI_Request[nProc];

for ( int i = 0; i < nProc - 1; i++ )MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,

MPI_COMM_WORLD, &reqList[i] );while ( true ) { // main consumer loop

int numMsg;MPI_Waitsome( nProc–1, reqList, &numMsg, reqIndex, status );for ( int i = 0; i < numMsg; i++ ) {

MPI_Get_count ( &status[i], MPI_CHAR, &recvSize );do_service( buffer[reqIndex[i]].data, recvSize);MPI_Irecv( buffer[reqIndex[i]].data, MAX_SIZE, MPI_CHAR,

status[i].MPI_SOURCE, tag, MPI_COMM_WORLD,&reqList[reqIndex[i]] );

Outline

Barriersynchronization

BarrierSynchronization

Waitingtime(processdependent)

CalltoMPI_Barrier

processrank Computations

Idletime

MPI_Barrier(MPI_Comm comm)Thisfunctiondoesnotreturnuntilallprocessesincommhavecalledit.

BroadcastMPI_Bcast(void *buffer, int count, MPI_Datatype datatype,

int root, MPI_Comm comm)MPI_Bcastbroadcastscount elementsoftypedatatype storedinbuffer attheroot processtoallotherprocessesincomm wherethisdatais storedinbuffer.

processrank

bufferatnon-root process(contentswillbeoverwritten)

countelements

bufferatroot process(containsusefuldata)

MPI_Bcast

processrank

bufferatallprocessesnowcontainsamedata

countelements

Broadcastexample...int rank, size;

... // init MPI and rank and size variables

int root = 0;char buffer[12];

if (rank == root) sprintf(buffer, “Hello world”);

MPI_Bcast(buffer, 12, MPI_CHAR, root, MPI_COMM_WORLD);

printf(“Process %d has %s stored in the buffer.\n”, buffer, rank);

john@doe ~]$ mpirun –np 4 ./broadcastProcess 1 has Hello World stored in the buffer.Process 0 has Hello World stored in the buffer.Process 3 has Hello World stored in the buffer.Process 2 has Hello World stored in the buffer.

fillthebufferattherootprocessonly

allprocessesmustcallMPI_Bcast

Broadcastalgorithm

data(nbytes)

processrank

• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.

• Binarytreealgorithm takesonly(a +bn)élog2Pù time.

ScatterMPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendType,

void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)

MPI_Scatter partitionsasendbuf attheroot processintoPequalpartsofsizesendcount andsendseachprocessincomm (includingroot)aportioninrankorder.

processrank

sendbuffer(onlymattersatroot process)

sendcountelements

receivebuffer

MPI_Scatter

processrank

recvcountelements

d1 d2 d3 d4

root=p1

Scatterexampleint root = 0;char recvBuf[7];

if (rank == root) { char sendBuf[25];sprintf(sendBuf, “This is the source data.”);

MPI_Scatter(sendBuf, 6, MPI_CHAR, recvBuf, 6, MPI_CHAR,root, MPI_COMM_WORLD);

} else {MPI_Scatter(NULL, 0, MPI_CHAR, recvBuf, 6, MPI_CHAR,

root, MPI_COMM_WORLD);}

recvBuf[6] = ’\0’;printf(“Process %d has %s in receive buffer\n”, rank, recvBuf);

...john@doe ~]$ mpirun –np 4 ./scatterProcess 1 has s the stored in the buffer.Process 0 has This i stored in the buffer.Process 3 has data. stored in the buffer.Process 2 has source stored in the buffer.

fillthesendbufferattherootprocessonly

firstthreeparametersareignoredonnon-rootprocesses

Scatteralgorithm

processrank

• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.

• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)

oneblock=nbytes

Scatter(vectorvariant)MPI_Scatterv(void *sendbuf, int *sendcnts, int *displs,

MPI_Datatype sendType, void *recvbuf, int recvcnt,MPI_Datatype recvType, int root, MPI_Comm comm)

Partitionsofsendbuf don’tneedtobeofequalsizeandarespecifiedperreceivingprocess:thefirstindexbythedispls array,theirsizebythesendcnts array.

processrank

sendbuffer(onlymattersatroot process)

sendcnts[i]elements

receivebuffer

MPI_Scatterv

processrank

recvcountelements

root=p1

displs[i]

GatherMPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendType,

void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)

MPI_Gathergathersequalpartitionsofsizerecvcount fromeachofthePprocessesincomm (includingroot)andstorestheminrecvbuf attheroot processinrankorder.

processrank

sendcountelements

processrank

receivebuffer(onlymattersatroot process)

recvcountelements

sendbuffer

d1 d2 d3 d4

root=p1

MPI_Gather

GatheristheoppositeofScatter

Avectorvariant,MPI_Gatherv,exists,asimilargeneralizationasMPI_Scatterv

Gatherexampleint root = 0;int sendBuf = rank;

if (rank == root) { int *recvBuf = new int[size];

MPI_Gather(&sendBuf, 1, MPI_INT, recvBuf, 1, MPI_INT, root, MPI_COMM_WORLD);

cout << “Receive buffer at root process: ” << endl;for (size_t i = 0; i < size; i++)

cout << recvBuf[i] << “ ”;cout << endl;

delete [] recvBuf;} else {

MPI_Gather(&sendBuf, 1, MPI_INT, NULL, 1, MPI_INT,root, MPI_COMM_WORLD);

john@doe ~]$ mpirun –np 4 ./gatherReceive buffer at root process:0 1 2 3

receivebufferexistsattherootprocessonly

receiveparametersareignoredonnon-rootprocesses

Gatheralgorithm

processrank

• Linearalgorithm,subsequentsendingofnbytesfromP-1processestoroottakes(a +bn)(P- 1)time.

• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)

oneblock=nbytes

AllGatherMPI_Allgather(void *sendbuf, int sendcnt, MPI_Datatype sendType,

void *recvbuf, int recvcnt, MPI_Datatype recvType,MPI_Comm comm)

MPI_Allgather isageneralizationofMPI_Gather,inthatsensethatthedataisgatheredbyallprocessed,insteadofjusttherootprocess.

processrank

sendcountelements

processrank recvcountelementssendbuffer

receivebuffer

MPI_Allgather

receivebuffer

Allgather algorithm

processrank

• Pcallstogather takesP[a élog2Pù +bn(P-1)]time(usingthebestgatheralgorithm)• Gatherfollowedbybroadcasttakes2a élog2Pù +bn(Pélog2Pù +P-1) time.• “Butterfly”algorithm takesonlya log2P +bn(P-1)time(incasePisapoweroftwo)

nbytes 2nbytes4nbytes

oneblock=nbytes

AlltoallcommunicationMPI_Alltoall(void *sendbuf, int sendcnt, MPI_Datatype sendType,

void *recvbuf, int recvcnt, MPI_Datatype recvType,int root, MPI_Comm comm)

UsingMPI_Alltoall,everyprocesssendsadistinctmessagetoeveryotherprocess.

processrank sendcountelements

processrank

MPI_Alltoall

a1 a2 a3 a4

b0 b1 b2 b3 b4

e0 e1 e2 e3 e4

d0 d1 d2 d3 d4

c0 c1 c2 c3 c4

sendbuffer

recvcountelements

a0 b0 c0 d0 e0

a1 b1 c1 d1 e1

a4 b4 c4 d4 e4

a3 b3 c3 d3 e3

a2 b2 c2 d2 e2

receivebuffer

Avectorvariant,MPI_Alltoallv,exists,allowingfordifferentsizesforeachprocess

ReduceMPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype

dataType, MPI_Op op, int root, MPI_Comm comm)Thereduceoperationaggregates(“reduces”)scattereddataattherootprocess

processrank

sendbuffer

countelements

MPI_Reduce

processrank

p3root=p0

d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4

receivebuffer

◊ =operation,likesum,product,maximum,etc.

countelements

Reduceoperations

MPI_MAX maximum

MPI_MIN minimum

MPI_SUM sum

MPI_PROD product

MPI_LAND logical AND

MPI_BAND bitwise AND

MPI_LOR logicalOR

MPI_BOR bitwiseOR

MPI_LXOR logicalexclusiveOR

MPI_BXOR bitwise exclusiveOR

MPI_MAXLOC maximumanditslocation

MPI_MINLOC maximumanditslocation

Availablereduceoperations(associativeandcommutative)Userdefinedoperationsarealsopossible

AllreduceoperationMPI_Allreduce(void *sendbuf, void *recvbuf, int count,

MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Similartothereduceoperation,buttheresultisavailableoneveryprocess.

processrank

sendbuffer

countelements

MPI_Allreduce

processrank

d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4

receivebuffer

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

ScanoperationMPI_Scan(void *sendbuf, void *recvbuf, int count,

MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Ascanperformsapartialreductionofdata,everyprocesshasadistinctresult

processrank

sendbuffer

countelements

MPI_Scan

processrank

receivebuffer

d0◊ d1 ◊ d2

d0◊ d1

d0◊ d1 ◊ d2 ◊ d3

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

Hands-onMatrix-vectormultiplication

Master :CoordinatestheworkofothersSlave :doesabitofwork

Task:computeA.bA:doubleprecision(mxn)matrixb:doubleprecision(nx1)columnmatrix

Master algorithm Slave algorithm1. Broadcast b to each slave 1. Broadcast b (in fact receive b)2. Send 1 row of A to each slave 2. do {3. while ( not all m results received ) { Receive message m

Receive result from any slave s if( m != termination )if ( not all m rows sent ) compute result

Send new row to slave s send result to master else } while( m != termination )

Send termination message to s 3. slave terminates}

4. continue

Matrix-vectormultiplicationint main( int argc, char** argv ) {

int rows = 100, cols = 100; // dimensions of adouble **a;double *b, *c;int master = 0; // rank of masterint myid; // rank of this processint numprocs; // number of processes// allocate memory for a, b and ca = (double**)malloc(rows * sizeof(double*));for( int i = 0; i < rows; i++ )

a[i]=(double*)malloc(cols * sizeof(double));b = (double*)malloc(cols * sizeof(double));c = (double*)malloc(rows * sizeof(double));MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &myid );MPI_Comm_size( MPI_COMM_WORLD, &numprocs );if( myid == master )

// execute master codeelse

// execute slave codeMPI_Finalize();

Matrixvectormultiplication// initialize a and bfor(int j=0;j<cols;j++) {b[j]=1.0; for(int i=0;i<rows;i++) a[i][j]=i;}// broadcast b to each slaveMPI_Bcast( b, cols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD );// send row of a to each slave, tag = row numberint numsent = 0;for( int i = 0; (i < numprocs-1) && (i < rows); i++ ) {

MPI_Send(a[i], cols, MPI_DOUBLE_PRECISION, i+1,i,MPI_COMM_WORLD);numsent++;

}for( int i = 0; i < rows; i++ ) {

MPI_Status status; double ans; int sender;MPI_Recv( &ans, 1, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE,

MPI_ANY_TAG, MPI_COMM_WORLD, &status );c[status.MPI_TAG] = ans;sender = status.MPI_SOURCE;if ( numsent < rows ) { // send more work if any

MPI_Send( a[numsent], cols, MPI_DOUBLE_PRECISION,sender, numsent, MPI_COMM_WORLD );

numsent++;} else // send termination message

MPI_Send( MPI_BOTTOM, 0, MPI_DOUBLE_PRECISION, sender,rows, MPI_COMM_WORLD );

Matrix-vectormultiplication// broadcast b to each slave (receive here)MPI_Bcast( b,cols,MPI_DOUBLE_PRECISION,master,MPI_COMM_WORLD );// send row of a to each slave, tag = row numberif( myid <= rows ) {

double* buffer=(double*)malloc(cols*sizeof(double));while (true) {

MPI_Status status;MPI_Recv( buffer, cols, MPI_DOUBLE_PRECISION, master,

MPI_ANY_TAG, MPI_COMM_WORLD, &status );if( status.MPI_TAG != rows ) { // not a termination message

double ans = 0.0;for(int i=0; i < cols; i++)

ans += buffer[i]*b[i];MPI_Send( &ans, 1, MPI_DOUBLE_PRECISION, master,

status.MPI_TAG, MPI_COMM_WORLD );} else

break;}

} // more processes than rows => no work for some nodes

Outline

Networkcostmodeling

• Simpleperformancemodel formultiple communications– Bisectionbandwidth:sumofthebandwidthsofthe(averagenumber

of)linkstocuttopartitionthenetworkintwohalves.– Diameter:maximumnumberofhopstoconnectanytwodevices

Network(firsthalf)

Network(secondhalf)

bisection

Bustopology

Bus =communicationchannelthatisshared byallconnecteddevices• Nomorethantwodevices cancommunicateatanygiventime• Hardwarecontrolswhichdeviceshaveaccess• Highriskofcontention whenmultipledevicestrytoaccessthebus

simultaneously.Busisa“blocking”interconnect.

bisection

Busnetworkproperties• Bisectionbandwidth=point-to-pointbandwidth(independentof#devices)• Diameter=1(singlehop)

Crossbarswitch

i4i3i2i1

inputdevices

outputdevices

switch

Crossbarswitchcanconnectanycombination ofinput/outputpairsatfullbandwidth=fullynon-blocking

Staticroutingincrossbarswitch

i4i3i2i1

inputdevices

outputdevices stateA

Switchesatcrossingofinput/outputshouldbeinstateB.OtherswitchesshouldbeinstateA.Pathselectionisfixed=staticrouting

Input Output

i1 o2i2 o4i3 o1i4 o3

stateB

Input– outputpairings

Crossbarswitchfordistributedmemorysystems

• CrossbarimplementationofaswitchwithP=4machines.• Fullynon-blocking• Expensive:P2 switchesandO(P2)wires

4portnon-blocking

switch

Usuallyimplementedinaswitch

Closswitchingnetworks

4x4CB4x4CB

ninputs noutputs

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

4x4CB4x4CB

ninputs noutputs

4x4CB4x4CB

ninputs noutputs

4x4CB4x4CB

• Adaptiveroutingmaybenecessary:cannotfindaconnectionforE!

4x4CB4x4CB

• Adaptiveroutingmaybenecessary:rerouteanexistingpath

4x4CB4x4CB

• Adaptiveroutingmaybenecessary:Ecannowbeconnected

Switchexamples

non-blockingswitchesInfinibandswitch(24ports) GigabitEthernetswitch(24ports)

Meshnetworks

bisection

Outline

Basicperformanceterminology

• Runtime (“Howlongdoesittaketorunmyprogram”)§ Inpractice,onlywallclocktime matters§ DependsonthenumberofparallelprocessesP§ TP =runtimeusingPprocesses

• Speedup (“Howmuchfasterdoesmyprogramruninparallel”)§ SP =T1 /TP§ Intheidealcase,SP =P§ Superlinearspeedup areusuallyduetocacheeffects

• Parallelefficiency (“Howwellistheparallelinfrastructureused”)§ hP =SP /P (0£ hP £ 1)§ Intheidealcase,hP =1=100%§ Dependsontheapplicationwhatisconsideredacceptableefficiency

Strongscaling:Amdahl’slaw• StrongScaling=increasingthenumberofparallelprocessesfora

fixed-size problem• Simplemodel:partitionsequentialruntimeT1 inaparallelizable

fraction(1-s) andinherentlysequentialfraction(s).§ T1 =sT1 +(1-s)T1 (0£ s £ 1)

§ Therefore, TP =sT1 +𝟏"𝐬 𝐓𝟏𝐏

sequential parallelizableT1=

sequentialT2= parallel

…sequentialT¥ =

sT1 (1-s)T1

Strongscaling:Amdahl’slaw

• Consequently,SP =&'&(

)&'*+,- .+/

= ')*+,-/

(Amdahl’slaw)

• Speedupisbounded S¥ =')

§ e.g.s=1%,S¥ = 100§ e.g.s=5%,S¥ = 20

• SourcesofsequentialfractionsT1§ Processstartup overhead§ Inherentlysequential portionsofthecode§ Dependenciesbetweensubtasks§ Communication overhead§ Functioncalling overhead§ Loadmisbalance

Amdahl’slaw

0 5 10 15 20 25 30

Parallelspe

NumberofparallelprocessesP

ForsmallP,closetolinearspeedup

linearspeedup

Amdahl’slaw

1 4 16 64 256 1024 4096 16384 65536

Parallelspe

S¥ =')

Samegraphasonpreviousslide,butlogarithmicx-as

Amdahl’slaw

1 4 16 64 256 1024 4096 16384 65536

Parallelefficiency

Samegraphasonpreviousslide,butexpressedasparallelefficiency

Weakscaling:Gustafson’slaw

• Weakscaling:increasingboththenumberofparallelprocessesPandtheproblemsizeN.

• Simplemodel:assumethatthesequentialpartisconstant,andthattheparallelizablepartisproportionaltoN.§ T1(N)=T1(1)[s+(1-s)N] (0£ s £ 1)

sequential parallelizableT1(1)=

sT1(1) (1-s)T1(1)

parallelizable

parallelizableparallelizable

Gustafson’slaw

Therefore,TP(N)=T1(1)[s+'") 1(

]andhence TP(P)=T1(1)SpeedupSP(P)=s+(1-s)P=P- s(P-1)(Gustafson’slaw)

• Solveincreasinglylargerproblemsusingaproportionallyhighernumberofprocesses(N=P).

sT1(1) (1-s)T1(1)

Gustafson’slaw

0 50 100 150 200 250

Parallelspe

Number ofparallelprocesses P,problem size N

Gustafson’slaw

1 4 16 64 256 1024 4096 16384 65536

Parallelefficiency

Number ofparallelprocesses P,problem size N

Samegraphasonpreviousslide,butexpressedasparallelefficiencyandlogscale

h¥ =1-s

Outline

CaseStudy1:Parallelmatrix-matrixproduct

Casestudy:parallelmatrixmultiplication

• Matrix-matrixmultiplication:C=a*A*B+b*C(BLASxgemm)§ AssumeC=mxn;A=mxk;B=kxnmatrix.§ a, b =scalars;assumea = b =1 inwhatfollows,i.e.C=C+A*B

• Initially,matrixelementsaredistributed amongPprocesses§ AssumesameschemeforeachmatrixA,BandC

• EachprocesscomputesvaluesforCthatarelocaltothatprocess§ RequireddatafromAandBthatisnotlocalneedstobe

communicated§ Performancemodelingassumingthea, b, gmodel

o a =latencyo b =perelementtransfertimeo g =timeforsinglefloatingpointcomputation

Casestud

yreprod

Casestudy:parallelmatrix-matrixproduct

• Twodimensionalpartitioning§ Partitionmatricesin2Dinanrxcmesh(P=r*c)§ X(I,J)referstoblock(I,J)ofmatrixX(X={A,B,C})§ Processpn isalsodenotedbypi,j (n=i*c+j)andholdsX(I,J)

p0,0 p0,1 p0,2 … p0,c-1

p1,0 …

pr-1,0 pr-1,c-1

Casestud

yreprod

B(I,J)e.g.matrixB

rprocesses

cprocesses

• Secondapproach:twodimensionalpartitioning(SUMMA)§ Processpi,j needstocomputeC(I,J):

C I, J = C I, J +8A I, i ∗ B(i, J)?"'

∀I = 0… r∀J = 0… c

C(I,J) A(I,J)

B(I,J)

datalocalinpi,jdatathatneedstobecommunicatedtopi,j

datanotneededbypi,j

dothisinparallel

Casestud

yreprod

kelements

indexi referstoasinglecolumnofAorsinglerowofB

A(I,i)B(i,J)

• Secondapproach:twodimensionalpartitioning§ SUMMAalgorithm(allIandJinparallel)

§ Costforinnerloop(executedktimes):o log2c(a +b(m/r))+log2r(a +b(n/c))+2mng/P

§ TotalTP =2kmng/P+ka(log2c+log2r)+kb((m/r)log2c)+(n/c)log2r)

§ Forn=m=kandr=c=sqrt(P)wefind:ParallelefficiencyhP = 1/(1+(a/g)(Plog2P)/(2n2)+(b/g)√PlogP/n)Isoefficiency whenngrowsas√P(constantmemorypernode!)

Casestud

yreprod

for i = 0 to k–1broadcast A(I, i) within process rowbroadcast B(i, J) within process columnC(I,J) += A(I, i) * B(i, J)

endfor

• Evenmoreefficient:use“blocking”algorithm

• SUMMAalgorithmisimplementedinPBLAS=ParallelBLAS• Algorithmcanbeextendedtoblock-cycliclayout(seefurther)

for i = 0 to k–1 step bend = min(i+b-1, k-1)broadcast A(I, i:end) within process rowbroadcast B(i:end, J) within process columnC(I,J) += A(I, i:end) * B(i:end, J)

endfor

PerformthisproductusingLevel-3BLAS

Casestudy:parallelGaussianElimination

Imagetakefrom

CaseStudy2:ParallelSorting

Casestudy:parallelsortingalgorithm

• Sequential sortingofnkeys(1st Bachelor)§ Bubblesort:O(n2)§ Mergesort:O(nlogn),eveninworst-case§ Quicksort:O(nlogn) expected,O(n2)worst-case,fastinpractice

• Parallel sortingofnkeys,usingPprocesses§ Initially,eachprocessholdsn/pkeys(unsorted)§ Eventually,eachprocessholdsn/pkeys(sorted)

o Keysperprocessaresortedo Ifq<r,eachkeyassignedtoprocessqislessthanorequalto

everykeyassignedtoprocessr(sortkeysinrankorder)

void Bubble_sort(int *a, int n) {for (int listLen = n; listLen >= 2; listLen--)

for (int i = 0; i < listLen-1; i++)if (a[i] > a[i+1]) {

temp = a[i];a[i] = a[i+1]a[i+1] = temp;

Bubblesort algorithm:O(n2)

Example: 52481® 2451 8® 241 58® 21 458® 1 2458

“Compare-swap”operation

AlgorithmreproducedfromP.Pacheco

afteriteration1afteriteration2afteriteration3afteriteration4

• Bubblesort§ Resultofcurrentstep(a[i]>a[i+1])dependsonpreviousstep

o Valueofa[i]isdeterminedbypreviousstepo Algorithmis“inherentlyserial”o Notmuchpointintryingtoparallelizethisalgorithm

• Odd-eventranspositionsort§ Decouplealgorithmintwophases:evenandodd

o Evenphase:compare-swaponfollowingelements:(a[0],a[1]),(a[2],a[3]),(a[4],a[5]),…

o Oddphase:compare-swapoperationsonfollowingelements:(a[1],a[2]),(a[3],a[4]),(a[5],a[6]),…

void Even_odd_sort(int *a, int n) {for (int phase = 0; phase < n; phase++)

if (phase % 2 == 0) { // even phasefor (int i = 0; i < n-1; i += 2)

if (a[i] > a[i+1]) {temp = a[i];a[i] = a[i+1]a[i+1] = temp;

}} else { // odd phase

for (int i = 1; i < n-1; i += 2)if (a[i] > a[i+1]) {

temp = a[i];a[i] = a[i+1]a[i+1] = temp;

Even-oddtranspositionsortalgorithm:O(n2)

“Compare-swap”operation

Example: 52481® 25481® 24518® 24158® 21458® 12458

evenphaseoddphaseevenphaseoddphaseevenphase

Even-oddtranspositionsortalgorithm:O(n2)

Parallelism withineachevenoroddphaseisnowobvious:Compare-swapbetween(a[i],a[i+1])independentfrom(a[i+2],a[i+3])

Parallelalgorithm• First,assumeP==n(oneelementperprocess)

a[i-1] a[i] a[i+1]

ImagereproducedfromP.Pacheco

a[i-1] a[i] a[i+1]

phasej

phasej+1

Communicatevaluewithneighbor• Rightprocess(highestrank)keepslargestvalue• Leftprocess(lowestrank)keepssmallestvalue

a[i-2]

executeinparallelduringphasej+1

executeinparallelduringphasej

Parallelalgorithm• Now,assumen/P>>1 (asistypicallythecase)• Example:P=4;n=16

initialvalues 15,11,9,16 3,14,8,7 4,6,12,10 5,2,13,1

localsorting 9,11,15,16 3,7,8,14 4,6,10,12 1,2,5,13

phase0(even) 3,7,8,9 11,14,15,16 1,2,4,5 6,10,12,13

phase1(odd) 3,7,8,9 1,2,4,5 11,14,15,16 6,10,12,13

phase2(even) 1,2,3,4 5,7,8,9 6,10,11,12 13,14,15,16

phase3(odd) 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16

process0 process1 process2 process3

sort local keysfor (int phase = 0; phase < P; phase++) {

neighbor = computeNeighbor(phase, myRank);if (I’m not idle) { // first and/or last process may be idle

send all my keys to neighborreceive all keys from neighborif (myRank < neighbor)

keep smaller keyselse

keep larger keys}

Paralleleven-oddtranspositionsortpseudocode

Theorem: Parallelodd-eventranspositionsortalgorithmwillsorttheinputlistafterP(=numberofprocesses)phases.

int computeNeighbor(int phase, int myRank) {int neighbor;if (phase % 2 == 0) {

if (myRank % 2 == 0)neighbor = myRank + 1;

elseneighbor = myRank - 1;

} else {if (myRank % 2 == 0)

neighbor = myRank – 1;else

neighbor = myRank + 1;}if (neighbor == -1 || neighbor == P-1)

neighbor = MPI_PROC_NULL;return neigbor;

ImplementationofcomputeNeighbor (MPI)

WhenusedasdestinationorsourcerankinMPI_SendorMPI_Recv,nocommunicationtakesplace

if (myRank % 2 == 0) {MPI_Send(...)MPI_Recv(...)

} else {MPI_Recv(...)MPI_Send(...)

ImplementationofdataexchangeinMPI• Becarefulofdeadlocks• Inbothevenandoddphases,communicationalwaystakesplace

betweenaprocesswitheven,andaprocesswithoddrank

Exchangeordertopreventdeadlocks!

MPI_Sendrecv(...)

Parallelodd-eventranspositionsortalgorithmanalysis• Initialsorting:O(n/Plog(n/P))time

• Useanefficientsequentialsortingalgorithm,e.g.quicksortormergesort

• Perphase:2(a +n/Pb)+gn/P• TotalruntimeTP(n)=O(n/Plog(n/P)+2(aP +nb)+gn

=1/PO(nlogn)+O(n)• LinearspeedupwhenPissmallandnislarge• However,badasymptoticbehaviour

• WhennandPincreaseproportionally,runtimeperprocessisO(n)

• Whatwereallywant:O(logn)• Difficult!(butpossible!)

• Sortingnetworks (=graphicaldepictionofsortingalgorithms)§ Numberofhorizontal“wires”(=elementstosort)§ Connectedbyvertical“comparators”(=compareandswap)

entsto

sort(n

Example(4elementstosort)

min(x,y)

max(x,y)

Bubblesort algorithm(sequential) Samebubblesortalgorithm(parallel)

Sequentialruntime=numberofcomparators

=“size ofthesortingnetwork”=n*(n– 1)/2

Parallelruntime(assumeP==nprocesses)

=“depth ofsortingnetwork”=2n– 3

Casestudy:parallelsortingalgorithmDefinition:depthofasortingnetwork (=parallelruntime)

• Zeroattheinputsoreachwire• Foracomparatorwithinputswithdepthd1 andd2,thedepthof

itsoutputsis1+max(d1,d2)• Depthofthesortingnetwork=maximumdepthofeachoutput

2 3 4 5 6 7 8 9

3 5 7 9

2 4 6 8

• Sortingnetworkofodd-eventranspositionsort

• Parallelruntime=“depth ofsortingnetwork”=n• …orP(weassumen==P)

• Canwedobetter?• SequentialsortingalgorithmsareO(nlogn)• Ideally,Pandncanscaleproportionally:P=O(n)• ThatmeansthatwewanttosortnnumbersinO(logn)time

• Thisispossible(!),however,bigconstantpre-factor• WewilldescribeanalgorithmthatcansortnnumbersinO(log2 n)

paralleltime(usingP=O(n)processes)• Thisalgorithmhasbestperformanceinpractice• …unlessnbecomeshuge(n>22000)• Nobodywantstosortthatmanynumbers

• Theorem:Ifasortingnetworkwithninputssortsall2n binarystringsoflengthncorrectly,thenitsortsallsequencescorrectly(proof:seereferences).

WewilldesignanalgorithmthatcansortbinarysequencesinO(log22n) time

Example

• Step1: createasortingnetworkthatsortsbitonic sequences• Definition:Abitonic sequenceisasequencewhichisfirst

increasingandthendecreasing,orcanbecircularlyshiftedtobecomeso.§ (1,2,3,3.14,5,4,3,2,1)isbitonic§ (4,5,4,3,2,1,1,2,3)isbitonic§ (1,2,1,2)isnotbitonic

• Overzerosandones,abitonic sequenceisoftheform§ 0i1j0k or1i0j1k(withe.g.0i=0000…0=i consecutivezeros)§ i,jorkcanbezero

• Now,let’screateasortingnetworkthatsortsabitonic sequence• Ahalf-cleaner networkconnectslinei withlinei +n/2

• Iftheinputisabinarybitonic sequencethenfortheoutput§ Elementsinthetophalfaresmaller thanthecorresponding

elementsinthebottomhalf,i.e.halvesarerelativelysorted.§ Oneofthehalvesoftheoutputconsistsofonlyzerosorones(i.e.

is“clean”),theotherhalfisbitonic.

Half-cleanerexample(n=8)

• Exampleofahalf-cleanernetwork

lowerhalfisbitonic

upperhalfis“clean”

input=bitonicsequence

upperhalfissmallerthanlowerhalf

• Therefore,abitonic sorter[n] (i.e.networkthatsortsabitonicsequenceoflengthn)isobtainedas

Half-cleaner[n]

Bitonic-sorter[n/2]

Abitonic sortersortsabitonic sequenceoflengthn=2k using• size=nk/2=n/2log2ncomparators(=sequentialtime)• depth=k=log2n(=paralleltime)

Casestudy:parallelsortingalgorithm• Step2:Buildanetworkmerger[n] thatmergestwosorted

sequencesoflengthn/2sothattheoutputissorted§ Flipsecondsequenceandconcatenatefirstandflippedsecond§ Concatenatedsequenceisbitonic,sortusingstep1

tobitonic-sorter[n/2]

sortedsequence1

sortedsequence2

tobitonic-sorter[n/2]

• Step3:Buildasorter[n] networkthatsortsarbitrarysequences§ Dothisrecursivelyfrommerger[n]buildingblocks§ Depth:D(1)=0andD(n)=D(n/2)+log2n=O(log22n)

merger[n](fromStep2)

sorter[n/2]

Example:sorter[16]

merger[16]

merger[8]

merger[4]merger[2]

flip-cleaner[16]

half-cleaner[8]

• Furtherreading ofbitonic networks:§ http://valis.cs.uiuc.edu/~sariel/teach/2004/b/

webpage/lec/14_sortnet_notes.pdf• Incasenisnotapoweroftwo:

§ http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/oddn.htm

Introduction to the Message Passing Interface (MPI)

Documents