UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Operating SystemsCMPSCI 377
Distributed Parallel ProgrammingEmery Berger
University of Massachusetts Amherst
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2
Outline
Previously:
Programming with threads
Shared memory, single machine
Today:
Distributed parallel programming
Message passing
some material adapted from slides by Kathy Yelick, UC Berkeley
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 3
Why Distribute?
SMP (symmetric multiprocessor): easy to program but limited Bus becomes
bottleneck when processors not operating locally
Typically < 32 processors
$$$
P1
network/bus
$
memory
P2
$
Pn
$
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 4
Distributed Memory Vastly different platforms
Networks of workstations
Supercomputers
Clusters
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 5
Distributed Architectures
Distributed memory machines:local memory but no global memory
Individual nodes often SMPs
Network interface for all interprocessorcommunication – message passing
interconnect
P0
memory
NI
. . .
P1
memory
NI Pn
memory
NI
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 6
Message Passing Program: # independent communicating processes
Thread + local address space only
Shared data: partitioned
Communicate by send & receive events
Cluster = message sent over sockets
PnP1P0
y = ..s ...
s: 12
i: 2
s: 14
i: 3
s: 11
i: 1
send P1,s
Network
receive Pn,s
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 7
Message Passing
Pros: efficient Makes data sharing explicit
Can communicate only what is strictly necessary for computation No coherence protocols, etc.
Cons: difficult Requires manual partitioning
Divide up problem across processors
Unnatural model (for some)
Deadlock-prone (hurray)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 8
Message Passing Interface
Library approach to message-passing
Supports most common architectural abstractions Vendors supply optimized versions
⇒ programs run on different machine, but with (somewhat) different performance
Bindings for popular languages Especially Fortran, C
Also C++, Java
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 9
MPI execution model
Spawns multiple copies of same program (SPMD = single program, multiple data)
Each one is different “process”(different local memory)
Can act differently by determining which processor “self” corresponds to
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 10
An Example
% mpirun –np 10 exampleProgram
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, size;MPI_Init(&argc, &argv );MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);printf("Hello world from process %d of %d\n",
rank, size);MPI_Finalize();return 0;
}
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 11
An Example
% mpirun –np 10 exampleProgram
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, size;MPI_Init(&argc, &argv );MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);printf("Hello world from process %d of %d\n",
rank, size);MPI_Finalize();return 0;
}
initializes MPI (passes
arguments in)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 12
An Example
% mpirun –np 10 exampleProgram
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, size;MPI_Init(&argc, &argv );MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);printf("Hello world from process %d of %d\n",
rank, size);MPI_Finalize();return 0;
}
returns # of processors in
“world”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 13
An Example
% mpirun –np 10 exampleProgram
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, size;MPI_Init(&argc, &argv );MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);printf("Hello world from process %d of %d\n",
rank, size);MPI_Finalize();return 0;
}
which processor am I?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 14
An Example
% mpirun –np 10 exampleProgram
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, size;MPI_Init(&argc, &argv );MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);printf("Hello world from process %d of %d\n",
rank, size);MPI_Finalize();return 0;
}
we’re done sending
messages
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 15
An Example% mpirun –np 10 exampleProgramHello world from process 5 of 10Hello world from process 3 of 10Hello world from process 9 of 10Hello world from process 0 of 10Hello world from process 2 of 10Hello world from process 4 of 10Hello world from process 1 of 10Hello world from process 6 of 10Hello world from process 8 of 10Hello world from process 7 of 10% // what happened?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 16
Message Passing
Messages can be sent directly to another processor
MPI_Send, MPI_Recv
Or to all processors
MPI_Bcast (does send or receive)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 17
Send/Recv Example
Send data from process 0 to all
“Pass it along” communication
Operations: MPI_Send (data *, count, MPI_INT, dest, 0,
MPI_COMM_WORLD );
MPI_Recv (data *, count, MPI_INT, source, 0, MPI_COMM_WORLD );
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 18
Send & Receive
Send integer input in a ring
int main(int argc, char * argv[]) {int rank, value, size;MPI_Status status;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);do {
if (rank == 0) {scanf( "%d", &value );MPI_Send(&value, 1, MPI_INT, rank + 1,
0, MPI_COMM_WORLD );} else {
MPI_Recv(&value, 1, MPI_INT, rank - 1,0, MPI_COMM_WORLD, &status );
if (rank < size - 1)MPI_Send( &value, 1, MPI_INT, rank + 1,
0, MPI_COMM_WORLD );}printf("Process %d got %d\n", rank, value);
} while (value >= 0);MPI_Finalize();return 0;
}
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 19
Send & Receive
Send integer input in a ring
int main(int argc, char * argv[]) {int rank, value, size;MPI_Status status;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);do {
if (rank == 0) {scanf( "%d", &value );MPI_Send(&value, 1, MPI_INT, rank + 1,
0, MPI_COMM_WORLD );} else {
MPI_Recv(&value, 1, MPI_INT, rank - 1,0, MPI_COMM_WORLD, &status );
if (rank < size - 1)MPI_Send( &value, 1, MPI_INT, rank + 1,
0, MPI_COMM_WORLD );}printf("Process %d got %d\n", rank, value);
} while (value >= 0);MPI_Finalize();return 0;
}
send destination?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 20
Send & Receive
Send integer input in a ring
int main(int argc, char * argv[]) {int rank, value, size;MPI_Status status;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);do {
if (rank == 0) {scanf( "%d", &value );MPI_Send(&value, 1, MPI_INT, rank + 1,
0, MPI_COMM_WORLD );} else {
MPI_Recv(&value, 1, MPI_INT, rank - 1,0, MPI_COMM_WORLD, &status );
if (rank < size - 1)MPI_Send( &value, 1, MPI_INT, rank + 1,
0, MPI_COMM_WORLD );}printf("Process %d got %d\n", rank, value);
} while (value >= 0);MPI_Finalize();return 0;
}
receive from?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 21
Send & Receiveint main(int argc, char * argv[]) {int rank, value, size;MPI_Status status;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);do {
if (rank == 0) {scanf( "%d", &value );MPI_Send(&value, 1, MPI_INT, rank + 1,
0, MPI_COMM_WORLD );} else {
MPI_Recv(&value, 1, MPI_INT, rank - 1,0, MPI_COMM_WORLD, &status );
if (rank < size - 1)MPI_Send( &value, 1, MPI_INT, rank + 1,
0, MPI_COMM_WORLD );}printf("Process %d got %d\n", rank, value);
} while (value >= 0);MPI_Finalize();return 0;
}
message tag
message tag
message tag
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Exercise
Compute expensiveComputation(i) on n processors; process 0 computes & prints sum
22
// MPI_Send (&value, 1, MPI_INT, dest, 0, MPI_COMM_WORLD );int main(int argc, char * argv[]) {int rank, size;MPI_Status status;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);if (rank == 0) {
int sum = 0;
printf(“sum = %d\n", sum);} else {
}MPI_Finalize(); return 0;
}
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Broadcast
Send and receive: point-to-point
Can also broadcast data
Source sends to everyone else
23
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 24
Broadcast
Repeatedly broadcast input (one integer) to all
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, value;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );do {if (rank == 0)
scanf( "%d", &value );MPI_Bcast( &value, 1, MPI_INT, 0, MPI_COMM_WORLD);printf( "Process %d got %d\n", rank, value );
} while (value >= 0);MPI_Finalize( );return 0;
}
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 25
Broadcast
Repeatedly broadcast input (one integer) to all
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, value;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );do {if (rank == 0)
scanf( "%d", &value );MPI_Bcast( &value, 1, MPI_INT, 0, MPI_COMM_WORLD);printf( "Process %d got %d\n", rank, value );
} while (value >= 0);MPI_Finalize( );return 0;
}
send or receive value
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 26
Broadcast
Repeatedly broadcast input (one integer) to all
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, value;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );do {if (rank == 0)
scanf( "%d", &value );MPI_Bcast( &value, 1, MPI_INT, 0, MPI_COMM_WORLD);printf( "Process %d got %d\n", rank, value );
} while (value >= 0);MPI_Finalize( );return 0;
}
how many to send/receive?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 27
Broadcast
Repeatedly broadcast input (one integer) to all
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, value;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );do {if (rank == 0)
scanf( "%d", &value );MPI_Bcast( &value, 1, MPI_INT, 0, MPI_COMM_WORLD);printf( "Process %d got %d\n", rank, value );
} while (value >= 0);MPI_Finalize( );return 0;
}
what’s the datatype?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 28
Broadcast
Repeatedly broadcast input (one integer) to all
#include <stdio.h>#include <mpi.h>
int main(int argc, char * argv[]) {int rank, value;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );do {if (rank == 0)
scanf( "%d", &value );MPI_Bcast( &value, 1, MPI_INT, 0, MPI_COMM_WORLD);printf( "Process %d got %d\n", rank, value );
} while (value >= 0);MPI_Finalize( );return 0;
}
who’s “root” for broadcast?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 29
Communication Flavors
Basic communication
blocking = wait until done
point-to-point = from me to you
broadcast = from me to everyone
Non-blocking
Think create & join, fork & wait…
MPI_ISend, MPI_IRecv
MPI_Wait, MPI_Waitall, MPI_Test
Collective
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 30
The End
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 31
From Pat Worley, ORNL
Scaling Limits Kernel used in
atmospheric models
99% floating point ops; multiplies/adds
Sweeps through memory with little reuse
One “copy” of code running independently on varying numbers of procs