MPIMessage Passing Interface
Mehmet BalmanCmpe 587
Dec, 2001
Parallel Computing•Separate workers or processes
•Interact by exchanging information
Types of parallel computing:
• SIMD (single instruction multiple data)
• SPMD (single program multiple data)
• MPMD (multiple program multiple data)
Hardware models:•Distributed memory (Paragon, IBM SPx, workstation network)
•Shared memory (SGI Power Challenge, Cray T3D)
Communication with other processes
•Cooperative
all parties agree to transfer data
•One sided•one worker performs transfer of data
What is MPI?
•A message-passing library specification
•Multiple processors by message passing
•Library of functions and macros that can be used in C FORTRAN and C programs
•For parallel computers, clusters, and heterogeneous networks
Who designed MPI?•Vendors IBM, Intel, TMC, Meiko, Cray, Convex, Ncube
•Library writers PVM, p4, Zipcode, TCGMSG, Chameleon, Express, Linda
•Broad Participation
Development history: (1993-1994)
•Began at Williamsburg Workshop in April, 1992
•Organized at Supercomputing '92 (November)
•Met every six weeks for two days
•Pre-final draft distributed at Supercomputing '93
•Final version of draft in May, 1994
•Public and vendor implementations available
Features of MPI•Point-to-point communication
• blocking, nonblocking
• synchronous, asynchronous
• ready,buffered
•Collective routines
• built-in, user defined
•Large # of data movement routines
•Built-in support for grids and graphs
•125 functions (MPI is large)
•6 basic functions (MPI is small)
•Communicators combine context and groups for message security
example#include "mpi.h"
#include <stdio.h>
int main( int argc, char **argv){
int rank, size;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf( "Hello world! I'm %d of %d\n", rank, size );
MPI_Finalize(); return 0;
}
What happens when an MPI job is run
1. The user issues a directive to the operating system which has the effect of placing a copy of the executable program on each processor
2. Each processor begins execution of its copy of the executable
3. Different processes can execute different statements by branching within the program Typically the branching will be based on process ranks
Envelope of a message: (control block)
• the rank of the receiver
• the rank of the sender
• a tag
• a communicator
Two mechanisms for partitioning message space
Tags(0-32767)
Communicators(MPI_COMM_WORLD)
MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv
MPI_Send( start, count, datatype, dest, tag, comm )
MPI_Recv(start, count, datatype, source, tag, comm, status)
MPI_Bcast(start, count, datatype, root, comm)
MPI_Reduce(start, result, count, datatype, operation, root, comm)
Collective patterns
Collective Computation Operations
Operation Name Meaning
MPI MAX Maximum
MPI MIN Minimum
MPI SUM Sum
MPI PROD Product
MPI LAND Logical And
MPI BAND Bitwise And
MPI LOR Logical Or
MPI BOR Bitwise Or
MPI LXOR Logical Exclusive Or
MPI BXOR Bitwise Exclusive Or
MPI MAXLOC Maximum and Location of Maximum
MPI MINLOC Minimum and Location of Minimum
MPI_Op_create(
user_function,
commute—true if commutative,
op)
MPI_Op_free(op)
User defined communication groups
Communicator: contains a context and a group. Group : just a set of processes.
MPI_Comm_create( oldcomm, group, &newcomm )
MPI_Comm_group( oldcomm, &group )
MPI_Group_free( &group )
MPI_Group_incl MPI_Group_excl
MPI_Group_range_incl MPI_Group_range_excl
MPI_Group_union MPI_Group_intersection
Non-blocking operations
•MPI_Isend(start, count, datatype, dest, tag, comm, request)
•MPI_Irecv(start, count, datatype, dest, tag, comm, request)
•MPI_Wait(request, status)
•MPI_Waitall•MPI_Waitany•MPI_Waitsome
•MPI_Test( request, flag, status)
Non-blocking operations return immediately.
Communication Modes
•Synchronous mode ( MPI_Ssend): the send does not complete until a matching receive has begun.
•Buffered mode ( MPI_Bsend): the user supplies the buffer to system for its use.
•Ready mode ( MPI_Rsend): user guarantees that matching receive has been posted.
Non-blocking versions: MPI_IssendMPI_IrsendMPI_Ibsend
int bufsize; char *buf = malloc(bufsize); MPI_Buffer_attach( buf, bufsize ); ... MPI_Bsend( ... same as MPI_Send ... );... MPI_Buffer_detach( &buf, &bufsize );
DatatypesTwo main purpose:
•Heterogenity --- parallel programs between different processors •Noncontiguous data --- structures, vectors with non-unit stride
MPI datatype C datatype
MPI CHAR signed char
MPI SHORT signed short int
MPI INT signed int
MPI LONG signed long int
MPI UNSIGNED CHAR unsigned char
MPI UNSIGNED SHORT unsigned short int
MPI UNSIGNED unsigned int
MPI UNSIGNED LONG unsigned long int
MPI FLOAT foat
MPI DOUBLE ouble
MPI LONG DOUBLE long double
MPI BYTE
MPI PACKED
Build derived typevoid Build_derived_type(INDATA_TYPE* indata,MPI_Datatype* message_type_ptr){int block_lengths[3]; MPI_Aint displacements[3]; MPI_Aint addresses[4]; MPI_Datatype typelist[3];
typelist[0] = MPI_FLOAT; typelist[1] = MPI_FLOAT; typelist[2] = MPI_INT;
block_lengths[0] = block_lengths[1] = block_lengths[2] = 1;
MPI_Address(indata, &addresses[0]); MPI_Address(&(indata >a), &addresses[1]); MPI_Address(&(indata >b), &addresses[2]); MPI_Address(&(indata >n), &addresses[3]);
displacements[0] = addresses[1] addresses[0];displacements[1] = addresses[2] addresses[0]; displacements[2] = addresses[3] addresses[0];
MPI_Type_struct(3, block_lengths, displacements, typelist, message_type_ptr); MPI_Type_commit(message_type_ptr); }
Other derived data types
•int MPI_Type_contiguous(int count, MPI_Datatype oldtype,MPI_Datatype *newtype)
elements are contiguous entries in an array
•int MPI_Type_vector(int count, int block_length,int stride,MPI_Datatype element_type,MPI_Datatype *new_type)
elements are equally spaced entries of an array
•int MPI_Type_indexed(int count,int array_of_blocklengths, int array_of_displacements,MPI_Datatype element_type,MPI_Datatype *new_type)
elements are arbitrary entries of an array
Pack/unpackvoid Get_data4(int my_rank, float* a_ptr, float* b_ptr, int* n_ptr){ int root = 0; char buffer[10];int position; if (my_rank == 0){printf(''Enter a, b, and n``n''); scanf(''%f %f %d'', a_ptr, b_ptr, n_ptr); position = 0;MPI_Pack(a_ptr, 1, MPI_FLOAT, buffer, 100, &position, MPI_COMM_WORLD); MPI_Pack(b_ptr, 1, MPI_FLOAT, buffer, 100, &position, MPI_COMM_WORLD); MPI_Pack(n_ptr, 1, MPI_INT, buffer, 100, &position, MPI_COMM_WORLD);
MPI_Bcast(buffer, 100, MPI_PACKED, root, MPI_COMM_WORLD);
}else{MPI_Bcast(buffer, 100, MPI_PACKED, root, MPI_COMM_WORLD);
position = 0; MPI_Unpack(buffer, 100, &position, a_ptr, 1, MPI_FLOAT, MPI_COMM_WORLD); MPI_Unpack(buffer, 100, &position, b_ptr, 1, MPI_FLOAT, MPI_COMM_WORLD); MPI_Unpack(buffer, 100, &position, n_ptr, 1, MPI_INT, MPI_COMM_WORLD); }}
Profiling
static int nsend = 0; int MPI_Send( start, count, datatype, dest, tag, comm ) { nsend++; return PMPI_Send( start, count, datatype, dest, tag, comm ) ;}
Architecture of MPI•Complex communication operations can be expressed portably in terms of lower-level ones
•All MPI functions are implemented in terms of the macros and functions that make up the ADI(Abstract Device Interface)
ADI
1. specifying a message to be sent or received
2. moving data between the API and the message-passing hardware
3. managing lists of pending messages (both sent and received),
4. providing basic information about the execution environment (e.g., how many tasks are there
Upper layers of MPICH
Channel Interface
Routines for send and receive envelope(control) information:
•MPID_SendControl(MPID_SendControlBlock )
•MPID_RecvAnyControl
•MPID_ControlMsgAvail
Send and receive data:
•MPID_SendChannel
•MPID_RecvFromChannel
Channel InterfaceThree different data exchange
mechanisms
Eager (default)
Data is sent to the destination immediately. Buffered on receiver site.
Rendezvous (MPI_Bsend)
Data is sent to the destination only when requested.
Get (shared memory)
Data is read directly by the receiver.
Lower layers of MPICH
Summary•Point-to-point and collective operations
•Blocking
•NonBlocking
•Asynchronous
•Synchronous
•Buffered
•Ready
•Abstraction for processes
•Rank of the group
•Virtual topologies
•Data types
•User specific
•predefined
•pack/unpack
•Architecture of MPI
•ADI(Abstract Device Interface)
•Channel Interface