MPI Message Passing Interface

transcript

MPIMessage Passing Interface

Mehmet BalmanCmpe 587

Dec, 2001

Parallel Computing•Separate workers or processes

•Interact by exchanging information

Types of parallel computing:

• SIMD (single instruction multiple data)

• SPMD (single program multiple data)

• MPMD (multiple program multiple data)

Hardware models:•Distributed memory (Paragon, IBM SPx, workstation network)

•Shared memory (SGI Power Challenge, Cray T3D)

Communication with other processes

•Cooperative

all parties agree to transfer data

•One sided•one worker performs transfer of data

What is MPI?

•A message-passing library specification

•Multiple processors by message passing

•Library of functions and macros that can be used in C FORTRAN and C programs

•For parallel computers, clusters, and heterogeneous networks

Who designed MPI?•Vendors IBM, Intel, TMC, Meiko, Cray, Convex, Ncube

•Library writers PVM, p4, Zipcode, TCGMSG, Chameleon, Express, Linda

•Broad Participation

Development history: (1993-1994)

•Began at Williamsburg Workshop in April, 1992

•Organized at Supercomputing '92 (November)

•Met every six weeks for two days

•Pre-final draft distributed at Supercomputing '93

•Final version of draft in May, 1994

•Public and vendor implementations available

Features of MPI•Point-to-point communication

• blocking, nonblocking

• synchronous, asynchronous

• ready,buffered

•Collective routines

• built-in, user defined

•Large # of data movement routines

•Built-in support for grids and graphs

•125 functions (MPI is large)

•6 basic functions (MPI is small)

•Communicators combine context and groups for message security

example#include "mpi.h"

#include <stdio.h>

int main( int argc, char **argv){

int rank, size;

MPI_Init( &argc, &argv );

MPI_Comm_rank( MPI_COMM_WORLD, &rank );

MPI_Comm_size( MPI_COMM_WORLD, &size );

printf( "Hello world! I'm %d of %d\n", rank, size );

MPI_Finalize(); return 0;

What happens when an MPI job is run

1. The user issues a directive to the operating system which has the effect of placing a copy of the executable program on each processor

2. Each processor begins execution of its copy of the executable

3. Different processes can execute different statements by branching within the program Typically the branching will be based on process ranks

Envelope of a message: (control block)

• the rank of the receiver

• the rank of the sender

• a tag

• a communicator

Two mechanisms for partitioning message space

Tags(0-32767)

Communicators(MPI_COMM_WORLD)

MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv

MPI_Send( start, count, datatype, dest, tag, comm )

MPI_Recv(start, count, datatype, source, tag, comm, status)

MPI_Bcast(start, count, datatype, root, comm)

MPI_Reduce(start, result, count, datatype, operation, root, comm)

Collective patterns

Collective Computation Operations

Operation Name Meaning

MPI MAX Maximum

MPI MIN Minimum

MPI SUM Sum

MPI PROD Product

MPI LAND Logical And

MPI BAND Bitwise And

MPI LOR Logical Or

MPI BOR Bitwise Or

MPI LXOR Logical Exclusive Or

MPI BXOR Bitwise Exclusive Or

MPI MAXLOC Maximum and Location of Maximum

MPI MINLOC Minimum and Location of Minimum

MPI_Op_create(

user_function,

commute—true if commutative,

MPI_Op_free(op)

User defined communication groups

Communicator: contains a context and a group. Group : just a set of processes.

MPI_Comm_create( oldcomm, group, &newcomm )

MPI_Comm_group( oldcomm, &group )

MPI_Group_free( &group )

MPI_Group_incl MPI_Group_excl

MPI_Group_range_incl MPI_Group_range_excl

MPI_Group_union MPI_Group_intersection

Non-blocking operations

•MPI_Isend(start, count, datatype, dest, tag, comm, request)

•MPI_Irecv(start, count, datatype, dest, tag, comm, request)

•MPI_Wait(request, status)

•MPI_Waitall•MPI_Waitany•MPI_Waitsome

•MPI_Test( request, flag, status)

Non-blocking operations return immediately.

Communication Modes

•Synchronous mode ( MPI_Ssend): the send does not complete until a matching receive has begun.

•Buffered mode ( MPI_Bsend): the user supplies the buffer to system for its use.

•Ready mode ( MPI_Rsend): user guarantees that matching receive has been posted.

Non-blocking versions: MPI_IssendMPI_IrsendMPI_Ibsend

int bufsize; char *buf = malloc(bufsize); MPI_Buffer_attach( buf, bufsize ); ... MPI_Bsend( ... same as MPI_Send ... );... MPI_Buffer_detach( &buf, &bufsize );

DatatypesTwo main purpose:

•Heterogenity --- parallel programs between different processors •Noncontiguous data --- structures, vectors with non-unit stride

MPI datatype C datatype

MPI CHAR signed char

MPI SHORT signed short int

MPI INT signed int

MPI LONG signed long int

MPI UNSIGNED CHAR unsigned char

MPI UNSIGNED SHORT unsigned short int

MPI UNSIGNED unsigned int

MPI UNSIGNED LONG unsigned long int

MPI FLOAT foat

MPI DOUBLE ouble

MPI LONG DOUBLE long double

MPI BYTE

MPI PACKED

Build derived typevoid Build_derived_type(INDATA_TYPE* indata,MPI_Datatype* message_type_ptr){int block_lengths[3]; MPI_Aint displacements[3]; MPI_Aint addresses[4]; MPI_Datatype typelist[3];

typelist[0] = MPI_FLOAT; typelist[1] = MPI_FLOAT; typelist[2] = MPI_INT;

block_lengths[0] = block_lengths[1] = block_lengths[2] = 1;

MPI_Address(indata, &addresses[0]); MPI_Address(&(indata >a), &addresses[1]); MPI_Address(&(indata >b), &addresses[2]); MPI_Address(&(indata >n), &addresses[3]);

displacements[0] = addresses[1] addresses[0];displacements[1] = addresses[2] addresses[0]; displacements[2] = addresses[3] addresses[0];

MPI_Type_struct(3, block_lengths, displacements, typelist, message_type_ptr); MPI_Type_commit(message_type_ptr); }

Other derived data types

•int MPI_Type_contiguous(int count, MPI_Datatype oldtype,MPI_Datatype *newtype)

elements are contiguous entries in an array

•int MPI_Type_vector(int count, int block_length,int stride,MPI_Datatype element_type,MPI_Datatype *new_type)

elements are equally spaced entries of an array

•int MPI_Type_indexed(int count,int array_of_blocklengths, int array_of_displacements,MPI_Datatype element_type,MPI_Datatype *new_type)

elements are arbitrary entries of an array

Pack/unpackvoid Get_data4(int my_rank, float* a_ptr, float* b_ptr, int* n_ptr){ int root = 0; char buffer[10];int position; if (my_rank == 0){printf(''Enter a, b, and n``n''); scanf(''%f %f %d'', a_ptr, b_ptr, n_ptr); position = 0;MPI_Pack(a_ptr, 1, MPI_FLOAT, buffer, 100, &position, MPI_COMM_WORLD); MPI_Pack(b_ptr, 1, MPI_FLOAT, buffer, 100, &position, MPI_COMM_WORLD); MPI_Pack(n_ptr, 1, MPI_INT, buffer, 100, &position, MPI_COMM_WORLD);

MPI_Bcast(buffer, 100, MPI_PACKED, root, MPI_COMM_WORLD);

}else{MPI_Bcast(buffer, 100, MPI_PACKED, root, MPI_COMM_WORLD);

position = 0; MPI_Unpack(buffer, 100, &position, a_ptr, 1, MPI_FLOAT, MPI_COMM_WORLD); MPI_Unpack(buffer, 100, &position, b_ptr, 1, MPI_FLOAT, MPI_COMM_WORLD); MPI_Unpack(buffer, 100, &position, n_ptr, 1, MPI_INT, MPI_COMM_WORLD); }}

Profiling

static int nsend = 0; int MPI_Send( start, count, datatype, dest, tag, comm ) { nsend++; return PMPI_Send( start, count, datatype, dest, tag, comm ) ;}

Architecture of MPI•Complex communication operations can be expressed portably in terms of lower-level ones

•All MPI functions are implemented in terms of the macros and functions that make up the ADI(Abstract Device Interface)

1. specifying a message to be sent or received

2. moving data between the API and the message-passing hardware

3. managing lists of pending messages (both sent and received),

4. providing basic information about the execution environment (e.g., how many tasks are there

Upper layers of MPICH

Channel Interface

Routines for send and receive envelope(control) information:

•MPID_SendControl(MPID_SendControlBlock )

•MPID_RecvAnyControl

•MPID_ControlMsgAvail

Send and receive data:

•MPID_SendChannel

•MPID_RecvFromChannel

Channel InterfaceThree different data exchange

mechanisms

Eager (default)

Data is sent to the destination immediately. Buffered on receiver site.

Rendezvous (MPI_Bsend)

Data is sent to the destination only when requested.

Get (shared memory)

Data is read directly by the receiver.

Lower layers of MPICH

Summary•Point-to-point and collective operations

•Blocking

•NonBlocking

•Asynchronous

•Synchronous

•Buffered

•Ready

•Abstraction for processes

•Rank of the group

•Virtual topologies

•Data types

•User specific

•predefined

•pack/unpack

•Architecture of MPI

•ADI(Abstract Device Interface)

•Channel Interface

MPI Message Passing Interface

Documents