DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 6: MESSAGE PASSING PROGRAMMING AND MPI.

DISTRIBUTED ANDHIGH-PERFORMANCE COMPUTING

CHAPTER 6: MESSAGE PASSING PROGRAMMINGAND MPI

Distributed Memory Programming Paradigm

Message passing programming is targeted at distributed memory machines (although it can be emulated on shared memory machines).

Set of processors, each with its own memory.

Communication between processors via a network.

Each processor runs a copy of the same executable — Single Program, Multiple Data (SPMD):

Written in a conventional programming language.

With its own memory address space, aka ‘local’,‘private’, ‘non-global’.

Passing of data between processors done using library subroutine calls.

Cont…

SPMD is a special case of the Multiple Instruction Multiple Data (MIMD) model. MIMD allows di erent executables to be ffrun on each processor, but SPMD model is simpler and more portable and hardly ever a limitation.

Message Passing Paradigm

Data movement specified explicitly by programmer.

Data propagated via packets of messages.

Communication types:

Point to point — most basic, one processor sending data, and one processor receiving. Building block for all other communications.

Collective, involving more than two processors, e.g.:

One processor broadcasting data to other processors.

Synchronisation, or barrier.

Reduction operation, e.g. global sum.

Cont…

Collective communications should be faster than user implementations with point to point (optimised implementations).

Communications libraries allow programmer to specify a virtual interconnect topology, which abstracts over the physical connectivity of the parallel computer.

Programmer must ensure correctness of code (every send must have related receive, avoidance of deadlock, etc).

Message Passing Programs

In some cases the parallel program requires a completely di erent algorithm, so ffthe code will look very di erent to the sequential code.ff

However in many cases message passing codes use standard domain decomposition parallelisation and look very similar to sequential codes, except for:

Some code to distribute arrays across processors.

Some code to pass data across processors where necessary, e.g. exchange array values at edges of domains, do global sums, etc.

The rest of the code is mostly the same, basically executing the same sequential algorithm, except that it acts only on the local subset of the array.

In the latter case it is straightforward to use the same code on sequential and parallel machines, with preprocessing directives to select sequential or parallel version.

Alternatively, can run sequentially by just using parallel version on 1 processor, but there is an overhead.

Types of Message Passing Programs

Embarrassingly Parallel Virtually no communication required

Easily load balanced (even dynamically)

Expect perfect speedup

Regular and Synchronous Easily statically load balanced

Expect good speedup for local communication

Expect reasonable speedup for non-local communication

Irregular and/or Asynchronous Di cult to load balance – often requires dynamic load balancingffi

Communications overhead usually large

Hard to get good speedup (but possible with some applications)

Usually cannot be done e ciently using data parallel ffiprogramming – must use message passing to get best performance

Embarrassingly Parallel Problems

Each element of an array can be processed independently of the others. No communication required, except to combine the final results. Static load balancing is usually trivial – can use any kind of distribution

(e.g. scattered) since communications is not a factor. Dynamic load balancing can be done using a task farm approach:

A master (or controller) process sends parcels of work to each of a set of workers (processors).

Workers return results when they are done, and request more work when idle.

This approach gives automatic dynamic load balancing as long as: 1. the problem can be split into lots of parcels of work 2. the parcels are not too small, to minimise communications overhead 3. the parcels are not too large, so as not to be left waiting for one processor once

the others are done

Expect perfect speedup.

Regular and Synchronous Problems

Regular algorithm and data structures. Synchronous (or loosely synchronous) communications. Use standard domain decomposition. Usually requires:

local communication (edges of domain); collective communication (combine final results); and (sometimes) non-local communication.

As long as the amount of data per processor is large enough so that computation time is much greater than communication time, expect: Good speedup for local communication Reasonable speedup for non-local communication

For problems with local communication: Computation goes like the volume of the grid (or array) on each

processor Communication goes like the edge (perimeter or surface) of the grid on

each processor Comms overhead Comm / Comp 1 / grid length So expect ∼ ∼

reasonable speedup as long as size of array is much larger than number of processors.

Irregular and/or Asynchronous Irregular algorithm and/or data structures.

Cannot be implemented e ciently except using message passing.ffi

Usually a high communications overhead.

Complex asynchronous (and usually non-local) communication requires careful coding.

May require careful load balancing.

May require dynamic repartitioning of data between processors.

Di cult to get good speedup, but possible for some problems.ffi

In some cases most of the problem is regular, and only part of it is irregular, which makes things easier, but still need to use message passing instead of (or as well as) data parallel.

HPF allows calls to extrinsic functions, which may be written in MPI, to support these mixed regular and irregular applications.

Message Passing Libraries

Long and noble history (since early 1980’s).

Main idea: abstraction from machine topology — routines to send/receive/broadcast messages between processors without having to manage the routing through the particular machine topology.

User sees a library for an existing language (e.g. C or Fortran), not (usually) extensions to the language.

Speed, not portability, historically the main criterion (but these are not necessarily exclusive).

N(implementations) N(vendors,institutes):∼

Machine specific – NX, MPL, CMMD, CS-Tools

Portable – CROS, Express, P4, Chameleon, CHIMP, PARMACS, PICL, PVM, SHMEM, TITCH, TINY,VERTEX, ZIPCODE

Many good ideas in them.

Basic functionality the same, interfaces similar but annoyingly di erent.ff

PVM (Parallel Virtual Machine) had been de-facto standard in recent times, but focussed on distributed computing (more so after introduction of MPI).

Message Passing Interface

In 1994 a standardized Message Passing Interface (MPI) was established by an international committee (the MPI Forum).

Prompted by the success of the HPF Forum in creating a standard data parallel language.

Interface to both C and Fortran, other languages could be supported (C++ bindings are provided in version 2, Java bindings are being investigated).

Di erent languages have slightly di erent syntax for the interface, e.g. error values ff ffreturned as integer return codes in C, and as extra parameters in Fortran.

Provides only a specification, which can be implemented on any parallel computer or workstation network.

Some implementations of this specification are:

Versions for many di erent machines – e.g. MPICH, LAM, CHIMP, MPI-Fff

Versions from vendors for specific machines – e.g. IBM SP2

Initial specification (MPI-1) did not handle some aspects such as parallel I/O and distributed computing support, but most of these issues are addressed in the latest version (MPI-2).

Reasons for Using MPI:

Standardization - MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms. Practically, it has replaced all previous message passing libraries.

Portability - There is no need to modify your source code when you port your application to a different platform that supports (and is compliant with) the MPI standard.

Performance Opportunities - Vendor implementations should be able to exploit native hardware features to optimize performance. For more information about MPI performance see the MPI Performance Topics tutorial.

Functionality - Over 115 routines are defined in MPI-1 alone.

Availability - A variety of implementations are available, both vendor and public domain.

Programming Model: MPI lends itself to virtually any distributed memory parallel programming

model. In addition, MPI is commonly used to implement (behind the scenes) some shared memory models, such as Data Parallel, on distributed memory architectures.

Hardware platforms: Distributed Memory: Originally, MPI was targeted for distributed memory

systems. Shared Memory: As shared memory systems became more popular,

particularly SMP / NUMA architectures, MPI implementations for these platforms appeared.

Hybrid: MPI is now used on just about any common parallel architecture including massively parallel machines, SMP clusters, workstation clusters and heterogeneous networks.

All parallelism is explicit: the programmer is responsible for correctly identifying parallelism and implementing parallel algorithms using MPI constructs.

The number of tasks dedicated to run a parallel program is static. New tasks can not be dynamically spawned during run time. (MPI-2 addresses this issue).

MPI Preliminaries MPI provides support for:

Point-to-Point message passing Collective Communications Communicators to group messages and processes Inquiry routines to query the environment Constants and data types

Typographical convention: All MPI identifiers prefixed by ‘MPI ’. C routine names may contain lower case, as in ‘MPI Init’. Fortran routines SHOUT AT YOU IN UPPER CASE, as in ‘MPI INIT’. MPI constants are all in upper case, e.g. ‘MPI FLOAT’ is an MPI C datatype.

C routines are actually integer functions which return a status code. Fortran routines really are subroutines, and return a status code as an

argument. Strongly advised to always check these status codes. Number of processors used is specified in the command line, when running the

MPI loader that loads the MPI program onto the processors, to avoid hard-coding it into the program.

MPI Routines MPI has 129 di erent routines, but most programs can be written using about a ff

dozen simple routines:

#include "mpi.h“ in C include "mpif.h" in Fortran

MPI_INIT Initialize MPI computationMPI_FINALIZE Terminate MPI computation

MPI_COMM_SIZE Determine number of processesMPI_COMM_RANK Determine process numberMPI_SEND Send a messageMPI_RECV Receive a messageMPI_BARRIER SynchronizeMPI_BCAST Broadcast same data to all procsMPI_GATHER Get data from all procsMPI_SCATTER Send different data to all procsMPI_REDUCE Combine data from all procsonto a single procMPI_ALLREDUCE Combine data from all procs onto all procs

Reduction operations include +, *, AND, OR, min, max

Will give general overview to these routines — check MPI specification, MPI programming books and courses, or manuals for implementation you are using, for exact usage of each routine.

Point to Point Communications Communication between two processors. one to send, and one to receive the data. Information required to specify the message:

Identification of sender processor. Identification of destination processor. Type of data — e.g. integers, floats, characters. Number of data elements to send — good idea if sending arrays. . . Where the data to send lives — usually a pointer. Where the received data should be stored into — again a pointer.

Two di erent communication operations:ff Blocking — sender & receiver waits until communication is complete. Non-blocking — send & receive handled ‘o ine’, processor continues on.ffl

In addition, four send modes: Synchronous — Waits for the recipient to complete. Bu ered — Copies message data into a bu er.ff ff Standard — MPI decides whether the message should be bu ered.ff Ready — A send is only started when a matching receive is initiated.

Collective Communications

Key concept: Groups are sets of processors that communicate with each other in a certain way.

Communications can take place amongst Group members without cross-talk. Rather like definable datatypes, allows better mapping between the language

and the problem. Also allows construction of libraries that are guaranteed not to interfere with

communications within user program, since all their communications are confined to a particular group.

MPI implements Groups using data objects called Communicators. MPI COMM SIZE gives number of processes (or processors, if there is one

process per processor) in a communicator. MPI defines a special Communicator MPI COMM WORLD for the Group of all

processes (or processors). Subgroups are sub-divided from the world Group. Each Group member is identified by a number, its Rank, enumerated 0, . . .

(Number in Group−1).

Required for all programs/routines which make MPI library calls.

Header File:

C include file Fortran include file

#include "mpi.h" include 'mpif.h'

Format of MPI Calls

C Binding

Format rc = MPI_Xxxxx(parameter, ... )

Example rc = MPI_Bsend(&buf,count,type,dest,tag,comm)

Error Code Returned as "rc". MPI_SUCCESS if successful

General MPI Program Structure:

Communicators and Groups MPI uses objects called communicators and groups to define which

collection of processes may communicate with each other. Most MPI routines require you to specify a communicator as an argument.

Use MPI_COMM_WORLD whenever a communicator is required - it is the predefined communicator that includes all of your MPI processes.

Rank

Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A rank is sometimes also called a "task ID". Ranks are contiguous and begin at zero.

Used by the programmer to specify the source and destination of messages. Often used conditionally by the application to control program execution (if rank=0 do this / if rank=1 do that).

MPI_Init

Initializes the MPI execution environment. This function must be

called in every MPI program, must be called before any other

MPI functions and must be called only once in an MPI

program. For C programs, MPI_Init may be used to pass the

command line arguments to all processes, although this is not

required by the standard and is implementation dependent.

MPI_Init (&argc,&argv)

MPI_INIT (ierr)

MPI_Comm_size

Determines the number of processes in the group associated with a communicator. Generally used within the communicator MPI_COMM_WORLD to determine the number of processes being used by your application.

MPI_Comm_size (comm,&size)

MPI_COMM_SIZE (comm,size,ierr)

MPI_Comm_rank

Determines the rank of the calling process within the communicator. Initially, each process will be assigned a unique integer rank between 0 and number of processors - 1 within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well.

MPI_Comm_rank (comm,&rank) MPI_COMM_RANK (comm,rank,ierr)

MPI_Comm_rank

Determines the rank of the calling process within the communicator. Initially, each process will be assigned a unique integer rank between 0 and number of processors - 1 within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well.

MPI_Comm_rank (comm,&rank)

MPI_COMM_RANK (comm,rank,ierr)

MPI_Abort

Terminates all MPI processes associated with the communicator. In most MPI implementations it terminates ALL processes regardless of the communicator specified.

MPI_Abort (comm,errorcode)

MPI_ABORT (comm,errorcode,ierr)

MPI_Get_processor_name

Returns the processor name. Also returns the length of the name. The buffer for "name" must be at least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into "name" is implementation dependent - may not be the same as the output of the "hostname" or "host" shell

commands.

MPI_Get_processor_name (&name,&resultlength)

MPI_GET_PROCESSOR_NAME (name,resultlength,ierr)

MPI_Initialized

Indicates whether MPI_Init has been called - returns flag as either logical true (1) or false(0). MPI requires that MPI_Init be called once and only once by each process. This may pose a problem for modules that want to use MPI and are prepared to call MPI_Init if necessary. MPI_Initialized solves this problem.

MPI_Initialized (&flag)

MPI_INITIALIZED (flag,ierr)

MPI_Wtime

Returns an elapsed wall clock time in seconds (double precision) on the calling processor.

MPI_Wtime ()

MPI_WTIME ()

MPI_Wtick

Returns the resolution in seconds (double precision) of

MPI_Wtime.

MPI_Wtick ()

MPI_WTICK ()

MPI_Finalize

Terminates the MPI execution environment. This function should be the last MPI routine called in every MPI program - no other MPI routines may be

called after it.

MPI_Finalize ()

MPI_FINALIZE (ierr)

Hello World in MPI#include <stdio.h>#include <mpi.h>void main(int argc, char *argv[]){

int ierror, rank, size;

MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {printf ("Hello World!\n");

}ierror=MPI_Comm_size(MPI_COMM_WORLD, &size);if( ierror != MPI_OK ) {

MPI_Abort(MPI_COMM_WORLD,ierror);}printf("I am %d out of %d.\n", rank, size);MPI_Finalize();

} Must call initialize and finalize routines. argc and argv required only for C language. Note di erence between a single output from the program (Hello World!) and ff

output from each of the processors.

Sending a Message: Blocking, standard mode send

int ierror,message[message_len],dest_rank,tag=1; ...

ierror=MPI_Send(message,message_len,MPI_INT,dest_rank,tag,MPI_COMM_WORLD);

The message to send. Message length. Type of each data element. Rank of destination processor. Tag, for filtering out unwanted message. The communicator for the sender & receiver group.

Every SEND must have a corresponding RECEIVE (and vice versa) or else will get deadlock – processors will be waiting forever to send (or receive) messages that never get received (or sent).

Deadlock can also occur if message bu ers overflow – the MPI implementation ffshould handle this.

Receiving a Message : Blocking Receive

int ierror,store[store_len],source_rank,tag=1;MPI_Status status;...ierror=MPI_Recv(store,store_len,MPI_INT, source_rank,tag,MPI_COMM_WORLD,

&status);

Where to store the message, and its size. Type of each data element. Rank of source processor. Same tag as used by sender processor. The communicator for the sender & receiver group. Data structure of received message (provides the message length,

so it can be compared to what is expected).

If communication is regular, e.g. each processor sends a message to its right while receiving one from its left (a shift), can combine separate SEND and RECV calls into a single SENDRECV (send and receive) call.

Global Summation : Global sum, using Reduction Function

float part_sum=1.0,global_sum;int ierror,root_rank=0;...ierror=MPI_Reduce(&part_sum,&global_sum,1, MPI_FLOAT, MPI_SUM,root_rank, MPI_COMM_WORLD);

Partial and global sum variables. Number of elements in the partial sum (= 1). Datatype of element to sum. Reduction operation. Rank (processor) where results will be deposited. Communicator for the participating processors.

MPI REDUCE places the result on the specified processor, MPI ALLREDUCE places the result on all processors.

Other reduction operations include MPI MAX, MPI MIN, MPI LAND (logical AND), MPI LXOR (exclusive OR), etc.

Also useful are MPI MINLOC and MPI MAXLOC, which return the location of the minimum or maximum values in the data structure.

Date post:	21-Dec-2015
Category:	Documents
View:	218 times
Download:	0 times

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 6: MESSAGE PASSING PROGRAMMING AND MPI.

Documents