July 30 - August 01, 2002
MPI Basics
1 of 43
Linux Clusters and Tiled Display Walls
MPI
An Introduction
Kadin Tseng
Scientific Computing and Visualization Group
Boston University
July 30 - August 01, 2002
MPI Basics
2 of 43
Linux Clusters and Tiled Display WallsSCV Computing Facilities
• SGI Origin2000 – 192 processors– Lego (32), Slinky (32), Jacks(64), Playdoh(64)– R10000, 195 MHz
• IBM SP - four 16-processor nodes– Hal - login node– Power3, 375 MHz
• IBM Regatta - three 32-processor nodes– Twister - login node (currently in friendly-user
phase)– Power4, 1.3 GHz
• Linux cluster - 38 2-processor nodes– Skate and Cootie - Two login nodes– 1.3 GHz Intel Pentium 3 (Myrinet Interconnect)
July 30 - August 01, 2002
MPI Basics
3 of 43
Linux Clusters and Tiled Display WallsUseful SCV Info
• SCV home page http://scv.bu.edu/
• Batch queues http://scv.bu.edu/SCV/scf-techsumm.html• Resource Applications
– MARINER http://acct.bu.edu/FORMS/Mariner-Pappl.html– Alliance http://www.ncsa.uiuc.edu/alliance/applying/
• Help– Online FAQs– Web-based tutorials (MPI, OpenMP, F90, MATLAB, Graphics tools)– HPC consultations by appointment
• Doug Sondak ([email protected])• Kadin Tseng ([email protected])
– [email protected], [email protected]– Origin2000 Repository http://scv.bu.edu/SCV/Origin2000/– IBM SP and Regatta Repository http://scv.bu.edu/SCV/IBMSP/– Alliance web-based tutorials http://foxtrot.ncsa.uiuc.edu:8900/
July 30 - August 01, 2002
MPI Basics
4 of 43
Linux Clusters and Tiled Display WallsMultiprocessor Architectures
Linux cluster– Shared-memory
Between the 2 processors of each node.– Distributed-memory There are 38 nodes; can use both processors in each node.
July 30 - August 01, 2002
MPI Basics
5 of 43
Linux Clusters and Tiled Display Walls
P0 P1 P2 Pn
Memory
Shared Memory Architecture
P0, P1, …, Pn are processors.
• • • •
July 30 - August 01, 2002
MPI Basics
6 of 43
Linux Clusters and Tiled Display Walls
N0 N1 N2 Nn
Interconnect
Distributed Memory Architecture
M0 MnM2M1
M0, M1, … Mn are memories associated with nodes N0, N1, …, Nn. Interconnect is Myrinet (or Ethernet).
• • • •
July 30 - August 01, 2002
MPI Basics
7 of 43
Linux Clusters and Tiled Display WallsParallel Computing Paradigms
• Parallel Computing Paradigms– Message Passing (MPI, PVM, ...)
• Distributed and shared memory
– Directives (OpenMP, ...)• Shared memory
– Multi-Level Parallel programming (MPI + OpenMP)• Distributed/shared memory
July 30 - August 01, 2002
MPI Basics
8 of 43
Linux Clusters and Tiled Display WallsMPI Topics to Cover
• Fundamentals• Basic MPI functions• Nonblocking send/receive• Collective Communications• Virtual Topologies• Virtual Topology Example
– Solution of the Laplace Equation
July 30 - August 01, 2002
MPI Basics
9 of 43
Linux Clusters and Tiled Display WallsWhat is MPI ?
• MPI stands for Message Passing Interface.• It is a library of subroutines/functions, not a computer
language.• These subroutines/functions are callable from fortran
or C programs.• Programmer writes fortran/C code, insert appropriate
MPI subroutine/function calls, compile and finally link with MPI message passing library.
• In general, MPI codes run on shared-memory multi-processors, distributed-memory multi-computers, cluster of workstations, or heterogeneous cluster of the above.
• Current MPI is MPI-1. MPI-2 will be available in the near future.
July 30 - August 01, 2002
MPI Basics
10 of 43
Linux Clusters and Tiled Display WallsWhy MPI ?
• To enable more analyses in a prescribed amount of time.• To reduce time required for one analysis.• To increase fidelity of physical modeling.• To have access to more memory.• To provide efficient communication between nodes of
networks of workstations, …• To enhance code portability; works for both shared- and
distributed-memory.• For “embarrassingly parallel” problems, such as some
Monte-Carlo applications, parallelizing with MPI can be trivial with near-linear (or even superlinear) scaling.
July 30 - August 01, 2002
MPI Basics
11 of 43
Linux Clusters and Tiled Display WallsMPI Preliminaries
• MPI’s pre-defined constants, function prototypes, etc., are included in a header file. This file must be included in your code wherever MPI function calls appear (in “main” and in user subroutines/functions) :– #include “mpi.h” for C codes– #include “mpi++.h” for C++ codes– include “mpif.h” for fortran 77 and fortran 9x codes
• MPI_Init must be the first MPI function called.• Terminates MPI by calling MPI_Finalize.• These two functions must only be called once in user code.
July 30 - August 01, 2002
MPI Basics
12 of 43
Linux Clusters and Tiled Display WallsMPI Preliminaries (continued)
• C is case-sensitive language. MPI function names always begin with “MPI_”, followed by specific name with leading character capitalized, e.g., MPI_Comm_rank. MPI pre-defined constant variables are expressed in upper case characters, e.g., MPI_COMM_WORLD.
• Fortran is not case-sensitive. No specific case rules apply.• MPI fortran routines return error status as last argument of
subroutine call, e.g., call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)• Error status is returned as int function value for C MPI
functions, e.g., int ierr = MPI_Comm_rank(MPI_COMM_WORLD, rank);
July 30 - August 01, 2002
MPI Basics
13 of 43
Linux Clusters and Tiled Display WallsWhat is A Message
• Collection of data (array) of MPI data types• Basic data types such as int /integer, float/real• Derived data types
• Message “envelope” – source, destination, tag, communicator
July 30 - August 01, 2002
MPI Basics
14 of 43
Linux Clusters and Tiled Display WallsModes of Communication
• Point-to-point communication– Blocking – returns from call when task complete
• Several send modes; one receive mode– Nonblocking – returns from call without waiting for task
to complete• Several send modes; one receive mode
• Collective communication
July 30 - August 01, 2002
MPI Basics
15 of 43
Linux Clusters and Tiled Display WallsEssentials of Communication
• Sender must specify valid destination.• Sender and receiver data type, tag, communicator
must match.• Receiver can receive from non-specific (but valid)
source.• Receiver returns extra (status) parameter to report
info regarding message received.• Sender specifies size of sendbuf; receiver specifies
upper bound of recvbuf.
July 30 - August 01, 2002
MPI Basics
16 of 43
Linux Clusters and Tiled Display WallsMPI Data Types vs C Data Types
• MPI types -- C types– MPI_INT – signed int– MPI_UNSIGNED – unsigned int– MPI_FLOAT – float– MPI_DOUBLE – double– MPI_CHAR – char– . . .
July 30 - August 01, 2002
MPI Basics
17 of 43
Linux Clusters and Tiled Display WallsMPI vs Fortran Data Types
• MPI_INTEGER – INTEGER• MPI_REAL – REAL• MPI_DOUBLE_PRECISION – DOUBLE PRECISION• MPI_CHARACTER – CHARACTER(1)• MPI_COMPLEX – COMPLEX• MPI_LOGICAL – LOGICAL
• …
July 30 - August 01, 2002
MPI Basics
18 of 43
Linux Clusters and Tiled Display WallsMPI Data Types
• MPI_PACKED• MPI_BYTE
July 30 - August 01, 2002
MPI Basics
19 of 43
Linux Clusters and Tiled Display WallsSome MPI Implementations
• MPICH (ANL) – Version 1.2.1 is latest.– There is a list of supported MPI-2 functionalities.
• LAM (UND/OSC)• CHIMP (EPCC)• Vendor implementations (SGI, IBM, …)• Don’t worry! - as long as vendor supports the MPI
Standard (IBM’s MPL does not). • Job execution procedures of implementations may
differ.
There are a number of implementations :
July 30 - August 01, 2002
MPI Basics
20 of 43
Linux Clusters and Tiled Display WallsExample 1 (Integration)
We will introduce some fundamental MPI function calls through the computation of a simple integral by the Mid-point rule.
p is number of partitions and n is increments per partition
hjniaa npabh
ha
dxxdxx
ij
p
i
n
j
hij
p
i
n
j
ha
a
b
a
ij
ij
*)*(;//)(
)cos(
)cos()cos(
1
0
1
02
1
0
1
0
July 30 - August 01, 2002
MPI Basics
21 of 43
Linux Clusters and Tiled Display WallsIntegrate cos(x) by Mid-point Rule
2x 0x
)(xcos
)(xf
x
2
0
)cos( dxx
July 30 - August 01, 2002
MPI Basics
22 of 43
Linux Clusters and Tiled Display WallsExample 1 - Serial fortran code
Program Example1 implicit none integer n, p, i, j real h, result, a, b, integral, pi pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration p = 4 ! number of partitions (processes) n = 500 ! # of increments in each part. h = (b-a)/p/n ! length of increment result = 0.0 ! Initialize solution to the integral do i=0,p-1 ! Integral sum over all partitions result = result + integral(a,i,h,n) enddo print *,'The result =',result stop end
July 30 - August 01, 2002
MPI Basics
23 of 43
Linux Clusters and Tiled Display Walls. . Serial fortran code (cont’d)
real function integral(a, i, h, n) implicit none integer n, i, j real h, h2, aij, a, fct, x
fct(x) = cos(x) ! Integral kernel integral = 0.0 ! initialize integral h2 = h/2. do j=0,n-1 ! sum over all "j" integrals aij = a + (i*n + j)*h ! lower limit of integration integral = integral + fct(aij+h2)*h enddo
return end
example1.f continues . . .
July 30 - August 01, 2002
MPI Basics
24 of 43
Linux Clusters and Tiled Display WallsExample 1 - Serial C code
#include <math.h>#include <stdio.h>float integral(float a, int i, float h, int n);void main() { int n, p, i, j, ierr; float h, result, a, b, pi, my_result; pi = acos(-1.0); /* = 3.14159... * a = 0.; /* lower limit of integration */ b = pi/2.; /* upper limit of integration */ p = 4; /* # of partitions */ n = 500; /* increments in each process */ h = (b-a)/n/p; /* length of increment */ result = 0.0; for (i=0; i<p; i++) { /* integral sum over procs */ result += integral(a,i,h,n); } printf("The result =%f\n",result);}
July 30 - August 01, 2002
MPI Basics
25 of 43
Linux Clusters and Tiled Display Walls. . Serial C code (cont’d)
float integral(float a, int i, float h, int n){ int j; float h2, aij, integ; integ = 0.0; /* initialize integral */ h2 = h/2.; for (j=0; j<n; j++) { /* integral in each partition */ aij = a + (i*n + j)*h; /* lower limit of integration */ integ += cos(aij+h2)*h; } return integ;}
example1.c continues . . .
July 30 - August 01, 2002
MPI Basics
26 of 43
Linux Clusters and Tiled Display WallsExample 1 - Parallel fortran code
PROGRAM Example1_1 implicit none integer n, p, i, j, ierr, master, Iam real h, result, a, b, integral, pi include "mpif.h“ ! pre-defined MPI constants, ... integer source, dest, tag, status(MPI_STATUS_SIZE) real my_result
data master/0/ ! 0 is the master processor responsible ! for collecting integral sums …
MPI functions used for this example:
• MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize
• MPI_Send, MPI_Recv
July 30 - August 01, 2002
MPI Basics
27 of 43
Linux Clusters and Tiled Display Walls. . . Parallel fortran code (cont’d)
! executable statements before MPI_Init is not! advisable; side effect implementation-dependent pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration n = 500 ! increments in each process h = (b - a)/ p / n dest = master ! proc to compute final result tag = 123 ! set tag for job
! Starts MPI processes ... call MPI_Init(ierr)! Get current process id call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr)! Get number of processes from command line call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)
July 30 - August 01, 2002
MPI Basics
28 of 43
Linux Clusters and Tiled Display Walls... Parallel fortran code (cont’d)
my_result = integral(a,Iam,h,n) ! compute local sum write(*,"('Process ',i2,' has the partial result of’, & f10.6)”)Iam,my_result if(Iam .eq. master) then
result = my_result ! Init. final result do source=1,p-1 ! loop to collect local sum (not parallel !) call MPI_Recv(my_result, 1, MPI_REAL, source, tag, & MPI_COMM_WORLD, status, ierr) ! not safe result = result + my_result enddo print *,'The result =',result else call MPI_Send(my_result, 1, MPI_REAL, dest, tag, & MPI_COMM_WORLD, ierr) ! send my_result to dest. endif call MPI_Finalize(ierr) ! let MPI finish up ... end
July 30 - August 01, 2002
MPI Basics
29 of 43
Linux Clusters and Tiled Display WallsExample 1 - Parallel C code
#include <mpi.h> #include <math.h> #include <stdio.h>float integral(float a, int i, float h, int n); /* prototype */void main(int argc, char *argv[]) { int n, p, i; float h, result, a, b, pi, my_result; int myid, source, dest, tag; MPI_Status status; MPI_Init(&argc, &argv); /* start MPI processes */
MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* current proc. id */
MPI_Comm_size(MPI_COMM_WORLD, &p); /* # of processes */
July 30 - August 01, 2002
MPI Basics
30 of 43
Linux Clusters and Tiled Display Walls… Parallel C code (continued)
pi = acos(-1.0); /* = 3.14159... */ a = 0.; /* lower limit of integration */ b = pi/2.; /* upper limit of integration */ n = 500; /* number of increment within each process */ dest = 0; /* define the process that computes the final result */ tag = 123; /* set the tag to identify this particular job */ h = (b-a)/n/p; /* length of increment */
i = myid; /* MPI process number range is [0,p-1] */
my_result = integral(a,i,h,n);
printf("Process %d has the partial result of %f\n", myid,my_result);
July 30 - August 01, 2002
MPI Basics
31 of 43
Linux Clusters and Tiled Display Walls… Parallel C code (continued)
if(myid == 0) { result = my_result; for (i=1;i<p;i++) { source = i; /* MPI process number range is [0,p-1] */ MPI_Recv(&my_result, 1, MPI_REAL, source, tag, MPI_COMM_WORLD, &status); result += my_result; } printf("The result =%f\n",result); } else { MPI_Send(&my_result, 1, MPI_REAL, dest, tag, MPI_COMM_WORLD); /* send my_result to “dest” } MPI_Finalize(); /* let MPI finish up ... */}
July 30 - August 01, 2002
MPI Basics
32 of 43
Linux Clusters and Tiled Display WallsExample1_2 – Parallel Integration
PROGRAM Example1_2 implicit none integer n, p, i, j, k, ierr, master real h, a, b, integral, pi integer req(1)
include "mpif.h" ! This brings in pre-defined MPI constants, ... integer Iam, source, dest, tag, status(MPI_STATUS_SIZE) real my_result, Total_result, result
data master/0/
MPI functions used for this example:
• MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize
• MPI_Recv, MPI_Isend, MPI_Wait
July 30 - August 01, 2002
MPI Basics
33 of 43
Linux Clusters and Tiled Display WallsExample1_2 (continued)
c**Starts MPI processes ... call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) call MPI_Comm_size(MPI_COMM_WORLD, p, ierr) pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration n = 500 ! number of increment within each process dest = master ! define process that computes the final result tag = 123 ! set the tag to identify this particular job h = (b-a)/n/p ! length of increment
my_result = integral(a,Iam,h,n) ! Integral of process Iam write(*,*)'Iam=',Iam,', my_result=',my_result
July 30 - August 01, 2002
MPI Basics
34 of 43
Linux Clusters and Tiled Display WallsExample1_2 (continued)
if(Iam .eq. master) then ! the following is serial result = my_result do k=1,p-1 call MPI_Recv(my_result, 1, MPI_REAL, & MPI_ANY_SOURCE, tag, ! “wildcard” & MPI_COMM_WORLD, status, ierr) ! status - about source … result = result + my_result ! Total = sum of integrals enddo else call MPI_Isend(my_result, 1, MPI_REAL, dest, tag, & MPI_COMM_WORLD, req, ierr) ! send my_result to “dest” call MPI_Wait(req, status, ierr) ! wait for nonblock send ... endifc**results from all procs have been collected and summed ... if(Iam .eq. 0) write(*,*)'Final Result =',result call MPI_Finalize(ierr) ! let MPI finish up ...
stop end
July 30 - August 01, 2002
MPI Basics
35 of 43
Linux Clusters and Tiled Display WallsExample1_3 Parallel Integration
PROGRAM Example1_3 implicit none integer n, p, i, j, ierr, master real h, result, a, b, integral, pi
include "mpif.h" ! This brings in pre-defined MPI constants, ... integer Iam, source, dest, tag, status(MPI_STATUS_SIZE) real my_result
data master/0/
MPI functions used for this example:
• MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize
• MPI_Bcast, MPI_Reduce, MPI_SUM
July 30 - August 01, 2002
MPI Basics
36 of 43
Linux Clusters and Tiled Display WallsExample1_3 (continued)
c**Starts MPI processes ... call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)
pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration dest = 0 ! define the process that computes the final result tag = 123 ! set the tag to identify this particular job if(Iam .eq. master) then print *,'The requested number of processors =',p print *,'enter number of increments within each process' read(*,*)n endif
c**Broadcast "n" to all processes call MPI_Bcast(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)
July 30 - August 01, 2002
MPI Basics
37 of 43
Linux Clusters and Tiled Display WallsExample1_3 (continued)
c**Starts MPI processes ... h = (b-a)/n/p ! length of increment my_result = integral(a,Iam,h,n) write(*,"('Process ',i2,' has the partial result of',f10.6)") & Iam,my_result
call MPI_Reduce(my_result, result, 1, MPI_REAL, MPI_SUM, dest, & MPI_COMM_WORLD, ierr) ! Compute integral sum
if(Iam .eq. master) then print *,'The result =',result endif
call MPI_Finalize(ierr) ! let MPI finish up ... stop end
July 30 - August 01, 2002
MPI Basics
38 of 43
Linux Clusters and Tiled Display WallsCompilations With MPICH
• Create Makefile is practical way to compile program
CC = mpiccCCC = mpiCCF77 = mpif77F90 = mpif90OPTFLAGS = -O3f-example: f-example.o $(F77) $(OPTFLAGS) -o f-example f-example.oc-example: c-example.o $(CC) $(OPTFLAGS) -o c-example c-example.o
July 30 - August 01, 2002
MPI Basics
39 of 43
Linux Clusters and Tiled Display WallsCompilations With MPICH (cont’d)
Makefile continues . . .
.c.o: $(CC) $(OPTFLAGS) -c $*.c.f.o: $(F77) $(OPTFLAGS) -c $*.f.f90.o: $(F90) $(OPTFLAGS) -c $*.f90.SUFFIXES: .f90
July 30 - August 01, 2002
MPI Basics
40 of 43
Linux Clusters and Tiled Display WallsRun Jobs Interactively
skate% mpirun.ch_gm –np 4 example (alias mpirun ‘/usr/local/mpich/1.2.1..7b/gm-1.5.1_Linux-2.4.7-10enterprise/smp/pgi/ssh/bin/mpirun.ch_gm)
On SCV’s Linux cluster, use mpirun.ch_gm to run MPI jobs.
July 30 - August 01, 2002
MPI Basics
41 of 43
Linux Clusters and Tiled Display WallsRun Jobs in Batch
We use PBS, or Portable Batch System, to manage batch jobs.
% qsub my_pbs_script
Where qsub is a batch submission script while my_pbs_script is a user-provided script containing user-specific pbs batch job info.
July 30 - August 01, 2002
MPI Basics
42 of 43
Linux Clusters and Tiled Display WallsMy_pbs_script
#!/bin/bash
# Set the default queue
#PBS -q dque
# Request 4 nodes with 1 processor per node
#PBS -l nodes=4:ppn=1,walltime=00:10:00
MYPROG="psor"
GMCONF=~/.gmpi/conf.$PBS_JOBID
/usr/local/xcat/bin/pbsnodefile2gmconf $PBS_NODEFILE >$GMCONF
cd $PBS_O_WORKDIR
NP=$(head -1 $GMCONF)
mpirun.ch_gm --gm-f $GMCONF --gm-recv polling --gm-use-shmem --gm-kill 5 –-gm-np $NP PBS_JOBID=$PBS_JOBID $MYPROG
July 30 - August 01, 2002
MPI Basics
43 of 43
Linux Clusters and Tiled Display WallsOutput of Example1_1
skate% mpirun.ch_gm –np 4 example1_1Process 1 has the partial result of 0.324423Process 2 has the partial result of 0.216773Process 0 has the partial result of 0.382683Process 3 has the partial result of 0.076120 The result = 1.000000
skate% make –f make.linux example1_1
Not in order !