An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos.

An Introduction to Parallel Programming and MPICH

Nikolaos Hatzopoulos

What is Serial Computing?

• Traditionally, software has been written for serial computation: – To be run on a single computer having a single

Central Processing Unit (CPU); – A problem is broken into a discrete series of

instructions. – Instructions are executed one after another. – Only one instruction may execute at any moment

in time.

Serial Computing:

What is Parallel Computing?

• In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:

To be run using multiple CPUs A problem is broken into discrete parts that can be solved

concurrently Each part is further broken down to a series of instructions Instructions from each part execute simultaneously on

different CPUs

Parallel Computing

Computer Architecture(von Neumann)

• Comprised of four main components: – Memory – Control Unit – Arithmetic Logic Unit – Input/Output

• Read/write, random access memory is used to store both program instructions and data – Program instructions are coded data which tell the

computer to do something – Data is simply information to be used by the program

• Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task.

• Aritmetic Unit performs basic arithmetic operations • Input/Output is the interface to the human

operator

UMA, or Uniform Memory Access

In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:

UMA, or Uniform Memory Access

UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all processors. Note that access time is also independent of data location within memory. That is, access time remains the same regardless of which shared memory module contains the data to be retrieved.

NUMA(Non-Uniform Memory Access)

• In the NUMA shared memory architecture, each processor has its own local memory module that it can access directly and with a distinctive performance advantage. At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:

NUMA(Non-Uniform Memory Access)

What gives NUMA its name is that memory access time varies with the location of the data to be accessed. If data resides in local memory, access is fast. If data resides in remote memory, access is slower. The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve average case access time through the introduction of fast, local memory.

Modern multiprocessor systems

• In this complex hierarchical scheme, processors are grouped by their physical location on one or the other multi-core CPU package or "node". Processors within a node share access to memory modules as per the UMA shared memory architecture. At the same time, they may also access memory from the remote node using a shared interconnect, but with slower performance as per the NUMA shared memory architecture.

Distributed computing

• A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable.

Parallel algorithm for Distributed Memory Computing

• We assumed that we have these numbers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] and we want to add them with a parallel algorithm for 4 CPUs

• Solution:

CPU0

CPU1

CPU2

CPU310, 11, 12

4, 5, 6

7,8,9

1, 2, 3 6

15

24

33

6 + 15 + 24 + 33 = 78CPU0

What’s the benefits from a parallel program

• We assume that Tn is the time to pass a Message through the Network and To it’s the time of an operation to execute.

• In our example we would need: Tn + 3To + Tn + 4To = 2Tn + 4To

• For a serial program: 12To• We assume that To=1 and Tn=10

Parallel = 2x10 + 4x1 = 24Serial = 12x1 = 12

• Conclusion Serial is faster than parallel

What’s the benefits from a parallel program

• We assume that we have 12,000 numbers to add

• Parallel: Tn + 3,000To + Tn + 4To = 10 + 3,000x1 + 10 +4x1 = 3,024

• Serial: 12,000To = 12,000x1 = 12,000• Conclusion parallel it’s about 4 times faster than

serial• Parallel computing is begum beneficial for large

scale computational problems

MPICH

• MPICH is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing.

• Message Passing Interface (MPI) is a specification for an API that allows many computers to communicate with one another.

• MPICH is a library for C/C++ or Fortran

Installation MPICH for Linux

• Web page: http://www.mcs.anl.gov/research/projects/mpi/mpich1/

• Download for Linux: ftp://ftp.mcs.anl.gov/pub/mpi/mpich.tar.gz

• Untar: tar xvfz mpich.tar.gz• Configure:

as root: ./configure - -prefix=/usr/local –rsh=sshas user: ./configure - -prefix=/home/username –

rsh=ssh• Compile: make• Install: make install

http://www.mcs.anl.gov/research/projects/mpi/mpich1/

ftp://ftp.mcs.anl.gov/pub/mpi/mpich.tar.gz

Testing MPICH• $ which mpicc

It should give a path of mpicc where we install it like: ~/bin/mpiccand the same for mpirun

• To run a test: from mpich installation dir$cd examples/basic$make$mpirun –np 2 cpiresult:

Process 0 of 2 on localhost.localdomainpi is approximately 3.1415926544231318, Error is 0.0000000008333387wall clock time = 0.000000Process 1 of 2 on localhost.localdomain

Possible Errors• Not find the path of mpicc

$cd ~ $gedit .bashrcadd the following line at the bottomexport PATH=$PATH:/path_of_mpich/binsave and relogin

• When we run: mpirun -np 2 cpip0_29223: p4_error: Could not gethostbyname for host

buster.localdomain; may be invalid namethat means it cannot resolve buster.localdomain that is our hostnameas root: gedit /etc/hosts locate 127.0.0.1 and add at the and of this line the hostnameexample:

127.0.0.1 localhost.localdomain localhost buster.localdomain

ssh login without password

• To avoid typing our password as many times as the np value we can make an login without password

• $ssh-keygenby finishing this process it will create two files$cd ~.ssh$ls id_rsa id_rsa.pub known_hosts$cp id_rsa.pub authorized_keys2So when we do $ssh localhostit will login without password

hello.c mpich program#include <stdio.h>#include “mpi.h” main(int argc, char** argv){ int my_rank;

int size;int namelen;char proc_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Get_processor_name(proc_name, &namelen);

if (my_rank == 2) printf(“Hello – I am process 2\n”); else printf(“Hello from process %d of %d on %s\n”, my_rank, size, proc_name);

MPI_Finalize();}

Run hello.c

• $mpicc hello.c• $mpirun –np 4 a.out

result:

Hello from process 0 of 4 on localhost.localdomainHello from process 1 of 4 on localhost.localdomainHello from process 3 of 4 on localhost.localdomainHello - I am process 2

• NOTE: the results may displayed by different order that’s depends on how the operating system manages the processes

From Documentation of MPICH• http://www.mcs.anl.gov/research/projects/mpi/mpich1/docs.html• MPI_MAX_PROCESSOR_NAME Maximum length of name returned by

MPI_GET_PROCESSOR_NAME• MPI_Init

Initialize the MPI execution environment• MPI_Comm_rank

Determines the rank of the calling process in the communicator• MPI_Comm_size

Determines the size of the group associated with a communicator• MPI_Get_processor_name

Gets the name of the processor • MPI_Finalize

Terminates MPI execution environment

http://www.mcs.anl.gov/research/projects/mpi/mpich1/docs.html

Prepare data for parallel sumif (my_rank == 0){ //ON CPU0 array_size = 12; for(i=0;i<array_size;i++) data[i] = i+1 ; //FILL THE data array 1,2,3,4.. 12

for (target = 1; target < p; target++) MPI_Send(&array_size, 1, MPI_INT, target, tag1, MPI_COMM_WORLD); //send array size to the rest CPUs

loc_array_size = array_size/p; //calculate locale array size k = loc_array_size;

for(target = 1; target < p; target++){ MPI_Send(&data[k], loc_array_size, MPI_INT, target, tag2, MPI_COMM_WORLD); //send data to the rest CPUs k+=loc_array_size; } //k = 3,6,9,12

for(k=0; k<loc_array_size; k++) data_loc[k]=data[k]; //initialize CPU0 local array } else{ MPI_Recv(&array_size, 1, MPI_INT, 0, tag1, MPI_COMM_WORLD, &status); //receive array size from CPU0 loc_array_size = array_size/p; MPI_Recv(&data_loc[0], loc_array_size, MPI_INT, 0, tag2, MPI_COMM_WORLD, &status); //receive locale array from CPU0 }

Parallel sumres = 0; //parallel sum for (k=0; k<loc_array_size; k++) res = res + data_loc[k];

if (my_rank != 0){ MPI_Send(&res, 1, MPI_INT, 0, tag3, MPI_COMM_WORLD); //send result to CPU0 } else{ finres = res; //res of CPU0 printf("\n Result of process %d: %d\n", my_rank, res); for (source = 1; source < p; source++) { MPI_Recv(&res, 1, MPI_INT, source, tag3, MPI_COMM_WORLD, &status);

//receive results from CPUs finres = finres + res; printf("\n Result of process %d: %d\n", source, res); } printf("\n\n\n Final Result: %d\n", finres); } MPI_Finalize();

Parallel Sum Output$ mpirun -np 4 a.out

Result of process 0: 6




Final Result: 78

MPI_Send

• Performs a basic send • Synopsis

– #include "mpi.h" int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )

• Input Parameters – buf initial address of send buffer (choice) – count number of elements in send buffer (nonnegative integer) – datatype datatype of each send buffer element (handle) – dest rank of destination (integer) – tag message tag (integer) – comm communicator (handle)

MPI_Recv• Basic receive • Synopsis

– #include "mpi.h" int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status )

• Output Parameters– buf initial address of receive buffer (choice) – status status object (Status)

• Input Parameters– count maximum number of elements in receive buffer (integer)– datatype datatype of each receive buffer element (handle) – source rank of source (integer) – tag message tag (integer) – comm communicator (handle)

Date post:	01-Jan-2016
Category:	Documents
Upload:	valentine-gibbs
View:	224 times
Download:	0 times

An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos.

Documents