Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory ...

Post on 20-Jan-2016

219 views 0 download

transcript

Parallel Programming

AMANO, Hideharu

Parallel  Programming

Message Passing PVM MPI

Shared Memory POSIX thread OpenMP CUDA/OpenCL

Automatic Parallelizing Compilers

Message passing( Blocking: randezvous )

Send Receive Send Receive

Message passing(with buffer)

Send Receive Send Receive

Message passing( non-blocking)

Send ReceiveOther Job

PVM (Parallel  Virtual Machine) A buffer is provided for a sender. Both blocking/non-blocking receive is

provided. Barrier synchronization

MPI(Message  Passing Interface) Superset of the PVM for 1 to 1

communication. Group communication Various communication is supported. Error check with communication tag.

Programming style using MPI SPMD (Single Program Multiple Data

Streams) Multiple processes executes the same program. Independent processing is done based on the

process number. Program execution using MPI

Specified number of processes are generated. They are distributed to each node of the NORA

machine or PC cluster.

Communication methods

Point-to-Point communication A sender and a receiver executes function for

sending and receiving. Each function must be strictly matched.

Collective communication Communication between multiple processes. The same function is executed by multiple

processes. Can be replaced with a sequence of Point-to-Point

communication, but sometimes effective.

Fundamental MPI functions Most programs can be described using six fu

ndamental functions MPI_Init() … MPI Initialization MPI_Comm_rank() … Get the process # MPI_Comm_size() … Get the total process # MPI_Send() … Message send MPI_Recv() … Message receive MPI_Finalize() … MPI termination

Other MPI functions

Functions for measurement MPI_Barrier() … barrier synchronization MPI_Wtime() … get the clock time

Non-blocking function Consisting of communication request and check Other calculation can be executed during waiting.

An Example1: #include <stdio.h>2: #include <mpi.h>3:4: #define MSIZE 645:6: int main(int argc, char **argv)7: {8:    char msg[MSIZE];9:    int pid, nprocs, i;10:   MPI_Status status;11:12:   MPI_Init(&argc, &argv);13:   MPI_Comm_rank(MPI_COMM_WORLD, &pid);14:    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);15:16: if (pid == 0) {17:   for (i = 1; i < nprocs; i++) {18:    MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status);19:    fputs(msg, stdout);20:    }21:   }22:   else {23:    sprintf(msg, "Hello, world! (from process #%d)\n", pid);24:    MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD);25:   }26:27:   MPI_Finalize();28:29:   return 0;30: }

Initialize and Terminate

int MPI_Init(

   int *argc, /* pointer to argc */

   char ***argv /* pointer to argv */ );

mpi_init(ierr)

   integer ierr       ! return code

The attributes from command line must be passed directly to argc and argv.

int MPI_Finalize();

   mpi_finalize(ierr)

   integer ierr       ! return code

Commincator functionsIt returns the rank (process ID) in the communicator comm.int MPI_Comm_rank(   MPI_Comm comm, /* communicator */   int *rank /* process ID (output) */ );mpi_comm_rank(comm, rank, ierr)   integer comm, rank   integer ierr       ! return code

It returns the total number of processes in the communicator comm.int MPI_Comm_size(   MPI_Comm comm, /* communicator */   int *size /* number of process (output) */ );mpi_comm_size(comm, size, ierr)   integer comm, size   integer ierr        ! return code

Communicators are used for sharing commnication space among a subset of processes. MPI_COMM_WORLD is pre-defined one for all processes.

MPI_Send

It sends data to process “dest”.int MPI_Send( void *buf, /* send buffer */ int count, /* # of elements to send */ MPI_Datatype datatype, /* datatype of elements */ int dest, /* destination (receiver) process ID */ int tag, /* tag */ MPI_Comm comm /* communicator */ );

mpi_send(buf, count, datatype, dest, tag, comm, ierr) <type> buf(*) integer count, datatype, dest, tag, comm integer ierr ! return code

Tags are used for identification of message.

MPI_Recv

int MPI_Recv( void *buf, /* receiver buffer */ int count, /* # of elements to receive */ MPI_Datatype datatype, /* datatype of elements */ int source, /* source (sender) process ID */ int tag, /* tag */ MPI_Comm comm, /* communicator */ MPI_Status /* status (output) */ );

mpi_recv(buf, count, datatype, source, tag, comm, status, ierr) <type> buf(*) integer count, datatype, source, tag, comm, status(mpi_status_size) integer ierr ! return code

The same tag as the sender’s one must be passed to MPI_Recv. Set the pointers to a variable MPI_Status. It is a structure with three members:

MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the sender, tag and error code.

datatype and count

The size of the message is identified with count and datatype. MPI_CHAR char MPI_INT int MPI_FLOAT float MPI_DOUBLE double … etc.

Compile and Execution

% icc –o hello hello.c -lmpi

% mpirun –np 8 ./hello

Hello, world! (from process #1)

Hello, world! (from process #2)

Hello, world! (from process #3)

Hello, world! (from process #4)

Hello, world! (from process #5)

Hello, world! (from process #6)

Hello, world! (from process #7)

POSIX Thread

Standard API on Linux for controlling threads. Portable Operating System Interface

Thread handling pthread_create(); pthread_join(); pthread_exec();

Synchronization mutex

pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock();

Condition variable: Semaphore pthread_cond_signal(); pthread_cond_wait(); etc.

OpenMP

#include <stdio.h>int main(){pragma omp parallel{ int tid, npes; tid = omp_get_thread_num(); npes = omp_get_num_threads(); printf(“Hello World from %d of %d\n”, tid, npes)}return 0;} Multiple threads are generated by using pragma. Variables declared globally can be shared.

Convenient pragma for parallel execution#pragma omp parallel {#pragma omp for for (i=0; i<1000; i++){ c[i] = a[i] + b[i]; }} The assignment between i and thread is automatical

ly adjusted in order that the load of each thread becomes even.

CUDA/OpenCL

CUDA is developed for GPGPU programming.

SPMD(Single Program Multiple Data) 3-D management of threads 32 threads are managed with a Warp

SIMD programming Architecture dependent memory model OpenCL is standard language for heterogene

ous accelerators.

Heterogeneous Programming with CUDA

Host

Device

Host

Device

Serial Code

Parallel KernelKernelA(args);

Serial Code

Parallel KernelKernelB(args);

Threads and thread blocks

0 1 2 3 4 5 6 7Thread Block 0

threadID

…float x =input[threadID];float y=func(x);output[threadID]=y;…

0 1 2 3 4 5 6 7Thread Block 1

…float x =input[threadID];float y=func(x);output[threadID]=y;…

0 1 2 3 4 5 6 7Thread Block N-1

…float x =input[threadID];float y=func(x);output[threadID]=y;…

Kernel = grid of thread blockseach threadexecutesthe same code

Threads in the same block may synchronize with barriers._syncthreads();Thread blocks cannot synchronize-> Execution is depending on machines.

Memory HierarchyThread

Per-threadLocal memory

BlockPer-blockSharedMemory

Per-deviceGlobal

Memory

Kernel 0

Kernel 1

SequentialKernels

Between host memorycudaMemcpy();is used.

CUDA extensions

Declaration specifiers : _global_ void kernelFunc(…); // kernel function, runs on device _device_ int GlobalVar; //variable in device memory _shared_ int sharedVar; //variable in per-block shared memory

Extend function invocation syntax for paralell kernel launch KernelFunc<<<dimGrid, dimBlock>>> // launch dimGrid blocks wi

th dimBlock threads each Special variables for thread identification in kernels

dim3 threadIDx; dim3 blockIdx; dim3 block Dim; dim3 gridDim; Barrier Synchronization between threads

_syncthreads();

CUDA runtime

Device menagement: cudaGetDeviceCount(), cudaGetDeviceProperties

(); Device memory management:

cudaMalloc(), cudaFree(),cudaMemcpy() Graphics interoperability:

cudaGLMapBufferObject(), cudaD3D9MapResources()

Texture management: cudaBindTexture(), cudaBindTextureToArray()

Example: Increment Array Elements

void increment_cpu(float *a, float b, int N)

{

for(int idx=0;idx<N;idx++)

a[idx]=a[idx]+b;

}

void main()

{

increment_cpu(a,b,N);

}

_global_ void increment_gpu(float *a, float b, int N)

{

int idx=blockidx.x*blockDim.x+threadIdx.x;

if(idx<N)

a[idx]=a[idx]+b;

}

void main()

{

dim3 dimBlock(blocksize);

dim3 dimGrid(ceil(N/(float)blocksize));

increment_gpu<<dimGrid,dimBlock>>(a,b,N);

}

Example: Increment Array Elements

blockIdx.x=0blockDim.x=4threadIdx.x=0,1,2,3idx=0,1,2,3

blockIdx.x=1blockDim.x=4threadIdx.x=0,1,2,3idx=4,5,6,7

blockIdx.x=2blockDim.x=4threadIdx.x=0,1,2,3idx=8,9,10,11

blockIdx.x=3blockDim.x=4threadIdx.x=0,1,2,3idx=12,13,14,15

Let’s assume N=16, blockDim=4

int idx = blockDim.x * blockId.x + threadldx.x;will map from local index threadIdx to global index.

blockDim should be >= 32 in real code!Using more number of blocks hides the memory latency in GPU.

Host code

// allocate host memoryunsigned int numBytes = N*sizeof(float);float * h_A = (float *) malloc(numBytes);// allocate device memoryfloat* d_A=0;cudaMalloc((void**)&d_a, numBytes);// copy data from host to devicecudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);// execute the kernelincrement_cpu<<<N/blockSize, blockSize>>>(d_A,b);// copy data from device back to hostcudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);// free device memorycudaFree(d_A);

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

Thread Execution Manager

Input Assembler

Host

Load/Store

Global Memory

GeForceGTX280240 cores

Hardware Implementation: Execution Model

Kernel1 Block

(0,0)Block(1,0)

Block(2,0)

Block(0,1)

Block(1,1)

Block(2,1)

Grid1

DeviceHost

Kernel2 Block

(0,0)Block(1,0)

Block(2,0)

Block(0,1)

Block(1,1)

Block(2,1)

Grid2

Thread(0,0) …

Thread(31,0)

Thread(32,0) …

Thread(63,0)

Warp 0 Warp 1

Thread(0,1) …

Thread(31,1)

Thread(32,1) …

Thread(63,1)

Warp 2 Warp 3

Thread(0,2) …

Thread(31,2)

Thread(32,2) …

Thread(63,2)

Warp 4 Warp 5

Block (1,1)

A multiprocessor executes thesame instruction on a group ofthreads called a warp.Warp size= the number of threads in a warp

Automatic parallelizing Compilers Automatically translating a code for uniprocessors in

to multiprocessors. Loop level parallelism is main target of parallelizing. Fortran codes have been main targets

No pointers The array structure is simple

Recently, restricted C becomes a target language Oscar Compiler (Waseda Univ.), COINS

Shared memory model vs . Message passing model

Benefits Distributed OS is easy to implement. Automatic parallelize compiler. POSIX thread, OpenMP

Message passing Formal verification is easy (Blocking) No-side effect (Shared variable is side effect itsel

f) Small cost

Parallel Programming Contest In this lecture, a parallel programming contest

will be held. All students who want to get the credit must

join it. At least, the program must correctly run. For students with good achievement, the credit

will be given unconditionally.