Lecture 5. OpenMP Intro. Prof. Taeweon Suh Computer Science Education Korea University COM503...

Lecture 5. OpenMP Intro.

Prof. Taeweon SuhComputer Science Education

Korea University

COM503 Parallel Computer Architecture & Programming

Korea Univ

References

• Lots of the lecture slides are based on the following web materials with some modifications https://computing.llnl.gov/tutorials/openMP/

• Official OpenMP page http://openmp.org/ It contains up-to-date information on OpenMP It also includes tutorials, book example codes

etc.

2

https://computing.llnl.gov/tutorials/openMP/

http://openmp.org/

Korea Univ

OpenMP

• What does OpenMP stand for? Short version: Open Multi-Processing

Long version: Open specifications for Multi-Processing via collaborative work between interested parties from the hardware and software industry, government and academia

• Standardized: Jointly defined and endorsed by a group of major computer hardware and software

vendors

• Comprised of three primary API components: Compiler Directives (#pragma)

Runtime Library Routines

Environment Variables

• Portable: The API is specified for C/C++ and Fortran

Most major platforms have been implemented including Linux and Windows platforms

3

Korea Univ

History

• In the early 90's, shared-memory machine vendors supplied similar, directive-based, Fortran programming extensions The user would augment a serial Fortran program with

directives specifying which loops were to be parallelized

The compiler would be responsible for automatically parallelizing such loops across the SMP processors

• The OpenMP standard specification started in 1997 Led by the OpenMP Architecture Review Board (ARB)

The ARB members included Compaq / Digital, HP, Intel, IBM, Kuck & Associates, Inc. (KAI), Silicon Graphics, Sun Microsystems , and U.S. Department of Energy

4

Korea Univ

Release History

5

• Our textbook covers OpenMP 2.5 specification

July 2013

OpenMP 4.0OpenMP Spec. released together for C and Fortran from OpenMP 2.5

Korea Univ

Shared Memory Machines

• OpenMP is designed for shared memory machines (UMA or NUMA)

6

Korea Univ

Fork-Join Model

• Begin as a single process (master thread). The master thread executes sequentially until the first parallel region construct

is encountered.

• FORK: the master thread creates a team of parallel threads

• The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads

• JOIN: When the team threads complete the statements in the parallel region, they synchronize and terminate, leaving only the master thread

7Source: https://computing.llnl.gov/tutorials/openMP/#Introduction

https://computing.llnl.gov/tutorials/openMP/

Korea Univ

Other Parallel Programming Models?

• MPI: Message Passing Interface Developed for distributed-memory architectures, where

multiple processes execute independently and communicate data

• Most widely used in the high-end technical computing community, where clusters are common

Most vendors of shared memory systems also provide MPI implementations

Most MPI implementations consist of a specific set of APIs callable from C, C++ ,Fortran or Java

MPI implementations• MPICH

Freely available, portable implementation of MPI

Free Software and is available for most flavors of Unix and Windows

• OpenMPI

8

Korea Univ

Other Parallel Programming Models?

• Pthreads: POSIX (Portable Operating System Interface) Threads Shared-memory programming model

Defined as a set of C and C++ programming types and procedure calls

• A collection of routines for creating, managing, and coordinating a collection of threads

• Programming with Pthreads is much more complex than with OpenMP

9

Korea Univ

Serial Code Example

• Serial version of dot product program The dot product is an algebraic operation that takes two equal-length

sequences of numbers (usually coordinate vectors) and returns a single number obtained by multiplying corresponding entries and adding up those products

10

#include <stdio.h>

int main(argc, argv)int argc;char * argv[];{

double sum;double a[256], b[256];

int i, n;n = 256;

for (i=0; i< n; i++){a[i] = i * 0.5;b[i] = i * 2.0;

}

sum = 0.0;

for (i=0 ; i<n; i++) {sum = sum + a[i]*b[i];

}

printf("sum = %9.2lf\n", sum);

}

Korea Univ

MPI Example

• MPI To compile, ‘mpicc dot_product_mpi.c –o dot_product_mpi’ To run, ‘mpirun –np 4 machine_file dot_product_mpi’

11

#include <stdio.h>#include <mpi.h>


double sum, sum_local;double a[256], b[256];

int i, n;

int numprocs, myid, my_first, my_last;

n = 256;

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &numprocs);MPI_Comm_rank(MPI_COMM_WORLD, &myid);

printf("#procs = %d\n", numprocs);

my_first = myid * n/numprocs;my_last = (myid + 1) * n/numprocs;

for (i=0; i< n; i++){a[i] = i * 0.5;b[i] = i * 2.0;

}sum_local = 0.0;

for (i=my_first ; i < my_last; i++) {sum_local = sum_local + a[i]*b[i];

}

MPI_Allreduce(&sum_local, &sum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

if (myid==0) printf("sum = %9.2lf\n", sum);

MPI_Finalize();return 0;

}

Korea Univ

Pthreads Example

• Pthreads To compile, ‘gcc dot_product_pthread.c –o dot_product_pthread -

pthread’

12

#include <stdio.h>#include <pthread.h>

#define NUMTHRDS 4double sum = 0 ;double a[256], b[256];

int status;int n = 256;

pthread_t thds[NUMTHRDS];pthread_mutex_t mutexsum;

void *dotprod(void *arg);


pthread_attr_t attr;int i;

for (i=0; i< n; i++){a[i] = i * 0.5;b[i] = i * 2.0;

}

pthread_mutex_init(&mutexsum, NULL);pthread_attr_init(&attr);pthread_attr_setdetachstate(&attr,

PTHREAD_CREATE_JOINABLE);

for (i=0; i<NUMTHRDS; i++) { pthread_create(&thds[i], &attr, dotprod, (void *)i);}

pthread_attr_destroy(&attr);

for (i=0; i<NUMTHRDS; i++) { pthread_join(thds[i], (void **) &status);}

printf("sum = %9.2lf\n", sum);pthread_mutex_destroy(&mutexsum);pthread_exit(NULL);

}

Korea Univ

Pthreads Example

• Pthreads To compile, ‘gcc dot_product_pthread.c –o dot_product_pthread -

pthread’

13

void *dotprod(void *arg){

int myid, i, my_first, my_last;double sum_local;

myid = (int) arg;my_first = myid * n / NUMTHRDS;my_last = (myid+1) * n / NUMTHRDS;

sum_local = 0.0;for (i=my_first; i< my_last; i++) {

sum_local = sum_local + a[i]*b[i];}

pthread_mutex_lock(&mutexsum);sum = sum + sum_local;pthread_mutex_unlock(&mutexsum);

pthread_exit((void *) 0);}

Korea Univ

Another Pthread Example

• Pthreads To compile, ‘gcc pthread_creation.c –o pthread_creation -pthread’

14

#include <stdio.h>#include <stdlib.h>#include <pthread.h>

void *print_message_function( void *ptr );

main(){ pthread_t thread1, thread2; char *message1 = "Thread 1"; char *message2 = "Thread 2"; int iret1, iret2;

/* Create independent threads each of which will execute function */

iret1 = pthread_create( &thread1, NULL, print_message_function, (void*) message1); iret2 = pthread_create( &thread2, NULL, print_message_function, (void*) message2);

/* Wait till threads are complete before main continues. Unless we */ /* wait we run the risk of executing an exit which will terminate */ /* the process and all threads before the threads have completed. */

pthread_join( thread1, NULL); pthread_join( thread2, NULL);

printf("Thread 1 returns: %d\n",iret1); printf("Thread 2 returns: %d\n",iret2); exit(0);}

void *print_message_function( void *ptr ){ char *message; message = (char *) ptr; printf("%s \n", message);}

Korea Univ

OpenMP Example

15

• OpenMP To compile, ‘gcc dot_product_omp.c –o dot_product_omp -fopenmp’

#include <stdio.h>#include <omp.h>


double sum;double a[256], b[256];

int i, n;n = 256;

for (i=0; i< n; i++){a[i] = i * 0.5;b[i] = i * 2.0;

}sum = 0.0;

#pragma omp parallel for reduction(+:sum)for (i=0 ; i<n; i++) {

sum = sum + a[i]*b[i];}

printf("sum = %9.2lf\n", sum);

}

Korea Univ

OpenMP

• Compiler Directive Based Parallelism is specified through the use of compiler directives in source code

• Nested Parallelism Support Parallel regions inside of other parallel regions.

• Dynamic Threads Dynamically alter the number of threads to execute parallel regions.

• Memory Model Provide a "relaxed-consistency”. In other words, threads can cache their

data and are not required to maintain exact consistency with real memory all the time.

When it is critical that all threads view a shared variable identically, the programmer is responsible for ensuring that the variable is flushed by all threads as needed.

16

Korea Univ

OpenMP Parallel Construct

• A parallel region is a block of code executed by multiple threads. This is the fundamental OpenMP parallel construct.

#pragma omp parallel [clause[[,] clause]…] structured block

• This construct is used to specify the block that should be executed in parallel A team of threads is created to execute the associated parallel region

Each thread in the team is assigned a unique thread number (0 to #thread-1)

The master is a member of that team and has thread number 0

Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code.

It does not distribute the work of the region among the threads in a team if the programmer does not use the appropriate syntax to specify this action

• There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point.

• The code not enclosed by a parallel construct will be executed serially

17

Korea Univ

Example

18

#include <stdio.h>#include <stdlib.h>#include <omp.h>


#pragma omp parallel { printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num());

if (omp_get_thread_num() == 2) {printf(" Thread %d does things differently\n", omp_get_thread_num());

} }

}

Korea Univ

How Many Threads?

• The number of threads in a parallel region is determined by the following factors, in order of precedence

1. Evaluation of the IF clause The IF clause is supported on the parallel construct only #pragma omp parallel if (n > 5)

2. num_threads clause with a parallel construct The num_threads clause is supported on the parallel construct only #pragma omp parallel num_threads(8)

3. omp_set_num_threads() library function omp_set_num_threads(8)

4. OMP_NUM_THREADS environment variable In bash, use ‘export OMP_NUM_THREADS=4’

5. Implementation default - usually the number of CPUs on a node, even though it could be dynamic

19

Korea Univ

Example

20


#define NUM_THREADS 8


int n = 6;

omp_set_num_threads(NUM_THREADS);

//#pragma omp parallel #pragma omp parallel if (n > 5) num_threads(n) { printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num());

if (omp_get_thread_num() == 2) { printf(" Thread %d does things differently\n", omp_get_thread_num()); } }

}

Korea Univ

Work-Sharing Constructs

• Work-sharing constructs are used to distribute computation among the threads in a team #pragma omp for

#pragma omp sections

#pragma omp single

• By default, threads wait at a barrier at the end of a work-sharing region until the last thread has completed its share of the work However, the programmer can suppress this by using the nowait

clause

21

Korea Univ

Loop Construct

• The loop construct causes the immediately following loop iterations to be executed in parallel

#pragma omp for [clause[[,] clause]…]

for loop

22



int main(argc, argv)int argc;char * argv[];{ int n = 8; int i;


#pragma omp parallel shared(n) private(i) { #pragma omp for for (i=0 ; i<n; i++) printf(" Thread %d executes loop iteration %d\n", omp_get_thread_num(), i); }

}

Korea Univ

Section Construct

• The section construct is the easiest way to have different threads execute different kinds of work

23



void funcA();void funcB();

int main(argc, argv)int argc;char * argv[];{ int n = 8; int i;


#pragma omp parallel sections { #pragma omp section funcA(); #pragma omp section funcB(); }

}

void funcA(){ printf("In funcA: this section is executed by thread %d\n", omp_get_thread_num());}

void funcB(){ printf("In funcB: this section is executed by thread %d\n", omp_get_thread_num());}

#pragma omp sections[clause[[,] clause]…] {

[#pragma omp section] structured block [#pragma omp section] structured block}

Korea Univ

Section Construct

• At run time, the specified code blocks are executed by the threads in the team Each thread executes one code block at a time Each code block will be executed exactly once If there are fewer threads than code blocks, some

threads execute multiple code blocks If there are fewer code blocks than threads, the

remaining threads will be idle Assignment of code blocks to threads is

implementation-dependent Depending on the type of work performed in the

various code blocks and the number of threads used, this construct might lead to a load-balancing problem

24

Korea Univ

Single Construct

• The single construct specifies that the block should be executed by one thread only

25


#define NUM_THREADS 4#define N 8

int main(argc, argv)int argc;char * argv[];{ int i, a, b[N];


#pragma omp parallel shared(a, b) private(i) { #pragma omp single { a = 10; printf("Single construct executed by thread %d\n", omp_get_thread_num()); }

#pragma omp for for (i=0; i<N; i++) b[i] = a; }

printf("After the parallel region\n"); for (i=0; i<N; i++) printf("b[%d] = %d\n", i, b[i]);}

#pragma omp single [clause[[,] clause]…] structured block

Korea Univ

Single Construct

• Only one thread executes the block with the single construct The other threads wait at a implicit barrier until the

thread executing the single code block has completed

• What if the single construct is omitted in the previous example? Memory consistency issue? Performance issue? A barrier is required then before the #pragma omp for?

26

Korea Univ

Misc

• A useful Linux command top (Display Linux tasks) provides a dynamic

real-time view of a running system• Try 1, z, H after running the top command

27

Korea Univ

Misc

• Useful Linux commands ps –eLf

• Display thread IDs for OpenMP and Pthreads top

• Display process IDs, which can be used to monitor the processes created by MPI

28

Korea Univ

Misc

• top does not show threads• ps –eLf

Display thread IDs for OpenMP and Pthreads

29

Korea Univ

Backup Slides

30

Korea Univ

Goal of OpenMP

• Standardization: Provide a standard among a variety of shared memory architectures/platforms

• Lean and Mean: Establish a simple and limited set of directives for programming shared

memory machines. Significant parallelism can be implemented by using just 3 or 4 directives.

• Ease of Use: Provide capability to incrementally parallelize a serial program, unlike

message-passing libraries which typically require an all or nothing approach Provide the capability to implement both coarse-grain and fine-grain

parallelism

• Portability: Supports Fortran (77, 90, and 95), C, and C++ Public forum for API and membership

31

Date post:	26-Dec-2015
Category:	Documents
Upload:	ezra-dorsey
View:	220 times
Download:	1 times

Lecture 5. OpenMP Intro. Prof. Taeweon Suh Computer Science Education Korea University COM503...

Documents