Spring 2019 CS4823/6643 Parallel Computing 1
OpenMP with Examples
Wei Wang
Spring 2019 CS4823/6643 Parallel Computing 2
The Components of OpenMP Programs
● Construct– A block of the code to be parallelized in a specific way
● Directive– A special statement (called Pragma) used to declare a block to be a construct
● Clause– Appended to directives to declare additional attributes for a construct
– Usually used to specified how to handle variables, e.g., whether to make a variable shared or private
● Run-Time Library Functions– Functions to-be-called from OpenMP programs
– Usually used to query execution-time information or control execution
Spring 2019 CS4823/6643 Parallel Computing 3
The Components of OpenMP
● Example:
#pragma omp for private(i)
for (int i = 0; i < 100000; i++) { a[i] = 2 * i;}
Clause (var a and i are shared by all threads)
Directive (declares the following for loop to be parallelized)
Construct (for loop to be parallelized)
Spring 2019 CS4823/6643 Parallel Computing 4
Constructs and Directives: What We Will Learn
● Parallel construct– #pragma omp parallel
● Work sharing constructs– “For” loops: #pragma omp for
– Sections: #pragma omp sections– SINGLE code block: #pragma omp single
● Combined Parallel Work-Sharing Constructs– #pragma omp parallel for/sections
● Synchronization constructs– MASTER block: #pragma omp master
– CRITICAL block: #pragma omp critical
– BARRIER: #pragma omp barrier
– ATOMIC: #pragma omp atomic
● There are a few more constructs and directives
Spring 2019 CS4823/6643 Parallel Computing 5
Parallel Construct
● Syntax:
● Functionality:– Create multiple threads (number specified by the user) to
run in parallel
– If not combined with other directives, each threads executes “CODE BLOCK” independently
#pragma omp parallel{
CODE BLOCK}
Spring 2019 CS4823/6643 Parallel Computing 6
Parallel Construct cont’d
● Example - code*:#include <stdio.h>#include <omp.h>
int main() {
#pragma omp parallel { printf("Hello world\n"); printf("Goodbye world\n"); }
return 0;}
● Example – output:Hello worldGoodbye worldHello worldHello worldHello worldGoodbye worldHello worldHello worldHello worldGoodbye worldGoodbye worldHello worldGoodbye worldGoodbye worldGoodbye worldGoodbye world
Note the randomsequence, every run
the sequence isdifferent
Fox01 server have 12 cores. So OpenMP creates 12 threads
by default*example file name: parallel_directive.c
Spring 2019 CS4823/6643 Parallel Computing 7
Work Sharing Construct:For Loops
● Syntax:
● Functionality:– Create multiple threads (number specified by the
user), each execute some of the iterations
#pragma omp parallel{ #pragma omp for for(...){
CODE BLOCK }}
Need this “parallel” or there’s only one thread
Spring 2019 CS4823/6643 Parallel Computing 8
Work Sharing Construct:For loops cont’d
● Example - code*:#include <stdio.h>#include <omp.h>
int main() { int i; #pragma omp parallel { #pragma omp for for(i = 0; i < 16; i++)
printf("i is %d\n", i); } return 0;}
● Example – output:i is 0i is 1i is 4i is 5i is 6i is 7i is 14i is 15i is 10i is 11i is 8i is 9i is 12i is 13i is 2i is 3
Scrambled outputindicates multiple
threads run concurrently
*example file name: for_directive.c
8 threads (by default)execute 16 iterations
==>2 iterations/thread
“i” is 16 maximum, indicating only 16
iterations are executed in total,
hence “work sharing”
Spring 2019 CS4823/6643 Parallel Computing 9
Work Sharing Construct:For loops cont’d
● For directive only parallelize the outer loop
#pragma omp parallel private(j){ #pragma omp for for(i=...){
SOME CODE for(j=...){
SOME OTHER CODE } }}
Only the outer loop(the i loop) is parallelized
The inner loop (loop j)will be executed by every
thread.
Because inner loop is executed by every thread,
it is important that j isa private variable (more
on private variables later).
Spring 2019 CS4823/6643 Parallel Computing 10
Work Sharing Construct:Sections
● Syntax:
● Functionality:– Create multiple threads (number specified by the user), each
execute some of the sections– Can declare any number of sections– For non-iterative (i.e., not for loops) parallel algorithms
#pragma omp parallel{ #pragma omp sections { #pragma omp section {CODE BLOCK} #pragma omp section {CODE BLOCK} }}
Spring 2019 CS4823/6643 Parallel Computing 11
Work Sharing Construct:Sections
● Example - code*:int i,j; #pragma omp parallel { #pragma omp sections { #pragma omp section for(i = 0; i < 8; i++) printf("i is %d\n", i); #pragma omp section for(j = 0; j < 8; j++) printf("j is %d\n", j); } }
● Example – output:j is 0j is 1j is 2j is 3i is 0i is 1i is 2j is 4i is 3j is 5i is 4j is 6j is 7i is 5i is 6i is 7
Intermingled i,j outputs indicate
parallel execution
*example file name: section_directive.c
Since we have onlytwo sections, onlytwo threads will
be used for execution
Values of “i” or “j” alwayssequentially go from 0 to 7, indicating each forloop is executed serially,
i.e., each section is executedby one thread
Spring 2019 CS4823/6643 Parallel Computing 12
Work Sharing Construct:SINGLE Directive
● Syntax:
● Functionality:– Specify that a block of code should be executed by only one
thread for once with in a parallel construct
#pragma omp parallel{ { CODE BLOCKS EXECUTED IN PARALLEL } #pragma omp single {CODE BLOCK TO BE EXECUTEED BY ONE THREAD} { CODE BLOCKS EXECUTED IN PARALLEL }}
Spring 2019 CS4823/6643 Parallel Computing 13
Work Sharing Construct:SINGLE Directive
● Example - code*:#pragma omp parallel{ #pragma omp single {
printf("This only prints once\n"); }
printf("Hello world\n"); printf("Goodbye world\n");}
● Example – output:This only prints onceHello worldGoodbye worldHello worldHello worldGoodbye worldGoodbye worldHello worldHello worldHello worldHello worldGoodbye worldHello worldGoodbye worldGoodbye worldGoodbye worldGoodbye world
This line only printedonce even within
a parallel construct
*example file name: single_directive.c The other printfs executed
multiple times by multiplethreads
Spring 2019 CS4823/6643 Parallel Computing 14
Combined Parallel Work-Sharing Constructs
● For convenience, OpenMP allows combining parallel and work share directives into one line:
#pragma omp parallel{ #pragma omp for for(...){
CODE BLOCK }}
#pragma omp parallel for{ for(...){
CODE BLOCK }}
#pragma omp parallel{ #pragma omp sections { #pragma omp section {CODE BLOCK} #pragma omp section {CODE BLOCK} }}
#pragma omp parallel sections{ #pragma omp section {CODE BLOCK} #pragma omp section {CODE BLOCK}}
Spring 2019 CS4823/6643 Parallel Computing 15
Synchronization Construct:MASTER Driective
● Syntax:
● Functionality:– Specify that a block of code should be executed by the master thread for
once with in a parallel construct– Master thread is usually the first thread that is created, and usually is
responsible to coordinate other threads
// within a parallel and/or work sharing // construct{ { CODE BLOCKS EXECUTED IN PARALLEL } #pragma omp master {CODE BLOCK TO BE EXECUTEED BY MASTER THREAD} { CODE BLOCKS EXECUTED IN PARALLEL }}
Spring 2019 CS4823/6643 Parallel Computing 16
Work Sharing Construct:MASTER Directive
● Example - code*:#pragma omp parallel{ #pragma omp master {
printf("I am the master!\n"); }
printf("Hello world\n"); printf("Goodbye world\n");}
● Example – output:Hello worldHello worldGoodbye worldHello worldHello worldHello worldGoodbye worldI am the master!Hello worldGoodbye worldGoodbye worldHello worldGoodbye worldGoodbye worldHello worldGoodbye worldGoodbye world
This line only printedonce even within
a parallel construct
*example file name: master_directive.c
The other printfs executedmultiple times by multiple
threads
Note that master maynot be the first gets
executed
Spring 2019 CS4823/6643 Parallel Computing 17
Work Sharing Construct:MASTER Directive cont’d
● MASTER V.S. SINGLE:– MASTER construct is always executed by the master thread
● Many algorithms need a master thread to coordinate parallel tasks, it is usually good to know these coordination work is always done in one thread
– SINGLE construct is executed by the first thread that hits the SINGLE construct code
● Because of this dynamic scheduling nature, SINGLE construct is slightly slower than MASTER construct
– SINGLE construct has an implicit barrier after it● If you compare the outputs of the MASTER and SINGLE example, you will find that
– SINGLE printf is always executed before other printfs, because there is an implicit barrier blocking other threads to proceed until the SINGLE printf is execute
– MASTER printf can be executed after other printfs, because there is no barrier, and master thread may be schedule to run after the other threads
Spring 2019 CS4823/6643 Parallel Computing 18
Synchronization Construct:Barrier Driective
● Syntax:
● Functionality:– When a BARRIER directive is reached, a thread will wait at that point until
all other threads have reached that barrier. All threads then resume executing in parallel the code that follows the barrier.
// within a parallel and/or work sharing construct{ { CODE BLOCKS EXECUTED IN PARALLEL }
#pragma omp barrier
{ CODE BLOCKS EXECUTED IN PARALLEL }}
Spring 2019 CS4823/6643 Parallel Computing 19
Work Sharing Construct:Barrier Directive
● Example - code*:#pragma omp parallel{ printf("Hello world\n"); #pragma omp barrier printf("Goodbye world\n");}
● Example – output:Hello worldHello worldHello worldHello worldHello worldHello worldHello worldHello worldGoodbye worldGoodbye worldGoodbye worldGoodbye worldGoodbye worldGoodbye worldGoodbye worldGoodbye world
“Hello world”s are always printed before
“Goodbye world”, because the barrierdirects threads to
wait for each otherbefore proceed
*example file name: barrier_directive.c
Compare the output withthe parallel construct where
there is no barrier
Spring 2019 CS4823/6643 Parallel Computing 20
Synchronization Construct:CRITICAL Directive
● Syntax:
● Functionality:– The CRITICAL directive specifies a region of code that must be
executed by only one thread at a time.
// within a parallel and/or work sharing construct{ {CODE BLOCKS EXECUTED IN PARALLEL}
#pragma omp critical {
CODE BLOCK EXECUTED IN SEQUENTIAL }
{CODE BLOCKS EXECUTED IN PARALLEL}}
Spring 2019 CS4823/6643 Parallel Computing 21
Work Sharing Construct:CRITICAL Directive cont’d
● Example – code* (control example without CRITICAL):int i = j = 0;
#pragma omp parallel { { i = i + 1; j = j + 1; } } printf(“i = %d, j = %d\n”, i,j);
● Example – output:
i = 5, j = 6
*example file name: critical_directive.c
The values of i and j are not stable; they can be any
number less then thetotal number of threads
Run1:
i = 4, j = 4
Run2:
i = 4, j = 5
Run3:
Spring 2019 CS4823/6643 Parallel Computing 22
Work Sharing Construct:CRITICAL Directive cont’d
● Example - code*:int i = j = 0;
#pragma omp parallel { #pragma omp critical { i = i + 1; j = j + 1; } } printf(“i = %d, j = %d\n”, i,j);
● Example – output:
i = 12, j = 12
*example file name: critical_directive.c
The values of i and j are stable; and are always
the same as the number ofthreads (by default 12
threads on fox01)
Run1:
i = 12, j = 12
Run2:
i = 12, j = 12
Run3:
Spring 2019 CS4823/6643 Parallel Computing 23
Synchronization Construct:ATOMIC Directive
● Syntax:
● Functionality:– The ATOMIC directive specifies one statement that must be executed
by only one thread at a time.
// within a parallel and/or work sharing construct{ {CODE BLOCKS EXECUTED IN PARALLEL}
#pragma omp atomic STATEMENT EXECUTED IN SEQUENTIAL
{CODE BLOCKS EXECUTED IN PARALLEL}}
Spring 2019 CS4823/6643 Parallel Computing 24
Work Sharing Construct:ATOMIC Directive cont’d
● Example – code* (control example without ATOMIC):int i = j = 0;
#pragma omp parallel { i = i + 1; } printf(“i = %d\n”, i,j);
● Example – output:
i = 5
*example file name: atomic_directive.c
The value of i isnot stable; they can be any
number less then thetotal number of threads
Run1:
i = 4
Run2:
i = 4
Run3:
Spring 2019 CS4823/6643 Parallel Computing 25
Work Sharing Construct:ATOMIC Directive cont’d
● Example - code*:int i = 0;
#pragma omp parallel { #pragma omp atomic i = i + 1; } printf(“i = %d\n”, i);
● Example – output:
i = 8
*example file name: atomic_directive.c
The value of i isstable; and is always
the same as the number ofthreads (by default 12
threads on fox01)
Run1:
i = 8
Run2:
i = 8
Run3:
Spring 2019 CS4823/6643 Parallel Computing 26
CLAUSES
● In OpenMP, clauses are used to specify the properties and/or behavior of a construct, e.g.,– How to use variables?
– How to map and partition work?
– How to create and execute threads?
● Clauses are appended to directives● Clauses we will learn:
– Variables-related: private, shared, default, firstprivate, lastprivate, reduction
– Mapping/partitioning-related: schedule
– Execution-related: nowait
Spring 2019 CS4823/6643 Parallel Computing 27
CLAUSE: PRIVATE
● Syntax (always after a directive):
● Description:– The PRIVATE clause declares variables in its list to be private
to each thread.
– For a private variable:● OpenMP creates a new instance of this variable for each thread● Each thread will only access this new instance instead of the original
instance● New instance is not initialized, i.e., it starts with a random value● After execution of the construct, all new instances are discarded
private (var1, var2, …)
Spring 2019 CS4823/6643 Parallel Computing 28
CLAUSE: PRIVATE cont’d
● Example – code: ● Example – output:int i = 11;
printf("Before: i is %d\n", i);
#pragma omp parallel private(i) { printf("Within: i is %d\n", i); } printf("After: i is %d\n", i);
Before: i is 11
Within: i is 1Within: i is 1Within: i is 1Within: i is 1Within: i is 1Within: i is 32604Within: i is 1Within: i is 1
After: i is 11
The values of “i” before and after the parallel construct
are always “11”, indicting thisinstance of “i” is not
accessed by any thread
The values of “i” arenot all the same for
each thread, indicatingthese “i”s are different
from the original.The differences in
these “i”s alsosuggesting these “i”s
are not initialized*example file name: private_clause.c
Spring 2019 CS4823/6643 Parallel Computing 29
CLAUSE: FIRSTPRIVATE
● Syntax (always after a directive):
● Description:– Same as PRIVATE, FIRSTPRIVATE variables are
private to a thread
– Different from PRIVATE, FIRSTPRIVATE variables are initialized
● A FIRSTPRIVATE variable is initialized using the values of the original instance’s value
firstprivate (var1, var2, …)
Spring 2019 CS4823/6643 Parallel Computing 30
CLAUSE: FIRSTPRIVATE cont’d
● Example – code:Int i = 3;
printf("Before: i is %d\n", i);
#pragma omp parallel firstprivate(i) { printf("Within: i’s init val is %d\n", i); i++; printf(“Within: i’s new val is %d\n”, I);} printf("After: i is %d\n", i);
*example file name: firstprivate_clause.c
Spring 2019 CS4823/6643 Parallel Computing 31
CLAUSE: FIRSTPRIVATE cont’d
● Example – output:Before: i is 3
Within: i's init value is 3Within: i's init value is 3Within: i's init value is 3Within: i's new value is 4Within: i's init value is 3Within: i's new value is 4Within: i's init value is 3Within: i's new value is 4Within: i's new value is 4Within: i's init value is 3Within: i's new value is 4Within: i's init value is 3Within: i's new value is 4Within: i's init value is 3Within: i's new value is 4Within: i's new value is 4
After: i is 3
The values of “i” before and after the parallel construct
are always “3”, indicting thisinstance of “i” is not
accessed by any thread
For each thread, “i”’s valuealways starts with “3”,
indicating “i” is initialized;“i”’s final value is always4, indicating “i” is private.
Spring 2019 CS4823/6643 Parallel Computing 32
CLAUSE: LASTPRIVATE
● Syntax (always after a directive):
● Description:– Same as PRIVATE, a LASTPRIVATE variable is
private to a thread– Different from PRIVATE, the original instance of a
LASTPRIVATE is changed after parallel execution● The value copied back into the original variable object is
obtained from the last (sequentially) iteration or section of the enclosing construct
lastprivate (var1, var2, …)
Spring 2019 CS4823/6643 Parallel Computing 33
CLAUSE: LASTPRIVATE cont’d
● Example – code:int i=0, j=0;
#pragma omp parallel for lastprivate(j)for(i = 0; i < 8; i++){ j = i; printf("i=%d, j=%d\n", i,j); } printf("Final val of j is %d\n", j);
*example file name: lastprivate_clause.c
● Example – output:i=0, j=0i=7, j=7i=2, j=2i=1, j=1i=6, j=6i=5, j=5i=3, j=3i=4, j=4Final val of j is 7
After parallel execution, thevalue of the original “j” isupdated based on the lastloop iteration
Spring 2019 CS4823/6643 Parallel Computing 34
CLAUSE: SHARED
● Syntax (always after a directive):
● Description:– The SHARED clause declares variables in its list to be
shared among all threads in the team.
– For a shared variable:● There is only one instance of this variable● Every thread accessing this same instance● Accesses to shared variables should be properly protected if
shared variables will be written (e.g., using critical/atomic directives)
shared (var1, var2, …)
Spring 2019 CS4823/6643 Parallel Computing 35
CLAUSE: SHARED cont’d
● Example – code: ● Example – output:Int i = 0;
printf("Before: i is %d\n", i);
#pragma omp parallel shared(i) { #pragma omp critical { i++; printf("Within: i is %d\n", i); } } printf("After: i is %d\n", i);
Before: i is 0Within: i is 1Within: i is 2Within: i is 3Within: i is 4Within: i is 5Within: i is 6Within: i is 7Within: i is 8Within a thread, i is 10Within a thread, i is 11Within a thread, i is 12After: i is 12
The value of “i” is incremented by each thread from 1 to 12 (recall there are 12 threads by default); CRITICAL directionensures the updates to “i” is sequential.
*example file name: shared_clause.c
Spring 2019 CS4823/6643 Parallel Computing 36
CLAUSE: DEFAULT
● Syntax (always after a directive):
● Description:– The DEFAULT clause specifies the default scope (shared or not) for all variables in a
construct.
– If “none” is the default, programmers must explicitly scope their variables
– The actually allowed types for DEFAULT is compiler-dependent● Some compiler does not allow default(private)
● For a variable, if:– DEFAULT clause is not used;
– and the variable is not explicitly scoped;
– and the variable is not the loop variable of a for construct
– then the variable is shared by default
default (shared | private | none)
Spring 2019 CS4823/6643 Parallel Computing 37
CLAUSE: REDUCTION
● Syntax (always after a directive):
● Description:– After the parallel execution, a reduction is
performed on the variables using specified operator
– Variables used in REDUCTION clause are treated as FIRSTPRIVATE variables
– You can use most C operators
reduction (operator: var1, var2)
Spring 2019 CS4823/6643 Parallel Computing 38
CLAUSE: REDUCTION cont’d
● Example – code: ● Example – output:int i = 0;int a[4]={0,1,2,3};int b[4]={4,5,6,7};int result=0;
#pragma omp parallel for reduction(+:result) for(i = 0; I < 4; i++) result+=a[i] * b[i]; printf("Final result is %d\n", result);
Final result is 38
Although each thread only sets “result” to be the only one
multiplication, the final “result” is the reduction (sum in this case)
of all multiplications.
*example file name: reduction_clause.c
Spring 2019 CS4823/6643 Parallel Computing 39
CLAUSE: SCHEDULE
● Syntax (only valid for for constructs):
● Description:– Describes how iterations of the loop are divided among
the threads – Schedule types: STATIC, DYNAMIC, GUIDED,
RUNTIME, AUTO
– Chuck size is optional; chuck size is used as a guide of how many iterations should be assigned to each thread
#pragma omp for schedule(type [, chunk])
Spring 2019 CS4823/6643 Parallel Computing 40
CLAUSE: SCHEDULE cont’d
● Schedule types:– STATIC: each thread takes chunk of iterations; if chuck is not
specified, iterations are evenly divided among threads
– DYNAMIC: each thread takes chunk of iterations. Once a thread finishes, a new chunk is assigned to it.
– GUIDED: similar to DYNAMIC, except that chunk size will shrink gradually as more threads requesting new work to do
– RUNTIME: Allow users to specify schedule type during execution by setting environment variable OMP_SCHEDULE
– AUTO: Let compiler and OpenMP library decide
Spring 2019 CS4823/6643 Parallel Computing 41
CLAUSE: NOWAIT
● Syntax (always after a work sharing directive):
● Description:– There is an implicit barrier after each work sharing
construct
nowait
#pragma omp parallel forfor(i = 0; i < 10; i++) {SOME WORK}
#pragma omp parallel forfor(j = 0; j < 10; j++) {SOME WORK}
Implicit barrier here
Spring 2019 CS4823/6643 Parallel Computing 42
CLAUSE: NOWAIT cont’d
● Description cont’d:– NOWAIT clause removes that barrier. Therefore,
after a thread finishes the previous construct, it can proceed immediately to the next construct
#pragma omp parallel for nowaitfor(i = 0; i < 10; i++) {SOME WORK}
#pragma omp parallel forfor(j = 0; j < 10; j++) {SOME WORK}
NO barrier here
Spring 2019 CS4823/6643 Parallel Computing 43
CLAUSE: NOWAIT cont’d
● Example – code (without nowait):
● Example – output:
#pragma omp parallel{ #pragma omp for for(i = 0; i < 8; i++) printf("i is %d\n", i); #pragma omp for for(j = 0; j < 8; j++) printf("j is %d\n", j);}
i is 0i is 1i is 6i is 7i is 5i is 2i is 3i is 4j is 0j is 7j is 2j is 1j is 5j is 3j is 4j is 6
*example file name: nowait_clause.c
Loop i is always executed before loop j
Spring 2019 CS4823/6643 Parallel Computing 44
CLAUSE: NOWAIT cont’d
● Example – code (with nowait):
● Example – output:
#pragma omp parallel{ #pragma omp for nowait for(i = 0; i < 8; i++) printf("i is %d\n", i); #pragma omp for for(j = 0; j < 8; j++) printf("j is %d\n", j);}
i is 0j is 0i is 1j is 1i is 2j is 2i is 3j is 3i is 4j is 4i is 7j is 7i is 5j is 5i is 6j is 6
*example file name: nowait_clause.c
Some of loop i’s iterations are executed
after loop j
Spring 2019 CS4823/6643 Parallel Computing 45
Runtime Function: omp_get_thread_num
● Function declaration:
● Description:– Returns the thread number (id) of the thread
#include <omp.h>int omp_get_thread_num(void)
Spring 2019 CS4823/6643 Parallel Computing 46
Runtime Function: omp_get_thread_num cont’d
● Example – code: ● Example – output:int i,id;#pragma omp parallel private(id){ id = omp_get_thread_num(); #pragma omp for for(i = 0; i < 16; i++) printf("thread %d handles iteration i = %d\n", id, i);}
thread 0 handles iteration i = 0thread 0 handles iteration i = 1thread 7 handles iteration i = 14thread 7 handles iteration i = 15thread 2 handles iteration i = 4thread 2 handles iteration i = 5thread 4 handles iteration i = 8thread 4 handles iteration i = 9thread 6 handles iteration i = 12thread 6 handles iteration i = 13thread 3 handles iteration i = 6thread 3 handles iteration i = 7thread 1 handles iteration i = 2thread 1 handles iteration i = 3thread 5 handles iteration i = 10thread 5 handles iteration i = 11
*example file name: ompgetthreadnum_clause.c
Thread number/id
Spring 2019 CS4823/6643 Parallel Computing 47
More Run-time Functions
Function Name Description
OMP_SET_NUM_THREADS Sets the number of threads that will be used in the next parallel region
OMP_GET_NUM_THREADS Returns the number of threads that are currently in the team executing the parallel region from which it is called
OMP_GET_WTIME Provides a portable wall clock timing routine
OMP_GET_WTICK Returns a double-precision floating point value equal to the number of seconds between successive clock ticks
*More functions can be found at:https://computing.llnl.gov/tutorials/openMP/#RunTimeLibrary