Parallel Programming Models
Parallel Programming Models:
Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism Shared memory / Distributed memory Other programming paradigms
• Object-oriented
• Functional and logic
Parallel Programming Models
Shared MemoryThe programmer’s task is to specify the activities of a set of processes that communicate by reading and writing shared memory. • Advantage: the programmer need not be concerned with data-distribution
issues. • Disadvantage: performance implementations may be difficult on computers
that lack hardware support for shared memory, and race conditions tend to arise more easily
Distributed MemoryProcesses have only local memory and must use some other mechanism (e.g., message passing or remote procedure call) to exchange information.• Advantage: programmers have explicit control over data distribution and
communication.
Shared vs Distributed Memory
Shared memory
Distributed memory
Memory
Bus
P P P P
P P P P
M M M M
Network
Parallel Programming Models
Parallel Programming Tools:
Parallel Virtual Machine (PVM)• Distributed memory, explicit parallelism
Message-Passing Interface (MPI)• Distributed memory, explicit parallelism
PThreads• Shared memory, explicit parallelism
OpenMP• Shared memory, explicit parallelism
High-Performance Fortran (HPF)• Implicit parallelism
Parallelizing Compilers• Implicit parallelism
Parallel Programming Models
Shared Memory Model
Used on Shared memory MIMD architectures
Program consists of many independent threads
Concurrently executing threads all share a single, common address space.
Threads can exchange information by reading and writing to memory using normal variable assignment operations
Parallel Programming Models
Memory Coherence Problem
To ensure that the latest value of a variable updated in one thread is used when that same variable is accessed in another thread.
Hardware support and compiler support are required
Cache-coherency protocol
Thread 1 Thread 2
X
Parallel Programming Models
Distributed Shared Memory (DSM) Systems
Implement Shared memory model on Distributed memory MIMD architectures
Concurrently executing threads all share a single, common address space.
Threads can exchange information by reading and writing to memory using normal variable assignment operations
Use a message-passing layer as the means for communicating updated values throughout the system.
Parallel Programming Models
Synchronization operations in Shared Memory Model
Monitors Locks Critical sections Condition variables Semaphores Barriers
OpenMP
Shared-memory programming model Thread-based parallelism Fork/Join model Compiler directive based No support for parallel I/O
OpenMP
Master thread executes the sequential sections.
Master thread forks additional threads.
At the end of the parallel code, the created threads die and the control returns to the master thread (join).
Master
ThreadsFork
Join
Fork
Join
OpenMPGeneral Code Structure #include <omp.h> main () {
int var1, var2, var3;
Serial code . . .
Fork a team of threads. Specify variable scoping #pragma omp parallel private(var1, var2) shared(var3) {
Parallel section executed by all threads . . .
All threads join master thread and disband } Resume serial code . . . }
OpenMPGeneral Code Structure #include <omp.h> main () { int nthreads, tid;
#pragma omp parallel private(tid) /* Fork a team of threads */ {
tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid);
if (tid == 0) { /* master thread */ nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads);
} } /* All threads join master thread and terminate */ }
OpenMPparallel Directive
The execution of the code block after the parallel pragma is replicated among the threads.
#include <omp.h> main () { struct job_struct job_ptr;struct task_struct task_ptr;
#pragma omp parallel private(task_ptr) {
task_ptr = get_next_task(&job_ptr);while(task_ptr != NULL){
complete_task(task_ptr)task_ptr =
get_next_task(&job_ptr);}
}
}
job_ptr
task_ptr
task_ptrMaster tread Thread 1
Shared variables
OpenMP
parallel for Directive #include <omp.h>
main () { int i; float b[5];
#pragma omp parallel for{
for (i=0; i < 5; i++) b[i] = i;
}
}
b
i iMaster
Thread (0)Thread 1
In parallel for, variables are by default shared with the exception that the loop index variable is private.
OpenMP
Execution context: is an address space containing all of the variables the thread may access.
Shared variable: has the same address in the execution context of every thread
Private variable: has a different address in the execution context of every thread
OpenMP
private Clause declares variables in its list to be private to each thread
private (list)
#include <omp.h> main () { int i,n; float a[10][10];
n=10;#pragma omp parallel for \ private(j) {
for (i=0; i < n; i++) for (j=0; j < n; j++)
a[i][j] = a[i][j] + i; }
}
OpenMPcritical Directive Directs the compiler to enforce mutual exclusion among the threads
executing the block of code.
#include <omp.h>
main() {
int x;
x = 0;
#pragma omp parallel shared(x)
{
#pragma omp critical
x = x + 1;
} /* end of parallel section */
}
OpenMPreduction Directive
reduction (operator : variable)
#include <omp.h> main () { int i, n; float a, x, p; n=100;a=0.0; #pragma omp parallel for \ private(x) \ reduction(+:a) for (i=0; i < n; i++) {
x = i/10.0; a += x*x;
}
p = a/n;
}
OpenMP
reduction Operators
+ addition
- subtraction
* multiplication
& bitwise and
| bitwise or
^ bitwise exclusive or
&& conditional and
|| conditional or
OpenMP
Loop Scheduling: allows the iterations of a loop to be allocated to threads.
Static schedule: all iterations are allocated to threads before they execute any loop iterations.• Low overhead
• High load imbalance
Dynamic schedule: only some of the iterations are allocated to threads at the beginning of loop’s execution. Threads that complete their iterations are then eligible to get additional work. • Higher overhead
• Reduce load imbalance
OpenMP
schedule Clause schedule (type [, chunk])
type: static, dynamic, etc.chunk: number of contiguous iterations assigned to each
thread
Increasing chunk size can reduce overhead and increase cache hit rate.
OpenMP
schedule Clause #include <omp.h> main () { int i,n; float a[10];
n=10;#pragma omp parallel for \ private(i) \schedule(static,5){
for (i=0; i < n; i++) a[i] = a[i] + i;
}
}
OpenMP
schedule Directive #include <omp.h> #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i) { /* iterations will be distributed dynamically in chunk sized pieces*/
#pragma omp for schedule(dynamic,chunk) nowait { for (i=0; i < N; i++)
c[i] = a[i] + b[i]; }
} /* end of parallel section */ }
OpenMPnowait Clause
tells the compiler to omit the barrier synchronization at the end of the parallel for loop
#include <omp.h> #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i) {
#pragma omp for schedule(dynamic,chunk) nowait { for (i=0; i < N; i++)
c[i] = a[i] + b[i]; }
}}
OpenMPfor Directive
specifies that the iterations of the loop immediately following it must be executed in parallel by the team.
for (i=0; i < n; i++){
low=a[i];high=b[i];if (low > high) {
break;}for (j=low; j < high; j++)
c[j]=(c[j]-a[i])/b[i]; }
#pragma omp parallel private(i,j)for (i=0; i < n; i++){
low=a[i];high=b[i];if (low > high) {
break;}#pragma omp for for (j=low; j < high; j++)
c[j]=(c[j]-a[i])/b[i]; }
OpenMPsingle Directive
Specifies that the enclosed code is to be executed by only one thread in the team.Threads in the team that do not execute the single directive wait at the end of the enclosed code block
for (i=0; i < n; i++){
low=a[i];high=b[i];if (low > high) {
break;}for (j=low; j < high; j++)
c[j]=(c[j]-a[i])/b[i]; }
#pragma omp parallel private(i,j)for (i=0; i < n; i++){
low=a[i];high=b[i];if (low > high) {
#pragma omp singlebreak;
}#pragma omp forfor (j=low; j < high; j++)
c[j]=(c[j]-a[i])/b[i]; }
OpenMP
#include <omp.h> int a, b, i, tid; float x; #pragma omp threadprivate(a, x) main () { omp_set_dynamic(0); /*Explicitly turn off dynamic threads*/ #pragma omp parallel private(b,tid) {
tid = omp_get_thread_num(); a = tid; b = tid; x = 1.1 * tid +1.0; printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x);
} /* end of parallel section */ printf("Master thread doing serial work here\n"); #pragma omp parallel private(tid) {
tid = omp_get_thread_num(); printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x);
} /* end of parallel section */
}
Output:
Thread 0: a,b,x= 0 0 1.000000 Thread 2: a,b,x= 2 2 3.200000 Thread 3: a,b,x= 3 3 4.300000 Thread 1: a,b,x= 1 1 2.100000 Master thread doing serial work here Thread 0: a,b,x= 0 0 1.000000 Thread 3: a,b,x= 3 0 4.300000 Thread 1: a,b,x= 1 0 2.100000 Thread 2: a,b,x= 2 0 3.200000
THREADPRIVATE Directive The THREADPRIVATE directive is used to make global file scope variables (C/C++) local and
persistent to a thread through the execution of multiple parallel regions.
OpenMPparallel sections Directive – Functional Parallelism
Specifies that the enclosed section(s) of code are to be divided among the threads in the team to be evaluated concurrently.
#include <omp.h> main () { ...#pragma omp parallel sections {
#pragma omp section /* thread 1 */v = alpha(); #pragma omp section /* thread 2 */w = beta();#pragma omp section /* thread 3 */y = delta();
} /* end of parallel section */ x = gamma(v,w);printf(“%f\n”, epsilon(x,y));}
}
OpenMPparallel sections Directive – Functional Parallelism
Another solution:main () { ...#pragma omp parallel{
#pragma omp sections {
#pragma omp section /* thread 1 */v = alpha(); #pragma omp section /* thread 2 */w = beta();
}#pragma omp sections {
#pragma omp section /* thread 3 */x = gamma(v,w); #pragma omp section /* thread 4 */y = delta();
}} /* end of parallel section */ printf(“%f\n”, epsilon(x,y));}}
OpenMPsection Directive #include <omp.h> #define N 1000 main () { int i; float a[N], b[N], c[N], d[N]; for (i=0; i < N; i++) {
a[i] = i * 1.5; b[i] = i + 22.35;
}#pragma omp parallel shared(a,b,c,d) private(i) {
#pragma omp sections nowait {
#pragma omp section for (i=0; i < N; i++)
c[i] = a[i] + b[i]; #pragma omp section for (i=0; i < N; i++)
d[i] = a[i] * b[i]; } /* end of sections */
} /* end of parallel section */ }
OpenMP
Synchronization constructs:
master directive: specifies a region that is to be executed only by the master thread of the team
critical directive: specifies a region of code that must be executed by only one thread at a time
barrier directive: synchronizes all threads in the team
atomic directive: specifies that a specific memory location must be updates atomically (a mini critical section)
OpenMP
Run-time Library Routines omp_set_num_threads(void): Sets the number of threads that
will be used in the next parallel region omp_get_num_threads(void): Returns the number of threads
that are currently executing the parallel region omp_get_thread_number(void): Returns the thread number omp_get_num_procs(void): Returns the number of processors omp_in_parallel(void): used to determine if the section of code
is parallel or not
Parallel Programming Models
Example: Pi calculation
f01 f(x) dx = f0
1 4/(1+x2) dx = w ∑ f(xi)
f(x) = 4/(1+x2)
n = 10
w = 1/n
xi = w(i-0.5)
x
f(x)
0 0.1 0.2 xi 1
Parallel Programming ModelsSequential Code
#define f(x) 4.0/(1.0+x*x);
main(){int n,i;float w,x,sum,pi;
printf(“n?\n”);scanf(“%d”, &n);w=1.0/n;sum=0.0;for (i=1; i<=n; i++){
x=w*(i-0.5);sum += f(x);
}pi=w*sum;printf(“%f\n”, pi);
}
= w ∑ f(xi) f(x) = 4/(1+x2) n = 10 w = 1/nxi = w(i-0.5)
x
f(x)
0 0.1 0.2 xi 1
OpenMP#include <omp.h> #include <stdio.h>
#define f(x) 4.0/(1.0+x*x)#define NUM_THREADS 4
main() { float sum, w, pi, x;int i, n, id;sum = 0.0;w=1.0/n; omp_set_num_threads(NUM_THREADS);#pragma omp parallel for private(x) {
for (i=0; i<n; i++) {x=(i+0.5)*w;#pragma omp critical
sum+=f(x);}
}pi = sum*w;printf(“pi=%f\n”, pi);}
x
f(x)
0 0.1 0.2 xi 1