Presentation on Shared Memory Parallel Programming

transcript

Introduction to OpenMP

Presenter: Vengada Karthik Rangaraju

Fall 2012 Term

September 13th, 2012

What is openMP?

• Open Standard for Shared Memory Multiprocessing• Goal: Exploit multicore hardware with shared memory• Programmer’s view: The openMP API• Structure: Three primary API components: – Compiler directives,– Runtime Library routines and – Environment Variables

Shared Memory Architecture in a Multi-Core Environment

The key components of the API and its functions

• Compiler Directives - Spawning parallel regions (threads)- Synchronizing- Dividing blocks of code among threads- Distributing loop iterations

• Runtime Library Routines - Setting & querying no. of threads - Nested parallelism - Control over locks - Thread information

• Environment Variables - Setting no. of threads - Specifying how loop iterations are divided - Thread processor binding

- Enabling/Disabling dynamic threads - Nested parallelism

• Standardization• Ease of Use• Portability

Paradigm for using openMPWrite sequential

program

Find parallelizable portions of program

Insert directives/pragmas into existing code

Use openMP’s extended Compiler

Compile and run !

Insert calls to runtime library

routines and modify environment

variables, if desired

What happens here?

Compiler translation

#pragma omp <directive-type> <directive-clauses></n>{………..// Block of code executed as per instruction !}

Basic Example in C

{… //Sequential} #pragma omp parallel //fork{printf(“Hello from thread %d.\n”,omp_get_thread_num());} //join{… //Sequential}

What exactly happens when lines of code are executed in parallel?

• A team of threads are created• Each thread can have its own set of private

variables• All threads can have shared variables• Original thread : Master Thread• Fork-Join Model• Nested Parallelism

openMP LifeCycle – Petrinet model

Compiler directives – The Multi Core Magic Spells !

<directive type> Descriptionparallel Each thread will perform

same computation as others(replicated computations)

for / sections These are called workshare directives. Portions of overall work divided among threads(different computations). They don’t create threads. It has to be enclosed inside a parallel directive for threads to takeover the divided work.

• Types of workshare directives

for Countable iteration[static]

sections One or more sequential sections of code, executed by a single thread

single Serializes a section of code

• Clauses associated with each directive

<directive type> <directive clause>parallel If(expression)

private(var1,var2,…)firstprivate(var1,var2,..)lastprivate(var1,var2,..)shared(var1,var2,..)NUM_THREADS(integer value)

<directive type> <directive clause>for schedule(type, chunk)

private(var1,var2,…)firstprivate(var1,var2,..)lastprivate(var1,var2,..)shared(var1,var2,..)collapse(n)nowaitReduction(operator:list)

<directive type> <directive clause>sections private(var1,var2,…)

firstprivate(var1,var2,..)lastprivate(var1,var2,..)reduction(operator:list)nowait

Matrix Multiplication using loop directive

#pragma omp parallel private(i,j,k){

#pragma omp forfor(i=0;i<N;i++)

for(k=0;k<K;k++)for(j=0;j<M;j++)

C[i][j]=C[i][j]+A[i][k]*B[k][j];}

Scheduling Parallel Loops

• Static• Dynamic • Guided • Automatic • Runtime

• Static - Amount of work/iteration - same - Set of contiguous chunks in RR fashion

- 1 Chunk = x iterations

• Dynamic - Amount of work/iteration - Varies - Each thread will grab chunk of

iterations and return to grab another chunk when it has executed them.

• Guided - Same as dynamic, only difference, - a good proportion of iterations

remaining are shared among each thread.

• Runtime - Schedule determined using an environment variable. Library routine provided !

• Automatic - Implementation chooses any schedule

Matrix Multiplication using loop directive – with a schedule

#pragma omp parallel private(i,j,k){

#pragma omp for schedule(static)for(i=0;i<N;i++)

for(k=0;k<K;k++)for(j=0;j<M;j++)

C[i][j]=C[i][j]+A[i][k]*B[k][j];}

openMP worshare directive – sections

int g; void foo(int m, int n) {

int p,i; #pragma omp sections firstprivate(g) nowait{

#pragma omp section{

p=f1(g);for(i=0;i<m;i++)do_stuff;

}#pragma omp section {

p=f2(g);for(i=0;i<n;i++)do_other_stuff;

return;}

Parallelizing when the no.of Iterations is unknown[dynamic] !

• openMP has a directive called task

Explicit Tasks void processList(Node* list){

#pragma omp parallelpragma omp single{ Node *currentNode = list; while(currentNode) {

#pragma omp task firstprivate(currentNode)doWork(currentNode);

currentNode=currentNode->next; } }}

Explicit Tasks – Petrinet Model

Synchronization

• Barrier• Critical • Atomic • Flush

Performing Reductions

• A loop containing reduction will always be sequential, since each iteration would form a result depending on previous iteration.

• openMP allows these loops to be parallelized as long as the developer says, loop contains reduction and indicates the variable and kind of reduction via “Clauses”

Without using reduction#pragma omp parallel shared(array,sum)firstprivate(local_sum){

#pragma omp for private(i,j)for(i=0;i<max_i;i++){

for(j=0;j<max_j;++j)local_sum+=array[i][j];

}}#pragma omp criticalsum+=local_sum;}

Using Reductions in openMP

sum=0;#pragma omp parallel shared(array){

#pragma omp for reduction(+:sum) private(i,j)for(i=0;i<max_i;i++){for(j=0;j<max_j;++j)sum+=array[i][j];}

Programming for performance

• Use of IF clause before creating parallel regions

• Understanding Cache Coherence • Judicious use of parallel and flush • Critical and atomic - know the difference !• Avoid unnecessary computations in critical

region • Use of barrier - a starvation alert !

References• NUMA UMA

http://vvirtual.wordpress.com/2011/06/13/what-is-numa/

http://www.e-zest.net/blog/non-uniform-memory-architecture-numa/

• openMP basics

https://computing.llnl.gov/tutorials/openMP/

• Workshop on openMP SMP, by Tim Mattson from Intel (video)

http://www.youtube.com/watch?v=TzERa9GA6vY

Interesting links

• openMP official page

http://openmp.org/wp/

• 32 openMP Traps for C++ Developers

http://www.viva64.com/en/a/0054/#ID0EMULM

Presentation on Shared Memory Parallel Programming

Education