Parallel & Concurrent Programming: OpenMPemery/classes/cmpsci... · OpenMP Emery Berger CMPSCI 691W...

Post on 11-Oct-2020

9 views 0 download

transcript

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science

Parallel & Concurrent Programming:

OpenMPEmery BergerCMPSCI 691WSpring 2006

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 2

Outline

Last time(s):MPI – point-to-point & collective

Library calls

Today:OpenMP - parallel directives

Language extensions to Fortran/C/C++

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 3

Motivation

Take vectors a & b (100 ints)Distribute across all processorsEach processor:

Compute sum of all a[i] * b[i]Print overall sum

MPI: Use MPI_Scatter, MPI_Gather orMPI_Reduce

MPI_Scatter/Gather(sendbuf, cnt, type, recvbuf, recvcnt, type, root, comm)MPI_Reduce(sendbuf, recvbuf, cnt, type, op, root, comm)

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 4

MPI SolutionMPI_Init (&argc, &argv);MPI_Comm_rank (MPI_COMM_WORLD, &rank);MPI_Comm_size (MPI_COMM_WORLD, &size);

// Distribute a and bMPI_Scatter (a, 100, MPI_INT, a1, 100 / size, MPI_INT, 0, MPI_COMM_WORLD);MPI_Scatter (b, 100, MPI_INT, b1, 100 / size, MPI_INT, 0, MPI_COMM_WORLD);

// Multiply each chunkfor (int i = 0; i < 100/size; i++) {

z += a[i] *b1[i];}

// Reduce by summingif (rank == 0) {z1 = new int[size]; }MPI_Reduce (&z, &z, 1, MPI_INT, MPI_OP_PLUS, 0, MPI_COMM_WORLD);

// Output resultif (rank == 0) {

cout << z << endl; }

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 5

Ideal Solution

int z = 0;parallel for (i = 0; i < nProcessors; i++) {z += a[i] * b[i];

}cout << z << endl;

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 6

OpenMP Solution

int z = 0;#pragma omp forfor (int i = 0; i < 100; i++) {z += a[i] * b1[i];

}cout << z << endl;

OpenMP pragma directivesOmit = sequential programMore declarative styleAdd more pragmas for more efficiency

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 7

OpenMP Concepts

Fork-join modelOne thread executes sequential codeUpon reaching parallel directive:

Start new team of work-sharing threadsWait until all done (usually barrier)Can be nested!

Apparent global shared memory but relaxed consistency model

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 8

Consistency

Consistency =ordering of reads & writes

In same thread, across threads

Most “intuitive” consistency model = sequential consistency (Lamport)

Behaves like some sequential executionBUT: seriously limits parallelism

Must synchronize frequently

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 9

OpenMP Consistency

OpenMP: consistency across flushesWrites set of variables to memoryIf two flushes have intersecting sets, flushes must be seen in some sequential order by all threads

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 10

Parallel Execution

#pragma omp parallelExecutes next chunk of code across all or some number of threads

num_threads(n)

Only “master thread” continues after parallel section completes

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 11

Dynamic Threads

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 12

Parallel + nowait

Implicit barrier unless nowaitBarrier = flush operation

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 13

Parallel + Memory

Memory model:Heap objects sharedStack objects private

Includes loop iterators

unless indicated otherwise...

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 14

Parallel Example

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 15

Data-Sharing Attributes

sharedprivate

Each thread gets own private copyUndefined value

firstprivateCopies in original value

lastprivateCopies out private value

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 16

Lastprivate Example

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 17

Threadprivate Example

Can also declare variables as alwaysthread-private

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 18

Reduce

reductionprivate value per threadinitialized “appropriately”

uses predefined operators

copies out to originalreduction(+:a)

initializes a = 0reduction(*:1)

initializes a = 1

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 19

OpenMP Solution

int z = 0;#pragma omp for reduction(+:z)for (int i = 0; i < 100; i++) {z += a[i] * b1[i];

}cout << z << endl;

OpenMP pragma directivesOmit = sequential programMore declarative styleAdd more pragmas for more efficiency

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 20

All Together

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 21

But Still Races...

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 22

Master & Synchronization

masterAlways run by master thread

criticalDeclares critical section (one thread at a time)Can add names for greater concurrency

barrieratomic

Updated atomically (a++, a--, etc.)ordered

Executes loop body sequentially

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 23

Atomic Example

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 24

The End

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 25

Single Example

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 26

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 27

Ordered For

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 28

Copyin Example

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 29

Copyprivate Example

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 30

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 31

The End

Next time:OpenMP