+ All Categories
Home > Documents > Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared...

Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared...

Date post: 16-Mar-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
41
200 Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall, 1998 Processors Memory modules Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory In a shared memory system, any memory location can be accessible by any of the processors. A single address space exists, meaning that each memory location is given a unique address within a single range of addresses. For a small number of processors, a common architecture is the single bus architecture, in
Transcript
Page 1: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

200Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Processors Memory modulesFigure 8.1 Shared memory multiprocessor using a single bus.

Bus

Cache

Programming with Shared MemoryIn a shared memory system, any memory location can be accessible by any of theprocessors.

A single address space exists, meaning that each memory location is given a unique addresswithin a single range of addresses.

For a small number of processors, a common architecture is the single bus architecture, in

Page 2: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

201Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Several Alternatives for Programming:

• Using a new programming language

• Modifying an existing sequential language

• Using library routines with an existing sequential language• Using a sequential programming language and ask a parallelizing compiler to convert

it into parallel executable code. • UNIX Processes• Threads (Pthreads, Java, ..)

Page 3: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

202Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

TABLE 8.1 SOME EARLY PARALLEL PROGRAMMING LANGUAGES

Language Originator/date Comments

Concurrent Pascal Brinch Hansen, 1975a

a. Brinch Hansen, P. (1975), “The Programming Language Concurrent Pascal,” IEEE Trans. Software Eng.,Vol. 1, No. 2 (June), pp. 199–207.

Extension to Pascal

Ada U.S. Dept. of Defense, 1979b

b. U.S. Department of Defense (1981), “The Programming Language Ada Reference Manual,” LectureNotes in Computer Science, No. 106, Springer-Verlag, Berlin.

Completely new language

Modula-P Bräunl, 1986c

c. Bräunl, T., R. Norz (1992), Modula-P User Manual, Computer Science Report, No. 5/92 (August), Univ.Stuttgart, Germany.

Extension to Modula 2

C* Thinking Machines, 1987d

d. Thinking Machines Corp. (1990), C* Programming Guide, Version 6, Thinking Machines System Docu-mentation.

Extension to C for SIMD systems

Concurrent C Gehani and Roome, 1989e

e. Gehani, N., and W. D. Roome (1989), The Concurrent C Programming Language, Silicon Press, New Jer-sey.

Extension to C

Fortran D Fox et al., 1990f

f. Fox, G., S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu (1990), Fortran DLanguage Specification, Technical Report TR90-141, Dept. of Computer Science, Rice University.

Extension to Fortran for data parallel programming

Page 4: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

203Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Figure 8.2 FORK-JOIN construct.

Main program

FORK

FORK

FORK

JOIN

JOIN JOIN

JOIN

Spawned processes

Constructs for Specifying Parallelism

Creating Concurrent Processes – FORK-JOIN Construct

Have been applied as extensions to FORTRAN and to the UNIX operating system.

A FORK statement generates one new path for a concurrent process and the concurrent pro-

cesses use JOIN statements at their ends.

When both JOIN statements have been reached, processing continues in a sequential fash-

ion.

Page 5: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

204Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

UNIX Heavyweight Processes

The UNIX system call fork() creates a new process. The new process (child process) is anexact copy of the calling process except that it has a unique process ID. It has its own copyof the parent’s variables. They are assigned the same values as the original variables ini-tially.

The forked process starts execution at the point of the fork.

On success, fork() returns 0 to the child process and returns the process ID of the childprocess to the parent process.

Processes are “joined” with the system calls wait() and exit() defined as

wait(statusp); /*delays caller until signal received or one of its */

/*child processes terminates or stops */

exit(status); /*terminates a process. */

Hence, a single child process can be created by

.

pid = fork(); /* fork */

.

Code to be executed by both child and parent

.

if (pid == 0) exit(0); else wait(0); /* join */

.

If the child is to execute different code, we could use

pid = fork();

if (pid == 0) {

.

code to be executed by slave

.

} else {

.

Code to be executed by parent

.

}

if (pid == 0) exit(0); else wait(0);

.

Page 6: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

205Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

IP

Stack

Code Heap

Files

Interrupt routines

Code Heap

Files

Interrupt routines

IP

Stack

IP

Stack

Thread

Thread

(a) Process

(b) ThreadsFigure 8.3 Differences between a process and threads.

Threads

The process created with UNIX fork is a “heavyweight” process; it is a completely separateprogram with its own variables, stack, and memory allocation.

A much more efficient mechanism is a thread mechanism which shares the same memoryspace and global variables between routines.

Page 7: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

206Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Main program

pthread_create(&thread1, NULL, proc1, &arg);

pthread_join(thread1, *status);

proc1(&arg)

return(*status);

{

}

Figure 8.4 pthread_create() and pthread_join().

thread1

Pthreads

IEEE Portable Operating System Interface, POSIX, section 1003.1 standard

Executing a Pthread Thread

Page 8: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

207Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Pthread Barrier

The routine pthread_join() waits for one specific thread to terminate.

To create a barrier waiting for all threads, pthread_join() could be repeated:

.

for (i = 0; i < n; i++)

pthread_create(&thread[i], NULL, (void *) slave, (void *) &arg);

.

for (i = 0; i < n; i++)

pthread_join(thread[i], NULL);

.

Page 9: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

208Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Main program

Figure 8.5 Detached threads.

Thread

pthread_create();

pthread_create();

pthread_create(); Termination

Thread

Thread

Termination

Termination

Detached Threads

It may be that a thread may not be bothered when a thread it creates terminates and in thatcase a join would not be needed.

Threads that are not joined are called detached threads.

When detached threads terminate, they are destroyed and their resource released.

Page 10: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

209Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Statement Execution Order

Once processes or threads are created, their execution order will depend upon the system.

On a single processor system, the processor will be time shared between the processes/threads, in an order determined by the system if not specified, although typically a threadexecutes to completion if not blocked.

On a multiprocessor system, the opportunity exists for different processes/threads to executeon different processors.

The instructions of individual processes/threads might be interleaved in time.

Example

If there were two processes with the machine instructions

Process 1 Process 2

Instruction 1.1 Instruction 2.1Instruction 1.2 Instruction 2.2Instruction 1.3 Instruction 2.3

there are several possible orderings, including

Instruction 1.1Instruction 1.2Instruction 2.1Instruction 1.3Instruction 2.2Instruction 2.3

assuming an instruction cannot be divided into smaller interruptible steps.

If two processes were to print messages, for example, the messages could appear in differentorders depending upon the scheduling of processes calling the print routine.

Worse, the individual characters of each message could be interleaved if the machine instruc-tions of instances of the print routine could be interleaved.

Page 11: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

210Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Compiler Optimations

In addition to interleaved execution of machine instructions in processes/threads, acompiler (or the processor) might reorder instructions of your program for optimizationpurposes while preserving the logical correctness of the program.

Example

The statements

a = b + 5;

x = y + 4;

could be compiled to execute in reverse order:

x = y + 4;

a = b + 5;

and still be logically correct.

It may be advantageous to delay statement a = b + 5 because some previous instructioncurrently being executed in the processor needs more time to produce the value for b.

It is very common for modern processors to execute machines instructions out of order forincreased speed of execution.

Page 12: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

211Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Thread-Safe Routines

System calls or library routines are called thread safe if they can be called from multiplethreads simultaneously and always produce correct results.

Example

Standard I/O is designed to be thread safe (and will print messages without interleaving thecharacters).

Routines that access shared data and static data may require special care to be made threadsafe.

Example

System routines that return time may not be thread safe.

The thread-safety aspect of any routine can be avoided by forcing only one thread toexecute the routine at a time. This could be achieved by simply enclosing the routine in acritical section but this is very inefficient.

Page 13: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

212Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

+1 +1

Shared variable, x

Read

Write Write

Read

Process 1 Process 2Figure 8.6 Conflict in accessing shared variable.

Accessing Shared Data

Consider two processes each of which is to add one to a shared data item, x.

It will be necessary for the contents of the x location to be read, x + 1 computed, and theresult written back to the location.

With two processes doing this at approximately the same time, we have

Instruction Process 1 Process 2

x = x + 1; read x read x

compute x + 1 compute x + 1

write to x write to x

Time

Page 14: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

213Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Critical Section

The problem of accessing shared data can be generalized by considering shared resources.

A mechanism for ensuring that only one process accesses a particular resource at a time isto establish sections of code involving the resource as so-called critical sections andarrange that only one such critical section is executed at a time.

The first process to reach a critical section for a particular resource enters and executes thecritical section.

The process prevents all other processes from their critical sections for the same resource.

Once the process has finished its critical section, another process is allowed to enter acritical section for the same resource.

This mechanism is known as mutual exclusion.

Page 15: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

214Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Process 1 Process 2

while (lock == 1) do_nothing;lock = 1;

Critical section

lock = 0;

while (lock == 1)do_nothing;

lock = 1;

Critical section

lock = 0;

Figure 8.7 Control of critical sections through busy waiting.

Locks

The simplest mechanism for ensuring mutual exclusion of critical sections.

A lock is a 1-bit variable that is a 1 to indicate that a process has entered the critical sectionand a 0 to indicate that no process is in the critical section.

The lock operates much like that of a door lock.

A process coming to the “door” of a critical section and finding it open may enter thecritical section, locking the door behind it to prevent other processes from entering. Oncethe process has finished the critical section, it unlocks the door and leaves.

Spin Lock

Example

while (lock == 1) do_nothing; /* no operation in while loop */

lock = 1; /* enter critical section */

.

critical section.

lock = 0; /* leave critical section */

Page 16: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

215Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutually exclusive lock variables,or “mutex” variables.

To use a mutex, first it must be declared as of type pthread_mutex_t and initialized, usuallyin the “main” thread:

pthread_mutex_t mutex1;

.

.

pthread_mutex_init(&mutex1, NULL);

NULL specifies a default attribute for the mutex.

A mutex can be destroyed with pthread_mutex_destroy().

A critical section can then be protected using pthread_mutex_lock() andpthread_mutex_unlock():

pthread_mutex_lock(&mutex1);

.

critical section.

pthread_mutex_unlock(&mutex1);

If a thread reaches a mutex lock and finds it locked, it will wait for the lock to open.

If more than one thread is waiting for the lock to open when it opens, the system will selectone thread to be allowed to proceed.

Only the thread that locks a mutex can unlock it.

Page 17: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

216Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Figure 8.8 Deadlock (deadly embrace).

R1 R2

R1 R2 Rn −1 Rn

P1 P2

P1 P2 Pn −1 Pn

(a) Two-process deadlock

(b) n-process deadlock

Resource

Process

Deadlock

Deadlock can occur with two processes when one requires a resource held by the other, andthis process requires a resource held by the first process.

Deadlock can also occur in a circular fashion with several processes having a resourcewanted by another.

These particular forms of deadlock are known as deadly embrace.

Given a set of processes having various resource requests, a circular path between anygroup indicates a potential deadlock situation.

Deadlock can be eliminated between two processes accessing more than one resource ifboth processes make requests first for one resource and then for the other.

Pthreads

Offers one routine that can test whether a lock is actually closed without blocking the thread— pthread_mutex_trylock(). This routine will lock an unlocked mutex and return 0 orwill return with EBUSY if the mutex is already locked – might find a use in overcomingdeadlock.

Page 18: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

217Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Semaphores

A semaphore, s (say), is a positive integer (including zero) operated upon by two operationsnamed P and V.

P operation, P(s)

Waits until s is greater than zero and then decrements s by one and allows the process tocontinue.

V operation, V(s)

increments s by one to release one of the waiting processes (if any).

The P and V operations are performed indivisibly.

A mechanism for activating waiting processes is also implicit in the P and V operations.Though the exact algorithm is not specified, the algorithm is expected to be fair.

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the samesemaphore.

Page 19: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

218Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Mutual Exclusion of Critical Sections

Can be achieved with one semaphore having the value 0 or 1 (a binary semaphore), whichacts as a lock variable, but the P and V operations include a process scheduling mechanism.

The semaphore is initialized to 1, indicating that no process is in its critical section associ-ated with the semaphore.

Each mutually exclusive critical section is preceded by a P(s) and terminated with a V(s)on the same semaphore; i.e.,

Process 1 Process 2 Process 3

Noncritical section Noncritical section Noncritical section. . .

P(s) P(s) P(s)

. . .

Critical section Critical section Critical section. . .

V(s) V(s) V(s)

. . .

Noncritical section Noncritical section Noncritical section

Any process might reach its P(s) operation first (or more than one process may reach itsimultaneously).

The first process to reach its P(s) operation, or to be accepted, will set the semaphore to0, inhibiting the other processes from proceeding past their P(s) operations, but anyprocess reaching its P(s) operation will be recorded so that one can be selected when thecritical section is released.

When the process reaches its V(s) operation, it sets the semaphore s to 1 and one of theprocesses waiting is allowed to proceed into its critical section.

Page 20: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

219Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

General Semaphore

Can take on positive values other than zero and one.

Such semaphores provide, for example, a means of recording the number of “resourceunits” available or used and can be used to solve producer/consumer problems.

Semaphore routines exist for UNIX processes. They do not exist in Pthreads as such,though they can be written and they do exist in the real-time extension to Pthreads.

Page 21: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

220Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Monitor

A suite of procedures that provides the only method to access a shared resource.

Essentially the data and the operations that can operate upon the data are encapsulated intoone structure.

Reading and writing can only be done by using a monitor procedure, and only one processcan use a monitor procedure at any instant.

A monitor procedure could be implemented using a semaphore to protect its entry; i.e.,

monitor_proc1()

{

P(monitor_semaphore);

.

monitor body

.V(monitor_semaphore);

return;

}

Java Monitor

The concept of a monitor exists in Java.

The keyword synchronized in Java makes a block of code in a method thread safe, prevent-ing more than one thread inside the method.

Page 22: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

221Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Condition Variables

Often, a critical section is to be executed if a specific global condition exists; for example,if a certain value of a variable has been reached.

With locks, the global variable would need to be examined at frequent intervals (“polled”)within a critical section. This is a very time-consuming and unproductive exercise.

The problem can be overcome by introducing so-called condition variables.

Operations

Three operations are defined for a condition variable:

Wait(cond_var) — wait for a condition to occur

Signal(cond_var)— signal that the condition has occurred

Status(cond_var)— return the number of processes waiting for the condition to occur

The wait operation will also release a lock or semaphore and can be used to allow anotherprocess to alter the condition.

When the process calling wait() is finally allowed to proceed, the lock or semaphore isagain set.

Example

Consider one or more processes (or threads) designed to take action when a counter, x, iszero. Another process or thread is responsible for decrementing the counter. The routinescould be of the form

action() counter()

{ {

. .

lock(); lock();

while (x != 0) x--;

wait(s); if (x == 0) signal(s);

unlock(); unlock();

take_action(); .

. .

} }

Page 23: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

222Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Pthread Condition Variables

Condition variables associated with a specific mutex.

Given the declarations and initializations

pthread_cond_t cond1;

pthread_mutex_t mutex1;

.

pthread_cond_init(&cond1, NULL);

pthread_mutex_init(&mutex1, NULL);

the Pthreads arrangement for signal and wait is as follows:

Signals are not remembered, which means that threads must already be waiting for a signalto receive it.

action() counter()

{ {

. .

. .

pthread_mutex_lock(&mutex1); pthread_mutex_lock(&mutex1);

while (c <> 0) c--;

pthread_cond_wait(cond1,mutex1); if (c == 0) pthread_cond_signal(cond1);

pthread_mutex_unlock(&mutex1); pthread_mutex_unlock(&mutex1);

take_action(); .

. .

. .

} }

Page 24: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

223Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Language Constructs for Parallelism

Shared Data

Shared memory variables might be declared as shared with, say,

shared int x;

par Construct

For specifying concurrent statements:

par {

S1;

S2;

.

.

Sn;

}

forall Construct

To start multiple similar processes together:

forall (i = 0; i < n; i++) {

S1;

S2;

.

.

Sm;

}

which generates n processes each consisting of the statements forming the body of the forloop, S1, S2, …, Sm. Each process uses a different value of i.

Example

forall (i = 0; i < 5; i++)

a[i] = 0;

clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.

Page 25: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

224Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Dependency Analysis

To identify which processes could be executed together.

Example

We can see immediately in the code

forall (i = 0; i < 5; i++)

a[i] = 0;

that every instance of the body is independent of the other instances and all instances canbe executed simultaneously.

However, it may not be obvious; e.g.,

forall (i = 2; i < 6; i++) {

x = i - 2*i + i*i;

a[i] = a[x];

}

Preferably, we need an algorithmic way of recognizing the dependencies, which might beused by a parallelizing compiler.

Page 26: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

225Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Bernstein's Conditions

Set of conditions that are sufficient to determine whether two processes can be executedsimultaneously. Let us define two sets of memory locations, I (input) and O (output), suchthat

Ii is the set of memory locations read by process Pi.

Oj is the set of memory locations altered by process Pj.

For two processes P1 and P2 to be executed simultaneously, the inputs to process P1 mustnot be part of the outputs of P2, and the inputs of P2 must not be part of the outputs of P1;i.e.,

I1 ∩ O2 = φI2 ∩ O1 = φ

where φ is an empty set. The set of outputs of each process must also be different; i.e.,

O1 ∩ O2 = φ

If the three conditions are all satisfied, the two processes can be executed concurrently.

Example

Suppose the two statements are (in C)

a = x + y;

b = x + z;

We have

I1 = (x, y) O1 = (a)

I2 = (x, z) O2 = (b)

and the conditions

I1 ∩ O2 = φI2 ∩ O1 = φO1 ∩ O2 = φ

are satisfied. Hence, the statements a = x + y and b = x + z can be executed simulta-

neously.

Page 27: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

226Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Block

Cache

Processor 1

Cache

Processor 2

Main memory

Block in cache

76543210

Addresstag

Figure 8.9 False sharing in caches.

Shared Data in Systems with CachesAll modern computer systems have cache memory, high-speed memory closely attached toeach processor for holding recently referenced data and code.

Cache coherence protocols

In the update policy, copies of data in all caches are updated at the time one copy is altered.

In the invalidate policy, when one copy of data is altered, the same data in any other cacheis invalidated (by resetting a valid bit in the cache). These copies are only updated when theassociated processor makes reference for it.

False Sharing

The key characteristic used is that caches are organized in blocks of contiguous locations.When a processor first references one or more bytes in the block, the whole block is trans-ferred into the cache from the main memory.

Different parts of a block required by different processors but not the same bytes. If oneprocessor writes to one part of the block, copies of the complete block in other caches mustbe updated or invalidated though the actual data is not shared.

Page 28: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

227Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Solution for False Sharing

Compiler to alter the layout of the data stored in the main memory, separating data onlyaltered by one processor into different blocks.

This may be difficult to satisfy in all situations. For example, the code

forall (i = 0; i < 5; i++)

a[i] = 0;

is likely to create false sharing as the elements of a, a[0], a[1], a[2], a[3], and a[4], arelikely to be stored in consecutive locations in memory.

The only way to avoid false sharing would be to place each element in a different block,which would create significant wastage of storage for a large array.

Even code such as

par {

x = 0;

y = 0;

}

where x and y are shared variables, could create false sharing as the variables x and y arelikely to be stored together in a data segment.

Page 29: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

228Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Array a[]sum

addr

Figure 8.10 Shared memory locations for Section program example.

Program Examples

To sum the elements of an array, a[1000]:

int sum, a[1000];

sum = 0;

for (i = 0; i < 1000; i++)

sum = sum + a[i];

UNIX Processes

For this example, the calculation will be divided into two parts, one doing the even i andone doing the odd i; i.e.,

Process 1 Process 2

sum1 = 0; sum2 = 0;

for (i = 0; i < 1000; i = i + 2) for (i = 1; i < 1000; i = i + 2)

sum1 = sum1 + a[i]; sum2 = sum2 + a[i];

Each process will add its result (sum1 or sum2) to an accumulating result, sum (after sum isinitialized):

sum = sum + sum1; sum = sum + sum2;

producing the final answer. The result location, sum, will need to be shared and access pro-

tected by a lock. For this program, a shared data structure is created:

Page 30: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

229Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

#include <sys/types.h>

#include <sys/ipc.h>

#include <sys/shm.h>

#include <sys/sem.h>

#include <stdio.h>

#include <errno.h>

#define array_size 1000 /* no of elements in shared memory */

extern char *shmat();

void P(int *s);

void V(int *s);

int main()

{

int shmid, s, pid; /* shared memory, semaphore, proc id */

char *shm; /*shared mem. addr returned by shmat()*/

int *a, *addr, *sum; /* shared data variables*/

int partial_sum; /* partial sum of each process */

int i;

/* initialize semaphore set */

int init_sem_value = 1;

s = semget(IPC_PRIVATE, 1, (0600 | IPC_CREAT))

if (s == -1) { /* if unsuccessful*/

perror("semget");

exit(1);

}

if (semctl(s, 0, SETVAL, init_sem_value) < 0) {

perror("semctl");

exit(1);

}

/* create segment*/

shmid = shmget(IPC_PRIVATE,(array_size*sizeof(int)+1),

(IPC_CREAT|0600));

if (shmid == -1) {

perror("shmget");

exit(1);

}

/* map segment to process data space */

shm = shmat(shmid, NULL, 0);

/* returns address as a character*/

if (shm == (char*)-1) {

perror("shmat");

exit(1);

}

Page 31: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

230Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

addr = (int*)shm; /* starting address */

sum = addr; /* accumulating sum */

addr++;

a = addr; /* array of numbers, a[] */

*sum = 0;

for (i = 0; i < array_size; i++) /* load array with numbers */

*(a + i) = i+1;

pid = fork(); /* create child process */

if (pid == 0) { /* child does this */

partial_sum = 0;

for (i = 0; i < array_size; i = i + 2)

partial_sum += *(a + i);

else { /* parent does this */

partial_sum = 0;

for (i = 1; i < array_size; i = i + 2)

partial_sum += *(a + i);

}

P(&s); /* for each process, add partial sum */

*sum += partial_sum;

V(&s);

printf("\nprocess pid = %d, partial sum = %d\n", pid, partial_sum);

if (pid == 0) exit(0); else wait(0); /* terminate child proc */

printf("\nThe sum of 1 to %i is %d\n", array_size, *sum);

/* remove semaphore */

if (semctl(s, 0, IPC_RMID, 1) == -1) {

perror("semctl");

exit(1);

}

/* remove shared memory */

if (shmctl(shmid, IPC_RMID, NULL) == -1) {

perror("shmctl");

exit(1);

}

} /* end of main */

Page 32: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

231Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

void P(int *s) /* P(s) routine*/

{

struct sembuf sembuffer, *sops;

sops = &sembuffer;

sops->sem_num = 0;

sops->sem_op = -1;

sops->sem_flg = 0;

if (semop(*s, sops, 1) < 0) {

perror("semop");

exit(1);

}

return;

}

void V(int *s) /* V(s) routine */

{

struct sembuf sembuffer, *sops;

sops = &sembuffer;

sops->sem_num = 0;

sops->sem_op = 1;

sops->sem_flg = 0;

if (semop(*s, sops, 1) <0) {

perror("semop");

exit(1);

}

return;

}

SAMPLE OUTPUT

process pid = 0, partial sum = 250000

process pid = 26127, partial sum = 250500

The sum of 1 to 1000 is 500500

Page 33: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

232Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Array a[]global_index

addr

sum

Figure 8.11 Shared memory locations for Section 8.4.2 program example.

Pthreads Example

In this example, n threads are created, each taking numbers from the list to add to their

sums. When all numbers have been taken, the threads can add their partial results to a

shared location sum.

The shared location global_index is used by each thread to select the next element of a[].

After index is read, it is incremented in preparation for the next element to be read.

The result location is sum, as before, and will also need to be shared and access protected

by a lock.

Page 34: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

233Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

#include <stdio.h>

#include <pthread.h>

#define array_size 1000

#define no_threads 10

/* shared data */

int a[array_size]; /* array of numbers to sum */

int global_index = 0; /* global index */

int sum = 0; /* final result, also used by slaves */

pthread_mutex_t mutex1; /* mutually exclusive lock variable */

void *slave(void *ignored) /* Slave threads */

{

int local_index, partial_sum = 0;

do {

pthread_mutex_lock(&mutex1);/* get next index into the array */

local_index = global_index;/* read current index & save locally*/

global_index++; /* increment global index */

pthread_mutex_unlock(&mutex1);

if (local_index < array_size) partial_sum += *(a + local_index);

} while (local_index < array_size);

pthread_mutex_lock(&mutex1); /* add partial sum to global sum */

sum += partial_sum;

pthread_mutex_unlock(&mutex1);

return (); /* Thread exits */

}

main () {

int i;

pthread_t thread[10]; /* threads */

pthread_mutex_init(&mutex1,NULL); /* initialize mutex */

for (i = 0; i < array_size; i++) /* initialize a[] */

a[i] = i+1;

for (i = 0; i < no_threads; i++) /* create threads */

if (pthread_create(&thread[i], NULL, slave, NULL) != 0)

perror("Pthread_create fails");

for (i = 0; i < no_threads; i++) /* join threads */

if (pthread_join(thread[i], NULL) != 0)

perror("Pthread_join fails");

printf("The sum of 1 to %i is %d\n", array_size, sum);

} /* end of main */

SAMPLE OUTPUT

The sum of 1 to 1000 is 500500

Page 35: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

234Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Java Example

public class Adder

{

public int[] array;

private int sum = 0;

private int index = 0;

private int number_of_threads = 10;

private int threads_quit;

public Adder()

{

threads_quit = 0;

array = new int[1000];

initializeArray();

startThreads();

}

public synchronized int getNextIndex()

{

if(index < 1000) return(index++); else return(-1);

}

public synchronized void addPartialSum(int partial_sum)

{

sum = sum + partial_sum;

if(++threads_quit == number_of_threads)

System.out.println("The sum of the numbers is " + sum);

}

private void initializeArray()

{

int i;

for(i = 0;i < 1000;i++) array[i] = i;

}

public void startThreads()

{

int i = 0;

for(i = 0;i < 10;i++)

{

AdderThread at = new AdderThread(this,i);

at.start();

}

}

Page 36: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

235Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

public static void main(String args[])

{

Adder a = new Adder();

}

}

class AdderThread extends Thread

{

int partial_sum = 0;

Adder parent;

int number;

public AdderThread(Adder parent,int number)

{

this.parent = parent;

this.number = number;

}

public void run()

{

int index = 0;

while(index != -1) {

partial_sum = partial_sum + parent.array[index];

index = parent.getNextIndex();

}

System.out.println("Partial sum from thread " + number + " is "

+ partial_sum);

parent.addPartialSum(partial_sum);

}

}

Page 37: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

236Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

P R O B L E M S

Scientific/Numerical

8-1. List the possible orderings of the instructions of the two processes, each having three instruc-tions.

8-2. Write code using Pthreads with a condition variable to implement the example given in Sectionfor two “action” routines waiting on a “counter” routine to decrement a counter to zero.

8-3. Analyze the code

forall (i = 2; i < 6; i++) {x = i - 2*i + i*i;a[i] = a[x];

}

as given in Section 8.3.4 and determine whether any instances of the body can be executedsimultaneously.

8-4. List all possible outputs when the following code is executed:

j = 0;k = 0;forall (i = 1; i <= 2; i++) {

j = j + 10;k = k + 100;

}printf(“i=%i,j=%i,k=%i\n”,i,j,k);

assuming that each assignment statement is atomic. (Clue: Number the assignment statementsand then find every possible sequence.)

8-5. The following C-like parallel code is supposed to transpose a matrix:

forall (i = 0; i < n; i++)forall (j = 0; j < n; j++)a[i][j] = a[j][i];

Explain why the code will not work. Rewrite the code so that it will work.

8-6. The following C-like parallel routine is supposed to compute the sum of the first n numbers:

int summation(int n);{int sum = 0;forall (i = 1; i <= n; i++)

sum = sum + i;return(sum);

}

Why will it not work? Rewrite the code so that it will work given n = 200 and 51 processors.

Page 38: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

237Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

8-7. Determine and explain how the following code for a barrier works (based upon the two-phasebarrier given in Chapter 6, Section 6.1.3):

void barrier(){

lock(arrival);count++;if (count < n) unlock(arrival)else unlock(departure);lock(departure);count--;if (count > 0) unlock(departure)else unlock(arrival);return;

}

Why is it necessary to use two lock variables, arrival and departure?

8-8. Write a Pthreads program to perform numerical integration as described in Chapter 4, Section4.2.2. Compare using different decomposition methods (rectangular and trapezoidal).

8-9. Rewrite the Pthread example code in Section 8.4 so that the slaves will take (up to) 10 consec-utive numbers to add as a group to reduce the access to the index.

8-10. Condition variables can be used to detect distributed termination. Introduce condition variablesinto a load-balancing program that has distributed termination such as described in Chapter 7.

8-11. Write a multithreaded program consisting of two threads in which a file is read into a buffer byone thread and written out to another file by another thread.

8-12. Write a multithreaded program to find the roots of the quadratic equation, ax2 + bx + c = 0,using the formula

where intermediate values are computed by different threads. Use a condition variable torecognize when each thread has completed its designated computation.

xb– b

24ac–±

2a--------------------------------------=

Page 39: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

238Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Real Life

8-13. Write a multithreaded program to simulate two automatic teller machines being accessed bydifferent persons on a single shared account. Enhance the program to allow automatic debitsto occur.

8-14. Write a multithreaded program for an airline ticket reservation system to enable different travelagents access to a single source of available tickets (in shared memory).

8-15. Write a multithreaded program for a medical information system accessed by various doctorswho may try to retrieve and update a patient’s history (add something, etc.), which is held inshared memory.

8-16. Write a multithreaded program for selling tickets to the next concert of the rock group “PurpleMums” in Ericsson Stadium, Charlotte, North Carolina.

8-17. Write a multithreaded program to simulate a computer network in which workstations areconnected by a single Ethernet and send messages to each other and to a main server at randomintervals. Model each workstation by one thread making random requests for other worksta-tions and take into account message sizes and collisions.

8-18. Extend Problem 8-17 by providing multiple Ethernet lines (as described in Chapter 1, Section1.4).

8-19. Write a multithreaded program to simulate a hypercube network and a mesh network both withmultiple parallel communication links between nodes. Determine how the performancechanges when the number of parallel links between nodes is increased and make a comparativestudy of the performance of the hypercube and mesh using the results of your simulation. Per-formance metrics include the number of requests that are accepted in each time period. SeeWilkinson (1996) for further details and sample results of this simulation.

8-20. Write a program to simulate a digital system consisting of AND, OR, and NOT gates connectedin various user-defined ways. Each AND and OR gate has two inputs and one output. EachNOT gate has one input and one output. Each gate is to be implemented as a thread thatreceives Boolean values from other gates. The data for this program will be an array definingthe interconnections and the gate functions. For example, Table 8.2 defines the logic circuitshown in Figure 8.12. First establish that your program can simulate the specific logic circuitshown in Figure 8.12 and then modify the program to cope with any arrangement of gates giventhat there are a maximum of eight gates

Page 40: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

239Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Test1Test2

Test3

Output1

Output2

1 2

3Figure 8.12 Sample logic circuit.

TABLE 8.2 LOGIC CIRCUIT DESCRIPTION FOR FIGURE 8.12

Gate Function Input 1 Input 2 Output

1 AND Test1 Test2 Gate1

2 NOT Gate1 Output1

3 OR Test3 Gate1 Output2

Page 41: Programming with Shared Memorywalker/classes/pdc.f01/wilkinson/chapter08.pdf · Figure 8.1 Shared memory multiprocessor using a single bus. Bus Cache Programming with Shared Memory

240Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

8-21. Devise a problem that uses locks for protecting critical sections and condition variables andrequires less than three pages of code. Implement the problem.

8-22. Many problems in previous chapters can be implemented using threads. Select one and makea comparative study of using Pthreads and using PVM or MPI.

8-23. Write a multithreaded program to implement the following arcade game: A river has logsfloating downstream (or to and fro). A frog must cross the river by jumping on logs as the logspass by, as illustrated in Figure 8.13. The user controls when the frog jumps, which can onlybe perpendicular to the riverbanks. You win if the frog makes it to the opposite side of the river,and you lose if the frog lands in the river. Graphical output is necessary and sound effects arepreferable. Concurrent movements of the logs are to be controlled by separate threads. [Thisproblem was suggested and implemented for a short open-ended assignment (Problem 8-21)by Christopher Wilson, a senior at UNCC in 1997. Other arcade games may be amenable to athread implementation.]

8-24. Write a simple Web server using a collection of threads organized in a master-slave configura-tion. The master thread receives requests. When a request is received, the master thread checksa pool of slave threads to find a free thread. The request is handed to the first free thread, whichthen services the request, as illustrated in Figure 8.14. [This problem was suggested and imple-mented for a short open-ended assignment (Problem 8-21) by Kevin Vaughan, a junior at NorthCarolina State University in 1997.]

Movement

River

Log

of logs

Figure 8.13 River and frog for Problem 8-23.

Frog

Master

Slaves

Pool of threads

Request

Signal

Requestserviced

Figure 8.14 Thread pool for Problem 8-24.


Recommended