+ All Categories
Home > Documents > Parallel Algorithms

Parallel Algorithms

Date post: 20-Mar-2016
Category:
Upload: gerry
View: 35 times
Download: 0 times
Share this document with a friend
Description:
Parallel Algorithms. CET306 Harry R. Erwin University of Sunderland. Roadmap. Theoretical Models Turing Machine (TM) Von Neumann Machine (VNM) Random Access Machine (RAM) Parallel Random Access Machine (PRAM) Policies Shared-Memory Programming Distributed-Memory Programming - PowerPoint PPT Presentation
Popular Tags:
28
Parallel Algorithms CET306 Harry R. Erwin University of Sunderland
Transcript
Page 1: Parallel Algorithms

Parallel Algorithms

CET306Harry R. Erwin

University of Sunderland

Page 2: Parallel Algorithms

Roadmap• Theoretical Models

– Turing Machine (TM)– Von Neumann Machine (VNM)– Random Access Machine (RAM)– Parallel Random Access Machine (PRAM)– Policies

• Shared-Memory Programming• Distributed-Memory Programming• Portable Libraries

– PVM– MPI

• Critical Comparison• Parallel Patterns

Page 3: Parallel Algorithms

Texts

• Clay Breshears (2009) The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Applications, O'Reilly Media, Pages: 304.

• Mordechai Ben-Ari (2006) Principles of Concurrent and Distributed Programming, Addison-Wesley.

Page 4: Parallel Algorithms

Theoretical Models

• Turing Machine (TM)• Von Neumann Machine (VNM)• Random Access Machine (RAM)• Parallel Random Access Machine (PRAM)

Page 5: Parallel Algorithms

Turing Machine (TM)From Wikipedia

• Turing wrote that the Turing machine, here called a Logical Computing Machine, consisted of:

• “...an infinite memory capacity obtained in the form of an infinite tape marked out into squares, on each of which a symbol could be printed. At any moment there is one symbol in the machine; it is called the scanned symbol. The machine can alter the scanned symbol and its behaviour is in part determined by that symbol, but the symbols on the tape elsewhere do not affect the behaviour of the machine. However, the tape can be moved back and forth through the machine, this being one of the elementary operations of the machine. Any symbol on the tape may therefore eventually have an innings.” (Turing 1948, p. 61)

Page 6: Parallel Algorithms

Commentary• You can think of a Turing Machine as automating what a

mathematician does in proving a statement.• The tape is the current state of a proof, and the question is

whether the Turing Machine ever stops (having successfully proven the statement). That is provably unsolvable.

• Any Turing Machine can be simulated by a Universal Turing Machine (UTM), with a ‘program’ at the beginning of the tape, followed by the statement to be proven.

• All digital computer programs are special cases of this.• Analogue computers introduce Super-Turing Machines.

Page 7: Parallel Algorithms

Von Neumann Machine (VNM) or Architecture (VNA)

• (Wikipedia) “This describes a design architecture for an electronic digital computer with subdivisions of a central arithmetic part, a central control part, a memory to store both data and instructions, external storage, and input and output mechanisms. The meaning of the phrase has evolved to mean a stored-program computer in which an instruction fetch and a data operation cannot occur at the same time because they share a common bus. This is referred to as the Von Neumann bottleneck and often limits the performance of the system.”

Page 8: Parallel Algorithms

Commentary• (Wikipedia) “The design of a Von Neumann architecture is simpler

than the more modern Harvard architecture which is also a stored-program system but has one dedicated address and data buses for memory, and another set of address and data buses for fetching instructions.”

• “A stored-program digital computer is one that keeps its programmed instructions, as well as its data, in read-write, random-access memory (RAM). In the vast majority of modern computers, the same memory is used for both data and program instructions.”

• (Backus quoted on Wikipedia) “The shared bus between the program memory and data memory leads to the Von Neumann bottleneck, the limited throughput (data transfer rate) between the CPU and memory compared to the amount of memory.”

Page 9: Parallel Algorithms

Harvard Architecture• (Wikipedia) “The Harvard architecture is a computer architecture with

physically separate storage and signal pathways for instructions and data. The term originated from the Harvard Mark I relay-based computer, which stored instructions on punched tape (24 bits wide) and data in electro-mechanical counters. These early machines had data storage entirely contained within the central processing unit, and provided no access to the instruction storage as data. Programs needed to be loaded by an operator; the processor could not boot itself.”

• “Today, most processors implement such separate signal pathways for performance reasons but actually implement a Modified Harvard architecture, so they can support tasks such as loading a program from disk storage as data and then executing it.”

• Security suggests data and instructions should be stored in separate areas, and instructions should be non-modifiable. The Modified Harvard architecture needs to be modified further to support this.

Page 10: Parallel Algorithms

Random Access Machine (RAM)

• Simplified Von Neumann Machine

• Can be given multiple storage levels

• Note the difference between a Von Neumann Machine and a Harvard Machine is ignored.

Input

Random Access

Memory (RAM)

Output

CPU

Page 11: Parallel Algorithms

Parallel Random Access Machine (PRAM)

• Pronounced “P-ram”• At its simplest, consists of multiple CPUs

accessing a common memory of unlimited size.

• Shared clock—one instruction per cycle• Memory access performance among the CPUs

is identical.

Page 12: Parallel Algorithms

PRAM Models• Concurrent Read, Concurrent Write (CRCW)

– Multiple threads can read and write a common memory location at the same time.

• Concurrent Read, Exclusive Write (CREW)– Multiple threads can read and one thread can write a common memory location

at the same time.• Exclusive Read, Concurrent Write (ERCW)

– One thread can read and multiple threads can write a common memory location at the same time.

• Exclusive Read, Exclusive Write (EREW)– One thread can read and one thread can write a common memory location at

the same time.• Policies

– The PRAM algorithm sorts out the interaction.

Page 13: Parallel Algorithms

Policies

• Who actually gets access during exclusive read or write operations.

• What gets written in a concurrent write operation.– Ensure the same value is written– Random choice– Some logical, arithmetic, or illogical combination

of the values being written.

Page 14: Parallel Algorithms

Programming

• Shared-Memory Programming• Distributed-Memory Programming• Portable Libraries– PVM– MPI

• Critical Comparison of programming models

Page 15: Parallel Algorithms

Shared-Memory Programming

• Petered out in 1985-95 with a limit of about 32 processors due to bus contention.

Page 16: Parallel Algorithms

Distributed-Memory Programming

• Some of the memory in the system is allocated to individual processors and some is shared.

• The processors need to collaborate—mostly handled by message-passing.– PVI– MPI

• Beowulf clusters showed how to combine PCs using MPI to get high performance. We have a cluster at Sunderland, and C. Panchev knows this area.

Page 17: Parallel Algorithms

Critical Comparison

• Features common to shared-memory and distributed-memory programming– There is no free lunch. Some parts of your

program will have to run serially.– Management is unavoidable. The work has to be

divided up. You can exploit data parallelism, or you can split the parts of the job among processors.

– Data have to be shared. Live with it.– You can allocate work on the fly or you can plan it.

Page 18: Parallel Algorithms

Shared Memory Issues• Threads will need their private memory areas. Usually you

can do this by allocating thread-local memory. This can be for a given method execution, or you can use thread-local storage that stays with a thread.

• Performance of data access will be an issue. Think about storage conflicts and data races.

• Communication in memory involves synchronisation.• You will need mutual exclusion or synchronisation primitives.

Learn about them.• Learn about producer/consumer or boss/worker protocols.• Learn about reader/writer locks.

Page 19: Parallel Algorithms

Pattern Languages

• Alexander (1977) invented pattern languages as practical tools for describing architectural expertise in some domain.

• The elements of a pattern language are patterns. Each pattern describes a problem that occurs over and over again and the core of the solution to that problem in such a way that it can be reused many times, never once the same way.

• A pattern isn’t considered proven until it has been used at least three times in real applications.

Page 20: Parallel Algorithms

Design Patterns

• The four essential elements (Gamma, et al) of a design pattern are:– A descriptive name – A problem description that shows when to apply the

pattern and to what contexts. The description also explains how it helps to complete larger patterns.

– A solution that abstractly describes the constituent elements, their relationships, responsibilities, and collaborations.

– The results and trade-offs that should be taken into account when applying the pattern.

Page 21: Parallel Algorithms

Pattern Resources• Gamma, Helm, Johnson, and Vlissides, 1995, Design Patterns,

Addison-Wesley.• The Portland Pattern Repository: http://c2.com/ppr/• Resources on Parallel Patterns http

://www.cs.uiuc.edu/homes/snir/PPP/ • Visual Studio 2010 and the Parallel Patterns Library http

://msdn.microsoft.com/en-us/magazine/dd434652.aspx http://www.microsoft.com/download/en/details.aspx?id=19222 http://msdn.microsoft.com/en-us/library/dd492418.aspx

• Alexander, 1977, A Pattern Language: Towns/Buildings/ Construction, Oxford University Press. (For historical interest.)

Page 22: Parallel Algorithms

Some Parallel Patterns

• Source: Williams, A (2011) “Picking Patterns for Parallel Programs (Part 1)”, Overload, 105, 15-17.

• Loop Parallelism• Fork/Join• Pipelines• Actor• Speculative Execution

Page 23: Parallel Algorithms

Loop Parallelism• Problem

– There is a for loop that operates on many independent data items.• Solution

– Parallelise the for loop. The operation should depend only on the loop counter, and the individual loop iterations should not interact.

• Positives– Scales very nicely.– Very common.

• Negatives– Overhead of setting up the thread.– Avoid if there is interaction as the individual iterations may execute in

any order.

Page 24: Parallel Algorithms

Fork/Join• Problem

– The task can be broken into two or more parts that can be run in parallel.• Solution

– Use a thread for each part. This can also be recursive.• Positives

– Handles part interaction better than Loop Parallelism.– Works best at the top level of the application.

• Negatives– Needs to be managed centrally so that hardware parallelism is utilised

efficiently.– Overhead of threads.– Bursty parallelism.– Uneven workloads.

Page 25: Parallel Algorithms

Pipelines• Problem

– You have a set of tasks to be applied in turn to data. First-in, first-out.– This problem shows up in sensor data processing a lot.

• Solution– Set up the tasks to run in parallel.– Fill the input queue.

• Positives– Adapted well to heterogeneous hardware configurations.

• Negatives– Setting it up.– Ensuring that the tasks have similar durations to avoid a rate-limiting step.– Cache interaction during transfers between pipeline stages.

Page 26: Parallel Algorithms

Actor• Problem

– Message-passing object-orientation with concurrency.– Message sending is asynchronous.– Response processing uses call-backs.

• Solution– Objects communicating (only) via message queues

• Positives– Actors can be analysed independently.– Avoids data races.

• Negatives– Setup and queue management overhead.– Not good for short-lived threads.– Not an ideal communications mechanism.– Limited scalability.

Page 27: Parallel Algorithms

Speculative Execution• Problem

– There’s an optional path that may be required for a solution, but it takes a lot of time.

• Solution– Start it early and cancel it if it’s not needed.– This is part of how BI works.– Part of why time travel implies P==NP.

• Positives– Exploits parallelism.– Likely to improve performance.

• Negatives– Wastes energy and resources.– Interferes with other use of parallelism.

Page 28: Parallel Algorithms

Conclusion

• We’ve explored some of the concepts of shared memory and distributed memory programming.

• I’ve also introduced patterns.• The tutorial is about the dining philosophers

problem. There’s a lot on the web, including a few C# versions. Try to solve it on your own first.


Recommended