1
Parallel Programming Concepts Summary
Dr. Peter Tröger M.Sc. Frank Feinbube
Course Topics
■ The Parallelization Problem □ Power wall, memory wall, Moore’s law
□ Terminology and metrics ■ Shared Memory Parallelism
□ Theory of concurrency, hardware today and in the past □ Programming models, optimization, profiling
■ Shared Nothing Parallelism □ Theory of concurrency, hardware today and in the past □ Programming models, optimization, profiling
■ Accelerators
■ Patterns ■ Future trends
3
Scaring Students with Word Clouds ...
The Free Lunch Is Over
■ Clock speed curve flattened in 2003 □ Heat □ Power consumption
□ Leakage ■ 2-3 GHz since 2001 (!) ■ Speeding up the serial
instruction execution through clock speed improvements no longer works
■ We stumbled into the Many-Core Era
5
[Her
b Sut
ter,
2009
]
The Power Wall
■ Air cooling capabilities are limited □ Maximum temperature of 100-125 °C, hot spot problem
□ Static and dynamic power consumption must be limited ■ Power consumption increases with Moore‘s law,
but grow of hardware performance is expected ■ Further reducing voltage as compensation
□ We can’t do that endlessly, lower limit around 0.7V □ Strange physical effects
■ Next-generation processors need to use even less power □ Lower the frequencies, scale them dynamically □ Use only parts of the processor at a time (‘dark silicon’) □ Build energy-efficient special purpose hardware
■ No chance for faster processors through frequency increase
6
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Memory Wall
■ Caching: Well established optimization technique for performance ■ Relies on data locality
□ Some instructions are often used (e.g. loops) □ Some data is often used (e.g. local variables) □ Hardware keeps a copy of the data in the faster cache □ On read attempts, data is taken directly from the cache
□ On write, data is cached and eventually written to memory ■ Similar to ILP, the potential is limited
□ Larger caches do not help automatically □ At some point, all data locality in the
code is already exploited □ Manual vs. compiler-driven optimization
[arstechnica.com]
7
Memory Wall
■ If caching is limited, we simply need faster memory ■ The problem: Shared memory is ‘shared’
□ Interconnect contention □ Memory bandwidth ◊ Memory transfer speed is limited by the power wall ◊ Memory transfer size is limited by the power wall
■ Transfer technology cannot keep up with GHz processors
■ Memory is too slow, effects cannot be hidden through caching completely à “Memory wall”
[dell.com]
8
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
The Situation
■ Hardware people □ Number of transistors N is still increasing
□ Building larger caches no longer helps (memory wall) □ ILP is out of options (ILP wall) □ Voltage / power consumption is at the limit (power wall) ◊ Some help with dynamic scaling approaches
□ Frequency is stalled (power wall) □ Only possible offer is to use increasing N for more cores
■ For faster software in the future ... □ Speedup must come from the utilization of an increasing core
count, since F is now fixed □ Software must participate in the power wall handling,
to keep F fixed □ Software must tackle the memory wall
9
Three Ways Of Doing Anything Faster [Pfister]
■ Work harder (clock speed) Ø Power wall problem Ø Memory wall problem
■ Work smarter (optimization, caching) Ø ILP wall problem Ø Memory wall problem
■ Get help (parallelization) □ More cores per single CPU
□ Software needs to exploit them in the right way
Ø Memory wall problem
Problem
CPU
Core
Core
Core
Core
Core
10
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelism on Different Levels
■ A processor chip (socket) □ Chip multi-processing (CMP)
◊ Multiple CPU’s per chip, called cores ◊ Multi-core / many-core
□ Simultaneous multi-threading (SMT) ◊ Interleaved execution of tasks on one core
◊ Example: Intel Hyperthreading □ Chip multi-threading (CMT) = CMP + SMT □ Instruction-level parallelism (ILP) ◊ Parallel processing of single instructions per core
■ Multiple processor chips in one machine (multi-processing) □ Symmetric multi-processing (SMP)
■ Multiple processor chips in many machines (multi-computer)
11
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelism on Different Levels
[ars
tech
nica
.com
]
ILP, SMT ILP, SMT ILP, SMT ILP, SMT
ILP, SMT ILP, SMT ILP, SMT ILP, SMT
CM
P Arc
hite
ctur
e
12
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelism on Different Levels
13
© 2011 IBM Corporation
IBM System Technology Group
1. Chip:16+2 !P
cores
2. Single Chip Module
4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus
5a. Midplane: 16 Node Cards
6. Rack: 2 Midplanes
7. System: 96 racks, 20PF/s
3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling
5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus
•Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming models for exploitation of node hardware concurrency
Blue Gene/Q
Small
Memory on Different Levels
volatile
non-volatile
Registers
Processor Caches
Random Access Memory (RAM)
Flash / SSD Memory
Hard Drives
Tapes
Fast Expensive
Slow Large
14
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Cheap
A Wild Mixture
15
Network
GF100
16
…
GF100
A Wild Mixture
17
MIC
GDDR5
CPU
Core Core
CPU
Core Core
GPU Core
QPI
16x PCIE
16x PCIE DDR3
DDR3
GPU
GDDR5
Dual Gigabit LAN
MIC
GDDR5
CPU Core Core
CPU Core Core
Core Core
QPI
16x PCIE
16x PCIE DDR3
DDR3
GPU
GDDR5
Dual Gigabit LAN
MIC
GDDR5
CPU Core Core
CPU Core Core
Core Core
QPI
16x PCIE
16x PCIE DDR3
DDR3
GPU
GDDR5
Dual Gigabit LAN
MIC
GDDR5
CPU Core Core
CPU Core Core
Core Core
QPI
16x PCIE
16x PCIE DDR3
DDR3
GPU
GDDR5
Dual Gigabit LAN
The Parallel Programming Problem
18
Execution Environment Parallel Application Match ?
Configuration
Flexible
Type
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Hardware Abstraction: Flynn‘s Taxonomy
■ Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension
Single Instruction, Single Data (SISD)
Single Instruction, Multiple Data (SIMD)
19
Processing Step Instruction
Data Item
Output Processing Step
Instruction
Data Items
Output
Multiple Instruction, Single Data (MISD)
Processing Step
Instructions Data Item
Output
Multiple Instruction, Multiple Data (MIMD)
Processing Step
Instructions Data Items
Output
Hardware Abstraction: Tasks + Processing Elements
Program Program Program
Process Process Process Process Task
PE
Process Process Process Process Task Process Process Process Process Task
PE PE
PE
Memory
Node
Net
wor
k
PE PE
PE
Memory
PE PE
PE
Memory
PE PE
PE
Memory
PE PE
PE
Memory
20
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Hardware Abstraction: PRAM
■ RAM assumptions: Constant memory access time, unlimited memory
■ PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors
■ Alternative models: BSP, LogP
21
CPU
Input Memory Output
CPU CPU
Shared Bus
CPU
Input Memory Output
Hardware Abstraction: BSP
■ Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 ■ Success of von Neumann model
□ Bridge between hardware and software □ High-level languages can be efficiently compiled on this model □ Hardware designers can optimize the realization of this model
■ Similar model for parallel machines
□ Should be neutral about the number of processors □ Program should be written for v virtual processors that are
mapped to p physical ones □ When v >> p, the compiler has options
■ BSP computation consists of a series of supersteps: □ 1.) Concurrent computation on all processors
□ 2.) Exchange of data between all processes □ 3.) Barrier synchronization
22
Hardware Abstraction: CSP
■ Behavior of real-world objects can be described through their interaction with other objects □ Leave out internal implementation details □ Interface of a process is described as set of atomic events
■ Event examples for an ATM: □ card – insertion of a credit card in an ATM card slot □ money – extraction of money from the ATM dispenser
■ Events for a printer: {accept, print}
■ Alphabet - set of relevant (!) events for an object description □ Event may never happen in the interaction □ Interaction is restricted to this set of events □ αATM = {card, money}
■ A CSP process is the behavior of an object, described with its alphabet
23
Hardware Abstraction: LogP
■ Criticism on overly simplification in PRAM-based approaches, encourage exploitation of ,formal loopholes‘ (e.g. communication)
■ Trend towards multicomputer systems with large local memories ■ Characterization of a parallel machine by:
□ P: Number of processors □ g (gap): Minimum time between two consecutive transmissions ◊ Reciprocal corresponds to per-processor communication
bandwidth □ L (latency): Upper bound on messaging time □ o (overhead): Exclusive processor time needed for send /
receive operation ■ L, o, G in multiples of processor cycles
24
Hardware Abstraction: OpenCL
Private
Per work-item
Local Shared within a workgroup
Global/ Constant Visible to
all workgroups
Host Memory
On the CPU ParProg | GPU Computing | FF2013
25
[4]
The Parallel Programming Problem
26
Execution Environment Parallel Application Match ?
Configuration
Flexible
Type
Software View: Concurrency vs. Parallelism
■ Concurrency means dealing with several things at once □ Programming concept for the developer
□ In shared-memory systems, implemented by time sharing ■ Parallelism means doing several things at once
□ Demands parallel hardware ■ Parallel programming is a misnomer
□ Concurrent programming aiming at parallel execution ■ Any parallel software is concurrent software
□ Note: Some researchers disagree, most practitioners agree ■ Concurrent software is not always parallel software
□ Many server applications achieve scalability by optimizing concurrency only (web server)
27
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Concurrency
Parallelism
Server Example: No Concurrency, No Parallelism
28
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
�������
�������
�����
�����
���� ��
���� ��
�����������������
�������������
�������������
���������������
Server Example: Concurrency for Throughput
29
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
���������
���������
��������
��������
������
������
��� ���
��� ���
������������������
��������� � �
������������������
��������� � �
���������� � �
�����������������
���������� � �
�����������������
Server Example: Parallelism for Throughput
30
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
���������
���������
��������
��������
������
������
�����
�����
��� ���
��� ���
������������������
������������������
��������� � �
��������� � �
���������� � �
���������� � �
�����������������
�����������������
Server Example: Parallelism for Speedup
31
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
���������
���������
�����
�����
�����
�����
�� ���
�� ���
����������������� � �����
��������� ��������� � �
��������� � �� �����
������������������������
Concurrent Execution
■ Program as sequence of atomic statements □ „Atomic“: Executed without interruption
■ Concurrent execution is the interleaving of atomic statements from multiple tasks □ Tasks may share resources
(variables, operating system handles, …) □ Operating system timing is not predictable,
so interleaving is not predictable □ May impact the result of the application
■ Since parallel programs are concurrent programs, we need to deal with that!
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
32
y=x z=x y=y-1 z=z+1 x=y x=z x=2
y=x y=y-1 x=y z=x z=z+1 x=z x=1
y=x y=y-1 y=x y=y+1 x=y x=y x=0
y=x z=x y=y-1 z=z+1 x=z x=y x=0
x=1 y=x y=y-1 x=y
z=x z=z+1 x=z
Case 3 Case 4
Case 1 Case 2
Critical Section
■ N threads has some code - critical section - with shared data access
■ Mutual Exclusion demand
□ Only one thread at a time is allowed into its critical section, among all threads that have critical sections for the same resource.
■ Progress demand
□ If no other thread is in the critical section, the decision for entering should not be postponed indefinitely. Only threads that wait for entering the critical section are allowed to participate in decisions.
■ Bounded Waiting demand
□ It must not be possible for a thread requiring access to a critical section to be delayed indefinitely by other threads entering the section (starvation problem)
33
Critical Sections with Mutexes
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
34 T1 T2 T3
m.lock()
m.unlock()
m.lock()
m.lock()
m.unlock()
m.unlock()
Critical Section
Critical Section
Critical Section
Waiting Queue
T3
T2 T3
T2
Critical Sections with High-Level Primitives
■ Today: Multitude of high-level synchronization primitives ■ Spinlock
□ Perform busy waiting, lowest overhead for short locks ■ Reader / Writer Lock
□ Special case of mutual exclusion through semaphores □ Multiple „Reader“ processes can enter the critical section at the
same time, but „Writer“ process should gain exclusive access □ Different optimizations possible:
minimum reader delay, minimum writer delay, throughput, … ■ Mutex
□ Semaphore that works amongst operating system processes ■ Concurrent Collections
□ Blocking queues and key-value maps with concurrency support
35
Critical Sections with High-Level Primitives
■ Reentrant Lock □ Lock can be obtained several times without locking on itself
□ Useful for cyclic algorithms (e.g. graph traversal) and problems were lock bookkeeping is very expensive
□ Reentrant mutex needs to remember the locking thread(s), which increases the overhead
■ Barriers □ All concurrent activities stop there and continue together □ Participants statically defined at compile- or start-time □ Newer dynamic barrier concept allows late binding of
participants (e.g. X10 clocks, Java phasers) □ Memory barrier or memory fence enforce separation of
memory operations before and after the barrier ◊ Needed for low-level synchronization implementation
36
37
Nasty Stuff
Deadlock ■ Two or more processes / threads are unable to proceed
■ Each is waiting for one of the others to do something Livelock ■ Two or more processes / threads continuously change their states
in response to changes in the other processes / threads ■ No global progress for the application
Race condition
■ Two or more processes / threads are executed concurrently ■ Final result of the application depends on the relative timing of
their execution
■ 1970. E.G. Coffman and A. Shoshani. Sequencing tasks in multiprocess systems to avoid deadlocks. □ All conditions must be fulfilled to allow a deadlock to happen □ Mutual exclusion condition - Individual resources are available
or held by no more than one thread at a time
□ Hold and wait condition – Threads already holding resources may attempt to hold new resources
□ No preemption condition – Once a thread holds a resource, it must voluntarily release it on its own
□ Circular wait condition – Possible for a thread to wait for a resource held by the next thread in the chain
■ Avoiding circular wait turned out to be the easiest solution for deadlock avoidance
■ Avoiding mutual exclusion leads to non-blocking synchronization □ These algorithms no longer have a critical section
38
Coffman Conditions
39
Terminology
Starvation ■ A runnable process / thread is overlooked indefinitely
■ Although it is able to proceed, it is never chosen to run (dispatching / scheduling)
Atomic Operation ■ Function or action implemented as a sequence of one or more
instructions ■ Appears to be indivisible - no other process / thread can see an
intermediate state or interrupt the operation ■ Executed as a group, or not executed at all
Mutual Exclusion
■ The requirement that when one process / thread is using a resource, no other shall be allowed to do that
Is it worth the pain?
■ Parallelization metrics are application-dependent, but follow a common set of concepts □ Speedup: More resources lead less time for solving the same
task □ Linear speedup: n times more resources à n times speedup
□ Scaleup: More resources solve a larger version of the same task in the same time
□ Linear scaleup: n times more resources à n times larger problem solvable
■ The most important goal depends on the application □ Transaction processing usually heads for throughput
(scalability) □ Decision support usually heads for response time (speedup)
40
Tasks: v=12 Processing elements: N= 3
Time needed: T3= 4 (Linear) Speedup: T1/T3=12/4=3
Speedup
■ Idealized assumptions □ All tasks are equal sized
□ All code parts can run in parallel Application
1 2 3 4 5 6 7 8 9 10
11
12
1 2 3 4
5 6 7 8
9 10
11
12
t t
Tasks: v=12 Processing elements: N=1
Time needed: T1=12
41
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Speedup with Load Imbalance
■ Assumptions □ Tasks have different size,
best-possible speedup depends on optimized resource usage
□ All code parts can run in parallel
Application
2 3 4 5 6 7 8 9 10
11
12
t t
1
2 3 4 1 5 6 7 8
9 10
11
12
Tasks: v=12 Processing elements: N= 3
Time needed: T3= 6 Speedup: T1/T3=16/6=2.67
Tasks: v=12 Processing elements: N=1
Time needed: T1=16
42
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Speedup with Serial Parts
■ Each application has inherently non-parallelizable serial parts □ Algorithmic limitations
□ Shared resources acting as bottleneck □ Overhead for program start □ Communication overhead in shared-nothing systems
2 3
4 5
6 7 8
9 10
11
12
tSER1
1
tPAR1 tSER2 tPAR2 tSER3
43
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Amdahl’s Law
■ Gene Amdahl. “Validity of the single processor approach to achieving large scale computing capabilities”. AFIPS 1967 □ Serial parts TSER = tSER1 + tSER2 + tSER3 + … □ Parallelizable parts TPAR = tPAR1 + tPAR2 + tPAR3 + …
□ Execution time with one processing element: T1 = TSER+TPAR
□ Execution time with N parallel processing elements: TN >= TSER + TPAR / N ◊ Equal only on perfect parallelization,
e.g. no load imbalance □ Amdahl’s Law for maximum speedup with N processing elements
S =T1
TN
44
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
S =TSER + TPAR
TSER + TPAR/N
Amdahl’s Law
45
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Amdahl’s Law
■ Speedup through parallelism is hard to achieve ■ For unlimited resources, speedup is bound by the serial parts:
□ Assume T1=1
■ Parallelization problem relates to all system layers □ Hardware offers some degree of parallel execution □ Speedup gained is bound by serial parts: ◊ Limitations of hardware components
◊ Necessary serial activities in the operating system, virtual runtime system, middleware and the application
◊ Overhead for the parallelization itself
46
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
SN!1 =T1
TN!1SN!1 =
1
TSER
Gustafson-Barsis’ Law (1988)
■ Gustafson and Barsis pointed out that people are typically not interested in the shortest execution time □ Rather solve the biggest problem in reasonable time
■ Problem size could then scale with the number of processors
□ Leads to larger parallelizable part with increasing N □ Typical goal in simulation problems
■ Time spend in the sequential part is usually fixed or grows slower than the problem size à linear speedup possible
■ Formally: □ PN: Portion of the program that benefits from parallelization,
depending on N (and implicitly the problem size) □ Maximum scaled speedup by N processors:
47
The Parallel Programming Problem
48
Execution Environment Parallel Application Match ?
Configuration
Flexible
Type
Programming Model for Shared Memory
49
Process
Explicitly Shared Memory
■ Different programming models for concurrency in shared memory
■ Processes and threads mapped to processing elements (cores)
■ Process- und thread-based programming typically part of operating system lectures
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Memory
Process
Memory
Thread Thread
Task
Task
Task
Task
Concurrent Processes Concurrent Threads
Concurrent Tasks
Main Thread
Process
Memory
Main Thread
Process
Memory
Main Thread Thread Thread
OpenMP
■ Programming with the fork-join model □ Master thread forks into declared tasks
□ Runtime environment may run them in parallel, based on dynamic mapping to threads from a pool
□ Worker task barrier before finalization (join)
50
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[Wik
iped
ia]
Task Scheduling
■ Classical task scheduling with central queue □ All worker threads fetch tasks from a central queue
□ Scalability issue with increasing thread (resp. core) count ■ Work stealing in OpenMP (and other libraries)
□ Task queue per thread □ Idling thread “steals” tasks from another thread
□ Independent from thread scheduling
□ Only mutual synchronization
□ No central queue
51
Thre
ad
New Task
Next Task
Task
Que
ue
Thre
ad
New Task
Next Task
Task
Que
ue
Work Stealing
PGAS Languages
■ Non-uniform memory architectures (NUMA) became default ■ But: Understanding of memory in programming is flat
□ All variables are equal in access time □ Considering the memory hierarchy is low-level coding
(e.g. cache-aware programming) ■ Partitioned global address space (PGAS) approach
□ Driven by high-performance computing community □ Modern approach for large-scale NUMA
□ Explicit notion of memory partition per processor ◊ Data is designated as local (near) or global (possibly far) ◊ Programmer is aware of NUMA nodes
□ Performance optimization for deep memory hierarchies
52
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming for Accelerators
■ OpenCL exposes CPUs, GPUs, and other Accelerators as “devices”
■ Each “device” contains one or more “compute units”, i.e. cores, SMs,... ■ Each “compute unit” contains one or more SIMD “processing elements”
ParProg | GPU Computing | FF2013
53
[4]
The BIG idea behind OpenCL
OpenCL execution model … execute a kernel at each point in a problem domain.
E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions
54
Message Passing
■ Programming paradigm targeting shared-nothing infrastructures □ Implementations for shared memory available,
but typically not the best-possible approach ■ Multiple instances of the same application on a set of nodes (SPMD)
Instance 0 Instance
1
Instance 2 Instance
3
Submission Host
Execution Hosts
Single Program Multiple Data (SPMD)
56
Single Program Multiple Data (SPMD)
P0 P1 P2 P3
seq. program and data distribution
seq. node program with message passing
identical copies with different process
identifications
Actor Model
■ Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI 1973. □ Another mathematical model for concurrent computation □ No global system state concept (relationship to physics)
□ Actor as computation primitive ◊ Makes local decisions ◊ Concurrently creates more actors ◊ Concurrently sends / receives messages
□ Asynchronous one-way messaging with changing topology (CSP communication graph is fixed), no order guarantees
□ Recipient is identified by mailing address ■ „Everything is an actor“
57
Actor Model
■ Interaction with asynchronous, unordered, distributed messaging ■ Fundamental aspects
□ Emphasis on local state, time and name space □ No central entity □ Actor A gets to know actor B only by direct creation,
or by name transmission from another actor C ■ Computation
□ Not global state sequence, but partially ordered set of events
◊ Event: Receipt of a message by a target actor ◊ Each event is a transition from one local state to another ◊ Events may happen in parallel
■ Messaging reliability declared as orthogonal aspect
58
Message Passing Interface (MPI)
■ MPI_GATHER ( IN sendbuf, IN sendcount, IN sendtype, OUT recvbuf, IN recvcount, IN recvtype, IN root, IN comm )
□ Each process sends its buffer to the root process, including root □ Incoming messages are stored in rank order
□ Receive buffer is ignored for all non-root processes □ MPI_GATHERV allows varying count of data to be received □ Returns if the buffer is re-usable (no finishing promised)
59
The Parallel Programming Problem
60
Execution Environment Parallel Application Match ?
Configuration
Flexible
Type
Execution Environment Mapping
61
Sing
le In
stru
ctio
n,
Mul
tiple
Dat
a (SIM
D)
Mul
tiple
Inst
ruct
ion,
Mul
tiple
Dat
a (M
IMD
)
Patterns for Parallel Programming [Mattson]
■ Finding Concurrency Design Space □ task / data decomposition, task grouping and ordering due to
data flow dependencies, design evaluation ■ Algorithm Structure Design Space
□ Task parallelism, divide and conquer, geometric decomposition, recursive data, pipeline, event-based coordination
□ Mapping of concurrent design elements to execution units ■ Supporting Structures Design Space
□ SPMD, master / worker, loop parallelism, fork / join, shared data, shared queue, distributed array
□ Program structures and data structures used for code creation ■ Implementation Mechanisms Design Space
62
Designing Parallel Algorithms [Foster]
■ Map workload problem on an execution environment □ Concurrency for speedup
□ Data locality for speedup □ Scalability
■ Best parallel solution typically differs massively from the sequential version of an algorithm
■ Foster defines four distinct stages of a methodological approach
■ Example: Parallel Sum
63
Example: Parallel Reduction
■ Reduce a set of elements into one, given an operation
■ Example: Sum
64
Designing Parallel Algorithms [Foster]
■ A) Search for concurrency and scalability □ Partitioning –
Decompose computation and data into small tasks □ Communication –
Define necessary coordination of task execution ■ B) Search for locality and other performance-related issues
□ Agglomeration – Consider performance and implementation costs
□ Mapping – Maximize processor utilization, minimize communication
■ Might require backtracking or parallel investigation of steps
65
Partitioning
■ Expose opportunities for parallel execution – fine-grained decomposition
■ Good partition keeps computation and data together □ Data partitioning leads to data parallelism
□ Computation partitioning leads task parallelism □ Complementary approaches, can lead to different algorithms □ Reveal hidden structures of the algorithm that have potential □ Investigate complementary views on the problem
■ Avoid replication of either computation or data, can be revised later to reduce communication overhead
■ Step results in multiple candidate solutions
66
Partitioning - Decomposition Types
■ Domain Decomposition □ Define small data fragments
□ Specify computation for them □ Different phases of computation
on the same data are handled separately □ Rule of thumb:
First focus on large or frequently used data structures ■ Functional Decomposition
□ Split up computation into disjoint tasks, ignore the data accessed for the moment
□ With significant data overlap, domain decomposition is more appropriate
67
Partitioning Strategies [Breshears]
■ Produce at least as many tasks as there will be threads / cores □ But: Might be more effective to use only fraction of the cores
(granularity) □ Computation must pay-off with respect to overhead
■ Avoid synchronization, since it adds up as overhead to serial execution time
■ Patterns for data decomposition □ By element (one-dimensional) □ By row, by column group, by block (multi-dimensional) □ Influenced by ratio of computation and synchronization
68
Partitioning - Checklist
■ Checklist for resulting partitioning scheme □ Order of magnitude more tasks than processors ?
-> Keeps flexibility for next steps □ Avoidance of redundant computation and storage
requirements ? -> Scalability for large problem sizes
□ Tasks of comparable size ? -> Goal to allocate equal work to processors
□ Does number of tasks scale with the problem size ? -> Algorithm should be able to solve larger tasks with more processors
■ Resolve bad partitioning by estimating performance behavior, and eventually reformulating the problem
69
Communication Step
■ Specify links between data consumers and data producers ■ Specify kind and number of messages on these links
■ Domain decomposition problems might have tricky communication infrastructures, due to data dependencies
■ Communication in functional decomposition problems can easily be modeled from the data flow between the tasks
■ Categorization of communication patterns □ Local communication (few neighbors) vs.
global communication □ Structured communication (e.g. tree) vs.
unstructured communication □ Static vs. dynamic communication structure □ Synchronous vs. asynchronous communication
70
Communication - Hints
■ Distribute computation and communication, don‘t centralize algorithm □ Bad example: Central manager for parallel summation □ Divide-and-conquer helps as mental model to identify
concurrency ■ Unstructured communication is hard to agglomerate,
better avoid it ■ Checklist for communication design
□ Do all tasks perform the same amount of communication ? -> Distribute or replicate communication hot spots
□ Does each task performs only local communication ? □ Can communication happen concurrently ? □ Can computation happen concurrently ?
71
Ghost Cells
■ Domain decomposition might lead to chunks that demand data from each other for their computation
■ Solution 1: Copy necessary portion of data (,ghost cells‘) □ If no synchronization is needed after update
□ Data amount and frequency of update influences resulting overhead and efficiency
□ Additional memory consumption ■ Solution 2: Access relevant data ,remotely‘
□ Delays thread coordination until the data is really needed
□ Correctness („old“ data vs. „new“ data) must be considered on parallel progress
72
Agglomeration Step
■ Algorithm so far is correct, but not specialized for some execution environment
■ Check again partitioning and communication decisions □ Agglomerate tasks for efficient execution on some machine
□ Replicate data and / or computation for efficiency reasons ■ Resulting number of tasks can still be greater than the number of
processors ■ Three conflicting guiding decisions
□ Reduce communication costs by coarser granularity of computation and communication
□ Preserve flexibility with respect to later mapping decisions
□ Reduce software engineering costs (serial -> parallel version)
73
Agglomeration [Foster]
74
Agglomeration – Granularity vs. Flexibility
■ Reduce communication costs by coarser granularity □ Sending less data
□ Sending fewer messages (per-message initialization costs) □ Agglomerate, especially if tasks cannot run concurrently ◊ Reduces also task creation costs
□ Replicate computation to avoid communication (helps also with reliability)
■ Preserve flexibility
□ Flexible large number of tasks still prerequisite for scalability ■ Define granularity as compile-time or run-time parameter
75
Agglomeration - Checklist
■ Communication costs reduced by increasing locality ? ■ Does replicated computation outweighs its costs in all cases ?
■ Does data replication restrict the range of problem sizes / processor counts ?
■ Does the larger tasks still have similar computation / communication costs ?
■ Does the larger tasks still act with sufficient concurrency ? ■ Does the number of tasks still scale with the problem size ? ■ How much can the task count decrease, without disturbing load
balancing, scalability, or engineering costs ? ■ Is the transition to parallel code worth the engineering costs ?
76
Mapping Step
■ Only relevant for shared-nothing systems, since shared memory systems typically perform automatic task scheduling
■ Minimize execution time by □ Place concurrent tasks on different nodes
□ Place tasks with heavy communication on the same node ■ Conflicting strategies, additionally restricted by resource limits
□ In general, NP-complete bin packing problem ■ Set of sophisticated (dynamic) heuristics for load balancing
□ Preference for local algorithms that do not need global scheduling state
77
Surface-To-Volume Effect [Foster, Breshears]
■ Visualize the data to be processed (in parallel) as sliced 3D cube ■ Synchronization requirements of a task
□ Proportional to the surface of the data slice it operates upon □ Visualized by the amount of ,borders‘ of the slice
■ Computation work of a task □ Proportional to the volume of the data slice it operates upon
□ Represents the granularity of decomposition ■ Ratio of synchronization and computation
□ High synchronization, low computation, high ratio à bad □ Low synchronization, high computation, low ratio à good
□ Ratio decreases for increasing data size per task ■ Coarse granularity by agglomerating tasks in all dimensions
□ For given volume, the surface then goes down à good
78
Surface-To-Volume Effect [Foster, Breshears]
79
(C)
nice
rweb
.com
Surface-to-Volume Effect [Foster]
■ Computation on 8x8 grid ■ (a): 64 tasks,
one point each □ 64x4=256
synchronizations □ 256 data values are
transferred ■ (b): 4 tasks,
16 points each □ 4x4=16
synchronizations □ 16x4=64 data values
are transferred
80
Designing Parallel Algorithms [Breshears]
■ Parallel solution must keep sequential consistency property ■ „Mentally simulate“ the execution of parallel streams
□ Check critical parts of the parallelized sequential application ■ Amount of computation per parallel task
■ Always introduced by moving from serial to parallel code ■ Speedup must offset the parallelization overhead (Amdahl)
■ Granularity: Amount of parallel computation done before synchronization is needed
□ Fine-grained granularity overhead vs. coarse-grained granularity concurrency ◊ Iterative approach of finding the right granularity ◊ Decision might be only correct only for a chosen execution
environment
81
OK ?!?
Certificate ‚for free‘
83