Parallel Programming Concepts OpenHPI Course Week 2 : Shared Memory Parallelism - Basics Unit 2.1: Parallelism and Concurrency
Dr. Peter Tröger + Teaching Team
Summary: Week 1
■ Moore’s Law and the Power Wall □ Processing elements no longer get faster
□ More cores per processor chip, specialized hardware ■ ILP Wall and Memory Wall
□ ILP hard to optimize further □ Memory access is not fast enough
■ Parallel Hardware Classification □ Parallelism on all levels, SIMD vs. MIMD
■ Memory Architectures □ UMA vs. NUMA
■ Speedup and Scaling □ Amdahl’s Law and Gustafson's Law
2
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Hardware Perspective
3
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Program Program Program
Process Process Process Process Task
PE
Process Process Process Process Task Process Process Process Process Task
PE PE
PE
Memory
Node
Net
wor
k
PE PE
PE
Memory
PE PE
PE
Memory
PE PE
PE
Memory
PE PE
PE
Memory
What is the software
perspective ?
Back in the day: Batch Processing
■ IBM 1401 - October 5th, 1959 □ One of the first general purpose computers
□ 1401 Processing Unit, 1402 Card Read-Punch (250 cards / minute), printer
■ First concepts of batch processing for multiple job input cards □ Operator loads monitor software to run
batched jobs from a prepared input tape □ Programs are constructed to jump back to
the monitor program after termination (early Fortran language)
□ Monitor program realizes better utilization of the extremely expensive hardware
■ Monitor concept later became the foundation for operating system schedulers
4 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
4
[ibm-1401.info]
Back in the day: Multi-Programming
■ Batch processing is nice, but jobs still waited too long for I/O ■ Idea: Load two jobs into memory at the same time
□ While one is waiting for I/O results, let the other run ■ Multiplexing of jobs became a basic operating system task
-> multi-programming or multi-tasking □ Maximize CPU utilization by sacrificing memory
5
(C) Stallings
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
5
Wait Run
Run A
Wait
Wait Run Run Wait Wait
Run B Run B Wait Wait Run A
Run Program A
Program B
Combined
Back in the day: Time Sharing
■ Users started to demand interaction with their program ■ Advent of time-sharing / preemptive multi-tasking systems
□ Minimize single user response time □ Extension of multi-programming to interactive jobs □ Starting point for Unix operating systems in the 1960‘s □ Relies on preemption support in hardware
■ Time-sharing behavior is now default in operating systems □ Applications running ‚at the same time‘ □ User assumption on the software behavior should be fulfilled
independent from execution environment □ Concurrency must be considered by the developers
6 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
6
Terminology
■ Concurrency □ Capability of a system to have multiple activities
in progress at the same time □ Activities can have different pace of execution
□ Concurrent execution on one or multiple processing elements ◊ Managed by operating system
□ Demands scheduling, dispatch and (often) synchronization ■ Parallelism
□ Capability of a system to execute activities simultaneously □ Demands parallel hardware and concurrency support □ Historically, only a term in shared nothing programming
7
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Terminology
■ Concurrency vs. parallelism vs. distribution □ Two tasks defined by the application
◊ Specify concurrent activities in the program code ◊ Might be executed concurrently or in parallel ◊ May be distributed on different machines
■ Management of concurrent activities in an operating system
□ Multiple applications being executed at the same time □ Scheduling and dispatch is mapping running tasks to
available processing elements □ With multiple processing elements, tasks can run in parallel
8
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Concurrency vs. Parallelism
■ Concurrency means dealing with several things at once □ Programming concept for the developer
□ In shared-memory systems, implemented by time sharing ■ Parallelism means doing several things at once
□ Demands parallel hardware ■ Parallel programming is a misnomer
□ Concurrent programming aiming at parallel execution ■ Any parallel software is concurrent software
□ Note: Some researchers disagree, most practitioners agree ■ Concurrent software is not always parallel software
□ Many server applications achieve scalability by optimizing concurrency only (web server)
9
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Concurrency
Parallelism
Server Example: No Concurrency, No Parallelism
10
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
�������
�������
�����
�����
���� ��
���� ��
�����������������
�������������
�������������
���������������
Server Example: Concurrency for Throughput
11
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
���������
���������
��������
��������
������
������
��� ���
��� ���
������������������
��������� � �
������������������
��������� � �
���������� � �
�����������������
���������� � �
�����������������
Server Example: Parallelism for Throughput
12
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
���������
���������
��������
��������
������
������
�����
�����
��� ���
��� ���
������������������
������������������
��������� � �
��������� � �
���������� � �
���������� � �
�����������������
�����������������
Server Example: Parallelism for Speedup
13
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
���������
���������
�����
�����
�����
�����
�� ���
�� ���
����������������� � �����
��������� ��������� � �
��������� � �� �����
������������������������
Terminology
■ Concurrency □ Capability of a system to have
multiple activities in progress at the same time □ Demands scheduling, dispatch and (often) synchronization
■ Parallelism □ Capability of a system to execute activities simultaneously □ Demands parallel hardware and concurrency support
14
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
“The vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago
didn’t yet grok objects” [Herb Sutter, 2005]
Parallel Programming Concepts OpenHPI Course Week 2 : Shared Memory Parallelism - Basics Unit 2.2: Concurrency Problems
Dr. Peter Tröger + Teaching Team
Terminology
■ Concurrency □ Capability of a system to have
multiple activities in progress at the same time □ Demands scheduling, dispatch and (often) synchronization
■ Parallelism □ Capability of a system to execute activities simultaneously □ Demands parallel hardware and concurrency support
16
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
„When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again
until the other has gone.“
[Kansas legislature, early 20th century]
Concurrent Execution
■ Code executed in two tasks for concurrently fetching some data ■ Two shared variables between the tasks
■ Executed … □ … by two threads on a single core. □ … by two threads on two cores.
■ What can happen?
17 … char *input, *output; void echo() {
input = my_id(); output = input; printf("%s",output);
}; … // Thread creation
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
void echo() { input = my_id(); output = input; printf("%s",output); };
void echo() { input = my_id(); output = input; printf("%s",output); };
char *input, *output;
Concurrent Execution
■ Program as sequence of atomic statements □ „Atomic“: Executed without interruption
■ Concurrent execution is the interleaving of atomic statements from multiple tasks □ Tasks may share resources
(variables, operating system handles, …) □ Operating system timing is not predictable,
so interleaving is not predictable □ May impact the result of the application
■ Since parallel programs are concurrent programs, we need to deal with that!
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
18
y=x z=x y=y-1 z=z+1 x=y x=z x=2
y=x y=y-1 x=y z=x z=z+1 x=z x=1
y=x y=y-1 y=x y=y+1 x=y x=y x=0
y=x z=x y=y-1 z=z+1 x=z x=y x=0
x=1 y=x y=y-1 x=y
z=x z=z+1 x=z
Case 3 Case 4
Case 1 Case 2
Race Condition
■ Concurrent activities may share global resources □ Concurrent read and write to the same memory
□ Order of interleaving becomes relevant, may / may not activate a problematic situation
□ Programming errors become non-deterministic ■ Race condition
□ The final result of an operation depends on the order of execution
□ This is always bad, even if no error occurs
□ Well-known issue since the 60‘s, identified by E. Dijkstra
19
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
19
Shared Resource
Dining Philosophers [Dijkstra]
■ Widely known thought experiment for race conditions ■ Five philosophers, each using two forks for eating
■ Common dining room, circular table, surrounded by five labeled chairs, only five forks available for all people
■ In the center a large bowl of spaghetti, constantly replenished ■ When a philosopher gets hungry:
□ Sits on his chair □ Always picks up the left fork first
□ Then picks up the right fork and eats □ When finished, he puts down both forks
and leaves until he gets hungry again □ Hungry philosophers wait for the
second fork before starting to eat
20
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Dining Philosophers [Dijkstra]
■ No two neighbors can eat at the same time, two forks needed à Mutual Exclusion
■ Due to the eating policy, some of them may never get their turn à Starvation
■ All philosophers may pick up the left fork at the same time à Deadlock □ Can be solved by putting the left fork
back when the right fork is unavailable □ Prevents deadlock, introduces the
possibility that all act and still remain hungry à Lifelock
21
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Concurrency Issues
■ Mutual Exclusion □ The requirement that when one concurrent task is using a
shared resource, no other shall be allowed to do that ■ Deadlock
□ Two or more concurrent tasks are unable to proceed □ Each is waiting for one of the others to do something
■ Starvation □ A runnable task is overlooked indefinitely
□ Although it is able to proceed, it is never chosen to run ■ Livelock
□ Two or more concurrent tasks continuously change their states in response to changes in the other activities
□ No global progress for the application
22
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
22
Concurrency Types
23
Concurrency" Relationship" Influence"
Independent tasks, unaware of each
other"
Competition for resources"
• Task results are always independent from the activities of other tasks"
Dependent tasks, sharing resources"
Cooperation by sharing"
• Task results may be affected by mutual exclusion mechanisms
• Task results may depend on information from other tasks"
Dependent tasks, sharing a
communication channel"
Cooperation by
communication"
• Task results may be affected by channel access
• Task results may depend on information from other tasks
"
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
23
Parallel Programming Concepts OpenHPI Course Week 2 : Shared Memory Parallelism - Basics Unit 2.3: Critical Section, Semaphore, Mutex
Dr. Peter Tröger + Teaching Team
Concurrency in History
■ 1961, Atlas Computer, Kilburn & Howarth □ First use of interrupts to simulate
concurrent execution of multiple programs – multiprogramming
■ 1965, Cooperating Sequential Processes
□ Landmark publication by E. W. Dijkstra □ First general principles of concurrent
programming □ Basic concepts: Critical section,
mutual exclusion, fairness, speed independence
□ Note: Historical literature talks about ‘processes’ as concurrent tasks
25
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Cooperating Sequential Processes [Dijkstra]
■ Sequential single task as application □ Execution speed does not influence the correctness
■ Multiple concurrent tasks as application □ Beside rare moments of cooperation, tasks run autonomously □ Speed assumption should still hold □ If this is not fulfilled, it might bring „analogue interferences“
■ Note: Dijkstra already identified the race condition problem here ■ Idea of a critical section
□ Need at least two concurrent sequential tasks □ At any moment, at most one should be inside the section
□ Prevents race condition inside the critical section □ Implementation through shared synchronization variables
26
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Critical Section
■ N tasks have a critical section for a common resource ■ Algorithm to implement this must consider:
□ Mutual Exclusion demand ◊ Only one task at a time is allowed in the critical section
□ Progress demand ◊ If no other task is in the critical section, the decision for
entering should not be postponed indefinitely ◊ Only already waiting tasks are part of the decision
□ Bounded Waiting demand ◊ Ensure upper waiting time for section entrance ◊ Promise that Mutex can be locked at some point in time ◊ Targets the starvation problem, not livelock
27
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Critical Section
■ Dekker provided the first algorithmic solution for shared memory ■ Generalization by Lamport with the Bakery algorithm
■ Both solutions assume atomicity and predictable sequential execution on machine code level □ Unpredictable execution due to instruction-level parallelism
and compiler optimizations □ Software must somehow realize atomic synchronization □ Works with processor hardware support (test-and-set,
compare-and-swap, read-modify-write operations) ■ Ready-to-use to implementations in practice
□ On operating system level
□ Wrapped in language and class libraries
28
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Binary and General Semaphores
■ Several tasks are waiting to enter the critical section □ Find a solution to allow waiting processes to ,sleep‘
■ Special purpose integer called semaphore □ wait() operation: Decrease value by 1 as atomic step ◊ Performed when critical section entrance is requested ◊ Blocks if the semaphore is already zero
□ signal() operation: Increase value by 1 as atomic step ◊ Performed when critical section is left ◊ Releases one instance of the protected resource
■ Binary semaphore has initial value of 1
□ Never goes higher ■ Counting semaphore has initial value of N
29
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
29
Semaphore and the Operating System
■ Semaphore implementation in the operating system □ Wrapped by language or class library
■ Implementation typically suspends / resumes calling thread □ Waiting threads are managed internally □ Wake-up order typically undefined
■ Mutex concept standardized as part of the POSIX specification
□ Similar to binary semaphore, but flag instead of a counter □ Consideration of ownership □ Mutex can only be released by the task creating it □ lock() and unlock() operation
□ May support recursive locking to avoid recursive deadlock ■ Typical understanding today
□ Use Mutex for locking, semaphore for signaling
30
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Example
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
31 T1 T2 T3
m.lock()
m.unlock()
m.lock()
m.lock()
m.unlock()
m.unlock()
Critical Section
Critical Section
Critical Section
Waiting Queue
T3
T2 T3
T2
Deadlocks
■ Unknown if a wait() / lock() operation will block ■ Unknown which task will continue on signal() / unlock()
■ Unpredictable regardless of number of processing elements ■ Starvation
□ Indefinite blocking of one of the tasks in the waiting queue
■ Deadlock □ Multiple tasks are waiting indefinitely for a lock
□ Can only be released by one of the waiting tasks
□ Good practice for avoidance: All tasks lock and unlock in the same order
32
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Task 1" Task 2"
wait(S)" wait(Q)"wait(Q)" wait(S)"
..." ..."signal(S)" signal(Q)"signal(Q)" signal(S)"
Potential
Deadlock
Parallel Programming Concepts OpenHPI Course Week 2 : Shared Memory Parallelism - Basics Unit 2.4: Monitor Concept
Dr. Peter Tröger + Teaching Team
Monitors
■ 1974, Monitors: An Operating System Structuring Concept, C.A.R. Hoare □ First formal description of monitor concept □ Originally invented by Per Brinch Hansen in 1972
■ Operating system has to schedule requests for resources □ Per resource: Local data and functionality □ Example: Current printer status and printer driver functions
■ Combination of resource state and associated functionality
■ Monitor: Collection of associated data and functionality □ Functions are the same for all instances □ Function invocations should be mutually exclusive,
lead to occupation of the monitor □ What happens if the state is not appropriate?
34
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Condition Variables
■ Function implementation itself might need to wait at some point □ wait() operation:
◊ Issued inside the monitor, causes the caller to wait ◊ Temporarily releases the monitor lock while waiting
□ signal() / notify() operation: ◊ Resume one of the waiting callers
■ Might be more than one reason for waiting inside the function □ Variable of type condition in the monitor □ Delay operations on some condition variable:
condvar.wait(), condvar.signal() □ Tasks are signaled for the condition they are waiting for □ Transparent implementation as queue of waiting processes
■ Note: We discuss ‘Mesa-style’ monitors here (check Wikipedia)
35
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Example: Single Resource Monitor
■ Monitor methods are mutually exclusive, lock on object level ■ wait() operation releases lock on monitor object
■ notify() releases one of the tasks blocked in wait()
36 class monitor Account { private int balance := 0 private Condition mayBeBigEnough public method withdraw(int amount) { while balance < amount
mayBeBigEnough.wait() assert(balance >= amount) balance := balance - amount } public method deposit(int amount) { balance := balance + amount mayBeBigEnough.notify() } }
[Bas
ed o
n W
ikip
edia
exa
mpl
e.
Pseu
doco
de.]
Example: Java Monitors
■ Monitors are part of the Java programming language ■ Each class can be a monitor
□ Mutual exclusion of method calls by synchronized keyword □ Methods can use arbitrary objects as condition variables
■ Each class can be used for a condition variable □ Object base class provides condition variable functionality
◊ Object.wait(), Object.notify(), waiting queue □ Both functions are only callable from synchronized methods
■ Thread gives up ownership of a monitor □ By calling XXX.wait()
□ By leaving the synchronized method ■ Signaling threads still must give up the ownership of the monitor
37
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Example: Java Monitors
■ Operating system typically prefers threads being woken up ■ Awakened thread cannot proceed, still needs monitor lock
□ Relinquishes by notify() caller when leaving synchronized ■ Note: Many high-level concurrency abstractions since Java 5
38
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Method Description
void wait(); Enter the waiting queue and block, until notified by another thread.
void wait(long timeout); Enter the waiting queue and block, until notified or timeout elapses.
void notify(); Wake up one arbitrary thread in the waiting queue. If no threads are waiting, do nothing.
void notifyAll(); Wake up all threads waiting. Should be preferred to avoid ‚lost wakeup‘.
39
Example: Java Monitors
class SingleElementQueue { int n; boolean valueSet = false; synchronized int get() { while(!valueSet) { try { this.wait(); } catch(InterruptedException e) { ... } } valueSet = false; this.notify(); return n; } synchronized void put(int n) { while(valueSet) { try { this.wait(); } catch(InterruptedException e) { ... } } this.n = n; valueSet = true; this.notify(); } }
class Producer implements Runnable { Queue q; Producer(Queue q) { this.q = q; new Thread(this, "Producer").start(); } public void run() { int i = 0; while(true) { q.put(i++); } }}
class Consumer implements Runnable { ... }
class App { public static void main(String args[]) { Queue q = new Q(); new Producer(q); new Consumer(q); } }
Parallel Programming Concepts OpenHPI Course Week 2 : Shared Memory Parallelism - Basics Unit 2.5: Advanced Concurrency Concepts
Dr. Peter Tröger + Teaching Team
High-Level Primitives
■ Today: Multitude of high-level synchronization primitives □ Usable in language or class libraries
□ Often based on native operating system support ■ Spinlock
□ Implementation of a waiting loop on a status variable □ Performs busy waiting, processing element is at 100%
□ Suitable for short expected waiting time □ Less overhead than true Mutex synchronization □ Works also for code that is not allowed to sleep (drivers) □ Only implementable with atomic instructions
□ Often used inside operating system kernels □ Only reasonable with multiple processing elements,
otherwise performance impact
41
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
High-Level Primitives
■ Reader / Writer Lock □ Special case of mutual exclusion
□ Protects a critical section □ Multiple „Reader“ tasks can enter at the same time □ „Writer“ process needs exclusive access □ Different optimizations possible
◊ Minimum reader delay ◊ Minimum writer delay ◊ Throughput, …
■ Concurrent Collections
□ Queues, arrays, hash maps, … □ Concurrent access und mutual exclusion already built-in
42
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
High-Level Primitives
■ Reentrant Lock □ Lock can be obtained several times without locking on itself
□ Useful for cyclic algorithms (e.g. graph traversal) □ Useful when lock bookkeeping is very expensive □ Needs to remember the locking thread(s)
■ Barriers
□ Concurrent activities meet at one point and continue together □ Participants statically defined □ Newer dynamic barrier concept allows definition of
participants during run-time □ Memory barrier or memory fence enforces separation of
memory operations before and after the barrier
43
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Barrier
44
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Futures
■ Future: Object acting as proxy for an upcoming result □ Method starting activity can return immediately
□ Read-only placeholder returned as result ■ Fetching the final result
□ Using the future object as variable in an expression □ Call a blocking function for getting the final result
45
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
#include <iostream> #include <future> #include <thread> int main() { std::future<int> f = std::async(std::launch::async, [](){ return 8; }); std::cout << "Waiting..." << std::flush; f.wait(); std::cout << "Done!\nResult:“ << f.get() << '\n'; }
Lock-Free Programming
■ Way of sharing data in concurrent tasks without maintaining locks □ Prevents deadlock and livelock conditions
□ Suspension of one task should never influence another task □ Blocking by design does not disqualify the lock-free realization
■ Algorithms rely again on hardware support for atomic operations
46
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
void LockFreeQueue::push(Node* newHead) { for (;;) { // Copy a shared variable (m_Head) to a local. Node* oldHead = m_Head; // Do some speculative work, not yet visible to other threads. newHead->next = oldHead; // Next, attempt to publish our changes to the shared variable. if (_InterlockedCompareExchange(&m_Head, newHead, oldHead) == oldHead) return; }}
8 Simple Rules For Concurrency [Breshears]
■ „Concurrency is still more art than science“ □ Identify truly independent computations
□ Implement concurrency at the highest level possible ◊ Coarse-grained tasks are better to manage
□ Plan early for scalability ◊ Reduce interdependencies as much as possible
□ Code re-use through libraries □ Never assume a particular order of execution □ Use task-specific storage if possible, avoid global data □ Don‘t change the algorithm for better concurrency
◊ First correctness, then performance
47
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Summary: Week 2
■ Parallelism and Concurrency □ Parallel programming deals with concurrency issues
□ Parallelism is mainly a hardware property ■ Concurrency Problems
□ Race condition, deadlock, livelock, starvation ■ Critical Sections
□ Progress, bounded waiting, semaphores, Mutex ■ Monitor Concept
□ Condition variables, wait(), notify() ■ Advanced Concurrency Concepts
□ Spinlocks, Reader / Writer Locks, Barriers, Futures
48
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
How do these abstract concepts map to real programming languages?