Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 1 times |
Concurrent, Multi-core Programming on Windows and .NET Pre-conference Session
David CallahanDistinguished EngineerMicrosoft Corporation
Joe DuffyLead Software EngineerMicrosoft Corporation
Stephen ToubLead Program ManagerMicrosoft Corporation
Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
Overview and Architecture
•Parallel computing mattersThe Shift to Manycore
•Parallel computing conceptsFoundations
•Concerns, top-downTechniques
Moore's Law
“That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.” -- Intel co-founder Gordon Moore in 1965
Quad-core Nehalem announced at IDF in 2007: 731 Million transistors (more than 13 doublings later…)
Spending Moore's Dividend
Windows 3.1 NT 3.51 Windows 95 Windows 98 Windows 2000
Windows XP Windows XP SP2
Vista Premium
1.0
10.0
100.0
1000.0
Processor (SPECInt)Memory (MB)Disk (MB)
Thanks to Jim Larus of Microsoft Research
52% CAGR in Spec Performance!Attack of the Killer Micros!Software is a Gas!
The Power Wall
10,000
1,000
100
10
1
‘70 ‘80 ‘90 ‘00 ‘10
Pow
er D
ensi
ty (
W/c
m2)
40048008
8080
8085
8086
286386
486
Pentium® processors
Hot Plate
Nuclear Reactor
Rocket Nozzle
Sun’s Surface
Dr. Pat Gelsinger, Sr. VP, Intel Corporation and GM, Digital Enterprise Group, February 19, 2004, Intel Developer Forum, Spring 2004
The Memory WallThe ILP Wall
Single-thread software performance will not be improving (much)
The Manycore Shift
Future processors: More cores not faster Optimized for parallel workloads
Intel Larrabee
Return of the free lunch = scalable parallel programs
Parallel Computing
Latent parallelism for future scaling
Focus on data – the scalable dimension
Tasks instead of threads
No silver bullet – many “right” approaches
Outline for the rest of this session
Running example Parallelism, Concurrency, “multi-threading” System Attributes Approaches to Parallelism Mutual-exclusion Concurrent Data Structures The Rabbit Hole of Performance The Black Hole of Correctness
Example: Connected Components
foreach node do node.component = nullforeach node do if(node.component == null) then node.component = new Component;
roots.add(node); dfsearch(node)
fi
Identify connected components and map every node to its containing
componentAll code in this talk is pseudo-
code
Example: Connected Components
function dfsearch(n) foreach m in adjacent(n) do
if(m.component == null) do m.component = node.component
dfsearch(m)fi
Start concurrent searches at arbitrary nodes Each is a candidate
component Identify where searches “collide”
Arbitrate “ownership” of the nodes Record that two candidates connect
Candidates & connections form a reduced graph Recursively find components on reduced graph
Update nodes to refer to final components
Parallelization Strategy
Parallelism v. Concurrency
Concurrent processing: independent requests
(most server applications)
Parallel processing: decompose one task to enable concurrent
execution
“Start concurrent searches …”“Arbitrate “ownership” of the nodes”
Scheduling tasksSimulating isolation of threads
Multi-threading, Asynchronous, …
Fairness: allocating resources to competing uses
Preemption: removing resources from one use to enable another
Responsiveness: latency from request to response
Throughput: work (requests) per unit time
System Metrics
For parallelism, not a goal but a context
Existing architectural concern
Drive overheads down
Task v. Data Parallelism Structured Multi-threading Work-sharing (SPMD, OpenMP)
Dataflow
Approaches to ParallelismWays to structure the code
Task v. Data
parallel foreach node do node.component = NULL
parallel foreach node do …start a parallel search …
Classically data parallel: same
operation applied to an homogenous
collection
Data-focused but built on an underlying
“task” model for generality
Structured Multi-threading
function dfsearch(n) parallel foreach m in adjacent(n) do
if(… first to visit m …) dfsearch(m)fi
• Emphasize recursive decomposition
• Preserves function interfaces• “fork-join”
• Structured control constructs• Parallel loops, co-begin
Each iteration is a taskAll tasks finish before function returns
Work-sharing
Parallel -- acquire workers shared foreach node do node.component = NULL -- implied barrier, workers wait shared foreach node do …start a parallel seearch … -- release workers
• Emphasizes processors• “fork –join” threads + barrier
• Structured control constructs• “shared loops”• Improving support for
recursion
OpenMP is the common binding
of this model
Resource Management
is too hard
A computation is a network Data flows along connections Nodes consume, transform, and produce data
A good fit for Streaming media algorithms Concurrent event systems Component interactions for responsiveness Occasionally in recursive algorithms
Dataflow
m1 m2 m3 m4 m5 m6 m7
c11 c21c12 c22
Data flow graph for subtasks of Strassen Multiplication
Mutual Exclusion
function dfsearch(n) foreach m in adjacent(n) do
if(m.component == NULL) do m.component = n.component
dfsearch(m) fi
Search 1 Search 2
If(m.component == null)? If(m.component == null)?
m.Component = n1.component m.Component = n2.component
dfsearch(m) dfsearch(m) !
Identify where searches “collide”
Arbitrate “ownership” of the nodes
Locking for Mutual Exclusion
function dfsearch(n) foreach m in adjacent(n) do
m.lock(); var old = m.component; if(old == NULL) m.component = n.component m.unlock(); if(old == NULL) then
dfsearch(m) else if (old != n.component) then
-- record the “edge” between searches endif
Locks provide exclusion but the algorithm correction depends on careful reasoning that order does not matter
One action at a time for any specific node
Lock-free Techniques
word compare_and_swap(word * loc, word oldv, word newv) {
word current = *loc;if(current == oldv) *loc = newv;return current;
}
function dfsearch(n) foreach m in adjacent(n) do
var old = compare_and_swap(&m.component, NULL, n.component)
if(old == NULL) then
Common hardware primitive
•Short duration•Preemption friendly•Limited scenarios
Concurrent Data Structures
function dfsearch(n,edges) foreach m in adjacent(n) do
m.lock(); -- Arbitrate “ownership” of the nodes
var old = m.component; if(old == NULL) m.component = n.component m.unlock(); if(old == null) then
dfsearch(m,edges) else if (old != n.component) then edges.insert(old, n.component) endif
Identify where searches “collide”
…Record that two candidates connect
Concurrency SafeHigh-Bandwidth
Example top level
parallel foreach node do node = NULL
parallel foreach node do node.lock() var old = node.component if(old == NULL) node.component = new Component node.unlock() if(old == NULL) then
roots.add(node) dfsearch(node, edges)
fi-- (roots, edges) form a derived problem
Moore’s Dividend: Sequential Parallel Identify computations that are currently or may
grow to be performance concerns Over-decompose for scaling (the new Free
Lunch) Structured multi-threading with a data focus
Relax sequential order to gain more parallelism Ensure atomicity of unordered interactions
Consider data as well as control flow Careful data structure & locking choices to
manage contention
Parallel Computing
The Rabbit Hole of Performance
1 2 3 4 5 6 7 8 9 10 11 12 13 140
20
40
60
80
100
120
Time 95% Efficiency 95%Efficiency 99%
Processors
A program that is 95% (99%) with 3% overhead to parallelize
ContentionLoad BalanceCache EffectsLatenciesPreemption
Data races Deadlock Livelock Memory hierarchy Performance robustness Reproducibility & Testing
The Black Hole of Correctness
Microsoft Visual Studio: Bringing out the Best in Multicore SystemsDate/Time: Monday, Oct. 27 1:45 PM – 3:00 PM
PDC Parallelism Sessions
Parallel Programming for C++ Developers in the Next Version of Microsoft Visual StudioDate/Time: Monday, Oct. 27 3:30PM – 4:45 PM
Parallel Programming for Managed Developers with the Next Version of Microsoft Visual StudioDate/Time: Wednesday, Oct. 29 10:30AM– 11:45AMConcurrency Runtime Deep Dive: How to Harvest Multicore Computing ResourcesDate/Time: Wednesday, Oct. 29 1:15 PM – 2:30 PM
Research: Concurrency Analysis Platform and Tools for Finding Concurrency BugsDate/Time: Wednesday Oct. 29 10:30 AM – 11:45 AM
The Concurrency and Coordination Runtime and Decentralized Software Services ToolkitDate/Time: Tuesday, Oct. 28 1:45PM – 3:00 PM
Parallel Computing Application Architectures and Opportunities Date/Time: Thursday, Oct. 30 10:15AM – 11:45AM
Addressing the Hard Problems of Concurrency Date/Time: Thursday, Oct. 30 8:30AM – 10:00AM
Future of Parallel Computing (Panel) Date/Time: Thursday, Oct. 30 12:00PM – 1:30PM
Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
System.Threading.Thread Exposes Windows threads
Explicit control over concurrent work Join: wait for it to exit IsBackground: allows process to exit while threads are alive Interrupt: dangerous! Abort: even more dangerous! Suspend/Resume: run away!!
Cons: Too expensive for fine-grained work No process-wide resource management
Scheduling: Explicit ThreadingFor coarse-grained work and agents
Thread t = new Thread(delegate{ // concurrent work});t.Start();
System.Threading.ThreadPool Executes work with a pool of Threads
Manages queues of work items and shared threads Each thread dispatches work items from the queue
Improvements for .NET 4.0 to be discussed later CLR controls # of threads to ensure good scaling
Best for fine-grained (task and data) parallelism Short-lived work items that don’t (or seldom) block Also used for async I/O completion (FileStream.BeginRead) Prefer over asynchronous delegate invocation
Cons: Blocking on the thread pool can lead to deadlocks / degradation Not straightforward to wait, cancel, continue from, etc.
Scheduling: ThreadPool For fine-grained work
ThreadPool.QueueUserWorkItem(delegate{ // concurrent work});
ThreadPool
ThreadPool in Action
QueueWorker
Thread 1Worker
Thread p
Program Thread
…
Item 1Item 2Item 3
Item 4Item 5
Item 6
System.Threading.ThreadPool Executes work with a pool of Threads
Manages queues of work items and shared threads Each thread dispatches work items from the queue
Improvements for .NET 4.0 to be discussed later CLR controls # of threads to ensure good scaling
Best for fine-grained (task and data) parallelism Short-lived work items that don’t (or seldom) block Also used for async I/O completion (FileStream.BeginRead) Prefer over asynchronous delegate invocation
Cons: Blocking on the thread pool can lead to deadlocks / degradation Not straightforward to wait, cancel, continue from, etc.
Scheduling: ThreadPool For fine-grained work
ThreadPool.QueueUserWorkItem(delegate{ // concurrent work});
Two pools of work/threads One for work items (QueueUserWorkItem) One for IO completions Both are process-wide
Min/max threads Default: min = 0, max = 250 * CPU Max used to be the cause of frequent deadlocks
Fixed in the .NET Framework 3.5 Thread injection & retirement is automatic
CLR has a daemon thread that watches for blocking
Throttles creation at 2 threads/second Min is often used to reduce “startup” latency
Scheduling: ThreadPoolAdvanced capabilities
API convention supported by much of .NET Synch API
Async APIs
“Begin” schedules work to run asynchronously “End” retrieves the operation return value or throws exception Rendezvous options
Blocking Wait on IAsyncResult.AsyncWaitHandle
Callback Pass an AsyncCallback to “Begin”; runs when completed
Polling Check IAsyncResult.IsCompleted
Must call “End” (in almost all cases)
Scheduling: Async Prog Model (APM)Common Async API Pattern in the Framework
int Foo(object o, string s);
IAsyncResult BeginFoo(object o, string s, AsyncCallback callback, object state);int EndFoo(IAsyncResult result);
I/O completion ports Bound to file HANDLEs and network sockets Async I/O is entirely async in hardware Interrupt posts packet to port when complete Threads block on port to be notified of packets Windows throttles waking for good scaling
CLR ThreadPool has a single, global port All async I/O via APM goes through it Pool of thread-pool I/O threads wait on the port Highly efficient: # of blocked threads is amortized
Can post to it directly using ThreadPool API
Scheduling: I/O Completion PortsEfficient async I/O on Windows
public static unsafe bool UnsafeQueueNativeOverlapped(NativeOverlapped* overlapped)
.NET UI frameworks have thread affinity Controls must be accessed on creating thread
Each framework has marshaling mechanism Windows Forms (Control.Invoke / BeginInvoke)
WPF (Dispatcher.Invoke/BeginInvoke)
Background Work and UIsUI Marshaling
// on background threadControl c = …;c.BeginInvoke((Action)delegate{ // runs on UI thread});
// on background threadControl c = …;c.Dispatcher.BeginInvoke((Action)delegate{ // runs on UI thread});
SynchronizationContext hides marshaling details Send (sync) and Post (async)
SynchronizationContext Send: d() Post: ThreadPool.QueueUserWorkItem
WindowsFormsSynchronizationContext Send: Control.Invoke Post: Control.BeginInvoke
DispatcherSynchronizationContext Send: Dispatcher.Invoke Post: Dispatcher.BeginInvoke
… SynchronizationContext.Current AsyncOperationManager.SynchronizationContext
Background Work and UIsSynchronization Context
Performs work on the right thread Heavy lifting on ThreadPool thread
DoWork event Progress reporting and completion on UI thread
ProgressChanged and RunWorkerCompleted events Kicked off by a call to RunAsync
Also supports cancelation Initiated by CancelAsync DoWork needs to poll the CancelationPending flag
Built on top of SynchronizationContext Captured when BackgroundWorker is instantiated Works for both Windows Forms and WPF
Hides Control.Invoke, Dispatcher, Invoke, etc.
Background Work and UIsBackgroundWorker
Ambient state associated with each thread Security info, call context, … Async points represents a logical continuation
State must be flowed ExecutionContext
Capture Gets the current context
Run Executes a delegate with a captured context
Flowed automatically Thread, ThreadPool, Control.BeginInvoke, etc. But if you do something funky…
Flow can be suppressed SuppressFlow ThreadPool.UnsafeQueueUserWorkItem
Thread hoppingExecutionContext
Overview and Architecture Mechanisms for Asynchrony Lunch (12pm-1:15pm) Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
Concurrency and State The Pitfalls of Shared Memory
Threads in a process share access to memory Easy to share information; enticing even! Hard to identify what is shared
Private: locals, heap objects not shared Shared: statics, heap objects shared explicitly, transitive
E.g. class C{ static int s_f; int m_f;public: void f(int * py) { int x; x++; // local variable s_f++; // static class member m_f++; // class member (*py)++; // pointer to something }};
Managing State Isolation, Immutability, and Synchronization
Isolation (a.k.a. confinement) Memory space is “partitioned” No two threads ever access the same state +: no overhead, easy to reason about -: sharing is often needed, leading to message passing
Immutability Data is only read, not written +: no overhead, easy to reason about -: C# and VB encourage mutability … [lineage] -: copying means efficiency can be a challenge +: see F# for promising advances!
Synchronization Lock access to shared state +: flexible, programming techniques remain similar -: perf overhead, deadlocks, races, …
Sharing HazardsR/W
Read/Write, a.k.a. unrepeatable read
t1’s 2nd read of x different from its 1st
static int x = 0;
void t1() { void t2() { int y = x; … x = 42; int z = x; // y != z }}
Sharing HazardsW/R
Write/Read, a.k.a. dirty read
t2 saw t1’s initial write of x, but it got “rolled back”
static int x = 0;
void t1() { void t2() { try { x = 42; … int y = x; … throw e; … f(y); } catch { } // whoops; // rollback! x = 0; throw; }}
Sharing HazardsW/W
Write/Write
What value exists in x at the end? And what do y and z contain? Who knows.
static int x = 0;
void t1() { void t2() { x = 42; x = 99; int y = x; int z = x;} }
On Serializability and LinearizabilityEnsuring A happens-before () B
Transactions are useful concepts For some method M, … A – atomic: M happens all-at-once C – consistent: M never moves the program into an inconsistent
state I – isolated: M’s intermediary work is isolated D – durable: M’s effects persist for the lifetime of the context in
which the effects take place The effect? Serializability!
Given two properly serialized methods M and N, we say either M N or N M
Linearization points The place where M’s effects take place
Locks are common implementation technique
Concurrency and Time Example of a Serializability Problem
static int x;…x++;
// compiles to
MOV EAX, [x]INC EAXMOV [x], EAX
T t0 t1 t2
0 t2(0): MOV EAX,[a] #01 t0(0): MOV EAX,[a] #02 t0(1): INC,EAX #13 t0(2): MOV [a],EAX #14 t1(0): MOV EAX,[a] #15 t1(1): INC,EAX #26 t1(2): MOV [a],EAX #27 t2(1): INC,EAX #18 t2(2): MOV [a],EAX #1
Concurrency Begets Complexity
Sequential Concurrent
Behavior Deterministic Nondeterministic
Memory Stable In flux (unless private, read-only, or protected by a lock)
Locks Unnecessary Essential
Invariants
Must hold only on method entry/exit or calls to external code
Anytime the protecting lock is not held
Deadlock Impossible Possible, but can be mitigated
Testing Code coverage finds most bugs
Code coverage insufficient; races, timing, and environments probabilistically change
Debugging
Trace execution leading to failure; finding a fix is generally assured
Postulate a race and inspect code; root causes easily remain unindentified
Interlocked Operations Hardware Synchronization
Interlocked operations are lowest level sync primitive A.k.a. compare-and-swap (CAS) XCHG, LOCK CMPXCHG, LOCK CMPXCHG8B Atomic sequence of read and write Locks, events, etc. are all built out of them
System.Threading.Interlocked static class
Fairly expensive Cache coherency – bus traffic >100 cycles on non-NUMA, >500 cycles on NUMA (uncontended)
int Add(ref int l, int v);int CompareExchange(ref int l, int v, int cmp);int Decrement(ref int l);int Increment(ref int l);int Exchange(ref int l, int v);
Kernel Synchronization Objects The Foundation on top of Which All Else Exists
Kernel objects are basic primitives Signaled / nonsignaled state WaitForSingleObject, WaitForMultipleObjects *Msg variety: pump messages (STAs, GUIs) *Ex variety: alertable waits (dispatch APCs) CLR always pumps and does alertable waits
Data synchronization kinds Mutex – mutually exclusive Semaphore – N signals before nonsignaled Auto-reset Event – becomes nonsignaled when wait Manual-reset Event – manually reset
Exposed through System.Threading.WaitHandlepublic class WaitHandle : IDisposable { public void Close(); public void WaitOne();
// timeout-variants, and plenty of others…
public static void WaitAll(WaitHandle[] hs); public static int WaitAny(WaitHandle[] hs);}
Mutex and Semaphore In .NET
Useful for Win32 interop When ACLs need to be used Inter-AppDomain synchronization
But otherwise too heavyweight for industrial usepublic class Mutex : WaitHandle { public Mutex(string name, MutexSecurity acl, …); public void ReleaseMutex();}
public class Semaphore : WaitHandle { public Semaphore( int initialCount, int maximumCount, string name, SemaphoreSecurity acl, …); public void Release(int count);}
Auto- and Manual-Reset Events In .NET
Manual-reset: Once set, all threads are awoken, reset must happen manually
Auto-reset: Once set, one thread is awoken, reset happens automatically
Useful for all the previously stated reasons But also because it’s “sticky”
public class EventWaitHandle : WaitHandle { public EventWaitHandle( bool initialState, EventResetMode mode, string name, EventWaitHandleSecurity acl, …); public void Reset(); public void Set();}public enum EventResetMode { AutoReset, ManualReset}public class AutoResetEvent : EventWaitHandle { … }public class ManualResetEvent : EventWaitHandle { … }
Monitors Locking
Monitor.Enter/Exit (and TryEnter) – mutual exclusion Language syntax
Any CLR object can be used as target Recursive and timeout-based acquires Spins briefly to reduce frequency of context switches
Thin and fat locks Header word before fields in object layout used for thin lock Acquire incurs a single interlocked operation Fat lock on contention: sync-block (+event) reclaimed by GC
[C#] lock (obj) { … }[VB] SyncLock obj … End SyncLock
Monitor.Enter(obj);try { …} finally { Monitor.Exit(obj);}
Monitors Condition Variables
Monitor.Wait/Pulse/PulseAll – alternative to events
Waiter releases all locks on target Pulse wakes one, PulseAll wakes all
Efficient mechanisms used internally to pool events Inflates target to fat lock to register thread/event pair Event used is one per-thread
bool P = false;…lock (obj) { while (!P) Monitor.Wait(obj); …}… elsewhere …lock (obj) { P = true; Monitor.Pulse[All](obj);}
Reader/Writer Locks When Mutual Exclusion is Unnecessary
Take advantage of R/R “conflicts” being safe Allow many readers in the lock But when one writer, no others can be in
ReaderWriterLock Read and write modes (AcquireXXLock/ReleaseXXLock) Upgrades (but prone to broken serialization)
ReaderWriterLockSlim (in 3.5) is cheaper Supersedes old ReaderWriterLock class Read, writer, and upgradeable-read modes
(EnterXXLock, ExitXXLock) Deadlock-free upgrades
However, scalability can still suffer… When work in lock is small: CAS dominates
Locks and FairnessConvoy Avoidance
Should locks be fair? Thread A arrives, gets the lock Thread B arrives, must wait Thread C arrives, must wait Is it guaranteed that B gets the lock when A exits?
Pre-windows Server 2003 SP1: Yes; Post: No! Fairness exacerbates convoys, e.g.
A leaves lock, pulses B; Time for B to awaken is at least C, where C = cost
of context switch (>10,000 cycles) --- could be >2C! If Thread D arrives in this window, it must wait! Effectively extends lock hold times!
Thread Local Storage (TLS)Confined State Within Threads
Ensures a reference is local to one thread Static (preferred)
[ThreadStaticAttribute]private static T foo;
Provides best performance (~TLS lookup + field fetch) Dynamic
Thread.SetData(k, v); .. T v = Thread.GetData(k); Only needed when per-object TLS is needed
In both cases, initialization is tricky Class ctors only invoked once per thread static E.g. static T foo = new T(); -- not what you want All uses need to check for initialization
Immutability in PracticeAn ImmutableStack<T> Type
public class ImmutableStack<T> { private readonly T m_value; private readonly ImmutableStack<T> m_next; private readonly bool m_empty; public ImmutableStack() { m_empty = true; } internal ImmutableStack(T value, ImmutableStack<T> next) { m_value = value; m_next = next; m_empty = false; } public ImmutableStack<T> Push(T value) { return new ImmutableStack(value, this); } public ImmutableStack<T> Pop(out T value) { if (m_empty) throw new Exception("Empty."); return m_next; }}
Memory Models (Danger!)Architecture and Platform Guarantees
Sometimes locks are unnecessary Native pointer-sized writes are atomic (32bit/64bit) But compilers + processors reorder reads + writes
CLR memory model: Data dependence All writes are store/release (nothing moves after) All ‘volatile’ reads are load/acquire (nothing moves before) Adjacent writes can be coalesced Fence ensures nothing moves either direction (lock,
interlocked operation, Thread.MemoryBarrier) C++:
Volatiles and explicit barriers Sophisticated code can exploit this; warning!
E.g. double checked locking Hard to test, since real reordering differs per-platform Often requires so much cleverness that locks win out
Memory ReorderingExamples
X = Y = 0;~~~X = 1; A = Y;Y = 1; B = X;~~~A == 1 && B == 0?
X = Y = 0;~~~X = 1; A = X; B = Y;
Y = 1; C = X;~~~A == 1 && B == 1 && C == 0?
X = Y = 0;~~~X = 1; Y = 1;A = Y; B = X;~~~A == 0 && B == 0?
No, except on IA64.(No StoreStore, No LoadLoad)
Yes!(StoreLoad is permitted)
No.(Transitivity)
Word TearingAccessing Nonatomic Locations w/out Proper Synchronization
Can the assert fire? Yes! Comprised of two reads/writes apiece
Possible values incl. 0x0L and 0xaaaabbbbccccddddL (of course), but also 0xaaaabbbb00000000L and 0x000000000ccccddddL !!
internal static long s_x; void t1() { int i = 0; while (true) { s_x = (i & 1) == 0 ? 0x0L : 0xaaaabbbbccccddddL; i++; }} void t2() { while (true) { long x = s_x; Debug.Assert(x == 0x0L || x == 0xaaaabbbbccccddddL); }}
Lock FreedomDouble Edged Sword
Interlocked provides hardware atomicity CompareExchange(&a, b, c): if a contains c, replace it w/ b Guaranteed atomic in the hardware
Can be used to build scalable, wait-free algorithms:class Stack<T> { Node<T> head; void Push(T obj) { Node<T> n = new Node<T>(obj); Node<T> h; do { h = head; n.next = h; } while (Interlocked. CompareExchange(ref head, n, h) != h); }
T Pop() { Node<T> n; do { n = head; } while (Interlocked. CompareExchange(ref head, n.next, n) != n); return n.Value; } …}
Double Checked LockingEfficient Lazy Initialization (Variant 1: Never Create >1)
CLR 2.0 memory model guarantees safety Other memory models (ECMA/1.1, C++, Java) don’t
Volatile necessary to prevent speculative loads (IA64)
class Foo { private static volatile Foo s_inst; private static object s_mutex = new object();
internal Foo { get { if (s_inst == null) lock (s_mutex) if (s_inst == null) s_inst = new Foo(…); return s_inst; } }}
Double Checked LockingEfficient Lazy Initialization (Variant 2: >1 OK)
If possibility of garbage is OK, lock-free…class Foo { private static volatile Foo s_inst;
internal Foo { get { if (s_inst == null) { Foo candidate = new Foo(); Interlocked.CompareExchange( ref s_inst, candidate, null); } return s_inst; } }}
Building a Spin-lockTrickier Than You Think!
class SpinLock { private int m_state = 0;
public void Enter() { while (Interlocked.CompareExchange( ref m_state, 1, 0) != 0) ; }
public void Exit() { m_state = 0; }}
Building a Spin-lockBrain Melting Details …
Spin-waiting needs to Yield immediately on a 1-CPU machine Not use CompareExchange in a loop
Too much cache contention Back-off to add randomization Reread and only try CompareExchange if seen as 1 [TTAS] Possibly consider queueing [MCS]
Avoid priority starvation (Sleep(0) issue) Issue PAUSE instructions on Intel HT machines Mark BeginCriticalRegion/EndCriticalRegion Use managed thread ID to avoid OS thread affinity
Building a Spin-lockTry Numero Dos – Still Imperfect
class SpinLock { private volatile int m_state = 0;
public void Enter() { int tid = Thread.CurrentThread.ManagedThreadId; while (true) { if (Interlocked.CompareExchange(ref m_state, tid, 0) != 0) { int iters = 1; while (m_state != 0) { if (Environment.ProcessorCount == 1) { if (iters % 5 == 0) Thread.Sleep(1); else Thread.Sleep(0); iters++; } else { Thread.SpinWait(iters); if (iters >= 4096) Thread.Sleep(1); else { if (iters >= 2048) Thread.Sleep(0); iters *= 2 } } } } } }
public void Exit() { m_state = 0; }}
Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
Do: Lock over all mutable shared state Do: Always use the same lock for the same state Do: Comment on how state is protected
Synchronization Best PracticesLock consistently
class MyList<T> { T[] items; // lock: items int n; // lock: items
void Add(T item) { lock (items) { items[n] = item; n++; } } …}
Do: Lock over entire invariant Don’t: Lock for longer than is absolutely necessary
Synchronization Best PracticesLock for the right duration
class MyList<T> { T[] items; // lock: items int n; // lock: items
// invariant: n is count of valid // items in list and items[n] == null
void Add(T item) { lock (items) { items[n] = item; n++; } } …}
Do: Minimize the time you hold on to a lock Don’t: Call others’ code while you hold locks Don’t: Block while you hold locks
Synchronization Best Practices Make critical regions short and sweet
class MyList<T> { ...
void Add(T t) { lock(items) { items[n] = t; n++; } Listener.Notify(this); } …}
Don’t: Use public lock objects Don’t: Lock on Types or Strings
Synchronization Best PracticesEncapsulate your locks
class MyList<T> { T[] items; int n;
static object slk = new object(); … static void ResetStats() { lock(slk){ … } } …}
Do: Acquire locks in a consistent order
Synchronization Best PracticesAvoiding deadlocks
class MyService { A a; B b; …
void DoAB() { lock(a) lock(b) { a.Do(); b.Do(); } } void DoBA() { lock(b) lock(a) { b.Do(); a.Do(); } }}
Do: Document your locking policy Especially for public APIs
Do: Use a reader/writer lock if readers are common Do: Prefer lock-based code to lock-free code Do: Prefer Monitors over kernel synchronization Avoid: Lock recursion in your designs Don’t: Build your own lock
Synchronization Best PracticesLocking Miscellany
Avoid: Writing your own thread pools
ThreadPool Best Practices
Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break (3:15pm-3:45pm) Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
Coarse vs. Fine Grained ConcurrencyThe Impact of Multi-core on Apps Server programs have it “easy”
Have been doing this for years Steady stream of incoming work, W Typically W >= #P, thus … One W per P isn’t such a bad strategy Multi-core is only a problem once P > W—then
the server needs to worry about fine-grained Clients, not so much
User-centric and responsiveness-oriented “User clicks a button” now what? Faster? Sure! Need to go fine-grained
Going Fine-GrainedCode and Data
Two basic approaches to fine-grained concurrency Approach #1: It’s the task (i.e. code)
Futures—make a function call o.f and its results will be available as soon as possible; enough independent calls leads to scaling
Divide and conquer—split computations as you go into “left” and “right”, running one on a different thread (recursively)
Approach #2: It’s the data Partitioning—split data source into n partitions, and process each
partition in parallel (database query approach) Pipelining—break into n stages, and execute each stage in parallel
Ideally, app developers choose whichever is most appropriate for the task at hand… Coarse-grained mixed with task and data parallelism == happiness (So long as libraries don’t get in the way)
A Taxonomy of Concurrency
…Agents/CSPs * Message Passing * Loose Coupling
Task Parallelism * Statements * Structured * Futures * ~O(1) Parallelism
Data Parallelism * Data Operations * O(N) Parallelism
Messaging
On Speedups Metrics Worth Measuring
Goal: tS/tP ≈ P tS is the time it takes to run an algorithm sequentially tP is the time it takes to run on P processors Speedups:
If tS/tP = P, this is a linear speedup (good! for every doubling in processors, we halve execution time)
Most problems aren’t linear, … But we try to get as close as possible If tS/tP > P, we have super-linear speedup! Wowzas!
Design challenge: cheap problem decomposition Parallelism always costs something, even when efficient Synchronization, inter-thread communication, cache effects,
and threading not major factors in sequential code
AlgorithmsParallel For Loops
Independent loop iterations may run in parallelfor (int i = 0; i < n; i++) a(i);
Only if no loop-carried dependencies, and … No shared state mutation
Static decomposition Loop iterations are assigned to workers a priori Contiguous chunks, striping, etc. +: simple, predictable, efficient -: can’t tolerate iteration imbalance, blocking
Dynamic decomposition Loop iterations are assigned on-demand, usually in chunks +: tolerates imbalance, blocking -: more difficult, communication overhead
AlgorithmsParallel For Loops – Static Decomposition
void ParallelForS(int lo, int hi, Action<int> body, int p) { int chunk = ((hi – lo) + p - 1) / p; // Iterations/thread ManualResetEvent mre = new ManualResetEvent(false); int remaining = p;
// Schedule the threads to run in parallel for (int i = 0; i < p; i++) { ThreadPool.QueueUserWorkItem(delegate(object procId) { int start = lo + (int)procId * chunk; for (int j=start; j<start + chunk && j < hi; j++) { body(j); } if (Interlocked.Decrement(ref remaining) == 0) mre.Set(); }, i); }
mre.WaitOne(); // Wait for them to finish}
AlgorithmsParallel For Loops – Dynamic Decomposition
void ParallelForD(int lo, int hi, Action<int> body, int p) { const int chunk = 16; // Chunk size (constant) ManualResetEvent mre = new ManualResetEvent(false); int remaining = p; int current = lo;
// Schedule the threads to run in parallel for (int i = 0; i < p; i++) { ThreadPool.QueueUserWorkItem(delegate(object procId) { int j; while ((j = (Interlocked.Add( ref current, chunk) – chunk)) < hi) { for (int k = 0; k < chunk && j + k < hi; k++) { body(j + k); } } if (Interlocked.Decrement(ref remaining) == 0) mre.Set(); }, i); }
mre.WaitOne(); // Wait for them to finish}
AlgorithmsParallel Foreach Loops
What about IEnumerable<T> objects? IEnumerable<T> enum = …;
foreach (T e in enum) a(e);
Challenge: don’t know size a priori, can’t index into it ICollection<T> is slightly better (size) but not by much
Can be handled a bit like dynamic for partitioning Hand out chunks of elements at a time Requires serializing access to a single enumerator Large chunk size means more time spent in lock But less frequent lock arrivals
AlgorithmsParallel Foreach Loops
void ParallelForEach<T>(IEnumerable<T> e, Action<T> body, int p) { const int chunk = 16; // Chunk size (constant) ManualResetEvent mre = new ManualResetEvent(false); int remaining = p;
using (IEnumerator<T> en = e.GetEnumerator()) { // shared // Schedule the threads to run in parallel for (int i = 0; i < p; i++) { ThreadPool.QueueUserWorkItem(delegate(object procId) { T[] buffer = new T[chunk]; int j; do { lock (en) { for (j = 0; j < chunk && en.MoveNext(); j++) buffer[j] = en.Current; } for (int k = 0; k < j; k++) body(buffer[k]); } while (j == chunk); if (Interlocked.Decrement(ref remaining) == 0) mre.Set(); }, i); } mre.WaitOne(); // Wait for them to finish }}
AlgorithmsDivide and Conquer - Recursion
Process separate “halves” of the problem in parallel Process right half concurrently, … And left half on current thread (Recursively…) Switchover to sequential at some point (else too much
parallelism is exposed) E.g. mirror a binary tree in place
void Mirror(TreeNode node) { if (node == null) return; Mirror(node.Left); Mirror(node.Right); TreeNode tmp = node.Left; node.Left = node.Right; node.Right = tmp;}
AlgorithmsReductions
Associative and commutative reductions can be done in parallel, e.g. Sum, Count, Min, Max, Average, …int ParallelSum(int[] array, int p) { int chunk = (array.Length + p - 1) / p; // Iterations/thread ManualResetEvent mre = new ManualResetEvent(false); int sum = 0, remaining = p;
// Schedule the threads to run in parallel for (int i = 0; i < p; i++) { ThreadPool.QueueUserWorkItem(delegate(object procId) { int mySum = 0; int start = (int)procId * chunk; for (int j=start; j<start + chunk && j < array.Length; j++) mySum += array[j]; Interlocked.Add(ref sum, mySum); if (Interlocked.Decrement(ref remaining) == 0) mre.Set(); }, i); }
mre.WaitOne(); // Wait for them to finish return sum;}
When to “Go Parallel”?
There is a cost; only worthwhile when Work per task/element is large, and/or Number of tasks/elements is large
-- Work Per Task // # of Tasks ++
--
Spee
dup
++
1 task(Sequential)
? tasksBreak even point
? tasksPoint of diminishing returns
Speedup InhibitorsSynchronous I/O
With synchronous IO:
With asynchronous IO:
= Running ()
= Waiting ()
Thread 1:
Thread 2:
Thread 1:
Thread 2:
Overlapped IO
time
time
6 work items in3 time
6 work items in4 time
Speedup Inhibitors Synchronization
Synchronization inhibits speedup
Thread 1:
Thread 2:
Thread 3:
Thread 4:
…
= Running ()
= Running w/ lock ()= Waiting ()
(lock)
(lock)
(lock)
(lock)
Speedup InhibitorsLoad Imbalance
Exacerbates Amdahl’s Law E.g. if Your API is 50% of the work, and is only
sequential, maximum parallel speedup = 2x!
Sequential:
Parallel:
More than 2 threads is just wasted resource:S = 50%, 1/S == 2
No matter how many processors, 2x is it
= Your API
Thread 1
Thread 2
Thread 3
Thread 4
AlgorithmsOther Miscellaneous Algorithms
Pipelining Multiple stages run in parallel with one another Producer/consumer relationship
Speculation and search Many threads cooperate to “precompute” an answer If the answer isn’t needed, speculation is discarded Wasted work if unused, but speedups otherwise
Dataflow Future<T> abstraction: accessing the Value waits if
not computed yet, or runs synchronously otherwise Synchronization is driven by data dependence
AlgorithmsProducer/Consumer: Blocking & Bounded Queue
public class BlockingBoundedQueue<T> { private Queue<T> m_queue = new Queue<T>(); private Semaphore m_fullSemaphore = new Semaphore(128); private Semaphore m_emptySemaphore = new Semaphore(0);
public void Enqueue(T item) { m_fullSemaphore.WaitOne(); lock (m_queue) { m_queue.Enqueue(item); } m_emptySemaphore.Release(); }
public T Dequeue() { T e; m_emptySemaphore.WaitOne(); lock (m_queue) { e = m_queue.Dequeue(); } m_fullSemaphore.Release(); return e; }}
Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
Example: "Baby Names"
IEnumerable<BabyInfo> babies = ...;var results = new List<BabyInfo>();foreach(var baby in babies){ if (baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd) { results.Add(baby); }}results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year));
Manual Parallel Solution
IEnumerable<BabyInfo> babies = …;var results = new List<BabyInfo>();int partitionsCount = Environment.ProcessorCount;int remainingCount = partitionsCount;var enumerator = babies.GetEnumerator();try { using (var done = new ManualResetEvent(false)) { for(int i = 0; i < partitionsCount; i++) { ThreadPool.QueueUserWorkItem(delegate { var partialResults = new List<BabyInfo>(); while(true) { BabyInfo baby; lock (enumerator) { if (!enumerator.MoveNext()) break; baby = enumerator.Current; } if (baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd) { partialResults.Add(baby); } } lock (results) results.AddRange(partialResults); if (Interlocked.Decrement(ref remainingCount) == 0) done.Set(); }); } done.WaitOne(); results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year)); }}finally { if (enumerator is IDisposable) ((IDisposable)enumerator).Dispose(); }
LINQ Solution
var results = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;
.AsParallel()
Visual Studio 2010Tools / Programming Models / Runtimes
Parallel Pattern Library
Resource Manager
Task Scheduler
Task Parallel Library
PLINQ
Managed Library Native LibraryKey:
Threads
Operating System
Concurrency Runtime
Programming Models
AgentsLibrary
ThreadPool
Task Scheduler
Resource Manager
Data Structures
Dat
a St
ruct
ures
Tools
Tools
ParallelDebugger Windows
Profiler Concurrency
Analysis
What is it? .NET types (mscorlib.dll, System.dll, System.Core.dll) No compiler changes necessary Work-stealing runtime Multiple programming models
Declarative data parallelism (PLINQ) Imperative data and task parallelism (Task Parallel Library) Coordination/synchronization constructs (Coordination Data Structures)
Common exception handling model Parallel debugging and profiling support
Why is it good? Supports parallelism in any .NET language Delivers reduced concept count and complexity, better time to
solution Begins to move parallelism capabilities from concurrency experts
to domain experts
Parallel Extensions to the.NET Framework 4.0
Parallel Extensions Architecture
Task Parallel Library Coordination Data Structures
.NET Program
Proc 1 …
PLINQ Execution Engine
C# Compiler
VB Compiler
C++ Compiler
MSIL
Threads
DeclarativeQueries Data Partitioning
ChunkRangeHash
StripedRepartitioning
Operator TypesMapFilterSort
SearchReduce
…
MergingBuffering options
Order preservationInverted
Proc p
Parallel Algorithms
Query Analysis
Concurrent CollectionsSynchronization Types
Coordination Types
Loop replacementsImperative Task Parallelism
Scheduling
PLINQ
TPL or CDS
F# Compiler
Other .NET Compiler
Work-Stealing Scheduler
Work-Stealing in Action
GlobalQueue
LocalQueue
LocalQueue
Worker Thread 1
Worker Thread p
Program Thread
…
…
Task 1Task 2
Task 3Task 5
Task 4
Task 6
Used throughout PLINQ and TPL Address many of today’s core concurrency
issues
Coordination Data Structures
Thread-safe collectionsConcurrentStack<T>ConcurrentQueue<T>ConcurrentDictionary<TKey,TValue>…
Work exchangeBlockingCollection<T>IProducerConsumerCollection<T>
Phased OperationCountdownEvent Barrier
LocksManualResetEventSlimSemaphoreSlimSpinLockSpinWait
InitializationLazyInit<T>
Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up
Agenda
Concurrency begins with architecture State management Agent interactions (coarse concurrency) Problem decomposition (fine grain parallelism)
It is possible with today’s tools Windows and .NET offer rich support Kernel objects Threads and thread pools User-mode sync primitives (.NET & C++)
Advances are on the horizon To bring parallelism within everybody’s reach You saw some (Parallel Extensions) …
6 Hours in a SlideTalk Recap
What the Future HoldsProgramming Models
Safety Current offerings minimal impact (sharp knives) Three key themes
Functional: immutable & pure Safe imperative: isolated Safe side-effects: transactions
Verification tools Patterns
Agents (CSPs) + tasks + data 1st class isolated agents Raise level of abstraction: what, not how
110
What the Future HoldsEfficiency and Heterogeneity
Efficiency “Do no harm” O(P) >= O(1) More static decision-making vs. dynamic Profile guided optimizations
The future is heterogeneous Chip multiprocessors are “easy” Out-of-order vs. in-order GPGPU’ (fusion of X86 with GPU) Vector ISAs Possibly different memory systems
111
+=~
Implicit ParallelismUse APIs that internally use parallelism
Structured in terms of agentsApps, LINQ queries, etc.
Explicit ParallelismSafe
Frameworks, DSLs, XSLT, sorting, searching
All Programmers Will Not Be Parallel
Explicit ParallelismUnsafe
(Parallel Extensions, etc)
In Conclusion
Opportunity and crisis Competitive advantage for those who grok it Less incentive for the client platform without
Architects & senior developers pay heed Time to start thinking and experimenting Not yet for ubiquitous consumption [5 year horizon] but… Can make a real difference today in select places:
embarassingly parallel Begin experimenting today
Windows Vista + .NET 3.5 Play with Parallel Extensions (.NET 4.0 and C++) Exciting times!
Thank-you. 113
Learn more about concurrency:
Book SigningWhere: PDC bookstoreDate/Time: Wednesday, Oct. 29 2:30PM – 3:00PM
Just released!Available at the PDC bookstore
Concurrent Programming on Windows
(Addison-Wesley)Covers Win32 & .NET Framework
Learn more about Parallel Computing at:
msdn.com/concurrencyAnd download
Parallel Extensions to the .NET Framework!
Microsoft Visual Studio: Bringing out the Best in Multicore SystemsDate/Time: Monday, Oct. 27 1:45 PM – 3:00 PM
PDC Parallelism Sessions
Parallel Programming for C++ Developers in the Next Version of Microsoft Visual StudioDate/Time: Monday, Oct. 27 3:30PM – 4:45 PM
Parallel Programming for Managed Developers with the Next Version of Microsoft Visual StudioDate/Time: Wednesday, Oct. 29 10:30AM– 11:45AMConcurrency Runtime Deep Dive: How to Harvest Multicore Computing ResourcesDate/Time: Wednesday, Oct. 29 1:15 PM – 2:30 PM
Research: Concurrency Analysis Platform and Tools for Finding Concurrency BugsDate/Time: Wednesday Oct. 29 10:30 AM – 11:45 AM
The Concurrency and Coordination Runtime and Decentralized Software Services ToolkitDate/Time: Tuesday, Oct. 28 1:45PM – 3:00 PM
Parallel Computing Application Architectures and Opportunities Date/Time: Thursday, Oct. 30 10:15AM – 11:45AM
Addressing the Hard Problems of Concurrency Date/Time: Thursday, Oct. 30 8:30AM – 10:00AM
Future of Parallel Computing (Panel) Date/Time: Thursday, Oct. 30 12:00PM – 1:30PM
Thank you!
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.