David Callahan Distinguished Engineer Microsoft Corporation Joe Duffy Lead Software Engineer...

Concurrent, Multi-core Programming on Windows and .NET Pre-conference Session

David CallahanDistinguished EngineerMicrosoft Corporation

Joe DuffyLead Software EngineerMicrosoft Corporation

Stephen ToubLead Program ManagerMicrosoft Corporation

Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up

Agenda


Agenda

Overview and Architecture

•Parallel computing mattersThe Shift to Manycore

•Parallel computing conceptsFoundations

•Concerns, top-downTechniques

Moore's Law

“That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.” -- Intel co-founder Gordon Moore in 1965

Quad-core Nehalem announced at IDF in 2007: 731 Million transistors (more than 13 doublings later…)

Spending Moore's Dividend

Windows 3.1 NT 3.51 Windows 95 Windows 98 Windows 2000

Windows XP Windows XP SP2

Vista Premium

1.0

10.0

100.0

1000.0

Processor (SPECInt)Memory (MB)Disk (MB)

Thanks to Jim Larus of Microsoft Research

52% CAGR in Spec Performance!Attack of the Killer Micros!Software is a Gas!

The Power Wall

10,000

1,000

100

10

1

‘70 ‘80 ‘90 ‘00 ‘10

Pow

er D

ensi

ty (

W/c

m2)

40048008

8080

8085

8086

286386

486

Pentium® processors

Hot Plate

Nuclear Reactor

Rocket Nozzle

Sun’s Surface

Dr. Pat Gelsinger, Sr. VP, Intel Corporation and GM, Digital Enterprise Group, February 19, 2004, Intel Developer Forum, Spring 2004

The Memory WallThe ILP Wall

Single-thread software performance will not be improving (much)

The Manycore Shift

Future processors: More cores not faster Optimized for parallel workloads

Intel Larrabee

Return of the free lunch = scalable parallel programs

Parallel Computing

Latent parallelism for future scaling

Focus on data – the scalable dimension

Tasks instead of threads

No silver bullet – many “right” approaches

Outline for the rest of this session

Running example Parallelism, Concurrency, “multi-threading” System Attributes Approaches to Parallelism Mutual-exclusion Concurrent Data Structures The Rabbit Hole of Performance The Black Hole of Correctness

Example: Connected Components

foreach node do node.component = nullforeach node do if(node.component == null) then node.component = new Component;

roots.add(node); dfsearch(node)

fi

Identify connected components and map every node to its containing

componentAll code in this talk is pseudo-

code

Example: Connected Components

function dfsearch(n) foreach m in adjacent(n) do

if(m.component == null) do m.component = node.component

dfsearch(m)fi

Start concurrent searches at arbitrary nodes Each is a candidate

component Identify where searches “collide”

Arbitrate “ownership” of the nodes Record that two candidates connect

Candidates & connections form a reduced graph Recursively find components on reduced graph

Update nodes to refer to final components

Parallelization Strategy

Parallelism v. Concurrency

Concurrent processing: independent requests

(most server applications)

Parallel processing: decompose one task to enable concurrent

execution

“Start concurrent searches …”“Arbitrate “ownership” of the nodes”

Scheduling tasksSimulating isolation of threads

Multi-threading, Asynchronous, …

Fairness: allocating resources to competing uses

Preemption: removing resources from one use to enable another

Responsiveness: latency from request to response

Throughput: work (requests) per unit time

System Metrics

For parallelism, not a goal but a context

Existing architectural concern

Drive overheads down

Task v. Data Parallelism Structured Multi-threading Work-sharing (SPMD, OpenMP)

Dataflow

Approaches to ParallelismWays to structure the code

Task v. Data

parallel foreach node do node.component = NULL

parallel foreach node do …start a parallel search …

Classically data parallel: same

operation applied to an homogenous

collection

Data-focused but built on an underlying

“task” model for generality

Structured Multi-threading

function dfsearch(n) parallel foreach m in adjacent(n) do

if(… first to visit m …) dfsearch(m)fi

• Emphasize recursive decomposition

• Preserves function interfaces• “fork-join”

• Structured control constructs• Parallel loops, co-begin

Each iteration is a taskAll tasks finish before function returns

Work-sharing

Parallel -- acquire workers shared foreach node do node.component = NULL -- implied barrier, workers wait shared foreach node do …start a parallel seearch … -- release workers

• Emphasizes processors• “fork –join” threads + barrier

• Structured control constructs• “shared loops”• Improving support for

recursion

OpenMP is the common binding

of this model

Resource Management

is too hard

A computation is a network Data flows along connections Nodes consume, transform, and produce data

A good fit for Streaming media algorithms Concurrent event systems Component interactions for responsiveness Occasionally in recursive algorithms

Dataflow

m1 m2 m3 m4 m5 m6 m7

c11 c21c12 c22

Data flow graph for subtasks of Strassen Multiplication

Mutual Exclusion


if(m.component == NULL) do m.component = n.component

dfsearch(m) fi

Search 1 Search 2

If(m.component == null)? If(m.component == null)?

m.Component = n1.component m.Component = n2.component

dfsearch(m) dfsearch(m) !

Identify where searches “collide”

Arbitrate “ownership” of the nodes

Locking for Mutual Exclusion


m.lock(); var old = m.component; if(old == NULL) m.component = n.component m.unlock(); if(old == NULL) then

dfsearch(m) else if (old != n.component) then

-- record the “edge” between searches endif

Locks provide exclusion but the algorithm correction depends on careful reasoning that order does not matter

One action at a time for any specific node

Lock-free Techniques

word compare_and_swap(word * loc, word oldv, word newv) {

word current = *loc;if(current == oldv) *loc = newv;return current;

}


var old = compare_and_swap(&m.component, NULL, n.component)

if(old == NULL) then

Common hardware primitive

•Short duration•Preemption friendly•Limited scenarios

Concurrent Data Structures

function dfsearch(n,edges) foreach m in adjacent(n) do

m.lock(); -- Arbitrate “ownership” of the nodes

var old = m.component; if(old == NULL) m.component = n.component m.unlock(); if(old == null) then

dfsearch(m,edges) else if (old != n.component) then edges.insert(old, n.component) endif

Identify where searches “collide”

…Record that two candidates connect

Concurrency SafeHigh-Bandwidth

Example top level

parallel foreach node do node = NULL

parallel foreach node do node.lock() var old = node.component if(old == NULL) node.component = new Component node.unlock() if(old == NULL) then

roots.add(node) dfsearch(node, edges)

fi-- (roots, edges) form a derived problem

Moore’s Dividend: Sequential Parallel Identify computations that are currently or may

grow to be performance concerns Over-decompose for scaling (the new Free

Lunch) Structured multi-threading with a data focus

Relax sequential order to gain more parallelism Ensure atomicity of unordered interactions

Consider data as well as control flow Careful data structure & locking choices to

manage contention

Parallel Computing

The Rabbit Hole of Performance

1 2 3 4 5 6 7 8 9 10 11 12 13 140

20

40

60

80

100

120

Time 95% Efficiency 95%Efficiency 99%

Processors

A program that is 95% (99%) with 3% overhead to parallelize

ContentionLoad BalanceCache EffectsLatenciesPreemption

Data races Deadlock Livelock Memory hierarchy Performance robustness Reproducibility & Testing

The Black Hole of Correctness

Microsoft Visual Studio: Bringing out the Best in Multicore SystemsDate/Time: Monday, Oct. 27 1:45 PM – 3:00 PM

PDC Parallelism Sessions

Parallel Programming for C++ Developers in the Next Version of Microsoft Visual StudioDate/Time: Monday, Oct. 27 3:30PM – 4:45 PM

Parallel Programming for Managed Developers with the Next Version of Microsoft Visual StudioDate/Time: Wednesday, Oct. 29 10:30AM– 11:45AMConcurrency Runtime Deep Dive: How to Harvest Multicore Computing ResourcesDate/Time: Wednesday, Oct. 29 1:15 PM – 2:30 PM

Research: Concurrency Analysis Platform and Tools for Finding Concurrency BugsDate/Time: Wednesday Oct. 29 10:30 AM – 11:45 AM

The Concurrency and Coordination Runtime and Decentralized Software Services ToolkitDate/Time: Tuesday, Oct. 28 1:45PM – 3:00 PM

Parallel Computing Application Architectures and Opportunities Date/Time: Thursday, Oct. 30 10:15AM – 11:45AM

Addressing the Hard Problems of Concurrency Date/Time: Thursday, Oct. 30 8:30AM – 10:00AM

Future of Parallel Computing (Panel) Date/Time: Thursday, Oct. 30 12:00PM – 1:30PM


Agenda

System.Threading.Thread Exposes Windows threads

Explicit control over concurrent work Join: wait for it to exit IsBackground: allows process to exit while threads are alive Interrupt: dangerous! Abort: even more dangerous! Suspend/Resume: run away!!

Cons: Too expensive for fine-grained work No process-wide resource management

Scheduling: Explicit ThreadingFor coarse-grained work and agents

Thread t = new Thread(delegate{ // concurrent work});t.Start();

System.Threading.ThreadPool Executes work with a pool of Threads

Manages queues of work items and shared threads Each thread dispatches work items from the queue

Improvements for .NET 4.0 to be discussed later CLR controls # of threads to ensure good scaling

Best for fine-grained (task and data) parallelism Short-lived work items that don’t (or seldom) block Also used for async I/O completion (FileStream.BeginRead) Prefer over asynchronous delegate invocation

Cons: Blocking on the thread pool can lead to deadlocks / degradation Not straightforward to wait, cancel, continue from, etc.

Scheduling: ThreadPool For fine-grained work

ThreadPool.QueueUserWorkItem(delegate{ // concurrent work});

ThreadPool

ThreadPool in Action

QueueWorker

Thread 1Worker

Thread p

Program Thread

…

Item 1Item 2Item 3

Item 4Item 5

Item 6

System.Threading.ThreadPool Executes work with a pool of Threads

Manages queues of work items and shared threads Each thread dispatches work items from the queue

Improvements for .NET 4.0 to be discussed later CLR controls # of threads to ensure good scaling

Best for fine-grained (task and data) parallelism Short-lived work items that don’t (or seldom) block Also used for async I/O completion (FileStream.BeginRead) Prefer over asynchronous delegate invocation

Cons: Blocking on the thread pool can lead to deadlocks / degradation Not straightforward to wait, cancel, continue from, etc.

Scheduling: ThreadPool For fine-grained work

ThreadPool.QueueUserWorkItem(delegate{ // concurrent work});

Two pools of work/threads One for work items (QueueUserWorkItem) One for IO completions Both are process-wide

Min/max threads Default: min = 0, max = 250 * CPU Max used to be the cause of frequent deadlocks

Fixed in the .NET Framework 3.5 Thread injection & retirement is automatic

CLR has a daemon thread that watches for blocking

Throttles creation at 2 threads/second Min is often used to reduce “startup” latency

Scheduling: ThreadPoolAdvanced capabilities

API convention supported by much of .NET Synch API

Async APIs

“Begin” schedules work to run asynchronously “End” retrieves the operation return value or throws exception Rendezvous options

Blocking Wait on IAsyncResult.AsyncWaitHandle

Callback Pass an AsyncCallback to “Begin”; runs when completed

Polling Check IAsyncResult.IsCompleted

Must call “End” (in almost all cases)

Scheduling: Async Prog Model (APM)Common Async API Pattern in the Framework

int Foo(object o, string s);

IAsyncResult BeginFoo(object o, string s, AsyncCallback callback, object state);int EndFoo(IAsyncResult result);

I/O completion ports Bound to file HANDLEs and network sockets Async I/O is entirely async in hardware Interrupt posts packet to port when complete Threads block on port to be notified of packets Windows throttles waking for good scaling

CLR ThreadPool has a single, global port All async I/O via APM goes through it Pool of thread-pool I/O threads wait on the port Highly efficient: # of blocked threads is amortized

Can post to it directly using ThreadPool API

Scheduling: I/O Completion PortsEfficient async I/O on Windows

public static unsafe bool UnsafeQueueNativeOverlapped(NativeOverlapped* overlapped)

.NET UI frameworks have thread affinity Controls must be accessed on creating thread

Each framework has marshaling mechanism Windows Forms (Control.Invoke / BeginInvoke)

WPF (Dispatcher.Invoke/BeginInvoke)

Background Work and UIsUI Marshaling

// on background threadControl c = …;c.BeginInvoke((Action)delegate{ // runs on UI thread});

// on background threadControl c = …;c.Dispatcher.BeginInvoke((Action)delegate{ // runs on UI thread});

SynchronizationContext hides marshaling details Send (sync) and Post (async)

SynchronizationContext Send: d() Post: ThreadPool.QueueUserWorkItem

WindowsFormsSynchronizationContext Send: Control.Invoke Post: Control.BeginInvoke

DispatcherSynchronizationContext Send: Dispatcher.Invoke Post: Dispatcher.BeginInvoke

… SynchronizationContext.Current AsyncOperationManager.SynchronizationContext

Background Work and UIsSynchronization Context

Performs work on the right thread Heavy lifting on ThreadPool thread

DoWork event Progress reporting and completion on UI thread

ProgressChanged and RunWorkerCompleted events Kicked off by a call to RunAsync

Also supports cancelation Initiated by CancelAsync DoWork needs to poll the CancelationPending flag

Built on top of SynchronizationContext Captured when BackgroundWorker is instantiated Works for both Windows Forms and WPF

Hides Control.Invoke, Dispatcher, Invoke, etc.

Background Work and UIsBackgroundWorker

Ambient state associated with each thread Security info, call context, … Async points represents a logical continuation

State must be flowed ExecutionContext

Capture Gets the current context

Run Executes a delegate with a captured context

Flowed automatically Thread, ThreadPool, Control.BeginInvoke, etc. But if you do something funky…

Flow can be suppressed SuppressFlow ThreadPool.UnsafeQueueUserWorkItem

Thread hoppingExecutionContext

Overview and Architecture Mechanisms for Asynchrony Lunch (12pm-1:15pm) Topics in Synchronization Synchronization Best Practices Break Designs and Algorithms .NET Framework 4.0 Wrap-up

Agenda


Agenda

Concurrency and State The Pitfalls of Shared Memory

Threads in a process share access to memory Easy to share information; enticing even! Hard to identify what is shared

Private: locals, heap objects not shared Shared: statics, heap objects shared explicitly, transitive

E.g. class C{ static int s_f; int m_f;public: void f(int * py) { int x; x++; // local variable s_f++; // static class member m_f++; // class member (*py)++; // pointer to something }};

Managing State Isolation, Immutability, and Synchronization

Isolation (a.k.a. confinement) Memory space is “partitioned” No two threads ever access the same state +: no overhead, easy to reason about -: sharing is often needed, leading to message passing

Immutability Data is only read, not written +: no overhead, easy to reason about -: C# and VB encourage mutability … [lineage] -: copying means efficiency can be a challenge +: see F# for promising advances!

Synchronization Lock access to shared state +: flexible, programming techniques remain similar -: perf overhead, deadlocks, races, …

Sharing HazardsR/W

Read/Write, a.k.a. unrepeatable read

t1’s 2nd read of x different from its 1st

static int x = 0;

void t1() { void t2() { int y = x; … x = 42; int z = x; // y != z }}

Sharing HazardsW/R

Write/Read, a.k.a. dirty read

t2 saw t1’s initial write of x, but it got “rolled back”

static int x = 0;

void t1() { void t2() { try { x = 42; … int y = x; … throw e; … f(y); } catch { } // whoops; // rollback! x = 0; throw; }}

Sharing HazardsW/W

Write/Write

What value exists in x at the end? And what do y and z contain? Who knows.

static int x = 0;

void t1() { void t2() { x = 42; x = 99; int y = x; int z = x;} }

On Serializability and LinearizabilityEnsuring A happens-before () B

Transactions are useful concepts For some method M, … A – atomic: M happens all-at-once C – consistent: M never moves the program into an inconsistent

state I – isolated: M’s intermediary work is isolated D – durable: M’s effects persist for the lifetime of the context in

which the effects take place The effect? Serializability!

Given two properly serialized methods M and N, we say either M N or N M

Linearization points The place where M’s effects take place

Locks are common implementation technique

Concurrency and Time Example of a Serializability Problem

static int x;…x++;

// compiles to

MOV EAX, [x]INC EAXMOV [x], EAX

T t0 t1 t2

0 t2(0): MOV EAX,[a] #01 t0(0): MOV EAX,[a] #02 t0(1): INC,EAX #13 t0(2): MOV [a],EAX #14 t1(0): MOV EAX,[a] #15 t1(1): INC,EAX #26 t1(2): MOV [a],EAX #27 t2(1): INC,EAX #18 t2(2): MOV [a],EAX #1

Concurrency Begets Complexity

Sequential Concurrent

Behavior Deterministic Nondeterministic

Memory Stable In flux (unless private, read-only, or protected by a lock)

Locks Unnecessary Essential

Invariants

Must hold only on method entry/exit or calls to external code

Anytime the protecting lock is not held

Deadlock Impossible Possible, but can be mitigated

Testing Code coverage finds most bugs

Code coverage insufficient; races, timing, and environments probabilistically change

Debugging

Trace execution leading to failure; finding a fix is generally assured

Postulate a race and inspect code; root causes easily remain unindentified

Interlocked Operations Hardware Synchronization

Interlocked operations are lowest level sync primitive A.k.a. compare-and-swap (CAS) XCHG, LOCK CMPXCHG, LOCK CMPXCHG8B Atomic sequence of read and write Locks, events, etc. are all built out of them

System.Threading.Interlocked static class

Fairly expensive Cache coherency – bus traffic >100 cycles on non-NUMA, >500 cycles on NUMA (uncontended)

int Add(ref int l, int v);int CompareExchange(ref int l, int v, int cmp);int Decrement(ref int l);int Increment(ref int l);int Exchange(ref int l, int v);

Kernel Synchronization Objects The Foundation on top of Which All Else Exists

Kernel objects are basic primitives Signaled / nonsignaled state WaitForSingleObject, WaitForMultipleObjects *Msg variety: pump messages (STAs, GUIs) *Ex variety: alertable waits (dispatch APCs) CLR always pumps and does alertable waits

Data synchronization kinds Mutex – mutually exclusive Semaphore – N signals before nonsignaled Auto-reset Event – becomes nonsignaled when wait Manual-reset Event – manually reset

Exposed through System.Threading.WaitHandlepublic class WaitHandle : IDisposable { public void Close(); public void WaitOne();

// timeout-variants, and plenty of others…

public static void WaitAll(WaitHandle[] hs); public static int WaitAny(WaitHandle[] hs);}

Mutex and Semaphore In .NET

Useful for Win32 interop When ACLs need to be used Inter-AppDomain synchronization

But otherwise too heavyweight for industrial usepublic class Mutex : WaitHandle { public Mutex(string name, MutexSecurity acl, …); public void ReleaseMutex();}

public class Semaphore : WaitHandle { public Semaphore( int initialCount, int maximumCount, string name, SemaphoreSecurity acl, …); public void Release(int count);}

Auto- and Manual-Reset Events In .NET

Manual-reset: Once set, all threads are awoken, reset must happen manually

Auto-reset: Once set, one thread is awoken, reset happens automatically

Useful for all the previously stated reasons But also because it’s “sticky”

public class EventWaitHandle : WaitHandle { public EventWaitHandle( bool initialState, EventResetMode mode, string name, EventWaitHandleSecurity acl, …); public void Reset(); public void Set();}public enum EventResetMode { AutoReset, ManualReset}public class AutoResetEvent : EventWaitHandle { … }public class ManualResetEvent : EventWaitHandle { … }

Monitors Locking

Monitor.Enter/Exit (and TryEnter) – mutual exclusion Language syntax

Any CLR object can be used as target Recursive and timeout-based acquires Spins briefly to reduce frequency of context switches

Thin and fat locks Header word before fields in object layout used for thin lock Acquire incurs a single interlocked operation Fat lock on contention: sync-block (+event) reclaimed by GC

[C#] lock (obj) { … }[VB] SyncLock obj … End SyncLock

Monitor.Enter(obj);try { …} finally { Monitor.Exit(obj);}

Monitors Condition Variables

Monitor.Wait/Pulse/PulseAll – alternative to events

Waiter releases all locks on target Pulse wakes one, PulseAll wakes all

Efficient mechanisms used internally to pool events Inflates target to fat lock to register thread/event pair Event used is one per-thread

bool P = false;…lock (obj) { while (!P) Monitor.Wait(obj); …}… elsewhere …lock (obj) { P = true; Monitor.Pulse[All](obj);}

Reader/Writer Locks When Mutual Exclusion is Unnecessary

Take advantage of R/R “conflicts” being safe Allow many readers in the lock But when one writer, no others can be in

ReaderWriterLock Read and write modes (AcquireXXLock/ReleaseXXLock) Upgrades (but prone to broken serialization)

ReaderWriterLockSlim (in 3.5) is cheaper Supersedes old ReaderWriterLock class Read, writer, and upgradeable-read modes

(EnterXXLock, ExitXXLock) Deadlock-free upgrades

However, scalability can still suffer… When work in lock is small: CAS dominates

Locks and FairnessConvoy Avoidance

Should locks be fair? Thread A arrives, gets the lock Thread B arrives, must wait Thread C arrives, must wait Is it guaranteed that B gets the lock when A exits?

Pre-windows Server 2003 SP1: Yes; Post: No! Fairness exacerbates convoys, e.g.

A leaves lock, pulses B; Time for B to awaken is at least C, where C = cost

of context switch (>10,000 cycles) --- could be >2C! If Thread D arrives in this window, it must wait! Effectively extends lock hold times!

Thread Local Storage (TLS)Confined State Within Threads

Ensures a reference is local to one thread Static (preferred)

[ThreadStaticAttribute]private static T foo;

Provides best performance (~TLS lookup + field fetch) Dynamic

Thread.SetData(k, v); .. T v = Thread.GetData(k); Only needed when per-object TLS is needed

In both cases, initialization is tricky Class ctors only invoked once per thread static E.g. static T foo = new T(); -- not what you want All uses need to check for initialization

Immutability in PracticeAn ImmutableStack<T> Type

public class ImmutableStack<T> { private readonly T m_value; private readonly ImmutableStack<T> m_next; private readonly bool m_empty; public ImmutableStack() { m_empty = true; } internal ImmutableStack(T value, ImmutableStack<T> next) { m_value = value; m_next = next; m_empty = false; } public ImmutableStack<T> Push(T value) { return new ImmutableStack(value, this); } public ImmutableStack<T> Pop(out T value) { if (m_empty) throw new Exception("Empty."); return m_next; }}

Memory Models (Danger!)Architecture and Platform Guarantees

Sometimes locks are unnecessary Native pointer-sized writes are atomic (32bit/64bit) But compilers + processors reorder reads + writes

CLR memory model: Data dependence All writes are store/release (nothing moves after) All ‘volatile’ reads are load/acquire (nothing moves before) Adjacent writes can be coalesced Fence ensures nothing moves either direction (lock,

interlocked operation, Thread.MemoryBarrier) C++:

Volatiles and explicit barriers Sophisticated code can exploit this; warning!

E.g. double checked locking Hard to test, since real reordering differs per-platform Often requires so much cleverness that locks win out

Memory ReorderingExamples

X = Y = 0;~~~X = 1; A = Y;Y = 1; B = X;~~~A == 1 && B == 0?

X = Y = 0;~~~X = 1; A = X; B = Y;

Y = 1; C = X;~~~A == 1 && B == 1 && C == 0?

X = Y = 0;~~~X = 1; Y = 1;A = Y; B = X;~~~A == 0 && B == 0?

No, except on IA64.(No StoreStore, No LoadLoad)

Yes!(StoreLoad is permitted)

No.(Transitivity)

Word TearingAccessing Nonatomic Locations w/out Proper Synchronization

Can the assert fire? Yes! Comprised of two reads/writes apiece

Possible values incl. 0x0L and 0xaaaabbbbccccddddL (of course), but also 0xaaaabbbb00000000L and 0x000000000ccccddddL !!

internal static long s_x; void t1() { int i = 0; while (true) { s_x = (i & 1) == 0 ? 0x0L : 0xaaaabbbbccccddddL; i++; }} void t2() { while (true) { long x = s_x; Debug.Assert(x == 0x0L || x == 0xaaaabbbbccccddddL); }}

Lock FreedomDouble Edged Sword

Interlocked provides hardware atomicity CompareExchange(&a, b, c): if a contains c, replace it w/ b Guaranteed atomic in the hardware

Can be used to build scalable, wait-free algorithms:class Stack<T> { Node<T> head; void Push(T obj) { Node<T> n = new Node<T>(obj); Node<T> h; do { h = head; n.next = h; } while (Interlocked. CompareExchange(ref head, n, h) != h); }

T Pop() { Node<T> n; do { n = head; } while (Interlocked. CompareExchange(ref head, n.next, n) != n); return n.Value; } …}

Double Checked LockingEfficient Lazy Initialization (Variant 1: Never Create >1)

CLR 2.0 memory model guarantees safety Other memory models (ECMA/1.1, C++, Java) don’t

Volatile necessary to prevent speculative loads (IA64)

class Foo { private static volatile Foo s_inst; private static object s_mutex = new object();

internal Foo { get { if (s_inst == null) lock (s_mutex) if (s_inst == null) s_inst = new Foo(…); return s_inst; } }}

Double Checked LockingEfficient Lazy Initialization (Variant 2: >1 OK)

If possibility of garbage is OK, lock-free…class Foo { private static volatile Foo s_inst;

internal Foo { get { if (s_inst == null) { Foo candidate = new Foo(); Interlocked.CompareExchange( ref s_inst, candidate, null); } return s_inst; } }}

Building a Spin-lockTrickier Than You Think!

class SpinLock { private int m_state = 0;

public void Enter() { while (Interlocked.CompareExchange( ref m_state, 1, 0) != 0) ; }

public void Exit() { m_state = 0; }}

Building a Spin-lockBrain Melting Details …

Spin-waiting needs to Yield immediately on a 1-CPU machine Not use CompareExchange in a loop

Too much cache contention Back-off to add randomization Reread and only try CompareExchange if seen as 1 [TTAS] Possibly consider queueing [MCS]

Avoid priority starvation (Sleep(0) issue) Issue PAUSE instructions on Intel HT machines Mark BeginCriticalRegion/EndCriticalRegion Use managed thread ID to avoid OS thread affinity

Building a Spin-lockTry Numero Dos – Still Imperfect

class SpinLock { private volatile int m_state = 0;

public void Enter() { int tid = Thread.CurrentThread.ManagedThreadId; while (true) { if (Interlocked.CompareExchange(ref m_state, tid, 0) != 0) { int iters = 1; while (m_state != 0) { if (Environment.ProcessorCount == 1) { if (iters % 5 == 0) Thread.Sleep(1); else Thread.Sleep(0); iters++; } else { Thread.SpinWait(iters); if (iters >= 4096) Thread.Sleep(1); else { if (iters >= 2048) Thread.Sleep(0); iters *= 2 } } } } } }

public void Exit() { m_state = 0; }}


Agenda

Do: Lock over all mutable shared state Do: Always use the same lock for the same state Do: Comment on how state is protected

Synchronization Best PracticesLock consistently

class MyList<T> { T[] items; // lock: items int n; // lock: items

void Add(T item) { lock (items) { items[n] = item; n++; } } …}

Do: Lock over entire invariant Don’t: Lock for longer than is absolutely necessary

Synchronization Best PracticesLock for the right duration

class MyList<T> { T[] items; // lock: items int n; // lock: items

// invariant: n is count of valid // items in list and items[n] == null

void Add(T item) { lock (items) { items[n] = item; n++; } } …}

Do: Minimize the time you hold on to a lock Don’t: Call others’ code while you hold locks Don’t: Block while you hold locks

Synchronization Best Practices Make critical regions short and sweet

class MyList<T> { ...

void Add(T t) { lock(items) { items[n] = t; n++; } Listener.Notify(this); } …}

Don’t: Use public lock objects Don’t: Lock on Types or Strings

Synchronization Best PracticesEncapsulate your locks

class MyList<T> { T[] items; int n;

static object slk = new object(); … static void ResetStats() { lock(slk){ … } } …}

Do: Acquire locks in a consistent order

Synchronization Best PracticesAvoiding deadlocks

class MyService { A a; B b; …

void DoAB() { lock(a) lock(b) { a.Do(); b.Do(); } } void DoBA() { lock(b) lock(a) { b.Do(); a.Do(); } }}

Do: Document your locking policy Especially for public APIs

Do: Use a reader/writer lock if readers are common Do: Prefer lock-based code to lock-free code Do: Prefer Monitors over kernel synchronization Avoid: Lock recursion in your designs Don’t: Build your own lock

Synchronization Best PracticesLocking Miscellany

Avoid: Writing your own thread pools

ThreadPool Best Practices

Overview and Architecture Mechanisms for Asynchrony Lunch Topics in Synchronization Synchronization Best Practices Break (3:15pm-3:45pm) Designs and Algorithms .NET Framework 4.0 Wrap-up

Agenda


Agenda

Coarse vs. Fine Grained ConcurrencyThe Impact of Multi-core on Apps Server programs have it “easy”

Have been doing this for years Steady stream of incoming work, W Typically W >= #P, thus … One W per P isn’t such a bad strategy Multi-core is only a problem once P > W—then

the server needs to worry about fine-grained Clients, not so much

User-centric and responsiveness-oriented “User clicks a button” now what? Faster? Sure! Need to go fine-grained

Going Fine-GrainedCode and Data

Two basic approaches to fine-grained concurrency Approach #1: It’s the task (i.e. code)

Futures—make a function call o.f and its results will be available as soon as possible; enough independent calls leads to scaling

Divide and conquer—split computations as you go into “left” and “right”, running one on a different thread (recursively)

Approach #2: It’s the data Partitioning—split data source into n partitions, and process each

partition in parallel (database query approach) Pipelining—break into n stages, and execute each stage in parallel

Ideally, app developers choose whichever is most appropriate for the task at hand… Coarse-grained mixed with task and data parallelism == happiness (So long as libraries don’t get in the way)

A Taxonomy of Concurrency

…Agents/CSPs * Message Passing * Loose Coupling

Task Parallelism * Statements * Structured * Futures * ~O(1) Parallelism

Data Parallelism * Data Operations * O(N) Parallelism

Messaging

On Speedups Metrics Worth Measuring

Goal: tS/tP ≈ P tS is the time it takes to run an algorithm sequentially tP is the time it takes to run on P processors Speedups:

If tS/tP = P, this is a linear speedup (good! for every doubling in processors, we halve execution time)

Most problems aren’t linear, … But we try to get as close as possible If tS/tP > P, we have super-linear speedup! Wowzas!

Design challenge: cheap problem decomposition Parallelism always costs something, even when efficient Synchronization, inter-thread communication, cache effects,

and threading not major factors in sequential code

AlgorithmsParallel For Loops

Independent loop iterations may run in parallelfor (int i = 0; i < n; i++) a(i);

Only if no loop-carried dependencies, and … No shared state mutation

Static decomposition Loop iterations are assigned to workers a priori Contiguous chunks, striping, etc. +: simple, predictable, efficient -: can’t tolerate iteration imbalance, blocking

Dynamic decomposition Loop iterations are assigned on-demand, usually in chunks +: tolerates imbalance, blocking -: more difficult, communication overhead

AlgorithmsParallel For Loops – Static Decomposition

void ParallelForS(int lo, int hi, Action<int> body, int p) { int chunk = ((hi – lo) + p - 1) / p; // Iterations/thread ManualResetEvent mre = new ManualResetEvent(false); int remaining = p;

// Schedule the threads to run in parallel for (int i = 0; i < p; i++) { ThreadPool.QueueUserWorkItem(delegate(object procId) { int start = lo + (int)procId * chunk; for (int j=start; j<start + chunk && j < hi; j++) { body(j); } if (Interlocked.Decrement(ref remaining) == 0) mre.Set(); }, i); }

mre.WaitOne(); // Wait for them to finish}

AlgorithmsParallel For Loops – Dynamic Decomposition

void ParallelForD(int lo, int hi, Action<int> body, int p) { const int chunk = 16; // Chunk size (constant) ManualResetEvent mre = new ManualResetEvent(false); int remaining = p; int current = lo;

// Schedule the threads to run in parallel for (int i = 0; i < p; i++) { ThreadPool.QueueUserWorkItem(delegate(object procId) { int j; while ((j = (Interlocked.Add( ref current, chunk) – chunk)) < hi) { for (int k = 0; k < chunk && j + k < hi; k++) { body(j + k); } } if (Interlocked.Decrement(ref remaining) == 0) mre.Set(); }, i); }

mre.WaitOne(); // Wait for them to finish}

AlgorithmsParallel Foreach Loops

What about IEnumerable<T> objects? IEnumerable<T> enum = …;

foreach (T e in enum) a(e);

Challenge: don’t know size a priori, can’t index into it ICollection<T> is slightly better (size) but not by much

Can be handled a bit like dynamic for partitioning Hand out chunks of elements at a time Requires serializing access to a single enumerator Large chunk size means more time spent in lock But less frequent lock arrivals

AlgorithmsParallel Foreach Loops

void ParallelForEach<T>(IEnumerable<T> e, Action<T> body, int p) { const int chunk = 16; // Chunk size (constant) ManualResetEvent mre = new ManualResetEvent(false); int remaining = p;

using (IEnumerator<T> en = e.GetEnumerator()) { // shared // Schedule the threads to run in parallel for (int i = 0; i < p; i++) { ThreadPool.QueueUserWorkItem(delegate(object procId) { T[] buffer = new T[chunk]; int j; do { lock (en) { for (j = 0; j < chunk && en.MoveNext(); j++) buffer[j] = en.Current; } for (int k = 0; k < j; k++) body(buffer[k]); } while (j == chunk); if (Interlocked.Decrement(ref remaining) == 0) mre.Set(); }, i); } mre.WaitOne(); // Wait for them to finish }}

AlgorithmsDivide and Conquer - Recursion

Process separate “halves” of the problem in parallel Process right half concurrently, … And left half on current thread (Recursively…) Switchover to sequential at some point (else too much

parallelism is exposed) E.g. mirror a binary tree in place

void Mirror(TreeNode node) { if (node == null) return; Mirror(node.Left); Mirror(node.Right); TreeNode tmp = node.Left; node.Left = node.Right; node.Right = tmp;}

AlgorithmsReductions

Associative and commutative reductions can be done in parallel, e.g. Sum, Count, Min, Max, Average, …int ParallelSum(int[] array, int p) { int chunk = (array.Length + p - 1) / p; // Iterations/thread ManualResetEvent mre = new ManualResetEvent(false); int sum = 0, remaining = p;

// Schedule the threads to run in parallel for (int i = 0; i < p; i++) { ThreadPool.QueueUserWorkItem(delegate(object procId) { int mySum = 0; int start = (int)procId * chunk; for (int j=start; j<start + chunk && j < array.Length; j++) mySum += array[j]; Interlocked.Add(ref sum, mySum); if (Interlocked.Decrement(ref remaining) == 0) mre.Set(); }, i); }

mre.WaitOne(); // Wait for them to finish return sum;}

When to “Go Parallel”?

There is a cost; only worthwhile when Work per task/element is large, and/or Number of tasks/elements is large

-- Work Per Task // # of Tasks ++

--

Spee

dup

++

1 task(Sequential)

? tasksBreak even point

? tasksPoint of diminishing returns

Speedup InhibitorsSynchronous I/O

With synchronous IO:

With asynchronous IO:

= Running ()

= Waiting ()

Thread 1:

Thread 2:

Thread 1:

Thread 2:

Overlapped IO

time

time

6 work items in3 time

6 work items in4 time

Speedup Inhibitors Synchronization

Synchronization inhibits speedup

Thread 1:

Thread 2:

Thread 3:

Thread 4:

…

= Running ()

= Running w/ lock ()= Waiting ()

(lock)

(lock)

(lock)

(lock)

Speedup InhibitorsLoad Imbalance

Exacerbates Amdahl’s Law E.g. if Your API is 50% of the work, and is only

sequential, maximum parallel speedup = 2x!

Sequential:

Parallel:

More than 2 threads is just wasted resource:S = 50%, 1/S == 2

No matter how many processors, 2x is it

= Your API

Thread 1

Thread 2

Thread 3

Thread 4

AlgorithmsOther Miscellaneous Algorithms

Pipelining Multiple stages run in parallel with one another Producer/consumer relationship

Speculation and search Many threads cooperate to “precompute” an answer If the answer isn’t needed, speculation is discarded Wasted work if unused, but speedups otherwise

Dataflow Future<T> abstraction: accessing the Value waits if

not computed yet, or runs synchronously otherwise Synchronization is driven by data dependence

AlgorithmsProducer/Consumer: Blocking & Bounded Queue

public class BlockingBoundedQueue<T> { private Queue<T> m_queue = new Queue<T>(); private Semaphore m_fullSemaphore = new Semaphore(128); private Semaphore m_emptySemaphore = new Semaphore(0);

public void Enqueue(T item) { m_fullSemaphore.WaitOne(); lock (m_queue) { m_queue.Enqueue(item); } m_emptySemaphore.Release(); }

public T Dequeue() { T e; m_emptySemaphore.WaitOne(); lock (m_queue) { e = m_queue.Dequeue(); } m_fullSemaphore.Release(); return e; }}


Agenda

Example: "Baby Names"

IEnumerable<BabyInfo> babies = ...;var results = new List<BabyInfo>();foreach(var baby in babies){ if (baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd) { results.Add(baby); }}results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year));

Manual Parallel Solution

IEnumerable<BabyInfo> babies = …;var results = new List<BabyInfo>();int partitionsCount = Environment.ProcessorCount;int remainingCount = partitionsCount;var enumerator = babies.GetEnumerator();try { using (var done = new ManualResetEvent(false)) { for(int i = 0; i < partitionsCount; i++) { ThreadPool.QueueUserWorkItem(delegate { var partialResults = new List<BabyInfo>(); while(true) { BabyInfo baby; lock (enumerator) { if (!enumerator.MoveNext()) break; baby = enumerator.Current; } if (baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd) { partialResults.Add(baby); } } lock (results) results.AddRange(partialResults); if (Interlocked.Decrement(ref remainingCount) == 0) done.Set(); }); } done.WaitOne(); results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year)); }}finally { if (enumerator is IDisposable) ((IDisposable)enumerator).Dispose(); }

LINQ Solution

var results = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;

.AsParallel()

Visual Studio 2010Tools / Programming Models / Runtimes

Parallel Pattern Library

Resource Manager

Task Scheduler

Task Parallel Library

PLINQ

Managed Library Native LibraryKey:

Threads

Operating System

Concurrency Runtime

Programming Models

AgentsLibrary

ThreadPool

Task Scheduler

Resource Manager

Data Structures

Dat

a St

ruct

ures

Tools

Tools

ParallelDebugger Windows

Profiler Concurrency

Analysis

What is it? .NET types (mscorlib.dll, System.dll, System.Core.dll) No compiler changes necessary Work-stealing runtime Multiple programming models

Declarative data parallelism (PLINQ) Imperative data and task parallelism (Task Parallel Library) Coordination/synchronization constructs (Coordination Data Structures)

Common exception handling model Parallel debugging and profiling support

Why is it good? Supports parallelism in any .NET language Delivers reduced concept count and complexity, better time to

solution Begins to move parallelism capabilities from concurrency experts

to domain experts

Parallel Extensions to the.NET Framework 4.0

Parallel Extensions Architecture

Task Parallel Library Coordination Data Structures

.NET Program

Proc 1 …

PLINQ Execution Engine

C# Compiler

VB Compiler

C++ Compiler

MSIL

Threads

DeclarativeQueries Data Partitioning

ChunkRangeHash

StripedRepartitioning

Operator TypesMapFilterSort

SearchReduce

…

MergingBuffering options

Order preservationInverted

Proc p

Parallel Algorithms

Query Analysis

Concurrent CollectionsSynchronization Types

Coordination Types

Loop replacementsImperative Task Parallelism

Scheduling

PLINQ

TPL or CDS

F# Compiler

Other .NET Compiler

Work-Stealing Scheduler

Work-Stealing in Action

GlobalQueue

LocalQueue

LocalQueue

Worker Thread 1

Worker Thread p

Program Thread

…

…

Task 1Task 2

Task 3Task 5

Task 4

Task 6

Used throughout PLINQ and TPL Address many of today’s core concurrency

issues

Coordination Data Structures

Thread-safe collectionsConcurrentStack<T>ConcurrentQueue<T>ConcurrentDictionary<TKey,TValue>…

Work exchangeBlockingCollection<T>IProducerConsumerCollection<T>

Phased OperationCountdownEvent Barrier

LocksManualResetEventSlimSemaphoreSlimSpinLockSpinWait

InitializationLazyInit<T>


Agenda

Concurrency begins with architecture State management Agent interactions (coarse concurrency) Problem decomposition (fine grain parallelism)

It is possible with today’s tools Windows and .NET offer rich support Kernel objects Threads and thread pools User-mode sync primitives (.NET & C++)

Advances are on the horizon To bring parallelism within everybody’s reach You saw some (Parallel Extensions) …

6 Hours in a SlideTalk Recap

What the Future HoldsProgramming Models

Safety Current offerings minimal impact (sharp knives) Three key themes

Functional: immutable & pure Safe imperative: isolated Safe side-effects: transactions

Verification tools Patterns

Agents (CSPs) + tasks + data 1st class isolated agents Raise level of abstraction: what, not how

110

http://www.haskell.org/haskellwiki/Image:Haskellwiki_logo_big.png

What the Future HoldsEfficiency and Heterogeneity

Efficiency “Do no harm” O(P) >= O(1) More static decision-making vs. dynamic Profile guided optimizations

The future is heterogeneous Chip multiprocessors are “easy” Out-of-order vs. in-order GPGPU’ (fusion of X86 with GPU) Vector ISAs Possibly different memory systems

111

+=~

Implicit ParallelismUse APIs that internally use parallelism

Structured in terms of agentsApps, LINQ queries, etc.

Explicit ParallelismSafe

Frameworks, DSLs, XSLT, sorting, searching

All Programmers Will Not Be Parallel

Explicit ParallelismUnsafe

(Parallel Extensions, etc)

In Conclusion

Opportunity and crisis Competitive advantage for those who grok it Less incentive for the client platform without

Architects & senior developers pay heed Time to start thinking and experimenting Not yet for ubiquitous consumption [5 year horizon] but… Can make a real difference today in select places:

embarassingly parallel Begin experimenting today

Windows Vista + .NET 3.5 Play with Parallel Extensions (.NET 4.0 and C++) Exciting times!

Thank-you. 113

Learn more about concurrency:

Book SigningWhere: PDC bookstoreDate/Time: Wednesday, Oct. 29 2:30PM – 3:00PM

Just released!Available at the PDC bookstore

Concurrent Programming on Windows

(Addison-Wesley)Covers Win32 & .NET Framework

Learn more about Parallel Computing at:

msdn.com/concurrencyAnd download

Parallel Extensions to the .NET Framework!

Microsoft Visual Studio: Bringing out the Best in Multicore SystemsDate/Time: Monday, Oct. 27 1:45 PM – 3:00 PM

PDC Parallelism Sessions

Parallel Programming for C++ Developers in the Next Version of Microsoft Visual StudioDate/Time: Monday, Oct. 27 3:30PM – 4:45 PM

Parallel Programming for Managed Developers with the Next Version of Microsoft Visual StudioDate/Time: Wednesday, Oct. 29 10:30AM– 11:45AMConcurrency Runtime Deep Dive: How to Harvest Multicore Computing ResourcesDate/Time: Wednesday, Oct. 29 1:15 PM – 2:30 PM

Research: Concurrency Analysis Platform and Tools for Finding Concurrency BugsDate/Time: Wednesday Oct. 29 10:30 AM – 11:45 AM

The Concurrency and Coordination Runtime and Decentralized Software Services ToolkitDate/Time: Tuesday, Oct. 28 1:45PM – 3:00 PM

Parallel Computing Application Architectures and Opportunities Date/Time: Thursday, Oct. 30 10:15AM – 11:45AM

Addressing the Hard Problems of Concurrency Date/Time: Thursday, Oct. 30 8:30AM – 10:00AM

Future of Parallel Computing (Panel) Date/Time: Thursday, Oct. 30 12:00PM – 1:30PM

Thank you!

© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market

conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Date post:	21-Dec-2015
Category:	Documents
View:	216 times
Download:	1 times

David Callahan Distinguished Engineer Microsoft Corporation Joe Duffy Lead Software Engineer...

Documents