(How) Can Programmers Conquer the Multicore Menace? Saman Amarasinghe Computer Science and...

(How) Can Programmers Conquer the

Multicore Menace?

Saman Amarasinghe

Computer Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology

Outline

• The Multicore Menace

• Deterministic Multithreading via Kendo

• Algorithmic Choices via PetaBricks

• Conquering the Multicore Menace

2

Today: The Happily ObliviousAverage Joe Programmer

• Joe is oblivious about the processor– Moore’s law bring Joe performance – Sufficient for Joe’s requirements

• Joe has built a solid boundary between Hardware and Software– High level languages abstract away the processors

– Ex: Java bytecode is machine independent

• This abstraction has provided a lot of freedom for Joe

• Parallel Programming is only practiced by a few experts

3

4

1

10

100

1000

10000

100000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

Per

form

ance

(vs

. VA

X-1

1/78

0)

25%/year

52%/year

??%/year

8086

286

386

486

Pentium

P2

P3

P4

Itanium

Itanium 2

Moore’s Law

From David Patterson

1,000,000,000

100,000

10,000

1,000,000

10,000,000

100,000,000

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Nu

mb

er o

f Tra

nsisto

rs

5

8086

286

386

486

Pentium

P2

P3

P4

Itanium

Itanium 2

1

10

100

1000

10000

100000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

Per

form

ance

(vs

. VA

X-1

1/78

0)

25%/year

52%/year

Uniprocessor Performance (SPECint)


1,000,000,000

100,000

10,000

1,000,000

10,000,000

100,000,000


Nu

mb

er o

f Tra

nsisto

rs

6

8086

286

386

486

Pentium

P2

P3

P4

Itanium

Itanium 2

1

10

100

1000

10000

100000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

Per

form

ance

(vs

. VA

X-1

1/78

0)

25%/year

52%/year

??%/year

Uniprocessor Performance (SPECint)


1,000,000,000

100,000

10,000

1,000,000

10,000,000

100,000,000


Nu

mb

er o

f Tra

nsisto

rs

Squandering of the Moore’s Dividend

• 10,000x performance gain in 30 years! (~46% per year)• Where did this performance go?• Last decade we concentrated on correctness and

programmer productivity• Little to no emphasis on performance • This is reflected in:

– Languages– Tools– Research– Education

• Software Engineering: Only engineering discipline where performance or efficiency is not a central theme

7

Matrix Multiply An Example of Unchecked Excesses

• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented

• High Level Languages• Memory Management

– Transpose for unit stride– Tile for cache locality

• Vectorization• Prefetching• Parallelization

2,271x296,260x522x1,117x7,514x12,316x33,453x87,042x220x

Matrix Multiply An Example of Unchecked Excesses

• Typical Software Engineering Approach– In Java– Object oriented– Immutable– Abstract types– No memory optimizations– No parallelization

• Good Performance Engineering Approach– In C/Assembly– Memory optimized (blocked)– BLAS libraries– Parallelized (to 4 cores)

9

14,700x

• In Comparison: Lowest to Highest MPG in transportation

296,260x

294,000x

Joe the Parallel Programmer

• Moore’s law is not bringing anymore performance gains

• If Joe needs performance he has to deal with multicores– Joe has to deal with

performance– Joe has to deal with

parallelism

10

Joe

11

Why Parallelism is Hard

• A huge increase in complexity and work for the programmer– Programmer has to think about performance! – Parallelism has to be designed in at every level

• Programmers are trained to think sequentially– Deconstructing problems into parallel tasks is hard for many of us

• Parallelism is not easy to implement– Parallelism cannot be abstracted or layered away– Code and data has to be restructured in very different (non-intuitive) ways

• Parallel programs are very hard to debug– Combinatorial explosion of possible execution orderings – Race condition and deadlock bugs are non-deterministic and illusive – Non-deterministic bugs go away in lab environment and with

instrumentation

Outline


• Deterministic Multithreading via Kendo– Joint work with Marek Olszewski and Jason Ansel



12

Racing for Lock Acquisition• Two threads

– Start at the same time– 1st thread: 1000 instructions to the lock acquisition– 2nd thread: 1100 instructions to the lock acquisition

13

Instruction #

Tim

e

Non-Determinism

• Inherent in parallel applications– Accesses to shared data can experience many possible

interleavings

– New! Was not the case for sequential applications!

– Almost never part of program specifications

– Simplest parallel programs, i.e. a work queue, is non deterministic

• Non-determinism is undesirable– Hard to create programs with repeatable results

– Difficult to perform cyclic debugging

– Testing offers weaker guarantees

14

Deterministic Multithreading

• Observation:– Non-determinism need not be a required property of threads

– We can interleave thread communication in a deterministic manner

– Call this Deterministic Multithreading

• Deterministic multithreading:– Makes debugging easier

– Tests offer guarantees again

– Supports existing programming models/languages

– Allows programmers to “determinize” computations that have

previously been difficult to do so using today’s programming idioms

– e.g.: Radiosity (Singh et al. 1994), LocusRoute (Rose 1988), and

Delaunay Triangulation (Kulkarni et al. 2008)

Deterministic Multithreading

• Strong Determinism– Deterministic interleaving for all accesses to shared data for a given input

– Attractive, but difficult to achieve efficiently without hardware support

• Weak Determinism– Deterministic interleaving of all lock acquisitions for a given input

– Cheaper to enforce

– Offers same guarantees as strong determinism for data-race-free program executions

– Can be checked with a dynamic race detector!

Kendo

• A Prototype Deterministic Locking Framework– Provides Weak Determinism for C and C++ code

– Runs on commodity hardware today!

– Implements a subset of the pthreads API

– Enforces determinism without sacrificing load balance

– Tracks progress of threads to dynamically construct

the deterministic interleaving: – Deterministic Logical Time

– Incurs low performance overhead (16% geomean on Splash2)

Deterministic Logical Time

• Abstract counterpart to physical time

– Used to deterministically order events on an SMP machine

– Necessary to construct the deterministic interleaving

• Represented as P independently updated deterministic logical clocks

– Not updated based on the progress of other threads

(unlike Lamport clocks)

– Event1 (on Thread 1) occurs before Event2 (on Thread 2) in Deterministic

Logical Time if:

– Thread 1 has lower deterministic logical clock than Thread 2 at time of events

Deterministic Logical Clocks

• Requirements– Must be based on events that are deterministically reproducible

from run to run

– Track progress of threads in physical time as closely as possible

(for better load balancing of the deterministic interleaving)

– Must be cheap to compute

– Must be portable over micro-architecture

– Must be stored in memory for other threads to observe

Deterministic Logical Clocks

• Some x86 performance counter events satisfy many of

these requirements– Chose the “Retired Store Instructions” event

• Required changes to Linux Kernel– Performance counters are kernel level accessible only

– Added an interrupt service routine

– Increments each thread’s deterministic logical clock (in memory) on

every performance counter overflow

– Frequency of overflows can be controlled

Locking Algorithm

• Construct a deterministic interleaving of lock acquires from deterministic logical clocks– Simulate the interleaving that would occur if running in

deterministic logical time

• Uses concept of a turn– It’s a thread’s turn when:

– All thread’s with smaller ID have greater deterministic logical clocks– All thread’s with larger ID have greater or equal deterministic logical

clocks

Locking Algorithm

function det_mutex_lock(l) { pause_logical_clock(); wait_for_turn(); lock(l); inc_logical_clock(); enable_logical_clock();}

function det_mutex_unlock(l) { unlock(l);}

Example

Thread 1 Thread 2

Det

erm

inis

tic L

ogic

al T

ime

t=3t=5Physical Time

Example

Thread 1 Thread 2

Det

erm

inis

tic L

ogic

al T

ime

t=6

t=11

It’s a race!

Physical Time

Example

Thread 1 Thread 2

Det

erm

inis

tic L

ogic

al T

ime

t=11

t=20

It’s a race!

Physical Time

Example

Thread 1

det_lock(a) t=25

Thread 2

t=18

Det

erm

inis

tic L

ogic

al T

ime

Physical Time

Example

Thread 1

det_lock(a) t=25

Thread 2

t=18

Det

erm

inis

tic L

ogic

al T

ime

wait_for_turn()

Physical Time

Example

Thread 1

t=25

Thread 2

t=22

Det

erm

inis

tic L

ogic

al T

ime

det_lock(a)wait_for_turn()

det_lock(a)

Physical Time

Example

Thread 1

t=25

Thread 2

t=22

Det

erm

inis

tic L

ogic

al T

ime


det_lock(a) wait_for_turn()

Physical Time

Example

Thread 1

t=25

Thread 2

t=22

Det

erm

inis

tic L

ogic

al T

ime


det_lock(a) lock()

Physical Time

Example

Thread 1

t=25

Thread 2

t=22

Det

erm

inis

tic L

ogic

al T

ime


det_lock(a)

Thread 2 will always acquire the lock first!

Physical Time

Example

Thread 1

t=25

Thread 2

t=26

Det

erm

inis

tic L

ogic

al T

ime


det_lock(a)

Physical Time

Example

Thread 1

t=25

Thread 2

t=26

Det

erm

inis

tic L

ogic

al T

ime

det_lock(a)

det_lock(a)lock(a)

Physical Time

Example

Thread 1

t=25

Thread 2

Det

erm

inis

tic L

ogic

al T

ime

t=32

det_lock(a)

det_unlock(a)

det_lock(a)

lock(a)

Physical Time

Example

Thread 1

t=28

Thread 2

t=32

Det

erm

inis

tic L

ogic

al T

ime

det_lock(a)

det_unlock(a)

det_lock(a)

Physical Time

Locking Algorithm Improvements

• Eliminate deadlocks in nested locks – Make thread increment its deterministic logical clock while it

spins on the lock

– Must do so deterministically

• Queuing for fairness• Lock priority boosting

• See ASPLOS09 Paper on Kendo for details

Evaluation

• Methodology– Converted Splash2 benchmark suite to run use the Kendo

framework

– Eliminated data-races

– Checked determinism by examining output and the final

deterministic logical clocks of each thread

• Experimental Framework– Processor: Intel Core2 Quad-core running at 2.66GHz

– OS: Linux 2.6.23 (modified for performance counter support)

Results

tsp quicksort ocean barnes radiosity raytrace fmm volrend water-nsqrd

mean0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

Application Time Interrupt Overhead Deterministic Wait Overhead

Benchmarks

Exe

cuti

on

Tim

e (R

elat

ive

to N

on

-Det

erm

inis

tic)

Effect of interrupt frequency

64 128 256 512 1K 2K 4K 8K 16K0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Application Time Interrupt Overhead Deterministic Wait Overhead

Interrupt Period

Ex

ec

uti

on

Tim

e (

Re

lati

ve

to

No

n-D

ete

rmin

isti

c)

Related Work

• DMP – Deterministic Multiprocessing– Hardware design that provides Strong Determinism

• StreamIt Language– Streaming programming model only allows one interleaving of

inter-thread communication

• Cilk Language– Fork/join programming model that can produce programs with

semantics that always match a deterministic “serialization” of the code

– Cannot be used with locks– Must be data-race free (can be checked with a Cilk race detector)

Outline



• Algorithmic Choices via PetaBricks– Joint work with Jason Ansel, Cy Chan, Yee Lok Wong,

Qin Zhao, and Alan Edelman


48

Observation 1: Algorithmic Choice

• For many problems there are multiple algorithms – Most cases there is no single winner– An algorithm will be the best performing for a given:

– Input size– Amount of parallelism– Communication bandwidth / synchronization cost– Data layout– Data itself (sparse data, convergence criteria etc.)

• Multicores exposes many of these to the programmer– Exponential growth of cores (impact of Moore’s law)– Wide variation of memory systems, type of cores etc.

• No single algorithm can be the best for all the cases

49

Observation 2: Natural Parallelism

• World is a parallel place– It is natural to many, e.g. mathematicians

– ∑, sets, simultaneous equations, etc.

• It seems that computer scientists have a hard time thinking in parallel– We have unnecessarily imposed sequential ordering on the world

– Statements executed in sequence – for i= 1 to n– Recursive decomposition (given f(n) find f(n+1))

• This was useful at one time to limit the complexity…. But a big problem in the era of multicores

50

Observation 3: Autotuning

• Good old days model based optimization• Now

– Machines are too complex to accurately model– Compiler passes have many subtle interactions– Thousands of knobs and billions of choices

• But…– Computers are cheap– We can do end-to-end execution of multiple runs – Then use machine learning to find the best choice

51

PetaBricks Language

transform MatrixMultiply

from A[c,h], B[w,c]

to AB[w,h]

{

// Base case, compute a single element

to(AB.cell(x,y) out)

from(A.row(y) a, B.column(x) b) {

out = dot(a, b);

}

}

• Implicitly parallel description

52

A

B

ABc

h

w

c

h

w

ABy

x

y

x

PetaBricks Language


from A[c,h], B[w,c]

to AB[w,h]

{




out = dot(a, b);

}

// Recursively decompose in c

to(AB ab)

from(A.region(0, 0, c/2, h ) a1,

A.region(c/2, 0, c, h ) a2,

B.region(0, 0, w, c/2) b1,

B.region(0, c/2, w, c ) b2) {

ab = MatrixAdd(MatrixMultiply(a1, b1),

MatrixMultiply(a2, b2));

}

• Implicitly parallel description

• Algorithmic choice

53

A

B

ABABa1 a2 b1

b2

PetaBricks Language


from A[c,h], B[w,c]

to AB[w,h]

{




out = dot(a, b);

}


to(AB ab)



B.region(0, 0, w, c/2) b1,

B.region(0, c/2, w, c ) b2) {



}

// Recursively decompose in w

to(AB.region(0, 0, w/2, h ) ab1,

AB.region(w/2, 0, w, h ) ab2)

from( A a,

B.region(0, 0, w/2, c ) b1,

B.region(w/2, 0, w, c ) b2) {

ab1 = MatrixMultiply(a, b1);


}

54

a

B

ABAB

b2b1

ab1 ab2

PetaBricks Language


from A[c,h], B[w,c]

to AB[w,h]

{




out = dot(a, b);

}


to(AB ab)



B.region(0, 0, w, c/2) b1,

B.region(0, c/2, w, c ) b2) {



}

// Recursively decompose in w

to(AB.region(0, 0, w/2, h ) ab1,

AB.region(w/2, 0, w, h ) ab2)

from( A a,

B.region(0, 0, w/2, c ) b1,

B.region(w/2, 0, w, c ) b2) {



}

// Recursively decompose in h

to(AB.region(0, 0, w, h/2) ab1,

AB.region(0, h/2, w, h ) ab2)

from(A.region(0, 0, c, h/2) a1,

A.region(0, h/2, c, h ) a2,

B b) {

ab1=MatrixMultiply(a1, b);

ab2=MatrixMultiply(a2, b);

}

}

55

PetaBricks Compiler Internals

ChoiceGridChoiceGridRule/TransformHeaders

Ru

le B

od

ies

Code Generation

Compiler Passes

C++

ChoiceGridChoice

Dependency Graph

ChoiceGridChoiceGridRule Body IR

CompilerPasses

Compiler Passes

Seq

uen

tialL

eaf Co

de

Parallel D

ynam

ically S

ched

uled

PetaBricksSource Code

RUNTIME

Choice Grids

Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … }

Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … }

Rule1 or Rule2Rule2

0 1 n

Input

transform RollingSum from A[n] to B[n] {

}

0 n

A:

B:

Choice Dependency Graph

Rule1 or Rule2Rule2

Input

(r2, =)

(r1, =), (r2, <=)

(r1, =, -1)

(r1, <)

Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … }

Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … }

transform RollingSum from A[n] to B[n] {

}

PetaBricks Autotuning


Ru

le B

od

ies

Code Generation

Compiler Passes

C++

ChoiceGridChoice

Dependency Graph


CompilerPasses

Compiler Passes

ChoiceGridChoiceGridCompiled User Code

Parallel Runtime Engine

Seq

uen

tialL

eaf Co

de

Parallel D

ynam

ically S

ched

uled


Autotuner

Choice Configuration

File

PetaBricks Execution


Ru

le B

od

ies

Code Generation

Compiler Passes

C++

ChoiceGridChoice

Dependency Graph


CompilerPasses

Compiler Passes

ChoiceGridChoiceGridCompiled

User Code

Parallel Runtime Engine

Dependency Graph

Seq

uen

tialL

eaf Co

de

Parallel D

ynam

ically S

ched

uled



File

Pruning


File

Experimental Setup

• Test System– Dual-quad core (8 cores) – Xeon X5460 @ 3.16GHz w/ 8GB RAM– CSAIL Debian 4.0 (etch), kernel 2.6.18

• Training– Using our hybrid genetic tuner– Trained using all 8 cores– Training times varied from ~1 min to ~1 hour

Sort

62

Size

Tim

e

0 200 400 600 800 1000 1200 1400 1600 1800 20000.000

0.002

0.004

0.006

0.008

0.010

Insertion SortQuick SortMerge SortRadix Sort

Sort

63

Size

Tim

e

0 200 400 600 800 1000 1200 1400 1600 1800 20000.000

0.002

0.004

0.006

0.008

0.010

Insertion SortQuick SortMerge SortRadix SortAutotuned

Eigenvector Solve

64

Size

Tim

e

0 50 100 150 200 250 300 350 400 450 500

-0.01

0.00

0.01

0.02

0.03

0.04

0.05

Bisection DC

QR

Eigenvector Solve

65

Size

Tim

e

0 50 100 150 200 250 300 350 400 450 500

-0.01

0.00

0.01

0.02

0.03

0.04

0.05

Bisection

DC

QR

Autotuned

Poisson

66

Matrix Size

Tim

e

3 5 9 17 33 65 129 257 513 1025 20493.81469726562501E-06

0.0000305175781250001

0.000244140625

0.001953125

0.015625

0.125

1

7.99999999999999

63.9999999999999

511.999999999999

4095.99999999999

Direct

Jacobi

SOR

Multigrid

Poisson

67

Matrix Size

Tim

e

3 5 9 17 33 65 129 257 513 1025 20493.81469726562501E-06

0.0000305175781250001

0.000244140625

0.001953125

0.015625

0.125

1

7.99999999999999

63.9999999999999

511.999999999999

4095.99999999999

Direct

Jacobi

SOR

Multigrid

Autotuned

Scalability

68

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

MM

Sort

Poisson

Eigenvector Solve

Number of Cores

Spe

edup

Impact of Autotuning

• Custom hybrid genetic tuner• Huge gains by training on the target architecture:

SunFire T200Niagra (8 cores)

Xeon E7340(8 cores)

Xeon E7340(1 core)

SunFire T200Niagra (8 cores) 1.00x 0.72x

Xeon E7340(8 cores) 0.43x 1.00x 0.30x

Trained On:

Run On:

Related Work

• SPARSITY, OSKI – Sparse Matrices

• ATLAS, FLAME – Linear Algebra

• FFTW

• STAPL – Template Framework Library

• SPL – Digital signal processing

• High level optimization via automated statistical modeling. (Eric Brewer)

Outline





71

Conquering the Menace

• Parallelism Extraction– The world is parallel,

but most computer science is based in sequential thinking– Parallel Languages

– Natural way to describe the maximal concurrency in the problem

– Parallel Thinking– Theory, Algorithms, Data Structures Education

• Parallelism Management– Mapping algorithmic parallelism to a given architecture– New hardware support

– Easier to enforce correctness– Reduce the cost of bad decisions

– A Universal Parallel Compiler

72

73

Hardware Opportunities

• Don’t have to contend with uniprocessors• Not your same old multiprocessor problem

– How does going from Multiprocessors to Multicores impact programs?

– What changed?– Where is the Impact?

– Communication Bandwidth– Communication Latency

74

Communication Bandwidth

• How much data can be communicated between two cores?

• What changed?– Number of Wires– Clock rate– Multiplexing

• Impact on programming model?– Massive data exchange is possible– Data movement is not the bottleneck

processor affinity not that important

32 Giga bits/sec ~300 Tera bits/sec

10,000X

Parallel Language Opportunities

• We need a lot more innovation! Languages that…..

– require no non-intuitive reorganization of data or code.

– make the programmer focus on concurrency, but not performance Off-load the parallelism and performance issues to the compiler (akin to ILP compilation to VLIW machines)

– eliminate hard problems such as race conditions and deadlocks (akin to the elimination of memory bugs in Java)

– inform the programmer if they have done something illegal (akin to a type system or runtime null-pointer checks)

– take advantage of domains to reduce the parallelization burden (akin to the StreamIt language for the streaming domain)

– use novel hardware to eliminate problems & help the programmer (akin to cache coherence hardware)

75

Compilation Opportunities

• Universal Parallel Compiler: GCC for Uniprocessors– Easily portable to any uniprocessor– Able to obtain respectable performance Single program (in C) runs on all uniprocessors

• MultiCompiler: Universal Compiler for Parallel Systems– Language exposes maximal parallelism Compiler manages it– Unlike uniprocessors, many single decisions are performance critical

– Candidates: Don’t bind a single decision, keep multiple tracks– Learning: Learn and improve heuristics – Adaptation: Dynamically choose candidates and adapt the program to

resources and runtime conditions

76

77

Conclusions

• Kendo

– The first system to efficiently provide weak determinism on commodity hardware

– Provide a systematic method of reproducing many non-deterministic bugs

– Incurs modest performance overhead when running on 4 processors

– This low overhead makes it possible to leave on while an application is deployed

• PetaBricks– First language where micro-level algorithmic choice can be naturally expressed– Autotuning can find the best choice – Can switch between choices as solution is constructed

• Switching to multicores without losing the gains in programmer productivity may be the Grandest of the Grand Challenges

– Half a century of work, still no winning solution– Will affect everyone! – A lot more work to do to solve this problem!!!

http://groups.csail.mit.edu/commit/

Date post:	01-Jan-2016
Category:	Documents
Upload:	julian-hubbard
View:	219 times
Download:	1 times

(How) Can Programmers Conquer the Multicore Menace? Saman Amarasinghe Computer Science and...

Documents