Cores, cores, everywhere Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham,...

Cores, cores, everywhere

Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Vladimir Gajinov, Orion Hodson,

Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania

Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

Amdahl’s law

“Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with 128 cores, how many cores do you need to use to get a 4x speed-up on the overall program?”

Amdahl’s law, f=70%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

#cores

Spee

dup

Desired 4x speedup

Speedup achieved (perfect scaling on 70%)

Limit as c→∞ = 1/(1-f) = 3.33


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.94

0.96

0.98

1.00

1.02

1.04

1.06

1.08

1.10

1.12

#cores

Spee

dup

Speedup achieved with perfect scaling

Amdahl’s law limit, just 1.11x


1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 1270

10

20

30

40

50

60

#cores

Spee

dup

Amdahl’s law & multi-core

Suppose that the same h/w budget (space or power) can make us:

1 2

5 6

3 4

7 8

9 10

13 14

11 12

15 16

1

1 2

3 4

(analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Perf of big & small cores

1/16 1/8 1/4 1/2 10.0

0.2

0.4

0.6

0.8

1.0

1.2

Resources dedicated to core

Core

per

f (re

lativ

e to

1 b

ig c

ore

Assumption: perf = α √resource

Total perf:16 * 1/4 = 4

Total perf:1 * 1 = 1



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

#Cores

Perf

(rel

ative

to 1

big

cor

e)

1 big

4 medium

16 small



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.0

0.2

0.4

0.6

0.8

1.0

1.2

#Cores

Perf

(rel

ative

to 1

big

cor

e)

1 big

4 medium

16 small


Asymmetric chips

1

3 4

7 8

9 10

13 14

11 12

15 16


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

#Cores

Perf

(rel

ative

to 1

big

cor

e)

1 big4 medium

16 small

1+12


Two hardware trends

Traditional multi-processor machines

Asymmetric performance and/or

instruction sets

Cache-coherent multicore

AMD Istanbul: 6 cores, per-core L2, per-package L3

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

L3

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

L3

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

L3

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

CoreL2

L3

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

Single-chip cloud computer (SCC)

24 * 2-core tilesOn-chip mesh n/w

Non-coherent cachesHardware supported messaging

L2 Core

L2

Router MPB

Core

VRC

MC

-1

MC

-3

MC

-0

MC

-4System

interface

RAM RAM

RAMRAM

MSR Beehive

Ring interconnectMessage passing in h/w

No cache coherenceSplit-phase memory access

Module MemMux

MQ

DDR Controller

Core 2

RingIn[31:0],SlotTypeIn[3:0],SrcDestIn[3:0]

Core 3Core N

Module RISCNModule RISCN Module RISCN

Messages, Locks

RA from display controller

RA,WA

WDRD (128 bits) Rdreturn (32 bits)

(pipelined bus toall cores)

RD toDisplay

controller

Core 1

Module RISCN

RAM

Two hardware trends

Traditional multi-processor machines

Asymmetric performance and/or

instruction sets

Non-cache-coherent access to memory


Messaging vs shared data as default

• Fundamental model is message based• “It’s better to have shared memory and

not need it than to need shared memory and not have it”

Shared state,one-big-lock

Fine-grainedlocking

Clustered objects,partitioning

Distributed state,replica maintenance

Traditional operating systemsBarrelfish multikernel

The Barrelfish multi-kernel OS

x64

Message passing

App

x64 ARMAccelerator core

App

OS node OS node OS node OS node

Statereplica

Statereplica

State replica

Statereplica

App App

Hardware interconnect


x64

Message passing

App


App


Statereplica

Statereplica

State replica

Statereplica

App App


System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64


x64

Message passing

App


App


Statereplica

Statereplica

State replica

Statereplica

App App


System components, each local to a specific core, and using

message passing



x64

Message passing

App


App


Statereplica

Statereplica

State replica

Statereplica

App App


User-mode programs: several models supported, including conventional shared-memory

OpenMP & pthreads

System components, each local to a specific core, and using

message passing



Shared Resource Database Consensus

bool updatePermissions(page_t page, flags_t flags) { bool ok = true; for (core in cores) ok &= permUpdateRequest_rpc(core, page, flags); if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page, flags); } return ok;}

Two-Phase Commit

Voting Phase

Commit Phase

Blocking RPC before sending to next core

~400 cyclesassuming process is scheduled on other

core!

Shared Resource Database Consensusbool updatePermissions(page_t page, flags_t flags) { state_t *st = malloc (sizeof(state_t)); st->ok=true; st->page=page; st->flags=flags; st->count=0; for (core in cores) { permUpdateRequest_send(core, page, flags, st); st.count++;}}void recvReply(state_t st, bool ok) { st->ok &= ok; if (st->count-- == 0) { if (st->ok) { localUpdatePermissions(st->page, st->flags); for (core in cores) permUpdateCommit_send(core, st->page, st->flags); } else { for (core in cores) permUpdateAbort_send(core, st->page , st->flags); free(st);}}

Stack-RippedCan fail to send immediately (e.g., due to full channel)

Need to Stack-Rip

and here

and here…

AC: Asynchronous C

Synchronous Event-Driven

Easy to program

Poor Performance

Difficult to program

Good Performance

AC:Similar programing model to

sync Similar performance to event-

driven

Shared Resource Database Consensusbool updatePermissions(page_t page, flags_t flags) { bool ok = true; do { for (core in cores) async { ok &= permUpdateRequest_AC(core, page, flags); } } finish; if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page , flags); } return ok;}

Identify code that can block – execution can continue after async

AC versions of message RPCs

Don’t pass finish until all async work created in do {} finish

block has complete

2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

10000

20000

30000

40000

50000

# Cores

Tim

e pe

r op

erat

ion

/ cy

cles

Shared Resource Database Consensus

Event-DrivenSynchronous

AC

Performance

Ping-pong test Minimum-sized messages

AMD 4 * 4-core machineUsing cores sharing L3

cache

Ping-pong latency (cycles)

Using UMP channel directly

931

Using event-based stubs

1134

Synchronous model (client only)

1266

Synchronous model(client and server)

1405

MPI (Visual Studio 2008 + HPC-Pack 2008 SDK)

2780

PerformanceFunction call latency

(cycles)

Direct (normal function call)

8

async foo()(foo does not block)

12

async foo()(foo blocks)

1692

• “Do not fear async”– Think about correctness: if the callee doesn’t

block then perf is basically unchanged


Adding Parallelism

do { async msg_send(core_1, “Computing Forces”); par fluidAnimate (computeForces, cells, range); } finish; Spawn a bunch of parallel

tasks that can be run across multiple cores

Wait for parallel and async tasks to complete before

continuing

FluidAnimate

• for each frame

–move particles to correct cell

–calculate cell density

–calculate particle forces

–calculate particles position

–render frame

Static Partitioning

• for each frame





–render frame

Static Partitioning

• for each frame





–render frame

Static Partitioning

• for each frame





–render frame

Problem: Uneven workload

Static Partitioning

• for each frame





–render frame

Problem: Barrier Synchronization

Static Partitioning

• for each frame





–render frame

Problem: Thread Preemption

Approach taken by (e.g.) OpenMP and Intel Parallel Building Blocks

They assume you own the machine and know your workload

Dynamic Partitioning (Work-Stealing)

• for each frame





–render frame


• for each frame





–render frame


• for each frame





–render frame


• for each frame





–render frame


• for each frame





–render frame


• for each frame





–render frame


• for each frame





–render frame

Problem: Spawn / Sync Overhead

Cilk-5: 218 cycles per task

Wool (old version): 97 cycles per task

Density calculation task:~ 10 cycles per particle


• for each frame





–render frame

Problem: Cache Locality


• for each frame





–render frame



• for each frame





–render frame



• for each frame





–render frame



• for each frame





–render frame



• for each frame





–render frame

Problem: Data Synchronization

Space-Time Continuum

• Controlled partitioning programming model– Flexible enough to enable movement on this

spectrum– Runtime system controls re-partitioning– Application controls how

• Parameterise how data is partitioned• Decide whether data-synchronisation is

necessary

DynamicPartitioning

Static Partitioning

Workload 1Workload 2

Workload 1Workload 264 Core Server

4 Core Laptop

Controlled Partitioning






FluidAnimatevoid computeForces(cell_t [][][] cells, dimentions_t d) { range_t range= { .x_start=0, .x_curr=0, .x_end=d.x_len, ...}; do { par fluidAnimate (computeForces, cells, range); } finish;}par_task fluidAnimate { task computeForces(cell_t cell) { for (particle in cell) { struct cell_t [] ncells = getNeighbours(cell); particle.force = calcForce(particle, ncells); }} range_t [] subdivide(range_t curr_cells, int num) { // subdivide curr into num equal cubes, and add to new } cell_t getNext(cells_t [][][] cells, range_t range) { // return next cell in cells, or NULL if finished}}

FluidAnimatevoid __computeForces_task(range_t my_range, cells_t [][][] cells) { cell_t cell = __fluidAnimate_getNext(cells, my_range); do { for (particle in cell) { struct cell_t [] ncells = getNeighbours(cell); particle.force = calcForce(particle, ncells); } if ((int num = calico_should_subdivide()) > 0) { range_t[] new_ranges = __fluidAnimate_subdivide(my_range, num); calico_schedule_par(__computeForces_task, new_ranges, cells); return; } while ((cell = __fluidAnimate_getNext(cells, my_range)) != NULL);} range_t[] __fluidAnimate_subdivide(range_t curr_cells, int num) { // subdivide curr into num equal cubes, and add to new}cell_t __fluidAnimate_getNext(cells_t [][][] cells, range_t range) { // return next cell in cells, or NULL if finished}

Aggregation of multiple task iterations

Automatic Repartitioning when necessary

FluidAnimatepar_task fluidAnimate {

task moveParticles(cell_t cell) { ... } task computeDensities(cell_t cell) { ... } task computeForces(cell_t cell) { ... } task renderCell(cell_t cell) { ... }

range_t [] subdivide(range_t curr_cells, int num) { // subdivide curr into num equal cubes, and add to new } cell_t getNext(cells_t [][][] cells, range_t range) { // return next cell in cells, or NULL if finished } bool calcOnDifferentCore(cell_t cell, range_t range) { // return true if cell is not within range}}

FluidAnimatepar_task fluidAnimate { task moveParticles(cell_t cell) { for (particle in cell) { cell_t new_cell = calculateParticlesCell(particle); if (new_cell == cell) continue; if (onDifferentCore(new_cell)) { lockAndUpdate(new_cell, particle); } else { updateNoLock(new_cell, particle); }}} ... bool calcOnDifferentCore(cell_t cell, range_t range) { // return true if cell is not within range } ... }}

FluidAnimatepar_task fluidAnimate { task moveParticles(cell_t cell) { for (particle in cell) { cell_t new_cell = calculateParticlesCell(particle); if (new_cell == cell) continue; if (onDifferentCore(new_cell)) { lockAndUpdate(new_cell, particle); } else { updateNoLock(new_cell, particle); }}} ... bool calcOnDifferentCore(cell_t cell, range_t range) { // return true if cell is not within range } ... }}

calcOnDifferentCore(new_cell, my_range);

FluidAnimate Results

0 1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

1.4

Parsec NativeCalico

Number of Cores

Wal

l-clo

ck e

xecu

tion

time

(nor

mal

ised

to s

eque

ntial

)

No competition for CPU-time

Bett

er

FluidAnimate Results

0 1 2 3 4 5 6 7 80

0.5

1

1.5

2

2.5

Parsec NativeCalico

Number of Cores

Wal

l-clo

ck e

xecu

tion

time

(nor

mal

ised

to s

eque

ntial

)

Competition for CPU-time

Bett

er


http://www.barrelfish.org

©2010 Microsoft Corporation. All rights reserved.This material is provided for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.

Date post:	21-Dec-2015
Category:	Documents
View:	213 times
Download:	0 times

Cores, cores, everywhere Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham,...

Documents