Chronos: Efficient Speculative Parallelism for...

Chronos: Efficient Speculative Parallelism for Accelerators

MALEEN ABEYDEERA, DANIEL SANCHEZ

ASPLOS 2020

Current hardware accelerators are limited to easy parallelism

Current Accelerators

Target easy parallelism

Tasks and dependences known in advance

2

e.g.: Deep learning, Genomics

Chronos

Targets hard parallelism

Require speculative execution

e.g.: Graph analytics, simulation, transactional databases

Problem and InsightProblem

Prior speculation mechanisms (Transactional Memory, Thread Level Speculation) require global conflict detection

3

Shared memory system → coherence protocolCoherence poorly suited for accelerators

Transaction 2

Core 2

Transaction 1(W, Y) WOrder constraints

Transaction 1

Insight

Limit the data that each core can access

Divide work into tiny tasks and send them to data

Coordinate tasks through order constraints

W

X

Y

Z

Transaction 2 (Z, W)

Memory

Core 2

Core 1

Y Z

WCore 1

Local conflict detection → No coherence needed

Contributions

SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts

Chronos: An implementation of SLOT that provides a common framework for acceleration of applications with speculative parallelism

4

https://chronos-arch.csail.mit.edu/


Speculative parallelism with single-object tasks

Discrete Event Simulation (DES) for Digital Circuits

5

O1

N2

X3 X6

1 2 3 4 5 6 Time (ns)

OR

1 ns

0

5 ns

1 ns

2 ns

1NAND

XOR

0

1

1 ns

6 ns

NAND

OR

XOR

If X6 is being

speculatively executed

Prior techniques rely on global conflict detection

6

Shared Cache / Directory

O1

N2

X3 X6

1 2 3 4 5 6

Time (ns)

Why? No restriction on where a task can run

Private Cache

Private Cache

Core 1 Core 2

O1 N2 X6X3

Relies on coherence protocol to find conflicts

Insight 1: Leveraging spatial task mapping for local conflict detection

7

Shared Cache / Directory

O1

N2

X3 X6

1 2 3 4 5 6

Time (ns)

Impose restrictions on where a task can run

Private Cache

Private Cache

Core 1 Core 2

O1 N2 X6X3

Conflict detection is local to a core

Mapped to Core 1

Mapped to Core 2

Insight 2: Leveraging order to ensure atomicity

8

0

Account

(object)

Balance

W $100

X $1500

Y $200

Z $400

Tx. 1:

Transfer W Y

1

Tx. 3:

Transfer X Z

20

21

Tx. 2:

Transfer Z W

10

11

Timestamp

Banking application: Each transaction decrements the balance of one account and increments another

Assign a disjoint timestamp range for each coarse transaction

Benefits of fine-grained tasks

9

✓ Increased data locality

✓ Reduced network traffic

✓ Increased parallelism

Transaction 2

Core 2

Transaction 1 (W, Y) WOrder constraints

Transaction 1

W

X

Y

Z

Transaction 2 (Z, Y)

Memory

Core 2

Core 1

Y Z

WCore 1

Brings data to compute Sends compute to data

✓ Low probability and impact of aborts

✓ Asynchronous communication

SLOT (Spatially Located Ordered Tasks)

10

SLOT programs consist of tasks

Tasks can create children tasks through a simple API:

slot::enqueue( fn_ptr, timestamp, object-id, arguments…);

Timestamp : Specifies order. Tasks appear to execute in timestamp order

Object-id : Specifies dependences. Tasks with same object-id are treated as data-dependent

Tasks with different object-ids can only communicate through arguments

SLOT programming example (in software)

11

1 ns

0

5 ns1 ns

2 ns

1

0

1

// Simulates an event arriving at a gatevoid simToggle(Time time, GateInput input) {

gate = input.gate;toggledOutput = updateState(gate, input);if (toggledOutput) {

// create events for connected gatesfor (GateInput i : gate.connectedInputs()) {

Time nextTime = time + gate.delay(input, i);slot::enqueue(

simToggle, nextTime, i.gateID, i);

}

}

}enqueueInitialTasks()slot::run()

// Simulates an event arriving at a gate

void simToggle(Time time, GateInput input) {gate = input.gate;

toggledOutput = updateState(gate, input);

if (toggledOutput) {

// create events for connected gates

for (GateInput i : gate.connectedInputs()) {

Time nextTime = time + gate.delay(input, i);

eventQueue.enqueue(nextTime, i);}

}}

PriorityQueue<Time, GateInput> eventQueue;

enqueueInitialEvents()

// event loop. Sequentially execute in ts order

while (!eventQueue.empty()){

(time, input) = eventQueue.dequeue();

simToggle(time, input);

}

Chronos: An implementation of SLOT

Chronos overview

Chronos provides a framework to build accelerators for applications with speculative parallelism

13

PE

Cache (Private,

non-coherent)

Task UnitTask Traffic Interconnect

Mem0 Mem1 Mem2 Mem3

Memory Traffic Interconnect

Tile

0

Tile

1 …Tile

2

Tile

N

PE

PE…

The developer specifies the tasks and how they are implemented◦ Either software routines on soft cores, or specialized Processing Elements (PE)

Framework takes care of task management and speculative execution

Chronos Framework

Application-specificRTL

Task life cycle

14

Create DispatchIdle Running

FinishFinished

YN

Parentaborted?

Commit

Abort

Discard

Mapped to Tile A

Mapped to Tile B

Chronos internal dataflow

15

Task Interconnect

Tile

ATi

le B

Task Queue Commit Queue TSB

Cache

Cache

IDLE (I)

RUNNING (R)

FINISHED (F)

1

I

2

I

3

I

6

I

1

2

1

Task creation/ dispatch

1

R

6

PE

Speculative state of finished tasks

1 ns

0

5 ns1 ns

2 ns

111

6 6

1

F

22

2

R

3

6

R

8

I

Abort messages

Requeue task

2 ns

6

Versioning and commit protocol

Core

Main Memory / Cache

Eager versioning

Undo Log

Commit Protocol (GVT – Global Virtual Time)

Tile 0

Tile 1

Tile N

GVT

Arbiter

LVT (Earliest unfinished ts in the tile)

GVT (Earliest unfinished ts in the system)

GVT =

min{LVT0, .. LVTN}

Key benefitsMakes the common case (commits) fastMakes speculative data available before commit

Key benefitsAchieves fast and parallel commits

Updates speculative values in place

Store old values in an undo log

16

Chronos FPGA implementation

Developed an FPGA implementation of Chronos – up to 16 tiles

Running at 125 MHz

High task throughput – can enqueue, dequeue, execute and commit 8 tasks per cycle on a 16-tile system

17

AWS Shell16 Tiles

Experimental methodology

Four accelerators built using Chronos framework running on AWS FPGAs• Discrete Event Simulation (DES)• Maxflow• Single Source Shortest Paths (SSSP)• Astar Search

Custom PEs per application: 32-way multithreaded PE, single PE/tile

Baseline: Highly optimized software parallel implementations running on a 40-threaded Xeon AWS instance

18

Platform AWS Instance Price ($/hr)

Baseline CPU M4.10xlarge 2.00

FPGA F1.2xlarge 1.65

Chronos performance vs. 40-threaded Xeon

19

App ConcurrentMax. Tasks

FPGA 1t/ CPU 1t

Overall Speedup

des 256 2.45× 15.3×

maxflow 192 0.11× 4.3×

sssp 512 0.24× 3.6×

astar 192 0.58× 3.5×

3.6x

4.3x

3.5x

15.3x

Runs many more tasks in parallel

Specialization helps to run a single task efficiently(narrowing the 19× frequency gap with CPU)

Chronos performance analysis

20

Breakdown of aggregate PE cycles

Observation:

Most work is ultimately useful (only 11% of cycles result in wasted work)

See the paper for more

Non-speculative applications

Non-rollback applications

Chronos with RISC-V cores

Projected performance on ASIC Chronos

Chronos resource utilization

21

Conclusion

Prior speculative parallel systems have relied on cache coherence to detect conflicts, precluding their use in accelerators

SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts

Chronos: An implementation of SLOT that provides a common framework for acceleration of applications with speculative parallelismo Use Chronos to build FPGA accelerators for four challenging applications providing up to 15x speedup

over a multicore baseline

22



Date post:	18-Aug-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Chronos: Efficient Speculative Parallelism for...

Documents