+ All Categories
Home > Documents > Chronos: Efficient Speculative Parallelism for...

Chronos: Efficient Speculative Parallelism for...

Date post: 18-Aug-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
22
Chronos: Efficient Speculative Parallelism for Accelerators MALEEN ABEYDEERA, DANIEL SANCHEZ ASPLOS 2020
Transcript
Page 1: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Chronos: Efficient Speculative Parallelism for Accelerators

MALEEN ABEYDEERA, DANIEL SANCHEZ

ASPLOS 2020

Page 2: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Current hardware accelerators are limited to easy parallelism

Current Accelerators

Target easy parallelism

Tasks and dependences known in advance

2

e.g.: Deep learning, Genomics

Chronos

Targets hard parallelism

Require speculative execution

e.g.: Graph analytics, simulation, transactional databases

Page 3: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Problem and InsightProblem

Prior speculation mechanisms (Transactional Memory, Thread Level Speculation) require global conflict detection

3

Shared memory system → coherence protocolCoherence poorly suited for accelerators

Transaction 2

Core 2

Transaction 1(W, Y) WOrder constraints

Transaction 1

Insight

Limit the data that each core can access

Divide work into tiny tasks and send them to data

Coordinate tasks through order constraints

W

X

Y

Z

Transaction 2 (Z, W)

Memory

Core 2

Core 1

Y Z

WCore 1

Local conflict detection → No coherence needed

Page 4: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Contributions

SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts

Chronos: An implementation of SLOT that provides a common framework for acceleration of applications with speculative parallelism

4

https://chronos-arch.csail.mit.edu/

Page 5: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Speculative parallelism with single-object tasks

Discrete Event Simulation (DES) for Digital Circuits

5

O1

N2

X3 X6

1 2 3 4 5 6 Time (ns)

OR

1 ns

0

5 ns

1 ns

2 ns

1NAND

XOR

0

1

1 ns

6 ns

NAND

OR

XOR

If X6 is being

speculatively executed

Page 6: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Prior techniques rely on global conflict detection

6

Shared Cache / Directory

O1

N2

X3 X6

1 2 3 4 5 6

Time (ns)

Why? No restriction on where a task can run

Private Cache

Private Cache

Core 1 Core 2

O1 N2 X6X3

Relies on coherence protocol to find conflicts

Page 7: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Insight 1: Leveraging spatial task mapping for local conflict detection

7

Shared Cache / Directory

O1

N2

X3 X6

1 2 3 4 5 6

Time (ns)

Impose restrictions on where a task can run

Private Cache

Private Cache

Core 1 Core 2

O1 N2 X6X3

Conflict detection is local to a core

Mapped to Core 1

Mapped to Core 2

Page 8: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Insight 2: Leveraging order to ensure atomicity

8

0

Account

(object)

Balance

W $100

X $1500

Y $200

Z $400

Tx. 1:

Transfer W Y

1

Tx. 3:

Transfer X Z

20

21

Tx. 2:

Transfer Z W

10

11

Timestamp

Banking application: Each transaction decrements the balance of one account and increments another

Assign a disjoint timestamp range for each coarse transaction

Page 9: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Benefits of fine-grained tasks

9

✓ Increased data locality

✓ Reduced network traffic

✓ Increased parallelism

Transaction 2

Core 2

Transaction 1 (W, Y) WOrder constraints

Transaction 1

W

X

Y

Z

Transaction 2 (Z, Y)

Memory

Core 2

Core 1

Y Z

WCore 1

Brings data to compute Sends compute to data

✓ Low probability and impact of aborts

✓ Asynchronous communication

Page 10: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

SLOT (Spatially Located Ordered Tasks)

10

SLOT programs consist of tasks

Tasks can create children tasks through a simple API:

slot::enqueue( fn_ptr, timestamp, object-id, arguments…);

Timestamp : Specifies order. Tasks appear to execute in timestamp order

Object-id : Specifies dependences. Tasks with same object-id are treated as data-dependent

Tasks with different object-ids can only communicate through arguments

Page 11: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

SLOT programming example (in software)

11

1 ns

0

5 ns1 ns

2 ns

1

0

1

// Simulates an event arriving at a gatevoid simToggle(Time time, GateInput input) {

gate = input.gate;toggledOutput = updateState(gate, input);if (toggledOutput) {

// create events for connected gatesfor (GateInput i : gate.connectedInputs()) {

Time nextTime = time + gate.delay(input, i);slot::enqueue(

simToggle, nextTime, i.gateID, i);

}

}

}enqueueInitialTasks()slot::run()

// Simulates an event arriving at a gate

void simToggle(Time time, GateInput input) {gate = input.gate;

toggledOutput = updateState(gate, input);

if (toggledOutput) {

// create events for connected gates

for (GateInput i : gate.connectedInputs()) {

Time nextTime = time + gate.delay(input, i);

eventQueue.enqueue(nextTime, i);}

}}

PriorityQueue<Time, GateInput> eventQueue;

enqueueInitialEvents()

// event loop. Sequentially execute in ts order

while (!eventQueue.empty()){

(time, input) = eventQueue.dequeue();

simToggle(time, input);

}

Page 12: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Chronos: An implementation of SLOT

Page 13: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Chronos overview

Chronos provides a framework to build accelerators for applications with speculative parallelism

13

PE

Cache (Private,

non-coherent)

Task UnitTask Traffic Interconnect

Mem0 Mem1 Mem2 Mem3

Memory Traffic Interconnect

Tile

0

Tile

1 …Tile

2

Tile

N

PE

PE…

The developer specifies the tasks and how they are implemented◦ Either software routines on soft cores, or specialized Processing Elements (PE)

Framework takes care of task management and speculative execution

Chronos Framework

Application-specificRTL

Page 14: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Task life cycle

14

Create DispatchIdle Running

FinishFinished

YN

Parentaborted?

Commit

Abort

Discard

Page 15: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Mapped to Tile A

Mapped to Tile B

Chronos internal dataflow

15

Task Interconnect

Tile

ATi

le B

Task Queue Commit Queue TSB

Cache

Cache

IDLE (I)

RUNNING (R)

FINISHED (F)

1

I

2

I

3

I

6

I

1

2

1

Task creation/ dispatch

1

R

6

PE

Speculative state of finished tasks

1 ns

0

5 ns1 ns

2 ns

111

6 6

1

F

22

2

R

3

6

R

8

I

Abort messages

Requeue task

2 ns

6

Page 16: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Versioning and commit protocol

Core

Main Memory / Cache

Eager versioning

Undo Log

Commit Protocol (GVT – Global Virtual Time)

Tile 0

Tile 1

Tile N

GVT

Arbiter

LVT (Earliest unfinished ts in the tile)

GVT (Earliest unfinished ts in the system)

GVT =

min{LVT0, .. LVTN}

Key benefitsMakes the common case (commits) fastMakes speculative data available before commit

Key benefitsAchieves fast and parallel commits

Updates speculative values in place

Store old values in an undo log

16

Page 17: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Chronos FPGA implementation

Developed an FPGA implementation of Chronos – up to 16 tiles

Running at 125 MHz

High task throughput – can enqueue, dequeue, execute and commit 8 tasks per cycle on a 16-tile system

17

AWS Shell16 Tiles

Page 18: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Experimental methodology

Four accelerators built using Chronos framework running on AWS FPGAs• Discrete Event Simulation (DES)• Maxflow• Single Source Shortest Paths (SSSP)• Astar Search

Custom PEs per application: 32-way multithreaded PE, single PE/tile

Baseline: Highly optimized software parallel implementations running on a 40-threaded Xeon AWS instance

18

Platform AWS Instance Price ($/hr)

Baseline CPU M4.10xlarge 2.00

FPGA F1.2xlarge 1.65

Page 19: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Chronos performance vs. 40-threaded Xeon

19

App ConcurrentMax. Tasks

FPGA 1t/ CPU 1t

Overall Speedup

des 256 2.45× 15.3×

maxflow 192 0.11× 4.3×

sssp 512 0.24× 3.6×

astar 192 0.58× 3.5×

3.6x

4.3x

3.5x

15.3x

Runs many more tasks in parallel

Specialization helps to run a single task efficiently(narrowing the 19× frequency gap with CPU)

Page 20: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Chronos performance analysis

20

Breakdown of aggregate PE cycles

Observation:

Most work is ultimately useful (only 11% of cycles result in wasted work)

Page 21: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

See the paper for more

Non-speculative applications

Non-rollback applications

Chronos with RISC-V cores

Projected performance on ASIC Chronos

Chronos resource utilization

21

Page 22: Chronos: Efficient Speculative Parallelism for Acceleratorspeople.csail.mit.edu/sanchez/papers/2020.chronos.asplos.slides.pdf · Chronos: An implementation of SLOT that provides a

Conclusion

Prior speculative parallel systems have relied on cache coherence to detect conflicts, precluding their use in accelerators

SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts

Chronos: An implementation of SLOT that provides a common framework for acceleration of applications with speculative parallelismo Use Chronos to build FPGA accelerators for four challenging applications providing up to 15x speedup

over a multicore baseline

22

https://chronos-arch.csail.mit.edu/


Recommended