Chronos: Efficient Speculative Parallelism for Accelerators
MALEEN ABEYDEERA, DANIEL SANCHEZ
ASPLOS 2020
Current hardware accelerators are limited to easy parallelism
Current Accelerators
Target easy parallelism
Tasks and dependences known in advance
2
e.g.: Deep learning, Genomics
Chronos
Targets hard parallelism
Require speculative execution
e.g.: Graph analytics, simulation, transactional databases
Problem and InsightProblem
Prior speculation mechanisms (Transactional Memory, Thread Level Speculation) require global conflict detection
3
Shared memory system → coherence protocolCoherence poorly suited for accelerators
Transaction 2
Core 2
Transaction 1(W, Y) WOrder constraints
Transaction 1
Insight
Limit the data that each core can access
Divide work into tiny tasks and send them to data
Coordinate tasks through order constraints
W
X
Y
Z
Transaction 2 (Z, W)
Memory
Core 2
Core 1
Y Z
WCore 1
Local conflict detection → No coherence needed
Contributions
SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts
Chronos: An implementation of SLOT that provides a common framework for acceleration of applications with speculative parallelism
4
https://chronos-arch.csail.mit.edu/
Speculative parallelism with single-object tasks
Discrete Event Simulation (DES) for Digital Circuits
5
O1
N2
X3 X6
1 2 3 4 5 6 Time (ns)
OR
1 ns
0
5 ns
1 ns
2 ns
1NAND
XOR
0
1
1 ns
6 ns
NAND
OR
XOR
If X6 is being
speculatively executed
Prior techniques rely on global conflict detection
6
Shared Cache / Directory
O1
N2
X3 X6
1 2 3 4 5 6
Time (ns)
Why? No restriction on where a task can run
Private Cache
Private Cache
Core 1 Core 2
O1 N2 X6X3
Relies on coherence protocol to find conflicts
Insight 1: Leveraging spatial task mapping for local conflict detection
7
Shared Cache / Directory
O1
N2
X3 X6
1 2 3 4 5 6
Time (ns)
Impose restrictions on where a task can run
Private Cache
Private Cache
Core 1 Core 2
O1 N2 X6X3
Conflict detection is local to a core
Mapped to Core 1
Mapped to Core 2
Insight 2: Leveraging order to ensure atomicity
8
0
Account
(object)
Balance
W $100
X $1500
Y $200
Z $400
Tx. 1:
Transfer W Y
1
Tx. 3:
Transfer X Z
20
21
Tx. 2:
Transfer Z W
10
11
Timestamp
Banking application: Each transaction decrements the balance of one account and increments another
Assign a disjoint timestamp range for each coarse transaction
Benefits of fine-grained tasks
9
✓ Increased data locality
✓ Reduced network traffic
✓ Increased parallelism
Transaction 2
Core 2
Transaction 1 (W, Y) WOrder constraints
Transaction 1
W
X
Y
Z
Transaction 2 (Z, Y)
Memory
Core 2
Core 1
Y Z
WCore 1
Brings data to compute Sends compute to data
✓ Low probability and impact of aborts
✓ Asynchronous communication
SLOT (Spatially Located Ordered Tasks)
10
SLOT programs consist of tasks
Tasks can create children tasks through a simple API:
slot::enqueue( fn_ptr, timestamp, object-id, arguments…);
Timestamp : Specifies order. Tasks appear to execute in timestamp order
Object-id : Specifies dependences. Tasks with same object-id are treated as data-dependent
Tasks with different object-ids can only communicate through arguments
SLOT programming example (in software)
11
1 ns
0
5 ns1 ns
2 ns
1
0
1
// Simulates an event arriving at a gatevoid simToggle(Time time, GateInput input) {
gate = input.gate;toggledOutput = updateState(gate, input);if (toggledOutput) {
// create events for connected gatesfor (GateInput i : gate.connectedInputs()) {
Time nextTime = time + gate.delay(input, i);slot::enqueue(
simToggle, nextTime, i.gateID, i);
}
}
}enqueueInitialTasks()slot::run()
// Simulates an event arriving at a gate
void simToggle(Time time, GateInput input) {gate = input.gate;
toggledOutput = updateState(gate, input);
if (toggledOutput) {
// create events for connected gates
for (GateInput i : gate.connectedInputs()) {
Time nextTime = time + gate.delay(input, i);
eventQueue.enqueue(nextTime, i);}
}}
PriorityQueue<Time, GateInput> eventQueue;
enqueueInitialEvents()
// event loop. Sequentially execute in ts order
while (!eventQueue.empty()){
(time, input) = eventQueue.dequeue();
simToggle(time, input);
}
Chronos: An implementation of SLOT
Chronos overview
Chronos provides a framework to build accelerators for applications with speculative parallelism
13
PE
Cache (Private,
non-coherent)
Task UnitTask Traffic Interconnect
Mem0 Mem1 Mem2 Mem3
Memory Traffic Interconnect
Tile
0
Tile
1 …Tile
2
Tile
N
PE
PE…
The developer specifies the tasks and how they are implemented◦ Either software routines on soft cores, or specialized Processing Elements (PE)
Framework takes care of task management and speculative execution
Chronos Framework
Application-specificRTL
Task life cycle
14
Create DispatchIdle Running
FinishFinished
YN
Parentaborted?
Commit
Abort
Discard
Mapped to Tile A
Mapped to Tile B
Chronos internal dataflow
15
Task Interconnect
Tile
ATi
le B
Task Queue Commit Queue TSB
Cache
Cache
IDLE (I)
RUNNING (R)
FINISHED (F)
1
I
2
I
3
I
6
I
1
2
1
Task creation/ dispatch
1
R
6
PE
Speculative state of finished tasks
1 ns
0
5 ns1 ns
2 ns
111
6 6
1
F
22
2
R
3
6
R
8
I
Abort messages
Requeue task
2 ns
6
Versioning and commit protocol
Core
Main Memory / Cache
Eager versioning
Undo Log
Commit Protocol (GVT – Global Virtual Time)
Tile 0
Tile 1
Tile N
GVT
Arbiter
LVT (Earliest unfinished ts in the tile)
GVT (Earliest unfinished ts in the system)
GVT =
min{LVT0, .. LVTN}
Key benefitsMakes the common case (commits) fastMakes speculative data available before commit
Key benefitsAchieves fast and parallel commits
Updates speculative values in place
Store old values in an undo log
16
Chronos FPGA implementation
Developed an FPGA implementation of Chronos – up to 16 tiles
Running at 125 MHz
High task throughput – can enqueue, dequeue, execute and commit 8 tasks per cycle on a 16-tile system
17
AWS Shell16 Tiles
Experimental methodology
Four accelerators built using Chronos framework running on AWS FPGAs• Discrete Event Simulation (DES)• Maxflow• Single Source Shortest Paths (SSSP)• Astar Search
Custom PEs per application: 32-way multithreaded PE, single PE/tile
Baseline: Highly optimized software parallel implementations running on a 40-threaded Xeon AWS instance
18
Platform AWS Instance Price ($/hr)
Baseline CPU M4.10xlarge 2.00
FPGA F1.2xlarge 1.65
Chronos performance vs. 40-threaded Xeon
19
App ConcurrentMax. Tasks
FPGA 1t/ CPU 1t
Overall Speedup
des 256 2.45× 15.3×
maxflow 192 0.11× 4.3×
sssp 512 0.24× 3.6×
astar 192 0.58× 3.5×
3.6x
4.3x
3.5x
15.3x
Runs many more tasks in parallel
Specialization helps to run a single task efficiently(narrowing the 19× frequency gap with CPU)
Chronos performance analysis
20
Breakdown of aggregate PE cycles
Observation:
Most work is ultimately useful (only 11% of cycles result in wasted work)
See the paper for more
Non-speculative applications
Non-rollback applications
Chronos with RISC-V cores
Projected performance on ASIC Chronos
Chronos resource utilization
21
Conclusion
Prior speculative parallel systems have relied on cache coherence to detect conflicts, precluding their use in accelerators
SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts
Chronos: An implementation of SLOT that provides a common framework for acceleration of applications with speculative parallelismo Use Chronos to build FPGA accelerators for four challenging applications providing up to 15x speedup
over a multicore baseline
22
https://chronos-arch.csail.mit.edu/