Scalable Interconnects forReconfigurable Spatial Architectures
Yaqi Zhang, Alexander Rucker, Matthew Vilim, Raghu Prabhakar,William Hwang, Kunle Olukotun
Electrical EngineeringStanford University
ISCA ’19: The 46th International Symposium on Computer Architecture, Phoenix, AZ
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 1/27
®
Spatial Accelerators
• Energy efficient• High-throughput• Low-latency
Examples:• Plasticine (ISCA ‘17)• Compressed-sparse CNN accelerator (ISCA ‘17)• Stream-dataflow accelerator (ISCA ‘17)
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 2/27
Accelerator Characteristics• High compute density• High on-chip memory bandwidth
• Distributed compute andmemory resources• Streaming interface between compute andmemory• Statically mapped and scheduled compute graph
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/27
M C M
C
M
C
M
M
C
MC
MC
MC MC
MC
MC
MC
MC
MC C
C
C
C
C
C
C
C
C
MDD D D D
C Compute M On-chipScratchpad D Off-chipDRAM
Accelerator Characteristics• High compute density• High on-chip memory bandwidth• Distributed compute andmemory resources• Streaming interface between compute andmemory• Statically mapped and scheduled compute graph
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/27
M C M
C
M
C
M
M
C
MC
MC
MC MC
MC
MC
MC
MC
MC C
C
C
C
C
C
C
C
C
MDD D D D
C Compute M On-chipScratchpad D Off-chipDRAM
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)• Flexibility• Scalability• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)
• Flexibility• Scalability• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)• Flexibility
• Scalability• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)• Flexibility• Scalability
• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
Key ChallengesOn-chip networks play a critical role in:
• Energy efficiency (↓ data movement)• Flexibility• Scalability• Compute utilization
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet Latency
Spatial Accelerator Frequent Fine-grained Throughput
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
M M M MC C C C
MemoryBus
Parallelism
Multi-Processor
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet Latency
Spatial Accelerator Frequent Fine-grained Throughput
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
M M M MC C C C
MemoryBus
Parallelism
Multi-Processor
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained
Throughput
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
C C C C
M M M M
C C C C
M M M M
Parallelism
Pipelining
Spatial Acceleratordummy
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained
Throughput
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
C C C C
M M M M
C C C C
M M M M
Parallelism
Pipelining
Spatial Acceleratordummy
Communication Patterns
Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained Throughput
Compute On-chipMemoryNetwork Network Compute
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27
Outline
Motivation
Network Design Space
Compilation Flow
Evaluation
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 6/27
Static Network
PB
S S
S
PB
S
S
PB
S
S
PB
S
PB
S
PB
S
PB
S
PB
S
PB PB PB PB
R Router
S Switch
PB Physical Block
Pros Cons
Guaranteed bandwidth Low link utilizationP&R failures
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 7/27
Dynamic Network
PB
R R
R
PB
R
R
PB
R
R
PB
R
PB
R
PB
R
PB
R
PB
R
PB PB PB PB
R Router
S Switch
PB Physical Block
Pros Cons
Link sharing Limited bandwidthDeadlock
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 8/27
Hybrid Network: Static and Dynamic
PB
R R
R
S
PB
R
R
S
PB
R
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
S
PB
S
PB
S
PB
S R Router
S Switch
PB Physical Block
Pros Cons
Link sharing More areaMore bandwidth More static powerGuaranteed P&R
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 9/27
Outline
Motivation
Network Design Space
Compilation Flow
Evaluation
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 10/27
SpatialA DSL for Reconfigurable Accelerators
1 // Tiled inner product2 val vecA,vecB = DRAM[T](N)3 Reduce(Reg[T])( N ){ i =>4 val tileA,tileB = SRAM[T](tilesize)5 tileA load vecA(i::i+tilesize)6 tileB load vecB(i::i+tilesize)7 Reduce(Reg[T])(tilesize) { j =>8 tileA(j) * tileB(j)9 } { _ + _ }
10 }
• Annotate data size N
• Calculate loopiterations
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 11/27
Spatial Accelerator Compiler Mapping Characterization Simulation
Accelerator Compiler
• Allocate compute andmemory Virtual Blocks (VBs)• Infer activation counts for logical links
1 // Tiled inner product2 val vecA,vecB = DRAM[T](N)3 Reduce(Reg[T])( N ){ i =>4 val tileA,tileB = SRAM[T](tilesize)5 tileA load vecA(i::i+tilesize)6 tileB load vecB(i::i+tilesize)7 Reduce(Reg[T])(tilesize) { j =>8 tileA(j) * tileB(j)9 } { _ + _ }
10 }
⇒A
B C
D
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 12/27
Virtual Block→Logical Link→
Spatial Accelerator Compiler Mapping Characterization Simulation
Mapping
• Partition VB graph to meet hardware constraints
• Place and route the VB graph onto the network• Allocate VCs for the dynamic network
A
B C
D
⇒A-1
A-2B
C
D
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 13/27
Spatial Accelerator Compiler Mapping Characterization Simulation
Mapping
• Partition VB graph to meet hardware constraints• Place and route the VB graph onto the network• Allocate VCs for the dynamic network
A-1
A-2B
C
D
⇒PB
R R
R
S
PB
R
R
S
PB
R
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
S
PB
S
PB
S
PB
S R Router
S Switch
PB Physical Block
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 13/27
Spatial Accelerator Compiler Mapping Characterization Simulation
Placement and Routing
• Start with random placement
• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
Placement and Routing
• Start with random placement• Route all links, in order of activation count
• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
Placement and Routing
• Start with random placement• Route all links, in order of activation count
• Build most efficient broadcast tree• Guarantee static network placement, if possible
• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
Placement and Routing
• Start with random placement• Route all links, in order of activation count
• Build most efficient broadcast tree• Guarantee static network placement, if possible• Else, map the link to the dynamic network
• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost
• Dynamic network congestion• Average route length• Maximum route length
• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost
• Dynamic network congestion• Average route length• Maximum route length
• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
D
B A C
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost
• Dynamic network congestion• Average route length• Maximum route length
• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
A
B C
D
C D
B A
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
Placement and Routing
• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing
Summary
Iteratively reduce routing costMap bandwidth-critical links onto the static network
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27
Area and Energy Characterization• Synthesize switch and router RTL at 28 nm, 1GHz• Power simulation with Primetime
• Decompose power into:• Inactive (per-cycle)• Active (per-bit)
0 20 40 60 80 100
Activation (%)
0.00
0.05
0.10
0.15
Powe
r(W
)
Inactive
Active
Switchstaticstatic+dynamic
0 20 40 60 80 100
Activation (%)
0.00
0.01
0.02
0.03
0.04
Powe
r(W
)
Inactive
Active
Router
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 15/27
Spatial Accelerator Compiler Mapping Characterization Simulation
Area and Energy Characterization• Synthesize switch and router RTL at 28 nm, 1GHz• Power simulation with Primetime• Decompose power into:
• Inactive (per-cycle)• Active (per-bit)
0 20 40 60 80 100
Activation (%)
0.00
0.05
0.10
0.15
Powe
r(W
)
Inactive
Active
Switchstaticstatic+dynamic
0 20 40 60 80 100
Activation (%)
0.00
0.01
0.02
0.03
0.04
Powe
r(W
)
Inactive
Active
Router
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 15/27
Simulation• Integrate simulator with DRAMSim and BookSim• Track transmitted data in switches and routers• Estimate per-app power with activity traces:
Enet =∑
allocated
PinactiveTsim + Eflit#flit
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 16/27
Spatial Accelerator Compiler Mapping Characterization Simulation
Outline
Motivation
Network Design Space
Compilation Flow
Evaluation
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 17/27
Area and Energy Characterization
0.0
0.1
0.2mm
2
1 2 3 4 50.00
0.02
0.04
0.06
2 4 8
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 18/27
Area and Energy Characterization
0.0
0.1
0.2mm
2
1 2 3 4 50.00
0.02
0.04
0.06
2 4 8
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 18/27
L
L
Takeaway
Switches take less energy to transmit data than routersBandwidth scales more efficiently on the static network
Benchmarks
Category Application
Linear Algebra
Dot ProductOuter ProductBlack ScholesGEMM
Database TPC-H Query 6Clustering k-Means Clustering
Inference
Lattice RegressionLSTM (RNN)GRU (RNN)LeNet (CNN)
TrainingGaussian Discriminant AnalysisLogistic RegressionStochastic Gradient Descent
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 19/27
Benchmark Resource Usage
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 20/27
Evaluated Design Space
• Different network configurations• Static: flow control, bandwidth• Dynamic: VC count, flit width• Hybrid
• Different applications• Different architectures
• Pipelined (high throughput)• Scheduled (low throughput)
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 21/27
Evaluated Metrics• Performance (Perf)• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27
Evaluated Metrics• Performance (Perf)• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)
Reported values are the geomean across all applications,normalized to the worst network configuration.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27
Evaluated Metrics• Performance (Perf)L• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)
Area
Compute 51.0%
On-ChipMemory
32.0%
Network
17.0%
Power
Compute
26.0%
On-ChipMemory
34.8% Network15.6%
DRAM
23.6%
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27
Hybrid Network VCs and Flit Width
Perf
Perf / Area
Perf / Watt
1.1
vc4vc21.3
1.1
Perf
Perf / Area
Perf / Watt
1.0
flit128flit256flit512
1.7
1.0
Dynamic network flit width and VC count can be decreasedwith no performance loss.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 23/27
R R
Static vs. Dynamic vs. Hybrid
DotPro
duct
OuterP
roduct
BlackSch
oles
TPCHQ6
Lattice
GDA
GEMM
Kmea
ns
LogRegSGD
LSTMGRU
LeNet
0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
Per
form
ance
Dynamic Hybrid (2.25x) Static (3x)
The dynamic network performs poorly on compute-boundapplications due to insufficient bandwidth.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 24/27
Static vs. Dynamic vs. Hybrid
DotPro
duct
OuterP
roduct
BlackSch
oles
TPCHQ6
Lattice
GDA
GEMM
Kmea
ns
LogRegSGD
LSTMGRU
LeNet
0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
Per
form
ance
Dynamic Hybrid (2.25x) Static (3x)
The dynamic network performs poorly on compute-boundapplications due to insufficient bandwidth.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 24/27
Most Efficient Network Configurations
DotPro
duct
OuterP
roduct
BlackSch
oles
TPCHQ6
Lattice
GDA
GEMM
Kmea
ns
LogRegSGD
LSTMGRU
LeNet
0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
Dat
aM
ovem
ent
Hybrid (2.25x) Static (3x)
The hybrid network reduces data movement by using adynamic network as an escape path.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 25/27
Most Efficient Network ConfigurationsPipelined Architecture
Perf
Perf / Area
Perf / Watt
7.0
HybridStatic6.9
2.3
A hybrid network improves energy efficiency by 1.8x withperformance similar to a static network.
Performance varies up to 7x between the best and worstnetwork configurations.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 26/27
L
Most Efficient Network ConfigurationsPipelined Architecture
Perf
Perf / Area
Perf / Watt
7.0
HybridStatic6.9
2.3
A hybrid network improves energy efficiency by 1.8x withperformance similar to a static network.Performance varies up to 7x between the best and worstnetwork configurations.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 26/27
L
Conclusion• Network performance correlates strongly with
bandwidth for spatial accelerators• Bandwidth scales more efficiently on a static network• A hybrid (large static, small dynamic) network:
• Eliminates place and route failure• Improves perf/watt
Thank You!
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 27/27
Conclusion• Network performance correlates strongly with
bandwidth for spatial accelerators• Bandwidth scales more efficiently on a static network• A hybrid (large static, small dynamic) network:
• Eliminates place and route failure• Improves perf/watt
Thank You!
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 27/27
Static Network: Flow Control
Src Dst
End-to-end Flow Control Per-hop Flow Control
Back PressureAck
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 1/9
Static Network: Bandwidth
PB
S S
S
PB
S
S
PB
S
S
PB
S
PB PB PB PB R Router
S Switch
PB Physical Block
We vary the number of links between switches.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 2/9
Dynamic Network
RouterFlit-width/
We vary the number of Virtual Channels (VCs) and flit width.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/9
Static Network Bandwidth
Perf
1 / Area
1 / Power
2.0
x1x2x3
3.4
4.3
PB
R R
R
S
PB
R
R
S
PB
R
S
PB
R
S
PB
R
S
PB
R
S
PB
S
PB
S
PB
S
3x static network bandwidth
Bandwidth strongly impacts accelerator performance.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/9
R
Static Network Flow ControlCredit-Based vs. Per-Hop
Perf
1 / Area
1 / Power
3.3
creditper-hop
1.2
2.1
Src Dst
End-to-end Flow Control
Per-hop Flow Control
Back Pressure
Ack
Credit-based flow control has 3x lower performance.
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/9
R
Accelerator Model• Pool of compute andmemory resource• Compute:
• SIMD pipeline, or• Vector processor with a small instruction window
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 6/9
SIMD Lanes
Input Buffers
Pipelined Scheduled
Stages
SIMD Lanes
Function Unit
Compute Physical Block
Memory Physical BlockScratchpad Bank
ComputePB
MemoryPB
ComputePB
ComputePB
MemoryPB
ComputePB
ComputePB
MemoryPB
ComputePB
DRAMPB
DRAMPB
DRAMPB
DRAMPB
DRAMPB
DRAMPB
Statically Routed Dynamic Network
• Streaming protocol requires in-order transmission• Can’t use adaptive or oblivious routing• Can’t drop packets
• Routes are looked up in a table at runtime• Route to multiple outputs for efficient broadcast links
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 7/9
Performance Scaling
0
20
40
Pipe
lined
Norm
Perfo
rman
ce
BlackScholes TPCHQ6 GEMM SGDD-x0S-x3S-x2S-x1
H-x3H-x2H-x1
32 64 128# PBs
0
5
10
15
Sche
duled
Norm
Perfo
rman
ce
32 64 128# PBs
32 64 128# PBs
32 64 128# PBs
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 8/9
Key Design Challenges
Off-chipMemoryBandwidth
ComputeThroughput
On-chipMemoryBandwidth
On-chipNetworkBandwidth
ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 9/9