+ All Categories
Home > Documents > Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val...

Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val...

Date post: 09-Jun-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
61
Scalable Interconnects for Reconfigurable Spatial Architectures Yaqi Zhang, Alexander Rucker, Matthew Vilim, Raghu Prabhakar, William Hwang, Kunle Olukotun Electrical Engineering Stanford University ISCA ’19: The 46th International Symposium on Computer Architecture, Phoenix, AZ ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 1/27
Transcript
Page 1: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Scalable Interconnects forReconfigurable Spatial Architectures

Yaqi Zhang, Alexander Rucker, Matthew Vilim, Raghu Prabhakar,William Hwang, Kunle Olukotun

Electrical EngineeringStanford University

ISCA ’19: The 46th International Symposium on Computer Architecture, Phoenix, AZ

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 1/27

®

Page 2: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Spatial Accelerators

• Energy efficient• High-throughput• Low-latency

Examples:• Plasticine (ISCA ‘17)• Compressed-sparse CNN accelerator (ISCA ‘17)• Stream-dataflow accelerator (ISCA ‘17)

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 2/27

Page 3: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Accelerator Characteristics• High compute density• High on-chip memory bandwidth

• Distributed compute andmemory resources• Streaming interface between compute andmemory• Statically mapped and scheduled compute graph

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/27

M C M

C

M

C

M

M

C

MC

MC

MC MC

MC

MC

MC

MC

MC C

C

C

C

C

C

C

C

C

MDD D D D

C Compute M On-chipScratchpad D Off-chipDRAM

Page 4: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Accelerator Characteristics• High compute density• High on-chip memory bandwidth• Distributed compute andmemory resources• Streaming interface between compute andmemory• Statically mapped and scheduled compute graph

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/27

M C M

C

M

C

M

M

C

MC

MC

MC MC

MC

MC

MC

MC

MC C

C

C

C

C

C

C

C

C

MDD D D D

C Compute M On-chipScratchpad D Off-chipDRAM

Page 5: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)• Flexibility• Scalability• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Page 6: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)

• Flexibility• Scalability• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Page 7: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)• Flexibility

• Scalability• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Page 8: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)• Flexibility• Scalability

• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Page 9: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)• Flexibility• Scalability• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Page 10: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet Latency

Spatial Accelerator Frequent Fine-grained Throughput

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

M M M MC C C C

MemoryBus

Parallelism

Multi-Processor

Page 11: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet Latency

Spatial Accelerator Frequent Fine-grained Throughput

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

M M M MC C C C

MemoryBus

Parallelism

Multi-Processor

Page 12: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained

Throughput

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

C C C C

M M M M

C C C C

M M M M

Parallelism

Pipelining

Spatial Acceleratordummy

Page 13: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained

Throughput

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

C C C C

M M M M

C C C C

M M M M

Parallelism

Pipelining

Spatial Acceleratordummy

Page 14: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained Throughput

Compute On-chipMemoryNetwork Network Compute

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

Page 15: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Outline

Motivation

Network Design Space

Compilation Flow

Evaluation

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 6/27

Page 16: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Static Network

PB

S S

S

PB

S

S

PB

S

S

PB

S

PB

S

PB

S

PB

S

PB

S

PB PB PB PB

R Router

S Switch

PB Physical Block

Pros Cons

Guaranteed bandwidth Low link utilizationP&R failures

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 7/27

Page 17: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Dynamic Network

PB

R R

R

PB

R

R

PB

R

R

PB

R

PB

R

PB

R

PB

R

PB

R

PB PB PB PB

R Router

S Switch

PB Physical Block

Pros Cons

Link sharing Limited bandwidthDeadlock

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 8/27

Page 18: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Hybrid Network: Static and Dynamic

PB

R R

R

S

PB

R

R

S

PB

R

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

S

PB

S

PB

S

PB

S R Router

S Switch

PB Physical Block

Pros Cons

Link sharing More areaMore bandwidth More static powerGuaranteed P&R

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 9/27

Page 19: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Outline

Motivation

Network Design Space

Compilation Flow

Evaluation

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 10/27

Page 20: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

SpatialA DSL for Reconfigurable Accelerators

1 // Tiled inner product2 val vecA,vecB = DRAM[T](N)3 Reduce(Reg[T])( N ){ i =>4 val tileA,tileB = SRAM[T](tilesize)5 tileA load vecA(i::i+tilesize)6 tileB load vecB(i::i+tilesize)7 Reduce(Reg[T])(tilesize) { j =>8 tileA(j) * tileB(j)9 } { _ + _ }

10 }

• Annotate data size N

• Calculate loopiterations

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 11/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Page 21: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Accelerator Compiler

• Allocate compute andmemory Virtual Blocks (VBs)• Infer activation counts for logical links

1 // Tiled inner product2 val vecA,vecB = DRAM[T](N)3 Reduce(Reg[T])( N ){ i =>4 val tileA,tileB = SRAM[T](tilesize)5 tileA load vecA(i::i+tilesize)6 tileB load vecB(i::i+tilesize)7 Reduce(Reg[T])(tilesize) { j =>8 tileA(j) * tileB(j)9 } { _ + _ }

10 }

⇒A

B C

D

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 12/27

Virtual Block→Logical Link→

Spatial Accelerator Compiler Mapping Characterization Simulation

Page 22: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Mapping

• Partition VB graph to meet hardware constraints

• Place and route the VB graph onto the network• Allocate VCs for the dynamic network

A

B C

D

⇒A-1

A-2B

C

D

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 13/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Page 23: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Mapping

• Partition VB graph to meet hardware constraints• Place and route the VB graph onto the network• Allocate VCs for the dynamic network

A-1

A-2B

C

D

⇒PB

R R

R

S

PB

R

R

S

PB

R

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

S

PB

S

PB

S

PB

S R Router

S Switch

PB Physical Block

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 13/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Page 24: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Placement and Routing

• Start with random placement

• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Page 25: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Placement and Routing

• Start with random placement• Route all links, in order of activation count

• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Page 26: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Placement and Routing

• Start with random placement• Route all links, in order of activation count

• Build most efficient broadcast tree• Guarantee static network placement, if possible

• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Page 27: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Placement and Routing

• Start with random placement• Route all links, in order of activation count

• Build most efficient broadcast tree• Guarantee static network placement, if possible• Else, map the link to the dynamic network

• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Page 28: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost

• Dynamic network congestion• Average route length• Maximum route length

• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

Page 29: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost

• Dynamic network congestion• Average route length• Maximum route length

• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Page 30: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost

• Dynamic network congestion• Average route length• Maximum route length

• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

C D

B A

Page 31: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

Page 32: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

Page 33: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Area and Energy Characterization• Synthesize switch and router RTL at 28 nm, 1GHz• Power simulation with Primetime

• Decompose power into:• Inactive (per-cycle)• Active (per-bit)

0 20 40 60 80 100

Activation (%)

0.00

0.05

0.10

0.15

Powe

r(W

)

Inactive

Active

Switchstaticstatic+dynamic

0 20 40 60 80 100

Activation (%)

0.00

0.01

0.02

0.03

0.04

Powe

r(W

)

Inactive

Active

Router

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 15/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Page 34: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Area and Energy Characterization• Synthesize switch and router RTL at 28 nm, 1GHz• Power simulation with Primetime• Decompose power into:

• Inactive (per-cycle)• Active (per-bit)

0 20 40 60 80 100

Activation (%)

0.00

0.05

0.10

0.15

Powe

r(W

)

Inactive

Active

Switchstaticstatic+dynamic

0 20 40 60 80 100

Activation (%)

0.00

0.01

0.02

0.03

0.04

Powe

r(W

)

Inactive

Active

Router

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 15/27

Page 35: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Simulation• Integrate simulator with DRAMSim and BookSim• Track transmitted data in switches and routers• Estimate per-app power with activity traces:

Enet =∑

allocated

PinactiveTsim + Eflit#flit

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 16/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Page 36: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Outline

Motivation

Network Design Space

Compilation Flow

Evaluation

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 17/27

Page 37: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Area and Energy Characterization

0.0

0.1

0.2mm

2

1 2 3 4 50.00

0.02

0.04

0.06

2 4 8

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 18/27

Page 38: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Area and Energy Characterization

0.0

0.1

0.2mm

2

1 2 3 4 50.00

0.02

0.04

0.06

2 4 8

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 18/27

L

L

Takeaway

Switches take less energy to transmit data than routersBandwidth scales more efficiently on the static network

Page 39: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Benchmarks

Category Application

Linear Algebra

Dot ProductOuter ProductBlack ScholesGEMM

Database TPC-H Query 6Clustering k-Means Clustering

Inference

Lattice RegressionLSTM (RNN)GRU (RNN)LeNet (CNN)

TrainingGaussian Discriminant AnalysisLogistic RegressionStochastic Gradient Descent

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 19/27

Page 40: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Benchmark Resource Usage

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 20/27

Page 41: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Evaluated Design Space

• Different network configurations• Static: flow control, bandwidth• Dynamic: VC count, flit width• Hybrid

• Different applications• Different architectures

• Pipelined (high throughput)• Scheduled (low throughput)

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 21/27

Page 42: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Evaluated Metrics• Performance (Perf)• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27

Page 43: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Evaluated Metrics• Performance (Perf)• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)

Reported values are the geomean across all applications,normalized to the worst network configuration.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27

Page 44: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Evaluated Metrics• Performance (Perf)L• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)

Area

Compute 51.0%

On-ChipMemory

32.0%

Network

17.0%

Power

Compute

26.0%

On-ChipMemory

34.8% Network15.6%

DRAM

23.6%

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27

Page 45: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Hybrid Network VCs and Flit Width

Perf

Perf / Area

Perf / Watt

1.1

vc4vc21.3

1.1

Perf

Perf / Area

Perf / Watt

1.0

flit128flit256flit512

1.7

1.0

Dynamic network flit width and VC count can be decreasedwith no performance loss.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 23/27

R R

Page 46: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Static vs. Dynamic vs. Hybrid

DotPro

duct

OuterP

roduct

BlackSch

oles

TPCHQ6

Lattice

GDA

GEMM

Kmea

ns

LogRegSGD

LSTMGRU

LeNet

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Per

form

ance

Dynamic Hybrid (2.25x) Static (3x)

The dynamic network performs poorly on compute-boundapplications due to insufficient bandwidth.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 24/27

Page 47: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Static vs. Dynamic vs. Hybrid

DotPro

duct

OuterP

roduct

BlackSch

oles

TPCHQ6

Lattice

GDA

GEMM

Kmea

ns

LogRegSGD

LSTMGRU

LeNet

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Per

form

ance

Dynamic Hybrid (2.25x) Static (3x)

The dynamic network performs poorly on compute-boundapplications due to insufficient bandwidth.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 24/27

Page 48: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Most Efficient Network Configurations

DotPro

duct

OuterP

roduct

BlackSch

oles

TPCHQ6

Lattice

GDA

GEMM

Kmea

ns

LogRegSGD

LSTMGRU

LeNet

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Dat

aM

ovem

ent

Hybrid (2.25x) Static (3x)

The hybrid network reduces data movement by using adynamic network as an escape path.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 25/27

Page 49: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Most Efficient Network ConfigurationsPipelined Architecture

Perf

Perf / Area

Perf / Watt

7.0

HybridStatic6.9

2.3

A hybrid network improves energy efficiency by 1.8x withperformance similar to a static network.

Performance varies up to 7x between the best and worstnetwork configurations.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 26/27

L

Page 50: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Most Efficient Network ConfigurationsPipelined Architecture

Perf

Perf / Area

Perf / Watt

7.0

HybridStatic6.9

2.3

A hybrid network improves energy efficiency by 1.8x withperformance similar to a static network.Performance varies up to 7x between the best and worstnetwork configurations.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 26/27

L

Page 51: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Conclusion• Network performance correlates strongly with

bandwidth for spatial accelerators• Bandwidth scales more efficiently on a static network• A hybrid (large static, small dynamic) network:

• Eliminates place and route failure• Improves perf/watt

Thank You!

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 27/27

Page 52: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Conclusion• Network performance correlates strongly with

bandwidth for spatial accelerators• Bandwidth scales more efficiently on a static network• A hybrid (large static, small dynamic) network:

• Eliminates place and route failure• Improves perf/watt

Thank You!

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 27/27

Page 53: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Static Network: Flow Control

Src Dst

End-to-end Flow Control Per-hop Flow Control

Back PressureAck

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 1/9

Page 54: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Static Network: Bandwidth

PB

S S

S

PB

S

S

PB

S

S

PB

S

PB PB PB PB R Router

S Switch

PB Physical Block

We vary the number of links between switches.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 2/9

Page 55: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Dynamic Network

RouterFlit-width/

We vary the number of Virtual Channels (VCs) and flit width.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/9

Page 56: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Static Network Bandwidth

Perf

1 / Area

1 / Power

2.0

x1x2x3

3.4

4.3

PB

R R

R

S

PB

R

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

S

PB

S

PB

S

3x static network bandwidth

Bandwidth strongly impacts accelerator performance.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/9

R

Page 57: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Static Network Flow ControlCredit-Based vs. Per-Hop

Perf

1 / Area

1 / Power

3.3

creditper-hop

1.2

2.1

Src Dst

End-to-end Flow Control

Per-hop Flow Control

Back Pressure

Ack

Credit-based flow control has 3x lower performance.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/9

R

Page 58: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Accelerator Model• Pool of compute andmemory resource• Compute:

• SIMD pipeline, or• Vector processor with a small instruction window

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 6/9

SIMD Lanes

Input Buffers

Pipelined Scheduled

Stages

SIMD Lanes

Function Unit

Compute Physical Block

Memory Physical BlockScratchpad Bank

ComputePB

MemoryPB

ComputePB

ComputePB

MemoryPB

ComputePB

ComputePB

MemoryPB

ComputePB

DRAMPB

DRAMPB

DRAMPB

DRAMPB

DRAMPB

DRAMPB

Page 59: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Statically Routed Dynamic Network

• Streaming protocol requires in-order transmission• Can’t use adaptive or oblivious routing• Can’t drop packets

• Routes are looked up in a table at runtime• Route to multiple outputs for efficient broadcast links

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 7/9

Page 60: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Performance Scaling

0

20

40

Pipe

lined

Norm

Perfo

rman

ce

BlackScholes TPCHQ6 GEMM SGDD-x0S-x3S-x2S-x1

H-x3H-x2H-x1

32 64 128# PBs

0

5

10

15

Sche

duled

Norm

Perfo

rman

ce

32 64 128# PBs

32 64 128# PBs

32 64 128# PBs

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 8/9

Page 61: Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val tileA,tileB = SRAM[T](tilesize) 5 tileA load vecA(i::i+tilesize) 6 tileB load vecB(i::i+tilesize)

Key Design Challenges

Off-chipMemoryBandwidth

ComputeThroughput

On-chipMemoryBandwidth

On-chipNetworkBandwidth

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 9/9


Recommended