Scalable Interconnects for Reconfigurable Spatial Architectures · 2021. 6. 1. · 4 val...

Post on 09-Jun-2021

1 views 0 download

transcript

Scalable Interconnects forReconfigurable Spatial Architectures

Yaqi Zhang, Alexander Rucker, Matthew Vilim, Raghu Prabhakar,William Hwang, Kunle Olukotun

Electrical EngineeringStanford University

ISCA ’19: The 46th International Symposium on Computer Architecture, Phoenix, AZ

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 1/27

®

Spatial Accelerators

• Energy efficient• High-throughput• Low-latency

Examples:• Plasticine (ISCA ‘17)• Compressed-sparse CNN accelerator (ISCA ‘17)• Stream-dataflow accelerator (ISCA ‘17)

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 2/27

Accelerator Characteristics• High compute density• High on-chip memory bandwidth

• Distributed compute andmemory resources• Streaming interface between compute andmemory• Statically mapped and scheduled compute graph

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/27

M C M

C

M

C

M

M

C

MC

MC

MC MC

MC

MC

MC

MC

MC C

C

C

C

C

C

C

C

C

MDD D D D

C Compute M On-chipScratchpad D Off-chipDRAM

Accelerator Characteristics• High compute density• High on-chip memory bandwidth• Distributed compute andmemory resources• Streaming interface between compute andmemory• Statically mapped and scheduled compute graph

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/27

M C M

C

M

C

M

M

C

MC

MC

MC MC

MC

MC

MC

MC

MC C

C

C

C

C

C

C

C

C

MDD D D D

C Compute M On-chipScratchpad D Off-chipDRAM

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)• Flexibility• Scalability• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)

• Flexibility• Scalability• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)• Flexibility

• Scalability• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)• Flexibility• Scalability

• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Key ChallengesOn-chip networks play a critical role in:

• Energy efficiency (↓ data movement)• Flexibility• Scalability• Compute utilization

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/27

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet Latency

Spatial Accelerator Frequent Fine-grained Throughput

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

M M M MC C C C

MemoryBus

Parallelism

Multi-Processor

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet Latency

Spatial Accelerator Frequent Fine-grained Throughput

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

M M M MC C C C

MemoryBus

Parallelism

Multi-Processor

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained

Throughput

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

C C C C

M M M M

C C C C

M M M M

Parallelism

Pipelining

Spatial Acceleratordummy

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained

Throughput

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

C C C C

M M M M

C C C C

M M M M

Parallelism

Pipelining

Spatial Acceleratordummy

Communication Patterns

Architecture Communication Limited byFrequency GranularityProcessor Infrequent Packet LatencySpatial Accelerator Frequent Fine-grained Throughput

Compute On-chipMemoryNetwork Network Compute

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/27

Outline

Motivation

Network Design Space

Compilation Flow

Evaluation

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 6/27

Static Network

PB

S S

S

PB

S

S

PB

S

S

PB

S

PB

S

PB

S

PB

S

PB

S

PB PB PB PB

R Router

S Switch

PB Physical Block

Pros Cons

Guaranteed bandwidth Low link utilizationP&R failures

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 7/27

Dynamic Network

PB

R R

R

PB

R

R

PB

R

R

PB

R

PB

R

PB

R

PB

R

PB

R

PB PB PB PB

R Router

S Switch

PB Physical Block

Pros Cons

Link sharing Limited bandwidthDeadlock

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 8/27

Hybrid Network: Static and Dynamic

PB

R R

R

S

PB

R

R

S

PB

R

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

S

PB

S

PB

S

PB

S R Router

S Switch

PB Physical Block

Pros Cons

Link sharing More areaMore bandwidth More static powerGuaranteed P&R

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 9/27

Outline

Motivation

Network Design Space

Compilation Flow

Evaluation

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 10/27

SpatialA DSL for Reconfigurable Accelerators

1 // Tiled inner product2 val vecA,vecB = DRAM[T](N)3 Reduce(Reg[T])( N ){ i =>4 val tileA,tileB = SRAM[T](tilesize)5 tileA load vecA(i::i+tilesize)6 tileB load vecB(i::i+tilesize)7 Reduce(Reg[T])(tilesize) { j =>8 tileA(j) * tileB(j)9 } { _ + _ }

10 }

• Annotate data size N

• Calculate loopiterations

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 11/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Accelerator Compiler

• Allocate compute andmemory Virtual Blocks (VBs)• Infer activation counts for logical links

1 // Tiled inner product2 val vecA,vecB = DRAM[T](N)3 Reduce(Reg[T])( N ){ i =>4 val tileA,tileB = SRAM[T](tilesize)5 tileA load vecA(i::i+tilesize)6 tileB load vecB(i::i+tilesize)7 Reduce(Reg[T])(tilesize) { j =>8 tileA(j) * tileB(j)9 } { _ + _ }

10 }

⇒A

B C

D

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 12/27

Virtual Block→Logical Link→

Spatial Accelerator Compiler Mapping Characterization Simulation

Mapping

• Partition VB graph to meet hardware constraints

• Place and route the VB graph onto the network• Allocate VCs for the dynamic network

A

B C

D

⇒A-1

A-2B

C

D

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 13/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Mapping

• Partition VB graph to meet hardware constraints• Place and route the VB graph onto the network• Allocate VCs for the dynamic network

A-1

A-2B

C

D

⇒PB

R R

R

S

PB

R

R

S

PB

R

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

S

PB

S

PB

S

PB

S R Router

S Switch

PB Physical Block

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 13/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Placement and Routing

• Start with random placement

• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Placement and Routing

• Start with random placement• Route all links, in order of activation count

• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Placement and Routing

• Start with random placement• Route all links, in order of activation count

• Build most efficient broadcast tree• Guarantee static network placement, if possible

• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Placement and Routing

• Start with random placement• Route all links, in order of activation count

• Build most efficient broadcast tree• Guarantee static network placement, if possible• Else, map the link to the dynamic network

• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost

• Dynamic network congestion• Average route length• Maximum route length

• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost

• Dynamic network congestion• Average route length• Maximum route length

• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

D

B A C

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost

• Dynamic network congestion• Average route length• Maximum route length

• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

A

B C

D

C D

B A

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

Placement and Routing

• Start with random placement• Route all links, in order of activation count• Re-place VBs with the highest routing cost• Repeat routing

Summary

Iteratively reduce routing costMap bandwidth-critical links onto the static network

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 14/27

Area and Energy Characterization• Synthesize switch and router RTL at 28 nm, 1GHz• Power simulation with Primetime

• Decompose power into:• Inactive (per-cycle)• Active (per-bit)

0 20 40 60 80 100

Activation (%)

0.00

0.05

0.10

0.15

Powe

r(W

)

Inactive

Active

Switchstaticstatic+dynamic

0 20 40 60 80 100

Activation (%)

0.00

0.01

0.02

0.03

0.04

Powe

r(W

)

Inactive

Active

Router

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 15/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Area and Energy Characterization• Synthesize switch and router RTL at 28 nm, 1GHz• Power simulation with Primetime• Decompose power into:

• Inactive (per-cycle)• Active (per-bit)

0 20 40 60 80 100

Activation (%)

0.00

0.05

0.10

0.15

Powe

r(W

)

Inactive

Active

Switchstaticstatic+dynamic

0 20 40 60 80 100

Activation (%)

0.00

0.01

0.02

0.03

0.04

Powe

r(W

)

Inactive

Active

Router

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 15/27

Simulation• Integrate simulator with DRAMSim and BookSim• Track transmitted data in switches and routers• Estimate per-app power with activity traces:

Enet =∑

allocated

PinactiveTsim + Eflit#flit

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 16/27

Spatial Accelerator Compiler Mapping Characterization Simulation

Outline

Motivation

Network Design Space

Compilation Flow

Evaluation

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 17/27

Area and Energy Characterization

0.0

0.1

0.2mm

2

1 2 3 4 50.00

0.02

0.04

0.06

2 4 8

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 18/27

Area and Energy Characterization

0.0

0.1

0.2mm

2

1 2 3 4 50.00

0.02

0.04

0.06

2 4 8

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 18/27

L

L

Takeaway

Switches take less energy to transmit data than routersBandwidth scales more efficiently on the static network

Benchmarks

Category Application

Linear Algebra

Dot ProductOuter ProductBlack ScholesGEMM

Database TPC-H Query 6Clustering k-Means Clustering

Inference

Lattice RegressionLSTM (RNN)GRU (RNN)LeNet (CNN)

TrainingGaussian Discriminant AnalysisLogistic RegressionStochastic Gradient Descent

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 19/27

Benchmark Resource Usage

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 20/27

Evaluated Design Space

• Different network configurations• Static: flow control, bandwidth• Dynamic: VC count, flit width• Hybrid

• Different applications• Different architectures

• Pipelined (high throughput)• Scheduled (low throughput)

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 21/27

Evaluated Metrics• Performance (Perf)• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27

Evaluated Metrics• Performance (Perf)• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)

Reported values are the geomean across all applications,normalized to the worst network configuration.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27

Evaluated Metrics• Performance (Perf)L• Area efficiency (1/Area)• Performance per area (Perf/Area)• Power efficiency (1/Power)• Energy efficiency (Perf/Watt)

Area

Compute 51.0%

On-ChipMemory

32.0%

Network

17.0%

Power

Compute

26.0%

On-ChipMemory

34.8% Network15.6%

DRAM

23.6%

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 22/27

Hybrid Network VCs and Flit Width

Perf

Perf / Area

Perf / Watt

1.1

vc4vc21.3

1.1

Perf

Perf / Area

Perf / Watt

1.0

flit128flit256flit512

1.7

1.0

Dynamic network flit width and VC count can be decreasedwith no performance loss.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 23/27

R R

Static vs. Dynamic vs. Hybrid

DotPro

duct

OuterP

roduct

BlackSch

oles

TPCHQ6

Lattice

GDA

GEMM

Kmea

ns

LogRegSGD

LSTMGRU

LeNet

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Per

form

ance

Dynamic Hybrid (2.25x) Static (3x)

The dynamic network performs poorly on compute-boundapplications due to insufficient bandwidth.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 24/27

Static vs. Dynamic vs. Hybrid

DotPro

duct

OuterP

roduct

BlackSch

oles

TPCHQ6

Lattice

GDA

GEMM

Kmea

ns

LogRegSGD

LSTMGRU

LeNet

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Per

form

ance

Dynamic Hybrid (2.25x) Static (3x)

The dynamic network performs poorly on compute-boundapplications due to insufficient bandwidth.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 24/27

Most Efficient Network Configurations

DotPro

duct

OuterP

roduct

BlackSch

oles

TPCHQ6

Lattice

GDA

GEMM

Kmea

ns

LogRegSGD

LSTMGRU

LeNet

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Dat

aM

ovem

ent

Hybrid (2.25x) Static (3x)

The hybrid network reduces data movement by using adynamic network as an escape path.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 25/27

Most Efficient Network ConfigurationsPipelined Architecture

Perf

Perf / Area

Perf / Watt

7.0

HybridStatic6.9

2.3

A hybrid network improves energy efficiency by 1.8x withperformance similar to a static network.

Performance varies up to 7x between the best and worstnetwork configurations.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 26/27

L

Most Efficient Network ConfigurationsPipelined Architecture

Perf

Perf / Area

Perf / Watt

7.0

HybridStatic6.9

2.3

A hybrid network improves energy efficiency by 1.8x withperformance similar to a static network.Performance varies up to 7x between the best and worstnetwork configurations.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 26/27

L

Conclusion• Network performance correlates strongly with

bandwidth for spatial accelerators• Bandwidth scales more efficiently on a static network• A hybrid (large static, small dynamic) network:

• Eliminates place and route failure• Improves perf/watt

Thank You!

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 27/27

Conclusion• Network performance correlates strongly with

bandwidth for spatial accelerators• Bandwidth scales more efficiently on a static network• A hybrid (large static, small dynamic) network:

• Eliminates place and route failure• Improves perf/watt

Thank You!

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 27/27

Static Network: Flow Control

Src Dst

End-to-end Flow Control Per-hop Flow Control

Back PressureAck

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 1/9

Static Network: Bandwidth

PB

S S

S

PB

S

S

PB

S

S

PB

S

PB PB PB PB R Router

S Switch

PB Physical Block

We vary the number of links between switches.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 2/9

Dynamic Network

RouterFlit-width/

We vary the number of Virtual Channels (VCs) and flit width.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 3/9

Static Network Bandwidth

Perf

1 / Area

1 / Power

2.0

x1x2x3

3.4

4.3

PB

R R

R

S

PB

R

R

S

PB

R

S

PB

R

S

PB

R

S

PB

R

S

PB

S

PB

S

PB

S

3x static network bandwidth

Bandwidth strongly impacts accelerator performance.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 4/9

R

Static Network Flow ControlCredit-Based vs. Per-Hop

Perf

1 / Area

1 / Power

3.3

creditper-hop

1.2

2.1

Src Dst

End-to-end Flow Control

Per-hop Flow Control

Back Pressure

Ack

Credit-based flow control has 3x lower performance.

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 5/9

R

Accelerator Model• Pool of compute andmemory resource• Compute:

• SIMD pipeline, or• Vector processor with a small instruction window

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 6/9

SIMD Lanes

Input Buffers

Pipelined Scheduled

Stages

SIMD Lanes

Function Unit

Compute Physical Block

Memory Physical BlockScratchpad Bank

ComputePB

MemoryPB

ComputePB

ComputePB

MemoryPB

ComputePB

ComputePB

MemoryPB

ComputePB

DRAMPB

DRAMPB

DRAMPB

DRAMPB

DRAMPB

DRAMPB

Statically Routed Dynamic Network

• Streaming protocol requires in-order transmission• Can’t use adaptive or oblivious routing• Can’t drop packets

• Routes are looked up in a table at runtime• Route to multiple outputs for efficient broadcast links

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 7/9

Performance Scaling

0

20

40

Pipe

lined

Norm

Perfo

rman

ce

BlackScholes TPCHQ6 GEMM SGDD-x0S-x3S-x2S-x1

H-x3H-x2H-x1

32 64 128# PBs

0

5

10

15

Sche

duled

Norm

Perfo

rman

ce

32 64 128# PBs

32 64 128# PBs

32 64 128# PBs

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 8/9

Key Design Challenges

Off-chipMemoryBandwidth

ComputeThroughput

On-chipMemoryBandwidth

On-chipNetworkBandwidth

ISCA 2019 Scalable Interconnects for Reconfigurable Spatial Architectures 9/9