Why do we need DNN accelerators? - IBM...GoogleNet (2015) 6.77M Resnet-20 (2016) 0.27M Resnet-110...

Why do we need DNN accelerators?

uMillions of Parameters (i.e., weights)uBillions of computations

uHeavy data movement

March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology

DNN Topology Number of Weights

AlexNet (2012) 3.98M

VGGnet-16 (2014) 28.25M

GoogleNet (2015) 6.77M

Resnet-20 (2016) 0.27M

Resnet-110 (2016) 1.7M

5

Need high throughput

Need to reduce energy

This makes CPUs inefficient

This makes GPUs inefficient

Spatial (or Dataflow) Accelerators

uMillions of Parameters (i.e., weights)uBillions of computations

uHeavy data movement


Spread computations across hundreds of ALUs

Reuse data within the array via local storage and direct communication

Examples: MIT Eyeriss, Google TPU, Xilinx xDNN

Memory Hierarchy

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU Mem

ory Hierarchy

Control

Register/FIFO/SRAM

6

Two Design Questions

uHow do we map billions of computations over limited compute and memory resources

uHow do we design an accelerator to efficiently map arbitrary layer types and dataflows?

March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 7

Outline of Talk




Motivation: Data Movement


VGG16 conv 3_2Multiply Add Ops 1.85 BillionWeights 590 KInputs 803 KOutputs 803 K

Re-use

“Dataflow”

Energy costs8-bit Integer Multiply 1x

Fetch two 8-bit operands from DRAM ~100x

Fetch two 8-bit operands from large SRAM ~10x

Fortunately …

Slide Acknowledgment: Joel Emer, Angshuman Parashar, Michael Pellauer (NVIDIA)

How to exploit reuse?

What is “Dataflow”

u How to schedule DNN computation (i.e., loop transformations (ordering, tiling, unrolling))

u How to map computations across PEs (i.e., data staging within accelerators)

u Goal of a good dataflow:u Algorithmic Data Reuse à Hardware Reuse


PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

7-dimensional network layer

.

.

.N

.

.

.N

C

C

K

KC

X’

Weights InputsOutputs

R

S

Y

X

Y’ map

Types of Algorithmic Data Reuse in DNNsFilter ReuseConvolutional Reuse Fmap Reuse

CONV layers only(sliding window)

CONV and FC layers CONV and FC layers(batch size > 1)

Filter Input FmapFilters

2

1

Input FmapFilter

2

1

Input Fmaps

ActivationsFilter weights

Reuse: ActivationsReuse: Filter weightsReuse:

FastPath Workshop Tushar Krishna | Georgia Institute of Technology 11March 24, 2019Slide Acknowledgment: Yu-Hsin Chen, Vivenne Sze, Joel Emer (MIT)

Hardware structures to exploit reuse

Temporal Reuse Spatial Reuse Spatio-Temporal Reuse

DRAM Buf RF ALU

Memory Hierarchy / Staging Buffers

Buf

Multicasting-support NoCs

E.g., Custom memory hierarchies in accelerators.

E.g., Hierarchical Bus in Eyeriss (ISCA 2016), Tree in MAERI (ASPLOS 2018)

E.g., TPU (ISCA 2017), local network in Eyeriss (ISCA 2016)

PEs

Tim

e

PE0 PE1 PE2 PE3 …

1 ……

PE0 PE1 PE2 PE3 …

1 ……

PE0 PE1 PE2 PE3 …

1…

PE0 PE1 PE2 PE3 …

2 ……

1 …

2

PEs

Tim

e

PE0 PE1 PE2 PE3 …

0

PE0 PE1 PE2 PE3 …

0 ……

1 …

1Tim

e

PEs


Neighbor-to-Neighbor Connections

Buf

2

1

0

0 1 2 3 4 5 6 7 8

S

Dataflow 101 – 1D convolution

each point is a partial sum


PE0t0

for(int x = 0; x < X’; x++)for(int s = 0; s < S; s++)Output[x] += Weight[s] * Input[x+s]

S

Weights

X

Inputs

X’ = X-S

Outputs*

* =

W[0], I[0] à O[0]0

W[1], I[1] à O[0]0+1W[2], I[2] à O[0]

t1

t2

t3 W[0], I[1] à O[1]0W[1], I[2] à O[1]0+1W[2], I[3] à O[1]

t4

t5

How often do we need to fetch a new weight?

How often do we start contributing to a new output

Every cycle

Every S cycles

Iteration SpaceSpatial Dimension (#PEs)

Tem

pora

l Dim

ensi

on

Map Space“Output stationary” dataflow

How often do we need to fetch a new input?

Every cycle

MultAdd

L1 Buffers

Note: “Stationary” => intuition rather than precise specification

Dataflow 101 – 1D convolution


for(int s = 0; s < S; s++)for(int x = 0; x < X’; x++)Output[x] += Weight[s] * Input[x+s]

S

Weights

X

Inputs

X’ = X-S

Output*

* =

2

1

0

0 1 2 3 4 5 6 7 8

S

each point is a partial sum

PE0W[0], I[0] à O[0]0

W[0], I[1] à O[1]0W[0], I[2] à O[2]0

W[0], I[1] à O[X’-1]0+1

W[1], I[0] à O[0]0+1W[1], I[1] à O[1]0+1

How often do we need to fetch a new weight?

How often do we start contributing to a new output

Every S cycles

Every cycle

Iteration SpaceSpatial Dimension (#PEs)

Tem

pora

l Dim

ensi

on

Map Space

“Weight stationary” dataflow

How often do we need to fetch a new input?

Every cycle

…

W[1], I[2] à O[2]0+1…

t0t1t2

MultAdd

L1 Buffers

The “best” dataflow


S

Weights

X

Inputs

X’ = X-S

Output*

* =

Common metric Weights Inputs Outputs / Partial Sums

Minimum accesses to backing store

S X X’

Max Operand Reuse within PE

SX’ SX’ SX’

How to achieve this with a one PE design?

Dataflow Weights Inputs Outputs

Weight-stationary 1 X’ X’

Input-stationary S 1 S

Output-stationary S S 1

L1 buffer size for zero re-fetchDataflow Weights Inputs Outputs

Weight-stationary SX’ S S

Input-stationary X’ SX’ X’

Output-stationary X’ X’ SX’

Buffer accesses

Note: product always equals SX’

Backing Store(e.g., DRAM)

Local (L1) Buffers

PE

Weights

Inputs

Partial SumsSuppose “best” dataflow => max reuse

Choose one of these three based on area/power budget of buffers, and latency/energy cost of each access

Getting More Realistic


C

C

K

K

Weights

InputsPartial Sums

R

SY

XX’ = X – S

C

Y’ = Y –R

N KN

L2Weights

L2Inputs

L2Outputs

PE

L1Weight

s

L1Inputs

L1Outputs

PE

L1 Weight

s

L1Inputs

L1 Outputs

…DRAM

AcceleratorDNN

7D Computation Space: R * S * X * Y * C * K * N

• Number of PEs• Memory Hierarchy• Interconnect Bandwidth

Transform +

Map

Millions of dataflows

Why does Dataflow matter?u Loop Transformations (Loop Order and Tile Size)

u Determines interconnect bandwidth requirementu Determines buffer size within PEs

u Mapping (over space and time)u Opportunities for Spatial, Temporal and Spatio-temporal

reuseu Energy of reads/writes/interconnect


How do we explore all possible dataflows?

{Performance, Energy} = f(Dimension Sizes, Hardware Resources, Dataflow)

DNN Layer Sizes

…

C

X

YS

RC

K

HW Resources

Mapping (Dataflow)

• Size Requirement• Access Count

(Energy)

Buffer Analysis

• BW Requirement• NoC Activity Count

NoC Analysis

• Roofline Throughput• Expected Runtime

Runtime Analysis

Data Reuse Analysis

Abstract HW Model

Communication Analysis

Computation Analysis

MAESTRO: Analytical Cost/Benefit Model

March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 18*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566

Example DataflowSpatial Dimension (PEs)

Tem

pora

l Dim

ensi

on

PE0 PE1

I[3..5] I[4..6]

W[0] W[1]

O[3..5] O[3..5]

I[0..2] I[1..3]

W[0] W[1]

O[0..2] O[0..2]

PE0PE1PE2

6

5

4

3

2

1

0

t = 1

t = 0

0 1 2 3 4 5 6 7 8

PE2

I[5..7]

W[2]

O[3..5]

I[2..4]

W[2]

O[0..2]


Iteration Space

Map Space

“Weight stationary” dataflow

S

Spatial Dimension (PEs)

Tem

pora

l Dim

ensi

on

PE0 PE1

I[3..5] I[4..6]

W[0] W[1]

O[3..5] O[3..5]

I[0..2] I[1..3]

W[0] W[1]

O[0..2] O[0..2]

t = 0 PE2

I[5..7]

W[2]

O[3..5]

I[2..4]

W[2]

O[0..2]

t = 1

Data Mapping over Space and Time

Spatial_Map(1,1) S

Temporal_Map(3,3) X’

+1

Weight is spatially mapped across PEs (i.e., parallelization).

Output is temporally mapped at each PE.

*Map(Mapping size, Offset) Dim

+1


Map Space

Spatial Dimension (PEs)

Tem

pora

l Dim

ensi

on

PE0 PE1

I[3..5] I[4..6]

W[0] W[1]

O[3..5] O[3..5]

I[0..2] I[1..3]

W[0] W[1]

O[0..2] O[0..2]

t = 0 PE2

I[5..7]

W[2]

O[3..5]

I[2..4]

W[2]

O[0..2]

t = 1

Data Reuse over Space and Time

Spatial_Map(1,1) S

Weight Reuse Opportunity:across Time (i.e., stationary)

Output Reuse Opportunity: across Space (i.e., multicast)



Map Space

Input Reuse Opportunity:across Space (i.e., multicast)


Data Movement Order

Spatial_Map(1,1) S


*Map(Mapping size, Offset) DimSpatial Dimension (PEs)

Tem

pora

l Dim

ensi

on

PE0 PE1

I[3..5] I[4..6]

W[0] W[1]

O[3..5] O[3..5]

I[0..2] I[1..3]

W[0] W[1]

O[0..2] O[0..2]

t = 0 PE2

I[5..7]

W[2]

O[3..5]

I[2..4]

W[2]

O[0..2]

t = 1


Describing Dataflows in MAESTRO

• Temporal_Map• Spatial_Map• Cluster (PE grouping for hierarchies)

• Data Movement


• Data Mapping

Example: Dataflows from recent accelerators in MAESTRO representation

The Dataflow Playground

6

5

4

3

2

1

0

Spatial_Map(1,1) STemporal_Map(3,3) X’

6

5

4

3

2

1

0

6

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8

Temporal_Map(3,3) SSpatial_Map(1,1) X’

6

5

4

3

2

1

0

Temporal_Map(3,3) X’Spatial_Map(1,1) S

Spatial_Map(1,1) X’Temporal_Map(3,3) S

Weight stationary Output stationary Output stationary Weight stationary

0 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 8



PE0PE1PE2

Dataflow à Hardware Implications


PE0 PE1 PE2 PE3

L2 Buffer

PE0 PE1 PE2 PE3

L2 Buffer

PE0 PE1 PE2 PE3

L2 Buffer

PE0 PE1 PE2 PE3

L2 Buffer

PE0 PE1 PE2 PE3

L2 Buffer

PE0 PE1 PE2 PE3

L2 Buffer

Reduction

DistributionL2 Read Bandwidth

L2 Write Bandwidth

DNN Layer Sizes

…

C

X

YS

RC

K

HW Resources

Mapping (Dataflow)


(Energy)

Buffer Analysis


NoC Analysis


Runtime Analysis

Data Reuse Analysis

Abstract HW Model





Abstract Accelerator Model


Shared Buffer (L2 Buffer)

PE 0

Private L1

ALU

PE N-1

Private L1

ALU

PE 1

Private L1

ALU

...

Network-on-Chip (NoC)1) Bandwidth

2) Average latency

3) Multicast capability

4) Forwarding capability

1) Size of L2 buffer

2) Rd/Wr Bandwidth

1) Number of PEs

2) Size of L1 buffer

3) Vector width

To/From DRAM

L2/L1 Buffer: Scratch padL0 Buffer (in ALUs): Register File

DNN Layer Sizes

…

C

X

YS

RC

K

HW Resources

Mapping (Dataflow)


(Energy)

Buffer Analysis


NoC Analysis


Runtime Analysis

Data Reuse Analysis

Abstract HW Model





MAESTRO Analysis Engine


//Buffer RequirementsL1BufferRequirement = 2 x (MV[Weights] + MV[Inputs] + MV[Outputs])L2BufferRequirement = 2 x { (M[Weights] + (NumSpIter-1) x MSUV[Weights]) + (M[Inputs] + (NumSpIter-1) x MSUV[Inputs])

//Data volumes required for each spatial iteration in four cases (spatial volume)// (first/steady) temporal iteration x (steady/last) spatial iteration // -> {first/stedy, first/last, steady/steady, steady/last}SV_FTP_SSP[in_data_cls]= tilesz_lst[sp_var] * {MV[in_data_cls] + (num_sp_tiles-1) * MSUV[in_data_cls]}SV_FTP_LSP[in_data_cls]= tilesz_lst[sp_var] * {MV[in_data_cls] + (num_sp_edge_tiles-1) * MSUV[in_data_cls] }SV_STP_SSP[in_data_cls]= tilesz_lst[sp_var] * {MTUV[in_data_cls] + (num_sp_tiles-1) * MTSUV[in_data_cls] }SV_STP_LSP[in_data_cls]= tilesz_lst[sp_var] * {MTUV[in_data_cls] + (num_sp_edge_tiles-1) * MTSUV[in_data_cls] }//Multicasting factor Multcast_factor[in_data_cls] = MV[in_data_cls] / MSUV[in_data_cls]

//Buffer access countsL2Wr[in_data_cls] = // Product of Loop sizes of each corresponding variable to in_data_clsL2Rd[in_data_cls] = (NoC.multicast_support)? (sp_iter-1) * SV_FTP_SSP[in_data_cls] + SV_FTP_LSP[in_data_cls] + (tp_iter-1) * {(sp_iter-1) * SV_STP_SSP[in_data_cls] + SV_STP_LSP[in_data_cls]} / tp_freq[in_data_cls] : (sp_iter-1) * num_sp_tiles * MV[in_data_cls] + num_sp_edge_tiles * MV[in_data_cls] + (tp_iter-1) * { (sp_iter-1) * num_sp_tiles * MTUV[in_data_cls] + num_sp_edge_tiles * MTUV[in_data_cls] }/tp_freq[in_data_cls]

L1Wr[in_data_cls] = (NoC.multicast_support)? L2Rd[in_data_cls] * multcast_factor : L2Rd[in_data_cls]; L1Rd[in_data_cls] = tp_iter * { (sp_iter-1) * num_sp_tiles * MV[in_data_cls] + num_sp_edge_tiles * MV[in_data_cls] }

Input: The number of ALUs in each PE (num_alus), temporal update frequency (tp_freq), number of spatial iterations (sp_iter), number of temporal iterations (tp_iter)Output: Total runtime for a given input layer (runtime) Procedure ComputeRuntime runtime = 0; //First temporal iteration if(sp_iter > 1) then init_noc_delay = NoCDelay(SV_FTP_SSP[input]) + NoCDelay(SV_FTP_SSP[weight]) else then init_noc_delay = NoCDelay(SV_FTP_LSP[input]) + NoCDelay(SV_FTP_LSP[weight]) end runtime += init_noc_delay;

if(sp_iter > 2) then //already loaded the first data sets L2ToL1_noc_delay = NoCDelay(SV_FTP_SSP[weight] + SV_FTP_SSP[input]) L1ToL2_noc_delay = NoCDelay(SV_FTP_SSP[output]) runtime += (sp_iter-2) *max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) else then L2ToL1_noc_delay = NoCDelay(SV_FTP_LSP[weight] + SV_FTP_LSP[input]) L1ToL2_noc_delay = NoCDelay(SV_FTP_LSP[output]) runtime += (sp_iter-1) * max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) end

//Rest of temporal iterations if(sp_iter > 1) then L2ToL1_noc_delay = NoCDelay(SV_STP_SSP[weight]/tp_freq[weight] + SV_STP_SSP[input]/tp_freq[input]) L1ToL2_noc_delay = NoCDelay(SV_STP_SSP[output]//tp_freq[output]) runtime += (tp_iter-1) * (sp_iter-1) *max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) else then L2ToL1_noc_delay = NoCDelay(SV_STP_LSP[weight]/tp_freq[weight] + SV_STP_LSP[input]/tp_freq[input]) L1ToL2_noc_delay = NoCDelay(SV_STP_LSP[output]/tp_freq[output]) end runtime += (tp_iter-1) * max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) return runtime;endprocedure

Volume Analysis Reuse Analysis Runtime Analysis Buffer Analysis

//MV: Mapped volumeMV[Weights] = M(K) x M(C) x M(R) x M(S)MV[Inputs] = M(C) x M(Y) x M(X) MV[Outputs] = M(K) x M(Y’) x M(X’)

//MSUV: Mapped spatially unique volumeMSUV[Weights] = GetSpUSz(K) x GetSpUSz(C) x GetSpUSz(R) x GetSpUSz(S)MSUV[Inputs] = GetSpUSz(C) x GetSpUSz(Y) x GetSpUSz(X)MSUV[Outputs] = GetSpUSz(K) x GetSpUSz(C) x GetSpUSz(Y’) x GetSpUSz(X’)

//MTUV: Mapped temporally unique volumeMTUV[Weights] = TU(K) x TU(C) x TU(R) x TU(S)MTUV[Inputs] = TU(C) x TU(Y) x TU(X)MTUV[Outputs] = TU(K) x TU(C) x TU(Y’) x TU(X’)

//MSTUV: Mapped spatially and temporally unique volumeMSTUV[Weights] = GetSTpUSz(K) x GetSTpUSz(C) x GetSTpUSz(R) x GetSTpUSz(S) MSTUV[Inputs] = GetSTpUSz(C) x GetSTpUSz(Y) x GetSTpUSz(X) MSTUV[Outputs] = GetSTpUSz(K) x GetSTpUSz(C) x GetSTpUSz(Y’) x GetSTpUSz(X’) * GetSpUSz(V) = (V.pragma.class == TemporalMap)? M(V) : SU(V);* GetSTpUSz(V) = (V.pragma.class == SpatialMap)? SU(V) : TU(V);

Input: dataflow description in MAESTRO pragmas (df_desc)Output: The total or uniquely mapped size of a data class on a PE (mp_sz) Procedure AnalyzeVariableMapping: for each pragma in df_desc switch(pragma.class) case TemporalMap: M[pragma.var] = pragma.map_sz; SU[pragma.var] = 0; TU[pragma.var] = (pragma.map_sz > pragma.ofs)? pragma.ofs : pragma.map_sz; case SpatialMap: M[pragma.var] = pragma.map_sz; SU[pragma.var] = (pragma.map_sz > pragma.ofs)? pragma.ofs : pragma.map_sz; TU[pragma.var] = pragma.map_sz; case Unroll: M[pragma.var] = pragma.loop.sz; SU[pragma.var] = 0; TU[pragma.var] = pragma.loop.sz; end endendprocedure

Input: Dataflow description (df_desc), loop list (lp_lst), pragma id of spatial map (sp_prag_id) a tile size list processed by AnalyzeTiles (tilesz_lst)Output: Number of spatial iterations (sp_iter), number of temporal iterations (tp_iter)Procedure AnalyzeNumIterations sp_iter = 1; tp_iter = 1; for each pragma in df_desc if(pragma.id > sp_prag_id) then if(pragma.class == TemporalMap) then tp_iter *= lp_list[pragma.var].size / pragma.ofs; end end else if(pragma.id == sp_prag_id) then sp_iter *= lp_list[pragma.var].size / pragma.ofs / tilesz_lst[pragma.var]; end end return {sp_iter, tp_iter};endProcedureInput: dataflow description (df_desc), target data class (data_cls), temporal loop list (tp_lp_lst)Output: Number of temporal iterations to have a change in mapped data points of data_cls (tp_freq)Procedure AnalyzeTemporalUpdateFrequency tp_freq=1; upper_most_sz = 1; saw_cor_var = false; if(df_desc.has_spatial_map(data_cls)) then return tp_freq; end for each loop in tp_lp_lst if(data_cls.has(loop.var)) then if(!saw_cor_var) then saw_cor_var = true; end tp_freq=1; else then if(saw_cor_var) then pragma = df_desc.search(loop.var); tp_freq *= loop.size/pragma.ofs; end end end return tp_freq; endProcedure

Ignore this “eyechart” slide – Just an intuition on how it works

• Analytical Model – not cycle accurate sims, 1000-4000x faster.• Error within 5% of cycle-accurate RTL sims of Eyeriss and NVDLA

DNN Layer Sizes

…

C

X

YS

RC

K

HW Resources

Mapping (Dataflow)


(Energy)

Buffer Analysis


NoC Analysis


Runtime Analysis

Data Reuse Analysis

Abstract HW Model





Use Cases: (i) HW Design

DNN Layer Sizes…C

X

YS

RC

K Fixed

Fixed

HW Resources HW DSESearch HW Configs

FastPath Workshop Tushar Krishna | Georgia Institute of Technology 32

Mapping (Dataflow)

March 24, 2019

HW DSE using MAESTRO


140120100

80604020

0 0 2 4 6 8 10 12 14 16 0 20 40 60 80 100 120 140

Area Constraint

0 1 2 3 4 5 6 7 8 9

140120100

80604020

0

140120100

80604020

0

140120100

80604020

0 0

Power Constraint

10050 150 200 250 300 350 400 450

Thro

ughp

ut

(MAC

/Cyc

le)

VGG

16-C

ON

V2

Area (mm )2 Buffer (KB) Power (mW)150

100

50

0 0 2 4 6 8 10 12 14 16Area (mm )2 Buffer (KB)

0 20 40 60 80 100 120 140

Area Constraint

0 5Power (mW)

10 0

Power Constraint

10050 150 200 250 300 350 400 450

Thro

ughp

ut

(MAC

/Cyc

le)

Normalized Energy (10 X Single MAC Energy)9

150

100

50

0

150

100

50

0

150

100

50

015 20 25VG

G16

-CO

NV1

1

Normalized Energy (10 X Single MAC Energy)9

Energy-Optimized DesignThroughput-Optimized Design # of PEs 15264 12832 96

8

NVDLA Dataflow

Best HW-config for Throughput different from Best design for Energy

DSE engine searched 480M designs and identified 2.5M valid designs at an avg rate of 0.17M designs per second.

Use Cases: (ii) Compiler Design

DNN Layer Sizes…C

X

YS

RC

K

Mapping (Dataflow)

Fixed

FixedHW Resources

Dataflow DSEGenerate Opt Map

FastPath Workshop Tushar Krishna | Georgia Institute of Technology 34March 24, 2019

Dataflow Comparison using MAESTRO


NOTE: this represents the performance and energy of the dataflow on a normalized PE substrate

• Not representative of the performance of the original architecture

NLR WS Shi DLA RS0

0.5

1

NLR WS Shi DLA RS0

2

4

6

8

10

NLR WS Shi DLA RS0

2

4

6

8

B

C

D

Ba

nd

wid

th R

eq

uir

em

en

t (G

bp

s)

L1

Me

mo

ry R

eq

uir

em

en

t (K

B)

Th

ro

ug

hp

ut

(GFLO

PS

)

NLR WS DLAShi RS0

0.5

1

L1 M

emor

y Re

quire

men

t (KB

)

Band

wid

th

Requ

irem

ent (

Gbp

s)

0

468

10

0

Thro

ughp

ut(G

FLO

PS)

Dataflow Style Dataflow Style Dataflow Style

(a) Bandwidth (b) L1 Memory (c) Throughput

2 2468

NLR WS DLAShi RS NLR WS DLAShi RS

NLR WS Shi DLA RS0

0.5

1

1.5

2

NLR WS Shi DLA RS0

2

4

6

8

10

NLR WS Shi DLA RS0

1

2

3

4

B

C

D

Ba

nd

wid

th R

eq

uir

em

en

t (G

bp

s)

L1

Me

mo

ry R

eq

uir

em

en

t (K

B)

Th

ro

ug

hp

ut

(GFLO

PS

)

0

0.5

1.0

L1 M

emor

y Re

quire

men

t (KB

)

Band

wid

th

Requ

irem

ent (

Gbp

s)

0

468

10

0

Thro

ughp

ut(G

FLO

PS)

Dataflow Style Dataflow Style Dataflow Style

(a) Bandwidth (b) L1 Memory (c) Throughput

2 1

2

34

NLR WS DLAShi RS NLR WS DLAShi RS NLR WS DLAShi RS

1.5

2.0

LateLayer(C11)

EarlyLayer(C1)

0

20

40

60

80

100

120

140

160

0 5 10 15 200

100

200

300

400

NLR WS SD DLA RS NLR WS SD DLA RS

MACL1RdL1WrL2RdL2Wr160

140120100

80604020

0

400

300

200

100

0VGG16-CONV1 VGG16-CONV11

Nor

mal

ized

Ene

rgy

DataflowLayer

# of PEs(16,32,…,256)

Energy

Scalability

Performance and Hardware Requirement

Takeaway - No one dataflow is best for all layers

Use Cases: (iii) HW-SW Co-Design

Fixed

HW/SW Co-designSearch HW Config + Mappings

DNN Layer Sizes…C

X

YS

RC

K

HW Resources


Mapping (Dataflow)

March 24, 2019

Summary of MAESTRO

uPrecise Specification of Dataflows using a Data-centric Approach

uAnalytical Model for Analyzing Reuse => Performance, Memory, Interconnect, Energy

uUse for HW Design-space or Mapping Space exploration, or HW-SW Co-Design

uValidation on going…


Outline of Talk




Myriad Dataflows in DNN Acceleratorsu DNN Topologies

u Layer size / shapeu Layer types: Convolution / Pool / FC / LSTMu New sub-structure: e.g., Inception in Googlenet

u Compiler/Mapper (e.g., MAESTRO)u Loop Scheduling

u Reordering and Tiling

u Mappingu Output/Weight/Input/Row-stationary

u Algorithmic Optimizationu Weight pruning: Sparse workload


The current trend for supporting multiple dataflows

uNew Dataflow à New AcceleratoruData reuse: FlexFlow (2017), Eyeriss (2016), ...uCross-layer: Fused CNN (2016)

uSparse CNN: SCNN (2017), EIE(2016), ...uLSTM: ESE (2017), ...


Can we have one architectural solution that can handle arbitrary dataflows and provides ~100% utilization?

What is the computation in a DNN?


…Input Activation

Filter

Output Activation

0 0

CONV Layer

……

W0W1W2

SlideX0

X1

X2

Xk

Σ(WiXi)

W0

W1

W2

Wk

Inputs

Neuron

Out

OutputWeights

…

Compute weighted sum Independent multiplicationAccumulation of partial products

41

Our Key insight: Each dataflow translates into neurons of different sizes

Layer 1 Layer 2

Pruning

Removed Weight

Layer 1 Layer 2Layer 1 Layer 2

Irregular Dataflow: Pruning


X0

X1

X2

Σ(WiXi)W0

W1

W2

Neuron

Out

X0

X2

Σ(WiXi)W0

W2

Neuron

Out

Our Key insight: Each dataflow translates into neurons of different sizes

Example: Weight Pruning (Sparse Workload)

The MAERI Abstraction


PrefetchBufferX X X XX X X X

+++

++

+ +

Weight / Input

Output

Multiplier Pool

Adder Pool

VN0 VN1 VN2

“MultAlloc(3); AddAlloc(2)”

X0

X1

X2

Xk

Σ(WiXi)

W0

W1

W2

Wk

Inputs

Neuron

Out

OutputWeights

…

Virtual Neuron (VN): Temporary grouping of compute units for an output

How to enable flexible grouping?

“MultAlloc(2); AddAlloc(1)”

43

Need flexible connectivity!

Naïve Approach: Full Crossbar


PrefetchBufferX X X XX X X X

+++

++

+ +

Weight / Input

Output

Wire overhead = O(n2)

Need “specialization” in interconnection network

for traffic in DNN accelerators

March 24, 2019

Traffic Patterns in DNN Accelerators*

* H. Kwon et al., Rethinking NoCs for Spatial DNN Accelerators, NOCS 2017

GB NoC

PE

PE

PE

PE

One-to-Many(Distribution)

GB NoC

PE

PE

PE

PE

Many-to-One(Collection/Reduction)

GB NoC

PE

PE

PE

PE

* GB: Global buffer* NoC: Network-on-Chip (Interconnection network)* PE: Processing element (Compute units)

One-to-One(Forwarding)

e.g. input and weight distribution to PEs

e.g. partial sum and output reduction

e.g. input/weight/partial-sum forwarding


The MAERI Implementation


X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree

Augmented Reduction Tree

…

Activation Units

From/To DRAM

Simple Switch

Hyoukjun Kwon, Ananda Samajdar, and Tushar KrishnaMAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention



X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch

Distribution Network• Spatial Reuse via Multicasts• High Bandwidth via fat links



X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch

Local FIFOs for Temporal Reuse i.e. “stationary”



X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch

Linear Local Network• Forwarding of weights• Spatio-Temporal Reuse



X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…

Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch

Reduction Network• High Bandwidth via fat links• Provably Non-blocking

Reductions via forwarding links



X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…

Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch

Distribution Network• Spatial Reuse via Multicasts• High Bandwidth via fat links

Linear Local Network• Forwarding of weights• Spatio-Temporal Reuse

Reduction Network• High Bandwidth via fat links• Provably Non-blocking

Reductions via forwarding links

Local FIFOs for Temporal Reuse i.e. “stationary”

X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…

Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch


FastPath Workshop Tushar Krishna | Georgia Institute of Technology

Input _LInput _R

Fwd_In

Fwd_Out

Output_Up

+/>

Adder Switch(adder+ 3x2 switch)

Data_InLeft_Out

Right_Out

Inv

Inv

Distribute Switch (1x2 Switch)

52March 24, 2019

Micro-Switches

Multiplier Switch

Data_In

Fwd_In Fwd_Out

Data_OutX

(multiplier+ 2x2 switch)

Example: Computing a CONV layeru [Communication] Distribute weights and

inputs (image pixels) to multiplier switchesuAssume: weight stationary, conv reuse of inputs via

local links

u [Computation] Compute partial sums

u [Computation] Reduce partial sums

u [Communication] Collect outputs to bufferFastPath Workshop Tushar Krishna | Georgia Institute of Technology 53March 24, 2019

MAERI Operation Example


W00 W01W10 W11 =X

X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

Input ActivationFilter Output Activation

O00 O01 O02 O03O10 O11 O12 O13O20 O21 O22 O23O30 O31 O32 O33

Slides

O00 = W00 X X00 + W01 X X01

+ W10 X X10 + W11 X X11

W020

+ W02 X X01

Sparse Weight Filter

X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch



Virtual Neuron Construction

VN size = 5


Controller configures the switches

W00 W01W10 W11

W020

X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch



Weight

Distribution

Distribution bandwidth is tunable.

Suppose BW = 3


W00 W01W10 W11

W020

W00 W01W10 W11 X


Input ActivationFilter

W020

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch



Distribution bandwidth is tunable.

Suppose BW = 3


W00 W01W10 W11

W020

Weight

Distribution

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch



x00 x01 x02 x10 x11 X10 X11 X12 X20 X21 X20 X21 X22 X30 X31Sparse Weight Filter

W00 W01W10 W11

W020

Input

Distribution

Utilize multicast to reduce latency and energy

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch



Partial sum reduction

x00 x01 x02 x10 x11 X10 X11 X12 X20 X21 X20 X21 X22 X30 X31

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

Simultaneous Reduction

X X X XX X X X

+++

++

+ +

Acce

lera

tor C

ontro

ller

Dataflow(from CPU)

+

X Multiplier Switch

Adder Switch

Legend

LookupTable

X X X XX X X X

+++

++

+ +

+

…

…Weights

Inputs

Outputs

…

Distribution Tree


…

Activation Units

From/To DRAM

Simple Switch



Sliding Window

x01 x02 x11 X11 X12 X21 X21 X22 X31

x03

X13

x12

X22

X23

X32

Repeat Step 4 - 5W00 W01

W10 W11 X



W020

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

W00 W01W10 W11 X



W020

x00 x10 X10 X20 X20 X30

Weights: stationaryInputs: Partially reused via forwarding

Mapping optimal dataflows for MAERI


Deep Neural NetworkNe

uron

s

Dataflow Configs

MAERI Mapper: mRNA

Search for Optimal Dataflow

X X X XX X X X

+++

++

+ +

X X XX X X X

+++

++

+ +

+

X

…

… …

To/From DRAMWeight, Input, Output SRAM

X X X XX X X X

+++

++

+ +VN0

X X XX X X X

+++

++

+ +VN1

+

VN2

Weights/Inputs Weights/Inputs

Output Activation Output Activation

Output Activation

1Virtual Neurons

X

2

34567

Hyoukjun Kwon, Ananda Samajdar, and Tushar KrishnaMAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention

Z. Zhao, H. Kwon, S.Kuhar, W. Sheng, Z. Mao, T. KrishnaEfficient Mapping Space Exploration on a Reconfigurable Neural AcceleratorISPASS 2019

~100% Utilization

Example Mapping – Dense CNN


Our Key insight: Each dataflow translates into neurons of different sizes X0

X1

X2

Xk

Σ(WiXi)

W0

W1

W2

Wk

Inputs

Neuron

Out

OutputWeights

…

X X X XX X X X

+++

++

+ +VN0

X X X XX X X X

+++

++

+ +VN1

+

VN2


Partial Outputs Partial Outputs

Example Mapping – Sparse DNN



X1

X2

Xk

Σ(WiXi)

W0

W1

W2

Wk

Inputs

Neuron

Out

OutputWeights

…

X X X XX X X X

+++

++

+ +VN0

X X X XX X X X

+++

++

+ +VN1

+

VN2



Example Mapping – LSTM/FC



X1

X2

Xk

Σ(WiXi)

W0

W1

W2

Wk

Inputs

Neuron

Out

OutputWeights

…

X X X XX X X X

+++

++

+ +VN0

X X X XX X X X

+++

++

+ +

+



Example - Impact of MappingsMapping 1 Mapping 2 Mapping 3 Mapping 4 Mapping 5

DN BW requirement for input 16 / 8 8 / 4 16 / 8 8 / 4 9

DN BW requirement for weight 16 16 8 8 4

RN BW requirement 1 2 2 4 4

Number of DS access for weight 64 64 64 64 56

Number of DS access for input 64 / 32 64 / 32 64 / 32 64 / 32 42

Number of reduce 15 14 14 12 12

Number of RS access 30 28 28 24 24

Number of MS access 8 8 8 8 0

Number of iterations 36 41 41 45 73

Peak Utilization Rate 100% 100% 100% 100% 100%

Average Utilization Rate 100% 100% 100% 100% 56%

Best performance

Least DN bandwidth

Least RN Bandwidth

Low utilization rateZ. Zhao, H. Kwon, S.Kuhar, W. Sheng, Z. Mao, T. KrishnaEfficient Mapping Space Exploration on a Reconfigurable Neural AcceleratorISPASS 2019


End-to-End Performance


0

500

1000

1500

2000

2500

3000

3500

4000

4500

MAERI(Featuremapparallelism) MAERI(Channelparallelism) MAERI(AdaptiveDataflow) Eyeriss(witharraypartitioning)

Runtime(ms)

VGG16-Layer

VGG16End-to-endRuntime(MAERIvsEyeriss)

CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10 CONV11 CONV12 CONV13

Summary of MAERIu DNN models evolving rapidly

uMultiple layer typesu Sparsity OptimizationsuMyriad dataflows for scheduling and mapping

u MAERI enables dynamic grouping of arbitrary number of MACCs (“Virtual Neuron”) via reconfigurable, non-blocking interconnects, providinguFuture proof to DNN models and dataflowsuNear 100% compute unit utilization


Takeaways


AI will be pervasive

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

2D hardware array

.

.

.N

.

.

.N

C

C

K

KC X’

Weights Inputs OutputsR

SY

X

Y’ map

Convolutional Neural Network

Analytical Model for DNN Dataflow Analysis

X0

X1

X2

Xk

Σ(WiXi)

W0

W1

W2

Wk

Inputs

Neuron

Out

OutputWeights

… DNN Accelerator with Configurable Interconnects can map Irregular Dataflows

X X X XX X X X

+++

++

+ +

Thank you!

http://synergy.ece.gatech.edu/tools/maestro/

http://synergy.ece.gatech.edu/tools/maeri/

Date post:	26-Jan-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Why do we need DNN accelerators? - IBM...GoogleNet (2015) 6.77M Resnet-20 (2016) 0.27M Resnet-110...

Documents