Why do we need DNN accelerators?
uMillions of Parameters (i.e., weights)uBillions of computations
uHeavy data movement
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology
DNN Topology Number of Weights
AlexNet (2012) 3.98M
VGGnet-16 (2014) 28.25M
GoogleNet (2015) 6.77M
Resnet-20 (2016) 0.27M
Resnet-110 (2016) 1.7M
5
Need high throughput
Need to reduce energy
This makes CPUs inefficient
This makes GPUs inefficient
Spatial (or Dataflow) Accelerators
uMillions of Parameters (i.e., weights)uBillions of computations
uHeavy data movement
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology
Spread computations across hundreds of ALUs
Reuse data within the array via local storage and direct communication
Examples: MIT Eyeriss, Google TPU, Xilinx xDNN
Memory Hierarchy
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU Mem
ory Hierarchy
Control
Register/FIFO/SRAM
6
Two Design Questions
uHow do we map billions of computations over limited compute and memory resources
uHow do we design an accelerator to efficiently map arbitrary layer types and dataflows?
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 7
Outline of Talk
uHow do we map billions of computations over limited compute and memory resources
uHow do we design an accelerator to efficiently map arbitrary layer types and dataflows?
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 8
Motivation: Data Movement
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 9
VGG16 conv 3_2Multiply Add Ops 1.85 BillionWeights 590 KInputs 803 KOutputs 803 K
Re-use
“Dataflow”
Energy costs8-bit Integer Multiply 1x
Fetch two 8-bit operands from DRAM ~100x
Fetch two 8-bit operands from large SRAM ~10x
Fortunately …
Slide Acknowledgment: Joel Emer, Angshuman Parashar, Michael Pellauer (NVIDIA)
How to exploit reuse?
What is “Dataflow”
u How to schedule DNN computation (i.e., loop transformations (ordering, tiling, unrolling))
u How to map computations across PEs (i.e., data staging within accelerators)
u Goal of a good dataflow:u Algorithmic Data Reuse à Hardware Reuse
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 10
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
7-dimensional network layer
.
.
.N
.
.
.N
C
C
K
KC
X’
Weights InputsOutputs
R
S
Y
X
Y’ map
Types of Algorithmic Data Reuse in DNNsFilter ReuseConvolutional Reuse Fmap Reuse
CONV layers only(sliding window)
CONV and FC layers CONV and FC layers(batch size > 1)
Filter Input FmapFilters
2
1
Input FmapFilter
2
1
Input Fmaps
ActivationsFilter weights
Reuse: ActivationsReuse: Filter weightsReuse:
FastPath Workshop Tushar Krishna | Georgia Institute of Technology 11March 24, 2019Slide Acknowledgment: Yu-Hsin Chen, Vivenne Sze, Joel Emer (MIT)
Hardware structures to exploit reuse
Temporal Reuse Spatial Reuse Spatio-Temporal Reuse
DRAM Buf RF ALU
Memory Hierarchy / Staging Buffers
Buf
Multicasting-support NoCs
E.g., Custom memory hierarchies in accelerators.
E.g., Hierarchical Bus in Eyeriss (ISCA 2016), Tree in MAERI (ASPLOS 2018)
E.g., TPU (ISCA 2017), local network in Eyeriss (ISCA 2016)
PEs
Tim
e
PE0 PE1 PE2 PE3 …
1 ……
PE0 PE1 PE2 PE3 …
1 ……
PE0 PE1 PE2 PE3 …
1…
PE0 PE1 PE2 PE3 …
2 ……
1 …
2
PEs
Tim
e
PE0 PE1 PE2 PE3 …
0
PE0 PE1 PE2 PE3 …
0 ……
1 …
1Tim
e
PEs
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 12
Neighbor-to-Neighbor Connections
Buf
2
1
0
0 1 2 3 4 5 6 7 8
S
Dataflow 101 – 1D convolution
each point is a partial sum
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 13
PE0t0
for(int x = 0; x < X’; x++)for(int s = 0; s < S; s++)Output[x] += Weight[s] * Input[x+s]
S
Weights
X
Inputs
X’ = X-S
Outputs*
* =
W[0], I[0] à O[0]0
W[1], I[1] à O[0]0+1W[2], I[2] à O[0]
t1
t2
t3 W[0], I[1] à O[1]0W[1], I[2] à O[1]0+1W[2], I[3] à O[1]
t4
t5
How often do we need to fetch a new weight?
How often do we start contributing to a new output
Every cycle
Every S cycles
Iteration SpaceSpatial Dimension (#PEs)
Tem
pora
l Dim
ensi
on
Map Space“Output stationary” dataflow
How often do we need to fetch a new input?
Every cycle
MultAdd
L1 Buffers
Note: “Stationary” => intuition rather than precise specification
Dataflow 101 – 1D convolution
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 14
for(int s = 0; s < S; s++)for(int x = 0; x < X’; x++)Output[x] += Weight[s] * Input[x+s]
S
Weights
X
Inputs
X’ = X-S
Output*
* =
2
1
0
0 1 2 3 4 5 6 7 8
S
each point is a partial sum
PE0W[0], I[0] à O[0]0
W[0], I[1] à O[1]0W[0], I[2] à O[2]0
W[0], I[1] à O[X’-1]0+1
W[1], I[0] à O[0]0+1W[1], I[1] à O[1]0+1
How often do we need to fetch a new weight?
How often do we start contributing to a new output
Every S cycles
Every cycle
Iteration SpaceSpatial Dimension (#PEs)
Tem
pora
l Dim
ensi
on
Map Space
“Weight stationary” dataflow
How often do we need to fetch a new input?
Every cycle
…
W[1], I[2] à O[2]0+1…
t0t1t2
MultAdd
L1 Buffers
The “best” dataflow
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 15
S
Weights
X
Inputs
X’ = X-S
Output*
* =
Common metric Weights Inputs Outputs / Partial Sums
Minimum accesses to backing store
S X X’
Max Operand Reuse within PE
SX’ SX’ SX’
How to achieve this with a one PE design?
Dataflow Weights Inputs Outputs
Weight-stationary 1 X’ X’
Input-stationary S 1 S
Output-stationary S S 1
L1 buffer size for zero re-fetchDataflow Weights Inputs Outputs
Weight-stationary SX’ S S
Input-stationary X’ SX’ X’
Output-stationary X’ X’ SX’
Buffer accesses
Note: product always equals SX’
Backing Store(e.g., DRAM)
Local (L1) Buffers
PE
Weights
Inputs
Partial SumsSuppose “best” dataflow => max reuse
Choose one of these three based on area/power budget of buffers, and latency/energy cost of each access
Getting More Realistic
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 16
C
C
K
K
Weights
InputsPartial Sums
R
SY
XX’ = X – S
C
Y’ = Y –R
N KN
L2Weights
L2Inputs
L2Outputs
PE
L1Weight
s
L1Inputs
L1Outputs
PE
L1 Weight
s
L1Inputs
L1 Outputs
…DRAM
AcceleratorDNN
7D Computation Space: R * S * X * Y * C * K * N
• Number of PEs• Memory Hierarchy• Interconnect Bandwidth
Transform +
Map
Millions of dataflows
Why does Dataflow matter?u Loop Transformations (Loop Order and Tile Size)
u Determines interconnect bandwidth requirementu Determines buffer size within PEs
u Mapping (over space and time)u Opportunities for Spatial, Temporal and Spatio-temporal
reuseu Energy of reads/writes/interconnect
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 17
How do we explore all possible dataflows?
{Performance, Energy} = f(Dimension Sizes, Hardware Resources, Dataflow)
DNN Layer Sizes
…
C
X
YS
RC
K
HW Resources
Mapping (Dataflow)
• Size Requirement• Access Count
(Energy)
Buffer Analysis
• BW Requirement• NoC Activity Count
NoC Analysis
• Roofline Throughput• Expected Runtime
Runtime Analysis
Data Reuse Analysis
Abstract HW Model
Communication Analysis
Computation Analysis
MAESTRO: Analytical Cost/Benefit Model
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 18*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566
Input specification to MAESTRO1 | //Layer Description2 | Layer CONV VGG16_C13 | K=64;C=3;R=3;S=3;Y=224;X=224 4 | endLayer
1 | //Hardware Resource Description2 | L1Size 64 3 | L2 Size 10244 | NoCBW 645 | Multcast True6 | NumPEs 256
1 | //Mapping (Dataflow) Description ???
DNN Layer Sizes
…
C
X
YS
RC
K
HW Resources
Mapping (Dataflow)
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 19
Example DataflowSpatial Dimension (PEs)
Tem
pora
l Dim
ensi
on
PE0 PE1
I[3..5] I[4..6]
W[0] W[1]
O[3..5] O[3..5]
I[0..2] I[1..3]
W[0] W[1]
O[0..2] O[0..2]
PE0PE1PE2
6
5
4
3
2
1
0
t = 1
t = 0
0 1 2 3 4 5 6 7 8
PE2
I[5..7]
W[2]
O[3..5]
I[2..4]
W[2]
O[0..2]
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 20
Iteration Space
Map Space
“Weight stationary” dataflow
S
Spatial Dimension (PEs)
Tem
pora
l Dim
ensi
on
PE0 PE1
I[3..5] I[4..6]
W[0] W[1]
O[3..5] O[3..5]
I[0..2] I[1..3]
W[0] W[1]
O[0..2] O[0..2]
t = 0 PE2
I[5..7]
W[2]
O[3..5]
I[2..4]
W[2]
O[0..2]
t = 1
Data Mapping over Space and Time
Spatial_Map(1,1) S
Temporal_Map(3,3) X’
+1
Weight is spatially mapped across PEs (i.e., parallelization).
Output is temporally mapped at each PE.
*Map(Mapping size, Offset) Dim
+1
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 21
Map Space
Spatial Dimension (PEs)
Tem
pora
l Dim
ensi
on
PE0 PE1
I[3..5] I[4..6]
W[0] W[1]
O[3..5] O[3..5]
I[0..2] I[1..3]
W[0] W[1]
O[0..2] O[0..2]
t = 0 PE2
I[5..7]
W[2]
O[3..5]
I[2..4]
W[2]
O[0..2]
t = 1
Data Reuse over Space and Time
Spatial_Map(1,1) S
Weight Reuse Opportunity:across Time (i.e., stationary)
Output Reuse Opportunity: across Space (i.e., multicast)
*Map(Mapping size, Offset) Dim
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 22
Map Space
Input Reuse Opportunity:across Space (i.e., multicast)
Temporal_Map(3,3) X’
Data Movement Order
Spatial_Map(1,1) S
Temporal_Map(3,3) X’
*Map(Mapping size, Offset) DimSpatial Dimension (PEs)
Tem
pora
l Dim
ensi
on
PE0 PE1
I[3..5] I[4..6]
W[0] W[1]
O[3..5] O[3..5]
I[0..2] I[1..3]
W[0] W[1]
O[0..2] O[0..2]
t = 0 PE2
I[5..7]
W[2]
O[3..5]
I[2..4]
W[2]
O[0..2]
t = 1
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 23
Describing Dataflows in MAESTRO
• Temporal_Map• Spatial_Map• Cluster (PE grouping for hierarchies)
• Data Movement
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 24
• Data Mapping
Example: Dataflows from recent accelerators in MAESTRO representation
The Dataflow Playground
6
5
4
3
2
1
0
Spatial_Map(1,1) STemporal_Map(3,3) X’
6
5
4
3
2
1
0
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8
Temporal_Map(3,3) SSpatial_Map(1,1) X’
6
5
4
3
2
1
0
Temporal_Map(3,3) X’Spatial_Map(1,1) S
Spatial_Map(1,1) X’Temporal_Map(3,3) S
Weight stationary Output stationary Output stationary Weight stationary
0 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 8
*Map(Mapping size, Offset) Dim
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 25
PE0PE1PE2
Dataflow à Hardware Implications
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 26
PE0 PE1 PE2 PE3
L2 Buffer
PE0 PE1 PE2 PE3
L2 Buffer
PE0 PE1 PE2 PE3
L2 Buffer
PE0 PE1 PE2 PE3
L2 Buffer
PE0 PE1 PE2 PE3
L2 Buffer
PE0 PE1 PE2 PE3
L2 Buffer
Reduction
DistributionL2 Read Bandwidth
L2 Write Bandwidth
DNN Layer Sizes
…
C
X
YS
RC
K
HW Resources
Mapping (Dataflow)
• Size Requirement• Access Count
(Energy)
Buffer Analysis
• BW Requirement• NoC Activity Count
NoC Analysis
• Roofline Throughput• Expected Runtime
Runtime Analysis
Data Reuse Analysis
Abstract HW Model
Communication Analysis
Computation Analysis
MAESTRO: Analytical Cost/Benefit Model
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 27*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566
Abstract Accelerator Model
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 28
Shared Buffer (L2 Buffer)
PE 0
Private L1
ALU
PE N-1
Private L1
ALU
PE 1
Private L1
ALU
...
Network-on-Chip (NoC)1) Bandwidth
2) Average latency
3) Multicast capability
4) Forwarding capability
1) Size of L2 buffer
2) Rd/Wr Bandwidth
1) Number of PEs
2) Size of L1 buffer
3) Vector width
To/From DRAM
L2/L1 Buffer: Scratch padL0 Buffer (in ALUs): Register File
DNN Layer Sizes
…
C
X
YS
RC
K
HW Resources
Mapping (Dataflow)
• Size Requirement• Access Count
(Energy)
Buffer Analysis
• BW Requirement• NoC Activity Count
NoC Analysis
• Roofline Throughput• Expected Runtime
Runtime Analysis
Data Reuse Analysis
Abstract HW Model
Communication Analysis
Computation Analysis
MAESTRO: Analytical Cost/Benefit Model
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 29*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566
MAESTRO Analysis Engine
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 30
//Buffer RequirementsL1BufferRequirement = 2 x (MV[Weights] + MV[Inputs] + MV[Outputs])L2BufferRequirement = 2 x { (M[Weights] + (NumSpIter-1) x MSUV[Weights]) + (M[Inputs] + (NumSpIter-1) x MSUV[Inputs])
//Data volumes required for each spatial iteration in four cases (spatial volume)// (first/steady) temporal iteration x (steady/last) spatial iteration // -> {first/stedy, first/last, steady/steady, steady/last}SV_FTP_SSP[in_data_cls]= tilesz_lst[sp_var] * {MV[in_data_cls] + (num_sp_tiles-1) * MSUV[in_data_cls]}SV_FTP_LSP[in_data_cls]= tilesz_lst[sp_var] * {MV[in_data_cls] + (num_sp_edge_tiles-1) * MSUV[in_data_cls] }SV_STP_SSP[in_data_cls]= tilesz_lst[sp_var] * {MTUV[in_data_cls] + (num_sp_tiles-1) * MTSUV[in_data_cls] }SV_STP_LSP[in_data_cls]= tilesz_lst[sp_var] * {MTUV[in_data_cls] + (num_sp_edge_tiles-1) * MTSUV[in_data_cls] }//Multicasting factor Multcast_factor[in_data_cls] = MV[in_data_cls] / MSUV[in_data_cls]
//Buffer access countsL2Wr[in_data_cls] = // Product of Loop sizes of each corresponding variable to in_data_clsL2Rd[in_data_cls] = (NoC.multicast_support)? (sp_iter-1) * SV_FTP_SSP[in_data_cls] + SV_FTP_LSP[in_data_cls] + (tp_iter-1) * {(sp_iter-1) * SV_STP_SSP[in_data_cls] + SV_STP_LSP[in_data_cls]} / tp_freq[in_data_cls] : (sp_iter-1) * num_sp_tiles * MV[in_data_cls] + num_sp_edge_tiles * MV[in_data_cls] + (tp_iter-1) * { (sp_iter-1) * num_sp_tiles * MTUV[in_data_cls] + num_sp_edge_tiles * MTUV[in_data_cls] }/tp_freq[in_data_cls]
L1Wr[in_data_cls] = (NoC.multicast_support)? L2Rd[in_data_cls] * multcast_factor : L2Rd[in_data_cls]; L1Rd[in_data_cls] = tp_iter * { (sp_iter-1) * num_sp_tiles * MV[in_data_cls] + num_sp_edge_tiles * MV[in_data_cls] }
Input: The number of ALUs in each PE (num_alus), temporal update frequency (tp_freq), number of spatial iterations (sp_iter), number of temporal iterations (tp_iter)Output: Total runtime for a given input layer (runtime) Procedure ComputeRuntime runtime = 0; //First temporal iteration if(sp_iter > 1) then init_noc_delay = NoCDelay(SV_FTP_SSP[input]) + NoCDelay(SV_FTP_SSP[weight]) else then init_noc_delay = NoCDelay(SV_FTP_LSP[input]) + NoCDelay(SV_FTP_LSP[weight]) end runtime += init_noc_delay;
if(sp_iter > 2) then //already loaded the first data sets L2ToL1_noc_delay = NoCDelay(SV_FTP_SSP[weight] + SV_FTP_SSP[input]) L1ToL2_noc_delay = NoCDelay(SV_FTP_SSP[output]) runtime += (sp_iter-2) *max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) else then L2ToL1_noc_delay = NoCDelay(SV_FTP_LSP[weight] + SV_FTP_LSP[input]) L1ToL2_noc_delay = NoCDelay(SV_FTP_LSP[output]) runtime += (sp_iter-1) * max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) end
//Rest of temporal iterations if(sp_iter > 1) then L2ToL1_noc_delay = NoCDelay(SV_STP_SSP[weight]/tp_freq[weight] + SV_STP_SSP[input]/tp_freq[input]) L1ToL2_noc_delay = NoCDelay(SV_STP_SSP[output]//tp_freq[output]) runtime += (tp_iter-1) * (sp_iter-1) *max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) else then L2ToL1_noc_delay = NoCDelay(SV_STP_LSP[weight]/tp_freq[weight] + SV_STP_LSP[input]/tp_freq[input]) L1ToL2_noc_delay = NoCDelay(SV_STP_LSP[output]/tp_freq[output]) end runtime += (tp_iter-1) * max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) return runtime;endprocedure
Volume Analysis Reuse Analysis Runtime Analysis Buffer Analysis
//MV: Mapped volumeMV[Weights] = M(K) x M(C) x M(R) x M(S)MV[Inputs] = M(C) x M(Y) x M(X) MV[Outputs] = M(K) x M(Y’) x M(X’)
//MSUV: Mapped spatially unique volumeMSUV[Weights] = GetSpUSz(K) x GetSpUSz(C) x GetSpUSz(R) x GetSpUSz(S)MSUV[Inputs] = GetSpUSz(C) x GetSpUSz(Y) x GetSpUSz(X)MSUV[Outputs] = GetSpUSz(K) x GetSpUSz(C) x GetSpUSz(Y’) x GetSpUSz(X’)
//MTUV: Mapped temporally unique volumeMTUV[Weights] = TU(K) x TU(C) x TU(R) x TU(S)MTUV[Inputs] = TU(C) x TU(Y) x TU(X)MTUV[Outputs] = TU(K) x TU(C) x TU(Y’) x TU(X’)
//MSTUV: Mapped spatially and temporally unique volumeMSTUV[Weights] = GetSTpUSz(K) x GetSTpUSz(C) x GetSTpUSz(R) x GetSTpUSz(S) MSTUV[Inputs] = GetSTpUSz(C) x GetSTpUSz(Y) x GetSTpUSz(X) MSTUV[Outputs] = GetSTpUSz(K) x GetSTpUSz(C) x GetSTpUSz(Y’) x GetSTpUSz(X’) * GetSpUSz(V) = (V.pragma.class == TemporalMap)? M(V) : SU(V);* GetSTpUSz(V) = (V.pragma.class == SpatialMap)? SU(V) : TU(V);
Input: dataflow description in MAESTRO pragmas (df_desc)Output: The total or uniquely mapped size of a data class on a PE (mp_sz) Procedure AnalyzeVariableMapping: for each pragma in df_desc switch(pragma.class) case TemporalMap: M[pragma.var] = pragma.map_sz; SU[pragma.var] = 0; TU[pragma.var] = (pragma.map_sz > pragma.ofs)? pragma.ofs : pragma.map_sz; case SpatialMap: M[pragma.var] = pragma.map_sz; SU[pragma.var] = (pragma.map_sz > pragma.ofs)? pragma.ofs : pragma.map_sz; TU[pragma.var] = pragma.map_sz; case Unroll: M[pragma.var] = pragma.loop.sz; SU[pragma.var] = 0; TU[pragma.var] = pragma.loop.sz; end endendprocedure
Input: Dataflow description (df_desc), loop list (lp_lst), pragma id of spatial map (sp_prag_id) a tile size list processed by AnalyzeTiles (tilesz_lst)Output: Number of spatial iterations (sp_iter), number of temporal iterations (tp_iter)Procedure AnalyzeNumIterations sp_iter = 1; tp_iter = 1; for each pragma in df_desc if(pragma.id > sp_prag_id) then if(pragma.class == TemporalMap) then tp_iter *= lp_list[pragma.var].size / pragma.ofs; end end else if(pragma.id == sp_prag_id) then sp_iter *= lp_list[pragma.var].size / pragma.ofs / tilesz_lst[pragma.var]; end end return {sp_iter, tp_iter};endProcedureInput: dataflow description (df_desc), target data class (data_cls), temporal loop list (tp_lp_lst)Output: Number of temporal iterations to have a change in mapped data points of data_cls (tp_freq)Procedure AnalyzeTemporalUpdateFrequency tp_freq=1; upper_most_sz = 1; saw_cor_var = false; if(df_desc.has_spatial_map(data_cls)) then return tp_freq; end for each loop in tp_lp_lst if(data_cls.has(loop.var)) then if(!saw_cor_var) then saw_cor_var = true; end tp_freq=1; else then if(saw_cor_var) then pragma = df_desc.search(loop.var); tp_freq *= loop.size/pragma.ofs; end end end return tp_freq; endProcedure
Ignore this “eyechart” slide – Just an intuition on how it works
• Analytical Model – not cycle accurate sims, 1000-4000x faster.• Error within 5% of cycle-accurate RTL sims of Eyeriss and NVDLA
DNN Layer Sizes
…
C
X
YS
RC
K
HW Resources
Mapping (Dataflow)
• Size Requirement• Access Count
(Energy)
Buffer Analysis
• BW Requirement• NoC Activity Count
NoC Analysis
• Roofline Throughput• Expected Runtime
Runtime Analysis
Data Reuse Analysis
Abstract HW Model
Communication Analysis
Computation Analysis
MAESTRO: Analytical Cost/Benefit Model
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 31*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566
Use Cases: (i) HW Design
DNN Layer Sizes…C
X
YS
RC
K Fixed
Fixed
HW Resources HW DSESearch HW Configs
FastPath Workshop Tushar Krishna | Georgia Institute of Technology 32
Mapping (Dataflow)
March 24, 2019
HW DSE using MAESTRO
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 33
140120100
80604020
0 0 2 4 6 8 10 12 14 16 0 20 40 60 80 100 120 140
Area Constraint
0 1 2 3 4 5 6 7 8 9
140120100
80604020
0
140120100
80604020
0
140120100
80604020
0 0
Power Constraint
10050 150 200 250 300 350 400 450
Thro
ughp
ut
(MAC
/Cyc
le)
VGG
16-C
ON
V2
Area (mm )2 Buffer (KB) Power (mW)150
100
50
0 0 2 4 6 8 10 12 14 16Area (mm )2 Buffer (KB)
0 20 40 60 80 100 120 140
Area Constraint
0 5Power (mW)
10 0
Power Constraint
10050 150 200 250 300 350 400 450
Thro
ughp
ut
(MAC
/Cyc
le)
Normalized Energy (10 X Single MAC Energy)9
150
100
50
0
150
100
50
0
150
100
50
015 20 25VG
G16
-CO
NV1
1
Normalized Energy (10 X Single MAC Energy)9
Energy-Optimized DesignThroughput-Optimized Design # of PEs 15264 12832 96
8
NVDLA Dataflow
Best HW-config for Throughput different from Best design for Energy
DSE engine searched 480M designs and identified 2.5M valid designs at an avg rate of 0.17M designs per second.
Use Cases: (ii) Compiler Design
DNN Layer Sizes…C
X
YS
RC
K
Mapping (Dataflow)
Fixed
FixedHW Resources
Dataflow DSEGenerate Opt Map
FastPath Workshop Tushar Krishna | Georgia Institute of Technology 34March 24, 2019
Dataflow Comparison using MAESTRO
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 35
NOTE: this represents the performance and energy of the dataflow on a normalized PE substrate
• Not representative of the performance of the original architecture
NLR WS Shi DLA RS0
0.5
1
NLR WS Shi DLA RS0
2
4
6
8
10
NLR WS Shi DLA RS0
2
4
6
8
B
C
D
Ba
nd
wid
th R
eq
uir
em
en
t (G
bp
s)
L1
Me
mo
ry R
eq
uir
em
en
t (K
B)
Th
ro
ug
hp
ut
(GFLO
PS
)
NLR WS DLAShi RS0
0.5
1
L1 M
emor
y Re
quire
men
t (KB
)
Band
wid
th
Requ
irem
ent (
Gbp
s)
0
468
10
0
Thro
ughp
ut(G
FLO
PS)
Dataflow Style Dataflow Style Dataflow Style
(a) Bandwidth (b) L1 Memory (c) Throughput
2 2468
NLR WS DLAShi RS NLR WS DLAShi RS
NLR WS Shi DLA RS0
0.5
1
1.5
2
NLR WS Shi DLA RS0
2
4
6
8
10
NLR WS Shi DLA RS0
1
2
3
4
B
C
D
Ba
nd
wid
th R
eq
uir
em
en
t (G
bp
s)
L1
Me
mo
ry R
eq
uir
em
en
t (K
B)
Th
ro
ug
hp
ut
(GFLO
PS
)
0
0.5
1.0
L1 M
emor
y Re
quire
men
t (KB
)
Band
wid
th
Requ
irem
ent (
Gbp
s)
0
468
10
0
Thro
ughp
ut(G
FLO
PS)
Dataflow Style Dataflow Style Dataflow Style
(a) Bandwidth (b) L1 Memory (c) Throughput
2 1
2
34
NLR WS DLAShi RS NLR WS DLAShi RS NLR WS DLAShi RS
1.5
2.0
LateLayer(C11)
EarlyLayer(C1)
0
20
40
60
80
100
120
140
160
0 5 10 15 200
100
200
300
400
NLR WS SD DLA RS NLR WS SD DLA RS
MACL1RdL1WrL2RdL2Wr160
140120100
80604020
0
400
300
200
100
0VGG16-CONV1 VGG16-CONV11
Nor
mal
ized
Ene
rgy
DataflowLayer
# of PEs(16,32,…,256)
Energy
Scalability
Performance and Hardware Requirement
Takeaway - No one dataflow is best for all layers
Use Cases: (iii) HW-SW Co-Design
Fixed
HW/SW Co-designSearch HW Config + Mappings
DNN Layer Sizes…C
X
YS
RC
K
HW Resources
FastPath Workshop Tushar Krishna | Georgia Institute of Technology 36
Mapping (Dataflow)
March 24, 2019
Summary of MAESTRO
uPrecise Specification of Dataflows using a Data-centric Approach
uAnalytical Model for Analyzing Reuse => Performance, Memory, Interconnect, Energy
uUse for HW Design-space or Mapping Space exploration, or HW-SW Co-Design
uValidation on going…
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 37
Outline of Talk
uHow do we map billions of computations over limited compute and memory resources
uHow do we design an accelerator to efficiently map arbitrary layer types and dataflows?
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 38
Myriad Dataflows in DNN Acceleratorsu DNN Topologies
u Layer size / shapeu Layer types: Convolution / Pool / FC / LSTMu New sub-structure: e.g., Inception in Googlenet
u Compiler/Mapper (e.g., MAESTRO)u Loop Scheduling
u Reordering and Tiling
u Mappingu Output/Weight/Input/Row-stationary
u Algorithmic Optimizationu Weight pruning: Sparse workload
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 39
The current trend for supporting multiple dataflows
uNew Dataflow à New AcceleratoruData reuse: FlexFlow (2017), Eyeriss (2016), ...uCross-layer: Fused CNN (2016)
uSparse CNN: SCNN (2017), EIE(2016), ...uLSTM: ESE (2017), ...
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 40
Can we have one architectural solution that can handle arbitrary dataflows and provides ~100% utilization?
What is the computation in a DNN?
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology
…Input Activation
Filter
Output Activation
0 0
CONV Layer
……
W0W1W2
SlideX0
X1
X2
Xk
Σ(WiXi)
W0
W1
W2
Wk
Inputs
Neuron
Out
OutputWeights
…
Compute weighted sum Independent multiplicationAccumulation of partial products
41
Our Key insight: Each dataflow translates into neurons of different sizes
Layer 1 Layer 2
Pruning
Removed Weight
Layer 1 Layer 2Layer 1 Layer 2
Irregular Dataflow: Pruning
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 42
X0
X1
X2
Σ(WiXi)W0
W1
W2
Neuron
Out
X0
X2
Σ(WiXi)W0
W2
Neuron
Out
Our Key insight: Each dataflow translates into neurons of different sizes
Example: Weight Pruning (Sparse Workload)
The MAERI Abstraction
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology
PrefetchBufferX X X XX X X X
+++
++
+ +
Weight / Input
Output
Multiplier Pool
Adder Pool
VN0 VN1 VN2
“MultAlloc(3); AddAlloc(2)”
X0
X1
X2
Xk
Σ(WiXi)
W0
W1
W2
Wk
Inputs
Neuron
Out
OutputWeights
…
Virtual Neuron (VN): Temporary grouping of compute units for an output
How to enable flexible grouping?
“MultAlloc(2); AddAlloc(1)”
43
Need flexible connectivity!
Naïve Approach: Full Crossbar
FastPath Workshop Tushar Krishna | Georgia Institute of Technology 44
PrefetchBufferX X X XX X X X
+++
++
+ +
Weight / Input
Output
Wire overhead = O(n2)
Need “specialization” in interconnection network
for traffic in DNN accelerators
March 24, 2019
Traffic Patterns in DNN Accelerators*
* H. Kwon et al., Rethinking NoCs for Spatial DNN Accelerators, NOCS 2017
GB NoC
PE
PE
PE
PE
One-to-Many(Distribution)
GB NoC
PE
PE
PE
PE
Many-to-One(Collection/Reduction)
GB NoC
PE
PE
PE
PE
* GB: Global buffer* NoC: Network-on-Chip (Interconnection network)* PE: Processing element (Compute units)
One-to-One(Forwarding)
e.g. input and weight distribution to PEs
e.g. partial sum and output reduction
e.g. input/weight/partial-sum forwarding
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 45
The MAERI Implementation
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 46
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
Hyoukjun Kwon, Ananda Samajdar, and Tushar KrishnaMAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention
The MAERI Implementation
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 47
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
Distribution Network• Spatial Reuse via Multicasts• High Bandwidth via fat links
The MAERI Implementation
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 48
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
Local FIFOs for Temporal Reuse i.e. “stationary”
The MAERI Implementation
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 49
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
Linear Local Network• Forwarding of weights• Spatio-Temporal Reuse
The MAERI Implementation
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 50
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…
Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
Reduction Network• High Bandwidth via fat links• Provably Non-blocking
Reductions via forwarding links
The MAERI Implementation
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 51
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…
Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
Distribution Network• Spatial Reuse via Multicasts• High Bandwidth via fat links
Linear Local Network• Forwarding of weights• Spatio-Temporal Reuse
Reduction Network• High Bandwidth via fat links• Provably Non-blocking
Reductions via forwarding links
Local FIFOs for Temporal Reuse i.e. “stationary”
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…
Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
The MAERI Implementation
FastPath Workshop Tushar Krishna | Georgia Institute of Technology
Input _LInput _R
Fwd_In
Fwd_Out
Output_Up
+/>
Adder Switch(adder+ 3x2 switch)
Data_InLeft_Out
Right_Out
Inv
Inv
Distribute Switch (1x2 Switch)
52March 24, 2019
Micro-Switches
Multiplier Switch
Data_In
Fwd_In Fwd_Out
Data_OutX
(multiplier+ 2x2 switch)
Example: Computing a CONV layeru [Communication] Distribute weights and
inputs (image pixels) to multiplier switchesuAssume: weight stationary, conv reuse of inputs via
local links
u [Computation] Compute partial sums
u [Computation] Reduce partial sums
u [Communication] Collect outputs to bufferFastPath Workshop Tushar Krishna | Georgia Institute of Technology 53March 24, 2019
MAERI Operation Example
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 54
W00 W01W10 W11 =X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter Output Activation
O00 O01 O02 O03O10 O11 O12 O13O20 O21 O22 O23O30 O31 O32 O33
Slides
O00 = W00 X X00 + W01 X X01
+ W10 X X10 + W11 X X11
W020
+ W02 X X01
Sparse Weight Filter
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
MAERI Operation Example
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 55
Virtual Neuron Construction
VN size = 5
Sparse Weight Filter
Controller configures the switches
W00 W01W10 W11
W020
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
MAERI Operation Example
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 56
Weight
Distribution
Distribution bandwidth is tunable.
Suppose BW = 3
Sparse Weight Filter
W00 W01W10 W11
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
MAERI Operation Example
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 57
Distribution bandwidth is tunable.
Suppose BW = 3
Sparse Weight Filter
W00 W01W10 W11
W020
Weight
Distribution
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
MAERI Operation Example
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 58
x00 x01 x02 x10 x11 X10 X11 X12 X20 X21 X20 X21 X22 X30 X31Sparse Weight Filter
W00 W01W10 W11
W020
Input
Distribution
Utilize multicast to reduce latency and energy
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
MAERI Operation Example
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 59
Partial sum reduction
x00 x01 x02 x10 x11 X10 X11 X12 X20 X21 X20 X21 X22 X30 X31
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
Simultaneous Reduction
X X X XX X X X
+++
++
+ +
Acce
lera
tor C
ontro
ller
Dataflow(from CPU)
+
X Multiplier Switch
Adder Switch
Legend
LookupTable
X X X XX X X X
+++
++
+ +
+
…
…Weights
Inputs
Outputs
…
Distribution Tree
Augmented Reduction Tree
…
Activation Units
From/To DRAM
Simple Switch
MAERI Operation Example
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 60
Sliding Window
x01 x02 x11 X11 X12 X21 X21 X22 X31
x03
X13
x12
X22
X23
X32
Repeat Step 4 - 5W00 W01
W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
W00 W01W10 W11 X
X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33
Input ActivationFilter
W020
x00 x10 X10 X20 X20 X30
Weights: stationaryInputs: Partially reused via forwarding
Mapping optimal dataflows for MAERI
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 61
Deep Neural NetworkNe
uron
s
Dataflow Configs
MAERI Mapper: mRNA
Search for Optimal Dataflow
X X X XX X X X
+++
++
+ +
X X XX X X X
+++
++
+ +
+
X
…
… …
To/From DRAMWeight, Input, Output SRAM
X X X XX X X X
+++
++
+ +VN0
X X XX X X X
+++
++
+ +VN1
+
VN2
Weights/Inputs Weights/Inputs
Output Activation Output Activation
Output Activation
1Virtual Neurons
X
2
34567
Hyoukjun Kwon, Ananda Samajdar, and Tushar KrishnaMAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention
Z. Zhao, H. Kwon, S.Kuhar, W. Sheng, Z. Mao, T. KrishnaEfficient Mapping Space Exploration on a Reconfigurable Neural AcceleratorISPASS 2019
~100% Utilization
Example Mapping – Dense CNN
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 62
Our Key insight: Each dataflow translates into neurons of different sizes X0
X1
X2
Xk
Σ(WiXi)
W0
W1
W2
Wk
Inputs
Neuron
Out
OutputWeights
…
X X X XX X X X
+++
++
+ +VN0
X X X XX X X X
+++
++
+ +VN1
+
VN2
Weights/Inputs Weights/Inputs
Partial Outputs Partial Outputs
Example Mapping – Sparse DNN
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 63
Our Key insight: Each dataflow translates into neurons of different sizes X0
X1
X2
Xk
Σ(WiXi)
W0
W1
W2
Wk
Inputs
Neuron
Out
OutputWeights
…
X X X XX X X X
+++
++
+ +VN0
X X X XX X X X
+++
++
+ +VN1
+
VN2
Weights/Inputs Weights/Inputs
Partial Outputs Partial Outputs
Example Mapping – LSTM/FC
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 64
Our Key insight: Each dataflow translates into neurons of different sizes X0
X1
X2
Xk
Σ(WiXi)
W0
W1
W2
Wk
Inputs
Neuron
Out
OutputWeights
…
X X X XX X X X
+++
++
+ +VN0
X X X XX X X X
+++
++
+ +
+
Weights/Inputs Weights/Inputs
Partial Outputs Partial Outputs
Example - Impact of MappingsMapping 1 Mapping 2 Mapping 3 Mapping 4 Mapping 5
DN BW requirement for input 16 / 8 8 / 4 16 / 8 8 / 4 9
DN BW requirement for weight 16 16 8 8 4
RN BW requirement 1 2 2 4 4
Number of DS access for weight 64 64 64 64 56
Number of DS access for input 64 / 32 64 / 32 64 / 32 64 / 32 42
Number of reduce 15 14 14 12 12
Number of RS access 30 28 28 24 24
Number of MS access 8 8 8 8 0
Number of iterations 36 41 41 45 73
Peak Utilization Rate 100% 100% 100% 100% 100%
Average Utilization Rate 100% 100% 100% 100% 56%
Best performance
Least DN bandwidth
Least RN Bandwidth
Low utilization rateZ. Zhao, H. Kwon, S.Kuhar, W. Sheng, Z. Mao, T. KrishnaEfficient Mapping Space Exploration on a Reconfigurable Neural AcceleratorISPASS 2019
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 65
End-to-End Performance
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 66
0
500
1000
1500
2000
2500
3000
3500
4000
4500
MAERI(Featuremapparallelism) MAERI(Channelparallelism) MAERI(AdaptiveDataflow) Eyeriss(witharraypartitioning)
Runtime(ms)
VGG16-Layer
VGG16End-to-endRuntime(MAERIvsEyeriss)
CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10 CONV11 CONV12 CONV13
Summary of MAERIu DNN models evolving rapidly
uMultiple layer typesu Sparsity OptimizationsuMyriad dataflows for scheduling and mapping
u MAERI enables dynamic grouping of arbitrary number of MACCs (“Virtual Neuron”) via reconfigurable, non-blocking interconnects, providinguFuture proof to DNN models and dataflowsuNear 100% compute unit utilization
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 67
Takeaways
March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 68
AI will be pervasive
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
2D hardware array
.
.
.N
.
.
.N
C
C
K
KC X’
Weights Inputs OutputsR
SY
X
Y’ map
Convolutional Neural Network
Analytical Model for DNN Dataflow Analysis
X0
X1
X2
Xk
Σ(WiXi)
W0
W1
W2
Wk
Inputs
Neuron
Out
OutputWeights
… DNN Accelerator with Configurable Interconnects can map Irregular Dataflows
X X X XX X X X
+++
++
+ +
Thank you!
http://synergy.ece.gatech.edu/tools/maestro/
http://synergy.ece.gatech.edu/tools/maeri/