+ All Categories
Home > Documents > Why do we need DNN accelerators? - IBM...GoogleNet (2015) 6.77M Resnet-20 (2016) 0.27M Resnet-110...

Why do we need DNN accelerators? - IBM...GoogleNet (2015) 6.77M Resnet-20 (2016) 0.27M Resnet-110...

Date post: 26-Jan-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
64
Why do we need DNN accelerators? u Millions of Parameters (i.e., weights) u Billions of computations u Heavy data movement March 24, 2019 FastPath Workshop Tushar Krishna | Georgia Institute of Technology DNN Topology Number of Weights AlexNet (2012) 3.98M VGGnet-16 (2014) 28.25M GoogleNet (2015) 6.77M Resnet-20 (2016) 0.27M Resnet-110 (2016) 1.7M 5 Need high throughput Need to reduce energy This makes CPUs inefficient This makes GPUs inefficient
Transcript
  • Why do we need DNN accelerators?

    uMillions of Parameters (i.e., weights)uBillions of computations

    uHeavy data movement

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology

    DNN Topology Number of Weights

    AlexNet (2012) 3.98M

    VGGnet-16 (2014) 28.25M

    GoogleNet (2015) 6.77M

    Resnet-20 (2016) 0.27M

    Resnet-110 (2016) 1.7M

    5

    Need high throughput

    Need to reduce energy

    This makes CPUs inefficient

    This makes GPUs inefficient

  • Spatial (or Dataflow) Accelerators

    uMillions of Parameters (i.e., weights)uBillions of computations

    uHeavy data movement

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology

    Spread computations across hundreds of ALUs

    Reuse data within the array via local storage and direct communication

    Examples: MIT Eyeriss, Google TPU, Xilinx xDNN

    Memory Hierarchy

    ALU ALU ALU ALU

    ALU ALU ALU ALU

    ALU ALU ALU ALU

    ALU ALU ALU ALU Mem

    ory Hierarchy

    Control

    Register/FIFO/SRAM

    6

  • Two Design Questions

    uHow do we map billions of computations over limited compute and memory resources

    uHow do we design an accelerator to efficiently map arbitrary layer types and dataflows?

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 7

  • Outline of Talk

    uHow do we map billions of computations over limited compute and memory resources

    uHow do we design an accelerator to efficiently map arbitrary layer types and dataflows?

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 8

  • Motivation: Data Movement

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 9

    VGG16 conv 3_2Multiply Add Ops 1.85 BillionWeights 590 KInputs 803 KOutputs 803 K

    Re-use

    “Dataflow”

    Energy costs8-bit Integer Multiply 1x

    Fetch two 8-bit operands from DRAM ~100x

    Fetch two 8-bit operands from large SRAM ~10x

    Fortunately …

    Slide Acknowledgment: Joel Emer, Angshuman Parashar, Michael Pellauer (NVIDIA)

    How to exploit reuse?

  • What is “Dataflow”

    u How to schedule DNN computation (i.e., loop transformations (ordering, tiling, unrolling))

    u How to map computations across PEs (i.e., data staging within accelerators)

    u Goal of a good dataflow:u Algorithmic Data Reuse à Hardware Reuse

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 10

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    7-dimensional network layer

    .

    .

    .N

    .

    .

    .N

    C

    C

    K

    KC

    X’

    Weights InputsOutputs

    R

    S

    Y

    X

    Y’ map

  • Types of Algorithmic Data Reuse in DNNsFilter ReuseConvolutional Reuse Fmap Reuse

    CONV layers only(sliding window)

    CONV and FC layers CONV and FC layers(batch size > 1)

    Filter Input FmapFilters

    2

    1

    Input FmapFilter

    2

    1

    Input Fmaps

    ActivationsFilter weights

    Reuse: ActivationsReuse: Filter weightsReuse:

    FastPath Workshop Tushar Krishna | Georgia Institute of Technology 11March 24, 2019Slide Acknowledgment: Yu-Hsin Chen, Vivenne Sze, Joel Emer (MIT)

  • Hardware structures to exploit reuse

    Temporal Reuse Spatial Reuse Spatio-Temporal Reuse

    DRAM Buf RF ALU

    Memory Hierarchy / Staging Buffers

    Buf

    Multicasting-support NoCs

    E.g., Custom memory hierarchies in accelerators.

    E.g., Hierarchical Bus in Eyeriss (ISCA 2016), Tree in MAERI (ASPLOS 2018)

    E.g., TPU (ISCA 2017), local network in Eyeriss (ISCA 2016)

    PEs

    Tim

    e

    PE0 PE1 PE2 PE3 …

    1 ……

    PE0 PE1 PE2 PE3 …

    1 ……

    PE0 PE1 PE2 PE3 …

    1…

    PE0 PE1 PE2 PE3 …

    2 ……

    1 …

    2

    PEs

    Tim

    e

    PE0 PE1 PE2 PE3 …

    0

    PE0 PE1 PE2 PE3 …

    0 ……

    1 …

    1Tim

    e

    PEs

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 12

    Neighbor-to-Neighbor Connections

    Buf

  • 2

    1

    0

    0 1 2 3 4 5 6 7 8

    S

    Dataflow 101 – 1D convolution

    each point is a partial sum

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 13

    PE0t0

    for(int x = 0; x < X’; x++)for(int s = 0; s < S; s++)Output[x] += Weight[s] * Input[x+s]

    S

    Weights

    X

    Inputs

    X’ = X-S

    Outputs*

    * =

    W[0], I[0] à O[0]0

    W[1], I[1] à O[0]0+1W[2], I[2] à O[0]

    t1

    t2

    t3 W[0], I[1] à O[1]0W[1], I[2] à O[1]0+1W[2], I[3] à O[1]

    t4

    t5

    How often do we need to fetch a new weight?

    How often do we start contributing to a new output

    Every cycle

    Every S cycles

    Iteration SpaceSpatial Dimension (#PEs)

    Tem

    pora

    l Dim

    ensi

    on

    Map Space“Output stationary” dataflow

    How often do we need to fetch a new input?

    Every cycle

    MultAdd

    L1 Buffers

    Note: “Stationary” => intuition rather than precise specification

  • Dataflow 101 – 1D convolution

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 14

    for(int s = 0; s < S; s++)for(int x = 0; x < X’; x++)Output[x] += Weight[s] * Input[x+s]

    S

    Weights

    X

    Inputs

    X’ = X-S

    Output*

    * =

    2

    1

    0

    0 1 2 3 4 5 6 7 8

    S

    each point is a partial sum

    PE0W[0], I[0] à O[0]0

    W[0], I[1] à O[1]0W[0], I[2] à O[2]0

    W[0], I[1] à O[X’-1]0+1

    W[1], I[0] à O[0]0+1W[1], I[1] à O[1]0+1

    How often do we need to fetch a new weight?

    How often do we start contributing to a new output

    Every S cycles

    Every cycle

    Iteration SpaceSpatial Dimension (#PEs)

    Tem

    pora

    l Dim

    ensi

    on

    Map Space

    “Weight stationary” dataflow

    How often do we need to fetch a new input?

    Every cycle

    W[1], I[2] à O[2]0+1…

    t0t1t2

    MultAdd

    L1 Buffers

  • The “best” dataflow

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 15

    S

    Weights

    X

    Inputs

    X’ = X-S

    Output*

    * =

    Common metric Weights Inputs Outputs / Partial Sums

    Minimum accesses to backing store

    S X X’

    Max Operand Reuse within PE

    SX’ SX’ SX’

    How to achieve this with a one PE design?

    Dataflow Weights Inputs Outputs

    Weight-stationary 1 X’ X’

    Input-stationary S 1 S

    Output-stationary S S 1

    L1 buffer size for zero re-fetchDataflow Weights Inputs Outputs

    Weight-stationary SX’ S S

    Input-stationary X’ SX’ X’

    Output-stationary X’ X’ SX’

    Buffer accesses

    Note: product always equals SX’

    Backing Store(e.g., DRAM)

    Local (L1) Buffers

    PE

    Weights

    Inputs

    Partial SumsSuppose “best” dataflow => max reuse

    Choose one of these three based on area/power budget of buffers, and latency/energy cost of each access

  • Getting More Realistic

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 16

    C

    C

    K

    K

    Weights

    InputsPartial Sums

    R

    SY

    XX’ = X – S

    C

    Y’ = Y –R

    N KN

    L2Weights

    L2Inputs

    L2Outputs

    PE

    L1Weight

    s

    L1Inputs

    L1Outputs

    PE

    L1 Weight

    s

    L1Inputs

    L1 Outputs

    …DRAM

    AcceleratorDNN

    7D Computation Space: R * S * X * Y * C * K * N

    • Number of PEs• Memory Hierarchy• Interconnect Bandwidth

    Transform +

    Map

    Millions of dataflows

  • Why does Dataflow matter?u Loop Transformations (Loop Order and Tile Size)

    u Determines interconnect bandwidth requirementu Determines buffer size within PEs

    u Mapping (over space and time)u Opportunities for Spatial, Temporal and Spatio-temporal

    reuseu Energy of reads/writes/interconnect

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 17

    How do we explore all possible dataflows?

    {Performance, Energy} = f(Dimension Sizes, Hardware Resources, Dataflow)

  • DNN Layer Sizes

    C

    X

    YS

    RC

    K

    HW Resources

    Mapping (Dataflow)

    • Size Requirement• Access Count

    (Energy)

    Buffer Analysis

    • BW Requirement• NoC Activity Count

    NoC Analysis

    • Roofline Throughput• Expected Runtime

    Runtime Analysis

    Data Reuse Analysis

    Abstract HW Model

    Communication Analysis

    Computation Analysis

    MAESTRO: Analytical Cost/Benefit Model

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 18*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566

  • Input specification to MAESTRO1 | //Layer Description2 | Layer CONV VGG16_C13 | K=64;C=3;R=3;S=3;Y=224;X=224 4 | endLayer

    1 | //Hardware Resource Description2 | L1Size 64 3 | L2 Size 10244 | NoCBW 645 | Multcast True6 | NumPEs 256

    1 | //Mapping (Dataflow) Description ???

    DNN Layer Sizes

    C

    X

    YS

    RC

    K

    HW Resources

    Mapping (Dataflow)

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 19

  • Example DataflowSpatial Dimension (PEs)

    Tem

    pora

    l Dim

    ensi

    on

    PE0 PE1

    I[3..5] I[4..6]

    W[0] W[1]

    O[3..5] O[3..5]

    I[0..2] I[1..3]

    W[0] W[1]

    O[0..2] O[0..2]

    PE0PE1PE2

    6

    5

    4

    3

    2

    1

    0

    t = 1

    t = 0

    0 1 2 3 4 5 6 7 8

    PE2

    I[5..7]

    W[2]

    O[3..5]

    I[2..4]

    W[2]

    O[0..2]

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 20

    Iteration Space

    Map Space

    “Weight stationary” dataflow

    S

  • Spatial Dimension (PEs)

    Tem

    pora

    l Dim

    ensi

    on

    PE0 PE1

    I[3..5] I[4..6]

    W[0] W[1]

    O[3..5] O[3..5]

    I[0..2] I[1..3]

    W[0] W[1]

    O[0..2] O[0..2]

    t = 0 PE2

    I[5..7]

    W[2]

    O[3..5]

    I[2..4]

    W[2]

    O[0..2]

    t = 1

    Data Mapping over Space and Time

    Spatial_Map(1,1) S

    Temporal_Map(3,3) X’

    +1

    Weight is spatially mapped across PEs (i.e., parallelization).

    Output is temporally mapped at each PE.

    *Map(Mapping size, Offset) Dim

    +1

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 21

    Map Space

  • Spatial Dimension (PEs)

    Tem

    pora

    l Dim

    ensi

    on

    PE0 PE1

    I[3..5] I[4..6]

    W[0] W[1]

    O[3..5] O[3..5]

    I[0..2] I[1..3]

    W[0] W[1]

    O[0..2] O[0..2]

    t = 0 PE2

    I[5..7]

    W[2]

    O[3..5]

    I[2..4]

    W[2]

    O[0..2]

    t = 1

    Data Reuse over Space and Time

    Spatial_Map(1,1) S

    Weight Reuse Opportunity:across Time (i.e., stationary)

    Output Reuse Opportunity: across Space (i.e., multicast)

    *Map(Mapping size, Offset) Dim

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 22

    Map Space

    Input Reuse Opportunity:across Space (i.e., multicast)

    Temporal_Map(3,3) X’

  • Data Movement Order

    Spatial_Map(1,1) S

    Temporal_Map(3,3) X’

    *Map(Mapping size, Offset) DimSpatial Dimension (PEs)

    Tem

    pora

    l Dim

    ensi

    on

    PE0 PE1

    I[3..5] I[4..6]

    W[0] W[1]

    O[3..5] O[3..5]

    I[0..2] I[1..3]

    W[0] W[1]

    O[0..2] O[0..2]

    t = 0 PE2

    I[5..7]

    W[2]

    O[3..5]

    I[2..4]

    W[2]

    O[0..2]

    t = 1

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 23

  • Describing Dataflows in MAESTRO

    • Temporal_Map• Spatial_Map• Cluster (PE grouping for hierarchies)

    • Data Movement

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 24

    • Data Mapping

    Example: Dataflows from recent accelerators in MAESTRO representation

  • The Dataflow Playground

    6

    5

    4

    3

    2

    1

    0

    Spatial_Map(1,1) STemporal_Map(3,3) X’

    6

    5

    4

    3

    2

    1

    0

    6

    5

    4

    3

    2

    1

    0

    0 1 2 3 4 5 6 7 8

    Temporal_Map(3,3) SSpatial_Map(1,1) X’

    6

    5

    4

    3

    2

    1

    0

    Temporal_Map(3,3) X’Spatial_Map(1,1) S

    Spatial_Map(1,1) X’Temporal_Map(3,3) S

    Weight stationary Output stationary Output stationary Weight stationary

    0 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 8

    *Map(Mapping size, Offset) Dim

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 25

    PE0PE1PE2

  • Dataflow à Hardware Implications

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 26

    PE0 PE1 PE2 PE3

    L2 Buffer

    PE0 PE1 PE2 PE3

    L2 Buffer

    PE0 PE1 PE2 PE3

    L2 Buffer

    PE0 PE1 PE2 PE3

    L2 Buffer

    PE0 PE1 PE2 PE3

    L2 Buffer

    PE0 PE1 PE2 PE3

    L2 Buffer

    Reduction

    DistributionL2 Read Bandwidth

    L2 Write Bandwidth

  • DNN Layer Sizes

    C

    X

    YS

    RC

    K

    HW Resources

    Mapping (Dataflow)

    • Size Requirement• Access Count

    (Energy)

    Buffer Analysis

    • BW Requirement• NoC Activity Count

    NoC Analysis

    • Roofline Throughput• Expected Runtime

    Runtime Analysis

    Data Reuse Analysis

    Abstract HW Model

    Communication Analysis

    Computation Analysis

    MAESTRO: Analytical Cost/Benefit Model

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 27*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566

  • Abstract Accelerator Model

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 28

    Shared Buffer (L2 Buffer)

    PE 0

    Private L1

    ALU

    PE N-1

    Private L1

    ALU

    PE 1

    Private L1

    ALU

    ...

    Network-on-Chip (NoC)1) Bandwidth

    2) Average latency

    3) Multicast capability

    4) Forwarding capability

    1) Size of L2 buffer

    2) Rd/Wr Bandwidth

    1) Number of PEs

    2) Size of L1 buffer

    3) Vector width

    To/From DRAM

    L2/L1 Buffer: Scratch padL0 Buffer (in ALUs): Register File

  • DNN Layer Sizes

    C

    X

    YS

    RC

    K

    HW Resources

    Mapping (Dataflow)

    • Size Requirement• Access Count

    (Energy)

    Buffer Analysis

    • BW Requirement• NoC Activity Count

    NoC Analysis

    • Roofline Throughput• Expected Runtime

    Runtime Analysis

    Data Reuse Analysis

    Abstract HW Model

    Communication Analysis

    Computation Analysis

    MAESTRO: Analytical Cost/Benefit Model

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 29*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566

  • MAESTRO Analysis Engine

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 30

    //Buffer RequirementsL1BufferRequirement = 2 x (MV[Weights] + MV[Inputs] + MV[Outputs])L2BufferRequirement = 2 x { (M[Weights] + (NumSpIter-1) x MSUV[Weights]) + (M[Inputs] + (NumSpIter-1) x MSUV[Inputs])

    //Data volumes required for each spatial iteration in four cases (spatial volume)// (first/steady) temporal iteration x (steady/last) spatial iteration // -> {first/stedy, first/last, steady/steady, steady/last}SV_FTP_SSP[in_data_cls]= tilesz_lst[sp_var] * {MV[in_data_cls] + (num_sp_tiles-1) * MSUV[in_data_cls]}SV_FTP_LSP[in_data_cls]= tilesz_lst[sp_var] * {MV[in_data_cls] + (num_sp_edge_tiles-1) * MSUV[in_data_cls] }SV_STP_SSP[in_data_cls]= tilesz_lst[sp_var] * {MTUV[in_data_cls] + (num_sp_tiles-1) * MTSUV[in_data_cls] }SV_STP_LSP[in_data_cls]= tilesz_lst[sp_var] * {MTUV[in_data_cls] + (num_sp_edge_tiles-1) * MTSUV[in_data_cls] }//Multicasting factor Multcast_factor[in_data_cls] = MV[in_data_cls] / MSUV[in_data_cls]

    //Buffer access countsL2Wr[in_data_cls] = // Product of Loop sizes of each corresponding variable to in_data_clsL2Rd[in_data_cls] = (NoC.multicast_support)? (sp_iter-1) * SV_FTP_SSP[in_data_cls] + SV_FTP_LSP[in_data_cls] + (tp_iter-1) * {(sp_iter-1) * SV_STP_SSP[in_data_cls] + SV_STP_LSP[in_data_cls]} / tp_freq[in_data_cls] : (sp_iter-1) * num_sp_tiles * MV[in_data_cls] + num_sp_edge_tiles * MV[in_data_cls] + (tp_iter-1) * { (sp_iter-1) * num_sp_tiles * MTUV[in_data_cls] + num_sp_edge_tiles * MTUV[in_data_cls] }/tp_freq[in_data_cls]

    L1Wr[in_data_cls] = (NoC.multicast_support)? L2Rd[in_data_cls] * multcast_factor : L2Rd[in_data_cls]; L1Rd[in_data_cls] = tp_iter * { (sp_iter-1) * num_sp_tiles * MV[in_data_cls] + num_sp_edge_tiles * MV[in_data_cls] }

    Input: The number of ALUs in each PE (num_alus), temporal update frequency (tp_freq), number of spatial iterations (sp_iter), number of temporal iterations (tp_iter)Output: Total runtime for a given input layer (runtime) Procedure ComputeRuntime runtime = 0; //First temporal iteration if(sp_iter > 1) then init_noc_delay = NoCDelay(SV_FTP_SSP[input]) + NoCDelay(SV_FTP_SSP[weight]) else then init_noc_delay = NoCDelay(SV_FTP_LSP[input]) + NoCDelay(SV_FTP_LSP[weight]) end runtime += init_noc_delay;

    if(sp_iter > 2) then //already loaded the first data sets L2ToL1_noc_delay = NoCDelay(SV_FTP_SSP[weight] + SV_FTP_SSP[input]) L1ToL2_noc_delay = NoCDelay(SV_FTP_SSP[output]) runtime += (sp_iter-2) *max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) else then L2ToL1_noc_delay = NoCDelay(SV_FTP_LSP[weight] + SV_FTP_LSP[input]) L1ToL2_noc_delay = NoCDelay(SV_FTP_LSP[output]) runtime += (sp_iter-1) * max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) end

    //Rest of temporal iterations if(sp_iter > 1) then L2ToL1_noc_delay = NoCDelay(SV_STP_SSP[weight]/tp_freq[weight] + SV_STP_SSP[input]/tp_freq[input]) L1ToL2_noc_delay = NoCDelay(SV_STP_SSP[output]//tp_freq[output]) runtime += (tp_iter-1) * (sp_iter-1) *max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) else then L2ToL1_noc_delay = NoCDelay(SV_STP_LSP[weight]/tp_freq[weight] + SV_STP_LSP[input]/tp_freq[input]) L1ToL2_noc_delay = NoCDelay(SV_STP_LSP[output]/tp_freq[output]) end runtime += (tp_iter-1) * max(L2ToL1_noc_delay, L1ToL2_noc_delay + ComputeDelay) return runtime;endprocedure

    Volume Analysis Reuse Analysis Runtime Analysis Buffer Analysis

    //MV: Mapped volumeMV[Weights] = M(K) x M(C) x M(R) x M(S)MV[Inputs] = M(C) x M(Y) x M(X) MV[Outputs] = M(K) x M(Y’) x M(X’)

    //MSUV: Mapped spatially unique volumeMSUV[Weights] = GetSpUSz(K) x GetSpUSz(C) x GetSpUSz(R) x GetSpUSz(S)MSUV[Inputs] = GetSpUSz(C) x GetSpUSz(Y) x GetSpUSz(X)MSUV[Outputs] = GetSpUSz(K) x GetSpUSz(C) x GetSpUSz(Y’) x GetSpUSz(X’)

    //MTUV: Mapped temporally unique volumeMTUV[Weights] = TU(K) x TU(C) x TU(R) x TU(S)MTUV[Inputs] = TU(C) x TU(Y) x TU(X)MTUV[Outputs] = TU(K) x TU(C) x TU(Y’) x TU(X’)

    //MSTUV: Mapped spatially and temporally unique volumeMSTUV[Weights] = GetSTpUSz(K) x GetSTpUSz(C) x GetSTpUSz(R) x GetSTpUSz(S) MSTUV[Inputs] = GetSTpUSz(C) x GetSTpUSz(Y) x GetSTpUSz(X) MSTUV[Outputs] = GetSTpUSz(K) x GetSTpUSz(C) x GetSTpUSz(Y’) x GetSTpUSz(X’) * GetSpUSz(V) = (V.pragma.class == TemporalMap)? M(V) : SU(V);* GetSTpUSz(V) = (V.pragma.class == SpatialMap)? SU(V) : TU(V);

    Input: dataflow description in MAESTRO pragmas (df_desc)Output: The total or uniquely mapped size of a data class on a PE (mp_sz) Procedure AnalyzeVariableMapping: for each pragma in df_desc switch(pragma.class) case TemporalMap: M[pragma.var] = pragma.map_sz; SU[pragma.var] = 0; TU[pragma.var] = (pragma.map_sz > pragma.ofs)? pragma.ofs : pragma.map_sz; case SpatialMap: M[pragma.var] = pragma.map_sz; SU[pragma.var] = (pragma.map_sz > pragma.ofs)? pragma.ofs : pragma.map_sz; TU[pragma.var] = pragma.map_sz; case Unroll: M[pragma.var] = pragma.loop.sz; SU[pragma.var] = 0; TU[pragma.var] = pragma.loop.sz; end endendprocedure

    Input: Dataflow description (df_desc), loop list (lp_lst), pragma id of spatial map (sp_prag_id) a tile size list processed by AnalyzeTiles (tilesz_lst)Output: Number of spatial iterations (sp_iter), number of temporal iterations (tp_iter)Procedure AnalyzeNumIterations sp_iter = 1; tp_iter = 1; for each pragma in df_desc if(pragma.id > sp_prag_id) then if(pragma.class == TemporalMap) then tp_iter *= lp_list[pragma.var].size / pragma.ofs; end end else if(pragma.id == sp_prag_id) then sp_iter *= lp_list[pragma.var].size / pragma.ofs / tilesz_lst[pragma.var]; end end return {sp_iter, tp_iter};endProcedureInput: dataflow description (df_desc), target data class (data_cls), temporal loop list (tp_lp_lst)Output: Number of temporal iterations to have a change in mapped data points of data_cls (tp_freq)Procedure AnalyzeTemporalUpdateFrequency tp_freq=1; upper_most_sz = 1; saw_cor_var = false; if(df_desc.has_spatial_map(data_cls)) then return tp_freq; end for each loop in tp_lp_lst if(data_cls.has(loop.var)) then if(!saw_cor_var) then saw_cor_var = true; end tp_freq=1; else then if(saw_cor_var) then pragma = df_desc.search(loop.var); tp_freq *= loop.size/pragma.ofs; end end end return tp_freq; endProcedure

    Ignore this “eyechart” slide – Just an intuition on how it works

    • Analytical Model – not cycle accurate sims, 1000-4000x faster.• Error within 5% of cycle-accurate RTL sims of Eyeriss and NVDLA

  • DNN Layer Sizes

    C

    X

    YS

    RC

    K

    HW Resources

    Mapping (Dataflow)

    • Size Requirement• Access Count

    (Energy)

    Buffer Analysis

    • BW Requirement• NoC Activity Count

    NoC Analysis

    • Roofline Throughput• Expected Runtime

    Runtime Analysis

    Data Reuse Analysis

    Abstract HW Model

    Communication Analysis

    Computation Analysis

    MAESTRO: Analytical Cost/Benefit Model

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 31*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566

  • Use Cases: (i) HW Design

    DNN Layer Sizes…C

    X

    YS

    RC

    K Fixed

    Fixed

    HW Resources HW DSESearch HW Configs

    FastPath Workshop Tushar Krishna | Georgia Institute of Technology 32

    Mapping (Dataflow)

    March 24, 2019

  • HW DSE using MAESTRO

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 33

    140120100

    80604020

    0 0 2 4 6 8 10 12 14 16 0 20 40 60 80 100 120 140

    Area Constraint

    0 1 2 3 4 5 6 7 8 9

    140120100

    80604020

    0

    140120100

    80604020

    0

    140120100

    80604020

    0 0

    Power Constraint

    10050 150 200 250 300 350 400 450

    Thro

    ughp

    ut

    (MAC

    /Cyc

    le)

    VGG

    16-C

    ON

    V2

    Area (mm )2 Buffer (KB) Power (mW)150

    100

    50

    0 0 2 4 6 8 10 12 14 16Area (mm )2 Buffer (KB)

    0 20 40 60 80 100 120 140

    Area Constraint

    0 5Power (mW)

    10 0

    Power Constraint

    10050 150 200 250 300 350 400 450

    Thro

    ughp

    ut

    (MAC

    /Cyc

    le)

    Normalized Energy (10 X Single MAC Energy)9

    150

    100

    50

    0

    150

    100

    50

    0

    150

    100

    50

    015 20 25VG

    G16

    -CO

    NV1

    1

    Normalized Energy (10 X Single MAC Energy)9

    Energy-Optimized DesignThroughput-Optimized Design # of PEs 15264 12832 96

    8

    NVDLA Dataflow

    Best HW-config for Throughput different from Best design for Energy

    DSE engine searched 480M designs and identified 2.5M valid designs at an avg rate of 0.17M designs per second.

  • Use Cases: (ii) Compiler Design

    DNN Layer Sizes…C

    X

    YS

    RC

    K

    Mapping (Dataflow)

    Fixed

    FixedHW Resources

    Dataflow DSEGenerate Opt Map

    FastPath Workshop Tushar Krishna | Georgia Institute of Technology 34March 24, 2019

  • Dataflow Comparison using MAESTRO

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 35

    NOTE: this represents the performance and energy of the dataflow on a normalized PE substrate

    • Not representative of the performance of the original architecture

    NLR WS Shi DLA RS0

    0.5

    1

    NLR WS Shi DLA RS0

    2

    4

    6

    8

    10

    NLR WS Shi DLA RS0

    2

    4

    6

    8

    B

    C

    D

    Ba

    nd

    wid

    th R

    eq

    uir

    em

    en

    t (G

    bp

    s)

    L1

    Me

    mo

    ry R

    eq

    uir

    em

    en

    t (K

    B)

    Th

    ro

    ug

    hp

    ut

    (GFLO

    PS

    )

    NLR WS DLAShi RS0

    0.5

    1

    L1 M

    emor

    y Re

    quire

    men

    t (KB

    )

    Band

    wid

    th

    Requ

    irem

    ent (

    Gbp

    s)

    0

    468

    10

    0

    Thro

    ughp

    ut(G

    FLO

    PS)

    Dataflow Style Dataflow Style Dataflow Style

    (a) Bandwidth (b) L1 Memory (c) Throughput

    2 2468

    NLR WS DLAShi RS NLR WS DLAShi RS

    NLR WS Shi DLA RS0

    0.5

    1

    1.5

    2

    NLR WS Shi DLA RS0

    2

    4

    6

    8

    10

    NLR WS Shi DLA RS0

    1

    2

    3

    4

    B

    C

    D

    Ba

    nd

    wid

    th R

    eq

    uir

    em

    en

    t (G

    bp

    s)

    L1

    Me

    mo

    ry R

    eq

    uir

    em

    en

    t (K

    B)

    Th

    ro

    ug

    hp

    ut

    (GFLO

    PS

    )

    0

    0.5

    1.0

    L1 M

    emor

    y Re

    quire

    men

    t (KB

    )

    Band

    wid

    th

    Requ

    irem

    ent (

    Gbp

    s)

    0

    468

    10

    0

    Thro

    ughp

    ut(G

    FLO

    PS)

    Dataflow Style Dataflow Style Dataflow Style

    (a) Bandwidth (b) L1 Memory (c) Throughput

    2 1

    2

    34

    NLR WS DLAShi RS NLR WS DLAShi RS NLR WS DLAShi RS

    1.5

    2.0

    LateLayer(C11)

    EarlyLayer(C1)

    0

    20

    40

    60

    80

    100

    120

    140

    160

    0 5 10 15 200

    100

    200

    300

    400

    NLR WS SD DLA RS NLR WS SD DLA RS

    MACL1RdL1WrL2RdL2Wr160

    140120100

    80604020

    0

    400

    300

    200

    100

    0VGG16-CONV1 VGG16-CONV11

    Nor

    mal

    ized

    Ene

    rgy

    DataflowLayer

    # of PEs(16,32,…,256)

    Energy

    Scalability

    Performance and Hardware Requirement

    Takeaway - No one dataflow is best for all layers

  • Use Cases: (iii) HW-SW Co-Design

    Fixed

    HW/SW Co-designSearch HW Config + Mappings

    DNN Layer Sizes…C

    X

    YS

    RC

    K

    HW Resources

    FastPath Workshop Tushar Krishna | Georgia Institute of Technology 36

    Mapping (Dataflow)

    March 24, 2019

  • Summary of MAESTRO

    uPrecise Specification of Dataflows using a Data-centric Approach

    uAnalytical Model for Analyzing Reuse => Performance, Memory, Interconnect, Energy

    uUse for HW Design-space or Mapping Space exploration, or HW-SW Co-Design

    uValidation on going…

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 37

  • Outline of Talk

    uHow do we map billions of computations over limited compute and memory resources

    uHow do we design an accelerator to efficiently map arbitrary layer types and dataflows?

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 38

  • Myriad Dataflows in DNN Acceleratorsu DNN Topologies

    u Layer size / shapeu Layer types: Convolution / Pool / FC / LSTMu New sub-structure: e.g., Inception in Googlenet

    u Compiler/Mapper (e.g., MAESTRO)u Loop Scheduling

    u Reordering and Tiling

    u Mappingu Output/Weight/Input/Row-stationary

    u Algorithmic Optimizationu Weight pruning: Sparse workload

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 39

  • The current trend for supporting multiple dataflows

    uNew Dataflow à New AcceleratoruData reuse: FlexFlow (2017), Eyeriss (2016), ...uCross-layer: Fused CNN (2016)

    uSparse CNN: SCNN (2017), EIE(2016), ...uLSTM: ESE (2017), ...

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 40

    Can we have one architectural solution that can handle arbitrary dataflows and provides ~100% utilization?

  • What is the computation in a DNN?

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology

    …Input Activation

    Filter

    Output Activation

    0 0

    CONV Layer

    ……

    W0W1W2

    SlideX0

    X1

    X2

    Xk

    Σ(WiXi)

    W0

    W1

    W2

    Wk

    Inputs

    Neuron

    Out

    OutputWeights

    Compute weighted sum Independent multiplicationAccumulation of partial products

    41

    Our Key insight: Each dataflow translates into neurons of different sizes

  • Layer 1 Layer 2

    Pruning

    Removed Weight

    Layer 1 Layer 2Layer 1 Layer 2

    Irregular Dataflow: Pruning

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 42

    X0

    X1

    X2

    Σ(WiXi)W0

    W1

    W2

    Neuron

    Out

    X0

    X2

    Σ(WiXi)W0

    W2

    Neuron

    Out

    Our Key insight: Each dataflow translates into neurons of different sizes

    Example: Weight Pruning (Sparse Workload)

  • The MAERI Abstraction

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology

    PrefetchBufferX X X XX X X X

    +++

    ++

    + +

    Weight / Input

    Output

    Multiplier Pool

    Adder Pool

    VN0 VN1 VN2

    “MultAlloc(3); AddAlloc(2)”

    X0

    X1

    X2

    Xk

    Σ(WiXi)

    W0

    W1

    W2

    Wk

    Inputs

    Neuron

    Out

    OutputWeights

    Virtual Neuron (VN): Temporary grouping of compute units for an output

    How to enable flexible grouping?

    “MultAlloc(2); AddAlloc(1)”

    43

    Need flexible connectivity!

  • Naïve Approach: Full Crossbar

    FastPath Workshop Tushar Krishna | Georgia Institute of Technology 44

    PrefetchBufferX X X XX X X X

    +++

    ++

    + +

    Weight / Input

    Output

    Wire overhead = O(n2)

    Need “specialization” in interconnection network

    for traffic in DNN accelerators

    March 24, 2019

  • Traffic Patterns in DNN Accelerators*

    * H. Kwon et al., Rethinking NoCs for Spatial DNN Accelerators, NOCS 2017

    GB NoC

    PE

    PE

    PE

    PE

    One-to-Many(Distribution)

    GB NoC

    PE

    PE

    PE

    PE

    Many-to-One(Collection/Reduction)

    GB NoC

    PE

    PE

    PE

    PE

    * GB: Global buffer* NoC: Network-on-Chip (Interconnection network)* PE: Processing element (Compute units)

    One-to-One(Forwarding)

    e.g. input and weight distribution to PEs

    e.g. partial sum and output reduction

    e.g. input/weight/partial-sum forwarding

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 45

  • The MAERI Implementation

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 46

    X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    Hyoukjun Kwon, Ananda Samajdar, and Tushar KrishnaMAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention

  • The MAERI Implementation

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 47

    X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    Distribution Network• Spatial Reuse via Multicasts• High Bandwidth via fat links

  • The MAERI Implementation

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 48

    X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    Local FIFOs for Temporal Reuse i.e. “stationary”

  • The MAERI Implementation

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 49

    X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    Linear Local Network• Forwarding of weights• Spatio-Temporal Reuse

  • The MAERI Implementation

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 50

    X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    Reduction Network• High Bandwidth via fat links• Provably Non-blocking

    Reductions via forwarding links

  • The MAERI Implementation

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 51

    X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    Distribution Network• Spatial Reuse via Multicasts• High Bandwidth via fat links

    Linear Local Network• Forwarding of weights• Spatio-Temporal Reuse

    Reduction Network• High Bandwidth via fat links• Provably Non-blocking

    Reductions via forwarding links

    Local FIFOs for Temporal Reuse i.e. “stationary”

  • X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    The MAERI Implementation

    FastPath Workshop Tushar Krishna | Georgia Institute of Technology

    Input _LInput _R

    Fwd_In

    Fwd_Out

    Output_Up

    +/>

    Adder Switch(adder+ 3x2 switch)

    Data_InLeft_Out

    Right_Out

    Inv

    Inv

    Distribute Switch (1x2 Switch)

    52March 24, 2019

    Micro-Switches

    Multiplier Switch

    Data_In

    Fwd_In Fwd_Out

    Data_OutX

    (multiplier+ 2x2 switch)

  • Example: Computing a CONV layeru [Communication] Distribute weights and

    inputs (image pixels) to multiplier switchesuAssume: weight stationary, conv reuse of inputs via

    local links

    u [Computation] Compute partial sums

    u [Computation] Reduce partial sums

    u [Communication] Collect outputs to bufferFastPath Workshop Tushar Krishna | Georgia Institute of Technology 53March 24, 2019

  • MAERI Operation Example

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 54

    W00 W01W10 W11 =X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter Output Activation

    O00 O01 O02 O03O10 O11 O12 O13O20 O21 O22 O23O30 O31 O32 O33

    Slides

    O00 = W00 X X00 + W01 X X01

    + W10 X X10 + W11 X X11

    W020

    + W02 X X01

    Sparse Weight Filter

  • X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    MAERI Operation Example

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 55

    Virtual Neuron Construction

    VN size = 5

    Sparse Weight Filter

    Controller configures the switches

    W00 W01W10 W11

    W020

  • X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    MAERI Operation Example

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 56

    Weight

    Distribution

    Distribution bandwidth is tunable.

    Suppose BW = 3

    Sparse Weight Filter

    W00 W01W10 W11

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

  • X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    MAERI Operation Example

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 57

    Distribution bandwidth is tunable.

    Suppose BW = 3

    Sparse Weight Filter

    W00 W01W10 W11

    W020

    Weight

    Distribution

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

  • X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    MAERI Operation Example

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 58

    x00 x01 x02 x10 x11 X10 X11 X12 X20 X21 X20 X21 X22 X30 X31Sparse Weight Filter

    W00 W01W10 W11

    W020

    Input

    Distribution

    Utilize multicast to reduce latency and energy

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

  • X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    MAERI Operation Example

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 59

    Partial sum reduction

    x00 x01 x02 x10 x11 X10 X11 X12 X20 X21 X20 X21 X22 X30 X31

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    Simultaneous Reduction

  • X X X XX X X X

    +++

    ++

    + +

    Acce

    lera

    tor C

    ontro

    ller

    Dataflow(from CPU)

    +

    X Multiplier Switch

    Adder Switch

    Legend

    LookupTable

    X X X XX X X X

    +++

    ++

    + +

    +

    …Weights

    Inputs

    Outputs

    Distribution Tree

    Augmented Reduction Tree

    Activation Units

    From/To DRAM

    Simple Switch

    MAERI Operation Example

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 60

    Sliding Window

    x01 x02 x11 X11 X12 X21 X21 X22 X31

    x03

    X13

    x12

    X22

    X23

    X32

    Repeat Step 4 - 5W00 W01

    W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    W00 W01W10 W11 X

    X00 X01 X02 X03X10 X11 X12 X13X20 X21 X22 X23X30 X31 X32 X33

    Input ActivationFilter

    W020

    x00 x10 X10 X20 X20 X30

    Weights: stationaryInputs: Partially reused via forwarding

  • Mapping optimal dataflows for MAERI

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 61

    Deep Neural NetworkNe

    uron

    s

    Dataflow Configs

    MAERI Mapper: mRNA

    Search for Optimal Dataflow

    X X X XX X X X

    +++

    ++

    + +

    X X XX X X X

    +++

    ++

    + +

    +

    X

    … …

    To/From DRAMWeight, Input, Output SRAM

    X X X XX X X X

    +++

    ++

    + +VN0

    X X XX X X X

    +++

    ++

    + +VN1

    +

    VN2

    Weights/Inputs Weights/Inputs

    Output Activation Output Activation

    Output Activation

    1Virtual Neurons

    X

    2

    34567

    Hyoukjun Kwon, Ananda Samajdar, and Tushar KrishnaMAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention

    Z. Zhao, H. Kwon, S.Kuhar, W. Sheng, Z. Mao, T. KrishnaEfficient Mapping Space Exploration on a Reconfigurable Neural AcceleratorISPASS 2019

    ~100% Utilization

  • Example Mapping – Dense CNN

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 62

    Our Key insight: Each dataflow translates into neurons of different sizes X0

    X1

    X2

    Xk

    Σ(WiXi)

    W0

    W1

    W2

    Wk

    Inputs

    Neuron

    Out

    OutputWeights

    X X X XX X X X

    +++

    ++

    + +VN0

    X X X XX X X X

    +++

    ++

    + +VN1

    +

    VN2

    Weights/Inputs Weights/Inputs

    Partial Outputs Partial Outputs

  • Example Mapping – Sparse DNN

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 63

    Our Key insight: Each dataflow translates into neurons of different sizes X0

    X1

    X2

    Xk

    Σ(WiXi)

    W0

    W1

    W2

    Wk

    Inputs

    Neuron

    Out

    OutputWeights

    X X X XX X X X

    +++

    ++

    + +VN0

    X X X XX X X X

    +++

    ++

    + +VN1

    +

    VN2

    Weights/Inputs Weights/Inputs

    Partial Outputs Partial Outputs

  • Example Mapping – LSTM/FC

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 64

    Our Key insight: Each dataflow translates into neurons of different sizes X0

    X1

    X2

    Xk

    Σ(WiXi)

    W0

    W1

    W2

    Wk

    Inputs

    Neuron

    Out

    OutputWeights

    X X X XX X X X

    +++

    ++

    + +VN0

    X X X XX X X X

    +++

    ++

    + +

    +

    Weights/Inputs Weights/Inputs

    Partial Outputs Partial Outputs

  • Example - Impact of MappingsMapping 1 Mapping 2 Mapping 3 Mapping 4 Mapping 5

    DN BW requirement for input 16 / 8 8 / 4 16 / 8 8 / 4 9

    DN BW requirement for weight 16 16 8 8 4

    RN BW requirement 1 2 2 4 4

    Number of DS access for weight 64 64 64 64 56

    Number of DS access for input 64 / 32 64 / 32 64 / 32 64 / 32 42

    Number of reduce 15 14 14 12 12

    Number of RS access 30 28 28 24 24

    Number of MS access 8 8 8 8 0

    Number of iterations 36 41 41 45 73

    Peak Utilization Rate 100% 100% 100% 100% 100%

    Average Utilization Rate 100% 100% 100% 100% 56%

    Best performance

    Least DN bandwidth

    Least RN Bandwidth

    Low utilization rateZ. Zhao, H. Kwon, S.Kuhar, W. Sheng, Z. Mao, T. KrishnaEfficient Mapping Space Exploration on a Reconfigurable Neural AcceleratorISPASS 2019

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 65

  • End-to-End Performance

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 66

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    MAERI(Featuremapparallelism) MAERI(Channelparallelism) MAERI(AdaptiveDataflow) Eyeriss(witharraypartitioning)

    Runtime(ms)

    VGG16-Layer

    VGG16End-to-endRuntime(MAERIvsEyeriss)

    CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10 CONV11 CONV12 CONV13

  • Summary of MAERIu DNN models evolving rapidly

    uMultiple layer typesu Sparsity OptimizationsuMyriad dataflows for scheduling and mapping

    u MAERI enables dynamic grouping of arbitrary number of MACCs (“Virtual Neuron”) via reconfigurable, non-blocking interconnects, providinguFuture proof to DNN models and dataflowsuNear 100% compute unit utilization

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 67

  • Takeaways

    March 24, 2019FastPath Workshop Tushar Krishna | Georgia Institute of Technology 68

    AI will be pervasive

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE

    2D hardware array

    .

    .

    .N

    .

    .

    .N

    C

    C

    K

    KC X’

    Weights Inputs OutputsR

    SY

    X

    Y’ map

    Convolutional Neural Network

    Analytical Model for DNN Dataflow Analysis

    X0

    X1

    X2

    Xk

    Σ(WiXi)

    W0

    W1

    W2

    Wk

    Inputs

    Neuron

    Out

    OutputWeights

    … DNN Accelerator with Configurable Interconnects can map Irregular Dataflows

    X X X XX X X X

    +++

    ++

    + +

    Thank you!

    http://synergy.ece.gatech.edu/tools/maestro/

    http://synergy.ece.gatech.edu/tools/maeri/


Recommended