Multi-Core context Optimized Support of CNNs in a Extending ......2018/12/16 · Optimized Support...

Extending the RISC-V ISA for Optimized Support of CNNs in a

Multi-Core context

RISC-V Summit Dec 3-6 2018 Santa Clara

Eric Flamand, CTO & CoFounder of Greenwaves Technologies

Who are we?

• French based startup created in 2015• First product, GAP8, launched in Feb 2018

12/4/2018 RISC-V Summit Dec 2018 2

Our Market Vision


The IoT pipeNB-IoT, LTE-M, Sigfox,

LoRa, etc.

B/day to kB/dayBattery operated

sensors

8-bit, 160x120 @ 10 fps =4.6 Mbit/s

24-bit @ 50kHz = 1.2 Mbit/s

Linear PCM =1.4 Mbit/s

Market DemandRich sensor data

Keyword SpottingBeam formingSpeech pre-processing

Vibration analysisFault detection

Face detectionPresence detectionCountingEmotion detection

Our Market Vision


B/day to kB/day B/day to kB/dayBattery operated

sensors

The IoT pipeNB-IoT, LTE-M, Sigfox,

LoRa, etc.

8-bit, 160x120 @ 10 fps =4.6 Mbit/s

24-bit @ 50kHz = 1.2 Mbit/s

Linear PCM =1.4 Mbit/s

Market DemandRich sensor data

CNNSVM

BayesianBoostingCepstral analysis

Market demand+

Low operation cost+

Low deployment cost +

Low installation cost=

Massive deployment of intelligent rich data sensors

Issue: way more MIPS than an MCU can deliver but need

to bewithin an MCU power

envelope ?

GAP8 An IOT Application Processor


MemoryL2

FC clock & voltage domain

PMU RTC

FabricController

L1

ROM

I$

Debug

LVDS

Serial I/Q

UART

SPI

I2C

I2S

CPI

HyperBus

GPIO / PWM

Mic

ro D

MA Logarithmic Interconnect

Shared L1 Memory

Shared Instruction Cache

Cor

e 0

Debug

ClusterDMA

HWSync

Cor

e 1

Cor

e 7

Cor

e 6

Cor

e 5

Cor

e 4

Cor

e 3

Cor

e 2

HW

CE

Cluster clock & voltage domain

Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V

MCU FunctionExtended RISC-V coreExtensive I/O setMicro DMAEmbedded DC/DC convertersSecured execution / e-fuses

Computation engine function8 extended RISC-V coresFully programmableEfficient parallelizationShared instruction cacheMulti channel DMAHW synchronizationHW convolution Engine (3 * 3x3)

Retentive1µA+x*8µA

Pre-analysis1mWs

Inferencefew 10mWs

An integrated, hierarchical architectureDeep sleep

1uATSMC 55LP1.0V to 1.2VMax Freq: 133 MHz to 250 MHz

Gap8 The open source heritage


GreenWaves- Best in class Instruction Set

Architecture (ISA)- UC Berkeley originated

- GWT Member of RiscV Foundation

- Open Source Computing Platform created by ETHZ and UniBo

- Permissive license (solderpad)- Multiple tape outs

- GWT contributes to PULP

- Innovating on Risc-V and PULP-Proprietary balanced system solution (SOC) based on PULP open source elements plus GWT proprietary elements both on HW and SW/Tools side


Extending the ISA – Impact on CNN centric applications

ISA Extension(s)


Given our 4 stages in order how to increase ILP with a moderate gate count increase given an application family?

Group 1: Loop Kernels• Zero overhead HW loop• Post modified load/store, Reg/Reg load/store

Group 2: DSP/Linear Algebra• Mac/Msu with optional normalization and rounding• Add/sub/mult with optional normalization and rounding• Clip,Min,Max,Abs

Group 3: Bit manipulation• Insert/extract/set/clear/findfirst/findlast/countleadingbits/rotation• PopulationCount

Group 4: Vectorial/SIMD 4 Bytes, 2 Half Words• Add/sub/avg/min/max/abs/shift/logical• Shuffle/insert/extract/pack• Dot product/sum of dot products

Group 5: Complex Numbers, Treillis• Product/Conjugate/Rotation• Max path/path selection

PULP

Greenwaves

Performance and Power Measurement


Impact on CNN

• Selected layers• Convolution, fixed point. Will use 5x5• Convolution, binary. Will use 5x5• Max pooling. Will use 2x2 with stride 2• Average pooling. Will use 2x2 with stride 2• Linear

• Vectorization impact• Qx.y <= 15 Vector of 2 signed short int• Qx.y <= 7 Vector of 4 signed bytes


Compared configurations• Pure RiscV• Gap8 without vector (Groups 1,2 and 3)• Gap8 with vector (Groups 1,2,3 and 4)

Convolution


for (int c=0; c<(W-4); c++) for (int l=0; l<(H-4); l++) int R = Out[l*W+c]<<Norm; for (int kl=0; kl<5; kl++) for (int kc=0; kc<5; kc++) R += Filter[kl*5+kc]*In[(l+kl)*W + c+kc]; Out[l*W+c] = R>>Norm;

.L28:lb t5,0(a6)lb t4,0(a1)lb t3,2(a1)lb s10,2(a6)lb t1,4(a1)lb s8,4(a6)lb a7,6(a1)lb s7,6(a6)mul t4,t4,t5lb s6,8(a6)lb t5,8(a1)add a1,a1,10add a6,a6,t6mul t3,t3,s10add t4,t4,s9mul t1,t1,s8add t3,t3,t4mul a7,a7,s7add t1,t1,t3mul t5,t5,s6add a7,a7,t1add s9,t5,a7bne t0,a1,.L28

lp.setup x1,t3,(.L242) lb t5,0(t1)lb t4,0(a7)lb s10,1(t1)lb s9,1(a7)p.macs a6,t5,t4lb s8,2(t1)lb s7,2(a7)lb s6,3(t1)lb t6,3(a7)lb t5,4(t1)lb t4,4(a7)add t1,t1,5add a7,a7,a1p.macs a6,s10,s9p.macs a6,s8,s7p.macs a6,s6,t6

.L242: p.macs a6,t5,t4

lp.setup x1,s2,(.L68)lb a4,0(s3)p.lw a7,a1(s4!)p.lb t1,a1(s5!)sll a4,a4,a5pv.sdotsp.b a4,t3,t6pv.sdotsp.b a4,a6,t0pv.sdotsp.b a4,a0,t2pv.sdotsp.b a4,a2,s0pv.sdotsp.b a4,a7,s1pv.sdotsp.b a4,a3,t4pv.sdotsp.b a4,t1,t5sra a4,a4,a5p.sb a4,s7(s3!)mv t3,a6pv.shuffle2.b a3,t1,s6mv a6,a0mv a0,a2

.L68: mv a2,a7

x5 x5

RiscVGap8 NoVect

Gap8 Vect

Binary Convolution


lp.setup x1,t3,(.L286)lhu a5,0(a5)p.bclr a2,a7,28,3or a2,a2,160p.extractur a5,a5,a2p.insert a6,a5,5,24srl a5,a6,1xor a5,a5,t4xor a2,t4,a6not a5,a5not a2,a2and a5,a5,t0lhu s3,0(t1)and a2,a2,t0p.cnt a5,a5p.cnt a2,a2sll a5,a5,8or a5,a5,a2pv.add.b a5,s3,a5add a7,a7,a3p.sh a5,t6(t1!)srl a6,a6,6

.L286: srl a5,a7,3

PopCount on RiscV is not naïve implementation but still costs approx 15 cycles

Convolution


Cycles Per Ouput RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect

Short Convolution 5x5 135,1 100,2 40,2 5,3

Short Xnor Conv 5x5 29,9 11,4 11,4 1,5

Byte Convolution 5x5 135,1 98,2 19,1 2,5

Byte Xnor Conv 5x5 29,9 11,4 11,4 1,5

pJ Per Ouput RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect

Short Convolution 5x5 9674,2 7677,4 3008,9 1384,7

Short Xnor Conv 5x5 2379,8 864,4 864,4 374,3

Byte Convolution 5x5 9674,2 7524,2 1430,2 653,8

Byte Xnor Conv 5x5 2379,8 864,9 864,9 373,8

Pooling


lp.setup x1,a3,(.L29)p.lh a4,t6(a0!)p.lh a5,t6(a2!)pv.max.b a5,a5,a4pv.extract.b a4,a5,0pv.extract.b a5,a5,1p.max a5,a4,a5

.L29:p.sb a5,t5(a1!)

lp.setup x1,a4,(.L39)p.lh a5,t0(a6!)p.lh t3,t0(t1!)pv.shuffle2.b a5,t3,t4pv.dotsp.sci.b a5,a5,1sra a5,a5,2

.L39: p.sb a5,t6(a7!)

lp.setup x1,a6,(.L216)p.lb a5,s2(a7!)p.lb t0,s2(t4!)p.lb a4,s2(t5!)p.lb t6,s2(t3!)p.max a5,a5,t0p.max a4,a4,t6p.max a5,a5,a4

.L216: p.sb a5,s0(t1!)

lp.setup x1,a4,(.L227)p.lb a5,s2(a6!)p.lb t0,s2(t3!)p.lb t6,s2(t4!)p.lb t5,s2(t1!)add a5,a5,t0add a5,a5,t6p.addN a5,a5,t5,2

.L227:p.sb a5,s0(a7!)

.L14: lbu a5,1(a2)lbu t4,0(a2)add a7,a7,1sll a6,a5,24sll t3,t4,24sra a6,a6,24sra t3,t3,24bge a6,t3,.L11mv a5,t4mv a6,t3

.L11: lbu t3,1(a4)add a2,a2,t5sll t4,t3,24sra t4,t4,24bge a6,t4,.L12mv a5,t3

.L12: lbu a6,0(a4)sll t4,a5,24sra t4,t4,24sll t3,a6,24sra t3,t3,24bge t4,t3,.L13mv a5,a6

.L13:sb a5,0(t1)

.L21:lb a5,0(a4)lb t5,1(a4)lb t4,0(a2)lb t3,1(a2)add a5,a5,t5add a5,a5,t4add a5,a5,t3sra a5,a5,2sb a5,0(a7)add a6,a6,1add a4,a4,t1add a2,a2,t1add a7,a7,t0

bne t6,a6,.L21

for (int c=0; c<Wo; c++) for (int l=0; l<Ho; l++) Out[l*Wo+c] = Max(Max(In[2*l*W+2*c], In[2*l*W + 2*c+1]), Max(In[(2*l+1)*W+2*c], In[(2*l+1)*W + 2*c+1]));

Max Pooling

for (int c=0; c<Wo; c++) for (int l=0; l<Ho; l++) Out[l*Wo+c] = (In[2*l*W+2*c]+In[2*l*W + 2*c+1]+In[(2*l+1)*W+2*c]+In[(2*l+1)*W + 2*c+1])>>2;

Average Pooling

RiscV

Gap8 No Vect Gap8 Vect

Pooling


Cycles Per Ouput RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect

Short 2x2/2 Max Pool 32,2 8,3 8,2 1,2Short 2x2/2 Avg Pool 16,2 8,3 6,2 1,1Byte 2x2/2 Max Pool 32,2 8,3 8,2 0,9Byte 2x2/2 Avg Pool 16,2 8,3 7,2 1,1

pJ Per Ouput RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect

Short 2x2/2 Max Pool 2268,7 598,1 597,2 293,0Short 2x2/2 Avg Pool 1179,9 650,3 456,2 253,1Byte 2x2/2 Max Pool 2268,7 597,5 596,8 215,0Byte 2x2/2 Avg Pool 1179,4 650,4 531,3 254,1

Linear


for (int i=0; i<H; i++) { int R = Out[i]<<Norm;

for (int j=0; j<(W>>2); j++) {R += In[4*j]*Filter[W*i+4*j];R += In[4*j+1]*Filter[W*i+4*j+1];R += In[4*j+2]*Filter[W*i+4*j+2];R += In[4*j+3]*Filter[W*i+4*j+3];

}for (int j=4*(W>>2); j<W; j++) R += In[j]*Filter[W*i+j];Out[i] = R>>Norm;

}

.L41:lb t3,0(a2)lb s5,0(a6)lb t1,1(a2)lb s4,1(a6)p.mul t3,t3,s5lb a7,2(a2)lb s3,2(a6)lb s1,3(a2)lb s2,3(a6)add a2,a2,4add a6,a6,4p.mul t1,t1,s4add t3,t3,t4p.mul a7,a7,s3add t1,t1,t3p.mul t3,s1,s2add a7,a7,t1add t4,t3,a7

bne t5,a2,.L41

RiscV

lp.setup x1,t3,(.L259)lb t5,0(a7)lb t4,0(t1)lb s4,1(a7)lb s3,1(t1)p.macs a6,t5,t4lb s2,2(a7)lb s1,2(t1)lb t5,3(a7)lb t4,3(t1)add a7,a7,4add t1,t1,4p.macs a6,s4,s3p.macs a6,s2,s1

.L259: p.macs a6,t5,t4

Gap8 NoVect

lp.setup x1,t5,(.L99)p.lw t4,8(t6!)p.lw t3,8(t0!)p.lw t1,8(a2!)p.lw a7,8(t2!)pv.sdotsp.b a6,t4,t3

.L99: pv.sdotsp.b a6,t1,a7

Gap8 Vect

Linear


Cycles per sum of product RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect

Short Linear 5,3 3,0 1,5 0,3

Byte Linear 5,3 3,0 0,8 0,2

pJ per sum of product RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect

Short Linear 382,5 241,3 122,7 64,2

Byte Linear 382,3 241,0 62,3 34,6

Summary - Performance


Average Extension’s Speed Gain: 3,6

Convolution: 80% of CNN workload

Summary – Energy Efficiency


Average Extension’s Energy Gain: 3,4

Convolution: 80% of CNN workload


Memory Management

Handling large network with minimal energy overhead

Memory Management


Shared L1

L2

1 8

External L3 (Ram/Flash)

DMA

uDMA

• Gap8 is not equipped with data caches• Silicon area• More important energy efficiency

mostly due to hit ratio• We can turn this weakness into an (energy)

benefit if we can automate data transfers• In practice a vast majority of traffic is

predictable

ExecL2 to L1L3 to L2

Automatic data tiling and pipelined memory transfer interleaved with parallel call to compute

kernel is solved by our “Autotiler” tool

AutoTiler


Basic KernelsHow to handle a parametric tile• Vectorization + Parallelization• No assumption on where actual data are located

User Kernels

Passing actual data to basic kernels and having data circulating between them• A multi dimensional iteration space (2D; 3D; 4D; 5D. ..) and a traversal

order• Each argument is a sub space of the iteration space and has actual

dimensions, location (L2, external) and properties. Order may differ from the one of the iteration space

• Given a memory budget the auto tiler “tiles” each argument and generates a fully pipelined implementation interleaving processing and data transfers

• Basic Kernels are inserted at defined locations in the iteration space (prologue, body, epilog, …)

• Generated tiles are passed to Basic Kernels

Usually seen as libraries

Can be grouped and organized as generators

AutoTiler


BasicKernelsUser KernelsGroup of User KernelsGenerators

C Programs, calls to Autotiler’s Model API

C Libraries

Autotiler Library

(Constraints Solver, C Code Generator)

Compile & Run on PC

C code for the target handling data transfers and Basic Kernels dispatch on cluster’s cores

#include "AutoTilerLib.h"#include "CNN_Generator.h"void Mnist(){ CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1); CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1); CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0);}


On real life networks

Key Word Spotting


CNN on HWCE: Avg power: 8.79mWDuration: 58ms

MFCC on FC: Avg power: 3.3 mWDuration 170ms

Processing of 1 second of voice data at 1.0V:

CNN (cluster)SW version 155ms 11,8mW : 1,8 mW averageHWCE version 58ms 8.8mW : 509uW average

MFCC (FC)170ms 3,3mW : 560uW average

Total 1,07mW with HWCE2,36mW in SW

Google CNN:Conv 8x20, MaxPool 2x2/2, 1 InFeat, 32 OutFeat, W:95, H:40Conv 4x10, ReLU, InFeat 32, OutFeat 32Linear: 10 Outs

CNN Based Text Recognition


Trainable Par: 421 263Neurons: 1 511 904

33ms per image

DRONET: RESNET based Autonous Drone


• Developed by UZH and ETH-Z• Autonomously follow a road and avoid collision• Up to 18 Frames Per Second at maximum frequency• @1.0V, FC: 50MHz, Cluster: 100MHz 6.5fps 40mW

Conclusion

• Well selected extension can really make the difference at a very limited silicon area overhead.

• On CNN we measure a factor of approx. 3.5 for both speed and energy efficiency for a single core

• Parallelism brings another boost factor (24.4) on performance thanks to a close to optimal scaling. Root cause is architecture.

• More interesting, parallelism contributes very significantly to the energy per operation improvement with a factor of 2 on top of the ISA extension contribution for a total of x7.4 vs RiscV single core. Here also root cause is architecture.

• These gains are further amplified by the capability to optimally managed memory transfers across memory hierarchy.

• This is enabling the support of mid complexity CNN with MCU class power budget



Thank You!

Date post:	16-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Multi-Core context Optimized Support of CNNs in a Extending ......2018/12/16 · Optimized Support...

Documents