Extending the RISC-V ISA for Optimized Support of CNNs in a
Multi-Core context
RISC-V Summit Dec 3-6 2018 Santa Clara
Eric Flamand, CTO & CoFounder of Greenwaves Technologies
Who are we?
• French based startup created in 2015• First product, GAP8, launched in Feb 2018
12/4/2018 RISC-V Summit Dec 2018 2
Our Market Vision
12/4/2018 RISC-V Summit Dec 2018 3
The IoT pipeNB-IoT, LTE-M, Sigfox,
LoRa, etc.
B/day to kB/dayBattery operated
sensors
8-bit, 160x120 @ 10 fps =4.6 Mbit/s
24-bit @ 50kHz = 1.2 Mbit/s
Linear PCM =1.4 Mbit/s
Market DemandRich sensor data
Keyword SpottingBeam formingSpeech pre-processing
Vibration analysisFault detection
Face detectionPresence detectionCountingEmotion detection
Our Market Vision
12/4/2018 RISC-V Summit Dec 2018 4
B/day to kB/day B/day to kB/dayBattery operated
sensors
The IoT pipeNB-IoT, LTE-M, Sigfox,
LoRa, etc.
8-bit, 160x120 @ 10 fps =4.6 Mbit/s
24-bit @ 50kHz = 1.2 Mbit/s
Linear PCM =1.4 Mbit/s
Market DemandRich sensor data
CNNSVM
BayesianBoostingCepstral analysis
Market demand+
Low operation cost+
Low deployment cost +
Low installation cost=
Massive deployment of intelligent rich data sensors
Issue: way more MIPS than an MCU can deliver but need
to bewithin an MCU power
envelope ?
GAP8 An IOT Application Processor
12/4/2018 RISC-V Summit Dec 2018 5
MemoryL2
FC clock & voltage domain
PMU RTC
FabricController
L1
ROM
I$
Debug
LVDS
Serial I/Q
UART
SPI
I2C
I2S
CPI
HyperBus
GPIO / PWM
Mic
ro D
MA Logarithmic Interconnect
Shared L1 Memory
Shared Instruction Cache
Cor
e 0
Debug
ClusterDMA
HWSync
Cor
e 1
Cor
e 7
Cor
e 6
Cor
e 5
Cor
e 4
Cor
e 3
Cor
e 2
HW
CE
Cluster clock & voltage domain
Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V
MCU FunctionExtended RISC-V coreExtensive I/O setMicro DMAEmbedded DC/DC convertersSecured execution / e-fuses
Computation engine function8 extended RISC-V coresFully programmableEfficient parallelizationShared instruction cacheMulti channel DMAHW synchronizationHW convolution Engine (3 * 3x3)
Retentive1µA+x*8µA
Pre-analysis1mWs
Inferencefew 10mWs
An integrated, hierarchical architectureDeep sleep
1uATSMC 55LP1.0V to 1.2VMax Freq: 133 MHz to 250 MHz
Gap8 The open source heritage
12/4/2018 RISC-V Summit Dec 2018 6
GreenWaves- Best in class Instruction Set
Architecture (ISA)- UC Berkeley originated
- GWT Member of RiscV Foundation
- Open Source Computing Platform created by ETHZ and UniBo
- Permissive license (solderpad)- Multiple tape outs
- GWT contributes to PULP
- Innovating on Risc-V and PULP-Proprietary balanced system solution (SOC) based on PULP open source elements plus GWT proprietary elements both on HW and SW/Tools side
12/4/2018 RISC-V Summit Dec 2018 7
Extending the ISA – Impact on CNN centric applications
ISA Extension(s)
12/4/2018 RISC-V Summit Dec 2018 8
Given our 4 stages in order how to increase ILP with a moderate gate count increase given an application family?
Group 1: Loop Kernels• Zero overhead HW loop• Post modified load/store, Reg/Reg load/store
Group 2: DSP/Linear Algebra• Mac/Msu with optional normalization and rounding• Add/sub/mult with optional normalization and rounding• Clip,Min,Max,Abs
Group 3: Bit manipulation• Insert/extract/set/clear/findfirst/findlast/countleadingbits/rotation• PopulationCount
Group 4: Vectorial/SIMD 4 Bytes, 2 Half Words• Add/sub/avg/min/max/abs/shift/logical• Shuffle/insert/extract/pack• Dot product/sum of dot products
Group 5: Complex Numbers, Treillis• Product/Conjugate/Rotation• Max path/path selection
PULP
Greenwaves
Performance and Power Measurement
12/4/2018 RISC-V Summit Dec 2018 9
Impact on CNN
• Selected layers• Convolution, fixed point. Will use 5x5• Convolution, binary. Will use 5x5• Max pooling. Will use 2x2 with stride 2• Average pooling. Will use 2x2 with stride 2• Linear
• Vectorization impact• Qx.y <= 15 Vector of 2 signed short int• Qx.y <= 7 Vector of 4 signed bytes
12/4/2018 RISC-V Summit Dec 2018 10
Compared configurations• Pure RiscV• Gap8 without vector (Groups 1,2 and 3)• Gap8 with vector (Groups 1,2,3 and 4)
Convolution
12/4/2018 RISC-V Summit Dec 2018 11
for (int c=0; c<(W-4); c++) for (int l=0; l<(H-4); l++) int R = Out[l*W+c]<<Norm; for (int kl=0; kl<5; kl++) for (int kc=0; kc<5; kc++) R += Filter[kl*5+kc]*In[(l+kl)*W + c+kc]; Out[l*W+c] = R>>Norm;
.L28:lb t5,0(a6)lb t4,0(a1)lb t3,2(a1)lb s10,2(a6)lb t1,4(a1)lb s8,4(a6)lb a7,6(a1)lb s7,6(a6)mul t4,t4,t5lb s6,8(a6)lb t5,8(a1)add a1,a1,10add a6,a6,t6mul t3,t3,s10add t4,t4,s9mul t1,t1,s8add t3,t3,t4mul a7,a7,s7add t1,t1,t3mul t5,t5,s6add a7,a7,t1add s9,t5,a7bne t0,a1,.L28
lp.setup x1,t3,(.L242) lb t5,0(t1)lb t4,0(a7)lb s10,1(t1)lb s9,1(a7)p.macs a6,t5,t4lb s8,2(t1)lb s7,2(a7)lb s6,3(t1)lb t6,3(a7)lb t5,4(t1)lb t4,4(a7)add t1,t1,5add a7,a7,a1p.macs a6,s10,s9p.macs a6,s8,s7p.macs a6,s6,t6
.L242: p.macs a6,t5,t4
lp.setup x1,s2,(.L68)lb a4,0(s3)p.lw a7,a1(s4!)p.lb t1,a1(s5!)sll a4,a4,a5pv.sdotsp.b a4,t3,t6pv.sdotsp.b a4,a6,t0pv.sdotsp.b a4,a0,t2pv.sdotsp.b a4,a2,s0pv.sdotsp.b a4,a7,s1pv.sdotsp.b a4,a3,t4pv.sdotsp.b a4,t1,t5sra a4,a4,a5p.sb a4,s7(s3!)mv t3,a6pv.shuffle2.b a3,t1,s6mv a6,a0mv a0,a2
.L68: mv a2,a7
x5 x5
RiscVGap8 NoVect
Gap8 Vect
Binary Convolution
12/4/2018 RISC-V Summit Dec 2018 12
lp.setup x1,t3,(.L286)lhu a5,0(a5)p.bclr a2,a7,28,3or a2,a2,160p.extractur a5,a5,a2p.insert a6,a5,5,24srl a5,a6,1xor a5,a5,t4xor a2,t4,a6not a5,a5not a2,a2and a5,a5,t0lhu s3,0(t1)and a2,a2,t0p.cnt a5,a5p.cnt a2,a2sll a5,a5,8or a5,a5,a2pv.add.b a5,s3,a5add a7,a7,a3p.sh a5,t6(t1!)srl a6,a6,6
.L286: srl a5,a7,3
PopCount on RiscV is not naïve implementation but still costs approx 15 cycles
Convolution
12/4/2018 RISC-V Summit Dec 2018 13
Cycles Per Ouput RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect
Short Convolution 5x5 135,1 100,2 40,2 5,3
Short Xnor Conv 5x5 29,9 11,4 11,4 1,5
Byte Convolution 5x5 135,1 98,2 19,1 2,5
Byte Xnor Conv 5x5 29,9 11,4 11,4 1,5
pJ Per Ouput RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect
Short Convolution 5x5 9674,2 7677,4 3008,9 1384,7
Short Xnor Conv 5x5 2379,8 864,4 864,4 374,3
Byte Convolution 5x5 9674,2 7524,2 1430,2 653,8
Byte Xnor Conv 5x5 2379,8 864,9 864,9 373,8
Pooling
12/4/2018 RISC-V Summit Dec 2018 14
lp.setup x1,a3,(.L29)p.lh a4,t6(a0!)p.lh a5,t6(a2!)pv.max.b a5,a5,a4pv.extract.b a4,a5,0pv.extract.b a5,a5,1p.max a5,a4,a5
.L29:p.sb a5,t5(a1!)
lp.setup x1,a4,(.L39)p.lh a5,t0(a6!)p.lh t3,t0(t1!)pv.shuffle2.b a5,t3,t4pv.dotsp.sci.b a5,a5,1sra a5,a5,2
.L39: p.sb a5,t6(a7!)
lp.setup x1,a6,(.L216)p.lb a5,s2(a7!)p.lb t0,s2(t4!)p.lb a4,s2(t5!)p.lb t6,s2(t3!)p.max a5,a5,t0p.max a4,a4,t6p.max a5,a5,a4
.L216: p.sb a5,s0(t1!)
lp.setup x1,a4,(.L227)p.lb a5,s2(a6!)p.lb t0,s2(t3!)p.lb t6,s2(t4!)p.lb t5,s2(t1!)add a5,a5,t0add a5,a5,t6p.addN a5,a5,t5,2
.L227:p.sb a5,s0(a7!)
.L14: lbu a5,1(a2)lbu t4,0(a2)add a7,a7,1sll a6,a5,24sll t3,t4,24sra a6,a6,24sra t3,t3,24bge a6,t3,.L11mv a5,t4mv a6,t3
.L11: lbu t3,1(a4)add a2,a2,t5sll t4,t3,24sra t4,t4,24bge a6,t4,.L12mv a5,t3
.L12: lbu a6,0(a4)sll t4,a5,24sra t4,t4,24sll t3,a6,24sra t3,t3,24bge t4,t3,.L13mv a5,a6
.L13:sb a5,0(t1)
.L21:lb a5,0(a4)lb t5,1(a4)lb t4,0(a2)lb t3,1(a2)add a5,a5,t5add a5,a5,t4add a5,a5,t3sra a5,a5,2sb a5,0(a7)add a6,a6,1add a4,a4,t1add a2,a2,t1add a7,a7,t0
bne t6,a6,.L21
for (int c=0; c<Wo; c++) for (int l=0; l<Ho; l++) Out[l*Wo+c] = Max(Max(In[2*l*W+2*c], In[2*l*W + 2*c+1]), Max(In[(2*l+1)*W+2*c], In[(2*l+1)*W + 2*c+1]));
Max Pooling
for (int c=0; c<Wo; c++) for (int l=0; l<Ho; l++) Out[l*Wo+c] = (In[2*l*W+2*c]+In[2*l*W + 2*c+1]+In[(2*l+1)*W+2*c]+In[(2*l+1)*W + 2*c+1])>>2;
Average Pooling
RiscV
Gap8 No Vect Gap8 Vect
Pooling
12/4/2018 RISC-V Summit Dec 2018 15
Cycles Per Ouput RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect
Short 2x2/2 Max Pool 32,2 8,3 8,2 1,2Short 2x2/2 Avg Pool 16,2 8,3 6,2 1,1Byte 2x2/2 Max Pool 32,2 8,3 8,2 0,9Byte 2x2/2 Avg Pool 16,2 8,3 7,2 1,1
pJ Per Ouput RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect
Short 2x2/2 Max Pool 2268,7 598,1 597,2 293,0Short 2x2/2 Avg Pool 1179,9 650,3 456,2 253,1Byte 2x2/2 Max Pool 2268,7 597,5 596,8 215,0Byte 2x2/2 Avg Pool 1179,4 650,4 531,3 254,1
Linear
12/4/2018 RISC-V Summit Dec 2018 16
for (int i=0; i<H; i++) { int R = Out[i]<<Norm;
for (int j=0; j<(W>>2); j++) {R += In[4*j]*Filter[W*i+4*j];R += In[4*j+1]*Filter[W*i+4*j+1];R += In[4*j+2]*Filter[W*i+4*j+2];R += In[4*j+3]*Filter[W*i+4*j+3];
}for (int j=4*(W>>2); j<W; j++) R += In[j]*Filter[W*i+j];Out[i] = R>>Norm;
}
.L41:lb t3,0(a2)lb s5,0(a6)lb t1,1(a2)lb s4,1(a6)p.mul t3,t3,s5lb a7,2(a2)lb s3,2(a6)lb s1,3(a2)lb s2,3(a6)add a2,a2,4add a6,a6,4p.mul t1,t1,s4add t3,t3,t4p.mul a7,a7,s3add t1,t1,t3p.mul t3,s1,s2add a7,a7,t1add t4,t3,a7
bne t5,a2,.L41
RiscV
lp.setup x1,t3,(.L259)lb t5,0(a7)lb t4,0(t1)lb s4,1(a7)lb s3,1(t1)p.macs a6,t5,t4lb s2,2(a7)lb s1,2(t1)lb t5,3(a7)lb t4,3(t1)add a7,a7,4add t1,t1,4p.macs a6,s4,s3p.macs a6,s2,s1
.L259: p.macs a6,t5,t4
Gap8 NoVect
lp.setup x1,t5,(.L99)p.lw t4,8(t6!)p.lw t3,8(t0!)p.lw t1,8(a2!)p.lw a7,8(t2!)pv.sdotsp.b a6,t4,t3
.L99: pv.sdotsp.b a6,t1,a7
Gap8 Vect
Linear
12/4/2018 RISC-V Summit Dec 2018 17
Cycles per sum of product RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect
Short Linear 5,3 3,0 1,5 0,3
Byte Linear 5,3 3,0 0,8 0,2
pJ per sum of product RV 1C Gap8 1C Gap8 1C/Vect Gap8 8C/Vect
Short Linear 382,5 241,3 122,7 64,2
Byte Linear 382,3 241,0 62,3 34,6
Summary - Performance
12/4/2018 RISC-V Summit Dec 2018 18
Average Extension’s Speed Gain: 3,6
Convolution: 80% of CNN workload
Summary – Energy Efficiency
12/4/2018 RISC-V Summit Dec 2018 19
Average Extension’s Energy Gain: 3,4
Convolution: 80% of CNN workload
12/4/2018 RISC-V Summit Dec 2018 20
Memory Management
Handling large network with minimal energy overhead
Memory Management
12/4/2018 RISC-V Summit Dec 2018 21
Shared L1
L2
1 8
External L3 (Ram/Flash)
DMA
uDMA
• Gap8 is not equipped with data caches• Silicon area• More important energy efficiency
mostly due to hit ratio• We can turn this weakness into an (energy)
benefit if we can automate data transfers• In practice a vast majority of traffic is
predictable
ExecL2 to L1L3 to L2
Automatic data tiling and pipelined memory transfer interleaved with parallel call to compute
kernel is solved by our “Autotiler” tool
AutoTiler
12/4/2018 RISC-V Summit Dec 2018 22
Basic KernelsHow to handle a parametric tile• Vectorization + Parallelization• No assumption on where actual data are located
User Kernels
Passing actual data to basic kernels and having data circulating between them• A multi dimensional iteration space (2D; 3D; 4D; 5D. ..) and a traversal
order• Each argument is a sub space of the iteration space and has actual
dimensions, location (L2, external) and properties. Order may differ from the one of the iteration space
• Given a memory budget the auto tiler “tiles” each argument and generates a fully pipelined implementation interleaving processing and data transfers
• Basic Kernels are inserted at defined locations in the iteration space (prologue, body, epilog, …)
• Generated tiles are passed to Basic Kernels
Usually seen as libraries
Can be grouped and organized as generators
AutoTiler
12/4/2018 RISC-V Summit Dec 2018 23
BasicKernelsUser KernelsGroup of User KernelsGenerators
C Programs, calls to Autotiler’s Model API
C Libraries
Autotiler Library
(Constraints Solver, C Code Generator)
Compile & Run on PC
C code for the target handling data transfers and Basic Kernels dispatch on cluster’s cores
#include "AutoTilerLib.h"#include "CNN_Generator.h"void Mnist(){ CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1); CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1); CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0);}
12/4/2018 RISC-V Summit Dec 2018 24
On real life networks
Key Word Spotting
12/4/2018 RISC-V Summit Dec 2018 25
CNN on HWCE: Avg power: 8.79mWDuration: 58ms
MFCC on FC: Avg power: 3.3 mWDuration 170ms
Processing of 1 second of voice data at 1.0V:
CNN (cluster)SW version 155ms 11,8mW : 1,8 mW averageHWCE version 58ms 8.8mW : 509uW average
MFCC (FC)170ms 3,3mW : 560uW average
Total 1,07mW with HWCE2,36mW in SW
Google CNN:Conv 8x20, MaxPool 2x2/2, 1 InFeat, 32 OutFeat, W:95, H:40Conv 4x10, ReLU, InFeat 32, OutFeat 32Linear: 10 Outs
CNN Based Text Recognition
12/4/2018 RISC-V Summit Dec 2018 26
Trainable Par: 421 263Neurons: 1 511 904
33ms per image
DRONET: RESNET based Autonous Drone
12/4/2018 RISC-V Summit Dec 2018 27
• Developed by UZH and ETH-Z• Autonomously follow a road and avoid collision• Up to 18 Frames Per Second at maximum frequency• @1.0V, FC: 50MHz, Cluster: 100MHz 6.5fps 40mW
Conclusion
• Well selected extension can really make the difference at a very limited silicon area overhead.
• On CNN we measure a factor of approx. 3.5 for both speed and energy efficiency for a single core
• Parallelism brings another boost factor (24.4) on performance thanks to a close to optimal scaling. Root cause is architecture.
• More interesting, parallelism contributes very significantly to the energy per operation improvement with a factor of 2 on top of the ISA extension contribution for a total of x7.4 vs RiscV single core. Here also root cause is architecture.
• These gains are further amplified by the capability to optimally managed memory transfers across memory hierarchy.
• This is enabling the support of mid complexity CNN with MCU class power budget
12/4/2018 RISC-V Summit Dec 2018 28
12/4/2018 RISC-V Summit Dec 2018 29
Thank You!