A Science & Technology Center
Creating Intelligence at the Edge- Part 2
Vladimir StojanovićE3S Retreat
September 20, 2018
A Science & Technology Center
Computing is moving toward the Edgeole of electronics
2
Big issues – speed, available power for local computation
Opportunity for new technologies to help – enter E3S
A Science & Technology Center
Autonomous Driving: What networks do we need to run?
3
[Lin et al. ASPLOS18]
DET - YOLO
TRA - GOTURN
LOC – Orb-Slam
~ Level 3 pipeline
A Science & Technology Center
Autonomous driving hw requirements
4
Fastest human driver reaction 100-150ms Automated driving system requirements
99.99th- percentile latency <100ms Frame rate > 10 fps
Driving range reduction on Chevy Bolt
Overhead of storage and cooling ~100%
[Lin et al. ASPLOS18]
A Science & Technology Center
CPU performance is not enough
5
DNNs take most of the time in all tasks! Need a few orders of magnitude improvement
Opportunity for acceleration
DNNs take most of the time![Lin et al. ASPLOS18]
A Science & Technology Center
Acceleration results
6 GPU+FPGA/ASIC needed for <10% driving range impact FPGA/ASIC needed for <5% driving range impact
[Lin et al. ASPLOS18]
Xeon TitanX Stratix V EIE, Eyriss 45nm SOI
A Science & Technology Center
What the future holds?
Would like to achieve Level 5 (fully autonomous) Higher accuracy required => More complex DNNs More sensors (e.g. LIDAR) => Additional/more complex DNNs Higher resolution => More storage/computation Additional algorithms => Human-machine interaction [Dragan]
7
GPU/ASIC latency o.k. at Full HD, but no design meets QHD
A Science & Technology Center
Current inference accelerators
8
PX Xavier (12nm)30W, 30 INT8 Top/s, 1 Tops/W
PX Pegasus (12nm)500W, 320 INT8 Top/s, 0.6 Tops/W
Stanford EIE 600mW, 0.1 INT8 Top/s, 0.5 Tops/W (process normalized to 16nm)
A Science & Technology Center
Last time: PBP impact on micro-architecture
Algorithmic transformations enable simple, systolic (flow-through) architecture with in-situ coefficient storage (minimal energy) Fixed or reconfigurable input/output shuffle (permutation)
Page 911/1/2018
Multiply-accumulates with
local weight storage (dense sub-blocks)
Input vector shuffle
Output vector shuffle
A Science & Technology Center
Optimizing accelerator energy
10
Minimize data-movement Store weights on-chip
• Pruning• Dense-memory
In-memory computation• Reconfigurable interconnect• Dense-memory
Computation flexibility Need a mix of acceleration and
regular CPU 20 Top/s, 0.4W
50 Tops/W 50-100x improvement 90% of the power in the accelerator
Rocket Core
SPACE Accelerator
Berkeley AI core chip [Naous, Kang, Stojanovic]Taped-out May 2018
2.5mm x 2.5mm, 16nm
A Science & Technology Center
Rout
ing
Mat
rix
Tile Interface
PE1
PE1
PE3
PEn1Ro
utin
g M
atrix PE1
PE1
PE6
PEn2
Rout
ing
Mat
rix
11
Top Level Accelerator View
A Science & Technology Center
Input Activations
PE1
PE1
PE3
PEn
outp
ut A
ctiv
atio
nsInput
Permutations
LatchesInput
Activations
Multiply Units
SRAM(Weights)
Adder Tree
Quantizer
SRAMOutput
Activations
ReLU
Network layer accelerator implementation
Minimize the size of the routing matrix through smart output SRAM/Mux scheduling Input permutations via small muxes Output permutations via output SRAM reads
12
A Science & Technology Center
Parameterized generator allows evaluation of various scenarios
13
SRAM still dominates power and area, and limits throughput
PE power breakdown
A Science & Technology Center
Where next?
Now have a full architecture generator and advanced CMOS benchmark
Architecture that can be tuned further for new devices e.g. NEMS reconfigurable interconnect e.g. NEMS or other dense memory
Fully benchmark the architecture with new devices vs. advanced CMOS design
Page 1411/1/2018