Creating Intelligence at the Edge - Part 2 › wp-content › uploads › ... · A Science &...

A Science & Technology Center

Creating Intelligence at the Edge- Part 2

Vladimir StojanovićE3S Retreat

September 20, 2018


Computing is moving toward the Edgeole of electronics

2

Big issues – speed, available power for local computation

Opportunity for new technologies to help – enter E3S


Autonomous Driving: What networks do we need to run?

3

[Lin et al. ASPLOS18]

DET - YOLO

TRA - GOTURN

LOC – Orb-Slam

~ Level 3 pipeline


Autonomous driving hw requirements

4

Fastest human driver reaction 100-150ms Automated driving system requirements

99.99th- percentile latency <100ms Frame rate > 10 fps

Driving range reduction on Chevy Bolt

Overhead of storage and cooling ~100%



CPU performance is not enough

5

DNNs take most of the time in all tasks! Need a few orders of magnitude improvement

Opportunity for acceleration

DNNs take most of the time![Lin et al. ASPLOS18]


Acceleration results

6 GPU+FPGA/ASIC needed for <10% driving range impact FPGA/ASIC needed for <5% driving range impact


Xeon TitanX Stratix V EIE, Eyriss 45nm SOI


What the future holds?

Would like to achieve Level 5 (fully autonomous) Higher accuracy required => More complex DNNs More sensors (e.g. LIDAR) => Additional/more complex DNNs Higher resolution => More storage/computation Additional algorithms => Human-machine interaction [Dragan]

7

GPU/ASIC latency o.k. at Full HD, but no design meets QHD


Current inference accelerators

8

PX Xavier (12nm)30W, 30 INT8 Top/s, 1 Tops/W

PX Pegasus (12nm)500W, 320 INT8 Top/s, 0.6 Tops/W

Stanford EIE 600mW, 0.1 INT8 Top/s, 0.5 Tops/W (process normalized to 16nm)


Last time: PBP impact on micro-architecture

Algorithmic transformations enable simple, systolic (flow-through) architecture with in-situ coefficient storage (minimal energy) Fixed or reconfigurable input/output shuffle (permutation)

Page 911/1/2018

Multiply-accumulates with

local weight storage (dense sub-blocks)

Input vector shuffle

Output vector shuffle


Optimizing accelerator energy

10

Minimize data-movement Store weights on-chip

• Pruning• Dense-memory

In-memory computation• Reconfigurable interconnect• Dense-memory

Computation flexibility Need a mix of acceleration and

regular CPU 20 Top/s, 0.4W

50 Tops/W 50-100x improvement 90% of the power in the accelerator

Rocket Core

SPACE Accelerator

Berkeley AI core chip [Naous, Kang, Stojanovic]Taped-out May 2018

2.5mm x 2.5mm, 16nm


Rout

ing

Mat

rix

Tile Interface

PE1

PE1

PE3

PEn1Ro

utin

g M

atrix PE1

PE1

PE6

PEn2

Rout

ing

Mat

rix

11

Top Level Accelerator View


Input Activations

PE1

PE1

PE3

PEn

outp

ut A

ctiv

atio

nsInput

Permutations

LatchesInput

Activations

Multiply Units

SRAM(Weights)

Adder Tree

Quantizer

SRAMOutput

Activations

ReLU

Network layer accelerator implementation

Minimize the size of the routing matrix through smart output SRAM/Mux scheduling Input permutations via small muxes Output permutations via output SRAM reads

12


Parameterized generator allows evaluation of various scenarios

13

SRAM still dominates power and area, and limits throughput

PE power breakdown


Where next?

Now have a full architecture generator and advanced CMOS benchmark

Architecture that can be tuned further for new devices e.g. NEMS reconfigurable interconnect e.g. NEMS or other dense memory

Fully benchmark the architecture with new devices vs. advanced CMOS design

Page 1411/1/2018

Date post:	25-Jun-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Creating Intelligence at the Edge - Part 2 › wp-content › uploads › ... · A Science &...

Documents