+ All Categories
Home > Documents > Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning...

Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning...

Date post: 04-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
80
Deep Learning Hardware Acceleration Jorge Albericio + Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington * Natalie Enright Jerger Tor Aamodt* Andreas Moshovos * + now at NVIDIA
Transcript
Page 1: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Deep Learning Hardware

Acceleration

Jorge Albericio+ Alberto Delmas Lascorz

Patrick Judd Sayeh Sharify

Tayler Hetherington*

Natalie Enright Jerger Tor Aamodt*

Andreas Moshovos

*

+ now at NVIDIA

Page 2: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

The University of Toronto has filed patent

applications for the mentioned technologies.

Disclaimer

Page 3: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Time: ~ 60% - 90% inner products

Deep Learning: Where Time Goes?

100s

-

1000s

X

X

+

Convolutional Neural Networks: e.g., Image Classification

Page 4: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

4

100s

-

1000s

X

X

+

X

X

+

X

X

+X

X

+X

X

+X

X

+

Time: ~ 60% - 90% inner products

Deep Learning: Where Time Goes?

Page 5: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

SIMD: Exploit Computation Stucture

5

DaDianNao

4K terms/cycle

0

15

0

15

0

15

x

x

x

x

x16x256

Page 6: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Our Approach

6

0

15

0

15

0

15

Filter 0

Filter 15

Improve by Exploiting Value Properties

Maintaining:

Massive Parallelism

SIMD Lanes

Wide Memory Accesses

No Modifications to the Networks

Page 7: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Longer Term Goal

7

0

15

0

15

0

15

Filter 0

Filter 15

One Architecture to Rule them All

Page 8: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Value Properties to Exploit? Many ~0 values

8

Page 9: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Value Properties to Exploit? Varying Precision Needs

9

X

X

X

Page 10: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Our Results: Performance

10

1.5x1.9x

3.1x1.60x

2.08x

0

1

1

2

2

3

3

4

CNVLUTIN STRIPES ENGINE P

100% 99%

PRAGMATIC

TARTAN +

4.3x

vs. DaDianNao which was ~300x over GPUs

Accuracy

ISCA’16 MICRO’16 Arxiv +ICLR Workshop

Page 11: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Proteus:

44% less memory bandwidth + footprint

Our Results: Memory Footprint and Bandwidth

11

Page 12: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Avoiding computations with ~0

• Performance from precision

• Performance from zero bits

• Reducing footprint and bandwidth

Roadmap

12

Page 13: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

#1: Skipping Ineffectual Activations

13

Page 14: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Many ineffectual multiplications

Cnvlutin: ISCA’16

14

X0

X

+0

0

Page 15: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Zero Activations:

• Runtime calculated

• None that are always zero

Many Activations and Weights

are Intrinsically Ineffectual (zero)

15

Zero Weights:

Known in Advance

Not pursued in this work

45% of Runtime Values are zero

% Stable for any Input

None always zero

Page 16: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Many ineffectual multiplications

Cnvlutin

16

X0

X

+0

0

Page 17: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Many more ineffectual multiplications

Cnvlutin

17

X0

X

+0

0

a

b

Page 18: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Cnvlutin

18

X

X

+

Beating Fast and

“Dumb” SIMD is Hard

On-the-fly ineffectual product elimination

Performance + energy

Optional: accuracy loss +performance

Page 19: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Cnvlutin

No Accuracy Loss

+52% performance

-7% power

+5% area

Can relax the ineffectual criterion

better performance: 60%

even more w/ some accuracy loss

Page 20: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Deep Learning: Convolutional Neural Networks

20

imagelayers10s-100

Beaver

maybe

Page 21: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• > 60% -- 90% of time in Convolutional Layers

Deep Learning: Convolutional Networks

21

i

activations

activationsN

K

K

i

weights

N

input

output

Page 22: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Why are there so many zero neurons?

22

N

Hypothesis: Filter = feature

z

Page 23: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

SIMD: Exploit Computation Stucture

23

DaDianNao

4K terms/cycle

0

15

0

15

0

15

x

x

x

x

x16x256

Page 24: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Processing all Activations:

• All Lanes operate in lock step

Skipping Ineffectual Activations: Key Challenge

Lane 15

Lane 0

0

15

0

15

0

15

x

x

x

x

Page 25: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• 16 independent narrow activation streams

Naïve Solution: No Wide Memory Accesses

25

Lane 15

Lane 0

Page 26: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Removing Zeroes: At the output of each layer

en

co

de

Layer

iLayer

i + 1N

eu

ron

Mem

Beaver

maybe

Page 27: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Maintaining Wide Accesses But Skipping Zeroes

27

#1: Partition NM in 16 Slices over 16 Banks

Processing order does not matter

Neuron lane 15

Neuron lane 0

Neuron lane 1

Page 28: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Maintaining Wide Accesses But Skipping Zeroes

28

#2: Fetch and Maintain One Container per Slice

Container: up to 16 non-zero neurons

Neuron lane 15

Neuron lane 0

Neuron lane 1

Page 29: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Maintaining Wide Accesses But Skipping Zeroes

29

#3: Keep Neuron Lanes Supplied with One

Neuron Per Cycle

Lane 15

Lane 0

Lane 1

Page 30: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Maintaining Wide Accesses But Skipping Zeroes

30

#4: When a container is exhausted, get the next

one within the slice

Lane 15

Lane 0

Lane 1

Page 31: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Maintaining Wide Accesses But Skipping Zeroes

Container: stores only non-zeros

Encoding: Value, 4-bit offset

Could use 1 extra bit: encoded vs. raw

Page 32: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Zero-Free Neuron Array Format:

• Only non-zero neurons + offsets

• Brick-level

Inside Each Neuron Container

32

ZFNAf: Enabling the Skipping of Ineffectual Neurons

Page 33: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Cnvlutin: No Accuracy Loss

33better

Page 34: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Treat

Neurons

close to zero

as zero

Loosening the Ineffectual Neuron Criterion

34

Open Questions:

Are these robust? How to find the best?

Page 35: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

#2: Exploiting Precision

35

X

X

X

Page 36: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Another Property of CNNs

36

X

X

+

Operand Precision Required Fixed?

16 bits?

16 bits

Page 37: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

CNNs: Precision Requirements Vary

37

X

X

+

Operand Precision Required Fixed Varies

5 bits to 13 bits

16 bits

p bits

Page 38: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Stripes

38

X

X

+

Execution Time = 16 / PPeformance + Energy Efficiency + Accuracy Knob

p bits

Page 39: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Devil in the Details: Carefully chose what to serialize and

what to reuse same input wires as baseline

Stripes: Key Concept

39

2 2x2b

Terms/Step2 1x2b 4 1x2b

Page 40: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

SIMD: Exploit Computation Stucture

40

DaDianNao

4K terms/cycle

0

15

0

15

0

15

x

x

x

x

x16

Page 41: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Stripes Bit-Serial Engine

41

0

15

0

15

x

x

1

1

16

16

248

255

x

x

1

1

Page 42: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Each Tile:

• 16 Windows Concurrently – 16 neurons each

• 16 Filters

• 16 partial output neurons

Compensating for Bit-Serial’s Compute Bandwidth Loss

42

16

16

neurons

neurons

synapses

16

Page 43: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Stripes

No Accuracy Loss

+192% performance

-57% energy

+32% area

More performance w/ accuracy loss

*

* W/O Older: LeNet + Covnet

Page 44: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Stripes: Performance Boost

44

be

tte

r

Page 45: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Each Tile:

• No Weight Reuse

• Cannot Have 16 Windows

Fully-Connected Layers?

45

16

16

neurons

neurons

synapses

16

Page 46: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• No Weight Reuse

• Cannot Have 16 Windows

Fully-Connected Layers

46

Input neurons Output

neurons

synapses

Page 47: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Bit-Parallel Engine

• V: activation

• I: weight

• Both 2 bits

TARTAN: Accelerating Fully-Connected Layers

47

Page 48: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Cycle 1:

• Activation: a1 and Weight: W

Bit-Parallel Engine: Processing one Activation x Weight

48

Page 49: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Cycle 2:

• Activation: a2 and Weight: W

• a1 x W + a2 x W over two cycles

Bit-Parallel Engine: Processing Another Pair

49

Page 50: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• 2 x 1b activation inputs

• 2b or 2 x 1b weight inputs

TARTAN engine

50

activations

we

igh

ts

Page 51: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Cycle 1: load 2b weight into BRs

TARTAN: Convolutional Layer Processing

51

Page 52: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Cycle 2: Multiply W with bit 1 of activations a1 and a2

TARTAN: Weight x 1st bit of Two Activations

52

Page 53: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Cycle 3: multiply W with 2nd bit of a1 and a2

• Load new W’ into BR

• 3-stage pipeline to do 2: 2b activation x 2b weight

TARTAN: Weight x 2nd bit of Two Activations

53

Page 54: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• What is different? Weights cannot be reused

• Cycle 1: Load first bit of two weights into Ars

TARTAN: Fully-Connected Layers: Loading Weights

54

Bit 1 of Two Different Weights

Page 55: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Cycle 2: Load 2nd bit of w1 and w2 into ARs

• Bit 2 of Two Different Weights

• Loaded Different Weights to Each Unit

TARTAN: Fully-Connected Layers: Loading Weights

55

Page 56: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Cycle 3: Move AR into BR and proceed as before over

two cycles

• 5-stage pipeline to do:

• TWO of (2b activation x 2b weight)

TARTAN: Fully-Connected Layers: Processing Activations

56

Page 57: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Bit-Serial TARTAN

•2.04x faster than DaDiannao

•1.25x more energy efficient at the same frequency

•1.5x area overhead

• 2-bit at-a-time TARTAN

•1.6x faster over DaDiannao

•Roughly same energy efficiency

•1.25x area overhead

TARTAN: Result Summary

57

Page 58: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Bit-Pragmatic Engine

58

X

X

+

Operand Information Content Varies

Page 59: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Want to do A x B

• Let’s look at A

• Which bits really matter?

Inner-Products

59

X B

Page 60: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Only 8% of bits are non-zero once precision is reduced

• 15%-10% otherwise

Zero Bit Content: 16-bit fixed-point

60

Page 61: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Only 27% of bits are non-zero

Zero Bit Content: 8-bit Quantized (Tensorflow-like)

61

Page 62: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Pragmatic Concept: Bit-Parallel Engine

62

Page 63: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Simply Modify Stripes?

• Too Large + Cross Lane Synchronization

Pragmatic Concept: Use Shift-and-Add

63

Page 64: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Bit-Parallel Engine

64

0

15

0

15

x

x

16

16

16

16

Page 65: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

STRIPES

65

0

15

0

15

x

x

1

1

16

16

248

255

x

x

1

1

Page 66: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Pragmatic: Naive STRIPES extension? Problem #1: Too Large

66

0

15

0

15

BIG

>>

>>

offset

offset

16

16

248

255

>>

>>

offset

offset

32

32

32

32

BIG = 3.7x area overhead just for the datapath

Page 67: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Process in groups of Max N Difference

• Example with N = 4

• Some opportunity loss, much lower area overhead

• Can skip groups of all zeroes

Solution to #1? 2-Stage Shifting

67

0

15

0

15

OK

>>

>>

1

1

16

16

20

20>>

0

1

10 00 00

00 00 10

Page 68: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Process in groups of Max N Difference

• Example with N = 4

• Some opportunity loss, much lower area overhead

Solution to #1? 2-Stage Shifting

68

0

15

0

15

OK

>>

>>

1

1

16

16

20

20>>

1

0

10 00 00

00 00 10

4

Page 69: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Different # of 1 bits

• Lanes go out of sync

• May have to fetch up to 256 different activations from

NM

• Keep Lanes Synchronized:

• No cost: All lanes

• Extra register for weights:

• Allow columns to advance by 1

• Some cost but much better performance

Lane Synchronization

70

Page 70: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Speedup and Energy Efficiency vs. DaDianNao

71

Page 71: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Bit-Pragmatic

72

X

X

+

No Accuracy Loss

+310% performance

- 48% Energy

+ 45% AreaBetter w/ 8-bit Quantization

4.3x with Encoding

Page 72: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Reducing Memory Footprint and Bandwidth

73

Page 73: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Operand Precision Required Varies

Proteus

74

X

X

+

Proteus: Store in reduced precision in memory

Less Bandwidth, Less Energy

Page 74: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Weights (synapses) and Data (activations/neurons)

Proteus: Pick Per Layer Precision

75

Layered Extension:

Compatible with Existing Systems

Page 75: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

0

15

0

15

x

x

Conventional Format: Base Precision

Data Physically aligns with Unit Inputs

Page 76: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Conventional Format: Base Precision

Need Shuffling Network to Route Synapses

4K input bits Any 4K output bit position

0

15

0

255

x

x

4K

x 4

K

Cro

ssb

ar

Page 77: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

0

15

0

15

x

x

Proteus’ Key Idea: Pack Along Data Lane Columns

78

Local Shufflers: 16b input 16b output

Much simpler

Page 78: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

Proteus

44% less memory

bandwidth

Page 79: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• Training

• Prototype

•Design Space: lower-end confs

• Unified Architecture

•Dispatcher + Compute

•Other Workloads: Comp. Photo

• General Purpose Compute Class

What’s Next

80

Page 80: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify

• More properties to discover and exploit

• E.g., Filters do overlap significantly

• CNNs one class

• Other networks

• Use the same layers

• Relative importance different

• Training

A Value-Based Approach to Acceleration

81


Recommended