+ All Categories
Home > Documents > Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh...

Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh...

Date post: 21-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
67
Deep Learning Hardware Acceleration Jorge Albericio + Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington * Natalie Enright Jerger Tor Aamodt* Andreas Moshovos * + now at NVIDIA
Transcript
Page 1: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Deep Learning Hardware

Acceleration

Jorge Albericio+ Alberto Delmas Lascorz

Patrick Judd Sayeh Sharify

Tayler Hetherington*

Natalie Enright Jerger Tor Aamodt*

Andreas Moshovos

*

+ now at NVIDIA

Page 2: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

The University of Toronto have filed patent

applications for the mentioned technologies.

Disclaimer

Page 3: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Time: ~ 60% - 90% inner products

Deep Learning: Where Time Goes?

100s

-

1000s

X

X

+

Convolutional Neural Networks: e.g., Image Classification

Page 4: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

4

100s

-

1000s

X

X

+

X

X

+

X

X

+X

X

+X

X

+X

X

+

Time: ~ 60% - 90% inner products

Deep Learning: Where Time Goes?

Page 5: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

SIMD: Exploit Computation Stucture

5

DaDianNao

4K terms/cycle

0

15

0

15

0

15

x

x

x

x

x16

Page 6: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Our Approach

6

0

15

0

15

0

15

Filter 0

Filter 15

Improve by Exploiting Value Properties

Maintaining:

Massive Parallelism

SIMD Lanes

Wide Memory Accesses

No Modifications to the Networks

Page 7: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Longer Term Goal

7

0

15

0

15

0

15

Filter 0

Filter 15

One Architecture to Rule them All

Page 8: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Value Properties to Exploit?

8

0.0…0a x A

Page 9: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Value Properties to Exploit

9

X

X

X

Page 10: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Value Properties to Exploit

x A

x A

x A

Page 11: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Our Results: Performance

11

1.5x1.9x

3,1x1.60x

2.08x

0

1

1

2

2

3

3

4

CNVLUTIN STRIPES ENGINE P

100% 99%

PRAGMATIC

TARTAN +

vs. DaDianNao which was ~300x over GPUs

Accuracy

ISCA’16 MICRO’16 arxiv

Page 12: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Proteus:

44% less memory bandwidth + footprint

Our Results: Memory Footprint and Bandwidth

12

Page 13: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Many ineffectual multiplications

Cnvlutin: ISCA’16

13

X0

X

+0

0

Page 14: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Zero Activations:

• Runtime calculated

• None that are always zero

Many Activations and Weights

are Intrinsically Ineffectual (zero)

14

Zero Weights:

Known in Advance

Not pursued in this work

45% of Runtime Values are zero

% Stable for any Input

None always zero

Page 15: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Many ineffectual multiplications

Cnvlutin

15

X0

X

+0

0

Page 16: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Many more ineffectual multiplications

Cnvlutin

16

X0

X

+0

0

a

b

Page 17: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Cnvlutin

17

X

X

+

Beating Fast and

“Dumb” SIMD is Hard

On-the-fly ineffectual product elimination

Performance + energy

Optional: accuracy loss +performance

Page 18: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Cnvlutin

No Accuracy Loss

+52% performance

-7% power

+5% area

Can relax the ineffectual criterion

better performance: 60%

even more w/ some accuracy loss

Page 19: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Deep Learning: Convolutional Neural Networks

19

imagelayers10s-100

Swedish

Meatballs

maybe

Page 20: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• 16 independent narrow activation streams

Naïve Solution: No Wide Memory Accesses

23

Lane 15

Lane 0

Page 21: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Removing Zeroes: At the output of each layer

en

co

de

Layer

iLayer

i + 1N

eu

ron

Mem

Swedish

Meatballs

maybe

Page 22: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Cnvlutin: No Accuracy Loss

25better

Page 23: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Treat

Neurons

close to zero

as zero

Loosening the Ineffectual Neuron Criterion

26

Open Questions:

Are these robust? How to find the best?

Page 24: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Another Property of CNNs

27

X

X

+

Operand Precision Required Fixed?

16 bits?

16 bits

Page 25: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

CNNs: Precision Requirements Vary

28

X

X

+

Operand Precision Required Fixed Varies

5 bits to 13 bits

16 bits

p bits

Page 26: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Stripes

29

X

X

+

Execution Time = 16 / PPeformance + Energy Efficiency + Accuracy Knob

p bits

Page 27: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Devil in the Details: Carefully chose what to serialize and

what to reuse same input wires as baseline

Stripes: Key Concept

30

2 2x2b

Terms/Step2 1x2b 4 1x2b

Page 28: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

SIMD: Exploit Computation Stucture

31

DaDianNao

4K terms/cycle

0

15

0

15

0

15

x

x

x

x

x16

Page 29: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Stripes Bit-Serial Engine

32

0

15

0

15

x

x

1

1

16

16

248

255

x

x

1

1

Page 30: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Each Tile:

• 16 Windows Concurrently – 16 neurons each

• 16 Filters

• 16 partial output neurons

Compensating for Bit-Serial’s Compute Bandwidth Loss

33

16

16

neurons

neurons

synapses

16

Page 31: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Stripes

No Accuracy Loss

+192% performance

-57% energy

+32% area

More performance w/ accuracy loss

*

* W/O Older: LeNet + Covnet

Page 32: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Stripes: Performance Boost

35

be

tte

r

Page 33: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Each Tile:

• No Weight Reuse

• Cannot Have 16 Windows

Fully-Connected Layers?

36

16

16

neurons

neurons

synapses

16

Page 34: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• No Weight Reuse

• Cannot Have 16 Windows

Fully-Connected Layers

37

Input neurons Output

neurons

synapses

Page 35: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Bit-Parallel Engine

• V: activation

• I: weight

• Both 2 bits

TARTAN: Accelerating Fully-Connected Layers

38

Page 36: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Cycle 1:

• Activation: a1 and Weight: W

Bit-Parallel Engine: Processing one Activation x Weight

39

Page 37: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Cycle 2:

• Activation: a2 and Weight: W

• a1 x W + a2 x W over two cycles

Bit-Parallel Engine: Processing Another Pair

40

Page 38: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• 2 x 1b activation inputs

• 2b or 2 x 1b weight inputs

TARTAN engine

41

activations

we

igh

ts

Page 39: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Cycle 1: load 2b weight into BRs

TARTAN: Convolutional Layer Processing

42

Page 40: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Cycle 2: Multiply W with bit 1 of activations a1 and a2

TARTAN: Weight x 1st bit of Two Activations

43

Page 41: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Cycle 3: multiply W with 2nd bit of a1 and a2

• Load new W’ into BR

• 3-stage pipeline to do 2: 2b activation x 2b weight

TARTAN: Weight x 2nd bit of Two Activations

44

Page 42: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• What is different? Weights cannot be reused

• Cycle 1: Load first bit of two weights into Ars

TARTAN: Fully-Connected Layers: Loading Weights

45

Bit 1 of Two Different Weights

Page 43: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Cycle 2: Load 2nd bit of w1 and w2 into ARs

• Bit 2 of Two Different Weights

• Loaded Different Weights to Each Unit

TARTAN: Fully-Connected Layers: Loading Weights

46

Page 44: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Cycle 3: Move AR into BR and proceed as before over

two cycles

• 5-stage pipeline to do:

• TWO of (2b activation x 2b weight)

TARTAN: Fully-Connected Layers: Processing Activations

47

Page 45: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Bit-Serial TARTAN

• 2.04x faster than DaDiannao

• 1.25x more energy efficient at the same frequency

• 1.5x area overhead

• 2-bit at-a-time TARTAN

• 1.6x faster over DaDiannao

• Roughly same energy efficiency

• 1.25x area overhead

TARTAN: Result Summary

48

Page 46: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Bit-Pragmatic Engine

49

X

X

+

Operand Information Content Varies

Page 47: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Want to A x B

• Let’s look at A

• Which bits really matter?

Inner-Products

50

X B

Page 48: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Only 8% of bits are non-zero once precision is reduced

• 15%-10% otherwise

Zero Bit Content: 16-bit fixed-point

51

Page 49: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Only 27% of bits are non-zero

Zero Bit Content: 8-bit Quantized (Tensorflow-like)

52

Page 50: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Simply Modify Stripes?

• Too Large + Cross Lane Synchronization

Pragmatic Concept: Use Shift-and-Add

53

Page 51: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Bit-Parallel Engine

54

0

15

0

15

x

x

16

16

16

16

Page 52: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

STRIPES

55

0

15

0

15

x

x

1

1

16

16

248

255

x

x

1

1

Page 53: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Pragmatic: Naive STRIPES extension? Problem #1: Too Large

56

0

15

0

15

BIG

>>

>>

1

1

16

16

248

255

>>

>>

1

1

32

32

32

32

Page 54: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Process in groups of Max N Difference

• Example with N = 4

• Some opportunity loss, much lower area overhead

• Can skip groups of all zeroes

Solution to #1? 2-Stage Shifting

57

0

15

0

15

OK

>>

>>

1

1

16

16

20

20>>

0

1

10 00 00

00 00 10

Page 55: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Process in groups of Max N Difference

• Example with N = 4

• Some opportunity loss, much lower area overhead

Solution to #1? 2-Stage Shifting

58

0

15

0

15

OK

>>

>>

1

1

16

16

20

20>>

1

0

10 00 00

00 00 10

4

Page 56: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Different # of 1 bits

• Lanes go out of sync

• May have to fetch up to 256 different activations from

NM

• Keep Lanes Synchronized:

• No cost: All lanes

• Extra register per column: some cost better performance

Lane Synchronization

60

Page 57: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Bit-Pragmatic

61

X

X

+

No Accuracy Loss

+310% performance

- 48% Energy

+ 45% Area

Better w/ 8-bit Quantized

Nets

Page 58: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• 8-bit Quantized Representation

Processing Only The Essential Information

62

Stripes 8b Pragmatic

Page 59: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Better encoding is possible and improves performance

Bit-Pragmatic

63

Page 60: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Operand Precision Required Varies

Proteus

64

X

X

+

Proteus: Store in reduced precision in memory

Less Bandwidth, Less Energy

Page 61: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

0

15

0

15

x

x

Conventional Format: Base Precision

Data Physically aligns with Unit Inputs

Page 62: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Conventional Format: Base Precision

Need Shuffling Network to Route Synapses

4K input bits Any 4K output bit position

0

15

0

15

x

x

4K

x 4

K

Cro

ssb

ar

Page 63: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

0

15

0

15

x

x

Proteus’ Key Idea: Pack Along Data Lane Columns

68

Local Shufflers: 16b input 16b output

Much simpler

Page 64: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Proteus

44% less memory

bandwidth

Page 65: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• Training

• Prototype

•Design Space: lower-end confs

• Unified Architecture

•Dispatcher + Compute

•Other Workloads: Comp. Photo

• General Purpose Compute Class

What’s Next

70

Page 66: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

Our Results: Performance

71

1.5x1.9x

3,1x1.60x

2.08x

0

1

1

2

2

3

3

4

CNVLUTIN STRIPES ENGINE P

100% 99%

PRAGMATIC

TARTAN +

vs. DaDianNao which is ~300x over GPUs

Accuracy

ISCA’16 MICRO’16 arxiv

Page 67: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos

• More properties to discover and exploit

• E.g., Filters do overlap significantly

• CNNs one class

• Other networks

• Use the same layers

• Relative importance different

• Training

A Value-Based Approach to Acceleration

72


Recommended