+ All Categories
Home > Documents > ELSA: Hardware-Software Co-Design for Efficient ...

ELSA: Hardware-Software Co-Design for Efficient ...

Date post: 22-Jan-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
32
ELSA: Hardware-Software Co-Design for E fficient, L ightweight S elf-A ttention Mechanism in Neural Networks Tae Jun Ham* Yejin Lee* Seong Hoon Seo Soosung Kim [email protected] [email protected] [email protected] [email protected] Hyunji Choi Sung Jun Jung Jae W. Lee [email protected] [email protected] [email protected] The 48th ACM/IEEE International Symposium on Computer Architecture (ISCA) ARC Lab @ Seoul National University * These authors contributed equally to this work
Transcript
Page 1: ELSA: Hardware-Software Co-Design for Efficient ...

ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

Tae Jun Ham* Yejin Lee* Seong Hoon Seo Soosung [email protected] [email protected] [email protected] [email protected]

Hyunji Choi Sung Jun Jung Jae W. Lee [email protected] [email protected] [email protected]

The 48th ACM/IEEE International Symposium on Computer Architecture (ISCA) ARC Lab @ Seoul National University

* These authors contributed equally to this work

Page 2: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 2

Self-Attention: A Key Primitive of Transformer-based Models

• Transformer-based models achieve state-of-the-art performance on NLP

• Natural Language Processing: QA, Translation, Language Modeling

• CV: Image Classification, Object Detection

• Self-attention is the key primitive for the Transformer-based models

• Self-attention identifies relations within entities and enablesmodels to effectively handle sequential data.

Page 3: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 3

Self-Attention: A Key Primitive of Transformer-based Models

• Transformer-based models often come with large computational cost • Self-attention often accounts for more than 30% of the runtime on Transformers

• Even larger portion with longer input sequences

OthersSelf-Attention

BERT RoBERTa ALBERT SASRec BERT4Rec

Default Sequence Length

4x Larger Sequence Length

BERT RoBERTa ALBERT SASRec BERT4Rec

Page 4: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1

-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1

Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors

corresponding to relevant key vectors

2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94

-0.6 -0.8 -1.1

0.4 0.1 -0.7

0.5 -0.1 0.9

4

Query Matrix(n x d)

Step 1. Attention Score: Compute dot product of each row in the key matrix and the query

Key Matrix(n x d)

AttentionScore (n x n)

Value Matrix(n x d)

Self-Attention Mechanism

Dot Product Computation

0.4 0.1 -0.7

0.5 -0.1 0.9

Page 5: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

-0.6 -0.8 -1.1

0.4 0.1 -0.7

0.5 -0.1 0.9

-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1

0.91 0.08 0.010.33 0.22 0.450.03 0.09 0.88

-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1

Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors

corresponding to relevant key vectors

2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94

5

Query Matrix(n x d)

Step 1. Attention Score: Compute dot product of each row in the key matrix and the query

Key Matrix(n x d)

AttentionScore (n x n)

NormalizedScore (n x n)

Value Matrix(n x d)

Step 2. Softmax-normalized Attention Score: Softmax normalization on Attention Score

σ

Self-Attention Mechanism

Softmax Computation

0.4 0.1 -0.7

0.5 -0.1 0.9

Page 6: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1

-0.6 -0.8 -1.1

0.4 0.1 -0.7

0.5 -0.1 0.9

0.91 0.08 0.010.33 0.22 0.450.03 0.09 0.88

-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1

Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors

corresponding to relevant key vectors

2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94

6

Query Matrix(n x d)

Step 1. Attention Score: Compute dot product of each row in the key matrix and the query

Output(n x d)

Key Matrix(n x d)

AttentionScore (n x n)

NormalizedScore (n x n)

Value Matrix(n x d)

Step 2. Softmax-normalized Attention Score: Softmax normalization on Attention ScoreStep 3. Weighted Sum of Value vectors: Weighted sum of value matrix row vectors using Normalized Score as weights

σ

Self-Attention Mechanism

Weighted Sum Computation

0.4 0.1 -0.7

0.5 -0.1 0.9

-1.5 -0.5 -0.4

0.2 -0.3 0.4

1.7 -0.4 1.0

x

x

x

Page 7: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 5

• Attention mechanism looks like a series of dense matrix operations,

but it is also a content-based search

• Softmax operation filters out data that are NOT relevant

to the query based on attention scores

Key Idea

What if we can find out the set of candidates that are likelyto be relevant to the query without much computation?

Avoid processing irrelevant keys

Reduce the amount of computation

AttentionScore

SoftmaxNormalized

Score

!!"#$%∑!!"#$%

0.0140.0000.0050.0000.0000.0000.0000.0000.9760.0000.005

6.661.745.581.80-0.85-0.29-0.19-1.0310.902.335.54

Opportunities for Approximation in Self-Attention Mechanism

Page 8: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 6

Why Specialized Hardware?

Observation Conventional HW such as GPUs do not benefit from such approximation• GPUs are better optimized for full similarity computation (dot product) than presented

approximate similarity computation!

GPU Device ELSA HW Accelerator

Specialized HWs can fully exploit the potential benefits of the presented approximate self-attention mechanism

Page 9: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 7

ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism

1

2

1

3

1 1 1-0.9 0.8 1.2 -0.30.1 0.0 -0.1 0.70.9 -0.4 0.2 0.1-0.1 -0.2 0.7 0.6

-0.3 -0.2 1.1 0.71.4 -0.7 0.9 -0.80.4 0.3 0.8 -0.20.9 0.5 0.1 -1.3

1.351.970.961.66

ᬠProcess

Next Query ௬ܭ

0.92

1.97

0.92

3.01

0.61

-0.39

0.61

-0.99

0.82

-0.76

0.58

-1.65

1 0 10 0 11 0 10 0 0

Key Matrix K

ᬮEfficient Key Hash

Computation

Key Hash

Query Matrix Q

ᬛComputeHamming Distance

ᬜHamming DistanceTo Angle

ᬝCosine

ᬞ Key Norm

Multiplication

( ௫ ((௬ܭ),

గ ȉ,ȉെߠ௦

cos(ߠொ,) ௬ܭ cos(ߠொ, )

ᬟCandidateSelection

9

8

8

8

Key Norm

ᬮKey NormComputation

ᬚEfficient Query Hash

Computation

With novel approximate self-attention algorithm and specialized hardware, ELSA achieves

significant reduction in the runtime as well as energy spent on the self-attention mechanism

!!∑

Buffer

××

××

+=+=+=+=

∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

KeyMem

ValMem

××××

+=

HashComp

HashMem

××

!/#

Buffer

Approximate Self-Attention Algorithm

Specialized Hardware for Approximation

Page 10: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 8

Estimating Angular DistanceAttention Score between a given query vector ! and a key vector ",

#$$%&$'(&)*(+% ", Q = " / !

Assuming & entities, there are total &! pairs à 0"1 MAC operations

Key Idea ELSA maps query vector ! and a key vector " to binary vectors and approximates attention scores

1 multiply-and-add (MAC) operations

-0.6 -0.8 -1.1...Query Q

Key Matrix K

0.3 -0.2 -1.10.5 0.6 -0.30.1 0.4 0.5

-0.3 -0.5 -0.2

...

...

...

...

1 0 1...

1 0 00 0 10 1 01 1 0

...

...

...

...

Hash for Q

Hash for K

Page 11: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 9

Estimating Angular DistanceAttention Score between a given query vector ! and a key vector ",

Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing

#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )

Self-attention is finding relevant keys to a given query !,

thus we can ignore " since it is the same for all keys 얞猷邀

same IIQ1 -

Page 12: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

Attention Score between a given query vector ! and a key vector ",

Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing

10

Estimating Angular Distance

!#!$

!%+-

❸"!!

❷"!" +-

❶"!#+-

❹"!$

+ -

Initialize k (=4) random vectors (v1, v2, v3, v4)Normal vectors defined by them (n!!,n!",n!# ,n!$) are illustrated

1

#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )

Self-attention is finding relevant keys to a given query !,

thus we can ignore " since it is the same for all keys

Page 13: ELSA: Hardware-Software Co-Design for Efficient ...
Page 14: ELSA: Hardware-Software Co-Design for Efficient ...

1 ) 7차원 vector Space

orandanvectorv.GR

V, = 0.5 - 0.7 0.3 0.9 1.2 . J

Page 15: ELSA: Hardware-Software Co-Design for Efficient ...

1 ) 7차원 vector Space

orandanvectorv.GR

2) hy perplate 생성

NormalA vector

nvithyperplanet.ee-

± x.nu.no

Y.nu, < o

Page 16: ELSA: Hardware-Software Co-Design for Efficient ...

1 ) 7차원 vector Space 3) binary hasKing (ex : K-3)

irandom.ca iii.v.GRi.

t

F V3-

t2) hy perplate 생성

h (K) = ( 1 0 0 JNormalA vectorna h (Q) = ( 1 I]

thyperplanet.ee-

-

: K는 Random vector의 수± x.nu.no

Y.nu, < o

Page 17: ELSA: Hardware-Software Co-Design for Efficient ...

4)

hammingdistaneh.lk) = ( 1 0 0J

④ ✗or

h (Q) = [ 1 1 1]

hammi ng ( K .Q) = 도 (0 1 1 ) = 2

Page 18: ELSA: Hardware-Software Co-Design for Efficient ...

4)

[email protected]) = 도 (0 1 1 ) = 2

5) Complete E에 (1<=3)

OQKFEhamming.lk ,Q) = E . 2 = FL

Men Score (Q14계K '쌈臧뻬e

Page 19: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 12

Estimating Angular Distance

!#!$

!%+-

❸&!!

❷&!" +-

;.&,.' ≈/0 ⋅hamming = π

;.',.( ≈/0 ⋅hamming = )

;.(,.& ≈/0 ⋅hamming = (

-π01101110( )11100001( )0001 0110( )

❶&!#+-

❹&!$

+ -

3 Can obtain angular distance just from binary embeddings

- ./0(ℎ2334%5 ℎ ! , ℎ - ⋅ 6/8)

ℎ >( = [0110]

ℎ >! = [1110]

ℎ >) = [0001]

*&*'*(*)

Attention Score between a given query vector ! and a key vector ",

Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing

#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )

Self-attention is finding relevant keys to a given query !,

thus we can ignore " since it is the same for all keys

Page 20: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 13

Approximate Attention Score with Binary HashingStep 1. Preprocessing Compute hash values for keys and a query and norms for each key

Step 2. Approximate Attention Score Computation Compute approximate attention scores for each key

!""#$"%&$'(&)# *, Q = !""#$"%&$'(&)# K, Q / /= * ∙ / / / =≈ * (&2(ℎ566%$7 ℎ / , ℎ * ⋅ 9/:), where C=number of hash bits

d × multiply-and-addApproximate Computation

1 Bitwise XOR(Hamming Distance)

2 Lookup Tablef (x) = cos ( ! ⋅ π/k )

3 Multiplication× 'K ⋅ Q

* cos ?(,*

Original Computation

Step 3. Candidate Selection Filter out potentially irrelevant keys whose approximate attention score is below a pretrained threshold (F)

: ⋅ -%&' ≥ - ./0(ℎ2334%5 ℎ ! , ℎ - ⋅ 6/8)t 보다 작은 세대 Score를 걸러낸다.

Page 21: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 14

Candidate Selection

Q. How do we find the right value for all of NN layers (and sub-layers)?A. Learn from training data!

One should sort all approximate attention scores and select top-scoring keysIssue #1 Sorting is expensive in hardware

Issue #2 Sorting can only happen after computing all key’s estimated score is done

Select keys whose estimated score exceeds certain thresholdPros #1 Do not need expensive sort

Pros #2 Filtering can happen during computing each key’s estimated score

Page 22: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 15

Candidate Selection

AttentionScore

SoftmaxNormalized

Score!!"#$%∑!!"#$% 0.48

0.060.300.16

1.35-0.720.860.24

Step 1. Set Hyperparameter ! User enters hyperparameter G for accuracy-performance

Assume 0 is the number of input entities (keys), we identify the set of keys whose softmax-normalized attention score (4′) exceeds G/0.

If G = J, 0 = JKK, user considers entities whose 41 > 0.01 to be relevant

Step 2. Learn Raw Attention Score whose Softmax-Normalized Attention Score > !/"Higher G: aggressive approximation, Lower G: conservative approximation

Assume < = =, > = ?. Filter out keys with @′< 0.25

Step 3. Repeat for All Cases In Training Data

User gets different thresholds for each layer (sub-layer)

tradeoff⇒ 11K$

Page 23: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 16

ELSA Hardware Implementation

HashComp

HashMemKey

Preprocessing Phase (Key hash computation)1

Execution Phase2

2-1 Compute Query Hash

2-3 Output Division

2-2 Candidate Selection & Attention Computation

⇒ binary hash 생성

Page 24: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 17

ELSA Hardware Implementation

Preprocessing Phase (Key hash computation)1

Execution Phase

Query HashComp

HashMem

2

2-1 Compute Query Hash

2-3 Output Division

2-2 Candidate Selection & Attention Computation

Compute hash for the query

h Q = !/+

Page 25: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

Compute approximate similarity & filter out keys with low similarity

Approximate Similarity||*||(&2(ℎ566%$7 ℎ / , ℎ * ⋅ 9/:)

18

ELSA Hardware Implementation

Preprocessing Phase (Key hash computation)1

Execution Phase

2-3 Output Division

2

2-1 Compute Query Hash

2-2 Candidate Selection & Attention Computation

∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

Key & Query Hash

HashComp

HashMem

Selected Row IDs

Page 26: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

Compute partial attention for selected keys and accumulate it in the buffer

Σ #((⋅*( ⋅ B-

19

ELSA Hardware Implementation

Preprocessing Phase (Key hash computation)1

Execution Phase2

2-1 Compute Query Hash

B)

Query

Buffer

××××

+=+=+=+=

∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

Selected Row IDs

KeyMem

ValMem

××××

+=

HashComp

HashMem

2-3 Output Division

2-2 Candidate Selection & Attention Computation

Page 27: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

Normalize the outcome by dividing it by the sum of exponents

(Σ #C'⋅D' ⋅ %E) / Σ #C'⋅D'

20

ELSA Hardware Implementation

Preprocessing Phase (Key hash computation)1

Execution Phase2

2-1 Compute Query Hash

B)

Query

Buffer

××××

+=+=+=+=

∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

KeyMem

ValMem

××××

+=

HashComp

HashMem …

××

*/,

Buffer

2-2 Candidate Selection & Attention Computation

2-3 Output Division

Page 28: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

Query-level pipelining ( , , )

Fine-grained pipelining within each pipeline stage

21

ELSA Hardware Pipelining

2-1 Compute Query Hash

B)

Query

Buffer

××××

+=+=+=+=

∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

KeyMem

ValMem

××××

+=

HashComp

HashMem

2-3 Output Division

××

*/,

Buffer

2-2 Candidate Selection & Attention Computation

Preprocessing Phase (Key hash computation)1

Execution Phase2

i-1 th Query I th Query i+1th Query

2-1 2-2 2-3

Page 29: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 22

Evaluation: Impact of Approximate Attention

• Line Graph: model accuracy metric• Bar Graph: portion of selected candidates

Hyperparameter for Accuracy/Performance

< 1% accuracy metric degradations are observed while inspecting only ~33% of the keys

Page 30: ELSA: Hardware-Software Co-Design for Efficient ...

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 23

Evaluation: Performance EvaluationGPU Ideal Ours (<1%) Ours (<2.5%) Ours (<5%)Ours (No Approx)

8x – 46x speedup over GPU

Proposed HW Accelerator (w/o Approximation)

20x – 157x speedup over GPU

Proposed HW Accelerator (w/ Approximation)<1% accuracy metric degradation

32x – 168x speedup over GPU<2.5% accuracy metric degradation

Page 31: ELSA: Hardware-Software Co-Design for Efficient ...

Pat/WEnergy onsumration

Page 32: ELSA: Hardware-Software Co-Design for Efficient ...

Tae Jun Ham* Yejin Lee* Seong Hoon Seo Soosung [email protected] [email protected] [email protected] [email protected]

Hyunji Choi Sung Jun Jung Jae W. Lee [email protected] [email protected] [email protected]

* These authors contributed equally to this work

ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

The 48th ACM/IEEE International Symposium on Computer Architecture (ISCA) ARC Lab @ Seoul National University


Recommended