ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Tae Jun Ham* Yejin Lee* Seong Hoon Seo Soosung [email protected] [email protected] [email protected] [email protected]
Hyunji Choi Sung Jun Jung Jae W. Lee [email protected] [email protected] [email protected]
The 48th ACM/IEEE International Symposium on Computer Architecture (ISCA) ARC Lab @ Seoul National University
* These authors contributed equally to this work
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 2
Self-Attention: A Key Primitive of Transformer-based Models
• Transformer-based models achieve state-of-the-art performance on NLP
• Natural Language Processing: QA, Translation, Language Modeling
• CV: Image Classification, Object Detection
• Self-attention is the key primitive for the Transformer-based models
• Self-attention identifies relations within entities and enablesmodels to effectively handle sequential data.
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 3
Self-Attention: A Key Primitive of Transformer-based Models
• Transformer-based models often come with large computational cost • Self-attention often accounts for more than 30% of the runtime on Transformers
• Even larger portion with longer input sequences
OthersSelf-Attention
BERT RoBERTa ALBERT SASRec BERT4Rec
Default Sequence Length
4x Larger Sequence Length
BERT RoBERTa ALBERT SASRec BERT4Rec
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1
-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1
Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors
corresponding to relevant key vectors
2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94
-0.6 -0.8 -1.1
0.4 0.1 -0.7
0.5 -0.1 0.9
4
Query Matrix(n x d)
Step 1. Attention Score: Compute dot product of each row in the key matrix and the query
Key Matrix(n x d)
AttentionScore (n x n)
Value Matrix(n x d)
Self-Attention Mechanism
Dot Product Computation
0.4 0.1 -0.7
0.5 -0.1 0.9
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
-0.6 -0.8 -1.1
0.4 0.1 -0.7
0.5 -0.1 0.9
-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1
0.91 0.08 0.010.33 0.22 0.450.03 0.09 0.88
-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1
Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors
corresponding to relevant key vectors
2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94
5
Query Matrix(n x d)
Step 1. Attention Score: Compute dot product of each row in the key matrix and the query
Key Matrix(n x d)
AttentionScore (n x n)
NormalizedScore (n x n)
Value Matrix(n x d)
Step 2. Softmax-normalized Attention Score: Softmax normalization on Attention Score
σ
Self-Attention Mechanism
Softmax Computation
0.4 0.1 -0.7
0.5 -0.1 0.9
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1
-0.6 -0.8 -1.1
0.4 0.1 -0.7
0.5 -0.1 0.9
0.91 0.08 0.010.33 0.22 0.450.03 0.09 0.88
-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1
Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors
corresponding to relevant key vectors
2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94
6
Query Matrix(n x d)
Step 1. Attention Score: Compute dot product of each row in the key matrix and the query
Output(n x d)
Key Matrix(n x d)
AttentionScore (n x n)
NormalizedScore (n x n)
Value Matrix(n x d)
Step 2. Softmax-normalized Attention Score: Softmax normalization on Attention ScoreStep 3. Weighted Sum of Value vectors: Weighted sum of value matrix row vectors using Normalized Score as weights
σ
Self-Attention Mechanism
Weighted Sum Computation
0.4 0.1 -0.7
0.5 -0.1 0.9
-1.5 -0.5 -0.4
0.2 -0.3 0.4
1.7 -0.4 1.0
x
x
x
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 5
• Attention mechanism looks like a series of dense matrix operations,
but it is also a content-based search
• Softmax operation filters out data that are NOT relevant
to the query based on attention scores
Key Idea
What if we can find out the set of candidates that are likelyto be relevant to the query without much computation?
Avoid processing irrelevant keys
Reduce the amount of computation
AttentionScore
SoftmaxNormalized
Score
!!"#$%∑!!"#$%
0.0140.0000.0050.0000.0000.0000.0000.0000.9760.0000.005
6.661.745.581.80-0.85-0.29-0.19-1.0310.902.335.54
Opportunities for Approximation in Self-Attention Mechanism
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 6
Why Specialized Hardware?
Observation Conventional HW such as GPUs do not benefit from such approximation• GPUs are better optimized for full similarity computation (dot product) than presented
approximate similarity computation!
GPU Device ELSA HW Accelerator
Specialized HWs can fully exploit the potential benefits of the presented approximate self-attention mechanism
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 7
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism
1
2
1
3
1 1 1-0.9 0.8 1.2 -0.30.1 0.0 -0.1 0.70.9 -0.4 0.2 0.1-0.1 -0.2 0.7 0.6
-0.3 -0.2 1.1 0.71.4 -0.7 0.9 -0.80.4 0.3 0.8 -0.20.9 0.5 0.1 -1.3
1.351.970.961.66
ᬠProcess
Next Query ௬ܭ
0.92
1.97
0.92
3.01
0.61
-0.39
0.61
-0.99
0.82
-0.76
0.58
-1.65
1 0 10 0 11 0 10 0 0
Key Matrix K
ᬮEfficient Key Hash
Computation
Key Hash
Query Matrix Q
ᬛComputeHamming Distance
ᬜHamming DistanceTo Angle
ᬝCosine
ᬞ Key Norm
Multiplication
( ௫ ((௬ܭ),
గ ȉ,ȉെߠ௦
cos(ߠொ,) ௬ܭ cos(ߠொ, )
ᬟCandidateSelection
9
8
8
8
Key Norm
௫
ᬮKey NormComputation
ᬚEfficient Query Hash
Computation
With novel approximate self-attention algorithm and specialized hardware, ELSA achieves
significant reduction in the runtime as well as energy spent on the self-attention mechanism
!!∑
Buffer
××
××
+=+=+=+=
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
KeyMem
ValMem
××××
+=
HashComp
HashMem
…
××
!/#
Buffer
Approximate Self-Attention Algorithm
Specialized Hardware for Approximation
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 8
Estimating Angular DistanceAttention Score between a given query vector ! and a key vector ",
#$$%&$'(&)*(+% ", Q = " / !
Assuming & entities, there are total &! pairs à 0"1 MAC operations
Key Idea ELSA maps query vector ! and a key vector " to binary vectors and approximates attention scores
1 multiply-and-add (MAC) operations
-0.6 -0.8 -1.1...Query Q
Key Matrix K
0.3 -0.2 -1.10.5 0.6 -0.30.1 0.4 0.5
-0.3 -0.5 -0.2
...
...
...
...
1 0 1...
1 0 00 0 10 1 01 1 0
...
...
...
...
Hash for Q
Hash for K
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 9
Estimating Angular DistanceAttention Score between a given query vector ! and a key vector ",
Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing
#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )
Self-attention is finding relevant keys to a given query !,
thus we can ignore " since it is the same for all keys 얞猷邀
same IIQ1 -
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Attention Score between a given query vector ! and a key vector ",
Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing
10
Estimating Angular Distance
!#!$
!%+-
❸"!!
❷"!" +-
❶"!#+-
❹"!$
+ -
Initialize k (=4) random vectors (v1, v2, v3, v4)Normal vectors defined by them (n!!,n!",n!# ,n!$) are illustrated
1
#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )
Self-attention is finding relevant keys to a given query !,
thus we can ignore " since it is the same for all keys
1 ) 7차원 vector Space
orandanvectorv.GR
V, = 0.5 - 0.7 0.3 0.9 1.2 . J
1 ) 7차원 vector Space
orandanvectorv.GR
2) hy perplate 생성
NormalA vector
nvithyperplanet.ee-
± x.nu.no
Y.nu, < o
1 ) 7차원 vector Space 3) binary hasKing (ex : K-3)
irandom.ca iii.v.GRi.
t
F V3-
t2) hy perplate 생성
h (K) = ( 1 0 0 JNormalA vectorna h (Q) = ( 1 I]
thyperplanet.ee-
-
: K는 Random vector의 수± x.nu.no
Y.nu, < o
4)
hammingdistaneh.lk) = ( 1 0 0J
④ ✗or
h (Q) = [ 1 1 1]
hammi ng ( K .Q) = 도 (0 1 1 ) = 2
4)
[email protected]) = 도 (0 1 1 ) = 2
5) Complete E에 (1<=3)
OQKFEhamming.lk ,Q) = E . 2 = FL
Men Score (Q14계K '쌈臧뻬e
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 12
Estimating Angular Distance
!#!$
!%+-
❸&!!
❷&!" +-
;.&,.' ≈/0 ⋅hamming = π
;.',.( ≈/0 ⋅hamming = )
-π
;.(,.& ≈/0 ⋅hamming = (
-π01101110( )11100001( )0001 0110( )
❶&!#+-
❹&!$
+ -
3 Can obtain angular distance just from binary embeddings
- ./0(ℎ2334%5 ℎ ! , ℎ - ⋅ 6/8)
ℎ >( = [0110]
ℎ >! = [1110]
ℎ >) = [0001]
*&*'*(*)
Attention Score between a given query vector ! and a key vector ",
Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing
#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )
Self-attention is finding relevant keys to a given query !,
thus we can ignore " since it is the same for all keys
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 13
Approximate Attention Score with Binary HashingStep 1. Preprocessing Compute hash values for keys and a query and norms for each key
Step 2. Approximate Attention Score Computation Compute approximate attention scores for each key
!""#$"%&$'(&)# *, Q = !""#$"%&$'(&)# K, Q / /= * ∙ / / / =≈ * (&2(ℎ566%$7 ℎ / , ℎ * ⋅ 9/:), where C=number of hash bits
d × multiply-and-addApproximate Computation
1 Bitwise XOR(Hamming Distance)
2 Lookup Tablef (x) = cos ( ! ⋅ π/k )
3 Multiplication× 'K ⋅ Q
* cos ?(,*
Original Computation
Step 3. Candidate Selection Filter out potentially irrelevant keys whose approximate attention score is below a pretrained threshold (F)
: ⋅ -%&' ≥ - ./0(ℎ2334%5 ℎ ! , ℎ - ⋅ 6/8)t 보다 작은 세대 Score를 걸러낸다.
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 14
Candidate Selection
Q. How do we find the right value for all of NN layers (and sub-layers)?A. Learn from training data!
One should sort all approximate attention scores and select top-scoring keysIssue #1 Sorting is expensive in hardware
Issue #2 Sorting can only happen after computing all key’s estimated score is done
Select keys whose estimated score exceeds certain thresholdPros #1 Do not need expensive sort
Pros #2 Filtering can happen during computing each key’s estimated score
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 15
Candidate Selection
AttentionScore
SoftmaxNormalized
Score!!"#$%∑!!"#$% 0.48
0.060.300.16
1.35-0.720.860.24
Step 1. Set Hyperparameter ! User enters hyperparameter G for accuracy-performance
Assume 0 is the number of input entities (keys), we identify the set of keys whose softmax-normalized attention score (4′) exceeds G/0.
If G = J, 0 = JKK, user considers entities whose 41 > 0.01 to be relevant
Step 2. Learn Raw Attention Score whose Softmax-Normalized Attention Score > !/"Higher G: aggressive approximation, Lower G: conservative approximation
Assume < = =, > = ?. Filter out keys with @′< 0.25
Step 3. Repeat for All Cases In Training Data
User gets different thresholds for each layer (sub-layer)
tradeoff⇒ 11K$
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 16
ELSA Hardware Implementation
HashComp
HashMemKey
Preprocessing Phase (Key hash computation)1
Execution Phase2
2-1 Compute Query Hash
2-3 Output Division
2-2 Candidate Selection & Attention Computation
⇒ binary hash 생성
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 17
ELSA Hardware Implementation
Preprocessing Phase (Key hash computation)1
Execution Phase
Query HashComp
HashMem
2
2-1 Compute Query Hash
2-3 Output Division
2-2 Candidate Selection & Attention Computation
Compute hash for the query
h Q = !/+
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Compute approximate similarity & filter out keys with low similarity
Approximate Similarity||*||(&2(ℎ566%$7 ℎ / , ℎ * ⋅ 9/:)
18
ELSA Hardware Implementation
Preprocessing Phase (Key hash computation)1
Execution Phase
2-3 Output Division
2
2-1 Compute Query Hash
2-2 Candidate Selection & Attention Computation
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
Key & Query Hash
HashComp
HashMem
Selected Row IDs
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Compute partial attention for selected keys and accumulate it in the buffer
Σ #((⋅*( ⋅ B-
19
ELSA Hardware Implementation
Preprocessing Phase (Key hash computation)1
Execution Phase2
2-1 Compute Query Hash
B)
Query
∑
Buffer
××××
+=+=+=+=
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
Selected Row IDs
KeyMem
ValMem
××××
+=
HashComp
HashMem
2-3 Output Division
2-2 Candidate Selection & Attention Computation
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Normalize the outcome by dividing it by the sum of exponents
(Σ #C'⋅D' ⋅ %E) / Σ #C'⋅D'
20
ELSA Hardware Implementation
Preprocessing Phase (Key hash computation)1
Execution Phase2
2-1 Compute Query Hash
B)
Query
∑
Buffer
××××
+=+=+=+=
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
KeyMem
ValMem
××××
+=
HashComp
HashMem …
××
*/,
Buffer
2-2 Candidate Selection & Attention Computation
2-3 Output Division
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Query-level pipelining ( , , )
Fine-grained pipelining within each pipeline stage
21
ELSA Hardware Pipelining
2-1 Compute Query Hash
B)
Query
∑
Buffer
××××
+=+=+=+=
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
KeyMem
ValMem
××××
+=
HashComp
HashMem
2-3 Output Division
…
××
*/,
Buffer
2-2 Candidate Selection & Attention Computation
Preprocessing Phase (Key hash computation)1
Execution Phase2
i-1 th Query I th Query i+1th Query
2-1 2-2 2-3
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 22
Evaluation: Impact of Approximate Attention
• Line Graph: model accuracy metric• Bar Graph: portion of selected candidates
Hyperparameter for Accuracy/Performance
< 1% accuracy metric degradations are observed while inspecting only ~33% of the keys
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 23
Evaluation: Performance EvaluationGPU Ideal Ours (<1%) Ours (<2.5%) Ours (<5%)Ours (No Approx)
8x – 46x speedup over GPU
Proposed HW Accelerator (w/o Approximation)
20x – 157x speedup over GPU
Proposed HW Accelerator (w/ Approximation)<1% accuracy metric degradation
32x – 168x speedup over GPU<2.5% accuracy metric degradation
Pat/WEnergy onsumration
Tae Jun Ham* Yejin Lee* Seong Hoon Seo Soosung [email protected] [email protected] [email protected] [email protected]
Hyunji Choi Sung Jun Jung Jae W. Lee [email protected] [email protected] [email protected]
* These authors contributed equally to this work
ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
The 48th ACM/IEEE International Symposium on Computer Architecture (ISCA) ARC Lab @ Seoul National University