Post on 09-Apr-2018
transcript
NAXNear-Data Approximate Computing
Georgia Institute of Technology
Amir Yazdanbakhsh Jacob Sacks Choungki Song1
Hadi EsmaeilzadehPejman Lotfi-Kamran2 Nam Sung-Kim3
1 University of Wisconsin-Madison
3 University of Illinois at Urbana-Champaign
2 The Institute for Research in Fundamental Sciences
2
Approximate ComputingEmbracing Imprecision
Relax theabstractionof“nearperfect” accuracy in
Acceptimprecision toimprove
performanceenergy dissipationresourceutilizationefficiency
DataProcessing Storage Communication
3
VirtualReality
DataAnalytics
MachineLearning
MultimediaProcessing
NGPU
SM SM SM SM
SM SM SM SM
SM SM SM SM
SM SM SM SM
GPU
VirtualReality
DataAnalytics
MachineLearning
MultimediaProcessing
NGPU
SM SM SM SM
SM SM SM SM
SM SM SM SM
SM SM SM SM
4
GPU
DiverseclassesofGPUapplications
areamenableto“approximation”.
5
Neural Transformation for GPUs
NeuralNetwork
NeuralNetwork
6
Neural Network Operations
xj,ixj,0 xj,n
wj,0
wj,i wj,n
...wj,0
...yj =
sigmoid(
wj,0 ⇥ xj,0 +
. . .
wj,i ⇥ xj,i +
. . .
wj,n ⇥ xj,n +
)yj
7
Runtime Breakdown of Baseline GPU
AmirYazdanbakhsh,etal.,“NeuralAccelerationforGPUThroughputProcessors”,MICRO2015.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
Nor
mal
ized
Run
time Data Processing Data Communication
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
8
Runtime Breakdown of NGPU
AmirYazdanbakhsh,etal.,“NeuralAccelerationforGPUThroughputProcessors”,MICRO2015.
Nor
mal
ized
Run
time Data Processing Data Communication
45%
9
In-DRAM Computing Challenges
DRAMiscost-sensitive!
10
In-DRAM Computing Challenges
DRAMisunderpower constraint!
11
In-DRAM Computing Challenges
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
GPUisSIMD!
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
12
Near-Data Approximate Computing
In-DRAMCtrl
13
Near-Data Approximate Computing
A B C DI/O S/ACO
LDEC
COLDEC
RD RD RD RD
IOCNTBitline
...
...
...Arithmetic
UnitArithmetic
Unit
Sigmoid LUT
Sigmoid LUT
Weight Register
Arithmetic Unit
Sigmoid LUT
Read Data
Write Data
Half-bank Half-bank Half-bank Half-bank
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
14
NAX Execution Flow
1In-DRAM
Ctrl
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
15
NAX Execution Flow
2
In-DRAMCtrl
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
16
NAX Execution Flow
3In-DRAM
Ctrl
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
17
NAX Execution Flow
4In-DRAM
Ctrl
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
18
NAX Execution Flow
5 In-DRAMCtrl
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
19
NAX Execution Flow
6
In-DRAMCtrl
20
NAX Microarchitectures
input register
shifter shift register
output register
contr
oll
er
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
FloatingPoint
FixedPoint
21
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
22
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
6 0
23
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
4
24
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
3
25
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
1
26
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Iteration 1
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
T1 = Xi�6 + 0 = (8000)10
Error = 28.9%
S00 = (00110)2
(8000)10
27
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Iteration 2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
T2 = Xi�4 + T1 = (10000)10
Error = 11.2%
(2000)10
28
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S02 = (00011)2
S03 = (00001)2
Iteration 3
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
T3 = Xi�3 + T2 = (11000)10
Error = 2.3%
(1000)10
29
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S03 = (00001)2
Iteration 4
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
T4 = Xi�1 + T3 = (11250)10
Error = 0.0%
(250)10
30
Experimental Setup
Power Model • TechnologyNode40nm(3-LayersMetal)
• Synopsys,Cadence• GPUWattch,McPAT andCACTI,Verilog
GPU Simulator• GPGPU-SimCycle-LevelSimulator
• Fermi-basedGTX480,Shader CoreFrequency1.4GHz
• NVCCCompiler–O3
MachineLearning,Finance,Vision3DGaming,MedicalImaging
NumericalAnalysis,ImageProcessing
31
NAX Speedup Compared to NGPU
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Spee
dup
x
x
x
x
xx
x
x
NAX-AFxPNAX-FxPNAX-FP
2.0x
2.0x
1.2x
NAX-AFxP provides 1.2x speedup compared to NGPU.
32
NAX Energy Saving Compared to NGPU
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
Ener
gy S
avin
g
NAX-AFxP provides 4.8x energy saving compared to NGPU.
4.8x
xxxxxxx
xx NAX-AFxPNAX-FxPNAX-FP
0.0
0.5
1.0
1.5
2.0
2.5
3.0
33
DRAM System PowerD
RA
M S
yste
mPo
wer
Incr
ease
NAX-AFxP yields to a 0.7x lower DRAM system power.
x
x
x
x
x
x
x
NAX-AFxPNAX-FxPNAX-FP
Lower is better
0.7x
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
34
Application Quality LossQ
ualit
y Lo
ss
Quality loss is below 10% in all applications except one.
NAX-AFxPNAX-FxPNAX-FP
35
NAX: Near-Data Approximate Computing
4.8X Energy Saving1.2X Speedup
Ove
rhea
dB
enef
its
over
NG
PU2% Area Overheadper DRAM Chip
≤ 10% Quality Loss
0.7X DRAM System Power