Electrical Engineering – Electronic Systems group
Kanishkan Vadivel, Henk Corporaal, Pekka Jääskeläinen
General Architectures for DNN
Recap• Inference and Learning Principles• Improving Network Efficiency – Focuses on reducing number of MACs and weights
• Loop Transformations – Software tricks to effectively use the Memory hierarchy
2
S E (8) M (23)fp32
S E(5) M(10)fp16
Quantization
Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
3
Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
4
Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
5
Today
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
6
Today
Introduction• Machine learning plays a major role in today’s world
7
Introduction• Machine learning plays a major role in today’s world
8
Introduction• Machine learning plays a major role in today’s world
9
Made for “AI”
Compute Intensity of DNN• Compute intensity is roughly proportional to accuracy of the DNN
10Source: Scaling for edge inference, Nature Electronics, 2018.
Energy Efficiency Requirements • Ranges from Cloud to Edge device (low power embedded applications)
• Different energy budget and compute capabilities
11
Edge Node Embedded Device Cloud Server HPC Cloud
mW W kW MW
Compute capability
Hardware Platform for DNN
12
Workload1. Inference2. Training
3. Meta Learning
Compute Platforms1. High-performance computing
2. Embedded systems
*Markovic, EE292 Class, Stanford, 2013
Hardware Platform for DNN
13
Workload1. Inference2. Training
3. Meta Learning
Compute Platforms1. High-performance computing
2. Embedded systems
*Markovic, EE292 Class, Stanford, 2013
Hardware Platform for DNN
14
Workload1. Inference2. Training
3. Meta Learning
Compute Platforms1. High-performance computing
2. Embedded systems
+?
Flexibility
Ene
rgy
Effi
cien
cy ASICPerformance/AreaDSP
CPU
FPGAGPU
ASIP
~1000x*
*Markovic, EE292 Class, Stanford, 2013
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
15
Today
Deep Convolutional Neural Networks
16Source: ICIP Tutorial, 2019
Contributes more that 90% of overall computation, dominating runtime and energy consumption
Convolution Layer
17
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
=
Convolution Layer
18
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
=
Convolution Layer
19
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
1
=
Convolution Layer
20
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
1
Number of MACs = R x S
=
Convolution Layer
21
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
1
Number of MACs = R x S
2
=
Convolution Layer
22
Input fmap
H
W
weights
R
S
X
Output fmap
F
E
1
Number of MACs = R x S
=
x (H – R + 1) x (W - S + 1)
Convolution Layer
23
Input fmap
H
W
C
2
weights
R
S
C
X
Output fmap
F
E
1
=
Many input fmap
Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C
Convolution Layer
24
Input fmap
H
W
C
X
Output fmap
F
E
M
=
Many input & output fmap
M
weights
R
S
C
Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M
Convolution Layer – For N batch
25
X
Output fmap
F
E
M
=
Many input & output fmap
M
weights
R
S
C
Input fmaps
RS
CN N
Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M x N
Fully Connected Layer
26Source: ICIP Tutorial, 2019
Contributes more that 90% of overall computation, dominating runtime and energy consumption
Fully Connected Layer
27
X
Output fmap
F =1
E =
1
M
=
1. Height and width of output fmap are 1 (E = F = 1)2. Filters as large as input fmaps (R= H, S=W)
M
weights
R
S
C
Input fmaps
R =
H
S = W
C
N N
Fully Connected Layer
28
X
Output fmap
M
=
Matrix-Multiplication
weights
M
CRS = CHW
Input fmaps
CH
W
N
N
Compute Intensity of Popular CNNs
29
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
30
Today
Overview of Microprocessor Designs
31Source: Time Moore, Liming Xiu 2019.
Intrinsic Compute Capability
32Source: XETAL-II, 2010
Intrinsic Compute Capability
33Source: XETAL-II, 2010
Intrinsic Compute Capability
34Source: XETAL-II, 2010
TPU
CPU
~200x for DL
Difference in Performance/Watt
Source of Inefficiency
35
How to improve?• Results does not include DRAM
power• More than 50% of energy is spent
on Cache and Control logic
• Reduce control overhead
• Improve Cache hierarchy• Multi-Core/Cluster concepts
provides an additional performance gain
Source: Computing’s Energy Problem, ISSCC 2014
Compute
Cac
he &
Con
trol
Reduce Control Overhead: SIMD Extensions
36
Reduce Control Overhead: SIMD Extensions
37
Reduce Control Overhead: SIMD Extensions
• Intel
• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]
• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-
processor), NEON → 128bitm SVE → 128 to 2048bit
• Qualcomm – 4x1024b → 4096bits
38
Reduce Control Overhead: SIMD Extensions
• Intel
• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]
• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-
processor), NEON → 128bitm SVE → 128 to 2048bit
• Qualcomm – 4x1024b → 4096bits
39DNN specific extensions: Reduced Precision and Instruction-set extension
Example: Intel – Cascade Lake 2019(VNNI – Vector Neural Net Instructions)
40
x28-cores
SIMD Extensions for DNNs
41
SIMD Extensions for DNNs
42
(VNNI)
• Mixed-precision mode – INT8 x INT8 + INT32• VNNI – FMA on single cycle compared to 3-cycles with normal SIMD
instructions• Some architectures support “2x2 dot-product” as well
Brain Floating Point
43
• bfloat16: Same dynamic range as IEEE-FP32, but with less accuracy• Example use-case: Google TPU, Cooper Lake Xeon processors• Another option - “posit” floating-point (adaptable fp format)
How about BNN?
44
y = popcount (W XNOR X)
Software Stack for DNN
45
?
Software Stack for DNN
46
?
Software Stack for DNN
• Parallelizing Compiler
• Inline assembly
• Intrinsics
47
?
Software Stack for DNN
• Parallelizing Compiler
• Inline assembly
• Intrinsics• Optimized libraries
MKL-DNN, clDNN, BLAS, Arm NN, Arm CIMSIS-NN, and many more..
48
?
Example: Arm-NN Library
49
TensorFlow
armNN
Comparison among CNN Libraries on CPU
50Source: Evaluating the energy efficiency of Deep CNN, Da Li 2016. *Convnet on Xenon E5 (16-core)
• Caffe backends – Atlas, OpenBLAS, MKL, openMP, and CaffeConTroll
• Performance depends on the quality of the library optimizations for the target
Distributed Learning and Inference
51Source: Large Scale Distributed Deep Networks, Jeffrey Dean, Google 2019
Distributed DL - Approach
52
Distributed DL – Performance
53
• Models with more parameters benefit more from the use of additional machines
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
54
Today
Domain Specific Processors (VLIW-SIMD)• Processors optimized for specific
application domain (e.g: Vision, signal processing, etc)
• Example: Qualcomm Hexagon, Movidious (Intel), Ceva, and many more..
• Support for DNN
• Instruction-set extensions
• DNN Accelerator in the execution pipeline
55
Programming Model
56
Hexagon DSP (Qualcomm)
57
Hexagon over Quad CPU
58Source: Hotchips
Hexagon – Power breakdown
59
• Less overhead on control logic and memory compared to CPUs
Another Example: Ceva DSP
60Source: Anandtech
Final Example: Movidious v2
61
Source: Hotchips’14
Final Example: Movidious v2
62
Source: Hotchips’14
Final Example: Intel Neural Compute Stick
63
Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms
• General Purpose Processor (CPU)
• Domain Specific Processors (DSPs, VLIW-SIMD)
• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures
64
Today
Graphics Processing Units (GPUs)• SIMD vs GPU
• GPUs uses threads instead of vectors
• GPUs have the “shared memory” spaces
65
SIMD GPU
How Threads are Scheduled?
6666
Example: NVIDIA Fermi - 2009
67
Example: NVIDIA Fermi - 2009
68
Example: NVIDIA Fermi - 2009
69
• Streaming Multiprocessors (SM)
• 32 – cuda cores/SM
• ALU – 32/64-bit
• FP – SP/DP (with FMA)
• SFU – Sin, cosine, sqrt, etc• Clock – 1.5 GHz (estimated)• Peak Performance – 1.5 TFLOPS
Example: NVIDIA Volta - 2017
70
• Streaming Multiprocessors (SM)
• 80 - SMs
• 64 – INT32/FP32 core/SM
– 32 – FP64 cuda core/SM
• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops
• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125
• 1.53GHz * 80 * 1024
Example: NVIDIA Volta - 2017
71
• Streaming Multiprocessors (SM)
• 80 - SMs
• 64 – INT32/FP32 core/SM
– 32 – FP64 cuda core/SM
• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops
• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125
• 1.53GHz * 80 * 1024
Example: NVIDIA Volta - 2017
72
• Streaming Multiprocessors (SM)
• 80 - SMs
• 64 – INT32/FP32 core/SM
– 32 – FP64 cuda core/SM
• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops
• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125
• 1.53GHz * 80 * 1024
Internals of Tensor Core• Modes of operation – Volta
• FP16 – A, B, C are FP16
• Mixed-precision – A and B are FP16, C is FP32• Turing GPUs – Supports 1, 2, 4, and 8 bit data-types (int4, int8 on Tensor
cores)
73
Scheduling Example for 16x16x16 Gemm
74Source: Anandtech
Scheduling Example for 16x16x16 Gemm
75Source: Anandtech
Scheduling Example for 16x16x16 Gemm
76Source: Anandtech
Scheduling Example for 16x16x16 Gemm
77Source: Anandtech
Scheduling Example for 16x16x16 Gemm
78Source: Anandtech
How to use Tensor cores• CuBLAS, CuDNN, etc• Library takes care of
Tiling and storage hierarchy
• Opcode: HMMA (Matrix Multiply Accumulate)
79
GPU Performance
80
• We still need CPU for some extent
GPU vs CPU Performance
• CPU - 16-core Intel Xeon E5 -2650 v2 @ 2.6GHz
• Benchmark: AlexNet• Lower batch size leads
to under utilization on all devices
• K20 has less memory than Titan X
81
Concluding Remarks• Compute and data requirement of DNN is quite high and a major part of
the computation is from Matrix Multiplications (i.e. MAC ops)• Common DNN specific extensions in Generic architecture is,
1. Instruction-set extensions – Generally SIMD support at reduced precision
2. DNN accelerator on datapath (Co-processor, Tensor core, etc)• The effective performance of the platform depends on the hardware
capability and software support (programming model and library used to realize the network)
• The energy efficiency is still a limitation of generic platforms for DNN
82
Reference• Evaluating the Energy Efficiency of Deep Convolutional Neural Network on
CPUs and GPUs, Da Li and Xinbo Chen, 2016
83
Backup
84
Mutithreading Categories
85
Tim
e (P
roce
ssor
Cyc
le)
super-scalar Fine-Grained Coarse-Grained MultiprocessingSimultaneous Mutithreading
Thread 1
Thread 2Thread 3Thread 4
Thread 5Idle slot
Example: IBM Power4 (Superscalar)
86
Example: IBM Power5• Supports 2 threads
87
2 fetch (PC),2 initial decodes
2 commits (architected register sets)
Power5 Thread Performance• Relative priority of each thread is
controllable in hardware• For balanced operation, both threads
run slower than if they “owned” the machine
88
Any guess on largest chip so far?
89
Source : Cerebras, Hotchip 2019