Murat E Guney, Kazushige Goto, Timothy B Costa, Sarah Knepper, Louise Huot, Arthur A Mitrano, Shane Story
Presenter: Marius Cornea
24th IEEE Symposium on Computer Arithmetic
July 24th, 2017
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Intel® Math Kernel Library (Intel® MKL)
2
Speeds up computations for scientific, engineering, financial and machine learning applications
Provides key functionality for dense and sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics, deep learning primitives and more
Included in Intel® Parallel Studio XE and Intel® System Studio Suites
Optimized for single core vectorization and cache utilization
Automatic parallelism for multi-core and many-core
Scales to clusters and beyond
Great performance with minimal effort
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice3
Components of Intel MKL
Linear Algebra
• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative Solvers• PARDISO*• Cluster Sparse
Solver
Fast Fourier Transforms
• Multidimensional• FFTW interfaces• Cluster FFT
Vector Math
• Trigonometric• Hyperbolic • Exponential• Log• Power• Root• Vector RNGs
Summary Statistics
• Kurtosis• Variation
coefficient• Order
statistics• Min/max• Variance-
covariance
Deep Neural Networks
• Convolution• Pooling• Normalization• ReLU• Inner Product
And More…
• Splines• Interpolation• Trust Region• Fast Poisson
Solver
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Intel®
Xeon®
Processor64-bit
Intel® Xeon®
Processor 5100 series
Intel® Xeon®
Processor 5500 series
Intel® Xeon®
Processor 5600 series
Intel® Xeon®
ProcessorE5-2600 v2
series
Intel® Xeon®
ProcessorE5-2600 v3
seriesv4 series
Intel® Xeon®
Scalable Processor1
Up to Core(s) 1 2 4 6 12 18-22 28
Up to Threads 2 2 8 12 24 36-44 56
SIMD Width 128 128 128 128 256 256 512
Vector ISAIntel® SSE3
Intel® SSE3Intel®
SSE4- 4.1Intel® SSE
4.2Intel® AVX
Intel® AVX2
Intel® AVX-512
Intel® Xeon Phi™ x100
Coprocessor(KNC)
Intel® Xeon Phi™ x200 Processor
(KNL)
61 72
244 288
512 512
IMCI 512Intel®
AVX-512
More cores More Threads Wider vectors
1. Product specification for launched and shipped products available on ark.intel.com.
Automatic Dispatching to Tuned ISA-specific Code Paths
4
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice5
BLAS – Basic Linear Algebra Subprograms
De-facto standard APIs since the 1980s (Fortran77)
Level1: vector-vector operations
Level2: matrix-vector operations
Level3: matrix-matrix operations
Precisions: single, double, single complex, double complex
Original BLAS available at: http://netlib.org/blas
OperationMKL Routine“D is for double”
ExampleComputational complexity (work)
Vector-Vector D AXPY y = y + α x O(N)
Matrix-Vector D GEMV y = αAx + βy O(N²)
Matrix-Matrix D GEMM C = αA * B + βC O(N³)
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice6
GEMM: GEneral Matrix-Matrix MultiplicationC = alpha*op(A)*op(B) + beta*C
op(X) = X or XT
A is m×k
B is k×n
C is m×n
lda
n
m km m
k
k
k
n
n
ldb
ldc
A C
B
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice7
GEMM: GEneral Matrix-Matrix MultiplicationC = alpha*op(A)*op(B) + beta*C
C = beta*CDO i=1,M
DO j=1,NDO kk=1,K
C(i,j) += alpha*A(i,kk)*B(kk,j)END DO
END DOEND DO
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice8
Intel® Xeon Phi™ x200 Architecture
Self-boot processor eliminates PCIe bottlenecks
Supports x87, SSE, AVX, AVX2, AVX512
36 tiles interconnected with 2D mesh
Each tile has 2 cores (2VPU/core), 1MB L2 cache
Two-wide core: decode-allocate-retire 2 ins. per cycle
2×FMA throughput per core:
32 double-precision and 64 single-precision
MCDRAM: high BW memory (Triad: 450 GB/sec)
MCDRAM is configurable: flat, cache, and hybrid
Integrated fabric support
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice9
GEMM Optimizations
Goal is to reach the peak floating-point throughput of the processor
Two key levels of optimization:
1. Kernel-level optimizations
Compute intensive
Hand-written assembly kernel
Maximize core utilization by optimal instruction scheduling and cache usage
2. Thread-level optimizations
Optimizations to utilize the entire processor
Sub-matrices are dynamically scheduled on each core
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice10
Kernel-level Optimizations
Goal is to reach the peak floating point throughput of the core
Threads share compute resources (L2, mesh, MCDRAM)
Hand-tuned kernel for optimal instruction scheduling and cache reuse
Computes the partial results on blocks from matrices A, B and C
Block sizes chosen to maximize cache utilization and limit internal buffer sizes
Blocking Dimension SGEMM DGEMM ZGEMM CGEMM
M 9984 4992 2496 4992
N 224 112 56 112
K 336 336 336 336
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice11
Inner-most Kernel
Employs AVX512 FMA instructions to achieve near-peak core floating-point throughput
Computes the partial sum for 28 columns of matrix C
Streams Atemp & Btemp from L1 & L2 caches
Accesses matrices Atemp and Btemp contiguously
Software prefetches guarantee that vector loads and FMAs do not incur cache misses
After K=336 FMAs, the columns of matrix C are written to memory and the accumulators are cleared
Performance is limited by the 2-wide core:
Peak SGEMM efficiency: ~86%
Peak DGEMM efficiency: ~81%
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice12
Data Copying (1)
Matrices A and B are copied into internal buffers (Atemp and Btemp)
Copies made before kernel routines are called
Internal buffers an order of magnitude smaller than the original matrices
Buffers are re-used for each sub-block of the original matrices A and B
Buffers are stored in MCDRAM
Caveat: for small M/N values, copying may be skipped for best performance
Btemp Bkj
Atemp C
B
Aik Cij
A C
copied
copied
not copied
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice13
Data Copying (2)
Copying matrices into internal buffers enables
Accessing contiguously matrices A and B
More predictable access pattern
Reducing overhead of kernel loops
Cache-aligned loads
Minimizing cache and DTLB misses
More of A and B fits into the caches
row
column0x0EFE
0x0FFF
0x0000
0x0101
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice14
Data Copying (3)
Copy matrix A to Atemp Copy matrix B to Btemp
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice15
Thread-Level Optimizations
Partition matrices according to kernel block sizes (see the table on slide 10)
Dynamically assign computational tasks for sub-matrices to threads
Main thread from each team takes the next available task from the queue
Three key tasks associated with the submatrices:
1. Copy submatrix of A to the temporary buffer Atemp
2. Copy submatrix of B to the temporary buffer Btemp
3. Compute partial results for the output matrix: C += Atemp* Btemp
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice16
Thread Mapping: 2-threads Sharing Btemp (8 cores)
Atemp
Btemp
CLarge
336336
112
Btemp
C
112
Btemp
C
112
Btemp
C
112
0,1,4,5
2,3,6,7
0
2
1
3
4
6 7
5
0,2 1,3 4,6 5,7
Best configuration for MCDRAM
Btemp is not shared between two cores on the same tile
10% efficiency drop if Btemp is shared inside 1-tile only
Only 1 hardware-thread per core is utilized
0 1 2 3 4 5 6 7
t0 t1 t2 t3 4 tiles
8 cores
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice17
Thread Mapping: 4-threads Sharing Btemp(8 cores)
Atemp
Btemp
CLarge
336336
112
Btemp
C
112
0,4
1,5
0
1
4
5
0,1,2,3 4,5,6,7
Best configuration for DDR-only run
Matrix Atemp can be stored in L2 cache
B shared between 2 tiles
No sharing penalty if Btemp is shared between two or more tiles
Set MKL_FAST_MEMORY_LIMIT=0 to limit Intel MKL memory usage to DDR
0 1 2 3 4 5 6 7
t0 t1 t2 t3 4 tiles
8 cores
2,6
3,7
2
3
6
7
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice18
More Details on Thread-level Optimizations
Dynamic scheduling with prioritized tasks
Copy matrix A in advance to unblock future computations
Requires two temporary buffers for matrix A
Static scheduling if the matrix dimensions are small
All thread mapping is done automatically by Intel MKL
Just link with Intel OpenMP library
Disabling L2 cache hardware prefetchers improves performance for large DGEMM
~2% improvement
Tickless OS-kernel improves performance and stability
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice19
Performance Data (1 thread)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
FMA
Eff
icie
ncy
Matrix Dimensions (M=N=K)
SGEMM and DGEMM FMA Efficiency
SGEMM Efficiency
DGEMM Efficiency
SGEMM Achievable Efficiency
DGEMM Achievable Efficiency
Configuration Info - Software: IntelMath Kernel Library 2017, Hardware: Intel Xeon Phi™ 7250 Processor, 68 cores (34 MB L2 Cache, 1.4 GHz), 96 GB RAM and 16 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel measured as of May 2017. For more information go to http://www.intel.com/performance
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice20
Performance Data (68 threads, MCDRAM flat-mode)
Configuration Info - Software: IntelMath Kernel Library 2017, Hardware: Intel Xeon Phi™ 7250 Processor, 68 cores (34 MB L2 Cache, 1.4 GHz), 96 GB RAM and 16 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel measured as of May 2017. For more information go to http://www.intel.com/performance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
0 4000 8000 12000 16000 20000
Per
form
ance
(TF
/sec
)
Matrix Dimensions (M=N=K)
SGEMM and DGEMM Performance
SGEMM DGEMM
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice21
Performance Data (68 threads, MCDRAM cache-mode)
Configuration Info - Software: IntelMath Kernel Library 2017, Hardware: Intel Xeon Phi™ 7250 Processor, 68 cores (34 MB L2 Cache, 1.4 GHz), 96 GB RAM and 16 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel measured as of May 2017. For more information go to http://www.intel.com/performance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
0 4000 8000 12000 16000 20000
Pe
rfo
rma
nce
(T
F/s
ec)
Matrix Dimensions (M=N=K)
SGEMM and DGEMM Performance
SGEMM DGEMM
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice22
Conclusion
Intel MKL SGEMM and DGEMM reach 4.5TF and 2.1TF respectively
MCDRAM cache-mode sustains high performance levels
It is possible to exploit automatically the VPU and many-cores via Intel MKL routines
BLAS3 and LAPACK functions run at performance close to that of GEMM
Intel MKL can be installed for free!
https://software.intel.com/en-us/articles/free-mkl
Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
23