IEEE Symposium on Computer Arithmetic July 24 ,...

Murat E Guney, Kazushige Goto, Timothy B Costa, Sarah Knepper, Louise Huot, Arthur A Mitrano, Shane Story

Presenter: Marius Cornea

24th IEEE Symposium on Computer Arithmetic

July 24th, 2017

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Intel® Math Kernel Library (Intel® MKL)

2

Speeds up computations for scientific, engineering, financial and machine learning applications

Provides key functionality for dense and sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics, deep learning primitives and more

Included in Intel® Parallel Studio XE and Intel® System Studio Suites

Optimized for single core vectorization and cache utilization

Automatic parallelism for multi-core and many-core

Scales to clusters and beyond

Great performance with minimal effort


Optimization Notice3

Components of Intel MKL

Linear Algebra

• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative Solvers• PARDISO*• Cluster Sparse

Solver

Fast Fourier Transforms

• Multidimensional• FFTW interfaces• Cluster FFT

Vector Math

• Trigonometric• Hyperbolic • Exponential• Log• Power• Root• Vector RNGs

Summary Statistics

• Kurtosis• Variation

coefficient• Order

statistics• Min/max• Variance-

covariance

Deep Neural Networks

• Convolution• Pooling• Normalization• ReLU• Inner Product

And More…

• Splines• Interpolation• Trust Region• Fast Poisson

Solver


Optimization Notice

Intel®

Xeon®

Processor64-bit

Intel® Xeon®

Processor 5100 series

Intel® Xeon®


Intel® Xeon®


Intel® Xeon®

ProcessorE5-2600 v2

series

Intel® Xeon®

ProcessorE5-2600 v3

seriesv4 series

Intel® Xeon®

Scalable Processor1

Up to Core(s) 1 2 4 6 12 18-22 28

Up to Threads 2 2 8 12 24 36-44 56

SIMD Width 128 128 128 128 256 256 512

Vector ISAIntel® SSE3

Intel® SSE3Intel®

SSE4- 4.1Intel® SSE

4.2Intel® AVX

Intel® AVX2

Intel® AVX-512

Intel® Xeon Phi™ x100

Coprocessor(KNC)

Intel® Xeon Phi™ x200 Processor

(KNL)

61 72

244 288

512 512

IMCI 512Intel®

AVX-512

More cores More Threads Wider vectors

1. Product specification for launched and shipped products available on ark.intel.com.

Automatic Dispatching to Tuned ISA-specific Code Paths

4



BLAS – Basic Linear Algebra Subprograms

De-facto standard APIs since the 1980s (Fortran77)

Level1: vector-vector operations

Level2: matrix-vector operations

Level3: matrix-matrix operations

Precisions: single, double, single complex, double complex

Original BLAS available at: http://netlib.org/blas

OperationMKL Routine“D is for double”

ExampleComputational complexity (work)

Vector-Vector D AXPY y = y + α x O(N)

Matrix-Vector D GEMV y = αAx + βy O(N²)

Matrix-Matrix D GEMM C = αA * B + βC O(N³)

http://netlib.org/blas



GEMM: GEneral Matrix-Matrix MultiplicationC = alpha*op(A)*op(B) + beta*C

op(X) = X or XT

A is m×k

B is k×n

C is m×n

lda

n

m km m

k

k

k

n

n

ldb

ldc

A C

B



GEMM: GEneral Matrix-Matrix MultiplicationC = alpha*op(A)*op(B) + beta*C

C = beta*CDO i=1,M

DO j=1,NDO kk=1,K

C(i,j) += alpha*A(i,kk)*B(kk,j)END DO

END DOEND DO



Intel® Xeon Phi™ x200 Architecture

Self-boot processor eliminates PCIe bottlenecks

Supports x87, SSE, AVX, AVX2, AVX512

36 tiles interconnected with 2D mesh

Each tile has 2 cores (2VPU/core), 1MB L2 cache

Two-wide core: decode-allocate-retire 2 ins. per cycle

2×FMA throughput per core:

32 double-precision and 64 single-precision

MCDRAM: high BW memory (Triad: 450 GB/sec)

MCDRAM is configurable: flat, cache, and hybrid

Integrated fabric support



GEMM Optimizations

Goal is to reach the peak floating-point throughput of the processor

Two key levels of optimization:

1. Kernel-level optimizations

Compute intensive

Hand-written assembly kernel

Maximize core utilization by optimal instruction scheduling and cache usage

2. Thread-level optimizations

Optimizations to utilize the entire processor

Sub-matrices are dynamically scheduled on each core



Kernel-level Optimizations

Goal is to reach the peak floating point throughput of the core

Threads share compute resources (L2, mesh, MCDRAM)

Hand-tuned kernel for optimal instruction scheduling and cache reuse

Computes the partial results on blocks from matrices A, B and C

Block sizes chosen to maximize cache utilization and limit internal buffer sizes

Blocking Dimension SGEMM DGEMM ZGEMM CGEMM

M 9984 4992 2496 4992

N 224 112 56 112

K 336 336 336 336



Inner-most Kernel

Employs AVX512 FMA instructions to achieve near-peak core floating-point throughput

Computes the partial sum for 28 columns of matrix C

Streams Atemp & Btemp from L1 & L2 caches

Accesses matrices Atemp and Btemp contiguously

Software prefetches guarantee that vector loads and FMAs do not incur cache misses

After K=336 FMAs, the columns of matrix C are written to memory and the accumulators are cleared

Performance is limited by the 2-wide core:

Peak SGEMM efficiency: ~86%

Peak DGEMM efficiency: ~81%



Data Copying (1)

Matrices A and B are copied into internal buffers (Atemp and Btemp)

Copies made before kernel routines are called

Internal buffers an order of magnitude smaller than the original matrices

Buffers are re-used for each sub-block of the original matrices A and B

Buffers are stored in MCDRAM

Caveat: for small M/N values, copying may be skipped for best performance

Btemp Bkj

Atemp C

B

Aik Cij

A C

copied

copied

not copied



Data Copying (2)

Copying matrices into internal buffers enables

Accessing contiguously matrices A and B

More predictable access pattern

Reducing overhead of kernel loops

Cache-aligned loads

Minimizing cache and DTLB misses

More of A and B fits into the caches

row

column0x0EFE

0x0FFF

0x0000

0x0101



Data Copying (3)

Copy matrix A to Atemp Copy matrix B to Btemp



Thread-Level Optimizations

Partition matrices according to kernel block sizes (see the table on slide 10)

Dynamically assign computational tasks for sub-matrices to threads

Main thread from each team takes the next available task from the queue

Three key tasks associated with the submatrices:

1. Copy submatrix of A to the temporary buffer Atemp

2. Copy submatrix of B to the temporary buffer Btemp

3. Compute partial results for the output matrix: C += Atemp* Btemp



Thread Mapping: 2-threads Sharing Btemp (8 cores)

Atemp

Btemp

CLarge

336336

112

Btemp

C

112

Btemp

C

112

Btemp

C

112

0,1,4,5

2,3,6,7

0

2

1

3

4

6 7

5

0,2 1,3 4,6 5,7

Best configuration for MCDRAM

Btemp is not shared between two cores on the same tile

10% efficiency drop if Btemp is shared inside 1-tile only

Only 1 hardware-thread per core is utilized

0 1 2 3 4 5 6 7

t0 t1 t2 t3 4 tiles

8 cores



Thread Mapping: 4-threads Sharing Btemp(8 cores)

Atemp

Btemp

CLarge

336336

112

Btemp

C

112

0,4

1,5

0

1

4

5

0,1,2,3 4,5,6,7

Best configuration for DDR-only run

Matrix Atemp can be stored in L2 cache

B shared between 2 tiles

No sharing penalty if Btemp is shared between two or more tiles

Set MKL_FAST_MEMORY_LIMIT=0 to limit Intel MKL memory usage to DDR

0 1 2 3 4 5 6 7

t0 t1 t2 t3 4 tiles

8 cores

2,6

3,7

2

3

6

7



More Details on Thread-level Optimizations

Dynamic scheduling with prioritized tasks

Copy matrix A in advance to unblock future computations

Requires two temporary buffers for matrix A

Static scheduling if the matrix dimensions are small

All thread mapping is done automatically by Intel MKL

Just link with Intel OpenMP library

Disabling L2 cache hardware prefetchers improves performance for large DGEMM

~2% improvement

Tickless OS-kernel improves performance and stability



Performance Data (1 thread)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

FMA

Eff

icie

ncy

Matrix Dimensions (M=N=K)

SGEMM and DGEMM FMA Efficiency

SGEMM Efficiency

DGEMM Efficiency

SGEMM Achievable Efficiency

DGEMM Achievable Efficiency

Configuration Info - Software: IntelMath Kernel Library 2017, Hardware: Intel Xeon Phi™ 7250 Processor, 68 cores (34 MB L2 Cache, 1.4 GHz), 96 GB RAM and 16 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel measured as of May 2017. For more information go to http://www.intel.com/performance



Performance Data (68 threads, MCDRAM flat-mode)


0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

0 4000 8000 12000 16000 20000

Per

form

ance

(TF

/sec

)


SGEMM and DGEMM Performance

SGEMM DGEMM



Performance Data (68 threads, MCDRAM cache-mode)


0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

0 4000 8000 12000 16000 20000

Pe

rfo

rma

nce

(T

F/s

ec)


SGEMM and DGEMM Performance

SGEMM DGEMM



Conclusion

Intel MKL SGEMM and DGEMM reach 4.5TF and 2.1TF respectively

MCDRAM cache-mode sustains high performance levels

It is possible to exploit automatically the VPU and many-cores via Intel MKL routines

BLAS3 and LAPACK functions run at performance close to that of GEMM

Intel MKL can be installed for free!

https://software.intel.com/en-us/articles/free-mkl

https://software.intel.com/en-us/articles/free-mkl


Optimization Notice

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

23

Date post:	11-Jun-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

IEEE Symposium on Computer Arithmetic July 24 ,...

Documents