+ All Categories
Home > Documents > IEEE Symposium on Computer Arithmetic July 24 ,...

IEEE Symposium on Computer Arithmetic July 24 ,...

Date post: 11-Jun-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
24
Murat E Guney, Kazushige Goto, Timothy B Costa, Sarah Knepper, Louise Huot, Arthur A Mitrano, Shane Story Presenter: Marius Cornea 24 th IEEE Symposium on Computer Arithmetic July 24 th , 2017
Transcript

Murat E Guney, Kazushige Goto, Timothy B Costa, Sarah Knepper, Louise Huot, Arthur A Mitrano, Shane Story

Presenter: Marius Cornea

24th IEEE Symposium on Computer Arithmetic

July 24th, 2017

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Intel® Math Kernel Library (Intel® MKL)

2

Speeds up computations for scientific, engineering, financial and machine learning applications

Provides key functionality for dense and sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics, deep learning primitives and more

Included in Intel® Parallel Studio XE and Intel® System Studio Suites

Optimized for single core vectorization and cache utilization

Automatic parallelism for multi-core and many-core

Scales to clusters and beyond

Great performance with minimal effort

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice3

Components of Intel MKL

Linear Algebra

• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative Solvers• PARDISO*• Cluster Sparse

Solver

Fast Fourier Transforms

• Multidimensional• FFTW interfaces• Cluster FFT

Vector Math

• Trigonometric• Hyperbolic • Exponential• Log• Power• Root• Vector RNGs

Summary Statistics

• Kurtosis• Variation

coefficient• Order

statistics• Min/max• Variance-

covariance

Deep Neural Networks

• Convolution• Pooling• Normalization• ReLU• Inner Product

And More…

• Splines• Interpolation• Trust Region• Fast Poisson

Solver

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Intel®

Xeon®

Processor64-bit

Intel® Xeon®

Processor 5100 series

Intel® Xeon®

Processor 5500 series

Intel® Xeon®

Processor 5600 series

Intel® Xeon®

ProcessorE5-2600 v2

series

Intel® Xeon®

ProcessorE5-2600 v3

seriesv4 series

Intel® Xeon®

Scalable Processor1

Up to Core(s) 1 2 4 6 12 18-22 28

Up to Threads 2 2 8 12 24 36-44 56

SIMD Width 128 128 128 128 256 256 512

Vector ISAIntel® SSE3

Intel® SSE3Intel®

SSE4- 4.1Intel® SSE

4.2Intel® AVX

Intel® AVX2

Intel® AVX-512

Intel® Xeon Phi™ x100

Coprocessor(KNC)

Intel® Xeon Phi™ x200 Processor

(KNL)

61 72

244 288

512 512

IMCI 512Intel®

AVX-512

More cores More Threads Wider vectors

1. Product specification for launched and shipped products available on ark.intel.com.

Automatic Dispatching to Tuned ISA-specific Code Paths

4

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice5

BLAS – Basic Linear Algebra Subprograms

De-facto standard APIs since the 1980s (Fortran77)

Level1: vector-vector operations

Level2: matrix-vector operations

Level3: matrix-matrix operations

Precisions: single, double, single complex, double complex

Original BLAS available at: http://netlib.org/blas

OperationMKL Routine“D is for double”

ExampleComputational complexity (work)

Vector-Vector D AXPY y = y + α x O(N)

Matrix-Vector D GEMV y = αAx + βy O(N²)

Matrix-Matrix D GEMM C = αA * B + βC O(N³)

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice6

GEMM: GEneral Matrix-Matrix MultiplicationC = alpha*op(A)*op(B) + beta*C

op(X) = X or XT

A is m×k

B is k×n

C is m×n

lda

n

m km m

k

k

k

n

n

ldb

ldc

A C

B

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice7

GEMM: GEneral Matrix-Matrix MultiplicationC = alpha*op(A)*op(B) + beta*C

C = beta*CDO i=1,M

DO j=1,NDO kk=1,K

C(i,j) += alpha*A(i,kk)*B(kk,j)END DO

END DOEND DO

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice8

Intel® Xeon Phi™ x200 Architecture

Self-boot processor eliminates PCIe bottlenecks

Supports x87, SSE, AVX, AVX2, AVX512

36 tiles interconnected with 2D mesh

Each tile has 2 cores (2VPU/core), 1MB L2 cache

Two-wide core: decode-allocate-retire 2 ins. per cycle

2×FMA throughput per core:

32 double-precision and 64 single-precision

MCDRAM: high BW memory (Triad: 450 GB/sec)

MCDRAM is configurable: flat, cache, and hybrid

Integrated fabric support

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice9

GEMM Optimizations

Goal is to reach the peak floating-point throughput of the processor

Two key levels of optimization:

1. Kernel-level optimizations

Compute intensive

Hand-written assembly kernel

Maximize core utilization by optimal instruction scheduling and cache usage

2. Thread-level optimizations

Optimizations to utilize the entire processor

Sub-matrices are dynamically scheduled on each core

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice10

Kernel-level Optimizations

Goal is to reach the peak floating point throughput of the core

Threads share compute resources (L2, mesh, MCDRAM)

Hand-tuned kernel for optimal instruction scheduling and cache reuse

Computes the partial results on blocks from matrices A, B and C

Block sizes chosen to maximize cache utilization and limit internal buffer sizes

Blocking Dimension SGEMM DGEMM ZGEMM CGEMM

M 9984 4992 2496 4992

N 224 112 56 112

K 336 336 336 336

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice11

Inner-most Kernel

Employs AVX512 FMA instructions to achieve near-peak core floating-point throughput

Computes the partial sum for 28 columns of matrix C

Streams Atemp & Btemp from L1 & L2 caches

Accesses matrices Atemp and Btemp contiguously

Software prefetches guarantee that vector loads and FMAs do not incur cache misses

After K=336 FMAs, the columns of matrix C are written to memory and the accumulators are cleared

Performance is limited by the 2-wide core:

Peak SGEMM efficiency: ~86%

Peak DGEMM efficiency: ~81%

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice12

Data Copying (1)

Matrices A and B are copied into internal buffers (Atemp and Btemp)

Copies made before kernel routines are called

Internal buffers an order of magnitude smaller than the original matrices

Buffers are re-used for each sub-block of the original matrices A and B

Buffers are stored in MCDRAM

Caveat: for small M/N values, copying may be skipped for best performance

Btemp Bkj

Atemp C

B

Aik Cij

A C

copied

copied

not copied

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice13

Data Copying (2)

Copying matrices into internal buffers enables

Accessing contiguously matrices A and B

More predictable access pattern

Reducing overhead of kernel loops

Cache-aligned loads

Minimizing cache and DTLB misses

More of A and B fits into the caches

row

column0x0EFE

0x0FFF

0x0000

0x0101

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice14

Data Copying (3)

Copy matrix A to Atemp Copy matrix B to Btemp

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice15

Thread-Level Optimizations

Partition matrices according to kernel block sizes (see the table on slide 10)

Dynamically assign computational tasks for sub-matrices to threads

Main thread from each team takes the next available task from the queue

Three key tasks associated with the submatrices:

1. Copy submatrix of A to the temporary buffer Atemp

2. Copy submatrix of B to the temporary buffer Btemp

3. Compute partial results for the output matrix: C += Atemp* Btemp

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice16

Thread Mapping: 2-threads Sharing Btemp (8 cores)

Atemp

Btemp

CLarge

336336

112

Btemp

C

112

Btemp

C

112

Btemp

C

112

0,1,4,5

2,3,6,7

0

2

1

3

4

6 7

5

0,2 1,3 4,6 5,7

Best configuration for MCDRAM

Btemp is not shared between two cores on the same tile

10% efficiency drop if Btemp is shared inside 1-tile only

Only 1 hardware-thread per core is utilized

0 1 2 3 4 5 6 7

t0 t1 t2 t3 4 tiles

8 cores

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice17

Thread Mapping: 4-threads Sharing Btemp(8 cores)

Atemp

Btemp

CLarge

336336

112

Btemp

C

112

0,4

1,5

0

1

4

5

0,1,2,3 4,5,6,7

Best configuration for DDR-only run

Matrix Atemp can be stored in L2 cache

B shared between 2 tiles

No sharing penalty if Btemp is shared between two or more tiles

Set MKL_FAST_MEMORY_LIMIT=0 to limit Intel MKL memory usage to DDR

0 1 2 3 4 5 6 7

t0 t1 t2 t3 4 tiles

8 cores

2,6

3,7

2

3

6

7

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice18

More Details on Thread-level Optimizations

Dynamic scheduling with prioritized tasks

Copy matrix A in advance to unblock future computations

Requires two temporary buffers for matrix A

Static scheduling if the matrix dimensions are small

All thread mapping is done automatically by Intel MKL

Just link with Intel OpenMP library

Disabling L2 cache hardware prefetchers improves performance for large DGEMM

~2% improvement

Tickless OS-kernel improves performance and stability

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice19

Performance Data (1 thread)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

FMA

Eff

icie

ncy

Matrix Dimensions (M=N=K)

SGEMM and DGEMM FMA Efficiency

SGEMM Efficiency

DGEMM Efficiency

SGEMM Achievable Efficiency

DGEMM Achievable Efficiency

Configuration Info - Software: IntelMath Kernel Library 2017, Hardware: Intel Xeon Phi™ 7250 Processor, 68 cores (34 MB L2 Cache, 1.4 GHz), 96 GB RAM and 16 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel measured as of May 2017. For more information go to http://www.intel.com/performance

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice20

Performance Data (68 threads, MCDRAM flat-mode)

Configuration Info - Software: IntelMath Kernel Library 2017, Hardware: Intel Xeon Phi™ 7250 Processor, 68 cores (34 MB L2 Cache, 1.4 GHz), 96 GB RAM and 16 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel measured as of May 2017. For more information go to http://www.intel.com/performance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

0 4000 8000 12000 16000 20000

Per

form

ance

(TF

/sec

)

Matrix Dimensions (M=N=K)

SGEMM and DGEMM Performance

SGEMM DGEMM

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice21

Performance Data (68 threads, MCDRAM cache-mode)

Configuration Info - Software: IntelMath Kernel Library 2017, Hardware: Intel Xeon Phi™ 7250 Processor, 68 cores (34 MB L2 Cache, 1.4 GHz), 96 GB RAM and 16 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel measured as of May 2017. For more information go to http://www.intel.com/performance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

0 4000 8000 12000 16000 20000

Pe

rfo

rma

nce

(T

F/s

ec)

Matrix Dimensions (M=N=K)

SGEMM and DGEMM Performance

SGEMM DGEMM

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice22

Conclusion

Intel MKL SGEMM and DGEMM reach 4.5TF and 2.1TF respectively

MCDRAM cache-mode sustains high performance levels

It is possible to exploit automatically the VPU and many-cores via Intel MKL routines

BLAS3 and LAPACK functions run at performance close to that of GEMM

Intel MKL can be installed for free!

https://software.intel.com/en-us/articles/free-mkl

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

23


Recommended