+ All Categories
Home > Documents > Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128...

Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128...

Date post: 13-Jun-2020
Category:
Upload: others
View: 10 times
Download: 0 times
Share this document with a friend
36
Vectorization Kirill Rogozhin Intel
Transcript
Page 1: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Vectorization

Kirill Rogozhin

Intel

Page 2: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Motivation

2

Page 3: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

The “Free Lunch” is over, reallyProcessor clock rate growth halted around 2005

3

Source: © 2014, James Reinders, Intel, used with permission

Page 4: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Moore’s Law Is STILL Going StrongHardware performance potential continues to grow

4

“We think we can continue Moore's Law for at least another 10 years."

Intel Senior Fellow Mark Bohr, 2015

1980 1990 2000 2010

1e

+0

01

e+

02

1e

+0

41

e+

06

Processor scaling trends

dates

Re

lative

sca

lin

g

Transistors

Clock

Power

Performance

Performance/W

Page 5: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Intel® Xeon®

processor

64-bit

Intel® Xeon®

processor

5100 series

Intel® Xeon®

processor

5500 series

Intel® Xeon®

processor

5600 series

Intel® Xeon®

processor code-named

Sandy Bridge EP

Intel® Xeon®

processor code-named

Ivy Bridge EP

Intel® Xeon®

processor code-named

HaswellEP

Future Xeon

Core(s) 1 2 4 6 8 12 18 >18

Threads 2 2 8 12 16 24 36 >36

SIMD Width

128 128 128 128 256 256 256 512

Intel® Xeon Phi™ coprocessor

Knights Corner

Intel® Xeon Phi™ processor & coprocessor

Knights Landing1

61 70+

244 280+

512 512

*Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning.

More cores . More Threads . Wider vectors

5

High performance software must exploit both:• Threading parallelism• Vector data parallelism

Page 6: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Untapped Potential Can Be Huge!

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

6

Configurations for Binomial Options SP

at the end of this presentation

The Difference Is Growing With Each New Generation of Hardware

Page 7: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Mandelbrot: ~2000x Speedup on Xeon Phi™ --Isn’t it Cool?

#pragma omp parallel for schedule(guided) for (int32_t y = 0; y < ImageHeight; ++y) {

float c_im = max_imag - y * imag_factor;#pragma omp simd safelen(32)for (int32_t x = 0; x < ImageWidth; ++x) {

fcomplex in_vals_tmp = (min_real + x * real_factor) + (c_im * 1.0iF);count[y][x] = mandel(in_vals_tmp, max_iter);

}}

Intel Xeon Phi™ system, Linux64, 61 cores running 244 threads at 1GHz, 32 KB L1, 512 KB L2 per core. Intel C/C++ Compiler 1internal build.

#pragma omp declare simd uniform(max_iter), simdlen(32) uint32_t mandel(fcomplex c, uint32_t max_iter){ uint32_t count = 1; fcomplex z = c;

while ((cabsf(z) < 2.0f) && (count < max_iter)) {z = z * z + c; count++;

}return count;

}

6

Page 8: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Untapped Potential Can Be Huge!

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

8

Configurations for Binomial Options SP

at the end of this presentation

The Difference Is Growing With Each New Generation of Hardware

Many codes are still here

Page 9: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Don’t use a single Vector lane/thread!Un-vectorized and un-threaded software will under perform

9

Page 10: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Permission to Design for All LanesThreading and Vectorization needed to fully utilize modern hardware

10

Page 11: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Vector SIMD parallelism, vectorization.

11

Vector Processing

Ci

+

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

VL

Page 12: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

62

294 318 378471 485

1109

1557

~3800

SSE SSE2 SSE3 SSSE3 SSE4 SSE42 AVX AVX2 AVX512

Sandy Bridge

Haswell

Next Xeon/KNL

Nehalem

2015+201320112008

Cumulative (app.) # of Vector Instructions

How can customers use these new instructions?

4

Page 13: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Why SIMD vector parallelism?

13

Page 14: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

14

Vectorization of Code

for(i = 0; i <= MAX;i++)

c[i] = a[i] + b[i];

+

a[i]

b[i]

c[i]

+

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

Page 15: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

vector data operations:data operations done in parallel

void v_add (float *c,

float *a,

float *b)

{

for (int i=0; i<= MAX; i++)

c[i]=a[i]+b[i];

}

15

Page 16: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

vector data operations:data operations done in parallel

void v_add (float *c,

float *a,

float *b)

{

for (int i=0; i<= MAX; i++)

c[i]=a[i]+b[i];

}

16

Loop:1. LOAD a[i] -> Ra2. LOAD b[i] -> Rb3. ADD Ra, Rb -> Rc4. STORE Rc -> c[i]5. ADD i + 1 -> i

16

Scalar Processing

A B

C

+

Page 17: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

vector data operations:data operations done in parallel

void v_add (float *c,

float *a,

float *b)

{

for (int i=0; i<= MAX; i++)

c[i]=a[i]+b[i];

}

Loop:1. LOAD a[i] -> Ra2. LOAD b[i] -> Rb3. ADD Ra, Rb -> Rc4. STORE Rc -> c[i]5. ADD i + 1 -> i

17

Scalar Processing

A B

C

+

Loop:1. LOADv4 a[i:i+3] -> Rva2. LOADv4 b[i:i+3] -> Rvb3. ADDv4 Rva, Rvb -> Rvc4. STOREv4 Rvc -> c[i:i+3]5. ADD i + 4 -> i

Vector Processing

Ci

+

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

VL

Page 18: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

vector data operations:data operations done in parallel

void v_add (float *c,

float *a,

float *b)

{

for (int i=0; i<= MAX; i++)

c[i]=a[i]+b[i];

}

Loop:1. LOAD a[i] -> Ra2. LOAD b[i] -> Rb3. ADD Ra, Rb -> Rc4. STORE Rc -> c[i]5. ADD i + 1 -> i

18

Scalar Processing

A B

C

+

Loop:1. LOADv4 a[i:i+3] -> Rva2. LOADv4 b[i:i+3] -> Rvb3. ADDv4 Rva, Rvb -> Rvc4. STOREv4 Rvc -> c[i:i+3]5. ADD i + 4 -> i

Vector Processing

Ci

+

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

VL

We call this “vectorization”

Page 19: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

19

Intel® SSE and AVX-128 Data Types

4x floatsSSE

16x bytes

8x 16-bit shorts

4x 32-bit integers

2x 64-bit integers

1x 128-bit(!) integer

2x doubles

SSE-2

Page 20: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

20

AVX-256 Data Types

Intel®

AVX2

8x floats

4x doublesIntel®

AVX

32x bytes

16x 16-bit shorts

8x 32-bit integers

4x 64-bit integers

2x 128-bit(!) integer

Page 21: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

AVX-512 data types

16x floats

8x doubles

16x 32-bit integers

AVX-

512

3/16/2017 21

...

Page 22: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

16x DP speed-up over scalar. 8x DP speed-up over SSEwith Advanced Vector Extensions 512 (AVX-512)

Higher performance for the most demanding computational tasks

- Significant leap to 512-bit SIMD support for processors

- Intel® Compilers and Intel® Math Kernel Library include AVX-512 support

- Strong compatibility with AVX

- Added EVEX prefix enables additional functionality

- Appears first in Intel® Xeon Phi™ coprocessor, code named Knights Landing

x

x

x

22

Page 23: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

threaded

scalarvector

serial

scalar

threaded

serial

vector

Parallel, Fast Serial

Multicore + Vector

Leadership Today and Tomorrow

Most Commonly UsedParallel Processor*

Many Core

Support for 512 bit vectors

Higher memory bandwidth

Common SW programming

Optimized for Highly-Vectorizable Parallel Apps

*Based on highest volume CPU in the IDC HPC Qview Q1’13

Next generation Intel Xeon Phi (Knights Landing)

Targeted for Highly-Vectorizable, Parallel Apps

+

Single Source Code

Optimization

23

Page 24: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Knights Landing Architectural Diagram

Diagram is for conceptual purposes only and only illustrates a CPU and memory – it is not to scale and does not

DMI

MCDRAM MCDRAM MCDRAM

MCDRAM

MCDRAM

MCDRAM MCDRAM MCDRAM

DDR4

DDR4

DDR4

Wellsburg

PCH

Up to 72 cores

HFI

DDR4

DDR4

DDR4

PCIe Gen3

x36

6 channelsDDR4

Up to

384GB

Common with

Grantley PCH

2 ports Storm Lake

Integrated Fabric

On-package

50 GB/s bi-directional

Up to 16GB high-bandwidth on-package memory (MCDRAM)

Exposed as NUMA node

~500 GB/s sustained BW

Up to 72 cores

2D mesh architecture

Over 3 TF DP peak

Full Xeon ISA compatibility through AVX-512

~3x single-thread vs. compared to Knights Corner

Core Core

2 VPU

2VPU

1M

B L

2H

UB

Tile

Mic

ro-C

oa

x C

ab

le

(IF

P)

Mic

ro-C

oa

x C

ab

le

(IF

P)

2x 512b VPU per core (Vector Processing Units)

Based on Intel® Atom Silvermont processor with many HPC enhancements

Deep out-of-order buffers

Gather/scatter in hardware

Improved branch predition

4 threads/core

High cache bandwidth

& more24

Page 25: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

(re-cap) parallel Programming for multi-core and manycore processors

25

B

C

A

Page 26: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

How could we program these parallel machines?

26

B

C

A “Three Layer Cake”

“abstracts” common hybrid parallelism

programming approaches

Page 27: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

How could we program these parallel machines?

27

B

C

A A – MPI, tbb::flow,

PGAS

B – OpenMP4.x, Cilk Plus, TBB

C - OpenMP4.x,Cilk Plus

Programming models Software tools

Cluster Edition

Professional Edition

Implementing the Cake

Page 28: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

How could we program these parallel machines?

28

B

C

• Different methods exist• OpenMP4.x:

• Industry standard

• C/C++ and Fortran

• Supported by Intel Compiler (14, 15, 16), GCC 4.9+, …

• Both levels of microprocessor parallelism

Page 29: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

#pragma omp parallel for

for (int y = 0; y < ImageHeight; ++y){

#pragma omp simd

for (int x = 0; x < ImageWidth; ++x){count[y][x] = mandel(in_vals[y][x]);

}}

2 level parallelism decomposition with OpenMP4.x: image processing example

B

C

29

Page 30: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

#pragma omp parallel for

for (int i = 0; i < X_Dim; ++i){

#pragma omp simd

for (int m = 0; x < n_velocities; ++m){next_i = f(i, velocities(m));X[i] = next_i;

}}

B

C

2L parallelism decomposition with OpenMP4.x: fluid dynamics example

30

Page 31: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Programming for vector SIMD parallelism

31

Vector Processing

Ci

+

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

VL

Page 32: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Many Ways to Vectorize

Ease of use

Compiler: Auto-vectorization (no change of code)

Programmer control

Compiler: Auto-vectorization hints (#pragma vector, …)

SIMD intrinsic class(e.g.: F32vec, F64vec, …)

Vector intrinsic(e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …)

Assembler code(e.g.: [v]addps, [v]addss, …)

Explicit (user mandated) Vector Programming:

OpenMP4.x, Intel Cilk Plus

3/16/2017 3232

Cilk Plus Array Notation (CEAN )

(a[:] = b[:] + c[:])

Use Performance Libraries

(MKL, IPP)

explicit

instructionaware

implicit

Page 33: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Explicit Vector Programming with OpenMP 4.x

33

Inp

ut:

C/C

++

/FO

RT

RA

N s

ou

rce

co

de

Vectorizer

Intel® SSE Intel® AVX Intel® MIC

Map vector parallelism to vector ISA

Ve

cto

r p

art

of

Op

en

MP

* 4

.0 e

xte

nsi

on

Inp

ut:

C/C

++

/FO

RT

RA

N s

ou

rce

co

de

Vectorizer

Intel® SSE Intel® AVX Intel® MIC

Optimize and Code GenerationOptimization and Code Generation

Vectorizer makesretargeting easy!

Page 34: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

> icc –O2 –xcore-avx2 src.cpp –o test.exe

Intel® AVX2; Haswell CPU

> icc –O2 –xcore-avx2 –axCOMMON-AVX512 src.cpp –o test.exe

Default is AVX2

If AVX512 is available, use this “code path”.

Math libraries may target SSE/AVX2/AVX512 automatically at runtime

34

Compiling for Intel® AVX

Page 35: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Ignore data dependencies, indirectly mitigate control flow dependence & assert alignment:

void vec1(float *a, float *b, int off, int len)

{

#pragma omp simd safelen(32) aligned(a:64, b:64)

for(int i = 0; i < len; i++)

{

a[i] = (a[i] > 1.0) ?

a[i] * b[i] :

a[i + off] * b[i];

}

}

35

Pragma SIMD Example

Page 36: Kirill Rogozhin Intel - Core(s) 1 2 4 6 8 12 18 >18 Threads 2 2 8 12 16 24 36 >36 SIMD Width 128 128 128 128 256 256 256 512 Intel® Xeon Phi coprocessor Knights Corner ® processor

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

SIMD-enabled functions

Write a function for one element and add pragma as follows

Call the scalar version:

Call vector version via SIMD loop:

36

#pragma omp declare simd

float foo(float a, float b, float c, float d)

{

return a * b + c * d;

}

#pragma omp simd

for(i = 0; i < n; i++) {

A[i] = foo(B[i], C[i], D[i], E[i]);

}

A[:] = foo(B[:], C[:], D[:], E[:]);

e = foo(a, b, c, d);


Recommended