Программирования Xeon Phi - South Ural State...

Intel® Many Integrated Core Architecture

Software & Services Group, Developer Relations Division

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice

Основы программирования Xeon Phi

Дмитрий Петунин

Ведущий технический консультант Intel

1




Содержание

• Общая концепция

• Модели программирования

– Native execution

– Автоматический offload

– Явный offload

– Неявный offload

• Параллелизация

• Векторизация

• Prefetching

• MPI программирование

• Инструменты Intel Parallel Studio XE и Intel Cluster Studio XE

2

Единый исходный код для MultiCore и MIC архитектур

Поддержка многоядерности и массового параллелизма

Intel® Cluster Studio XE* Distributed Performance

Intel® Parallel Studio XE* Advanced Performance

Intel® Trace Analyzer and Collector

Intel® MPI Library

Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus, Intel® TBB Library, Intel® IPP Library

Intel® Parallel Studio XE

Производительность. Масштабируемость

Содержание Общая концепция

Модели программирования

Native execution Автоматический offload Явный offload Неявный offload

Параллелизация

Векторизация

Prefetching

MPI программирование

Инструменты Intel Parallel Studio XE и Intel Cluster Studio XE

5

Many-Core Hosted Native MIC Programming

•Enabled by –mmic compiler switch

•Fully supported by compiler vectorization, Intel® MKL, OpenMP*, Intel® TBB, Intel® Cilk Plus, Intel® MPI, …

• No Intel® Integrated Performance Primitives library yet

•Might be an option for some applications:

• Needs to fit into memory !!! • Should be highly parallel code

• Serial parts are slower on MIC than on host !

• Limited access to external environment like I/O • Native MIC file system exists in memory only ! • NFS allows external I/O but …




Options for Offloading Application Code

• Intel Composer XE 2011 for MIC supports three models: – Automatic offload

o Use MKL routines that offloads it’s computation to Xeon Phi

– Offload pragmas

o Only trigger offload when a MIC device is present

o Safely ignored by non-MIC compilers

– Offload keywords

o Only trigger offload when a MIC device is present

o Language extensions, need conditional compilation to be ignored

• Offloading and parallelism is orthogonal – Offloading only transfers control to the MIC devices

– Parallelism needs to be exploited by a second model (e.g. OpenMP*)

7




Автоматический offload

8

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

void foo() /* Intel® Math Kernel Library */ {

float *A, *B, *C; /* Matrices */

sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);

}

Automatic offload with Math Kernel Library Intel® Math Kernel Library (Intel® MKL)

Intel® Xeon® processor Intel® Xeon Phi™ coprocessor

Implicit automatic offloading requires no code

changes, simply link with the offload MKL Library

Intel High Performance Math Kernel Library is Applicable to Multicore and Many-core Programming

4/1/2013

9


Явный(explicit) offload

10




Heterogeneous Compiler – Offload using Explicit Copies – Modifier Example

11

float reduction(float *data, int numberOf)

{

float ret = 0.f;

#pragma offload target(mic) in(data:length(numberOf))

{

#pragma omp parallel for reduction(+:ret)

for (int i=0; i < numberOf; ++i)

ret += data[i];

}

return ret;

}

Note: copies numberOf elements to the coprocessor, not numberOf*sizeof(float) bytes – the compiler knows data’s type




Heterogeneous Compiler – Offload using Explicit Copies – Data Movement

• Default treatment of in/out variables in a #pragma offload

statement

– At the start of an offload:

o Space is allocated on the coprocessor

o in variables are transferred to the coprocessor

– At the end of an offload:

o out variables are transferred from the coprocessor

o Space for both types (as well as inout) is deallocated on the coprocessor

12

Host MIC

#pragma offload inout(pA:length(n)) {...}

Allocate

1

Copy back

4

Copy over

2

Free

5

pA

3




Heterogeneous Compiler Offload using Explicit Copies

C/C++ Syntax Semantics

Offload pragma #pragma offload <clauses>

<statement block>

Allow next statement block to execute on Intel® MIC Architecture or host CPU

Keyword for variable & function definitions

__attribute__((target(mic))) Compile function for, or allocate variable on, both CPU and Intel® MIC Architecture

Entire blocks of code

#pragma

offload_attribute(push,

target(mic))

#pragma offload_attribute(pop)

Mark entire files or large blocks of code for generation on both host CPU and Intel® MIC Architecture

Data transfer #pragma offload_transfer

target(mic) Initiates asynchronous data transfer, or initiates and completes synchronous data transfer

Intel® Many Integrated Core Architecture 13




Heterogeneous Compiler Offload using Explicit Copies

Fortran Syntax Semantics

Offload directive !dir$ omp offload <clause> <OpenMP construct>

Execute next OpenMP* parallel construct on Intel® MIC Architecture

!dir$ offload <clauses> <statement>

Execute next statement (function call) on Intel® MIC Architecture

Keyword for variable/function definitions

!dir$ attributes offload:<MIC> :: <rtn-name>

Compile function or variable for CPU and Intel® MIC Architecture

Data transfer !dir$ offload_transfer target(mic)

Initiates asynchronous data transfer, or initiates and completes synchronous data transfer





Heterogeneous Compiler Offload using Explicit Copies – Clauses

Clauses Syntax Semantics

Target specification target( name[:card_number] ) Where to run construct

Conditional offload if (condition) Boolean expression

Inputs in(var-list modifiersopt) Copy from host to coprocessor

Outputs out(var-list modifiersopt) Copy from coprocessor to host

Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back when offload completes

Non-copied data nocopy(var-list modifiersopt) Data is local to target

Async. offload signal(signal-slot) Trigger async offload

Async. offload wait(signal-slot) Wait for completion

Variables and pointers restricted to scalars, structs of scalars, and arrays of scalars

15




Heterogeneous Compiler Offload using Explicit Copies – Modifiers

Modifiers Syntax Semantics

Specify pointer length length(element-count-expr) Copy N elements of the pointer’s type

Control pointer memory allocation

alloc_if ( condition ) Allocate memory to hold data referenced by pointer if condition is TRUE

Control freeing of pointer memory

free_if ( condition ) Free memory used by pointer if condition is TRUE

Control target data alignment

align ( expression ) Specify minimum memory alignment on target

Variables and pointers restricted to scalars, structs of scalars, and arrays of scalars

16




Неявный(implicit) offload

17




Heterogeneous Compiler Offload using Implicit Copies

• Section of memory maintained at the same virtual address on both the host and Intel® MIC Architecture coprocessor

• Reserving same address range on both devices allows – Seamless sharing of complex pointer-containing data structures

– Elimination of user marshaling and data management

– Use of simple language extensions to C/C++

18

Host Memory

KN* Memory

Offload code

C/C++ executable

Host

Intel® MIC

Same address range




Heterogeneous Compiler Offload using Implicit Copies

• When “shared” memory is synchronized

– Automatically done around offloads (so memory is only synchronized on entry to, or exit from, an offload call)

– Only modified data is transferred between CPU and coprocessor

• Dynamic memory you wish to share must be allocated with special functions: _Offload_shared_malloc, _Offload_shared_aligned_malloc, _Offload_shared_free,

_Offload_shared_aligned_free

• Allows transfer of C++ objects

– Pointers are no longer an issue when they point to “shared” data

• Well-known methods can be used to synchronize access to shared data and prevent data races within offloaded code

– E.g., locks, critical sections, etc.

This model is integrated with the Intel® Cilk™ Plus parallel extensions

19

Note: Not supported on Fortran - available for C/C++ only




Heterogeneous Compiler Implicit: Offloading using _Offload Example

// Shared variable declaration for pi

_Cilk_shared float pi;

// Shared function declaration for

// compute

_Shared void compute_pi(int count)

{

int i;

#pragma omp parallel for \

reduction(+:pi)

for (i=0; i<count; i++)

{

float t = (float)((i+0.5f)/count);

pi += 4.0f/(1.0f+t*t);

}

}

void findpi()

{

int count = 10000;

// Initialize shared global

// variables

pi = 0.0f;

// Compute pi on target

_Offload compute_pi(count);

pi /= count;

}

20

_Offload compute_pi(count);

_Cilk_shared void compute_pi(nt count)

{

int i;

#pragma omp parallel for \

reduction(+:pi)

for (i=0; i<count; i++)

{

float t = (float)((i+0.5f)/count);

pi += 4.0f/(1.0f+t*t);

}

}




Heterogeneous Compiler Keyword _Cilk_shared for Data/Functions


What Syntax Semantics

Function int _Cilk_shared f(int x)

{ return x+1; }

Versions generated for both CPU and card; may be called from either side

Global _Cilk_shared int x = 0; Visible on both sides

File/Function static

static _Cilk_shared int

x;

Visible on both sides, only to code within the file/function

Class class _Cilk_shared x {…}; Class methods, members, and and operators are available on both sides

Pointer to shared data

int _Cilk_shared *p; p is local (not shared), can point to shared data

A shared pointer int *_Cilk_shared p; p is shared; should only point at shared data

Entire blocks of code

#pragma offload_attribute(

push, _Cilk_shared) #pragma

offload_attribute(pop)

Mark entire files or large blocks of code _Cilk_shared using this pragma




Heterogeneous Compiler Implicit: Offloading using _Offload

Feature Example Description

Offloading a function call

x = _Offload func(y); func executes on

coprocessor if possible

x = _Offload_to (card_number)

func(y);

func must execute on

specified coprocessor

Offloading asynchronously

x = _Cilk_spawn _Offload func(y); Non-blocking offload

Offload a parallel for-loop

_Offload _Cilk_for(i=0; i<N; i++)

{

a[i] = b[i] + c[i];

}

Loop executes in parallel on target. The loop is implicitly outlined as a function call.

22




Heterogeneous Compiler Command-line Options

Offload-specific arguments to the Intel® Compiler: • Generate host+coprocessor code (by default only host code is

generated): -offload-build (Deprecated – offload is default)

• Produce a report of offload data transfers at compile time (not runtime) -opt-report-phase:offload

• Add Intel® MIC Architecture compiler switches -offload-copts:“switches”

• Add Intel® MIC Architecture archiver switches -offload-aropts:“switches”

• Add Intel® MIC Architecture linker switches -offload-ldopts:“switches”

Example: icc –g –O2 –mkl –offload-build –offload-copts=”-g -03”

–offload-ldopts=”-L/opt/intel/composerxe_mic/mkl/lib/mic”

foo.c

23




Примеры offload

24




Example 1: Using MKL for offloading Lapack and Blas routines

int main{

// initialize variables …

#pragma offload target(mic) in(transa, transb, N, alpha, beta) \

in(A:length(matrix_elements)) in(B:length(matrix_elements)) \

inout(C:length(matrix_elements))

{

sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,

&beta, C, &N);

}

// … continue code

}

Sgemm performs C=beta*C+alpha*A*B, transa and transb regulate the transposition of A and B and the Ns define the sizes of the matrices (see documentation). C is input and output, all others are input only.

MKL will automatically make optimal use of MIC

25




Example 2: Simultaneous computation on host and accelerator

When using a straight

#pragma offload

the host blocks until completion of the of the offloaded region or function. In order to obtain max performance it is necessary to keep the host working at the time the offload computes.

26

Compute

workload

parallel

Host Target

Compute

workload

parallel

Prework

Postwork




Example 2: Simultaneous computation on host and accelerator - OpenMP

double __attribute__((target(mic))) myworkload(double input){

// do something useful here

return result;

}

int main(void){

//…. Initialize variables

#pragma omp parallel sections

{

#pragma omp section

{

#pragma offload target(mic)

result1= myworkload(input1);

}

#pragma omp section

result2= myworkload(input2);

}

}

27

Function is generated for both MIC and CPU

One thread executes the offload code on MIC

The other thread executes the same function on the

host

Create two threads in an OpenMP sections env.




Example 2: Simultaneous computation on host and accelerator - Cilk

_Cilk_shared double myworkload(double input){

// do something useful here

return result;

}

int main() {

result1 = _Cilk_spawn _Cilk_offload myworkload(input2);

result2 = myworkload(input1); cilk_sync;

}

28

Function is generated for both MIC and CPU

One thread is spawned and executes the offload

code on MIC

The host executes the same function and waits













• Prefetching



29

Software & Services Group, Developer Products Division

Copyright © 2011, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Optimization Notice

30

Software & Services Group

Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

AVX Vector size: 256 bit Data types: • 32 and 64 bit float VL: 4, 8, 16

Intel® MIC Vector size: 512 bit Data types: • 32 and 64 bit integer • 32 and 64 bit float VL: 8,16

X4

Y4

X4◦Y4

X3

Y3

X3◦Y3

X2

Y2

X2◦Y2

X1

Y1

X1◦Y1

0

X8

Y8

X8◦Y8

X7

Y7

X7◦Y7

X6

Y6

X6◦Y6

X5

Y5

X5◦Y5

255

X4

Y4

X4◦Y4

X3

Y3

X3◦Y3

X2

Y2

X2◦Y2

X1

Y1

X1◦Y1

0

X8

Y8

X8◦Y8

X7

Y7

X7◦Y7

X6

Y6

X6◦Y6

X5

Y5

X5◦Y5

X16

Y16

X16◦Y16

…

...

…

511

Illustrations: Xi, Yi & results 32 bit float

Data Parallelism of Intel® Processors (2)

http://software.intel.com/en-us/articles/optimization-notice/




Optimization Notice

31



Vectorization of Code

• Transform sequential code to exploit data parallel capabilities

(SIMD) of Intel processors

– Manually by explicit syntax

– Automatically by tools like a compiler

for(i = 0; i <= MAX;i++)

c[i] = a[i] + b[i];

a

b

c

+ +

a[i]

b[i]

c[i]

+

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]





Optimization Notice

32



Vectorization by Compiler In

pu

t: C

/C+

+/FO

RTR

AN

so

urc

e co

de

Vectorizer

Intel® SSE Intel® AVX Intel® MIC

Express/expose vector parallelism

Array Notation

SIMD pragma

Vectorization Hints (ivdep/vector pragmas)

Fully Automatic Analysis

Elemental Function

Map vector parallelism

to vector ISA

Optimize and Code Gen

Vec

tor

par

t o

f In

tel®

Cilk

™P

lus

exte

nsi

on

Vectorizer makes

retargeting easy!





Optimization Notice

33



Vectorization Report

Provides details on vectorization success & failure:

Linux*, Mac OS* X: -vec-report<n>, Windows*: /Qvec-report<n>

n Diagnostic Messages

0 Tells the vectorizer to report no diagnostic information. Useful for turning off reporting in case it was enabled on command line earlier.

1 Tells the vectorizer to report on vectorized loops. [default if n missing]

2 Tells the vectorizer to report on vectorized and non-vectorized loops.

3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven or assumed data dependences.

4 Tells the vectorizer to report on non-vectorized loops.

5 Tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized.

6 Tells the vectorizer to use greater detail when reporting on vectorized and non-vectorized loops and any proven or assumed data dependences.

X To be done: Even more details in next compiler releases !





Optimization Notice

34



Vectorization Report Sample

Example:

Additional details about loop transformations, in-lining, versioning, etc

are reported by compiler switch –opt-report

35: subroutine fd( y )

36: integer :: i

37: real, dimension(10), intent(inout) :: y

38: do i=2,10

39: y(i) = y(i-1) + 1

40: end do

41: end subroutine fd

novec.f90(38): (col. 3) remark: loop was not vectorized: existence

of vector dependence.

novec.f90(39): (col. 5) remark: vector dependence: proven FLOW

dependence between y line 39, and y line 39.

novec.f90(38:3-38:3):VEC:MAIN_: loop was not vectorized:

existence of vector dependence





Optimization Notice

35



Array Sections

• Array Section Notation

<array base> [ <lower bound> : <length> [: <stride>] ]

[ <lower bound> : <length> [: <stride>] ].....

• Note that length is chosen.

– Not upper bound as in Fortran [lower bound : upper bound]

A[:] // All elements of vector A

B[2:6] // Elements 2 to 7 of vector B

D[0:3:2] // Elements 0,2,4 of vector D

E[0:3][0:4] // 12 elements from E[0][0] to E[2][3]

0 1 2 3 4 5 6 7 8 9 float B[10];

B[2:6] = …


What is Elemental Function?

• Write a function for one

element.

• Add __declspec(vector) to

get vector code for it.

__declspec(vector)

float foo(float a, float b,

float c, float d) {

return a * b + c * d;

}

• and obtain

vmulps ymm0, ymm0, ymm1

vmulps ymm2, ymm2, ymm3

vaddps ymm0, ymm0, ymm2

ret

• Call it from auto-vec or

SIMD loop

for(i=0;i<n;i++){

A[i] = foo(B[i], C[i],

• D[i], E[i]);

}

• Call it from Array Notation

A[:] = foo(B[:], C[:], D[:], E[:]);

• Call it from Elemental

Function

__declspec(vector)

float bar(float a, float b,

float c, float d){

return sinf(foo(a,b,c,d));

}

• Call scalar version from

scalar code

e = foo(a, b, c, d);

36 4/1/2013

Elemental Function: Uniform/Linear

clauses • Why do we need them?

– Because “vector” loads

and stores of IA chips

are optimized for

accessing immediately

next elements in

memory (e.g.,

[v]movups).

• They are most useful

when consumed in the

address computation. • __declspec(vector)

void foo(float *a, int i); – a is a vector of pointers

– i is a vector of integers

– a[i] becomes gather/scatter.

• __declspec(vector(uniform (a)))

void foo(float *a, int i); – a is a pointer

– i is a vector of integers


• __declspec(vector(linear(i)))

void foo(float *a, int i); – a is a vector of pointers

– i is a sequence of integers

[i, i+1, i+2…]


• __declspec(vector(uniform(a),linear(i)))

void foo(float *a, int i); – a is a pointer

– i is a sequence of integers [i, i+1,

i+2…]

– a[i] is a unit-stride load/store

([v]movups).

37 4/1/2013

SIMD Pragma: definition

• Top-level

– #pragma simd

– !DIR$ SIMD

• Attached clauses to describe semantics /

aid code generation

– vectorlength(VL)/vectorlengthfor(TYPE)

– private/firstprivate/lastprivate(var1[, var2, …])

– reduction(oper1:var1[, …][, oper2:var2[, …]])

– linear(var1[:step1][, var2[:step2], …])

– [no]assert

3/29/2013 38

SIMD Pragma: simple examples

void foo(int *A, int N, int n){

int i;

#pragma simd

vectorlength(4)

for (i=0; i<n; i++){

A[i] = A[i] + A[i-N];

}

}

• #pragma simd not

applicable if “0 < N < n”, but

vectorization is still possible

if N isn’t too small.

short sum(float *A, int n){

int i; short x = 0;

#pragma simd reduction(+:x)

for (i=0; i<n; i++){

xt = x + A[i]*2

x = xt + N;

}

return x;

}

• Tell compiler “x” has sum-

reduction semantics

3/29/2013 39

Copyright© 2012, Intel Corporation. All rights reserved.


Compiler has to assume the worst case the language/flag allow.

float *A; void vectorize() { int i; for (i = 0; i < 102400; i++) { A[i] *= 2.0f; } }

Loop Body:

• Load of A

• Load of A[i]

• Multiply with 2.0f

• Store of A[i]

3) Recompile with –ansi-alias

• icc –vec-report1 –ansi-alias test1.c

• test1.c(4): (col. 3) remark: LOOP WAS VECTORIZED.

4) Change “float *A” to “float *restrict A”.

• icc –vec-report1 test1a.c

• test1a.c(4): (col. 3) remark: LOOP WAS VECTORIZED.

5) Add “#pragma ivdep” to the loop.

• icc –vec-report1 test1b.c

• test1b.c(5): (col. 3) remark: LOOP WAS VECTORIZED.

40 3/29/2013

Q: Will the store

modify A?

A: Maybe

“NO” is needed to make

vectorization legal. Wait, we aren’t done yet



Does FORTRAN have aliasing issue?

• Standard rule is in favor of compiler optimization. In plain English:

• Two storage locations in different names won’t overlap unless both are read-only.

SUBROUTINE FOO(A,B,N) REAL A(*), B(*) DO I=1, N A(I) = B(I)+1 ENDDO END

• Compiler still needs to do memory disambiguation (or data dependence analysis)

SUBROUTINE FOO(A,M,N1,N2) REAL A(*) DO I=N1, N2 A(I) = A(I-M)+1 ENDDO END

6) ifort ftest1.f –vec-report2

7) Add !DIR$ IVDEP ifort ftest1a.f –vec-report2

41

3/29/2013



Writing Explicit Vector Code with Intel® Cilk™Plus

float *A; void vectorize() { for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }

8) Using SIMD Pragma

float *A; void vectorize() { #pragma simd vectorlength(4) for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }

9) Using Array Notation

float *A; void vectorize() { A[0:102400] *= 2.0f; }

10)Using Elemental Function

float *A; __declspec(noinline) __declspec(vector(uniform(p), linear(i))) void mul(int i){ p[i] *= 2.0f; } void vectorize() { for (int i = 0; i < 102400; i++) { mul(A, i); } }

42

3/29/2013



Memory Accesses and Alignment

Memory Access Patterns Alignment Optimization

43 3/29/2013

A[i] A[i+1] A[i+2] A[i+3] … …

Unit-stride

A[2*i] A[2*(i+1)] A[2*(i+2)]

Strided (special form of gather/scatter)

A[B[i+2]] A[B[i]] A[B[i+1]]

Gather/Scatter

If you write in Cilk™Plus Array Notation,

access patterns are obvious in your eyes:

Unit-stride means A[:] or A[lb:#elems], helps you think more clearly.

A[i] A[i+1] A[i+2] A[i+3] … …

Aligned Unit-stride

Addr % SIZE == 0

A[i] A[i+1] A[i+2] A[i+3] … …

Alignment unknown Unit-stride

Addr % SIZE == ???

SIZE:

64B for Xeon™Phi,

32B for AVX1/2,

16B for SSE4.2 and below

A[i] A[i+1] A[i+2] A[i+3] … …

Misaligned Unit-stride

Addr % SIZE != 0

Align your data AND tell the compiler

• Good array data alignment for

– Pentium 4 to Core i7: 16B

– AVX: 32B

– MIC: 64B

• Data alignment directive (64B example)

– C/C++ Windows : __declspec(align(64)) float A[1000]; Linux/MacOS: float A[1000] __attribute__ ((aligned (64));

– Fortran !DIR$ ATTRIBUTES ALIGN: 64:: A

• Aligned malloc – _aligned_malloc()

– _mm_malloc()

• Data alignment assertion

(64B example)

– C/C++:

__assume_aligned(p,64);

– Fortran: !DIR$

ASSUME_ALIGNED A(1):64

• Multiple of good number

– __assume(n%16==0)

• Aligned loop assertion

– C/C++: #pragma vector

aligned

– Fortran: !DIR$ VECTOR

ALIGNED

44

Align your data AND tell the compiler!!

3/29/2013



Fixed-size Array Sections

Short vector coding #define VLEN 4 for(i=0;i<N;i+=VLEN){ A[i:VLEN]= B[i:VLEN]+C[i:VLEN]; D[i:VLEN]= E[i:VLEN]+A[i:VLEN]; }

Similar C loop for(i=0;i<N;i+=VLEN){ for(j=0;j<VLEN;i++) A[i+j]=B[i+j]+C[i+j]; for(j=0;j<VLEN;i++) D[i+j]=E[i+j]+A[i+j]; }

Long vector coding A[0:N]=B[0:N]+C[0:N]; D[0:N]=E[0:N]+A[0:N];

This is visually appealing, but may not be high performing.

Similar to C loops: for(i=0;i<N;i++){ A[i]=B[i]+C[i]; } for(i=0;i<N;i++){ D[i]=E[i]+A[i]; }

Use short-vector coding if you have data reuse between statements and N is big.

45

3/29/2013



Alignment and Module Data Known Sized Arrays

Example: Global arrays declared in modules with known size. module mymod !dir$ attributes align:64 :: a

!dir$ attributes align:64 :: b

real (kind=8) :: a(1000), b(1000)

end module mymod

subroutine add_them()

use mymod

implicit none

! array syntax shown, could also be explicit loop

!...No explicit directive needed to say that A and B

! are aligned, the USE brings that information

a = a + b

end subroutine add_them

This saves coding effort AND

improves performance!

46

INTEL CONFIDENTIAL Software and Services Group Software and Services Group Software and Services Group Software and Services Group

Alignment and Module Data Allocatable Arrays

Example 8.2: Global allocatable arrays declared in modules, but allocated elsewhere. module mymod

real, allocatable :: a(:), b(:)

end module mymod

subroutine add_them()

use mymod

implicit none

!dec$ vector aligned

a = a + b

end subroutine add_them

Currently cannot use !dir$ attributes align:64

here – not safe to assume that the actual allocation site will use an aligned allocation

11/13/2012 47

Plan is to let you write “attribute” syntax for 64B alignment cases --- ETA: Feb 2013.













• Prefetching



48




A Family of Parallel Programming Models Developer Choice

Intel® Cilk™ Plus C/C++ language extensions to simplify parallelism

Open sourced

Also an Intel product

Intel® Threading Building Blocks

Widely used C++ template library for parallelism

Open sourced

Also an Intel product

Domain-Specific Libraries

Intel® Integrated Performance Primitives

Intel® Math Kernel Library

Established Standards

Message Passing Interface (MPI)

OpenMP*

Coarray Fortran

OpenCL*

Research and Development

Intel® Concurrent Collections

Offload Extensions

Intel® SPMD Parallel Compiler

Choice of high-performance parallel programming models

Applicable to Multicore and Many-core Programming

49




Optimization Notice

50



Cilk Tasking – Very Simple

/* Matrix Transpose */

cilk_for (int i = 0; i < n; i++)

cilk_for (int j = 0; i< n; i++)

b[j][i] = a[i][j];

int fib (int n) {

if (n < 2) return 1;

else {

int x = cilk_spawn fib(n-1);

int y = cilk_spawn fib(n-2);

cilk_sync;

return x + y;

}

}

• A “composable” model for thread parallelism: Programming in tasks, not threads: Don’t ask: “How many cores are available ?”

• Very closely follows the serial execution semantic: For deterministic code, in fact, it is the same • Easy testing • Easy debugging

• For reductions and critical regions additional “hyper-objects’ (“reducers”) are available





Optimization Notice

51



OpenMP* Support

• Intel® Compilers ( both C++ and Fortan ) fully complaint to OpenMP* 3.1

– See http://www.openmp.org/ for standard, tutorials etc

• Includes OpenMP Tasking

• Many Intel-specific control mechanism for thread mapping, scheduler control, memory allocation, thread-private variable implementation etc

• Support for OpenMP 4.0 being added

– Standard still being worked on

– Many features will be added to new compiler release in 2013

• Automatic parallelization of compiler maps to Intel OpenMP run time system for thread management


http://www.openmp.org/




Optimization Notice

52



OpenMP 4.0 (beta2 released Q1/13)

• Portable SIMD construct – Execute iterations of following loop in SIMD chunks

#pragma omp simd [clause [[,] clause] …]

– Not the same as Intel’s pragma SIMD but …

• SIMD function declaration prefix #pragma omp declare simd [clause [[,] clause] …]

– Build vector version of function to be called from “SIMD” loop

– Very much the same as Intel’s elemental functions

• Extended affinity support – E.g. via env variables OMP_PLACES and OMP_PROC_BIND

– Similar powerful as Intel’s KMP_AFFINITY

• FORTRAN 2003 support

• User defined reductions

• Taskgroups and a lot more …

• Support for offload/accelerator (TR1, not in spec yet)














• Prefetching



53




54

Prefetching Basics

Compiler prefetching is turned on by default for the Intel® Xeon Phi™ coprocessor • At option levels –O2 and above • Prefetches issued for all regular memory accesses

inside loops • Prefetching for memory accesses expressed using

load/store intrinsics • Maximal loop prefetching

Use the compiler reporting options to see detailed diagnostics of prefetching per loop

• -opt-report-phase hlo –opt-report 3 Use compiler option –no-opt-prefetch to turn off

compiler prefetching




55

Loop-Prefetches • Prefetches issued targeting memory access in a future

iteration of the loop • Targeting regular array accesses • Pointer accesses similar to array accesses where the

address can be predicted in advance • Supports address calculations that involve:

–Affine functions of surrounding loop indices –More complicated access-patterns that require

additional instructions inside the loop




56

Indirect Prefetch Example • #pragma simd reduction(+:fxtmp,fytmp,fztmp) vectorlengthfor(double) • for (int jj = 0; jj < jnum; jj++) { • int j,sbindex, jtype; double factor_lj; • j = jlist[jj]; sbindex = sbmask(j); … • _mm_prefetch((char *) &xx[jlist[jj+1+16]], 1); • … • _mm_prefetch((char *) &xx[jlist[jj+8+16]], 1); • _mm_prefetch((char *) &ff[jlist[jj+1+16]], 5); • … • _mm_prefetch((char *) &ff[jlist[jj+8+16]], 5); • double delx = xtmp - xx[j].x; double dely = ytmp - xx[j].y; • double delz = ztmp - xx[j].z; double rsq = delx*delx + dely*dely + delz*delz; • if (rsq < global_cutsq) { • double r2inv = 1.0/rsq; double r6inv = r2inv*r2inv*r2inv; • double forcelj = r6inv * (global_lj1*r6inv - global_lj2); • double fpair = factor_lj*forcelj*r2inv; • fxtmp += delx*fpair; fytmp += dely*fpair; fztmp += delz*fpair; • if (NEWTON_PAIR || j < nlocal) { • ff[j].x -= delx*fpair; ff[j].y -= dely*fpair; ff[j].z -= delz*fpair; } • } • }




57

Interactions with the Hardware Prefetcher

• Intel® Xeon Phi™ coprocessor has a hardware L2 prefetcher that is enabled by default

• If software prefetches are doing a good job, then hardware prefetching does not kick in

– In several workloads (such as stream), maximal software prefetching gives the best performance

• Any references not prefetched by compiler may get prefetched by hardware




58

Directive Support for Loop Prefetches • Directive to turn off prefetching for a particular loop

– #pragma noprefetch – CDEC$ noprefetch – Specify before a loop, affects only that loop, does

not affect inner loops • Directive to turn off prefetching for a particular

routine – #pragma noprefetch – CDEC$ noprefetch – Specify at the top of the routine as the first

executable statement • Prefetch pragma support for C loops

– #pragma prefetch var:hint:distance • Prefetch directive support for Fortran loops

– CDEC$ prefetch var:hint:distance




59

Prefetch Distance Tuning Option -opt-prefetch-distance=n1[,n2]

• n1 specifies the distance for first-level prefetches into L2

• n2 specifies prefetch distance for second-level prefetches from L2 to L1 (use n2 <= n1)

• -opt-prefetch-distance=64,32

• -opt-prefetch-distance=24

o Use first-level distance=24, second-level distance to be determined by compiler


o Turns off all first-level prefetches, second-level uses distance=4 (Use this if you want to rely on hardware prefetching to L2, and compiler prefetching from L2 to L1)


o First-level distance=16, no second-level prefetches issued

• If option not specified, all distances determined by compiler

Поддержка многоядерности и массового параллелизма

Intel® Cluster Studio XE* Distributed Performance

Intel® Parallel Studio XE* Advanced Performance


Intel® MPI Library

Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus, Intel® TBB Library, Intel® IPP Library

Intel® Parallel Studio XE

Производительность. Масштабируемость

61

Where is my application…

Spending Time? Wasting Time? Waiting Too Long?

• Focus tuning on functions

taking time

• See call stacks

• See time on source

• See cache misses on your

source

• See functions sorted by

# of cache misses

• See locks by wait time

• Red/Green for CPU

utilization during wait

Intel® VTune™ Amplifier XE Performance Profiler

• Windows & Linux

• Low overhead

• No special recompiles Claire Cates

Principal Developer, SAS Institute Inc.

We improved the performance of the latest

run 3 fold. We wouldn't have found the

problem without something like Intel® VTune™

Amplifier XE.

Intel® VTune™ Amplifier XE

Advanced Profiling for Scalable Multicore Performance

61

Intel VTune Amplifier XE supports Intel MIC Architecture

VTune Amplifier XE using remote functionality on MIC architecture and requires host

62




Intel® MPI Library support for the Intel® Xeon Phi™ Coprocessor

63




MPI+Offload

• MPI ranks on Intel® Xeon® processors (only)

• All messages into/out of processors

• Offload models used to accelerate MPI ranks

• Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* within Intel® MIC Architecture

• Homogenous network of hybrid nodes:

Xeon

MIC

Xeon

MIC

Xeon

MIC

Xeon

MIC

Network

Data

Data

Data

Data

MPI

Offload

64



MPI+Offload How to run

• Compile your code with the offload directives $ mpiifort –openmp test.f –o test.offload

• Create your hosts file (Xeon only) $ cat hosts

node0

node1

• Run your application (Xeon only) $ mpirun –f hosts –n 2 ./test.offload

65




Many-core Hosted (Native)

• MPI ranks on Intel® Xeon PhiTM coprocessors(only)

• All messages into/out of Intel® Xeon PhiTM coprocessors

• Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads used directly within MPI processes

• Programmed as homogenous network of many-core CPUs:

Xeon

MIC

Xeon

MIC

Xeon

MIC

Xeon

MIC

Network

Data

Data

Data

Data

MPI

66



Many-core Hosted (Native) How to run

• Compile your code for Intel® Xeon Phi™ Coprocessor $ mpiifort –mmic test.f –o test.mic

• Copy the MIC-enabled executable to the coprocessor $ scp test.mic mic0:/home/user/test

• Create your hosts file (MIC only) $ cat hosts

mic0

mic1

• Let the library know you’re planning on running on MIC $ export I_MPI_MIC=1

• Run your application (from the Xeon) $ mpirun –f hosts –n 4 /home/user/test.mic

67




Symmetric

• MPI ranks on Intel® Xeon PhiTM coprocessors and Intel® Xeon® processors

• Messages to/from any core

• Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* used directly within MPI processes

• Programmed as heterogeneous network of homogeneous nodes:

Xeon

MIC

Xeon

MIC

Xeon

MIC

Xeon

MIC

Network

Data

Data

Data

Data

MPI

Data

Data

Data

Data

MPI

MPI

68



Symmetric How to run

• Compile your code for the Intel® Xeon node $ mpiifort test.f –o test

• And for Intel® Xeon Phi™ Coprocessor $ mpiifort –mmic test.f –o test.mic

• Copy the MIC-enabled executable to the coprocessor $ scp test.mic mic0:/home/user/test

• Create your hosts file (Xeon+MIC) $ cat hosts

node0

mic0

mic1

• Let the library know you’re planning on running on MIC $ export I_MPI_MIC=1

• Run your application (from the Xeon) $ mpirun –f hosts –n 4 /home/user/test.mic

69




Scale Performance Tune Hybrid Cluster MPI and Thread Performance

Tune cross-node MPI

•Visualize MPI behavior

•Evaluate MPI load balancing

•Find communication hotspots

Tune single node threading

•Visualize thread behavior

•Evaluate thread load balancing

•Find thread sync. bottlenecks

Intel®

Trace Analyzer and Collector Intel®

VTune™ Amplifier XE

70


Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice 71

Key Features

• Low Overhead

• Catch all MPI events

• Powerful configuration mechanism – Filters, settings, features

• Automatic source-code references

• Instrumentation – Rich API

– Binary instrumentation (itcpin)

– Compiler based (-tcollect)

• Fail-safe version

• Comparison of multiple profiles

• Idealizer

• MPI Correctness Checking

http://software.intel.com/en-us/articles/optimization-notice


Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice 72

How to use Intel® Trace Analyzer and Collector

• Step 1: Run your binary and create a tracefile run the binary for a representative amount of time (to reduce initialization influences) on representative data (no corner cases)

$ mpirun –trace –n 2 ./test

– Alternative 1: Generate an instrumented binary via re-linking $ mpiicc –trace test.c –o test.inst

$ mpirun –n 2 ./test.inst

– Alternative 2: Instrument binary itcpin –-run –- ./test

• Step 2: To view the generated trace file, start the GUI:

traceanalyzer &



Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice


Compare the event timelines of two communication profiles Blue = computation Red = communication

Chart showing how the MPI processes interact

73




Chart

A Chart is a numerical or graphical diagram

Chart

74




Timelines: Event Timeline

• Get impression of program structure

• Display functions, messages and collective operations for each process/thread along time-axis

• Retrieval of detailed event information

75




Timelines: Qualitative Timeline

• Find patterns and irregularities

• Display attributes of functions, messages or collective operations as they occur for any process/thread

• Retrieval of detailed event information

76




Timelines: Quantitative Timeline

• Get impression on parallelism and load balance

• Show for every function how many threads/processes are currently executing it

77




Profiles: Flat Function Profile

• Statistics about functions

78




Profiles: Call-Tree and Call-Graph

• Function statistics including calling hierarchy

– Tree: call-stack

– Graph: calling dependencies

79




Communication Profiles

• Statistics about point-to-point or collective communication

• Generic matrix supports grouping by several attributes in each dimension Sender, Receiver, Data volume per msg, Tag, Communicator, Type

• Available attributes Count, Bytes transferred, Time, Transfer rate

80




SCIF – low level communication interface

81





SCIF Symmetric Communications Interface

• The SCIF driver provides a reliable connection-based messaging layer, as well as functionality which abstracts RMA operations.

• The SCIF API is documented in the Intel® MIC SCIF API Reference Manual for User Mode Linux and the Intel® MIC SCIF API Reference Manual for Kernel Mode Linux.

• A common API is exposed for use in both user mode (ring 3) and kernel mode (ring 0), with the exception of slight differences in signature, and several functions which are only available in user mode, and several only available in kernel mode.

82




SCIF - Nodes and Ports

• SCIF node: physical endpoint in the SCIF network. The host and MIC Architecture devices are SCIF nodes (all cores under a single OS). Each node has a node identifier assigned at boot time. Node IDs are generally based on PCIe discovery order. The host node is always assigned ID 0.

• SCIF port: logical destination on a SCIF node. Within a node, a SCIF port on that node may be referred to by its number, a 16-bit integer, similar to an IP port.

• SCIF port identifier: is unique across a SCIF network, comprising both a node identifier and a local port number (analogous to a complete TCP/IP address with port)

83




SCIF – Opening a connection

84

epdi=scif_open() epdj=scif_open()

scif_bind(epdi,pm) scif_bind(epdj,pn)

scif_listen(epdj,qLen)

scif_connect(epdi,(Nj,pn)) scif_accept(*nepd,peer)




SCIF - Messaging

• After the connection has been established, messages may be exchanged:

• int scif_send(scif_epd_t epd,void* msg,int len,int flags);

• int scif_recv(scif_epd_t epd,void* msg,int len,int flags);

• Messages may be up to 2^31-1 bytes long

• Message layer queues are relatively short, though

• For bulk data transfer use the SCIF RMA functionality

• The connection is bi-directional

85




Extension SCIF

SCIF

• Host-KNC communications backbone

• Provides com. cap. within a single platform(node)

• Low latency, low overhead communication

• Provides uniform API for communication across the hosts PCI Express* system busses

• Directly exposes DMA capabilities for high bandwidth transfer

• Fully exposed (/usr/include/scif.h)

86




SCIF – Connections Functionality

• scif_epd_t scif_open(void);

Create a new endpoint

• int scif_bind(scif_epd_t epd, uint16_t pn);

Bind Endpoint to port

• int scif_listen(scif_epd_t epd, int backlog);

Set endpoint to listen

• int scif_connect(scif_epd_t epd, struct scif_portID* dst);

Request connection to listening endpoint

• int scif_accept (scif_epd_t epd, struct scif_portID* peer, scif_epd_t* newepd, int flags);

Accepts the connection request

• int scif_close (scif_epd_t epd);

Closes the connection

87




SCIF – Basic Functionality

• scif_epd_t scif_open(void);

Create a new endpoint

• int scif_bind(scif_epd_t epd, uint16_t pn);

Bind Endpoint to port

• int scif_listen(scif_epd_t epd, int backlog);

Set endpoint to listen

• int scif_connect(scif_epd_t epd, struct scif_portID* dst);

Request connection to listening endpoint

• int scif_accept (scif_epd_t epd, struct scif_portID* peer, scif_epd_t* newepd, int flags);

Accepts the connection request

• int scif_close (scif_epd_t epd);

Closes the connection

88




SCIF – RMA Operations

• off_t scif_register(scif_epd_t epd, void* addr, size_t len, off_t offset, int prot_flags, int map_flags);

Expose range of address space for control by an remote process.

The memory must be registered before it can be mapped for RMA

• int scif_unregister(scif_epd_t epd, off_t offset, size_t len);

Revoke registration/mapping

• int scif_readfrom(scif_epd_t epd, off_t loffset, size_t len, off_t roffset, int rma_flags);

Read from mapped address range

• int scif_writeto(scif_epd_t epd, off_t loffset, size_t len, off_t roffset, int rma_flags);

Read from mapped address range

89




Заключение

• Програмирование Xeon™ и Xeon™ Phi треюует одних и тех же навыков и знаний

• Параллелизация и векторизация - залог эффективности программ на Xeon™ Phi

• Автоматический offload MKL самый простой способ использования Xeon™ Phi

• Обратите внимание на оптимизацию использования иерархии памяти. Используйте prefetching

• Инструменты Intel Parallel Studio XE 2013 и Intel Cluster Studio XE 2013 существенно расширяют возможности разработчика

90




Legal Disclaimer & Optimization Notice

91

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2012, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804




92




Example 3: Asynchronous Transfer & Double Buffering

• Overlap computation and communication

• Generalizes to data domain decomposition

93

Host Target data block

data block

data block

data block

data block

data block

data block

data block

process

process

process

process

pre-work

iteration 0

iteration 1

iteration n

data block

last iteration data block process

iteration n+1




Example 3 – Using Signals

#pragma offload_transfer target(mic:0) \

nocopy(in1:length(cnt)) alloc_if(1) free_if(0))

#pragma offload_transfer target(mic:0)

in(in1:length(cnt) alloc_if(0) free_if(0)) signal(in1)

#pragma offload target(mic:0) nocopy(in1) wait(in1) \

out(res1:length(cnt) alloc_if(0) free_if(0))


nocopy(in1:length(cnt) alloc_if(0) free_if(1))

94

This does nothing except allocating an array

Start an asynchronous transfer, tracking signal in1

Start once the completion of the transfer of in1 in signaled

This does nothing except freeing an array




Example 3: Double Buffering I

int main(int argc, char* argv[]) {

// … Allocate & initialize in1, res1,

//… in2, res2 on host

#pragma offload_transfer target(mic:0) in(cnt)\

nocopy(in1, res1, in2, res2 : length(cnt) \

alloc_if(1) free_if(0))

do_async_in();


nocopy(in1, res1, in2, res2 : length(cnt) \

alloc_if(0) free_if(1))

return 0;

}

95

Only allocate arrays on card with alloc_if(1), no

transfer

Only free arrays on card with free_if(1), no transfer




Example 3: Double Buffering II

void do_async_in() {

float lsum;

int i;

lsum = 0.0f;

#pragma offload_transfer target(mic:0) in(in1 : length(cnt) \

alloc_if(0) free_if(0)) signal(in1)

for (i = 0; i < iter; i++) {

if (i % 2 == 0) {

#pragma offload_transfer target(mic:0) if(i !=iter - 1) \

in(in2 : length(cnt) alloc_if(0) free_if(0)) signal(in2)


out(res1 : length(cnt) alloc_if(0) free_if(0))

{

compute(in1, res1);

}

lsum = lsum + sum_array(res1);

} else {…

96

Send buffer in1

Send buffer in2

Once in1 is ready (signal!) process in1




Example 3: Double Buffering III

…} else {

#pragma offload_transfer target(mic:0) if(i != iter - 1) \

in(in1 : length(cnt) alloc_if(0) free_if(0)) signal(in1)


out(res2 : length(cnt) alloc_if(0) free_if(0))

{

compute(in2, res2);

}

lsum = lsum + sum_array(res2);

}

}

async_in_sum = lsum / (float)iter;

} // for

} // do_async_in()

97

Send buffer in1

Once in2 is ready (signal!) process in2



Intel Confidential

98

Fortran Vectorization

Specific focus on

– Unit stride vectorization

– Copy in/out with temp array usage

– Treatment of user-provided alignment statements

– Multiversion code: defer decision until runtime – For alignment

– For stride

Some examples here:

– Example 1: Adjustable size arrays as routine parameters

– Example 2: Assumed shape arrays as routine parameters

– More examples and details on link in BKM Pages



Intel Confidential

99

Fortran adjustable size array parameters

Adjustable size arrays as parameters

subroutine adj(Y, Z, M, N)

real, intent(inout), dimension(M, N) :: Y

real, intent(in), dimension(M, N) :: Z

integer, intent(in) :: M, N

Y = Y + Z

return

end

2 Questions for vectorization:

– Stride and alignment



Intel Confidential

100

Adjustable Size Arrays Vectorization Question 1: Stride of Y and Z

– While file adj.f90 is separately compiled, what should compiler assume?

– For adjustable size arrays, it assumes array parameters unit-stride

– At the call site, sectioning could have been applied

adj( A[1:m:2, 1:n:2], B[1:m:2, 1:n:2], m/2, n/2)

– Compiler generates pack/unpack (compress/decompress) into/from temporary unit-stride array

tmpA[1:m/2,1:n/2] = A[1:m:2,1:n:2]

tmpB[1:m/2,1:n/2] = B[1:m:2,1:n:2]

adj( tmpA, tmpB, m/2, n/2)

A[1:m:2,1:n:2] = tmpA[1:m/2,1:n/2]

B[1:m:2,1:n:2] = tmpB[1:m/2,1:n/2]

– No sectioning? Just pass refs to A and B



Intel Confidential

101

Adjustable Size Arrays Alignment

Question 2: Alignment of Y and Z

– Y and Z could be unaligned – Depends on alignments of A/tmpA and B/tmpB

– Separate compilation, no information from other files

– Ensure aligned allocation using – !dec$ attribute align

– -align array64byte

– Compiler should allocate tmpA, tmpB with same alignment as A,B

– Tell the compiler using – !dec$vector aligned

• Per loop, for all arrays in loop – !dec$asume_aligned Y:64, Z:64

• Before loop, for each array



Intel Confidential

102

Fortran Assumed Shape Array Parameter

Assumed shape arrays as parameters

subroutine ash(A, B, C)

real, intent(out), dimension(:) :: A

real, intent(in), dimension(:) :: B

real, intent(in), dimension(:) :: C

A = B + C

return

end

No information is passed explicitly by the programmer

– Implicit interface (dope vector) for extent, stride info

– Populated by the compiler, passed from caller to callee

Can have any stride

– Compiler does not generate packing/unpacking at call site

Same 2 questions: Stride and alignment.



Intel Confidential

103

Assumed Shape Array Vectorization

Any stride is possible for each of the 3 arrays

– Multiversion code to check for stride at runtime

– How many versions? There are 2^3=8 combinations: – unitstride(A) & unitstride(B) & unitstride(C)

– unitstride(A) & unitstride(B) & !unitstride(C)

– unitstride(A) & !unitstride(B) & unitstride(C)

– ...

– !unitstride(A) & !unitstride(B) & !unitstride(C)

– Compiler generates 2 versions: – Ver1: All arrays are unitstride

– Ver2: At least 1 array is non-unitstride

– Version 1 can be vectorized using vmovaps/vloadunpack (alignment)

– Version 2 can be vectorized using vgather



Intel Confidential

104

Assumed Shape Array Alignment

Each array can have arbitrary alignment

– User should help compiler with alignment assumptions (as before)

– Without user help, the compiler generates – A peel loop that iterates until one array is aligned

• Preferred array to align is the one we store into (i.e., A)

– Still (N-1) arrays could be unaligned

– A multiversion code that checks alignment of B (2nd array)

– No further multiversioning for array C (too deep version tree)

if( A,B,C all unit stride )

Peel loop until A is aligned (uses vscatter for A)

if( B is aligned )

[al64] A = [al64] B + C //Version 1a

else

[al64] A = B + C //Version 1b

endif

else

A = B + C //Version 2

endif

Date post:	12-Jan-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Программирования Xeon Phi - South Ural State...

Documents