Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Основы программирования Xeon Phi
Дмитрий Петунин
Ведущий технический консультант Intel
1
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Содержание
• Общая концепция
• Модели программирования
– Native execution
– Автоматический offload
– Явный offload
– Неявный offload
• Параллелизация
• Векторизация
• Prefetching
• MPI программирование
• Инструменты Intel Parallel Studio XE и Intel Cluster Studio XE
2
Единый исходный код для MultiCore и MIC архитектур
Поддержка многоядерности и массового параллелизма
Intel® Cluster Studio XE* Distributed Performance
Intel® Parallel Studio XE* Advanced Performance
Intel® Trace Analyzer and Collector
Intel® MPI Library
Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Advisor
Intel® C/C++ and Fortran Compilers w/OpenMP
Intel® MKL, Intel® Cilk Plus, Intel® TBB Library, Intel® IPP Library
Intel® Parallel Studio XE
Производительность. Масштабируемость
Содержание Общая концепция
Модели программирования
Native execution Автоматический offload Явный offload Неявный offload
Параллелизация
Векторизация
Prefetching
MPI программирование
Инструменты Intel Parallel Studio XE и Intel Cluster Studio XE
5
Many-Core Hosted Native MIC Programming
•Enabled by –mmic compiler switch
•Fully supported by compiler vectorization, Intel® MKL, OpenMP*, Intel® TBB, Intel® Cilk Plus, Intel® MPI, …
• No Intel® Integrated Performance Primitives library yet
•Might be an option for some applications:
• Needs to fit into memory !!! • Should be highly parallel code
• Serial parts are slower on MIC than on host !
• Limited access to external environment like I/O • Native MIC file system exists in memory only ! • NFS allows external I/O but …
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Options for Offloading Application Code
• Intel Composer XE 2011 for MIC supports three models: – Automatic offload
o Use MKL routines that offloads it’s computation to Xeon Phi
– Offload pragmas
o Only trigger offload when a MIC device is present
o Safely ignored by non-MIC compilers
– Offload keywords
o Only trigger offload when a MIC device is present
o Language extensions, need conditional compilation to be ignored
• Offloading and parallelism is orthogonal – Offloading only transfers control to the MIC devices
– Parallelism needs to be exploited by a second model (e.g. OpenMP*)
7
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Автоматический offload
8
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
void foo() /* Intel® Math Kernel Library */ {
float *A, *B, *C; /* Matrices */
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
}
Automatic offload with Math Kernel Library Intel® Math Kernel Library (Intel® MKL)
Intel® Xeon® processor Intel® Xeon Phi™ coprocessor
Implicit automatic offloading requires no code
changes, simply link with the offload MKL Library
Intel High Performance Math Kernel Library is Applicable to Multicore and Many-core Programming
4/1/2013
9
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Явный(explicit) offload
10
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler – Offload using Explicit Copies – Modifier Example
11
float reduction(float *data, int numberOf)
{
float ret = 0.f;
#pragma offload target(mic) in(data:length(numberOf))
{
#pragma omp parallel for reduction(+:ret)
for (int i=0; i < numberOf; ++i)
ret += data[i];
}
return ret;
}
Note: copies numberOf elements to the coprocessor, not numberOf*sizeof(float) bytes – the compiler knows data’s type
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler – Offload using Explicit Copies – Data Movement
• Default treatment of in/out variables in a #pragma offload
statement
– At the start of an offload:
o Space is allocated on the coprocessor
o in variables are transferred to the coprocessor
– At the end of an offload:
o out variables are transferred from the coprocessor
o Space for both types (as well as inout) is deallocated on the coprocessor
12
Host MIC
#pragma offload inout(pA:length(n)) {...}
Allocate
1
Copy back
4
Copy over
2
Free
5
pA
3
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Offload using Explicit Copies
C/C++ Syntax Semantics
Offload pragma #pragma offload <clauses>
<statement block>
Allow next statement block to execute on Intel® MIC Architecture or host CPU
Keyword for variable & function definitions
__attribute__((target(mic))) Compile function for, or allocate variable on, both CPU and Intel® MIC Architecture
Entire blocks of code
#pragma
offload_attribute(push,
target(mic))
#pragma offload_attribute(pop)
Mark entire files or large blocks of code for generation on both host CPU and Intel® MIC Architecture
Data transfer #pragma offload_transfer
target(mic) Initiates asynchronous data transfer, or initiates and completes synchronous data transfer
Intel® Many Integrated Core Architecture 13
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Offload using Explicit Copies
Fortran Syntax Semantics
Offload directive !dir$ omp offload <clause> <OpenMP construct>
Execute next OpenMP* parallel construct on Intel® MIC Architecture
!dir$ offload <clauses> <statement>
Execute next statement (function call) on Intel® MIC Architecture
Keyword for variable/function definitions
!dir$ attributes offload:<MIC> :: <rtn-name>
Compile function or variable for CPU and Intel® MIC Architecture
Data transfer !dir$ offload_transfer target(mic)
Initiates asynchronous data transfer, or initiates and completes synchronous data transfer
Intel® Many Integrated Core Architecture 14
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Offload using Explicit Copies – Clauses
Clauses Syntax Semantics
Target specification target( name[:card_number] ) Where to run construct
Conditional offload if (condition) Boolean expression
Inputs in(var-list modifiersopt) Copy from host to coprocessor
Outputs out(var-list modifiersopt) Copy from coprocessor to host
Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back when offload completes
Non-copied data nocopy(var-list modifiersopt) Data is local to target
Async. offload signal(signal-slot) Trigger async offload
Async. offload wait(signal-slot) Wait for completion
Variables and pointers restricted to scalars, structs of scalars, and arrays of scalars
15
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Offload using Explicit Copies – Modifiers
Modifiers Syntax Semantics
Specify pointer length length(element-count-expr) Copy N elements of the pointer’s type
Control pointer memory allocation
alloc_if ( condition ) Allocate memory to hold data referenced by pointer if condition is TRUE
Control freeing of pointer memory
free_if ( condition ) Free memory used by pointer if condition is TRUE
Control target data alignment
align ( expression ) Specify minimum memory alignment on target
Variables and pointers restricted to scalars, structs of scalars, and arrays of scalars
16
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Неявный(implicit) offload
17
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Offload using Implicit Copies
• Section of memory maintained at the same virtual address on both the host and Intel® MIC Architecture coprocessor
• Reserving same address range on both devices allows – Seamless sharing of complex pointer-containing data structures
– Elimination of user marshaling and data management
– Use of simple language extensions to C/C++
18
Host Memory
KN* Memory
Offload code
C/C++ executable
Host
Intel® MIC
Same address range
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Offload using Implicit Copies
• When “shared” memory is synchronized
– Automatically done around offloads (so memory is only synchronized on entry to, or exit from, an offload call)
– Only modified data is transferred between CPU and coprocessor
• Dynamic memory you wish to share must be allocated with special functions: _Offload_shared_malloc, _Offload_shared_aligned_malloc, _Offload_shared_free,
_Offload_shared_aligned_free
• Allows transfer of C++ objects
– Pointers are no longer an issue when they point to “shared” data
• Well-known methods can be used to synchronize access to shared data and prevent data races within offloaded code
– E.g., locks, critical sections, etc.
This model is integrated with the Intel® Cilk™ Plus parallel extensions
19
Note: Not supported on Fortran - available for C/C++ only
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Implicit: Offloading using _Offload Example
// Shared variable declaration for pi
_Cilk_shared float pi;
// Shared function declaration for
// compute
_Shared void compute_pi(int count)
{
int i;
#pragma omp parallel for \
reduction(+:pi)
for (i=0; i<count; i++)
{
float t = (float)((i+0.5f)/count);
pi += 4.0f/(1.0f+t*t);
}
}
void findpi()
{
int count = 10000;
// Initialize shared global
// variables
pi = 0.0f;
// Compute pi on target
_Offload compute_pi(count);
pi /= count;
}
20
_Offload compute_pi(count);
_Cilk_shared void compute_pi(nt count)
{
int i;
#pragma omp parallel for \
reduction(+:pi)
for (i=0; i<count; i++)
{
float t = (float)((i+0.5f)/count);
pi += 4.0f/(1.0f+t*t);
}
}
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Keyword _Cilk_shared for Data/Functions
Intel® Many Integrated Core Architecture 21
What Syntax Semantics
Function int _Cilk_shared f(int x)
{ return x+1; }
Versions generated for both CPU and card; may be called from either side
Global _Cilk_shared int x = 0; Visible on both sides
File/Function static
static _Cilk_shared int
x;
Visible on both sides, only to code within the file/function
Class class _Cilk_shared x {…}; Class methods, members, and and operators are available on both sides
Pointer to shared data
int _Cilk_shared *p; p is local (not shared), can point to shared data
A shared pointer int *_Cilk_shared p; p is shared; should only point at shared data
Entire blocks of code
#pragma offload_attribute(
push, _Cilk_shared) #pragma
offload_attribute(pop)
Mark entire files or large blocks of code _Cilk_shared using this pragma
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Implicit: Offloading using _Offload
Feature Example Description
Offloading a function call
x = _Offload func(y); func executes on
coprocessor if possible
x = _Offload_to (card_number)
func(y);
func must execute on
specified coprocessor
Offloading asynchronously
x = _Cilk_spawn _Offload func(y); Non-blocking offload
Offload a parallel for-loop
_Offload _Cilk_for(i=0; i<N; i++)
{
a[i] = b[i] + c[i];
}
Loop executes in parallel on target. The loop is implicitly outlined as a function call.
22
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Heterogeneous Compiler Command-line Options
Offload-specific arguments to the Intel® Compiler: • Generate host+coprocessor code (by default only host code is
generated): -offload-build (Deprecated – offload is default)
• Produce a report of offload data transfers at compile time (not runtime) -opt-report-phase:offload
• Add Intel® MIC Architecture compiler switches -offload-copts:“switches”
• Add Intel® MIC Architecture archiver switches -offload-aropts:“switches”
• Add Intel® MIC Architecture linker switches -offload-ldopts:“switches”
Example: icc –g –O2 –mkl –offload-build –offload-copts=”-g -03”
–offload-ldopts=”-L/opt/intel/composerxe_mic/mkl/lib/mic”
foo.c
23
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Примеры offload
24
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Example 1: Using MKL for offloading Lapack and Blas routines
int main{
// initialize variables …
#pragma offload target(mic) in(transa, transb, N, alpha, beta) \
in(A:length(matrix_elements)) in(B:length(matrix_elements)) \
inout(C:length(matrix_elements))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N);
}
// … continue code
}
Sgemm performs C=beta*C+alpha*A*B, transa and transb regulate the transposition of A and B and the Ns define the sizes of the matrices (see documentation). C is input and output, all others are input only.
MKL will automatically make optimal use of MIC
25
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Example 2: Simultaneous computation on host and accelerator
When using a straight
#pragma offload
the host blocks until completion of the of the offloaded region or function. In order to obtain max performance it is necessary to keep the host working at the time the offload computes.
26
Compute
workload
parallel
Host Target
Compute
workload
parallel
Prework
Postwork
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Example 2: Simultaneous computation on host and accelerator - OpenMP
double __attribute__((target(mic))) myworkload(double input){
// do something useful here
return result;
}
int main(void){
//…. Initialize variables
#pragma omp parallel sections
{
#pragma omp section
{
#pragma offload target(mic)
result1= myworkload(input1);
}
#pragma omp section
result2= myworkload(input2);
}
}
27
Function is generated for both MIC and CPU
One thread executes the offload code on MIC
The other thread executes the same function on the
host
Create two threads in an OpenMP sections env.
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Example 2: Simultaneous computation on host and accelerator - Cilk
_Cilk_shared double myworkload(double input){
// do something useful here
return result;
}
int main() {
result1 = _Cilk_spawn _Cilk_offload myworkload(input2);
result2 = myworkload(input1); cilk_sync;
}
28
Function is generated for both MIC and CPU
One thread is spawned and executes the offload
code on MIC
The host executes the same function and waits
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Содержание
• Общая концепция
• Модели программирования
– Native execution
– Автоматический offload
– Явный offload
– Неявный offload
• Параллелизация
• Векторизация
• Prefetching
• MPI программирование
• Инструменты Intel Parallel Studio XE и Intel Cluster Studio XE
29
Software & Services Group, Developer Products Division
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
30
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
AVX Vector size: 256 bit Data types: • 32 and 64 bit float VL: 4, 8, 16
Intel® MIC Vector size: 512 bit Data types: • 32 and 64 bit integer • 32 and 64 bit float VL: 8,16
X4
Y4
X4◦Y4
X3
Y3
X3◦Y3
X2
Y2
X2◦Y2
X1
Y1
X1◦Y1
0
X8
Y8
X8◦Y8
X7
Y7
X7◦Y7
X6
Y6
X6◦Y6
X5
Y5
X5◦Y5
255
X4
Y4
X4◦Y4
X3
Y3
X3◦Y3
X2
Y2
X2◦Y2
X1
Y1
X1◦Y1
0
X8
Y8
X8◦Y8
X7
Y7
X7◦Y7
X6
Y6
X6◦Y6
X5
Y5
X5◦Y5
X16
Y16
X16◦Y16
…
...
…
511
Illustrations: Xi, Yi & results 32 bit float
Data Parallelism of Intel® Processors (2)
Software & Services Group, Developer Products Division
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
31
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization of Code
• Transform sequential code to exploit data parallel capabilities
(SIMD) of Intel processors
– Manually by explicit syntax
– Automatically by tools like a compiler
for(i = 0; i <= MAX;i++)
c[i] = a[i] + b[i];
a
b
c
+ +
a[i]
b[i]
c[i]
+
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
Software & Services Group, Developer Products Division
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
32
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization by Compiler In
pu
t: C
/C+
+/FO
RTR
AN
so
urc
e co
de
Vectorizer
Intel® SSE Intel® AVX Intel® MIC
Express/expose vector parallelism
Array Notation
SIMD pragma
Vectorization Hints (ivdep/vector pragmas)
Fully Automatic Analysis
Elemental Function
Map vector parallelism
to vector ISA
Optimize and Code Gen
Vec
tor
par
t o
f In
tel®
Cilk
™P
lus
exte
nsi
on
Vectorizer makes
retargeting easy!
Software & Services Group, Developer Products Division
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
33
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization Report
Provides details on vectorization success & failure:
Linux*, Mac OS* X: -vec-report<n>, Windows*: /Qvec-report<n>
n Diagnostic Messages
0 Tells the vectorizer to report no diagnostic information. Useful for turning off reporting in case it was enabled on command line earlier.
1 Tells the vectorizer to report on vectorized loops. [default if n missing]
2 Tells the vectorizer to report on vectorized and non-vectorized loops.
3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven or assumed data dependences.
4 Tells the vectorizer to report on non-vectorized loops.
5 Tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized.
6 Tells the vectorizer to use greater detail when reporting on vectorized and non-vectorized loops and any proven or assumed data dependences.
X To be done: Even more details in next compiler releases !
Software & Services Group, Developer Products Division
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
34
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization Report Sample
Example:
Additional details about loop transformations, in-lining, versioning, etc
are reported by compiler switch –opt-report
35: subroutine fd( y )
36: integer :: i
37: real, dimension(10), intent(inout) :: y
38: do i=2,10
39: y(i) = y(i-1) + 1
40: end do
41: end subroutine fd
novec.f90(38): (col. 3) remark: loop was not vectorized: existence
of vector dependence.
novec.f90(39): (col. 5) remark: vector dependence: proven FLOW
dependence between y line 39, and y line 39.
novec.f90(38:3-38:3):VEC:MAIN_: loop was not vectorized:
existence of vector dependence
Software & Services Group, Developer Products Division
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
35
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Array Sections
• Array Section Notation
<array base> [ <lower bound> : <length> [: <stride>] ]
[ <lower bound> : <length> [: <stride>] ].....
• Note that length is chosen.
– Not upper bound as in Fortran [lower bound : upper bound]
A[:] // All elements of vector A
B[2:6] // Elements 2 to 7 of vector B
D[0:3:2] // Elements 0,2,4 of vector D
E[0:3][0:4] // 12 elements from E[0][0] to E[2][3]
0 1 2 3 4 5 6 7 8 9 float B[10];
B[2:6] = …
What is Elemental Function?
• Write a function for one
element.
• Add __declspec(vector) to
get vector code for it.
__declspec(vector)
float foo(float a, float b,
float c, float d) {
return a * b + c * d;
}
• and obtain
vmulps ymm0, ymm0, ymm1
vmulps ymm2, ymm2, ymm3
vaddps ymm0, ymm0, ymm2
ret
• Call it from auto-vec or
SIMD loop
for(i=0;i<n;i++){
A[i] = foo(B[i], C[i],
• D[i], E[i]);
}
• Call it from Array Notation
A[:] = foo(B[:], C[:], D[:], E[:]);
• Call it from Elemental
Function
__declspec(vector)
float bar(float a, float b,
float c, float d){
return sinf(foo(a,b,c,d));
}
• Call scalar version from
scalar code
e = foo(a, b, c, d);
36 4/1/2013
Elemental Function: Uniform/Linear
clauses • Why do we need them?
– Because “vector” loads
and stores of IA chips
are optimized for
accessing immediately
next elements in
memory (e.g.,
[v]movups).
• They are most useful
when consumed in the
address computation. • __declspec(vector)
void foo(float *a, int i); – a is a vector of pointers
– i is a vector of integers
– a[i] becomes gather/scatter.
• __declspec(vector(uniform (a)))
void foo(float *a, int i); – a is a pointer
– i is a vector of integers
– a[i] becomes gather/scatter.
• __declspec(vector(linear(i)))
void foo(float *a, int i); – a is a vector of pointers
– i is a sequence of integers
[i, i+1, i+2…]
– a[i] becomes gather/scatter.
• __declspec(vector(uniform(a),linear(i)))
void foo(float *a, int i); – a is a pointer
– i is a sequence of integers [i, i+1,
i+2…]
– a[i] is a unit-stride load/store
([v]movups).
37 4/1/2013
SIMD Pragma: definition
• Top-level
– #pragma simd
– !DIR$ SIMD
• Attached clauses to describe semantics /
aid code generation
– vectorlength(VL)/vectorlengthfor(TYPE)
– private/firstprivate/lastprivate(var1[, var2, …])
– reduction(oper1:var1[, …][, oper2:var2[, …]])
– linear(var1[:step1][, var2[:step2], …])
– [no]assert
3/29/2013 38
SIMD Pragma: simple examples
void foo(int *A, int N, int n){
int i;
#pragma simd
vectorlength(4)
for (i=0; i<n; i++){
A[i] = A[i] + A[i-N];
}
}
• #pragma simd not
applicable if “0 < N < n”, but
vectorization is still possible
if N isn’t too small.
short sum(float *A, int n){
int i; short x = 0;
#pragma simd reduction(+:x)
for (i=0; i<n; i++){
xt = x + A[i]*2
x = xt + N;
}
return x;
}
• Tell compiler “x” has sum-
reduction semantics
3/29/2013 39
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compiler has to assume the worst case the language/flag allow.
float *A; void vectorize() { int i; for (i = 0; i < 102400; i++) { A[i] *= 2.0f; } }
Loop Body:
• Load of A
• Load of A[i]
• Multiply with 2.0f
• Store of A[i]
3) Recompile with –ansi-alias
• icc –vec-report1 –ansi-alias test1.c
• test1.c(4): (col. 3) remark: LOOP WAS VECTORIZED.
4) Change “float *A” to “float *restrict A”.
• icc –vec-report1 test1a.c
• test1a.c(4): (col. 3) remark: LOOP WAS VECTORIZED.
5) Add “#pragma ivdep” to the loop.
• icc –vec-report1 test1b.c
• test1b.c(5): (col. 3) remark: LOOP WAS VECTORIZED.
40 3/29/2013
Q: Will the store
modify A?
A: Maybe
“NO” is needed to make
vectorization legal. Wait, we aren’t done yet
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Does FORTRAN have aliasing issue?
• Standard rule is in favor of compiler optimization. In plain English:
• Two storage locations in different names won’t overlap unless both are read-only.
SUBROUTINE FOO(A,B,N) REAL A(*), B(*) DO I=1, N A(I) = B(I)+1 ENDDO END
• Compiler still needs to do memory disambiguation (or data dependence analysis)
SUBROUTINE FOO(A,M,N1,N2) REAL A(*) DO I=N1, N2 A(I) = A(I-M)+1 ENDDO END
6) ifort ftest1.f –vec-report2
7) Add !DIR$ IVDEP ifort ftest1a.f –vec-report2
41
3/29/2013
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Writing Explicit Vector Code with Intel® Cilk™Plus
float *A; void vectorize() { for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }
8) Using SIMD Pragma
float *A; void vectorize() { #pragma simd vectorlength(4) for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }
9) Using Array Notation
float *A; void vectorize() { A[0:102400] *= 2.0f; }
10)Using Elemental Function
float *A; __declspec(noinline) __declspec(vector(uniform(p), linear(i))) void mul(int i){ p[i] *= 2.0f; } void vectorize() { for (int i = 0; i < 102400; i++) { mul(A, i); } }
42
3/29/2013
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Memory Accesses and Alignment
Memory Access Patterns Alignment Optimization
43 3/29/2013
A[i] A[i+1] A[i+2] A[i+3] … …
Unit-stride
A[2*i] A[2*(i+1)] A[2*(i+2)]
Strided (special form of gather/scatter)
A[B[i+2]] A[B[i]] A[B[i+1]]
Gather/Scatter
If you write in Cilk™Plus Array Notation,
access patterns are obvious in your eyes:
Unit-stride means A[:] or A[lb:#elems], helps you think more clearly.
A[i] A[i+1] A[i+2] A[i+3] … …
Aligned Unit-stride
Addr % SIZE == 0
A[i] A[i+1] A[i+2] A[i+3] … …
Alignment unknown Unit-stride
Addr % SIZE == ???
SIZE:
64B for Xeon™Phi,
32B for AVX1/2,
16B for SSE4.2 and below
A[i] A[i+1] A[i+2] A[i+3] … …
Misaligned Unit-stride
Addr % SIZE != 0
Align your data AND tell the compiler
• Good array data alignment for
– Pentium 4 to Core i7: 16B
– AVX: 32B
– MIC: 64B
• Data alignment directive (64B example)
– C/C++ Windows : __declspec(align(64)) float A[1000]; Linux/MacOS: float A[1000] __attribute__ ((aligned (64));
– Fortran !DIR$ ATTRIBUTES ALIGN: 64:: A
• Aligned malloc – _aligned_malloc()
– _mm_malloc()
• Data alignment assertion
(64B example)
– C/C++:
__assume_aligned(p,64);
– Fortran: !DIR$
ASSUME_ALIGNED A(1):64
• Multiple of good number
– __assume(n%16==0)
• Aligned loop assertion
– C/C++: #pragma vector
aligned
– Fortran: !DIR$ VECTOR
ALIGNED
44
Align your data AND tell the compiler!!
3/29/2013
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Fixed-size Array Sections
Short vector coding #define VLEN 4 for(i=0;i<N;i+=VLEN){ A[i:VLEN]= B[i:VLEN]+C[i:VLEN]; D[i:VLEN]= E[i:VLEN]+A[i:VLEN]; }
Similar C loop for(i=0;i<N;i+=VLEN){ for(j=0;j<VLEN;i++) A[i+j]=B[i+j]+C[i+j]; for(j=0;j<VLEN;i++) D[i+j]=E[i+j]+A[i+j]; }
Long vector coding A[0:N]=B[0:N]+C[0:N]; D[0:N]=E[0:N]+A[0:N];
This is visually appealing, but may not be high performing.
Similar to C loops: for(i=0;i<N;i++){ A[i]=B[i]+C[i]; } for(i=0;i<N;i++){ D[i]=E[i]+A[i]; }
Use short-vector coding if you have data reuse between statements and N is big.
45
3/29/2013
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Alignment and Module Data Known Sized Arrays
Example: Global arrays declared in modules with known size. module mymod !dir$ attributes align:64 :: a
!dir$ attributes align:64 :: b
real (kind=8) :: a(1000), b(1000)
end module mymod
subroutine add_them()
use mymod
implicit none
! array syntax shown, could also be explicit loop
!...No explicit directive needed to say that A and B
! are aligned, the USE brings that information
a = a + b
end subroutine add_them
This saves coding effort AND
improves performance!
46
INTEL CONFIDENTIAL Software and Services Group Software and Services Group Software and Services Group Software and Services Group
Alignment and Module Data Allocatable Arrays
Example 8.2: Global allocatable arrays declared in modules, but allocated elsewhere. module mymod
real, allocatable :: a(:), b(:)
end module mymod
subroutine add_them()
use mymod
implicit none
!dec$ vector aligned
a = a + b
end subroutine add_them
Currently cannot use !dir$ attributes align:64
here – not safe to assume that the actual allocation site will use an aligned allocation
11/13/2012 47
Plan is to let you write “attribute” syntax for 64B alignment cases --- ETA: Feb 2013.
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Содержание
• Общая концепция
• Модели программирования
– Native execution
– Автоматический offload
– Явный offload
– Неявный offload
• Параллелизация
• Векторизация
• Prefetching
• MPI программирование
• Инструменты Intel Parallel Studio XE и Intel Cluster Studio XE
48
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
A Family of Parallel Programming Models Developer Choice
Intel® Cilk™ Plus C/C++ language extensions to simplify parallelism
Open sourced
Also an Intel product
Intel® Threading Building Blocks
Widely used C++ template library for parallelism
Open sourced
Also an Intel product
Domain-Specific Libraries
Intel® Integrated Performance Primitives
Intel® Math Kernel Library
Established Standards
Message Passing Interface (MPI)
OpenMP*
Coarray Fortran
OpenCL*
Research and Development
Intel® Concurrent Collections
Offload Extensions
Intel® SPMD Parallel Compiler
Choice of high-performance parallel programming models
Applicable to Multicore and Many-core Programming
49
Software & Services Group, Developer Products Division
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
50
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Cilk Tasking – Very Simple
/* Matrix Transpose */
cilk_for (int i = 0; i < n; i++)
cilk_for (int j = 0; i< n; i++)
b[j][i] = a[i][j];
int fib (int n) {
if (n < 2) return 1;
else {
int x = cilk_spawn fib(n-1);
int y = cilk_spawn fib(n-2);
cilk_sync;
return x + y;
}
}
• A “composable” model for thread parallelism: Programming in tasks, not threads: Don’t ask: “How many cores are available ?”
• Very closely follows the serial execution semantic: For deterministic code, in fact, it is the same • Easy testing • Easy debugging
• For reductions and critical regions additional “hyper-objects’ (“reducers”) are available
Software & Services Group, Developer Products Division
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
51
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenMP* Support
• Intel® Compilers ( both C++ and Fortan ) fully complaint to OpenMP* 3.1
– See http://www.openmp.org/ for standard, tutorials etc
• Includes OpenMP Tasking
• Many Intel-specific control mechanism for thread mapping, scheduler control, memory allocation, thread-private variable implementation etc
• Support for OpenMP 4.0 being added
– Standard still being worked on
– Many features will be added to new compiler release in 2013
• Automatic parallelization of compiler maps to Intel OpenMP run time system for thread management
Software & Services Group, Developer Products Division
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
52
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenMP 4.0 (beta2 released Q1/13)
• Portable SIMD construct – Execute iterations of following loop in SIMD chunks
#pragma omp simd [clause [[,] clause] …]
– Not the same as Intel’s pragma SIMD but …
• SIMD function declaration prefix #pragma omp declare simd [clause [[,] clause] …]
– Build vector version of function to be called from “SIMD” loop
– Very much the same as Intel’s elemental functions
• Extended affinity support – E.g. via env variables OMP_PLACES and OMP_PROC_BIND
– Similar powerful as Intel’s KMP_AFFINITY
• FORTRAN 2003 support
• User defined reductions
• Taskgroups and a lot more …
• Support for offload/accelerator (TR1, not in spec yet)
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Содержание
• Общая концепция
• Модели программирования
– Native execution
– Автоматический offload
– Явный offload
– Неявный offload
• Параллелизация
• Векторизация
• Prefetching
• MPI программирование
• Инструменты Intel Parallel Studio XE и Intel Cluster Studio XE
53
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
54
Prefetching Basics
Compiler prefetching is turned on by default for the Intel® Xeon Phi™ coprocessor • At option levels –O2 and above • Prefetches issued for all regular memory accesses
inside loops • Prefetching for memory accesses expressed using
load/store intrinsics • Maximal loop prefetching
Use the compiler reporting options to see detailed diagnostics of prefetching per loop
• -opt-report-phase hlo –opt-report 3 Use compiler option –no-opt-prefetch to turn off
compiler prefetching
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
55
Loop-Prefetches • Prefetches issued targeting memory access in a future
iteration of the loop • Targeting regular array accesses • Pointer accesses similar to array accesses where the
address can be predicted in advance • Supports address calculations that involve:
–Affine functions of surrounding loop indices –More complicated access-patterns that require
additional instructions inside the loop
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
56
Indirect Prefetch Example • #pragma simd reduction(+:fxtmp,fytmp,fztmp) vectorlengthfor(double) • for (int jj = 0; jj < jnum; jj++) { • int j,sbindex, jtype; double factor_lj; • j = jlist[jj]; sbindex = sbmask(j); … • _mm_prefetch((char *) &xx[jlist[jj+1+16]], 1); • … • _mm_prefetch((char *) &xx[jlist[jj+8+16]], 1); • _mm_prefetch((char *) &ff[jlist[jj+1+16]], 5); • … • _mm_prefetch((char *) &ff[jlist[jj+8+16]], 5); • double delx = xtmp - xx[j].x; double dely = ytmp - xx[j].y; • double delz = ztmp - xx[j].z; double rsq = delx*delx + dely*dely + delz*delz; • if (rsq < global_cutsq) { • double r2inv = 1.0/rsq; double r6inv = r2inv*r2inv*r2inv; • double forcelj = r6inv * (global_lj1*r6inv - global_lj2); • double fpair = factor_lj*forcelj*r2inv; • fxtmp += delx*fpair; fytmp += dely*fpair; fztmp += delz*fpair; • if (NEWTON_PAIR || j < nlocal) { • ff[j].x -= delx*fpair; ff[j].y -= dely*fpair; ff[j].z -= delz*fpair; } • } • }
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
57
Interactions with the Hardware Prefetcher
• Intel® Xeon Phi™ coprocessor has a hardware L2 prefetcher that is enabled by default
• If software prefetches are doing a good job, then hardware prefetching does not kick in
– In several workloads (such as stream), maximal software prefetching gives the best performance
• Any references not prefetched by compiler may get prefetched by hardware
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
58
Directive Support for Loop Prefetches • Directive to turn off prefetching for a particular loop
– #pragma noprefetch – CDEC$ noprefetch – Specify before a loop, affects only that loop, does
not affect inner loops • Directive to turn off prefetching for a particular
routine – #pragma noprefetch – CDEC$ noprefetch – Specify at the top of the routine as the first
executable statement • Prefetch pragma support for C loops
– #pragma prefetch var:hint:distance • Prefetch directive support for Fortran loops
– CDEC$ prefetch var:hint:distance
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
59
Prefetch Distance Tuning Option -opt-prefetch-distance=n1[,n2]
• n1 specifies the distance for first-level prefetches into L2
• n2 specifies prefetch distance for second-level prefetches from L2 to L1 (use n2 <= n1)
• -opt-prefetch-distance=64,32
• -opt-prefetch-distance=24
o Use first-level distance=24, second-level distance to be determined by compiler
• -opt-prefetch-distance=0,4
o Turns off all first-level prefetches, second-level uses distance=4 (Use this if you want to rely on hardware prefetching to L2, and compiler prefetching from L2 to L1)
• -opt-prefetch-distance=16,0
o First-level distance=16, no second-level prefetches issued
• If option not specified, all distances determined by compiler
Поддержка многоядерности и массового параллелизма
Intel® Cluster Studio XE* Distributed Performance
Intel® Parallel Studio XE* Advanced Performance
Intel® Trace Analyzer and Collector
Intel® MPI Library
Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Advisor
Intel® C/C++ and Fortran Compilers w/OpenMP
Intel® MKL, Intel® Cilk Plus, Intel® TBB Library, Intel® IPP Library
Intel® Parallel Studio XE
Производительность. Масштабируемость
61
Where is my application…
Spending Time? Wasting Time? Waiting Too Long?
• Focus tuning on functions
taking time
• See call stacks
• See time on source
• See cache misses on your
source
• See functions sorted by
# of cache misses
• See locks by wait time
• Red/Green for CPU
utilization during wait
Intel® VTune™ Amplifier XE Performance Profiler
• Windows & Linux
• Low overhead
• No special recompiles Claire Cates
Principal Developer, SAS Institute Inc.
We improved the performance of the latest
run 3 fold. We wouldn't have found the
problem without something like Intel® VTune™
Amplifier XE.
Intel® VTune™ Amplifier XE
Advanced Profiling for Scalable Multicore Performance
61
Intel VTune Amplifier XE supports Intel MIC Architecture
VTune Amplifier XE using remote functionality on MIC architecture and requires host
62
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Intel® MPI Library support for the Intel® Xeon Phi™ Coprocessor
63
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
MPI+Offload
• MPI ranks on Intel® Xeon® processors (only)
• All messages into/out of processors
• Offload models used to accelerate MPI ranks
• Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* within Intel® MIC Architecture
• Homogenous network of hybrid nodes:
Xeon
MIC
Xeon
MIC
Xeon
MIC
Xeon
MIC
Network
Data
Data
Data
Data
MPI
Offload
64
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
MPI+Offload How to run
• Compile your code with the offload directives $ mpiifort –openmp test.f –o test.offload
• Create your hosts file (Xeon only) $ cat hosts
node0
node1
• Run your application (Xeon only) $ mpirun –f hosts –n 2 ./test.offload
65
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Many-core Hosted (Native)
• MPI ranks on Intel® Xeon PhiTM coprocessors(only)
• All messages into/out of Intel® Xeon PhiTM coprocessors
• Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads used directly within MPI processes
• Programmed as homogenous network of many-core CPUs:
Xeon
MIC
Xeon
MIC
Xeon
MIC
Xeon
MIC
Network
Data
Data
Data
Data
MPI
66
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Many-core Hosted (Native) How to run
• Compile your code for Intel® Xeon Phi™ Coprocessor $ mpiifort –mmic test.f –o test.mic
• Copy the MIC-enabled executable to the coprocessor $ scp test.mic mic0:/home/user/test
• Create your hosts file (MIC only) $ cat hosts
mic0
mic1
• Let the library know you’re planning on running on MIC $ export I_MPI_MIC=1
• Run your application (from the Xeon) $ mpirun –f hosts –n 4 /home/user/test.mic
67
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Symmetric
• MPI ranks on Intel® Xeon PhiTM coprocessors and Intel® Xeon® processors
• Messages to/from any core
• Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* used directly within MPI processes
• Programmed as heterogeneous network of homogeneous nodes:
Xeon
MIC
Xeon
MIC
Xeon
MIC
Xeon
MIC
Network
Data
Data
Data
Data
MPI
Data
Data
Data
Data
MPI
MPI
68
Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Symmetric How to run
• Compile your code for the Intel® Xeon node $ mpiifort test.f –o test
• And for Intel® Xeon Phi™ Coprocessor $ mpiifort –mmic test.f –o test.mic
• Copy the MIC-enabled executable to the coprocessor $ scp test.mic mic0:/home/user/test
• Create your hosts file (Xeon+MIC) $ cat hosts
node0
mic0
mic1
• Let the library know you’re planning on running on MIC $ export I_MPI_MIC=1
• Run your application (from the Xeon) $ mpirun –f hosts –n 4 /home/user/test.mic
69
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Scale Performance Tune Hybrid Cluster MPI and Thread Performance
Tune cross-node MPI
•Visualize MPI behavior
•Evaluate MPI load balancing
•Find communication hotspots
Tune single node threading
•Visualize thread behavior
•Evaluate thread load balancing
•Find thread sync. bottlenecks
Intel®
Trace Analyzer and Collector Intel®
VTune™ Amplifier XE
70
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice 71
Key Features
• Low Overhead
• Catch all MPI events
• Powerful configuration mechanism – Filters, settings, features
• Automatic source-code references
• Instrumentation – Rich API
– Binary instrumentation (itcpin)
– Compiler based (-tcollect)
• Fail-safe version
• Comparison of multiple profiles
• Idealizer
• MPI Correctness Checking
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice 72
How to use Intel® Trace Analyzer and Collector
• Step 1: Run your binary and create a tracefile run the binary for a representative amount of time (to reduce initialization influences) on representative data (no corner cases)
$ mpirun –trace –n 2 ./test
– Alternative 1: Generate an instrumented binary via re-linking $ mpiicc –trace test.c –o test.inst
$ mpirun –n 2 ./test.inst
– Alternative 2: Instrument binary itcpin –-run –- ./test
• Step 2: To view the generated trace file, start the GUI:
traceanalyzer &
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice
Intel® Trace Analyzer and Collector
Compare the event timelines of two communication profiles Blue = computation Red = communication
Chart showing how the MPI processes interact
73
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice
Chart
A Chart is a numerical or graphical diagram
Chart
74
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice
Timelines: Event Timeline
• Get impression of program structure
• Display functions, messages and collective operations for each process/thread along time-axis
• Retrieval of detailed event information
75
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice
Timelines: Qualitative Timeline
• Find patterns and irregularities
• Display attributes of functions, messages or collective operations as they occur for any process/thread
• Retrieval of detailed event information
76
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice
Timelines: Quantitative Timeline
• Get impression on parallelism and load balance
• Show for every function how many threads/processes are currently executing it
77
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice
Profiles: Flat Function Profile
• Statistics about functions
78
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice
Profiles: Call-Tree and Call-Graph
• Function statistics including calling hierarchy
– Tree: call-stack
– Graph: calling dependencies
79
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice
Communication Profiles
• Statistics about point-to-point or collective communication
• Generic matrix supports grouping by several attributes in each dimension Sender, Receiver, Data volume per msg, Tag, Communicator, Type
• Available attributes Count, Bytes transferred, Time, Transfer rate
80
Software & Services Group, Developer Products Division
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. >Optimization Notice
SCIF – low level communication interface
81
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
SCIF Symmetric Communications Interface
• The SCIF driver provides a reliable connection-based messaging layer, as well as functionality which abstracts RMA operations.
• The SCIF API is documented in the Intel® MIC SCIF API Reference Manual for User Mode Linux and the Intel® MIC SCIF API Reference Manual for Kernel Mode Linux.
• A common API is exposed for use in both user mode (ring 3) and kernel mode (ring 0), with the exception of slight differences in signature, and several functions which are only available in user mode, and several only available in kernel mode.
82
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
SCIF - Nodes and Ports
• SCIF node: physical endpoint in the SCIF network. The host and MIC Architecture devices are SCIF nodes (all cores under a single OS). Each node has a node identifier assigned at boot time. Node IDs are generally based on PCIe discovery order. The host node is always assigned ID 0.
• SCIF port: logical destination on a SCIF node. Within a node, a SCIF port on that node may be referred to by its number, a 16-bit integer, similar to an IP port.
• SCIF port identifier: is unique across a SCIF network, comprising both a node identifier and a local port number (analogous to a complete TCP/IP address with port)
83
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
SCIF – Opening a connection
84
epdi=scif_open() epdj=scif_open()
scif_bind(epdi,pm) scif_bind(epdj,pn)
scif_listen(epdj,qLen)
scif_connect(epdi,(Nj,pn)) scif_accept(*nepd,peer)
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
SCIF - Messaging
• After the connection has been established, messages may be exchanged:
• int scif_send(scif_epd_t epd,void* msg,int len,int flags);
• int scif_recv(scif_epd_t epd,void* msg,int len,int flags);
• Messages may be up to 2^31-1 bytes long
• Message layer queues are relatively short, though
• For bulk data transfer use the SCIF RMA functionality
• The connection is bi-directional
85
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Extension SCIF
SCIF
• Host-KNC communications backbone
• Provides com. cap. within a single platform(node)
• Low latency, low overhead communication
• Provides uniform API for communication across the hosts PCI Express* system busses
• Directly exposes DMA capabilities for high bandwidth transfer
• Fully exposed (/usr/include/scif.h)
86
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
SCIF – Connections Functionality
• scif_epd_t scif_open(void);
Create a new endpoint
• int scif_bind(scif_epd_t epd, uint16_t pn);
Bind Endpoint to port
• int scif_listen(scif_epd_t epd, int backlog);
Set endpoint to listen
• int scif_connect(scif_epd_t epd, struct scif_portID* dst);
Request connection to listening endpoint
• int scif_accept (scif_epd_t epd, struct scif_portID* peer, scif_epd_t* newepd, int flags);
Accepts the connection request
• int scif_close (scif_epd_t epd);
Closes the connection
87
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
SCIF – Basic Functionality
• scif_epd_t scif_open(void);
Create a new endpoint
• int scif_bind(scif_epd_t epd, uint16_t pn);
Bind Endpoint to port
• int scif_listen(scif_epd_t epd, int backlog);
Set endpoint to listen
• int scif_connect(scif_epd_t epd, struct scif_portID* dst);
Request connection to listening endpoint
• int scif_accept (scif_epd_t epd, struct scif_portID* peer, scif_epd_t* newepd, int flags);
Accepts the connection request
• int scif_close (scif_epd_t epd);
Closes the connection
88
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
SCIF – RMA Operations
• off_t scif_register(scif_epd_t epd, void* addr, size_t len, off_t offset, int prot_flags, int map_flags);
Expose range of address space for control by an remote process.
The memory must be registered before it can be mapped for RMA
• int scif_unregister(scif_epd_t epd, off_t offset, size_t len);
Revoke registration/mapping
• int scif_readfrom(scif_epd_t epd, off_t loffset, size_t len, off_t roffset, int rma_flags);
Read from mapped address range
• int scif_writeto(scif_epd_t epd, off_t loffset, size_t len, off_t roffset, int rma_flags);
Read from mapped address range
89
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Заключение
• Програмирование Xeon™ и Xeon™ Phi треюует одних и тех же навыков и знаний
• Параллелизация и векторизация - залог эффективности программ на Xeon™ Phi
• Автоматический offload MKL самый простой способ использования Xeon™ Phi
• Обратите внимание на оптимизацию использования иерархии памяти. Используйте prefetching
• Инструменты Intel Parallel Studio XE 2013 и Intel Cluster Studio XE 2013 существенно расширяют возможности разработчика
90
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Legal Disclaimer & Optimization Notice
91
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2012, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
92
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Example 3: Asynchronous Transfer & Double Buffering
• Overlap computation and communication
• Generalizes to data domain decomposition
93
Host Target data block
data block
data block
data block
data block
data block
data block
data block
process
process
process
process
pre-work
iteration 0
iteration 1
iteration n
data block
last iteration data block process
iteration n+1
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Example 3 – Using Signals
#pragma offload_transfer target(mic:0) \
nocopy(in1:length(cnt)) alloc_if(1) free_if(0))
#pragma offload_transfer target(mic:0)
in(in1:length(cnt) alloc_if(0) free_if(0)) signal(in1)
#pragma offload target(mic:0) nocopy(in1) wait(in1) \
out(res1:length(cnt) alloc_if(0) free_if(0))
#pragma offload_transfer target(mic:0) \
nocopy(in1:length(cnt) alloc_if(0) free_if(1))
94
This does nothing except allocating an array
Start an asynchronous transfer, tracking signal in1
Start once the completion of the transfer of in1 in signaled
This does nothing except freeing an array
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Example 3: Double Buffering I
int main(int argc, char* argv[]) {
// … Allocate & initialize in1, res1,
//… in2, res2 on host
#pragma offload_transfer target(mic:0) in(cnt)\
nocopy(in1, res1, in2, res2 : length(cnt) \
alloc_if(1) free_if(0))
do_async_in();
#pragma offload_transfer target(mic:0) \
nocopy(in1, res1, in2, res2 : length(cnt) \
alloc_if(0) free_if(1))
return 0;
}
95
Only allocate arrays on card with alloc_if(1), no
transfer
Only free arrays on card with free_if(1), no transfer
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Example 3: Double Buffering II
void do_async_in() {
float lsum;
int i;
lsum = 0.0f;
#pragma offload_transfer target(mic:0) in(in1 : length(cnt) \
alloc_if(0) free_if(0)) signal(in1)
for (i = 0; i < iter; i++) {
if (i % 2 == 0) {
#pragma offload_transfer target(mic:0) if(i !=iter - 1) \
in(in2 : length(cnt) alloc_if(0) free_if(0)) signal(in2)
#pragma offload target(mic:0) nocopy(in1) wait(in1) \
out(res1 : length(cnt) alloc_if(0) free_if(0))
{
compute(in1, res1);
}
lsum = lsum + sum_array(res1);
} else {…
96
Send buffer in1
Send buffer in2
Once in1 is ready (signal!) process in1
Intel® Many Integrated Core Architecture
Software & Services Group, Developer Relations Division
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice
Example 3: Double Buffering III
…} else {
#pragma offload_transfer target(mic:0) if(i != iter - 1) \
in(in1 : length(cnt) alloc_if(0) free_if(0)) signal(in1)
#pragma offload target(mic:0) nocopy(in2) wait(in2) \
out(res2 : length(cnt) alloc_if(0) free_if(0))
{
compute(in2, res2);
}
lsum = lsum + sum_array(res2);
}
}
async_in_sum = lsum / (float)iter;
} // for
} // do_async_in()
97
Send buffer in1
Once in2 is ready (signal!) process in2
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Confidential
98
Fortran Vectorization
Specific focus on
– Unit stride vectorization
– Copy in/out with temp array usage
– Treatment of user-provided alignment statements
– Multiversion code: defer decision until runtime – For alignment
– For stride
Some examples here:
– Example 1: Adjustable size arrays as routine parameters
– Example 2: Assumed shape arrays as routine parameters
– More examples and details on link in BKM Pages
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Confidential
99
Fortran adjustable size array parameters
Adjustable size arrays as parameters
subroutine adj(Y, Z, M, N)
real, intent(inout), dimension(M, N) :: Y
real, intent(in), dimension(M, N) :: Z
integer, intent(in) :: M, N
Y = Y + Z
return
end
2 Questions for vectorization:
– Stride and alignment
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Confidential
100
Adjustable Size Arrays Vectorization Question 1: Stride of Y and Z
– While file adj.f90 is separately compiled, what should compiler assume?
– For adjustable size arrays, it assumes array parameters unit-stride
– At the call site, sectioning could have been applied
adj( A[1:m:2, 1:n:2], B[1:m:2, 1:n:2], m/2, n/2)
– Compiler generates pack/unpack (compress/decompress) into/from temporary unit-stride array
tmpA[1:m/2,1:n/2] = A[1:m:2,1:n:2]
tmpB[1:m/2,1:n/2] = B[1:m:2,1:n:2]
adj( tmpA, tmpB, m/2, n/2)
A[1:m:2,1:n:2] = tmpA[1:m/2,1:n/2]
B[1:m:2,1:n:2] = tmpB[1:m/2,1:n/2]
– No sectioning? Just pass refs to A and B
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Confidential
101
Adjustable Size Arrays Alignment
Question 2: Alignment of Y and Z
– Y and Z could be unaligned – Depends on alignments of A/tmpA and B/tmpB
– Separate compilation, no information from other files
– Ensure aligned allocation using – !dec$ attribute align
– -align array64byte
– Compiler should allocate tmpA, tmpB with same alignment as A,B
– Tell the compiler using – !dec$vector aligned
• Per loop, for all arrays in loop – !dec$asume_aligned Y:64, Z:64
• Before loop, for each array
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Confidential
102
Fortran Assumed Shape Array Parameter
Assumed shape arrays as parameters
subroutine ash(A, B, C)
real, intent(out), dimension(:) :: A
real, intent(in), dimension(:) :: B
real, intent(in), dimension(:) :: C
A = B + C
return
end
No information is passed explicitly by the programmer
– Implicit interface (dope vector) for extent, stride info
– Populated by the compiler, passed from caller to callee
Can have any stride
– Compiler does not generate packing/unpacking at call site
Same 2 questions: Stride and alignment.
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Confidential
103
Assumed Shape Array Vectorization
Any stride is possible for each of the 3 arrays
– Multiversion code to check for stride at runtime
– How many versions? There are 2^3=8 combinations: – unitstride(A) & unitstride(B) & unitstride(C)
– unitstride(A) & unitstride(B) & !unitstride(C)
– unitstride(A) & !unitstride(B) & unitstride(C)
– ...
– !unitstride(A) & !unitstride(B) & !unitstride(C)
– Compiler generates 2 versions: – Ver1: All arrays are unitstride
– Ver2: At least 1 array is non-unitstride
– Version 1 can be vectorized using vmovaps/vloadunpack (alignment)
– Version 2 can be vectorized using vgather
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Confidential
104
Assumed Shape Array Alignment
Each array can have arbitrary alignment
– User should help compiler with alignment assumptions (as before)
– Without user help, the compiler generates – A peel loop that iterates until one array is aligned
• Preferred array to align is the one we store into (i.e., A)
– Still (N-1) arrays could be unaligned
– A multiversion code that checks alignment of B (2nd array)
– No further multiversioning for array C (too deep version tree)
if( A,B,C all unit stride )
Peel loop until A is aligned (uses vscatter for A)
if( B is aligned )
[al64] A = [al64] B + C //Version 1a
else
[al64] A = B + C //Version 1b
endif
else
A = B + C //Version 2
endif