+ All Categories
Home > Documents > Day5:IntroductiontoParallel Intel...

Day5:IntroductiontoParallel Intel...

Date post: 19-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
52
Day 5: Introduction to Parallel Intel ® Architectures Lecture day 5 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com Welcome © Colfax International, 2013–2017
Transcript
Page 1: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Day 5: Introduction to ParallelIntel® Architectures

Lecture day 5

Ryo AsaiColfax International — colfaxresearch.com

April 2017

colfaxresearch.com Welcome © Colfax International, 2013–2017

Page 2: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Disclaimer2

While best efforts have been used in preparing this training, Colfax International makes norepresentations or warranties of any kind and assumes no liabilities of any kind with respect tothe accuracy or completeness of the contents and specifically disclaims any implied warrantiesof merchantability or fitness of use for a particular purpose. The publisher shall not be heldliable or responsible to any person or entity with respect to any loss or incidental orconsequential damages caused, or alleged to have been caused, directly or indirectly, by theinformation or programs contained herein. No warranty may be created or extended by salesrepresentatives or written sales materials.

colfaxresearch.com Welcome © Colfax International, 2013–2017

Page 3: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Recap: OpenMP3

Directives discussed:

▷ omp parallel: create team of threads

▷ parallel for loop and sections▷ omp task: variables are firstprivate by default

▷ taskwait: used for synchronizarion between threads

▷ reduction < atomic < critical▷ ordered: execute loop in parallel; ordered block is executed

sequentially following the natural loop ordering

▷ taskloop: similar to omp for but uses the more flexible taskmechanism instead of worksharing omp for

colfaxresearch.com Welcome © Colfax International, 2013–2017

Page 4: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Code Modernization4

.Code Modernization..

......Optimizing software to better utilize features available in modern computerarchitectures.

colfaxresearch.com Welcome © Colfax International, 2013–2017

Page 5: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Colfax Research5

http://colfaxresearch.com/

colfaxresearch.com Welcome © Colfax International, 2013–2017

Page 6: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

§1. Introduction

Page 7: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

40-year Microprocessor Trend7

Source: https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/colfaxresearch.com Introduction © Colfax International, 2013–2017

Page 8: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Parallelism8

Task Parallelism – multiple instructions multiple data elements (MIMD)

Data Parallelism – single instruction multiple data elements (SIMD)

Unbounded growth opportunity, but not automatic

colfaxresearch.com Introduction © Colfax International, 2013–2017

Page 9: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Short Vector Support9

Vector instructions – one of the implementations of SIMD (SingleInstruction Multiple Data) parallelism.

+ =+ =+ =+ =

+ =

Vec

tor

Len

gth

Scalar Instructions Vector Instructions

4 1 5

0 3 3

-2 8 6

9 -7 2

4 1 5

0 3 3

-2 8 6

9 -7 2

colfaxresearch.com Introduction © Colfax International, 2013–2017

Page 10: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Intel Architectures

Page 11: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Intel Architecture11

colfaxresearch.com Intel Architectures © Colfax International, 2013–2017

Page 12: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Intel Xeon Processors12

▷ 1-, 2-, 4-way

▷ General-purpose

▷ Highly parallel (44 cores*)

▷ Resource-rich

▷ Forgiving performance

▷ Theor. ∼ 1.0 TFLOP/s in DP*

▷ Meas. ∼ 154 GB/s bandwidth*

* 2-way Intel Xeon processor, Broadwell architec-ture (2016), top-of-the-line (e.g., E5-2699 V4)

colfaxresearch.com Intel Architectures © Colfax International, 2013–2017

Page 13: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Intel Xeon Phi Processors (2nd Gen)13

2nd Generation of Intel Many Integrated Core (MIC) Architecture.Specialized platform for demanding computing applications.

▷ Bootable host proccesor orcoprocessor

▷ 3+ TFLOP/s DP

▷ 6+ TFLOP/s SP

▷ Up to 16 GiB MCDRAM

▷ MCDRAM bandwidth ≈5x DDR4

▷ Binary compatible with Intel Xeon

▷ More information

colfaxresearch.com Intel Architectures © Colfax International, 2013–2017

Page 14: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Instruction Sets in Intel Architecture14

MMX

SSE SSE2 SSE3 SSE4.2...

...

1995 2000 2005 2010 2015 2020

AVX

IMCI AVX-512

64-bit

128-bit

256-bit

512-bit

IntelXeon Phi

← KNC

KNL →

colfaxresearch.com Intel Architectures © Colfax International, 2013–2017

Page 15: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Increasing Core Count (Intel Xeon)15

Net

Bur

st

Cor

e

Pen

ryn

San

dy B

ridg

e

Nah

alem

2005

2009

2011

Has

wel

l

Bro

adw

ell

200620072008

2010

20122013201420152016

24

68

Ivy

Bri

dge

1012

18

24

Phy

sica

l Cor

e co

unt

colfaxresearch.com Intel Architectures © Colfax International, 2013–2017

Page 16: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

§2. Vectorization

Page 17: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Short Vector Support17

Vector instructions – one of the implementations of SIMD (SingleInstruction Multiple Data) parallelism.

+ =+ =+ =+ =

+ =

Vec

tor

Len

gth

Scalar Instructions Vector Instructions

4 1 5

0 3 3

-2 8 6

9 -7 2

4 1 5

0 3 3

-2 8 6

9 -7 2

colfaxresearch.com Vectorization © Colfax International, 2013–2017

Page 18: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Workflow of Vector Computation18

colfaxresearch.com Vectorization © Colfax International, 2013–2017

Page 19: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Explicit Vectorization

Page 20: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Intel Intrinsics Guide20

https://software.intel.com/sites/landingpage/IntrinsicsGuide

colfaxresearch.com Explicit Vectorization © Colfax International, 2013–2017

Page 21: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Example Explicit Vectorization (AVX512)21

1 #include <immintrin.h>2 // ... //3 double *A = (double *) malloc(sizeof(double)*n);4 double *B = (double *) malloc(sizeof(double)*n);5 // ... //6 for(int i = 0; i < n; i+=16) {7 // A[i] += B[i];8 __m512d Avec = _mm512_loadu_pd(&A[i]);9 __m512d Bvec = _mm512_loadu_pd(&B[i]);

10 Avec = _mm512_add_pd(Avec, Bvec);11 _mm512_storeu_pd(&A[i], Avec);12 }

student@cdt% icpc -xMIC_AVX512 explicit.ccstudent@cdt% g++ -mavx512f -mavx512pf -mavx512cd -mavx512er explicit.cc

colfaxresearch.com Explicit Vectorization © Colfax International, 2013–2017

Page 22: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Detecting Available Instructions22

In the OS: In code (see also):

[student@cdt ~]% cat /proc/cpuinfo...fpu_exception : yescpuid level : 11wp : yesflags : fpu vme de pse tsc msr pae mcecx8 apic mtrr pge mca cmov pat pse36 clflush mmxfxsr sse sse2 ss ht syscall nx lm constant_tscunfair_spinlock pni ssse3 cx16 sse4_1 sse4_2x2apic popcnt aes hypervisor lahf_lm fsgsbasebogomips : 5985.17clflush size : 64cache_alignment: 64address sizes : 46 bits physical, 48 bits virtual...

1 // Intel compiler2 // preprocessor macros:3

4 #ifdef __SSE__5 // ...SSE code path6 #endif7

8 #ifdef __SSE4_2__9 // ...SSE code path

10 #endif11

12 #ifdef __AVX__13 // ...AVX code path14 #endif

colfaxresearch.com Explicit Vectorization © Colfax International, 2013–2017

Page 23: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Automatic Vectorization

Page 24: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Automatic Vectorization (Intel Compiler)24

Intel Compilers have auto vectorization enabled by default:

student@cdt% icpc -xMIC_AVX512 automatic.ccstudent@cdt% icpc -S -xMIC_AVX512 automatic.cc # produce assemblystudent@cdt% cat automatic.s # Default name. Change with -o// ..... //

vmovups 8(%r14,%rsi,8), %zmm0 #17.5 c1vaddpd 8(%rax,%rsi,8), %zmm0, %zmm2 #17.5 c13 stall 2vmovupd %zmm2, 8(%r14,%rsi,8) #17.5 c19 stall 2

// ..... //student@cdt% icpc -xMIC_AVX512 automatic.cc -qopt-report=5 # produce reportstudent@cdt% cat automatic.optrpt// ..... //LOOP BEGIN at automatic.cc(16,3)

// ..... //remark #15300: LOOP WAS VECTORIZED

// ..... //

colfaxresearch.com Automatic Vectorization © Colfax International, 2013–2017

Page 25: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Automatic Vectorization (GCC)25

Easiest to enable with -O3 flag.

student@cdt% g++ -O3 -mavx512f -mavx512pf -mavx512cd -mavx512er -ffast-math \% automatic.ccstudent@cdt% g++ -O3 -S -mavx512f -mavx512pf -mavx512cd -mavx512er -ffast-math \% -g -fverbose-asm automatic.cc # produce verbose assemblystudent@cdt% cat automatic.s// ..... //

.loc 1 17 0 discriminator 2// ... //vaddpd (%rsi,%rdx), %zmm0, %zmm0 # MEM[base: vectp_A.28_89, ....

student@cdt% g++ -O3 -mavx512f -mavx512pf -mavx512cd -mavx512er -ffast-math -g \% -fopt-info-vec -fopt-info-vec-missed automatic.cc # produce verbose report// ... //automatic.cc:16:23: note: loop vectorizedautomatic.cc:16:23: note: loop peeled for vectorization to enhance alignmentstudent@cdt% g++ -O3 -mavx512f -mavx512pf -mavx512cd -mavx512er -g \% -fopt-info-vec=v.rpt -fopt-info-vec-missed=v.rpt automatic.cc # report file

colfaxresearch.com Automatic Vectorization © Colfax International, 2013–2017

Page 26: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

OpenMP SIMD26

OpenMP 4.0 introduced SIMD construct. Compiler will try to vectorize this loop.

1 #pragma omp simd2 for(int i = 0; i < n; i++)3 A[i] += B[i];

With parallel. May need to define chunksize that is a multiple of vector length.

1 #pragma omp parallel for simd schedule(static,16)2 for(int i = 0; i < n; i++)3 A[i] += B[i];

Nested parallel and simd construct.

1 #pragma omp parallel for2 for(int i = 0; i < n; i++)3 #pragma omp simd4 for(int j = 0; j < n; j++)5 A[i*n+j] += B[i*n+j];

colfaxresearch.com Automatic Vectorization © Colfax International, 2013–2017

Page 27: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Limitations of Auto-Vectorization

Page 28: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Limitations of Auto-Vectorization28

There are certain limitations on automatic vectorization.

▷ Only for loops are supported. No while loops.

▷ Iteration count must be known at the beginning of the for loop.

▷ Loop can’t contain non-vectorizable operations. (e.g. I/O)

▷ All functions are in-lined or declared simd.

▷ No vector dependence.Any of these could prevent vectorization, but you may be able to find"hints" on what is preventing vectorization in the vectorization reports

colfaxresearch.com Limitations of Auto-Vectorization © Colfax International, 2013–2017

Page 29: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

SIMD-Enabled Functions29

Define function in one file (e.g., library), use in another

1 // Compiler will produce 3 versions:2 #pragma omp declare simd3 float my_simple_add(float x1, float x2){4 return x1 + x2;5 }

1 // May be in a separate file2 #pragma omp simd3 for (int i = 0; i < N, ++i) {4 output[i] = my_simple_add(inputa[i], inputb[i]);5 }

colfaxresearch.com Limitations of Auto-Vectorization © Colfax International, 2013–2017

Page 30: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Vector Dependence30

It is unsafe to vectorize a loop with vector dependence.

1 // A = {1,2,3,4,5}2 for(int i = 1; i < 5; i++)3 A[i] += A[i-1];

+ =

Scalar Vector

+ = 3

6

10

15

+ =+ =+ =

2 1

A[i-1]A[i]

3

4

5

3

6

10(4 in

stru

ctio

ns)

(1 in

stru

ctio

n)

A[i-1]A[i]

2

3

4

5

2

3

4

1 3

5

7

9Correct: A[i-1] is updated every i Wrong: A[i-1] all loaded at i=1

colfaxresearch.com Limitations of Auto-Vectorization © Colfax International, 2013–2017

Page 31: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Assumed Vector Dependence31

▷ True vector dependence – vectorization impossible:1 float *a, *b;2 for (int i = 1; i < n; i++)3 a[i] += b[i]*a[i-1]; // dependence on the previous element

▷ Assumed vector dependence – compiler suspects dependence

1 void mycopy(int n,2 float* a, float* b) {3 for (int i=0; i<n; i++)4 a[i] = b[i];5 }

vega@lyra% icpc -c vdep.cc -qopt-reportvega@lyra% cat vdep.optrpt...remark #15304: loop was notvectorized: non-vectorizable loopinstance from multiversioning...

colfaxresearch.com Limitations of Auto-Vectorization © Colfax International, 2013–2017

Page 32: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Resolving Assumed Dependency32

▷ Restrict: Keyword indicating that there is no pointer aliasing (C++11)

1 void mycopy(int n,2 float* restrict a,3 float* restrict b) {4 for (int i=0; i<n; i++)5 a[i] = b[i];6 }

vega@lyra% icpc -c vdep.cc -qopt-report \% -restrictvega@lyra% cat vdep.optrpt...remark #15304: LOOP WAS VECTORIZED...

▷ #pragma ivdep: ignores assumed dependency for a loop (IntelCompiler)

1 void mycopy(int n, float* a, float* b) {2 #pragma ivdep3 for (int i=0; i<n; i++)4 a[i] = b[i];5 }

colfaxresearch.com Limitations of Auto-Vectorization © Colfax International, 2013–2017

Page 33: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

§3. Sneak Peak

Page 34: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Nowwhat?34

.I have a vectorized and multi-threaded code!..

......

Some people stop here. But even if your application is multi-threadedand vectorized, it may not be optimal. Optimization could unlock moreperformance for your application.

Example areas for consideration:▷ Multi-threading

• Do my threads have enough work?• Are my threads independent?• Is work distributed properly?

▷ Vectorization• Is my data organized well for vectorization?• Do I have regular loop patterns?

colfaxresearch.com Sneak Peak © Colfax International, 2013–2017

Page 35: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

§4. Additional Topic: Working with NUMA

Page 36: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

NUMA Architectures36

NUMA = Non-Uniform Memory Access. Cores have fast access to localmemory, slow access to remote memory.

CPU 0CPU 0

CPU 1CPU 1

Memory bankslocal to CPU 0

Memory bankslocal to CPU 1

Non-Uniform Memory Architecture

(NUMA)

Examples:▷ Multi-socket Intel Xeon processors▷ Second generation Intel Xeon Phi in sub-NUMA clustering mode

colfaxresearch.com Additional Topic: Working with NUMA © Colfax International, 2013–2017

Page 37: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Intel Xeon CPU: Memory Organization37

▷ Hierarchical cache structure

▷ Two-way processors have NUMA architecture

CORE

CORER

EG

IST

ER

SR

EG

IST

ER

S

L1cache

L1cache

L2cache

L2cache

LLC DDR4RAM

(main memory)

32 KiB/core~ 4 cycles

256 KiB/core~10 cycles

35 MiB/package~ 30 cycles

Up to 1.5 TiB/package~ 200 cycles

~60 GB/s/package

QPI...

more cores

CORE

CORE

RE

GIS

TE

RS

RE

GIS

TE

RS

L1cache

L1cache

L2cache

L2cache

LLCDDR4RAM

(main memory)

32 KiB/core~ 4 cycles

256 KiB/core~10 cycles

35 MiB/package~ 30 cycles

...more cores

Intel XeonPackage

Intel XeonPackage

Up to 1.5 TiB/package~ 200 cycles

~60 GB/s/package

colfaxresearch.com Additional Topic: Working with NUMA © Colfax International, 2013–2017

Page 38: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

KNL Die Organization: Tiles38

▷ Up to 36 tiles, each with 2 physical cores (72 total).

▷ Distributed L2 cache across a mesh interconnect.

CORE

L2

CORE CORE

L2

CORE CORE

L2

CORE CORE

L2

CORE CORE

L2

CORE

DD

R4

CO

NT

RO

LL

ER

DD

R4

CO

NT

RO

LL

ER

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE CORE

L2

CORECORE

L2

CORE CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE CORE

L2

CORE

MCDRAM PCIe

≤ 38

4 G

iB s

yste

m D

DR

4, ~

90 G

B/s

CORE

L2

CORE

≤ 16 GiB on-package MCDRAM, ~ 400 GB/s

MCDRAM

MCDRAM MCDRAM

CORE

L2

CORE CORE

L2

CORE

CORE

L2

CORECORE

L2

CORE

...36 TILES

72 CORES

colfaxresearch.com Additional Topic: Working with NUMA © Colfax International, 2013–2017

Page 39: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Thread Affinity

Page 40: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

What is Thread Affinity40

▷ OpenMP threads may migrate between cores▷ Forbid migration — improve locality — increase performance▷ Affinity patterns “scatter” and “compact” may improve cache

sharing, relieve thread contention

colfaxresearch.com Thread Affinity © Colfax International, 2013–2017

Page 41: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

The KMP_HW_SUBSET Environment Variable41

Control the # of cores and # of threads per core:KMP_HW_SUBSET=[<cores>c,]<threads-per-core>t

vega@lyra-mic0% export KMP_HW_SUBSET=3t # 3 threads per corevega@lyra-mic0% ./my-native-app

orvega@lyra% export MIC_ENV_PREFIX=XEONPHIvega@lyra% export KMP_HW_SUBSET=1t # 1 thread per core on hostvega@lyra% export XEONPHI_KMP_HW_SUBSET=2t # 2 threads per core on Xeon Phivega@lyra% ./my-offload-app

colfaxresearch.com Thread Affinity © Colfax International, 2013–2017

Page 42: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

The KMP_AFFINITY Environment Variable42

KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]

modifier:

▷ verbose/nonverbose▷ respect/norespect▷ warnings/nowarnings▷ granularity=core or thread

▷ type=compact, scatter orbalanced

▷ type=explicit,proclist=[<proc_list>]

▷ type=disabled or none.

The most important argument is type:

▷ compact: place threads as close to each other as possible

▷ scatter: place threads as far from each other as possible

colfaxresearch.com Thread Affinity © Colfax International, 2013–2017

Page 43: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

OMP_PROC_BIND and OMP_PLACES Variables43

Control the binding pattern, including nested parallelism:

OMP_PROC_BIND=type[,type[,...]]

Here type=true, false, spread, close or master.Comma separates settings for different levels of nesting (OMP_NESTEDmust be enabled).

Control the granularity of binding:

OMP_PLACES=<threads|cores|sockets|(explicit)>

colfaxresearch.com Thread Affinity © Colfax International, 2013–2017

Page 44: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Thread Affinity: Scatter Pattern44

Generally beneficial for bandwidth-bound applications.OMP_NUM_THREADS={1 thread/core} or KMP_HW_SUBSET=1tKMP_AFFINITY=scatter,granularity=fine

colfaxresearch.com Thread Affinity © Colfax International, 2013–2017

Page 45: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Thread Affinity: Compact Pattern45

Generally beneficial for compute-bound applications.OMP_NUM_THREADS={2(4) threads/core on Xeon (Xeon Phi)}KMP_AFFINITY=compact,granularity=fine

colfaxresearch.com Thread Affinity © Colfax International, 2013–2017

Page 46: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Parallelism and Affinity Interfaces46

Intel-specific (in order of priority):

▷ Functions (e.g., kmp_set_affinity())

▷ Compiler arguments (e.g., -par-affinity)

▷ Environment variables (e.g., KMP_AFFINITY)

Defined by the OpenMP standard (in order of priority):

▷ Clauses in pragmas (e.g., proc_bind)

▷ Functions (e.g., omp_set_num_threads())

▷ Environment variables (e.g., OMP_PROC_BIND)

colfaxresearch.com Thread Affinity © Colfax International, 2013–2017

Page 47: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Impact of Affinity on Bandwidth47

0

50

100

150

200

250

40 50 60 70 80 90

Num

ber o

f Tria

ls (o

ut o

f 100

0)

Memory Bandwidth, GB/s

STREAM benchmark: SCALE, 40 threads

StandardOptimized ▷ Without affinity: "fortunate"

and "unfortunate" runs

▷ With affinity "scatter":consistently goodperformance

Plot from this paper

colfaxresearch.com Thread Affinity © Colfax International, 2013–2017

Page 48: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

First-Touch Locality

Page 49: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Allocation on First Touch49

▷ Memory allocation occurs not during _mm_malloc(), but upon thefirst write to the buffer (“first touch”)

▷ Default NUMA allocation policy is “on first touch”

▷ For better performance in NUMA systems, initialize data with thesame parallel pattern as during data usage

1 float* A = (float*)_mm_malloc(n*m*sizeof(float), 64);2

3 // Initializing from parallel region for better performance4 #pragma omp parallel for5 for (int i = 0; i < n; i++)6 for (int j = 0; j < m; j++)7 A[i*m + j] = 0.0f;

colfaxresearch.com First-Touch Locality © Colfax International, 2013–2017

Page 50: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

First-Touch Allocation Policy50

Memory of CPU 0

CPU 0 CPU 1

QPIMemory of CPU 1

array A[i]VM page 0 VM page 1 VM page 2 VM page 3

for (i=0; i<n; i++) A[i] = 0.0;

Serial execution

Memory of CPU 0

CPU 0 CPU 1

QPIMemory of CPU 1

array A[i]VM page 0 VM page 1 VM page 2 VM page 3

for (i=0; i<n/4; i++) A[i] = 0.0;

Thread 0

for (i=n/2;i<3*n/4;i++) A[i] = 0.0;

Thread 2

for (i=n/4; i<n/2; i++) A[i] = 0.0;

Thread 1

Thread 3for (i=3*n/4; i<n; i++) A[i] = 0.0;

Poor First-Touch Allocation Good First-Touch Allocation

NUMA Node 0 NUMA Node 1 NUMA Node 0 NUMA Node 1

colfaxresearch.com First-Touch Locality © Colfax International, 2013–2017

Page 51: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Impact of First-Touch Allocation51

Vectorized Parallel Code(Private Variables)

Parallel Initialization(First-Touch Allocation)

0

5

10

15

20

25

30

Per

form

ance

, bill

ion

valu

es/s

(hig

her i

s bet

ter)

11.2

22.1

13.9 13.6

20 20

Intel Xeon processor E5-2697 V2 Intel Xeon Phi coprocessor 7120P (KNC) Intel Xeon Phi processor 7210 (KNL)

colfaxresearch.com First-Touch Locality © Colfax International, 2013–2017

Page 52: Day5:IntroductiontoParallel Intel Architecturesweb.stanford.edu/.../files/lectures/Colfax_Stanford_Presentation_Day… · Up to 1.5 TiB/package ~ 200 cycles ~60 GB/s/package QPI...

Binding to NUMA Nodes with numactl 52

▷ libnuma – a Linux library for fine-grained control over NUMA policy▷ numactl – a tool for global NUMA policy control

vega@lyra% numactl --hardwareavailable: 2 nodes (0-1)node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17node 0 size: 65457 MBnode 0 free: 24426 MBnode 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23node 1 size: 65536 MBnode 1 free: 53725 MBnode distances:node 0 1

0: 10 211: 21 10

vega@lyra% numactl --membind=<nodes> --cpunodebind=<nodes> ./myApplication

colfaxresearch.com First-Touch Locality © Colfax International, 2013–2017


Recommended