Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | intel-it-center |
View: | 106 times |
Download: | 0 times |
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Optimizing Commercial Software for Intel® Xeon Phi™ Coprocessors: Lessons Learned
©2013 Acceleware Ltd. All rights reserved.
Dan Cyca, Chief Technical Officer, Acceleware
Supercomputing ConferenceDenver, Colorado, USA November 17-22, 2013
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
In My Parallel Universe… Small to medium-sized seismic companies aren’t limited by
computational resources when processing seismic data
1
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Seismic Computing Requirements
1995 2000 2005 2010 2015
100 GF1 TF
10 TF
10 PF
100 TF
100 PF
1 PF
1 EF
1990
Paraxial WE approximation
Kirchhoff Migration
Post SDM, PreSTM
Full WE Approximation
WEM
RTM
FWI
ElasticImaging
Source: Total 2012 3
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
RTM OverviewSource Receiver data
Propagate backwardsin time
Correlate
source and
receiver
wavefields
Propagate forwardsin time
4
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
5
RTM Introduction Finite-difference code Compute intensive:
10s of hours per seismic shot
Large memory footprint: 100GB per shot
Large local storage requirement: 500GB per shot
10,000s of shots
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
RTM: Computational Requirements RTM image is made by migrating and then stacking a large
number of shots (typically between 10,000 and 100,000) Migrating each shot requires two or three 3D wave
propagations Each shot migration requires large RAM (~100GB) and
temporary disk space (~500GB) Runtime per shots varies between a few minutes (low
frequency isotropic) to several hours (high frequency anisotropic)
Typical compute cluster used for RTM will be 100s of nodes
6
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
In My Parallel Universe… Small to medium-sized seismic
companies aren’t limited by computational resources when processing seismic data– We want to make RTM (1 PFlop)
available to these companies We’re delivering parallel
software to run RTM on Xeon Phi systems
7
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
RTM: Wave Propagation Finite-difference time domain technique
– 3D stencils
3D grid with millions of points – Update the entire grid every time step– 1000s of time steps
Memory footprint of 10-100 GB Wavefield data from forward pass stored to
disk to facilitate imaging
8
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Parallelizing Single Shots Finite-difference grid
contains over 200 million cells per volume (2 GB)
Numerous volumes per shot (Earth model, wavefields and image)
One shot easily fits in a CPU compute node, but may be too large for a single Xeon Phi
9
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Parallelizing Each Shot: Multiple Cards
The volume is partitioned into pieces that fit on a single Xeon Phi
Phi 2Phi 1Phi 010
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Phi 0
Phi 1
Boundaries must be transferred between partitions
Transfers can become a bottleneck unless they are done asynchronously with stencil calculations
…
…
Transfer
Transfer
Parallelizing Each Shot: Multiple Cards
11
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Parallelizing Each Shot: Within Card
Core 1, Thread 1
Core 1, Thread 0
Core 0, Thread 3
Core 0, Thread 2
Core 0, Thread 1
Core 0, Thread 0
z
x/y
Data in x and y are split over cores Operations in z dimension are vectorized 12
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Levels of Parallelism Each shot is split over multiple Xeon Phi Coprocessors (or
Xeon nodes) using MPI The partition on each Phi is split over cores using OpenMP Operations on each thread are vectorized using the
compiler’s autovectorizer
13
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Kernel: 8th Order Spatial Derivative#pragma omp parallel forfor(size_t x = xMin; x < xMax; x++){ for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY;#pragma vector … for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z;
pVy[i] = yCoeffs[0]*(pV[i-4*strideY]-pV[i+4*strideY]) + yCoeffs[1]*(pV[i-3*strideY]-pV[i+3*strideY]) + yCoeffs[2]*(pV[i-2*strideY]-pV[i+2*strideY]) + yCoeffs[3]*(pV[i-1*strideY]-pV[i+1*strideY]) + yCoeffs[4]*pV[i]; } }}
123456789101112131415161718
Triple loopover dimensions
One-dimensional derivative: simple calculation with large memory bandwidth
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Tuning OpenMP#pragma omp parallel for collapse(2) schedule(static)for(size_t x = xMin; x < xMax; x++){ for(size_t y = yMin; y < yMax; y++) {#pragma vector … size_t const idx = x*strideX + y*strideY; for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z;
// Derivative Calculations } }}
12345678910111213141516
Many options available for OpenMP– Tuning especially
important on Phi (mostly because of high thread count)
Here we use static loop scheduling, because it has the lowest overhead– It is also the most prone
to load-balance issues
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Tuning OpenMP Collapse(2) combines two adjacent for loops Here, X and Y dimensions are combined. Eg: X = 250, Y = 150 Work is broken more evenly onto cores when there are more
iterations– 250 iterations on 240 threads (60*4) means 10 threads do
double work while other threads wait (1/2 time wasted)– 250 x 150 divides much better onto 240 threads (1/157 time
wasted) Improved Phi performance by 1.5x!
Y
X X * Y
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Tuning Thread Affinity We programmatically set affinity with run dependent logic Isolating various tasks prevents over-subscription of cores
Core 0 Core 1 Core 2
…
Core 60 Core 61
Transfer Threads Disk IO Threads Propagation Threads OS Threads
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Tuning Thread Affinity Thread affinity settings improved scaling on multiple Phis
and multiple CPU sockets
Dual Xeon Phi vs. Single
Phi
Dual Xeon sockets vs.
single socketWithout Affinity Changes 1.3x 1.9x
With Affinity Changes 1.9x 1.7x
Different settings for Xeon Phi and Xeon
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Tuning Memory Access#pragma omp parallel for collapse(2) schedule(static)for(size_t x = xMin; x < xMax; x++){ for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY; __assume(strideX%16==0); __assume(strideY%16==0); __assume(idx%16==0); __assume_aligned(pV ,64); __assume_aligned(pVy ,64); #pragma vector always assert vecremainder#pragma ivdep#pragma vector nontemporal (pVy) for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; pVy[i] = ( yCoeffs*(pV[i-4*strideY]-pV[i+4*strideY])... } }}
12345678910111213141516171819202122
pVy[i] is written once and should not be cached
Give compiler hints about indexing so it knows when to use aligned reads/writes
Improved performance by 1.1x on both Xeon and Xeon Phi!
19
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Current Performance Results For anisotropic wave propagation, Xeon Phi coprocessor is
~2.3x a single Xeon E5-2670 CPU Same code-base and optimizations applied to Xeon and
Xeon Phi
20
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
About Acceleware Professional training
– Xeon Phi Coprocessor Optimization– OpenCL– OpenMP – MPI
High performance consulting – Feasibility Studies – Porting and Optimization – Algorithm parallelization
Accelerated software – Oil and Gas– Electromagnetics
21
Op
tim
izin
g C
om
merc
ial
Soft
ware
for
Inte
l X
eon P
hi
Copro
cess
ors
: Le
ssons
Learn
ed
Questions? Come visit us in booth #1825!
Head Office Tel: +1 403.249.9099 Email: [email protected]
Viktoria Kaczur Senior Account Manager Tel: +1 403.249.9099 ext. 356 Cell: +1 403.671.4455Email: [email protected]
22