DiamondTile Algorithm for High-Performance Wave Modeling
Vadim LevchenkoAnastasia Perepelkina
Keldysh Institute of Apllied Mathematics RAS
GTC 2015
FLOPs and Bandwidth Performance Ratio
10
100
1000
0.1 1 10
GB/s
TFLOP/s (fp32)
nVidia Maxwell, 2014-15nVidia Kepler, 2012-13
Intel CPU, 2014NEC SX, 199x
0.1 By
tes/Flo
ps
0.04 B
ytes/Fl
ops4 B
ytes/Fl
ops
RoofLine modelS. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicorearchitectures. Commun. ACM, 52:65–76, 2009.
L.Barba, R.Yokota, How Will the Fast Multipole Method Fare in the Exascale Era?CSE13
Wave Modeling Specifics∂2F∂t2 = c
2(
∂2F∂x2 +
∂2F∂y2 +
∂2F∂z2
)(+BCs + ICs)
Finite difference along each axis:∂2F∂x2
∣∣∣x0,y0,z0
= 1∆x2
∑NO/2i=0 Ci (F |x0+i∆x ,y0,z0 + F |x0−i∆x ,y0,z0)
NO = 2 for ∂2F∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.
Per one cell, per one time step calculation:I O = 1+ 3NO FMA operationsI D = 3+ 3NO data
Operational intensity:O/D ∼ 1/2 Flop/byte (näıve algorithm) .
xy
t
x2+y
2 +z2
=c2 t2
domain ofinfluence
domain ofdependence
asynchro-nous domain
asynchro-nousdomain
synchronization instant
Wave Modeling Specifics∂2F∂t2 = c
2(
∂2F∂x2 +
∂2F∂y2 +
∂2F∂z2
)(+BCs + ICs)
Finite difference along each axis:∂2F∂x2
∣∣∣x0,y0,z0
= 1∆x2
∑NO/2i=0 Ci (F |x0+i∆x ,y0,z0 + F |x0−i∆x ,y0,z0)
NO = 2 for ∂2F∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.
.Cross-shaped stencil fits into diamond shape
xy
t
x2+y
2 +z2
=c2 t2
domain ofinfluence
domain ofdependence
asynchro-nous domain
asynchro-nousdomain
synchronization instant
Wave equation modellingComputational Grid projection to (x–t)
Wave equation modellingComputational Grid projection to (x–t)
Wave equation modelling
Wave equation modelling
Wave equation modelling
Traditional stepwise evaluation order
Traditional stepwise evaluation order
Traditional stepwise evaluation order
Traditional stepwise evaluation order
Overlapping stencils increase operationalintensity:
I O = 1+ 3NO FMA operationsI D = 3 data
Operational intensity:O/D ∼ (1+ NO) Flop/byte
RoofLine Model for Wave Equation on GPGPU
10
100
1000
0.1 1 10
perfo
rman
ce, 1
09 c
ells
/sec
localization parameter, cells calculations/(data loads+stores)
the
best
of s
tepw
ise
nai
ve
CUD
A FD
TD3d
resu
lts
TitanZ
GTX 970
LRnLA method
LRnLA method
LRnLA methodLocality Take advantage of memory subsystem hierarchy, from on-chip CPU cash
and up to disk and networkRecursivity Application of “divide et impera” strategy for any situations (computer
architectures, numerical schemes, etc.)non-Locality Optimized for distributed computationsAsynchrony Adaptable parallel computations on any levels
Memory Subsystem Hierarchy for GPGPU and CPU. GK110 Haswell GM204 .. GTX Titan Xeon E5 v3 GTX 980 .
109
1010
1011
1012
1013
1014
1T 1G 1M 1K 1M 1G 1T
Data
thro
ughp
ut, B
/sec
Data set size, B
regs
L1+sh
L2GDDR5
regs
L1+shL2
GDDR5
regs
L1
L2LLC
DDR4
SSD/PCIe
HDD
DiamondTile based algorithm constructionComputational grid in x-y and x-t projections
DiamondTile based algorithm constructionComputational domain is subdivided into Diamond shaped tiles in x-y.
I Diamond encloses cross-shaped stencilI All elements along 3rd (z) axis are included
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-step
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tile
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)v Plot dependence cone of last tile
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)v Plot dependence cone of last tilev Find intersection
DiamondTorre Algorithm shape
Understand Algorithm as a shapeStepwise
Understand Algorithm as a shapeDomain decomposition
Understand Algorithm as a shapeMore operational intensity
Understand Algorithm as a shapeDiamondTorre
DiamondTorre Algorithm shapeI DiamondTorre tilt depends on stencil sizeI Stencil width is determined by order of approximation (NO)
DiamondTorre Algorithm parametersPerformance depends on careful choice of algorithm parameters:
I Size of DiamondTorre base — Diamond Tile Size, DTSI Quantity of time layers — Nt
Operational Intensity ∼ DTS/(4-1/DTS) (for large Nt)
RoofLine Model for Wave Equation on GPGPU
10
100
1000
0.1 1 10
perfo
rman
ce, 1
09 c
ells
/sec
localization parameter, cells calculations/(data loads+stores)
Diam
ondT
ile, D
TS=
1
DTS=
4
DTS=
7 DTS=
14
DTS=
20
the
best
of s
tepw
ise
DT
S=1
nai
ve
Diamon
dTorre fo
r variou
s DTS
TitanZ
GTX 970
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
First stage
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
Second stage
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
Odd and even stages are alternating. Synchronization after eachstage.
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
Odd and even stages are alternating. Synchronization after eachstage.
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
Odd and even stages are alternating. Synchronization after eachstage.
DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers
DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers
DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers
DiamondTorre Algorithm with CUDAv At the end, all data are progressed to a given time step. This time step isdetermined by DiamondTorre height
RoofLine Model for Wave Equation on GPGPU
10
100
1000
0.1 1 10
perfo
rman
ce, 1
09 c
ells
/sec
localization parameter, cells calculations/(data loads+stores)
Diam
ondT
ile, D
TS=
1
DTS=
4
DTS=
7 DTS=
14
DTS=
20
the
best
of s
tepw
ise
DT
S=1
nai
ve
Diamon
dTorre fo
r variou
s DTS
CUD
A FD
TD3d
resu
lts
TitanZ
GTX 970
0
10
20
30
40
50
60
2/1 4/1 6/1 8/1 10/112/114/1 6/1 6/2 4/1 4/2 4/3 2/1 2/2 2/3 2/4 2/5 2/6 2/7
calc
rate
, Gce
lls/s
ec
various scheme/algorithm parameters, NO/DTS
GTX 750TiGTX 970
TitanZ (1)
0.01
0.1
1
10
100
0.01 0.1 1 10 100 1000
calc
rate
, Gce
lls/s
ec
parallel level, warps
FDTD3d CPU rate
FDTD3d CPU rate with -O3
FDTD3d TitanZ rateFDTD3d GTX970 rate
TitanZGTX970
Wave Modeling Applications
FDTD simulation for electromagnetics (2nd and 4th order approximation, PML)(Zakirov A., Goryachev I.)
Wave Modeling Applications
Gas Dynamis with RKDG scheme (Korneev B.)
Wave Modeling Applications
2000 3000 4000 5000 6000 7000
07.53.75
0-3.75
-7.5 6
6 4
4 2
200
7.53.75-3.75 0-7.5
FDTD simulation for elastic seismic media (Levander scheme, 4th order, PML,Thompsen anisotropy, TFSF source) (Levchenko V., Zakirov A., Perepelkina A.,
Ivanov A.)
Wave Modeling Applications
Particle-in-cell plasma kinetics (Levchenko V., Perepelkina A., Goryachev I.)
Main Results and ConclusionsI New algorithms DiamondTile of LRnLA family are developed for wave modeling.
The algorithms are efficient on memory and parallelism models of CUDA GPGPU;I Unlike traditional stepwise evaluation order, data dependencies are traced for many
time iteration steps. It increases operational intensity and allows to reach highercalculation rates.
I Performance of 50-60 billion cells/s is achieved with Titan, as well as withGTX970 in the implementation of wave modeling.