+ All Categories
Home > Documents > DiamondTile Algorithm for High-Performance Wave Modeling · FLOPsandBandwidthPerformanceRatio 10...

DiamondTile Algorithm for High-Performance Wave Modeling · FLOPsandBandwidthPerformanceRatio 10...

Date post: 03-Feb-2021
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
51
DiamondTile Algorithm for High-Performance Wave Modeling Vadim Levchenko Anastasia Perepelkina Keldysh Institute of Apllied Mathematics RAS GTC 2015
Transcript
  • DiamondTile Algorithm for High-Performance Wave Modeling

    Vadim LevchenkoAnastasia Perepelkina

    Keldysh Institute of Apllied Mathematics RAS

    GTC 2015

  • FLOPs and Bandwidth Performance Ratio

    10

    100

    1000

    0.1 1 10

    GB/s

    TFLOP/s (fp32)

    nVidia Maxwell, 2014-15nVidia Kepler, 2012-13

    Intel CPU, 2014NEC SX, 199x

    0.1 By

    tes/Flo

    ps

    0.04 B

    ytes/Fl

    ops4 B

    ytes/Fl

    ops

  • RoofLine modelS. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicorearchitectures. Commun. ACM, 52:65–76, 2009.

    L.Barba, R.Yokota, How Will the Fast Multipole Method Fare in the Exascale Era?CSE13

  • Wave Modeling Specifics∂2F∂t2 = c

    2(

    ∂2F∂x2 +

    ∂2F∂y2 +

    ∂2F∂z2

    )(+BCs + ICs)

    Finite difference along each axis:∂2F∂x2

    ∣∣∣x0,y0,z0

    = 1∆x2

    ∑NO/2i=0 Ci (F |x0+i∆x ,y0,z0 + F |x0−i∆x ,y0,z0)

    NO = 2 for ∂2F∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.

    Per one cell, per one time step calculation:I O = 1+ 3NO FMA operationsI D = 3+ 3NO data

    Operational intensity:O/D ∼ 1/2 Flop/byte (näıve algorithm) .

    xy

    t

    x2+y

    2 +z2

    =c2 t2

    domain ofinfluence

    domain ofdependence

    asynchro-nous domain

    asynchro-nousdomain

    synchronization instant

  • Wave Modeling Specifics∂2F∂t2 = c

    2(

    ∂2F∂x2 +

    ∂2F∂y2 +

    ∂2F∂z2

    )(+BCs + ICs)

    Finite difference along each axis:∂2F∂x2

    ∣∣∣x0,y0,z0

    = 1∆x2

    ∑NO/2i=0 Ci (F |x0+i∆x ,y0,z0 + F |x0−i∆x ,y0,z0)

    NO = 2 for ∂2F∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.

    .Cross-shaped stencil fits into diamond shape

    xy

    t

    x2+y

    2 +z2

    =c2 t2

    domain ofinfluence

    domain ofdependence

    asynchro-nous domain

    asynchro-nousdomain

    synchronization instant

  • Wave equation modellingComputational Grid projection to (x–t)

  • Wave equation modellingComputational Grid projection to (x–t)

  • Wave equation modelling

  • Wave equation modelling

  • Wave equation modelling

  • Traditional stepwise evaluation order

  • Traditional stepwise evaluation order

  • Traditional stepwise evaluation order

  • Traditional stepwise evaluation order

    Overlapping stencils increase operationalintensity:

    I O = 1+ 3NO FMA operationsI D = 3 data

    Operational intensity:O/D ∼ (1+ NO) Flop/byte

  • RoofLine Model for Wave Equation on GPGPU

    10

    100

    1000

    0.1 1 10

    perfo

    rman

    ce, 1

    09 c

    ells

    /sec

    localization parameter, cells calculations/(data loads+stores)

    the

    best

    of s

    tepw

    ise

    nai

    ve

    CUD

    A FD

    TD3d

    resu

    lts

    TitanZ

    GTX 970

  • LRnLA method

  • LRnLA method

  • LRnLA methodLocality Take advantage of memory subsystem hierarchy, from on-chip CPU cash

    and up to disk and networkRecursivity Application of “divide et impera” strategy for any situations (computer

    architectures, numerical schemes, etc.)non-Locality Optimized for distributed computationsAsynchrony Adaptable parallel computations on any levels

  • Memory Subsystem Hierarchy for GPGPU and CPU. GK110 Haswell GM204 .. GTX Titan Xeon E5 v3 GTX 980 .

    109

    1010

    1011

    1012

    1013

    1014

    1T 1G 1M 1K 1M 1G 1T

    Data

    thro

    ughp

    ut, B

    /sec

    Data set size, B

    regs

    L1+sh

    L2GDDR5

    regs

    L1+shL2

    GDDR5

    regs

    L1

    L2LLC

    DDR4

    SSD/PCIe

    HDD

  • DiamondTile based algorithm constructionComputational grid in x-y and x-t projections

  • DiamondTile based algorithm constructionComputational domain is subdivided into Diamond shaped tiles in x-y.

    I Diamond encloses cross-shaped stencilI All elements along 3rd (z) axis are included

  • DiamondTile based algorithm constructionv Choose a DiamondTile on first time-step

  • DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tile

  • DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)

  • DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)v Plot dependence cone of last tile

  • DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)v Plot dependence cone of last tilev Find intersection

  • DiamondTorre Algorithm shape

  • Understand Algorithm as a shapeStepwise

  • Understand Algorithm as a shapeDomain decomposition

  • Understand Algorithm as a shapeMore operational intensity

  • Understand Algorithm as a shapeDiamondTorre

  • DiamondTorre Algorithm shapeI DiamondTorre tilt depends on stencil sizeI Stencil width is determined by order of approximation (NO)

  • DiamondTorre Algorithm parametersPerformance depends on careful choice of algorithm parameters:

    I Size of DiamondTorre base — Diamond Tile Size, DTSI Quantity of time layers — Nt

    Operational Intensity ∼ DTS/(4-1/DTS) (for large Nt)

  • RoofLine Model for Wave Equation on GPGPU

    10

    100

    1000

    0.1 1 10

    perfo

    rman

    ce, 1

    09 c

    ells

    /sec

    localization parameter, cells calculations/(data loads+stores)

    Diam

    ondT

    ile, D

    TS=

    1

    DTS=

    4

    DTS=

    7 DTS=

    14

    DTS=

    20

    the

    best

    of s

    tepw

    ise

    DT

    S=1

    nai

    ve

    Diamon

    dTorre fo

    r variou

    s DTS

    TitanZ

    GTX 970

  • DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block

    First stage

  • DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block

    Second stage

  • DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block

    Odd and even stages are alternating. Synchronization after eachstage.

  • DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block

    Odd and even stages are alternating. Synchronization after eachstage.

  • DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block

    Odd and even stages are alternating. Synchronization after eachstage.

  • DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers

  • DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers

  • DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers

  • DiamondTorre Algorithm with CUDAv At the end, all data are progressed to a given time step. This time step isdetermined by DiamondTorre height

  • RoofLine Model for Wave Equation on GPGPU

    10

    100

    1000

    0.1 1 10

    perfo

    rman

    ce, 1

    09 c

    ells

    /sec

    localization parameter, cells calculations/(data loads+stores)

    Diam

    ondT

    ile, D

    TS=

    1

    DTS=

    4

    DTS=

    7 DTS=

    14

    DTS=

    20

    the

    best

    of s

    tepw

    ise

    DT

    S=1

    nai

    ve

    Diamon

    dTorre fo

    r variou

    s DTS

    CUD

    A FD

    TD3d

    resu

    lts

    TitanZ

    GTX 970

  • 0

    10

    20

    30

    40

    50

    60

    2/1 4/1 6/1 8/1 10/112/114/1 6/1 6/2 4/1 4/2 4/3 2/1 2/2 2/3 2/4 2/5 2/6 2/7

    calc

    rate

    , Gce

    lls/s

    ec

    various scheme/algorithm parameters, NO/DTS

    GTX 750TiGTX 970

    TitanZ (1)

  • 0.01

    0.1

    1

    10

    100

    0.01 0.1 1 10 100 1000

    calc

    rate

    , Gce

    lls/s

    ec

    parallel level, warps

    FDTD3d CPU rate

    FDTD3d CPU rate with -O3

    FDTD3d TitanZ rateFDTD3d GTX970 rate

    TitanZGTX970

  • Wave Modeling Applications

    FDTD simulation for electromagnetics (2nd and 4th order approximation, PML)(Zakirov A., Goryachev I.)

  • Wave Modeling Applications

    Gas Dynamis with RKDG scheme (Korneev B.)

  • Wave Modeling Applications

    2000 3000 4000 5000 6000 7000

    07.53.75

    0-3.75

    -7.5 6

    6 4

    4 2

    200

    7.53.75-3.75 0-7.5

    FDTD simulation for elastic seismic media (Levander scheme, 4th order, PML,Thompsen anisotropy, TFSF source) (Levchenko V., Zakirov A., Perepelkina A.,

    Ivanov A.)

  • Wave Modeling Applications

    Particle-in-cell plasma kinetics (Levchenko V., Perepelkina A., Goryachev I.)

  • Main Results and ConclusionsI New algorithms DiamondTile of LRnLA family are developed for wave modeling.

    The algorithms are efficient on memory and parallelism models of CUDA GPGPU;I Unlike traditional stepwise evaluation order, data dependencies are traced for many

    time iteration steps. It increases operational intensity and allows to reach highercalculation rates.

    I Performance of 50-60 billion cells/s is achieved with Titan, as well as withGTX970 in the implementation of wave modeling.


Recommended