+ All Categories
Home > Documents > From Piz Daint to the Stars: Simulation of Stellar Mergers...

From Piz Daint to the Stars: Simulation of Stellar Mergers...

Date post: 30-May-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
51
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Denver, Colorado November, 2019 University of Stuttgart, IPVS, SSE
Transcript
Page 1: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

From Piz Daint to the Stars:Simulation of Stellar Mergersusing High-Level Abstractions

Denver, Colorado

November, 2019

University of Stuttgart, IPVS, SSE

Page 2: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Authors

• Gregor Daiß• John Biddiscombe

• Parsa Amini• Patrick Diehl• Juhan Frank• Kevin Huck• Hartmut Kaiser• Dominic Marcello• David Pfander• Dirk Pflüger

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 1 / 23

Page 3: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Motivation

• Simulation of binary star systems and their mergers• Octo-Tiger models these star systems using self-gravitating fluids on an AMR grid using

HPX• Large-scale runs on Piz Daint use up to 768 million cells.

Contributions:• Significant speedup replacing MPI with Libfabric without changing any application code• Integrating small GPU kernels efficiently into an asynchronous many-task runtime system

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 2 / 23

Page 4: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Motivation

• Simulation of binary star systems and their mergers• Octo-Tiger models these star systems using self-gravitating fluids on an AMR grid using

HPX• Large-scale runs on Piz Daint use up to 768 million cells.

Contributions:• Significant speedup replacing MPI with Libfabric without changing any application code• Integrating small GPU kernels efficiently into an asynchronous many-task runtime system

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 2 / 23

Page 5: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Table of Contents

1 Octo-Tiger in a Nutshell2 HPX and the Libfabric Parcelport in a Nutshell3 Asynchronous Many Tasks with GPUs4 Results

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 3 / 23

Page 6: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Octo-Tiger in a Nutshell

Page 7: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Octo-Tiger in a Nutshell

Octo-Tiger simulates self-gravitating fluids on an AMR grid

Gravity Solver• Using Fast Multipole Method

(FMM)• Has to be solved every

timestep• Is the more

compute-intensive part

Hydro Solver• Navier-Stokes Equations• Using finite volumes

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 4 / 23

Page 8: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Octo-Tiger in a Nutshell

Octo-Tiger simulates self-gravitating fluids on an AMR grid

Node 1

Locality 1

Node 2

Locality 2

...

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 4 / 23

Page 9: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

HPX and the Libfabric Parcelportin a Nutshell

Page 10: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

A Distributed Task-Based Runtime

Locality

0

Locality

1

Locality

N

actions

async

Active Global Address Space(Locality still matters)

Component

Unified C++ syntax for localand remote operations

Asynchronoustasks with futures

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 5 / 23

Page 11: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Standards Driven C++ Tasking for Parallelism and Concurrency

Futures for Synchronization• Continuation Passing Style (CPS) preferred• Functional approach to programming• Task synchronization is also data driven

Runtime• Lightweight threads• Suspend on get(), resume when ready• Work stealing when current task done/suspended

AGAS• Manages a handle to a component• Forward work to the locality holding data

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 6 / 23

Page 12: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Standards Driven C++ Tasking for Parallelism and Concurrency

Futures for Synchronization• Continuation Passing Style (CPS) preferred• Functional approach to programming• Task synchronization is also data driven

Runtime• Lightweight threads• Suspend on get(), resume when ready• Work stealing when current task done/suspended

AGAS• Manages a handle to a component• Forward work to the locality holding data

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 6 / 23

Page 13: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Standards Driven C++ Tasking for Parallelism and Concurrency

Futures for Synchronization• Continuation Passing Style (CPS) preferred• Functional approach to programming• Task synchronization is also data driven

Runtime• Lightweight threads• Suspend on get(), resume when ready• Work stealing when current task done/suspended

AGAS• Manages a handle to a component• Forward work to the locality holding data

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 6 / 23

Page 14: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Building DAGs from Futures

Task 1 Task 2f1.then()

Task 2

when_xxx(f1,f2)

Task 3

Task 1

Task 3

Task 2

Task 1

shared.then()

N N

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 7 / 23

Page 15: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Remote Actions use Active Messages

AMR refinement and redistribution:• Moving a subgrid from one node to another

calling a (copy) constructor on a remote nodewith the contents of this subgrid as parameter(s)

Halo exchange:• Execute a put on remote node

- with a data buffer as parameter• Execute a get on remote node

- with a (local) buffer address as parameter

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 8 / 23

Page 16: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Remote Actions use Active Messages

AMR refinement and redistribution:• Moving a subgrid from one node to another

calling a (copy) constructor on a remote nodewith the contents of this subgrid as parameter(s)

Halo exchange:• Execute a put on remote node

- with a data buffer as parameter• Execute a get on remote node

- with a (local) buffer address as parameter

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 8 / 23

Page 17: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Active Messages

Syntax• Instead of a more traditional

MPI_Isend(buffer, count, datatype, dest_rank, tag, comm, request)HPX messages take the form of a remote function invocation

future = hpx::async(dest_locality, function, arg1, arg2...)

where any C++ data args can be sent (vector/set/list/map/custom)

Implementation• Data is passed as arguments to a remote function (or object::function)• Remote function parameters are serialized into a parcel - consisting of

- a function identifier, (including object if complex like a grid::node)

- a list of parameters

Channels• HPX uses Channel abstraction to simplify send/recv for halo regions

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 9 / 23

Page 18: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Active Messages

Syntax• Instead of a more traditional

MPI_Isend(buffer, count, datatype, dest_rank, tag, comm, request)HPX messages take the form of a remote function invocation

future = hpx::async(dest_locality, function, arg1, arg2...)

where any C++ data args can be sent (vector/set/list/map/custom)

Implementation• Data is passed as arguments to a remote function (or object::function)• Remote function parameters are serialized into a parcel - consisting of

- a function identifier, (including object if complex like a grid::node)

- a list of parameters

Channels• HPX uses Channel abstraction to simplify send/recv for halo regions

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 9 / 23

Page 19: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Active Messages

Syntax• Instead of a more traditional

MPI_Isend(buffer, count, datatype, dest_rank, tag, comm, request)HPX messages take the form of a remote function invocation

future = hpx::async(dest_locality, function, arg1, arg2...)

where any C++ data args can be sent (vector/set/list/map/custom)

Implementation• Data is passed as arguments to a remote function (or object::function)• Remote function parameters are serialized into a parcel - consisting of

- a function identifier, (including object if complex like a grid::node)

- a list of parameters

Channels• HPX uses Channel abstraction to simplify send/recv for halo regions

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 9 / 23

Page 20: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Existing MPI-Based Parcelport Implementation

A parcel is represented by a ‘chunk list’ + data block• If params are small (eager protocol)

- Index chunk (size/offset)copy into parcel buffer

• For large params (rendezvous)

- Pointer chunk - separate sends• Message handling of parcels is currently

sub-optimal one sided put/get can/should be usedfor rendevous items

Header Eager Data

chunk list

type

SendParcel

RecvParcel

Largedata?

ack

Postrecvs

Senddata

decode

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 10 / 23

Page 21: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Libfabric as an Alternative to MPI

Downsides of MPI• MPI_Put/Get not asynchronous API• Copies completions to MPI_Request

handles• Memory management less flexible

Benefits of libfabric• API asynchronous (inc. Put/Get –

enqueue many)• Maps driver/GNI completion queues

(without copy)• Robustly threadsafe• Vectorized sends : fi_sendv• Flexible memory pinning

GNI

MPI

libfabric

KernelUser

other

MPI_RequestsCommunicatorsMemory Windows

Epochs

CompletionsEndpoints

Memory Pinning

HPX Futures

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 11 / 23

Page 22: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Fine Tuning of RDMA-Based Parcelport

Impedance match between HPX and libfabric API• Identical asynchronous GNI/driver level completions for send/recv/get/(put)• Trigger futures directly from completion handler

Memory Management• C++ allocator for pinned memory blocks• Flow control – we explicitly manage queues (=buffers)• FI_sendv allows reduced memory copies• Multi-Parcels when send buffers filling up (FI_sendv)• RDMA<T> types integrated into our parcelport (channels ongoing)

Threading• Robust threadsafe libfabric API• FI_CONTEXT allows us to be 100% lock free in our HPX layer

• map completions directly to objects (c.f. communicators)

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 12 / 23

Page 23: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Fine Tuning of RDMA-Based Parcelport

Impedance match between HPX and libfabric API• Identical asynchronous GNI/driver level completions for send/recv/get/(put)• Trigger futures directly from completion handler

Memory Management• C++ allocator for pinned memory blocks• Flow control – we explicitly manage queues (=buffers)• FI_sendv allows reduced memory copies• Multi-Parcels when send buffers filling up (FI_sendv)• RDMA<T> types integrated into our parcelport (channels ongoing)

Threading• Robust threadsafe libfabric API• FI_CONTEXT allows us to be 100% lock free in our HPX layer

• map completions directly to objects (c.f. communicators)

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 12 / 23

Page 24: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Fine Tuning of RDMA-Based Parcelport

Impedance match between HPX and libfabric API• Identical asynchronous GNI/driver level completions for send/recv/get/(put)• Trigger futures directly from completion handler

Memory Management• C++ allocator for pinned memory blocks• Flow control – we explicitly manage queues (=buffers)• FI_sendv allows reduced memory copies• Multi-Parcels when send buffers filling up (FI_sendv)• RDMA<T> types integrated into our parcelport (channels ongoing)

Threading• Robust threadsafe libfabric API• FI_CONTEXT allows us to be 100% lock free in our HPX layer

• map completions directly to objects (c.f. communicators)

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 12 / 23

Page 25: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Performance/Integration of Libfabric in the Runtime

Threadpool 1

7

Poll +Handle

completions

Task Queues

Execute task

Core 1Poll +

Handle completions

Task Queues

Execute task

Core N

… Core 2,3,4, ...

Threadpool 2

Poll +Handle

completions

Empty? Queues

Do Nothing?

Octo-Tiger

• Every core can poll for completion events during background processing• Polling can be moved to another thread pool, with or without tasks• Every microsecond saved in polling/handling = 1MFlop on a 1TFlop GPU

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 13 / 23

Page 26: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Asynchronous Many Tasks withGPUs

Page 27: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Example FMM Kernels from Octo-Tiger

• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level

• Stencil code

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23

Page 28: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Example FMM Kernels from Octo-Tiger

• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level

• Stencil code

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23

Page 29: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Example FMM Kernels from Octo-Tiger

• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level

• Stencil code

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23

Page 30: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Example FMM Kernels from Octo-Tiger

• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level

• Stencil code

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23

Page 31: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Example FMM Kernels from Octo-Tiger

• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level

• Stencil code• (3D) Stencil has 1074 elements• Stencil gets applied for all the

512 cells per subgrid

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23

Page 32: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Running the CUDA Kernel

CPU Thread ...

1. memcpy async HToD

2. queue CUDA kernel

3. memcpy async DToHWhen to sync results?

Time

CUDA Stream

Goals:• Interleave GPU kernel with

arbitrary CPU kernels andcommunication

• Non-blocking synchronization

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 15 / 23

Page 33: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Running the CUDA Kernel

CPU Thread ...

1. memcpy async HToD

2. queue CUDA kernel

3. memcpy async DToHWhen to sync results?

Time

CUDA Stream

Goals:• Interleave GPU kernel with

arbitrary CPU kernels andcommunication

• Non-blocking synchronization

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 15 / 23

Page 34: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Integrating with HPX

HPX Task fut

1. memcpy async HToD

2. queue CUDA kernel

3. memcpy async DToH

4. get future

4.5. insert callback

HPX Scheduler

Time

CUDA Stream

Solution:• Use HPX tasks instead• Insert callback into CUDA stream• Scheduler can return a future

that becomes ready once thiscallback get executed

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 16 / 23

Page 35: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Integrating with HPX

HPX Task fut.get()

1. memcpy async HToD

2. queue CUDA kernel

3. memcpy async DToH

4. get future

4.5. insert callback

HPX Scheduler

Time

CUDA Stream

Solution:• Use HPX tasks instead• Insert callback into CUDA stream• Scheduler can return a future

that becomes ready once thiscallback get executed

• HPX task gets suspended

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 16 / 23

Page 36: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Integrating with HPX

HPX Task fut.get()

1. memcpy async HToD

2. queue CUDA kernel

3. memcpy async DToH

4. get future

4.5. insert callback

HPX Scheduler

Time

CUDA Stream

Solution:• Use HPX tasks instead• Insert callback into CUDA stream• Scheduler can return a future

that becomes ready once thiscallback get executed

• HPX task gets suspended

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 16 / 23

Page 37: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Integrating with HPX

HPX Task fut.get()

1. memcpy async HToD

2. queue CUDA kernel

3. memcpy async DToH

4. get future

Arbitrary HPX task

4.5. insert callback

HPX Scheduler

Time

CUDA Stream

Solution:• Use HPX tasks instead• Insert callback into CUDA stream• Scheduler can return a future

that becomes ready once thiscallback get executed

• HPX task gets suspended• HPX thread can work on other

tasks/communication• Task will be resumed once the

GPU kernel has finished

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 16 / 23

Page 38: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Filling the GPU?

• One kernel calculates 512 * 1074 cell interactions• Depending on the type, 12 to 455 floating point operations• Still not enough work to utilize even one GPU

• Leverage CUDA streams for implicit work aggregation• Avoid on-the-fly allocations

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 17 / 23

Page 39: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Filling the GPU!

HPX Task

futureLaunch kernel

Launcher (lockfree/ threadlocal)

GPU

Slot 1

Pinned host memory

CUDA stream

CUDA buffer

CPU kernel

fallback

Slot 2 Slot 3 Slot 4

Solution• Launch many small kernels in

different streams• One launcher for each HPX

thread• One kernel launch per slot• If all slots are busy, execute the

kernel on the CPU• Most kernels executed on the

GPU (99.5%)• Arbitrary number of slots on

multiple GPUs

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 18 / 23

Page 40: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Filling the GPU!

HPX Task

futureLaunch kernel

Launcher (lockfree/ threadlocal)

GPU

Slot 1

Pinned host memory

CUDA stream

CUDA buffer

CPU kernel

fallback

Slot 2 Slot 3 Slot 4

Solution• Launch many small kernels in

different streams• One launcher for each HPX

thread• One kernel launch per slot• If all slots are busy, execute the

kernel on the CPU• Most kernels executed on the

GPU (99.5%)• Arbitrary number of slots on

multiple GPUs

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 18 / 23

Page 41: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Filling the GPU!

HPX Task

futureLaunch kernel

Launcher (lockfree/ threadlocal)

GPU

Slot 1

Pinned host memory

CUDA stream

CUDA buffer

CPU kernel

fallback

Slot 2 Slot 3 Slot 4

Solution• Launch many small kernels in

different streams• One launcher for each HPX

thread• One kernel launch per slot• If all slots are busy, execute the

kernel on the CPU• Most kernels executed on the

GPU (99.5%)• Arbitrary number of slots on

multiple GPUs

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 18 / 23

Page 42: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Work Size

HPX Task

futureLaunch kernel

Launcher (lockfree/ threadlocal)

GPU 1 GPU 2

Slot 1

Pinned host memory

CUDA stream

CUDA buffer

CPU kernel

fallback

Slot 2 Slot 3 Slot 4

Solution• Launch many small kernels in

different streams• One launcher for each HPX

thread• One kernel launch per slot• If all slots are busy, execute the

kernel on the CPU• Most kernels are being executed

on the GPU (99.5207%)• Arbitrary number of slots on

multiple GPUs

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 18 / 23

Page 43: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Asynchronous Many Tasks with GPUs

Advantages:• Optimize GPU launch

• CUDA streams• HPX futures• non-blocking launcher

• Overlapping• CPU/GPU tasks;• computation and communication

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 19 / 23

Page 44: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Results

Page 45: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

FMM Node-Level Results on Piz Daint

• Ported most important FMM kernels to the GPU• Scenario: V1309 contact binary merger• Setup: Single node, 12 HPX-Threads, 128 launch slots (CUDA streams)• 10928 sub-grids resulting in 5595136 cells

Utilized Hardware FMM Total scenario runtimeruntime GFLOP/s fraction of peak (FMM + Hydro + Others)

One Piz Daint NodeIntel Xeon E5-2690v3 980s 157 GFLOP/s 31% [CPU] 2415swith 1x NVIDIA P100 (PCI-E) 158s 973 GFLOP/s 21% [GPU] 1592s

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 20 / 23

Page 46: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

FMM Node-Level Results on Piz Daint

• Ported most important FMM kernels to the GPU• Scenario: V1309 contact binary merger• Setup: Single node, 12 HPX-Threads, 128 launch slots (CUDA streams)• 10928 sub-grids resulting in 5595136 cells

Utilized Hardware FMM Total scenario runtimeruntime GFLOP/s fraction of peak (FMM + Hydro + Others)

One Piz Daint NodeIntel Xeon E5-2690v3 980s 157 GFLOP/s 31% [CPU] 2415swith 1x NVIDIA P100 (PCI-E) 158s 973 GFLOP/s 21% [GPU] 1592s

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 20 / 23

Page 47: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Distributed Results

2 8 32 128 512 2048 8192

2

8

32

128

512

2048

Number of nodes

Spe

edup

w.r.

tsub

-grid

son

one

node

Level 14 Level 14Level 15 Level 15Level 16 Level 16Level 17 Level 17

• The red lines show the resultsusing HPX’s MPI parcelport andthe blue lines using HPX’sLibfabric parcelport, respectively

• Number of subgrids ranges from10928 (level 14) to 1.5 millionsubgrids (level 17) depending onthe refinement levels

• Achieved a weak scaling of68.1% with 2048 nodes on PizDaint on level 17

• At 4096 nodes 2.7 speedupusing Libfabric

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 21 / 23

Page 48: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Distributed Results

2 8 32 128 512 2048 8192

2

8

32

128

512

2048

Number of nodes

Spe

edup

w.r.

tsub

-grid

son

one

node

Level 14 Level 14Level 15 Level 15Level 16 Level 16Level 17 Level 17

• The red lines show the resultsusing HPX’s MPI parcelport andthe blue lines using HPX’sLibfabric parcelport, respectively

• Number of subgrids ranges from10928 (level 14) to 1.5 millionsubgrids (level 17) depending onthe refinement levels

• Achieved a weak scaling of68.1% with 2048 nodes on PizDaint on level 17

• At 4096 nodes 2.7 speedupusing Libfabric

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 21 / 23

Page 49: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Summary

• HPX programming modelexposes easy synchronization with futures for

• Networking• GPU / CUDA• CPU

• Reduced overhead to maximize throughput

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 22 / 23

Page 50: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Thank you for your attention!

Page 51: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to

Distributed Results

2 8 32 128 512 2,048 8,192

1

1.5

2

2.5

Number of nodes

Rat

ioof

proc

esse

dsu

bgr

ids

pers

econ

d

Level 14 Level 15Level 16

• Ratio of processed sub grids persecond between HPXs Libfabricand MPI parcelport on Piz Daint

• Switch to Libfabric did not requireany changes within Octo-Tiger

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 23 / 23


Recommended