+ All Categories
Home > Documents > Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and...

Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and...

Date post: 27-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
213
BW Webinar ‘18 Charm++ and Adaptive MPI 5/30/18 1
Transcript
Page 1: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Charm++ and Adaptive MPI

5/30/18 �1

Page 2: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Challenges in Parallel Programming

5/30/18 �2

Page 3: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Challenges in Parallel Programming• Applications are getting more sophisticated

– Adaptive refinement – Multi-scale, multi-module, multi-physics – E.g. load imbalance emerges as a huge problem for some apps

5/30/18 �2

Page 4: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Challenges in Parallel Programming• Applications are getting more sophisticated

– Adaptive refinement – Multi-scale, multi-module, multi-physics – E.g. load imbalance emerges as a huge problem for some apps

• Exacerbated by strong scaling needs from apps – Strong scaling: run an application with same input data on more processors, and get

better speedups – Weak scaling: larger datasets on more processors in the same time

5/30/18 �2

Page 5: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Challenges in Parallel Programming• Applications are getting more sophisticated

– Adaptive refinement – Multi-scale, multi-module, multi-physics – E.g. load imbalance emerges as a huge problem for some apps

• Exacerbated by strong scaling needs from apps – Strong scaling: run an application with same input data on more processors, and get

better speedups – Weak scaling: larger datasets on more processors in the same time

• Hardware variability – Static/dynamic – Heterogeneity: processor types, process variation, etc. – Power/Temperature/Energy – Component failure

5/30/18 �2

Page 6: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Our View• To deal with these challenges, we must seek:

– Not full automation – Not full burden on app-developers – But: a good division of labor between the system and app developers

• Programmer: what to do in parallel, System: where,when

• Develop language driven by needs of real applications – Avoid “platonic” pursuit of “beautiful” ideas – Co-developed with NAMD, ChaNGa, OpenAtom,..

• Pragmatic focus – Ground-up development, portability, – accessibility for a broad user base

5/30/18 �3

Page 7: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

What is Charm++?

5/30/18 �4

Page 8: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

What is Charm++?

• Charm++ is a generalized approach to writing parallel programs – An alternative to the likes of MPI, UPC, GA etc. – But not to sequential languages such as C, C++, and Fortran

5/30/18 �4

Page 9: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

What is Charm++?

• Charm++ is a generalized approach to writing parallel programs – An alternative to the likes of MPI, UPC, GA etc. – But not to sequential languages such as C, C++, and Fortran

• Represents: – The style of writing parallel programs

– The runtime system – And the entire ecosystem that surrounds it

5/30/18 �4

Page 10: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

What is Charm++?

• Charm++ is a generalized approach to writing parallel programs – An alternative to the likes of MPI, UPC, GA etc. – But not to sequential languages such as C, C++, and Fortran

• Represents: – The style of writing parallel programs

– The runtime system – And the entire ecosystem that surrounds it

• Three design principles: – Overdecomposition, Migratability, Asynchrony

5/30/18 �4

Page 11: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Overdecomposition

• Decompose the work units & data units into many more pieces than execution units – Cores/Nodes/…

• Not so hard: we do decomposition anyway

5/30/18 �5

Page 12: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Migratability

• Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move them

• Consequences for the application developer – Communication must now be addressed to logical units with global

names, not to physical processors – But this is a good thing

• Consequences for RTS – Must keep track of where each unit is – Naming and location management

5/30/18 �6

Page 13: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Asynchrony: Message-Driven Execution• With over-decomposition and migratability:

– You have multiple units on each processor – They address each other via logical names

• Need for scheduling: – What sequence should the work units execute in? – One answer: let the programmer sequence them

• Seen in current codes, e.g. some AMR frameworks

– Message-driven execution: • Let the work-unit that happens to have data (“message”) available for it execute next • Let the RTS select among ready work units • Programmer should not specify what executes next, but can influence it via priorities

5/30/18 �7

Page 14: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Realization of This Model in Charm++• Overdecomposed entities: chares

5/30/18 �8

Page 15: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Realization of This Model in Charm++• Overdecomposed entities: chares

– Chares are C++ objects

5/30/18 �8

Page 16: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Realization of This Model in Charm++• Overdecomposed entities: chares

– Chares are C++ objects – With methods designated as “entry” methods

• Which can be invoked asynchronously by remote chares

5/30/18 �8

Page 17: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Realization of This Model in Charm++• Overdecomposed entities: chares

– Chares are C++ objects – With methods designated as “entry” methods

• Which can be invoked asynchronously by remote chares

– Chares are organized into indexed collections• Each collection may have its own indexing scheme

– 1D, ..., 6D – Sparse– Bitvector or string as an index

5/30/18 �8

Page 18: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Realization of This Model in Charm++• Overdecomposed entities: chares

– Chares are C++ objects – With methods designated as “entry” methods

• Which can be invoked asynchronously by remote chares

– Chares are organized into indexed collections• Each collection may have its own indexing scheme

– 1D, ..., 6D – Sparse– Bitvector or string as an index

– Chares communicate via asynchronous method invocations• A[i].foo(…);

– A is the name of a collection, i is the index of the particular chare.

5/30/18 �8

Page 19: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘185/30/18 �9

Page 20: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

GlobalObjectSpace

5/30/18 �9

Page 21: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

GlobalObjectSpace

Processor3Processor2

Processor1Processor0

5/30/18 �9

Page 22: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

GlobalObjectSpace

Processor3Processor2

Processor1Processor0

Scheduler

MessageQueue

Scheduler

MessageQueue

Scheduler

MessageQueue

Scheduler

MessageQueue5/30/18 �9

Page 23: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

GlobalObjectSpace

Processor3Processor2

Processor1Processor0

Scheduler

MessageQueue

Scheduler

MessageQueue

Scheduler

MessageQueue

Scheduler

MessageQueue5/30/18 �9

Page 24: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

GlobalObjectSpace

Processor3Processor2

Processor1Processor0

Scheduler

MessageQueue

Scheduler

MessageQueue

Scheduler

MessageQueue

Scheduler

MessageQueue5/30/18 �9

Page 25: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Message-driven Execution

5/30/18 �10

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

Page 26: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Message-driven Execution

5/30/18 �10

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

A[23].foo(…)

Page 27: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Message-driven Execution

5/30/18 �10

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

A[23].foo(…)

Page 28: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Message-driven Execution

5/30/18 �10

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

A[23].foo(…)

Page 29: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Message-driven Execution

5/30/18 �10

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

A[23].foo(…)

Page 30: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Message-driven Execution

5/30/18 �10

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

A[23].foo(…)

Page 31: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Message-driven Execution

5/30/18 �10

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

A[23].foo(…)

Page 32: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Processor2

Scheduler

MessageQueue

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

Processor3

Scheduler

MessageQueue5/30/18 �11

Page 33: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Processor2

Scheduler

MessageQueue

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

Processor3

Scheduler

MessageQueue5/30/18 �12

Page 34: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Processor2

Scheduler

MessageQueue

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

Processor3

Scheduler

MessageQueue5/30/18 �12

Page 35: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Processor2

Scheduler

MessageQueue

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

Processor3

Scheduler

MessageQueue5/30/18 �13

Page 36: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Processor2

Scheduler

MessageQueue

Processor1

Scheduler

MessageQueue

Processor0

Scheduler

MessageQueue

Processor3

Scheduler

MessageQueue5/30/18 �13

Page 37: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Empowering the RTS

5/30/18 �14

Asynchrony Overdecomposition Migratability

Page 38: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Empowering the RTS

5/30/18 �14

Asynchrony Overdecomposition Migratability

AdaptiveRuntimeSystem

Introspection Adaptivity

Page 39: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Empowering the RTS

• The Adaptive RTS can: – Dynamically balance loads – Optimize communication:

• Spread over time, async collectives – Automatic latency tolerance – Prefetch data with almost perfect predictability

5/30/18 �14

Asynchrony Overdecomposition Migratability

AdaptiveRuntimeSystem

Introspection Adaptivity

Page 40: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Charm++ and CSE Applications

5/30/18 �15

Page 41: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Charm++ and CSE Applications

5/30/18 �15

EnablingCStechnologyofparallelobjectsandintelligentruntimesystemshasledtoseveralCSEcollaborativeapplications

Page 42: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Charm++ and CSE Applications

5/30/18 �15

EnablingCStechnologyofparallelobjectsandintelligentruntimesystemshasledtoseveralCSEcollaborativeapplications

Well-knownBiophysicsMolecularSimulationApp

GordonBellAward,2002

ComputationalAstronomy

SynergyNano-Materials

Page 43: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Summary: What is Charm++?• Charm++ is a way of parallel programming• It is based on:

– Objects– Overdecomposition– Asynchrony

• Asynchronous method invocations

– Migratability– Adaptive runtime system

• It has been co-developed synergistically with multiple CSE applications

5/30/18 �16

Page 44: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Grainsize• Charm++ philosophy:

– Let the programmer decompose their work and data into coarse-grainedentities

• It is important to understand what I mean by coarse-grained entities– You don’t write sequential programs that some system will auto-decompose– You don’t write programs when there is one object for each float– You consciously choose a grainsize, but choose it independently of the

number of processors• Or parameterize it, so you can tune later

5/30/18 �17

Page 45: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘185/30/18 �18

Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

Crack PropagationThis is 2D, circa 2002...

but shows overdecomposition for unstructured meshes

Page 46: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘185/30/18 �19

Working definition of grainsize: amount of computation per remote interaction

Page 47: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘185/30/18 �19

Working definition of grainsize: amount of computation per remote interaction

Page 48: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘185/30/18 �19

Working definition of grainsize: amount of computation per remote interaction

Choose grainsize to be just large enough to amortize the overhead

Page 49: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Grainsize in a common setting

5/30/18 �20

2 MB/chare, 256 objects per core

number of points per chare

Page 50: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Grainsize: Weather Forecasting in BRAMS

5/30/18 �21

• BRAMS: Brazillian weather code (based on RAMS) • AMPI version (Eduardo Rodrigues, with Mendes, J. Panetta, ..)

Instead of using 64 work units on 64 cores, used 1024 on 64

Page 51: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Grainsize: Weather Forecasting in BRAMS

5/30/18 �21

• BRAMS: Brazillian weather code (based on RAMS) • AMPI version (Eduardo Rodrigues, with Mendes, J. Panetta, ..)

Instead of using 64 work units on 64 cores, used 1024 on 64

Page 52: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Grainsize: Weather Forecasting in BRAMS

5/30/18 �21

• BRAMS: Brazillian weather code (based on RAMS) • AMPI version (Eduardo Rodrigues, with Mendes, J. Panetta, ..)

Instead of using 64 work units on 64 cores, used 1024 on 64

Page 53: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Baseline: 64 Objects

5/30/18 �22

Profile of Usage for Processors 0-63 Time per Step: 46s

Usag

e Pe

rcen

t (%)

0

25

50

75

100

PE

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 Avg

Page 54: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Overdecomposition: 1024 Objects

5/30/18 �23

Profile of Usage for Processors 0-63 Time per Step: 33s

Usag

e Pe

rcen

t (%)

0

25

50

75

100

PE

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 Avg

Benefits from communication/computation overlap

Page 55: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

With Load Balancing: 1024 objects

5/30/18 �24

Usage Profile for Processors 0-63 Time per Step: 27s

Usag

e Pe

rcen

t (%)

80

85

90

95

100

PE

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 Avg

Page 56: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

With Load Balancing: 1024 objects

5/30/18 �24

Usage Profile for Processors 0-63 Time per Step: 27s

Usag

e Pe

rcen

t (%)

80

85

90

95

100

PE

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 Avg

No overdecomp (64 threads) 46 sec

+ Overdecomposition (1024 threads) 33 sec

+ Load balancing (1024 threads) 27 sec

Page 57: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Message-drivenexecution

Migratability

Introspectiveandadaptiveruntimesystem

Scalabletools Automaticoverlapofcommunicationandcomputation

Emulationforperformanceprediction

Faulttolerance

Dynamicloadbalancing(topology-aware,scalable)

Temperature/power/energyoptimizations

Charm++ Benefits

5/30/18 �25

Perfectprefetch

Compositionality

Overdecomposition

Page 58: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Locality and Prefetch

5/30/18 �26

Processor1

Scheduler

MessageQueue

Page 59: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Locality and Prefetch

• Objects connote and promote locality

5/30/18 �26

Processor1

Scheduler

MessageQueue

Page 60: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Locality and Prefetch

• Objects connote and promote locality• Message-driven execution

– A strong principle of prediction for data and code use – Much stronger than principle of locality

• Can use to scale memory wall: • Prefetching of needed data:

– Into scratchpad memories, for example

5/30/18 �26

Processor1

Scheduler

MessageQueue

Page 61: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Impact on Communication

• Current use of communication network: – Compute-communicate cycles in typical MPI apps – The network is used for a fraction of time

• And is on the critical path

• Current communication networks are over-engineered by necessity

5/30/18 �27

P1

P2

BSPbasedapplication

Page 62: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Impact on Communication

• With overdecomposition: – Communication is spread over an iteration – Adaptive overlap of communication and computation

5/30/18 �28

P1

P2

Overdecompositionenablesoverlap

Page 63: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Communication Data from Chombo

�29

Chombo with reductions

Work by Phil Miller

5/30/18

Page 64: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Communication Data from Chombo

�29

Chombo with reductions

Chombo on Charm

(experimental)

Work by Phil Miller

5/30/18

Page 65: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Decomposition Challenges

• Current method is to decompose to processors – This has many problems – Deciding which processor does what work in detail is difficult at

large scale

• Decomposition should be independent of number of processors – enabled by object based decomposition

• Let runtime system (RTS) assign objects to available resources adaptively

5/30/18 �30

Page 66: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Decomposition Independent of numCores

5/30/18 �31

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Page 67: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Decomposition Independent of numCores• Rocket simulation example under traditional MPI

5/30/18 �31

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Page 68: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Decomposition Independent of numCores• Rocket simulation example under traditional MPI

• With migratable-objects:

– Benefit: load balance, communication optimizations, modularity

5/30/18 �31

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Solid1

Fluid1

Solid2

Fluid2

Solidn

Fluidm. . .

Solid3. . .

Page 69: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Adaptive MPI

5/30/18 �32

Page 70: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

What is Adaptive MPI?

�335/30/18

Page 71: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

What is Adaptive MPI?

• AMPI is an MPI implementation on top of Charm++’s runtime system – Enables Charm++’s dynamic features for pre-existing MPI codes

�335/30/18

Page 72: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Process Virtualization

�345/30/18

Page 73: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Process Virtualization• AMPI virtualizes MPI “ranks”, implementing them as migratable user-

level threads rather than OS processes – Benefits:

• Communication/computation overlap

• Cache benefits to smaller working sets

• Dynamic load balancing • Lower latency messaging within a process

– Disadvantages: • Global/static variables are shared by all threads in an OS process scope

– AMPI provides support for automating this at compile/run-time – Ongoing work to fully automate

�345/30/18

Page 74: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Dynamic Load Balancing

�355/30/18

Page 75: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Dynamic Load Balancing

• Isomalloc memory allocator – No need for the user to explicitly write

de/serialization (PUP) routines – Memory allocator migrates all heap data

and stack transparently – Works on all 64-bit platforms except BGQ

& Windows�355/30/18

Page 76: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Dynamic Load Balancing

• Isomalloc memory allocator – No need for the user to explicitly write

de/serialization (PUP) routines – Memory allocator migrates all heap data

and stack transparently – Works on all 64-bit platforms except BGQ

& Windows�35

• AMPI ranks are migratable across address spaces at runtime– Add a call to AMPI_Migrate(MPI_Info) in the application’s main

iterative loop

5/30/18

Page 77: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Fault Tolerance

�365/30/18

Page 78: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Fault Tolerance• AMPI ranks can be migrated to persistent storage or in remote

memories for fault tolerance – Storage can be Disk, SSD, NVRAM, etc.

�365/30/18

Page 79: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Fault Tolerance• AMPI ranks can be migrated to persistent storage or in remote

memories for fault tolerance – Storage can be Disk, SSD, NVRAM, etc.

• The runtime uses a scalable fault detection algorithm and restarts automatically on a failure

– Restart is online, within the same job

�365/30/18

Page 80: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Fault Tolerance• AMPI ranks can be migrated to persistent storage or in remote

memories for fault tolerance – Storage can be Disk, SSD, NVRAM, etc.

• The runtime uses a scalable fault detection algorithm and restarts automatically on a failure

– Restart is online, within the same job

• Checkpointing strategy is specified by passing a different MPI_Info to AMPI_Migrate()

�365/30/18

Page 81: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Communication Optimizations

�375/30/18

Page 82: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Communication Optimizations

• Along with overlapping communication, AMPI optimizes forcommunication locality:– Within a core, within a process, within a host, etc.– Communication-aware load balancers can maximize locality

�375/30/18

Page 83: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Communication Optimizations

�385/30/18

Page 84: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Communication Optimizations

• AMPI outperforms process-based MPIs for messages within aprocess– All messaging is done in user-space: no kernel involvement

• Below: OSU MPI Benchmarks on Quartz, an Intel Omni-Path cluster at LLNL

�385/30/18

Page 85: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Communication Optimizations

�395/30/18

Page 86: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Communication Optimizations

• AMPI outperforms process-based MPIs for messages within a process – Utilize the full memory bandwidth on a node for messaging

�395/30/18

Page 87: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Compiling & Running AMPI Programs

�405/30/18

Page 88: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Compiling & Running AMPI Programs

• To compile an AMPI program: – charm/bin/ampicc –o pgm pgm.o – For migratability, link with: -memory isomalloc – For LB strategies, link with: –module CommonLBs

�405/30/18

Page 89: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Compiling & Running AMPI Programs

• To compile an AMPI program: – charm/bin/ampicc –o pgm pgm.o – For migratability, link with: -memory isomalloc – For LB strategies, link with: –module CommonLBs

• To run an AMPI job, specify the # of virtual processes (+vp) – ./charmrun +p 1024 ./pgm – ./charmrun +p 1024 ./pgm +vp 16384 – ./charmrun +p 1024 ./pgm +vp 16384 +balancer RefineLB

�405/30/18

Page 90: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Case Study

�415/30/18

Page 91: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Case Study

• LULESH proxy-application (LLNL) – Shock hydrodynamics on an unstructured mesh – With artificial load imbalance included to test runtimes

�415/30/18

Page 92: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Case Study

• LULESH proxy-application (LLNL) – Shock hydrodynamics on an unstructured mesh – With artificial load imbalance included to test runtimes

• No mutable global/static variables: can run on AMPI as is 1. Replace mpicc with ampicc 2. Link with “-module CommonLBs –memory isomalloc” 3. Run with # of virtual processes and a load balancing strategy:

• ./charmrun +p 2048 ./lulesh2.0 +vp 16384 +balancer GreedyLB

�415/30/18

Page 93: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: Without Virtualization & LB

�425/30/18

Page 94: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: Without Virtualization & LB• Load imbalance appears during pt2pt messaging and in

MPI_Allreduce each timestep

�425/30/18

Page 95: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: Without Virtualization & LB

�435/30/18

Page 96: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: Without Virtualization & LB• Communication/computation cycles mean the network is underutilized most

of the time

�435/30/18

Page 97: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: With 8x Virtualization & LB

�445/30/18

= GreedyLB

Page 98: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: With 8x Virtualization & LB• Most of the communication time is overlapped by computation after load

balancing

�445/30/18

= GreedyLB

Page 99: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: With 8x Virtualization & LB

�455/30/18

Page 100: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: With 8x Virtualization & LB

• The communication of each virtual rank is overlapped with the computation of others scheduled on the same core

�455/30/18

Page 101: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: With 8x Virtualization & LB

• The communication of each virtual rank is overlapped with the computation of others scheduled on the same core

�45

– Projections allows viewing all virtual ranks on a PE, not only what is currently scheduled on one

• In Projections Timeline, select: View -> Show Nested Bracketed User Events

5/30/18

Page 102: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: With 8x Virtualization & LB

�465/30/18

Page 103: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

LULESH: With 8x Virtualization & LB• Communication is spread over the whole timestep

– Peak network bandwidth used is reduced by 3x

�465/30/18

Page 104: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

AMPI Summary

�475/30/18

Page 105: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

AMPI Summary

• AMPI provides the dynamic RTS support of Charm++ with the familiar API of MPI – Communication optimizations – Dynamic load balancing – Automatic fault tolerance – Checkpoint/restart – OpenMP runtime integration

�475/30/18

Page 106: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

AMPI Summary

• AMPI provides the dynamic RTS support of Charm++ with the familiar API of MPI – Communication optimizations – Dynamic load balancing – Automatic fault tolerance – Checkpoint/restart – OpenMP runtime integration

�475/30/18

Page 107: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

AMPI Summary

• AMPI provides the dynamic RTS support of Charm++ with the familiar API of MPI – Communication optimizations – Dynamic load balancing – Automatic fault tolerance – Checkpoint/restart – OpenMP runtime integration

• See the AMPI Manual for more info.

�475/30/18

Page 108: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Hello World with Chares

5/30/18 �48

hello.cpp#include“hello.decl.h”

classMain:publicCBase_Main{public:Main(CkArgMsg∗m){CProxy_Singleton::ckNew();};};classSingleton:publicCBase_Singleton{public:Singleton(){ckout<<“HelloWorld!”<<endl;CkExit();};};#include“hello.def.h”

Page 109: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Hello World with Chares

5/30/18 �48

hello.cimainmodulehello{mainchareMain{entryMain(CkArgMsg∗m);};chareSingleton{entrySingleton();};};

hello.cpp#include“hello.decl.h”

classMain:publicCBase_Main{public:Main(CkArgMsg∗m){CProxy_Singleton::ckNew();};};classSingleton:publicCBase_Singleton{public:Singleton(){ckout<<“HelloWorld!”<<endl;CkExit();};};#include“hello.def.h”

Page 110: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Hello World with Chares

5/30/18 �48

hello.cimainmodulehello{mainchareMain{entryMain(CkArgMsg∗m);};chareSingleton{entrySingleton();};};

hello.cpp#include“hello.decl.h”

classMain:publicCBase_Main{public:Main(CkArgMsg∗m){CProxy_Singleton::ckNew();};};classSingleton:publicCBase_Singleton{public:Singleton(){ckout<<“HelloWorld!”<<endl;CkExit();};};#include“hello.def.h”

Ci file is processed to generate code for classes such as Cbase_Main, Cbase_Singleton, Cproxy_Singleton

Page 111: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Charm++ File Structure

5/30/18 �49

Page 112: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Charm++ File Structure

5/30/18 �49

• C++ objects (including Charm++ objects)– Defined in regular .h and .cpp files

Page 113: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Charm++ File Structure

5/30/18 �49

• C++ objects (including Charm++ objects)– Defined in regular .h and .cpp files

• Chare objects, entry methods (asynchronous methods)– Defined in .ci file– Implemented in the .cpp file

Page 114: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Charm++ File Structure

5/30/18 �49

• C++ objects (including Charm++ objects)– Defined in regular .h and .cpp files

• Chare objects, entry methods (asynchronous methods)– Defined in .ci file– Implemented in the .cpp file

Hello World Example• Compiling

– charmchello.ci– charmc-chello.cpp– charmc-ohellohello.o

• Running – ./charmrun+p7./hello– The +p7 tells the system to

use seven cores

Page 115: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Compiling a Charm++ Program

5/30/18 �50

Page 116: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Hello World with Chares

5/30/18 �51

hello.cpp#include“hello.decl.h”

classMain:publicCBase_Main{public:Main(CkArgMsg∗m){CProxy_Singleton::ckNew();};};classSingleton:publicCBase_Singleton{public:Singleton(){ckout<<“HelloWorld!”<<endl;CkExit();};};#include“hello.def.h”

Page 117: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Hello World with Chares

5/30/18 �51

hello.cimainmodulehello{mainchareMain{entryMain(CkArgMsg∗m);};chareSingleton{entrySingleton();};};

hello.cpp#include“hello.decl.h”

classMain:publicCBase_Main{public:Main(CkArgMsg∗m){CProxy_Singleton::ckNew();};};classSingleton:publicCBase_Singleton{public:Singleton(){ckout<<“HelloWorld!”<<endl;CkExit();};};#include“hello.def.h”

Page 118: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Charm Termination

5/30/18 �52SC'17

There is a special system call CkExit() that terminates the parallel execution on all processors (but it is called on one processor) and performs the requisite cleanupThe traditional exit() is insufficient because it only terminates one process, not the entire parallel job (and will cause a hang)CkExit() should be called when you can safely terminate the application (you may want to synchronize before calling this)

Page 119: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Entry Method Invocation Example: .ci file

5/30/18 �53

mainmoduleMyModule{mainchareMain{

entryMain(CkArgMsg∗m);};

chareSimple{entrySimple(doubley);entryvoidfindArea(intradius,booldone);};};

Page 120: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Does this program execute correctly?

5/30/18 �54

Page 121: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Does this program execute correctly?

5/30/18 �54

structMain:publicCBase_Main{Main(CkArgMsg∗m){CProxy_Simplesim=CProxy_Simple::ckNew(3.1415);for(inti=1;i<10;i++)sim.findArea(i,false);sim.findArea(10,true);}};

structSimple:publicCBase_Simple{doubley;Simple(doublepi){y=pi;}voidfindArea(intr,booldone){ckout<<“Area:”<<y∗r∗r<<endl;if(done)CkExit();}};

Page 122: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

No! Methods are Asynchronous

• If a chare sends multipleentry method invocations

• These may be delivered inany order

5/30/18 �55

sim.findArea(1,false);...sim.findArea(10,true);

Simple::findArea(intr,booldone){ckout<<“Area:”<<y∗r∗r<<endl;if(done)CkExit();}};

Page 123: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

No! Methods are Asynchronous

• If a chare sends multipleentry method invocations

• These may be delivered inany order

• Output:

5/30/18 �55

sim.findArea(1,false);...sim.findArea(10,true);

Simple::findArea(intr,booldone){ckout<<“Area:”<<y∗r∗r<<endl;if(done)CkExit();}};

Area:254.34 Area:200.96 Area:28.26 Area:3.14 Area:12.56 Area:153.86 Area:50.24 Area:78.50 Area:314.00

Page 124: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

No! Methods are Asynchronous

• If a chare sends multipleentry method invocations

• These may be delivered inany order

• Output:

5/30/18 �55

sim.findArea(1,false);...sim.findArea(10,true);

Simple::findArea(intr,booldone){ckout<<“Area:”<<y∗r∗r<<endl;if(done)CkExit();}};

Area:254.34 Area:200.96 Area:28.26 Area:3.14 Area:12.56 Area:153.86 Area:50.24 Area:78.50 Area:314.00

or

Area:28.26 Area:78.50 Area:3.14 Area:113.04 Area:314.00

Page 125: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

No! Methods are Asynchronous

• If a chare sends multipleentry method invocations

• These may be delivered inany order

• Output:

5/30/18 �55

sim.findArea(1,false);...sim.findArea(10,true);

Simple::findArea(intr,booldone){ckout<<“Area:”<<y∗r∗r<<endl;if(done)CkExit();}};

Area:254.34 Area:200.96 Area:28.26 Area:3.14 Area:12.56 Area:153.86 Area:50.24 Area:78.50 Area:314.00

or

Area:28.26 Area:78.50 Area:3.14 Area:113.04 Area:314.00

if(++count==10)CkExit();}};

Page 126: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Chare Arrays• Indexed collections of chares

– Every item in the collection has a unique index and proxy– Can be indexed like an array or by an arbitrary object– Can be sparse or dense– Elements may be dynamically inserted and deleted– Elements are distributed across the available processors,

• May be migrated to other nodes by the user or the runtime

• For many scientific applications, collections of chares are a convenientabstraction

5/30/18 �56

Page 127: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Declaring a Chare Array

.ci file:

�575/30/18

char foo{entryfoo();//constructor//…entrymethods…}char bar{entrybar();//constructor//…entrymethods…}

Page 128: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Declaring a Chare Array

.ci file:array[1d]foo{entryfoo();//constructor//…entrymethods…}array[2d]bar{entrybar();//constructor//…entrymethods…}

�575/30/18

Page 129: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Constructing a Chare Array

•proxy to an individual element in the array

�58

• Constructed much like a regular chare, using ckNew• The size of each dimension is passed to the constructor at the end

CProxy_foomyFoo=CProxy_foo::ckNew(<params>,10);//1d,size10CProxy_barmyBar=CProxy_bar::ckNew(<params>,5,5);//2d,size5x5}

myFoo[4].invokeEntry(…);

The proxy represents the entire array, and may be indexed to obtain a

myBar(2,4).method3(…);

5/30/18

void someMethod() {

Page 130: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

thisIndex

• 1d: thisIndex returns the index of the current chare array element

• 2d: thisIndex.x and thisIndex.y return the indices of the currentchare array element

�59

.ci file:array[1d]foo{entryfoo();}

.cpp file:structfoo:publicCBase_foo{foo(){ckout<<“arrayindex:”<<thisIndex;}};

5/30/18

Page 131: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Chare Array: Hello Examplemainmodulearr{mainchareMain{

entryMain(CkArgMsg∗);}array[1D]hello{entryhello(int);entryvoidprintHello();}}

�605/30/18

Page 132: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Chare Array: Hello Example#include“arr.decl.h”structMain:CBase_Main{

Main(CkArgMsg∗msg){intarraySize=atoi(msg->argv[1]);CProxy_hellop=CProxy_hello::ckNew(arraySize,arraySize);p[0].printHello();}};

�615/30/18

Page 133: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Chare Array: Hello Example#include“arr.decl.h”structMain:CBase_Main{

Main(CkArgMsg∗msg){intarraySize=atoi(msg->argv[1]);CProxy_hellop=CProxy_hello::ckNew(arraySize,arraySize);p[0].printHello();}};structhello:CBase_hello{intarraySize;hello(intn):arraySize(n){}voidprintHello(){CkPrintf(“PE[%d]:hellofromp[%d]\n”,CkMyPe(),thisIndex);if(thisIndex==arraySize–1)CkExit();elsethisProxy[thisIndex+1].printHello();}};#include“arr.def.h” �615/30/18

Page 134: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Broadcast• A message to each object in a collection• The chare array proxy object is used to perform a broadcast • It looks like a function call to the proxy object

5/30/18 �62

Page 135: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Broadcast• A message to each object in a collection• The chare array proxy object is used to perform a broadcast • It looks like a function call to the proxy object• From a chare array element that is a member of the same

array:

5/30/18 �62

Page 136: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Broadcast• A message to each object in a collection• The chare array proxy object is used to perform a broadcast • It looks like a function call to the proxy object• From a chare array element that is a member of the same

array:thisProxy.foo();

5/30/18 �62

Page 137: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Broadcast• A message to each object in a collection• The chare array proxy object is used to perform a broadcast • It looks like a function call to the proxy object• From a chare array element that is a member of the same

array:

• From any chare that has a proxy p to the chare array

thisProxy.foo();

5/30/18 �62

Page 138: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Broadcast

• From any chare that has a proxy p to the chare arrayp.foo();

5/30/18 �62

• A message to each object in a collection• The chare array proxy object is used to perform a broadcast• It looks like a function call to the proxy object• From a chare array element that is a member of the same

array:

thisProxy.foo();

Page 139: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Reduction

5/30/18 �63

Page 140: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Reduction

• Combines a set of values:

5/30/18 �63

Page 141: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Reduction

• Combines a set of values:• The operator must be commutative and associative

– sum, max, …

• Each object calls contribute in a reduction

5/30/18 �63

Page 142: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Reduction: Example#include“reduction.decl.h”constintnumElements=49;classMain:publicCBase_Main{public:

Main(CkArgMsg∗msg){CProxy_Elem::ckNew(thisProxy,numElements);}voiddone(intvalue){CkPrintf(“value:%d\n”,value);CkExit();}};

classElem:publicCBase_Elem{public:Elem(CProxy_MainmProxy){intval=thisIndex;CkCallbackcb(CkReductionTarget(Main,done),mProxy);contribute(sizeof(int),&val,CkReduction::sum_int,cb);}};#include“reduction.def.h”

5/30/18 �64

Page 143: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Reduction: Example#include“reduction.decl.h”constintnumElements=49;classMain:publicCBase_Main{public:

Main(CkArgMsg∗msg){CProxy_Elem::ckNew(thisProxy,numElements);}voiddone(intvalue){CkPrintf(“value:%d\n”,value);CkExit();}};

classElem:publicCBase_Elem{public:Elem(CProxy_MainmProxy){intval=thisIndex;CkCallbackcb(CkReductionTarget(Main,done),mProxy);contribute(sizeof(int),&val,CkReduction::sum_int,cb);}};#include“reduction.def.h”

Outputvalue: 1176 Program finished.

5/30/18 �64

Page 144: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Chare Arrays view

5/30/18 �65

Page 145: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Dynamic Load Balancing

• Object-based decomposition (i.e. virtualized decomposition)helps– Charm++ RTS reassigns objects to Pes to balance load– But how does the RTS decide?

• Multiple strategy options• E.g. Just move objects away from overloaded processors to underloaded

processors

– How is load determined?

5/30/18 �66

Page 146: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Measurement Based Load Balancing

5/30/18 �67

Page 147: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Measurement Based Load Balancing• Principle of Persistence

– Object communication patterns and computational loads tend to persist over time – In spite of dynamic behavior

• Abrupt but infrequent changes • Slow and small changes

– Recent past is a good predictor of near future

5/30/18 �67

Page 148: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Measurement Based Load Balancing• Principle of Persistence

– Object communication patterns and computational loads tend to persist over time– In spite of dynamic behavior

• Abrupt but infrequent changes• Slow and small changes

– Recent past is a good predictor of near future• Runtime instrumentation

– Measures communication volume and computation time

5/30/18 �67

Page 149: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Measurement Based Load Balancing• Principle of Persistence

– Object communication patterns and computational loads tend to persist over time – In spite of dynamic behavior

• Abrupt but infrequent changes • Slow and small changes

– Recent past is a good predictor of near future• Runtime instrumentation

– Measures communication volume and computation time• Measurement-based load balancers

– Measure load information for chares – Periodically use the instrumented database to make new decisions and migrate

objects – Many alternative strategies can use the database

5/30/18 �67

Page 150: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Using the Load Balancer

5/30/18 �68

Page 151: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Using the Load Balancer

• Link a LB module– -module<strategy>

– RefineLB, NeighborLB, GreedyCommLB, others– EveryLB will include all load balancing strategies

5/30/18 �68

Page 152: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Using the Load Balancer

• Link a LB module– -module<strategy>

– RefineLB, NeighborLB, GreedyCommLB, others– EveryLB will include all load balancing strategies

• Compile time option (specify default balancer)– -balancerRefineLB

5/30/18 �68

Page 153: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Using the Load Balancer

• Link a LB module – -module<strategy>

– RefineLB, NeighborLB, GreedyCommLB, others – EveryLB will include all load balancing strategies

• Compile time option (specify default balancer) – -balancerRefineLB

• Runtime option (override default) – +balancerRefineLB

5/30/18 �68

Page 154: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Instrumentation

5/30/18 �69

Page 155: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Instrumentation

• By default, instrumentation is enabled– Automatically collects load information

5/30/18 �69

Page 156: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Instrumentation

• By default, instrumentation is enabled – Automatically collects load information

• Sometimes, you want LB decisions to be based only on a portion of your program – To disable by default, provide runtime argument +LBOff

– To toggle instrumentation in code, use LBTurnInstrumentOn() and LBTurnInstrumentOff()

5/30/18 �69

Page 157: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Code to Use Load Balancing

5/30/18 �70

Page 158: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Code to Use Load Balancing

• Set usesAtSync=true; in chare constructor

5/30/18 �70

Page 159: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Code to Use Load Balancing

• Set usesAtSync=true; in chare constructor

• Insert AtSync() call at a natural barrier– Call from every chare in all collections– Does not block

5/30/18 �70

Page 160: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Code to Use Load Balancing

• Set usesAtSync=true; in chare constructor

• Insert AtSync() call at a natural barrier – Call from every chare in all collections – Does not block

• Implement ResumeFromSync() to resume execution – A typical ResumeFromSync() contributes to a reduction

5/30/18 �70

Page 161: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Example: Stencil//Synchronizeateveryiteration:MainstartsnextiterationvoidMain::endIter(err){if(err<T)CkExit();

elsestencilProxy.sendBoundaries();}

//Assumea1DStencilcharearraywithnearneighborcommunicationvoidStencil::sendBoundaries(){thisProxy(wrap(x-1)).updateGhost(RIGHT,left_ghost);thisProxy(wrap(x+1)).updateGhost(LEFT,right_ghost);}

voidStencil::updateGhost(intdir,doubleghost){updateBoundary(dir,ghost);if(++remoteCount==2){remoteCount=0;doWork();}}

5/30/18 �71

Page 162: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Example: Stencil cont.

voidStencil::doWork(){e=(computeKernel()<DELTA);if(++i%10==0){AtSync();}//Allowloadbalancingevery10iterationselse{}

contribute(CkCallback(CkReductionTarget(Main,endIter),mainProxy));

5/30/18 �72

contribute(8,e,CkCallback(CkReductionTarget(Main,endIter),mainProxy));

Page 163: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Example: Stencil cont.

voidStencil::doWork(){e=(computeKernel()<DELTA);if(++i%10==0){AtSync();}//Allowloadbalancingevery10iterationselse{contribute(8,e,CkCallback(CkReductionTarget(Main,endIter),mainProxy));}}

contribute(CkCallback(CkReductionTarget(Main,endIter),mainProxy));}

5/30/18 �72

Page 164: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Example: Stencil cont.

voidStencil::doWork(){e=(computeKernel()<DELTA);if(++i%10==0){AtSync();}//Allowloadbalancingevery10iterationselse{contribute(8,e,CkCallback(CkReductionTarget(Main,endIter),mainProxy));}}

voidStencil::ResumeFromSync(){contribute(CkCallback(CkReductionTarget(Main,endIter),mainProxy));}

5/30/18 �72

Page 165: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Serialization and PUP

5/30/18 �73

Page 166: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Serialization and PUP

• How can the RTS move arbitrary objects across nodes?

5/30/18 �73

Page 167: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Serialization and PUP

• How can the RTS move arbitrary objects across nodes?• Charm++ has a framework for serializing data called PUP

5/30/18 �73

Page 168: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Serialization and PUP

• How can the RTS move arbitrary objects across nodes? • Charm++ has a framework for serializing data called PUP • PUP: Pack and Unpack

5/30/18 �73

Page 169: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Serialization and PUP

• How can the RTS move arbitrary objects across nodes? • Charm++ has a framework for serializing data called PUP • PUP: Pack and Unpack• With PUP, chares become serializable and can be transported

to memory, disk, or another processor

5/30/18 �73

Page 170: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Simple PUP for a Simple Chare

classMyChare:publicCbase_MyChare{inta;floatb;charc;doublelocalArray[LOCAL_SIZE];};

voidpup(PUP::er&p){

p|a;p|b;p|c;p(localArray,LOCAL_SIZE);}

5/30/18 �74

Page 171: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Writing an Advanced PUP RoutineclassMyChare:publicCbase_MyChare{intheapArraySize;float*heapArray;MyClass*pointer;};

5/30/18 �75

Page 172: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Writing an Advanced PUP RoutineclassMyChare:publicCbase_MyChare{intheapArraySize;float*heapArray;MyClass*pointer;};voidpup(PUP::er&p){p|headArraySize;if(p.isUnpacking()){heapArray=newfloat[heapArraySize];}p(heapArray,heapArraySize);boolisNull=!pointer;p|isNull;if(!isNull){if(p.isUnpacking()){pointer=newMyClass();}p|*pointer;}}

5/30/18 �75

Page 173: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

PUP Uses

5/30/18 �76

Page 174: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

PUP Uses

• Moving objects for load balancing

5/30/18 �76

Page 175: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

PUP Uses

• Moving objects for load balancing• Marshalling user defined data types

– When using a type you define as a parameter for an entry method– Type has to be serialized to go over network, uses PUP for this

– Can add PUP to any class, doesn’t have to be a chare

5/30/18 �76

Page 176: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

PUP Uses

• Moving objects for load balancing • Marshalling user defined data types

– When using a type you define as a parameter for an entry method – Type has to be serialized to go over network, uses PUP for this

– Can add PUP to any class, doesn’t have to be a chare

• Serializing for storage

5/30/18 �76

Page 177: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Split Execution: Checkpoint Restart

5/30/18 �77

Page 178: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Split Execution: Checkpoint Restart

• Can use to stop execution and resume later – The job runs for 5 hours, then will continue in new allocation

another day!

5/30/18 �77

Page 179: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Split Execution: Checkpoint Restart

• Can use to stop execution and resume later – The job runs for 5 hours, then will continue in new allocation

another day!

• We can use PUP for this!

5/30/18 �77

Page 180: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Split Execution: Checkpoint Restart

• Can use to stop execution and resume later– The job runs for 5 hours, then will continue in new allocation

another day!

• We can use PUP for this!• Instead of migrating to another PE, just “migrate” to disk

5/30/18 �77

Page 181: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

How to Enable Split Execution

5/30/18 �78

Page 182: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

How to Enable Split Execution

• Call to checkpoint the application is made in the main chare at a synchronization point

5/30/18 �78

Page 183: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

How to Enable Split Execution

• Call to checkpoint the application is made in the main chareat a synchronization point

• log_path is file system path for checkpoint

5/30/18 �78

Page 184: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

How to Enable Split Execution• Call to checkpoint the application is made in the main chare

at a synchronization point• log_path is file system path for checkpoint

• Callbackcb called when checkpoint (or restart) is done– For restart, user needs to provide argument +restart and path of

checkpoint file at runtime

CkCallbackcb(CkIndex_Hello:SayHi(),helloProxy);CkStartCheckpoint(“log_path”,cb);

shell>./charmrunhello+p4+restartlog_path

5/30/18 �78

Page 185: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Chares Are Reactive

5/30/18 �79

Page 186: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Chares Are Reactive• The way we described Charm++ so far, a chare is a reactive entity:

– If it gets this method invocation, it does this action,– If it gets that method invocation then it does that action– But what does it do?– In typical programs, chares have a life-cycle

5/30/18 �79

Page 187: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Chares Are Reactive• The way we described Charm++ so far, a chare is a reactive entity:

– If it gets this method invocation, it does this action, – If it gets that method invocation then it does that action – But what does it do? – In typical programs, chares have a life-cycle

• How to express the life-cycle of a chare in code? – Only when it exists

• i.e. some chares may be truly reactive, and the programmer does not know the life cycle

– But when it exists, its form is: • Computations depend on remote method invocations, and completion of other local

computations • A DAG (Directed Acyclic Graph)!

5/30/18 �79

Page 188: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Structured DaggerThe serial construct

• The serial construct– A sequential block of C++ code in the .ci file– The keyword serial means that the code block will be executed without interruption/preemption,

like an entry method– Syntax: serial<optionalString>{/*C++code*/}– The <optionalString> is used for identifying the serial for performance analysis

– Serial blocks can access all members of the class they belong to• Examples (.ci file):

entryvoidmethod1(parameters){serial{

thisProxy.invokeMethod(10);callSomeFunction();

}};

entryvoidmethod2(parameters){serial“setValue”{

value=10;}};

5/30/18 �80

Page 189: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Structured DaggerThe when construct

• The when construct – Declare the actions to perform when a message is received – In sequence, it acts like a blocking receive

entryvoidsomeMethod(){

whenentryMethod1(parameters){/∗block2∗/}whenentryMethod2(parameters){/∗block3∗/}};

5/30/18 �81

Page 190: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Structured DaggerThe when construct: waiting for multiple invocations

• Execute SDAG_CODEwhen method1 and method2 arrive

• Which is semantically the same as this:

whenmethod1(intparam1,intparam2),method2(boolparam3)

SDAG_CODE

whenmyMethod1(intparam1,intparam2){

whenmyMethod2(boolparam3){}

SDAG_CODE}

5/30/18 �82

Page 191: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Structured DaggerThe when construct : reference number matching

• The when clause can wait on a certain reference number• If a reference number is specified for a when , the first

parameter for the when must be the reference number •

whenmethod1[100](intref,boolparam1)

/∗sdagblock∗/

serial{

proxy.method1(200,false);/∗willnotbedeliveredtothewhen∗/proxy.method1(100,true);/∗willbedeliveredtothewhen∗/

}

5/30/18 �83

Semantics: the when will “block” until a message arrives with that reference number

Page 192: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Structured DaggerOther constructs

• if-then-else– Same as the typical C if-then-else semantics and syntax

• for– Defines a sequenced for loop (like a sequential C for loop)

• while– Defines a sequenced while loop (like a sequential C while loop)

• forall– Has “do-all” semantics: iterations may execute in any order

• overlap– Allows multiple independent constructs to execute in any orderhttp://charm.cs.illinois.edu/manuals/html/charm++/5.html

5/30/18 �84

Page 193: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Interoperability and Within Node Parallelism• GPGPUs are supported

– Via a “GPU Manager” module, with asynchronous callbacks into Charm++code

• Multicore:– Charm++ has its own OpenMP runtime implementation (via LLVM)

• Highly flexible nested parallelism

– Charm++ can run in a mode with 1 PE on each process• Interoperates with regular OpenMP, OMPSS, other task models,

• Charm++ interoperates with MPI– So, some modules can be written in Charm++, rest in MPI

5/30/18 �85

Page 194: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Control flow within chare

• Structured dagger notation– Provides a script-like language for expressing dag of dependencies

between method invocations and computations

• Threaded Entry methods– Allows entry methods to block without blocking the PE

– Supports futures, and– ability to suspend/resume threads

5/30/18 �86

Page 195: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

Advanced Concepts• Priorities

• Entry method tags

• Quiescence detection

• LiveViz: visualization from a parallel program

• CharmDebug: a powerful debugging tool

• Projections: Performance Analysis and Visualization, really nice, and a workhorse tool for Charm++developers

• Messages (instead of marshalled parameters)

• Processor-aware constructs:

– Groups: like a non-migratable chare array with one element on each “core”

– Nodegroups: one element on each process

5/30/18 �87

Page 196: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

NAMD: Biomolecular Simulations

• Collaboration with K. Schulten • With over 70,000 registered

users • Scaled to most top US

supercomputers • In production use on

supercomputers and clusters and desktops

• Gordon Bell award in 2002

Determination of the structure of HIV capsid by researchers including Prof Schulten

�885/30/18

Page 197: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Parallelization using Charm++

�895/30/18

Page 198: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Parallelization using Charm++

�895/30/18

Page 199: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

ChaNGa: Parallel Gravity• Collaborative project (NSF)

– with Tom Quinn, Univ. of Washington

• Gravity, gas dynamics• Barnes-Hut tree codes

– Oct tree is natural decomp

– Geometry has better aspect ratios, so you“open” up fewer nodes

– But is not used because it leads to bad loadbalance

– Assumption: one-to-one map between sub-trees and PEs

– Binary trees are considered better loadbalanced

�90

Evolution of Universe and Galaxy Formation

5/30/18

Page 200: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

ChaNGa: Parallel Gravity• Collaborative project (NSF)

– with Tom Quinn, Univ. of Washington

• Gravity, gas dynamics • Barnes-Hut tree codes

– Oct tree is natural decomp

– Geometry has better aspect ratios, so you “open” up fewer nodes

– But is not used because it leads to bad load balance

– Assumption: one-to-one map between sub-trees and PEs

– Binary trees are considered better load balanced

�90

Evolution of Universe and Galaxy Formation

5/30/18

Page 201: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

ChaNGa: Parallel Gravity• Collaborative project (NSF)

– with Tom Quinn, Univ. of Washington

• Gravity, gas dynamics• Barnes-Hut tree codes

– Oct tree is natural decomp

– Geometry has better aspect ratios, so you“open” up fewer nodes

– But is not used because it leads to bad loadbalance

– Assumption: one-to-one map between sub-trees and PEs

– Binary trees are considered better loadbalanced

�90

With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors

Evolution of Universe and Galaxy Formation

5/30/18

Page 202: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

ChaNGa: Parallel Gravity• Collaborative project (NSF)

– with Tom Quinn, Univ. of Washington

• Gravity, gas dynamics• Barnes-Hut tree codes

– Oct tree is natural decomp

– Geometry has better aspect ratios, so you“open” up fewer nodes

– But is not used because it leads to bad loadbalance

– Assumption: one-to-one map between sub-trees and PEs

– Binary trees are considered better loadbalanced

�90

With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors

Evolution of Universe and Galaxy Formation

5/30/18

Page 203: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BWWebinar‘18

OpenAtom: On the fly ab initio molecular dynamics on the ground state surface with instantaneous GW-BSE level spectra

PIs: G.J. Martyna, IBM; S. Ismail-Beigi, Yale; L. Kale, UIUC; Team: Q. Li, IBM, M. Kim, Yale; S. Mandal, Yale;

E. Bohm, UIUC; N. Jain, UIUC; M. Robson, UIUC;E. Mikida, UIUC; P. Jindal, UIUC; T. Wicky, UIUC.

1

Light in

5/30/18 �91

Page 204: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Decomposition and Computation Flow

�925/30/18

Page 205: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Mini-App Features Machine Max cores

AMR Overdecomposition, Custom array index, Message priorities,

Load Balancing, Checkpoint restart

BG/Q 131,072

LeanMD Overdecomposition, Load Balancing, Checkpoint restart,

Power awareness

BG/P BG/Q

131,072 32,768

Barnes-Hut (n-body)

Overdecomposition, Message priorities, Load Balancing

Blue Waters 16,384

LULESH 2.02 AMPI, Over-decomposition, Load Balancing

Hopper 8,000

PDES Overdecomposition, Message priorities, TRAM

Stampede 4,096

MiniApps

�93

Available at: http://charmplusplus.org/miniApps/

5/30/18

Page 206: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Mini-App Features Machine Max cores

1D FFT Interoperable with MPI BG/P BG/Q

65,536 16,384

Random Access TRAM BG/P BG/Q

131,072 16,384

Dense LU SDAG XT5 8,192

Sparse Triangular Solver SDAG BG/P 512

GTC SDAG BG/Q 1,024

SPH Blue Waters -

More MiniApps

�945/30/18

Page 207: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18 �95

Describes seven major applications developed using Charm++

More info on Charm++: http://charm.cs.illinois.edu Including the miniApps

5/30/18

Page 208: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Saving Cooling Energy

�965/30/18

Page 209: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Saving Cooling Energy• Easy: increase A/C setting

– But: some cores may get too hot

• So, reduce frequency if temperature is high (DVFS) – Independently for each chip

�965/30/18

Page 210: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Saving Cooling Energy• Easy: increase A/C setting

– But: some cores may get too hot

• So, reduce frequency if temperature is high (DVFS) – Independently for each chip

• But, this creates a load imbalance!

�965/30/18

Page 211: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Saving Cooling Energy• Easy: increase A/C setting

– But: some cores may get too hot

• So, reduce frequency if temperature is high (DVFS) – Independently for each chip

• But, this creates a load imbalance!• No problem, we can handle that:

– Migrate objects away from the slowed-down processors – Balance load using an existing strategy – Strategies take speed of processors into account

�965/30/18

Page 212: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

Saving Cooling Energy• Easy: increase A/C setting

– But: some cores may get too hot

• So, reduce frequency if temperature is high (DVFS) – Independently for each chip

• But, this creates a load imbalance!• No problem, we can handle that:

– Migrate objects away from the slowed-down processors – Balance load using an existing strategy – Strategies take speed of processors into account

• Implemented in experimental version – SC 2011 paper, IEEE TC paper

• Several new power/energy-related strategies – PASA ‘12: Exploiting differential sensitivities of code segments to frequency change

�965/30/18

Page 213: Charm++ and Adaptive MPI - Blue Waters · BW Webinar ‘18 Migratability • Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime can move

BW Webinar ‘18

PARM:Power Aware Resource Manager

• Charm++ RTS facilitates malleable jobs • PARM can improve throughput under a fixed power budget using:

– overprovisioning (adding more nodes than conventional data center) – RAPL (capping power consumption of nodes) – Job malleability and moldability

�975/30/18


Recommended