MPJ meets Gadget

11

MPJ meets Gadget

Bryan Carpenter

OMII, University of SouthamptonSouthampton SO17 1BJ, UK

March 24 2006

[email protected]

A Java Code for Cosmological Simulations?

22

Contents Java for Scientific Computing - background MPJ Express overview The Gadget 2 code Making a Java version of Gadget

33

Acknowledgements

Work done in collaboration with Mark Baker and Aamir Shafi.

Most work done by latter.

44

MPJ Background

55

Java History The Java language grabbed public attention in 1995,

with the release of the HotJava experimental Web browser, and its later incorporation into the Netscape browser.

Very suddenly, Java became one of the most important programming languages in the industry.

Within a year or so of its release, some people were suggesting that Java might be good for high performance scientific computing.• A workshop on Java for Science and Engineering

Computation was held at Syracuse University in late 1996 – a precursor of the subsequent Java Grande activities.

66

Java for High Performance Computing?

Various people (Java Grande, 1997-2003, RIP) have argued that in general Java provides a better programming platform than precursors.• In the parallel computing world there has been a long

history of novel language concepts to support parallelism, multithreading etc.

• Java seems to incorporate at least some of these ideas, and it has the benefit that it is a “mainstream” language.

In general Java encourages better software engineering and is more portable than, say, Fortran or C.• There is a huge body of supporting software for Java.

77

Performance To enable safe execution of code on multiple platforms, Java

programs are translated to portable byte code instructions for the Java Virtual Machine (JVM).

Modern JVMs perform compilation from byte code to native machine code on the fly, as the Java program is executed.• Typical JVMs implement many of the most important kinds of

optimizations used by compilers for “traditional” programming languages.

• Adaptive compilation may even allow optimizations that are impractical for static compilers, because run-time information is available.

• Quality of compilation comparable to many C or Fortran compilers.

But Java has safety features that may limit performance.• Difficulty of managing memory explicitly, exact exception processing, …

88

MPJ Background MPI was introduced in June 1994 as a standard

message passing API for parallel scientific computing:• Language bindings for C, C++, and Fortran.

Java Grande Message Passing Workgroup defined Java bindings in 1998.

Previous efforts follow two approaches: • Pure Java approaches:

Remote Method Invocation (RMI), Java Sockets,

• Java Native Interface (JNI) approaches: mpiJava – Java wrapper to platform native MPI.

99

MPJ Goals Modular design: unify earlier approaches by

pluggable communication devices.

Support high-level functionality of MPI-1.2 in pure Java code, as far as possible:• Point to point communications,

• Collective communications,

• Groups, communicators, and contexts,

• Derived datatypes – buffering API.

Open source distribution:

http://dsg.port.ac.uk/projects/mpj

1010

MPJ Design

MPJ point to point communications (Base level)

mpjdev (MPJ Device level)

MPJ collective Communications (High level)

Hardware (NIC, Memory etc)

MPJ API

JNI Java NIO

Java Virtual Machine (JVM)

JNI

Native MPI

gmdev

ThreadsAPI

smpdev

xdev

niodev

1111

MPJ Demos There are a few existing demos for MPJ, inherited from

the mpiJava project, e.g.• A 2d Fluid Flow example.

• Some Monte Carlo simulations of 2d spin systems from condensed matter physics.

These have nice GUIs, but aren’t very representative of real codes used by computational scientists.

1212

MPJ demo: CFD – inviscid flow

1313

MPJ demo: Q-state Potts model

1414

Gadget

1515

Gadget-2 Gadget-2 is a free-software, production code for

cosmological N-body (and hydrodynamic) computations.

http://www.mpa-garching.mpg.de/gadget• Written by Volker Springel, of the Max Plank Institute for

Astrophysics, Garching. It is written in the C language – already parallelized

using MPI. Versions have been used in various research papers in

astrophysics literature, including the Millennium Simulation.

1616

Millennium Simulation See the paper Simulating the Joint Evolution of Quasars,

Galaxies and their Large-Scale Distribution by Springel et al on Gadget home page.

Follows evolution of 1010 dark matter “particles” from early Universe (z = 127) to current day.

Performed on 512 nodes of IBM p690.• Used 1TB of distributed memory.

• 350,000 CPU hours – 28 days elapsed time.

• Floating point performance around 0.2 TFLOPS. Around 20Tb total data generated.

1717

Dynamics in Gadget Gadget is “just” simulating the movement of (a lot

of) representative particles under the influence of Newton’s law of gravity.• Plus some hydrodynamic forces, but these don’t affect

dominant Dark Matter. Co-moving coordinates take account of expansion of

Universe according to General Relativity, but otherwise basically classical mechanics.

Classical N-body problem.

1818

Gadget Main Loop This is a slightly simplified view of the Gadget code:

… Initialize … while (not done) { move_particles() ; // update positions

domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities

}

• Most of the interesting work happens in compute_accelerations() (and domain_Decomposition() – see later).

1919

Computing Forces The function compute_accelerations() must

compute forces experienced by all particles.• In particular must compute gravitational force.

Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N2).• with N ≈ 1010, this is quite infeasible.

Need some kind of approximation.• Intuitive approach: when computing force on particle i,

group together particles in selected regions distant from i, and treat groups as single particles, located at their centres of mass.

2020

Barnes Hut Tree First divide cubical region of 3d space into 23 = 8

regions, halving each dimension. For every sub-region that contains any particles, divide

again recursively to make an octtree, until “leaf” regions have at most one particle.

2121

Two Dimensional Example

Picture borrowed from

www.cs.berkeley.edu/~demmel/cs267/lecture26/lecture26.html

2222

Barnes Hut Force Computation To compute the force on a particle i, traverse tree

starting from root:• if a node n is “distant from” i, just add contribution to force

on i from centre of mass of n – no need to visit children of n;

• if node n is “close to” i, visit children of n and recurse. Hinges on definition of distant from/close to.• Basic idea is that a node representing some region of space is

distant from a particle i if the angle it subtends is smaller than a threshold opening angle:

i

n

θ

2323

Complexity

On average, number of nodes “opened” to compute force on i is O(log N), as opposed to visiting O(N) particles in naïve algorithm.

A huge win when N ≈ 1010.

2424

Domain Decomposition Need to divide space and/or particle set into domains,

where each domain is handled by a single processor. Problem is that we can’t just divide space evenly,

because some regions will have many more particle than others – poor load balancing.

Conversely, can’t just divide particles evenly, because particles move throughout space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.

2525

Peano-Hilbert Curve Warren and Salmon originally suggested using a space-

filling curve:

Picture borrowed from

http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf

2626

Peano-Hilbert Key Gadget applies the recursion 20 times, logically

dividing space into up to 220 × 220 × 220 cells on the Peano-Hilbert curve.

Then can label each cell by its location along the Peano-Hilbert curve – 260 possible locations comfortably fit into a 64-bit word.

2727

Decomposition based on P-H Curve Because number of cells > > number of particles,

segments of linear Peano-Hilbert curve sparsely populated.

Sort particles by their Peano-Hilbert key, then divide evenly into P domains.• Intuitively – stretch out the P-H curve with particles dotted

along it; segment it into P parts where each part has the same number of particles.

Characteristics of this decomposition:• Good load balancing.• Domains simply connected and quite “compact” in real space,

because particles that are close along P-H curve are close in real space (converse often but not always true).

• Domains have relatively simple mapping to BH octtree nodes.

2828

Distribution of BH Tree in Gadget

Ibid.

2929

Distributed Representation of Tree Every processor hold a copy of the root nodes, and a

copy of all child nodes down to the point where all particles in of a node are held on a single remote processor. Remotely held nodes are called pseudo-particles.

To compute the force on a single local target particle, traverse tree from root as usual, and accumulate contributions from locally held particles.

Build an export list containing target particle and hosts of pseudo-particles encountered in walk.

3030

Communication After local computation for all target particles, process

export list and send list of local target particles to all hosts that own pseudo-particle nodes needed for those particles.

All processors do another tree walk to compute their contributions to remotely owned (from their point of view) target particles.

These contributions are returned to the original processor, and added into the accelerations for the target particles.

3131

Other Communications Notably:• The Domain Decomposition itself requires (in principle) a

distributed sort of the particle list.

• Gadget approximates this sort, but still fairly intricate communications.

In Gadget all communication is uses the standard MPI library. It makes fairly extensive use of collective communication routines.

3232

Java Gadget

3333

History MPJ Express released 2005. Realized we need a realistic exemplar code: both as a

demo, and to drive further improvements of our software.

First thoughts that an N-body simulation code would make sense – Summer 2005.

Learned about Gadget in December that year. Effort to make a Java version of Gadget started

seriously early February 2006 – expecting many months work.

Non-trivial simulations running in March.

3434

Gadget Code Statistics Public distribution is around 17,000 lines of C,

distributed over about 30 source files. Dependencies on:• MPI library for parallelization.

• GNU scientific library (but only a handful of functions)

• FFTW – library for parallel Fourier transforms. Comes with a set of initial conditions files that we can

use to test our code, including “colliding galaxies” and “cluster formation”.

3535

Translation of Gadget Code Manually translated, but in first cut, deliberately keep the Java

code as similar to C as possible.• e.g. nearly all of Java Gadget is currently implemented as static methods

on one enormous class called Main.

• Small supporting classes corresponding to structs of C code.

Currently not supporting a number of features, including:• PMTree algorithm (an optimization using Fourier Transforms on a mesh

to compute long-range part of gravitational force),

• Periodic Boundary Conditions,

• Multiple Initial Conditions files,

• COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS, FLEXSTEPS, OUTPUTPOTENTIAL, FORCETEST, MAKEGLASS, PSEUDOSYMMETRIC, …

Not quite complete in a few other ways (“magic numbers”).

3636

Handling Dependencies Replace MPI calls with MPJ calls (!) FFTW is not needed, because we disable the PMTree

algorithm. Few GSL functions that were needed (numeric

quadrature, etc) were hand translated to Java.

3737

Test Cases Restrictions aside, we have successfully run Colliding

Galaxies and Cluster Formation example simulations on 2, 4 and 8 processors.• These use pure Dark Matter – hydrodynamics code not yet

tested. Results are indistinguishable from running the original

Gadget.

3838

Colliding Galaxies

3939

Cluster Formation

4040

Custom serialization in Java Gadget-2 -- Motivation

Java’s built in object serialization can have detrimental effect on parallel application performance

Experiment• Compare the performance of sending

An array of 1, 1Kbytes, 1Mbytes byte array elements An array of 1, 1Kbytes, 1Mbytes object array elements, where each

object contains exactly one byte• Both tests are communicating same amount of data

Message Size Latency for MPI.BYTE Latency for MPI.OBJECT Bytes communicated in objects case

1 167 us 282 us 72 (overhead is 71 bytes)

1K 512 us 7745 us ~ 17K (overhead is 16K bytes)

1M .09 milliseconds 10.85 seconds ~ 17M (overhead is 16M bytes)

4141

Custom serialization in Java Gadget-2

In the original C Gadget-2, initial conditions are read into an array of C struct called ParticleData and SphParticleData

• Particles that need to be exported are copied to a contiguous memory region called CommBuffer

In Java version, ParticleData and SphParticleData are objects arrays• CommBuffer is a contiguous memory region, which is an instance of

ByteBuffer class

• Each primitive datatype in object arrays are copied to (packed) to CommBuffer by the sender and copied from (unpacked) CommBuffer by the receiver

Helps avoiding the overheads of Java’s object serialization

4242

Initial Benchmarks Here we are running the simulations defined by the

initial conditions files distributed with Gadget. The numbers of particles are relatively modest, so

parallel speedups are less than perfect.

4343

4444

Interpretation These are first-cut results – comparing the production

C Gadget with a mostly unoptimized Java code. In this version the Java code appears to be roughly a

factor of two slower than C. It is fairly clear that Java memory usage is very

inefficient, probably this has an adverse effect on numerical performance (poor caching).

Need to investigate whether Java code can be rewritten to make better use of memory.

4545

Possible enhancement to the MPJ API

Support for sending a basic datatype• For example, if we want to send an integer, it has to be part of or

copied to an integer array before communication Support for sending from and receiving to ByteBuffer

• Our buffering API in MPJ Express allows efficient reuse of ByteBuffers• Application developers can get a chunk of memory (ByteBuffer) using

buffering API, copy their data, and use it for communication MPJ could now use this ByteBuffer for socket send method

• No additional copying• Can help reduce latency

Date post:	02-Jan-2016
Category:	Documents
Upload:	dingbang-lio
View:	87 times
Download:	1 times

MPJ meets Gadget

Documents