+ All Categories
Home > Documents > INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and...

INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and...

Date post: 28-Dec-2015
Category:
Upload: candice-doyle
View: 217 times
Download: 4 times
Share this document with a friend
24
INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body cosmological code on IBM SP M. Comparato , U. Becciani, C. Gheller, V. Antonuccio The N-Body project at OACT The FLY code Performance analysis with LAPI and MPI-2 Questions
Transcript
Page 1: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

INAF Osservatorio Astrofisico di Catania

“ScicomP 9”Bologna March 23 – 26 2004

Using LAPI and MPI-2Using LAPI and MPI-2

in an N-body cosmological code on IBM SPin an N-body cosmological code on IBM SP

M. Comparato, U. Becciani, C. Gheller, V. Antonuccio

• The N-Body project at OACT

• The FLY code

• Performance analysis with LAPI and MPI-2

• Questions

Page 2: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

INAF Osservatorio Astrofisico di Catania

FLY Project• People: V. Antonuccio, U. Becciani, M. Comparato, D.

Ferro

• Funds: Included in the project “Problematiche Astrofisiche Attuali ed Alta Formazione nel Campo del Supercalcolo” funded by MIUR with more than 500,000 Euros + 170,000 Euros

• INAF provides grants on MPP systems at CINECA

• Resources: IBM SP, SGI Origin systems, Cray T3E

Page 3: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

INAF Osservatorio Astrofisico di Catania

24 Processors 222 Mhz

Global RAM Memory: 48 Gbytes

Disk Space: 254 GB

(72.8 GB HD per node + 36.2 GB HD cws)

Network Topology: SPS scalable Omega switch and FastEthernet node interconnection type

Bandwidth: 300 Mbyte/s peak bi-directional transfer rate

Programming Language: C, C++, Fortran 90

Parallel paradigms: OpenMP, MPI, LAPI

IBM SP POWER3

INAF Astrophysical Observatory of Catania

Page 4: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

INAF Osservatorio Astrofisico di Catania

NEW SYSTEM

8 Processors 1.1 GHz

Global RAM Memory: 16 Gbytes

Disk Array: 254 GB

L2: 1.5 Mbytes

L3: 128 Mbyte

Memory: 2 GB per processorIBM POWER4

P650

INAF Astrophysical Observatory of Catania

Page 5: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

With this method matter is represented as a set of particles:

• each particle is characterized by mass, position and velocity,

• the only force considered is gravity,

• the evolution of the system is obtained integrating numerically

the equation of motion on a proper time interval

Gravitational N-body problem

The cosmological simulation

The N-body technique allows to perform cosmological

simulation that describe the evolution of the universe.

Page 6: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Gravitational N-body problem

The cosmological simulation

• Direct interaction (P-P) method is conceptually the simplest.

• … but it scale as O(N2) and this makes impossible to run simulations with more than N >= 105 particles.

• to overcome this problem tree- or mesh-based algorithms have been developed, these scale as NlogN and N

• but, only supercomputers, and parallel codes, allow the user to run simulations with N>= 107 particles

Page 7: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

FLY: Parallel tree N-body code for Cosmological Applications

• Based on the Barnes-Hut Algorithm (J. Barnes & P. Hut, Nature, 324, 1986)

• Fortran 90 parallel code

• High Performance Code for MPP/SMP architecture using one-side communication paradigm: SHMEM – LAPI

• It runs on Cray-T3E System, SGI ORIGIN, and on IBM SP.

• Typical Simulations require 350 MB Ram for 1 million particles

Page 8: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Gravitational N-body problem

The Barnes-Hut algorithm

ij ij

ijji

d

dGm

dt

xd32

2

terms mult. order higerj cmi

cmi

ij

ijj

d

dGM

d

dGm

3,

,3

The particles evolve according to the laws of Newtonian physics

where dij = xi - xj

Considering a region the force component on the i-particle may be computed as:

Page 9: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Gravitational N-body problem

Tree Formation

root

2D Domain decomposition and tree structure. The split of each sub-domain is carried out until only one body (a leaf) is contained in each box

Page 10: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Gravitational N-body problem

Force Computation

The force on any particle is computed as the sum of the forces by the nearby particles plus the force by the distant cells whose mass distributions are approximated by multipole series truncated, typically, at the quadrupole order.

cell mark the If cmid

Cellsize

Two phases:

• tree walk procedure• force computation

Page 11: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

FLY Block Diagram

SYSTEM INIZIALIZATION

TREE FORMATIONand BARRIER

FORCE COMPUTATION

TREE INSPECTION

ACC. COMPONENTS

BARRIER

UPDATE POSITIONSand BARRIER

TIM

E

ST

EP

C

YC

LE

STOP

Page 12: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Parallel implementation of FLY

Data distribution

Two main data structures:

• particles

• tree

Particles are distribuited in blocks such that each processor has the same

number of contiguous bodies. E.g. with 4 processors:

Tree structure is distribuited in a cyclic way such that each processor has the same number of cells

1 32 54 76 98 ... ...... ...

1 95 213 106 314

... ...... ...

Page 13: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Parallel implementation of FLY

Work distribution

• Each processor calculates the force for its local particles.

• To do that the whole tree structure (which is distributed among processors) must be accessed asyncronously (one-side communications required)

• This leads to a huge communication overhead

Page 14: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

FLY: Tips and tricks

In order to lower the problems related to communication overhead, we have implemented several “tricks”

Dynamical Load Balancing: processors help each other

Grouping: close particles have the same interaction with far distributions of mass

Data Buffering

Page 15: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Data buffering: free RAM segments are dynamically allocated to store remote data (tree cell properties and remote bodies) already accessed during the tree walk procedure.

Performance improvement:

16 Million bodies on Cray T3E, 32 PEs 156 Mbytes for each PE

Without Buffering

Each PE executes 700 GET operations for each local body

Using Buffering

Each PE execute ONLY 8 - 25 GET operations for each local body

FLY: Data buffering

Page 16: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Why FLYing from LAPI to MPI-2

• LAPI is a propretary parallel programming library (IBM)

• Implementing FLY using MPI-2 improves the portability of our code

• RMA calls introduced in MPI-2 make the porting simple, since there is a direct correspondence between basic functions

lapi_get(…)

lapi_put(…)

mpi_get(…)

mpi_put(…)

• MPI-2 doesn’t have an atomic fetch and increment call

Page 17: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

MPI-2 syncronization

• MPI_Win_lock and MPI_Win_unlock mechanism

when we need the data just after the callwhen only one processor access remote data

• MPI_Win_fence mechanism

when we can separate non-RMA access from RMA access

when all the processors access remote data at the same time

Page 18: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

MPI-2 syncronization

• FLY algorithm requires continuous asyncronous access to remote data

• passive target syncronization is needed

• we have to use the lock/unlock mechanism.

Page 19: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

MPI-2 Drawback

Unfortunately, Lock and unlock are usually not implemented efficiently (or they are not even implemented at all)

• LAM: not implemented• MPICH: not implemented• MPICH2: I am waiting next release• IBM AIX: poor performance• IBM Turbo MPI2: testing phase

Page 20: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

FLY 3.3

Problem: poor performance of lock/unlock calls

Walk-around: rewrite portion of code (where possible) to separate non-RMA accesses from RMA accesses in order to use the fence calls

Result: MPI-2 version runs 2 times faster

Why don’t we port these changes on the LAPI version?

FLY 3.3 was born

Page 21: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Static part:

• Tree generation• Cell properties• …

Dynamic part:

• Interaction list• Force computation• …

FLY 3.2 VS FLY 3.32M particles test

Static

0

50

100

150

200

250

300

350

400

450

1 2 4 8 16 32 64

Number of processors

sec.

FLY_3.2FLY_3.3

Dynamic

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 32 64

Number of processors

se

c.

FLY_3.2FLY_3.3

Page 22: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

FLY 3.2 VS FLY 3.32M particles test

Total

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 4 8 16 32 64

Number of processors

sec.

FLY_3.2

FLY_3.3

Scalability

0

1

2

3

4

5

6

7

1 2 4 8 16 32 64

Number of processors

FLY_3.2

FLY_3.3

Total simulation time

Scalability:timing normalized on the number of processors

Scalability

0

1

2

3

4

5

6

7

1 2 4 8 16 32 64

Number of processors

FLY_3.2

FLY_3.3(t

nxn)

/t1

Page 23: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Static part:

• Tree generation• Cell properties• …

Dynamic part:

• Interaction list• Force computation• …

FLY 3.3 VS FLY MPI-22M particles test

Dynamic

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32 64

Number of processors

sec.

FLY_3.3

FLY_MPI

Static

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 3 4 5 6 7

Number of processors

sec.

FLY_3.3

FLY_MPI

Page 24: INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.

Conclusions

Present:

• Low performce MPI2 version of FLY (for now)• More scalable LAPI version of FLY

Future:

• TurboMPI2• MPICH2 (porting on Linux clusters)• OO interface to hydrodynamic codes (FLASH)


Recommended