Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | candice-doyle |
View: | 217 times |
Download: | 4 times |
INAF Osservatorio Astrofisico di Catania
“ScicomP 9”Bologna March 23 – 26 2004
Using LAPI and MPI-2Using LAPI and MPI-2
in an N-body cosmological code on IBM SPin an N-body cosmological code on IBM SP
M. Comparato, U. Becciani, C. Gheller, V. Antonuccio
• The N-Body project at OACT
• The FLY code
• Performance analysis with LAPI and MPI-2
• Questions
INAF Osservatorio Astrofisico di Catania
FLY Project• People: V. Antonuccio, U. Becciani, M. Comparato, D.
Ferro
• Funds: Included in the project “Problematiche Astrofisiche Attuali ed Alta Formazione nel Campo del Supercalcolo” funded by MIUR with more than 500,000 Euros + 170,000 Euros
• INAF provides grants on MPP systems at CINECA
• Resources: IBM SP, SGI Origin systems, Cray T3E
INAF Osservatorio Astrofisico di Catania
24 Processors 222 Mhz
Global RAM Memory: 48 Gbytes
Disk Space: 254 GB
(72.8 GB HD per node + 36.2 GB HD cws)
Network Topology: SPS scalable Omega switch and FastEthernet node interconnection type
Bandwidth: 300 Mbyte/s peak bi-directional transfer rate
Programming Language: C, C++, Fortran 90
Parallel paradigms: OpenMP, MPI, LAPI
IBM SP POWER3
INAF Astrophysical Observatory of Catania
INAF Osservatorio Astrofisico di Catania
NEW SYSTEM
8 Processors 1.1 GHz
Global RAM Memory: 16 Gbytes
Disk Array: 254 GB
L2: 1.5 Mbytes
L3: 128 Mbyte
Memory: 2 GB per processorIBM POWER4
P650
INAF Astrophysical Observatory of Catania
With this method matter is represented as a set of particles:
• each particle is characterized by mass, position and velocity,
• the only force considered is gravity,
• the evolution of the system is obtained integrating numerically
the equation of motion on a proper time interval
Gravitational N-body problem
The cosmological simulation
The N-body technique allows to perform cosmological
simulation that describe the evolution of the universe.
Gravitational N-body problem
The cosmological simulation
• Direct interaction (P-P) method is conceptually the simplest.
• … but it scale as O(N2) and this makes impossible to run simulations with more than N >= 105 particles.
• to overcome this problem tree- or mesh-based algorithms have been developed, these scale as NlogN and N
• but, only supercomputers, and parallel codes, allow the user to run simulations with N>= 107 particles
FLY: Parallel tree N-body code for Cosmological Applications
• Based on the Barnes-Hut Algorithm (J. Barnes & P. Hut, Nature, 324, 1986)
• Fortran 90 parallel code
• High Performance Code for MPP/SMP architecture using one-side communication paradigm: SHMEM – LAPI
• It runs on Cray-T3E System, SGI ORIGIN, and on IBM SP.
• Typical Simulations require 350 MB Ram for 1 million particles
Gravitational N-body problem
The Barnes-Hut algorithm
ij ij
ijji
d
dGm
dt
xd32
2
terms mult. order higerj cmi
cmi
ij
ijj
d
dGM
d
dGm
3,
,3
The particles evolve according to the laws of Newtonian physics
where dij = xi - xj
Considering a region the force component on the i-particle may be computed as:
Gravitational N-body problem
Tree Formation
root
2D Domain decomposition and tree structure. The split of each sub-domain is carried out until only one body (a leaf) is contained in each box
Gravitational N-body problem
Force Computation
The force on any particle is computed as the sum of the forces by the nearby particles plus the force by the distant cells whose mass distributions are approximated by multipole series truncated, typically, at the quadrupole order.
cell mark the If cmid
Cellsize
Two phases:
• tree walk procedure• force computation
FLY Block Diagram
SYSTEM INIZIALIZATION
TREE FORMATIONand BARRIER
FORCE COMPUTATION
TREE INSPECTION
ACC. COMPONENTS
BARRIER
UPDATE POSITIONSand BARRIER
TIM
E
ST
EP
C
YC
LE
STOP
Parallel implementation of FLY
Data distribution
Two main data structures:
• particles
• tree
Particles are distribuited in blocks such that each processor has the same
number of contiguous bodies. E.g. with 4 processors:
Tree structure is distribuited in a cyclic way such that each processor has the same number of cells
1 32 54 76 98 ... ...... ...
1 95 213 106 314
... ...... ...
Parallel implementation of FLY
Work distribution
• Each processor calculates the force for its local particles.
• To do that the whole tree structure (which is distributed among processors) must be accessed asyncronously (one-side communications required)
• This leads to a huge communication overhead
FLY: Tips and tricks
In order to lower the problems related to communication overhead, we have implemented several “tricks”
Dynamical Load Balancing: processors help each other
Grouping: close particles have the same interaction with far distributions of mass
Data Buffering
Data buffering: free RAM segments are dynamically allocated to store remote data (tree cell properties and remote bodies) already accessed during the tree walk procedure.
Performance improvement:
16 Million bodies on Cray T3E, 32 PEs 156 Mbytes for each PE
Without Buffering
Each PE executes 700 GET operations for each local body
Using Buffering
Each PE execute ONLY 8 - 25 GET operations for each local body
FLY: Data buffering
Why FLYing from LAPI to MPI-2
• LAPI is a propretary parallel programming library (IBM)
• Implementing FLY using MPI-2 improves the portability of our code
• RMA calls introduced in MPI-2 make the porting simple, since there is a direct correspondence between basic functions
lapi_get(…)
lapi_put(…)
mpi_get(…)
mpi_put(…)
• MPI-2 doesn’t have an atomic fetch and increment call
MPI-2 syncronization
• MPI_Win_lock and MPI_Win_unlock mechanism
when we need the data just after the callwhen only one processor access remote data
• MPI_Win_fence mechanism
when we can separate non-RMA access from RMA access
when all the processors access remote data at the same time
MPI-2 syncronization
• FLY algorithm requires continuous asyncronous access to remote data
• passive target syncronization is needed
• we have to use the lock/unlock mechanism.
MPI-2 Drawback
Unfortunately, Lock and unlock are usually not implemented efficiently (or they are not even implemented at all)
• LAM: not implemented• MPICH: not implemented• MPICH2: I am waiting next release• IBM AIX: poor performance• IBM Turbo MPI2: testing phase
FLY 3.3
Problem: poor performance of lock/unlock calls
Walk-around: rewrite portion of code (where possible) to separate non-RMA accesses from RMA accesses in order to use the fence calls
Result: MPI-2 version runs 2 times faster
Why don’t we port these changes on the LAPI version?
FLY 3.3 was born
Static part:
• Tree generation• Cell properties• …
Dynamic part:
• Interaction list• Force computation• …
FLY 3.2 VS FLY 3.32M particles test
Static
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 32 64
Number of processors
sec.
FLY_3.2FLY_3.3
Dynamic
0
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 32 64
Number of processors
se
c.
FLY_3.2FLY_3.3
FLY 3.2 VS FLY 3.32M particles test
Total
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 4 8 16 32 64
Number of processors
sec.
FLY_3.2
FLY_3.3
Scalability
0
1
2
3
4
5
6
7
1 2 4 8 16 32 64
Number of processors
FLY_3.2
FLY_3.3
Total simulation time
Scalability:timing normalized on the number of processors
Scalability
0
1
2
3
4
5
6
7
1 2 4 8 16 32 64
Number of processors
FLY_3.2
FLY_3.3(t
nxn)
/t1
Static part:
• Tree generation• Cell properties• …
Dynamic part:
• Interaction list• Force computation• …
FLY 3.3 VS FLY MPI-22M particles test
Dynamic
0
500
1000
1500
2000
2500
3000
1 2 4 8 16 32 64
Number of processors
sec.
FLY_3.3
FLY_MPI
Static
0
1000
2000
3000
4000
5000
6000
7000
8000
1 2 3 4 5 6 7
Number of processors
sec.
FLY_3.3
FLY_MPI
Conclusions
Present:
• Low performce MPI2 version of FLY (for now)• More scalable LAPI version of FLY
Future:
• TurboMPI2• MPICH2 (porting on Linux clusters)• OO interface to hydrodynamic codes (FLASH)