Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | leon-stevenson |
View: | 216 times |
Download: | 3 times |
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
APPLIED PARALLEL ALGORITHMS 3
Prof. Thomas SterlingDr. Hartmut Kaiser Center for Computation and TechnologyLouisiana State UniversityMarch 22, 2011
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Topics
• LU Decomposition• N-Body Problem• Parallel Sorting
2
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Puzzle of the Day
#include<stdio.h> int array[] = {23, 34, 12, 17, 204, 99, 16};
#define TOTAL_ELEMENTS (sizeof(array) / sizeof(array[0]))int main(){ int d; for (d = -1; d <= (TOTAL_ELEMENTS-2); ++d) printf("%d\n", array[d+1]); return 0;}
3
The expected output of the following C program is to print the elements in the array. But when actually run, it doesn't do so. Why?
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Topics
• LU Decomposition• N-Body Problem• Parallel Sorting
4
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
LU Factorization
• Gaussian Elimination is simple but– What if we have to solve many Ax = b systems for different values of b?
• This happens a LOT in real applications
• Another method is the “LU Factorization” (LU Decomposition)• Ax = b• Say we could rewrite A = L U, where L is a lower triangular matrix, and U is an upper
triangular matrix O(n3)• Then Ax = b is written L U x = b• Solve L y = b O(n2) • Solve U x = y O(n2)
??????
x =??????
x =
equation i has i unknowns equation n-i has i unknowns
triangular system solves are easy
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
5
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
LU Factorization: Principle
• It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.
• Magically, A = L x U !• Should be done with pivoting as well
1 2 -1
4 3 1
2 2 3
1 2 -1
0 -5
5
2 2 3
gaussianelimination
save thescalingfactor
1 2 -1
4 -5
5
2 2 3
gaussianelimination
+save thescalingfactor
1 2 -1
4 -5
5
2 -2
5gaussianelimination
+save thescalingfactor
1 2 -1
4 -5 5
2 2/5 3
1 0 0
4 1 0
2 2/5 1
L = 1 2 -1
0 -5 5
0 0 3U =
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
6
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
LU Factorization
stores the scaling factors
k
k
LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk
for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj
}}
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make
the code look complicated without changing the overall principle
7
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
LU Factorization
• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make
the code look complicated without changing the overall principle
LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk
for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj
}}
k
ij
k
update
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
8
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Parallel LU on a ring
• Since the algorithm operates by columns from left to right, we should distribute columns to processors
• Principle of the algorithm– At each step, the processor that owns column k does the “prepare” task
and then broadcasts the bottom part of column k to all others• Annoying if the matrix is stored in row-major fashion• Remember that one is free to store the matrix in anyway one wants, as long
as it’s coherent and that the right output is generated
– After the broadcast, the other processors can then update their data.
• Assume there is a function alloc(k) that returns the rank of the processor that owns column k– Basically so that we don’t clutter our program with too many global-to-
local index translations
• In fact, we will first write everything in terms of global indices, as to avoid all annoying index arithmetic
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
9
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
LU-broadcast algorithm
LU-broadcast(A,n) { q MY_NUM() p NUM_PROCS() for k = 0 to n-2 { if (alloc(k) == q) // preparing column k for i = k+1 to n-1 buffer[i-k-1] aik -aik / akk
broadcast(alloc(k),buffer,n-k-1) for j = k+1 to n-1 if (alloc(j) == q) // update of column j for i=k+1 to n-1 aij aij + buffer[i-k-1] * akj
}}
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
10
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Dealing with local indices
• Assume that p divides n• Each processor needs to store r=n/p columns and its
local indices go from 0 to r-1• After step k, only columns with indices greater than k will
be used• Simple idea: use a local index, l, that everyone initializes
to 0• At step k, processor alloc(k) increases its local index so
that next time it will point to its next local column
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
11
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
LU-broadcast algorithm
... double a[n-1][r-1];
q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (alloc(k) == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
12
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Bad load balancing
P1 P2 P3 P4
alreadydone
alreadydone working
on it
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
13
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Good Load Balancing?
working on it
alreadydone
alreadydone
Cyclic distribution
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
14
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Load-balanced program
... double a[n-1][r-1];
q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k mod p == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
15
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Performance Analysis
• How long does this code take to run?– This is not an easy question because there are many tasks and
many communications
• A little bit of analysis shows that the execution time is the sum of three terms– n-1 communications: n L + (n2/2) b + O(1)– n-1 column preparations: (n2/2) w’ + O(1)– column updates: (n3/3p) w + O(n2)
• Therefore, the execution time is O(n3/p) – Note that the sequential time is: O(n3)
• Therefore, we have perfect asymptotic efficiency!– This is good, but isn’t always the best in practice
• How can we improve this algorithm?
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
16
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Pipelining on the Ring
• So far, in the algorithm we’ve used a simple broadcast• Nothing was specific to being on a ring of processors
and it’s portable – in fact you could just write raw MPI that just looks like our
pseudo-code and have a very limited, inefficient for small n, LU factorization that works only for some number of processors
• But it’s not efficient– The n-1 communication steps are not overlapped with
computations– Therefore Amdahl’s law, etc.
• Turns out that on a ring, with a cyclic distribution of the columns, one can interleave pieces of the broadcast with the computation– It almost looks like inserting the source code from the broadcast
code we saw at the very beginning throughout the LU code
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
17
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Previous program
... double a[n-1][r-1];
q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
18
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
LU-pipeline algorithm
double a[n-1][r-1];
q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 send(buffer,n-k-1) else recv(buffer,n-k-1) if (q ≠ k-1 mod p) send(buffer, n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
19
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Topics
• LU Decomposition• N-Body Problem• Parallel Sorting
20
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 21
N Bodies
OU Supercomputing Center for Education & Research
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 22
OU Supercomputing Center for Education & ResearchImg src : http://www.lsbu.ac.uk/water
N-Body Problems
An N-body problem is a problem involving N “bodies” – that is, particles (e.g., stars, atoms) – each of which applies a force to all of the others.
For example, if you have N stars, then each of the N stars exerts a force (gravity) on all of the other N–1 stars.
Likewise, if you have N atoms, then every atom exerts a force on all of the other N–1 atoms. The forces are Coulombic and van der Waal’s.
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 23
2-Body Problem
When N is 2, you have – surprise! – a 2-Body Problem: exactly two particles, each exerting a force that acts on the other.
The relationship between the 2 particles can be expressed as a differential equation that can be solved analytically, producing a closed-form solution.
So, given the particles’ initial positions and velocities, you can immediately calculate their positions and velocities at any later time.
OU Supercomputing Center for Education & Research
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 24
N-Body Problems
• For N of 3 or more, no one knows how to solve the equations to get a closed form solution.
• So, numerical simulation is pretty much the only way to study groups of 3 or more bodies.
• Popular applications of N-body codes include astronomy and chemistry.
• Note that, for N bodies, there are on the order of N2 forces, denoted O(N2).
OU Supercomputing Center for Education & Research
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 25
N-Body Problems
• Given N bodies, each body exerts a force on all of the other N–1 bodies.
• Therefore, there are N • (N–1) forces in total.• You can also think of this as (N • (N–1))/2 forces, in
the sense that the force from particle A to particle B is the same (except in the opposite direction) as the force from particle B to particle A.
OU Supercomputing Center for Education & Research
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 26
N-Body Problems
• Given N bodies, each body exerts a force on all of the other N–1 bodies.
• Therefore, there are N • (N–1) forces in total.• In Big-O notation, that’s O(N2) forces to calculate.• So, calculating the forces takes O(N2) time to execute.• But, there are only N particles, each taking up the
same amount of memory, so we say that N-body codes are of:– O(N) spatial complexity (memory)– O(N2) time complexity
OU Supercomputing Center for Education & Research
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 27
How to Calculate?
• Whatever your physics is, you have some function, F(A,B), that expresses the force between two bodies A and B.
• For example,
F(A,B) = G · mA · mB / dist(A,B)2
where G is the gravitational constant and m is the mass of the particle in question.
• If you have all of the forces for every pair of particles, then you can calculate their sum, obtaining the force on every particle.
OU Supercomputing Center for Education & Research
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
• Objective is to find positions and movements of bodies in space (say planets) that are subject to gravitational forces from other bodies using Newtonian laws of physics.
• Subject to forces, a body will accelerate according to Newton’s second law:
F = mawhere m is the mass of the body,
F is the force it experiences, and a is the resultant acceleration.
2r
mGmF ba
Gravitational N-Body Problem
28
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
• For a precise numeric description, differential equations would be used
F = m dv/dt and v = dx/dt
• Let the time interval be t. Then, for a particular body of mass m, the force is given by
and a new velocity
• where vt+1 is the velocity of body at time t + 1 and• vt is the velocity of body at time t.
• If a body is moving at a velocity v over the time interval t, its position changes by
xt+1 - xt = vt• where xt is its position at time t.
• Once bodies move to new positions, the forces change and the computation has to be repeated.
t
vvmF
tt
)( 1
m
tFvv tt
1
Gravitational N-Body Problem
29
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
– The velocity is not actually constant over the time interval Dt .• It can help to have a “leap-frog” computation in which velocity and
position are computed alternately
• and
• where the velocities are computed for times t, t + 1, t + 2, etc. and the position are computed for times t + 1/2, t + 3/2, t + 5/2, etc.
t
vvmF
ttt
)( 2/12/1
tvxx ttt 2/11
Gravitational N-Body Problem
30
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
– In a three-dimensional space having a coordinate system (x, y, z),• the distance between the bodies at (xa, ya, za) and (xb, yb, zb) is
given by
• The forces are resolved in the three directions, using, for example,
– where particles are of mass ma and mb and
– have coordinates (xa, ya, za) and (xb, yb, zb).
222 )()()( ababab zzyyxxr
r
zz
r
mGmF
r
yy
r
mGmF
r
xx
r
mGmF
abbaz
abbay
abbax
2
2
2
Three-Dimensional Space
31
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 32
O(N2) Forces
Note that this picture shows only the forces between A and everyone else.
A
OU Supercomputing Center for Education & Research
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
Overall gravitational N-body computation can be described by:
for (t = 0; t < tmax; t++) { /* for each time period */ for (i = 0; i < N; i++) { /* for each body */
F = Force_routine(i); /* compute force on ith body */vnew[i] = v[i] + F * dt / m; /* compute new velocity */xnew[i] = x[i] + vnew[i] * dt; /* and new position */
} for (i = 0; i < nmax; i++) { /* for each body */
x[i] = xnew[i]; /* update velocity & position*/v[i] = vnew[i];
}}
Sequential Code
33
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 34
How to Parallelize?
Okay, so let’s say you have a nice serial (single-CPU) code that does an N-body calculation.
How are you going to parallelize it?You could:• have a master feed particles to processes;• have a master feed interactions to processes;• have each process decide on its own subset of the
particles, and then share around the forces;• have each process decide its own subset of the
interactions, and then share around the forces.
OU Supercomputing Center for Education & Research
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 35
Do You Need a Master?
• Let’s say that you have N bodies, and therefore you have ½N(N-1) interactions (every particle interacts with all of the others, but you don’t need to calculate both A B and B A).
• Do you need a master?• Well, can each processor determine on its own either
(a) which of the bodies to process, or (b) which of the interactions?
• If the answer is yes, then you don’t need a master.
OU Supercomputing Center for Education & Research
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Nbody - OpenMP
36
#ifndef _NBODY_H#define _NBODY_H/* Needed includes */#include <stdlib.h> /* atoi */#include <stdio.h> /* fprintf */#include <time.h> /* clock/clock_gettime */#include <malloc.h> /* malloc */#include <math.h> /* sqrt */#ifdef WITH_OPENMP#include <omp.h>#endif/* Some constants */#define GRAV_CONST 6.673e-11 /* m^3/(kg*s^2) */#define EPSILON 1e-12/* for init_bodies */#define INIT_LINEAR 0#define INIT_SPIRAL 1/* ickiness in naming of time stuff */#define GNU_TIME#ifdef GNU_TIME# define __need_clock_t# define mytspec clock_t# define get_time(tspec) tspec = clock()#else# define mytspec timespec_t# define get_time(tspec) clock_gettime(CLOCK_SGI_CYCLE,&tspec)#endif#define X 0#define Y 1#if 0#ifndef _BOOL
#endif #ifdef _STANDARD_C_PLUS_PLUS // If -LANG:std is specified, it defines the macro // in the preceding line. Use new-style headers // and a using directive to bring names from the // std namespace into the global namespace. #include<complex> #include<iostream> using namespace std; #else // If -LANG:std is not specified, use old-style headers, // and there is no need for a using directive. #include<complex.h> #include<iostream.h> #endif complex<float> x(1,2);#endif/* the Body structure */
struct Body_struct { double mass; double pos[2]; double vel[2];};typedef struct Body_struct Body;/* Subroutines in nbody.c */Body* init_bodies(unsigned int num_bodies, int init_type);int check_simulation(Body *bodies, int num_bodies);double elapsed_time(const mytspec t2, const mytspec t1);#endif /* _NBODY_H */
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Nbody - OpenMP
37
#include "nbody.h"#include <omp.h>int main(int argc, char **argv) { mytspec start_time, end_time; int num_bodies, num_steps, max_threads=1; int i, j, k, l; double dt=1.0, dv[2]; double r[2], dist, force_len, force_ij[2], tot_force_i[2]; #if NEWTON_OPT double *forces_matrix; #endif Body *bodies; const char Usage[] = "Usage: nbody <num bodies> <num_steps>\n"; if (argc < 2) { fprintf(stderr, Usage); exit(1); } num_bodies = atoi(argv[1]); num_steps = atoi(argv[2]); /* Initialize with OpenMP */ #ifdef _OPENMP max_threads = omp_get_max_threads(); #else printf("Warning: no OpenMP!\n"); #endif #if NEWTON_OPT > 0 printf("Using Newton's third law optimization, variant %d.\n", NEWTON_OPT); #endif printf("Initializing with %d threads, %d bodies, %d time steps\n", max_threads, num_bodies, num_steps); bodies = init_bodies(num_bodies, INIT_SPIRAL); check_simulation(bodies, num_bodies);#endif
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Nbody - OpenMP
38
printf("Running "); fflush(stdout); get_time(start_time);#define PRIVATE_VARS r, dist, force_len, force_ij, tot_force_i, dv#define Calc_Force_ij() r[X] = bodies[j].pos[X] - bodies[i].pos[X]; r[Y] = bodies[j].pos[Y] - bodies[i].pos[Y]; dist = r[X]*r[X] + r[Y]*r[Y]; force_len = GRAV_CONST * bodies[i].mass * bodies[j].mass (dist*sqrt(dist)); \ force_ij[X] = force_len * r[X]; force_ij[Y] = force_len * r[Y]…
#pragma omp parallel for private(j, PRIVATE_VARS) for (i=0; i<num_bodies; i++) { tot_force_i[X] = 0.0; tot_force_i[Y] = 0.0; for (j=0; j<num_bodies; j++) { if (j==i) continue; Calc_Force_ij(); tot_force_i[X] += force_ij[X]; tot_force_i[Y] += force_ij[Y]; } Step_Body_i(); } #elif NEWTON_OPT == 1 #define forces(i,j,x) forces_matrix[x + 2*(j + num_bodies*i)]
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Nbody - OpenMP
39
…
Body *init_bodies(unsigned int num_bodies, int init_type) { int i; double n = num_bodies; Body *bodies = (Body *) malloc(num_bodies * sizeof(Body)); return bodies; for (i=0; i<num_bodies; i++) { switch (init_type) { case INIT_LINEAR: bodies[i].mass = 1.0; bodies[i].pos[X] = i/n; bodies[i].pos[Y] = i/n; bodies[i].vel[X] = 0.0; bodies[i].vel[Y] = 0.0; break; case INIT_SPIRAL: bodies[i].mass = (n-i)/n; bodies[i].pos[X] = (1+i/n) * cos(2*M_PI*i/n) / 2; bodies[i].pos[Y] = (1+i/n) * sin(2*M_PI*i/n) / 2; bodies[i].vel[X] = 0.0; bodies[i].vel[Y] = 0.0; break; } }
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Nbody - OpenMP
40
/** Verify that the simulation is running correctly; it should satisfy the invariant of conservation of momentum*/int check_simulation(Body *bodies, int num_bodies) { int i, check_ok; double momentum[2] = { 0.0, 0.0 }; for (i=0; i<num_bodies; i++) { momentum[X] += bodies[i].mass * bodies[i].vel[X]; momentum[Y] += bodies[i].mass * bodies[i].vel[Y]; } check_ok = ((abs(momentum[X]) < EPSILON) && (abs(momentum[Y]) < EPSILON)); if (!check_ok) printf("Warning: total momentum = (%3.3f, %3.3f)\n", momentum[X], momentum[Y]); return check_ok;}#ifdef GNU_TIMEdouble elapsed_time(const mytspec t2, const mytspec t1) { return 1.0 * (t2 - t1) / CLOCKS_PER_SEC;}#elsedouble elapsed_time(const mytspec t2, const mytspec t1) { return (((double)t2.tv_sec) + ((double)t2.tv_nsec / 1e9)) - (((double)t1.tv_sec) + ((double)t1.tv_nsec / 1e9));}#endif
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 41
N-Body “Pipeline” Implementation Flowchart
Create ring communicator
Initialize particle parameters
Copy local particle data to send buffer
Update positions of local particles
All iterations done?
Finalize MPI
N
Y
Initiate transmission of send buffer to the RIGHT neighbor in ring
Initiate reception of data from the LEFT neighbor in ring
Compute forces between local and send buffer particles
Processed particles from all remote nodes?
N
Wait for message exchange to complete
Copy particle data from receive buffer to send buffer
Y
Initialize MPI environment
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 42
N-Body (source code)
#include "mpi.h"#include <stdlib.h>#include <stdio.h>#include <string.h>#include <math.h>
/* Pipeline version of the algorithm... *//* we really need the velocities as well… */
/* Simplified structure describing parameters of a single particle */typedef struct { double x, y, z; double mass; } Particle;/* We use leapfrog for the time integration ... */
/* Structure to hold force components and old position coordinates of a particle */typedef struct { double xold, yold, zold; double fx, fy, fz; } ParticleV;
void InitParticles( Particle[], ParticleV [], int );double ComputeForces( Particle [], Particle [], ParticleV [], int );double ComputeNewPos( Particle [], ParticleV [], int, double, MPI_Comm );
#define MAX_PARTICLES 4000#define MAX_P 128
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 43
N-Body (source code)
main( int argc, char *argv[] ){ Particle particles[MAX_PARTICLES]; /* Particles on ALL nodes */ ParticleV pv[MAX_PARTICLES]; /* Particle velocity */ Particle sendbuf[MAX_PARTICLES], /* Pipeline buffers */
recvbuf[MAX_PARTICLES]; MPI_Request request[2]; int counts[MAX_P], /* Number on each processor */ displs[MAX_P]; /* Offsets into particles */ int rank, size, npart, i, j,
offset; /* location of local particles */ int totpart, /* total number of particles */
cnt; /* number of times in loop */ MPI_Datatype particletype; double sim_t; /* Simulation time */ double time; /* Computation time */ int pipe, left, right, periodic; MPI_Comm commring; MPI_Status statuses[2];
/* Initialize MPI Environment */ MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size );
/* Create 1-dimensional periodic Cartesian communicator (a ring) */ periodic = 1; MPI_Cart_create( MPI_COMM_WORLD, 1, &size, &periodic, 1, &commring ); MPI_Cart_shift( commring, 0, 1, &left, &right ); /* Find the closest neighbors in ring */
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
/* Calculate local fraction of particles */ if (argc < 2) {
fprintf( stderr, "Usage: %s n\n", argv[0] );MPI_Abort( MPI_COMM_WORLD, 1 );
} npart = atoi(argv[1]) / size; if (npart * size > MAX_PARTICLES) {
fprintf( stderr, "%d is too many; max is %d\n", npart*size, MAX_PARTICLES );MPI_Abort( MPI_COMM_WORLD, 1 );
} MPI_Type_contiguous( 4, MPI_DOUBLE, &particletype ); /* Data type corresponding to Particle struct */ MPI_Type_commit( &particletype );
/* Get the sizes and displacements */ MPI_Allgather( &npart, 1, MPI_INT, counts, 1, MPI_INT, commring ); displs[0] = 0; for (i=1; i<size; i++)
displs[i] = displs[i-1] + counts[i-1]; totpart = displs[size-1] + counts[size-1];
/* Generate the initial values */ InitParticles( particles, pv, npart); offset = displs[rank]; cnt = 10; time = MPI_Wtime(); sim_t = 0.0;
/* Begin simulation loop */ while (cnt--) {
double max_f, max_f_seg;
44
N-Body (source code)
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 45
N-Body (source code)/* Load the initial send buffer */memcpy( sendbuf, particles, npart * sizeof(Particle) );max_f = 0.0;for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) {
/* Initialize send to the “right” neighbor, while receiving from the “left” */MPI_Isend( sendbuf, npart, particletype, right, pipe, commring, &request[0] );MPI_Irecv( recvbuf, npart, particletype, left, pipe, commring, &request[1] );
} /* Compute forces */ max_f_seg = ComputeForces( particles, sendbuf, pv, npart ); if (max_f_seg > max_f) max_f = max_f_seg;
/* Wait for updates to complete and copy received particles to the send buffer */ if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, counts[pipe] * sizeof(Particle) );}/* Compute the changes in position using the already calculated forces */sim_t += ComputeNewPos( particles, pv, npart, max_f, commring );
/* We could do graphics here (move particles on the display) */ } time = MPI_Wtime() - time; if (rank == 0) {
printf( "Computed %d particles in %f seconds\n", totpart, time ); } MPI_Finalize(); return 0;}
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 46
N-Body (source code)/* Initialize particle positions, masses and forces */void InitParticles( Particle particles[], ParticleV pv[], int npart ){ int i; for (i=0; i<npart; i++) {
particles[i].x = drand48();particles[i].y = drand48();particles[i].z = drand48();particles[i].mass = 1.0;pv[i].xold = particles[i].x;pv[i].yold = particles[i].y;pv[i].zold = particles[i].z;pv[i].fx = 0;pv[i].fy = 0;pv[i].fz = 0;
}}/* Compute forces (2-D only) */double ComputeForces( Particle myparticles[], Particle others[], ParticleV pv[], int npart ){ double max_f, rmin; int i, j;
max_f = 0.0; for (i=0; i<npart; i++) { double xi, yi, mi, rx, ry, mj, r, fx, fy; rmin = 100.0; xi = myparticles[i].x; yi = myparticles[i].y; fx = 0.0; fy = 0.0;
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 47
N-Body (source code)for (j=0; j<npart; j++) { rx = xi - others[j].x; ry = yi - others[j].y; mj = others[j].mass; r = rx * rx + ry * ry; /* ignore overlap and same particle */ if (r == 0.0) continue; if (r < rmin) rmin = r; /* compute forces */ r = r * sqrt(r); fx -= mj * rx / r; fy -= mj * ry / r; } pv[i].fx += fx; pv[i].fy += fy; /* Compute a rough estimate of (1/m)|df / dx| */ fx = sqrt(fx*fx + fy*fy)/rmin; if (fx > max_f) max_f = fx; } return max_f;}
/* Update particle positions (2-D only) */double ComputeNewPos( Particle particles[], ParticleV pv[], int npart, double max_f, MPI_Comm commring ){ int i; double a0, a1, a2; static double dt_old = 0.001, dt = 0.001; double dt_est, new_dt, dt_new;
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 48
N-Body (source code)/* integation is a0 * x^+ + a1 * x + a2 * x^- = f / m */ a0 = 2.0 / (dt * (dt + dt_old)); a2 = 2.0 / (dt_old * (dt + dt_old)); a1 = -(a0 + a2); /* also -2/(dt*dt_old) */ for (i=0; i<npart; i++) { double xi, yi; /* Very, very simple leapfrog time integration. We use a variable step version to simplify time-step control. */ xi = particles[i].x; yi = particles[i].y; particles[i].x = (pv[i].fx - a1 * xi - a2 * pv[i].xold) / a0; particles[i].y = (pv[i].fy - a1 * yi - a2 * pv[i].yold) / a0; pv[i].xold = xi; pv[i].yold = yi; pv[i].fx = 0; pv[i].fy = 0; } /* Recompute a time step. Stability criteria is roughly 2/sqrt(1/m |df/dx|) >= dt. We leave a little room */ dt_est = 1.0/sqrt(max_f); if (dt_est < 1.0e-6) dt_est = 1.0e-6; MPI_Allreduce( &dt_est, &dt_new, 1, MPI_DOUBLE, MPI_MIN, commring ); /* Modify time step */ if (dt_new < dt) { dt_old = dt; dt = dt_new; } else if (dt_new > 4.0 * dt) { dt_old = dt; dt *= 2.0; } return dt_old;}
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 49
Demo : N-Body Problem
> mpiexec –np 4 ./nbodypipe 4000Computed 4000 particles in 1.119051 seconds> mpiexec –np 4 ./nbodypipe 4000Computed 4000 particles in 1.119051 seconds
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Barnes-Hut Algorithm
• Start with whole space in which one cube contains the bodies (or particles).– First, this cube is divided into eight subcubes.– If a subcube contains no particles, the subcube is deleted from further
consideration.– If a subcube contains more than one body, it is recursively divided
until every subcube contains not more than one body.• This process creates an octtree; that is,
– a tree with up to eight edges from each node.• The leaves represent cells each containing one body.• The decomposition for a two-dimensional case follows the
same construction except with up to four edges from each node - Quadtree
50
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
– In Barnes-Hut Algorithm, after the tree has been constructed, the total mass and center of mass of the subcube is stored at each node.
• The force on each body can then be obtained by traversing the tree starting at the root, stopping at a node when the clustering approximation can be used, e.g. when:
– where is a constant typically 1.0 or less ( is called the opening angle).
θ
dr
51
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
• Once all the bodies have been given new positions and velocities,– the process is repeated for each time period.– This means that the whole octtree must be reconstructed for each time
period(because the bodies have moved).– Constructing the tree requires a time of (nlogn), and so does computing all
the forces, so that the overall time complexity of the method is O(nlogn).• The algorithm can be described by the following:
for (t = 0; t < tmax; t++) { /* for each time period */Build_Octtree(); /* construct Octtree(or Quadtree) */Tot_Mass_Center(); /* compute total mass & center */Comp_Force(); /* traverse tree/computing forces */Update(); /* update position/velocity */
}
– Build_Octtree(): can be constructed from the positions of the bodies, considering each body in turn.
– Tot_Mass_Center(): must traverse the tree, computing the total mass and center of mass at each node.
» where position of the centers of mass have three components, in the x, y, and z directions.
– Comp_Force() : must visit nodes ascertaining whether the clustering approximation can be applied to compute the force of all the bodies in that cell.
» If the clustering approximation cannot be applied, the children of the node must be visited.
)(1 7
0
7
0i
ii
ii cm
MCmM
52
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
(For 2-dimensional area) First, find a vertical line that divides area into two areas each with equal number of bodies. For each area, find a horizontal line that divides it into two areas each with equal number of bodies. Repeated as required.
Orthogonal Recursive Bisection
53
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
O(NlogN) TreeCode
54
[cdekate@celeritas tree]$ ./treecode nbody=4000Hierarchical N-body code (theta scan) nbody dtime eps theta usequad dtout tstop 4000 0.03125 0.0250 1.00 false 0.25000 2.0000…… time |T+U| T -U -T/U |Vcom| |Jtot| CPUtot
2.000 0.24094 0.23061 0.47155 0.48904 0.00019 0.00494 0.085
DEMO
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
N Body Viz
55
The Millennium Run used more than 10 billion particles to trace the evolution of the matter distribution in a cubic region of the Universe over 2 billion light-years on a side. It kept busy the principal supercomputer at the Max Planck Society's Supercomputing Centre in Garching, Germany for more than a month. By applying sophisticated modelling techniques to the 25 Tbytes of stored output, Virgo scientists have been able to recreate evolutionary histories both for the 20 million or so galaxies which populate this enormous volume and for the supermassive black holes which occasionally power quasars at their hearts. By comparing such simulated data to large observational surveys, one can clarify the physical processes underlying the buildup of real galaxies and black holes.
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Topics
• LU Decomposition• N-Body Problem• Parallel Sorting
– Bubble Sort – Merge Sort – Heap Sort – Quick Sort
56
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Parallel Sorting
• Finding a permutation of a sequence [a1, a2, ...an-1], such that a1 <= a2 <= … an-1
• Often we sort records based on key• Parallel sort results in:
– Partial sequences are sorted on all nodes– Largest value on node N-1 is smaller or equal to smallest value
on node N
• Several ways to parallelize– Chunk sequence, sort locally, merge back (bubblesort)– Project algorithm structure onto communication and distribution
scheme (quicksort)
57
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Topics
• LU Decomposition• N-Body Problem• Parallel Sorting
– Bubble Sort – Merge Sort – Heap Sort – Quick Sort
58
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Bubble Sort• The bubble sort is the oldest and simplest sort in use. Unfortunately, it's also the
slowest. • The bubble sort works by comparing each item in the list with the item next to it,
and swapping them if required. • The algorithm repeats this process until it makes a pass all the way through the
list without swapping any items (in other words, all items are in the correct order). • This causes larger values to "bubble" to the end of the list while smaller values
"sink" towards the beginning of the list.• The bubble sort is generally considered to be the most inefficient sorting algorithm
in common usage. Under best-case conditions (the list is already sorted), the bubble sort can approach a constant O(n) level of complexity. General-case is O(n2).
• Pros: Simplicity and ease of implementation.• Cons: Extremely inefficient.
Referencehttp://math.hws.edu/TMCM/java/xSortLab/
Sourcehttp://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/sorting/bubblesort.c
http://www.sci.hkbu.edu.hk
59
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Bubblesort
void sort(int *v, int n){
int i, j;for(i = n-2; i >= 0; i--)
for(j = 0; j <= i; j++)if(v[j] > v[j+1])
swap(v[j], v[j+1]);}
60
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Bubblesort
61
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Discussion
• Bubble sort takes time proportional to N*N/2 for N data items
• This parallelization splits N data items into N/P so time on one of the P processors now proportional to (N/P*N/P)/2 – i.e. have reduced time by a factor of P*P!
• Bubble sort is much slower than quick sort!– Better to run quick sort on single processor than bubble sort on
many processors!
http://www.sci.hkbu.edu.hk
62
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Topics
• LU Decomposition• N-Body Problem• Parallel Sorting
– Bubble Sort – Merge Sort – Heap Sort – Quick Sort
63
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Merge Sort
• The merge sort splits the list to be sorted into two equal halves, and places them in separate arrays.
• Each array is recursively sorted, and then merged back together to form the final sorted list.
• Like most recursive sorts, the merge sort has an algorithmic complexity of O(n log n). • Elementary implementations of the merge sort make use of three arrays - one for
each half of the data set and one to store the sorted list in. The below algorithm merges the arrays in-place, so only two arrays are required. There are non-recursive versions of the merge sort, but they don't yield any significant performance enhancement over the recursive algorithm on most machines.
Pros: Marginally faster than the heap sort for larger sets.
Cons: At least twice the memory requirements of the other sorts; recursive.
Reference
http://math.hws.edu/TMCM/java/xSortLab/
64
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Merge Sort
[cdekate@celeritas sort]$ mpiexec -np 4 ./mergesort1000000; 4 processors; 0.250000 secs[cdekate@celeritas sort]$
65
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Mergesort
void msort(int *A, int min, int max){
int *C; /* dummy, just to fit the function */int mid = (min+max)/2;int lowerCount = mid - min + 1;int upperCount = max - mid;
/* If the range consists of a single element, it's already sorted */if (max == min) {
return;} else {
/* Otherwise, sort the first half */sort(A, min, mid);/* Now sort the second half */sort(A, mid+1, max);/* Now merge the two halves */C = merge(A + min, lowerCount, A + mid + 1, upperCount);
}}
66
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Mergesort
67
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Topics
• LU Decomposition• N-Body Problem• Parallel Sorting
– Bubble Sort – Merge Sort – Heap Sort – Quick Sort
68
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Heap Sort• The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge
and quick sorts it doesn't require massive recursion or multiple arrays to work. This makes it the most attractive option for very large data sets of millions of items.
• The heap sort works as it name suggests1. It begins by building a heap out of the data set, 2. Then removing the largest item and placing it at the end of the sorted array. 3. After removing the largest item, it reconstructs the heap and removes the largest remaining
item and places it in the next open position from the end of the sorted array.4. This is repeated until there are no items left in the heap and the sorted array is full.
Elementary implementations require two arrays - one to hold the heap and the other to hold the sorted elements.
• To do an in-place sort and save the space the second array would require, the algorithm below "cheats" by using the same array to store both the heap and the sorted array. Whenever an item is removed from the heap, it frees up a space at the end of the array that the removed item can be placed in.
• Pros: In-place and non-recursive, making it a good choice for extremely large data sets.
• Cons: Slower than the merge and quick sorts.
Referencehttp://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/heapsort.html
Sourcehttp://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/heapsort/heapsort.c
69
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Heapsort
70
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Topics
• LU Decomposition• N-Body Problem• Parallel Sorting
– Bubble Sort – Merge Sort – Heap Sort – Quick Sort
71
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Quick Sort• The quick sort is an in-place, divide-and-conquer, massively recursive sort.• Divide and Conquer Algorithms
– Algorithms that solve (conquer) problems by dividing them into smaller sub-problems until the problem is so small that it is trivially solved.
• In Place– In place sorting algorithms don't require additional temporary space to store
elements as they sort; they use the space originally occupied by the elements.• Quicksort takes time proportional to (worst case) N*N for N data items, usually
n log n, but most of the time much faster– for 1,000,000 items, Nlog2N ~ 1,000,000*20
• Constant communication cost – 2*N data items– for 1,000,000 must send/receive 2*1,000,000 from/to root
• In general, processing/communication proportional to N*log2N/(2*N) = log2N/2
– so for 1,000,000 items, only 20/2 =10 times as much processing as communication
• Suggests can only get speedup, with this parallelization, for very large N
Referencehttp://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort.html
Sourcehttp://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/qsort/qsort.c
http://www.sci.hkbu.edu.hk
72
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Quick Sort
• The recursive algorithm consists of four steps (which closely resemble the merge sort):
1. If there are one or less elements in the array to be sorted, return immediately.
2. Pick an element in the array to serve as a "pivot" point. (Usually the left-most element in the array is used.)
3. Split the array into two parts - one with elements larger than the pivot and the other with elements smaller than the pivot.
4. Recursively repeat the algorithm for both halves of the original array.
• The efficiency of the algorithm is majorly impacted by which element is chosen as the pivot point.
• The worst-case efficiency of the quick sort, O(n2), occurs when the list is sorted and the left-most element is chosen.
• If the data to be sorted isn't random, randomly choosing a pivot point is recommended. As long as the pivot point is chosen randomly, the quick sort has an algorithmic complexity of O(n log n).
Pros: Extremely fast.Cons: Very complex algorithm, massively recursive
http://www.sci.hkbu.edu.hk
73
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Quicksort
74
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011
Summary : Material for the Test
• LU decomposition: Slides 5-19• N-body problem: Slides 33-48• Sorting Algorithms: Slides 57-74
75
CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 76