Confidential – Internal Use Only 1 Parallel Programming Orientation.

Confidential – Internal Use Only

1

Parallel Programming Orientation

Confidential – Internal Use Only2

Agenda

Parallel jobs

» Paradigms to parallelize algorithms

» Profiling and compiler optimization

» Implementations to parallelize code

• OpenMP

• MPI

Queuing

» Job queuing

» Integrating Parallel Programs

Questions and Answers


Traditional “Ad-Hoc” Linux Cluster

Full Linux install to disk; Load into memory

» Manual & Slow; 5-30 minutes

Full set of disparate daemons, services, user/password, & host access setup

Basic parallel shell with complex glue scripts run jobs

Monitoring & management added as isolated tools

Master Node

Interconnection Network

Internet or Internal Network


Cluster Virtualization Architecture Realized Minimal in-memory OS with single

daemon rapidly deployed in seconds - no disk required

» Less than 20 seconds

Virtual, unified process space enables intuitive single sign-on, job submission

» Effortless job migration to nodes

Monitor & manage efficiently from the Master

» Single System Install

» Single Process Space

» Shared cache of the cluster state

» Single point of provisioning

» Better performance due to lightweight nodes

» No version skew is inherently more reliable

Master Node

Interconnection Network

Internet or Internal Network

Optional Disks


Just a Primer

Only a brief introduction is provided here. Many other in-depth tutorials are available on the web and in published sources.

» http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html

» https://computing.llnl.gov/?set=training&page=index

http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html

https://computing.llnl.gov/?set=training&page=index






Parallel Code Primer

Paradigms for writing parallel programs depend upon the application

» SIMD (single-instruction multiple-data)

» MIMD (multiple-instruction multiple-data)

» MISD (multiple-instruction single-data)

SIMD will be presented here as it is a commonly used template

» A single application source is compiled to perform operations on different sets of data (Single Program Multiple Data (SPMD) model)

» The data is read by the different threads or passed between threads via messages (hence MPI = message passing interface)

• Contrast this with shared memory or OpenMP where data is locally via memory

• Optimizations in the MPI implementation can perform localhost optimization; however, the program is still written using a message passing construct


Explicitly Parallel Programs

Different paradigms exist for parallelizing programs

» Shared memory

» OpenMP

» Sockets

» PVM

» Linda

» MPI

Most distributed parallel programs are now written using MPI

» Different options for MPI stacks: MPICH, OpenMPI, HP, Intel

» ClusterWare comes integrated with customized versions of MPICH and OpenMPI


Example Code

Calculate through numerical integration

» Compute by integrating f(x) = 4/(1 + x**2) from 0 to 1

• Function is the derivative of arctan(x)

• See source code


Compiling and running code

set n = 400,000,000 $ gcc -o cpi-serial cpi-serial.c$ time ./cpi-serialProcess 0pi is approximately 3.1415926535895520, Error is 0.0000000000002411

real 0m11.009suser 0m11.007ssys 0m0.000s

$ gcc -g -pg -o cpi-serial_prof cpi-serial.c$ time ./cpi-serial_profProcess 0pi is approximately 3.1415926535895520, Error is 0.0000000000002411

real 0m11.012suser 0m11.010ssys 0m0.000s$ ls -ltratotal 40…-rw-rw-r-- 1 jhan jhan 586 Mar 7 14:45 gmon.out


Profiling

-g flag includes debugging information in the binary. Useful for gdb tracing of an application and for profiling

-pg flag generates code to writing profile information

$ gprof cpi-serial_prof gmon.outFlat profile:

Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 74.48 2.85 2.85 main 23.95 3.77 0.92 400000000 2.29 2.29 f 2.37 3.86 0.09 frame_dummy

Call graph (explanation follows)

granularity: each sample hit covers 2 byte(s) for 0.26% of 3.86 seconds

index % time self children called name <spontaneous>[1] 97.7 2.85 0.92 main [1] 0.92 0.00 400000000/400000000 f [2]----------------------------------------------- 0.92 0.00 400000000/400000000 main [1][2] 23.8 0.92 0.00 400000000 f [2]----------------------------------------------- <spontaneous>[3] 2.3 0.09 0.00 frame_dummy [3]-----------------------------------------------

-A flag to gprof shows calls next to source code


Profiling Tips

Code should be profiled using realistic data set

» Contrast the call graphs of n=100 versus n=400,000,000

Profiling can give tips about where to optimize the current algorithm, but it can’t suggest alternative (better) algorithms

» e.g. Monte Carlo algorithm to calculate

Amdahl’s Law

» The speedup parallelization achieves is limited by the serial part of the code


OpenMP Introduction

Parallelization using shared memory in a single machine » Portion of the code is forked on the machine to parallelize

» i.e. not distributed parallelization

Done using pragmas in the source code. Compiler must support OpenMP (gcc 4, Intel, etc.)» $ gcc -fopenmp -o cpi-openmp cpi-openmp.c

» See Source Code

Profiling can add overhead to resulting executable

‘time’ can be used to measure improvement

Runtime selection of the number of threads using OMP_NUM_THREADS environment variable


Scaling with OpenMP

$ time OMP_NUM_THREADS=1 ./cpi-openmpProcess 0pi is approximately 3.1415926535895520, Error is 0.0000000000002411

real 0m10.583suser 0m10.581ssys 0m0.001s$ time OMP_NUM_THREADS=2 ./cpi-openmpProcess 0Process 1pi is approximately 3.1415926535900218, Error is 0.0000000000002287

real 0m5.295suser 0m11.297ssys 0m0.000s$ time OMP_NUM_THREADS=4 ./cpi-openmp…real 0m2.650suser 0m10.586ssys 0m0.003s$ time OMP_NUM_THREADS=8 ./cpi-openmp…real 0m1.416suser 0m11.294ssys 0m0.001s


Scaling with OpenMPOpenMP Speedup

0

1

2

3

4

5

6

7

8

9

0 1 2 3 4 5 6 7 8 9

Number of Processors

Sp

ee

du

p

Ideal Speedup

Speedup

Code is easy to parallelize

Good scaling is seen up to 8 processors, kink in the curve is expected


Role of the Compiler

Parallelization using shared memory in a single machine

» i.e. not distributed parallelization

Done using pragmas in the source code. Compiler must support OpenMP (gcc 4, Intel, etc.)

» $ gcc -fopenmp -o cpi-openmp cpi-openmp.c

Profiling can add overhead to resulting executable

‘time’ can be used to measure improvement

Runtime selection of the number of threads using OMP_NUM_THREADS environment variable


GCC versus Intel C

$ time OMP_NUM_THREADS=1 ./cpi-openmp…real 0m10.583suser 0m10.581ssys 0m0.001s

$ gcc -O3 -fopenmp -o cpi-openmp-gcc-O3 cpi-openmp.c$ time OMP_NUM_THREADS=1 ./cpi-openmp-gcc-O3Process 0pi is approximately 3.1415926535895520, Error is 0.0000000000002411

real 0m3.154suser 0m3.143ssys 0m0.011s$ time OMP_NUM_THREADS=8 ./cpi-openmp-gcc-O3…real 0m0.399suser 0m3.181ssys 0m0.001s

$ icc -openmp -o cpi-openmp-icc cpi-openmp.c$ time OMP_NUM_THREADS=1 ./cpi-openmp-iccProcess 0pi is approximately 3.1415926535899272, Error is 0.0000000000001341

real 0m1.618suser 0m1.575ssys 0m0.000s$ time OMP_NUM_THREADS=8 ./cpi-openmp-icc…real 0m0.204suser 0m1.584ssys 0m0.001s


Compiler Timings




» Shared memory

» OpenMP

» Sockets

» PVM

» Linda

» MPI





OpenMP Summary

OpenMP provides a mechanism to parallelize within a single machine

Shared memory and variables are handled “automatically”

Performance, with an appropriate compiler, can provide significant speedups

Coupled with large core count SMP machines, OpenMP could be all of the parallelization required

» GPU programming is similar to the OpenMP model




» Shared memory

» OpenMP

» Sockets

» PVM

» Linda

» MPI





Running MPI Code

Binaries are executed simultaneously

» on the same machine or different machines

After the binaries start running, the MPI_COMM_WORLD is established

Any data to be transferred must be explicitly determined by the programmer

Hooks exist for a number of languages

» E.g. Python (https://computing.llnl.gov/code/pdf/pyMPI.pdf)


Example MPI Source

cpi.c calculates using MPI in C

#include "mpi.h"#include <stdio.h>#include <math.h>

double f( double );double f( double a ){ return (4.0 / (1.0 + a*a));}

int main( int argc, char *argv[]){ int done = 0, n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; double startwtime = 0.0, endwtime; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name,&namelen);

fprintf(stderr,"Process %d on %s\n", myid, processor_name);

n = 0;

while (!done) { if (myid == 0) {/* printf("Enter the number of intervals: (0

quits) "); scanf("%d",&n);*/

if (n==0) n=100; else n=0;

startwtime = MPI_Wtime(); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) done = 1; else { h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += f(x); } mypi = h * sum;

MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

if (myid == 0) {

printf("pi is approximately %.16f, Error is %.16f\n",

pi, fabs(pi - PI25DT));endwtime = MPI_Wtime();printf("wall clock time = %f\n", endwtime-startwtime);

}

} } MPI_Finalize();

return 0;}

System include file which defines the MPI functionsInitialize the MPI execution environment Determines the size of the group associated with a communictor

Determines the rank of the calling process in the communicator

Gets the name of the processor Differentiate actions based on rank. Only “master” performs this action

MPI built-in function to get time valueBroadcasts “1” MPI_INT from &n from the process with rank “0" to all other processes of the group

Each worker does this loop and increments the counter by the number of processors (versus dividing the range -> possible off-by-one error)

Does MPI_SUM function on “1” MPI_DOUBLE at &mypi on all workers in MPI_COMM_WORLD to a single value at &pi on rank “0”

Only rank “0” outputs the value of piTerminates MPI execution environment

compute pi by integrating f(x) = 4/(1 + x**2)


Other Common MPI Functions

MPI_Send, MPI_Recv

» Blocking send and receive between two specific ranks

MPI_Isend, MPI_Irecv

» Non-blocking send and receive between two specific ranks

man pages exist for the MPI functions

Poorly written programs can suffer from poor communication efficiency (e.g. stair-step) or lost data if the system buffer fills before a blocking send or receive is initiated to correspond with a non-blocking receive or send

Care should be used when creating temporary files as multiple threads may be running on the same host overwriting the same temporary file (include rank in file name in a unique temporary directory per simulation)


Compiling MPICH programs

mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/MPICH

» Environment variables can used to set the compiler:

• CC, CPP, FC, F90

» Command line options to set the compiler:

• -cc=, -cxx=, -fc=, -f90=

» GNU, PGI, and Intel compilers are supported


Running MPICH programs

mpirun is used to launch MPICH programs

Dynamic allocation can be done when using the –np flag

Mapping is also supported when using the –map flags

If Infiniband is installed, the interconnect fabric can be chosen using the machine flag:

» -machine p4

» -machine vapi


Scaling with MPI

$ which mpicc/usr/bin/mpicc$ mpicc -show -o cpi-mpi cpi-mpi.cgcc -L/usr/lib64/MPICH/p4/gnu -I/usr/include -o cpi-mpi cpi-mpi.c -lmpi -

lbproc$ mpicc -o cpi-mpi cpi-mpi.c$ time mpirun -np 1 ./cpi-mpiProcess 0 on scyld.localdomain…real 0m11.198suser 0m11.187ssys 0m0.010s$ time mpirun -np 2 ./cpi-mpiProcess 0 on scyld.localdomainProcess 1 on n0…real 0m6.486suser 0m5.510ssys 0m0.009s$ time mpirun -map -1:-1:-1:-1:-1:-1:-1:-1:0:0:0:0:0:0:0:0 ./cpi-mpi…real 0m1.283suser 0m1.381ssys 0m0.016s


Environment Variable Options

Additional environment variable control:

» NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes.

» ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs.

» ALL_NODES—Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU.

» ALL_LOCAL — Run every process on the master node; used for debugging purposes.

» NO_LOCAL — Don’t run any processes on the master node.

» EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment.

» BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on.


Compiling and Running OpenMPI programs

env-modules package allow users to change their environment variables according to predefined files

» module avail

» module load openmpi/gnu

» GNU, PGI, and Intel compilers are supported

mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /opt/scyld/openmpi

mpirun is used to run code

Interconnect can be selected at runtime

» -mca btl openib,tcp,sm,self

» -mca btl udapl,tcp,sm,self


Compiling and Running OpenMPI programs

What env-modules does: Set user environment prior to compiling

» export PATH=/opt/scyld/openmpi/gnu/bin:${PATH}

mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /opt/scyld/openmpi

» Environment variables can used to set the compiler:• OPMI_CC, OMPI_CXX, OMPI_F77, OMPI_FC

Prior to running PATH and LD_LIBRARY_PATH should be set

» module load openmpi/gnu

» /opt/scyld/openmpi/gnu/bin/mpirun –np 16 a.out

OR:» export PATH=/opt/scyld/openmpi/gnu/bin:${PATH}

export MANPATH=/opt/scyld/openmpi/gnu/share/manexport LD_LIBRARY_PATH=/opt/scyld/openmpi/gnu/lib:${LD_LIBRARY_PATH}

» /opt/scyld/openmpi/gnu/bin/mpirun –np 16 a.out


Scaling with MPI Implementations


Scaling with MPI ImplementationsScaling with MPI for gcc

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18


Sp

eed

up

rel

ati

ve t

o

GC

C

Ideal Speedup 1:1 MPICC - GCC - P4

MPICC - GCC - VAPI OpenMPI - GCC - tcp,sm,self

OpenMPI - GCC - openib,sm,self

Infiniband allows wider scaling

Performance difference between MPICH versus OpenMPI

A little artificial because it’s only two physical machines


Scaling with MPI Implementations

Larger problems would allow continued scaling

Compiler Comparison

0

10

20

30

40

50

60

0 2 4 6 8 10 12 14 16 18


Sp

eed

up

rel

ativ

e to

GC

C

Ideal Speedup 1:1 MPICC - GCC - P4 MPICC - GCC - VAPI OpenMPI - GCC - tcp,sm,self OpenMPI - GCC - openib,sm,self

Ideal Speedup 3.49:1 MPICC - GCC -O3 - P4 MPICC - GCC -O3 - VAPI MPICC - ICC - P4 MPICC - ICC - VAPI

OpenMPI - GCC -O3 - tcp,sm,self OpenMPI - GCC -O3 - openib,sm,self OpenMPI - ICC - tcp,sm,self OpenMPI - ICC - openib,sm,self


MPI Summary

MPI provides a mechanism to parallelize in a distributed fashion

» Localhost optimization is done is on a shared memory machine

Shared variables are explicitly handled by the developer

Tradeoff between CPU versus IO can determine the performance characteristics

Hybrid programming models are possible

» MPI code with OpenMPI sections

» MPI code with GPU calls


Queuing

How are resources allocated among multiple users and/or groups?

» Statically by using bpctl user and group permissions

» ClusterWare supports a variety of queuing packages

• TaskMaster (advanced MOAB policy based scheduler integrated ClusterWare)

• Torque

• SGE


Interacting with Torque

To submit a job:

» qsub script.sh

• Example script.sh:#!/bin/sh

#PBS –j oe

#PBS –l nodes=4

cd $PBS_O_WORKDIR

hostname

• qsub does not accept arguments for script.sh. All executable arguments must be included in the script itself

» Administrators can create a ‘qapp’ script that takes user arguments, creates script.sh with the user arguments embedded, and runs ‘qsub script.sh’


Interacting with Torque

Other commands

» qstat – Status of queue server and jobs

» qdel – Remove a job from the queue

» qhold, qrls – Hold and release a job in the queue

» qmgr – Administrator command to configure pbs_server

» /var/spool/torque/server_name: should match hostname of the head node

» /var/spool/torque/mom_priv/config: file to configure pbs_mom

• ‘$usecp *:/home /home’ indicates that pbs_mom should use ‘cp’ rather than ‘rcp’ or ‘scp’ to relocate the stdout and stderr files at the end of execution

» pbsnodes – Administrator command to monitor the status of the resources

» qalter – Administrator command to modify the parameters of a particular job (e.g. requested time)


Other options to qsub

Options that can be included in a script (with the #PBS directive) or on the qsub command line» Join output and error files : #PBS –j oe

» Request resources: #PBS –l nodes=2:ppn=2

» Request walltime: #PBS –l walltime=24:00:00

» Define a job name: #PBS –N jobname

» Send mail at jobs events: #PBS –m be

» Assign job to an account: #PBS –A account

» Export current environment variables: #PBS –V

To start an interactive queue job use:» qsub –I for Torque

» qrsh for SGE


Queue script case studies

qapp script:

» Be careful about escaping special characters in the redirect section (\$, \’, \”)

#!/bin/bash

#Usage: qapp arg1 arg2

debug=0

opt1=“${1}”opt2=“${2}”

if [[ “${opt2}” == “” ]] ; thenecho “Not enough arguments”exit 1

fi

cat > app.sh << EOF#!/bin/bash

#PBS –j oe#PBS –l nodes=1

cd \$PBS_O_WORKDIR

app $opt1 $opt2

EOF

if [[ “${debug}” –lt 1 ]] ; thenqsub app.sh

fi

if [[ “${debug}” –eq 0 ]] ; then/bin/rm –f app.sh

fi



Using local scratch: #!/bin/bash

#PBS –j oe#PBS –l nodes=1

cd $PBS_O_WORKDIR

tmpdir=“/scratch/$USER/$PBS_JOBID”/bin/mkdir –p $tmpdir

rsync –a ./ $tmpdir

cd $tmpdir

$pathto/app arg1 arg2

cd $PBS_O_WORKDIR

rsync –a $tmpdir/ .

/bin/rm –fr $tmpdir



Using local scratch for MPICH parallel jobs:

» pbsdsh is a Torque command

#!/bin/bash

#PBS –j oe#PBS –l nodes=2:ppn=8

cd $PBS_O_WORKDIR

tmpdir=“/scratch/$USER/$PBS_JOBID”/usr/bin/pbsdsh –u “/bin/mkdir –p

$tmpdir”

/usr/bin/pbsdsh –u bash –c “cd $PBS_O_WORKDIR ; rsync –a ./ $tmpdir”

cd $tmpdir

mpirun –machine vapi $pathto/app arg1 arg2

cd $PBS_O_WORKDIR

/usr/bin/pbsdsh –u “rsync –a $tmpdir/ $PBS_O_WORKDIR”

/usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir”



Using local scratch for OpenMPI parallel jobs:

» Do a ‘module load openmpi/gnu’ prior to running qsub

» OR explicitly include a ‘module load openmpi/gnu’ in the script itself

#!/bin/bash

#PBS –j oe#PBS –l nodes=2:ppn=8#PBS -V

cd $PBS_O_WORKDIR

tmpdir=“/scratch/$USER/$PBS_JOBID”/usr/bin/pbsdsh –u “/bin/mkdir –p

$tmpdir”

/usr/bin/pbsdsh –u bash –c “cd $PBS_O_WORKDIR ; rsync –a ./ $tmpdir”

cd $tmpdir

/usr/openmpi/gnu/bin/mpirun –np `cat $PBS_NODEFILE | wc –l` –mca btl openib,sm,self $pathto/app arg1 arg2

cd $PBS_O_WORKDIR

/usr/bin/pbsdsh –u “rsync –a $tmpdir/ $PBS_O_WORKDIR”

/usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir”


Other considerations

A queue script need not be a single command

» Multiple steps can be performed from a single script

• Guaranteed resources

• Jobs should typically be a minimum of 2 minutes

» Pre-processing and post-processing can be done from the same script using the local scratch space

» If configured, it is possible to submit additional jobs from a running queued job

To remove multiple jobs from the queue:» qstat | grep “ [RQ] “ | awk ‘{print $1}’ | xargs qdel


Integrating Parallel Programs

The scheduler on keeps track of available resources

Don’t monitor how the resources are used

Onus is on the user to request and use the correct resources» OpenMP: be sure to requests multiple processors on the same

machine

• Torque: #PBS –l nodes=1:ppn=x

• SGE: Correct PE (parallel environment) submission

» MPI: be sure to the use the machines that have been assigned by the queue system

• Torque: MPICH and OpenMPI mpirun will do the correct thing. $PBS_NODEFILE contains a list of assigned hosts

• SGE: $PE_HOSTFILE contains a list of assigned hosts. OpenMPI’s mpirun may need to be recompiled


Integrating Parallel Programs

Be careful about task pinning (taskset)

» Different jobs may assuming the same CPU set resulting in oversubscription of some cores and some free cores

» In a shared environment, not using task pinning can be easier at a slight trade-off in performance

Make sure that the same MPI implementation and compiler combination is used to run the code as was used to compile and link

Confidential – Internal Use Only

45

Questions??

Date post:	26-Dec-2015
Category:	Documents
Upload:	silvia-evans
View:	219 times
Download:	4 times

Confidential – Internal Use Only 1 Parallel Programming Orientation.

Documents