Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | silvia-evans |
View: | 219 times |
Download: | 4 times |
Confidential – Internal Use Only2
Agenda
Parallel jobs
» Paradigms to parallelize algorithms
» Profiling and compiler optimization
» Implementations to parallelize code
• OpenMP
• MPI
Queuing
» Job queuing
» Integrating Parallel Programs
Questions and Answers
Confidential – Internal Use Only3
Traditional “Ad-Hoc” Linux Cluster
Full Linux install to disk; Load into memory
» Manual & Slow; 5-30 minutes
Full set of disparate daemons, services, user/password, & host access setup
Basic parallel shell with complex glue scripts run jobs
Monitoring & management added as isolated tools
Master Node
Interconnection Network
Internet or Internal Network
Confidential – Internal Use Only4
Cluster Virtualization Architecture Realized Minimal in-memory OS with single
daemon rapidly deployed in seconds - no disk required
» Less than 20 seconds
Virtual, unified process space enables intuitive single sign-on, job submission
» Effortless job migration to nodes
Monitor & manage efficiently from the Master
» Single System Install
» Single Process Space
» Shared cache of the cluster state
» Single point of provisioning
» Better performance due to lightweight nodes
» No version skew is inherently more reliable
Master Node
Interconnection Network
Internet or Internal Network
Optional Disks
Confidential – Internal Use Only5
Just a Primer
Only a brief introduction is provided here. Many other in-depth tutorials are available on the web and in published sources.
» http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html
» https://computing.llnl.gov/?set=training&page=index
Confidential – Internal Use Only6
Parallel Code Primer
Paradigms for writing parallel programs depend upon the application
» SIMD (single-instruction multiple-data)
» MIMD (multiple-instruction multiple-data)
» MISD (multiple-instruction single-data)
SIMD will be presented here as it is a commonly used template
» A single application source is compiled to perform operations on different sets of data (Single Program Multiple Data (SPMD) model)
» The data is read by the different threads or passed between threads via messages (hence MPI = message passing interface)
• Contrast this with shared memory or OpenMP where data is locally via memory
• Optimizations in the MPI implementation can perform localhost optimization; however, the program is still written using a message passing construct
Confidential – Internal Use Only7
Explicitly Parallel Programs
Different paradigms exist for parallelizing programs
» Shared memory
» OpenMP
» Sockets
» PVM
» Linda
» MPI
Most distributed parallel programs are now written using MPI
» Different options for MPI stacks: MPICH, OpenMPI, HP, Intel
» ClusterWare comes integrated with customized versions of MPICH and OpenMPI
Confidential – Internal Use Only8
Example Code
Calculate through numerical integration
» Compute by integrating f(x) = 4/(1 + x**2) from 0 to 1
• Function is the derivative of arctan(x)
• See source code
Confidential – Internal Use Only9
Compiling and running code
set n = 400,000,000 $ gcc -o cpi-serial cpi-serial.c$ time ./cpi-serialProcess 0pi is approximately 3.1415926535895520, Error is 0.0000000000002411
real 0m11.009suser 0m11.007ssys 0m0.000s
$ gcc -g -pg -o cpi-serial_prof cpi-serial.c$ time ./cpi-serial_profProcess 0pi is approximately 3.1415926535895520, Error is 0.0000000000002411
real 0m11.012suser 0m11.010ssys 0m0.000s$ ls -ltratotal 40…-rw-rw-r-- 1 jhan jhan 586 Mar 7 14:45 gmon.out
Confidential – Internal Use Only10
Profiling
-g flag includes debugging information in the binary. Useful for gdb tracing of an application and for profiling
-pg flag generates code to writing profile information
$ gprof cpi-serial_prof gmon.outFlat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 74.48 2.85 2.85 main 23.95 3.77 0.92 400000000 2.29 2.29 f 2.37 3.86 0.09 frame_dummy
Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.26% of 3.86 seconds
index % time self children called name <spontaneous>[1] 97.7 2.85 0.92 main [1] 0.92 0.00 400000000/400000000 f [2]----------------------------------------------- 0.92 0.00 400000000/400000000 main [1][2] 23.8 0.92 0.00 400000000 f [2]----------------------------------------------- <spontaneous>[3] 2.3 0.09 0.00 frame_dummy [3]-----------------------------------------------
-A flag to gprof shows calls next to source code
Confidential – Internal Use Only11
Profiling Tips
Code should be profiled using realistic data set
» Contrast the call graphs of n=100 versus n=400,000,000
Profiling can give tips about where to optimize the current algorithm, but it can’t suggest alternative (better) algorithms
» e.g. Monte Carlo algorithm to calculate
Amdahl’s Law
» The speedup parallelization achieves is limited by the serial part of the code
Confidential – Internal Use Only12
OpenMP Introduction
Parallelization using shared memory in a single machine » Portion of the code is forked on the machine to parallelize
» i.e. not distributed parallelization
Done using pragmas in the source code. Compiler must support OpenMP (gcc 4, Intel, etc.)» $ gcc -fopenmp -o cpi-openmp cpi-openmp.c
» See Source Code
Profiling can add overhead to resulting executable
‘time’ can be used to measure improvement
Runtime selection of the number of threads using OMP_NUM_THREADS environment variable
Confidential – Internal Use Only13
Scaling with OpenMP
$ time OMP_NUM_THREADS=1 ./cpi-openmpProcess 0pi is approximately 3.1415926535895520, Error is 0.0000000000002411
real 0m10.583suser 0m10.581ssys 0m0.001s$ time OMP_NUM_THREADS=2 ./cpi-openmpProcess 0Process 1pi is approximately 3.1415926535900218, Error is 0.0000000000002287
real 0m5.295suser 0m11.297ssys 0m0.000s$ time OMP_NUM_THREADS=4 ./cpi-openmp…real 0m2.650suser 0m10.586ssys 0m0.003s$ time OMP_NUM_THREADS=8 ./cpi-openmp…real 0m1.416suser 0m11.294ssys 0m0.001s
Confidential – Internal Use Only14
Scaling with OpenMPOpenMP Speedup
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9
Number of Processors
Sp
ee
du
p
Ideal Speedup
Speedup
Code is easy to parallelize
Good scaling is seen up to 8 processors, kink in the curve is expected
Confidential – Internal Use Only15
Role of the Compiler
Parallelization using shared memory in a single machine
» i.e. not distributed parallelization
Done using pragmas in the source code. Compiler must support OpenMP (gcc 4, Intel, etc.)
» $ gcc -fopenmp -o cpi-openmp cpi-openmp.c
Profiling can add overhead to resulting executable
‘time’ can be used to measure improvement
Runtime selection of the number of threads using OMP_NUM_THREADS environment variable
Confidential – Internal Use Only16
GCC versus Intel C
$ time OMP_NUM_THREADS=1 ./cpi-openmp…real 0m10.583suser 0m10.581ssys 0m0.001s
$ gcc -O3 -fopenmp -o cpi-openmp-gcc-O3 cpi-openmp.c$ time OMP_NUM_THREADS=1 ./cpi-openmp-gcc-O3Process 0pi is approximately 3.1415926535895520, Error is 0.0000000000002411
real 0m3.154suser 0m3.143ssys 0m0.011s$ time OMP_NUM_THREADS=8 ./cpi-openmp-gcc-O3…real 0m0.399suser 0m3.181ssys 0m0.001s
$ icc -openmp -o cpi-openmp-icc cpi-openmp.c$ time OMP_NUM_THREADS=1 ./cpi-openmp-iccProcess 0pi is approximately 3.1415926535899272, Error is 0.0000000000001341
real 0m1.618suser 0m1.575ssys 0m0.000s$ time OMP_NUM_THREADS=8 ./cpi-openmp-icc…real 0m0.204suser 0m1.584ssys 0m0.001s
Confidential – Internal Use Only18
Explicitly Parallel Programs
Different paradigms exist for parallelizing programs
» Shared memory
» OpenMP
» Sockets
» PVM
» Linda
» MPI
Most distributed parallel programs are now written using MPI
» Different options for MPI stacks: MPICH, OpenMPI, HP, Intel
» ClusterWare comes integrated with customized versions of MPICH and OpenMPI
Confidential – Internal Use Only19
OpenMP Summary
OpenMP provides a mechanism to parallelize within a single machine
Shared memory and variables are handled “automatically”
Performance, with an appropriate compiler, can provide significant speedups
Coupled with large core count SMP machines, OpenMP could be all of the parallelization required
» GPU programming is similar to the OpenMP model
Confidential – Internal Use Only20
Explicitly Parallel Programs
Different paradigms exist for parallelizing programs
» Shared memory
» OpenMP
» Sockets
» PVM
» Linda
» MPI
Most distributed parallel programs are now written using MPI
» Different options for MPI stacks: MPICH, OpenMPI, HP, Intel
» ClusterWare comes integrated with customized versions of MPICH and OpenMPI
Confidential – Internal Use Only21
Running MPI Code
Binaries are executed simultaneously
» on the same machine or different machines
After the binaries start running, the MPI_COMM_WORLD is established
Any data to be transferred must be explicitly determined by the programmer
Hooks exist for a number of languages
» E.g. Python (https://computing.llnl.gov/code/pdf/pyMPI.pdf)
Confidential – Internal Use Only22
Example MPI Source
cpi.c calculates using MPI in C
#include "mpi.h"#include <stdio.h>#include <math.h>
double f( double );double f( double a ){ return (4.0 / (1.0 + a*a));}
int main( int argc, char *argv[]){ int done = 0, n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; double startwtime = 0.0, endwtime; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name,&namelen);
fprintf(stderr,"Process %d on %s\n", myid, processor_name);
n = 0;
while (!done) { if (myid == 0) {/* printf("Enter the number of intervals: (0
quits) "); scanf("%d",&n);*/
if (n==0) n=100; else n=0;
startwtime = MPI_Wtime(); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) done = 1; else { h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += f(x); } mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) {
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));endwtime = MPI_Wtime();printf("wall clock time = %f\n", endwtime-startwtime);
}
} } MPI_Finalize();
return 0;}
System include file which defines the MPI functionsInitialize the MPI execution environment Determines the size of the group associated with a communictor
Determines the rank of the calling process in the communicator
Gets the name of the processor Differentiate actions based on rank. Only “master” performs this action
MPI built-in function to get time valueBroadcasts “1” MPI_INT from &n from the process with rank “0" to all other processes of the group
Each worker does this loop and increments the counter by the number of processors (versus dividing the range -> possible off-by-one error)
Does MPI_SUM function on “1” MPI_DOUBLE at &mypi on all workers in MPI_COMM_WORLD to a single value at &pi on rank “0”
Only rank “0” outputs the value of piTerminates MPI execution environment
compute pi by integrating f(x) = 4/(1 + x**2)
Confidential – Internal Use Only23
Other Common MPI Functions
MPI_Send, MPI_Recv
» Blocking send and receive between two specific ranks
MPI_Isend, MPI_Irecv
» Non-blocking send and receive between two specific ranks
man pages exist for the MPI functions
Poorly written programs can suffer from poor communication efficiency (e.g. stair-step) or lost data if the system buffer fills before a blocking send or receive is initiated to correspond with a non-blocking receive or send
Care should be used when creating temporary files as multiple threads may be running on the same host overwriting the same temporary file (include rank in file name in a unique temporary directory per simulation)
Confidential – Internal Use Only24
Compiling MPICH programs
mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/MPICH
» Environment variables can used to set the compiler:
• CC, CPP, FC, F90
» Command line options to set the compiler:
• -cc=, -cxx=, -fc=, -f90=
» GNU, PGI, and Intel compilers are supported
Confidential – Internal Use Only25
Running MPICH programs
mpirun is used to launch MPICH programs
Dynamic allocation can be done when using the –np flag
Mapping is also supported when using the –map flags
If Infiniband is installed, the interconnect fabric can be chosen using the machine flag:
» -machine p4
» -machine vapi
Confidential – Internal Use Only26
Scaling with MPI
$ which mpicc/usr/bin/mpicc$ mpicc -show -o cpi-mpi cpi-mpi.cgcc -L/usr/lib64/MPICH/p4/gnu -I/usr/include -o cpi-mpi cpi-mpi.c -lmpi -
lbproc$ mpicc -o cpi-mpi cpi-mpi.c$ time mpirun -np 1 ./cpi-mpiProcess 0 on scyld.localdomain…real 0m11.198suser 0m11.187ssys 0m0.010s$ time mpirun -np 2 ./cpi-mpiProcess 0 on scyld.localdomainProcess 1 on n0…real 0m6.486suser 0m5.510ssys 0m0.009s$ time mpirun -map -1:-1:-1:-1:-1:-1:-1:-1:0:0:0:0:0:0:0:0 ./cpi-mpi…real 0m1.283suser 0m1.381ssys 0m0.016s
Confidential – Internal Use Only27
Environment Variable Options
Additional environment variable control:
» NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes.
» ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs.
» ALL_NODES—Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU.
» ALL_LOCAL — Run every process on the master node; used for debugging purposes.
» NO_LOCAL — Don’t run any processes on the master node.
» EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment.
» BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on.
Confidential – Internal Use Only28
Compiling and Running OpenMPI programs
env-modules package allow users to change their environment variables according to predefined files
» module avail
» module load openmpi/gnu
» GNU, PGI, and Intel compilers are supported
mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /opt/scyld/openmpi
mpirun is used to run code
Interconnect can be selected at runtime
» -mca btl openib,tcp,sm,self
» -mca btl udapl,tcp,sm,self
Confidential – Internal Use Only29
Compiling and Running OpenMPI programs
What env-modules does: Set user environment prior to compiling
» export PATH=/opt/scyld/openmpi/gnu/bin:${PATH}
mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /opt/scyld/openmpi
» Environment variables can used to set the compiler:• OPMI_CC, OMPI_CXX, OMPI_F77, OMPI_FC
Prior to running PATH and LD_LIBRARY_PATH should be set
» module load openmpi/gnu
» /opt/scyld/openmpi/gnu/bin/mpirun –np 16 a.out
OR:» export PATH=/opt/scyld/openmpi/gnu/bin:${PATH}
export MANPATH=/opt/scyld/openmpi/gnu/share/manexport LD_LIBRARY_PATH=/opt/scyld/openmpi/gnu/lib:${LD_LIBRARY_PATH}
» /opt/scyld/openmpi/gnu/bin/mpirun –np 16 a.out
Confidential – Internal Use Only31
Scaling with MPI ImplementationsScaling with MPI for gcc
0
2
4
6
8
10
12
14
16
18
0 2 4 6 8 10 12 14 16 18
Number of Processors
Sp
eed
up
rel
ati
ve t
o
GC
C
Ideal Speedup 1:1 MPICC - GCC - P4
MPICC - GCC - VAPI OpenMPI - GCC - tcp,sm,self
OpenMPI - GCC - openib,sm,self
Infiniband allows wider scaling
Performance difference between MPICH versus OpenMPI
A little artificial because it’s only two physical machines
Confidential – Internal Use Only32
Scaling with MPI Implementations
Larger problems would allow continued scaling
Compiler Comparison
0
10
20
30
40
50
60
0 2 4 6 8 10 12 14 16 18
Number of Processors
Sp
eed
up
rel
ativ
e to
GC
C
Ideal Speedup 1:1 MPICC - GCC - P4 MPICC - GCC - VAPI OpenMPI - GCC - tcp,sm,self OpenMPI - GCC - openib,sm,self
Ideal Speedup 3.49:1 MPICC - GCC -O3 - P4 MPICC - GCC -O3 - VAPI MPICC - ICC - P4 MPICC - ICC - VAPI
OpenMPI - GCC -O3 - tcp,sm,self OpenMPI - GCC -O3 - openib,sm,self OpenMPI - ICC - tcp,sm,self OpenMPI - ICC - openib,sm,self
Confidential – Internal Use Only33
MPI Summary
MPI provides a mechanism to parallelize in a distributed fashion
» Localhost optimization is done is on a shared memory machine
Shared variables are explicitly handled by the developer
Tradeoff between CPU versus IO can determine the performance characteristics
Hybrid programming models are possible
» MPI code with OpenMPI sections
» MPI code with GPU calls
Confidential – Internal Use Only34
Queuing
How are resources allocated among multiple users and/or groups?
» Statically by using bpctl user and group permissions
» ClusterWare supports a variety of queuing packages
• TaskMaster (advanced MOAB policy based scheduler integrated ClusterWare)
• Torque
• SGE
Confidential – Internal Use Only35
Interacting with Torque
To submit a job:
» qsub script.sh
• Example script.sh:#!/bin/sh
#PBS –j oe
#PBS –l nodes=4
cd $PBS_O_WORKDIR
hostname
• qsub does not accept arguments for script.sh. All executable arguments must be included in the script itself
» Administrators can create a ‘qapp’ script that takes user arguments, creates script.sh with the user arguments embedded, and runs ‘qsub script.sh’
Confidential – Internal Use Only36
Interacting with Torque
Other commands
» qstat – Status of queue server and jobs
» qdel – Remove a job from the queue
» qhold, qrls – Hold and release a job in the queue
» qmgr – Administrator command to configure pbs_server
» /var/spool/torque/server_name: should match hostname of the head node
» /var/spool/torque/mom_priv/config: file to configure pbs_mom
• ‘$usecp *:/home /home’ indicates that pbs_mom should use ‘cp’ rather than ‘rcp’ or ‘scp’ to relocate the stdout and stderr files at the end of execution
» pbsnodes – Administrator command to monitor the status of the resources
» qalter – Administrator command to modify the parameters of a particular job (e.g. requested time)
Confidential – Internal Use Only37
Other options to qsub
Options that can be included in a script (with the #PBS directive) or on the qsub command line» Join output and error files : #PBS –j oe
» Request resources: #PBS –l nodes=2:ppn=2
» Request walltime: #PBS –l walltime=24:00:00
» Define a job name: #PBS –N jobname
» Send mail at jobs events: #PBS –m be
» Assign job to an account: #PBS –A account
» Export current environment variables: #PBS –V
To start an interactive queue job use:» qsub –I for Torque
» qrsh for SGE
Confidential – Internal Use Only38
Queue script case studies
qapp script:
» Be careful about escaping special characters in the redirect section (\$, \’, \”)
#!/bin/bash
#Usage: qapp arg1 arg2
debug=0
opt1=“${1}”opt2=“${2}”
if [[ “${opt2}” == “” ]] ; thenecho “Not enough arguments”exit 1
fi
cat > app.sh << EOF#!/bin/bash
#PBS –j oe#PBS –l nodes=1
cd \$PBS_O_WORKDIR
app $opt1 $opt2
EOF
if [[ “${debug}” –lt 1 ]] ; thenqsub app.sh
fi
if [[ “${debug}” –eq 0 ]] ; then/bin/rm –f app.sh
fi
Confidential – Internal Use Only39
Queue script case studies
Using local scratch: #!/bin/bash
#PBS –j oe#PBS –l nodes=1
cd $PBS_O_WORKDIR
tmpdir=“/scratch/$USER/$PBS_JOBID”/bin/mkdir –p $tmpdir
rsync –a ./ $tmpdir
cd $tmpdir
$pathto/app arg1 arg2
cd $PBS_O_WORKDIR
rsync –a $tmpdir/ .
/bin/rm –fr $tmpdir
Confidential – Internal Use Only40
Queue script case studies
Using local scratch for MPICH parallel jobs:
» pbsdsh is a Torque command
#!/bin/bash
#PBS –j oe#PBS –l nodes=2:ppn=8
cd $PBS_O_WORKDIR
tmpdir=“/scratch/$USER/$PBS_JOBID”/usr/bin/pbsdsh –u “/bin/mkdir –p
$tmpdir”
/usr/bin/pbsdsh –u bash –c “cd $PBS_O_WORKDIR ; rsync –a ./ $tmpdir”
cd $tmpdir
mpirun –machine vapi $pathto/app arg1 arg2
cd $PBS_O_WORKDIR
/usr/bin/pbsdsh –u “rsync –a $tmpdir/ $PBS_O_WORKDIR”
/usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir”
Confidential – Internal Use Only41
Queue script case studies
Using local scratch for OpenMPI parallel jobs:
» Do a ‘module load openmpi/gnu’ prior to running qsub
» OR explicitly include a ‘module load openmpi/gnu’ in the script itself
#!/bin/bash
#PBS –j oe#PBS –l nodes=2:ppn=8#PBS -V
cd $PBS_O_WORKDIR
tmpdir=“/scratch/$USER/$PBS_JOBID”/usr/bin/pbsdsh –u “/bin/mkdir –p
$tmpdir”
/usr/bin/pbsdsh –u bash –c “cd $PBS_O_WORKDIR ; rsync –a ./ $tmpdir”
cd $tmpdir
/usr/openmpi/gnu/bin/mpirun –np `cat $PBS_NODEFILE | wc –l` –mca btl openib,sm,self $pathto/app arg1 arg2
cd $PBS_O_WORKDIR
/usr/bin/pbsdsh –u “rsync –a $tmpdir/ $PBS_O_WORKDIR”
/usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir”
Confidential – Internal Use Only42
Other considerations
A queue script need not be a single command
» Multiple steps can be performed from a single script
• Guaranteed resources
• Jobs should typically be a minimum of 2 minutes
» Pre-processing and post-processing can be done from the same script using the local scratch space
» If configured, it is possible to submit additional jobs from a running queued job
To remove multiple jobs from the queue:» qstat | grep “ [RQ] “ | awk ‘{print $1}’ | xargs qdel
Confidential – Internal Use Only43
Integrating Parallel Programs
The scheduler on keeps track of available resources
Don’t monitor how the resources are used
Onus is on the user to request and use the correct resources» OpenMP: be sure to requests multiple processors on the same
machine
• Torque: #PBS –l nodes=1:ppn=x
• SGE: Correct PE (parallel environment) submission
» MPI: be sure to the use the machines that have been assigned by the queue system
• Torque: MPICH and OpenMPI mpirun will do the correct thing. $PBS_NODEFILE contains a list of assigned hosts
• SGE: $PE_HOSTFILE contains a list of assigned hosts. OpenMPI’s mpirun may need to be recompiled
Confidential – Internal Use Only44
Integrating Parallel Programs
Be careful about task pinning (taskset)
» Different jobs may assuming the same CPU set resulting in oversubscription of some cores and some free cores
» In a shared environment, not using task pinning can be easier at a slight trade-off in performance
Make sure that the same MPI implementation and compiler combination is used to run the code as was used to compile and link