Copyright © 2014 CeGP
Parallel Computing Basics
With a Case Study on the CeGP Cluster
Dr. Tamir HegazyDr. Entao LiuDr. Zhiling Long
September 17, 2014
Copyright © 2014 CeGP
Supercomputing & Geophysics
2
Copyright © 2014 CeGP
Seminar Goals
After this seminar, you should be able to:– Differentiate among
• Parallel computing architectures• Parallel programming models
– Write parallel programs• Matlab• Message Passing Interface (MPI)• MatlabMPI/pMatlab
– Run your parallel programs on the CeGP cluster
3
Copyright © 2014 CeGP
Outline
• Parallel computing basics
• New CeGP cluster
• Matlab Parallel Processing Toolbox
• Message Passing Interface (MPI)
• MatlabMPI/pMatlab
4
Copyright © 2014 CeGP
Parallel Computing Basics
Copyright © 2014 CeGP
Definition
• Classic definition
A parallel computer is a “collection of processing elements
that communicate and cooperate to solve large problems
fast.”
form
function
P
M I/O
PP
PP
M I/O
???
UniprocessorParallel processors 6
Copyright © 2014 CeGP
Coupling in Parallel SystemsTightly Coupled Loosely Coupled
Share/sync clock? Yes No
Share Bus? Yes No
Communication Faster Slower
Cost Higher Lower
Scalability Lower Higher
Energy Efficiency Higher Lower
Examples Multi‐cores Today’s Clusters
P
M I/O
P PP
M I/O
Uniprocessor
P
M I/O
P
M I/O
P
M I/O 7
Copyright © 2014 CeGP
Parallel Computing Paradigms
Shared Address Space Message Passing
Coupling Tighter Looser
Communication Through bus/memory Through network
Primitives read, write send, recv
• What’s the difference between parallel and distributed systems?8
Copyright © 2014 CeGP
Shared Address Space
P P
Switch/Network
M M
P
Switch/Network
M MP
Dancehall(UMA)
NUMA
• Versus shared memory• Common architectures
• User‐level operations: read/write or load/store
Process i
Private
Shared
Process j
Private
Shared
Private
Shared
Private
Physical memory
9
Copyright © 2014 CeGP
Message Passing
• Versus message passing for inter‐process
communication
• Architecture: similar to NUMA
– Major difference: communication through I/O, not
memory
• Architecture convergenceP
Switch/Network
M MP
NUMA 10
Copyright © 2014 CeGP
Flynn’s TaxonomyData Stream
Single Multiple
Instruction Stream Single
SISD SIMD
Multip
le
MISD MIMD
Instruction
pool
Data pool
P
Instruction
pool
Data pool
PP
Instruction
pool
Data pool
P
P Instruction
pool
Data pool
P
P
Program S SPMD
M MPMD
Aka, Data parallel (vs. Task Parallel)
11
Copyright © 2014 CeGP
Tasks
Parallelization Steps
Decompo
sition
Assig
nmen
t p0 p1
p2 p3Orche
stratio
np0 p1
p2 p3 Mapping
Partition
ing
Sequential computation
Processes Parallel program
Parallel architecture
P0P1P2P3P4
12
Copyright © 2014 CeGP
Parallelizing a Once‐Sequential Program
Parallelizable
Inherently sequential
Sequential
Parallelized
TsTp
Speedup = Ts/Tp
13
Copyright © 2014 CeGP
Speedup Analysis (1)
comms
s
p
s
TTp
ss
TTTpS
1
Speedup
# processors
Parallel exec. time
Sequential exec. time
Parallel processing time
Effective commun. time
Inherently sequential fraction
14
Copyright © 2014 CeGP
Speedup Analysis (2)• Amdahl’s Law
– Ignore communication time
• Amdahl’s limit
p
sspS
11Speedup
Inherently sequential fraction # processors
spSp
1
lim
0.50.20.10.010.001s25101001000Limit
15
Copyright © 2014 CeGP
Speedup Analysis (3)
• Degree of parallelism limit
• Communication limit
• Assume Tcomm(p)= f (p) Ts
• (f: communication-to-computation ratio)
• Assume s=0 (perfectly parallelizable),
10 ppSpE
comm.comp.1lim
pfpS
p
dppS useful
Degree of parallelism: number of parallel
operations in a program
EfficiencyLinear speedup
16
Copyright © 2014 CeGP
CeGP Cluster at GT
Copyright © 2014 CeGP
CeGP Cluster at GT• Received in July 2014• Provider: PSSC Labs
– Based in California– Top provider for Cluster solution
• Three‐year warranty and support• Expansion plans
18
Copyright © 2014 CeGP
Cluster Specs at a Glance
• Physical cores: 44• Logical cores: 88 (hyperthreading)• Nodes: 3• Total Memory: 192 GB• Total Storage: 34 TB• Interconnect: Gigabit Ethernet• UPS for head node• OS: CentOS
19
Copyright © 2014 CeGP
Cluster Architecture Overview
12 (24 HT) Intel Xeon @ 2.1 GHz
64 GB RAM
Gb Ethernet NIC
16 (32 HT) Intel Xeon @ 2.6 GHz
64 GB RAM
Gb Ethernet NIC
64 GB RAM
Gb Ethernet NIC
16x 2 TB
Gb Ethernet Switch
1 TB1 TB
16 (32 HT) Intel Xeon @ 2.6 GHz
Head
Nod
e
Compute Node 1 Compute Node 2
20
Copyright © 2014 CeGP
MATLAB Parallel Toolbox
Copyright © 2014 CeGP
Parallel Processing/Computing on MATLAB
• Carry out multiple tasks simultaneously on different processors
• Speed up task‐parallel applications• MATLAB has well documented and convenient parallel processing toolbox
• Built in syntax abstracts the complexity involved in parallel computing
• Support usage of NVIDIA GPUs
22
Copyright © 2014 CeGP
Basics in Parallel Processing
• Parallel for loops (multi‐core CPUs)– parfor (Independent parameter sweep/Monte‐Carlo experiment)
– Distributed arrays (Matrix that are larger than memory limits on single computer)
• GPU Computing– CUDA‐enabled NVIDIA GPUs– FFT/FFT2 – A\b
23
Copyright © 2014 CeGP
Application speed up
Figure borrowed from www.mathworks.com24
Copyright © 2014 CeGP
How to use parfor?
• Must create a pool of workerspool = parpool(8);
• Do parallel computinguse parfor loops
• Clean up by deleting pool of workers after finishdelete(gcp);
25
Copyright © 2014 CeGP
parfor
• Adding 1 to s can be done in any order• Condition p(i) is independent of s
26
Copyright © 2014 CeGP
Distributed Arrays/Matrices
27
Copyright © 2014 CeGP
GPU Arrays
28
Copyright © 2014 CeGP
Benchmarking A\b on the GPU
29
Copyright © 2014 CeGP
Message Passing Interface (MPI)
Copyright © 2014 CeGP
MPI at a Glance
• Not a language
• Standardized system, several implementations
• Dominates parallel programming and HPC
• Set of routines callable from C/C++, Fortran
• MPI 1.0: 1992
• MPI 2.0: 1997
• We focus on MPI 1.0 for C
High Performance Computing
31
Copyright © 2014 CeGP
MPI Program Structure#include headers
int main(){
}
• Declare• General• MPI-related
• Initialize MPI
Work in parallel. . .
Finalize MPI
#include <stdio.h>#include <mpi.h>
int main(int argc, char **argv) {
}
. . .int myID, nProc; MPI_Status status;MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nProc);MPI_Comm_rank(MPI_COMM_WORLD, &myID);
Work in parallel. . .
MPI_Finalize();
32
Copyright © 2014 CeGP
Specify boundaries
Simple Example:
Work in parallel
Collect results
Display results
start = (1000000*myID/nProc)+1;
end = 1000000*(myID+1)/nProc;
for(i=start; i<=end; i++) sum = sum + i;
if(myID == 0)
for(i=1; i<nProc; i++)
{
MPI_Recv(&accum, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &status);
sum = sum + accum;
}
else
MPI_Send(&sum, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
if(myID == 0)
printf(“Sum from 1 to 1000000 is: %d\n", sum );33
Copyright © 2014 CeGP
Compile and Run MPI
mpicc -o sum1e6.o sum1e6.cmpirun –np 10 sum1e6.o
Sum from 1 to 1000000 is: 1784293664
Output (binary) file Source file
# processes
34
Copyright © 2014 CeGP
Example: Matrix Multiplication
x=
A B C
35
Copyright © 2014 CeGP
Matrix Multiplication: Data Partitioning
x =
x =
x =
P0
P1
P2
36
Copyright © 2014 CeGP
MPI_Scatter
MPI_Scatter(A, //send buffer4, //send countMPI_FLOAT, //send data typeA_row, //receive buffer4, //receive countMPI_FLOAT, //receive data type0, //source process idMPI_COMM_WORLD //comm. handle);
AP0
P0
P1
P2
A_row
A_row
A_row
37
Copyright © 2014 CeGP
MPI_Bcast
MPI_Bcast (BB, //send buffer20, //send countMPI_FLOAT, //send data typeB, //receive buffer0, //source process idMPI_COMM_WORLD //comm. handle);
P0
BBP0
P1
P2
B
B
B
38
Copyright © 2014 CeGP
MPI_Gather
MPI_Gather(C, //send buffer5, //send countMPI_FLOAT, //send data typeC_row, //receive buffer5, //receive countMPI_FLOAT, //receive data type0, //source process idMPI_COMM_WORLD //comm. handle);
CP0
P0
P1
P2
C_row
C_row
C_row
39
Copyright © 2014 CeGP
More MPI Collective Calls (1)
MPI_Gather + MPI_Bcast
Diagrams from mpitutorial.com40
Copyright © 2014 CeGP
More MPI Collective Calls (2)MPI_MAXMPI_MINMPI_PRODMPI_LANDMPI_LORMPI_BANDMPI_BORMPI_MAXLOCMPI_MINLOC
MPI_Reduce + MPI_Bcast
Diagrams from mpitutorial.com41
Copyright © 2014 CeGP
MPI Collective Calls (3)
P0
P1
P2
P0
P1
P2
MPI_Alltoall()
42
Copyright © 2014 CeGP
Blocking vs. Non‐blocking Calls• Order of MPI_Send, MPI_Recv could cause deadlocks
• There are equivalent nonblocking calls: MPI_Isend, MPI_Irecv
43
MatlabMPI
What is MatlabMPI
• A Matlab implementation of a subset of the Message
Passing Interface (MPI) standard, developed at MIT
Lincoln Lab.
• An extremely compact (~200 lines) implementation
on top of standard Matlab file I/O.
• Can match the bandwidth of C based MPI at large
message sizes.
45
Installation/Setup• Download at
http://www.ll.mit.edu/mission/cybersec/softwaretools/matlabmpi/matl
abmpi.html
• PC: Add to startup.m (usually at $matlabroot/toolbox/local)
– “addpath MatlabMPIInstallationFolder\src”
– “addpath ProgramLaunchingFolder\MatMPI”
• Linux: Add to startup.m
– “addpath /home/username/MatlabMPI/src”
• Source file MPI_Probe.m needs to be modified:
– Line 42: [pathstr, name, ext, versn] = fileparts(file_name); remove 46
Principles
Principles of MatlabMPI: Implementatioin of Basic MPI Communications
47
Functions: Core
MPI_Run.mMPI_Init.m
MPI_Finalize.m
MPI_Comm_size.mMPI_Comm_rank.m
MPI_Send.mMPI_Recv.m
MPI_Abort.mMPI_Bcast.mMPI_Probe.mMPI_cc.m
• The core library implements the basic MPI operations.
48
Functions: Utility
MatMPI_Comm_dir.mMatMPI_Save_messages.m
MatMPI_Delete_all.mMatMPI_Comm_settings.m
• The utility library implements auxiliary functions besides the MPI core.
MatMPI_Buffer_file.mMatMPI_Lock_file.mMatMPI_Commands.mMatMPI_Comm_init.mMatMPI_mcc_wrappers
49
Example: Image Convolution
Padding must be done using actual pixel data from neighboring sub_images to avoid error.
Image A Padded Image Convolves w/ Kernel B
Flipped Kernel (–B)
Convolved Image C
Serial Implementation: C = conv2(A, B, ‘same’);
Image A Splits into sub_image Pads into work_image Convolves w/ Kernel B, each in serial manner
Merge into Convolved Image
Padding needed around the borders, taken care of by Matlab.
Parallel Implementation
50
Code: Skeleton% General initialization.MPI_Init; % Initialize MPI.comm = MPI_COMM_WORLD; % Create communicator.comm_size = MPI_Comm_size(comm); % Get size.my_rank = MPI_Comm_rank(comm); % Get rank.
% Prepare kernel(nFX, nFY).% Prepare image(nX, nY).% Split into sub_image(nSX, nSY) for each processor.% Prepare work_image based on sub_image with appropriate padding % (as shown on the next slide).
% Convolves each work_image with kernel.work_image = conv2(work_image, kernel, 'same');% Extract convolved results.sub_image = work_image(nFX/2+1:nFX/2+nSX, 1:nSY);
% Finalize MatlabMPI.MPI_Finalize;
% Host processor collects sub_images and merges into the final image.% End of program
51
Code: Padding% Find out ranks for left and right processors.left = my_rank - 1; if (left < 0), left = comm_size - 1; endright = my_rank + 1;if (right >= comm_size), right = 0; endltag = 1; rtag = 2; % Create message tags.
% Extract from left/right side of sub_image% and send to left/right processor.l_sub_image = sub_image(1:nFX/2, 1:nSY); MPI_Send(left, ltag, comm, l_sub_image); r_sub_image = sub_image(nSX-nFX/2+1:nSX, 1:nSY);MPI_Send(right, rtag, comm, r_sub_image);
% Prepare work_image(nSX+nFX, nSY).work_image = zeros(nSX+nFX, nSY); work_image(nFX/2+1:nFX/2+nSX, 1:nSY) = sub_image; % Copy sub_image into central part.r_pad = MPI_Recv(right, ltag, comm); % Receive right padding from right processor.work_image(nFX/2+nSX+1:nSX+nFX, 1:nSY) = r_pad; % Put into right part of work_image.l_pad = MPI_Recv(left, rtag, comm); % Receive left padding from left processor.work_image(1:nFX/2, 1:nSY) = l_pad; % Put into left part of work_image.
sub_image(left)
sub_image(self)
sub_image(right)
work_image (self)
r_padl_pad
MPI_Send/RecvMPI_Send/Recv
52
How to Run
% Abort left over jobs.MPI_Abort;pause(2.0);
% Delete left over MPI directory.MatMPI_Delete_all;pause(2.0);
% Define machines; empty means run locally.machines = {};
% Define machines.%machines = {‘machineA:/directoryA' ...% 'machineB:/directoryB'};
% Run scripts.eval(MPI_Run(‘convolveImage', 2, machines));
An Example Script (RUN.m from the package).
• System Requirements:– Shared memory systems
require a single Matlablicense; distributed memory systems require one Matlab license per machine.
– A directory visible to every machine (defaults to the launching directory but can be changed).
53
Comparison with MPI
• File I/O is less efficient.
• More desirable if many other Matlab
toolboxes may be utilized for the application.
54
pMatlab
• Newer package built upon MatlabMPI.
• Hides message passing from programmer.
• Utilizes a global array library, implemented using
MatlabMPI.
• Interested? Check it out at
http://www.ll.mit.edu/mission/cybersec/softwar
etools/pmatlab/pmatlab.html
55
Questions?
Thank You!
56