Date post: | 26-Jun-2015 |
Category: |
Technology |
Upload: | vsachde |
View: | 642 times |
Download: | 0 times |
Slide 1
MIT Lincoln Laboratory
Toward Mega-Scale Computing with pMatlab
Chansup Byun and Jeremy Kepner
MIT Lincoln Laboratory
Vipin Sachdeva and Kirk E. Jordan
IBM T.J. Watson Research Center
HPEC 2010
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
Slide 2
MIT Lincoln Laboratory
Outline
• What is Parallel Matlab (pMatlab)
• IBM Blue Gene/P System• BG/P Application Paths• Porting pMatlab to BG/P
• Introduction
• Performance Studies
• Optimization for Large Scale Computation
• Summary
Slide 3
MIT Lincoln Laboratory
Library Layer (pMatlab)Library Layer (pMatlab)
Parallel Matlab (pMatlab)
Vector/MatrixVector/Matrix CompComp TaskConduit
Application
ParallelLibrary
ParallelHardware
Input Analysis Output
UserInterface
HardwareInterface
Kernel LayerKernel Layer
Math(MATLAB/Octave)
Messaging(MatlabMPI)
Layered Architecture for parallel computing• Kernel layer does single-node math & parallel messaging• Library layer provides a parallel data and computation toolbox to Matlab users
Slide 4
MIT Lincoln Laboratory
IBM Blue Gene/P System
Core speed: 850 MHz
cores
LLGridCore counts: ~1K
Blue Gene/PCore counts: ~300K
Slide 5
MIT Lincoln Laboratory
Blue Gene Application Paths
Serial and Pleasantly Parallel Apps
Highly ScalableMessage Passing Apps
High Throughput Computing (HTC)
High Performance Computing (MPI)
Blue Gene Environment
• High Throughput Computing (HTC)– Enabling BG partition for many single-node jobs– Ideal for “pleasantly parallel” type applications
Slide 6
MIT Lincoln Laboratory
HTC Node Modes on BG/P
• Symmetrical Multiprocessing (SMP) mode– One process per compute node– Full node memory available to the process
• Dual mode– Two processes per compute node– Half of the node memory per each process
• Virtual Node (VN) mode– Four processes per compute node (one per core)– 1/4th of the node memory per each process
Slide 7
MIT Lincoln Laboratory
Porting pMatlab to BG/P System
• Requesting and booting a BG partition in HTC mode– Execute “qsub” command
Define number of processes, runtime, HTC boot script (htcpartition --trace 7 --boot --mode dual \
--partition $COBALT_PARTNAME) Wait for the partition ready (until the boot completes)
• Running jobs– Create and execute a Unix shell script to run a series of
“submit” commands including submit -mode dual -pool ANL-R00-M1-512 \ -cwd /path/to/working/dir -exe /path/to/octave \ -env LD_LIBRARY_PATH=/home/cbyun/lib \ -args “--traditional MatMPI/MatMPIdefs523.m”
• Combine the two stepseval(pRUN(‘m_file’, Nprocs, ‘bluegene-smp’))
Slide 8
MIT Lincoln Laboratory
Outline
• Single Process Performance• Point-to-Point
Communication• Scalability
• Introduction
• Performance Studies
• Optimization for Large Scale Computation
• Summary
Slide 9
MIT Lincoln Laboratory
Performance Studies
• Single Processor Performance– MandelBrot – ZoomImage– Beamformer– Blurimage– Fast Fourier Transform (FFT)– High Performance LINPACK (HPL)
• Point-to-Point Communication– pSpeed
• Scalability– Parallel Stream Benchmark: pStream
Slide 10
MIT Lincoln Laboratory
0
2
4
6
8
10
12
14
Tim
e R
elat
ive
to M
atla
b
Matlab 2009b, LLGrid
Octave 3.2.2, LLGrid
Octave 3.2.2, IBM BG/P
Octave 3.2.2, IBM BG/P(Clock Normalized)
Single Process Performance:Intel Xeon vs. IBM PowerPC 450
* conv2() performance issue in Octave has been improved in a subsequent release
HPLMandelBrot ZoomImage* Beamformer Blurimage* FFT
Lower is better
26.4
s
11.9
s
18.8
s
6.1
s
1.5
s
5.2
s
Slide 11
MIT Lincoln Laboratory
Octave Performance With Optimized BLAS
DGEM Performance Comparison
Matrix size (N x N)
MF
LOP
S
Slide 12
MIT Lincoln Laboratory
0
1
2
3
4
1 2 3
Matlab 2009B, LLGrid
Octave 3.2.2, LLGrid
Octave 3.2.2, IBM BG/P
Octave 3.2.2, IBM BG/P(Clock Normalized)
Single Process Performance:Stream Benchmark
Higher is better
Triada = b + q c
Addc = a + c
Scaleb = q c
944
MB
/s
120
8 M
B/s
996
MB
/s
Rel
ativ
e P
erfo
rman
ce t
o M
atla
b
Slide 13
MIT Lincoln Laboratory
Point-to-Point Communication
Pid = 0
Pid = 1Pid = Np-1
Pid = 3
Pid = 2
• pMatlab example: pSpeed– Send/Receive messages to/from the neighbor.– Messages are files in pMatlab.
Slide 14
MIT Lincoln Laboratory
Filesystem Consideration
• A single NFS-shared disk (Mode S)
Pid = 0
Pid = 1Pid = Np-1
Pid = 3
Pid = 2
Pid = 0
Pid = 1Pid = Np-1
Pid = 3
Pid = 2
• A group of cross-mounted, NFS-shared disks to distribute messages (Mode M)
Slide 15
MIT Lincoln Laboratory
pSpeed Performance on LLGrid: Mode S
Matlab 2009b, 1x2
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Ban
dw
idth
, B
ytes
/sec
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Matlab 2009b, 2x1
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Ban
dw
idth
, B
ytes
/sec
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Matlab 2009b, 4x1
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Ban
dw
idth
, B
ytes
/sec
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Matlab 2009b, 8x1
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Ban
dw
idth
, B
ytes
/sec
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Higher is better
Slide 16
MIT Lincoln Laboratory
1.0E-02
1.0E-01
1.0E+00
1.0E+01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Tim
es
, Se
co
nd
s
Matlab 2009b, 4x1
Octave 3.2.2, 4x1
pSpeed Performance on LLGrid:Mode M
Lower is better
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Ban
dw
idth
, Byt
es/s
ec
Matlab 2009b, 4x1
Octave 3.2.2, 4x1
Higher is better
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Ba
nd
wid
th, B
yte
s/s
ec
Matlab 2009b, 8x1
Octave 3.2.2, 8x1
1.0E-02
1.0E-01
1.0E+00
1.0E+01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Tim
es
, Se
co
nd
s
Matlab 2009b, 8x1
Octave 3.2.2, 8x1
Slide 17
MIT Lincoln Laboratory
pSpeed Performance on BG/P
BG/P Filesystem: GPFS
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Ba
nd
wid
th, B
yte
s/s
ec
Octave 3.2.2, 2x1
Octave 3.2.2, 4x1
Octave 3.2.2, 8x1
0.1
1
10
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Tim
es
, Se
co
nd
s
Octave 3.2.2, 2x1
Octave 3.2.2, 4x1
Octave 3.2.2, 8x1
Slide 18
MIT Lincoln Laboratory
pStream Results with Scaled Size
• SMP mode: Initial global array size of 2^25 for Np=1– Global array size scales proportionally as number of
processes increases (1024x1)
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1 2 3 4 5 6 7 8 9 10 11
Number of Processes, 2^(N-1)
Ban
dw
idth
, M
B/S
ec
Scale
Add
Triad
563 GB/sec100% Efficiency
at Np = 1024
Slide 19
MIT Lincoln Laboratory
pStream Results with Fixed Size
• Global array size of 2^30– The number of processes scaled up to 16384 (4096x4)
VN: 8.832 TB/Sec101% efficiency
at Np=16384
DUAL: 4.333 TB/Sec96% efficiency
at Np=8192
SMP: 2.208 TB/Sec98% efficiency
at Np=4096
Slide 20
MIT Lincoln Laboratory
Outline
• Aggregation
• Introduction
• Performance Studies
• Optimization for Large Scale Computation
• Summary
Slide 21
MIT Lincoln Laboratory
Current Aggregation Architecture
• The leader process receives all the distributed data from other processes.
• All other processes send their portion of the distributed data to the leader process.
• The process is inherently sequential.– The leader receives Np-1 messages.
01
23
45
67
8
Np = 8Np: total number of processes
Slide 22
MIT Lincoln Laboratory
Binary-Tree Based Aggregation
• BAGG: Distributed message collection using a binary tree– The even numbered processes send a message to its odd
numbered neighbor– The odd numbered processes receive a message from its
even numbered neighbor.
01
23
45
67
0
2
4
60
4
Maximum number of message a process may send/receive is N, where Np = 2^(N)
Slide 23
MIT Lincoln Laboratory
0.01
0.1
1
10
100
1 2 3 4 5 6 7
No of CPUs (2^N)
Tim
e, S
ec
on
ds
IBRIX, agg()
IBRIX, bagg()
LUSTRE, agg()
LUSTRE, bagg()
BAGG() Performance
• Two dimensional data and process distribution
• Two different file systems are used for performance comparison
– IBRIX: file system for users’ home directories– LUSTRE: parallel file system for all computation
IBRIX: 10x faster at Np=128
LUSTRE: 8x faster at Np=128
Slide 24
MIT Lincoln Laboratory
0.1
1
10
100
1000
1 2 3 4 5 6 7 8
No of CPUs, 2^(N+2)
Tim
e, S
eco
nd
s
GPFS, agg()
GPFS, bagg()
BAGG() Performance, 2
• Four dimensional data and process distribution
• With GPFS file system on IBM Blue Gene/P System (ANL’s Surveyor)
– From 8 processes to 1024 processes
2.5x faster at Np=1024
Slide 25
MIT Lincoln Laboratory
Generalizing Binary-Tree Based Aggregation
• HAGG: Extend the binary tree to the next power of two number
– Suppose that Np = 6 The next power of two number: Np* = 8
– Skip any messages from/to the fictitious Pid’s.
01
23
45
67
0
2
4
60
4
Slide 26
MIT Lincoln Laboratory
BAGG() vs. HAGG()
• HAGG() generalizes BAGG() – Removes the restriction (Np = 2^N) in BAGG()– Additional costs associated with bookkeeping
• Performance comparison on two dimensional data and process distribution
0.01
0.1
1
10
1 2 3 4 5 6 7
No of CPUs, 2^N
Tim
e, S
ec
on
ds
IBRIX, agg
IBRIX, bagg
IBRIX, hagg
~3x faster at Np=128
Slide 27
MIT Lincoln Laboratory
0.1
1
10
100
1 2 3 4 5 6 7
No of CPUs, 2^(N+2)
Tim
e, S
ec
on
ds
GPFS, bagg()
GPFS, hagg()
BAGG() vs. HAGG(), 2
• Performance comparison on four dimensional data and process distribution
• Performance difference is marginal on a dedicated environment– SMP mode on IBM Blue Gene/P System
Slide 28
MIT Lincoln Laboratory
BAGG() Performance with Crossmounts
• Significant performance improvement by reducing resource contention on file system
– Performance is jittery because production cluster is used for performance test
0
10
20
30
40
50
1 2 3 4 5 6 7 8
Number of Processes, 2^(N-1)
To
tal
Ru
nti
me,
Sec
on
ds
Old AGG, IBRIX
New AGG, IBRIX
New AGG, Hybrid(IBRIX+Crossmounts)
Lower is better
Slide 29
MIT Lincoln Laboratory
Summary
• pMatlab has been ported to IBM Blue Gene/P system
• Clock-normalized, single process performance of Octave on BG/P system is on-par with Matlab
• For pMatlab point-to-point communication (pSpeed), file system performance is important.
– Performance is as expected with GPFS on BG/P
• Parallel Stream Benchmark scaled to 16384 processes
• Developed a new pMatlab aggregation function using a binary tree to scale beyond 1024 processes