[IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Anchorage, AK, USA...

p2Matlab: Productive Parallel Matlab for the Exascale

Vipin SachdevaCollege of Computing

Georgia Institute of TechnologyAtlanta, GA

[email protected]

Abstract—MATLAB R©and its open-source implementationOctave have proven to be one of the most productive environ-ments for scientific computing in recent years. There have beenmultiple efforts to develop an efficient parallel implementationof MATLAB including by Mathworks R©(Parallel ComputingToolbox), MIT Lincoln Labs (pMatlab) and several otherorganizations. However, most of these implementations seem tosuffer from issues in performance or productivity or both. Withthe rapid scaling of high-end systems to hundreds of thousandsof cores, and discussions of exascale systems in the near future,a scalable parallel Matlab would be of immense benefit topractitioners in the scientific computing industry. In this paper,we first describe our work to create an efficient pMatlabrunning on the IBM BlueGene/P architecture, and present ourexperiments with several important kernels used in scientificcomputing including from HPC Challenge Awards. We explainthe bottlenecks with the current pMatlab implementation onBlueGene/P architecture, specially at high processor countsand then outline the steps required to develop a parallelMATLAB/Octave implementation, p2Matlab, which is trulyscalable to hundreds of thousands of processors.

Keywords-MATLAB, HPC, parallel programming, Octave

I. INTRODUCTION

The MIT Lincoln Laboratory Grid (LLGrid) team hasdeveloped pMatlab [6], a parallel Matlab toolkit that makesparallel programming with Matlab accessible and simple byusing two partitioned global address space (PGAS) datatypes, parallel maps and distributed arrays. This enablespMatlab programmers to work in their familiar environmentof numerical arrays and to parallelize their serial codes withonly a few lines of code changes.

The parallel Matlab toolkit is made of two layers. ThepMatlab layer provides parallel data structures and libraryfunctions and the MatlabMPI layer provides messaging ca-pability. The MatlabMPI layer in its current implementationrelies on file I/O and locking as a means of exchangingmessages. For further details on pMatlab, please refer to[6], [5].

In Section II, we first explain our efforts to create anefficient pMatlab ported to IBM’s BlueGene/P architecture,namely for the pMatlab layer and the MatlabMPI layeroperating on BlueGene/P. Since Matlab is not availableon the BG/P system, an alternative open-source version ofMatlab-equivalent software, Octave, is used as a substitutefor Matlab. For MatlabMPI, the investigation was first done

using IBM’s GPFS filesystem and subsequently, experimentwith Active Storage Fabric (ASF) which maps part of thememory of compute nodes to a filesystem. For submissionof pMatlab jobs, HTC (High-Throughput Computing) modeon BlueGene/P was used.

In Section III, the results of our existing pMatlab im-plementation on BlueGene/P architecture are presented forseveral kernels such as matrix-multiply, STREAM, FFTamong others. Based on these results, the bottlenecks ofthe current implementation are presented, specially in theMatlabMPI layer which relies on disk I/O as a means ofcommunication. In Section IV, we outline the efforts neededto develop a more scalable pMatlab, p2Matlab, based oneither a “fast” filesystem or a real MPI implementation as ameans of communication.

II. PMATLAB ON BLUEGENE/P

Our current work has resulted in a fully functioning im-plementation of pMatlab on BlueGene/P architecture. Blue-Gene/P is IBM’s massively parallel supercomputer scalableto 262,144 quad-processor nodes, with a peak performanceof 3.56 Pflops. The Blue Gene/P system enables this un-precedented scaling via architectural and design choices thatmaximize performance per watt, performance per squarefoot, and mean time between failures. For more details onBlueGene/P architecture, please refer to [7].

As mentioned before, pMatlab relies on a engine (Matlabor Octave) to run on the compute nodes of the system,and the messaging layer MatlabMPI to do the commu-nication amongst the processes. pMatlab constructs everykernel as a computation on a “matrix” object. The mapAPI allows partitioning of “matrix” objects amongst theprocesses, in row-wise, column-wise or block-wise format.Once the matrix object is partitioned appropriately by theprogrammer depending on the computation, each of theprocesses compute on their own part of the matrix object.

For communication amongst the processes, the pMatlabuses the MatlabMPI layer, which is currently based on fileI/O. for example if processor X has to send a message toprocessor Y, X first opens a lock file to denote that it is stillin the process of writing data to the file. Once the lock file iscreated, processor X then writes a binary Matlab (.mat) filewith the actual data itself. Processor Y, when it reaches its

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.389

2112


1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.389

2108


1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.389

2104


1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.389

2108


1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.389

2108


1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.389

2108


1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.389

2104


1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.389

2109


1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.389

2109

stage of reading the data, first spins on the lock file. If thelock file is still available, it denotes that the data requiredby processor Y is not available as yet. If there is no lockfile, processor Y can safely read data from the appropriatefile. The files are suffixed by the process IDs of the sendingand the recieving process, for example in this case the fileswill be suffixed X Y.mat or X Y.lock.

The compute engine of pMatlab on BlueGene/P is basedon Octave [2], an open-source alternative to Matlab. Thischoice was made due to several reasons, including Matlabnot being supported on PowerPC architectures. The otherreason was that since Octave is fully open-source, it can beconfigured and subsequently optimized based on the userneeds. Octave is mostly compatible with Matlab, and iscapable of reading Matlab’s .m files. To run Octave runningsuccessfully on the compute nodes, it has to be consideredthat BlueGene/P’s front-end and compute nodes are differentarchitectures and run different operating systems; the com-pute nodes are 32-bit and run CNK (compute node kernel),while the front-end is a 64-bit Power6 architecture runningSuSE Linux. This is common in the high-end systems, asrunning a full flavored operating system on the computenodes leads to issues in scalability and performance. Forthis reason, cross-configuration and compilation had to becompleted for Octave to run on BlueGene/P’s computenodes. Octave relies on standard GNU autoconf scripts forconfiguration/compilation which accept options for cross-compilation. Octave without the X11 libraries is functionaland running on the compute nodes; further optimization ofOctave will be however needed, and is one of the stepsoutlined for our future work.

There was no effort required for the second componentof pMatlab, MatlabMPI working on BlueGene/P as it relieson disk I/O. This ensures that MatlabMPI is highly portableacross all architectures, as long as there is a single sharedfilesystem across all compute nodes. Our BlueGene/P systemuses the IBM’s GPFS (General Parallel File System). Toreduce the overhead of disk I/O, experimentation with ASF(Active Storage Fabric) was also completed. ASF allowsevery process to contribute a part of its memory, to createa memory-mapped shared filesystem among the computenodes, for example if there are 1024 compute processes, andeach of them contribute 1 GB, the mounted “fast” filesystemwill consist of 1 TB. This filesystem has very fast accesstimes, and could potentially speed up the “communication”time amongst the processes. For further information on ASF,please refer to [3]. In Section III, the results for both GPFSand ASF filesystems are presented.

pMatlab “forks” a single Matlab or Octave process intouser-specified single-node processes, each computing ontheir own part of the data (defined by the map API explainedbefore). However, the fork API is not supported by CNK; forthis reason, we relied on the BlueGene/P’s HTC mode or asan alternative of job submission or creation on the compute

nodes. HTC mode on BG/P allows parts of the machine to beused for single-node jobs. Unlike a MPI job, these processesare completely independant and do not communicate at all.While the pMatlab processes communicate, in their currentimplementation they communicate using the disk I/O and arenot dependant on MPI. Hence, HTC is a suitable alternativeto “fork” processes on the compute nodes. For further detailson HTC, please refer to [1].

Our fully functioning port of pMatlab on BlueGene/Pthus proceeds as follows: we create a standard Octavecommand-line on the front-end by launching the Octaveexecutable (compiled for the front-end nodes) on the front-end. The actual job submission to the compute nodes isthen performed through the front-end process; the front-endprocess uses the HTC API submit to create Octave processeson the compute nodes. The compute nodes, then use the diskI/O (GPFS or ASF) to do communication, and finally returnthe results to the front-end. In the next section, we presentresults for our implementation for several kernels includingfrom HPC Challenge benchmarks.

.

III. RESULTS AND ANALYSIS

In this section, we present results for our current im-plementation of pMatlab on BlueGene/P architecture. Oursystem consists of 4096 quad-core compute nodes, andallows upto 16384 MPI processes. Each compute node isbased on four 32-bit PowerPC cores with a frequency of750 Mhz. The four cores share 4 GB of RAM. The computenodes are connected to a GPFS filesystem having a peakbandwidth of 15 GB/s. The GPFS is serviced by four filesystem servers, connected through a 10 GB/s ethernet to theI/O nodes. Our system has pset ratio equal to 16, whichimplies that each I/O node services 16 compute nodes.

Figure 1. pStream bandwidths on the BlueGene/P

We first present results for pStream benchmark, part of theHPC Challenge benchmarks. Figure 1 shows the bandwidths

211321092105210921092109210521102110

of the pStream benchmark, as a percentage of the linearbandwidths (obtained by linearly scaling the bandwidth at64 processes). For this benchmark, we create an array of size230 elements, equally distributed amongst the processes atinitialization. pStream does not require any communication,and only depends on simple operations (scale, add, triad)on a global array equally partitioned amongst the processes.For this reason, we are able to see a linear scaling upto16384 processes, which is the full size of our system. At16384 processes, the bandwidth of the system for pStreamoperations exceeds 95% of the linear bandwidth of thesystem.

Figure 2. Time taken for matrix-multiply - compute and aggregate

We next present results for a simple matrix-matrix mul-tiply kernel; we partition the first matrix (A) row-wise,but each process has the full matrix B. Each process, thuscomputes a subset of the rows of the result matrix. We didnot employ any optimization techniques, as we wanted to seethe results for a highly simple implementation. We also didnot link the Octave running on the compute nodes with anoptimized BLAS implementation; this was done to keep thecompute nodes as busy as possible, and to keep the computeto communicate ratio as high as possible. Figure 2 showsthe time taken to multiply two square double-precisionmatrices each of dimensions 4096, upto 128 processes. Theaggregate time is the the time taken to aggregate the fullresult matrix on the root process. This includes the timetaken by the root process to recieve rows of the root processfrom all compute processes, for example in the case of 128processes, the root process has to read 127 files (1 from eachprocess); each file contains the results of every row-columnmultiplication performed by the process. With increasingnumber of processes, the size of every file decreases (lesserdata for every process), but the number of files increases.This leads to a bottleneck for the file system. As can beseen, the compute step is scalable upto 128 processes, but theaggregate step increases in time with increasing number ofprocesses. At higher processor counts, the increase is steeper

compared to lower processor numbers; at 128 processes, thecommunication time ends up taking a bigger percentage ofthe total time. This result shows the bottlenecks of disk I/O,even for a problem with very limited communication suchas matrix-multiplication.

Figure 3. FFT times using both GPFS and ASF

For our next example, we show results for a more band-width intensive kernel - the fast fourier transform (FFT),which is also part of the HPC Challenge benchmarks. Inthis case, the size of the input array increases linearly withthe number of processors (weak scaling). For this example,we show results, with both GPFS and ASF. In the ASFcase, the lock file and the data file are both opened inthe memory-mapped filesystem, which is significantly fasterthan GPFS. As can be seen from Figure 3, with ASF, thecompute times increase marginally, but this can be attributedto the system noise associated with ASF. However, wesee a bigger decrease in the communication times, but thecommunication times are still more than two orders ofmagnitude of the compute times. This signifies that thecurrent method of communication based on disk I/Ois unsustainable for high process counts, on standardfilesystems such as GPFS. The current ASF implementationdepends on only one server handling the I/O requests, whichmight be a bottleneck. Further analysis of performance ofpMatlab on ASF or on faster filesystems such as flash-memory based is thus required.

IV. CONCLUSION

In Section III, we showed that our existing implementationwhich relies on file I/O (both GPFS and ASF) as a meansof communication creates bottlenecks in performance, evenfor problems that are pleasingly parallel. The STREAMbenchmark, which does not use communication at all isseen to be scalable upto 16384 processes. Matrix-multiplywhich only uses communication in the last step, as well asFFT which does more periodic communication, both suffer

211421102106211021102110210621112111

in performance due to the overhead of the communicationstep; in the case of FFT, the degradation in performance isworse. Thus, any kernel which relies on an aggregate step(similiar to MPI Reduce) during execution, such as matrix-multiplication or FFT shows scalability issues in the com-municate step. We therefore propose p2Matlab, a parallelMATLAB implementation with a more scalable means ofcommunication amongst the processes; the communicationcould still be based on ASF-based I/O, or depend on aMPI implementation. The first part of this work will studythe bottlenecks of the ASF implementation in more detail,and analyze if ASF can provide truly fast communicationamongst the processes; flash-memory based filesystem couldalso be an alternative to ASF. A more long-term solutioncould be to use a true MPI or sockets library for MatlabMPI;since MPI has been tested and optimized for many applica-tions, this could provide a scalable communication layer forMatlabMPI. One of the earlier efforts to couple Octave withMPI include bcMPI [4], but in our experiments, bcMPI didnot work with the newer releases of Octave. Our work willbuild on the bcMPI effort, and then subsequently optimizethe implementation. We then intend to evaluate severalproblems from life sciences, finance and data analysis tofully ascertain the usability of p2Matlab in several domains.Domain-specific extensions to pMatlab APIs, and abilityto handle objects other than double-precision matrices willbe the other areas of work. Support for shared-memoryparallelism and vectorization as part of further optimizationof Octave is also needed. Our final goal is to be able tolaunch a truly scalable Octave implementation runningon hundreds of thousands of processes on massivelyparallel system such as BlueGene/P from a Matlab orOctave environment on a user-desktop. We believe suchan implementation will make HPC accessible for themasses.

ACKNOWLEDGMENT

I thank my research advisor, David A. Bader, in theSchool of Computational Science and Engineering, Collegeof Computing, Georgia Institute of Technology. I wouldlike to thank Dr. Kirk Jordan (IBM) for technical adviceon BlueGene/P and HTC mode, and Blake Fitch (IBM)for providing assistance with ASF. I would also like tothank Jeremy Kepner, Julie Mullen and Chansup Byun (MITLincoln Labs) for technical help on pMatlab,

REFERENCES

[1] Tom Budnik, Brant Knudson, Mark Megerian, and Sam Miller.High Throughput Computing on IBM’s BlueGene R©/P. Tech-nical report, International Business Machines, Rochester, NY,2008.

[2] John W. Eaton, David Bateman, and Soren Hauberg. GNUOctave Manual Version 3. Network Theory Ltd, 2008.

[3] Blake G. Fitch, Aleksandr Rayshubskiy, Michael C. Pitman,T. J. Christopher Ward, and Robert S. Germain. Using theActive Storage Fabrics model to address Petascale StorageChallenges. In Proc. 4th Annual Workshop on Petascale DataStorage, pages 47–54, Portland, OR, 2009.

[4] Dave Hudak, Neil Ludban, Jaya Natarajan, Siddharth Samsi,and Ashok Krishnamurthy. bcMPI: HPC Computational Sci-ence IDE on Itanium. In Proc. GELATO 2007, 2007.

[5] J. Kepner. Parallel MATLAB for Multicore and MultinodeComputers. SIAM, Philadelphia, PA, 2009.

[6] J. Kepner and N. T. Bliss. Parallel Matlab: The Next Genera-tion. In Proc. High-Performance Embedded Computing, 2003.

[7] IBM BlueGene Team. Overview of the IBM BlueGene/Pproject. IBM Journal of Research and Development, 52(1),January 2008.

211521112107211121112111210721122112

Date post:	08-Dec-2016
Category:	Documents
Upload:	vipin
View:	214 times
Download:	1 times

[IEEE Distributed Processing, Workshops and Phd Forum (IPDPSW) - Anchorage, AK, USA...

Documents