MT3DMSP – A parallelized version of the MT3DMS code

Journal of African Earth Sciences 100 (2014) 1–6

Contents lists available at ScienceDirect

Journal of African Earth Sciences

journal homepage: www.elsevier .com/locate / ja f rearsc i

MT3DMSP – A parallelized version of the MT3DMS code

http://dx.doi.org/10.1016/j.jafrearsci.2014.06.0061464-343X/� 2014 Elsevier Ltd. All rights reserved.

⇑ Corresponding author. Address: Gustav-Zeuner-Str. 12, D-09596 Freiberg,Germany. Tel.: +49 17634181843.

E-mail addresses: [email protected] (R. Abdelaziz), [email protected] (H.H. Le).

Ramadan Abdelaziz a,⇑, Hai Ha Le b

a Institute for Hydrogeology, TU Bergakademie Freiberg, Germanyb Institute for Geophysics and Geoinformatics, TU Bergakademie Freiberg, Germany

a r t i c l e i n f o

Article history:Received 11 February 2014Received in revised form 2 June 2014Accepted 10 June 2014Available online 25 June 2014

Keywords:MT3DMSParallel programmingSSOR-AIOpenMPMT3DMS5P

a b s t r a c t

A parallelized version of the 3-D multi-species transport model MT3DMS was developed and tested. Spe-cifically, the open multiprocessing (OpenMP) was utilized for communication between the processors.MT3DMS emulates the solute transport by dividing the calculation into the flow and transport steps.In this article, a new preconditioner, derived from Symmetric Successive Over Relaxation (SSOR) wasadded into the generalized conjugate gradient solver. This preconditioner is well suited and appropriatefor the parallel architecture. A case study in the test field at TU Bergakademie Freiberg was used to pro-duce the results and analyze the code performance. It was observed that most of running time would berequired for the advection, dispersion. As a result, the parallel version decreases significantly runningtime of solute transport modeling. In addition, this work provides a first attempt to demonstrate thecapability and versatility of MT3DMS5P to simulate the solute transport in fractured gneiss rock.

� 2014 Elsevier Ltd. All rights reserved.

1. Introduction

In the past decade, fluid flow and solute transport in porousmedia and fractured aquifers attracted the attention of manyGeo-scientists. Hydraulic and transport properties controlling fluidflow are of high heterogeneity in fractured aquifers. Simulationstook a lot of time to compute. In fact, the computation time moreor less depended on the model scale. When the model area waslarge and had many grid cells, it required more computation time.To accelerate the 3-D multi-species transport model, the needarose to use parallel computing. MT3DMS (Zheng and Wang,1999) is a block-centered finite difference solver that performsthe spatial and temporal discretization of the equations. In general,the flow steps for a fluid flow model are large and each flow timestep needs to subdivide into smaller steps of the transport modelbecause of stability criteria.

In this paper, the early attempts to construct and implement ahighly parallel version of MT3DMS v5.3 (Zheng, 2010) threedimensional, Multi-Species transport simulators is demonstrated.The aim of this work was to parallelize MT3DMS so that such alarge-scale model, which can be executed in reasonable time. Fur-thermore, a new preconditioning option which will help toimprove the solution scheme is proposed. To the best of our knowl-

edge, MT3DMS has never been parallelized to simulate contaminattransport in an aquifer.

1.1. Parallel programming

Parallel programming increases capability of computers. Manycompilers are available to parallelize the program for multi-pro-cessing and accelerate the processing. One of these compilers isthe Intel� Fortran compiler. The recent Intel� compiler containsOpenMP3.1 and an automatic loop parallelization. Moreover, therecent version includes a feature which improves the performanceof the code.

Parallel programming is aimed at breaking down a program intoparts that can be executed concurrently. A parallel program alwaysuses multi-threads, therefore developers need to control the side-effects that can occur when these threads interact with sharedresources. The idea of parallel programming is not new but devel-opers still find it hard to implement effectively. Recently, somemodels have been invented to support the development of parallelprograms. Most prominent among these models are open multi-processing (OpenMP) for shared memory programming (Chandra,2001; Chapman et al., 2008; Chivers and Sleightholme, 2012;OpenMP, 2013), and message passing interface (MPI) for distrib-uted memory programming (MPI, 2013; OpenMpi, 2013). Althoughthe parallel program developed by MPI can run on both shared anddistributed memory architectures, programming with MPI requiresmore code changes to go from the serial version to the parallel ver-sion. Sometimes, because of the limitations in the communication

http://crossmark.crossref.org/dialog/?doi=10.1016/j.jafrearsci.2014.06.006&domain=pdf

http://dx.doi.org/10.1016/j.jafrearsci.2014.06.006

mailto:[email protected]

mailto:lehai@mailserver. tu-freiberg.de

mailto:lehai@mailserver. tu-freiberg.de

http://dx.doi.org/10.1016/j.jafrearsci.2014.06.006

http://www.sciencedirect.com/science/journal/1464343X

http://www.elsevier.com/locate/jafrearsci

Fig. 1. The methodology to add parallelism to MT3DMS.

Table 1Time consumption for different procedures within MT3DMS.

No Procedure Total time (%)

1 SADV5U 632 DSP5FM 203 GCG5AP 134 Others 4

2 R. Abdelaziz, H.H. Le / Journal of African Earth Sciences 100 (2014) 1–6

network between computers nodes can result in poor parallel pro-gram performance by MPI. Programming with OpenMP requiresless code changes i.e. using some feature of the compiler. In thisstudy, results from parallel programming for a single system withsome processors and cores are presented. The emphasis is on usingOpenMP and the parallel features supported by the Intel� FortranCompiler only (Intel, 2013).

OpenMP is a shared-memory application programming inter-face (API) that supports shared memory multiprocessing program-ming. In the context, shared memory multiprocessing means thatsome processes share memory in such a way that each of themcan access any memory unit with the same speed, i.e., they havea uniform memory access (UMA) time. OpenMP consists of a setof compiler directives, library routines, and environment variablesthat determine the run time behavior of a program. OpenMP direc-tives can be added to a sequential program in FORTRAN, C, or C++to describe how the work is to be shared among threads that willexecute the program on different processors or cores and to orderaccesses to shared data as needed. The appropriate insertion ofOpenMP features into a sequential program will allow many, per-haps most, applications to benefit from shared-memory parallelarchitectures – often with minimal modification to the code. Inpractice, a compiler such as the Intel� Fortran Compiler does muchof the work to insert OpenMP features into a program automati-cally by compiler parameters.

The Intel� Fortran Compiler includes some useful features forparallel programming. A serial program can become a parallel pro-gram with only a compiler option without any code change. Thesefeatures are auto loop parallelization, guided auto parallelism, andautomatic vectorization.

The auto loop parallelization feature automatically translatesserial portions of the input program into equivalent multi-threaded code. Automatic-loop parallelization determines theloops that are good work sharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitionsthe data for threaded code generation as needed in programmingwith OpenMP directives. Beside the auto loop parallelization fea-ture, the Intel� FORTRAN compiler can read OpenMP directivesinserted manually by programmers.

The automatic vectorization feature of the Intel� Fortran Com-piler converts normal instructions to single instruction multipledata (SIMD). It detects low-level operations in the program thatcan be executed in parallel, and then converts the sequential pro-gram to process 2, 4, 8, or 16 elements in one operation, dependingon the data type. OpenMP and vectorization can be combined forbetter performance results.

2. Methodology

Adding parallelism to MT3DMS (Zheng, 2010) is an iterativeprocedure which includes four steps: (1) survey and locate time-consuming codes and then design parallel codes; (2) code andcompile with optimize options; (3) check memory/threading errorand test program; (4) tune the code and report the effect. Fig. 1shows the methodology to parallelize MT3DMS.

2.1. Locate time-consuming codes and design new parallel codes

The Intel� Advisor is a tool to survey and analyze the code tolocate time-consuming code regions. This study intends to runthe Intel� Advisor with MT3DMS (Zheng, 2010) code and sampledata. The time consumption of each procedure is summarized inTable 1. This shows that running MT3DMS on a single processorrequires 63% CPU time of the simulation time to deal with advec-tion scheme, 20% CPU time of the simulation time to deal with dis-

persion, and 13% CPU time of the simulation time to deal with thegeneralized conjugate gradient (GCG) solver, and 4% of the simula-tion time for other calculations.

The SADV5U procedure is the procedure to solve the advectionterm with the third-order total-variation-diminishing (TVD)scheme. The DSP5FM procedure formulates the coefficient matrixfor the dispersion term using the implicit finite-difference scheme.The GCG5AP procedure finds the solution by the generalized con-jugate gradient methods using one of the three pre-conditioners(Jacobi, SSOR, and MIC) with the Lanczos/Orthomin accelerationscheme for non-symmetrical matrices (Zheng, 2010).

2.1.1. Analyze the gain and the correctness of parallelizationThe main loop of the SADV5U loops through all cells from top to

bottom, from back to front, and from left to right to calculate con-centration at the new time level, n + 1, from old time level, n. Themaximum gain of the main parallelized loop is shown inFig. 2(a). Parallelization of this procedure without changing thecode creates incorrect results due to the conflict of read/writeinstructions in the shared memories.

The DSP5FM procedure loops through all cells to calculate thecoefficient matrix, i.e., matrix A and vector RHS. Fig. 2(b) showsthe maximum gain of parallel code for this procedure. The checkcorrectness feature of Intel� Advisor (Intel, 2013) shows that thisprocedure can be parallelized correctly without code change.

The GCG5AP procedure uses an iterative method to find thesolution. GCG uses a preconditioning matrix to reduce the numberof iterations. The time-consuming aspect of GCG is the calculationof the vector direction by solving the equation Q�1x = y in each iter-ation. The current preconditioning options, Symmetric SuccessiveOver-Relaxation (SSOR) and the Modified Incomplete Cholesky(MIC) are not easy to parallelize because of their strongly serialprocessing due to the forward/backward substitutions (Amentet al., 2010; Helfenstein and Koko, 2012). The Jacobi precondition-ing option takes more iteration numbers compared to otherspreconditioners.

The maximum gains of the parallelization of GCG5AP in Fig. 2(c)shows that they should use a new parallel algorithm.

2.1.2. New parallel codes for the SADV5U procedureThe SADV5U procedure implements the algorithm as shown in

the MT3DMS manual (Zheng, 2010). The new solution at time leveln + 1 for cell (i, j, k) can be calculated using Eq. (1) as follows.

Fig. 2. The maximum gain of the parallelization. (a) SADV5U, (b) DSP5FM and (c) GCG5AP.

R. Abdelaziz, H.H. Le / Journal of African Earth Sciences 100 (2014) 1–6 3

hi;j;kCnþ1

i;j;k � Cni;j;k

Dt¼ �

qxi;jþ12;k

Cni;jþ1

2;k� qxi;j�1

2;kCn

i;j�12;k

Dxj

�qyiþ1

2;j;kCn

iþ12;j;k� qyi�1

2;j;kCn

i�12;j;k

Dyi

�qzi;j;kþ1

2Cn

i;j;kþ12� qzi;j;k�1

2Cn

i;j;k�12

Dzkð1Þ

By the algorithm, the loop calculates the concentration values atfaces kþ 1

2, iþ 12, jþ 1

2 for all k ¼ 1 . . . NLAY � 1; i ¼ 1 . . . NCOL� 1;j ¼ 1 . . . NROW� 1 and at time level n can be run before the calcu-lation of the concentration values at all cells. Therefore, the code inthe SADV5U has been divided into two main loop and the firstoverload loop has been paralyzed by the OMP directive (i.e.,!$omp parallel do).

2.1.3. A new preconditionerThe current version of MT3DMS contains three preconditioner

equations which are Jacobi, Modified Incomplete Cholesky, andSymmetric Successive Over Relaxation (SSOR). Helfenstein andKoko (2012) suggested a new approximate inverse derived fromthe SSOR preconditioner (called SSOR-AI). The preconditionerSSOR-AI was implemented here which is described below.

Since A is symmetric positive definite matrix, it can be decom-posed as A = L + D + LT where D is the diagonal matrix of diagonalelements of A and L the lower triangular part of A. Then, the SSORpreconditioner is given by

Q xð Þ ¼ 12�x

1x

Dþ L� �

1x

D� ��1 1

xDþ L

� �T

ð2Þ

where x is any real number between 0 and 2. This preconditioner isalso symmetric positive definite and can be factorized as Q = KKT. Inthe case of SSOR, K is defined as

K ¼ 1ffiffiffiffiffiffiffiffiffiffiffiffiffi2�xp Eþ Lð ÞE�1

2 ¼ 1ffiffiffiffiffiffiffiffiffiffiffiffiffi2�xp E I þ E�1L

� �E�

12; ð3Þ

where 0 < x < 2 and E ¼ 1x D. So, the inverse of K can be obtained by

the following formula:

K�1 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffi2�xp

E12 I � E�1L� ��1

E�1: ð4Þ

Denote the spectral radius of a matrix A as q(A)and assume furtherthat q E�1L

� �< 1, then a Neumann series approximation of the

inverse of K is

K�1 �ffiffiffiffiffiffiffiffiffiffiffiffiffi2�xp

E12 I � E�1Lþ ðE�1LÞ2 � ðE�1LÞ3 þ . . .h i

E�1: ð5Þ

A first order approximate inverse of K is written as

K�1 �ffiffiffiffiffiffiffiffiffiffiffiffiffi2�xp

E12 I � E�1Lh i

E�1 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffi2�xp

E�12 I � LE�1h i

: ð6Þ

Therefore, the inverse of Q is Q�1 ¼ KKT� ��1

¼

KT� ��1

K�1 ¼ ðK�1ÞT K�1 or

Q�1 � 2�xð Þ I � LE�1� �T

E�1 I � LE�1� �

: ð7Þ

By substituting E ¼ 1x D into Eq. (7), the formula of the inverse of

Q can be computed by Eq. (8)

Q�1 � 2�xð Þx I �xLD�1� �T

D�1 I �xLD�1� �

: ð8Þ

By directly defining the inverse of Q, the equation Q�1x = y canbe solved by a matrix–vector multiplication and is easy to paralle-lize. The advantage of this new precondition method is the easy toparallelize and it involves less iteration numbers.

2.2. Code and compile parallel codes with optimize options

The new parallel codes of the SADV5U and the GCG5AP proce-dures have been implemented. OpenMP directives have also beeninserted into the DSP5FM procedure using parallel programmingfeatures of the Intel� Fortran Compiler (Intel, 2013) such as auto-matic vectorization were used. The codes are available as supple-mentary materials that accompany this paper.

2.3. Check memory/threading error and test the program

Intel� Inspector (Intel, 2013) has been used to check the mem-ory and threading error of the parallel MT3DMS. Results of runningthis software ensure the correctness of the parallel processing. Theoutput of parallel MT3DMS is also compared with the output of theoriginal MT3DMS to check correctness.

2.4. Tune program and report

Intel� VTune™ Amplifier (Intel, 2013) is a tool to analyze codeperformance and identify where and how to benefit from availablehardware resources. Fig. 3 shows that CFACE in procedure SADV5Uand procedure DSP5FM perform well by using hardware resources.

3. Applicability

The Freiberg test field contains six wells that were drilled fortraining purposes and are each about 50 m deep. The upper part

Fig. 3. Analyze function performance – ordered by CPU time.


of the boreholes (3–5 m) is protected by a steel stand pipe; thedeeper part was left as open borehole. The static water level is6 m below the ground surface. The main fractured rock was deter-mined to be 11 m below the ground surface by geophysical andhydraulic tests. Packer and tracer tests were conducted. Fig. 4shows the experimental area at the TUB Freiberg campus. The testfield was used to produce the results and allowed us to test thecode efficiency.

A three dimensional finite difference scheme was used to sim-ulate the field conditions. The model consists of five layers, 280cells in x-direction and 100 cells in the y-direction with a squaregrid of 0.1 � 0.1 m were used. A total number of 140,000 grid cells

Fig. 4. The test sit

were active. This fine grid resolution was chosen to achieve numer-ical accuracy for the transport equations. For simplicity, a No-flowboundary condition was specified in the north and south of themodel domain while a constant head boundary was applied tothe east and west boundary to drive the flow from right to left.The model domain selected quite enough to minimize the compu-tation cost. Fig. 5 illustrates the 2-D model in a planar view of thetest site including selected wells and also the model grid andboundary conditions.

Two groundwater models were selected to simulate the flowand transport in fractured gneiss aquifer. MODFLOW 2005(Harbaugh, 2005) code was used to solve the fluid flow while

e in Freiberg.

Fig. 5. Model domain in a two dimensional plane view.

Fig. 6. Displays calculated concentration in parallel and serial modes vs. observedconcentration values.

R. Abdelaziz, H.H. Le / Journal of African Earth Sciences 100 (2014) 1–6 5

solute transport was determined by MT3DMS (Zheng, 2010). Thetwo codes are three dimensions, and finite difference solution.MODFLOW is block-centered finite difference. MT3DMS was devel-oped to be implemented with the block centered flow model. TheMODFOW and MT3DMS are available free from the U.S. GeologicalSurvey and at the University of Alabama, which also offers techni-cal support. MODFLOW simulates both confined and unconfinedflow in three-dimensional heterogeneous aquifer systems.MT3DMS is a three-dimensional model that can simulate multi-species transport of solutes undergoing advection, dispersion/dif-fusion, and chemical reactions in general in spatially distributedhydrogeological conditions. Both codes can accommodate severaltypes of time-dependent boundary conditions and forcing termsand have been implemented with modular structures that allowthe user to modify them and interface them for any particularapplication.

MODFLOW generated the flow field, which is the one of themain input for the MT3DMS. The solute transport in the flow fieldwas simulated using MT3DMS version 5.3 that is the most recentversion available. MT3DMS is a three dimensional multi-specieswith ability to simulate the solute transport model (Zheng,2006). Instead of applying a single porosity (effective porosity)for the model, MT3DMS has been applied with double porosityfor simulating the solute transport. This solution is carried out byconnecting together both MODFLOW (Harbaugh, 2005) andMT3DMS (Zheng and Wang, 1999; Zheng, 2010). The double-porosity method is more suitable than the traditional single-poros-ity approach for modeling solute transport specifically for frac-tured aquifers due to the complexity and heterogeneity of thefractured structure.

The Generalized Conjugate-Gradient solver with upstreamthird-order total-variation-diminishing (TVD) scheme advectionterm was selected. A TVD [ULTIMATE] scheme was applied to solvethe advection term because it reduces artificial oscillation. A dou-ble-porosity advection- diffusion was used to simulate solutetransport, which is more practically from the single porosity advec-tion–dispersion model. TVD was used to control the time step size,with a simulation time of 3.5 days. There were 100 stress periodsin the flow model. The flow time step divided for several transportsteps according to the stability criteria. The fluid flow and trans-port parameters like hydraulic conductivity, porosity, etc. wereobtained from Abdelaziz et al. (2013), Abdelaziz and Merkel(2012), and Abdelaziz and Zambrano-Bigiarinib (2014).

Fig. 6 shows concentration over time for serial and parallelimplementation. The calculated breakthrough curve displays agood fits with observed concentrations.

The code runs on an Enermax desktop computer with the Intel�

Core™ i7–875 K Processor:

� Processor number: i7-875K� Number of cores: 4� Number of threads: 8� Clock speed: 2.93 GHz

The simulation time was reduced from 7.45 min to 3.5 min. Itwas realized that with parallel advection, dispersion, and appliedSSOR-AI, the parallel processing was more efficient. The results ofthe model are compared to observed concentrations and show anacceptable match.

4. Conclusions

A new solute transport modeling code, MT3DMSP has beentested. MT3MSP was developed in FORTRAN language like the pre-vious version. This study indicates that the parallelized codeMT3DMSP is more efficient than the previous version. However,the limit to the maximum number of processors in our computerlimited the total computation time but we expected that a highernumber of processors will further reduce the amount of computa-tion time required. This is particularly advantageous when cali-brating a large and complex model.

In summary:


� The original code does not have given the best possible perfor-mance because it still works in serial model and most comput-ers nowadays are multi-core ones.� In our test computer, MT3DMSP saves up to 60% running time

in comparison with MT3DMS.� The new solver SSOR-AI + CG is fast compared to other solvers

(e.g. Modified Incomplete Cholesky (MIC)).

Note that redistributable library package for Intel 64/32 shouldbe installed in the operating system before running a newMT3DMS5P. This is available free in the Intel website.

Acknowledgments

The first author wishes to thank Prof. Broder J. Merkel for tech-nical support. In addition, we want to thank the anonymousreviewers and the editors for the valuable comments and sugges-tions. This work has been supported by the Department of Hydro-geology at TU Freiberg. Many thanks to Prof. Dr. Helmut Shaeben,Prof. Dr. Heinrich Jasper, and Dr. Ines Görz for their help and adviceto the second author.

References

Abdelaziz, R., Merkel, B.J., 2012. Analytical and numerical modeling of flow in afractured gneiss aquifer. J. Water Resour. Protect. 4.

Abdelaziz, R., Zambrano-Bigiarinib, M., 2014. Particle swarm optimization forinverse modeling of solute transport in fractured gneiss aquifer. J. Contam.Hydrol. http://dx.doi.org/10.1016/j.jconhyd.2014.06.003.

Abdelaziz, R., Pearson, A.J., Merkel, B.J., 2013. Lattice Boltzmann modeling for tracertest analysis in a fractured Gneiss aquifer. Natur. Sci. 5.

Ament, M., Knittel, G., Weiskopf, D., Strasser, W., 2010. A parallel preconditionedconjugate gradient solver for the poisson problem on a multi-GPU platform. In:18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 583–592.

Chandra, R., 2001. Parallel Programming in 0penMP, Morgan Kaufmann.Chapman, B., Jost, G., van der Pas, R., 2008. Using OpenMP: portable shared memory

parallel programming. The MIT Press.Chivers, I.I.D., Sleightholme, J., 2012. Introduction to Programming with Fortran:

with Coverage of Fortran 90, 95, 2003, 2008 and 77. Springer.Harbaugh, A.W., 2005. MODFLOW-2005, the US Geological Survey modular ground-

water model: The ground-water flow process. US Department of the Interior. USGeological Survey.

Helfenstein, R., Koko, J., 2012. Parallel preconditioned conjugate gradient algorithmon GPU. J. Comput. Appl. Math. 236, 3584–3590. http://dx.doi.org/10.1016/j.cam.2011.04.025.

Intel, 2013. Intel� Parallel Studio XE Suites (accessed 03.11.13).MPI, 2013. The Message Passing Interface (MPI) standard, <http://

www.mcs.anl.gov/research/projects/mpi/> (accessed 28.10.13).OpenMP, 2013. The OpenMP API specification for parallel Programming, <http://

openmp.org/wp/> (accessed 06.09.13).openMPI, 2013. Open MPI: Open Source High Performance Computing, <http://

www.open-mpi.de/> (accessed 28.10.13).Zheng, C., 2006. MT3DMS v5. 3 A Modular Three-dimensional Multispecies

Transport Model for Simulation of Advection, Dispersion and ChemicalReactions of Contaminants in Groundwater Systems. Supplemental user’s guide.

Zheng, C., 2010. MT3DMS v5. 3: Guide, Supplemental User’s.Zheng, C., Wang, P.P., 1999. MT3DMS: A Modular Three-dimensional Multi-species

Transport Model for Simulation of Advection, Dispersion, and ChemicalReactions of Contaminants in Ground-water Systems: Documentation andUser’s Guide.

http://refhub.elsevier.com/S1464-343X(14)00185-X/h0080


http://dx.doi.org/10.1016/j.jconhyd.2014.06.003





http://dx.doi.org/10.1016/j.cam.2011.04.025

http://dx.doi.org/10.1016/j.cam.2011.04.025

http://www.mcs.anl.gov/research/projects/mpi/

http://www.mcs.anl.gov/research/projects/mpi/

http://openmp.org/wp/

http://openmp.org/wp/

http://www.open-mpi.de/

http://www.open-mpi.de/

Date post:	31-Jan-2017
Category:	Documents
Upload:	hai-ha
View:	217 times
Download:	1 times

MT3DMSP – A parallelized version of the MT3DMS code

Documents