TM5 manual for parallel version - projects.science.uu.nl

TM5 manual for parallel versionby Wouter Peters

July 2003

1

Contents

1 Introduction 4

2 Model structure 4

3 Data structure 6

3.1 book-keeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 data declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 short-lived tracers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Flow of the model 9

5 Frequently used routines 10

5.1 Sources and Sinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.2 Chemistry and photolysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.3 Budgets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

6 Input/Output 12

7 MPI modules 13

7.1 mpi constants.f90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

7.2 mpi communication.f90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

7.3 swap all mass.f90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8 FAQ and common tasks 16

8.1 Running the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

8.2 Adding a transported tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2

8.3 Adding a short lived tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

8.4 Adding a new emission field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

8.5 Viewing the contents of an array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

8.6 Adding a reaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

8.7 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

8.8 Getting model performance data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

9 Model performance tests 21

10 More resources 21

3

1 Introduction

Recent developments in high performance computing have led to a huge increase in the numberof multi-processor platforms. The scales of these machines range from dual processor PC’s tomassive parallel systems with as many as 1000 processors. Somewhere in that range are also thecurrently popular clusters of PC’s, usually running under linux. All of these platforms have differentarchitecture and therefore require a different approach in terms of software development to performoptimally. Therefore, several parallel computing software protocols exist.

These protocols usually take advantage of one specific characteristic of a platform, such as theability to access memory quickly, to share memory between processors, or to transfer data betweenprocessors easily. The choice of protocol is thus directly related to the architecture of the platformthat the software will execute on.

An easy way to exploit the availability of several processors is through the protocol openMP.Under openMP, directives are put into the source code indicating to a compiler where to shareworkload over processors. This is usually in large loops where operations are performed inde-pendent of previous calculations in the same loop, so that the operation can easily be divided.Communication occurs before the loop, to distribute data, and after the loop, to gather the resultsin the central, shared memory of all the processors. This method will work best if processors sharea common memory that can be accessed fast, and if communication from processors to memoryis fast. Disadvantage of this method is the delay in distributing and gathering data into (large!)central memory, and the limited scaling that can be reached by only parallelizing at loop level.

An alternative to openMP is the Message Passing Interface (MPI). Unlike the ’automatic’ paral-lelization of openMP, MPI requires the programmer to implement the data structure and processor-to-processor communication. MPI is especially well-suited for architectures with distributed mem-ory, and large amounts of cache memory per processor. With a well considered implementation,MPI programs can scale extremely well and reach high speed-up factors. This is because the work-load can be balanced over the processors much better, and each processor can work on its ownprivate (fast!) memory instead of on a central shared memory.

For TM5, MPI was chosen as the protocol. Despite the larger investment in code development,previous experience with TM3 had shown MPI to work well for this type of problem. The mainplatform for TM5 (an SGI-3800) possesses the distributed memory, large cache memory per proces-sor, and fast communication between processors that make MPI viable. Moreover, the distributionof memory over processors can be an extra advantage since TM5 is a very memory intensive appli-cation.

2 Model structure

The aim of the parallelization project is to change the original TM5 source code as little as pos-sible. The new data structure in TM5 is extremely suited for this. Nevertheless, changes (mostlyadditions) had to be made to the code.

4

For TM5 parallel the choice was made to run the code parallel over tracers (good speed-upin transport), as well as over vertical layers (good speed-up in chemistry). Thus, each processorhandles one or more tracers during transport (and sources+sinks), and that same processor handlesone or more vertical layers during chemistry. Between these processes, data will be swappedbetween processors. Several routines have been developed to take care of this data-swap (see‘mpi communication.f90’ on page 14).

A fair amount of book-keeping is necessary to track the status of the model for each region, andeach process. For example, region 3 can be working on chemistry and thus be parallel over levels,while it will return to region 2 afterward to resume x-advection which is parallel over tracers. Thedata-structure has been implemented such that data will only be swapped (remember: communi-cation is relatively expensive in time!) when necessary. This can be either when changing froma tracer-parallel process to a level-parallel process, but also when communication between regionsoccurs (update parent for instance) and data need to be in the same domain.

The current status of the model is determined by the active communicator. A communicator isa concept from MPI that denotes a group of processors sharing information. The main commu-nicator is called mpi comm world and it contains all available processors and is always active. InTM5 parallel, two additional communicators exist: com trac, the communicator that contains allprocessors assigned to handle tracers, and com lev, the communicator that contains all processorsassigned to handle levels. A processor can be part of either, or both communicator depending onthe number of tracers and levels. For example, in a run with 5 processors, 25 levels, and 3 tracers,com lev will contain all processors (each with 5 levels), while com trac will only contain processor0,1, and 2. Thus, processor 3 and 4 are waiting idle when a process parallel over tracers is executed.

The status of the active communicator is contained in the parameter which par (see ‘mpi constants.f90’on page 13), and is changed by the subroutine switch domain (‘mpi communication.f90’ on page 14).The status of the data for each region is contained in the parameter previous par(nregions). Bycomparing (subroutine check domain) which par and previous par(region), the model determineswhether to switch communicators (switch domain), and whether to switch data (swap all mass).

Since these checks and swaps were implemented in subroutines tracer/start/do steps, most userswill not have to deal with these issues. Data will be ensured to reside on the right domain insideeach subroutine such as source1, trace0, and ebi. However, it is important for the developer tobe aware which communicator is active inside each subroutine they work on, as this determinesthe location in memory of the data. Generally, the subroutines can be subdivided in four groups:those parallel over tracers, levels, neither or both. The first two categories are obvious, while thethird and fourth are not. A subroutine that is not parallel is usually something simple which iseasier just to execute on each processor, than by a limited number of processors and broadcast theresults. An example is subroutine calc dxy11 which was simply left unchanged. A few routinesare parallel over both levels and tracers, since they are called very often at different locations inthe code. Restricting these to one domain would cause a huge number of (expensive!) data swaps.Examples are subroutines update parent, put xedges, and accumulate mmix. A complete list ofsubroutines and their parallel scope is included in Table 1.

5

3 Data structure

In order to cope with the dual modes of operation in the model, adjustments were made to thedata structure. These can be separated into adjustments made to do the necessary book-keeping,and those to hold extra/different data.

3.1 book-keeping

Subroutine mpi constants initializes a number of arrays used in the book-keeping, which will becomeimportant to model developers. To address individual processors, each processor is numbered from0:(n-1) and this value is contained in myid. Besides this number in the global scope, each processorhas a separate number myid t with which it is referenced in the tracer domain, and myid k in thelevels domain. These numbers can be the same and usually will be. However, processors that donot take part in a domain will hold the number 999 here. Output that describes the status of eachprocessor and the values of book-keeping arrays is given at the begin of each run and should beconsulted if strange results are produced by the model (see ‘Debugging’ on page 19).

The number of tracers handled by each processor is know at all processors through the arrayntracet ar, and similarly the number of levels by lmar. Besides frequently used in communicationroutines, these arrays are sometimes useful to calculate offsets (see next section).

Finally, three arrays are used in sources sinks to determine which processor executes which partof the code. Firstly, tracer active (logical) has dimensions ntrace and tells each processor whethera certain tracer number is active on that processor or not. This allows statements such as:

if(tracer_active(io3)) then...

endif

Secondly, the array tracer loc translates global tracer numbers to local tracer numbers:

rm(i,j,l_level,tracer_loc(ico)) = rm(i,j,l_level,tracer_loc(ico)) + x

And thirdly, proc tracer holds for each tracer number the corresponding number of the processorassigned to handle it. This is useful in communication:

call scatter_after_read_k(rmt,im(region),jm(region),lm(region),0,0,0,1,&rmct,proc_tracer(inox))

which tells the processor that holds inox to scatter data to all others (from the array rmt withdimensions im,jm,lm,0 to array rmct)

6

3.2 data declarations

Most importantly, arrays used throughout the model (e.g. m,rm,rxm,rym,rzm) are now declaredtwice: once on the tracer domain, and once on the levels domain. Appropriate letters t (tracers)and k (levels) were appended to separate these two arrays. Thus, a declaration in global data nowlooks like:

imr=im(reg) ; jmr=jm(reg) ; lmr= lmloc ; nt = ntracetallocate ( mass_dat(reg)%rm_k(-1:imr+2,-1:jmr+2,lmr,nt) )imr=im(reg) ; jmr=jm(reg) ; lmr= lm(reg) ; nt = ntracetlocallocate ( mass_dat(reg)%rm_t(-1:imr+2,-1:jmr+2,lmr,nt) )

where lmloc and ntracetloc are the number of levels and tracers handled for each processor. Inthe code, data can be on only one domain: when the data is on tracers, m k, rm k, etc contain thevalue 0.0 everywhere! This is to ensure that no operations are done on these arrays, as their resultswill be overwritten when a swap occurs. As an extra safety, there is an option to deallocate thearrays that are not used, and to re-allocate them as soon as a swap occurs. This is set through theparameter allocate mass in mpi constants. Note however that this takes extra time and resourcesand is default set to false.

At the start of each subroutine, pointers are set to point to the correct arrays. In the case ofsubroutine trace1 which is parallel over tracers:

!---------------------------------------------------------------------subroutine trace1!-----------------------------------------------------------------------!...... declarations ......if(ntracetloc == 0) returndo region=1,nregions

call check_domain(region,’n’,’tracer’) !WP! data must be on tracers for trace1m => mass_dat(region)%m_trm => mass_dat(region)%rm_trxm => mass_dat(region)%rxm_trym => mass_dat(region)%rym_trzm => mass_dat(region)%rzm_tdo n=1,ntracetloc

offsetn = sum(ntracet_ar(0:myid-1))rm(1:im(region),1:jm(region),1:lm(region),n)=1e-30*m(1:im(region),1:jm(region),1:lm(region)) &

/fscale(offsetn+n)if(adv_scheme.eq.’slope’) then

rxm(1:im(region),1:jm(region),1:lm(region),n) = 0.0rym(1:im(region),1:jm(region),1:lm(region),n) = 0.0rzm(1:im(region),1:jm(region),1:lm(region),n) = 0.0

endifenddo

7

nullify(m)nullify(rm)nullify(rxm)nullify(rym)nullify(rzm)

enddoprint *,’ rm initialized at mixing ratio of 1e-30’

do region = nregions,2,-1call update_parent(region)

enddocall print_totalmass(1)

end subroutine trace1

Also, processors that are not participating in the subroutine are excluded on the first line. Thecall to check domain here is an extra safety measure. In the rest of the subroutine m,rm,etc canbe referenced by their common name and the code is almost identical to the non-parallel version.However, the programmer has to take caution when looping over tracers or levels: they have tokeep to the boundaries of the data (ntracetloc in this case). Finally, it is possible to address datacontained in an array that is not parallel. For instance, fscale which holds scaling factors for eachtracer is declared simply in sequence. To have each processor address the correct tracer in thatarray, we calculate offsetn, which is the number of tracers handled by all processors with a lowerrank than the one calling (0:myid-1). By adding the number of the tracers handled by myid weend up referencing the correct tracer number in fscale (see trace1).

3.3 short-lived tracers

A second important change to the data structure is the separation of the array rm into two, one oftype mass dat and one of type chem dat:

(1) imr=im(reg) ; jmr=jm(reg) ; lmr= lmloc ; nt = ntracetallocate ( mass_dat(reg)%rm_k(-1:imr+2,-1:jmr+2,lmr,nt) )

(2) imr = im(region) ;jmr = jm(region) ; lmr = lmlocallocate(chem_dat(region)% &

rm_k(imr,jmr,lmr,ntracet+1:ntracet+ntrace_chem+3))

The parameter ntrace chem now holds the number of non-transported tracers. There is no tversion of chem dat%rm k since these tracers are only used in chemistry which is parallel overlevels. The 4th dimension here runs from ntracet+1 to ntracet+ntrace chem so that the data canstill be addressed consecutively:

if(n<=ntracet) theny(i,j,n) =rm(i,j,level,n)/ye(i,j,iairm)*ye(i,j,iairn)*fscale(n) !kg ----> #/cm3

else

8

y(i,j,n) =rmc(i,j,level,n)/ye(i,j,iairm)*ye(i,j,iairn)*fscale(n) !kg ----> #/cm3endif

4 Flow of the model

Every model run starts with setting up the MPI environment, numbering processors in each domain,and assigning levels and tracers to processors. After printing diagnostic output for this, the modelcontinues with the regular subroutine start. Once the main loop has been reached, each time stepproceeds according to the diagram in Figure 1.

XYZXYZ

XYZ VSC CSV ZYXVSC CSV ZYX

CSV ZYX XYZ VSC

VSC

CSV ZYXXYZ

XYZ VSCCSV ZYXVSC ZYXCSV

XYZ VSC CSV ZYX

Update parent, no swap neededUpdate parent, swap parent to levels needed

Write BC before advection, no swap neededWrite BC before advection, swap child to tracers needed

C Chemistry with swap to levels needed

S Sources with swap to tracers needed

Figure 1: model flow

The moments where special provisions were made in the model to account for the parallel compu-tations have been colored in the diagram. Red colors signify a swap of data (m,rm,rxm,rym,rzm)from tracers to levels, while green colors show the opposite. Obviously, data has to be swappedmany times due to the staggered treatment of regions, and because of the fact that communica-tion between parents and children can only occur when both are in the same domain. Since theput edges and update parent routines are not specifically over tracers or levels but can handle both,6 data swaps were avoided. Also note that the swap halfway through the time step (before doubleupdate parent) could be avoided when region 3 swaps data back to the tracer domain after chem-

9

istry. This would also render the following two swaps before writing of the boundary conditionsobsolete, and could thus save additional time.

5 Frequently used routines

5.1 Sources and Sinks

The module sources and sinks has undergone quite serious expansion of the source code. Manyadditions were made to the code while the structure was left mostly intact. The routines source1and source2 which are called from mainZoom when the process ’s’ is executed are both parallelover tracers. Thus, each processor can read the files necessary to update the tracers assigned to it,and then proceed to add or remove tracer mass, and update budgets.

To assign tasks to processors, the array tracer active is used:

if(tracer_active(idms)) thenallocate(dms_land (region)%surf(imr,jmr))call readtm3hdf(’DMSland.hdf’,rank2,nlon360,nlat180,level1,idate(2)-1,field2d,’spec1’)call msg_emis(amonth,’vegetation/soil’,’DMS’,xms,sum(field2d))call coarsen_emission(’dms_land’,nlon360,nlat180,field2d,dms_land,add_field)do region=1,nregions

call do_add_2d(region,idms,level1, dms_land(region)%surf,xmdms,xms)enddo

endif !tracer_active

Note that there is no need to distribute any data to other processors and that coarsen emissionsand do add 2d can proceed for each processor (and thus tracer) individually.

If an emission field is used for several tracers that are not on the same processor (e.g. ozonedestruction rates used by io3 and io3s, and bmbcycle used for all pyrogenic tracers) it is ofteneasiest to assign the needed field on all processors and let the root read and distribute the field.The alternative would be to have communication only betwen the processors that actually need thefield, but this requires more programming, cannot be handled by the available routines, and wouldsave memory, but likely not time.

The short-lived tracers which now only exist in chemistry (see ‘short-lived tracers’ on page 8)are allocated in subroutine trace0, and initialized in the new subroutine init short. Emissions forshort-lived tracers are added through new subroutine do add 2d short. Currently, this only occursfor tracer imgly. De-allocation of all the arrays used in sources sinks and chemistry occurs insubroutine trace after read.

For specific examples on how to add tracers or emissions to sources sinks see ‘FAQ and commontasks’ on page 16.

10

5.2 Chemistry and photolysis

Chemistry in the non-parallel version of TM5 was called for each gridbox separately by looping overthe vertical levels. This concept has not changed, but the scope of the vertical loop was changedto run from 1 to lmloc instead of 1 to lm.

Photolysis rate calculations are initialized in trace1 when the model is parallel over tracers. Therates per level and time step are retrieved in subroutine chemie and is parallel over levels (but doesnot depend on levels).

5.3 Budgets

Since budgets are kept in processes that can be parallel in either domain, updating of budgets canbe both parallel over tracers or levels. This means that no particular data structure ( t or k) waschosen for budget arrays:

real, dimension(nbudg,nbud_vg,ntracet):: budconvgreal, dimension(nbudg,nbud_vg,ntracet):: budconvg_all

budconvg all is used in finalizing the budget runs, when the budgets accumulated on all processorare added. Since each processor only has values in the locations it was assigned to (and zero’severywhere else), the final budconvg all will contain the full budget of all tracers on all processors.

Accumulating budgets starts with setting up parameters that determine the offset per processorover levels or tracers, the bounds of the tracer and level loops, and the communicators.

which_par=previous_par(region)

if(which_par==’tracer’.and.ntracetloc==0) returnif(which_par==’level’.and.lmloc==0) return !WP!

IF(which_par==’tracer’) THEN

rm => mass_dat(region)%rm_tlmr = lm(region)nt=ntracetloccommunicator=com_trac !WP! assign com_trac as communicatorroot_id=root_toffsetn=sum(ntracet_ar(0:myid-1) ) !offset for global value of noffsetl=0 ! no offset for levels

ELSEIF(which_par==’levels’) THEN

rm => mass_dat(region)%rm_klmr = lmlocnt=ntracetcommunicator=com_lev !WP! assign com_lev as communicatorroot_id=root_k

11

offsetl=sum(lmar(0:myid-1) ) ! offset for global value of loffsetn=0 ! no offset for tracers

ENDIF

After that, the calculations (budini,budconc,etc.) can be done regardless of the domain. Outputof HDF files with the budgets is handled by one processor and done on the parallel tracer domain.Calls to update budgets in sources sinks are thus handled as:

budemi(nzone,nzone_v,i_tracer)=budemi(nzone,nzone_v,i_tracer)+x/mol_mass*1e3 !mole

Note that when budgets are updated one needs to account for offsets of tracers or levels dependingon the current parallel domain. See ‘??’ on page ?? for more information.

6 Input/Output

To handle the I/O of many HDF files by the model relatively little changes had to be made. Mostimportantly, each subroutine where input is read now starts with a statement that explicitly stateswhich processor is assigned to do the reading. If the array to be read is going to be distributedamong the processors later, the processor root (or root t, root k) is designated. After the properfile is read, the data is either:

1. broadcasted by root to all others (within the global, tracer, or levels domain)

2. scattered by the designated processor to all others (within the global, tracer, or levels domain)

3. not distributed but used only locally by the designated processor

Examples of each of these can be found throughout the source code, copying and pasting existingcommands is usually the most fail-safe method.

Output of the model can be separated in three groups:

1. output to the screen for diagnostics,

2. output to text files,

3. and output of fields to HDF files.

(1) Output to the screen can still be obtained with simple print statements. However, eachprocessor will execute the print statement separately, and many values will be given. Moreover,since each processor is working on a different part of the code, the output on the screen does notnecessarily reflect the order in which statements were executed according to the flow of the model.To ensure proper output, use statements such as:

12

call barrier_tprint*,myid,’value of nsrce ’, nsrcecall barrier_t

or

if(tracer_active(io3)) print*,myid,’value of z ’, zcall barrier_t

If a sum of values is needed, or a specific value from a distributed array, other means can help. Forinstance, mpi reduce statements can find minima and maxima over multiple processors, and addvalues of distributed arrays. See the code and ‘Debugging’ on page 19 for examples.

(2) Output to text files is in principle not encouraged in TM5, but it is possible. The user needsto ensure that a processor is designated to open, write, and close the text file, and that it has accessto all the needed data. If not, each processor will open a file separately (with different units) andwrite only the part of the data they own.

(3) For output to HDF files several routines are available. For one time output there are parallelcopies of dump field (2d,3d,4d) which will gather data from the current domain and write it toa specified file. Output of mixing ratio and budget files work along similar principles. Each ofthese routines uses the gather tracer ( t or k) routine to accumulate data on one processor beforewriting. See module tm5 io.f90 and sources sinks.f90 for examples of I/O. See also ‘Viewing thecontents of an array’ on page 19 for more information.

7 MPI modules

7.1 mpi constants.f90

Similar to dimension.f90 for the non-parallel model, mpi constants is used to set global parametersneeded to successfully run TM5 parallel. The parameter npes is referenced against the number ofactual available processors and the model exits when these are not the same. See section ‘Modelstructure’ on page 4 for more information.

! this module declares the values needed in MPI communications! WP january 2003!module mpi_constantsUSE dimsimplicit noneinclude ’mpif.h’

integer,parameter :: npes = 1 ! number of PE’s at compilation time

13

integer :: my_real ! holds platform dependent reference to real values for MPI from mpif.hinteger :: myid ! PE number in mpi_comm_worldinteger :: nprocs ! number of PE’sinteger :: pe_first_tracer ! lowest myid involved in processes over tracersinteger :: pe_first_l ! lowest myid involved in processes over levelsinteger :: ierr ! return status of MPI routine callsinteger :: com_trac ! communicator with only PE’s having nonzero ntracetlocinteger :: com_lev ! communicator with only PE’s having nonzero lmlocinteger :: myid_t ! PE number in com_trac (can differ from mpi_comm_world!)integer :: myid_k ! PE number in com_levinteger :: root ! myid of root in mpi_comm_worldinteger :: root_k ! myid of root in com_levinteger :: root_t ! myid root in com_traccharacter(len=6) :: which_par ! either ’levels’ or ’tracer’character(len=6),dimension(nregions) :: previous_par ! remember previous paralel regimeinteger :: lmar(0:npes-1) ! number of levels assigned to each PEinteger :: ntracet_ar(0:npes-1) ! nr of transported tracers " "integer :: lmloc ! nr of levels at this PEinteger :: ntracetloc ! nr of tracers and transported tracers at this PElogical,dimension(ntracet) :: tracer_active ! tracer is on processer (true) or not (false).integer,dimension(ntracet) :: tracer_loc ! translates global local tracer numberinteger,dimension(ntracet) :: proc_tracer ! determines myid of pe that handels itracerlogical,parameter :: allocate_mass=.false. ! allocate and deallocate mass after each swapend module mpi_constants

7.2 mpi communication.f90

Module mpi communication holds specific routines added to handle parallel communication. It isused mostly to control parallel flow and check the model progress. Also, it contains routines toscatter and gather data from one to all all from all to one processor. Note that actual swapping ofmass is done through subroutines defines in swap all mass.f90.

! this module holds all specific routines needed to run MPI! WP january 2003!module mpi0_commuse mpi_constantsimplicit none

contains

subroutine startmpi ! necessary calls to start mpi programssubroutine stopmpi ! finish mpi and stop programsubroutine initialize_domains ! set up tracer domain and levels domainsubroutine determine_lmar ! give each processor a ’myid’ in the domainsubroutine determine_first ! communicate processor numberssubroutine determine_tracer_active ! communicate tracer numbers

subroutine set_domain ! switches between tracers and levels domain on requestsubroutine check_domain ! checks which domain is active and switches if necessarysubroutine barrier ! let all PE’s wait for eachother

14

subroutine barrier_t ! let all PE’s in tracer domain wait for eachothersubroutine barrier_k ! let all PE’s in levels domain wait for eachothersubroutine gather_tracer_t ! gather all data from tracer domain to processor root using mpi_gathervsubroutine gather_tracer_k ! gather all data from levels domain to processor root using mpi_gathervsubroutine scatter_after_read_k ! scatter data to all processors in levels domain using mpi_scattervsubroutine scatter_after_read_t ! scatter data to all processors in tracer domain using mpi_scatterv

end module mpi_comm

7.3 swap all mass.f90

subroutine swap_all_mass(region,iaction)!WP! subroutine swaps the arrays m,rm,rxm,rym,and rzm from levels to tracers (iaction=1)!WP! or from levels to tracers (iaction=0) for requested regions (region1=>region2)!WP! after swapping, old data is set to zero to avoid mistakes with ’old’ data!WP! 23 december 2002

use global_data,only : mass_datuse mpi_commuse mpi_constants,only : myid,root,root_t,root_k,mpi_comm_world,ierr,my_realimplicit none!______________________________I/O________________________________________

integer,intent(in) :: iaction,region

!______________________________local________________________________________

integer :: imr,jmr,lmrreal,dimension(:,:,:),pointer :: msourcereal,dimension(:,:,:),pointer :: mtargetreal,dimension(:,:,:,:),pointer :: rxmsourcereal,dimension(:,:,:,:),pointer :: rxmtargetinteger :: nsend,xx

!______________________________start________________________________________!imr=im(region)jmr=jm(region)lmr=lm(region)

do xx=1,4 !WP! once for each x,y, and zselect case(xx)case(1)rxmsource => mass_dat(region)%rm_trxmtarget => mass_dat(region)%rm_k

case(2)rxmsource => mass_dat(region)%rxm_trxmtarget => mass_dat(region)%rxm_k

case(3)rxmsource => mass_dat(region)%rym_trxmtarget => mass_dat(region)%rym_k

15

case(4)rxmsource => mass_dat(region)%rzm_trxmtarget => mass_dat(region)%rzm_k

endselectif(iaction.eq.0)then

call tracer_to_k(rxmtarget,& ! tracer_to_k first receives target (rm_k)rxmsource, &imr,jmr,lmr,& ! x,y,z-dimensions2,2,0,& ! nr of halo cells in each directionntracet,ntracetloc,& ! global and local number of tracersntracet_ar) ! nr of tracers on other PE’s, needed for offset

call barrierelseif(iaction.eq.1)then

call k_to_tracer(rxmtarget,& ! k_to_tracer first receives source (rm_k)rxmsource, &imr,jmr,lmr,& ! x,y,z-dimensions2,2,0,& ! nr of halo cells in each directionntracet,ntracetloc,& ! global and local number of tracersntracet_ar) ! nr of tracers on other PE’s, needed for offset

endifnullify(rxmsource)nullify(rxmtarget)if(okdebug.and.myid==root) print*,’Scattered tracer mass and slopes ’,xx

enddoend subroutine swap_all_mass

subroutine tracer_to_k ! uses mpi_alltoallv to swap vectors of datasubroutine k_to_tracer ! uses mpi_alltoallv to swap vectors of data

8 FAQ and common tasks

8.1 Running the model

The model can only be run on platforms that support MPI! Both at compilation time and when themodel is executing, the number of processors needs to be known and remain constant. The numberof processors is set at two different locations in the job file: (1) in the header for submission to thequeue and (2) as a parameter for the rest of the script. The number of processors also needs to bespecified in module mpi constants before compilation. The model checks whether the number ofprocessors specified in different places are the same and aborts if not.

When choosing a number of processors, aim for a reasonable number. Reasonable here is definedby:

• the number of tracers you’re running, nprocessor > ntracer will not bring much more speed-upunless there is very intense chemistry.

16

• if nprocessor > nlevels > ntracers, several processors will be idle all run long. Since these arehowever dedicated to your simulation you might pay for them.

• requesting more processors usually gets you into longer waiting queues.

• too few processors will not scale you’re problem down enough (in terms of memory andworkload per processor).

As a starting point, try to request a number of processors that equals either the number of tracersor levels, whichever is smallest. If both are very large, try to get an integer fraction of either.

A second parameter that needs to be set is my real. This parameter specifies to MPI how manybytes are send per real number. This parameter is platform dependent, and failing to set it correctlywill result in strange errors. Look for the file mpif.h on your system to see what options there are(mpi real, mpi real8, mpi double, etc), and try several options (perhaps write a small test program)to ensure this works properly.

Running parallel code under MPI is done through the command:

mpirun -np ?? tm5.exe << input

Where the number of processors is filled in at the question marks. If the model is run through thejob-files, this should all be set automatically.

8.2 Adding a transported tracer

• Go into the dimensions module of your project and increase the number of transported tracers(ntracet) with one. Next, go into the chemistry module and specify a name, number, andweight for the new tracer. Also increase ntracet chem with one.

• If you want to see the budget of your new tracer at the end of each simulation, be sure togive it the number 1 in the tracer list in module chemistry.

• If a specific initialization is needed for your tracer go into module sources sinks and add thisin subroutine trace1. Note that if a restart from a saveold file is specified (istart=3), the newtracer will likely not be found in this file an thus initialized to 0.0!

• Rethink the number of processors that you request, consider changing it to perform betterwith the new number of tracers.

• Be sure to

make clean

before compiling and running. Check the output and pay specific attention to the distributionof the tracers over the processors to make sure all worked well.

17

8.3 Adding a short lived tracer

• Go into the chemistry module and increase parameter ntrace chem, specify a name, number,and weight for the new tracer.

• depending on what you wish to do with the new tracer, add an emission field, a chemicalreaction, or another process (see next sections)

• Be sure to

make clean

before compiling and running. Check the output and pay specific attention to the distributionof the tracers over the processors to make sure all worked well.

8.4 Adding a new emission field

• Check the module chemistry to see if the type of field (2D, 3D, etc) you want to add alreadyexists. If not, create a new type containing a pointer with the required dimensions, Forinstance:

type myemis_datareal,dimension(:,:),pointer :: surf

end type myemis_data

If so, go into module sources sinks and add a target field of the right type to the declarations,for example:

type(myemis_data),dimension(nregions),target :: myemis

• In module sources sinks, go to subroutine trace0 and allocate space to hold the new emissionfield on the processors that need to access these data. If we’re adding for instance CO2

emissions:

do region=1,nregionsimr = im(region) ; jmr=jm(region)if(tracer_active(ico2)) allocate(myemis%surf(imr,jmr))

enddo

If the field will be used for multiple tracers, allocate this field on all processors by omittingthe *if* statement

• Read your emission field on 1x1 degrees through the subroutine readtm3hdf and send it tocoarsen emissions with myemis%surf as target. Let either the appropriate processor do thisreading and coarsening (tracer active(ico2)), or let root t handle this.

• If the field needs to be known by multiple processors, finish by broadcasting the new fieldfrom root t to all others with:

do region=1,nregions ! transmit co2 emissionscall mpi_bcast(myemis(region)%surf, im(region)*jm(region), my_real, &

root_t, mpi_comm_world, ierr)enddo

18

• Now go into subroutine source1, and add a loop to actually do the emissions. Check whetheremissions go into a transported or short-lived tracer, and whether it’s a 2d or 3d field:

if (tracer_active(ico2) thencall do_add_2d(region,ico2,level1,myemis(region)%surf,xmco2,xmco2) ! transportedcall do_add_2d_short(region,ico2,level1,myemis(region)%surf,xmco2,xmco2) ! short-lived

endif

Budgets of emissions are kept automatically in subroutine do add 2d.

• Finally, go into subroutine trace end and deallocate the emission fields on all processors atthe end of the run.

8.5 Viewing the contents of an array

For offline viewing using for instance IDL, dump the array using subroutine dump field (3d or 4d):

call dump_field4d(region,rm,lmr,ntracet,(/2,2,0/),’rm’,’rmlev.hdf’)

Arguments to dumpfield are the array (rm), the number of levels in the array (lmr), the number oftracers in the array (ntracet), the number of halo cells for each dimensions as an array of 3 integers,and finally the name of the field and the name of the HDF file to dump to. The routine checksautomatically which communicator is active, and which data to collect from which processor tocomplete the requested array.

Viewing the contents of an array in the code itself can be more complicated as the data resided ondifferent processors and print statements will thus show you only the part on the processor doingthe printing. Therefore, ame sure the print statement contains the myid of the processor, and isbracketed between mpi barriers so that output is given at the same time by all processors. Seesection ‘Debugging’ on page 19 for more information.

8.6 Adding a reaction

Reactions can always be added within subroutine EBI and do not require further consideration ofparallel issues.

8.7 Debugging

Debugging under MPI requires a special approach. Since each processor is running its own privatecopy of the model and only accesses its own memory, the first step in debugging is to find theprocessor that causes the crash or error. Experience has shown that 90% of the coding errors andmistakes are made in MPI related calls. The recommended approach to debugging is:

19

1. Go to the last point in the code you know to still work fine, and check what the next fewcalls/subroutines are going to be. If any of these are calls to mpi library routines, checkthe parameters in the call carefully, and make sure that the array on which the operation isexecuted exists, and has the correct dimensions on all the participating processors.

2. If you do not spot the mistake, make a call to stopmpi at different locations in the code tosee where exactly the error occurs. Since processors are not always at the same point in themodel, the fact that certain lines have been executed (e.g. print statements) does not mean afatal error cannot have occurred earlier. Remember: print statements are not a goodtool to locate a mistake when using MPI!. The specific line can often be found bycommenting lines out of the code and running up to the stopmpi statement.

3. Once the location has been determined, check what could be wrong in that line. Most often,arrays are accessed out of bounds, or do not exist on one of the processors (either because theywere not allocated or because the processor was not supposed to be active in that part of thecode). Use the calls ubound and lbound to check array boundaries. Another common mistakeis sending the wrong kind of data (e.g. receiving integers while sending reals etcetera).

4. If the calling structure and array bounds appear to be okay, check the contents of arraysby performing dump fields, or well placed (e.g. bracketed in mpi barriers and includingthe processor id’s) print statements. Often, previous operations that accessed arrays out ofbounds have overwritten other arrays partly (by writing out of memory) and cause problemsfurther along in the code. You can only see this by strange numbers appearing in otherwisereasonable arrays.

5. If the problem has not been identified, set the number of processors to one and see whatoccurs. Since memory access out of bounds is almost impossible here, it might help yourecognize this problem. If necessary, make dump fields and output on one processor andcompare it to multiple processors. Results should be identical!

6. When you think a problem has been found and corrected, do not remove your debuggingcode additions until you’ve tested your solution (even if you saw the problem right away...).Ensure that you have solved the problem on one processor, as well as on multipleprocessors before you proceed!

7. Remember that the whole code has to be recompiled (make clean) when the number of tracers,levels, or processors changes. This easy solution is often forgotten...

8.8 Getting model performance data

Getting performance data is very platform specific, and it is usually best to check the documentationof the platform you’re working from. Here, I will illustrate my preferred technique available to assessmodel performance on TERAS, the Dutch national supercomputer. The software used is calledSPEEDSHOP.

• Edit the jobfile so that the line of code where mpirun (for 4 processors here) is invoked lookslike:

20

mpirun -np 4 ssrun -[experiment name here] tracer.exe << input

The name of the experiment is chosen from a list available at the TERAS website, anddenotes what aspect of model performance will be tested. This can be the time spent in eachsubroutine, or the number of cache misses per subroutine, or one of several other options.

• SSRUN has created several output files with the names:

tracer.exe.fdc_sampx.*

These files (where * is the runid) contain the performance data from each processor separately.

• The performance for a single processor is viewed with the command: beginsmall

prof -lines tracer.exe.fdc_sampx.[runid]

To view the statistics from all processors, first merge the output files through the command:

ssaggregate -e file1 file2 file3 ... -o outfile

And then continue like above, substituting tracer.exe.* with the name of your merged outputfile.

• Usually, begin with a timer experiment (fpc sampx) to see where the model spends most ofits time. And follow this up with a primary cache misses experiment (fdc hwc). Usually, theresults from these experiments will show that the slowest subroutine spends most of its timein a line where primary cache misses are abundant. Take another look at the source code andtry to solve this problem.

9 Model performance tests

The first tests with the new TM5 parallel code can be seen in Figure 2. The code was tested onTERAS in a configuration with full CBM-IV chemistry and three regions (glb6x4, eur3x2,eur1x1).Thus, 25 layers and 24 transported tracers were used. Obviously, a large decrease in turnover timecan be achieved by using multiple processors (a factor of ∼9.5 here). However, this will increasethe cost (in total CPU seconds billed) of the run, and not all the added processors will lead to asimilar scaling in turnover time. At 20 processors, the speed up of the model is 50% less effectivethan a linear (1:1) speed-up. Interpolating the blue curve (visually) suggests that using more than20 processors will not reduce the turn-over time significantly, in line with the suggested limit of 24(ntrace) or 25 (nlevel) processors.

10 More resources

• email questions to Wouter Peters

• The TM5 website http://www.phys.uu.nl/ tm5/

21

http://www.sara.nl/userinfo/teras/usage/progdevel/analysis/index.html

mailto:[email protected]

http://www.phys.uu.nl/~tm5/

• MPI website http://www-unix.mcs.anl.gov/mpi/

• LAM-MPI implementation on clusters and MacOSX http://www.lam-mpi.org/

• Debugging tools http://www.sara.nl/userinfo/teras/usage/progdevel/debugging/index.html

• Speedshop and MPI on Teras http://www.sara.nl/userinfo/teras/usage/progdevel/analysis/index.html

• SGI profiling information http://techpubs.sgi.com

22

http://www-unix.mcs.anl.gov/mpi/

http://www.lam-mpi.org/

http://www.sara.nl/userinfo/teras/usage/progdevel/debugging/index.html

http://www.sara.nl/userinfo/teras/usage/progdevel/analysis/index.html

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=0650&db=bks&fname=/SGI_Developer/books/OrOn2_PfTune/sgi_html/ch04.html&srch=

Table 1: subroutines and the domain over which they are parallelized in the model

Module subroutine none both tracers levelsadvectx.f90: subroutine advectx •advectx.f90: subroutine dynamu1 •advectx.f90: subroutine dynamu •advectx.f90: subroutine compress xedges •advectx.f90: subroutine uncompress xedges •advecty.f90: subroutine advecty •advecty.f90: subroutine dynamv •advecty.f90: subroutine compress yedges •advecty.f90: subroutine uncompress yedges •advectz.f90: subroutine advectz •advectz.f90: subroutine dynamw •budget global.f90: subroutine ini zoneg •budget global.f90: subroutine gather budget •budget global.f90: subroutine diagbudg •budget global.f90: subroutine budget transportg •chemistry cbm4.f90: subroutine rates •chemistry cbm4.f90: subroutine calrates •chemistry cbm4.f90: subroutine calchet1 •chemistry cbm4.f90: subroutine calchet2 •coarsen region.f90: subroutine coarsen region •controlg.f90: subroutine geomtryh •controlg.f90: subroutine geomtryv •controlg.f90: subroutine start •ebi cbm4.f90: subroutine do ebi •ebi cbm4.f90: subroutine reacbud •ebi cbm4.f90: subroutine ebi •gather tracer.f90: subroutine gather tr •global.f90: subroutine declare fields •global.f90: subroutine free fields •global.f90: subroutine free massdat •global.f90: subroutine assign massdat •mainZoom.f90: subroutine do stepsmainZoom.f90: subroutine do stepsmainZoom.f90: subroutine advectxzoom •mainZoom.f90: subroutine advectyzoom •mainZoom.f90: subroutine scatter ambm •mainZoom.f90: subroutine dynam0 •mainZoom.f90: subroutine m2phlb1 •mainZoom.f90: subroutine m2phlb •mainZoom.f90: subroutine calc phlb k •mainZoom.f90: subroutine advect m •mix edges.f90: subroutine mix edges •mpi communication.f90: subroutine startmpi •mpi communication.f90: subroutine stopmpi •mpi communication.f90: subroutine determine lmar •mpi communication.f90: subroutine determine first •mpi communication.f90: subroutine determine tracer active •mpi communication.f90: subroutine initialize domains •mpi communication.f90: subroutine set domain •mpi communication.f90: subroutine check domain •mpi communication.f90: subroutine barrier •mpi communication.f90: subroutine barrier t •mpi communication.f90: subroutine barrier k •mpi communication.f90: subroutine gather tracer t •mpi communication.f90: subroutine gather tracer k •mpi communication.f90: subroutine scatter after read k •mpi communication.f90: subroutine scatter after read t •put xedges.f90: subroutine put xedges •put yedges.f90: subroutine put yedges •redgridZoom.f90: subroutine uni2red mfsredgridZoom.f90: subroutine uni2red mfredgridZoom.f90: subroutine initredgridredgridZoom.f90: subroutine calc pdiffsources sinks cbm4.f90: subroutine trace0 •sources sinks cbm4.f90: subroutine init short •sources sinks cbm4.f90: subroutine trace1 •sources sinks cbm4.f90: subroutine trace after read •sources sinks cbm4.f90: subroutine do add 2d •sources sinks cbm4.f90: subroutine do add 2d short •sources sinks cbm4.f90: subroutine do add isop •sources sinks cbm4.f90: subroutine source1 •sources sinks cbm4.f90: subroutine source2 •sources sinks cbm4.f90: subroutine copy o3rat •sources sinks cbm4.f90: subroutine source2 •sources sinks cbm4.f90: subroutine chemie •sources sinks cbm4.f90: subroutine getDMS •subscalg.f90: subroutine convec •subscalg.f90: subroutine fastminv •swap all mass.f90: subroutine swap all mass •swap all mass.f90: subroutine tracer to k •swap all mass.f90: subroutine k to tracer •tm5 io.f90: subroutine user outputtm5 io.f90: subroutine print totalmasstm5 io.f90: subroutine summasr •tm5 io.f90: subroutine check mass •tm5 io.f90: subroutine dumpfield •tm5 io.f90: subroutine dumpfieldi •tm5 io.f90: subroutine readhdfmmr •tm5 io.f90: subroutine savehdf •tm5 io.f90: subroutine writemmix region •tm5 io.f90: subroutine savemmix •tm5 io.f90: subroutine readtm3hdf •tm5 io.f90: subroutine coarsen emission •tm5 io.f90: subroutine output tm5 •update parent.f90: subroutine update parent •write mix.f90: subroutine output mix •

23

Figure 2: First results from the TM5 parallelization, as tested on TERAS.

24

Date post:	01-Feb-2022
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

TM5 manual for parallel version - projects.science.uu.nl

Documents