Accelerating NWChem Coupled The Author(s) 2017 …excitations (CCSD) (Purvis and Bartlett, 1982),...

Original Article

The International Journal of HighPerformance Computing Applications1–13� The Author(s) 2017Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1094342016672543hpc.sagepub.com

Accelerating NWChem CoupledCluster through dataflow-basedexecution

Heike Jagode1, Anthony Danalis1 and Jack Dongarra1,2,3

AbstractNumerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quan-tum chemistry package NWCHEM, are of extreme interest to the computational chemistry community in fields such ascatalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationallyintensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelized in coarse chunks. Inthis paper, we present our effort of converting the NWCHEM’s CC code into a dataflow-based form that is capable of uti-lizing the task scheduling system PARSEC (Parallel Runtime Scheduling and Execution Controller): a software packagedesigned to enable high-performance computing at scale. We discuss the modularity of our approach and explain howthe PARSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWCHEM codebase.Furthermore, we argue how the CC algorithms can be easily decomposed into finer-grained tasks (compared with theoriginal version of NWCHEM); and how data distribution and load balancing are decoupled and can be tuned indepen-dently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC com-ponent of NWCHEM, concluding that the utilization of dataflow-based execution for CC methods enables more efficientand scalable computation.

KeywordsPaRSEC, tasks, dataflow, DAG, PTG, NWChem, CCSD

1. Introduction

Simulating non-trivial physical systems in the field ofcomputational chemistry imposes such high demandson the performance of software and hardware, that itcomprises one of the driving forces of high-performancecomputing. In particular, many-body methods, such asCoupled Cluster (CC) (Bartlett and Musial, 2007) ofthe quantum chemistry package NWCHEM (Valiev etal., 2010), come with a significant computational cost,which stresses the importance of the scalability ofNWCHEM in the context of real science.

On the software side, the complexity of these soft-ware packages, with diverse code hierarchies, and mil-lions of lines of code in a variety of programminglanguages, represents a central obstacle for long-termsustainability in the rapidly changing landscape ofhigh-performance computing. On the hardware side,despite the need for high performance, harnessing largefractions of the processing power of modern large-scalecomputing platforms has become increasingly difficultover the past couple of decades. This is due both to the

increasing scale and the increasing complexity and het-erogeneity of modern (and projected future) platforms.This paper is centered around code modernization,focusing on adapting the existing NWCHEM CC meth-ods to a dataflow-based approach by utilizing the taskscheduling system PARSEC. We argue that dataflow-driven task-based programming models, in contrast tothe control flow model of coarse-grain parallelism, area more sustainable way to achieve computation atscale.

The Parallel Runtime Scheduling and ExecutionControl (PARSEC) (Bosilca et al., 2012) framework is atask-based dataflow-driven runtime that enables task

1University of Tennessee, Knoxville, USA2Oak Ridge National Laboratory, USA3University of Manchester, UK

Corresponding author:

Heike Jagode, Innovative Computing Laboratory, University of Tennessee,

Department of Electrical Engineering and Computer Science, Suite 203

Claxton, 1122 Volunteer Boulevard, Knoxville, TN 37996, USA.

Email: [email protected]

execution based on holistic conditions, leading to a bet-ter computational resources occupancy. PARSECenables task-based applications to execute on distribu-ted memory heterogeneous machines, and providessophisticated communication and task schedulingengines that hide the hardware complexity from theapplication developer. The main difference betweenPARSEC and other task-based engines lies in the waytasks, and their data dependencies, are represented.PARSEC employs a unique, symbolic description ofalgorithms allowing for innovative ways of discoveringand processing the graph of tasks. Namely, PARSECuses an extension of the symbolic Parameterized TaskGraph (PTG) (Cosnard and Loi, 1995; Danalis et al.,2014) to represent the tasks and their data dependenciesto other tasks. The PTG is a problem-size-independentrepresentation that allows for immediate inspection ofa task’s neighborhood, regardless of the location of thetask in the directed acyclic graph (DAG). This con-trasts with all other task-scheduling systems, which dis-cover the tasks and their dependencies at runtime(through the execution of skeleton programs) andtherefore cannot process a future task that has not yetbeen discovered, or face large overheads due to storingand traversing the DAG that represents the whole exe-cution of the parallel application.

In this paper, we describe the transformations of theNWCHEM CC code to a dataflow version that is exe-cuted over PARSEC. Specifically, we discuss our effortof breaking down the computation of the CC methodsinto fine-grained tasks with explicitly defined datadependencies, so that the serialization imposed by thetraditional linear algorithms can be eliminated, allow-ing the overall computation to scale to much largercomputational resources.

Despite having in-house expertise in PARSEC, andworking closely and deliberately with computationalchemists, this code conversion proved to be laborious.Still, the outcome of our effort of exploiting finer gran-ularity and parallelism with runtime/dataflow schedul-ing is twofold. First, it successfully demonstrates thefeasibility of converting the CC kernel into a form thatcan execute in a dataflow-based task-scheduling envi-ronment. Second, it confirms that utilizing dataflow-based execution for CC methods enables more efficientand scalable computations. We present a thorough per-formance evaluation and demonstrate that the modifiedCC component of NWCHEM outperforms the originalimplementation by more than a factor of two.

2. Overview of NWChem

Computational modeling has become an integral partof many research efforts in key application areas inchemical, physical, and biological sciences. NWCHEM isa molecular modeling software developed to take full

advantage of the advanced computing systems avail-able. NWCHEM provides many methods to computethe properties of molecular and periodic systems byusing standard quantum-mechanical descriptions of theelectronic wave function or density. The CC theory(Bartlett and Musial, 2007) is considered by many to bea gold standard for accurate quantum-mechanicaldescription of ground and excited states of chemicalsystems. Its accuracy, however, comes at a significantcomputational cost.

2.1. Tensor Contraction Engine

An important role in designing the optimum memoryversus cost strategies in CC implementations is playedby the automatic code generator, the TensorContraction Engine (TCE) (Hirata, 2003), whichabstracts and automates the time-consuming and error-prone processes of deriving the working equations ofsecond-quantized many-electron theories and synthesiz-ing efficient parallel computer programs on the basis ofthese equations. Current development is mostly focusedon CC implementation which can utilize any type ofsingle-determinantal reference function includingrestricted, restricted open-shell, and unrestrictedHartree–Fock determinants (RHF, ROHF, and UHFrespectively) in describing closed- and open-shell mole-cular systems. All TCE CC implementations takeadvantage of Global Arrays (GA) (Nieplocha et al.,2006) functionalities, which supports the distributedmemory programming model.

2.2. CC Single Double

Especially important in the hierarchy of the CC formal-ism is the iterative CC model with single and doubleexcitations (CCSD) (Purvis and Bartlett, 1982), whichis a starting point for many accurate perturbative CCformalisms including the ubiquitous CCSD(T)approach (Raghavachari et al., 1989). Our startingpoint for the investigation in this paper is the CCSDversion that takes advantage of the alternative taskscheduling (ATS). The details of these implementationshave been described in previous publications (Kowalskiet al., 2011). In summary, the original CCSD TCEimplementations aggregated a large number of subrou-tines, which calculate either recursive intermediates orcontributions to a residual vector. The dimensionalitiesof the tensors involved in a given subroutine greatlyimpact the memory, computation, and communicationcharacteristics of each subroutine, which can lead topronounced problems with load balancing. To addressthis problem and improve the scalability of the CCSDimplementations, NWCHEM exploits the dependenciesexposed between the task pools into classes character-ized by a collective task pool. This was done in such a

2 The International Journal of High Performance Computing Applications

way as to ensure sufficient parallelism in each classwhile minimizing the total number of such classes.

3. Implementation of CC theory

In the first subsection, we highlight the basics necessaryto understand the original parallel implementation ofCC through TCE. We then describe the challenges ofconverting TCE-generated code into a PARSEC-enabled version and its execution model that is basedon the PTG abstraction.

3.1. CC theory through TCE

In NWCHEM, the CCSD code (among other kernels) isgenerated through the TCE into multiple sub-kernelsthat are divided into so-called ‘‘T1’’ and ‘‘T2’’ subrou-tines for equations that determine the T1 and T2 ampli-tude matrices. These amplitude matrices embody thenumber of excitations in the wave function, where T1represents all single excitations and T2 represents alldouble excitations. The underlying equations of thesetheories are all expressed as contractions of many-dimensional arrays or tensors (generalized matrix mul-tiplications). There are typically many thousands ofsuch terms in any one problem, but their regularitymakes it relatively straightforward to translate theminto FORTRAN code, parallelized with the use of GA,through the TCE. In general, NWCHEM contains aboutone million lines of human-generated code and overtwo million lines of TCE-generated code.

3.1.1. Structure of the CCSD approach. For the iterativeCCSD code, there exist 19 T1 and 41 T2 subroutines,and all of them highlight very similar code structureand patterns. Figure 1 shows the pseudo FORTRANcode for one of the generated T1 and T2 subroutines,highlighting that most work is in deep loop nests. Theseloop nests consist of three types of code:

� local memory management (i.e. MA_PUSH_GET(), MA_POP_STACK());

� calls to functions (i.e. GET_HASH_BLOCK(),ADD_HASH_BLOCK()) that transfer data over thenetwork via the GA layer;

� calls to the subroutines that perform the actualcomputation on the data DGEMM() and TCE_SORT_*() (which performs an O(n) remapping ofthe data, rather than an O(n � log (n)) sorting).

The control flow of the loops is parameterized, but sta-tic. That is, the induction variable of a loop with aheader such as ‘‘DO p3b = noab+1,noab+nvab’’(i.e. p3b) may take different values between differentexecutions of the code, but during a single execution ofCCSD the values of the parameters noab and nvab

will not vary; therefore every time this loop executes itwill perform the same number of steps, and the induc-tion variable p3b will take the same set of values. Thisenables us to restructure the body of the inner loop intotasks that can be executed by PARSEC. Specifically,tasks with an execution space that is parameterized (bynoab, nvab, etc.), but constant during execution.

3.1.2. Parallelization of CCSD. Parallelism of the TCE-gen-erated CC code follows a coarse task-stealing model.The work inside each T1 and T2 subroutine is groupedinto chains of multiple matrix-multiply kernels(GEMM). The GEMM operations within each chainare executed serially, but different chains are executedin a parallel fashion. However, the work is divided intolevels. More precisely, the 19 T1 subroutines aredivided into three different levels and the execution ofthe 41 T2 subroutines is divided into four different lev-els. The task-stealing model applies only within eachlevel, and there is an explicit synchronization stepbetween the levels. Therefore, the number of chainsthat are available for parallel execution at any time is asubset of the total number of chains.

Load balancing within each of the seven levels ofsubroutines is achieved through shared variables (exem-plified in Figure 1 through SharedCounter()) thatare atomically updated (read–modify–write) using GAoperations. This is an excellent case where very goodparallelism already exists but where additional paralle-lism can be obtained by examining the data dependen-cies in the memory blocks of each matrix. For example,elements of the so-called T1 amplitude matrices can beused for further computation before all of the elementsare computed. However, the current implementation ofCC features a significant amount of synchronizationsthat prevent introducing additional levels of paralle-lism, which consequently limits the overall scaling onmuch larger computational resources.

Figure 1. Pseudocode of one CCSD subroutine as generatedby the TCE.

Jagode et al. 3

In addition, the use of shared variables, that areatomically updated, which is currently at the heart ofthe task-stealing and load-balancing solution, is boundto become inefficient at large scale, becoming a bottle-neck and causing major overhead.

Also, the notion of task in the current CC implemen-tation of NWCHEM and the notion of task in PARSECare not identical. As discussed before, in NWCHEM, atask is a whole chain of GEMMs, executed serially, oneafter the other. In our PARSEC implementation of CC,each individual GEMM kernel is a task on its own, andthe choice between executing them as a chain, or as areduction tree, is almost as simple as flipping a switch.

In summary, the most significant impact of portingCC over PARSEC is the ability to eliminate redundantsynchronizations between the levels and to break downthe algorithms into finer-grained tasks with explicitlydefined dependencies.

3.2. CC theory over PaRSEC

PARSEC provides a front-end compiler for convertingcanonical serial codes into the PTG representation.However, due to computability limits, this tool is limitedto polyhedral codes, i.e.if a code is not affine then a sta-tic analysis is not possible at compile time. Affine codescan only contain loop headers (bounds and steps) andarray indices with expressions that are limited to addi-tion and subtraction of the induction variables, con-stants, and numeric literals (as well as multiplication bynumeric literals) and branches (if–then–else) that containonly similar arithmetic expressions and comparisonoperators. These affine codes (e.g. dense linear algebrais largely affine codes) can be fully analyzed using poly-hedral analysis and a compact representation of theDAG, the PTG, can be generated. A critical characteris-tic about the PTG is it can be used by the runtime toevaluate any part of the DAG without having to storethe entire DAG in memory. This is one of the importantfeatures that differentiate PARSEC from any other task-scheduling system.

The problem with affine codes, though, is they are avery small subset of the real-world applications. TheCC code generated by TCE is neither organized in puretasks (i.e. functions with no side-effects to any memoryother than arguments passed to the function itself) noris the control flow affine (e.g. loop execution space hasholes in it; branches are statically undecidable sincetheir outcome depends on program data, and thus itcannot be resolved at compile time).

While the CC code seems polyhedral, it is not quite so.The code generated by TCE includes branches that per-form array lookups into program data. For example,branches such as ‘‘IF(int_mb(k_spin+h7b-1).)’’(see Figure 1) are very common. Such branches make thecode not only non-affine, but statically undecidable since

their outcome depends on program data, and thus it can-not be resolved at compile time.

However, while the behavior of the CC code dependson program data, this data is constant during a givenexecution of the code. Therefore, the code can beexpressed as a parameterized DAG, by using lookupsinto the program data, either directly or indirectly. Inour implementation we access the program data indir-ectly by building meta-data structures in a preliminarystep. The details of this ‘‘first step’’ are explained in thenext section.

In the work described in this paper, we implementeda dataflow form for all functions of the CCSD computa-tion that are associated with calculating parts of the T2amplitudes, particularly the ones that perform a GEMMoperation (the most time-consuming parts). More pre-cisely, we converted a total of1 29 of the 41 T2 subrou-tines, which we refer to under the unified moniker of‘‘GA:T2’’ for the original version, and ‘‘PaRSEC:T2’’ forthe dataflow version of the subroutines.

Figure 2 illustrates the dataflow of the original 41T2 subroutines. It shows how the work is divided intofour distinct levels. The solid edges of this DAG repre-sent the dataflow where output data (matrix C) is addedto output data in another level (C-.C); and the dottededges show where output data (matrix C) is used asinput data (matrix B) for different subroutines (C-.B).

As mentioned above, the task-stealing model appliesonly within each level, and there is an explicit synchro-nization step between the four levels. As for thePARSEC version of this code, our dataflow implemen-tation of the 29 ‘‘PaRSEC:T2’’ subroutines is displayedin black; and the remaining 12 of the 41 T2 subroutinesare displayed in light gray. They are the subroutinesthat do not perform a GEMM operation; and are, due toinsignificance in terms of execution time, not yet con-verted into a dataflow form but executed as in the orig-inal code.

Our chosen subroutines comprise approximately91% of the execution time of all 41 T2 subroutineswhen computing the CCSD correlation energy of thebeta-carotene molecule (C40H56). (not including the 19T1 subroutines and additional execution steps that setup the iterative CCSD computation). More details arediscussed in Section 6.1 and illustrated in Figure 7a.

4. Dataflow version of CC

In this section, we describe our design decisions of theCC dataflow version and discuss various levels of paral-lelism and optimizations that have been studied for thePARSEC-enabled CC implementation.

4.1. Design decisions

The original code of our chosen subroutines consists ofdeep loop nests that contain the memory access


routines as well as the main computation, namelySORT and GEMM. In addition to the loops, the code con-tains several IF statements, such as the one mentionedabove. When CC executes, the code goes through theentire execution space of the loop nests, and only exe-cutes the actual computation kernels (SORT and GEMM)if the multiple IF branches evaluate to true. To createthe PARSEC-enabled version of the subroutines(PaRSEC:T2), we decomposed the code into two steps.

The first step traverses the execution space and evaluatesall IF statements, without executing the actual com-putation kernels (SORT and GEMM). This stepuncovers sparsity information by examining the pro-gram data (i.e.int_mb(k_spin+h7b-1)) that isinvolved in the IF branches, and stores the results incustom meta-data vectors that we defined.The custom meta-data vectors merely hold infor-mation regarding the actual loop iterations that willexecute the computational kernels at runtime, i.e.iterations where all of the IF statements evaluate totrue. This step significantly reduces the executionspace of the loop nests by eliminating all entriesthat would not have executed. In addition, this step

probes the GA library to discover where the pro-gram data resides in memory and stores theseaddresses into the meta-data structures as well.

The second step is the execution of the PTG representa-tion of the subroutines. Since the control flowdepends on the program data, the PTG examinesour custom meta-data vectors populated by the firststep; this allows the execution space of the modifiedsubroutines over PARSEC to match the original exe-cution space of GA:T2. Also, using the meta-datastructures, PARSEC accesses the program datadirectly from memory, without using GA.

4.2. Parallelization and optimization

One of the main reasons we are porting CC overPARSEC is the ability of the latter to express tasks andtheir dependencies at a finer granularity, as well as thedecoupling of work tasks and communication opera-tions that enables us to experiment with more advancedcommunication patterns than serial chains. Sincematrix addition is an associative and commutativeoperation, the order in which the GEMMs are performeddoes not bear great significance as long as the results

Figure 2. Directed acyclic graph (DAG) of the 41 T2 subroutines and its data dependencies.

Jagode et al. 5

are atomically added. This enables us to perform allGEMM operations in parallel and sum the results using abinary reduction tree. Figure 3 shows the DAG of eightGEMM operations utilizing a binary tree reduction (assupposed to a serial ‘‘chain’’ of GEMMs). Clearly, in thisimplementation there are significantly fewer sequentialsteps than in the original chain (McCraw et al., 2014).For the sake of completeness, Figure 4 depicts such achain where eight GEMM operations are computedsequentially.

In addition, the sequential steps are matrix addi-tions, not GEMM operations, so they are significantlyfaster, especially for larger matrices. Reductions onlyapply to GEMM operations that execute on the samenode, thus avoiding additional communication.

The original version of the code performs an atomicaccumulate–write operation (via calls to ADD_HASH_BLOCK()) at the end of each chain. Since our dataflowversion of the code computes the GEMMs for each chainin parallel, we eliminate the global atomic GA function-ality and perform direct memory access instead, usinglocal atomic locks within each node to prevent race con-ditions. The choice of our implementation, discussed inthis paper, is based on earlier investigations presentedin Danalis et al. (2015); Jagode et al. (2016), where wecompared the performance of different variants of themodified code and explain the different behaviors thatlead to the differences in performance.

4.3. Additional levels of parallelism

It is important to note that our work of converting theentire NWCHEM CC ‘‘FORTRAN plus Global Arrays’’

implementation into a dataflow form has been of incre-mental nature in order to preserve the original behaviorof the code. This allowed us to initially focus only onthe most time-consuming and most significant subrou-tines (the 29 heavy GEMM routines), and more impor-tantly, execute them over PARSEC without having toconvert the entire CCSD module. The successful con-version of these 29 kernels has proven to be very bene-ficial, resulting in a performance improvement of morethan a factor of two in the execution of the entire CCcomponent of NWCHEM. This result justifies our

Figure 3. Parallel GEMM operations followed by reduction.

Figure 4. Chain of GEMM operations computed sequentially.


conclusion that the utilization of dataflow-based execu-tion of CC methods enabled more efficient and scalablecomputation.

After completion of the dataflow implementationwithin each of the four levels, the next increment ofwork that we are currently pursuing focuses on imple-menting dataflow between the levels. From the DAG inFigure 2 it becomes very apparent that not every sub-routine in the upper levels has to wait for the comple-tion of all subroutines in the lower levels. For instance,t2_2_2() in Level 3 only depends on the data comingout of t2_2_4(), t2_2_5(), and t2_2_2_2() inLevel 2; while these three only depend on 10 (of the 29)subroutines2 in Level 1. Instead of putting the executionof the tasks that comprise t2_2_2() on hold until allsubroutines in Level 1 and Level 2 are completed, theoutput of the tasks that flow into t2_2_2() can bepassed as soon as it becomes available. The return ofenabling dataflow between levels is twofold. First, itincreases the level of parallelism even more; and second,it enabled the freedom to choose a certain placement oftasks. For example, tasks of subroutine t2_2_2() inLevel 3 can be computed where the Level 2 subroutines,whose output flows into t2_2_2(), store their data.

Another advantage of enabling dataflow between thelevels, in addition to the benefits resulting from increasedlevels of parallelism, is that part of the aforementionedwork that is necessary in the current version of the code,will become unnecessary as soon as the complete CCSDcode has been converted to PARSEC. Specifically, datawill not need to be pulled and pushed into the GA at thebeginning and end of each subroutine if all subroutinesexecute over PARSEC. Instead, the different PARSECtasks that comprise a subroutine will pass their output tothe tasks that comprise another subroutine using theinternal communication engine of PARSEC. This will bedone implicitly, without user involvement, since PARSEChandles communication internally.

5. Dataflow as a programming paradigm

In this section we will discuss the reasons why thePARSEC implementation of CC is faster than the origi-nal code by contrasting the corresponding program-ming paradigms.

5.1. Communication computation overlapping

The original code is written in FORTRAN3 and makescalls to the GA toolkit for communication and syn-chronization purposes. Global Arrays enables applica-tions to adopt the Partitioned Global Address Space(PGAS) programming model and provides primitivesfor one-sided communication and atomic operations.However, the structure of the CC code that makes thecommunication calls does not take advantage of these

more advanced concepts and rather follows a more tra-ditional Corse Grain Parallelism (CGP) programmingparadigm. Let us consider again the pseudocode shownin Figure 1. This figure abstracts away many of thedetails of the actual code but captures the structurevery accurately. Namely, the work is organized in sub-routines (that are further organized in logical steps)and inside each subroutine there are multiple nestedloops with the call to the most computationally inten-sive functions (i.e. DGEMM and TCE_SORT_*) containedin the innermost loop. Interestingly, the calls to thefunctions that fetch the data to be processed by thesecalls (i.e.GET_HASH_BLOCK) are also in the innermostloop, immediately preceding the computation. In otherwords, the program contains no additional work that isavailable between the call to the data transfer functionand the call to the computation function that uses thedata. This means no matter how sophisticated and effi-cient the communication library and the network hard-ware is, this data exchange will not be overlapped withcomputation. This behavior leads to a major waste ofefficiency, since at almost any given point in the pro-gram there is additional work (in other subroutines)that is not semantically dependent on the work cur-rently being performed. However, the coarse-grainstructure of the program, with work and communica-tion contained in deep loop nests inside subroutines,does not allow for different units of work, that areindependent to one another, to be utilized forcommunication–computation overlapping.

This missed opportunity for overlapping can be wit-nessed in the execution trace shown in Figure 5a. Thefigure shows an enlarged segment of the execution traceto improve readability. The shown segment is represen-tative of the whole execution and does not exhibit aunique behavior. In this trace, the different colors rep-resent different operations (i.e.GEMM, SORT, data trans-fer), the x-axis represents time (i.e. longer boxes signifyoperations that took longer to finish), and each rowcorresponds to one MPI rank (i.e. one instance of theparallel application) out of the total 224 used for thisrun. The red color represents the execution of a matrixmultiply (GEMM), which is the most computationallyexpensive operation in this code. The blue and purplecolors represent data transfers (matrices A and Bneeded to perform the C+=A � B operation that aGEMM performs). As can be easily seen by the promi-nence of the blue color in the trace, the communicationimposes an unacceptably high overhead in the execu-tion of the original code.

This behavior is in stark contrast with the dataflow-based execution model that the PARSEC-enabled ver-sion of the code follows. In the latter, computationload is organized in tasks (not loops, or subroutines),and tasks declare to the runtime their dependencies toother tasks. When a task completes, the runtime can

Jagode et al. 7

initiate the asynchronous transfer of its output datawhile scheduling for execution the next task whosedependencies have been satisfied. This opportunisticexecution, typical of dataflow-based systems, allows formaximum communication computation overlapping.This is true because at any given time during the life ofthe program, if there is available computation, it willbe performed without unnecessary waiting for somepredetermined order of loops or subroutines to be satis-fied. The opportunistic execution of the dataflow-basedcode is demonstrated in the trace shown in Figure 5b.As can be seen in this figure, which is an enlarged viewof the beginning of the execution, many of the tasks,responsible for reading input data (blue and purpleboxes) and sending it to the tasks that will perform thecomputation (red boxes), are scheduled for executionearly on. However, as soon as some data transfers com-plete, computation tasks become ready and, from thatpoint on, communication and computation are over-lapped. It is worth noting that, in contrast with thetrace of the original code (Figure 5a), the length of theblue and purple boxes in Figure 5b do not representthe communication cost of the data transfers, sincetasks in PARSEC do not perform explicit communica-tion. Rather, these boxes represent the tasks thatmerely express their communication needs to the run-time by specifying their dependencies to other tasks.The actual data transfers are scheduled and managedby the runtime and are fully overlapped with the com-putation. As a result, they do not appear in the trace,but they are responsible for the light gray gaps(Figure 5b), which are periods of time where no workcan be executed because the corresponding data has notbeen transferred yet (a phenomenon that only happensat the beginning of the execution). To summarize, bycomparing the two traces we can clearly see that theoriginal NWCHEM code faces significant communica-tion delays which can be largely addressed through theuse of communication–computation overlapping.

One could argue that the original FORTRAN codecan be modified to allow for more communication–computation overlapping without resorting todataflow-based programming. A developer could reorga-nize the loop nests into a form of a pipeline, so that eachiteration ‘‘prefetches’’ the data needed for the computa-tion of a future iteration. This can be achieved if everyiteration initialized an asynchronous transfer for dataneeded by future iterations and then proceeded to exe-cute work whose data was prefetched by a previousiteration. This would probably increase communication-computation overlapping and decrease waiting time,however, it would not be sufficient to achieve the perfor-mance improvements gained in the dataflow-based exe-cution. The reason can be seen in the trace. In each row(i.e. for every MPI rank) there are time periods wherecommunication (blue) takes a small amount of time incomparison with useful work (red). However, in eachrow there are also time periods where a few communica-tion operations take significantly longer than the follow-ing computation. As a result, pipelining work and datatransfer would only remove a small part of the commu-nication overhead, unless a very deep pipeline is used,which would lead to significant temporary storage over-head (because all of the incoming data that correspondto future iterations have to be stored in temporary buf-fers). In the case of our dataflow-based version of CC,the runtime does not merely pipeline loop iterations, butoverlaps communication with completely independentcomputation. Work that in the original code resides incompletely different subroutines and can not be utilizedfor overlapping, unless a major restructuring of the codeis performed.

5.2. Freedom from control flow

As we mentioned earlier, the original code executes allGEMM operations in serial chains and only allows forparallelism between different chains. This structure

Figure 5. Execution traces: (a) trace of original NWCHEM code; (b) trace of dataflow-based NWCHEM code.


defines both the order in which operations (that aresemantically independent) will execute and the granu-larity of parallelism. Changing either while preservingthe straight forward structure of the original code isnot a trivial exercise. The dataflow-based PARSECform of the code that we have created departs from thesimplicity of the original code, but it is not subject toeither limitation mentioned above. Namely, as we dis-cussed earlier, in PARSEC we execute all GEMM opera-tions in parallel and perform a reduction on the outputdata of all GEMM operations that belonged to a chain inthe original code. In this way, we preserve the seman-tics of the code, but liberate the execution from theunnecessary limitations on the granularity of paralle-lism and strict ordering that were not dictated by thesemantics of the algorithm, but rather by inherent lim-itations of control-flow-based programs and coarsegrain parallelism.

5.3. Multi-threading and accelerators

Another artifact of the way the original code is struc-tured is that taking advantage of multi-threading to uti-lize multiple cores on each node is not easy toimplement efficiently. As a result, the NWCHEM appli-cation does not use threads and rather relies on multipleMPI ranks per node in order to utilize multiple cores.In PARSEC, multi-threading comes at no additionalcost for the developer. Once the developer has definedthe tasks and the dataflow between them, PARSEC willautomatically use threads to accomplish the work, utiliz-ing multiple hardware resources. Furthermore, PARSECuses a combination of thread local queues and globalqueues to store available tasks during the execution.When a task Ti completes, the tasks Tj that were waitingfor the output of Ti will be placed in the thread localqueue of the thread that executed Ti. As a result, workthat should naturally execute as a chain (because the out-put of one task is the input of another) has a high prob-ability of executing as a chain in PARSEC takingadvantage of cache locality and increasing execution effi-ciency. At the same time, the existence of global queuesand work stealing guarantees that PARSEC will alsoexhibit good load balance.

While outside the scope of this paper, PARSEC alsoenables use of accelerators without too much complex-ity overhead for the developer. If the developer providesa kernel that can execute on an accelerator, and specifiesthe availability in the PTG representation of the code,then PARSEC will execute this kernel on the accelerator.In the work we are current pursuing, we are experiment-ing with the execution of some of the GEMM operationson an Intel Xeon Phi aiming to maximize performance,given the tradeoffs between using that additional com-puting power of the accelerator and paying the overheadof transferring the necessary data to it.

6. Performance evaluation

In this section we present the performance of the entireCCSD code using the dataflow version ‘‘PaRSEC:T2’’of the 29 CC subroutines and contrast it with the per-formance of the original code ‘‘GA:T2’’. Figure 6depicts a high-level view of the integration of thePARSEC-enabled code in NWCHEM’s CCSD compo-nent. The code that we timed (see start and endtimers in Figure 6) includes all 19 T1 and 41 T2 sub-routines as well as additional execution steps that setup the iterative CCSD computation. The only differ-ence between the original NWCHEM runs and our mod-ified version is the replacement of the 29 original T2subroutines ‘‘GA:T2’’ with their dataflow version‘‘PaRSEC:T2’’ and the prerequisites discussed earlier;these prerequisites include meta-data vector popula-tion, initialization, and finalization of PARSEC. Also,in our experiments we allow for all iterations of theiterative CCSD code to reach completion.

6.1. Methodology

As input, we used the beta-carotene molecule (C40H56)in the 6-31G basis set, composed of 472 basis set func-tions. In our tests, we kept all core electrons frozen, andcorrelated 296 electrons. Figure 7a shows the relativeworkload of different subroutines (omitting those thatfell under 0.1%). To calculate this load we sum thenumber of floating point operations of each GEMM thata subroutine performs (given the sizes of the inputmatrices). In addition, Figure 7b shows the distributionof chain lengths for the five subroutines with the highestworkload in the case of beta-carotene. The different col-ors in this figure are for readability only. As can be seenfrom these statistics, the subroutines that we targetedfor our dataflow conversion effort comprise approxi-mately 91% of the execution time of all 41 T2 subrou-tines in the original NWCHEM TCE CCSD execution.

The scalability tests for the original TCE-generatedcode and the dataflow version of PaRSEC:T2 were

Figure 6. High-level view of PARSEC code in NWCHEM.

Jagode et al. 9

performed on the Cascade computer system at EMSL/PNNL. Each node has 128 GB of main memory and is

a dual-socket Intel Xeon E5-2670 (Sandy Bridge EP)system with a total of 16 cores running at 2.6 GHz. Weperformed various performance tests utilizing 1, 2, 4, 8,and 16 cores per node. NWCHEM v6.5 was compiledwith the Intel 14.0.3 compiler, using the optimizedBLAS library MKL 11.1, provided on Cascade.

6.2. Discussion

Figure 8 shows the execution time of the entireCCSD kernel when the implementation found in theoriginal NWCHEM code is used, and when ourPARSEC-based dataflow implementation is used forthe (earlier-mentioned) 29 PaRSEC:T2 subroutines.Each of the experiments were run three times; the var-iance between the runs, however, is so small that it isnot visible in the figures. Also, the correctness of thefinal computed energies have been verified for eachrun, and differences occur only in the last digit or two(meaning, the energies match for up to the 14th decimalplace). In the graph we depict the behavior of the origi-nal code using the dark (green) dashed line and thebehavior of the PARSEC implementation using thelight (orange) solid lines. Once again, the executiontime of the PARSEC runs does not exclude any stepsperformed by the modified code.

On a 32 node partition, the PARSEC version of theCCSD code performs best for 16 cores/node whilethe original code performs best for 8 cores/node.Comparing the two, the PARSEC execution runs morethan twice as fast: to be precise, it executes in 48% ofthe best time of the original. If we ignore the PARSECrun on 16 cores/node, in an effort to compare perfor-mance when both versions use 8 cores/node and thushave similar power consumption, we find that PARSECstill runs 44% faster than the original.

Figure 7. CCSD statistics for beta-carotene and tilesize of 45:(a) relative load of heaviest subroutines; (b) chain lengthdistribution.

Figure 8. Execution time comparison using beta-carotene on EMSL/PNNL Cascade: (a) performance on 32 nodes; (b) performanceon 64 nodes.


The results are similar on a 64 node partition: thePaRSEC version of CCSD is fastest (for 16 cores/node)with a 43% runtime improvement compared with theoriginal code (which on 64 nodes performs best for 4cores/node). It is also interesting to point out that for64 nodes, while PaRSEC manages to use an increasingnumber of cores, all the way up to 64 3 16= 1024

cores, to improve performance, the original code exhi-bits a slowdown beyond 4 cores/node. This behavior isnot surprising since (1) the unit of parallelism of theoriginal code (chain of GEMMs) is much coarser thanthat of PARSEC (single GEMM), and (2) the originalcode uses a global atomic variable for load balancingwhile PARSEC distributes the work in a round robinfashion and avoids any kind of global agreement in thecritical path.

7. Related work

An alternate approach for achieving better load balan-cing in the TCE CC code is the Inspector-Executormethods (Ozog et al., 2013). This method applies per-formance model-based cost estimation techniques forthe computations to assign tasks to processors. Thistechnique focuses on balancing the computational costwithout taking into consideration the data locality.

ACES III (Lotrich et al., 2008) is another methodthat has been used effectively to parallelize CC codes.In this work, the CC algorithms are designed in adomain-specific language called the Super InstructionAssembly Language (SIAL) (Deumens et al., 2011).This serves a similar function as the TCE, but with aneven higher level of abstraction to the equations. TheSIAL program, in turn, is run by a MPMD parallel vir-tual machine, the Super Instruction Processor (SIP).SIP has components that coordinate the work by tasks,communicate information between tasks for retrievingdata, and then for execution.

The Dynamic Load-balanced Tensor Contractionsframework (Lai et al., 2013) has been designed with thegoal of providing dynamic task partitioning for tensorcontraction expressions. Each contraction is decom-posed into fine-grained units of tasks. Units from inde-pendent contractions can be executed in parallel. As inTCE, the tensors are distributed among all processesvia global address space. However, since GA does notexplicitly manage data redistribution, the communica-tion pattern resulting from one-sided accesses is oftenirregular (Solomonik et al., 2013).

8. Conclusion and future work

We have successfully demonstrated the feasibility ofconverting TCE-generated code into a form that canexecute in a dataflow-based task-scheduling environ-ment, such as PARSEC. Our effort substantiates that

utilizing dataflow-based execution for CC methodsenables more efficient and scalable computation, as ourperformance evaluation reveals a performance boost ofa factor of two for the entire CCSD kernel.

This strategy with PARSEC offers many advantagessince communication becomes implicit (and can beoverlapped with computation), finer-grained tasks canbe executed in more efficient orderings than sequentialchains (i.e. binary trees) and each of these finer-grainedparallel tasks are able to run on different cores of mul-ticore systems, or even different parts of heterogeneousplatforms. This will enable computation at extremescale in the era of many-core, highly heterogeneousplatforms, utilizing the components (e.g. CPU, GPUaccelerators, and/or Xeon Phi coprocessors) that per-form best for the type of task under consideration.

As a next step, we will automate the conversion ofthe entire NWCHEM TCE CC implementation into adataflow form so that it can be integrated to more soft-ware levels of NWChem with minimal human involve-ment. Ultimately, the generation of a dataflow versionwill be adopted by the TCE.

Acknowledgments

We thank the anonymous reviewers for their improvementsuggestions. A portion of this research was performed usingEMSL, a DOE Office of Science User Facility sponsored bythe Office of Biological and Environmental Research and

located at Pacific Northwest National Laboratory.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest withrespect to the research, authorship, and/or publication of thisarticle.

Funding

The author(s) disclosed receipt of the following financial sup-port for the research, authorship, and/or publication of thisarticle: The authors disclosed receipt of the following finan-cial support for the research, authorship, and/or publicationof this article: This material is based upon work supported inpart by the Air Force Office of Scientific Research underAFOSR Award Number FA9550-12-1-0476, and the DOEOffice of Science, Advanced Scientific Computing Research,under award number DE-SC0006733 ‘‘SUPER - Institute forSustained Performance, Energy and Resilience’’.

Notes

1. All subroutines with prefix ‘‘icsd_t2_’’ and suffices2_2_2_2(), 2_2_3(), 2_4_2(), 2_5_2(), 2_6(), lt2_3x(),4_2_2(), 4_3(), 4_4(), 5_2(), 5_3(), 6_2_2(), 6_3(), 7_2(),7_3(), vt1ic_1_2(), 8(), 2_2_2(), 2_4(), 2_5(), 4_2(), 5(),6_2(), vt1ic_1, 7(), 2_2(), 4(), 6(), 2()

2. The 10 subroutines are t2_2_1(), t2_2_6(), t2_2_4_1(),t2_2_4_2(), t2_2_5_1(), t2_2_5_2(), t2_2_2_1(), t2_2_2_3(), t2_2_2_2_1(), t2_2_2_2_2().

Jagode et al. 11

3. NWCHEM uses a mixture of FORTRAN dialects, includ-ing 77 and more modern ones.

References

Bartlett RJ and Musial M (2007) Coupled-cluster theory in

quantum chemistry. Reviews of Modern Physics 79(1):

291–352. DOI:10.1103/RevModPhys.79.291.Valiev M, Bylaska EJ, Govind N, et al. (2010) NWChem: A

comprehensive and scalable open-source solution for large

scale molecular simulations. Computer Physics Communi-

cations 181(9): 1477–1489. DOI:10.1016/j.cpc.2010.04.018.Bosilca G, Bouteiller A, Danalis A, et al. (2012) DAGuE: A

generic distributed DAG engine for high performance

computing. Parallel Computing 38(12): 37–51.Cosnard M and Loi M. (1995). Automatic task graph genera-

tion techniques. In: Proceedings of the 28th Hawaii Inter-

national Conference on System Sciences, pp. 113–122.

Washington, DC: IEEE.Danalis A, Bosilca G, Bouteiller A, et al. (2014) PTG: an

abstraction for unhindered parallelism. In Domain-Specific

Languages and High-Level Frameworks for High Perfor-

mance Computing (WOLFHPC), 2014 Fourth Interna-

tional Workshop on, pp. 21–30. Washington, DC: IEEE.

DOI:10.1109/WOLFHPC.2014.8.Hirata S (2003) Tensor contraction engine: Abstraction and

automated parallel implementation of configuration-inter-

action, coupled-cluster, and many-body perturbation the-

ories. Journal of Physical Chemistry A 107(46): 9887–9897.

DOI:10.1021/jp034596z.Nieplocha J, Palmer B, Tipparaju V, et al. (2006) Advances,

applications and performance of the global arrays shared

memory programming toolkit. International Journal of

High Performance Computing Applications 20(2): 203–231.Purvis G and Bartlett R. (1982) A full coupled-cluster singles

and doubles model - the inclusion of disconnected triples.

The Journal of Chemical Physics 76(4): 1910–1918. DOI:

10.1063/1.443164.Raghavachari K, Trucks GW, Pople JA, et al. (1989) A 5th-

order perturbation comparison of electron correlation the-

ories. Chemical Physics Letters 157(6): 479–483.Kowalski K, Krishnamoorthy S, Olson R, et al. Scalable

implementations of accurate excited-state coupled cluster

theories: Application of high-level methods to porphyrin-

based systems. In: High Performance Computing, Network-

ing, Storage and Analysis (SC), 2011, pp. 1–10. Washing-

ton, DC: IEEE.McCraw H, Danalis A, Herault T, et al. (2014) Utilizing data-

flow-based execution for coupled cluster methods. In: Pro-

ceedings of IEEE Cluster 2014, pp. 296–297. Washington,

DC: IEEE.Danalis A, Jagode H, Bosilca G, et al. (2015) PaRSEC in

practice: Optimizing a legacy chemistry application

through distributed task-based execution. In: 2015 IEEE

International Conference on Cluster Computing, pp. 304–

313. Washington, DC: IEEE. DOI:10.1109/

CLUSTER.2015.50.Jagode H, Danalis A, Bosilca G, et al. (2015) Parallel process-

ing and applied mathematics: 11th International Conference,

PPAM 2015, Krakow, Poland, September 6–9, 2015.

Revised Selected Papers, Part I, chapter Accelerating

NWChem Coupled Cluster Through Dataflow-Based Exe-

cution, pp. 366–376. Cham: Springer International Pub-

lishing. DOI:10.1007/978-3-319-32149-3_35.Ozog D, Shende S, Malony A, et al. (2013) Inspector/executor

load balancing algorithms for block-sparse tensor contrac-

tions. In: Proceedings of the 27th International ACM Con-

ference on International Conference on Supercomputing, pp.

483–484. New York, NY: ICS ’13, ACM.Lotrich V, Flocke N, Ponton M, et al. (2008) Parallel imple-

mentation of electronic structure energy, gradient and hes-

sian calculations. The Journal of Chemical Physics 128:

194104. DOI: 10.1063/1.2920482.Deumens E, Lotrich VF, Perera A, et al. (2011) Software

design of aces iii with the super instruction architecture.

Wiley Interdisciplinary Reviews: Computational Molecular

Science 1(6): 895–901.Lai PW, Stock K, Rajbhandari S, et al. (2013) A framework

for load balancing of tensor contraction expressions via

dynamic task partitioning. In: Proceedings of the Interna-

tional Conference on High Performance Computing, Net-

working, Storage and Analysis (SC), pp. 1–10. New York,

NY: ACM.Solomonik E, Matthews D, Hammond J, et al. (2013) Cyclops

tensor framework: Reducing communication and eliminat-

ing load imbalance in massively parallel contractions. In:

Parallel Distributed Processing (IPDPS), 2013 IEEE 27th

International Symposium on, pp. 813–824. Washington,

DC: IEEE.

Author biographies

Heike Jagode is a Research Director with theInnovative Computing Laboratory at the University ofTennessee, Knoxville. She specializes in high-performance computing (HPC) and the efficient use ofadvanced computer architectures; focusing primarily ondeveloping methods and tools for performance analysisand tuning of parallel scientific applications. Herresearch interests include the multi-disciplinary effort toconvert computational chemistry algorithms into adataflow-based form to make them compatible withnext-generation task-scheduling systems, such asPaRSEC. She is currently pursuing a PhD in ComputerScience from the University of Tennessee, Knoxville.Previously, she received an MSc in High-PerformanceComputing from The University of Edinburgh,Scotland, UK; an MSc in Applied Techno-Mathematicsand a BSc in Applied Mathematics from the Universityof Applied Sciences Mittweida, Germany.

Anthony Danalis is currently a Research Scientist IIwith the Innovative Computing Laboratory at theUniversity of Tennessee, Knoxville. His research inter-ests come from the area of HPC. Recently, his work hasbeen focused on the subjects of performance analysis,system benchmarking, compiler analysis and optimiza-tion, dataflow programming, and accelerators. He


received his PhD in Computer Science from theUniversity of Delaware on compiler optimizations forHPC. Previously, he received an MSc from theUniversity of Delaware and an MSc from the Universityof Crete, Greece, both on Computer Networks, and aBSc in Physics from the University of Crete, Greece.

Jack Dongarra holds an appointment at the Universityof Tennessee, Oak Ridge National Laboratory, and theUniversity of Manchester. He specializes in numericalalgorithms in linear algebra, parallel computing, use ofadvanced- computer architectures, programming

methodology, and tools for parallel computers. He wasawarded the IEEE Sid Fernbach Award in 2004; in2008 he was the recipient of the first IEEE Medal ofExcellence in Scalable Computing; in 2010 he was thefirst recipient of the SIAM Special Interest Group onSupercomputing’s award for Career Achievement; in2011 he was the recipient of the IEEE IPDPS CharlesBabbage Award; and in 2013 he received the ACM/IEEE Ken Kennedy Award. He is a Fellow of theAAAS, ACM, IEEE, and SIAM and a member of theUS National Academy of Engineering.

Jagode et al. 13

Date post:	06-Apr-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Accelerating NWChem Coupled The Author(s) 2017 …excitations (CCSD) (Purvis and Bartlett, 1982),...

Documents