A General Sparse Tensor Framework for Electronic Structure...

A General Sparse Tensor Framework for Electronic Structure TheorySamuel Manzer,†,‡ Evgeny Epifanovsky,§ Anna I. Krylov,∥ and Martin Head-Gordon*,†,‡

†Kenneth S. Pitzer Center for Theoretical Chemistry, Department of Chemistry, University of California − Berkeley, Berkeley,California 94720, United States‡Chemical Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States§Q-Chem, Inc., 6601 Owens Drive, Suite 105, Pleasanton, California 94588, United States∥Department of Chemistry, University of Southern California, Los Angeles, California 90089, United States

ABSTRACT: Linear-scaling algorithms must be developed inorder to extend the domain of applicability of electronicstructure theory to molecules of any desired size. However, theincreasing complexity of modern linear-scaling methods makescode development and maintenance a significant challenge. Amajor contributor to this difficulty is the lack of robustsoftware abstractions for handling block-sparse tensoroperations. We therefore report the development of a highlyefficient symbolic block-sparse tensor library in order toprovide access to high-level software constructs to treat suchproblems. Our implementation supports arbitrary multi-dimensional sparsity in all input and output tensors. Weavoid cumbersome machine-generated code by implementingall functionality as a high-level symbolic C++ language library and demonstrate that our implementation attains very highperformance for linear-scaling sparse tensor contractions.

1. INTRODUCTIONBlock-sparse tensors are central mathematical objects inelectronic structure theory, the quintessential example ofwhich is the four-center, two-electron integral tensor. Thesparsity of the two-electron integral tensor and its blockedstructure arise from the application of common integralscreening algorithms, the cornerstone of modern self-consistentfield (SCF) calculations.1,2 More expensive correlated elec-tronic structure methods transform the two-electron integralsinto the molecular orbital basis, requiring additional approx-imations to preserve sparsity in the computation.3−5 Such codescan become very cumbersome to develop and maintain.Consider, for example, a universally sparse transformation of

the three-center atomic orbital basis electron repulsionintegrals, (μν|Q), into a localized occupied6,7 basis and virtualmolecular orbital8,9 basis (indices i and a below), a commonstep in local correlation methods:8

∑ ∑ μν| = |μ μ ν ν

μ ν∈ ∈

ia Q Q c c( ) ( ) i a{ } { }i a (1)

where Q is an auxiliary basis index, as used in the resolution ofthe identity (RI) approximation,10,11 and the AO integrals areexplicitly given as

∫ ∫μν ω ω ω| = μ ν−Q d dr r r r r r( ) ( ) ( ) ( )Q1 2 1 1 12

12 (2)

The restricted domain of the summations in eq 1 indicates thatthe transformations are sparse. When one index determines the

range of another index, as the chosen molecular orbital index icontrols which μ index entries are nonzero and thus must betraversed by the summation, we say that these two indices arecoupled. Any number of indices may be coupled together forvarying physical reasons; for example, the μ, ν, i, and a indicesfrom eq 1 are all coupled. The pairs (μ, i) and (ν, a) arecoupled by the sparsity of the molecular orbital coefficients,while the pair (μ, ν) is coupled by atomic orbital overlapsparsity.When a code consists of a long series of such sparse

operations, including many different coupled sparse indiceswith varying types of sparsity, performing such simplerefactoring operations as changing the storage order of theindices in several of the contractions becomes an extremelydifficult task. While the RI approximation is applied in manycontexts in order to reduce the dimensionality of the integraltensors, the developer is still forced to work with tensorscontaining at least three indices. The increasing sophisticationof modern electronic structure methods makes the develop-ment of software infrastructure to efficiently handle theseobjects ever more necessary. While historically most electronicstructure methods have been laboriously implemented bymanually composing loops and matrix operations, interest inmore suitable software paradigms has been steadily growing.

Received: August 28, 2016Published: January 24, 2017

Article

pubs.acs.org/JCTC

© 2017 American Chemical Society 1108 DOI: 10.1021/acs.jctc.6b00853J. Chem. Theory Comput. 2017, 13, 1108−1116

pubs.acs.org/JCTC

http://dx.doi.org/10.1021/acs.jctc.6b00853

The early work of Windus and Pople12 is perhaps the firstrecognizably modern approach to tensors in electronicstructure software. In more recent times, the development oftensor infrastructure has been dominated by the TensorContraction Engine (TCE) project.13 The capabilities of theTCE span the entire process of tensor expression derivation,factorization, and code generation.13,14 The TCE has been usedto implement extremely complex electronic structure methods,with a particular emphasis on high-order coupled clustermethods.13,15,16 In an alternate line of development,Epifanovsky et al. have developed the libtensor project, ahigh-level symbolic C++ library to describe block tensoroperations,17 following an early implementation of the C++block-tensor library powering coupled-cluster methods in Q-Chem since 1998. A similar library project with a greateremphasis on concurrency is the Cyclops Tensor Framework ofSolomonik et al.18 Indeed, libtensor was recently extended tomassively parallel architectures via fusion with Cyclops. Finally,the TiledArray project of Valeev and Calvin is an activelydeveloped high-level library.19

Quantum chemistry has always had a pressing need forsoftware tools to treat sparse tensors, which appear in virtuallyall reduced-scaling implementations of electron-correlationmethods and also in recently developed high-performancemean-field methods.20,21 The TCE was designed to exploitsparsity arising from spatial, spin, and index-permutationalsymmetries.13 These TCE features have been employed todesign sophisticated treatments of quantum chemical problemsthat make effective use of these types of sparsity.22 Other earlyefforts to provide rapid software prototyping capabilities forcertain classes of sparse problems23,24 came from Parkhill andHead-Gordon.25 This was followed later by a differentformulation from Kats and Manby.26 Recently, there has beena surge of interest in the area of sparse tensor tools. Pinski et al.have formalized the nomenclature and provided an implemen-tation of several commonly used sparse operations.27 Lewis etal. have described a block-sparse tensor library that also derivescost savings from exploitation of rank sparsity within individualblocks.28 Finally, Solomonik et al. have extended the CyclopsTensor Framework to handle a degree of sparsity anddemonstrated this capability in the case of massively parallelcontraction of a sparse tensor with a dense tensor.29

In this work, we describe our block-sparse tensor library, anextension of the libtensor project.17 We provide the firstdetailed description of an implementation capable of handlingfully general multi-dimensional sparsity in tensor contractions.Our library has the unique ability to represent universally sparsecontractions, in which the output and all input tensors haveuniquely defined sparse structures. This capability is providedwith the added benefit of a high-level, symbolic C++implementation. We present concrete numerical evidence thatthe ability to handle multi-dimensional sparsity providessignificant cost reductions relative to utilizing only pair-wisesparsity and that the low-overhead of our implementationallows very high floating-point performance to be obtained. Inaddition, we discuss the algorithms by which our libraryresolves an inline symbolic tensor expression into nested loopsthat perform the minimal number of arithmetic operations in ablack-box, fully automated fashion.

2. THEORY2.1. Sparsity in Electron Structure Theory. Sparsity

occurs in many different forms in various mathematical contexts

across the field of electronic structure theory. The simplest,“lowest common denominator” form of sparsity is element-wisesparsity, tensor sparsity that occurs without any additionalstructure or pattern. Element-wise sparsity is tracked using datastructures that record significant tensor elements on anindividual basis, such as the compressed-sparse-row30 storageformats. While exploiting such sparsity reduces the operationcount of electronic structure computations, it is advantageousto take advantage of higher-level sparse structure whenpossible.31,32

Many of the most crucial tensors in electronic structuretheory exhibit block sparsity, in which the significant tensorelements are clustered together in blocks of varying size anddimensionality. Perhaps the most famous manifestation ofblock sparsity in the context of electronic structure theory isfound in the two-electron integral tensor. The two-electronintegrals have a natural block-sparse structure that arises fromthe shell structure of widely used Gaussian basis sets. Themolecular orbital coefficient matrices in localized orbital basissets also exhibit block-sparse structure due to the use of atom-centered functions in the atomic orbital basis. After theapplication of appropriate numerical thresholds to define theblock-sparse structure of such root tensors, the sparse structureof subsequent tensors may be derived using methods that wediscuss. We summarize by noting that due to the commonoccurrence of “atom-centered function groups that decay withdistance,” block-sparse tensors of varying forms are ubiquitousin electronic structure theory.Electronic structure tensors often also contain higher-level

“structured sparsity”, which can arise for very different reasonsthan the decay-with-distance phenomena described above.Point group and spin symmetries give rise to significant sparsityin many electronic structure algorithms, and general purposelibraries have been devised to exploit these properties in a wideclass of computations.17 While the focus of this work is on theexploitation of small-block sparsity such as that found in thetwo-electron integral tensor, the algorithms discussed hereincan also be used to treat more structured sparsity throughappropriate block definition.

2.2. Block-Sparse Tensor Storage Format. Unfortu-nately, implementation of block-sparse tensor contractions hashistorically required time-consuming and error-prone manualcoding of each individual contraction. However, several recenthigh-level library projects have utilized block sparsity in someform,26,28 and our work continues in this vein. In order toobtain high floating point efficiency through the use of existinghighly optimized matrix multiply routines, our library storestensors using the well-established “block-major order.”12,17,26,28

In order to give a clear definition of “block-major order,” weclarify several terms. Throughout this document, we use theterms “index” and “subspace” somewhat interchangeably. Ingeneral, a subspace of a tensor is a one-dimensional object witha certain dimension (size) and with a particular blockingpattern. An index is a label applied to the elements of asubspace, and indices are grouped inside parentheses toindicate that the subspaces that they label are coupled bysparsity in some way. To illustrate our usage of these terms, wewould say that the tensor I(μν)Q is composed of three subspacesand that the indices μ and ν are coupled to form a sparse indexgroup. An index group is either a single dense index or any setof indices coupled by sparsity to one another. We only drop theparentheses convention for sparse index groups in the case oftensors with superscripted indices.

Journal of Chemical Theory and Computation Article

DOI: 10.1021/acs.jctc.6b00853J. Chem. Theory Comput. 2017, 13, 1108−1116

1109


We now define block-major order as follows. First, thesubspaces of the tensor are subdivided into blocks; for example,in our implementation of the PARI-K algorithm,21 indicescorresponding to atomic orbitals are subdivided by shell (a setof Gaussians with the same center and contraction scheme).Those corresponding to auxiliary basis indices are subdividedby atom, and occupied molecular-orbital basis indices are notsubdivided at all and referenced as a single block. A given blockof the tensor is specified by a tuple of block indices, hereafterreferred to as an “index tuple.” Elements of the tensor whoseabsolute indices fall within the bounds of a given index tuple arestored contiguously in memory. Only nonzero blocks arestored. By convention, the distance between tensor elementscorresponding to consecutive block index values of a givensubspace decreases with increasing subspace index (this isanalogous to “row-major” order for matrices). An example of atensor stored in this format is shown in Figure 1.

It is somewhat more cumbersome to compute offsets intoblock-major tensors than typical “row-major” or “column-major” arrays, and this difficulty is compounded when saidtensors are sparse. We here note only that there are manydifferent efficient approaches through which one can computethe offsets of blocks in sparse tensors, including simply a lookupof index tuples using a hash table. Given this fact, in thealgorithms described in the following sections, we simply refergenerically to the process of “computing block offsets.”2.3. Tensor Sparsity. Historically, the exploitation of

sparsity in electronic structure has focused on pair-wise sparsecouplings between indices, for example, the overlap sparsitybetween two atomic orbital indices μλ in the two-electronintegrals33 or the locality of the expansion coefficients c(μi) oflocal molecular orbitals.34 Pinski et al.27 and Riplinger et al.35

have discussed methods for tracking the sparsity relationshipsbetween two indices, using a “sparse map” data structure whichthey describe as two nested lists.35 While the infrastructure theypropose is powerful, its pair-wise nature leads to certaininevitable limitations. For example, in order to formulate a localRI expansion using coefficients Ciμ Q of the localized occupied-virtual pair iμ, Pinski et al. use a pair-wise sparse map between iand the auxiliary basis index Q. However, it would be more

natural in this context to map the iμ pair itself to a unique localsubset of auxiliary basis functions.We discuss why it is necessary in many contexts to utilize a

higher-dimensional description, which we refer to as “tensorsparsity,” in order to capture the full measure of sparsity inmany of the tensors appearing in electronic structure. As astrong example, we consider the fit coefficients, C(μλQ) arising inthe pair resolution of the identity approximation (PARI),20 inwhich a product of Gaussians, |μAλB) is approximately

represented by a fit, μ λ| )A B , which is an expansion in terms ofthe union of the sets of auxiliary basis functions centered on therespective atoms, A and B, of the two functions:

∑μ λ| = |μλ

∪

C Q) )A BQ

A B

Q( )(3)

Clearly, the sparse structure of this tensor cannot be describedsimply by pair-wise maps; the range of the Q index dependsupon both other indices. One might expect that issues of thisnature could be handled within the existing pair-wiseframeworks by relabeling the pair of indices (μν) as a single“combined pair index.” However, this does not yield convenientdata structures for use in tensor operations in which only oneindex from the specified index pair couples to the sparsitypresent in a dif ferent tensor. We may readily obtain an exampleof such a problem by considering the transformation of oneindex of the PARI fit coefficients into a basis of localizedmolecular orbitals with the MO coefficients c(λj):

∑=μλ

μλ λD C cjQ Q j( ) ( ) ( )(4)

For this and many other problems, we propose the use of asparse tree data structure, an example of which is shown inFigure 2. In this data structure, each nonzero index tuple

corresponds to a specific lowest-level node of the tree. Theindividual block index values for the different indices are nestedat each appropriate level; thus, the index tuples (0,3,6) and(0,4,2) nest under the common node corresponding to blockindex 0 of the first subspace. Each significant value of a givensubspace index corresponds to a separate subtree. Thus, theentire list of index tuples may be generated by depth-firsttraversal of the tree, where at each traversal step the path fromthe root to the leaf node is the current index tuple. The “list oflists” utilized by Pinski et al.27 and Riplinger et al.35 may beregarded as a two-dimensional special case of this data structurein some sense. However, we have found that in order to obtain

Figure 1. An example of the block-major sparse tensor storage formatused in our library. The example is a third rank tensor of AO integrals(i.e., like eq 2) with formal dimensions of (20,20,20). As shown on theleft, the ranges of the first two indexes (ν and σ) are divided into threeblocks (of size 10, 5, and 5 respectively), while the third index (Q) isdivided into two blocks of size 15 and 5. In this fictional tensor whose8000 elements are divided into 18 blocks, only the three index tuples(0,1,0), (1,1,1), and (2,0,1) correspond to nonzero blocks. The sizes ofthe three three nonzero blocks are illustrated on the right, and they arestored consecutively in memory.

Figure 2. Conceptual representation of our sparse tree data structurefor the blocks that contain nonzero elements and its literal multi-dimensional representation for an example of a third-rank tensor.



1110


high performance it is necessary to minimize the number ofindependent memory allocations involved in the constructionof a sparse tree. As such, while we find that conceptualizing ourdata structure as a tree and referring to it as such yields a betterunderstanding of its fundamental operations, the actualimplementation stores the tree as a single contiguous allocationof block index tuples, as shown in Figure 2.In a subsequent section, we define a set of (nearly) linear-

scaling operations on these sparse tree data structures. Usingthese operational definitions, we then prove that nocombination of pair-wise sparse maps can recover the fullsparsity present in a tensor description; only the tensordescription can minimize the number of arithmetic operationsin a sparse problem. We provide numerical examples of the sizeof the cost reduction for practical calculations in our Resultsand Discussion section.2.4. Sparse Operations on Tree Data Structures. We

now specify a standardized set of operations that may beperformed on these sparse tree data structures. The firstoperation we discuss is “permutation”, in which we permute theorder of the levels in the tree. This may be triviallyimplemented with O(N log N) scaling by swapping thecorresponding entries in each index tuple, followed by resortingthe index tuples into lexicographic order using standard sortingalgorithms. The “inversion” operation of Pinski et al., in whichthe order of the subspace indices in a pair-wise map is reversed,may be regarded as a special case of tree permutation. Anexample of the permutation operation is shown in Figure 3.This operation is used in many situations, particularly to bringall sparsity information in a tensor contraction into agreementwith a common loop ordering.

The second operation that we discuss is “fusion”, in whichthe sparsity of two tensors may be combined to yield a higher-dimensional object. The following example illustrates the utilityof this operation. Consider a standard sparse matrix product

∑=C A Bijk

ik kj( ) ( ) ( )(5)

All indices in the above equation refer to blocks rather thanelements, and the parentheses are used to denote nonzeroblocks. What is the range of the k index in the inner loop of thecontraction in eq 5? Clearly, it depends upon the particulariteration of the outer loops over the i and j indices. We mayobtain the appropriate set of k index values for a given set ofouter indices by fusing the trees describing A and B, meaning

that we join the lowest level of tree A to the highest level of treeB, as shown in Figure 4.

This fusion operation bears a resemblance to the “chaining”operation of Pinski et al., in which two pair-wise sparse mapswith a common index are used to generate a pair-wise mapbetween their unshared indices. However, our fusion operationpreserves the full multi-dimensional character of all inputsparsity, yielding a higher-dimensional tree reflecting the newcombined coupling. The space of loop iterations traversed in eq5 corresponds to the elements of the resulting fused tree, andthe set of k index blocks touched for given values of the outerloop indices is given by the appropriate set of leaf nodes.The fusion operation is extended to multiple indices in the

following manner: suppose that we wish to fuse two trees,containing m and n indices, respectively, along k shared indices.First, the right-hand side tree is permuted such that the k fusedindices occupy the first k positions, while retaining the relativeorder of the unfused indices. The process of fusion is thenrepeated as in Figure 4, with arrows drawn matching left-handside tree entries with the right-hand side tree entries whose firstk indices match. This results in a new tree of order n + m − k.Finally, we discuss the “reduction” operation. Reduction

involves the removal of a given subspace from a tree, yielding alower dimensional tree. It is implemented in a near-linear-scaling fashion in several stages. First, the appropriate entry isdeleted from each index tuple. The index tuples are thenresorted, and the duplicate entries which may have beenintroduced are removed. An example of the reduction operationis given in Figure 5.These three tree structure operations together provide an

approach for implementing sparse tensor contraction in ageneral way. Referring back to eq 5 as a simple example, onefirst performs appropriate permutations of A and B, followed bythe fusion of their trees to implicitly define (though notexplicitly form) a third rank tensor, F(ikj) = A(ik) * B(kj). Nonzeroblocks of the fused tensor implicitly contain element-wisemultiplications between the nonzero blocks of A and B,although we form only the tree describing the sparsity of thistensor rather than the elements themselves. Finally, reductionover k blocks defines the block structure of the output product,C(ij) = Trk [F(ikj)]. While reduction is formally a partial trace

Figure 3. Permutation of indices 1 and 2 (i.e., second and thirdcolumns) of a sparse tree.

Figure 4. Fusion of two lower-dimensional sparse trees which share acommon index to form a higher-dimensional tree. In this example, twotwo-dimensional tensors with a common index type (lowest level oftree A; highest level of tree B) are fused to make a three-dimensionalsparse tensor. One application of the fusion operation can be seenfrom eq 5: the fused three-dimensional tensor represents nonzeroblock contributions to the sparse matrix multiply.



1111


(i.e., summation), in practice, this is recombined with thedelayed element-wise multiplications associated with fusion andimplemented as matrix multiplication.2.5. Emergent Sparsity. We now highlight an interesting

special case of the exploitation of sparsity in the context ofelectronic structure theory, though it surely arises in othercontexts also. If a given tensor (Tensor 1) is contracted withanother tensor (Tensor 2) sharing two or more of its indices, itis clear that we should utilize the sparse coupling between thoseshared indices, as present in Tensor 2, not only in the lattercontraction but also “pre-emptively” in the formation of Tensor1. By doing so, only a small part of Tensor 1 must beconstructed, resulting in great computational savings. If we donot, we would be forming elements of Tensor 1 which, whilethey could be numerically significant, would be multiplied byzero values when they were actually utilized in the subsequentcontraction. It is thus often advantageous when possible toimpose a sparse structure on the output of a tensor contraction(in this example, Tensor 1) that is somewhat independent ofthe sparse structure of its input tensors.The above discussion is perhaps a bit abstract, so we now

provide a real-world example. In a separate publication, wediscuss the concurrent use of the occ-RI-K36 and PARI-K21

approximations in the context of SCF for molecularinteractions (SCF-MI).37 The SCF-MI method is defined bythe construction of a minimum-energy Slater determinant fromabsolutely localized molecular orbitals (ALMOs).37 EachALMO is expanded only in terms of the atomic orbitalsresiding on its respective molecular fragment; this results in ablock-diagonal MO coefficient matrix and gives many of thetensors in the method additional block sparse structure.The mixed representation of the Hartree−Fock exchange

matrix consisting of occupied orbitals (index i) and atomicorbitals (index ν) needed for the occ-RI-K method usingnonorthogonal ALMOs is

∑ ν= |ν ••K j( )i

ji

j

(6)

We have used the standard superscript notation to indicate thatthe first occurrence of the j index is contravariant.38 Our newmethod utilizes the Dunlap robust fit functional to remove thefirst order fitting errors associated with the PARI-Kapproximation:39

μλ νσ μλ νσ μλ νσ μλ νσ| ≈ | + | − | ( ) ( ) ( ) ( ) (7)

In the context of evaluating the exchange matrix, the expressioncorresponding to the third term in the Dunlap functional istherefore

∑ν| = ″ | ∼ν•

••

•j D P Q D( ) ( ) ( )ij

Pi P

jjQ( )

(8)

In this expression, (D″)i•P•j corresponds to the fully MOtransformed fit coefficients (with one of the MO indicescontravariant due to the nonorthogonality of the absolutelylocalized molecular orbitals),37 while D(νjQ) corresponds to thehalf-transformed fit coefficients. The combined expression isevaluated in two steps. In the first step, the fully MOtransformed RI fit coefficients are contracted with the(nonsparse) auxiliary basis Coulomb integrals:

∑= ″ |••

••E D P Q( ) ( )i Q

j

Pi P

j

(9)

This intermediate is Tensor 1 in the context of the earlierabstract discussion. Even if (D″)i•P•j has only a linear number ofnonzero elements, because (P|Q) is nonsparse, the product hasa quadratic number of significant elements (i.e., index Q isuncoupled from i and j).In the second step, the resulting tensor, Ei•Q

•j , is contractedwith the half-transformed fit coefficients (playing the role ofTensor 2) and used to decrement the economized exchangematrix in a contraction that has emergent sparsity:

∑− =ν ν••K E Di

jQi Q

jjQ( )

(10)

The emergent sparsity comes from the fact that the summedindex pair jQ is sparse in D(νjQ) (Tensor 2), even though it isdense in Ei•Q

•j . Thus, in the contraction, for every j value, only aconstant number of Q blocks will be significant. We thereforeimpose this sparse coupling between j and Q on Ei•Q

•j , renderingthe first contraction linear-scaling to form just the requiredblocks.Due to cases such as this, it desirable that any sparse tensor

framework allow the imposition of arbitrary sparse structure onthe output of any tensor contraction. Local correlation modelsrepresent a related example where sparsity is imposed based ona model that is far more drastic than would be obtained basedon strict numerical thresholding.

2.6. Tensor Sparsity vs Pair-Wise Sparsity. Utilizingmulti-dimensional sparsity brings additional code complexity ifnot handled appropriately. We therefore provide definitiveproof that exploiting sparsity in the full multi-dimensional formcan reduce the operation count beyond what is possiblethrough consideration of pair-wise sparsity alone and at worstwill recover the performance of the pair-wise implementation.Once again, we consider as an example the local density fittingexpansion for the atomic orbital product μν. The mosteconomical expansion naturally takes the form of some set ofauxiliary functions Q centered near the μν product. However,to avoid the additional complexity of dealing with high-dimensional sparsity, many local fitting codes such as the ARI-Kalgorithm of Sodt and Head-Gordon40 simply associate the fitdomain of Q values with a single AO index, either μ or ν. Themost economical pair-wise approximate description of the multi-dimensional sparsity inherent in our RI coefficient tensor isobtained by reduction of the appropriate excluded tree indexusing the definition of the reduction operation from theprevious section. For example, to obtain the (μ,Q) pair-wise

Figure 5. Reduction over the second subspace of a rank-3 tree to forma two-dimensional tree.



1112


map, we may reduce over the ν index of the (μ,ν,Q) tree. Theuse of any pair-wise map that does not contain all elementsresulting from this reduction operation amounts to makingfurther physical approximations with respect to the multi-dimensional caseapproximations that may or may not bejustified.We consider now the problem of looping over the significant

elements of the fit coefficient tensor C(μν Q) corresponding tothe sparse tree shown in Figure 6. While in this simplified

example the simplest solution would be to loop over the tensorelements in order without consideration of sparsity at all, this isclearly not possible if another tensor must be accessedconcurrently as, for example, in the case of contraction withanother tensor Tνj. If we are restricted to using pair-wise mapswithin our nested loops, then the minimum-operation solutionthat retains the same physical content as the multi-dimensionalloop is to run one outer loop over its full index range andrestrict every inner loop by the applicable pair-wise sparsemap(s) obtained by reduction over the appropriate subspace(s)of the full multi-dimensional tree. We may recover the space ofiterations performed by such nested pair-wise loops using ourpreviously defined tree fusion operation. For example, given theloop nesting order μ,ν,Q, we first reduce each subspace of ourtree individually to obtain the pair-wise maps (μ,ν), (μ,Q), and(ν,Q) . Then, we fuse the (μ,ν) and (ν,Q) trees to produce anew tree suitable for looping over all three indices. We thenprune the resulting tree by performing two-index fusionbetween it and our (μ,Q) tree. In this case, the two-index

fusion amounts to an “intersection” between the two trees, inthe same sense as defined by Pinski et al. The resulting tree, thefinal tree in Figure 6, represents the full space of iterations thatwould be performed by a set of nested loops over the pair-wisesparse maps. We have highlighted the red element to indicatean increase in the number of operations relative to the multi-dimensional case that we have started with. We further notethat any fused tree composed from pair-wise maps thatcorresponds to a set of nested loops over all indices in thetensor must contain all entries from the original multi-dimensional tree. The combination of these two facts provesthat the cost of the pair-wise implementation is an upper boundto the cost of the multi-dimensional implementation. We giveconcrete numerical examples of the cost savings associated withexploitation of multi-dimensionality in the Results andDiscussion section.

2.7. Automatic Loop Generation: A Fully SymbolicLibrary. In the interest of providing easily extensible,maintainable implementations of electronic structure methods,we provide a library that does not require the explicit coding ofany loops over indices or cumbersome machine generated code.The libtensor17 library has provided such functionality for sometime for dense tensors; in the present work, we have extendedthis library to treat block-sparse tensors. Our library utilizesC++ operator overloading to enable parsing of in-codesymbolic sparse tensor expressions, allowing problems such asa fully sparse integral transformation to be converted into asingle line of C++ code as shown in Figure 7.Our algorithm begins by choosing a loop order that traverses

all indices in the problem. The problem of optimally orderingsparse nested loops is very complex. In the interest of simplicityof the initial implementation, we only make one potentialrearrangement to the index order given in the contractionexpression; namely, we decide whether or not the loops overcontracted or uncontracted indices should be the outer loops.We make a crude estimate of the performance trade-offbetween these two choices by the comparing the product of thecontracted dimensions with the uncontracted dimensions. Thiswill favor reuse of input tensor elements in cases in which manyindices are contracted, meaning that the output tensor is muchlower order than the input tensors. This procedure is neithersophisticated nor optimal, and a more advanced treatment isdesirable. However, we find that we are able to obtain goodperformance with the present implementation.Our automatic loop generation code must handle an arbitrary

number of indices across both sparse and dense tensors. Inorder to generate a canonical set of loops over all indices, wemust first fuse all coupled sparse trees in the input and outputtensors. The algorithm performing this process is summarizedin Figure 8. A given loop will traverse either a single sparse treeproduced by the above loop fusion step (and therefore possiblytraverse many indices at once) or traverse a single index that isdense in its occurrences in all tensors in the expression. Whileone could represent dense tensors as sparse trees with all block

Figure 6. Example of advantage of multi-dimensional sparsity relativeto pair-wise sparsity. The red element is erroneously traversed innested loops over the combined pair-wise sparsity but not over themulti-dimensional original.

Figure 7. Conversion of symbolic tensor contraction expression to C++ code, as deployed in our development version of Q-Chem.



1113


combinations included, this would introduce undesirableoverhead.Having obtained a canonical set of sparse trees, we prepare

our nested loops to be run in two phases. First, we merge loopsover indices that are coupled as members of the same sparsetree (note that they need not all be members of the sametensor) into a “loop group.” Universally dense indices eachconstitute a loop group unto themselves. Once the loop groupshave been formed, we match each loop group to the list of all(tensor, index group) pairs that it touches. Finally, we generatethe list of offsets into each (tensor, index group) pair for everyiteration of the loop group. The loop grouping process issummarized in Figure 9. It should be noted that our design

does not allow interspersing loops over dense indices betweenloops over coupled sparse indices. This is an undesirablelimitation of our library, and one which we are currentlyworking to correct.Finally, we proceed to execute the nested loops themselves.

Our implementation uses a recursive algorithm in which eachloop group corresponds to a unique recursion level. Our looplist execution routine takes a [block kernel] parameter, allowingthe same loop traversal algorithm to implement many differentblock-level operations such as permutation and contraction.The algorithm by which we traverse all tensors and perform therequested operations in terms of individual block operations issummarized in Figure 10.An unfortunate side affect of this implementation is that the

trees representing coupled sparsity across multiple tensors areheld explicitly in memory and looped over. These trees canbecome very large due to their high dimensionality, in somecases comparable in size to some of the input tensorsthemselves. We have nonetheless successfully run jobs withover 10,000 basis functions in augmented basis sets on nodeswith 256−512GB of RAM, demonstrating that our implemen-tation may be used in practical calculations. Work on an

alternate formulation of the fusion steps that avoids thisrelatively high memory usage is in progress.

3. RESULTS AND DISCUSSION3.1. Multi-Dimensional Sparsity: Numerical Perform-

ance. We now provide numerical examples of the advantagesof utilizing the full multi-dimensional index coupling present inhigher-order tensors. We examine a series of globular waterclusters, an example of which is shown in Figure 11. For each of

these clusters, we compute the half-transformed PARI fitcoefficient tensor D(μjQ) given in eq 4. We then zero all blocksof D(μjQ) in which no element of a given block exceeds athreshold of 10−6 (based on the second order errors of theDunlap robust fit,21 we may use such a loose threshold toscreen RI fit coefficients).We now compare the operation count of a set of loops

traversing the significant blocks of D(μjQ) using the full three-dimensional tree versus the operation count yielded by usingnested pair-wise sparse maps. As discussed in the Theorysection, the nested loops over pair-wise sparse maps traverse aset of index tuples corresponding to the entries of a sparse treecreated by a two step process. First, we generate three differenttwo-dimensional sparse trees by reducing the full D(μjQ) treeover all three different indices. These trees formed by reducingover indices 0,1, and 2 are labeled (jQ), (μQ), and (μj),respectively. We then fuse these trees together in sequence overthe appropriate shared indices to recreate a new “pair-wise”tree. As the numerical results in Table 1 demonstrate, the resultis that we lose a very large factor of sparsity. Maintaining a fullymulti-dimensional description of tensor sparsity is thereforenecessary in order to keep arithmetic operations to a minimum.

3.2. Tensor Contraction Performance. We have alreadyemployed the sparse tensor library in this work in the context ofimplementing two high-performance applications,21,36 interfac-ing with the Q-Chem electronic structure program,43 and are inthe process of reporting further applications in a separate

Figure 8. Algorithm for performing fusion of sparsity across all tensorsto form canonical set of trees.

Figure 9. Algorithm for grouping loops over indices by sparsecoupling.

Figure 10. Nested loop execution algorithm for general sparse tensors.

Figure 11. A 46-molecule water cluster used in our performancebenchmarks.



1114


publication. In order to highlight the generality of our library,we consider examples of three types of contractions: [sparse] =[sparse] × [sparse], [sparse] = [sparse] × [dense], and [dense]= [sparse] × [sparse] Our chosen examples of thesecontraction types are chosen from the proceeding SCF-MIequations, and are summarized in eqs 11−13 in the orderspecified above.

∑=μλ

μλ λD C cjQ Q j( ) ( ) ( )(11)

∑= ″ |••

••E D P Q( ) ( )i Q

j

Pi P

j

(12)

∑− =ν ν••K E Di

jQi Q

jjQ( )

(13)

As a further illustration of the benefits yielded by the abstractsymbolic nature of our library, our implementation uses onlythe following literal C++ code to implement these threecontractions with full exploitation of sparsity:

As a test system for benchmarking these tensor contractions,we have chosen linear stacked benzene ring structures such asthat shown in Figure 12. The spacing between consecutivebenzene rings is 3.8 Å. We avoid the traditional linearbenchmarks such as alkanes in this case because ALMOs aretraditionally defined only for nonbonded fragments. For stacklengths of 10, 20, and 30 rings, we time the first SCF iterationstarting from a guess ALMO coefficient matrix consisting of theblock-diagonal combination of the monomer MO coefficientmatrices. Our timing results are shown in Figure 13. Thefloating point performance of each contraction for the largeststack was measured as 1.3, 4.3, and 2.0 GFLOPS, respectively.These rates correspond respectively to 25%, 80%, and 38% andof single-core peak floating point performance (5.4 GFLOPS,as gauged by the performance of a 10,000 × 10,000 densematrix multiplication using the Intel MKL library) for the singleAMD Opteron 6376 processor on which these calculationswere run. Thus, our tensor library captures the linear scalingbehavior inherent in the sparse nature of these structures anddoes so while retaining high f loating point performance. Thisleaves the SCF-MI calculations dominated by the three-centerAO integral calculation, the optimization of which we discuss infuture work.

The library implementation presented in this work is not yetparallel. We believe the most natural approach to implementingshared-memory parallelism for the operations that we havedescribed is to choose an index of the output tensor and divideits significant blocks among multiple processors. All of the loopgeneration and execution operations may then be performed asbefore, with each processor using a set of sparse trees that havebeen “pruned” so that the divided index falls within theappropriate range specified for that processor. Such paralleliza-tion should suffice for calculations of moderate size; extensionto parallelization over more indices is technically more difficultbut certainly possible.

4. CONCLUSIONSWe have described a highly abstract, high-performanceimplementation of common block-sparse tensor operations.Subject to definition of appropriate block structures, ourimplementation is capable of utilizing truly arbitrary sparsity toreduce operation count without sacrificing floating-pointperformance. We have demonstrated the capabilities of thislibrary in the context of a high-performance SCF-MI code fullydescribed in a separate publication. The source code for ourblock-sparse library is available on GitHub as part of thelibtensor project at https://github.com/epifanovsky/libtensor.Work on adding greater concurrency support to the library isalso in progress. We hope that the capabilities of our library canaid in the development of more complex electronic structuremethods that either exploit locality or impose local models.

Table 1. Percentage of Nonzero Elements in the Half-Transformed PARI Fit Coefficients D(μjQ) Defined in eq 4, Illustrated inTerms of Different Sparse Treesa

system D(μjQ) % dense (μj) % dense (jQ) % dense (μQ) % dense pair-wise % dense

water 46 1.0 68 96 34 28water 71 0.6 55 88 28 19water 127 0.3 41 71 20 10

aPercentages are based on block tuples, not on elements. The discrepancy between the first column and the “pair-wise” column is the sparsity lost byutilizing only a pair-wise representation. The aug-cc-pVTZ41 basis set was used in conjunction with the cc-pVTZ-JK RI fitting basis set,42 and thehalf-transformed coefficients were screened with a threshold of 10−6.

Figure 12. Stacked benzene molecules used to benchmark linear-scaling contractions.

Figure 13. Timings for all three contractions from eqs 11−13.Benzene stack calculations used the aug-cc-pVTZ41 basis set with anintegral threshold of 10−10. Reported times are for the first SCFiteration.



1115

https://github.com/epifanovsky/libtensor


■ AUTHOR INFORMATION

Corresponding Author*E-mail: [email protected].

ORCIDMartin Head-Gordon: 0000-0002-4309-6669NotesThe authors declare the following competing financialinterest(s): M.H.-G. and A.I.K. are part-owners of Q-Chem,Inc.

■ ACKNOWLEDGMENTS

Initial support for this work was provided by the ScientificDiscovery through Advanced Computing (SciDAC) programfunded by the U.S. Department of Energy, Office of Science,Advanced Scientific Computing Research and Basic EnergySciences, with additional support from Q-Chem Inc. throughan NIH Subcontract from Grant 2 R44 GM096678-02, and asubcontract from MURI Grant W911NF-14-1-0359. M.H.-G.and A.I.K. are part-owners of Q-Chem Inc.

■ REFERENCES(1) Cremer, D.; Gauss, J. J. Comput. Chem. 1986, 7, 274−282.(2) Haser, M.; Ahlrichs, R. J. Comput. Chem. 1989, 10, 104−111.(3) Bartlett, R. J. Annu. Rev. Phys. Chem. 1981, 32, 359−401.(4) Jung, Y.; Shao, Y.; Head-Gordon, M. J. Comput. Chem. 2007, 28,1953−1964.(5) Huntington, L. M. J.; Hansen, A.; Neese, F.; Nooijen, M. J. Chem.Phys. 2012, 136, 064101.(6) Boys, S. F. Rev. Mod. Phys. 1960, 32, 296−299.(7) Pipek, J.; Mezey, P. G. J. Chem. Phys. 1989, 90, 4916.(8) Pulay, P. Chem. Phys. Lett. 1983, 100, 151−154.(9) Subotnik, J. E.; Dutoi, A. D.; Head-Gordon, M. J. Chem. Phys.2005, 123, 114108.(10) Vahtras, O.; Almlof, J.; Feyereisen, M. Chem. Phys. Lett. 1993,213, 514−518.(11) Feyereisen, M.; Fitzgerald, G.; Komornicki, A. Chem. Phys. Lett.1993, 208, 359−363.(12) Windus, T. L.; Pople, J. A. Int. J. Quantum Chem. 1995, 56,485−495.(13) Hirata, S. J. Phys. Chem. A 2003, 107, 9887−9897.(14) Hirata, S. Theor. Chem. Acc. 2006, 116, 2−17.(15) Hirata, S. J. Chem. Phys. 2004, 121, 51.(16) Shiozaki, T.; Hirao, K.; Hirata, S. J. Chem. Phys. 2007, 126,244106.(17) Epifanovsky, E.; Wormit, M.; Kus, T.; Landau, A.; Zuev, D.;Khistyaev, K.; Manohar, P.; Kaliman, I.; Dreuw, A.; Krylov, A. I. J.Comput. Chem. 2013, 34, 2293−2309.(18) Solomonik, E.; Matthews, D.; Hammond, J. R.; Demmel, J.Cyclops tensor framework: Reducing communication and eliminatingload imbalance in massively parallel contractions. IEEE InternationalSymposium on Parallel & Distributed Processing (IPDPS), 2013; pp813−824.(19) Calvin, J.; Valeev, E. TiledArray: A massively-parallel, block-sparse tensor library written in C++. https://github.com/valeevgroup/tiledarray/ (accessed 2015).(20) Merlot, P.; Kjærgaard, T.; Helgaker, T.; Lindh, R.; Aquilante, F.;Reine, S.; Pedersen, T. B. J. Comput. Chem. 2013, 34, 1486−96.(21) Manzer, S. F.; Epifanovsky, E.; Head-Gordon, M. J. Chem.Theory Comput. 2015, 11, 518−527.(22) Baumgartner, G.; Auer, A.; Bernholdt, D. E.; Bibireata, A.;Choppella, V.; Cociorva, D.; Gao, X.; Harrison, R. J.; Hirata, S.;Krishnamoorthy, S.; Krishnan, S.; Lam, C.-C.; Lu, Q.; Nooijen, M.;Pitzer, R. M.; Ramanujam, J.; Sadayappan, P.; Sibiryakov, A. Synthesisof High-Performance Parallel Programs for a Class of Ab InitioQuantum Chemistry Models. Proc. IEEE 2005, 93, 276−292.

(23) Parkhill, J. A.; Lawler, K.; Head-Gordon, M. J. Chem. Phys. 2009,130, 084101.(24) Parkhill, J.; Head-Gordon, M. J. Chem. Phys. 2010, 133, 024103.(25) Parkhill, J. A.; Head-Gordon, M.Mol. Phys. 2010, 108, 513−522.(26) Kats, D.; Manby, F. R. J. Chem. Phys. 2013, 138, 144101.(27) Pinski, P.; Riplinger, C.; Valeev, E. F.; Neese, F. J. Chem. Phys.2015, 143, 034108.(28) Lewis, C. A.; Calvin, J. A.; Valeev, E. F. arXiv:1510.01156, 2015.(29) Solomonik, E.; Hoefler, T. arXiv:1512.00066, 2015.(30) Gustavson, F. G. ACM Transactions on Mathematical Software.1978, 4, 250−269.(31) Challacombe, M. Comput. Phys. Commun. 2000, 128, 93−107.(32) Saravanan, C.; Shao, Y.; Baer, R.; Ross, P. N.; Head-Gordon, M.J. Comput. Chem. 2003, 24, 618−22.(33) Strout, D. L.; Scuseria, G. E. J. Chem. Phys. 1995, 102, 8448.(34) Maslen, P. E.; Ochsenfeld, C.; White, C. A.; Lee, M. S.; Head-Gordon, M. J. Phys. Chem. A 1998, 102, 2215−2222.(35) Riplinger, C.; Pinski, P.; Becker, U.; Valeev, E. F.; Neese, F. J.Chem. Phys. 2016, 144, 024109.(36) Manzer, S.; Horn, P. R.; Mardirossian, N.; Head-Gordon, M. J.Chem. Phys. 2015, 143, 024113.(37) Khaliullin, R. Z.; Head-Gordon, M.; Bell, A. T. J. Chem. Phys.2006, 124, 204105.(38) Head-Gordon, M.; Lee, M.; Maslen, P.; Voorhis, T. V.;Gwaltney, S. In Modern Methods and Algorithms of Quantum Chemistry;John-von-Neumann-Inst. for Computing, 2000, Vol. 3, pp 593−638.(39) Dunlap, B. J. Mol. Struct.: THEOCHEM 2000, 529, 37−40.(40) Sodt, A.; Head-Gordon, M. J. Chem. Phys. 2008, 128, 104106.(41) Kendall, R. A.; Dunning, T. H.; Harrison, R. J. J. Chem. Phys.1992, 96, 6796.(42) Weigend, F. Phys. Chem. Chem. Phys. 2002, 4, 4285−4291.(43) Shao, Y.; Gan, Z.; Epifanovsky, E.; Gilbert, A. T.; Wormit, M.;Kussmann, J.; Lange, A. W.; Behn, A.; Deng, J.; Feng, X.; Ghosh, D.;Goldey, M.; Horn, P. R.; Jacobson, L. D.; Kaliman, I.; Khaliullin, R. Z.;Kus, T.; Landau, A.; Liu, J.; Proynov, E. I.; Rhee, Y. M.; Richard, R. M.;Rohrdanz, M. A.; Steele, R. P.; Sundstrom, E. J.; Woodcock, H. L.;Zimmerman, P. M.; Zuev, D.; Albrecht, B.; Alguire, E.; Austin, B.;Beran, G. J. O.; Bernard, Y. A.; Berquist, E.; Brandhorst, K.; Bravaya,K. B.; Brown, S. T.; Casanova, D.; Chang, C.-M.; Chen, Y.; Chien, S.H.; Closser, K. D.; Crittenden, D. L.; Diedenhofen, M.; DiStasio, R. A.;Do, H.; Dutoi, A. D.; Edgar, R. G.; Fatehi, S.; Fusti-Molnar, L.;Ghysels, A.; Golubeva-Zadorozhnaya, A.; Gomes, J.; Hanson-Heine,M. W.; Harbach, P. H.; Hauser, A. W.; Hohenstein, E. G.; Holden, Z.C.; Jagau, T.-C.; Ji, H.; Kaduk, B.; Khistyaev, K.; Kim, J.; Kim, J.; King,R. A.; Klunzinger, P.; Kosenkov, D.; Kowalczyk, T.; Krauter, C. M.;Lao, K. U.; Laurent, A.; Lawler, K. V.; Levchenko, S. V.; Lin, C. Y.; Liu,F.; Livshits, E.; Lochan, R. C.; Luenser, A.; Manohar, P.; Manzer, S. F.;Mao, S.-P.; Mardirossian, N.; Marenich, A. V.; Maurer, S. A.; Mayhall,N. J.; Neuscamman, E.; Oana, C. M.; Olivares-Amaya, R.; O’Neill, D.P.; Parkhill, J. A.; Perrine, T. M.; Peverati, R.; Prociuk, A.; Rehn, D. R.;Rosta, E.; Russ, N. J.; Sharada, S. M.; Sharma, S.; Small, D. W.; Sodt,A.; Stein, T.; Stuck, D.; Su, Y.-C.; Thom, A. J.; Tsuchimochi, T.;Vanovschi, V.; Vogt, L.; Vydrov, O.; Wang, T.; Watson, M. A.; Wenzel,J.; White, A.; Williams, C. F.; Yang, J.; Yeganeh, S.; Yost, S. R.; You, Z.-Q.; Zhang, I. Y.; Zhang, X.; Zhao, Y.; Brooks, B. R.; Chan, G. K.;Chipman, D. M.; Cramer, C. J.; Goddard, W. A.; Gordon, M. S.;Hehre, W. J.; Klamt, A.; Schaefer, H. F.; Schmidt, M. W.; Sherrill, C.D.; Truhlar, D. G.; Warshel, A.; Xu, X.; Aspuru-Guzik, A.; Baer, R.;Bell, A. T.; Besley, N. A.; Chai, J.-D.; Dreuw, A.; Dunietz, B. D.;Furlani, T. R.; Gwaltney, S. R.; Hsu, C.-P.; Jung, Y.; Kong, J.;Lambrecht, D. S.; Liang, W.; Ochsenfeld, C.; Rassolov, V. A.;Slipchenko, L. V.; Subotnik, J. E.; Van Voorhis, T.; Herbert, J. M.;Krylov, A. I.; Gill, P. M.; Head-Gordon, M. Mol. Phys. 2015, 113, 184−215.



1116

mailto:[email protected]

http://orcid.org/0000-0002-4309-6669

https://github.com/valeevgroup/tiledarray/

https://github.com/valeevgroup/tiledarray/


Date post:	09-May-2018
Category:	Documents
Upload:	lyphuc
View:	217 times
Download:	0 times

A General Sparse Tensor Framework for Electronic Structure...

Documents