Performance evaluation for petascale quantum simulation...

Performance evaluationfor petascale quantum simulation tools

Stanimire Tomov1, Wenchang Lu2,3, Jerzy Bernholc2,3, Shirley Moore1, and Jack Dongarra1,3

1 University of Tennessee, Knoxville, TN2 North Carolina State University, Raleigh, NC

3 Oak Ridge National Laboratory, Oak Ridge, TN

ABSTRACT: This paper describes the performance evaluation and analysis of a set of open source petascalequantum simulation tools for nanotechnology applications. The tools of interest are based on the existing real-space multigrid (RMG) method. In this work we take a reference set of these tools and evaluate their performancewith the help of performance evaluation libraries and tools such as TAU and PAPI. The goal is to develop anin-depth understanding of their performance on Teraflop leadership platforms, and moreover identify possiblebottlenecks and give suggestions for their removal. The measurements are being done on ORNL’s Cray XT4system (Jaguar) based on quad-core 2.1 GHz AMD Opteron processors. Profiling is being used to identifypossible performance bottlenecks and tracing is being used to try to determine the exact locations and causes ofthose bottlenecks. The results so far indicate that the methodology followed can be used to easily produce andanalyze performance data, and that this ability has the potential to aid our overall efforts on developing efficientquantum simulation tools for petascale systems.KEYWORDS: performance evaluation, petascale tools, quantum simulations, TAU, PAPI.

1 Introduction

Computational science is firmly established as a pil-lar of scientific discovery promising unprecedented ca-pability. In particular, computational advances in thearea of nanoscale and molecular sciences are expectedto enable the development of materials and systemswith radically new properties, relevant to virtually ev-ery sector of the economy, including energy, telecom-munications and computers, medicine, and areas ofnational interest such as homeland security. The ad-vances that would enable this progress critically de-pend on the development of petascale-level quan-tum simulation tools and their adaptation to thecurrently available petascale computing systems. How-ever, this tool development and adaptation is a highlynon-trivial task that requires a truly interdisciplinaryteam, and cannot be accomplished by scientists work-ing in a single discipline. This is due to the followingchallenges, which require expertise in several areas:

– Quantum simulation methods, while by now well-established, provide various ”levels” of accuracyat very different cost. The most accurate methodsare prohibitively expensive for large systems andscale poorly with system size (up to O(N7)). Disci-plinary expertise is required to develop multiscalemethods that will provide sufficient accuracy atacceptable cost, while performing well for grand-challenge-size problems at the petascale level.

– Existing codes and algorithms perform well oncurrent terascale systems, but need major perfor-mance tuning and adaptation for petascale sys-tems (based for example on emerging multi/many-core and hybrid architectures). This requires sub-stantial computer science expertise, use of ad-vanced profiling and optimization tools, and addi-tional development of these tools to adapt themto different petascale architectures, with differ-ent memory hierarchies, latencies and bandwidths.The profiling may also identify algorithmic bottle-necks that inhibit petaflop performance.

– New or improved algorithms can greatly decreasetime to solution and thus enhance the impactof petascale hardware. Very large problems of-ten exhibit ”slow down” of convergence, requiring”coarse-level” accelerators adapted to the partic-ular algorithm. Algorithmic changes to decreasebandwidth or latency requirements may also benecessary. In time-dependent simulations, sophis-ticated variable time-stepping and implicit meth-ods can greatly increase the ”physical time” of thesimulation, enabling the discovery of new phenom-ena.

This work is a step towards solving some of the com-puter science problems related to the adaptation of ex-isting codes and algorithms to petascale systems basedon multicore processors. In particular, we take a ref-

erence set of quantum simulation tools that we arecurrently developing [4, 5], and show the main steps inevaluating their performance, analyzing it to identifypossible performance bottlenecks, and determining theexact locations and causes of those bottlenecks. Basedon this, we also give recommendations for possible per-formance optimizations. We use state-of-the-art per-formance evaluation libraries and tools including TAU(Tuning and Analysis Utilities [6]) and PAPI (Perfor-mance Application Programming Interface [2, 1]). Themeasurements are being done on Jaguar, a Cray XT4system at ORNL based on quad-core 2.1 GHz AMDOpteron processors. The results so far indicate thatthe main steps that we have followed (and described)can be viewed/used as a methodology to not only eas-ily produce and analyze performance data, but alsoto aid the development of algorithms, and in particu-lar petascale quantum simulation tools, that effectivelyuse the underlying hardware.

2 Performance evaluation

Here we describe the performance evaluation tech-niques that we found most useful for this study. Wealso give a performance evaluation for two methodolo-gies that we have implemented so far in our codes. Oneis global grid method, in which the wave functions arerepresented in the real space uniform grids [5]. The re-sults show that the most time-consuming part in thismethod is the orthogonalization and subspace di-agonalization. The code is massively paralleled andit can reach a very good flops performance. Unfortu-nately, it scales in O(N3) with system size, which willbe prohibitive for large system applications. The otherone is the optimally localized orbital method [4], inwhich the basis set is optimized variationally and itscales in nearly O(N) with system size. This O(N)method has some computational challenges in paral-lel and strong scaling with the number of processors.In this contribution, we profile (below) and analyze(Section 3) the code using different tools, find the bot-tlenecks and optimize the performance (Section 4).

2.1 Profiling

An efficient way to quickly get familiar with large soft-ware packages and identify possible performance bot-tlenecks is to use TAU’s profiling capabilities. For ex-ample, we use it to first get familiar with the generalcode structure of the quantum simulation tools of in-terest (Section 2.2), second to get performance pro-files (Section 2.3) revealing the main function candi-dates for performance optimization, and finally to pro-

file in various hardware configuration scenarios to an-alyze the code and identify possible performance bot-tlenecks.

2.2 Code structure

Code structure can be easily studied by generatingcall-path data profiles. For example TAU is compiledwith the -PROFILECALLPATH option and the runtime variable TAU CALLPATH DEPTH can be set tolimit the depth of profiled functions. Figure 1 shows

Fig. 1. Callpath data.

a view of the call-path data (of the optimally local-ized orbital method code) using the paraprof tool,by selecting from the pool-down menu consecutivelyWindows, Thread, Statistics Table (if finally CallGraph is selected one can see a graphical representa-tion/graph of the call path). For example, shown arewhat are the functions in main, their inclusive exe-cution times (the exclusive execution times are min-imized for this snapshot), how many times are thefunctions called, and how many child functions are

2

Fig. 2. Callpath data.

called from them. In this case, function run is themost time consuming and left-mouse clicking on itreveals similar information for its children. This isshown on Figure 2, where from run we get to quench,and next to scf (obviously called in an outer loop20 times), get new rho (again called 20 times), anddensity orbit X orbit (called 56, 343 times for thisrun) as the most time consuming functions. Startingfrom here, and revealed furthermore by consequentprofiling and tracing, we can say that as most of thework is done in density orbit X orbit (further de-noted by DOxO), the most probable improvements inperformance will come from optimizing it. The op-timization though, as expected, is coupled with thefunction’s relation to load balancing and patterns ofcomputation and MPI communication, which is thesubject of the finer performance studies below. Theconsiderable time spent in MPI Wait hints to furtherlook for possible load dis-balance or different ways oforganizing/mixing computation and communication.

2.3 Performance Profiles

Figure 3 shows a TAU profile using paraprof. Onthe top we have a display of the most time con-suming functions (from left to right) along with in-dication for their execution time for the differentthreads of execution used for this run. At the bot-tom we see a legend giving the correspondence ofcolor-to-function name, and its execution time asa percentage from the total execution time (in thiscase taken for the mean). This indicates at a glance

Fig. 3. Code profile.

the main function candidates for performance opti-mization. Namely, as already determined, this is func-tion DOxO. Moreover, it is obvious that there is someload dis-balance between two groups of executionthreads, namely the ones from 0-to-55 and from 56-to-63, or that the two groups have different algorithm-specific functionalities. In either case, the time spenton MPI Wait seems to be excessive.

Similar profiles can be easily generated for variousperformance counters. For example, Figure 4 shows

3

Fig. 4. Floating point instructions.

profiles for the floating point instructions counterPAPI FP INS from PAPI. This shows for example thatthe most time consuming function is not the one per-forming most of the Flops (actually, it is third in termsof Flops performed) further underscoring the need tolook for its possible optimization. Moreover, the Flopsin the second thread group are approximately twiceless, requiring a check for load dis-balance, or in thecase of different functionality, further analysis on howthis 2-way splitting influences the scalability.

2.4 Performance evaluation results

The results shown so far are for the purpose of illustra-tion. Both codes were profiled for larger problems andusing larger number of processing elements (up to 1024was enough for the purpose of this study). As men-tioned at the beginning of this section, the most time-consuming part in the real space uniform grids methodis the orthogonalization and subspace diagonal-ization. Figure 5, Top shows the profile for the mosttime consuming functions on a run using 1024 cores.

Fig. 5. Code profiles for large problems on 1024 cores.

Figure 5, Bottom shows the profile for the most timeconsuming functions on the optimally localized orbitalmethod, again for 1024 cores. The first code runs at anoverall 670 MFlop/s per core vs 114 MFlop/s per corefor the second code (in both cases using all 4 core inthe nodes). Both codes are based on domain decompo-sition and have good weak scalability. The optimallylocalized orbital method has some computational chal-lenges in parallel and strong scaling with the numberof processors. The basis set in this method is optimizedvariationally which makes it “richer” on sparse opera-tions and MPI communications (compared to the firstcode).

3 Performance analysis

There are 3 main techniques that we found useful inanalyzing the performance. These are tracing (and inparticular comparing traces of various runs), scalabil-ity studies, and experimenting for multicore use, whichare all described briefly as follows.

3.1 Tracing

Tracing in general is being used to try to determinethe exact locations and causes of bottlenecks. We usedTAU to generate trace files and analyzed them usingJumpshot [8] and tools like KOJAK [7]. The codesthat we analyze are well written - in the sense thatcommunications are already blocked, asynchronous,and intermixed (and hence overlapped) with compu-tation to a certain degree. We found that in our case,when domain decomposition guarantees to a degreeweak scalability, to improve performance we have toconcentrate mainly on the efficient use of the multi-cores within a node. Related to using the tracing tools

4

and MPI, we discovered that it may help if we in-crease the degree of posting early MPI Irecvs. Thisis again related to the efficient use of multicores andMPI and is further discussed in our recommendationsfor performance improvements. We also found it use-ful to compare traces of various runs - and thus studythe effect of various code changes on the performance.Using traces, one can also easily generate profile-typestatistics for various parts of the code, which we foundvery helpful in understanding the performance.

Fig. 6. Scalability using 4, 8, and 16 quad-core nodes.

3.2 Scalability

We studied both strong and weak scalability of ourcodes. The results were briefly summarized in sub-section 2.4. Here we give a brief illustration on some

strong scalability results using, as before, the small sizeproblems with the optimally localized orbital method.Figure 6 shows the inclusive (top) and exclusive (bot-tom) execution time comparison when using 16 quad-core nodes (in blue), 8 (red), and 4 nodes (green). Wenote that the top 3 compute intensive functions fromFigure 4 scale well (see Figure 6 bottom). This is im-portant, since there was the concern of load dis-balancefor a group of threads, as explained earlier. This resultshows that the balance is properly maintained and thatthe 2 groups of threads observed before have differentfunctionality (and their load also gets proportionallysplit). The less than perfect overall scaling (see inclu-sive time on top) may be due to the fact that the mea-surement here is for strong scalability, i.e. the problemsize has been kept fixed and the cost of communica-tion has not become yet small enough, compared tocomputation, due to surface to volume effects that wehave due to domain decomposition techniques that weemploy. Improvements though are possible, as will bediscussed further.

3.3 Multicore use

To understand how efficient the code is in using mul-ticore architectures, we perform measurements in dif-ferent hardware configurations. Namely, we compareruns using 4, 2, and single core of the quad-core nodes.In all cases we vary the number of nodes used so thatthe total number of execution threads (one per core)is 64. Figure 7 shows results for comparing the inclu-sive (top) and exclusive (bottom) execution times.

5

Fig. 7. Comparison using 4, 2, and 1 cores.

Note that DOxO, the most time consuming function,efficiently uses multicores as the time for 2 (given inred) and 4 cores (in blue) is the same as that for a sin-gle core (green). The same is not true though for theother most time consuming functions. For example,comparing single and quad-core exclusive run timesfor the next 5 most time consuming functions (fromFigure 3), we see that the quad-core based runs arecorrespondingly 23, 51, 58, 49, and 36% slower, whichresults in an overall slow down of about 28% (as seenfrom the inclusive times for main).

These results are not surprising as multicore use isalmost always of concern. In our case, a general rea-son that can explain the slowdown is that executionthreads are taken as separate MPI processes that donot take advantage of the fact that locally, within anode, we have shared memory and can avoid MPI com-munications. The fact is that the local MPI commu-nications (within a multicore), even if invoked as non-blocking, would end up getting executed as a copy thatis blocking, and therefore there would be no overlap ofcommunication and computation, contributing to theincrease of MPI Wait time. But besides slowdown dueto local (within a multicore) communication, there ismore load on the communications coming to and froma node. These bottlenecks have to be addressed: copiesshould be avoided within the nodes and the inter-nodescommunications related to the cores of a single nodeshould be blocked whenever possible to avoid latenciesassociated with multiple instances of communication.

As mentioned above, multicores are efficientlyused for our most time consuming function, and notthat well used for the communication related func-tions. Looking at the compute intensive functions,namely dot product dot orbit and theta phi new

(from Figure 4), we see they slow down with multi-core use significantly more than even MPI Wait: cor-respondingly 51 and 49% when comparing single vsquad-core, as already mentioned above. Understandingthis, especially in relation to why the third compute in-tensive function DOxO (and also most time consuming)is fine under multicores, requires further measurementsand analysis, as done next.

Figure 8 shows a performance comparison when

Fig. 8. Performance using 4, 2, and 1 cores.

using the 4, 2, and single core regimes, i.e. this is acombination of the time (as in Figure 3) and Flopsmeasurements (as in Figure 4). The numbers giveMFlop/s ratios. This shows for example that the coderuns at an overall speed of about 110 MFlop/s percore when using quad-cores and 152 MFlop/s whenusing a single core per node. We note that the max-

6

imum performance is 8.4 GFlop/s per core and themain memory bandwidth is 10.6 GB/s. These are justtwo numbers to keep in mind in evaluating maximumpossible speedups depending on the operations be-ing performed. In our case, it looks like DOxO usesvery irregular memory accesses, so performance is verypoor (about 53 MFlop/s which is about 9 and 7times less than the two most compute intensive func-tions, correspondingly dot product dot orbit andtheta phi new. Moreover, the accesses are so irregu-lar that the bottleneck is not in bandwidth (otherwiseperformance would have degraded with multicore use)but in latencies. Consideration should be given here tosee if the computation can be reorganized to have morelocality of reference. The same should be done for theother two flops reach functions, which although havebetter performance, their performance degrade drasti-cally with the use of multicores, as shown on Figure8, bottom. It must be determined if this is due to abandwidth bottleneck, in which case certain types ofblocking may help.

4 Bottlenecks

Based on our performance analysis, the biggest per-formance increase of the current codes can come frommore efficient use of the multicores. We described thebottleneck in subsection 3.3. Namely, the cores of anode are taken as separate MPI processes, which fur-ther increases the load (without need) of the sharedbetween the cores memory bus. This causes, for ex-ample, extra copies (in an otherwise shared memoryenvironment) and no overlap of communication andcomputation (locally). Not to have additional load onthe multicores memory bus is important because ourtype of computation involves some sparse linear alge-bra (especially the 1st code) which are notorious forrunning much below the machine’s peaks, especiallyfor the case of multicore architectures.

Another example on overloading the multicores’memory bus (which is happening in our current codes)are posting MPI Irecv late, in particular after commu-nication data has already started to arrive. This alsoresults in extra copy as MPI would start putting thedata in temporary buffers, and later copy it to the userspecified buffers.

When looking for performance optimization oppor-tunities, it is also important to keep in mind what areroughly the limits for improvement based on machineand algorithmic requirements. One can get close tomachine peak performance only for operations of highenough ratio of Flops vs data needed. For exampleLevel 3 BLAS can achieve it for large enough matrix

size (e.g. approximately at least 200 on current ar-chitectures). Otherwise, in most cases, memory band-width and latencies are limits for the maximum perfor-mance. Jaguar, in particular, has quad-core Opteronsat 2.1GHz with theoretical maximum performance of8.4 GFlop/s per core (about 32 GFlop/s per quad-corenode) and memory bandwidth of 10.6 GB/s (sharedbetween the 4 cores). With these characteristics, if anapplication requires for example streaming (copy), onecan expect about 10GB/s and 1 core will saturate thebuss, dot product is ≈ 1GFlop/s (again one core sat-urates the bus), FFT is ≈ 0.7 GFlop/s for 2 cores and1.3GFlop/s for 4 cores, random sparse are ≈ 0.035GFlop/s for 2 cores and 0.052 GFlop/s for 4 cores, etc.The point here is that if certain performance is not sat-isfactory, we may have to look for ways to change thealgorithm itself (some suggestions given below).

Here is a list of suggestions for performance im-provement:– Try some standard optimization techniques on the

most compute intensive functions;– Change the current all-MPI implementation to a

multicore-aware implementation where MPI com-munications are performed only between nodes(and not within them);

– Try different strategies/patterns of intermixingcommunication and computation. For example,the current pattern in get new rho is to havea queue of 2 MPI Irecvs (and correspondingMPI Isends) where there is an MPI Wait associ-ated with the first MPI Irecv of the queue, ensur-ing the data needed has been received, followed bythe computation associated with that data. Thispattern insures some overlapping of communica-tion and computation but it is worth investigat-ing larger sets of MPI Irecvs combined for ex-ample with MPI Waitsome. The idea is to firsthave enough MPI Irecvs posted to avoid a caseof data arriving (from some MPI Isend) before acorresponding MPI Irecv is posted (in which casethere would be a copy overhead). Second, we wantto start immediately the computation associatedwith the communication that has first completed;

– Consider changing the algorithms if performanceis still not satisfactory.

Related to item one, we can give an example withfunction DOxO, which was the most time consumingfunction for most of the small runs used also as illus-trations in this paper. We managed to accelerate itapproximately 2.6× which brought about 28% over-all improvement (as DOxO was running 29% of thetotal time). The techniques used here are given onFigure 9. The candidates for this type of optimiza-tion are determined. We note that the opportunity

7

Fig. 9. Code optimization.

for speedup here may be better for the second code,because it has more sparse operations. A difficulty isthat there is not a single function to optimize - Fig-ure 5 shows the profiles for the two codes as run onlarge problems of interest. Note that the first code hasa single most time consuming function, but speeduphere would most probably come from algorithmic in-novations.

The optimizations related to items 2 and 3are work in progress. Finally, for item 4, we con-sider for example certain new algorithms, designedfor example to avoid/minimize communication [3],and in general new linear algebra developmentsfor multicores and emerging hybrid architectures.

Fig. 10. Hybrid GPU-accelerated Hessenberg reduc-tion in double precision.

For example, related to accelerating the subspacediagonalization problem, Figure 10 shows a perfor-mance acceleration of the reduction to upper Hessen-berg form using hybrid GPU-accelerated computation- the improvement is 16× from the current implemen-tation, obviously a bottleneck.

5 Conclusions

We profiled and analyzed two petascale quantum simu-lation tools for nanotechnology applications. We used

different tools to help in understanding their perfor-mance on Teraflop leadership platforms. We identi-fied bottlenecks and gave suggestions for their removal.The results so far indicate that the main steps that wehave followed (and described) can be viewed/used asa methodology to not only easily produce and analyzeperformance data, but also to aid the development ofalgorithms, and in particular petascale quantum sim-ulation tools, that effectively use the underlying hard-ware.

Acknowledgments. This work was supported bythe U.S. National Science Foundation under contract# 0749293. We used resources of the National Cen-ter for Computational Sciences at Oak Ridge NationalLaboratory, which is supported by the Office of Scienceof the U.S. Department of Energy under Contract No.DE-AC05-00OR22725.

References

1. S. Browne, J. Dongarra, N. Garner, G. Ho, andP. Mucci, A portable programming interface for perfor-mance evaluation on modern processors, The Interna-tional Journal of High Performance Computing Appli-cations 14 (2000), 189–204.

2. Shirley Browne, Christine Deane, George Ho, and PhilipMucci, Papi: A portable interface to hardware perfor-mance counters, (June 1999).

3. James Demmel, Laura Grigori, Mark Hoemmen, andJulien Langou, Communication-avoiding parallel andsequential qr factorizations, CoRR abs/0806.2159(2008).

4. J.-L. Fattebert and J. Bernholc, Towards grid-basedo(n) density-functional theory methods: Optimizednonorthogonal orbitals and multigrid acceleration, Phys.Rev. B 62 (2000), no. 3, 1713–1722.

5. Miroslav Hodak, Shuchun Wang, Wenchang Lu, andJ. Bernholc, Implementation of ultrasoft pseudopoten-tials in large-scale grid-based electronic structure calcu-lations, Physical Review B (Condensed Matter and Ma-terials Physics) 76 (2007), no. 8, 085108.

6. Sameer S. Shende and Allen D. Malony, The tau paral-lel performance system, Int. J. High Perform. Comput.Appl. 20 (2006), no. 2, 287–311.

7. F. Wolf and B. Mohr, Kojak - a tool set for automaticperformance analysis of parallel applications, Proc. ofthe European Conference on Parallel Computing (Euro-Par) (Klagenfurt, Austria), Lecture Notes in ComputerScience, vol. 2790, Springer, August 2003, Demonstra-tions of Parallel and Distributed Computing, pp. 1301–1304.

8. Omer Zaki, Ewing Lusk, William Gropp, and DeborahSwider, Toward scalable performance visualization withJumpshot, High Performance Computing Applications13 (1999), no. 2, 277–288.

8

Date post:	02-Feb-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Performance evaluation for petascale quantum simulation...

Documents