Intel European Exascale Labs European Exascale Labs Annual R… · 2012 has been a year of dynamic...

Report 2012

Intel European Exascale Labs

32

FOREWORD2012 has been a year of dynamic development for the four Intel European Exascale Labs, with significant work in many areasrelevant for Exascale R&D, like scalable algorithms, energy awareness of applications, resiliency concepts, performance characteri-zation tools, and innovative architectural experiments.

The major target platform for the work in the Exascale Labs is the Intel® MIC (Many Integrated Cores) architecture, the firstproduct of this architecture has been announced at SC12 and is now available as Intel® Xeon Phi™.

The labs have more and more grown into the role of “co-design” centers, driving the feedback and requirements of high-endEuropean HPC centers back to Intel’s HPC architects, and discussing Intel’s concepts for future Exascale architectures with theEuropean partners. This is in line with the strategic mission of the labs:

• Supporting our European partners on their roadmap to Exascale

• Providing technology feedback to Intel architects

Most of the work in the labs is presented at conferences and openly published, the software is generally released under opensource licenses. This open policy has contributed to the increased visibility of the labs as leading R&D players in the EuropeanHPC community.

Intel (through the Exascale labs) and several of our lab partners have contributed actively to the strategic research agenda ofthe new European Technology Platform for HPC (ETP4HPC), as well as to the European Exascale Software Initiative (EESI-2).The DEEP project (funded by the European Commission) is one of three European pilot Exascale projects, with a novel systemsarchitecture based on Intel® Xeon Phi™. To support the European HPC and Exascale agenda, the labs have contributed to furtherproposals in the 2012 FP7 Exascale call.

Life science applications are increasingly using HPC resources, they are expected to be a driver for Exascale technology, withhigh demands both for processing performance and data volume. The conference “HPC for Life Sciences” organized by theExascale labs in Brussels (in May 2012) brought together leading life-science researchers and HPC experts.

In this annual report the Exascale labs present some examples and highlights of their 2012 R&D work:

• ExaCluster Lab Jülich: the progress of the DEEP project and its innovative software concept, and an overview on Scalasca,a performance tool which supports the highest scalability

• ExaScience Lab Leuven: the development of a typical proto-application for astrophysics (space weather) codes, andnew communication-avoiding sparse matrix solvers

• Exascale Computing Research Lab Paris: two applications (RTM and Z-Code) with scalable algorithms exhibiting poten-tial for Exascale

• Intel-BSC Exascale Lab Barcelona: performance tools to characterize applications and an automatic load balance technology.

Finally, I want to thank the teams in the Exascale Labs for their engagement and exciting R&D contributions in 2012, and ourpartners for their support and collaboration. I wish this report many interested readers. Enjoy reading, we appreciate yourfeedback.

Karl Solchenbach

Director Intel European Exascale Labs

INDEX

KEY EVENTS 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 04

INTEL-BSC EXASCALE CENTER – BARCELONA, SPAIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 06

INTEL EXACLUSTER LAB – JÜLICH, GERMANY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 08

INTEL EXASCIENCE LAB – LEUVEN, BELGIUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

INTEL EXASCALE COMPUTING RESEARCH LAB – PARIS, FRANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Automatic Load-balancing Hybrid MPI+OmpSs and its Interaction with Reduction Operations . . . . 14

High-precision Performance and Power Measurements with Low Overhead . . . . . . . . . . . . . . . . . .18

The Scalability of the Cluster-Booster Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Scalasca – A Scalable Parallel Performance Measurement and Analysis Toolset . . . . . . . . . . . . . 32

Helsim: A Particle-in-Cell Proto-application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Hiding Global Communication Latency in Krylov Solvers at Exascale . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Innovative Programming Models for Depth Imaging Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Z-CODE: Towards an Efficient Implementation of Multiresolution Methods . . . . . . . . . . . . . . . . . . 52

European Exascale Labs

54

European Exascale Labs KEY EVENTS 2012

KEY EVENTS 2012JOINT EXASCALE LAB WORKSHOPSThe Exascale labs met in two cross-lab workshops:

• In Jülich (February 12), to discuss recent results with StevePawlowski and get his feedback. Steve also kicked-off thenew "Jülich Lecture" series, speaking on "Overcoming thebarriers to Exascale through innovation".

• In Leuven (October 12), to discuss trends for Exascale pro-gramming models and to coordinate the related work in the labs.

INTEL LEADERSHIP CONFERENCE ON HPC FOR LIFE SCIENCESMAY 31 – JUNE 1, 2012, BRUSSELSThe two-day conference covered many domains, from systembiology, genomics, and modelling drugs for oncology to cardio-vascular simulation. The experts discussed the state-of-the-art and directions about future needs for compute power andfinding “mature” models which are ready for co-design activities.The Intel European Exascale Labs will make Life Science one oftheir priority areas.

INTERNATIONAL SUPERCOMPUTER CONFERENCE (ISC’12) JUNE 17-21, 2012, HAMBURGThe Intel European Exascale labs had a very visible presenceat ISC12 in Hamburg. Two demonstrations were presented atthe Intel booth: the DEEP project and the QMC grand challengeapplication, which was run on the Curie system, in over 76000cores, showing a very good scalability and 38% efficiency. TheBOF session “Preparing for Exascale” was run very success-fully with great interaction.

TERATEC FORUMJUNE 27-28, 2012, PARISThe Teratec Forum in Paris is the leading HPC event in France,attracting over 1,000 visitors this year, with good attendanceto both the Intel and the French Exascale lab (ECR) stands, andinteresting contacts. The ECR organized the workshop on Ex-ascale and received very positive comments on this activity.This session had the highest registration and attendancelevel of all the workshops.

Participants of the “HPC for Life Sciences“ Conference

“MEET THE LABS” EVENT WITH INTEL HPC EXPERTSNOVEMBER 8-9, 2012, SANTA CLARA A delegation of the European Exascale labs met with IntelHPC experts at the Intel headquarters in Santa Clara. The labsgave an update of their work during 2012 with interestingdiscussions and feedback from Intel experts. A group fromthe Oregon Health and Science University (OHSU) also at-tended for the discussion on the life science applications.

SUPERCOMPUTING 2012 (SC12) NOVEMBER 10-16, 2012, SALT LAKE CITYAt SC12 the main focus of the labs was showing first boardsof the DEEP system and organizing a BOF session on EuropeanExascale activities.

Steve Pawlowski talking about Exascale in Jülich

Exascale labs demonstration at ISC'12 Prof. Thomas Lippert presenting at the Intel booth at ISC'12

Presentation at the Teratec Forum 2012

The Exascale session at the Teratec Forum 2012

“Meet the Labs“ event at the Intel HQ in Santa Clara

The new DEEP board shown at SC12

76

European Exascale Labs INTEL-BSC EXASCALE CENTER, BARCELONA

The Intel-BSC Exascale Center has been set up to perform research in topics, considered by bothIntel and BSC, to be of key relevance to approach the Exascale. The aim of the center is toleverage and continue development of the BSC research infrastructures with the supervisionand advice of Intel, searching to maximize the applicability and impact of the insight and ideasderiving from such cooperation.

VISION AND BACKGROUND The BSC vision towards Exascale, presented as a white paper to the first IESP meeting, envis-ages that systems with hundreds of thousands of nodes, each of them with hundreds of thou-sands of cores, will be used. Managing the power consumption of such a system, the variabilityin behavior across devices, the proper allocation and sharing of resources, and the overall com-plexity (in particular the memory structure) of the systems are huge challenges that have yetto be solved. We envisage that a very strong research effort has to be devoted to: masteringasynchrony to generate and manage the huge levels of parallelism required, as well as toleratehuge latencies and variability; mastering malleability and very dynamic resource managementto achieve high resource utilization; and mastering locality to maximize efficiency and minimizedata transfers. We consider that these three directions of research will make the key contributionsfrom the system software side that will complement the necessary technological developmentsand jointly address the most challenging problem: limiting the power consumption of such Exascalesystems to “reasonable” levels. Three specific lines of work are extremely important to approachthese general objectives.

We first consider that the programming model is the key level that can support and enable thenecessary decoupling between application and system developers in such a way that applicationscan start to be developed/prepared today for platforms whose exact characteristics are not yetknown, while runtime implementers will fill the gap between the two levels. We envisage thata hybrid programming model is the most realistic approach to the Exascale target. At the lowerlevel, a single address space (shared memory) model will run within a node or small cluster ofnodes going up to hundreds or thousands of cores, not necessarily with hardware supportedcoherence or even sharing the physical address space. Whatever the architecture of such nodesor supernodes, the programming models offered to the user will have to be the same. This levelwill provide most of the required malleability and flexibility mentioned above. At the coarse grainlevel in the hybrid approach, either MPI or a PGAS models may be used. We consider that thetarget should be several hundreds of thousands of processes at this level (a number alreadyreached by some high-end systems today, indicating that we expect that the major growth insize will be in the number of cores in a node). In this line of thought, BSC has been developingand promoting the StarSs programming model concept, where programmers specify, througha couple of simple pragmas, the directionality of the data accesses performed by computationtasks, and the runtime dynamically determines how these accesses transform into dependences.The runtime can thus exploit huge levels of unstructured parallelism on one side, but by alsohaving the information on the data access patterns it can manage memory and perform datamovements as well, relieving the programmer of this cumbersome and error prone task. Thehuge “lookahead” supported by the StarSs concept allows the run-time to “see” deep into thefuture of things that will have to be done and thus gives it the opportunity to implement op-timizations that would otherwise be impossible. OmpSs is our implementation of the StarSsconcepts in OpenMP and it is the infrastructure that integrates all of the BSC programming modelefforts. It supports the execution of the same single address space program on architecturesas varied as SMPs, Clusters and GPU-based nodes, as well as combinations of them.

INTEL-BSC EXASCALE CENTER –BARCELONA, SPAIN

Second, performance tools are essential to understanding theway we use our systems today. Tools will be even more impor-tant in future systems, where variability in behavior betweencomponents and in workload characteristics will require a de-tailed understanding of microscopic behavior and its impact onglobal performance. We can say that in today’s practice, mostof the application development and optimization are done with-out a quantified understanding of the parallel application behavior(what percentage of efficiency is lost due to load imbalance orserialization or actual data transfer time or synchronization?)or sequential behavior (what percentage of peak core perform-ance is lost due to instruction mix, to cache misses at which level,to branch mispredictions, instruction dependences, etc? ). Paral-lelizing/ optimizing applications, is in some sense, like flyingblind in the mist. Different hypotheses have to be tried (fullyimplemented) without quantitative evidence so that they ad-dress the most important problem in the application, and with-out a quantitative a priori estimation of the expected performanceimprovement to compare with the predictable implementationcost. Our vision is that it is extremely important to significantlyincrease the amount of data analysis capabilities and “intelli-gence” in our tools in order to gain real and useful insight inapplication and system behavior that will help us to work in themost cost-effective direction. BSC develops and maintains Extrae,an instrumentation package; Paraver, a browser for traced data;Dimemas, a simulator to rebuild the timing behavior of parallelprograms, and various related utilities that make the whole BSCTools an extremely powerful integrated performance-analysisenvironment.

Finally, efficiently executing applications delivering maximumfunctionality with minimal resource/energy consumption is thefinal target of computing systems. It is our vision that close inter-action between application developers and system developers

is key in order to maximize the quality and usability of the sys-tems we design by maximizing the flow of information (require-ments, possibilities) in both directions between application andsystem developers. It is also our experience that this requiresa lot of effort from both sides, as an interesting newly identifiedfeature can neither be made immediately available in a systemnor can an application be instantaneously restructured to leveragethe wonderful features that a new system characteristic may offer.

INTEL-BSC EXASCALE CENTER ACTIVITIESThe work has thus been grouped into three areas: programmingmodels, performance tools and applications.

In the programming models area we focus on different exten-sions to the OmpSs environment, in particular: the support formultiple implementation if tasks such as the run-time can sched-ule tasks to the fastest available device and maximize theutilization of the resources; the support of resiliency throughcheckpointing at fine granularity (task level); the support forreductions with scattered access patterns on large arrays; andthe support of dynamic load balance by shifting cores acrossMPI processes within a node.

In the performance tools area we improve the BSC tools sup-port for Intel® compilers and run-time, work on fine grain (“in-stantaneous”) power measurement based on Sandy Bridgecounters and provide support for predictive studies of the im-pact of different architectural parameters on the performanceachievable by future systems.

In the applications area we port to OmpSs several large appli-cations and also work on communication-avoiding algorithmsand how OmpSs can ease their programming.

Contact: Karl Solchenbach ([email protected])

Researchers from the Intel-BSC lab met the European Intel Exascale team in Barcelona.

98

High Performance Computing (HPC) plays a pivotal role in science and industry. Increasingsupercomputer performance to Exascale (1018 operations per second) levels will enable break-through discoveries not possible today, as well as significant improvements in products andservices. Current HPC systems follow the Cluster approach: many thousands of identical com-pute nodes, each equipped with a moderate number of CPU cores, are connected by a high-performance network and interfaced with a storage system. HPC applications are highly parallel,consisting of up to hundreds of thousands of processes.

The ExaCluster Lab focuses on the critical issue on the way to Exascale: how to scale up Clustersystems, middleware and applications to deliver Exascale performance to end users reliably andat greatly increased power efficiency. It develops architectural concepts, prototype hardwareand software for the next generations of Cluster systems, leading all the way to Exascale.

EXACLUSTER RESEARCHIn the ExaCluster Lab, Intel partners with two leading European HPC players. Jülich Supercom-puting Center (JSC) operates the fastest European supercomputer, carries out world-class re-search in software tools and applications and has an unsurpassed track record in operating Top10 HPC systems. JSC is part of Forschungszentrum Jülich (FZJ), the largest German governmentlab, with focus on materials science, energy, health and environmental research. The secondpartner, ParTec GmbH, is a leading vendor of scalable communication libraries, cluster manage-ment software and highly efficient parallel file systems.

The research touches hardware and software topics: processor and system architecture forextreme scalability and power efficiency, highly efficient and robust communication and man-agement middleware, scalable performance analysis tools and top tier applications from scienceand engineering.

As a centerpiece of the lab’s research the DEEP project prototypes a new class of HPC Clustersbased on Intel® Xeon Phi™, which pushes back the scalability limitations expressed by Amdahl’slaw and enables application sections to be run at exactly the right level of parallelism. DEEPhas received funding from the European Community's Seventh Framework Programme underGrant Agreement n° 287530.

INTEL EXACLUSTER LAB – JÜLICH, GERMANY SCALABILITYToday’s largest supercomputers deliver Petascale performance with hundreds of thousandsof compute cores and power consumption in the tens of Megawatts. To reach Exascale, com-pute performance has to grow by a factor of more than 100, while power consumption hasto stay in the range of 20 Megawatts. Moore’s law alone will not take us there by 2020–sig-nificant improvements in power efficiency are needed, and Exascale systems will have tensif not hundreds of millions of compute cores.

Building, programming and operating systems of this enormous scale does pose extremely difficultchallenges. Networks have to scale up and connect the cores with very low latency and highbandwidth; the memory subsystem has to match the ever-increasing processor speed, whilepower consumption of all hardware components has to be pared down as far as possible.Communication and resource management systems must be significantly improved to supportscalable applications and ensure efficient system usage. Finally, HPC applications have to be re-architected for extreme scalability, and extensions for today’s proven programming modelshave to be found.

RESILIENCYProjections strongly indicate that Exascale systems will have to operate in the presence ofregular, if not frequent, component failures. Since most HPC applications are tightly coupledacross the cores running in parallel, a non-recoverable component failure puts the whole appli-cation at risk. To address this, the ExaCluster lab is developing new resiliency concepts thatinvolve hardware, system and application software with the objective of allowing applicationsto recover from such failures. This goes beyond conventional checkpointing, which does notscale up. The work leverages JSC’s insight into failure modes of large systems, Intel’s competencyon the system side and ParTec’s extensive experience in Cluster management and communicationsoftware.

“DEEP” EXASCALE PATHFINDINGThe Intel® Xeon Phi™ line of coprocessors provides extreme throughput performance (≈ 1012

double precision floating point operations per second) in a compact form factor with break-through energy efficiency (#1 on the 2012 Green500 list). The Dynamic Exascale Entry Plat-form (DEEP) pairs a highly scalable “Booster” Cluster of Intel® Xeon Phi™ nodes with a standardHPC Cluster made up of Intel® Xeon® processors. HPC applications combine sections with dif-ferent levels of parallelism, with the least scalable part putting a hard limit on the achievabletotal parallelism and therefore performance. The heterogeneous DEEP architecture pushesthese limits out by running each application section at its natural level of parallelism and per coreperformance “sweet spot”. Sections with less parallelism run on the faster (by core) Cluster partof the system, while the highly parallel computational kernels can use the far wider parallelismof the Booster. To make the Booster work with highest efficiency DEEP relies on an extremelyhigh-performing, 3D torus interconnect. The Cluster uses standard InfiniBand technology.

The DEEP systems is programmed using a mix of MPI and the OmpSs tasking system developedby Barcelona Supercomputing Centre, home of a sibling Intel Exascale Lab. ParTec’s provenParaStation middleware provides scalable communication within and between the Boosterand Cluster as well as reliable system management functionality.

Work in the ExaCluster lab on pre-production Intel® Xeon Phi™ systems has validated theirpotential to execute extremely complex applications with high efficiency and excellent through-put performance. Within the DEEP project, a selection of highly relevant, complex HPC applica-tions are being ported to Intel® Xeon Phi™ and to the DEEP architecture. The first fully functionalDEEP systems will be available to the project partners around mid-2013.

Contact: Hans-Christian Hoppe ([email protected])

European Exascale Labs INTEL EXACLUSTER LAB – JÜLICH

MichaelKauschke, Intel

Michael Richter,Intel

Hans ChristianHoppe, Intel

Thomas Lippert,FZJ

Wolfgang Gürich,FZJ

Hugo Falter,ParTec

Bart Stukken,Intel

Norbert Eicker,FZJ

Jochen Kreutz,FZJ

Estela Suarez,FZJ

Ina Schmitz,ParTec

Thomas Moshny,ParTec

1110

The ExaScience Lab is a unique collaboration between Intel, Imec and the five Flemish univer-sities and is staffed by over 25 scientists. It will be celebrating its 3rd anniversary in summer2013.

The research focus of the lab is enabling high-performance software applications to scale to-ward Exascale computing. The current linchpin application is space weather simulation, rep-resenting a whole class of astrophysics and general hydrodynamics applications.

SPACE WEATHERThe ExaScience Lab is organized around a linchpin application: space weather simulation andprediction. The algorithms and parallelization strategies needed for such simulations are similarin nature to the ones that are expected to drive a host of applications on future Exascale sys-tems, in particular in the area of particle physics, computational fluid dynamics, magneto-hydro-dynamics, etc.

SIMULATION TOOLKITThe toolkit shields the application developer from the effects of variable and failing hardware.It provides numerical kernels optimized for power/energy efficiency and scalability.

ARCHITECTURAL SIMULATION FRAMEWORKThe architectural simulation framework allows studying the behavior of this Exascale work-load on future Exascale systems.

This framework enables the study of the detailed behavior of representative workloads on anaccurate software model of Exascale hardware.

VISUALIZATION SOFTWAREThe space-weather simulation application is tightly linked with its visualization. This visualizationenables real-time presentation and interaction, allowing immediate interpretation and in-depthtime-space analysis of the simulation results. This visualization software operates in-situ: ondata in memory.

INTEL EXASCIENCE LAB – LEUVEN, BELGIUM


RESEARCH HIGHLIGHTS (SEE WWW.EXASCIENCE.COM FOR DETAILS)

Space Weather SimulationA groundbreaking achievement has been the demonstrated ability of the particle in cell simulation software to study realspace weather problems. Two first-principle simulations have been carried out:

• A higher-resolution description of a vast region of space on the night side of the Earth extending for 80 Earth radii from theEarth away from the Sun. In this region we have identified the kinetic-fluid coupling responsible for energy transfer duringspace weather events.

• A full global Earth environment simulation displaying all the important features relevant to space weather events. This simulationis an absolute first, demonstrating that the implicit moment method central to the approach can indeed be used as a first prin-ciple approach to do global simulations that bridge the micro-macro gap. Before, global simulations could not include kinetic effects.The new result is the first to realistically include electron and ion kinetic physics in global Earth environment simulations.

Pipelined GMRES solverIn the Generalized Minimal Residual Method (GMRES), the global all-to-all communication required in each iteration for orthogo-nalization and normalization of the Krylov base vectors is becoming a performance bottleneck on massively parallel machines.

The pipelined GMRES solver developed at the lab is a scalable version of such a GMRES algorithm. The method achieves goodscalability by removing the expensive global synchronization points from standard GMRES. Impressive speedups have beenobserved for strong scaling experiments.

This pipelined GMRES solver has been incorporated in the well-known and widely-used PETSc library, a suite of data structuresand routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. Details aredescribed in a paper in this report.

High-performance shared-memory parallel SpMV Library releasedThe sparse matrix-vector multiplication (SpMV) is an important computational kernel, but is hard to efficiently execute even inthe sequential case. The problems—namely low arithmetic intensity, inefficient cache use, and limited memory bandwidth—aremagnified as the core count on shared-memory parallel architectures increases.

A shared-memory parallel SpMV software library was released using new parallelization techniques that mitigate the effectsof unpredictable memory accesses and high bandwidth requirements. It implements various strategies, both sequential andparallel, some which compare well to, or exceed, the performance of current state-of-the-art software.

CobraReliability will be a major issue for future Exascale systems, and software solutions can reduce the resilience cost for hardware.As a proof of concept, we created Cobra, a novel design for a shared-memory work-stealing scheduler that realizes this notionof restartable task graphs, and enables computations to survive hardware failures due to software.

A comparison with the work-stealing scheduler of Threading Building Blocks on the PARSEC benchmark suite shows that theCobra implementation incurs almost no performance overheads in the absence of failures, and low performance overheads inthe presence of failures.

SharkShark – Scalable Hybrid Array Kit – is a high-level programming toolkit that offers n-dimensional distributed grids in the styleof the Global Array Toolkit. Besides regular data distributions it supports irregular data distributions where grids can arbitrarilycut along coordinate planes. The data can be redistributed at run-time. Shark supports global (e.g. reductions), collective andone-sided communications. It is hybrid and can be configured to use only MPI, only threading (several threading options can beused: pthreads, OpenMP, TBB or Cobra), or a combination of MPI and threading. Shark is implemented as a library for C++11.

SniperSniper is a next-generation parallel, high-speed and accurate x86-64 simulator. This multi-core simulator is based on the intervalcore model and the Graphite simulation infrastructure, allowing for fast and accurate simulation and for trading off simulationspeed for accuracy to allow a range of flexible simulation options when exploring different multi-core architectures. Using thismethodology, we are able to achieve good accuracy against hardware for up to 16-threaded applications.

Contact: Luc Provoost ([email protected])

INTEL EXASCIENCE LAB – LEUVEN

The team of the ExaScience Lab in Leuven

1312

European Exascale Labs INTEL EXASCALE COMPUTING RESEARCH LAB – PARIS

OVERVIEWThe French Exascale Computing Research (ECR) Lab was the first Exascale lab established byIntel in Europe in 2010, as a shared effort between the partners CEA, Genci, the University ofVersailles-Saint-Quentin-en-Yvelines, and Intel. The mission of ECR is to conduct R&D studies,in co-operation with European researchers, on applications that are critical for industries andHPC users from European universities. It focuses on software for Exascale in two main researchareas:

• Application enabling co-design: collaborative research with industry and academia to prepareexisting applications for Exascale through close partnership between application developersand the lab. The research addresses specific features, and concentrates on programmingmodels for high scalability, data flows, and numerical performance.

• Software tools and middleware to characterize applications and optimize their performance onfuture Exascale machines: The work will allow developers to improve the scalability, perform-ance, resource saturation and power consumption of their parallel applications. It will alsohelp hardware designers and compiler builders to optimize their products.

ENABLING APPLICATION CO-DESIGNLeveraging from the capacity of new architectures for progress in scienceThe purpose of this research area is to analyze selected applications with their full complexity,assess their behavior on prototypes of future architectures, and work in close collaborationwith the developer to increase application efficiency as well as current and future scalability.The methodology for performance evaluation developed at the lab is central to these activi-ties. In 2012 the efforts were focused on Geosciences and Molecular Dynamics, working onhigher efficiency on production HPC systems and preparing for increased scalability on Intel®MIC architecture.

Our portfolio in Geosciences covers two major intertwined fields: seismic imaging for Oil & Gasapplications, and seismology with open-source, HPC research codes. Seismic applications arekey for the energy vertical, an industry sector focused on reducing the time-to-solution whileincreasing the resolution and physical model complexity of the subsurface reconstruction. Inour effort on seismic imaging, we have been working on different 3D finite difference kernels

INTEL EXASCALE COMPUTING RESEARCHLAB – PARIS, FRANCE

targeted at both traditional and MIC architectures. For seismol-ogy, we have started exploring shared memory parallelizationstrategies for Galerkin finite element methods (for exampleusing a parallel divide-and-conquer algorithm), which waspresented at the TERATEC Forum 2012).

The Molecular Dynamics efforts are driven by the POLARIS(MD)project. The code has been undergoing some major redevelop-ments in 2012, mostly on developing a threaded OpenMPversion and exploiting vectorization opportunities as much aspossible. This resulted in reducing the time to solution of theproduction version of the code by 60%. The scientific capacitiesof POLARIS(MD) are being enhanced thanks to the resultingincreased efficiency and throughput.

Finally the team collaborated with Institut Camille Jordan (CNRS& Université Lyon 1) on Z-code, a multi-resolution code for stiffreaction-diffusion problems, which shares many implementationchallenges with adaptive mesh refinement (AMR) applications.The corresponding work, presented as a contribution in thisreport, paves the way to working with a broader class of ap-plications in domains as diverse as biomedical engineering,combustion or plasma physics.

SOFTWARE TOOLS FOR APPLICATION CHARACTERIZATIONAND PERFORMANCE OPTIMIZATION Helping hardware, middleware, and application softwaredevelopers Understanding the interaction between applications, algorithms,and the underlying software stack and hardware is critical forachieving optimal performance, even more so in an Exascaleenvironment. Therefore, the lab is working on several projects,which are all being validated with real-world applications. Theresults of these projects have been presented and publishedat various conferences and workshops.

• Programming models and runtime systems: the lab contributesto the development of MPC, a unified runtime environmentfor parallel applications on very large clusters of multi-core/multiprocessor NUMA nodes. MPC is designed to be scalableand to provide maximum performance to the application,while keeping the memory footprint of the runtime as lowas possible. The latest MPC release 2.4.1 was published inDecember 2012, offering support for pthreads, OpenMP 2.5,MPI 1.3 and hybrid programming with MPI and threads. Be-sides the de-facto standards MPI and OpenMP, the lab alsoinvestigates emerging programming models such as PGAS,Cilk, and CnC (see the article on RTM on page 46).

• A set of performance tools and a methodology was created tohelp users identify performance problems quickly and evaluatepotential gains through optimizations, with special focus onthe analysis of memory behavior and single-node optimization,and in collaboration with renowned performance tool devel-opment teams world-wide. The tools were presented at the9th VI-HPS Tuning workshop in April 2012, organized andhosted by UVSQ with support from ECR and TGCC, at the 10thVI-HPS Tuning workshop in Munich in October 2012, and willbe presented at the upcoming 11th VI-HPS Tuning workshopin April 2013, organized again by UVSQ as part of the FrenchPrace Advanced Training Center (PATC) activities.

• An application characterization framework, which extractsthe hot code segments and necessary parameters from anapplication, performs systematic analysis of these segments,and derives optimization recommendations and performancepredictions. This framework is designed to deliver critical andfine-grain information for tuning the overall performance ofan application.

Contact: Marie-Christine Sawley ([email protected])

Designing ... ... reviewing... implementing ...

14 15

go to sleep using the pthread_cond_wait call. The mainthreads of all processes check if there are extra cores availableevery time they enter the scheduling routine. If so, they reservethe desired number of cores and wake up some of their sleepingthreads using the pthread_cond_signal call. This mech-anism tries to hand over cores between processes, but has theproblem that it totally relies on the underlying linux kernel tosleep and wake up the threads using its default scheduler. Manynasty effects can occur, for example a thread being awakenedbefore the sleeping one is actually suspended, a thread beingawakened preempting a thread doing useful computation, etc.

In order to better understand the effects, we instrumentedsome simple benchmarks with information about where waseach task running. Figure 1 shows some examples of the im-pact of these preemptions. In this trace, two MPI processes arerunning on the same node, each one with a different load andDLB enabled. The event displayed is cpuid, so each color in thetrace represents the core a thread is running on. As we cansee, some threads do migrate between cores and we also ob-served that several threads end up time-sharing the samecores at the same time. Other Paraver views showing the per-centage of CPU that each thread was getting during the exe-cution of each task did corroborate the importance of theperturbation caused by preemptions.

Although the coordination between processes at the librarylevel ensures that there is no significant overall resource over-commitment, the fact that the runtime decisions are not directlycoordinated with the kernel scheduler results in poor globaldecisions. Even if a Linux* kernel is designed to try and separateprocesses into different cores when the resources are not over-committed, the time it would take will often be larger than theinternal granularity of the parallel application. An alternativewould be to coordinate the runtime library and the kernelscheduler by providing hand-off scheduling functionalities inthe kernel API. Unfortunately, patching the kernel is not anideal solution as many production installations would haveserious objections to using externally provided patches. Trying

to push modifications into an official Linux* distribution is alsoan extremely long process.

We thus tried to use mechanisms already supported by Linuxthat would restrict its scheduling decision and end up in abetter coordination between the runtime library and the kernel.We tried different approaches where some of the processeswere pinned to cores in order to somehow restrict the migra-tion of threads. Our final approach does pin every single threadof each process to a core and we explicitly handoff a core fromone application to another. By keeping track of the state andmapping of every thread in each node the DLB library can wakeup the desired thread.

Figure 2 shows an execution of the same microbenchmarkapplication using this new optimization in DLB. For the sakeof comparison, the trace is in the same time scale as Figure 1.In this case, every single thread runs on its pinned core andno other thread of the application uses the same core at thesame time.

In the SMPSs implementation, it is still only the main threadthat checks for the availability of additional cores. This meansthat the granularity of the tasks executed by the main threaddetermines how fast a process realizes that it can use morecores. In order to improve the responsiveness, it would be in-teresting to include this check in the scheduling code of everythread.

A first version of the OmpSs runtime integrating some of theexperiences from our SMPSs work is already in place. The portingwork has focused on implementing the basic support withinthe OmpSs runtime of the mechanism to enable the dynamicchange of the number of active threads per process and theintegration of the shared memory mechanism for communicationbetween processes. Although very preliminary, the implemen-tation aims at providing a general infrastructure that will beusable not only for MPI+OmpSs applications but also in multi-programmed independent OmpSs workloads.

Figure 1: Timeline where the color represents the core where eachthread is running. Black represents a blocked thread

Figure 2: Thread-to-core mapping under DLB [Please, note that thereare two different tones of purple, two of light red and one of darkred, although some of them seem the same color, they are not.]

Automatic Load-balancing Hybrid MPI+OmpSs and its Interaction with Reduction Operations

M. Garcia1,2

V. Lopez1

J. Labarta1,21 Barcelona Supercomputing Center (BSC) 2 Technical University of Catalonia (UPC)

ABSTRACTLoad Imbalance is one of the major causes of inefficiency inparallel programs and its impact will only become worse as weapproach the Exascale. The task-based approach of the OmpSsprogramming model decouples computations from resourcesand thus enables the runtime to implement very dynamic andflexible resource-allocation policies. In hybrid MPI+OmpSs pro-grams this gives the opportunity to dynamically balance theload between processes within a node. This paper describesexperiences, implications and some results towards the imple-mentation of an automatic load-balancing mechanisms withinthe nodes of MPI+OmpSs programs. Important findings of thestudy relate to the interaction between dynamic load balancingand the implementation of sparse reductions on large vectors.Although not a global optimization, our approach comes for freefor the application developer and can be extremely useful tocomplement the coarse-grain user-level load-balancing techniques.

INTRODUCTIONLoad imbalance is one of the major causes of inefficiency inparallel programs. Generally described as the overall inefficiencycaused by an uneven distribution of work among threads orcores it is widely acknowledged as a key bottleneck in todaycodes and a wheel blocker when facing the Exascale. On theother side, the fact that there is not a single widespread wayof measuring it (some studies use the standard deviation ofthe load between processors, others the difference or ratiobetween the minimum and maximum load…) or that perform-ance tools do not report it as a first-order metric is an indicationof the not yet sufficient quantitative consciousness of theimportance of the problem.

OmpSs[1] is a programming model developed at the BarcelonaSupercomputing Center that extends the OpenMP task direc-tives with directionality clauses for data used within the task.Based on these clauses, dependences can be dynamicallycomputed at runtime thus supporting unstructured parallelismand an asynchronous execution model. By focusing on tasks,OmpSs supports a huge malleability of the applications. Byjust specifying tasks, the programmer leaves to the runtimethe responsibility of mapping them to the available resources.This subtle difference with thread-based models enables the

correct execution of an OmpSs program in an environmentwhere the actual number of resources changes dynamicallydepending on availability or overall system workload. The run-time may use a variable number of threads along the executionof the process with the only limitation being task granularity.

The malleability of OmpSs processes can be used to achieve loadbalance in MPI+OmpSs applications. When several MPI+OmpSsprocesses are run in one node and one of them reaches ablocking MPI operation, it can lend its core(s) to another process,helping them to finalize the execution as soon as possible andkeeping all the resources in the node busy. When all the processeshave reached the blocking MPI operation the initial allocationof resources can be recovered. This mechanism is totally dy-namic and can be used to improve the application behavior incases where there is not only a static algorithmic load imbal-ance among processors but also if the imbalance has a dynamicstructure or even if it derives from temporary resource pre-emptions (i.e. to compensate for OS noise perturbations).

In [2] we presented an initial implementation of a Dynamic Load-Balancing Library (DLB) for hybrid programs MPI+OpenMP, andthe LeWI (Lend When Idle) policy. In [3] we reported on an initialimplementation based on the MPI+SMPSs infrastructure. SMPSsis a precursor of OmpSs which, although not being as flexible interms of task specification, it does share the general conceptand malleability. In this paper we describe some of the issuesencountered in the original implementation of DLB, how wehave improved it and experiments carried out during this yearthat help us to better understand the potential and implica-tions of the dynamic load-balancing technique. A particularlyinteresting insight has been gained on the interaction betweenthe dynamic load-balancing technique and the implementationof reductions on large sparse vectors. This experience was theresult of applying DLB to JusPIC, a particle-in-cell code fromJülich Supercomputing Center. An MPI+SMPSs version of thecode had been developed by Dirk Broemmel form JSC andwas used as a starting point in our experiments.

FIGHTING THE LINUX SCHEDULERThe original DLB/LeWI mechanism relied on instantiating manythreads per process but putting to sleep those threads whoseid was larger than the number of cores allocated to the process.This avoids oversubscribing resources and the underlying Linuxkernel should not introduce preemptions for time-sharing pur-poses. The runtimes of the different processes have accessto a shared arena where each process notifies when it entersa blocking call of how many cores it makes available to otherprocesses. After notifying it, the threads of the blocking process

Automatic Load-balancing Hybrid MPI+OmpSs and its Inter-action with Reduction Operations


16 17

REDUCTIONS AND LOAD BALANCEReductions do play a very important role in the algorithmicstructure within an MPI process in JusPIC, especially in theparticle push phase. Every iteration of the loop over particlesupdates potentially very scattered values within a very large3D array. The actual computation of the value to accumulate isrelatively small and a significant part of the time is thus takenby the updates. The distribution of particles between processesis very dynamic along the program execution, leading to poten-tially very important load imbalances.

Our experiments with JusPIC using the first implementation ofDLB showed significant preemption effects. With our modifiedDLB library the preemption problems disappeared but we werestill not obtaining the desired performance compared to theoriginal pure MPI execution. We describe in this section thestudies performed in order to understand the effect of DLB insome of the techniques to parallelize reductions with indirectaddressing on large arrays.

Reductions on large vectorsTwo basic approaches can be used to parallelize scattered re-ductions on large vectors when in a shared memory model. Thefirst approach is to ensure the atomicity of the updates byencapsulating them within critical or atomic statements. Thisapproach incurs a significant overhead if the computation issmall and fine-grain locking of individual updates is done. Theother approach is to expand (privatize) the reduction arraysuch that the loop iterations are totally parallel. This approachhas the problem of the overhead to allocate and initialize theprivate copies as well as the need to perform the final reduc-tion from the privatized copies into the original array using theprevious atomicity-based mechanisms. In cases (as ours hap-pens to be) where the size of the reduced data structure islarge and the access pattern can be quite sparse the overheadwill be proportionally very high.

At the conceptual level, a pure MPI or a pure SMPSs versionof a program implementing the reduction by privatizing thevector should be very similar. Nevertheless, MPI versions keeptrack of particles that move to other domains precisely identi-fying the actual data that needs to be shared or communicatedbetween cores. The blind replication and update of all globaldata in the typical shared memory approach results in a sig-nificant replication of updates that are essentially useless.Mimicking, at the shared memory level, the MPI practice ofkeeping track of what really needs to be updated to the globalvector is very important to achieve efficient reductions. Thisis actually done in our reduction-related work as part of the

Intel-BSC Exascale center activities.

When combining DLB with the privatized shared-memory par-allelization with SMPSs one additional issue arises. As the num-ber of threads that each process will execute depends on theactual load imbalance and resource reallocations we do notreally know how many private buffers we need to allocateand initialize. An approach of allocating (and reducing into theglobal array) one private copy per instantiated thread wouldimply a lot of computational overhead. Figure 3 shows twoiterations of the JusPIC code under DLB. We see in purple(push_particle_2d) the tasks computing the contribution ofthe particles to the privatized array and in green (reduce) theaccumulation of those contributions to the global array. Someprocesses do not perform any local contribution and reduction,while others start with 4 threads but finish with 8.

Malleable reductions on large vectorsHaving identified the issues of the reduction on large vectors,we investigated potential approaches to reduce as much aspossible such overheads. The technique used is as follows:

• Thread privatization of the vector. Only one instance perthread is allocated and initialized to the neutral value of thereduction operator before the loop with the reductions isentered for the first time. Besides the actual buffer, a scalarper thread is used to indicate whether the buffer has beenupdated (contains some values different from the neutralvalue for the reduction) or not.

Figure 3: Two iterations of the timeline of the tasks being executedby every thread in a run with 16 MPI processes, each of them withup to 8 threads on 8 nodes of 8 cores.


• Every task finds out the core where it is running and usesthe corresponding buffer. It also updates the scalar to tagthat the buffer does contain some value. There is no needfor mutual exclusion as tasks are non-preemtably scheduledto threads by the runtime.

• As many reduction tasks as the maximum number of threadsis instantiated. Dependences are set such that the reductiontasks wait for all computation tasks to finish.

• The reduction tasks check for the scalar of their correspon-ding buffer and if the buffer has not been updated the taskterminates immediately. Thus only the tasks whose bufferscontain some value incur the significant overhead of its re-duction to the original vector.

• If the buffer has been modified, the reduction task doesthe update on the original vector and also resets the bufferto the neutral element. In this way, by not deallocating thebuffers at the end of the loop they can be reused everytime the loop is entered (in the outer time stepping loop)without needing further initialization traversals.

A further optimization is to allocate one less buffer thanthreads, leaving the main thread to directly use the originalvector. The approach results in very significant gains com-pared to the original scheme and now it competes with theMPI based implementation. Some results are shown in thefollowing table

The attempt to understand the poor behavior of the initialMPI+SMPSs implementations as compared to the pure MPIimplementation for the reductions pattern has been ex-tremely useful to understand how the programming modeland its implementation should handle reductions on largestructures. We do believe that the mechanism implementedmanually in this work will have to be integrated in the pro-gramming model, thus achieving good performance in a dy-namic resource allocation environment and with very simplesyntax. We are starting work on support for reductions inOmpSs leveraging the User Defined Reductions mechanismincluded in the recent OpenMP 3.1 standard.

CONCLUSIONIn this paper we have described the implementation issues, ex-periences and implications of a dynamic load-balancing run-time that tries to maximize the utilization of the cores withineach node in hybrid MPI+OmpSs applications. The essentialidea is to detect when a processes enters blocking MPI callsand then lend its cores to other processes in the node thatmight have more tasks to execute.

The first version of the runtime and experiments were im-plemented in SMPSs, a predecessor of the OmpSs model andruntime. A first prototype in OmpSs has been developed al-though it still has to be extended to include all the identifiedfeatures in the SMPSs experiences.

Although initially targeted at hybrid MPI+OmpSs applications,we envisage that the idea of coordinating different applica-tions in a shared memory node will in the future be appliedto general-purpose multiprogrammed workloads where eachof the processes will be tasked to expose very dynamic par-allelism profiles.

REFERENCES

[1] A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell,Xavier Martorell, and J. Planas. “OmpSs: a proposal for pro-gramming heterogeneous multi-core architectures”. Paral-lel Processing Letters, 21(2):173-193, 2011.

[2] M.Garcia et all. “LeWI: A Runtime Balancing Algorithm forNested Parallelism”. ICPP09

[3] M. Garcia and J. Corbalan and R.M. Badia and J. Labarta ”ADynamic Load-Balancing approach with SMPSuperscalarand MPI” Facing the Multi-core Challenge II (2011)

Automatic Load-balancing Hybrid MPI+OmpSs and its Interaction with Reduction Operations

TotalCores

MPI procs SMPSsthreads

DLB Time (s)

64 64 0 No 34.97

64 16 4 No 37.54

64 16 4 Yes 35.73

64 32 2 Yes 32.09

18 19

H. Servat1

J. Gimenez1,2

J. Labarta1,21 Barcelona Supercomputing Center (BSC) 2 Technical University of Catalonia (UPC)

ABSTRACTUnderstanding the correlation between power consumptionand performance in real programs and current machines is animportant element that will help us improve the efficiency offuture systems. This paper describes the work being done aspart of the Intel-BSC Exascale Lab to precisely and simultane-ously characterize the instantaneous power consumption andperformance of real programs. The technique combines theinstrumentation- and sampling-based acquisition of processorand power hardware counters to achieve very fine grainmeasurements with minimal overhead.

INTRODUCTIONPrecise measurement of programs in terms of both performanceand power consumption is becoming an extremely importantissue in order to understand how to optimize the overall effi-ciency we obtain from our computing resources. This is par-ticularly relevant in the parallel programming context targetingthe Exascale performance levels, where power consumptionis a limiting constraint.

Real large-scale production codes are complex pieces of soft-ware. Instrumenting and measuring such large applications maybe quite cumbersome, requiring in many cases [1][2] specialbuilds of the application where the compiler injects in the binaryfor each (or selected) function or loops the monitoring hooksthat will obtain the raw performance data. Tools to analyzeunmodified production binaries typically inject the hooks throughthe LD_PRELOAD mechanism for dynamically linked libraries.An intermediate approach is to use dynamic instrumentation[3] or binary rewriting infrastructures [4] where the instrumen-tation is injected into an already-built binary, or even at applica-tion load time.

The LD_PRELOAD mechanism requires no modification of theproduction binaries. It allows the tools to provide instrumentedversions of dynamically linked libraries by way of these libraries,typically a runtime call (MPI, OpenMP, etc). With this approach,performance data (timing, hardware counters, call stack) is onlycaptured when the application invokes the runtime. Although

potentially reflecting the deep underlying algorithmic and com-putational structure of the application it has two potential draw-backs: it does not directly reflect the syntactic structure of thecode; and the distance between two runtime calls can be quitelarge thus making the measurements too coarse grained.

A mechanism was introduced in [5] to obtain very fine graininstantaneous performance metrics with very low overhead.The approach relies on the repetitive structure of many ap-plications. The technique simultaneously uses instrumentationand low-frequency sampling-based acquisition techniques whosedata is then combined to obtain a characterization of a repre-sentative instance of each distinct computation interval betweencalls to the programming model runtime. The mechanism wasapplied to classic processor activity hardware counters suchas cycles, instructions or cache misses.

We are interested in applying the mechanism to power-relatedcounters available in the Sandy Bridge processors through theRAPL mechanism [6]. These counters are also available throughPAPI although they have two important characteristics thatdifferentiate them from regular processor counters:

• Time discretization: The power values reported by RAPL areonly updates with a 1KHz frequency. Which means that theincremental value returned between two reads can be 0 ifthey are separated by less than 1ms.

• Power quantization: the reported measurement is only re-ported in increments of 15.2µJ. This effect introduces anadditional noise in the measurement that further complicatesthe use of the folding technique.

These two characteristics result in a significant noise beingintroduced in the measurements especially for high-acquisitionfrequencies. Our objective is to report the instantaneous powerconsumption of a region of code with precision greater thanthe 1ms and 15.2 µJ that constitute a limitation of the meas-urement system. For that we need to process the captureddata to significantly improve the signal-to-noise ratio.

FOLDING POWER SAMPLESTo address the specific issues arising from the characteristicsof the power counters, we have modified the data gatheringand the folding algorithm in three different directions: reducingthe number of counter reads, improving the outlier removaland reducing the noise.

In the original folding description [5] the data generation process

High-precision Performance and Power Measurements withLow Overhead


emitted the performance counter values at the beginning andat the end of the instrumented code region. The folding mecha-nism normalized the performance counters to minimize theimpact of potential perturbation suffered by different instancesof the code region being measured. We observed that meas-uring power at instrumentation points introduced a significantnoise as these measurements could occur very close to thesampling points thus making the discretization and quantizationeffects important. We removed the power counters acquisitionat the end of the instrumented region in order to at leastpartially reduce this noise. We still read the power counter atthe beginning of the region to have an initial reference. Thefolding can still be applied by using the absolute values fromthe beginning of the region instead of normalized values. Fig-ure 1 shows the typical stepped distribution of folded samples(red crosses) due to the discretization of the power counterupdates. For this figure, the sampling frequency was 250Hz(4 ms between samples) which means that each code regioninstance (16.12 ms) will have between 4 and 5 samples. Ob-serve that the width of the steps is in the order of 1ms: thediscretization unit of the acquisition instrument.

Based on the captured data we performed a first Kriging fit ofthe cumulative energy since the beginning of the iteration(green curve fit and blue counter rate in Figure 1). In a secondstep, a correction was made to try and compensate the discretizedcounter update. The approach looks collectively at all samplesbelonging to every instance of the code region. For each suchinstance, the average distance of its samples to the fitted curveis computed and a new set of folded points is computed (purplecrosses in Figure 1) by collectively shifting them toward thefirst fitted curve. All code region instances whose shifted sam-

ples fall far away from the fitted curve (beyond 2 standarddeviations) we are considered outliers and discarded. Finally asecond fit of the corrected folded samples was performed. Giventhat the power data was originally very noisy, we used, in thefirst fit, a large control parameter whose effect is as a low passfilter that makes the fit change relatively slowly and make itless nervous or sensitive to noise.

An example of the result is shown in Figure 2 where we observehow the power derived metrics (package and DRAM power) dochange between phases, although the method does not reachthe precision it gets when applied to the hardware counters ofthe core. Ongoing work will focus on increasing the precisionby tightening the control parameter of the second Kriging fit(as data has less variability for this) and through differentways of estimating the correction to apply to the originaldata.

0

1e+08

2e+08

3e+08

4e+08

5e+08

6e+08

7e+08

8e+08

0 3.22 6.45 9.67 12.89 16.12 0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

3.5e+07

4e+07

4.5e+07 0 0.2 0.4 0.6 0.8 1

Nor

mal

ized

PA

CKA

GE E

NER

GY: P

ACK

AGE

0

PACK

AGE

EN

ERGY

: PA

CKA

GE0

rate

(in

Mev

ents

/s)

Normalized time

Task 1 Thread 1 - main.0Duration = 16.12 ms Counter = 545451.65 Kevents

Samples (2000)Curve fittingCounter rate

CSamples (2000)CCurve fittingCCounter rate

Q0 = 1.00Q1 = 0.00Q2 = 1.00Q3 = 1.00

Figure 1: Folded samples and fit before (red crosses) and after (pur-ple crosses along the diagonal) the correction

0

2000

4000

6000

8000

10000

12000

14000

0 0.2 0.4 0.6 0.8 1 0

20

40

60

80

100

MIP

S

L1 m

iss

rate

(Mm

isse

s/s)

Normalized time

NOP INT FP MEM

Instructions L1 D-cache misses

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1 0

10

20

30

40

50

Wat

ts

Wat

ts

Normalized time

NOP INT FP MEM

Package DRAM

Figure 2: Instantaneous performance (top) and power (bottom) for acode with 4 phases of different ativity (noops, integer, floating pointor memory intensive)

High-precision Performance and Power Measurements with Low Overhead

20 21

Figure 6 shows the instantaneous evolution of DRAM powerconsumption and Last Level Cache Miss rates. The figure alsoshows the time span of the different routines called within thecompute region. This information is reported by the tool andis obtained by also folding the call stack events. For the 8PPMcase we observe for some routines a 3x larger power consump-tion than in the 1PPM case, while this ratio is only ~1.3x largerfor other routines. We also see that the total elapsed time goesfrom 225.81 to 313.37 ms per iteration, being the 1PPM faster.With the data processed by our folding method, we can reporta lot of information on the different routines (their duration,performance, and power consumption) even if their durationis below the sampling frequency. For example, the duration andpower of routine qleftright when run in the 1PPM configurationare 18ms and 7W respectively, while it takes 37.5 ms and 20W

when 8 processes share the socket. From these numbers wecan see that a socket delivers 3.85x more performance at theexpense of 2.85x more DRAM power if packing 8 processes ina node. The corresponding numbers for routine riemann are6.7x performance for a 1.3x DRAM power cost.

Using more processes per core is expected to be more powerefficient as some of the statics power costs are amortized overmore workers, but the example shows how different routinesmay lead to significant differences in such power efficiency im-provement. The experiment also hints that very synchronousparallelization approaches where the same routine is simulta-neously run on all cores may not be the most energy efficientway of programming our systems. A key advantage of ourmethodology is that it can identify these phenomena even ifthe duration of the routines is below the sampling frequency.

CONCLUSIONIn this paper we have described techniques being developedto compute the very fine grain instantaneous evolution ofpower and performance metrics. The measurements reportedshed light on the interaction between performance and powerand suggest new ways on how the system should be used tooptimize the power efficiency.

Although this is still ongoing work, we do believe these tech-niques have a lot of potential not only for power-related metrics,but for any environment where the basic acquisition systemdoes not support very precise measurements and high-frequencyacquisition rates.

REFERENCES[1] S. Shende and A. D. Malony. “The TAU parallel performance

system.” International Journal of High Performance ComputingApplications, Volume 20 Number 2. Summer 2006. pp.287-311.

[2] A. Knupfer, C. Rossel, D. an Mey, S. Biersdorf, K. Diethelm,D. Eschweiler, M. Gerndt, D. Lorenz, A.D. Malony, W.E. Nagel,Y. Oleynik, P. Saviankou, D. Schmidl, S. Shende, R. Tschuter,M. Wagner, B. Wesarg, F. Wolf, “Score-P – A Joint PerformanceMeasurement Run-Time Infrastructure for Periscope, Scalasca,TAU, and Vampir” in: Proc. of the Intl. Parallel Tools Workshop,Dresden, Germany, 2011.

[3] J. K. Hollingsworth, B. P. Miller, J. Cargille, “Dynamic ProgramInstrumentation for Scalable Performance Tools” SHPCC,May 1994.

[4] Reddi, Vijay Janapa, et al. "PIN: a binary instrumentationtool for computer architecture research and education."Proceedings of the Workshop on Computer ArchitectureEducation. 2004.

[5] Servat H, Llort G, Gim´enez J, Labarta J. Detailed perform-ance analysis using coarse grain sampling. Euro-Par Work-shops (Workshop on Productivity and Performance,PROPER), 2009; 185–198.

[6] Rotem E, Naveh A, Ananthakrishnan A, Rajwan D, Weiss-mann E. Power-Management Architecture of the Intel Mi-croarchitecture Code-Named Sandy Bridge. IEEE Micro2012; 32:20–27, doi: http://doi.ieeecomputersociety.org/10.1109/MM.2012.12

Figure 3: SPEC benchmarks (437.leslie3d, 444.namd, and 481.wrffrom top to botttom) instantaneous MIPS (black) DRAM power (blue)and package power(red)

0.0 35.1 70.2 105.2 140.3 175.4 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

5

10

15

20

25

30

35

40

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

5

10

15

20

25

30

35

40

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

5

10

15

20

25

30

35

40

MIP

S

Wat

ts

Time (ms)

MIP

S

Wat

ts

Time (ms)

MIP

S

Wat

ts

Time (ms)

0.0 2706.4 5412.9 8119.3 10825.8 13532.2

0.0 413.0 826.1 1239.1 1652.2 2065.2

European Exascale Labs High-precision Performance and Power Measurements with Low Overhead

Figure 6: HydroC DRAM power consumption and Last Level Cachemiss rates when run with 1 process per node (top) or 8 processesper node (bottom)

0

5

10

15

20

25

0 45.16 90.32 135.49 180.65 225.81 0

10

20

30

40

50

60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Wat

ts

Milli

ons

of m

isses

per

sec

ond

Time (in ms)

riemann

qlef

trig

ht

traceslope

0

5

10

15

20

25

0

10

20

30

40

50

60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Wat

ts

Milli

ons

of m

isses

per

sec

ond

Time (in ms)

DRAM power consumption LLC Misses

0 62.67 125.35 188.02 250.7 313.37

riemann

qlef

trig

ht

traceslope

SOME RESULTS Even with lower precision for power counters than for corecounters, the method can still be applied to real codes in orderto show the correlation between performance and power met-rics. We applied the method to some sequential codes fromthe SPEC, NAS and STREAM benchmarks and show the resultsin the following figures.

The sampling frequency used is very low (50Hz, 20ms) as ingprof to minimize the perturbation on the run. A low samplingfrequency also reduces the relative impact of the quantizationerror as each individual measurement reports a large energy.The number of samples within each instance of the iterativecomputation in the above codes ranges from between just 9for leslie up to 90 for is.C or even 677 for namd. We can seehow overall the power consumption has significantly lessvariability than the performance both across applications butalso within each application. Some applications do show

phases in the power metrics, specially the DRAM consump-tion, and although there is some correlation to the MIPS rate,the relationship is not one of direct proportionality.

Curiously, STREAM is the benchmark that consumes (Figure5) more power but uniformly along the whole iteration, irre-spective of the internal phases. The MIPS rate is different foreach phase as they have a different mix of additional instructionsbeyond the load/store instructions. Memory accesses consti-tute the bottleneck of this benchmark, keeping the DRAMmemory consumption at its maximum value.

We have also performed some experiments with parallel applications.We run the HydroC PRACE benchmark with 8 MPI processes butwith two different processes to node mappings. In one case(1PPM for future reference) 8 nodes are used with only oneprocess per node and in the other case (8PPM) a single nodeis populated with 8 processes.

Figure 4: NAS benchmarks ( is.C, bt.c, and lu.B from top to bottom)instantaneous MIPS (black) DRAM power (blue) and packagepower(red)

0.0 358.8 717.5 1076.3 1435.1 1793.9

MIP

S

Wat

ts

Time (ms)

MIP

S

Wat

ts

Time (ms)

MIP

S

Wat

ts

Time (ms)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

5

10

15

20

25

30

35

40

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

5

10

15

20

25

30

35

40

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

5

10

15

20

25

30

35

40

0.0 276.7 553.4 830.1 1106.8 1383.5

0.0 182.6 365.2 547.7 730.3 912.9

Figure 5: STREAM benchmark instantaneous MIPS (black) DRAMpower (blue) and package power(red). The four internal phases (copy,scale, add and triad) can be identified in the MIPS rate.

MIP

S

Wat

ts

Time (ms)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

5

10

15

20

25

30

35

40

0.0 62.1 124.2 186.2 248.3 310.4

22 23

Damian Alvarez Mallon1

Norbert Eicker1

Maria Elena Innocenti2

Giovanni Lapenta2

Thomas Lippert1

Estela Suarez11 Jülich Supercomputing Center, Forschungszentrum Jülich2 Katholieke Universiteit Leuven

ABSTRACTCluster computers are dominating high-performance computing(HPC) today. The success of this architecture is based on thefact that it profits from the improvements provided by main-stream computing well known under the label of Moore’s law.But trying to get to Exascale within this decade might requireadditional endeavors beyond surfing this technology wave. Inorder to find possible directions for the future we reviewAmdahl’s and Gustafson’s thoughts on scalability. Based onthis analysis we propose an advanced architecture combining aCluster with a so-called Booster element comprising of accel-erators interconnected by a high-performance fabric. We arguethat this architecture provides significant advantages comparedto today’s accelerated clusters and might pave the way forclusters into the era of Exascale computing. The DEEP projectis implementing this concept. Six applications from fields havingthe potential to exploit Exascale systems are being ported toDEEP. We analyze one application in detail and explore the con-sequences of the constraints of the DEEP systems on itsscalability.

1. INTRODUCTIONToday HPC architectures are dominated by cluster computers.The prosperity of this concept stems from the ability to gainfrom the advances in mainstream computing during the lastdecades. The modular setup of clusters allows for the replace-ment of single components improving the overall performanceof such systems. For example, a new generation of CPUs pro-viding more compute power might replace their predecessorwithout the necessity to change the high-speed fabric or thecomplex software stack at the same time. With systems atPetascale in production today the next goal in HPC is to reachExascale by the end of the decade. Obviously, this target intro-duces new challenges. First of all there are technologicalproblems like energy efficiency or resiliency to be overcome.Furthermore, it is questionable, if general-purpose CPU willstill be competitive from an energy efficiency point of viewwith more specialized solutions like accelerators, namely GPGPUs1.Since the way accelerators are employed in today’s systems

limits their scalability significantly, it will become a necessityto review the idea of the cluster architecture in HPC in orderto prolong their success into the future.

To get a better idea in which direction clusters have to be de-veloped we re-investigated the requirements set by applicationspotentially capable of using Exascale systems. On one handthere are highly scalable parallel kernels with simple commu-nication patterns. On the other hand less scalable kernels willlimit the overall scalability of the application. Typically, thelatter will show much more complex communication patterns.

Based on this analysis we have proposed a novel HPC architec-ture combining a classical cluster system with a so-called Boostersystem comprising of accelerator nodes and a high-speed inter-connect having a less flexible but very scalable torus topology.The basic idea is to use the Booster for the highly scalablekernels of potential Exascale applications. The cluster part ofthe system will be reserved for the less scalable and morecomplex kernels. For a high performance deployment of anExascale application to the Cluster-Booster system it will be-come inevitable to make use of both parts concurrently. TheCluster-Booster concept is the basis of the EU Exascale projectDEEP [2]. DEEP implements the Booster using Intel® Xeon Phi™coprocessors [13] and the EXTOLL interconnect [15]. The firstcommercially available Intel® Xeon Phi™ product developed underthe code name Knights Corner (KNC) acts as the acceleratorprocessor in the Booster nodes while the EXTOLL fabric de-veloped at the University of Heidelberg serves as the Boosterinterconnect. Enabling application developers to make use ofthe DEEP system will require a complex software stack to bedesigned and implemented during the three-year project. Sixdifferent applications are being ported to the system.

This paper is organized as follows: In the next section we ex-plore the space of HPC architectures by means of analyzing re-sults taken from the TOP500 list. With that we identify thechallenges arising from the goal to reach Exascale by the endof the decade. In section 4 general ideas concerning scalabilityin HPC are revised. Section 5 presents the DEEP project, includ-ing the idea of the combined Cluster-Booster architecture, andDEEP’s implementation of the Booster hardware and the soft-ware stack. Before we conclude, section 6 presents a SpaceWeather application that will run on DEEP and draws someideas concerning scalability of the DEEP system from this ex-ample.

2. HPC ARCHITECTURES IN TOP500Having collected almost 20 years of statistics the TOP500 list2

[1] is a powerful tool for identifying trends in HPC architectures.

The Scalability of the Cluster-Booster Concept A critical assessment of the DEEP architecture


It shows that—starting with very first examples in the late90s—today’s clusters are the dominant architecture in HPC.Their stage was set by commodity processors that becamesufficiently powerful in respect to floating point performance(e.g. Compaq/DIGITAL™ Alpha, AMD Opteron™ or Intel® Xeon®processors) together with high bandwidth and low latencynetworks that enabled the connection of such commoditynodes (at that time MyriNet and Quadrics, later Gigabit Ether-net and now superseded by InfiniBand). Last but not leastonly the availability of a complete open-source software stackenabled the available hardware to form the general-purposesupercomputers as we know them today.

Distributed memory systems like clusters require a convenientprogramming paradigm. Even before clusters became available—with the advent of MPP systems—the message-passing inter-face (MPI) [3] became the predominant standard for theprogramming of communication operations as required for par-allel applications. Therefore, the availability of free, high-qual-ity implementations of the MPI standard was crucial. Withoutthem clusters would not have been usable for practical appli-cations at that time. After all, the availability of source codeenabled the community to implement the necessary adaptationsrequired by every hardware innovation in this field.

Since then cluster architectures took over more than 80% ofthe market leaving the remainder for MPP systems3. This in-credible success is caused by the ability to benefit from theenhancements of computer technology in general. Of course,this development is not mainly driven by a relatively smallmarket like HPC. Instead, it is pushed by mainstream marketslike enterprise servers, personal computing or gaming. Combinedwith the ever-increasing costs of technology development—designing a new generation of processors requires investmentsin the range of billions of Euro—competing HPC architectureswere only able to survive in very special niche markets likethe very high end of HPC.

The most crucial feature of HPC architectures is not the high-est performance of a single component but the balance of allbuilding blocks in the system. E.g. most HPC systems don’t relyon the highest performing CPU available at a given time. In-stead, a compromise that is more balanced to the availablememory and network bandwidths is used. In the end this willshow higher energy efficiency or optimize the ratio of pricevs. performance.

Another important result derived from the TOP500 list is thedevelopment of the available compute power. Combined withresults of historical HPC systems it shows that HPC systemsgain a factor of 1000 in performance per decade. It is impor-tant to face the fact that this development outperforms “Moore’slaw”, i.e. the observation that semiconductor technology doublesthe number of transistors per unit area every 18 month [4].Thus, in order to achieve this outcome a significant increase

of the systems’ parallelism beyond sheer accretion of transistorsper chip was necessary in HPC. This reflects in the observationthat basically all current HPC architectures are massively parallel.

3. THE EXASCALE CHALLENGE

Running systems with multi-Petaflop compute power today4,the community thinks about the next step, i.e. having Exascalesystems (1018 floating point operations per second) by the endof the decade. An analysis done by a group of experts four yearsago showed [5] that an Exascale system using an updatedversion of today’s technology and concepts would run intosevere trouble:

• Power consumption: a projection of today’s architectures andmachines onto technology of 2018 predicts a power require-ment of several 100 MW for an Exascale system, renderingthe energy requirements completely unacceptable.

• Resiliency: projecting the trend of ever-increasing numberof components in HPC systems to the end of the decade5 willlead to millions of components. Combined with today’s com-ponents’ mean time to failure (MTTF) it would make Exascalesystems practically unusable since the whole system’sMTTF will drop into the range of hours or even minutes.

• Memory hierarchies and I/O: the increasing gap betweencompute performance and bandwidth of both memory andstorage, will require additional layers in the memory hierar-chy of Exascale systems like higher level CPU caches orflash memory as disk caches.

• Concurrency:With ever-increasing levels of parallelism it be-comes harder and harder for the users to extract performanceout of these systems. Additionally, it introduces new require-ments on the scalability of Exascale applications as we dis-cuss in the next section.

Thus, in order to achieve the ambitious goal of Exascale by theend of the decade one has to hunt for new concepts.

Having these challenges in mind the ability of clusters to reachExascale has to be reviewed. In particular, the question has tobe raised, if the central concept of clusters – the utilization ofcommodity CPUs – is competitive compared to proprietarydevelopments for HPC.

The yardstick regarding competitiveness for the next years isset by IBM’s efforts leading to the Blue Gene*/Q system. Ananalysis of preliminary results [6] creates reasonable doubt thatcommodity CPUs will be sufficient in the future. This is dueto their limitations on energy efficiency and the superior ratioof price vs. performance of the Blue Gene*/P technology. Bothoriginate from the fact that commodity CPUs carry too muchdead freight required for their general purpose objectives com-pared to their floating point abilities which are the key capa-bilities in the field of HPC.

The Scalability of the Cluster-Booster Concept

1 General Purpose Graphical Processing Units— in fact these arehighly optimized floating point units.

2 This paper refers to the November 2011 version of this list. 3 These systems are represented today by IBM* Systems Blue Gene*systems and the Cray XT* series. One might argue that the latterare cluster systems, too.

4 Examples are the K computer at RIKEN or the upcoming IBM Se-quoia* system at LLNL and Blue Waters at NCSA.

5 JSC’s Petaflop system JUQUEEN based on Blue Gene*/Q technologyhas 28672 nodes with 458752 cores.

24 25

Since commodity processors will not be sufficient, one has tostrive for a new work horse in HPC systems. A good candidateare GPGPUs providing an order of magnitude higher floatingpoint performance compared to commodity CPUs today at thecost of limited usability. They represent the current endpointof a long development of accelerators that might be used asco-processors enhancing the capabilities of general-purposeprocessors significantly.

Even though the different incarnations of accelerators are ver-satile6 they share some features leading to a superior energyefficiency compared to commodity CPUs. This includes the lackof complex hardware mechanisms for out-of-order operationsand speculative execution. Instead, simultaneous multi-threading(SMT) is used to keep the execution units busy while waitingfor data fetched from memory. Furthermore, wide vector reg-isters are used for floating point operations in order to increasethe instruction level parallelism (ILP). Since these mechanismare leading to more lightweight cores compared to the ones usedin today's generalpurpose multi-core CPUs, current acceleratordevices share the attributes of many-core architectures.

In fact, 40% of the top 10 systems in the current TOP500 listare already equipped with accelerator cards. Nevertheless, theprevailing architecture of accelerated clusters7 suffers fromsevere limitation concerning balance and, thus, scalability. Thisis due to the fact that their accelerators are neither directlyconnected to the cluster node's main memory nor capable ofdirectly sending or receiving data from their local memory. Inthe end, data is forced to be copied back and forth between themain memory and the memory directly attached to the accel-erator, using up the scarce resource of bandwidth on the node’ssystem bus. At the same time latency for doing communicationbetween the actual compute elements—that is the accelera-tors—is significantly increased due to the required detour throughmain memory8.

A good way out of these dilemmas would be to have a cluster ofaccelerators. In this concept the node consists of an accelera-tor only—accompanied by some memory and a high-speed in-terconnect—saving the commodity CPU. In particular, programsrunning on the accelerator cores shall be capable of initiatingcommunication operations without the support of a commodityprocessor. A good example of this concept is the QPACE sys-tem [9] that was ranked #1 in the GREEN500 list [8] of themost energy efficient HPC systems in the the world in 2010.

Nevertheless, this concept has limitations, too. Besides theproblem of finding accelerators that are capable of runningautonomously, they might be not flexible enough to drive ahigh-speed interconnect efficiently. Furthermore, the gain in-troduced by the direct connection between the compute ele-ment and the interconnect fabric might be wasted by thefact that accelerators—as highly dedicated devices—sufferwhen running general-purpose codes.

Thus, a radically new concept is required for clusters systemsthat can be competitive at Exascale.

4. CONSIDERATIONS ON SCALABILITY

Talking about Exascale it is crucial to discuss the effects ofparallelism on scalability. Amdahl’s law [10] states that thescalability of a parallel program is inherently limited by thesequential part of this application.

Granted that the runtime of the parallel part of a program isgiven by 𝑝𝑡 on a single processor, it leaves a sequential part of(1−𝑝)𝑡. Executing this program on a computer with 𝑁 proces-sors, the parallel part’s runtime reduces to 𝑝(𝑡/𝑁). By definition,the sequential part will not benefit from additional processors.The speedup 𝑆 is given by the ratio of the runtimes on aparallel machine 𝑇𝑁 and on a serial machine 𝑇𝑆, i.e.

In practice this speedup is limited by the sequential part of agiven program. E.g. a program with a sequential part of just1% will be limited in its speedup to 100 independent of theamount of processors used. Nevertheless, the practical impactof Amdahl’s law today might be constricted. As an example thefact that some applications are able to scale on Blue Gene*systems with 𝑂(100000) processors shows that these seemto have a vanishing sequential part.

In fact, the behavior of such applications is better described byGustafson’s law [11]. While Amdahl assumed that using a largermachine will leave the problem untouched, in reality the problemsize usually is scaled with the capabilities of the machine. E.g.if a scientist wants to solve a problem in molecular dynamics,a machine twice as capable is usually not used for solving theproblem in half the time but to tackle a problem twice as bigin the same time or to do a more detailed simulation whichrequires double the amount of operations.

Therefore, in Gustafson’s law the amount of work done on asystem 𝑁 times more capable than the serial one is 𝑤=1−𝑝+𝑝𝑁.Overall this leads to a practical speedup of

assuming that executing the sequential part will take thesame amount of time on the parallel system as on the serialone. It is obvious that for a large number of processors thespeedup is dominated by 𝑁.

Of course, Gustafson postulates the possibility of implementingthe parallel portion of a program in a scalable way. Unfortunately,in practice there are several caveats. E.g. any collective com-munication operation on a parallel system like a global sum or

S = =(1–p)+Np

(1–p)+p(1–p)+Np

a broadcast will inherently introduce costs beyond O(1) andis therefore not scalable.

Additionally it turns out to be expensive to build really scalablefabrics of high-speed interconnects comprising full flexibilityon communication patterns at reasonable prices9. This leadsto the fact that scalability of the parallel part of an applicationmight be significantly restricted in reality.

Thus, a different viewpoint to the question on how to reachExascale might be taken by looking at actual applications andtheir inherent scalability. Since HPC systems undoubtedly willstill be massively parallel by the end of the decade with an evenhigher degree of parallelism than today, application scalabilitywill play a crucial role in order to exploit the computationalpower of such systems. Analyzing JSC’s application portfoliotoday one finds basically two classes of applications.

1. Highly scalable codes using very regular communicationpatterns. These are the codes that are able to exploit JSC’sBlue Gene*/P system today and the Blue Gene*/Q in thenear future.

2. Less scalable but significantly more complex programs. Mostoften their codes require complicated communication patterns.These requirements constrain such applications to clusters.

A more detailed view on the second class reveals that amongthem are several applications that have highly scalable kernels,too. I.e. in principle they should be able to also exploit BlueGene-type machines. Nevertheless, in reference to the serialwork in Amdahl’s law their scalability is limited by the leastscalable kernel.

With respect to Exascale it will be highly desirable to run appli-cations of the second class on such systems, too. This wish isreinforced by the expectation that this class will become moreimportant in the future. In fact, due to the growing degrees ofparallelism in Exascale systems, even codes on the safe sideof the first class today might be shifted towards the latter classsoon. Furthermore, we expect simulation codes to become morecomplex due to additional aspects of a given scientific question tobe included in limiting their scalability. Last but not least prob-lems completely out of range today due to their complexitymight become feasible upon the availability of Exascale systems.Most probably they will belong to the second class, too.

Putting everything together a future cluster architecture re-quires different ingredients to be flexible enough for runningcomplex applications at Exascale.

5. THE DEEP ARCHITECTURE

5.1 The Cluster-Booster ConceptThe concept of the DEEP architecture is sketched in Figure 1. Itforesees a Cluster element comprising of nodes (CN) connectedby a highly flexible switched network labeled as InfiniBand here.It is accompanied by a so-called Booster element built out ofBooster nodes (BN). Each BN hosts an accelerator capable ofautonomously booting and running its own operating system.The BNs are interconnected by a highly scalable torus networksketched as red lines and named EXTOLL in the figure. BoosterInterfaces (BI) connect the Booster and its network to theCluster’s fabric.

The Cluster-Booster architecture provides several advantages:

• The Cluster element allows running complex kernels. Thereis no restriction to run parallel kernels with complicatedcommunication patterns or highly irregular codes as ontoday’s highly scalable MPP machines.

• The Booster element is able to run highly scalable kernelsin the most energy efficient way. The limitations of today’saccelerated clusters are avoided by enabling the computeelements—i.e. the accelerators in the BNs—to do directcommunication to remote BNs.

• The ratio between the amount of work to be executed onthe commodity CPU and the accelerator cannot be ex-pected to be fixed among different applications or even be-tween different kernels of a single applications. Theproposed architecture supports this fact by detaching theaccelerator. Accelerators are no longer part of the physicalCN and might be assigned to CNs in dynamical way.

• At the same time the dynamical assignment of CNs and BNs im-proves resiliency. Faulty BNs do not affect CNs and vice versa.

Furthermore, the architecture supports the user to employ thehigh degree of parallelism of future machines. Today the typeof kernels to be offloaded onto accelerators is very limiteddue to their missing the ability to communicate efficiently withother accelerators. In contrast, the DEEP architecture allowsmore complex kernels to be offloaded to a Booster. Thesekernels might even include communication, as long as thecorresponding communication patterns are regular enoughand do not swamp the torus topology.

In this context the Booster might be seen as a highly scalablesystem on its own. Thinking of the highly scalable codes thatare able to exploit Blue Gene* or QPACE today, it should bepossible to run them on the Booster alone.

CN

CN

CN

BI

BI

BI

BN

BN

BN

BN

BN

BN

BN

BN

BN

InfiniBand

Cluster Booster

EXTOLL

Figure 1: Diagram of the DEEP architecture.

European Exascale Labs The Scalability of the Cluster-Booster Concept

6 Early examples are dedicated devices from ClearSpeed, ranging over processors originally developed for gaming like the Cell BroadbandEngine and the aforementioned GPGPUs, towards newer developments like Intel’s many integrated core (MIC) architecture.

7 We will distinguish accelerated clusters, i.e. classical clusters comprising of nodes equipped with accelerators, and clusters of accelerators.8 There are efforts to attenuate the problems resulting from that, e.g. NVIDIA’s GPUDirectTM [7].

9 Full crossbar switches today are limited to several dozen ports; thetrick of using a fat tree topology introduces an overhead of 𝑂(𝑝2)for 𝑝 ports [12].

S = = =TS

TN

(1–p)+p

(1–p)+p/N

1

(1–p)+p/N

26 27

In the other extreme BNs might be assigned to single CNs andused in the same fashion as in today’s accelerated clusters.Both use cases—and anything in between—are possible with-out modification of the hardware concept but are just imple-mented by means of configuration on the level of system andapplication software.

To reflect all these features, the architecture was named thedynamical Exascale evaluation platform (DEEP).

5.2 Coping with Amdahl’s LawThe Cluster-Booster architecture aims to offload kernels withhigh scalability onto the Booster while leaving kernels with lim-ited scalability on the Cluster element. Let’s assume the highlyidealized situation where the highly scalable code fraction is,leaving a fraction of 1−𝐻 for the less scalable parts. Given anumber ℎ of cores on the Booster and a number 𝑓 of coreson the Cluster, and assuming the Cluster’s multi-core units tobe a factor 𝑐 times faster than the Booster’s many-core units,Amdahl’s law predicts the speedup 𝑆 of the DEEP Cluster-Booster Architecture as:

For a 10 PFlop/s Booster with ℎ=500000, a Cluster with 𝑓=10000,a difference on the effective performance of a factor of 4 be-tween the Cluster’s multi-core and the Booster’s many-coreunits and a highly scalable code fraction of 𝐻=0.95, the speedup𝑆 would reach a value of 320000 as compared with a singlecore on the Booster. This corresponds to a parallel efficiencyof 64%. It is evident that the efficiency strongly depends onthe value of 𝐻. Since 𝐻 is approaching 1 in a weak scalingscenario, as advocated by Gustafson, one can be optimistic ofevading the severe performance degradation that would be ex-pected for 𝑂(𝐾) concurrent kernels on the Booster. Note thatthe communication between the cluster and Booster was nottaken into account in this consideration.

The optimal balance between ℎ and 𝑓 is specific to a givenapplication. The DEEP architectural concept allows adjustingthis balance dynamically. The recipe to achieve scalability forthe DEEP grand challenge applications is to maximize 𝐻 in thebreakdown of the code into kernels while reducing data trans-fers to the bare minimum or by hiding the Cluster Boostercommunication behind computation.

5.3 The DEEP HardwareSo far we have described the Cluster-Booster architecture onan abstract level. Now we will concentrate on the actual im-plementation as planned within the DEEP project. The DEEPsystem consists of two principal parts: an Intel® Xeon® basedCluster and an attached Booster using Intel® Xeon Phi™ co-

processors. While the Cluster is using conventional componentslike Intel® Xeon® processors or Mellanox QDR InfiniBand HCAsand switches, the Booster exploits the first Intel® Xeon Phi™product formerly known under the code name Knights Corner(KNC). The Booster uses the EXTOLL interconnect technologydesigned at the University of Heidelberg as a communicationnetwork. The first DEEP prototype hardware will be availablein the second half of 2013.

5.3.1 Booster Node Card (BNC)The BNC is the physical realization of the Booster Nodes. It housestwo independent BNs of the 3D Torus. Each one comprises aKNC card and an EXTOLL ASIC currently under developmentat EXTOLL GmbH, a spin-off of the University of Heidelberg.The KNC card includes the actual processor and memory basedon GDDR5 technology. It is connected to the EXTOLL ASIC viaa ×16 PCIe gen2 interface. A baseboard management controller(BMC) allows for management and monitoring of the BNC’s com-ponents. The KNC coprocessor has approx. 60 cores running anextension of the x86 instruction set [13]. The cores are muchsimpler than today’s Intel® Xeon® CPUs. They operate in orderand without speculative execution. Compute throughput is pro-vided by vector computing units integrated with the cores, whichcan operate on 512 bits (8 double precision numbers) in oneinstruction and offer advanced masking and memory transferfunctionality. The cores have local caches and a shared L2 cache.The main memory is based on GDDR5, and we expect memorysizes on a par with the latest GPGPUs. At Supercomputing 2011,a double precision DGEMM matrix multiply was shown on alphasilicon with >1 TeraFlop/s performance [14], and a system basedon Intel® Xeon Phi™ leads the November 2012 Green500 list [8].

The EXTOLL architecture was designed from scratch for theneeds of HPC. The technology specifically optimizes latency,message rate and scalability. Therefore the host software, thehost interface, the network interface hardware and the actualnetwork with its switch and link latency were optimized.

EXTOLL is a one-chip solution, realized either as FPGA todayor as an ASIC in the near future. Figure 2 shows the top levelblock diagram of the EXTOLL device. The whole device canbe divided into three main parts: the host interface, the net-work interface and the network.

The host interface houses a PCIe IP core, which is able to op-erate in root port mode. It is attached to the network interfaceusing an internal bridging hardware. The network interfacefeatures an on-chip network called HTAX, which connects thehost interface and all communication engines and functionalunits of EXTOLL to each other.

There are three main communication engines available in EX-TOLL. VELO (Virtualized Engine for Low Overhead) offers two-sided communication semantics (send/receive) with very low

latency and very high message rates. It might be used for anefficient implementation of the handling of small to medium MPImessages. The second communication engine is RMA (RemoteMemory Access) offering one-sided operations. Both of theseunits are virtualized and offer direct user space access for hun-dreds of processes, as required for the usage by many-coreprocessors. The last communication engine is the SMFU (SharedMemory Functional Unit) [16]. It is capable of tunneling arbi-trary PCIe requests over the EXTOLL network. A typical usecase is non-coherent, global address space programming usingCPU loads and stores. Another application is the forwardingof device DMAs to memory of a remote EXTOLL node.

The third component of EXTOLL is the actual network. It imple-ments six links which can be interconnected arbitrarily. There-fore, a natural choice for the topology is a 3D torus. Beyondthat EXTOLL also offers a so called “7th link”. It might be usedto attach torus nodes to the network, for example I/O nodes.The internal switch of EXTOLL is a non-blocking crossbar. Packetsare delivered in order, if deterministic routing is chosen. Therouting is table based and not restricted to torus topologies.

5.3.2 Booster Interface Card (BIC) The interface between Booster and Cluster is realized by BoosterInterface Cards (BIC). They contain an ultra-low voltage Intel®Xeon® CPU, the same baseboard management controller (BMC)as the BNC, a NIC for the EXTOLL network and a HCA for Infini-Band connectivity. The two latter are connected to the CPUvia PCIe.

The BICs are responsible for booting up and controlling the KNCsover the EXTOLL network. The BIC CPU maps a boot image ata certain memory address, and the KNC cards access this imagein a PCIe window, with EXTOLL tunneling the PCIe communi-cation using SMFU. In addition, the BICs pass data betweenthe InfiniBand and EXTOLL networks.

5.3.3 Interconnect Network and TopologyTo provide unlimited scalability of the Booster interconnect anda match with HPC applications that use spatial data partition-ing, a 3D torus configuration has been chosen. EXTOLL is cur-rently implemented on FPGAs, and the first DEEP evaluationboards will therefore use FPGAs (Altera Stratix V). This pro-vides full functionality, but reduced performance is expected.For the DEEP system, the plan is to move to an ASIC implemen-tation currently under development.

The DEEP Booster will consist of a total of 512 Booster Nodesarranged in a 8×8×8 topology. 32 BNs (i.e. 16 BNCs) will be housedin a chassis and are accompanied by two BICs forming the inter-face to the Cluster of the DEEP system. Compared to the accu-mulated bandwidth of the 16 BNs connected to a single BIC orthe 128 nodes in the Cluster the bandwidth of a single BICappears to be limited. Nevertheless, since the Cluster-Boosterarchitecture allows for offloading complex kernels the amountof data that has to be sent back and forth between the two partsis assumed to be relatively small. Furthermore, the ability of theBNs to directly communicate among each other constrains themajority of communication operations within the Booster andrelieves the pressure on the BICs. Completion of this system isscheuled for the end of 2013, with smaller prototypes plannedfor mid 2013.

5.4 The Cluster-Booster Software StackIn order to benefit from the Booster element of the DEEP ar-chitecture the user is forced to identify the highly scalablekernels capable of being offloaded. In a way this is similar tothe identification of kernels to be offloaded onto the GPGPUsof accelerated clusters today.

Hos

t In

terf

ace

(HT3

or

PCIe

)

On

Chip

Net

wor

k

ATULinkport

Linkport

VELO

Networkport

Networkport

Networkport

RMA

Control& Status

EXTO

LL N

etw

ork

Switc

h

HOSTINTERFACE

NETWORKINTERFACE NETWORK

Figure 2: EXTOLL block diagram


Application

Less scalablekernels

Highly scalable kernels

Resource Management

OmpSs Offload Abstraction

ParaStation Global MPI

Low-Level Infiniband® Communication

Low-Level EXTOLL Communication

ParaStationCluster MPI

ParaStationBooster MPI

Cluster-Boostercommunication

CN CN CN CN BN BN BN BNBI BI

OmpSs Compiler OmpSs Compiler

Intel® Compiler for Xeon® Intel® Compiler for MIC

DEEP Cluster DEEP Booster

Figure 3: High-level view of the Cluster-Booster software stack

S = 1

c1–H + h

H

28 29

A Cluster-Booster software stack as sketched in Figure 3 willsupport the user in expressing the different levels of parallelismof the problem. The application’s kernels of limited scalabilityremain on the Cluster part and make use of a standard soft-ware stack comprising of standard compilers, MPI, etc. Highlyscalable kernels are offloaded to the Booster with the help ofa resource management layer. For that, a Cluster-Boostercommunication layer is required in order to bridge the gap be-tween the more flexible Cluster interconnect and the highlyscalable Booster network.

In order to implement the highly scalable kernels on the Boosteran additional software stack is required there. Since the Boostermight be seen as a cluster of accelerators it is no surprise tofind a software stack similar to the one on the Cluster. Besidesnative compilers supporting the processors of the Boosternodes a communication library is required. For simplicity MPIis chosen here, too. Of course in contrast to the Cluster-MPIthe Booster-MPI has to support the Booster interconnect.

The application itself is formed by both types of kernels runningon the Cluster and Booster parts of the system glued togetherby the resource management layer.

5.5 Programming Model

In the Exascale era a concurrency of at least 𝑂(109) is expected.Its management requires improvements in today’s program-ming models. The most popular programming model currentlyused in scientific computing is based on message passing, usingmainly MPI. However, MPI is starting to show some drawbackson current Petascale machines that will become more importantin future systems. Frequent synchronization points present inmany applications will hinder scaling. Replication of data acrossprocesses, and the increasing memory requirements for man-aging 𝑂(106) connections will be showstoppers. Therefore, im-provements in programming models are required. The presentrelies on hybrid programming models with MPI and an intra-nodeparallel programming model (OpenMP, CUDA, and OpenCL). Some

of those programming models take advantage of accelerators.OmpSs is a proposal with means to use CPUs and accelerators,providing at the same time asynchronicity through taskification[22], allowing improved programmability and performance ad-vantages. Other approaches propose more radical changes indesign, getting rid of MPI at all [23]. The nature of the DEEParchitecture requires extra innovations in the programmingmodel. Current paradigms cannot cope with it, since they pro-vide a means to offload code from a CPU to an accelerator thatis part of the same node. In DEEP the accelerators are decou-pled from the nodes in the Cluster side, and become part ofthe Booster. This feature provides more flexibility than intraditional GPGPU clusters, allowing a near optimal use of CPUand accelerator cores.

Given the foreseen requirements for Exascale computing, andthe features of the DEEP architecture, the offload mechanismhas to provide:

• An easy-to-use interface for offloading computing and data tothe Booster,

• Certain degree of portability,

• Individual offloading, to enable applications to offload inde-pendent kernels or tasks to accelerators,

• Collective offloading, to enable applications to offload col-laborative kernels or tasks to the Booster,

• Asynchronicity to allow overlapping of different phases inthe Cluster and in the Booster.

The first two items can be easily achieved using an annotationapproach. The parameters specified in the annotations will tellthe runtime if it should offload the function to the Booster, toan accelerator, or to another CPU core. This way, small changesin the runtime and the application code will enable portabilitywithout major rewriting of the code. The third item requiresan offload mechanism similar to any other offload mechanismpresent in current clusters of accelerators. The only major

difference is that the accelerator is not local to the clusternode, providing more flexibility, but at the same time the datahas to be moved through slower channels than the internalbuses, which requires a bigger granularity in the offloadedkernels, to overcome the penalty for transferring the data.

The fourth item is a collective offload, where every processin the Cluster offloads its data to the Booster roughly at thesame time, allowing the processes on the accelerators to com-municate and cooperate. This is necessary e.g. when the ap-plication decomposes the domain and some operations aremore suitable for the Cluster and others for the Booster. Thenon both sides of the system processes have to exchange datasuch as the boundaries of the local domains. An example of thisis depicted in Figure 4.

OmpSs’s model satisfies the last item individually on the Clusterand the Booster, since it already supports multi-core architec-tures and it is going to be ported to the MIC architecture.However, it is desirable that the offload mechanism does notblock the execution in the Cluster while the Booster performsthe required computations. For that two mechanisms arepossible: (1) An offloading task runtime approach, offloadingindividual tasks to the Booster and automatically building thedependency graph between tasks, like in the OmpSs model;and (2) an approach where the function to be offloaded re-turns a handler that the programmer can use for checkingfor completion later.

6. SCIENTIFIC APPLICATIONSThe amount of today’s scientific applications that will be ableto efficiently run on the millions of cores of a future Exascalemachine is very limited. A reason for that is the fact that evenhighly scalable applications often contain kernels with complexcommunication patterns. These limit the overall scalability ofsuch codes. One of the main goals in DEEP is to enlarge theportfolio of Exascale applications using the DEEP Cluster forthe less scalable part of the applications while their highlyscalable parts profit from the computing power of the Booster.To evaluate the DEEP Concept six HPC applications represen-tative for future Exascale computing and data handling require-ments will be ported to the DEEP system in the coming threeyears. These applications—coming from the fields of Healthand Biology, Climatology, Seismic Imaging, Computational En-gineering, Space Weather, and Superconductivity—have a highimpact on industry and society and present very different algo-rithmic structures. In this paper we focus on one of the selectedapplications: Space Weather. After a brief analysis we discusshow it can profit from the characteristics of the DEEP architec-ture.

The aim of Space Weather simulations is to emulate the mag-netized plasma traveling from the Sun to the Earth. They areused to study the conditions on the Sun, solar wind, and Earth’s

magneto and ionosphere that can disrupt spaceborne and ground-based technology. iPIC3D [17] (written in C++ and MPI) is animplicit-moment-method particle-in-cell code [18, 19] that al-lows studying a large portion of the system while retainingthe full detailed microphysics. The main challenge in this HPCapplication is its wide variety of spatial and temporal scalesdue to the vast gap between the scales in which the plasmaelectrons and ions interact. Exascale computing would allowSpace Weather researchers to simulate the whole Sun-Earthsystem in a single simulation run, something impossible withtoday’s Petascale machines.

iPIC3D simulates the evolution of particles on a discrete grid,taking the surrounding electric and magnetic fields into account.The logical block operations performed in iPIC3D for each timestep are shown in Figure 5, where the most communicationintensive block (B2) is marked in green and the most compu-tational expensive blocks (B4 and B5) in red. iPIC3D showsvery good scalability. Most of the execution time is spent inthe particle simulation, making this part of the code (blocksB4 and B5) a perfect candidate to be offloaded to the DEEPBooster. Furthermore, the only communication required in theseblocks is in the form of point-to-point, nearest neighbor com-munication, needed in B4 for the few particles that exchangesubdomain, i.e. particles that move from one BN to a neighboringone. On the other hand the field solver (block B2) has a smallimpact on the execution time but is more demanding for thenetwork layer as it requires global communication. Therefore itfits naturally on the DEEP Cluster. In this scenario, communi-cation between Cluster and Booster is very limited: only thefields and moments (1% of the memory) need to be transferredfrom one side to the other, while all the particles (99% of thememory) stay on the Booster side.

Communication and computation can be easily overlapped in theSpace Weather application, for instance by running in parallelseveral instances of the same scenario with different pseudo-random input data as required for ensemble simulations. The

Data set Local data set for each node or MPI rank

px MPI ranks

Cluster Booster

py M

PI ra

nks

boos

ter

py M

PI ra

nks

booster px MPI ranks

Figure 4: Collective offload example of an application with 2D domain decomposition


Momentequatations

B1

Particlemover

B4Particle – GridB5

Grid – ParticleB3

Field solverB2

Start

End

Figure 5: Block diagram showing the operations done in iPIC3D ateach time step.

30 31

approach chosen for iPIC3D in DEEP is presented in Figure 6,and consists of dividing blocks B2 and B4 into various steps,such that at every moment two concurrent operations, oneon the Booster and one on the Cluster, are executed. Barriersare required at key points to synchronize the process. In thisapproach the two most expensive operations are done simul-taneously: the Poisson solver in the Cluster and the particlepusher to the Booster. Estimations predict a similar time spanfor both Booster (faster but with many more operations to do)and Cluster (slower but with fewer operations to do).

This method constitutes a simple modification of the standardiPIC3D and is based on solid experience, since offloading par-ticles onto accelerators is commonplace10. No disadvantages orimpact on the physics accuracy are expected while the cost ofthe run is reduced essentially by the speed increase of theBooster over the Cluster.

At this point a rough estimation of the benefits of the DEEParchitecture can be made. On the JUDGE cluster at JSC, a Scalascaprofiling run with 1024 processes gave the information on table1 below:

The data implies that B1 and B3 can be safely ignored, sincetogether they account for ∼2% of the time. B5 takes 30% ofthe execution time, but since the communications in this phaseuse only 8%, obviously most of the time the application ispreparing the structures to be communicated. In DEEP the ratiobetween Cluster nodes and Booster nodes is 1:4. The ratio ofthe memory per node will most likely be 3:1. The peak com-pute power per node ratio can be assumed as 1:4 (∼250GFlop/s for 2x Intel® Xeon® E5 processors vs. ∼1000 GFlop/sfor KNC).

Bearing all the previous statements in mind the following es-timations can be made:

B4 will be executed on the Booster, which means a maximumtheoretical speedup of 4×4=16 for the computational part.This is hardly achievable in practice. Assuming a 50% efficiencyof the Booster relative to the Cluster which is a more feasiblenumber for this application, when vector units are not fullyexploitable, the speedup would be 8. This applies to 80% ofthe runtime of B4. The other 20% is spent on communications.EXTOLL’s features might allow a speed up of those communi-cations by a factor of 6 realistically11, reducing the total timeof B4 to a ∼13.5% of its original time.

Even though B5 is not easily optimized on the Booster, the

higher number of cores of a KNC vs. an Intel® Xeon CPU canhelp to speed it up. KNC’s cores are simpler, work at a lowerfrequency, and are therefore slower than standard cores12.Assuming a relative speed of cores of 1:4, the tasks requiredto put data in place to be communicated will be 4× slower.However, a Booster node has ∼4× the number of cores of aCluster node, and it can manage twice the number of hardwarethreads. Furthermore, the Booster will have 4× more nodes,and therefore every node will have less data to reorder. Thespeedup of these tasks can be naïvely rated at ∼8, leading toa total time of B5 of ∼20% of its original time.

We have on one hand that B4 and B5 can be reduced to ∼13.5%and ∼20% of their original time. Moreover, B2 can be executedconcurrently on the Cluster, while B4 and the most expensivepart of B5 are executed on the Booster, dragging it out of thecritical path. Therefore, just B4 and B5 will be on the criticalpath. This gives a runtime reduction to 0.02 + 0.6 × 0.135 +0.3 × 0.2 = ∼0.161 of the original. On the other hand, this re-sult is only achievable because communication between Clusterand Booster is limited to a small amount of data. Furthermore,IPIC3D enables hiding the communication behind computationby introducing ensemble simulation. Nevertheless, this showsthat there are applications capable of exploiting the potentialof the DEEP architecture.

Just for comparison purposes, the runtime required for theapplication using a cluster architecture, with the same numberof nodes as nodes in the DEEP architecture (Cluster nodes andBooster nodes), assuming linear scaling, would be ∼1/5.However, the Booster part, including I/O, will consume roughly2× more power than the Cluster part. This leads to a powerconsumption per node of roughly half of a Cluster node allow-ing the application to run consuming 0.2 + 0.8 × 0.5 = ∼0.6the amount of energy than using a standard architecture, ig-noring storage and switches.

7. CONCLUSIONS AND OUTLOOKWe presented a novel HPC architecture extending the conceptof typical clusters and newer developments of acceleratedclusters, i.e. cluster systems with nodes equipped with accel-erators alongside commodity processors. Our concept includesa Cluster that is accompanied by a so-called Booster compris-ing of nodes of accelerators and a high-speed interconnectwith torus topology. We argue that our concept provides theadditional flexibility and scalability that is required to pavecluster architectures the way into the future of Exascale. Wereport on the physical implementation of this concept as it isbeing realized in the DEEP project. Furthermore, both the hard-ware and software architectures of the DEEP system are pre-sented. We used a Space Weather application in order to evaluatethe Cluster-Booster concept on a theoretical level, showingimportant benefits in performance and energy efficiency.

8. ACKNOWLEDGMENTSThe research leading to these results has received fundingfrom the European Community’s Seventh Framework Programme(FP7/2007-2013) under Grant Agreement 287530.

9. REFERENCES[1] http://www.top500.org[2] http://www.deep-project.eu[3] http://http://www.mpi-forum.org[4] Gordon E. Moore, “Cramming more components onto inte-

grated circuits.”, Electronics. 19, Nr. 3, 1965, pp. 114-117.[5] www.cse.nd.edu/Reports/2008/TR-2008-13.pdf[6] http://www.theregister.co.uk/2010/11/22/ibm_blue_

gene_q_super[7] http://developer.nvidia.com/gpudirect[8] http://www.green500.org[9] H. Baier et al., “QPACE: power-efficient parallel architecture

based on IBM PowerXCell 8i”, Computer Science R&D 25(2010), pp. 149-154. doi:10.1007/s00450-010-0122-4.

[10] Gene Amdahl (1967), “Validity of the Single ProcessorApproach to Achieving Large-Scale Computing Capabilities”,(PDF), AFIPS Conference Proceedings (30), pp. 483-485.

[11] John L. Gustafson, “Re-evaluating Amdahl’s Law”, Com-munications of the ACM 31(5), 1988, pp. 532-533.

[12] Charles Clos, “A Study of Non-blocking Switching Networks”,The Bell System Technical Journal, 1953, vol. 32, no. 2,pp. 406-424

[13] http://www.intel.com/content/www/us/en/high-performance-computing/high-performance-xeon-phi-coprocessor-brief.html

[14] http://newsroom.intel.com/servlet/JiveServlet/download/38-6968/Intel_SC11_presentation.pdf

[15] Mondrian Nüssle et al., “A resource optimized remote-mem-

ory-access architecture for low-latency communication”,The 38th International Conference on Parallel Processing(ICPP-2009), September 22-25, Vienna, Austria.

[16] H. Fröning und H. Litz, Efficient Hardware Support for thePartitioned Global Address Space, 10th Workshop on Com-munication Architecture for Clusters (CAC2010), co-locatedwith 24th International Parallel and Distributed Process-ing Symposium (IPDPS 2010), Atlanta, Georgia, 2012.

[17] S. Markidis, G. Lapenta and Rizwan-Uddin, “Multi-scalesimulations of plasma with iPIC3D”, Mathematics andComputers in Simulation, pp. 1509-1519, 2010.

[18] J. U. Brackbill and D. W. Forslund, “Simulation of low fre-quency, electromagnetic phenomena in plasmas”, Journalof Computational Physics, 1982, p. 271.

[19] P. Ricci, G. Lapenta and J. U. Brackbill, “A simplified implicitMaxwell solver”, Journal of Computational Physics (2002),p. 117.

[20] B. Marder, “A method for incorporating Gauss’ law intoelectromagnetic PIC codes”, J. Comput. Phys., vol. 68(1987), p. 48

[21] A. Bruce Langdon, “On enforcing Gauss’ law in electro-magnetic particle-in-cell codes”, Computer Physics Com-munications, vol. 70, Issue 3 (1992).

[22] A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell,X. Martorell and J. Planas, “OmpSs: A Proposal for Pro-gramming Heterogeneous Multi-Core Architectures”, inParallel Processing Letters, vol. 21, Issue 2 (2011) pp.173-193.

[23] G. R. Gao, T. L. Sterling, R. Stevens, M. Hereld and W. Zhu,“ParalleX: A Study of A New Parallel ComputationModel”, in Proc. of 21th International Parallel and Distrib-uted Processing Symposium (IPDPS 2007), Long Beach,California, USA

Part Execution Time Time on communication

B2 8% 5.5%

B4 60% 20%

B5 30% 8%

Table 1: Profiling of iPIC3D


10 Delayed Poisson correction is novel, but similar methods based onincomplete correction— e.g. Marder [20] and Langdon [21] me-thods— exist.

11 ×1.5 better performance per link, and ×4 more links, since thedata is distributed over ×4 more nodes

12 The superior floating point capabilities of KNC might not comeinto play for this task.

ClusterNeeds only fields

BoosterNeeds both fields and particles

B2 (part1): Maxwell Explicit Update• Operates on less than 1% of data• Fast operation• Near-neighbor communication

B4 (part1): Communicate particles between neighbour Boosters• Typically needed only for a very small fraction of particles

B2 (part2): Poisson Corrector• Slow operation based on linear solver• Global communication needed

B4 (part2): Move particles in the fields• Operates on 99% of the memory• Compute intensive• No communication needed in this phase (*)

B2 (part3): Collect changes• One floating operation on fields only (1% of memory)• No communication

B4 (part1): Communicate particles between neighbour Boosters• Typically needed only for a very small fraction of particles

(*) Note that this is a strong restriction that can be achieved using only ghost cells and that requires some changes to the code: the moments need to be computed before the particles that change domain are communicated, the ghost moments must be communicated from the cluster, and particles must be hindered from roaming beyond the ghost region.

Barrier – Move fields from Cluster to Booster (B3)

Barrier – Move fields from Cluster to Booster (B5)

Figure 6: Scheme of the proposed method to adapt iPIC3D to the DEEP architecture

32 33

European Exascale Labs Scalasca – A Scalable Parallel Performance Measurement and Analysis Toolset

Bernd MohrJülich Supercomputing Center, Forschungszentrum Jülich

ABSTRACTScalasca [1] is an open-source software toolset that supportsthe performance optimization of parallel programs by meas-uring and analyzing their runtime behavior. The analysis identi-fies potential performance bottlenecks—in particular thoseconcerning communication and synchronization—and offersguidance in exploring their causes. Scalasca targets mainlyscientific and engineering applications based on the program-ming interfaces MPI and OpenMP, including hybrid applicationsbased on a combination of the two. The tool has been specif-ically designed for use on large-scale systems including IBM*Blue Gene* and Cray* XT, but is also well suited for small- andmedium-scale HPC clusters.

INTRODUCTIONMaking applications run efficiently on larger scales is oftenthwarted by excessive communication and synchronizationoverheads. Especially during simulations of irregular and dy-namic domains, these overheads are often enlarged by waitstates that appear in the wake of load or communication im-balance when processes fail to reach synchronization pointssimultaneously. Even small delays of single processes mayspread wait states across the entire machine, and their accu-mulated duration can constitute a substantial fraction of theoverall resource consumption. In particular, when trying toscale communication-intensive applications to large proces-sor counts, such wait states can result in substantial per-formance degradation.

To address these challenges, Scalasca has been designed asa diagnostic tool to support application optimization on highlyscalable systems. Although also covering single-node perform-ance via hardware-counter measurements, Scalasca mainlytargets communication and synchronization issues, whose un-derstanding is critical for scaling applications to performancelevels in the petaflops range and beyond. A distinctive featureof Scalasca is its ability to identify wait states that occur, forexample, as a result of unevenly distributed workloads.

FUNCTIONALITYTo evaluate the behavior of parallel programs, Scalasca takesperformance measurements at runtime to be analyzed post-mortem (i.e., after program termination). The user of Scalascacan choose between two different analysis modes: (i) Perform-ance overview on the call-path level via profiling (called runtime

summarization in Scalasca terminology) or (ii) In-depth studyof application behavior via event tracing. In profiling mode,Scalasca generates aggregate performance metrics for indi-vidual function call paths, which are useful for identifying themost resource-intensive parts of the program and assessingprocess-local performance via hardware-counter analysis. Intracing mode, Scalasca goes one step further and records in-dividual performance-relevant events, allowing the automaticidentification of call paths that exhibit wait states. This corefeature is the reason why Scalasca can be classified as anautomatic tool. As an alternative, the resulting traces can bevisualized in a traditional time-line browser such as Vampir orParaver to study the detailed interactions among differentprocesses or threads.

Before any performance data can be collected, the target ap-plication must be instrumented, that is, probes must be insertedinto the code which carry out the measurements. This canhappen at different levels, including source code, object code,or library. Before running the instrumented executable on theparallel machine, the user can choose between generating aprofile or an event trace. When tracing is enabled, each processgenerates a trace file containing records for its process-localevents. To prevent traces from becoming too large or inaccurateas a result of measurement intrusion, it is generally recommendedto optimize the instrumentation based on a previously gener-ated profile report. After program termination, Scalasca loadsthe trace files into main memory and analyzes them in parallelusing as many cores as have been used for the target applica-tion itself. During the analysis, Scalasca searches for wait states,classifies detected instances by category, and quantifies theirsignificance. The result is a wait-state report similar in structureto the profile report but enriched with higher-level communi-cation and synchronization inefficiency metrics. Both profileand wait-state reports contain performance metrics for everycombination of function call path and process/thread and canbe interactively examined in the provided analysis report ex-plorer (Figure 1) along the dimensions of performance metric,call tree, and system. For very large systems, the list of nodesand processes in the system dimensions becomes long, so asan alternative, the distribution of a value over the differentprocesses and/or threads can be shown as a mapping to a sys-tem hardware or application topology (Figure 2). If the appli-cation uses MPI Cartesian topologies, they are automaticallyrecorded, in other cases the application needs to be instrumentedto record the topology mapping. In addition, reports can be com-bined or manipulated to allow comparisons or aggregations,or to focus the analysis on specific extracts of a report. Forexample, the difference between two reports can be calculatedto assess the effectiveness of an optimization or a new re-

Scalasca – A Scalable Parallel Performance Measurement andAnalysis Toolset

port can be generated after eliminating uninteresting phases(e.g., initialization).

CALL-PATH PROFILINGScalasca can efficiently calculate many execution performancemetrics by accumulating statistics during measurement, avoidingthe cost of storing them with events for later analysis. Forexample, elapsed times and hardware-counter metrics forsource regions (e.g., routines or loops) can be immediatelydetermined and the differences accumulated. Whereas tracestorage requirements increase in proportion to the number ofevents (dependent on the measurement duration), summarizedstatistics for a call-path profile per thread have a fixed storagerequirement (dependent on the number of threads and exe-cuted call paths).

In addition to call-path visit counts, execution times, and op-tional hardware counter metrics, Scalasca profiles include various

MPI statistics, such as the numbers of synchronization, com-munication and file I/O operations along with the associatednumber of bytes transferred. Each metric is broken down intocollective versus point-to-point/individual, sends/writes versusreceives/reads, and so on. Call-path execution times separateMPI message-passing and OpenMP multithreading costs frompurely local computation, and break them down further intoinitialization/finalization, synchronization, communication, fileI/O and thread management overheads (as appropriate). Formeasurements using OpenMP, additional thread idle time andlimited parallelism metrics are derived, assuming a dedicatedcore for each thread.

Scalasca provides accumulated values of metrics for every com-bination of call path and thread. Which regions actually appearon a call path depends on which regions have been instrumented.When execution is complete, all locally executed call paths arecombined into a global dynamic call tree for interactive explo-ration (as shown in the middle of Figure 1).

Figure 1. The screenshot shows the result of the wait-state analysis of a trace measurement of an execution of the sea ice module of the"Community Earth System Model" (CESM) on 4096 cores of a Blue Gene*/P system. The left pane of the result browser shows the differentcollected and computed metrics organized in a functional hierarchy. After selecting a metric (here Late Sender), the middle pane showswhere it is occurring in the source code and how its value is distributed over the call tree. To allow the user to more quickly find the higher(worse) values, Scalasca also uses a color coding (see color scale at the bottom) so one can quickly see that the worst Late Sender problemin this example occurs at a MPI_Waitall() call inside the ice_boundary::ice_haloupdate function. Again, after selecting this call path, one cansee how the problem is distributed over the different nodes, processes or threads participating in this operation.

34 35


SCALABLE WAIT-STATE ANALYSISIn message-passing applications, processes often require accessto data provided by remote processes, making the progressof a receiving process dependent upon the progress of a send-ing process. If a rendezvous protocol is used, this relationshipalso applies in the opposite direction. Collective synchronizationis similar in that its completion requires each participatingprocess to have reached a certain point. As a consequence, asignificant fraction of the time spent in communication andsynchronization routines can often be attributed to the wait statesthat occur when processes fail to reach implicit or explicitsynchronization points in a timely manner. Scalasca providesa diagnostic method that allows the localization of wait statesby automatically searching event traces for characteristic

patterns. One example is the Late Sender pattern, where areceiver of a message is blocked while waiting for a messageto arrive. That is, the receive operation is entered by thedestination process before the corresponding send operationhas been entered by the source process. The waiting timelost as a consequence is the time difference between enter-ing the send and the receive operations. Another example isthe Wait at NxN pattern which quantifies the waiting timedue to the inherent synchronization in n-to-n operations, suchas MPI_Allreduce(). A full list of the wait-state types supportedby Scalasca including explanatory diagrams can be found on-line in the Scalasca documentation.

To accomplish the search in a scalable way, Scalasca exploitsboth distributed memory and parallel processing capabilities

A delay is the original source of a wait state, that is, an intervalthat causes a process to arrive belatedly at a synchronizationpoint, causing one or more other processes to wait. Besidessimple computational overload, delays may include a variety ofbehaviors such as serial operations or centralized coordinationactivities that are performed only by a designated process. Thecosts of a delay are the total amount of wait states it causes.Wait states can also themselves delay subsequent communi-cation operations and produce further indirect wait states. Thispropagation effect not only adds to the total costs of theoriginal delay, but also creates a potentially large temporal andspatial distance in between a wait state and its original rootcause. The root-cause analysis closes this gap by mapping thecosts of a delay onto the call paths and processes where thedelay occurs, offering a high degree of guidance in identifyingpromising targets for load or communication balancing. Togetherwith the analysis of wait-state propagation effects, the delaycosts enable a precise understanding of the root causes andthe formation of wait states in parallel programs.

We also leverage Scalasca's parallel trace replay technique toisolate the critical path in a highly scalable way [3]. Instead ofexposing the lengthy critical-path structure to the user in itsentirety, we use the critical path to derive a set of compactperformance indicators, which provide intuitive guidance aboutload-balance characteristics and quickly draw attention topotentially inefficient code regions. Our critical-path analysisproduces a critical-path profile, which represents the time anactivity spends on the critical path. In addition, we combine thecritical-path profile with per-process time profiles to create acritical-path imbalance performance indicator. This critical-path

imbalance corresponds to the time that is lost due to ineffi-cient parallelization in comparison with a perfectly balancedprogram. As such, it provides similar guidance as prior profile-based load imbalance metrics (e.g., the difference of maximumand average aggregate workload per process), but the critical-path imbalance indicator can draw a more accurate picture. Thecritical path retains dynamic effects in the program execution,such as shifting the imbalance between processes over time,which per-process profiles simply cannot capture. Because ofthis, purely profile-based imbalance metrics regularly under-estimate the actual performance impact of a given load im-balance.

REFERENCES[1] Markus Geimer, Felix Wolf, Brian J. N. Wylie, Bernd Mohr: A

scalable tool architecture for diagnosing wait states inmassively parallel applications. Parallel Computing,35(7):375–388, July 2009.

[2] David Böhme, Markus Geimer, Felix Wolf, Lukas Arnold:Identifying the root causes of wait states in large-scaleparallel applications. In Proc. of the 39th InternationalConference on Parallel Processing (ICPP), San Diego, CA,USA, pages 90–100, IEEE Computer Society, September2010, Best Paper Award.

[3] David Böhme, Bronis R. de Supinski, Markus Geimer, MartinSchulz, Felix Wolf: Scalable Critical-Path Based PerformanceAnalysis. In Proc. of the 26th IEEE International Parallel &Distributed Processing Symposium (IPDPS), Shanghai,China, IEEE Computer Society, May 2012.

available on the target system. After the target applicationhas terminated and the trace data has been flushed to disk,the trace analyzer is launched with one analysis process per(target) application process and it loads the entire trace datainto its distributed memory address space. While traversingthe traces in parallel, the analyzer performs a replay of theapplication's original communication. During the replay, theanalyzer identifies wait states in communication and synchro-nization operations by measuring temporal differences betweenlocal and remote events after their timestamps have beenexchanged using an operation of similar type. Detected wait-state instances are classified and quantified according to theirsignificance for every call path and system resource involved.Since trace processing capabilities (i.e., processors and memory)

grow proportionally with the number of application processes,good scalability can be achieved even at previously intractablescales. Recent scalability improvements allowed Scalasca tocomplete trace analyses of runs with up to one million coreson a 16-rack IBM* Blue Gene*/Q system.

WORK IN PROGRESS: SCALABLE ROOT CAUSE ANALYSISSo far, Scalasca's trace analysis could identify wait states atMPI and OpenMP synchronization points. Wait states, whichare intervals through which a process is idle while waiting fora delayed process, are a primary symptom of load imbalance inparallel programs. The new root-cause analysis [2] also iden-tifies the root causes of these wait states and calculates thecosts of delays in terms of the waiting time that they induce.

Figure 3. Screenshot showing the results of the root-cause analysis of the CESM sea ice module experiment. One can see the distributionof the delays over the application call-tree (middle pane) and the application topology (right pane).

Scalasca – A Scalable Parallel Performance Measurement and Analysis Toolset

Figure 2. This screenshot shows the same experiment as in Figure 1, however using the application topology to show the mapping of per-formance metric values to processes instead of a list of nodes and processes. The example application internally uses a 2D grid topology ofthe Earth's surface to organize its calculations. Each grid element represents a piece of the Earth's surface which is mapped to the correspon-ding MPI rank. The color indicates the value of the metric measured for each MPI rank (here: the time lost because of Late Sender for theMPI_Waiall() call).

36 37

the grid from the particles position and velocity.

Step 2. The electric field is adjusted to match the Poissonequation (divergence cleaning using a Poisson solver).

Step 3. The electric field and magnetic field for the next stepare computed using the discretized Maxwell equations.

Step 4. The electric and magnetic fields are interpolated fromthe grid to the particle positions. The same interpolationas in Step 1 is used.

Step 5. The particle velocities and positions are advanced intime with Newton's equations, knowing the electricfield at the particle positions.

Load imbalances often occur in parallel PIC simulations as particlesmove, even if the initial state is well balanced. Indeed, compu-tational particle densities can vary by order of magnitude in timeand space. Since computations on particles (Steps 1 and 5) typ-ically account for 90% of the execution time, particle densityinhomogeneities can have serious impacts on the code's scala-bility. This is why some dynamic load balancing is needed [4, 5].

Helsim implements an explicit three-dimensional PIC simulation.It takes particular care in balancing the computational load forhandling the particles and trading this off against computation.This allows Helsim to simulate experimental configuration withhighly imbalanced particle distributions with ease. MoreoverHelsim includes in-situ visualization, where the visualization

happens during the simulation. An example simulation resultis shown in Figure 2. The result displays the magnetic fieldlines in an experiment involving a ball of charged particlesevolving over time.

In the rest of this article we highlight the architectural choicesthat differentiate Helsim from other PIC simulations. We thenshow performance results on clusters as well as listing variousother experiments that are enabled by the clean and moderncodebase, such as running on the recently launched Intel® XeonPhi™ coprocessor or analyzing performance with the Sniperhardware simulator [6].

2. HELSIM ARCHITECTURAL HIGHLIGHTSIn this section, we highlight some of the major characteristicsof the Helsim proto-app.

2.1. Separate Particle and Grid DatastructuresParticle-in-Cell simulations need to represent particles on theone hand, and various fields on the other hand (electric field,magnetic field, etc.). From a software engineering perspective,we therefore need to design the data structures that representparticles and fields.

One option that is taken by many state-of-the-art PIC imple-mentations is to couple particles and fields into either a singledata structure, or into two data structures that are distrib-uted similarly. This choice makes it easy to do the particle-to-field (Step 1) and field-to-particle (Step 5) operations. The

Figure 2: Magnetic Field Lines in a Helsim Simulation

Helsim: A Particle-in-Cell Proto-application

Roel Wuyts1,2,3

Tom Haber1,4

Arnaud Beck1,3

Bruno De Fraine1,5

Charlotte Herzeel1,5

Pascal Costanza1,6

Wim Heirman1,71 Intel ExaScience Lab, Leuven2 imec3 KULeuven4 Universiteit Hasselt5 Vrije Universiteit Brussel6 Intel, Belgium7 ELIS Department, Ghent University

ABSTRACTThe Intel ExaScience Lab in Leuven, Belgium focuses on accuratespace weather modeling, extremely scalable fault-tolerant sim-ulation toolkits, in-situ visualization through virtual telescopes,and the architectural simulation of large-scale systems andworkloads. This article describes recent advances achieved inthe lab in a proto-application for Particle-in-Cell simulationscalled Helsim.

Proto-applications are representative for the kinds of applicationsend users might actually want to run. In this case we focus onan application that is of interest to astrophysicists studyingspace-weather phenomena, namely Particle-in-Cell simulations.The goal of the application is to show a realistic, modern, rel-evant application with a manageable code-base that needs tomake trade-offs between computation and communication. Theresult, Helsim, is a 3D electro-magnetic Particle-in-Cell simula-tion with in-situ visualization. It integrates advances achievedby different partners in the lab. More specifically it showcasesa novel approach for load-balancing particles and fields forParticle-in-Cell simulations, a library that offers n-dimensionaldistributed grids running on high-performance systems con-sisting of clusters of multi-core compute nodes, and an in-situvisualization approach.

1. INTRODUCTIONSpace weather refers to conditions on the Sun, in the inter-planetary space and in the Earth space environment. Theseconditions can influence the performance and reliability of

space-borne and ground-based technological systems, andcan endanger human life or health. Adverse conditions in thespace environment have in fact caused disruption of satelliteoperations, communications, navigation and electric powerdistributions grids in the past, leading to a variety of socio-economic losses [1]. Given these great impacts on society,space weather is attracting growing attention and was chosenas the linchpin application for the ExaScience Lab in Leuven.

Astrophysicists researching space weather are interested inthe behavior of plasma: highly energetic and highly conductivegas where the atoms have been broken into their nuclei andtheir freely moving electrons. One process of interest is the so-called Magnetic Reconnection phenomenon, in which the mag-netic topology is rearranged and magnetic energy is convertedto kinetic energy, thermal energy, and particle acceleration.Magnetic reconnection is one of the mechanisms responsiblefor solar flares on the sun, or for the aurora borealis (the north-ern lights) on earth. Understanding it better is therefore ofinterest to improve the prediction of space weather phenomena.

Plasma behaviour can be described by fluid methods, such asmagnetohydrodynamics (MHD) methods, that capture themacroscopic evolution of the system at large scales and lowfrequencies. However, such models cannot describe the small-scale kinetic processes (wave-particle interactions) that canstrongly affect the macroscopic behaviour of the system. Thissmall-scale behavior can be described by kinetic approachesdescribed by the Vlasov equation, self consistently coupledwith Maxwell's equations [2].

Solving the Vlasov equation is intractable for all but the smallestof simulations. It is therefore commonly solved using the Par-ticle-in-Cell (hereafter PIC) method [3].

In the PIC method, individual macro-particles in a Lagrangianframe, which mimic the behavior of the distribution function,are tracked in continuous phase space.

Moments of the distribution function, such as charge densitiesfor plasma physics simulations, are computed simultaneouslyon a Eulerian frame (fixed cells). The explicit electromagnetic Particle-in-Cell method has fivemain steps that are executed in a time-step loop that runs fora desired number of iterations (See Figure 1):

Step 1. The charge density and current are interpolated on



ParticleCharge Density

& CurrentDeposition

DivergenceCleaning

MoveParticles

Electricand MagneticField Update

Electricand Magnetic

FieldInterpolation

Figure 1: The 5 Main Steps in an Electromagnetic Particle-in-Cell method

38 39

executed after the simulation results have been computed.Such a step requires the movement of the massive amountsof simulation data between the two processes, typically viathe file system, and is therefore cumbersome and inefficient.In Helsim, visualization is done in-situ while the simulation isrunning, gaining direct access to the data using the resourcesallocated by the simulation code. In this model, the simulationadvances using the normal iterative processing with regularhand-offs to service visualization requests. The visualizationalgorithm also leverages the same levels of parallelism as thesimulation, namely distributed/shared memory parallelismand vector parallelism.

Since the visualization algorithms can have a different loadfrom the simulation, load-imbalances may occur. Rather thanredistributing the data across the cluster for visualizationpurposes only, which would be extremely expensive in termsof energy and time, we perform load-balancing only in theshared memory context. In the distributed memory context,we regain some performance by overlapping communicationwith computation as much as possible using specialized re-duction algorithms.

2.4. Helsim Code StatisticsHelsim was built from scratch as a C++11 application, with theexplicit goal of offering a clean implementation to astrophysi-cist researchers that they can understand, use, and maintain.Moreover the implementation is relatively small. Figure 2 showsa breakdown of the 3404 lines of code that make up Helsim.

It can be seen that 1000 lines are related to setting up theexperiments (reading configuration files that describe theexperiments, etc.) and that the actual implementation of thecore of the simulation and visualization take up about 2000lines.

A small and relatively clean code base has many practical ad-vantages for experiments. It made it easy to port Helsim to theIntel® Xeon Phi™ coprocessor, to use various advanced perform-ance analaysis tools like the Sniper architectural simulator, orto experiment with various vectorization options.

3. HELSIM ON HPC CLUSTERSSince Helsim is a proto-application for HPC systems, it shouldscale well on HPC clusters. In particular, for astrophysical ex-periments, it should exhibit good weak scaling, where theproblem sizes are scaled together with the cluster. Figure 4shows the results of one of our weak scaling experiments ofHelsim on a cluster of 32 compute nodes, each featuring dual6-core Intel® Xeon® 5660 (Westmere-EP) processors.

For this weak-scaling experiment we decided to grow the do-main in one dimension, according to the Z-axis. The number ofcells for a run on C compute nodes is 64x64x(32xC). Everycompute node is assigned 1 MPI process, and within this MPIprocess we use 8 threads. Each cell contains 1024 particles.We add 4 compute nodes at a time (hence the number ofthreads increments by 32). When perfect weak scaling is achieved the lines should re-main perfectly flat, indicating that the runtime should remainconstant as the problem size and the computational powergrow. Apart from a notable jump in runtime this is achieved.

Initialization: 1098 Rebalancing:

418

Visualization: 246

Debugging/Timing: 432

Main: 426

PIC Step: 196

Move Particles: 116

Update Fields: 57Divergence Cleaning: 341

Charge Deposition: 74

Figure 3: Breakdown of the 3404 lines-of-code of Helsim accordingto their function


reason is that the same process or thread will have access toboth particles and fields for those particles. The particle-to-field and field-to-particle operations are therefore easy to imple-ment and distribute. However, it also means that an imbalanceof the distribution of particles in space will be directly reflectedas an imbalance in workload. For example, when many particlesare clustered together, which happens when studying partic-ular scenarios in space weather research, all the work is doneby only a small fraction of the available compute nodes. Thiscan be countered by load-balancing approaches, such as onewe implemented for an implementation of a simpler electro-static particle-in-cell simulation in Unified Parallel C [7].

An alternative option, and the one taken by Helsim, is to usea data structure for the particles that is decoupled from thefields. This data structure can then be distributed over thecluster independently of the distribution of the field over thecluster. Making it possible to use different distributions for par-ticles and fields enables Helsim to easily simulate experimentalconfigurations with highly imbalanced particle distributions.These highly imbalanced particle distributions are dynamicallydistributed across the cluster according to some load-balancingcriteria, and hence the processing of the particles is carriedout by all compute nodes of the cluster. This is complementedwith a low-cost, lightweight mechanism to adjust the particledistributions at runtime and keep these properties as particlesmove throughout space.

The advantage of Helsim is that it can support balanced as wellas imbalanced distributions of particles. The price paid for im-balanced distributions is that remote communication will beneeded in the grid-to-particle and the particle-to-grid operations,because there is no guarantee that they will be located on thesame compute node. While this may seem like a serious draw-back for balanced particle distributions, the alternative coupledsolution can only deal very badly with imbalanced particle dis-tributions (when the particles are processed by only one or fewnodes), or not at all (as soon as the total amount of particlesfor the few nodes that process them exceeds the memorycapacity of those nodes). We also remind the reader that mostof the time in Particle-in-Cell simulations is actually spent inthe particle operations. Choosing the particle distribution asthe primary criterion for load balancing therefore indeed makessense. The experiments section indicates that the performanceof Helsim is very good, both with balanced as well as imbal-anced particle distributions.

Helsim not only implements the Particle-in-Cell simulation, italso does in-situ visualization. In-situ visualization means that

the visualization is done in parallel with the actual simulation.Unlike classic simulations that generate data that is afterwardsprocessed on a separate visualization cluster, Helsim runs ineither scripted or interactive mode, and directly produces visualoutput. In the interactive mode, the user can connect to arunning simulation and control the camera and visualizationoptions. In scripted mode, the camera and the visualizationoptions are controlled by a Python script that is passed as anargument when the simulation is launched.

2.2. Shark LibraryHelsim uses the Shark Library to store all of its distributed datastructures, including the particles and the various grids. Sharkis a high-level library for programming distributed n-dimensionalgrids that is developed and used at the ExaScience Lab in Leu-ven. It allows building performant implementations of grid com-putations in a highly productive manner. Broadly speaking, Sharkmanages the bookkeeping and distribution of grid data struc-tures, and offers specific computation and communicationoperations to work with the grid data.

In contrast to MPI programs, where the computation logic isfragmented across the participating processes, a Shark programrepresents a global view of the computation logic because itis written in terms of partitioned global data and collectiveoperations.

The Shark runtime manages parallelism on three levels, whichare common in today's multi-core cluster architectures: distrib-uted memory parallelism using one-sided communication fromMPI 2; shared memory parallelism using a thread scheduler suchas OpenMP, direct pthreads, TBB, and others; and SIMD vectorinstructions using compiler auto-vectorization assisted withpragmas.

Because of Shark Helsim can be configured to use only MPI,only shared memory parallelism, or a combination of both. Theadvantage for Helsim is that this choice has no effect on thecode: Shark’s abstraction facilities hide most, if not all, of thedifferences between these approaches.

Because Shark is implemented as a C++11 library, it integrateswell with many existing toolchains and code bases. By makinguse of advanced C++11 features, the new standard that isquickly being adopted by C++ compilers, an extra level of con-venience and safety is provided for programmers.

2.3. In-Situ VisualizationNowadays, visualization is usually performed as a separate step,


14

12

10

8

6

4

2

032 64 96 128 160 192 224 256

Figure 4: Helsim Weak Scaling Results

40 41

The jump is due to the effects on the simulation when thedomain becomes more and more oblong. We are collaboratingwith the astrophysicists in the ExaScience Lab to design anexperiment that is better suited for weak scaling experiments(while still resulting in a valid Particle-in-Cell simulation) andwill not show these effects of the form of the domain.

We are also in the process of gaining access to PRACE’s Tier-0Curie cluster to repeat this experiment on more nodes.

4. VARIOUS HELSIM EXPERIMENTSWe also execute Helsim on the recently launched Intel® XeonPhi™ coprocessor. The same code base that runs on HPC clustersand on regular Xeon® nodes can simply be compiled with theappropriate flags to run on the Intel® Xeon Phi™. Running Helsimnatively on the Xeon Phi™ made it easy to spot the parts of thecode that are more suited for the Intel® Xeon Phi™ coprocessor(the processing of particles in particular) and those more suitedfor the regular Xeon processor (the field-related operations andin-situ visualization). With this information, we are now able tocreate a version of Helsim that offloads the particle-relatedoperations to the Intel® Xeon Phi™ coprocessor, knowing thatthis should speed up computation.

To gain more insight in optimization opportunities, we alsoused the Sniper architectural simulator [6] developed at theExaScience Lab, Leuven. Using Sniper, we can construct cyclestacks that summarize where time is spent while executingthe application. A cycle stack is a stacked bar showing the “lost”cycle opportunities. An example is shown in Figure 5, whichplots Helsim’s simulated performance through time. A numberof phases are clearly visible, corresponding to different partsof the code with distinct performance characteristics. Withineach time slice, the dark red "base" component represents thebest possible performance. All red components together makeup the time the cores are performing useful work, whereas theyellow, green and blue components denote stalls due to branchmispredictions, memory accesses and synchronization, respec-tively.

Using tools such as Sniper is helping us to understand the im-plications of various vectorization approaches we are developing,differences when switching threading approaches in Shark,or the impact of load-balancing strategies.

5. SUMMARYThis paper describes the Helsim proto-application, a 3D electro-magnetic Particle-in-Cell simulation with in-situ visualization.Its novel implementation focuses on software engineeringaspects while being functionally complete. It is implementedin C++11, using the Shark library for its distributed datastructures.

We feel that Helsim strikes the right balance between offeringenough complexity to be representative for real applicationsbut small enough that it can be used as a case for software/hardware codesign, tools, and various optimizations. Helsim,where no Particle-in-Cell has gone before.

ACKNOWLEDGEMENTS This work is funded by Intel and by the Institute for the Pro-motion of Innovation through Science and Technology inFlanders (IWT).

6. REFERENCES[1] J. Kappenman. Geomagnetic Storms and Their Impacts on

the U.S. Power Grid. Metatech Corporation, Technical ReportMeta-R-319, prepared for Oak Ridge National Laboratory,January 2010.

[2] N.A. Krall and A.W. Trivelpiece. Principles of plasma physics.International Series in Pure and Applied Physics, McGraw-Hill, 1973.

[3] R. Hockney and J. Eastwood. Computer Simulation UsingParticles. Taylor & Francis, 1988.

[4] R.D. Ferraro, P.C. Liewer and V.K. Decyk. Dynamic LoadBalancing for a 2D Concurrent Plasma PIC Code, Journal ofComputational Physics, 109, 329–341, 1993.

[5] S.J. Plimpton et al. A Load-Balancing Algorithm for a ParallelElectromagnetic Particle-in-Cell Code, Comp. Phys. Comm.,152, 227-241, 2003.

[6] T. E. Carlson, W. Heirman, L. Eeckhout. Sniper: Exploring theLevel of Abstraction for Scalable and Accurate Parallel Multi-Core Simulation. SC’11 Proceedings of the 2011 Interna-tional Conference for High-Performance Computing,Networking, Storage and Analysis, 2011.

[7] B. Verleye, P. Henri, R. Wuyts, G. Lapenta, K. Meerbergen.Implementation of a 2D electrostatic Particle-in-Cell algo-rithm in Unified Parallel C with dynamic load-balancing.Computers & Fluids (September 2012),doi:10.1016/j.compfluid.2012.08.020, 2012.

Hiding Global Communication Latency in Krylov Solvers at ExascaleEuropean Exascale Labs

Figure 5: Sniper results for a Helsim run

P. Ghysels1,2

T. J. Ashby2,3

K. Meerbergen4

W. Vanroose11 University of Antwerp2 Intel ExaScience Lab, Leuven3 Imec vzw, Leuven 4 KU Leuven

1 INTRODUCTIONNumerous academic as well as industrial applications rely oncomputational methods which spend most of their computetime in linear algebra operations. Some example applicationsare: data mining, financial analysis, a broad range of physicalsimulations, computer vision and rendering. More specifically,these fields use techniques such as level sets, interior pointmethods, K-means, implicit time integration and so on.

In this work, we focus on the general linear algebra problem ofsolving Ax=b for the solution vector x with a given right-handside b and sparse matrix A, where x, b and A are extremelylarge and are stored distributed over a very large number ofnodes. We consider a physical problem on a 3D regular grid.The solution vector x can describe for instance temperature,pressure or density at the discrete points in this grid.

A vast number of techniques have been presented in the liter-ature for solving linear systems, with probably the most well-known being Gaussian elimination or LU decomposition. However,the problem with such dense matrix factorizations is that theyregard the system matrix A as dense and hence consume waytoo much memory. Sparse direct linear system solvers are verypopular in industry and will remain so because of their directnature, i.e., they do not suffer from a possible lack of convergence.Excellent codes exist that have shown good parallel efficiencies,see, e.g., MUMPS, PARDISO, SuperLU. Unfortunately this typeof method severely suffers from a fill-in of the sparse matrices,which leads to very high memory requirements. For this reason,direct methods are often impractical or even not feasible forlinear systems arising from 3D PDEs.

Krylov solvers, instead of working on the individual elementsof the matrix, only require the matrix-vector product routine,i.e., for a given vector v, compute w=Av. Starting from an initialguess x0 for the solution and with r0=b−Ax0 being the initialresidual, a Krylov subspace

(1)

can be constructed iteratively. The Krylov solver will then try

to find the best possible xi ∈ x0 + Ki(A, r0). Since the Krylovspace expands in each iteration, the solution vector will con-verge to the exact solution.

A powerful concept typically used in iterative methods is or-thogonalization. By making the vectors that span the Krylovspace orthogonal to each other, it is much easier to determinethe best possible approximation in that subspace. Additionally,for numerical stability, the vectors are usually normalized. How-ever, the orthogonalization and normalization of vectors relieson the computation of dot-products. The dot-product of two vec-tors v and w is a single scalar α = vTw = ΣN

i=1 vi · wi. It is clearthat in a parallel computing setting, where the elements of thevectors v and w are distributed over the compute nodes, thisoperation requires communication involving all nodes. First,each node computes its local contribution to the sum, then allcontributions are combined using a reduction operation. Theresult is a single scalar α that has to be broadcast again overthe entire machine. For large parallel machines, this reductionfollowed by a broadcast, called an all-reduce operation, quicklybecomes a bottleneck for iterative linear solvers. Most Krylovsolvers require at least two global communication phases periteration, one for orthogonalization and one normalization. Hence,a single iteration takes at least as long as twice the latency of aglobal all-reduce operation. The aim of the communication-hidingKrylov methods we present here is to make the methods latencytolerant by overlapping different operations and relaxing datadependencies. The global reductions will be done asynchronouslyinstead of blocking. The standard Krylov methods are rewrittensuch that the results of a reduction are only used when they be-come available and other useful work can be done during thecommunication phase. A typical iteration in our communication-hiding Krylov methods performs a (possibly preconditioned)matrix-vector product to create a new basis vector and alsoorthogonalizes a previously created base vector. The idea issimilar to that of pipe-lining of instructions in a microprocessor.Therefore we refer to the methods as pipe-lined Krylov solvers.

We focus on two of the most widely used Krylov methods:Conjugate Gradients (CG) and the Generalized Minimal Residualmethod (GMRES). The CG method is designed for symmetricand positive definite matrices while the GMRES method is ap-plicable for general square and non-singular matrices. In [2], apipe-lined GMRES method, p(𝓁)-GMRES with general pipe-liningdepth is presented where the global communication latency isoverlapped with 𝓁 matrix-vector products. For preconditionedCG we have recently come up with a 1-step pipe-lined methodin which the global reduction latency can be overlapped withapplication of the matrix-vector product and application ofthe preconditioner.

Hiding Global Communication Latency in Krylov Solvers at Exascale

Ki(A, r0) = {r0, Ar0, A2r0, ... , Ai-1r0}

42 43

in time for the reduction: an average of 48μs, several outliersabove 200µs and one outlier of 7205µs. In [3], the influenceof operating system noise on collective communication routinesis quantified. It turns out that for large numbers of nodes, smallimbalances between nodes can lead to very big increases inlatencies for collective communications such as global reduc-tions. Clever implementations can reduce or even completelyeliminate operating system noise. However, imbalances betweennodes are also caused by unevenly distributed workloads or byvariability in hardware. Due to the continued reduction in scaleand voltage of processors, this hardware variability will becomeworse. Hardware faults will occur more frequently and evenhardware-based fault recovery (for instance based on errorcorrecting codes) will lead to imbalances.

The different steps of a single GMRES iteration on 4 nodesare schematically represented on a time line in Figure 3. In thisexample, the sparse matrix-vector product (blue) only needscommunication between neighbors, and this communication isoverlapped with local computations. The vector updates (AXPYoperations, denoted in green) do not require communication.The dot-products (red) for the Gram-Schmidt orthogonalizationand for the normalization however, do require global commu-nication, which in this example is done with a binomial reductiontree, followed by a broadcast.

3 PIPE-LINING KRYLOV ITERATIONSAn algorithm that scales well on future Exascale machines shouldremove the data dependencies that lead to blocking global com-munication. By doing the reductions asynchronously, however,the latency can be overlapped with useful local work and smalllocal imbalances will not immediately have global effects.

As in the previous section we focus on the GMRES algorithms.A standard GMRES algorithm has two global communicationphases (the two reduction trees in figure 3). In [2], we firstpresent a variation of GMRES where the two global reductionsare combined in a single communication phase by combiningthe normalization phase with the orthogonalization step.This already greatly improves scalability.

Furthermore, to eliminate the scaling bottleneck caused by theremaining reduction, the iteration is altered to allow overlapof the reduction with other work, both computation and com-munication. In the 1-step pipe-lined method, p(1)-GMRES, thereduction is overlapped with the matrix-vector product. Figure4 shows the p(1)-GMRES iteration schematically. Note howthe sparse matrix-vector product (blue) overlaps the commu-nication phase of the dot-products (red). The figure shows theduration of a single iteration, but, due to the pipe-lining, sev-eral iterations are fused. In this figure, the matrix-vector prod-uct does not take long enough to overlap the completereduction and broadcast step.

2 COMMUNICATION PATTERNS IN KRYLOV ITERATIONSTo discuss the communication patterns in Krylov methods weuse the GMRES method as an example. An iteration of GMRESconsists of the following steps: 1) construct a new Krylov basisvector, 2) project this new vector on the previous Krylov sub-space and make the Krylov subspace orthonormal. These stepsuse three computational building blocks: the matrix-vector prod-uct, vector updates and dot-products. The first building block,the matrix-vector product, only requires communication betweenneighboring compute nodes. The vector updates (scalar-vectormultiplication and vector-vector additions) do not require com-munication between nodes. But the projection building block,however, relying on dot-products between vectors, requiresglobal communication due to the all-reduce operation.

Suppose we are solving a linear system defined on a regular3D grid. Different mappings from the physical quantities de-fined on the grid to a vector are possible. Likewise, the solutionvector x and right-hand side b can be distributed over the avail-able processors in many different ways. We shall assume thatthe distribution of the physical simulation domain to the com-pute nodes is such that neighboring points in the mesh areassigned to the same or to neighboring compute nodes. Thisis for instance the case when a low-dimensional mesh or torusnetwork topology is used.

2.1 Communication pattern of the Sparse-Matrix vectormultiplicationIn many relevant applications, the matrix A is the discrete ana-log of a differential operator. For instance a finite differenceapproximation to a partial differential equation (PDE) discretizedon a 3D grid leads to a very sparse matrix A. When a finitedifference (or other) stencil is used, the matrix will only have

as many entries per row as there are points in the stencil. Inmany applications, the matrix A is not even constructed ex-plicitly. Rather, a routine is implemented that returns the resultof applying the matrix to a given vector, the so-called matrix-vector product. Performing the matrix-vector product is anal-ogous to applying the stencil to the grid.

Consider a regular 3D grid with N=nx×ny×nz grid points anda system matrix corresponding to a 7-point stencil (each pointis coupled to its 6 closest neighbors). On a cluster with a 3Dmesh network topology consisting of P=px ⋅py ⋅pz computenodes, the natural distribution of the grid over the nodes is a3D distribution. Each compute node holds a local cube of sizenx / px × ny / py × nz /pz. In order to evaluate the stencil in eachpoint, the faces of the cubes have to be exchanged betweenneighboring nodes. Figure 1 shows the average wall-clock timefor this face exchange step on 4760=20×14×17 nodes for dif-ferent local cube sizes, from 103 to 1003, as a function of thesize of the cube faces. This was implemented with non-blockingMPI_Isend and MPI_Irecv commands. By using non-blockingcommunication for the stencil, the communication time caneffectively be overlapped with local computations. First thecommunication is started asynchronously. Then the stencil canbe applied to the interior points of the local domain. Then oneshould wait until the communication finishes, if it was notfinished already, and lastly, the points on the boundaries ofthe local domain can be updated. An implementation like thisminimizes the communication overhead and typically leads tocode that scales well to very large machines providing thereis enough local work to overlap the communication phase.

2.2 Communication pattern of the dot-productTo understand the communication patterns in the dot-product,we next benchmark the time it takes to do an all-reduce overthe same machine with 4760 nodes. Each of the 4760 nodesholds one double-precision variable and calls the MPI_Allreducecommand to get the sum of these values. Figure 2 shows tim-ings for 1000 of these calls to MPI_Allreduce. The average timeexcluding the 5% outliers is shown both in Figure 1 and 2.

From Figure 1 we see that an all-reduce on 4760 nodes for asingle scalar takes about as long as the face exchange step fora local cube of dimension 303. However, unlike the stencilcommunication, the all-reduce operation can typically not beoverlapped by local work due to data dependencies. So duringthe ∼50µs it takes to perform the reduction, the nodes justsit idle. This clearly has a dramatic effect on the scalability ofKrylov methods. With increasing parallelism, the local problemsize will only get smaller, moving closer to the strong scalinglimit and making the reduction relatively more expensive.

Furthermore, collective communication typically suffers fromvariability in the system. In Figure 2, notice the huge variation


p4

p3

p2

p1

SpMV localdot reduction b-cast axpy norm reduction b-cast scale

Gram-Schmidt NormalizationH update

Figure 3: Schematic representation of a single iteration of standard GMRES on 4 nodes. Communication for the sparse matrix-vector product(dashed lines) is assumed to be among neighbors only, and its latency is completely overlapped by computations for the SpMV. There aretwo reductions per iteration: one for orthogonalization and one for normalization.

400

350

300

250

200

150

100

50

0102 302 402 502 602 702 802 902 1002

face exchange

average MPI_Allreduce time

linear fit

time

(µse

c)

local cube face size

Figure 1: Face exchange times for different problem sizes. Alsoshown is the time to do an all-reduce over 4760 nodes. On 4760nodes a global all-reduce is more expensive than the face exchangewith local cubes of dimension 303.

10000

1000

100

100 100 200 300 400 500 600 700 800 900 1000

MPI_Allreduce time

average MPI_Allreduce time

time

(µse

c)

tranches

Figure 2: Timings for 1000 MPI_Allreduce calls on 4760 nodes andthe corresponding average time. Notice the huge variation in timefor the reduction: an average of 48µs, several outliers above 200µsand one outlier of 7205µs.


44 45

steps and by using different but mathematically equivalent for-mulations. In exact arithmetic, the resulting pipe-lined methodswill have convergence behavior identical to that of the methodson which they are based. However, the new methods are affectedquite differently by rounding errors due to finite precision cal-culations. We have tested the methods on a large collection oftest matrices, available online as the matrix-market1.

Figure 6 compares convergence of the residual norm for stan-dard GMRES and p(𝓁)-GMRES for the PDE900 matrix. In Figure6 (left), the monomial basis is used, while the figure on theright uses the Newton basis. A black dot denotes a breakdown(a square root has to be taken from a negative value). Whensuch problems are detected, the GMRES method is restarted,which typically slows down the convergence. The longer thepipe-lining depth, the more frequent these breakdowns andthe slower the convergence. However, these breakdowns canbe avoided by selecting good shifts, see Figure 6 (right), andconvergence is similar to that of standard GMRES.

We compared the parallel performance of pipe-lined GMRESwith standard GMRES on Carver, an IBM* iDataPlex* machinefrom NERSC. The full system has 1.202 compute nodes (9.984processor cores) with a theoretical peak performance of 106Tflops/sec. All nodes are interconnected by 4×QDR InfiniBand*technology, providing 32Gb/s of point-to-point bandwidth forhigh-performance message passing and I/O. Most nodes (1.120)have two quad-core Intel® Xeon® X5550 Nehalem 2.67GHzprocessors (eight cores/node) and since parallel jobs on Carverare limited to 64 nodes, only these Nehalem nodes were used.

The problem being solved is the 2D Poisson equation discretizedon a regular grid (N=10242) using a standard 5-point finitedifference stencil. The algorithms are implemented in C++using the message passing interface (MPI).

The asynchronous communication of the local boundary andthe ghost points is started using non-blocking MPI_Isend andMPI_Irecv respectively. This communication can be overlappedwith computation on the locally interior points. Afterward, acall to MPI_Waitall blocks until communication for the boundaryis finished. Finally, the local boundary points may be computed.Communication for the matrix-vector product is overlappedwith stencil computations in this manner.

To maximize the potential overlap with the non-blocking reduc-tion, a progress thread should be used. Our tests are performedusing MPICH2 version 1.5rc12 , configured to use the neme-sis channel with TCP and IP-over-Infiniband (IPoIB) networkinterfaces. When a progress thread is used, one should: con-figure MPICH2 with --enable-threads=runtime, set the environ-ment variable MPICH_ASYNC_PROGRESS=1 and initialize theMPI library with MPI_Init_thread with MPI_THREAD_MULTIPLEas argument. A non-blocking reduction can then be performedby calling MPI_Iallreduce in combination with a matching MPI_Wait.Although non-blocking or asynchronous point-to-point opera-tions have always been part of the MPI standard, non-blockingcollective operations, such as barrier and all-reduce, have onlyrecently been included in the MPI-3 standard and the currentimplementations in most available MPI libraries are still experi-mental.

In standard GMRES a new Krylov basis vector, generated bythe matrix-vector product, is made orthogonal to the previousbase vectors. In the next iteration it is then used again as inputfor the matrix-vector product. In the pipe-lined GMRES method,this strict data dependency is broken. Instead of waiting forthe orthogonalization procedure, the new basis vector is usedimmediately as input for the next matrix-vector product. How-ever, without intermediate orthogonalization, the subsequentKrylov basis vectors tend to converge with the eigenvectorbelonging to the largest eigenvalue. Clearly, as these vectorsbecome more and more aligned, they will no longer span thecomplete Krylov space (they do in exact arithmetic, but not infinite precision calculations) and convergence will slow down.It is therefore necessary to introduce a correction to the basisvectors. However, this correction is applied with a delay of 1iteration in the case of p(1)-GMRES and a delay of 𝓁 iterationsin the more general p(𝓁)-GMRES case. Compared to standardGMRES, this correction introduces additional floating point op-erations, but they do not introduce additional communication.

Figure 5 shows a schedule for p(2)-GMRES. In this scenario, aKrylov basis of dimension 4 is constructed, using 4 matrix-vectorproducts and 4 global reductions. The first phase of compu-

tations starts by performing a matrix-vector product and theneach node computes the local part of the dot-products withthe new basis vector. Then the global communication for thedot-product is started. Phase two continues with the nextmatrix-vector product, also computing local dot-products andalso starting global communication. Then in Step 3, after a thirdmatrix-vector product and more dot-products, finally the com-munication for the dot-products started at the end pf phase 1will have completed. Now the Krylov basis vector constructedin phase 1 can be made orthogonal. At this point, also the cor-rection can be applied to the basis vector constructed earlierin Step 3. Note that during the first 𝓁=2 steps, no orthogonal-ization is performed, due to the delay caused by the globalcommunication. This is the pipeline start-up. Likewise, the lasttwo steps do not perform a new matrix-vector product (drain-ing of the pipeline).

Since the correction to the basis vectors is only applied witha delay of 𝓁 steps, it is to be expected that numerical stabilityof the pipe-lined methods will worsen with increasing 𝓁. Thisis indeed the case, as we observe from numerical tests. How-ever, by introducing a shift in the matrix-vector product, i.e., byperforming the matrix-vector product with the matrix (A−σiI)instead of just A, stability can be much improved, providedgood choices for σi are available. For p(𝓁)-GMRES only 𝓁 shiftsσi are needed, and the shift only has to be applied in the first𝓁 iterations. Taking all shifts zero corresponds to the classicalmonomial basis. The shifts can be chosen as the roots of the(complex) Chebyshev polynomial that is minimal over the ellipsesurrounding the spectrum of the matrix A and the resultingKrylov basis will be called the Chebyshev basis.

4 NUMERICAL RESULTS AND PERFORMANCE EVALUATIONIn order to improve scalability of the standard Krylov methodsGMRES and CG, we relaxed data dependencies by reordering


p4

p3

p2

p1

reduction b-cast axpy correction local dotscaleH update

Figure 4: Schematic representation of an iteration of p(1)-GMRES on 4 nodes. Note that the reduction follows on the local dot phase from theprevious iteration. The redundant computations are represented by the correction block. Note that only a single global reduction is neededper iteration.

Krylov

dimension

1

2

3

4

1

SpMV, dot

2

SpMV, dot

3

orth.

SpMV, dot

computation phase

4

orth.

SpMV, dot

5

orth.

6

orth.

Figure 5: Pipelined(2) Krylov method. During computation phases 1and 2, the pipeline is being filled with matrix-vector products andcommunication for the dot-products is started. In Step 3, the resultof the dot-products started in Step 1 are available to each processorand orthogonalization can be performed. During the last two steps,the pipeline is being drained: no more matrix-vector products areperformed.

100

1

0.01

0.0001

1e-06

1e-08

1e-10

1e-12

1e-14

0 100 200 300 400 500

iterations

||rk ||

r 2

100

1

0.01

0.0001

1e-06

1e-08

1e-10

1e-12

1e-14

0 100 200 300 400 500

iterations

||rk ||

r 2

standard GMRES

(a) Without shifts (b) With shifts

p(1)-gmresp(2)-gmresp(3)-gmresp(4)-gmres

standard GMRESp(1)-gmresp(2)-gmresp(3)-gmresp(4)-gmres

Figure 6: Convergence results for the different GMRES versions applied to the pde900 test matrix. The GMRES methods are restarted after100 iterations. As the pipe-lining depth increases, breakdowns (black dots) occur more frequently and convergence slows down correspondin-gly. Left: Without shifts. Right: With shifts.


1 http://math.nist.gov/MatrixMarket/ 2 http://www.mcs.anl.gov/research/projects/mpich2/When a progress thread is used, the MPI implementation should bethread safe, which turned out to be problematic for most of thecurrently available MPI implementations. We found MPICH2 has goodmulti-threading support, but unfortunately it has no Infiniband support.

46 47

Figure 7 (left) shows the average runtime per iteration for thedifferent GMRES methods on Carver. On a single node, pipe-lining does not pay off. On 256 cores (using 32) nodes, p(4)-GMRES already achieves a 1.45× speedup compared to classicalGMRES, which goes up to 1.84× on 512 cores (64 nodes).Figure 7 (right) shows the speedup over a single node.

To study the scaling behavior of the pipe-lined Krylov meth-ods, we constructed an analytical performance model basedon machine parameters such as network latency and band-width, peak floating point performance and the number ofcores and nodes. Figure 8 shows the predicted runtime periteration for GMRES and pipe-lined GMRES on a hypotheticalExascale machine. The machine parameters were taken fromthe Swimlane 1 extrapolation in the “Exascale ComputingTechnology Challenges” report from the DOE. Figure 8 (right)shows for each of the methods, on 220 nodes, the time spentin global communication, matrix-vector product and localcomputations. Since the pipe-lined methods perform slightlymore flops than the classical GMRES method, the blue part(local computations) grows slightly for the pipe-lined methods.However, for 𝓁≥3, the global communication latency (the redpart) can be completely overlapped by local work.

In [1] a model is developed for the performance of the pipelinedGMRES algorithm on large clusters that looks at the influenceof the network architecture.

5 CONCLUSIONSIn this paper we have illustrated how to improve the scalabil-ity of Krylov methods that lie at the heart of many large scalesimulations by hiding the latencies associated with global re-ductions.

The time spent in global reductions will grow as the numberof nodes in a system. Furthermore, global reductions are sen-sitive to system noise and variability. So they are a serious

bottleneck for the performance already on current HPC hard-ware. By reorganizing the Krylov subspace algorithm and com-bining different global reductions, the number of synchronizationpoints is reduced and the latencies associated with these re-ductions can be hidden behind other work.

The positive effects of replacing the blocking communicationwith asynchronous communication can already be measuredon small systems. Extrapolating our measured data to Exascalesystems predicts a significantly improved scalability of the al-gorithm on extreme-scale machines.

We expect that the strategy of hiding global reductions behindother useful work can be applied to many other algorithmsand applications to improve the scalability.

REFERENCES[1] T. Ashby, P. Ghysels, W. Heirman, and W. Vanroose. The im-

pact of global communication latency at extreme scales onkrylov methods. Algorithms and Architectures for ParallelProcessing, pages 428â“442, 2012.

[2] P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose.Hiding global communication latency in the GMRES algo-rithm on massively parallel machines. SIAM Journal on Sci-entific Computing, 2012.

[3] T. Hoefler, T. Schneider, and A. Lumsdaine, Characterizingthe Influence of System Noise on Large-Scale Applicationsby Simulation. International Conference for High Perform-ance Computing, Networking, Storage and Analysis(SCâ™10), Nov. 2010.


GMRESp(1)p(2)p(3)p(4)SpMV

0.1

0.01

0.001

0.00010 10 100 1000

MPI processes (8 per node)

tim

e pe

r it

erat

ion

(s)

70

60

50

40

30

20

10

00 100 200 300 400 500 600

MPI processes (8 per node)

spee

dup

(ove

r 1

nod

e)

Figure 7: Average runtime per iteration (left) and speedup over a single node (right) for a 2D Poisson problem on Carver (IBM* iDATAPLEX*from NERSC*) for the different GMRES variations described in the text. On a single node pipe-lining does not pay off, but on 512 cores p(4)-GMRES achieves a 1.45× speedup compared to classical GMRES


1

0.1

0.01

0.001

0.0001

1e-05

1e-061 10 100 1000 10000 100000 1e+06

nodes

runt

ime

(s)

0.00004

0.00003

0.00003

0.00002

0.00002

0.00001

0.00001

0.00004GMRES p(1) p(2) p(3) p(4)

GMRES variations

Global All-reduceMatrix-Vector

Local calculation

GMRESp(1)p(2)p(3)p(4)

SpMV

Figure 8: Left: Predicted runtime for a strong scaling experiment (N=20003) using both classical and pipe-lined GMRES on a hypotheticalExascale machine. Right: Prediction of the breakdown of the different GMRES methods in time spent at local computation, matrix-vectorproduct and global all-reduce communication on 220 nodes with N=20003. For 𝓁≥3 the global reduction latency can be completely hidden,which results in a 4.9× speedup for both p(3) and p(4)-GMRES compared to normal GMRES with classical Gram-Schmidt.

48 49

medium, we consider a finite-difference time-domain (FDTD)implementation. The equation governing the wave propagationin such medium is,

where U designates the wave field pressure, c the velocityand t the time variable.

In a finite-difference approach, each value of the field is up-dated using a combination of its neighboring values. Thenumber of neighbors in each direction defines the order inspace of the stencil. Our implementation uses the centeredexplicit stencil shown in Figure 3. In the following, k denotesthe half order of the stencil.

2 PROGRAMMING THE INTEL®MANY INTEGRATED CORESARCHITECTUREThe Intel® Many-Integrated Core (MIC) evaluation version that weuse for our experiments contains 60 cores with 4 hyper-threadseach for a total of 240 threads. It is connected to a 2-socketIntel® Xeon® E5-2670, codename Sandy Bridge, EP host viaPCIe. MIC can be programmed through three major modes.

• In the native mode, the whole application runs directly onMIC and does not use the host.

• In the offload mode, the application runs first on the host andsome parts are executed on MIC. Data are transferred via PCIe.

• In the symmetric mode, both MIC and host are used simulta-neously as two independent nodes.

For our FDTD implementation, we are mainly interested in thenative and symmetric modes since the offload implementationwill induce an overhead due to the transfer of the whole wavefield at each time step.

2.1 Native ImplementationFor our native implementation, we considered a set of halfstencil orders varying from 1 to 8 and a different number ofthreads on MIC. Figure 4 illustrates the experience and high-lights the impact of hyper-threading on the FDTD implemen-tation in isotropic medium for all stencil orders chosen.

Compared to a usual implementation on a CPU, we did not applyany changes on the FDTD kernels. We only used the appropriatecompilation flag (-mmic) to be able to run on MIC.

2.2 Symmetric ImplementationThe symmetric implementation involves the host and MIC andconsists in a hybrid MPI plus OpenMP implementation of theFDTD kernels. On a node level, we opt for 2 MPI sub-domains:one on the host and one on the MIC. The MPI communicationsinvolve only the layers of ghost cells whose number is equal tohalf the stencil order. We only decompose along the Z directionas the slices are contiguous in memory. Figure 5 illustratesour MPI decomposition.

As we are using a heterogeneous machine where the frequencyand the number of the processing units are different on hostand on MIC, we pay attention to our domain decomposition asload imbalance can induce a drop in performance. In this imple-

Figure 3: 3D centered stencil of order 8 (k=4).

9000

8000

7000

6000

5000

4000

3000

2000

1000

060 80 120 140 160 180 200 220 240

Perf

orm

ance

(MCe

lls/s

ec)

Number of Threads

k=1k=2k=3k=4k=5k=6k=7k=8

Figure 4: As we increase the order of the stencil, we increase thenumber of arithmetic operations to compute the Laplacian of thewave field. We notice that in general hyper-threading has a positiveimpact on the isotropic implementation.

Innovative Programming Models for Depth Imaging Applications

Asma Farjallah1,2

Marc Tchiboukdjian1,2

Bettina Krammer1,2

Henri Calandra3

and William Jalby1,21 University of Versailles Saint-Quentin-en-Yvelines, France2 Exascale Computing Research, France3 TOTAL Exploration & Production, Pau, France

ABSTRACTWave propagation simulations in the Earth's crust are of majorinterest for seismic surveys as they build a synthetic image ofthe subsurface, helping companies to locate deposits of crudeoil and gas and thus facilitating the locations of drilling targets.As Exascale systems are expected in the 2018- 2020 range,these applications need to be prepared for such systems. Inthis paper, we present new programming models applied todepth imaging applications. To leverage the Intel® Many-Inte-grated Cores Architecture, we propose a hybrid MPI plus OpenMPimplementation with adaptive domain decomposition. We alsotested and evaluated the performance of an Intel® ConcurrentCollections (CnC) implementation. As parallelism and communi-cations are implicit in CnC, it is a productive way to programdistributed memory machines.

Keywords. Seismic modeling, Reverse Time Migration, Program-ming models, Many-Integrated Cores, Concurrent Collections.

1 SEISMIC IMAGING APPLICATIONSAs we are moving towards the Exascale level of performance,challenges on the hardware and the software levels are becom-ing more acute, imposing constraints on both architects anddevelopers. Rethinking the way we express parallelism and con-sequently the way we develop applications considering a relevantarchitectural context is a first step in the co-design process.

Industry relies on depth imaging applications to model thesubsurface and thus precisely locate oil and gas deposits. Theaccuracy of the image depends on the numerical methodsimplemented in the application. For geophysical imaging, onecan solve numerically the wave equation using three distinctapproaches: spectral method, strong formulation and weakformulation [3].

1.1 Reverse Time MigrationReverse Time Migration (RTM) [1] is a commonly used applica-tion as it represents a good compromise between the qualityof the rendered image and the time to solution. Figure 2 illus-

trates the different steps in RTM. During exploration campaignsdata is collected by sending vibrations from the surface intothe earth. Discontinuities of the subsurface induce reflectionsof the wave and receivers record the reflected waves. Thisdata is used to build an image of the crust. The first step is thedirect problem. It consists in solving of the wave propagationequation previously for a given velocity and density models.These models are priorly validated using benchmarks such as the2004 BP benchmark illustrated in Figure 1. The second step isthe adjoint problem which is a propagation backward in time ofthe signals collected by the receivers. A cross-correlation ofthe estimated and the recorded data adjusts the velocity anddensity models and results in a final image of the subsurface.

1.2 Seismic ModelingSeismic modeling consists of building synthetic seismogramsbased on a model of the subsurface. Assuming an isotropic

Innovative Programming Models for Depth Imaging Applications


Figure 1: The 2004 BP velocity-analysis benchmark: synthetic velo-city model designed to study challenging sub-salt environmentsthat can be encountered in the Gulf of Mexico and the off-shoreWest Africa.

21 = ∆ #,$%2

$2#

1. Source signalfrom shots

3. Receiver signal

2. Wave propagation

Subsurfacemodel m

The directproblem

Subsurface Image

The adjointproblem

Modeledsignal dcal (m)

Observed Datadobs

Imaging Conditionƒ (dobs, dcal (m))

Figure 2: The resolution is done in 2 steps; first by solving the directproblem for the modeling of the wave propagation, then the adjointproblem to geometrically place the subsurface discontinuities andthus built an image.

50 51

to parallelism-oblivious programmers, the Intel® CnC runtime hasseveral performance-oriented features such as zero-copy com-munications of items in shared memory, dynamic load balancingand automatic overlapping of communications with computation.Moreover, the Intel® CnC runtime allows parallelism experts tocontrol steps and items mapping on the nodes to optimize loadbalancing and communication volume through the use of tuners.

To evaluate the performances of the Intel® CnC runtime, we im-plemented the direct problem of RTM using a stencil of order8 in space and compared the performances to a standard MPIimplementation. The MPI implementation uses 3D domain de-composition with a layer of 4 ghost cells, which are exchangedusing point-to-point communications on a Cartesian commu-nicator. Sub-domains are treated by blocks to improve cacheusage. No overlapping of communications with computationis implemented. The CnC implementation is similar to the MPIimplementation in an over-decomposed mode, i.e. the domainis decomposed into small pieces independent of the numberof cores. A CnC step corresponds to the computation withinone sub-domain for one time-step. For an efficient execution,a large number of steps should be ready to execute at all times,thus the sub-domain size should be small. However, the amountof computation within a step should be enough to amortize theCnC runtime overhead. For the experiments, we chose twosub-domains of sizes: 803 and 1603.

We perform a strong scaling study using a grid size of1280×1440×1440 on up to 144 nodes on the CURIE1 machinecomposed of two sockets 8 cores Sandy Bridge EP nodes in-terconnected by Infiniband QDR. Performances of the CnC pro-gram are comparable to the MPI implementation (see Figure 8).

The CnC implementation with larger sub-domains performsbetter but does not expose enough parallelism to scale above600 cores. The CnC implementation with smaller sub-domainsis able to scale further.

4 FUTURE WORKIn order to efficiently compute the imaging condition, the wavefield needs frequent storing during the solving phase of thedirect problem, hence generating huge volumes of data. Thisis a major challenge for the full RTM. Therefore future workwill focus on experimenting with the full RTM. Our currentsymmetric and CnC implementations will be extended to in-clude the computation of the adjoint problem and the cross-correlation. The symmetric implementation will target clustersof multiple-MIC nodes. For CnC, we will develop a new domaindecomposition method to reduce the overhead when usingsmall sub-domains and be able to scale further.

5 ACKNOWLEDGMENTSWe acknowledge PRACE for awarding us access to resourceCURIE based in France at TGCC.

REFERENCES[1] Edip Baysal. Reverse time migration. Geophysics,

48(11):1514, November 1983.

[2] Zoran Budimlic, Michael Burke, Vincent Cave, KathleenKnobe, Geoff Lowney, Ryan Newton, Jens Palsberg, DavidPeixotto, Vivek Sarkar, Frank Schlimbach, and Sagnak Tes-mer. Concurrent collections. Scientific Programming,18(3,4):203–217, 2010.

[3] Jean Virieux, Henri Calandra, and R.É. Plessix. A review ofthe spectral, pseudo-spectral, finite-difference and finite-element modelling techniques for geophysical imaging.Geophysical Prospecting, 59(5):794–813, 2011.

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.08 16 32 64 128 144 256 288 512 576 1152 2304

Effic

ienc

y

Number of Cores

CnC (80) CnC (160) MPI

Figure 8: Performances of the CnC program compared to the MPIimplementation on the direct problem of RTM. Efficiency is computedwith reference to the best sequential program running replicated onone socket.

mentation, we have a static load balancing using a parameterdenoted as cost equal to the ratio of the sizes of the sub-do-mains on Z direction respective for the CPU and the MIC. Forevery run, we report timings for computation, MPI communi-cations and overhead due to synchronizations. Figure 6 illus-trates the percentages of these timings relatively to the mainloop in FDTD for 3 values of cost. It depicts that for a cost equalto 0.9 we minimize the overhead impact (1% of the time) whichresults in an increase of the overall performance.

As mentioned previously, we only need to transfer k layersof ghost cells where k is half the stencil order at each timestep. As a consequence, we reduce the overhead due to data

transfers on the PCIe compared to the offload programmingmodel where the whole grid is transferred at each time step.

For computation, we use the OpenMP threads. We deploy dif-ferent number of threads on CPU and on MIC. We also pay at-tention to the cache block sizes we choose on these twodifferent parts of the machine.

Figure 7 illustrates the relative performance compared to aSandy Bridge socket. We notice that for the 2 sockets wehave a perfect scaling, that MIC is slightly better than the 2sockets and for the symmetric implementation; we have aspeedup of 4.24 which is almost the summation of the SandyBridge socketperformance and a MIC in native mode perform-ance.

3 PRODUCTIVE PROGRAMMING ON CLUSTERS WITHCONCURRENT COLLECTIONSConcurrent Collections (CnC) [2] is a new parallel programmingmodel which is higher level compared to more conventionalmodels such as MPI or OpenMP. A CnC program is composedof a collection of items, with small pieces of data, and steps,small pieces of computation operating on the items. In addi-tion, the application programmer provides:

• data dependencies, i.e., items consumed and produced by astep, and

• control dependencies, i.e., which steps become ready to ex-ecute after the completion of a step.

Parallelism and communications are implicit; it is the runtimeenvironment’s responsibility to schedule the computation inparallel and to communicate data between nodes while respect-ing data and control dependencies. Moreover, the executionof a CnC program is deterministic which eases debugging. Thus,CnC has the potential to be a productive programming model.

Besides CnC being a relatively high-level language well suited

100

80

60

40

20

00.5 0.9 1

Tim

e (%

)

cost

74%

3%

23%

92%

3%5%

95%

3%2%

Computation Communication Imbalance

Figure 6: Percentages of the computation, the MPI communicationsand the overhead, relative to the time spent in the main loop of theisotropic implementation for 2 values of cost.

4.50

4.00

3.50

3.00

2.50

2.00

1.50

1.00

0.50

0.001 socket SNB 2 sockets SNB MIC MIC + 2 sockets SNB

Spee

dup

Figure 7: Relative performance compared to a single Sandy Bridge socket.

1 http://www-hpc.cea.fr/en/complexe/tgcc-curie.htm

MIC

CPU

Ghos

t Ce

lls

z-cu

t

Figure 5: Domain decomposition for the symmetric implementationis only done on the Z direction.

Innovative Programming Models for Depth Imaging ApplicationsEuropean Exascale Labs

52 53

ratio of up to 108, which cannot be approached by conventionalexplicit methods. In preceding works [4, 3], we focused on thesolution of stiff time scales and proved that an operator splittingmethod [10], together with dedicated solvers [6] for the sub-steps, efficiently addresses the temporal multiscale aspects.

In the spatial domain, reaction-diffusion problems can also giverise to solutions with very strong spatial gradients. This behav-ior can be readily illustrated by the combustion of a flammablegaseous mixture in a tube ignited from one side. As the flamefront progresses to the other side, part of the solution variesvery sharply in space (at the flame interface), while the statebefore and after the flame are quasi-uniform.

2 TACKLING MULTISCALE PROBLEMS: MULTIRESOLUTION METHODSFor multiscale systems, the numerical complexity derives fromthe high-resolution required to capture the sharp spatial gra-dients and keep truncation error under control. However, usinga uniform high-resolution Cartesian grid is often not tractablewith multiscale problems, because of the high computationalintensity and memory footprint required.

One key idea to overcome this issue is to have a high resolutiongrid at places where the solution exhibits strong gradients, anda coarser discretization in uniform regions. This idea is at thecore of a number of numerical methods, such as Adaptive MeshRefinement (AMR) (see for example [1]), Z-CODE relies on amultiresolution representation of the fields, which borrows fromwavelet theory and its strong mathematical background [6].The multiresolution procedure consists in breaking down agiven signal into a “smooth” component (by locally averaging2d cells in dimension d into a single coarser value) and a “sharp”component (which can be understood as the residual betweenthe actual signal and the smooth component). The sharp com-ponent, called the details, can be used to determine wherefurther refinement is needed (typically, the grid is refinedwhere details exceed a given threshold). This procedure canthen be carried out recursively, by further breaking down thedetails in the newly refined regions into low-frequency andhigh-frequency components. Each step of this refinementprocess creates a new level in the multiresolution grid.

Z-CODE provides a multiresolution implementation for reaction-diffusion systems, based on a recursive octree grid structurein 3D. A refined octree mesh is shown on figure 3. This tech-nique allows Z-CODE to significantly reduce the required numberof cells, and therefore also the memory footprint and computetime. Figure 2 shows a plane cut of a 3D simulation of the so-called Belousov-Zhabotinsky reaction-diffusion system [2],which produces a spiral wave. In Z-CODE, the multiresolutiongrid is updated dynamically after each time step, to closelyfollow the evolution of the solution.

The current version of Z-CODE is written in C++, and relieson the Intel® Threading Building Blocks (TBB) [11] library toexpress shared memory parallelism. As a preliminary projectat the Exascale Computing Research Lab, we have analyzedthe behavior of the code on current Intel® multi-core andmany-core architectures. We now discuss some challengesand prospects of relevance to Z-CODE and multiresolutionmethods in general.

Z-CODE: Towards an Efficient Implementation of Multiresolution Methods

T. Dumont1

V. Louvet1

T. Guillet2,31 Institut Camille Jordan, Université Lyon 1 & CNRS, Villeurbanne, France 2 Intel, 3 Exascale Computing Research

Ischemic strokes—the occlusion of blood vessels in the brain—are a major and global public health concern; yet the detailedbiochemical mechanisms involved in the disease are not fullyknown. Understanding the processes involved can play an impor-tant role in both the short-term emergency treatment and forthe development of candidate drugs to help contain the stroke.

In the event of an ischemic stroke, the obstruction of the bloodflow will cause a deficiency in energy supply to the cells, there-fore modifying the local potentials. As ions can propagate inthe inter-cellular medium, ionic canals far from the ischemiczone can be triggered, resulting in an expansion of the disor-ders in the brain. Drugs which are able to modulate the ioniccanals may help controlling the disease; however they haveso far only proved effective on rodents and not humans. It istheorized that this lack of effectiveness may be linked to thecomplexity of the chemical reaction network in the humanbrain. Numerical simulation of the ionic imbalance process mayhelp build our understanding of brain chemistry, and is thedriving problem for the development of the code Z-CODE.

1 FROM SIMULATIONS OF ISCHEMIC STROKES TO COMBUSTION: STIFF REACTION-DIFFUSION PROBLEMSIt turns out that the evolution of ionic species during ischemicstrokes can be described by an electrodynamic model [5], at

least during the first two hours of the stroke. There are twomain mechanisms involved in the process. Locally within braintissue, ionic concentrations will evolve because of the activa-tion of various ionic gates. This process can be modeled as alocal chemical reaction process. In addition, the effects of ionimbalance will influence neighboring brain matter regions, whichis well described by a diffusion process. These two mechanismsare coupled and occur simultaneously, and the resulting sys-tem of equations is known as a reaction-diffusion problem.Figure 1 pictures a simulation of ionic imbalance in the brainat the onset of an ischemic stroke, as modeled in a reaction-diffusion simulation.

While the study of ischemic strokes was the initial drivingproblem for Z-CODE, reaction-diffusion systems actually applyto a number of other physical or engineering problems. In somecombustion problems, flame propagation may also be describedby reaction-diffusion equations coupling the local chemicalreactions to their spatial propagation.

From a numerical point of view, these problems share a lot ofinteresting and challenging properties. For many scientificapplications, the number of species involved in the reaction-diffusion process can be quite large: the ischemic stroke modelrelies on about 20 chemical species, while the simulation ofcomplex chemical networks can involve hundreds of unknowns(for example in combustion or atmospheric chemistry applica-tions). As a result, simulating these processes requires largeamounts of memory and computational power.

A particular difficulty for the coupling of nonlinear reactionprocesses to diffusion is the highly multiscale behavior in thesolution, both in space and time. For example, the chemicalreactions for ischemic stroke modeling involve time scales in a

Z-CODE: Towards an Efficient Implementation of MultiresolutionMethods


Figure 2: Spiral wave solution in the Belousov-Zhabotinsky reaction.Note the very sharp spatial gradients developing across the spiral shape,whereas regions away from the spiral are very smooth, illustratingthe need for multiresolution. We used 10 levels of grid, for which auniform grid would contain 109 cells, whereas the multiresolutionmesh has only 1.2 108, corresponding to a 88 % compression.

Figure 1: Ionic imbalance in the brain at the onset of an ischemic stroke in a reaction-diffusion simulation.

Figure 3: Multi resolution mesh in 3D

54 55

cent architectures provide execution units capable of applyinga same operation to multiple operands at once. The Intel® SandyBridge architecture introduced Advanced Vector Extensions(AVX) instructions, which are able to operate on 8 single pre-cision floating-point numbers. The Intel® Xeon Phi™ codenamedKnight’s Corner extends this capability to 16 single precisionoperands.

To be able to benefit from vectorization, the computationsshould be organized so that they can be carried out on multi-ple operands at once. In the case of implicit stiff solvers, ex-pressing data parallelism is intrinsically challenging: becauseof the extreme stability requirements for the numerical inte-grators, the processing of a cell is highly data-dependent, duefor example to control flow in convergence tests. Because thereaction solver is the largest hotspot, we focused our firstvectorization efforts on the computation of the reaction ratesused to evolve the chemistry. These fine-grained vectorizationoptimizations yielded a 15–20% performance improvementof the reaction steps on a Sandy Bridge processor comparedto non-vectorized code. On Knight’s Corner, partial vectoriza-tion improved performance by 50%, highlighting the importanceof data parallelism on recent microprocessors.

In spite of this, a lot of the power of the processor’s vector unitremains untapped, because of the complex control flow of theimplicit solver. As stiff ODE methods are a key ingredient in thenumerical resolution of many challenging scientific problems—including chemistry, combustion, bioscience or plasma physics—improving the efficiency of such solvers can have a far-reachingimpact. Exploring the development of vectorization-aware solversfor stiff ODEs, in collaboration with the numerical analysiscommunity, could be an fruitful further development.

4 TOWARDS A DISTRIBUTED MEMORY IMPLEMENTATION Adaptive mesh and multiresolution techniques are at the heartof many scientific problems in need of high resolution. Z-CODE,as a shared-memory code, is a first step towards a large-scaledistributed memory application, but tackling extreme scalereaction-diffusion problems such as combustion will requireadapting the application’s data structures and parallelism. Whilemost adaptive mesh or multiresolution applications today relyon traditional message passing (e.g. MPI) with explicit user-controlled load balancing, a number of task-based executionmodels are emerging which could help expressing and lever-aging extreme parallelism toward Exascale.

In a follow-up work, we plan to focus on existing task-basedexecution models in shared and distributed memory, in par-ticular relying on work stealing techniques. Such a project wouldfocus on the implementation of proto-applications built arounddifferent existing programming models. Developing proto-ap-plications allows for maximal flexibility and versatility, and re-lies on the sound knowledge of the application domain andthe underlying numerical methods provided by the collabora-tion. Proto-applications are also a good way of sharing scien-tific results with the community, as illustrated by existingprojects such as the Mantevo Project [8].

REFERENCES[1] M. J. Berger and P. Colella. Local adaptive mesh refinement

for shock hydrodynamics. J. Comput. Phys., 82, 1989.[2] The Belousov–Zhabotinsky reaction.

http://en.wikipedia.org/wiki/Belousov-Zhabotinsky_reaction.[3] S. Descombes and T. Dumont. Numerical simulation of a

stroke: Computational problems and methodology. Progressin Biophysics and Molecular Biology, 97:40–53, 2008.

[4] S. Descombes, T. Dumont, V. Louvet, and M. Massot. Onthe local and global errors of splitting approximations ofreaction-diffusion equations with high spatial gradients.Int. J. of Computer Mathematics, 84(6):749–765, 2007.

[5] M.-A. Dronne, J.-P. Boissel, and E. Grenier. A mathematicalmodel of ion movements in gray matter during a stroke. J.of Theoretical Biology, 240(4):599–615, 2006.

[6] M. Duarte, M. Massot, S. Descombes, C. Tenaud, T. Dumont,V. Louvet, and F. Laurent. New resolution strategy formulti-scale reaction waves using time operator splitting,space adaptive multiresolution and dedicated high orderimplicit/explicit time integrators. Submitted to SIAM J. Sci.Comput., available on HAL (http://hal.archives-ouvertes.fr/hal-00457731), 2011.

[7] E. Hairer, S. P. Nørsett, and G. Wanner. Solving ordinarydifferential equations I. Springer-Verlag, Berlin, secondedition, 1993. Nonstiff problems.

[8] The Mantevo Project. https://software.sandia.gov/mantevo/[9] Space filling curve.

http://en.wikipedia.org/wiki/Space\_filling\_curve[10] G. Strang. On the construction and comparison of differ-

ence schemes. SIAM J. Numer. Anal., 5:506–517, 1968.[11] Intel threading building blocks.

http://threadingbuildingblocks.org/

[12] Z-order curve. http://en.wikipedia.org/wiki/Z-order_curve

Z-CODE: Towards an Efficient Implementation of Multiresolution Methods

3 TOWARDS AN EFFICIENT IMPLEMENTATION OF MULTIRESOLUTION METHODS In order to achieve high performance on modern microproces-sors, application developers can no longer rely on ever-increas-ing CPU clock speeds. Designing performing applications requirestaking advantage of the CPU caches, and expressing as muchthread and data parallelism as possible. New parallel computingsolutions, such as the Intel® Xeon Phi™ coprocessor, target ap-plications which can express a high level of thread and dataparallelism, and can provide better energy efficiency.

3.1 Data locality Data movements associated with transferring data to and frommemory are already a common bottleneck in High PerformanceComputing applications today, and are expected to be a centralissue for power consumption as we reach Exascale-class ma-chines. As a result, making careful use of the CPU caches byusing locality-friendly algorithms and data structures is ofgrowing importance.

Traversing the complex octree structure used in Z-CODE re-quires accessing some elements multiple times, most notablyfor the diffusion step, which consists in a series of sparsematrix-vector multiplications to compute the Laplacian oper-

ator on the adaptive mesh. Z-CODE addresses this problemby encoding and storing the tree nodes using a space-fillingcurve known as the Z-order curve [12, 9], which is depictedon figure 4. This scheme ensures that cells at neighboringgeometrical locations in the mesh will usually be processed inclose sequence and stored in nearby locations in memory,providing both temporal and spatial locality.

3.2 Efficient parallelism Thread parallelism consists in exploiting the available cores andhardware threads of the CPU. With the increase in number ofcores in multi-core processors, and the advent of many-coreplatforms, fully exploiting thread parallelism becomes of pri-mary importance for application performance.

Because the reaction steps simulate the local evolution of thechemical reactions, the computation is embarrassingly paralleland each grid point can be updated concurrently and independ-ently. However, as Z-CODE uses an implicit integration method(Radau5, derived from Runge-Kutta schemes [7]) for the re-action steps, the time required for the chemistry to convergestrongly depends on the local values of the unknowns, makingdynamic load balancing necessary for performance. In the pres-ent implementation of the code, the TBB runtime handles thescheduling of the computation tasks to the available threads,taking care of dynamic load balancing.

The most challenging algorithm, in terms of parallelism, is themesh update step. During the mesh update, constraints are en-forced on the mesh structure, which can, in some cases, triggernon-local mesh transformations. As the algorithm operates ona shared data structure, care must be taken to ensure consis-tency of the multiresolution tree without data race conditions.

The mesh update remains the less scalable part of Z-CODE,and will be the focus of future optimization efforts. Task-basedparallelism, which allows the programmer to express opportu-nities for independent computations without diving into sched-uling or load-balancing considerations, could provide the idealframework for achieving fine-grain parallelism with very goodload balancing, and is supported by TBB. Part of an upcomingeffort will be to assess the potential of such algorithms toimprove code scalability.

3.3 Data parallelism and vectorizationAnother type of parallelism of growing importance on modernmicroprocessors is data-level parallelism, or vectorization. Re-


Figure 4: Illustration of the Z-curve ordering of the mesh cells in a2D refined quad-tree

For more information visit www.Exascale-labs.eu

Intel ExaCluster Lab – Jülich Forschungszentrum JülichJülich Supercomputing CentreWilhelm-Johnen-Strasse52425 JülichGermany

Intel Exascale Computing Research Lab – ParisUniversité de Versailles Saint-Quentin-en-Yvelines45, Avenue des Etats-Unis 78000 Versailles France

Copyright © 2013 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Core, Intel Xeon and Intel Xeon Phi are trademarks or registered trademarks ofIntel Corporation in the United States and other countries.

This document and the information given are for the convenience of Intel’s customer base and are provided “AS IS” WITH NO WARRANTIES WHATSOEVER, EXPRESSOR IMPLIED, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT OF INTELLECTUALPROPERTY RIGHTS. Receipt or possession of this document does not grant any license to any of the intellectual property described, displayed, or containedherein. Intel products are not intended for use in medical, life-saving, life-sustaining, critical control, or safety systems, or in nuclear facility applications.

*Other names and brands may be claimed as the property of others.

Intel ExaScience Lab – LeuvenIMECKapeldreef 753001 LeuvenBelgium

Intel-BSC Exascale Lab – BarcelonaBarcelona Supercomputing CenterNexus II Buildingc/ Jordi Girona, 2908034 BarcelonaSpain

The Intel European Exascale Labs are part of Intel Labs Europe (ILE), a network of Research & Development, Product, and InnovationLabs spanning the European region as well as a variety of Intel business units. ILE was formally established in early 2009 asa central means of coordinating activities across Intel’s diverse network of labs and to further strengthen Intel’s commitment toand alignment with European R&D. In addition to driving key technology innovations for Intel, ILE works closely with academic,industry, and government institutions to advance innovations and strengthen Europe’s technology leadership in the globalcommunity. For information visit www.intel.com/europe/labs.

Date post:	25-May-2020
Category:	Documents
Upload:	others
View:	8 times
Download:	1 times

Intel European Exascale Labs European Exascale Labs Annual R… · 2012 has been a year of dynamic...

Documents