Tibidabo : Making the Case for an ARM-Based HPC System · 2018-01-19 · retical HPC cluster with a...

TibidaboI: Making the Case for an ARM-Based HPC

System

Nikola Rajovica,b,∗, Alejandro Ricoa,b, Nikola Puzovica, ChrisAdeniyi-Jonesc, Alex Ramireza,b

aComputer Sciences Department, Barcelona Supercomputing Center, Barcelona, SpainbDepartment d’Arquitectura de Computadors, Universitat Politecnica de Catalunya -

BarcelonaTech, Barcelona, SpaincARM Ltd., Cambridge, United Kingdom

Abstract

It is widely accepted that future HPC systems will be limited by their powerconsumption. Current HPC systems are built from commodity server pro-cessors, designed over years to achieve maximum performance, with energyefficiency being an after-thought. In this paper we advocate a different ap-proach: building HPC systems from low-power embedded and mobile tech-nology parts, over time designed for maximum energy efficiency, which nowshow promise for competitive performance.

We introduce the architecture of Tibidabo, the first large-scale HPC clus-ter built from ARM multicore chips, and a detailed performance and energyefficiency evaluation. We present the lessons learned for the design and im-provement in energy efficiency of future HPC systems based on such low-power cores. Based on our experience with the prototype, we perform simu-lations to show that a theoretical cluster of 16-core ARM Cortex-A15 chipswould increase the energy efficiency of our cluster by 8.7x, reaching an energyefficiency of 1046 MFLOPS/W.

Keywords: high-performance computing, embedded processors, mobile

ITibidabo is a mountain overlooking Barcelona∗Corresponding authorEmail addresses: [email protected] (Nikola Rajovic),

[email protected] (Alejandro Rico), [email protected] (Nikola Puzovic),[email protected] (Chris Adeniyi-Jones), [email protected] (AlexRamirez)

Preprint of the article accepted for publication in Future Generation Computer Systems, Elsevier

processors, low power, cortex-a9, cortex-a15, energy efficiency

1. Introduction

In High Performance Computing (HPC), there is a continued need forhigher computational performance. Scientific grand challenges e.g., engineer-ing, geophysics, bioinformatics, and other types of compute-intensive appli-cations require increasing amounts of compute power. On the other hand,energy is increasingly becoming one of the most expensive resources and itsubstantially contributes to the total cost of running a large supercomputingfacility. In some cases, the total energy cost over a few years of operationcan exceed the cost of the hardware infrastructure acquisition [1, 2, 3].

This trend is not only limited to HPC systems, it also holds true for datacentres in general. Energy efficiency is already a primary concern for thedesign of any computer system and it is unanimously recognized that reach-ing the next milestone in supercomputers’ performance, e.g. one EFLOPS(exaFLOPS - 1018 floating-point operations per second), will be stronglyconstrained by power. The energy efficiency of a system will define the max-imum achievable performance.

In this paper, we take a first step towards HPC systems developed fromlow-power solutions used in embedded and mobile devices. However, usingCPUs from this domain is a challenge: these devices are neither crafted toexploit high ILP nor for high memory bandwidth. Most embedded CPUslack a vector floating-point unit and their software ecosystem is not tunedfor HPC. What makes them particularly interesting is the size and powercharacteristics which allow for higher packaging density and lower cost. Inthe following three subsections we further motivate our proposal from severalimportant aspects.

1.1. Road to Exascale

To illustrate our point about the need for low-power processors, let usreverse engineer a theoretical Exaflop supercomputer that has a power bud-get of 20 MW [4]. We will build our system using cores of 16 GFLOPS(8 ops/cycle @ 2 GHz), assuming that single-thread performance will notimprove much beyond the performance that we observe today. An Exaflopmachine will require 62.5 million of such cores, independently of how they arepackaged together (multicore density, sockets per node). We also assume that

2

only 30-40% of the total power will be actually spent on the cores, the restgoing to power supply overhead, interconnect, storage, and memory. Thatleads to a power budget of 6 MW to 8 MW for 62.5 million cores, which is0.10 W to 0.13 W per core. Current high performance processors integratingthis type of cores require tens of watts at 2 GHz. However, ARM proces-sors, designed for the embedded mobile market, consume less than 0.9 W atthat frequency [5], and thus are worth exploring—even though they do notyet provide a sufficient level of performance, they have a promising roadmapahead.

1.2. ARM Processors

There is already a significant trend towards using ARM processors in dataservers and cloud computing environments [6, 7, 8, 9, 10]. Those workloadsare constrained by I/O and memory subsystems, not by CPU performance.Recently, ARM processors are also taking significant steps towards increaseddouble-precision floating-point performance, making them competitive withstate-of-the-art server performance.

Previous generations of ARM application processors did not feature afloating-point unit capable of supporting the throughputs and latencies re-quired for HPC1. The ARM Cortex-A9 has an optional VFPv3 floating-pointunit [11] and/or a NEON single-instruction multiple-data (SIMD) floating-point unit [12]. The VFPv3 unit is pipelined and is capable of executingone double-precision ADD operation per cycle, or one MUL/FMA (FusedMultiply Accumulate) every two cycles. The NEON unit is a SIMD unitand supports only integers and single-precision floating-point operands thusmaking itself unattractive for HPC. Then, with one double-precision floating-point arithmetic instruction per cycle (VFPv3), a 1 GHz Cortex-A9 providesa peak of 1 GFLOPS. The more recent ARM Cortex-A15 [13] processor hasa fully-pipelined double-precision floating-point unit, delivering 2 GFLOPSat 1 GHz (one FMA every cycle). The new ARMv8 instruction set, which isbeing implemented in next-generation ARM cores, namely the Cortex-A50Series [14], features a 64-bit address space, and adds double-precision to theNEON SIMD ISA, allowing for 4 operations per cycle per unit leading to4 GFLOPS at 1 GHz.

1Cortex-A8 is the processor generation prior to Cortex-A9, which has a non-pipelinedfloating-point unit. In the best case it can deliver one floating-point ADD every ∼10cycles; MUL and MAC have smaller throughputs.

3

1.3. Bell’s Law

Our approach for an HPC system is novel because we argue for the use ofmobile cores. We consider the improvements expected in mobile SoCs in thenear future that would make them real candidates for HPC. As Bell’s lawstates [15], a new computer class is usually based on lower cost components,which continue to evolve at a roughly constant price but with increasing per-formance from Moore’s law. This trend holds today: the class of computingsystems on the rise today in HPC is those systems with large numbers ofclosely-coupled small cores (BlueGene/Q and Xeon Phi systems). From thearchitectural point of view, our proposal fits into this computing class and ithas the potential for performance growth given the size and evolution of themobile market.

1.4. Contributions

In this paper, we present Tibidabo, an experimental HPC cluster that webuilt using NVIDIA Tegra2 chips, each featuring a performance-optimizeddual-core ARM Cortex-A9 processor. We use the PCIe support in Tegra2to connect a 1 GbE NIC, and build a tree interconnect with 48-port 1 GbEswitches.

We do not intend our first prototype to achieve an energy efficiency com-petitive with today’s leaders. The purpose of this prototype is to be a proofof concept to demonstrate that building such energy-efficient clusters withmobile processors is possible, and to learn from the experience. On the soft-ware side, the goal is to deploy an HPC-ready software stack for ARM-basedsystems, and to serve as an early application development and tuning vehicle.

Detailed analysis of performance and power distribution points to a ma-jor problem when building HPC systems from low-power parts: the systemintegration glue takes more power than the microprocessor cores themselves.The main building block of our cluster, the Q7 board, is designed havingembedded and mobile software development in mind, and is not particularlyoptimized for energy-efficient operation. Nevertheless, the energy efficiencyof our cluster is 120 MFLOPS/W, still competitive with Intel Xeon X5660and AMD Opteron 6128 based clusters,2 but much lower than what couldbe anticipated from the performance and power figures of the Cortex-A9processor.

2In the November 2012 edition of Green500 list these systems are ranked as 395th and396th respectively.

4

We use our performance analysis to model and simulate a potential HPCcluster built from ARM Cortex-A9 and Cortex-A15 chips with higher multi-core density (number of cores per chip) and higher bandwidth interconnects,and conclude that such a system would deliver competitive energy efficiency.The work presented here, and the lessons that we learned are a first steptowards such a system that will be built with the next generation of ARMcores implementing the ARMv8 architecture.

The contributions of this paper are:

• The design of the first HPC ARM-based cluster architecture, witha complete performance evaluation, energy efficiency evaluation, andcomparison with state-of-the-art high-performance architectures.

• A power distribution estimation of our ARM cluster.

• Model-based performance and energy-efficiency projections of a theo-retical HPC cluster with a higher multicore density and higher-performanceARM cores.

• Technology challenges and design guidelines based on our experienceto make ARM-based clusters a competitive alternative for HPC.

The rest of this paper is organized as follows. Section 2 gives an archi-tectural overview of our cluster prototype. We introduce our compute nodetogether with the description of the compute SoC. In Section 3 we benchmarka single node in terms of computational power as well as energy efficiencycompared to a laptop Intel Core i7 platform. We also present performance,scalability and energy efficiency results for a set of HPC applications execut-ing on our prototype. Section 4 deals with performance and energy efficiencyprojections for a theoretical ARM-based system including the desired featuresfor HPC identified in the process of building our prototype. In Section 5 weprovide an explanation of the technology challenges encountered while build-ing our prototype and give a set of design guidelines for a future ARM-basedHPC system. Other systems that are built with energy-efficiency in mindare surveyed in Section 6. We conclude our paper in Section 7.

2. ARM Cluster Architecture

The compute chip in the Tibidabo cluster is the Nvidia Tegra2 SoC,with a dual-core ARM Cortex-A9 running at 1 GHz and implemented using

5

(a) Q7 module (b) Q7 carrier board

(c) Blade with 8 boards (d) Tibidabo rack

Figure 1: Components of the Tibidabo system.

6

TSMC’s 40nm LPG performance-optimized process. Tegra2 features a num-ber of application-specific accelerators targeted at the mobile market, suchas video and audio encoder/decoder, and image signal processor, but noneof these can be used for general-purpose computation and only contributeas a SoC area overhead. The GPU in Tegra2 does not support general pro-gramming models such as CUDA or OpenCL, so it cannot be used for HPCcomputation either. However, more advanced GPUs actually support theseprogramming models and a variety of HPC systems use them to acceleratecertain kind of workloads.

Tegra2 is the central part of the Q7 module [16] (See Figure 1(a)). Themodule also integrates 1 GB of DDR2-667 memory, 16 GB of eMMC stor-age, a 100 MbE NIC (connected to Tegra2 through USB) and exposes PCIeconnectivity to the carrier board. Using Q7 modules allows an easy up-grade when next generation SoCs become available, and reduces the cost ofreplacement in case of failure.

Each Tibidabo node is built using Q7-compliant carrier boards [17] (SeeFigure 1(b)). Each board hosts one Q7 module, integrates 1 GbE NIC (con-nected to Tegra2 through PCIe), µSD card adapter and exposes other con-nectors and related circuitry that are not required for our HPC cluster, butare required for embedded software/hardware development (RS232, HDMI,USB, SATA, embedded keyboard controller, compass controller, etc.).

These boards are organized into blades (See Figure 1(c)), and each bladehosts 8 nodes and a shared Power Supply Unit (PSU). In total, Tibidabohas 128 nodes and it occupies 42 U standard rack space: 32 U for computeblades, 4 U for interconnect switches and 2 U for the file server.

The interconnect has a tree topology and is built from 1 GbE 48-portswitches, with 1 to 8 Gb/s link bandwidth aggregation between switches.Each node in the network is reachable within three hops.

The Linux Kernel version 2.6.32.2 and a single Ubuntu 10.10 filesystemimage are hosted on an NFSv4 server with 1 Gb/s of bandwidth. Each nodehas its own local scratch storage on a 16 GB µSD CLASS 4 memory card.Tibidabo relies on the MPICH2 v1.4.1 version of the MPI library. At thetime of writing of this paper this was the only MPI distribution that workedreliably with the SLURM job-manager in our cluster.

We use ATLAS 3.9.51 [18] as our linear algebra library. This library ischosen due to the lack of a hand-optimized algebra library for our platformand its ability to auto-tune to the underlying architecture. Applications thatneed an FFT library rely on FFTW v3.3.1 [19] for the same reasons.

7

3. Evaluation

In this section we present a performance and power evaluation of Tibidaboin two phases, first for a single compute chip in a node, and then for the wholecluster. We also provide a break down of a single node power consumptionto understand the potential sources of inefficiency for HPC.

3.1. Methodology

For the measurement of energy efficiency (MFLOPS/W), and energy-to-solution (Joules) in single core benchmarks, we used a Yokogawa WT230power meter [20] with an effective sampling rate3 of 10 Hz, a basic precisionof 0.1%, and RMS output values given as voltage/current pairs. We repeatour runs to get at least an acquisition interval of 10 minutes. The meter isconnected to act as an AC supply bridge and to directly measure power drawnfrom the AC line. We have developed a measurement daemon that integrateswith the OS and triggers the power meter to start collecting samples when thebenchmark starts, and to stop when it finishes. Collected samples are thenused to calculate the energy-to-solution and energy efficiency. To measure theenergy efficiency of the whole cluster, the measurement daemon is integratedwith the SLURM [21] job manager, and after the execution of a job, powermeasurement samples are included alongside the outputs from the job. Inthis case, the measurement point is the power distribution unit of the entirerack.

For single-node energy efficiency, we have measured a single Q7 board andcompared the results against a power-optimized Intel Core i7 [22] laptop (Ta-ble 1), whose processor chip has a thermal design power of 35 W. Due to thedifferent natures of the laptop and the development board, and in order togive a fair comparison in terms of energy efficiency, we are measuring onlythe power of components that are necessary for executing the benchmarks,so all unused devices are disabled. On our Q7 board, we disable Ethernetduring the benchmarks execution. On the Intel Core i7 platform, graphicoutput, sound card, touch-pad, bluetooth, WiFi, and all USB devices aredisabled, and the corresponding modules are unloaded from the kernel. Thehard disk is spun down, and the Ethernet is disabled during the execution

3Internal sampling frequencies are not known. This is the frequency at which the meteroutputs new pairs of samples.

8

Table 1: Experimental platformsARM Platform Intel Platform

SoC Tegra 2 Intel Core i7-640MArchitecture ARM Cortex-A9 (ARMv7-a) NehalemCore Count Dual core Dual coreOperating Frequency 1 GHz 2.8 GHzCache size(s) L1:32 KB I, 32KB D per core L1: 32KB I, 32KB D per core

L2: 1 MB I/D shared L2: 256 KB I/D per coreL3: 4 MB I/D shared

RAM 1 GB DDR2-667 8 GB DDR3-106632-bit single channel 64-bit dual channel

2666.67 MB/s per channel 8533.33 MB/s per channelCompiler GCC 4.6.2 GCC 4.6.2OS Linux 2.6.36.2 (Ubuntu 10.10) Linux 2.6.38.11 (Ubuntu 10.10)

of the benchmarks. Multithreading could not be disabled, but all experi-ments are single-threaded and we set their logical core affinity in all cases.On both platforms benchmarks are compiled with -O3 level of optimizationusing GCC 4.6.2 compiler.

3.2. Single node performance

We start with the evaluation of the performance and energy efficiency ofa single node in our cluster, in order to have a meaningful comparison toother state-of-the-art compute node architectures.

In Figure 2 we evaluate the performance of Cortex-A9 floating-pointdouble-precision pipeline using in-house developed microbenchmarks. Thesebenchmarks perform dense double-precision floating-point computation withaccumulation on arrays of a given size (input parameter) stressing the FPADDand FPMA instructions in a loop.We exploit data reuse by executing the sameinstruction multiple times on the same elements within one loop iteration.This way we reduce loop condition testing overheads and keep the floating-point pipeline as utilized as possible. The purpose is to evaluate if the ARMCortex-A9 pipeline is capable of achieving the peak performance of 1 FLOPper cycle. Our results show that the Cortex-A9 core achieves the theoreticalpeak double-precision floating-point performance when the microbenchmarkworking set fits in the L1 cache (32 KB).

Next, in Table 2 we show the evaluation of CPU performance using theDhrystone benchmark [23] and the SPEC CPU2006 benchmark suite [24].Dhrystone is not an HPC benchmark, but it is a known reference for whichthere are official ARM Cortex-A9 results [5]. The purpose of this test is tocheck whether our platform achieves the reported performance. We have also

9

4K 16K 32K 64K 100K 1M 10M 100MProblem size

0.0

0.2

0.4

0.6

0.8

1.0

1.2

GF

LO

PS

L1

cach

e

L2

cach

e

Microbenchmark

FPADD

FPMA

Figure 2: Performance of double-precision microbenchmarks on the Cortex-A9. Peakperformance is 1 GFLOPS @ 1 GHz

evaluated the benchmark on a laptop Intel Core i7 processor to establish abasis for comparison in terms of performance and energy-to-solution. We usethe same working set size on both platforms.

Our results show that the Tegra2 Cortex-A9 achieves the expected peakDMIPS. Our results also show that the Cortex-A9 is 7.8x slower than theCore i7. If we factor these results for the the frequency difference, we getthat Cortex-A9 has 2.79x lower performance/MHz.

The SPEC CPU2006 benchmarks cover a wide range of CPU intensiveworkloads. Table 2 also shows the performance and energy-to-solution nor-malized to Cortex-A9 averaged across all the benchmarks in the CINT2006(integer) and CFP2006 (floating-point) subsets of the SPEC CPU2006 suite.Cortex-A9 is over 9x slower in both subsets. The per-benchmark results forthese experiments can be found in a previous paper [25].

We also evaluate the effective memory bandwidth using the STREAMbenchmark [26] (Table 3). In this case, the memory bandwidth comparisonis not just a core architecture comparison because bandwidth depends mainlyon the memory subsystem. However, bandwidth efficiency, which shows theachieved bandwidth out of the theoretical peak, shows to what extent thecore, cache hierarchy and on-chip memory controller are able to exploit off-chip memory bandwidth. We use the largest working set size that fits in the

10

Q7 module memory on both platforms. Our results show that the DDR2-667memory in our Q7 modules delivers a memory bandwidth of 1348 MB/s forcopy and 720 MB/s for add—the Cortex-A9 chip achieves a 51% and 27%bandwidth efficiency respectively. Meanwhile, the DDR3-1066 in Core i7delivers around 7000 MB/s for both copy and add, which is 41% of bandwidthefficiency considering the two memory channels available.

Table 2: Dhrystone and SPEC CPU2006: Intel Core i7 and ARM Cortex-A9 performanceand energy-to-solution comparison. SPEC CPU2006 results are normalized to Cortex-A9and averaged across all benchmarks in the CINT2006 and CFP2006 subsets of the suite.

Platform Dhrystone SPEC CPU2006perf energy CINT2006 CFP2006

(DMIPS) abs (J) norm perf energy perf energyIntel Core i7 19246 116.8 1.056 9.071 1.185 9.4735 1.172ARM Cortex-A9 2466 110.8 1.0 1.0 1.0 1.0 1.0

Table 3: STREAM: Intel Core i7 and ARM Cortex-A9 memory bandwidth and bandwidthefficiency comparison.

Platform Peak STREAMmem. BW perf (MB/S) energy (avg.) efficiency (%)

(MB/s) copy scale add triad abs (J) norm copy addIntel Core i7 17066 6912 6898 7005 6937 481.5 1.059 40.5 41.0ARM Cortex-A9 2666 1348 1321 720 662 454.8 1.0 50.6 27.0

Tables 2 and 3 also present the energy-to-solution required by each plat-form for the Dhrystone, SPEC CPU2006 and STREAM benchmarks. Energy-to-solution is shown as the absolute value (for Dhrystone and STREAM)and normalized to the Cortex-A9 platform. While it is true that the ARMCortex-A9 platform takes much less power than the Core i7, it also requires alonger runtime, which results in a similar energy consumption—the Cortex-A9 platform is between 5% and 18% better. Given that the Core i7 platformis faster, that makes it superior in other metrics such as Energy-Delay. Weanalyze the sources of energy inefficiency for these results in Section 3.4, andevaluate potential energy-efficiency improvements for ARM-based platformsin Section 4.

3.3. Cluster performance

Our single-node performance evaluation shows that the Cortex-A9 is∼9 times slower than the Core i7 at their maximum operating frequencies,which means that we need our applications to exploit a minimum of 9 parallel

11

processors in order to achieve competitive time-to-solution. More processingcores in the system mean more need for scalability. In this section we evalu-ate the performance, energy efficiency and scalability of the whole Tibidabocluster.

Figure 3 shows the parallel speed-up achieved by the High-PerformanceLinpack benchmark (HPL) [27] and several other HPC applications. Fol-lowing common practice, we perform a weak scalability test for HPL and astrong scalability test for the rest.4 We have considered several widely usedMPI applications: GROMACS [28], a versatile package to perform moleculardynamics simulations; SPECFEM3D GLOBE [29] that simulates continen-tal and regional scale seismic wave propagation; HYDRO, a 2D Euleriancode for hydrodynamics; and PEPC [30], an application that computes long-range Coulomb forces for a set of charged particles. All applications arecompiled and executed out-of-the-box, without any hand tuning of the re-spective source codes.

If the application could not execute on a single node, due to large memoryrequirements, we calculated the speed-up with respect to the smallest numberof nodes that can handle the problem. For example, PEPC with a referenceinput set requires at least 24 nodes, so we plot the results assuming that on24 nodes the speed-up is 24.

We have executed SPECFEM3D and HYDRO with an input set that isable to fit into the memory of a single node, and they show good strong scalingup to the maximum available number of nodes in the cluster. In order toachieve good strong scaling with GROMACS, we have used two input sets,both of which can fit into the memory of two nodes. We have observedthat scaling of GROMACS improves when the input set size is increased.PEPC does not show optimal scalability because the input set that we canfit on our cluster is too small to show the strong scalability properties of theapplication [30].

HPL shows good weak scaling. In addition to HPL performance, we alsomeasure power consumption, so that we can derive the MFLOPS/W met-ric used to rank HPC systems in the Green500 list. Our cluster achieves120 MFLOPS/W (97 GFLOPS on 96 nodes - 51% HPL efficiency), com-

4Weak scalability refers to the capability of solving a larger problem size in the sameamount of time using a larger number of nodes (the problem size is limited by the availablememory in the system). On the other side, strong scalability refers to the capability ofsolving a fixed problem size in less time while increasing the number of nodes.

12

4 8 16 32 64 96Number of nodes

48

16

32

64

96

Spee

d-up

idealHP LinpackPEPCHYDROGROMACS - small inputGROMACS - big inputSPECFEM3D

Figure 3: Scalability of HPC applications on Tibidabo.

petitive with AMD Opteron 6128 and Intel Xeon X5660-based clusters, but19x lower than the most efficient GPU-accelerated systems, and 21x lowerthan Intel Xeon Phi (November 2012 Green500 #1). The reasons for the lowHPL efficient performance include lack of architecture-specific tuning of thealgebra library, and lack of optimization in the MPI communication stack forARM cores using Ethernet.

3.4. Single node power consumption breakdown

In this section we analyze the power of the multiple components on a com-pute node. The purpose is to identify the potential causes of inefficiency—onthe hardware side—that led to the results in the previous section.

We were unable to take direct power measurements of the individualcomponents in the Q7 card and carrier board, so we checked them in theprovider specifications for each one of the components. The CPU core powerconsumption is taken from the ARM website [31]. For the L2 cache powerestimate, we use power models of the Cortex-A9’s L2 implemented in 40nmand account for long inactivity periods due to the 99% L1 cache hit rate ofHPL (as observed with performance counters reads). The power consump-tion of the NICs is taken from the respective datasheets [32, 33]. For theDDR2 memory, we use Micron’s spreadsheet tool [34] to estimate power con-sumption based on parameters such as bandwidth, memory interface width,

13

Core1, 0.26 Core2, 0.26 L2 cache, 0.1

Memory, 0.7

Eth1, 0.9

Eth2, 0.5

Other, 5.68

Figure 4: Power consumption breakdown of the main components on a compute node. Thecompute node power consumption while executing HPL is 8.4 W. This power is computedby measuring the total cluster power and divide it by the number of nodes.

and voltage.Figure 4 shows the average power breakdown of the major components in

a compute node over the total compute node power during an HPL run on theentire cluster. As can be seen, the total measured power on the compute nodeis significantly higher than the sum of the major parts. Other on-chip andon-board peripherals in the compute node are not used for computation sothey are assumed to be shut off when idle. However, the large non-accountedpower part (labeled as OTHER) accounts for more than 67% of the totalpower. That part of the power includes on-board low-dropout (LDO) voltageregulators, on-board multimedia devices with related circuitry, correspondingshare of a blade PSU losses and on-chip power sinks. Figure 5 shows theTegra2 chip die. The white outlined area shows the chip components thatare used by HPC applications. This area is less than 35% of the total chiparea. If the rest of the chip area is not properly power and clock gated, itwould leak power even though it is not being used, thus also contributing tothe OTHER part of the compute node power.

Although the estimations in this section are not exact, we actually overes-timate the power consumption of some of the major components when takingthe power from the multiple data sources. Therefore, our analysis shows thatup to 16% of the power is on the computation components: cores (includ-

14

Figure 5: Tegra2 die: the area marked with white border line are the components actuallyused by HPC applications. It represents less than a 35% of the total chip area. source:www.anandtech.com

ing on-chip cache-coherent interconnect and L2 cache controller), L2 cache,and memory. The remaining 84% or more is then the overhead (or systemglue) to interconnect those computation components with other computationcomponents in the system. The reasons for this significant power overhead isbecause of the small size of the compute chip (two cores), and due to use ofdevelopment boards targeted to embedded and mobile software development.

The conclusions from this analysis are twofold. HPC-ready carrier boardsshould be stripped-out of unnecessary peripherals to reduce area, cost andpotential power sinks/wastes. And, at the same time, the computation chipsshould include a larger number of cores: less boards are necessary to integratethe same number of cores, and the power overhead of a single compute chipis distributed among a larger number of cores. This way, the power overheadshould not be a dominant part of the total power but just a small fraction.

4. Performance and energy efficiency projections

In this section we project what would be the performance and powerconsumption of our cluster if we could have set up an HPC-targeted systemusing the same low-power components. One of the limitations seen in theprevious section is that having only two cores per chip leads to significant

15

overheads in order to glue them together to create a large system with a largenumber of cores. Also, although Cortex-A9 is the leader in mobile computingfor its high performance, it trades-off some performance for power savings toimprove battery life. Cortex-A15 is the highest performing processor in theARM family, which includes features more suitable for HPC. Therefore, inthis section we evaluate cluster configurations with higher multicore density(more cores per chip) and we also project what would be the performanceand energy efficiency if we used Cortex-A15 cores instead. To complete thestudy, we evaluate multiple frequency operating points to show how frequencyaffects performance and energy efficiency.

For our projections, we use an analytical power model and the DIMEMAScluster simulator [35]. DIMEMAS performs high-level simulation of the exe-cution of MPI applications on cluster systems. It uses a high-level model ofthe compute nodes—modeled as symmetric multi-processing (SMP) nodes—to predict the execution time of computation phases. At the same time, itsimulates the cluster interconnect to account for MPI communication de-lays. The interconnect and computation node models accept configurationparameters such as interconnect bandwidth, interconnect latency, number oflinks, number of cores per computation node, core performance ratio, andmemory bandwidth. DIMEMAS has been used to model the interconnect ofthe MareNostrum supercomputer with an accuracy within 5% [36], and itsMPI communication model has been validated showing an error below 10%for the NAS benchmarks [37].

The input to our simulations is a trace obtained from an HPL executionon Tibidabo. As an example, the PARAVER [38] visualization of the inputand output traces of a DIMEMAS simulation are shown in Figure 6. Thechart shows the activity of the application threads (vertical axis) over time(horizontal axis). Figure 6(a) shows the visualization of the original execu-tion on Tibidabo, and Figure 6(b) shows the visualization of the DIMEMASsimulation using a configuration that mimics the characteristics of our ma-chine (including the interconnect characteristics) except for the CPU speedwhich is, as an example, 4 times faster. As it can be observed in the realexecution, threads do not start communication all at the same time, andthus have computation in some threads overlapping with communication inothers. In the DIMEMAS simulation, where CPU speed is increased 4 times,computation phases (in grey) become shorter and all communication phasesget closer in time. However, the application shows the same communicationpattern and communications take a similar time as that in the original ex-

16

(a) Part of the original HPL execution on Tibidabo

(b) DIMEMAS simulation with hypothetical 4x faster computation cores

Figure 6: An example of a DIMEMAS simulation where each row presents the activity ofa single processor: it is either in a computation phase (grey) or in MPI communication(black).

ecution. Due to the computation-bound nature of HPL, the resulting totalexecution time is largely shortened. However, the speedup is not close to 4x,as it is limited by communications, which are properly simulated to matchthe behavior of the interconnect in the real machine.

4.1. Cluster energy efficiency

Table 4 shows the parameters used to estimate the performance and en-ergy efficiency of multiple cluster configurations. For this analysis, we useHPL to measure the performance ratio among different core configurations.The reasons to use HPL are twofold: it makes heavy use of floating-pointoperations, as it happens in most HPC applications, and is the referencebenchmark used to rank HPC machines both on Top500 and Green500 lists.

To project the performance scaling of HPL for Cortex-A9 designs clockedat frequencies over 1 GHz, we execute HPL in one of our Tibidabo nodesusing two cores at multiple frequencies from 456 MHz to 1 GHz. Then, wefit a polynomial trend line on the performance points to project the perfor-mance degradation beyond 1 GHz. Figure 7(a) shows a performance degra-dation below 10% at 1 GHz compared to perfect scaling from 456 MHz. Thepolynomial trend line projections to 1.4 and 2.0 GHz show a 14% and 25%performance loss over perfect scaling from 456 MHz respectively. A poly-

17

Table 4: Core architecture, performance and power model parameters, and results forperformance and energy efficiency of clusters with 16 cores per node.

Configuration # 1 2 3 4 5CPU input parameters

Core architecture Cortex-A9 Cortex-A9 Cortex-A9 Cortex-A15 Cortex-A15Frequency (GHz) 1.0 1.4 2.0 1.0 2.0Performance over A91GHz 1.0 1.2 1.5 2.6 4.5Power over A91GHz 1.0 1.1 1.8 1.54 3.8

Per node power figures for 16 cores per chip configuration [W]CPU cores 4.16 4.58 7.49 7.64 18.85L2 cache 0.8 0.88 1.44 Integrated with coresMemory 5.6 5.6 5.6 5.6 5.6Ethernet NICs 1.4 1.4 1.4 1.4 1.4

Aggregate power figures [W]Per node 17.66 18.16 21.63 20.34 31.55Total cluster 211.92 217.87 259.54 244.06 378.58

nomial trend line seems somewhat pessimistic if there are no fundamentalarchitectural limitations, so we can use these projections as a lower boundfor the performance of those configurations.

For the performance of Cortex-A15 configurations we perform the sameexperiment on a dual-core Cortex-A15 test chip clocked at 1 GHz [39]. HPLperformance on Cortex-A15 is 2.6 times faster compared to our Cortex-A9Tegra2 boards. To project the performance over 1 GHz we run HPL atfrequencies ranging between 500 MHz and 1 GHz and fit a polynomial trendline on the results. The performance degradation compared to perfect scalingfrom 500MHz at 2 GHz is projected to 14% so the performance ratio overCortex-A9 at 1 GHz is 4.5x. We must say that these performance ratios ofCortex-A15 over Cortex-A9 are for HPL, which makes heavy use of floating-point code. The performance ratios of Cortex-A15 over Cortex-A9 for integercode are typically 1.5x at 1 GHz and 2.9x at 2 GHz (both compared to Cortex-A9 at 1 GHz). This shows how, for a single compute node, Cortex-A15 isbetter suited for HPC double-precision floating-point computation.

For the power projections at different clock frequencies, we are using apower model for Cortex-A9 based on 40nm technology as this is what manyCortex-A9 products are using today, and for the Cortex-A15 on 28nm tech-nology as this is the process that will be used for most products producedin 2013. The power consumption in both cases is normalized to the powerof Cortex-A9 running at 1 GHz. Then, we introduce these power ratios inour analytical model to project the power consumption and energy efficiencyof the different cluster configurations. In all our simulations, we assumethe same number of total cores as in Tibidabo (192) and we vary the num-

18

456

608

760

816

912

1000

1400

2000

Frequencies (MHz)

0

1

2

3

4

5

Perf

orm

ance

rela

tive

to1

GH

z

PerfectCortex-A9Linear fitPoly fit

(a) Dual-core Cortex-A9

500

600

700

800

900

1000

2000

Frequencies (MHz)

0

1

2

3

4

5

Perf

orm

ance

rela

tive

to1

GH

z

PerfectCortex-A15Linear fitPoly fit

(b) Dual-core Cortex-A15

Figure 7: HPL performance at multiple operating frequencies and projection to frequenciesover 1 GHz

ber of cores in each compute node. When we increase the number of coresper compute node, the number of nodes is reduced, thus, reducing integra-tion overhead and pressure on the interconnect (i.e. less boards, cables andswitches). To model this effect, our analytical model is as follows:

From the power breakdown of a single node presented in Figure 4, wesubtract the power corresponding to the CPUs and the memory subsystem(L2 + memory). The remaining power in the compute node is considered tobe board overhead, and does not change with the number of cores. The boardoverhead is part of the power of a single node, to which we add the powerof the cores, L2 cache and memory. For each configuration, the CPU corepower is multiplied by the number of cores per node. Same as in Tibidabo,our projected cluster configurations are assumed to have 0.5 MB of L2 cacheper core and 500 MB of RAM per core—this assumption allows for simplescaling to large numbers of cores. Therefore, the L2 cache power (0.1 W/MB)and the memory power (0.7 W/GB) are multiplied both by half the numberof cores. The L2 core power for the Cortex-A9 configurations is also factoredfor frequency, for which we use the core power ratio. The L2 in Cortex-A15is part of the core macro, so the core power already includes the L2 power.

For both Cortex-A9 and Cortex-A15, the CPU macro power includes theL1 caches, cache coherence unit and L2 controller. Therefore, the increase inpower due to a more complex L2 controller and cache coherence unit for a

19

larger multicore are accounted when that power is factored by the number ofcores. The memory power is overestimated, so the increased power due to theincreased complexity of the memory controller to scale to a higher numberof cores is also accounted for the same reason. Furthermore, a Cortex-A9system cannot address more than 4 GB of memory so, strictly speaking,Cortex-A9 systems with more than 4 GB are not realistic. However, weinclude configurations for higher core counts per chip to show what wouldbe the performance and energy efficiency if Cortex-A9 included large phys-ical address extensions as the Cortex-A15 does to address up to 1 TB ofmemory [40].

The power model is summarized in these equations:

Ppred =ntc

ncpc×(Pover

nnin+ Peth + ncpc ×

(Pmem

2+ pr ×

(PA91G +

PL2$

2

)))(1)

Pover = Ptot − nnin × (Pmem + 2 × PA91G + PL2$ + Peth) (2)

where Ppred represents the projected power of simulated clusters. ntc =192 and nnin = 96 are constants and represent the total number of coresand total number of nodes in Tibidabo respectively. ncpc is the number ofcores per chip. Pover represents the total Tibidabo cluster power overhead(evaluated in Equation 2). pr defines the power ratio derived from core powermodels and normalized to Cortex-A9 at 1 GHz. PA91G , Pmem, PL2$ and Peth

are constants defining a core, per core memory, per core L2 cache and pernode Ethernet power consumptions in Tibidabo. Ptot = 808 W is the averagepower consumption of Tibidabo while running HPL.

In our experiments, the total number of cores remains constant and is thesame as in the Tibidabo cluster (ntc = 192). We explore the total numberof cores per chip (ncpc) that, having one chip per node, determines the totalnumber of nodes of the evaluated system. Table 4 shows the resulting totalcluster power of the multiple configurations using 16 cores per chip and abreakdown of the power for the major components.

For the performance projections of the multiple cluster configurations, weprovide DIMEMAS with a CPU core performance ratio for each configura-tion, and a varying number of processors per node. DIMEMAS produces asimulation of how the same 192-core5 application will behave based on the

5Out of 128 nodes with a total of 256 processors, 4 nodes are used as login nodes and

20

new core performance and multicore density, accounting for synchronizationsand communication delays. Figure 8 shows the results. In all simulations wekeep a network bandwidth of 1 Gb/s (1GbE) and a memory bandwidth of1400 MB/s (from peak bandwidth results using STREAM).

The results show that, as we increase the number of cores per node (atthe same time reducing the total number of nodes), performance does notshow further degradation with 1 GbE interconnect until we reach the level ofperformance of Cortex-A15. None of the Cortex-A15 configurations reach itsmaximum speed-up due to interconnect limitations. The configuration withtwo Cortex-A15 cores at 1 GHz scales worst because the interconnect is thesame as in Tibidabo. With a higher number of cores, we are reaching 96% ofthe speed-up of a Cortex-A15 at 1 GHz. Further performance increase withCortex-A15 at 2 GHz shows further performance limitation due to intercon-nect communication—reaching 82% of the ideal speed-up with two cores andreaching 91% with sixteen.

Cortex-A91 GHz

Cortex-A91.4 GHz

Cortex-A92 GHz

Cortex-A151 GHz

Cortex-A152 GHz

Platforms

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Spee

dup

2 cores/chip

4 cores/chip

8 cores/chip

16 cores/chip

Figure 8: Projected speed-up for the evaluated cluster configurations. The total numberof MPI processes is constant across all experiments.

28 are unstable There are two major identified sources for instabilities: cooling issues andproblems with the PCIe driver, which drops the network connection on the problematicnodes.

21

Increasing computation density potentially improves MPI communicationbecause more processes communicate on chip rather than using the networkand memory bandwidth is larger than the interconnect bandwidth. Settinga larger machine than Tibidabo, with faster mobile cores and a higher corecount, will require a faster interconnect. In Section 4.2 we explore the inter-connect requirements when using faster mobile cores.

2 4 8 16Number of cores per node

0

200

400

600

800

1000

1200

Ene

rgy

effic

ienc

y(M

FL

OP

S/W

)

Cortex-A9 @ 1 GHzCortex-A9 @ 1.4 GHzCortex-A9 @ 2 GHzCortex-A15 @ 1 GHzCortex-A15 @ 2 GHz

Figure 9: Projected energy efficiency

The benefit of increased computation density (more cores per node) is ac-tually the reduction of the integration overhead and the resulting improvedenergy efficiency of the system (Figure 9). The results show that increasingthe computation density, with Cortex-A9 cores running at 2 GHz we canachieve an energy efficiency of 563 MFLOPS/W using 16 cores per node(∼4.7x improvement). The configuration with 16 Cortex-A15 cores per nodehas an energy efficiency of 1004 MFLOPS/W at 1 GHz and 1046 MFLOPS/Wat 2 GHz(∼8.7x improvement).

Using these models, we are projecting the energy efficiency of our clusterif it used higher performance cores and included more cores per node. How-ever, all other features remain the same, so inefficiencies due to the use ofnon-optimized development boards, lack of software optimization, and lackof vector double-precision floating-point execution units is accounted in themodel. Still with all these inefficiencies, our projections show that such acluster would be competitive in terms of energy efficiency with Sandy Bridge

22

and GPU-accelerated systems in the Green500 list, which shows promise forfuture ARM-based platforms actually optimized for HPC.

0.1 1 10Network bandwidth (Gb/s)

0

1

2

3

4

5

6

Rel

ativ

epe

rfor

man

ceCortex-A9 1 GHzCortex-A9 1.4 GHzCortex-A9 2 GHzCortex-A15 1 GHzCortex-A15 2 GHz

(a) Bandwidth sensitivity

0 100 200 300 400 500Latency (µs)

0

1

2

3

4

5

6

Rel

ativ

epe

rfor

man

ce

Cortex-A9 1 GHz/1 GbECortex-A9 1.4 GHz/1 GbECortex-A9 2 GHz/1 GbECortex-A15 1 GHz/1 GbECortex-A15 2 GHz/1 GbE

Cortex-A9 1 GHz/10 GbECortex-A9 1.4 GHz/10 GbECortex-A9 2 GHz/10 GbECortex-A15 1 GHz/10 GbECortex-A15 2 GHz/10 GbE

(b) Latency sensitivity

Figure 10: Interconnection network impact. Cluster configuration with 16 cores per node.

23

4.2. Interconnection network requirements

Cluster configurations with higher-performance cores and more cores pernode, put a higher pressure on the interconnection network. The result ofincreasing the node computation power while maintaining the same networkbandwidth is that the interconnect bandwidth-to-flops ratio decreases. Thismay lead to the network being a bottleneck. To evaluate this effect, we carryout DIMEMAS simulations of the evaluated cluster configurations using arange of network bandwidth (Figure 10(a)) and latency values (Figure 10(b)).The baseline for these results is the cluster configuration with Cortex-A9 at1 GHz, 1 Gb/s of bandwidth and 50 µs of latency.

The results in Figure 10(a) show that a network bandwidth of 1 Gb/s issufficient for the evaluated cluster configurations with Cortex-A9 cores andthe same size as Tibidabo. The Cortex-A9 configurations show a negligibleimprovement with 10 Gb/s interconnects. On the other hand, configurationswith Cortex-A15 do benefit from an increased interconnect bandwidth: the1 GHz configuration reaches its maximum at 3 Gb/s and the 2 GHz config-uration at 8 Gb/s.

The latency evaluation in Figure 10(b) shows the relative performancewith network bandwidths of 1 Gb/s and 10 Gb/s for a range of latenciesnormalized to 50 µs. An ideal zero latency does not show a significant im-provement over 50 µs and increasing the latency in a factor of ten, only has asignificant impact on the Cortex-A15 at 2 GHz configuration. Therefore, thelatency of Tibidabo’s Ethernet network, although being larger than that ofspecialized and custom networks used in supercomputing, is small enough forall the evaluated cluster configurations which have the same size as Tibidabo.

5. Lessons learned and next steps

We have described the architecture of our Tegra2-based cluster, the firstattempt to build an HPC system using ARM processors. Our performanceand power evaluation shows that ARM Cortex-A9 platform is competitivewith a mobile Intel Nehalem Core i7-640M platform in terms of energy effi-ciency for a reference benchmark like SPEC CPU2006. We also demonstratedthat, even without hand tuning, HPC applications scale well on our cluster.

However, building a supercomputer out of commodity-of-the-shelf low-power components is a challenging task because achieving a balanced designin terms of power is difficult. As an example, the total energy dissipated inthe PCB voltage regulators is comparable or even higher than that spent on

24

the CPU cores. Although the core itself provides a theoretical peak energyefficiency of 2-4 GFLOPS/W, this design imbalance results in the measuredHPL energy efficiency of 120 MFLOPS/W.

In order to achieve system balance, we identified two fundamental im-provements to put in practice. The first is to make use of higher-end ARMmulticore chips like Cortex-A15, which provides an architecture more suit-able for HPC while maintains comparable single-core energy efficiency. Thesecond is to increase the compute density by adding more cores to thechip. The recently announced ARM CoreLink CCN-504 cache coherencenetwork [41, 42] scales up to 16 cores and is targeted to high-performancearchitectures such as Cortex-A15 and next-generation 64-bit ARM proces-sors. In a resulting system of putting together these design improvements,the CPU cores power is better balanced with that of other components suchas the memory. Our projections based on ARM Cortex-A15 processors withhigher multicore integration density show that such systems are a promisingalternative to current designs built from high performance parts. For exam-ple, a cluster of the same size as Tibidabo, based on 16-core ARM Cortex-A15chips at 2 GHz would provide 1046 MFLOPS/W.

A well known technique to improve energy efficiency is the use of SIMDunits. As an example, BlueGene/Q uses 256-bit-wide vectors for quad double-precision floating-point computations, and the Intel MIC architecture uses512-bit-wide SIMD units. Both Cortex-A9 and Cortex-A15 processors imple-ment the ARMv7-a architecture which only supports single-precision SIMDcomputation. Most HPC applications require calculations in double-precisionso they cannot exploit the current ARMv7 SIMD units. The ARMv8 archi-tecture specification includes double-precision floating-point SIMD, so fur-ther energy efficiency improvements for HPC computation are expected fromnext-generation ARMv8 chips featuring those SIMD units.

In all of our experiments, we run the benchmarks out of the box, anddid not hand-tune any of those codes. Libraries and compilers includearchitecture-dependent optimizations that, for the case of ARM processors,target mobile computing. This leads to two different scenarios: the optimiza-tions of libraries used in HPC, such as ATLAS or MPI, for ARM processorsare one step behind; and optimizations in compilers, operating systems anddrivers target mobile computing, thus trading-off performance for quality ofservice or battery life. We have put together an HPC-ready software stackfor Tibidabo but we have not put effort in optimizing its several componentsfor HPC computation yet. Further energy efficiency improvements are ex-

25

pected when critical components such as MPI communication functions areoptimized for ARM-based platforms, or the Linux kernel is stripped-out ofthose components not used by HPC applications.

As shown in Figure 5, the Tegra2 chip includes a number of application-specific accelerators that are not programmable using standard industrialprogramming models such as CUDA or OpenCL. If those accelerators wereprogrammable and used for HPC computation, that would reduce the in-tegration overhead of Tibidabo. The use of SIMD or SIMT programmableaccelerators is widely adopted in supercomputers, such as those includinggeneral-purpose programmable GPUs (GPGPUs). Although the effectiveperformance of GPGPUs is between 40% and 60%, their efficient compute-targeted design provides them with high energy efficiency. GPUs in mobileSoCs are starting to support general-purpose programming. One example isthe Samsung Exynos5 [43] chip, which includes two Cortex-A15 cores andan OpenCL-compatible ARM Mali T-604 GPU [44]. This design, apart fromproviding the improved energy efficiency of GPGPUs, has the advantage ofhaving the compute accelerator close to the general purpose cores, thus re-ducing data transfer latencies. Such an on-chip programmable acceleratoris an attractive feature to improve energy efficiency in an HPC system builtfrom low-power components.

Another important issue to keep in mind when designing such kind ofsystems is that the memory bandwidth-to-flops ratio must be maintained.Currently available ARM-based platforms make use of either memory tech-nology that is behind compared to top-class standards (e.g., many platformsuse DDR2 memory instead of DDR3), or memory technology targeting lowpower (e.g., LPDDR2). For a higher-performance node with a higher num-ber of cores and including double-precision floating-point SIMD units, cur-rent memory choices in ARM platforms may not provide enough bandwidth,so higher-performance memories must be adopted. Low-power ARM-basedproducts including DDR3 are already announced [45] and the recently an-nounced DMC-520 [41] memory controller enables DDR3 and DDR4 memoryfor ARM processors. These upcoming technologies are indeed good news forlow-power HPC computing. Moreover, package-on-package memories whichreduce the distance between the computation cores and the memory, andincrease pin density can be used to include several memory controllers andprovide higher memory bandwidth.

Finally, Tibidabo employs 1 Gbit Ethernet for the cluster interconnect.Our experiments show that 1 GbE is not a performance limiting factor for

26

a cluster of Tibidabo size employing Cortex-A9 processors up to 2 GHz andfor compute-bound codes such as HPL. However, when using faster mobilecores such as Cortex-A15, a 1 GbE interconnect starts becoming a bottleneck.Current ARM-based mobile chips include peripherals targeted to the mobilemarket and thus, do not provide enough bandwidth or are not compatiblewith faster network technologies used in supercomputing, such as 10 GbE orInfiniband. However, the use of 1 GbE is extensive in supercomputing—32%of the systems in the November 2012 Top500 list use 1 GbE interconnects—,and potential communication bottlenecks are in many cases addressable insoftware [46]. Therefore, although support for a high-performance networktechnology would be desirable for ARM-based HPC systems, using 1 GbEmay not be a limitation as long as the communication libraries are optimizedappropriately for Ethernet communication and the communication patternsin HPC applications are tuned appropriately keeping the network capabilitiesin mind.

6. Related work

One of the first attempts to use low-power commodity processors in HPCsystems was GreenDestiny [47]. They relied on Transmeta TM5600 proces-sor, and although the proposal seemed good for a top platform in energy ef-ficiency, a large-scale HPC system was never produced. Also, its computing-to-space ratio was leading at the time.

MegaProto systems [48] were another approach in this direction. Theywere based on more advanced versions of Transmeta’s processors, namelyTM5800 and TM8820. This system was able to achieve good energy effi-ciency for the time, reaching up to 100 MFLOPS/W using a system with512 processors. Like its predecessor, MegaProto never made it into a com-mercial HPC product.

Roadrunner [49] topped the Top500 list in June 2008 to be the first tobreak the petaflop barrier. It uses IBM PowerXCell 8i [50] together withdual-core AMD Opteron processors. The Cell/B.E. architecture emphasizesperformance per watt by prioritizing bandwidth over latency and favourspeak computation capabilities over simplifying programmability. In the June2008 Green500 list, it held third place with 437.43 MFLOPS/W, behind twosmaller homogeneous Cell/B.E.-based clusters.

There has been a proposal to use the Intel Atom family of processorsin clusters [51]. The platform is built and tested with a range of different

27

types of workloads, but those target data centers rather than HPC. One ofthe main contributions of this work is determining the type of workloads forwhich Intel Atom can compete in terms of energy-efficiency with commodityIntel Core i7. A follow-up of this work [52] leads to the conclusion that acluster made homogeneously of low-power nodes (Intel Atom) is not suitedfor complex database loads. They propose future research in heterogeneouscluster architectures using low-power nodes combined with high-performanceones.

The use of low-power processors for scale-out systems was assessed in astudy by Stanley-Marbell and Caparros-Cabezas [53]. They did a compara-tive study of three different low-power architecture implementations: x86-64(Intel Atom D510MO), Power Architecture e500 (Freescale P2020RDB) andARM Cortex-A8 (TI DM3730, BeagleBoard xM). The authors presented astudy with performance, power and thermal analyses. One of their findingsis that a single core Cortex-A8 platform is suitable for energy-proportionalcomputing, meaning very low idle power. However, it lacks sufficient comput-ing resources to exploit coarse-grained task-level parallelism and be a moreenergy efficient solution than the dual-core Intel Atom platform. They alsoconcluded that a large fraction of the platforms’ power consumption (up to67% for the Cortex-A8 platform) cannot be attributed to a specific compo-nent, despite the use of sophisticated techniques such as thermal imaging.

The AppleTV cluster [54, 55] is an effort to assess the performance ofthe ARM Cortex-A8 processor in a cluster environment running HPL. Theauthors built a small cluster with four nodes based on AppleTV devices witha 100 MbE network. They achieved 160.4 MFLOPS with an energy efficiencyof 16 MFLOPS/W. Also, they compared the memory bandwidth against aBeagleBoard xM platform and explained the performance differences dueto different design decisions in the memory subsystems. In our system, weemploy more recent low-power core architectures and show how improvedfloating-point units, memory subsystems, and an increased number of corescan significantly improve the overall performance and energy efficiency, whilestill maintaining a small power footprint.

The BlueGene family of supercomputers has been around since 2004 inseveral generations [56, 57, 58]. BlueGene systems are composed of em-bedded cores integrated on ASIC together with architecture-specific fabrics.BlueGene/L, the first such system, is based on the PowerPC 440, with atheoretical peak performance of 5.6 GFLOPS. BlueGene/P increased thepeak performance of the compute card to 13.6 GFLOPS by using 4-core

28

PowerPC 450. BlueGene/Q-based clusters are one of the most power effi-cient HPC machines nowadays delivering around 2.1 GFLOPS/W. A Blue-Gene/Q compute chip includes 16 4-way SMT in-order cores, each one witha 256-bit-wide quad double-precision SIMD floating-point unit, deliveringa total of 204.8 GFLOPS per chip on a power budget of around 55 W(3.7 GFLOPS/W).

The most energy-efficient machine in the November 2012 Green500 listis based on the Intel Xeon Phi coprocessor. It has a design similar to Blue-Gene/Q: 4-way SMT in-order cores with wide SIMD units, but it integratesmore cores per chip (60) and the SIMD units are 512-bits-wide. The useof a more recent technology process (22nm instead of the 45nm of Blue-Gene/Q) allows this larger integration and results in an energy efficiency of2.5 GFLOPS/W for the number one machine in the Green500 list.

There is a lot of hype about the use of low-power ARM processors inservers. Currently, the most exciting, commercially available approaches arethe ones from Boston Ltd. [10], Penguin Computing [59] and EXXACT Cor-poration [60]. They offer solutions based on the Calxeda ECX-1000 SoC [6]with up to 48 server nodes (192 cores) and up to 4 GB of memory perserver node (192 GB in total) in a 2U enclosure. HP went one step furtherwith Project Moonshot [9], where they introduce the Redstone DevelopmentServer Platform [61]. It has a compute integration option with up to 288Calxeda SoCs in 4U.

The Calxeda ECX-1000 SoC is built for server workloads: it is a quad-corechip with Cortex-A9 cores running at 1.4 GHz, 4 MB of L2 cache with ECCprotection, a 72-bit memory controller with ECC support, five 10 Gb lanes forconnecting with other SoCs, support for 1 GbE and 10 GbE, and SATA 2.0controllers with support for up to five SATA disks. Unlike ARM-based mobileSoCs, ECX-1000 does not have a power overhead in terms of unnecessaryon-chip resources and, thus, it seems better suited for energy-efficient HPC.However, to the best of our knowledge, there are neither reported numbersfor energy-efficiency of HPL running in a cluster environment (only single-node executions) nor scientific applications scalability tests for any of theaforementioned enclosures.

AppliedMicro announced an ARM server platform based on their ownARMv8-based SoC design, the X-gene [8]. There are still no enclosuresannounced, and no benchmark reports, but we expect a better performancethan ARMv7-based enclosures, due to an improved CPU core architectureand three levels of cache hierarchy.

29

7. Conclusions

In this paper we presented Tibidabo, the world’s first ARM-based HPCcluster, for which we set up an HPC-ready software stack to execute HPCapplications widely used in scientific research such as SPECFEM3D andGROMACS. Tibidabo was built using commodity off-the-shelf componentsthat are not designed for HPC. Nevertheless, our prototype cluster achieves120 MFLOPS/W on HPL, competitive with AMD Operton 6128 and IntelXeon X5660-based systems. We identified a set of inefficiencies of our de-sign given the components target mobile computing. The main inefficiencyis that the power taken by the components required to integrate small low-power dual-core processors offsets the high energy efficiency of the coresthemselves. We perform a set of simulations to project the energy efficiencyof our cluster if we could have used chips featuring higher-performance ARMcores and integrating a larger number of them together.

Based on these projections, a cluster configuration with 16-core Cortex-A15 chips would be competitive with Sandy Bridge-based homogeneous sys-tems and GPU-accelerated heterogeneous systems in the Green500 list.

We also explained the major issues and how they should evolve or beimproved for next clusters made from low-power ARM processors. Theseissues include, apart from the aforementioned integration overhead, the lackof optimized software, the use of mobile-targeted memories, the lack ofdouble-precision floating-point SIMD units, and the lack of support for high-performance interconnects. Based on our recommendations, an HPC-readyARM processor design should include a larger number of cores per chip (e.g.,16) and use a core microarchitecture suited for high-performance, like theone in Cortex-A15. It should also include double-precision floating-pointSIMD units, support for multiple memory controllers servicing DDR3 orDDR4 memory modules, and probably support for a higher-performancenetwork, such as Infiniband, although Gigabit Ethernet may be sufficient formany HPC applications. On the software side, libraries, compilers, driversand operating systems need tuning for high performance, and architecture-dependent optimizations for ARM processor chips.

Recent announcements show an increasing interest in server-class low-power systems that may benefit HPC. The new 64-bit ARMv8 ISA improvessome features that are important for HPC. First, using 64-bit addresses re-moves the 4GB memory limitation per application. This allows more mem-ory per node, so one process can compute more data locally, requiring less

30

network communication. Also, ARMv8 increases the size of the general-purpose register file from 16 to 32 registers. This reduces register spillingand provides more room for compiler optimization. It also improves floating-point performance by extending the NEON instructions with fused multiply-add and multiply-substract, and cross-lane vector operations. More impor-tantly, double-precision floating-point is now part of NEON. All together,this provides a theoretical peak double-precision floating-point performanceof 4 FLOPS/cycle for a fully-pipelined SIMD unit. As an example, ARMCortex-A57, the highest performance ARM implementation of the ARMv8ISA, includes two NEON units, totalling 8 double-precision floating-pointFLOPS/cycle—this is 4 times better than ARM Cortex-A15 and equivalentto Intel implementations with one AVX unit.

These encouraging industrial roadmaps, together with research initia-tives such as the EU-funded Mont-Blanc project [62], may lead ARM-basedplatforms to accomplish the recommendations given in this paper in a nearfuture.

8. Acknowledgments

Authors would like to thank to anonymous reviewers for their constructivecomments. In addition, authors would like to thank to Bernard Ortiz deMontellano and Paul M. Carpenter for their help to improve the quality ofthis paper.

This project and the research leading to these results have received fund-ing from the European Union’s Seventh Framework Programme [FP7/2007-2013] under grant agreement no 288777. Part of this work is supported bythe PRACE project (European Union funding under grants RI-261557 andRI-283493).

References

[1] D. Goddeke, D. Komatitsch, M. Geveler, D. Ribbrock, N. Rajovic,N. Puzovic, A. Ramirez, Energy efficiency vs. performance of the numer-ical solution of PDEs: an application study on a low-power ARM-basedcluster, Journal of Computational Physics 237 (2013) 132–150.

[2] HPCwire, New Mexico to Pull Plug on Encanto, FormerTop5 Supercomputer, http://www.hpcwire.com/hpcwire/2012-

31

07-12/new_mexico_to_pull_plug_on_encanto_former_top_5_

supercomputer.html, accessed: 5-May-2013 (7 2012).

[3] HPCwire, Requiem for Roadrunner, http://www.hpcwire.com/

hpcwire/2013-04-01/requiem_for_roadrunner.html, accessed: 5-May-2013 (4 2013).

[4] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. D.nneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler,D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely,T. Sterling, R. S. Williams, K. Yelick, P. Kogge, Exascale Comput-ing Study: Technology Challenges in Achieving Exascale Systems, in:DARPA Technical Report, 2008.

[5] ARM Ltd., Cortex-A9 Processor, http://www.arm.com/products/

processors/cortex-a/cortex-a9.php, accessed: 5-May-2013.

[6] Calxeda, EnergyCoreTMprocessors, http://www.calxeda.com/

technology/products/processors/, accessed: 5-May-2013.

[7] Marvell, Marvell Quad-Core ARMADA XP Series SoC,http://www.marvell.com/embedded-processors/armada-

xp/assets/Marvell-ArmadaXP-SoC-product%20brief.pdf, accessed5-May-2013.

[8] AppliedMicro, AppliedMicro X-Gene, http://www.apm.com/

products/x-gene/, accessed: 5-May-2013.

[9] HP, HP Labs research powers project Moonshot, HPs new archi-tecture for extreme low-energy computing, http://www.hpl.hp.com/

news/2011/oct-dec/moonshot.html, accessed: 15-May-2013.

[10] Boston Ltd., Boston Viridis - ARM Microservers, http://www.boston.co.uk/solutions/viridis/default.aspx, accessed: 5-May-2013.

[11] ARM Ltd., VFPv3 Floating Point Unit, http://www.arm.com/

products/processors/technologies/vector-floating-point.php,accessed: 5-May-2013.

[12] ARM Ltd., The ARM R© NEONTMgeneral purpose SIMD engine, http://www.arm.com/products/processors/technologies/neon.php, ac-cessed: 5-May-2013.

32

[13] J. Turley, Cortex-A15 “Eagle” flies the coop, Microprocessor Report24 (11) (2010) 1–11.

[14] ARM Ltd., Cortex-A50 Series, http://www.arm.com/products/

processors/cortex-a50/index.php, accessed: 5-May-2013.

[15] G. Bell, Bell’s law for the birth and death of computer classes, Commu-nications of ACM 51 (1) (2008) 86–94.

[16] SECO, QuadMo747-X/T20, http://www.seco.com/en/item/

quadmo747-x_t20, accessed: 5-May-2013.

[17] SECO, SECOCQ7-MXM, http://www.seco.com/en/item/secocq7-

mxm, accessed: 5-May-2013.

[18] R. Whaley, J. Dongarra, Automatically tuned linear algebra software,in: Proceedings of the 1998 ACM/IEEE conference on Supercomputing(CDROM), IEEE Computer Society, 1998, pp. 1–27.

[19] M. Frigo, S. G. Johnson, The design and implementation of FFTW3,Proceedings of the IEEE 93 (2) (2005) 216–231, special issue on “Pro-gram Generation, Optimization, and Platform Adaptation”.

[20] Yokogawa, WT210/WT230 Digital Power Meters, http://tmi.

yokogawa.com/products/digital-power-analyzers/digital-

power-analyzers/wt210wt230-digital-power-meters/, accessed:5-May-2013.

[21] A. Yoo, M. Jette, M. Grondona, Slurm: Simple linux utility for re-source management, in: Job Scheduling Strategies for Parallel Process-ing, Springer, 2003, pp. 44–60.

[22] Intel, Intel R© CoreTMi7-640M Processor, http://ark.intel.com/

products/49666/Intel-Core-i7-640M-Processor-4M-Cache-2_80-

GHz, accessed: 5-May-2013.

[23] R. Weicker, Dhrystone: a synthetic systems programming benchmark,Communications of the ACM 27 (10) (1984) 1013–1030.

[24] J. L. Henning, SPEC CPU2006 benchmark descriptions, SIGARCHComput. Archit. News 34 (4) (2006) 1–17.

33

[25] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic, A. Ramirez, Thelow power architecture approach towards exascale computing, Journalof Computational Science.

[26] J. D. McCalpin, Memory bandwidth and machine balance in currenthigh performance computers, IEEE Computer Society Technical Com-mittee on Computer Architecture (TCCA) Newsletter (1995) 19–25.

[27] J. Dongarra, P. Luszczek, A. Petitet, The LINPACK Benchmark: past,present and future, Concurrency and Computation: Practice and Expe-rience 15 (9) (2003) 803–820.

[28] H. Berendsen, D. van der Spoel, R. van Drunen, Gromacs: A message-passing parallel molecular dynamics implementation, Computer PhysicsCommunications 91 (1) (1995) 43–56.

[29] D. Komatitsch, J. Tromp, Introduction to the spectral element methodfor three-dimensional seismic wave propagation, Geophysical JournalInternational 139 (3) (1999) 806–822.

[30] DEISA 2; Distributed european infrastructure for supercomputing ap-plications; Maintenance of the DEISA Benchmark Suite in the SecondYear, Available online at: www.deisa.eu.

[31] ARM Ltd., ARM Announces 2GHz Capable Cortex-A9 Dual CoreProcessor Implementation, http://www.arm.com/about/newsroom/

25922.php, accessed: 5-May-2013.

[32] Intel Corporation, Intel R© 82574 GbE Controller Family,http://www.intel.com/content/dam/doc/datasheet/82574l-

gbe-controller-datasheet.pdf, accessed: 5-May-2013.

[33] SMSC, LAN9514/LAN9514i: USB 2.0 Hub and 10/100 Eth-ernet Controller, http://www.smsc.com/media/Downloads_Public/

Data_Sheets/9514.pdf, accessed: 5-May-2013.

[34] Micron, DDR2 SDRAM System-Power Calculator, http://www.

micron.com/support/dram/power_calc.html, accessed: 5-May-2013.

[35] R. Badia, J. Labarta, J. Gimenez, F. Escale, DIMEMAS: PredictingMPI applications behavior in Grid environments, in: Workshop on GridApplications and Programming Tools (GGF8), Vol. 86, 2003.

34

[36] A. Ramirez, O. Prat, J. Labarta, M. Valero, Performance Impact of theInterconnection Network on MareNostrum Applications, in: 1st Work-shop on Interconnection Network Architectures: On-Chip, Multi-Chip,2007.

[37] S. Girona, J. Labarta, R. M. Badia, Validation of Dimemas Commu-nication Model for MPI Collective Operations, in: Proceedings of the7th European PVM/MPI Users’ Group Meeting on Recent Advancesin Parallel Virtual Machine and Message Passing Interface, 2000, pp.39–46.

[38] V. Pillet, J. Labarta, T. Cortes, S. Girona, Paraver: A tool to visualizeand analyze parallel code, WoTUG-18 (1995) 17–31.

[39] ARM Ltd., CoreTile ExpressTMA15×2 A7×3 Technical Refer-ence Manual, http://infocenter.arm.com/help/topic/com.arm.

doc.ddi0503c/DDI0503C_v2p_ca15_a7_tc2_trm.pdf, accessed: 14-May-2013.

[40] ARM Ltd., Virtualization Extensions and Large Physical Ad-dress Extensions, http://www.arm.com/products/processors/

technologies/virtualization-extensions.php, accessed: 14-May-2013.

[41] ARM Ltd., CoreLink CCN-504 Cache Coherent Network, http:

//www.arm.com/products/system-ip/interconnect/corelink-ccn-

504-cache-coherent-network.php, accessed: 14-May-2013.

[42] J. Byrne, ARM CoreLink Fabric Links 16 CPUs, Microprocessor Report(2012) 1–3.

[43] Samsung, Samsung Exynos5 Dual White Paper, http://www.samsung.com/global/business/semiconductor/minisite/Exynos/data/

Enjoy_the_Ultimate_WQXGA_Solution_with_Exynos_5_Dual_WP.

pdf, accessed: 14-May-2013.

[44] ARM Ltd., Mali Graphics Hardware - Mali-T604, http://www.

arm.com/products/multimedia/mali-graphics-hardware/mali-

t604.php, accessed: 14-May-2013.

35

[45] Calxeda, Calxeda Quad-Node EnergyCard, http://www.calxeda.com/technology/products/energycards/quadnode, accessed: 14-May-2013.

[46] V. Marjanovic, J. Labarta, E. Ayguade, M. Valero, Overlapping com-munication and computation by using a hybrid mpi/smpss approach, in:Proceedings of the 24th ACM International Conference on Supercom-puting, ACM, 2010, pp. 5–16.

[47] M. Warren, E. Weigle, W. Feng, High-density computing: A 240-processor beowulf in one cubic meter, in: Supercomputing, ACM/IEEE2002 Conference, IEEE, 2002, pp. 61–61.

[48] H. Nakashima, H. Nakamura, M. Sato, T. Boku, S. Matsuoka, D. Taka-hashi, Y. Hotta, Megaproto: 1 tflops/10kw rack is feasible even withonly commodity technology, in: Supercomputing, 2005. Proceedings ofthe ACM/IEEE SC 2005 Conference, IEEE, 2005, pp. 28–28.

[49] K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin, J. San-cho, Entering the petaflop era: the architecture and performance ofroadrunner, in: Proceedings of the 2008 ACM/IEEE conference on Su-percomputing, IEEE Press, 2008, p. 1.

[50] T. Chen, R. Rghavan, J. Dale, E. Iwata, Cell Broadband Engine Archi-tecture and its first implementationA performance view, IBM Journalof Research and Development 51 (5) (2007) 559 –572.

[51] V. Vasudevan, D. Andersen, M. Kaminsky, L. Tan, J. Franklin,I. Moraru, Energy-efficient cluster computing with fawn: Workloadsand implications, in: Proceedings of the 1st International Conferenceon Energy-Efficient Computing and Networking, ACM, 2010, pp. 195–204.

[52] W. Lang, J. Patel, S. Shankar, Wimpy node clusters: What about non-wimpy workloads?, in: Proceedings of the Sixth International Workshopon Data Management on New Hardware, ACM, 2010, pp. 47–55.

[53] P. Stanley-Marbell, V. C. Cabezas, Performance, power, and thermalanalysis of low-power processors for scale-out systems, in: Parallel andDistributed Processing Workshops and Phd Forum (IPDPSW), 2011IEEE International Symposium on, IEEE, 2011, pp. 863–870.

36

[54] K. Furlinger, C. Klausecker, D. Kranzlmuller, Towards energy efficientparallel computing on consumer electronic devices, in: Information andCommunication on Technology for the Fight against Global Warming,Springer, 2011, pp. 1–9.

[55] K. Furlinger, C. Klausecker, D. Kranzlmuller, The appletv-cluster: To-wards energy efficient parallel computing on consumer electronic devices,Whitepaper, Ludwig-Maximilians-Universitat.

[56] N. R. Adiga, G. Almasi, G. S. Almasi, Y. Aridor, R. Barik, D. Beece,R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, et al., An overviewof the BlueGene/0l supercomputer, in: Supercomputing, ACM/IEEE2002 Conference, IEEE, 2002.

[57] S. Alam, R. Barrett, M. Bast, M. R. Fahey, J. Kuehn, C. McCurdy,J. Rogers, P. Roth, R. Sankaran, J. S. Vetter, P. Worley, W. Yu, Earlyevaluation of IBM BlueGene/P, in: Proceedings of the 2008 ACM/IEEEconference on Supercomputing, IEEE Press, 2008.

[58] IBM Systems and Technology, IBM System Blue Gene/Q Data Sheet,http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12345usen/

DCD12345USEN.PDF, accessed: 5-May-2013.

[59] Penguin Computing, ARM Servers - UDX1, http://www.

penguincomputing.com/Products/RackmountedServers/ARM, ac-cessed: 15-May-2013.

[60] EXXACT Corporation, ARM Microservers - Energy Efficient Hy-perscale Computing, http://exxactcorp.com/index.php/solution/

solu_list/59, accessed: 15-May-2013.

[61] HP, HP Project Moonshot and the Redstone Development Server Plat-form, http://h10032.www1.hp.com/ctg/Manual/c03442116.pdf, ac-cessed: 15-May-2013.

[62] The Mont-Blanc project, http://montblanc-project.eu, website wasonline on May 16th 2013.

37

Date post:	04-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Tibidabo : Making the Case for an ARM-Based HPC System · 2018-01-19 · retical HPC cluster with a...

Documents