Hardware Architecture Impact on Manycore Programming...

UPTEC IT 21001

Degree Project in Computer and Information EngineeringMarch 2, 2021

Hardware Architecture Impact onManycore Programming Model

Erik Stubbfalt

Civilingenjorsprogrammet i informationsteknologi

Master Programme in Computer and Information Engineering

Institutionen forinformationsteknologi

Besoksadress:ITC, PolacksbackenLagerhyddsvagen 2

Postadress:Box 337751 05 Uppsala

Hemsida:http:/www.it.uu.se

Abstract

Hardware Architecture Impact on ManycoreProgramming Model

Erik Stubbfalt

This work investigates how certain processor architectures can affectthe implementation and performance of a parallel programming model.The Ericsson Many-Core Architecture (EMCA) is compared and con-trasted to general-purpose multicore processors, highlighting differ-ences in their memory systems and processor cores. A proof-of-conceptimplementation of the Concurrency Building Blocks (CBB) program-ming model is developed for x86-64 using MPI. Benchmark tests showhow CBB on EMCA handles compute-intensive and memory-intensivescenarios, compared to a high-end x86-64 machine running the proof-of-concept implementation. EMCA shows its strengths in heavy com-putations while x86-64 performs at its best with high degrees of datareuse. Both systems are able to utilize locality in their memory systemsto achieve great performance benefits.

Extern handledare: Lars Gelin & Anders Dahlberg, EricssonAmnesgranskare: Stefanos KaxirasExaminator: Lars-Ake NordenISSN 1401-5749, UPTEC IT 21001Tryckt av: Angstromlaboratoriet, Uppsala universitet

Sammanfattning

Det har projektet undersoker hur olika processorarkitekturer kan paverka implementa-tioner och prestanda hos en parallell programmeringsmodell. Ericsson Many-Core Ar-chitecture (EMCA) analyseras och jamfors med kommersiella multicore-processorer.Skillnader i respektive minnessystem och processorkarnor tas upp. En prototyp av enConcurrency Building Blocks-implementation (CBB) for x86-64 tas fram med hjalp avMPI. Benchmark-tester visar hur CBB tillsammans med EMCA hanterar beraknings-intensiva samt minnesintensiva scenarion, i jamforelse med ett modernt x86-64-systemtillsammans med den utvecklade prototypen. EMCA visar sina styrkor i tunga berak-ningar och x86-64 presterar bast nar data ateranvands i hog grad. Bada systemen anvan-der lokalitet i respektive minnessystem pa ett satt som har stora fordelar for prestandan.

iv

Contents

1 Introduction 1

2 Background 2

2.1 Multicore and manycore processors . . . . . . . . . . . . . . . . . . . 2

2.2 Parallel computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2.1 Different types of parallelism . . . . . . . . . . . . . . . . . . 3

2.2.2 Parallel programming models . . . . . . . . . . . . . . . . . . 3

2.3 Memory systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.1 Cache and scratchpad memory . . . . . . . . . . . . . . . . . . 5

2.4 Memory models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.7 Performance analysis tools . . . . . . . . . . . . . . . . . . . . . . . . 7

2.8 The actor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.9 Concurrency Building Blocks . . . . . . . . . . . . . . . . . . . . . . . 8

2.10 The baseband domain . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Purpose, aims, and motivation 10

3.1 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Methodology 11

4.1 Literature study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

v

5 Literature study 12

5.1 Comparison of architectures . . . . . . . . . . . . . . . . . . . . . . . 12

5.1.1 Memory system . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.1.2 Processor cores . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.3 SIMD operations . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.4 Memory models . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Related academic work . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2.1 The Art Of Processor Benchmarking: A BDTI White Paper . . 15

5.2.2 A DSP Acceleration Framework For Software-Defined RadiosOn x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2.3 Friendly Fire: Understanding the Effects of Multiprocessor Pre-fetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2.4 Analysis of Scratchpad and Data-Cache Performance Using Sta-tistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Selection of software framework 19

6.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.1.1 Why MPI? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.1.2 MPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.1.3 Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7 Selection of target platform 21

8 Evaluation methods 21

8.1 Strong scaling and weak scaling . . . . . . . . . . . . . . . . . . . . . 21

8.1.1 Compute-intensive benchmark . . . . . . . . . . . . . . . . . . 22

8.1.2 Memory-intensive benchmark without reuse . . . . . . . . . . . 23

vi

8.1.3 Memory-intensive benchmark with reuse . . . . . . . . . . . . 23

8.1.4 Benchmark tests in summary . . . . . . . . . . . . . . . . . . . 23

8.2 Collection of performance metrics . . . . . . . . . . . . . . . . . . . . 24

8.3 Systems used for testing . . . . . . . . . . . . . . . . . . . . . . . . . 25

9 Implementation of CBB actors using MPI 26

9.1 Sending messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

9.2 Receiving messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

10 Creating and running benchmark tests 28

10.1 MPI for x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

10.2 CBB for EMCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

11 Results and discussion 29

11.1 Compute-intensive benchmark . . . . . . . . . . . . . . . . . . . . . . 30

11.1.1 Was the test not compute-intensive enough for EMCA? . . . . . 33

11.2 Memory-intensive benchmark with no data reuse . . . . . . . . . . . . 35

11.3 Memory-intensive benchmark with data reuse . . . . . . . . . . . . . . 39

11.4 Discussion on software complexity and optimizations . . . . . . . . . . 43

12 Conclusions 44

13 Future work 45

13.1 Implement a CBB transform with MPI for x86-64 . . . . . . . . . . . . 45

13.2 Expand benchmark tests to cover more scenarios . . . . . . . . . . . . 45

13.3 Run benchmarks with hardware prefetching turned off . . . . . . . . . 45

13.4 Combine MPI processes with OpenMP threads . . . . . . . . . . . . . 46

vii

13.5 Run the same code in an ARMv8 system . . . . . . . . . . . . . . . . . 46

viii

List of Figures

1 Memory hierarchy of a typical computer system [7]. . . . . . . . . . . . 4

2 Memory hierarchy and address space for a cache configuration (left)and a scratchpad configuration (right) [2, Figure 1]. . . . . . . . . . . . 5

3 Main artefacts of the CBB programming model. . . . . . . . . . . . . . 9

4 Conceptual view of a multicore system implementing TSO [7, Fig-ure 4.4 (b)]. Store instructions are issued to a FIFO store buffer beforeentering the memory system. . . . . . . . . . . . . . . . . . . . . . . . 15

5 Categorization of DSP benchmarks from simple (bottom) to com-plex (top) [5, Figure 1]. The grey area shows examples of benchmarksthat BDTI provides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Processor topology of the x86-64 system used for testing. . . . . . . . . 25

7 The CBB application used for implementation. . . . . . . . . . . . . . 26

8 Normalized execution times for the compute-intensive benchmark testwith weak scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

9 Normalized execution times for the compute-intensive benchmark testwith strong scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

10 Speedup for the compute-intensive benchmark test with strong scaling. . 32

11 Speedup for the compute-intensive benchmark test with strong scalingand 64-bit floating-point addition. Only EMCA was tested. . . . . . . . 34

12 Normalized execution times for the memory-intensive benchmark withno data reuse and weak scaling. . . . . . . . . . . . . . . . . . . . . . . 35

13 Normalized execution times for the memory-intensive benchmark withno data reuse and strong scaling. . . . . . . . . . . . . . . . . . . . . . 36

14 Speedup for the memory-intensive benchmark with no data reuse andstrong scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

15 Normalized execution times for the memory-intensive benchmark withdata reuse and weak scaling. . . . . . . . . . . . . . . . . . . . . . . . 39

ix

16 Cache miss ratio in L1D for the memory-intensive benchmark with datareuse and weak scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . 40

17 Normalized execution times for the memory-intensive benchmark withdata reuse and strong scaling. . . . . . . . . . . . . . . . . . . . . . . . 41

18 Speedup for the memory-intensive benchmark with data reuse and strongscaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

19 Cache miss ratio for the memory-intensive benchmark with data reuseand strong scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

List of Tables

1 Flag synchronization program to motivate why memory models areneeded [20, Table 3.1]. . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 One possible execution of the program in Table 1 [20, Table 3.2]. . . . . 6

x

1 Introduction

1 Introduction

This work is centered around the connections between two areas within computer sci-ence, namely hardware architecture and parallel programming. How can a programmingmodel, developed specifically for a certain processor type, be expanded and adapted torun on a completely different hardware architecture? This question, which is a gen-eral problem found in many areas of industry and research, is what this thesis revolvesaround.

The project is conducted in collaboration with the Baseband Infrastructure (BBI) depart-ment at Ericsson. They develop low-level software platforms and tools used in basebandsoftware within the Ericsson Radio System product portfolio. This includes the Con-currency Building Blocks (CBB) programming model, which is designed to take fulladvantage of the Ericsson Many-Core Architecture (EMCA) hardware.

EMCA has a number of characteristics that sets it apart from commercial off-the-shelf(COTS) designs like x86-64 and ARMv8. EMCA uses scratchpad memories and sim-plistic DSP cores instead of the coherent cache systems and out-of-order cores with si-multaneous multithreading found in general-purpose hardware. These differences, andmore, are investigated in a literature study with a special focus on how they might affectrun-time performance.

MPI is used as a tool for developing a working CBB prototype that can run on both x86-64 and ARMv8. This choice is motivated by the many similarities between conceptsused in CBB and concepts seen in MPI. Finally, a series of benchmark tests are run withCBB on EMCA and on the CBB-prototype on a high-end x86-64 machine. These testsaim to investigate some compute-intensive and memory-intensive scenarios, which areboth relevant for actual baseband software. Each test is run with a fixed problem sizewhich is divived equally among the available workers, and also with a problem sizethat increases linearly with the number of workers. EMCA shows very good perfor-mance with the compute-intensive tests. The test (using 16-bit integer addition) is infact deemed to not be compute-intensive enough to highlight the expected scaling be-havior, and a modified benchmark (using 64-bit floating point addition) is also tested. Inthe memory-intensive tests, it is shown that x86-64 performs at its best when the degreeof data reuse is high and it can hold data in its L1D cache. In this scenario it showsbetter scaling behavior than EMCA. However, x86-64 takes a much larger performancehit than EMCA when the number of processes exceed the number of available processorcores.

The rest of this report is structured as follows: Section 2 describes the necessary back-ground theory on the problem at hand. Section 3 discusses the purpose, aims and mo-

1

2 Background

tivation behind the project, along with some delimitations. Section 4 goes in to themethodology used. The literature study is contained in Section 5, and then Sections 6and 7 describes the hardware and software that is used for development. The develop-ment of a CBB proof-of-concept and a series of benchmark tests is described in Sec-tions 8, 9 and 10. Finally Section 12 contains some conclusions and a summary of thecontributions made, and Section 13 describes how this work could be continued in thefuture.

2 Background

To arrive at a concrete problem description, a bit of background theory on computerarchitecture and parallel programming is required. The following paragraphs providedetails about some of the concepts that will be central later on.

2.1 Multicore and manycore processors

Traditionally, the key to making a computer program run faster was to increase theperformance of the processor core that it was running on. The increase in single coreperformance over the years was made possible by Moore’s law [13, p. 17], which de-scribes how the number of transistors that fits in an integrated circuit of a given sizehas doubled approximately every two years. However, this rate of progression started tolevel off in the late 00s, mainly as a consequence of limits in power consumption.

To get more performance per Watt of energy, the solution was to add more processorcores and have them collaborate on running program code. This type of construction iscommonly referred to as a multicore processor. In cases where the number of cores isespecially high, the term manycore processor is often used.

2.2 Parallel computing

The performance observed when running a certain algorithm rarely scales perfectly withthe number of processor cores added. Instead, the possible speedup for a fixed problemsize is indicated by Amdahl’s law [14], which can be formulated as

Speedup =1

(1− f) + fs

,

2

2 Background

where f is the fraction of the program that can benefit from additional system resources(in this case processor cores) to get a speedup of s. Another way of looking at this isthat s is the number of processor cores used. This means that there is a portion of theprogram that can be split up in parallel tasks. The fraction of the code that can not beparallelized, and thus not benefit from more cores, is represented by 1− f . Finding andexploring parallelism, i.e. maximizing f , is crucial to getting the most out of modernmulticore hardware in terms of overall performance. Scaling the number of processorsused for a fixed problem size, like described by Amdahl’s law, is often referred to asstrong scaling [17].

There is also a possibility of increasing the size of the problem along with the numberof processor cores. This is called weak scaling. The possible speedup gains for this typeof scenario is depicted in Gustafson’s law [12],

Speedup = (1− f) + f · s ,

where f and s have the same meaning as previously. We can see that there is no the-oretical upper limit to the speedup that can be achieved with weak scaling, since thespeedup increases linearly. In contrast, the non-parallelizable part 1 − f poses a hardupper-limit on the speedup for strong scaling even if s approaches infinity.

2.2.1 Different types of parallelism

There are many ways to divide parallelism into subgroups. In the context of multicoreprocessors and especially in this project, two of the most important types are:

• Data parallelism: The same set of operations is performed on many pieces ofdata, for example across items in a data vector, and the iterations are independentfrom one another. This means that the iterations can be split up beforehand andthen be performed in parallel. This is referred to as data parallelism.

• Task parallelism: Different parts of a program are split up into tasks, which aresets of operations that are typically different from one another. A set of tasks canoperate on the same or different data. If multiple tasks are completely independentof one another they can be run in parallel, which gives task parallelism.

2.2.2 Parallel programming models

When creating a parallel program, the programmer typically uses a parallel program-ming model. This is an abstraction of available hardware that gives parallel capabilities

3

2 Background

to an existing programming language, or in some cases introduces an entirely new pro-gramming language. Programming models can for example support features such asthread creation, message passing and synchronization primitives.

2.3 Memory systems

Ideally a processor core would be able to access data items to operate on without anydelay, regardless of which piece of data it requests. In reality, memory with short accesstimes are very expensive to manufacture and are also not perfectly scalable. This is whymodern computer systems has a hierarchy of memory devices attached to them. Figure 1shows an example of a memory hierarchy, and the technologies typically associated witheach level.

Figure 1 Memory hierarchy of a typical computer system [7].

One of the main ideas behind memory hierarchies is to take advantage of the principleof locality, which states that programs tend to reuse data and instructions they have usedrecently [13, p. 45]. This means that access patterns both in data and in code can oftenbe predicted to a certain extent. Temporal locality refers to a specific item being reusedmultiple times in the a short time period, while spatial locality means that items withadjacent memory addresses are accessed close together in time.

4

2 Background

Figure 2 Memory hierarchy and address space for a cache configuration (left) and ascratchpad configuration (right) [2, Figure 1].

2.3.1 Cache and scratchpad memory

Cache memory typically sits close to the processor core. These are high-speed on-chipmemory modules that share the same address space as the underlying main memory.Data and instructions are automatically brought into the cache, and there is typicallya cache coherency mechanism to ensure that all data items get updated accordingly inall levels of the cache system. All this is typically implemented in hardware, and istherefore invisible to the programmer.

Some processor designs have scratchpad memory, which have the same high-speedcharacteristics that a cache has [2]. The transferring of data to and from a scratch-pad memory is typically controlled in software, which makes it different from a cachewhere this is handled entirely in hardware. Scratchpad memory requires more effortby software developers, but are more predictable since they do not suffer from cachemisses. Scratchpad memory consumes less energy per access than cache memory sinceit has its own address space, meaning that no address tag lookup mechanism is needed.Figure 2 shows a schematic view of differences between cache and scratchpad memory.

Important to note is that there are many possible configurations of cache and scratchpadmemory within a chip. They may for example be shared across multiple cores or privateto a single core, and there may be multiple levels of cache or scratchpad memory whereeach level has different properties. It is also possible to utilize a scratchpad memory inconjunction with software constructs that make it behave similar to a cache.

5

2 Background

2.4 Memory models

The memory model is an abstract description of what memory ordering properties thata particular system has. Important to note here is that these properties are visible tosoftware threads in a multicore system. Different memory models give different levelsof freedom for compilers to do optimizations in the code.

Core 1 Core 2 CommentsS1: Store data = NEW; /* Initially, data = 0

& flag 6= SET */S2: Store flag = SET; L1: Load r1 = flag; /* L1 & B1 may repeat

many times */B1: if (r1 6= SET) goto L1;L2: Load r2 = data;

Table 1 Flag synchronization program to motivate why memory modelsare needed [20, Table 3.1].

To understand what a memory model is and why it is needed, look at Table 1. Here,core 2 spins in a loop while waiting for the flag variable to be SET by core 1. Thequestion is, what value will core 2 observe when it loads the data variable in the end?Without knowing anything about the memory model of the system, this is impossibleto answer. The memory model describes how instructions may be reordered at a localcore.

Cycle Core 1 Core 2 Coherence Coherencestate of data state of flag

1 S2: Store flag = SET Read-only Read-writefor C2 for C1

2 L1: Load r1 = flag Read-only Read-onlyfor C2 for C2

3 L2: Load r2 = data Read-only Read-onlyfor C2 for C2

4 S1: Store data = NEW Read-write Read-onlyfor C1 for C2

Table 2 One possible execution of the program in Table 1 [20, Table 3.2].

One possible outcome of running the program is shown in Table 2. We can see that astore-store reordering has occured at core 1, meaning that it has executed instruction S2

6

2 Background

before instruction S1 (violating the program order). In this case core 2 would observethat the data variable has the “old” value of 0.

With knowledge about the memory model, the programmer can for example knowwhere to insert memory barriers to ensure correctness in the program.

2.5 SIMD

Processor designs with Single Instruction Multiple Data (SIMD) features offer a wayof performing the same calculation on multiple data items in parallel within the sameprocessor pipeline [13, p. 10]. These features can be used to obtain data parallelism of adegree beyond the core count of a system, and are often implemented using wide vectorregisters and operations on these registers. Since this approach needs to fetch and exe-cute fewer instructions than the number of data items, it is also potentially more powerefficient than the conventional Multiple Instruction Multiple Data (MIMD) approach.

2.6 Prefetching

Prefetching is a useful way of hiding memory access latency during program execution.It involves predicting what data and instructions that will be used in the near future,and bringing them into a nearby cache. An ideal prefetching mechanism would accu-rately predict addresses, make the prefetches at the right time, and place the data inthe right place (which might include choosing the right data to replace) [9]. Inaccu-rate prefetching may pollute the cache system, possibly evicting useful items. Manycommon prefetching schemes try to detect sequential access patterns (possibly with aconstant stride), which can be fairly accurate. Another method is to, for each memoryaccess, bring in a couple of adjacent items from memory instead of just the item referredto in the program. Prefetching can be implemented both in hardware and in software.

2.7 Performance analysis tools

Measuring how a computer system behaves while running a certain application can bedone through an automated performance analysis tool. These tools can be divided intotwo broad categories: Static analysis tools rely on source code insertions for collectingdata, while dynamic analysis tools makes binary-level alterations and procedure calls atrun-time [27]. There are also hybrid tools that utilize both techniques.

7

2 Background

Most tools use some kind of statistical sampling, where the program flow of the testedapplication is paused at regular intervals to run a data collection routine. This can forexample provide information about the time spent in each function, and how many timeseach function has been called. Many tools also utilize a feature present in most modernprocessor chips, namely hardware performance counters. These are special-purposeregisters that can be programmed to react whenever a specific event occurs, for examplean L1 cache miss or a branch misprediction. This can provide very accurate metricswithout inducing any significant overhead.

2.8 The actor model

The actor model is an abstract model for concurrent computation centered around prim-itives called actors [1]. They are independent entities which can do computations ac-cording to a pre-defined behavior. The actor model is built around the concept of asyn-chronous message passing, in that every actor has the ability to send and receive mes-sages to and from other actors. These messages can be sent and received at any timewithout coordinating with the actor at the other end of the communication, hence theasynchrony. When receiving a message, an actor has the ability to:

1. Send a finite number of messages to one or many other actors.

2. Create a finite number of new actors.

3. Define what behavior to use when receiving the next message.

Actors have unique identification tags, which are used as “addresses” for all messagepassing. They can have internal state information which can be changed by local behav-ior, but there is no ability to directly change the state of other actors.

2.9 Concurrency Building Blocks

CBB is Ericsson’s proprietary programming model for baseband functionality devel-opment, designed as a high-level domain-specific language (DSL). A CBB applicationis translated into C code for specific hardware platforms through a process called theCBB transform. This makes for great flexibility, since developers do not need to targetone platform specifically when writing baseband software.

Application behavior is defined inside CBB behavior classes (CBCs), seen in the middleof Figure 3. The CBCs are based on the actor model, described in Section 2.8. When

8

2 Background

Figure 3 Main artefacts of the CBB programming model.

a CBC handles an incoming message it can initiate an activity, shown in the right ofFigure 3. An activity is an arbitrarily complex directed acyclic graph (DAG) of callsto C functions, and can also contain syncronization primitives. Different message typescan be mapped to different activities.

The simplest form of a CBC is an actor with a single serializing first-in first-out (FIFO)message queue. It is possible to define CBCs with different queue configurations, butthose will not be focused on here.

At the top level, an application is structured inside a CBB structure class (CSC). ThisCSC can in itself contain instances of CBCs and other CSCs, forming a hierarchy ofapplication components.

2.10 The baseband domain

In a cellular network, the term baseband is used to describe the functionality in betweenthe radio unit and the core network. This is where functionalities from Layer 1 (thephysical layer) and Layer 2 (the data link layer) of the OSI model [30] are found. Thebaseband domain also contains Radio Resource Management (RRM) and a couple ofother features.

• Layer 1: Responsible for modulation and demodulation of data streams for down-link and up-link respectively. It also performs link measurements and other tasks.This layer is responsible for approximately 75% of the compute cycles within the

9

3 Purpose, aims, and motivation

baseband domain, since it does a lot of computationally intensive signal process-ing.

• Layer 2: Handles per-user packet queues and multiplexing of control data anduser data, among other tasks. This layer has a lot of memory intensive work, andproduces around 5% of the compute cycles.

• RRM: The main task of RRM is to schedule data streams on available radio chan-nel resources, which means solving bin-packing problems with a large numberof candidates. This produces 15% of the compute cycles within the basebanddomain.

3 Purpose, aims, and motivation

The broader purpose of this work is to investigate how hardware can affect software,and more specifically how certain hardware architectures affect the implementation andperformance of a parallel programming model. CBB will be at the center of this investi-gation, and the result will be a proof-of-concept showing how it can be implemented onCOTS hardware such as a x86-64 or ARMv8 chip. Differences between the selected ar-chitecture and EMCA will be analyzed, including how these differences manifest them-selves in performance.

Adapting a programming model to run on new architectures is a general problem thatexists in many parts of industry and research. If done successfully it can create entirelynew use cases and products. There is also potential to learn how to utilize hardwarefeatures that has not previously been considered.

3.1 Delimitations

This work focuses on important aspects for adapting a programming model to new hard-ware, and not on the actual implementation. Therefore this project does not include anew, fully functional CBB implementation. It instead results in a prototype with suf-ficient functionality for running performance tests. Section 13 of the report, whichdescribes future work, discusses some necessary steps to make a more complete imple-mentation.

All of the aspects described in the comparative part of the literature study will not beevaluated with performance tests, since this would require more time and resources than

10

4 Methodology

available. Instead, the evaluation focuses on a few of the most relevant metrics. Theseare described in Section 8.2.

The project is also not focused on comparing different programming models with eachother. This is however a topic that is investigated in another ongoing project within theBBI department at Ericsson.

4 Methodology

The project is divided into three main parts, which will be described in the followingsections.

4.1 Literature study

The first part of this work will consist of a literature study. The goal is to identify keyfeatures and characteristics, both of the programming model and available hardware, toanalyze. Special emphasis will be put on key differences between different hardwarearchitectures. The literature study can be found in Section 5.

4.2 Development

A proof-of-concept implementation of some parts of CBB on a new hardware architec-ture will be built. This is described in Section 9. Some of the tools used in this processare provided by the BBI department at Ericsson. This includes EMCA IDE, which isthe Eclipse-based Integrated Development Environment (IDE) that is used internally atEricsson to create CBB applications.

The literature study will be used as a basis when selecting what hardware platform,and additional technologies, that will be used during the implementation phase. Thisselection process is outlined in Section 7 and Section 6.

4.3 Testing

A series of benchmark tests will be created using the previously created CBB proto-type. The same set of tests will also be created using CBB for EMCA. This process isdescribed in detail in Section 10.

11

5 Literature study

A performance analysis tool will be used for gathering performance metrics from thetargeted hardware platform. The most important requirement is to access hardware per-formance counters (see Section 2.7 for more details) for collecting cache performancemetrics. The perf tool fits this requirement [18]. It is available in all Linux systems bydefault, and is therefore the performance tool of choice for this project. Execution willbe measured in code, with built-in timing functions which are described in Section 8.2.

5 Literature study

The literature study is split up in three parts. Section 5.1 contains a comparison ofthe characteristics of three different hardware architectures. Section 2.9 describes theprogramming model used in this project. Section 5.2 summarizes academic work whichmay be valuable in later parts of the project.

5.1 Comparison of architectures

This section will compare EMCA to x86-64 and also to ARMv8, and highlight someof the key differences. Intel 64 (Intel’s x86-64 implementation) and ARMv8-A (thegeneral-purpose profile of ARMv8) will be used as reference for most of the compar-isons.

5.1.1 Memory system

The memory system of EMCA is one of the key characteristics that sets if apart from thetypical commercial architecture. Each processor core has a private scratchpad memoryfor data, and also a private scratchpad memory for program instructions. These memorymodules will be referred to as DSP data scratchpad and DSP instruction scratchpadthroughout the rest of this report. There is some hardware support for loading programinstructions into the DSP instruction scratchpad automatically, making it behave similarto an instruction cache, but for the DSP data scratchpad all data handling has to be donein software.

There is also an on-chip memory module, the shared memory, that all cores can use.It has significantly larger capacity than the scratchpad memories, and it is designed tobehave in a predictable way (for example by offering bandwidth guarantees for everyaccess).

12

5 Literature study

One of the main reasons for doing so much of the baseband software development in-house is the memory system of EMCA. Most software is designed to run on a cachecoherent memory model, which is not present in EMCA.

x86-64 designs like Intel’s Sunny Cove cores (used within the Ice Lake processor fam-ily) has a three-tier cache hierarchy [8]. Each core has a split Level 1 (L1) cache, onefor data (L1D) and one for instructions (L1I), and a unified Level 2 (L2) cache whichhas ∼5-10x the capacity of the combined L1. The architecture features a unified Level3 (L3) cache which is shared among all cores. It is designed so that each core canuse a certain amount of its capacity. Information about the cache coherency protocolused is not publicly available, but earlier Intel designs have been reported to use theMESIF (Modified, Exclusive, Shared, Invalid, Forward) protocol [13, p. 362], which isa snoop-based coherence protocol.

Contemporary ARMv8 designs feature a cache hierarchy of two or more levels [3].Each core has its own L1D and L1I cache combined with a larger L2 cache, just likex86-64. The cache sizes vary between implementations. It is possible to extend thecache system with an external last-level cache (LLC) that can be shared among a clusterof processor cores, but this depends on the particular implementation. Details about thecache coherency protocol used by ARM is not publicly available.

5.1.2 Processor cores

EMCA is characterized as a manycore design, and it has a higher number of cores thanmany x86-64 or ARMv8 chip. However most x86-64 chips and many ARMv8 chipssupport simultaneous multithreading (SMT), so that each processor core can issue mul-tiple instructions from different software threads simultaneously [28]. This is achievedby duplicating some of the elements of the processor pipeline, and this technique givesthe programmer access to a higher number of virtual cores and threads than the actualcore count. EMCA does not support SMT.

The processor cores inside EMCA are characterized as Very Long Instruction Word(VLIW) processors. This means that they have wide pipelines with many functionalunits that can do calculations in parallel, and it is the compiler’s job to find instruction-level parallelism (ILP) and put together bundles of instructions (i.e. instruction words)that can be issued simultaneously. The instruction bundles can vary in length dependingon what instructions they contain. It has an in-order pipeline, as opposed to the out-of-order pipelines found in both x86-64 and ARMv8.

Since EMCA is developed for a certain set of calculations, namely digital signal pro-cessing (DSP) algorithms, its processor cores are optimized for this purpose. There is

13

5 Literature study

however nothing fundamentally different about how they execute each instruction com-pared to other architectures.

5.1.3 SIMD operations

The x86-64 features a range of vector instruction sets, of which the latest generationis named Advanced Vector Extensions (AVX) and exists in a couple of different ver-sions. AVX512, introduced by Intel in 2013 [22], features 512 bit wide vector registerswhich can be used for vector operations. All instructions perform operations on vectorswith fixed lengths. AMD processors currently support only AVX2 (with 256 bits as itsmaximum vector length), while Intel have support for AVX512 in most of its currentprocessors.

ARMv8 features the Scalable Vector Extension (SVE) [24]. As the name implies, thesize of the vector registers used in this architecture is not fixed. It is instead an im-plementation choice, where the size can vary from 128 bits to 2048 bits (in 128-bitincrements). Writing vectorized code for this architecture is done in the Vector-LengthAgnostic (VLA) programming model, which consists of assembly instructions that au-tomatically adapts to whatever vector registers that are available at run-time. This meansthat there is no need to recompile code for different ARM chips to take advantage ofvectorization, and also no need to write assembly intrinsics by hand. SVE was first an-nounced in 2017, and details about the latest version (SVE2) was released in 2019 [19].As of today, only higher-end ARM designs feature SVE. Most designs do however sup-port the older NEON extension, utilizing fixed-size vector registers of up to 128 bits.

The Instruction Set Architecture (ISA) used with EMCA has support for SIMD instruc-tions targeted at certain operations commonly used in DSP applications. One exampleis multiply-accumulate (MAC) which is accelerated in hardware. Similar instructionsare available in AVX and SVE as well.

5.1.4 Memory models

The x86-64 architecture uses Total Store Order (TSO) as its memory model [20, p. 39].There has been a bit of debate about this statement, but most academic sources claim thatthis is true and the details are not relevant enough to cover here. With TSO, each corehas a FIFO store buffer that ensures that all store instructions from that core are issued inprogram order. The load instructions are however issued directly to the memory system,meaning that loads can bypass stores. This configuration is shown in Figure 4.

ARM systems use a weakly consistent memory model [6] (also called relaxed consis-

14

5 Literature study

Figure 4 Conceptual view of a multicore system implementing TSO [7, Figure 4.4 (b)].Store instructions are issued to a FIFO store buffer before entering the memory system.

tency). This model makes no guarantees at all regarding the observable order of loadsand stores. It can do all sorts of reordering: store-store, load-load, load-store and store-load. Writing parallel software for an ARM processor can therefore be more challengingthan doing the same for an x86 processor, since weak consistency requires more effortto ensure program correctness (for example by inserting memory barriers/fences whereorder must be preserved). The upside is that more optimizations can be done both insoftware and in hardware, giving the weakly consistent system potential to run an in-struction stream faster than a TSO system could. Two store instructions can for examplebe issued in reverse program order, which is not possible under TSO.

The memory model of EMCA does not guarantee a global ordering of instructions,although there are synchronization primitives for enforcing a global order when needed.Further details on its memory model are not publicly available.

5.2 Related academic work

This section summarizes earlier academic work in related areas which are useful withinthis project.

5.2.1 The Art Of Processor Benchmarking: A BDTI White Paper

Berkeley Design Technology, Inc. (BDTI) is one of the leading providers of benchmark-ing suites for DSP applications. This white paper [5] aims to explain the key factors thatdetermine the relevance and quality of a benchmark test. It discusses how to distinguish

15

5 Literature study

good benchmarks from bad ones, and when to trust their results.

Figure 5 Categorization of DSP benchmarks from simple (bottom) tocomplex (top) [5, Figure 1]. The grey area shows examples of benchmarks thatBDTI provides.

They argue that a trade-off has to be done between practicality and complexity. Figure 5shows four different categories of signal processing benchmarks. Simple ones basedfor example on additions and multiply-accumulate (MAC) operations may be easy todesign and run, but may not provide very meaningful results for the particular testingpurpose. On the other side of the spectrum there are full applications that may provideuseful results, but may be unnecessarily complex to implement across many hardwarearchitectures. Somewhere, often in between the two extremes, there is a sweetspotthat provides meaningful results without being too specific. A useful benchmark musthowever perform the same kind of work that will be used in the real-life scenario thatthe processor is tested for.

Another factor is optimization. The performance-critical sections of embedded sig-nal processing applications are often hand-optimized, sometimes down to assemblylevel. Different processors support different types of optimizations (for example dif-ferent SIMD operations), and allowing all these optimizations in a benchmark makes itmore complex and hardware-specific but can also expose more of the available perfor-mance.

Many benchmarks focus on achieving maximum speed, but other metrics (such as mem-ory use, energy efficiency and cost efficiency) might also be important factors whendetermining if a particular processor is suitable for the task. It can also be importantto reason about the comparability of the results across multiple hardware architectures,instead of only looking at one processor in isolation.

16

5 Literature study

5.2.2 A DSP Acceleration Framework For Software-Defined Radios Onx86-64

This article [11] is concerned with the use of COTS devices for implementing basebandfunctions within Software-Defined Radios (SDR). The goal is to accelerate commonDSP operations with the use of SIMD instructions available in modern x86-64 proces-sors.

The OpenAirInterface (OAI), which is an open-source framework for deploying cel-lular network SDRs on x86 and ARM hardware, is used as a baseline. Some of itsexisting functions are using 128-bit vector instructions. The authors extend OAI withan acceleration and profiling framework using Intel’s AVX512 instruction set. Theyimplement a number of common algorithms, targeting massive multiple-input multipleoutput (MIMO) use cases.

A speedup of up to 10x is observed for the DSP functions implemented, comparedto the previous implementation within OAI. Most previous studies within the field havefocused on application-specific processors and architectures. This study highlights someof the potential for using SIMD features in modern x86-64 processors for basebandapplications.

5.2.3 Friendly Fire: Understanding the Effects of Multiprocessor Pre-fetches

Prefetching is an important feature of modern computer systems, and its effects arewidely understood in single core systems. This article [16] investigates side-effects thatdifferent prefetching schemes can cause in multicore systems with cache coherency, andwhen these can become harmful.

Four prefetching schemes are investigated – sequential prefetching, Content DirectedData Prefetching (CDDP), wrong path prefetching and exclusive prefetching. Mea-surements are done in a simulator implementing an out-of-order sequentially consistentsystem using the MOESI cache coherence protocol.

The result is a taxonomy of 29 different prefetch interactions and their effects in a mul-ticore system. The harmful prefetch scenarios are categorized into three groups:

• Local conflicting prefetches: A prefetch in the local core forces an eviction of auseful cache line, which is referenced in the code before the prefetched cache lineis.

17

5 Literature study

• Remote harmful prefetches: A prefetch that causes a downgrade in a remote corefollowed by an upgrade in the same remote core, before the prefetched cache lineis referenced locally. This upgrade will evict the cache line in the local core,making it useless.

• Harmful speculation: Prefetching a cache line speculatively, causing unnecessarycoherence transactions in other cores. Can for example cause a remote harmfulprefetch.

Performance measurements within the different prefetching schemes show that theseprefetching effects can be harmful to performance. Some optimizations that can mitigatethis effect are also briefly discussed.

5.2.4 Analysis of Scratchpad and Data-Cache Performance Using Statis-tical Methods

Choosing the right memory technology is important to get good performance and energyefficiency out of embedded systems. This study [15] compares how cache memory andscratchpad memory perform in different types of data-heavy application workloads. It iscommonly believed that scratchpad memory is better for regular and predictable accesspatterns, while cache memory is preferrable when access patterns are irregular.

The authors use a statistical model involving access probabilities, i.e. the probabilitythat a certain data object is the next to be referenced in the code. They use this tocalculate the optimal behavior of a scratchpad memory, and compare it to cache hitratios. This is done both analytically and empirically. Matrix multiplication is used asan example of a workload with a regular access pattern. Applications involving trees,heaps, graphs and linked lists are seen as having irregular access patterns.

This work proves that scratchpad memory can always outperform cache memory, if anoptimal mapping based on access probabilities is used. Increasing the cache associa-tivity is shown not to improve the cache performance significantly.

18

6 Selection of software framework


6.1 MPI

Message Passing Interface (MPI) is a library standard formed by the MPI Forum, whichhas a large number of participants (including hardware vendors, research organizationsand software developers) [4]. The first version of the standard specification emerged inthe mid-90s, and MPI has since then become the de-facto standard for implementingmessage passing programs for high-performance computing (HPC) applications. MPI3.1, approved in Jun 2015, is the latest revision of the standard. MPI can create pro-cesses in a computer system, but not threads.

MPI is designed to primarily use the message-passing parallel programming model,where messages are passed by moving data from the address space of one process tothe address space of another process. This is done through certain operations that boththe sender and the receiver must participate in. One of the main goals of MPI is to offerportability, so that the software developer can use the same code in different systemswith varying memory hierarchies and communication interconnects.

Since MPI is a specification rather than a library, there are many different implemen-tations. Some examples are Open MPI, MPICH and Intel MPI. There are differencesacross implementations that might affect performance, but these differences are not fo-cused on within this project.

6.1.1 Why MPI?

Choosing MPI as a basis for implementing CBB on new hardware architecture is moti-vated by the great compatibility of its concepts compared to the concepts found in theactor model and CBB. This includes:

• Independent execution. An MPI process is an independent unit of computationwith its own internal state, just like an actor. It can execute code completely inde-pendent of other processes. A process has its own address space in the computersystem.

• Asynchronous message passing. An MPI process can send any number of mes-sages to other processes, using ranks (equivalent to process IDs) as addresses.The operations for sending and receiving messages can be run asynchronously,just like with actors. An MPI processes has one FIFO message queue by default.

19


6.1.2 MPICH

The MPI implementation used within this project is MPICH. It was initially developedalong with the original MPI standard in 1992 [25]. It is a portable open-source imple-mentation, one of the most popularly used today. It supports the latest MPI standard andhas good user documentation. The goals of the MPICH project, as stated on the projectwebsite, are:

• To provide an MPI implementation that efficiently supports different computa-tion and communication platforms including commodity clusters, high-speed net-works and proprietary high-end computing systems

• To enable cutting-edge research in MPI through an easy-to-extend modular frame-work for other derived implementations.

MPICH has been used as a basis for many other MPI implementations including IntelMPI, Microsoft MPI and MVAPICH2 [29]. Since MPICH is designed to be portable itcan run on x86-64, ARMv8 and also in a variety of other computer systems.

6.1.3 Open MPI

The initial choice for an MPI implementation to use within this project was Open MPI.It is one of the most commonly used implementations and it has excellent user docu-mentation. Open MPI is an open-source implementation developed by a consortium ofpartners within academic research and the HPC industry [10]. Some of the goals of theOpen MPI project are:

• To create a free, open source, peer-reviewed, production-quality complete MPIimplementation.

• To directly involve the HPC community with external development and feedback(vendors, 3rd party researchers, users, etc.).

• To provide a stable platform for 3rd party research and commercial development.

Unfortunately there were problems with getting Open MPI to run properly inside thex86-64 system used for testing (which is described in Section 8.3). Instead of spendingtime debugging these problems, Open MPI was replaced with MPICH which workedwithout problems.

20

8 Evaluation methods

7 Selection of target platform

As seen in the architecture comparisons in Section 5.1, modern x86-64 and ARMv8hardware have many similarities. They both incorporate high-performance out-of-ordercores with multiple levels of cache, since they are both targeted at general-purpose com-puting where these features can be beneficial. The differences between EMCA and x86-64 are of the same nature as the differences between EMCA and ARMv8. The decisionbetween x86-64 and ARMv8 is therefore not as significant as the overall question:

What happens when we run CBB applications on modern general-purpose hardware?

The choice of a target platform instead comes down to availability. Ericsson’s develop-ment environment is run on x86-64 servers, and accessing additional x86-64 hardwarefor testing within Ericsson has been less difficult than accessing ARMv8 hardware. Do-ing all development and testing on x86-64 hardware was therefore the natural choice.

As mentioned in Section 6.1.2, the CBB implementation created will work with bothhardware platforms since MPICH programs can be compiled and run on both. Port-ing the CBB implementation from x86-64 to ARMv8 would simply mean moving thesource code and recompiling it.


As described by BDTI [5] (see Section 5.2.1), there has to be a trade-off between prac-ticality and complexity when designing benchmark tests. There is often a sweetspot oftests that provides meaningful results without being too specific. This is the goal of thetests that will be used in this project, which are all described in the following sections.

8.1 Strong scaling and weak scaling

One of the fundamental goals of CBB and EMCA, as described in Section 2.9, is toenable massive parallelism. With this in mind, it is reasonable to test how the degree ofparallelism in an application affects performance within the targeted hardware platform.A simple two-actor application will be used, following this basic structure:

1. Actor A and actor B gets initiated.

21


2. Actor A sends a message to actor B, containing a small piece of data.

3. Actor B receives the message and performs some work (described in Sections8.1.1, 8.1.2 and 8.1.3).

4. Actor B sends a message back to actor A, acknowledging that all calculations arecompleted.

5. Both actors terminate and the test is finished.

To test different degrees of parallelism, multiple instances of the two-actor applicationwill be run. These instances are completely independent of each other, which means thatthe parallelism present in the software (namely the data parallelism) will scale perfectly.See Section 2.2 for more details on parallel computing. Each benchmark test will be runwith four actors (two of actor A, two of actor B), and then the number of actors will beincreased before running the test again. This process will be repeated until reaching1024 actors, which is significantly more than the number of available processor cores inthe computer systems used for testing (described in Section 8.3).

Two variations of each benchmark test will be evaluated:

1. Weak scaling: The size of the problem will increase along with the number ofprogram instances.

2. Strong scaling: The size of the problem will be fixed, and split up among allinstances of the two-actor applications.

8.1.1 Compute-intensive benchmark

As described in Section 2.10, Layer 1 functionality is computationally intensive andis responsible for most of the compute cycles within the baseband domain. This willbe simulated by letting actor B perform a large amount of computation after receivingdata from actor A. The computation at actor B will consist of addition operations of thefollowing form:

result = result + 10000;

Here, result will be an unsigned 16-bit integer. It will overflow many times duringthe test, so that the result will always be between 0 and 216. The addition operation willbe repeated 1 million times by each actor B in the weak scaling scenario, and 1million

N

times by each actor B in the strong scaling scenario with N program instances.

22


8.1.2 Memory-intensive benchmark without reuse

The behavior of Layer 2 applications, as described in Section 2.10, does memory inten-sive work. This will be simulated by letting actor B allocate and loop through a datavector after receiving its message from actor A, touching each element once. Two vec-tor sizes will be tested: 1 MB and 128 kB. In the weak scaling scenario, each actor Bwill have its own data vector with this size. With strong scaling, each actor B will havea data vector sized initial vector size

Nwith N program instances. The data vectors will be

dynamically allocated (on the heap).

8.1.3 Memory-intensive benchmark with reuse

This is a variation of the test described in Section 8.1.2. The only difference is that, inthis test, actor B will loop through its own data 1000 times. This means that there willbe data reuse, so that the application can make use of locality and caches.

8.1.4 Benchmark tests in summary

To cover all variations of the different benchmarks, 10 individual test cases will becreated and run. These are:

1. Compute-intensive test with weak scaling.

2. Compute-intensive test with strong scaling.

3. Memory-intensive test with no data reuse and weak scaling, with a 1 MB vector.

4. Memory-intensive test with no data reuse and weak scaling, with a 128 kB vector.

5. Memory-intensive test with no data reuse and strong scaling, with a 1 MB vector.

6. Memory-intensive test with no data reuse and strong scaling, with a 128 kB vector.

7. Memory-intensive test with data reuse and weak scaling, with a 1 MB vector.

8. Memory-intensive test with data reuse and weak scaling, with a 128 kB vector.

9. Memory-intensive test with data reuse and strong scaling, with a 1 MB vector.

10. Memory-intensive test with data reuse and strong scaling, with a 128 kB vector.

23


8.2 Collection of performance metrics

At each individual step of the two benchmark tests described above, these metrics willbe recorded:

• Execution time: This indicates the overall performance, in terms of pure speed.Shorter execution times are better. When running on x86-64, this metric willbe measured using the built-in MPI Wtime function in MPI. A correspondingfunction available in EMCA systems will be used there. The timing methodologyfollows this scheme:

1. All actors in the system gets initialized, and then synchronize using a barrieror similar.

2. One actor collects the current time.

3. The actors do their work.

4. All actors synchronize again.

5. One actor collects the current time again and subtracts the previously col-lected time.

With this method, all overhead associated with initializing and terminating theexecution environment is excluded. The measured time instead only shows howlong the actual work within the benchmark tests take.

• Cache misses in L1D: This shows us how the cache system and the hardwareprefetcher performs. Lower numbers of cache misses are preferable. See Sec-tion 2.3 for more details. This metric is available in x86-64 systems but not inEMCA, which do not have caches. The perf tool will be used for this. The com-mand used for running a test and measuring cache behavior is:

$ perf stat -e L1-dcache-loads,L1-dcache-load-misses

This shows the number of loads and misses in the L1D cache, and also the missratio in %.

All steps of each benchmark test will be run three times, and then the average numbersproduced from these three runs will be used as a result. This is to even out the effects ofunpredictable factors that might produce noisy results.

24


8.3 Systems used for testing

The x86-64 system that will be used for running benchmarks is a high-end server ma-chine with AMD processors built on their Zen 2 microarchitecture [23]. Some of itsspecifications are summarized below.

• 2 x AMD EPYC 7662 processors.

• 128 physical cores in total (64 per chip).

• 256 virtual cores in total (128 per chip) using SMT.

• 32 kB of private L1D cache per physical core.

• 512 kB of private L2 cache per physical core.

• 4 MB of L3 cache per physical core (16 MB shared across four cores in each corecomplex).

• The system is running Linux Ubuntu 20.04.1 LTS.

Machine (503GB total)

Package L#0

L3 (16MB)

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#0

PU L#0P#0

PU L#1P#128

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#1

PU L#2P#1

PU L#3P#129

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#2

PU L#4P#2

PU L#5P#130

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#3

PU L#6P#3

PU L#7P#131

L3 (16MB)

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#4

PU L#8P#4

PU L#9P#132

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#5

PU L#10P#5

PU L#11P#133

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#6

PU L#12P#6

PU L#13P#134

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#7

PU L#14P#7

PU L#15P#135

16x totalL3 (16MB)

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#60

PU L#120P#60

PU L#121P#188

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#61

PU L#122P#61

PU L#123P#189

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#62

PU L#124P#62

PU L#125P#190

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#63

PU L#126P#63

PU L#127P#191

NUMANode L#0 P#0 (252GB)

Package L#1

L3 (16MB)

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#64

PU L#128P#64

PU L#129P#192

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#65

PU L#130P#65

PU L#131P#193

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#66

PU L#132P#66

PU L#133P#194

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#67

PU L#134P#67

PU L#135P#195

L3 (16MB)

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#68

PU L#136P#68

PU L#137P#196

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#69

PU L#138P#69

PU L#139P#197

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#70

PU L#140P#70

PU L#141P#198

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#71

PU L#142P#71

PU L#143P#199

16x totalL3 (16MB)

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#124

PU L#248P#124

PU L#249P#252

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#125

PU L#250P#125

PU L#251P#253

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#126

PU L#252P#126

PU L#253P#254

L2 (512KB)

L1d (32KB)

L1i (32KB)

Core L#127

PU L#254P#127

PU L#255P#255

NUMANode L#1 P#1 (252GB)

Figure 6 Processor topology of the x86-64 system used for testing.

Figure 6 shows the topology of the processor cores and cache hierarchy in the x86-64system. This graphical representation was obtained with the hwloc command line toolin Linux.

25

9 Implementation of CBB actors using MPI

The benchmarks will also be run on a recent iteration of Ericssons manycore hardware.As discussed in 5.1 it has a number of specialized DSP cores with private scratchpadmemories for instructions and data, an also an on-chip shared memory that all cores canuse. Detailed specifications of this processor are confidential and can not be describedin detail.


sc

dataP

dataActor:dataActordataP ~out

printerActor:printerActorin

Figure 7 The CBB application used for implementation.

A simple CBB application was created using EMCA IDE, consisting of just two CBCs.A graphical representation of the application is seen in Figure 7. Here, the outer boxlabeled sc represents the top-level CSC. There is one CBC instance named dataActorwhich is of the CBC type with the same name, and one instance printerActor of the typewith the same name. The names of the CBCs originate from some of Ericsson’s usertutorial material, and are not representative of their behavior.

The out port of dataActor is connected to the in port of printerActor, meaning that theyare aware of each other’s existence and can send messages to each other. The coloringof the ports in Figure 7 and the “∼” label on one of them symbolizes the direction of thecommunication; certain message types can be sent from dataActor to printerActor, andother message types can be sent in the other direction. The ports named dataP, whichconnects dataActor to the edge of the CSC, will not be used.

The application was run through the CBB transform targeting EMCA, which generateda number of C code and header files. These files were then used as a basis for creatingnew C functions targeting the x86-64 platform, with the help of MPI calls. Detailsabout all the MPI routines mentioned in the following sections can be found on theofficial MPICH documentation webpage [26].

26


9.1 Sending messages

CBB generates a “send” function for each port of each CBC. The contents of thesefunctions was rewritten to do the following:

1. MPI Isend is used to post a non-blocking send request. This call returns im-mediately without ensuring that the message has been delivered to its destination.The function gets arguments describing the message contents and what process todeliver it to.

2. MPI Wait will then block code execution until the MPI Isend has delivered itsmessage away from its own send buffer, so that it can be reused.

The reason for using these two MPI calls instead of MPI Send, which is a single block-ing call that performs the same task, is to enable an overlap between communication andcomputation. This could be accomplished by doing some calculations in between thetwo MPI calls.

9.2 Receiving messages

There is a generated “receive” function for each port of each CBC. These were rewrittento have the following behavior:

1. MPI Probe is a blocking call that checks for incoming messages. When it de-tects a message, it will write some information to a status variable and return. Thestatus information includes the tag of the message and also the ID of the sourceprocess.

2. MPI Get count is then used to determine how many bytes of data that the mes-sage contains.

3. MPI Recv is called last. This function is a blocking call, but will not cause astall in execution since the previous calls have assured that there is actually anincoming message.

Using these three MPI calls instead of only MPI Recv allows for receiving messageswithout knowing all specifics; the MPI Recv functions requires arguments describingthe tag, source and size of the incoming message, which we find out using MPI Probeand MPI Get count. This enables the definition of message types with varying datacontents.

27

10 Creating and running benchmark tests

10 Creating and running benchmark tests

10.1 MPI for x86-64

With MPI, there is a need to initialize the execution environment and create the MPI pro-cesses corresponding each CBC. This is done in an additional code file, test cases.c,which is used to run the actual tests. Each MPI process runs its own copy of the code,which follows this basic structure:

1. Initiate the MPI execution environment with MPI Init and determine the pro-cess ID by calling MPI Comm rank.

2. Use the process ID to determine which actor type and instance that its ID corre-sponds to, according to a lookup table or similar. The actor now knows if it is adataActor or printerActor in this case.

3. Run code corresponding to the current test case. This part will differ dependingon what kind of test that is being run, see Section 8.1. The time measurement, asdiscussed in Section 8.2, is also a part of this step.

4. Terminate the MPI process and exit the execution environment withMPI Finalize.

All necessary code files and headers are compiled using mpicc, which is a compilercommand for MPICH that uses the default C compiler of the system (gcc in this case)along with the addidional linkage needed by MPICH. make is used to produce a singlebinary for the complete application. To run a test case, a variation of the followingcommand is used:

$ mpiexec -n X -bind-to hwthread bin/test

Here X is the total number of actors (MPI processes) that will be present in the system.Since the actors operate in pairs (with one doing work and the other one just sendingmessages), X must be an even number. The -bind-to hwthread makes sure thatevery MPI process gets associated with one hardware thread (virtual core), which re-duces the process management overhead. This is beneficial for performance and alsomakes the behavior more predictable. bin/test points to the binary generated bympicc.

28

11 Results and discussion

10.2 CBB for EMCA

Creating the benchmark tests with CBB is a simpler process. The two-actor applicationseen in Figure 7 had already been generated using EMCA IDE and the CBB transform(as described in Section 9). The code for doing the actual work (as described in 8.1.1,8.1.2 and 11.3) is then added inside code files generated by the CBB transform. Furtherdetails about the structure of the code structure inside CBB will not be described here.

Since the memory allocation in the DSP data scratchpad and the shared memory has tobe done manually on EMCA, and they do not share address spaces, this structure wasused for allocating memory in the memory-intensive benchmark tests on EMCA:

if (vector_size < threshold_size)// allocate space in DSP data scratchpad

else// allocate space in shared memory

Here, the threshold size is used as a cross-over point between the two memoryunits. It is smaller than the actual size of the DSP data scratchpad memory, whichleaves space that could be used for system-internal data.


This section contains results collected when running the benchmark tests on EMCA andx86-64. To make the results comparable across architectures, all execution times havebeen normalized. This means that each individual execution time is divided with thefirst execution time in that series, so that every data series (every line in a graph) startswith 1.0. Hence any value above 1.0 means worse (slower) performance, and any valuebelow 1.0 means better (faster) performance. This makes it possible to focus on thescaling properties in each benchmark test, instead of execution times in absolute terms.Actual execution times are discussed briefly but not shown in figures.

In the tests that involve strong scaling it is also relevant to look at the speedup, which isthe inverse of normalized execution time. This means that we divide the initial executiontime with the current execution time. That will show us how many times faster theexecution is, compared to the first run That will show us how many times faster theexecution is, compared to the first run (represented by 1.0 here as well). Thus, a lowernumber means worse performance and a higher number means better performance.

29


In Section 11.3, covering the memory-intensive benchmark with data reuse, cache missratios in x86-64 will also be presented. This is the only test that takes advantage ofcaches, which is why this metric is relevant here but not in the other tests.

Finally there will be a discussion about complexity and optimizations in the software,and how these factors affect the performance results. This discussion is found in Sec-tion 11.4.

11.1 Compute-intensive benchmark

0

10

20

30

40

50

60

70

80

90

0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024

EXEC

UTI

ON

TIM

E (N

OR

MA

LIZE

D P

ER D

ATA

SER

IES)

TOTAL NUMBER OF ACTORS/PROCESSES

x86 EMCA

Figure 8 Normalized execution times for the compute-intensive benchmark test withweak scaling.

Figure 8 shows how both systems perform in the compute-intensive test with weakscaling. We can see that the execution time in the EMCA system scales linearly with thenumber of actors created. This is true even when the number of actors get significantlylarger than the number of processor cores available.

The x86-64 system behaves very differently. It performs well compared to EMCA upuntil hitting 256 actors, which is also the number of virtual processor cores available

30


in the system. After that we can see a big performance penalty. The execution timeincreases fairly linearly until 768 actors. At this point, the execution time is over 80times longer than the baseline.

The normalized execution times for x86-64 finally tapers off slightly, ending at around75 for 1024 actors. This is an interesting phenomenon that could possibly be explainedby the distribution of actors on to processor cores. Half of the actors spend most oftheir time performing computations and the other half spends their time waiting for amessage response, so there could be a congestion of many actors doing computationsin the same virtual core. In this case these have to wait for each other, causing a per-formance penalty. If the actors doing computations are more evenly distributed acrossthe cores, they interfere less with each other, which could make for better performanceeven though the number of actors increase. This is a theory that could be confirmed withfurther experiments in the future. The phenomenon seen here is seen in all the x86-64benchmark tests, but will not be discussed again in later sections.

0

20

40

60

80

100

120

140

160

180

0 64 128 192 2 5 6 320 384 448 512 5 7 6 640 704 768 832 896 960 1024

EXEC

UTI

ON

TIM

E (N

OR

MA

LIZE

D P

ER D

ATA

SER

IES)


x86 EMCA

Figure 9 Normalized execution times for the compute-intensive benchmark test withstrong scaling.

In Figure 9 we can see the normalized execution times when testing the strong scalingproperty. The behavior of EMCA is indistinguishable from the previous test, with alinearly increasing normalized execution time ending at just below 60. Since the num-

31


bers are so similar in the compute-intensive benchmark both when the total amount ofcomputation increases (weak scaling) and when it is fixed (strong scaling), one couldassume that this kind of computation is “cheap” even when done many times. In thatcase the overhead cost of managing many actors in the system dominates, and it it thiscost that we can see in Figure 8 and Figure 9. This theory is investigated further inSection 11.1.1.

The x86-64 system seems to have a similar behavior as before when the number of actorspass the number of virtual processor cores. The big difference is that the normalizedexecution time ends with around 160 in this test, which is close to double what we sawin the weak scaling test. This reveals a possible drawback of the methodology used.Since each series starts with two actors doing computations (and two more which sendsand waits for messages), the weak scaling actors do 1 million additions each. Whenlooking at absolute execution time (not seen in the figures), this test takes 2.35 ms. Thestrong scaling actors start at 1 million

2= 0.5 million additions, taking 1.11 ms. With

1024 actors the weak scaling variant takes 153 ms while the strong scaling variant takes174 ms, which are much more similar numbers. If both tests had started with only oneactor doing computations (two actors in total), both scaling variants would have startedat around 2 ms and we would have seen much more similar numbers for normalizedexecution time when the number of actors get large.

0

1

2

3

4

5

6

7

0 16 32 48 64 80 96 112 128 144 160 176 192 208 2 2 4 240 256

SPEE

DU

P


x86 EMCA

Figure 10 Speedup for the compute-intensive benchmark test with strong scaling.

32


Finally, Figure 10 shows the speedup of the execution in the strong scaling variant. Wecan see that EMCA does not speed up at all. This matches the previous discussion well;computations are “cheap” and the overhead for creating actors dominate, so we get nobenefit by splitting up the amount of computation among the processor cores.

The x86-64 data shows a different behavior. The amount of computation is actuallyvisible in its performance, since there is an evident benefit in splitting it up among thecores. We get the best performance with 64 actors, producing a speedup of around 6.After that the overhead from creating and managing processes in the system takes over,and the speedup is back below 1.0 when we hit 256 actors. After that it only decreasesfurther, which is not meaningful to include in the figure.

11.1.1 Was the test not compute-intensive enough for EMCA?

As discussed previously, the compute-intensive benchmark on EMCA gave us identicalexecution time with weak scaling and strong scaling. This was not expected, and atheory is that the test was not compute-intensive enough for the EMCA processor. Toexplore this theory, the test was modified to use 64-bit floating-point addition, whichis a more computationally intensive operation than the previously used 16-bit integeraddition. This test was run with 4, 16 and 128 actors.

33


0

0,5

1

1,5

2

2,5

3

3,5

4

0 16 32 48 64 80 96 112 128

SPEE

DU

P


EMCA

Figure 11 Speedup for the compute-intensive benchmark test with strong scaling and64-bit floating-point addition. Only EMCA was tested.

The results from the modified compute-intensive benchmark test with floating-point ad-dition on EMCA is shown in Figure 11. We can see that there was a performance benefitin doing strong scaling here, i.e. splitting up the floating-point calculations among ac-tors. This shows that floating-point operations are indeed more computationally heavyon EMCA than what we tested previously, and it strengthens the theory that the previoustest was not compute-intensive enough.

34


11.2 Memory-intensive benchmark with no data reuse

0

100

200

300

400

500

600

700

0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024

EXEC

UTI

ON

TIM

E (N

OR

MA

LIZE

D P

ER D

ATA

SER

IES)


x86 1MB Vector x86 128kB Vector EMCA 1MB Vector EMCA 128kB Vector

Figure 12 Normalized execution times for the memory-intensive benchmark with nodata reuse and weak scaling.

Looking at Figure 12, we can see how the two systems perform when doing the memory-intensive benchmark with no data reuse and weak scaling. The EMCA system performsin a similar way with both the larger data vector and the smaller data vector, ending innormalized execution times of 73 and 100 respectively. When looking at actual execu-tion times, we can see that the initial test run with the larger data vector has a 15-20times longer execution time than the initial test run with the smaller vector. What weare seeing here are normalized execution times for allocating large vectors in the sharedmemory. It seems as though allocating and looping through large vectors takes longertime to begin with, but scales better, than allocating and looping through small vectors.

An interesting point here is that there are cases when we are trying to allocate morememory in the shared memory than what is available. This did not seem to cause anytrouble when running the test; all actors could allocate and write to their memory vectorsat all times. A theory here is that the function used for dynamic memory allocationin EMCA does not return until there is available space in the shared memory. Thisshould cause a congestion of actors waiting for memory allocation. This is however a

35


theory that has not been confirmed, and it is not evident when looking at the normalizedexecution times.

The x86-64 system shows very different scaling properties for the two vector sizes, withsimilar behavior as EMCA for the larger vector and significantly worse scaling for thesmaller vector. This can be explained by looking at the absolute execution times. Withthe larger vector, the initial execution time (with 4 actors in total) is 2.07 ms and thefinal execution time (with 1024 actors in total) is 179 ms. The corresponding times forthe smaller vector are 0.3 ms and 172 ms respectively. Since the initial execution timeis almost 600 times longer with the larger vector size than the smaller vector size, andthey end up in very similar execution times in the end, the difference in normalizedexecution times is also 500-600. This is a drawback of showing normalized executiontimes instead of actual execution times. Showing actual execution times would howevermake it harder to compare architectures. In conclusion, allocating and writing data toa small vector is much faster than doing the same with a large vector, but when thenumber of processes and vectors get very large the overhead from process managementdominates and the execution times start to look more similar.

0

200

400

600

800

1000

1200

1400

1600

0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024

EXEC

UTI

ON

TIM

E (N

OR

MA

LIZE

D P

ER D

ATA

SER

IES)



Figure 13 Normalized execution times for the memory-intensive benchmark with nodata reuse and strong scaling.

When testing strong scaling, it is hard to evaluate EMCA by only looking at Figure 13

36


since the x86-64 numbers dominate. If we would have a figure showing only the EMCAnumbers, we would see that the normalized execution times with a 1 MB vector ends at0.79 with 1024 actors, which means that this execution is even faster than when running4 actors in total. This is a symptom of the difference in response time when dealing withthe DSP data scratchpad compared to the shared memory. With the smaller vector thenormalized execution time with 1024 actors is 10.1, which is because we have to dealless with shared memory to begin with. More on DSP data scratchpad versus sharedmemory when discussing the speedup numbers below.

When looking at the x86-64 numbers, we can see a similar behavior as in the weakscaling test when the number of actors grow large. The scaling between the two ishowever even larger here, with the normalized execution times hitting a maximum ofalmost 1400. Like before, this is a result of the difference in initial execution times;1.02 ms with the larger vector and 0.131 ms with the smaller vector. With 1024 actors,the execution times are 192 ms and 160 ms respectively. Again, they start with verydifferent execution times because of the difference in vector sizes, but end up with verysimilar execution times when the process management overhead dominates.

0

2

4

6

8

10

12

14

16

18

0 16 32 48 64 8 0 96 112 128 144 160 176 192 2 0 8 224 240 256

SPEE

DU

P



Figure 14 Speedup for the memory-intensive benchmark with no data reuse and strongscaling.

As a continuation of the previous discussion we can now look at Figure 14, showing how

37


the two systems benefit from splitting up the work across many actors. With EMCA,we can see something interesting when testing the large vector. The speedup increasesdramatically when the number of actors increase, peaking at a 16x performance gainfor 64 actors. This is because the 1 MB vector is split up among the actors so that eachactor can put its chunk inside the DSP data scratchpad at some point, instead of theshared memory. As described in Section 10.2, all this has to be done in code instead ofhardware. Since the benchmark is written in such a way that anything smaller than athreshold value is put in the DSP data scratchpad and anything larger is put in the sharedmemory, there is a point (somewhere between 32 and 64 actors in total) where we crossthis threshold. The biggest jump in speedup is seen when going from 32 to 64 actors intotal, so the results seem correct. After that point, the overhead from using many actorsstart to show and the speedup tapers off.

With the smaller vector size tested on EMCA, the same effect is seen but in a muchsmaller scale. In the initial execution the data vectors are put in the shared memory,but very soon the vectors are instead put in the DSP data scratchpad. The maximumspeedup observed is 4.6 with 8 actors in total. Then the overhead takes over and thespeedup starts to decrease again.

With the x86-64 system, we see a similar speedup phenomenon as in EMCA for boththe larger vector and the smaller vector. Here, the memory management is done withmalloc and free and the memory system is abstracted away from the programmer.When examining the numbers closer we can see that the test with the larger vector peaksat 32 actors in total (16 using memory), and 16 actors in total (8 using memory) for thesmaller vector. This translates to allocation of vectors of 64 kB and 4 kB respectively. Itis not entirely apparent why the performance peaks at these particular points. The im-plementation of malloc and free can depend on both the compiler and the operatingsystem, which is a factor that has not been investigated in this project. Also, there is noobvious caching benefit since we have no data reuse in this case.

38


11.3 Memory-intensive benchmark with data reuse

0

10

20

30

40

50

60

70

80

90

0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024

EXEC

UTI

ON

TIM

E (N

OR

MA

LIZE

D P

ER D

ATA

SER

IES)



Figure 15 Normalized execution times for the memory-intensive benchmark with datareuse and weak scaling.

Figure 15 shows how the test systems behave when having a high degree of data reuse,with weak scaling. Both the larger vector and the smaller vector are larger than thethreshold value used for the DSP data scratchpad of the EMCA system, meaning thatall data is stored in shared memory in this entire test. We see a linear behavior withnormalized execution times ending at just above 80 for both vector sizes. The conclusionthat we can draw from this is that the allocation of shared memory memory behaves ina very predictable way.

The caches in the x86-64 system show their strengths in this benchmark test, with greatscaling properties. The normalized execution times stay at just above 1 until we hit 256actors, which is the same as the number of virtual cores in the system. More importantly,this means that the number of actors doing memory-intensive work is 128, which isthe same as the number of physical cores. After that the normalized execution timesincrease, ending at around 5 for both vector sizes. The 1 MB vector is too large to fit inthe L1D (32 kB) and L2 (512 kB), but it fits in the shared L3 cache (16 MB shared acrossfour physical cores). The 128 kB data vector is also too large for the L1D cache but it

39


fits inside the private L2 cache. The actual execution times show us that the test withthe larger vector takes about 10x longer than with the smaller vector. This is assumedto be the difference in performance when comparing the L2 and the L3 caches in thisparticular system.

0,00%

0,20%

0,40%

0,60%

0,80%

1,00%

1,20%

1,40%

1,60%

1,80%

2,00%

0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024

CA

CH

E M

ISS

RA

TIO


1MB Vector 128kB Vector

Figure 16 Cache miss ratio in L1D for the memory-intensive benchmark with data reuseand weak scaling.

The cache miss ratios observed in L1D in this test, with weak scaling, is shown inFigure 16. Note the scaling of the y-axis; the observed miss ratios range from 0.37%to 1.68%, so the miss ratio stays very low at all times. Since the L1D in the system isat 32 kB, none of the vectors will fit inside L1D at any time during this test. The lowcache miss ratios is likely due to a very accurate hardware prefetcher, which can detectsequential access patterns and bring data in to the L1D cache before it is needed. It ishard to analyze the fluctuations observed in cache miss ratios, since the numbers are sosmall. The overall conclusion is instead that the hardware prefetcher does a good job ofkeeping the miss ratios low in L1D. It could be interesting to collect this metric with thehardware prefetcher turned off. This is discussed in Section 13.3.

40


0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

2

0 64 128 192 256 320 384 448 512 576 640 704 768 8 3 2 896 960 1024

EXEC

UTI

ON

TIM

E (N

OR

MA

LIZE

D P

ER D

ATA

SER

IES)



Figure 17 Normalized execution times for the memory-intensive benchmark with datareuse and strong scaling.

When looking at the strong scaling properties of this benchmark test in Figure 17, wecan see that both the EMCA system and the x86-64 system benefits greatly from split-ting up the memory-intensive work with data reuse across multiple processor cores (notethe scaling of the y-axis here). With a large number of actors, we can see that EMCAoutperforms the x86-64 system with better scaling properties.

The tests on x86-64 shows a performance degradation when the number of actors passthe number of virtual cores. Just like what we have seen in previous tests, the normaliza-tion makes the test with the smaller vector size stand out as though it takes the biggestperformance hit. This occurs when the data chunk used by each actor gets so smallthat looping through it does not take a significant amount of time. Instead, the processmanagement overhead starts to show and the normalized execution time goes up. Withthe larger vector size, looping through the data (which is getting smaller and smaller)still dominates in the time measurements. Just like in previous tests, the test with thelarger vector and the test with the smaller vector starts at very different execution timeswith 4 actors in total (770 ms and 105 ms respectively) but end up in similar executiontimes with 1024 actors in total (181 ms and 170 ms respectively).

41


0

5

10

15

20

25

30

35

40

0 32 64 96 128 1 6 0 192 224 2 5 6 288 320 352 384

SPEE

DU

P



Figure 18 Speedup for the memory-intensive benchmark with data reuse and strongscaling.

When looking at the speedup numbers in Figure 18, the benefit of using locality in thememory system is obvious. With EMCA, the test with the larger vector size peaks at aspeedup of 15.3 with 48 actors doing memory-intensive work (96 actors in total). Thespeedup with the smaller vector size get to similar numbers, but decline more rapidlywhen the number of actors grow large. This is because proportion of time used forlooping through the vector gets very small, and instead the overhead costs start to bemore visible.

In the x86-64 system, we can see that the test with the larger vector size achieves greatspeedup numbers. Initially, the vector will fit in the L2 cache but not the L1D cache.The vector should fit inside the 32 kB L1D cache when the number of actors usingmemory hits 1MB

32kB = 32, so 64 actors in total. We expected to see super-linear speedupat that point, but we are not. It is possible that the distribution of actors on to processorcores prevents this from happening, if many actors doing memory-intensive work areplaced in the same physical core (like discussed in Section 11.1). Instead, the numbersincrease in a more or less linear way until hitting a speedup of 34 with 256 actors in total(128 using memory). This means that, when splitting up a large vector across actors, weget the maximum performance benefit when using all the available cores in the system.After that, the process management overhead starts to show and the speedup goes down

42


quickly.

With the smaller vector, the data should fit in the L1D cache when the number of actorsusing memory gets to 128kB

32kB = 4, so 8 actors in total. We can see that the speedupincreases most rapidly in the beginning, which means that we get the most benefit fromadding processor cores here. The best speedup achieved is 18.3 at 192 actors in total.Then the process management overhead starts to show, and after passing 256 actors thespeedup declines rapidly.

0,00%

0,50%

1,00%

1,50%

2,00%

2,50%

3,00%

3,50%

4,00%

0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024

CA

CH

E M

ISS

RA

TIO


1MB Vector 128kB Vector

Figure 19 Cache miss ratio for the memory-intensive benchmark with data reuse andstrong scaling.

Figure 19 shows the cache miss ratio observed when running this particular test case.Just like in Figure 16 we can see that the cache miss ratios stay very low; the highestobserved miss ratio is 3.5% for the 128 kB with 8 actors in total, and the lowest observedis 0.65% with the 1 MB vector and 64 actors in total. The small peaks that are seen withsmall numbers of actors are likely due to the characteristics of the hardware prefetcherin the system, which is not investigated in detail here. The conclusion from Figure 19is, just like in the previous test, that the hardware prefetcher helps to keep the cachemisses in L1D fairly low at all times.

11.4 Discussion on software complexity and optimizations

There are some factors affecting the performance results which have not been investi-gated in detail. One important aspect is that there are different levels of software com-plexity and optimization in the two test systems. Since CBB was developed specificallyto run on EMCA, its implementation is highly optimized for this hardware. This is one

43

12 Conclusions

of the main benefits of co-designing software and hardware. A potential drawback tothis approach is that it can be complex to migrate the software to another hardware plat-form. In this project MPI turned out to be a helpful tool to get a working prototype fornew hardware platforms, but no time has been spent optimizing the code for the x86-64test machine. Instead the MPI-based prototype produced results that are assumed to be“good enough” to reason about, which leads to useful conclusions. The performanceachieved with MPI can be viewed as a “lower bound” to what is actually achievable onan x86-64 system.

In addition, it would be possible to optimize the benchmark-specific code for the dif-ferent hardware platforms. This was briefly discussed in Section 5.2.1. An examplewould be to, in the memory-intensive tests with data reuse, change the access pattern tothe data vectors. It would then be possible to loop through a chunk of the vector smallenough to fit inside the DSP data scratchpad and the L1D cache respectively, beforemoving on to the next chunk. This would dramatically increase the use of locality in thesoftware, and would likely result in higher performance on both EMCA and x86-64.

12 Conclusions

The objective of this project was to investigate connections between parallel program-ming models and the hardware that they run on. We saw that the EMCA processorshave many differences when compared to commercial designs like x86-64 and ARMv8.Their memory systems and processor cores were described and analyzed, highlightinga number of their key characteristics.

The project has provided insights on how a programming model can be adapted to runon new hardware. The MPI library was suggested as a tool for making a CBB im-plementation which can be used on both x86-64 and ARMv8. A prototype supportingactors with message passing capabilities was developed.

Benchmarks targeting compute-intensive and memory-intensive scenarios was devel-oped and tested on EMCA and a high-end x86-64 machine. In the compute-intensivetests we saw that EMCA had better scaling properties than the x86-64 system withvery many actors, but worse scaling properties with few actors. A modified benchmarkwas tested on EMCA, highlighting the theory that EMCA is very good at doing heavycalculations and that the original test was not compute-intensive enough. The memory-intensive tests showed how both systems can utilize locality in their memory systemsby putting data in a cache or a scratchpad memory. We also saw how the hardwareprefetcher in the x86-64 system helps to keep cache miss ratios low. Overall, EMCAshowed more linear scaling behavior than the x86-64 system. Having more actors than

44

13 Future work

the number of processor cores gave an immediate performance penalty in the x86-64system in all tests, something that was not as apparent on EMCA.

13 Future work

There are a number of aspects that has not fitted within the scope of this project, butcould be investigated in the future. Some examples are listed below.

13.1 Implement a CBB transform with MPI for x86-64

As previously stated, this project focused on creating a proof-of-concept model for CBBon x86-64. This could be expanded to become a full CBB implementation. The CBBtransform could then take any CBB code and automatically generate the correspondingMPI code. It is difficult to estimate how comprehensive and time-consuming this workwould be.

13.2 Expand benchmark tests to cover more scenarios

The benchmark tests used within this project could be expanded and modified to testmore performance aspects. For example, it could be interesting to use a real basebandapplication with the kind of signal processing operations used in a cellular base station.Comparing the performance of such an application across multiple hardware architec-tures might be relevant for Ericsson.

13.3 Run benchmarks with hardware prefetching turned off

To dive deeper in to the characteristics of the cache system in the x86-64 machine, itcould be interesting to run the benchmark tests with the hardware prefetcher turned off.This would yield worse performance overall, but the results would then be easier torelate to the cache specifications.

45

13 Future work

13.4 Combine MPI processes with OpenMP threads

There could be a benefit of splitting up tasks inside an MPI process using threads. Asmentioned in Section 6.1, MPI only supports the creation of processes. It could be in-teresting to make a CBB implementation using MPI for process creation and somethinglike OpenMP for thread-level parallelism. This is an approach which is widely used andstudied, for example by Rabenseifner et al. [21].

13.5 Run the same code in an ARMv8 system

As mentioned in Section 6.1.2, MPICH is portable and can run in ARMv8-based sys-tems as well. The code written in this project could easily be compiled and run in sucha system, testing the same or other benchmarks. This would be an easy way to makeadditional comparisons between EMCA, x86-64 and also ARMv8.

46

References

References

[1] G. A. Agha, Actors: a model of concurrent computation in distributed systems.Cambridge, Mass: MIT Press, 1986.

[2] B. Anuradha and C. Vivekanandan, “Usage of scratchpad memory in embeddedsystems — State of art,” in 2012 Third International Conference on Computing,Communication and Networking Technologies (ICCCNT’12), 2012, pp. 1–5.

[3] ARM Holdings. (2020, Sep.) ARM Cortex-A Series Programmer’s Guide forARMv8-A – Chapter 11. Caches. Accessed: 2020-09-17. [Online]. Available:https://developer.arm.com/documentation/den0024/a/caches

[4] Barney, Blaise. (2020) Message Passing Interface (MPI). Lawrence LivermoreNational Laboratory. Accessed: 2020-10-27. [Online]. Available: https://computing.llnl.gov/tutorials/mpi/

[5] Berkeley Design Technology, Inc. (2006) The Art of Processor Benchmarking:A BDTI White Paper. Accessed: 2020-10-06. [Online]. Available: https://www.bdti.com/MyBDTI/pubs/artofbenchmarking.pdf

[6] N. Chong and S. Ishtiaq, “Reasoning about the ARM Weakly Consistent MemoryModel,” in Proceedings of the 2008 ACM SIGPLAN Workshop on MemorySystems Performance and Correctness: Held in Conjunction with the ThirteenthInternational Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS ’08), ser. MSPC ’08. New York, NY, USA:Association for Computing Machinery, 2008, p. 16–19. [Online]. Available:https://doi-org.ezproxy.its.uu.se/10.1145/1353522.1353528

[7] W. Chu. (2020, Jun.) Caching and Memory Hierarchy. Accessed: 2020-09-16.[Online]. Available: https://medium.com/@worawat.chu/caching-and-memory-hierarchy-fc7a9b9efcca

[8] I. Cutress. (2019, Aug.) Cache and TLB updates - The Ice Lake BenchmarkPreview: Inside Intel’s 10nm. Accessed: 2020-09-17. [Online]. Available:https://www.anandtech.com/show/14664/testing-intel-ice-lake-10nm/2

[9] B. Falsafi and T. F. Wenisch, “A Primer on Hardware Prefetching,” SynthesisLectures on Computer Architecture, vol. 9, no. 1, pp. 1–67, 2014. [Online].Available: https://doi.org/10.2200/S00581ED1V01Y201405CAC028

[10] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres,V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel,

47

https://developer.arm.com/documentation/den0024/a/caches

https://computing.llnl.gov/tutorials/mpi/

https://computing.llnl.gov/tutorials/mpi/

https://www.bdti.com/MyBDTI/pubs/artofbenchmarking.pdf

https://www.bdti.com/MyBDTI/pubs/artofbenchmarking.pdf

https://doi-org.ezproxy.its.uu.se/10.1145/1353522.1353528

https://medium.com/@worawat.chu/caching-and-memory-hierarchy-fc7a9b9efcca

https://medium.com/@worawat.chu/caching-and-memory-hierarchy-fc7a9b9efcca

https://www.anandtech.com/show/14664/testing-intel-ice-lake-10nm/2

https://doi.org/10.2200/S00581ED1V01Y201405CAC028

References

R. L. Graham, and T. S. Woodall, “Open MPI: Goals, concept, and design of anext generation MPI implementation,” in Proceedings, 11th European PVM/MPIUsers’ Group Meeting, Budapest, Hungary, September 2004, pp. 97–104.

[11] G. Georgis, A. Thanos, M. Filo, and K. Nikitopoulos, “A DSP AccelerationFramework For Software-Defined Radios On X86 64,” in ICASSP 2020 - 2020IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2020, pp. 1648–1652.

[12] J. L. Gustafson, “Reevaluating Amdahl’s Law,” Commun. ACM, vol. 31, no. 5,p. 532–533, May 1988. [Online]. Available: https://doi-org.ezproxy.its.uu.se/10.1145/42411.42415

[13] J. L. Hennessy and D. A. Patterson, Computer Architecture - A QuantitativeApproach (5th Edition). Elsevier, 2012. [Online]. Available: https://app.knovel.com/hotlink/toc/id:kpCAAQAE11/computer-architecture/computer-architecture

[14] M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,”Computer, vol. 41, no. 7, p. 33–38, Jul. 2008. [Online]. Available:https://doi.org/10.1109/MC.2008.209

[15] Javed Absar and F. Catthoor, “Analysis of scratch-pad and data-cache performanceusing statistical methods,” in Asia and South Pacific Conference on Design Au-tomation, 2006., 2006, pp. 6 pp.–.

[16] N. D. E. Jerger, E. L. Hill, and M. H. Lipasti, “Friendly fire: understanding theeffects of multiprocessor prefetches,” in 2006 IEEE International Symposium onPerformance Analysis of Systems and Software, 2006, pp. 177–188.

[17] X. Li. (2018, Nov.) Scalability: strong and weak scaling. PDC Center for HighPerformance Computing – KTH Royal Institute of Technology. Accessed: 2020-09-17. [Online]. Available: https://www.kth.se/blogs/pdc/2018/11/scalability-strong-and-weak-scaling/

[18] Linux Kernel Organization, Inc. (2020, Sep.) perf(1) — Linux manualpage. Accessed: 2020-11-19. [Online]. Available: https://man7.org/linux/man-pages/man1/perf.1.html

[19] B. Mann and N. Stephens. (2019, Apr.) New Technologies for theArm A-Profile Architecture. Accessed: 2020-09-23. [Online]. Avail-able: https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/new-technologies-for-the-arm-a-profile-architecture

48



https://app.knovel.com/hotlink/toc/id:kpCAAQAE11/computer-architecture/computer-architecture

https://app.knovel.com/hotlink/toc/id:kpCAAQAE11/computer-architecture/computer-architecture

https://doi.org/10.1109/MC.2008.209

https://www.kth.se/blogs/pdc/2018/11/scalability-strong-and-weak-scaling/

https://www.kth.se/blogs/pdc/2018/11/scalability-strong-and-weak-scaling/

https://man7.org/linux/man-pages/man1/perf.1.html

https://man7.org/linux/man-pages/man1/perf.1.html

https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/new-technologies-for-the-arm-a-profile-architecture

https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/new-technologies-for-the-arm-a-profile-architecture

References

[20] V. Nagarajan, D. J. Sorin, M. D. Hill, and D. A. Wood, “A Primer on MemoryConsistency and Cache Coherence, Second Edition,” Synthesis Lectures onComputer Architecture, vol. 15, no. 1, pp. 1–294, 2020. [Online]. Available:https://doi.org/10.2200/S00962ED2V01Y201910CAC049

[21] R. Rabenseifner, G. Hager, and G. Jost, “Hybrid MPI/OpenMP Parallel Program-ming on Clusters of Multi-Core SMP Nodes,” in 2009 17th Euromicro Interna-tional Conference on Parallel, Distributed and Network-based Processing, 2009,pp. 427–436.

[22] J. Reinders. (2013, Jul.) Intel AVX-512 Instructions. Accessed: 2020-09-23. [Online]. Available: https://software.intel.com/content/www/us/en/develop/articles/intel-avx-512-instructions.html

[23] T. Singh, S. Rangarajan, D. John, R. Schreiber, S. Oliver, R. Seahra, and A. Schae-fer, “2.1 zen 2: The amd 7nm energy-efficient high-performance x86-64 micro-processor core,” in 2020 IEEE International Solid- State Circuits Conference -(ISSCC), 2020, pp. 42–44.

[24] N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell,G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker, “TheARM Scalable Vector Extension,” IEEE Micro, vol. 37, no. 2, pp. 26–39, 2017.

[25] The MPICH Project. (2019, Nov.) MPICH Overview. Accessed: 2021-01-05.[Online]. Available: https://www.mpich.org/about/overview/

[26] The MPICH Project. (2019, Nov.) MPICH User Documentation. Accessed:2021-01-05. [Online]. Available: https://www.mpich.org/static/docs/latest/

[27] J. Thiel. (2006) An Overview of Software Performance Analysis Tools andTechniques: From GProf to DTrace. Washington University in St. Louis.Accessed: 2020-10-06. [Online]. Available: https://www.cse.wustl.edu/∼jain/cse567-06/ftp/sw monitors1/

[28] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous Multithreading:Maximizing on-chip Parallelism,” SIGARCH Comput. Archit. News, vol. 23,no. 2, p. 392–403, May 1995. [Online]. Available: https://doi-org.ezproxy.its.uu.se/10.1145/225830.224449

[29] Wikipedia contributors. (2019) MPICH – Wikipedia. Accessed: 2020-11-20.[Online]. Available: https://en.wikipedia.org/wiki/MPICH

[30] Wikipedia contributors. (2020) OSI model – Wikipedia. Accessed: 2020-11-20.[Online]. Available: https://en.wikipedia.org/wiki/OSI model

49

https://doi.org/10.2200/S00962ED2V01Y201910CAC049

https://software.intel.com/content/www/us/en/develop/articles/intel-avx-512-instructions.html

https://software.intel.com/content/www/us/en/develop/articles/intel-avx-512-instructions.html

https://www.mpich.org/about/overview/

https://www.mpich.org/static/docs/latest/

https://www.cse.wustl.edu/~jain/cse567-06/ftp/sw_monitors1/

https://www.cse.wustl.edu/~jain/cse567-06/ftp/sw_monitors1/



https://en.wikipedia.org/wiki/MPICH

https://en.wikipedia.org/wiki/OSI_model

Date post:	03-Aug-2021
Category:	Documents
Upload:	others
View:	15 times
Download:	1 times

Hardware Architecture Impact on Manycore Programming...

Documents