MSc Thesis

Technische Universitat MunchenFakultat fur Elektrotechnik und Informationstechnik

Study on Embedded

Multi-Processor Systems-on-Chip

with Virtual Prototyping Technique

Bogdan Pricope

A thesis submitted for the degree of Master of Science.

September 11, 2008

Supervisors:

Dr. Jinan Lin

Dr. Xianing Nie

Dipl. Ing. Michael Meitinger

Prof. Andreas Herkersdorf

Abstract

Processor performance has been primarily driven by increasing clock frequency and advancesin silicon process technology. However, power dissipation density is the critical factor limitingperformance increases. For this reason, performance growth has slowed down in the last years. Ithas become clear that the future performance demands can only be met by new design solutions.Moreover, todays embedded applications are very different from those in the past, in terms ofboth application complexity and dataset sizes. Consequently, it is not feasible any more to meetdemands of embedded applications by using single core based systems.

Multiprocessor system-on-chip (MPSoC) designs are a way to scale performance in accor-dance with Moores law. There is a growing trend towards employing MPSoC type of architec-tures, where multiple processor cores reside on the same chip and share data through on-chipmemory and an on-chip communication network.

However, high performance MPSoC architectures need high memory bandwidth. With thewidening gap between processor and memory speeds, system performance has become increas-ingly dependent upon the effective use of the memory hierarchy. Moreover, the integration ofmultiple processors on a single chip makes the problem even worse.

Caches, which store frequently used instructions and data in high speed memory close to theprocessor are a means of increasing memory bandwidth. But, especially in embedded systemscaches are very expensive. Therefore, cache design is still an important area of research.

An important concept for understanding how caches behave is the principle of locality. Inthis thesis, the locality of a stream of instructions is described using the reuse-distance model.This model bases the probability of a cache hit on the instruction reuse-distance. The conceptof Instruction Reuse is introduced as a reference for our measurements, in order to abstractour results from implementation details such as the application being executed or the cacheconfiguration.

An ARM11 MPCore based multiprocessor system is modelled and simulated using virtualprototyping technology from VaST Systems and the effect of Instruction Reuse on system perfor-mance and scalability is studied. We show that a low Instruction Reuse limits the performanceand scalability of multiprocessor systems. Moreover, it is observed that even doubling the mem-ory bandwidth does not improve system scalability when Instruction Reuse is low. In SymmetricMultiprocesing mode, it is shown that a solution to the MPSoC scalability problem is the ad-dition of a shared Level 2 cache. However, in Asymmetric Multiprocessing, the shared Level 2cache may actually decrease system performance when Instruction Reuse is low.

i

Declaration of Originality

I hereby declare that the research documented in this thesis and the thesis itself is the result ofmy own work in the Communications Solutions business group at Infineon Technologies.

Bogdan Pricope

ii

Acknowledgments

Hereby, I would like to express my gratitude to all those who gave me the possibility to completethis thesis.

First and foremost, I wish to thank my supervisors from Infineon Technologies, Dr. JinanLin and Dr. Xiaoning Nie for offering me this thesis topic and for their continuous support andguidance. I also would like to thank Mr. Stefan Maier and Mr. Thomas Niedermeier for theirfruitfull discussions which improved the quality of this thesis and I thank all my collegues fromthe Advanced Systems and Circuits department for making me feel at home.

Moreover, I would like to thank Dipl. Ing. Michael Meitinger and Prof. Herkersdorf from theLehrstuhl fur Integrierte Systeme at Technische Universitat Munchen, for providing the initialsupport without which I would not have been able to commence this thesis. I especially wantto thank Mr. Meitinger for his valuable support and discussions.

Last but not least, I wish to thank my family and my girlfriend for their continuous supportduring my studies.

iii

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iDeclaration of Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiContents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 The Instruction Reuse Challenge 62.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Reuse-distance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Virtual Prototyping Technology 183.1 Virtual System Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Virtual Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 VaST Systems Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Experiment System Architecture 244.1 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Simulation Results & Analysis 425.1 Effect of Cache Size on Instruction Reuse . . . . . . . . . . . . . . . . . . . . . . 435.2 The Low Instruction Reuse Problem . . . . . . . . . . . . . . . . . . . . . . . . . 445.3 The Shared Level 2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.4 The effect of Tightly Coupled Memory . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Conclusion & Future Work 57

iv

List of Figures

1.1 Exponentially increasing application complexity [7] . . . . . . . . . . . . . . . . . 2

2.1 Symmetric multiprocessing (SMP) . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Asymmetric Multiprocessing (AMP) . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Cache block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Instruction Reuse Distance Histogram . . . . . . . . . . . . . . . . . . . . . . . . 142.6 Instruction Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.7 Instruction Reuse increases, as the cache capacity increases from 4 to 7 instructions. 16

3.1 Virtual Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 CoMET window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Experiment System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Arm11 MPCore block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3 Memory Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Output image structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5 Flowchart main() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6 Basis for the OS instruction reuse distance benchmark . . . . . . . . . . . . . . . 384.7 Benchmark OS instruction reuse-distance histogram . . . . . . . . . . . . . . . . 394.8 Test() function control flow graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1 Instruction Reuse as cache size is varied for the TEST benchmark. . . . . . . . . 435.2 A low Instruction Reuse results in no performance improvement as the number of

CPUs is increased. In other words, a low Instruction Reuse limits the scalabilityof multiprocessor system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 High Instruction Reuse values enable a multiprocessor system to scale to a highernumber of processor and significant performance gains can be seen over a singleprocessor system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4 Doubling the memory bandwidth increases system IPC but does not help toimprove the scalability of the multiprocessor system when Instruction Reuse is low. 46

5.5 Doubling the Level 1 cache size or even the memory bandwidth may not improvethe scalability of a multiprocessor system. However, increasing the cache sizeabove a certain threshold value solves the scalability problem. . . . . . . . . . . . 47

v

LIST OF FIGURES LIST OF FIGURES

5.6 Instruction Reuse as cache size is varied for the modified histogram. . . . . . . . 485.7 Effect of application instruction reuse-distance histogram . . . . . . . . . . . . . 505.8 In SMP mode, the addition of a shared Level 2 cache increases system IPC and

also improves the scalability of the multiprocessor system for low Instruction Reuse. 525.9 In AMP mode, the addition of a shared Level 2 cache slightly increases system

IPC but does not improve the scalability of the multiprocessor system. In fact,for low Instruction Reuse, increasing the number of processors decreases systemIPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.10 The addition of relatively small Level 2 shared cache (e.g. twice the size of theLevel 1 cache) provides a significantly greater performance improvement thandoubling the Level 1 cache size alone. However, increasing the Level 2 cachesize without also increasing the number of CPUs, does not bring any significantperformance gain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.11 As opposed to SMP mode, in AMP mode increasing the Level 2 cache size con-siderably increases system IPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

vi

List of Tables

1.1 MPSoC cache configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Instruction Reuse Distance example . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Experiment System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Cache and TLB Operation Functions . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Typical memory sizes and access times . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Memory Controller configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5 Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6 Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.7 Instruction Reuse Distance is modelled by loops. . . . . . . . . . . . . . . . . . . 404.8 Type and number of instructions executed. . . . . . . . . . . . . . . . . . . . . . 41

5.1 Instruction Reuse as cache size is varied for the TEST benchmark . . . . . . . . 435.2 Instruction Reuse comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

vii

Chapter 1

Introduction

1.1 Motivation

Embedded electronic systems are specialized to carry out specific tasks and are embedded in theirenvironment. This is in contrast to personal computers or supercomputers which are generalpurpose and interact with users. Embedded systems are much more prevalent than their general-purpose counterparts; for instance, 98% of all microprocessors manufactured in a given year areused within embedded systems [14, 21]. Embedded systems designers must meet strict time-to-market and productivity requirements. Thus, embedded systems are generally restrictive,because designers have to make trade-offs between design cost and design complexity.

However, computational requirements for embedded applications are increasing exponentially.During the past 15 years, a variety of new protocols and standards have been introduced whichfeature rapidly increasing computational requirements. Figure 1.1 shows some of these trends forthree classes of multimedia applications: video, cellular, and wireless LAN. Code size for theseapplications is also increasing, reflecting the trend that application complexity is increasing alongwith computational requirements. This exponential trend is creating demand for the increasingnumber of transistors that can be integrated with scaling [18].

Advances in process technology have made possible to roughly double the number of transistorsper area every 2 months according to Moore’s law [22]. Thus, performance gains have beenachieved by higher transistor integration densities and increased clock frequencies due to thesmaller size of the transistors. However, the increase in performance came at the cost of increasedpower consumption and thus heat dissipation. The latter has been the most critical technicalchallenge in maintaining performance growth.

1

1.1 Motivation Introduction

Figure 1.1: Exponentially increasing application complexity [7]

The total power consumption for a chip is given by the sum of two components: the active (ordynamic) power consumption and the static power consumption. The dynamic power dissipationdensity is proportional to the number of transistor devices per area (N), the activation factorof the device (α), the switched capacitance per device (C), the operation frequency (f) and thesquare of the supply voltage (V).

Pdynamic/Area ≈ N · α · C · f · V 2

For older process technologies, the dynamic power consumption was dominant. As the dimen-sions of the transistors shrank by

√2 every two years, the power dissipation density increased

by more than√

2 or 40%. This was due to the increase in the static power consumption, withgate current leakage being the dominant component.

In order to keep the power dissipation under acceptable levels, designers have traded siliconarea against power consumption. Thus, performance of a single processor has increased withthe square root of complexity [6]. Each new processor architecture has required two to threetimes the silicon area, while providing only 20% improvement in performance [11]. Due to thismarginal increases in performance, a new approach to increase the gain out of the same siliconarea was needed.

Multiprocessor systems-on-chip (MPSoCs) have emerged as a solution to scale performance byexploiting software parallelism. Nevertheless, MPSoCs introduce some challenges to the systemarchitects concerning the efficient design of memory hierarchies and system interconnects whilemaintaining the low power and cost constraints of embedded systems.

2

1.1 Motivation Introduction

Cache Type Size Line size No. of ways No. of sets

Data 16 KB 32 B 4 128Instruction 16 KB 32 B 4 128

Table 1.1: MPSoC cache configuration

For example, in [2], multiprocessor performance has been investigated for network protocolprocessing. The MPSoC platform is based on two 32-bit MIPS cores with the following features:

• 32-bit address path

• 64-bit data path to the caches

• Eight stages pipeline

• Two separate instruction and data caches with 16 KByte per each

• Cache lines are virtually indexed, physically tagged

• Cache replacement policy is based on least recently used (LRU) strategy

• Processor clock frequency to bus clock frequency ratio is 2:1

The cache configuration is given in Table 1.1.

During the measurements, which were conducted in the course of studying software based cachecoherence, an interesting phenomenon was observed: for TCP/IP protocol processing on Linux,up to 70% of the total cycles are stall cycles due to instruction cache misses caused by the Linuxoperating system code [2].

As the example above shows, using multiple processors does not necessarily increase systemperformance. The challenge is scalability: system performance must increase as additionalprocessors are added to the system.

High instruction cache hit rates are key to achieve high performance. In contrast to data cacheaccesses, instruction cache accesses are serialized and cannot be overlapped. Instruction cachemisses prevent the flow of instructions through the processor and directly affect performance. Tomaximize processor instruction cache utilization and minimize stalls, application code shouldhave a high locality i.e. few branches (exhibiting high spatial locality), a repeating patternwhen deciding whether to follow a branch (yielding a low branch misprediction rate), and mostimportantly, the working set code footprint should fit in the processor’s instruction cache.

Unfortunately, many applications and especially the operating system code exhibit exactly theopposite behavior. As the example above showed, the low instruction locality of the operatingsystems while running a TCP/IP processing application is responsible for a substantial amountof idle processor clock. What makes it worse now is the increasing gap between memory andprocessor speeds, which results in huge performance penalties.

3

1.2 Objective Introduction

Even if the processor clock frequency remains the same, if the core count is doubled, the memorybandwidth must double. Unfortunately, doubling the cache size on-chip, does not result in twiceas low a miss rate. Empirically, one has found that miss rate only goes down by a factor thatis the square root of two when doubling the size of the cache [1]. As a result, for each newgeneration, the bandwidth in and out of the chip will increase exponentially. This poses adilemma as the pin bandwidth will not increase exponentially but rather linearly according toITRS (International Technology Roadmap for Semiconductors) predictions [19, 10].

1.2 Objective

Memory latency and bandwidth are two important metrics when designing a multiprocessorsystem. While the latency is the (round-trip) time it takes to read/write a piece of data frommemory until the data is returned, the bandwidth is the amount of data that is transferred ina time unit. The proximity of multiple processors makes interprocessor communication muchcheaper, but providing enough memory bandwidth for all processsors to function becomes aserious problem. This is because if cache hierarchy is not designed appropriately, one can incura significant number of off-chip references (accesses), which can be very expensive from bothperformance and power perspectives.

With modern CPUs having 16KB to 64KB Level 1 instruction cache sizes, operating systemcodes are too long to reside in the cache. Current chip design trends for improving processperformance are leading to thread-parallel architectures, where multiple threads run simulta-neously on a single chip via multiple on-chip processor cores (chip multiprocessors, or CMP)and/or multiple simultaneous threads per processor (SMT).

To be able to fit more cores on a single chip without overheating, and also save time in hardwareverification, chip designers are expected to use simple processor cores as building blocks. Onesuch example is the Suns UltraSPARC T1, which uses up to eight cores on a single chip withfour threads per core. The instruction cache size of these cores is not expected to grow. Forexample, the UltraSPARC T1 features a 16KB Level 1 instruction cache per core, which isthe same size as in the first UltraSPARC chip, introduced 10 years ago. Moreover, SMT chipsalready operate on a reduced effective instruction cache size, since the instruction cache is sharedamong all simultaneous threads. In future processors, the combined effect of larger and sharedLevel 2 cache sizes and small Level 1 instructoin caches will make instruction cache stalls thekey performance bottleneck.

Although lots of research has been devoted to caches, it is still important and there is still roomfor innovation. While uniprocessor architectures are well understood, this is not the case forembedded multiprocessor systems and especially for multithreaded chip multiprocessors (CMT)i.e. multiple on-chip processor cores with multiple simultaneous threads per core.

4

1.2 Objective Introduction

The goal and contribution of this thesis is to study and understand the effect of instructionlocality on the performance and scalability of multiprocessor systems. The study includes thefollowing parts:

1. Investigate the instruction locality issue of embedded multiprocessor systems:

• Design and write test programs with a configurable instruction locality, which can beused for systems with different number of cores.

• Analyse the performance (e.g. IPC) of multiprocessor architectures for various in-struction localities, cache sizes, as well as number of processors.

2. Explore performance/cost optimization possibilities:

• Analyse the tradeoff between using a shared Level 2 cache and TCM (Tightly CoupledMemory).

5

Chapter 2

The Instruction Reuse Challenge

The concept of Instruction Reuse provides the foundation of this thesis. First, an introduction tothe main concepts used throughout this thesis is given. Then, reuse-distance analysis is presentedand the proposed Instruction Reuse concept for measuring instruction locality is described indetail.

2.1 Background

2.1.1 MPSoC Classification

Depending on the combination of processors, memory and operating system, multiprocessorsystems are divided into two major categories: symmetric multiprocessing and assymetric mul-tiprocessing.

Symmetric Multiprocessing (SMP)

SMP is a homogeneous topology, which means processors share a common instruction set archi-tecture (ISA) and have a common view of the rest of the system resources, including a sharedmemory architecture. In SMP mode, a single operating system runs on all processors, whichaccess a single image of the operating system in memory. In Figure 2.1 a diagram of an SMPsystem is shown.

6

2.1 Background The Instruction Reuse Challenge

Figure 2.1: Symmetric multiprocessing (SMP)

The operating system (OS) is responsible for dynamically distributing tasks across the pro-cessors, managing the ordering of task completion, and controling the sharing of all resourcesbetween the cores. Thus, processes or threads can be assigned and reassigned to differentprocessors depending on processor loading. Moreover, porting of applications developed forsingle-processor systems to SMP systems is easy and load balancing algorithms are efficient inmaking maximum use of the available processing power.

The major disadvantage of the SMP approach is that as the number of processors increases,the communication overhead becomes dominant and the shared memory cannot support thebandwidth demands for all processors. Each additional processor in the system increases theamount of time load balancing algorithms spend assessing load conditions, deciding task assign-ments and transferring tasks between processors. Moreover, the shared communication mediumquickly becomes a bottleneck. As a result, SMP systems typically do not scale to more thanabout 8 processors.

Moreover, the SMP behavior is non-deterministic. This means that critical software functionscannot be guaranteed to execute with a certain response time because execution time is highlydependent on the systems current state and load distribution. Without guaranteed responsetime, the SMP approach does not meet the needs of real-time systems.

The SMP approach also cannot be implemented in a heterogeneous system. The software de-pends on each processor having the same instructions set architecture and identical resourcesavailable to it, including the operating system it is running, so that tasks can be readily in-terchanged. Multiprocessor systems that have different processors to handle different types oftasks simply cannot run an SMP operating system, nor can SMP be constructed using differentoperating systems on each core.

7


Figure 2.2: Asymmetric Multiprocessing (AMP)

Assymetric Multiprocessing (AMP)

Asymmetric multiprocessing differs from symmetric multiprocessing in allowing the use of het-erogeneous processors and operating systems as well as the homogeneous environment supportedby SMP. In AMP mode, different operating systems run on different processors from private localmemories. These processors are specialized for certain tasks by having different instruction setarchitectures and communicate with each other through shared-memory and message passing.In Figure 2.2 a diagram of an AMP system is shown.

The advantage of AMP is that memory bandwidth is increased for each processor and it reducesthe latency for accesses to local memory. Moreover, the processors spend less time handshakingwith each other. This enables designs to scale to much larger numbers of multiprocessors thanSMP does.

AMP performs selective load balancing, allowing the designer to permanently assign some tasksto fixed processor while allowing others to be load-balanced among many processors. This meansthat the application can be made deterministic in those areas where system response is critical.

The major disadvantage is that communication between processors is much more complex andit requires more effort from the software side. It is up to the programmer to make sure theprocessors are being utilized to their maximum potential and to worry about whether a processorcan complete a certain task and how to make the processors communicate effectively to distributetasks accordingly.

Moreover, since application partitioning and mapping is an NP-hard problem, designers cannoteasily port their applications over from earlier generations to an AMP system. They must decidewhich components need to be fixed and which can be distributed, and map them to processorsaccordingly.

8


Figure 2.3: Memory Hierarchy

2.1.2 The principle of locality

The principle of locality is an empirically observed phenomenon that has numerous practicalimplications. The basic observation is that programs tend to reuse data and instructions theyhave used recently. A widely held rule of thumb is that a program spends 90% of its executiontime in only 10% of the code [16].

An implication of locality is that we can predict with reasonable accuracy what instructions anddata a program will use in the near future based on its accesses in the recent past. The principleof locality also applies to data accesses, though not as strongly as to code accesses.

Two different types of locality have been observed:

• Temporal locality states that recently accessed items have a high probability to be accessedin the near future.

• Spatial locality says that items whose addresses are physically near an item recently ac-cessed have a high probability of being accessed in the near future.

The principle of locality and the higher speed of smaller memories, led to hierarchies based onmemories of different speeds and sizes [15].

Figure 2.3 shows a multilevel memory hierarchy, including typical sizes and speeds of access. Aswe move farther away from the processor, the memory in the level below becomes slower andlarger. Since fast memory is expensive, a memory hierarchy is organized into several levels witheach being smaller, faster, and more expensive per byte than the next lower level. The goal is

9


Figure 2.4: Cache block diagram

to provide a memory system with cost per byte almost as low as the cheapest level of memoryand speed almost as fast as the fastest level.

The importance of the memory hierarchy has increased with advances in performance of pro-cessors. Cache memories have become a major factor to bridge the bottleneck between the timeto access main memory and faster clock rate of current processors. Cache behavior has becomeone of the major factors to impact application performance. Since memory access times improvemuch slower than processor speed, performance is bound by instruction and data cache missesthat cause expensive main-memory accesses.

2.1.3 Caches

To hide the slowness of the main memory, caches are used. Caches are fast but small memoriesbetween the processor and the main memory. In order to achieve high performance, data shouldbe found in the cache most of the time. However, because of the limited capacity of the cache,cache misses occur.

A cache miss occurs when a word is not found in the cache by the processor. The word mustbe fetched and placed in the cache before continuing. Because of spatial locality, multiple wordscalled a block or line are moved at one time. Since cache size is much smaller than main memory,a key design decision is where can cache blocks (lines) be placed in a cache. The most popularscheme is set associative, where a set is a group of blocks in the cache.

Figure 2.4 shows the structure of a cache. A cache block is first mapped onto a set, and thenthe block can be placed anywhere within that set. Finding a block consists of first mapping the

10


block address to the set, and then searching the set to find the block. The set is chosen by theaddress of the data:

(Block address) MOD (Number of sets in cache)

If there are n blocks in a set, the cache placement is called n-way set associative. The end pointsof set associativity have their own names:

• A direct-mapped cache has just one block per set, thus a block is always mapped to thesame location.

• A fully associative cache has just one set, thus a block can be placed anywhere.

The replacement algorithm is the process used to select one of the blocks in a given set foroccupation by a newly referenced block. The important schemes include LRU (Least RecentlyUsed), random, and FIFO (First In First Out).

Most cache designs also assume demand fetch and write allocate. Demand fetch means thata block is fetched from memory into the cache only on a cache miss; and write allocate is thepolicy where an entire block is fetched into the cache on a write into the block if it is absentfrom the cache.

One measure of the benefits of different cache configurations is miss rate. Miss rate is simplythe fraction of cache accesses that result in a miss, i.e. the number of accesses that miss dividedby the number of total accesses.

To gain insights into the causes of high miss rates, the three Cs model sorts all misses into threesimple categories [16]:

• Compulsory: The very first access to a block cannot be in the cache, so the block mustbe brought into the cache. Compulsory misses are those that occur even if you had aninfinite cache.

• Capacity: If the cache cannot contain all the blocks needed during execution of a program,capacity misses (in addition to compulsory misses) will occur because of blocks beingdiscarded and later retrieved.

• Conflict: If the block placement strategy is not fully associative, conflict misses (in additionto compulsory and capacity misses) will occur because a block may be discarded and laterretrieved if conflicting blocks map to its set.

To exploit the principle of locality, cache designs are adding more cache levels and dynamicconfiguration control. It is common in today’s design to have two or three levels of cachesmemory. As the memory hierarchy becomes deeper and more adaptive, its performance willincreasingly depend on our ability to predict instruction and data locality.

11

2.2 Reuse-distance analysis The Instruction Reuse Challenge

2.2 Reuse-distance analysis

The reuse distance is a metric for the cache behavior of programs. A large reuse distanceindicates a high probability of cache misses. A low reuse distance indicates good temporallocality and thus a high probability of cache hits.

Reuse-distance analysis predicts program locality by experimentally determining locality prop-erties as a function of the data size of a program, allowing accurate locality analysis when theprograms data size changes.

Prior work has established the effectiveness of reuse distance analysis in predicting programlocality over a wide range of data sizes. Ding, et al. [9, 24], have proposed techniques topredict the reuse distance of memory references across all program inputs using a few profilingruns. They use curve fitting to predict reuse distance (the number of distinct memory locationsaccessed between two references to the same memory location) as a function of a programs datasize. By quantifying reuse as a function of data size, the information obtained via a few profiledruns allow the prediction of reuse to be quite accurate over varied data sizes. Ding, et al., haveused reuse-distance predictions to accurately predict whole program miss rates [24, 5].

The most obvious application of reuse distance is to prefetching those memory operations thatcause the most misses. Both hardware and software prefetching may issue many unneces-sary prefetches. Hardware could be constructed to use reuse-distance information to scheduleprefetches dynamically for important instructions.

Knowledge from reuse-distance analysis can be used to reduce capacity misses. Since most ofthe cache misses are capacity misses, to eliminate the capacity miss, the reuse distance must bemade smaller than the cache size. On the hardware level, this can be done by increasing thecache size. As a result, probability of hitting the cache will increase for the reference with a longreuse distance.

Moreover, reuse-distance analysis may also be used in architectural optimization via compilerhints to gain a more global view of the expected behavioral patterns of a program. On the com-piler and algorithmic level, the cache size cannot be changed, but the program or the algorithmcan be changed, so that fewer long reuses occur. On the compiler level, the most well-knowntechniques are loop tiling and loop fusion.

At the algorithmic level, one has more freedom to restructure the program than at the compilerlevel. Also the programmer has a better understanding of the global program structure. There-fore, the programmer can decide to use different algorithms to decrease long reuse distances.However for it is difficult to know exactly where in the code the bad data locality happens, andinstrumentation and visualization of the program can help the programmer to pinpoint the hotspots.

12


2.2.1 Instruction Reuse Distance

In 1970, Mattson et al. studied stack algorithms in cache management and defined the conceptof stack distance [17]. Instruction Reuse Distance (IRD) is the same as LRU stack distance orstack distance using LRU (Least Recently Used) replacement policy.

Whenever a memory location is used multiple times throughout program execution (i.e. it isreused), cache hits may result if the corresponding instruction stays in the cache between thedifferent accesses to it. However, when the reuses are separated by accesses to a lot of otherdifferent instructions, the probability that it remains in the cache between use and reuse is low.

By ordering the instruction memory accesses of a program execution by logical time, we obtaina program trace. In a sequential execution, reuse distance is the number of distinct instructionsexecuted between two consecutive executions of the same instruction (i.e. between use andreuse).

Instruction reuse distance measures the volume of the intervening instructions not the timebetween two executions. While time distance is unbounded in a long-running program, reusedistance is always bounded by the code footprint. Moreover, the reuse distance is a propertyof the trace and is independent of hardware parameters. In Table 2.1 an example of how tocompute the instruction reuse distance is shown.

Time 1 2 3 4 5 6 7 8 9

Memory Address A1 A2 A3 A4 A2 A3 A4 A1 A1

Instruction I1 I2 I3 I4 I2 I3 I4 I1 I1

Reuse-distance of I1 ∞ 3 0Reuse-distance of I2 ∞ 2Reuse-distance of I3 ∞ 2Reuse-distance of I4 ∞ 2

Table 2.1: Instruction Reuse Distance example

When an instruction has reuse distance d, exactly d different instructions were executed previ-ously. If d is smaller than the number of instructions that can fit in the cache, the referencedinstruction will be found in a fully associative cache. Conversly, if d is larger, the referencedinstruction will result in a cache miss. If the reuse distance is zero, the referenced instructionwill always result in a cache hit.

The classification of a miss into compulsory, conflict or capacity misses is easily made usingreuse-distance:

• A compulsory miss has an infinite reuse distance, since it was not previously referenced.

• If the reuse distance is smaller than the cache size, it is a conflict miss, since the samereference would have been a hit in the fully associative cache.

• When the reuse distance is larger than the cache size, it is a capacity miss, since thereference also misses in a fully associative cache.

13


Figure 2.5: Instruction Reuse Distance Histogram

2.2.2 Instruction Reuse Distance Histograms

A reuse distance histogram summarizes the locality of a program execution and is important forcache performance prediction [13, 12, 23], reference affinity detection [25], and data reorganiza-tion [8].

The reuse distance histogram is a histogram showing the distribution of reuse distances in anexecution. Each bar in the histogram shows the fraction of total instructions executed witha certain reuse distance. The X-axis is the instruction reuse distance, and the Y-axis is thefraction of total references.

An example of an instruction reuse distance histogram is shown in Figure 2.5. The fraction ofreferences to the left of the mark Cache Capacity will hit in a fully associative cache havingthe capacity indicated by the dotted line. For set associative caches, reuse distance is still animportant hint to cache behavior as the probability of a conflict miss was determined in [1].

14

2.3 Our approach The Instruction Reuse Challenge

Figure 2.6: Instruction Reuse

2.3 Our approach

Cache performance depends on program locality, which changes from program to program andalso from input to input for the same program. For example, different inputs may require theexecution of different routines with diverse locality profiles. In addition, different programsusually have different instruction reuse-distance histograms. For these reasons, measurementsof cache performance are application or program specific.

Moreover, using cache miss-ratios as a reference to compare the performance of different systemsrequires knowledge of implementation details such as cache line size or set-associativity. Suchknowledge is not available in early stages of the design.

In order to abstract from implementation details and to obtain application/program independentresults, we introduce the notion of Instruction Reuse as a method to measure instructionlocality.

Definition:

Instruction Reuse is the projection of a particular program execution on a particular cacheconfiguration.

As shown in Figure 2.6, different application-cache combinations can have the same projectioni.e. Instruction Reuse. Thus, using Instruction Reuse as a reference for measuring system perfor-mance, systems with different applications and/or cache configurations can be compared. Thismeans that results based on Instruction Reuse values are not specific to a particular instruc-tion reuse-histogram or cache configuration. It is the effect from the combination of these twocomponents that we study.

15


Figure 2.7: Instruction Reuse increases, as the cache capacity increases from 4 to 7 instructions.

We measure Instruction Reuse as:

Instruction Reuse =Number of instructions executed

Number of instructions loaded in Level 1 cache

The Number of instructions executed is a parameter that only depends on the applica-tion. On the other hand, the Number of instructions loaded in Level 1 cache dependsboth on the application (represented by its instruction reuse-distance histogram) and the cacheconfiguration.

For the example given in Table 2.1, the Instruction Reuse is equal to 9/4, assuming the cachecapacity is larger than 4 instructions i.e. once loaded from memory instructions I1 . . . I4 can bereused from the cache.

Different Instruction Reuse values can be obtained either by:

• fixing a particular application profile and varying the cache configuration (size, associativ-ity, replacement policy) or

• by keeping a particular cache configuration fixed and varying the application profile.

By application profile the instruction reuse distance histogram is meant. In this thesis, theformer approach is used in order to obtain a large range of Instruction Reuse values.

In Figure 2.7 this idea is illustrated graphically. When the cache size increases, as indicatedby the arrow, more instructions will fit inside the cache and thus the number of instructionsloaded in the Level 1 cache will decrease. Since the total number of instructions executed by the

16


processors is independent of the cache configuration and thus remains constant, the InstructionReuse will increase.

To investigate the effect of instruction locality on the performance and scalability of multipro-cessors systems, our approach consisted of the following steps:

1. Model a multiprocessor system-on-chip with configurable number of CPUs using virtualprototyping technology.

2. Model the instruction reuse-distance histogram of a typical application, which will serveas a benchmark for comparison with future real system measurements.

3. Measure the Instruction Reuse values resulting from the modelled reuse-distance histogramin combination with different Level 1 cache sizes.

4. Measure multiprocessor system performance and scalability for the measured InstructionReuse values.

Before describing the experiment architecture, the virtual prototyping technology used for mod-elling both the hardware and software parts of the design is introduced.

17

Chapter 3

Virtual Prototyping Technology

3.1 Virtual System Prototypes

A Virtual System Prototype (VSP) is a model of a complete embedded system including software.The hardware platform component of the VSP is called a Virtual Prototype. Characteristicssuch as performance and power for a complex system cannot be represented and computed asa formal mathematical problem. The only realistic solution for determining such characteristicsis through simulation.

One option for this simulation is hardware acceleration and/or emulation. Unfortunately, inaddition to providing only limited visibility into the inner working of the system, the highestlevel of abstraction supported by these solutions are register transfer level (RTL) representations.As a result, development and evaluation cannot commence until a long way into the design cyclewhen the hardware portion of the design is largely completed. In turn, this limits the designteams ability with regard to exploring and evaluating the hardware architecture. In addition,FPGA implementations of processors typically are slow, executing software at around 1 MIPS,about 50 times slower than a virtual processor model of the same processor [??].

A VSP is a pure software model of the entire system: that is, the combination of the virtualprototype and the software that will run on it. Fully evaluating the characteristics of a complexsystem may require performing many hundreds of experiments on various system configurations.Furthermore, it is not unusual for a single simulation to require that 100 billion instructions berun to reproduce a problem or to compute a representative result. This represents less than onehour of simulation time using a high performance, timing-accurate VSP. By comparison, the

18

3.2 Virtual Platforms Virtual Prototyping Technology

same software simulation would take between 100 to 500 hours or more using a typical timing-accurate structural instruction set simulator model and 100,000 hours or more using an RTLmodel.

System architects use VSPs to explore the optimum architecture while software developers usethe VSP to develop software before hardware is available. Overall, simulation speed proves to bemore important for SW developers, but accuracy (in terms of processor and bus cycles) is moreimportant for the hardware architects. Nevertheless, software developers often also require highdegrees of accuracy, for example in real-time critical inter-processor communication. In return,the system architects are moving towards SW-driven architecture analysis and optimizationstrategies, in which real SW loads are used as stimulus for the architectural exploration.

An effective and efficient VSP simulation system must have the following characteristics:

• Near Silicon Speed: The solution must be fast enough so that the real software applicationswritten for the SoC may be run on the VSP, including the operating system (OS) and anytarget application that may run on top of the OS.

• Complete System: The solution must model and simulate the whole system (processors,buses, peripherals, external hardware).

• Cycle-Accurate: The solution must retain accuracy; i.e., the simulated hardware must havetiming associated with it that is reflected in the real hardware. This must also includeasynchronous events and multiple clock domains.

• Model Library: For the purpose of architecture design productivity and efficiency, thesystem should offer a portfolio of processor, bus, and peripheral models.

• High-Speed Modeling Method: A proven modeling method that supports simulation resultsorders of magnitude faster then traditional RTL simulations must exist by which high-speed, system-level modules of custom hardware are modeled in the VSP.

• Binary Compatibility: The solution must be capable of using the same target images thatwill be executed by real hardware for execution by the modeled processor; that is, binarycompatibility between simulated and actual. The solution must also provide the capabilityto use commercial debugging and development tools for those applications.

• Configurable: The solution must include run-time configurability for as many parametersas possible. i.e., no recompilation should be necessary in order to try different experimentsfor different parameters such as cache size of the processor models.

• Visibility: The solution must make available data mining statistics and events that occurin the hardware system. For example, the VSP must be able to track things like instructioncounts, cache statistics (hits, misses, fetches) and bus transactions.

19

3.2 Virtual Platforms Virtual Prototyping Technology

Figure 3.1: Virtual Platform

3.2 Virtual Platforms

Virtual Platforms which contain the underlying models of the system are the building blocksof a Virtual System Prototype. As shown in Figure 3.1, a simple VSP usually consists of asingle Virtual Platform containing one of more virtual devices: Virtual Processor Models (thatmodel the actual processor and execute software embedded within the product), Virtual MemoryModels, Peripheral Device Models (that emulate the functionality of the peripheral hardwaredevices) and interconnections.

Virtual Processor Models

A Virtual Processor Model (VPM) emulates the behavior of the physical processor running thesoftware written and compiled for that processor. A Virtual Platform can contain one or moreVPMs.

A VPM runs the actual target code that is designed for the physical processor. This means thattarget software can be developed, executed and debugged exactly as with a physical prototype.However, using a VPM provides greater control and flexibility than with an actual processor.

With a VPM, internal processor resources such as cache size or the number of TLB can beconfigured, which would be impossible with a physical processor. Thus, the performance ofthe target code on various configurations of the processor can be analyzed. Moreover, afterporting the software to a new VPM, the performance and suitability of various processors canbe compared.

In order to accurately simulate the effects of the physical processor, a VPM has to be instructioncycle, bus cycle, and register cycle accurate.

20

3.3 VaST Systems Tools Virtual Prototyping Technology

Peripheral Device Models

A Peripheral Device Model (PDM) emulates the behavior of a physical device in the hardwarearchitecture, such as interrupt controller, clock generator, etc. Peripheral Device Models canconnect directly to other PDMs or interface to Virtual Processor Models using interconnectionssuch as bus connections or asynchronous connections.

Proprietary (pre-built) device models can be used within a VSP. However, some PeripheralDevice Models are unique for each platform and must be developed to suit the architecture.

3.3 VaST Systems Tools

Figure 3.2: CoMET window

VaST Systems is an Electronic Design and Automation company, which builds and marketssystem level design tools and intellectual property to support the engineering of virtual systemprotototypes. It was founded by Graham Hellestrand, a professor of computer science andengineering at the University of New South Wales, Australia.

CoMET from VaST Systems was used to implement our experimental virtual multiprocessorsystem and to evaluate its performance. Some of the features of CoMET are:

• It has high performance, typically 20-100 MIPS, depending on the complexity of the plat-form and the performance of the host PC. Therefore it is possible to run real applicationsat near real-time speeds.

21


• The simulation technology is cycle accurate.

• A library of models of commercially available processors, bus architectures, and peripheraldevices is provided.

• Target images may be specified for each processor core in the design and third-partydebuggers (such as Lauterbach T32) are supported so that users may use their standardenvironment to debug software in the virtual environment.

• Virtual processor model parameters (such as cache size, processor frequency) may bespecified at run-time and therefore no recompilation of the system model is necessary.

• Through its Metrix profiling tool, CoMET enables tracing of system events so that systemperformance may be evaluated.

The CoMET window is shown in Figure 3.2. A VSP is constructed by adding instances ofcomponent modules in a hierarchial structure. Modules instances, nets and port connections areadded or edited using the XML standard view. Target software code for the ARM processorscan be created and compiled within the CoMET environment.

The Metrix is a component of CoMET, which provide non-intrusive performance monitoringcapabilities for the entire VSP including the virtual processor models, buses and peripheraldevice models. Metrix consists of three components:

• VPM Metrix that provides access to the Virtual Processor Models (VPMs) in a VirtualSystem Prototype. VPMs have features not available with the actual hardware such asvisibility. They can provide details of the instruction path, registers, memory, and cacheusage while executing, and can issue reports that summarize such activity over a user-determined period.

• Bus Metrix that allows triggering and monitoring of bus accesses.

• Net Metrix that allows the monitoring of logic, 32-bit vector and clock nets defined withinthe module hierarchy.

The output of a Metrix VPM can look as follows:

VpmCtrl Counter 1

VpmCtrl 7733 Total Instructions Executed, Using

VpmCtrl 12965 Cycles, with

VpmCtrl 0 Inst Page Table Walks, and

VpmCtrl 0 Data Page Table Walks

VpmCtrl 0 Page Table Walks

VpmCtrl

VpmCtrl Data Read Access Counts

VpmCtrl 3336 Total

VpmCtrl 3304 - Cache Hits

VpmCtrl 19 - Cache Miss Allocate new line in cache

VpmCtrl 13 - Cache Miss no allocate, Uncached Region

VpmCtrl 0 - Cache Miss no allocate, Cache Disabled

VpmCtrl 0 - Cache Miss, All ways locked

22


VpmCtrl 0 - Cache Miss, Hit Pending Buffer

VpmCtrl 0 - TLB Abort - Read access denied

VpmCtrl

VpmCtrl Data Write Access Counts

VpmCtrl 1326 Total

VpmCtrl 1189 - Cache Hits

VpmCtrl 0 - Cache Miss allocate new line in cache

VpmCtrl 100 - Cache Miss no allocate, Uncached Region

VpmCtrl 0 - Cache Miss no allocate, Cache Disabled

VpmCtrl 37 - Cache Miss no allocate, No write allocate

VpmCtrl 0 - Cache Miss, All ways locked

VpmCtrl 0 - Cache Miss, Hit Pending Buffer

VpmCtrl 0 - Cache Hit with Write Through attribute

VpmCtrl 0 - TLB Abort - Write access denied

VpmCtrl

VpmCtrl Inst Access Counts

VpmCtrl 0 - Pipeline Stall Ticks, Cache Miss

VpmCtrl 0 - TLB Abort - Execute access denied

VpmCtrl

VpmCtrl Cache Counts

VpmCtrl 19 Data Cache Line Fill

VpmCtrl 0 Data Cache Write Back (sub line)

VpmCtrl 24 Instruction Cache Line Fill

VpmCtrl

0 Inst. Line Not Cacheable

23

Chapter 4

Experiment System Architecture

The final purpose of this thesis is to model a workable multiprocessor system for evaluating theeffect of Instruction Reuse on the performance and scalability of multiprocessor systems. There-fore, one of the primary scopes of the chosen system architecture was to be easily configurableand scalable and at the same time gather necessary experience for building up a real system inthe future.

This chapter describes details about the modelled virtual multiprocessor architecture. Theprocessor, memory system, and buses are discussed separately in the Hardware Architecturesection, while the target code running on the processor is described in the Software Architecturesection. The whole system was modelled with the VaST CoMET virtual prototyping toolsdescribed in Chapter 3.

24

4.1 Hardware Architecture Experiment System Architecture

4.1 Hardware Architecture

Figure 4.1: Experiment System Architecture

Processor No. of CPUs 1 to 8Instruction Size 32-bit

L1 Cache Latency one CPU clock cycleInstruction Cache Size per CPU 4 KB to 512 KBData Cache Size per CPU 4 KB to 512 KBSet Associativity 4-wayLine Size 32 BytesInstruction Cache virtually indexed, physically taggedData Cache physically indexed, physically tagged

L2 Cache Latency 6 clock cyclesSize 16 KB to 1 MBSet Associativity Direct MappedLine Size 32 Bytes

Memory Latency 28 CPU clock cyclesCPU to MEM Frequency Ratio 2:1CPU to MEM Bandwidth Ratio 2:1

Table 4.1: Experiment System Configuration

25


A block diagram of the experiment system arhictecture is shown in Figure 4.1. It contains thefollowing device modules that simulate the functionality of the hardware devices indicated inparentheses:

• VaST Arm11MPCore (Arm11MPCore)

• VaST ARM L220 Cache Controller (Arm L220 Cache Controler)

• VaST ARM AXI PL300 Interconnect (ArmAxiPl300)

• VaST ARM AXI PL340 Memory Controller (ArmAxiPl340)

• VaST Gp Memory (GenericMemory)

• VaST StdBus AXI (AMBA 3 AXI Protocol)

• VaST StdBus AHB (AMBA 3 AHB Protocol)

• VaST StdBus APB (AMBA 3 APB Protocol)

4.1.1 Arm11 MPCore

Figure 4.2: Arm11 MPCore block diagram

The processor is the most important element in a multiprocessor system, because it influencesboth the hardware and software design. It determines the hardware interfaces, which connectthe processor to the rest of the system and it influences the choice of the operating system or

26


the structure and functionality of the applications running on it.

The Arm11MPCore was chosen for the following reasons:

• Can be configured to contain between one and four processors.

• Both data and instruction cache can be configured individually across each processor withsupport for full data coherence.

• Ability for data to move between each processors cache permitting rapid data sharingwithout accesses to main memory.

• Either dual or single 64-bit AMBA 3 AXI bus intefaces providing high bandwidth.

• Support for both asymmetric multiprocessor (AMP) and for symmetric multiprocessing(SMP) multiprocessor programming.

• Designed for low power by providing gate level shutdown of unused resources and sup-porting the ability for each processor to go into standby, dormant or power off energymanagement states.

The Snoop Control Unit

A block diagram of the Arm11MPCore processor is shown in Figure 4.2. The Snoop Control Unit(SCU) is a key component for the MPCore solution as it interfaces up to four multiprocessingCPUs with each other and with an L2 memory system. Individual CPUs can be dynamicallyconfigured as operating in a symmetric (SMP mode) or asymmetric (AMP mode) manner, i.e.taking part in the L1 coherency or not. The SCU manages the coherent traffic between CPUsmarked as being part of the coherent system and routes all non-coherent/instruction trafficbetween CPUs and L2 memory through its dual 64-bit AMBA AXI ports. In order to limit thenumber of requests to individual CPUs, the SCU contains a duplication of all CPU L1 physicaltag RAMs, so it sends a coherent request only to the CPUs that contains the target data linefor the coherent traffic.

The MPCore implements a modified MESI write invalidate protocol. MESI stands for the 4possible states of a data line in a CPU cache:

• Modified: The data line is in one CPU cache only and has been written to.

• Exclusive: The data line is in one CPU cache only and has not been modified.

• Shared: The data line is in multiple CPUs caches and has not been modified.

• Invalid: The cache line is empty or the line can be present in another CPU’s cache in themodified state.

27


The MESI protocol has been modified in order to reduce the amount of coherency commandsand data traffic on the bus. When a CPU reads a data line and allocates it into its cache, theMESI protocol states the line should be in Shared state whenever it is already in another CPUcache or not, and then moves its state to Exclusive if the CPU requests it. In MPCore, if thedata is not in any other CPU, the data line is marked as Exclusive from the start, removing theneed of an additional coherency command.

Another optimization is known as Direct Data Intervention (DDI), which consists in passingdata directly from a CPU to another one, without having to request the data from Level 2memory. If the data line in the source CPU was in modified state, the data is written to Level 2anyway (data line is then in Shared state) but the destination CPU gets its data directly fromsource CPU, without having to perform an additional request to Level 2 to get updated data.

Last improvement, called Migratory Lines support, is based on additional logic being capableof detecting that a line is going across multiple CPUs. Instead of writing the dirty data backto Level 2 on each line migration, the data line is allocated into the destination CPU Level 1cache as being dirty (Modified state). That prevents then any useless write to Level 2 memorysystem until the data line ceases to be migratory and brought back coherent with Level 2.

In our system the ARM11MPCore Virtual Processor Model from VaST Systems was used andprovided all the properties of the real processor, while providing more flexibility in terms ofperformance analysis. With the property of profiling, a cycle and instruction accurate trace ofthe application being executed can be obtained. Moreover, cache statistics are recorded, whichis helpful in evaluating the system and locating problems.

The Level 1 Cache

Each MPCore CPU has separate instruction and data caches, which have the following features:

• The instruction and data cache can be configured to sizes between 16KB and 64KB. TheVaST Virtual Processor Model allowed for cache sizing of any power of 2.

• Both caches are 4-way set-associative.

• The cache line length is 8 words or 32 bytes.

• Each cache can be sized or disabled independently, using the CP15 system control copro-cessor.

Cache operations are controlled through a dedicated coprocessor, CP15, integrated within thecore. This coprocessor provides a standard mechanism for configuring the level one memorysystem. The CP15 registers can be accessed with MRC and MCR assembler instructions. Theassembler syntax for these instructions is:

MRC{cond} P15,<Opcode_1>,<Rd>,<CRn>,<CRm>,<Opcode_2>

MCR{cond} P15,<Opcode_1>,<Rd>,<CRn>,<CRm>,<Opcode_2>

28


Function Assembler Instruction

Instruction cache invalidate MCR p15, 0, R0, c7, c5, 0

Clean and invalidate cache MCR p15, 0, R0, c7, c14, 0

TLB Invalidate MCR p15, 0, R0, c8, c7, 0

Table 4.2: Cache and TLB Operation Functions

CPU synchronization

Some additions to the ARMv6 architecture are implemented in Arm11MPCore for multiprocess-ing support, such as 64-bit non-bus locking exclusive read and write instructions: LDRDEX andSTRDEX. Exclusive loads and stores are a way to implement interprocess communication inmultiprocessor and shared-memory systems, where the load and store operations are not atomic.

Moreover, theses instructions rely on the ability to tag a physical address as exclusive-access fora particular processor. This tag is later used to determine if an exclusive store to an addressoccurs. The system guarantees that if the data that has been previously loaded has been modifiedby another CPU, the store fails and the load-store sequence must be retried.

• LDREX loads data from memory. If the physical address has the Shared TLB attribute,LDREX tags the physical address as exclusive access for the current processor, and clearsany exclusive access tag for this processor for any other physical address.

• STREX performs a conditional store to memory. The conditions are:

– If the physical address has the Shared TLB attribute, and the physical address istagged as exclusive access for the executing processor, the store takes place, the tagis cleared, and the value 0 is returned in Rd.

– If the physical address has the Shared TLB attribute, and the physical address is nottagged as exclusive access for the executing processor, the store does not take place,and the value 1 is returned in Rd.

An example of how to implement a synchronization semaphore is implemented is provided below:

tryAgain:

ldrex r2, [r1] ; load semaphore and set exclusive

orr r0, r0, r2 ; update the semaphore

strex r2, r0, [r1] ; if still exclusive access then store

cmp r2, #0 ; did this succeed?

bne tryAgain ; no try again

29


4.1.2 Arm L220 Cache Controller

The addition of an on-chip secondary cache, also referred to as a Level 2 cache, is a recognizedmethod of improving the system performance when significant memory traffic is generated bythe processor. By definition a secondary cache assumes the presence of a Level 1 cache, closelycoupled or internal to the CPU. Memory access is fastest to Level 1 cache, followed closely bythe Level 2 cache. Memory access is significantly slower with Level 3 memory or main memory.

Memory Type Typical Size Access Time

Processor registers 64 B 1 cycleLevel 1 Cache 32 KB 1-2 cyclesLevel 2 Cache 128 KB 8 cyclesOff-chip memory MB or GB 30-42 cycles

Table 4.3: Typical memory sizes and access times

The Cache Controller has the following features:

• Physically addressed and physically tagged

• Fixed line length of 32 bytes (eight words or 256 bits)

• Cache size can be configured from 16KB to 2MB

• Configurable set-associativity from Direct Mapped to 8-way associativity

• Configurable latency from 1-8 cycles

• Designed to work with 64-bit AXI master and slave interfaces

Unlike the Level 1 cache, the L220 Cache Controller is configured using memory-mapped regis-ters. In our design, the Level 2 cache was configured to be Direct Mapped with a latency of 8cycles. The cache size was varied from 16KB to 128 KB.

4.1.3 ARM AXI PL300 Interconnect

The PrimeCell AXI Configurable Interconnect (PL300) is a high performance interconnect modelthat provides connectivity between one or more AXI Masters and one or more AXI Slaves.

The PL300 supports a full multi-layer connection of all master and slave interfaces on all of theAXI channels. Multi-layer interconnect enables parallel access paths between multiple mastersand slaves in a system, which increases data throughput and decreases latency.

Write data interleaving enables the interconnect to combine write data streams from differentphysical masters, to a single slave. This is useful because you can combine write data from a

30


fast master with write data from a slow master and consequently increase the throughput ofdata across the interconnect.

In any interconnect that is connected to a slave that reorders read or write signals, there is thepotential for deadlock. To prevent this the PL300 provides arbitration priority and three cyclicdependency schemes that enable the slave interface to accept or stall a new transaction address.

The following list highlights the functionality available:

• Compliant with the AMBA 3 AXI Protocol v1.0 Specification

• Multi-layer capability to allow multiple masters to access different slaves simultaneously

• Automatically connect to buses of varying data width (32, 64, 128 or 256 bits wide)

• Independently configurable number of Slave and Master Interfaces.

• Each slave interface has configurable: Read and Write transaction acceptance, arbitrationpriority and cyclic dependency scheme.

• Each master interface has configurable: Read or Combined Issuing capability, Write In-terleave depth, and Arbitration scheme.

• Supports read and write data interleaving for increased data throughput

4.1.4 ARM AXI PL340 Memory Controller

The PL340 memory controller is a high-performance, area-optimized SDRAM memory controllercompatible with the AMBA 3 AXI protocol.

The following list highlights the functionality available:

• Highly configurable via APB protocol register interface

• Multiple active read and write transactions via AXI protocol Slave Interface

• Timing accurate internal modeling of DRAM devices

• Automatically connects to AXI buses of 4, 8 or 16 bytes data width

• Support for Exclusive Access transactions

Before the memory PL340 memory controller can be used to access external memory internalconfiguration registers must be setup and the external memory initialized. Table 4.4 lists themain configuration values used to model the memory:

31


Symbol Memory Cycles Description

CAS 5 Column Address Strobe latencyT RCD 2 RAS to CAS minimum delayT RP 2 Precharge to RAS delayT RAS 9 Row Address Strobe to Precharge delay

Table 4.4: Memory Controller configuration

Figure 4.3: Memory Timing

4.1.5 Memory

High speed memory can increase the speed of the system dramatically. The VaST genericmemory model is used to model memory blocks with configurable timing such as ROM or RAM.The following list highlightes the functionality available:

• Supports Read, Write, Fetch and Load access types

• Supports memory paging.

• Supports exclusive access.

• Connects to AHB, AHB Lite and APB bus protocols.

• Configurable memory width and size.

• Configurable burst read and burst write limit.

• Configurable first read/write delay and next read/write delay.

The memory timing can be configured for:

• InitialRead/InitialWrite delay: indicates the number of bus clock cycles that is insertedon initiating the first read/write burst to memory. In our system it is set to 1 clock cycle.

• FirstRead/FirstWrite delay: indicates the number of bus clock cycles on data phase thatis inserted for first read/write of a memory width of data in a burst. In our system it isset to 1 clock cycle.

32


• NextRead/NextWrite delay: indicates the number of bus clock cycles inserted for eachread/write of a memory width of data in a burst. In our system it is set to 1 clock cycle.

The burst read and burst write limit is set to 8-beats and the memory timing is shown in Figure4.3.

4.1.6 Buses

The VaST Standard Bus (StdBus) provides an interface to processors, peripheral devices andmemory models and represents the standard concept of address and data phases along withassociated timing in a bus transaction. The following bus protocols were used in our architec-ture: AXI, AHB and APB. These protocols provide a single interface definition for describinginterfaces:

• between a master and the interconnect

• between a slave and the interconnect

• between a master and a slave.

In order to resolve bus access in a multi master system the following arbitration algorithms areused: First Come, Round Robin and Fixed Priority.

AXI Bus Protocol Support

The AMBA 3 AXI protocol supports several functionalities, which make it suitable for high-performance, high-frequency system designs. The funcionalities include:

• separate address/control and data phases

• separate read and write data channels

• burst-based transactions with only start address issued

• out-of-order transaction completion

• ability to issue multiple outstanding addresses

The AXI protocol is burst-based. Every transaction has address and control information on theaddress channel that describes the nature of the data to be transferred. The data is transferredbetween master and slave using a write data channel to the slave or a read data channel to themaster. In write transactions, in which all the data flows from the master to the slave, the AXIprotocol has an additional write response channel to allow the slave to signal to the master thecompletion of the write transaction.

Out-of-order transaction completion means that transactions with the same ID tag are completedin order, but transactions with different ID tags can be completed out of order. Out-of-ordertransactions can improve system performance in two ways:

33


• The interconnect can enable transactions with fast-responding slaves to complete in ad-vance of earlier transactions with slower slaves.

• Complex slaves can return read data out of order. For example, a data item for a lateraccess might be available from an internal buffer before the data for an earlier access isavailable.

AHB Bus Protocol Support

The AMBA 3 AHB interface specification enables highly efficient interconnect between simplerperipherals in a single frequency subsystem where the performance of AMBA 3 AXI is notrequired. The funcionalities include:

• separate address/control and data phases

• separate read and write data channels

The master starts a transfer by driving the address and control signals. These signals provideinformation about the address, direction, width of the transfer, and indicate if the transfer formspart of a burst. Transfers can be single, incrementing bursts or wrapping bursts that wrap atthe address bounderies.

APB Bus Protocol Support

APB protocol is used when low bandwidth transactions are necessary to access configurationregisters in peripherals and data traffic through low bandwidth peripherals. It is used to isolatedata traffic from the high performance AXI and AHB interconnects, and thus to reduce thepower consumption in a design.

34

4.2 Software Architecture Experiment System Architecture

4.2 Software Architecture

The main aim of our software architecture is to model the instruction locality of a typicalapplication. As explained in Section 2.2, an instruction reuse-distance histogram summarizesinstruction locality information and in combination with a configurable cache architecture per-mits the measurement of Instruction Reuse values.

The software architecture consists of two parts: system initialization and the TEST application.In the system initialization phase, all the necessary functionality required to configure the multi-processor system is implemented. The TEST application contains Assembler code that matchesto the modelled benchmark instruction reuse-distance histogram.

4.2.1 System Initialization

The target code is usually a stand-alone program, such as an operating system, which hasaccess to I/O devices and/or a file system. Target applications compiled to run on top of atarget operating system (OS), such as WinCE or Linux would normally have access, via the OS,to the I/O devices and/or file system. However, using the VaST VPM, a target application canalso run without the support of a target operating system.

The ARM-ELF GCC compiler and assembler are used to build the target executable. The archi-tecture was setup with a default memory map. The GNU ARM-ELF GCC compiler supportslinker directives embedded in the start-up files that have been tailored to produce a binaryexecutable, which will load the image in SDRAM.

Memory Map

The device and memory configuration are defined in a platform configuration file. At startupthis configuration file is read and configures all memory regions with the corresponding physicaladdresses.

The SDRAM memory has 64MB starting from physical address 0x0 and is split into 64 pages.The TCM memory is modelled as a 1MB in size and therefore fits in one page.

Memory Start Address Size Memory Width Page Size

SDRAM 0x0000 0000 64 MB 32 bit 1 MBTCM 0x1000 0000 1 MB 64 bit 1 MB

Table 4.5: Memory Map

35


The Linker

The purpose of the ARM linker is to combine the contents of one or more object files (the outputof a compiler or assembler) with selected parts of one or more object libraries, to produce anexecutable program. The ELF file format is used for the output image and specifies an executablebinary image made up of several sections starting from virtual address 0x0. Moreover, the linkeris used to create a virtual map of these sections by specifing the base virtual addresses of eachsection in the the output image.

Figure 4.4: Outputimage structure.

Each ELF file is made up of one ELF header, followed by file data.The file data includes the following sections:

• Program Header: contains assembler directives to load the ELFimage in SDRAM at address 0x0, it initializes the ARM11 MP-Core and calls the main() function.

• TCM Code: contains the code that shall be copied to the TCMmemory.

• Page Table: contains the modified Page Table code.

• .text: contains all other executable instruction code of the com-piled program.

• .data: contains the initialized global and static variables andtheir values.

The Page Table

A Page Table is a data structure used to store the mapping between virtual addresses (used inthe ELF image) and physical addresses (unique to TCM, SDRAM) and to set the attributes ofeach page. While the code and data residing in SDRAM memory is cacheable, TCM is used tohold critical code where the unpredictability of a cache is not desired. Therefore, TCM addressesare non-cacheable.

Output image section Mapped to Attribute

Program Header SDRAM Shared, cacheableTCM Code TCM Shared, non-cacheablePage Table SDRAM Shared, cacheable.text SDRAM Shared, cacheable.data SDRAM Shared, cacheable

Table 4.6: Page Table

36


The main() function

The purpose of the main() function is to:

1. configure the PL340 Memory Controller

2. copy TCM code to TCM memory (if applicable)

3. configure and initialize the L220 Cache Controller (if applicable)

4. set the Page Table

5. call the TEST application

Device configuration and initialization is different in a multiprocessor system than in a unipro-cessor system because some code needs to be executed only by one processor, while some mustbe executed by all processors. Therefore, processor cooperation is essential. For example, points1-3 above need to be executed by only one processor, while point 4 by all processors. In Figure4.5 the control flow graph of the main function is shown.

Figure 4.5: Flowchart main() function

37


4.2.2 The Modelled OS Benchmark

Figure 4.6: Basis for the OS instruction reuse distance benchmark

In order to obtain results, which are independent of specific implementation details like theapplication being executed or the cache line size, the concept of Instruction Reuse is used asthe reference for measuring system performance and scalability. This concept is introduced inSection 4.2.

Different Instruction Reuse values are obtained by changing either the application (representedby its instruction reuse-distance histogram) or the cache configuration. Therefore, the distad-vantage of this technique, is that benchmarks are required in order for designers to quicklyestimate in what range of Instruction Reuse values their particular application-cache combina-tion is situated.

The TEST application was modelled to serve as a benchmark for the instruction reuse-distanceof an operating system. The significant number of processor stall cycles caused when workloadsinclude the operating system motivates a thorough characterization of the effect of operating sys-tem references. Exclusion of operating system’s references have caused cache miss-rate estimatesto be optimistic because:

• the working sets of system references are much larger than single process user workloads

• system code is less repetitive than user code

• interruption of user activity by system calls, or by other user processes, tends to evictportions of user code from the cache.

Due to the limited amount of time available for the thesis, it was decided not to port an operatingsystem on the modelled experimental platform. Moreover, in [4, 3], the RDVIS tool is used tomeasure and visualize the reuse-distance histogram of data reference. Unfortunately, we cannotuse this tool to obtain the instruction reuse-distance histogram of a current operating systemas it is limited to visualizing only load/store instructions.

38


Figure 4.7: Benchmark OS instruction reuse-distance histogram

However, in [20] the instruction cache performance of the operating system is studied. The oper-ating system used in their experiments is Alliants Concentrix 3.0, which is based on Unix BSD 4.2.

Figure 4.6 shows the number of intervening operating system instruction words referenced be-tween two consecutive calls to the same routine in the same operating system invocation. Thedata corresponds to the 10 most frequently invoked routines in the operating system and is theaverage of four workloads [20].

What is important to note is the shape of the histogram: the majority of routine invocationshave a small number of intervening instructions i.e. a small instruction reuse-distance.

Figure 4.6 differs from an instruction reuse-distance histogram because the reuse-distance (innumber of instructions) of OS routines was measured, as opposed to the reuse-distance of indi-vidual instructions. However, routines can be viewed as very complex instruction. Therefore,Figure 4.6 provides a good basis for modelling a benchmark instruction reuse-distance histogram.

The modelled benchmark histogram is shown in Figure 4.7. The code modelling the benchmarkinstruction reuse-distance histogram of an operating system is contained in the TEST() function.The instruction reuse-distance values are provided in KBytes for an easier comparison with cachesize. The conversion is possible because all instructions have a fixed length of 32-bit or 4-Bytes.

Each individual instruction reuse-distance in the histogram is modelled as one basic block ofinstructions. A basic block is code that has one entry point (no code within it is the destinationof a branch instruction), one exit point and no jump instructions contained within it. The start

39


Time 1 2 3 4 5 6 7 8 9 . . .

Memory Address A1 A2 A3 A1 A2 A3 A1 A2 A3 . . .

Instruction I1 I2 I3 I1 I2 I3 I1 I2 I3 . . .

Instruction Reuse-distance ∞ ∞ ∞ 2 2 2 2 2 2 . . .

Table 4.7: Instruction Reuse Distance is modelled by loops.

of a basic block may be jumped to from more than one location. The end of a basic block maybe a branch instruction or the statement before the destination of a branch instruction.

The instruction reuse-distance quantifies temporal locality, which is modeled by loops. Whenthe destination of the branch at the end of a basic block, is the start of the same basic block, aloop is created. All instructions in the loop have a reuse-distance given by the loop size (numberof instructions in the loop) minus 1. An example with loop size of 3 is provided in Table 4.7.

The control flow graph of the TEST() function is shown in Figure 4.8. Between the first in-struction beginning a basic block and the branch instruction ending a basic block a number ofinstructions equal to the instruction reuse-distance are executed. Note, that all instructions inthe basic block have the same reuse-distance.

The number of times each basic block is executed, is representative for the percent of instructionsexecuted with a particular instruction reuse distance i.e. the Y-axis from the instruction reuse-distance histogram.

In our modelling, the following assumptions and simplifications are made:

1. The ARM11 MPCore is used to model a multiprocessor system. However, we want toobtain measurements that are independent of the choice of processor. Therefore, thenumber of branch instructions is minimized in order to eliminate the effect of the processorpipeline. Thus, almost all instructions perform register-to-register operations and the idealprocessor performance is modelled. This implies that a real application will perform worsethan our benchmark.

2. The effect of data instructions is not taken into consideration. The code to model in-struction reuse-distance does not contain any load or store instructions. This is motivatedby the work published in [2] and described in the multiprocessor system example giventhe Motivation section. It was found that the misses to the instruction cache, not to thedata cache, were causing the majority of processor stall cycles and were limiting systemscalability.

3. Instruction reuse-distance is a measure of temporal locality only. Spatial locality dependsnot only on cache implementation such as cache block sizes and cache associativity but alsoon program implementation such as data placement. Therefore, in our benchmark model,spatial locality is modelled to be optimal in order to abstract from such implementationdetails. Because of this the performance of a real system will be lower than for ourbenchmark.

40


Figure 4.8: Test() function control flow graph

The table below illustrates the type and total number of instructions executed in modelling thebenchmark instruction reuse-distance histogram.

Instruction Number Executed Percent Executed

MOV, ADD, CMP 15 346 441 99.92%Branch 11 803 0.08%

Table 4.8: Type and number of instructions executed.

41

Chapter 5

Simulation Results & Analysis

In this section, the results of our simulation experiments are described. The concept of Instruc-tion Reuse, described in Section 2.3, is used in order to evaluate the performance and scalabilityof a multiprocessor system.

For the purpose of our experiment, the TEST benchmark (Section 4.2.2) for the instructionreuse-distance histogram of an operating system was modelled. In our simulations, we vary onlythe cache size since all other cache parameters are fixed for the ARM11 MPCore processor.

To evaluate the effects of Instruction Reuse on a multiprocessor system, we simulate the TESTbenchmark both in Symmetric Multiprocessing (SMP) configuration and in Asymmetric Mul-tiprocessing (AMP) configuration. In SMP mode, there is only one instance of the TESTbenchmark code in SDRAM, which is run by all processors. In AMP mode, there are severalcopies of the TEST bechmark code in memory, and each processor executes a different copy.Here, the ideal behavior of both multiprocessing modes is modelled. In a real SMP system,applications from different memory regions may be executed, while in AMP mode, code mightstill be shared with other processors.

The total number of instructions executed, processor clock cycles and instruction cache line fillsare measured using Metrix. An example of Metrix output is given in Section 3.2.

42

5.1 Effect of Cache Size on Instruction Reuse Simulation Results & Analysis

5.1 Effect of Cache Size on Instruction Reuse

(a) TEST benchmark histogram (b) Instruction Reuse as a function of cache size

Figure 5.1: Instruction Reuse as cache size is varied for the TEST benchmark.

Table 5.1 shows the Instruction Reuse values as a function of cache size for the TEST benchmark(Figure 5.1(a)). The Instruction Reuse values are plotted in Figure 5.1(b) and are computedaccording with the definition given in Section 2.3. For convenience, the formula is shown againbelow:

Instruction Reuse =Number of instructions executed

Number of instructions loaded in Level 1 cache.

Instructions Executed 15358244

Instructions Per Cache Line 8

Level 1 Cache Size [KB] Disabled 4 8 16 32 64 128 256 512

Cache Line Fills - 1299285 1014593 722563 353735 291126 104394 33927 9372

Instruction Reuse 0 1,48 1,89 2,66 5,43 6,59 18,39 56,59 204,84

Table 5.1: Instruction Reuse as cache size is varied for the TEST benchmark

When the Level 1 cache is disabled, all instructions are fetched from main memory. There-fore, the Instruction Reuse is zero. As expected, when the cache size is increased, the numberof instructions fetched from main memory decreases due to the principle of locality and theInstruction Reuse increases accordingly.

For 512 KB cache size, the TEST benchmark code completely fits in the cache and the in-structions loaded in the cache are compulsory cache misses. This corresponds to the maximumInstruction Reuse, because any further increase in the cache size will not decrease the numberof instructions loaded from main memory.

If we recall the example given in the Motivation section, for TCP/IP protocol processing underLinux the Instruction Reuse was measured to be around 2 for a Level 1 cache size of 16 KB [2].We note that the TEST benchmark shows a similar Instruction Reuse for the same cache size.

43

5.2 The Low Instruction Reuse Problem Simulation Results & Analysis

5.2 The Low Instruction Reuse Problem

In a single processor system, the IPC is computed simply as the number of instructions executedby the processor divided by the number of clock cycles required to execute the instructions. Ina multiprocessor system, there are several processors operating in parallel and thus the IPC iscomputed for the entire system as follows:

IPC =∑

(Number of instructions executed per CPU)Maximum number of clock cycles

Figures 5.2 and 5.3 show the effect of Instruction Reuse on system IPC in both SMP and AMPmode, as the number of processors is increased from 1 to 8.

If the Instruction Reuse is low, our multiprocessor system does not scale to more than twoprocessors. From Figure 5.2, it can be seen that while a second CPU, slightly helps to improvesystem IPC, the addition of a 3rd or 4th CPU does not result in any performance improvement.This is justified by the fact that when Instruction Reues is low, a high amount of processor cacheline fills occurs, which in the worst-case must be brought from main memory consuming a totalbandwidth of:

Total BW consumed = Number of CPUs · Cache line sizeMemory Latency

In our example, adding the second core compensates the Memory Latency and thus a perfor-mance improvement can still be measured. However, increasing the number of CPUs increasesthe total amount of bandwidth consumed beyond the limit the memory system can satisfy. In-creasing the number of CPUs, when the memory bandwidth limit has been reached, results inno additional performance improvement.

This is an important result because it proves through simulation that a low instruction localitylimits the performance and scalability of multiprocessor systems. For example, an InstructionReuse of almost 2 was measured for the MPSoC example in the Motivation section, whichexplains the high number of stall cycles.

If the Instruction Reuse is high, the performance scales almost linearly with the number ofprocessors. In Figure 5.3, for a single processor, the ideal IPC is reached at the maximumInstruction Reuse. As the number of CPUs increases, the IPC does not increase exactly linearlydue to contention at the main memory. Therefore, for the maximum Instruction Reuse and 8processors, an IPC of 5.64 is obtained instead of the ideal IPC of 8.

In AMP mode, the results are almost the same as in SMP mode. This is due to the factthat, in both modes, the same contention at the SDRAM interface applies. Moreover, in AMPoperation, there is never contention for the same word in memory, while in SMP mode this typeof contention has little effect on performance.

One obvious way to decrease the number of stall cycles, while fetching instructions from mainmemory, is to increase the memory bandwidth. Instruction Reuse and memory bandwidth aretwo independent parameters. Next, we look at what influence does memory bandwidth have onthe scalability of a multiprocessor system.

44


(a) Symmetric Multiprocessing mode (b) Asymmetric Multiprocessing mode

Figure 5.2: A low Instruction Reuse results in no performance improvement as the number ofCPUs is increased. In other words, a low Instruction Reuse limits the scalability of multipro-cessor system.

(a) Symmetric Multiprocessing mode (b) Asymmetric Multiprocessing mode

Figure 5.3: High Instruction Reuse values enable a multiprocessor system to scale to a highernumber of processor and significant performance gains can be seen over a single processor system.

45


5.2.1 Effect of Memory Bandwidth

(a) BWCPU = 2 · BWMEM (b) BWCPU = BWMEM

Figure 5.4: Doubling the memory bandwidth increases system IPC but does not help to improvethe scalability of the multiprocessor system when Instruction Reuse is low.

Figures 5.4 shows the system IPC as a function of Instruction Reuse with the original (Figure5.4(a)) and doubled (Figure 5.4(b)) memory bandwidth. Results are presented only for SMPmode because there is almost no different in AMP mode as could be seen in the previoussubsection.

As expected, doubling the memory bandwidth, increases system IPC over the whole range ofInstruction Reuse values. However, Figure 5.4(b), shows an interesting result: if the InstructionReuse is low, the double memory bandwidth does not improve the scalability of the multiproces-sor system. The addition of the second CPU provides a higher performance improvement than inFigure 5.4(a) because main memory can return instructions two times faster. Nevertheless, the3rd and 4th CPUs still do not increase IPC significantly due to the high number of instructionfetches, which cause many stall cycles.

With this information, it is certain that for the MPSoC example in the Motivation section, dou-bling the memory bandwidth will increase system IPC but will bring no scalability improvementbecause the Instruction Reuse was as low as 2.

Instruction Reuse is a function of both the application instruction reuse-distance histogramand the cache configuration. From the software side, Instruction Reuse can be increased byoptimizing the code layout such that the spatial and temporal locality is maximized. From thehardware point of view, Instruction Reuse can be increased by increasing the cache size. Inthe following we investigate the effect of both varying cache size and of a different instructionreuse-distance histogram.

46


5.2.2 Effect of Level 1 Cache Size

(a) BWCPU = 2 · BWMEM (b) BWCPU = BWMEM

Figure 5.5: Doubling the Level 1 cache size or even the memory bandwidth may not improvethe scalability of a multiprocessor system. However, increasing the cache size above a certainthreshold value solves the scalability problem.

As could be seen so far, knowledge of the Instruction Reuse of a particular application-cachecombination offers important information on the scalability of a multiprocessor system. However,it does not provide any information on how to design a multiprocessor system for improvedperformance or scalability.

For a fixed application profile, one method to improve Instruction Reuse is to increase cache size.The relation between Instruction Reuse and cache size for the TEST benchmark is shown inFigure 5.1(b). In Figure 5.5, the system IPC is shown as a function of cache size with the original(Figure 5.5(a)) and doubled (Figure 5.5(b)) memory bandwidth for the TEST benchmark.

An interesting observation that can be seen is that increasing Level 1 cache size or even doublingthe memory bandwidth may not necessarily improve the scalability of a multiprocessor system.This is due to the fact that the Instruction Reuse is still small enough, creating a considerableamount of instruction fetches to main memory, which imply processor stall cycles.

However, increasing Level 1 cache size above a certain threshold value solves the scalabilityproblem. As cache size increases, the number of capacity cache misses decreases, resultingin fewer accesses to main memory which decreases the number of processor stall cycles. Theresult is that the system IPC scales linearly with the number of CPUs. This is an importantobservation because it allows designers to choose an optimal value for the Level 1 cache.

47


5.2.3 Effect of modified application profile

(a) Gaussian Instruction Reuse-Distance histogram. (b) Instruction Reuse as a function of cache size.

Figure 5.6: Instruction Reuse as cache size is varied for the modified histogram.

Instructions Executed TEST Benchmarck 15358244

Gaussian histogram 38282521

Level 1 Cache Size [KB] 4 8 16 32 64 128 256 512

Instruction Reuse TEST Benchmarck 1.48 1.89 2.66 5.43 6.59 18.39 56.59 204.84

Gaussian histogram 1.05 1.25 1.41 2.81 5.98 13.36 27.06 52.14

Table 5.2: Instruction Reuse comparison.

The concept of Instruction Reuse was introduced as a measurement reference in order to abstractour results from a particular application. We showed that a low Instruction Reuse limits theperformance and scalability of a multiprocessor system using the TEST benchmark. From thedefinition of Instruction Reuse our result is independent of the TEST benchmark. Whateverthe application running on the processors may be, if in combination with the Level 1 cache alow instruction reuse is obtained, our results are valid.

In order to investigate the effect of a different application profile, a reuse-distance histogramwas modelled in order to have the envelope of a Gaussian distribution. Many measurementsof physical phenomena can be approximated by the Gaussian distribution. The use of theGaussian distribution is justified by assuming that many small, independent events are additivelycontributing to each experiment observation and by the central limit theorem the sum will beGaussian distributed.

The Gaussian instruction reuse-distance histogram is shown in Figure 5.6(a). It can be seen,that the average instruction reuse-distance is of about 32 KB. The average instruction reuse-distance for the TEST benchmark is about 16 KB. As explained in Section 2.3, the cache capacityneeds to be bigger than the average reuse-distance in order to achieve a high Instruction Reuse.

48


Therefore, higher cache size values are required to obtain the same Instruction Reuse in the caseof the Gaussian histogram as for the TEST benchmark.

Table 5.2 shows the Instruction Reuse values as a function of cache size for the Gaussian reuse-distance histogram and the TEST benchmark the values are also plotted in Figure 5.6(b). Asexpected, the Instruction Reuse curve for the Gaussian histogram is below the one for the TESTbenchmark.

In Figure 5.7, the main results presented so far can be compared for the TEST benchmark (leftcolumn) and the Gaussian histogram (right column). The comparison is given only for the SMPmode, because the results are the same in AMP mode as shown in Section 5.2.

The two instruction reuse-distance histograms are shown for comparison in Figures 5.7(a) and5.7(b).

In Figure 5.7(d), the system IPC is plotted as a function of Instruction Reuse for the Gaussianhistogram. The same trend as in Figure 5.7(c) can be observed: for low Instruction Reuse,increasing the number of CPUs to more than two, provides no additional performance improve-ment.

Figure 5.7(f), shows the system IPC in relation to the cache size. As the Level 1 cache sizeincreases, Instruction Reuse increases as well and if the Level 1 cache size exceeds a certainthreshold value, system IPC increases almost linearly with increasing number of CPUs for bothapplication profiles.

Due to the limited amount of time available, the effect of other application profiles could not beinvestigated. In order to statistically show that our results are indepedendent of the application,a higher number of reuse-distance profiles would have to be simulated. However, the TESTbenchmark (designed to model the profile of an operating system) and the Gaussian histogramshow that our results are not influenced by the different application profiles.

49


(a) TEST Benchmark (b) Gaussian histogram

(c) TEST Benchmark., BWCPU = BWMEM (d) Gaussian histogram, BWCPU = BWMEM

(e) TEST Benchmark., BWCPU = BWMEM (f) Gaussian histogram, BWCPU = BWMEM

Figure 5.7: Effect of application instruction reuse-distance histogram50

5.3 The Shared Level 2 Cache Simulation Results & Analysis

5.3 The Shared Level 2 Cache

When the Instruction Reuse is low, significant memory traffic is generated by the processor. Theaddition of an on-chip Level 2 cache, is a recognized method of exploiting instruction localityand improving system performance. In this section, a shared Level 2 cache is investigated as asolution to the low Instruction Reuse problem.

Figure 5.8, shows a comparison between the system IPC without Level 2 cache and with ashared 128 KB Level 2 Direct Mapped cache, in Symmetric Multiprocessing mode and forBWCPU = 2 · BWMEM. As can be seen, not only is the system IPC considerably higher forall Instruction Reuse values, but the scalability is significantly improved for Instruction Reusevalues higher than 3.

In SMP mode, the shared Level 2 cache transforms the compulsory cache misses of one processorinto Level 2 cache hits for the other processors. Thus, an instruction needs to be fetched onlyonce from main memory and then it can be reused by all other processors as long as the reuse-distance is smaller than the cache size.

In AMP mode (Figure 5.9, the situation is different. Because processors execute instructionsfrom different memory regions, the cache misses of one processor may evict instructions requiredby other processors from the shared Level 2 cache. This means conflict misses are created. Thenumber of conflict misses increases with increasing number of CPUs and depends on the set-associativity of the shared Level 2 cache. The higher the set-associativity, the lower is theprobability of a conflict miss. A Direct Mapped cache has the highest probability of a conflictmiss.

While the shared Level 2 cache, offers the best-case scenario in SMP mode, it provides theworst-case in AMP mode. As shown in Figure 5.9(b), when the number of CPUs increasesto more than two, the low Instruction Reuse causes a high number of instruction fetches tomain memory, which results in decreasing performance due to conflict misses at the Level 2cache. When the Instruction Reuse is increased, the number of conflict misses decreases but theperformance and scalability of the multiprocessor system does not improve significantly and iscomparable to that of the system without Level 2 cache.

In a realistic scenario, the results will correspond to a weighted average between the two multi-processing case. One example is the multiprocessor system presented in the Motivation section:the operating system is Linux, therefore, the system code is running in Symmetric Multiprocess-ing mode, while the code of the applications running on top of the operating system are runningin Asymmetric Multiprocessing mode.

51


(a) SMP mode, no Level 2 cache (b) SMP mode, 128 KB shared Level 2 cache

Figure 5.8: In SMP mode, the addition of a shared Level 2 cache increases system IPC and alsoimproves the scalability of the multiprocessor system for low Instruction Reuse.

(a) AMP mode, no Level 2 cache (b) AMP mode, 128 KB shared Level 2 cache

Figure 5.9: In AMP mode, the addition of a shared Level 2 cache slightly increases system IPCbut does not improve the scalability of the multiprocessor system. In fact, for low InstructionReuse, increasing the number of processors decreases system IPC.

52


5.3.1 Level 1 Cache vs. Level 2 cache

In Figures 5.10(a)-5.10(d), the system IPC as a function of Level 1 and Level 2 cache size isshown for increasing number of processors in Symmetric Multiprocessing mode.

One of the most important observations is that the addition of relatively small Level 2 sharedcache (e.g. twice the size of the Level 1 cache) provides a significantly greater performanceimprovement than doubling the Level 1 cache size with no Level 2 cache. For example, increasingthe Level 1 cache from 16 KB to 32 KB or even 64 KB, results in a system IPC significantlybelow the system IPC with 16 KB Level 1 cache size and a 32 KB shared Level 2 cache. Thisis due to the fact that the shared Level 2 cache transforms the compulsory cache misses of oneprocessor into cache hits for the other CPUs, eliminating even the first-time fetches that wouldbe required in the absence of a shared Level 2 cache.

Another important observation is that in the absence of a Level 2 cache, the threshold Level 1cache size beyond which system IPC scales linearly with increasing number of processors is 128KB for the TEST benchmark. With the addition of Level 2 cache, the threshold Level 1 cachesize decreases to just 16 KB. An instruction needs to be fetched only once from main memory inorder to be reused by all other processors from the Level 2 cache. Therefore, the low InstructionReuse due to a smaller Level 1 cache is compensated by a high reuse of instructions at the Level2 cache.

Finally, it can be seen that for a given number or processors, increasing the size of the sharedLevel 2 cache alone, does not offer a significant performance gain. When the Level 2 cache sizeincreases, a higher number of instructions can reside in the cache but this does not imply thatthe reuse of instructions at the Level 2 cache increases. Only if the number of CPUs is increasedwill the reuse increase because the additional processors will fetch instructions directly from theshared Level 2 cache.

In Asymmetric Multiprocessing mode, the size of the shared Level 2 cache plays an importantrole in system performance. In Figures 5.11(a)-5.11(d), the system IPC as a function of Level 1and Level 2 cache size is shown for Asymmetric Multiprocessing.

The problem with a shared Level 2 cache in AMP mode, is that instructions that one processorfetches into the Level 2 cache may be evicted by instruction fetches from a different processorresulting in conflict misses. Increasing the Level 2 cache size, decreases the number conflictmisses and thus increases system IPC.

53


(a) SMP mode, #CPUs = 1 (b) SMP mode, #CPUs = 2

(c) SMP mode, #CPUs = 3 (d) SMP mode, #CPUs = 4

Figure 5.10: The addition of relatively small Level 2 shared cache (e.g. twice the size of the Level1 cache) provides a significantly greater performance improvement than doubling the Level 1cache size alone. However, increasing the Level 2 cache size without also increasing the numberof CPUs, does not bring any significant performance gain.

54


(a) AMP mode, #CPUs = 1 (b) AMP mode, #CPUs = 2

(c) AMP mode, #CPUs = 3 (d) AMP mode, #CPUs = 4

Figure 5.11: As opposed to SMP mode, in AMP mode increasing the Level 2 cache size consid-erably increases system IPC.

55

5.4 The effect of Tightly Coupled Memory Simulation Results & Analysis

5.4 The effect of Tightly Coupled Memory

Tightly Coupled Memory or TCM is a type of on-chip memory used to hold critical code whendeterministic memory behavior is required such as for real-time systems. Because caches behav-ior is not deterministic, TCM has an address space that is non-cacheable. The words ”TightlyCoupled” come from the fact that TCM sits very close to the processor having a latency ofabout 1-2 clock cycles just as Level 1 cache.

TCM could provide a solution to the multiprocessor scalability problem by storing code withlow instruction reuse i.e. code which creates the most amount of traffic to memory increasingthe total bandwidth consumed.

The effect of placing low Instruction Reuse code in TCM could not be properly investigatedbecause of hardware constraints in placing the TCM as close as possible to the ARM11 MPCore.In the following this constraints are described:

• The Arm11 MPCore processor was not designed to support TCM. It contains an advancedinternal memory management system with support for snooping cache coherency and twoAXI interfaces specifically designed to be connected to the ARM L220 Cache Controller.Therefore, the TCM can only connected as a second slave via the PL300 Interconnect tothe MPCore.

• The AXI PL300 Interconnect supports only AXI interfaces while the on-chip memorymodel supports at most the AHB protocol. Therefore, a Bridge is required in order toconvert between the AXI and AHB protocols.

In this configuration, the latency to TCM was measured to be 14 processor clock cycles. Giventhis high latency compared to the typical TCM latency of 1-2 processor clock cycles, the effectof placing code in the modelled TCM would not realistic.

Moreover, compared to the latency to Level 2 cache of 6 processor clock cycles, the performanceof the modelled TCM will be significantly lower than that of using Level 2 cache.

For the reasons above, the effect of using TCM memory as an alternative to Level 2 Cache inorder to increase system performance and/or scalability could not be investigated.

56

Chapter 6

Conclusion & Future Work

In this thesis an ARM11 MPCore based multiprocessor system is modelled and simulated us-ing virtual prototyping technology from VaST Systems. The purpose of this modelling is tostudy and understand the effect of instruction locality on the performance and scalability ofmultiprocessor systems to do preparation for a future MPSoC design.

The design of the entire system consists of two aspects: hardware architecture and softwarearchitecture. The hardware architecture was modelled using virtual models of ARM fabriccomponents. The software architecture was designed to permit a configurable instruction localityto be modelled and the concept of Instruction Reuse is introduced in order to evaluate theperformance and scalability of a multiprocessor system indepedent of the targe application beingexecuted.

In order to obtain Instruction Reuse values, the instruction reuse-distance histogram of anoperating system is modelled based on previous published work in [20]. The modelled histogramserves as a benchmark for comparison with future real measurements.

After implementation the system is simulated and system IPC is recorded using Metrix fromVaST Systems. System performance is analyzed both in Symmetric Multiprocessing and Asym-metric Multiprocessing modes.

One of the main contributions of this thesis is that it is proved by means of simulation thatwhatever the target application may be, if in combination with a specific cache configuration itresults in a low Instruction Reuse, then increasing the number of processors above a fairly smallnumber results in no additional performance increase. In other words, a low Instruction Reuselimits the scalability of a multiprocessor system.

57

Conclusion & Future Work

The effects of doubling memory bandwidth, cache configuration and modified application profileare also investigated. Doubling the memory bandwidth increases system IPC but does not helpto improve the scalability of the multiprocessor system when Instruction Reuse is low. However,increasing the Level 1 cache size above a certain threshold value solves the scalability problemand system IPC increases linearly with increasing number of CPUs. Using a different applicationprofile, the above mentioned conclusions did not change.

It was also shown that in the absence of a shared cache there is no significant difference betweenSymmetric Multiprocessing and Asymmetric Multiprocessing modes. However, the addition ofa shared Level 2 cache introduces great differences between the two processing modes:

• In Symmetric Multiprocessing mode: the addition of a shared Level 2 cache increasessystem IPC for all Instruction Reuse values or Level 1 cache sizes. Moreover, adding ashared Level 2 cache with double the size of the Level 1 cache, provides a significantlygreater performance improvement than doubling the Level 1 cache size alone. However,increasing the Level 2 cache size without also increasing the number of CPUs, does notbring any significant performance gain.

• In Symmetric Multiprocessing mode: the addition of a shared Level 2 cache may actuallydecrease system performance if the Instruction Reuse or Level 1 cache size is small. More-over, increasing the Level 2 cache size offers considerable performance improvement evenwhen the number of CPUs is constant.

The conclusions above give meaningful insights to the factors that govern MPSoC performanceand scalability.

The effect of Tightly Coupled Memory could not be investigated due to hardware architecturerestrictions on the placement of the TCM.

A few interesting issues that future work could focus on are:

• extending the analysis to include both Data Reuse as well as Instruction Reuse.

• porting an actual operating system on the modelled hardware architecture to investigatethe effects of Instruction Reuse based on a real application.

• exploring the effect of placing code with low Instruction Reuse in TCM memory.

58

Bibliography

[1] J. Hennessy A. Agarwal and M. Horowitz. An analytical cache model. ACM Trans. Comput.Syst., 7(2):184–215, 1989.

[2] Mohamed A. Bamakhrama. Embedded multiprocessor system-on-chip for access networkprocessing. MSc. Thesis, Technische Universitt Mnchen, December 2007.

[3] Kristof Beyls and Erik D‘Hollander. Platform-independent cache optimization by pinpoint-ing low-locality reuse. In M. Bubak, G.D. van Albada, P.M.A. Sloot, and J.J. Dongarra,editors, Computational Science - ICCS 2004: 4th International Conference, Proceedings,Part III, volume 3038, pages 448–455, Krakow, 6 2004. Springer-Verlag Heidelberg.

[4] Kristof Beyls, Erik D‘Hollander, and Frederik Vandeputte. Rdvis: A tool that visualizesthe causes of low locality and hints program optimizations. In V.S. et al. Sunderam, editor,Computational Science – ICCS 2005, 5th International Conference, volume 3515, pages166–173, Atlanta, 5 2005. Springer.

[5] Soner nder Changpeng Fang, Steve Carr and Zhenlin Wang. Reuse-distance-based miss-rate prediction on a per instruction basis. MSP ’04: Proceedings of the 2004 workshop onMemory system performance, pages 60–68, 2004.

[6] Peter Claydon. Multicore gives more bang for the buck. EE Times, 15th October 2007.

[7] Abhijit Davare. Automated Mapping for Heterogeneous Multiprocessor Embedded Systems.PhD thesis, EECS Department, University of California, Berkeley, Sep 2007.

[8] C. Ding. Improving effective bandwidth through compiler enhancement of global and dy-namic cache reuse. PhD thesis, Dept. of Computer Science, Rice University, January 2000.

[9] Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distanceanalysis. Proceedings of the ACM SIGPLAN 2003 conference on Programming languagedesign and implementation, pages 245–257, 2003.

[10] International Technology Roadmap for Semiconductors. http://www.itrs.net.

[11] D. Geer. Chip makers turn to multicore processors. Computer, 38:11–13, May 2005.

[12] P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular sectionanalysis. IEEE Transactions on Parallel and Distributed Systems, 2(3):350360, July 1991.

59

BIBLIOGRAPHY BIBLIOGRAPHY

[13] M. D. Hill. Aspects of cache memory and instruction buffer performance. PhD thesis,University of California, Berkeley, November 1987.

[14] Anthony Massa and Michael Barr. Programming Embedded Systems. OReilly Publishers,San Francisco, CA, chapter 1 edition, October 2006.

[15] David A. Patterson and John L. Hennessy. Computer Organization and Design: The Hard-ware/Software Interface. Morgan Kaufmann Publishers, San Francisco, CA, third edition,2004.

[16] David A. Patterson and John L. Hennessy. Computer Architecture: A Quantitative Ap-proach. Morgan Kaufmann Publishers, San Francisco, CA, fourth edition edition, 2007.

[17] D. Slutz R. L. Mattson, J. Gecsei and I. L. Traiger. Evaluation techniques for storagehierarchies. IBM System Journal, 9(2):78117, 1970.

[18] IEEE Design & Test staff. Dac, moore’s law still drive eda. IEEE Des. Test, 20(3):99–100,2003.

[19] P. Stenstrom. The paradigm shift to multi-cores: Opportunities and challenges. Appl.Comput. Math., 6(2):253–257, 2007.

[20] Josep Torrellas, Chun Xia, and Russell Daigle. Optimizing instruction cache performancefor operating system intensive workloads. In Proceedings of the 1st Intl. Conference on HighPerformance Computer Architecture, pages 360–369, 1995.

[21] Jim Turley. The two percent solution. http://www.embedded.com/story/OEG20021217S0039,December 2002.

[22] Intel website. http://www.intel.com/museum/archives/history docs/mooreslaw.htm.

[23] C. Ding Y. Zhong and K. Kennedy. Reuse distance analysis for scientific programs. Proceed-ings of Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers,March 2002.

[24] S. Dropsho Y. Zhong and C. Ding. Miss rate prediction across all program inputs. Pro-ceedings of the 12th International Conference on Parallel Architectures and CompilationTechniques, page 91101, September 2003.

[25] X. Shen Y. Zhong, M. Orlovich and C. Ding. Array regrouping and structure splitting usingwhole-program reference affinity. Proceedings of ACM SIGPLAN Conference on Program-ming Language Design and Implementation, June 2004.

60

Date post:	21-Nov-2014
Category:	Documents
Upload:	mail4scribd
View:	93 times
Download:	0 times

MSc Thesis

Documents