+ All Categories
Home > Documents > High Performance, Ultra-Low Power Streaming Systems

High Performance, Ultra-Low Power Streaming Systems

Date post: 25-Oct-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
71
TECHNICAL UNIVERSITY OF CATALONIA School of Informatics Master on Computer Architecture, Networks and Systems Master Thesis High Performance, Ultra-Low Power Streaming Systems Student: Jos´ e Mar´ ıa Arnau Advisor: Joan-Manuel Parcerisa Advisor: Polychronis Xekalakis Advisor: Antonio Gonz´ alez Date: September 20, 2011
Transcript
Page 1: High Performance, Ultra-Low Power Streaming Systems

TECHNICAL UNIVERSITY OF CATALONIA

School of Informatics

Master on Computer Architecture, Networks and Systems

Master Thesis

High Performance, Ultra-Low PowerStreaming Systems

Student: Jose Marıa Arnau

Advisor: Joan-Manuel Parcerisa

Advisor: Polychronis Xekalakis

Advisor: Antonio Gonzalez

Date: September 20, 2011

Page 2: High Performance, Ultra-Low Power Streaming Systems

2

Page 3: High Performance, Ultra-Low Power Streaming Systems

Abstract

Smartphones are emerging as one of the fastest growing markets, with new devices and im-provements in their operating systems taking place every few months. The design of a CPU/GPUfor such a mobile devices is challenging due to the users demands for a truly mobile computingexperience, including highly responsive user interfaces, uncompromised web browsing performanceor visually compelling gaming experiences, and the power constrains due to the limited capacity ofthe battery. In the last years, the power demand of these mobile devices has increased much fasterthan the battery improvements.

Our key ambition is to design a CPU/GPU for such a system, trying to minimize the powerconsumed, while also achieving the highest performance possible. We first analyze commercialAndroid workloads and establish that the most demanding applications in terms of performanceare, as expected, games. We show that because these systems are based on OpenGL ES, the vastmajority of the time the CPU is idle. In fact, we find that the GPU is much more active than theCPU and that the major performance limitation for these systems is the use of memory by theGPU.

We thus focus on the GPU and more specifically on its memory behavior. We show that formost of the caches employed in these systems, traditional prefetchers provide significant benefits.The exception is the texture cache, for which the patterns are irregular, especially for 3D games.We then demonstrate how we can aleviate this issue by using a decoupled access/execute likearchitecture. We also show that an important part of the power consumed can be reduced bycarefully moving data around and by orchestrating the accesses to the L2 cache. The end design isable to achieve similar performance with a more traditional many-warp system, while consumingonly a fraction of its power. Our experimental results using the latest version of Android anda commercial set of games, proves this claim. More specifically, our proposed system is able toachieve 29% improvements over state-of-the-art prefetchers, while consuming 6% less power.

Keywords

Prefetching, GPU, Android, rasterization, smartphones.

3

Page 4: High Performance, Ultra-Low Power Streaming Systems
Page 5: High Performance, Ultra-Low Power Streaming Systems

Contents

1 Introduction 11

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Objectives and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Related work 15

2.1 Rasterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Android Software Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 State of the art architectures for mobile devices . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Qualcomm Snapdragon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.2 PowerVR chipsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.3 NVIDIA Tegra 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Data Cache Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.1 CPU prefetchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.2 GPU prefetchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Problem statement: Memory Wall for Low Power GPUs 37

3.1 Hiding memory latency on a modern low power mobile GPU . . . . . . . . . . . . . 37

4 Proposal: Decoupled Access Execute Prefetching 41

4.1 Ultra-low power decoupled prefetcher . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5

Page 6: High Performance, Ultra-Low Power Streaming Systems

CONTENTS

4.1.1 Baseline GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.2 Decoupled prefetcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.3 Decoupled prefetcher improvements . . . . . . . . . . . . . . . . . . . . . . . 45

5 Evaluation methodology 49

5.1 Simulation infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.1 GPU trace generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.2 Cycle accurate GPU simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Experimental results 55

6.1 Workload characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 State of the art prefetchers performance . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Ultra-low power decoupled prefetcher performance . . . . . . . . . . . . . . . . . . . 61

7 Conclusions 67

6

Page 7: High Performance, Ultra-Low Power Streaming Systems

List of Figures

1.1 Smartphones sales vs Desktop and Notebook sales. Data obtained from [1]. . . . . . 12

1.2 Energy need vs energy available in a standard size battery. Two days of batterylife cannot be achieved with current batteries and the gap is getting bigger. Dataobtained from [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 Initial scene, intermediate results produced by the different stages of the rasterizationprocess and final result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

(a) 3D triangles plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

(b) 2D triangles plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

(c) Clipped 2D triangles plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

(d) Pixels after rasterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

(e) Visible pixels after Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

(f) Shaded and textures pixels after Pixel stage . . . . . . . . . . . . . . . . . . . . 16

2.2 Rasterization pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Android architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Qualcomm Snapdragon System on Chip. . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 PowerVR GPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 NVIDIA Tegra 2 architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 Ultra-low power GeForce architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.8 Stride prefetching table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.9 Markov prefetching. The left side of the figure shows the state of the correlationtable after processing the miss address stream shown at the top of the figure. Theright side illustrates the Markov transition graph that corresponds to the examplemiss address stream. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7

Page 8: High Performance, Ultra-Low Power Streaming Systems

LIST OF FIGURES

2.10 Distance prefetching. The address delta stream corresponds to the sequence of ad-dresses used in the example of figure 2.9. . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.11 Global History Buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.12 Distance prefetcher implemented by using a Global History Buffer. The Head Pointerpoints to the last inserted address in the GHB. . . . . . . . . . . . . . . . . . . . . . 30

2.13 An overview of the baseline GPGPU architecture. . . . . . . . . . . . . . . . . . . . 31

2.14 An example of memory address with/without warp interleaving. . . . . . . . . . . . 32

(a) Accesses by warps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

(b) Accesses by a hardware prefetcher . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.15 Many-thread aware hardware prefetcher. . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.16 Throttling heuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.17 Baseline architecture for texture mapping. . . . . . . . . . . . . . . . . . . . . . . . . 35

2.18 Texture cache prefetcher architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1 Effectiveness of multithreading for hiding memory latency. As we increase the num-ber of warps on each processor we obtain better performance. . . . . . . . . . . . . . 38

3.2 Power consumed by the GPU main register file for different configurations. . . . . . 38

4.1 Baseline GPU architecture (based on the ultra-low power GeForce GPU in theNVIDIA Tegra 2 chipset). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Decoupled prefetcher architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Improved decoupled prefetcher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.1 GPU trace generation system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 GPU architecture modelled by the cycle accurate simulator. . . . . . . . . . . . . . . 52

6.1 CPU configuration for the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2 CPI stacks for several Android applications. iCommando, Shooting Range 3D andPolyBreaker 3D are commercial games from the Android market. . . . . . . . . . . . 56

6.3 Misses per 1000 instructions for the different caches in the GPU. . . . . . . . . . . . 57

6.4 Texture and pixel cache analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8

Page 9: High Performance, Ultra-Low Power Streaming Systems

LIST OF FIGURES

6.5 Analysis of the strides of the cache misses in the Pixel and Texture cache of oneStreaming processor when running the 2D game iCommando. In the Sequitur gram-mars non-terminal symbols (rules) are represented by numbers and terminal symbols(strides) are represented by numbers in square brackets. After each rule we show thenumber of times the rule is applied to form the input sequence of strides. We onlyshow the 5 most frequent rules of the grammar. . . . . . . . . . . . . . . . . . . . . . 58

6.6 Analysis of the strides of the cache misses in the Pixel and Texture cache of oneStreaming processor when running the 3D game PolyBreaker 3D. For each cache thefigure shows the 5 most frequent rules of the grammar and the 5 most frequent strides. 58

6.7 GPU configuration for the experiments. The baseline GPU architecture is the oneillustrated in figure 5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.8 Speedups for different state-of-the-art prefetchers. . . . . . . . . . . . . . . . . . . . . 60

6.9 Normalized power consumption for different state-of-the-art prefetchers. . . . . . . . 61

6.10 Ultra-low power decoupled prefetcher compared with state-of-the-art prefetchers. . . 62

6.11 Ultra-low power decoupled prefetcher compared with the distance prefetcher imple-mented with GHB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.12 Decoupled prefetcher power consumption. . . . . . . . . . . . . . . . . . . . . . . . . 63

6.13 Normalized energy-delay product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.14 Prefetch queue size evaluation. The graph shows the speedup achieved by the de-coupled prefetcher over the baseline GPU without prefetching for different sizes ofthe prefetch queue, for the game shooting. . . . . . . . . . . . . . . . . . . . . . . . . 64

9

Page 10: High Performance, Ultra-Low Power Streaming Systems
Page 11: High Performance, Ultra-Low Power Streaming Systems

1Introduction

1.1 Motivation

Mobile devices such as smartphones and tablets have become ubiquitous in the last few years.This kind of general purpose but battery limited devices has experienced a huge growth in bothcomputing experience and market quota. Regarding the user experience, making calls is just oneof the multiple features offered in cell phones since the user is able to browse the web, play high-definition videos or play complex 3D games. In regard to the market share, the total number ofsmartphones sold in 2008 exceeded the total number of desktop PCs and the gap is increasing eachyear [1]. Furthermore, the forecast for the next years predicts that the smartphone market willexceed the laptop and desktop markets by 2012, as shown in figure 1.1.

The design of a CPU/GPU system for smartphones is very challenging due to the user expecta-tions for what these devices should do and the important power limitations. On the one hand, usersare demanding a truly mobile computing experience: highly responsive user interfaces, uncompro-mised web browsing performance, visually compelling online and offline gaming experiences... Onthe other hand, the power demand is increasing faster than battery improvements, as shown infigure 1.2. Therefore, the combination of these two factors, the demand of complex applicationsand the power constrains, is putting a big pressure on the CPU/GPU due to the need of providinghigh performance without breaking the small power budget of mobile devices.

The clock rate of mobile CPUs and GPUs has been significantly increased in the last years.Nowadays, smartphones achieve a clock rate between 1 GHz and 1.5 GHz. Furthermore, companiessuch as Qualcomm and NVIDIA have announced CPU/GPU chipsets with clock rates between 2GHz and 2.5 GHz for 2012. Hence the smartphones are going to hit the memory wall and the latencyto access the main memory is going to be one of the main performance limiting factors. Thus, the

11

Page 12: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 1. INTRODUCTION

Figure 1.1: Smartphones sales vs Desktop and Notebook sales. Data obtained from [1].

use of techniques to hide the memory latency will be necessary to provide high performance.

Figure 1.2: Energy need vs energy available in a standard size battery. Two days of battery lifecannot be achieved with current batteries and the gap is getting bigger. Data obtained from [2].

12

Page 13: High Performance, Ultra-Low Power Streaming Systems

1.2. OBJECTIVES AND CONTRIBUTIONS

Prefetching is one of the main techniques for hiding the memory latency. Despite prefetchinghas been extensively studied in CPUs, as far as we know the use of prefetching has not beenevaluated in low-power mobile GPUs running graphics workloads.

1.2 Objectives and contributions

The first objective is to get a better understanding of the applications available for smartphones.We want to evaluate the behavior of the CPU/GPU when running these applications and we wantto identify the most demanding workloads. We focus on Android [3], since it is one of the mostpopular platforms for mobile devices and it is open source.

Another objective is the proposal of a technique to increase the performance of smartphonesGraphics Processing Units (GPUs) while maintaining the power consumption in the limits of thesmall power budget. Since games are the most demanding applications for smartphones and thememory is one of the main limiting factors in the GPU, as we describe in section 6.1, it is necessary tofind a mechanism to hide the latency of the main memory. We want to explore the use of prefetchingin a low-power mobile GPU, evaluate the performance and power consumption of current state ofthe art prefetchers and propose a new low-power prefetching technique specifically designed formobile devices if necessary.

In this report we claim the following contributions:

1. We perform a characterization of smartphones applications. More specifically, we characterizethe behavior of multiple 2D and 3D games in the Android platform.

2. We develop a methodology to evaluate the performance and power consumption of mobileGraphics Processing Units. We propose a technique to identify the code executed by theGPU in the Android operating system. Furthermore, we develop a cycle accurate GPUsimulator which models a mobile GPU similar to the NVIDIA Tegra 2 [4], the simulatorincludes performance and power statistics.

3. We evaluate the effectiveness of state of the art CPU and GPU prefetchers in reducing thememory latency of smartphones graphics hardware.

4. We propose our ultra-low power decoupled prefetcher, which outperforms previous proposalswhen running graphics workloads on a low-power mobile GPU. Furthermore, the ultra-lowpower decoupled prefetcher provides performance improvements while reducing energy con-sumption.

1.3 Organization

The remainder of this report is organized as follows. In chapter 2 we provide basic backgroundinformation on the rasterization process. Furthermore, we review the Android platform, some ofthe state of the art architectures for mobile devices and the most efficient prefetching techniques

13

Page 14: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 1. INTRODUCTION

for CPUs and GPUs. In chapter 3 we describe the problem to solve and in chapter 4 we explainour solution: the ultra-low power decoupled prefetcher. In chapter 5 we describe the evaluationmethodology, we present the GPU trace generation system and the cycle accurate GPU simulator.In chapter 6 we show the experimental results, this chapter includes a workload characterizationof several smartphone applications, a comparative of different state of the art CPU and GPUprefetchers and the performance and power results of the ultra-low power decoupled prefetcher.Finally, in chapter 7 we present the main conclusions of the report.

14

Page 15: High Performance, Ultra-Low Power Streaming Systems

2Related work

2.1 Rasterization

Rasterization is the process of taking an image described in a vector graphics format (polygons)and converting it into a raster image (pixels or fragments) for output on the screen [5]. Nowadaysrasterization is the most popular technique for producing real-time 3D computer graphics. Incomparison to other rendering techniques such as ray tracing [6], rasterization is exceptionally fast.Usually, computers include specialized graphics hardware to carry out the task of rasterizing 3Dmodels onto a 2D plane for display on the screen.

In its most basic form, the rasterization algorithm takes as input a set of 3D polygons andrenders it onto a 2D surface, usually a computer monitor. Polygons are described as a collectionof 3D triangles and these 3D triangles are represented by three vertices in 3D space. Basically,rasterizers take a stream of 3D vertices, transform them into corresponding 2-dimensional points onthe viewer’s screen and fill in the transformed 2-dimensional triangles as appropriate by processingthe corresponding pixels.

The rasterization algorithm consists on different stages, each one of these stages produces apartial result as shown in figure 2.1. The rendering process starts with a vectorial description of a3D scene (figure 2.1a). All the objects in the scene are described as a collection of triangles. At thesame time, triangles are defined by 3 vertices in 3D-space. Different attributes are specified for eachtriangle: position, normal (for lighting computations), color, one or several texture coordinates (fortexture mapping)... Therefore, all the 3D vertices with all the per-vertex information describe thescene and form the input for the first stage of the rasterization process.

The vertex stage is the first step in the rasterization algorithm. The input for this phaseis the set of 3D vertices with all the per-vertex information (figure 2.1a). Several operations

15

Page 16: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

(a) 3D triangles (b) 2D triangles

(c) Clipped 2D triangles (d) Pixels

(e) Visible pixels (f) Shaded and textured pixels

Figure 2.1: Initial scene, intermediate results produced by the different stages of the rasterizationprocess and final result.

are applied to each vertex in the vertex stage. In first place vertices are transformed, the maintransformations are translation, scaling and rotation. All the transformations are described by atransformation matrix, so transforming a vertex consists on multiplying its 3D coordinates by thistransformation matrix. In second place, vertices are lit according to the defined locations of lightsources, reflectance and other surface properties. Finally, vertices are projected from a 3D-space to

16

Page 17: High Performance, Ultra-Low Power Streaming Systems

2.2. ANDROID

a 2D plane, this projection is done by multiplying each transformed vertex by a projection matrix.The result of the vertex stage is a set of 2D triangles, as shown in figure 2.1b.

Once 3D vertices have been transformed to their corresponding 2D locations, some of theselocations may be outside the viewing window, or the area on the screen to which pixels will actuallybe written. For instance, in figure 2.1b the vertices V3 and V5 are outside the screen. So the nextstage of the rasterization process is clipping. Clipping is the task of truncating triangles to fitthem inside the viewing area. The most common technique is the Sutherland-Hodgeman clippingalgorithm [7]. After clipping, triangles are truncated so all the vertices fit in the screen (figure 2.1c).

The next step of the rasterization process is to fill the 2D triangles that are now in the imageplane, this stage is also known as raster conversion or scan conversion. So raster conversionconsists on converting the vectorial 2D clipped triangles (figure 2.1c) into pixels (figure 2.1d).There are a number of algorithms to fill pixels inside a triangle, the most popular of which is thescanline algorithm [8]. During raster conversion all the attributes of the 2D vertices (color, texturecoordinates...) are interpolated across the triangle.

After raster conversion, the rasterization algorithm must ensure that pixels close to the viewerare not overwritten by pixels far away, this issue is known as the visibility problem. A Z-buffer [9]is the most popular solution. The Z-buffer is a 2D array corresponding to the image plane whichstores a depth value for each pixel. Each time a pixel is drawn, the Z-buffer is updated with thepixel’s depth value. Any new pixel must check its depth value against the Z-buffer value before itis drawn. Closer pixels are drawn and further pixels are disregarded. This process of checking thedepth value of each pixel with the value stored in the Z-buffer is called depth test. Figure 2.1dshows an example of input to the depth test and the corresponding output is shown in figure 2.1e.

Finally, all the pixels that pass the depth test (visible pixels) are processed in the last stage ofthe rasterization algorithm: the pixel stage. To compute a pixel’s color, pixels are textured andshaded in the pixel stage. Let’s briefly review how textures are applied. A texture map is a bitmapthat is applied to a triangle to define its look. Each triangle vertex is associated with a texture anda texture coordinate (u, v) for normal 2D textures in addition to its position coordinate. Whenevera pixel on a triangle is rendered, the corresponding texel (or texture element) in the texture mustbe found. This is accomplished by interpolating among the triangle’s vertices’ associated texturecoordinates by the pixels on-screen distance from the vertices. Moreover, lighting computationsare also performed in the pixel stage. The result of this stage is the final image with textures andper-pixel lighting (figure 2.1f). A diagram of the whole rasterization process is shown in figure 2.2.

2.2 Android

Android [3] is a software stack for mobile devices such as mobile telephones and tablet comput-ers developed by Google and the Open Handset Alliance. Android consists of a mobile operatingsystem based on the Linux kernel, with middleware, libraries and APIs written in C and applicationsoftware running on an application framework which includes Java-compatible libraries. Androiduses the Dalvik virtual machine [10] with just-in-time compilation to run compiled Java code. An-droid has a large community of developers writing applications that extend the functionality of the

17

Page 18: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

Figure 2.2: Rasterization pipeline.

devices, developers write primarily in Java.

The architecture of Android is shown in figure 2.3. Regarding the applications, Androidprovides a set of core applications including an email client, SMS program, calendar, maps, browser,contacts and others. All applications are written using the Java programming language.

Regarding the application framework, by providing an open development platform Androidoffers developers the ability to build rich applications. Developers are free to take advantage of thedevice hardware, access location information, run background services, set alarms... Developershave full access to the same framework APIs used by the core applications. The application archi-tecture is designed to simplify the reuse of components, any application can publish its capabilitiesand any other application may then make use of those capabilities. This same mechanism allowscomponents to be replaced by the user.

On the other hand, Android includes a set of C/C++ libraries used by various componentsof the Android system. These capabilities are exposed to developers through the Android applica-tion framework. These libraries include an implementation of the standard C system library (libc),media libraries to support playback and recording of many popular audio and video formats, a rela-tional database engine (SQLite) and many other functionality. An implementation of the OpenGLES API [11] is also provided as one of these libraries. This 3D library uses either hardware 3Dacceleration (where available) or the included highly optimized 3D software rasterizer, as describedin section 2.2.1.

Regarding the Android Runtime, Android includes a set of core libraries that provides most

18

Page 19: High Performance, Ultra-Low Power Streaming Systems

2.2. ANDROID

Figure 2.3: Android architecture.

of the functionality available in the core libraries of the Java programming language. Every Androidapplication runs in its own process, with its own instance of the Dalvik virtual machine. Dalvikhas been written so that a device can run multiple VMs efficiently. The VM is register-based, andruns classes compiled by a Java language compiler. The Dalvik VM relies on the Linux kernel forunderlying functionality such as threading and low-level memory management.

Finally, Android relies on Linux version 2.6 for core system services such as security, memorymanagement, process management, network stack, and driver model. The kernel also acts as anabstraction layer between the hardware and the rest of the software stack.

2.2.1 Android Software Renderer

Android supports the rendering of 3D graphics by providing an implementation of the OpenGLES API [11]. The rasterization process, described in section 2.1, can be done in hardware by usinga specialized graphics accelerator or in software on the CPU. When Android runs on a mobiledevice provided with a Graphics Processing Unit (for instance, the NVIDIA Tegra 2 described insection 2.3.3) then the GPU driver is employed and the rasterization is hardware accelerated. Onthe contrary, when Android is executed on a device without specialized graphics hardware thenthe Android software renderer performs the rasterization on the CPU. Software rendering is alsoemployed when executing Android on top of an emulator such as QEMU, as we will see in the

19

Page 20: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

section describing our simulation infrastructure.

The Android software renderer is a library that provides support for 3D graphics, it is animplementation of the OpenGL ES 1.0 API. The software renderer is a piece of software of specialinterest for several reasons. In first place, it performs the rasterization when executing Android ontop of a simulator. In second place, the Android software renderer, unlike GPU drivers, is opensource. Since the source code of the software renderer is available we can review it and modifyit. This means that we can, for example, instrument the software renderer to collect interestinginformation about the rasterization process. For instance, we can count the number of verticesprocessed, the number of triangles, the number of pixels generated for each triangle or even thememory addresses of the pixels that are accessed in the color buffer or in the textures. Therefore,by instrumenting the Android software renderer we can generate traces of the OpenGL ES ren-dering commands and we can feed these traces to a cycle-accurate GPU simulator, as described insection 5.1.

In this section we briefly describe the structure of the Android software renderer and in sec-tion 5.1.1 we describe the instrumentation of this library. The Android software renderer consistson two static libraries:

• libagl.a: This is the Android OpenGL library. This library provides all the functions inthe OpenGL ES 1.0 API. It implements the vertex processing and clipping stages of therasterization pipeline (figure 2.2).

• libpixelflinger.a: This library implements the raster conversion, depth test and pixel pro-cessing stages of the rasterization pipeline (figure 2.2).

The libagl.a library source code is located in the directory /frameworks/base/opengl/libagl ofthe Android distribution. This library implements all the functions in the OpenGL ES 1.0 API,these functions are called from the applications. Regarding the rasterization process, this libraryimplements the vertex processing and clipping stages of the rasterization pipeline shown in fig-ure 2.2. The remainder of the stages are implemented in the libpixelflinger.a library. So the libagl.alibrary has classes to handle vertices, triangles, lights, transformation matrices and all the necessarystuff for the vertex processing.

In the OpenGL ES API we can identify two types of functions: functions to configure therendering pipeline (set number of lights, set transformation matrices...) and functions to renderpolygons. The rasterization process explained in section 2.1 is triggered when the application callsa function of the second type (rendering function). There are only a few rendering functions inthe OpenGL ES API. In first place, the glDrawArrays and glDrawElements are employed to render3D triangles. In second place, the glDrawTex function is used to render textured 2D rectangles,usually in 2D games. Section 5.1.1 describes how these rendering functions are instrumented togenerate GPU traces.

The libpixelflinger.a library source code is located in the directory /system/core/libpixelflingerof the Android distribution. This library implements the raster conversion, depth test and pixelprocessing stages of the rasterization pipeline shown in figure 2.2. The functions of this library arecalled from the libagl.a library to render 2D clipped triangles (figure 2.1). Although this library

20

Page 21: High Performance, Ultra-Low Power Streaming Systems

2.3. STATE OF THE ART ARCHITECTURES FOR MOBILE DEVICES

can also be used directly from the applications, usually developers employ the libagl.a libraryto do the rendering. The libpixelflinger.a library performs the pixel generation and processing:conversion from vectorial triangles to pixels, visibility determination by using a depth buffer [9],texture mapping, per-pixel lighting... This library employs the scanline algorithm [8] to rasterizethe 2D triangles.

2.3 State of the art architectures for mobile devices

In this section we review the most popular CPU/GPU architectures for smartphones andtablets. Usually, these mobiles devices are provided with a system on chip (SoC) [12] includingthe CPU, the GPU and other specialized hardware for functions such as audio and video encod-ing/decoding. We start the review with the Qualcomm Snapdragon family of chipsets that we canfind in many HTC and Samsung smartphones and in the Sony Xperia PLAY. Next we describethe PowerVR family of chipsets, some of the devices using this SoC are, for instance, the AppleiPhone 4, iPad and iPad 2. Finally, we review the NVIDIA Tegra 2 SoC which is included in severalsmartphones and in the Samsumg Galaxy Tab.

2.3.1 Qualcomm Snapdragon

Snapdragon is a family of mobile System on Chips by Qualcomm [13], it is a platform for usein smartphones, tablets, and smartbook devices. The CPU of the Snapdragon chipset, Scorpion,is Qualcomm’s own design. It is very similar to the ARM Cortex-A8 core and it is based on theARM v7 instruction set. However, it has much higher performance for multimedia related SIMDoperations due to its advanced media processing engine.

Figure 2.4: Qualcomm Snapdragon System on Chip.

21

Page 22: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

On the other hand, all Snapdragon processors contain the circuitry to encode and decode high-definition video. Regarding the GPU, all the Snapdragon chipsets include the Adreno GPU, thecompany’s propietary GPU technology. This low power GPU provides hardware acceleration fordifferent graphic APIs: OpenGL ES, Direct3D and OpenVG. Furthermore, the Adreno GPU is ableto accelerate 3D user interfaces for Android and other mobile operating systems and it provides fullsupport for websites based on Flash and WebGL frameworks. The Qualcomm Snapdragon chipsetsalso include circuitry for audio encoding/decoding, communications (3G modem) and GPS, asshown in figure 2.4.

Despite Adreno is a very powerful and interesting GPU, there is not any technical documentdescribing the specifications of this piece of hardware. Information such as the number of processors,the size of the caches or the instruction set is not available at all.

2.3.2 PowerVR chipsets

PowerVR is a division of Imagination Technologies that develops hardware for 2D and 3D ren-dering [14]. PowerVR accelerators are not manufactured by PowerVR, but instead their integratedcircuits and patents are licensed to other companies such as Texas Instruments, Samsung, Appleand many others. The PowerVR graphics accelerators are included in the System on Chips of manypopular devices: Apple iPhone 4 and iPad, Nokia N900 or Samsung Galaxy S.

The PowerVR chipset uses a method of 3D rendering known as tile-based deferred render-ing [15] (often abbreviated as TBDR). As the application feeds triangles to the PowerVR GPU,it stores them in memory in a triangle strip or an indexed format. Unlike other architectures,polygon rendering is not performed until all polygon information has been collated for the cur-rent frame. Furthermore, the expensive operations of texturing and shading pixels are delayed,whenever possible, until the visible surface at a pixel is determined.

In order to perform the rendering, the display is split into rectangular sections in a grid pattern,each section is known as a tile. Associated with each tile is a list of triangles that visibly overlapthat tile. Each tile is rendered in turn to produce the final image. Tiles are rendered using aprocess similar to ray-casting. Rays are cast onto the triangles associated with the tile and a pixelis rendered from the triangle closest to the camera.

The architecture implementing the tile-based rendering algorithm is shown in figure 2.5. Asthe application feeds triangles to the GPU the Tile Accelerator (TA) assigns these triangles to thecorresponding overlapping tiles. So the TA creates a list of visible triangles for each tile, the listincludes the triangles coordinates and all the necessary information: active textures, render states...

Once all the polygons in the scene have been dispatched to the GPU and have been classifiedinto the corresponding tiles the rendering process starts. The Image Synthesis Processor (ISP) hasthe responsibility of determining which pixels in a tile are visible. Hidden Surface Removal (HSR)is performed on a tile per tile basis, with each tile’s HSR results sent to the Texture and ShadingProcessor (TSP) for rasterization of visible pixels. The ISP processes all triangles affecting a tileone by one. Calculating the triangle equation and projecting a ray at each position in the trianglereturn accurate depth information for all pixels. This depth information is then compared with the

22

Page 23: High Performance, Ultra-Low Power Streaming Systems

2.3. STATE OF THE ART ARCHITECTURES FOR MOBILE DEVICES

values in the tile’s depth buffer to determine whether these pixels are visible or not. The Textureand Shading Processors (TSP) in the PowerVR pipeline behaves much like a traditional shadingand texturing engine.

Figure 2.5: PowerVR GPU architecture.

Tile based rendering architectures have several advantages over traditional rasterization archi-tectures. Since the scene is rasterized on a tile per tile basis and a tile is much smaller than thewhole display then all the necessary information to process a tile (color buffer and depth bufferinformation for instance) can be stored on-chip. Therefore, off-chip memory accesses to externalmemory are avoided to a large extent. Other advantages such as great cache efficiency and parallelprocessing of localized data are also important factors.

Regarding the drawbacks of tile based rendering, despite off-chip memory accesses can beavoided in many cases, the memory bandwidth increases in other places in the pipeline. For exam-ple, the triangle/tile sorting needs to be done, and creating the triangle lists increases bandwidthusage. So the memory requirements for this triangle/tile sorting are expensive since the GPU hasto capture the information of the whole 3D scene.

There is an ongoing debate on which architecture is the best suited for rasterization. Asexplained in [29], the performance of the different rendering architectures is clearly scene-dependent.This means that there will be three-dimensional scenes where a tiling architecture performs muchbetter than a standard architecture, but the opposite is also true. Unfortunately, there is noacademic study analyzing the advantages and disadvantages in terms of hardware implementationand memory bandwidth usage.

A more detailed review of the PowerVR GPU architecture is provided in [16].

23

Page 24: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

2.3.3 NVIDIA Tegra 2

The NVIDIA Tegra 2 mobile processor is a multi-core System on Chip for mobile devices suchas smartphones and tablets. The Tegra integrates two ARM Cortex-A9 processors, an ultra-lowpower GeForce GPU and specialized hardware for audio and video encoding/decoding (figure 2.6).

Figure 2.6: NVIDIA Tegra 2 architecture.

The NVIDIA’s ultra low power GeForce GPU in the Tegra processor is derived from thedesktop GeForce GPU architecture, but is specifically tailored to meet the growing demands ofmobile applications. The ultra low power GeForce GPU is highly customized and modified todeliver high-end graphics while consuming ultra low power. The GeForce architecture is a fixedfunction pipeline architecture that includes fully programmable pixel and vertex shaders, alongwith an advanced texture unit that supports high quality Anisotropic Filtering. Figure 2.7 showsthe graphics processing pipeline of the GeForce GPU in the Tegra mobile processor.

The GeForce GPU includes four programmable vertex processors and four programmable pixelprocessors for high speed vertex and pixel processing. Although the GeForce GPU architectureis a pipelined architecture similar to traditional desktop graphics architectures, it includes severalspecial features and customizations to significantly reduce power consumption and deliver increasedperformance and graphics quality.

One of this special features is the introduction of the Early-Z stage in the GPU pipelinethat is placed before the pixel shader stage. Modern GPUs use a Z-buffer (depth buffer) to trackwhich pixels in a scene are visible to the eye, and do not need to be displayed becaused theyare occluded by other pixels. The depth test for individual pixel data as defined in the OpenGLlogical pipeline happen after the pixels are processed by the pixel processor. The problem withevaluating individual pixels after the pixel shading process is that pixels must traverse nearly theentire pipeline to ultimately discover some are occluded and will be discarded. So processing these

24

Page 25: High Performance, Ultra-Low Power Streaming Systems

2.3. STATE OF THE ART ARCHITECTURES FOR MOBILE DEVICES

non-visible pixels involve significant amount of transactions between the GPU and shared memoryin the case of mobile devices, this consumes significant amounts of power. By performing the depthtest before the pixel processing stage, the GeForce architecture fetches depth, color and texturedata only for the visible pixels that pass the Z-test. Therefore, the main benefit of the Early-Zprocessing is that it reduces power consumption by reducing memory traffic between the GPU andoff-chip system memory.

Figure 2.7: Ultra-low power GeForce architecture.

Another feature is the use of pixel and texture caches to reduce memory transactions. Thetraditional OpenGL GPU pipeline specifies that pixel information such as texture, depth or coloris stored in system memory (or frame buffer memory). The pixel information is moved to and frommemory during the pixel processing stage. This requires a significant number of off-chip systemmemory transactions, and thus consumes large amounts of power. The GeForce architecture hasimplemented on-chip pixel and texture caches to reduce the system memory transactions. The pixelcache is used to store on-chip depth and color values of pixels and can be reused for all pixels thatare accessed repeatedly. The texture cache is employed to store on-chip texture elements (texels).

Finally, the GeForce GPU implements several advanced power management techniques to re-duce power consumption including, for instance, multiple levels of clock gating and dynamic voltage

25

Page 26: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

and frequency scaling.

A more detailed description of the NVIDIA Tegra 2 architecture is provided in [4].

2.4 Data Cache Prefetching

While trends in both underlying semiconductor technology and in microarchitecture have sig-nificantly increased processor clock rates, the major trend in main memory technology has beenin the direction of higher densities with memory access time decreasing much less than processorcycle times. These trends have increased main memory latencies when measured in processor clockcycles. To avoid performance losses due to disparity of speed between CPU and main memory,microprocessors rely on a hierarchy of cache memories. However, cache memories are not alwayseffective due to limited cache capacity and limited associativity. In order to try to overcome theselimitations of cache memories data can be prefetched into the cache.

In this section we review different prefechting schemes for CPUs. We start with the simplestprefetching scheme, the stride prefetcher, and next we review the Markov prefetcher and the distanceprefetcher. First we review the implementation of these prefetchers by using a table to record thenecessary information and later we show how these prefetching schemes can be implemented moreeffectively by using a Global History Buffer [37].

Several prefetching schemes have also been proposed for the GPU, this type of prefechersare aware of the special characteristics of the GPU architecture. In this section we review twoprefetching schemes targeting GPUs. First, we describe the many-thread aware prefetcher proposedin [36]. Next we review a prefetching scheme specifically designed for texture caches [33].

The “agressiveness” of a prefetcher can be characterized by the prefetch degree. The degreeof prefetching determines how many requests can be initiated by one prefetch trigger. Increasingthe degree can be beneficial, if the prefetched lines are used by the application, or harmful, if theprefetched lines are evicted before being accessed by the application.

2.4.1 CPU prefetchers

Stride prefetcher

Conventional Stride Prefetching [31] uses a table to store stride-related local history information(figure 2.8). The program counter (PC) of a load instruction is employed to index the table. Eachtable entry stores the load’s most recent stride (the difference between the two most recently pendingload addresses), last address (to allow computation of the next local stride), and state informationdescribing the stability of the load’s recent stride behavior. When a prefetch is triggered, addressesa + s, a + 2s,...,a + ds are prefetched (a is the load’s current target address, s is the detectedstride and d is the prefetch degree, an implementation dependent prefetch look-ahead distance).

When originally proposed, this method was applied to a single L1 cache, and all load PCs

26

Page 27: High Performance, Ultra-Low Power Streaming Systems

2.4. DATA CACHE PREFETCHING

Figure 2.8: Stride prefetching table.

were applied to the stride prefetching table. However, using all load PCs results in relatively highdemand on L1 and L2 cache ports. Later Nesbit et al. [37] proposed to implement the strideprefetcher by using only the PCs and addresses of the loads that miss in the cache.

Markov prefetcher

Markov Prefetching [34] is an example of a correlation prefetching method. Correlation prefetch-ing uses a history table to record consecutive address pairs. When a cache miss occurs, the missaddress indexes the correlation table (figure 2.9). Each entry in the Markov correlation table holdsa list of addresses that have immediately followed the current miss address in the past. When atable entry is accessed, the members of its address list are prefetched, with the most recent missaddress first. To update the table the previous miss address is used to index the table and the cur-rent miss address is inserted in the address list. To insert the address the current list of addressesis shifted to the right and the new address is inserted in the “most recent” position (the columnlabeled as “1st” in figure 2.9).

Markov prefetching models the miss address stream as a Markov graph, a probabilistic statemachine. Each node in the Markov graph is an address and the arcs between nodes are labeledwith the probability that the arc’s source node address will be immediately followed by the targetnode address. Each entry in the correlation table represents a node in an associated Markov graph,and its list of memory addresses represents arcs with the highest probabilities. Thus, the tablemaintains only a very raw approximation to the actual Markov probabilities.

Distance prefetcher

Distance prefetching [35] is a generalization of Markov Prefetching. Originally, Distanceprefetching was proposed for prefetching TLB entries, but the method is easily adapted to prefetch-

27

Page 28: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

Figure 2.9: Markov prefetching. The left side of the figure shows the state of the correlation tableafter processing the miss address stream shown at the top of the figure. The right side illustratesthe Markov transition graph that corresponds to the example miss address stream.

ing cache lines. This prefetching scheme uses the distance between two consecutive global missaddresses, an address delta, to index the correlation table. Each correlation table entry holds a listof deltas that have followed the entry’s delta in the past. Figure 2.10 shows an example of addressdelta stream and the state of the correlation table after processing the delta stream. When a cachemiss occurs, the new delta is computed by subtracting the previous miss address to the currentmiss address. This delta is employed to access the table and the list of deltas in the correspondingentry is used to generate the prefetch requests. The table is updated by using the same mechanismthat was explained in the Markov prefetcher.

Figure 2.10: Distance prefetching. The address delta stream corresponds to the sequence of ad-dresses used in the example of figure 2.9.

Distance prefetching is considered a generalization of Markov prefetching because one delta cor-relation can represent many miss address correlations. On the other hand, unlike Markov prefetch-ing, Distance prefetching’s state predictions are not prefetch addresses. To calculate prefetch ad-dresses the predicted deltas are added to the current miss address.

28

Page 29: High Performance, Ultra-Low Power Streaming Systems

2.4. DATA CACHE PREFETCHING

Global History Buffer

Prefecth tables store prefetch history inefficiently. In first place, table data can become stale,and consequently reduce prefetch accuracy. In second place, tables suffer from conflicts that occurwhen multiple access keys map to the same table entry. The main solution for reducing conflictsis to increase the number of table entries. However, this approach increases the table’s memoryrequirements. In third place, tables have a fixed amount of history per entry. Adding more prefetchhistory per entry creates new opportunities for effective prefetching, but the additional history alsoincreases the table’s memory requirements.

A new prefetching structure, the Global History Buffer, is proposed in [37]. This prefetchingstructure decouples table key matching from the storage of prefetch-related history information.The overall prefetching structure has two levels (figure 2.11):

• An Index Table (IT) that is accessed with a key as in conventional prefetch tables. The keymay be a load instruction’s PC, a cache miss address, or some combination. The entries inthe Index Table contain pointers into the Global History Buffer.

• The Global History Buffer (GHB) is an n-entry FIFO table (implemented as a circular buffer)that holds the n most recent miss addresses. Each GHB entry stores a global miss addressand a link pointer. Each pointer points to the previous miss address with the same IndexTable Key. The link pointers are used to chain the GHB entries into address lists. Hence,each address list is a time-ordered sequence of addresses that have the same Index Table key.

All the prefetchers reviewed in the previous section can be implemented by using a GHB insteadof a table. Depending on the key that is used for indexing the Index Table, the stride, Markov andDistance prefetchers can be implemented more effectively with a GHB. In this section we reviewthe implementation of the Distance prefetcher by using the GHB approach.

Figure 2.12 illustrates how the GHB can prefetch by using a distance prefetching scheme. The“Deltas” box shown in the figure does not exist in GHB hardware, but is extracted by findingthe difference between miss addresses in the GHB. As shown in the figure, prefetch addresses aregenerated by taking the miss address and accumulatively adding deltas, a valid prefetch address iscreated from each addition.

With the GHB approach, one can often get a better estimate of the actual Markov graphtransition probabilities than with conventional correlation methods. In fact, the GHB allows aweighting of transition probabilities based on how recently they have occurred.

2.4.2 GPU prefetchers

Many-Thread Aware Prefetching Mechanisms

All the previous prefetching schemes were designed targeting CPU architectures, so they arenot aware of the special characteristics of Graphics Processing Units. Lee et al. propose in [36]

29

Page 30: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

Figure 2.11: Global History Buffer.

Figure 2.12: Distance prefetcher implemented by using a Global History Buffer. The Head Pointerpoints to the last inserted address in the GHB.

a new prefeching scheme specifically designed for CUDA applications in a GPGPU environment.The baseline GPGPU architecture is shown in figure 2.13, the architecture follows the NVIDIA’sCUDA programming model [17].

In the CUDA model, each core is assigned a certain number of thread blocks, a group ofthreads that should be executed concurrently. Each thread block consists of several warps, which

30

Page 31: High Performance, Ultra-Low Power Streaming Systems

2.4. DATA CACHE PREFETCHING

Figure 2.13: An overview of the baseline GPGPU architecture.

are much smaller groups of threads. A warp is the smallest unit of hardware execution. A coreexecutes instructions from a warp in an SIMT (Single-Instruction Multiple-Thread) fashion. InSIMT execution, a single instruction is fetched for each warp, and all the threads in the warpexecute the same instruction in lockstep, except when there is control divergence. Threads andblocks are part of the CUDA programming model, but a warp is an aspect of the microarchitecturaldesign.

The GPGPU architecture illustrated in figure 2.13 is similar to the state-of-the-art architectureof current NVIDIA’s GPUs. The basic design consists of several cores and an off-chip DRAM withmemory controllers located inside the chip. Each core has SIMD execution units, a software-managed cache (shared memory), a memory request queue (MRQ) and other units. The processorhas an in-order scheduler, it executes instructions from one warp, switching to another warp ifsource operands are not ready. The MRQ is employed to store both demand requests (from theapplication) and prefetch requests (from the prefetching engine). Each new request is compared toexisting requests and in case of a match the requests are merged.

The prefetching scheme proposed in [36], the many-thread aware hardware prefetcher, has spe-cial features that make it more effective in a GPGPU environment. In first place, this prefetcherprovides improved scalability. Current GPGPU applications exhibit largely regular memory accesspatterns, so traditional CPU prefetchers should work well. However, because the number of threadsis often in the hundreds, traditional training mechanisms do not scale.

In the many-thread aware prefetcher the pattern detectors are trained on a per-warp basis,

31

Page 32: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

similar to those in simultaneous multithreading architectures. This aspect is critical since many re-quests from different warps can easily confuse pattern detectors. An example is shown in figure 2.14.In this example a strong stride behavior within each warp exists, but due to warp interleaving, ahardware prefetcher only sees a random pattern. In order to prevent this problem, in the many-thread aware prefetcher stride information trained per warp is stored in a per warp stride (PWS)table. So the many-thread aware prefetcher is based on the stride prefetcher described in figure 2.8,the PC of the miss address is employed to index the different tables employed by this prefetcherand each entry of these tables contains stride information.

PC Warp ID Addr Delta

0x08 1 0 -

0x08 1 100 100

0x08 1 200 100

0x08 2 10 -

0x08 2 110 100

0x08 2 210 100

0x08 3 20 -

0x08 3 120 100

0x08 3 220 100

(a) Accesses by warps

PC Warp ID Addr Delta

0x08 1 0 -

0x08 2 10 10

0x08 1 100 90

0x08 3 20 -80

0x08 2 110 90

0x08 3 120 10

0x08 3 220 100

0x08 1 200 -20

0x08 2 210 10

(b) Accesses seen by a hardware prefetcher

Figure 2.14: An example of memory address with/without warp interleaving.

On the other hand, the many-thread aware prefetcher employs stride promotion. Since memoryaccess patterns are fairly regular in GPGPU applications, when a few warps have the same accessstride for a given PC, all warps will often have the same stride for the PC. Based on this obvervation,when at least three PWS entries for the same PC have the same stride, the prefetcher promotesthe PC stride combination to the global stride (GS) table. By promoting strides, yet-to-be-trainedwarps can use the entry in the GS table to issue prefetch requests immediately without accessingthe PWS table.

Another feature of the many-thread aware prefetcher is the inter-thread prefetching (IP): eachthread can issue prefetch requests for threads in other warps, instead of prefetching for itself. Thekey idea behind IP is that when an application exhibits a strided memory access pattern acrossthreads at the same PC, one thread generates prefetch requests for another thread. This informationis stored in a separate table called an IP table. The IP table is trained until three accesses fromthe same PC and different warps have the same stride. Thereafter, the prefetcher issues prefetchrequests from the table entry.

Figure 2.15 shows the overall design of the many-thread aware prefetcher, which consists ofthe three tables discussed earlier: PWS, GS and IP tables. The IP and GS tables are indexedin parallel with a PC address. When there are hits in both tables, the prefetcher gives a higherpriority to the GS table because strides within a warp are much more common than strides acrosswarps. Furthermore, the GS table contains only promoted strides, which means an entry in the GStable has been trained for a longer period than strides in the IP table. If there are no hits in anytable, then the PWS table is indexed in the next cycle. However, if any of the tables have a hit,

32

Page 33: High Performance, Ultra-Low Power Streaming Systems

2.4. DATA CACHE PREFETCHING

Figure 2.15: Many-thread aware hardware prefetcher.

the prefetcher generates a request.

On the other hand, the many-thread aware prefetcher includes an adaptive prefetch throt-tling mechanism to control the agressiveness of prefetching (prefetch degree). Big prefetch degreescan reduce performance if the prefeched lines are useless (are evicted before being used). So theprefetcher should be able to eliminate the instances of prefetching that yield negative effects whileretaining the beneficial cases. Two metrics are employed to control the prefetch degree. The earlyeviction rate is the number of cache blocks evicted from the prefetch cache before their first usedivided by the number of useful prefetches:

Metric(EarlyEviction) =#EarlyEvictions

#UsefulPrefetches

The second metric is the merge ratio. Memory requests can be merged at various levels in thehardware. As shown in figure 2.13, each core maintains its own Memory Request Queue (MRQ).New requests that match with existing MRQ requests will be merged with the matching request.The merge ratio is the number of intra-core merges that occur divided by the total number ofrequests:

Metric(Merge) =#IntraCoreMerging

#TotalRequest

The adaptive throttling mechanism maintains the early eviction rate and merge ratio in eachone of the cores, periodically updating them and using them to adjust the degree of throttling.The throttling degree varies from 0 (0%: keep all prefetches) to 5 (100%: no prefetch). Theprefetcher adjusts this degree using the current values of the two metrics according to the heuristicsin figure 2.16. The early eviction rate is considered high if it is greater than 0.02, low if it is lessthan 0.01, and medium otherwise. The merge ratio is considered high if it is greater than 15% andlow otherwise.

33

Page 34: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

Early Eviction Rate Merge Action

High - No prefetch

Medium - Increase throttle (fewer prefetches)

Low High Decrease throttle (more prefetcher)

Low Low No prefetch

Figure 2.16: Throttling heuristics.

Prefetching Architecture for Texture Caches

The prefetchers described in previous sections were designed for general purpose computingand the performance increase is specially significant for applications with regular memory accesspatterns. Even the GPGPU prefetcher, the many-thread aware prefetcher, was specifically designedfor scientific applications in a CUDA like architecture.

Igehy et al. [33] proposed a prefetching architecture for texture caches. This prefetcher wasdesigned targeting a traditional GPU architecture, similar to the GeForce architecture describedin section 2.3.3, and graphics workloads. The main objective of this prefetcher is to accelerate theprocess of applying textures on triangles (texture mapping).

Texture mapping has become ubiquitous in real-time graphics hardware. In its most basicform, texture mapping is a process by which a 2D image is mapped onto a projected screen-spacetriangle under perspective. This operation amounts to a linear transformation in 2D homogeneouscoordinates. The transformation is typically done as a backward mapping: for each pixel on thescreen, the corresponding coordinate in the texture map is calculated. The backward mappedcoordinate typically does not fall exactly onto a sample in the texture map, and the texture maybe minified or magnified on the screen. Filtering is applied to minimize the effects of aliasing, andideally, the filtering should be efficient and amenable to hardware acceleration. Mip mapping [40]is the filtering technique most commonly implemented in graphics hardware.

Figure 2.17 shows the part of the graphics pipeline where texture mapping is performed. Therasterizer circuitry converts 2D triangles to pixels on the screen, each one of these pixels is processedin a fragment processor. In order to apply texture mapping, the fragment processor has to fetch thecorresponding texels (texture elements) from texture memory. Since textures are located in off-chipmain memory, the fragment processor is provided with a texture cache to reduce the latency ofmemory accesses and the number of off-chip system memory transactions.

The prefetching architecture for texture caches proposed in [33] is illustrated in figure 2.18.The architecture processes fragments as follows. As each fragment is generated, each of its texeladdresses is looked up in the cache tags. If a tag check reveals a miss, the cache tags are updatedwith the fragment’s texel address immediately and the address is forwarded to the memory requestFIFO. The cache addresses associated with the fragment are forwarded to the fragment FIFO andare stored along with all the other data needed to process the fragment: color, depth, filteringinformation... As the request FIFO sends requests for missing cache blocks to the texture memorysystem, space is reserved in the reorder buffer to hold the returning memory blocks. This guaranteeof space makes the architecture robust and deadlock-free in the presence of an out-of-order system.

34

Page 35: High Performance, Ultra-Low Power Streaming Systems

2.4. DATA CACHE PREFETCHING

Figure 2.17: Baseline architecture for texture mapping.

When a fragment reaches the head of the fragment FIFO, it can proceed only if all of itstexels are present in the cache. Fragments that generated no misses can proceed immediately, butfragments that generated one or more misses must first wait for their corresponding cache blocksto return from memory into the reorder buffer. In order to guarantee that new cache blocks do notprematurely overwrite older cache blocks, new cache blocks are committed to the cache only whentheir corresponding fragment reaches the head of the fragment FIFO. Fragments that are removedfrom the head of the FIFO have their corresponding texels read from the cache and proceed onwardto the rest of the texture pipeline.

Figure 2.18: Texture cache prefetcher architecture.

35

Page 36: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 2. RELATED WORK

One of the key parameters of this prefetching architecture is the size of the fragment FIFO.This FIFO primarly masks the latency of the memory system. If the system is not to stall ona cache miss, it must be able to continually service new fragments while previous fragments arewaiting for texture cache misses to be filled. Thus, the fragment FIFO depth should at least matchthe latency of the memory system.

36

Page 37: High Performance, Ultra-Low Power Streaming Systems

3Problem statement: Memory Wall for Low Power GPUs

3.1 Hiding memory latency on a modern low power mobile GPU

The clock rate of mobile CPUs and GPUs has experimented a big growth in the last years.Nowadays, it is usual to find smartphones and tablets with a CPU clock rate of 1 GHz, and it seemsthis trend is going to continue in the next years. For example, the new Qualcomm Snapdragon S3has a clock rate of 1.5 GHz [18] and the Qualcomm roadmap includes a mobile chipset with a clockrate between 2.0 GHz and 2.5 GHz by 2012 [13]. Hence, these mobile devices are going to hit thememory wall. Due to the disparity of speed between the CPU/GPU and memory, the performanceof these mobile devices will be significantly affected by the latency to access main memory. Thus,the use of techniques to hide this latency is going to be necessary. Furthermore, these techniquesmust improve the behavior of the memory system without breaking the limited power budget ofsmartphones.

Basically, the three main techniques for hiding memory latency are caches, multithreadingand prefetching. Caches are a very effective technique for CPUs, however, in GPUs cachesfocus on conserving bandwidth rather than reducing latency [32]. Graphics workloads usuallyexhibit irregular memory access patterns. Therefore, the typical hit rates of the caches in theGPU are not so big like in a CPU. Although the hit rates are far to be perfect, caches can filtera significant percentage of the accesses to system memory, so they are a good mechanism to savememory bandwidth in a GPU and mobile GPUs include different types of caches (section 2.3.3).However, caches are not the ideal solution for hiding memory latency on GPUs due to the specialcharacteristics of graphics workloads.

On the other hand, multithreading is a very effective technique to keep all the GPU processorsutilized and state of the art desktop GPUs support thousand of simultaneous threads [19]. Fig-

37

Page 38: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 3. PROBLEM STATEMENT: MEMORY WALL FOR LOW POWER GPUS

ure 3.1 shows the effectiveness of multithreading for hiding the memory latency in different Androidgames. As we increase the number of threads in each processor we obtain better performance. With16 warps per processor the performance is very close to the performance of a system with perfectcaches, so multithreading is able to hide all the memory latency if the number of threads availableis big enough (which is the case of graphics workloads).

Figure 3.1: Effectiveness of multithreading for hiding memory latency. As we increase the numberof warps on each processor we obtain better performance.

Figure 3.2: Power consumed by the GPU main register file for different configurations.

Although multithreading is very effective for hiding the memory latency, it is also a powerhungry technique. Due to the need for fast context switching, the GPU has to keep the architec-tural state of all the threads on execution in the register file. Since the number of threads is big(thousands) the size of the main register file becomes huge when applying agressive multithreading.As we can see in figure 3.2, the power consumed by the main register file is significantly increased aswe increase the number of simultaneous threads. For 32 warps the power is close to 250 mW, so theGPU runs out of the power budget (the power budget of a mobile System on Chip is between 1 and

38

Page 39: High Performance, Ultra-Low Power Streaming Systems

3.1. HIDING MEMORY LATENCY ON A MODERN LOW POWER MOBILE GPU

2 Watts, including CPU, GPU, specialized circuitry for video encoding/decoding...). Therefore,although multithreading is an effective techinque for hiding the latency of the main memory, it isnot well suited for a low power environment.

The last technique is prefetching. Prefetching has been studied in depth and several prefetch-ing schemes have been proposed for both CPUs and GPUs, as we have seen in section 2.4. However,there are several issues with the previous proposals. In first place, the CPU prefetchers are effectivefor applications with regular memory access patterns and none of them directly apply to GPUs [36].In second place, the GPU prefetchers described in section 2.4.2 are effective for scientific applica-tions written in CUDA. These prefetchers are effective in heavily multithreaded systems but theyalso require applications with regular memory access patterns, so they are not well suited for graph-ics workloads. The GPU prefetcher for texture caches described in section 2.4.2 is very effectivefor graphic applications. However, it was designed for a GPU with just one pixel processor and itcannot be directly applied to a multicore GPU. Applying the GPU prefetcher for texture caches toa multicore GPU introduces several challenges as we will describe in chapter 4.

In conclusion, we have observed the lack of a mechanism for hiding the main memory latencyin low power systems when executing graphics workloads.

39

Page 40: High Performance, Ultra-Low Power Streaming Systems
Page 41: High Performance, Ultra-Low Power Streaming Systems

4Proposal: Decoupled Access Execute Prefetching

4.1 Ultra-low power decoupled prefetcher

In this chapter we present our ultra-low power decoupled prefetcher for Graphics ProcessingUnits, this prefetching scheme has been designed for graphics workloads in low power environments.This section is organized as follows. First, we describe the baseline GPU architecture which issimilar to the ultra low-power GeForce GPU in the NVIDIA Tegra 2 chipset (section 2.3.3). Nextwe present the first version of the decoupled prefetcher that is based in the prefetching architecturepresented in [33]. Finally, we present additional optimizations to improve performance and reducepower consumption.

4.1.1 Baseline GPU Architecture

The baseline GPU architecture is illustrated in figure 4.1, it is based on the ultra-low powerGeForce in the NVIDIA Tegra 2 chipset. In this architecture pixels are generated and processedas follows. In first place, the rasterizer circuitry performs the scan conversion or raster conversiondescribed in section 2.1. The rasterizer takes 2D triangles as input and generates the pixels to fillthese triangles (figure 2.1d shows an example of raster conversion). All the generated pixels, orfragments in OpenGL terminology, are inserted into the fragment queue.

After raster conversion, non-visible pixels are discarded by using the Z-buffer algorithm [9].The hardware in the Early Depth Test stage performs the visibility determination. The depthvalue of each fragment read from the fragment queue is compared with the actual value in theZ-buffer. If the fragment’s depth value is smaller than the current value, the Z-buffer is updatedand the fragment proceeds through the pipeline. Otherwise, the fragment is discarded. Hence, in

41

Page 42: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 4. PROPOSAL: DECOUPLED ACCESS EXECUTE PREFETCHING

Figure 4.1: Baseline GPU architecture (based on the ultra-low power GeForce GPU in the NVIDIATegra 2 chipset).

order to perform the visibility determination the hardware in the Depth Test stage has to access tomemory two times at most for each fragment: one time to read the actual depth value and, if thefragment passes the depth test, a second time to write the new depth value. The Depth Test stageemploys a cache to optimize this process, so part of the Z-buffer is stored within this pixel cache.

After the Depth Test stage the visible fragments are packed in groups of n fragments or tiles,we have chosen 4 as the number of fragments in each tile. These tiles are inserted in the tile queue

42

Page 43: High Performance, Ultra-Low Power Streaming Systems

4.1. ULTRA-LOW POWER DECOUPLED PREFETCHER

to be processed by the fragment processors. The four fragments within a tile will be processed inthe same streaming processor. The scheduler in the Fragment stage reads tiles from the queueand decides in which processor will be processed each one of the tiles. There are 4 streamingprocessors in the Fragment stage, the scheduler employs a round-robin policy to dispatch tiles tothe processors.

The streaming processors perform several operations to each fragment like texture mapping,blending, per-pixel lighting... These processors are programmable and the user can specify thesequence of instructions to be applied to each fragment. The streaming processors are in-orderprocessors and multithreading is employed to try to hide the memory latency. Each processor has8 thread hardware contexts grouped in 2 warps. A warp is a group of threads that are scheduledtogether and executed in lockstep mode. Each tile is processed by one warp, and each one of the4 threads in a warp processes one of the 4 fragments in the tile. There are also 4 SIMD executionunits or vectorial units in each streaming processor, so at a given time just one of the warps is onexecution. A streaming processor fetches and executes instructions from one warp until a cachemiss is encountered, then the processor fetches instructions from the other warp to try to hide thelatency of the memory access.

Each streaming processor is provided with a pixel cache and a texture cache. The pixelcache is employed to store color values of pixels (cache lines from the color buffer) and the texturecache is used to store texture elements (cache lines from texture memory). Hence, in the wholearchitecture there are 10 caches: the L2 cache, the pixel cache employed in the Depth Test stageand one pixel cache and one texture cache in each one of the 4 streaming processors. Prefetchingcan be applied to each one of these caches in order to improve performance.

4.1.2 Decoupled prefetcher

The traditional prefetchers are triggered on cache misses. Whenever a cache miss is produced,the prefetching engine triggers one or more (depending on the degree of prefetching) cache linerequests to the next level of the memory hierarchy following a prediction scheme based on historyinformation. However, in the GPU architecture previously described a more efficient approach canbe employed. The information stored in the fragment queue allows us to compute which cache linesfrom the Z-buffer will be accessed in the Depth Test stage. In the same manner, the informationstored in the tile queue allows us to compute which cache lines from the color buffer and thetexture memory will be accessed in the fragment processing stage. Therefore, this information canbe employed to preemptively prefetch the cache lines that will be accessed during the processingof each one of the fragments.

In the decoupled prefetching scheme, a prefetch request is sent to the corresponding Tex-ture/Pixel cache for each cache line that will be requested in the future during the processing ofthe fragments. The cache controller handles prefetch requests as follows. In first place, the tagsare checked to see if the target line of the prefetch request is already in cache. In case of a hit,the prefetch request is disregarded. In case of a miss, the prefetch request is redirected to the nextlevel of the memory hierarchy. When the data is served by the next level the cache is updated.

The architecture of our proposed prefetching scheme is illustrated in figure 4.2. As we can

43

Page 44: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 4. PROPOSAL: DECOUPLED ACCESS EXECUTE PREFETCHING

observe, the prefetching engine is decoupled from the caches and the streaming processors. Thedecoupled prefetcher works as follows. For each new fragment generated in the rasterizer thecorresponding address in the Z-buffer is computed, so the fragment is inserted in the fragmentqueue and the memory address is inserted in the prefetch queue. While fragments are waitingin the fragment queue to be processed in the Depth Test stage, the prefetch requests for thecorresponding cache lines are sent to the Pixel cache. The prefetch queue is traversed each cycle totry to send pending prefetch requests. By preemptively prefetching cache lines we expect that allthe necessary depth values will be available in the pixel cache when the fragments are read fromthe fragment queue and processed in the Depth Test stage.

We can apply the same decoupled prefetching scheme to prefetch for the pixel and texturecaches in the streaming processors. Once the visible fragments are packed in tiles, we can computewhich cache lines from the color buffer and from texture memory will be accessed during theprocessing of each tile. So we can prefetch all the necessary cache lines while the tiles are waitingin the tile queue. However, in this case the prefetching is more challenging because there are 4streaming processors and 8 caches, so we have to decide in which cache we are going to prefetch thelines. Furthermore, we have to guarantee that each tile is processed in the streaming processor inwhich we have prefetched its data. To solve this issue we move the scheduling from the entry of theFragment stage to the output of the Depth Test stage. When a new tile is created it is scheduledto a streaming processor by using a round-robin policy, the ID of the processor is stored in the tilequeue together with the rest of the tile information. All the necessary cache lines for the tile willbe prefetched in the pixel and texture caches of the corresponding processor, the prefetch queueincludes an additional field to identify the target cache of the prefetching request. When the tilesare read from the tile queue they are dispatched to the corresponding streaming processor.

Merging is employed to reduce the number of prefetch requests. For example, let’s assume acache line size of 64 bytes. If the four fragments within a tile will access to the memory addresses4, 8, 12 and 16 respectively, then just one prefetch request to cache line 0 is issued to the prefetchqueue. The prefetch queue is clocked each cycle to try to send pending prefetch requests. Thisqueue has two fields for each entry: the tag of the cache line to be prefetched and the ID of thetarget cache (the cache where the data will be prefetched).

Our decoupled prefetcher is based on the prefetcher described in section 2.4.2. However, ourwork is different in several ways. In first place, since we have moved the Depth Test stage beforethe Fragment processing stage, color values and texture elements are prefetched just for the visiblepixels. So we significantly reduce the number of prefetch requests. In second place, our prefetcherworks in a multiprocessor environment with multiple caches. On the contrary, the texture cacheprefetcher described in section 2.4.2 assumes just one cache and one streaming processor.

The size of the queues (two prefetch queues, fragment queue and tile queue) is a key parameterin this prefeching scheme. If the queues are small then the prefetcher cannot prefetch early enough,so maybe when the fragments are read from the queue the prefetch requests are still in flight andthe data is not in cache. Thus, as we reduce the size of these queues we increase the number ofcompulsory misses. On the other hand, if the queues are too big we increase the number of conflictsdue to the limited associativity. For example, assuming 2-way associative caches, if three differentcache lines from three different tiles are prefetched to the same cache and they are mapped to thesame set a conflict miss is produced. In this case, a cache line that will be accessed by a tile is

44

Page 45: High Performance, Ultra-Low Power Streaming Systems

4.1. ULTRA-LOW POWER DECOUPLED PREFETCHER

Figure 4.2: Decoupled prefetcher architecture.

evicted because of a younger prefetch request is mapped to the same set.

4.1.3 Decoupled prefetcher improvements

We can further improve the decoupled prefetcher by better utilizing the bandwidth to theL2 cache. When we implemented the decoupled prefetcher in our simulation infrastructure (sec-tion 5.1.2), we realized that we were often prefetching the same cache line to different caches. An

45

Page 46: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 4. PROPOSAL: DECOUPLED ACCESS EXECUTE PREFETCHING

example of this case is illustrated in figure 4.3. In the example there is a prefetch request to cacheline A targeting the texture cache of processor 2 and another prefetch request to the same cacheline but targeting the texture cache of processor 3. So in this case a prefetch request will be sentto the texture cache of processor 2 for line A, if the line is not in the cache the request will beredirected to the L2 cache and the line A will be read from L2 cache and stored in texture cache2. Furthermore, another prefetch request for the same line will be sent to texture cache 3 and, incase of a miss, the prefetch will also be resent to the L2 cache. Therefore, 2 prefetch requests tothe L2 cache for the same line are generated and the line is read from L2 cache twice.

Figure 4.3: Improved decoupled prefetcher.

46

Page 47: High Performance, Ultra-Low Power Streaming Systems

4.1. ULTRA-LOW POWER DECOUPLED PREFETCHER

We can employ a more efficient approach to handle the case described in the previous example.Since the cache line A is prefetched in the texture cache 2, the texture cache 3 can obtain this linefrom texture cache 2 instead of from the L2 cache. In this case, we save bandwidth to L2 cacheand we reduce power since texture and pixel caches are much smaller than the L2 cache.

The improvement proposed in this section is implemented as follows. First, the prefetch queueincludes an additional field, Source, with the ID of the cache where the prefetch request will beredirected in case of a miss in the target cache. When a prefetch request is inserted in the prefetchqueue, the tag of this prefetch request is compared with the current tags in the prefetch queue.In case of a match, the Source field of the prefetch request will be set to the Cache ID tag of thematching request. If there is no match, the Source field is set to the ID of the L2 cache. In caseof multiple matches, we select the youngest matching request. When the prefetch request is issuedto the corresponding cache, the information in the Source field is packed within the request. Thisinformation is employed by the cache controller in case of a cache miss to redirect the prefetchrequest to the corresponding Texture/Pixel cache instead of to the L2 cache.

By introducing this improvement we try to save bandwidth to the L2 cache, since a significantpercentage of the prefetch requests that in the previous scheme were served by the L2 cache nowwill be served by Pixel/Texture caches. Furthermore, accessing a Pixel/Texture cache requires lessenergy because these caches are much smaller than the L2 cache, so we also expect to save power.The experimental results presented in section 6.3 prove these claims.

Regarding the management of the requests in the cache controller, there is no prioritizationof demand requests (requests from the application) over prefetch requests from the prefetch queueor remote prefetch requests from other caches, all the requests have the same priority. Maybe itwould be beneficial to serve the demand requests first, however, since we have obtained importantspeedups without considering priorities we have not explored this option.

The architecture of the improved decoupled prefetcher is shown in figure 4.3. The connectionbetween texture caches 2 and 3 is highlighted to illustrate that the cache line A is obtained fromtexture cache 2, not from the L2 cache.

47

Page 48: High Performance, Ultra-Low Power Streaming Systems
Page 49: High Performance, Ultra-Low Power Streaming Systems

5Evaluation methodology

5.1 Simulation infrastructure

We have developed a simulation infrastructure in order to evaluate the performance of severalprefetching techniques on a mobile GPU. Our infrastructure is divided in two main components:the trace generation system and the cycle accurate GPU simulator. The trace generation systemis able to intercept all the rendering commands (OpenGL ES commands) in Android and save allthe necessary information for each command: number of vertices processed, number of triangles,number of fragments generated for each triangle... This information is stored in a GPU trace whichis the input to the cycle accurate GPU simulator. The simulator computes different GPU statisticssuch as number of cycles, IPC, miss rates for the different caches...

We have employed several existing tools to develop our infrastructure. For example, we haveused Android and QEMU for the trace generation tool. Furthermore, our GPU simulator is basedon a previous GPU simulator, Qsilver [39].

5.1.1 GPU trace generation

The GPU trace generation system is illustrated in figure 5.1. We employ QEMU [20] to boot andrun the Android [21] operating system. On top of Android we run some smartphone applicationslike the web browser, the audio player or games.

When Android is executed on top of an emulator, such as QEMU, the OpenGL ES commandsare processed by the Android Software Renderer, as described in section 2.2.1. We have instru-mented this library to collect all the necessary information for the cycle accurate GPU simulator.

49

Page 50: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 5. EVALUATION METHODOLOGY

The instrumentation code has been inserted in the three rendering functions of the OpenGL ESAPI: glDrawArrays, glDrawElements and glDrawTex.

Figure 5.1: GPU trace generation system.

Whenever an application calls a rendering function from the OpenGL ES API, our instru-mentation code starts to collect information about the rendering process. At the beginning of therendering function some state information is collected:

• Lighting information: lighting enabled/disabled, number of lights...

• Texturing information: texturing enabled/disabled, number of active texture units...

• Array information: for each one of the OpenGL client arrays (vertex, color, normal andtexture coordinates array) the following information is saved:

– Enabled/Disabled.

– Base address of the array.

– Size of each element of the array in bytes.

– Stride between elements.

• Rendering mode: points, lines, triangles, triangle strip, triangle fan...

On the other hand, as the rendering command is processed we save several information for eachtriangle and for each pixel:

• Triangle information: visibility (is this triangle discarded in the clipping stage?) and listof all the pixels generated to fill the triangle.

50

Page 51: High Performance, Ultra-Low Power Streaming Systems

5.1. SIMULATION INFRASTRUCTURE

• Pixel information:

– Visibility (Is this pixel discarded in the Depth Test?).

– Address of the pixel in the Z-buffer.

– Address of the pixel in the color buffer.

– Addresses of all the texture elements accessed to process this pixel.

At the end of the rendering function, all the collected information about the rendering commandis ready to be dumped to the trace file. All the instrumentation code is executed inside the guestoperating system (Android), so if we try to open a file and save the information about the renderingcommand the file will be created in a virtual file system since Android is executed inside a virtualmachine. So we would have to transfer the file from the virtual file system to the file system of thehost operating system in order to feed this trace file to the GPU simulator. The host system is thesystem in which QEMU is executed, in our case Linux.

On the other hand, we can create the trace file directly in the host system by using a differentapproach. We can signal the end of the rendering command in the Android Software Renderer insome manner. Then, we can detect this signal in QEMU and save all the collected informationto the trace file in the host file system. To signal the end of a rendering command we employ aninterruption with a special code, 0x99. So we have modified the code translation in QEMU. Whenan interrupt instruction with code 0x99 is found, this instruction is replaced by a call to a functionthat saves all the collected information to the trace file.

5.1.2 Cycle accurate GPU simulator

The cycle accurate GPU simulator is able to read the information in the trace files, created bythe GPU trace generation system previously described, and simulate the execution of the renderingcommands in a state-of-the-art mobile GPU similar to the one inside the NVIDIA Tegra 2 chipset.

The architecture modelled by the GPU simulator is illustrated in figure 5.2. In this section webriefly describe each one of the components in the architecture modelled by the GPU simulator.Futhermore, we describe the power model employed to obtain the energy required to process therendering commands.

The first stage in the graphics pipeline is the Primitive processing. This stage fetchesall the necessary information for each vertex: position, color, normal, texture coordinates... Theinformation stored in the GPU trace about the OpenGL client arrays (see section 5.1.1) is employedto issue the corresponding memory requests to the VBO (Vertex Buffer Object) Cache. Asthe vertex data is fetched from memory the vertices are inserted in the first Vertex queue.

Vertices are transformed and shaded in the Vertex processing stage. This stage containsseveral Streaming Processors, each of these processors is able to process one vertex. A sequence ofinstructions, or vertex shader, is applied to each one of the vertices read from the first Vertex queue.The vertex shader is obtained from the GPU trace. The instruction set employed is the OpenGLArchitecture Review Board ISA for vertex programs [22]. The number of Streaming Processors is

51

Page 52: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 5. EVALUATION METHODOLOGY

Figure 5.2: GPU architecture modelled by the cycle accurate simulator.

a parameter of the simulator and it can be modified in the configuration file, the default value is4. Once a vertex is processed in a Streaming Processor it is inserted in the second Vertex queue.

The next stage in the graphics pipeline is the Primitive Assembly. Vertices are read fromthe second Vertex queue and grouped into the corresponding triangles. Once the 3 vertices of a 2Dtriangle have been found, the triangle is clipped against the screen (as described in section 2.1).Finally, the clipped 2D triangles are inserted into the Triangle queue.

Raster conversion is the next step in the rasterization process. The Rasterizer takes 2D

52

Page 53: High Performance, Ultra-Low Power Streaming Systems

5.1. SIMULATION INFRASTRUCTURE

triangles from the Triangle queue and generates the pixels, or fragments in OpenGL terminology,to fill the triangles. The Rasterizer module in the simulator does not perform this raster conversion,but it employs the information stored in the GPU trace. As we have described in section 5.1.1, theGPU trace stores the list of pixels generated by the Android Software Renderer’s rasterizer for eachtriangle. So the simulator does not perform the raster conversion because this task was performedin the Android OpenGL driver and the output was saved into the GPU trace. The generated pixelsare inserted into the Fragment queue.

The Early Depth Test stage performs the visibility determination by applying the Z-bufferalgorithm [9]. For each fragment read from the Fragment queue, a memory request is issued to thePixel cache in order to obtain the depth value in the corresponding position of the Z-buffer. If thefragment is visible, another memory request is issued to update the depth value in the Z-buffer.Finally, visible fragments are packed in groups of 4 fragments or tiles and they are inserted in theTile queue. The information stored in the GPU trace for each fragment is also employed in thisstage. This information includes the address in the depth buffer, this address is necessary to issuethe memory requests to the Pixel cache. It also includes the fragment visibility: true if the pixel isvisible or false if it must be discarded.

Finally, the tiles are processed in the Fragment processing stage. In this stage, fragmentswithin tiles are textured and shaded. There are several Streaming Processors in the Fragment stage,each one of these processors is able to process multiple tiles. The Streaming Processors apply asequence of instructions, or fragment shader, to each one of the fragments. The fragment shaderis obtained from the GPU trace, as well as the memory addresses that have to be requested toprocess each fragment. The instruction set employed is the OpenGL Architecture Review BoardISA for fragment programs [23]. A more detailed description of these Streaming Multiprocessors isprovided in section 4.1.1.

The GPU simulator computes several statitics. For example, it computes the total number ofcycles to process all the rendering commands in the trace file, the number of instructions, IPC orthe miss rates of each one of the caches.

In order to compute the cycles, the GPU simulator models a streaming processor as a very sim-ple in-order processor. The pipeline has 5 stages: Instruction Fetch, Instruction Decode, OperandsFetch, Execution and Writeback. Only one instruction can be fetched, decoded and issued per cycle.However, the same instruction is issued to all the SIMD execution units, so the same instructionis executed in parallel n times (where n is the number of execution units) but with different data.There is no forwarding mechanism, each instruction waits until its source operands are availableand in case of a data dependency the pipeline is stalled. Regarding the latencies, each instructionspends a different number of cycles in the execution stage. We have obtained the latencies of eachone of the instructions in the ISA from the Qsilver GPU simulator [39].

Regarding the power model, we employ CACTI [24] to compute the energy consumed by thecaches, the queues between stages or the register files. Furthermore, we employ the power modelof Qsilver [39] to obtain the energy consumed by the ALUs in the Streaming Processors. So thedynamic energy consumed by the GPU is the addition of the dynamic energy consumed by thefollowing components:

53

Page 54: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 5. EVALUATION METHODOLOGY

• The caches: the simulator provides the number of accesses to each cache. On the otherhand, by using CACTI we compute the dynamic energy required to access a cache. So we canobtain the total energy consumed in each one of the caches by multiplying the total numberof accesses by the energy per access.

• The queues: as in the previous case, we multiply the total number of accesses to the queue(provided by the simulator) by the energy per access (computed with CACTI).

• The Streaming Processors: we account for the energy consumed by the Main RegisterFile and the SIMD execution units:

– Main Register File: we obtain the energy per access by using CACTI and we multiplythis value by the total number of accesses to the main register file (obtained from thesimulator statistics).

– ALUs: We employ the power model from Qsilver [39], it is a simple power model inwhich each one of the instructions in the ISA has a fixed amount of energy assigned, thisis the energy required to execute the instruction. So if there are N different instructionsin the ISA, the energy consumed by the ALUs is defined by the following equation:

EALUs =N∑i=1

Num instructionsi × Ei

The number of executed instructions of each type is computed by the GPU simulator,the energy required to execute each instruction is obtained from Qsilver.

• The prefetchers: All the prefetching schemes employ one or several structures like tablesor queues (see section 2.4) that are accessed multiple times. As in the previous cases, thenumber of accesses is computed by using the GPU simulator and the energy per access isprovided by CACTI.

Regarding the static energy, the simulator is able to compute the number of idle cycle foreach one of the hardware structures (caches, register files, queues...). By combining this informationwith the energy estimations provided by CACTI we can compute the total leakage power.

When we present power numbers in chapter 6, these numbers include the power consumed by allthe caches, all the queues and all the Streaming Processors. Furthermore, if some prefetching schemeis employed, the power results also include the power consumed by all the hardware structures usedby the prefetcher.

54

Page 55: High Performance, Ultra-Low Power Streaming Systems

6Experimental results

In this section we present the experimental results obtained with the simulation infrastructuredescribed in section 5.1. In first place, we analyze several Android applications from the AndroidStore and we establish that the most demanding applications are, as expected, games. Furthermore,we show the potential benefits of improving the memory system by analyzing the behavior of thetexture and pixel caches. In second place, we present the performance and power results for thedifferent state-of-the-art prefetchers. Finally, we compare these results with the performance andpower consumption of our ultra-low power decoupled prefetcher.

6.1 Workload characterization

We have analyzed several Android applications from the Android Store in order to evaluatethe behavior of the CPU and the GPU. We have included several common applications, such as theweb browser and the audio player, and several 2D and 3D games. To obtain statistics about theCPU we have employed a full-system simulator, MARSSx86 [25]. MARSSx86 consists of QEMU,an emulator which is able to boot and run an OS, and PTLSim [26], a cycle accurate simulator forthe x86 instruction set. We have introduced several modifications to MARSSx86. In first place,we have modified PTLSim to compute CPI stacks [30]. In second place, we have integrated ourGPU trace generator and cycle accurate simulator (section 5.1) in MARSSx86, so we can obtaininformation from the CPU and the GPU. Since PTLSim, the cycle accurate CPU simulator, onlysupports the x86 ISA we have employed the x86 version of Android [27].

The CPU configuration employed for the experiments is described in figure 6.1. We have con-figured the simulator to model a very simple out-of-order processor with small caches in order tokeep power consumption inside the small power budget of the smartphones. The results of the

55

Page 56: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 6. EXPERIMENTAL RESULTS

CPU configuration

2-issue out-of-order core, 4 functional units: load unit, store unit, integer ALU and FPU.

Two-level branch predictor

L1 Instruction cache: 64 bytes line size, 4-way associative, 16 KBytes, 2 cycles latency.

L1 Data cache: 64 bytes line size, 4-way associative, 16 KBytes, 2 cycles latency.

L2 cache: 64 bytes line size, 8-way associative, 256 KBytes, 12 cycles latency.

Figure 6.1: CPU configuration for the experiments.

Figure 6.2: CPI stacks for several Android applications. iCommando, Shooting Range 3D andPolyBreaker 3D are commercial games from the Android market.

CPU/GPU analysis are summarized in figure 6.2. This figure shows the CPI stacks for severalAndroid applications: the Android app store, the audio player, the web browser and 3 commercialgames. We have included in the CPI stacks the cycles that the CPU spents waiting for the GPU.As we can observe, the behavior of the games is different from the rest of the applications. Forapplications that are not games, the CPI is relatively small (between 1.5 and 2.5 cycles per instruc-tion) and the main source of pipeline stalls are branch mispredictions and L2 cache misses. Onthe other hand, for games the CPI is big (between 5.5 and 16 cycles per instruction) and the mainsource of stalls is the GPU. Hence, games are the most demanding applications and, furthermore,they are the only applications that stress the GPU. These characteristics make games the idealapplications for studying the memory behaviour of a GPU.

56

Page 57: High Performance, Ultra-Low Power Streaming Systems

6.1. WORKLOAD CHARACTERIZATION

Figure 6.3: Misses per 1000 instructions for the different caches in the GPU.

We have analyzed the memory behavior of several commercial games by using our GPU simula-tion infrastructure. The GPU configuration employed for the experiments is described in figure 6.7.In first place, we have evaluated the miss rates for the different caches in the GPU, the results areshown in figure 6.3. The L2 cache presents the biggest miss rates in all the games except in iBowl,in which the texture cache turns to be the most problematic cache. Despite these miss rates seemto be very small, a significant performance speedup can still be achieved by improving the behaviorof the caches.

Figure 6.4: Texture and pixel cache analysis.

57

Page 58: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 6. EXPERIMENTAL RESULTS

We have performed several experiments to evaluate the potential benefits of improving the pixeland texture caches, the results are shown in figure 6.4. As we can observe, the use of perfect texturecaches provides a speedup of 48% on average. Furthermore, by making the pixel caches perfect weget an average speedup of 65%. Hence, a significant speedup can be achieved by improving thebehavior of the different caches in the GPU.

Prefetching is one of the techniques that can be employed to improve the behavior of thememory system. However, as we have seen in section 2.4, conventional prefetchers for CPUs andGPUs work fine for applications with regular memory access patterns. Games, and graphicalworkloads in general, usually exhibit an unpredictable memory access pattern [32]. In order tounderstand the memory behavior of our applications we have analyzed the stride of all the cachemisses by using Sequitur [38]. Sequitur is able to contruct the grammar that generates the sequenceof strides we have recorded during the execution of an application. By analyzing these grammarswe can identify patterns in the strides of the cache misses like, for example: a cache miss with stride1 is usually followed by a cache miss with stride 2. Hence, by analyzing the Sequitur grammars wecan evaluate how easily the strides can be predicted.

iCommando (2D game)

Pixel cache Texture cache

Rules Histogram Rules Histogram

1 -> [1] [1] (44238) 1 - 91.59% 146 -> [1] [1] (35184) 1 - 70.82%

4 -> 1 [1] (36190) 3 - 1.03% 14 -> 146 [1] (24802) -1 - 5.75%

23 -> 4 [1] (23880) 24 - 0.72% 183 -> 14 [1] (21983) 64 - 3.92%

45 -> 23 [1] (20327) 25 - 0.61% 405 -> 183 [1] (17428) 65 - 3.37%

516 -> 45 [1] (18393) 23 - 0.55% 108 -> 405 [1] (16203) -65 - 2.52%

Figure 6.5: Analysis of the strides of the cache misses in the Pixel and Texture cache of oneStreaming processor when running the 2D game iCommando. In the Sequitur grammars non-terminal symbols (rules) are represented by numbers and terminal symbols (strides) are representedby numbers in square brackets. After each rule we show the number of times the rule is applied toform the input sequence of strides. We only show the 5 most frequent rules of the grammar.

PolyBreaker 3D (3D game)

Pixel cache Texture cache

Rules Histogram Rules Histogram

111 -> [1] [1] (43231) 1 - 63.77% 145 -> [1] [1] (20848) 1 - 48.39%

51 -> 111 [1] (26247) 25 - 5.44% 99 -> 145 [1] (16144) 8 - 6.08%

43 -> 51 [1] (18251) 24 - 3.84% 1556 -> 99 [1] (11941) 4 - 3.61%

60 -> 43 [1] (13584) -1 - 3.59% 7341 -> 1556 [1] (11811) -8 - 2.82%

104 -> 60 [1] (11964) 2 - 3.53% 10316 -> 7341 [1] (11718) -16 - 2.18%

Figure 6.6: Analysis of the strides of the cache misses in the Pixel and Texture cache of oneStreaming processor when running the 3D game PolyBreaker 3D. For each cache the figure showsthe 5 most frequent rules of the grammar and the 5 most frequent strides.

Figure 6.5 shows the result of the stride analysis for iCommando, one of the 2D games. Aswe can see, the stride 1 is the most common stride, 91.59% of the misses in the pixel cache and

58

Page 59: High Performance, Ultra-Low Power Streaming Systems

6.2. STATE OF THE ART PREFETCHERS PERFORMANCE

70.82% of the misses in the texture cache have stride 1. The most frequently applyied rules of thegrammar also include the stride 1. This means that most of the time when there is a cache miss thenext cache miss will be in the next line. We have observed a similar behavior for all the 2D gameswe have evaluated. So for 2D games the memory access patterns are regular and the conventionalprefetchers should work relatively well. This makes sense because 2D games basically consists on asequence of blitting operations [28], in which a matrix of pixels is copied into another matrix (thecolor buffer).

On the other hand, 3D games exhibit the memory behavior described in figure 6.6. In this case,the frequency of the stride 1 is only 63% in the pixel cache and 48% in the texture cache and thereare other stides such us 4 or 8 that are relatively common. Hence, for 3D games the strides of thecache misses are not so predictable like in the previous case, this makes the work of the prefetchersharder.

6.2 State of the art prefetchers performance

In this section we evaluate the performance and power consumption of different state-of-the-artCPU and GPU prefetchers. These are the configurations we have analyzed:

• Baseline - No prefetching: This is the baseline GPU architecture shown in figure 5.2 with-out any kind of prefetching. The parameters for the architecture are described in figure 6.7.

• Stride prefetcher (Table): In this configuration we have included the stride prefetcherimplemented with a table shown in figure 2.8 in each one of the caches of the GPU. Thestride table has a size of 16 entries and the prefetch degree is set to 2.

• Distance prefetcher (GHB): This configuration employs a distance prefetcher imple-mented with GHB (figure 2.12) in each one of the caches of the GPU. The Index Tablehas a size of 16 entries, the GHB has a size of 64 entries and the prefetch degree is set to 2.

• Many-Thread Aware Prefetcher with Throttling: This configuration employs the GPUprefetcher described in figure 2.15 in each one of the caches. Each one of the tables employedin this prefetcher (PWS, GS and IP table) has a size of 16 entries and the prefetch degree isdynamically adapted from 0 to 5.

• Perfect caches: All the caches are ideal, and have a hit rate of 100%.

Figure 6.8 shows the speedup for the different prefetching techniques. The stride prefetcherprovides an average speedup of 1.31. The distance prefetcher with GHB and the many-thread awareprefetcher provide better performance than the stride prefetcher. The GHB prefetcher achieves aspeedup of 2.27, which is slightly better than the speedup obtained with the many-thread awareprefetcher (2.19). Although the many-thread aware prefetcher has been designed specifically forGPUs, it does not provide better performance than the state-of-the-art CPU prefetcher. There areseveral reasons for this situation. In first place, the many-thread aware prefetcher has been designedtargeting a GPU architecture similar to the NVIDIA Fermi [19] in which there are thousands of

59

Page 60: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 6. EXPERIMENTAL RESULTS

GPU configuration

Fragment processing stage 4 Streaming processors

Vertex processing stage 4 Streaming processors

Streaming processor 4 SIMD execution units, 1 Pixel cache, 1 Texture cache,8 thread hardware contexts (2 warps, 4 threads in each warp).

Pixel cache 64 bytes per line, 2-way associative, 8 KBytes, 2 cycles latency

Texture cache 64 bytes per line, 2-way associative, 8 KBytes, 2 cycles latency

L2 cache 64 bytes per line, 8-way associative, 256 KBytes, 12 cycles latency

Figure 6.7: GPU configuration for the experiments. The baseline GPU architecture is the oneillustrated in figure 5.2.

simultaneous threads on execution at the same time. As we can see in figure 6.7, in our mobileGPU architecture there are only 8 thread hardware contexts per processor due to power constrains,whereas in the NVIDIA Fermi architecture there are 1024 simultaneous threads per streamingprocessor. So the effectiveness of some mechanisms like the inter-thread prefetching or the stridepromotion (see section 2.4.2) is significantly limited due to the small number of in-flight threads.Furthermore the graphics workloads don’t exhibit regular memory access patterns, but the many-thread aware prefetcher has been designed for scientific applications developed in CUDA withvery regular access patterns. Nevertheless, the performance of this prefetcher is very close to theGHB on average and it outperforms the distance prefetcher with GHB is some of the games (ibowl,pocketracing, quake2 and shooting). On the other hand, the speedups achieved by these prefetchersare far away from the speedup obtained by a system with perfect caches.

Figure 6.8: Speedups for different state-of-the-art prefetchers.

Regarding the power consumption, figure 6.9 shows the power for each one of the prefetchersnormalized by the power of the baseline architecture without prefetching. As we can observe,the stride prefetcher is the prefetching scheme with the smallest power consumption (due to its

60

Page 61: High Performance, Ultra-Low Power Streaming Systems

6.3. ULTRA-LOW POWER DECOUPLED PREFETCHER PERFORMANCE

simplicity) and it only consumes 1.1% more than the baseline architecture on average. Once again,the behavior of the GHB prefetcher and the many-thread aware prefetcher are very close. Thedistance prefetcher with GHB consumes 4.5% more than the baseline GPU architecture on average,whereas the many-thread aware prefetcher consumes 4.9% more than the baseline. In three of thebenchmarks the GHB prefetcher consumes more power (angryfrogs, ibowl and tankrecon), but inthe other 5 games the many-thread aware prefetcher requires more power.

Figure 6.9: Normalized power consumption for different state-of-the-art prefetchers.

In conclusion, the three state-of-the-art prefetchers provide significant speedups over the base-line GPU without prefetching, specially the GHB buffer and the many-thread prefetcher. However,all the prefetchers also require more power than the baseline GPU.

6.3 Ultra-low power decoupled prefetcher performance

In this section we evaluate the performance and power consumption of our ultra-low powerdecoupled prefetcher. We have included these three additional configurations:

• Original decoupled prefetcher: This is the prefetching architecture for texture cachesproposed by Igehy et al. in [33] and described in section 2.4.2. The original idea only worksfor systems with one processor, so the experiments for this prefetching scheme have beenperformed by using just one streaming processor instead of four. The size of the prefetchqueues is 32 entries.

• Decoupled prefetcher: This configuration implements our decoupled prefetcher illustratedin figure 4.2. The size of the prefetch queues is 32 entries.

• Decoupled prefetcher with optimizations: This configuration implements our decoupled

61

Page 62: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 6. EXPERIMENTAL RESULTS

prefetcher with the optimizations described in section 4.1.3 to reduce the number of requeststo the L2 cache. The size of the prefetch queues is also 32 entries.

Figure 6.10: Ultra-low power decoupled prefetcher compared with state-of-the-art prefetchers.

Figure 6.10 shows the performance improvement provided by our decoupled prefetcher. Theoriginal decoupled prefetcher causes a performance penalty with respect to the baseline GPU onaverage. However, this is not a fair comparison because the original decoupled prefetching schemecan only be implemented with one processor and the baseline GPU includes 4 streaming processors.Nevertheless, it offers 86% of the performance of a system with 4 processors just by using onestreaming processor. Futhermore, it even outperforms the baseline GPU in some of the games(angryfrogs and pocketracing).

Regarding our decoupled prefetcher, it offers better performance than the state-of-the-art CPUand GPU prefetchers in all of the games and it achieves and speedup of 2.63. By reducing thenumber of requests to the L2 cache (decoupled prefetcher with optimizations) the speedup is evenbetter, 2.94 on average, and it is close to the speedup of a system with perfect caches (3.37).

Figure 6.11 shows the speedups of our decoupled prefetcher compared to one of the state-of-the-art prefetchers, the distance prefetcher with GHB (as we have seen in the previous section this is thestate-of-the-art prefetcher that provides the best performance for our mobile GPU architecture).As we can see, our decoupled prefetcher achieves 15% improvements on average over the GHBprefetcher. Furthermore, if we apply the optimizations it provides 29% improvements over theGHB.

Regarding the power consumption, the power results are presented in figure 6.12. This fig-ure shows the power consumed by each one of the prefetching schemes normalized by the powerconsumed by the baseline GPU architecture without prefetching. As we can see, the decoupledprefetcher consumes less power than the distance prefetcher with GHB and the many-thread awareprefetcher in all of the games. Furthermore, the optimizations to reduce the number of accesses

62

Page 63: High Performance, Ultra-Low Power Streaming Systems

6.3. ULTRA-LOW POWER DECOUPLED PREFETCHER PERFORMANCE

Figure 6.11: Ultra-low power decoupled prefetcher compared with the distance prefetcher imple-mented with GHB.

Figure 6.12: Decoupled prefetcher power consumption.

to the L2 cache turn to be effective to reduce power. The decoupled prefetcher with these opti-mizations consumes less power than the baseline GPU on average and in all of the games exceptin ibowl. It consumes about 6% less power than the state-of-the-art CPU and GPU prefetchersand 1.1% less power than the baseline GPU architecture on average. Although the power savingsseem to be small, the optimized decoupled prefetcher offers these savigns in energy while providingsignificant performance improvements.

In the previous graphs we have reported different power savings for different speedups, butit would be interesting to compare both parameters, power and performance, at the same time.So we have computed the energy-delay product for the different prefetching schemes and we have

63

Page 64: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 6. EXPERIMENTAL RESULTS

Figure 6.13: Normalized energy-delay product.

normalized the results by the energy-delay product of the baseline GPU without prefetching (fig-ure 6.13). As we can see, the improvement introduced by the decoupled prefetcher is even biggerif we consider the speedup and the energy savings at the same time.

Figure 6.14: Prefetch queue size evaluation. The graph shows the speedup achieved by the decou-pled prefetcher over the baseline GPU without prefetching for different sizes of the prefetch queue,for the game shooting.

64

Page 65: High Performance, Ultra-Low Power Streaming Systems

6.3. ULTRA-LOW POWER DECOUPLED PREFETCHER PERFORMANCE

Finally, we have analyzed the impact of the prefetch queue size of the decoupled prefetcher inthe performance improvements. Figure 6.14 shows the evolution of the speedup obtained over thebaseline GPU without prefetching in the 3D game shooting as we increase the size of the prefetchqueue. If the size of the prefetch queue is small the prefetcher can not prefetch the necessary linesearly enough, so the number of compulsory misses increases. If the size of the prefetch queue is big,the likelihood of a cache line prefetched for a pixel being replaced for another cache line prefetchedfor a younger pixel increases. Therefore, the number of conflict misses increases as we increase thesize of the prefetch queue. As we can observe in figure 6.14, we get the best results for intermediatevalues of the prefetch queue size (from 64 to 512 entries).

In conclusion, the ultra-low power decoupled prefetcher outperforms the state-of-the-art CPUand GPU prefetchers. It is 2.94 times faster than the baseline GPU architecture and 1.29 timesfaster than the best state-of-the-art prefetcher on average. Furthermore, it provides these perfor-mance improvements without increasing the power consumption. In fact, it consumes 1.1% lesspower than the baseline GPU architecture without prefetching on average.

65

Page 66: High Performance, Ultra-Low Power Streaming Systems
Page 67: High Performance, Ultra-Low Power Streaming Systems

7Conclusions

Games are the most demanding applications for smartphones. Graphics workloads make anintensive use of the GPU while the CPU is idle most of the time. Due to the growing disparityof speed between the GPU cores and memory, one of the most performance limiting factors ofthe GPU is the latency to access main memory. Multithreading is a commonly used techniqueto tolerate memory latency. However, we found that it does so by significantly increasing powerconsumption. Prefetching is also a very effective technique for hiding memory latency on a mobileGPU, we have proved that by using prefetchers we can achieve a speedup of 2.94 on average overa GPU without prefetching in a commercial set of games. Furthermore, this speedup is achievedwithout increasing energy consumption, which is of primary importance in a mobile GPU.

Despite the special characteristics of graphics workloads, state of the art CPU and GPGPUprefetchers are an effective mechanism to improve the memory behavior of a mobile GPU. Just byusing a simple stride prefetcher implemented with a table we get a speedup of 1.31 on average over aGPU without prefetching. The distance prefetcher implemented with GHB achieves an speedup of2.27, whereas the state of the art GPGPU prefetcher (the many-thread aware prefetcher) providesa speedup of 2.19. However, all these prefetchers produce a small increase in energy consumption.Moreover, the performance enhancements are far from the speedup achieved by a system withperfect caches (3.37), so there is a significant margin for improvement.

A decoupled access/execute prefetching architecture can be very effective to hide the memorylatency. Our decoupled prefetcher achieves a speedup of 2.63 over a GPU without prefetchingand 1.15 over the distance prefetcher with GHB, just by using 1.4% more power than the baselineGPU. Furthermore, we also show that performance can be improved and power can be reduced bycarefully moving data around and by orchestrating the accesses to the L2 cache (section 4.1.3). Byusing this optimizations the speedup achieved is 2.94 over a GPU without prefetching and 1.29 overthe GHB prefetcher. Moreover, the power is reduced by 1.1% with respect to the baseline GPU.

67

Page 68: High Performance, Ultra-Low Power Streaming Systems

CHAPTER 7. CONCLUSIONS

The traditional CPU and GPGPU prefetchers make predictions by using history information.These prefetchers are triggered on cache misses and the only information they have available is thesequence of miss addresses. By using the miss address stream they try to guess which cache lineswill be requested next. On the contrary, the decoupled prefetcher employs the information aboutthe pixels to compute which lines are going to be requested, so the prefetch requests are not basedon predictions. Moreover, the decoupled prefetcher has a better knowledge of the whole system(number of processors, number of texture caches, number of pixel caches) and it can employ thisinformation to do a more effective prefetching. For instance, if the prefetcher knows that a pixelis going to be processed in the streaming processor 0, then all the necessary data to process thepixel will be prefetched in the texture and pixel caches of processor 0. Therefore, the use of thisinformation about exactly which cache lines are going to be requested in each one of the processorsprovides a big advantage to the decoupled prefetcher over the rest of prefetchers.

The prefetch request queue must be sized long enough to achieve timeliness of prefetching,which mostly depends on memory latency. But it must also avoid excessive length that could leadto late requests evicting yet-to-be-used prior prefetched data due to cache conflicts. We have foundthat lengths between 32 and 512 are appropiate for our workloads.

68

Page 69: High Performance, Ultra-Low Power Streaming Systems

Bibliography

[1] http://assets.en.oreilly.com/1/event/39/Internet%20Trends%20Presentation.pdf.

[2] http://www.migsmobile.net/2010/01/12/evolution-of-mobile-device-uses-and-battery-life/.

[3] http://en.wikipedia.org/wiki/Android_%28operating_system%29.

[4] http://www.nvidia.com/content/PDF/tegra_white_papers/Bringing_High-End_

Graphics_to_Handheld_Devices.pdf.

[5] http://en.wikipedia.org/wiki/Rasterisation.

[6] http://en.wikipedia.org/wiki/Ray_tracing_%28graphics%29.

[7] http://en.wikipedia.org/wiki/Sutherland-Hodgeman.

[8] http://en.wikipedia.org/wiki/Scanline_algorithm.

[9] http://en.wikipedia.org/wiki/Z_buffer.

[10] http://en.wikipedia.org/wiki/Dalvik_virtual_machine.

[11] http://www.khronos.org/opengles/.

[12] http://en.wikipedia.org/wiki/System-on-a-chip.

[13] http://en.wikipedia.org/wiki/Snapdragon_%28system_on_chip%29.

[14] http://en.wikipedia.org/wiki/PowerVR.

[15] http://en.wikipedia.org/wiki/Tiled_rendering.

[16] http://www.imgtec.com/factsheets/SDK/PowerVR%20Technology%20Overview.1.0.2e.

External.pdf.

[17] NVIDIA Corporation. CUDA Programming Guide, V3.0.

[18] http://www.qualcomm.com/snapdragon/specs.

[19] NVIDIA Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_

Fermi_Compute_Architecture_Whitepaper.pdf.

[20] http://wiki.qemu.org/.

69

Page 70: High Performance, Ultra-Low Power Streaming Systems

BIBLIOGRAPHY

[21] http://developer.android.com/guide/basics/what-is-android.html.

[22] http://oss.sgi.com/projects/ogl-sample/registry/ARB/vertex_program.txt.

[23] http://oss.sgi.com/projects/ogl-sample/registry/ARB/fragment_program.txt.

[24] http://www.hpl.hp.com/research/cacti/.

[25] http://www.marss86.org/.

[26] http://www.ptlsim.org/.

[27] http://www.android-x86.org/.

[28] http://en.wikipedia.org/wiki/Bit_blit.

[29] Tomas Akenine-Moller and Jacob Strom. Graphics processing units for handhelds. Proceedingsof the IEEE, 96:779–789, 2008.

[30] Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. A performancecounter architecture for computing accurate cpi components. In Proceedings of the 12th inter-national conference on Architectural support for programming languages and operating systems,ASPLOS-XII, pages 175–184, New York, NY, USA, 2006. ACM.

[31] John W. C. Fu, Janak H. Patel, and Bob L. Janssens. Stride directed prefetching in scalarprocessors. SIGMICRO Newsl., 23:102–110, December 1992.

[32] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, andK. Skadron. Energy-efficient mechanisms for managing thread context in throughput pro-cessors. Proceedings of the ACM/IEEE International Symposium on Computer Architecture(ISCA), June 2011.

[33] Homan Igehy, Matthew Eldridge, and Kekoa Proudfoot. Prefetching in a texture cache ar-chitecture. In SIGGRAPH / Eurographics Workshop on Graphics Hardware, pages 133–142,1998.

[34] Doug Joseph and Dirk Grunwald. Prefetching using markov predictors. In In Proceedings ofthe 24th Annual International Symposium on Computer Architecture, pages 252–263, 1997.

[35] Gokul B. Kandiraju and Anand Sivasubramaniam. Going the distance for tlb prefetching: anapplication-driven study. In Proceedings of the 29th annual international symposium on Com-puter architecture, ISCA ’02, pages 195–206, Washington, DC, USA, 2002. IEEE ComputerSociety.

[36] Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. Many-threadaware prefetching mechanisms for gpgpu applications. IEEE/ACM International Symposiumon Microarchitecture, 0:213–224, 2010.

[37] Kyle J. Nesbit and James E. Smith. Data cache prefetching using a global history buffer. IEEEMicro, 25(1):90–97, 2005.

70

Page 71: High Performance, Ultra-Low Power Streaming Systems

BIBLIOGRAPHY

[38] Craig G. Nevill-Manning and Ian H. Witten. Identifying hierarchical structure in sequences:a linear-time algorithm. J. Artif. Int. Res., 7:67–82, September 1997.

[39] J. W. Sheaffer, D. Luebke, and K. Skadron. A flexible simulation framework for graphics archi-tectures. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphicshardware, HWWS ’04, pages 85–94, New York, NY, USA, 2004. ACM.

[40] Lance Williams. Pyramidal parametrics. In Proceedings of the 10th annual conference onComputer graphics and interactive techniques, SIGGRAPH ’83, pages 1–11, New York, NY,USA, 1983. ACM.

71


Recommended