Download - Dynamic Cache Management Technique

A SEMINAR REPORT

ON

DYNAMIC CACHE MANAGEMENT TECHNIQUE

BY

ELEKWA JOHN OKORIE

ESUT/ 2007/88499

PRESENTED TO

THE DEPARTMENT OF COMPUTER ENGINEERING

FACULTY OF ENGINEERING

ENUGU STATE UNIVERSITY OF SCIENCE AND

TECHNOLOGY (ESUT), ENUGU

SUBMITTED

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR

THE AWARD OF A BACHELOR OF ENGINEERING

(B.ENG) DEGREE IN COMPUTER ENGINEERING

SEPTEMBER, 2012

CERIFICATION

I, ELEKWA JOHN OKORIE with the registration number,

ESUT/2007/88499 in the Department of Computer

Engineering in Enugu State University of Science and

Technology, Enugu certify that this seminar was done by me.

--------------------------------- -------------------------

Signature Date

ii

APPROVAL PAGE

This is to certify that this seminar topic on dynamic cache

management technique was approved and carried out by

ELEKWA JOHN OKORIE with Reg. no: ESUT/2007/88499 strict

supervision.

--------------------------------- ------------------------------ENGR. ARIYO IBIYEMI MR.IKECHUKWU ONAH (Seminar supervisor) (Head of department)

-------------------- ----------------------

DATE DATE

iii

DEDICATION

The report is dedicated to God Almighty for his love, favors

and protection all these time. To my parents Mr. Ugonna

Alex Okorie and Mrs. Monica Elekwa and also to those who

contributed in marking sure that I’m moving ahead from my

wonderful lectures to my great friends. To my family for their

love, care, prayers and support.

iv

ACKNOWLEDGEMENT

Apart from the efforts of me, the success of any seminar

depends largely on the encouragement and guidelines of

many others. I take this opportunity to express my gratitude

to the people who have been instrumental in the successful

completion of this seminar.

I would like to show my greatest appreciation to Dr. Mac vain

Ezedo and his wife. I can’t say thank you enough for his

tremendous support and help.God bless you all.

The guidance and support received from all my friends: Uche

Ugwoda, Peter Obaro, Matin Ozioko ,Oge Raphael and Stone

etc. I am grateful for their constant support and help.

Finally, my lecturers, whose tutelage I was taught under. A

grateful God Bless you. More especially my supervisor ENGR.

ARIYO IBIYEMI and all others and also my amiable HOD.

ENGR. IKECHUKWU ONAH.

v

TABLE OF CONTENTS

Title page i

Certification ii

Approval Page iii

Dedication iv

Acknowledgement v

Abstract vi

Table of Contents vii

CHAPTER ONE

1.0 INTRODUCTION 1

1.1 Power Trends for Current Microprocessors 5

CHAPTER TWO

2.0 Working with the L0 CACHE 6

2.1 pipeline Micro Architecture

2.2 Branch Prediction & Confidence Estimation – A

Brief Overview

CHAPTER THREE

3.0 What is Dynamic Cache Management Technique

3.1 Basic Idea of the Dynamic Management Scheme

3.2 Dynamic Techniques for l0-cache Management

3.3 Simple Method

3.4 Static Method

7

7

12

12

13

14

14

15

vi

3.5 Dynamic Confidence Estimation Method

3.6 Restrictive Dynamic Confidence Estimation

Method

3.7 Dynamic Distance Estimation Method

3.8 Comparison of Dynamic Techniques

CHAPTER FOUR

4.0 Conclusion

REFERENCES

16

16

17

18

19

20

vii

LIST OF FIGURES

Fig 1: Memory Hierarchy 3

Fig 2: An Instruction Cache 4

Fig 3: Levels of Cache

6

Fig 4: Pipeline Micro Architecture (A) 7

Fig 5: Pipeline Micro Architecture (B) 7

viii

CHAPTER ONE

1.0 INTRODUCTION

First of all what is Cache Memory:-

Cache memory is a fast memory that is use to hold the

most recently accessed data.

Cache is pronounced like the word cash. Cache Memory

is the level of computer memory hierarchy situated

between the processor and main memory. It is a very

fast memory the processor can access much more

quickly than main memory or RAM. Cache is relatively

small and expensive. Its function is to keep a copy of

the data and code (instructions) currently used by the

CPU. By using cache memory, waiting states are

significantly reduced and the work of the processor

becomes more effective.

As processor performance continues to grow, and high

performance, wide-issue processors exploit the available

Instruction-Level Parallelism, the memory hierarchy should

continuously supply instructions and data to the data path to

keep the execution rate as high as possible. Very often, the

memory hierarchy access latencies dominate the execution

time of the program. The very high utilization of the

1

instruction memory hierarchy entails high energy demands

for the on-chip I-Cache subsystem.

In order to reduce the effective energy dissipation per

instruction access, we propose the addition of a small, extra

cache (the L0-Cache) which serves as the primary cache of

the processor, and is used to store the most frequently

executed portions of the co de, and subsequently provide

them to the pipeline. Our approach seeks to manage the L0-

Cache in a manner that is sensitive to the frequency of

accesses of the instructions executed. It can exploit the

temporalities of the code and can make decisions on they,

i.e., while the code executes. In recent years, power

dissipation has become one of the major design

concerns for the microprocessor industry. The shrinking

device size and the large number of devices packed in a chip

die coupled with large operating frequencies; have led to

unacceptably high levels of power dissipation. The problem

of wasted power caused by unnecessary activities in various

parts of the CPU during code execution has traditionally

been ignored in code optimization and architectural design.

Higher frequencies and large transistor counts more

than offset the lower voltages and the smaller the devices

and they result in large power consumption in a newest

version in a processor family.

2

Figure 1

Cache is much faster than main memory because it is

implemented using SRAM (Static Random Access Memory).

The problem with DRAM, which comprises main memory, is

that it is composed entirely of capacitors, which have to be

constantly refreshed in order to preserve the stored

information (leakage current). Whenever data is read from

the cell, the cell is refreshed. The DRAM cells need to be

refreshed very frequently, i.e. typically every 4 to 16ms. this

slows down the entire process. SRAM on the other hand

consists of flip-flops, which stay in its state as long as the

power supply is on. (A flip-flop is an electrical circuit

composed of transistors and resistors. See picture) Because

of this SRAM need not be refreshed and is over 10 times

faster than DRAM. Flip-flops, however, are implemented

3

using complex circuitry which makes SRAM much larger and

more expensive, limiting its use.

Level one cache memory (called L1 Cache, for Level 1

Cache) is directly integrated into the processor. It is

subdivided into two parts:

The first part is the instruction cache, which contains

instructions from the RAM that have been decoded as they

came across the pipelines.

The second part is the data cache, which contains data from

the RAM and data recently used during processor operations.

4

Figure 2 - An instruction cache

Level 1 cache can be accessed very rapidly. Access

waiting time approaches that of internal processor

registers Level two cache memory (called L2 Cache, for

Level 2 Cache) is located in the case along with the

processor (in the chip). The level two cache is an

intermediary between the processor, with its internal

cache, and the RAM. It can be accessed more rapidly

than the RAM, but less rapidly than the level one cache.

Level three cache memory (called L3 Cache, for Level 3

Cache) is located on the motherboard.

5

1.1 POWER TRENDS FOR CURRENT

MICROPROCESSORS

DEC

21164

DEC 21164

High Freq

Pentiu

m Pro

Pentium

II

Freq (Mhz) 433 600 200 300

Power (W) 32.5 45 28.1 41.4

Very often the memory hierarchy access latencies dominate

the execution time of the program; the very high utilization

of the instruction memory hierarchy entails high energy

demands on the on chip I-cache subsystem. In order to

reduce the effective energy dissipation per instruction

access, the addition of an extra cache is proposed,

which serves as the primary cache of the processor,

and is used to store the most frequently executed

portion of the code.

6

CHAPTER TWO

2.0 WORKING WITH THE L0 CACHE

Some dynamic techniques are used to manage the L0-cache.

The problem that the dynamic techniques seek to solve is

how to select the basic blocks to be stored in the L0-cache

while the program is being executed. If a block is selected,

the

Figure 3

CPU will access the L0-cache first; otherwise, it will go

directly to the I-cache and bypass the L0-cache. In the

case of an L0-cache miss, the CPU is directed to fetch

7

instructions from the I-cache and to transfer the instructions

from the I-cache to the L0-cache

2.1 PIPELINE MICRO ARCHITECTURE

Figure 4

8

Figure 5

Figure 4 shows the processor pipeline we model in this

research. The pipeline is typical of embedded processors

such as StrongARM. There are five stages in the pipeline–

fetch, decode, and execute, mem and writeback. There is no

external branch predictor. All branches are predicted

“untaken”. There is two-cycle delay for “taken” branches.

Instructions can be delivered to the pipeline from one of

three sources: line buffer, I-cache and DFC. There are three

ways to determine where to fetch Instructions:

• Serial–sources are accessed one by one in fixed order;

9

• Parallel–all the sources are accessed in parallel;

• Predictive–the access order can be serial with flexible order

or parallel based on prediction. Serial access results in

minimal power because the most power efficient source is

always accessed first. But it also results in the highest

performance degradation because every miss in the first

accessed source will generate a bubble in the pipeline.

On the other hand, parallel access has no performance

degradation. But I-cache is always accessed and there is no

power savings in instruction fetch. Predictive access, if

accurate, can have both the power efficiency of the

serial access and the low performance degradation of

the parallel access. Therefore, it is adopted in our

approach. As shown in Figure 1, a predictor decides

which source to access first based on current fetch

address. Another functionality of the predictor is pipeline

gating. Suppose a DFC hit is predicted for the next fetch at

cycle N. The fetch stage is disabled at cycle N and the

decoded instruction is sent from the DFC to latch 5. Then at

cycle N, the decode stage is disabled and the decoded

instruction is sent from latch 5 to latch 2. If an instruction is

fetched from the I-cache, the hit cache line is also sent to

the line buffer. The line buffer can provide instructions for

subsequent fetches to the same line.

10

2.2 BRANCH PREDICTION & CONFIDENCE ESTIMATION

– A BRIEF OVERVIEW

2.2.1 Branch Prediction

Branch prediction is an important technique to increase

parallelism in the CPU, by predicting the outcome of a

conditional branch instruction as soon as it is decoded.

Successful branch prediction mechanisms take advantage of

the non-random nature of branch behavior. Most branches

are either mostly taken in the course of program

execution.

The commonly used branch predictors are:

1. Bimodal branch predictor.

Bimodal branch predictor uses a counter for determining

the prediction. Each time a branch is taken, the counter is

incremented by one, and each time it falls through, it is

decremented by one. Looking onto the value of the counter

does the prediction. If it is less than a threshold value, the

branch is predicted as not taken; otherwise, it is predicted as

taken.

11

Figure 6

2. Global branch predictor

Global branch predictor considers the past behavior of the

current branch as well as the other branches to predict the

behavior of the current branch.

3. Confidence Estimation

The relatively new concept confidence estimation has

been introduced to keep track of the quality of branch

prediction. Confidence estimators are hardware

mechanisms that are accessed in parallel with the branch

predictors when a branch is decoded, and they are modified

when the branch is resolved. They characterize a branch as

‘high confidence’ or ‘low confidence’ depending upon the

12

branch predictor for the particular branch. If the branch

predictor predicted a branch correctly most of the time, the

confidence estimator would designate the prediction as ‘high

confidence’ otherwise as ‘low confidence’.

13

CHAPTER THREE

3.0 WHAT IS DYNAMIC CACHE MANAGEMENT

TECHNIQUE:-

The memory hierarchy of high performance.

Extrapolating and current trend and this portion are

likely to the near future.

This mechanism provides to you the instruction stream.

It is an accounting for and fraction of a chip’s transistor.

It is use to the eliminate the need for high utilization.

It is a resizing strategy of cache memory.

3.1 BASIC IDEA OF THE DYNAMIC MANAGEMENT

SCHEME

The dynamic scheme for the L0-cache should be able to

select the most frequently executed basic blocks for

placement in the L0-cache. It should also rely on existing

mechanisms without much hardware investment if it to be

attractive for energy reduction.

The branch prediction in conjunction with confidence

estimation is a reliable solution to this problem.

14

Unusual Behavior of the Branches

A branch that was predicted ‘taken’ with ‘high confidence’

will be expected to be taken during program execution. If it

not taken, it will be assumed to be behaving

‘unusually’.

The basic idea is that, if a branch behaves ‘unusually’,

the dynamic scheme disables the L0-cache access for the

subsequent basic blocks. Under this scheme, only basic

blocks that are to be executed frequently tend to make it to

the L0-cache, hence avoiding cache pollution problems in

the L0-cache

3.2 DYNAMIC TECHNIQUES FOR L0-CACHE

MANAGEMENT

The dynamic techniques discussed in the subsequent

portions select the basic blocks to be placed in the L0-

cache. There are five techniques for the management of L0-

cache.

1. Simple Method.

2. Static Method.

15

3. Dynamic Confidence Estimation Method.

4. Restrictive Dynamic Confidence Estimation Method.

5. Dynamic Distance Estimation Method.

Different dynamic techniques trade off energy reduction

with performance degradation

3.3 SIMPLE METHOD

The confidence estimation mechanism is not used in simple

method. The branch predictor can be used as a stand-alone

mechanism to provide insight on which portions of the code

are frequently executed and which is not. A mispredicted

branch is assumed to drive the thread of execution to an

infrequently executed part of the program.

The strategy used for selecting the basic blocks is as follows:

If a branch predictor is mispredicted, the machine will access

the I-cache to fetch the instructions. If a branch is predicted

correctly, the machine will access the L0-cache.

In a misprediction, the pipeline will flush and the machine

will start fetching the instructions from the correct address

by accessing the I-cache. The energy dissipation and the

execution time of the original configuration that uses no

L0-cache is taken as unity, and normalize everything with

respect to that.

16

3.4 STATIC METHOD

The selection criteria adopted for the basic blocks is:

If a ‘high confidence’ branch was predicted incorrectly,

the I-cache is accessed for the subsequent basic blocks.

If more than n low confidence branches have been

decoded in a row, the I-cache is accessed.

Therefore the L0-cache will be bypassed when either of the

two conditions is satisfied. In any other case the machine will

access the L0-cache.

The first rule for accessing the I-cache is due to the fact that

a mispredicted ‘high confidence’ branch behaves

‘unusually’ and drives the program to an infrequently

executed portion of the code. The second rule is due to the

fact that a series of ‘low confidence’ branches will also suffer

from the same problem since the probability that they all are

predicted correctly is low.

3.5 DYNAMIC CONFIDENCE ESTIMATION METHOD

Dynamic confidence estimation method is a dynamic version

of the static method. The confidence of The I-cache is

accessed if a high confidence branch is mispredicted.

More than n successive ‘low confidence’ branches are

encountered.

17

The dynamic confidence estimation mechanism is

slightly better in terms of energy reduction than in the

simple or static method. Since the confidence estimator can

adapt dynamically to the temporalities of the code, it is more

accurate in characterizing a branch and, then, regulating the

access of the L0-cache.

3.6 RESTRICTIVE DYNAMIC CONFIDENCE ESTIMATION

METHOD

The methods described in previous sections tend to place a

large number of basic blocks in the L0-cache, thus degrading

performance. Restrictive dynamic scheme is a more

selective scheme in which only the really important basic

blocks would be selected for the L0-cache.

The selection mechanism is slightly modified as:

The L0-cache is accessed only if a ‘high confidence’

branch is predicted correctly. The I-cache is accessed in

any other case.

This method selects some of the most frequently

executed basic blocks, yet it misses some others. It has

much lower performance degradation, at the expense of

lower energy reduction. It is probably preferable in a

system where performance is more important than

energy.

18

3.7 DYNAMIC DISTANCE ESTIMATION METHOD

The dynamic distance estimation method is based on the

fact that, a mispredicted branch triggers a series of

successive mispredicted branches. The method works as

follows:

All branches after a mispredicted branch are tagged as

‘low confidence’ otherwise as ‘high confidence’. The basic

blocks after a ‘low confidence’ branch are fetched from the

L0-cache. The net effect is that a branch misprediction

causes a series of fetches from the I-cache.

A counter is used to measure the distance of a branch

from the previous mispredicted branch. This scheme is also

very selective in storing instructions from the L0-cache, even

more than the restrictive dynamic estimation method.

3.8 COMPARISON OF DYNAMIC TECHNIQUES

The energy reduction and delay increase is a function of the

algorithm used for the regulation of the L0-cache access,

the size of the L0-cache, its block size and its

associability. For example, a larger block size causes a larger

hit ratio in the L0-cache.

This results into smaller performance overhead, and bigger

efficiency since the I-cache does not need to be accessed so

often.

19

On the other hand if the block size increase does not have a

large impact on the hit ratio, the energy dissipation may go

up, since a cache with a larger block size is less energy

efficient than a cache with the same size but smaller block

size.

The static method and dynamic confidence method make

the assumption:

The less frequently executed basic blocks usually follow less

predictable branches that are mispredicted.

The simple method and restrictive dynamic confidence

estimation method address the problem from another angle.

They make the assumption:

Most frequently executed basic blocks usually follow high

predictable branches.

The dynamic distance estimation method is the most

successful in reducing the performance overhead, but the

least successful in energy reduction.

This method possesses stricter requirement for a basic

block to be selected for the L0-cache than the original

dynamic confidence estimation method.

Larger block size and associability will have a beneficial

effect on both energy and performance. The hit rate of a

20

small cache is more sensitive to the variation of the block

size and the associability.

21

CHAPTER FOUR

4.0 CONCLUSION

This paper presents the method for dynamic selection

of basic blocks for placement in the L0-cache. It explains

the functionality of the branch prediction and the confidence

estimation mechanisms in high performance processors.

Finally five different dynamic techniques were discussed for

the selection of the basic blocks. These techniques try to

capture the execution profile of the basic blocks by using the

branch statistics that are gathered by the branch predictor.

The experiment evaluation demonstrates the

applicability of the dynamic techniques for the

management of the L0-cache. Different techniques can

trade off energy with delay by regulating the way the L0-

cache is accessed.

22

REFERENCES

[1] J. Diguet, S. Wuytack, F. Cattho or, and H. De Man, Formalized methodology for data reuse exploration in hierarchical memory mappings," in Proceedings of the International Symposium of Low Power Electronics and Design, pp. 30{35, Aug. 1997.

[2] J. Kin, M. Gupta, and W. Mangione-Smith, The Filter cache: An energy efficient memory structure," in Proceedings of the International Symposium on Micro architecture, pp. 184{193, Dec. 1997.

[3] N. Bellas, I. Ha jj, C. Polychronopoulos, and G. Stamoulis, Architectural and compiler supp ort for energy reduction in the memory hierarchy of high performance microprocessors," in Proceedings of the International Symposium of Low Power Electronics and De-sign, pp. 70{75, Aug. 1998.

[4] S. Manne, D. Grunwald, and A. Klauser, Pipeline gating: Speculation control for energy reduction," in Proceedings of the International Symposium of Computer Architecture, pp. 132{141, 1998.

[5] Speed Shop User's Guide. Silicon Graphics, Inc., 1996.

[6] S. Wilson and N. Jouppi, An enhanced access and cycle time model for on-chip caches," tech. rep., DEC WRL 93/5, July 1994.

[7] Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos, Using dynamic cache management techniques to reduce energy in a high-performance processor, Department of Electrical & Computer Engineering and the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign 1308 West Main Street, Urbana, IL 61801, USA.

23