A SEMINAR REPORT
ON
DYNAMIC CACHE MANAGEMENT TECHNIQUE
BY
ELEKWA JOHN OKORIE
ESUT/ 2007/88499
PRESENTED TO
THE DEPARTMENT OF COMPUTER ENGINEERING
FACULTY OF ENGINEERING
ENUGU STATE UNIVERSITY OF SCIENCE AND
TECHNOLOGY (ESUT), ENUGU
SUBMITTED
IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR
THE AWARD OF A BACHELOR OF ENGINEERING
(B.ENG) DEGREE IN COMPUTER ENGINEERING
SEPTEMBER, 2012
CERIFICATION
I, ELEKWA JOHN OKORIE with the registration number,
ESUT/2007/88499 in the Department of Computer
Engineering in Enugu State University of Science and
Technology, Enugu certify that this seminar was done by me.
--------------------------------- -------------------------
Signature Date
ii
APPROVAL PAGE
This is to certify that this seminar topic on dynamic cache
management technique was approved and carried out by
ELEKWA JOHN OKORIE with Reg. no: ESUT/2007/88499 strict
supervision.
--------------------------------- ------------------------------ENGR. ARIYO IBIYEMI MR.IKECHUKWU ONAH (Seminar supervisor) (Head of department)
-------------------- ----------------------
DATE DATE
iii
DEDICATION
The report is dedicated to God Almighty for his love, favors
and protection all these time. To my parents Mr. Ugonna
Alex Okorie and Mrs. Monica Elekwa and also to those who
contributed in marking sure that I’m moving ahead from my
wonderful lectures to my great friends. To my family for their
love, care, prayers and support.
iv
ACKNOWLEDGEMENT
Apart from the efforts of me, the success of any seminar
depends largely on the encouragement and guidelines of
many others. I take this opportunity to express my gratitude
to the people who have been instrumental in the successful
completion of this seminar.
I would like to show my greatest appreciation to Dr. Mac vain
Ezedo and his wife. I can’t say thank you enough for his
tremendous support and help.God bless you all.
The guidance and support received from all my friends: Uche
Ugwoda, Peter Obaro, Matin Ozioko ,Oge Raphael and Stone
etc. I am grateful for their constant support and help.
Finally, my lecturers, whose tutelage I was taught under. A
grateful God Bless you. More especially my supervisor ENGR.
ARIYO IBIYEMI and all others and also my amiable HOD.
ENGR. IKECHUKWU ONAH.
v
TABLE OF CONTENTS
Title page i
Certification ii
Approval Page iii
Dedication iv
Acknowledgement v
Abstract vi
Table of Contents vii
CHAPTER ONE
1.0 INTRODUCTION 1
1.1 Power Trends for Current Microprocessors 5
CHAPTER TWO
2.0 Working with the L0 CACHE 6
2.1 pipeline Micro Architecture
2.2 Branch Prediction & Confidence Estimation – A
Brief Overview
CHAPTER THREE
3.0 What is Dynamic Cache Management Technique
3.1 Basic Idea of the Dynamic Management Scheme
3.2 Dynamic Techniques for l0-cache Management
3.3 Simple Method
3.4 Static Method
7
7
12
12
13
14
14
15
vi
3.5 Dynamic Confidence Estimation Method
3.6 Restrictive Dynamic Confidence Estimation
Method
3.7 Dynamic Distance Estimation Method
3.8 Comparison of Dynamic Techniques
CHAPTER FOUR
4.0 Conclusion
REFERENCES
16
16
17
18
19
20
vii
LIST OF FIGURES
Fig 1: Memory Hierarchy 3
Fig 2: An Instruction Cache 4
Fig 3: Levels of Cache
6
Fig 4: Pipeline Micro Architecture (A) 7
Fig 5: Pipeline Micro Architecture (B) 7
viii
CHAPTER ONE
1.0 INTRODUCTION
First of all what is Cache Memory:-
Cache memory is a fast memory that is use to hold the
most recently accessed data.
Cache is pronounced like the word cash. Cache Memory
is the level of computer memory hierarchy situated
between the processor and main memory. It is a very
fast memory the processor can access much more
quickly than main memory or RAM. Cache is relatively
small and expensive. Its function is to keep a copy of
the data and code (instructions) currently used by the
CPU. By using cache memory, waiting states are
significantly reduced and the work of the processor
becomes more effective.
As processor performance continues to grow, and high
performance, wide-issue processors exploit the available
Instruction-Level Parallelism, the memory hierarchy should
continuously supply instructions and data to the data path to
keep the execution rate as high as possible. Very often, the
memory hierarchy access latencies dominate the execution
time of the program. The very high utilization of the
1
instruction memory hierarchy entails high energy demands
for the on-chip I-Cache subsystem.
In order to reduce the effective energy dissipation per
instruction access, we propose the addition of a small, extra
cache (the L0-Cache) which serves as the primary cache of
the processor, and is used to store the most frequently
executed portions of the co de, and subsequently provide
them to the pipeline. Our approach seeks to manage the L0-
Cache in a manner that is sensitive to the frequency of
accesses of the instructions executed. It can exploit the
temporalities of the code and can make decisions on they,
i.e., while the code executes. In recent years, power
dissipation has become one of the major design
concerns for the microprocessor industry. The shrinking
device size and the large number of devices packed in a chip
die coupled with large operating frequencies; have led to
unacceptably high levels of power dissipation. The problem
of wasted power caused by unnecessary activities in various
parts of the CPU during code execution has traditionally
been ignored in code optimization and architectural design.
Higher frequencies and large transistor counts more
than offset the lower voltages and the smaller the devices
and they result in large power consumption in a newest
version in a processor family.
2
Figure 1
Cache is much faster than main memory because it is
implemented using SRAM (Static Random Access Memory).
The problem with DRAM, which comprises main memory, is
that it is composed entirely of capacitors, which have to be
constantly refreshed in order to preserve the stored
information (leakage current). Whenever data is read from
the cell, the cell is refreshed. The DRAM cells need to be
refreshed very frequently, i.e. typically every 4 to 16ms. this
slows down the entire process. SRAM on the other hand
consists of flip-flops, which stay in its state as long as the
power supply is on. (A flip-flop is an electrical circuit
composed of transistors and resistors. See picture) Because
of this SRAM need not be refreshed and is over 10 times
faster than DRAM. Flip-flops, however, are implemented
3
using complex circuitry which makes SRAM much larger and
more expensive, limiting its use.
Level one cache memory (called L1 Cache, for Level 1
Cache) is directly integrated into the processor. It is
subdivided into two parts:
The first part is the instruction cache, which contains
instructions from the RAM that have been decoded as they
came across the pipelines.
The second part is the data cache, which contains data from
the RAM and data recently used during processor operations.
4
Figure 2 - An instruction cache
Level 1 cache can be accessed very rapidly. Access
waiting time approaches that of internal processor
registers Level two cache memory (called L2 Cache, for
Level 2 Cache) is located in the case along with the
processor (in the chip). The level two cache is an
intermediary between the processor, with its internal
cache, and the RAM. It can be accessed more rapidly
than the RAM, but less rapidly than the level one cache.
Level three cache memory (called L3 Cache, for Level 3
Cache) is located on the motherboard.
5
1.1 POWER TRENDS FOR CURRENT
MICROPROCESSORS
DEC
21164
DEC 21164
High Freq
Pentiu
m Pro
Pentium
II
Freq (Mhz) 433 600 200 300
Power (W) 32.5 45 28.1 41.4
Very often the memory hierarchy access latencies dominate
the execution time of the program; the very high utilization
of the instruction memory hierarchy entails high energy
demands on the on chip I-cache subsystem. In order to
reduce the effective energy dissipation per instruction
access, the addition of an extra cache is proposed,
which serves as the primary cache of the processor,
and is used to store the most frequently executed
portion of the code.
6
CHAPTER TWO
2.0 WORKING WITH THE L0 CACHE
Some dynamic techniques are used to manage the L0-cache.
The problem that the dynamic techniques seek to solve is
how to select the basic blocks to be stored in the L0-cache
while the program is being executed. If a block is selected,
the
Figure 3
CPU will access the L0-cache first; otherwise, it will go
directly to the I-cache and bypass the L0-cache. In the
case of an L0-cache miss, the CPU is directed to fetch
7
instructions from the I-cache and to transfer the instructions
from the I-cache to the L0-cache
2.1 PIPELINE MICRO ARCHITECTURE
Figure 4
8
Figure 5
Figure 4 shows the processor pipeline we model in this
research. The pipeline is typical of embedded processors
such as StrongARM. There are five stages in the pipeline–
fetch, decode, and execute, mem and writeback. There is no
external branch predictor. All branches are predicted
“untaken”. There is two-cycle delay for “taken” branches.
Instructions can be delivered to the pipeline from one of
three sources: line buffer, I-cache and DFC. There are three
ways to determine where to fetch Instructions:
• Serial–sources are accessed one by one in fixed order;
9
• Parallel–all the sources are accessed in parallel;
• Predictive–the access order can be serial with flexible order
or parallel based on prediction. Serial access results in
minimal power because the most power efficient source is
always accessed first. But it also results in the highest
performance degradation because every miss in the first
accessed source will generate a bubble in the pipeline.
On the other hand, parallel access has no performance
degradation. But I-cache is always accessed and there is no
power savings in instruction fetch. Predictive access, if
accurate, can have both the power efficiency of the
serial access and the low performance degradation of
the parallel access. Therefore, it is adopted in our
approach. As shown in Figure 1, a predictor decides
which source to access first based on current fetch
address. Another functionality of the predictor is pipeline
gating. Suppose a DFC hit is predicted for the next fetch at
cycle N. The fetch stage is disabled at cycle N and the
decoded instruction is sent from the DFC to latch 5. Then at
cycle N, the decode stage is disabled and the decoded
instruction is sent from latch 5 to latch 2. If an instruction is
fetched from the I-cache, the hit cache line is also sent to
the line buffer. The line buffer can provide instructions for
subsequent fetches to the same line.
10
2.2 BRANCH PREDICTION & CONFIDENCE ESTIMATION
– A BRIEF OVERVIEW
2.2.1 Branch Prediction
Branch prediction is an important technique to increase
parallelism in the CPU, by predicting the outcome of a
conditional branch instruction as soon as it is decoded.
Successful branch prediction mechanisms take advantage of
the non-random nature of branch behavior. Most branches
are either mostly taken in the course of program
execution.
The commonly used branch predictors are:
1. Bimodal branch predictor.
Bimodal branch predictor uses a counter for determining
the prediction. Each time a branch is taken, the counter is
incremented by one, and each time it falls through, it is
decremented by one. Looking onto the value of the counter
does the prediction. If it is less than a threshold value, the
branch is predicted as not taken; otherwise, it is predicted as
taken.
11
Figure 6
2. Global branch predictor
Global branch predictor considers the past behavior of the
current branch as well as the other branches to predict the
behavior of the current branch.
3. Confidence Estimation
The relatively new concept confidence estimation has
been introduced to keep track of the quality of branch
prediction. Confidence estimators are hardware
mechanisms that are accessed in parallel with the branch
predictors when a branch is decoded, and they are modified
when the branch is resolved. They characterize a branch as
‘high confidence’ or ‘low confidence’ depending upon the
12
branch predictor for the particular branch. If the branch
predictor predicted a branch correctly most of the time, the
confidence estimator would designate the prediction as ‘high
confidence’ otherwise as ‘low confidence’.
13
CHAPTER THREE
3.0 WHAT IS DYNAMIC CACHE MANAGEMENT
TECHNIQUE:-
The memory hierarchy of high performance.
Extrapolating and current trend and this portion are
likely to the near future.
This mechanism provides to you the instruction stream.
It is an accounting for and fraction of a chip’s transistor.
It is use to the eliminate the need for high utilization.
It is a resizing strategy of cache memory.
3.1 BASIC IDEA OF THE DYNAMIC MANAGEMENT
SCHEME
The dynamic scheme for the L0-cache should be able to
select the most frequently executed basic blocks for
placement in the L0-cache. It should also rely on existing
mechanisms without much hardware investment if it to be
attractive for energy reduction.
The branch prediction in conjunction with confidence
estimation is a reliable solution to this problem.
14
Unusual Behavior of the Branches
A branch that was predicted ‘taken’ with ‘high confidence’
will be expected to be taken during program execution. If it
not taken, it will be assumed to be behaving
‘unusually’.
The basic idea is that, if a branch behaves ‘unusually’,
the dynamic scheme disables the L0-cache access for the
subsequent basic blocks. Under this scheme, only basic
blocks that are to be executed frequently tend to make it to
the L0-cache, hence avoiding cache pollution problems in
the L0-cache
3.2 DYNAMIC TECHNIQUES FOR L0-CACHE
MANAGEMENT
The dynamic techniques discussed in the subsequent
portions select the basic blocks to be placed in the L0-
cache. There are five techniques for the management of L0-
cache.
1. Simple Method.
2. Static Method.
15
3. Dynamic Confidence Estimation Method.
4. Restrictive Dynamic Confidence Estimation Method.
5. Dynamic Distance Estimation Method.
Different dynamic techniques trade off energy reduction
with performance degradation
3.3 SIMPLE METHOD
The confidence estimation mechanism is not used in simple
method. The branch predictor can be used as a stand-alone
mechanism to provide insight on which portions of the code
are frequently executed and which is not. A mispredicted
branch is assumed to drive the thread of execution to an
infrequently executed part of the program.
The strategy used for selecting the basic blocks is as follows:
If a branch predictor is mispredicted, the machine will access
the I-cache to fetch the instructions. If a branch is predicted
correctly, the machine will access the L0-cache.
In a misprediction, the pipeline will flush and the machine
will start fetching the instructions from the correct address
by accessing the I-cache. The energy dissipation and the
execution time of the original configuration that uses no
L0-cache is taken as unity, and normalize everything with
respect to that.
16
3.4 STATIC METHOD
The selection criteria adopted for the basic blocks is:
If a ‘high confidence’ branch was predicted incorrectly,
the I-cache is accessed for the subsequent basic blocks.
If more than n low confidence branches have been
decoded in a row, the I-cache is accessed.
Therefore the L0-cache will be bypassed when either of the
two conditions is satisfied. In any other case the machine will
access the L0-cache.
The first rule for accessing the I-cache is due to the fact that
a mispredicted ‘high confidence’ branch behaves
‘unusually’ and drives the program to an infrequently
executed portion of the code. The second rule is due to the
fact that a series of ‘low confidence’ branches will also suffer
from the same problem since the probability that they all are
predicted correctly is low.
3.5 DYNAMIC CONFIDENCE ESTIMATION METHOD
Dynamic confidence estimation method is a dynamic version
of the static method. The confidence of The I-cache is
accessed if a high confidence branch is mispredicted.
More than n successive ‘low confidence’ branches are
encountered.
17
The dynamic confidence estimation mechanism is
slightly better in terms of energy reduction than in the
simple or static method. Since the confidence estimator can
adapt dynamically to the temporalities of the code, it is more
accurate in characterizing a branch and, then, regulating the
access of the L0-cache.
3.6 RESTRICTIVE DYNAMIC CONFIDENCE ESTIMATION
METHOD
The methods described in previous sections tend to place a
large number of basic blocks in the L0-cache, thus degrading
performance. Restrictive dynamic scheme is a more
selective scheme in which only the really important basic
blocks would be selected for the L0-cache.
The selection mechanism is slightly modified as:
The L0-cache is accessed only if a ‘high confidence’
branch is predicted correctly. The I-cache is accessed in
any other case.
This method selects some of the most frequently
executed basic blocks, yet it misses some others. It has
much lower performance degradation, at the expense of
lower energy reduction. It is probably preferable in a
system where performance is more important than
energy.
18
3.7 DYNAMIC DISTANCE ESTIMATION METHOD
The dynamic distance estimation method is based on the
fact that, a mispredicted branch triggers a series of
successive mispredicted branches. The method works as
follows:
All branches after a mispredicted branch are tagged as
‘low confidence’ otherwise as ‘high confidence’. The basic
blocks after a ‘low confidence’ branch are fetched from the
L0-cache. The net effect is that a branch misprediction
causes a series of fetches from the I-cache.
A counter is used to measure the distance of a branch
from the previous mispredicted branch. This scheme is also
very selective in storing instructions from the L0-cache, even
more than the restrictive dynamic estimation method.
3.8 COMPARISON OF DYNAMIC TECHNIQUES
The energy reduction and delay increase is a function of the
algorithm used for the regulation of the L0-cache access,
the size of the L0-cache, its block size and its
associability. For example, a larger block size causes a larger
hit ratio in the L0-cache.
This results into smaller performance overhead, and bigger
efficiency since the I-cache does not need to be accessed so
often.
19
On the other hand if the block size increase does not have a
large impact on the hit ratio, the energy dissipation may go
up, since a cache with a larger block size is less energy
efficient than a cache with the same size but smaller block
size.
The static method and dynamic confidence method make
the assumption:
The less frequently executed basic blocks usually follow less
predictable branches that are mispredicted.
The simple method and restrictive dynamic confidence
estimation method address the problem from another angle.
They make the assumption:
Most frequently executed basic blocks usually follow high
predictable branches.
The dynamic distance estimation method is the most
successful in reducing the performance overhead, but the
least successful in energy reduction.
This method possesses stricter requirement for a basic
block to be selected for the L0-cache than the original
dynamic confidence estimation method.
Larger block size and associability will have a beneficial
effect on both energy and performance. The hit rate of a
20
small cache is more sensitive to the variation of the block
size and the associability.
21
CHAPTER FOUR
4.0 CONCLUSION
This paper presents the method for dynamic selection
of basic blocks for placement in the L0-cache. It explains
the functionality of the branch prediction and the confidence
estimation mechanisms in high performance processors.
Finally five different dynamic techniques were discussed for
the selection of the basic blocks. These techniques try to
capture the execution profile of the basic blocks by using the
branch statistics that are gathered by the branch predictor.
The experiment evaluation demonstrates the
applicability of the dynamic techniques for the
management of the L0-cache. Different techniques can
trade off energy with delay by regulating the way the L0-
cache is accessed.
22
REFERENCES
[1] J. Diguet, S. Wuytack, F. Cattho or, and H. De Man, Formalized methodology for data reuse exploration in hierarchical memory mappings," in Proceedings of the International Symposium of Low Power Electronics and Design, pp. 30{35, Aug. 1997.
[2] J. Kin, M. Gupta, and W. Mangione-Smith, The Filter cache: An energy efficient memory structure," in Proceedings of the International Symposium on Micro architecture, pp. 184{193, Dec. 1997.
[3] N. Bellas, I. Ha jj, C. Polychronopoulos, and G. Stamoulis, Architectural and compiler supp ort for energy reduction in the memory hierarchy of high performance microprocessors," in Proceedings of the International Symposium of Low Power Electronics and De-sign, pp. 70{75, Aug. 1998.
[4] S. Manne, D. Grunwald, and A. Klauser, Pipeline gating: Speculation control for energy reduction," in Proceedings of the International Symposium of Computer Architecture, pp. 132{141, 1998.
[5] Speed Shop User's Guide. Silicon Graphics, Inc., 1996.
[6] S. Wilson and N. Jouppi, An enhanced access and cycle time model for on-chip caches," tech. rep., DEC WRL 93/5, July 1994.
[7] Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos, Using dynamic cache management techniques to reduce energy in a high-performance processor, Department of Electrical & Computer Engineering and the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign 1308 West Main Street, Urbana, IL 61801, USA.
23