UNIVERSITY OF CALIFORNIA
Irvine
Scalable Hardware Mechanisms for Superscalar Processors
A dissertation submitted in partial satisfaction of the
requirements for the degree Doctor of Philosophy
in Electrical and Computer Engineering
by
Steven Daniel Wallace
Committee in charge:
Professor Nader Bagherzadeh, Chair
Professor Nikil Dutt
Professor Fadi Kurdahi
1997
c�1997
STEVEN DANIEL WALLACE
ALL RIGHTS RESERVED
The dissertation of Steven Daniel Wallace is approved
and is acceptable in quality and form for
publication on microfilm:
Committee Chair
University of California, Irvine
1997
ii
Dedication
To my parents,for their never-ending love and support.
iii
Contents
List of Figures � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � vii
List of Tables � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ix
Table of Symbols � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � x
Acknowledgements � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xii
Curriculum Vitae � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xiii
Abstract � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xv
Chapter 1 Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2 Background � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 92.1 Instruction Fetch Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Fetching Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Software Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Dynamic Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Instruction Fetch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5 Multiple Block Fetching . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6 Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.1 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.7 Register File Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3 Experimental Methodology � � � � � � � � � � � � � � � � � � � � � 283.1 Simulation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 SPEC95 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Program Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Machine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Default Configuration . . . . . . . . . . . . . . . . . . . . . . . . 353.4 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Instructions Fetched Per Cycle (IFPC) . . . . . . . . . . . . . . . . 363.4.2 Branch Execution Penalty (BEP) . . . . . . . . . . . . . . . . . . . 373.4.3 Effective Instruction Fetch Rate (IPC f) . . . . . . . . . . . . . . . 373.4.4 Instructions Per Cycle (IPC) . . . . . . . . . . . . . . . . . . . . . 38
iv
Chapter 4 Instruction Fetching Mechanisms � � � � � � � � � � � � � � � � � 394.1 Fetching Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Hardware Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Simple Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.2 Extended Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.3 Self-Aligned Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.4 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.5 Dual Branch Target Buffer . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Expected Instruction Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.1 Simple Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.2 Extended Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.3 Self-aligned Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.4 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.5 Dual Block Fetching . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 5 Multiple Branch and Block Prediction � � � � � � � � � � � � � � � 665.1 Multiple Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Dual Block Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.1 Single Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2.2 Double Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2.3 Misprediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.3.1 Conditional Branch Accuracy . . . . . . . . . . . . . . . . . . . . 895.3.2 Block Information Type . . . . . . . . . . . . . . . . . . . . . . . 915.3.3 Single vs. Double Selection . . . . . . . . . . . . . . . . . . . . . 915.3.4 Target Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.3.5 Instruction Cache Configurations . . . . . . . . . . . . . . . . . . 945.3.6 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Multiple Block Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.5 Cost Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.1 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.5.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 6 Scalable Register File � � � � � � � � � � � � � � � � � � � � � � � � 1176.1 Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1.1 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.1.2 CAM/Table Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.1.3 Intrablock Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Register File Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.3 Dynamic Result Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3.1 Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
v
6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.4.1 Source Operand Renaming . . . . . . . . . . . . . . . . . . . . . . 1306.4.2 Destination Operand Renaming . . . . . . . . . . . . . . . . . . . 133
6.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Chapter 7 Conclusion � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 143
Chapter 8 Future Directions � � � � � � � � � � � � � � � � � � � � � � � � � � 148
Bibliography � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 151
vi
List of Figures
1.1 Current Superscalar Cost and Performance Trends . . . . . . . . . . . . . . 51.2 Superscalar Cost and Performance Goals . . . . . . . . . . . . . . . . . . . 5
2.1 Pipeline Stages of a Superscalar Processor . . . . . . . . . . . . . . . . . . 102.2 Simple Fetching Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Pattern History Table and 2-bit Counter State Diagram . . . . . . . . . . . 152.4 2-Level Adaptive Branch Prediction . . . . . . . . . . . . . . . . . . . . . 162.5 Global History Adaptive Branch Prediction . . . . . . . . . . . . . . . . . 172.6 Per-Addr History Adaptive Branch Prediction . . . . . . . . . . . . . . . . 182.7 Block Diagram Schematic of the NLS Architecture . . . . . . . . . . . . . 202.8 Multiple Global Adaptive Branch Prediction . . . . . . . . . . . . . . . . . 222.9 Branch Address Tree and Cache Mapping . . . . . . . . . . . . . . . . . . 232.10 Two-block Ahead Branch Prediction . . . . . . . . . . . . . . . . . . . . . 232.11 Block Diagram of Renaming Logic . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Fetching Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Extended Fetching Example . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Self-aligned Fetching Example . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Prefetch Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5 Block Diagram of Dual Branch Target Buffer Entry . . . . . . . . . . . . . 474.6 Dual Branch Target Buffer Example . . . . . . . . . . . . . . . . . . . . . 484.7 Prefetch Buffer State Diagram . . . . . . . . . . . . . . . . . . . . . . . . 514.8 Expected Instruction Fetch without Prefetching . . . . . . . . . . . . . . . 544.9 Self-Aligned Expected Instruction Fetch with Prefetching (n � �) . . . . . 554.10 Self-Aligned Expected Instruction Fetch with Prefetching (n � �) . . . . . 564.11 Simple Expected Instruction Fetch with Prefetching . . . . . . . . . . . . . 574.12 Different Cache Techniques with Prefetching . . . . . . . . . . . . . . . . 584.13 Different Cache Techniques for Dual Block Fetching with Prefetching . . . 59
5.1 Multiple Global Adaptive Branch Prediction Example . . . . . . . . . . . . 695.2 Multiple Branch Prediction with Blocked PHT Example . . . . . . . . . . 705.3 Block Diagram of a Multiple Branch Prediction Fetching Mechanism . . . 725.4 Branch Selection Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.5 Block Diagram for Dual Block Prediction . . . . . . . . . . . . . . . . . . 795.6 Pipeline Stage Diagram for Dual Block Prediction . . . . . . . . . . . . . . 805.7 Block Diagram for Dual Block Prediction Using Double Selection . . . . . 83
vii
5.8 Pipeline Stage Diagram for Dual Block Prediction Using Double Selection . 845.9 Branch Misprediction Rate and Improvement . . . . . . . . . . . . . . . . 905.10 Block Information Type Penalty and Performance . . . . . . . . . . . . . . 925.11 Single and Double Selection Performance . . . . . . . . . . . . . . . . . . 935.12 Branch Execution Penalties for Dual block, Single Selection . . . . . . . . 985.13 Branch Execution Penalties for Dual Block, Double Selection . . . . . . . 1005.14 Predicting Multiple Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.15 Effective Instruction Fetch for Different Block Prediction Capability . . . . 1045.16 Hardware Storage Cost of Prediction for Different Cache Sizes . . . . . . . 1085.17 Hardware Storage Cost of Dual Block Prediction for Single and Double
Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.18 Timing Chart for 8 KB Instruction Cache Using Dual Block Prediction with
Single Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.19 Timing Chart for Pipelined 32 KB Instruction Cache Using Dual Block
Prediction with Double Selection . . . . . . . . . . . . . . . . . . . . . . . 116
6.1 Block Diagram of Hybrid Renaming . . . . . . . . . . . . . . . . . . . . . 1206.2 Dependence Distance for SDSP/SPARC . . . . . . . . . . . . . . . . . . . 1226.3 Block Diagram of Scalable Register File . . . . . . . . . . . . . . . . . . . 1266.4 Register File Performance Comparison for a 4-way Superscalar . . . . . . . 1376.5 Register File Performance Comparison for an 8-way Superscalar . . . . . . 1386.6 BIPS and Cycle Time Performance Comparison for a 4-way Superscalar . . 1416.7 BIPS and Cycle Time Performance Comparison for an 8-way Superscalar . 142
viii
List of Tables
3.1 Description of SPEC95 Applications . . . . . . . . . . . . . . . . . . . . . 313.2 Branch Attributes of SPEC95 Applications . . . . . . . . . . . . . . . . . 333.3 Functional Unit Quantity, Type, and Latency . . . . . . . . . . . . . . . . . 34
4.1 Expected Instruction Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Instructions Fetched per Cycle (n � �) . . . . . . . . . . . . . . . . . . . . 614.3 Instructions Fetched per Cycle with Prefetching (n � �) . . . . . . . . . . . 624.4 Instructions Fetched per Cycle with Prefetching (n � �) . . . . . . . . . . . 634.5 IPB and IFPC for Dual Block Fetching with Prefetching . . . . . . . . . . 65
5.1 Block Information Types and Prediction Sources . . . . . . . . . . . . . . 735.2 Next Line Prediction Example Based on Starting Position . . . . . . . . . . 755.3 Misprediction Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.4 Bad Branch Recovery Entry . . . . . . . . . . . . . . . . . . . . . . . . . 865.5 Indirect and Immediate Misfetch Penalty Comparison for Different Target
Array Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.6 IPB and IPC f for Different Cache Types . . . . . . . . . . . . . . . . . . . 965.7 BEP Distribution, IPB, and IPC f for Dual Block, Single Selection . . . . . 995.8 BEP Distribution, IPB, and IPC f for Dual Block, Double Selection . . . . 1015.9 Two-block Prediction with Prefetching for Different Decode Sizes . . . . . 1025.10 Simplified Hardware Cost Estimates . . . . . . . . . . . . . . . . . . . . . 1065.11 Access Time Estimates (ns) . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1 Bad Branch Penalty and Performance . . . . . . . . . . . . . . . . . . . . 1216.2 Average Register File Utilization per Cycle . . . . . . . . . . . . . . . . . 1246.3 Read Operand Category Distribution (%) . . . . . . . . . . . . . . . . . . 1286.4 CAM Lookup Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.5 Mapping Table Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.6 Recovery List Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
ix
Table of Symbols
ParametersL Number of logical registersN Order of superscalar – the maximum issue rateP Number of ports in a RAM cellR Number of physical registersS Maximum number of speculative instructions allowed in pipelineb Probability an instruction transfers controlm Extended cache line sizen Maximum number of instructions in a decode blockp Size of prefetch bufferq Maximum number of instructions in a fetch block
FunctionsEi Probability the starting address in the block is at position iF Expected instruction fetch per cycleIi Probability exactly i instructions are fetchedLi Probability a control transfer occurs at position i in a blockPi Probability the prefetch buffer contains i instructionsc Probability of a control transfer in a blockr Expected block run length
AcronymsALU Arithmetic logical unitBAC Branch address cacheBBR Bad branch recoveryBEP Branch execution penaltyBHR Branch history registerBIPS Billions of instructions per secondBIT Block information type
BTB Branch target bufferCAM Content addressable memory
x
Acronyms (continued)DBTB Dual branch target buffer
DS Double selectionFIFO First-in first-outGHR Global history registerIFPC Instructions fetched per cycle
IPB Instructions per blockIPC Instructions per cycle
IPC f Effective instruction fetching rateIPFQ Instructions per fetch request
IW Instruction windowLRU Least recently usedNLS Next line and set
PC Program counterPHT Pattern history table
RAM Random access memoryRAS Return address stack
RF Register fileSMT Simultaneous multithreading
SS Single selectionST Select table
xi
Acknowledgements
I give special thanks to Professor Nader Bagherzadeh, my advisor, for his guidance
and support.
I thank my dissertation committee members, Professor Nikil Dutt and Professor Fadi
Kurdahi for their reading and evaluation of my dissertation work.
I thank all members of the computer architecture research group. I enjoyed working
with fellow graduate students from the Advanced Computer Architecture Laboratory and
the Fault-Tolerant Multicomputer Laboratory: Marcelo Moraes de Azevedo, Bill Brown,
Nirav Dagli, Manu Gulati, Joao Lacerda, Hung Liu, Nayla Nassif, Jesse Pan, Brian Park,
Mark Pontius, Simin Shoari, Koji Suginuma, and Honge Wang.
I also appreciate financial support I received from the Chancellor’s Fellowship and a
research assitantship.
xii
Curriculum Vitae
Steven Daniel Wallace
1969 Born in Burbank, California1988 Graduate of Woodbridge High School1992 B.S. in Electrical Engineering (magna cum laude),
B.S. in Information and Computer Science (cum laude),Minor in Applied Mathematics,University of California, Irvine
1992 Chancellor’s Fellowship, University of California, Irvine1993 M.S. in Engineering, University of California, Irvine
1993–1996 Research Assistant, Department of Electrical and Computer Engineering,University of California, Irvine
1996 Ph.D. in Electrical and Computer Engineering, with concentration inComputer Engineering, University of California, IrvineDissertation: Scalable Hardware Mechanisms for Superscalar Processors
Publications
S. Wallace and N. Bagherzadeh, “Multiple Block and Branch Prediction,”Third International Symposium on High-Performance Computer Architec-ture, Februrary 1997.
S. Wallace and N. Bagherzadeh, “A Scalable Register File Architecture forDynamically Scheduled Processors,” 1996 Conference on Parallel Archi-tectures and Compilation Techniques, October 1996.
S. Wallace and N. Bagherzadeh, “Instruction Fetching Mechanisms for Super-scalar Microprocessors,” Euro-Par ’96, August 1996.
S. Wallace and N. Bagherzadeh, “Performance Issues of a Superscalar Micro-processor,” Microprocessors and Microsystems, May 1995.
S. Wallace and N. Bagherzadeh, “Performance Issues of a Superscalar Micro-processor,” 23rd International Conference of Parallel Processing, August1994.
S. Wallace, “Performance Analysis of a Superscalar Architecture,” Master’sthesis, University of California, Irvine, September 1993.
xiii
Superscalar Design Papers
S. Wallace, N. Dagli, and N. Bagherzadeh, “Design and Implementation of a100 MHz Centralized Instruction Window for a Superscalar Microproces-sor,” 1995 International Conference on Computer Design, October 1995.
S. Wallace, N. Dagli, and N. Bagherzadeh, “Design and Implementation ofa 100 MHz Reorder Buffer,” 37th Midwest Symposium on Circuit andSystems, August 1994.
J. Lenell, S. Wallace, and N. Bagherzadeh, “A 20 MHz CMOS Reorder Bufferfor a Superscalar Microprocessor,” 4th NASA Symposium on VLSI Design,October 1992.
Physics Papers
K. Moe, M. Moe, and S. Wallace, “Drag Coefficients of Spheres in Free-mo-lecular Flow,” 1996 AAS/AIAA Space Flight Mechanics Meeting, Austin,Texas, February 1996.
M. Moe, S. Wallace, and K. Moe, “Recommended Drag Coefficients for Aer-onomic Satellites,” The Upper Mesosphere and Lower Thermosphere: AReview of Experiment and Theory, Geophysical Monograph 87, AmericanGeophysical Union, pp. 349-356, 1995.
M. Moe, S. Wallace, and K. Moe, “Refinements in Determining Satellite Dragcoefficients: Method for Resolving Density Discrepancies,” Journal ofGuidance, Control, and Dynamics, June 1993.
M. Moe, S. Wallace, and K. Moe, “Recommended Drag Coefficients for Aer-onomic Satellites,” Chapman Conference, Asilomar, November 1992.
xiv
Abstract of the Dissertation
Scalable Hardware Mechanisms for Superscalar Processors
by
Steven Daniel Wallace
Doctor of Philosophy in Electrical and Computer Engineering
University of California, Irvine, 1997
Professor Nader Bagherzadeh, Chair
Superscalar processors fetch and execute multiple instructions per cycle. As more
instructions can be executed per cycle, an accurate and high bandwidth instruction fetching
mechanism becomes increasingly important to performance. This dissertation describes
and analyzes instruction fetching mechanisms using three different cache types: a simple
cache, an extended cache, and a self-aligned cache. A mathematical model is developed for
each cache technique, and performance is evaluated both in theory and in simulation using
the SPEC95 suite of benchmarks. In all the techniques, the fetching performance is dra-
matically lower than ideal expectations. Prefetching can be used to increase performance.
Nevertheless, single block fetching performance is fundamentally limited by control trans-
fers. Thus, to overcome this limitation, multiple blocks must be fetched in a single cycle.
Accurate branch prediction and instruction fetch prediction of a microprocessor are
also critical to achieve high performance. In order to achieve a high fetching rate for wide-
issue superscalar processors, a scalable method to predict multiple branches per block of
xv
sequential instructions is presented. Its accuracy is equivalent to a scalar two-level adap-
tive prediction. Also, to overcome the limitation imposed by control transfers, a scalable
method to predict multiple blocks is introduced. Results demonstrate that a two block,
multiple branch prediction mechanism for a block width of eight instructions achieves an
effective fetching rate of eight instructions per cycle.
A major obstacle in designing superscalar processors is the size and port requirement
of the register file. Multiple scalar register files can be used if results are renamed when
they are written to the register file. Consequently, a scalable register file architecture can
be implemented without performance degradation. Another benefit is that the cycle time of
the register file is significantly shortened, potentially producing a tremendous increase in
the speed of the processor.
xvi
Chapter 1
Introduction
The goal of a superscalar microprocessor is to execute multiple instructions per cy-
cle. Instruction-level parallelism (ILP) available in programs can be exploited to realize
this goal [19]. Depending on what type of programs and assumptions used, researchers
have shown that parallelism anywhere from 4 to 90 is available [22, 32, 28, 41]. The po-
tential speedup from program parallelism may not be realized if the processor is unable to
completely take advantage of it. Instruction fetch efficiency, branch prediction, instruction
and data cache, resource allocation, decode width, and issue width are hardware factors
that determine the overall performance of a processor. Instruction fetch efficiency has been
shown to be the greatest factor limiting speedup [43].
For example, an 8-way superscalar processor with a simple fetching hardware using
perfect branch prediction could only expect to fetch less than four instructions per cycle.
This accounts for over 50% of the loss in potential speedup regardless of any other perfor-
mance issues. Thus, the performance is severely reduced even if the ILP in the program
and execution pipeline would be able to execute eight instructions per cycle.
The underlying problem in fetching instructions using a control flow architecture
is branches. To begin with, conditional branches create uncertainty in the flow of control,
which can cause severe performance penalties if not accurately predicted. Even with perfect
1
2
dynamic branch prediction, conditional and unconditional branches disrupt the sequential
addressing of instructions. The non-sequential accessing of instructions causes difficulty
with fetching instructions in hardware. As a result, the instruction fetcher restricts the
amount of concurrency available to the processor [37].
Branch prediction foretells the outcome of conditional branch instructions. It pre-
dicts the direction of a conditional branch, taken or not taken. Also, the target addresses
for indirect branches and return instructions need to be predicted because they are usually
determined late in the pipeline. Without branch prediction, when a branch instruction is
encountered, instruction fetching must stall until its direction is calculated before proceed-
ing to the next instruction. Using branch prediction, however, a processor can continue
fetching and speculatively execute instructions past this branch. If the prediction is incor-
rect, then all the speculative work must be nullified and instruction fetching restarted at
the correct address. The resulting pipeline bubbles are called the misprediction penalty. In
addition to predicting a branch’s direction, the target address of taken branches need to be
predicted through instruction fetch prediction. Instruction fetch prediction predicts which
instructions to fetch from the instruction cache when there is a branch [8].
After it has been determined which instructions to fetch, the instruction fetch mech-
anism reads the instructions from the instruction cache and delivers them to the decoder.
The instruction fetch mechanism may not be able to fetch all the desired instructions. This
limits the instruction fetch efficiency and overall performance. Potential parallelism from
ILP can not be utilized when instructions are not delivered for decoding and execution at a
sufficient rate. A high performance fetching mechanism is required.
Extensive research in branch prediction and instruction fetch prediction has been
accomplished for scalar processors. Yeh introduced a two-level adaptive branch prediction.
3
It uses previous branches’ history to index into a Pattern History Table (PHT). He reports
a 97% branch prediction accuracy [53]. Calder proposed the Next Line Set (NLS), which
predicts the next instruction cache line and set to fetch [8]. Both the PHT and NLS provided
excellent branch prediction and instruction fetch prediction for scalar processors; it is not
clear how to scale these techniques for a superscalar processor.
The register file is a design obstacle for superscalar microprocessors. If N instruc-
tions can be issued in a cycle, then a superscalar microprocessor’s register file needs �N
read ports and N write ports to handle the worst-case scenario. The area complexity for
the register file grows proportional to N� [10]. Therefore, a new architecture is needed
to keep the ports of a register file cell constant as N increases. In addition, the register
requirements for high performance and exception handling can be quite high. Farkas et al.
conclude that for best performance, 160 registers are needed for a four-way issue machine,
and 256 registers are needed for an eight-way issue machine [15]. Therefore, it is desirable
to reduce the register requirements and still maintain performance.
A major difficulty in the simultaneous multithreading (SMT) architecture, introduced
by Tullsen et al., was the size of the register file [40]. They supported eight threads on an
eight-way issue machine, using 356 total registers with 16 read ports and 8 write ports.
Compared to a standard 32-register, 2 read port, 1 write port register file of a scalar pro-
cessor, the area of the SMT register file is estimated to be over seven hundred times larger.
To account for the size of the register file, they took two cycles to read registers instead of
one. This underscores the need for a mechanism to scale the register file, yet still have the
benefits of area and access time of a register file for a scalar processor.
Although current superscalar microprocessors are attempting to execute more in-
structions per cycle than previous generations, the increase in performance has not been
4
substantial and the cost of implementation has greatly increased [38, 52, 1]. This trend
in cost and performance is displayed in Figure 1.1. The area of the processor increases
proportionally to the square of the order of superscalar. In addition, the performance has
dropped off far from ideal speedup. Although ideal speedup is not reasonable, a cost and
performance closer to ideal is desirable, as shown in Figure 1.2. The cost should increase
proportional to the order of the superscalar, N . In addition, the performance should be
able to continue to increase (given enough ILP) as N increases. It should not be limited by
instruction fetching.
This dissertation introduces instruction fetching mechanisms that are scalable in cost
and performance and do not limit instruction fetching. In addition, a scalable organization
of a register file for a superscalar microprocessor is described. The general cost and per-
formance objectives of Figure 1.2 are followed for the hardware mechanisms introduced.
Although every aspect of the instruction fetching mechanisms and register files may not be
scalable, the major storage elements are designed to be scalable yet still retain excellent
performance. The following paragraphs summarize these hardware mechanisms.
In this dissertation, different types of instruction fetching mechanisms are described
and modeled. Three different instruction cache configurations are considered: a simple
cache type, an extended cache type, and a self-aligned cache type. A simple cache uses
a one-to-one mapping between the instruction cache line and the decoder. Unfortunately,
once a control transfer instruction is encountered, remaining instructions from the cache
line must be invalidated. In addition, if the target address of a control transfer instruction is
in the middle of a cache line, previous instructions in that line can not be used. An extended
cache uses a cache line size greater than the maximum number of instructions allowed by
the decoder. This reduces the chance of a limited block size from targets that jump into
5
1 2 3 4 5 6 7 8 9 10order of superscalar (N)
2
4
6
8
10
12
14
16
18
20
mag
nitu
de r
elat
ive
to s
cala
r pr
oces
sor
ideal
performance
size
Figure 1.1: Current Superscalar Cost and Performance Trends
1 2 3 4 5 6 7 8 9 10order of superscalar (N)
2
4
6
8
10
12
14
16
18
20
mag
nitu
de r
elat
ive
to s
cala
r pr
oces
sor
performance
ideal
size
Figure 1.2: Superscalar Cost and Performance Goals
6
a middle of a cache line. To completely solve the problem with target addresses, a self-
aligned cache is presented. Unfortunately, all of these cache types are limited by control
transfers.
To approach the upper bound of single block fetching, a form of prefetching can
be used. Instead of limiting fetching to N instructions from a cache line, this number
is increased and extra instructions are put into a prefetch buffer. As a result, when less
than N instructions are retrieved from a cache line, extra instructions previously fetched
can provide the remaining instructions to deliver N instructions to the decoder. In order
to improve fetching beyond what prefetching can accomplish, two cache lines need to be
fetched per cycle. A new fetching mechanism, called the dual branch target buffer, can
predict the addresses for the next two lines. Hence, a two-block fetching mechanism can
increase fetching capability beyond the limitation of one-block fetching and satisfy ever
increasing fetching demands by wide-issue superscalar processors.
The theory behind the fetching techniques gives insight into fetching problems. For
that reason, a probabilistic model based on the probability of a control transfer is developed
for all fetching technique combinations. Given certain program characteristics and fetching
mechanism parameters, the expected performance can be calculated. To demonstrate the
accuracy of the fetching models, they are evaluated under several different conditions and
compared to simulations.
Although the dual branch target buffer is able to fetch two blocks per cycle, its in-
struction fetch accuracy is not as good as scalar methods. In order to achieve extremely
high fetching rates, it is necessary to accurately predict multiple branches per block. The
accuracy of a scalar two-level adaptive branch prediction can be retained by using a blocked
7
pattern history table suitable for multiple branch prediction. The difficulty arises in retain-
ing this accuracy while predicting multiple blocks per cycle. Essentially, the solution to this
problem is to predict the prediction. The prediction for additional blocks uses the predic-
tion made previously with the same branch history. Hence, high-performance instruction
fetching is possible under realistic conditions.
Lastly, the problem of scalability of a register file is attacked in two directions. First,
by using a multiple banked register file, it is possible to reduce the port requirement to that
of a scalar processor: 2 read ports and 1 write port. This is accomplished by dynamically
renaming registers at result write time instead of at decode time. Second, by improving
the utilization of the registers, the number of registers may be reduced. Both of these
factors can dramatically reduce the area and cycle time of the register file while maintaining
performance.
1.1 Contributions
This dissertation makes three major contributions:
1. Instruction fetch mechanisms, including three different types of instruction cache
configurations, prefetching, and two block fetching are evaluated. A theoretical
model is presented for each type which is able to accurately determine the expected
instruction fetching performance. The limits of single block fetching are clearly
shown. The potential benefit of a two block fetch mechanism to break this barrier is
demonstrated by introducing a dual branch target buffer.
8
2. A method to provide scalable multiple branch and block prediction is introduced. The
multiple branch predictor retains the accuracy of a scalar branch predictor. Multiple
blocks can be predicted in parallel each cycle.
3. A scalable register file architecture is introduced. Multiple scalar register files are
used instead of a large multi-ported register file. The number of registers can be
reduced by increasing the utilization of registers. As a result, the area of the register
file is dramatically reduced. Also, the cycle time of the register file is shortened,
which can significantly increase the performance of a processor.
1.2 Organization
The remainder of this dissertation is organized into six chapters. Chapter 2 dis-
cusses background material related to this dissertation. Chapter 3 explains the experimen-
tal methodology used throughout the dissertation, including simulation tools, performance
metrics, benchmark programs, and benchmark characteristics. Chapter 4 describes differ-
ent instruction fetching mechanisms: three instruction cache techniques, prefetching, and
two block fetching. A theoretical model for each mechanism is presented and expected
fetching performance is compared against simulated results. Chapter 5 introduces multiple
branch per block prediction and multiple block prediction. Chapter 6 describes a scalable
register file architecture. Chapter 7 presents the conclusion of this dissertation. Finally,
Chapter 8 gives insight into future directions concerning instruction fetching mechanisms
and register file architectures.
Chapter 2
Background
This chapter provides background information and related work on instruction fetch-
ing and register files. Many mechanisms have been proposed to improve branch predic-
tion and instruction fetch prediction. These include two-level adaptive branch prediction,
branch target buffer, and next line and set prediction. This chapter describes the instruc-
tion fetch problem, instruction fetch limitation, dynamic branch prediction, multiple block
fetching, register renaming, and register file complexity.
2.1 Instruction Fetch Problem
Branch instructions create two basic problems with instruction fetching. First, a con-
ditional branch creates uncertainty in which direction should be taken. By the time a branch
is executed and found to be incorrectly predicted, a superscalar processor may have fetched
dozens of instructions which will have to be thrown away. Second, the target address of
a taken branch may not be known, so it also has to be predicted. In addition, a control
transfer disrupts the sequential accessing of instructions, so this requires a different line in
the instruction cache to be accessed.
9
10
For example, consider a superscalar architecture with six pipeline stages: instruction
fetch (IF), instruction decode and register rename (D/R), issue (IS), register read (RR),
execute (EX), and result commit (RC). These stages are depicted in Figure 2.1. If the target
address of a conditional branch is not known or incorrectly predicted, there is a one cycle
misfetch penalty. Furthermore, at least four more stages are required to detect an incorrectly
predicted conditional branch’s direction or an indirect branch’s address. This misprediction
penalty may be greater if a branch instruction spends more than one cycle in the issue stage
or instructions from a branch’s block have to be re-fetched.
�� ��� �� �� � �
�� ��� �� �� � �
�� ��� �� �� � �
�� ��� �� �� � �
�� ��� �� �� � �
�� ��� �� �� � ��� ��������� ��� �����
�� ������
�������
Figure 2.1: Pipeline Stages of a Superscalar Processor
In addition to branch misprediction, control transfers can cause a significant loss in
fetching throughput. Figure 2.2 demonstrates how a straightforward superscalar fetching
technique would handle control transfers. To begin with, the first block of instructions
fetched discards two instructions after a taken branch. The branch transfers control to the
second block, but the starting position is not at the beginning of the block. As a result,
11
previous instructions in that block must be invalidated. Another control transfer is encoun-
tered in the second block and the remaining instructions are invalidated. Overall, only four
instructions out of a potential eight were fetched.
add
sub branch
call
lost lost
lost
Block 0
Block 1
Starting PC
lost
Figure 2.2: Simple Fetching Example
The simple fetching example demonstrates two fetching problems caused by control
transfers. First, a branch whose target address is not the beginning of a block results in
lost instructions. This is called a branch alignment problem. As will be discussed in
Chapter 4, the branch alignment problem may be completely solved in hardware. Second,
a control transfer instruction stops the sequential accessing of instructions in an instruction
cache line. As a result, a new instruction line must be read. Unlike the branch alignment
problem, this implies a fundamental limitation on the number of instructions that may be
fetched in one block.
2.1.1 Fetching Limitation
Control transfer instructions impose a limitation to instruction fetching. Let n be the
width of a block and b be the probability that an instruction transfers control. The expected
block run length, r�n� b�, is
r�n� b� � n��� b�n �nX
i��
i��� b�i��b ��� ��� b�n
b� (2.1)
12
Equation 2.1 represents the weighted sum of all events that could occur in a sequence of
n instructions. The first term is the case where there is no control transfer in a block. The
second term represents all possible permutations of a control transfer in a block. The limit
of r�n� b� as the block width increases is given by
limn��
r�n� b� ��
b� (2.2)
If a control transfer requires another cycle to reach the target address, then only one block
of instructions can be fetched in a cycle. Regardless of the type of software scheduling
or hardware techniques used to improve fetching, ��b is the limit for the average number
of instructions fetched per cycle. Under these conditions, ��b is the maximum average
number of instructions per cycle that can be executed on any single-threaded control-flow
architecture.
Here is an example to illustrate this fundamental fetching limitation. Suppose a pro-
gram executes a million instructions, and one hundred thousand of these instructions trans-
fer control. The probability of a control transfer instruction is therefore one tenth, and an
average of ten instructions fetched per cycle is the theoretical limit. Since each control
transfer instruction requires one cycle, to execute this program would require a minimum
of one hundred thousand cycles. Assuming no other performance problems, this program
can execute a maximum of ten instructions per cycle.
As a result of this limitation, in order to average greater than ��b instructions fetched
per cycle, multiple blocks of instructions must be fetched in one cycle. This requires spe-
cial hardware to predict multiple branches per cycle, which will be discussed in detail in
Chapter 5.
13
2.2 Software Techniques
Although hardware techniques will only be discussed, the potential benefit from soft-
ware techniques cannot be ignored. Using software techniques, the probability of a control
transfer instruction can be reduced. Loop unrolling is one method [3]. A relatively new
technique proposed by Calder and Grunwald is most promising [6]. By rearranging ba-
sic blocks, conditional branches become more likely not to be taken. This means that the
probability of a control transfer instruction is reduced because a not-taken branch is not a
control transfer. Nevertheless, software will only be able to make limited improvements.
As will be shown in this dissertation, hardware techniques can boost instruction fetching
performance after software improvements. Furthermore, unlike software techniques, hard-
ware techniques are able to address limitations created by control transfers.
Software techniques can be used to perform static branch prediction, which does
not vary during the execution of a program. One form of static branch prediction uses
compile-time heuristics [4, 24, 26]. Profile-based prediction is another method which usu-
ally performs better than compile-time heuristics [16, 26]. Usually the most common types
of static branch predictors implemented include predicting backwards taken and forward
not taken, encoding the most likely direction in the branch instruction, and using delay
slots. Static branch prediction can only reliably achieve around 70-80% accuracy. On the
other hand, dynamic branch prediction, which uses run-time information, can accurately
predict over 90% of dynamic branches.
14
2.3 Dynamic Branch Prediction
In order to achieve a high branch prediction accuracy, most modern microprocessors
use dynamic branch prediction [38, 52, 39, 30]. Dynamic branch prediction uses informa-
tion from previous execution of branches. Therefore, the prediction of a specific branch
may change depending on the run-time behavior.
The simplest form of dynamic branch prediction is a 1-bit predictor which records
if a branch was taken or not-taken the last time it was executed. A 1-bit predictor may be
stored in the instruction cache. Alternatively, a pattern history table (PHT) may be used to
store 1-bit, 2-bit, or N-bit counters. A 2-bit up-down saturating counter has been shown to
be effective [27]. The 2-bit up-down saturating counter state transition diagram is shown
in Figure 2.3. When a branch is taken, the counter is incremented; when it is not taken, the
counter is decremented. The counter does not decrement past 0 or increment past 3. When
a branch is strongly predicted to be taken or not taken, two consecutive mispredictions are
required to change the prediction (it has a “second chance”). This has proven especially
effective for loop conditional branches.
The PHT may be directly indexed by the PC, as shown in Figure 2.3. Unfortunately,
considerable interference results with this simple mapping. A more effective use of a PHT
is to use branch-correlation and two-level adaptive prediction mechanisms [29, 55, 56]. It
uses a k-bit branch history register to index into a �k-entry PHT, as shown in Figure 2.4. A
history register is updated after each branch. It is shifted to the left one position, and a 1 is
inserted for a taken branch and a 0 is inserted for a not taken branch.
The history register may be a global history register, as in Figure 2.5, or a per-addr
history register, as in Figure 2.6. A global history register (GHR) represents the last k
15
��
������ �!� �"��
#�$���%�!#&
��'''��
��'''��
��'''��
��'''��
��
��
��
"�
��(�
"�
��(�
"�
��(� ��(�
��(�
��(�
���" )����������
*"��#�(�
�������
*"��#�(�
��������#�(�
+,$����"- ���
������ #�� ���"
���" )��
�������
#�(�
"�� ��(�
��(�
�� � ��.
Figure 2.3: Pattern History Table and 2-bit Counter State Diagram
16
��
/�� ���!� �"��
� ��)� ���
����� ����
0�� � -������
� ���� �� -��
"��$�� ��
������ �!� �"��
#�$���%�!#&
'�'�'��'''��
��'''��
��'''��
��'''��
��
��
��
"�
��(�
"�
��(�
"�
��(� ��(�
��(�
��(�
���" )����������
*"��#�(�
�������
*"��#�(�
��������#�(�
+,$����"- ���
������ #�� ���"
���" )��
�������
#�(�
"�� ��(�
��(�
(
Figure 2.4: 2-Level Adaptive Branch Prediction
outcomes of the whole program, while a per-addr branch history register (BHR) represents
the last k outcomes for a specific branch. In addition, as shown in Figures 2.5 and 2.6
the PHT may be a single global table or multiple tables indexed by the branch address.
Furthermore, the BHR and PHT may be a per-set variation [55]. Although Yeh found these
schemes to be effective in accuracy, much of the PHT may be left unused, depending on
branch patterns. In order to increase utilization of the PHT and overall accuracy, McFarling
used the exclusive-or of the global history register and the branch address to index the
PHT [25].
17
'�'�'
1�"$��
������ �!� �"��
#�$���%�!#&
���,2���
������ �!� �"��
#�$���%�!#&
� $�"�� $�� ��������
����� � ��$��
1�"$���!� �"��3�1�"$����!#
%12)&1�"$���!� �"��3����,2�����!#
%12�&
1�"$��
!� �"��
��)� ���
1�"$��
!� �"��
��)� ���
Figure 2.5: Global History Adaptive Branch Prediction
18
'�'�'
1�"$��
������ �!� �"��
#�$���%�!#&
���,2���
������ �!� �"��
#�$���%�!#&
$�� ��
�����
� $�"�� $�� ��������
����� � ��$��
/�� ��
!� �"��
#�$��
$�� ��
�����
/�� ��
!� �"��
#�$��
���,2����!� �"��3�1�"$����!#
%�2)&���,2����!� �"��3����,2�����!#
%�2�&
Figure 2.6: Per-Addr History Adaptive Branch Prediction
19
2.4 Instruction Fetch Prediction
The two-level adaptive branch prediction only provides the direction of a branch.
Another mechanism is required to predict the target of a branch. For taken conditional
branches and unconditional jumps, a branch target buffer (BTB) can be used to predict
the target address [24]. To predict return addresses, a BTB may be used, but a return
address stack (RAS) proves to be considerably more accurate [20]. A BTB may vary in its
associativity, and it is indexed using current PC address. If the tag matches, then the target
address is used. Otherwise, the next PC is used. The BTB requires a modest amount of
storage. Each entry needs to store the tag of the branch address and the full target address.
A new technique by Seznec, however, can dramatically reduce the storage requirement by
using a pointer to a page instead of a full page address [34].
Another innovative technique for instruction fetch prediction is the use of a next line
and set table (NLS) [7]. Calder observed all that is immediately needed to access the
instruction cache is the line and a set prediction. Therefore, instead of storing the complete
address like the BTB, an NLS entry records only the index of a line and a set prediction. It
also records a 2-bit branch type information, which can represent an invalid entry, a return
instruction, a conditional branch, or any other type of branch. Figure 2.7 is a block diagram
of the NLS architecture. It is a decoupled branch architecture since it separates the branch
prediction (global two-level adaptive) from the target address prediction. The NLS is a
direct-mapped table, or can be stored in an instruction cache line. Since the NLS table
does not record a tag and has a small storage requirement for an instruction index, it can
store many times more entries than an associative BTB. Given a cost verses performance
comparison, an NLS architecture can be more effective than a BTB architecture [8].
20
/�� ��
!� �"��
4�
����� ���-�
2����
����(
�����
�5�
������
!� �"��
#�$��67
*�.��8� �
� �����
#�$��
������
*�.�
�����
� ��-���"
�����
8� �
*�.��� ��-���"
������8� �
�
%+,$��
$�� ��
����&
%�� ��� �� ��&
Figure 2.7: Block Diagram Schematic of the NLS Architecture
21
2.5 Multiple Block Fetching
In order to achieve a high fetching rate, multiple branches must be predicted in a
single cycle. In addition, in order to fetch beyond control transfer limitations, multiple
blocks need to be fetched per cycle. A basic block is defined to be instructions between
branches, whether they are taken or not taken. This dissertation refers to a block simply as
a group of sequential instructions up to a predefined limit, n, or up to the end of a line. A
line of instructions refers to the group of instructions physically accessed in the instruction
cache. The size of a line may be greater than or equal to the block width n. Therefore,
mechanisms which predict multiple basic blocks may be predicting multiple branches in a
single cache line or across multiple cache lines.
A technique to predict multiple basic blocks was first introduced by Yeh and Patt [54].
Multiple branches can be accurately predicted using his global two-level adaptive branch
prediction. This is accomplished by indexing the PHT with the k-bit GHR. After the GHR
is shifted to the left once, the two remaining possibilities are simultaneously accessed, and
the prediction for the second branch is selected based on the result of the first prediction.
This mechanism is shown in Figure 2.8. The number of simultaneous accesses to the
PHT increases exponentially with the number of branch predictions. In order to predict
multiple target addresses and select among the possibilities, Yeh and Patt introduced a
branch address cache (BAC). Given the current PC, the address of each possible successor
basic block is stored in a BAC entry. As shown in Figure 2.9, this is a tree structure which
grows exponentially with the number of basic block predictions. As a result, the BAC
requires an enormous amount of storage for accurate prediction, and a large percentage is
wasted from paths that are not used.
22
'�'�'�'�'�'�'1!��
(,�
(
�!#
�����
+ �
$�� ��
��������"
� �
$�� ��
��������"
Figure 2.8: Multiple Global Adaptive Branch Prediction
Seznec et al. [35] recently introduced an innovative way to fetch multiple (two) basic
blocks. Their idea is to always use the current instruction block information to predict the
block following the next instruction block, as shown in Figure 2.10. Its accuracy is almost
as good as a single block fetching and requires little additional storage cost. The major
drawback, as the authors explain, is that the prediction for the second block is dependent
on the prediction from the first block (the tag-matching is serialized). Chapter 5 introduces
a mechanism which is able to predict multiple blocks in parallel without such a dependency.
Another mechanism to effectively fetch multiple basic blocks is a trace buffer [50] or
a trace cache [33]. A trace cache dynamically builds a run of instructions based on a starting
address. If the current address and branch prediction match an entry in the trace cache, then
the instructions stored in the cache are used. Otherwise, a single block of instructions must
be retrieved from the instruction cache, and a new entry is built. Using a 4KB trace cache,
23
����
$�� ��
'�'�'�'
����
$�� ��
'�'�'�'
����
$�� ��
'�'�'�'
����
$�� ��
'�'�'�'
����
$�� ��
'�'�'�'
����
$�� ��
'�'�'�'
����
$�� ��
'�'�'�'
/2
� ����
-��� ��������� �
%��)&
# *# *#
*##
#
Figure 2.9: Branch Address Tree and Cache Mapping
'�'�'�'�'�'�' '�'�'�'�'�'�' '�'�'�'�'�'�' '�'�'�'�'�'�'
/�"�(� � /�"�(� + /�"�(� 9 /�"�(� �
Figure 2.10: Two-block Ahead Branch Prediction
24
the instruction fetching rate was significantly improved, even with a 30-40% trace cache
hit rate.
2.6 Register Renaming
Register renaming is an important tool to allow a superscalar to perform out-of-order
execution. One method to rename registers is to use a reorder buffer [36, 19]. It can dy-
namically rename a logical (programmable) register to a unique tag identifier. The result is
written back into this buffer. If there are no exceptions or bad branches, the result eventu-
ally commits by writing the result to the register file.
Another register renaming technique maps logical registers into physical registers
(the index into the physical data array), as performed by the MIPS R10000 [52]. A block
diagram of the renaming process is shown in Figure 2.11. When an instruction is decoded,
a new physical register from a free list is allocated for its destination register and entered
into a mapping table. The old physical register for that register is entered into a recovery
list. The recovery list (also called the active list) maintains the in-order state of instructions
and can be used to undo the mappings in the event of a mispredicted branch or exception.
After an instruction completes and all previous instructions have completed, its register
is committed and the old value is discarded by freeing the old physical register contained
in the recovery list. During decoding of source operands, its logical register number is
used as an index into the mapping table to read the corresponding physical register. The
advantage of this renaming technique is that one data array is used to store both committed
registers (part of the state of the machine) and speculative registers (extra registers reserved
for speculative results until committed).
25
Free List Recovery List
Mapping Table
OldPhysicalRegister
Allocate new physical register
Commit register by freeingold physical register
Figure 2.11: Block Diagram of Renaming Logic
2.6.1 Recovery
A major drawback with using a mapping table to rename logical registers to physical
registers is the large penalty required to recover from a branch misprediction or exception.
Recovery proceeds by reading the recovery list from the most recent entry until the mispre-
dicted branch or the instruction which caused an exception is encountered. After an entry is
read, the old physical register is used to replace the mapping of that entry’s logical register
number. After all appropriate entries from the recovery list have been read and re-mapped
into the mapping table, the mapping table will reflect the state of the machine after that
mispredicted branch.
The large penalty required to recover the mapping table can be minimized by using
a checkpoint mechanism. The R10000 uses checkpointing for up to four branches, but not
26
for exceptions [52]. In order for checkpointing to be realistic in hardware, checkpoint stor-
age must be integrated into the basic cell of the mapping table to have direct access for a
single cycle recovery. Thus, a mapping table’s cell size is greatly increased and is not scal-
able with the number of branch checkpoints. With increased speculation from a wide-issue
superscalar and a larger mapping tables from SMT, using standard RAM cells becomes im-
perative for speed and scalability. Chapter 6 will introduce a new hybrid register renaming
technique which significantly reduces this penalty yet still retains some of the benefits of a
mapping table.
2.7 Register File Complexity
The design of a register file for a superscalar processor has become an increasingly
difficult task to accomplish. With increasing issue rates, larger instruction windows, deeper
pipelines, and the advent of simultaneous multithreading, the pressure on the register file
to supply multiple values and store speculative results has dramatically increased. In order
to issue N instructions per cycle, �N registers need to be read for operands and N results
need to be written back. For instance, issuing eight instructions per cycle requires 16 read
ports and 8 write ports on a register file.
Unfortunately, the area of the register file increases proportional to the square of the
number of ports on a register file [10, 9]. Because of the implementation of a data cell in
a register file, each time a port is added, more hardware is required: a transistor, a wire to
access the cell, and a wire to read/write the cell. This increases both the cell’s length and
the cell’s width.
27
In addition, the access time increases with the number of ports. For example, using
0.5�m CMOS technology, the cycle times of a register file with 64 registers for N �
�� �� �� � are estimated to be 2.8ns, 3.0ns, 3.2ns, 3.6ns, respectively [51]. The increased
cycle time must be taken into account in comparing the effective performance between
different values of N , as will be shown in Section 6.5 of Chapter 6.
Chapter 3
Experimental Methodology
This chapter presents the experimental methodology applied to the remainder of this
dissertation. The experiments presented are simulation based. A benchmark suite of pro-
grams, SPEC95, is executed via simulation, and results are gathered. The results are pre-
sented using metrics described in Section 3.4.
The SPEC95 programs are the industry standard for evaluating the performance of
computer systems. The programs are large in size, and they execute a large number of
dynamic instructions. These factors enable a realistic performance evaluation of a super-
scalar processor using the scalable hardware mechanisms presented in this dissertation.
The SPEC95 benchmark suite consists of two parts: an integer suite, SPECint95, and a
floating point suite, SPECfp95. A description and attributes of these benchmark programs
are given in Section 3.2. The SPECint95 and SPECfp95 suites were executed using the
SPARC instruction-set architecture [49]. In addition, Chapter 6 executes the SPECint95
using the Superscalar Digital Signal Processor (SDSP) instruction-set architecture [43].
The instruction set of the SDSP is very similar to MIPS [21]. Tools for both architectures
were used to facilitate simulation. These tools are described in Section 3.1.
28
29
3.1 Simulation Tools
A simulator was developed to analyze the performance of the different instruction
fetching and register file mechanisms. This was accomplished by using a front-end to pro-
vide a detailed dynamic instruction trace and a back-end to provide the simulation of the
specific mechanism. The front-end for the SPARC architecture used the Shade instruction-
set simulator [12]. The front-end for the SDSP architecture used the SDSP simulator de-
veloped in [42]. Both front-ends provide user-level traces only. The instruction trace was
generated during simulation run-time by actually executing the instructions. This provides
greater speed and flexibility than using a trace file.
After the program is compiled, the simulator’s front-end reads the program into its
memory and begins execution. The trace information is delivered to the back-end which
collects statistical information based on the machine model. After execution completes, the
statistical information is written to a file in its raw data format. Another program uses this
to display the statistical results, possibly from several different runs.
The SDSP front-end interprets SDSP code, while Shade uses dynamic compilation
to execute the program. When Shade compiles SPARC code, it also annotates it to record
specific run-time information. Additional decoding of each instruction is performed before
it is passed on to the back-end. As a result, a completed trace structure for an instruction
contains the original PC, the opcode, source register identifiers, destination identifier, and
the functional unit type.
The back-end of the simulator for full execution reads the stream of trace structures.
It performs its own decoding by building up a dependency list between the source operands
previous instructions in its instruction window. It also performs renaming of operands.
30
Instructions in its instruction window are scanned to find ready-to-run instructions. If the
required resource is available, then the instruction is mark issued. After the required latency
of an instruction, the simulator frees the appropriate resource and marks the instruction as
completed. It can simulate different branch prediction methods and handle mispredicted
branches. It can also simulate instruction and data cache misses. The simulator can execute
with different parameters: decode size, instruction window size, different register renaming
options, maximum issue rate, maximum result write rate, line size, instruction cache size,
data cache size, BTB size, PHT size, various penalty options, functional unit specifications
including number and latency, and multiple data cache banks with outstanding request
queue.
3.2 SPEC95 Benchmarks
Each SPEC95 program was compiled using the SunPro compiler with standard op-
timizations (-O) for the SPARC architecture and compiled using the GNU CC compiler
with second-level optimizations (-O2) for the SDSP architecture. A list of each program,
application area, and description is given in Table 3.1.
3.2.1 Program Attributes
Table 3.2 lists the branch attributes for the first billion instructions of each SPEC95
applications on the SPARC architecture. The first column lists the percentage of dynamic
instruction that transfered control, which includes taken conditional branches and any other
type of branch. The second column lists the percentage of any type of branch encountered.
The third column lists the percentage of taken conditional branches. The remaining five
31
Program Application Area Description
SPECint95
go Artificial intelligence Plays the game of “Go” against itself.
m88ksim CPU simulator Motorola 88000 chip simulator; runs test program
gcc Compiler A benchmark version of the GNU C compiler, version 2.5.3.
Only the “cc1” phase is executed, using pre-processed files.
compress Utility A compression program that uses adaptive
Lempel-Ziv coding. It compresses and decompresses
in-memory text data.
li Interpreter A LISP interpreter.
ijpeg Graphics JPEG compression and decompression.
perl Interpreter Manipulates strings (anagrams) and prime numbers in Perl.
vortex Database Single-user object-oriented database transaction benchmark.
SPECfp95
tomcatv Geometry Generates 2-dimensional, boundary fitted coordinate
systems around general geometric domains.
swim Meteorology Solves the system of shallow water equations
using finite difference approximations.
su2cor Quantum physics Calculates masses of elementary particles
in the framework of the Quark Gluon theory.
hydro2d Astrophysics Uses hydrodynamic Navier Stokes equations
to calculate galactiaal jets.
mgrid Electromagnetism A simplified multigrid solver computing a 3D potential field.
applu Mathematics Solves multiple, independent systems of a block tridiagonal
system using Gaussian elimination (without pivoting).
turb3d Aeronautics Simulates isotropic, homogeneous turbulence in a cube
with periodic boundary conditions.
apsi Meteorology Solves for the mesoscale and synoptic variations of
potential temperature, the mesoscale vertical velocity and
pressure and distribution of pollutants.
fpppp Quantum chemistry Calculates multi-electron integral derivatives.
wave5 Electromagnetism Solves particle Maxwell’s Equations on a Cartesian mesh.
Table 3.1: Description of SPEC95 Applications
32
columns are a distribution of branch types. The branch types are conditional branches
(CBR), immediate and indirect unconditional branches (UBr), call instruction (Call), and
return instruction (Ret).
3.3 Machine Model
To verify the new register file architecture in Chapter 6 performs well, a reasonable
machine model was chosen that resembles commercial processors including Power PC
604 [38], MIPS R10000 [52], and SUN UltraSPARC [39]. Table 3.3 lists the quantity,
type, and latency of the different function units modeled. The quantity of functional units
for the 8-way superscalar architecture is twice that of the 4-way superscalar architecture.
The machine model parameters used in simulation are:
� instruction cache: 64 Kbyte, two-way set associative LRU, 16 byte line size, 2
banks, self-aligned fetching, 10 cycle miss penalty
� data cache: 64 Kbyte, two-way set associative LRU, 16 byte line size, 4 banks,
2 simultaneous accesses per cycle, lockup-free, write-around, write-through, 4 out-
standing cache request capability, 10 cycle miss penalty
� branch prediction: 2K x 2-bit pattern history table indexed by the exclusive-or of
the PC and global history register
� speculative execution: enabled
� interrupts: precise
� instruction window: centralized; 32 entries for 4-way, 64 entries for 8-way
33
%Control %CBr Branch Type Distribution
Program Transfer %Branches Taken %CBr %UBr % Call %Ret
gcc 13 21 48 76 10 7 7
compress 9 17 33 70 16 7 7
go 9 14 47 75 11 7 7
ijpeg 5 9 41 76 14 5 5
li 14 22 42 62 12 13 13
m88ksim 10 16 50 72 8 10 10
perl 13 19 50 65 16 10 9
vortex 12 18 54 76 4 10 10
applu 4 6 56 86 14 0 0
apsi 2 3 51 83 11 3 3
fpppp 1 2 60 79 14 4 4
hydro2d 9 12 72 90 6 2 2
mgrid 1 1 61 81 5 7 7
su2cor 12 20 47 70 15 7 7
swim 2 2 69 69 4 13 13
tomcatv 12 19 46 70 15 8 8
turb3d 7 8 68 64 16 10 10
wave5 7 8 63 50 20 15 15
Table 3.2: Branch Attributes of SPEC95 Applications
34
Table 3.3: Functional Unit Quantity, Type, and Latency
Quantity Type Latency
4-way 8-way
4 8 ALU 1
2 4 Load unit 1
2 4 Store unit -
1 2 Integer multiply 2
1 2 Integer divide 10
4 8 FP add 3
1 2 FP multiply 3
1 2 FP divide 16
1 2 FP other 3
� register file: separate general purpose and floating point register files; logical regis-
ters are mapped to physical registers
� recovery list: 32 entries for 4-way, 64 entries for 8-way
� store buffer: 16 entries
The instruction scheduling logic uses a single instruction window for all functional
units [19]. A reasonable size for the instruction window, 32 entries for 4-way and 64 entries
for 8-way, was chosen that would give good performance and produce a strong demand for
registers during issue and write back. A variable number of instructions, up to the de-
code width of 4 or 8, may be inserted into the instruction window, if entries are available.
Instructions are issued out-of-order using an oldest first algorithm. Store instructions are
35
issued in-order, and load instructions may be issued out-of-order in between store instruc-
tions.
Each cycle, a variable number of registers, up to the decode width, may be retired
from the recovery list. If entries are available, then they may be used by new instructions,
up to the decode width. The old physical register corresponding to the same destination
register is inserted into the recovery list.
The pipeline stages of the processor modeled are instruction fetch, decode and re-
name, issue, register read, execute, and result commit, as shown in Figure 2.1. Conse-
quently, two levels of bypassing are required for back-to-back execution of dependent in-
structions. Also, instructions dependent on a load are optimistically issued, in expectation
of a cache hit. If a cache miss occurs, then the dependent instructions must be re-issued,
similar to the design in [48]. In addition, the simulator continued to fetch instructions from
the wrong-path until an incorrectly predicted branch is resolved.
3.3.1 Default Configuration
In Chapter 4, the instruction fetching mechanisms are simulated. Therefore, no
branch prediction and execution was simulated. A perfect instruction cache was assumed,
and only the type of cache was considered. Each program was simulated for the first four
billion instructions.
In Chapter 5, branch prediction and instruction fetch prediction were simulated. The
branch prediction used was always based using a PHT and the exclusive-or of the GHR and
block address. A perfect instruction cache was again assumed, and only the type of cache
36
and number of banks were considered. Each program was simulated for the first one billion
instructions.
Full execution simulation was simulated in Chapter 6. Unless otherwise noted, the
base configuration of the simulator uses the default as described in the above section. Only
the first fifty million instructions were simulated for full execution simulation.
3.4 Performance Metrics
This section covers the performance metrics used in this dissertation. Performance
metrics are used to give an overall impression of the performance of a specific hardware
mechanism. Depending on the performance objective, one performance metric may be
better suited than another. In Chapter 4, the basic instruction fetching performance is ana-
lyzed by using the instructions fetched per cycle (IFPC) metric. In Chapter 5, the Branch
Execution Penalty (BEP) metric is used to indicate the penalty cost for executing multi-
ple branches and blocks, and the Effective Instruction Fetching Rate (IPC f) metric gives
an overall fetching rate including all branch penalties. Finally, in Chapter 6, the instruc-
tions per cycle (IPC) metric is used to show the overall performance including fetching and
execution.
3.4.1 Instructions Fetched Per Cycle (IFPC)
The instructions fetched per cycle represent the average number of instructions re-
turned to the decoder per fetch cycle. This would equal IPC if there were no branch
mispredictions, cache misses, or other stalls in execution. The IFPC represents the raw
37
fetching rate the instruction fetch mechanism can deliver assuming a perfect instruction
cache, branch prediction, and instruction fetch prediction.
3.4.2 Branch Execution Penalty (BEP)
The branch execution penalty is defined to be
BEP �Total Branch Penalty Cycles
Branches� (3.1)
This gives us the average number of additional cycles required to execute a branch
instruction. The total branch penalty cycles include cycles from a branch misprediction,
misselection, a branch misfetch, an indirect branch misprediction, and a return address
misprediction (see Table 5.3 in Chapter 5). In addition, when fewer than the maximum
number of blocks are fetched using multiple block prediction, the additional cycles required
to fetch remaining blocks are considered part of the branch penalty. This includes bank
conflicts.
3.4.3 Effective Instruction Fetch Rate (IPC f)
The effective instruction fetch rate is similar to IFPC, except now branch prediction
and instruction fetch prediction are taken into consideration. The rest of the processor
execution is assumed to be ideal. The effective instruction fetch rate is computed as,
IPC f � V alid instructions�Fetch cycles (3.2)
where the number of fetch cycles is equal to
Total Branch Penalty Cycles�Blocks fetched
Maximum blocks per cycle(3.3)
38
The number of blocks refer to the total number of valid blocks fetched and delivered
to the decoder. When fetching multiple blocks per cycle, the maximum blocks per cycle
refer to the number of blocks the fetcher is designed to fetch.
3.4.4 Instructions Per Cycle (IPC)
The instructions per cycle is the total number of dynamic instructions executed di-
vided by the total execution cycles. Ideally, the IPC would be equal to N , the maximum
number of instructions a superscalar processor can issue in a cycle. Due to instruction
fetching, cache misses, data dependencies, and mispredictions, IPC performance is usually
much lower than the ideal rate of N .
Chapter 4
Instruction Fetching Mechanisms
Instruction fetch mechanisms involve the process of how instructions are fetched
from memory and delivered to the decoder. Since this chapter focuses only on hardware in-
struction fetching mechanisms, other performance issues (such as branch prediction, cache,
execution, etc.) are not evaluated. The objective of this chapter is to describe, evaluate, and
provide solutions to the first step in a series of hurdles for exploiting high levels of ILP. �
First, the fetching model used throughout this chapter is described in Section 4.1.
Next, different hardware techniques used for instruction fetching are described in Sec-
tion 4.2. A mathematical model for each hardware technique is presented in Section 4.3.
Finally, Section 4.4 compares the expected instruction fetching performance with results
from simulating the SPEC95 benchmark suite.
4.1 Fetching Model
This section describes the fetching model used in the rest of the chapter. The cache
line size is defined to be the size of a row in the instruction cache. The terms ‘line’ and
‘row’ are used interchangeably. This determines the maximum number of instructions that
�Portions of this chapter were published in Euro-Par ’96 [44]
39
40
can be accessed simultaneously in one cycle. Also, a block is defined to be a group of
sequential instructions. A block’s width is the maximum number of instructions allowable.
Figure 4.1 is a block diagram showing the different fetching steps. The instruction
cache reads the requested fetch block of width q and returns it to the instruction fetcher.
The instruction decoder receives a decode block of width n. If prefetching is applied, up to
q new instructions from the instruction fetcher go into the prefetch buffer FIFO queue and
n instructions come out. This implies q � n in the diagram. Otherwise if prefetching is not
used, the fetch and decode widths are equal, and the instruction fetcher delivers instructions
directly to the decoder.
I-CACHE
INSTRUCTIONFETCHER
PREFETCHBUFFER
FIFO
INSTRUCTIONDECODER
q
q
n
n
PC
Figure 4.1: Fetching Block Diagram
The instruction fetcher is responsible for determining the new starting PC each cycle
and sending it to the instruction cache. It cooperates with a branch predictor or branch
target buffer, if employed. Calder and Grunwald [5] describe different techniques for fast
PC calculation. Whichever technique is used, the new PC must be determined in the same
cycle. Also, after the instruction fetcher receives the fetch block from the instruction cache,
41
it performs preliminary decoding to determine the instruction type (or uses prediction/pre-
decoding methods). Instructions after the first instruction that transfers control are invali-
dated.
Johnson defines an instruction run to be the sequentially fetched instructions between
branches [19]. In this dissertation, an instruction run is further specified to be between in-
structions that transfer control. A control transfer instruction includes unconditional jumps
and calls, conditional branches that are taken, and any other instruction that transfers con-
trol, such as a trap. The run length is the number of instructions in a run. In addition, a
block run is defined to be the instructions from the start of the block to the the end of the
block or the first instruction that transfers control. The block run length is the number of
instructions in a block run.
4.2 Hardware Techniques
This section describes hardware techniques which perform instruction fetching. To
begin with, three cache types are described: a simple cache, an extended cache, and a
self-aligned cache. Next, prefetching is described. Finally, a new mechanism to fetch two
blocks per cycle, a dual branch target buffer, is introduced.
4.2.1 Simple Cache
A straightforward approach to fetch instructions from the instruction cache is to have
the line size equal the width of the fetch block. If the starting PC address is not the first
position in the corresponding row of the instruction cache, then the appropriate instructions
42
are invalidated and fewer than the fetch width are returned. As with all fetching techniques,
if there is an instruction that transfers control, instructions after it are invalidated.
Figure 2.2 showed an example for the simple fetching mechanism. In that example,
the second instruction in the first block was a taken branch, so the third and fourth in-
structions were invalidated. Also, only two instructions from the second block were valid.
Altogether, only four out of a potential eight instructions were used for instruction decoding
and execution, which illustrates the problem with this simple approach.
4.2.2 Extended Cache
One way to reduce the chance that instructions will be lost from an unaligned target
address of a control transfer instruction is to extend the instruction cache line size beyond
the width of the fetch block. To avoid lost instructions on sequential reads that are not
block aligned, the instruction fetcher must be able to save the last n � � instructions in a
row and combine them with instructions that are read the next cycle. Only when there is a
control transfer to the last n� � instructions in a cache row, instructions are lost due to an
unaligned target address.
Figure 4.2 is an example of the extended cache fetching technique using n � � and
an extended cache line size of 8 instructions. The starting PC in this example is at the
third instruction in Line 0. Four instructions are returned to the instruction fetcher in Cycle
1. The last two instructions in Line 0 are saved for the next cycle. During Cycle 2, the
instruction fetcher combines two new instructions read from Line 1 and the two instructions
saved the previous cycle. There is no need to save any instructions this cycle because the
line can be re-read and still be able to return four instructions.
43
ignore ignore store
store
ignore ignore
asr
asrempty
ignore ignoreignore ignore
subLine 0:
Cycle 1
Cycle 2
Line 1:
Saved:
lsl load
load
add
add
0
0
8
1
1
9
2
2
A
3
B
4
C
5
D
6
E
7
F
addr:
addr:
Starting PC
fetchblock0
fetchblock0
fetchblock2
saved2
fetchblock2
fetchblock3
fetchblock3
fetchblock1
saved1
fetchblock1
Figure 4.2: Extended Fetching Example
4.2.3 Self-Aligned Cache
The target alignment problem can be solved completely in hardware with a self-
aligned instruction cache. The instruction cache reads and concatenates two consecutive
rows within one cycle so as to always be able to return n instructions. To implement a
self-aligned cache, the hardware must either use a dual-port instruction cache, perform two
separate cache accesses in a single cycle, or split the instruction cache into two banks.
Using a two-way interleaved (i.e., two banks) instruction cache is preferred for both space
and timing reasons [13, 17, 14],
Figure 4.3 is an example of the self-aligned cache fetching technique using n � �.
Only the last two instructions in Line 0 are available for use because the starting PC is not
at the first position. Since the following line is read and available during the same cycle,
four instructions are returned by combining the two instructions from Line 0 and the first
two instructions from Line 1.
44
ignore ignore ignoreignoresubLine 0: Line 1:lsl load add0 1 2 3 4 5 6 7addr:
Starting PC
fetchblock0
fetchblock2
fetchblock3
fetchblock1
Figure 4.3: Self-aligned Fetching Example
4.2.4 Prefetching
All of the above cache types can be used in conjunction with prefetching. Prefetching
helps improve fetching performance, but fetching is still limited because instructions after
a control transfer must be invalidated.
The fetch width, q, q � n, is the number of instructions that are examined for a con-
trol transfer. Let p be the size of the prefetch buffer. After the instruction fetcher searches
up to q instructions for a control transfer, valid instructions are stored into a prefetch buf-
fer. Each cycle, the instruction decoder removes the oldest n instructions from the prefetch
buffer. In essence, the prefetch buffer enables an average performance closer to the larger
expected run length of q instructions compared to n instructions.
Figure 4.4 shows an example using prefetching with n � �, q � �, and p � �.
Starting with an empty prefetch buffer, there are seven valid instructions (this example
shows a complete block of q � � instructions returned by the instruction cache to the
instruction fetcher) before branch. Four instructions are used in this cycle, while the re-
maining three valid instructions are put in the prefetch buffer for later use. The next cycle,
a block of instructions starting with the target address of the branch is read. Only two in-
structions are valid because a call instruction was detected. As a result, three instructions
45
from the buffer and the first add instruction are used, while the remaining call instruction
is put into the prefetch buffer.
lostbranch
branchlost lost lost lost lost lost
empty empty empty empty
empty
subasrFetchBlock 0:
Cycle 1
Cycle 2
FetchBlock 1:
lsland load
loadcall
add
addadd
1
1
2
2
3
3
4
4
0 1 2 3 4 5 6 7position:
0 1 2 3 4 5 6 7position:
Starting PC
decodeblock0
decodeblock0
decodeblock3
decodeblock2
decodeblock2
decodeblock3
decodeblock1
decodeblock1
prefetchbuffer1
Target PC
Prefetch Buffer
Figure 4.4: Prefetch Example
4.2.5 Dual Branch Target Buffer
In this section the dual branch target buffer (DBTB) is introduced. It is based on the
original branch target buffer (BTB) design by Lee and Smith [24]. Unlike the previous
techniques mentioned thus far, the DBTB can bypass the limitation imposed by a control
transfer. The DBTB is similar to the Branch Address Cache introduced by Yeh, et al. [54],
except the DBTB does not grow exponentially. Conte et al. introduced the collapsing
buffer, which allows intra-block branches [13]. The DBTB can handle both intra-block and
inter-block branches.
The purpose of a BTB is to predict the target address of the next instruction given
the address of the current instruction. This idea is taken one step further. Given the current
PC, the DBTB predicts the starting address of the following two lines. Using the predicted
addresses for the next two lines, a dual-ported instruction cache is used to simultaneously
46
read them. Hence, the first line may have a control transfer without requiring another cycle
to fetch the subsequent line.
The DBTB is indexed by the starting address of the last row currently being accessed
in the instruction cache (i.e., the current PC). The entry read from the DBTB can be viewed
as two BTB entries, BTB1 and BTB2. The DBTB entry indexed may match both in BTB1
and BTB2, in one or the other, or none at all. This allows a single DBTB entry to be shared
between two different source PCs. Although physically they are one entry, logically they
are separate.
Figure 4.5 is a block diagram of a DBTB entry and shows how it is used in determin-
ing the following two rows’ PC starting address, PC1 and PC2. The tag of the current PC is
checked against the PC tag found in BTB1. If it matches, then the predicted PC1 found in
BTB1 is used. Otherwise, the prediction is to follow through to the next row of the instruc-
tion cache. If the value predicted for PC1 matches the value in BTB2, then the prediction
for PC2 in BTB2 is used; else, PC2 is predicted to be the next row after PC1. The exit posi-
tion in DBTB entry indicates where the control transfer (or follow through) is predicted to
occur. The DBTB entry also contains branch prediction information about all the potential
branches in the referenced line. It may contain no information at all, a 1-bit predictor, a
2-bit saturating predictor, or information for other branch prediction mechanisms.
To save space, an alternative design of the DBTB would be to logically unify BTB1
and BTB2. Only one PC source can be valid, so only one PC tag is stored. In addition for
space savings, the time it takes for PC2 to be ready is reduced because the predicted PC1
does not need to be checked against the tagged PC1 in BTB2. This improvement may be
critical in a processor’s cycle time. The drawback is BTB2 must be invalidated to reflect a
47
follow through prediction when BTB1 is updated, which can reduce accuracy of prediction.
On the other hand, a BTB2 misprediction does not need to invalidate BTB1.
PCtag
exitposition
branchpredictioninfo
BTB1 BTB2
exitposition
branchpredictioninfo
predicted PC1 PC1
predicted PC2
current PC
=(tag)
(index)
MUX
+ +
=
PC1 PC2
Dual BTB entry:
MUX
linesize
(line/word)
Figure 4.5: Block Diagram of Dual Branch Target Buffer Entry
The DBTB has many different configurations, many similar to the traditional BTB.
Its options include the number of entries, associativity, branch prediction, and a one or two
tagged system. A DBTB can be used with a simple, extended, or self-aligned cache, and
with or without prefetching. Figure 4.6 is a fetching example without prefetching using the
DBTB. The previous cycle, BTB1 predicted PC1 to be at Address 0, and BTB2 predicted
Line 0 to exit at position 1 to PC2 at Address 12. While Line 0 and Line 3 are being read,
PC2 is used to index into the DBTB to predict the next PC1 and PC2. Although Line 0 has
a jump, a full fetch block of four instructions is returned.
48
sub jumpLine 0: lsl branch
addr: 0 1 2 3 12 13 14 15
ignore ignore ignore ignore
PC1 PC2
fetchblock
fetchblock
fetchblock
fetchblock0 1 2 3
Line 3:
nextPC1
Figure 4.6: Dual Branch Target Buffer Example
4.3 Expected Instruction Fetch
A mathematical model for each type of fetching mechanism from the previous sec-
tion is presented in this section. The model allows the expected instruction fetching perfor-
mance to be calculated. In the next section, the expected performance from this model will
be compared with results from simulation.
4.3.1 Simple Cache
Let Li be the probability a control transfer occurs at position i, and Ei be the proba-
bility the starting address in the block is at position i. Upon a control transfer, if the target
address is equally likely to enter any position in a block, then
Esimple� �n� b� � ��
n� �
ncsimple�n� b�
Esimplei �n� b� �
csimple�n� b�
n� � � i � n� (4.1)
Lsimplei �n� b� �
iXj��
b��� b�i�jEsimplej �n� b�� (4.2)
49
where c�n� b� is the probability of a control transfer in a block,
csimple�n� b� �nX
i��
Lsimplei �n� b� �
n�
b� n� �
� (4.3)
The total expected instructions fetched per cycle for simple fetching is
F simple�n� b� �nX
i��
Esimplei �n� b�r�i� b� �
csimple�n� b�
b�
n
� � b�n� ��� (4.4)
Equation 4.4 is the weighted sum of the expected number of instructions at each possible
starting position.
4.3.2 Extended Cache
The probability the starting address in the block is at position i for the extended cache
is
Eextend� �n� b�m� � ��
m� �
mcextend�n� b�m�
Eextendi �n� b�m� �
cextend�n� b�m�
m� � � i � m� (4.5)
The probability of a control transfer in a block for the extended cache, given the
extended cache line size m, m � n, is
cextend�n� b�m� ��m� n�
m��� ��� b�n� �
n
m
nb
�� � b�n� ���� (4.6)
The expected instructions fetched per cycle is
F extend�n� b�m� �m� n
mr�n� b� �
n
mF simple�n� b� �
�m� n�
m
��� ��� b�n�
b�
n
m
n
�� � b�n� ���� (4.7)
50
With the cache line size extended beyond the desired n instructions, if there is a control
transfer, n out of m times it is expected to transfer into the last n instructions of the block,
which behave as the simple fetching case where less than n instructions are available. The
rest of the time n instructions will be available.
4.3.3 Self-aligned Cache
The probability of a control transfer in a block for the self-aligned cache is
calign�n� b� � �� ��� b�n� (4.8)
The expected instructions fetched per cycle for the self-aligned cache is the expected block
run length of width n,
F align�n� b� � r�n� b� ��� ��� b�n
b(4.9)
because n instructions will always be read from the instruction cache.
4.3.4 Prefetching
All three cache techniques can be used in combination with prefetching. The fetch
and decode widths are not equal with prefetching. As a result, q, the fetch width, may now
be substituted for n, the decode width, as a parameter to some of the equations previously
defined that did not use prefetching, as will be indicated.
Let I typei be the probability exactly i instructions are available up to and including a
control transfer instruction or the end of the block, where type is one of the three different
51
cache types: simple, extend, or align. The equations for the three types are:
Isimplei � ��� b�i��Esimple
q�i�� �q� b� �q�iXj��
b��� b�i��Esimplej �q� b� (4.10)
Iextendi �
�����������������������
��� b�i��Eextendm�i���q� b�m�
���� b�i��bPm�i
j�� Eextendj �q� b�m� � � i � q � �
��� b�q��Pm�q��
j�� Eextendj �q� b�m� i � q
otherwise
(4.11)
Ialigni �
���������������
��� b�i��b � � i � q � �
��� b�q�� i � q
otherwise
(4.12)
Let Pi be the probability the prefetch buffer contains i instructions. Figure 4.7 il-
lustrates the transition from one buffer state to another. It does not show all possible
transitions. The prefetch buffer increases in size when the number of new instructions
is greater than n. It will remain in the same state if exactly n new instructions are available.
It decreases in size when fewer than n new instructions are available. The zero and full
boundary states have additional possible transitions.
P0
In+1
I1+I2+...In
I1+I2+...In-1
In+1+In+2+...Iq
In+In+1+...Iq
P1
In+1
In
In-1
PpPp-1In
In-1
...
Figure 4.7: Prefetch Buffer State Diagram
The probability the prefetch buffer is in state i is
P typei �
���������������
Pj�k�n P
typej I typek i � � j � p � � k � q
Pj�k�n�i P
typej I typek � � i � p� � � j � p � � k � q
Pj�k�n�p P
typej I typek i � p � j � p � � k � q
(4.13)
52
Also,Pp
i�� Pi � �. Equation 4.13 can be expanded as a system of linear equations and
solved for each Pi.
The total expected instruction fetch for each of the different cache types with pre-
fetching is
F typeprefetch�p� q� n� b� � n�
n��Xi��
�n� i�X
j � k � i
� � j � p
� � k � q
P typej I typek � (4.14)
Notice Equation 4.14 depends only on the last n�� prefetch buffer state sizes since if there
are n� � or more instructions in the prefetch buffer, n instructions are guaranteed for that
cycle.
A problem can arise with prefetching and simple cache type. The prefetch buffer
can be full, and instructions from the fetch block go unused. If this happens, the starting
address of the next cycle will not be the first position, so q instructions will not be available.
Therefore, Equation 4.1 needs to be modified to include this effect, unless a hardware
solution similar to that of the extended cache is included. The hardware would need to save
instructions left over on a prefetch buffer overflow for the following cycle. If this is done,
Equation 4.10 is an accurate model.
4.3.5 Dual Block Fetching
Fetching two blocks per cycle (via the DBTB) with the simple, extended, or self-
aligned cache without prefetching is simply twice the expected value for half the block
size,
F dbtb�type�n� b� � �F type�n
�� b�� (4.15)
53
If prefetching is used with dual block fetching, the equation for Itypek in Equation 4.13
and Equation 4.14 is replaced with
Idbtb�typek �q� b� �kX
j��
I typej �q
�� b�I typek�j �
q
�� b�� (4.16)
4.3.6 Evaluation
Table 4.1 lists the evaluation of the simple, extended, and self-aligned cache types
without prefetching for b � ��� and for different values of the decode block width n. The
value chosen for b, the probability of a control transfer, is common for RISC architectures.
The probability of a control transfer in a block is listed as well as the expected instructions
fetched per cycle. For n � ��, the fetching rate is close to ��b. Although this large fetching
width achieves excellent fetching performance, it may not be practical to implement in
hardware.
Table 4.1: Expected Instruction Fetch
n csimplen����� F simple
n����� cextendn�������n F extendn�������n calignn����� F align
n�����
1 .125 1.00 .125 1.00 .125 1.00
2 .222 1.78 .228 1.83 .234 1.88
4 .364 2.91 .389 3.11 .414 3.31
8 .533 4.26 .595 4.76 .656 5.25
16 .696 5.57 .789 6.32 .882 7.06
32 .821 6.56 .903 7.23 .986 7.89
64 .901 7.21 .951 7.61 1.00 8.00
Figure 4.8 shows the expected instruction fetch for the simple, extended, and self-
aligned cases without prefetching for b � ���. Although ideally, for a block size of n, a
54
fetching rate of n instructions per cycle is desired, the difference between this ideal and
the actual rate increases as n increases. Instead, it approaches ��b (8 in this instance) for
each case. The disadvantage for the simple and extended cache techniques is the lower rate
at which they reach the limit. It takes a significantly larger value of n to reach the same
expected fetch performance. With this extended case of m � �n, its value is the average of
the values for the align and simple cases for each n.
1
2
3
4
5
6
7
8
0 5 10 15 20 25 30 35
alignExptected Fetch vs n (b = 1/8)
Exp
ecte
d Fe
tch
n
extend
Exptected Fetch vs n (b = 1/8)
Exp
ecte
d Fe
tch
n
simple
Exptected Fetch vs n (b = 1/8)
Exp
ecte
d Fe
tch
n
ideal
Exptected Fetch vs n (b = 1/8)
Exp
ecte
d Fe
tch
n
Figure 4.8: Expected Instruction Fetch without Prefetching
55
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
0 2 4 6 8 10 12 14 16
b = 1/8, n = 4
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=8b = 1/8, n = 4
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=7b = 1/8, n = 4
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=6b = 1/8, n = 4
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=5
b = 1/8, n = 4
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=4
Figure 4.9: Self-Aligned Expected Instruction Fetch with Prefetching (n � �)
Figure 4.9 shows the expected instruction fetch for the self-aligned cache with pre-
fetching for b � ��� and n � � varying p and q. In this case, it takes very little increase
in q and p to reach close to maximum fetching. At q � � and p � �, the expected fetch is
already over 3.95.
56
5.2
5.4
5.6
5.8
6
6.2
6.4
6.6
6.8
7
0 5 10 15 20 25 30 35
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=16b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=14
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=12
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=10
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=9
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=8
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=11
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=13
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=15
Figure 4.10: Self-Aligned Expected Instruction Fetch with Prefetching (n � �)
Figure 4.10 shows the expected instruction fetch for the self-aligned cache with
prefetching for b � ��� and n � �, varying p and q. The value of the different curves
for each q is identical for p � q � n. After that point, it branches out and approaches its
r�q� b� limit. To reach the ultimate limit of ��b, both q and p need to increase.
57
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
0 2 4 6 8 10 12 14 16
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=8
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=16
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=15
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=14
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=13
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=12
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=11
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=10
b = 1/8, n = 8
Exp
ecte
d Fe
tch
Prefetch buffer size, p
q=9
Figure 4.11: Simple Expected Instruction Fetch with Prefetching
Figure 4.11 shows the expected instruction fetch for the simple cache with prefetch-
ing for b � ��� and n � �, varying p and q. Unlike the self-aligned case, each q curve
is distinct and greater than the previous q curve. Even without prefetching (p � ), the
values are not identical because the increase in the line size to q reduces the chance that an
unaligned target address will not be able to return n instructions.
58
4.5
5
5.5
6
6.5
7
0 2 4 6 8 10 12 14 16
b = 1/8, n = 8, q = p + n
Exp
ecte
d Fe
tch
Prefetch buffer size, p
alignb = 1/8, n = 8, q = p + n
Exp
ecte
d Fe
tch
Prefetch buffer size, p
simple
b = 1/8, n = 8, q = p + n
Exp
ecte
d Fe
tch
Prefetch buffer size, p
extend
Figure 4.12: Different Cache Techniques with Prefetching
Figure 4.12 shows the expected instruction fetch for the simple cache, extended
cache, and self-aligned cache with prefetching for b � ���, n � �, q � p � n, and
m � �q (extended only) verses p. Similar to the cases without prefetching, the extended
cache’s fetching performance is between the simple and self-aligned cache techniques.
59
0 2 4 6 8 10 12 14 167
8
9
10
11
12
13
14b = 1/8, n = 16, q = p + n
Exp
ecte
d F
etch
Prefetch buffer size, p
align
extend
simple
Figure 4.13: Different Cache Techniques for Dual Block Fetching with Prefetching
Figure 4.13 shows the expected instruction fetch for the simple cache, extended
cache, and self-aligned cache for dual block fetching with prefetching. The parameters
are b � ���, n � ��, q � p� n, and m � �q (extended only) verses p. The plot shows that
a simple cache performs significantly less than the self-aligned and extended cache.
The plots presented show that prefetching can significantly increase expected fetch-
ing. As the fetch width, q, increases, the expected fetch rate reaches a higher plateau.
Unfortunately, with b � ��� and a decode width of eight, an extensive amount of hardware
– a fetch width of sixteen, a prefetch buffer size of thirty-two, and a self-aligned cache –
is required to reach almost 7 instructions fetched per cycle, still noticeably below the goal
60
of 8 instructions fetched per cycle. It is difficult to achieve a high fetching rate under those
conditions because the decode width is the same size as the ��b limit. On the other hand,
if two blocks are fetched in a cycle with prefetching, a high rate close to 14 instructions
fetched per cycle can be achieved.
4.4 Results and Discussion
This section compares the expected instruction fetch with the actual performance
of simulations from the SPEC95 benchmark suite running on the SPARC architecture.
Programs ran until completion or the first four billion instructions.
Table 4.2 shows the predicted and observed instruction fetch count results of these
programs using the three cache techniques without prefetching (n � �). Table 4.3 and
Table 4.4 show the predicted and observed instruction fetch count results using the three
cache techniques with prefetching (n � �, q � �, p � �; and n � �, q � ��, p � ��,
respectively). The first column in both tables show the value observed for ��b, the average
run length. The average dynamic run length of a program is the total number of instructions
executed divided by the number of instructions that transferred control. The observed value
of b for each program was used in its calculation of the expected fetch.
A concern with the fetching model presented is that the distribution of run lengths is
expected to be uniform, but in observing actual program behavior, the distribution is not
uniform. It does, however, generally follow the expected distribution. When the expected
fetch is calculated via a weighted sum, the outcome is reasonably accurate. As can be
seen in the tables, the difference between the predicted and observed fetch count is usually
within a few percent.
61
Table 4.2: Instructions Fetched per Cycle (n � �)
Program ��b simple extend align
pred obs pred obs pred obs
go 11.6 3.18 3.20 3.35 3.34 3.51 3.56
gcc 7.5 2.86 2.86 3.07 3.06 3.27 3.41
m88ksim 10.2 3.09 3.01 3.27 3.12 3.45 3.48
compress 9.93 3.07 3.31 3.25 3.43 3.44 3.59
li 6.9 2.78 2.75 2.99 3.10 3.21 3.31
ijpeg 21.5 3.51 3.51 3.62 3.59 3.73 3.73
perl 7.8 2.89 2.88 3.09 3.14 3.29 3.36
vortex 9.2 3.01 2.90 3.20 3.02 3.39 3.57
tomcatv 22.0 3.52 3.40 3.63 3.47 3.74 3.69
swim 114 3.90 3.86 3.92 3.93 3.95 3.96
su2cor 11.7 3.19 3.25 3.35 3.40 3.52 3.62
hydro2d 12.9 3.24 3.24 3.40 3.34 3.56 3.63
mgrid 79.6 3.85 3.68 3.89 3.81 3.93 3.85
applu 25.3 3.58 3.45 3.67 3.61 3.77 3.73
turb3d 14.6 3.32 3.37 3.46 3.46 3.61 3.69
apsi 54.9 3.79 3.76 3.84 3.81 3.89 3.87
fpppp 13.6 3.28 3.13 3.43 3.30 3.58 3.60
wave5 23.6 3.55 3.54 3.65 3.62 3.75 3.74
62
Table 4.3: Instructions Fetched per Cycle with Prefetching (n � �)
Program ��b simple extend align
pred obs pred obs pred obs
go 11.6 3.95 3.93 3.95 3.97 3.99 4.00
gcc 7.5 3.76 3.62 3.91 3.77 3.96 3.99
m88ksim 10.2 3.92 3.89 3.97 3.95 3.99 4.00
compress 9.93 3.91 3.92 3.97 3.98 3.99 3.99
li 6.9 3.69 3.71 3.87 3.87 3.94 3.98
ijpeg 21.5 3.99 3.96 4.00 3.96 4.00 4.00
perl 7.8 3.79 3.66 3.92 3.80 3.97 3.99
vortex 9.2 3.88 3.58 3.96 3.68 3.98 4.00
tomcatv 22.0 3.99 3.95 4.00 3.99 4.00 4.00
swim 114 4.00 4.00 4.00 4.00 4.00 4.00
su2cor 11.7 3.95 3.86 3.99 3.99 3.99 4.00
hydro2d 12.9 3.97 3.72 3.99 3.92 4.00 4.00
mgrid 79.6 4.00 4.00 4.00 4.00 4.00 4.00
applu 25.3 4.00 4.00 4.00 4.00 4.00 4.00
turb3d 14.6 3.98 3.69 3.99 3.77 4.00 3.99
apsi 54.9 4.00 4.00 4.00 4.00 4.00 4.00
fpppp 13.6 3.97 3.74 3.99 3.81 4.00 3.99
wave5 23.6 3.99 3.96 4.00 3.99 4.00 4.00
63
Table 4.4: Instructions Fetched per Cycle with Prefetching (n � �)
Program ��b simple extend align
pred obs pred obs pred obs
go 11.6 6.75 6.75 7.33 7.33 7.65 7.75
gcc 7.5 5.32 5.10 5.98 5.58 6.55 6.52
m88ksim 10.2 6.36 6.22 7.02 6.85 7.44 7.44
compress 9.93 6.27 7.19 6.94 7.45 7.38 7.64
li 6.9 5.03 4.89 5.66 5.86 6.22 6.49
ijpeg 21.5 7.79 7.14 7.93 7.51 7.97 7.89
perl 7.8 5.45 5.16 6.13 5.65 6.69 6.64
vortex 9.2 6.02 5.38 6.71 5.81 7.20 7.05
tomcatv 22.0 7.56 7.03 7.83 7.51 7.92 7.82
swim 114 8.00 7.95 8.00 7.98 8.00 7.99
su2cor 11.7 6.77 6.10 7.35 6.61 7.66 7.53
hydro2d 12.9 7.03 6.39 7.53 6.80 7.77 7.25
mgrid 79.6 8.00 7.97 8.00 7.99 8.00 8.00
applu 25.3 7.88 7.56 7.96 7.72 7.98 7.96
turb3d 14.6 7.30 5.87 7.69 6.29 7.85 6.92
apsi 54.9 7.99 7.93 8.00 7.98 8.00 8.00
fpppp 13.6 7.15 6.09 7.60 6.43 7.80 6.96
wave5 23.6 7.85 7.42 7.95 7.73 7.98 7.92
64
The expected and observed performance for dual block fetching without prefetching
is exactly twice the values listed in Table 4.2 for n � �. Table 4.5 lists the performance of
SPEC95 for dual block fetching with prefetching (n � �� q � ��� p � �). The instructions
fetched per cycle (IFPC) is listed as well as the instructions per fetch block (IPB). The
results show that close to ideal (n � �) fetching rate is possible, when a two-block fetching
mechanism, such as the dual branch target buffer, is used with extended or self-aligned
cache and prefetching. In this case, the fetching hardware mechanism no longer restricts
instruction fetching, and therefore, the possibility of exploiting instruction-level parallelism
and a high instructions per cycle execution rate.
Using a 256-entry, direct-mapped, two-tagged DBTB, the miss rate was between 10%
and 20% for most of the SPEC95 benchmarks. Also, the miss rates for BTB2 was usually
slightly higher than BTB1. BTB1 and BTB2 each behaved similarly to a standard BTB.
Although perfect branch accuracy was assumed in Table 4.5 (to make a fair comparison to
the other data), it is important to realize that accurate branch prediction becomes critical
since more branches need to be predicted accurately per fetch block. Therefore, the next
chapter presents a mechanism to predict two blocks per cycle with a greater accuracy than
the dual branch target buffer.
The overall performance will be much lower than the fetching rates shown when
branch prediction, cache misses, execution, etc., of a real microprocessor are simulated.
In addition, the difference between the values will be much smaller. These facts do not
devalue the results presented. These results show the upper limit achievable using different
fetching mechanisms presented, both in theory and in simulation.
65
Table 4.5: IPB and IFPC for Dual Block Fetching with Prefetching
Program ��b simple extend align
IPB IFPC IPB IFPC IPB IFPC
go 11.6 9.90 7.79 11.2 7.90 12.3 7.98
gcc 7.5 8.01 7.18 9.3 7.49 10.5 7.93
m88ksim 10.2 9.24 7.68 11.0 7.87 11.7 7.98
compress 9.93 9.98 7.78 10.5 7.86 11.8 7.95
li 6.9 7.64 7.37 9.8 7.74 10.4 7.91
ijpeg 21.5 12.0 7.89 12.8 7.91 13.6 8.00
perl 7.8 8.37 7.36 9.9 7.67 10.7 7.93
vortex 9.2 8.07 7.14 10.6 7.52 11.8 7.99
tomcatv 22.0 11.7 7.88 12.4 7.97 13.9 8.00
swim 114 15.0 7.99 15.3 8.00 15.8 8.00
su2cor 11.7 9.5 7.53 11.2 7.78 12.4 7.99
hydro2d 12.9 9.8 7.36 11.5 7.90 12.7 8.00
mgrid 79.6 15.5 8.00 15.6 8.00 15.7 8.00
applu 25.3 12.4 7.97 13.2 8.00 13.9 8.00
turb3d 14.6 10.6 7.38 11.9 7.53 12.5 7.94
apsi 54.9 14.0 7.98 14.7 8.00 15.1 8.00
fpppp 13.6 13.0 7.79 13.5 7.92 14.9 7.99
wave5 23.6 12.4 7.89 13.2 7.96 14.0 7.99
Chapter 5
Multiple Branch and Block Prediction
Multiple branches and multiple blocks must be predicted in a single cycle to achieve
a high instruction fetching rate. This chapter describes how to predict multiple branches in
a single block and how to predict multiple blocks per cycle. �
A block of instructions may contain multiple basic blocks because some of the con-
ditional branches encountered may be predicted not taken. A prediction mechanism that
can only predict basic blocks limits potential performance improvement, since it may only
be able to predict one line to read from the instruction cache instead of multiple lines from
the instruction cache. Hence, Section 5.1 introduces how to predict multiple branches in a
single block. This allows a block of sequential instructions to be read up to the first control
transfer.
As Chapter 4 concluded, multiple blocks of sequential instructions need to be fetched
each cycle to overcome the limitation of single-block fetching. As a result, an accurate
prediction mechanism is required to predict multiple blocks. Therefore, Section 5.2 intro-
duces a novel mechanism to predict two blocks per cycle, using a select table. In addition,
�Parts of this chapter appear in the Third International Symposium on High-Performance Computer
Architecture [47]
66
67
Section 5.4 explains how the select table can be expanded to predict multiple blocks per cy-
cle. An important feature of this prediction mechanism is that multiple blocks are predicted
in parallel.
The performance of these new prediction mechanisms are studied in Section 5.3. The
accuracy of predicting multiple branches per block is shown to be as good as predicting
each branch one at a time. The effective fetching performance of predicting one block,
two blocks, and multiple blocks per cycle is presented. Different types of misprediction
resulting from dual block prediction are described, and distributions of the contribution of
each type to the average branch execution penalty are given.
Finally, Section 5.5 presents cost estimates of multiple branch and block prediction
in terms of hardware storage and timing requirements.
5.1 Multiple Branch Prediction
The multiple global adaptive branch prediction by Yeh and Patt discussed in Chap-
ter 2 retains the accuracy of their original single branch prediction. However, multiple
reads from the PHT are not necessary for predicting multiple branches in a single block.
Yeh’s original two-level adaptive branch prediction can easily be scaled to perform mul-
tiple branch prediction for a single block. All of his schemes involve finding pattern his-
tory information to predict a single branch using a 2-bit up/down saturating counter (see
Figure 2.4). A pattern history entry is expanded to contain information not for one branch
instruction, but for an entire block of potential branch instructions. For example, if eight
instructions per block are being fetched, a PHT entry will contain eight 2-bit counters, one
for each position in a block.
68
One important difference is the updating of the global history register (GHR) or
branch history register (BHR). Instead of being updated after the prediction of each in-
dividual branch, it is updated after the prediction for the entire block. Updating the GHR
after each branch requires multiple reads from the PHT. In order to avoid this, the GHR
is updated after each block, possibly containing multiple branches. As a result, only a
single entry needs to be read from the PHT. For example, if three branches are predicted
not taken, not taken, and taken, then the GHR/BHR is shifted to the left three bits and a
“001” inserted. All of Yeh’s original variations may be expanded in this manner, except his
per-addr variation now becomes a per-block variation.
The difference between Yeh’s multiple global adaptive branch prediction and multi-
ple branch prediction using a blocked PHT can be highlighted by considering an example.
In this example, every other instruction in a block of eight instructions is a conditional
branch. Figure 5.1 shows how Yeh’s multiple global adaptive branch prediction predicts
these four branches. Starting with a GHR of “0001”, it reads 15 entries – which is difficult
to do – and selects four of those entries for prediction. The selection of the second branch
is based on the prediction of the first branch; the selection of the third branch is based on
the prediction of the first and second branches, etc. As a result, the complexity of this mul-
tiplexer selection grows exponentially with the number of branch predictions. In contrast,
the blocked PHT method in Figure 5.2 reads a block of eight sequential counters – which
is easy to read – and selects the appropriate counters based on the least significant bits of
the branch’s address. The blocked PHT can predict as many branches in the block up to the
first taken branch.
Figure 5.3 is a block diagram of a multiple branch prediction fetching mechanism.
While the instruction cache is reading the current block of instructions, the instruction
fetcher at a minimum must predict the index of the next line to retrieve from the instruction
69
�!#
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
� �����
1!�������
+
9
� �
$�� ��
%������&
+ �
$�� ��
%�����9&
9��
$�� ��
%�����:&
���
$�� ��
%�����;&
Figure 5.1: Multiple Global Adaptive Branch Prediction Example
70
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
� $� ����� � � + 9 � : < ;
� �����
1!������� /�"�(������� �!� �"���#�$��
����� � ����
�"0�"���"- ��� '
������� ��������"
$� ��� " � �����
Figure 5.2: Multiple Branch Prediction with Blocked PHT Example
71
cache. The complete address may be determined during subsequent cycles. Therefore,
an efficient method to predict target addresses is to use an NLS table. The NLS table is
modified and expanded to be indexed by the instruction block address and contain target
lines for an entire block of instructions. Alternatively, a Branch Target Buffer (BTB) may
be used [24]. The BTB, however, is also modified to be indexed and checked against the
instruction block address and contain target addresses for an entire block of instructions.
The NLS or BTB may be viewed as n separate tables accessed in parallel, which predict the
target address for each of the n possible branch exit positions. The actual target address, if
any, is selected at a later time. An NLS or BTB which predicts targets for a whole block is
called a target array.
In addition, the branch type information is no longer contained in the NLS table, but
in a separate block instruction type (BIT) table. In superscalar fetch prediction, knowing
what type of instructions are in a block is the most critical piece of information. Each
BIT entry contains two bits of information for each instruction in a cache line. This BIT
information may be pre-decoded and contained in the instruction cache line. For a faster
access time, it can be stored in a separate array. Instead of storing BIT information for each
instruction in the cache, a separate direct-mapped BIT table can be used with fewer entries
than the number of lines in the instruction cache. In this case, BIT information is originally
assumed by the predictor to be correct for prediction. After the line has been read from the
instruction cache, the BIT information is verified and replaced, if necessary.
At a minimum, the BIT information for each instruction in a fetch block must contain
at least two bits to represent that an instruction is either not a branch, a return instruction, a
conditional branch, or other types of branches. If this is expanded to three bits per instruc-
tion, it can contain additional information about conditional branches with targets adjacent
to the current line, referred to as near-block targets. The offset into the line may be quickly
72
/�� ��
!� �"��
4�
�����
*8��"��/#/
#��)��
2����
���-�
2����
����(
,
�,�����
�� �� �5�
������
!� �"��
#�$��
/�"�(
� ��-���"
#���
������
*�.�
�����
� ��-���"
�����
8� �
67
*�.��� ��-���"
������8� �
�
Figure 5.3: Block Diagram of a Multiple Branch Prediction Fetching Mechanism
73
added with a log�n-bit adder as soon as the branch offset is ready. As a result, near-block
target addresses do not need to be stored in the target array, and the size of the target array
can be reduced.
Given the starting position in the line fetched, BIT and PHT block information, the
instruction fetch control logic uses the instruction type information to find the first uncon-
ditional branch or conditional branch predicted to be taken based on its pattern history.
The next line to be fetched is then selected from a multiplexer whose input contains the
current line, previous line, following line, two lines after the current line, the top of the
return address stack (RAS), and the n possible targets from branches in a block. The BIT
codes and resulting prediction sources are summarized in Table 5.1. A schematic of the
logic required to select the branch position using the BIT and PHT information for a four
instruction block is shown in Figure 5.4 (except BIT type “011” replaces BIT type “111”
in Table 5.1 to simplify logic).
Table 5.1: Block Information Types and Prediction Sources
Instruction Type Prediction Source
0 0 0 Non-branch Fall-through PC
0 0 1 Return Return Stack
0 1 0 Other branches Always use Target Array
0 1 1 Conditional branch, Target Array entry or Fall-
long target through, depending on PHT
1 0 0 Cond. branch, prev line Current line - line size
1 0 1 Cond. branch, same line Current line
1 1 0 Cond. branch, next line Current line + line size
1 1 1 Cond. branch, next line+1 Current line + 2 * line size
74
PHT1
BIT0
BIT2
BIT1
PHT1
BIT0
BIT2
BIT1
PHT1
BIT0
BIT2
BIT1
PHT1
BIT0
BIT2
BIT1
Branch Position
2
Block Position 0 Block Position 1 Block Position 2 Block Position 3
Fall−Through
Priority Encoder
idle priority
3 2 1 0
Figure 5.4: Branch Selection Logic
75
The processor should keep track of the target address of each conditional branch that
is predicted not taken. In the event it is mispredicted, the correct block may be immedi-
ately fetched the following cycle after branch resolution. Otherwise, an additional cycle is
required to read the target address from the target array.
Table 5.2 is an example showing a line of instructions and the result of prediction.
The type of instruction, BIT information code, and PHT entry values are given. The starting
position corresponds to the beginning of a block. The exit position is where an instruction
transfers control. For each possible starting position, the exit position, next line select pre-
diction, target used for a misprediction, and the new prediction used after a misprediction
are shown. NLS(x) indicates that the target address for the exit position x is selected from
the NLS target array. For instance, if the starting position is 4, the exit position is 5 where a
conditional branch is predicted to be taken, and the NLS at position 5 is used for the target
address. If the branch is mispredicted, the return address stack is used as the target for the
next block. Since the pattern history indicates a “second chance” bit, the prediction will
not change the next time the branch is encountered.
Table 5.2: Next Line Prediction Example Based on Starting Position
Position in block 0 1 2 3 4 5 6 7
instruction type shift branch add jump sub branch move return
BIT value 000 100 000 010 000 011 000 001
PHT value XX 10 XX XX XX 11 XX XX
exit position 1 1 3 3 5 5 7 7
select prediction line-- line-- NLS(3) NLS(3) NLS(5) NLS(5) RAS RAS
target on misprediction NLS(3) NLS(3) N/A N/A RAS RAS N/A N/A
select replacement NLS(3) NLS(3) N/A N/A NLS(5) NLS(5) N/A N/A
76
5.2 Dual Block Prediction
Once an instruction which transfers control is encountered, no more instructions in a
block may be used. Another cycle is required to fetch from a different line in the instruction
cache. This is a barrier to fetching a large number of instructions in a single cycle. Hence,
what is needed is the capability to fetch multiple blocks at the same cycle. The problem is
determining which blocks to fetch each cycle.
Fetching two blocks per cycle requires predicting two lines per cycle. In order to
accomplish this prediction completely in parallel, only the address of the two lines currently
being fetched and any available branch history information may be used as a basis for
prediction. Using the PC from the last block currently being fetched, the first line can be
predicted using methods from the previous section. The difficulty arises in predicting the
following (second) line.
The underlying problem with predicting two lines to fetch is that the prediction for the
second line is dependent on the first. Hence, the PHT and BIT information for the second
line cannot be fetched until the first line has been predicted, and the new PC and GHR have
been determined. The solution to this problem is essentially to predict the prediction. The
end result of using the BIT and PHT for prediction is a multiplexer selector. Therefore,
because the BIT and PHT information for the second block prediction are not available, we
store the multiplexer selection bits of a previous prediction for that block into a select table
(ST). The select table is indexed by the exclusive-or of the GHR and the current PC block
address [25]. This index is the same as the index into the PHT for the prediction of the first
block. The select value read from the select table is used to directly control the multiplexer
for the second block prediction. A 3-bit selector can be used with a block width of four
(n � �). Four bits are required for n � �.
77
5.2.1 Single Selection
Figure 5.5 is a block diagram of a dual block (two-block) prediction fetching mech-
anism. It has two multiplexors to select the next two lines to fetch. The first selection is
calculated from the PHT and BIT information. The second selection comes from the select
table. To accurately predict target addresses, a dual target array is used. It provides n target
addresses for the first block, and n target addresses for the second block. The address of the
second block currently being fetched is used as the index into both target arrays. Although
the NLS must have two target arrays, a BTB may use its tag to indicate the block number
(one or two).
Undesirable duplication of target addresses is inherent to the dual target array. A
branch’s target address could be stored in both target arrays. Also, it may be represented
in the second target array multiple times, since a branch may have multiple predecessor
blocks. This duplication, however, does not significantly reduce its accuracy compared to
a single target array.
The second multiplexer shown in Figure 5.5 is dependent on the output of the first
multiplexer. An addition to determine the fall-through address of the first prediction or
other near-block targets is required. Although the addition of a line index is relatively
small, if timing is critical, each of the n targets from the first target array and the RAS
can calculate the fall-through (and possibly near target(s)) indexes before the first block
selector is ready. The fall-through adder used as input for the second multiplexer can now
be replaced with a multiplexer which selects the correct pre-computed fall-through address
from the first target.
78
The RAS sends the top of its stack to the input of the first multiplexer. For the second
multiplexer, if the first block performs a call, the RAS input is bypassed with the address
after the exit address of the first block. If the first block performs a return, the RAS sends
the second address off the stack. Otherwise, the top of the stack is sent to the second
multiplexer. In addition, the target array should encode whether or not its target is a result
of a call, so that proper return bypassing can take place.
Figure 5.6 displays the pipeline stages involved in the dual block prediction. The first
stage is the prediction of the next two blocks (bX denoted block # X). The selector for
the first predicted block is computed from BIT and PHT information. The second block
is predicted by reading the select table. The second stage fetches the two blocks. It also
verifies the select prediction in the previous stage against prediction computed using the
PHT and BIT information. If the prediction is different, then a misselect has occurred. The
previous prediction is replaced with the new prediction in the select table, and the new block
is fetched. Also during the second stage, the predicted target address of the first block is
checked against the calculated branch offset or immediate branch from the previous block
(misfetch). The third stage checks for a misfetch of the second block.
From the pipeline diagram, two problems are observed. One problem is with the
updating of the GHR. The GHR can reflect the outcome of the first block prediction, but
for the second block prediction, there is no information about the number of conditional
branches predicted or their outcome. Therefore, the select table entry needs to contain
prediction information to update the GHR. This can be accomplished by using log�n bits
to represent the number of not taken branches and one bit to represent either a fall-through
case or a taken branch. GHR prediction may be avoided by assuming a limited number of
conditional branches have been predicted “not taken.” Multiple entries in the PHT and ST
79
*8��"��/#/
�-���#��)��
2����
���-�
2����
����(
,
�,�����
�� �� �5�
������
!� �"��
#�$��
������
*�.�
�����
/�� ��
!� �"��
4�
�����
67
�"�
*�.��%��� �&�� ��-���"
������8� �
�
������� #�$��
�"�
���" �
#��)��
67
�"�
*�.��%���" �&�� ��-���"
������8� �
,�,�����
�� �� �5��
%���)��� �&
%���)��� +&
-��� �� %���" �&
� ��-���"
������8� �
/�"�(
� ��-���"
#���
Figure 5.5: Block Diagram for Dual Block Prediction
80
��������$9
��"��$+
/�#�=��!#
�����
$��=�$+
�������$�
��������"
>������$�
�����
���(
$���"�
�� �����
��������$:
��"��$�
/�#�=��!#
�����
$9�=�$�
�������$<
��������"
��������$�
��"��$�
/�#�=��!#
�����
$:�=�$<
�������$+
��������"
>������$<
�����
���(
$9��"�
�� �����
>������$+
�����
���(
$:��"�
�� �����
�������������� ������������ ������������ ��������������
� ���� �� ���� �
� ���� � ����
� ���� � ����
���(
$+��"�
�� �����
���(
$���"�
�� �����
���(
$<��"�
�� �����
��������������
Figure 5.6: Pipeline Stage Diagram for Dual Block Prediction
81
are read, and the correct entries are chosen once the number of branches in the previous
block have been determined.
Near-block select prediction for the second block causes another problem. It does
not give information about the offset into the line. As a result, up to log�n extra bits are
needed to provide this information, or there may be enough time to calculate the line offset
after its source block has been read. To avoid this problem, targets can always be predicted
from the target array instead of using near-block targets. The GHR and position prediction
(if any) are verified at the same time as the select prediction.
5.2.2 Double Selection
The selection prediction can be used on the first block as well as the second block.
Selection prediction of both blocks is referred to as double selection. Figure 5.7 is a block
diagram of two-block prediction using double prediction. Double selection increases the
misselect penalty. However, the benefit is the removal of BIT storage altogether. The
instruction type is decoded after the line has been fetched. The select table is still indexed
by the exclusive-or of the GHR and starting address, but it is now a dual select table,
providing selectors for both multiplexors. Timing concerns regarding the calculation of the
selector bits for the first target no longer exist. The potential for timing problems from the
adders between the multiplexers is significantly reduced. Selector and GHR prediction bits
for both blocks are required, although the starting position prediction for the second block
is no longer needed.
Figure 5.8 is a pipeline diagram using double selection. The first stage predicts the
next two blocks from the dual select table. The second stage fetches the two blocks, and
82
verifies the first block’s select prediction and target address. The third stage verifies the
second block’s select prediction and target address.
5.2.3 Misprediction
The penalties for the different types of possible mispredictions are listed in Table 5.3.
It is assumed that it takes four cycles to resolve a branch after it has been fetched. For the
first block, if there are remaining instructions required to be re-fetched after a conditional
branch was mispredicted taken, then it will take an additional cycle. A misprediction on
the second block always requires another cycle. There is a one cycle misselect or GHR
mispredict penalty using a single selection for the second block.
With double selection, the first block has a one cycle penalty while the second block
takes two cycles for a misselection. Since a misselect is detected during or immediately
after the instructions have been fetched, instructions that would have been discarded on a
taken branch become valid, and no re-fetch cycle is needed. A misfetch takes one cycle for
the first block and two cycles for the second block to detect.
Since multiple blocks are being fetched using different cache lines, a multiple banked
instruction cache is required. With dual block fetching, two lines are fetched simultane-
ously, so they may map into the same cache bank. Should a conflict arise, the second line
is read the next cycle.
In order to facilitate recovery from a mispredicted branch, each conditional branch is
assigned a bad branch recovery (BBR) entry, which provides information on how to update
branch prediction tables and provide a new target. The processor must create this entry and
keep track of it as the branch moves down the execution pipeline. For instance, a processor
83
*8��"��/#/
�-���#��)��
2����
������
#�$��
�,�����
�� �� �5�
���-�
2����
����(
,
�,�����
�� �� �5�
67
�"�
*�.��%��� �&�� ��-���"
������8� �
�
67
�"�
*�.��%���" �&�� ��-���"
������8� �
,
�
%���)��� �&
%���)��� +&
%���)��� �&
%���)��� +&
-��� �� %���" �&
� ��-���"
������8� �
/�� ��
!� �"��
4�
�����
Figure 5.7: Block Diagram for Dual Block Prediction Using Double Selection
84
�����
$��=�$+
���(
$+��"�
�� �����
�����
$9�=�$�
�����
$:�=�$<
���(
$���"�
�� �����
���(
$<��"�
�� �����
�������������� ������������ ������������ ��������������
>�����
$�
�����
>�����
$+
�����
>�����
$�
�����
>�����
$<
�����
>�����
$9
�����
>�����
$:
�����
��������������
� ���� �� ���� �
� ���� � ����
� ���� � ����
���(
$���"�
�� �����
���(
$9��"�
�� �����
���(
$:��"�
�� �����
������
$��=�$+
��������"
������
$9�=�$�
��������"
������
$:�=�$<
��������"
Figure 5.8: Pipeline Stage Diagram for Dual Block Prediction Using Double Selection
85
Table 5.3: Misprediction Penalties
Misprediction Single Select Double Select
�st block �nd block �st block �nd block
Conditional branch �� 5 �� 5
Return 4 5 4 5
Misfetch indirect 4 5 4 5
Misfetch immediate 1 2 1 2
Misselect N/A 1 1 2
GHR N/A 1 1 2
BIT 1 1 N/A N/A
I-cache bank conflict 0 1 0 1
* Add one cycle if instructions remain and need to be re-fetched.
could use a table with a fixed number of BBR entries, and if there are more unresolved
branches than entries, the processor would stall. Alternatively, the processor could store
this information with each instruction in an instruction window. Some processors, such as
the SDSP, have unused storage in an instruction window entry for a branch instruction, and
this extra space is used to store recovery information [48].
Table 5.4 lists a description and sizes of the fields in a recovery entry. A recovery
entry is created after a conditional branch is predicted using BIT and PHT information.
When a prediction is made for a conditional branch, another prediction is made assuming
its original prediction is incorrect. If a branch is predicted not taken, then the alternate
target address is the branch’s target address. If it is predicted taken, then the alternate
address is the next control transfer or fall-through address in its block (see the example in
86
Table 5.2). The alternate target address is entered into the recovery entry. In addition, a
replacement selector and new GHR are generated.
Table 5.4: Bad Branch Recovery Entry
Bits Description
1 Block 1 or 2
1 Predicted taken or not taken
1 Second chance
8-12 PHT/ST index
2n PHT block (optional)
8-12 Corrected GHR
8-11 Replacement selector
10/30 Corrected i-cache index or full address
The PHT index is required from the recovery entry so that the counter of a pattern
history table entry can be correctly updated after a correct or incorrect branch prediction.
When a conditional branch is predicted, the counter of PHT is not immediately updated.
The 2-bit counter is stored in the BBR entry, as a predicted taken or not taken bit and a
second chance bit. When a branch instruction commits (i.e., all previous instructions and
branches have successfully completed), the PHT index from the BBR is used to update the
PHT to reflect a correct prediction. On the other hand, when a branch is discovered to be
mispredicted, the counter in the PHT is updated to reflect an incorrect prediction (see the
state transition diagram of Figure 2.4). Since a PHT entry contains counters for an entire
block of instructions, updating a single branch requires using a read/modify/write cycle.
In order to avoid a read/modify/write cycle, the original PHT block information may be
87
optionally stored in the BBR entry so that a PHT update requires only one write cycle to a
PHT entry.
If the branch does not have a “second chance” when it is mispredicted, then the pre-
computed selector from the bad branch recovery entry is written into the select table.
If a misprediction occurs for the second block, then any remaining instructions from
the first block are fetched along with a new second block target retrieved from the recovery
entry. On the other hand, if the misprediction occurs for the first block, an extra cycle may
be required to fetch any remaining instructions from the previous block.
5.3 Performance
The objective of this section is to analyze the performance of the instruction fetch pre-
diction mechanisms presented thus far in this chapter. The performance was determined by
running the SPEC95 benchmark suite on the SPARC architecture. Each program ran for the
first one billion instructions. The performance for each suite, SPECint95 and SPECfp95, is
calculated by adding the results of each program in its respective suite.
To begin with, the conditional branch accuracy of predicting multiple branches in
a single block via a blocked PHT is compared against a non-blocked PHT of equal size.
Next, the impact of incorrect BIT information on performance is presented. Section 5.3.3
examines the performance of dual block prediction by comparing the performance of single
selection and double selection with different sized PHTs and STs. There are many options
for implementing a target array: BTB or NLS, number of entries, and near-block target pre-
diction. The performance and relationship of these options are studied. In Section 5.3.5, the
performance of single block fetching and two-block fetching is compared using different
88
types of instruction caches. In addition, a breakdown of the different misprediction penal-
ties is shown for two-block fetching for both single selection and double selection. Finally,
performance using two-block prefetching is compared using different decode widths.
All the results presented use a block width of eight (n � �). Single selection is used
for dual block prediction unless otherwise noted. The results presented only use a global
adaptive branch prediction scheme using one global blocked pattern history table. The
default size of a select table is 1024 entries, which corresponds to a GHR length of 10 bits.
The size of the RAS is 32 entries. It was assumed the processor would always have bad
branch recovery entries available.
Instruction cache misses were not simulated, i.e., a perfect instruction cache was
assumed. All the results presented would have their performance lowered if instruction
cache misses are included. The objective of this section is to examine the performance of
instruction fetch and prediction mechanisms only. Therefore, cache misses and execution
stalls are not considered. The only consideration for the instruction cache were the line size
and bank conflicts. A line size equal to the block width was used, and the instruction cache
was split into eight banks.
The default target array is a 256-entry NLS array. The set prediction was not simu-
lated. Therefore, the results presented for the NLS configuration are really a direct-mapped
tag-less BTB. The performance of a real NLS is affected by the associativity of an instruc-
tion cache, since it may incorrectly predict the set of the cache. Direct-mapped caches,
though, do not need set prediction. For a performance and cost comparison of an NLS
verses a BTB, please refer to [7]. Also, by default, near-block target prediction is not used.
89
The branch execution penalty (BEP) gives information regarding performance and
the interaction between many different types of penalties, as listed in Table 5.3. Never-
theless, all types of penalties are recorded, so that the contribution of each penalty type
towards the overall BEP can be shown. If multiple penalties overlap during fetching, only
the most significant penalty is recorded. For example, if a conditional branch is mispre-
dicted, it is irrelevant if there is a misselect on subsequent blocks. All of those instructions
will be invalidated once the branch is resolved. Overall performance is best understood
from the effective instruction fetch rate. One cannot directly compare a scalar BEP with
a superscalar BEP or a multi-block BEP since higher penalties are overcome by increased
number of instructions per successful fetch block.
Also, when fetching two blocks per cycle of potentially eight instructions each, up to
sixteen instructions may be returned in one cycle. Consequently, the effective instruction
fetching rate, IPC f can be greater than n. If an eight issue processor is used, then extra
instructions returned can be buffered. This would correspond to a two-block prefetching
scheme with n � � and q � �� as presented in Chapter 4. When the raw two block rate
is greater than n, the issue unit will usually receive, and average close to, n instructions
per request. Of course, a simpler configuration to satisfy issue unit constraints in such a
situation would be to use two blocks of four instructions each. This would still yield an
excellent fetching rate. By default, prefetching is not used with two-block fetching.
5.3.1 Conditional Branch Accuracy
To begin with, the conditional branch accuracy of a blocked PHT for multiple branch
prediction was evaluated. The branch history length varied from 6 to 12. The results were
compared to a per-addr scalar PHT (GAp) with 8 PHTs (see Figure 2.5) to give it equal
90
size of a blocked PHT for n � �. Figure 5.9 displays the branch misprediction rates and
the improvement over a scalar PHT. The difference in accuracy between the scalar and
blocked schemes across all variations were small, and the accuracy favored the blocked
PHT scheme for most programs. The accuracy of SPECint95 averaged 91.5% while the
accuracy of the SPECfp95 averaged 97.3%, using a GHR length of 10. In this case, the
blocked PHT had a better accuracy by a few hundredths of a percent for SPECfp95 and a
few tenths of a percent for SPECint95. The results also show that the accuracy of a blocked
PHT is more sensitive with small GHR lengths. Consequently, the accuracy may not be as
good as a scalar PHT.
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
6 8 10 12
Branch History Length
Mis
pre
dic
tio
n Im
pro
vem
ent
ove
r sc
alar
(%
)
-4
-2
0
2
4
6
8
10
12
14
16
Mis
pre
dic
tio
n R
ate
%
Int - improvement
FP - improvement
Int -miss rate
FP - miss rate
Figure 5.9: Branch Misprediction Rate and Improvement
91
5.3.2 Block Information Type
Correct instruction type information for a block is critical to making accurate predic-
tions. Incorrect BIT information can still result in a correct prediction, but this possibility
is reduced with larger block sizes. Different BIT table sizes were simulated to evaluate
its impact. Using single block fetching, Figure 5.10 shows the BEP contribution from
inaccurate BIT information. Also shown is the IPC f. Small sized BIT tables result in
poor performance. Only until about 2048 entries does the percentage of BEP drop below
5%. Therefore, for smaller sized instruction caches, it may be more beneficial to store the
BIT information inside the instruction cache. Conversely, a separate BIT table would be
more cost effective because the one cycle miss penalty of the BIT is much lower than an
instruction cache miss.
The results demonstrate that it is important to use a BIT table sufficiently large to
make the impact of inaccurate BIT information small or guarantee accurate BIT informa-
tion. The rest of the results presented use two blocks and assume BIT information is stored
in the instruction cache, or there are enough BIT entries in a separate table as blocks in the
instruction cache to prevent using incorrect BIT information.
5.3.3 Single vs. Double Selection
The performance of the select table depends on the branch history length and the
number of select tables used. Multiple select tables are indexed by the starting position
of the current address (the least significant bits). The correct target depends on the en-
tering position in a block, so multiple select tables help identify which target should be
selected. The least significant bits of the starting address determine which select table is
92
0
0.05
0.1
0.15
0.2
0.25
0.3
64 128 256 512 1024 2048 4096BIT block entries
BE
P
0
1
2
3
4
5
6
IPC
_f
Int - BEP
FP - BEP
Int - IPC_f
FP - IPC_f
Figure 5.10: Block Information Type Penalty and Performance
93
used. Figure 5.11 shows the performance of dual block prediction for single selection (SS)
and double selection (DS). The global history register length varies from 9 to 12. There
can be 1, 2, 4, or 8 STs. However, there are not multiple PHTs. The results demonstrate
that increasing the number of STs improves performance as well as increasing the branch
history length. The extra penalties from using double selection significantly reduced perfor-
mance, roughly 10% for most cases. Hence, single selection is preferred. Double selection
significantly improves, though, with more STs.
3
4
5
6
7
8
9
10
9/1 9/2 9/4 9/8 10/1 10/2 10/4 10/8 11/1 11/2 11/4 11/8 12/1 12/2 12/4 12/8Branch History Length / # Select Tables
IPC
_f
Int/SSInt/DSFP/SSFP/DS
Figure 5.11: Single and Double Selection Performance
94
5.3.4 Target Arrays
Target arrays can use a BTB or NLS. In addition, if a near-block target is used, this
will reduce the number of immediate targets used in the target array. Table 5.5 shows the
percentage of BEP due to indirect and immediate misfetches for SPECint95. The total BEP
and IPC f are also reported. The number of block entries is varied for both NLS and a 4-
way BTB using LRU replacement algorithm. A BTB entry can be for the first or second
target, while an NLS entry has two separate targets. The data indicates that eight NLS
block entries are needed for comparable performance of one 4-way BTB entry because the
BTB is 4-way associative while the NLS is direct-mapped. About 70% of the conditional
branches are near-block targets. As a result of using near-block encoding, the number of
BTB or NLS entries can be reduced in half for about the same performance.
5.3.5 Instruction Cache Configurations
The performance can be dramatically improved if a different type of instruction cache
configuration is used, as described in Chapter 4. To increase the number of instructions per
block (IPB), the cache line size can be extended to 16 instructions. For the highest possible
IPB, a self-aligned cache should be used. If a self-aligned cache is used though, the number
of banks should be doubled to offset the increase in bank conflicts, since up to four lines are
being simultaneously accessed to return two blocks. Although there are no bank conflicts
with single block fetching, the extended and self-aligned caches improve the instructions
fetched per block and overall fetching performance.
95
Table 5.5: Indirect and Immediate Misfetch Penalty Comparison for Different Target ArrayConfigurations
Target # block near- %BEP misfetch BEP IPC f
Type entries block? imm. indirect
BTB 8 no 19.2 18.7 0.603 5.02
BTB 8 yes 10.6 16.3 0.520 5.40
BTB 16 no 12.6 15.1 0.523 5.32
BTB 16 yes 6.5 12.6 0.476 5.57
BTB 32 no 7.4 11.6 0.473 5.58
BTB 32 yes 3.6 9.6 0.446 5.73
BTB 64 no 4.0 9.6 0.447 5.72
BTB 64 yes 1.9 7.9 0.431 5.80
NLS 64 no 12.0 14.7 0.516 5.41
NLS 64 yes 6.7 13.1 0.480 5.54
NLS 128 no 8.3 12.3 0.481 5.53
NLS 128 yes 4.2 10.8 0.454 5.67
NLS 256 no 5.5 10.1 0.457 5.66
NLS 256 yes 2.7 8.7 0.438 5.77
NLS 512 no 3.8 9.2 0.444 5.74
NLS 512 yes 1.6 7.9 0.429 5.81
96
With the extended and self-aligned caches, when branch prediction is performed,
the values wrap around the PHT block. For instance, if the starting position of an eight-
instruction wide block is at address 7, the first instruction will use the last (eighth) counter
in a PHT block, and the second instruction will wrap-around the PHT block and use the
first counter in that PHT block. Also, the target arrays must be correspondingly extended or
self-aligned. The performance of these three cache types are compared using one and two
block fetching with single selection. The results are shown in Table 5.6, using 8 STs and a
branch history length of 10. Outstandingly, the self-aligned cache achieves 10.9 IPC f for
SPECfp95. It averages over 8 IPC f for the entire SPEC95 suite. The high performance is
primarily due to the increase in IPB. Also, the starting address becomes more random which
helps associate a select table and use it efficiently. The performance of the extended cache
type is between a normal and self-aligned cache. Compared to single block prediction,
dual block prediction has an effective fetching rate approximately 40% higher for integer
programs and 70% higher for floating point programs.
Table 5.6: IPB and IPC f for Different Cache Types
SPECint95 SPECfp95
cache line IPB IPC f IPB IPC f
type size banks 1 block 2 block 1 block 2 block
normal 8 8 5.01 3.96 5.66 5.81 5.48 9.43
extended 16 8 5.30 4.12 5.87 6.03 5.65 9.80
self-aligned 8 16 5.99 4.53 6.42 6.76 6.33 10.88
Using a self-aligned cache, 8 STs, and a branch history length of 10, Figure 5.12
shows the BEP of each program and the contribution of BEP by each type of misprediction
as described in Section 5.2.3. Also, Table 5.7 shows the BEP distribution for each block.
97
These are for single selection, while Figure 5.13 and Table 5.8 show the distribution for
double selection. The effective instruction fetching rate is proportional to the number of
instructions per block (IPB) and inversely proportional to the product of the average branch
execution penalty and total number of branches executed. As a result, a program with a
lower BEP may have a smaller IPC f because it executes more branches.
The BEP distribution from those figures show that the most significant BEP contribu-
tion is from misprediction of conditional branches. Misselection is the next most significant
contribution. Target array mispredictions are also a significant factor in BEP. Some of the
floating point programs performed exceedingly well. On the other hand, some integer pro-
grams had a high BEP because of poor conditional branch prediction.
5.3.6 Prefetching
A prefetch buffer, as described in Chapter 4, can be used in conjunction with two-
block prediction and single selection. Table 5.9 shows the performance for different decode
(issue) sizes from 4 to 16. A global history register length of 12, one select table, and a 512-
entry NLS target array were used. The instructions per fetch request (IPFQ) and effective
instruction fetching performance are shown. The IPFQ is the average number of instruc-
tions returned to the decoder including penalties from misselection, misfetching, and bank
conflicts, but not penalties from branch prediction, indirect branches, and returns. This is
to demonstrate how well the instruction fetch mechanism including instruction fetch pre-
diction can deliver instructions to the decoder. Instructions fetched from the incorrect path
are the result of incorrect branch prediction; its accuracy is equivalent to scalar prediction.
Table 5.9 shows that the IPFQ is relatively close to the decode size up to about 14. These
98
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
appl
u
apsi
fppp
p
hydr
o2d
mgr
id
su2c
or
swim
tom
catv
turb
3d
wav
e5
CF
P95
CIN
T95 gc
c
com
pres
s
go
ijpeg
li
m88
ksim
perl
vort
ex
BE
P
bank conflictreturnmisfetch indirectmisfetch immediateghrmisselectmispredict
Figure 5.12: Branch Execution Penalties for Dual block, Single Selection
99
Table 5.7: BEP Distribution, IPB, and IPC f for Dual Block, Single Selection
Program Block 1 Block 2 BEP IPB IPC f
cnd ind imm ret cnd sel ghr ind imm ret bnk
applu .054 .000 .000 .000 .068 .010 .008 .000 .000 .000 .009 .149 7.28 12.87
apsi .041 .000 .001 .006 .046 .034 .022 .001 .004 .007 .020 .183 7.68 14.10
fpppp .101 .001 .001 .000 .121 .059 .032 .009 .000 .000 .016 .340 7.71 14.19
hydro2d .005 .012 .006 .000 .007 .020 .002 .027 .009 .000 .003 .091 6.34 11.17
mgrid .103 .000 .000 .000 .121 .075 .007 .001 .001 .000 .009 .318 7.86 14.85
su2cor .022 .007 .032 .000 .026 .047 .009 .021 .060 .000 .015 .240 5.76 7.46
swim .029 .000 .000 .000 .032 .025 .002 .007 .000 .000 .015 .110 7.61 14.65
tomcatv .033 .002 .017 .000 .041 .029 .004 .016 .032 .000 .012 .185 5.92 8.37
turb3d .060 .003 .003 .000 .071 .069 .006 .011 .016 .000 .034 .272 6.21 9.56
wave5 .066 .002 .000 .007 .079 .067 .005 .017 .050 .009 .036 .337 6.46 9.31
CFP95 .037 .004 .013 .001 .044 .040 .007 .016 .031 .001 .016 .211 6.76 10.88
gcc .173 .020 .044 .003 .205 .066 .013 .056 .061 .004 .008 .653 5.61 4.40
compres .235 .000 .000 .000 .289 .068 .032 .000 .000 .000 .006 .631 6.43 5.43
go .348 .018 .016 .001 .409 .132 .034 .053 .030 .001 .011 1.052 6.43 4.40
ijpeg .152 .001 .002 .000 .185 .042 .007 .005 .002 .000 .001 .397 7.03 9.44
li .056 .002 .019 .003 .070 .021 .006 .014 .025 .004 .012 .232 5.35 6.88
m88ksim .055 .005 .011 .000 .066 .020 .005 .009 .009 .000 .009 .187 5.82 8.60
perl .063 .011 .048 .004 .076 .055 .006 .028 .070 .005 .029 .395 5.59 6.08
vortex .037 .010 .009 .004 .044 .051 .011 .030 .017 .005 .014 .232 5.80 7.77
CINT95 .123 .007 .018 .002 .149 .053 .014 .022 .026 .003 .013 .429 5.99 6.42
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
appl
u
apsi
fppp
p
hydr
o2d
mgr
id
su2c
or
swim
tom
catv
turb
3d
wav
e5
CF
P95
CIN
T95 cc
1
com
pres
s
go
ijpeg
li
m88
ksim
perl
vort
ex
BE
P
bank conflictreturnmisfetch indirectmisfetch immediateghrmisselectmispredict
Figure 5.13: Branch Execution Penalties for Dual Block, Double Selection
101
Table 5.8: BEP Distribution, IPB, and IPC f for Dual Block, Double Selection
Program Block 1 Block 2 BEP IPB IPC f
cnd sel ghr ind imm ret cnd sel ghr ind imm ret bnk
applu .054 .012 .001 .000 .000 .000 .068 .019 .016 .000 .000 .000 .009 0.180 7.28 12.57
apsi .041 .039 .017 .000 .001 .006 .046 .068 .044 .001 .004 .007 .020 0.295 7.68 13.42
fpppp .101 .048 .033 .001 .001 .000 .121 .117 .065 .009 .000 .000 .016 0.512 7.71 13.63
hydro2d .005 .017 .001 .012 .006 .000 .007 .039 .003 .027 .009 .000 .003 0.131 6.34 10.62
mgrid .103 .056 .016 .000 .000 .000 .121 .149 .015 .001 .001 .000 .009 0.472 7.86 14.46
su2cor .022 .035 .005 .007 .032 .000 .026 .093 .019 .021 .060 .000 .015 0.336 5.76 6.54
swim .029 .005 .000 .000 .000 .000 .032 .050 .004 .007 .000 .000 .015 0.142 7.61 14.50
tomcatv .033 .018 .001 .002 .017 .000 .041 .057 .008 .016 .032 .000 .012 0.238 5.92 7.72
turb3d .060 .037 .010 .003 .003 .000 .071 .137 .011 .011 .016 .000 .034 0.393 6.21 8.67
wave5 .066 .046 .005 .002 .000 .007 .079 .135 .009 .017 .050 .009 .036 0.460 6.46 8.45
CFP95 .037 .028 .005 .004 .013 .001 .044 .080 .013 .016 .031 .001 .016 0.291 6.76 10.13
gcc .173 .046 .005 .020 .044 .003 .205 .133 .027 .056 .061 .004 .008 0.785 5.61 3.92
compres .235 .035 .013 .000 .000 .000 .289 .137 .064 .000 .000 .000 .006 0.778 6.43 4.78
go .348 .126 .024 .018 .016 .001 .409 .265 .068 .053 .030 .001 .011 1.369 6.43 3.68
ijpeg .152 .015 .003 .001 .002 .000 .185 .083 .014 .005 .002 .000 .001 0.463 7.03 8.95
li .056 .006 .000 .002 .019 .003 .070 .042 .012 .014 .025 .004 .012 0.265 5.35 6.55
m88ksim .055 .008 .000 .005 .011 .000 .066 .039 .010 .009 .009 .000 .009 0.220 5.82 8.23
perl .063 .024 .002 .011 .048 .004 .076 .111 .011 .028 .070 .005 .029 0.482 5.59 5.53
vortex .037 .033 .010 .010 .009 .004 .044 .102 .022 .030 .017 .005 .014 0.337 5.80 6.76
CINT95 .123 .033 .007 .007 .018 .002 .149 .107 .027 .022 .026 .003 .013 0.536 5.99 5.76
102
results indicate that the instruction fetch mechanism and fetch prediction can sustain an ad-
equate instruction fetching rate, but branch mispredictions restrict the effective instruction
fetching performance.
Table 5.9: Two-block Prediction with Prefetching for Different Decode Sizes
Decode Size
Suite Metric 4 6 8 10 12 14 16
Int IPFQ 3.90 5.73 7.75 8.86 9.88 10.5 10.7
IPC f 3.20 4.38 5.71 6.39 6.91 7.22 7.35
FP IPFQ 3.98 5.96 7.91 9.69 11.2 12.5 13.2
IPC f 3.84 5.64 7.39 8.91 10.3 11.2 11.8
5.4 Multiple Block Prediction
In addition to the two-block prediction, multiple blocks can be predicted by a simple
extension of the two-block prediction. Given the current block address, the first block of
the next fetch cycle is predicted using BIT and PHT information (single selection) or using
a select table entry (double selection). The second and remaining blocks of the next fetch
are predicted using the select table. Instead of the select table entry providing select bits for
the first and/or second block, it provides select bits for all blocks. The relationship between
the current fetch cycle and predicting multiple blocks for the next fetch cycle is illustrated
in Figure 5.14.
When the second and remaining blocks are predicted from the select table, they
are all verified at the same time. This is done during the next cycle for single selection
(see Figure 5.6, Verify b2 select stage) or the cycle after that for double selection (see
103
'�'�'�'�'�'�'
/�"�(� �
%�� �� $�"�(� �-��� ���
$�� )� �������&
'�'�'�'�'�'�' '�'�'�'�'�'�' '�'�'�'�'�'�''�'�'
$�� ��� �")��� %��&
"�� ������ ��$��� %��&
/�"�(� �
�������� ?��� ������ ��$��
�������� ?��� ������ ��$��
�������� ?��� ������ ��$��
/�"�(� + 8� ��/�"�(
'�'�'
�-��� �� ������ ����� �.�� ������ �����
Figure 5.14: Predicting Multiple Blocks
Figure 5.8). In addition, the target array is expanded to provide targets for multiple blocks.
Also, another read/write port to the PHT and BIT tables is needed for each additional block
predicted.
The effective instruction fetching performance from a single block to four blocks
is shown in Figure 5.15. A global history register length of 12, one select table, and a
512-entry NLS target array were used. The floating point benchmarks showed remarkable
increases in fetching performance, achieving 16 IPC f with four-block fetching. On the
other hand, the improvement from the integer benchmarks was not as impressive after
two-block fetching, because poorer branch prediction accuracy inhibits its performance
potential. In addition, as more blocks are predicted per cycle, the accuracy of the selection
table decreases, eventually leading to negligible improvement.
104
0
2
4
6
8
10
12
14
16
1 2 3 4Blocks per cycle
IPC
_f
Int
FP
Figure 5.15: Effective Instruction Fetch for Different Block Prediction Capability
105
5.5 Cost Estimates
This section presents cost estimates of multiple branch and block prediction in terms
of hardware storage and timing requirements. Using simplified hardware cost estimates, the
amount of hardware storage is evaluated for single block fetching and dual block fetching
with single and double selection. Also, using a timing model for a 0.5�m CMOS technol-
ogy, timing estimates are given for each structure used in dual block prediction. Timing
charts show how the structures relate to the critical path. Single selection and double selec-
tion are compared based on hardware and timing requirements.
5.5.1 Storage
Simplified hardware cost estimates were developed to get an idea of the type of stor-
age requirement for the different pieces of multiple branch and block prediction. Given the
block width, history register length, number of PHTs, number of select tables, size of NLS,
and type of instruction cache, the total number of bits required can be estimated. Table 5.10
lists the parameters and equations which give simplified hardware cost estimates for the
PHT, ST, NLS, BIT, and BBR tables. Single block fetching requires the use of a PHT, NLS
target array, BIT table, and a BBR. Dual block prediction with single selection requires the
use of a PHT, ST, two NLS target arrays, BIT table, and a BBR. Dual block prediction with
double selection requires the use of a PHT, two STs, two NLS target arrays, and a BBR.
Multiple block prediction in excess of two blocks requires a corresponding number of STs
and NLS target arrays.
Here is an example of a hardware cost estimate using specific values for the parame-
ters in Table 5.10. Using a block width of 8, a 32 Kbyte direct-mapped instruction cache, a
106
Table 5.10: Simplified Hardware Cost Estimates
Symbol Description
n block width
k history register length
p number of PHTs
s number of Select Tables
t number of NLS block entries
l size of line index
a cache associativity
b number of BBR entries
i number of BIT block entries
Table Simple hardware cost estimate
PHT p� �k � n� �
ST s� �k � �� �log�n� ��
NLS t� n� �l�log�a�
BIT i� n� �
BBR b� ��k � l � �log�a� �log�n� ��
107
10-bit GHR, 1 PHT, 1 ST, 256 NLS entries, 1024 BIT entries, and 8 BBR entries, the cost
estimates evaluate to:
� PHT: 16 Kbits
� ST: 8 Kbits
� NLS: 26 Kbits
� BIT: 16 Kbits
� BBR: .3 Kbits
� single block total: 58 Kbits
� dual block, single select total: 92 Kbits
� dual block, double select total: 84 Kbits
When single selection is used, a BIT table with an equal number of entries as rows in
an instruction cache should be used to avoid any BIT mispredictions. As a result, the cost
of predicting the next block based on PHT and BIT information, as with single selection,
will increase when the size of the instruction cache increases. A total cost comparison for
different instruction cache sizes using single block prediction, dual block prediction with
single selection, and dual block prediction with double selection is shown in Figure 5.16.
A block width of 8, a 10-bit GHR, 1 PHT, 1 ST, 256 NLS entries, and 8 BBR entries were
used in evaluating these estimates. In comparing double and single selection, after a cache
size of 16 Kbytes (512 BIT entries), double selection requires a smaller bit storage cost
than single selection. In fact, eventually the cost of BIT storage results in the cost of single
fetch prediction greater than dual block prediction with double selection.
108
0
20000
40000
60000
80000
100000
120000
140000
160000
2k 4k 8k 16k 32k 64k 128kcache size (bytes)
bit
s
1 block2 block, SS2 block, DS
Figure 5.16: Hardware Storage Cost of Prediction for Different Cache Sizes
109
The cost difference between single and double selection is not solely dependent on
BIT storage. The most significant factor is the size of the select tables. Since double selec-
tion requires a select table twice as large as single selection, the cost of double selection can
easily become much larger than single selection. The total hardware storage cost is shown
in Figure 5.17, where the length of the GHR and the number of select tables is varied for
dual block prediction with single selection and double selection. A block width of 8, a 32
Kbyte direct-mapped instruction cache, 1 PHT, 256 NLS entries, 1024 BIT entries, and 8
BBR entries were used in evaluating these estimates. The graph demonstrates that the cost
of double selection is less than the cost of single selection for small history register lengths
and few select tables. After a history register length of 10, the cost of double selection is
always greater than single selection. For both single and double selection, though, the cost
of using large select tables becomes excessive. Implementing a prediction mechanism with
a cost greater than 200 Kbits is not reasonable for today’s technology.
In summary, double selection can provide cost savings over single selection with a
small select table or with a large instruction cache. The cost savings, however, are not
extremely significant, and in most cases the cost of double selection is greater. Given these
cost estimates and the performance results of Section 5.3.3, the performance loss from
double selection does not justify a small storage savings. As the next section will show,
though, double selection may prove invaluable in reducing the cycle time.
5.5.2 Timing
Dual selection and single selection have different timing requirements. In order to
evaluate these requirements, the timing model of Wilson and Jouppi [51] is used to make
access and cycle time estimates of the different tables, caches, and logic required. The
110
0
100000
200000
300000
400000
500000
600000
700000
9/1 9/2 9/4 9/8 10/1 10/2 10/4 10/8 11/1 11/2 11/4 11/8 12/1 12/2 12/4 12/8
Branch History Length / # Select Tables
bit
s
Single SelectDouble Select
Figure 5.17: Hardware Storage Cost of Dual Block Prediction for Single and DoubleSelection
111
technology and implementation parameters are identical to what is used in their report,
except the results are factored for a 0.5�m CMOS technology instead of a 0.8�m CMOS
technology. When the access and cycle times of a tag-less table are estimated, the tag
side of their cache model is ignored and only the data side is considered. Select logic and
multiplexer delays were also estimated by applying the Horowitz approximation Wilson
and Jouppi used [18].
Table 5.11 displays the access time of direct-mapped caches, BIT table, PHT, ST,
NLS target array, and 4-way BTB target array for different sizes. The access time estimates
for the BTB are about 2 ns greater than the estimates for the NLS and are higher than
the access time for direct-mapped caches. The access time for the associative BTB is
greater because of the required tag matching. Consequently, in order to design a fetching
mechanism with a short cycle time, the NLS target array is preferred.
Table 5.11: Access Time Estimates (ns)
i-cache BIT PHT ST (DS) NLS 4-way BTB
size time entries time entries time entries time entries time entries time
8 KB 3.2 256 2.1 512 2.2 512 2.1 64 2.0 8 4.3
16 KB 3.8 512 2.1 1024 2.3 1024 2.2 128 2.2 16 4.4
32 KB 3.9 1024 2.3 2048 2.7 2048 2.6 256 2.3 32 4.5
64 KB 4.7 2048 2.7 4096 3.3 4096 3.1 512 2.5 64 4.6
Single Selection
Figure 5.18 is a timing chart for a direct-mapped 8 Kbyte instruction cache using dual
block prediction with single selection. The chart includes timing of a 256-entry NLS target
112
array, a 1024-entry select table, a 1024-entry blocked PHT, and a 256-entry BIT table. The
BIT table has enough entries to avoid any BIT mispredictions, since the instruction cache
has 256 rows with a block size of 32 bytes. All the structures use a block width of eight
and a single port, except PHT and BTB which are dual ported.
@�A/��,�����
*8�
/�#
�!#
������ �")��
� ��67
+ ��67
9'+� ��
+'9�
+'��
�':�
+'9�
���� �������)�
�':�
�':�
������ ������ �'+�
+'�� ������� #�$��
�';� ���������)� �
Figure 5.18: Timing Chart for 8 KB Instruction Cache Using Dual Block Prediction withSingle Selection
The access time of the cache is 3.2 ns, which is fast compared to a 4.7 ns access
time of a 64 Kbyte cache. The instruction cache requires a precharge time of about 1
ns, which makes the cycle time greater than the access time. This allows enough time
113
to discharge the word lines and precharge the bit lines. During this time, alignment of
instructions may take place as well as the prediction of the new PC addresses. The select
logic requires BIT and PHT block information. Hence, it may not begin computing until
both have completed reading. As shown in the chart, the PHT requires 2.3 ns to read its
data. The select logic, as shown in Figure 5.4, takes approximately 0.5 ns to complete.
After this time, the control logic is ready for the first multiplexer to select from the NLS
target array, RAS, or fall-through address (see Figure 5.5). Some of the inputs for the
second multiplexer are dependent on the output of the first multiplexer. As a result, an
additional 0.2 ns is required to complete the selection of the second target. Even with the
select logic, the prediction of both targets is completed with a comfortable margin of 0.7
ns. Only if the cycle time of the instruction cache significantly decreases or the access time
of the BIT or PHT increases would prediction using single selection become the critical
path. If this is the case, double selection may be the solution.
Double selection can avoid the extra delay required by selection logic. Unfortunately,
double selection always performs significantly poorer than single selection. On the other
hand, if the selection logic required with single selection extends the cycle time of the
processor, the performance savings from a longer cycle time may justify the extra penalty
cycles of double selection. As shown in Figure 5.18, the selection logic may become part
of the critical path if the access time of the instruction cache decreases. This may be
accomplished with a cache size less than or equal to 2 Kbytes. It is unlikely, however, that
a designer with the high transistor budgets today’s technology provides would implement
a small instruction cache and use PHT and STs larger than the primary cache itself.
114
Double Selection
The potential benefit from double selection most likely may be exploited using a
pipelined instruction cache access which completes in two cycles, as used in the Intel
Pentium Pro [30]. For example, using a large primary instruction cache of 32 Kbytes,
the access time will increase to 3.9 ns and the cycle time will increase to 5.3 ns. The 25%
increase in cycle time compared to an 8 Kbyte cache may not justify the decrease in in-
struction cache miss penalties. On the other hand, the cycle time may be reduced in half
if the instruction cache access spans two cycles. In order to retain the same instruction
fetched per cycle throughput, dual block prediction needs to complete in one cycle. Also,
four banks will be busy during one cycle instead of two, so this increases the possibility of
a bank conflict.
One possibility to keep bank conflicts under control is to use the sets of a set-asso-
ciative instruction cache in addition to interleaving the cache based on address to provide
multiple banks. For example, sixteen banks may be chosen from four sets, where each set
is interleaved four ways based on the address. When using next line and set prediction, the
set is predicted before the access of the cache line. Therefore, it is unnecessary to access
all sets and select the correct one after a tag comparison. Consequently, only the predicted
set needs to be accessed, leaving the remaining sets for use by other blocks to be fetched at
the same time. When an instruction cache miss occurs, the line is replaced into a set such
that it would not create a conflict with other lines when it was initiated. The next time the
line is accessed, it is likely that it will be accessed with the same lines as before. Hence,
the chance of a bank conflict is significantly reduced.
115
Using a shortened cycle time from a two-cycle instruction cache access and a single
cycle dual block prediction, the select logic from single selection becomes part of the crit-
ical path. For example, using a 32 Kbyte cache completes in two cycles with each cycle
lasting 2.7 ns. Referring back to the timing chart of Figure 5.18, the dual block predic-
tion completes after 3.5 ns. As a result, using single selection with a pipelined instruction
cache would require increasing the cycle time from 2.7 ns to 3.5 ns to meet the objective
of dual block prediction in a single cycle. Alternatively, if dual selection is used, the 0.5 ns
from the selection logic no longer exists. In addition, the multiplexer selection bits become
ready 0.2 ns earlier from a faster select table than pattern history table. As a result, the dual
block prediction can be completed within 2.8 ns, which comes close to the 2.7 ns goal. The
timing chart for this pipelined 32 Kbyte instruction cache using dual block prediction with
double selection is shown in Figure 5.19.
The timing chart of Figure 5.19 shows four instruction cache accesses: two are ini-
tiated during Cycle 0 and two are initiated during Cycle 1. The prediction for the lines
in Cycle 1 was completed in Cycle 0 by using the select table. Also shown in the chart
is the prediction for Cycle 2 made during Cycle 1. The dual block prediction determines
the cycle time of 2.8 ns. The access of the first two lines completes 1.1 ns during the sec-
ond cycle. Approximately 1.7 ns is available to perform instruction alignment and block
merging required by two-block fetching.
Indeed, double selection can be useful in reducing the cycle time when the prediction
is time-critical. As demonstrated by this example, double selection reduced the cycle time
by 20% over the cycle time required by single selection. An overall instructions per second
performance increase of about 10% is expected after considering a 10% loss in IPC f.
116
�,�����3
$� (�9
*8�
�!#
� ��67
+ ��67
9'B� ��
+'9�
+'9�
���� �������)�
�':�
�':�
������ ������ +'@�
+'�� ������� #�$��
������ ������ +'@�
������?������':�
�,�����3
$� (��9'B� ��
���� �������)�
�,�����3
$� (�<
�,�����3
$� (��
$�)� � ����
$�)� � ����
� ��'� ���)�� �
+'9�
+'9�
�':�
�':�
+'��
�������������� ������������
Figure 5.19: Timing Chart for Pipelined 32 KB Instruction Cache Using Dual BlockPrediction with Double Selection
Chapter 6
Scalable Register File
A large branch execution penalty may result from register renaming using a mapping
table. In Section 6.1, this chapter describes a hybrid renaming technique, which uses the
advantages of a reorder buffer with a mapping table, to significantly reduce the recovery
time. Section 6.2 analyzes the utilization of a register file and discovers that most of the
physical registers are not being used most of the time. This leads into Section 6.3, which
describes dynamic result renaming. The implementation details of the hybrid renaming
mechanism and dynamic result renaming are described in Section 6.4. Lastly, Section 6.5
presents the performance of the scalable register file architecture. �
6.1 Register Renaming
Section 2.6 gave background information regarding register renaming. Registers can
be renamed from logical to physical register using a mapping table. Unfortunately, a mis-
predicted conditional branch or an exception may result in a large penalty to recover the
mapping table. Section 6.1.2 introduces a hybrid renaming technique to reduce this penalty.
�Parts of this chapter were published in the proceedings of the 1996 Conference on Parallel Architectures
and Compilation Techniques [46]
117
118
The hybrid renaming technique uses both content addressable memory (CAM), as used in
a reorder buffer, and random access memory (RAM), as used with a mapping table. The
number of ports for the CAM and RAM cells can be reduced by detecting the dependencies
within the decode block, as will be described in Section 6.1.3.
6.1.1 Recovery
A major performance penalty with using a mapping table and a recovery list is the
time it takes to recover from a mispredicted branch. Using the recovery list, the mapping
table can be recovered by undoing each entry in the list one at a time. The mapping table
can be updated in groups at a time, but the rate is limited by the number of read and write
ports. If P ports are available and M register mappings need to be recovered, then it will
take dM�P e cycles to recover.
Hence, the longer it takes for a branch to be found incorrectly predicted, the longer it
takes to recover the mapping table. In fact, this essentially doubles the branch misprediction
penalty, compared to a mechanism that can recover from a mispredicted branch in one
cycle.
In addition, the ports of a RAM cell used by the mapping table grow proportional
to N , since more instructions are being renamed per cycle. Therefore, it is not ideally
scalable. On the other hand, since the number of logical registers remains fixed, N would
have to be large in order to cause real problems from a practical standpoint. Nevertheless,
it would be beneficial to reduce the number of ports of the mapping table’s RAM cell.
119
6.1.2 CAM/Table Hybrid
The advantage of a register renaming mechanism that uses CAM, such as a reorder
buffer [19], is a one cycle recovery time from a mispredicted branch or exception. This
is accomplished by simply invalidating appropriate entries relative to the branch. A one
cycle recovery time is desirable. However, CAM cells scale worse than RAM cells, as
used in a mapping table. To begin with, a CAM cell is more expensive and slower than a
normal RAM cell. In addition, the lookup array, which searches for the most recent register
instance, grows as the number of speculative registers increases. Therefore, it is desirable
to have the significant performance benefits of the CAM and the area benefits of RAM.
As a compromise, CAM is used to rename a limited number of speculative regis-
ters (extra registers reserved for speculative results until committed), and a mapping table,
which is implemented using RAM, renames the remaining speculative registers and the
committed registers. Figure 6.1 is a block diagram of renaming hardware which uses both
CAM and RAM. The hardware which uses CAM is called CAM lookup. The CAM lookup
is a FIFO queue. When a new instance of a logical register is created, it is inserted into
the CAM lookup. Only instructions which have a result need to be entered into the CAM
lookup. When an instance exits the CAM lookup, it is entered into the appropriate map-
ping table entry, and the old physical register is entered into the recovery list. When an
instruction commits its result, the old physical register in the recovery is freed.
Source operands are first searched for matching entries in the CAM lookup list for
the most recent destination register. If a match is not found, then the register is looked up in
the mapping table. When a mispredicted branch is resolved, if its instruction tag identifier
indicates that subsequent instructions are still in the CAM lookup entries, then there is no
additional penalty. If it has been entered into the mapping table, then the recovery list is
120
used to undo the mappings. However, the penalty is not as severe since instances are first
entered into the CAM lookup. The number of entries to recover is reduced by the number
of entries in the CAM lookup. Consequently, there is a significant reduction in the average
misprediction or exception penalty.
FreeOldPReg
Bank1 2
Bank Bank Bank43
FREELI ST
DetectionIntrablock
Source Operands
......
......
Itag PReg Ready...
...
MappingTable
Operands toInstruction Window
Send renamed
Allocate PRegfor Result and Send to RF and IW
OldPReg
Next
RecoveryList
Update PRegOld PReg
Itag
CAM Lookup
Reg Itag PReg Ready
Figure 6.1: Block Diagram of Hybrid Renaming
To compare the performance benefit from a full mapping table, full CAM lookup, and
hybrid, Table 6.1 lists the average misprediction penalty and instructions per cycle (IPC)
121
for CAM depths (number of entries divided by decode width) of 0, 2, 4, and 8. The mispre-
diction penalty includes all the pipeline bubbles in the various stages. A branch may wait
to be issued for several cycles. These cycles are also included, except when the pipeline
stalls due to data dependencies. In addition, an extra cycle may be included if instructions
have to be re-fetched from the mispredicted branch’s block. A significant performance im-
provement is observed with the CAM lookup because the average misprediction penalty
is reduced. After about half the total depth (CAM � �), there is a marginal improve-
ment in performance compared to a full CAM lookup (CAM � �). Therefore, the hybrid
CAM/table is a good compromise between cost and performance.
Table 6.1: Bad Branch Penalty and Performance
CAM=0 CAM=2 CAM=4 CAM=8
Arch Suite Issue Penalty IPC Penalty IPC Penalty IPC Penalty IPC
SDSP Int 4 7.7 2.63 6.4 2.70 5.8 2.74 5.5 2.75
SPARC Int 4 8.0 2.20 6.6 2.29 6.0 2.32 5.7 2.35
SPARC FP 4 8.1 1.58 6.8 1.61 6.1 1.63 5.8 1.64
SDSP Int 8 8.8 3.70 7.4 3.85 6.7 3.93 6.2 3.99
SPARC Int 8 10.0 2.70 8.4 2.84 7.5 2.91 7.0 2.96
SPARC FP 8 10.6 1.91 9.1 1.96 8.1 2.00 7.4 2.03
6.1.3 Intrablock Decoding
Many operands are dependent on a result in the same block or in the recent past.
This is shown in Figure 6.2. It shows the number of instructions between a source operand
and the creation of the register it is referencing. In a block of four instructions, each with
one operand (since about half are usually constant or not used), about 1.2 operands are
122
expected to be dependent within that block. If intrablock dependencies are detected, the
number of CAM ports required to search the lookup entries may be reduced. If there is not
enough time in the decode stage to determine the intrablock dependencies, then pre-decode
bits in the instruction block can be used. Each source operand in a block requires log�N
bits to encode which of the previous N � � instructions it is dependent on, or none at all.
In addition, if an operand is dependent on an instruction which comes before the starting
position of a block, the dependency information is ignored. When each line is brought into
the instruction cache, or after the first access, the line containing a block of instructions
is annotated with �N log�N bits (for two source operands) indicating if an instruction is
dependent on another one in the same block.
0%
20%
40%
60%
80%
100%
0 1 2 4 8 16 32 640.0%
46.3%
56.6%
66.6%
75.3%
84.9%90.6% 93.3%
0.0%
27.3%
40.9%
53.1%
66.8%
76.2%82.6%
86.9%
SDSP SPARC
Percent
Instruction Distance
Figure 6.2: Dependence Distance for SDSP/SPARC
After running simulations using intrablock detection, decoding four instructions re-
quires one less CAM port to search the lookup array for equivalent performance. Instead of
123
needing five or six CAM ports, now the CAM lookup can use four or five. When the block
size is doubled to eight instructions, about four out of eight register operands are expected
to be intrablock dependent. Therefore, instead of doubling the number of CAM ports when
the decode size doubles, an increase of only one port is needed – to about five or six.
Referring back to Figure 6.1, the intrablock decoding now can be done before any
CAM lookups or table mappings. AsN increases, the number of operands needed for CAM
and table lookup only increases slightly because it becomes more likely that operands will
be dependent within the same block. Hence, the hybrid CAM/table renaming scheme is
scalable.
6.2 Register File Utilization
A disadvantage with allocating a physical register at decode time is that physical reg-
isters go unused until they receive their result value. As a result, a good portion of the
register file (RF) is wasted most of the time. The total register file utilization is defined to
be the ratio of the number of physical registers with a useful value and the total number
of physical registers. In addition, the speculative register file utilization is the ratio of the
number of physical registers with a useful speculative value and the total number of physi-
cal registers reserved for speculative results (does not include committed registers). A value
is considered to be useful if it is needed to ensure proper execution of the machine. With
speculative execution and precise interrupts, this occurs from the time a register receives
its result until it is committed.
Table 6.2 shows the average speculative and total register file utilization per cycle
for 4-way and 8-way superscalar processors. The mean, median, and th percentile of
124
Table 6.2: Average Register File Utilization per Cycle
Arch Suite Issue % spec % total mean median th %
SDSP Int 4 26.3 63.2 8.43 8 17
SPARC Int 4 16.0 84.0 5.11 4 12
SPARC FP 4 6.3 53.2 2.03 1 6
SDSP Int 8 24.6 49.7 15.73 15 32
SPARC Int 8 12.4 72.0 7.91 5 21
SPARC FP 8 4.8 54.4 2.80 1 9
the number of useful speculative registers are shown. Physical registers used to store the
state of the logical register file will always be active, so the total register file utilization is
not as meaningful as the speculative register file utilization. From the results presented,
it is observed that less than one quarter of the registers reserved for speculative execution
are used on the average. Less than half of the available speculative registers are used 90%
of the time. The floating point RF used in the SPARC SPECfp95 has an extremely low
utilization: less than 6%. Hence, the majority of speculative registers are going to waste
most of the time.
6.3 Dynamic Result Renaming
As has been shown, many physical registers in the register file have no value or
contain a useless value. Therefore, one way to reduce the size of the RF is to improve
its utilization. Physical registers allocated with no value can be virtually eliminated by
allocating at result write time instead of decode time. This is accomplished by splitting
the register file into multiple banks, each bank with two read ports and one write port, as
125
shown in Figure 6.3. Each bank maintains its own free list (see Figure 6.1), and old physical
registers are freed when an instruction commits. In addition, a bank is directly connected
to one result bus. When functional units arbitrate for result buses, a free register is needed
in each bank. The allocation of a register cannot be done at decode time, since it is not
known exactly which functional unit and bus a result will eventually arrive. On the other
hand, by allocating the entry when results are written, multiple banks can be used with one
write port and have no conflicts with writing results into the same bank. As a result of
allocating physical registers at result write time, the size of each bank can remain constant
as the number of banks increase proportionally to the issue width.
Although allocating physical registers at write time creates no conflicts for the single
write port in a bank, the two read ports on one bank can cause contention with the instruc-
tion window. For example, three ALUs could require three operands from a single bank.
With only two read ports, one ALU would not be able to issue its instruction. Even though
this event can happen, it is not a likely event for two reasons. First, not every instruction
issued requires two register operands. Some have one operand, while others require an im-
mediate value. Second, most instructions issued bypass one of their results from the result
of an instruction completed the previous cycle. Consequently, such a limited number of
read ports per bank has a very limited impact on performance.
Table 6.3 demonstrates this fact by showing the distribution of read operand types,
and the percentage of individual operand requests failed due to insufficient ports. If the
operand is a register, then it can originate from the first or second level of bypassing, the first
or second read port of a bank, or be an identical register read from the first or second read
port. On the other hand, if it is not a register operand, then it can be an immediate value, a
zero value, or no operand at all. Interestingly, although a significant percentage of register
operands came from the first read port, few required the second read port. Furthermore, a
126
. . . . .
(Read Operands)Issue Instructions
New Instructions(Opcode, Renamed Operands)
Result Bus Muxes
1 2 3 4
FU FU
Result Bus
Bypass Network
PRegResultUpdate
WindowInstruction
Bank BankRF RF
BankRF
BankRF
Figure 6.3: Block Diagram of Scalable Register File
127
large percentage of registers are bypassed, especially at the first level. Since many operands
are bypassed, a traditional RF for map on decode could reduce the number of read ports
by about 50%. The number of write ports, however, must remain the same. Consequently,
although the size of its RF may be reduced, this does not lead to a scalable solution.
The allocation of registers for results is pipelined. Two cycles before the result will
be ready for writing, write arbitration takes place. The free list of each bank is searched
for an available register. Result buses are assigned to a bank with a free register in a round-
robin process. For example, consider a register file with four banks. In one cycle, three
results are assigned to the first three banks. In the next cycle, the first result bus is assigned
to the fourth bank, and remaining results are assigned starting with the first bank. If a bank
should not have a free register, that bank is skipped in the assigment process.
If there should be more results than available banks, then the pipeline for the func-
tional unit which cannot write its results is stalled. Also, another instruction may be issued
to the functional unit before the instruction window is notified that the functional unit is
stalled. Consequently, two instructions with results may be waiting in the functional unit’s
pipeline. This does not create a problem since already two levels of bypassing exist. If
subsequent instructions require a stalled result, then the result continues to use the bypass
network until it is written to the register file.
In order to be able to allocate registers at result write time and be able to do register
renaming for out-of-order execution, two types of renaming must take place. Before an
instruction is placed into the instruction window, its destination operand is renamed to a
unique instruction tag (itag) and inserted into the mapping table. This contrasts to renaming
the register to a physical register since allocation has not taken place yet. After the physical
register has been allocated for the result, then the instruction window, mapping table, and
128
Table 6.3: Read Operand Category Distribution (%)
4 Issue 8 Issue
Architecture SDSP SPARC SDSP SPARC
SPEC95 Int Int FP Int Int FP
Bypass Level 1 25.1 19.7 19.4 19.7 13.8 21.5
Bypass Level 2 4.8 4.3 2.6 4.3 2.9 2.3
Read Port 1 13.2 18.1 20.4 18.1 9.6 18.8
Read Port 2 0.8 1.5 5.2 1.5 1.1 3.4
Identical 0.3 0.6 1.6 0.6 1.0 2.5
Zero Value 9.3 9.3 8.7 9.3 13.8 8.7
Imm. Value 32.7 32.7 38.8 32.7 54.5 38.9
No Operand 13.8 13.8 3.3 13.8 3.5 4.0
Failed Read 0.2 0.2 7.3 0.2 7.4 5.1
129
recovery list need to be notified of the physical register (preg). Using the result’s destination
register identifier, the mapping table is updated, if necessary. Matching itag entries in the
instruction window receive the preg. Entries in the instruction window are already matched
to mark its operand ready, so there is little additional cost involved besides storage cost.
A problem exists with updating the preg in the recovery list. The recovery list may
contain an entry with an invalid old preg because its value is not ready. As a result, there
needs to be some way of finding that entry so it can be updated. The old preg in the recovery
list can be updated matching the itag using CAM cells. A more efficient mechanism would
be to store the next itag in the recovery list. When the entry in the recovery list is marked
complete, the next itag is read, and the preg is written into the entry indexed by the next
itag.
The greatest cost in hardware using dynamic result renaming is the full multiplexer
network used for reading and writing registers. This cost, however, does not begin to
compare to the enormous time and space savings by using a two read port and one write
port register file. The bypass network is still required for map on decode case. In addition,
the bypass network might be reduced by implemented suggestions by Ahuja et al. [2].
6.3.1 Deadlocks
The most critical aspect of using dynamic result renaming is avoiding deadlock sit-
uations. If there are fewer speculative registers than entries in the recovery list, then it
is possible all registers can be allocated with results still pending and create a deadlock
situation. To guarantee a deadlock will not occur, two conditions must exist:
1. The oldest instruction must be able to issue its instruction.
130
2. The oldest instruction must be able to write its result.
It may occur that the oldest instruction is not able to issue its instruction if the func-
tional unit it requires is stalled and there are no free registers available. Therefore, if this
situation arises, the oldest instruction and only the oldest instruction is permitted to issue
its instruction to a functional unit whose results are stalled and latched into its two stages.
When it completes, the result is delivered to the register file and written using the old phys-
ical register stored in the recovery list (thereby guaranteeing the second condition).
Moreover, if the oldest instruction is unable to issue or complete its instruction, then
the processor temporarily executes in a scalar manner. In addition, each functional unit
must have access to enough banks that equal to at least the number of entries in the recovery
list plus the number of logical registers (in most situations, the entire register file).
6.4 Implementation
This section describes implementation details for hybrid renaming with dynamic re-
sult renaming. First, the structures and procedures used for renaming source operands are
explained. Then the structures and procedures used for renaming destination operands dur-
ing decode and result write time are explained.
6.4.1 Source Operand Renaming
Register renaming can be performed by using a mapping table. A CAM lookup can
also be used to perform hybrid renaming. In addition, intrablock detection of dependent
131
operands can be used. This section describes some of the implementation requirements for
a CAM lookup, a mapping table. Also, the steps involved for renaming source operands
and destination operands are described.
CAM Lookup
The CAM lookup is composed of a destination register identifier, an instruction tag
field, a physical register field, and a ready field. The name, number of bits required, type of
cell, and description of each field for the CAM lookup is given in Table 6.4. The number
of logical registers is represented by L; the number of speculative instructions allowed is
represented by S; and the number of physical registers is represented by R. For L � ��,
S � ��, and R � �, 5 bits are required for the destination register, 6 bits are required for
the instruction tag, and 7 bits are required for the physical register number.
Table 6.4: CAM Lookup Fields
Name Bits Type Description
Reg log�L CAM Logical destination register identifier
Itag log�S RAM Unique instruction tag
PReg dlog�Re RAM Physical register number
Ready 1 RAM Indicates if PReg field is valid (result has completed)
Mapping Table
The mapping table is composed of an instruction tag field, a physical register field,
and a ready field. The name, number of bits required, and description of each field for the
132
mapping table is given in Table 6.5. These fields are the same as the CAM lookup, except
the logical register indexes into the table instead of being stored as CAM.
Table 6.5: Mapping Table Fields
Name Bits Description
Itag log�S Unique instruction tag
PReg log�R Physical register number
Ready 1 Indicates if PReg field is valid (result has completed)
Renaming Procedure
The steps performed in renaming a source operand are summarized as follows (refer
to Figure 6.1):
1. Intrablock Detection: Pre-decode bits are used to indicate which instruction within
a block it is dependent, if any. The instruction tag of the instruction it is dependent
upon is returned to the instruction window.
2. CAM Lookup: Remaining operands from intrablock detection are searched for the
most recent matching destination register identifier using content addressable mem-
ory. If the register has its value ready, then the physical register is used for the
operand. Otherwise, the instruction tag is used for the operand.
3. Mapping table: Remaining operands from CAM lookup are indexed into the map-
ping table. If the register has its value ready, the physical register is returned. Other-
wise, the instruction tag is returned to the instruction window.
133
The intrablock detection is optional. This reduces the number of required ports for
the CAM lookup. In addition, the CAM lookup is not required for renaming operands, but
the mapping table can be used by itself.
Instead of a accessing the CAM lookup and then the mapping table in series, they may
be accessed in parallel. This requires �N ports for the mapping table’s RAM cells, which
would satisfy the worst case possible. A serial access, on the other hand, allows the number
of ports for the mapping table’s RAM cells to be significantly reduced. However, accessing
the CAM lookup and mapping table sequentially may not be possible in one cycle. Most
likely another cycle will be required to read the mapping table for the remaining operands.
The instruction window can tolerate an additional cycle delay in receiving the remaining
renamed operands with no observable impact on performance. This is possible because
the majority of instructions are waiting on its counterpart operand to become ready before
being issued.
6.4.2 Destination Operand Renaming
This section describes the procedure used in renaming a destination operand. The
steps for dynamic result renaming are also listed. The structure used for the recovery list
and free list are detailed.
Recovery List
When a destination register is renamed and entered into the mapping table, the old
mapping is recorded into the recovery list. Table 6.6 lists the name, number of bits, and
description for its three fields: old physical register, next instruction tag, and completed bit.
134
Table 6.6: Recovery List Fields
Name Bits Description
Old PReg log�S Physical register number of previous instance of the same logical reg
Next Itag log�R Instruction tag of the next instance of the same logical register
Completed 1 Indicates if instruction has completed
Operand Events
The sequence of events a destination operand experiences from decode time until it
commits its result are summarized as follows:
1. Rename the destination logical register identifier to a unique instruction tag. Instruc-
tion tags are numbered sequentially.
2. Enter instruction tag and destination register identifier into CAM lookup.
3. Destination register exits CAM lookup and is entered into the mapping table. Its
logical register identifier is used to index into the mapping table. The entry is read
and used in the next two steps. The new instruction tag, physical register number,
and ready bit are written into that entry.
4. The old physical register from the mapping table entry is written into the recovery
list.
5. The old instruction register identifier is used to index the recovery list and write the
new instruction tag into the Next Itag field.
6. A physical register is allocated when the result is ready. The physical register is writ-
ten to the appropriate entry in either the CAM lookup or the mapping table (explained
in the following section).
135
7. When the register commits its result, the old physical register number is read from
the recovery entry and then freed.
Dynamic Result Renaming
In order to perform dynamic result renaming to multiple banks of a register file, a
free list must be maintained for each bank. Only a single bit is required to indicate if a
register is allocated. As a result, the storage for this is not complicated. For example, a
RF with 96 registers using 8 banks would require 12 bits per free list of each bank. When
results are committed, the old physical registers are read from the recovery list. A mask
can be generated for each bank according to the registers to be freed. Then the old register
mask for each bank is ORed with the corresponding free list.
The steps involved in renaming a destination register during result write time are
summarized as follows:
1. Two cycles before the result is ready, registers are allocated for all banks by searching
the appropriate free list.
2. If the allocation fails for the bank which the result is writing to, writing of the result
must stall. Otherwise, the result writes to the assigned bank as scheduled.
3. The instruction tag of the result is used to determine if the destination operand has
moved from the CAM lookup to the mapping table.
4. The allocated physical register is written into the appropriate entry in either the CAM
lookup or the mapping table, depending on that determination. The corresponding
ready bit is also set.
136
5. If the destination operand has been written to the mapping table, an entry in the map-
ping table will exist. The instruction tag of the result is used to read the appropriate
entry in the recovery list and mark the result as completed. This entry provides the
next instruction tag for that register.
6. The allocated physical register number is written to the recovery list entry pointed to
by the next instruction tag, provided that tag is valid.
6.5 Performance
This section compares the performance of dynamic result renaming with renaming
during decoding. The performance is first compared using the instructions per cycle (IPC)
metric and then using the billions of instructions per second (BIPS) metric.
To begin with, the overall performance is compared using the IPC metric. The num-
ber of speculative physical registers is varied. Figure 6.4 and Figure 6.5 show the IPC
for a 4-way and 8-way issue processor, respectively. The map on decode (referred as the
base case) does not need to vary the speculative physical registers because the recovery
list is constant at 32 (or 64) entries and requires a fixed amount. On the other hand, result
renaming can be affected by more or less registers than this amount.
The performance difference from the base case is negligible, except for SPECint95
on the SPARC architecture, where a 5% decrease is observed. Increasing the number of
physical registers reduced this gap for the 8-way issue processor but did not help much for
the 4-way issue processor.
137
8 16 24 32 40 48Speculative physical registers
1.0
1.5
2.0
2.5
3.0
IPC
SDSP IntSPARC IntSPARC FPSDSP Int baseSPARC Int baseSPARC FP base
1.49 1.53 1.55
1.55
1.55 1.541.57
2.01
2.09 2.11
2.10 2.11
2.112.20
2.20
2.59 2.63 2.63
2.63 2.63
2.63
Figure 6.4: Register File Performance Comparison for a 4-way Superscalar
138
16 32 48 64 80 96Speculative physical registers
1.0
1.5
2.0
2.5
3.0
3.5
4.0
IPC
SDSP IntSPARC IntSPARC FPSDSP Int baseSPARC Int baseSPARC FP base
1.73 1.781.88
1.90
1.89 1.90
2.54 2.59 2.58
2.60
2.65
2.63
3.18
3.66 3.70 3.70 3.70 3.70
3.70
2.70
1.91
Figure 6.5: Register File Performance Comparison for an 8-way Superscalar
139
The reason why a slight performance decrease is observed in the SPARC architecture
and not in the SDSP architecture is the difference in the number of logical integer registers.
The ratio of logical to speculative registers in the SDSP is 1:1 while the ratio is 4.25 for the
SPARC (136 integer registers) for a 4-way issue processor. This ratio is cut in half when the
decode width and recovery list are doubled. When there is a large ratio, speculative registers
have a difficult time competing against logical registers in a bank. Logical registers can
pool into one particular bank, thereby restricting its usage. Registers will then tend to be
allocated from a bank with most of the free registers, and functional units will stall since
writing becomes limited. This performance problem can be avoided by reducing the ratio
and/or increasing the number of banks.
With none or a small performance decrease, the number of speculative registers can
be reduced by 25% to 50%. As a result, the total and speculative utilization of the register
file increases. Sometimes the performance can actually increase by decreasing the number
of registers. This is not due to the renaming, but from a slight reduction in the branch
misprediction penalty.
In Section 2.7, it was pointed out that the complexity of the register file increases
the cycle time of the register file. As a result, the cycle time of the processor may have
to increase. By assuming the cycle time of the processor is equal to the cycle time of the
register file, the performance can be more accurately compared. The cycle time of the
register file was estimated using the timing model of Wilson and Jouppi [51] as described
in Section 5.5.2. The register file is similar to a direct-mapped cache without a tag. The
number of ports in the register file was taken into consideration by appropriately increasing
the capacitance of the word and bit lines.
140
Figure 6.6 and Figure 6.7 show the billions of instructions per second (BIPS) and the
cycle time for a 4-way and 8-way processor, respectively. The billions of instructions per
second (BIPS) metric reflects the cycle time, since it is the product of the IPC and the cycle
time. As with Figure 6.4 and Figure 6.5, the number of speculative registers is varied and
the performance is compared to the base case. The performance of a scalable register file is
significantly larger than the performance of the base case for both the SDSP and SPARC.
The increase in BIPS is approximately 25%, which is outstanding. This can be entirely
attributed to the reduction in the cycle time of the register file. In fact, by reducing the
number of speculative registers, the performance continues to increase because the slight
loss in IPC does not compare to the improvement in cycle time. The reduction in cycle time
for the SDSP is greater than the SPARC because the ratio of logical to speculative registers
is much higher for SPARC. As a result, the percentage reduction of registers is much less
for the SPARC register file.
141
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
8 16 24 32 40 48
Speculative Physical Registers
BIPS
2.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
4.4
cycle time (ns)
SDSP Int BIPS
SDSP base Int BIPS
SPARC Int BIPS
SPARC base Int BIPS
SPARC FP BIPS
SPARC base FP BIPS
SDSP cycle time
SDSP base cycle time
SPARC cycle time
SPARC base cycle time
Figure 6.6: BIPS and Cycle Time Performance Comparison for a 4-way Superscalar
142
0
0.2
0.4
0.6
0.8
1
1.2
1.4
8 16 24 32 40 48
Speculative Physical Registers
BIPS
2.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
4.4
cycle time (ns)
SDSP Int BIPSSDSP base Int BIPSSPARC Int BIPSSPARC base Int BIPSSPARC FP BIPSSPARC base FP BIPSSDSP cycle timeSDSP base cycle timeSPARC cycle timeSPARC base cycle time
Figure 6.7: BIPS and Cycle Time Performance Comparison for an 8-way Superscalar
Chapter 7
Conclusion
This research related to instruction fetching was motivated by work done from my
Master’s thesis [42]. That work studied different performance aspects for a superscalar
microprocessor and found that instruction fetching significantly hindered overall perfor-
mance. It seemed logical to investigate further to find the cause and the solution.
Interestingly, research into high-performance instruction fetching mechanisms was
ignored until recently. It was ignored for several reasons. First, single and double is-
sue processors were adequately supplied by a simple fetching mechanism. Even with a
four-issue processor, delays from the execution stage result in a reduced demand on the
instruction fetcher. The deficiency in the instruction fetch mechanism still was not evident.
Furthermore, branch prediction was the primary focus, because penalties from mispredic-
tion overshadowed any other fetching loss (instruction cache misses still have a significant
impact). Today, a two-level adaptive branch predictor provides excellent branch prediction,
a next line and set predictor produces good instruction fetch prediction, and instruction
cache hit rates are higher from larger primary caches. As a result, the instruction fetcher of
wide-issue processors with dynamic scheduling fails to supply instructions at an adequate
rate. The performance loss directly related to the instruction fetching mechanism becomes
significant.
143
144
It is important to first understand the specific problems and limitations of a simple
fetching mechanism. To begin with, a control transfer can jump into a middle of a cache
line, thereby reducing the number of potential instructions. This effect can be mitigated
by using an extended cache line and completely eliminated using a self-aligned cache.
Furthermore, a control transfer disrupts the sequential accessing of instructions, thereby
limiting the number of instructions that can be returned in a single cache line. This fact
places an upper bound on instruction fetching performance for a single block.
To reach the upper bound on instruction fetching, the instruction fetcher must read
more instructions from the instruction cache than the decoder requires. This allows re-
maining instructions to be buffered for later use when a control transfer produces a short
instruction run. Chapter 4 showed prefetching can reach the upper bound in theory, but
at an extreme cost in hardware. Nevertheless, using reasonable hardware, prefetching sig-
nificantly improves fetching performance. To go beyond the limit created by single block
fetching, at least two blocks of instructions must be fetched per cycle. The results prove
that two-block fetching dramatically improves instruction fetching performance.
In order to clearly identify the performance capability of a fetching mechanism, math-
ematical models were presented in Chapter 4. Given the design parameters and the proba-
bility of a control transfer, the expected instruction fetching performance can be calculated.
The models enable the production of graphs that clearly show the relationship between dif-
ferent fetching options without running hundreds of simulations. Also, they can be helpful
in the design of a new superscalar microprocessor to determine which technique will meet
its performance objective. In addition, the maximum performance of a specific fetching
mechanism can be evaluated for unlimited hardware resources.
145
Multiple conditional branches must be predicted in a single block if any of the poten-
tial performance from the fetching mechanisms in Chapter 4 are to be realized. A scalable
mechanism to predict multiple branches in a block was presented in Chapter 5. It uses a
blocked PHT, which is able to retain the accuracy of a scalar predictor. Furthermore, mul-
tiple blocks must be accurately predicted per cycle to reach the performance potential of
two-block fetching. This is accomplished by predicting the prediction. A select table is
used to retrieve the previous prediction. As a result, two blocks can be accurately predicted
in parallel. The performance increase from two-block fetching dramatically outweighs
prediction penalties.
Dual block prediction can be performed using single selection or double selection.
Single selection always outperforms double selection. On the other hand, double selection
can provide some cost savings, since it does not require a BIT table. The most significant
benefit from double selection is the fast retrieval of the selection bits instead of requiring
computation. In most instances, the prediction using single selection will complete with a
comfortable timing margin. Double selection, though, can be useful in reducing the cycle
time when the prediction is time-critical. For example, when an instruction cache access is
pipelined and completes in two cycles, double selection may yield a lower cycle time than
single selection. This increase in processor speed can outweigh the performance loss of
double selection.
The results in Chapter 5 demonstrate that the instruction fetch mechanism with mul-
tiple block prediction and prefetching can sustain an adequate instruction fetching rate,
but branch mispredictions restrict the effective instruction fetching rate and overall perfor-
mance [23]. Unless the total penalty from branch prediction is correspondingly reduced
with the number of fetch cycles, it is impossible to achieve linear speedup. With the same
branch prediction accuracy of a scalar prediction, at best the number of penalty cycles will
146
be identical to scalar branch prediction penalties. Usually, though, the number of penalty
cycles increases due to longer pipelines in wider issue processors. Consequently, it is im-
perative that the branch penalty not increase, and, if possible, reduce this penalty.
The last problem this dissertation addressed was the scalability of the register file.
After designing a reorder buffer and instruction window, I realized the implementation im-
plications for a register file of a wide-issue superscalar microprocessor. Each additional
instruction that is issued per cycle requires two more read ports and one more write port.
Since the area is proportional to the square of the number of ports, and the cycle time in-
creases with more ports, the implementation of a large register file can significantly increase
the cycle time and area.
The MIPS R10000 uses a powerful register renaming technique. This technique re-
names registers from a logical register to a physical register when instructions are decoded.
Chapter 6 introduced a new technique which renames registers during result writing instead
of instruction decoding. As a result, multiple scalar register files can be used. The cost of
the register file now scales with the number of instructions issued per cycle. Furthermore,
the register file utilization can increase by reducing the number of physical registers. On
the other hand, renaming at result write time creates a problem with reading registers and
deadlock considerations to avoid, but simulation results show the proposed technique did
not have any significant performance drawbacks compared to mapping during instruction
decoding.
Another benefit of using a scalable register file architecture is that the cycle time can
be close to that of a scalar register file. Consequently, a tremendous increase in instructions
147
executed per second is observed when the cycle time of the register file is taken into ac-
count. In addition, the performance can continue to increase by decreasing the number of
physical registers.
This dissertation described, modeled, and simulated different scalable instruction
fetching mechanisms. In order to increase fetching beyond the limit of single block fetch-
ing, fetching mechanisms were proposed which perform multiple branch and block predic-
tion. In addition, a scalable register file architecture was presented. All of these designs
strive to be scalable both in cost and performance.
Chapter 8
Future Directions
High-performance instruction fetching mechanisms were presented in this disser-
tation. Still, what changes can be made to further improve performance? The largest
performance penalty from multiple block fetching is conditional branch misprediction.
Significant improvements over the accuracy of two-level adaptive branch prediction is not
likely, since Chen et al. showed that it is already close to optimal [11]. On the other hand,
other types of branch predictors besides a global two-level adaptive branch predictor need
to be researched to determine their effectiveness in predicting multiple branches in a block.
The greatest area for improvement is the reduction of the branch misprediction pen-
alty. The complexity involved from fetching multiple blocks using multiple instruction
cache banks and a prefetch buffer may require an additional pipeline stage. This effectively
increases the misprediction penalty by one cycle. A trace cache can be used to avoid this
penalty. However, instead of using resources for a trace cache, additional buffers can be
used for wrong-path instruction fetching. With a large number of banks, unused banks are
available to fetch the first few blocks from the alternate path once a conditional branch is
encountered. Hence, when a branch is ready to execute, its alternate path is ready to be
decoded the next cycle, should it be mispredicted. This eliminates pipeline bubbles from
the instruction cache stage and any additional instruction alignment stages. In addition,
148
149
Pierce and Mudge showed that wrong-path instruction prefetching increases the instruction
cache hit and hides the latency of instruction cache misses [31].
The performance penalties from misselection, GHR misprediction, immediate mis-
fetches, and bank conflicts are mitigated by prefetching. Of these factors, misselection is
the largest factor reducing the performance of multiple block fetching. Therefore, addi-
tional research into different selection mechanisms is desirable. Selection accuracy might
be improved by using multiple predictors and choosing the best at run-time, similar to
choosing among multiple branch predictors. In addition, a cost savings in the selection
table can be made by eliminating GHR prediction. This can be accomplished by using a
different index for the select table. Instead of using the current GHR, the GHR from the
previous cycle can be used. Alternatively, the selection index need not be based on the
GHR. A selection history register could record the history of selection bits and be used as
an index into the select table. The savings in GHR misprediction penalties might be more
beneficial from the loss in accuracy from either scheme.
Multithreading is a technique which improves the parallelism available to the execu-
tion unit. In this situation, it is unclear if a sophisticated instruction fetching mechanism
used in a single-threaded superscalar processor is required in a multithreading processor.
Since multiple instruction streams are available, predicting multiple blocks per cycle is
not necessary. On the other hand, should a multithreaded machine begin executing in a
single-threaded fashion, then a multiple block predictor may be beneficial.
The scalable register file architecture presented in Chapter 6 may prove especially
valuable in implementing a multithreaded architecture. Further research is needed to verify
150
this technique will not degrade performance in the presence of multiple threads. Neverthe-
less, the large savings in area and time provide a significant improvement in cycle time and
performance.
Furthermore, the utilization of the register file can be increased by mapping registers
to the data cache. Work done in [45] successfully mapped registers when using a reorder
buffer for renaming. This should also work well when using a hybrid renaming technique
and dynamic result renaming. Mapping registers to the data cache is viable because only
a relatively small subset of registers are required for immediate use. Other registers not
used recently or only needed in case of an exception could be stored in the data cache.
Consequently, the number of physical registers could be reduced. This may be especially
useful in deep pipelines where a large number of speculative registers is required.
Microprocessors will continue to increase in size by executing multiple instructions
per cycle. Continued research into instruction fetching, execution, and design issues needs
to take place in order to improve performance.
Bibliography
[1] Anant Agrawal. Ultrasparc: A 64-bit, high-performance sparc processor. In
Proceedings of MicroProcessor Forum, October 1994.
[2] P. Ahuja, D. Clark, and A. Rogers. The performance impact of incomplete
bypassing in processor pipelines. In 28th Annual International Symposium on
Microarchitecture, November 1995.
[3] A. Aiken and Alex Nicolau. Optimal loop parallelization. In ACM SIGPLAN 1988
Conference on Programming Language Design and Implementation, pages 308–317,
Atlanta, Georgia, June 1988.
[4] T. Ball and J. Larus. Branch prediction for free. In 1993 SIGPLAN Conference on
Programming Language Design and Implementation, pages 300–313, June 1993.
[5] Brad Calder and Dirk Grunwald. Fast & accurate instruction fetch and branch pre-
diction. In 21st Annual International Symposium on Computer Architecture, pages
2–11, Chicago, Illinois, April 1994.
[6] Brad Calder and Dirk Grunwald. Reducing branch costs via branch alignment. In
Sixth International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 242–251, October 1994.
[7] Brad Calder and Dirk Grunwald. Next cache line and set prediction. In 22nd Annual
International Symposium on Computer Architecture, pages 287–296, June 1995.
151
152
[8] Bradley Gene Calder. Hardware and Software Mechanisms for Instruction Fetch
Prediction. PhD thesis, University of Colorado, December 1995.
[9] Andrea Capitanio, Nikil Dutt, and Alexandru Nicolau. Design considerations for
limited connectivity VLIW architectures. TR 92-95, University of California, Irvine,
ICS Dept., 1992.
[10] Andrea Capitanio, Nikil Dutt, and Alexandru Nicolau. Partitioned register files
for VLIWs: A preliminary analysis of tradeoffs. In 25th Annual International
Symposium on Microarchitecture, pages 292–300, Portland, Oregon, December 1992.
[11] I-Cheng K. Chen, John T. Coffey, and Trevor N. Mudge. Analysis of branch pre-
diction via data compression. In Seventh International Conference on Architectural
Support for Programming Languages and Operating Systems, October 1996.
[12] Bob Cmelik and David Keppel. Shade: A fast instruction-set simulator for execution
profiling. In ACM SIGMETRICS, 1994.
[13] Thomas M. Conte, Kishore N. Menezes, Patrick M. Mills, and Burzin A. Patel.
Optimization of instruction fetch mechanisms for high issue rates. In 22nd Annual
International Symposium on Computer Architecture, pages 333–344, June 1995.
[14] Val Popescu et al. Metaflow architecture. IEEE Micro, pages 10–13,63–73, June
1991.
[15] Keith I. Farkas, Norman P. Jouppi, and Paul Chow. Register file design considerations
in dynamically scheduled processors. In Second International Symposium on High-
Performance Computer Architecture, pages 40–51, February 1996.
[16] J. A. Fisher and S. M. Freudenberger. Predicting conditional branch directions from
previous runs of a program. In Fifth International Conference on Architectural
153
Support for Programming Languages and Operating Systems, pages 85–95, October
1992.
[17] G. F. Grohoski. Machine organization of the IBM RS/6000 processor. IBM Journal
of R&D, 34(1):37–58, January 1990.
[18] Mark A. Horowitz. Timing models for MOS circuits. TR SEL83-003, Integrated
Circuits Laboratory, Stanford University, 1983.
[19] Mike Johnson. Superscalar Microprocessor Design. Prentice Hall, Englewood
Cliffs, 1991.
[20] David R. Kaeli and Philip G. Emma. Branch history table prediction of moving tar-
get branches due to subroutine returns. In 18th Annual International Symposium on
Computer Architecture, pages 34–42, May 1991.
[21] Gerry Kane and Joe Heinrich. MIPS RISC Architecture. Prentice Hall, Englewood
Cliffs, NJ, 1992.
[22] D. J. Kuck, Y. Muraoka, and S. Chen. On the number of operations simultaneously
executable in fortran-like programs and their resulting speedup. IEEE Transactions
on Computers, C-21:1293–1310, December 1972.
[23] Monica S. Lam and Robert P. Wilson. Limits of control flow on parallelism. In 19th
Annual International Symposium on Computer Architecture, pages 46–57, 1992.
[24] Johnny K. F. Lee and Alan J. Smith. Branch prediction strategies and branch target
buffer design. IEEE Computer, pages 6–22, January 1984.
[25] Scott McFarling. Combining branch predictors. TN 36, DEC-WRL, June 1993.
[26] Scott McFarling and John Hennessy. Reducing the cost of branches. In 13th Annual
International Symposium of Computer Architecture, 1986.
154
[27] Ravi Nair. Optimal 2-bit branch predictors. IEEE Transactions on Computers,
44(5):698–702, May 1995.
[28] A. Nicolau and J. A. Fisher. Measuring the parallelism available for very long instruc-
tion word architectures. IEEE Transactions on Computers, C-33:968–976, November
1984.
[29] Shien-Tai Pan, Kimming So, and Joseph T. Rahmeh. Improving the accuracy of dy-
namic branch prediction using branch correlation. In Fifth International Conference
on Architectural Support for Programming Languages and Operating Systems, pages
76–84, Boston, Massachusetts, October 12–15, 1992.
[30] David B. Papworth. Tuning the pentium pro microarchitecture. IEEE Micro, pages
8–15, April 1996.
[31] Jim Pierce and Trevor Mudge. Wrong-path instruction prefetching. In 29th Annual
International Symposium on Microarchitecture, December 1996.
[32] E. M. Riseman and C. C. Foster. The inhibition of potential parallelism by conditional
jumps. IEEE Transactions on Computers, C-21:1405–1411, December 1972.
[33] Eric Rotenberg, Steve Bennett, and James E. Smith. Trace cache: a low latency
approach to high bandwidth instruction fetching. In 29th Annual International
Symposium on Microarchitecture, December 1996.
[34] Andre Seznec. Don’t use the page number, but a pointer to it. In 23rd Annual
International Symposium on Computer Architecture, pages 104–113, May 1996.
[35] Andre Seznec, Stephan Jourdan, Pascal Sainrat, and Pierre Michaud. Multiple-
block ahead branch predictors. In Seventh International Conference on Architectural
Support for Programming Languages and Operating Systems, October 1996.
155
[36] J. E. Smith and A. R. Pleszkun. Implementing precise interrupts in pipelined proces-
sors. IEEE Transactions on Computers, C-37:562–573, May 1988.
[37] M. D. Smith, M. Johnson, and M. A. Horowitz. Limits on multiple instruction is-
sue. In Third International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 290–302, April 1989.
[38] S. Peter Song, Marvin Denman, and Joe Chang. The Power PC 604 RISC micropro-
cessor. IEEE Micro, pages 8–17, October 1994.
[39] Marc Tremblay and J. Michael O’Connor. UltraSparc I: A four-issue processor sup-
porting multimedia. IEEE Micro, pages 42–49, April 1996.
[40] Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo,
and Rebecca L. Stamm. Exploiting choice: Instruction fetch and issue on an im-
plementable simultaneous multithreading processor. In 23rd Annual International
Symposium on Computer Architecture, May 1996.
[41] David Wall. Limits of instruction-level parallelism. Technical Report 93/6, Digital
Equipment Corporation, November 1993.
[42] Steven Wallace. Performance analysis of a superscalar architecture. Master’s thesis,
University of California, Irvine, 1993.
[43] Steven Wallace and Nader Bagherzadeh. Performance issues of a superscalar micro-
processor. Microprocessors and Microsystems, 19(4):187–199, May 1995.
[44] Steven Wallace and Nader Bagherzadeh. Instruction fetching mechanisms for super-
scalar microprocessors. In Euro-Par ’96, August 1996.
[45] Steven Wallace and Nader Bagherzadeh. Resource efficient register file architectures.
Technical report, University of California, Irvine, ECE Department, December 1996.
156
[46] Steven Wallace and Nader Bagherzadeh. A scalable register file architecture for dy-
namically scheduled processors. In Proceedings of the 1996 Conference on Parallel
Architectures and Compilation Techniques, pages 179–184, October 1996.
[47] Steven Wallace and Nader Bagherzadeh. Multiple block and branch prediction.
In Third International Symposium on High-Performance Computer Architecture,
February 1997.
[48] Steven Wallace, Nirav Dagli, and Nader Bagherzadeh. Design and implementation of
a 100 MHz centralized instruction window for a superscalar microprocessor. In 1995
International Conference on Computer Design, October 1995.
[49] David L. Weaver and Tom Germond. The SPARC Architecture Manual, Version 9.
PTR Prentice Hall, Englewood Cliffs, NJ, 1994.
[50] Chih-Po Wen. Improving instruction supply efficiency in superscalar architectures
using instruction trace buffers. In Proceedings of the 1992 ACM/SIGAPP Symposium
on Applied Computing, pages 28–36, 1992.
[51] Steven J. E. Wilton and Norman P. Jouppi. An enhanced access and cycle time model
for on-chip caches. TR 93/5, Digital Equipment Corporation Western Research Lab,
July 1994.
[52] Kenneth C. Yeager. MIPS R10000 superscalar microprocessor. IEEE Micro, pages
28–40, April 1996.
[53] Tse-Yu Yeh. Two-Level Adaptive Branch Prediction and Instruction Fetch Mech-
anisms for High Performance Superscalar Processors. PhD thesis, University of
Michigan, 1993.
157
[54] Tse-Yu Yeh, Deborah T. Marr, and Yale N. Patt. Increasing the instruction fetch rate
via multiple branch prediction and a branch address cache. In 7th ACM International
Conference on Supercomputing, pages 67–76, Tokyo, Japan, July 1993.
[55] Tse-Yu Yeh and Yale N. Patt. Alternative implementations of two-level adap-
tive branch prediction. In 19th Annual International Symposium on Computer
Architecture, pages 124–134, Gold Cost, Australia, May 1992.
[56] Tse-Yu Yeh and Yale N. Patt. A comparison of dynamic branch predictors that use
two levels of branch history. In 20th Annual International Symposium on Computer
Architecture, pages 257–266, San Diego, California, May 1993.