DESIGN AND IMPLEMENTATION OF A DYNAMIC SCHEDULED ... · DESIGN AND IMPLEMENTATION OF A DYNAMIC...

JORDAN UNIVESRITY OF SCIENCE AND TECHNOLOGHY

SCHOOL OF COMPUTER AND INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER ENGINEERING

DESIGN AND IMPLEMENTATION OF A DYNAMIC SCHEDULED SUPERSCALAR PROCESSOR BASED ON

HARDWARE SPECULATION

B.Sc. Graduation Project

Submitted to Department of Computer Engineering at Jordan University of Science and Technology in partial fulfillment of the requirement for the

graduation project

Prepared By:

Rawad Haddad 990027004 Muawya Al-Otoom 990027038 Abdullah Khreesha 990027043

Supervised by:

Dr. Abdullah Batayneh

May 2003

PDF created with FinePrint pdfFactory trial version http://www.fineprint.com

http://www.fineprint.com

2

Acknowledgment To our advisor, Dr. Abdullah Batayneh for supporting us technically and mentally, hope we met your expectations. To Dr. Omar Aljarrah, our inspiring dean. And all our respectable professors who spent four years teaching us and preparing us for the future. To the university that gave us all the facilities we needed to achieve our goals, thank you. To our trusted friend Ghaith Matalkah, thank you for proposing us to take on this project, and for standing with us in every step of our work. To our colleague Ghassan bin Hommam for your technical and mental support.

The Project Team To the people who made me as I am, in all the meanings that this phrase can hold! My parents! My beloved mother who gave all the love and support a mother can give and my beloved father who was always there to point me in the right direction. This project is for you before anyone… To my brother Ghaith my predecessor and mentor in computer engineering, thanks for your technical and brotherly support! To my best and dearest friends Oday and Remi, thank you for your support and the smiles you gave me when they were so badly needed. Thanks for the water and driving me around! To my friend Samar, for your help in the documentation, for your belief in me, and for standing always behind me. To the people who weren’t there physically, but helped by keeping my spirits high; Nart, Reem, Aseel, Mary, Hanin, and Majd.

Rawad To my father my best idol, To my mother who never give up loving me, To my brother Muaz who never stop giving me his care, To my sisters Do'aa, Demah and small Amomah, To all my uncles, aunts, specially Awni who is my symbol of success all the time, To all my grand mothers and grand fathers, To all my friends Ahmed, Sadeq, Omar, Hamzeh, Hassan and Samir who never miss the opportunity to make me happy, To all people that I love them,

Muawya To my mother and father. To all the professors who learned me. To my brother and sisters. To my grand mother. To all my friends. To all my teams partners. To anyone who search for the truth. To any one who want to develop the technology for every people. To anyone I have made mistake with.

Abdullah



3

Abstract The aim of this Project is to build a simulator for a dynamic scheduled Superscalar Processor, with Multiple Issue, based on Hardware Speculation concept, using the Verilog HDL. Apart from the main goal of implementing the simulator, we have concentrated on many advanced methodologies in computer architecture. Such as Pre-Fetching, Speculating the outcome of branches, resolving false dependencies by Register Renaming, In-Order commit (to preserve the state of the processor against interrupts and wrong branch prediction), Out Of Order execution to maintain the execution rate in the presence of true data dependency, and many other topics related to the subject of Dynamic Scheduling and Hardware Speculation. We have built all the sub-systems in the processor, and integrate them together (using Verilog). We have also written a simulator in C++, on top of the Verilog simulator, to simplify the process of tracking the processor’s outcome. This will free the user from tracking the annoying timing diagrams, and give him a visual view of the work. Finally, we have also provided the needed utilities for the processor, including an Assembler, and a linker/Loader written in C++. So, one can start from an Assembly program written in the ISA of the processor (described later), ending with a visual view of the processor’s work.



4

Table Of Contents Chapter1: Introduction

1.1- Overview 5 1.2- Theory 7 1.3- Processor Specifications 12 1.4- Verilog HDL 17 1.5- Software Utilities 18

Chapter 2: I-Memory Subsystem

2.1- Introduction 19 2.2- Register PC 19 2.3- Instruction Cache 20 2.4- Instruction Memory 22 2.5- Branch Prediction Unit 23

Chapter 3: Decoder_Queue & Issue Unit

3.1- Decoder 29 3.2- Queue 31 3.3- Register renaming logic 32 3.4- The pool 38 3.5- Issue Unit 40 3.6- Structural Hazard Detector 43

Chapter 4: Reservation Stations & Functional Units

4.1- Reservation Stations 46 4.2- The Dispatch Logic 53 4.3- Functional Units 54 4.4- Load/Store Buffers 56 4.5- Data Memory Subsystem 57 4.6- The register Files 58 4.7- Reorder Buffer 60

Conclusion 62 References 63 Appendix: Synthesized circuits & Timing Diagrams 64



5

Chapter 1: Introduction:

1.1- Overview:

All processors since about 1985, including those in the embedded space, use pipelining to overlap the execution of instructions and improve performance. This potential overlap among instructions is called instruction-level parallelism (ILP) since the instructions can be evaluated in parallel. There are two largely separable approaches to exploiting ILP. They are:

• Dynamic approaches that depend on the hardware to locate the parallelism.

• Static approaches that rely much more on software. In practice, this partitioning between dynamic and static and between hardware- intensive and software-intensive is not clean, and techniques from one camp are often used by the others. We’ve chosen the dynamic scheduling approach to implement in our graduation project. The dynamic, hardware-intensive approaches dominate the desktop and server markets and are used in a wide range of processors, including:

• The Pentium 3 and 4 • The Athlon • The MIPS R10000/1200 • The sun UltraSPARC 3 • The PowerPC 603,G3, and G4 • The Alpha 21264.

The main challenge that the architects encounter in

designing a dynamically scheduled processor is to exploit parallelism among instructions (instruction-level parallelism).

The amount of parallelism available within a basic

block -- a straight-line sequence with no branches in except to the entry and no branches out except at the exit -- is quite small. Since these instructions are likely to depend upon one another, the amount of overlap



6

we can exploit within a basic block is likely to be much less than the average basic block size. To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks. Although a dynamically scheduled processor cannot change the data flow, it tries to avoid stalling when dependences, which could generate hazards, are present. In contrast, static pipeline scheduling by the compiler tries to minimize stalls by separating dependent instructions so that they will not lead to hazards. Of course, compiler pipeline scheduling can also be used on code destined to run on a processor with a dynamically scheduled pipeline. In Dynamic Scheduling techniques, the hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior. Dynamic scheduling offers several advantages:

• It enables handling some cases when dependences are unknown at compile time (e.g. because they may involve a memory reference).

• It simplifies the compiler. • It also allows code that was compiled with one

pipeline in mind to run efficiently on a different pipeline.

As we will see, the advantages of dynamic scheduling are gained at a cost of a significant increase in hardware complexity. The purpose of this Project is to design a Hypothetical Superscalar processor, with multiple issues, based on hardware speculation. And to implement most of the well known techniques (Branch prediction, register renaming, out of order execution, etc…) in our design.



7

1.2- Theory 1.2.1 Multiple Issue Processors: In a traditional Linear Pipeline, the ideal execution can lead to a CPI=1 (when there is absolutely no dependencies, or if there, they can be overcome by bypassing). But the dream that never left the architects mind is to get a CPI less than 1. This dream becomes true, by introducing the concept of Multiple Issue Processors. Clearly, the CPI cannot be reduced below one if we issue only one instruction every clock cycle, so, multiple issue is a must to achieve this property. The goal of the multiple issue processors is to allow multiple instructions to issue in a clock cycle. Multiple- issue processor comes in two basic flavors:

• Superscalar processors • VLIW (very long instruction word) processors.

Superscalar processors issue varying numbers of instructions per clock and are either:

• Statically scheduled (using compiler techniques) or • Dynamically scheduled (using techniques based on

Tomasulo’s algorithm). Statically scheduled processors use in order execution, while dynamically scheduled processors use out-of-order execution. Dynamic scheduling is one method for improving performance in a multiple instruction issue processor. When applied to a superscalar processor, dynamic scheduling has the traditional benefit of boosting performance in the face of data hazards, but if it also allows the processors to potentially overcome the issue restrictions. Put another way, although the hardware may not be able to initiate execution of more than one integer and one FP operation in a clock cycle, dynamic scheduling can eliminate this restriction at instruction issue, at least until the hardware runs out of reservation stations.



8

1.2.2- Hardware Based Speculation: As we try to exploit more instruction-level parallelism, maintaining control dependences becomes an increasing burden. Branch prediction reduces the direct stalls attributable to branches, but for a processor executing multiple instructions per clock, just predicting branches accurately may not be sufficient to generate the desired amount of instruction-level parallelism. A wide issue processor may need to execute a branch every clock cycle to maintain maximum performance. Hence, exploiting more parallelism require that we overcome the limitation of control dependence. There is one stall cycle each loop iteration due to a branch hazard. In programs with more branches and more data-dependent branches, this penalty could be larger. Overcoming control dependence is done by speculating on the outcome of branches and executing the program as if our guesses were correct. This mechanism represents a subtle, but important, extension over branches prediction with dynamic scheduling. In particular, with speculation, we fetch, issue, and execute instructions, as if our branches predictions were always correct, dynamic scheduling only fetches and issue such instructions. Of course, we need mechanisms to handle the situation where the speculation is incorrect.

Hardware-based speculation combines three key ideas:

• Dynamic branch prediction to choose which instruction to execute.

• Speculation to allow the execution of instructions

before the control dependences are resolved (with the ability to undo the effects of an incorrectly speculated sequence)

• Dynamic scheduling to deal with the scheduling of

different combinations of basic blocks. (In comparison, dynamic scheduling without speculation only partially overlaps basic blocks because it requires that a branch be resolved before actually executing any instructions in the successor basic block.)



9

Hardware–based speculation follows the predicted flow of data values to choose when to execute instructions. This method of executing programs is essentially a data flow execution: operations execute as soon as their operands are available. Our Goal in This Project is to implement speculative execution based on Tomasulo’s algorithm. But the ideas are only applied to the integer operations. It can be easily extended to support Floating-point operations. The hardware implements Tomasulo’s algorithm can be extended to support speculation. To do so, we must separate the bypassing of results among instructions, which is needed to execute an instruction speculatively, from the actual completion of an instruction. By making this separation, we can allow an instruction to execute and to bypass its results to other instructions, without allowing the instruction to perform any updates that cannot be undone, until we know that the instruction is no longer speculative. Using the bypassed value is like performing a speculative register read, since we do not know whether the instruction providing the source register value is providing the correct result until the instruction is no longer speculative. When an instruction is no longer speculative, we allow it to update the register file or memory, we call this additional step in the instruction execution sequence instruction commit. The key idea behind implementing speculation is: To allow instructions to execute out of order but to force them to commit in order and to prevent any irrevocable action (such as updating state or taking an exception) until an instruction commits. In this simple single–issue five-stage pipeline we could ensure that instructions committed in order, and only after any exceptions for that instruction had been detected, simply by moving writes to the end of the pipeline. When we add speculation, we need to separate the process of completing execution from instruction commit, since instructions may finish execution considerably before they are ready to commit. Adding this commit phase to the instruction execution sequence requires some changes to the sequence as well as an additional set of hardware buffers that hold the results of instructions that have finished execution but have not committed. This hardware buffer, which we call the



10

Reorder Buffer, is also used to pass results among instructions that may be speculated. The Reorder buffer (ROB) provides additional registers in the same way as the reservation stations in Tomasulo’s algorithm extend the register set. The ROB holds the result of an instruction between the time the operation associated with the instruction completes and the time the instruction commits. Hence, the ROB is a source for operands for instructions, just as the reservation station provide operands in Tomasulo’s algorithm, the key difference is that in Tomasulo’s algorithm, once an instruction writes its result, any subsequently issued instructions will find the result in the register file. With speculation, the register file is not updated until the instruction commits (and we know definitively that the instruction should execute). Here are the four steps involved in instruction execution:

1. Issue: get an instruction from the instruction queue. If there is an empty reservation station and an empty slot in the ROB, Rename the operands, and issue the instruction to the RS’s. Updated the control entries to indicate the buffers are in use. The number of the ROB allocated for the result is also sent to the reservation station, so that the number can be used to tag the result when it is placed on the CDB. If either all reservations are full or the ROB is full, then instruction issue is stalled until both have available entries.

2. Execute: if one or more of the operands is not yet

available, monitor the CBD while waiting for the register to be computed. This step checks for RAW hazards. When both operands are available at a reservation station, execute the operation. Instructions may take multiple clock cycles in this stage, and loads still require two steps in this stage. Stores need only have the base register available at this step, since execution for a store at this point is only effective address calculation.

3. Write result: when the result is available, write it

on the Physical Register file, and on the ROB. Send the tag of the destination register into the CDB to all the RS’s. So that and waiting instruction can go and fetch the result from the Physical Register



11

file. Mark the reservation station as available. Special

4. Commit: there are three different sequences of

actions at commit depending on whether the committing instruction is branch with an incorrect prediction, a store, or any other instruction (normal commit). The normal commit case occurs when an instruction reaches the head of the ROB and its result is present in the buffer, at this point, the processor updates the register with the result and removes the instruction from the ROB. Committing a store is similar except that memory is updated rather than a result register. When a branch with incorrect prediction reaches the head of the ROB, it indicates that the speculation was wrong. The ROB is flushed and execution is restarted at the correct successor of the branch. If the branch was correctly predicted, the branch is finished. Some machines call this commit phase “completion” or “graduation”.

We have tried to implement these topics in our design, we add our own touches when we felt we have to do, but in a general matter, the overall frame of the work is very near to what we have mentioned in this section.



12

1.3 - Processor Specifications: 1.3.1- Instruction Set: The ISA that we have chosen to our processor contains 16 instructions, which are enough to write any program to solve typical problems (this is called Complete Instruction Set). Here, we provide a description of all the 16 instructions, providing the format of each one. Register Arithmetic Operations:

add rd, rs, rt :

Puts the sum of the integers in the register rs and the register rt into register rd and checks for overflow.

31-26 25-21 20-16 15-11 10-6 5-0 000000 rs rt rd 00000 100000

addi rt, rs, imm :

Put the sum of the integer in register rs and the sign extended immediate value imm into the register rt and check for overflow.

31-26 25-21 20-16 15-0 001000 rs rt Imm

sub rd, rs, rt :

Subtracts the integer in register rt from the integer in register rs, putting the result into register rd and checks for overflow.

31-26 25-21 20-16 15-11 10-6 5-0 000000 Rs rt rd 00000 100010



13

mul rd, rs, rt:

Performs aritmetic multiplication of contents of lower 16 bits of register rt and lower 16 bits of rs, putting results into register rd.

31-26 25-21 20-16 15-11 10-6 5-0 000000 rs rt rd 00000 000011

div rd, rs, rt :

Performs integer arithmetic division of 32 bit contents of register rs by the lower 16 bits of register rt, putting the quotient into the lower 16 bits of register rd. and the remainder in the upper 16 bits

31-26 25-21 20-16 15-11 10-6 5-0 000000 rs rt rd 00000 000100

Register Logic Operations:

and rd, rs, rt :

Puts the logical AND of the integers from register rs and rt into register rd.

31-26 25-21 20-16 15-11 10-6 5-0 000000 Rs rt rd 00000 100100

or rd, rs, rt :

Puts the logical OR of the integers from register rs and rt into register rd.

31-26 25-21 20-16 15-11 10-6 5-0 000000 Rs rt rd 00000 100101

xor rd, rs, rt :

Puts the logical XOR of the integers from register rs and rt into register rd.

31-26 25-21 20-16 15-11 10-6 5-0 000000 Rs rt rd 00000 100110



14

andi rt, rs, imm :

Put the logical AND of the value in the register rs and the zero extended immediate value imm into register rt.

31-26 25-21 20-16 15-0 001100 rs rt imm

ori rt, rs, imm :

Put the logical OR of the value in the register rs and the zero extended immediate value imm into register rt.

31-26 25-21 20-16 15-0 001101 rs rt imm

xori rt, rs, imm :

Put the logical XOR of the value in the register rs and the zero extended immediate value imm into register rt.

31-26 25-21 20-16 15-0 001110 rs rt imm

Load and Store Operations:

lui rt, imm :

Load the immediate value imm into the upper half-word of register rt. The lower bits of the resgister are set to 0.

31-26 25-21 20-16 15-0 001111 00000 rt imm

lw rt, Offset(rs) :

Sign extend the word (next 32 bits) at address = Offset(rs) = rs + Offset and load it into register rt. The word must be aligned, meaning the address must be evenly divisible by four.

31-26 25-21 20-16 15-0 100011 rs rt Offset



15

sw rt, Offset(rs) :

Store the word from register rt at address = Offset(rs) = rs + Offset.

31-26 25-21 20-16 15-0 101011 rs rt Offset

Branch Operations:

beq rs, rt, label : (label is transated to a 16 bit Offset by the assembler)

If the integer in register rs is equal to the integer in register rt, then increment the program counter (PC) by Offset * 4.

31-26 25-21 20-16 15-0 000100 rs rt Offset

Jump Operation:

j target :

Unconditionally jump to instruction at Target.

31-26 25-0 000010 Target

1.3.2- Functional Units: In our processor, we will have the following FU’s : § 1 Add/Subtract units (pipelined/5 stages) § 1 Logical Units (1 stage) § 1 Multiplier (pipelined/16 stages) § 1 Divider (pipelined/16 stages) § Load Unit (pipelined/5 stages) § Store Unit (pipelined/2 stages) § Branch Unit (pipelined/2 stages)



16

1.3.3- Reservation Stations: For each Functional unit, there is a 4-entries Reservation Station. 1.3.4- Physical Registers file:

- 128 register. - 32-bit each register.

1.3.5- Logical Register File:

- 32 general-purpose registers. - 32-bit each register.

These are the most obvious components, everything else will be found in the documentation inside.



17

1.4- Verilog HDL: We have used Verilog as our main implementation language of the processor, as every one know HDL is considered at a higher level than ordinary languages since it is required to understand you description of the hardware, and then generate the appropriate hardware, this process is call hardware synthesis. The language can describe the hardware at 3 levels of abstraction; our processor is built on a mix of the three levels. The first one is the gate level, here you have to know the gates as they would be synthesized, the benefit of it is you can anticipate the synthesized logic for your hardware, but it is difficult for a large amount of code, in addition to long hardworking. We used it when it is appropriate i.e. units with no states or memories (decoders, etc…). The second level is the data flow, here you can write a code in a higher level than the gate level, and it is also synthesized, but you have to know a little bit about the gate that would be synthesized, we used it in components that are lengthy to be written in gate level. The third level that we have used is the behavioral level it is not fully synthesized, but we have used to synthesized part of it, it is easy to write programs with, and its suitable for large amount of code, here you can specify the delay for any logic as you would, it is used for the parts of the system where there is a critical constraints on the timing delay for the purpose of simulation, we have used it heavily in modules that contain memory elements with decoding and searching capabilities.



18

1.5- Software Utilities: Assembler: The assembler is the same assembler used in CPE 471 for MIPS instruction set, but we have made some modification on the MOT (Machine Operation Table) to meet our instruction set needs, for more information about the assembler see www.geocities.com/cie_471/ and get the user manual of it. Linker and Loader: This software consist of 2 integrated software’s (Linkage Editor and Loader), it is different slightly from the known linkers/loaders since it has to meet the processor requirements in it's memory system design, i.e. since you have 2 memories (Instruction and Data) you have to generate two files for the simulator. Output Simulator: Since the simulator used to run the Verilog code don't have other capabilities to show what happening inside the processor than timing diagrams, we decided to build a simulator to show the processor state at each cycle. This helped us in debugging the system and knowing where the error will be, in addition to tracking processor bottle-necks and measuring the processor performance. At first the Verilog simulator runs, and then it prints the state of the machine at each clock cycle in many log files, then the output simulator will read these files and print it to the user.



19

Chapter 2: I-Memory Subsystem 2.1- Introduction: The Instruction Memory (I-Memory) Subsystem is the first stand-alone subsystem in our processor. Its job is to supply the superscalar processor with a high rate of fetched instructions, to be decoded, scheduled, and then executed.

The most important constraint on this subsystem is that: it must meet the issue rate of the processor (in our case: 4 instructions per cycles). Consider the following scenarios:

• If it doesn’t meet it, low utilization of the processor bandwidth will occur -the issue rate (departure rate) is larger than the fetch rate (arrival rate).

• If it exceeds it, queuing will occur- the issue rate

(departure rate) is smaller than the fetch rate (arrival rate).

So, the unit of work in our I-Memory Subsystem is 4 Instructions (16 bytes).

Now, we will explore this subsystem in more details, taking in consideration every individual module. 2.2- Register PC The PC is designed as 24 bit register, with two enables:

1- The first one will come from the Data_Akc signal. This signal is generated by the I-Cache to tell the following units (Decoder and Instruction Queue) that the Instruction stream (4 instructions) is ready, it will also enable the PC to write the new address, it must do so because the fetching must occur from the new address at that instant.

2- The second one will come from a signal called

Structural_Hazard. This signal will come from a module called Structural Hazard Detector, this module will decide if issuing will occur (because there is a space in the Reservation Stations), or the processor must stall to allow the Instructions in the RS’s to proceed. If so, the fetching MUST stop to prevent the newly fetched instructions, from



20

writing the old ones. So, the enable of the PC should depend also on this signal.

Refer to Figure 2.1 in the Appendix to see the Synthesized Circuit of the Register PC 2.3- Instruction Cache: Programs in nature have a tendency toward the principle of locality. To exploit this property in our design, we have chosen to simulate an Instruction Cache (explained here), and a Data Cache (explained later).

Inputs:

• Clock: System Clock • Inst_read: signal from the next stages (Decoder-

queue), if active, the cache will fetch instruction. • d_mem_ready: used as am acknowledgment from the

memory to the cache in case of cache miss • Address: 24 bits address, from the Register PC. • data_out0-data_out3: data from memory (in case of

cache miss)



21

Outputs:

• d_ack: output to the Decoder-Queue Logic, to indicate that the data is valid

• cache_miss: output to memory, to indicate cache miss.

• data_out0-data_out3: instructions fetched from cache, to the decoder logic.

We have decided to design the cache as a 2-way set associative cache, with a cache line of 4 Instructions (each instruction is 4 words). Note that the increase in associativity will reduce the miss rate, and larger block size (4 words in our case), will exploit the spatial locality, in this very crucial, specially in instruction caches, because programs have in nature, spatial locality in their instructions. Note that there is no decisions need to be taken regarding the write policy (Write back, Write Through), because any program cannot write the I-Cache! But the replacement policy used is simply to replace the 1’st Cache Line always (not too efficient, but also it is very rare to encounter a program that will show this deficiency in the cache! So, for sake of simplicity, we implement it like this) The signal D_Ack will tell the following units that the cache has done its job, and will also enable the PC to write the new address. The Icache will start working upon receiving the inst_read signal; this signal is valuable when the processor should stall (for any reason, such as structural hazard). In case of cache hit, the entire job will be done in one cycle. And the D_Aack will be activated in the next cycle (after the activation of inst_read). In case of cache miss, the cache will signal the I-Memory (cashe_miss signal), this signal will enable the memory, it will fetch the appropriate data, and will supply it to the cache in one cycle, the cache in turn, will continue its work as usual in the next cycle (D-Ack, enable the PC). So, in case of cache miss, the entire job will be done in two cycles, i.e., the miss penalty is one cycle.



22

The Cache consists of 256 set, each set is two ways, making 512 cache lines, each cache line can hold 4 instructions, and each instruction is 4 words (32 bits), which results in a 8KB cache (Very common!). To address 256 cache set, we need an 8-bit index. They are the most significant 8-bits. In addition, the least 2 bits is ignored (addressing 4 bytes in a word), and the next 2 bits are also ignored (distinguish 4 words in a cache line). Now: 24 – 8 – 2 – 2 = 12 bits used for the Tag

12-23 4-11 2-3 0-1 Tag

Index Word Selection Byte Selection

2.4 - Instruction Memory The I-Memory is simply the memory in which the program resides. This memory is 224 = 16 MB. It may hold only Instructions (Programs). There is a separate Memory for data (we will talk about it later).

Inputs:

• Clock: System Clock • Cache_miss: signal from I-Cache, if active, the

memory will fetch instruction.



23

• Address: 24 bits address, from the Register PC. Outputs:

• d_mem_ready: used as am acknowledgment from the memory to the cache in case of cache miss

• data_out0-data_out3: instructions fetched from memory, to the cache.

The memory has a simple design in order to meet the project requirements only. It starts working upon receiving the Cache_miss signal, delay one cycle, and upon finishing, it will put the data on the bus, so hat, the cache can get it. It will also activate a signal (d_mem_ack) to tell the cache that everything is done. 2.5 - Branch Prediction Unit As the amount of ILP we attempt to exploit grows, control dependences rapidly become the limiting factor. Specially to any processor that tries to issue more than one instruction per clock. This is because of two reasons. First, branches will arrive up to n times faster in an n-issue processor, and providing an instruction stream to the processor will probably require that we predict the outcome of branches. Second, Amdahl’s law reminds us that relative impact of the control stalls will be larger with the lower potential CPI in such machines. The schemes used in Branch prediction fall in two categories:

• Static: the action taken does not depend on the dynamic behavior of the branch (Branch-taken for example).

• Dynamic: Dynamically use the hardware to predict the outcome of a branch, the prediction will depend on the behavior of the branch at run time and will change if the branch changes its behavior during execution.

We have used a 2-bit predictor scheme. In a 2-bit scheme, a prediction must miss twice before it is changed.



24

See the state diagram for a 2-bit predictor (From Computer Architecture, a quantitative approach, by Hennesy & Petterson)

To reduce the branch penalty, we need to know from what address to fetch by the end of the Instruction fetch phase. This requirement means we must know whether the as-yet-uudecoded instruction is a branch and, if so, what the next PC should be. If the instruction is a branch and we know what the next PC should be, we can have a branch penalty of zero. A branch-prediction cache that stores the predicted address for the next instruction after a branch is called a branch-target buffer or branch-target cache. Because we are predicting the next instruction address and will send it out before decoding the instruction, we must know whether the fetched instruction is predicted as taken branch. If the PC of the fetched instruction matches a PC in the buffer, then the corresponding predicted PC is used as the next PC. If a matching entry is found in the branch-target buffer, fetching begins immediately at the predicted PC, then the wrong PC would be sent out for instructions that were not branches, resulting in slower processor. We only need to



25

store the predicted-taken branches in the branch-target buffer, since an untaken branch follows the same strategy (fetch the next sequential instruction) as a non-branch. Complications arise when we are using a 2-bit predictor, since this requires that we store information for both taken and untaken branches. One way to resolve this is to use both a target buffer and a prediction buffer, which is the solution, used by several PowerPC processors. We assume that the buffer only holds PC-relative conditional branches, since this makes the target address a constant, it is not hard to extend the mechanism to work with indirect branches. This is the design of a BPU (From Computer Architecture, a quantitative approach, by Hennesy & Petterson)

Let’s take a closer look into our BPU.



26

Inputs:

• Clock: System Clock • Inst_read: signal from the next stages (Decoder-

queue), if active, the BPU will start working. • Old_PC: 24 bits address, this PC will be compared

with the BPU entries to find any matches. Outputs:

• New_PC: if there is an entry in the BPU that matches the old_OC input, the corresponding prediction address will be output to update the PC. Else, PC+16 will be output (simply, increment PC)

• Active: 4 bits output, to indicate which of the instructions in the Fetch packet is active and which are not. See below for more details.

• inst_mode: 8 bits output (2 bits for each instruction) to indicate the state of this instruction. See below for more details.

The BPU consists of 128 entries; each of them is 48 bits, see the structure of each entry below:

47 24-46 2-23 0-1 Valid/Invalid Old PC Predicted PC 2-bit predictor



27

The 8-bit inst_mode signal is a signal to indicate some information about each one of the four instructions that makes the Fetch Packet. So, each instruction takes two bits (representing 4 states), this is the Encoding of this instruction:

Combination Meaning

00 Not found in BTB 01 Found, but not taken 10 Found, and taken 11 For Future Use

The 4-bits active signal indicates which instructions of the 4 are active, and which is not. There are two reasons for this signal, read them carefully:

• Un-Aligned Address will require invalidation for the previous instruction in the packet.

Example: If the required address is 6, the Issue packet will absolutely be 4-7, but simply the non-required instructions will be invalidate.

Address Corresponding Active bit 4 0 5 0 6 1 7 1

This will tell the next units that they must not manipulate the instructions at address 4 and 5, because they are simply: invalid instructions

• Instructions that follow a predicted taken branch, must be inactivated too (they are not in the real program)

Example: If there is a predicted taken branch at address 6, the Issue packet will absolutely be 4-7, but simply the non-required instructions will be invalidate.



28

Address Corresponding Active bit 4 1 5 1 6 1 7 0

This will tell the next units that they must not manipulate the instruction at address 7, because it is invalid (not in the program flow). Refer to figure 2.2 to see a timing diagram for the I-Memory Subsystem. Finally, refer to figure 2.3 to see a full schematic diagram for the I-Memory Subsystem.



29

Chapter 3: Decoder_Queue & Issue Unit

This unit is located directly after the I-Memory sub-system; it is composed of 4 instances from the Decoder module and 4 instances of the Queue module. The unit applies each one of the 4 outputs from I-sub-system to the appropriate input of the decoder and the output of each decoder to the appropriate input of each queue (see below figure), the output of this unit is going to be used directly by Register_Renaming module.

3.1- Decoder: This unit is responsible of decoding instructions coming from the I-Memory sub-system and sending it to its appropriate entry in the queue. Inputs:

• Instruction read from the I-Memory sub-system (32-bit).

• Instruction PC (22-bit). • Speculation information including the speculation

mode (2-bit), predicted PC (22-bit).



30

• Activation bit (1-bit) to decide if the instruction is active one or not, this will result from not aligned addresses or branching to a not aligned one.

Output:

• Operation code numbered from 0-15 (4-bits) see table.

• First source rs (5-bits). • Second source rt (5-bits). • Destination rd (5-bit). • Immediate imm (16-bits). • Renaming bits will decide renaming type –see

register renaming unit- (3-bits). • Validation bit for the immediate will be set if

there is an immediate (1-bit). • All the speculation information will only pass

through decoder without any change.

Op-code Instruction 0 ADD 1 ADDI 2 SUB 3 MUL 4 DIV 5 AND 6 OR 7 XOR 8 ANDI 9 ORI A XORI B LUI C LW D SW E BEQ F J

And here is a block diagram for the decoder:



31

Refer to figure 3.1 to see the synthesized circuit of the Decoder.

Refer to figure 3.2 to see a top level View of the Decoder. Refer to figure 3.3 to see a Timing diagram of the Decoder.

3.2- Queue: This unit is the unit that is responsible for taking the output of the decoder and queuing it until it reaches the register renaming unit and issued to the reservation station. The basic component of the queue is 4 instances of Issue_Packet module which is 86-bit register, theses instances are put in front of each other to form a queue. The control of the queue is dependent on the structural hazard detection unit and the active bit, if the structural hazard signal is true the queue will latch the output of the decoder and move the queue forward, now it is dependent on the active bit coming from decoder to determine if I will latch a valid instruction or a bubble, so if the structural hazard signal is true and active bits are zeros the queue will move bubbles and they will be ignored in the register renaming and the issuing unit. In general the queue job is to decrease the price of the instruction fetching and utilizing it if the



32

reservation stations are busy, which is the idea of the prefetching. Input: The same output of the decoder:

• Operation code numbered from 0-15 (4-bits). • First source rs (5-bits). • Second source rt (5-bits). • Destination rd (5-bit). • Immediate imm (16-bits). • Renaming bits for renaming type -revise register

renaming unit- (3-bits). • Validation bit for the immediate will be set if

there is an immediate (1-bit). • Speculation information predicted_pc(22-

bit),speculation_mode (2-bit) and active (1-bit).

Output: The output will be the same as the input since the queue will not do any changes.

The schematic synthesis for the queue is shown in figure 3.4 and the timing for it in figure 3.5 3.3- Register Renaming Unit: Introduction: The problem of data dependency was large dilemma for computer architects appeared first when pipelined computers raised, they tried to solve it by using bypassing of data and freezing the pipeline, after that scientists start to thinking in parallelizing instruction (ILP) and tried to build an algorithm to run instructions in parallel based totally on data hazard problem, in other words they built on top of the data hazard problem their design which is know by (Data Flow Model), the data flow model suggests that



33

when ever an instruction operands are ready then issue it to the functional unit, the model also suggests that there is no limits to resources, but this is not valid when you move your design to the implementation area (Structural Hazard Problem). Many trends appeared after the data flow model to break it's principle (if ready then issue), the first one is the Data Prediction, i.e. if an instruction is waiting on some square root result then you can make a prediction of the result and continue if the prediction is wrong you can reload execution again. The second one is Data Reuse, i.e. if you make a long division operation you can store in a fast and small memory both of operands and it's result, then you can use it again, here problems of replacement will arise and can be solved by replacement policy techniques. Unfortunately these to models are still under research area and didn't get out to implementation area. Any one may think that register renaming is an old technique used in solving false data dependency problem, and who may cares what Tomasulo said in the 1967 about register renaming. 1st the purpose of renaming in the IBM 360/91 was to preserve the program consistency rather than solve false dependencies to increase issue rate. 2nd early implementation used partial renaming not a full one this means that renaming was to specific set of register that is designate to some instructions, but a real and full implementation of register renaming did not appeared until 1992 in the IBM mainframe family ES/9000. Concept: Register renaming principle is based on simple idea and simple assumptions, the assumptions are first you have both logical register file and physical register files, second you have a pool of free physical registers, third you will have a mapping table that maps renamed registers to it's alias (called RAT Register Alias Table), the logical register file represents the register file that can be used by the programmer, the physical one is the one which will be written actually by the processor, the idea is when you take an instruction then take it's operands and look for them in the RAT and find what is their aliases, if they are ready then dispatch the instruction to it's functional unit, then take the destination register and rename it to one of the free



34

pool registers and add it to the RAT and mark it as not ready since it will be written soon, when functional unit finishes writing on it, it will be marked as ready. Now lets look at the 3 common know problems in data dependency which will arise when the processor execute instruction out of order, and see how they will be solved: RAW: an instruction wants to read a register and a previous instruction will write on same register, this is the true dependency is solved since incoming instructions will look for their operands in the RAT and that one of its operand is not ready. WAR: an instruction want to write on a register and a previous instruction will read the same register which is false dependency, this problem is solved since the write instruction will rename its destination, and hence the read will not read same physical register. WAW: an instruction want write to a register after a write instruction to the same physical register, this is also a false dependency and will be solved also since the two write operations will rename destination register to different physical registers. Example: Assume the following code segment: ADD R5, R3, R9 SUB R4, R5, R3 ADD R5, R2, R10 MUL R3, R1, R3 And the free register pool: P6 P11 P12 P15 P16 P17 And Register Alias table is: Logical Physical Renamed



35

R1 P4 1 R2 P18 1 R3 P5 1 R4 P1 1 R5 P2 1 R9 P3 1 R10 P4 1

1- After renaming the 1st instruction it will be: ADD P6, P5, P3 And the pool: P11 P12 P15 P16 P17 And the RAT: Logical Physical Renamed

R1 P4 1 R2 P18 1 R3 P5 1 R4 P1 1 R5 P6 1 R9 P3 1 R10 P4 1

2- After renaming the 2nd instruction it will be: SUB P11, P6, P5 And the pool: P12 P15 P16 P17 And the RAT: Logical Physical Renamed

R1 P4 1 R2 P18 1 R3 P5 1 R4 P11 1



36

R5 P6 1 R9 P3 1 R10 P4 1

3- After renaming the 3rd instruction it will be: ADD P12, P18, P4 And the pool: P15 P16 P17 And the RAT: Logical Physical Renamed

R1 P4 1 R2 P18 1 R3 P5 1 R4 P11 1 R5 P12 1 R9 P3 1 R10 P4 1

4- After renaming the 4th instruction it will be: MUL P15, P4, P5 And the pool: P16 P17 And the RAT: Logical Physical Renamed

R1 P4 1 R2 P18 1 R3 P15 1 R4 P11 1 R5 P12 1 R9 P3 1 R10 P4 1



37

Inputs: - 4 inputs (5-bit per one) for rs for each

instruction. - 4 inputs (5-bit per one) for rt for each

instruction. - 4 inputs (5-bit per one) for rd for each

instruction. - 4 inputs (2-bit per one) for renaming type for each

instruction, the renaming type will be discussed down.

- 4 inputs (1-bit per one) for active bits, since if the active bit is zero there is no need to rename it.

- 1 input for the clock. Renaming Type: Since the variation in the instructions not all instructions need to rename two sources and the destinations, so we encoded them into 3 bits, see table below and note the type of renaming for each instruction:

Op-code Instruction Renaming Type

0 ADD 111 1 ADDI 110 2 SUB 111 3 MUL 111 4 DIV 111 5 AND 111 6 OR 111 7 XOR 111 8 ANDI 110 9 ORI 110 A XORI 110 B LUI 100 C LW 110 D SW 011 E BEQ 011 F J 000



38

Output:

• 4 outputs (7-bits per one) for rs source for each instruction.

• 4 outputs (7-bits per one) for rt source for each instruction.

• 4 outputs (1-bit per one) for pl_rs if it's renamed (physical) or not (logical) source for each instruction.

• 4 outputs (1-bit per one) for pl_rt if it's renamed (physical) or not (logical) source for each instruction.

• 4 outputs (1-bit per one) for rs_ready if its value is ready or not (still being crunched in some functional unit).

• 4 outputs (1-bit per one) for rt_ready if its value is ready or not (still being crunched in some functional unit).

• 4 outputs (7-bit per one) for destination register (no need for pl bit since destination always renamed if exist).

3.4- The Pool: The free pool register consist of an array of a 128 bit each bit will sign if the matching physical register is busy or free. Inputs:

• 4 inputs (5-bit per one) for rd for each instruction.

• 4 inputs (2-bit per one) for renaming type for each instruction, the renaming type will be discussed down.

• 4 inputs (1-bit per one) for active bits, since if the active bit is zero there is no need to rename it.

• 1 input (1-bit) for the clock. • 1 input (1-bit) to start renaming (taken from

structural hazard detection unit).



39

Output:

• 4 outputs (7-bit per one) for destination register (no need for pl bit since destination always renamed if exist).

Register Alias Table: This table consists of 32 entries each entry will be as shown below:

Physical Alias Valid Entry Ready

This table will be indexed by the logical register number, physical alias to which the logical register where last time renamed to, valid entry indicates if the entry contains valid renaming or not and ready bit indicate if the renamed register is ready or still being crunched in some functional unit. To see a real test I will not show the register renaming unit alone, I will integrate it with Decoder_Queue unit and the I-Memory Sub-system and show a real renaming logic based on the following program: ADD R1, R0, R3 ADDI R3, R0, R5 SUB R2, R3, R1 MUL R4, R1, R3 DIV R6, R4, R1 AND R9, R6, R2 OR R9, R2, R1 XOR R6, R9, R6 ANDI R16,R9, FFFF ORI R10,R16,FFAA XORI R12,R16,5555 LUI R9, FFFF LW R13, 8 (R6) SW R12, 12(R7) BEQ R11,R12,FFF3 J 000004 The program will be assembled and loaded into memory at address 0x000000, the program after assembly and loading will look like:



40

00030820 20030005 00611022 00232003 00813004 00C24824 00414825 01263026 3130FFFF 360AFFAA 3A0C5555 3C09FFFF 8CCD0008 ACEC000C 116AFFF3 08000004 For the results check out figure 3.5 you can note the branch to the 2nd instruction and it's affect o the active bits, also note that the renaming pool will vanish since there is no commitment logic integrated, you will note this in overall run. 3.5- Issue Unit: The purpose of this unit is to take instructions outputted from the register renaming stage and forwarding it to the appropriate reservation station, I mean by appropriate here is the allocated reservation station for the same type of instruction and the free entry in that reservation station, you will know the free entries by reading busy signal for each entry in each reservation station, this unit is completely combinational it contains no state. This unit consists of 4 instances of a unit that is capable of issuing the 4 instructions to the appropriate 4 of the 7 instances modules that will select empty entries in the reservation station –see figure below-.



41

Each one of these instances has one input (instruction) and 7 outputs (for matching 7 entry selection units), each one of entry selection unit has 4 inputs from the 4 issue sub unit and 4 outputs for appropriate empty entry. You have to note that each one of the entry selection unit is different from the other since it is dependent on the type of information that will stored in the reservation station. Inputs:

• Same output of the register renaming unit (which is the 4 queued instructions).

• 7 input (4 per one) busy signals read from 7 reservation stations to determine to which entry you will issue your instruction.

Outputs:

• 7 outputs (each output width is equal to each reservation station input).



42

• See Figure 3.6 for the Full issue Unit. • See Figure 3.7 for a top level view of the full issue unit. • See Figure 3.8 for a top level view of the issue unit for the

add. • See Figure 3.9 for a top level view of the issue unit for the

branch. • See Figure 3.9 for a top level view of the issue unit for the

Divide • See Figure 3.10 for a top level view of the issue unit for

the Logical Unit • See Figure 3.11 for a top level view of the issue unit for

the Load Unit • See Figure 3.12 for a top level view of the issue unit for

the Multiply Unit • Figure 3.13 for a top level view of the Sub issue unit • See Figure 3.14 for a top level view of the issue unit of the

Store Unit



43

3.6- Structural Hazard Detector: This unit is responsible only of generating one signal! This signal tells whether there is a Structural Hazard in the processor, or not. Of course, this signal is very important to tell the Issue unit about the state of the processor. If there is a structural hazard, the issue should be stopped, else, the issue should continue. The design of this unit was one of the best experiences we got in this project; we did it fully on the Gate Level. We will explore it in more details. First, let us take a look on the input/output diagram:

Input:

• Busy: 28 bits signal, every 4 bits come from a different RS, each bit from an entry in that RS.

• Active: 4 bits Active bits, from the Queue. • Op0-Op3: 4 OP Codes, from the Queue.

Output:

• Structural_Hazard: 1 Bit, to indicate if there is a structural Hazard

Inside the Structural hazard Unit, There is The Following Modules:



44

1- Four units of type RS_Selector: This Module Receives an Op Code and the corresponding Active bit, it will find the RS that maps to this OP (The 4 units for the 4 Op’s in the Queue).

Refer to Figure 3.15 to see the synthesized circuit of the RS_Selector

2- Seven units of type Counter_Issue: This Unit will

receives 4 bits from the 4 RS_Selector, each bit will be 1 if the RS_Selector maps that instruction to the unit (7 units is for the 7 RS’s that we have in the processor), and the output will be a 3 bits, indicating the number of OP’s that is from the same type.

Refer to Figure 3.16 to see the synthesized circuit of the Counter_Issue

Example: if there are 2 Adds, one multiply, and one divide in the issue packet. The Counter Issue of the ADD will output: 010, and the Counter Issue of the MUL will output 001, and The Counter Issue of the DIV will output 001, all the other Counters will output 000. Refer to Figure 3.17 to see the Synthesized circuit of the RS_Selector

3- Seven Units of Type Counter_RS: This Unit will receives 4 bits from the 4 entries in each RS (and thus, seven units is required), and output a 3 bits indicating the number of empty entries in each RS.

Example: If the ADD RS has 3 empty entries, the Counter_RS of the ADD will output 011. and so on. Refer to Figure 3.18 to see the Synthesized circuit of the Counter_Issue Refer to Figure 3.19 to see the Synthesized circuit of the Counter_RS Now, the final thing that we have to do is to compare the 3 bits from each Counter_Issue (number of instructions that try to enter a specific RS), with the corresponding 3 bits from the Counter_RS (Number of empty slots in that RS). Of Course, the number of instruction trying to enter



45

an RS, must be less than the number of empty slots in that RS, so, the comparators that we used will compare these two 3-bits input, the output will be one if the result is greater than (indicates a structural hazard in that RS), and zero if less than or equal (indicating that everything is OK). Note that there must be 7 comparators, each one will be responsible about an RS. To see the full Structural Hazard Synthesized circuit, refer to figure 3.20 and 3.21



46

Chapter 4 Reservation Stations and Functional Units

4.1- Reservation Stations: In a Superscalar microprocessor there are many instructions to be executed at the same time, this will require the designer of the microprocessor to take care of the dependencies among the instructions. The problem of Write after Write (WAW) have been solved by using a register renaming technique, but how can the designer prevent the Read after Write (RAW) problem to occurs. Also the functional units of the microprocessor are pipelined and there is a limit of the number of the instructions to enter these Functional units to be executed at the same time if they become ready at the same time. To overcome these two problems we have to use the reservation station, which composed of registers used after the issuing of the instructions and before the execution phase of the instructions. It contains a number of entries equal to the issue packet size to solve the problem when the issue packet contains the same type of instructions. Each one of these entries contains a number of fields required to solve the above two problems and to insure the complete execution and the correct committing of the instructions, we have implemented it using a D Flip-Flops .

See Figure 4.1 for a synthesize logic for a D Flip-Flops. 4.1.1 Types and fields of the reservation stations: We have divided the instruction set for our microprocessor into seven types, they are: 1. Addition/subtraction. 2. Logical. 3. Multiplication. 4. Division. 5. Load. 6. Store. 7. Branch.

All of the operations of the reservation station i.e. writing to the reservation station register or reading from it are synchronized with the main clock of the system. A reservation station also includes a reset signal as an input to it used to initialize or flush the



47

reservation station entries at the beginning of the operation of the microprocessor or when an instruction is dispatched to the functional units.

Each one of these reservation stations type has a number of fields they are as the following: 1 Addition/subtraction:

This reservation station has the following fields: • Busy: Says if this entry of the reservation

station is occupied or not. • Add-sub: Determines if the instruction in this

entry is addition or subtraction. • Fields for the Rs register: includes a tag to

determine to take the value of this register from the physical or logical register file, and another to determine if this register is ready to be taken, and another field for the register number of the Rs.

• Fields for the Rt register: the same as the above for the Rt field.

• Fields for the immediate: include a tag to determine if this instruction uses the immediate operand, and another field to determine the value for this immediate operand.

• Fields for the Rd register: contains the register number for the destination register.

• ROB fields: Contains the reorder buffer address of this instruction.

See Figure 4.2 for the addition reservation station entry built from a Flip Flops. Also See Figure 4.2 for the timing diagram of the addition reservation station entry.

The following table summarizes the Addition reservation station entry fields: Signal name

Number of bits

Description

Busy 1 Determines if this entry is occupied or not

Add_sub 1 Determine whether the instruction is add or subtract

R_Slp 1 Determine whether to take Rs from the logical or physical register file.

R_Sready 1 Determine whether Rs is ready or



48

not. R_Tlp 1 Determine whether to take Rt

from the logical or physical register file.

R_Tready 1 Determine whether Rt is ready or not.

R_S 7 Contains the Register number of Rs.

R_T 7 Contains the Register number of Rt.

R_D 7 Contains the Register number of Rd.

ROB 7 Contains the reorder buffer entry number for this instruction.

Immbit 1 Determine whether this instruction works with immediate value or not.

Immediate 15 Contains the immediate value for this instruction, if it uses it.

2.Logical:

This reservation station has the following fields: • Busy: says if this entry of the reservation

station is occupied or not. • Fields for the Rs register: the same as the above

for the addition Rs fields. • Fields for the Rt register: the same as the above

for the addition Rt fields. • Fields for the Rd register: the same as the above

for the addition Rd fields. • OP fields: to distinguish which logical

instruction is this. • Fields for the immediate: The same as the above

for the addition reservation Station. • Fields for the Rd register: as the above for

addition. • ROB fields: as the above for addition.

The following table summarizes the logical reservation station entry fields: Signal name

Number of bits

Description


OP_code 4 Contains the OP code for this instruction.



49


R_Sready 1 Determine whether Rs is ready or not.

R_Tlp 1 Determine whether to take Rt from the logical or physical register file.






Immbit 1 Determine whether this instruction works with immediate value or not.

immediate 15 Contains the immediate value for this instruction, if it uses it.

3.Multiplication: • Busy: says if this entry of the reservation

station is occupied or not. • Fields for the Rs register: the same as the above

for the addition Rs fields. • Fields for the Rt register: the same as the above

for the addition Rt fields. • Fields for the Rd register: the same as the above

for the addition Rd fields. • Fields for the immediate: The same as the above

for the addition reservation Station. • Fields for the Rd register: as the above for

addition. • ROB fields: as the above for addition.

The following table summarizes the multiplication reservation station entry fields: Signal name

Number of bits

Description

Busy 1 Determines if this entry is



50

occupied or not R_Slp 1 Determine whether to take Rs









4. Division:

The same as the multiplication Reservation station. The following table summarizes the division reservation station entry fields: Signal name

Number of bits

Description












51

5. Load: -busy: says if this entry of the reservation station is occupied or not. -fields for the Rs register: the same as the above for the addition Rs fields. -fields for the Rd register: the same as the above for the addition Rd fields. -ROB fields: as the above for addition. -The immediate Field: contains the replacement value used for the load instruction.

The following table summarizes the load reservation station entry fields: Signal name

Number of bits

Description







immediate 15 Contains the displacement value for this instruction, if it uses it.

6. Store:

-busy: says if this entry of the reservation station is occupied or not. -fields for the Rs register: the same as the above for the addition Rs fields. -fields for the Rt register: the same as the above for the addition Rt fields. -ROB fields: as the above for addition. -The immediate Field: contains the replacement value used for the store instruction.

The following table summarizes the Store reservation station entry fields: Signal Number of Description



52

name bits Busy 1 Determines if this entry is

occupied or not R_Slp 1 Determine whether to take Rs








Immediate 15 Contains the displacement value for this instruction, if it uses it.

7. Branch:

-busy: says if this entry of the reservation station is occupied or not. -fields for the Rs register: the same as the above for the addition Rs fields. -fields for the Rt register: the same as the above for the addition Rt fields. -ROB fields: as the above for addition. -The immediate Field: contains the replacement value used for the store instruction. -The Program counter value for this instruction. -The predicted target address for this branch if exist. -The mode for this branch i.e. strongly taken, taken, not taken, strongly not taken.

The following table summarizes the Addition reservation station entry fields: Signal name Number of

bits Description





53







Immediate 15 Contains the immediate value for this instruction, if it uses it.

PC 22 The location For this instruction

PC_Predicted 22 The predicted value for the target of this branch in the BPU

Inst_mode 2 The mode for this instruction as found in the BPU.

The using of these reservation stations types with their fields facilitates solving the above problems and improving the performance of our microprocessor. 4.2 The dispatch Logic: After the instruction is ready to be executed and all its operand is ready, the instruction should be dispatched in away from the reservation station to the functional units to be executed. In some cases two same type instructions are made ready at the same time, but the functional unit cannot accept two instructions at the same time, and there should be a priority among these instructions. The dispatch logic is a logic that insures the above requirements. When all the operands of a busy reservation station entry are ready then the dispatch logic will work, this logic will pass the high priority instruction waiting in the reservation station to its associated functional unit and flush this entry because it will be empty. See the Figures from 4.3 to 4.8 for all the dispatch logic.



54

And also See Figure 4.17 For the timing diagram of one of the dispatch logic. 4.3- The Functional Units: After the dispatch logic have been performed for the instruction, The instruction is passed with its operands to its specific functional unit to be executed, the result are then written to the common data bus to be broadcasted to the other reservation stations. Due to our type partitioning for the instructions we have seven different functional units. Each of these functional units is pipelined with a different length to increase the throughput of instruction execution. See Figure 4.9 for the interactions between the dispatch logic and the Functional Units. The following specification specifies the functional units used in our microprocessor with there properties. 4.3.1 Addition/subtraction. This functional unit is responsible for executing the addition and subtraction instructions, it is five stage pipelined, it takes its inputs from the reservation station and the logical or physical register file and from the dispatch logic after the reservation station, it is activated by the valid signal from the dispatch logic and activate on the output a valid result when the result is computed at the output after five cycles, this result is written to the reorder buffer and broadcasted to the common data bus. See Figure 4.10 For the adder functional unit. And also see Figure 4.18 for the timing diagram of the addition functional unit. 4.3.2 Logical. This functional unit is responsible for executing the logical operations, due to the simplicity of the logical instructions this unit is one stage pipelined so the output is valid after one cycle of its corresponding input. As for the addition/subtraction Unit this unit takes its input from the dispatch logic and the logical and physical register file, and also activated by its active signal from the dispatch logic, and has a valid result output signal to say that the output is a valid output,



55

it also writes its outputs to the reorder buffer and broadcast the result to the common data bus. See Figure 4.11 For the logical functional unit. And also see figure 4.21 for the timing diagram 3.3.3- Multiplication: This functional unit is responsible for executing the multiplication operations, due to the time consuming of computing the multiplication of two operands this unit is sixteen stages pipelined so the output is valid after sixteen cycles of its corresponding input. As for the addition/subtraction Unit this unit takes its input from the dispatch logic and the logical and physical register file, and also activated by its active signal from the dispatch logic, and has a valid result output signal to say that the output is a valid output, it also writes its outputs to the reorder buffer and broadcast the result to the common data bus. See Figure 4.12 For the multiplication functional unit. And also see Figure 4.20 for the timing diagram of the multiplier functional nit. 4.3.4 Division. This functional unit is responsible for executing the division operations; its specification is the same as the multiplication unit. see Figure 4.19 for the timing diagram of the division functional unit. 4.3.5 Load. This functional unit is used to execute the load instruction it is three stages pipelined, it may take more than three cycles due to cache miss, or only two cycles. This unit works as the following, in the first stage it computes the effective address from which it wants to load the word, in the second cycle it consults the load/store buffer if it finds that address in the load/store buffer it reads the word value from that buffer and activate the result valid, if his value is not found in the load/store buffer then in the third stage the result is fetched from the cache, if not found in the cache then it is fetched from the memory. The cache and the memory is to be discussed later. When the result of



56

this unit is valid the result is written to the ROB and to the common data bus. See Figure 4.13 for the load functional unit. And also see figure 4.23 for the timing diagram 4.3.6 Store: This unit is responsible for executing the store operation it is two stage pipelined, in the first stage the effective address is calculated, then in the second stage the value to be stored is written to the load store buffer, and finally the effective address and the value to be stored is written to the ROB. See Figure 4.14 for the store functional unit. And also see figure 4.22 for the timing diagram 4.3.7 Branch: This unit is responsible for executing the branch instruction, it is three stage pipelined. Firstly the value of the Rs and Rt are compared then the mode of the instruction is changed if the prediction was wrong and the right value of the PC is written to the PC, if the prediction were wrong then the signal Flush ROB is activated, its purpose is to flush all the instructions fetched after the wrong predicted branch until now. See Figure 4.15 For the multiplication functional unit. And also see figure 4.24 for the timing diagram 4.4 Load/Store buffer: To increase the performance of the load instructions we have used a cached technique to speed the execution of this instruction. The load/store buffer is this technique. The load/store buffer is as the cache for the load instruction. This buffer is interacted with the load and store functional units. When a value is to be stored in memory firstly it is written to the load/store buffer. Then if a load instructions is executed then the load unit searches the load/store buffer if this value is found in the load/store buffer then it is taken from it with out consulting the cache or memory system. The size of Our load/store buffer is 128 bytes of data, so it will help in implementing the principle of locality, The Load Store buffer has a two write ports for the load and store units and one read port for the load unit.



57

See figure 4.25 for the timing diagram for the load store buffer.

4.5 Data Memory Subsystem: The Data Memory (D-Memory) Subsystem is the part of memory where data stored and the load and store instructions are executed. The Data Memory subsystem consists of two basic components, the Data cache and the Data memory. See figure 4.27 for the timing diagram for the D-memory Subsystem. 4.5.1 The Data cache: To enhance the principle of locality, and to reduce the miss rate we have used a two way set associative cache with a write back scheme. The size of our cache is 8k bytes, it is word addressable. When a request is sent to the cache for a word it is first checked if the required word is in the cache this word is read from the cache and this situation is called a hit. Else the miss signal activates the data memory and the cache is waiting for the acknowledgment from the memory. when we want to write a word to the cache if the line that we want to write is full then we have to write back one word of this line to the memory then perform a write to the cache. The cache is four ports write because the store instruction is executed when its operands are written to the ROB, so because the packet contains four instructions in our microprocessor we need a four write port cache. The cache has only one port for read because the loads are executed sequentially in the load unit. The Data Cache consists of 1024 set, each set is two ways, making 2048 cache lines, each cache line can hold 1 words (32 bits), which results in a 8KB cache (Very common!). To address 1024 cache set, we need a 10-bit index. They are the most significant 10-bits. In addition, the least 2 bits is ignored (addressing 4 bytes in a word), in addition two bits for the dirty and valid. Now: 24 – 10 – 2 - 2 = 12 bits used for the Tag



58

14-23 4-13 2-3 0-1 Tag

Index Word Selection Dirty and Valid

4.5.2 The Data memory: The memory is accessed when a miss in the cache is occurred when the value needed or to be written to the memory is ready then the acknowledgment from the memory to the cache is activated. As the cache is the ports of read and write for the memory is the same as they for the cache. As with Instruction Memory we have used a Data memory of size 224 = 16 MB. 4.6 The register Files: Due to the register renaming technique used to solve the WAW problem our microprocessor should have two register File the first is the logical register file and the second is the physical register file. The logical register file is the set of registers seen by the user or assembly programmer of the microprocessor. The final result of the instructions that have a register as there destination operands are written sequentially in the comet phase. The logical register file is always smaller in size than the physical one, in MIPs the number of the logical registers are 32. The Physical register file is the renamed one where the instructions takes its operands and write them in the execution phase of the instructions. We have chosen the number of physical registers to be 128 registers. And also see figure 4.26 for the timing diagram for the register file.



59

So as a Conclusion: The number of ports in our storage units is as the following: Storage Type: Read ports Write ports Data Memory 1 4 Data Cache 1 4 Load Store Buffer 1 2 Physical Register File

13 5

Logical Register File

13 5



60

4.7- Reorder Buffer: Since the dispatch of instructions is out-of-order and the length of the pipelines and to preserve processor state against interrupts need arises to use this element to allow instruction be committed in order. Simply the reorder buffer is a memory consisting of 128 entries to cope all instructions that may use all the register pool. The operations made on this unit is:

1- Reserving an entry in the ROB: this is done by announcing always 4 entry numbers that is ready to be reserved.

2- Writing a result from any one of the seven functional units.

3- Committing the physical registers into the logical ones.

4- Flushing the processor needed state if the prediction was false and the branch reaches the end of the Reorder Buffer.

The reorder buffer contains 32 row each row is 312-bit can hold up to 4 instructions each row is divided as follows:

Bits Name [311] Active3 [310] Flush3 [309:308] Type3 [307:276] Data3 [275:254] Address3 [253:249] Rd3 [248:242] rd_ren3 [241] Finish3 [240:234] ROB_entry3 [233] Active2 [232] Flush2 [231:230] Type2 [229:198] Data2 [197:176] Address2 [175:171] Rd2 [170:164] rd_ren2 [163] Finish2 [162:156] ROB_entry2 [155] Active1 [154] Flush1 [153:152] Type1 [151:120] Data1 [119:98] Address1



61

[97:93] Rd1 [92:86] rd_ren1 [85] Finish1 [84:78] ROB_entry1 [77] Active0 [76] Flush0 [75:74] Type0 [73:42] Data0 [41:20] Address0 [19:15] Rd0 [14:8] rd_ren0 [7] Finish0 [6:0] ROB_entry0

See figure 4.28 for the timing diagram for reserving entries of the reservation station. See also figure 2.29 for the top level module for Also see the top-level module for the ROB.



62

Conclusion: From all the previous the main thing that we have concluded that there is a difference between Computer Architecture Science and Computer Design Science, architecture concentrate more on quantitative approaches that help to increase computer performance in a higher level of abstraction than implementation details, and as we think that our project concentrated more on the computer architecture, we have spent along time on decision making to enhance performance, increasing issue rate, minimizing miss penalty and avoiding structural hazard problems. In addition we take a decision to make processor a regular one, i.e. equal fetch rate, issue rate, decoding rate, renaming rate, dispatch rate and commitment rate. On the other hand we didn’t forgot the design side completely we have used techniques that model state machines and data memory systems, but we didn’t give much care to design problems such as glitches, metastability, setup, hold times and racing since we are dealing with simulation code not synthesized code. Main problems we faced in this project is its large size, it is really too much hard to implement a processor, so I understand why a company hire 500 person to build a processor. Main result we got from this project is that we succeeded to decrease CPI than 1 by letting the processor to execute instruction out of order.



63

References: [1] John L. Hennessy, David A. Patterson. "Computer Architecture" A Quantitative Approach, Third Edition. [2] Mark Gordon Arnold. "Verilog Digital Computer Design" algorithms into hardware. [3] John L. Hennessy, David A. Patterson. "Computer Organization and Design", Hardware and Software Interface, Second Edition. [4] Donald E.Tomas, Philip R.Moorby. "The Verilog Hardware Descriptive Language", Third Edition. [5] Douglas J.Smith. "HDL Chip Design" a practical guide for designing, synthesizing and simulating ASICs and FPGAs using VHDL or Verilog. [6] Behrooz Parhami. "Computer Arithmetic" algorithms and hardware design. [7] Dezsõ Sima, Kandó Polytechnic, Institute of Informatics, Budapest " The Design Space of Register Renaming Techniques in Superscalar Processors".



64

Appendix: Synthesized Circuits & Timing Diagrams



65

Figure 2.1: Register PC



66

Figure 2.2: The timing diagram for the Instruction memory subsystem.



67

Figure 2.3: The overall Instruction Memory Subsystem view.



68

Figure 3.1: The Decoder



69

Figure 3.2: Top level View of the Decoder.

Figure 3.3: Timing diagram for the Decoder



70

Figure 3.4: A top level view of the queue.



71

Figure 3.5: The timing diagram for the queue.



72

Figure 3.5: The timing diagram for the queue.



73

Continue Figure 3.5: The timing diagram for the queue.



74

Figure 3.6: Full issue Unit.



75

Figure 3.7: A top level view of the full issue unit.



76

Figure 3.8: A top level view of the issue unit for the add.



77

Figure 3.9: A top level view of the issue unit for the branch.



78

Figure 3.9: A top level view of the issue unit for the Divide.



79

Figure 3.10: A top level view of the issue unit for the Logical Unit



80

Figure 3.11: A top level view of the issue unit for the Load Unit



81

Figure 3.12: A top level view of the issue unit for the Multiply Unit



82



83

Figure 3.13: A top level view of the Sub issue unit



84

Figure 3.14: A top level view of the issue unit of the Store Unit



85

Figure 3.15: Synthesized circuit of the RS_Selector



86

Figure 3.16: Synthesized circuit of the Counter_Issue



87

Figure 3.17: Synthesized circuit of the RS_Selector



88

Figure 3.18: Synthesized circuit of the Counter_RISsue



89

Figure 3.19: Synthesized circuit of the Counter_RS



90

Figure 3.20: full Structural Hazard Synthesized circuit



91

Figure 3.21: Structural Hazard Synthesized circuit



92

Figure 4.1: A D Flip-flop built from gates.



93

Figure 4.2: The addition reservation station built from D-flip-flops.



94

Figure 4.3: The addition dispatch Logic.



95

Figure 4.4: The branch dispatch Logic.



96

Figure 4.5: The division and multiply dispatch Logic.



97

Figure 4.6: The logical dispatch Logic.



98

Figure 4.7: The load dispatch Logic.



99

Figure 4.8: The store dispatch Logic.



100

Figure 4.9: The interaction between the dispatch logic and the

Functional Units.



101

Figure 4.10: The Adder Unit.



102

Figure 4.11: The logical Unit.



103

Figure 4.12: The Multiplication Unit.



104

Figure 4.13: The Load Unit.



105

Figure 4.14: The Store Unit.



106

Figure 4.15: The Branch Unit.



107

Figure 4.16 the timing diagram for the reservation station entry all the operations are synchronized with the clock.



108

Figure 4.17: The Timing Diagram For the branch dispatch logic

there is a priority among the 4 instructions.



109

Figure 4.18: Timing diagram for the adder.

Figure 4.19: Timing diagram for the Divider.



110

Figure 4.20: Timing diagram for the multiplier.

Figure 4.21: Timing diagram for the logical functional unit.



111

Figure 4.22: Timing diagram for the Store functional unit.

Figure 4.23: Timing diagram for the load functional unit.



112

Figure 4.24: Timing diagram for the branch functional unit.

Figure 4.25: Timing diagram for the load/store Buffer.



113

Figure 4.26: Timing diagram the Register File.



114

Figure 4.27: Timing diagram the Data Memory Subsystem.



115

Figure 4.28: Timing diagram For the reserving entries in the reservation station.

Figure 4.29: The top level module for the reorder buffer.



Date post:	06-Apr-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

DESIGN AND IMPLEMENTATION OF A DYNAMIC SCHEDULED ... · DESIGN AND IMPLEMENTATION OF A DYNAMIC...

Documents