Computer Science 152 - University of California, …kubitron/courses/cs1… · Web viewStatic...

Jack KangBenjamin Lee

Computer Science 152 David LeeReport – Final 15 May 2003 Lyle Takacs

Contents

AbstractDivision of LaborStrategy

Superscalar ProcessorSuperscalar Datpath [Design | Test]Instruction Cache Block [Design | Test]Issue Unit [Design | Test]Forwarding [Design | Test]

Branch PredictionBranch Translation Buffer [Design | Test]Branch History Table [Design | Test]Static Branch Prediction [Design]

General TestingResults Conclusion

Note: Appendices included in supplemental files. All hyperlinks reference these files.Appendix I – Notebooks

Jack KangBenjamin LeeDavid LeeLyle Takacs

Appendix II – SchematicsTop Pipeline SchematicBottom Pipeline SchematicCache Block SchematicDecode Block SchematicExecute Block SchematicInstruction Cache Block SchematicMemory Block SchematicSuperscalar Datapath (Lower Right) SchematicSuperscalar Datapath (Upper Left) SchematicSuperscalar Datapath (Lower Left) SchematicSuperscalar Datapath (Upper Right) SchematicSuperscalar Datapath (Overall) Schematic

Appendix III – VerilogAppendix IV – Testing

Issue UnitForwardingBranch PredictionGeneral

Abstract

The purpose of this lab is to enhance the 5-stage pipelined processor from previous lab work with a major sub-project and one or more minor sub-projects as stated in the lab specifications. This team’s selection of projects included a 2-way

superscalar processor with two pipelines, one instruction stream, and one cache. The team also implemented dynamic branch prediction with a branch target buffer and a branch history table.

The performance gains achieved by a superscalar processor result from the ability to exploit instruction level parallelism and achieve a greater instruction throughput than would be possible in our previous implementations of our pipelined processor. The ability to execute instructions in parallel is limited by the ability to issue instructions in parallel. A significant design issue and functional component of our superscalar processor is an issue unit that checks dependencies between instructions and determines the number and sequence of instructions to be executed on the two pipelines. In addition to the implementation of a dual issue fetch/decode unit, the two pipelines must also allow for forwarding within each pipeline as well as forwarding between pipelines.

The number of delay slots increase with a superscalar implementation, compounding the effects of data and control hazards. This provides the motivation for our minor sub-project. Branch prediction is intended to reduce the performance impact of the extra delay slots. The branch target buffer will buffer the target addresses of past branch instructions and allows the target address to be available for the PC in the fetch stage. The branch history table contains the taken history of branches and allows the taken signal to be available for the PC source mux in the fetch stage. By adding these buffers, the number of delay slots is reduced and the performance penalties due to control hazards are reduced.

Static branch prediction is also enabled as a result of a stream buffer supplying instructions to the issue unit. In the event that a branch is issued, the instructions in the buffer will continue to be supplied to the issue unit. In the event that a branch is taken, the stream buffer will be flushed. The entire scheme represents an implementation of static branch prediction such that a branch is always predicted not taken and the stream buffer is flushed when a branch is taken.

The final result of this lab is a five-stage two-way superscalar processor implemented as a Harvard architecture. The design of the pipeline stages were derived from our past implementations of these stages. These stages have been grouped in schematic and blocked such that the final superscalar processor was assembled in schematic. The new forwarding, control signals, and logic for dual issue were implemented in Verilog. The design process included dependency checking and forwarding as well as dynamic branch prediction.

Division of Labor

Jack Kang Design and implementation of the dual issue unit, including the FIFO, instruction cache, and issuer. General testing and debugging in simulation and board.

Benjamin Lee Design and implementation of branch history table and branch target buffer Datapath assembly General testing and debugging in simulation

David Lee Design and implementation of the dual issue unit, including the FIFO, instruction cache, and issuer General testing and debugging in simulation and board

Lyle Takacs Design and implementation of forwarding unit. Modifications to the monitor module. Datapath assembly. General testing and debugging in simulation.

Strategy

Part I – Superscalar Processor

A. Superscalar Datapath

Superscalar Datapath (Lower Right) SchematicSuperscalar Datapath (Upper Left) SchematicSuperscalar Datapath (Lower Left) SchematicSuperscalar Datapath (Upper Right) SchematicSuperscalar Datapath (Overall) Schematic

Design – Superscalar DatapathThe superscalar datapath requires the duplication of the execution stage in the pipeline. Specifically, the processor has one instruction fetch stage, one instruction decode stage, two execute stages, one memory stage and one write back stage. Although instructions can execute in parallel, only one memory stage exists which means that only one of the two instructions issued in parallel may access memory. The corresponding stage in the other pipeline has no functionality but must still propagate its data and control signals through a “memory” stage in order to keep the same number of stages in each pipeline. Since there is only one pipeline has a memory stage, there is only one DRAM and two caches (instruction and data) and no modifications were necessary for the memory interface.

The final high level schematic includes all five stages of the pipeline modularized into several larger inclusive blocks:

Decode Stage Verilog

Decode StageThe decode stage module is the symbol corresponding to the Verilog instantiations of a FIFO queue and the issue unit. The issue unit checks the dependencies between instructions prior to issuing them to the pipelines (a detailed discussion of dual issue follows). The FIFO queue is used to buffer issued instructions, trying to ensure that the pipelines are always issued instructions. This queue serves to hide the fact that instructions may be stalled and/or swapped by the issue unit from the pipelines, trying to provide a continuous stream of decoded instructions to the pipelines. The decode stage also interfaces with the RegFile in order to decode the instruction and fetch the corresponding operands and the instruction cache block to receive the instructions from memory. The instruction cache block has been modified to support fetching enough instructions to support the issue unit (a detailed discussion of the modified instruction cache block follows)

Decode BlockExecute Block

Decode BlockIn prior implementations of our five-stage pipelined processor, a portion of the execution of R-type instructions occurred in the decode stage. Specifically, a branch is resolved and its address calculated in this stage. A shift instruction is also executed in this stage. The superscalar implementation of our processor required a separation of the decoding portion of this stage (in the Decode Stage block) and the executing portion of this stage (in the Decode Block block). The Decode Block contains an extender and a shifter that was originally used for branch calculations but is now used to support the shift operation in the next pipeline stage. It also contains forwarding muxes for the decode stage such that values can be forwarded and used for branch target calculations. The actual branch calculations now occur in the issue unit and these forwarded values are passed into the issue unit accordingly.

Execute BlockThe execute block encapsulates the ALU, the forwarding muxes for ALU operands, the SLT unit, and the Shifter unit. This block is included in both the top and bottom pipeline.

Memory Block Schematic

Memory BlockThe memory block encapsulates memory-mapped I/O, the data cache, and the associated read and write logic.

Top PipelineBottom Pipeline

Pipeline TopThe top pipeline contains the three remaining pipeline registers (ID/EX, EX/MEM, MEM/WB). Between the ID/EX and EX/MEM registers is the execute block. Between the EX/MEM and MEM/WB registers is a single mux that combines the results from the execute block into a single value to forward into the decode and execute stage.

Pipeline BottomThe bottom pipeline contains the three remaining pipeline registers (ID/EX, EX/MEM, MEM/WB). Between the ID/EX and EX/MEM registers is the execute block and two muxes taking in the original register values and the next PC as seen by a jump register instruction. These muxes provide support for a jump register instruction followed by an instruction that uses $R31 (e.g. jr loop, addiu $31, $31, 4).

The memory block is located between the EX/MEM and MEM/WB registers. It is also supported by a new multiplexor that takes the executed values from the top pipeline and uses them as input to memory. This allows for an instruction to execute and write a value into the register file in the top pipeline and a store of the same value to occur in the bottom pipeline (e.g. addu $1, $2, $3, sw $1, 0($0))

Forward VerilogForward Schematic

ForwardThe forwarding unit is positioned between the two pipelines, taking inputs from various pipeline stages to check for dependencies (a detailed discussion of the forwarding unit follows). It also outputs the select signals to both pipelines to control the forwarding in various stages.

Branch Translation Buffer VerilogBranch History Table Verilog

Branch Translation BufferThe branch target buffer connects to the decode stage block to provide the PC with a branch address in the fetch stage before the actual address is calculated in the decode stage (a detailed discussion of the branch target buffer follows). The target buffer holds the last target address of the same branch. The buffer is direct mapped and will replace entries upon a conflict miss. Our team has implemented a 32-entry, 256-entry, and 2048-entry buffer. We had decided against the 2048-entry buffer since such a large module would take a significant amount of time mapping to board.

Branch History TableThe branch history table connects to the decode stage to provide the taken signal to the PC source multiplexer in the fetch stage before the actual comparator generates the true taken signal in the decode stage (a detailed discussion of the branch history table follows). The history table is a 2-bit predictor and employs hysteresis to improve prediction accuracy. Our team has implemented a 32-entry, 256-entry, and 2048-entry table. Again, we had decided against the 2048-entry table since such a large module would take a significant amount of time mapping to board.Register FileThe regfile was changed considerably. There had to be two sets of Ra and Rb outputs, one for the top pipeline and one for the bottom pipeline. Correspondingly, there were two sets of inputs selecting which registers to read. During writes, if both pipelines tried writing to the same register, the bottom pipeline would take precedence, as the bottom pipeline contained the later instruction. Additionally, there was a special write input port for register $31 and an output port for register $31. This was necessary so we could

run the jal instruction in the decode stage and immediately write in to register 31. This was done because otherwise the forwarding unit would have to come through the issuer in order to take care of instructions immediately after the jal that modify register 31, (since the address is calculated in decode). This created a WAW hazard; that is, if another instruction that was modifying register 31 was already in the pipeline, it would overwrite this value. We fixed this by disabling pipeline writes to register 31 for 2 cycles after a jal.

Testing – Datapath Testing of the superscalar datapath was only possible after the issue unit was attached to the two pipelines. These two components were thoroughly tested with various small assembly files to test specific cases in conjunction with more comprehensive tests. A detailed discussion of datapath testing may be found in the part on general testing (III).

B. Instruction Cache Block

Instruction Cache Block SchematicInstruction Cache Controller VerilogTwo Way Cache Verilog

Design – Instruction Cache BlockBecause the superscalar processor must issue 2 instructions per cycle, the instruction cache was changed to have a 128-bit output, or 4 instructions per cycle. The cacheline was striped across four 32-bit SRAM blocks so that upon reading, each instruction could be accessed simultaneously. The instruction cache fetches 4 instructions each time because this will keep the processor busy while the cache is fetching new instructions. It also allows for the issue unit to analyze the instruction stream so that it knows if a branch or jump is coming, then it should begin fetching the delay slot if the delay slot is not in the current block.

The instruction cache is now direct-mapped as opposed to the two-way set associative cache in Lab6 because in general the instruction stream is sequential so having a two-way set associative cache would not give us much benefit. Rather, it would complicate the cache because we would need to have muxes that take in 256 bits and output 128 bits and slow down the cache.

Testing – Instruction Cache BlockTo test the instruction cache, we loaded up the instruction cache with blocks of instructions by telling the instruction cache to load each instruction individually. Then we tried to read from the cache by giving it an address and reading the block. To make sure that no two blocks overlapped we loaded several blocks next to each other and read out the final block values.

C. Issue Unit

Issue Unit VerilogFIFO Verilog

Design – Issue UnitThe issuer is broken into 2 pieces: a FIFO buffer and an issue unit. The FIFO unit can store up to 8 instructions (2 blocks of instructions), which it grabs from the instruction cache. The reason that it stores 2 blocks of instructions is because if a branch or jump is the last instruction of a block then its delay slot must be in the next block. By having 2 blocks in the FIFO, we can be sure that when the branch or jump instruction reaches the top of the buffer, then the delay slot will have been fetched. The top 2 instructions in the FIFO are visible to the issue unit at all times. The issue unit analyzes these 2 instructions, and upon deciding to issue one, two or zero instructions to the rest of the datapath, will choose to then remove one, two, or zero instructions from the top of the queue.

The FIFO is implemented as 8 registers with 2 pointers. One pointer points to the current top entry in the FIFO. This and the next entry are the visible entries to the issue unit. The other pointer keeps track of the next empty slot. This slot, plus the 3 slots immediately after it, are the 4 slots that are filled when 128 bits are read from the cache. Finally, there is a counter that keeps track of the number of valid entries in the FIFO. When this drops to a certain level, then the FIFO unit will grab more instructions from the cache. The command to grab more instructions comes from the issue unit, which is also responsible for sending the proper pc to the instruction cache.

The issue unit is basically a controller that grabs two instructions from the top of the FIFO and analyzes them. There are a few cases worth noting:

If there are no structural hazards or data forwarding hazards between the two instructions then we can issue 2 instructions, with the first instruction in the top pipeline and the second in the bottom pipeline.

Because we only have one memory stage, memory instructions (lw and sw) must be issued in the bottom pipeline, which has the memory stage. If we get two memory instructions in a row, we can only issue one at a time. If, however, the memory instruction comes with a 2nd instruction that has no dependency on the memory instruction, we can swap them so that the later instruction executes in the top pipeline and the memory instruction goes into the bottom pipeline. There was a small optimization done here in the case of two load instructions loading to the same register. If we saw that case, we would only issue the 2nd instruction and discard the first.

Branches and jumps must be issued in the top pipeline with its delay slot instruction in the bottom pipeline. This is so that we can preserve the number of delay slots after a branch or jump instruction. No matter where the branch or jump is located in a block of instructions, the number of delay slots after it is a constant of one slot.

If there are any dependencies between 2 instructions, only one instruction is issued, otherwise 2 instructions are issued. A signal is sent back to the FIFO to remove one, two, or zero instructions and output the next pair of instructions.

Testing – Issue UnitIn order to do testing on the Issuer (issue unit and FIFO), a dummy instruction cache was created that simulated the instruction cache previously described. We could not use the instruction cache because it required that the whole processor must be put together first which would make testing impossible. The fake cache allowed us to load our test code and check the output from the issuer without having to connect up the rest of the processor. The two most important thing in testing the issuer are 1) testing that the issue unit and FIFO properly communicate to each other and 2) testing that all instruction hazards are handled before they are issued (ie issue only one, two, or zero instructions). The testing is done in the following order:

Tests Expected Results1) have instructions that have no dependencies or structural hazards between them

1) The issue unit should always issue 2 instructions/cycle

2) have instructions that only have data forwarding dependencies (ie. the second instruction depends on the result of the first)

2) The issue unit should only issue 1 instruction/cycle

3) Testing memory instructions

3) Tested that memory instructions were always issued in the bottom pipeline. If it was possible, the issue unit would swap the memory instruction with the instruction after it so that it could still issue 2 instructions. Otherwise, if there are 2 memory instructions, then only the first one is issued. If a memory instruction was paired with a following instruction that writes to the same register, the memory instruction would not be issued. Similarly is 2 load instructions wrote the same register, the later load would be allowed to run

and the previously load would be ignored. 4) Testing for jumps: jumps were placed in different parts of the block so that we could test that it did not matter where the jump was, it would still be properly issued. All sorts of instructions were placed in the delay slot. (jumps include j, jal, and jr)

4) A jump must always be issued with its delay slot. If the delay slot is not present, then cannot issue any instruction. If the jump is in the second instruction slot, then we can only issue the first instruction and on the next cycle issue the jump with its delay slot. Also we made sure that the issue unit would output nops if we jump to a block that is not in the FIFO and wait for the new block to be written in.

5) Testing for branches: branches were placed in different parts of the block so that we could test that it did not matter where the branch was, it would still be properly issued. All sorts of instructions were placed in the delay slot.(branches include beq, bne, bltz, bgtz)

5) Similar to a jump. A branch must always be issued with its delay slot Also we made sure that the issue unit would output nops if we mispredict on a branch and wait for the new block to arrive. Otherwise if the prediction was correct, it would just continue to issue.

Index of Issue Unit Tests

Tests used:Part 1: noDependecy.sPart 2: forwardDependency.sPart 3: swlwNoDepend.s, swlwDepend1.s, swlwDepend2.s, swlwDepend3.sPart 4: jump.s, simpleJump.s, harderJump.sPart 5: simpleBranch.s, branch.s beq.s, bgez.s, bltz.s, bne.s

D. Forwarding

Forward Verilog

Design – ForwardingThere are two types of hazards in our superscalar datapath: control hazards and data hazards. We avoided structure hazards with the addition of necessary hardware. As for previous labs, this was a reasonable decision, as we have plenty of space available on the board. We avoid control hazards by issuing only instructions that have no control hazards and only data hazards that can be resolved by forwarding.

The forwarding unit will check for data hazards between stages in the pipeline and for data hazards between the two pipelines. The forwarding within the “pipeline_top” and “pipeline_bottom” are essentially the same as the forwarding in the original five-stage pipeline. The jump register instruction will need to use a forwarded value if the return address register ($ra) was modified but not yet written into the register file. Thus, the PC to be written into $ra is propagated through the pipeline and a multiplexor added in the next PC logic to detect a data hazard and select the forwarded PC if necessary. The comparator logic in each pipeline also needs forwarded values in the event a control instruction uses a register that has been modified but not yet written to the register file. Thus, a multiplexor chooses between the normal value, the value forwarded from the memory stage, and the value forwarded from the write-back stage. The ALU and shifter needs forwarded values in each pipeline need forwarded values in the event that these units use a register value that has been modified but not yet written to the register file. Thus, a multiplexor chooses between the normal value, the value forwarded from the memory stage, and the value forwarded from the write-back stage. In summary, each pipeline must support forwarding the PC to the instruction fetch stage, forwarding memory or write-back data to the instruction decode stage, and forwarding memory or write-back data to the execute stage.

In addition to the forwarding for each pipeline, the superscalar processor must also account for data hazards between pipelines. For this reason, the comparator logic in one pipeline will also need the forwarded values from the other pipeline. Thus, the multiplexors in the decode stage have been expanded to include data forwarded from the memory or write-back stage of the other pipeline. The ALU and Shifter units in one pipeline will also need the forwarded values from the other pipeline. Thus, the multiplexors in the execute stage have been expanded to include data forwarded from the memory or write-back stage of the other pipeline. Finally, it is possible for the top pipeline to execute a value that needs to be stored into memory by a store instruction in the bottom pipeline (e.g. addu $1, $2, $3, sw $1, 0($0)). Thus, the forwarding unit passes the results of the ALU, SLT, and Shifter units from the top pipeline into a multiplexor which selects from these inputs or the regular value. The output of this multiplexor will then be the data stored into memory for that store word instruction.

The forwarding unit also handles forwarding priorities. In the event of multiple data dependencies, the instruction making the most recent modification to the register value should be forwarded. This essentially means the cases in the forwarding unit should check for a hazard in the memory stage first and the write-back stage second. Furthermore, when checking a particular stage, the forwarding unit should check for a hazard in the stage of the other pipeline first and the stage in the same pipeline second. The priority should be as follows:

1. Memory stage in the other pipeline2. Memory stage in the same pipeline3. Write-back stage in the other pipeline4. Write-back stage in the same pipeline

There are several cases where a sequence of instructions will require a stall followed by the forwarding of data. The issue unit and the forwarding unit will be able to handle these cases independently since both control units monitor the same state of the processor.

Index of Forward Tests

Testing – Forwarding The testing of the forwarding control units included a sequence of instructions that tested the permutations of various types of instructions including arithmetic/logical/compare/shift operations, control operations, and data operations. These tests were the same tests ran on the original five-stage pipeline to test its forwarding. In addition to single pipeline forwarding tests, we added new tests to check for forwarding between pipelines. This amounted to generating instructions in sets of four and five that would be in the same buffer and would be issued in pairs. An example of such a test would be <add $1, $2, $3>, <sll $0, $0, 0>, <sll $0, $0, 0 >, < add $4, $1, $10>. Since the two add instructions have a data dependency and they will execute on consecutive cycles, but on different pipelines (top and bottom), the forwarding between the pipelines would be needed. The cases below are the forwarding cases for a single pipelined datapath. Each of the forwarding cases (with the exception of Jr Program Counter and Arithmetic – Store) are duplicated to forward values to the other pipeline in the case of a superscalar processor.

Jr Program Counter Pass register $31 through the pipeline such

that the jump register instruction propagates through all five stages in the pipeline Pass register $31 from the decode stage into

the execute stage such that operations can use $31 as an operand.

Arithmetic – Store (Arithmetic -> Store) Forward Execute Result from Ex/Mem (top) to Memory

(bottom)

Arithmetic – Arithmetic (Arithmetic -> Arithmetic) Forward VAL from Ex/Mem to Ex stage

(Arithmetic -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage (Arithmetic -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Arithmetic – Logical See Arithmetic – Arithmetic

(Arithmetic -> Logical) Forward VAL from Ex/Mem to Ex stage (Arithmetic -> Nopx1 -> Logical) Forward BusW from Mem/Wb to Ex stage (Arithmetic -> Nopx2 -> Logical) Forward BusW from Mem/Wb to Id stage

(Logical -> Arithmetic) Forward VAL from Ex/Mem to Ex stage (Logical -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage (Logical -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Arithmetic – Shift See Arithmetic – Arithmetic

(Arithmetic -> Shift) Forward VAL from Ex/Mem to Ex stage (Arithmetic -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage (Arithmetic -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage (Shift -> Arithmetic) Forward VAL from Ex/Mem to Ex stage (Shift -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage (Shift -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Arithmetic – Control (Arithmetic -> Branch) Stall one cycle (Arithmetic -> Nopx1 -> Branch) Forward VAL from Ex/Mem to Id stage (Arithmetic -> Nopx2 -> Branch) Forward BusW from Mem/Wb to Id stage (Branch -> Arithmetic) Delay slot executed

Arithmetic – Compare See Arithmetic – Arithmetic

(Arithmetic -> Compare) Forward VAL from Ex/Mem to Ex stage (Arithmetic -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage (Arithmetic -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

(Compare -> Arithmetic) Forward VAL from Ex/Mem to Ex stage (Compare -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage (Compare -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Arithmetic – Data Transfer (Arithmetic -> SW) Forward VAL from Ex/Mem to Ex stage (Arithmetic -> Nopx1 -> SW) Forward BusW from Mem/Wb to Ex stage (Arithmetic -> Nopx2 -> SW) Forward BusW from Mem/Wb to Id stage

(LW -> Arithmetic) Stall one cycle (LW -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage (LW -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Logical – Logical See Arithmetic – Logical (Logical -> Logical) Forward VAL from Ex/Mem to Ex stage

(Logical -> Nopx1 -> Logical) Forward BusW from Mem/Wb to Ex stage (Logical -> Nopx2 -> Logical) Forward BusW from Mem/Wb to Id stage

Logical – Shift See Arithmetic – Shift

(Logical -> Shift) Forward VAL from Ex/Mem to Ex stage (Logical -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage (Logical -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage

(Shift -> Logical) Forward VAL from Ex/Mem to Ex stage (Shift -> Nopx1 -> Logical) Forward BusW from Mem/Wb to Ex stage (Shift -> Nopx2 -> Logical) Forward BusW from Mem/Wb to Id stage

Logical – Control See Arithmetic – Control

(Logical -> Branch) Stall one cycle (Logical -> Nopx1 -> Branch) Forward VAL from Ex/Mem to Id stage (Logical -> Nopx2 -> Branch) Forward BusW from Mem/Wb to Id stage (Branch -> Logical) Delay slot executed

Logical – Compare See Arithmetic – Compare

(Logical -> Compare) Forward VAL from Ex/Mem to Ex stage (Logical -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage (Logical -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

(Compare -> Logical) Forward VAL from Ex/Mem to Ex stage (Compare -> Nopx1 -> Logical) Forward BusW from Mem/Wb to Ex stage (Compare -> Nopx2 -> Logical) Forward BusW from Mem/Wb to Id stage

Logical – Data Transfer See Arithmetic – Data Transfer

(Logical -> SW) Forward VAL from Ex/Mem to Ex stage (Logical -> Nopx1 -> SW) Forward VAL from Mem/Wb to Ex stage (Logical -> Nopx2 -> SW) Forward VAL from Mem/Wb to Id stage

(LW -> Arithmetic) Stall one cycle (LW -> Nopx1 -> Arithmetic) Forward BusW from Mem/Wb to Ex stage (LW -> Nopx2 -> Arithmetic) Forward BusW from Mem/Wb to Id stage

Shift – Shift See Arithmetic – Shift

(Shift -> Shift) Forward VAL from Ex/Mem to Ex stage (Shift -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage (Shift -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage

Shift – Compare See Arithmetic – Compare

(Shift -> Compare) Forward VAL from Ex/Mem to Ex stage (Shift -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage

(Shift -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

(Compare -> Shift) Forward VAL from Ex/Mem to Ex stage (Compare -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage (Compare -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage

Shift – Control See Arithmetic – Control

(Shift -> Branch) Stall one cycle (Shift -> Nopx1 -> Branch) Forward VAL from Ex/Mem to Id stage (Shift -> Nopx2 -> Branch) Forward BusW from Mem/Wb to Id stage (Branch -> Shift) Delay slot executed

Shift – Data Transfer See Arithmetic – Data Transfer

(Shift -> SW) Forward VAL from Ex/Mem to Ex stage (Shift -> Nopx1 -> SW) Forward BusW from Mem/Wb to Ex stage (Shift -> Nopx2 -> SW) Forward BusW from Mem/Wb to Id stage

(LW -> Shift) Stall one cycle (LW -> Nopx1 -> Shift) Forward BusW from Mem/Wb to Ex stage (LW -> Nopx2 -> Shift) Forward BusW from Mem/Wb to Id stage

Compare – Compare See Arithmetic – Arithmetic

(Compare -> Compare) Forward VAL from Ex/Mem to Ex stage (Compare -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage (Compare -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

Compare – Control See Arithmetic – Control

(Compare -> Branch) Stall one cycle (Compare -> Nopx1 -> Branch) Forward VAL from Ex/Mem to Id stage (Compare -> Nopx2 -> Branch) Forward BusW from Mem/Wb to Id stage (Branch -> Compare) Delay slot executed

Compare – Data Transfer See Arithmetic – Data Transfer

(Compare -> SW) Forward VAL from Ex/Mem to Ex stage (Compare -> Nopx1 -> SW) Forward BusW from Mem/Wb to Ex stage (Compare -> Nopx2 -> SW) Forward BusW from Mem/Wb to Id stage

(LW -> Compare) Stall one cycle (LW -> Nopx1 -> Compare) Forward BusW from Mem/Wb to Ex stage (LW -> Nopx2 -> Compare) Forward BusW from Mem/Wb to Id stage

Control – Control No dependencies

Control – Data Transfer No dependencies

Data Transfer – Data Transfer (LW -> SW) Stall one cycle (LW -> Nopx1 -> SW) Forward BusW from Mem/Wb to Ex stage (LW -> Nopx2 -> SW) Forward BusW from Mem/Wb to Id stage

Part II – Branch Prediction

A. Branch Translation Buffer (BTB)

Branch Translation Buffer Verilog

Design – Branch Translation BufferThe branch history table is 256-entry direct mapped cache for branch target addresses. The 256-entry buffer is constructed from eight 32-entry translation buffers. The buffer takes in the PC of the first of four fetched instructions, the four 32-bit instructions, and outputs the predicted target PC, asserting a predictedHit signal if an entry’s tag matches the incoming PC. It also asserts a special flag if the fourth instruction of the set is a branch. The decode stage handles this case specially.

Given a set of four instructions, there can be at most two branch instructions. Furthermore, in the case where there are two branch instructions, one branch must be in the first two instructions and one branch must be in the last two instructions. Each 32-entry translation buffer is constructed from 32 btb-entry modules. Each btb-entry contains a 32-bit tag (used to match with the PC corresponding to the first of the four instructions) and two target addresses corresponding to the two possible branch instructions.

The 32-entry buffer contains 32 instances of the btb-entry module. Each module has a hit signal which is asserted when the incoming PC matches the tag of the corresponding entry and at least one of the four incoming instructions is a branch. The write enable for any particular entry is asserted only if the index (as determined by the lower 5 bits of the PC) corresponds to the entry number, the entry’s bank is enabled, and one of the four instructions is a branch. The output of the 32-entry buffer is a result of the outputs of all 32 entries going into a series of multiplexors. All 32 outputs are put into four 8-input multiplexors selected by PC [2:0]. The outputs of these multiplexors are put into one 4-input multiplexor selected by PC [4:3]. It is the output from this last multiplexor that is also the output of the 32-entry buffer.

The 256-entry buffer is easily constructed from eight 32-entry buffers and multiplexors to select the outputs of these eight buffers. The wrapper around the 256-entry buffer takes each of the four instructions fetched and decodes each instruction to determine if they are branches. The wrapper also takes the immediate fields of each instruction and computes the branch target address. The signals ith “br” signal is asserted if the ith instruction is a branch. These 4 “br” signals are sent into the sub-buffers which, in turn, are sent into the btb-entries.

The 256-entry buffer was tested independently with test vectors. The inputs to the buffer were designed to simulate all possible branch locations in a set of four instructions. Our test vectors included tests for one branch in each of the four instruction slots, tests for two branches in various instruction slots. The tests for functionality verified that cache hits occurred correctly, cache misses occurred correctly, and the target address contained in the entries were replaced correctly.

Branch Translation Buffer TestBench

Testing – Branch Translation BufferThe incremental testing for the branch translation buffer involved creating test vectors for sets of four instructions with a single beq instruction in various positions in the set (i.e. a beq instruction in the first, second, third, and fourth instruction slot). The other instructions in the set are nops (sll $0, $0, imm) with varying immediates in order to differentiate different nops. These tests were used to ensure the basic functionality of the branch translation table. The following are the test cases checked by the test vectors in the testbench (also noted in the test-bench as comments).

1. Beq instruction in instruction slot 12. Beq instruction in instruction slot 23. Beq instruction in instruction slot 34. Beq instruction in instruction slot 45. Beq instruction in instruction slots 2 and 4, use top branch6. Beq instruction in instruction slots 1 and 3, use bottom branch

Under the interface specified during the design phase, the issue unit would provide a signal to specify whether the top or bottom branch target address was to be predicted in the event that there were two branches in a set of four instructions.

B. Branch History Table

Branch History Table Verilog

Design – Branch History TableThe branch translation table is a 256-entry direct mapped cache for branch taken signals. The 256-entry history buffer is constructed from eight 32-entry history tables. The buffer takes in the PC of the first of four fetched instructions, and a mispredict signal asserted if the predicted taken doesn’t match the comparator’s taken in the decode stage and outputs the predicted taken signal.

The 256-entry history table is constructed from eight 32-entry tables and multiplexors to select the outputs of these eight tables. Furthermore, the 32-entry tables are constructed from 32 bht-entries and multiplexors to select the outputs of these entries. Both sets of multiplexors are selected by bits in the PC.

The 32-entry table contains 32 instances of a bht-entry. The 32 entries are index by bits in the PC. Each entry implements a 2-bit dynamic branch prediction scheme where prediction is changed only when the misprediction occurs twice. This team’s implementation involved creating a four state FSM. The four states are described below:

Taken 1The FSM resets to Taken 1. If the prediction was incorrect, the FSM transitions to Not Taken 1. If the prediction was correct, the FSM transitions to Taken 2. In this state, the prediction is that the branch will be taken.

Taken 2The FSM transitions from Taken 1 to Taken 2 if a prediction was correct. If the prediction was incorrect, the FSM transitions to Taken 1. If the prediction was correct, the FSM stays in the same state. In this state, the prediction is that the branch will be taken.

Not Taken 1The FSM transitions from Taken 1 to Not Taken 1 if a prediction was incorrect. If the prediction was incorrect, the FSM transitions to Not Taken2. If the prediction was correct, the FSM transitions to Taken 1. In this state, the prediction is that the branch will not be taken.

Not Taken 2The FSM transitions from Not Taken 1 to Not Taken 2 if a prediction was incorrect. If the prediction was incorrect, the FSM stays in the same state. If the prediction was correct, the FSM transitions to Not Taken 1. In this state, the prediction is that the branch will not be taken.

Branch History Table TestBench

Testing – Branch History TableThe 256-entry table was tested independently with test vectors. The inputs to the buffer were designed to simulate a sequence of predictions followed by signals from the decode stage signaling correct and incorrect branches. These

tests were used to verify that the entries are updated correctly on a misprediction and that the state transitions are correct depending on correct/incorrect predictions.

The test-bench for the branch history table checks the basic functionality and semantics of the four state finite state machine. There are a series of three program counters, each with different “mispredict” and “branch signals.” A simulated “mispredict” signal from the decode stage would inform the FSM whether or not the prediction was correct and cause a state change in the FSM. A simulated “branch” signal would indicate whether or not the instruction at the given program counter was a branch. Two of the program counters were provided different mispredict signals over time. The third program counter was provided a deasserted branch signals, testing that the FSM would not change regardless of this program counters mispredict signal. All three instructions are interleaved to simulate repeating loops.

PC Test Case32’h00000002 A series of branches that mispredicts on a “random” basis32’h00000004 A series of branches that mispredicts on an alternating basis32’h00000008 A series of non-branch instructions

C. Static Branch Prediction

Stream Buffer Verilog

Static branch prediction is enabled by the stream buffer. The issue unit essentially takes the fetched instructions from the stream buffer and performs the data dependency analysis. If the issue unit issues a branch, the instructions following the branch remain in the buffer and will be issued unless the branch is taken. Thus, the stream buffer allows static branch prediction, always predicting the branch is not taken. If the branch is taken, the stream buffer will be flushed and the instructions immediately following the branch instruction would not be executed.

The alternative to this form of static branch prediction would be the addition of extra delay slots which is not an option for this project. The project specifications require a single delay slot after each branch instruction.

Part III - General Testing

Index of General Tests

The team followed a policy of incremental testing with the implementation of new modules before integrating these modules to assemble the final datapath. The branch prediction components were tested individually to ensure basic functionality of the direct mapped tables and, in the case of the branch history table, the correct functioning of the finite state machine. The decode stage was tested independently. The FIFO queue buffering the fetched instructions was integrated the issue unit and connected to a fake instruction cache. The fake instruction cache contained particular sequences of instructions that the issue unit would analyze before issuing to the pipeline. The tests involved observing the issued instructions to ensure that no pairs of instructions issued had data dependencies that could not be handled by the two pipelines.

After integrating the decode stage with the two pipelines, the first major test of the processor was the boot loader. The boot loader actually contains many corner cases requiring the pipelines to forward between themselves. After the processor executed the boot loader successfully, testing proceeded with the test files used for lab 5 to test forwarding and hazards. This same set of instructions would excite similar data hazards and test the forwarding logic between the pipelines. Additional tests were added to the original set of tests to ensure that certain cases that may not have occurred in a single pipeline would be tested for the superscalar pipelines. Lastly, the corner tests from lab 5 were run to ensure that those cases would still pass.

Although these general tests were comprehensive with regard to forwarding between the stages of the pipeline and forwarding between the two pipelines themselves, these tests failed to explicitly check the forwarding priorities. Ultimately, the various cases in the forwarding logic needed to be reorder such that each pipeline checks for a data hazard occurring in the other pipeline before checking for a data hazard occurring within the same pipeline. This is necessary since the instructions are essentially interleaved between the two pipelines. In the event of a data hazard, the most recent hazard would

occur in the other pipeline instead of occurring in the same pipeline from the previous cycle. Refer to the design of the forwarding unit for details.

On a general note regarding the test trace files, the nops used by our tests were <sll $0, $0, 14>. This differentiated the nops that we had entered into the code versus the nops generated within the pipelines, which were <sll $0, $0, 0>

Results

Our team was able to pass all dumb (corner2), dumber (prime), and quicksort (quicksort1). In addition, our superscalar processor was able to pass all the previous tests, as well as our own tests, including but not limited to base, verify, and corner. All of these passed in simulation. Our clock cycle results for the more interesting tests are:

Dumb: 960 clock cyclesDumber:

5th prime number: 1,124 clock cycles.80th prime number: 1,138,615 clock cycles.

Quicksort1: 9,154 clock cycles. Quicksort 1 had 5,307 instructions, which means we had a CPI of 1.72 for a very sw/lw intensive program.

Our clock cycle is: 14.358 Mhz.

While the team did not have time to map it down to board, one try at synthesis revealed that a final push to board would have yielded numbers very similar to these:

Block Rams: 67 out of 160 (42%)Slices: 6669 out of 19200 (35%)Critical Path: 69.646 nsMax Clock Frequency: 14.358 Mhz

Conclusion

The strength of this team’s implementation of the superscalar processor is modularity of the design. Compared to previous lab assignments that included all datapath components in a single schematic file, the size of the superscalar processor, with its two pipelines, required encapsulating the various components of our design into single schematic symbols when assembling the datapath. This design facilitates modular and incremental testing of the datapath components and also allows for a more compact design.

In general, designing the dual issue unit required the most time and effort, while implementing the forwarding for the superscalar pipelines was a relatively straight-forward extension of the forwarding from lab 5. The dual issue unit required the consideration of many possible pairs of instructions that could be fetched and considered for issue. The forwarding unit required modifications to include forwarding paths between pipelines. The forwarding priorities also caused some problems that were ultimately resolved by forwarding from the other pipeline first and within the same pipeline second (in addition to forwarding from the memory stage first and the write back stage second). This subtlety was the cause of the correctness issues in running the TA’s tests, but were ultimately resolved by correcting the forwarding priorities.

The greatest difficulty encountered in this team’s work on the final project were the time constraints. Each member had put in a great deal of time on previous labs, thereby falling behind in their other classes. Work on the final project extended too far into the final schedules for the members of the group, who had finals to prepare for. The resulting time constraints resulted in the design, implementation, verification and presentation of the project in only one week. Given more time, we would have been able to put the design to the board, but but given the time constraints, mapping the processor to hardware was not feasible. As it turned out, our team worked until Thursday, spent an additional 3-5 hours over the extra three days given, and resumed work on Monday night to resolve a correctness issue excited by Corner 2. The team was able to resolve all known correctness issues in simulation Tuesday afternoon after team members had completed their finals.

Given more time, the team would have tried to integrate the branch history table, the branch translation buffer, and the superscalar processor more effectively in order to implement the more optimal 2-bit dynamic branch prediction scheme over the static branch prediction scheme currently in operation.

The final result of this project is a two-way superscalar processor with five-stage pipelines and branch prediction. The team has verified full functionality in simulation for a variety of tests and programs with the exception of the quicksort2. The team has been able to map the processor to hardware due to time constraints.

Appendix I – Notebooks

Jack Kang

Total Hours: 98 hours over 9 days.

---------Thursday 5/8 12:00am

Goal: Think about how to implement superscalar

Have a decent idea, still some bugs to work out. 4 pages of work ready to be shown to the guystomorrow.

Thursday 2:30 am

--------------------

Thursday 5/8 11:00am

Goal: Discuss design with partners

Ran into some problems with branch, pc, arbiter. Talked to kubi, we need to have fetch grabmore than just two instructions per cycle. This is complicated and needs more working out...

Divided up the work, plan is to finish most things by Saturday for final testing.

Design is on paper.

Thursday 5/8 1pm

-----------------------

Thursday 5/8 1pm

Goal: Dual port the reg file!

Done and tested. Files in U:\newfilesforlab7

Thursday 5/8 2:20pm

-----------------------

Thursday 5/8 9:10pm

Goal: Design Issue Arbiter!

lots of thinking, lots of paper cases. We think we know how to do the buffer side of things.Will come in tomorrow to code, and to also do the issuing part of the state machine.

Friday 5/9 1:20am

--------------------

Friday 5/9 3:15pm

Goal: Finish coding speical FIFO buffer part of issue arbiter

finished coding. Need to test. Breaking for dinner.

Friday 5/9 7:09pm-------------------------

Friday 5/9 9:00pm

Goal: Finish testing FIFO. Begin issue unit.

-Finished issue unit with david. WE ahve linked them together and have tested the easiest caseof instructions with no dependencies or control or memory usage and it all works.-We will be back tomorrow to test further, including memory and dependency. We have tests written upbut it is hard to test due to the lack of a cache at the moment. We will be creating a new verilog moduletomorrow to fake this.

Meanwhile, forwarding assumptions made:

addu $4, $5, $6sw $4, 0($0)

should work because we will forwards the $4 value from teh top pipeline down to the bottom pipeline. (Ex->Mem)

Sat 5/10 2:30AM-------------------

Sat 5/10 4pm

Goal: Finish Issue Unit

Jumps are giving us a very ahrd time. Offsets (if we don't jump to a mod 4) are a big issue. Also, the fifo hasto know when to write in, and that part is difficult. Our prior implementation needs some work...

Integrated issue unit with fifo buffer. Fixed some bugs. Testing various cases to see if it issues correctly.We are doing this in steps, first doing just r=type, then r-type with dependencies, then sw/lw, and then sw/lwwith dependencies, then finally control (branches and jump).

David and I are now splitting up the work, he is continuing to test branches and jumps while I create the PC for actually jumping.

We have almost completely tested the issue for non control instructions and believe that it all works. I am workingon the jump/branch pc stuff, and am running into some problems. I am going home to sleep and coming back tomorrowto do that part.

Sun 5/11 4:00am--------------

Sun 5/11 4:00pm

Goal: Fix up issue unit to work with jumps.

I think i figured out what to do last night at 5am. I wrote it down and will implement it. It should hopefully work.

Finished jumps! Had to make issue unit smarter, as it now tells the fifo when to load or not. Also, fifo had to bechanged to be smarter when an invalidate signal came along.

Mon 5/12 1:45am

---------------

Mon 5/12 1:45am

Goal: Figure out how to do branches.

we've written in the branches code. We have begun testing, using waves, put together with a reg file. There are someproblems that we ened to fix tomorrow.

Mon 5/12 5:00am

----------

Mon 5/12 6:00pm

Goal: Finish doing branches

We have finished branches, also added in jal and jr. We are now putting everything together to begin testing.

We had to increase the regfile to always output the data that the jr is alwasy pointing to. This will save us a cycle when we issue jr.

Also, we still have to look at break and stalls later on.

Mon 5/12 10:45pm-------

Mon 5/12 10:45pm

Goal: Put everything together.

Lots of pins not connected in the schematic. This was a huge waste of time and the stuff should of been tested before it was given to us, especially since they told us it was tested.

the branch in the boot loader seems to be broken, it seems to not take it even though it should...

Coming back tomorrow at 11 to do this.

Tue 5/13 4:50am-------------

Tue 5/13 11:00am

Goal: Debug stuff we put together

Problems with some timing, we can't get past boot loader. Fixed some bugs with the pointers in the fifo.

Stuff works except for break, and stalls.

Tue 5/13 4:00pm-------------

Tue 5/13 9:30pm

Goal: Fix breaks and stalls

LOTS OF TIMING ISSUES! hard...we may have to re think our design. We need to deal with instruction stalls anddata stalls seperately.

Wed 5/14 3:00am--------------

Wed 5/14 11:45am

Goal: Fix data stalls vs. instruction stalls

write some stuff....

can we do a beq to a beq? w/ weird delay slots? + inst stall?!beqbreak

beq

Okay, verify and base work now. We stil ahve to get corner to work.

There are problems with some branches going to the wrong place. This seems very peculiar as it has been working.....Specifically, the taken signal is not going high. This makes no sense...

MODELSIM is broken!...Tried to put to board during down time and it didn't work. Going home now.

The taken signal may have been an artifact of the broken modelsim.

Thu 5/15 4:00am----------------

Thu 5/15 10:45am

Goal: Fix the processor to pass corner...

Time ran out, we do not do jal->jr correctly if $31 is modified too quickly within the instructions. Time to godo the presentation

Thu 5/15 1:30pm--------------

Thu 5/15 4:20pm

Goal: Fix the jal-> jr case

Breaking for dinner.

Thu 5/15 5:00pm

-------------

Thu 5/15 8:00pm

Goal: FINISH FOR GOOD!

We can do base and corner.

final half ass has quicksort semi working, and one minor bug it seems in base. It is the closest we have to all testsworking.

final_allbutQS has base working, and quicksort just not sorting in the correct order.

jr with a delay slot to somethign in it's own block? Will that work??

We have problems in that we write in too many things when there is an offset. This is causing quicksort to not work.

final_imtired is the last state of what we have. Basically, we will write in an offset of 3 into block 1-4. Thenwe will have the next free entry be 3, rather than 4. tHis causes major problesm!final_imtired zip has a hacked verison of fifo that was supposed to fix this by writing smarte but it didn't workon the first try.

Fri 5/16 7:10am-------------------

Fri 5/16 9:30pm

Goal: fix the damn bugs

-changes what happens when we come back from an instruction stall to make it more restrictive-changes how we hold the invalidate signal inside a stallcurrently: corner almost works...-added currentTopEntry hack to force it to 0 in certain cases.these above this line are good for sure-hold offset during numEntry case = 0, fixed branches where we would branch but hit a data stall during the branchthat sohuld be good, but not fixed yet completely...-hold invalidate during numEntry case =0

during numEntry case, why are we checking for branchstall2 and jumpstall2?????I got rid of those, and now it works.

OH HEEEELLLLL YEA! quicksort works! 1:10am.

Even the worm tested, provided to us by Ray, works.

Fri 5/16 11pm--------------Mon 5/19 8:15pm

Goal: watch the race

There’s a bug…can’t seem to fix it. I don’t have any extra time to finish it right now, maybe tomorrow after my final I can fix it for whatever credit.

Mon 5/19 10:15pm------------Tue 5/20 10:30am

Goal: Fix the bug that was found yesterday

This is a forwarding bug that happens during stalls. Unfortunate, as I am not familiar with the forwarding unit.Found it, priority between pipelines was wrong. Additionally, we also have to forceload again during a stall if our first forceload gets cut off due to the stalling.

Tue 5/20 4:30pmBenjamin Lee

Total Time: ~56.5 hours

Thu May 8 12:02:13 PDT 2003 (~1 hour)Goals: Design review for the final project

Decided to implement superscalar with branch prediction Two distinct components with [fetch, decode] and [execute, mem, write back]. Distributed work (Lyle and I are working on the [execute, mem, write back] section)

Thu May 8 13:05:21 PDT 2003

Thu May 8 15:23:10 PDT 2003 (~ 5 hours)Goals: Complete the schematic implementation of the superscalar pipeline

Began building blocks for each pipeline stage Completed the block implementation of the memory stage Completed the block implementation of the top and bottom pipelines Halfway through wiring up the new block components

Thu May 8 20:10:42 PDT 2003

Fri May 9 09:05:10 PDT 2003 (~2 hours)Goals: Complete the wiring for the schematic implementation of the pipeline

Wiring completed. Began to write basic tests vectors to test the later stages of the pipeline independently of the fetch and

decode stage. We’ll attach the top and bottom pipelines to the earlier stages when Jack and David work out the dual-issue in the fetch stage.

Fri May 10:57:22 PDT 2003

Fri May 9 13:30:01 PDT 2003 (~ 12 hours)Goals: Complete the testing for the dual pipelines.

Test vectors are complete with several basic tests. We may need to look at the forwarding again to account for the corner cases. There are significantly more corner cases with two instructions issued at the same time.

Began the implementation of the branch target buffer (BTB). Implemented a 2048 entry BTB. There are concerns about the size of the BTB when mapping to board. We may opt for a smaller BTB for our final implementation.

Sat May 10 01:30:22 PDT 2003

Sat May 10 09:31:21 PDT 2003 (~ 10 hours)Goals: Complete testing for the BTB

The BTB had some syntax errors that Lyle and I fixed. The hit signal has been modified to include a condition for the branch. In order to get a hit, we must have a

matching PC and tag, in addition to having a branch instruction in one of the four instructions in the buffer.

Began implementation of branch history table (BHT) The BHT is implemented with hysteresis (a four state FSM). The transitions are made on the clock cycle. The BHT takes the taken signal from the decode to make the appropriate transition in the FSM. NOTE: The predicted PC from the BHT will be available 2 mux delays (2 ns) after the PC changes. The

PC is used as the selectors for the internal muxes. Therefore, the predicted PC cannot be available at the beginning of the clock cycle, but will be ready after 2ns.

Sat May 10 19:30:15 PDT 2003

Sun May 11 10:47:23 PDT 2003 (~2 hours)Goals: Complete the debuggin of the BHT.

The predicted PC from the BHT will be available 2 mux delays (2 ns) after the PC changes. The PC is used as the selectors for the internal muxes. Therefore, the predicted PC cannot be available at the beginning of the clock cycle, but will be ready after 2ns.

Sun May 11 12:43:21 PDT 2003

Sun May 11 13:42:17 PDT 2003 (~ 3 hours)Goals: Check forwarding for the two pipelines in the superscalar implementation.

Added hardware and logic to handle forwarding of the reg $31. A code sequence that has a <jal loop; add $31, $31, 2> would need $31 forwarded to the execute stage.

Need to add logic for an <add $1, $2, $3; sw $1, 0($0)>. The ALU, SLT, Shift output from the EX/MEM pipeline register in the top pipeline need to be muxed in the lower pipeline (memory stage). The mux in the memory bottom stage takes: (1) MEM_RegB_bottom, (2) MEM_ALU_top, (3) MEM_SLT_top, (4) MEM_Shift_top. This will require added logic in the forwarding unit and the pipeline_top schematic.

New Files for Integration: (U:\final.zip)Bht.vBtb.vDecode_block.schExecute_block.schM1x2.vM32x2.vM32x5.vM32x8.vMem_block.schPipeline_bottom.schPipeline_top.schSuperscalar.sch

Sun May 11 16:49:20 PDT 2003

Monday May 12 16:04:22 PDT 2003 (~2 hours)Goals: Complete modifications to the forwarding unit in pipeline.

Finished modifications to the forwarding, top and bottom pipelines. These modifications include a multiplexor in the bottom pipeline to allow for forwarding of executed values from the top pipeline into the memory stage of the bottom pipeline to allow a store to execute.

Monday May 12 18:11:11 PDT 2003

Tuesday May 13 09:59:35 PDT 2003 (~3.5 hours)Goals: Complete the integration of Jack/David’s issue unit with the superscalar pipelines.

Modified the forwarding unit to display the case and values being forwarded.

Tuesday May 13 13:32:22 PDT 2003

Tuesday May 13 15:10:21 PDT 2003 (~ 8 hours)Goals: Begin drafting the report

Note: An error in the input the branch history table. The BHT decTaken signal should actually be a mispredict signal, asserted only when the predTaken signal doesn’t match the taken signal from the comparator in the decode stage.

Tuesday May 13 23:01:11 PDT 2003

Wednesday May 14 10:02:52 PDT 2003 (~3 hours)Goals: Complete power point slides for presentation

There are a few more slides than the eight Prof. Kubi requested, but we should be well under the time limit. Completed the slides and have e-mailed the group. Heading out to Berkeley to discuss and rehearse Completed the final version of the slides, pending performance data to be collected tomorrow.

Wednesday May 14 20:01:19 PDT 2003

Saturday May 17 11:14:23 PDT 2003 (~4 hours)Goals: Integrate the branch prediction with the superscalar processor

Seems like there are fundamental issues in the interface between the processor and the branch prediction hardware.

Leaving to study for finals.

Saturday May 17 15:12:01 PDT 2003

David Lee

* Index ==============================================================Estimated time spent in lab 115.5 hours

+ ==================================================================== Fri May 09 13:15:28 PST 2003

Goal: Finish making new instruction cache

new filesinstboardRAM.vinstboardRAM_tf.tfinstdirectMappedCache.vinstdirectMappedCache_tf.tfinstcache_block.schinstcache_controller.vinstcache_controller.syminstdirectMappedCache.symIssueUnit.vfifo.vfifotest.vissuetest.v

does addu forward to sw if they are issued at same time?

hopefully it should

Sat May 10 02:35:45 PST 2003- ====================================================================+ ==================================================================== Sat May 10 12:06:33 PST 2003

Goal: Finish testing and debugging issue unit and the fifo buffer

new filesfake_icache.v

fake_icache simulates the instruction cache and outputs blocks of instructions.Must put in your instructions manually into the if statements in thefake_icache.v.

In the fifo, the write signals were off by one cycle because it was usingthe curState to determine the values. I now set it to nextState so that it would output as soon as we want to write.

caused some glitching in the we signal. So changed back to curstate to rethink the problem.

rt register of branches assigned to rb in issue unit because they act more like theRtype operand registers than a destination register. might have to do it for sw but not sure. the rw is set to 0.

Sun May 11 03:45:16 PST 2003 - ==================================================================== + ==================================================================== Sun May 11 15:21:57 PST 2003

Goal: Finish testing issue unit with branching and implement PC stuff

for bgez we will stall one cycle if the previous instruction write to register 1 because in register 1 is used to determine whether it is bgez

or bltz. Bltz uses 0 so it does not stall. This is no big deal. I just means lose one cycle for bgez, but functionally correct. Otherwise we need to change the issue unit so that it checks that for bgez don't check the rb == busy_rw or busy_lw_rw registers.

Issue unit works for branches too.

Need to figure out how to handle break.

for jal add case where the add depends on the jal, then we need forwardthe jal calculated address to the add in the execute stage in the secondpipeline.

figure out how to do brancheszzzZZZzzzZZZMon May 12 04:57:22 PST 2003 - ==================================================================== + ==================================================================== Mon May 12 18:31:18 PST 2003

Goal: get jal, branch, jr, break working and hopefully integrate tonite.

fixed the jal branch and jr maybe. only tested the branches. Not surewhat to do with break. When do we stall?

okay we are going to put stuff together.

In the Decode Stage, the CLK and RESET signals were flipped in the symbol

MAKE SURE ALL PINS ARE CONNECTED IN SCHEMATIC AND SYMBOL!!! VERY IMPORTANTWASTE OF 1 HR

Branch taken signal changes too quickly don't know why yet

Tue May 13 04:44:11 PST 2003 - ==================================================================== + ==================================================================== Tue May 13 11:03:59 PST 2003

***ignore***In the fifo unit, in the logic for next free entry, the curstate waschanging too quickly to idle before you add. If you move the delay intothe if else statement it works, but it seems strange because it might bea violation of the hold time for the nextfreeentry register.

Ignore previous statement. It failed.***ignore***

we added stalling to the nextstate and write enable logic so that we don'twrite or change stage if we are stalling. Need to verify with Jack.

We integrated all the pieces. It seems to be working for the most part, butthere are lots of bug especially with the issue unit and the fifo talkingwith the rest of the pipeline. Main problems are branches and jumps

Tomorrow we will fix all these bugs and hopefully integrate the branch prediction unit

Wed May 14 04:12:44 PST 2003- ====================================================================+ ==================================================================== Wed May 14 10:04:11 PST 2003

goal: must get everything working

added new states to stall when there is branch. The problem is that becausethe issue unit and the decode is done in the same stage, we have a very longcritical path.

Problem 2: because we must wait to issue before we can read from the registers, wehave trouble doing branching really quickly. Instead we must issue the branch, set a flag, and then on the next time around, we check the flag and know thatthe registers are ready for comparison. This is very inefficient but we don'thave time to redesign.

Problem 3: we have a forwarding problem with jal. Because of the way the datapath was originallydesigned, the jal cannot forward. It passes through different registers so either we change the forwarding unit to take this into account, change the datapath (no time for this)or we figure out someway to automatically write into the register file.

We decided to go for the register file change because it is simple and quick.

Seems to work fine.

Thu May 15 05:55:31 PST 2003- ====================================================================+ ==================================================================== Thu May 15 12:15:02 PST 2003

I need to sleep. I need lots of sleep. Cannot not keep functioning likethis.

In our implementation, jr acts more like a branch than a jump. It gets theregister value after it is issued so we must add new flags like branchand make jr act like a branch.

We stall unnecessarily on instruction stalls. This sucks because it ruins ourspeedup. We don't have much time to figure out how to fix it, but if it we couldwe would never stall unless the fifo buffer is empty. It would take quite a bitof redesigning or a terrible hack to get it to work.

all of our tests pass. Corner, verify, and base also work. Quicksort does not.

A change in quicksort causes quicksort to work but base to fail. We cannot keepfooling around with stalling. The biggest and only problem in our implementationis dealing with the stalls because it ruins our issue timing.

This really bites we can't finish right now. Jack and I are dead tired and cannot function much more.

We are going to sleep and maybe try only a few more hours tomorrow to get it workingwe are so close.

Fri May 16 07:33:51 PST 2003

- ====================================================================+ ==================================================================== Sat May 17 13:27:20 PST 2003Yay Jack got everything working. Now i must put in the branch prediction unit.

crap, the branch prediction unit does not interface properly with ourissue unit. It depends too much on control from issue unit. Also, we needto figure out how to bypass the fetch unit because we don't have enough controlsignal to switch between the blockpc and the predpc.

Okay it doesn't look like the branch prediction unit is going to work yet. We had to add in new ports. Basically the branch prediction needs tolook at the block and by itself check for branches and predict if it is going to branch.

We still have some sort of branch prediction because we always predict fall throughand if it is right we don't stall.

Sat May 17 15:15:15 PST 2003- ==================================================================== + ==================================================================== Sun May 18 21:52:22 PST 2003Stupid jal WAW hazard.

the globalpc is missing an 'f' in the beginning. This did not affect anythingbecause we recovered to the correct pc after we jump.

Can't fix jal WAW hazard

Mon May 18 02:04:56 PST 2003- ==================================================================== + ==================================================================== Mon May 18 11:24:31 PST 2003Stupid jal WAW hazard again.

changed it so that if stalling, jaldont_1 still gets updated to jaldont in regfile

changed the stall from inststall to data_stall because only thendo we actually stall the entire pipeline. Otherwise instructionsafter the jal will not get to commit their values.

All tests pass except jack jal test for the jal WAW. It mostly works but does not get to the last line. Don't know why. But I think it couldbe just the test code.

Mon May 18 13:37:33 PST 2003- ====================================================================

Lyle Takacs

+==================================Thu May 8 11:21:12 PDT 2003work on creating blocks:decode, execute, memory, pipeline

Top pipeline does not need any memory forwarding, but needs to pass data to bottom pipeline for memory stores+==================================

+==================================Fri May 9 10:48:23 PDT 2003Build forwarding symbol block, 'duplicate' code for cross pipeline forwarding

Need forwarding to decode blocks+==================================

+==================================Sat May 10 11:01:47 PDT 2003Put all blocks together into datapath, test

Still need PC, issue, inst cache, regfile

Done initial tests, now going to work on BHT and BTB with Ben

We went with 256 entries to refrain from thrashing too much and to minimize map time to the board.

Done with BTB, but BHT needs a little more work and to be tested.+==================================

+==================================Mon May 12 10:25:54 PDT 2003fix up monitorclock cycle, more data info, better formating

integrate issue unit - add regfile, cache, decode stage

test!oops forgot to put in synthesis drams...+==================================

+==================================Tue May 13 11:12:43 PDT 2003more monitor fixesadded more testing stuff, better stall handling

lots of testing! mostly all fixes with issue unit. Most detail in Jack and/or David's logs+==================================

+==================================Thu May 14 12:20:34jal data should go through normal data path, not handled in wb stagemakes forwarding much more difficult

i dont think the mem input mux should be on right side of register-it's ok, we only write into buffer so data doesn't need to be ready at posedge clk

i'm done... need to study for final tomorrow, but first need to finish take home final due tomorrowrest of group cannot work past tonight so I hope they have good luck!+==================================

Appendix II – Schematics

Top Pipeline Schematic

Bottom Pipeline Schematic

Cache Block Schematic

Decode Block Schematic

Execute Block Schematic

Instruction Cache Block Schematic

Memory Block Schematic

Superscalar Datapath Lower Right Schematic

Superscalar Datpath Upper Left Schematic

Superscalar Datapath Lower Left Schematic

Superscalar Dapath Upper Right Schematic

Superscalar Datapath Overall Schematic

Appendix III – Verilog

alu.varbiter.vbht.vbranchCalc.vbtb.vbts32.vcache_controller.vClockDivider.vcomp.vconstants.vcontroller.vcounter.vCPU.vdebouncer.vdecode_stage.vdirectMappedCache.vextend.vfifo.vforward.vinBoot.vinstcache_controller.vinstdirectMappedCache.vIssueUnit.vjconcat.v

lvlZeroBoot.vm1x2.vm32x2.vm32x3.vm32x5.vm32x6.vm32x8.vm5x3.vmem_write.vmemory_control.vmemoryio.vmonitor.vpipeline_registers.vrdBuffer.vregfile.vreleaseUnit.vshifter.vsll2.vSLT.vss_test.vTwoWayCache.vupedge_detectorwbBuffer.v

Appendix IV – Testing

Issue Unit Test Files

beq.sbgez.sbltz.sbne.sbranch.sforwardDependecy.sforwardDependecy_dave.sharderJump.sjalWAW.sjump.snoDependecy.ssimpleBranch.ssimpleJump.sswlwDepend.sswlwDepend1.sswlwDepend2.sswlwDepend3.sswlwNoDepend.s

fake_icache_beq.vfake_icache_bgez.vfake_icache_bltz.vfake_icache_bne.vfake_icache_dave_forwardDependency.vfake_icache_harderJump.vfake_icache_jump.vfake_icache_nodependency.vfake_icache_simpleBranch.vfake_icache_simpleJump.vfake_icache_swlsDepend1.vfake_icache_swlsDepend2.vfake_icache_swlwNodepend.v

Note: Trace files are very large. Open them with WordPad.

Forwarding Test FilesForwarding Hazard Test CodeForwarding Hazard Test Trace

Branch PredictionBranch History Table Test-benchBranch Translation Buffer Test-bench

GeneralBase TraceBase I/O Output

Corner 1 TraceCorner 1 I/O Output

Quicksort 1 TraceQuicksort 1 I/O Output

Corner 2 Trace (aka Dumber)Corner 2 I/O Output

Prime Trace (aka Dumb)Prime I/O Output

Date post:	07-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Computer Science 152 - University of California, …kubitron/courses/cs1… · Web viewStatic...

Documents